ClaudeBot and a Pandemic of inconsiderate coding

The Curious Codex

          11 Votes   Published 2024-05-18, Updated 2024-06-17



ClaudeBot and a Pandemic of inconsiderate coding

The Author
GEN UK Blog

By Matt (Virtualisation)

Matt has been with the firm since 2015.

Introduction

badbot

Inconsiderate bots can have a significant negative impact on websites by consuming excessive bandwidth and slowing down site performance. When bots aggressively crawl a website, making numerous requests in a short period of time, they can strain the server resources and eat up the site's allocated bandwidth. This is especially problematic for smaller websites with limited hosting plans that have strict bandwidth allowances. As the bots consume more and more of the available bandwidth, it leaves less for legitimate users trying to access the site. This can lead to slower page load times, increased latency, and even periods of unavailability if the bandwidth is completely exhausted. The website owner may then be forced to upgrade to a more expensive hosting plan with higher bandwidth limits to accommodate the bot traffic and ensure their site remains accessible to real users. In essence, inconsiderate bots can cost website owners money by driving up their hosting costs and hurting the user experience for their visitors. It's important for bot developers to be responsible and respect website owners by following crawling best practices, honoring robots.txt directives, and limiting their request rate to avoid unduly burdening websites.


Who are they

These bots scrape (trawl) your website to use its data for building search indexes, or looking for content. Most bots, like Google and Bing a very careful to scrape your site at a low rate, and low priority, and indeed you can even tell both of these what hours they can and cannot scrape, but not everyone is quite so respectful.


Known Bad Bots

There have always been 'bad' bots, like Petal, but recently these have increased in number substantially. Here a list of a few:


  • AhrefsBot/6.1 - A bot from Ahrefs, a Singapore-based SEO company, known for regular crawling that can strain server resources.
  • Amazonbot - A bot from Amazon, the global e-commerce giant, which crawls websites for product information and can consume significant bandwidth.
  • facebookexternalhit - A bot from Facebook that doesn't respect boundaries and just floods sites with requests.
  • Baiduspider - The web crawler for Baidu, the dominant search engine in China, which has been reported to ignore robots.txt rules and crawl excessively.
  • BLEXBot - A bot from WebMeUp, a German SEO company, that has been known to crawl websites aggressively, causing performance issues.
  • SemrushBot - A bot from Semrush, an SEO tool provider, which can consume considerable bandwidth while gathering data for its services.
  • claudebot - A mysterious bot with little information available, but it has been observed to crawl websites at a high rate, leading to increased server load.
  • YandexBot - The crawler for Yandex, the leading search engine in Russia, which has been reported to ignore crawl delay directives and cause bandwidth issues.
  • Bytespider - A bot from Bytedance, the Chinese company behind TikTok, which has been known to crawl websites aggressively for data collection purposes.
  • Mb2345Browser - A bot disguised as a mobile browser, originating from China, which has been observed to crawl websites excessively and consume bandwidth.
  • LieBaoFast - A lesser-known bot, possibly from China, that has been reported to crawl websites at a high rate, leading to performance issues.
  • MicroMessenger - A bot associated with WeChat, the popular Chinese messaging app, which has been known to crawl websites aggressively for content indexing.
  • Kinza - A bot related to a Japanese web browser, which has been observed to crawl websites excessively, consuming bandwidth and resources.
  • Datanyze - A bot from a US-based company that provides technographic data, known to crawl websites heavily for information gathering.
  • serpstatbot - A bot from Serpstat, an SEO platform, which has been reported to crawl websites aggressively, causing strain on server resources.
  • spaziodati - An Italian company's bot that collects data from websites, sometimes at a high rate, leading to increased bandwidth consumption.
  • OPPO\sA33 - A bot disguised as a mobile device from the Chinese smartphone manufacturer OPPO, known to crawl websites excessively.
  • AspiegelBot - A bot from a Swedish company that provides website monitoring services, which has been observed to crawl websites aggressively.
  • PetalBot - A bot from Huawei, the Chinese telecommunications company, known to crawl websites heavily for its mobile search engine.

A customer of ours recently had 100GB of bandwidth quota consumed by facebookexternalhit in a few days. That's the problem right there.


Can we block them

Yes, in most cases you can block them from your site, but this means having access to create/edit a special file called .htaccess in your site's root directory.

Assumimg you can, then add the following to that file:


RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} AhrefsBot/6.1|facebookexternalhit|Amazonbot|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider|YandexBot|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|PetalBot [NC]
RewriteRule ^ - [F]

This code looks for clues in the HTTP_USER_AGENT, compares that to a list of known BadBots, and then instead of returning the page, it returns a forbidden response code. After a series of these, the badbot should give up and go and wreck someone elses bandwidth allowance. This list isn't exhaustive, but its a good start. You can amend this from time to time by checking your weblogs and picking out the badbots that are causing you issues.


Your Logfiles

Your hosting provider *should* allow you access to your logfiles. If you're a GEN customer, and using shared private hosting then you'll find this in the left hand menu under Logs and Reports, and its called "Apache Access Log"

Within this file, in its default configuration, you will be able to see requests coming into your website. A line of this file will look something like:


18.188.60.124 - - [18/May/2024:11:54:44 +0100] "GET /doku.php/archive/alpha_microsystems HTTP/2.0" 200 245 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

Looking at the line above, we can see the IP Address on the left (18.188.60.124), the date/time (18/May/2024:11:54:44 +0100), the page requested (/doku.php/archive/alpha_microsystems), the response code (200 = ok), and finally the HTTP_USER_AGENT (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).

We can see here that this request is indeed a badbot, coming from ClaudeBot, which is well known for hitting websites from multiple IP addresses and in some cases exhausting the webservers resources causing an outage.

We already have ClaudBot added to the blocklist above, but let's assume that we don't, in which case you simply extract the indentifiable part 'ClaudeBot' and stick it in the search line between | and |, like this:


        |ClaudeBot|
    

And you're now be blocking this as well. It may take a minute or two for the changes to take effect especially if you're hosting provide is running a cluster, but it will take effect.


Conclusion

Assuming you've done everything correctly, your website load (and traffic) should reduce. However, even though you're sending a 'Forbidden' response it doesn't mean that the bad bot is written to understand. Host will, but ClaudeBot for example will just carry on hammering your webserver with endless requests until its had a poke at every single page even if you send it a 403 for every request. This is just poor coding on the part of the bot developer, and there's nothing you can do about it... or is there?

You can block the IP addresses of the bad bots using a firewall. Blocking with a firewall is preferable because then the requests won't even reach your servers, but that all depends on what level of access you have to a 'firewall'. If you're a GEN customer and need specific IP's blocking then just a raise a ticket at the HelpDesk and they'll make those changes for you, but if not and you don't have access to a firewall, you can still do it with .htaccess, like this...


Order allow,deny
Allow from all
Deny from 123.456.789.0

Your logfiles will show the IP address of every request, and you've already extracted the HTTP_USER_AGENT in order to setup your rules in .htaccess. Now using those same logfiles, check each line for 403 responses, and extract the IP address. Add that to the list in .htaccess on a Deny from line. In our example above we're blocking 123.456.789.0 (note: I know that's not a valid IPv6, its an example ok). In this configuration, the first option 'Order' states that we're going to allow, then, deny, and this is important. On the next line we allow from all, so no restrictions, then on the next line(s) we deny from xxx.xxx.xxx.xxx so that's 'denied'.

Sometimes you can find a list of IP's on the internet, for example, at the time of writing I was able to find several good resources listed IP addresses for some bad bots, but you need to fight your way through the endless AI generated waffle to find them. Be careful with range blocking - some websites may provide a list of IP's to block in CIDR format, so something like 123.200.0.0/16, in this case we're blocking 123.200.anything.anything which is 65536 IP addresses. If possible stick to individual IP's or you may wind up blocking legitimate traffic unknowingly.


          11 Votes   Published 2024-05-18, Updated 2024-06-17

--- This content is not legal or financial advice & Solely the opinions of the author ---


Version 1.009  Copyright © 2024 GEN, its companies and the partnership. All Rights Reserved, E&OE.  ^sales^  0115 933 9000  Privacy Notice