ClaudeBot and a Pandemic of inconsiderate coding

19 Votes

2024-05-18 Published, 2024-10-28 Updated
1992 Words, 10 Minute Read

Matt (Virtualisation)

Matt has been with the firm since 2015.

Introduction

Inconsiderate bots can have a significant negative impact on websites by consuming excessive bandwidth and slowing down site performance. When bots aggressively crawl a website, making numerous requests in a short period of time, they can strain the server resources and eat up the site's allocated bandwidth. This is especially problematic for smaller websites with limited hosting plans that have strict bandwidth allowances. As the bots consume more and more of the available bandwidth, it leaves less for legitimate users trying to access the site. This can lead to slower page load times, increased latency, and even periods of unavailability if the bandwidth is completely exhausted. The website owner may then be forced to upgrade to a more expensive hosting plan with higher bandwidth limits to accommodate the bot traffic and ensure their site remains accessible to real users. In essence, inconsiderate bots can cost website owners money by driving up their hosting costs and hurting the user experience for their visitors. It's important for bot developers to be responsible and respect website owners by following crawling best practices, honoring robots.txt directives, and limiting their request rate to avoid unduly burdening websites.

Who are they

These bots scrape (trawl) your website to use its data for building search indexes, or looking for content. Most bots, like Google and Bing a very careful to scrape your site at a low rate, and low priority, and indeed you can even tell both of these what hours they can and cannot scrape, but not everyone is quite so respectful.

Known Bad Bots

There have always been 'bad' bots, like Petal, but recently these have increased in number substantially. Here a list of a few:

AhrefsBot/6.1 - A bot from Ahrefs, a Singapore-based SEO company, known for regular crawling that can strain server resources.
Amazonbot - A bot from Amazon, the global e-commerce giant, which crawls websites for product information and can consume significant bandwidth.
facebookexternalhit - A bot from Facebook that doesn't respect boundaries and just floods sites with requests.
Baiduspider - The web crawler for Baidu, the dominant search engine in China, which has been reported to ignore robots.txt rules and crawl excessively.
BLEXBot - A bot from WebMeUp, a German SEO company, that has been known to crawl websites aggressively, causing performance issues.
SemrushBot - A bot from Semrush, an SEO tool provider, which can consume considerable bandwidth while gathering data for its services.
claudebot - A mysterious bot with little information available, but it has been observed to crawl websites at a high rate, leading to increased server load.
YandexBot - The crawler for Yandex, the leading search engine in Russia, which has been reported to ignore crawl delay directives and cause bandwidth issues.
Bytespider - A bot from Bytedance, the Chinese company behind TikTok, which has been known to crawl websites aggressively for data collection purposes.
Mb2345Browser - A bot disguised as a mobile browser, originating from China, which has been observed to crawl websites excessively and consume bandwidth.
LieBaoFast - A lesser-known bot, possibly from China, that has been reported to crawl websites at a high rate, leading to performance issues.
MicroMessenger - A bot associated with WeChat, the popular Chinese messaging app, which has been known to crawl websites aggressively for content indexing.
Kinza - A bot related to a Japanese web browser, which has been observed to crawl websites excessively, consuming bandwidth and resources.
Datanyze - A bot from a US-based company that provides technographic data, known to crawl websites heavily for information gathering.
serpstatbot - A bot from Serpstat, an SEO platform, which has been reported to crawl websites aggressively, causing strain on server resources.
spaziodati - An Italian company's bot that collects data from websites, sometimes at a high rate, leading to increased bandwidth consumption.
OPPO\sA33 - A bot disguised as a mobile device from the Chinese smartphone manufacturer OPPO, known to crawl websites excessively.
AspiegelBot - A bot from a Swedish company that provides website monitoring services, which has been observed to crawl websites aggressively.
PetalBot - A bot from Huawei, the Chinese telecommunications company, known to crawl websites heavily for its mobile search engine.

Just a quick look at our own access_log for the last day I can see...

Googlebot 1314 occurrences
YandexBot 1168 occurrences
bingbot 1102 occurrences
AhrefsBot 863 occurrences
SemrushBot 688 occurrences
DotBot 684 occurrences
GPTBot 651 occurrences
Twitterbot 632 occurrences
SemrushBot-BA 536 occurrences
MJ12bot 495 occurrences
PetalBot 428 occurrences
Googlebot 367 occurrences
SeobilityBot 269 occurrences
SeznamBot 221 occurrences
Applebot 169 occurrences
coccocbot-web 151 occurrences
Googlebot-Image 127 occurrences
Amazonbot 108 occurrences
coccocbot-image 99 occurrences
WellKnownBot 65 occurrences
ZoominfoBot 62 occurrences
BLEXBot 54 occurrences
YandexImages 45 occurrences
PetalBot 35 occurrences
GPTBot 33 occurrences
CCBot 33 occurrences
AdsBot-Google 32 occurrences
Slack-ImgProxy 30 occurrences
DuckDuckBot 23 occurrences
Googlebot 17 occurrences
download bot 16 occurrences
Mail.RU_Bot 14 occurrences
ImagesiftBot 14 occurrences
bingbot 13 occurrences
bingbot 12 occurrences
Owler 11 occurrences
AdsBot-Google-Mobile 10 occurrences
LinkedInBot 10 occurrences
AwarioBot 8 occurrences
Pinterestbot 7 occurrences
MojeekBot 6 occurrences
DataForSeoBot 4 occurrences
Googlebot 4 occurrences
YandexFavicons 4 occurrences
YandexBot 4 occurrences
Googlebot 3 occurrences
2ip bot 3 occurrences
SEMrushBot 3 occurrences
DomainStatsBot 2 occurrences
AdsBot-Google-Mobile 2 occurrences
YaK 2 occurrences
AndersPinkBot 2 occurrences
AhrefsSiteAudit 2 occurrences
DuckDuckBot-Https 2 occurrences
FullStoryBot 2 occurrences
ClaudeBot 1 occurrence
Gensparkbot 1 occurrence
fluid 1 occurrence
trendictionbot 1 occurrence

Which is a crazy amount of activity, and everyon's got a bot! You'll notice that some bot's are crawling more than once, and that's because they crawl with different user agents, pretending to be a desktop browser, then pretending to be a mobile browser, etc. This is used by google/bing etc to see the two different versions of your site, if it has a different version which is not common thesedays with responsive css. We have quite an active site with lots of new content and changes so the search engines crawl most pages daily, but for a small site bad bots can be a real menace.

A customer of ours recently had 100GB of bandwidth quota consumed by facebookexternalhit in a few days. That's the problem right there.

Can we block them

Yes, in most cases you can block them from your site, but this means having access to create/edit a special files called robots.txt and .htaccess in your site's root directory.

robots.txt

robots.txt is supposed to be a way to tell bots what they can and can't crawl, for example:

User-agent: *
Disallow: /files/
User-agent: Googlebot
Disallow: /admin
User-agent: ClaudeBot
Disallow: /

In this example, we're telling everyone not to crawl /files, and Google not to crawl /admin, and ClaudeBot not to crawl anything. The problem with this, is that only google and bing really take any notice, others, like ClaudeBot simply ignore it, so we need to use another method.

.htaccess

.htaccess is a directives file that configures your webserver to do various things like translate URL's, to know where the error page is, and to perform redirects. Using this file we can add the following lines (Assuming the file is empty or doesn't exist):

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} AhrefsBot/6.1|facebookexternalhit|Amazonbot|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider|YandexBot|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|PetalBot [NC]
RewriteRule ^ - [F]

This code looks for clues in the HTTP_USER_AGENT, compares that to a list of known BadBots, and then instead of returning the page, it returns a forbidden response code. After a series of these, the badbot should give up and go and wreck someone elses bandwidth allowance. This list isn't exhaustive, but its a good start. You can amend this from time to time by checking your weblogs and picking out the badbots that are causing you issues.

Its worth pointing out that some webservers may not take any notice of .htaccess. They should and its their default behaviour but in some restrictive providers they may have configured their server to ignore it, and you might need to contact them to get it enabled, or change provider.

Your Logfiles

Your hosting provider *should* allow you access to your logfiles. If you're a GEN customer, and using shared private hosting then you'll find this in the left hand menu under Logs and Reports, and its called "Apache Access Log"

Within this file, in its default configuration, you will be able to see requests coming into your website. A line of this file will look something like:

18.188.60.124 - - [18/May/2024:11:54:44 +0100] "GET /doku.php/archive/alpha_microsystems HTTP/2.0" 200 245 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

Looking at the line above, we can see the IP Address on the left (18.188.60.124), the date/time (18/May/2024:11:54:44 +0100), the page requested (/doku.php/archive/alpha_microsystems), the response code (200 = ok), and finally the HTTP_USER_AGENT (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).

We can see here that this request is indeed a badbot, coming from ClaudeBot, which is well known for hitting websites from multiple IP addresses and in some cases exhausting the webservers resources causing an outage.

We already have ClaudBot added to the blocklist above, but let's assume that we don't, in which case you simply extract the indentifiable part 'ClaudeBot' and stick it in the search line between | and |, like this:

        |ClaudeBot|

And you're now be blocking this as well. It may take a minute or two for the changes to take effect especially if you're hosting provide is running a cluster, but it will take effect.

Conclusion

Assuming you've done everything correctly, your website load (and traffic) should reduce. However, even though you're sending a 'Forbidden' response it doesn't mean that the bad bot is written to understand. Host will, but ClaudeBot for example will just carry on hammering your webserver with endless requests until its had a poke at every single page even if you send it a 403 for every request. This is just poor coding on the part of the bot developer, and there's nothing you can do about it... or is there?

You can block the IP addresses of the bad bots using a firewall. Blocking with a firewall is preferable because then the requests won't even reach your servers, but that all depends on what level of access you have to a 'firewall'. If you're a GEN customer and need specific IP's blocking then just a raise a ticket at the HelpDesk and they'll make those changes for you, but if not and you don't have access to a firewall, you can still do it with .htaccess, like this...

Order allow,deny
Allow from all
Deny from 123.456.789.0

Your logfiles will show the IP address of every request, and you've already extracted the HTTP_USER_AGENT in order to setup your rules in .htaccess. Now using those same logfiles, check each line for 403 responses, and extract the IP address. Add that to the list in .htaccess on a Deny from line. In our example above we're blocking 123.456.789.0 (note: I know that's not a valid IPv6, its an example ok). In this configuration, the first option 'Order' states that we're going to allow, then, deny, and this is important. On the next line we allow from all, so no restrictions, then on the next line(s) we deny from xxx.xxx.xxx.xxx so that's 'denied'.

Sometimes you can find a list of IP's on the internet, for example, at the time of writing I was able to find several good resources listed IP addresses for some bad bots, but you need to fight your way through the endless AI generated waffle to find them. Be careful with range blocking - some websites may provide a list of IP's to block in CIDR format, so something like 123.200.0.0/16, in this case we're blocking 123.200.anything.anything which is 65536 IP addresses (You can use our Subnet Tool to check ranges). If possible stick to individual IP's or you may wind up blocking legitimate traffic unknowingly.

19 Votes

Comments (3)

Zazzy P · 2024-07-12 17:12 UTC
WellKnownBot! that made me laugh, defo block dat one.

A Shia Q · 2024-07-10 23:50 UTC
Perfect and finally stopped it so ty and keep up the good work!

Allan Dx · 2024-07-08 15:54 UTC
Fantastic! hammering my site for a week now and now all quiet again.