ClaudeBot and a Pandemic of inconsiderate coding

The Curious Codex

             16 Votes

2024-05-18 Published
2024-07-12 Updated
1992 Words, 10  Minute Read

The Author
GEN UK Blog

By Matt (Virtualisation)

Matt has been with the firm since 2015.

 

Introduction

badbot

Inconsiderate bots can have a significant negative impact on websites by consuming excessive bandwidth and slowing down site performance. When bots aggressively crawl a website, making numerous requests in a short period of time, they can strain the server resources and eat up the site's allocated bandwidth. This is especially problematic for smaller websites with limited hosting plans that have strict bandwidth allowances. As the bots consume more and more of the available bandwidth, it leaves less for legitimate users trying to access the site. This can lead to slower page load times, increased latency, and even periods of unavailability if the bandwidth is completely exhausted. The website owner may then be forced to upgrade to a more expensive hosting plan with higher bandwidth limits to accommodate the bot traffic and ensure their site remains accessible to real users. In essence, inconsiderate bots can cost website owners money by driving up their hosting costs and hurting the user experience for their visitors. It's important for bot developers to be responsible and respect website owners by following crawling best practices, honoring robots.txt directives, and limiting their request rate to avoid unduly burdening websites.


Who are they

These bots scrape (trawl) your website to use its data for building search indexes, or looking for content. Most bots, like Google and Bing a very careful to scrape your site at a low rate, and low priority, and indeed you can even tell both of these what hours they can and cannot scrape, but not everyone is quite so respectful.


Known Bad Bots

There have always been 'bad' bots, like Petal, but recently these have increased in number substantially. Here a list of a few:


  • AhrefsBot/6.1 - A bot from Ahrefs, a Singapore-based SEO company, known for regular crawling that can strain server resources.
  • Amazonbot - A bot from Amazon, the global e-commerce giant, which crawls websites for product information and can consume significant bandwidth.
  • facebookexternalhit - A bot from Facebook that doesn't respect boundaries and just floods sites with requests.
  • Baiduspider - The web crawler for Baidu, the dominant search engine in China, which has been reported to ignore robots.txt rules and crawl excessively.
  • BLEXBot - A bot from WebMeUp, a German SEO company, that has been known to crawl websites aggressively, causing performance issues.
  • SemrushBot - A bot from Semrush, an SEO tool provider, which can consume considerable bandwidth while gathering data for its services.
  • claudebot - A mysterious bot with little information available, but it has been observed to crawl websites at a high rate, leading to increased server load.
  • YandexBot - The crawler for Yandex, the leading search engine in Russia, which has been reported to ignore crawl delay directives and cause bandwidth issues.
  • Bytespider - A bot from Bytedance, the Chinese company behind TikTok, which has been known to crawl websites aggressively for data collection purposes.
  • Mb2345Browser - A bot disguised as a mobile browser, originating from China, which has been observed to crawl websites excessively and consume bandwidth.
  • LieBaoFast - A lesser-known bot, possibly from China, that has been reported to crawl websites at a high rate, leading to performance issues.
  • MicroMessenger - A bot associated with WeChat, the popular Chinese messaging app, which has been known to crawl websites aggressively for content indexing.
  • Kinza - A bot related to a Japanese web browser, which has been observed to crawl websites excessively, consuming bandwidth and resources.
  • Datanyze - A bot from a US-based company that provides technographic data, known to crawl websites heavily for information gathering.
  • serpstatbot - A bot from Serpstat, an SEO platform, which has been reported to crawl websites aggressively, causing strain on server resources.
  • spaziodati - An Italian company's bot that collects data from websites, sometimes at a high rate, leading to increased bandwidth consumption.
  • OPPO\sA33 - A bot disguised as a mobile device from the Chinese smartphone manufacturer OPPO, known to crawl websites excessively.
  • AspiegelBot - A bot from a Swedish company that provides website monitoring services, which has been observed to crawl websites aggressively.
  • PetalBot - A bot from Huawei, the Chinese telecommunications company, known to crawl websites heavily for its mobile search engine.

Just a quick look at our own access_log for the last day I can see...

  • Googlebot 1314 occurrences
  • YandexBot 1168 occurrences
  • bingbot 1102 occurrences
  • AhrefsBot 863 occurrences
  • SemrushBot 688 occurrences
  • DotBot 684 occurrences
  • GPTBot 651 occurrences
  • Twitterbot 632 occurrences
  • SemrushBot-BA 536 occurrences
  • MJ12bot 495 occurrences
  • PetalBot 428 occurrences
  • Googlebot 367 occurrences
  • SeobilityBot 269 occurrences
  • SeznamBot 221 occurrences
  • Applebot 169 occurrences
  • coccocbot-web 151 occurrences
  • Googlebot-Image 127 occurrences
  • Amazonbot 108 occurrences
  • coccocbot-image 99 occurrences
  • WellKnownBot 65 occurrences
  • ZoominfoBot 62 occurrences
  • BLEXBot 54 occurrences
  • YandexImages 45 occurrences
  • PetalBot 35 occurrences
  • GPTBot 33 occurrences
  • CCBot 33 occurrences
  • AdsBot-Google 32 occurrences
  • Slack-ImgProxy 30 occurrences
  • DuckDuckBot 23 occurrences
  • Googlebot 17 occurrences
  • download bot 16 occurrences
  • Mail.RU_Bot 14 occurrences
  • ImagesiftBot 14 occurrences
  • bingbot 13 occurrences
  • bingbot 12 occurrences
  • Owler 11 occurrences
  • AdsBot-Google-Mobile 10 occurrences
  • LinkedInBot 10 occurrences
  • AwarioBot 8 occurrences
  • Pinterestbot 7 occurrences
  • MojeekBot 6 occurrences
  • DataForSeoBot 4 occurrences
  • Googlebot 4 occurrences
  • YandexFavicons 4 occurrences
  • YandexBot 4 occurrences
  • Googlebot 3 occurrences
  • 2ip bot 3 occurrences
  • SEMrushBot 3 occurrences
  • DomainStatsBot 2 occurrences
  • AdsBot-Google-Mobile 2 occurrences
  • YaK 2 occurrences
  • AndersPinkBot 2 occurrences
  • AhrefsSiteAudit 2 occurrences
  • DuckDuckBot-Https 2 occurrences
  • FullStoryBot 2 occurrences
  • ClaudeBot 1 occurrence
  • Gensparkbot 1 occurrence
  • fluid 1 occurrence
  • trendictionbot 1 occurrence

Which is a crazy amount of activity, and everyon's got a bot! You'll notice that some bot's are crawling more than once, and that's because they crawl with different user agents, pretending to be a desktop browser, then pretending to be a mobile browser, etc. This is used by google/bing etc to see the two different versions of your site, if it has a different version which is not common thesedays with responsive css. We have quite an active site with lots of new content and changes so the search engines crawl most pages daily, but for a small site bad bots can be a real menace.

A customer of ours recently had 100GB of bandwidth quota consumed by facebookexternalhit in a few days. That's the problem right there.


Can we block them

Yes, in most cases you can block them from your site, but this means having access to create/edit a special files called robots.txt and .htaccess in your site's root directory.

robots.txt

robots.txt is supposed to be a way to tell bots what they can and can't crawl, for example:

User-agent: *
Disallow: /files/
User-agent: Googlebot
Disallow: /admin
User-agent: ClaudeBot
Disallow: /

In this example, we're telling everyone not to crawl /files, and Google not to crawl /admin, and ClaudeBot not to crawl anything. The problem with this, is that only google and bing really take any notice, others, like ClaudeBot simply ignore it, so we need to use another method.

.htaccess

.htaccess is a directives file that configures your webserver to do various things like translate URL's, to know where the error page is, and to perform redirects. Using this file we can add the following lines (Assuming the file is empty or doesn't exist):


RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} AhrefsBot/6.1|facebookexternalhit|Amazonbot|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider|YandexBot|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|PetalBot [NC]
RewriteRule ^ - [F]

This code looks for clues in the HTTP_USER_AGENT, compares that to a list of known BadBots, and then instead of returning the page, it returns a forbidden response code. After a series of these, the badbot should give up and go and wreck someone elses bandwidth allowance. This list isn't exhaustive, but its a good start. You can amend this from time to time by checking your weblogs and picking out the badbots that are causing you issues.

Its worth pointing out that some webservers may not take any notice of .htaccess. They should and its their default behaviour but in some restrictive providers they may have configured their server to ignore it, and you might need to contact them to get it enabled, or change provider.

Your Logfiles

Your hosting provider *should* allow you access to your logfiles. If you're a GEN customer, and using shared private hosting then you'll find this in the left hand menu under Logs and Reports, and its called "Apache Access Log"

Within this file, in its default configuration, you will be able to see requests coming into your website. A line of this file will look something like:


18.188.60.124 - - [18/May/2024:11:54:44 +0100] "GET /doku.php/archive/alpha_microsystems HTTP/2.0" 200 245 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

Looking at the line above, we can see the IP Address on the left (18.188.60.124), the date/time (18/May/2024:11:54:44 +0100), the page requested (/doku.php/archive/alpha_microsystems), the response code (200 = ok), and finally the HTTP_USER_AGENT (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).

We can see here that this request is indeed a badbot, coming from ClaudeBot, which is well known for hitting websites from multiple IP addresses and in some cases exhausting the webservers resources causing an outage.

We already have ClaudBot added to the blocklist above, but let's assume that we don't, in which case you simply extract the indentifiable part 'ClaudeBot' and stick it in the search line between | and |, like this:


        |ClaudeBot|
    

And you're now be blocking this as well. It may take a minute or two for the changes to take effect especially if you're hosting provide is running a cluster, but it will take effect.


Conclusion

Assuming you've done everything correctly, your website load (and traffic) should reduce. However, even though you're sending a 'Forbidden' response it doesn't mean that the bad bot is written to understand. Host will, but ClaudeBot for example will just carry on hammering your webserver with endless requests until its had a poke at every single page even if you send it a 403 for every request. This is just poor coding on the part of the bot developer, and there's nothing you can do about it... or is there?

You can block the IP addresses of the bad bots using a firewall. Blocking with a firewall is preferable because then the requests won't even reach your servers, but that all depends on what level of access you have to a 'firewall'. If you're a GEN customer and need specific IP's blocking then just a raise a ticket at the HelpDesk and they'll make those changes for you, but if not and you don't have access to a firewall, you can still do it with .htaccess, like this...


Order allow,deny
Allow from all
Deny from 123.456.789.0

Your logfiles will show the IP address of every request, and you've already extracted the HTTP_USER_AGENT in order to setup your rules in .htaccess. Now using those same logfiles, check each line for 403 responses, and extract the IP address. Add that to the list in .htaccess on a Deny from line. In our example above we're blocking 123.456.789.0 (note: I know that's not a valid IPv6, its an example ok). In this configuration, the first option 'Order' states that we're going to allow, then, deny, and this is important. On the next line we allow from all, so no restrictions, then on the next line(s) we deny from xxx.xxx.xxx.xxx so that's 'denied'.

Sometimes you can find a list of IP's on the internet, for example, at the time of writing I was able to find several good resources listed IP addresses for some bad bots, but you need to fight your way through the endless AI generated waffle to find them. Be careful with range blocking - some websites may provide a list of IP's to block in CIDR format, so something like 123.200.0.0/16, in this case we're blocking 123.200.anything.anything which is 65536 IP addresses (You can use our Subnet Tool to check ranges). If possible stick to individual IP's or you may wind up blocking legitimate traffic unknowingly.


             16 Votes

Comments (3)

Zazzy P · 2024-07-12 17:12 UTC
WellKnownBot! that made me laugh, defo block dat one.

A Shia Q · 2024-07-10 23:50 UTC
Perfect and finally stopped it so ty and keep up the good work!

Allan Dx · 2024-07-08 15:54 UTC
Fantastic! hammering my site for a week now and now all quiet again.

--- This content is not legal or financial advice & Solely the opinions of the author ---


Index v1.019 Standard v1.110 Module v1.055   Copyright © 2024 GEN Partnership. All Rights Reserved, E&OE.   ^sales^  0115 933 9000  Privacy Notice   277 Current Users