Zazzy P
· 2024-07-12 17:12 UTC
WellKnownBot! that made me laugh, defo block dat one.
Subscribe to GEN
Login to GEN
Add a Comment
Inconsiderate bots can have a significant negative impact on websites by consuming excessive bandwidth and slowing down site performance. When bots aggressively crawl a website, making numerous requests in a short period of time, they can strain the server resources and eat up the site's allocated bandwidth. This is especially problematic for smaller websites with limited hosting plans that have strict bandwidth allowances. As the bots consume more and more of the available bandwidth, it leaves less for legitimate users trying to access the site. This can lead to slower page load times, increased latency, and even periods of unavailability if the bandwidth is completely exhausted. The website owner may then be forced to upgrade to a more expensive hosting plan with higher bandwidth limits to accommodate the bot traffic and ensure their site remains accessible to real users. In essence, inconsiderate bots can cost website owners money by driving up their hosting costs and hurting the user experience for their visitors. It's important for bot developers to be responsible and respect website owners by following crawling best practices, honoring robots.txt directives, and limiting their request rate to avoid unduly burdening websites.
These bots scrape (trawl) your website to use its data for building search indexes, or looking for content. Most bots, like Google and Bing a very careful to scrape your site at a low rate, and low priority, and indeed you can even tell both of these what hours they can and cannot scrape, but not everyone is quite so respectful.
There have always been 'bad' bots, like Petal, but recently these have increased in number substantially. Here a list of a few:
Just a quick look at our own access_log for the last day I can see...
Which is a crazy amount of activity, and everyon's got a bot! You'll notice that some bot's are crawling more than once, and that's because they crawl with different user agents, pretending to be a desktop browser, then pretending to be a mobile browser, etc. This is used by google/bing etc to see the two different versions of your site, if it has a different version which is not common thesedays with responsive css. We have quite an active site with lots of new content and changes so the search engines crawl most pages daily, but for a small site bad bots can be a real menace.
A customer of ours recently had 100GB of bandwidth quota consumed by facebookexternalhit in a few days. That's the problem right there.
Yes, in most cases you can block them from your site, but this means having access to create/edit a special files called robots.txt and .htaccess in your site's root directory.
robots.txt is supposed to be a way to tell bots what they can and can't crawl, for example:
User-agent: * Disallow: /files/ User-agent: Googlebot Disallow: /admin User-agent: ClaudeBot Disallow: /
In this example, we're telling everyone not to crawl /files, and Google not to crawl /admin, and ClaudeBot not to crawl anything. The problem with this, is that only google and bing really take any notice, others, like ClaudeBot simply ignore it, so we need to use another method.
.htaccess is a directives file that configures your webserver to do various things like translate URL's, to know where the error page is, and to perform redirects. Using this file we can add the following lines (Assuming the file is empty or doesn't exist):
RewriteEngine On RewriteBase / RewriteCond %{HTTP_USER_AGENT} AhrefsBot/6.1|facebookexternalhit|Amazonbot|Ahrefs|Baiduspider|BLEXBot|SemrushBot|claudebot|YandexBot/3.0|Bytespider|YandexBot|Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|PetalBot [NC] RewriteRule ^ - [F]
This code looks for clues in the HTTP_USER_AGENT, compares that to a list of known BadBots, and then instead of returning the page, it returns a forbidden response code. After a series of these, the badbot should give up and go and wreck someone elses bandwidth allowance. This list isn't exhaustive, but its a good start. You can amend this from time to time by checking your weblogs and picking out the badbots that are causing you issues.
Its worth pointing out that some webservers may not take any notice of .htaccess. They should and its their default behaviour but in some restrictive providers they may have configured their server to ignore it, and you might need to contact them to get it enabled, or change provider.
Your hosting provider *should* allow you access to your logfiles. If you're a GEN customer, and using shared private hosting then you'll find this in the left hand menu under Logs and Reports, and its called "Apache Access Log"
Within this file, in its default configuration, you will be able to see requests coming into your website. A line of this file will look something like:
18.188.60.124 - - [18/May/2024:11:54:44 +0100] "GET /doku.php/archive/alpha_microsystems HTTP/2.0" 200 245 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"
Looking at the line above, we can see the IP Address on the left (18.188.60.124), the date/time (18/May/2024:11:54:44 +0100), the page requested (/doku.php/archive/alpha_microsystems), the response code (200 = ok), and finally
the HTTP_USER_AGENT (Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com).
We can see here that this request is indeed a badbot, coming from ClaudeBot, which is well known for hitting websites from multiple IP addresses and in some cases exhausting the webservers resources causing an outage.
We already have ClaudBot added to the blocklist above, but let's assume that we don't, in which case you simply extract the indentifiable part 'ClaudeBot' and stick it in the search line between | and |, like this:
|ClaudeBot|
And you're now be blocking this as well. It may take a minute or two for the changes to take effect especially if you're hosting provide is running a cluster, but it will take effect.
Assuming you've done everything correctly, your website load (and traffic) should reduce. However, even though you're sending a 'Forbidden' response it doesn't mean that the bad bot is written to understand. Host will, but ClaudeBot for example will just carry on hammering your webserver with endless requests until its had a poke at every single page even if you send it a 403 for every request. This is just poor coding on the part of the bot developer, and there's nothing you can do about it... or is there?
You can block the IP addresses of the bad bots using a firewall. Blocking with a firewall is preferable because then the requests won't even reach your servers, but that all depends on what level of access you have to a 'firewall'. If you're a GEN customer and need specific IP's blocking then just a raise a ticket at the HelpDesk and they'll make those changes for you, but if not and you don't have access to a firewall, you can still do it with .htaccess, like this...
Order allow,deny Allow from all Deny from 123.456.789.0
Your logfiles will show the IP address of every request, and you've already extracted the HTTP_USER_AGENT in order to setup your rules in .htaccess. Now using those same logfiles, check each line for 403 responses, and extract the IP address. Add that to the list in .htaccess on a Deny from line. In our example above we're blocking 123.456.789.0 (note: I know that's not a valid IPv6, its an example ok). In this configuration, the first option 'Order' states that we're going to allow, then, deny, and this is important. On the next line we allow from all, so no restrictions, then on the next line(s) we deny from xxx.xxx.xxx.xxx so that's 'denied'.
Sometimes you can find a list of IP's on the internet, for example, at the time of writing I was able to find several good resources listed IP addresses for some bad bots, but you need to fight your way through the endless AI generated waffle to find them. Be careful with range blocking - some websites may provide a list of IP's to block in CIDR format, so something like 123.200.0.0/16, in this case we're blocking 123.200.anything.anything which is 65536 IP addresses (You can use our Subnet Tool to check ranges). If possible stick to individual IP's or you may wind up blocking legitimate traffic unknowingly.
Zazzy P
· 2024-07-12 17:12 UTC
WellKnownBot! that made me laugh, defo block dat one.
A Shia Q
· 2024-07-10 23:50 UTC
Perfect and finally stopped it so ty and keep up the good work!
Allan Dx
· 2024-07-08 15:54 UTC
Fantastic! hammering my site for a week now and now all quiet again.
--- This content is not legal or financial advice & Solely the opinions of the author ---
Index v1.019 Standard v1.110 Module v1.055 Copyright © 2024 GEN Partnership. All Rights Reserved, E&OE. ^sales^ 0115 933 9000 Privacy Notice 277 Current Users