General Articles: Blocking abusive webcrawlers

Blocking abusive webcrawlers

March 5, 2025 5:07 PM (11 months ago)

People often talk about how bad AI is for the environment, but only focus on the operation of the LLMs themselves. They seem to ignore the much larger impact of what the AI scrapers are doing: not only do those take massive amounts of energy and bandwidth to run, but they’re impacting every single website operator on the planet by increasing their server requirements and bandwidth utilizatoin as well. And this makes it everyone’s problem, since everyone ends up having to foot the bill. It’s asinine and disgusting.

At one point, fully 94% of all of my web traffic was coming from a single botnet like this. These bots do not respect robots.txt or the nofollow link rels that I put on my site to prevent robots from getting stuck in a trap of navigating every single tag combination on my site, and it’s ridiculous just how many resources — both mine and theirs — are constantly being wasted like this.

I’ve been using the nginx ultimate bad bot blocker to subscribe to lists of known bots, but this one particular botnet (which operates on Alibaba’s subnets) has gotten ridiculous, and enough is enough.

So, I finally did something I should have done ages ago, and set up UFW and set up some basic rules.

UPDATE: This article has started getting linked to from elsewhere (including Coyote’s excellent article about the problem), but I no longer use this approach for blocking crawlers as it’s become completely ineffective thanks to the crawlers now behaving like a massive DDoS. These days I’m using a combination of gated access, sentience checks, and, unfortunately, CloudFlare.

Initial setup

First, you need to make sure that you’re allowing in the traffic that you do want. Here are the allowlist rules I have set up (run these commands as root or under sudo):

ufw allow ssh
ufw allow tcp 80,443
ufw allow finger
ufw allow 60000:61000/udp  # mosh

ufw enable

After this, ufw status should look something like this:

Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere
60000:61000/udp            ALLOW       Anywhere
80,443/tcp                 ALLOW       Anywhere
79/tcp                     ALLOW       Anywhere
79/tcp (v6)                ALLOW       Anywhere (v6)
80,443/tcp (v6)            ALLOW       Anywhere (v6)
22/tcp (v6)                ALLOW       Anywhere (v6)
60000:61000/udp (v6)       ALLOW       Anywhere (v6)

Adding bad hosts

When a connection comes in, it gets allowed or rejected by the first matching rule. As a result, if you want to block anything, it’ll have to come before the other rules. You can add a higher-priority rule with the command ufw insert 1.

But first, it’s good to know which hosts to actually add! Here is how I found the hosts that were participating in the most recent protracted attack:

$ cut -f1 -d\  /var/log/nginx/access.log | cut -f1-3 -d. | sort | uniq -c | sort -n | tail -n 15
    602 157.55
    661 4.231
    813 85.208
    879 144.76
   1081 52.167
   1498 111.62
   1542 40.77
   5031 96.126
   5201 66.249
   9231 2600:3c01::f03c:91ff:fedf:b0de
  13795 37.27
  45815 47.242
  45898 8.218
 141104 47.76
 427121 8.210

This command parses out all of the IP addresses from the server access log, then collapses them into the first two octets of the address quad (i.e. the /16 subnet), then counts how many accesses came from those subnets. From here you can spot-check some of the things that are involved in the top offending addresses:

$ grep ^37.27 /var/log/nginx/access.log | tail -n 3
37.27.51.144 - - [05/Mar/2025:16:26:21 -0800] "GET /blog/?id=249&tag=feed-on-feeds&tag=import HTTP/1.1" 500 784 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" "publ.beesbuzz.biz" http
37.27.51.144 - - [05/Mar/2025:16:26:25 -0800] "GET /toc/1886-TOC-with-postprocessing HTTP/1.1" 444 0 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" "beesbuzz.biz" https
37.27.51.144 - - [05/Mar/2025:16:26:27 -0800] "GET /blog/?id=249&tag=feed-on-feeds&tag=mt2publ HTTP/1.1" 500 786 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)" "publ.beesbuzz.biz" http
$ grep ^47.242 /var/log/nginx/access.log | tail -n 3
47.242.217.111 - - [05/Mar/2025:16:26:42 -0800] "HEAD /blog/?id=6615&tag=bandcamp-friday HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3447.27 Safari/537.36" "beesbuzz.biz" https
47.242.217.111 - - [05/Mar/2025:16:26:42 -0800] "HEAD /blog/?id=589&tag=animals HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3112.101 Safari/537.36" "beesbuzz.biz" http
47.242.217.111 - - [05/Mar/2025:16:26:43 -0800] "HEAD /blog/?id=8540&tag=smart-house HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.2567.173 Safari/537.36" "beesbuzz.biz" http
$ grep ^8.218 /var/log/nginx/access.log | tail -n 3
8.218.91.49 - - [05/Mar/2025:16:27:02 -0800] "HEAD /blog/?id=9644&tag=mozilla HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.2945.58 Safari/537.36" "beesbuzz.biz" http
8.218.91.49 - - [05/Mar/2025:16:27:02 -0800] "HEAD /blog/?id=8689&tag=comments HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3872.74 Safari/537.36" "beesbuzz.biz" https
8.218.91.49 - - [05/Mar/2025:16:27:03 -0800] "HEAD /blog/?id=8083&tag=gaming HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.2342.157 Safari/537.36" "beesbuzz.biz" http
$ grep ^47.76 /var/log/nginx/access.log | tail -n 3
47.76.222.244 - - [05/Mar/2025:16:20:46 -0800] "HEAD /blog/?id=1524&tag=streaming HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3262.152 Safari/537.36" "beesbuzz.biz" http
47.76.209.138 - - [05/Mar/2025:16:20:47 -0800] "GET /blog/?id=1128&tag=origin-story HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3395.185 Safari/537.36" "beesbuzz.biz" https
47.76.99.127 - - [05/Mar/2025:16:20:48 -0800] "GET /blog/?id=9982&tag=ios HTTP/1.1" 444 0 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3826.99 Safari/537.36" "beesbuzz.biz" http
$ grep ^8.210 /var/log/nginx/access.log | tail -n 3
8.210.189.26 - - [05/Mar/2025:16:20:48 -0800] "HEAD /blog/?id=365&tag=rambles HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.2932.180 Safari/537.36" "beesbuzz.biz" http
8.210.179.35 - - [05/Mar/2025:16:20:48 -0800] "GET /blog/?id=1128&tag=origin-story HTTP/1.1" 200 9013 "-" "Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3395.185 Safari/537.36" "beesbuzz.biz" https
8.210.179.35 - - [05/Mar/2025:16:20:49 -0800] "GET /blog/?id=98&tag=speedrunning HTTP/1.1" 200 3324 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3678.171 Safari/537.36" "beesbuzz.biz" https

So, we can see that the 37.27 subnet is at least a legitimate-looking crawler that at least declares itself to be a bot and doesn’t completely destroy the site (although its behavior could be better, as it’s not respecting my nofollow rels). For now I give it a pass since it’s only hitting my site every few seconds.

But the other three are all coming from the same crawler that’s pretending to be a browser, and are issuing multiple requests per second. Plus, combined they are taking up 94% of my server traffic, and pegging my CPU usage at its high extreme. So this one needs to go the hell away.

For looking up an IP address, there’s the whois command; there are many web-based frontends for it out there, but your Linux or macOS system will have it on the command line.

$ whois 47.242.217.111
[...]
NetRange:       47.235.0.0 - 47.246.255.255
CIDR:           47.236.0.0/14, 47.240.0.0/14, 47.246.0.0/16, 47.244.0.0/15, 47.235.0.0/16
NetName:        AL-3
NetHandle:      NET-47-235-0-0-1
Parent:         NET47 (NET-47-0-0-0-0)
NetType:        Direct Allocation
OriginAS:
Organization:   Alibaba Cloud LLC (AL-3)
RegDate:        2016-04-15
Updated:        2017-04-26
Ref:            https://rdap.arin.net/registry/ip/47.235.0.0
[...]

So this tells us that the IP address in question is coming from Alibaba Cloud’s subnet, and then the CIDR line tells us which IP addresses belong to that allocation. So, blocking that subnet requires a series of commands:

ufw insert 1 deny from 47.236.0.0/14 to any
ufw insert 1 deny from 47.240.0.0/14 to any
ufw insert 1 deny from 47.246.0.0/14 to any
ufw insert 1 deny from 47.244.0.0/15 to any
ufw insert 1 deny from 47.235.0.0/16 to any

We can then go down the list of other IP addresses and do the same process to find their subnets and block them:

$ whois 8.218.91.49 | grep CIDR
CIDR:           8.208.0.0/12
$ whois 47.76.222.244 | grep CIDR
CIDR:           47.74.0.0/15, 47.80.0.0/13, 47.76.0.0/14
$ whois 8.210.189.26 | grep CIDR
CIDR:           8.208.0.0/12

So this gives us a few more subnets to block:

ufw insert 1 deny from 8.208.0.0/12 to any
ufw insert 1 deny from 47.74.0.0/15 to any
ufw insert 1 deny from 47.80.0.0/13 to any
ufw insert 1 deny from 47.76.0.0/14 to any

After all this, here’s what my ufw rule list looks like:

$ ufw status numbered
Status: active

     To                         Action      From
     --                         ------      ----
[ 1] Anywhere                   DENY IN     8.208.0.0/12
[ 2] Anywhere                   DENY IN     47.235.0.0/16
[ 3] Anywhere                   DENY IN     47.244.0.0/15
[ 4] Anywhere                   DENY IN     47.244.0.0/14
[ 5] Anywhere                   DENY IN     47.240.0.0/14
[ 6] Anywhere                   DENY IN     47.236.0.0/14
[ 7] Anywhere                   DENY IN     47.80.0.0/13
[ 8] Anywhere                   DENY IN     47.74.0.0/15
[ 9] Anywhere                   DENY IN     47.76.0.0/14
[10] Anywhere                   DENY IN     8.210.0.0/16
[11] 22/tcp                     ALLOW IN    Anywhere
[12] 60000:61000/udp            ALLOW IN    Anywhere
[13] 80,443/tcp                 ALLOW IN    Anywhere
[14] 79/tcp                     ALLOW IN    Anywhere
[15] 79/tcp (v6)                ALLOW IN    Anywhere (v6)
[16] 80,443/tcp (v6)            ALLOW IN    Anywhere (v6)
[17] 22/tcp (v6)                ALLOW IN    Anywhere (v6)
[18] 60000:61000/udp (v6)       ALLOW IN    Anywhere (v6)

And that has stopped Alibaba’s awful AI scraper from hammering my poor little server for now. My CPU utilization has dropped from an average of 120% to a much more reasonable 4%, and everything is much more responsive.

Automating the process

Here is a simple script you can use to take the nuclear option for a badly-behaved serial offender:

block-subnet.sh

#!/bin/sh

# Blocks the entire subnet for an IP address
for ip in "$@"; do
    whois $ip | grep CIDR | cut -f2- -d: | tr ',' '\n'
done | sort | uniq | while read subnet ; do
    echo "Blocking subnet $subnet"
    ufw insert 1 deny from $subnet to any
done

Run it with block-subnet.sh [list of ip addresses]. And then be free of this nonsense.