Blocking abusive webcrawlers
People often talk about how bad AI is for the environment, but only focus on the operation of the LLMs themselves. They seem to ignore the much larger impact of what the AI scrapers are doing: not only do those take massive amounts of energy and bandwidth to run, but they’re impacting every single website operator on the planet by increasing their server requirements and bandwidth utilizatoin as well. And this makes it everyone’s problem, since everyone ends up having to foot the bill. It’s asinine and disgusting.
At one point, fully 94% of all of my web traffic was coming from a single botnet like this. These bots do not respect robots.txt or the nofollow link rels that I put on my site to prevent robots from getting stuck in a trap of navigating every single tag combination on my site, and it’s ridiculous just how many resources — both mine and theirs — are constantly being wasted like this.
I’ve been using the nginx ultimate bad bot blocker to subscribe to lists of known bots, but this one particular botnet (which operates on Alibaba’s subnets) has gotten ridiculous, and enough is enough.
So, I finally did something I should have done ages ago, and set up UFW and set up some basic rules.
UPDATE: This article has started getting linked to from elsewhere (including Coyote’s excellent article about the problem), but I no longer use this approach for blocking crawlers as it’s become completely ineffective thanks to the crawlers now behaving like a massive DDoS. These days I’m using a combination of gated access, sentience checks, and, unfortunately, CloudFlare.