Post-Cloudflare update fluffy rambles

It’s been nearly a week since I removed Cloudflare from my sites. As a quick followup, I did get a slight surge in traffic that lasted for a day or so after a bunch of bots' DNS caches expired, but they seem to have all given up after the Cloudflare “managed challenge” interstitial turned into an HTTP 401 error for them.

Read more…

A simple anti-AI measure for Flask Code

After figuring out a basic anti-bot measure for Publ, I decided to try building a simple experiment for Flask in general.

Here is an extremely simple implementation that has worked amazingly well, having implemented it on The Flickr Random Image Generatr.

Read more…

Preventing bot scraping on Publ and Flask Code

This morning I was once again thinking about how to put some proper antibot behavior onto my websites, without relying on Cloudflare. There are plenty of fronting proxies like Anubis and Go Away which put a simple proof-of-work task in front of a website. This is pretty effective, but it adds more of an admin tax (and is often quite difficult to configure for servers that host multiple websites, such as mine), and sometimes the false positive rates can have some other bad effects, such as disallowing feed readers and the like.

I started going down the path of how to integrate antibot stuff directly into Flask, using an @app.before_request rule that would do much of the same work as Anubis et al, but really the various bots are very stupid and the reason the challenge even works is because they aren’t running any JavaScript at all. This made me think that a better approach would be to have it just look for a simple signed cookie, and if that cookie isn’t there, insert an interstitial page that sets it via form POST (with a Javascript wrapper to automatically submit the form).

But then I realized, Publ already provides this sort of humanity test: the login page!

UPDATE: This approach is a bit obsolete; here is a better approach that uses HTTP 429 responses (which also serve the purpose of signalling to crawlers that they are unwelcome). I also no longer recommend the g.is_bot approach to removing page elements, as Publ now has user.is_bot as a built-in function that works better with caching.

Read more…

Building a lyric search engine fluffy rambles

Y'all probably know that my views on AI are somewhat nuanced. I’m not 100% “AI BAD!!!” but I’m also hesitant to rely on AI for a lot of things, and generally do not care for generative AI or any situation where you need AI to “reason” on things.

But, recently I’ve wanted to remember the name of a song that I listened to a lot, and where the lyrics I can remember don’t come up in any of the major lyrics databases. I listen to a lot of obscure indie music that tends to get lost by the major platforms, and I’ve been packratting music for decades now.

Further, it’s only fairly recently that music started to get lyrics embedded into the id3 tags (thanks to bandcamp really pushing for that) and even the streaming platforms have taken forever to pick it up. So a lot of the music I listen to has never had its lyrics entered in any sort of machine-searchable way.

But, hey, there are plenty of AI models for vocal extraction and text transcription… so why not actually use them?

Read more…

Fuck AI LLM scrapers fluffy rambles

Wellp, my whack-a-mole approach finally got to be too much to maintain. The last day or so my server has been absolutely inundated with traffic from thousands of IP blocks, all coming from China, and I got sick of trying to keep up with it myself.

I looked into setting up Anubis and preparing to just whitelist a lot of IndieWeb things, but it’s all just so very overwhelming and for now I’ve gone with Cloudflare, problematic as they are, because the amount of energy I can put into this shrinks every day and sometimes I just want things to stop sucking for a while.

All of my DNS has propagated but of course it’ll be a while before the bots decide to update their own DNS caches, so my server is still getting absolutely hammered, but hopefully things will subside, and in the meantime things are at least responsive.

I guess at some point I’ll have to figure out how to actually set up TLS with Cloudflare (since I’ve been using Letsencrypt wildcard certs but obviously those don’t work anymore when Cloudflare is handling my DNS) but that’s a problem for future me. Also I’ll definitely be on the lookout to make sure that Cloudflare is properly honoring my login cookies. It’d definitely be unfortunate if it gets confused about logins, which is one of the more common failure modes with HTTP proxies.

I’m also super worried that this will interfere with IndieWeb stuff, because of course most of the anti-bot things assume that any traffic coming from data centers or from headless/scriptless user agents is abusive. Which is, y'know, 99.99% accurate, but that 0.01% is stuff I really care about (namely interop).

Anyway. I resent that this is the state of the Internet right now. It’s getting really difficult for me to find anything positive about AI when this is how the industry treats everyone.

Comma 3X: Initial impressions fluffy rambles

About a week ago I bought a Comma 3X from comma.ai, based on seeing a bunch of quite glowing reviews of it (and other FSD systems) from a number of car and tech reviewers I trust. In particular, since Kate of Transport Evolved has one and also has the exact same car as mine (2019 Kia Niro EV EX Premium in Galaxy Blue) and speaks highly of it, I decided that this might be a useful thing for handling my ongoing driving anxiety and vertigo issues.

Luckily enough it happened to be during a flash sale, where they included the harness for free ($99 off from usual), so my total cost was $999 (shipping was included and there was no sales tax either).

It arrived last Wednesday, and I installed and calibrated it soon after. I didn’t really get a chance to try it out until Sunday, but so far I’m very impressed with it.

Read more…

Blocking abusive webcrawlers General Articles

People often talk about how bad AI is for the environment, but only focus on the operation of the LLMs themselves. They seem to ignore the much larger impact of what the AI scrapers are doing: not only do those take massive amounts of energy and bandwidth to run, but they’re impacting every single website operator on the planet by increasing their server requirements and bandwidth utilizatoin as well. And this makes it everyone’s problem, since everyone ends up having to foot the bill. It’s asinine and disgusting.

At one point, fully 94% of all of my web traffic was coming from a single botnet like this. These bots do not respect robots.txt or the nofollow link rels that I put on my site to prevent robots from getting stuck in a trap of navigating every single tag combination on my site, and it’s ridiculous just how many resources — both mine and theirs — are constantly being wasted like this.

I’ve been using the nginx ultimate bad bot blocker to subscribe to lists of known bots, but this one particular botnet (which operates on Alibaba’s subnets) has gotten ridiculous, and enough is enough.

So, I finally did something I should have done ages ago, and set up UFW and set up some basic rules.

UPDATE: This article has started getting linked to from elsewhere (including Coyote’s excellent article about the problem), but I no longer use this approach for blocking crawlers as it’s become completely ineffective thanks to the crawlers now behaving like a massive DDoS. These days I’m using a combination of gated access, sentience checks, and, unfortunately, CloudFlare UPDATE TO THE UPDATE no I’m not.

ANOTHER UPDATE: I’ve had to go back to using this technique selectively as some of the crawlers have managed to get around my other mitigations. If only these bot authors would spend 1% as much time on making their bots not be utterly broken as they do on trying to inflict themselves on everyone.

Read more…