Preventing bot scraping on Publ and Flask
This morning I was once again thinking about how to put some proper antibot behavior onto my websites, without relying on Cloudflare. There are plenty of fronting proxies like Anubis and Go Away which put a simple proof-of-work task in front of a website. This is pretty effective, but it adds more of an admin tax (and is often quite difficult to configure for servers that host multiple websites, such as mine), and sometimes the false positive rates can have some other bad effects, such as disallowing feed readers and the like.
I started going down the path of how to integrate antibot stuff directly into Flask, using an @app.before_request rule that would do much of the same work as Anubis et al, but really the various bots are very stupid and the reason the challenge even works is because they aren’t running any JavaScript at all. This made me think that a better approach would be to have it just look for a simple signed cookie, and if that cookie isn’t there, insert an interstitial page that sets it via form POST (with a Javascript wrapper to automatically submit the form).
But then I realized, Publ already provides this sort of humanity test: the login page!
UPDATE: This approach is a bit obsolete; here is a better approach that uses HTTP 429 responses (which also serve the purpose of signalling to crawlers that they are unwelcome). I also no longer recommend the g.is_bot approach to removing page elements, as Publ now has user.is_bot as a built-in function that works better with caching.