Preventing bot scraping on Publ and Flask
This morning I was once again thinking about how to put some proper antibot behavior onto my websites, without relying on Cloudflare. There are plenty of fronting proxies like Anubis and Go Away which put a simple proof-of-work task in front of a website. This is pretty effective, but it adds more of an admin tax (and is often quite difficult to configure for servers that host multiple websites, such as mine), and sometimes the false positive rates can have some other bad effects, such as disallowing feed readers and the like.
I started going down the path of how to integrate antibot stuff directly into Flask, using an @app.before_request rule that would do much of the same work as Anubis et al, but really the various bots are very stupid and the reason the challenge even works is because they aren’t running any JavaScript at all. This made me think that a better approach would be to have it just look for a simple signed cookie, and if that cookie isn’t there, insert an interstitial page that sets it via form POST (with a Javascript wrapper to automatically submit the form).
But then I realized, Publ already provides this sort of humanity test: the login page!
So anyway, here’s a recipe that seems to work for Publ, assuming authentication is enabled. It requires adding the user-agents package to the environment (with e.g. poetry add user-agents).
import publ.user import user_agents import werkzeug.exceptions @app.before_request def antiscraper(): # Flag bots to remove page elements if user_agents.parse(flask.request.headers.get('User-Agent', '')).is_bot: flask.g.is_bot = True # Logged-in users have passed the test already if publ.user.get_active(): return # Send possible crawlers to the login page if (('id' in flask.request.args and 'tag' in flask.request.args) or len(flask.request.args.getlist('tag')) > 1): raise werkzeug.exceptions.Unauthorized("Sentience test") return
Then the other thing I did was changed my page templates to only show tag browsers if g.is_bot isn’t set; for example, on my entry template:
{%- if not g.is_bot and entry.tags -%}{%- block entrytags scoped %}<nav class="tags"> <ul>{%- for tag in entry.tags -%} <li><a class="p-category" rel="tag" href="{{view(tag=tag).link(template=template)}}">{{tag}}</a></li> {%- endfor -%}</ul>
This isn’t really relevant to the antiscraper thing so much as to prevent search indexes from seeing pages that don’t do them any good to begin with. (Unfortunately, rel="noindex nofollow" et al don’t actually stop most search engines from indexing the combinatoric explosion of tag combinations, which is where the scrapers become a problem for Publ sites to begin with.)
Notably, after enabling this simple test, there’s no significant CPU load difference whether I have Cloudflare’s antibot rules active or not.
Anyway. This should be pretty easy to adapt to any Flask application that already supports a login mechanism. For things that don’t have a login mechanism I might end up building a simple Flask version of the form-posting thing I mentioned.