Preventing bot scraping on Publ and Flask

This morning I was once again thinking about how to put some proper antibot behavior onto my websites, without relying on Cloudflare. There are plenty of fronting proxies like Anubis and Go Away which put a simple proof-of-work task in front of a website. This is pretty effective, but it adds more of an admin tax (and is often quite difficult to configure for servers that host multiple websites, such as mine), and sometimes the false positive rates can have some other bad effects, such as disallowing feed readers and the like.

I started going down the path of how to integrate antibot stuff directly into Flask, using an @app.before_request rule that would do much of the same work as Anubis et al, but really the various bots are very stupid and the reason the challenge even works is because they aren’t running any JavaScript at all. This made me think that a better approach would be to have it just look for a simple signed cookie, and if that cookie isn’t there, insert an interstitial page that sets it via form POST (with a Javascript wrapper to automatically submit the form).

But then I realized, Publ already provides this sort of humanity test: the login page!

UPDATE: This approach is a bit obsolete; here is a better approach that uses HTTP 429 responses (which also serve the purpose of signalling to crawlers that they are unwelcome). I also no longer recommend the g.is_bot approach to removing page elements, as Publ now has user.is_bot as a built-in function that works better with caching.

Read more…

Random verb selection on a MUCK

On SpinDizzy MUCK there are a bunch of “hug” verbs which are a bit whimsical and a bit nonsensical, and for reasons that are too silly to get into, I have been locked in an eternal battle with Austin in which I am constantly creating more.

A while back I ran into an issue with a few miscellaneous world scripts breaking around me, and it turned out to be that one of the global scripts, for reasons I’m still unclear on, attempts to parse every verb attached to a character object, and for other reasons I am also unclear on, it ends up attempting to push every name for the verb onto the stack.

Read more…

macOS Dequarantine

Tired of dealing with the annoying processes necessary to run an unsigned application on macOS?

Here’s a simple thing to make your life a lot easier: dequarantine.zip

Download this file, open it up, double-click the dequarantine.workflow file, and then install it as a Quick Action. Now if you want to let an unsigned application run, right-click (or ctrl-click) it, select “Quick Actions,” then “dequarantine.” And then, done.

dequarantine.png

Have fun.

Read more…

Radix sort revisited

Around a year and a half ago I wrote an article on the perils of relying on big-O notation, and in it I focused on a comparison between comparison-based sorting (via std::sort) and radix sort, based on the common bucketing approach.

Recently I came across a video on radix sort which presents an alternate counting-based implementation at the end, and claims that the tradeoff point between radix and comparison sort comes much sooner. My intuition said that even counting-based radix sort would still be slower than a comparison sort for any meaningful input size, but it’s always good to test one’s intuitions.

So, hey, it turns out I was wrong about something. (But my greater point still stands.)

Read more…

The danger of big-O notation

A common pitfall I see programmers run into is putting way too much stock into Big O notation and using it as a rough analog for overall performance. It’s important to understand what the Big O represents, and what it doesn’t, before deciding to optimize an algorithm based purely on the runtime complexity.

Read more…

How not to shuffle a list

A frequent thing that people want to do in making games or interactive applications is to shuffle a list. One common and intuitive approach that people take is to simply sort the list, but use a random number generator as the comparison operation. (For example, this is what’s recommended in Fuzzball’s MPI documentation, and it is a common answer that comes up on programming forums as well.)

This way is very, very wrong.

Read more…

Making a hash of data

When I was replacing peewee with PonyORM, I was evaluating a few options, including moving away from an ORM entirely and simply storing the metadata in indexed tables in memory. This would have also helped to solve a couple of minor annoying design issues (such as improper encapsulation of the actual content state into the application instance), but I ended up not doing this.

A big reason why is that there don’t actually seem to be any useful in-memory indexed table libraries for Python. Or many other languages.

Read more…

Pushl

Pushl: A tool for generating WebMention, Pingback, and WebSub notifications from arbitrary websites regardless of their underlying publishing system.