fluffy rambles: Mastodon instance rambling

Mastodon instance rambling

November 17, 2018 11:22 AM (5 years ago)

Lately most of my social networking has been via Mastodon, which is basically an open source, semi-distributed equivalent to Twitter. When I first joined a few years ago I got an account on the flagship instance, but not much later ended up switching to queer.party. Unfortunately, queer.party has had several scaling issues – similar to a lot of the other small instances – and while it hasn’t gone down entirely, it’s so backlogged that it’s gotten to be pretty much useless.

On Mastodon there’s a general feeling that anyone with a mastodon.social address isn’t savvy because they don’t “get” Mastodon, that the whole point to it is that it’s distributed and you don’t have to be on a single central instance and so on. But the problem is that most of the instances – and there’s quite a lot of them – aren’t run in a way that can be expected to scale over time.

Most instances are maintained as a spare-time thing by someone, but instance management is more and more becoming a full-time job. I am incredibly grateful that Maffsie is willing to run the instance even on that basis, don’t get me wrong! But all the same I’d like to be on an instance where it doesn’t regularly go down or have massive backlogs (7 hours, at present) or random weird federation problems.

The problem with Mastodon in this case is that any Mastodon instance, regardless of the user count (or a user limit), will continue to grow without bounds for as long as it’s being used, and as the ActivityPub network grows, the amount of stuff that every instance needs to keep track of will grow too.

It’s a pretty simple combinatoric problem. Any time anyone posts a message to their instance, the instance has to see which other instances have people who would be receiving that message, and send a copy of the message to each of them. If one of those other instances has gone down, then the origin instance needs to keep track of that and retry later. And retry logic is very difficult to get right, especially when you don’t know when the other system might be coming back – or if it’s coming back at all!

There are two specific… um, okay the word “instance” is getting very overloaded here. There are two specific events I’m thinking of which happened to cause a perfect storm of retry problems; witches.town went offline permanently, and scifi.fyi had a certificate issue. And hey, these things happen! But both of these had enough users who were visible to enough other instances that a lot of other instances had trouble dealing with it.

Different retry logic would have probably scaled better, but it still has to scale.

And people being on smaller instances can actually end up hurting a lot. If you have 400 followers, and each of those followers is on a different instance, that means your instance has to send out 400 notifications for every message you send. And if half of those instances are down? That means 200 of them have to be retried until the other instance isn’t down.

And if an instance is down permanently, your instance admin has to go through and manually say “yep, gone forever, never retry.”

(That said, batching all send and retry logic at an instance level would help this a lot, but that probably has a lot of other issues to consider as well.)

But sending updates isn’t the only problem. Receiving updates is usually what causes smaller instances to crumble; it’s the issue that led me to switch back to mastodon.social, for example. And that one doesn’t have any amount of batching that can be done to mitigate problems; if you have any number of users that follow any number of other users, and those other users start to get very prolific?

Or heck, let’s say there’s a bad actor on the network that’s just sending out updates to all instances they can find. There’s no distributed mechanism for blocking things like that (nor should there be, since that would also be very easy to abuse).

Another thing that I feel is a design mistake in Mastodon (which I touched on in my ActivityPub hot take) is that media gets replicated by default as well. This has several problems:

Propagating media takes time and bandwidth
Propagating deletion of media takes time and bandwidth (and is one of the two Hardest Problems in programming: naming things, cache invalidation, and off-by-one errors)
There are often severe legal implications by the media that gets propagated, and “common carrier” status is very unclear in this space
The propagated media also has to be stored somewhere
Most media which gets propagated won’t even be seen by most of the recipients anyway!

Mastodon’s implementation of media propagation is essentially like a push-based CDN, which is typically not how CDNs are designed in this day and age. Pull-based – namely, only requesting a piece of media based on a client request – scales way better, because it’s only media that’s actively being used that ends up being distributed anywhere. And in the case of Mastodon there’s no reason for the individual instances to be doing that pulling anyway; most media will only be seen a handful of times, and can be served up by the originating instance’s media store when it is. If a piece of content actually goes viral, well, hopefully the originating instance has a fronting CDN for their media! (~~Most~~ Many of them do, as ~~most~~ many of them just use S3 for their media storage anyway.)

Even for the rare spike of hugely visible content on Mastodon, preemptively trying to pre-cache it on every instance where someone might see it doesn’t actually make a lot of sense.

(ETA: I am told that distributed attachments are actually deleted after a week or two, which at least means they’re not a permanent scaling issue, but that doesn’t help with the initial impact that it causes. If anything that makes things worse, since every instance is paying the up-front cost for media propagation and then has to pay another cost later to delete it! The only part that this helps with, scaling-wise, is ongoing storage costs, and that’s still more expensive than it’d be to simply not do propagate attachments in the first place.)

As usual, it’s better to distribute the metadata, rather than the content.

On that note, I wonder if Mastodon statuses would make more sense that way too. Most statuses that are being posted aren’t going to be seen by most of the users they’re replicated to. Why send all the metadata plus the status payload, when the recipients' instance could pull the payload if the status ever becomes visible to someone? Replicate the push notification, not the content the notification refers to. This also makes for a somewhat safer case for status deletion; if someone posts something and then immediately deletes it, the original post still goes out to be seen by everyone before the deletion notification can go out. (And you still have to trust every recipient instance to honor deletions, too.)

Most Mastodon statuses are fairly small, to be fair, but some instances, like dragon.style (which is, incidentally, down at the time of this writing!), allow huge posts, like many kilobytes long. And oh boy do people take advantage of that. Do these statuses all need to be fully replicated everywhere for the handful of people who are actually going to read them? Probably not!

(Also, decoupling the transmission of the notification from the transmission of the content means that smarter instances or clients could batch up the content requests. Instance foo tells instance bar, “Hey, I’m gonna need the content for these 10 statuses,” and then when all of them come through, those statuses actually become fully visible to the requesting clients. Optimizing this is also a difficult thing to get right but at least there’s the option of doing it.)

Anyway, to bring this back to the original issue at hand: Mastodon might be decentralized, but in a way that it shares its failure modes across the network. Not only is every instance is a single point of failure for all its users, but its failure modes can in turn cause failure modes in other instances as well. And an instance is way more likely to have scaling issues than you might expect. Even if it’s your own self-hosted single-user instance. In a sense, Mastodon works best if you have a handful of large instances, and falls apart much more easily in the multitudes of small instances case – the opposite of how you want a decentralized system to work!

As usual I feel like this is something that the Atom/RSS world has a better handle on; namely, doing periodic (but not instantaneous!) pulling of content, where things don’t matter if the publishing site is temporarily down (because anything that got missed will be picked up on the next pull). And then have very basic push support via WebSub (which only pushes very basic metadata – specifically “hey there are updates”), where if the recipient is down it doesn’t matter because they’ll just pick it up on their next pull when they come back up later! There’s still an \(\mathcal{O}(N^2)\) scaling issue in terms of communication overhead but failures don’t contribute to that (and failures don’t actually generate any additional overhead of their own since retries are just plain not a thing).

It’s like things that are designed for immediacy only end up making their problems worse, while things that are designed for casual propagation of non-urgent data end up scaling for immediacy better. Weird.

(Incidentally, I really ought to write Subl at some point; at its most basic it’s going to just be an RSS/Atom reader with WebSub support, but I also intend on putting social features in it using WebMention and a sharing feed, and there’s no reason it couldn’t become more Mastodon-like! But now more than ever I have approximately zero interest in directly supporting ActivityPub for its social features.)

Anyway, tl;dr: larger instances are a better bet for Mastodon as currently designed, and for now I’m using mastodon.social because I can expect it to be generally reliable, and I’m pretty okay with that.