The average Mastodon user age

A question keeps on coming up on Mastodon (and the Fediverse in general), namely trying to figure out what the overall age distribution of the userbase is.

However, every single poll that comes about ends up becoming quite contentious in ways that lead to a lot of conflict and drama, primarily due to the nature of how Mastodon polls work. The most common problem is that due to the limited number of options in polls, it is very difficult to make a poll that gives both range and precision. Commonly, the person attempting to run the poll will end up being attacked by people in an age bracket that feel disenfranchised by the specific poll splits, and there’s this… tunnel-vision hyperfocus that a lot of people get into where if they see an “other” or “X or [older|younger]” category, even if there’s a whole thread of polls that dig into splitting up those larger buckets, you get deluged with responses from those who are very angry about the lack of representation.

Attempting to run the poll on a more suitable polling platform (Surveymonkey, Google Forms, Straw Poll, etc.) always ends up with similar concerns, such as worries about privacy or an unwillingness to use the survey platform of choice for whatever technological hill people want to die on. And, again, users who feel put off by this make it their mission to let the perceived offender (and everyone around them) know.

I started out riffing on this with a couple of threads that were not meant to be taken seriously, and which I thought were obvious jokes, but, you know how the Internet can be. Anyway, my third poll (doing it as a median-seeking binary search) was also meant as a joke (which I intended to just be a one-off), but people actually responded to it fairly well. When I posted a followup in which I explained that it was a joke but asked if people wanted me to follow through with it, the overwhelming response was a resounding “yes.”

Clearly there is enough interest in trying to figure out some sort of statistical distribution for the age of Mastodon users (otherwise this wouldn’t keep on happening!), so, #MastoAge was born, for better or worse.

This poll series is absolutely not meant to be definitive, and is, again, the result of me doing a bit and taking it way too far (as I tend to do). Any results derived from this are meant to be taken as seriously as I intended it, which is to say, not very.

I am also not a statistician (beyond having taken a couple of college courses and done some trivial amounts of data science in my career1) and my analysis is probably ridiculously flawed.

Methodology

This poll series simply tries to determine the median posting age of users of Mastodon by using a simple binary search. In particular, the methodology is:

  1. Start with a known upper bound and lower bound
  2. Take a sample at the halfway point between them
  3. If more than 50% are older than the split point, use the split point as the next upper bound; otherwise, use it as the lower bound
  4. Continue until either there’s a 50/50 split, or both bounds are on the same day

Each of these is done as a simple poll on Mastodon. The assumptions are as follows:

  • The random sample of users will be enough of a representative sample that a simple majority is enough to move forward
  • The demographics of each responding group is roughly the same (i.e. there’s not a lot of skew in either direction on any given sample point)
  • People who overtly decide to provide misinformation will be roughly the same on either side of the split
  • People who are concerned about privacy and refuse to participate in such polls will also be roughly the same on either side of the split

There were, of course, plenty of objections stated quite clearly to me; in particular, many users made it clear to me that they didn’t want me violating their privacy (not that I was collecting any specific user-level information and they could simply opt not to participate in any or all of the polls!) or they wanted to make clear that they were way younger or, more commonly, way older than the current split point. Like, yes, I know, that’s how binary searches work, as well as the nature of being statistical outliers.

There are also some obvious sampling biases present:

  • My instance of choice uses a fairly extensive blocklist for badly-behaved instances, limiting the reach of the polls (although this same thing does filter out a lot of known bad actors who would be more likely to purposefully pollute the data)
  • Outliers have a tendency to be more invested in providing data proving their existence
  • The initial data points and subsequent network effects were biased towards those who already follow me, an English-speaking neurodivergent socialist transgender furry2 who people somehow keep thinking to be a lot younger than I am despite how frequently I point out that I am older than Pac-Man
  • Poll fatigue is a thing
  • And of course all the polls were run in English (and had an extremely nerdy framing to begin with)

Anyway.

On the first sample I just somewhat arbitrarily decided to use January 1, 1970 (i.e. (time_t)0) as the sample point, figuring that that’s as good of a reference point as any, and based on the results would have chosen an appropriate opposing bound. Thankfully for my mentions, well over 50% of users reported being younger than that, so it was easy enough to simply use the then-current timestamp ((time_t)1694398510) as the upper bound, making it September 10, plus or minus a bit for timezone stuff (which extremely, severely, truly does not matter for this)3.

Anyway, with the initial upper bound being September 10, 2023, that retroactively implies an initial lower bound of April 22, 1916, so, good enough for me4.

So, then it was time to run a series of polls to search for the median (i.e. 50th percentile) Mastodon user age!

Each poll was run for 24 hours (the Mastodon default), but in the interest of time, additional polls were generated when one side had a distinct majority over the other and future votes were unlikely to change the result (typically some time after the percentage spread was greater than the margin of error as estimated by \(\frac{\sqrt{N}}{N}\), and after several hours have passed to try to mitigate the aforementioned sampling biases).

Results

The times on this chart are, per the methodology, given as UNIX timestamps. The individual poll links are given for the reference of others.

Poll LB UB Split Date % older % younger
0 -1694398510 1694398510 05 1970-01-01 25 75
1 0 1694398510 847199255 1996-11-05 80 20
2 0 847199255 423594588 1983-06-04 42 58
3 423594588 847199255 635396921 1990-02-18 72 29
4 423594588 635396921 529495754 1986-10-12 52 48
5 423594588 529495754 476545171 1985-02-06 55 45
6 423594588 476545171 450069879 1984-04-05 47 53

So, something slightly odd happened between 1985 and 1986, which doesn’t become obvious until you graph it as the ogive of a cumulative distribution function6:

mastoage%20cdf%20borked.svg

So, you see that downturn between 1985 and 1986? That means that there is a negative population size between those times, at least per the polling methodology. Unfortunately, that makes absolutely no sense, which means we can only come to one logical conclusion: a large portion of Mastodon’s population are ghosts.

Millennial ghosts.

Spooky!

Well, okay. The margin of error at those sample points is around 3%, around the same as the gap between those samples. It’s quite conceivable that the actual values are quite different. Honestly, this is exactly one of the outcomes I was expecting when I started this thing, and was part of the probably-funny-only-to-me joke in attempting to use a binary search in the first place.

Some real possibilities for what happened (based on some of the Mastodon replies I got):

  • Enough people misread/misunderstood the dates to get confused and pick the wrong side

    @fluffy@plush.city I think I answered correctly but I’ve answered enough of these now that I’m starting to forget how dates work

    Foxis The Cookie Dragon

    @fluffy as someone who did her date math wrong in one of those polls, I take the blame in increasing your error

    Elizabeth

    @fluffy this ones in the right range for me to almost click “well duh I was born after; that’s so long ago”

    but no. I was born before 💀 lol — Jack

  • Poll fatigue meant that people were only voting on polls that they felt applied strongly to them

    @fluffy I wonder if there’s a new possibility that the group that’s “losing” (ie , me and my fellow olds 😂) might start participating less – the incremental poll results could be impacting the next poll round in a weird way?

    Jeremy Nickurak

  • People thought they were only supposed to vote on polls that directly applied to them

    @fluffy I presume those born after 1990 shouldn’t vote, to avoid messing up your results…

    Adrian Cochrane

  • Good old-fashioned sampling error

    @fluffy I think part of it might be that some of the previous polls didn’t federate as well as the first ones (or at least I missed some of them)

    Oblomov

    @fluffy seeing the proportions shift, I’m wondering how much this is affected by different demographics in different timezones. The longer poll times should help with that, I guess?

    Dominic

    @fluffy damn it I clicked on the wrong one while scrolling by accident and now it won’t let me change it

    Canageek

  • Fucking millennials7 ruining everything as usual

    @fluffy I think this graph is saying that the median respondent is only comfortable sharing so many bits about their birthday

    Lioness With Mane

    @fluffy This one is my favorite poll so far, because I’m in what is clearly the problem bucket.

    panasaurusrex

    @fluffy mastodon is a Millennial App confirmed

    Raptor

Just to try to polish the turd, though, I posted a followup poll to attempt to reconcile the data, which improved things somewhat:

Span % older
1985-02-02 49
1986-10-11 60
2023-09-13 100

Only now I just realized I meant for the first split to be February 6, not February 2, but I’m willing to overlook a four-day difference here.

So, okay. The binary search approach is basically done with, at this point, and it’s easier for me to just arbitrarily add more sample points into a master table as I go. I ran a few more polls to try to get some data points around the outliers, and here’s what I came up with:

Date % older data source
1916-04-22 0% axiomiatic
1960-01-01 3% poll
1970-01-01 23% poll
1983-06-04 42% poll
1984-04-05 47% poll
1985-02-06 49% poll
1986-10-12 60% poll
1990-02-18 69% poll
1996-11-05 81% poll
2000-01-01 90% poll
2023-09-10 100% axiomatic

So now we have what appears to be a credible cumulative distribution function:

mastoage%20cdf%20fixed.svg

And this lets us also infer a histogram by taking the slope and midpoint of each segment:

mastoage%20histogram.svg

WHY. WHY ARE YOU DOING THIS TO ME.

Anyway. The average age (median) of the Mastodon userbase is, as of September, 2023, around 38. Probably.

Future work

So, a couple of folks had suggested using Newton’s method for finding the sample points, which is definitely something I’d considered, but, just to be completely clear, this was a joke which was not intended to go this far.

Anyway, to summarize, in Newton’s Method you use the slope of the line to determine the best approximation of the next sample point. Given that we just have discrete data rather than a continuous function, the best bet we have for that is doing a simple linear interpolation between adjacent data points. In particular, given two adjacent points \((x_0,y_0)\) and \((x_1,y_1)\) and a target \(y\) position, use linear interpolation to find an \(x\) position for its estimate:

\[ x = (y - y_0)*\frac{x_1 - x_0}{y_1 - y_0} + x_0 \]

Using this approach to chase the median (rather than simple binary search) would have been far more efficient (although I was already committed to the bit of doing a simple binary search, and after the fourth sample it ended up not mattering that much anyway), but this does let us refine our data pretty quickly.

So, from this data we can get some initial estimates of the lower and upper quartiles' birthdates:

Target % (y) x0 y0 x1 y1 Estimate (x)
5 1960-01-01 3 1970-01-01 23 1960-12-31
25 1970-01-01 23 1983-06-04 42 1971-05-31
75 1990-02-18 69 1996-11-05 81 1993-06-28
95 2000-01-01 90 2023-09-10 100 2011-11-05

Using these as starting points I started some more polls. However, this article has taken too long to get written, and I just want to get it posted, damnit. In a few days I’ll post a followup with my final data and graphs thereof, probably.

Additionally, I got an interesting response from an actual data scientist who presumably knows what the hell he’s doing. He indicated interest in doing a Bayesian analysis, and when that happens I will be sure to link to it as well.

Conclusions

Tumblr polls make this way easier, and we can easily infer that among the people who saw my poll there, the average age is around 30. However, that poll was also way less fun and got only a tiny fraction of the responses!

My real takeaway: Sometimes things being shitty and terrible can make them a lot more fun and interesting, and that’s worth a lot.

Also, I have a tendency to take a dumb joke way too far.

Addendum

The final-ish poll results are plotted below with a third-order polynomial fit:

mastodon%20cdf%20finalish.svg

Comments

Before commenting, please read the comment policy.

Avatars provided via Libravatar