Federated video streaming

From the recently announced changes to Twitch prime, people are, understandably, upset about Twitch changing their monetization strategy, and are, predictably, wondering about the possibility of making a federated live-streaming platform.

The good news is that all of the stuff necessary to make federated live streaming happen already exist and wouldn’t be even all that hard to build.

The bad news is that it’ll probably be expensive to do well.

Some caveats

I have worked on various streaming providers' stacks in the past (having worked on some of the streaming aspects of the Playstation Video store, been tangentially involved in the streaming aspects of HBO GO more recently, and had my fingers in aspects of a few other services' offerings mostly insofar as they were built on the Trilithium media platform for Playstation), and I’ve played with real-time video encoding while optimizing my own twitch stream. That said, I’ve never actually built a streaming system like this myself. I am speaking mostly theoretically, based on what I know about the current state of video streaming protocols and CDNs and general federation stuff.

It’s also possible (read: extremely likely) that there’s something I’ve missed or otherwise just plain don’t know about. I’m also probably being a bit pessimistic about the costs involved, and at the same time overly optimistic about the ease of actually building this sort of thing.

Background

There are a few open protocols for live video streaming; the two most worth focusing on are HTTP Live Streaming (also known as HLS), and MPEG-DASH. Both of these operate under the same basic principle; you have a master playlist file that refers to the various quality level playlist files, and those quality playlist files contain a list of short video segments. (The segments are usually 20 seconds long for pre-recorded video, and 2 seconds long for live streams; the segment length has various tradeoffs in terms of bandwidth usage and latency.)

There are three basic configurations of these segment files; you can either have an offline, fully-seekable stream (where the playlist is already complete), you can have a live stream (where the playlist basically only has the most recent processed segment), or you can have a DVR-style seekable stream (where the playlist grows as time goes on and there’s a marker at the end of it indicating it should be periodically reloaded).

On the client side, the client is responsible for choosing a quality level’s stream, and is able to switch between them as network conditions permit. A typical naïve implementation is to simply fill a buffer of upcoming segments, and if the buffer dips down too low, switch to a lower quality, and if the buffer is full, switch to a higher quality.

This setup has several great advantages over tradtional streaming solutions; notably, since the data is all just files served up over HTTP, and since the files are all static (aside from the live playlist), they can be served up via a CDN, which can take advantage of all sorts of caching behavior and viewer locality and so on.

HLS is the underlying technology used by many of the largest video sites on the Internet; in particular, it’s used by YouTube and Twitch, both for their offline videos and for their live streaming functionality.

For the purpose of this article I’ll refer to the technology as HLS but this could also mean other similar protocols such as MPEG-DASH or Microsoft’s Smooth Streaming (which also operates in this same basic way).

Also, the really good news: there’s nothing secret or magic about how HLS works. All of the tools you need to build an HLS playlist are totally free and opensource, and there are several scripts out there for converting an existing video file to an HLS stream using FFmpeg.

How a typical streaming setup works

In a typical configuration, you have a streamer running software such as OBS, which is live-encoding the highest quality stream and sending it to a broadcast server. The broadcast server then segments this video and builds the dynamic playlist (Twitch seems to use the single-segment mode, whereas YouTube uses the DVR-esque mode) and simultaneously transcodes the segments into lower quality levels. The segments themselves can all be stored on a CDN for easy retrieval and caching close to the viewers.

The transcoding process is where things get a little bit hairy; if you’ve ever been on a Twitch stream where you’re having network problems or are watching it on a mobile device or whatever, you might have noticed that you’re seeing a lot more latency than people who are able to watch the highest-quality/source stream. This is because there’s an extra step between you and the broadcaster; you have to wait for the most recent segment to be transcoded into the lower quality so that it can become available. This generally leads to an entire segment’s worth of delay at best.

Also, the overhead of doing this transcoding is quite expensive; not only do you have the source stream to contend with (where at the very least you’re remuxing and partitioning the video, if not actually transcoding it on-the-fly), every quality setting needs its own transcode to take place in real time, and each transcode segment needs to wait for the source segment to finish before it can start.

And by expensive I mean it costs money. Being able to live transcode video at 1080p30 needs a pretty high-end CPU. Boost it to 1080p60 and you need twice as much, and 4K30 needs around four times as much. And then given the overhead of the lower quality levels as well, you’re looking at probably needing at least high-end virtual machines per simultaneous user just to get the basics. Currently this means that even without accounting for bandwidth, you’re looking at spending probably $1/hour for every active streamer (this being a wild guess based on current AWS pricing, and yes I realize the irony of using Amazon services in discussing how to get away from an Amazon service).

Fortunately, compute resources do have a tendency to get cheaper over time, but unfortunately, video streaming is one of those areas where higher compute availability is very easy to fill in with higher video quality instead.

How federated video streaming might work

For offline videos (i.e. archived recordings and whatever – think classic YouTube) the setup is pretty straightforward. A complete video is uploaded somewhere, and the place it’s uploaded to splits and transcodes the video into HLS segments. These segments and playlists, all being static data, could be easily stored in any number of federated storage mechanisms such as IPFS.

However, the latency added by IPFS retrieval means that it’ll need a different sort of tuning for the actual streaming; popular videos would likely be easy to stream well, but more obscure ones would almost certainly not be in the IPFS swarm already and so this might lead to an interesting dynamic feedback scenario where a latency penalty occurs every time a quality switch happens (causing it to keep on dropping down to the lowest quality). Ideally for this situation, IPFS-based video streaming clients will request their local IPFS nodes to prefetch at least the next few segments at their current quality level (regardless of whether the IPFS node is something running on their own local machine or is part of the web gateway of the site they use to access the video)

For live streaming things get more complicated, though. The individual segments could be easily stored through IPFS et al, but you have the dynamic playlists to contend with. IPFS itself cannot handle this, as you need to already have the complete file’s hash to know the key to look the file up to begin with. Fortunately, IPFS has a companion protocol, IPNS, which allows for content to change underneath a file. Unfortunately, actually propagating updates through the IPFS network can be slow – much too slow for live-streaming, where it has to change and be updated every 20 seconds.

(Also, this means that the playlist segments themselves can’t actually be added to a playlist until IPFS has provided the storage key back to the owner – and this adds more latency as well, this time on the broadcaster’s side.)

So, for the purpose of at least the live streaming case, it makes more sense for the playlists (both master and lower-quality) to be served up over plain ol' HTTP. Once a stream is finished, though, its now-solidified playlists can certainly be stored on IPFS.

In any case, there are plenty of mechanisms for sharing the master playlists in a federated manner – for example, Mastodon (yes, I know, I don’t like ActivityPub but this is one of those cases where push actually is useful) – or of course if you’re scheduling it in advance you could also have your playlist just be a placeholder that shows the same segment repeatedly until the stream actually starts. Or something. Anyway, there’s no special sauce necessary for actually providing the playlist itself; it’s just HTTP.

But okay, you have your federated video streaming service, and now you have a bunch of people trying to stream at once. And this is getting very expensive very fast – remember, $1/streamer/hour at a very rough estimate – so maybe it’d be great to have the community share the load of the quality-stream transcodes. Maybe we can even get random viewers joining in on an encoding cloud, or at least share the load across instances that people have added to the federation network. Hey, good idea! But how can we actually do this?

Well, the segments themselves can be stored on IPFS as always, but how do we handle the playlist files? IPFS doesn’t know the URL of a file until it’s already been completed. A typical solution might be to have multiple things doing the transcoding in parallel and then using consensus on the hash to decide which one wins out, but this doesn’t actually work for video encoding; even if you ignore or omit the encode-time metadata, multicore encoding is nondeterministic since you don’t know what order the multiple CPU cores will finish their work in (and modern CODECs are designed specifically to allow the sub-blocks to go in arbitrary orders for this reason). So, you have to have a trust relationship that this encoder is actually encoding the data you’re asking it to, rather than, say, replacing it with porn or trolling or whatever.

So, it’s probably much more reasonable to expect each instance owner to simply have their own elastic scaling to spin up compute systems as necessary to do the segment transcoding and playlist building.

The main expenses for an instance owner to worry about are CPU (again, $1/hour/streamer) and ongoing storage for its IPFS endpoint (to at least guarantee the availability of the data when it’s not a very popular stream) although that’s at least cheap if you don’t care about archiving old streams, since you only need to store the last few minutes' worth or so.

Also, it turns out that when compute is this expensive, the bandwidth costs (currently around $0.02/GB on Amazon, and even lower on other VPS providers) don’t become that big a deal. So what is stream federation even buying you in this case, aside from worse latency?

So, it seems to me that the sensible point of federation isn’t about the actual broadcast, but about discovery/publication; simply use ActivityPub or the like to announce your stream locations and then the actual data gets served up from an ordinary HTTP-based website. For bonus points the website could also only prioritize transcoding the quality levels which are actually being used. Then when the stream is over and it’s time to post the VOD, that’s when everything gets shuttled over to IPFS or whatever. (You could even be smart and put the encoded segments into IPFS during the initial stream so that the VOD playlist is ready to go as soon as possible. Or, heck, use those to make a seekable DVR-style version of the playlist during the live broadcast so that when people rewind they’re getting the files off IPFS instead of your live server!)

Other stream elements

There’s a bit more to a streaming service than just the online transcoding and replication of the data. Many Twitch streams use various Twitch APIs to put a bunch of widgets to overlay onto the stream itself (often from a third-party provider like Streamlabs which, again, has no real secret sauce here, just a pretty good implementation that’s easy to leverage).

Even in a single-endpoint scenario like Twitch it’s already difficult to determine how many people are actively watching a stream. You can only really get a rough estimate based on how many unique users are polling the playlist files. (Ever notice how 20 people watching the same twitch stream will see 20 different counts of how many viewers are watching?)

In a live-streaming context you’re probably going to still be able to keep track of who’s polling the individual playlists, but once it goes offline/archival, all bets are off.

As far as chat goes, good news – even Twitch’s chat is already federated. (And of course this is one of those things ActivityPub is also pretty good at.)

But I just want to get off twitch!

Okay, that’s fair. I get it; Amazon is evil/capitalist/whatever, and they’ve pulled a big bait-and-switch here (my theory is that they didn’t realize just how many people already had Prime and how much this would cause their hemorrhage of funds to intensify). Fortunately, there are a bunch of other streaming providers out there, each with pluses and minuses.

  • A really obvious one is YouTube; they support live streaming (and are trying to compete directly with Twitch on that front), and some people find Google less objectionable than Amazon for whatever reason.
  • Picarto is pretty okay with the caveat that you can only stream creation (i.e. art, music, game dev, etc.), and as far as I can tell they only provide rebroadcasting of the source stream (which has a bunch of downsides but at least it keeps their service relatively cheap to operate)
  • OBS also directly supports a bunch of other services out of the box. I haven’t used any of these so YMMV:
    • Smashcast seems like a pretty compelling one (if only because they let you stream just about anything legal), although their focus is on eSports
    • mixer.com seems to only allow game content but I’m having trouble finding an actual list of allowed content (also apparently they’re owned by Microsoft?)
    • DailyMotion is mostly known for where to watch bootlegged video copied from other places; I had no idea they even supported streaming
    • Facebook live oh hell no
    • Restream.io seems to be a frontend for rebroadcasting your single stream to a bunch of other stream backends? I’m not sure how that’s useful (especially since that violates pretty much every streaming platform’s partnership agreement)
    • LiveEdu.tv is, as its name implies, only for broadcasting educational content
    • Periscope which I had no idea supported streaming outside of their Twitter app, so that’s neat I guess.

Updates

  • 2018/10/11: Fixed some stuff about segment lengths and livestream playlists

Comments

Before commenting, please read the comment policy.

Avatars provided via Libravatar