The problem with select() vs. poll()

The UNIX select() API should have been deprecated years ago. While unsafe operations like sscanf(), sprintf(), gets(), and so forth all provide compile-time deprecation warnings, select() is also incredibly dangerous and has a more modern, safer replacement (poll()), but yet people continue to use it.

The problem is that it doesn’t scale. In this case, “not scaling” doesn’t just mean it’s bad for performance, “not scaling” means it can destroy your call stack, crash your process, and leave it in a state that is incredibly difficult to debug.

There are actually two problems, and they are closely-related. The hint to both of them comes from the actual signature of the select function:

void FD_CLR(fd, fd_set *fdset);

void FD_COPY(fd_set *fdset_orig, fd_set *fdset_copy);

int FD_ISSET(fd, fd_set *fdset);

void FD_SET(fd, fd_set *fdset);

void FD_ZERO(fd_set *fdset);

int select(int nfds, fd_set *restrict readfds, fd_set *restrict writefds,
           fd_set *restrict errorfds, struct timeval *restrict timeout);

Do you see that nfds parameter? Here is what the BSD libc documentation has to say about it:

The first nfds descriptors are checked in each set; i.e., the descriptors from 0 through nfds-1 in the descriptor sets are examined. (Example: If you have set two file descriptors “4” and “17”, nfds should not be “2”, but rather “17 + 1” or “18”.)

What this is hinting at is that fd_set’s implementation is not a container that contains a maximum number of sparsely-populated file descriptors; instead, it is a bit vector saying whether to check the fds at a given offset. That is, in pseudocode, people expect it to be something like:

struct fd_set {
    int count;
    struct fd_record {
        int fd;
        bool check;
    } fds[FD_SETSIZE];
};

but it’s really more like:

struct fd_set {
    bool check[FD_SETSIZE];
};

So this immediately shows what the two problems are, but I’ll still explain it further (since this wasn’t enough when I was helping diagnose a problem with some critical production code that was breaking). Both of them involve what happens when your process has a large number of file descriptors open (which is very common in large-scale services).

First problem: performance

Let’s say you are trying to check the state of a single socket with fd of 907 — which happens to be within FD_SETSIZE on a modern Linux. So your code looks something like this:

fd_set fds;
FD_ZERO(&fds);
FD_SET(907, &fds);
select(908, &fds, NULL, NULL, NULL);

Here is what select() is doing internally (again, pseudocodeishly):

for (int i = 0; i < 908; i++) {
    if (readfds && readfds->check[i]) {
        readfds->check[i] = check_readable(i);
    }
    if (writefds && writefds->check[i]) {
        writefds->check[i] = check_writeable(i);
    }
    if (errorfds && errorfds->check[i]) {
        errorfds->check[i] = check_errored(i);
    }
}

In this case, since only fd 907 is actually being checked, it’s looping futilely over 906 entries which don’t need to be checked. If you’re doing select() a lot, this can add up (remember that the structures described above are just pseudocode, and in reality there’s a lot of bitfiddling/masking/etc. going on as well).

But this isn’t the dire problem.

Problem 2: How it can destroy your stack

So let’s look at the same thing as above, except now the fd in question is, say, 2048, which is quite a lot larger than FD_SETSIZE. Theoretically this is not possible, but in practice it is — it’s pretty simple to raise the socket ulimit for circumstances where you need a lot of connections open. (For example, in a large-scale distributed cache. Again, speaking from experience on this.)

Let’s annotate the code that calls it, first:

fd_set fds; // allocates a 1024-bit vector on the stack
FD_ZERO(fds); // clears the 1024-bit vector
FD_SET(2048, &fds); // Sets the 2048th bit; congratulations, you've corrupted your stack a little bit
select(2049, &fds, NULL, NULL, NULL); // congratulations, you've just BLOWN AWAY your stack

Why does this blow away the stack? Well, keep in mind what select() is doing internally:

for (int i = 0; i < 2049; i++) {
    if (readfds && readfds->check[i]) {
        readfds->check[i] = check_readable(i);
    }
    if (writefds && writefds->check[i]) {
        writefds->check[i] = check_writeable(i);
    }
    if (errorfds && errorfds->check[i]) {
        errorfds->check[i] = check_errored(i);
    }
}

That is, for all the values up to 2048 (which, remember, is outside of the fd_set and well into your stack at this point), it’s checking a bit to see if a garbage fd needs to be checked (which is relatively harmless), checking that fd (which is relatively harmless but kills your performance), and then sets that value in memory based on whether the fd is in that particular state (which is, no matter what, essentially randomizing the stack).

Congratulations, now not only will you get a weird error, you won’t even be able to tell where it comes from because your debugger will give you complete garbage on the call trace.

The easy fix: use poll()

Hindsight is 20/20, of course, but poll() is the API that select() should have been in the first place. When select() was designed, UNIX had no asynchronous I/O mechanism at all, and the whole concept of a large-scale network server was unheard of. Everything was its own process, and you’d do something like accept(8) to accept only 8 simultaneous connections at a time, and each of those connections would likely open up a handful of files to serve up static content or whatever. In the early days, FD_SETSIZE was only 16; Winsock raised it to 64, and early Linux raised it to 256. That was still more than adequate for a reasonably-sized server at the time, and it was still generally the case that you’d be checking all of your fds at once in a single-threaded event loop (heck, back then, the whole concept of “thread” was just an abuse of vfork() and not something that any reasonable programmer would expect to use) and that there wouldn’t be that many to check; the overhead of managing a variable-sized buffer would be greater than the savings of not having the conditional loop.

But nowadays, any given Internet server can be easily handling thousands, if not tens of thousands, of connections at once, and the FD_SET/select() API is fundamentally broken in a way which can never, ever make this safe.

Fortunately, there’s another API, poll(), which is much more like what people expect out of select() in the first place.

struct pollfd {
    int    fd;       /* file descriptor */
    short  events;   /* events to look for */
    short  revents;  /* events returned */
};

int poll(struct pollfd fds[], nfds_t nfds, int timeout);

Essentially, you tell it exactly how many fds you’re asking about, and for each fd, exactly which events you want to know about. So now the poll() call only has to scan the actual specified fds, and only check the actual specified events — no checking readable, writable, and error conditions for ALL fds that are ≤ the highest one you’re asking about.

In the common case, poll() is a simple drop-in replacement, and is generally simpler and easier to read; this code:

fd_set fds;
FD_ZERO(fd_set);
FD_SET(500, &fds);
struct timeval tv;
tv.tv_sec = 5;
tv.tv_usec = 0;
if (select(501, &fds, NULL, NULL, &tv)) {
    if (FD_ISSET(500, &fds)) { ... }
}

becomes:

pollfd pfd;
pfd.fd = 500;
pfd.events = POLLIN;
if (poll(&pfd, 1, 5000)) {
    if (pfd.revents & POLLIN) { ... }
}

And, obviously, if you want to check multiple types of events on a socket, it becomes even simpler, as you only need a single set of pollfd structures and simple bit masking on the calling side.

Why do people still write code with select() then?

I have a couple of theories. select() is the legacy API that was provided in all of the classic textbooks; universities possibly still teach networking using it, and there’s a lot of sample code out there that uses select and even the best programmers still have a tendency to copy-paste at times.

Additionally, the fact that gcc doesn’t issue a deprecation warning on it means that people haven’t had any reason to change their practices. The man pages tend to contain warnings, but people generally just read the signature and return codes.

Current versions of certain libc variants will return an error if nfds ≥ FD_SETSIZE (for example, the one on OSX 10.8 will, although there is apparently no warning to this effect on even current Linux glibc), but at that point the stack is already corrupted and if your server has gotten to the point that a subtle hard-to-understand glibc error is occurring, you’re already having a bad day and probably don’t realize where the issue is coming from.

For what it’s worth, one version of the Linux manpage for select() has this to say:

An fd_set is a fixed size buffer. Executing FD_CLR() or FD_SET() with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavior. Moreover, POSIX requires fd to be a valid file descriptor.

“Undefined behavior” is putting it lightly. (And it makes no mention of how select() itself is also a ticking time bomb in that situation.)

What about epoll and libEvent?

I purposefully did not discuss them in this article. While both are far superior to poll, they aren’t simple drop-in replacements to select, and at least in the basic case of an Internet server the more important thing is not crashing mysteriously. Obviously with a better asynchronous eventing mechanism you will scale better, but scaling better is often a longer-term goal than scaling at all.

Comments