The danger of big-O notation

A common pitfall I see programmers run into is putting way too much stock into Big O notation and using it as a rough analog for overall performance. It’s important to understand what the Big O represents, and what it doesn’t, before deciding to optimize an algorithm based purely on the runtime complexity.

What is Big O?

Big O notation is a commonly-used notation in the field of complexity. Roughly, saying that an algorithm is \(\mathcal{O}(f(N))\) means that for an input of size \(N\), the overall time taken will grow at a rate that is at worst \(f(N)\).

Or, in other words, if you zoom out on the graph of time-taken by an algorithm, you can overlay a curve of \(f(N)\) that is scaled in some way such that the time taken by the algorithm will always be underneath that curve.

More formally, \(f(x)\) is \(\mathcal{O}(g(x))\) if there are some values \(K\) and \(x_0\) such that \(f(x) \leq K \cdot g(x)\) for all \(x \geq x_0\).

What does being “big-O something” tell you?

Not that much, in practical terms.

If a function is \(\mathcal{O}(N)\), then it means that for a sufficiently-large input of size \(N\), if you double the input then it will take roughly \(2N\) as long to execute.

Similarly, if it’s \(\mathcal{O}(N^2)\), then for a sufficiently-large input of size \(N\), if you double the input then it will take roughly \(4N\) as long to execute.

And so on.

It allows you to sort various algorithms based on their overall complexity class; whatever term is most-dominating in an algorithm is what will eventually dominate in terms of runtime, for a sufficiently-large input. If your algorithm has two phases, one that’s \(\mathcal{O}(N)\) and one that’s \(\mathcal{O}(N^2)\), then over the long term, for a sufficiently-large input, the \(\mathcal{O}(N^2)\) portion will dominate in terms of runtime, and so the algorithm as a whole is \(\mathcal{O}(N^2)\).

So if I have two algorithms, one \(\mathcal{O}(N)\) and one \(\mathcal{O}(N^2)\), the \(\mathcal{O}(N)\) one will be faster, right?

Oh heck no. Remember, these are just expressions of the dominant factor over long-term increases.

Bubble sort is \(\mathcal{O}(N^2)\) so we should never use it, right?

Well, not necessarily. If you’re only ever sorting small arrays – like on the order of 10 elements or so – it can quite likely be a lot faster than something like merge sort or quicksort. (Incidentally, quicksort is also \(\mathcal{O}(N^2)\) by a strict definition; the average case tends towards \(\mathcal{O}(N \lg{} N)\) but that’s assuming pivot selections that cut your sets in half every time. And doing a perfect pivot selection turns out to be really hard.)

Remember, \(\mathcal{O}\) only describes the long-term growth of an algorithm. For an extreme example, trying to find a string that generates a specific fixed-size hash (colloquially called “reversing a hash” but that’s also a terrible misnomer) is \(\mathcal{O}(1)\) – after all, the search space is always a constant size. There will only ever be 2128 possible md5 hashes, so all you have to do to find a string that produces a particular md5sum (given no other inputs) is iterate over every 128-bit number represented as strings, right?

Well, that’ll take quite a long time to do. But it’ll always take on average the same amount of time! (Technically, a random amount which takes on average the same amount of time as generating 2127 hashes.)

For a more practical example: general-purpose comparison-based sorting is provably only ever going to be, at best, \(\mathcal{O}(N \lg{} N)\). But there are actually sorting algorithms that execute with lower complexity; for example, radix sort can be implemented to take only \(\mathcal{O}(N)\) time, at least given a fixed maximum key size. So, why don’t we always use radix sort when possible? Because in the general case, it’s hecking slow. It just happens to appear to scale very nicely.

For example, here are some graphs of std::sort vs. a radix sort using both a 4-bit and an 8-bit radix, for various input sizes; consider that regardless of radix size, radix sort is \(\mathcal{O}(N)\) while std::sort is \(\mathcal{O}(N \lg{} N)\): (source code, raw data)

big-o-100000.pngbig-o-524288.pngbig-o-1000000.pngbig-o-1000000-scaled.png

So, at least with this simple test, it takes an input size of around 40,000 before a 4-bit radix beats std::sort, and around 200,000 before 8-bit radix beats std::sort – even though both radix sorts are \(\mathcal{O}(N)\) compared to the \(\mathcal{O}(N \lg{} N)\) of std::sort.

And it only gets worse since radix sort has various time jumps up around certain power-of-two boundaries; this is because as the buckets get bigger, more allocations (and reallocations) need to occur as they hit power-of-two boundary sizes, as well as the various cache thrashing that occurs as a result. This also reflects the fact that radix sort also requires a lot more memory than a comparison sort; radix sort can take up to two times the size of the input array in addition to the input itself, whereas the comparison sorts generally do as much in-place as possible.

(You can smooth out the power-of-two boundary issues by using a linked list for the buckets instead of a vector, incidentally, but that only hurts performance – by a factor of around 30, in my tests – and significantly increases memory usage. But at least it makes the graphs slightly prettier.)

The bucket-size overhead of 4-bit radix also makes it scale up much more quickly than 8-bit, and after a certain point it’s not even worth looking at the 4-bit radix time since it’s so much slower. But at least it scales linearly! Maybe after a few billion numbers it’ll start to beat std::sort again.

And yes, std::sort does scale with \(N \lg{} N\), and radix does scale linearly:

big-o-n_lg_n.pngbig-o-n.png

Other code metrics

Another thing to note is that the lower-big O algorithms tend to also be much more complicated to implement and maintain. std::sort is literally one line of code; my radix sort implementation is 15 lines just for sorting numbers, without comments, and using an algorithm that doesn’t make intuitive sense to most people.

A general rule of thumb is that the more lines of code you’re implementing yourself, the slower it’s likely to be in the common case, and the worse it is to maintain. Standard library implementations and idioms exist for a reason. Sure, it isn’t always the case that shorter code will be faster (after all, bubble sort is easily done in around 5 lines of code but is way worse than radix sort, performance-wise) but why implement something that at best only slightly improves performance while taking on extra maintenance burden?

This gets even more extreme in scripting languages like Python; an idiomatic sort of a list in Python is just sorted(input) and goes through mostly-native code, which has been optimized to bits by compilers or even going through standard library functions. Implementing your own sort, no matter how efficient, is going to still go entirely through the Python interpreter.

Conclusion

I hope that rather than worrying only about the theoretical computational complexity of an algorithm, you’ll consider using it as a guideline for implementation and pay more attention to the practical performance.

Comments

Before commenting, please read the comment policy.

Avatars provided via Libravatar