.com
, .net
, .edu
, or .org
address.fr
→ French).
, _
, and -
)^_^@example.com
or +&#@example.com
is invalidFrom:
addressDate:
header on a message is legitimateReceived:
headers will always be no earlier than the Date:
headerSee also: email is bad
Update I’m getting some good additions from folks' responses and I’ll be adding them as I see them.
From elainemorisi:
Additional suggestions from a reddit thread:
+suffixes
from email addresses (e.g. john+doe@example.com
→ john@example.com
)From Jens Alfke:
A good one that was posted by many folks when this post went viral but seems to first be by benoliver999 on lobste.rs:
Update, 9/1/2022: Wow, I am getting overwhelmed with comments and suggestions. This really struck a nerve, huh!
I don’t think I’ll be adding any more things to this list. But here’s a few threads you can follow and post comments on:
Also, I moderate every single comment that comes to this site, so please be patient if a comment you post doesn’t appear immediately.
Finally, see this followup.
#Email #Communication ]]>std::sort
) and radix sort, based on the common bucketing approach.]]>std::sort
) and radix sort, based on the common bucketing approach.Recently I came across a video on radix sort which presents an alternate counting-based implementation at the end, and claims that the tradeoff point between radix and comparison sort comes much sooner. My intuition said that even counting-based radix sort would still be slower than a comparison sort for any meaningful input size, but it’s always good to test one’s intuitions.
So, hey, it turns out I was wrong about something. (But my greater point still stands.)
Here are my implementations of bucket-based and count-based radix sort, in C++:
For the full source, see big-o-2.cpp
And here are the time comparisons between std::sort
, and both bucket and counting radix sort using a 4- and 8-bit radix: (raw data)
So, what’s going on here? And why are even the bucket-radix sort graphs different than last time?
It’s hard to do a like-for-like comparison with the previous set of implementations; this time around was running on a very different computer (a Mac mini running on the M1 chip), a newer version of the C++ compiler, a newer version of the C++ standard library, and who knows how many other differences. It’s pretty interesting that the power-of-two allocation overhead from bucketed radix, in particular, has more or less gone away; there’s possibly something about the M1 architecture which makes the vector resize take much less time, and also making use of clang’s robust C++17 support may have also reduced some of the copy overhead due to implicit move semantics being used.
But it’s pretty interesting to me that the following things are pretty apparent:
std::sort
at around N=13000Now, all that said, this still demonstrates a problem with just assuming that a lower big-O factor is better. All four of those radix sort implementations are \(\mathcal{O}(N)\), but the bucket-based ones still are slower than the \(\mathcal{O}(N \lg N)\) std::sort
until fairly large input sizes, even with all of the overall performance improvements which have happened (after all, std::sort
has gotten faster too).
And, of course, all four radix sorts have the same time complexity as each other, but they all scale with different factors; in particular, it doesn’t take that long for the radix+bucket sorts to overtake the 4-bit counting sort (which is, frankly, pretty surprising to me).
As always, the fact that an algorithm scales better in the long term doesn’t mean it’s suitable for all input sizes. Even in this best-case situation, std::sort
still wins for input sizes of a few hundred values, and of course the maintenance overhead of using std::sort
is way lower.
It’s also important to remember that these sorting functions can still only work on unsigned integers, or signed integers with a little tweaking. They are not (easily) applicable to floating-point values, much less things with more complicated tiebreaking rules (such as database rows).
And, heck, it’s also really easy to write code which is \(\mathcal{O}(N)\) but not optimal! As we saw with the previous article.
So, my conclusion is still the same: practical concerns trump theoretical complexity. But as a bonus conclusion, it’s okay to revisit things to see if they’ve changed.
Oh, and you might also want to consider that just because your parts of the algorithm have a certain complexity factor doesn’t mean that’s what the runtime performance will be; it’s easy to make something accidentally quadratic.
#complexity #performance ]]>Big O notation is a commonly-used notation in the field of complexity. Roughly, saying that an algorithm is \(\mathcal{O}(f(N))\) means that for an input of size \(N\), the overall time taken will grow at a rate that is at worst \(f(N)\).
Or, in other words, if you zoom out on the graph of time-taken by an algorithm, you can overlay a curve of \(f(N)\) that is scaled in some way such that the time taken by the algorithm will always be underneath that curve.
More formally, \(f(x)\) is \(\mathcal{O}(g(x))\) if there are some values \(K\) and \(x_0\) such that \(f(x) \leq K \cdot g(x)\) for all \(x \geq x_0\).
Not that much, in practical terms.
If a function is \(\mathcal{O}(N)\), then it means that for a sufficiently-large input of size \(N\), if you double the input then it will take roughly \(2N\) as long to execute.
Similarly, if it’s \(\mathcal{O}(N^2)\), then for a sufficiently-large input of size \(N\), if you double the input then it will take roughly \(4N\) as long to execute.
And so on.
It allows you to sort various algorithms based on their overall complexity class; whatever term is most-dominating in an algorithm is what will eventually dominate in terms of runtime, for a sufficiently-large input. If your algorithm has two phases, one that’s \(\mathcal{O}(N)\) and one that’s \(\mathcal{O}(N^2)\), then over the long term, for a sufficiently-large input, the \(\mathcal{O}(N^2)\) portion will dominate in terms of runtime, and so the algorithm as a whole is \(\mathcal{O}(N^2)\).
Oh heck no. Remember, these are just expressions of the dominant factor over long-term increases.
Bubble sort is \(\mathcal{O}(N^2)\) so we should never use it, right?
Well, not necessarily. If you’re only ever sorting small arrays – like on the order of 10 elements or so – it can quite likely be a lot faster than something like merge sort or quicksort. (Incidentally, quicksort is also \(\mathcal{O}(N^2)\) by a strict definition; the average case tends towards \(\mathcal{O}(N \lg{} N)\) but that’s assuming pivot selections that cut your sets in half every time. And doing a perfect pivot selection turns out to be really hard.)
Remember, \(\mathcal{O}\) only describes the long-term growth of an algorithm. For an extreme example, trying to find a string that generates a specific fixed-size hash (colloquially called “reversing a hash” but that’s also a terrible misnomer) is \(\mathcal{O}(1)\) – after all, the search space is always a constant size. There will only ever be 2128 possible md5 hashes, so all you have to do to find a string that produces a particular md5sum (given no other inputs) is iterate over every 128-bit number represented as strings, right?
Well, that’ll take quite a long time to do. But it’ll always take on average the same amount of time! (Technically, a random amount which takes on average the same amount of time as generating 2127 hashes.)
For a more practical example: general-purpose comparison-based sorting is provably only ever going to be, at best, \(\mathcal{O}(N \lg{} N)\). But there are actually sorting algorithms that execute with lower complexity; for example, radix sort can be implemented to take only \(\mathcal{O}(N)\) time, at least given a fixed maximum key size. So, why don’t we always use radix sort when possible? Because in the general case, it’s hecking slow. It just happens to appear to scale very nicely.
UPDATE: It turns out that there is actually an optimization of radix sort that is generally much faster, and which sort of invalidates this test (but not really).
For example, here are some graphs of std::sort
vs. a radix sort using both a 4-bit and an 8-bit radix, for various input sizes; consider that regardless of radix size, radix sort is \(\mathcal{O}(N)\) while std::sort
is \(\mathcal{O}(N \lg{} N)\): (source code, raw data)
So, at least with this simple test, it takes an input size of around 40,000 before a 4-bit radix beats std::sort
, and around 200,000 before 8-bit radix beats std::sort
– even though both radix sorts are \(\mathcal{O}(N)\) compared to the \(\mathcal{O}(N \lg{} N)\) of std::sort
.
And it only gets worse since radix sort has various time jumps up around certain power-of-two boundaries; this is because as the buckets get bigger, more allocations (and reallocations) need to occur as they hit power-of-two boundary sizes, as well as the various cache thrashing that occurs as a result. This also reflects the fact that radix sort also requires a lot more memory than a comparison sort; radix sort can take up to two times the size of the input array in addition to the input itself, whereas the comparison sorts generally do as much in-place as possible.
(You can smooth out the power-of-two boundary issues by using a linked list for the buckets instead of a vector, incidentally, but that only hurts performance – by a factor of around 30, in my tests – and significantly increases memory usage. But at least it makes the graphs slightly prettier.)
The bucket-size overhead of 4-bit radix also makes it scale up much more quickly than 8-bit, and after a certain point it’s not even worth looking at the 4-bit radix time since it’s so much slower. But at least it scales linearly! Maybe after a few billion numbers it’ll start to beat std::sort
again.
And yes, std::sort
does scale with \(N \lg{} N\), and radix does scale linearly:
Another thing to note is that the lower-big O algorithms tend to also be much more complicated to implement and maintain. Using std::sort
is literally one line of code; my radix sort implementation is 15 lines just for sorting numbers, without comments, and using an algorithm that doesn’t make intuitive sense to most people. (Granted, if you were to implement std::sort
yourself it’d be a lot longer, but the point is that you don’t have to!)
A general rule of thumb is that the more lines of code you’re implementing yourself, the slower it’s likely to be in the common case, and the worse it is to maintain. Standard library implementations and idioms exist for a reason. Sure, it isn’t always the case that shorter code will be faster (after all, bubble sort is easily done in around 5 lines of code but is way worse than radix sort, performance-wise) but why implement something that at best only slightly improves performance while taking on extra maintenance burden?
This gets even more extreme in scripting languages like Python; an idiomatic sort of a list in Python is just sorted(input)
and goes through mostly-native code, which has been optimized to bits by compilers or even going through standard library functions. Implementing your own sort, no matter how efficient, is going to still go entirely through the Python interpreter.
I hope that rather than worrying only about the theoretical computational complexity of an algorithm, you’ll consider using it as a guideline for implementation and pay more attention to the practical performance.
#complexity #performance ]]>This way is very, very wrong.
First, there are two approaches which work really well that I’ll describe briefly; one is the Fisher-Yates shuffle algorithm, which works by walking down the array and randomly swapping the current element with any of the elements which follow it (including itself). This is an \(\mathcal{O}(N)\) operation which is trivial to implement in pretty much any imperative language and can be done in-place.
If you are in an environment which does not allow array-level swapping like that (such as, say, a functional programming language or a spreadsheet), an alternate approach is to map the array to one where each element is prefixed with a random number, then sort based on that Random prefix, and then remove the prefix. Here is how to do it in Excel, and how to do it in Fuzzball MPI. This generally is an \(\mathcal{O}(N \log_2 N)\) operation: not as efficient, but it still generates high-quality results.
But why are these approaches better? Well, both of them have an equal chance of any input cell ending up in any output cell; there is absolutely no bias to where things might end up, because all of their placements are independent from one another, and each placement happens effectively once.
But the coin-flip sorting approach introduces a lot of bias; cells can move an arbitrary number of times, and where they can move to depends on where they are located within the array to begin with.
So, let’s measure this bias using a \(\chi^2\) test.
Briefly, \(\chi^2\) measures how close your output distribution is to an expected distribution. If we were shuffling our lists uniformly, then we’d expect every value to end up in every cell about the same number of times; for an array of length \(N\), with a shuffle repeated \(K\) times, you’d expect each value to be in each cell \(\frac{K}{N}\) times.
Here is what the \(\chi^2\) distribution is for the good algorithms, with \(N=235\) and \(K=100000\):
And here it is for three different random-sort algorithms, bubble sort, std::stable_sort
, and merge sort, all using a coin-toss ordering function:
(The reason I went with std::stable_sort
instead of std::sort
is that the latter uses some optimizations that assume that the sort is transitive – which isn’t the case when you’re randomly flipping a coin. So std::sort
tends to crash when this constraint is violated.)
For a better comparison of how the different algorithms stack up as \(N\) changes, here’s a chart:
N | Fisher-Yates | Random prefix | Bubble sort | std::stable_sort |
Merge sort | |||||
---|---|---|---|---|---|---|---|---|---|---|
4 | 0.00005 | 0.00004 | 0.139569 | 0.158157 | 0.00004 | |||||
5 | 0.00002 | 0.00005 | 0.218803 | 0.273358 | 0.0487565 | |||||
6 | 0.00006 | 0.00008 | 0.29487 | 0.400601 | 0.0330905 | |||||
7 | 0.0000 | 0.00006 | 0.364549 | 0.527574 | 0.0342361 | |||||
8 | 0.00006 | 0.00006 | 0.430965 | 0.660637 | 0.00005 | |||||
10 | 0.0001 | 0.00010 | 0.557523 | 0.93249 | 0.0335068 | |||||
12 | 0.000127262 | 0.000108561 | 0.674842 | 1.19897 | 0.024186 | |||||
15 | 0.000148504 | 0.000128046 | 0.831853 | 1.60815 | 0.0178444 | |||||
18 | 0.000186571 | 0.000189251 | 0.971268 | 2.01685 | 0.0259307 | |||||
22 | 0.000212744 | 0.000221075 | 1.14734 | 2.56097 | 0.0321786 | |||||
27 | 0.000249649 | 0.000268954 | 1.34988 | 3.24243 | 0.0271932 | |||||
33 | 0.000348344 | 0.000328459 | 1.56487 | 4.06064 | 0.0116926 | |||||
41 | 0.000417497 | 0.000404196 | 1.82521 | 5.1443 | 0.0301371 | |||||
51 | 0.000494652 | 0.000514062 | 2.11672 | 6.50646 | 0.026951 | |||||
63 | 0.000612439 | 0.000614844 | 2.4303 | 8.14154 | 0.0057482 | |||||
78 | 0.000753832 | 0.000759035 | 2.7903 | 10.1829 | 0.0270242 | |||||
97 | 0.000952913 | 0.000986311 | 3.19424 | 12.7711 | 0.022091 | |||||
121 | 0.0011958 | 0.0012227 | 3.65048 | 16.0263 | 0.0137545 | |||||
151 | 0.00147693 | 0.0015004 | 4.16317 | 2.2788 | 0.026883 | |||||
188 | 0.00185188 | 0.00186338 | 4.7293 | 2.72105 | 0.0242407 | |||||
235 | 0.00231242 | 0.00233842 | 5.3744 | 3.24268 | 0.0189719 |
And here is a logarithmic graph showing how the total \(\chi^2\) error changes as \(N\) increases:
So, as N increases, you would expect the \(\chi^2\) to increase if \(K\) remains constant, as it did in this case (as there are fewer overlapping samples). And in fact the bubble sort and std::stable_sort
approaches increase pretty much in parallel with the Fisher-Yates and random prefix approaches (aside from the discontinuity at \(N=128\) due to the performance-tuning heuristics that algorithm uses) – but this is in a logarithmic graph, meaning that the random sorts increase orders of magnitude faster than the other ones.
Merge sort is a bit interesting, in that it is highly dependent on the actual size of the array, and in particular how close it is to a power of 2, with perfect powers of 2 having the same \(\chi^2\) as the ideal algorithms. (I leave this as an exercise to the reader to consider why this would be.) But there’s still a lot of inherent bias, and it’s not even casually-predictable bias.
Anyway, this is why when shuffling a list, you should use Fisher-Yates if possible, or sort based on a randomly-generated unassociated number if not.
(Here is the rather ugly source code if you want to play with it yourself.)
#statistics #boredom #randomness ]]>A big reason why is that there don’t actually seem to be any useful in-memory indexed table libraries for Python. Or many other languages.
Back in the 90s when I was a larval software engineer, the common practice for teaching data structures always started with the basics: linked lists, then binary trees (first naïve, then self-balancing ones such as AVL or red-black trees, possibly with a foray into B-trees), and then onto hash tables, typically starting with fixed hashes and then moving on to dynamic hashes. Hash tables themselves were often even treated as an afterthought, a performance optimization that wasn’t useful in the general case! In the meantime, algorithms based on binary search on sorted lists and the like would also be a major part of the curriculum.
Somewhere along the lines, though, it seems like everyone has moved away from ordered, indexed structures and towards everything being based on hash tables. Somehow as an industry we’ve all decided that it’s only important to be able to do \(\mathcal{O}(1)\) constant-time lookup on fixed keys with no ordering necessary between them.1
C++ and Java both provide ordered associative structures. std::map
and TreeMap<T>
, for example. But whenever you ask a programmer (especially a younger one) about them these days, people just point out that they’re “slow” because they’re \(\mathcal{O}(\log_2 N)\) for key lookup, and that you should only use std::unordered_map
or HashMap<T>
instead, because they’re \(\mathcal{O}(1)\).
But focusing on this forgets a few things:
Ordered indexes are really freaking useful for a lot of stuff. For example, trying to find all of the keys matching a prefix, or finding values which are greater than or less than a key (or doing a range query in general). Or quickly finding the next or previous entry in a content store.
All of these things require a full table scan – meaning an \(\mathcal{O}(N)\) operation – in a hash-only scenario. Or if you want to do a range query and then sort it at the end, it takes \(\mathcal{O}(N \log_2 N)\), as you have to first filter the table (which is \(\mathcal{O}(N)\)) and then sort it (which is \(\mathcal{O}(N \log_2 N)\)). How is this more efficient than just using an \(\mathcal{O}(\log_2 N)\) one-time lookup?
One of my recurring duties as a full-time software engineer was to interview other engineers as job candidates. One of my go-to problems was writing a Boggle solver. There are generally three phases to the solution:
Phase 3 is usually the hard part that most candidates have the most trouble with. There are very simple solutions to this problem, though. You could start out by sorting the wordlist in an array (\(\mathcal{O}(N \log_2 N)\)) and then do an \(\mathcal{O}(\log_2 N)\) prefix search for each word, or you can store it in a tree-type map (also \(\mathcal{O}(N \log_2 N)\) for the initial storage) and then do an \(\mathcal{O}(\log_2 N)\) lower-bound search for each word, or you can do what most candidates go with and either store the word list in a hash table (\(\mathcal{O}(N)\)) and do a search through the whole table for every check (also \(\mathcal{O}(N)\)), or they build a trie to store all the words with a flag as to whether a node is a leaf (i.e. if the word is complete), which is essentially an \(\mathcal{O}(N)\) initial phase and an \(\mathcal{O}(L)\) lookup (where \(L\) is the length of the word). These are all acceptable solutions, but the amount of code you have to write for each thing is… highly variable.
For example, in C++, here is how you do a prefix search on a std::set
:
Or here is how you do it in Java with a TreeSet<String>
:
Or if you’re in a language without a tree-based set concept, such as Python, and you store your dictionary in a sorted list:
I’m not sure where along the line this concept got lost, or why whenever I ask people about indexed data structures in their language of choice they look at me like I’ve grown an extra head or two. Even among engineers who are aware of indexed data structures, the pushback I get is that it’s not really relevant, and that we should all just be building everything out of hash tables or doing full table scans all the time. I feel like this all came from about a decade ago when suddenly Map-Reduce thinking took over the industry, and all software development was based around Big Data and Massive Scaling and parallel clustering of stuff. Which is just really weird to me. Like, not all problems exist at that scale, you know?
(And I mean Map-Reduce is useful at small scales too, it’s just not the universal way of reasoning about things. Especially when the excuse for it boils down to, “Eh, there’s CPU power to spare, why bother?”)
When trying to introspect into software engineers who just rely on a hash table for everything, I try to figure out if they even know how a hash table works.
Almost without fail, candidates who are asked about the underlying data structure end up starting out by taking the key, building a hash on it, and then insert the key into a binary tree. This is pretty maddening! It means that they’re paying for the complexity of a binary tree and having the limited capabilities of a hash table – a worst of both worlds situation. No in-order traversal combined with \(\mathcal{O}(\log_2 N)\) lookups. And when I complain to others about candidates not understanding these fundamentals, the feedback I get is that maybe I shouldn’t be expecting people to remember introductory computer science courses (usually with an implication that the person I’m talking to doesn’t know it either).
How have we gotten to this point?
Anyway, just out of curiosity I decided to do a timing comparison between a few different ways of implementing a Boggle solver; the only difference between these algorithms is the implementation of has_prefix
based on the data structure in question. (Also note that the code itself is crusty stuff I wrote many years ago for a very specific challenge and all I’ve done with it for this article is to change the data structures in use. It is certainly not high-quality code and I suspect if I were to actually look at most of the code I’d be a bit aghast at it right now.)
Using an ordered set:
Using a sorted vector:
Using a hash table:
Using a hand-rolled trie:
In the above solutions, the surprising thing is that the sorted vector is so much faster than the ordered set; however, what’s not surprising is that both of those are many orders of magnitude faster than the hash table approach, and that the trie approach is somewhat slower than the ordered set. Granted, the trie implementation could be a bit better (for example, the actual search algorithm could track where it is in the trie rather than searching from root every time), but the amount of code written is also important:
So, the sorted vector is only slightly more complicated to implement (in fact the only line count difference is the call to std::sort
after the dictionary is loaded – which is actually not even necessary if your dictionary is sorted to begin with), and yet it’s the fastest of all these by far.
Okay, so that algorithm doesn’t actually make use of an indexed data structure. But the thought process that leads to implementing it is along the same lines as the thought processes that leads to using an ordered indexed data structure; in effect, the vector
is an index, viewed in the greater sense. And, whenever I’ve interviewed a candidate with this problem, not one has gone with a sorted array and a binary search! (I have had a couple at least go with an ordered set
though. But nearly everyone goes with the trie – and most of them have never even heard of a trie before, and just sort of, like, invent it on the spot. Which is cool, but still…)
Also, the sorted vector
approach only really works performance-wise if your input set is static. Once you start adding stuff to it, well, each addition is a potentially \(\mathcal{O}(N)\) operation, which can end up becoming incredibly costly very fast. For example, if you’re writing, say, a software load balancer, and your table keeps track of your backing servers' load levels, every update to that requires changing a row in the index, and if you have a lot of rows (say, you’re working at the sort of scale where map-reduce thinking takes over), every single update starts to add up very quickly.
Anyway, with just standard C++ you can build your own indexes yourself, or you can use Boost.MultiIndex, which maintains arbitrarily many indexes for you. It’s pretty neat.
Anyway, back to my original conundrum. I wanted to investigate moving Publ’s data store into an in-memory table, with various indexes for the necessary sort criteria. The easiest approach was to stick with a database, despite that having very poor encapsulation (since the object storage is essentially global state). But what else could I have done?
When I was asking around, everyone kept on pointing me towards collections.OrderedDict
, which is a dict
which maintains the order of its keys. But by “maintains the order” it actually means maintaining the insertion order, not any ongoing sort order. This isn’t actually useful for the purpose of maintaining an index. (And even if it were, it doesn’t provide any primitives for performing a range query or sibling iteration or whatever.)
However, the blist
package is a rather comprehensive B-tree implementation, and in particular it provides a sorteddict
, which does indeed maintain the sort order of its keys. It also provides a KeysView
which allows you to bisect_left
the keys themselves, in order to traverse items starting at a particular point. It’s certainly not the worst thing to do. So, in the future, if I ever decide to get rid of an ORM entirely, this is what I will probably focus on using. However, this adds a bit more complexity in that if you know your sort key is going to change at all, you need to remember to remove your item from the b-tree and then readd it after it’s been updated. (Fortunately this isn’t all that different from a typical CRUD mechanism anyway.)
And of course, Publ’s use case calls for a lot of reads and not a lot of writes, so it’s also not that big a deal to periodically rebuild indexes as a sorted list and just use bisect_left
that way. It’s still something that needs to be managed directly, though.
Maybe if/when I decide to de-ORMify Publ I’ll just end up writing a library to make indexed tables easier to manage. I sure as heck can’t find anything that exists as it is. (If anyone does know of anything, please let me know!)
(I mean, unless I can find something that’s similar to Berkeley DB and is actually supported in Python 3 and is MIT-license-compatible…)
(note to self: lmdb is a good candidate)
As mentioned previously, Java provides both TreeMap/TreeSet
and HashMap/HashSet
. But for some reason people are constantly taught to only use the HashMap/HashSet
versions, and this leads to people never even knowing that TreeMap/TreeSet
even exist or even consider why they might want to use them. I find this incredibly baffling.
Lua and JavaScript are very popular because of their simplicity; both of them make the same simplifying assumption in that all data structures are hash tables – including basic arrays. In many cases these get optimized under the hood to act as basic arrays but there are also many situations where that ends up falling apart, and this is why in Lua in particular you generally want to use ipairs
instead of pairs
, especially if order matters.
The result of this is that there’s also absolutely no concept of an indexed, ordered associative array. Either your array traversal is out-of-order (or, in JavaScript, insertion-order, as JS’s array acts more or less like Python’s collections.OrderedDict
, except when it doesn’t), or you’re getting all of your keys out of the array, sorting that, and then iterating on that array instead. Which adds even more layers of complexity, and reduces performance further.
In Lua you’re not likely to be writing anything which makes use of indexed structures (and if you are, you’re probably implementing that stuff in C or C++ and calling into it with ffi
), but JavaScript? That’s used to run a lot of web services, and while node.js provides an ffi
binding, that’s generally seen as a last resort. So what do people do when they need to handle indexes in node.js services?
Well, in my experience, they seem to just defer it to an ORM, or shrug and suffer the poor performance.
I haven’t touched Go in a long time, but the last time I did, the golang orthodoxy was that you don’t need ordered indexes. From some cursory websearches I’m finding that this appears to still be the case.
Fortunately, there are container libraries which correct this. It looks like TreeMap
even now provides floor
and ceiling
which can then be used to implement range queries. It looks like it only gained this functionality incredibly recently (i.e. a month ago as of this writing).
C# provides a Dictionary
class.
You know what you can do with a real-life dictionary? You can quickly find a word you want, and then see which words come before and after it.
You know what you can’t do with a C# Dictionary
?
Okay, so C# does also provide OrderedDictionary
but this doesn’t, as far as I can tell, provide any iteration methods for getting the next or previous entries, or any sort of range queries at all. Presumably these are possible via LINQ, though.
So yeah uh. It’s so weird to me that this is the way software has gone. Everything’s a big pile of hash tables and nobody is expected to treat anything differently or learn any algorithms that make use of other storage representations. I find that weird and sad.
As an aside, I don’t think it has anything to do with experience, and everything to do with the environment in which people are being taught data structures; my CS education was very much focused on the way that data structures and algorithms work, but it seems like all of the younger programmers I talk to (oh god I sound so old now) were taught that data structures themselves aren’t that important beyond being an implementation detail or the basic theory, and furthermore that computers have such vast resources available that you don’t really need to care about this (oh god I am so old now). ↩
Whenever I build a piece of software for the web, almost invariably somebody asks why I’m not using PHP to do it. While much has been written on this subject from a standpoint of what’s wrong with the language (and with which I agree quite a lot!), that isn’t, to me, the core of the problem with PHP on the web.
So, I want to talk a bit about some of the more fundamental issues with PHP, which actually goes back well before PHP even existed and is intractably linked with the way PHP applications themselves are installed and run.
(I will be glossing over a lot of details here.)
Back when the web was first created, it was all based around serving up static
files. You’d have an HTML file (usually served up from a public_html
directory
inside your user account on some server you had access to, which was sometimes
named or aliased www
but more often was just some random machine living on
your university’s network), and it acted much like a simplified version of FTP —
someone would go to a URL like http://example.com/~username/
and you’d see an
ugly directory index of the files in there (if you didn’t override it with an
index.html
or, more often in those days, index.htm
), and then someone would
click on the page they wanted to look at like homepage3.html
and it would
retrieve this file and whatever flaming skull .gif files it linked to in an
<img>
tag and the copy of canyon.mid
you put an <embed>
around, and that
would be that. The web server was really just a file server that happened to
speak HTTP.
Then one day, servers started supporting things called SSIs, short for “server-side includes.” This let you do some very simple templatization of your site; the server wouldn’t just serve up the HTML file directly, but it would scan it for simple SSI tags that told the server to replace this tag with another file, so that you could, for example, have a single navigation header that was shared between all your pages, and a common footer or whatever.
But this mechanism was still pretty limited, and so about two minutes later
someone came up with the idea of the Common Gateway
Interface, or CGI; this
would make it so the server would see a special URL like /cgi-bin/formail.pl
and instead of serving up the content of the file, it would run the file as a
separate program and serve up its output.
At this time, HTTP generally used just a single verb, GET
, which would get a
resource. CGI needed a way of passing in parameters to the program. Instead of
just running the program like a command line (which would be very insecure),
they passed in parameters through environment variables; for example, if the
user requested the “file” at /cgi-
bin/formail.pl?email=fwiffo@example.com&text=Hi+I+like+your+site!
, the web
server would set the environment variable QUERY_STRING
to the value of
everything after the ?
, which formail.pl
would then parse out.
If the POST
verb were used instead, then the server would also read some
additional data from the user’s web browser and then send that to the script via
its standard input.
Basically, the web server was no longer just a file server, but a primitive command processor.
Back when this first started, system administrators knew better than to let just anyone run just any program from the web server. After all, people might do silly things like make it very easy to execute arbitrary commands on the server — and since the web server often ran as the root/administrator user, this would be very bad indeed. Even the admins who were savvy enough to set up a special sandbox user for the HTTP server would still need it to run everything from a common, trusted account that might have had access to common areas of the server.
So, the usual approach was to have just a single /cgi-bin/
directory with
trusted programs that were vetted and installed by the administrator, for
things that they felt were important or useful for everyone to have. Usually
this would be things like standard guest books (the great-great-grandfather to
comment sections) or email contact forms (since spam was starting to become a
problem and it was already dangerous to put your email address on the public
web).
Back in these days people generally didn’t have a database — after all, Oracle
was expensive — and it didn’t really matter anyway; if you wanted to have a
complex website you’d just run some sort of static site generator (which was
often written in tcsh or Perl or something) and if you needed scheduled posts
you’d do it by having a cron
job periodically update things. So, it wasn’t
really that much of an impediment to have this setup.
If you were really savvy and wanted to run, say, an interactive online
multiplayer game of your own design, you’d simply run your own server (often
under your desk in your dorm room) and you’d have root access and could install
everything you wanted in /cgi-bin/
.
Because everything in /cgi-bin/
was run as a program, you knew better than
to let your scripts save other files into that same directory; if it was a thing
where people could upload files or post comments, it’s not like it would do any
good anyway (since then the server would try to run them as programs, and you
can’t run a .jpg).
Then as the web really started to take off, shared hosting providers started
appearing, and CGI access became a pretty commonly-requested high-end feature.
Generally the shared hosting providers didn’t want to let just anyone upload a
script to be run by the server, but they also didn’t want to have to manually
vet each and every script that users wanted to install. So, as a compromise,
they set up special rules so that within your own server space you could have a
/cgi-bin/
directory and that things run from that directory would run under
your account, rather than as the web server (using a mechanism called suexec
).
This provided a pretty good compromise; users still had to know what they were
doing in order to install their scripts, but they still ran from a little
sandboxed location, and because of the way suexec
worked it was pretty
unlikely for even a very badly-written script to cause problems, because if the
script tried to save out an executable file into the cgi-bin
directory, it
wouldn’t be saved out with execute privileges, so it would just cause an error
500 to occur. After all, /cgi-bin/picture.jpg
wasn’t a program, so why should
it run?
But then things started to get a little more complicated. People wanted their
main index page to be able to run as a script, without it forwarding the page to
/cgi-bin/index.pl
or whatever.
So, another compromise happened: the CGI mechanism, which previously was set up
to only run the scripts from the /cgi-bin/
directory, got a few new rules,
such as “if the filename ends in .cgi
(or other common extensions like .pl
or .py
) run it as a script.” It still needed permissions to be set correctly,
though, and by this point suexec
was generally set up so that there were even
more rigorous checks before it would run the script. And there were so many
safety checks in place that this was still generally okay.
Around this time it also started becoming common to have access to a database such as mySQL or Postgresql, which allowed more flexibility and more two-way content. Forums became a thing. So did early blogs. Most of this software started out by having the database just for storage and the software would simply write out static files, but this started to have scaling problems and the web server got busy with the software writing these files out all the time, so it became more common for the software to simply read from the database directly as it ran. This helped somewhat, but it also shifted a significant amount of load over to constantly establishing short-lived database connections, because every time the forum program ran it had to connect.
At some point, PHP started to get popular.
PHP itself was originally intended as another way of adding server-side
scripting into HTML files; it was in effect a templating system for HTML. In the
earliest days it was often just treated as another scripting language; the
server would be configured to consider .php
as another name for .cgi
or
.pl
or whatever, and the file would still be run as a script. In some cases it
even needed to start with #!/usr/local/bin/php
and it needed to be set
executable with the correct permissions and so on (although this setup was
uncommon).
However, most sites used mod_php
, a server extension that allowed the web
server to handle PHP files directly. In many respects it was very similar to
mod_cgi
, except it did a few interesting things. One of the undeniable
benefits was that it was now able to maintain the database connection
persistently, rather than having to re-establish a connection every time a
script ran. It was also generally a bit nicer for speed because commonly-used
PHP scripts could stay in memory and not have to be re-interpreted every time a
page was loaded.
But there were a couple of other implications this led to. In particular:
There were a few different variations on this and it didn’t always just run PHP from the web server (for example, some of the better hosts figured out that they could have each user run their own separate per-user FastCGI server that would run the PHP programs as the separate users, or whatever) but regardless of the setup, you now had PHP always running and not having to care about the permissions of the file, meaning you now had some persistent process running what was essentially executable code without the usual safeguards that a shared server would have.
This actually seemed like a good thing at the time, but then many, many pieces of software started allowing arbitrary people to upload images, and often wouldn’t make sure that what was supposedly an image was actually an image…
And so that’s where we stand today.
This makes sites potentially vulnerable even if they aren’t written in PHP
themselves; for example, if your HTML directory permissions are set to be
slightly too permissive, and another site on the server gets hacked, that hacked
site can potentially be used to place a .php
file into your site, and since
mod_php
doesn’t check ownership permissions it now runs on your site with
whatever permissions PHP would normally run in your account. (And this isn’t
just a theoretical; I’ve had sites hacked in this way! Now I run a nightly
script that ensures that my directory permissions are correct and tells me about
new .php
files that appeared since the last check, just to be sure.)
So, long story short, one of the biggest problems with PHP isn’t with the language itself, but with the way that PHP gets run; people (and their bots) can find ways to upload arbitrary files with a .php extension and, if that upload is visible to the web server (which it often will be), then a request to view that file will execute that file, regardless of its origin, and from there it can do anything that your own site can.
Granted, the erroneously-executable upload feature is only responsible for some of the security exploits I’ve seen in the wild. I wasn’t really intending to get into language-specific issues (after all, I linked to much better, more- comprehensive articles about it in the introduction), but it’s worth mentioning some of them anyway, as I have seen all of these be used to hack websites I’ve helped to clean up and secure.
The biggest one: For a very long time, the include()
function would happily support
any arbitrary URL and would download and run whatever URL it was given. And it
was very easy for a PHP script to be accidentally written to allow an arbitrary
user to provide such an arbitrary URL. (And by “a very long time” I mean that
this was the default configuration until very recently, and many hosts still
configure it that way for backwards compatibility.)
Some might be looking at the PHP docs I linked to there and thinking, “wait, but it’s not running the PHP code locally.” What the docs mean are that if you do something like
include('http://example.com/foo.php');
it’s the output offoo.php
that gets included. However, that output could in turn be more PHP code, which would then be executed locally, meaning on your server. And PHP doesn’t even care what the file extension is; doing aninclude()
onasdf.txt
orpony.jpg
will happily execute whatever<?php ?>
blocks exist inside of it as well.
There’s also a few other features of PHP that lend itself to arbitrary code
execution. One particularly fun one was the PCRE e
flag, which indicated
that the result of the regular expression should be executed as arbitrary code;
and as PCRE flags are embedded into the regular expression itself, a carefully-
crafted search term (on a less-carefully-crafted search page) could run
arbitrary code. Fortunately, this has been removed in PHP 7; unfortunately, a
lot of web hosts still run PHP 5 (or older!) and so this option — which never
had a single legitimate usage — is still available on the vast majority of web
servers out there.
So, I originally wrote this article for the Publ blog, which implies that I’m trying to build a favorable comparison for Publ. And that’s a perfectly fine inference to take.
Publ is built on Flask, which runs in a self-contained server. There are a few different ways to deploy it such as WSGI (Web Server Gateway Interface) or by having a “reverse proxy” configuration or the like. This is a bit more complicated than I want to get into but the short version is that rather than the web server running a program based on the URL, Publ stays running as a standalone program that the web server sends commands to as requests come in. So, it’s never asking a file how it should be run, but instead it’s telling a single program to handle a request. So, there’s no danger of some random file being executed when it shouldn’t be.
“But wait,” you might ask, “isn’t that exactly what you were complaining about
mod_php
doing?” Well, sort of; mod_php
works by always having the PHP
interpreter running and able to execute whatever arbitrary code it comes across.
However, in this setup, code is kept separate from data. Loading a URL in Flask
isn’t mapping to a script file that gets loaded and run, it’s simply passing a
URL to a single fixed application that handles the URL accordingly. In the case
of Publ it loads a content file and formats it through a template.
Another thing that Flask does is it separates out template content (which is executable) from static file content. Static files aren’t executable by default. Templates can embed arbitrarily-complex code, but they can only use functions that are provided to them — there’s no direct access to the entire Python standard library, for example, and so the most dangerous functions aren’t included by default. (And Publ does not provide any of those functions either, at least not purposefully.)
Important note: When I say static files aren’t executable by default, this simply refers to how Publ sees them. If your site is configured to serve up static files where PHP or CGI scripts are executable, then any such scripts that end up in your static files will indeed be executable. This is going to be the case on pretty much any shared hosting provider, for example.
Also, regardless of the server setup, Publ can’t magically protect your content or template directory from outright misconfigurations with permissions. Even classic static sites need to be secured from third-party/unauthorized access.
Publ itself also only knows how to handle a handful of content formats —
Markdown, HTML, and images — and ignores everything else. So if a .php
file
somehow ends up in the content directory, it won’t matter at all — Publ just
ignores it. It will never attempt to run code that’s embedded in a content file,
nor does it even even know how to. And Publ doesn’t handle arbitrary user
uploads anyway (nor is there any plan to ever support this); anything that would
be potentially hazardous would have been put there by some other means.
Publ’s design is basically just a fancy way of presenting static files, just like in the early days of the web. It just serves up the static files dynamically. Or, as I keep on saying, Publ is like a static publishing system, only dynamic.
(Of course, if your directory permissions are set wrong, someone can still use someone else’s exploited PHP-based site to attack your account and modify Publ’s code. But there’s nothing that Flask or Publ can do to prevent that, and this is just a general security problem that impacts everyone regardless of what they’re running.)
It would of course be foolish of me to claim that Publ itself is 100% secure and impossible to hack. And at least on Dreamhost there’s the very real possibility that somehow an arbitrary .php file gets injected into the static files (perhaps by an incorrect directory permission or whatever), which isn’t a flaw in Publ itself but the end result (a hacked site) is the same. So far as I can tell there’s no way to entirely disable PHP on a Dreamhost-based Publ instance, and it’s really the ability to run PHP that makes PHP so dangerous in this world.
So, I’m not going to claim that Publ is 100% secure or unhackable. But it sure has one heck of a head start.
#programming #php ]]>(Also the software that powers this website.)
#tools #CMS #Python ]]>#pragma once
and #ifndef
guards vs. the argument of correctness or not (I was taking the side of #pragma once
based on some relatively recent indoctrination to that end), I decided to finally test the theory that #pragma once
is faster because the compiler doesn’t have to try to re-#include
a file that had already been included.]]>#pragma once
and #ifndef
guards vs. the argument of correctness or not (I was taking the side of #pragma once
based on some relatively recent indoctrination to that end), I decided to finally test the theory that #pragma once
is faster because the compiler doesn’t have to try to re-#include
a file that had already been included.For the test, I automatically generated 500 header files with complex interdependencies, and had a .c
file that #include
s them all. I ran the test three ways, once with just #ifndef
, once with just #pragma once
, and once with both. I performed the test on a fairly modern system (a 2014 MacBook Pro running OSX, using XCode’s bundled Clang, with the internal SSD).
First, the test code:
And now, my various test runs:
As you can see, the versions with #pragma once
were indeed slightly faster to preprocess than the #ifndef
-only one, but the difference was quite negligible, and would be far overshadowed by the amount of time that actually building and linking the code would take. Perhaps with a large enough codebase it might actually lead to a difference in build times of a few seconds, but between modern compilers being able to optimize #ifndef
guards, the fact that OSes have good disk caches, and the increasing speeds of storage technology, it seems that the performance argument is moot, at least on a typical developer system in this day and age. Older and more exotic build environments (e.g. headers hosted on a network share, building from tape, etc.) may change the equation somewhat but in those circumstances it seems more useful to simply make a less fragile build environment in the first place.
The fact of the matter is, #ifndef
is standardized with standard behavior whereas #pragma once
is not, and #ifndef
also handles weird filesystem and search path corner cases whereas #pragma once
can get very confused by certain things, leading to incorrect behavior which the programmer has no control over. The main problem with #ifndef
is programmers choosing bad names for their guards (with name collisions and so on) and even then it’s quite possible for the consumer of an API to override those poor names using #undef
- not a perfect solution, perhaps, but it’s possible, whereas #pragma once
has no recourse if the compiler is erroneously culling an #include
.
Thus, even though #pragma once
is demonstrably (slightly) faster, I don’t agree that this in and of itself is a reason to use it over #ifndef
guards.
Let’s say you want to make a single-binary application that has embedded resources (images, GLSL shaders, etc.). Let’s say you want to automatically wrap your resources in a storage container to make it easier to deal with stuff. Let’s also say that you might even be using CMake as your build system.
CMake doesn’t provide a way of making a custom build rule, and using extern
data is a little unwieldy. So here’s an easy-ish way to do both parts, making use of C++11 language features (and a scary preprocessor hack).
The C++11 bit is also useful on its own, even if you aren’t using CMake, although things will have to be adapted to your build system of choice.
Note: Back when I wrote this the general compiler ecosystem was different, especially on macOS. If you just want a library that does all this stuff for you in a platform-independent manner, check out this resource embedding script. Or you might be interested in a CMake-only approach for the resource generation and using that in conjunction with the rest of this article.
resource.o
Using GNU toolchains, the easiest and most efficient way to convert an arbitrary binary into a .o file is with ld -r -b binary -o outfile.o infile
, which will generate a .o file with a few symbols that are based on the mangled full path of infile, of the form _binary_[mangled path]_start
and _binary_[mangled path]_end
. The mangling rule is just that any character that’s not a valid C symbol character turns into _
.
Note that this uses the full path of infile, though, and this is not settable at runtime; so, you should run the ld
command from the source root directory so that you end up with predictable but unique symbol names.
CMake doesn’t provide a facility for custom build rules. They apparently have a good reason for this. Whatever it is, it’s inconvenient. Fortunately, I managed to find a mailing list post that explains how to fake it. So here’s my CMake snippet that simulates a custom build rule:
So now you have a means of building and linking the symbols. Now you just have to make them usable, easily.
The typical usage would be to do something annoying like:
(There’s also a symbol _binary_test_txt_size
but it’s kind of unclear about how to actually use it and what the data type is and so on, and so just doing pointer math is a lot more predictable.)
This is okay if you only have a single resource, but that gets really freaking annoying if you’re trying to do a whole bunch of resources. So let’s wrap it! This is where a C++11 feature comes in handy:
Picking this apart, what it does is creates a generic “Resource” blob object that takes a start and end pointer, and provides a data pointer, a size accessor, and standard iterator accessors (so that you can still use STL access methods and ranged for
), and it defines a macro that defines and immediately calls a lambda that instantiates the resource wrapper (##
is the preprocessor concatenation operator). Using it is simple; let’s say that the resource’s path (relative to the top-level CMakeLists.txt
) is src/test.txt
:
The astute may have wondered why the Resource::data
and Resource::size
accessors return references rather than values. The reason is because OpenGL doesn’t take a single pointer and length when specifying shader source; the API is:
i.e. it takes an array of them. And having to assign the Resource
values to local lvalues just so that you can take the address is annoying; why do that when you could just do this?
Okay, if you only want to load shaders like this and don’t care about supporting other resource types, you can simplify things a lot more (to the extent that you don’t even need C++11 anymore):
So now your code to load the shader is just:
#include
in your shaders?Normally, to use #include
you need to make use of the shading_language_include
extension, which requires either pre-registering all of your includes or providing a virtual filesystem driver. Creating a virtual filesystem driver is difficult because you can’t (easily) dynamically generate symbol names to pull the included resource from. (It is possible using dlsym
and so on, but that’s annoying.)
Fortunately, you can run the C preprocessor on its own, and generate the preprocessed vert/frag files accordingly. (Obviously, any #include
d fragment will be repeated in your resulting binary every time it’s used. If this is a problem, use the dlsym
method instead.) This requires some changes to the CMake rules. I haven’t done this work just yet but it should be fairly straightforward to make a custom command PREPROCESS_RESOURCE
that runs cpp
on the resource file into the build directory (maintaining the appropriate subdirectory path) and then ADD_RESOURCE
can depend on PREPROCESS_RESOURCE
’s output instead (or you can make a separate command, ADD_PREPROCESSED_RESOURCE
, that does this all in one step or whatever). If I ever decide to implement this I’ll add it to this writeup, or someone else could post it in the comments, I suppose.
select()
API should have been deprecated years ago. While unsafe operations like sscanf()
, sprintf()
, gets()
, and so forth all provide compile-time deprecation warnings, select()
is also incredibly dangerous and has a more modern, safer replacement (poll()
), but yet people continue to use it.]]>select()
API should have been deprecated years ago. While unsafe operations like sscanf()
, sprintf()
, gets()
, and so forth all provide compile-time deprecation warnings, select()
is also incredibly dangerous and has a more modern, safer replacement (poll()
), but yet people continue to use it.The problem is that it doesn’t scale. In this case, “not scaling” doesn’t just mean it’s bad for performance, “not scaling” means it can destroy your call stack, crash your process, and leave it in a state that is incredibly difficult to debug.
There are actually two problems, and they are closely-related. The hint to both of them comes from the actual signature of the select function:
Do you see that nfds
parameter? Here is what the BSD libc documentation has to say about it:
The first
nfds
descriptors are checked in each set; i.e., the descriptors from 0 through nfds-1 in the descriptor sets are examined. (Example: If you have set two file descriptors “4” and “17”, nfds should not be “2”, but rather “17 + 1” or “18”.)
What this is hinting at is that fd_set’s implementation is not a container that contains a maximum number of sparsely-populated file descriptors; instead, it is a bit vector saying whether to check the fds at a given offset. That is, in pseudocode, people expect it to be something like:
but it’s really more like:
So this immediately shows what the two problems are, but I’ll still explain it further (since this wasn’t enough when I was helping diagnose a problem with some critical production code that was breaking). Both of them involve what happens when your process has a large number of file descriptors open (which is very common in large-scale services).
Let’s say you are trying to check the state of a single socket with fd of 907 — which happens to be within FD_SETSIZE
on a modern Linux. So your code looks something like this:
Here is what select() is doing internally (again, pseudocodeishly):
In this case, since only fd 907 is actually being checked, it’s looping futilely over 906 entries which don’t need to be checked. If you’re doing select()
a lot, this can add up (remember that the structures described above are just pseudocode, and in reality there’s a lot of bitfiddling/masking/etc. going on as well).
But this isn’t the dire problem.
So let’s look at the same thing as above, except now the fd in question is, say, 2048, which is quite a lot larger than FD_SETSIZE
. Theoretically this is not possible, but in practice it is — it’s pretty simple to raise the socket ulimit for circumstances where you need a lot of connections open. (For example, in a large-scale distributed cache. Again, speaking from experience on this.)
Let’s annotate the code that calls it, first:
Why does this blow away the stack? Well, keep in mind what select()
is doing internally:
That is, for all the values up to 2048 (which, remember, is outside of the fd_set
and well into your stack at this point), it’s checking a bit to see if a garbage fd needs to be checked (which is relatively harmless), checking that fd (which is relatively harmless but kills your performance), and then sets that value in memory based on whether the fd is in that particular state (which is, no matter what, essentially randomizing the stack).
Congratulations, now not only will you get a weird error, you won’t even be able to tell where it comes from because your debugger will give you complete garbage on the call trace.
Hindsight is 20/20, of course, but poll()
is the API that select()
should have been in the first place. When select()
was designed, UNIX had no asynchronous I/O mechanism at all, and the whole concept of a large-scale network server was unheard of. Everything was its own process, and you’d do something like accept(8)
to accept only 8 simultaneous connections at a time, and each of those connections would likely open up a handful of files to serve up static content or whatever. In the early days, FD_SETSIZE
was only 16; Winsock raised it to 64, and early Linux raised it to 256. That was still more than adequate for a reasonably-sized server at the time, and it was still generally the case that you’d be checking all of your fds at once in a single-threaded event loop (heck, back then, the whole concept of “thread” was just an abuse of vfork()
and not something that any reasonable programmer would expect to use) and that there wouldn’t be that many to check; the overhead of managing a variable-sized buffer would be greater than the savings of not having the conditional loop.
But nowadays, any given Internet server can be easily handling thousands, if not tens of thousands, of connections at once, and the FD_SET
/select()
API is fundamentally broken in a way which can never, ever make this safe.
Fortunately, there’s another API, poll()
, which is much more like what people expect out of select()
in the first place.
Essentially, you tell it exactly how many fds you’re asking about, and for each fd, exactly which events you want to know about. So now the poll()
call only has to scan the actual specified fds, and only check the actual specified events — no checking readable, writable, and error conditions for ALL fds that are ≤ the highest one you’re asking about.
In the common case, poll()
is a simple drop-in replacement, and is generally simpler and easier to read; this code:
becomes:
And, obviously, if you want to check multiple types of events on a socket, it becomes even simpler, as you only need a single set of pollfd structures and simple bit masking on the calling side.
select()
then?I have a couple of theories. select()
is the legacy API that was provided in all of the classic textbooks; universities possibly still teach networking using it, and there’s a lot of sample code out there that uses select
and even the best programmers still have a tendency to copy-paste at times.
Additionally, the fact that gcc doesn’t issue a deprecation warning on it means that people haven’t had any reason to change their practices. The man pages tend to contain warnings, but people generally just read the signature and return codes.
Current versions of certain libc variants will return an error if nfds ≥ FD_SETSIZE
(for example, the one on OSX 10.8 will, although there is apparently no warning to this effect on even current Linux glibc), but at that point the stack is already corrupted and if your server has gotten to the point that a subtle hard-to-understand glibc error is occurring, you’re already having a bad day and probably don’t realize where the issue is coming from.
For what it’s worth, one version of the Linux manpage for select()
has this to say:
An
fd_set
is a fixed size buffer. ExecutingFD_CLR()
orFD_SET()
with a value offd
that is negative or is equal to or larger thanFD_SETSIZE
will result in undefined behavior. Moreover, POSIX requiresfd
to be a valid file descriptor.
“Undefined behavior” is putting it lightly. (And it makes no mention of how select()
itself is also a ticking time bomb in that situation.)
epoll
and libEvent?I purposefully did not discuss them in this article. While both are far superior to poll
, they aren’t simple drop-in replacements to select
, and at least in the basic case of an Internet server the more important thing is not crashing mysteriously. Obviously with a better asynchronous eventing mechanism you will scale better, but scaling better is often a longer-term goal than scaling at all.
Further, in this day and age, you might actually be travelling between different countries, and so you can’t really predict what your outgoing call routing will be like!
So, here’s a really simple C++ program that wraps libphonenumber to normalize the phone numbers in a .vcf file (used by most modern address book systems) based on your current locale. It can be used both on local .vcf stores as well as ones stored on a CardDAV/CalDAV server.
See the code comments for usage details.
#tools #c++ #formatting ]]>Note: The timing and memory analysis has been updated as of May 24, 2018.
Okay, so, a pretty simple test case is our friend, recursive Fibonacci. It’s a pretty simple algorithm which is pretty straightforward for demonstrating the major issues. So, here are two versions; this first one is using trivial checked error returns: (fib-1.cpp)
and here is using exceptions: (fib-2.cpp)
(The error condition in this case is whether the operation overflowed.) Both of them use the same mechanism for determining how large the stack has grown, and the error checking version is based on how it works in real code that’s based on error checking.
(Note: I realize that the stack value isn’t going to be exactly precise in terms of how much is used, but it is precise in showing how much it grows based on the depth of the call tree. That’s all I care about here.)
(Also, to stave off any criticisms about the error-checking version, while yes, this could have used 0 to indicate error, that’s meaningless in the context of most applications; 0 could be a valid result value. In real applications, you end up always using the errorType function(input, input, input, output&)
method if you want to keep things consistent.)
In terms of human-readable code size, the exceptions version is somewhat shorter, with 55 lines instead of 62. That isn’t very much for a single function, but across an entire codebase it can add up quickly.
I compiled both versions with clang-900.0.39.2 (on macOS 10.13) with g++ -O3
and stripped the executables. The error-return version weighs in at 8620 bytes, while the exception version is 9488 bytes - a difference of 868 bytes.
In order to determine how much of that is due to generated code and how much is due to the inclusion of <exception>
, I modified fib-2.cpp
to only throw an int
instead of std::overflow_error
, and then produced assembly dumps of all three versions. Stripping those assembly dumps of all labels (showing just the generated opcodes) I get the following sizes:
fib-1.s
: 103 opcodesfib-2.s
: 117 opcodesfib-2-throwint.s
: 99 opcodeswhich indicates that exceptions, in all likelihood, generate less code on a per-throw/catch site basis than the checked-error equivalent, with the initial increase in size due to per-exception-class overhead, which will be shared across every instance of an exception class.
So, the next thing to compare would be the amount of stack used. I ran each version from n=1 to 30 using seq 1 30 | xargs -n 1 ./fib-1
. The error return version grew the stack exactly 12 ints for each additional n
. The exception version grew the stack… exactly 12 words (48 bytes) for each additional n
. The exception version ended up using slightly less stack overall, too, albeit with
just a difference of one int
.
This was a simple test running on my late-2017 iMac (3.8GHz core i5):
So despite the “common knowledge” that exceptions are “slow,” the exception-throwing version actually ran about 20% faster.
For code size and memory usage, it probably depends a lot on the complexity of the application, and a trivial application won’t do a very good job at capturing the difference between the two. It requires further study with a less-trivial example.
For execution time, in the (hopefully common) case that errors don’t occur, exceptions are faster. This makes sense; checked error returns require checking every result, whether it’s good or bad, while exceptions only have to do anything special when an error occurs. Now, given a sufficiently-complicated example (with lots of different exception types with a complex inheritance tree, for example), the error condition could very well take a lot of time to handle compared to simply checking an error return. In an environment where you’d expect there to be a lot of error conditions to handle (transitory failures, network hiccups, etc.) then error returns could very well turn out to be faster. But it seems that it’d be better to just use error returns for the circumstances which call for it, and to use exceptions everywhere else.
There is, of course, one major missing component to the memory use issue: the static data structures which have to be set up to begin with when the first try
block starts. I don’t know of a quick, platform-independent way to measure that. I also don’t know what the impact is on a per-exception-type basis. However, those are essentially one-time costs, shared across every use of the exception type.
So, basically, my conclusion is that the only way you can be sure of which one works better for a given application is to write the application both ways, and see which one works better. However, my suspicion is that the small amount of memory overhead for exceptions is far outweighed by the time savings, reduced code complexity, and guarantee that errors are not simply ignored (unless they are explicitly ignored, and even ignoring exceptions means code was written to do so — and just putting a try{}catch(...){}
at the top level will still do a lot more to “handle” an error than letting a return code drop on the floor).
Also, a larger, more meaningful comparison would do a better job of determining whether the generated code size (which is an issue for embedded systems! executables load entirely into memory on MMU-less systems) grows larger with exceptions or with error returns. I suspect that error returns will grow faster — there’s more code that has to go into more places, unless you absolutely need fine-grained error checking. If you’re wrapping a single line of code in a try/catch
block, that would certainly be a good situation to use error returns in. But if you’re dealing with larger functional units where one of several things could go wrong and abort an entire process, exceptions are probably the way to go.
Anyway, the most important thing is to not simply assume things based on “common knowledge,” but to actually test them. People who have been working in a field for a long time tend to accumulate Ways of Doing Things and collections of assumptions based on how things were when they started. I used to assume exceptions were bad because that’s what older, more experienced programmers than myself told me. But change can be good, and just because someone isn’t as experienced in a given field doesn’t mean they don’t understand the issues. A fresh pair of eyes is always a good thing.
#c++ #performance ]]>Note: This isn’t about converting between RGB and HSV; this is only about applying an HSV-space modification to an RGB value and getting another RGB value out. There is no affine transformation to convert between RGB and HSV, as there is not a linear mapping between the two.
All of the operations here require multiplication of matrices and vectors. As a quick primer/refresher, multiplying a matrix with a vector looks like this:
\[ \begin{bmatrix} A_x & A_y & A_z \\ B_x & B_y & B_z \\ C_x & C_y & C_z \end{bmatrix} \begin{bmatrix} V_x \\ V_y \\ V_z \end{bmatrix} = \begin{bmatrix} A_xV_x + A_yV_y + A_zC_z \\ B_xV_x + B_yV_y + B_zC_z \\ C_xV_x + C_yV_y + C_zC_z \\ \end{bmatrix} \]
Matrix multiplication, like numerical multiplication, is associative, but unlike numerical mutiplication, is not commutative. This means that if you have multiple matrix operations to perform to a single vector, like \(v'=ABCDv\), you can combine them together into a single master matrix where \(Z=ABCD\) and then \(v'=Zv\). This will be useful later on.
The math involved assumes that we are dealing with a linear color space. However, most displays use an exponential curve. For this to behave completely accurately, you’ll have to convert your colors to linear color before, and to exponential color afterwards. A pretty close approximation (assuming a display gamma of 2.2) is as follows, where \(L\) is the linear-space value, \(G\) is the gamma-space value, and \(M\) is the maximum value; for traditional 8-bit/channel images this is 255.
\[ L = M \left( \frac{G}{M} \right)^{2.2} \ \]
\[ G = M \left( \frac{L}{M} \right)^{0.455} \]
Note that if you are going to do this on a per-pixel basis in real time you will almost certainly want to precompute this as a lookup table. However, if you’re doing it in a GPU, treating gamma as 2 is Close Enough™, and you can just square or sqrt()
for the conversions as appropriate.
For the remainder of the math, we don’t care about the scale of the values, as long as they start at 0.
RGB values aren’t very convenient for doing complex transforms on, especially hue. The math for doing a hue rotation on RGB is nasty. However, the math for doing a hue rotation on YIQ is very easy; YIQ is a color space which uses the perceptive-weighted brightness of the red, green and blue channels to provide a luminance (Y) channel, and places the chroma values for red, green and blue roughly 120 degrees apart in the I-Q plane.
Note that there are many color spaces that you can use for this transform which have different hue-mapping characteristics; strictly-speaking, there is no single natural “angle” between any given colors, and different effects can be achieved by using different color spaces such as YPbPr or YUV. (An earlier version of this page used YPbPr for simplicity, but that led to much confusion when people expected the 180-degree rotation of red to be cyan, which is roughly the case in YIQ but not in YPbPr.)
The transformation from RGB to YIQ is best expressed as a matrix, with the RGB value multiplied as a \(1\times3\) vector on the right:
\[ T_{YIQ} = \begin{bmatrix} 0.299 & 0.587 & 0.114 \\ 0.596 & -0.274 & -0.321 \\ 0.211 & -0.523 & 0.311 \end{bmatrix} \]
One convenient property of the YIQ color space is that it is unit-independent, so we don’t have to care about what our range of colors is, as long as it starts at 0.
Now that our color is in YIQ format, doing a hue transform is pretty simple – we’re just rotating the color around the Y axis. The math for this is as follows (where \(H\) is the hue transform amount, in degrees):
\[ \theta = \frac{H\pi}{180} \\ U = \cos(\theta) \\ W = \sin(\theta) \\ T_h = \begin{bmatrix} 1 & 0 & 0 \\ 0 & U & -W \\ 0 & W & U \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta) & -\sin(\theta) \\ 0 & \sin(\theta) & \cos(\theta) \end{bmatrix} \]
Saturation is just the distance between the color and the gray (Y) axis; you just scale the I and Q channels. So its matrix is:
\[ T_s = \begin{bmatrix} 1 & 0 & 0 \\ 0 & S & 0 \\ 0 & 0 & S \end{bmatrix} \]
Finally, the value transformation is a simple scaling of the color as a whole:
\[ T_v = \mathrm{I}V = \begin{bmatrix} V & 0 & 0 \\ 0 & V & 0 \\ 0 & 0 & V \end{bmatrix} \]
To convert YIQ back to RGB, it’s also pretty simple; we just use the inverse of the first matrix: \[ T_{RGB} = \begin{bmatrix} 1 & 0.956 & 0.621 \\ 1 & -0.272 & -0.647 \\ 1 & -1.107 & 1.705 \end{bmatrix} \]
The final composed transform \(T_{HSV}\) is \(T_{RGB}T_hT_sT_vT_{YIQ}\) (we compose the matrices from right-to-left since the original color is on the right), which you get by multiplying all the above matrices together. Because matrix multiplication is associative, it’s easiest to work out \(T_hT_sT_v\) first, which is
\[ \begin{bmatrix} V & 0 & 0 \\ 0 & VSU & -VSW \\ 0 & VSW & VSU \end{bmatrix} \]
where \(U\) and \(W\) are the same as in \(T_h\), above.
From here we can work out the master transform:
\[ T_{HSV} = \begin{bmatrix} 1 & 0.956 & 0.621 \\ 1 & -0.272 & -0.647 \\ 1 & -1.107 & 1.705 \end{bmatrix} \begin{bmatrix} V & 0 & 0 \\ 0 & VSU & -VSW \\ 0 & VSW & VSU \end{bmatrix} \begin{bmatrix} 0.299 & 0.587 & 0.114 \\ 0.596 & -0.274 & -0.321 \\ 0.211 & -0.523 & 0.311 \end{bmatrix} \\ = \begin{bmatrix} .299V+.701VSU+.168VSW & .587V-.587VSU+.330VSW & .114V-.114VSU-.497VSW \\ .299V-.299VSU-.328VSW & .587V+.413VSU+.035VSW & .114V-.114VSU+.292VSW \\ .299V-.3VSU+1.25VSW & .587V-.588VSU-1.05VSW & .114V+.886VSU-.203VSW \end{bmatrix} \]
As a quick smoke test, we test the matrix with \(V=S=1\) and \(H=0\) (meaning \(U=1\) and \(W=0\)) and as a result we get something very close to the identity matrix (with a little divergence due to roundoff error).
Here’s some simple C++ code to do an HSV transformation on a single Color
(where Color
is a struct
containing three members, r
, g
, and b
, in whatever data format you want):
Let’s say you want to perfectly replicate the affine color transformation done by some other library or image-manipulation package (say, Flash, for example) but don’t know the exact colorspace they use for their intermediate transformation. Well, for any affine transformation (i.e. not involving gamma correction or whatnot), this is actually pretty simple. First, just run the unit colors of red, green, and blue through that color filter, and then those become the rows of your transformation matrix.
Code for that would look something like this:
Note that this only works for simple affine transforms, and the usual precautions about gamma vs. linear apply; a much more general approach is to use a CLUT, which can encode arbitrarily many color manipulations into a single color mapping image.
The YIQ color space as used here (as well as in NTSC) attempts to keep the perceptive brightness the same, regardless of channel. This leads to some fairly major problems with the response range; for example, since pure green is perceived as about twice as bright as pure red, rotating pure green to red would produce a super-red – and meanwhile, the value on some channels can also be pulled down past 0. This is not a flaw in the algorithm, so much as a fundamental flaw in how color works to begin with, and a disconnect between the intuitive notion of how “hue” works vs. the actual physics involved. (In a sense, a hue “rotation” is a pretty ridiculous thing to even try to do to begin with; while red and blue shift are very real phenomena, there is no actual fundamental property of light or color which places red, green and blue at 120-degree intervals apart from each other.)
The easy solution to this issue is to always clamp the color values between 0 and 1. A more correct solution is to actually keep track of where the colors are beyond the range and spread that excess energy (or lack thereof) across the image (i.e. to neighboring pixels), such as in HDR rendering. As more display devices move to floating-point-based color representation, this will become easier to deal with. However, at present (May 2018), no commonly-supported image formats support anything other than a clamped integer color space (although there are a few emerging standards such as Radiance HDR and OpenEXR which are finally gaining traction).
In many cases, if you’re making heavy use of hue transformation (in e.g. a game), you are better off splitting your assets into different layers which you can tint appropriately and composite them together for display.
As mentioned earlier, these equations only apply to linear color, whereas most image formats and displays assume an exponential color space. While an HSV transform will generally look accepetable on a gamma colorspace, it tends to bias things weirdly and won’t look quite right. So, you probably want to convert your gamma RGB values to linear before you apply the HSV transform:
In this case, both the linear and gamma colors will be stored in the same range \( \left(0,max\right) \). In real-world real-time implementations you will probably do this transformation using either a pair of lookup tables or a polynomial approximation so you don’t have to do multiple pow()
calls per pixel. (Or, better yet, convert your assets to be linear in the first place.)
Photoshop does not use a linear transform to do its HSL adjustments. Instead, it does an ad-hoc “intuitive” approach where it uses the lowest channel value to determine the saturation, and uses the ratio between the remaining two channels to determine the “angle” on the traditional 6-spoke print color wheel (red/yellow/green/cyan/blue/magenta). This provides perfect numeric values based on peoples' expectations of hues vs. RGB channels, but it actually makes for some pretty terrible chroma adjustment which tends to only work well on solid colors. Even smooth gradients between two colors are completely fouled up by this approach.
This can be seen clearly in this side-by-side comparison; notice the severely obvious banding on the Photoshop version:
In my opinion, Photoshop’s HSL transform mechanism isn’t really something that should be emulated since its utility is limited to begin with.
#graphics #c++ #GLSL #color ]]>