Code: Making a hash of data

Making a hash of data

October 23, 2018 10:07 AM (5 years ago)

When I was replacing peewee with PonyORM, I was evaluating a few options, including moving away from an ORM entirely and simply storing the metadata in indexed tables in memory. This would have also helped to solve a couple of minor annoying design issues (such as improper encapsulation of the actual content state into the application instance), but I ended up not doing this.

A big reason why is that there don’t actually seem to be any useful in-memory indexed table libraries for Python. Or many other languages.

Back in the 90s when I was a larval software engineer, the common practice for teaching data structures always started with the basics: linked lists, then binary trees (first naïve, then self-balancing ones such as AVL or red-black trees, possibly with a foray into B-trees), and then onto hash tables, typically starting with fixed hashes and then moving on to dynamic hashes. Hash tables themselves were often even treated as an afterthought, a performance optimization that wasn’t useful in the general case! In the meantime, algorithms based on binary search on sorted lists and the like would also be a major part of the curriculum.

Somewhere along the lines, though, it seems like everyone has moved away from ordered, indexed structures and towards everything being based on hash tables. Somehow as an industry we’ve all decided that it’s only important to be able to do \(\mathcal{O}(1)\) constant-time lookup on fixed keys with no ordering necessary between them.¹

C++ and Java both provide ordered associative structures. std::map and TreeMap<T>, for example. But whenever you ask a programmer (especially a younger one) about them these days, people just point out that they’re “slow” because they’re \(\mathcal{O}(\log_2 N)\) for key lookup, and that you should only use std::unordered_map or HashMap<T> instead, because they’re \(\mathcal{O}(1)\).

But focusing on this forgets a few things:

The big-O complexity factor matters for really large values of \(N\), ignoring the constant overhead of the underlying computations (which, in the case of string hashing, can be pretty slow, especially compared to a handful of lexical comparisons)
Single-key lookup isn’t the only dang thing you might want to be doing on your data!

Ordered indexes are really freaking useful for a lot of stuff. For example, trying to find all of the keys matching a prefix, or finding values which are greater than or less than a key (or doing a range query in general). Or quickly finding the next or previous entry in a content store.

All of these things require a full table scan – meaning an \(\mathcal{O}(N)\) operation – in a hash-only scenario. Or if you want to do a range query and then sort it at the end, it takes \(\mathcal{O}(N \log_2 N)\), as you have to first filter the table (which is \(\mathcal{O}(N)\)) and then sort it (which is \(\mathcal{O}(N \log_2 N)\)). How is this more efficient than just using an \(\mathcal{O}(\log_2 N)\) one-time lookup?

Interviewing candidates

One of my recurring duties as a full-time software engineer was to interview other engineers as job candidates. One of my go-to problems was writing a Boggle solver. There are generally three phases to the solution:

Determine an overall algorithm
Given a function for determining if a prefix exists in the word list, traverse the board to find the solutions
Implement the function for determining if the prefix exists

Phase 3 is usually the hard part that most candidates have the most trouble with. There are very simple solutions to this problem, though. You could start out by sorting the wordlist in an array (\(\mathcal{O}(N \log_2 N)\)) and then do an \(\mathcal{O}(\log_2 N)\) prefix search for each word, or you can store it in a tree-type map (also \(\mathcal{O}(N \log_2 N)\) for the initial storage) and then do an \(\mathcal{O}(\log_2 N)\) lower-bound search for each word, or you can do what most candidates go with and either store the word list in a hash table (\(\mathcal{O}(N)\)) and do a search through the whole table for every check (also \(\mathcal{O}(N)\)), or they build a trie to store all the words with a flag as to whether a node is a leaf (i.e. if the word is complete), which is essentially an \(\mathcal{O}(N)\) initial phase and an \(\mathcal{O}(L)\) lookup (where \(L\) is the length of the word). These are all acceptable solutions, but the amount of code you have to write for each thing is… highly variable.

For example, in C++, here is how you do a prefix search on a std::set:

bool has_prefix(const std::set<string>& wordlist, std::string prefix) {
    auto iter = wordlist.lower_bound(prefix);
    return iter != wordlist.end && iter->substr(0, prefix.length()) == prefix;
}

Or here is how you do it in Java with a TreeSet<String>:

boolean has_prefix(TreeSet<String> wordlist, string prefix) {
    String first = wordlist.floor(prefix);
    return first != null && first.startsWith(prefix);
}

Or if you’re in a language without a tree-based set concept, such as Python, and you store your dictionary in a sorted list:

def has_prefix(wordlist, prefix):
    import bisect
    index = bisect.bisect_left(wordlist, prefix)
    return index < len(wordlist) and wordlist[index].startswith(prefix)

I’m not sure where along the line this concept got lost, or why whenever I ask people about indexed data structures in their language of choice they look at me like I’ve grown an extra head or two. Even among engineers who are aware of indexed data structures, the pushback I get is that it’s not really relevant, and that we should all just be building everything out of hash tables or doing full table scans all the time. I feel like this all came from about a decade ago when suddenly Map-Reduce thinking took over the industry, and all software development was based around Big Data and Massive Scaling and parallel clustering of stuff. Which is just really weird to me. Like, not all problems exist at that scale, you know?

(And I mean Map-Reduce is useful at small scales too, it’s just not the universal way of reasoning about things. Especially when the excuse for it boils down to, “Eh, there’s CPU power to spare, why bother?”)

Further interview frustrations

When trying to introspect into software engineers who just rely on a hash table for everything, I try to figure out if they even know how a hash table works.

Almost without fail, candidates who are asked about the underlying data structure end up starting out by taking the key, building a hash on it, and then insert the key into a binary tree. This is pretty maddening! It means that they’re paying for the complexity of a binary tree and having the limited capabilities of a hash table – a worst of both worlds situation. No in-order traversal combined with \(\mathcal{O}(\log_2 N)\) lookups. And when I complain to others about candidates not understanding these fundamentals, the feedback I get is that maybe I shouldn’t be expecting people to remember introductory computer science courses (usually with an implication that the person I’m talking to doesn’t know it either).

How have we gotten to this point?

C++

Anyway, just out of curiosity I decided to do a timing comparison between a few different ways of implementing a Boggle solver; the only difference between these algorithms is the implementation of has_prefix based on the data structure in question. (Also note that the code itself is crusty stuff I wrote many years ago for a very specific challenge and all I’ve done with it for this article is to change the data structures in use. It is certainly not high-quality code and I suspect if I were to actually look at most of the code I’d be a bit aghast at it right now.)

Using an ordered set:

$ g++ -O3 -Wall --std=c++11 boggle-orderedset.cpp
$ time ./a.out < board.txt > /dev/null

real    0m0.173s
user    0m0.160s
sys     0m0.010s

Using a sorted vector:

$ g++ -O3 -Wall --std=c++11 boggle-sortedvector.cpp
$ time ./a.out < board.txt > /dev/null

real    0m0.048s
user    0m0.039s
sys     0m0.007s

Using a hash table:

$ g++ -O3 -Wall --std=c++11 boggle-hashtable.cpp
$ time ./a.out < board.txt > /dev/null

real    0m44.075s
user    0m43.867s
sys     0m0.110s

Using a hand-rolled trie:

$ g++ -O3 -Wall --std=c++11 boggle-trie.cpp
$ time ./a.out < board.txt > /dev/null

real    0m0.362s
user    0m0.320s
sys     0m0.038s

In the above solutions, the surprising thing is that the sorted vector is so much faster than the ordered set; however, what’s not surprising is that both of those are many orders of magnitude faster than the hash table approach, and that the trie approach is somewhat slower than the ordered set. Granted, the trie implementation could be a bit better (for example, the actual search algorithm could track where it is in the trie rather than searching from root every time), but the amount of code written is also important:

$ wc -l *.cpp | sort -n
     131 boggle-orderedset.cpp
     132 boggle-sortedvector.cpp
     137 boggle-hashtable.cpp
     161 boggle-trie.cpp
     561 total

So, the sorted vector is only slightly more complicated to implement (in fact the only line count difference is the call to std::sort after the dictionary is loaded – which is actually not even necessary if your dictionary is sorted to begin with), and yet it’s the fastest of all these by far.

Okay, so that algorithm doesn’t actually make use of an indexed data structure. But the thought process that leads to implementing it is along the same lines as the thought processes that leads to using an ordered indexed data structure; in effect, the vector is an index, viewed in the greater sense. And, whenever I’ve interviewed a candidate with this problem, not one has gone with a sorted array and a binary search! (I have had a couple at least go with an ordered set though. But nearly everyone goes with the trie – and most of them have never even heard of a trie before, and just sort of, like, invent it on the spot. Which is cool, but still…)

Also, the sorted vector approach only really works performance-wise if your input set is static. Once you start adding stuff to it, well, each addition is a potentially \(\mathcal{O}(N)\) operation, which can end up becoming incredibly costly very fast. For example, if you’re writing, say, a software load balancer, and your table keeps track of your backing servers' load levels, every update to that requires changing a row in the index, and if you have a lot of rows (say, you’re working at the sort of scale where map-reduce thinking takes over), every single update starts to add up very quickly.

Anyway, with just standard C++ you can build your own indexes yourself, or you can use Boost.MultiIndex, which maintains arbitrarily many indexes for you. It’s pretty neat.

Python

Anyway, back to my original conundrum. I wanted to investigate moving Publ’s data store into an in-memory table, with various indexes for the necessary sort criteria. The easiest approach was to stick with a database, despite that having very poor encapsulation (since the object storage is essentially global state). But what else could I have done?

When I was asking around, everyone kept on pointing me towards collections.OrderedDict, which is a dict which maintains the order of its keys. But by “maintains the order” it actually means maintaining the insertion order, not any ongoing sort order. This isn’t actually useful for the purpose of maintaining an index. (And even if it were, it doesn’t provide any primitives for performing a range query or sibling iteration or whatever.)

However, the blist package is a rather comprehensive B-tree implementation, and in particular it provides a sorteddict, which does indeed maintain the sort order of its keys. It also provides a KeysView which allows you to bisect_left the keys themselves, in order to traverse items starting at a particular point. It’s certainly not the worst thing to do. So, in the future, if I ever decide to get rid of an ORM entirely, this is what I will probably focus on using. However, this adds a bit more complexity in that if you know your sort key is going to change at all, you need to remember to remove your item from the b-tree and then readd it after it’s been updated. (Fortunately this isn’t all that different from a typical CRUD mechanism anyway.)

And of course, Publ’s use case calls for a lot of reads and not a lot of writes, so it’s also not that big a deal to periodically rebuild indexes as a sorted list and just use bisect_left that way. It’s still something that needs to be managed directly, though.

Maybe if/when I decide to de-ORMify Publ I’ll just end up writing a library to make indexed tables easier to manage. I sure as heck can’t find anything that exists as it is. (If anyone does know of anything, please let me know!)

(I mean, unless I can find something that’s similar to Berkeley DB and is actually supported in Python 3 and is MIT-license-compatible…)

(note to self: lmdb is a good candidate)

Java

As mentioned previously, Java provides both TreeMap/TreeSet and HashMap/HashSet. But for some reason people are constantly taught to only use the HashMap/HashSet versions, and this leads to people never even knowing that TreeMap/TreeSet even exist or even consider why they might want to use them. I find this incredibly baffling.

Lua and JavaScript

Lua and JavaScript are very popular because of their simplicity; both of them make the same simplifying assumption in that all data structures are hash tables – including basic arrays. In many cases these get optimized under the hood to act as basic arrays but there are also many situations where that ends up falling apart, and this is why in Lua in particular you generally want to use ipairs instead of pairs, especially if order matters.

The result of this is that there’s also absolutely no concept of an indexed, ordered associative array. Either your array traversal is out-of-order (or, in JavaScript, insertion-order, as JS’s array acts more or less like Python’s collections.OrderedDict, except when it doesn’t), or you’re getting all of your keys out of the array, sorting that, and then iterating on that array instead. Which adds even more layers of complexity, and reduces performance further.

In Lua you’re not likely to be writing anything which makes use of indexed structures (and if you are, you’re probably implementing that stuff in C or C++ and calling into it with ffi), but JavaScript? That’s used to run a lot of web services, and while node.js provides an ffi binding, that’s generally seen as a last resort. So what do people do when they need to handle indexes in node.js services?

Well, in my experience, they seem to just defer it to an ORM, or shrug and suffer the poor performance.

Go

I haven’t touched Go in a long time, but the last time I did, the golang orthodoxy was that you don’t need ordered indexes. From some cursory websearches I’m finding that this appears to still be the case.

Fortunately, there are container libraries which correct this. It looks like TreeMap even now provides floor and ceiling which can then be used to implement range queries. It looks like it only gained this functionality incredibly recently (i.e. a month ago as of this writing).

C#

C# provides a Dictionary class.

You know what you can do with a real-life dictionary? You can quickly find a word you want, and then see which words come before and after it.

You know what you can’t do with a C# Dictionary?

Okay, so C# does also provide OrderedDictionary but this doesn’t, as far as I can tell, provide any iteration methods for getting the next or previous entries, or any sort of range queries at all. Presumably these are possible via LINQ, though.

Oh I guess I forgot to write a conclusion

So yeah uh. It’s so weird to me that this is the way software has gone. Everything’s a big pile of hash tables and nobody is expected to treat anything differently or learn any algorithms that make use of other storage representations. I find that weird and sad.