RSS LJ

August 10, 2008

Hash tables (, , )

by fluffy at 12:33 AM
Lately I've noticed that a lot of people are confused about what a hash table is and how one goes about implementing it. (This often comes up when I'm interviewing candidates and I ask them how they'd implement an associative array without the use of std::map.)

A hash table is not using a hash function to distill a key into a value which is used to order it in an associative array/sorted list/etc.

A hash table is using a hash function to distill a key into a value which is used to index into a linear, non-sparse array.

(Note: I will be using STL conventions for referring to actual data structures, as a shorthand. As a very quick overview if you're not familiar with STL, just think of map<A,B> as a binary tree with A as the sort key and B as the stored value, multimap<A,B> as the same except allowing multiple values with the same key, vector<A> as a linear array of type A, and pair<A,B> as a structure which encapsulates exactly one of each, and sorts with A as the dominant term.)

The common case of people who believe the first case is something like, they'll use a hash function to convert a string into an integer, then use that to select where to insert it into a linked list or an array or a binary tree or whatever. That is no better than actually using ordinary comparison functions - you're just reducing the maximum number of steps required to do the comparisons, but you're not actually implementing a hash table. At best you could say that you're implementing some sort of proxy data structure, and instead of doing map<KEY,VALUE> (for example) they're just implementing a form of multimap<int,pair<KEY,VALUE> >, and it has the exact same overall efficiency as if they'd just done a map<KEY,VALUE> to begin with (since searches still take O(log N) and inserts still take anywhere from O(1) to O(N) depending on what sort of further stupidity you're expressing in the underlying structure).

In reality, a hash table looks more like vector<pair<KEY,VALUE> >, where the items are placed within the vector based on the hash of the key. That is, you have an array — with fixed, non-sparse indices — and a given element with hash value N is placed at vector element N mod K if the hash table has K elements. Collisions are handled by searching for the next suitable location for the key (often as simple as seeking forward in the array until you find an empty slot); if you are colliding often, it's because your hash function sucks, and if you run out of slots, it's because your hash table is full. (One counterpoint is that it's a perfectly reasonable strategy for each hash slot to actually be a linked list of items, rather than a single item, although this only applies to static hash tables where you absolutely cannot resize it later for whatever reason, and then it's debatable as to whether you can really call it "static.")

In a dynamic hash table, when your hash table fills up, you increase (usually, double) the size of the table and then transfer your elements to it, reevaluating the hash function. In the case of doubling the size, if your hash function is any good, this means that roughly half of your elements will go to the first half of the table, and the other half will go to the second half of the table.

Now, the implication of this is that in a pure hash table (dynamic or non), consulting the table for an item which is in it will generally be pretty fast (usually O(1) in the case of a decent hash function — what constitutes a "decent hash function" is of course very problem-domain-specific and what works well for some data sets will be hideous for others) but consulting the table for an item which isn't in it can be quite expensive; in the case of a full table of size N and a seek-forward collision strategy, a miss can end up taking O(N) time (since even with an absolutely wonderful and perfect hash function it's still quite possible that the table gets filled with N-1 items whose keys all hash the same. As an example, N-1 items who all have the same key. Also, you can't short-circuit it by bailing if the next slot has a key which hashes differently — the next slot for one hashed value is the first slot for plenty of other ones.)

So, again, depending on problem domain you may be better off with, say, different collision avoidance strategies, or for making each bucket an associative array. Keep in mind, however, that if you use an associative container for each hash bucket, you're somewhat ruining the point to using a hash table, since lookups will have two parts, one O(1) and one O(log(N/K)), which is (from a complexity standpoint) the same as O(log N). (Of course real-world performance will be a lot better anyway.)

But in any case, whatever you do, do not store a map<hash,pair<key,value> > and call it a "hash table."

Anyway, to summarize:

  • A hash function distills a key type into a smaller numerical domain
  • A hash table is an array which is indexed using a hash function, not a sparse array using the hash value as a key
  • Using a hash function to order things in an associative array may improve the fine-grained efficiency of your comparison function, but it does not improve your algorithmic complexity

Comments