Hash tables (code, job stuff, rant)
std::map.)A hash table is not using a hash function to distill a key into a value which is used to order it in an associative array/sorted list/etc.
A hash table
(Note: I will be using STL conventions for referring to actual data structures, as a shorthand. As a very quick overview if you're not familiar with STL, just think of
The common case of people who believe the first case is something like, they'll use a hash function to convert a string into an integer, then use that to select where to insert it into a linked list or an array or a binary tree or whatever. That is no better than actually using ordinary comparison functions - you're just reducing the maximum number of steps required to do the comparisons, but you're not actually implementing a hash table. At best you could say that you're implementing some sort of proxy data structure, and instead of doing
In reality, a hash table looks more like
In a dynamic hash table, when your hash table fills up, you increase (usually, double) the size of the table and then transfer your elements to it, reevaluating the hash function. In the case of doubling the size, if your hash function is any good, this means that roughly half of your elements will go to the first half of the table, and the other half will go to the second half of the table.
Now, the implication of this is that in a pure hash table (dynamic or non), consulting the table for an item which is in it will generally be pretty fast (usually O(1) in the case of a decent hash function — what constitutes a "decent hash function" is of course very problem-domain-specific and what works well for some data sets will be hideous for others) but consulting the table for an item which isn't in it can be quite expensive; in the case of a full table of size N and a seek-forward collision strategy, a miss can end up taking O(N) time (since even with an absolutely wonderful and perfect hash function it's still quite possible that the table gets filled with N-1 items whose keys all hash the same. As an example, N-1 items who all have the same key. Also, you can't short-circuit it by bailing if the next slot has a key which hashes differently — the next slot for one hashed value is the first slot for plenty of other ones.)
So, again, depending on problem domain you may be better off with, say, different collision avoidance strategies, or for making each bucket an associative array. Keep in mind, however, that if you use an associative container for each hash bucket, you're somewhat ruining the point to using a hash table, since lookups will have two parts, one O(1) and one O(log(N/K)), which is (from a complexity standpoint) the same as O(log N). (Of course real-world performance will be a lot better anyway.)
But in any case, whatever you do, do not store a
Anyway, to summarize:
map<A,B> as a binary tree with A as the sort key and B as the stored value, multimap<A,B> as the same except allowing multiple values with the same key, vector<A> as a linear array of type A, and pair<A,B> as a structure which encapsulates exactly one of each, and sorts with A as the dominant term.)map<KEY,VALUE> (for example) they're just implementing a form of multimap<int,pair<KEY,VALUE> >, and it has the exact same overall efficiency as if they'd just done a map<KEY,VALUE> to begin with (since searches still take O(log N) and inserts still take anywhere from O(1) to O(N) depending on what sort of further stupidity you're expressing in the underlying structure).vector<pair<KEY,VALUE> >, where the items are placed within the vector based on the hash of the key. That is, you have an array — with fixed, non-sparse indices — and a given element with hash value N is placed at vector element N mod K if the hash table has K elements. Collisions are handled by searching for the next suitable location for the key (often as simple as seeking forward in the array until you find an empty slot); if you are colliding often, it's because your hash function sucks, and if you run out of slots, it's because your hash table is full. (One counterpoint is that it's a perfectly reasonable strategy for each hash slot to actually be a linked list of items, rather than a single item, although this only applies to static hash tables where you absolutely cannot resize it later for whatever reason, and then it's debatable as to whether you can really call it "static.")map<hash,pair<key,value> > and call it a "hash table."
Comments