Hashing and Hash tables
Hash tables are an abstract data-type that provides near constant time lookups, but does not guarantee ordering of any elements.
The implementation of hash tables is called hashing.
A hash table is generally just an array where a key, ie, data-member corresponding to a specific item is passed through a hash-function, that returns the index value of that item.
Therefore, each key is mapped onto some index in the range
The hash function can work on anything, integers, strings, whatever, we just have to design something good.
The hash function:
- Must be simple to compute
- Must distribute keys evenly among cells in hash table.
Some methods of hashing
- Truncation : Map to a table of smaller size by picking a subset of the total key, ie 3 out 9 digits.
- Folding : Break the key into disjoint subsets, then do some operations on all of them then combine it.
- Key mod n :
is the size of the table, better if prime. - Radix conversion : Treat the key as a number in a different base, then run relevant operations.
Collision resolution
A collision occurs if a new key, when hashed, maps to an index that is already occupied
There are several methods to deal with it
- Separate chaining
- Open Addressing
- Linear probing
- Quadratic probing
- Double hashing
Separate chaining
Each index in the hash-table stores the pointer to a start of a linked list, and that index stores all keys that map to it, using the linked list.
Much better collision handling, and allows for quick insert, and handles overflow well.
Load factor (
In this case, the average length of a list is also
And since we are chaining, hash-tables can be small, and therefore,
Cost of searching
Cost = Time to evaluate hash function + Time to traverse the list at the index
Assume constant time to evaluate hash function,
Unsuccessful search : We have to look through the entire list at bucket, meaning
Successful search : We have to look through, on average, half the list, since we assume that the target node can be anywhere with equal probability, meaning
Therefore, average search cost =
In summary, what matters is the load factor, and for Separate Chaining collision resolution, it is best to keep
Open Addressing
All data goes into the table, and
In this strategy, cells,
where
The first empty cell is where the data is inserted.
Linear probing
Find and delete operations :
Same strategy used as insert. We start from the "ideal" cell, which is given by
The important issue here is that when searching, since we are attempting to make search
Because of this delete operations become somewhat tricky, and we use lazy deletion.
Essentially means that when we try to delete a key, mark it as deleted, instead of removing the key and marking as empty.
This ensures we do not break the line of search, and still have better than
Clustering problems
Since we use something linear, it means that same hash result will tend to congregate around the same bucket.
This effect is known as primary clustering.
Any key that hashes into that cluster will require multiple attempts to resolve the collision, and then add to the cluster.
Analysis
The average number of cells that are required to be examined during insertion, is the same as an unsuccessful search, and it is $$\frac{1}{2}\left( 1 + \frac{1}{(1-\ell)^2} \right)$$where
For a half-full table, we obtain 2.5 as the average number of the cells required to be examined.
Therefore, primary clustering is something that makes the cost grow quadratically as load factor approaches 1.
For a successful search operation, the average number of cells required to be probed is $$\frac{1}{2}\left( 1 + \frac{1}{1-\ell} \right)$$
It is derived from $$\frac{1}{\ell} \int_{0}^\lambda \frac{1}{2}\left( 1 + \frac{1}{(1-x)^2} \right)dx$$
Quadratic probing
This eliminates the problem of clustering that we saw. In this we have a resolution function of the form $$f(i) = i^2$$
Therefore, it examines cells,
The problem of this is :
- We know we can't scan every cell. There are bound to be misses. Therefore there is no guarantee to find an empty cell, if a table is more than half-full.
- If the hash table size is not prime, this problem is much more severe.
However, there is a theorem saying that
If the table size is prime and load factor is not larger than 0.5, all probes will be to different locations and an item can always be inserted.
The phenomenon of distributed clustering seen in this method is called secondary clustering
Double hashing
A second hash function is used to drive the collision resolution. $$f(i) = i \times h_{2}(x)$$
We apply a second hash function, and it must never evaluate to 0.
An example function is $$R - (x % R)$$ where
Applications of hashing
- compilers use hash tables to implement the symbol table to keep track of declared symbols.
- Transposition tables in games
- Spell checkers
Overall summary
- Hash tables can be used to provide near constant time lookup and insert
- It depends on the load factor
- Important to have a prime table-size and a correct choice of load factor
- For separate chaining, the load factor should be close to 1
- For open addressing the load factor should not exceed 0.5
- Rehashing can be used to grow or shrink the table.