Hashing and Hash tables

#programming

Hash tables are an abstract data-type that provides near constant time lookups, but does not guarantee ordering of any elements.
The implementation of hash tables is called hashing.

A hash table is generally just an array where a key, ie, data-member corresponding to a specific item is passed through a hash-function, that returns the index value of that item.
Therefore, each key is mapped onto some index in the range $0$ to $TableSize - 1$

The hash function can work on anything, integers, strings, whatever, we just have to design something good.

The hash function:

Must be simple to compute
Must distribute keys evenly among cells in hash table.

Some methods of hashing

Truncation : Map to a table of smaller size by picking a subset of the total key, ie 3 out 9 digits.
Folding : Break the key into disjoint subsets, then do some operations on all of them then combine it.
Key mod n : $N$ is the size of the table, better if prime.
Radix conversion : Treat the key as a number in a different base, then run relevant operations.

Collision resolution

A collision occurs if a new key, when hashed, maps to an index that is already occupied
There are several methods to deal with it

Separate chaining
Open Addressing
- Linear probing
- Quadratic probing
- Double hashing

Separate chaining

Each index in the hash-table stores the pointer to a start of a linked list, and that index stores all keys that map to it, using the linked list.

Much better collision handling, and allows for quick insert, and handles overflow well.

Load factor ( $ℓ$ ) : Ratio of total elements to hash-table size $ℓ = N / TableSize$
In this case, the average length of a list is also $ℓ$ .
And since we are chaining, hash-tables can be small, and therefore, $ℓ > 1$ , is possible.

Cost of searching

Cost = Time to evaluate hash function + Time to traverse the list at the index

Assume constant time to evaluate hash function, $O (1)$

Unsuccessful search : We have to look through the entire list at bucket, meaning $O (ℓ)$
Successful search : We have to look through, on average, half the list, since we assume that the target node can be anywhere with equal probability, meaning $O (ℓ / 2)$

Therefore, average search cost = $1 + ℓ / 2$

In summary, what matters is the load factor, and for Separate Chaining collision resolution, it is best to keep $ℓ \approx 1$ .

Open Addressing

All data goes into the table, and $ℓ \leq 0.5$ is ideal.

In this strategy, cells, $x_{0}, x_{1}, x_{2} \dots$ are successively tried using the formula $$x_{i} = (\pu{hash}(x) + f(i)) % \pu{TableSize}$$
where $f (i)$ is called the collision resolution strategy, with $f (0) = 0$ .
The first empty cell is where the data is inserted.

Linear probing

$f$ is typically a linear function of $i$ , typically $f (i) = i$

Find and delete operations :

Same strategy used as insert. We start from the "ideal" cell, which is given by $h a s h (x)$ , and progress using $f (i)$ .
The important issue here is that when searching, since we are attempting to make search $\leq O (n)$ as an improvement, we stop searching when we encounter an empty cell, assuming that there cannot be any other entry beyond this

Because of this delete operations become somewhat tricky, and we use lazy deletion.
Essentially means that when we try to delete a key, mark it as deleted, instead of removing the key and marking as empty.
This ensures we do not break the line of search, and still have better than $O (n)$ performance

Clustering problems

Since we use something linear, it means that same hash result will tend to congregate around the same bucket.
This effect is known as primary clustering.
Any key that hashes into that cluster will require multiple attempts to resolve the collision, and then add to the cluster.

Analysis

The average number of cells that are required to be examined during insertion, is the same as an unsuccessful search, and it is $$\frac{1}{2}\left( 1 + \frac{1}{(1-\ell)^2} \right)$$where $ℓ$ is the load factor.
For a half-full table, we obtain 2.5 as the average number of the cells required to be examined.
Therefore, primary clustering is something that makes the cost grow quadratically as load factor approaches 1.

For a successful search operation, the average number of cells required to be probed is $$\frac{1}{2}\left( 1 + \frac{1}{1-\ell} \right)$$
It is derived from $$\frac{1}{\ell} \int_{0}^\lambda \frac{1}{2}\left( 1 + \frac{1}{(1-x)^2} \right)dx$$

Quadratic probing

This eliminates the problem of clustering that we saw. In this we have a resolution function of the form $$f(i) = i^2$$
Therefore, it examines cells, $1, 4, 9 \dots$ away from the original cell.

The problem of this is :

We know we can't scan every cell. There are bound to be misses. Therefore there is no guarantee to find an empty cell, if a table is more than half-full.
If the hash table size is not prime, this problem is much more severe.

However, there is a theorem saying that

Theorem

If the table size is prime and load factor is not larger than 0.5, all probes will be to different locations and an item can always be inserted.

The phenomenon of distributed clustering seen in this method is called secondary clustering

Double hashing

A second hash function is used to drive the collision resolution. $$f(i) = i \times h_{2}(x)$$
We apply a second hash function, and it must never evaluate to 0.

An example function is $$R - (x % R)$$ where $R$ is a small prime number.

Applications of hashing

compilers use hash tables to implement the symbol table to keep track of declared symbols.
Transposition tables in games
Spell checkers

Overall summary

Hash tables can be used to provide near constant time lookup and insert
It depends on the load factor
Important to have a prime table-size and a correct choice of load factor
For separate chaining, the load factor should be close to 1
For open addressing the load factor should not exceed 0.5
Rehashing can be used to grow or shrink the table.