Hash tables
Reasons to use a hash table:
- Search: Retrieval of a particular piece of information from large volumes of previously stored data
- To access information within the item (not just the key)
- Insertions: Arrays, linked lists are worst case \(O(n)\) for insertions and deletions
We can speed these operations up to constant time with hash maps!
Introduction to Dictionary ADT
A dictionary is an abstract data structure of items with keys that supports two basic operations: insert
Other commonly supported operations:
- Remove (inefficient)
- Sorting the symbol table (inefficient)
- Select the k-th largest element
- Join two symbol tables
- Construct, test if empty, destroy, copy, etc.
What if you have a large set of keys?
- No problem! Look for particular item in \(O(1)\) time! Each item is represented by a bucket.
What if your range of integers does not fit into memory? Let's say the range of keys is up to 100,000,000 (but you only use 10 of them). You don't want to allocate an array of size 100,000,000, since that wouldn't fit in memory – use hashing!
Hashing
What it lets us do: Reference items in a table by keys.
- Input: Some key.
- Output: A table address
You need:
- A hash function transforms the search key into a table address.
- A collision resolution system: dealing with search keys that hash to the same table address.
How important is this?
- Databases are actually just symbol tables!
- Symbol tables are supported in the C and C++ standard libraries!
Hashing is good at:
Hashing is bad at:
- Selection of a k-th largest item
- Sorting
Hash Function
Two parts:
- Hash code: maps the key into an integer
- Compression map: maps the integer into the range [0, M)
Good hash functions:
- Must:
- Easily computable
- Deterministic (works for same key multiple times)
- Computes a hash for every key
- Want:
- Minimize number of collisions
Floats
If you are hashing floats in the range [0, 1), then you can just multiply the float by M and convert into an integer for the hash!
Integers
- Modular hash function: \(h(\text{key}) = \text{key} \mod M\)
- Great iff randomly distributed
- Otherwise bad
- \(M\) must be prime
- Multiplication method: \(h(\text{key}) = \lfloor key * \alpha \rfloor \mod M\)
- Say \(\alpha = 0.618033\) (inverse of golden ratio)
- And M can be prime or not
Strings
Consider the following strings:
stop, tops, pots, spot
The ASCII sum of each is equivalent, so that wouldn't work. You should do something that takes into account the position of the letters, as well as their value. A polynomial hash code does this:
If \(a = 31\), then:
- t("tops") is 3,566,014.
- t("pots") is 3,446,974.
This hashing is somewhat costly, but it can be worth the tradeoff!
But wait... where did you choose \(a\) from? You should try to eliminate common factors (note that 31 is prime). It also should be greater than 26, because if you mod by something less than 26, then two numbers will be the same value after the mod.
Compression Mapping
Map the integer to the address range
- Division method: \(\left|\text{intmap}\right| \mod M\), where \(M\) is prime
- Multiply and divide method (MAD): \(\left|a* \text{intmap} + b\right| \mod M\), where \(M\) is prime and \(a, b\) are non-negative integers and \(a \mod M\) is not equal to zero.
Complexity
Assuming perfect hashing (no collisions):
- Insertion cost: \(O(1)\)
- Search cost: \(O(1)\)
- Removal cost: \(O(1)\)
Summary: Hash Tables
- Efficient ADT for insert, search, and remove
- Hash function turns key into integer
- Compression map turns integer to address
Collision Resolution
A method to handle the case when to keys hash to the same address.
Separate Chaining
Description
Instead of \(m\) table addresses, you have \(m\) linked lists, one for each table address.
Properties
- Allows for collision resolution
- Uses more memory because pointers
- Reduces the number of comparisons for sequential search by a factor of \(M\) (on average), using extra space for \(M\) links
- In a separate chaining hash table with \(M\) lists and \(N\) keys, the probability that the number of keys in each list is within a small constant factor of \(N/M\) is extremely close to 1, if the hash function is good.
Complexity
- Insertions in constant time
- Search and remove in \(O(N/M)\) time.
Open Addressing
Description
Use empty places in a table to resolve collisions, using probing. When a collision occurs, check the next position in the table. This is called linear probing.
Properties
Could overflow out of the allocated memory - must do: \((h(\text{key}) + c) \mod M\).
Probe outcomes:
- Miss: Probe finds empty cell in table
- Hit: Probe finds cell that contains item whose key matches search key
- Full: Probe finds cell that has "occupant", but key doesn't match search keyHit*
- If full, advance to next table cell
Cluster: contiguous group of occupied table cells
- In a table that is half-full:
- Best case distribution is that there are alternating occupied/not occupied cells.
- Worse case distribution is that there is one huge chunk of half of the cells.
Deletion
- Option 1: Remove it, re-hash rest of cluster
- Use a "dummy" element
- Not an element, not empty either
- We'll call this "deleted"
Load factor \(\alpha\)
When using this load factor in linear probing, you have an expected value of hits of: \(1/2(1 + 1/(1-\alpha))\).
Estimated number of misses: \(1/2(1 + 1/(1 - \alpha)^2)\)