Data Structures and Algorithms

Hash tables

Reasons to use a hash table:

  • Search: Retrieval of a particular piece of information from large volumes of previously stored data
    • To access information within the item (not just the key)
  • Insertions: Arrays, linked lists are worst case \(O(n)\) for insertions and deletions

We can speed these operations up to constant time with hash maps!

Introduction to Dictionary ADT

A dictionary is an abstract data structure of items with keys that supports two basic operations: insert

Other commonly supported operations:

  • Remove (inefficient)
  • Sorting the symbol table (inefficient)
  • Select the k-th largest element
  • Join two symbol tables
  • Construct, test if empty, destroy, copy, etc.

What if you have a large set of keys?

  • No problem! Look for particular item in \(O(1)\) time! Each item is represented by a bucket.

What if your range of integers does not fit into memory? Let's say the range of keys is up to 100,000,000 (but you only use 10 of them). You don't want to allocate an array of size 100,000,000, since that wouldn't fit in memory – use hashing!

Hashing

What it lets us do: Reference items in a table by keys.

  • Input: Some key.
  • Output: A table address

You need:

  • A hash function transforms the search key into a table address.
  • A collision resolution system: dealing with search keys that hash to the same table address.

How important is this?

  • Databases are actually just symbol tables!
  • Symbol tables are supported in the C and C++ standard libraries!

Hashing is good at:

  • Insertion
  • Search
  • Removal

Hashing is bad at:

  • Selection of a k-th largest item
  • Sorting

Hash Function

Two parts:

  • Hash code: maps the key into an integer
  • Compression map: maps the integer into the range [0, M)

Good hash functions:

  • Must:
    • Easily computable
    • Deterministic (works for same key multiple times)
    • Computes a hash for every key
  • Want:
    • Minimize number of collisions

Floats

If you are hashing floats in the range [0, 1), then you can just multiply the float by M and convert into an integer for the hash!

Integers

  • Modular hash function: \(h(\text{key}) = \text{key} \mod M\)
    • Great iff randomly distributed
    • Otherwise bad
    • \(M\) must be prime
  • Multiplication method: \(h(\text{key}) = \lfloor key * \alpha \rfloor \mod M\)
    • Say \(\alpha = 0.618033\) (inverse of golden ratio)
    • And M can be prime or not

Strings

Consider the following strings:

stop, tops, pots, spot

The ASCII sum of each is equivalent, so that wouldn't work. You should do something that takes into account the position of the letters, as well as their value. A polynomial hash code does this:

If \(a = 31\), then:

  • t("tops") is 3,566,014.
  • t("pots") is 3,446,974.

This hashing is somewhat costly, but it can be worth the tradeoff!

But wait... where did you choose \(a\) from? You should try to eliminate common factors (note that 31 is prime). It also should be greater than 26, because if you mod by something less than 26, then two numbers will be the same value after the mod.

Compression Mapping

Map the integer to the address range

  • Division method: \(\left|\text{intmap}\right| \mod M\), where \(M\) is prime
  • Multiply and divide method (MAD): \(\left|a* \text{intmap} + b\right| \mod M\), where \(M\) is prime and \(a, b\) are non-negative integers and \(a \mod M\) is not equal to zero.

Complexity

Assuming perfect hashing (no collisions):

  • Insertion cost: \(O(1)\)
  • Search cost: \(O(1)\)
  • Removal cost: \(O(1)\)

Summary: Hash Tables

  • Efficient ADT for insert, search, and remove
  • Hash function turns key into integer
  • Compression map turns integer to address

Collision Resolution

A method to handle the case when to keys hash to the same address.

Separate Chaining

Description

Instead of \(m\) table addresses, you have \(m\) linked lists, one for each table address.

Properties

  • Allows for collision resolution
  • Uses more memory because pointers
  • Reduces the number of comparisons for sequential search by a factor of \(M\) (on average), using extra space for \(M\) links
  • In a separate chaining hash table with \(M\) lists and \(N\) keys, the probability that the number of keys in each list is within a small constant factor of \(N/M\) is extremely close to 1, if the hash function is good.

Complexity

  • Insertions in constant time
  • Search and remove in \(O(N/M)\) time.

Open Addressing

Description

Use empty places in a table to resolve collisions, using probing. When a collision occurs, check the next position in the table. This is called linear probing.

Properties

Could overflow out of the allocated memory - must do: \((h(\text{key}) + c) \mod M\).

Probe outcomes:

  • Miss: Probe finds empty cell in table
  • Hit: Probe finds cell that contains item whose key matches search key
  • Full: Probe finds cell that has "occupant", but key doesn't match search keyHit*
    • If full, advance to next table cell

Cluster: contiguous group of occupied table cells

  • In a table that is half-full:
    • Best case distribution is that there are alternating occupied/not occupied cells.
    • Worse case distribution is that there is one huge chunk of half of the cells.

Deletion

  • Option 1: Remove it, re-hash rest of cluster
  • Use a "dummy" element
    • Not an element, not empty either
    • We'll call this "deleted"

Load factor \(\alpha\)

  • \(\alpha = N/M\), where \(N\) keys are placed in an \(M\) sized table

  • Separate Chaining

    • \(\alpha\) is average number of items per list
    • \(\alpha\) is sometimes larger than 1
  • Linear Probing
    • \(\alpha\) is percentage of table positions occupied
    • \(\alpha\) is (must be) <= 1

When using this load factor in linear probing, you have an expected value of hits of: \(1/2(1 + 1/(1-\alpha))\).

Estimated number of misses: \(1/2(1 + 1/(1 - \alpha)^2)\)