Time Complexity | Note | |
---|---|---|
Worst Case: | O(n) | Linear, depending on collision handling |
Best Case: | O(1) | Thanks to hashing |
- Each item must have a unique key— or the value is converted into a unique key through hashing.
- Underlying data structure is either an incredibly large array or a linked list
- Hash (maps a unique key to a position in the list)
- Insert
- Search
- Delete
Converts a specified word into a unique key that can be used to determine its position in the underlying array.
The goal is to have a function that maps a unique key for any input value— this is to ensure no 'collisions' occur (that is, two differnt words having the same key). A good hash function requires the following:
- type dependent (otherwise it cannot compute a value)
- must return a value within the array range (otherwise it will return an index error when inserting)
- not have too many operations (otherwise slowness)
- use as much of the key as possible (minimise collisions)
- avoid common factors (minimise collisions)
Below is a universal hash function implemented in Python. It is best to use prime numbers for TABLE_SIZE
, a
and b
.
def hash_function(word, TABLE_SIZE):
value = 0
# random seeds
a = 10651
b = 10531
for i in range(len(word)):
value = (a*value + ord(word[i])) % TABLE_SIZE
a = a * b % (TABLE_SIZE-1)
return value
- Use the hash function to get a position
- Try to insert at position
- Otherwise, deal with the collision
- Use hash function to get a position
- If the query matches, return True
- Otherwise, deal with the collision until found
- If not found, return False
- Use the hash function to get a position
- Search for the item, deal with collisions
- If it exists and is found, remove the item
Clustering occurs when elements of similar keys create groups in a hash table array (called a probe chain)
This is a problem because when a search occurs for a particular key and the actual key location has another item that doesn't belong there, it has to traverse over the entire cluster— which would affect performance.
- Checking every item in the array until the position's conditions are perfect
- Check query position, then look at the next, then the next... wrapping around the entire table until you're back to the original position
- Like Linear Probing, but moving in larger and larger steps, wrapping around the list
- A problem with probing is that items with similar hash values tend to 'cluster'
- Using a second hash function to determine the probe 'step' to prevent clustering
- Should not hash to 0, and the table size and step size are co-prime
- Using linked lists for every item in the large array
- If a collision occurs, just append a node to the end of the chain in the array position and traverse
class Dictionary:
def __init__(self, table_size, prime):
self.prime = prime
self.item_count = 0
self.table_size = table_size
self.array = [None] * table_size
def hash(self, key):
index = 0
a = 2692
b = 8523
for i in range(len(key)):
index = (a * index + ord(key[i])) % self.table_size
a = a * b % (self.table_size-1)
return index
def is_full(self):
return self.item_count >= self.table_size
def insert(self, key, value):
index = self.hash(key)
if self.is_full():
found = False
end = (index - 1) % self.table_size
while index != end:
if self.array[index][0] == key:
self.array[index] = [key, value]
found = True
break
index = (index + 1) % self.table_size
if not found:
raise Exception("Dictionary is full!")
else:
while not self.is_full():
if self.array[index] != None and self.array[index][0] == key:
self.array[index] = [key, value]
break
if self.array[index] == None:
self.array[index] = [key, value]
self.item_count += 1
break
index = (index + 1) % self.table_size
def search(self, key):
index = self.hash(key)
if self.array[index][0] == key:
return self.array[index][1]
else:
end = (index - 1) % self.table_size
if self.array[end][0] == key:
return self.array[end][1]
index = (index + 1) % self.table_size
while index != end:
if self.array[index][0] == key:
return self.array[index][1]
else:
index = (index + 1) % self.table_size
raise KeyError(str(key) + " not found!")
Given 100 items, the probability of collision for:
- 23 items is ~50%
- 50 items is 97%
- 70 items is 99.9%