Java HashMap Implementation

Java HashMap

Java HashMap Implementation

Java HashMap Examples

Java HashMap vs HashTable

Java HashMap Interview Questions

HashMap uses an array to store data. It initializes this array when created and resizes it dynamically based on the amount of data being stored.

Keys are converted to index positions on the array. When two keys generate the same index, collision occurs. This results in a LinkedList holding all the key/value being stored at the index.

If the LinkedList gets too big, it's converted to a binary search tree. If the binary search tree gets too small, it's converted back into a LinkedList.

The HashMap hashing technique maximizes the efficiency of hashCodes() values to create the best distribution of index values for the array. This minimizes collision and is central to the benefits of using a HashMap.

Hashing

When you do something like this:
myMap.get("key1")

You're really doing something like this...

tab[35]

Hashing

This is because HashMap uses hashing to convert the key into an index in the array. This hashing function looks like this:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Notice how 0 is returned if the key is null. Otherwise, the key's hashCode() function is used to calculate the integer hash value.

(h >>> 16) shifts the bits from h right and XORs the bits with the original h. This spreads the bits allowing the highest impact bits to be considered in index calculation...

Calculating the index

The index for the array is found like this:

i = (n - 1) & hash;
n is the length of the array. (n - 1) & hash is a fancy bitwise operation for getting the remainder or % of the two.

Why all the bit manipulation?

By spreading the bits via (h = key.hashCode()) ^ (h >>> 16), the indexing is better distributed across the array. This is especially true when considering smaller table lengths (32) where many high hash bits won't get considered in index calculation (&).

Why is the index so important?

The index is the position where the key/value is stored for a given key. If keys have the same index, then collision occurs. When collision occurs, elements having the same index are stored as a LinkedList or Tree depending on how many collisions there are.

The performance of the HashMap significantly decreases with collisions. Instead of accessing array elements in O(1) time, sub collections have to be traversed in O(n) time to find matches. This is why collision is not good. This is why having good index distribution matters.

Resizing

The underlying array is initialized with an initial capacity (default 16). Once the number of elements in the array exceeds the load factor (default .75), the array is automatically resized by a power of 2.

The load factor is simply how full the array can get before resizing takes place. A load factor of .75 means the array can be 75% full before resizing takes place.

The resize() gets called anytime the size of the table exceeds the load factor. Whenever you call put() this evaluation takes place.

Resizing is an expensive operation because a new (larger) array has to be created and populated with the same data as the original array. This takes time and should be avoided when possible.

LinkedList vs Tree?

When collision occurs, two keys have the same hash value. This hash value leads to the same index calculation. When two keys have the same index, they are stored as a LinkedList at that index.

To improve performance, Java 8 added additional logic of converting this LinkedList to a Tree of Nodes based on a threshold. If the length of the LinkedList is greater the TREEIFY_THRESHOLD then it's converted to a binary search tree. If the length of the binary search tree is less than the TREEIFY_THRESHOLD, the tree is converted back into a LinkedList.

Similar to resizing, converting between trees and lists is an expensive operation. The reasoning behind tree node conversion is that a binary search tree has much faster lookup O(logn) than a LinkedList O(n).

HashMap put() Step by Step

HashMap myMap = new HashMap();
myMap.put("Name","Sam");

1) The key "Name" is used to calculate the hash.

2) The resulting hash is used to calculate the index via i = (n - 1) & hash.

3) The underlying array is checked for an element at position i.

4) If nothing is stored at i, insert the key/value pair at i.

5) If something exists at i, use equals() and hash to determine equality.

6) If equal, replace the entry at i.

7) If unequal, iterate through the LinkedList or Tree to find the matching entry using equals().

8) If match found, replace within the LinkedList or Tree.

9) If no match found, append to the end of the LinkedList or Tree.

HashMap get() Step by Step

myMap.get("Name");

1) The key "Name" is used to calculate the hash.

2) The resulting hash is used to calculate the index via i = (n - 1) & hash.

3) Use i to retrieve entry near constant time O(1).

Performance

Remember that things like resizing arrays and converting between lists and trees are expensive. They take time because you have to create new objects and populate them with old objects.

The idea is that hashCode() are strong enough to create a strong distribution. This minimizes collision and everything can be looked up in near constant time (O(1)). The more collision you have, the more O(n) activity you create traversing LinkedLists. And when these lists are converted to Trees there is a performance cost there. This is why having a strong hashCode() is so important.

Also remember that iterations take time. The more key/value pairs you add the more time it takes to traverse everything. For these reasons, it's recommended that you use a high initial capacity for a large number of entries with little iteration and a low initial capacity for a small number of entries with lots of iteration.

Your thoughts?