Skip to content

Instantly share code, notes, and snippets.

@mtroiani
Created January 25, 2018 02:52
Show Gist options
  • Save mtroiani/e03e48aa23b7e46b151d52ac042a8fa3 to your computer and use it in GitHub Desktop.
Save mtroiani/e03e48aa23b7e46b151d52ac042a8fa3 to your computer and use it in GitHub Desktop.

Hashmaps and Hashsets

What are hashmaps?

Hashmaps are a way to store key/value pairs with a fast (O(1) on average) lookup time.

We commonly use arrays for constant time lookups, but hashmaps give us more flexability as to what we can store. For example, you may have used an array's index to 'map' to character values:

Let's count the number of times each character appears: "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sed posuere nunc, sit amet blandit felis. Ut non orci eget."

Index Data
97 5
98 1
99 5
100 4
101 12
102 1

Using an array we can look up the count of any particular character in constant time just by using its ASCII value. But what if we wanted to look up something that didn't have an ASCII value, say for example an entire word?

We could try to save the words alphabetically, but then how would we know which word maps to which index? And what would happen if we need to add a new word? We need a way to consistently map a word to a space in the array. We can do this using something called a hashing function.

Hashing Functions

Hash functions are a way to map a piece of data from one form to a number - typically an integer or a long. So, for example, let's take one of the words from our sample text: "Lorem". We already have one way of converting characters to numbers using the ASCII chart - let's start with that.

L (76) o (111) r (114) e (101) m (109)

There are many, many ways these could be combined, but let's keep it simple and just add them.

76 + 111 + 114 + 101 + 109 = 511

Great, the word "Lorem" could map to array index 511. This could make for a very long array though, so let's map it to a smaller index. Say our array is of length 5. We can use the modulo operator: 511 % 5 = 1. Now our word belongs in index 1. We can continue this for the next words in our string.

  • ipsum -> 558 % 5 = 3
  • dolor -> 544 % 5 = 4
  • sit -> 337 % 5 = 2
Index Key Value
0
1 Lorem 1
2 sit 1
3 ipsum 1
4 dolor 1

Collisions

amet -> 423 % 5 = 3

Oh no! The word "amet" should hash to index 3, but we already have the count for the word ipsum there. This is called a collision, and it happens when there are two or more hashed values that map to the same spot. What can we do about this? Here are two common approaches for resolving collisions:

  1. Linear probing: In linear probing we just put the hashed key into the next clostest empty space. (In this particular example it would be index 0). Later, when we're searching for the key we start at the original hash location and keep searching until we either find the correct key or an empty space.
  2. Linked Lists: Instead of putting the key/value pair directly into the array we could store a pointer to a linked list. This linked list continues to grow as necessary. While we'd need to search through the entire linked list for the value we're looking for as long as there are enough 'buckets' in the hashmap none of the lists would get too long.

Preventing Collisions

Our goal for using a hashmap was to have constant time O(1) lookups, but when there are too many collisions our lookup times can get longer if we need to search through a linked list, or if we need to search many locations due to linear probing. One of the main ways to prevent our lookup time from getting too long is by increasing the size of our hashtable. This is called dynamic resizing. With dynamic resizing once our table reaches a certain percentage of 'fullness' it automatically resizes. This typically occurs at around 2/3 load (Python) to 3/4 load (Java).

Resizing takes extra time to rehash the existing items in the hashmap and takes more space than the previous map, but it reduces the chance of future collisions. Resizing typically doesn't happen very often, so while our worst case insertion is O(n) the typical case is O(1).

Why would we use a hashmap?

  • Hashmaps are very useful - they allow us to map a key to a value.
  • Hashmaps have constant time insertion, deletion, and lookup in the average case (and O(n) in the worst case)

Why wouldn't we use a hashmap?

  • Items in a hashmap are unordered, so if you know your use case would include iterating through items in order you'd want to consider other options.

Hashsets

Hashmaps are such an important datastructure that they can lay the foundation for other data structures such as sets. A hashset is a set construction that uses a hashtable underneath to implement a set. There are other ways to construct a set, but we'll focus on hashsets.

What is a set?

A set is an unordered collection of items that cannot have duplicates. Sets typically have fast lookup for membership (hashsets do this in O(1) on the average case), and allow for set operations such as union, intersection, difference, and subset.

How are hashsets related to hashmaps?

Let's think back to the hashmap we made before:

Index Key Value
0
1 Lorem 1
2 sit 1
3 ipsum 1
4 dolor 1

What if, instead of counting the occurances of the words in our string we only wanted to check if they are present or not? In that case, the value becomes unnecessary - the key is what really matters. Instead, the value can be changed to constant dummy value.

Index Key Value
0
1 Lorem Present
2 sit Present
3 ipsum Present
4 dolor Present

In this case we use essentially the same hashing methods, collison prevention methods, and lookup methods as we do with hashmaps.

Why would we use a hashset?

  • Useful for checking membership
  • Constant time insertion, deletion, and lookup in the average case (and O(n) in the worst case)

Why wouldn't we use a hashset?

  • Items in a hashset are unordered (Note that some language have ordered sets, but they typically use a tree to do so - not a hashmap)

Where can I learn more?

Check your favorite language's docs!

Hashmaps

Sets

Let's Use Hashmaps and Sets

Given an array of integers, find if the array contains any duplicates. Your function should return true if any value appears a least twice in the array, and it should return false if every element is distinct.

Given a List of integers, find two numbers contained in it such that they sum up to a specific target. The function couple_sum should return the indices of the two numbers such that they add up to the target, where index1 must be less than index2. Please note that the indices are not zero based, and you can assume that each input has exactly one solution. Example: Input: [2, 3, 4, 7], Target = 7, Output = [2, 3]

Given two strings s and t, write a function to determine if t is an anagram of s.

For example, s = "anagram", t = "nagaram", return true. s = "rat", t = "car", return false.

Note: You may assume the string contains only lowercase alphabets.

Follow up: What if the inputs contain unicode characters? How would you adapt your solution to such case?

Given a List of words, return the words that can be typed using letters of alphabet on only one row's of American keyboard like the image below.

Example 1: Input: ["Hello", "Alaska", "Dad", "Peace"] Output: ["Alaska", "Dad"] Note: You may use one character in the keyboard more than once. You may assume the input string will only contain letters of alphabet.

Given two strings s and t which consist of only lowercase letters.

String t is generated by random shuffling string s and then add one more letter at a random position.

Find the letter that was added in t.

Example: Input: s = "abcd", t = "abcde"

Output: e


Gist content by Mary Troiani for presentation she gave at the Algorithms and Technical Interview Study Group - Jan 24, 2018 meetup

This is a part of a series of presentations for Learn Teach Code - Algorithm Study Group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment