First to explore the hash collisions i used following utility methods which access the field 'table' array which stores all the key-value pairs in the hash table.
private void analyzeBuckets(HashMap<?, ?> hashMap) throws NoSuchFieldException, IllegalAccessException {
Class<?> class1 = hashMap.getClass();
Field field = class1.getDeclaredField("table");
field.setAccessible(true);
Object[] objects = (Object[]) field.get(hashMap);
System.out.printf("Total Buckets : %s Used : %s ",
Arrays.stream(objects).count(),
Arrays.stream(objects).filter(Objects::nonNull).count());
}
When I directly store integers as my key for the entries it can be seen that 10K slots in the array have been used. This happens because hashcode() of Integer returns itself as the hashcode. so it evenly get distributed between the bucket range. So if it's a prior knowladge that Integer keys are evenly distributed across a range it'll be mostly prefered to be used as key for the map.
(Java HashMap has load factor, which is defaulted to 0.75, so everytime number of entries reached 75% of the table bucket capacity(inner array size) it double itself)
@Test
public void testHashMap_Integer() throws Exception {
HashMap<Integer, String> hashMap = new HashMap<>();
IntStream.iterate(1, i -> i + 1).limit(10000).mapToObj(Integer::toString)
.forEach(i -> hashMap.put(Integer.valueOf(i), i));
analyzeBuckets(hashMap);
}
// Total Buckets : 16384 Used : 10000
Now let's see what happens if we stores key as String using same integer
@Test
public void testHashMap_String_incrementing() throws Exception {
HashMap<String, String> hashMap = new HashMap<>();
IntStream intStream = IntStream.iterate(1, i -> i + 1);
intStream.limit(10000).mapToObj(Integer::toString).forEach(i -> hashMap.put(i, i));
analyzeBuckets(hashMap);
}
// Total Buckets : 16384 Used : 6765
Here, unlike when we used integers, it used only 6765 buckets. What is the Reason? Hash collisions. Though hash functions tend to yield uniformly distribute values across its discreet output domain when tried with the infinite amount of inputs, with finite input domain we cannot expect it's to be held because of the randomness of output based on the input( see Birthday Paradox). In this case, Instead of using incremental series, if a random number series is used bucket percentage increased to ~75% from this 67%.
@Test
public void testHashMap_String_random() throws Exception {
HashMap<String, String> hashMap = new HashMap<>();
IntStream intStream =ThreadLocalRandom.current().ints();
intStream.limit(10000).mapToObj(Integer::toString).forEach(i -> hashMap.put(i, i));
analyzeBuckets(hashMap);
}
// Total Buckets : 16384 Used : 7486
// Total Buckets : 16384 Used : 7531
However we cannot expect it to reach 100%. If we look close to hasCode() of String class, what it attept to genrate random uniquie number for a given string which doesn't consider about it's distribution (any way there's no requirement for hashfunction to be uniformly distributed)
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
TL;DR If integers are grarateed to be from a properly distributed range use Integers for HashMap keys insertead of Strings. Otherwise using
- Hash Table https://en.wikipedia.org/wiki/Hash_table
- Open Addressing https://en.wikipedia.org/wiki/Open_addressing
- Why did the language designers of Java preferred chaining over open addressing for most hash based structures except for some like ThreadLocal? https://stackoverflow.com/questions/12019434/why-did-the-language-designers-of-java-preferred-chaining-over-open-addressing-f
- Properties of Hash Function https://en.wikipedia.org/wiki/Cryptographic_hash_function