Skip to content

Instantly share code, notes, and snippets.

@asela38
Last active August 10, 2018 10:09
Show Gist options
  • Save asela38/2fcdadd40e74f99307e6ffb4a0f28022 to your computer and use it in GitHub Desktop.
Save asela38/2fcdadd40e74f99307e6ffb4a0f28022 to your computer and use it in GitHub Desktop.
Why it's good to keep Strings as key of a hash map over Integers?

First to explore the hash collisions i used following utility methods which access the field 'table' array which stores all the key-value pairs in the hash table.

    private void analyzeBuckets(HashMap<?, ?> hashMap) throws NoSuchFieldException, IllegalAccessException {
        Class<?> class1 = hashMap.getClass();
        
        Field field = class1.getDeclaredField("table");
        field.setAccessible(true);
        Object[] objects = (Object[]) field.get(hashMap);

        System.out.printf("Total Buckets : %s Used : %s ", 
                            Arrays.stream(objects).count(), 
                            Arrays.stream(objects).filter(Objects::nonNull).count());
    }

When I directly store integers as my key for the entries it can be seen that 10K slots in the array have been used. This happens because hashcode() of Integer returns itself as the hashcode. so it evenly get distributed between the bucket range. So if it's a prior knowladge that Integer keys are evenly distributed across a range it'll be mostly prefered to be used as key for the map.

(Java HashMap has load factor, which is defaulted to 0.75, so everytime number of entries reached 75% of the table bucket capacity(inner array size) it double itself)

    
    @Test
    public void testHashMap_Integer() throws Exception {
        HashMap<Integer, String> hashMap = new HashMap<>();
        IntStream.iterate(1, i -> i + 1).limit(10000).mapToObj(Integer::toString)
                .forEach(i -> hashMap.put(Integer.valueOf(i), i));

        analyzeBuckets(hashMap); 
    }
    // Total Buckets : 16384 Used : 10000 

Now let's see what happens if we stores key as String using same integer

    @Test
    public void testHashMap_String_incrementing() throws Exception {
        HashMap<String, String> hashMap = new HashMap<>();
        IntStream intStream = IntStream.iterate(1, i -> i + 1);
        intStream.limit(10000).mapToObj(Integer::toString).forEach(i -> hashMap.put(i, i));
        analyzeBuckets(hashMap); 
    }
    // Total Buckets : 16384 Used : 6765 

Here, unlike when we used integers, it used only 6765 buckets. What is the Reason? Hash collisions. Though hash functions tend to yield uniformly distribute values across its discreet output domain when tried with the infinite amount of inputs, with finite input domain we cannot expect it's to be held because of the randomness of output based on the input( see Birthday Paradox). In this case, Instead of using incremental series, if a random number series is used bucket percentage increased to ~75% from this 67%.

    
    @Test
    public void testHashMap_String_random() throws Exception {
        HashMap<String, String> hashMap = new HashMap<>();
        IntStream intStream =ThreadLocalRandom.current().ints();
        intStream.limit(10000).mapToObj(Integer::toString).forEach(i -> hashMap.put(i, i));
        analyzeBuckets(hashMap); 
    }
    // Total Buckets : 16384 Used : 7486 
    // Total Buckets : 16384 Used : 7531 

However we cannot expect it to reach 100%. If we look close to hasCode() of String class, what it attept to genrate random uniquie number for a given string which doesn't consider about it's distribution (any way there's no requirement for hashfunction to be uniformly distributed)

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

TL;DR If integers are grarateed to be from a properly distributed range use Integers for HashMap keys insertead of Strings. Otherwise using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment