Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
JS simple hash function
var hash = function(s) {
/* Simple hash function. */
var a = 1, c = 0, h, o;
if (s) {
a = 0;
/*jshint plusplus:false bitwise:false*/
for (h = s.length - 1; h >= 0; h--) {
o = s.charCodeAt(h);
a = (a<<6&268435455) + o + (o<<14);
c = a & 266338304;
a = c!==0?a^c>>21:a;
}
}
return String(a);
};
@u01jmg3
Copy link

u01jmg3 commented Apr 22, 2021

@bryc: would funhash() be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?

@bryc
Copy link

bryc commented Apr 22, 2021

@u01jmg3

@bryc: would funhash() be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?

Yes, actually. If you're just comparing two strings at once, and not storing the hash results or crosschecking all possibilities, then a simple 32-bit hash will do the job without any problems. But keep in mind, hashes only tell you if two messages don't match. If two messages do match, it's not absolute proof that the messages are identical.

For example, with FunHash/TSH, if you iterate 100s of millions of random JSON string comparisons, you will eventually find some collisions, where two different JSON strings produce the same hash and erroneously pass the comparison test. However, this is extremely uncommon and you would need to do up to 2 billion iterations to find collisions. Through my own testing, I've found it typically requires 650 million iterations before a single collision is found. But it can take as few as 35 million or as many as 1.44 billion iterations to find a collision. In fact, I once ran my 1.5 billion test 4 times in a row, and got 0 collisions each, but it does happen.

So unless you need to do more than 30 million JSON comparisons in one go, you should not have to worry about collisions.

Here's some simple test code that tests for this. It's set to 2 million, because JS isn't really fast enough to do 1.5 billion.

// TSH (TinySimpleHash) is the same as FunHash with slightly different numbers and a new name.
TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}

for(var i = 0; i < 2000000; i++) {
    var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
    var hash2 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result

    if (hash1 === hash2) console.warn(`Dupe at ${i}, Number ${hash1}`); 
   
}
console.info("Done");

However, 32-bit hashes can actually be a poor choice in some situations. Here's a test where 250,000 iterations is practically assured to give at least one collision against the full set, when storing them all in a hash map:

TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}

var hashMap = {};
for(var i = 0; i < 250000; i++) {
    var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
    if (hashMap[hash1]) console.warn(`Dupe at ${i}, Index ${hash1}`); 
    hashMap[hash1] = (hashMap[hash1] || 0) + 1;
}
console.info("Done");

With the above test, collisions can be found after as low as 20,000 iterations.

This is why sometimes it's better to just use a larger hash size than 32-bit. See my answer on StackOverflow: https://stackoverflow.com/a/52171480/815680. cyrb53 is a 53-bit hash that's still comparably fast to the 32-bit one, but the fact that it has 53 bits reduces collisions by an exponential scale. TSH (TinySimpleHash) at the bottom of that post is basically the same algorithm as FunHash just with a different name.

If you want something highly robust, I have an efficient port of 128-bit MurmurHash3, but it needs 4 comparisons instead of one: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js.

Further reading:
https://en.wikipedia.org/wiki/Pigeonhole_principle
https://en.wikipedia.org/wiki/Birthday_problem

@u01jmg3
Copy link

u01jmg3 commented Apr 24, 2021

@bryc - super explanation, thank you

Going with the 32-bit hash for now but will change to the 53-bit version if we experience frequent collisions.

@ArashPartow
Copy link

ArashPartow commented Jan 26, 2022

A comprehensive list of general purpose hash functions and their implementations can found here:

https://www.partow.net/programming/hashfunctions/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment