Skip to content

Instantly share code, notes, and snippets.

@iperelivskiy
Created November 19, 2012 14:39
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save iperelivskiy/4110988 to your computer and use it in GitHub Desktop.
Save iperelivskiy/4110988 to your computer and use it in GitHub Desktop.
JS simple hash function
var hash = function(s) {
/* Simple hash function. */
var a = 1, c = 0, h, o;
if (s) {
a = 0;
/*jshint plusplus:false bitwise:false*/
for (h = s.length - 1; h >= 0; h--) {
o = s.charCodeAt(h);
a = (a<<6&268435455) + o + (o<<14);
c = a & 266338304;
a = c!==0?a^c>>21:a;
}
}
return String(a);
};
@florianfrey1
Copy link

Great hash function! Works very good.
What's the name of this function?

@bryc
Copy link

bryc commented Sep 4, 2018

Seems to be from ga.js (Google Analytics), and was used to hash domain name strings. It's not that good honestly. Distribution and collision resistance is very poor. Only useful for its original purpose: hashing domain name strings.

For example, hash for "a" is 00184061 and "b" is 00188062. "revenge" is 097599b9 and "revenue" is 0575be19. They have many common bits, which implies bad mixing. And collisions have higher-than-chance probability: "1zbN" and "1t6g" = 00e8b236, "RivW" and "R1G7" = 10063cd2.

Did a quick sanity test:

Test: Single rolling bit over 4096 bytes (FAILED) Collisions: 10224 of 10240 (100%)
Test: Incrementing bytes to 255          (FAILED) Collisions: 8829 of 18360 (48%)
Test: Four unique random ASCII chars     (FAILED) Collisions: 484 of 102401 (0%)

Fails most tests, but 484 in 102,401 is not bad. I can see why they used it for domain strings. But an even simpler function is faster AND better:

var funhash = function(s) {
    for(var i = 0, h = 0xdeadbeef; i < s.length; i++)
        h = Math.imul(h ^ s.charCodeAt(i), 2654435761);
    return (h ^ h >>> 16) >>> 0;
};

@Undefitied
Copy link

@bryc, can you, please, share your test functions?

@bryc
Copy link

bryc commented Sep 20, 2019

@Undefitied Here you go. It's set to 4096 iterations per test, just for faster loading. And it also tests the bit distribution under "Bit averages range", where 50% is the expected value of true randomness. Just keep in mind these are just basic tests with small sample sizes. SMHasher is better but you might need to be a rocket scientist to use it.

@u01jmg3
Copy link

u01jmg3 commented Apr 22, 2021

@bryc: would funhash() be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?

@bryc
Copy link

bryc commented Apr 22, 2021

@u01jmg3

@bryc: would funhash() be efficient to use against a JSON string to see quickly whether two JSON strings match each other (rather than comparing full JSON strings)?

Yes, actually. If you're just comparing two strings at once, and not storing the hash results or crosschecking all possibilities, then a simple 32-bit hash will do the job without any problems. But keep in mind, hashes only tell you if two messages don't match. If two messages do match, it's not absolute proof that the messages are identical.

For example, with FunHash/TSH, if you iterate 100s of millions of random JSON string comparisons, you will eventually find some collisions, where two different JSON strings produce the same hash and erroneously pass the comparison test. However, this is extremely uncommon and you would need to do up to 2 billion iterations to find collisions. Through my own testing, I've found it typically requires 650 million iterations before a single collision is found. But it can take as few as 35 million or as many as 1.44 billion iterations to find a collision. In fact, I once ran my 1.5 billion test 4 times in a row, and got 0 collisions each, but it does happen.

So unless you need to do more than 30 million JSON comparisons in one go, you should not have to worry about collisions.

Here's some simple test code that tests for this. It's set to 2 million, because JS isn't really fast enough to do 1.5 billion.

// TSH (TinySimpleHash) is the same as FunHash with slightly different numbers and a new name.
TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}

for(var i = 0; i < 2000000; i++) {
    var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
    var hash2 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result

    if (hash1 === hash2) console.warn(`Dupe at ${i}, Number ${hash1}`); 
   
}
console.info("Done");

However, 32-bit hashes can actually be a poor choice in some situations. Here's a test where 250,000 iterations is practically assured to give at least one collision against the full set, when storing them all in a hash map:

TSH=s=>{for(var i=0,h=9;i<s.length;)h=Math.imul(h^s.charCodeAt(i++),9**9);return h^h>>>9}

var hashMap = {};
for(var i = 0; i < 250000; i++) {
    var hash1 = TSH(Math.random().toString(16)+Math.random().toString(16)); // 32-bit hash result
    if (hashMap[hash1]) console.warn(`Dupe at ${i}, Index ${hash1}`); 
    hashMap[hash1] = (hashMap[hash1] || 0) + 1;
}
console.info("Done");

With the above test, collisions can be found after as low as 20,000 iterations.

This is why sometimes it's better to just use a larger hash size than 32-bit. See my answer on StackOverflow: https://stackoverflow.com/a/52171480/815680. cyrb53 is a 53-bit hash that's still comparably fast to the 32-bit one, but the fact that it has 53 bits reduces collisions by an exponential scale. TSH (TinySimpleHash) at the bottom of that post is basically the same algorithm as FunHash just with a different name.

If you want something highly robust, I have an efficient port of 128-bit MurmurHash3, but it needs 4 comparisons instead of one: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js.

Further reading:
https://en.wikipedia.org/wiki/Pigeonhole_principle
https://en.wikipedia.org/wiki/Birthday_problem

@u01jmg3
Copy link

u01jmg3 commented Apr 24, 2021

@bryc - super explanation, thank you

Going with the 32-bit hash for now but will change to the 53-bit version if we experience frequent collisions.

@ArashPartow
Copy link

A comprehensive list of general purpose hash functions and their implementations can found here:

https://www.partow.net/programming/hashfunctions/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment