-
-
Save vaiorabbit/5657561 to your computer and use it in GitHub Desktop.
// 32 bit FNV-1a hash | |
// Ref.: http://isthe.com/chongo/tech/comp/fnv/ | |
function fnv32a( str ) | |
{ | |
var FNV1_32A_INIT = 0x811c9dc5; | |
var hval = FNV1_32A_INIT; | |
for ( var i = 0; i < str.length; ++i ) | |
{ | |
hval ^= str.charCodeAt(i); | |
hval += (hval << 1) + (hval << 4) + (hval << 7) + (hval << 8) + (hval << 24); | |
} | |
return hval >>> 0; | |
} | |
/* | |
print( "Lorem -> " + fnv32a('Lorem') ); // Lorem -> 1789342528 | |
*/ | |
/* for Mac OSX JSC: | |
$ cat words.txt | /System/Library/Frameworks/JavaScriptCore.framework/Resources/jsc -f fnv32a.js > tmp_js.txt | |
var line; | |
while((line = readline()) != "") | |
{ | |
var re = /\"(.*)\"\,/; | |
var word = re.exec(line); | |
print( word[1] + " -> " + fnv32a(word[1]) ); | |
} | |
*/ |
No, that's impossible. A hash function is lossy—that is, it maps many possible values to a smaller number of them. Hence you can never recover the original string.
Anyway, thanks for writing this!
I think you're missing a bit. Don't you need to add hval << 2?
@prdoyle No. The shifts add the components that would be added when the hash is multiplied with the prime - by adding the shifts to the variable containing the hash, the right-most bit (which would be hval << 0
, or a multiplication by 1) is incorporated inherently. (hval << 1
, in turn, adds the multiplication by the second-rightmost bit's power of 2 (2), and so on, and so forth.)
hval ^= str.charCodeAt(i);
only works for 8-bit encodings.
Try "Я".charCodeAt(0)
.
@max0x7ba How would you adjust it for non-8-bit encodings?
It appears to work for me without problem if I try a Russian string in Firefox (57.0), IE11 and Chrome (64):
fnv32a("Сегодня я банан")
< 3081921716
"Ya" is code 1071 in my browser(s):
"Я".charCodeAt(0)
< 1071
charCodeAt()
covers 16-bits not 8, so you're good up to (EDIT: Not exactly.. you do get a 16-bit value, but it's UTF-16 encoded, see below). 0xFFFF
(65535)codePointAt()
is useful for chars larger than those values, such as some emoji 👩🏿⚖️
(composite of several emoji/chars).
// Length 4, not 2 as it might look..
const unicorn = '🦄🌈';
console.log(`The character code is: ${unicorn.charCodeAt(0)} ,it's full codePoint is: ${unicorn.codePointAt(0)}`);
// Outputs: "The character code is: 55358 ,it's full codePoint is: 129412"
// Hexadecimal equivalents('.toString(16)'): charCode == 0xD83E, codePoint == 0x1F984
// charCode is acquired as explained for encoding to UTF-16 here:
// https://en.wikipedia.org/wiki/UTF-16#Examples
// 55358 ==0xD83E, is the "high surrogate" from the unicode value, calculated like this:
0xF984 / 0x400 + 0xD800 == 0xD83E
// The next value returned from 'charCodeAt()' when the index value increments, is not for the rainbow,
// but the 2nd UTF-16 value part of the unicorn.. Note that the 'codePointAt()' is now matching the charCode value
// It no longer points to the codePoint for the unicorn.
// 56708 == 0xDD84, is the "low surrogate" calculated like follows:
0xF984 % 0x400 + 0xDC00 == 0xDD84
So what happens here is that you'll get a result through charCodeAt()
but it's UTF-16 surrogates(two 16-bit pairs). Even with the information above about why we're getting these values, you'll find that FNV is meant to operate on one byte at a time(not one character), JS strings are UTF-16 by default, but for anything exceeding one byte, we need to encode to UTF-8 instead.
We can take the codepoint for the unicorn 🦄
(0x1f984) and then convert to octal notation as an easy way to visual the conversion to UTF-8 as the conversion is explained/visualized in wikipedia:
// This will convert to octal(base 8), output "374604"
console.log( Number(0x1f984).toString(8) )
console.log( '🦄'.codePointAt().toString(8) )
// codepoint exceeds 0x10000, so will be 4 bytes represented in UTF-8
// Initial octals for 4 byte UTF-8: 36x 2xx 2xx 2xx + split octal pairs to the last 3 bytes
// '0' prefix followed by number is JS syntax for octal values
const UTF8_bytes = [0360 + 000. 0200 + 037, 0200 + 046, 0200 + 004]
const UTF8_bytes = [0360, 0237, 0246, 0204]
// Which is equivalent to decimal values:
const UTF8_bytes = [240, 159, 166, 132]
You would then need to either modify the method to use bytes directly or convert those bytes into a string. In JS you can also have code handle that conversion like so:
const UTF8_string = unescape(encodeURIComponent('🦄'))
// Results in a string of length 4: "�", the two characters are 2 bytes long each, because their first bytes exceed 0x7F(128)
For those same reasons Я
(codepoint 1071 == 0x42f), it is two bytes(length of 1 in UTF-16), the expected hexadecimal value of the hash with that as a string input would be 0x80c353e0
(not 0x2a064b5e
), you can see that is the case with this FNV calculator.
UTF-8 string + char codes/bytes:"Я" == [208, 175]
Might as well point out for anyone else landing here and curious, the bitshifting instead of usual multiplying by prime you might see is required for JS, not just as an optimization, due to JS being a 53-bit float not 32-bit uint, otherwise you'll get inaccurate results as you exceed 32-bit values. In JS the bitwise/bitshift operators will operate on 32-bit numbers instead avoiding the issue.
The final return hval >>> 0
is also important to get a positive number(uint), otherwise if you expect to get the same results such as hexadecimal strings(toString(16)
), it won't work as expected if negative values are converted.
For anyone that wants a version that handles char code values beyond 128, or more than 32-bit hashes, see sindresorhus NPM package.
Really really great
Can we reverse it? I mean get the original string of an
FNV hash
.If so, please add that code too.