Skip to content

Instantly share code, notes, and snippets.

@luoq
Last active August 20, 2020 11:34
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save luoq/b4c374b5cbabe3ae76ffacdac22750af to your computer and use it in GitHub Desktop.
Save luoq/b4c374b5cbabe3ae76ffacdac22750af to your computer and use it in GitHub Desktop.
how feature hash is calculated in vw

feature hashing in vw

base hash function

Murmur32 hash is implemented in uniform_hash in hash.cc. It takes a string and a seed and return uint64_t.

There are two hash modes specified by --hash option.

Strings mode (hashstring in parse_primitives.cc) use parsed uint64_t value plus seed for integer string(matched by [0-9]+) and murmur32 for other type.

uint64_t hash(string s, uint64_t seed) =
  if s match [0-9]+
    return uint64_t(s)+seed
  else:
    return murmur32(s, seed)

All mode (hashall in parse_primitives.cc) use murmur32.

hash(str, seed) = murmur32(str, seed)

hash for constant term

const uint64_t constant = 11650396; // defined in constant.h
uint64_t constant_hash = 11650396 % (1<<b);

index for feature without namesapce

uint64_t feature_hash = hash(feature_name, 0)
uint64_t feature_index = feature_hash % (1<<b)

where 0 is seed, b is num of bits used.

hash for feature with name space

uint64_t namespace_hash = hash(namespace, 0)
uint64_t feature_hash = hash(feature_name, namespace_hash)
uint64_t feature_index = feature_hash % (1<<b)

hash for quadratic feature

from interactions.h

given hashes(indices) h1, h2 of two features

const uint32_t FNV_prime = 16777619; //defined in constant.h
uint64_t h = (h1*FNV_prime) ^  h2 // ^ is xor
uint64_t i = h % (1<<b)

Whether h1, h2 are feature hashes(before mode) or indices (after mod) will give the same result.

example

sample:

1 | 123456789 has_car |gender 男 |interest 篮球 |age 42

index calculation(strings mode, b=18):

constant -> 11650396 % 1<<18 = 116060

123456789 -> 123456789 % 1<<18 = 249109

hashstring(hash_car, 0) = 2086702313
has_car -> 2086702313 % (1<<18) = 36073

hashstring(gender, 0) = 2678092942
hashstring(男, 2678092942) = 430992541
gender^男 -> 430992541 % 1<<18 = 27805

hashstring(interest, 0) = 570245443
hashstring(篮球, 570245443) = 2287792271
interest^篮球 -> 2287792271 % 1<<18 = 61583

hashstring(age, 0) -> 717653329
hashstring(42, 717653329) -> 717653329+42 = 717653371
age^42 -> 717653371 % 1<<18 = 165243

gender^男*interest^篮球 -> ((27805 * 16777619) ^ 61583) % 1<<18 = 134056
gender^男*gender^男 -> ((27805*16777619)^27805) % 1<<18 = 169914
interest^篮球*interest^篮球 -> ((61583*16777619)^61583) % 1<<18 = 147858
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment