Skip to content

Instantly share code, notes, and snippets.

@laughinghan
Last active January 26, 2023 00:55
Show Gist options
  • Save laughinghan/9c3511e946f2fd86802e92255b9703c1 to your computer and use it in GitHub Desktop.
Save laughinghan/9c3511e946f2fd86802e92255b9703c1 to your computer and use it in GitHub Desktop.

Thoughts about how to do (minimal-ish) perfect hashing of Mechanical/EffectScript tag IDs for branch tables for pattern-matching:

  • the goal is to minimize CPU clock cycles to jump to the correct branch; the classic minimal perfect hashing algorithm is a 2-level hashing scheme which is undesirable overhead. We should be able to get it down to at worst one integer multiply, bitmask, and bitshift
  • for actual minimal perfect hashing would have to do integer modulo, which is a slow instruction, instead hash to range from 0 to nearest power-of-2 (eg for 5 items, has to 0-7) which only requires bitmasking, call that DESIRED_RANGE
  • tag IDs aren't randomly distributed across 0-INT_MAX, they're small-ish numbers maxing out in like thousands for big codebases presumably
  • find min and max tag ID; if max - min ≤ DESIRED_RANGE then we can just hash by subtracting min tag ID; otherwise let MAX_BIT_SIZE = ceil(log2(max tag ID))
  • if we can get by with just bitmask and bitshift that'd be great. Let MAX_BIT_POSITION = MAX_BIT_SIZE - log2(DESIRED_RANGE) (note DESIRED_RANGE was defined to be a power-of-2), scan a bitmask from position 0 to MAX_BIT_POSITION (no point scanning past there, just zero bits) and see if any such shifted bitmask is a perfect hash
    • each pattern might be multiple tag IDs; assume there's a mapping from each tag ID to a canonical ID for that pattern
    • to validate a perfect hash, make an array of size DESIRED_RANGE, then for each tag ID, hash it and look at that index in the array. If nothing's there continue loop, if something's there but it's the current tag ID's canonical ID then also continue loop, else report that this hash has a collision (not a perfect hash)
  • finally, using a fixed bitmask at MAX_BIT_POSITION and a constant that starts from floor(2**MAX_BIT_POSITION/second-smallest tag ID)-1 and increments, try a hash where you multiply by the constant and bitmask+bitshift. Validate as above
    • the reason for the bitmask at MAX_BIT_POSITION is the idea is that multiplying should mix the low bits into the high bits; the high bits don't get mixed back into low bits without a prime modulus which we don't have (not to mention multiplying by a large constant, which would be a slow search)
    • start from floor(...) because the constant needs to be at least big enough to bring the high bits of small tag IDs within range of the bitmask. Second-smallest tag ID, not min tag ID, because it's ok for one tag ID to hash to all zeroes. Subtract one just to be sure we don't overshoot

I'm not sure how I'd prove that in some number of iterations less than the max tag ID, we'd find a perfect hash, but it feels like it ought to be able to mix the bits in the right way. Of course multiplication doesn't come close to covering every possible permutation of a bitstring, maybe it won't find the necessary permutation for a perfect hash?

Note that hashing doesn't make sense for vtables, because the lookup key is a number known at compile-time: the index of the method being called. (Unless there could be multiple possible vtables, and we decide which vtable and method-index-number based on a lookup table or branch table off tag ID?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment