Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active May 25, 2023 10:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save paniq/2ff7b30c699378cb97e56eeb725e30c1 to your computer and use it in GitHub Desktop.
Save paniq/2ff7b30c699378cb97e56eeb725e30c1 to your computer and use it in GitHub Desktop.
Hash Integrals: Slice Hashing below O(log n)

Hash Integrals: Slice Hashing below O(log n)

Update: the technique employed here is called a polynomial rolling hash as I have recently learned.

Some notes on how to enable fast computation of hashes of slices of vectors (below complexity O(log n)), as well as hashes of joined vectors.

We start from first principles. To simplify matters, a vector has the same element type as our hash, which stands in for the actual data element we wish to hash.

Each vector v is paired with a secondary summed-hash vector h of which each element i stores hash(v[0..i]) to form a tuple of vectors (v,h). Appending a new element p to v is then performed in O(1) as follows, with a simple auxiliary linear operation:

append((v, h), p) == append(v, p), append(h, (last(h) * C + hash(p))
last(h) == h[countof(h)-1]
h[-1] == hash_type(0)

Where C is a fixed prime constant close to the maximum value of hash_type. Note that the hashing of p with the user-chosen function hash() is optional and needs only to be performed if p is of larger size than sizeof(hash_type).

For 64-bit hashes, we can use C := 0x66d6cf4cc5ddd26d:u64 (its modulo-inverse is C^-1 := 0x24ffe0c7fcc70765:u64).

We compute the hash of any left-aligned half-open subrange v[0..x) by accessing the cached accumulated value in h:

hash(v[0..x)) == hash(h[x-1])

The hash of an arbitrary subrange [a..b) of v can be computed in O(log (b - a)) by a pseudolinear operation:

hash(v[a..b)) == hash(h[b-1] - h[a-1] * (C ** (b - a)))

where ** denotes integer exponentiation.

Finally, two vectors (u, g) and (v, h) can be joined in O(n + log countof(h)) for n elements in v as follows:

join((u, g), (v, h)) == join(u, v), join(g, last(g) * (C ** countof(h)) + h)

The hash of the combined range of two joined vectors can therefore be computed in O(log countof(h)) just from the very last value of h':

hash(join((u, g), (v, h))) == hash(last(g) * (C ** countof(h)) + last(h))

Substring Search

An arbitrary vector q can be found in v in O(n log m) amortized for m == countof(q), n == countof(v) - m by testing all n slice hashes of size m in v until a match with the hash of q has been found. This is faster than the naive approach with complexity O(n * m) for large sizes of q. This is equivalent to the Rabin-Karp algorithm.

An improved technique allows us to find a substring in O(m log m), by precomputing bucketed maps of total size 2 * n which map every possible log-sized, log-aligned subtile hash to its location in advance. We can then break down the substring into all phase-shifted variants of its log-sized and log-aligned subtiles, and search for their adjacent occurrence in the map.

Parallelization

The computation of h from v is massively parallelizable on the GPU using prefix sums or jump flooding.

Higher Dimensions

As this solution is analog to integral arithmetic, it readily extends to higher dimensions. In 2D, we represent the accumulated hash as a summed-area table, where each top-left-aligned half-open range hash can be computed from (v, h) as follows:

h[x, y] == C * (h[x - 1, y] + h[x, y - 1] - C * h[x - 1, y - 1]) + hash(v[x,y])
h[-1, y] == h[x, -1] == hash_type(0)

All other operations can be derived accordingly, and analog to how integrals of SAT subranges are classically performed.

Efficient substring (subtile) search also extends to higher dimensions with a precomputed map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment