Instantly share code, notes, and snippets.

# paniq/slice_hashing.md

Last active May 25, 2023 10:19
Show Gist options
• Save paniq/2ff7b30c699378cb97e56eeb725e30c1 to your computer and use it in GitHub Desktop.
Hash Integrals: Slice Hashing below O(log n)

# Hash Integrals: Slice Hashing below O(log n)

Update: the technique employed here is called a polynomial rolling hash as I have recently learned.

Some notes on how to enable fast computation of hashes of slices of vectors (below complexity `O(log n)`), as well as hashes of joined vectors.

We start from first principles. To simplify matters, a vector has the same element type as our hash, which stands in for the actual data element we wish to hash.

Each vector `v` is paired with a secondary summed-hash vector `h` of which each element `i` stores `hash(v[0..i])` to form a tuple of vectors `(v,h)`. Appending a new element `p` to `v` is then performed in `O(1)` as follows, with a simple auxiliary linear operation:

``````append((v, h), p) == append(v, p), append(h, (last(h) * C + hash(p))
last(h) == h[countof(h)-1]
h[-1] == hash_type(0)
``````

Where `C` is a fixed prime constant close to the maximum value of `hash_type`. Note that the hashing of `p` with the user-chosen function `hash()` is optional and needs only to be performed if `p` is of larger size than `sizeof(hash_type)`.

For 64-bit hashes, we can use `C := 0x66d6cf4cc5ddd26d:u64` (its modulo-inverse is `C^-1 := 0x24ffe0c7fcc70765:u64`).

We compute the hash of any left-aligned half-open subrange `v[0..x)` by accessing the cached accumulated value in `h`:

``````hash(v[0..x)) == hash(h[x-1])
``````

The hash of an arbitrary subrange `[a..b)` of `v` can be computed in `O(log (b - a))` by a pseudolinear operation:

``````hash(v[a..b)) == hash(h[b-1] - h[a-1] * (C ** (b - a)))
``````

where `**` denotes integer exponentiation.

Finally, two vectors `(u, g)` and `(v, h)` can be joined in `O(n + log countof(h))` for `n` elements in `v` as follows:

``````join((u, g), (v, h)) == join(u, v), join(g, last(g) * (C ** countof(h)) + h)
``````

The hash of the combined range of two joined vectors can therefore be computed in `O(log countof(h))` just from the very last value of `h'`:

``````hash(join((u, g), (v, h))) == hash(last(g) * (C ** countof(h)) + last(h))
``````

## Substring Search

An arbitrary vector `q` can be found in `v` in `O(n log m)` amortized for `m == countof(q), n == countof(v) - m` by testing all `n` slice hashes of size `m` in `v` until a match with the hash of `q` has been found. This is faster than the naive approach with complexity `O(n * m)` for large sizes of `q`. This is equivalent to the Rabin-Karp algorithm.

An improved technique allows us to find a substring in `O(m log m)`, by precomputing bucketed maps of total size `2 * n` which map every possible log-sized, log-aligned subtile hash to its location in advance. We can then break down the substring into all phase-shifted variants of its log-sized and log-aligned subtiles, and search for their adjacent occurrence in the map.

## Parallelization

The computation of `h` from `v` is massively parallelizable on the GPU using prefix sums or jump flooding.

## Higher Dimensions

As this solution is analog to integral arithmetic, it readily extends to higher dimensions. In 2D, we represent the accumulated hash as a summed-area table, where each top-left-aligned half-open range hash can be computed from `(v, h)` as follows:

``````h[x, y] == C * (h[x - 1, y] + h[x, y - 1] - C * h[x - 1, y - 1]) + hash(v[x,y])
h[-1, y] == h[x, -1] == hash_type(0)
``````

All other operations can be derived accordingly, and analog to how integrals of SAT subranges are classically performed.

Efficient substring (subtile) search also extends to higher dimensions with a precomputed map.