Skip to content

Instantly share code, notes, and snippets.

@Yoric
Last active September 6, 2019 14:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Yoric/a9267822dee8776186624800169b217c to your computer and use it in GitHub Desktop.
Save Yoric/a9267822dee8776186624800169b217c to your computer and use it in GitHub Desktop.
Faster Huffman lookup

We currently use the slowest Huffman lookup ever. Let's fix this :)

Replacing (bits, bitLength) with just a key.

We're currently using a mapping (bits, bitLength) -> value. We can, however, turn this into a mapping paddedBits -> (bitLength, value).

Consider the following Huffman table

Symbol Binary Code Int value of Code Bit Length
A 00111 7 5
B 00110 6 5
C 0010 2 4
D 011 3 3
E 010 2 3
F 000 0 3
G 11 3 2
H 10 2 2

Unfortunately, we cannot use the "Int value of Code" because it lacks unicity. But if we pad "Binary Code" with additional bits, we can change this:

Symbol Binary Code Int value of Code Bit Length
A 00111 7 5
B 00110 6 5
C 00100 4 4
C 00101 5 4
D 01100 12 3
D 01101 13 3
D 01110 14 3
D 01111 15 3
E 01000 8 3
E 01001 9 3
E 01010 10 3
E 01011 11 3
F 00000 0 3
F 00001 1 3
F 00010 2 3
F 00011 3 3
G 11000 24 2
G 11001 25 2
G 11010 26 2
G 11011 27 2
G 11100 28 2
G 11101 29 2
G 11110 30 2
G 11111 31 2
H 10000 16 2
H 10001 17 2
H 10010 18 2
H 10011 19 2
H 10100 20 2
H 10101 21 2
H 10110 22 2
H 10111 23 2

Or, in a different order words

Symbol Binary Code Int value of Code Bit Length
F 00000 0 3
F 00001 1 3
F 00010 2 3
F 00011 3 3
C 00100 4 4
C 00101 5 4
B 00110 6 5
A 00111 7 5
E 01000 8 3
E 01001 9 3
E 01010 10 3
E 01011 11 3
D 01100 12 3
D 01101 13 3
D 01110 14 3
D 01111 15 3
H 10000 16 2
H 10001 17 2
H 10010 18 2
H 10011 19 2
H 10100 20 2
H 10101 21 2
H 10110 22 2
H 10111 23 2
G 11000 24 2
G 11001 25 2
G 11010 26 2
G 11011 27 2
G 11100 28 2
G 11101 29 2
G 11110 30 2
G 11111 31 2

With this transformation, "Int value of Code" is now unique in the table.

Now, if we represent our table as an array indexed by "Int value of Code", we have a constant-time lookup of (Symbol, Bit Length).

Initialization Algorithm

  1. Determine the largest bit length
  2. If the largest bit length is reasonable (TBD)
  3. Represent the Huffman table as a Vector indexed by "Int value of Code"
  4. Fill all the possible values of "Int value of Code" (Note: Could possibly be done lazily)
  5. Otherwise...
  6. see other sections :)

Reading algorithm

  1. If the table is represented as a single table of largest bit length N.
  2. Fetch the next N bits. If fewer than N bits are available, pad with trailing 0s.
  3. Perform an array lookup for (Symbol, Bit Length).
  4. Advance by Bit Length bits.
  5. Result is Symbol.
  6. Otherwise...
  7. See other sections :)

Dealing with large tables (simple version)

Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup *~ O(max bitlength)*.

The key idea is the following:

  • Since bitLength is limited to 20, (bits, bitLength) can be represented as a single uint32_t.
  • We can therefore represent all our Huffman tables as HashMap<uint32_t, Value>.

Initialization Algorithm

  1. Determine the largest bit length
  2. If the largest bit length is reasonable (TBD)
  3. see other sections :)
  4. Otherwise...
  5. Initialize a HashMap for P entries.
  6. For each (code, bitLength, value), insert value at key (code << 12) | bitLength

Reading algorithm

  1. If the table is represented as a single table of largest bit length N.
  2. see other sections :)
  3. Otherwise...
  4. For bitLength in 0 .. MAX_KEY_BIT_LENGTH
    1. Let bits be the next bitLength bits.
    2. Let key be (bits << 12) | bitLength.
    3. If HashMap lookup at key returns Symbol
      1. Advance by bitLength.
      2. Result is Symbol.
      3. We're done.

Dealing with large tables (probably faster, more complex)

Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup constant time :)

(Max bit length: 5)

Symbol Binary Code Int value of Code Bit Length
A 00111 7 5
B 00110 6 5
C 0010 2 4
D 011 3 3
E 010 2 3
F 000 0 3
G 11 3 2
H 10 2 2

Our objective is to subdivide this table into several tables that require smaller bit lengths.

The prefix table

First, we decide of a prefix length L and bucket our symbols depending on the first L bits.

TBD: How do we determine the length of the prefix?

Prefix Table
00 Table 1 (A, B, C, F)
01 Table 2 (D, E)
10 Table 3 (H)
11 Table 4 (G)

A lookup in the prefix table is constant-time.

Suffix tables

We now write one table per value of Prefix and store only the bitLength - L bits.

Table 1 (max bit length: 3) - everything prefixed by 00

Symbol Binary Code Int value of Code Bit Length
A 111 7 3
B 110 6 3
C 10 2 2
F 0 0 1

Table 2 (max bit length: 1) - everything prefixed by 01

Symbol Binary Code Int value of Code Bit Length
D 1 1 1
E 0 0 1

Table 3 (max bit length: 0) - everything prefixed by 10

Symbol Binary Code Int value of Code Bit Length
H (empty) 0 0

Table 4 (max bit length: 0) - everything prefixed by 11

Symbol Binary Code Int value of Code Bit Length
G (empty) 0 0

Making lookup linear

As seen above, we convert the Suffix Huffman tables into arrays:

Symbol Binary Code Int value of Code Bit Length
A 111 7 3
B 110 6 3
C 10 2 2
F 0 0 1

becomes

Symbol Binary Code Int value of Code Bit Length
F 000 0 1
F 001 1 1
F 010 2 1
F 011 3 1
C 100 4 2
C 101 5 2
B 110 6 3
A 111 7 3

Initialization Algorithm

  1. Determine the largest bit length
  2. If the largest bit length is reasonable (TBD)
  3. see other sections :)
  4. Otherwise...
  5. Determine a prefix length (TBD)
  6. Compute Prefix Table.
  7. For each value of Prefix 4. Compute Suffix Table. 5. Represent Suffix Table as Array (see other sections).

Reading algorithm

  1. If the table is represented as a single table of largest bit length N.
  2. see other sections :)
  3. Otherwise, if the table is represented as a prefix table with a prefix of length L
  4. Fetch the next L bits. If fewer than N bits are available, pad with trailing 0s.
  5. Perform an array lookup to determine the table in which the suffix is stored.
  6. Fetch the next N bits. If fewer than N bits are available, pad with trailing 0s.
  7. Perform an array lookup for Symbol.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment