We currently use the slowest Huffman lookup ever. Let's fix this :)
We're currently using a mapping (bits, bitLength) -> value
.
We can, however, turn this into a mapping paddedBits -> (bitLength, value)
.
Consider the following Huffman table
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
A | 00111 | 7 | 5 |
B | 00110 | 6 | 5 |
C | 0010 | 2 | 4 |
D | 011 | 3 | 3 |
E | 010 | 2 | 3 |
F | 000 | 0 | 3 |
G | 11 | 3 | 2 |
H | 10 | 2 | 2 |
Unfortunately, we cannot use the "Int value of Code" because it lacks unicity. But if we pad "Binary Code" with additional bits, we can change this:
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
A | 00111 | 7 | 5 |
B | 00110 | 6 | 5 |
C | 00100 | 4 | 4 |
C | 00101 | 5 | 4 |
D | 01100 | 12 | 3 |
D | 01101 | 13 | 3 |
D | 01110 | 14 | 3 |
D | 01111 | 15 | 3 |
E | 01000 | 8 | 3 |
E | 01001 | 9 | 3 |
E | 01010 | 10 | 3 |
E | 01011 | 11 | 3 |
F | 00000 | 0 | 3 |
F | 00001 | 1 | 3 |
F | 00010 | 2 | 3 |
F | 00011 | 3 | 3 |
G | 11000 | 24 | 2 |
G | 11001 | 25 | 2 |
G | 11010 | 26 | 2 |
G | 11011 | 27 | 2 |
G | 11100 | 28 | 2 |
G | 11101 | 29 | 2 |
G | 11110 | 30 | 2 |
G | 11111 | 31 | 2 |
H | 10000 | 16 | 2 |
H | 10001 | 17 | 2 |
H | 10010 | 18 | 2 |
H | 10011 | 19 | 2 |
H | 10100 | 20 | 2 |
H | 10101 | 21 | 2 |
H | 10110 | 22 | 2 |
H | 10111 | 23 | 2 |
Or, in a different order words
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
F | 00000 | 0 | 3 |
F | 00001 | 1 | 3 |
F | 00010 | 2 | 3 |
F | 00011 | 3 | 3 |
C | 00100 | 4 | 4 |
C | 00101 | 5 | 4 |
B | 00110 | 6 | 5 |
A | 00111 | 7 | 5 |
E | 01000 | 8 | 3 |
E | 01001 | 9 | 3 |
E | 01010 | 10 | 3 |
E | 01011 | 11 | 3 |
D | 01100 | 12 | 3 |
D | 01101 | 13 | 3 |
D | 01110 | 14 | 3 |
D | 01111 | 15 | 3 |
H | 10000 | 16 | 2 |
H | 10001 | 17 | 2 |
H | 10010 | 18 | 2 |
H | 10011 | 19 | 2 |
H | 10100 | 20 | 2 |
H | 10101 | 21 | 2 |
H | 10110 | 22 | 2 |
H | 10111 | 23 | 2 |
G | 11000 | 24 | 2 |
G | 11001 | 25 | 2 |
G | 11010 | 26 | 2 |
G | 11011 | 27 | 2 |
G | 11100 | 28 | 2 |
G | 11101 | 29 | 2 |
G | 11110 | 30 | 2 |
G | 11111 | 31 | 2 |
With this transformation, "Int value of Code" is now unique in the table.
Now, if we represent our table as an array indexed by
"Int value of Code", we have a constant-time
lookup of (Symbol, Bit Length)
.
- Determine the largest bit length
- If the largest bit length is reasonable (TBD)
- Represent the Huffman table as a Vector indexed by "Int value of Code"
- Fill all the possible values of "Int value of Code" (Note: Could possibly be done lazily)
- Otherwise...
- see other sections :)
- If the table is represented as a single table of largest bit length
N
. - Fetch the next
N
bits. If fewer thanN
bits are available, pad with trailing 0s. - Perform an array lookup for
(Symbol, Bit Length)
. - Advance by
Bit Length
bits. - Result is
Symbol
. - Otherwise...
- See other sections :)
Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup *~ O(max bitlength)*.
The key idea is the following:
- Since
bitLength
is limited to 20,(bits, bitLength)
can be represented as a singleuint32_t
. - We can therefore represent all our Huffman tables as
HashMap<uint32_t, Value>
.
- Determine the largest bit length
- If the largest bit length is reasonable (TBD)
- see other sections :)
- Otherwise...
- Initialize a
HashMap
forP
entries. - For each
(code, bitLength, value)
, insertvalue
at key(code << 12) | bitLength
- If the table is represented as a single table of largest bit length
N
. - see other sections :)
- Otherwise...
- For
bitLength
in0 .. MAX_KEY_BIT_LENGTH
- Let
bits
be the nextbitLength
bits. - Let
key
be(bits << 12) | bitLength
. - If
HashMap
lookup atkey
returnsSymbol
- Advance by
bitLength
. - Result is
Symbol
. - We're done.
- Advance by
- Let
Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup constant time :)
(Max bit length: 5)
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
A | 00111 | 7 | 5 |
B | 00110 | 6 | 5 |
C | 0010 | 2 | 4 |
D | 011 | 3 | 3 |
E | 010 | 2 | 3 |
F | 000 | 0 | 3 |
G | 11 | 3 | 2 |
H | 10 | 2 | 2 |
Our objective is to subdivide this table into several tables that require smaller bit lengths.
First, we decide of a prefix length L
and bucket our symbols
depending on the first L
bits.
TBD: How do we determine the length of the prefix?
Prefix | Table |
---|---|
00 | Table 1 (A, B, C, F) |
01 | Table 2 (D, E) |
10 | Table 3 (H) |
11 | Table 4 (G) |
A lookup in the prefix table is constant-time.
We now write one table per value of Prefix
and store only the bitLength - L
bits.
Table 1 (max bit length: 3) - everything prefixed by 00
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
A | 111 | 7 | 3 |
B | 110 | 6 | 3 |
C | 10 | 2 | 2 |
F | 0 | 0 | 1 |
Table 2 (max bit length: 1) - everything prefixed by 01
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
D | 1 | 1 | 1 |
E | 0 | 0 | 1 |
Table 3 (max bit length: 0) - everything prefixed by 10
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
H | (empty) | 0 | 0 |
Table 4 (max bit length: 0) - everything prefixed by 11
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
G | (empty) | 0 | 0 |
As seen above, we convert the Suffix Huffman tables into arrays:
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
A | 111 | 7 | 3 |
B | 110 | 6 | 3 |
C | 10 | 2 | 2 |
F | 0 | 0 | 1 |
becomes
Symbol | Binary Code | Int value of Code | Bit Length |
---|---|---|---|
F | 000 | 0 | 1 |
F | 001 | 1 | 1 |
F | 010 | 2 | 1 |
F | 011 | 3 | 1 |
C | 100 | 4 | 2 |
C | 101 | 5 | 2 |
B | 110 | 6 | 3 |
A | 111 | 7 | 3 |
- Determine the largest bit length
- If the largest bit length is reasonable (TBD)
- see other sections :)
- Otherwise...
- Determine a prefix length (TBD)
- Compute Prefix Table.
- For each value of Prefix 4. Compute Suffix Table. 5. Represent Suffix Table as Array (see other sections).
- If the table is represented as a single table of largest bit length
N
. - see other sections :)
- Otherwise, if the table is represented as a prefix table with a prefix of length
L
- Fetch the next
L
bits. If fewer thanN
bits are available, pad with trailing 0s. - Perform an array lookup to determine the table in which the suffix is stored.
- Fetch the next
N
bits. If fewer thanN
bits are available, pad with trailing 0s. - Perform an array lookup for
Symbol
.