Yoric/lookup.md

## lookup.md

      
    Raw
  

              lookup.md
            
          
    We currently use the slowest Huffman lookup ever. Let's fix this :)
Replacing (bits, bitLength) with just a key.

We're currently using a mapping (bits, bitLength) -> value.
We can, however, turn this into a mapping paddedBits -> (bitLength, value).
Consider the following Huffman table


Symbol
Binary Code
Int value of Code
Bit Length


A
00111
7
5


B
00110
6
5


C
0010
2
4


D
011
3
3


E
010
2
3


F
000
0
3


G
11
3
2


H
10
2
2


Unfortunately, we cannot use the "Int value of Code" because it lacks unicity. But if we pad "Binary Code" with
additional bits, we can change this:


Symbol
Binary Code
Int value of Code
Bit Length


A
00111
7
5


B
00110
6
5


C
00100
4
4


C
00101
5
4


D
01100
12
3


D
01101
13
3


D
01110
14
3


D
01111
15
3


E
01000
8
3


E
01001
9
3


E
01010
10
3


E
01011
11
3


F
00000
0
3


F
00001
1
3


F
00010
2
3


F
00011
3
3


G
11000
24
2


G
11001
25
2


G
11010
26
2


G
11011
27
2


G
11100
28
2


G
11101
29
2


G
11110
30
2


G
11111
31
2


H
10000
16
2


H
10001
17
2


H
10010
18
2


H
10011
19
2


H
10100
20
2


H
10101
21
2


H
10110
22
2


H
10111
23
2


Or, in a different order words


Symbol
Binary Code
Int value of Code
Bit Length


F
00000
0
3


F
00001
1
3


F
00010
2
3


F
00011
3
3


C
00100
4
4


C
00101
5
4


B
00110
6
5


A
00111
7
5


E
01000
8
3


E
01001
9
3


E
01010
10
3


E
01011
11
3


D
01100
12
3


D
01101
13
3


D
01110
14
3


D
01111
15
3


H
10000
16
2


H
10001
17
2


H
10010
18
2


H
10011
19
2


H
10100
20
2


H
10101
21
2


H
10110
22
2


H
10111
23
2


G
11000
24
2


G
11001
25
2


G
11010
26
2


G
11011
27
2


G
11100
28
2


G
11101
29
2


G
11110
30
2


G
11111
31
2


With this transformation, "Int value of Code" is now unique in the table.
Now, if we represent our table as an array indexed by
"Int value of Code", we have a constant-time
lookup of (Symbol, Bit Length).
Initialization Algorithm


Determine the largest bit length
If the largest bit length is reasonable (TBD)
Represent the Huffman table as a Vector indexed by "Int value of Code"
Fill all the possible values of "Int value of Code" (Note: Could possibly be done lazily)
Otherwise...
see other sections :)

Reading algorithm


If the table is represented as a single table of largest bit length N.
Fetch the next N bits. If fewer than N bits are available, pad with trailing 0s.
Perform an array lookup for (Symbol, Bit Length).
Advance by Bit Length bits.
Result is Symbol.
Otherwise...
See other sections :)

Dealing with large tables (simple version)

Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup *~ O(max bitlength)*.
The key idea is the following:

Since bitLength is limited to 20, (bits, bitLength) can be represented as a single uint32_t.
We can therefore represent all our Huffman tables as HashMap<uint32_t, Value>.

Initialization Algorithm


Determine the largest bit length
If the largest bit length is reasonable (TBD)
see other sections :)
Otherwise...
Initialize a HashMap for P entries.
For each (code, bitLength, value), insert value at key (code << 12) | bitLength

Reading algorithm


If the table is represented as a single table of largest bit length N.
see other sections :)
Otherwise...
For bitLength in 0 .. MAX_KEY_BIT_LENGTH

Let bits be the next bitLength bits.
Let key be (bits << 12) | bitLength.
If HashMap lookup at key returns Symbol

Advance by bitLength.
Result is Symbol.
We're done.


Dealing with large tables (probably faster, more complex)

Now, consider the case in which storing a table of 2^(max bitlength) entry would take too much memory/initialization time. We can make the lookup constant time :)
(Max bit length: 5)


Symbol
Binary Code
Int value of Code
Bit Length


A
00111
7
5


B
00110
6
5


C
0010
2
4


D
011
3
3


E
010
2
3


F
000
0
3


G
11
3
2


H
10
2
2


Our objective is to subdivide this table into several tables
that require smaller bit lengths.
The prefix table

First, we decide of a prefix length L and bucket our symbols
depending on the first L bits.
TBD: How do we determine the length of the prefix?


Prefix
Table


00
Table 1 (A, B, C, F)


01
Table 2 (D, E)


10
Table 3 (H)


11
Table 4 (G)


A lookup in the prefix table is constant-time.
Suffix tables

We now write one table per value of Prefix and store only the bitLength - L bits.
Table 1 (max bit length: 3) - everything prefixed by 00


Symbol
Binary Code
Int value of Code
Bit Length


A
111
7
3


B
110
6
3


C
10
2
2


F
0
0
1


Table 2 (max bit length: 1) - everything prefixed by 01


Symbol
Binary Code
Int value of Code
Bit Length


D
1
1
1


E
0
0
1


Table 3 (max bit length: 0) - everything prefixed by 10


Symbol
Binary Code
Int value of Code
Bit Length


H
(empty)
0
0


Table 4 (max bit length: 0) - everything prefixed by 11


Symbol
Binary Code
Int value of Code
Bit Length


G
(empty)
0
0


Making lookup linear

As seen above, we convert the Suffix Huffman tables into arrays:


Symbol
Binary Code
Int value of Code
Bit Length


A
111
7
3


B
110
6
3


C
10
2
2


F
0
0
1


becomes


Symbol
Binary Code
Int value of Code
Bit Length


F
000
0
1


F
001
1
1


F
010
2
1


F
011
3
1


C
100
4
2


C
101
5
2


B
110
6
3


A
111
7
3


Initialization Algorithm


Determine the largest bit length
If the largest bit length is reasonable (TBD)
see other sections :)
Otherwise...
Determine a prefix length (TBD)
Compute Prefix Table.
For each value of Prefix
4. Compute Suffix Table.
5. Represent Suffix Table as Array (see other sections).

Reading algorithm


If the table is represented as a single table of largest bit length N.
see other sections :)
Otherwise, if the table is represented as a prefix table with a prefix of length L
Fetch the next L bits. If fewer than N bits are available, pad with trailing 0s.
Perform an array lookup to determine the table in which the suffix is stored.
Fetch the next N bits. If fewer than N bits are available, pad with trailing 0s.
Perform an array lookup for Symbol.
Symbol	Binary Code	Int value of Code	Bit Length
A	00111	7	5
B	00110	6	5
C	00100	4	4
C	00101	5	4
D	01100	12	3
D	01101	13	3
D	01110	14	3
D	01111	15	3
E	01000	8	3
E	01001	9	3
E	01010	10	3
E	01011	11	3
F	00000	0	3
F	00001	1	3
F	00010	2	3
F	00011	3	3
G	11000	24	2
G	11001	25	2
G	11010	26	2
G	11011	27	2
G	11100	28	2
G	11101	29	2
G	11110	30	2
G	11111	31	2
H	10000	16	2
H	10001	17	2
H	10010	18	2
H	10011	19	2
H	10100	20	2
H	10101	21	2
H	10110	22	2
H	10111	23	2
Prefix	Table
00	Table 1 (A, B, C, F)
01	Table 2 (D, E)
10	Table 3 (H)
11	Table 4 (G)