MrSmith33/Token compression.md

## Token compression.md

      
    Raw
  

              Token compression.md
            
          
    Compressing token info

I measured my biggest code base written in Vox (which is mostly Vulkan bindings atm) and got the following numbers:

88140 tokens
17771 lines
4.95977 tokens/line on average
1 305 227 bytes of source code
1 410 240 bytes of token information

Then I measured how many lines are there with given number of tokens per line (distributed in 64 buckets).

Only 3%  (2819) of all tokens are on lines with more than 32 tokens per line.

And 16% (14168) of all tokens are on lines with more than 16 tokens per line.

number of tokens per line
number of lines with this number of tokens
column 2 as % of total (17771 lines)
number of tokens on those lines total (same as column 1 x column 2)
column 4 as % of total (88140 tokens)


tok/line
# of lines
%
# of tokens
%


0
2911 lines
16,4%
0 tokens
0,0%


1
1019 lines
5,7%
1019 tokens
1,2%


2
52 lines
0,3%
104 tokens
0,1%


3
3435 lines
19,3%
10305 tokens
11,7%


4
3565 lines
20,1%
14260 tokens
16,2%


5
1443 lines
8,1%
7215 tokens
8,2%


6
356 lines
2,0%
2136 tokens
2,4%


7
2785 lines
15,7%
19495 tokens
22,1%


8
162 lines
0,9%
1296 tokens
1,5%


9
106 lines
0,6%
954 tokens
1,1%


10
41 lines
0,2%
410 tokens
0,5%


11
39 lines
0,2%
429 tokens
0,5%


12
36 lines
0,2%
432 tokens
0,5%


13
897 lines
5,0%
11661 tokens
13,2%


14
241 lines
1,4%
3374 tokens
3,8%


15
43 lines
0,2%
645 tokens
0,7%


16
52 lines
0,3%
832 tokens
0,9%


17
119 lines
0,7%
2023 tokens
2,3%


18
53 lines
0,3%
954 tokens
1,1%


19
31 lines
0,2%
589 tokens
0,7%


20
36 lines
0,2%
720 tokens
0,8%


21
35 lines
0,2%
735 tokens
0,8%


22
68 lines
0,4%
1496 tokens
1,7%


23
29 lines
0,2%
667 tokens
0,8%


24
8 lines
0,0%
192 tokens
0,2%


25
44 lines
0,2%
1100 tokens
1,2%


26
26 lines
0,1%
676 tokens
0,8%


27
16 lines
0,1%
432 tokens
0,5%


28
15 lines
0,1%
420 tokens
0,5%


29
9 lines
0,1%
261 tokens
0,3%


30
13 lines
0,1%
390 tokens
0,4%


31
10 lines
0,1%
310 tokens
0,4%


32
12 lines
0,1%
384 tokens
0,4%


33
6 lines
0,0%
198 tokens
0,2%


34
1 lines
0,0%
34 tokens
0,0%


35
7 lines
0,0%
245 tokens
0,3%


36
6 lines
0,0%
216 tokens
0,2%


37
6 lines
0,0%
222 tokens
0,3%


38
1 lines
0,0%
38 tokens
0,0%


39
3 lines
0,0%
117 tokens
0,1%


40
7 lines
0,0%
280 tokens
0,3%


41
3 lines
0,0%
123 tokens
0,1%


42
0 lines
0,0%
0 tokens
0,0%


43
1 lines
0,0%
43 tokens
0,0%


44
2 lines
0,0%
88 tokens
0,1%


45
1 lines
0,0%
45 tokens
0,1%


46
1 lines
0,0%
46 tokens
0,1%


47
2 lines
0,0%
94 tokens
0,1%


48
0 lines
0,0%
0 tokens
0,0%


49
0 lines
0,0%
0 tokens
0,0%


50
1 lines
0,0%
50 tokens
0,1%


51
1 lines
0,0%
51 tokens
0,1%


52
1 lines
0,0%
52 tokens
0,1%


53
0 lines
0,0%
0 tokens
0,0%


54
0 lines
0,0%
0 tokens
0,0%


55
0 lines
0,0%
0 tokens
0,0%


56
0 lines
0,0%
0 tokens
0,0%


57
0 lines
0,0%
0 tokens
0,0%


58
1 lines
0,0%
58 tokens
0,1%


59
0 lines
0,0%
0 tokens
0,0%


60
0 lines
0,0%
0 tokens
0,0%


61
0 lines
0,0%
0 tokens
0,0%


62
0 lines
0,0%
0 tokens
0,0%


63+
13 lines
0,1%
819 tokens
0,9%


Current implementation stores 16 bytes per token, so 1410240 bytes for token info.
struct SourceLocation {
    uint start;
    uint end;
    uint line;
    uint col;
}
and each AST node stores an index into array of SourceLocation.
Column and end can be easily calculated by doing a relex (which will be small due to a choice of 16 or 32 tokens max per line token).
Let's assume 8 bytes per line token. Lines that are longer than than the limit (16/32) will have multiple line start tokens.
If we didn't have any lines over the limit then line index would be equal to the line number in the file, but with long lines this doesn't work. There should be some clever way of compactly recovering it without storing it explicitly (and taking 50% of the space)
Limit to 16 tokens per line token:

28 bits for line index = limit of 268 435 456 lines per compilation
16% (14168) of cases when line is bigger than 16 tokens
24456 line start tokens would be needed with 17771 LoC (+6685 (+37%) over real line count)
24456 * 8-byte = 195648 bytes (13.87% of current token info (1410240)
24456 * 4-byte =  97824 bytes (6.93% of current token info (1410240)

Limit to 32 tokens per line token:

27 bits for line index = limit of 134 217 728 lines per compilation
3% (2819) of cases when line is bigger than 32 tokens
20848 line start tokens would be needed with 17771 LoC (+3077 (+17%) over real line count)
20848 * 8-byte = 166784 bytes (11.82% of current token info (1410240)
20848 * 4-byte =  83392 bytes (5.91% of current token info (1410240)

4-byte lines show the savings if we managed to remove line number from the memory.
So, definitelly no point in targeting >32 token/line case (only 3% of tokens)

And situation with more that 16 tokens/line is relatively rare too (16% of tokens).

Alternative idea would be to just store byte offset inside AST node, instead of token index.

To not relex the whole file, you would store source location info each X bytes.
If we divide source code size (1305227) by number of line starts (20848) in the smallest case, we get 62 bytes per line start token, which gives us a pretty good chunk size of 64 bytes for relexing if we store just 4 bytes of information. Good thing is that we already implicitly know byte offset of both token (it is stored in the AST node), and chunk offset is just index time chunk size. What we don't know is where first token in the chunk is, and what line and column we are at. We can encode line with 32 bits, token offset with 6 bits (for 64 byte chunk) and use remaining 26 bits for column number, which is pretty decent. Of course that would give us 8 bytes of token data per 64 bytes of source code, so we can increase chunk size to 128 bytes, to compensate for that.
One problem is that there may be no token boundary in a random the chunk at all. But if we use offset obtained from AST node it cannot happen, because offset is guaranteed to point to token start within the chunk.
The way we would relex the token we want is by restarting lexer on the first token with the offset/line/col from the chunk info. Ideally the token we want is the first one, otherwise we skip tokens given by the lexer, until the byte offset matches the one from AST.
Finding the file that token belongs to is easy too. Since all file infos are stored in a single array, and all file source are stored in the single arena sequentially, in the order that matches file order, I can simply do binary search on file begin/end byte offsets.
What's good about byte offset approach is that chunk size is easily adjustable.
Previous implementation:

32-bit token index in the AST node
16-byte token info per token
token info size 1 410 240 bytes (108% of source size)

New implementation:

32-bit source code byte offset in the AST node
8-byte chunk info with adjustable chunk size (64+ bytes)
chunk size | total info size
       64B | 163153 bytes (12.5% of source size)
      128B |  81576 bytes ( 6.2% of source size)
      256B |  40788 bytes ( 3.1% of source size)
      512B |  20394 bytes ( 1.6% of source size)
tok/line	# of lines	%	# of tokens	%
0	2911 lines	16,4%	0 tokens	0,0%
1	1019 lines	5,7%	1019 tokens	1,2%
2	52 lines	0,3%	104 tokens	0,1%
3	3435 lines	19,3%	10305 tokens	11,7%
4	3565 lines	20,1%	14260 tokens	16,2%
5	1443 lines	8,1%	7215 tokens	8,2%
6	356 lines	2,0%	2136 tokens	2,4%
7	2785 lines	15,7%	19495 tokens	22,1%
8	162 lines	0,9%	1296 tokens	1,5%
9	106 lines	0,6%	954 tokens	1,1%
10	41 lines	0,2%	410 tokens	0,5%
11	39 lines	0,2%	429 tokens	0,5%
12	36 lines	0,2%	432 tokens	0,5%
13	897 lines	5,0%	11661 tokens	13,2%
14	241 lines	1,4%	3374 tokens	3,8%
15	43 lines	0,2%	645 tokens	0,7%
16	52 lines	0,3%	832 tokens	0,9%
17	119 lines	0,7%	2023 tokens	2,3%
18	53 lines	0,3%	954 tokens	1,1%
19	31 lines	0,2%	589 tokens	0,7%
20	36 lines	0,2%	720 tokens	0,8%
21	35 lines	0,2%	735 tokens	0,8%
22	68 lines	0,4%	1496 tokens	1,7%
23	29 lines	0,2%	667 tokens	0,8%
24	8 lines	0,0%	192 tokens	0,2%
25	44 lines	0,2%	1100 tokens	1,2%
26	26 lines	0,1%	676 tokens	0,8%
27	16 lines	0,1%	432 tokens	0,5%
28	15 lines	0,1%	420 tokens	0,5%
29	9 lines	0,1%	261 tokens	0,3%
30	13 lines	0,1%	390 tokens	0,4%
31	10 lines	0,1%	310 tokens	0,4%
32	12 lines	0,1%	384 tokens	0,4%
33	6 lines	0,0%	198 tokens	0,2%
34	1 lines	0,0%	34 tokens	0,0%
35	7 lines	0,0%	245 tokens	0,3%
36	6 lines	0,0%	216 tokens	0,2%
37	6 lines	0,0%	222 tokens	0,3%
38	1 lines	0,0%	38 tokens	0,0%
39	3 lines	0,0%	117 tokens	0,1%
40	7 lines	0,0%	280 tokens	0,3%
41	3 lines	0,0%	123 tokens	0,1%
42	0 lines	0,0%	0 tokens	0,0%
43	1 lines	0,0%	43 tokens	0,0%
44	2 lines	0,0%	88 tokens	0,1%
45	1 lines	0,0%	45 tokens	0,1%
46	1 lines	0,0%	46 tokens	0,1%
47	2 lines	0,0%	94 tokens	0,1%
48	0 lines	0,0%	0 tokens	0,0%
49	0 lines	0,0%	0 tokens	0,0%
50	1 lines	0,0%	50 tokens	0,1%
51	1 lines	0,0%	51 tokens	0,1%
52	1 lines	0,0%	52 tokens	0,1%
53	0 lines	0,0%	0 tokens	0,0%
54	0 lines	0,0%	0 tokens	0,0%
55	0 lines	0,0%	0 tokens	0,0%
56	0 lines	0,0%	0 tokens	0,0%
57	0 lines	0,0%	0 tokens	0,0%
58	1 lines	0,0%	58 tokens	0,1%
59	0 lines	0,0%	0 tokens	0,0%
60	0 lines	0,0%	0 tokens	0,0%
61	0 lines	0,0%	0 tokens	0,0%
62	0 lines	0,0%	0 tokens	0,0%
63+	13 lines	0,1%	819 tokens	0,9%