This is a draft for a space-efficient HTML5 character reference parser.
The parser contains a table that maps from a character reference name to a UTF-8 encoded codepoint list. The table is encoded as a trie.
Each node contains a list of children in the 'keys' array, and a list of offsets in the 'offsets' array.
'keys' encode either the next character of the node (for a child node), or 0 for end-of-list. 'offsets' encode the offset into the 'keys'/'offsets' array where the node data is located.
For leaf nodes, the list stored in 'keys' is empty (keys[offset] == 0), and the offsets table contains the offset into codepoint array.
convert.cpp contains a sample implementation; it may not be entirely correct (it's possible it can read out-of-bounds) - this is just a proof of concept.
Note that the implementation in convert.cpp may need to write more bytes than there are available for nGt and nLt keys.
To build this, run the following sequence of commands:
wget https://www.w3.org/TR/html5/entities.json
python3 entities.py >table.h
c++ -Os convert.cpp