Skip to content

Instantly share code, notes, and snippets.

@nitely
Last active September 16, 2017 06:23
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nitely/8b5caf8be8eaceff969448040178ee91 to your computer and use it in GitHub Desktop.
Save nitely/8b5caf8be8eaceff969448040178ee91 to your computer and use it in GitHub Desktop.
How Python's unicodedata (unicodenames) works

Fredrik Lundh's unicodenames

From: "Fredrik Lundh" effbot@telia.com Date: Sun, 16 Jul 2000 20:40:46 +0200

The unicodenames database consists of two parts: a name database which maps character codes to names, and a code database, mapping names to codes.

  • The Name Database (getname)

First, the 10538 text strings are split into 42193 words, and combined into a 4949-word lexicon (a 29k byte array).

Each word is given a unique index number (common words get lower numbers), and there's a "lexicon offset" table mapping from numbers to words (10k).

To get back to the original text strings, I use a "phrase book". For each original string, the phrase book stores a a list of word numbers. Numbers 0-127 are stored in one byte, higher numbers (less common words) use two bytes. At this time, about 65% of the words can be represented by a single byte. The result is a 56k array.

The final data structure is an offset table, which maps code points to phrase book offsets. Instead of using one big table, I split each code point into a "page number" and a "line number" on that page.

offset = line[ (page[code>>SHIFT]<<SHIFT) + (code&MASK) ]

Since the unicode space is sparsely populated, it's possible to split the code so that lots of pages gets no contents. I use a brute force search to find the optimal SHIFT value.

In the current database, the page table has 1024 entries (SHIFT is 6), and there are 199 unique pages in the line table. The total size of the offset table is 26k.

  • The code database (getcode)

For the code table, I use a straight-forward hash table to store name to code mappings. It's basically the same implementation as in Python's dictionary type, but a different hash algorithm. The table lookup loop simply uses the name database to check for hits.

In the current database, the hash table is 32k.

@nitely
Copy link
Author

nitely commented Sep 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment