js-choi/compact-unicode-character-names.md

## compact-unicode-character-names.md

      
    Raw
  

              compact-unicode-character-names.md
            
          
    Concise Unicode character names in JavaScript

J. S. Choi, 2022
⚠️ Warning: This article is not finished. The code will not yet run without
errors.
All programmers must manipulate text; JavaScript programmers are no exception.
Text manipulation often refers to specific characters, usually by their code
points in hexadecimal – or by embedding the characters directly in source code.
Almost every encoded character used by programmers has one or more standard,
unique, and immutable Unicode names. Unicode names are the most prominent,
if not the most important, properties of any Unicode character.
Unicode names are much more memorable, readable, visually distinctive, and
self-documenting than hexadecimal codes. And they will always be visible,
printable, and supported by system fonts, unlike most directly embedded
characters.
Yet Unicode names are not often used in JavaScript programming, despite
JavaScript being perhaps the most ubiquitous programming language in the world.
This is perhaps most due to the sheer number of Unicode names (285,275
names for 284,805 named Unicode values, as of Unicode 14.0 in 2021). The
canonical source of those 285,275 Unicode names, UnicodeData.txt, weighs
1,897.8 kB. Embedding such large files in every language runtime would impose a
large storage and memory burden on resource-constrained environments such as
mobile devices.
But that heavy memory can be made light, without compromising retrieval time,
by losslessly compressing UnicodeData.txt into a concise data structure.
Use cases

When manipulating text, developers often need to refer to non-ASCII characters,
non-printable characters, precomposed characters (or character sequences that
may be precomposed), or RTL characters.
Embedding the characters directly in source code can often be problematic:

Non-ASCII characters might be supported by no font in the human reader’s
system, making it unable to directly display them to the human reader.
Non-printable characters are, by their nature, invisible, which also makes
both reading and editing source code that directly embeds them inscrutable.
Precomposed characters and character sequences that may be precomposed
are visually ambiguous. Whether or not a character sequence is precomposed
(e.g., é) or not (e.g., e then combining ◌́) is often significant, such
as when matching text.
JavaScript syntax is based on English, which uses Latin letters. Many
characters resemble these Latin letters yet have very different meanings;
for instance, С (U+0421 “Cyrillic Capital Letter Es”) resembles C (U+0043
“Latin Capital Letter C”), and so async () => 'С' (which would return the
Cyrillic letter) is difficult to distinguish from async () => 'C' (which
would return the Latin letter).
Directly embedding RTL characters in source code (like Arabic or Hebrew
letters) can cause the flow of its text to switch unpredictably. Directly
embedded RTL characters can even cause security bugs.

Instead of directly embedding those characters, developers may instead refer
to these characters with \u syntax. But these require using the characters’
hex codes, which can be inscrutable to human readers, unless there are nearby
comments redundantly explaining which character belongs to each hex code.
With Gzemnid, we can see how people already use these characters in their
JavaScript source code.
From debug@4.1.1/src/node.js:
const colorCode = '\u001B[3' + (c < 8 ? c : '8;5;' + c);
From js-yaml@3.13.1/lib/js-yaml/loader.js:
(c === 0x4C/* L */) ? '\u2028' :
(c === 0x50/* P */) ? '\u2029' : '';
From uglify-js@3.6.0/lib/output.js:
str = str.replace(/[\\\b\f\n\r\v\t\x22\x27\u2028\u2029\0\ufeff]/g,
  function(s, i) {
    switch (s) {
      case '"': ++dq; return '"';
      case "'": ++sq; return "'";
      case "\\": return "\\\\";
      case "\n": return "\\n";
      case "\r": return "\\r";
      case "\t": return "\\t";
      case "\b": return "\\b";
      case "\f": return "\\f";
      case "\x0B": return options.ie8 ? "\\x0B" : "\\v";
      case "\u2028": return "\\u2028";
      case "\u2029": return "\\u2029";
      case "\ufeff": return "\\ufeff";
      case "\0":
          return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
    }
    return s;
});
From trim-right@1.0.1.mjs:
while (/[\s\uFEFF\u00A0]/.test(str[tail - 1])) {
From highlight.js@9.15.8/lib/languages/fix.js:
begin: /[^\u2401\u0001]+/,
end: /[\u2401\u0001]/,
From react-select@3.0.4/src/diacritics.js:
{ base:'AA',letters:/[\uA732]/g },
{ base:'AE',letters:/[\u00C6\u01FC\u01E2]/g },
{ base:'AO',letters:/[\uA734]/g },
{ base:'AU',letters:/[\uA736]/g },
{ base:'AV',letters:/[\uA738\uA73A]/g },
{ base:'AY',letters:/[\uA73C]/g },
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},
From katex@0.10.2/src/unicodeSymbols.js:
"\u014f": "\u006f\u0306",  // ŏ = \u{o}
From regexgen@1.3.0/test/test.js:
'\u{1F937}\u{1F3FB}\u200D',
'\u{1F937}\u{1F3FC}\u200D',
'\u{1F937}\u{1F3FD}\u200D',
'\u{1F937}\u{1F3FE}\u200D',
'\u{1F937}\u{1F3FF}\u200D',
'\u{1F937}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FB}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FC}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FD}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FE}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FF}\u200D\u2640\uFE0F'
From regexp-support@1.0.25/lib/pattern.js:
['^\\u{20bb7}$', 'u', '𠮷', true, 'test'],
Basic definitions

As defined in the Unicode Standard, a code point is an integer between
0 and 0x10_FFFF – this range of 16**5 + 16**6 (or 17,825,792) unsigned
integers is also called the Unicode codespace. We represent code points
with hexes (uppercase hexadecimal strings); two hexes (16**2) make up one
byte (2**8).
Each code point encodes one of three things:

A scalar character,
A noncharacter, or
A UTF-16 surrogate

– or the code point may yet be unassigned.
A UTF-16 surrogate is a code point that is between D800 and DFFF. There
are therefore 16**4 - (16**2 + 8 * 16**3) (or 32,512) UTF-16 surrogates.
UTF-16 surrogates are not characters, and they are not even valid code points
on their own, in well-formed Unicode text. They are used only as special byte
units to encode large code points in a certain form of Unicode, UTF-16.
A code point that is not a UTF-16 surrogate is also called a scalar;.
A scalar always encodes a scalar character or a noncharacter – or is yet
unassigned. We will prefer the term “scalar” where possible.
A character is a string that represents a single unit of writing.
Most characters that we care about are encoded by single scalars, which we
call scalar characters. All scalar characters have at least one name. In
addition, we can refer to scalar characters by U+ with their code points’ hex
codes, like U+0000 or U+10FFFF.
Note that not all characters are encoded by only single scalars. Named
character sequences – which are encoded by sequences of scalars – are also a
type of character, in the sense that they are string that each represent a
single unit of writing. They are characters in the same way that precomposed
characters (which are encoded by single scalars but may decompose into
character sequences) are also characters. Our algorithms will often split each
named character sequence into its single head scalar (the zeroth scalar in
the sequence) and their tail scalars (the rest of the scalars in the
sequence).
A noncharacter is a code point reserved for usage internally within an
computer system. They are not meant to mean any character but are rather meant
to be useful as “sentinel values” for internal string manipulation.
Noncharacters occur in a regular pattern: they are FFFE and FFFF, 1FFFE
and 1FFFF, 2FFFE and 2FFFF, and so on, until 10FFFE and 10FFFF.
A private-use character is a character to which the Unicode Standard itself
has given no specific meaning. It can be freely used for any purpose. There are
three sets of private-use characters: U+E000–F900 (totaling 6,400),
U+F0000–FFFFD (totaling 65,534), and U+100000–10FFFD (totaling 65,534).
When we combine names of many different types (including UTF-16 surrogates)
together, we will refer to their code points as head points. The head point
of a named character sequence is its head scalar.
A character may have zero or more character names, which includes strict
character names, character name aliases, and the names of named character
sequences.
Every scalar character has one strict character name that is defined as its
Name/na property, as declared by the Unicode Standard’s clause D4
(in Chapter 3) and § 4.8 Name (in Chapter 4). For example, the strict character
name of U+0020 is “SPACE”, and the strict character name of U+0021 is
“EXCLAMATION MARK”.
Every scalar character also has zero or more character name aliases. For
example, U+0020 has one alias: the abbreviation SP. In contrast, U+0021
has no aliases.
Several characters made of multiple scalars are named character sequences,
which have their own unique names.
For each named character sequence, any scalars following its first scalar are
its tail scalars. It is notable that many named character sequences share
the same tail scalars, varying only in their first code scalars. For instance,
all Keycap named character sequences share the same tail U+FE0F 20E3.
Lastly, many scalar characters (and all noncharacters and UTF-16 surrogates) do
not have any strict character name. These include the control characters,
private-use characters, UTF-16 surrogates, noncharacters, and reserved
characters. The Standard therefore gives them code-point labels such as
CONTROL-0009 for U+0009, SURROGATE-D800 for U+D800, and
NONCHARACTER-FFFF for U+FFFF.
Character names and code-point labels are collectively called Unicode
names. Other libraries sometimes call these unames.
Lastly, in order to distinguish general Unicode strings from Unicode strings
that have names, we call the latter named Unicode values. We avoid simply
calling named Unicode values “characters” because not all named Unicode values
are characters – i.e., the UTF-16 surrogates and the noncharacters.
The Unicode codespace is made of seventeen equally sized planes, each made
of 16**4 (or 65,536) code points. The design is such that, for any code
point, the most significant byte of its three-byte (six-hex) representation is
the plane number: from 00 (#0) to 10 (#16). For example:

The code point A1CC (00A1CC with three bytes or six hexes) would belong
to plane 00 (#0).
The code point BA1CC (0BA1CC) would belong to plane 0B (#11).
The code point 10A1CC would belong to plane 10 (#16).

There are two noncharacters at the end of each plane.
Some planes have special names:

Plane 00 is the Basic Multilingual Plane (BMP).
Plane 01 is the Supplementary Multilingual Plane (SMP).
Plane 02 is the Supplementary Ideographic Plane (SIP).
Plane 03 is the Tertiary Ideographic Plane (TIP).
(Planes 04–0D are completely unassigned except for the two noncharacters
at the end of each plane.)
Plane 0E is the Supplementary Special-purpose Plane (SSP).
Planes 0F and 10 are the Private Use Planes.

Name characteristics

The majority of Unicode names are strict character names, which are names
that are assigned to scalar characters by the Name/na property, as
declared by the Unicode Standard’s clause D4 (in Chapter 3) and § 4.8 Name
(in Chapter 4). There are also other kinds of Unicode names, including
character name aliases, the names of named character
sequences, and code-point labels.
No matter their type, all Unicode names are permanent and immutable for ever
and ever. Each name also denotes exactly one string (either a single code point
or a sequence of them). Together, they form a single shared namespace, the
Unicode namespace for character names.

The Unicode Name property values are unique for all non-null values, but not
every Unicode code point has a unique Unicode Name property value.
Furthermore, because Unicode character names, character name aliases,
named character sequences, and code point labels constitute a single,
unique namespace, the Name property value uniqueness requirement applies to
all three kinds of names and to code point labels.

All Unicode names also follow several simple orthographic rules as declared
in § 4.8 Name’s clauses R1–R4, so that they may “easily be transposed into
formal identifiers in another context, such as a computer language.”

Names will always contain only (case-insensitive) Latin letters, Arabic
digits, spaces, and ASCII hyphens: that’s 38 total characters. (We’ll call
chunks of names separated by spaces “words”. For instance, “Zero Width
Space” consists of three words.)
Words in names will never start with number digits.
Names will never start or end with ASCII hyphens; also, they will never have
double hyphens.
Rule #3 also applies to spaces too.

A corollary of these rules is that all spaces are ignorable. Furthermore, all
medial ASCII hyphens (where “medial” means “inside a space-delimited word, not
at the word’s beginning or its end”) are ignorable, with one current exception:
the Korean character U+1180 Hangul Jungseong O-E, whose name needs its hyphen
to be distinguishable from U+116C Hangul Jungseong OE. Fuzzy name
matching is defined in UAX #44, and we implement its rules in the
#core/fuzzy-matcher module.
Code hexes

Several Unicode names require the formatting of code points into specially
formatted 4-, 5-, or 6-digit hexadecimal strings (or hexes), which we will call
code hexes. A hex that would be less than four digits long (e.g., 1F) is
padded by 0s at the start (e.g., 001F).
We implement these rules in the #core/util/hex module:
// This function converts hex strings into their code-point integers.
// If the hex string is invalid malformed (e.g., `'123XYZ'`),
// then it returns `undefined`.
export function getscalarFromHex (scalarHex) { … }

// This function converts code-point integers into their hex strings. The
// string is padded with zeroes as necessary to make it at least four digits
// long.
export function getHexFromscalar (scalar) { … }
Strict character names

Strict character names (the Name/na property) follow the above
restrictions, but they’re actually determined by the same chapter’s clauses
NR1–NR4:


Precomposed Korean Hangul syllables (U+AC00–D7AF) always are named
“Hangul Syllable ” followed by a phrase denoting the sounds of their
components, which in turn are known as jamo. Korean Hangul-syllable names
are discussed more in Hangul syllables. We implement their rules in the
#core/hangul-syllables module.


Most ideographs have dynamically generated hex names: each hex name is a
prefix “CJK Unified Ideograph-”, “Tangut Ideograph-”, “Nushu Character-”, or
“CJK Compatibility Ideograph-” followed by a hexadecimal number indicating
its scalar. For example, U+4E00 is automatically named “CJK Unified
Ideograph-4E00”.
These hex names are defined in the Unicode Standard’s § 4.8 Name’s Table
4-8. In addition, their ranges are also defined in UnicodeData.txt, some
of which are in a compressed format. We implement hex names’ rules in the
#core/name-counter module.


If a character has an entry in the Unicode Character Database’s
UnicodeData.txt with a value explicitly in its first field, then
that is its strict character name (unless there the value is delimited by
< and >, in which case it is just an informal label). We call these
explicitly defined strict character names, rather than the strict
character names that are dynamically generated by the previous two rules.
UnicodeData.txt makes up the vast bulk of the data that we need to compress.
We implement these rules in the #core/name-range/ucd module.


If none of the rules above apply, then that means it has no strict
character name. This applies to control characters, private-use characters,
UTF-16 surrogates, noncharacters, and reserved characters. However, these
characters do have code-point labels, and some of these may have
character name aliases.


Strict Unicode character names therefore form an injective, non-surjective
function over all code points.
The longest name is “Arabic Ligature Uighur Kirghiz Yeh with Hamza Above with
Alef Maksura Isolated Form” (83 characters long). The shortest name is “Ox”
(two characters).
UnicodeData.txt

The bulk of the names are in the Unicode Character Database’s UnicodeData.txt
source. It has 34,627 lines weighing 1,897.8 kB, and it looks like this
(line breaks are significant):
0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
0001;<control>;Cc;0;BN;;;;;N;START OF HEADING;;;;
0002;<control>;Cc;0;BN;;;;;N;START OF TEXT;;;;
0003;<control>;Cc;0;BN;;;;;N;END OF TEXT;;;;
0004;<control>;Cc;0;BN;;;;;N;END OF TRANSMISSION;;;;
0005;<control>;Cc;0;BN;;;;;N;ENQUIRY;;;;
0006;<control>;Cc;0;BN;;;;;N;ACKNOWLEDGE;;;;
0007;<control>;Cc;0;BN;;;;;N;BELL;;;;
0008;<control>;Cc;0;BN;;;;;N;BACKSPACE;;;;

…and then this:
001E;<control>;Cc;0;B;;;;;N;INFORMATION SEPARATOR TWO;;;;
001F;<control>;Cc;0;S;;;;;N;INFORMATION SEPARATOR ONE;;;;
0020;SPACE;Zs;0;WS;;;;;N;;;;;
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
0024;DOLLAR SIGN;Sc;0;ET;;;;;N;;;;;

…and then this:
004C;LATIN CAPITAL LETTER L;Lu;0;L;;;;;N;;;;006C;
004D;LATIN CAPITAL LETTER M;Lu;0;L;;;;;N;;;;006D;
004E;LATIN CAPITAL LETTER N;Lu;0;L;;;;;N;;;;006E;
004F;LATIN CAPITAL LETTER O;Lu;0;L;;;;;N;;;;006F;
0050;LATIN CAPITAL LETTER P;Lu;0;L;;;;;N;;;;0070;
0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;

…and then later this:
090D;DEVANAGARI LETTER CANDRA E;Lo;0;L;;;;;N;;;;;
090E;DEVANAGARI LETTER SHORT E;Lo;0;L;;;;;N;;;;;
090F;DEVANAGARI LETTER E;Lo;0;L;;;;;N;;;;;
0910;DEVANAGARI LETTER AI;Lo;0;L;;;;;N;;;;;
0911;DEVANAGARI LETTER CANDRA O;Lo;0;L;;;;;N;;;;;
0912;DEVANAGARI LETTER SHORT O;Lo;0;L;;;;;N;;;;;

…then this:
33FF;SQUARE GAL;So;0;ON;<square> 0067 0061 006C;;;;N;;;;;
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DBF;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
4DC0;HEXAGRAM FOR THE CREATIVE HEAVEN;So;0;ON;;;;;N;;;;;
4DC1;HEXAGRAM FOR THE RECEPTIVE EARTH;So;0;ON;;;;;N;;;;;
4DC2;HEXAGRAM FOR DIFFICULTY AT THE BEGINNING;So;0;ON;;;;;N;;;;;

…then this:
4DFE;HEXAGRAM FOR AFTER COMPLETION;So;0;ON;;;;;N;;;;;
4DFF;HEXAGRAM FOR BEFORE COMPLETION;So;0;ON;;;;;N;;;;;
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FFF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
A000;YI SYLLABLE IT;Lo;0;L;;;;;N;;;;;
A001;YI SYLLABLE IX;Lo;0;L;;;;;N;;;;;

…then this:
ABF8;MEETEI MAYEK DIGIT EIGHT;Nd;0;L;;8;8;8;N;;;;;
ABF9;MEETEI MAYEK DIGIT NINE;Nd;0;L;;9;9;9;N;;;;;
AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
D7B0;HANGUL JUNGSEONG O-YEO;Lo;0;L;;;;;N;;;;;
D7B1;HANGUL JUNGSEONG O-O-I;Lo;0;L;;;;;N;;;;;

…then this:
D7FA;HANGUL JONGSEONG PHIEUPH-SIOS;Lo;0;L;;;;;N;;;;;
D7FB;HANGUL JONGSEONG PHIEUPH-THIEUTH;Lo;0;L;;;;;N;;;;;
D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
F900;CJK COMPATIBILITY IDEOGRAPH-F900;Lo;0;L;8C48;;;;N;;;;;
F901;CJK COMPATIBILITY IDEOGRAPH-F901;Lo;0;L;66F4;;;;N;;;;;
F902;CJK COMPATIBILITY IDEOGRAPH-F902;Lo;0;L;8ECA;;;;N;;;;;
F903;CJK COMPATIBILITY IDEOGRAPH-F903;Lo;0;L;8CC8;;;;N;;;;;
F904;CJK COMPATIBILITY IDEOGRAPH-F904;Lo;0;L;6ED1;;;;N;;;;;

…then this:
16FF0;VIETNAMESE ALTERNATE READING MARK CA;Mc;6;L;;;;;N;;;;;
16FF1;VIETNAMESE ALTERNATE READING MARK NHAY;Mc;6;L;;;;;N;;;;;
17000;<Tangut Ideograph, First>;Lo;0;L;;;;;N;;;;;
187F7;<Tangut Ideograph, Last>;Lo;0;L;;;;;N;;;;;
18800;TANGUT COMPONENT-001;Lo;0;L;;;;;N;;;;;
18801;TANGUT COMPONENT-002;Lo;0;L;;;;;N;;;;;
18802;TANGUT COMPONENT-003;Lo;0;L;;;;;N;;;;;
18803;TANGUT COMPONENT-004;Lo;0;L;;;;;N;;;;;

…then this:
1FBF8;SEGMENTED DIGIT EIGHT;Nd;0;EN;<font> 0038;8;8;8;N;;;;;
1FBF9;SEGMENTED DIGIT NINE;Nd;0;EN;<font> 0039;9;9;9;N;;;;;
20000;<CJK Ideograph Extension B, First>;Lo;0;L;;;;;N;;;;;
2A6DF;<CJK Ideograph Extension B, Last>;Lo;0;L;;;;;N;;;;;
2A700;<CJK Ideograph Extension C, First>;Lo;0;L;;;;;N;;;;;
2B738;<CJK Ideograph Extension C, Last>;Lo;0;L;;;;;N;;;;;
2B740;<CJK Ideograph Extension D, First>;Lo;0;L;;;;;N;;;;;
2B81D;<CJK Ideograph Extension D, Last>;Lo;0;L;;;;;N;;;;;
2B820;<CJK Ideograph Extension E, First>;Lo;0;L;;;;;N;;;;;
2CEA1;<CJK Ideograph Extension E, Last>;Lo;0;L;;;;;N;;;;;
2CEB0;<CJK Ideograph Extension F, First>;Lo;0;L;;;;;N;;;;;
2EBE0;<CJK Ideograph Extension F, Last>;Lo;0;L;;;;;N;;;;;
2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
2F801;CJK COMPATIBILITY IDEOGRAPH-2F801;Lo;0;L;4E38;;;;N;;;;;
2F802;CJK COMPATIBILITY IDEOGRAPH-2F802;Lo;0;L;4E41;;;;N;;;;;

…then this:
2FA1B;CJK COMPATIBILITY IDEOGRAPH-2FA1B;Lo;0;L;9F16;;;;N;;;;;
2FA1C;CJK COMPATIBILITY IDEOGRAPH-2FA1C;Lo;0;L;9F3B;;;;N;;;;;
2FA1D;CJK COMPATIBILITY IDEOGRAPH-2FA1D;Lo;0;L;2A600;;;;N;;;;;
30000;<CJK Ideograph Extension G, First>;Lo;0;L;;;;;N;;;;;
3134A;<CJK Ideograph Extension G, Last>;Lo;0;L;;;;;N;;;;;
E0001;LANGUAGE TAG;Cf;0;BN;;;;;N;;;;;
E0020;TAG SPACE;Cf;0;BN;;;;;N;;;;;
E0021;TAG EXCLAMATION MARK;Cf;0;BN;;;;;N;;;;;
E0022;TAG QUOTATION MARK;Cf;0;BN;;;;;N;;;;;

…until getting to the end:
E01EC;VARIATION SELECTOR-253;Mn;0;NSM;;;;;N;;;;;
E01ED;VARIATION SELECTOR-254;Mn;0;NSM;;;;;N;;;;;
E01EE;VARIATION SELECTOR-255;Mn;0;NSM;;;;;N;;;;;
E01EF;VARIATION SELECTOR-256;Mn;0;NSM;;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
FFFFD;<Plane 15 Private Use, Last>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;
10FFFD;<Plane 16 Private Use, Last>;Co;0;L;;;;;N;;;;;

1,410 (about 4.4% as of Unicode 14.0) lines in UnicodeData.txt are ideographic
characters with dynamically generated hex names, like CJK Compatibility
Ideograph-2F9FD. They are included in UnicodeData.txt in order to declare those
characters’ values for other Unicode properties. We will be able to discard
these characters, since they are already generated by the [#core/hex-names/
module][].
Korean Hangul-syllable names

The Korean Hangul script is made up of jamo characters which may be composed
into syllable characters.
The jamo’s names are encoded in UnicodeData.txt, as usual. However, many
composed syllables have been encoded in U+AC00–D7A3, and their names are
generated from their jamo’s sounds.
The algorithm for generating the names and scalars of these composed syllables
is defined in the Unicode Standard’s § 3.12. For example, U+D4DB
Hangul Syllable Pwilh 퓛 is made of three jamo:

U+1111 Hangul Choseong Phieuph (“P”),
U+1171 Hangul Jungseong Wi (“WI”), and
U+11B6 Hangul Jongseong Rieul-Hieul (“LH”).

Those three jamo’s sounds compose into a single name: P + WI + LH → PWILH:
hence, the name “Hangul Syllable Pwilh”.
§ 3.12 also conveniently gives some sample Java code. From this, a producer and
parser of their names in JavaScript can be derived. We implement these in the
[#core/hangul-syllables/ module][].
NamedSequences.txt

Several characters are sequences made of multiple scalars. They are stored in
the Unicode Character Database’s NamedSequences.txt source. It has 612
lines weighing 20.8 kB, and it looks like this (line breaks are significant):
# NamedSequences-13.0.0.txt
# Date: 2020-01-22, 19:12:00 GMT [KW, LI]
# © 2020 Unicode®, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see http://www.unicode.org/reports/tr44/
#
# Unicode Named Character Sequences
#
# This file is a normative contributory data file in the Unicode
# Character Database.

…then this:
# Named keycap sequences for telephone keypad (used for emoji)
# Provisional, 2015-05-05
# FE0F added to the sequences, 2016-05-11
# Approved 2017-05-12

KEYCAP NUMBER SIGN;0023 FE0F 20E3
KEYCAP ASTERISK;002A FE0F 20E3
KEYCAP DIGIT ZERO;0030 FE0F 20E3
KEYCAP DIGIT ONE;0031 FE0F 20E3
KEYCAP DIGIT TWO;0032 FE0F 20E3
KEYCAP DIGIT THREE;0033 FE0F 20E3
KEYCAP DIGIT FOUR;0034 FE0F 20E3
KEYCAP DIGIT FIVE;0035 FE0F 20E3
KEYCAP DIGIT SIX;0036 FE0F 20E3
KEYCAP DIGIT SEVEN;0037 FE0F 20E3
KEYCAP DIGIT EIGHT;0038 FE0F 20E3
KEYCAP DIGIT NINE;0039 FE0F 20E3

…and eventually ends with this:
KHMER VOWEL SIGN OM;17BB 17C6
KHMER VOWEL SIGN AAM;17B6 17C6

# Entries for JIS X 0213 compatibility mapping.
# Provisional 2008-11-07, Approved 2010-05-14
#
# Two of these were part of the original set of approved named sequences
# for Unicode 4.1. 2005.

HIRAGANA LETTER BIDAKUON NGA;304B 309A
HIRAGANA LETTER BIDAKUON NGI;304D 309A
HIRAGANA LETTER BIDAKUON NGU;304F 309A
HIRAGANA LETTER BIDAKUON NGE;3051 309A
HIRAGANA LETTER BIDAKUON NGO;3053 309A
KATAKANA LETTER BIDAKUON NGA;30AB 309A
KATAKANA LETTER BIDAKUON NGI;30AD 309A
KATAKANA LETTER BIDAKUON NGU;30AF 309A
KATAKANA LETTER BIDAKUON NGE;30B1 309A
KATAKANA LETTER BIDAKUON NGO;30B3 309A
KATAKANA LETTER AINU CE;30BB 309A
KATAKANA LETTER AINU TU;30C4 309A
KATAKANA LETTER AINU TO;30C8 309A
KATAKANA LETTER AINU P;31F7 309A
MODIFIER LETTER EXTRA-HIGH EXTRA-LOW CONTOUR TONE BAR;02E5 02E9
MODIFIER LETTER EXTRA-LOW EXTRA-HIGH CONTOUR TONE BAR;02E9 02E5

# EOF

For each named character sequence, any scalars following its first scalar are
its tail scalars. It is notable that many named character sequences share
the same tail scalars, varying only in their first code scalars. For instance,
all Keycap named character sequences share the same tail U+FE0F U+20E3.
NameAliases.txt

Some Unicode names are aliases to scalar characters. These character name
aliases are separately declared from strict character names, which are
immutable. Sometimes an alias is added because the strict character name has a
misspelling, like the infamous U+FE18 Presentation Form for Vertical Right
White Lenticular Brakcet. U+FE18 got an alias with the correct spelling
“Bracket”. Other characters have other types of aliases: for widely used
alternate names and for abbreviations.
A character may have zero, one, or many aliases: Although strict character
names are an injective function, character name aliases are a non-injective
function.
Aliases are stored in the Unicode Character Database’s NameAliases.txt
source. It has 567 lines weighing 16.1 kB, and it looks like this (line
breaks are significant):
# NameAliases-13.0.0.txt
# Date: 2019-09-09, 19:47:00 GMT [KW, LI]
# © 2019 Unicode®, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# Unicode Character Database
# For documentation, see http://www.unicode.org/reports/tr44/
#
# This file is a normative contributory data file in the
# Unicode Character Database.
#
# This file defines the formal name aliases for Unicode characters.

…then like this:
0000;NULL;control
0000;NUL;abbreviation
0001;START OF HEADING;control
0001;SOH;abbreviation
0002;START OF TEXT;control
0002;STX;abbreviation
0003;END OF TEXT;control
0003;ETX;abbreviation
0004;END OF TRANSMISSION;control
0004;EOT;abbreviation
0005;ENQUIRY;control
0005;ENQ;abbreviation
0006;ACKNOWLEDGE;control
0006;ACK;abbreviation

…and eventually ending with:
E01EA;VS251;abbreviation
E01EB;VS252;abbreviation
E01EC;VS253;abbreviation
E01ED;VS254;abbreviation
E01EE;VS255;abbreviation
E01EF;VS256;abbreviation

# EOF

Code-point labels

Many scalar characters (and all noncharacters and UTF-16 surrogates) do not
have any strict character name. These include the control characters,
private-use characters, UTF-16 surrogates, noncharacters, and reserved
characters. The Standard therefore gives them labels such as “control-0009”
(U+0009) and “noncharacter-FFFF” (U+FFFF). These code-point labels are
also determined by several rules:

A code point has a “special code-point type” if it fulfills its respective
condition:

control: The code point is within U+0000–001F, or it is U+007F
Delete. (The code point is considered a scalar.)
reserved: The code point not a noncharacter. (Although the Standard
defines reserved code points as those with a General Category of Cn
(“not assigned”) as defined in the third field of UnicodeData.txt,
this definition causes formerly reserved characters to no longer be
reserved characters, over time. In order to keep application code
backwards compatible, we use a broader definition of reserved character.
noncharacter: The code point is within U+FDD0–FDEF, or its hex ends
with FFFE or FFFF. This code point does not count encode a
character.
private-use: The code point is within U+E000–F8FF, or it is a
character from one of the two Private Use Planes (i.e.,
U+F0000–FFFFD, U+100000–10FFFD). (The code point is always a
scalar.)
surrogate: The code point is within U+D800–DFFF. (The code point
does not count as a scalar.)


When a code point has a “special code-point type”, then it has a code-point
label that is its special code-point type—concatenated with a -, then its
code hex.

We implement these rules in the [#core/name-counter/ module][], as well as
the #core/name-range/ucd module and #core/name-range/noncharacter
module.
Scalar distribution

The vast majority of scalars are currently unassigned. Nearly every scalar in
planes 04–0D is currently unassigned (the only exceptions are the two
noncharacters at the end of each plane).
Of the relatively few scalars that have been assigned, many of them have only
dynamically generated names. These include all noncharacters, all scalars for
private-use characters, and nearly all scalars for ideographic and
Hangul-syllable characters. Additionally, every single scalar in blocks 03
and 04 (the SIP and TIP) has an ideograph with a dynamically generated name.
When we exclude scalars with dynamically generated names, every scalar left –
i.e., scalars with explicitly defined names – is in the first two planes 00
and 01 (the BMP and SMP) or the antepenultimate plane (the 0E SSP). Of
these, the large majority of characters belong to the BMP or SMP; the SSP has
relatively few characters.
In other words, virtually every scalar character is in the BMP, SMP, SIP, and
TIP, and virtually every scalar character with an explicitly defined name is
in the BMP and SMP – with a large gap of scalars separating them from the few
characters with explicitly defined names in the SSP.
API and goals

We want a Node.js API that looks like this:
import UninameLibrary from 'uniname';

// Create a Uniname library object from a database file.
const database = await fetch('/uniname/database');
const u = new UninameLibrary(database);

{
  // Get the character that has a certain Unicode name, with fuzzy matching.
  const character0 = u.get('Devanagari Letter E');
  // Get multiple characters in a single string.
  const character1 = u.get('Latin Small Letter O', 'Combining Breve');
}

{
  const character = '\u0000';
  // Get the preferred Unicode name: 'NULL'.
  const name = u.getPreferredName(character);
  // Get the strict value of the character Name property: undefined.
  const strictName = u.getStrictName(character);
  // Get all names of a certain character with their types:
  // `[ [ 'NULL', 'control' ], [ 'NUL', 'abbreviation' ],
  // [ 'control-0000', 'label' ] ]`.
  const nameEntries = u.getNameEntries(character);
}
We can see what API would look like with the previous real-world code.
From debug@4.1.1/src/node.js:
const colorCode = '\u001B[3' + (c < 8 ? c : '8;5;' + c);

const colorCode = `${ u.get('escape') }[3${ c < 8 ? c : '8;5;' + c }`;

const colorCode = `\u[escape][3${ c < 8 ? c : '8;5;' + c }`;
From js-yaml@3.13.1/lib/js-yaml/loader.js:
(c === 0x4C/* L */) ? '\u2028' :
(c === 0x50/* P */) ? '\u2029' : '';

(c === 0x4C) ? u.get('line separator') :
(c === 0x50) ? u.get('paragraph separator') :
'';
From uglify-js@3.6.0/lib/output.js:
str = str.replace(/[\\\b\f\n\r\v\t\x22\x27\u2028\u2029\0\ufeff]/g,
  function(s, i) {
    switch (s) {
      case '"': ++dq; return '"';
      case "'": ++sq; return "'";
      case "\\": return "\\\\";
      case "\n": return "\\n";
      case "\r": return "\\r";
      case "\t": return "\\t";
      case "\b": return "\\b";
      case "\f": return "\\f";
      case u.get('line tabulation'):
        return options.ie8 ? "\\x0B" : "\\v";
      case u.get('line separator'):
        return u.get('backslash', 'line separator');
      case u.get('paragraph separator'):
        return u.get('backslash', 'paragraph separator');
      case u.get('zero width no-break space'):
        return u.get('backslash', 'zero width no-break space');
      case "\0":
        return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
    }
    return s;
});

str = str.replace(
  new RegExp(
    "[${
      u.get(
        "backslash", "backspace", "form feed", "line feed", "carriage return",
        "vertical tabulation", "quotation mark", "apostrophe",
        "line separator", "paragraph separator", "null", "byte order mark",
      )
    }]",
    "g",
  ),
  function(s, i) {
    switch (s) {
      case u.get("quotation mark"): ++dq; return s;
      case u.get("apostrophe"): ++sq; return s;
      case u.get("backslash"): return s;
      case u.get("line feed"): return s;
      case u.get("carriage return"): return s;
      case u.get("tab"): return s;
      case u.get("backspace"): return s;
      case u.get("backslash"): return s;
      case u.get("form feed"): return s;
      case u.get("vertical tabulation"): return options.ie8 ? "\\x0B" : s;
      case u.get("line separator"): return "\\u2028";
      case u.get("paragraph separator"): return "\\u2029";
      case u.get("byte order mark"): return "\\ufeff";
      case u.get("null"):
        return /[0-9]/.test(str.charAt(i+1)) ? "\\x00" : "\\0";
    }
    return s;
});
From trim-right@1.0.1.mjs:
while (/[\s\uFEFF\u00A0]/.test(str[tail - 1])) {

const rightTrimmableRegExp = new RegExp('[\s${
    u.get('zero width no-break space', 'no-break space')
  }]'
);
while (rightTrimmableRegExp.test(str[tail - 1])) {
From highlight.js@9.15.8/lib/languages/fix.js:
begin: /[^\u2401\u0001]+/,
end: /[\u2401\u0001]/,

begin: /[^${
  u.get('Symbol for Start of Heading')
}${
  u.get('Start of Heading')
}]+/,
end: /[${
  u.get('Symbol for Start of Heading')
}${
  u.get('Start of Heading')
}]+/,
From react-select@3.0.4/src/diacritics.js:
{ base:'AA',letters:/[\uA732]/g },
{ base:'AE',letters:/[\u00C6\u01FC\u01E2]/g },
{ base:'AO',letters:/[\uA734]/g },
{ base:'AU',letters:/[\uA736]/g },
{ base:'AV',letters:/[\uA738\uA73A]/g },
{ base:'AY',letters:/[\uA73C]/g },
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},

{
  base: 'AA',
  letters: new RegExp(`[${
    u.get(
     'Latin Capital Letter AE',
    )
  }]`, 'g')
},
{
  base: 'AE',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital Letter AE',
      'Latin Capital Letter AE with Acute',
      'Latin Capital Letter AE with Macron',
    )
  }]`, 'g')
},
{
  base: 'AO',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital Letter AO',
    )
  }]`, 'g')
},
{
  base: 'AU',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital letter AU',
    )
  }]`, 'g')
},
{
  base: 'AV',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital Letter AV',
      'Latin Capital Letter AV with Horizontal Bar',
    )
  }]`, 'g')
},
{
  base: 'AY',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital Letter AY',
    )
  }]`, 'g')
},
{
  base: 'B',
  letters: new RegExp(`[${
    u.get(
      'Latin Capital Letter B',
      'Circled Latin Capital Letter B',
      'Fullwidth Latin Capital Letter B',
    )
  }]`, 'g')
},
{ base:'B', letters:/[\u0042\u24B7\uFF22\u1E02\u1E04\u1E06\u0243\u0182\u0181]/g
},
From katex@0.10.2/src/unicodeSymbols.js:
"\u014f": "\u006f\u0306",  // ŏ = \u{o}

u.get('Latin Small Letter O with Breve'):
  u.get('Latin Small Letter O', 'Combining Breve'),
From regexgen@1.3.0/test/test.js:
'\u{1F937}\u{1F3FB}\u200D',
'\u{1F937}\u{1F3FC}\u200D',
'\u{1F937}\u{1F3FD}\u200D',
'\u{1F937}\u{1F3FE}\u200D',
'\u{1F937}\u{1F3FF}\u200D',
'\u{1F937}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FB}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FC}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FD}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FE}\u200D\u2640\uFE0F',
'\u{1F937}\u{1F3FF}\u200D\u2640\uFE0F'

u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-1-2', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-4', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-5', 'Zero Width Joiner'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-6', 'Zero Width Joiner'),
u.get('Shrug', 'Zero Width Joiner', 'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Zero Width Joiner', 'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-1-2', 'Zero Width Joiner',
  'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
  'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
  'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
  'Female Sign', 'Variation Selector-16'),
u.get('Shrug', 'Emoji Modifier Fitzpatrick Type-3', 'Zero Width Joiner',
  'Female Sign', 'Variation Selector-16'),
From regexp-support@1.0.25/lib/pattern.js:
['^\\u{20bb7}$', 'u', '𠮷', true, 'test'],

['^\\u{u.get('CJK Unified Ideograph-20BB7')}$', 'u', '𠮷', true, 'test'],

Our API (specifically the getNameEntries method) uses eight name types:

null: Strict character names – i.e., property values of the Name property.
'sequence': Names of named character sequences.
'label': Code-point labels.
'correction': Corrections for serious problems in strict character names.
'control': Commonly occurring names for control codes.
'alternate': Widely used alternate names for format characters.
'figment': Names of control codes which were documented but never actually
approved in any standard.
'abbreviation': Common abbreviations for many characters.

Our goals include (ordered by priority):


Name→value lookup: We want to retrieve the character (or UTF-16
surrogate) corresponding to a given Unicode name.


Value→names access: We want to retrieve all of the Unicode names that
correspond to a given character (or UTF-16 surrogate).


Named sequences, name aliases, and code-point labels: We want to
be able to retrieve data about not only scalar characters’ strict character
names but also named character sequences, name
aliass, and code-point labels.


Fuzzy matching: We want our name→value lookup to support fuzzy name
matching as defined in UAX #44: ignoring spaces, underscores, and most
medial hyphens.


Memory efficiency: The data structure needs to occupy as little memory
as as possible while it is being used. This is especially important for
mobile devices that have constrained memory. We do not want compressed data
structure that needs to be entirely decompressed from storage into memory.
Ideally, the data structure would be a memory-mapped file.


Storage efficiency: The data structure needs to occupy as little storage
as possible. This is especially important for mobile devices that have
constrained network bandwidth.


Simplicity: We want to keep our code and data simple. In particular, we
want to avoid dependency on large third-party libraries.


Time efficiency: This is least important. We want the speed of retrieval
and random access to be fast (both in the average and worst cases),
although we care less about speed than about storage and memory.
We also care more about the speed of name→value lookup than the
speed of value→names access. Although both are goals, the former is more
frequently useful to developers. The latter is useful for end-user-facing
internationalized applications and for program transpilation.


Non-goals include:

No dynamic updates: The data structure needs to be static. It would need
to change only every several months, whenever the Unicode Consortium updates
the Standard.
No fast compilation: The building and compression of the data structure
happen long before the structure is actually used, and so we will allow
these to occur more slowly than data decompression and retrieval.
No substring matching: We do not need to match names partially,
including with name prefixes or with name suffixes.
No file-format compatibility: The data structure’s file format does not
need to be backwards or forwards compatible. It can have one version number
that must exactly match the version number expected by the application.

Test and benchmark suites

In order to ensure that our code’s behavior remains the same as we refactor it,
we will use test suites. One of the test suites, the
[test/node/complete.mjs module][], tests every single Unicode name and named
Unicode value on the database.
We also measure the temporal performance of our database, using a benchmark
suite in a [src/node/benchmark.mjs module][]. This module first creates two
query sets: it randomly chooses a specified number of Unicode names and of
randomly chosen characters. It then calls uniname.get on all of those names
and uniname.getNameEntries on all of those hex sequences’ characters.
It also applies uniname.get to a nonexistent name and
uniname.getNameEntries to a nonexistent character – also 1,000 times each.
JSON array

The first version of our database is a JSON file that weighs 2,310.4 kB
(154.4 kB when compressed with Brotli v1.0.9).
The database file structure looks like this (whitespace added):
[{"scalar":0,"name":"NULL","nameType":"control"},
 {"scalar":0,"name":"NUL","nameType":"abbreviation"},
 {"scalar":1,"name":"START OF HEADING","nameType":"control"},
 {"scalar":1,"name":"SOH","nameType":"abbreviation"},
 {"scalar":2,"name":"START OFTEXT","nameType":"control"},
 {"scalar":2,"name":"STX","nameType":"abbreviation"},
 {"scalar":3,"name":"END OF TEXT","nameType":"control"},
 {"scalar":3,"name":"ETX","nameType":"abbreviation"},
 {"scalar":4,"name":"END OF TRANSMISSION","nameType":"control"},
 {"scalar":4,"name":"EOT","nameType":"abbreviation"},
 {"scalar":5,"name":"ENQUIRY","nameType":"control"},
 {"scalar":5,"name":"ENQ","nameType":"abbreviation"},
 {"scalar":6,"name":"ACKNOWLEDGE","nameType":"control"},
 {"scalar":6,"name":"ACK","nameType":"abbreviation"},
 {"scalar":7,"name":"ALERT","nameType":"control"},
 {"scalar":7,"name":"BEL","nameType":"abbreviation"},
 {"scalar":8,"name":"BACKSPACE","nameType":"control"},
 {"scalar":8,"name":"BS","nameType":"abbreviation"},
 {"scalar":9,"name":"CHARACTER TABULATION","nameType":"control"},

…and ending with this:
 {"initialHeadPoint":917998,"nameStem":"VS255","nameType":"abbreviation"},
 {"initialHeadPoint":917998,"nameStem":"VARIATION SELECTOR-255"},
 {"initialHeadPoint":917999,"nameStem":"VS256","nameType":"abbreviation"},
 {"initialHeadPoint":917999,"nameStem":"VARIATION SELECTOR-256"},
 {"initialHeadPoint":983038,"length":2,"nameStem":"NONCHARACTER",
  "nameCounterType":"hyphenHex","nameType":"label"},
 {"initialHeadPoint":983040,"length":65534,
  "nameCounterType":"hyphenHex","nameStem":"PRIVATE-USE","nameType":"label"},
 {"initialHeadPoint":1048574,"length":2,"nameStem":"NONCHARACTER",
  "nameCounterType":"hyphenHex","nameType":"label"},
 {"initialHeadPoint":1048576,"length":65534,
  "nameCounterType":"hyphenHex","nameStem":"PRIVATE-USE","nameType":"label"},
 {"initialHeadPoint":1114110,"length":2,"nameStem":"NONCHARACTER",
  "nameCounterType":"hyphenHex","nameType":"label"}
]

As of Unicode 14.0, these JSON objects correspond to 35,558 name ranges,
which cover 285,275 Unicode names and 284,805 named Unicode values.
Our time benchmark (10,000 operations each) gives these results for Unicode
14.0, by Node v18.3.0, on a MacBook Air (M1, 2020) with 16 GB of memory:


Operation
2%
25%
50%
75%
98%


Valid N → V
10,243 µs
20,020 µs
25,122 µs
26,467 µs
32,408 µs


Invalid N → no V
25,973 µs
28,988 µs
29,211 µs
31,090 µs
33,781 µs


Named V → N
9,849 µs
10,918 µs
11,010 µs
12,164 µs
14,123 µs


Unnamed V → no N
10,839 µs
12,194 µs
12,576 µs
13,688 µs
16,892 µs


When retrieving data, the JSON file must be loaded into memory, either in its
entirety at once or sequentially as a stream. When Node v18.3.0 loads the
entire JSON file, its memory heap usage increases by approximately 2,166 kB,
which is about the same size as the file itself.
Extended name ranges

We already are using the concept of name ranges, but there are many singleton name ranges in UnicodeData.txt that may still be combined with their neighbors.
For example:

There are more than 1,000 CJK Compatibility characters named “CJK
COMPATIBILITY IDEOGRAPH-F900”, “CJK COMPATIBILITY IDEOGRAPH-F901”, etc. that
are individually named in UnicodeData.txt.
There are 26 characters from U+1F130 Squared Latin Capital Letter A to
U+1F149 Squared Latin Capital Letter Z, one for each English letter A–Z.
There are 48 characters from U+1F031 Domino Tile Horizontal-00-00 to U+1F061
Domino Tile Horizontal-06-06, and they all start with “Domino Tile
Horizontal-” followed by “00-00”, “00-01”, …, “00-06”, “01-00”, …, “06-06”.
There are another 48 similar characters from U+1F062 Domino Tile
Vertical-00-00 to U+1F093 Domino Tile Vertical-06-06.
There are 766 characters from U+18801 Tangut Component-002 to U+18AFF
Tangut Component-768, and they all start with “Tangut Component-” followed by
a three-digit number between 002 and 768.

We save further space by consolidating these names’ singleton name ranges into
multiplex name ranges.


DIGIT ZERO
…
DIGIT NINE


DIGIT :RANGE:NUMWORD:10


LATIN CAPITAL LETTER A
LATIN CAPITAL LETTER B
…
LATIN CAPITAL LETTER Z


LATIN CAPITAL LETTER :RANGE:LETTER:1A


SINGLE SHIFT TWO
SINGLE SHIFT THREE


SINGLE SHIFT :RANGE:NUMWORD:2:2


TAMIL NUMBER ONE HUNDRED
TAMIL NUMBER ONE THOUSAND


TAMIL NUMBER :RANGE:NUMWORDHUNDRED
TAMIL NUMBER :RANGE:NUMWORDTHOUSAND


ETHIOPIC NUMBER TEN
ETHIOPIC NUMBER TWENTY
…
ETHIOPIC NUMBER NINETY
ETHIOPIC NUMBER HUNDRED


ETHIOPIC NUMBER :RANGE:NUMWORD:9:A:A
ETHIOPIC NUMBER HUNDRED


CIRCLED NUMBER TWENTY ONE
CIRCLED NUMBER TWENTY TWO
…
CIRCLED NUMBER TWENTY NINE
CIRCLED NUMBER THIRTY
CIRCLED NUMBER THIRTY ONE
…
CIRCLED NUMBER THIRTY NINE


CIRCLED NUMBER :RANGE:NUMWORDWITHSPACES:21:19


DOMINO TILE HORIZONTAL-00-00
DOMINO TILE HORIZONTAL-00-01
…
DOMINO TILE HORIZONTAL-00-06
DOMINO TILE HORIZONTAL-01-00
…
DOMINO TILE HORIZONTAL-01-06


DOMINO TILE HORIZONTAL-00-:RANGE:DIGITSPADDED2:7
DOMINO TILE HORIZONTAL-01-:RANGE:DIGITSPADDED2:7


LINEAR A SIGN AB001
LINEAR A SIGN AB002
…
LINEAR A SIGN AB021
LINEAR A SIGN AB021F
LINEAR A SIGN AB021M
LINEAR A SIGN AB022


LINEAR A SIGN AB:RANGE:DIGITSPADDED3:16:1
LINEAR A SIGN AB021F
LINEAR A SIGN AB021M
LINEAR A SIGN AB022


IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE
IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO
…
IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THIRTY-ONE


IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY :RANGE:NUMWORD:1F:1


VARIATION SELECTOR-1
…
VARIATION SELECTOR-16


VARIATION SELECTOR-:RANGE:DIGITS:10:1


AEGEAN NUMBER ONE
AEGEAN NUMBER TWO
…
AEGEAN NUMBER TEN
AEGEAN NUMBER TWENTY
AEGEAN NUMBER THIRTY
…
AEGEAN NUMBER NINETY
AEGEAN NUMBER ONE HUNDRED
AEGEAN NUMBER TWO HUNDRED
…
AEGEAN NUMBER NINE HUNDRED
AEGEAN NUMBER ONE THOUSAND
AEGEAN NUMBER TWO THOUSAND
…
AEGEAN NUMBER TEN THOUSAND
AEGEAN NUMBER TWENTY THOUSAND
AEGEAN NUMBER THIRTY THOUSAND
…
AEGEAN NUMBER NINETY THOUSAND


AEGEAN NUMBER :RANGE:NUMWORD:A:1
AEGEAN NUMBER :RANGE:NUMWORD:8:14:10
AEGEAN NUMBER :RANGE:NUMWORDHUNDRED:9:1
AEGEAN NUMBER :RANGE:NUMWORDTHOUSAND:A:1
AEGEAN NUMBER :RANGE:NUMWORDTHOUSAND:8:B


NUSHU CHARACTER-1B170
NUSHU CHARACTER-1B171
…
NUSHU CHARACTER-1B2FB


NUSHU CHARACTER-:RANGE:HEX:1B170:18B


WHITE CHESS KING ROTATED NINETY DEGREES


WHITE CHESS KING ROTATED NINETY DEGREES


VARIATION SELECTOR-17
VARIATION SELECTOR-18
…
VARIATION SELECTOR-256


VARIATION SELECTOR-:RANGE:DIGITS:EF:11


VS1:ABBREVIATION
VS2:ABBREVIATION
…
VS16:ABBREVIATION
VS17:ABBREVIATION
VS18:ABBREVIATION
…
VS256:ABBREVIATION


VS:RANGE:DIGIT:10:::ABBREVIATION
VS:RANGE:DIGIT:EF:11::ABBREVIATION


KEYCAP DIGIT ZERO:SEQUENCE:FE0F:20E3
KEYCAP DIGIT ONE:SEQUENCE:FE0F:20E3
…
KEYCAP DIGIT NINE:SEQUENCE:FE0F:20E3


KEYCAP DIGIT :RANGE:NUMWORD:9:::SEQUENCE:FE0F:20E3


Database 3.0: Text-line sequence

Our next database is a sequence of lines. In this case, both name→value lookup
and value→names access require linearly checking each line for matches.
The database file structure now looks like this (line breaks are significant):
0:NULL:CONTROL
0:NUL:ABBREVIATION
1:START OF HEADING:CONTROL
1:SOH:ABBREVIATION
2:START OF TEXT:CONTROL
2:STX:ABBREVIATION
3:END OF TEXT:CONTROL
3:ETX:ABBREVIATION
4:END OF TRANSMISSION:CONTROL
4:EOT:ABBREVIATION
5:ENQUIRY:CONTROL

…then later:
22:QUOTATION MARK
23:NUMBER SIGN
23:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
24:DOLLAR SIGN
25:PERCENT SIGN
26:AMPERSAND
27:APOSTROPHE

…and so on.
Each line in the file looks like: ‹scalarHex›:‹name›‹nameInfo›.

The ‹scalarHex› is a single scalar hex, stripped of leading 0s. (For
named character sequences, this is the hex of the first scalar:
the character’s head point.)
The ‹name› is a Unicode name, in all caps, and using spaces and -.
The ‹nameInfo› is one of the following.

For strict Name property values: an empty string.
For name aliases: :CORRECTION for correction, :CONTROL for
control, :ALTERNATE for alternate, :FIGMENT for figment, or
:ABBREVIATION for abbreviation.
For named character sequences: :SEQUENCE followed by a sequence of
hexes for the remaining scalar hexes that follow the first scalar.
These are called the character’s tail scalars.
Each hex is also preceded by : and is stripped of leading 0s.


For example:

The line for U+0021 Exclamation Mark is 21:EXCLAMATION MARK.
The two lines for U+0000’s two aliases NULL (a control alias) and NUL
(an abbreviation alias) are 0:NULL:CONTROL and 0:NUL:ABBREVIATION.
The line for the named character sequence U+0023 U+FE0F U+20E3 Keycap
Number Sign is 23:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3.

Database 4.0: Text-line delta encoding

Scalars in the Basic Multilingual Plane (the BMP: 0000–FFFF) currently use
one, two, three, or four hexadecimal digits. And scalars in other planes
(10000–10FFFF) currently use six hexadecimal digits. Yet the majority of
entries in the database deal with contiguous (or, at least, mutually close)
blocks of scalars.
All of those poorly compressible hexadecimal codes can be converted into very
compressible runs of 1 increments (for contiguous scalars), with occasional
larger increments. This is an example of delta encoding.
The database is now a plain-text string weighing 962.9 kB (104.6 kB
when compressed with Brotli v1.0.9). It looks like:
0:NULL:CONTROL
0:NUL:ABBREVIATION
1:START OF HEADING:CONTROL
0:SOH:ABBREVIATION
1:START OF TEXT:CONTROL
0:STX:ABBREVIATION
1:END OF TEXT:CONTROL
0:ETX:ABBREVIATION
1:END OF TRANSMISSION:CONTROL
0:EOT:ABBREVIATION
1:ENQUIRY:CONTROL

…then later:
1:QUOTATION MARK
1:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3
0:NUMBER SIGN
1:DOLLAR SIGN
1:PERCENT SIGN
1:AMPERSAND
1:APOSTROPHE

…and so on.
Each line in the file looks like: ‹scalarDeltaHex›:‹name›‹nameInfo›.

The ‹scalarDeltaHex› is a single hex of the scalar delta from the
previous line’s scalar. This will usually be 1 (one scalar after the
previous line’s) or 0 (the same scalar as the previous line’s).
(For named character sequences, this is the delta of the first scalar’s hex.
Similarly, if the previous line was for a named character sequence, then the
delta is compared to that sequence’s first scalar.)
The ‹name› is a Unicode name, in all caps, and using spaces and -.
The ‹nameInfo› is one of the following.

For strict Name property values: an empty string.
For name aliases: :CORRECTION for correction, :CONTROL for
control,  :ALTERNATE for alternate, :FIGMENT for figment, or
:ABBREVIATION for abbreviation.
For named character sequences: :SEQUENCE followed by a sequence of
hexes for the remaining scalar hexes that follow the first scalar.
Each hex is also preceded by : and is stripped of leading 0s.


For example:

The line for U+0021 Exclamation Mark is 1:EXCLAMATION MARK, because its
preceding line is for U+0020, and 0021 − 0020 = 1.
The two lines for U+0003’s two aliases END OF TRANSMISSION (a control
alias) and EOT (an abbreviation alias) are 1:END OF TRANSMISSION:CONTROL
and 0:EOT:ABBREVIATION. The first line’s ‹scalarDeltaHex› is 1 because
its previous line’s scalar is 0003, and 0004 − 0003 = 1. The second
line’s ‹scalarDeltaHex› is 0 because both it and its previous line have
the scalar U+0003, and 0003 − 0003 = 0.
The line for the named character sequence U+0023 U+FE0F U+20E3 Keycap
Number Sign is 1:KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3, because its
previous line’s scalar is 0022, and 0023 − 0022 = 1.

Towards concise data structures

Unfortunately, neither using a sequence of text lines is not very efficient in
time. Name→value lookup and value→names access must linearly scan the
database string, checking each of its line for a match. In the worst cases
(VARIATION SELECTOR-256 and nonexistent Unicode names), our code may have to
scan the entire database before determining the answer. We want to be able to
perform random access on the file. Brotli does not solve this problem; it
only can linearly decompress the file, and is therefore useful here mostly
for transferring the database across networks.
We therefore will adopt techniques from succinct data structures that allow
random access directly into compressed data. We will eventually encode custom
binary file format that allows for such random access.
As of this writing, the best such data structure for our purposes
(bidirectional string dictionaries) seems to be IBiS^RP+DAC-L-TT,
which is described in Brisaboa et al. (2019). IBiS^RP+DAC-L-TT
combines binary search, front coding, Re-Pair compression, and several
other compression techniques. (In actuality, we will be modifying
IBiS^RP+DAC-L-TT to accomodate the fact that we need to map string
keys to scalars, rather than to integer positions in the dictionary.)
We want to eventually get the database’s size close to the 97 kB of
version 1.0’s Brotli-compressed database – or at least close to GNU Uniname’s
350 kB – while speeding up name→value lookup by name and (to a lesser
extent) speeding up name→value access.
Database 5.0: Text table (TT)

Our first step toward compressing the database into a
memory-mappable, randomly accessible,
concise file is to convert our data structure into
an indexed text table (TT), sorted lexicographically by fuzzily folded
name. The text table will allow us to randomly look up characters without
having to linearly read the database.
For clarity, we will continue to use UTF-8/ASCII plain text to format the
database (until later, when we will switch it to a binary format). Now the
database is a 1-line plain-text file weighing 1,198.2 kB (232.5 kB when
compressed with Brotli v1.0.9). This size is significantly worse than the
previous version’s 962.9 kB but, as we will see, we have improved much in time.


Block
Type
Range
Total


Directory
JSON obj.
0.0 B
0.3 kB


TT texts (values)
ASCII str.
25.7 B
862.1 kB


TT texts (separations)
Hex array
5.0 B
167.9 kB


TT scalar array
Hex array
5.0 B
167.9 kB


Total

35.7 B
1,198.2 kB


The directory is an object that maps string keys to integer values,
stringified with JSON, then ending with the control character U+0003 Start of
Text (which is forbidden in the JSON format itself).
Most of the integer values are pointers: each indicates the position of each
database block. The pointers are relative to the end of the directory, after
the Start of Text control character.
The directory looks like the following. (The actual directory text has no extra
whitespace and ends with the control character U+0003 Start of Text. Here,
for illustration purposes, extra whitespace is shown, and the Start of Text
control character is displayed as ␂.)
{
  "nameTableDirectory": {
    "numOfEntries": 33578,
    "textSequencePointer": 0,
    "textSequenceDirectory": {
      "valuesPointer": 0,
      "separationArrayPointer": 862095,
      "separationArrayDirectory": {
        "numOfHexesPerEntry": 5
      },
      "length": 1029987
    },
    "scalarArrayPointer": 1029987,
    "scalarArrayDirectory": {
      "numOfHexesPerEntry": 5
    }
  }
}
␂

The text table has 33,578 text entries (32,647 of which are for strict
character names and 931 of which are for other Unicode names). There is one
text entry for each explicitly declared Unicode name, which all correspond to
characters made of scalars (as opposed to UTF-16 surrogates or to
noncharacters). This means that some scalars will have multiple text entries
somewhere in the text table, such as two entries for U+0000’s NULL and its
alias NUL.
The text entries are lexicographically ordered by fuzzily folded name, from the
entry for ABACUS (at index 0) to the entry for ZWSP (at index 32,727). At
the middle of the table is entry #16,789. (Because text entries are sorted by
fuzzily folded name, the entry for NORTHEAST-POINTING AIRPLANE
lexicographically precedes NORTH WEST ARROW, because the former name’s
fuzzily folded version is NORTHEASTPOINTINGAIRPLANE, which precedes
NORTHWESTARROW.)
The text table is made of two column data blocks: a text sequence and a
scalar array. The TT text sequence and the TT scalar array each represent
a “column” of the text table, each in the same shared name-lexicographic order.
A array is an array of fixed-length values; the length of each value is
declared in a directory object. We call them “arrays” instead of “arrays” to
avoid confusion with JavaScript arrays.
A sequence in turn is made of a values variable-length string (made of the
concatenated variable-length string values) and a separations array (made
of fixed-length unsigned integers that denote the positions of the separations
between the concatenated value strings).
The text sequence is a block of variable-length texts. The text values are
concatenated together in a values block, without any separator (separations
between texts are separately encoded in a following block). There is one text
for each Unicode name, and the texts are lexicographically ordered by their
Unicode names. Each text looks like: ‹name›‹nameInfo›.

The ‹name› is a Unicode name, in all caps, and using spaces and -.
The ‹nameInfo› is one of the following.

For strict Name property values: an empty string.
For name aliases: :CORRECTION for correction, :CONTROL for
control, :ALTERNATE for alternate, :FIGMENT for figment, or
:ABBREVIATION for abbreviation.
For named character sequences: :SEQUENCE followed by a sequence of
hexes for the remaining scalar hexes that follow the first scalar.
Each hex is also preceded by : and is stripped of leading 0s.


For example:

The text for U+0021 Exclamation Mark is EXCLAMATION MARK.
The two texts for U+0000’s two aliases NULL (a control alias) and NUL
(an abbreviation alias) are NULL:CONTROL and NUL:ABBREVIATION.
The texts for the named character sequence U+0023 U+FE0F U+20E3 Keycap
Number Sign is KEYCAP NUMBER SIGN:SEQUENCE:FE0F:20E3.

The text value sequence (with • inserted between each text for illustration
purposes; line breaks are also insignificant) starts with:
ABACUS•ACCORDION•ACCOUNT OF•AC CURRENT•ACK:ABBREVIATION•ACKNOWLEDGE:CONTROL•
ACTIVATE ARABIC FORM SHAPING•ACTIVATE SYMMETRIC SWAPPING•ACUTE ACCENT•
ACUTE ANGLE•ADDRESSED TO THE SUBJECT•ADHESIVE BANDAGE•ADI SHAKTI•
ADLAM ALIF LENGTHENER•ADLAM CAPITAL LETTER ALIF•ADLAM CAPITAL LETTER BA•
ADLAM CAPITAL LETTER BHE•ADLAM CAPITAL LETTER CHI•ADLAM CAPITAL LETTER DAALI•
ADLAM CAPITAL LETTER DHA•ADLAM CAPITAL LETTER E•ADLAM CAPITAL LETTER FA•
ADLAM CAPITAL LETTER GA•ADLAM CAPITAL LETTER GBE•ADLAM CAPITAL LETTER HA•
ADLAM CAPITAL LETTER I•ADLAM CAPITAL LETTER JIIM•ADLAM CAPITAL LETTER KAF•
ADLAM CAPITAL LETTER KHA•ADLAM CAPITAL LETTER KPO•ADLAM CAPITAL LETTER LAAM•
ADLAM CAPITAL LETTER MIIM

…and ends with:
ZANABAZAR SQUARE VOWEL SIGN REVERSED I•ZANABAZAR SQUARE VOWEL SIGN U•
ZANABAZAR SQUARE VOWEL SIGN UE•ZEBRA FACE•ZERO WIDTH JOINER•
ZERO WIDTH NO-BREAK SPACE•ZERO WIDTH NON-JOINER•ZERO WIDTH SPACE•ZEUS•
ZIPPER-MOUTH FACE•Z NOTATION BAG MEMBERSHIP•Z NOTATION DOMAIN ANTIRESTRICTION•
Z NOTATION LEFT BINDING BRACKET•Z NOTATION LEFT IMAGE BRACKET•
Z NOTATION RANGE ANTIRESTRICTION•Z NOTATION RELATIONAL COMPOSITION•
Z NOTATION RIGHT BINDING BRACKET•Z NOTATION RIGHT IMAGE BRACKET•
Z NOTATION SCHEMA COMPOSITION•Z NOTATION SCHEMA PIPING•
Z NOTATION SCHEMA PROJECTION•Z NOTATION SPOT•Z NOTATION TYPE COLON•
ZOMBIE•ZWJ:ABBREVIATION•ZWNBSP:ABBREVIATION•ZWNJ:ABBREVIATION•
ZWSP:ABBREVIATION

The TT separation array is a series of fixed-length hexes, which encodes a
array of integers. There is one integer for each Unicode name, and the texts
are lexicographically ordered by their Unicode names. The integers are
pointers: each indicates the position of each text within the text-sequence
values block. The pointers are relative to the beginning of the values block.
Each integer is five hex digits wide: hence
"separationArrayDirectory":{"numOfHexesPerEntry":5} in the directory.
The text separations (with • inserted between each text for illustration
purposes – line breaks are also insignificant) start with:
00006•0000F•00019•00023•00033•00046•00062•0007D•00089•00094•000AC•000BC•
000C6•000DB•000F4•0010B•00123•0013B•00155•0016D•00183•0019A•001B1•001C9

…and end with:
D05A5•D05BE•D05D3•D05E3•D05E7•D05F8•D0611•D0632•D0651•D066E•D068E•D06AF•
D06CF•D06ED•D070A•D0722•D073E•D074D•D0762•D0768•D0778•D078B•D079C•D07AD

Lastly the TT scalar array is also a series of fixed-length hexes, which
also encodes a array of scalars. There is one scalar for each character name,
and the texts are lexicographically ordered by their Unicode names. (For named
character sequences, the scalars are the sequences’ first scalars.)
Each integer is five hex digits wide: hence
"scalarArrayDirectory":{"numOfHexesPerEntry":5} in the directory.
The scalars (with • inserted between each text for illustration purposes –
line breaks are also insignificant) start with:
1F9EE•1FA97•02100•023E6•00006•00006•0206D•0206B•000B4•0299F•02101•1FA79•
0262C•1E944•1E900•1E904•1E907•1E915•1E901•1E90D•1E909•1E90A•1E918•1E91E

…and end with:
0200D•0FEFF•0200C•0200B•02BE2•1F910•022FF•02A64•02989•02987•02A65•02A3E•
0298A•02988•02A1F•02A20•02A21•02981•02982•1F9DF•0200D•0FEFF•0200C•0200B

The procedure for name→value lookup with an input name is:

Perform a binary search in the TT text sequence for a text with
the input name. This results in a text entry index (the index of the
matching text) and an array of tail scalars (which are embedded in the
text). (If a matching text is not present in the text sequence, then
there is no character with the given name.)
Given the text entry index, get its value in the TT scalar array.
This results in the character’s head point.
Combine the head point and tail scalars into a single character. This
this the final result.

The procedure for value→names access with an input value is:

Get the head point and the tail scalars of the input character.
Perform a linear search for the head point in the TT scalar array.
This results in a set of text entry indexes.
For each text-entry text, remove the text from the set if the text’s
tail scalars do not match the input character’s tail scalars.
Use the text-entry texts to create a sorted array of text entries. This
is the final result.

The average time of name→value lookup with valid names has improved by
1100 times compared to the previous version’s, and there is much less
variation in time. This reflects how much more efficient binary search is
compared to sequential scanning. Name→value lookup with invalid names has
similarly improved.
value→names access still requires linear search over the entire database.
However, even value→names access has greatly improved: by 86 times. This might
be due to the fact that scalars are now grouped together in a fixed-length
array. Iterating through such a fixed-length array is probably much faster
than scanning through the long, variable-length entry lines of the previous
version.
We could improve value→names access’s performance even more, but because we
are deprioritizing the time performance of value→names access (see § API and
Goals), we will leave its performance alone for now; it is acceptable to
search the entire database for names matching a character. In § ???, we will
create a separate point table that may be efficiently randomly accessed.
Database 6.0: Text front coding by binary-search parent

Next, we will adopt a similar technique to IBiS^RP+DAC-L-TT,
and we next compress the TT text sequence’s values with front coding.
The text table (because it is sorted lexicographically by fuzzily folded name)
may be considered as a binary search tree, which is defined using the same
binary-search algorithm we have been using when we look up characters from the
text table. There are ⌈log₂(33,578) + 1⌉ (or 15) levels in this
binary tree.
The text table’s root entry is the first entry in the text table that we
check during the binary search: i.e., the median of the binary tree, which is
entry number ⌊(33,578 − 1) ∕ 2⌋ (or #16,789). The two child entries of that
root entry are entries number ⌊(16,789 − 1) ∕ 2⌋ (or #8,394) and number
⌊(16,789 − 1) × 3 ∕ 2⌋ or (#25,182). This recursively goes on: each parent
entry has two child entries. For instance, entry #1 (ACCOUNT OF) is the
parent of two child entries: entry #0 (ABACUS) and entry #2 (ACCORDION).
(Note: We are indexing table entries starting from #0.)
Whenever a child entry in the text table has a text that shares a prefix with
its parent entry’s text, we could front code that shared
prefix.
For instance, entry #0’s text (ABACUS) shares an A with its parent entry’s
text (ACCOUNT OF), so it can be front coded as BACUS (along with the length
of the shared prefix: 1). The shared prefix is omitted from the child entry’s
text; it is implicit in its parent’s text (and the length of the omitted
prefix).
Likewise, entry #2’s text (ACCORDION) shares an ACC with its parent entry’s
text (ACCOUNT OF), so it can be front coded as OUNT OF (along with the
length of the shared prefix: 3).
This is similar but different to IBiS^RP+DAC-L-TT. The latter
algorithm inverts the front-coding relationship: at each step of the binary
search, the minimum entry of the binary search’s current interval is the
“parent”, and the interval’s median entry is its “child”. We are changing this
from the original to accomodate the fact that we need to map string keys to
scalars, rather than to integer positions in the dictionary: this change will
allow us to compress scalars more.)
Now the database weighs 651.3 kB (167.1 kB when compressed with Brotli
v1.0.9). This size is a dramatic improvement (being 184.0% smaller) over the
previous version’s 1,198.2 kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.3 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
25.7 B
862.1 kB
−578.1 kB


TT suffix texts (sep.s)
Hex array
5.0 B
167.9 kB
0.0 kB


TT prefix length array
Hex array
2.0 B
67.2 kB
+67.2 kB


TT scalar array
Hex array
5.0 B
167.9 kB
0.0 kB


Total

23.2 B
651.3 kB
−510.9 kB


The directory now looks like this:
{
  "nameTableDirectory": {
    "numOfEntries": 33578,
    "textSequencePointer": 0,
    "textSequenceDirectory": {
      "valuesPointer": 0,
      "separationArrayPointer": 247937,
      "separationArrayDirectory": {
        "numOfHexesPerEntry": 5
      },
      "length": 415829
    },
    "namePrefixLengthArrayPointer": 415829,
    "namePrefixLengthArrayDirectory": {
      "numOfHexesPerEntry": 2
    },
    "scalarArrayPointer": 482987,
    "scalarArrayDirectory": {
      "numOfHexesPerEntry": 5
    }
  }
}
␂

The TT text sequence (with • inserted between each text for illustration
purposes – line breaks are also insignificant) formerly started with this:
ABACUS•ACCORDION•ACCOUNT OF•AC CURRENT•ACK:ABBREVIATION•ACKNOWLEDGE:CONTROL•
ACTIVATE ARABIC FORM SHAPING•ACTIVATE SYMMETRIC SWAPPING•ACUTE ACCENT•
ACUTE ANGLE•ADDRESSED TO THE SUBJECT•ADHESIVE BANDAGE•ADI SHAKTI•
ADLAM ALIF LENGTHENER•ADLAM CAPITAL LETTER ALIF•ADLAM CAPITAL LETTER BA•
ADLAM CAPITAL LETTER BHE•ADLAM CAPITAL LETTER CHI•ADLAM CAPITAL LETTER DAALI•
ADLAM CAPITAL LETTER DHA•ADLAM CAPITAL LETTER E•ADLAM CAPITAL LETTER FA•
ADLAM CAPITAL LETTER GA•ADLAM CAPITAL LETTER GBE•ADLAM CAPITAL LETTER HA•
ADLAM CAPITAL LETTER I•ADLAM CAPITAL LETTER JIIM•ADLAM CAPITAL LETTER KAF•
ADLAM CAPITAL LETTER KHA•ADLAM CAPITAL LETTER KPO•ADLAM CAPITAL LETTER LAAM•
ADLAM CAPITAL LETTER MIIM

…but now it is a TT suffix sequence that starts with this (for
instance, BACUS corresponds to ABACUS and CORDION to ACCORDION):
BACUS•CCORDION•COUNT OF• CURRENT•K:ABBREVIATION•NOWLEDGE:CONTROL•
CTIVATE ARABIC FORM SHAPING•TIVATE SYMMETRIC SWAPPING•CUTE ACCENT•
NGLE•DDRESSED TO THE SUBJECT•HESIVE BANDAGE•I SHAKTI•
LAM ALIF LENGTHENER•ALIF•A•
BHE•CHI•DAALI•
DHA•E•FA•
GA•GBE•HA•
I•JIIM•AF•
KHA•PO•CAPITAL LETTER LAAM•
MIIM

The sequence also formerly ended with this:
ZANABAZAR SQUARE VOWEL SIGN REVERSED I•ZANABAZAR SQUARE VOWEL SIGN U•
ZANABAZAR SQUARE VOWEL SIGN UE•ZEBRA FACE•ZERO WIDTH JOINER•
ZERO WIDTH NO-BREAK SPACE•ZERO WIDTH NON-JOINER•ZERO WIDTH SPACE•ZEUS•
ZIPPER-MOUTH FACE•Z NOTATION BAG MEMBERSHIP•Z NOTATION DOMAIN ANTIRESTRICTION•
Z NOTATION LEFT BINDING BRACKET•Z NOTATION LEFT IMAGE BRACKET•
Z NOTATION RANGE ANTIRESTRICTION•Z NOTATION RELATIONAL COMPOSITION•
Z NOTATION RIGHT BINDING BRACKET•Z NOTATION RIGHT IMAGE BRACKET•
Z NOTATION SCHEMA COMPOSITION•Z NOTATION SCHEMA PIPING•
Z NOTATION SCHEMA PROJECTION•Z NOTATION SPOT•Z NOTATION TYPE COLON•
ZOMBIE•ZWJ:ABBREVIATION•ZWNBSP:ABBREVIATION•ZWNJ:ABBREVIATION•
ZWSP:ABBREVIATION

…but now ends like this:
ANABAZAR SQUARE VOWEL SIGN REVERSED I••
UE•EBRA FACE•ERO WIDTH JOINER•
-BREAK SPACE•RO WIDTH NON-JOINER•SPACE•US•
IPPER-MOUTH FACE• NOTATION BAG MEMBERSHIP•DOMAIN ANTIRESTRICTION•
 NOTATION LEFT BINDING BRACKET•LEFT IMAGE BRACKET••
ANGE ANTIRESTRICTION•ELATIONAL COMPOSITION•
RIGHT BINDING BRACKET•RIGHT IMAGE BRACKET•
SCHEMA COMPOSITION•PIPING•
SCHEMA PROJECTION•SPOT• NOTATION TYPE COLON•
OMBIE•WJ:ABBREVIATION•BSP:ABBREVIATION•NJ:ABBREVIATION•
SP:ABBREVIATION

We store the lengths of the shared prefixes in a
a new block to the text table: the TT prefix length array.
This is another array of fixed-length hexes, which encodes
a array of unsigned integers.
Each integer in the TT prefix length array is two hex digits wide:
hence parentPrefixLengthArrayDirectory":{"numOfDigitsPerEntry":2} in the
directory. The block starts with:
01•02•04•02•03•02•02•01•07•01•01•01•02•02•06•15•15•15•15•15•
15•15•15•15•15•15•15•15•16•16•15•06•16•15•16•15•15•15•15•15•
15•15•15•15•15•15•16•15•06•0C•0C•0C•0C•0C•0C•06•0D•06•0C•06

…and ends with:
18•18•19•18•19•19•19•18•1A•19•18•11•19•11•11•11•16•16•11•11•
1C•16•16•16•11•11•17•11•1C•1C•1C•01•1C•1C•1D•01•0B•0B•0D•02•
01•01•01•01•10•0B•0B•0B•11•0C•0B•0B•0C•01•0B•01•02•01•03•02

Fortunately, the time performance of name→value lookup has remained the
same, despite the dramatic 184.0% improvement in storage size.
Unfortunately, the time performance of value→names access has markedly
worsened by about 460%. This is attributable to the fact that we are still
searching the entire database for names matching the input character. Any small
difference in name extraction is multiplied by 33,578. As before, we are
deprioritizing the time performance of value→names access (see § API and
Goals), so we will leave it alone for now. In § ???, we will create a
separate point table that may be efficiently randomly accessed.
Database 7.0: Text front coding by binary-search ancestors

We can save even more space by determining, for each entry in the text table,
which of its ancestor entries (parent entry, the parent of the parent,
etc.) has a text that shares the longest prefix with the given entry’s text.
This means that, in addition to the prefix length array, we also have to store
an ancestor index array that, for each entry in the text table,
stores the position (in the binary-search path) of the ancestor entry
from whose text we are extracting the given entry’s implicit prefix.
Now the database weighs 637.7 kB (164.3 kB when compressed with Brotli
v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
+0.1 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
−47.3 kB


TT suffix texts (sep.s)
Hex array
5.0 B
167.9 kB
0.0 kB


TT prefix length array
Hex array
2.0 B
67.2 kB
0.0 kB


TT ancestor index array
Hex array
1.0 B
33.6 kB
+33.6 kB


TT scalar array
Hex array
5.0 B
167.9 kB
0.0 kB


Total

18.6 B
637.7 kB
−13.7 kB


The directory now looks like this:
{
  "nameTableDirectory": {
    "numOfEntries": 33578,
    "textSequencePointer": 0,
    "textSequenceDirectory": {
      "valuesPointer": 0,
      "separationArrayPointer": 200692,
      "separationArrayDirectory": {
        "numOfHexesPerEntry": 5
      },
      "length": 368584
    },
    "namePrefixLengthArrayPointer": 368584,
    "namePrefixLengthArrayDirectory": {
      "numOfHexesPerEntry": 2
    },
    "ancestorPathIndexArrayPointer": 435742,
    "ancestorPathIndexArrayDirectory": {
      "numOfHexesPerEntry": 1
    },
    "scalarArrayPointer": 469322,
    "scalarArrayDirectory": {
      "numOfHexesPerEntry": 5
    }
  }
}
␂

One-stage point table (PT)

Name→value lookup now uses a binary search for efficient random access.
However, value→names access still requires linear search over the entire
text table, which is very expensive in time.
We can spend a little more space to make value→names access more efficient.
Our next step is to add an indexed point table (PT), sorted numerically
by scalar value. The point table will allow us to efficiently perform
value→names random access. (This also will allow us to spend a little more
name→value access time to make the TT scalar array smaller.)
Now the database weighs 969.2 kB (???.? kB when compressed with Brotli
v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
0.0 kB


TT suffix texts (sep.s)
Hex array
5.0 B
167.9 kB
0.0 kB


TT prefix length array
Hex array
2.0 B
67.2 kB
0.0 kB


TT ancestor index array
Hex array
1.0 B
33.6 kB
0.0 kB


TT scalar array
Removed
0.0 B
0.0 kB
−167.0 kB


TT crosslink array
Hex array
4.0 B
134.3 kB
+134.3 kB


PT scalar array
Hex array
5.0 B
167.9 kB
+167.9 kB


PT crosslink array
Hex array
5.0 B
167.9 kB
+167.9 kB


Total

28.6 B
935.6 kB
+302.2 kB


The new point table has 33,578 head-point entries, one for each name
entry. However, the scalar entries are numerically ordered by scalar, from the
scalar entry for U+0000 (entry #0) to the scalar entry for U+E01EF (entry
#33,577).
For now, the point table has two column data blocks: a scalar array and a
crosslink array.
The PT scalar array is a series of fixed-length hexes, which encodes a
array of unsigned integers. The integers are scalar values: each indicates the
head point of a text entry in the text table. The same scalar value may occur
multiple times in this array if multiple names share that same head point.
The PT crosslink array is a series of fixed-length hexes, which encodes a
array of unsigned integers. The integers are crosslinks: text entry
indexes in the text table. Each crosslink of each scalar in the point table is
the index number of a text entry that has the head-point entry’s scalar as the
name entry’s head point.
The procedure for value→names access has changed to support efficient
random access.

Get the head point and the tail scalars of the input character.
Perform a binary search for the head point in the PT scalar array. If
there is a matching scalar value, then also check for more matching scalar
values that are contiguous to the first match. This results in a matching
head-point entry index range. (If there is no matching head point value in
the scalar array, then there is no name for the input character, and
the algorithm terminates.)
Given the head-point entry index range, get its values in the PT crosslink
sequence. This results in a set of crosslinks (i.e., text entry
indexes) to the text table.
For each crosslink to the text table, get its text entry’s text value. (In
order to get a value in the TT text sequence, first use the TT ancestor
index array to determine all ancestor text entry indexes, recursively
bottom up. Then use these text entry indexes to get an array of TT prefix
lengths. If any prefix length is 0, then discard all preceding prefix
lengths – and their text entry indexes – because those preceding data are
useless. Finally, use the remaining text entry indexes to get their
corresponding text prefixes, then use the prefixes and their prefix lengths
to reconstruct the entire text-entry text.) This results in a set of
text-entry texts.
For each text-entry text, remove the text from the set if the text’s
tail scalars do not match the input character’s tail scalars.
Use the the text-entry texts to create a sorted array of text entries.
This is the final result.

As a bonus, we can now make an improvement to the text table as well. We have
replaced the TT scalar array with a TT crosslink array: instead of
storing each text table’s head point directly in the text table (which may
theoretically range between 0 and 17,825,792), we store a scalar entry index
in the point table (which would range between 0 and only 33,578).
The procedure for name→value lookup is largely the same as before: a
binary search over the text table with name-prefix front coding. However,
when the binary search finds a matching text-entry text, instead of directly
getting its text entry’s head point, we get a crosslink to the point table.
We then get the head point value from the point table.
Two-stage point table (PT0 and PTΔ)

If we divide the point table into two sub-tables (PT0 and PTΔ), then we can
greatly improve their storage space, without compromising the time performance
of value→names access. Two-stage tables are commonly used to store Unicode
property data, and here we will combine one with delta encoding and
linear search (now limited to at most a small number).
Now the database weighs ???.? kB (164.3 kB when compressed with Brotli
v1.0.9). This size is a mild improvement over the previous version’s 651.3 kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
0.0 kB


TT suffix texts (sep.s)
Hex array
5.0 B
167.9 kB
0.0 kB


TT prefix length array
Hex array
2.0 B
67.2 kB
0.0 kB


TT ancestor index array
Hex array
1.0 B
33.6 kB
0.0 kB


TT crosslink array
Hex array
4.0 B
134.3 kB
0.0 kB


PT scalar array
Removed
0.0 B
0.0 kB
−167.9 kB


PT crosslink array
Removed
0.0 B
0.0 kB
−167.9 kB


PT0 base-point array
Hex array
?.0 B
???.? kB
+???.? kB


PT0 pointer array
Hex array
?.0 B
???.? kB
+???.? kB


PTΔ point-delta array
Hex array
?.? B
???.? kB
+???.? kB


PTΔ crosslink array
Hex array
?.? B
???.? kB
+???.? kB


Total

??.? B
???.? kB
−???.? kB


Both scalar sub-tables are still sorted numerically by scalar value. However,
they differ in their cardinality. The base-point table (PT0) has relatively
few, sparse entries – their number is partially determined by an adjustable
scalar-block size value. In contrast, the point-delta table (PTΔ)
still has 33,578 entries, one for each text entry – from an entry corresponding
to U+0000 (point-delta entry #0) – to an entry corresponding to U+E01EF
(point-delta entry #33,577).
The base-point table (PT0) starts with two column data blocks: a base-point
array and a pointer array.
The PT0 base-point array is a series of fixed-length hexes, which encodes
a array of unsigned integers. The integers are scalar values: each indicates
the head point of a text entry in the text table. There are never any duplicate
base scalars in this array. It is generated through the following algorithm:

Sort all text entries’ head points numerically.
For each head point, if the difference between this head point and the
previous head point exceeds the adjustable scalar-block size value, then
this head point is added to the base-point array.

The PT0 pointer array is a series of fixed-length hexes, which encodes a
array of unsigned integers. The integers are pointers to PTΔ entries: each
indicates the index number of a point-delta entry in the PTΔ table associated
with the PT0 entry’s base scalar.
The PTΔ point-delta array is a series of fixed-length hexes, which encodes
a array of unsigned integers. The integers are point-delta values: each
indicates a numeric difference – between the head point of its text entry in
the text table – and the previous PTΔ entry’s scalar (or, if the previous PTΔ
entry is associated with an PT0 entry, that PT0 entry’s base scalar). Because
of this, the PTΔ point-delta array is divided into scalar blocks by the
base-point table (PT0) – one block for each head-point entry.
The PTΔ crosslink array is a series of fixed-length hexes, which encodes a
array of unsigned integers. The integers are crosslinks: text entry
indexes in the text table. Each crosslink of each scalar in the point table is
the index number of a text entry that has the head-point entry’s scalar as the
name entry’s head point.
The procedure for value→names access has changed further:

Get the head point and the tail scalars of the input character.
Perform a binary search for the largest head point in the PT0
base-point array that is less than the head point. This results in a
base scalar and a PT0 entry index.
Use the PT0 entry index (and the PT0 entry index plus 1) to get start (and
end) PTΔ entry indexes from the PT0 pointer array.
Starting at the start PTΔ entry index, up until the end PTΔ entry index,
perform a linear search for the head point in the PTΔ point-delta
array, incrementing the PT0 base scalar with each scalar delta. Because
the PTΔ point-delta array is sorted by scalar value, any multiple matching
PTΔ entries will be contiguous, and the linear search results in a
matching head-point entry index range. (If the linear search reaches
a scalar value that is greater than the head point – or if the search
reaches the end PTΔ entry index – without finding a matching head point
value in the scalar array, then there is no name for the input
character, and the algorithm terminates.)
Given the head-point entry index range, get its values in the PTΔ crosslink
sequence. This results in a set of crosslinks (i.e., text entry
indexes) to the text table.
For each crosslink, get its text entry’s text value (using the algorithm
defined in the previous version). As before, this results in a set of name
entry texts.
For each text-entry text, remove the text from the set if the text’s
tail scalars do not match the input character’s tail scalars.
Use the the text-entry texts to create a sorted array of text entries.
This is the final result.

As for the text table, the TT crosslink array now stores an PT0 entry
index in PT0 (which now ranges between 0 and only ???). Each text entry’s
crosslink to PT0 indicates that its head point is somewhere within the
corresponding PTΔ range, as defined by the base-point table (PT0)’s crosslinks
to the point-delta table (PTΔ).
The procedure for name→value lookup is similar to the previous version,
except that it has become more indirect:

Perform a binary search in the TT text sequence for a text with the input
name. This results in a text entry index (the index of the matching
text) and an array of tail scalars (which are embedded in the text).
Given the text entry index, get its value in the TT crosslink array.
This results in the character’s PT0 entry index.
Use the PT0 entry index (and the PT0 entry index plus 1) to get start (and
end) PTΔ entry indexes from the PT0 pointer array.
Linearly search in the PTΔ crosslink array, within the
start and end PTΔ entry indexes, for the matching text entry index. This is
guaranteed to find a matching PTΔ entry index.
Given the matching PTΔ entry index, get all the PTΔ scalar deltas from the
start PTΔ entry index up to the matching PTΔ entry index. This results in an
array of PTΔ scalar deltas.
The base scalar from the PT0 base-point array with the PT0 entry index,
plus the summed array of PTΔ scalar deltas, is the head point.
Combine the head point and tail scalars into a single character. This
this the final result.

Binary buffer

By switching to using a binary buffer instead of a string as our database, we
will be able to encode the database’s integers as 8-bit integers instead of
hexes. This will generally cut the space of every integer array by somewhere
between one fourth and one half. JSON and text values will continue to be
encoded in ASCII/UTF-8.
After this switch, we will also eventually be able to encode bit arrays.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli
v1.0.9). This size is a ??? improvement over the previous version’s ???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
0.0 kB


TT suffix texts (sep.s)
Uint32 arr.
4.0 B
132.3 kB
−33.6 kB


TT prefix length array
Uint8 arr.
1.0 B
33.6 kB
−33.6 kB


TT ancestor index array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT crosslink array
Uint16 arr.
2.0 B
67.2 kB
−67.2 kB


PT0 base-point array
Uint32 arr.
?.0 B
???.? kB
−???.? kB


PT0 pointer array
Uint32 arr.
?.0 B
???.? kB
−???.? kB


PTΔ point-delta array
Uint8 arr.
?.? B
???.? kB
−???.? kB


PTΔ crosslink array
Uint32 arr.
?.? B
???.? kB
−???.? kB


Total

??.? B
???.? kB
−???.? kB


Separation bit arrays

Our database now has a variable-length sequence: the TT suffix sequence. It
contains a values block and a separations block; the latter stores the
locations between each value in the values blocks. The separations block has
been encoded as a fixed-length hex arrays, where each hex is a position within
the values block.
We can change the separations block to be a bit array instead. Each bit in the
array represents the presence or absence of a separation at each consecutive
position within the values block. In other words, each 1-bit in the separations
block’s sea of 0-bits indicates the start of a new value at that position in
the values block.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli
v1.0.9). This size is a ??? improvement over the previous version’s ???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
0.0 kB


TT suffix texts (sep.s)
Bit arr.
0.2 B
7.5 kB
−124.8 kB


TT prefix length array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT ancestor index array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT crosslink array
Uint16 arr.
2.0 B
67.2 kB
0.0 kB


PT0 base-point array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PT0 pointer array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PTΔ point-delta array
Uint8 arr.
?.? B
???.? kB
0.0 kB


PTΔ crosslink array
Uint32 arr.
?.? B
???.? kB
0.0 kB


Total

??.? B
???.? kB
−124.8 kB


In order to retrieve the value corresponding to text entry #n from
the TT suffix sequence:


When n = 0, we must retrieve the position of the 0th separation,
in the sequence’s separations block. In other words, we must find the
position of the 0th 1-bit in the sequence’s separations block. The value of
entry
#n is the data in the values block between 0 and the position of
#the
0th 1-bit.


When n > 0, we must retrieve the position of the
(n−1)th separation – and the nth separation – in the
sequence’s separations block. In other words, we must find the positions of
the (n−1)th 1-bit and the nth 1-bit in the sequence’s
separations block. The value of entry
#n is the data in the values block between the positions of the
(n−1)th and nth 1-bits.


The operation of “the nth 1-bit” in a bit array is known as the
select operation on bit arrays. We start with a naïve implementation of the
select operation, which totally recalculates every time, without using any
cached results.
When a text is empty (i.e., its parent text completely contains the given text,
and the given text is completely a prefix of its parent text), then it is
replaced by an arbitrary single-character string, in order to prevent positions
returned by select from overlapping.
Bit select caches

The TT suffix sequence now uses a bit array to store its separations,
using the select operation to locate the position of each separation.
The select operation currently uses a naïve implementation that totally
recalculates the result from scratch, starting from the very beginning of the
bit array, every time. This would become much faster if we could precalculate
and cache samples of select results.
We use a similar approach discussed in Zhang et al. (2013) and Zhang et
al. (2018). For each variable-length sequence, we create a select cache:
a fixed-length hex array. We precompute select results for sample queries at a
certain frequency. If the frequency is 3, then the 0th, 3rd, 6th, etc. select
results are stored in the select cache.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli
v1.0.9). The size has slightly increased compared to the previous version’s
???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
ASCII str.
5.6 B
862.1 kB
0.0 kB


TT suffix texts (sep.s)
Bit arr.
0.2 B
7.5 kB
0.0 kB


TT suffix texts (sel.s)
Uint32 arr.
?.0 B
??.? kB
+??.? kB


TT prefix length array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT ancestor index array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT crosslink array
Uint16 arr.
2.0 B
67.2 kB
0.0 kB


PT0 base-point array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PT0 pointer array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PTΔ point-delta array
Uint8 arr.
?.? B
???.? kB
0.0 kB


PTΔ crosslink array
Uint32 arr.
?.? B
???.? kB
0.0 kB


Total

??.? B
???.? kB
−72.0 kB


VByte

Two database blocks now consist of sequences of integers that we must access
only sequentially, rather than directly and randomly. These are the TT
suffix texts’ values (i.e., runs of ASCII strings) and the PTΔ point-delta
array. We can compress these integer sequences with VByte. VByte has been
described as such by Klein and Shapira (2016):

[In VByte coding], the codewords represent integers. The VByte code splits
the floor(log2(x_i)) + 1 bits needed to represent an integer x_i in its
standard binary form into blocks of b bits and prepends each block with a
flag-bit as follows. The highest bit is 0 in the extended block holding the
most significant bits of x_i, and 1 in the others. Thus, the 0-bit acts as
a comma between codewords. For example, if b = 3, and x_i = 25, the
standard binary representation of x_i, 11001, is split into two blocks, and
after adding the flags to each block, the codeword is 0011 1001. In the worst
case, the VByte code loses one bit per b bits of x_i plus b bits for an
almost empty leading block, which is worse than delta-Elias encoding.

We considered the variation Stream VByte, as described in Lemire, Kurz, and
Rupp (2017). Stream VByte is faster to decode than VByte. However, Stream
VByte uses slightly more storage than VByte proper; in addition, Stream VByte’s
time performance depends on hardware-accelerated SIMD, which is not available
in JavaScript.
Now the database weighs ???.? kB (???.? kB when compressed with Brotli
v1.0.9). This size is a large improvement over the previous version’s ???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
VByte str.
?.? B
???.? kB
−??.? kB


TT suffix texts (sep.s)
Bit arr.
0.2 B
7.5 kB
0.0 kB


TT suffix texts (sel.s)
Uint32 arr.
?.0 B
??.? kB
0.0 kB


TT prefix length array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT ancestor index array
Uint8 arr.
1.0 B
33.6 kB
0.0 kB


TT crosslink array
Uint16 arr.
2.0 B
67.2 kB
0.0 kB


PT0 base-point array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PT0 pointer array
Uint32 arr.
?.0 B
???.? kB
0.0 kB


PTΔ point-delta array
VByte str.
?.? B
???.? kB
0.0 kB


PTΔ crosslink array
Uint32 arr.
?.? B
???.? kB
0.0 kB


Total

??.? B
???.? kB
−??.? kB


DAC integers

We will next compress our integer arrays with directly accessible codes (DACs).
DACs were first described in Brisaboa, Ladra, and Navarro (2012) as a
way to compress integer arrays, while maintaining random access to their
integer values by index numbers. The integer array is encoded using a modified
version of VByte, along with a “bit rank cache”.

In another work, Brisaboa et al. introduced directly accessible codes
(DACs) by integrating rank dictionaries into byte aligned codes.
DACs can be regarded as a reorganization of the bits of VByte, plus extra
space for the rank structures, that enables direct access to it. First, all
the least significant blocks of all codewords are concatenated, then the
second least significant blocks of all codewords having at least two blocks,
and so on. Then the rank data structure is applied on the comma bits for
attaining log2(M) / b processing time, where M is the maximum integer to
be encoded.

Now the database weighs ???.? kB (???.? kB when compressed with Brotli
v1.0.9). This size is a large improvement over the previous version’s ???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
VByte str.
?.? B
???.? kB
0.0 kB


TT suffix texts (sep.s)
Bit arr.
0.2 B
7.5 kB
0.0 kB


TT suffix texts (sel.s)
Uint32 arr.
?.0 B
??.? kB
0.0 kB


TT prefix length array
DAC int.s
?.0 B
??.? kB
−??.? kB


TT ancestor index array
DAC int.s
?.0 B
??.? kB
−??.? kB


TT crosslink array
DAC int.s
?.0 B
??.? kB
−??.? kB


PT0 base-point array
DAC int.s
?.0 B
???.? kB
−??.? kB


PT0 pointer array
DAC int.s
?.0 B
???.? kB
−??.? kB


PTΔ point-delta array
VByte str.
?.? B
???.? kB
0.0 kB


PTΔ crosslink array
DAC int.s
?.? B
???.? kB
−??.? kB


Total

??.? B
???.? kB
−??.? kB


Crosslink deltas

Characters whose head points are near each other tend to also have similar
names. There are large blocks of characters that not only have contiguous head points but are also all named with identical prefixes. Because of this, more
than 80% of crosslinks between TT and PTΔ are nearly adjacent to their
neighbors’ crosslinks.
In other words, when text entries are sorted by name, their head points are
usually within only 500 away from their neighboring text entries’ head points.
Likewise, when head-point entries are sorted by head point, their crosslinks
to the text table are usually within only 500 away from their neighboring
head-point entries’ crosslinks. (In fact, more than 40% of crosslinks are
within only 3 away from their neighbors.)
This means that, if we use delta encoding on both TT crosslinks and PT
crosslinks, the large majority of crosslink data will become much smaller (in
absolute value, since roughly half of the deltas would be negative).
The worst-case range between the minimum and maximum crosslink values
(for both TT and PT) would be the number of entries in the TT and PT (65,534).
However, because DACs are biased to compress small integers in smaller space,
and because most crosslink deltas will be very small (mostly between 0 and 3 in
absolute value), much space will still be saved.
DACs can only store unsigned integers, so we store each delta’s
positive/negative sign in separate bit arrays.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT suffix texts (val.s)
VByte str.
?.? B
???.? kB
0.0 kB


TT suffix texts (sep.s)
Bit arr.
0.2 B
7.5 kB
0.0 kB


TT suffix texts (sel.s)
Uint32 arr.
?.0 B
??.? kB
0.0 kB


TT prefix length array
DAC int.s
?.0 B
??.? kB
0.0 kB


TT ancestor index array
DAC int.s
?.0 B
??.? kB
0.0 kB


TT crosslink Δ array
DAC int.s
?.0 B
??.? kB
−??.? kB


TT crosslink Δ-sign array
Bit arr.
?.0 B
??.? kB
+??.? kB


PT0 base-point array
DAC int.s
?.0 B
???.? kB
0.0 kB


PT0 pointer array
DAC int.s
?.0 B
???.? kB
0.0 kB


PTΔ point-delta array
VByte str.
?.? B
???.? kB
0.0 kB


PTΔ crosslink Δ array
DAC int.s
?.? B
???.? kB
−??.? kB


PTΔ crosslink Δ-sign array
Bit arr.
?.? B
???.? kB
+??.? kB


Total

??.? B
???.? kB
−??.? kB


Text Re-Pair compression

We will next squeeze the texts further using Re-Pair compression.
Afterwards, the database weighs ???.? kB (???.? kB when compressed with
Brotli v1.0.9). This size is a large improvement over the previous version’s
???.? kB.


Block
Type
Per Range
Total
Change


Directory
JSON obj.
0.0 B
0.4 kB
0.0 kB


TT text grammar
Uint16 arr.
?.? B
35.6 kB
+35.6 kB


TT suffix texts (val.s)
VByte str.
?.? B
???.? kB
−???.? kB


TT suffix texts (sep.s)
Bit arr.
?.? B
?.? kB
−??.? kB


TT suffix texts (sel.s)
Uint32 arr.
?.? B
??.? kB
−??.? kB


TT prefix length array
DAC int.s
?.0 B
??.? kB
0.0 kB


TT ancestor index array
DAC int.s
?.0 B
??.? kB
0.0 kB


TT crosslink Δ array
DAC int.s
?.0 B
??.? kB
0.0 kB


TT crosslink Δ-sign array
Bit arr.
?.0 B
??.? kB
0.0 kB


PT0 base-point array
DAC int.s
?.0 B
???.? kB
0.0 kB


PT0 pointer array
DAC int.s
?.0 B
???.? kB
0.0 kB


PTΔ point-delta array
VByte str.
?.? B
???.? kB
0.0 kB


PTΔ crosslink Δ array
DAC int.s
?.? B
???.? kB
−??.? kB


PTΔ crosslink Δ-sign array
Bit arr.
?.? B
???.? kB
0.0 kB


Total

??.? B
???.? kB
0.0 kB


Re-Pair repeatedly compresses strings by (1) finding their most frequent
character pair, (2) replacing all instance of that pair with a new, previously
unused character, and (3) recording that substitution as a rule in a separate
grammar data structure. In other words:

Let there be a collection of input texts and an empty grammar (a array of
bigrams).
Also let ‹getSubstitutionCharacter(𝑘)› be a function that, given an
integer 𝑘, returns a corresponding unique character that is not already in
any of the input texts.
Repeat the following steps:

Let numOfSubstitutions be the number of prior substitutions (which
would start at 0).
Let ‹character0›‹character1› be the most frequent bigram in all of the
texts. (If there are multiple bigrams that are tied for being most
frequent, then the particular choice among these bigrams is arbitrary.
Our implementation happens to choose the lexicographically first one.)
Let ‹substitutionCharacter› be
‹getSubstitutionCharacter(numOfSubstitutions)›. This is a new
character that does not already occur in any of the texts.
Repeatedly replace instances of the bigram ‹character0›‹character1›
with the single ‹substitutionCharacter›, in each of the texts, from
their starts to finishes.
Add the bigram ‹character0›‹character1› to the grammar.
If all character pairs in the texts now each occur only once, then exit
this loop.


The end result is (1) the compressed texts after their accumulated
substitutions and (2) the grammar with its accumulated rules.

To decode the compressed texts, the substitution characters are repeatedly
replaced by their respective character pairs in the grammar, until there are no
remaining substitutions to perform.
For example, let us suppose that getSubstitutionCharacter(𝑘) for 𝑘 = 0, 1,
2, and 3 is ①, ②, ③, and ④ respectively. Then applying Re-Pair to the
following collection of texts:

XYZZZ
XYZZ
XXXZ
YYYZ
ZZZZ

…would result in these texts:

③⑤
③Z
②X
④Y
⑤

…and this grammar:

ZZ (for ①)
XX (for ②)
XY (for ③)
YY (for ④)
①Z (for ⑤)

There are 917 character pairs that occur more than once among the texts in the
previous version. These pairs occur 209,041 times in total. The top ten
character pairs are:


Bigram
Occurrences


ER
2,813


AR
2,762


AL
2,725


IN
2,725


LE
2,646


ON
2,624


TE
2,342


E 
2,264


 S
2,250


TH
2,250


Running Re-Pair takes approximately seven minutes. It compresses the texts down
to 48,561 characters, which weigh ???.? kB (at ?.? bytes per scalar) when
encoded with VByte. The grammar has 8,907 substituted bigrams, each made of two
scalars, which altogether weigh ??.? kB (at ?.? bytes per scalar).
We could have adopted either of the modified Re-Pair algorithms described by
Bille, Gørtz, and Prezza (2017) and by Sakai et al. (2018); their
versions are more memory efficient during compression. However, the input
Unicode Character Database data are static and relatively small, compared to
the large genomes and other big-data sets that those articles study. While we
care about decompression time and memory, we do not care about compression time
and memory. We therefore do not need to save memory during compression, so we
eschew these more complicated compression algorithms.
We also could have adopted the RL-MR-RePair algorithm described by Isamu
Furuya (2019), but we are trying to stay simple for now.
Signature and checksum

We have greatly optimized the database’s memory, coming near its
information-theoretic lower bound, while also improving bidirectional
random-access time. Lastly, we will add a signature block and checksum block to
the database file.
File signature

The beginning of the file identifies it as a UniRangebase.
It consists of the byte 98 (in hexadecimal) to avoid confusion with
UTF-8/ASCII text files, followed by UNINAME ? ?.?  in
UTF-8/ASCII, ending in a space. The ?.?.? indicates the version of the file
format; the middle digit indicates the major Unicode version.
File checksum

A 128-bit MD5 hash of all preceding bytes in the file.
Summary table

The “N → V” and “V → Ns” columns have median times (50%-percentile values)
after 10,000 repetitions run on Unicode 14.0 data, by Node v18.3.0, on a
MacBook Air (M1, 2020) with 16 GB of memory. “N → V” indicates looking up
characters by “good” Unicode names, and “V → Ns” indicates accessing names by
“good” named Unicode values.


Version
Brotli
Heap
N → V
V → Ns


1. JSON array
160.6 kB
≈2,100 kB
25,122 µs
11,010 µs


2. Extended name ranges
‒‒‒.‒ kB
‒,‒‒‒.‒ kB
‒‒,‒‒‒ µs
‒‒,‒‒‒ µs


3. Text lines
147.8 kB
1,075.6 kB
36,236 µs
35,818 µs


4. Scalar deltas
104.6 kB
962.9 kB
34,033 µs
22,757 µs


5. Text table
189.9 kB
1,198.2 kB
21 µs
2,969 µs


6. Parent prefixes
167.1 kB
651.3 kB
22 µs
13,529 µs


7. Ancestor prefixes
164.3 kB
637.7 kB
22 µs
13,006 µs


8. One-stage point table
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


9. Two-stage point table
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


10. Binary buffer
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


11. Sep. bit arrays
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


12. Bit select caches
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


13. VByte integers
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


14. DAC integers
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


15. Crosslink deltas
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


16. Text Re-Pair
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs


17. Signature & checksum
‒‒.‒ kB
‒‒‒.‒ kB
‒‒ µs
‒‒ µs
Operation	2%	25%	50%	75%	98%
Valid N → V	10,243 µs	20,020 µs	25,122 µs	26,467 µs	32,408 µs
Invalid N → no V	25,973 µs	28,988 µs	29,211 µs	31,090 µs	33,781 µs
Named V → N	9,849 µs	10,918 µs	11,010 µs	12,164 µs	14,123 µs
Unnamed V → no N	10,839 µs	12,194 µs	12,576 µs	13,688 µs	16,892 µs
`DIGIT ZERO … DIGIT NINE`	`DIGIT :RANGE:NUMWORD:10`
`LATIN CAPITAL LETTER A LATIN CAPITAL LETTER B … LATIN CAPITAL LETTER Z`	`LATIN CAPITAL LETTER :RANGE:LETTER:1A`
`SINGLE SHIFT TWO SINGLE SHIFT THREE`	`SINGLE SHIFT :RANGE:NUMWORD:2:2`
`TAMIL NUMBER ONE HUNDRED TAMIL NUMBER ONE THOUSAND`	`TAMIL NUMBER :RANGE:NUMWORDHUNDRED TAMIL NUMBER :RANGE:NUMWORDTHOUSAND`
`ETHIOPIC NUMBER TEN ETHIOPIC NUMBER TWENTY … ETHIOPIC NUMBER NINETY ETHIOPIC NUMBER HUNDRED`	`ETHIOPIC NUMBER :RANGE:NUMWORD:9:A:A ETHIOPIC NUMBER HUNDRED`
`CIRCLED NUMBER TWENTY ONE CIRCLED NUMBER TWENTY TWO … CIRCLED NUMBER TWENTY NINE CIRCLED NUMBER THIRTY CIRCLED NUMBER THIRTY ONE … CIRCLED NUMBER THIRTY NINE`	`CIRCLED NUMBER :RANGE:NUMWORDWITHSPACES:21:19`
`DOMINO TILE HORIZONTAL-00-00 DOMINO TILE HORIZONTAL-00-01 … DOMINO TILE HORIZONTAL-00-06 DOMINO TILE HORIZONTAL-01-00 … DOMINO TILE HORIZONTAL-01-06`	`DOMINO TILE HORIZONTAL-00-:RANGE:DIGITSPADDED2:7 DOMINO TILE HORIZONTAL-01-:RANGE:DIGITSPADDED2:7`
`LINEAR A SIGN AB001 LINEAR A SIGN AB002 … LINEAR A SIGN AB021 LINEAR A SIGN AB021F LINEAR A SIGN AB021M LINEAR A SIGN AB022`	`LINEAR A SIGN AB:RANGE:DIGITSPADDED3:16:1 LINEAR A SIGN AB021F LINEAR A SIGN AB021M LINEAR A SIGN AB022`
`IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO … IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THIRTY-ONE`	`IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY :RANGE:NUMWORD:1F:1`
`VARIATION SELECTOR-1 … VARIATION SELECTOR-16`	`VARIATION SELECTOR-:RANGE:DIGITS:10:1`
`AEGEAN NUMBER ONE AEGEAN NUMBER TWO … AEGEAN NUMBER TEN AEGEAN NUMBER TWENTY AEGEAN NUMBER THIRTY … AEGEAN NUMBER NINETY AEGEAN NUMBER ONE HUNDRED AEGEAN NUMBER TWO HUNDRED … AEGEAN NUMBER NINE HUNDRED AEGEAN NUMBER ONE THOUSAND AEGEAN NUMBER TWO THOUSAND … AEGEAN NUMBER TEN THOUSAND AEGEAN NUMBER TWENTY THOUSAND AEGEAN NUMBER THIRTY THOUSAND … AEGEAN NUMBER NINETY THOUSAND`	`AEGEAN NUMBER :RANGE:NUMWORD:A:1 AEGEAN NUMBER :RANGE:NUMWORD:8:14:10 AEGEAN NUMBER :RANGE:NUMWORDHUNDRED:9:1 AEGEAN NUMBER :RANGE:NUMWORDTHOUSAND:A:1 AEGEAN NUMBER :RANGE:NUMWORDTHOUSAND:8:B`
`NUSHU CHARACTER-1B170 NUSHU CHARACTER-1B171 … NUSHU CHARACTER-1B2FB`	`NUSHU CHARACTER-:RANGE:HEX:1B170:18B`
`WHITE CHESS KING ROTATED NINETY DEGREES`	`WHITE CHESS KING ROTATED NINETY DEGREES`
`VARIATION SELECTOR-17 VARIATION SELECTOR-18 … VARIATION SELECTOR-256`	`VARIATION SELECTOR-:RANGE:DIGITS:EF:11`
`VS1:ABBREVIATION VS2:ABBREVIATION … VS16:ABBREVIATION VS17:ABBREVIATION VS18:ABBREVIATION … VS256:ABBREVIATION`	`VS:RANGE:DIGIT:10:::ABBREVIATION VS:RANGE:DIGIT:EF:11::ABBREVIATION`
`KEYCAP DIGIT ZERO:SEQUENCE:FE0F:20E3 KEYCAP DIGIT ONE:SEQUENCE:FE0F:20E3 … KEYCAP DIGIT NINE:SEQUENCE:FE0F:20E3`	`KEYCAP DIGIT :RANGE:NUMWORD:9:::SEQUENCE:FE0F:20E3`
Block	Type	Range	Total
Directory	JSON obj.	0.0 B	0.3 kB
TT texts (values)	ASCII str.	25.7 B	862.1 kB
TT texts (separations)	Hex array	5.0 B	167.9 kB
TT scalar array	Hex array	5.0 B	167.9 kB
Total		35.7 B	1,198.2 kB
Bigram	Occurrences
`ER`	2,813
`AR`	2,762
`AL`	2,725
`IN`	2,725
`LE`	2,646
`ON`	2,624
`TE`	2,342
`E`	2,264
`S`	2,250
`TH`	2,250
Version	Brotli	Heap	N → V	V → Ns
1. JSON array	160.6 kB	≈2,100 kB	25,122 µs	11,010 µs
2. Extended name ranges	‒‒‒.‒ kB	‒,‒‒‒.‒ kB	‒‒,‒‒‒ µs	‒‒,‒‒‒ µs
3. Text lines	147.8 kB	1,075.6 kB	36,236 µs	35,818 µs
4. Scalar deltas	104.6 kB	962.9 kB	34,033 µs	22,757 µs
5. Text table	189.9 kB	1,198.2 kB	21 µs	2,969 µs
6. Parent prefixes	167.1 kB	651.3 kB	22 µs	13,529 µs
7. Ancestor prefixes	164.3 kB	637.7 kB	22 µs	13,006 µs
8. One-stage point table	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
9. Two-stage point table	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
10. Binary buffer	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
11. Sep. bit arrays	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
12. Bit select caches	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
13. VByte integers	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
14. DAC integers	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
15. Crosslink deltas	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
16. Text Re-Pair	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs
17. Signature & checksum	‒‒.‒ kB	‒‒‒.‒ kB	‒‒ µs	‒‒ µs