Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@laughinghan
Last active September 13, 2022 19:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save laughinghan/b146f86d07aff2d36e5216a27765b6fd to your computer and use it in GitHub Desktop.
Save laughinghan/b146f86d07aff2d36e5216a27765b6fd to your computer and use it in GitHub Desktop.

Text and Unicode in Blossom Notebook

The Absolute Bare Minimum You Need To Know

Consider the text:

Sent in my résumé! 😮‍💨

Computers represent that as zeroes and ones:

S        e        n        t                 i        n                 m        y                 r        é                 s        u        m        é                 !        😮‍💨
01010011 01100101 01101110 01110100 00100000 01101001 01101110 00100000 01101101 01111001 00100000 01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001 00100001 00100000 11110000 10011111 10011000 10101110 11100010 10000000 10001101 11110000 10011111 10010010 10101000

(Note that the space character is 00100000.)

Each zero-or-one is called a bit, or binary digit.

Each group of 8 bits is called a byte. A byte can be represented by eight binary (0 or 1) digits or three decimal (0-9) digits:

S   e   n   t      i   n      m   y      r     é     s   u   m     é    !      😮‍💨
83 101 110 116 32 105 110 32 109 121 32 114 195 169 115 117 109 195 169 33 32 240 159 152 174 226 128 141 240 159 146 168

(Note that the space character is 32.)

Some characters are only one byte, such as S, !, or (space). Some characters are many bytes: é is 2 bytes (195 followed by 169), 😮‍💨 is 11 bytes.

Unicode is a standard that defines how to encode any text in any human language into bytes.

Unicode actually defines a few different variations of such encodings, but most systems today (including Blossom Notebook) use UTF-8, to the point where "UTF-8" is often used interchangeably with "Unicode".

There are some complexities in Unicode that are present for historical reasons that Blossom Notebook simplifies away; Blossom Notebook text actually only allows a subset of the text that is allowed in Unicode UTF-8 text. Sometimes the same character can be encoded in multiple ways; Blossom Notebook picks just one encoding. There are Unicode encodings for some things that can't really be understood as characters in text (byte order mark, control characters, etc), and some things that aren't even valid Unicode characters but are allowed by many systems anyway for historical reasons (unpaired surrogate code points); Blossom Notebook removes these non-characters.

This means that when text data is brought into Blossom Notebook from other systems (such as copying from another system and pasting into Blossom Notebook), that data may be altered by TODO to become valid Blossom Notebook text. The exact bytes of the original data is still available at TODO, and can be operated on with TODO unicode libraries.

Due to the complexity of human language, Unicode is still being refined and updated over time. Even seemingly simple things like the definition of a character have changed over time, which means that the exact count of the number of characters in a piece of text depends on which version of Unicode you use. (By contrast, the bytes that the text is encoded into will never change, so the count of the number of bytes will never change (.byte_length()). [TODO is this true?? What if we decide two different code point sequences are actually the same/should be normalized the same?]) Because of this, text functions are not built-ins, instead they are imported from an explicitly versioned library, and when Blossom Notebook updates how text functions work over time as Unicode updates, it adds newer versions of the library so that code using existing versions won't change in behavior.

Other Useful Terms To Know

  • Characters that are only 1 byte in UTF-8 Unicode are called ASCII characters. (ASCII was an older encoding where every character was exactly 1 byte, but only English letters and punctuation were supported.) A lot of legacy code assumes characters are only 1 byte, as a result they will only work correctly with text that only has ASCII characters, and they will fail when they encounter non-ASCII text, such as non-English text or emoji.
  • Besides UTF-8, the Unicode standard also defines a legacy encoding called UTF-16 where every character is at least 2 bytes. (When Unicode started they hoped every character would be exactly 2 bytes, but that turned out not to be enough.) Many major programming languages from the '90s like JavaScript, Java, and Python 2 (but not Python 3) encode text as UTF-16 and then assume that every character is exactly 2 bytes, so the single character 🥰 (which is 4 bytes in UTF-16) is counted as 2 "characters" in such languages. If you are interoperating with those languages and need to know how many characters those languages will think a piece of text has, you can use .utf16_code_units_count().
  • In order to not have to, for every character in every language, separately define how to encode the character into bytes for UTF-8, UTF-16, and the other Unicode encodings, the Unicode standard defines:
    • an abstract concept called code points that is a number ranging from 0 to 1,114,112 (or in hexadecimal, 0 to 0x10ffff)
    • an encoding of every character in every language into a sequence of one or more code points
    • and various encodings of code points into bytes, like UTF-8 and UTF-16.
  • The encodings of code points into bytes are fairly simple and, most usefully, are fixed and will never change. How to encode every character in every language into a sequence of code points (and how to define what a character is in each language) takes up the bulk of the Unicode standard, and is ongoing work that continues to be refined and updated.
  • Some newer programming languages like Python 3, Julia (in some ways), and Rust (in some ways) assume that every character is exactly one code point. This works for most European and East Asian language text and early emoji, but fails for many other languages and newer emoji. (It does have the advantage over Blossom Notebook's approach that the count of code points in a piece of text is fixed and will never change.) If you are interoperating with those languages and need to know how many characters those languages will think a piece of text has, you can use .code_points_count().
  • Older programming languages are unaware of Unicode and can only count bytes; some newer programming languages like Go, Julia (in other ways), and Rust (in other ways) also encourage counting bytes and not characters.
  • The Blossom Notebook definition of a character is based on what the Unicode standard calls a grapheme cluster, which approximate what Unicode calls a "user-perceived character". (Unicode otherwises uses "character" to refer to "code point", despite noting that that may not correspond to what readers and writers perceive to be a character. Also "grapheme clusters" are sometimes only pieces of what a linguist would call graphemes; for example th is linguistically 1 grapheme but considered a sequence of 2 grapheme clusters by Unicode, and 2 characters by Blossom Notebook.) While every major programming language has libraries to count Unicode grapheme clusters, Swift (a very modern programming language) is the only one that natively considers text to be a sequence of characters where characters are defined to be grapheme clusters and not code points. Unfortunately, faithfully following the Unicode standard leads to some pitfalls, which Blossom Notebook avoids with aggressive "normalization" (see below).
  • The process mentioned in the section above ("The Absolute Bare Minimum") of picking just one encoding of the multiple possible Unicode encodings of some characters and removing non-characters is called normalization. Blossom Notebook's normalization procedure is based on one of the normalization procedures defined by Unicode (specifically, NFC), but is more aggressive, it considers more pairs of characters to be the same and removes non-characters that Unicode allows.
    • For example, Blossom Notebook normalizes carriage return, form feed, and line feed all to just the line feed character, also commonly called the newline character. (In fact, the sequence of carriage return immediately followed by line feed is normalized to just one rather than two line feed characters, because every modern computer system interprets that combination as a single line break not two line breaks.)
    • An example of a problem with grapheme cluster-oriented languages like Swift is counterintuitive scenarios where text with 1 character appended to text with 1 character, rather than creating text with 2 characters, instead creates text with 1 character different from either of original characters. Blossom Notebook is able to be grapheme cluster-oriented without these problems by removing characters like isolated zero-width joiners, prepend, and combining marks.
    • Blossom Notebook also removes some other confusing "invisible" characters like ASCII control characters and shy hyphens.
    • Some Unicode symbols have both emoji and non-emoji text forms and display differently cross-platform when not marked with a variation selector—most notably the right-facing black triangle aka Play button displays as text in Safari on macOS (at least in Safari 15.4 on macOS 10.15.7), but for some reason it displays as emoji in Safari on iOS (at least on iOS 15.6.1), unless it comes after another character in the Geometric Shapes block (I'm very confused by this behavior). Inserting a variation selector suffix seems to result in cross-platform consistent behavior, though.
      • TODO: investigate which characters this happens to and if there's any pattern that would let us conservatively insert variation selectors whenever a character might have this issue, without doing it for like every emoji variation sequence even ASCII digits 0-9.
    • Finally, Blossom Notebook finds all flag pairs of Regional Indicator characters and inserts ZWNJs before each pair, so that they'll be self-synchronizing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment