Skip to content

Instantly share code, notes, and snippets.

@SoniEx2
Created July 7, 2018 19:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SoniEx2/ecd119507f160d9c26e3eabd9e012dc0 to your computer and use it in GitHub Desktop.
Save SoniEx2/ecd119507f160d9c26e3eabd9e012dc0 to your computer and use it in GitHub Desktop.

Unicode Instruction Set Architecture (UISA)

Unicode (UTF-8) is the leading character encoding being used in computers today. However, a few factors hinder its adoption:

  • It's big. It's actually huge! You need to include (increasingly bigger) character tables in your programs if you want to deal with Unicode.
  • You can't really do anything without those character tables. Unicode is a complete mess without them!

So we shall define an ISA that replaces Unicode. Meet UISA!

Features

  • Simple to use, fairly direct intentions system.
  • Easy to decode.
  • Good ASCII interoperability.

Instruction encoding

Instructions have a variable-length encoding. Instructions 0x00 to 0x7F represent ASCII characters.

Instructions 0x80-0x8F shall be reserved for two-byte instructions. 0x90-0x9F for three-byte, and so on.

Instructions starting with 0x80 shall be reserved for space characters. The ASCII space shall not be included in this.

Instructions starting with 0x81 shall be reserved for non-breaking space characters. 0x81 0x00 shall represent the canonical  .

Instructions starting with 0x82 shall be reserved for zero-width space characters.

Instructions starting with 0x83 shall be reserved for non-breaking zero-width space characters.

Instructions starting with 0x84 0x00/5 shall be reserved for control characters.

Instructions starting with 0x84 0x20/5, 0x84 0x40/6 and 0x84 0x80/7 shall be reserved for combining marks, and shall apply to the character on the left.

Instructions starting with 0x85 and 0x86 shall be reserved for matching start and end quotes, respectively. (This class will have a lot of homoglyphs, but that's actually intended, because different languages may use the same glyphs differently, and we should encode those semantics!)

Instructions starting with 0x87 shall be reserved for neutral quotes.

Instructions starting with 0x90 shall be reserved for digits representing values 0 to 9, as seen here:
0x90 0x10/4 = 1-like digits
0x90 0x20/4 = 2-like digits
0x90 0x30/4 = 3-like digits
0x90 0x40/4 = 4-like digits
0x90 0x50/4 = 5-like digits
0x90 0x60/4 = 6-like digits
0x90 0x70/4 = 7-like digits
0x90 0x80/4 = 8-like digits
0x90 0x90/4 = 9-like digits
(0x90 0xA0/4 and so on are available for other functions)

Instructions starting with 0xA0 shall be reserved for left-to-right alphabets.

Instructions starting with 0xA1 shall be reserved for right-to-left alphabets.

Other instructions shall be defined somewhere else.

Usage

If you want to check for space in a programming language/parser, you can easily just check for 0x80 instructions, for example. By encoding the character properties as part of the ISA, we avoid the need for separate character tables.

For passing characters around in code, you just pass them as a byte buffer + length. Alternatively, the maximum length for UISA opcodes is 16 bytes, which neatly fits in a 128-bit integer.

Unlike Unicode, UISA doesn't define an encoding-agnostic representation. Instead, UISA opts to have only one acceptable encoding. While Unicode is like ARM and UTF-8 is like Thumb, UISA is like x86, but with more sense put into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment