Skip to content

Instantly share code, notes, and snippets.

@lancejpollard
Last active January 13, 2022 04:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lancejpollard/4de97588ee01365a3dc180e31b44e0a4 to your computer and use it in GitHub Desktop.
Save lancejpollard/4de97588ee01365a3dc180e31b44e0a4 to your computer and use it in GitHub Desktop.
Universal Language Encoding (ULE)

Universal Language Encoding (ULE)

An alternative to the UTF encodings to handle more languages in a more compact way.

<language><characters><language><characters>...

Everything is in bytes.

You start off by specifying the language. Then you specify all the characters in the text. Then the language, then the characters. You can intermingle languages by just having them start and stop in a massive block of text.

You start by specifying the language. The language "code" is a number distinguishing the language from others in your text. These can be mapped in a custom way to your text, or they can be standardized, doesn't really matter other than without standardization it is still easy to map to arbitrary texts.

So you specify the language code. The right-most bit says if the code is any longer. The remaining left-most 7 bits say the number. So 00000010 says the code is number 1. If you add a 1 at the end, then 1 more byte follows, so 00000011 00000010 would mean (1 * 2^128) + (1 * 1^128). That gives plenty of space for adding multiple languages to the code.

Then you specify the text. The text works similarly, where the last bit tells if the character extends beyond one bit. If the character set is <= 128 characters, then you can use only 1 byte to specify them, otherwise you use more. The right-most bit is for saying if the character index needs another byte to specify.

Then finally, 00000000 is the stop code for characters. So really there are 127 possible characters in the first 8-bits. Everything after that is fair game in bytes.

So you have "language #1, with 3 8-bit letters followed by the stop code":

00000010 00000010 00000100 00000110 00000000

This way you can chain together languages no problem, and they take up minimal space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment