lancejpollard/readme.md

## readme.md

      
    Raw
  

              readme.md
            
          
    Universal Language Encoding (ULE)

An alternative to the UTF encodings to handle more languages in a more compact way.
<language><characters><language><characters>...

Everything is in bytes.
You start off by specifying the language. Then you specify all the characters in the text. Then the language, then the characters. You can intermingle languages by just having them start and stop in a massive block of text.
You start by specifying the language. The language "code" is a number distinguishing the language from others in your text. These can be mapped in a custom way to your text, or they can be standardized, doesn't really matter other than without standardization it is still easy to map to arbitrary texts.
So you specify the language code. The right-most bit says if the code is any longer. The remaining left-most 7 bits say the number. So 00000010 says the code is number 1. If you add a 1 at the end, then 1 more byte follows, so 00000011 00000010 would mean (1 * 2^128) + (1 * 1^128). That gives plenty of space for adding multiple languages to the code.
Then you specify the text. The text works similarly, where the last bit tells if the character extends beyond one bit. If the character set is <= 128 characters, then you can use only 1 byte to specify them, otherwise you use more. The right-most bit is for saying if the character index needs another byte to specify.
Then finally, 00000000 is the stop code for characters. So really there are 127 possible characters in the first 8-bits. Everything after that is fair game in bytes.
So you have "language #1, with 3 8-bit letters followed by the stop code":
00000010 00000010 00000100 00000110 00000000

This way you can chain together languages no problem, and they take up minimal space.