The baseIdeo encoding encodes an arbitrary stream of bytes to Unicode code points in the "CJK Unified Ideographs" block (U+4E00–U+9FFF). Each code point represents a maximum of 14 bits.
The baseIdeo encoding is inspired by pnck's basecjk.
- U+6000–U+9FFF are used for normative encoding.
- U+4E00-U+4E0D is dedicated to padding.
A baseIdeo
baseIdeo's padding scheme allows for easy lossless interpretation of padding lengths. This property can be utilized to concatenate streams without re-interpretation[1], given the following modification to the definiton of a stream:
Note that under this variant, the same bitstream, depending how it is segmented, can be encoded as different Concat-Var baseIdeo streams.
A baseIdeo encoder has an associated bit-stream sb, and a stream-length property l. Its handler runs the following operations:
- Let baseOffset be
U+6000
. - Let padBase be
U+4E00
. - Let remaining be l.
- While remaining is no less than 14:
- Read 14 bits from sb as integer b14.
- Decrement remaining by 14.
- Emit the codepoint b14 + baseOffset.
- If remaining is greater than 0:
- Read remaining bits from sb as integer b14.
- Bitwise shift b14 left by 14 - remaining bits.
- Emit the codepoint b14 + baseOffset.
- Emit the codepoint remaining + padBase.
A more realistic byte-oriented encoder will be discussed later in BYTES.md.
A baseIdeo decoder has an associated code point string sc, which has a length property l. Its handler performs the following operations to restore the original bit-stream sb:
- Let i be 0.
- While l is greater than 0: