ItemBox 1.0 draft 1
ItemBox is a binary format for serializing data in a language-independent way, similar to JSON, YAML, BSON, MessagePack, and other formats. ItemBox is specifically designed for encoding and decoding speed, space efficiency, and ease of implementation.
ItemBox can represent any of these basic data types:
- A false value distinct from all other values.
- Either false or true.
- A number, signed or unsigned, without a fractional part.
- An IEEE 754 double-precision floating point value.
- Unicode string
- A sequence of zero or more Unicode characters.
- A sequence of zero or more octets.
- A list of sequential values. The contained values can be of any type, including other containers.
- An unordered set of name/value pairs. Both the keys and the values can be of any type.
- Tagged value
- A combination of another value with a Unicode "tag" that describes the
value's format. For example, a timestamp may be stored as an integer value
with the tag
:timestamp. The tag is not part of the value itself, but merely describes its interpretation.
A representation of a value is referred to as a "term." Terms are always at least one byte, but often are longer and can have variable length. A term of length zero is an error, and should be treated as such.
The first byte of a term is referred to as the "type code." In addition, there may be a "payload," which can be fixed-length data, an array of bytes, or a string. No delimiters separate or end terms.
When there are multiple valid representations for a value, an encoder may use any of them. Conversely, a decoder must be able to accept all of the possible representations of a given value.
Here are a few defintions used in the encoding, for simplicity:
- A single value, consisting of at least a type code.
- An unsigned 32-bit integer in network byte order.
- An unsigned 16-bit integer in network byte order.
- Two terms in sequence.
- A sequence of bytes, without terminators or separators. Its length is given by a preceding integer, or the type code.
- A sequence of terms, one after the other. Its length is given by a preceding integer, or the type code.
- A sequence of Pairs, one after the other. Its length (in pairs - i.e. the number of terms is the length times two) is given by a preceding integer, or the type code.
Terms with type code 0 simply represent a null. (Note that a "zero byte" is not the same thing as no data at all - a zero-length term is an error!)
The type code 1 represents a Boolean true, and 2 represents a Boolean false.
3: 32-bit Integer
The type code 3 indicates that an integer follows.
Payload: 32-bit integer, including a sign bit.
4: 64-bit Integer
The type code 4 also indicates an integer, in this case 64 bits.
Payload: 64-bit integer, including a sign bit.
5: Floating-Point Number
The type code 5 indicates that a floating-point number follows.
Payload: IEEE 754 floating-point number (64 bits long).
6: UTF-8 String
The type code 6 indicates a Unicode string encoded in UTF-8. Improperly formatted UTF-8 should be treatead as an error.
- Payload: Uint32 indicating the length of the string in bytes;
- ByteArray that many bytes long.
The type code 7 indicates a bytestring. This has no restrictions on what octets may be included.
- Payload: Uint32 indicating the length of the bytestring;
- ByteArray that many bytes long.
The type code 8 indicates an array of values.
- Payload: Uint32 indicating the length of the array;
- TermList that many terms long.
The type code 9 indicates a mapping.
- Payload: Uint32 indicating the number of pairs in the mapping;
- PairList that many pairs long.
As of now, no meaning is assigned to type codes in the range 10 through 14. If a parser encounters one, it should return an error to the user. Future versions of this specification may add types, however it is not likely.
15: Tagged Value
The type code 15 indicates a tagged value.
- Payload: Uint16 for the length of the tag;
- ByteArray consisting of the tag in UTF-8 encoding; Term for the actual value of the tag.
16-31: Compact Mapping
Type codes in the range 16 through 31 are used to encode small mappings. The length of the mapping is equal to the type code minus 16. This can represent hashes with up to fifteen keys.
(tc & 240) == 16
Payload: PairList (type code minus 16) pairs long.
Type codes in the range 32 through 63 are used to encode short arrays. The length of the array is equal to the type code minus 32. This can represent arrays with up to 31 items.
(tc & 224) == 32
Payload: TermList (type code minus 32) terms long.
64-127: UTF-8 string
Type codes in the range 64 through 127 are used to encode short UTF-8 strings. The length of the string is equal to the type code minus 64. Improperly formatted UTF-8 should be treated as an error. This can represent strings up to 63 bytes long.
(tc & 192) == 64
Payload: ByteArray (type code minus 64) bytes long.
Type codes in the range 128 through 159 are used to encode short bytestrings. The length of the string is equal to the type code minus 128. This can represent bytestrings up to 31 bytes long.
(tc & 224) == 128
Payload: ByteArray (type code minus 128) bytes long.
160-191: Negative integer
Type codes in the range 160 through 191 are used to encode negative integers. The value is equal to zero minus (the type code minus 159) bytes long. This can represent integers from -1 to -32.
(tc & 224) == 160
192-255: Positive integer
Type codes in the range 192 to 255 are used to encode low-value positive integers. The value is equal to the type code minus 192. This can represent integers from 0 to 63.
(tc & 192) == 192