Skip to content

Instantly share code, notes, and snippets.

@leafstorm
Created June 10, 2012 19:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save leafstorm/2907123 to your computer and use it in GitHub Desktop.
Save leafstorm/2907123 to your computer and use it in GitHub Desktop.
ItemBox, A serialization format I never got around to implementing

ItemBox 1.0 draft 1

ItemBox is a binary format for serializing data in a language-independent way, similar to JSON, YAML, BSON, MessagePack, and other formats. ItemBox is specifically designed for encoding and decoding speed, space efficiency, and ease of implementation.

Types Representable

ItemBox can represent any of these basic data types:

Null

A false value distinct from all other values.

Boolean

Either false or true.

Integer

A number, signed or unsigned, without a fractional part.

Floating-point

An IEEE 754 double-precision floating point value.

Unicode string

A sequence of zero or more Unicode characters.

Bytestring

A sequence of zero or more octets.

Array

A list of sequential values. The contained values can be of any type, including other containers.

Mapping

An unordered set of name/value pairs. Both the keys and the values can be of any type.

Tagged value

A combination of another value with a Unicode "tag" that describes the value's format. For example, a timestamp may be stored as an integer value with the tag :timestamp. The tag is not part of the value itself, but merely describes its interpretation.

Encoding Format

A representation of a value is referred to as a "term." Terms are always at least one byte, but often are longer and can have variable length. A term of length zero is an error, and should be treated as such.

The first byte of a term is referred to as the "type code." In addition, there may be a "payload," which can be fixed-length data, an array of bytes, or a string. No delimiters separate or end terms.

When there are multiple valid representations for a value, an encoder may use any of them. Conversely, a decoder must be able to accept all of the possible representations of a given value.

Here are a few defintions used in the encoding, for simplicity:

Term

A single value, consisting of at least a type code.

Uint32

An unsigned 32-bit integer in network byte order.

Uint16

An unsigned 16-bit integer in network byte order.

Pair

Two terms in sequence.

ByteArray

A sequence of bytes, without terminators or separators. Its length is given by a preceding integer, or the type code.

TermList

A sequence of terms, one after the other. Its length is given by a preceding integer, or the type code.

PairList

A sequence of Pairs, one after the other. Its length (in pairs - i.e. the number of terms is the length times two) is given by a preceding integer, or the type code.

0: Null

Terms with type code 0 simply represent a null. (Note that a "zero byte" is not the same thing as no data at all - a zero-length term is an error!)

Payload: None.

1-2: Boolean

The type code 1 represents a Boolean true, and 2 represents a Boolean false.

Payload: None.

3: 32-bit Integer

The type code 3 indicates that an integer follows.

Payload: 32-bit integer, including a sign bit.

4: 64-bit Integer

The type code 4 also indicates an integer, in this case 64 bits.

Payload: 64-bit integer, including a sign bit.

5: Floating-Point Number

The type code 5 indicates that a floating-point number follows.

Payload: IEEE 754 floating-point number (64 bits long).

6: UTF-8 String

The type code 6 indicates a Unicode string encoded in UTF-8. Improperly formatted UTF-8 should be treatead as an error.

Payload: Uint32 indicating the length of the string in bytes;

ByteArray that many bytes long.

7: Bytestring

The type code 7 indicates a bytestring. This has no restrictions on what octets may be included.

Payload: Uint32 indicating the length of the bytestring;

ByteArray that many bytes long.

8: Array

The type code 8 indicates an array of values.

Payload: Uint32 indicating the length of the array;

TermList that many terms long.

9: Mapping

The type code 9 indicates a mapping.

Payload: Uint32 indicating the number of pairs in the mapping;

PairList that many pairs long.

10-14: Reserved

As of now, no meaning is assigned to type codes in the range 10 through 14. If a parser encounters one, it should return an error to the user. Future versions of this specification may add types, however it is not likely.

15: Tagged Value

The type code 15 indicates a tagged value.

Payload: Uint16 for the length of the tag;

ByteArray consisting of the tag in UTF-8 encoding; Term for the actual value of the tag.

16-31: Compact Mapping

Type codes in the range 16 through 31 are used to encode small mappings. The length of the mapping is equal to the type code minus 16. This can represent hashes with up to fifteen keys.

Binary range: 0001xxxx

Bitwise test: (tc & 240) == 16

Payload: PairList (type code minus 16) pairs long.

32-63: Array

Type codes in the range 32 through 63 are used to encode short arrays. The length of the array is equal to the type code minus 32. This can represent arrays with up to 31 items.

Binary range: 001xxxxx

Bitwise test: (tc & 224) == 32

Payload: TermList (type code minus 32) terms long.

64-127: UTF-8 string

Type codes in the range 64 through 127 are used to encode short UTF-8 strings. The length of the string is equal to the type code minus 64. Improperly formatted UTF-8 should be treated as an error. This can represent strings up to 63 bytes long.

Binary range: 01xxxxxx

Bitwise test: (tc & 192) == 64

Payload: ByteArray (type code minus 64) bytes long.

128-159: Bytestring

Type codes in the range 128 through 159 are used to encode short bytestrings. The length of the string is equal to the type code minus 128. This can represent bytestrings up to 31 bytes long.

Binary range: 100xxxxx

Bitwise test: (tc & 224) == 128

Payload: ByteArray (type code minus 128) bytes long.

160-191: Negative integer

Type codes in the range 160 through 191 are used to encode negative integers. The value is equal to zero minus (the type code minus 159) bytes long. This can represent integers from -1 to -32.

Binary range: 101xxxxx

Bitwise test: (tc & 224) == 160

Payload: None.

192-255: Positive integer

Type codes in the range 192 to 255 are used to encode low-value positive integers. The value is equal to the type code minus 192. This can represent integers from 0 to 63.

Binary range: 11xxxxxx

Bitwise test: (tc & 192) == 192

Payload: None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment