Skip to content

Instantly share code, notes, and snippets.

@karalabe

karalabe/sos.md Secret

Last active November 2, 2021 20:47
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save karalabe/3a25832b1413ee98daad9f0c47be3632 to your computer and use it in GitHub Desktop.
Save karalabe/3a25832b1413ee98daad9f0c47be3632 to your computer and use it in GitHub Desktop.
Simple Offset Serialization (SOS)

[This is an expanded idea, I don't have an implementation to back it up yet]

Simple Offset Serialization (SOS)

Every piece of data in Ethereum is identified by its hash. For structured data, this means serializing it down into a flat binary blob and then running a cryptographically secure hash function on it. Since the hash depends on the data format, the serialization algorithm needs to be standardized and deterministic across all implementations.

There's a wide range of data serialization formats in existence, usually each honed to some specific use cases. Ethereum has a unique requirement that makes most of the well known serialization formats completely inappropriate: to permit smart contracts accessing encoded data, we are bound by the abysmal computing capacity of the Ethereum Virtual Machine. The primary goal is to make data access fast and dumb, everything else is secondary.

A previous proposal called Simple Serialization (SSZ) achieved many of the simplicity requirements for primitive data types, but it incurs O(N) access time for complex dynamic types (e.g. list of transactions). This spec defines an alternative that guarantees O(log N) access times for arbitrary nested data.

Primitives

SOS supports three primitive types: booleans, unsigned integers and fixed-size binary blobs. The encoding of the primitive data types is simply their binary content (little endian for integers).

Type Constraints Examples
bool false0x00
true0x01
uintN N % 8 == 0
8 <= N <= 256
uint16(256)0x0001
bytesN N < 2 ^ 32 bytes3("sos")0x736f73

Lists

SOS supports a single list type, containing arbitrary heterogeneous items. All other list variations (homogeneous, primitives) are subsets and implicitly supported.

List are encoded into three concatenated binary segments (some may be missing):

  • The first segment contains all the primitive items and is the concatenation of the individually encoded primitives.
  • The third segment contains all the dynamic items and is the concatenation of the individually encoded dynamics.
  • The middle segment contains an offset index to permit direct data lookups and is the concatenation of the starting offsets of the dynamic items encoded as []uint32, where the starting offset of an item is the length of the primitive segment + length of the offset segment + length of all previous dynamic items.
Type Example
[]bool {true, false, true}0x010001
[]uint8 "sos"0x736f73
[]uint16 {256, 255}0x0001ff00
[]bytes3 {"foo", "bar", "baz"}0x666f6f62617262617
[][]uint8 {"Simple", "Offset", "Serialization"}
0x0c000000 0x12000000 0x24000000 // Offsets 0+12+0, 0+12+6, 0+12+12
0x53696d706c65 0x4f6666736574 0x53657269616c697a6174696f6e // Content
struct{bool, []uint8} {true, "sos"}
0x01 // Primitive
0x05000000 // Offset 1+4+0
0x736f73 // Content "sos"

Note, the bytes type defined in the SSZ format is simply a fancy name for []uint8 or []bytes1 in SOS. Implementations can feature this syntactic sugar, but there's no difference in the binary format.

Retrieving primitive items is straightforward. To extract a dynamic item, retrieve the starting offset of the required item and use the next offset to calculate the desired item's length. The last dynamic item's length can be calculated from the overall size of the list being parsed, which is propagated down during parsing.

Caveat

To retrieve the very last dynamic item, the parser needs to know where the data stream ends (only the starting offset is calculable in the encoded data). If we load the data from a self-contained location (e.g. database entry), it's fine as we can pass the size to the parser. If we're streaming data over the network, the addition of a delimiter (e.g. length prefix) solves it.

The length prefix could be added as an envelope on all encodings, but that would be redundant data that can be calculated with other means in all practical cases so should not be part of the actual serialized binary blob.

Open questions

Neither SSZ nor SOS are streaming encoders. This means it might take quite a lot of memory shuffling and allocations to actually implement both encoders. Are we sure this won't hit back at larger data structures?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment