Skip to content

Instantly share code, notes, and snippets.

@mzaks
Last active December 1, 2020 14:50
Show Gist options
  • Save mzaks/97196f09de88b8ec191762297d6d694e to your computer and use it in GitHub Desktop.
Save mzaks/97196f09de88b8ec191762297d6d694e to your computer and use it in GitHub Desktop.
IndexedData white paper
1. Motivation
This format is designed to allow users pack data together for random access and with space efficiency in mind.
2. Internal structure
IndexedData (further referenced as idata) can be split up in two general regions: manifest and data.
Manifest region contains information needed to identify the number of elements in data region.
Plus it contains data neded to extract a signle element out of the data region.
Optionaly it can contain a validation key, which can be used to ensure that the data at hand is in fact idata.
The data region contains all data items linearly concatenaited to each other.
Before we discuss the strucutre of the manifest region, we should clarify the concept of progressive offsets (or progressive indexing).
2.1. Progressive indexing
The idata format stores a sequence of data elements in the data reagion without a use of separator.
Data elements can be split appart thanks to the manifest, where we store the length of the data region after an item was appended to it.
For example: when we store first data element of length 5, the index portion of the manifest will store 5.
When we add second data element of length 4, the next eleemnt in the index is 9 (being 5 + 4).
This implies that the data we stroe in the index is a steadily increasing positive number, which can be represented as an unsigned integer (uint).
Using a uint, means that we can encode values smaller then 256 (2^8) in one byte.
Values smaller then 65_536 (2^16) in two bytes.
Values smaller then 16_777_216 (2^24) in three bytes.
And so on.
As the values are steadily increasing the size of the value (number of bytes needed to represent the value) increase as well.
For example, we create a new idata and add three values to it.
First value is of size 20, second is of size 200 and third is of size 60.
The values in the index section of the manifest are 20, 220, 280.
Where the values 20 and 220 can be represented in 1 byte, but value 280 need to be represented in 2 bytes.
It is possible to store all three values in 2 bytes each (the biggest representation needed), making the index information occupy 6 bytes.
But in case of progressive indexing, we store how many items are stored in a given size of bytes.
In ase of 20, 220, 280 we say that 2 items are stored in 1 byte and 1 is stored in 2 bytes.
This technique allows us to store the values as compact as possible and still being able to read values in random access fashion.
2.2 Manifest region structure
2.2.1 First byte
The first half of the first byte (b & 0x0f) of the manifest stores the largest number of bytes a progressive index value can be.
The other half of the byte (b & 0xf0) is reserved for flags.
First flag (0b0001_0000) represents, if the manifest contains index size.
Second flag (0b0010_0000) represents, if the manifest contains validation key.
Flags 0b0100_0000 and 0b1000_0000 are reserved for future use.
For future evolutions we define that, if the first byte equals zero (0x00) further bytes contain the future deviation from the format.
2.2.2 Index counts
Next section of the manifest starts at offset 1 and reads X VLQ values.
X being the number from the first half of the first byte.
For example, if the first half of the first byte is number 3.
We have to read 3 VLQ numbers starting at offset 1.
If you are not familiar with VLQ, please visit following page (https://en.wikipedia.org/wiki/Variable-length_quantity).
The 3 VLQ numbers tell us how many entries are stored in 1, 2 and 3 bytes.
The sum of the numbers is the total number of elements stored in data region.
2.2.3 Index size
This section of the manifest is optional.
If the first flag is present (see paragraph 2.2.1) this section contains a single VLQ number.
The number represents the size of the index.
The size of the index can be computed from the given the index counts.
E.g.: index counts = [5, 0, 10], then the index size is `1 * 5 + 2 * 0 + 3 * 10` = 35 bytes.
However there are situation where we want to reserve more space for the index, in order to add elements without resizing the file / buffer.
In this case we need to specify the index size explicitly.
2.2.4 Validation key
This section of the manifest is also optional.
It contains of two bytes, which are computed from first byte, index counts, index size (explicit or implicit) and the last element of the index (which represents the size of the used data section).
The algorithm for computing the two bytes is not defined yet.
Fletcher16 (https://en.wikipedia.org/wiki/Fletcher%27s_checksum#Fletcher-16) is currently considered for implementation.
When user sets the validation key the data at hand can be identified as idata with high confidence.
This is why we encourage to set the validation key by default in your IndexData implementation.
2.2.5 Index values
This section contains the indexes of the appended data.
The values can be read only based on index counts as the value size is "progressive".
If the index size is present, this section can have an unsused part at the end, which is reserved for later value append.
2.3 Data region structure
Theelements of this region are packed together without any separators.
They can be disassembled for read by reading the index values (see 2.2.5).
The data region can contain an unsused section at the end, reserved for later value append.
2.4 Region order
As we already know, idata contains of two main regions - manifest and data.
Different use cases can benefit from different order.
For example, when the manifest region comes first, it would be possible to read data out of idata without loading the whole file.
However when the data region comes first, building up the idata can be done much more efficiently, becuase we can append values at the end of a buffer / file.
This is why we decided to allow both options (manifest-prefix and manifest-postfix).
The producer and consumer of idata need to agree on the order, if a validation key is not present.
If the validation key is present, the consumer can identify the order by checking the validation key.
When the manifest is written in postfix order, the manifest section are written in reverse.
The first byte (2.2.1) becomes the last byte.
The values of index counts (2.2.2) are also written from right to left (reversed VLQ).
Indexe size (2.2.3), if present, is also represented as reversed VLQ.
Validation key (2.2.4) stays as is.
And index values are also written as is, in LittleEndian order.
The reverse VLQ implementation are needed to be able to read variable length data backwards.
In cases where the length of the data is fix, we read and write it as is, in LE-Order.
Currently we believe that the postfix order should be considered default, because of the benefits in regard to encoding.
3. Developer ergonomic
IndexedData is designed not only to be efficient, but also ergonomic from developer perspecive.
3.1 Building up IndexData
As described in paragraph 2. idata is composed of two main regions, data and manifest.
While the idata is being filled up with values, it makes sense to keep data region as a separate buffer and divide the manifest regions into multiple values.
This way when user append new values, the data can be stored without multiple reallocations.
The resulting idata can be streamed into file / socket at any time without further big data allocations.
3.1.1 Reduce memory usage by streaming the values directly into a file
Last paragraph assumes that idata is kept in the memory through the build process.
It is however imaginable to keep usage of memory to the minimum by storing the data directly into a file.
In this case we need to allocate a file with a certain size.
The total file size is based on the size we estimate for manifest and the data regions.
The size of the manifest and data regions can be estimated, if the user provides us with two values:
1. Average size of a data item
2. Round about number of items user wants to store in the idata
For example user sets the size of the data item to 20 bytes and number of items to 100.
This means that we can reserve 2000 bytes for the data region.
This also means that the index values will be at most 2 bytes long.
We estimate 12 one byte and 88 two byte long index values.
Which means that the index counts should takeup two bytes.
Index value are estimated at: 12 + 88 * 2 = 188 bytes.
So the total size of the manifest region is:
1 byte (first byte) + 2 bytes (index counts) + 2 bytes (index size) + 2 byte (if validaiton key is on, otherwise 0 bytes) + 188 bytes (for index values)
This results in 195 bytes total for manifest region.
2000 + 195 = 2195 bytes total file, or buffer size.
3.2 Reading data
idata can be considered as a randomly accessible collection, where the nuber of elements is known upfront.
At the initialisation of an idata reader, following section of manifest region need to be read and evaluated:
- First byte, in order to identify the size of index counts and if index size and validation key are present
- Index counts, need to be read in order to compute the total count of values and size of the index values, in case the explicit index size is not present
- If the index size is present, it needs to be read
- If the validation key is present it needs to be read and the validation check performed. For the validation check, the last element of index values also need to be read.
After the initialisation, values can be accesed by index, or iterated upon with little computational cost and no additional heap allocations (depends on the langugae and implementation details).
3.3 Appending items to an existing idata
An existing idata can be easily broken into
4. Possible use cases
4.1. Time series data
IndexedData allows efficient aggregation of data point. The users can append the

Internal structure

Generall overview:

Manifest prefix: [manifest]|[dataset]

Manifest postfix: [dataset]|[mainfest]

Detailed view:

Manifest prefix: [fb]+[ic][iso]?[vk]?[us]?[iv]*[us]?|[data]*[us]?

Manifest postfix: [data]*[us]?|[iv]*[us]?[vk]?[iso]?[ic][fb]+

[fb]: first byte

Size:

1 or more bytes. Inititaly it suppose to be one byte only, but there is a posibiity to xtend it to multiple byte in the future.

Data:

First 4 bits (b & 0x0f) represent the biggest size of an index value. Other 4 bits of the bytes represent following feature flags:

  • 0b0001_0000 idata is extendable and contains the [iso] value
  • 0b0010_0000 idata can be validated and contains the [vk] value
  • 0b0100_0000 idata index value size is homogenious, means that [ic] is just a single values
  • 0b1000_0000 idata is nil aware, means that user could encode nil values which resulted in [iv] equal to 0

[ic]: index count

Size:

1 or more bytes. Encodes X uint values in VLQ encoding. For manifest prefix it is LE, postfix BE. X is equal to one if index value size is homogenious, otherwise it equals [fb] first 4 bits value.

Data:

In case of homogenious index value the data represents the total count of values in idata. Otherwise [ic]1 ... [ic]X represent number of index values stored in 1 to X bytes. For example the index values are 20, 220, 340. The [ic]1 = 2, as 20 and 220 can be stored in 1 byte and [ic]2 = 1 as 340 needs 2 bytes.

[iso]: Index size and offset

Size:

Value is optional, if present it occupies 2 or more bytes. It encodes 2 VLQ values. For manifest prefix it is LE, postfix BE.

Data:

This value is presetn if idata is extendable, meaning that the buffer / file can have [us] regions. First value represents the explicit size of [iv] plus possible following [us]. The second value represents the relative offset of [iv] from the end of [vk], encoding the size of precessing [us] and possible future values.

[vk]: Validation key

Size:

Value is optional, if present it occupies 2 Bytes.

Data:

This value is a 16bit fletcher check sum, based on bytes leading (in case of prefix manifes), or trailing (in case of postfix manifest) to the [vk]. User can can add a secret byte array in order to add a simple data integriy check.

[us]: Unused space

Size:

Can be zero ore more bytes.

Data:

Represents reserved space in case of extendable idata.

[iv]: Index value

Size:

One or more bytes. This is an LE uint value representation with trailing 0 bytes removed.

Data:

The value represents the dataset size after a data value was added to it. Given index value for n and max(0, n-1) it is easy to identify data value n from the dataset. Where data set is a continuous sequence of concatentated data values.

[data]: Data value

Size:

Zero or more bytes.

Data:

This is the binary value, which was added to idata by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment