Skip to content

Instantly share code, notes, and snippets.

@Maxxen
Last active April 8, 2024 14:15
Show Gist options
  • Save Maxxen/364d32444d3f1577e30705e72ee3b80e to your computer and use it in GitHub Desktop.
Save Maxxen/364d32444d3f1577e30705e72ee3b80e to your computer and use it in GitHub Desktop.
WIP DuckDB storage docs

The DuckDB File Format

Layout:

| Main Header | DB Header 1 | DB Header 2 | Block 1    | Block 2    | ... | Block N    |
| ----------- | ------------| ------------| -----------| ---------- | --- | ---------- |
| 4 KB        | 4 KB        | 4 KB        | 256 KB     | 256 KB     | ... | 256 KB     |

Main Header Database Header 1 Database Header 2 Block 1 Block 2 Block 3 ... Block N

Main Header:

The Main header is a 4 KB block at the start of the database file that contains static metadata about the database. The Main header contains the following fields:

  • Magic Bytes (4 bytes): The magic bytes used to identify a DuckDB database file. This is always set to the ascii bytes "DUCK".
  • Version (8 bytes): A u64 version number of the database storage format.
  • Flags (64 bytes): Flags, currently unused.
  • Library Version (32 bytes): The version string of the DuckDB library that created the database file. Stored as a null-terminated string.
  • Source Version (32 bytes): The version string of the DuckDB source code that created the database file. Stored as a null-terminated string.

The rest of the Main header block is reserved for future use and should be set to zero.

Database Header:

The database header is a 4 KB block that stores the metadata describing the current state of the database. There are always two database headers in the database file, one for the current state of the database, and of for the previous state of the database. The headers alternate between being "current" and "previous" every time DuckDB commits, allowing for crash recovery. The currently active database header is identifier by having the higher iteration number, which increases every time a commit is made.

Having two alternating database headers is a crucial to enable atomic commits, as it allows DuckDB to write the new state of the database entirely without modifying the old state, only performing the "flip" of the database header, marking the end of a successeful transaction, at the very end.

The database header contains the following fields:

  • iteration (8 bytes): A u64 storing the iteration number of the database header.
  • meta_block (8 bytes): The block number of the first metadata block.
  • free_list (8 bytes): The block number of the first free block of the free list.
  • block_count (8 bytes): The number of blocks in the database file as of this database header. If the fiel is larger than block_size * block_count, any blocks appearing after block_count are implicitly part of the free_list
  • block_size (8 bytes): The size of each block in the database file. Unless DuckDB has been compiled with a different setting, this field is always zero and should be assumed to represent a default of 256 KB.
  • vector_size (8 bytes): The standard size of the vectors in the database file. Unless DuckDB has been compile with a different setting, this field is always zero and should be assumed to represent a default of 2048.

The rest of the database header block is reserved for future use and should be set to zero.

Metadata Block:

The metadata block(s) form a linked list that stores the actual metadata of the database contents, roughly in a hierarchical structure starting with the database, then the schemas, then the tables, and finally the columns. The fact that the metadata blocks form a linked list means that DuckDB can reuse blocks that have not been modified since the last commit, and only rewrite and update the pointers of the blocks that actually changed.

DuckDB uses a custom binary encoding to serialize metadata objects from C++ classes, described in Binary Serialization Format. It is important to understand how this encoding works as before you can read the column data you will have to traverse the metadata to identify the tables in the catalog, their column types, deserialize their statistics, and then eventually locate the row group pointers.

Binary Serialization Format

DuckDBs binary serialization format used to serialize classes is a custom encoding format, being somewhat of a hybrid between protobuf and JSON. In fact, the format is very easy to map to JSON which is why we are able to re-use the same serialization infrastructure to implement the json_serialize_sql() scalar functions present in the DuckDB json extension.

Like protobuffers, we use a separate set of json based "schema" files to generate the serialization and deserialization code for most of the metadata objects. These can be found in the src/include/storage/serialization/ directory in the DuckDB main repository, and the python script we use to generate the c++ code is in the scripts/generate_serialization.py file.

Primitive Value Types

The binary encoding has 5 primitive value types:

  • Signed Variable Length Integer (SVLI): A variable length LEB128 encoded signed integer.
  • Unsigned Variable Length Integer (UVLI): A variable length LEB128 encoded unsigned integer. Also used for encoding boolean values.
  • Length-Prefixed Blobs: A UVLI containing the length of the blob, followed by length bytes of data. This is used for strings and other binary data.
  • 32-bit Floating Point Numbers
  • 64-bit Floating Point Numbers

Complex Value Types

There are two types of complex value types, List and Object which are used to serialize nested structures.

List's are used to serialize a ordered sequence of repeating values of the same type, and are encoded similarly to the prefix-length blobs, with a UVLI containing the length of the list (the number of items to follow), followed by length items of the inner type.

Objects are used to serialize a ordered sequence of named fields. Conceptually Objects are a sequence of (field_id, value) pairs, where the field_id is a unsigned 16-bit integer used to identify the field, and the value is the serialized value of the field. field_id's are only unique within the scope of the object, and their values does not have to be sequential. By convention, since we mostly serialize C++ classes, most objects start their field_id's at 100, increasing by another 100 for each level of inheritance to avoid conflicts with any base/derived classes.

Unlike protobuf, field_id's are expected to follow in the same order as specified by the schema and are not allowed to be repeated. This makes it possible to deserialize the object in a single pass without having to buffer or seek backwards, and the absence of a expected field - or the presence of a unexpected field - can be quickly detected.

At the end of the object, a special MESSAGE_TERMINATOR_FIELD_ID (0xFFFF) value is written to indicate the end of the object.

Optional and Unknown Fields

In general, most fields in serialized objects have a default value specified in the schema files, which should be used in case the corresponding field_id is missing during deserialization. Similarly, during serialization, if the property of an object that is currently being serialized is equal to the default value, the field should be omitted from the serialized object. In practice, this makes any field with a default value optional, and a deserializer should handle the absence of a field as the equivalent to the field being set to the given default value.

Additional, since fields are expected to be serialized in the same order as specified in the schema it is not only possible to retain backwards compatability, but also some level of forwards compatability, meaning that deserializers using an older version of the schema can still read files written by a newer serializer. As long as fields added to the end of the schema have default values, and happen to be set to the default value during serialization (and are thus omitted), older deserializers will still be able to read the metadata properly. Unfortunately, due to legacy reasons, not all fields have default values, and in these cases the deserializer should raise an error if the field is missing. Going forwards, all changes to the serialization schema should specify default values for all new fields so as to ensure best-effort forwards compatability.

In the case of unknown fields, i.e. when the deserializing an object with a field_id that is not present in the schema (as can happen when reading a newer version of the file format where the field is not set to the default value, or does not have a default value to begin with), the deserializer should raise an error.

Since the binary encoding does not specify the byte length or start delimiters for the complex List or Object types, it is not possible to "skip" unknown fields.

It is also possible to mark a field as deleted in the schema, which means that the field should be ignored and during deserialization, and any data structures produced shuld be discarded. This is useful when fields are removed from the schema but still exist in older versions. Additionally, deleted fields can provide a default value that will always be written so as to allow older deserializers to still read the new version. Deleting fields in general is discouraged, and should only be done when absolutely necessary.

Example:

Given the following structure (represented in JSON):

    {
        "field1": 42,
        "field2": "hello",
        "field3": {
            "nested_field1": 43,
            "nested_field2": "world"
        },
        "field4": [1, 2, 3, 4]
    }

The binary serialized equivalent:

{
    (field_id):     1                               // "field1"
    (value):        42                              
    (field_id):     2                               // "field2"
    (length):       5                               // length of the string
    (data):         "hello"                         
    (field_id):     3                               // "field3"
    (field_id):     1                               // "nested_field1"
    (value):        43                              
    (field_id):     2                               // "nested_field2"
    (length):       5                               // length of the string
    (data):         "world"
    (field_id):     MESSAGE_TERMINATOR_FIELD_ID     // End of the object in "field3"
    (field_id):     4                               // "field4"
    (length):       4                               // length of the list
    (value):        1                               
    (value):        2
    (value):        3
    (value):        4
    (field_id):     MESSAGE_TERMINATOR_FIELD_ID     // End of the object
}

Table Data, ColumnData and Row Groups

Once the metadata has been you will find table catalog entries having a structure somewhat like this:

{
    ...
    "type[99]": "TABLE_ENTRY",
    "table[100]": {                 // CreateInfo
        "type[100]": "TABLE_ENTRY",
        "catalog[101]": "string",
        "schema[102]": "string",
        ...
        "table[200]": "string",     // Table name!
        "columns[201]": "List<ColumnData>",
        "constraints[202]": "List<Constraint>",
        ...
    },
    "table_pointer[101]": {         // The pointer to the actual table data
        "block_pointer[100]": "u64",
        "offset[101]": "u32"
    }, 
    "total_rows[102]": "u64",
    ....
}

The "table_pointer" fields holds the id of the start block holding the table data, as well as the byte offset within that block. Following this pointer will first lead you to the table statistics, also serialized using the binary serialization format. To read the table statistics, you need to have deserialized the rest of the table metadata first to know the column types, as the statistics are polymorphic and have different fields depending on the logical type of the column. The type is however not stored in the statistics object itself.

    // TableStatistics object
    {
        // List of ColumnStatistics, same order as in "Columns[201]" in the 
        // table catalog entry
        "column_stats[100]": [
            {
                // BaseStatistics object
                "statistics[100]": {
                    // <Common fields>
                    "has_null[100]": "bool",
                    "has_no_null[101]": "bool",
                    "distinct_count[102]": "u64",

                    // <Type dependent fields>
                    ...
                },
                "distinct[101]": "DistinctStatistics",
            },
            ...
        ],
        "table_sample[101]" : "TableSample"
    },
    // Row group count is stored as a raw u64 after the table statistics object
    // (not serialized using the binary serialization format)
    "row_group_count": "u64"

    // A sequence of RowGroupPointer's immediately follows the row group count
    ...

Following the end of the table statistics object, the RowGroup count is stored as a single raw u64, after which the row groups are stored in sequence.

    // RowGroupPointer object
    {
        "row_start[100]": "u64",
        "tuple_count[101]": "u64",
        "data_pointers[102]": "List<MetaBlockPointer>",
        "delete_pointers[103]": "List<MetaBlockPointer>",
    }

The data_pointers field holds a list of metadata pointers, each pointing to the data for a specific column within the row group, in the same order as the columns are defined within the table. Following a data_pointer will lead you to a ColumnData object serialized using the binary serialization format.

    // ColumnData object
    {
        // List of DataPointer objects
        "data_pointers[100]": [ 
            {
                "row_start[100]": "u64",
                "tuple_count[101]": "u64",
                "block_pointer[102]": "BlockPointer",
                "compression_type[103]": "CompressionType",
                "statistics[104]" : "BaseStatistics",
                "segment_state[105]": "ColumnSegmentState"
            }
        ],

        // In practice all column types have a validity bitmap, stored as a 
        // nested ColumnData object
        "validity[101]": {
            "data_pointers[100]": { ... }
        }

        // <Type dependent fields>
    }

Within the ColumnData object, yet another data_pointer field holds a list of DataPointer objects, each pointing to a specific block of data for the column. The block_pointer field holds the block number of the data block, and the compression_type field holds the type of compression used for the data block. Using the compression method specified in the compression_type field, you can start scanning at the address specified by the block_pointer field to access the column segment which is a compressed part of data for the column. For the uncompressed case (i.e. no compression), reading the data is akin to doing a memcpy from the block pointer, but with some additional bookkeeping to calculate the correct offset within the block and segment.

ColumnData is a polymorphic, potentially recursively nested object (depending on if the column contains a nested type) that holds part of the column data.

  • StandardColumnData holds data for primitive types, and has a nested validity ColumnData object representing the validity bitmap for this part of the column.
  • ListColumnData holds data for list types. It stores the list offsets in its own data block_pointer and has a nested validity ColumnData for the bitmap as well as a child_column ColumnData for the actual list element data.
  • StructColumnData holds data for struct types. It has a validity field as well and a nested child_columns list of ColumnData objects, one for each field in the struct, but it does not have its own data_pointer object.
  • ArrayColumnData works almost the same way as ListColumnData, but like StructColumnData does not have its own data pointer, and is used for array types. Unlike list types, array types does not need to store the list offsets as they are all the same size and assumed to be located sequentially.
  • ValidityColumnData holds its own data pointer, but does not have a validity field, as having a validity bitmap for the validity bitmap does not make sense.

Again, to deserialize the ColumnData (and the nested statistics object) you need to know the type of the column as the fields are dependent on the logical type, and the type is not stored in ColumnData itself.

The ColumnSegmentState is a piece of auxiliary data that can be used to store information about the state of the column segment. For example, in the case of string columns, the ColumnSegmentState stores a list of block ids used as overflow blocks to store the string data. Different compression methods can optionally implement compression of the ColumnSegmentState as well, but this is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment