Consider this an informal guide to reading the Solana snapshot format. This guide is written for Solana Labs versions v1.14 through v1.17.
You are probably reading this because you want to read the accounts in a snapshot without going through the pain of interfacing with the Solana Labs codebase.
We assume general familiarity with the Solana ledger. Let's start by clarifying some less obvious terminology.
- Solana Runtime: The deterministic state machine that executes transaction and manages all Solana accounts. When data enters the runtime/blockchain, it is often referred to as "confirmed". Every piece of runtime data can be validated by re-executing (replaying) the blockchain from genesis.
- Implicit state:
A lot of runtime data (ca. 300 MB) is not stored in any accounts and
only partially exposed via RPC. In this document, we will refer to
this kind of data as "implicit state". Some implicit state is
periodically copied to sysvars. In the Solana Labs client, the
implicit state is managed by a structure called
Bank
. - AppendVec:
A file format containing multiple accounts. The term
AppendVec
originates from the Solana Labs code. It should probably have been called "accounts vec". - AppendVec length: The current version of AppendVecs have a major design flaw: An AppendVec file cannot be read without external information for the simple reason that the true length of an AppendVec is unknown.
- Manifest: The manifest is a large binary file containing structured data serialized via Bincode. It contains implicit state as well as all AppendVec lengths.
- Bincode: A binary serialization format.
Further terminology will be introduced along the way. We will revisit each component in detail.
First, we need to understand how to get to the data stored inside a full snapshot. (Incremental snapshots will be explained at the end of this document.)
A snapshot consists of the following conceptual layers:
+-----------------------------+
| Zstandard compressed stream |
+-----------------------------+
| TAR file stream |
+-----------------------------+
| Files |
+-----------------------------+
| Accounts & Implicit State |
+-----------------------------+
Starting from the bottom of the stack:
- First, the information is serialized into bytes
- The serialized data is packed into files
- Multiple files are packed together into a TAR stream (OLDGNU format).
- The TAR stream is compressed using the Zstandard (zstd) compression format.
Hence, a snapshot uses the .tar.zst
file extension.
Consuming the .tar.zst
stream is straightforward. Both TAR and
Zstandard are widely adopted. A naive approach is to compress the
entire archive to the file system. However, it is also possible to
read, uncompress, and process a snapshot in a single pass using tar/zstd
streaming APIs.
A typical snapshot results in a list of files like so.
version
snapshots/
snapshots/status_cache
snapshots/196493007
snapshots/196493007/196493007
accounts/
accounts/196487562.2062643
accounts/196487862.2062575
accounts/196486291.2059029
accounts/196489838.2066997
...
The file names ending with /
indicate directories and can be ignored.
If you are just looking for accounts, it is tempting to just try to
parse the accounts/196487562.2062643
files (AppendVec file format).
As mentioned earlier, it is impossible to parse these files on their
own. The AppendVec lengths are deeply hidden inside the
snapshots/196493007/196493007
manifest file, which requires a
deserializer worth thousand of lines of code of complexity.
So without further ado ...
The version
file contains the 5 byte text string 1.2.0
.
Its use is obvious.
No idea what this one does. Doesn't seem to be important.
This file contains the snapshot manifest.
To understand how to parse the manifest, one must be able to parse the Bincode serialization format. Quick recap of the second worst serialization format known to man:
Bincode operates on the following data types:
- Scalar types (bool, u8, u16, u32, u64, u128, i8, i16, i32, i64, i128, float, double)
- Composite types
- Structs
- Tuples
- Enums (Tagged Unions)
- Collections
- Options: Like Rust's
Option<T>
- Arrays: Like Rust's
[T; count]
- Vectors: Like Rust's
Vec<T>
- Maps: Like Rust's
BTreeMap<K, V>
- Options: Like Rust's
Encoding rules:
- The bool type is a u8 that is either 0 or 1
- A scalar type is encoded in little-endian byte order
- A struct is encoded by encoding each of the struct's fields
- A tuple is encoded by encoding each of the tuple's fields
- An enum is the concatenation of ...
- the variant's ID encoded as an u32
- the encoded variant's data (if applicable)
- An option is the concatenation of ...
- whether the optional value is unset or set, as an encoded bool
- the encoded value, if set
- An array is the concatenation of each item encoded
- A vector is the concatentation of ...
- the number of items encoded as an u64
- each item encoded
- A map is the concatenation of each key-value tuple encoded
A Bincode blob can only be deserialized if the data type of that blob is known.
The top level data type of the manifest is as follows.
{
"name": "solana_manifest",
"type": "struct",
"fields": [
{ "name": "bank", "type": "deserializable_versioned_bank" },
{ "name": "accounts_db", "type": "solana_accounts_db_fields" },
{ "name": "lamports_per_signature", "type": "ulong" }
]
}
If you are only interested in accounts, you only might again be tempted to only parse the part that actually contains the AppendVec lengths.
Thanks to bincode's thoughtful design, it is not possible to selectively parse fields. You have to parse all of it.
So here are the full definitions:
{
"name": "hash",
"type": "array",
"length": 32,
"element": "uchar"
}
{
"name": "pubkey",
"type": "array",
"length": 32,
"element": "uchar"
}
{
"name": "deserializable_versioned_bank",
"type": "struct",
"fields": [
{ "name": "blockhash_queue", "type": "block_hash_queue" },
{ "name": "ancestors", "type": "vector", "element": "slot_pair" },
{ "name": "hash", "type": "hash" },
{ "name": "parent_hash", "type": "hash" },
{ "name": "parent_slot", "type": "ulong" },
{ "name": "hard_forks", "type": "hard_forks" },
{ "name": "transaction_count", "type": "ulong" },
{ "name": "tick_height", "type": "ulong" },
{ "name": "signature_count", "type": "ulong" },
{ "name": "capitalization", "type": "ulong" },
{ "name": "max_tick_height", "type": "ulong" },
{ "name": "hashes_per_tick", "type": "option", "element": "ulong" },
{ "name": "ticks_per_slot", "type": "ulong" },
{ "name": "ns_per_slot", "type": "uint128" },
{ "name": "genesis_creation_time", "type": "ulong" },
{ "name": "slots_per_year", "type": "double" },
{ "name": "accounts_data_len", "type": "ulong" },
{ "name": "slot", "type": "ulong" },
{ "name": "epoch", "type": "ulong" },
{ "name": "block_height", "type": "ulong" },
{ "name": "collector_id", "type": "pubkey" },
{ "name": "collector_fees", "type": "ulong" },
{ "name": "fee_calculator", "type": "fee_calculator" },
{ "name": "fee_rate_governor", "type": "fee_rate_governor" },
{ "name": "collected_rent", "type": "ulong" },
{ "name": "rent_collector", "type": "rent_collector" },
{ "name": "epoch_schedule", "type": "epoch_schedule" },
{ "name": "inflation", "type": "inflation" },
{ "name": "stakes", "type": "stakes" },
{ "name": "unused_accounts", "type": "unused_accounts" },
{ "name": "epoch_stakes", "type": "vector", "element": "epoch_epoch_stakes_pair" },
{ "name": "is_delta", "type": "char" }
],
}
{
"name": "block_hash_queue",
"type": "struct",
"fields": [
{ "name": "last_hash_index", "type": "ulong" },
{ "name": "last_hash", "type": "option", "element": "hash" },
{ "name": "ages", "type": "vector", "element": "hash_hash_age_pair" },
{ "name": "max_age", "type": "ulong" }
]
}
{
"name": "hash_hash_age_pair",
"type": "struct",
"fields": [
{ "name": "key", "type": "hash" },
{ "name": "val", "type": "hash_age" }
]
}
{
"name": "hash_age",
"type": "struct",
"fields": [
{ "name": "fee_calculator", "type": "fee_calculator" },
{ "name": "hash_index", "type": "ulong" },
{ "name": "timestamp", "type": "ulong" }
]
}
{
"name": "fee_calculator",
"type": "struct",
"fields": [
{ "name": "lamports_per_signature", "type": "ulong" }
]
}
{
"name": "slot_pair",
"type": "struct",
"fields": [
{ "name": "slot", "type": "ulong" },
{ "name": "val", "type": "ulong" }
]
}
{
"name": "hard_forks",
"type": "struct",
"fields": [
{ "name": "hard_forks", "type": "vector", "element": "slot_pair" }
]
}
{
"name": "fee_rate_governor",
"type": "struct",
"fields": [
{ "name": "target_lamports_per_signature", "type": "ulong" },
{ "name": "target_signatures_per_slot", "type": "ulong" },
{ "name": "min_lamports_per_signature", "type": "ulong" },
{ "name": "max_lamports_per_signature", "type": "ulong" },
{ "name": "burn_percent", "type": "uchar" }
]
}
{
"name": "rent_collector",
"type": "struct",
"fields": [
{ "name": "epoch", "type": "ulong" },
{ "name": "epoch_schedule", "type": "epoch_schedule" },
{ "name": "slots_per_year", "type": "double" },
{ "name": "rent", "type": "rent" }
]
}
{
"name": "epoch_schedule",
"type": "struct",
"fields": [
{ "name": "slots_per_epoch", "type": "ulong" },
{ "name": "leader_schedule_slot_offset", "type": "ulong" },
{ "name": "warmup", "type": "uchar" },
{ "name": "first_normal_epoch", "type": "ulong" },
{ "name": "first_normal_slot", "type": "ulong" }
]
}
{
"name": "rent",
"type": "struct",
"fields": [
{ "name": "lamports_per_uint8_year", "type": "ulong" },
{ "name": "exemption_threshold", "type": "double" },
{ "name": "burn_percent", "type": "uchar" }
]
}
{
"name": "inflation",
"type": "struct",
"fields": [
{ "name": "initial", "type": "double" },
{ "name": "terminal", "type": "double" },
{ "name": "taper", "type": "double" },
{ "name": "foundation", "type": "double" },
{ "name": "foundation_term", "type": "double" },
{ "name": "__unused", "type": "double" }
]
}
{
"name": "stakes",
"type": "struct",
"fields": [
{ "name": "vote_accounts", "type": "vote_accounts" },
{ "name": "stake_delegations", "type": "map", "element": "delegation_pair", "key": "account" },
{ "name": "unused", "type": "ulong" },
{ "name": "epoch", "type": "ulong" },
{ "name": "stake_history", "type": "stake_history" }
]
}
{
"name": "vote_accounts",
"type": "struct",
"fields": [
{ "name": "vote_accounts", "type": "map", "element": "vote_accounts_pair", "key": "key" }
]
}
{
"name": "vote_accounts_pair",
"type": "struct",
"fields": [
{ "name": "key", "type": "pubkey" },
{ "name": "stake", "type": "ulong" },
{ "name": "value", "type": "solana_account" }
]
}
{
"name": "solana_account",
"type": "struct",
"fields": [
{ "name": "lamports", "type": "ulong" },
{ "name": "data", "type": "vector", "element": "uchar" },
{ "name": "owner", "type": "pubkey" },
{ "name": "executable", "type": "uchar" },
{ "name": "rent_epoch", "type": "ulong" }
]
},
{
"name": "delegation_pair",
"type": "struct",
"fields": [
{ "name": "account", "type": "pubkey" },
{ "name": "delegation", "type": "delegation" }
]
}
{
"name": "delegation",
"type": "struct",
"fields": [
{ "name": "voter_pubkey", "type": "pubkey" },
{ "name": "stake", "type": "ulong" },
{ "name": "activation_epoch", "type": "ulong" },
{ "name": "deactivation_epoch", "type": "ulong" },
{ "name": "warmup_cooldown_rate", "type": "double" }
]
}
{
"name": "unused_accounts",
"type": "struct",
"fields": [
{ "name": "unused1", "type": "vector", "element": "pubkey" },
{ "name": "unused2", "type": "vector", "element": "pubkey" },
{ "name": "unused3", "type": "vector", "element": "pubkey_u64_pair" }
]
}
{
"name": "pubkey_u64_pair",
"type": "struct",
"fields": [
{ "name": "_0", "type": "pubkey" },
{ "name": "_1", "type": "ulong" }
]
}
{
"name": "epoch_epoch_stakes_pair",
"type": "struct",
"fields": [
{ "name": "key", "type": "ulong" },
{ "name": "value", "type": "epoch_stakes" }
]
}
{
"name": "epoch_stakes",
"type": "struct",
"fields": [
{ "name": "stakes", "type": "stakes" },
{ "name": "total_stake", "type": "ulong" },
{ "name": "node_id_to_vote_accounts", "type": "vector", "element": "pubkey_node_vote_accounts_pair" },
{ "name": "epoch_authorized_voters", "type": "vector", "element": "pubkey_pubkey_pair" }
]
}
{
"name": "pubkey_node_vote_accounts_pair",
"type": "struct",
"fields": [
{ "name": "key", "type": "pubkey" },
{ "name": "value", "type": "node_vote_accounts" }
]
}
{
"name": "node_vote_accounts",
"type": "struct",
"fields": [
{ "name": "vote_accounts", "type": "vector", "element":"pubkey" },
{ "name": "total_stake", "type": "ulong" }
]
}
{
"name": "pubkey_pubkey_pair",
"type": "struct",
"fields": [
{ "name": "key", "type": "pubkey" },
{ "name": "value", "type": "pubkey" }
]
}
{
"name": "solana_accounts_db_fields",
"type": "struct",
"fields": [
{ "name": "storages", "type": "vector", "element": "snapshot_slot_acc_vecs" },
{ "name": "version", "type": "ulong" },
{ "name": "slot", "type": "ulong" },
{ "name": "bank_hash_info", "type": "bank_hash_info" },
{ "name": "historical_roots", "type": "vector", "element": "ulong" },
{ "name": "historical_roots_with_hash", "type": "vector", "element": "slot_map_pair" }
]
}
{
"name": "snapshot_slot_acc_vecs",
"type": "struct",
"fields": [
{ "name": "slot", "type": "ulong" },
{ "name": "account_vecs", "type": "vector", "element": "snapshot_acc_vec" }
]
}
{
"name": "snapshot_acc_vec",
"type": "struct",
"fields": [
{ "name": "id", "type": "ulong" },
{ "name": "file_sz", "type": "ulong" }
]
}
{
"name": "bank_hash_info",
"type": "struct",
"fields": [
{ "name": "hash", "type": "hash" },
{ "name": "snapshot_hash", "type": "hash" },
{ "name": "stats", "type": "bank_hash_stats" }
]
}
{
"name": "bank_hash_stats",
"type": "struct",
"fields": [
{ "name": "num_updated_accounts", "type": "ulong" },
{ "name": "num_removed_accounts", "type": "ulong" },
{ "name": "num_lamports_stored", "type": "ulong" },
{ "name": "total_data_len", "type": "ulong" },
{ "name": "num_executable_accounts", "type": "ulong" }
]
}
{
"name": "slot_map_pair",
"type": "struct",
"fields": [
{ "name": "slot", "type": "ulong" },
{ "name": "hash", "type": "hash" }
]
}
And btw - These data structures have been extended several times. This means that you will need to build a deserializer that handles older versions of the data structure definitions if you want to unpack old snapshots.
To get to the AppendVec lengths, select
manifest.accounts_db.storages[].account_vecs[].file_sz
The corresponding filenames are according to
<slot>
:manifest.accounts_db.storages[].slot
<id>
:manifest.accounts_db.storages[].account_vecs[].id
Resulting in file name accounts/<slot>.<id>
.
In snapshots, the AppendVec file format looks somewhat like this.
+------------------+
| Account Header 0 |
+------------------+
| Account Data 0 |
| |
+------------------+
| Padding |
+------------------+
| Account Header 1 |
+------------------+
| Account Data 1 |
+------------------+
| Padding |
+------------------+
| Account Header N |
+------------------+
| Account Data N |
| |
| |
+------------------+ <--- file_sz
| Random Garbage |
+------------------+
It contains repeated instances of (Account Header, Account Data, Padding). Padding contains arbitrary bytes (usually zero) used to align the next account header such that its file offset is a multiple of 8. If the file offset is already aligned to a multiple of 8 after reading the account data, the padding is omitted.
file_sz
(obtained from the manifest above) indicates when the random
garbage starts. Technical reasons for the random garbage involve the
the way account data is allocated internally within Solana Labs.
If you just tried to parse the entire file without respecting file_sz
,
you would eventually end up misinterpreting an account where none exists.
The definition of the Account Header is as follows (C code). It amounts to a size of 136 bytes.
struct __attribute__((packed)) solana_account_hdr {
/* 0x00 */ uint64_t write_version;
/* 0x08 */ uint64_t data_len;
/* 0x10 */ uchar pubkey[32];
/* 0x30 */ uint64_t lamports;
/* 0x38 */ uint64_t rent_epoch;
/* 0x40 */ uchar owner[32];
/* 0x60 */ uchar executable;
/* 0x61 */ uchar padding[7];
/* 0x68 */ uchar hash[32];
/* 0x88 */
};
The Account Data appears as-is directly after the header.
The length is controlled by the solana_account_hdr
data_len
field.
While walking the accounts that appear in each AppendVec, you might
encounter the same pubkey
twice. In this case, compare the slot
numbers of each AppendVec and choose the larger one. The case where
an account appears twice with the same slot number is undefined.
Incremental snapshots use exactly the same format as specified above.
However, it is assumed that the accounts database is pre-populated with a set of accounts loaded from a prior full snapshot. The accounts in the incremental snapshot then override any existing ones.
The incremental snapshot's implicit state also replaces the full snapshot's state.
- Stop doing bincode.
- Stop adding random garbage to the end of AppendVec files so we can just skip the bincode blob.
- The snapshot manifest needs to change from bincode to Protobuf ASAP to allow parsing with an incomplete schema definition.