Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save aaguilerav/53f873abbd05e5c0faf4ce941e140f3c to your computer and use it in GitHub Desktop.
Save aaguilerav/53f873abbd05e5c0faf4ce941e140f3c to your computer and use it in GitHub Desktop.

Solana validator v1.14.17 startup + catchup logs

This document contains a listing of solana-validator logs that outline noteworthy steps during node startup + catchup process. Several quick notes:

  • These logs are from an actual node restart with tip of v1.14 branch against mainnet-beta
  • Lines starting with // are inserted commentary
  • Some newlines have been inserted to allow for better viewing
  • This node was running with the default logging RUST_LOG=solana=info
  • Hardware config:
    • AMD EPYC 7402P 24-Core Processor
    • 256 GB RAM
    • 2x Micron 9300 NVMe 3.84 TB drives
      • One logical drive that stripes across both drives
  • Validator setup
    • accounts, accounts-index, snapshots, and ledger all on aforementioned logical drive
// Log emitted by the solana-validator binary when started
// Grep for `solana-validator` to immediately know:
//  - The version that is actually executing
//  - The number of process restarts (ie 3 instances indicates 3 process restarts)
[2023-04-19T07:06:30.707721265Z INFO  solana_validator] solana-validator 1.14.18 (src:devbuild; feat:976579100)

// Log emitted immediately before unpacking the full snapshot
// This is a quick way to determine if the node has an incremental snapshot,
// if it doesn not, the last bit of this line will be None
[2023-04-19T07:06:38.757629237Z INFO  solana_runtime::snapshot_utils]
Loading bank from full snapshot: /home/sol/ledger/snapshot-189242108-4nKhRos6raw3jbGYUsK8uHzsxXoVoZioHCzA3PtYdopD.tar.zst,
and incremental snapshot: Some("/home/sol/ledger/incremental-snapshot-189242108-189262173-2zfiNEozUMWoo8AhLTMsHKf8jbmqf8o71iwKQ4PBCWAq.tar.zst")

// Unpack progress of full snapshot
// Unpack is multi-threaded (4 threads here), hence the duplicate looking log lines
[2023-04-19T07:06:48.781604781Z INFO  solana_runtime::hardened_unpack] unpacked 10411 entries so far...
[2023-04-19T07:06:48.781893539Z INFO  solana_runtime::hardened_unpack] unpacked 10412 entries so far...
[2023-04-19T07:06:48.781922414Z INFO  solana_runtime::hardened_unpack] unpacked 10412 entries so far...
[2023-04-19T07:06:48.781960426Z INFO  solana_runtime::hardened_unpack] unpacked 10412 entries so far...
[2023-04-19T07:06:58.781938840Z INFO  solana_runtime::hardened_unpack] unpacked 21169 entries so far...
[2023-04-19T07:06:58.782044250Z INFO  solana_runtime::hardened_unpack] unpacked 21171 entries so far...
[2023-04-19T07:06:58.782085218Z INFO  solana_runtime::hardened_unpack] unpacked 21173 entries so far...
[2023-04-19T07:06:58.782198092Z INFO  solana_runtime::hardened_unpack] unpacked 21170 entries so far...

// Full snapshot unpack finished
// Full snapshots are typically on the order of an epoch worth of entries (~400k)
[2023-04-19T07:08:14.997493335Z INFO  solana_runtime::hardened_unpack] unpacked 104410 entries total
[2023-04-19T07:08:14.998814443Z INFO  solana_runtime::hardened_unpack] unpacked 104411 entries total
[2023-04-19T07:08:14.999135392Z INFO  solana_runtime::hardened_unpack] unpacked 104410 entries total
[2023-04-19T07:08:14.999806105Z INFO  solana_runtime::hardened_unpack] unpacked 104410 entries total
...
[2023-04-19T07:08:15.075423926Z INFO  solana_runtime::snapshot_utils] snapshot untar took 96.3s

// Incremental unpack goes much quicker as expected
[2023-04-19T07:08:19.430401260Z INFO  solana_runtime::hardened_unpack] unpacked 4701 entries total
[2023-04-19T07:08:19.431530183Z INFO  solana_runtime::hardened_unpack] unpacked 4702 entries total
[2023-04-19T07:08:19.434131220Z INFO  solana_runtime::hardened_unpack] unpacked 4701 entries total
[2023-04-19T07:08:19.436312541Z INFO  solana_runtime::hardened_unpack] unpacked 4702 entries total
[2023-04-19T07:08:19.454496295Z INFO  solana_runtime::snapshot_utils] incremental snapshot untar took 4.4s

// Begin the creation of initial Bank / AccountsDB from deserialized snapshot
[2023-04-19T07:08:19.457996549Z INFO  solana_runtime::snapshot_utils] snapshot version: 1.2.0
[2023-04-19T07:08:19.458053497Z INFO  solana_runtime::snapshot_utils] snapshot version: 1.2.0
[2023-04-19T07:08:19.458076611Z INFO  solana_runtime::snapshot_utils] Loading bank from full snapshot /home/sol/ledger/snapshot/tmp-snapshot-archive-YY2Vb0/snapshots/189242108/189242108 and incremental snapshot Some("/home/sol/ledger/snapshot/tmp-snapshot-archive-c84wHI/snapshots/189262173/189262173")

// Accounts DB index generation
[2023-04-19T07:08:44.609864213Z INFO  solana_runtime::accounts_db] generating index: 1195/436435 slots...
[2023-04-19T07:08:46.610010992Z INFO  solana_runtime::accounts_db] generating index: 57277/436435 slots...
[2023-04-19T07:08:48.610223958Z INFO  solana_runtime::accounts_db] generating index: 126042/436435 slots...
[2023-04-19T07:08:50.616258223Z INFO  solana_runtime::accounts_db] generating index: 195919/436435 slots...
[2023-04-19T07:08:52.616436221Z INFO  solana_runtime::accounts_db] generating index: 262771/436435 slots...
[2023-04-19T07:08:54.616745580Z INFO  solana_runtime::accounts_db] generating index: 326167/436435 slots...

// Bank rebuild finished
[2023-04-19T07:11:27.084764892Z INFO  solana_runtime::snapshot_utils] rebuild bank from snapshots took 187.6s
[2023-04-19T07:11:27.075604183Z INFO  solana_runtime::snapshot_utils] Loaded bank for slot: 189262173

// Snapshot bank (slot) frozen
// The snapshot contains the bank after all transactions for that slot have been
// replayed, so we do not actually replay any transactions for this slot
[2023-04-19T07:11:27.088685794Z INFO  solana_runtime::bank] bank frozen: 189262173
hash: AWGRvPvQjGYY7PyWybTuLdpKGSKkDZAmw3XQrv9HTWek accounts_delta: FSseTJatwMYnjtKubURqbx2DbAB5mfKRbLej9hWEc88a
signature_count: 1743 last_blockhash: 3jNEYiKmYwzu8Wv6Xyx72Xn3pBEM7o4oFuc1dqxPh168 capitalization: 545759810842268750

// Begin replaying slots that are already locally present
// Note that the start slot is the same as the snapshot slot
[2023-04-19T07:11:28.465525474Z INFO  solana_ledger::blockstore_processor] Processing ledger from slot 189262173...

// Local replay will not go higher than what is already present on disk
[2023-04-19T07:11:28.465852675Z INFO  solana_ledger::blockstore_processor] ledger holds data through slot 189263276

// Local replay identifying rooted blocks based on votes in those blocks
[2023-04-19T07:11:29.392059749Z INFO  solana_metrics::metrics] datapoint: bank-forks_set_root elapsed_ms=0i slot=189262174i ...

// Local ledger replay progress
// This is where v1.13 restart is slow with existing ledger, see below
[2023-04-19T07:11:30.691108705Z INFO  solana_ledger::blockstore_processor] processing ledger: slot=189262177, last root slot=189262176 slots=3 slots/s=1.5 txs/s=2833
[2023-04-19T07:11:32.996727631Z ...] processing ledger: slot=189262181, last root slot=189262180 slots=4 slots/s=2.0 txs/s=3862
...
[2023-04-19T07:13:17.323154179Z ...] processing ledger: slot=189262403, last root slot=189262382 slots=7 slots/s=3.5 txs/s=7652.5
[2023-04-19T07:13:19.611082366Z ...] processing ledger: slot=189262409, last root slot=189262382 slots=6 slots/s=3.0 txs/s=6104.5

// Local ledger replay complete
[2023-04-19T07:13:21.730374466Z INFO  solana_ledger::blockstore_processor]
ledger processed in 1 minute, 53 seconds, 264 ms, 660 µs and 464 ns.
root slot is 189262382, 33 banks: 189262382, 189262383, 189262384, 189262385, 189262386, 189262387, 189262388, 189262389, 189262390, 189262391, 189262392, 189262393, 189262394, 189262395, 189262396, 189262397, 189262398, 189262399, 189262400, 189262401, 189262402, 189262403, 189262404, 189262405, 189262406, 189262407, 189262408, 189262409, 189262410, 189262411, 189262412, 189262413, 189262414

// Validator::new() is done and validator is up
[2023-04-19T07:13:21.751843326Z INFO  solana_validator] Validator initialized

v1.13 note: The local ledger processing has been observed to slow down validator bootup with v1.13. This slow down stems from how cleanup of already processed banks occurs.

  • In v1.13, after so many processed banks (slots) have accumulated, cleanup occurs in a way that blocks further processing. So, actual replay of blocks is periodically stopped while cleanup occurs.
  • In v1.14, processed banks (slots) are sent to a background task to be cleaned up ascynchronously. This is the same as how "steady-state" replay (as opposed to startup replay) works in both v1.13 and v1.14

After the last log line above, the validator is up and running. However, the cluster has continued block production while the node was offline and catching up, so the node is now behind. Obviously, more time spent restarting is more time of falling behind the cluster. The monitor command in solana-validator allows an operator to monitor the distance between the tip of the cluster.

Metrics can provide a similar view into how a node is doing in regards to the tip of the cluster. The below graph examines the replay-slot-stats.slot metric.

  • The purple trace is for the node whose logs are included above.
  • The cyan trace is from another node that was running and keeping up with the cluster while the node we're examining was restarting.
    • This node is included to serve as a reference as to where the tip of the cluster was at.

In addition, I have added some markup with numbers for key markers of the node's catchup.

  1. This marks when the validator has finished initialization.
    • Let's say that the cluster is at slot S at this point.
  2. As illustrated by the horizontal red line, this marks when the node has replayed local state to slot S.
    • This is an inflection point in validator catchup. Prior to this point, the node was replaying slots that it obtained through repair. After this point, the node is replaying slots that it had received via turbine. Given that it has been online and receiving shreds via turbine while catching up to slot S, the node has a lot of blocks locally available from slot S+1 up to where the cluster is. As such, ReplayStage can churn through these slots much faster as it isn't waiting for shreds to arrive from other peers. This explains why there is an often observed change in catchup velocity.
  3. This marks when the node has finally caught up with the tip of the cluster around 07:36, about 30 minutes after when the process restarted.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment