Skip to content

Instantly share code, notes, and snippets.

@jcnelson
Last active July 30, 2020 15:20
Show Gist options
  • Save jcnelson/99835bcb76064ecfb027a29ddcc53c71 to your computer and use it in GitHub Desktop.
Save jcnelson/99835bcb76064ecfb027a29ddcc53c71 to your computer and use it in GitHub Desktop.
Stacks Argon Testnet Post-Mortem

Testnet Outage Post-Mortem

From 19 July through 24 July, the Stacks testnet had experienced repeated outages every few hours. The symptom was the repeated recurrence of errors such as the following:

WARN [1595867918.213] [src/chainstate/stacks/db/blocks.rs:2861] [ThreadId(6)] Block 36a6b25fb7dd1775761b80b567425dd886f35272a5cc6a7661df0d99662f9d1c state root mismatch: expected 2fbb24799b4ea08ca50a8a2c75749735a6aa60415758a1394525f789a109067c, got 65fbf51eea07de1a75c5c45373c6aa3c7975e88269948114c0468acbf461cf66

No matter what blocks were mined, and no matter what the block contents were, the Argon master node was rejecting all blocks with due to a state root mismatch.

What is a "state root," and how can it be mismatched?

Each Stacks block commits to an authenticated materialized view of the entire blockchain state. This includes all account state, all Clarity smart contracts, all Clarity data maps, tokens, and data variables, as well as metadata about the chain itself. Given a "state root" -- the cryptographic digest over this materialized view, a light client can validate a node-generated Merkle proof that a particular key is equal to a particular value in that corresponding materialized view. For example, you can ask the node for a proof of your balance, and use a recent state root to verify that the balance is correct using that proof. You can see this in action if you go to http://argon-master.blockstack.xyz:20443/v2/accounts/ST1H6RQAK5V5SHQQKJVK3K59ZNPX51V3W035BG9EF. This proof structure, encoded in the proof field, can be efficiently validated against a state root in a recent Stacks block, thereby allowing a light client with the Stacks block headers to query blockchain state without having to process it all. See SIP-004 for details on how this data structure is maintained.

Because the principal use-case of the state root is to implement support for light clients, it is of paramount importance that all Stacks blocks include the state root as it was at the time the block was processed. In fact, a block is not considered valid unless other nodes can process its transactions and independently calculate the block's advertised state root. Of course, if the block has a different state root, the node would log that there was a "state root mismatch" and reject the block.

Why was the state root always mismatched?

TL;DR: stale cached data wasn't getting cleared on an error path.

Per SIP-004 and SIP-005, all of the transactions in the Stacks blockchain will load and store one or more key/value pairs in the blockchain's materialized view. You can think of the materialized view as a hash table, but structured in a way that multiple conflicting views ("forks") can be represented at once, efficiently. Like a lot of hash table implementations, the key/value pairs for a block are represented internally as a sparse hash array-mapped trie (HAMT) -- in particular, we use a sparse adaptive-radix trie (ART), implemented as a HAMT. We extended our ART to be authenticated as a "Merklized" ART (MART) -- that is, each node in the trie contains the cryptographic hash of its children, and the root hash represents all of the key/value pairs inserted.

A MART only represents the key/value pairs modified by a single block. To implement a data structure that can represent all key/value pairs modified by an entire fork of blocks, we extended the MART to create a novel data structure called a Merklized Adaptive-Radix Forest (MARF). When you get a proof back from the RESTful API endpoint, what you are really getting back is a Merkle proof from the Stacks node's MARF that a particular key is set to the returned value in the fork represented by the chain tip's state root.

There are several additional operations defined on MARFs that make them different from most tries -- specifically, these new operations them amenable to producing asymptotically-small proofs (i.e. O(log^2 N) time and space for N blocks in the fork). One such operation is an extension of the root hash calculation for a particular block's trie. In addition to hashing the root node's children, the MARF will hash a geometric series of prior trie root hashes in the block's fork to produce the final root hash. So, if the node is processing block 100, it will first calculate its trie, calculate its trie's root hash, and hash that with the root hashes for the block's 1st, 2nd, 4th, 8th, 16th, 32nd, and 64th ancestors to produce the final root hash. This final root hash is the block's "state root" field, referenced in the error message. This way, the state root represents not only its own block's key/value pairs, but also the key/value pairs inserted by all prior blocks in blockchain. The state root is the tip of an authenticated skip-list over its ancestor tries, and is used to generate a proof that not only is a particular key set to a particular value in the fork, but also that the value is the latest such value to be set for that key.

In implementing the MARF, we noticed during performance benchmarks that this last step of loading up the ancestor tries' root hashes could get expensive in practice. This is because reading each ancestor trie from disk is a random-access read, which means that each read is likely to incur the cost of a disk seek (if you have a SSD, you won't notice this). The root hash calculation needs to be performed at least once when processing a block, and is in practice calculated multiple times as more and more keys are inserted. To speed this operation up, the internal trie storage system caches the currently-processing trie's ancestor hashes. Therein lies a danger: this cache must ALWAYS be cleared once the current trie is no longer under consideration. Otherwise, the wrong state root hash will be calculated, since the wrong ancestor hashes will have been used.

We accidentally forgot to clear this trie root ancestor cache on an error path that gets executed when processing a truly-invalid block. This led to the observed bug.

How did we detect this problem?

This bug went unnoticed for months, which is also surprising! It turns out that the bug will only manifest under a particular sequence of events that simply hadn't been happening in both our test suite and the testnet. Specifically, this bug only manifests if the Stacks node processes a truly invalid block, and then processes a non-sibling of that block. This will cause the trie ancestor root hash cache to be poisoned with stale data, which in turn causes all subsequent blocks to be considered invalid. That is, the bug does not show up if the Stacks node processes an invalid block, and then processes an valid sibling immediately afterward.

The reason it doesn't show up when a valid sibling is the next block to be processed is obvious in hindsight: the sibling will have the same ancestor trie hashes. So, it wouldn't be affected by using stale cached data. Moreover, on successfully processing the sibling, the Stacks node would clear the cache as expected.

The Things We Tried

We spent 5 days working on literally nothing else while trying to pin down how this problem was happening.

The first hurdle we had to overcome was that the Stacks node doesn't store invalid blocks; it discards them. So, Aaron instrumented the Stacks node to log its invalid blocks so he could try to attach them to a local copy of the chainstate. One testnet reboot (!!) later, and we had a sample testcase.

The first breakthrough in diagnosing this problem is that -- surprise, surprise! -- if you stop the node and make it re-process the supposedly-invalid block, the node decides that it's suddenly valid! In retrospect, the trie root hash ancestor cache would have been cold when this happened, so the bug wouldn't have manifested. But we didn't know that was the culprit at the time, so we were left scratching our heads trying to figure out how the production node would behave differently when run on the exact same chainstate on the exact same OS locally. This was our first clue that whatever the trigger was, it had to be due to non-persisted state.

The next logical step was to attach gdb to the production system once the bug manifested, and walk through the process memory and see if anything looked amiss. This was much harder than we anticipated. Here are the fun lessons learned when doing this:

  • We run the Argon master Stacks node in a Kubernetes pod, which "helpfully" disables ptrace(2) by default. So, of course gdb couldn't attach when run from within the pod. We had to update the pod configuration, restart the testnet (!!), and wait for the bug to manifest again.

  • We statically link the Stacks node to musl, because it is smaller than its more widely-used cousin, glibc, and unlike glibc, musl actually statically links. But for reasons we do not yet understand, even when compiling the Stacks node in --release mode (so, no debug symbols by default), it's still possible to attach gdb to the node and see some actual frame data! We saw something like this:

... snip ...

Thread 5 (LWP 15):
#0  0x0000000000fdf806 in __syscall ()
#1  0x0000000000fde7d7 in nanosleep ()
#2  0x0000000000000000 in ?? ()
                                                                                                                                                               
Thread 4 (LWP 14):                                                                                                                                             
#0  0x0000000000fd4511 in epoll_pwait ()
#1  0xffffffffffffffff in ?? ()
#2  0x0000000000000001 in ?? ()
#3  0x0000000000dc7a84 in mio::sys::unix::epoll::Selector::select ()
#4  0x0000000000dc60dc in mio::poll::Poll::poll1 () 
#5  0x0000000000dc5d53 in mio::poll::Poll::poll ()
#6  0x0000000000abbd31 in blockstack_lib::net::poll::NetworkState::poll ()
#7  0x00000000008dc729 in blockstack_lib::net::p2p::PeerNetwork::run ()
#8  0x00000000004b4e60 in std::sys_common::backtrace::__rust_begin_short_backtrace () 
#9  0x0000000000570a5b in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#10 0x0000000000fafc5a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#11 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#12 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#13 0x0000000000fdc995 in start ()
#14 0x0000000000203000 in ?? ()
#15 0x0000000000fdd953 in __clone ()
#16 0x0000000000000000 in ?? ()

Thread 3 (LWP 13):
#0  0x0000000000fdf806 in __syscall ()
#1  0x0000000000fe039f in __timedwait_cp ()
#2  0x0000000000000000 in ?? ()

... snip ...

Seeing that at least one thread had what appeared to be the usual stack trace, Jude assumed that debug symbols were being included, and that gdb just couldn't read them in all threads for some reason (Rust is a comparatively new language, so it's expected that the debugging tooling is a bit rough around the edges). The file program confirmed that the binary was not stripped. It wasn't until days later that we discovered that this, in fact, was not supposed to be happening at all.

  • For reasons we still have not uncovered, compiling with --debug and statically linking against musl makes no difference in what stack trace we see. But the thing is, this isn't obvious up front -- this only became obvious after a few testnet reboots (!!) with a few different build options. This problem seems to be dependent on something in the (remote) build environment -- we tested for the ability to backtrace threads on a locally-sourced on an Alpine Linux build (which uses musl), and found that the resulting binary built with the very same link flags had all the proper debug symbols!

  • We gave up on using musl for the time being and switched over to using Debian stretch as our container base image, which uses glibc. We were pleasantly surprised to see that debug output actually worked this time! But, it turns out that you need a much later version of gdb than what's on Debian stretch -- specifically, you'll want gdb 9.2 or later. Without this later version, you cannot read most Rust data types (not even the common Option<T>), which prevented us from walking through the process memory.

  • When you build a Rust binary with debug symbols, Rust eagerly complies. It complies to such a high degree of fidelity that when we could finally run t a a bt in gdb on the production master node, it OOM'ed the Kubernetes pod and forced another testnet reboot (!!). The next time the bug manifested, we just produced a core file. That core file was a whopping 800MB!

  • To make things more interesting, you can't just apt-get install a later version of gdb on Debian stretch. Doing so also pulls in a newer glibc and friends, which (unsurprisingly) have different debug symbols than the Stacks node we're trying to debug. So we got to experience first-hand the fun experience of configuring and building a gdb from source (N.B.: unlike 99.9% of all Unix-like software you'd ever compile, the usual ./configure && make does NOT work for gdb -- you have to run mkdir build && cd build && ../configure && make.
    This is, of course, not documented).

  • Several testnet reboots later, we finally had a core file for the broken binary. Let the debugging commence!

In parallel to our gdb saga, we were making two additional efforts to reproduce the bug locally. First, we had just merged a big refactor PR as part of implementing PoX, and we had just merged a new feature that made some pretty heavy modifications to the MARF code in order to represent unconfirmed microblock state. Naturally, we were initially very suspicious that one of these PRs was the likely cause -- especially the unconfirmed microblock state PR, given all the modifications it made to the MARF code. But try as we might, we could not construct a test case that lead to the problem, no matter how much we messed around with leaving stale unconfirmed state around in the indexes or reprocessing it at odd or unexpected times. We thought at one point that there may have been a network-triggered race condition that could lead to MARF corruption if the unconfirmed state view was being processed at the same time as a block was being processed, but it turned out that this was not possible in the current design.

Second, Jude had extended Aaron's initial work on replaying one block on a local chainstate by modifying it so the node could replay the entire chainstate from scratch. The thinking at the time was that while the problem was only triggered by non-persistent state, perhaps it could be triggered by replaying the exact same blocks that lead up to the first instance of a valid block being declared valid. Much to our dismay, this did not work -- the supposedly-invalid block was found to be valid, even though all the same blocks were processed leading up to the bug.

But at least now that we FINALLY had a core file in hand and a gdb that could read it, we could determine for sure what was going on. There are three pieces of state in the trie storage module that could affect the trie root hash calculation: the currently-uncommitted trie that's being worked on while the block is processing, the cache of all known trie root hashes in RAM, and finally, the ancestor trie root hash cache.

Looking at this through gdb, it was clear that the first two pieces of state looked fine. Interestingly, a piece of state within the currently-uncommitted trie looked like it was corrupt (it was a hash with a surprising number of zero bytes), but we tentatively attributed that to gdb not doing a perfect job of parsing Rust data structures in RAM. We went with this intuition because we don't use any unsafe code in our codebase, and any memory corruption bugs from unsafe code in our dependencies that could have caused this were thought to be an unlikely culprits because the node kept producing the same kinds of errors over and over again (and no other memory-corruption-related errors).

The thing that drew our attention to the ancestor trie root hash cache being the likely culprit was that (a) the cache was still filled even though no block was being processed, and (b) the single key of the cache -- the block identifier of the trie whose ancestor hashes are loaded -- was set to a placeholder value that is used internally to indicate that the node does not yet know the block hash. Finding this value by itself was unsurprising -- when building up the trie for a block's key/value pairs, the miner uses this placeholder block hash because it does not yet know the block's true hash. Similarly, the Stacks chain state uses this placeholder block hash when processing the block in order to derive the same state root that the miner derived. The fact that the cache was still set for the placeholder block hash was the ultimate clue to unraveling this mystery: the cache could only be set this way if the node had been in middle of processing a new block, but ceased before it the block's trie was committed. This could only happen if the block was invalid.

This gave Aaron an idea: could we start to build a MARF trie, roll it back (simulating an invalid block), and then try to build a different MARF trie with the same placeholder block identifier but a different set of ancestors, and manage to reach the same trie hash as would have been reached had the roll-back not happened? Turns out, this was not the case -- the hashes were different, depending on the presence of the roll-back! We found the bug!

Jude confirmed this by building a test-case that mined an invalid block, and then processed a non-sibling block. Lo and behold, the bug was finally duplicated locally!

Five grueling days later, the fix was only three lines of code -- just clear the trie root ancestor cache in the two rollback error paths, and do it more aggressively on the success path.

Lessons Learned

In summary, we will be adopting a few changes to our practices going forward:

  • Make sure gdb works and is accessible in the testnet pods before jumping to conclusions about the presence (or corruption) of debugging symbols. Had we done this, we could have avoided several testnet reboots.

  • Make it so the Stacks node keeps a copy of all invalid blocks as well as valid blocks (but only as a config option), so the full chainstate can be re-played. Had we done this, we would have caught the problem a day or two earlier.

  • Make all panic!()s produce a core dump right then and there, for good measure. This wouldn't have helped us this time, but a similar lesson of "produce a core file first, ask for a backtrace later" would have saved us at least one reboot.

  • Don't be so sure that your recent code is the cause of a newly-manifested bug.

@wileyj
Copy link

wileyj commented Jul 29, 2020

Excellent read, thanks for the in-depth analysis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment