From 19 July through 24 July, the Stacks testnet had experienced repeated outages every few hours. The symptom was the repeated recurrence of errors such as the following:
WARN [1595867918.213] [src/chainstate/stacks/db/blocks.rs:2861] [ThreadId(6)] Block 36a6b25fb7dd1775761b80b567425dd886f35272a5cc6a7661df0d99662f9d1c state root mismatch: expected 2fbb24799b4ea08ca50a8a2c75749735a6aa60415758a1394525f789a109067c, got 65fbf51eea07de1a75c5c45373c6aa3c7975e88269948114c0468acbf461cf66
No matter what blocks were mined, and no matter what the block contents were, the Argon master node was rejecting all blocks with due to a state root mismatch.
Each Stacks block commits to an authenticated materialized view of the entire
blockchain state. This includes all account state, all Clarity smart
contracts, all Clarity data maps, tokens, and data variables, as well as
metadata about the chain itself. Given a "state root" -- the cryptographic
digest over this materialized view, a light client can validate a node-generated
Merkle proof that a particular key is equal to a particular value in that
corresponding materialized view. For example, you can ask
the node for a proof of your balance, and use a recent state root to verify that
the balance is correct using that proof. You can see this in action if you
go to http://argon-master.blockstack.xyz:20443/v2/accounts/ST1H6RQAK5V5SHQQKJVK3K59ZNPX51V3W035BG9EF
.
This proof structure, encoded in the proof
field, can be efficiently validated against a state
root in a recent Stacks block, thereby allowing a light client with the Stacks block
headers to query blockchain state without having to process it all. See
SIP-004
for details on how this data structure is maintained.
Because the principal use-case of the state root is to implement support for light clients, it is of paramount importance that all Stacks blocks include the state root as it was at the time the block was processed. In fact, a block is not considered valid unless other nodes can process its transactions and independently calculate the block's advertised state root. Of course, if the block has a different state root, the node would log that there was a "state root mismatch" and reject the block.
TL;DR: stale cached data wasn't getting cleared on an error path.
Per SIP-004 and SIP-005, all of the transactions in the Stacks blockchain will load and store one or more key/value pairs in the blockchain's materialized view. You can think of the materialized view as a hash table, but structured in a way that multiple conflicting views ("forks") can be represented at once, efficiently. Like a lot of hash table implementations, the key/value pairs for a block are represented internally as a sparse hash array-mapped trie (HAMT) -- in particular, we use a sparse adaptive-radix trie (ART), implemented as a HAMT. We extended our ART to be authenticated as a "Merklized" ART (MART) -- that is, each node in the trie contains the cryptographic hash of its children, and the root hash represents all of the key/value pairs inserted.
A MART only represents the key/value pairs modified by a single block. To implement a data structure that can represent all key/value pairs modified by an entire fork of blocks, we extended the MART to create a novel data structure called a Merklized Adaptive-Radix Forest (MARF). When you get a proof back from the RESTful API endpoint, what you are really getting back is a Merkle proof from the Stacks node's MARF that a particular key is set to the returned value in the fork represented by the chain tip's state root.
There are several additional operations defined on MARFs that make them different from most tries -- specifically, these new operations them amenable to producing asymptotically-small proofs (i.e. O(log^2 N) time and space for N blocks in the fork). One such operation is an extension of the root hash calculation for a particular block's trie. In addition to hashing the root node's children, the MARF will hash a geometric series of prior trie root hashes in the block's fork to produce the final root hash. So, if the node is processing block 100, it will first calculate its trie, calculate its trie's root hash, and hash that with the root hashes for the block's 1st, 2nd, 4th, 8th, 16th, 32nd, and 64th ancestors to produce the final root hash. This final root hash is the block's "state root" field, referenced in the error message. This way, the state root represents not only its own block's key/value pairs, but also the key/value pairs inserted by all prior blocks in blockchain. The state root is the tip of an authenticated skip-list over its ancestor tries, and is used to generate a proof that not only is a particular key set to a particular value in the fork, but also that the value is the latest such value to be set for that key.
In implementing the MARF, we noticed during performance benchmarks that this last step of loading up the ancestor tries' root hashes could get expensive in practice. This is because reading each ancestor trie from disk is a random-access read, which means that each read is likely to incur the cost of a disk seek (if you have a SSD, you won't notice this). The root hash calculation needs to be performed at least once when processing a block, and is in practice calculated multiple times as more and more keys are inserted. To speed this operation up, the internal trie storage system caches the currently-processing trie's ancestor hashes. Therein lies a danger: this cache must ALWAYS be cleared once the current trie is no longer under consideration. Otherwise, the wrong state root hash will be calculated, since the wrong ancestor hashes will have been used.
We accidentally forgot to clear this trie root ancestor cache on an error path that gets executed when processing a truly-invalid block. This led to the observed bug.
This bug went unnoticed for months, which is also surprising! It turns out that the bug will only manifest under a particular sequence of events that simply hadn't been happening in both our test suite and the testnet. Specifically, this bug only manifests if the Stacks node processes a truly invalid block, and then processes a non-sibling of that block. This will cause the trie ancestor root hash cache to be poisoned with stale data, which in turn causes all subsequent blocks to be considered invalid. That is, the bug does not show up if the Stacks node processes an invalid block, and then processes an valid sibling immediately afterward.
The reason it doesn't show up when a valid sibling is the next block to be processed is obvious in hindsight: the sibling will have the same ancestor trie hashes. So, it wouldn't be affected by using stale cached data. Moreover, on successfully processing the sibling, the Stacks node would clear the cache as expected.
We spent 5 days working on literally nothing else while trying to pin down how this problem was happening.
The first hurdle we had to overcome was that the Stacks node doesn't store invalid blocks; it discards them. So, Aaron instrumented the Stacks node to log its invalid blocks so he could try to attach them to a local copy of the chainstate. One testnet reboot (!!) later, and we had a sample testcase.
The first breakthrough in diagnosing this problem is that -- surprise, surprise! -- if you stop the node and make it re-process the supposedly-invalid block, the node decides that it's suddenly valid! In retrospect, the trie root hash ancestor cache would have been cold when this happened, so the bug wouldn't have manifested. But we didn't know that was the culprit at the time, so we were left scratching our heads trying to figure out how the production node would behave differently when run on the exact same chainstate on the exact same OS locally. This was our first clue that whatever the trigger was, it had to be due to non-persisted state.
The next logical step was to attach gdb
to the production system once the bug
manifested, and walk through the process memory and see if anything looked amiss. This
was much harder than we anticipated. Here are the fun lessons learned when doing
this:
-
We run the Argon master Stacks node in a Kubernetes pod, which "helpfully" disables
ptrace(2)
by default. So, of coursegdb
couldn't attach when run from within the pod. We had to update the pod configuration, restart the testnet (!!), and wait for the bug to manifest again. -
We statically link the Stacks node to musl, because it is smaller than its more widely-used cousin, glibc, and unlike
glibc
,musl
actually statically links. But for reasons we do not yet understand, even when compiling the Stacks node in--release
mode (so, no debug symbols by default), it's still possible to attachgdb
to the node and see some actual frame data! We saw something like this:
... snip ...
Thread 5 (LWP 15):
#0 0x0000000000fdf806 in __syscall ()
#1 0x0000000000fde7d7 in nanosleep ()
#2 0x0000000000000000 in ?? ()
Thread 4 (LWP 14):
#0 0x0000000000fd4511 in epoll_pwait ()
#1 0xffffffffffffffff in ?? ()
#2 0x0000000000000001 in ?? ()
#3 0x0000000000dc7a84 in mio::sys::unix::epoll::Selector::select ()
#4 0x0000000000dc60dc in mio::poll::Poll::poll1 ()
#5 0x0000000000dc5d53 in mio::poll::Poll::poll ()
#6 0x0000000000abbd31 in blockstack_lib::net::poll::NetworkState::poll ()
#7 0x00000000008dc729 in blockstack_lib::net::p2p::PeerNetwork::run ()
#8 0x00000000004b4e60 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9 0x0000000000570a5b in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#10 0x0000000000fafc5a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#11 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#12 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#13 0x0000000000fdc995 in start ()
#14 0x0000000000203000 in ?? ()
#15 0x0000000000fdd953 in __clone ()
#16 0x0000000000000000 in ?? ()
Thread 3 (LWP 13):
#0 0x0000000000fdf806 in __syscall ()
#1 0x0000000000fe039f in __timedwait_cp ()
#2 0x0000000000000000 in ?? ()
... snip ...
Seeing that at least one thread had what appeared to
be the usual stack trace, Jude assumed that debug
symbols were being included, and that gdb
just couldn't read them in all
threads for some reason (Rust is a comparatively new language, so it's expected
that the debugging tooling is a bit rough around the edges).
The file
program confirmed that the binary was
not stripped
. It wasn't until days later that we discovered that
this, in fact, was not supposed to be happening at all.
-
For reasons we still have not uncovered, compiling with
--debug
and statically linking againstmusl
makes no difference in what stack trace we see. But the thing is, this isn't obvious up front -- this only became obvious after a few testnet reboots (!!) with a few different build options. This problem seems to be dependent on something in the (remote) build environment -- we tested for the ability to backtrace threads on a locally-sourced on an Alpine Linux build (which usesmusl
), and found that the resulting binary built with the very same link flags had all the proper debug symbols! -
We gave up on using
musl
for the time being and switched over to using Debian stretch as our container base image, which usesglibc
. We were pleasantly surprised to see that debug output actually worked this time! But, it turns out that you need a much later version ofgdb
than what's on Debian stretch -- specifically, you'll wantgdb
9.2 or later. Without this later version, you cannot read most Rust data types (not even the commonOption<T>
), which prevented us from walking through the process memory. -
When you build a Rust binary with debug symbols, Rust eagerly complies. It complies to such a high degree of fidelity that when we could finally run
t a a bt
ingdb
on the production master node, it OOM'ed the Kubernetes pod and forced another testnet reboot (!!). The next time the bug manifested, we just produced a core file. That core file was a whopping 800MB! -
To make things more interesting, you can't just
apt-get install
a later version ofgdb
on Debian stretch. Doing so also pulls in a newerglibc
and friends, which (unsurprisingly) have different debug symbols than the Stacks node we're trying to debug. So we got to experience first-hand the fun experience of configuring and building agdb
from source (N.B.: unlike 99.9% of all Unix-like software you'd ever compile, the usual./configure && make
does NOT work forgdb
-- you have to runmkdir build && cd build && ../configure && make
.
This is, of course, not documented). -
Several testnet reboots later, we finally had a core file for the broken binary. Let the debugging commence!
In parallel to our gdb
saga, we were making two additional efforts to reproduce the bug locally.
First, we had just merged a big refactor
PR as part of
implementing PoX, and we had just merged a new feature that made some pretty
heavy modifications to the MARF code in order to represent unconfirmed
microblock state.
Naturally, we were initially very suspicious that one of these PRs was the
likely cause -- especially the unconfirmed microblock state PR, given all the
modifications it made to the MARF code. But try as we might, we could not
construct a test case that lead to the problem, no matter how much we messed
around with leaving stale unconfirmed state around in the indexes or
reprocessing it at odd or unexpected times. We thought at one point that there
may have been a network-triggered race condition that could lead to MARF
corruption if the unconfirmed state view was being processed at the same time as
a block was being processed, but it turned out that this was not possible in the
current design.
Second, Jude had extended Aaron's initial work on replaying one block on a local chainstate by modifying it so the node could replay the entire chainstate from scratch. The thinking at the time was that while the problem was only triggered by non-persistent state, perhaps it could be triggered by replaying the exact same blocks that lead up to the first instance of a valid block being declared valid. Much to our dismay, this did not work -- the supposedly-invalid block was found to be valid, even though all the same blocks were processed leading up to the bug.
But at least now that we FINALLY had a core file in hand and a gdb
that could read
it, we could determine for sure what was going on. There are three pieces of
state in the trie storage module that could affect the trie root hash
calculation: the currently-uncommitted trie that's being worked on while the
block is processing, the cache of all known trie root hashes in RAM, and
finally, the ancestor trie root hash cache.
Looking at this through gdb
, it was clear that the first two pieces of state
looked fine. Interestingly, a piece of state within the currently-uncommitted
trie looked like it was corrupt (it was a hash with a surprising number of zero
bytes), but we tentatively attributed that to gdb
not doing a perfect job of
parsing Rust data structures in RAM. We went with this intuition because we don't use
any unsafe
code in our codebase, and any memory corruption bugs from unsafe
code in
our dependencies that could have caused this were thought to be an unlikely
culprits because the node kept producing the same kinds of errors over and over
again (and no other memory-corruption-related errors).
The thing that drew our attention to the ancestor trie root hash cache being the likely culprit was that (a) the cache was still filled even though no block was being processed, and (b) the single key of the cache -- the block identifier of the trie whose ancestor hashes are loaded -- was set to a placeholder value that is used internally to indicate that the node does not yet know the block hash. Finding this value by itself was unsurprising -- when building up the trie for a block's key/value pairs, the miner uses this placeholder block hash because it does not yet know the block's true hash. Similarly, the Stacks chain state uses this placeholder block hash when processing the block in order to derive the same state root that the miner derived. The fact that the cache was still set for the placeholder block hash was the ultimate clue to unraveling this mystery: the cache could only be set this way if the node had been in middle of processing a new block, but ceased before it the block's trie was committed. This could only happen if the block was invalid.
This gave Aaron an idea: could we start to build a MARF trie, roll it back (simulating an invalid block), and then try to build a different MARF trie with the same placeholder block identifier but a different set of ancestors, and manage to reach the same trie hash as would have been reached had the roll-back not happened? Turns out, this was not the case -- the hashes were different, depending on the presence of the roll-back! We found the bug!
Jude confirmed this by building a test-case that mined an invalid block, and then processed a non-sibling block. Lo and behold, the bug was finally duplicated locally!
Five grueling days later, the fix was only three lines of code -- just clear the trie root ancestor cache in the two rollback error paths, and do it more aggressively on the success path.
In summary, we will be adopting a few changes to our practices going forward:
-
Make sure
gdb
works and is accessible in the testnet pods before jumping to conclusions about the presence (or corruption) of debugging symbols. Had we done this, we could have avoided several testnet reboots. -
Make it so the Stacks node keeps a copy of all invalid blocks as well as valid blocks (but only as a config option), so the full chainstate can be re-played. Had we done this, we would have caught the problem a day or two earlier.
-
Make all
panic!()
s produce a core dump right then and there, for good measure. This wouldn't have helped us this time, but a similar lesson of "produce a core file first, ask for a backtrace later" would have saved us at least one reboot. -
Don't be so sure that your recent code is the cause of a newly-manifested bug.
Excellent read, thanks for the in-depth analysis!