jcnelson/gist:99835bcb76064ecfb027a29ddcc53c71

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Testnet Outage Post-Mortem

From 19 July through 24 July, the Stacks testnet had experienced repeated
outages every few hours.  The symptom was the repeated recurrence of errors such
as the following:
WARN [1595867918.213] [src/chainstate/stacks/db/blocks.rs:2861] [ThreadId(6)] Block 36a6b25fb7dd1775761b80b567425dd886f35272a5cc6a7661df0d99662f9d1c state root mismatch: expected 2fbb24799b4ea08ca50a8a2c75749735a6aa60415758a1394525f789a109067c, got 65fbf51eea07de1a75c5c45373c6aa3c7975e88269948114c0468acbf461cf66

No matter what blocks were mined, and no matter what the block contents were,
the Argon master node was rejecting all blocks with due to a state root
mismatch.
What is a "state root," and how can it be mismatched?

Each Stacks block commits to an authenticated materialized view of the entire
blockchain state.  This includes all account state, all Clarity smart
contracts, all Clarity data maps, tokens, and data variables, as well as
metadata about the chain itself.  Given a "state root" -- the cryptographic
digest over this materialized view, a light client can validate a node-generated
Merkle proof that a particular key is equal to a particular value in that
corresponding materialized view.  For example, you can ask
the node for a proof of your balance, and use a recent state root to verify that
the balance is correct using that proof.  You can see this in action if you
go to http://argon-master.blockstack.xyz:20443/v2/accounts/ST1H6RQAK5V5SHQQKJVK3K59ZNPX51V3W035BG9EF.
This proof structure, encoded in the proof field, can be efficiently validated against a state
root in a recent Stacks block, thereby allowing a light client with the Stacks block
headers to query blockchain state without having to process it all.  See
SIP-004
for details on how this data structure is maintained.
Because the principal use-case of the state root is to implement support for
light clients, it is of paramount importance that all Stacks blocks include the
state root as it was at the time the block was processed.  In fact, a block is
not considered valid unless other nodes can process its transactions and
independently calculate the block's advertised state root.  Of course, if the
block has a different state root, the node would log that there was a "state
root mismatch" and reject the block.
Why was the state root always mismatched?

TL;DR: stale cached data wasn't getting cleared on an error path.
Per SIP-004 and SIP-005,
all of the transactions in the Stacks blockchain will load and store one or more
key/value pairs in the blockchain's materialized view.  You can think of the
materialized view as a hash table, but structured in a way that multiple
conflicting views ("forks") can be represented at once, efficiently.  Like a lot
of hash table implementations, the key/value pairs for a block are represented internally as
a sparse hash array-mapped
trie (HAMT) -- in particular, we use a
sparse adaptive-radix trie (ART),
implemented as a HAMT.  We extended our ART to be authenticated as a "Merklized"
ART (MART) -- that is, each node in the trie contains the cryptographic hash of its children, and the root hash
represents all of the key/value pairs inserted.
A MART only represents the key/value pairs modified by a single block.  To
implement a data structure that can represent all key/value pairs modified by an
entire fork of blocks, we extended the MART to create a novel data structure
called a Merklized Adaptive-Radix
Forest (MARF).
When you get a proof back from the RESTful API endpoint, what you are really
getting back is a Merkle proof from the Stacks node's MARF that a particular key
is set to the returned value in the fork represented by the chain tip's state root.
There are several additional operations defined on MARFs that make them
different from most tries -- specifically, these new operations them amenable to producing
asymptotically-small proofs (i.e. O(log^2 N) time and space for N blocks in
the fork).  One such operation is an extension of the root hash calculation for
a particular block's trie.  In addition to hashing the root node's children, the
MARF will hash a geometric series of prior trie root hashes in the block's fork
to produce the final root hash.  So, if the node is processing block 100, it
will first calculate its trie, calculate its trie's root hash, and hash that
with the root hashes for the block's 1st, 2nd, 4th, 8th, 16th, 32nd, and 64th
ancestors to produce the final root hash.  This final root hash is the block's
"state root" field, referenced in the error message.  This way, the state root represents not only its own block's
key/value pairs, but also the key/value pairs inserted by all prior blocks in blockchain.
The state root is the tip of
an authenticated skip-list over its
ancestor tries, and is used to generate a proof that not only is a particular key
set to a particular value in the fork, but also that the value is the latest
such value to be set for that key.
In implementing the MARF, we noticed during performance benchmarks that this last
step of loading up the ancestor tries' root hashes could get expensive in
practice. This is because reading each ancestor trie from disk is a
random-access read, which means that each read is likely to incur the cost of a
disk seek (if you have a SSD, you won't notice this).  The root hash calculation
needs to be performed at least once when processing a block, and is in practice
calculated multiple times as more and more keys are inserted.  To speed this
operation up, the internal trie storage
system
caches the currently-processing trie's ancestor hashes.  Therein lies a danger:
this cache must ALWAYS be cleared once the current trie is no longer under
consideration.  Otherwise, the wrong state root hash will be calculated, since
the wrong ancestor hashes will have been used.
We accidentally forgot to clear this trie root ancestor cache on an error path that gets executed when processing a truly-invalid block.  This led to the observed bug.
How did we detect this problem?

This bug went unnoticed for months, which is also surprising!  It turns
out that the bug will only manifest under a
particular sequence of events that simply hadn't been happening in both our
test suite and the testnet.  Specifically, this bug only manifests if the
Stacks node processes a truly invalid block, and then processes a non-sibling
of that block.  This will cause the trie ancestor root hash cache to be poisoned with
stale data, which in turn causes all subsequent blocks to be considered invalid.
That is, the bug does not show up if the Stacks node processes
an invalid block, and then processes an valid sibling immediately afterward.
The reason it doesn't show up when a valid sibling is the next block to be
processed is obvious in hindsight:  the sibling will have the same ancestor trie hashes.  So,
it wouldn't be affected by using stale cached data.  Moreover, on successfully
processing the sibling, the Stacks node would clear the cache as expected.
The Things We Tried

We spent 5 days working on literally nothing else while
trying to pin down how this problem was happening.
The first hurdle we had to overcome was that the Stacks node doesn't store
invalid blocks; it discards them.  So, Aaron instrumented the Stacks node to log
its invalid blocks so he could try to attach them to a local copy of the
chainstate.  One testnet reboot (!!) later, and we had a sample testcase.
The first breakthrough in diagnosing this problem is that -- surprise, surprise! -- if
you stop the node and make it re-process the supposedly-invalid block, the node
decides that it's suddenly valid!  In retrospect, the trie root hash ancestor
cache would have been cold when this happened, so the bug wouldn't have
manifested.  But we didn't know that was the
culprit at the time, so we were left scratching our heads trying to figure out how the production
node would behave differently when run on the exact same chainstate on the
exact same OS locally.  This was our first clue that whatever the trigger was,
it had to be due to non-persisted state.
The next logical step was to attach gdb to the production system once the bug
manifested, and walk through the process memory and see if anything looked amiss.  This
was much harder than we anticipated.  Here are the fun lessons learned when doing
this:


We run the Argon master Stacks node in a Kubernetes pod, which "helpfully"
disables ptrace(2) by default.  So, of course gdb couldn't attach when run from within the pod.
We had to update the pod configuration, restart the testnet (!!), and wait
for the bug to manifest again.


We statically link the Stacks node to musl, because it is smaller than its more
widely-used cousin, glibc, and
unlike glibc, musl actually statically links.  But for reasons we do not
yet understand, even when compiling the Stacks node in --release mode (so, no
debug symbols by default), it's still possible to attach gdb to the node and
see some actual frame data!  We saw something like this:


... snip ...

Thread 5 (LWP 15):
#0  0x0000000000fdf806 in __syscall ()
#1  0x0000000000fde7d7 in nanosleep ()
#2  0x0000000000000000 in ?? ()
                                                                                                                                                               
Thread 4 (LWP 14):                                                                                                                                             
#0  0x0000000000fd4511 in epoll_pwait ()
#1  0xffffffffffffffff in ?? ()
#2  0x0000000000000001 in ?? ()
#3  0x0000000000dc7a84 in mio::sys::unix::epoll::Selector::select ()
#4  0x0000000000dc60dc in mio::poll::Poll::poll1 () 
#5  0x0000000000dc5d53 in mio::poll::Poll::poll ()
#6  0x0000000000abbd31 in blockstack_lib::net::poll::NetworkState::poll ()
#7  0x00000000008dc729 in blockstack_lib::net::p2p::PeerNetwork::run ()
#8  0x00000000004b4e60 in std::sys_common::backtrace::__rust_begin_short_backtrace () 
#9  0x0000000000570a5b in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#10 0x0000000000fafc5a in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#11 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/5c1f21c3b82297671ad3ae1e8c942d2ca92e84f2/src/liballoc/boxed.rs:1076
#12 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#13 0x0000000000fdc995 in start ()
#14 0x0000000000203000 in ?? ()
#15 0x0000000000fdd953 in __clone ()
#16 0x0000000000000000 in ?? ()

Thread 3 (LWP 13):
#0  0x0000000000fdf806 in __syscall ()
#1  0x0000000000fe039f in __timedwait_cp ()
#2  0x0000000000000000 in ?? ()

... snip ...

Seeing that at least one thread had what appeared to
be the usual stack trace, Jude assumed that debug
symbols were being included, and that gdb just couldn't read them in all
threads for some reason (Rust is a comparatively new language, so it's expected
that the debugging tooling is a bit rough around the edges).
The file program confirmed that the binary was
not stripped.  It wasn't until days later that we discovered that
this, in fact, was not supposed to be happening at all.


For reasons we still have not uncovered, compiling with --debug and
statically linking against musl makes no difference in what stack trace we
see.  But the thing is, this isn't obvious up front -- this only became obvious
after a few testnet reboots (!!) with a few different build options.  This problem seems to be dependent on
something in the (remote) build environment -- we tested for the ability to backtrace threads on a
locally-sourced on an Alpine Linux build (which uses musl), and found that the
resulting binary built with the very same link flags had all the proper debug
symbols!


We gave up on using musl for the time being and switched over to using
Debian stretch as our container base image, which uses glibc.  We were pleasantly surprised to see that
debug output actually worked this time!  But, it turns out that you need a much
later version of gdb than what's on Debian stretch -- specifically, you'll
want gdb 9.2 or later.  Without this later version, you cannot read most
Rust data types (not even the common Option<T>), which prevented us from walking through the process memory.


When you build a Rust binary with debug symbols, Rust eagerly complies.  It complies to such
a high degree of fidelity that when we could finally run t a a bt in gdb
on the production master node, it OOM'ed the Kubernetes pod and forced
another testnet reboot (!!).  The next time the bug
manifested, we just produced a core file.  That core file was a whopping 800MB!


To make things more interesting, you can't
just apt-get install a later version of gdb on Debian stretch.  Doing so
also pulls in a newer glibc and friends, which (unsurprisingly) have different debug symbols than
the Stacks node we're trying to debug.  So we got to experience first-hand the
fun experience of configuring and building a gdb from source (N.B.: unlike 99.9% of
all Unix-like software you'd ever compile, the usual ./configure && make does
NOT work for gdb -- you have to run mkdir build && cd build && ../configure && make.

This is, of course, not documented).


Several testnet reboots later, we finally had a core file for the broken
binary.  Let the debugging commence!


In parallel to our gdb saga, we were making two additional efforts to reproduce the bug locally.
First, we had just merged a big refactor
PR as part of
implementing PoX, and we had just merged a new feature that made some pretty
heavy modifications to the MARF code in order to represent unconfirmed
microblock state.
Naturally, we were initially very suspicious that one of these PRs was the
likely cause -- especially the unconfirmed microblock state PR, given all the
modifications it made to the MARF code.  But try as we might, we could not
construct a test case that lead to the problem, no matter how much we messed
around with leaving stale unconfirmed state around in the indexes or
reprocessing it at odd or unexpected times.  We thought at one point that there
may have been a network-triggered race condition that could lead to MARF
corruption if the unconfirmed state view was being processed at the same time as
a block was being processed, but it turned out that this was not possible in the
current design.
Second, Jude had extended Aaron's initial work on replaying one block on a local
chainstate by modifying it so the node could replay the entire chainstate from
scratch. The thinking at the time was that while the problem was only triggered
by non-persistent state, perhaps it could be triggered by replaying the exact
same blocks that lead up to the first instance of a valid block being declared
valid.  Much to our dismay, this did not work -- the supposedly-invalid block
was found to be valid, even though all the same blocks were processed leading up
to the bug.
But at least now that we FINALLY had a core file in hand and a gdb that could read
it, we could determine for sure what was going on.  There are three pieces of
state in the trie storage module that could affect the trie root hash
calculation:  the currently-uncommitted trie that's being worked on while the
block is processing, the cache of all known trie root hashes in RAM, and
finally, the ancestor trie root hash cache.
Looking at this through gdb, it was clear that the first two pieces of state
looked fine.  Interestingly, a piece of state within the currently-uncommitted
trie looked like it was corrupt (it was a hash with a surprising number of zero
bytes), but we tentatively attributed that to gdb not doing a perfect job of
parsing Rust data structures in RAM.  We went with this intuition because we don't use
any unsafe code in our codebase, and any memory corruption bugs from unsafe code in
our dependencies that could have caused this were thought to be an unlikely
culprits because the node kept producing the same kinds of errors over and over
again (and no other memory-corruption-related errors).
The thing that drew our attention to the ancestor trie root hash cache being
the likely culprit was that (a) the cache was still filled even though no block was
being processed, and (b) the single key of the cache -- the block
identifier of the trie whose ancestor hashes are loaded -- was
set to a placeholder value that is used internally to
indicate that the node does not yet know the block hash.  Finding this value
by itself was unsurprising -- when building up the trie for a block's key/value
pairs, the miner uses this placeholder block hash because it does not yet know the block's
true hash.  Similarly, the Stacks chain state uses this placeholder block hash
when processing the block in order to derive the same state root that the miner
derived.  The fact that the cache was still set for the placeholder block hash was the ultimate
clue to unraveling this mystery: the cache could only be set this way if the node had been
in middle of processing a new block, but ceased before it the block's trie was committed.
This could only happen if the block was invalid.
This gave Aaron an idea: could we start to build a MARF trie, roll it back (simulating an
invalid block), and then try to build a different MARF trie with the same
placeholder block identifier but a different set of ancestors, and manage to reach the same
trie hash as would have been reached had the roll-back not happened?  Turns out,
this was not the case -- the hashes were different, depending on the presence of
the roll-back!  We found the bug!
Jude confirmed this by building a test-case that mined an invalid block, and
then processed a non-sibling block.  Lo and behold, the bug was finally
duplicated locally!
Five grueling days later, the fix was only three lines of code -- just clear the trie root ancestor cache
in the two rollback error paths, and do it more aggressively on the success
path.
Lessons Learned

In summary, we will be adopting a few changes to our practices going forward:


Make sure gdb works and is accessible in the testnet pods before jumping to
conclusions about the presence (or corruption) of debugging symbols.  Had we
done this, we could have avoided several testnet reboots.


Make it so the Stacks node keeps a copy of all invalid blocks as well as
valid blocks (but only as a config option), so the full chainstate can be
re-played.  Had we done this, we would have caught the problem a day or two
earlier.


Make all panic!()s produce a core dump right then and there, for good
measure.  This wouldn't have helped us this time, but a similar lesson of "produce a core file first, ask for a backtrace later" would have saved us at least one reboot.


Don't be so sure that your recent code is the cause of a newly-manifested bug.