jcnelson/postmortem-2.05.0.3.0.md Secret

## postmortem-2.05.0.3.0.md

      
    Raw
  

              postmortem-2.05.0.3.0.md
            
          
    The 2.05.0.3.0 release fixes a chain-killing denial-of-service vulnerability. While this is not the first denial-of-service vulnerability we have encountered, this one was particularly bad in that an exploit in the wild could have led to a chain split.  The only way this could have been avoided is to get a majority of mining power to upgrade to 2.05.0.3.0 before the vulnerability could be exploited.
Vulnerability Details

The bug was in two places in the Clarity abstract syntax tree (AST) parser.  First, let's look at the top-level AST builder in clarity/src/vm/ast/mod.rs:
pub fn build_ast<T: CostTracker>(
    contract_identifier: &QualifiedContractIdentifier,
    source_code: &str,
    cost_track: &mut T,
) -> ParseResult<ContractAST> {
    runtime_cost(
        ClarityCostFunction::AstParse,
        cost_track,
        source_code.len() as u64,
    )?;
    let pre_expressions = parser::parse(source_code)?;
    let mut contract_ast = ContractAST::new(contract_identifier.clone(), pre_expressions);
    StackDepthChecker::run_pass(&mut contract_ast)?;
    ExpressionIdentifier::run_pre_expression_pass(&mut contract_ast)?;
    DefinitionSorter::run_pass(&mut contract_ast, cost_track)?;
    TraitsResolver::run_pass(&mut contract_ast)?;
    SugarExpander::run_pass(&mut contract_ast)?;
    ExpressionIdentifier::run_expression_pass(&mut contract_ast)?;
    Ok(contract_ast)
}
Like most interpreters, the act of parsing a program into an AST is the act of running one or more parser passes on it to build out the tree structure, which will be later walked by the interpreter to execute the program.  Pertinent to the vulnerability, the input to this function is untrusted -- it comes from a smart contract transaction, and it can contain arbitrary data.  It is of paramount importance, then, that this function behave correctly for all inputs; failure to do so can lead to a node crash.
This brings us to our first problem.  Notice that build_ast() calls parser::parse() before it calls StackDepthChecker::run_pass().  The StackDepthChecker struct examines the constructed AST and verifies that no Clarity structure contains no more than 69 levels of nesting [1].  So for example, you cannot have a (list 1 (list 1 (list 1 ...))) that contains 70 or more lists.  And therein lies the problem: this check is performed after the AST has already been parsed!  So, you could have a transaction that contained a very, very, very deep nesting of lists, optionals, responses, and/or tuples, and the parser will parse it in full before erroring out.
This created a stack overflow vulnerability in the Stacks node.  The function parser::parse() is recursive, so given a deep enough nesting, a malicious transaction author could cause the thread parsing the AST to overflow its stack, causing it to crash.
The other problem is in the stack depth checker itself, in clarity/src/vm/ast/stack_depth_checker.rs:
use crate::vm::MAX_CALL_STACK_DEPTH;

// allow  the AST to get deeper than the max call stack depth,
//    but not much deeper (things like tuples would increase the
//    AST depth, without impacting the stack depth).
pub const AST_CALL_STACK_DEPTH_BUFFER: u64 = 5;

fn check(args: &[PreSymbolicExpression], depth: u64) -> ParseResult<()> {
    if depth >= (AST_CALL_STACK_DEPTH_BUFFER + MAX_CALL_STACK_DEPTH as u64) {
        return Err(ParseErrors::ExpressionStackDepthTooDeep.into());
    }
    for expression in args.iter() {
        match expression.pre_expr {
            List(ref exprs) => check(exprs, depth + 1),
            _ => {
                // Other symbolic expressions don't have depth
                //  impacts.
                Ok(())
            }
        }?;
    }
    Ok(())
}

pub struct StackDepthChecker;

impl BuildASTPass for StackDepthChecker {
    fn run_pass(contract_ast: &mut ContractAST) -> ParseResult<()> {
        check(&contract_ast.pre_expressions, 0)
    }
}
Notice the match statement in the check() function.  Only List(..) type variants are considered in the stack depth check.  This applies to function calls, lists, optionals, and responses.  But, it does not apply to tuples!  So, an expression like (list 1 { a: { a: { a: ... } } }) for an arbitrarily deeply-nested inner tuple would pass this check, and could get mined.
What would have happened if a transaction like this got mined?  If it contained a very, very, very deep structured nesting, it would have caused all miners who attempted to process it to crash.  This would have caused the chain to stop processing blocks.  This would have been the best-case scenario if this exploit was triggered in the wild, because then at least no blocks would ever be mined that contained this super-deeply-nested transaction.
However, it could have been worse: some miners could have continued to mine blocks, while others crashed.  The exact number of levels of nesting that will trigger a crash is non-deterministic, because different hosts will have different stack sizes.  Moreover, the same host can have different stack sizes for different instances of the same node due to ASLR.  If someone crafted a deeply-nested transaction that some miners could process, then that deeply-nested transaction would have become part of the blockchain's canonical fork, and all other nodes would have to process it.  This would have likely resulted in many follower nodes crashing, since they would be need to process this transaction as well.
The Fix

At a minimum, the fix needed to (1) limit the recursion depth of parser::parse() so that it would never even parse a structure with AST_CALL_STACK_DEPTH_BUFFER + MAX_CALL_STACK_DEPTH levels of nesting, and (2) fix the stack depth checker so that it rejects too-deeply-nested tuples.  But deploying these fixes is easier said than done.
Recall that in the Stacks blockchain, the block "size" is really a five-dimensional vector that measures bytes read, bytes written, number of reads, number of writes, and "runtime" -- a catch-all for "work" that the Clarity VM does to evaluate the code.  Parsing a contract consumes runtime, and does so in an incremental fashion -- for example, each time parser::parse() recurses, a runtime cost is assessed.  The reason for this behavior is to make sure that miners are able to get compensated for attempting to process malformed smart contracts.  A transaction with a malformed smart contract can be included in a block (so the miner can get the transaction fee), but all nodes must measure the runtime it took to determine that the smart contract code was malformed as part of the block capacity.
This fact precludes the possibility of making a point-release of the node software that simply stops processing a too-deeply-nested smart contract once the maximum structure depth is reached.  If the node did this, then it would calculate a different (smaller) runtime cost for processing this transaction than the nodes running the unpatched version, leading to a chain split.  This is because patched nodes' parser::parse() functions would error out earlier than unpatched nodes' parser::parse() functions, meaning that they would consume less runtime due to having executed fewer parsing steps (in particular, fewer recursive calls).
To address this problem, a majority of miner nodes would need to agree to never include these kinds of problematic transactions in blocks, even though they are valid under the current consensus rules.  To do this, a miner would need to first inspect each smart contract transaction to see if it contained a too-deeply-nested structure using a patched AST parser, and if so, drop it from its mempool.  The difficulty here is that all miners would need to agree to enforce this rule at the same time -- i.e. the same Bitcoin block height -- thereby constituting a soft fork.
To de-risk the chance of a problematic transaction finding its way into the blockchain before the soft fork took effect, we needed to additionally patch the node's relay logic.  The relay logic for 2.05.0.3.0 would never relay a problematic transaction regardless of whether or not the soft fork was in effect.  In addition, the relay logic would never relay a block or microblock if it contained a problematic transaction after the soft fork took effect.
We timed the release of 2.05.0.3.0 such that the soft fork would take effect about a week afterwards -- at Bitcoin block 752,000.  During this week, Hiro PBC and the Stacks Foundation tried to contact as many miners as we knew of (and we don't know everyone who mines) to encourage them to upgrade.  In addition, we encouraged every exchange and wallet provider we had contacts for to upgrade.  This included emails, Twitter DMs, and in some cases, sending a STX transfer transaction with a memo field containing a human-readable message encouraging the recipient to upgrade.
In the end, a majority of miners had upgraded without a problematic transaction ever getting mined.  At the time of this writing, 100% of miners have upgraded.  You can tell by looking at the version field of a Stacks block or microblock:  if it is 0x01, then the miner that produced it was running the 2.05.0.3.0 release (and if it was 0x00, it was running a prior release).
Timeline

August 18, 2022 4:48pm ET:  The bug report and PoC are filed on the Stacks Immunefi account.  The PoC is targeted at the Clarinet devnet.
August 18, 2022 5:03pm ET:  Immunefi closes the bug report as out-of-scope, because it is against the Clarinet devnet and not the blockchain mainnet.
August 18, 2022 9:49pm ET:  The Stacks Foundation identifies the problem as in-scope because the Clarinet devnet and mainnet share the same Clarity VM implementation.  The bug report and PoC are escalated to all blockchain engineers at the Stacks Foundation, Hiro PBC, and Trust Machines.  The bug report on Immunefi is re-opened and escalated.
August 19, 2022 12:03am ET:  An initial patch is privately circulated amongst the blockchain engineers that takes a first stab at implementing the soft fork.
August 19, 2022 10:05am ET:  A private git repository is set up for blockchain engineers to iterate on the patch and test it.  Several iterations are made on the patch.
August 23, 2022 11:27pm ET:  The patch begins its final review.  Artifacts from this patch are used to upgrade infrastructure at the various Stacks entities, including the Foundation, Hiro PBC, Trust Machines, Daemon Technologies, and Xverse.
August 27, 2022 9:33am ET:  A node at the Foundation is confirmed to have been able to boot from genesis with the patch applied, and reach the same chain tip as the rest of the network.
August 29, 2022 3:20pm ET:  A signed release of the patched artifacts are uploaded to a public but undisclosed Amazon S3 bucket and shared with miners and exchange partners for which any of the aforementioned entities have contact information.
August 31, 2022 1:39pm ET:  A private release of 2.05.0.3.0 is created.
August 31, 2022 2:03pm ET:  The 2.05.0.3.0 release pull request is made public on the Stacks blockchain Github project.
August 31, 2022 2:32pm ET:  The 2.05.0.3.0 release is merged to master.
August 31, 2022 4:16pm ET:  A public, signed announcement from announce@stacks.org is made to officially announce the 2.05.0.3.0 release.
Footnotes

[1] 64 nestings per MAX_CALL_STACK_DEPTH, plus a buffer of 5 per AST_CALL_STACK_DEPTH_BUFFER.