karalabe/post-mortem.md Secret

## post-mortem.md

      
    Raw
  

              post-mortem.md
            
          
    Geth v1.9.17 Post Mortem

Yesterday - 11th November, 2020 - a consensus issue was (deliberately) triggered on the Ethereum network. Opposed to the usual way these play out however, this consensus issue was not between different clients, rather between different versions of the same client, namely Geth.
Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.
Unfortunately not all node operators were running recent releases and yesterday morning a transaction managed to trigger the consensus issue, splitting old Geth releases off from the rest of the network. This became a larger issue as Infura was one of the affected parties, hence taking with them their client base.
There was some backlash on Twitter, revolving around two main themes:

Why did the Geth team unilaterally make a "consensus upgrade"?
Why did the Geth team ship this fix silently instead of warning operators?

Both questions are valid, but as always the answers are more nuanced than what fits into a Twitter thread.
Q: Why did the Geth team unilaterally make a "consensus upgrade"?
The Geth team indeed changed the consensus implementation in the v1.9.17 release, however the team did not create any new rules that the Ethereum community didn't know about or agree to. The rules were defined in EIP-211, and agreed to by the community when the network forked over to Byzantium 3 years ago.
If you don't consider accidentally introducing a bug a "consensus upgrade", then you should also not consider fixing the said bug a few months later a "consensus upgrade".
Q: Why did the Geth team ship this fix silently instead of warning operators?
This is a bit of a grey area and requires a case-by-case discussion. We all agree that transparency is king and that we should strive as much as possible towards it, but it's also important to look at all the details before heads start rolling.
Ethereum's consensus code is relatively stable, so the probability of things breaking get smaller as time passes. However, users also expect us to constantly make things faster, which inherently leads to the occasional introduction of new bugs. Fixing these bugs is not hard - in this case, it was 1 line of code - but shipping the fixes raises some interesting questions.
In the classical software world, once a security fix is created, a platform operator can update all their nodes, or a software vendor can push out the update to all their clients. This minimizes the time window in which an attacker who learns about the bug can abuse it. (Eg. The OpenSSL Heartbleed bug was patched by pretty much all the internet giants in their local infrastructure before anyone even told the public about it).
In the case of Ethereum, it takes a lot of time (weeks, months) to get node operators to update even to a scheduled hard fork. Highlighting that a release contains important consensus or DoS fixes always runs the risk of someone trying to beat updaters to the punch line and taking the network down. Security via obscurity is definitely not something to aim for, but delaying a potential attack by enough to get most node operators immune may be worth the temporary "hit" to transparency.
In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.
You might object that "it obviously didn't work".
We'd argue that it actually did work: most nodes have indeed updated and were not affected. From a network health perspective, the strategy worked as intended and the Ethereum network survived without meaningful issues. Certain projects using Infura were impacted, but at the end of the day, the primary goal of the Geth team is the health of the Ethereum network as a whole, individual pieces of it are only secondary.
The decision whether or not to publish details about a serious bug boils down to what the fallout would be in both cases and picking the one where the damage is smaller. Over the past years we've fixed a number of consensus issues never published and helped fix a number of such issues in Nethermind, Besu and Parity, part of them never published. In all these cases, avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks and without keeping node operators in a constant state of emergency that they need to do immediate updates.
We definitely don't condone doing this for all bugs - and we ourselves published a number of releases where we emphasized their emergency - but at certain times, it's better to remain silent as shown by other projects too such as Monero, ZCash and Bitcoin.
This particular silent consensus fix took an unexpected turn with yesterday's network split, but retrospectively we still believe it was the right call. We understand that we cannot expect operators to always immediately update to the latest releases and appreciate the understanding that our vulnerability management and release structure has nuances unique to a blockchain ecosystem.