Yesterday - 11th November, 2020 - a consensus issue was (deliberately) triggered on the Ethereum network. Opposed to the usual way these play out however, this consensus issue was not between different clients, rather between different versions of the same client, namely Geth.
Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.
Unfortunately not all node operators were running recent releases and yesterday morning a transaction managed to trigger the consensus issue, splitting old Geth releases off from the rest of the network. This became a larger issue as Infura was one of the affected parties, hence taking with them their client base.
There was some backlash on Twitter, revolving around two main themes:
- Why did the Geth team unilaterally make a "consensus upgrade"?
- Why did the Geth team ship this fix silently instead of warning operators?
Both questions are valid, but as always the answers are more nuanced than what fits into a Twitter thread.
Q: Why did the Geth team unilaterally make a "consensus upgrade"?
The Geth team indeed changed the consensus implementation in the v1.9.17 release, however the team did not create any new rules that the Ethereum community didn't know about or agree to. The rules were defined in EIP-211, and agreed to by the community when the network forked over to Byzantium 3 years ago.
If you don't consider accidentally introducing a bug a "consensus upgrade", then you should also not consider fixing the said bug a few months later a "consensus upgrade".
Q: Why did the Geth team ship this fix silently instead of warning operators?
This is a bit of a grey area and requires a case-by-case discussion. We all agree that transparency is king and that we should strive as much as possible towards it, but it's also important to look at all the details before heads start rolling.
Ethereum's consensus code is relatively stable, so the probability of things breaking get smaller as time passes. However, users also expect us to constantly make things faster, which inherently leads to the occasional introduction of new bugs. Fixing these bugs is not hard - in this case, it was 1 line of code - but shipping the fixes raises some interesting questions.
In the classical software world, once a security fix is created, a platform operator can update all their nodes, or a software vendor can push out the update to all their clients. This minimizes the time window in which an attacker who learns about the bug can abuse it. (Eg. The OpenSSL Heartbleed bug was patched by pretty much all the internet giants in their local infrastructure before anyone even told the public about it).
In the case of Ethereum, it takes a lot of time (weeks, months) to get node operators to update even to a scheduled hard fork. Highlighting that a release contains important consensus or DoS fixes always runs the risk of someone trying to beat updaters to the punch line and taking the network down. Security via obscurity is definitely not something to aim for, but delaying a potential attack by enough to get most node operators immune may be worth the temporary "hit" to transparency.
In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.
You might object that "it obviously didn't work".
We'd argue that it actually did work: most nodes have indeed updated and were not affected. From a network health perspective, the strategy worked as intended and the Ethereum network survived without meaningful issues. Certain projects using Infura were impacted, but at the end of the day, the primary goal of the Geth team is the health of the Ethereum network as a whole, individual pieces of it are only secondary.
The decision whether or not to publish details about a serious bug boils down to what the fallout would be in both cases and picking the one where the damage is smaller. Over the past years we've fixed a number of consensus issues never published and helped fix a number of such issues in Nethermind, Besu and Parity, part of them never published. In all these cases, avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks and without keeping node operators in a constant state of emergency that they need to do immediate updates.
We definitely don't condone doing this for all bugs - and we ourselves published a number of releases where we emphasized their emergency - but at certain times, it's better to remain silent as shown by other projects too such as Monero, ZCash and Bitcoin.
This particular silent consensus fix took an unexpected turn with yesterday's network split, but retrospectively we still believe it was the right call. We understand that we cannot expect operators to always immediately update to the latest releases and appreciate the understanding that our vulnerability management and release structure has nuances unique to a blockchain ecosystem.
Speaking from the perspective of a Zcash dev: one reason we're less vulnerable to problems like this is that any given zcashd node version will shut down approximately 16 weeks after its release, according to block height. Everyone in the ecosystem knows this and can plan upgrades around it. (We don't yet have multiple consensus-compatible node implementions, but we will do fairly soon.)
In fact we have had a small number of accidentally introduced consensus changes in zcashd that never triggered a fork because all of the affected node versions reached "End-of-Service halt" without that happening. Importantly, we knew exactly when the affected nodes would reach End-of-Service halt, and could make decisions about how to handle a given bug on that basis.
We also rely on this 16-week period to coordinate network upgrade timing (for example, zcashd 3.1.0 and earlier nodes that do not support the upcoming Canopy upgrade will EoS-halt before that upgrade). Our network upgrades are bilateral hard forks.
Please take this just as a suggestion and an explanation of what Zcash and zcashd does. I'm aware that it would be a major policy change for Ethereum node implementations, and has disadvantages as well as advantages.