Skip to content

Instantly share code, notes, and snippets.

@karalabe
Created November 11, 2020 21:31
Show Gist options
  • Save karalabe/e1891c8a99fdc16c4e60d9713c35401f to your computer and use it in GitHub Desktop.
Save karalabe/e1891c8a99fdc16c4e60d9713c35401f to your computer and use it in GitHub Desktop.
Geth v1.9.17 Post Mortem

Geth v1.9.17 Post Mortem

Yesterday - 11th November, 2020 - a consensus issue was (deliberately) triggered on the Ethereum network. Opposed to the usual way these play out however, this consensus issue was not between different clients, rather between different versions of the same client, namely Geth.

Geth v1.9.7 (released 7th November, 2019) broke the EIP-211 implementation, whereby a memory area was shallow-copied, allowing it to be overwritten out of bounds. The bug was reported by John Youngseok Yang on the 15th July, 2020 and was silently fixed and shipped 5 days later in Geth v1.9.17 (20th July, 2020). This fix brought Geth back into consensus with Besu, Nethermind and OpenEthereum (and the Ethereum specification itself); however it broke consensus with earlier Geth releases.

Unfortunately not all node operators were running recent releases and yesterday morning a transaction managed to trigger the consensus issue, splitting old Geth releases off from the rest of the network. This became a larger issue as Infura was one of the affected parties, hence taking with them their client base.

There was some backlash on Twitter, revolving around two main themes:

  • Why did the Geth team unilaterally make a "consensus upgrade"?
  • Why did the Geth team ship this fix silently instead of warning operators?

Both questions are valid, but as always the answers are more nuanced than what fits into a Twitter thread.

Q: Why did the Geth team unilaterally make a "consensus upgrade"?

The Geth team indeed changed the consensus implementation in the v1.9.17 release, however the team did not create any new rules that the Ethereum community didn't know about or agree to. The rules were defined in EIP-211, and agreed to by the community when the network forked over to Byzantium 3 years ago.

If you don't consider accidentally introducing a bug a "consensus upgrade", then you should also not consider fixing the said bug a few months later a "consensus upgrade".

Q: Why did the Geth team ship this fix silently instead of warning operators?

This is a bit of a grey area and requires a case-by-case discussion. We all agree that transparency is king and that we should strive as much as possible towards it, but it's also important to look at all the details before heads start rolling.

Ethereum's consensus code is relatively stable, so the probability of things breaking get smaller as time passes. However, users also expect us to constantly make things faster, which inherently leads to the occasional introduction of new bugs. Fixing these bugs is not hard - in this case, it was 1 line of code - but shipping the fixes raises some interesting questions.

In the classical software world, once a security fix is created, a platform operator can update all their nodes, or a software vendor can push out the update to all their clients. This minimizes the time window in which an attacker who learns about the bug can abuse it. (Eg. The OpenSSL Heartbleed bug was patched by pretty much all the internet giants in their local infrastructure before anyone even told the public about it).

In the case of Ethereum, it takes a lot of time (weeks, months) to get node operators to update even to a scheduled hard fork. Highlighting that a release contains important consensus or DoS fixes always runs the risk of someone trying to beat updaters to the punch line and taking the network down. Security via obscurity is definitely not something to aim for, but delaying a potential attack by enough to get most node operators immune may be worth the temporary "hit" to transparency.

In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.

You might object that "it obviously didn't work".

We'd argue that it actually did work: most nodes have indeed updated and were not affected. From a network health perspective, the strategy worked as intended and the Ethereum network survived without meaningful issues. Certain projects using Infura were impacted, but at the end of the day, the primary goal of the Geth team is the health of the Ethereum network as a whole, individual pieces of it are only secondary.

The decision whether or not to publish details about a serious bug boils down to what the fallout would be in both cases and picking the one where the damage is smaller. Over the past years we've fixed a number of consensus issues never published and helped fix a number of such issues in Nethermind, Besu and Parity, part of them never published. In all these cases, avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks and without keeping node operators in a constant state of emergency that they need to do immediate updates.

We definitely don't condone doing this for all bugs - and we ourselves published a number of releases where we emphasized their emergency - but at certain times, it's better to remain silent as shown by other projects too such as Monero, ZCash and Bitcoin.

This particular silent consensus fix took an unexpected turn with yesterday's network split, but retrospectively we still believe it was the right call. We understand that we cannot expect operators to always immediately update to the latest releases and appreciate the understanding that our vulnerability management and release structure has nuances unique to a blockchain ecosystem.

@chejazi
Copy link

chejazi commented Nov 12, 2020

Over the past years we've fixed a number of consensus issues never published

It's troubling to think that there may be undisclosed, potentially consensus-breaking issues running in other versions of Geth. Are there a non-trivial number of nodes online facing similarly undisclosed issues as we speak? We don't know!

avoiding the limelight allowed the network to seamlessly evolve without running the risk of attacks

Just because it didn't happen on other occasions doesn't mean the risk doesn't exist. A consensus failure is an attack on the (health of the) network. The risk is always there, as shown by the events that just unfolded.

@rockstardev
Copy link

So you're saying that all we need to do in order to split the Ethereum network:

  1. Follow "consensus fixes force pushed to production" you guys do
  2. Give it a bit time considering you leave old vulnerable versions without any notice for 5+ months
  3. We send in transaction that exploits the problem, and then dump double spends on those that haven't upgraded

I'm in awe of your l33t engineering 👌

@alexkroeger
Copy link

alexkroeger commented Nov 12, 2020

Appreciate the transparency here. If an update fixes a consensus bug, however, it seems worth it to reach out quietly to major node providers (Infura, Alchemy) to make sure they upgrade in a timely manner.

If Infura was aware of the implications of the fix, it's hard to imagine they would wait a year to update their Geth version.

@daira
Copy link

daira commented Nov 12, 2020

Speaking from the perspective of a Zcash dev: one reason we're less vulnerable to problems like this is that any given zcashd node version will shut down approximately 16 weeks after its release, according to block height. Everyone in the ecosystem knows this and can plan upgrades around it. (We don't yet have multiple consensus-compatible node implementions, but we will do fairly soon.)

In fact we have had a small number of accidentally introduced consensus changes in zcashd that never triggered a fork because all of the affected node versions reached "End-of-Service halt" without that happening. Importantly, we knew exactly when the affected nodes would reach End-of-Service halt, and could make decisions about how to handle a given bug on that basis.

We also rely on this 16-week period to coordinate network upgrade timing (for example, zcashd 3.1.0 and earlier nodes that do not support the upcoming Canopy upgrade will EoS-halt before that upgrade). Our network upgrades are bilateral hard forks.

Please take this just as a suggestion and an explanation of what Zcash and zcashd does. I'm aware that it would be a major policy change for Ethereum node implementations, and has disadvantages as well as advantages.

@zack-bitcoin
Copy link

A soft fork doesnt trigger dos like vitalik is saying.
Only a poorly implemented soft fork has that flaw.

If a tx appears that exploits the bug, the miners should save a copy of that tx, and gossip it to other miners. So the miners have a list of accounts that have attempted to trigger the bug.

Until the hard update fix activates, miners should censor any tx from any account that exploited the bug.
So each account can only make a bad tx once, and we can punish the bad accounts as a part of the hard update to recover.

@daira
Copy link

daira commented Nov 12, 2020

It seems to me that the approach @gmaxwell is suggesting wouldn't have worked in this case, and indeed wouldn't be expected to work in most situations of accidental consensus incompatibility in a coin with multiple consensus implementations.

In particular, adding bug-for-bug compatibility code to other node implementations to account for Geth's bug would have been absolutely the wrong approach, and it would only have delayed solving the problem, given that operators of those other nodes would also have needed to upgrade.

In response to @zack-bitcoin's comment, punishing whoever triggered the fork is beside the point, since no-one is alleging malfeasance.

FWIW, I think this bug was reasonably well-handled given the constraints, although I agree with @alexkroeger that giving infrastructure providers a heads up might have helped.

@perlboy
Copy link

perlboy commented Nov 12, 2020

No one else is going to ask why pretty important node operators were running releases >3 months old? Any self respecting and "old world" accredited IT shop would classify this as a classic patch management failure and most (all?) commercial vendors would point to this lag time in their defence...

@t-vila
Copy link

t-vila commented Nov 12, 2020

No one else is going to ask why pretty important node operators were running releases >3 months old? Any self respecting and "old world" accredited IT shop would classify this as a classic patch management failure and most (all?) commercial vendors would point to this lag time in their defence...

see Infura's explanation around that https://blog.infura.io/infura-mainnet-outage-post-mortem-2020-11-11/?utm_source=social&utm_medium=twitter

@mcelrath
Copy link

So the vulnerabilities resulting from effectively undecidable transactions are still unresolved almost 5 years later?

I think it's wrong to call it a vulnerability. It's certainly a property that Ethereum has, and it certainly has its downsides (like the inability to soft fork away bugs), but it also has important upsides: making it more difficult for a 51% coalition to censor specific applications (or even a smaller coalition via feather forking) is a significant win. Complicated tradeoffs!

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here. Whether to call state dependence a "property" of Ethereum or calling it "degraded upgrade safety" is in the eye of the beholder, but it has nothing to do with 51% attacks.

@JavierGonzalez
Copy link

Neutralizing a minority split with empty blocks + reorg would reduce the impact on users to zero.

If the split was 30 blocks and the block reward is 1,255 USD, the total cost would be 37,650 USD? (sorry if this is not correct, I have not studied ETH in depth).

Is it too high a cost in return for protecting users and the ETH project reputation?

@holiman
Copy link

holiman commented Nov 13, 2020

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here

Yes, they can censor all transactions, or they can whitelist a set of senders. What is difficult is to censor an application. As in, censoring transactions that somewhere along the execution invokes a certain contract. Because then they have to execute it in order to figure it out, and the path the execution takes may depend on the state, so two txs executed out of order may not act the same way as the same two txs mined in a different order.

Right now, tx validity can be determined without executing the transaction. So they do execute them today, but they know that they can put it into a block and get paid for it. If they want to censor, they have to execute e.g. a 10M gas transaction, and possibly have to censor it, and thus not get paid for it. Which is exploitable.

So saying "making it more difficult for a 51% coalition to censor specific applications" is a very accurate description.

@mcelrath
Copy link

A 51% coalition can censor applications or transactions, full stop. There is fundamentally no way around this security property. Bringing that up is a misdirection, there is no trade-off here

Yes, they can censor all transactions, or they can whitelist a set of senders. What is difficult is to censor an application. As in, censoring transactions that somewhere along the execution invokes a certain contract. Because then they have to execute it in order to figure it out, and the path the execution takes may depend on the state, so two txs executed out of order may not act the same way as the same two txs mined in a different order.

Right now, tx validity can be determined without executing the transaction. So they do execute them today, but they know that they can put it into a block and get paid for it. If they want to censor, they have to execute e.g. a 10M gas transaction, and possibly have to censor it, and thus not get paid for it. Which is exploitable.

So saying "making it more difficult for a 51% coalition to censor specific applications" is a very accurate description.

So someone who wants to censor an application has to execute the contract, and see what it calls. No big deal. The "more difficult" you claim is a tiny amount of CPU usage that, if you're determined to censor, you'd be happy to eat the cost for. Just because that's not how it works today does not make this a valid security argument.

So again, this "51%" argument is totally irrelevant.

@kurtseifried
Copy link

This was a clearly triggerable DoS in the software, as such it clearly meets the definition for getting a CVE identifier. Is there a CVE request in the works for this issue? Thanks.

@hooji
Copy link

hooji commented Nov 15, 2020

Perhaps new client releases which include protocol changes should include a "sunset" target block number by which all prior versions are required to be upgraded. This could be implemented by contract on the blockchain itself. Legacy clients could read the value to learn of required updates and the remaining time available before the upgrade becomes "required".

@JesseHerring33
Copy link

I need to learn more about it or hire people to help me with management skills and organization with my addresses. I know there is 20 addeess that belongs to me and I need to connect the rest at etherscan, etherscan has most of them but with the community of GitHub I'm going to give large donations for all the help over the past years! You all are so awesome and God bless you! We all go through our own Demons and I promise I have my own and they seem a little bit more aggressive than others but it's not my judgment; ) I accidentally sent transactions and I have had network problems and everything else that could interfere with my life but you all are my Family and I prey for you and just know it gets so much better than now, of course we will have problems bringing Heaven to earth. There is a lot going on and I need to be in a position where I can talk about it and let people know that God is good

@daira
Copy link

daira commented Apr 5, 2021

I suggest locking this bug and deleting the spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment