Skip to content

Instantly share code, notes, and snippets.

@holiman
Created May 12, 2023 10:42
Show Gist options
  • Save holiman/4f6601018a8f559d7ce4cfe4e861cf73 to your computer and use it in GitHub Desktop.
Save holiman/4f6601018a8f559d7ce4cfe4e861cf73 to your computer and use it in GitHub Desktop.

Something we've been discussing a bit on the geth-team, is the new failure mode for Ethereum. That is, basically, it would be interesting to first theoretically outline, and later in practice go through, the case that a non-acceptable consensus-issue occurs.

Scenario: geth version X contains a flaw, whereby the coinbase receives 1 eth in every block, after block N. This bug causes a chain split (geth-x: chain A, other:chain B) at block N.

Version A: geth-x is super-majority, and the erroneous fork reaches finality. Version B: geth-x is not super-majority, but large enough (e.g. ~50%) that neither fork reaches finality.

In both scenarios, it is decided that the flaw is unacceptable, and that none of the blocks on chain A can be accepted. A fixed version of geth is released.

Version A

A few possible narratives here are. Once chain A has finalized, the validators that were on that fork cannot jump to chain B unless it finalizes at an equal or higher block. And chain B cannot finalize until sufficient validators have been removed to make the set B validators a super-majority. In other words

  • If a subset of set A voluntarily equivocates and gets slashed, then a new finality can be reached somewhat quickly, at block M.
  • If all validators in set A wait it out, then the inactivity leak will make the set B become a super-majority after a non-finality period of ~2.5 weeks.

In both these cases, set A validators stand to lose substantial amounts of money, and the community decides to implement slashing impunity for this incident.

  • It is decided that the fork-version will be incremented twice (slashings can only be done from prev fork-version to current).

After ~1 week, the decision has been made. It is decided that

  • As of epoch (three days from now), the new fork "plustwo" is scheduled.
  • All CLs roll out the "plustwo" fork.

There are a lot of open questions regarding this scenario, primarily how, and when to act, as a validator. Client teams may individually have attempted rollbacks, but do week-long rollbacks involving both layers work as intended? How is "slashing impunity" realized in practice -- in this case, 1 week was spent discussing/analyzing, three days for scheduling the fork leads to 10 days of inactivity leak. Is that also discounted, or is that simply facts that have to be factored in?

Requirements

  • A modified geth version which can be instructed to contain the flaw-fork at a specified blocknumber.
  • Are modified CL clients needed?
  • Chain-definitions for a new dedicated network
  • Transaction-generators, we don't want an empty network.
  • Ideally, we would want to fork off an existing large network, otherwise problems with rolling back state are not true-to-life.

This whole exercise should be carried out by a group of people with expertise from both El, CL and devops.

Version B

TBD, let's focus on Version A first .

@potuz
Copy link

potuz commented May 23, 2023

cannot jump to chain B unless it finalizes at an equal or higher block

No stock client should do this, forkchoice should be thought of as a tree, whose root is the finalized block, anything that is contending with the root cannot be on the same tree, therefore is not even considered for inclusion.

but do week-long rollbacks involving both layers work as intended?

On the CL side this is a simple checkpoint sync from an unfinalized checkpoint, it should be simple to implement, so I'd say that the complexity of a week long reorg relies purely on the EL being able to handle it.

Are modified CL clients needed?

Definitely, more on this below, but CLs need to ignore certain slashable events, and this affects every single layer of processing on the CL, from gossiping, where now some DOS vectors may open by trying to deal with bogus slashing offences that are no longer slashable but have valid signatures, to block processing including these slashings.

I think the key of your drill here is this one

How is "slashing impunity" realized in practice?

I think there is no safe way of doing this. Regardless of political or social aspects that others will surely point out. I think we can never solve the technical issue that the slashings will be there for anyone to include afterwards. And the naive solution of "let's change slashing conditions to avoid a certain period" does not seem to work because that may give all validators of class A the chance to sell their stake which will may be unslashable for a period of over a couple of weeks at the transition, since they can't surround vote. The alternative is to keep a detailed description of the violations, which would grow forever the CL side, both in storage size and in processing complexity.

@yorickdowne
Copy link

Voluntary slashing would take 18 days for the secondary slashing to 32 ETH and dropping weight to 0, wouldn't it?
Inactivity leak takes 39 days in my calculation, at 84% Geth. Best case at 67% Geth is 31 days.

The loss in either case is not reasonable. The loss if this "bailout" idea is adopted within 7 days is still staggering.

Calculations at https://docs.google.com/spreadsheets/d/1N9Rjia84SQSedFzmBtnipnWj8_ND0tFS0p1C6q8lybc/ . I highly welcome a set of eyes on whether I have the numbers right, or at least directionally right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment