Something we've been discussing a bit on the geth-team, is the new failure mode for Ethereum. That is, basically, it would be interesting to first theoretically outline, and later in practice go through, the case that a non-acceptable consensus-issue occurs.
X contains a flaw, whereby the
1 eth in every block, after block
N. This bug causes a chain split (
chain B) at block
geth-x is super-majority, and the erroneous fork reaches finality.
geth-x is not super-majority, but large enough (e.g. ~
50%) that neither fork reaches finality.
In both scenarios, it is decided that the flaw is unacceptable, and that none of the blocks on
chain A can be accepted. A fixed version of geth is released.
A few possible narratives here are. Once
chain A has finalized, the validators that were on that fork cannot jump to
chain B unless it finalizes at an equal or higher block. And
chain B cannot finalize until sufficient validators have been removed to make the
set B validators a super-majority. In other words
- If a subset of
set Avoluntarily equivocates and gets slashed, then a new finality can be reached somewhat quickly, at block
- If all validators in
set Await it out, then the inactivity leak will make the
set Bbecome a super-majority after a non-finality period of ~2.5 weeks.
In both these cases,
set A validators stand to lose substantial amounts of money, and the community decides to implement slashing impunity for this incident.
- It is decided that the fork-version will be incremented twice (slashings can only be done from prev fork-version to current).
After ~1 week, the decision has been made. It is decided that
- As of epoch (three days from now), the new fork "plustwo" is scheduled.
- All CLs roll out the "plustwo" fork.
There are a lot of open questions regarding this scenario, primarily how, and when to act, as a validator. Client teams may individually have attempted rollbacks, but do week-long rollbacks involving both layers work as intended? How is "slashing impunity" realized in practice -- in this case, 1 week was spent discussing/analyzing, three days for scheduling the fork leads to 10 days of inactivity leak. Is that also discounted, or is that simply facts that have to be factored in?
- A modified
gethversion which can be instructed to contain the flaw-fork at a specified blocknumber.
- Are modified CL clients needed?
- Chain-definitions for a new dedicated network
- Transaction-generators, we don't want an empty network.
- Ideally, we would want to fork off an existing large network, otherwise problems with rolling back state are not true-to-life.
This whole exercise should be carried out by a group of people with expertise from both El, CL and devops.
TBD, let's focus on Version A first .