Something we've been discussing a bit on the geth-team, is the new failure mode for Ethereum. That is, basically, it would be interesting to first theoretically outline, and later in practice go through, the case that a non-acceptable consensus-issue occurs.
Scenario: geth
version X
contains a flaw, whereby the coinbase
receives 1 eth
in every block, after block N
. This bug causes a chain split (geth-x
: chain A
, other
:chain B
) at block N
.
Version A: geth-x
is super-majority, and the erroneous fork reaches finality.
Version B: geth-x
is not super-majority, but large enough (e.g. ~50%
) that neither fork reaches finality.
In both scenarios, it is decided that the flaw is unacceptable, and that none of the blocks on chain A
can be accepted. A fixed version of geth is released.
A few possible narratives here are. Once chain A
has finalized, the validators that were on that fork cannot jump to chain B
unless it finalizes at an equal or higher block. And chain B
cannot finalize until sufficient validators have been removed to make the set B
validators a super-majority. In other words
- If a subset of
set A
voluntarily equivocates and gets slashed, then a new finality can be reached somewhat quickly, at blockM
. - If all validators in
set A
wait it out, then the inactivity leak will make theset B
become a super-majority after a non-finality period of ~2.5 weeks.
In both these cases, set A
validators stand to lose substantial amounts of money, and the community decides to implement slashing impunity for this incident.
- It is decided that the fork-version will be incremented twice (slashings can only be done from prev fork-version to current).
After ~1 week, the decision has been made. It is decided that
- As of epoch (three days from now), the new fork "plustwo" is scheduled.
- All CLs roll out the "plustwo" fork.
There are a lot of open questions regarding this scenario, primarily how, and when to act, as a validator. Client teams may individually have attempted rollbacks, but do week-long rollbacks involving both layers work as intended? How is "slashing impunity" realized in practice -- in this case, 1 week was spent discussing/analyzing, three days for scheduling the fork leads to 10 days of inactivity leak. Is that also discounted, or is that simply facts that have to be factored in?
- A modified
geth
version which can be instructed to contain the flaw-fork at a specified blocknumber. - Are modified CL clients needed?
- Chain-definitions for a new dedicated network
- Transaction-generators, we don't want an empty network.
- Ideally, we would want to fork off an existing large network, otherwise problems with rolling back state are not true-to-life.
This whole exercise should be carried out by a group of people with expertise from both El, CL and devops.
TBD, let's focus on Version A first .
No stock client should do this, forkchoice should be thought of as a tree, whose root is the finalized block, anything that is contending with the root cannot be on the same tree, therefore is not even considered for inclusion.
On the CL side this is a simple checkpoint sync from an unfinalized checkpoint, it should be simple to implement, so I'd say that the complexity of a week long reorg relies purely on the EL being able to handle it.
Definitely, more on this below, but CLs need to ignore certain slashable events, and this affects every single layer of processing on the CL, from gossiping, where now some DOS vectors may open by trying to deal with bogus slashing offences that are no longer slashable but have valid signatures, to block processing including these slashings.
I think the key of your drill here is this one
I think there is no safe way of doing this. Regardless of political or social aspects that others will surely point out. I think we can never solve the technical issue that the slashings will be there for anyone to include afterwards. And the naive solution of "let's change slashing conditions to avoid a certain period" does not seem to work because that may give all validators of class
A
the chance to sell their stake which will may be unslashable for a period of over a couple of weeks at the transition, since they can't surround vote. The alternative is to keep a detailed description of the violations, which would grow forever the CL side, both in storage size and in processing complexity.