holiman/EL-CL firedrill.md Secret

## EL-CL firedrill.md

      
    Raw
  

              EL-CL firedrill.md
            
          
    Something we've been discussing a bit on the geth-team, is the new failure mode for Ethereum.
That is, basically, it would be interesting to first theoretically outline, and later in practice go through, the case that a non-acceptable consensus-issue occurs.
Scenario: geth version X contains a flaw, whereby the coinbase receives 1 eth in every block, after block N. This bug causes a chain split (geth-x: chain A, other:chain B) at block N.
Version A: geth-x is super-majority, and the erroneous fork reaches finality.
Version B: geth-x is not super-majority, but large enough (e.g. ~50%) that neither fork reaches finality.
In both scenarios, it is decided that the flaw is unacceptable, and that none of the blocks on chain A can be accepted. A fixed version of geth is released.
Version A

A few possible narratives here are. Once chain A has finalized, the validators that were on that fork cannot jump to chain B unless it finalizes at an equal or higher block. And chain B cannot finalize until sufficient validators have been removed to make the set B validators a super-majority. In other words

If a subset of set A voluntarily equivocates and gets slashed, then a new finality can be reached somewhat quickly, at block M.
If all validators in set A wait it out, then the inactivity leak will make the set B become a super-majority after a non-finality period of ~2.5 weeks.

In both these cases, set A validators stand to lose substantial amounts of money, and the community decides to implement slashing impunity for this incident.

It is decided that the fork-version will be incremented twice (slashings can only be done from prev fork-version to current).

After ~1 week, the decision has been made. It is decided that

As of epoch (three days from now), the new fork "plustwo" is scheduled.
All CLs roll out the "plustwo" fork.

There are a lot of open questions regarding this scenario, primarily how, and when to act, as a validator. Client teams may individually have attempted rollbacks, but do week-long rollbacks involving both layers work as intended?
How is "slashing impunity" realized in practice -- in this case, 1 week was spent discussing/analyzing, three days for scheduling the fork leads to 10 days of inactivity leak. Is that also discounted, or is that simply facts that have to be factored in?
Requirements


A modified geth version which can be instructed to contain the flaw-fork at a specified blocknumber.
Are modified CL clients needed?
Chain-definitions for a new dedicated network
Transaction-generators, we don't want an empty network.
Ideally, we would want to fork off an existing large network, otherwise problems with rolling back state are not true-to-life.

This whole exercise should be carried out by a group of people with expertise from both El, CL and devops.
Version B

TBD, let's focus on Version A first .