ZenGround0/Chain-Halting-Timestamp-Validation.md

## Chain-Halting-Timestamp-Validation.md

      
    Raw
  

              Chain-Halting-Timestamp-Validation.md
            
          
    Intro

@sternhenri and @jzimmerman have proposed a strict timestamp validation rule for the filecoin network.  The rule (1) is that all epochs uniquely determine a period in time fixed by the genesis timestamp.  Blocks with timestamps that don't fall within the window specified by their height are marked as invalid by the protocol. This relies on a clock synchrony assumption.
Talking with @Kubuxu I learned that the lotus team is hesitant to adopt these strict timestamp rules for at least one major reason: recovery after chain halting. This document dives into the proposed timing validation rule and its impact on recovery from chain halting.  It makes an argument for keeping these strict validation rules and explores some chain halting recovery options.
Timing model

During steady state operation (i.e. CHAIN_FOLLOW mode) a filecoin node mines off of the heaviest tipset it finds.  Nodes mine with the appropriate number of null blocks to put their candidate block in the current epoch.  If a node wins an election it transmits the winning block.
Chain Halting

Scenario

It is likely that mining progress will stop at some point in the network's lifetime due to implementation error.  The canonical example is that a clever adversary mines and propagates a particular block structure that causes all nodes to crash.  In an event like this the chain is said to halt. Operators and potentially programmers need to intervene to handle a halting event in order to keep the network running.  This typically takes some time.
Byzantine actors may take advantage of the situation.  For example to create a chain long enough to evict all other miners from the power table or at least significantly penalize other miners who will miss both election and state machine posts.  Using power table evictions it is possible for an adversary to force the heaviest chain to an unrecoverable protocol halt requiring a network upgrade for restart.
Adversaries could additionally use this opportunity to double spend by forking around a transaction near the halting point, or to simply inflate their block reward percentage and gain an outsized reward.
Recovery Behavior

What epoch do we restart at?

In light of this the choice to have a predetermined mapping between timestamp and epoch number in validation rule 1 leads to a critical question.  After miners are forced to stop during a chain halt event, what epoch do they begin mining on upon restarting?
(a) Strategy: disallow gap epochs by default closing off attack vectors for reorganizing the chain. Complications: there are a few possible ways to prevent a halt gap each with their own issues or added complexity (i) nodes restart with a network upgrade to create a "deadzone" -- epochs where blocks are disallowed by the protocol. (ii) the protocol can specify that nodes recover by filling in the gap by "backdating" timestamps.  AFAIK this is lotus's current strategy (iii) we can try to relax the first timestamp rule somehow to help avoid the problem.
(b) Strategy: start on the epoch corresponding to the current time allowing large null-block gaps Complications: while this keeps the first validation rule without need for any upgrades or special recovery behavior now there is a sequence of open epochs that can be filled by a miner deviating from the protocol after the fact which opens up attack vectors.
(iii) doesn't help

If we choose (a) we can try to relax the timestamp rule to something less strict to avoid the need for an upgrade enforcing no blocks mined during the halt.  We could relax rule 1 back to the minimal timestamp rules that (3) there must be at least a block time between timestamps and (4) that timestamps cannot come from the future.  However, as pointed out by @Kubuxu an adversary could always pack halt time / epoch time blocks into the initial recovery period due to the leniency of (3) which leaves us with the same problem.
For the sake of argument we could go down the (bad design) road of requiring clock behavior to change every time there is an epoch without a block to try to handle halting automatically (automatic backdating / always try to mine around null blocks).  But this immediatley opens up (at least) a symmetric problem where the weakened protections of (4) allow halt time / epoch time blocks to be packed between the backdated present (past) and the backdated future (present).
It seems that approach (iii) cannot help us here (though I'm still open to ideas)
(ii) doesn't require relaxing (1)

Observe that we can backdate timestamps to fit into the pre-determined epoch mapping while running the recovery protocol (ii).  So (1) actually doesn't prevent us from running approach (ii) at all.
constraints on (ii)

If we choose (a) and (ii) we DO have to modify our mining timing rules. To support (ii) we would need to allow for mining -- not just syncing -- with relaxed timing conditions, otherwise nodes would take an entire block time to mine a backdated block and would never catch up to the future.
Nodes are constrained by the time it takes to mine which will likely be a significant portion of block time due to election post bottlenecks.  To catch up to the epoch boundaries nodes must cut time waiting to receive other nodes' blocks before mining.  This inevitably leads to a greater number of forks in the recovery chain.
(ii) might not be enough

A greater number of forks during recovery in (ii) puts the honest participants running recovery at a relative disadvantage to a coordinated adversary running its own malicious recovery chain after restart (using an argument along the same lines as the GHOST paper's argument).
Furthermore a byzantine adversary may not actually halt operation during the chain halt but instead create a private chain and withhold it.  This is even likely if the halt event was triggered by a byzantine actor running a custom implementation patched before the rest of the network.  In this case the election post bottleneck may slow down honest participants enough that the adversary's chain is easily accepted as the heaviest simply because it has a headstart the size of the halt time.
All of the above can be costed out with respect to adversary size, recovery mode latency parameters and halt time.  It might not be enough to matter for expected ranges of these values but note that the honest participants are definitely at some disadvantage.
What should we do?


We should analyze (ii) and determine what threat models it works against
It is likely we'll want to launch with some version of recovery mining behind a configuration flag
We will likely want to be ready to deploy heavy handed solutions like (i) when the network does halt in order to be resilient against more significant threat models, particularly if the data causing an outage appears to exist only for the purposes of an exploit
It is important to deploy a fix to a network halt ASAP to mitigate adversary advantage if we opt against solution (i)
Operators should also consider running safety checks on incoming chains during sensitive periods like this that reject blocks that slash them.  This doesn't prevent an adversary from unfairly penalizing a large fraction of miners but does prevent an adversary from evicting the entire power table.

Finally we should not try to replace rule (1) with something else as it won't help us mitigate attacks.