Haseeb-Qureshi/thunder_mainnet.md

## thunder_mainnet.md

      
    Raw
  

              thunder_mainnet.md
            
          
    New and Simple Consensus Algorithms for ThunderCore’s MainNet

Speaker: Elaine Shi


Synchronous, with a change of partition tolerance
State machine replication (e.g., blockchain consensus)

Requires safety (nothing bad ever happens), and liveness (eventually something good happens)


Thunderella

If you don't know Thunderella, go look it up (check out the BPASE '18 talk)


Wait, there's a flaw in Thunderella?!

A confirmed transaction can be undone, even in a somewhat benign setting
wtf? they wrote 74 pages of proofs, so how can a provably correct protocol be flawed?

The proof is fine... the problem is the synchronous network model


Thunderella gives two properties:

With honest majority, you get consistency
With 3/4 honest and an honest leader/accelerator, you get liveness & good performance
Can also get liveness with an honest majority by...

Falling back to a slow chain (blockchain-based) whenever the fast chain halts


How does fallback happen?

Committee on fast path posts periodic heartbeats to the slow chain

Posts notarized (3/4 signed) hashes of fast path log
Both serves as heartbeat & checkpoint


If long number of blocks without heartbeat (k blocks), then all agree to fallback to slow chain

However, not everyone agrees on the number of transactions before the fast chain failed!
Everyone post notarized but UNCHECKPOINTED transactions onto the slowchain to achieve consensus on where the fastchain halted

Basically, everyone gets together and shares their notes on what was confirmed on the fast chain


Now you can achieve both consistency and liveness—you get optimistic liveness!


Thunderella two sentences: single round of voting when things are good, slow blockchain for fallback when things are bad


So what's the flaw??

Leader makes a proposal for a block
Everyone signs, leader collects the signatures and produces a notarized version of the block
Leader starts broadcasting the notarization to one committee member

But then leader crashes and goes offline before broadcasting to anyone else!


Committee member considers the block finalized and for example, pays out an end user
But that committee member then ALSO goes offline
So the only people who knew/saw the notarized block are now offline
The rest of the committee members go back to the slow chain, but because no one has those transactions...

They continue moving forward as though the transactions never happened
Thus, the transactions are REVERTED despite having been earlier confirmed.


This is bad! This should not happen!


Why is such a flaw possible in a provably correct protocol?

Classical synchronous models permit flawed protocols


What is the synchronous model?

We assume that when honest nodes send messages, they arrive in a bounded amount of time

(And the protocol is aware of this bound)


Message delay = 1 round
In the synchronous model, having a temporary outage means you're faulty

In a synchronous model, a "faulty" node gets no guarantees about correctness!


In the partially synchronous model, the message delay can be arbitrarily long

Having a temporary outage is isomorphic to having a long message delay


We say that partially synchronous models are arbitrarily partition-tolerant
But ANY partially synchronous model cannot tolerate more than 1/3 corruptions
Can we achieve the best of both worlds?

We want to tolerate 1/2 corruptions (which we can do in the synchronous model, a la Nakamoto consensus)
We want to let temporary outages = long message delays


The answer should be no, given classical consensus protocols...


What if we are more subtle in how we model how "synchronous" a network is?

Imagine we have some nodes that are honest and online
But other nodes that are honest, with unstable connections
Then we have corrupt nodes who we don't care about
We want to achieve consensus among the first two groups: is this possible?

Turns out, you cannot do this unless the honest and online group is LARGE

greater than or equal to 1/2 of all nodes
This makes some synchrony assumptions about the network, since you assume at least half of the nodes have fast connectivity

But a minority of nodes can be unstable
"Best-possible partition tolerance"


But it's commonly assumed that these honest and online nodes are persistent...

But Bitcoin has been around for 10 years—those nodes churn over long enough timespans


More realistically: every node is offline at some point!


You just need at least 1/2 nodes at all times to be continuously online from round to round


How to fix it? — Making Thunderella Best-Possible Partition Tolerant

For Fast Path:

Before, you confirmed a transaction whenever you saw a notarized block

But you don't know that you aren't the first person to see this and then go offline


So instead, you only confirm transactions in notarized blocks ONE BLOCK BACK

This ensures that all notaries must have seen the previous block
And thus should be able to write it to the slow chain if you go offline


New theoretical and practical results around "best-possible partition tolerance"

Can be considered a practical network model

Guarantees liveness only during "periods of synchrony"


Some philosophical observations:

This class of protocols with best-possible partition tolerance is a subset of "Synchronous honest majority protocols"
None of the classical synchronous honest majority protocols are best-possible partition tolerant


Classical corrupt majority protocols cannot be best-partition tolerant (e.g., Dolev Strong)

Classical synchronous model is a mismatch for what we need in real-world deployments


Thunder is launching in Q1 2019

Whoo