Client diversity in Ethereum is exceedingly important due to the agressive slashing penalties: in case of a consensus error, the more validators are in the wrong, the heavyer the penalties are. Even worse, if a majority of validators are in the wrong, the bad chain can get finalized, leading to gnarly governance issues of how to recover from the error with perverse incentives from the majority validators not to. Such an event would have the capcity to have a chilling effect on the entire Ethereum adoption.
The standard solution to the problem is client diversity: making sure that every client flavor in the network (consensus and execution too) has a market share less than 50%. In case of a consensus error, the faulty clients would get penalized, but it wouldn't be exorbitant, and it wouldn't have a detrimental effect on the network as a whole. This approach worked well on the consensus layer. However, the CL clients were all relatively new with similar performance profiles.
On the execution layer, things are a bit more complicated (at least for now). Although it's not clear how large of an edge Geth has over other clients market cap wise (the stats are unreliable at best), it is generally accepted that Geth does dominate over other clients. This places both Geth users as well as the entire network into a higher risk bracket than ideal. The usual mantra of "use a minority client" helps a bit, but switching can surface hidden incompatibilities, different resource requirements, new monitoring systems, etc. Doable, but less than ideal.
A better solution is to run more than one client side by side, and cross-reference blocks between them. This is an approach "expected" of large players (staking pools, high risk providers), but it is also an expensive solution both hardware and effort wise: every client has their quirks that the operator needs to be aware of, and every client has their maintenance burden that has to be performed. A non-issue for dedicated teams, but a definite issue from a decentralization perspective.
Pushing for diversity seems like a necessity, but the practicalities make it somewhat unrealistic, at least in the immediate short term future. We do need, however, a short term solution too.
The only solution for true resilience is verifying blocks with multiple clients. For most users, however, running multiple clients is unrealistic. The theoreticals and practicals seems to be at odds at one another, until we realise, we're not in a binary situation. Instead of looking at it as either running one client or running many clients, there is a third option: running one client but verifying with all clients (without actually running them). A wha'? 🤔
The observation is that verifying a block is actually quite cheap. 100-200ms worth of computation is enough for most clients running on most hardware. So from a computational perspective - taking into account that any remotely recent computer has ample CPU cores - verifying a block with one client or all clients is kind of the same.
The problem is not CPU or memory or even networking. The problem is state. Even though it takes 100ms to verify a block with almost any client, having the necessary state present to serve the verification makes it very expensive. Except... it is absolutely redundant to maintain the state N times, it's the same exact thing, just stored a bit differently in each client. We of course, cannot share the state across clients, so seems we're back to square one... seems...
Whilst it is true that we cannot share a 250GB state across clients (ignoring historical blocks now), there's also no need to do such a thing. Executing a single block only needs a couple hundred accounts and storage slots to be available, and verifying the final state roots only requires the proofs for those state items. In short, we don't need to share an entire state with other clients to verify a block, we just need to share (or rather send) a witness to them.
This changes everything...
Instead of asking people to run a minority client (may be inconvenient), or asking them to run multiple clients (may be expensive); we can let them use whatever client they fancy, and rather only ask them to cross-validate with other clients, statelessly.
From a high level perspective, block validation within the execution client consists of running a bundle of transactions and then comparing the produced results with the expected ones according to the block header.
The proposal is to extend this very last step a bit. Instead of just running the transactions and considering the results final, the user's client would at the same time also create a witness for the block as it is executing it. After execution finishes, we propose to add an extra cross validation step, sending the witness to a variety of other clients to run statelessly.
Aggregating the result from them, if all (or most) clients agree with the host, the result can be transmitted to the consensus client. On the other hand, if there are multiple cross-validating clients disagreeing, the user's client would refuse to accept and attest the block.
Q: Doesn't this still mean everyone needs to maintain multiple clients?
Not really. Everyone would need to/may have multiple clients available, but they would not be running as a live client. Rather every client would have a new mode of operation - stateless validation - where they only consume witness-enriched-blocks and return validation resuls.
The simplest form would not even need these clients to run non-stop, rather would use a CLI command to statelessly validate a block, but that's an implementation / spec detail beyond the scope of this document.
Q: Is it reasonable to expect clients to ship stateless validation?
Most clients already support something like this for running state tests, so it's not really a new concept. Making it production ready in a nicer sugar-coating should thus not be an excessive effort. Compared to the fallout that a consensus issue can cause, this feature seems like a no-brainer.
Q: Is it reasonable to expect clients to support witness generation and cross-validation with other clients?
Attesting a bad block can lead to serious penalties and slashings. Generating a witness and running a cross validation step is a straightforward and simple process. Our expectation is that the users of all clients will demand support for a feature that is easy to ship and can protect them from financial ruin.
Q: Aren't we doing Verkle because witnesses are too large?
Whilst Merkle-Patricia witnesses are indeed quite large, their size is only relevant when sending through the network. If the cross validator clients are indeed stateless, they can be run side-by-side (or on demand) with the main client. In that case, transmitting the witness to them happens in the OS memory, so whether it's tens or even hundreds of MB is not relevant.
Q: Do you expect me to install 6 different execution clients?
Depends:
- If you are a high stake operator, you are probably already running multiple full-fledged clients, so this is moot for you.
- If you are running your own infra, adding a few cross-validating clients might not be an excessively hard task, but we could create docker and bare metal bundles of all-the-clients to cross validate with.
- If you are running a DAppNode or similar setup, we'd expect them to add support for running all clients in cross-validation mode and provide this protection to their users out of the box, by default.
This propsal aims to solve the need for diveristy, without forcing diversity. It allows everyone to run their preferred client without having to fear of all the bad things that can go wrong. The proposal does require a bit of colleboration from everyone to get through, but it seems like a very simple solution compared to how gnarly the original problem is.
There are, ofcourse, some technicalities that need to be worked out (single-shot vs log running verifiers; communication protocol; witness format and content; K-out-of-N validity repercussions; hard suspend vs. empty attests; etc). This document was meant more of a teaser and high level overview of the proposal to the diversity problem.
One "problem" I see with this is that it only addresses client bugs in the VM, not in the rest of the execution environment. While these certainly are the most likely consensus bugs, one can easily imagine a client bug in storage or storage retrieval that would cause all of the clients to execute block the same, but since the witness data is coming from a single client, if it is wrong all of the clients will agree on the block.
Note: The situation here is still significantly better than current situation, but not as good as actual client diversity.