Skip to content

Instantly share code, notes, and snippets.

@karalabe

karalabe/moot.md Secret

Created November 16, 2023 05:50
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save karalabe/e106ac58afc1d611641e543312cf41e3 to your computer and use it in GitHub Desktop.
Save karalabe/e106ac58afc1d611641e543312cf41e3 to your computer and use it in GitHub Desktop.
Making EL Diversity Moot 🙃

Making EL Diversity Moot 🙃

Client diversity in Ethereum is exceedingly important due to the agressive slashing penalties: in case of a consensus error, the more validators are in the wrong, the heavyer the penalties are. Even worse, if a majority of validators are in the wrong, the bad chain can get finalized, leading to gnarly governance issues of how to recover from the error with perverse incentives from the majority validators not to. Such an event would have the capcity to have a chilling effect on the entire Ethereum adoption.

The standard solution to the problem is client diversity: making sure that every client flavor in the network (consensus and execution too) has a market share less than 50%. In case of a consensus error, the faulty clients would get penalized, but it wouldn't be exorbitant, and it wouldn't have a detrimental effect on the network as a whole. This approach worked well on the consensus layer. However, the CL clients were all relatively new with similar performance profiles.

On the execution layer, things are a bit more complicated (at least for now). Although it's not clear how large of an edge Geth has over other clients market cap wise (the stats are unreliable at best), it is generally accepted that Geth does dominate over other clients. This places both Geth users as well as the entire network into a higher risk bracket than ideal. The usual mantra of "use a minority client" helps a bit, but switching can surface hidden incompatibilities, different resource requirements, new monitoring systems, etc. Doable, but less than ideal.

A better solution is to run more than one client side by side, and cross-reference blocks between them. This is an approach "expected" of large players (staking pools, high risk providers), but it is also an expensive solution both hardware and effort wise: every client has their quirks that the operator needs to be aware of, and every client has their maintenance burden that has to be performed. A non-issue for dedicated teams, but a definite issue from a decentralization perspective.

Pushing for diversity seems like a necessity, but the practicalities make it somewhat unrealistic, at least in the immediate short term future. We do need, however, a short term solution too.

Rethinking the problem

The only solution for true resilience is verifying blocks with multiple clients. For most users, however, running multiple clients is unrealistic. The theoreticals and practicals seems to be at odds at one another, until we realise, we're not in a binary situation. Instead of looking at it as either running one client or running many clients, there is a third option: running one client but verifying with all clients (without actually running them). A wha'? 🤔

The observation is that verifying a block is actually quite cheap. 100-200ms worth of computation is enough for most clients running on most hardware. So from a computational perspective - taking into account that any remotely recent computer has ample CPU cores - verifying a block with one client or all clients is kind of the same.

The problem is not CPU or memory or even networking. The problem is state. Even though it takes 100ms to verify a block with almost any client, having the necessary state present to serve the verification makes it very expensive. Except... it is absolutely redundant to maintain the state N times, it's the same exact thing, just stored a bit differently in each client. We of course, cannot share the state across clients, so seems we're back to square one... seems...

Whilst it is true that we cannot share a 250GB state across clients (ignoring historical blocks now), there's also no need to do such a thing. Executing a single block only needs a couple hundred accounts and storage slots to be available, and verifying the final state roots only requires the proofs for those state items. In short, we don't need to share an entire state with other clients to verify a block, we just need to share (or rather send) a witness to them.

This changes everything...

Diversity, without diversity

Instead of asking people to run a minority client (may be inconvenient), or asking them to run multiple clients (may be expensive); we can let them use whatever client they fancy, and rather only ask them to cross-validate with other clients, statelessly.

From a high level perspective, block validation within the execution client consists of running a bundle of transactions and then comparing the produced results with the expected ones according to the block header.

The proposal is to extend this very last step a bit. Instead of just running the transactions and considering the results final, the user's client would at the same time also create a witness for the block as it is executing it. After execution finishes, we propose to add an extra cross validation step, sending the witness to a variety of other clients to run statelessly.

Aggregating the result from them, if all (or most) clients agree with the host, the result can be transmitted to the consensus client. On the other hand, if there are multiple cross-validating clients disagreeing, the user's client would refuse to accept and attest the block.

Details, details, details

Q: Doesn't this still mean everyone needs to maintain multiple clients?

Not really. Everyone would need to/may have multiple clients available, but they would not be running as a live client. Rather every client would have a new mode of operation - stateless validation - where they only consume witness-enriched-blocks and return validation resuls.

The simplest form would not even need these clients to run non-stop, rather would use a CLI command to statelessly validate a block, but that's an implementation / spec detail beyond the scope of this document.

Q: Is it reasonable to expect clients to ship stateless validation?

Most clients already support something like this for running state tests, so it's not really a new concept. Making it production ready in a nicer sugar-coating should thus not be an excessive effort. Compared to the fallout that a consensus issue can cause, this feature seems like a no-brainer.

Q: Is it reasonable to expect clients to support witness generation and cross-validation with other clients?

Attesting a bad block can lead to serious penalties and slashings. Generating a witness and running a cross validation step is a straightforward and simple process. Our expectation is that the users of all clients will demand support for a feature that is easy to ship and can protect them from financial ruin.

Q: Aren't we doing Verkle because witnesses are too large?

Whilst Merkle-Patricia witnesses are indeed quite large, their size is only relevant when sending through the network. If the cross validator clients are indeed stateless, they can be run side-by-side (or on demand) with the main client. In that case, transmitting the witness to them happens in the OS memory, so whether it's tens or even hundreds of MB is not relevant.

Q: Do you expect me to install 6 different execution clients?

Depends:

  • If you are a high stake operator, you are probably already running multiple full-fledged clients, so this is moot for you.
  • If you are running your own infra, adding a few cross-validating clients might not be an excessively hard task, but we could create docker and bare metal bundles of all-the-clients to cross validate with.
  • If you are running a DAppNode or similar setup, we'd expect them to add support for running all clients in cross-validation mode and provide this protection to their users out of the box, by default.

Epilogue

This propsal aims to solve the need for diveristy, without forcing diversity. It allows everyone to run their preferred client without having to fear of all the bad things that can go wrong. The proposal does require a bit of colleboration from everyone to get through, but it seems like a very simple solution compared to how gnarly the original problem is.

There are, ofcourse, some technicalities that need to be worked out (single-shot vs log running verifiers; communication protocol; witness format and content; K-out-of-N validity repercussions; hard suspend vs. empty attests; etc). This document was meant more of a teaser and high level overview of the proposal to the diversity problem.

@MicahZoltu
Copy link

MicahZoltu commented Nov 16, 2023

One "problem" I see with this is that it only addresses client bugs in the VM, not in the rest of the execution environment. While these certainly are the most likely consensus bugs, one can easily imagine a client bug in storage or storage retrieval that would cause all of the clients to execute block the same, but since the witness data is coming from a single client, if it is wrong all of the clients will agree on the block.

Note: The situation here is still significantly better than current situation, but not as good as actual client diversity.

@hkalodner
Copy link

hkalodner commented Nov 16, 2023

I think the full stateless approach handles that nicely verifying all storage ops against the state trie commitment. All the witness info would be trustless. There's still a risk of liveness issues with this approach so full client diversity is definitely the ideal, but I'm not sure if I see any safety issues here which would be a huge win

@karalabe
Copy link
Author

The general observation is probably correct that there always might be some aspects not covered by this setup. As for this specific issue with running on bad state, that can be avoided by verifying the witness against the parent's block header. Sure, if I give a satellite EVM both a bad state and a bad header then things can still be borked, but that's getting excessively improbable really.

@MicahZoltu
Copy link

Ah, good point about being able to verify the witness data against the block header. That does greatly limit the scope of possible consensus bugs outside of EVM execution.

@daniellehrner
Copy link

I see this approach to be more like a band aid than a real solution, which is true client diversity,. It leaves the network more fragile, because the network is more prone to DoS attacks, as we only have one tx pool implementation. Any other bug, that is not block validation, eg creating blocks, is not covered at all like that.

To be honest I think this will hurt client diversity. Staking providers running Geth have already the risk of losing a significant part of their staked ETH, because Geth is assumed to have a 80% majority. Not even that incentivizes them to increase their client diversity. I don't see how to convince staking providers to care about the health of the network and increase client diversity if there is not even a monetary risk anymore.

@LukaszRozmej
Copy link

LukaszRozmej commented Nov 16, 2023

This is very doable with Verkle Tries, this is at the core of the EIP. Nethermind already has Verkle Trie based stateless witness execution prototype mode in which we could run this very light client version. So we can consider this +1 for Verkle which has all this witnesses built in protocol?

But also like said before, there is more to client diversity than just transaction execution, there are things like networking and tx pool with attack vectors, that also benefit from client diversity.

@karalabe
Copy link
Author

I see this approach to be more like a band aid than a real solution, which is true client diversity,. It leaves the network more fragile, because the network is more prone to DoS attacks, as we only have one tx pool implementation.

The proposal does not propose to ignore diversity altogether. The idea is that fighting for a perfectly balanced diversity is (IMO) not realistic. If there is a way to avoid catastrophic failures without forcing a specific client distribution, that seems like a good tradeoff. There is also an advantage in the proposal vs running diverse clients, specifically that if my chosen client has a bug, the proposal prevents me getting penalized vs the plain diversity suggestion does not. So it actually also protects all validators against bugs in their chosen validators.

ny other bug, that is not block validation, eg creating blocks, is not covered at all like that.

Block production can be validated the same way. The primary client creates the block, but runs it through the other clients before announcing it. You get the same protections.

To be honest I think this will hurt client diversity. Staking providers running Geth have already the risk of losing a significant part of their staked ETH, because Geth is assumed to have a 80% majority. Not even that incentivizes them to increase their client diversity. I don't see how to convince staking providers to care about the health of the network and increase client diversity if there is not even a monetary risk anymore.

This proposal ensures that network health is protected even in case of a client imbalance. It not only protects majority client operators, but it also protects minority client operators against issues. It's an all around benefit for everyone. It will of course lower the pressure to switch clients, but I don't think it's realistic to expect people to switch clients every half a year because the balance shifts one way or another. The proposal is to allow operators to switch clients when another client suits them better, not when some network metric flips.

@karalabe
Copy link
Author

This is very doable with Verkle Tries. Nethermind already has Verkle Trie based stateless witness execution prototype mode in which we could run this very light client version. So we can consider this +1 for Verkle which has all this witnesses built in protocol?

The point with MPT I made was that we don't have to wait 2 more years for Verkle to land, just to have this protection. We can already have it today with MPT. There's no point being made wrt MPT vs Verkle switch for mainnet.

But also like said before, there is more to client diversity than just transaction execution, there are things like networking and tx pool with attack vectors, that also benefit from client diversity.

Of course. Diversity is good. We should just not have catastrophic network-killing repercussions with a less-than-perfect diversity.

@LukaszRozmej
Copy link

This is very doable with Verkle Tries. Nethermind already has Verkle Trie based stateless witness execution prototype mode in which we could run this very light client version. So we can consider this +1 for Verkle which has all this witnesses built in protocol?

The point with MPT I made was that we don't have to wait 2 more years for Verkle to land, just to have this protection. We can already have it today with MPT. There's no point being made wrt MPT vs Verkle switch for mainnet.

I would rather focus on Verkle and ship it late next year rather than to have a stop gap out-of-protocol throw-away solution.

@karalabe
Copy link
Author

I would rather focus on Verkle and ship it late next year rather than to have a stop gap out-of-protocol throw-away solution.

Verkle will not ship late last year, that I can promise. There are way too many open questions still not addressed. Doing this proposal takes a very little time on the other hand. I see no good reason not to do it. We should not let a future potential perfect solution be the enemy of an immediate potentially very good solution.

@LukaszRozmej
Copy link

I would rather focus on Verkle and ship it late next year rather than to have a stop gap out-of-protocol throw-away solution.

Verkle will not ship late last year, that I can promise. There are way too many open questions still not addressed. Doing this proposal takes a very little time on the other hand. I see no good reason not to do it. We should not let a future potential perfect solution be the enemy of an immediate potentially very good solution.

@karalabe which ones exactly, beyond potentially implementations optimizations and polishing? We have tree implementations, we have snap-sync with verkle-healing, we have stateless execution, we have multi-client testnet, we have quite good proposals for migration - the one thing that probably needs a bit more work, but I think we are close. Would be great if you could join some Verkle tree meetings to discuss your concerns.

@MariusVanDerWijden
Copy link

In my opinion verkle only really works/is useful if we have stateless clients. This proposal is for all clients to implement a stateless mode. This stateless mode can be very easily changed to use a different state format (like verkle). So since the work would need to be done anyway, it would be great in my opinion to do it as soon as possible (should be doable in a few weeks) so we can prevent worst case scenarios (finalizing a bad block from a majority client).

@LukaszRozmej
Copy link

LukaszRozmej commented Nov 16, 2023

One more thing came to my mind, if we want to prove the witness then the main node creating it will have to do all its reads through the MPT to generate the witness, it cannot use the snapshot. Similarly, every tree node in the validating client needs to be checked by proof, which is also potentially compute-intensive. Both will add a lot to block processing time and attestation latency (which we are already stretching in cancun). Increasing latency by a few hundred ms will reduce attestation performance and discourage validators from running this.

Another question is what to do when there is a mismatch within clients. Do we run multiple of them and act upon the majority? Do we stop finalization? Do we stop the chain at all?

All of the above makes me believe that "few weeks" is a very big underestimation. And that this would do better as a part of the Verkle upgrade.

@garyschulte
Copy link

garyschulte commented Nov 16, 2023

AFAIK, the payload would need to include parent header, witness, and new block. This would be a good primer for a stateless mode for besu and could be done in parallel while we finish up our verkle implementation. Post a format and example payload and I will start a branch for a stateless besu subcommand.

to Lukasz and Micah's points - this wouldn't use a flat account db for block processing, that is a class of storage errors this wouldn't catch. But I don't think this would necessarily be throwaway work, rather more of a proto-stateless execution. (which also would have to walk the trie for execution). The trie should fit very easily in memory, so I would think block execution time shouldn't suffer much from trie walking for state.

@jflo
Copy link

jflo commented Nov 16, 2023

I think I kind of love this idea. The concept of removing the moral hazard from operators choosing a client is very appealing, because at the end of the day, the incentive to “do the right thing” is not as strong as the economic one. It pains me that somehow we got the CL to re-diversify after a single client dominated it, but that seems to be much less possible on the EL.

In the execution space, removing the moral hazard incentive only strengthens the remaining (economic) incentive, and I fear that under a regime such as this, in a year everyone is running Reth.

Faster clients make more MEV, and right now the only thing protecting us from monoculture is social pressure. Maybe… it might not even be doing that. The good news is pursuing a social convention is cheap, so keep it up I guess?

Now the premise at work here is to make diversity moot. In order to truly do that, I think the scheme to operate co-clients as light clients who validate block witnesses is an excellent start, and we need to expand on that. The majority of developer time spent reacting to a hostile network comes in the form of mitigating DOS vectors, many of which happen via peering, serialization, or transaction pool mechanics. Redundancy in the state function does not protect us from those. A world where everyone runs FastClient™, and some people choose to re-validate with co-clients doesn’t look much different to me from where we are today; some people optimize for a healthy network for moral or long-term-economic incentives, but the majority do not.

If a scheme like this existed, an operator would be incentivized to disable it to recover the MEV time lost to co-clients bearing witness. I think for this to actually have an impact, there would need to be something in-protocol that made sure it was used.

There are also myriad UX, packaging and delivery problems needing to be solved, but I think they are solvable, and would all be worth it for a world where client diversity is moot.

@Perseverance
Copy link

With the addition of checks against the parent header, I think this proposal makes a lot of sense and can cover the majority of the possible problems (although as stated previously not all). In addition to running the existing clients in stateless co-client mode, this suggestion has the added benefit of enabling new specific optimized implementations of the EVM can be created.

These can enable new tech stacks to be used without the need (and time + cost) to create a full new client. This can increase both code diversity (new implementations - ex. Python, Haskell, or whatever church you pray to)
These can also increase contributor diversity (new teams getting close to the core). The more minds on the protocol, the better. New teams can advance the protocol in new ways and might come in without some "pre-exiting baggage".

@karalabe
Copy link
Author

@LukaszRozmej I disagree with you on both the simplicity of Verkle and the complexity of this proposal, but I am definitely willing to put my time where my mouth is and try to prove you wrong. I don't have a working code nor numbers to back my proposal up, but I'll try to implement both sides in Geth and see how messy and/or costly it becomes and then maybe we can discuss it more explicitly vs guessing.

@MarekM25
Copy link

After careful consideration of the proposal, I believe it is an interesting idea. It can help prevent worst-case scenarios and dramatic situations even with less than perfect client diversity. Nonetheless, I think we should continue push for achieving real client diversity.

Initially, I was skeptical because, if this solution gains wide adoption and is run on most machines, it could potentially make the network more fragile. Diverse client usage is not only about which client you run but also about which ones you don't. For instance, a vulnerability that allows taking control over the machine, a DDoS vulnerability, or a bug that consumes excessive resources and affects the operation of other clients on the same machine. Understanding how this solution will be implemented at the DevOps level is important. In a pessimistic scenario, we could inadvertently make the network more fragile by running more code, potentially exposing more vulnerabilities, and increasing the likelihood of issues. However, I believe this can be mitigated with isolated containers, and cross-validation solutions should be able to determine when to terminate satellite EVMs. would appreciate hearing more opinions from DevOps on this matter. Client implementations is just one step, to say that the solution is "shipped" we need solutions on DevOps side and in tooling.

Regarding performance, most blocks execute within 100-200ms on common hardware. However, for validator rewards, what matters is the execution of blocks that are too slow. If we slow down them even a little bit which is unavoidable we will decrease validators performance and as a result stakers might don't want to run this solution. Of course, this is not argument that should stop adopting the idea.

I would like to understand more about the technical details of how you see the process of sharing witnesses. Optional field in NewPayload?
What would we do if 50% see the block as VALID and 50% INVALID? Follow your main client?
What is the maximum acceptable wait time for satellite clients in terms of execution?

Nevertheless, the idea truly shines in preventing the finality of an incorrect chain with catastrophic consequences. Despite my concerns about making the network more fragile, which I believe can be overcome, the pros outweigh the cons of this proposal, and we should give it a try. In meantime, we can keep pushing for real client diversity. Last year, client diversity improved a lot, so I do believe we can improve it even more.

@yorickdowne
Copy link

This can also be solved in the validator client. The main concern here are high-stakes NOs that haven’t moved their infra off a supermajority client.

Vouch in its next version will support a “majority” attestation strategy. For example, an NO could run three Ethereum nodes (which they likely are already anyway) and use Geth, Nethermind and Besu. With attestations, two of 3 agreeing win in this setup. If any client has a bug and the other two do not, attestations remain safe.

Implementing a similar feature in other VCs may be a faster way to get to this checks and balances state for large NOs, without waiting for EL clients to gain cross-validation.

@LukaszRozmej
Copy link

@karalabe did you have any success with prototyping this solution?

@pincente
Copy link

One could envision a traditional load balancer (a la haproxy/traefik etc) armed with a custom 'liveness' check that could route EL requests accordingly.

@LukaszRozmej
Copy link

One could envision a traditional load balancer (a la haproxy/traefik etc) armed with a custom 'liveness' check that could route EL requests accordingly.

Define liveness in terms of a network split. Automatic routing to the majority client on any sight of the issue would be the same as just using it in the first place.

@pincente
Copy link

pincente commented Feb 1, 2024

Define liveness in terms of a network split. Automatic routing to the majority client on any sight of the issue would be the same as just using it in the first place.

In this context, liveness means 'cross-validate with other clients, statelessly' as @karalabe mentioned in the OP, which I imagine is along the same lines as @yorickdowne suggested.

@LukaszRozmej
Copy link

Define liveness in terms of a network split. Automatic routing to the majority client on any sight of the issue would be the same as just using it in the first place.

In this context, liveness means 'cross-validate with other clients, statelessly' as @karalabe mentioned in the OP, which I imagine is along the same lines as @yorickdowne suggested.

What about strategies for conflict resolution between EL's?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment