With LES essentially murdered by the merge, the current rewrite should prompt some reevaliation of base invariants - and IMO - a full rewrite with the lessons learnt. The ideas below are only that - ideas - but I tried to summarise some things we might want to consider to do better and hopefully reach a deployable light infrastructure.
First up, what are the goals of the light client? IMO we grossly overshot with LES on the goal department. We wanted to basically do everything that's doable with the provable primitives, without taking into consideration any other practical constraints. With a rewrite, we should rethink these:
- Nobody wants to run a light server because it a places an undue burden on top of full nodes.
- Current light clients promise too many features, too much data availability. Full nodes by default unindex transactions older than a year because tracking them is just immensely expensive. That is acceptable for a full node, since everyone can configure their own to use whatever limit suits them, but light clients assume the data will be available. Currently we either need to force full nodes to index everything and bear the costs (nobody wants to run them); or we need to make light clients way too smart for their own good as to where to get historical data from.
- My proposed solution is to force light clients to only ever provide a strict subset of a full node's data. E.g. light servers should only provide access to tx indices at most T time old, where T should be the minimum possible that is still useful. One proposition is a couple weeks another is a very aggressive last 128 blocks. We need to debate what exactly people should use LES for, but definitely not digging up past transaction data.
- Nobody wants to run a light server because it is too unstable.
- LES has an insanely complex flow control mechanism. Also it only works in theory, but in practice it's impossible to even debug, let alont put in production. Part of the problem is that LES tries to be too smart for its own good. It attempts to do extremely precise cost measurements for requests and it tries to be both extremely fair across peers as well as max out available light server capacity. These all make things complicated beyond the point of resonable levels of functionality.
- My proposal is - still - to replace it with a very simplified and well understood flow control mechanism: token buckets. LES is super complicated because so few people run servers, it had to push out every last drop of performance of them to remain functional. Rather than that, we should aim for simple and stable where we can rather have LES served by every geth node. IMHO total capacity of a node should be set low enough that a node operator does not notice that it's even running; and measurements should be something simple (e.g. tokens used == proof items); not some "actual work done because some cache was half warm and we shaved 23.6ns off trie node N". We should use our server count as a feature in serving "dumb" clients vs. trying to be optimal and misserving smart clients.
- LES wanted to be too much: both a client for sending txs once in a blue moon as well as a stateless full node.
- The requirements for the two are different. The former needs little data rarely in between. The latter needs a lot of data constantly. Because of convoluting the two, LES always has this strange notion of wanting to be P2P, except it's not, but still complicating everything because it wants to become P2P at some point.
- My proposal is to restrict light clients to be just that, light clients, not stateless full nodes. We should commit to simply not caring about P2P at all at the LES layer and rather fully commit to a client-server architecture. Taking it a step forward, my proposal is to get rid of devp2p at this layer fully and switch to an HTTP API server. This would remove an immense amount of peer shuffling complexity from LES; and it would also instatly enable a lot of elegant web2 composability when it comes to authentication, proxying, etc. The availaibility and address of the servers can still be announced via the DHT and ENRs and indexed by our DNS discovery.
- But what about sybil protection and quality of service and whatnot.
- IMO we're at the wrong level to think about these problems. Our goal should not be to design the perfect protocol that does not work, rather to have one that is resilient against attacks, but instead of "forbidding" bad behavior, it raises the level of friction to do it. There will always be malicious entities who figure out how to work around protocol limits: instead of making everththing brittle to play a game a whack'a'mole, we should make it work well out of the box and make it not perfectly useful to run abusive use cases.
- My proposal here is to introduce some small friction points that are not too relevant to normal light client usage, but which can be enough to prevent very sophisticated users from relying on it. This is an open ended question, but by say not providing access to the latest block, rather to HEAD-1, -3(?) it all of a sudden is not that relevant for anyone wanting extremely precise timing guarantees. As a normie, seeing a 12s old state should be still ok, but for a power user it might not be. Win! To discourage hammering nodes, we could have a slight delay in responding to requests? As long as the stock client behaves reasonably and a power client cannot abuse things too usefully, we should be golden. As for the couple use cases that might slip though, we'll just foot the bill. As long as everything is simple and robust enough to run on all nodes, we have a lot of capacity to spare.
In for a penny in for a pound
Since my arguments seem fall on deaf ears, let me try to make it one last time via real world example.
Say you want build a wallet that embodies crypto ideals to the maximum extent technically possible. Does not ever collect any user data at all. Allows you to use popular dapps via smart contracts directly, avoiding their frontend (& analytics)
Moment you have to run your own servers you have failed.
We should build this stuff for most adversarial environment imaginable.
I don’t think Joe Lubin wanted to cut off service for Tornado Cash. Nor do I think Hayden Adams wants collect copious amounts of user data on Uniswap front end. They have no choice as they are subjects to regulations in their jurisdictions. Just run a full node makes your dapp / app exponentially less censorship resistant.
There are whole classes of use cases, dapps that can be built if LES is reliable and readily available. It boggles my mind that I can’t seem to get this point across. Abundant LES increases real wold censorship dramatically. Allowing use cases we can’t event think of right now.
Therefore I think goal should be providing as much historical data as possible while only increasing computation requirement to run full node by sensible number (10%~20%).