With LES essentially murdered by the merge, the current rewrite should prompt some reevaliation of base invariants - and IMO - a full rewrite with the lessons learnt. The ideas below are only that - ideas - but I tried to summarise some things we might want to consider to do better and hopefully reach a deployable light infrastructure.
First up, what are the goals of the light client? IMO we grossly overshot with LES on the goal department. We wanted to basically do everything that's doable with the provable primitives, without taking into consideration any other practical constraints. With a rewrite, we should rethink these:
- Nobody wants to run a light server because it a places an undue burden on top of full nodes.
- Current light clients promise too many features, too much data availability. Full nodes by default unindex transactions older than a year because tracking them is just immensely expensive. That is acceptable for a full node, since everyone can configure their own to use whatever limit suits them, but light clients assume the data will be available. Currently we either need to force full nodes to index everything and bear the costs (nobody wants to run them); or we need to make light clients way too smart for their own good as to where to get historical data from.
- My proposed solution is to force light clients to only ever provide a strict subset of a full node's data. E.g. light servers should only provide access to tx indices at most T time old, where T should be the minimum possible that is still useful. One proposition is a couple weeks another is a very aggressive last 128 blocks. We need to debate what exactly people should use LES for, but definitely not digging up past transaction data.
- Nobody wants to run a light server because it is too unstable.
- LES has an insanely complex flow control mechanism. Also it only works in theory, but in practice it's impossible to even debug, let alont put in production. Part of the problem is that LES tries to be too smart for its own good. It attempts to do extremely precise cost measurements for requests and it tries to be both extremely fair across peers as well as max out available light server capacity. These all make things complicated beyond the point of resonable levels of functionality.
- My proposal is - still - to replace it with a very simplified and well understood flow control mechanism: token buckets. LES is super complicated because so few people run servers, it had to push out every last drop of performance of them to remain functional. Rather than that, we should aim for simple and stable where we can rather have LES served by every geth node. IMHO total capacity of a node should be set low enough that a node operator does not notice that it's even running; and measurements should be something simple (e.g. tokens used == proof items); not some "actual work done because some cache was half warm and we shaved 23.6ns off trie node N". We should use our server count as a feature in serving "dumb" clients vs. trying to be optimal and misserving smart clients.
- LES wanted to be too much: both a client for sending txs once in a blue moon as well as a stateless full node.
- The requirements for the two are different. The former needs little data rarely in between. The latter needs a lot of data constantly. Because of convoluting the two, LES always has this strange notion of wanting to be P2P, except it's not, but still complicating everything because it wants to become P2P at some point.
- My proposal is to restrict light clients to be just that, light clients, not stateless full nodes. We should commit to simply not caring about P2P at all at the LES layer and rather fully commit to a client-server architecture. Taking it a step forward, my proposal is to get rid of devp2p at this layer fully and switch to an HTTP API server. This would remove an immense amount of peer shuffling complexity from LES; and it would also instatly enable a lot of elegant web2 composability when it comes to authentication, proxying, etc. The availaibility and address of the servers can still be announced via the DHT and ENRs and indexed by our DNS discovery.
- But what about sybil protection and quality of service and whatnot.
- IMO we're at the wrong level to think about these problems. Our goal should not be to design the perfect protocol that does not work, rather to have one that is resilient against attacks, but instead of "forbidding" bad behavior, it raises the level of friction to do it. There will always be malicious entities who figure out how to work around protocol limits: instead of making everththing brittle to play a game a whack'a'mole, we should make it work well out of the box and make it not perfectly useful to run abusive use cases.
- My proposal here is to introduce some small friction points that are not too relevant to normal light client usage, but which can be enough to prevent very sophisticated users from relying on it. This is an open ended question, but by say not providing access to the latest block, rather to HEAD-1, -3(?) it all of a sudden is not that relevant for anyone wanting extremely precise timing guarantees. As a normie, seeing a 12s old state should be still ok, but for a power user it might not be. Win! To discourage hammering nodes, we could have a slight delay in responding to requests? As long as the stock client behaves reasonably and a power client cannot abuse things too usefully, we should be golden. As for the couple use cases that might slip though, we'll just foot the bill. As long as everything is simple and robust enough to run on all nodes, we have a lot of capacity to spare.
I spent most of my time working on dapps & crypto apps. Not protocol designs or dev for something like geth. My opinion here is of very limited value. Nevertheless from dapps & apps dev perspective there are a few things that seem underappreciated.
There will never be less laws related to crypto than there are now. Number of laws will only increase, so will the number of things that are banned / censored. Relying on RPC services companies introduces censorship choke points. We have already seen large RPC providers blocking entire countries, dropping support for dapps like tornado cash. Last time I checked Portal network was still in protocol design stage. So it could well be years. LES is crucial for preserving censorship resistance in particular for small teams.
Everything described by @karalabe makes perfects sense & is music to my ears. Particularly having
--lightclient-serve
on by default or better still always on. However I strongly disagree withlight servers should only provide access to tx indices at most T time old, where T should be the minimum possible that is still useful.
I think goal should be greatest possible T while only increasing computational resources requirement to run node by some sensible percentage (10% ~ 20% perhaps even higher). Maybe allow server to even increase T to some very large number.I fully realize LES clients need to trust the nodes, can't verify the chain. I much rather discover pears on the network and trust them, sample a few perhaps and handle all the complexity on client side (mobile phone / in browser), that using centralized RPC provider that is subject to increasingly draconian laws and whims of governments.
UPDATE: I don't think I made best case. One more though. Increasing computational resource by 10% ~ 20% is not that much. Having reliable LES with X months of historical data would unlock a lot use cases for dapps & apps developers. As well as increase censorship resistance by large amount.