sstone/ln_routing_sync_improvements.md

## ln_routing_sync_improvements.md

      
    Raw
  

              ln_routing_sync_improvements.md
            
          
    Improving initial routing table sync

Summary

I'm pushing for the hash-based solution because it can be implemented and developed quickly and easily and fixes the main issues we've seen on testnet:

routing sync on mobile nodes
"route not found" errors when you're missing routing info.

It can be deployed as an optional feature and will give us time to specify and implement proper IBLT-based filters.
It can be combined with the timestamp approach: nodes would send bucket hashes + low and high watermarks.
The problem

The current scheme (broadcast + optionally ask for a full routing table when you connect) works well for nodes which are never completely offline, but is becoming unpractical on mobile node/end-user nodes which are often offline and connected to a few peers. We need a way to improve the initial routing sync and retrieve announcements that we've missed without having to download the entire routing table.
Additionally, the only way to check that routing information is consistent between different nodes is to ask each one of them to send you their entire routing table. Exchanging filters/digests/… would mitigate the issue of having to “trust” that your peers do provide do with a good routing table, especially when you’re connected to very few peers.
Requirements


Easy to specify and implement
Low overhead
Ability to retrieve missing routing information
(Nice to have) Ability to query multiple peers for consistency checks

Background

Suppose you partition nodes into 3 generic roles:

payers: they mostly send payments, are typically small and operated by end users, and are offline quite a lot
relayers: they mostly relay payments, and would be online most of the time (if they're too unreliable other nodes will eventually close their channels with them)
payees: they mostly receive payments, how often they can be online is directly link to their particular mode of operations (since you need to be online to receive payments)

Of course most nodes would play more or less all roles. However, mobile nodes would probably be mostly "payers", and they have specific properties:

if they don't relay payments they don't have to be announced. There could be millions of mobile nodes that would have no impact on the size of the routing table
it does not impact the network when they're offline
but they need an accurate routing table. This is very different from nodes who mostly relay or accept payments
they would be connected to a very small number of nodes
they would typically be online for just  a few hours every day, but could be stopped/paused/restarted many times a day

End-user nodes, which fall into the "payer" category, are likely to be offline very often yet cannot work properly without an accurate routing table.
The current method for retrieving this routing table is to set the initial_routing_sync flag in the init message that you send every time you connect/reconnect to a peer, which will then send back its entire routing table (currently 6000 channels on testnet).
If a node believes that a channel is available when it has in fact been closed, and uses it to build a route, it will receive an error message and try again without this specific channel.
But if a node is missing a channel, and cannot route payments, there is currently no way to recover it: it has to wait until the channel parameters are updated, and will then receive a channel_announcement and the matching channel_update. This could take a very long time.
Hence, the only option for mobile nodes is to request a routing table dump every time they connect/reconnect, which simply does not work .
We need a way to improve this initial table sync, which is simple enough to be implemented and deployed quickly. Otherwise, these nodes will probably use ugly specific hacks (like use their own mechanisms for retrieving and syncing routing tables) or even rely on remote servers to compute routes for them.
Proposed solutions

Timestamps/watermarks

When they connect to another peer, nodes send a timestamp (I know the routing table up to X) or a pair of timestamps (I know the routing table from time X to time Y).
Pros:

Very simple to specify (use channel_update timestamps) and implement.
Very little overhead
Solves the “download the entire routing table every time” issue

Cons:

Does not solve the "missing announcements" issue: if you're missing routing info you would have to wait for channel parameters to be updated etc.., as above

Bucket hashes

Routing info (i.e. channel announcements) are grouped by buckets, one bucket being a group of 144 blocks. A hash is computed for each bucket, peers exchanges these hashes and send back all announcements for which bucket hashes don't match.
We propose to use a bucket per block for the last 7 days, one bucket per 144 blocks for older announcements,
If gives a total of (365 + 7*144) = 1373 hashes, for a year of announcements
Pros:

Simple to specify and implement
Little overhead (a few dozen Kb)
If a node is missing a few elements it will immediately recover them, even if they're very old
Works well when routing tables are mostly synchronized, which would be the case for nodes which connect to a very small number of peers
Bucket hashes are the same for all peers you connect to, and can be used for consistency checks between peers

Cons:

Buckets hashes are not queryable filters
Since a single mismatch will invalidate an entire buckets, even with small differences nodes could end up having to send their entire routing table (which exactly what they are doing today)

IBLT filters

Upon connection, nodes exchange information to estimate the number of differences between their routing table.
Using this estimate, they build and exchange IBLT filters, and use them to compute the announcements that they should send to their peer.
Pros:

Queryable filters
Very efficient if the number of differences is small, even with very large routing tables

Cons:

More complex to specify and implement: we need a good estimator for the number of differences (send min height + max height + a sample of announcements ?)
Filters become peer-specific (similar to the server-side vs client-side filtering for SPV clients)