Skip to content

Instantly share code, notes, and snippets.

@belisarius222
Last active February 3, 2022 10:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save belisarius222/3b808fc3fe6d9aa622cc87c9bf6a9a86 to your computer and use it in GitHub Desktop.
Save belisarius222/3b808fc3fe6d9aa622cc87c9bf6a9a86 to your computer and use it in GitHub Desktop.
Symmetric Routing Proposal

Symmetric Routing Proposal

This document describes a proposal for routing and peer discovery for Ames and Fine (the remote scry protocol). It aims to follow the following principles.

  • Symmetry: every response packet will retrace the same path through the network that the request packet that caused it took, with the steps reversed. If the request takes the path A->B->C->D, then the response will take D->C->B->A.
  • Transitivity: routing and peer discovery should treat every level of the hierarchy the same as every other, i.e. forwarding from galaxy to star should look the same as star to planet and planet to moon.
  • Tightening: the network should naturally "tighten" routes over time. The tightest route is a direct route; the loosest goes through every possible relay.
  • Requester Control: the requester (hereafter defined as the ship sending a scry request packet or Ames message fragment packet) should decide both routing and congestion control, not the responsder (the ship that receives a scry request and emits a scry response, or that receives an Ames message fragment and emits an ack).
  • Selflessness: a ship does not need to know its own IP address and port, although if it does know it, it should be able to make use of that knowledge.
  • Deterministic Routing: the route a packet takes through the network should be a pure function of the sender ship, responder ship, PKI state, whether the packet is a request or response, and number of hops (fewer hops is tighter). Sender determines number of hops (with some caveats).

Protocol Description

Full Case: Planet to Moon

Consider a ship trying to send a scry request to a moon under some other galaxy. This is the longest possible route through the network, with a maximum of four ship-to-ship hops in each direction. Requests made to higher-rank ships max out at fewer hops; a request to a galaxy is always one hop.

This pattern needs to solve two problems:

  • forward routing (how do relays know where to send the request next?)
  • reverse routing (how do relays know where to send the response next?)
  • lane propagation (how does the sender learn the locations of the responder and relays, so it can tighten the route on later requests?)

Requester planet ~rovnys-ricfer (~rovnys for short) wants to send a scry request to ~mister-ritpub-sipsyl (~mister for short). ~rovnys doesn't know ~mister's lane (IP and port), so it sends the request down through ~mister's routing hierarchy, starting with its galaxy, ~syl, then down through its star ~sipsyl, its planet ~ritpub-sipsyl, and then finally to the moon ~mister. The response packet from ~mister follows the same path in reverse.

request:  ~rovnys -> ~syl -> ~sipsyl -> ~ritpub-sipsyl -> ~mister-ritpub-sipsyl
response: ~rovnys <- ~syl <- ~sipsyl <- ~ritpub-sipsyl <- ~mister-ritpub-sipsyl

~rovnys specifies in its outgoing request packet's hops field that this route will take 4 hops, involving a total of 5 ships. This is followed by a routing stack, which consists of a 5-bit bitmap indicating which ships' lanes are filled in, and five 6-byte lanes (IPv4+port pairs). These fields are initialized to zero in ~rovnys's outgoing request packet.

request hop ~rovnys -> ~syl
bitmap: 00000
fields: [0 0 0 0 0]

When ~syl receives this packet from ~rovnys, it determines that it is the second ship in this 4-hop sequence, so it fills in the first field with the lane on which it heard the packet, and it fills in the third field with ~sipsyl's lane, adjusting the packet's bitmap accordingly. If ~syl knows its own lane (possibly by asking a STUN server for this information), it has the option to fill in the second field with its own lane too. Either way, it then forwards this modified packet to ~sipsyl.

request hop ~syl -> ~sipsyl
bitmap: 10100
fields: [~rovnys 0 ~sipsyl 0 0]

When ~sipsyl receives the packet from ~syl, it determines that it is the third ship in this 4-hop sequence, so it fills in the second field with the lane on which it heard the packet (~syl's lane), and it fills in the fourth field with ~ritpub-sipsyl's lane, adjusting the packet's bitmap accordingly. If ~sipsyl knows its own lane, it has the option to overwrite the third field with its own lane too. Either way, it then forwards this modified packet to ~ritpub-sipsyl.

request hop ~sipsyl -> ~ritpub-sipsyl
bitmap: 11110
fields: [~rovnys ~syl ~sipsyl ~ritpub-sipsyl 0]

When ~ritpub-sipsyl receives the packet from ~sipsyl, it determines that it is the fourth ship in this 4-hop sequence, so it fills in the fifth and final field with the moon ~mister's lane, to which it sends the packet. It can also override its own lane in the fourth field.

request hop ~ritpub-sipsyl -> ~mister-ritpub-sipsyl
bitmap: 11111
fields: [~rovnys ~syl ~sipsyl ~ritpub-sipsyl ~mister-ritpub-sipsyl]

When ~mister receives the packet from ~ritpub-sipsyl, it handles the scry request and generates a response packet back to ~ritpub-sipsyl. This packet includes the same fully filled in fields. Since it's a response, not a request, the fields are used by the relays to forward packets in reverse order back to the requester, ~rovnys.

When ~ritpub-sipsyl hears the response packet, it determines that it is the fourth ship in this 4-hop sequence. Since it's a response packet, the lane fields are already filled in. ~ritpub-sipsyl forwards it back to the third lane field, which contains the star ~sipsyl's lane. The fields are left intact.

~sipsyl and ~syl follow the same procedure, culminating in ~rovnys receiving the response packet with all these fields filled in. When ~rovnys hears this, it stores the lanes for all intermediate relay ships in its persistent Arvo state, allowing it to try to tighten the route for later requests.

Note that this example traced a Fine request/response interaction across the network, but the same steps are taken for an Ames poke/ack interaction. This will require adding a plaintext bit to an Ames packet indicating whether it's a message fragment or ack. This is not a significant privacy leak, since anyone listening to Ames packets can trivially figure out whether they're message fragments or acks based on which one was sent first.

Another difference with Ames is that since unlike Fine, Ames request packets are writes into the responder's Arvo, the responder can store the requester's lane when it hears the request packet. This is important for sponsor pinging, so that the sponsor can persist a sponsee's lane when it hears a request from the sponsee.

Tightening

After the previous interaction, the requester should try to "tighten" this route down to as close to a direct route as possible.

Incremental vs. Eager Tightening

I can see two alternatives for how to tighten:

  • incrementally: tighten one step at a time (e.g. from 4 hops to 3, then 3 to 2, etc.)
  • eagerly: tighten as quickly as possible

I'm not sure which is better. To determine the answer to that, we'd probably need to have a confident understanding of what percentage of which classes of ship are publicly addressable. If every ship is publicly addressable, then direct routes are always feasible, so jumping eagerly to the direct route is ideal. If no ships (other than galaxies) are publicly addressable, then incremental tightening will perform better because it will not jettison the loose but working route in favor of a nonfunctional direct route.

The truth is somewhere in between. It's pretty easy to say now that all stars should be publicly addressable, but few moons will be. Planets are the tough question here: cloud-hosted planets will generally be publicly addressable, but planets running on laptops or raspberry pi's at home will not. I think a solid assumption is that by two years from now, most planets will be cloud-hosted.

So a good rubric might be: tighten eagerly down to a responder planet, but not all the way to a responder moon.

Tightening Procedure

The requester should probably send two requests at once when trying to tighten: one through a known but looser route, and one through a tighter but experimental route. Given the previous section, just after the earlier example of ~rovnys sending its request through ~syl to ~mister, ~rovnys should try to tighten that to forwarding just through ~ritpub-sipsyl down to the moon, skipping the galaxy and star.

So the next request that ~rovnys sends should go both to the known route starting with the galaxy, established by the previous request/response, and also an experimental 2-hop request starting with ~ritpub-sipsyl. If the experimental request gets a response, then ~rovnys can mark that route as valid, and its next request will go to that route and an experimental direct route to ~mister.

Once the requester has tightened the route all the way to a direct route, I'm not sure if it should keep sending a duplicate request to the responder's sponsor, or if it should send exclusively to the responder's direct route.

Eventually, sponsors will likely want to cache scry responses on behalf of their sponsees. This is easier if the requests go through the sponsor, so I think at least for scry requests, it will be better not to tighten all the way to the responder. This means the requester will continue to try the direct route and a route through the responder's immediate sponsor.

(Probably, not all sponsors will cache all scry responses, so in many cases the direct route will still be faster, so trying both is likely to yield the best results).

Recovering from Route Loss

What happens if my nice direct route to the responder stops working? Again, I can see two possible ways to deal with this: either retry all the way from the galaxy, or drop down incrementally to a route with a single more hop. I suspect falling back incrementally is best, which will mean switching from trying both (n-1)-hop and n-hop routes to n-hop and (n+1)-hop routes.

Initialization Case: Planet Seeking Sponsoring Star

~rovnys-ricfer just booted and needs to find its sponsor ~ricfer so that ~ricfer can forward packets to it. In order to do this, it sends a 2-hop Ames request packet through ~fer to ~ricfer. From ~fer's perspective, ~rovnys-ricfer might as well be a foreign planet under a different galaxy; ~fer doesn't track ~rovnys-ricfer's lane.

When ~ricfer receives this request, it stores ~rovnys-ricfer's lane in its state and acks the request up through ~fer. Once ~rovnys-ricfer hears the response, it knows the direct route to ~ricfer, which it will try as its experimental route for future pings in addition to its known route through ~fer.

The same logic applies to a moon booting up to find its planet, just with more hops. The request goes from moon -> galaxy -> star -> planet, and then the response traces that path in reverse, ending in planet and moon knowing each other's lanes.

Well, almost the same. We need some way to propagate key information about the moon so the galaxy can verify its signature, to prevent fake moons being used as a DoS vector by sending lots of bogus packets to planets through galaxies and stars.

Optimization: Skipping Dead Sponsors

TODO: finish

Summary: if the star is down, the galaxy would convert a 3-hop request to a 2-hop request, skipping the star. The planet would need to start pinging its galaxy if it notices the star went down, so the galaxy knows where it is.

Request Timeout Problems

~wicdev-wisryt wrote:

the "back through the galaxy" ideas are all assuming you're under the same galaxy, which you're not

the idea of "routing back the way it came" is something we never do right now

because you never know that you can do that

unless you're constantly pinging each other

quiz: you receive a packet on lane A with origin B. What's the minimal set of set of lanes you need to send the message to get it back to the sender?

(a) A
(b) B
(c) A+B

trick question, probably none of them will work

the only lane that we can depend on to work is their galaxy, which is neither A nor B

the criss-cross pattern is actually important for a network with our characteristics

unless you want every ship to ping every galaxy all the time

tbc: B probably won't work if they have any kind of NAT, since us<->A has never been used. A probably won't work because B<->our galaxy is probably closed by now

unless you want to rely on responses to packets tracing through everything fast enough that the B<->our galaxy route hasn't been closed yet

which is impossible with the characteristics of ames

but for a scry vane maybe it's acceptable to say "you need to respond within 30 seconds or some ships just won't get your answer"?

There are a couple of different problems here.

  • How can a response get through if it's generated more than 30 seconds after the request is received, given that the firewall pinhole from the requester to the responder (or a ship in the responder's sponsorship hierarchy) will have timed out and the requester won't be able to receive the response?
  • Even if late responses do go through eventually, in the meantime, there could be a load problem on galaxies if requester ships are all pinging galaxies on a timer hoping to receive the response.

Firewall Problems with Late Responses

A requester ship will keep re-sending the request to the first ship in the route until it hears the response, using exponential backoff down to some max value, on the order of a minute.

There are two cases: either the max retry interval is below the firewall pinhole expiry time, or not. If it is, then the pinhole will remain open and no matter how late the response is, it will trivially go through.

If the pinhole has expired by the time the response packet is sent, then the requester's firewall will drop the response packet and the requester will not see it. However, the requester will re-send the request packet once its retry timer expires. This time, the response packet is available immediately on the responder. For remote scry, the response packet should be in the scry cache. For Ames, the ack is in the Arvo state.

Note that it's not a guarantee that the scry cache will contain the response packet for a particular request packet, so there are edge cases around large scry responses -- but Ames can always generate a duplicate ack quickly, so this is only ever an issue with scry, not Ames.

So as long as the network is functioning basically at all and the responder isn't bogged down by a multiple-minute-long queue of incoming events, the responder will be able to emit the response within 30 seconds of receiving it, and the packet will make it back to the requester before the firewall closes the pinhole.

A pathological case would be if a responder takes 30 seconds or more to field a scry request and its response is too large to be stored in cache. Firefox times out requests after 30 seconds, so it's possible to build real-world systems with this constraint, although it's arguably un-Martian.

I'm more concerned with what to do with a response that doesn't fit in cache. Scry requests that cause extreme processing time on the responder seem like an antipattern to me, and likely to be killed by an interpreter timeout because of this. But a scry for a large datum is legitimate, even an extremely large datum.

A sufficiently intelligent scry cache would want to do some sort of pagination, so that it would first cache the first 10MB of Shrek 2, then once those requests get satisfied, it would swap that out for the next 10MB, by peeking into Arvo again. This only works if the Arvo peeks are pretty quick, which basically means that above some value of space * time use for a scry response, the system won't work.

The scry cache could store up to some number of packets, like 10.000 (10MB), past the initial packet that triggered the response. The first request packet loads packets 1 through 10.000 into the cache. The request for packet 10.001 loads packets 10.001 through 20.000 into the cache. This should work for arbitrarily large responses.

Either Mars would continue to generate all the response packets and Urth would only keep the packets in the current page, or more performantly, the scry request could be modified to include the first fragment and page size, so that Mars would send fewer packets over IPC to Urth each time.

The Galaxy Load Issue

There's a danger that lots of spurious packets will be sent around the network if the responder is down, or if subscribers are repeatedly pinging it over remote scry for the next datum in a subscription, which doesn't exist yet.

Consider the case where ~ritpub-sipsyl is publishing a chat channel at /~ritpub-sipsyl/chat-foo. Each chat message within the channel is available over remote scry, with the latest at /~ritpub-sipsyl/chat-foo/33. The 1.000 subscribers to this chat are all trying to scry the next message at /~ritpub-sipsyl/chat-foo/34, which doesn't yet exist and won't exist for, say, another hour.

During that hour, all 1.000 subscribers will be re-sending their scry requests for message 34 every so often, quickly reaching the max retry interval, which will likely be somewhere between 20 seconds and 2 minutes. Let's call it 20 seconds to get conservative estimates.

This adds up to 50 packets per second on this subscription. Another complication is that naively, these packets won't be hitting ~ritpub-sipsyl directly, even though it's online and even if subscribers could get a direct route to it -- if a ship hasn't heard back from some other ship for some amount of time, it times out the route and backs out to a looser route, since it can't tell the difference between an offline ship and a ship that hasn't yet produced the requested data.

So after a few minutes, all these subscribers will be sending their requests through ~ritpub-sipsyl's galaxy and star. this is not good scaling: if a galaxy has to handle requests for its planets, that's a subsidiarity failure that will overload a galaxy at a fairly small total network size.

It's worth thinking through just how many packets this would be. Forwarding code lives in the Urth process and only rarely touches Nock, to scry sponsee lanes. At the moment, it's much faster than everything else Urbit does, and could potentially be made even faster with parallelization. If the average ship is subscribed to 1.000 subscriptions, and nearly all requests get dropped, then assuming ships are evenly distributed under only 10 galaxies, the number of packets per second per galaxy to maintain subscriptions would be n * 1.000 / 10 / 20, or five packets per second per planet on the network.

So with just two thousand planets under ~zod, that's 10.000 packets per second to handle subscriptions, using these conservative numbers. I think that's within an order of magnitude or two of maxing out the CPU of a galaxy, so that's not good.

I would note that we have a problem of similar magnitude right now, just in the opposite direction: publishers constantly ping offline subscribers, through the subscribers' galaxies once the direct lanes time out. There's a packet per subscription, just like in this problem. A difference is that the remote scry problem affects all subscribers, whereas the Ames problem only affects offline subscribers.

This suggests a deeper tradeoff, that the price for stateless publishing is more pinging from subscribers. The reason the publisher can stop pinging a subscriber in the current system is that it received an ack and used that to update its state. If the publisher doesn't know whether the subscriber got the response, it's up to the subscriber to keep pinging the publisher until it does.

Using remote scry for subscription can be thought of as trading disk write load on the publisher for a network load of more keepalive-like packets. This should be worth it. Keepalives scale a lot better than disk writes; a typical blade server can handle millions of simultaneous connections.

Just because subscribers have to keep pinging, though, doesn't necessarily mean that they all have to ping a galaxy. My main thought about how to prevent the galaxy bottleneck is to make it clear whether a ship is offline or just doesn't have a response yet to a scry request, so subscribers can know not to loosen their route to the responder and keep pinging the responder itself or its immediate sponsor, spreading the load. Two designs jump out at me:

  • Add a "blocked" response packet for future scrys
  • Send keepalive requests to the responder or its sponsor

"Blocked" Response Packets

I'd rather not add a "blocked" response packet. It complicates the protocol with a new primitive, and unlike all other packets, it's not referentially transparent, since it could be superseded by a real result later, resulting in a need for cache invalidation. I think it could be made to work, but I don't like it.

There would be advantages to this, though, especially since they could be used as keepalives, and generated quickly by the Urth process given the %bide scry notification protocol. In almost all cases, Urth could respond with the "blocked" packet just by doing a hashtable lookup in its local transient state, so it should have comparable performance to current ames's packet forwarding.

Keepalive Requests

The keepalive request could be a repeated scry request, such as /ax/~ritpub-sipsyl/r-u-ded/1, whose response could just be null. If the responder is online, it will be able to serve this from cache, or regenerate the response in Arvo very quickly if it fell out of cache. (Sponsors would want to know not to cache their sponsees' responses to this request, to prevent giving a false impression that their sponsee is online when it's not).

An advantage of using explicit keepalives over "blocked" response packets is that the subscriber only needs one per publisher ship, even in the common case where it's subscribed to several different paths on the publisher, in which case you'd need to send "blocked" responses on every path.

Once a request to some ship times out some number of times, the requester could star in addition to the repeated scry requests sending these keepalive requests every 20 seconds or so in addition to the repeated scry requests. This would both keep the pinhole to the responder open and also inform the requester about whether the responder is actually online. If the responses stop coming, the responder is offline. Given that most ships will likely support the %bide scry notification protocol, repeated scry requests could potentially be sent at a slower interval than the keepalive requests.

If the responder goes offline, the requester could temporarily stop sending requests to the responder and instead ask the responder's sponsor to be notified when the sponsee has come back online. The requester would also send keepalives to the sponsor to hold open the firewall pinhole and make sure the sponsor is online.

A complication with asking the sponsor is that the requester and responder could have different ideas of the current time. Maybe we could work around this restriction with some tricks, like pre-scheduled timeboxes with a synchronization scheme, or scry requests for "latest" including a random number in the request to prevent repeats, but we'd need to think carefully about that.

A hybrid scheme would send keepalives to the sponsor to hold the pinhole open, scry the sponsor for the "latest" revision number of the sponsee's lane, request that lane to be sure it's not the one we already have, and then request the next revision, relying on the %bide notification protocol on the sponsor to deliver the lane in the response as soon as it changes. This has lower reconnect latency -- as soon as the sponsor hears the sponsee's new lane, it will broadcast it to all the requesters, who will then resume retrying their scry requests to the sponsee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment