My previous symmetric routing proposal entailed building a whole reverse routing envelope into each packet. The requester would have to specify some information about the whole path before sending the packet. As ~master-morzod pointed out to me, such a design does not make for a good bottom-up, self-assembling network.
In contrast, in this proposal, each relay node only needs to know the previous node and the next, not the entire multi-hop route. The basic idea is that when a ship receives a request packet, it figures out where the next hop in the forwarding chain for that packet should be, forwards the packet to that location, and remembers the IP and port of the incoming request packet for thirty seconds, so if it hears a response packet, it can pass that response back to the previous relay.
This means that every request must be satisfied within thirty seconds, but that's a workable constraint. Note that many HTTP systems operate under this constraint. This arrangement is also similar to NDN's "pending interests" table. An NDN "interest" packet corresponds almost exactly to a Fine scry request packet, and stateful Ames request packets can behave the same with respect to routing.
In order for this to work, each node has to have some idea of the next node to try. If the node has a direct route to the receiver ship, it should send the packet there. Otherwise, if it knows the ship's sponsor, it should try that, falling back up the sponsorship chain potentially all the way to the receiver's galaxy.
The next question is how routes tighten, which requires information about lanes for ships later in the relay chain to propagate backward to ships earlier in the relay chain. A maximally tight route is a direct route between the sender ship and the receiver ship.
Each ship in the relay chain, when forwarding a response packet backward through the route, can overwrite a "next" field in the response packet to contain the relay just past it. This enables the previous relay to try the "next" relay on the next request packet, skipping the relay in between.
One simple rubric is that a ship should try the tightest known-good route and the next tightest route simultaneously until the tighter route has been demonstrated to be viable. This way, in case the tighter route turns out not to work (usually because it's behind a firewall and not publicly addressable), the looser route will still get the packet through.
A ship should only overwrite the "next" field when forwarding a response back to a ship that is not one of its direct sponsors (this precludes fraternization from galaxy directly to planet or moon, but I think that's ok). The following example shows how a route tightens over multiple packet roundtrips.
request: ~rovnys -> ~syl -> ~sipsyl -> ~ritpub-sipsyl -> ~mister-ritpub-sipsyl
response: ~rovnys <- ~syl <- ~sipsyl <- ~ritpub-sipsyl <- ~mister-ritpub-sipsyl
::
:: next: ~sipsyl@123.456.789:1234
request: ~rovnys -> ~sipsyl -> ~ritpub-sipsyl -> ~mister-ritpub-sipsyl
response: ~rovnys <- ~sipsyl <- ~ritpub-sipsyl <- ~mister-ritpub-sipsyl
::
:: next: ~ritpub-sipsyl@234.567.890:2345
request: ~rovnys -> ~ritpub-sipsyl -> ~mister-ritpub-sipsyl
response: ~rovnys <- ~ritpub-sipsyl <- ~mister-ritpub-sipsyl
::
:: next: ~mister-ritpub-sipsyl@345.678.901:3456
request: ~rovnys -> ~mister-ritpub-sipsyl
response: ~rovnys <- ~mister-ritpub-sipsyl
::
:: next: ~
As an aside, note that while this example was written using full Urbit ships as relays, this routing procedure would also allow any kind of program that speaks the routing protocol and knows the lanes of ships to act as a relay node, even if it does not have an Urbit address (although the system would need a new way to learn about these non-ship relays).
In order for this procedure to work, each relay must maintain a data structure:
+$ state
$: pending=(map request [=lane expiry=@da])
sponsees=(map ship lane)
==
+$ request
$% [%ames sndr=ship rcvr=ship]
[%scry rcvr=ship =path]
==
The expiry
field is set to 30 seconds from now whenever we hear a request. Repeated requests bump the timeout, keeping the request alive.
I'm not sure your proposed system works as intended. I've filled in the blanks a bit for parts of the system that you've only described somewhat briefly, so let me know if I misunderstood. Consider:
Subscriber sends scry request for next datum to publisher. Publisher doesn't have the datum, so it just starts sending heartbeat responses on a timer. When the subscriber gets the first heartbeat, it stops re-sending the scry request, assuming that the publisher will notify it. Then at some point, the publisher grows a value at the requested scry path and fires off a scry response packet to the subscriber, but the packet gets lost over the wire.
I guess there are a couple cases in this situation:
In fact, if there are other subscriptions, then we'd also need some sort of initial ack to each scry request, in addition to the heartbeats. Otherwise, if one subscription is already online and there are heartbeats flowing, then when the subscriber makes a request for a path in a different subscription (and the path is for the next datum and hasn't been grown yet), there's no observable change in the publisher's behavior, so the subscriber will just keep re-sending the request packet. The publisher was already sending heartbeats, and it just keeps sending heartbeats after receiving the second request. So the subscriber will never learn that it can stop re-sending this scry request.
So I don't think this works without the publisher sending some kind of ack on the initial scry request telling the subscriber that it doesn't need to keep re-sending the request, at least while the heartbeats are going. I think this ack is a little different from what I'd been thinking of as a "scry blocked" response, since it implies that the publisher is taking on the responsibility of delivering a response, as opposed to just a courtesy of informing the subscriber that its scry request failed for now.
The publisher would only send this ack if it's running the scry notifications protocol. Otherwise, it shouldn't claim responsibility for delivering the response later. This could act as a form of protocol negotiation: a subscriber polls for a scry request on a timer until it hears this initial ack; if it never comes, then it keeps pinging forever. If the subscriber isn't using the scry notifications protocol, it can drop incoming acks and keep re-sending the scry request packet as if it never heard them.
The ack needs to be authenticated, and it needs to include the sender ship's address, the scry request path, and a session id number of some sort so that if the publisher Vere restarts or the subscriber ship goes offline for a while, the session id will change, informing the subscriber that it needs to re-send all pending scry requests.
If the ack is lost over the wire, the subscriber will re-send its scry request on a timer. Each time the publisher gets a duplicate request, it sends a duplicate ack, which should include the same session id as the first ack unless the session has died.
An issue with eliding heartbeat requests and only sending heartbeat responses is: how does the publisher know to stop trying to push updates to a subscriber, if the subscriber has moved or gone offline? If heartbeats only go from the publisher to the subscriber, then the publisher doesn't know whether the subscriber is still online. It effectively has to keep its pending request table alive forever if the subscriber goes down. Again, I think it makes more sense for this responsibility to lie with the subscriber. If the publisher stops hearing keepalives from the subscriber, it deletes the request packets from that subscriber and stops trying to notify it. If the subscriber comes back online, it starts over from scratch and re-sends all its pending scry requests.
Here's a counterproposal, based on these thoughts:
The general idea is that the subscriber sends heartbeats while it has pending requests open on the publisher, which keeps those requests pinned in the publisher's pending requests table. A heartbeat can be thought of as a shorthand for a set of re-sent scry requests, one for each pending request.
The publisher acks each heartbeat request so that the subscriber can continue to tighten the route. The publisher also sends a subscription ack each time it hears a request for a scry path that doesn't yet exist. The subscriber will re-send each scry request on a timer until it hears such a subscription ack. This means that even if the first couple acks get lost over the wire, one will eventually go through and the subscriber can stop re-sending the scry request and instead switch to heartbeats -- and it might already have been sending heartbeats because of some other subscription.
A session dies either when the publisher Vere restarts or when a certain amount of time (maybe 1 minute) has gone by without the publisher hearing a heartbeat request packet. If the publisher hears a heartbeat request on an unrecognized session, it sends a heartbeat response with the new session id, creating that id if needed. If the subscriber stops hearing heartbeat responses for a certain amount of time (also a minute, let's say), it will consider the connection dead, but will continue to send heartbeats until it hears a response from the publisher. If the subscriber ever hears a new session id from the publisher, it knows it needs to re-send all pending scry requests and switch its heartbeat requests to the new session id.
A subscription ack packet contains:
A heartbeat request contains:
A heartbeat response contains: