belisarius222/ames-push.md

## ames-push.md

      
    Raw
  

              ames-push.md
            
          
    Ames Push-Based Reconnect

As it stands, Ames re-sends packets to an offline peer forever, every two minutes.  For a ship with, say, 2400 offline subscribers to five chat channels, this adds up to a hundred packets being sent per second to ships that aren't there.  A lot of those packets will also go through a relay before being hurled into the void, forcing the relay to do essentially meaningless work.
This is a fair amount of wasted CPU, bandwidth, and disk usage (every retry is a new Arvo event on the sender, which must be written to disk).  To make it worse, the retry rate trades off against reconnect latency.  If my ship comes back online after a few minutes of downtime, it will take on average one minute -- half the retry interval -- before I hear retried packets from my peers.  Lowering the retry interval to lessen the reconnect time would increase the bunk packet rate, which increases systemic load on the network.
A couple years ago, ~wicdev-wisryt proposed switching Ames to use push-based reconnect behavior, which I will attempt to flesh out into a full proposal in this document.  The basic idea is that once a peer stops responding for long enough that it seems offline, instead of pinging it on an interval, it will ask its sponsor for a notification when the peer comes back online.  This arrangement will make use of Ames's existing sponsor-pinging system.
Aside from galaxies, which are root nodes that have no sponsor, every ship pings its sponsor once every thirty seconds to hold open a connection in spite of any firewall restrictions.  Thirty seconds was chosen because that's a common timeout after which a firewall will close an inactive connection.
(As an aside, changing that to just under 30 seconds might be a good idea, to reduce the chance of missing the window by fractions of a second due to network latency.)
Every time the sponsor hears a ping packet from a sponsee, it records the IP and port from which it heard that packet in its Arvo state.  This is how a sponsor knows 'where' its sponsee is; it uses this stored IP and port as the destination to which it relays packets from other ships to the sponsee, this is also how it knows where to send ack packets in response to the sponsee's ping packets.  The sponsor also remembers the most recent date at which it heard a sponsee's ping packet.
So the sponsor knows when it last knew a sponsee was at a particular IP and port.  I propose letting other ships send a request to the sponsor to be notified with the sponsee's IP and port next time the sponsor receives a ping from the sponsee.  This will be a single-request, single-response protocol.  To prevent space leaks, it should use an Ames flow that can be reused for later requests about the same peer.
An exception needs to be made: re-sending packets to a galaxy, which has no sponsor to ask about its IP and port, should continue indefinitely until the galaxy responds.  Contacting a galaxy should probably have a low max retry interval, such as thirty seconds, since galaxies should never go down.
Arguably, pinging your own sponsor (which might be a star or planet) should similarly never back off and resort to its sponsor, but I think it's cleaner to keep the exception limited to galaxies, which should be sufficient.
Right now, sponsor pinging is performed by the :ping Gall agent.  This is currently clean, because the app doesn't interact directly with the Ames vane at all.  A push-based reconnect system such as the one proposed here might benefit from moving pinging into Ames itself.
Here's a tricky case: a sponsee could come back online after a brief disconnection, just before one of its peers requests to be notified about when it comes back online.  If the sponsor naively waits for the next ping from its sponsee before responding to the request, an extra almost-thirty seconds will pass before the peer knows the sponsee has come back online.
The way I would prefer to ameliorate this issue would be to wait longer before giving up pinging a peer and subscribing to its sponsor instead.  If I wait, say, five minutes with a retry interval of thirty seconds, then the likelihood that the peer just happened to come back online just before I switch to asking the sponsor should be low enough not to be a significant issue.
Another approach, which I think would be harder to get right and more prone to instability, would be that the sponsor could respond to the peer's request immediately if the last ping it heard from its sponsee was recent enough.  If this response was too optimistic, and the sponsee actually went offline after that most recent ping, then the peer will give up again after another thirty seconds or so and send a new request to the sponsor.  This trick could work well as long as the sponsor's time threshold for staleness (say, 5 seconds) is significantly shorter than the peers' threshold for disconnection (say, 30 seconds) and shorter than the sponsor ping interval (say, 30 seconds); otherwise, the first requests from peers will frequently yield stale sponsee data from the sponsor, causing the peers to perform an unnecessary retry iteration as opposed to quietly waiting for the sponsee to reappear.
(As another aside, it might be useful to have a user-facing "ping sponsor now" button in case I just joined a wifi network, or switched wifi networks, and don't want to wait thirty seconds for my ship to ping my sponsor again.)
I think the worst-case scenario with a system like this is a ship that pings its sponsor every so often but has generally poor connectivity or is under high CPU load (which look the same from a network perspective), causing a reconnect storm every time it blips back online.  The severity of this reconnect storm will be proportional to the number of outstanding Ames flows on peers trying to send it packets, which is roughly the number of chats and other social engagements the ship is in.  The more popular the ship, the more severe the reconnect storm if its connection flickers.
In general, I expect push-based reconnect to produce load spikes during reconnect that are higher than what ships experience now, but significantly lower steady-state load.  If reconnect storms turn out to be a serious problem, then as a courtesy a peer could delay by a random timeout before attempting to re-send stalled packets to the newly reappeared ship.  That will smear the packet onslaught over time, "flattening the curve" to avoid swamping the network by everyone spamming the ship at once.
A different approach from the request-response protocol that I'm advocating for in this document would be for peers to maintain long-lived subscriptions to their peers' sponsors, who would notify the peers whenever the sponsee's IP and port changed and each time it comes back online after a disconnection; however, I think this would lead to unnecessary load. Often ships don't care about their peers' locations for long periods of time, so when they weren't interested, they would need to either ignore subscription updates or unsubscribe and resubscribe later, which is similar to this proposal but with more state and more packets.
Another way to approach this sort of scaling problem would be to leave the current pull-based system, but reduce CPU, bandwidth, and disk usage by taking more piecemeal actions, some of which could be done in either the current system or in this proposed push-based system:

To reduce CPU on the retrying sender, Ames could maintain a cache of encrypted packets, so that the packets don't have to be re-encrypted each time they're re-sent.  This cache would only need to be invalidated when sender or receiver changes its keys, incrementing its life number.
To reduce disk write load on the sender, un-acked packets could be scried out by the runtime and re-sent without triggering Arvo events, or batched into multiple re-sends per event.  Maybe this would best be done only for packets that have already reached the max retry timeout, so that congestion control could still use fine-grained timers with online peers.
To reduce bandwidth, Ames could consolidate long-term retrying to one packet per offline peer per timeout interval, instead of one packet per flow with that peer.
To reduce load on relays, the relays could drop all packets intended to be relayed to known-offline sponsees.