Skip to content

Instantly share code, notes, and snippets.

@pervognsen
Last active February 9, 2021 05:44
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save pervognsen/91474ed5f6a81fa01566e63d928e1d27 to your computer and use it in GitHub Desktop.
Save pervognsen/91474ed5f6a81fa01566e63d928e1d27 to your computer and use it in GitHub Desktop.
The Mosh and Quake 3 Networking Models and State Synchronization Algebra
========================================================================
Mosh is a new remote shell program and protocol: https://mosh.org/
You may read technical details about its internals here:
https://www.usenix.org/system/files/login/articles/winstein.pdf
https://mosh.org/mosh-paper.pdf
This is a real break with the history of remote terminals, all the way from ARPANET's TELNET through SSH.
The basic premise of these traditional protocols is to emulate a serial terminal link over a packet-switched
network. Before the ARPANET, leased telephone lines were used to facilitate remote terminals, and indeed one
of the motivations for the ARPANET was to enable the remote use of large timesharing systems without such
dedicated point-to-point leased lines (leased lines were still used for IMP-to-IMP connectivity in the ARPANET,
but they were treated as a carrier for time-multiplexed packets rather than being circuit switched, so a
pair of hosts that wanted to communicate didn't need a dedicated leased line between them).
What Mosh does differently (among other things) is that it doesn't try to emulate an end-to-end serial link.
The virtual serial link only exists between the server-side Mosh host software and the server-side operating
system, not between the client-side endpoint and the server-side operating system. That is, Mosh runs its
own server-side terminator emulator, maintaining the terminal screen state, and so on, on behalf of the client.
It then synchronizes this screen state (and other relevant terminal state like the echo mode) to the client,
and the client renders the synchronized screen state locally.
You can imagine a simple version of this general idea where the client-side software is entirely terminal agnostic
like VNC or other remote desktop software and just sits there receiving and rendering pixels. And indeed, nothing
prevents you from running xterm over VNC, or running a local X server with a remote xterm as the X client.
But obviously Mosh is a lot smarter than that. As a first observation, a terminal is a text-mode display,
not a bitmap display, and synchronizing the display state pixel by pixel is not the right level of abstraction.
In addition, there is non-graphical state like the echo mode which is important for the client to be able to do
latency-tolerant client-side prediction (e.g. telnet's linemode option, which SSH does not support). And if the
order of business is synchronizing structured data rather than linear data streams, you can do better than TCP.
The UDP-based state synchronization protocol that was invented for Mosh is called SSP. As far as I can tell,
it's not defined as a separate entity from Mosh, and it's not currently used by any other protocols or software.
SSP's design will look uncannily familiar to game programmers. It's almost identical to the Quake 3 Arena
networking model. You may read more about it here:
http://trac.bookofhook.com/bookofhook/trac.cgi/wiki/Quake3Networking
http://fabiensanglard.net/quake3/network.php
In Mosh, state updates are tagged with a monotonic sequence number. States are transmitted as diffs
relative to states previously acknowledged by the receiver. This means the sender needs to maintain a window
of past states. Once the received has acknowledged a state, the sender throws away any older states in its window.
When the sender has seen an acknowledgement from the receiver, it will send its own kind of ack in an
upcoming packet with the so-called throwaway sequence number, which the receiver uses to shrink its window of
reference states.
Client-side input is sent to the server using SSP as well. Here the abstract state object to be synchronized
is the full session input history, which grows in linear append-only fashion. The diff between any two such
input states is just the suffix of input keystrokes that are in the newer state which were not in the older
state. While game coders might not think of client input transmission as a state synchronization problem,
they do tend to take the exact same approach to the problem as I just described, going back to Tribes 1
and probably earlier: http://www.pingz.com/wordpress/wp-content/uploads/2009/11/tribes_networking_model.pdf
One interesting thing is how Mosh/SSP uses both sender and receiver acks to truncate reference states. In
the input history synchronization example, there's obviously no point in maintaining a copy of the full
input history once the receiver has acknowledged reception up to a certain point. At that point you just
want to truncate the acknowledged prefix. While this is a simple and practical notion, it's interesting
to consider what it means at the state object level. This is not something that the Mosh paper talks about,
but I believe it is implicit in the Mosh implementation, and being a math nerd I couldn't help but notice it:
State synchronization implicitly relies on an abstract algebra of states!
States have an ordering: A <= B, where A and B are partial states. In Mosh and in most other games I can think of,
the ordering is linear via the tagging of states with update sequence numbers or timestamps. But partial orderings are
actually more natural for a lot of structured data, where you may think of it as an information ordering where A <= B
asserts that B contains as much or more information than A, so B strictly subsumes A. The ordering is partial
when neither A or B contains strictly more information than the other. For example, with a BitTorrent-like protocol,
if packet A contains fragments 1 and 2 and packet B contains fragments 2 and 3, then neither is subsumed by the other.
Along with this ordering on states, we have operations on states. If D represents a diff from A to B, we may
write A + D = B. We can also turn this around and compute D given A and B as D = B - A. Idempotency of state
updates is represented by the absorption law (A + D) + D = A + D.
There is also a 0 element. Like the 0 you know and love, this has the property that A + 0 = A for all A.
And its interpretation here is very simple: 0 is an empty diff.
The ordering and operations are tied together in a simple way: A <= A + B for all A and B.
Rather than thinking of addition as applying a diff to a state, it's better to think of it as merging information
in a symmetric way that does not distinguish the state object from the diff object.
Going back to our discussion of truncating input histories, the important observation with respect to this
algebraic framework is that truncating is just subtracting the older, acknowledged state from the newer state.
And this is just what D = B - A does. So, D can not only be regarded as a diff, i.e. a recipe for transforming
A into B, but as a partial, truncated state. And this is useful because we can use D to generate diffs relative
to A without keeping a full copy of A around. For example, if we want to compute D' = C - A, we may do as follows:
D' = C - A = (C - B) + (B - A) = (C - B) + D, where C - B is the diff from B to C. This is a sum of diffs! But
these confusing distinctions disappear when you just think of all these things, whether A, B, C or any of their
differences and sums, as objects of the same kind. Then B - A is no conceptually different than D' - D even though
the latter might seem like a diff of diffs from our original vantage point.
In the case of input histories, this is very simple since both input histories and diffs between input histories
are sequences of keystrokes (however, if we subtract sequences without a common base, we get disjoint intervals).
For entity states in a game, it's not quite as simple but it's pretty close: our objects will be partial entity
states, which may or may not contain a defined value for an entity field. Then a complete entity state is just
a partial entity state with all its fields fully defined. An entity diff is the partial entity state that specifies
the values of only those fields that are different between the entities. A diff between diffs specifies the fields
that are different between the diffs. So you can see that diffing is done uniformly whether we're diffing entities
or diffs (in fact, the distinction is erased at this level).
This looks a lot like the algebraic structure known as a lattice. In general, you can't unconditionally subtract
elements in a lattice like we're doing here. That relies on the ordering being linear rather than just partial.
I haven't worked out all the details, but at least the basic algebraic picture provides some useful insights,
and I thought it was worth sharing them even in a preliminary form.
At a practical implementation level, you can benefit from this framework by implementing these abstract things
like addition, subtraction and ordering in your code, making sure they obey the right algebraic laws. You can
see what Mosh does here with its subtract, init_diff, diff_from and apply_string operations:
https://github.com/mobile-shell/mosh/blob/master/src/statesync/completeterminal.cc
https://github.com/mobile-shell/mosh/blob/master/src/statesync/user.cc
This is not quite as clean as the mathematical model I presented, particularly because diff_from and
apply_string operate with partial states serialized as strings rather than with state objects per se, but
seeing this code is what led to the above realizations, so all credit is due to them.
@radicaljims
Copy link

Domain theory in packet form?

@pervognsen
Copy link
Author

pervognsen commented Jan 9, 2017

Yep, there's definitely a lot of domain theory lurking under the covers. And the entity structure I outlined would be a product domain. The value domain (assuming we don't differentiate by type) consists of integers, strings, etc. The partial value domain is the value domain augmented with bottom to represent an unspecified/unknown value. The domain of partial entity states with n fields is the n-fold product of the partial value domain. This picture of partial nested tuples/records also seems like a good foundation for the semantics of a language like GraphQL.

In programming language theory and denotational semantics, domains usually have a somewhat different conceptual meaning, where bottom represents an incomplete computation (either because you haven't taken the fixed-point limit yet or because you have taken the limit and the computation is divergent/non-terminating). Of course, "incomplete computation" and "unknown value" are totally compatible interpretations.

@nil-ableton
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment