wez/CMD_PROTOCOL.md

## CMD_PROTOCOL.md

      
    Raw
  

              CMD_PROTOCOL.md
            
          
    RFC: Remote Terminal Multiplexer Protocol

This document describes a protocol for a remote terminal session, which
may have 0 or more active terminals, to exchange control information with
a local terminal renderer.  This is similar in spirit to the tmux -CC
Control Mode functionality but is spec'd out separately from tmux in
order to make it relatively tmux agnostic and thus potentially easier
for it to be integrated in alternative multiplexer implementations
and supported in more terminal emulators.
My motivation for writing this is that I want to implement both
a remote multiplexer and a local renderer.  I'd like to be able
to use either component with others where appropriate.  For example,
I'd like to be able to use an existing macOS terminal, such as iTerm,
together with my server side multiplexer once written.
Terminology / Model

The following terms and mental model are used throughout this document.


Session refers to a connection between the client and the server.
It has a limited but likely long-lived duration.


Server refers to the process running on the session host that
is hosting the terminal sessions.


Client refers to the process running on the local system and
that is rendering the terminal sessions.


Terminal refers to a logical terminal running on the server.
There may be 0 or more terminals associated with a session.
Each terminal has a unique identifier that does not change for
the duration of the session.


Topology refers to the relative shape and placement of the
terminals in the session.  The topology is logically a tree
of nodes, and defines Group, Window, Tab and Pane nodes
that are expected to be rendered with the corresponding UI,
although the precise rendition is left to the client.
Each node, except for the root, is associated with 1 or more terminal ids.
Group nodes are present in the model to accommodate having multiple
discrete groupings on the server (these correspond to screen or
tmux sessions).  The group nodes themselves are not directly associated with
a terminal id, but their children are.  A topology consists of
1 or more Group nodes which can then contain 1 or more Window nodes,
which can then contain 1 or more Tab nodes which can contain 1 or more
Pane nodes.


Protocol Encoding

The channel that connects the server to the client can be thought of
as a byte oriented channel and may be a TCP/IP connection or it may
be a pair of pipes connected to a program such as Eternal Terminal.
Authentication, authorization and accounting are assumed to have been
taken care of at a different level and are out of scope of this document.
The data that passes over the channel is encoded as JSON objects
that have specific meanings defined in this document.  JSON was selected
due to its relative ubiquity as a serialization form.
Each JSON object is sent on a line of its own, so the logic to encode and
decode is relatively straightforward: encode the data as json and write it out,
followed by a newline.  Decoding is a matter of reading a line and then
decoding the json object.
If a JSON object is not well formed, the session should be terminated.
In this document we will show examples of the JSON objects in a
pretty printed form that is easier for humans to understand, along with
comments.  When these are sent over the channel it is critical that
they be sent on a single line and not in pretty printed format and
without comments.
While some JSON parsers support these things, they are not part of
the strict core specification and are thus not well formed JSON
object representations.
There are some key fields defined in the json object.  The field
names are intentionally short to try to keep the size of the data
relatively small.

t - the timestamp at which the object was generated.
The timestamp is measured in seconds since the establishment
of the session (to keep the size of the number smaller than
it would be if we used the unix epoch).
This can potentially be used to detect latency issues and
adjust communication strategy.
i - the request id field.  Each request is assigned
a value by the sender and this is encoded into
the object.  If a request generates a response then the response
will have its ri field set to match the request.  Request
ids are opaque to the peer and thus should not be interpreted
by the other end of the session.  This allows various strategies
to be used to generate the value of the i field.  The i field
should be an integer.
w - the what field.  This is a string naming the type
of the object.  That in turn defines how to interpret the
remainder of the fields.  It is present only on a request
packet and is omitted from the response packet.
e - the error field.  This is only valid in a response. If
present it indicates that the corresponding request failed.
Errors are represented as a json array with two elements; the
first is a string holding a machine readable error name and
the second is a string holding a human readable error description.
ri - the response id field.  This is only valid in a response,
and must appear in a response.  The ri field is set to the value
of the i field from the corresponding request and allows the
request and response to be joined together.

Error codes

The following machine parseable error codes are defined; these correspond
to index 0 in the e field in an error response.  For example:
// Client to server
{
  "i": 52,
  "w": "wibble",
  "t": 12
}

// Server to client
{
  "ri": 52,
  "t": 12,
  "e": ["NOTIMPL", "Unknown command wibble"]
}


NOTIMPL is returned when a packet is sent with an unsupported w field.
TIMEOUT is returned when the t field in the request is more than
a locally defined number of seconds different from the time at which it
was received and ready to process.

Capability Exchange

When the session (the connection between the client and server) is established,
the server and the client share capability information.  This allows
features to be added over time and taken advantage of when present.
The capabilities packet lists the features that are supported by the
implementation.  The list is a set of names that corresponds either to
features named in this document, or namespaced features that are implementation
specific.  For example, tmux might define a special foo command that is
out of scope of this document but that is supported by a specific client
implementation.  The server would include tmux:foo in its caps list
to indicate that support is present and that the peer could use it if they wish.
The client and the server send each other a capabilities packet as their
first action when the session is established.  The caps list should include
a list of all of the features/commands that they support.
// Server and Client send this to each other
{
  "w": "capabilities",
  "i": 0,
  "t": 0,
  "caps": [
    "async",
    "input",
    "output",
    "topology",
    "ping",
    "tmux-cmd:rename-session"
  ]
}

The client and server must respond to their capabilities packets:
// Client or Server responding to the above
{
  "ri": 0,
  "t": 0
}

If the peer doesn't support the required capabilities then an error may
be generated, and the session should then be gracefully shut down:
// Client or Server indicating that they can't interoperate with each other
{
  "ri": 0,
  "t": 0,
  "e": ["NOTIMPL", "required capabilities not present"]
}

Protocol Related Packets

Asynchronous Mode

By default the server operates in synchronous request-response mode.
In synchronous mode the server guarantees that it will not interleave
unilaterally server generated packets between the request and response
to a client generated request.  Likewise, the client makes the same
guarantee to the server.
If the server adverties the async capability then the client may
request that it be enabled with a packet like this:
// Client to server
{
  "w": "enable-async",
  "i": 1,
  "t": 1
}

If the server doesn't support async operation it must not advertise it
in its caps list and it will return an error response to the client like
this:
// Server to client
{
  "ri": 1,
  "t": 1,
  "e": ["NOTIMPL", "This server does not support async mode"]
}

When asynchronous mode is enabled, both peers are free to interleave
unilaterally generated data in between a request from the other end.
The i and ri fields are used to maintain the relationship between
the requests and responses.
Synchronous mode is significantly easier to implement but may have lower
throughput or higher latency characteristics than if the session is set to
asynchronous mode, depending on the workload.
Liveness and Ping

The ping packet can be used to signal or detect liveness of the session channel.
Upon receipt of a ping the peer will generate a ping response:
// Client or Server to peer
{
  "w": "ping",
  "i": 123,
  "t": 1234
}

the peer responds with:
// Server or client responding to the above
{
  "ri": 123,
  "t": 1234
}

It's possible (probable if a channel such as Eternal Terminal is in use), that
there may be periods of time where no ping response is generated for an
extended duration.  When connectivity is restored the peers will likely
receive a stream of ping requests.  Rather than generating a response for
each one, a bulk ping response may be sent that uses array syntax to specify
the list of ping requests to which it is responding.
// Responding to a series of `ping`s
// The same technique may be used to respond to multiple commands,
// provided that they either all have success status and no other
// data fields, or are all the same error condition.
{
  "ri": [123, 124, 125, 126, 127, 128],
  "t": 12345
}

The timing of ping packets and various responses can be used by the client
and server to determine latency.  For example, if no ping response has
been received within a reasonable time period then the server may choose
to suppress generating output for a Terminal until such a time that it
receives a request with a more recent t field.  Similarly, in such
a situation the frequency of ping requests could be reduced.
It is acceptable and encouraged for the server to emit error responses
to commands that have an old t field:
// Client to server
{
  "w": "input",
  "i": 345,
  "tid": 1,
  "text" "rm /some/thing",
  "t": 1
}

if the above isn't received until a day later then the server may respond with:
// Server responding to the client message above
{
  "ri": 345,
  "t": 86401,     // note the difference between the `t` fields
  "e": ["TIMEOUT", "took too long to receive your input, so ignored it"]
}

Topology Info

The server can unilaterally send topology information.  This typically
happens at the start of the session when connecting to an already established
session, or in response to actions that result in a change in topology.
// Server to client
{
  "w": "topology",
  "i": 123,
  "t": 0,
  "topology": {
    "groups": [
      {
        "id": 1,
        "name": "My Session",
        "windows": [
          {
            "id": 1,
            "name": "window 1",
            "tabs": [
              {
                "id": 1,
                "name": "tab 1",
                "panes": [
                  {
                    "id": 1,
                    "name": "pane 1",
                    "terminal": {
                      "id": 1,
                      "name": "terminal 1",
                      "title": "wez@localhost:~",
                      "size": [80,24],
                      "pixel_size": [640, 480],
                    }
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

// Client responding to the above
{
  "ri": 123,
  "t": 0
}

The intent is that the client will use the topological information to
produce a number of windows, tabs and panes that correspond to the entities
in the remote topology.
TODO:

Iconfied state?
Other applicable flags?
Positioning/layout of panes within their containing tab?

Topology/Window Management

TODO:

Server or Client creates, moves, resizes, closes Group, Window, Tab, Pane
Rather than sending the full topology in these cases, just send a packet
referencing the item id and the changed properties.

Input/Output

The client can send input to a given terminal using the input command.
The input can be any UTF-8 text and the intent is that it is passed
directly to the stdin of the terminal running on the server.  The input
text is typically plan UTF-8 text typed by the user, but it may also
consist of key presses such as cursor keys or function keys encoded
as per the host terminal.  For example, Cursor Up is often
represented as "text": "\x1b[A" in many terminals.
// Client to server
{
  "w": "input",
  "text": "ls -al\n",
  "tid": 1, // the terminal id to which the input is sent
  "i": 42,
  "t": 33
}

// Server responding to the client message above
{
  "ri": 42,
  "t": 34
}

The server can advise the client of output being printed to a specific
terminal via the output packet.  As with input, the output text
is a UTF-8 encoded string which may have escape sequences embedded:
// Server to client
{
  "w": "output",
  "text": "\x1b[1mHello\x1b[0m\n",
  "i": 64,
  "t": 35
}

// Client responding to the server message above
{
  "ri": 64,
  "t": 35
}

Scrollback


How can the client request scrollback data?
How about searching the scrollback for matching text?

Resynchronizing or minimizing output

The server, after an extended period without contact with the peer, may choose to suppress
sending output packets to the client.  When the client is available it will need to catch
up with the changes since it was gone.   Rather than replaying the entire stream of output
events that occurred in the interim, the server may choose to send an alternative output
packet to refresh or redraw the visible (non-scrollback) portions of the screen.
The same technique can be used in situations where a large amount of text is being dumped
to the terminal output.