Skip to content

Instantly share code, notes, and snippets.

@wez wez/CMD_PROTOCOL.md
Last active Jun 6, 2018

Embed
What would you like to do?
Remote Terminal Command Protocol

RFC: Remote Terminal Multiplexer Protocol

This document describes a protocol for a remote terminal session, which may have 0 or more active terminals, to exchange control information with a local terminal renderer. This is similar in spirit to the tmux -CC Control Mode functionality but is spec'd out separately from tmux in order to make it relatively tmux agnostic and thus potentially easier for it to be integrated in alternative multiplexer implementations and supported in more terminal emulators.

My motivation for writing this is that I want to implement both a remote multiplexer and a local renderer. I'd like to be able to use either component with others where appropriate. For example, I'd like to be able to use an existing macOS terminal, such as iTerm, together with my server side multiplexer once written.

Terminology / Model

The following terms and mental model are used throughout this document.

  • Session refers to a connection between the client and the server. It has a limited but likely long-lived duration.

  • Server refers to the process running on the session host that is hosting the terminal sessions.

  • Client refers to the process running on the local system and that is rendering the terminal sessions.

  • Terminal refers to a logical terminal running on the server. There may be 0 or more terminals associated with a session. Each terminal has a unique identifier that does not change for the duration of the session.

  • Topology refers to the relative shape and placement of the terminals in the session. The topology is logically a tree of nodes, and defines Group, Window, Tab and Pane nodes that are expected to be rendered with the corresponding UI, although the precise rendition is left to the client. Each node, except for the root, is associated with 1 or more terminal ids. Group nodes are present in the model to accommodate having multiple discrete groupings on the server (these correspond to screen or tmux sessions). The group nodes themselves are not directly associated with a terminal id, but their children are. A topology consists of 1 or more Group nodes which can then contain 1 or more Window nodes, which can then contain 1 or more Tab nodes which can contain 1 or more Pane nodes.

Protocol Encoding

The channel that connects the server to the client can be thought of as a byte oriented channel and may be a TCP/IP connection or it may be a pair of pipes connected to a program such as Eternal Terminal. Authentication, authorization and accounting are assumed to have been taken care of at a different level and are out of scope of this document.

The data that passes over the channel is encoded as JSON objects that have specific meanings defined in this document. JSON was selected due to its relative ubiquity as a serialization form.

Each JSON object is sent on a line of its own, so the logic to encode and decode is relatively straightforward: encode the data as json and write it out, followed by a newline. Decoding is a matter of reading a line and then decoding the json object.

If a JSON object is not well formed, the session should be terminated.

In this document we will show examples of the JSON objects in a pretty printed form that is easier for humans to understand, along with comments. When these are sent over the channel it is critical that they be sent on a single line and not in pretty printed format and without comments.

While some JSON parsers support these things, they are not part of the strict core specification and are thus not well formed JSON object representations.

There are some key fields defined in the json object. The field names are intentionally short to try to keep the size of the data relatively small.

  • t - the timestamp at which the object was generated. The timestamp is measured in seconds since the establishment of the session (to keep the size of the number smaller than it would be if we used the unix epoch). This can potentially be used to detect latency issues and adjust communication strategy.
  • i - the request id field. Each request is assigned a value by the sender and this is encoded into the object. If a request generates a response then the response will have its ri field set to match the request. Request ids are opaque to the peer and thus should not be interpreted by the other end of the session. This allows various strategies to be used to generate the value of the i field. The i field should be an integer.
  • w - the what field. This is a string naming the type of the object. That in turn defines how to interpret the remainder of the fields. It is present only on a request packet and is omitted from the response packet.
  • e - the error field. This is only valid in a response. If present it indicates that the corresponding request failed. Errors are represented as a json array with two elements; the first is a string holding a machine readable error name and the second is a string holding a human readable error description.
  • ri - the response id field. This is only valid in a response, and must appear in a response. The ri field is set to the value of the i field from the corresponding request and allows the request and response to be joined together.

Error codes

The following machine parseable error codes are defined; these correspond to index 0 in the e field in an error response. For example:

// Client to server
{
  "i": 52,
  "w": "wibble",
  "t": 12
}
// Server to client
{
  "ri": 52,
  "t": 12,
  "e": ["NOTIMPL", "Unknown command wibble"]
}
  • NOTIMPL is returned when a packet is sent with an unsupported w field.
  • TIMEOUT is returned when the t field in the request is more than a locally defined number of seconds different from the time at which it was received and ready to process.

Capability Exchange

When the session (the connection between the client and server) is established, the server and the client share capability information. This allows features to be added over time and taken advantage of when present.

The capabilities packet lists the features that are supported by the implementation. The list is a set of names that corresponds either to features named in this document, or namespaced features that are implementation specific. For example, tmux might define a special foo command that is out of scope of this document but that is supported by a specific client implementation. The server would include tmux:foo in its caps list to indicate that support is present and that the peer could use it if they wish.

The client and the server send each other a capabilities packet as their first action when the session is established. The caps list should include a list of all of the features/commands that they support.

// Server and Client send this to each other
{
  "w": "capabilities",
  "i": 0,
  "t": 0,
  "caps": [
    "async",
    "input",
    "output",
    "topology",
    "ping",
    "tmux-cmd:rename-session"
  ]
}

The client and server must respond to their capabilities packets:

// Client or Server responding to the above
{
  "ri": 0,
  "t": 0
}

If the peer doesn't support the required capabilities then an error may be generated, and the session should then be gracefully shut down:

// Client or Server indicating that they can't interoperate with each other
{
  "ri": 0,
  "t": 0,
  "e": ["NOTIMPL", "required capabilities not present"]
}

Protocol Related Packets

Asynchronous Mode

By default the server operates in synchronous request-response mode. In synchronous mode the server guarantees that it will not interleave unilaterally server generated packets between the request and response to a client generated request. Likewise, the client makes the same guarantee to the server.

If the server adverties the async capability then the client may request that it be enabled with a packet like this:

// Client to server
{
  "w": "enable-async",
  "i": 1,
  "t": 1
}

If the server doesn't support async operation it must not advertise it in its caps list and it will return an error response to the client like this:

// Server to client
{
  "ri": 1,
  "t": 1,
  "e": ["NOTIMPL", "This server does not support async mode"]
}

When asynchronous mode is enabled, both peers are free to interleave unilaterally generated data in between a request from the other end. The i and ri fields are used to maintain the relationship between the requests and responses.

Synchronous mode is significantly easier to implement but may have lower throughput or higher latency characteristics than if the session is set to asynchronous mode, depending on the workload.

Liveness and Ping

The ping packet can be used to signal or detect liveness of the session channel. Upon receipt of a ping the peer will generate a ping response:

// Client or Server to peer
{
  "w": "ping",
  "i": 123,
  "t": 1234
}

the peer responds with:

// Server or client responding to the above
{
  "ri": 123,
  "t": 1234
}

It's possible (probable if a channel such as Eternal Terminal is in use), that there may be periods of time where no ping response is generated for an extended duration. When connectivity is restored the peers will likely receive a stream of ping requests. Rather than generating a response for each one, a bulk ping response may be sent that uses array syntax to specify the list of ping requests to which it is responding.

// Responding to a series of `ping`s
// The same technique may be used to respond to multiple commands,
// provided that they either all have success status and no other
// data fields, or are all the same error condition.
{
  "ri": [123, 124, 125, 126, 127, 128],
  "t": 12345
}

The timing of ping packets and various responses can be used by the client and server to determine latency. For example, if no ping response has been received within a reasonable time period then the server may choose to suppress generating output for a Terminal until such a time that it receives a request with a more recent t field. Similarly, in such a situation the frequency of ping requests could be reduced.

It is acceptable and encouraged for the server to emit error responses to commands that have an old t field:

// Client to server
{
  "w": "input",
  "i": 345,
  "tid": 1,
  "text" "rm /some/thing",
  "t": 1
}

if the above isn't received until a day later then the server may respond with:

// Server responding to the client message above
{
  "ri": 345,
  "t": 86401,     // note the difference between the `t` fields
  "e": ["TIMEOUT", "took too long to receive your input, so ignored it"]
}

Topology Info

The server can unilaterally send topology information. This typically happens at the start of the session when connecting to an already established session, or in response to actions that result in a change in topology.

// Server to client
{
  "w": "topology",
  "i": 123,
  "t": 0,
  "topology": {
    "groups": [
      {
        "id": 1,
        "name": "My Session",
        "windows": [
          {
            "id": 1,
            "name": "window 1",
            "tabs": [
              {
                "id": 1,
                "name": "tab 1",
                "panes": [
                  {
                    "id": 1,
                    "name": "pane 1",
                    "terminal": {
                      "id": 1,
                      "name": "terminal 1",
                      "title": "wez@localhost:~",
                      "size": [80,24],
                      "pixel_size": [640, 480],
                    }
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}
// Client responding to the above
{
  "ri": 123,
  "t": 0
}

The intent is that the client will use the topological information to produce a number of windows, tabs and panes that correspond to the entities in the remote topology.

TODO:

  • Iconfied state?
  • Other applicable flags?
  • Positioning/layout of panes within their containing tab?

Topology/Window Management

TODO:

  • Server or Client creates, moves, resizes, closes Group, Window, Tab, Pane
  • Rather than sending the full topology in these cases, just send a packet referencing the item id and the changed properties.

Input/Output

The client can send input to a given terminal using the input command. The input can be any UTF-8 text and the intent is that it is passed directly to the stdin of the terminal running on the server. The input text is typically plan UTF-8 text typed by the user, but it may also consist of key presses such as cursor keys or function keys encoded as per the host terminal. For example, Cursor Up is often represented as "text": "\x1b[A" in many terminals.

// Client to server
{
  "w": "input",
  "text": "ls -al\n",
  "tid": 1, // the terminal id to which the input is sent
  "i": 42,
  "t": 33
}
// Server responding to the client message above
{
  "ri": 42,
  "t": 34
}

The server can advise the client of output being printed to a specific terminal via the output packet. As with input, the output text is a UTF-8 encoded string which may have escape sequences embedded:

// Server to client
{
  "w": "output",
  "text": "\x1b[1mHello\x1b[0m\n",
  "i": 64,
  "t": 35
}
// Client responding to the server message above
{
  "ri": 64,
  "t": 35
}

Scrollback

  • How can the client request scrollback data?
  • How about searching the scrollback for matching text?

Resynchronizing or minimizing output

The server, after an extended period without contact with the peer, may choose to suppress sending output packets to the client. When the client is available it will need to catch up with the changes since it was gone. Rather than replaying the entire stream of output events that occurred in the interim, the server may choose to send an alternative output packet to refresh or redraw the visible (non-scrollback) portions of the screen.

The same technique can be used in situations where a large amount of text is being dumped to the terminal output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.