Skip to content

Instantly share code, notes, and snippets.

@legastero
Last active April 6, 2022 17:19
Show Gist options
  • Save legastero/fa80e2366c448fd6e141 to your computer and use it in GitHub Desktop.
Save legastero/fa80e2366c448fd6e141 to your computer and use it in GitHub Desktop.

See a working demo of Jingle session management

I'm working on updates to XEP-0166 to add implementation guidance, but I'm going to share the CliffsNotes version here in the meantime since people are looking at creating new implementations.

I have worked on multiple Jingle implementations (and multiple iterations of each) for five years now, on various platforms and languages (Python, Lua, JavaScript, Swift). However, it has only been this past year that I've "achieved enlightenment" and finally understood what the processing model underlying Jingle is supposed to be.

The main problem with XEP-0166 is that it only describes the syntax of the core Jingle protocol. It does not explain why the protocol is shaped the way it is, and without that understanding it is quite simply impossible to achieve a correct and proper implementation. You can certainly make things that work for certain limited use cases, but it won't be a solid and extenisble foundation for new things.

So for right now, forget about what you know of XEP-0166. This is not a discussion that involves protocol or XML. It really doesn't even involve XMPP. The underlying signaling transport does not matter. The serialization of the signaling payloads does not matter. What matters is the data and how it flows internally. Yes, yes, those other things are important to us, but they are not relevant when creating the core of a Jingle engine.

1) What does Jingle do?

Ostensibly, what Jingle does are things like media streaming, file transfers, etc. But that's wrong.

Yes, you read that right. Jingle does not stream media or transfer files.

What Jingle does is negotiate how to stream media or how to transfer files. The actual work is left for other things to do.

In other words: Jingle's purpose is to synchronize boxes between two entities. These boxes, in particular:

+------------------------------------+
|                                    |
|              Session               |
|                                    |
|  +-----------------------------+   |
|  |           Content           |   |
|  |  (creator, name, senders)   |   |
|  |                             |   |
|  | +-------------------------+ |   |
|  | | Application Description | |   |
|  | +-------------------------+ |   |
|  | |  Security Description   | |   |
|  | +-------------------------+ |   |
|  | |  Transport Description  | |   |
|  | +-------------------------+ |   |
|  +-----------------------------+   |
|                                    |
|                                    |
|      ... additional contents       |
|                                    |
+------------------------------------+

Every Jingle action creates, removes, or modifies one of these boxes. Take a few minutes and look through the Jingle actions and convince yourself of this fact.

  • session-initiate
  • session-accept
  • session-terminate
  • content-add
  • content-remove
  • content-accept
  • content-reject
  • content-modify
  • description-info
  • security-info
  • transport-info
  • session-info
  • transport-replace
  • transport-accept
  • transport-reject

When implementing Jingle, your goal is to keep that diagram synchronized between both parties by using those actions.

There are some interesting concepts we can see here already:

  • The session itself is a box.
  • Content boxes can be added and removed, but also need to be approved.
  • Application and security boxes can be modified, but are not replaceable.
  • Transport boxes can be modifed, but can also be entirely replaced (with approval).

Again, note how Jingle does not care what goes inside those application, security, and transport description boxes. That data is meant to be handled by the other things doing the actual work, not Jingle itself.

2) A Session is a Session is a Session is a Session

There is one mistake that everyone makes when first starting to implement Jingle. I do mean everyone. It is an easy mistake. It is a tempting mistake. It looks and feels right. But it is wrong.

I'm talking about this:

  • MediaSession.code
  • FileTranserSession.code
  • WhiteboardSession.code
  • ... etc

Seriously, I mean that. But it looks right, doesn't it? Of course you would have MediaSessions and FileTransferSessions, maybe even an AudioSession, right?

No.

A Jingle session is a Jingle session is a Jingle session. Look again at those boxes above. There is only a Session box. There is nothing about audio or video or files or whiteboards. The core of a Jingle engine is not aware of those things; that information is only inside the black boxes of application descriptions, and is irrevelant to our goal of syncing those boxes.

The desired outcome here is that while there is no such thing as a FileTransferSession, there is the concept of a Session which contains one or more FileTransfer applications. That also means a single session can fall into multiple categories without requiring a combinatorial explosion of classes.

So then why do we all want to create these various Session classes? I believe it is because of the breadown in communciation that exists in XEP-0166 because it doesn't clearly explain this goal of box syncing and how to do it. You load up XEP-0166, start looking at the examples, your eyes glaze over from the blobs of XML, you throw your hands up saying this is an over complicated piece of garbage, and finally decide to try to implement just enough to satisfy your use case of media streaming, file transfers, etc. Thus, you end up collecting various types of session classes that are baked into solving one particular use case, and in a limited form of it, at that.

Or, it starts from knowing what UI you want to build, and thinking that since you want different UIs for media streaming, file transfers, etc, it would be best to have different Session classes for them. After all, they are different things, right?

The last point there is an interesting one. There are spots in XEP-0166 that advise that it is best not to mix application types inside of a single session beyond necessary. That is, the contents in a session should logically go together. And that is certainly a good principle, but deciding what "logically goes together" is only doable at the higher layers of your system, not inside the core of a Jingle engine. Creating multiple Session classes to enforce these separations is a decision that limits what you can do with your implementation.

3) Each action counts double

That list of Jingle actions up above, how many actions are there?

Fifteen, you say?

Nope.

There are thirty actions -- each name in that list counts double because there is both a local and remote version. I bring this up for two reasons:

  1. Your implementation is going to be about twice as large as what you were expecting, because there are thirty and not fifteen actions to handle.
  2. Local actions need to be managed by the session, just like you do remote actions. I've seen (written, even) implementations where local actions are not coordinated through the session itself, but rather blast out the associated stanza directly, which unfortunately leads to state management problems.

4) The priority queue

The heart of every Jingle session is a priority queue -- every beat processes a single local or remote action. Again, that is a priority queue. While all local and remote actions are added to this queue, it is always the local actions that run first.

Jingle

Here is what happens during each beat:

Local Action:

  1. Have the session verify that the action is valid to process.
  2. Have affected contents verify the action is valid to process.
  3. Have affected contents execute the action (NB. this needs to be treated as an async operation so that the API is always consistent).
  4. Once all contents have finished executing, update session state as necessary based on the action.
  5. Signal the results to the other side.
  6. Done, start next beat.

Remote Action:

  1. Have the session verify that the action is valid to process.
  2. Have affected contents verify the action is valid to process.
  3. Signal an ack (IQ result or error).
  4. Have affected contents execute the action (NB. this needs to be treated as an async operation so that the API is always consistent).
  5. Once all contents have finished executing, update session state as necessary based on the action.
  6. Done, start next beat.

Both sides look pretty similar, but this is still a very high level overview. Note the subtle difference: local actions signal the results of the action at the end of the beat (because there is now a difference in the boxes that needs to be synced) whereas remote actions signal an acknowledgement at the start of the beat, before executing.

Sending the ack before executing is important: the ack is not indicating if the action was succesfully applied or not; it is indicating that the action was accepted for execution (as opposed to be being a bad request or needed tie breaking). Execution failures will generally trigger new local actions (e.g., content-reject or transport-reject) to "undo" the action that failed.

In both cases, the actual work is treated as an async operation. While not every application or transport will actually require async support, to keep the API and developer expectations consistent, async should always be used.

There is one more item to consider here: the interaction between your API and the priority queue. Consider this example:

let ack = await session.addContent(content);

A local content-add action is created and then processed by the queue. But addContent() does not return when the queue has finished processing that local content-add action. Instead, it continues waiting for an acknowledgement to arrive from the peer. In other words, the queue does not wait for the remote side to acknowledge signaled actions; it is your API that should wait.

This section is the secret that is not explained yet in XEP-0166. Once you understand why a priority queue is necessary and the difference in how local and remote actions are processed, all of the rest of Jingle becomes obvious.

5) Session and Content states

There is a state machine diagram and some states defined for Sessions by XEP-0166, but it is incomplete and does not cover contents. Additionally, the states that a session goes through differs based on if you are the initiator or responder.

   Initiator                            Responder

+--------------+                     +--------------+
|   STARTING   |                     |   STARTING   |
+--------------+                     +--------------+
        |
     (local)
 session-initiate
        |
        v
+--------------+                     +--------------+
|   UNACKED    |- session-initiate ->|   PENDING    |
+--------------+                     +--------------+
                                             |
                                             |
        +--------------- ack ----------------+
        |                                    |
        v
+--------------+                         (local)
|   PENDING    |                      session-accept
+--------------+
                                             |
                                             |
                                             v
+--------------+                     +--------------+
|    ACTIVE    |<-- session-accept --|    ACTIVE    |
+--------------+                     +--------------+


        +- session-terminate (at any point) -+
        |                                    |
        v                                    v
+--------------+                     +--------------+
|    ENDED     |                     |    ENDED     |
+--------------+                     +--------------+

Notice that the session is in an UNACKED state between signaling out a session-initiate action and receiving an ack. Also, notice how the session has an initial STARTING state before sending or receiving the session-initiate action. Neither state is clearly documented in XEP-0166, but they do exist and need to be tracked.

The state diagram for contents is very similar. For contents, either side of the session can add a content, so the roles are labeled as "Local" and "Remote" instead of "Initiator" and "Responder":

     Local                                Remote

+--------------+                     +--------------+
|   STARTING   |                     |   STARTING   |
+--------------+                     +--------------+
        |
     (local)
   content-add
        |
        v
+--------------+                     +--------------+
|   UNACKED    |--- content-add ---->|   PENDING    |--+
+--------------+                     +--------------+  |
                                             |         |
                                             |         |
        +--------------- ack ----------------+         |
        |                                    |         |
        v                                              |
+--------------+                         (local)       |
|   PENDING    |                      content-accept   |
+--------------+                                       |
                                             |         |
                                             |         |
                                             v         |
+--------------+                     +--------------+  |
|    ACTIVE    |<-- content-accept --|    ACTIVE    |  |
+--------------+                     +--------------+  |
                                                       |
                                                    (local)
                                                content-reject
                                                       |
                                                       |
        +------------- content-reject -----------------+
        |                                              |
        v                                              |
+--------------+                     +--------------+  |
|    ENDED     |                     |    ENDED     |<-+
+--------------+                     +--------------+
        ^                                    ^
        |                                    |
        +--- content-remove (at any point) --+

There are two additional states for contents that are orthogonal to those listed in the diagram above:

  • UNACKED_SENDERS_CHANGE, entered upon a local content-modify action, exited upon receiving ack.
  • UANCKED_REPLACEMENT_TRANSPORT, entered upon a local transport-replace action, exited upon receiving ack.

6) Tie breaking

Section 7.2.16 of XEP-0166 describes how to do tie-breaking. It does an OK job describing how to tie-break the case where both sides try to start a session at the same time (the prose can be difficult to parse correctly, but it is doable). However, tie-breaking other cases feels almost like an after thought, because it provides no guidance on when it applies.

Understanding tie-breaking was a difficult task for me. In fact, the process of figuring this out is what lead to most of the insights in this document. Why was it so hard? Well, XEP-0166 only says that when you tie-break an action (other than session-initiate) that the session initiator is the side that wins. Nothing is said about which actions need to be tie broken (both sides can safely send transport-info actions at the same time, so tie-breaking therefore can't apply to every action). And it is unclear even how to detect that a tie-break situation is occurring: if I'm only processing one action at a time (local or remote), how would I even know that a tie happened?

Remember we said that the goal of Jingle is to synchronize boxes, and we do that using the suite of Jingle actions. Some of those actions materially affect the boxes (by adding/removing/modifying them), and some only carry across information inside the boxes.

The chain of insights that follow from that fact:

  1. Only actions that materially affect boxes need to be tie-broken.
  2. We only need to tie-break when both sides are trying to change the same boxes at the same time.
  3. By keeping track of session and content UNACKED* states, we know when we are in the middle of changing a box locally.
  4. Receiving a Jingle action that would modify the session or a content that is UNACKED* is what triggers a tie-breaking check.

Wonderful! As long as we can ensure that, whenever we modify them, the session and contents have the appropriate UNACKED* states before a remote action is processed then tie-breaking will work.

But how do we ensure that things are in that UNACKED* state when we need it? The priority queue! Because local actions are processed first, during a tie situation both sides will have set UNACKED* states locally in time to process the remote actions. Thus, both sides know that a tie occurred because a modifying action was received while an ack was still expected, and that can only happen if the remote side sent its action while our outgoing action was still on the wire.

7) How Applications and Transports interact

After all of the signaling is done by Jingle, you end up with Application and Transport objects, but how do they actually interact?

Quite simply, the Transport is a network socket provider, where streaming (TCP-like) or datagram (UDP-like) sockets can be requested depending on what the transport supports. Any application should be able to request such sockets from any transport, and have things work.

(There is a small wrinkle for JavaScript and the ICE transport because browsers don't give you direct access to the underlying socket, so APIs will need to be slightly more flexible in browsers.)

One of the common complaints is that Jingle transports do not directly provide encryption. This is a fair point based on what has been implemented and deployed in practice. As far as Jingle is concerned, support for encryption already exists in the suite of specs, but has not been implemented in clients.

The Jingle model for encryption is that transports provide pure bytestreams (that is, agnostic as to whether the bytes being transferred are encrypted or not). In the same way that TLS/DTLS can be implemented over top any TCP/UDP connection, the idea was that the Security Description box would be used to negotiate encryption that would be used on top of the streaming or datagram sockets provided by a transport. Unfortunately, the spec for Jingle XTLS which would have defined using TLS/DTLS fizzled out and has not been implemented in clients. It would be possible to use other encryption mechanisms such as OMEMO for this purpose.

8) Handling session-info

Pretty much every Jingle action except for session-info will include <content/> elements. Having those elements is actually very nice -- it lets us route the information in the action to the affected applications and transports in a generic way, without having to understand anything specific to the applications and transports.

Which means that for session-info, our one Session class does not know how route and apply the information in the action. This is one of the reasons why we've tended to create MediaSessions and FileTransferSessions, to make the session smart enough to handle these events.

While one solution would be to have all Applications listen for session-info events, there is still the case of session-info events that do not apply to a particular application instance, but to that entire aspect of the session as a whole.

The way that I've solved this problem has been to make Application and Transport controllers as bolt-on aspects of the Session class. That is, a session can have many such controllers without having to understand their internals. Additionally, these controllers can act as factories for creating applications and transports.

For example, I have an RTPController class which is responsible for processing session-info events dealing with RTP, such as ringing and muting. It also acts as a factory for creating RTPApplication instances. In the same way, I have an ICEController class that acts as a factory for new ICE transports, which means it can also perform some optimizations like bundling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment