legastero/jingle.md

## jingle.md

      
    Raw
  

              jingle.md
            
          
    See a working demo of Jingle session management
I'm working on updates to XEP-0166 to add implementation guidance, but I'm
going to share the CliffsNotes version here in the meantime since people
are looking at creating new implementations.
I have worked on multiple Jingle implementations (and multiple iterations
of each) for five years now, on various platforms and languages (Python,
Lua, JavaScript, Swift). However, it has only been this past year that I've
"achieved enlightenment" and finally understood what the processing model
underlying Jingle is supposed to be.
The main problem with XEP-0166 is that it only describes the syntax of the
core Jingle protocol. It does not explain why the protocol is shaped the
way it is, and without that understanding it is quite simply impossible
to achieve a correct and proper implementation. You can certainly make
things that work for certain limited use cases, but it won't be a solid and
extenisble foundation for new things.
So for right now, forget about what you know of XEP-0166. This is not a
discussion that involves protocol or XML. It really doesn't even involve
XMPP. The underlying signaling transport does not matter. The serialization
of the signaling payloads does not matter. What matters is the data and how
it flows internally. Yes, yes, those other things are important to us, but
they are not relevant when creating the core of a Jingle engine.
1) What does Jingle do?

Ostensibly, what Jingle does are things like media streaming, file
transfers, etc. But that's wrong.
Yes, you read that right. Jingle does not stream media or transfer files.
What Jingle does is negotiate how to stream media or how to transfer
files. The actual work is left for other things to do.
In other words: Jingle's purpose is to synchronize boxes between two
entities. These boxes, in particular:
+------------------------------------+
|                                    |
|              Session               |
|                                    |
|  +-----------------------------+   |
|  |           Content           |   |
|  |  (creator, name, senders)   |   |
|  |                             |   |
|  | +-------------------------+ |   |
|  | | Application Description | |   |
|  | +-------------------------+ |   |
|  | |  Security Description   | |   |
|  | +-------------------------+ |   |
|  | |  Transport Description  | |   |
|  | +-------------------------+ |   |
|  +-----------------------------+   |
|                                    |
|                                    |
|      ... additional contents       |
|                                    |
+------------------------------------+

Every Jingle action creates, removes, or modifies one of these boxes. Take
a few minutes and look through the Jingle actions and convince yourself of
this fact.

session-initiate
session-accept
session-terminate
content-add
content-remove
content-accept
content-reject
content-modify
description-info
security-info
transport-info
session-info
transport-replace
transport-accept
transport-reject

When implementing Jingle, your goal is to keep that diagram synchronized
between both parties by using those actions.
There are some interesting concepts we can see here already:

The session itself is a box.
Content boxes can be added and removed, but also need to be approved.
Application and security boxes can be modified, but are not replaceable.
Transport boxes can be modifed, but can also be entirely replaced (with
approval).

Again, note how Jingle does not care what goes inside those application,
security, and transport description boxes. That data is meant to be handled
by the other things doing the actual work, not Jingle itself.
2) A Session is a Session is a Session is a Session

There is one mistake that everyone makes when first starting to implement
Jingle. I do mean everyone. It is an easy mistake. It is a tempting
mistake. It looks and feels right. But it is wrong.
I'm talking about this:

MediaSession.code
FileTranserSession.code
WhiteboardSession.code
... etc

Seriously, I mean that. But it looks right, doesn't it? Of course you would
have MediaSessions and FileTransferSessions, maybe even an AudioSession,
right?
No.
A Jingle session is a Jingle session is a Jingle session. Look again at
those boxes above. There is only a Session box. There is nothing about
audio or video or files or whiteboards. The core of a Jingle engine is not
aware of those things; that information is only inside the black boxes
of application descriptions, and is irrevelant to our goal of syncing those
boxes.
The desired outcome here is that while there is no such thing as a
FileTransferSession, there is the concept of a Session which contains
one or more FileTransfer applications. That also means a single session can
fall into multiple categories without requiring a combinatorial explosion
of classes.
So then why do we all want to create these various Session classes? I
believe it is because of the breadown in communciation that exists in
XEP-0166 because it doesn't clearly explain this goal of box syncing and
how to do it. You load up XEP-0166, start looking at the examples, your
eyes glaze over from the blobs of XML, you throw your hands up saying
this is an over complicated piece of garbage, and finally decide to try
to implement just enough to satisfy your use case of media streaming,
file transfers, etc. Thus, you end up collecting various types of session
classes that are baked into solving one particular use case, and in a
limited form of it, at that.
Or, it starts from knowing what UI you want to build, and thinking that
since you want different UIs for media streaming, file transfers, etc, it
would be best to have different Session classes for them. After all, they
are different things, right?
The last point there is an interesting one. There are spots in XEP-0166
that advise that it is best not to mix application types inside of a
single session beyond necessary. That is, the contents in a session should
logically go together. And that is certainly a good principle, but deciding
what "logically goes together" is only doable at the higher layers of your
system, not inside the core of a Jingle engine. Creating multiple Session
classes to enforce these separations is a decision that limits what you
can do with your implementation.
3) Each action counts double

That list of Jingle actions up above, how many actions are there?
Fifteen, you say?
Nope.
There are thirty actions -- each name in that list counts double because
there is both a local and remote version. I bring this up for two reasons:

Your implementation is going to be about twice as large as what you were
expecting, because there are thirty and not fifteen actions to handle.
Local actions need to be managed by the session, just like you do remote
actions. I've seen (written, even) implementations where local actions
are not coordinated through the session itself, but rather blast out the
associated stanza directly, which unfortunately leads to state management
problems.

4) The priority queue

The heart of every Jingle session is a priority queue -- every beat
processes a single local or remote action. Again, that is a priority
queue. While all local and remote actions are added to this queue, it is
always the local actions that run first.

Here is what happens during each beat:
Local Action:

Have the session verify that the action is valid to process.
Have affected contents verify the action is valid to process.
Have affected contents execute the action (NB. this needs to be treated
as an async operation so that the API is always consistent).
Once all contents have finished executing, update session state as
necessary based on the action.
Signal the results to the other side.
Done, start next beat.

Remote Action:

Have the session verify that the action is valid to process.
Have affected contents verify the action is valid to process.
Signal an ack (IQ result or error).
Have affected contents execute the action (NB. this needs to be treated
as an async operation so that the API is always consistent).
Once all contents have finished executing, update session state as
necessary based on the action.
Done, start next beat.

Both sides look pretty similar, but this is still a very high level
overview. Note the subtle difference: local actions signal the results
of the action at the end of the beat (because there is now a difference
in the boxes that needs to be synced) whereas remote actions signal an
acknowledgement at the start of the beat, before executing.
Sending the ack before executing is important: the ack is not indicating
if the action was succesfully applied or not; it is indicating that the
action was accepted for execution (as opposed to be being a bad request or
needed tie breaking). Execution failures will generally trigger new local
actions (e.g., content-reject or transport-reject) to "undo" the action
that failed.
In both cases, the actual work is treated as an async operation. While not
every application or transport will actually require async support, to keep
the API and developer expectations consistent, async should always be used.
There is one more item to consider here: the interaction between your API
and the priority queue. Consider this example:
let ack = await session.addContent(content);

A local content-add action is created and then processed by the queue.
But addContent() does not return when the queue has finished processing
that local content-add action. Instead, it continues waiting for an
acknowledgement to arrive from the peer. In other words, the queue does not
wait for the remote side to acknowledge signaled actions; it is your API
that should wait.
This section is the secret that is not explained yet in XEP-0166. Once
you understand why a priority queue is necessary and the difference in how
local and remote actions are processed, all of the rest of Jingle becomes
obvious.
5) Session and Content states

There is a state machine diagram and some states defined for Sessions by
XEP-0166, but it is incomplete and does not cover contents. Additionally,
the states that a session goes through differs based on if you are the
initiator or responder.
   Initiator                            Responder

+--------------+                     +--------------+
|   STARTING   |                     |   STARTING   |
+--------------+                     +--------------+
        |
     (local)
 session-initiate
        |
        v
+--------------+                     +--------------+
|   UNACKED    |- session-initiate ->|   PENDING    |
+--------------+                     +--------------+
                                             |
                                             |
        +--------------- ack ----------------+
        |                                    |
        v
+--------------+                         (local)
|   PENDING    |                      session-accept
+--------------+
                                             |
                                             |
                                             v
+--------------+                     +--------------+
|    ACTIVE    |<-- session-accept --|    ACTIVE    |
+--------------+                     +--------------+


        +- session-terminate (at any point) -+
        |                                    |
        v                                    v
+--------------+                     +--------------+
|    ENDED     |                     |    ENDED     |
+--------------+                     +--------------+

Notice that the session is in an UNACKED state between signaling out a
session-initiate action and receiving an ack. Also, notice how the
session has an initial STARTING state before sending or receiving the
session-initiate action. Neither state is clearly documented in XEP-0166,
but they do exist and need to be tracked.
The state diagram for contents is very similar. For contents, either side
of the session can add a content, so the roles are labeled as "Local" and
"Remote" instead of "Initiator" and "Responder":
     Local                                Remote

+--------------+                     +--------------+
|   STARTING   |                     |   STARTING   |
+--------------+                     +--------------+
        |
     (local)
   content-add
        |
        v
+--------------+                     +--------------+
|   UNACKED    |--- content-add ---->|   PENDING    |--+
+--------------+                     +--------------+  |
                                             |         |
                                             |         |
        +--------------- ack ----------------+         |
        |                                    |         |
        v                                              |
+--------------+                         (local)       |
|   PENDING    |                      content-accept   |
+--------------+                                       |
                                             |         |
                                             |         |
                                             v         |
+--------------+                     +--------------+  |
|    ACTIVE    |<-- content-accept --|    ACTIVE    |  |
+--------------+                     +--------------+  |
                                                       |
                                                    (local)
                                                content-reject
                                                       |
                                                       |
        +------------- content-reject -----------------+
        |                                              |
        v                                              |
+--------------+                     +--------------+  |
|    ENDED     |                     |    ENDED     |<-+
+--------------+                     +--------------+
        ^                                    ^
        |                                    |
        +--- content-remove (at any point) --+

There are two additional states for contents that are orthogonal to those
listed in the diagram above:

UNACKED_SENDERS_CHANGE, entered upon a local content-modify action,
exited upon receiving ack.
UANCKED_REPLACEMENT_TRANSPORT, entered upon a local transport-replace
action, exited upon receiving ack.

6) Tie breaking

Section 7.2.16 of XEP-0166 describes how to do tie-breaking. It does an OK
job describing how to tie-break the case where both sides try to start a
session at the same time (the prose can be difficult to parse correctly,
but it is doable). However, tie-breaking other cases feels almost like an
after thought, because it provides no guidance on when it applies.
Understanding tie-breaking was a difficult task for me. In fact, the
process of figuring this out is what lead to most of the insights in
this document. Why was it so hard? Well, XEP-0166 only says that when
you tie-break an action (other than session-initiate) that the session
initiator is the side that wins. Nothing is said about which actions need
to be tie broken (both sides can safely send transport-info actions at
the same time, so tie-breaking therefore can't apply to every action). And
it is unclear even how to detect that a tie-break situation is occurring:
if I'm only processing one action at a time (local or remote), how would I
even know that a tie happened?
Remember we said that the goal of Jingle is to synchronize boxes, and we do that
using the suite of Jingle actions. Some of those actions materially affect the
boxes (by adding/removing/modifying them), and some only carry across information
inside the boxes.
The chain of insights that follow from that fact:

Only actions that materially affect boxes need to be tie-broken.
We only need to tie-break when both sides are trying to change the same
boxes at the same time.
By keeping track of session and content UNACKED* states, we know when
we are in the middle of changing a box locally.
Receiving a Jingle action that would modify the session or a content
that is UNACKED* is what triggers a tie-breaking check.

Wonderful! As long as we can ensure that, whenever we modify them, the
session and contents have the appropriate UNACKED* states before a remote
action is processed then tie-breaking will work.
But how do we ensure that things are in that UNACKED* state when we need
it? The priority queue! Because local actions are processed first,
during a tie situation both sides will have set UNACKED* states locally
in time to process the remote actions. Thus, both sides know that a tie
occurred because a modifying action was received while an ack was still
expected, and that can only happen if the remote side sent its action while
our outgoing action was still on the wire.
7) How Applications and Transports interact

After all of the signaling is done by Jingle, you end up with Application
and Transport objects, but how do they actually interact?
Quite simply, the Transport is a network socket provider, where streaming
(TCP-like) or datagram (UDP-like) sockets can be requested depending on
what the transport supports. Any application should be able to request such
sockets from any transport, and have things work.
(There is a small wrinkle for JavaScript and the ICE transport because
browsers don't give you direct access to the underlying socket, so APIs
will need to be slightly more flexible in browsers.)
One of the common complaints is that Jingle transports do not directly
provide encryption. This is a fair point based on what has been implemented
and deployed in practice. As far as Jingle is concerned, support for
encryption already exists in the suite of specs, but has not been
implemented in clients.
The Jingle model for encryption is that transports provide pure bytestreams
(that is, agnostic as to whether the bytes being transferred are encrypted
or not). In the same way that TLS/DTLS can be implemented over top any
TCP/UDP connection, the idea was that the Security Description box would
be used to negotiate encryption that would be used on top of the streaming
or datagram sockets provided by a transport. Unfortunately, the spec for
Jingle XTLS which would have defined using TLS/DTLS fizzled out and has not
been implemented in clients. It would be possible to use other encryption
mechanisms such as OMEMO for this purpose.
8) Handling session-info

Pretty much every Jingle action except for session-info will include
<content/> elements. Having those elements is actually very nice -- it
lets us route the information in the action to the affected applications
and transports in a generic way, without having to understand anything
specific to the applications and transports.
Which means that for session-info, our one Session class does not know
how route and apply the information in the action. This is one of the
reasons why we've tended to create MediaSessions and FileTransferSessions,
to make the session smart enough to handle these events.
While one solution would be to have all Applications listen for
session-info events, there is still the case of session-info events
that do not apply to a particular application instance, but to that entire
aspect of the session as a whole.
The way that I've solved this problem has been to make Application and
Transport controllers as bolt-on aspects of the Session class. That is,
a session can have many such controllers without having to understand
their internals. Additionally, these controllers can act as factories for
creating applications and transports.
For example, I have an RTPController class which is responsible for
processing session-info events dealing with RTP, such as ringing and
muting. It also acts as a factory for creating RTPApplication instances.
In the same way, I have an ICEController class that acts as a factory for
new ICE transports, which means it can also perform some optimizations like
bundling.