Audio and Video calls in XMPP are encrypted end-to-end with DTLS-SRTP as per XEP-0320: Use of DTLS-SRTP in Jingle Sessions.
This protocol replaces XEP-0320 with something that is encrypted with and verified by OMEMO.
Disclaimer: The proper solution is to use OMEMO version 0.5+ and Stanza Content Encryption and encrypt the entire Jingle handshake. However we are still a long road away from having OMEMO 0.5+ in general and any implementational experience with SCE for IQ based protocols in particular. The protocol proposed here is a hack that is hopefully not too dirty.
When accepting a phone call the callee includes the OMEMO device it in the proceed message.
<message from='sam@example.com/phone' to='max@example.com/phone'>
<proceed xmlns='urn:xmpp:jingle-message:0' id='a73sjjvkla37jfea'>
<device xmlns='http://gultsch.de/xmpp/drafts/omemo/dlts-srtp-verification' id='1234' />
</proceed>
</message>
Each occurence of <fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0'…>…</fingerprint>
is being replaced with <fingerprint xmlns='http://gultsch.de/xmpp/drafts/omemo/dlts-srtp-verification' …></fingerprint>
.
Instead of the fingerprint in plain text the <fingerprint xmlns='http://gultsch.de/xmpp/drafts/omemo/dlts-srtp-verification'>
element contains a OMEMO version 0.3 payload message. The attributes on the <fingerprint>
element are the same as in XEP-0320.
A full example would look like this.
<fingerprint xmlns='http://gultsch.de/xmpp/drafts/omemo/dlts-srtp-verification' hash='sha-256' setup='actpass'>
<encrypted xmlns='eu.siacs.conversations.axolotl'>
<header sid='5678'>
<key rid='1234'>BASE64ENCODED...</key>
<iv>BASE64ENCODED...</iv>
</header>
<payload>BASE64ENCODED</payload>
</encrypted>
</fingerprint>
The OMEMO payload message is encrypted for one device only. For session-initiate (or more specifically for jingle information going from the caller to the callee it will be encrypted to the device ID the callee has included in the <proceed/>
message. And for session-accept (and other information going from callee to caller) the device ID that was the source of the first encrypted message in the session-initiate will be used.
Both parties should ensure that they only encrypt to and from one device. (Meaning use the same OMEMO session to encrypt and to decrypt all OMEMO payload messages in the entire call negotiation.)
Both caller and callee have the option of not doing this. The callee can simply decide not to include a <device id='…'/>
in the proceed message. And the caller can decide to ignore the <device id='…'/>
and fall back to regular XEP-0320: DLTS.