igracia/mixing_recordings.md

## mixing_recordings.md

      
    Raw
  

              mixing_recordings.md
            
          
    Working with Twilio Room Recordings

The following guide will show you how to mix several audio and video tracks together, forming a grid. For this example,
we will use two video and two audio tracks. The video tracks will be placed side by side in a 1024x768 output file.

UPDATE - Video Recording Compositions API is out!

Yes! No ned to go through this process alone anymore. We've recently released the Twilio Recording Composition API.
This API will allow you to compose and transcode you Room Recordings. You can find the reference docs here
When mixing the tracks, we need to consider that they might (and probably have) started at different times.
If we were to merge tracks without taking this into account, we would end up with synchronization issues.
In our example, since Bob got in the room a good 20s (and that’s really a huge time for synchronization of audios),
mixing both Alice’s and Bob’s audio tracks together would end up having one speaking over the other.
To make merging easier, the start time of all tracks from the same room is the creation of the room itself.
Let’s get the start times for all the tracks from this room


Get Alice's audio start_time
$ ffprobe -show_entries format=start_time alice.mka
Input #0, matroska,webm, from 'alice.mka':
  Metadata:
    encoder         : GStreamer matroskamux version 1.8.1.1
    creation_time   : 2017-06-30T09:03:44.000000Z
  Duration: 00:13:09.36, start: 1.564000, bitrate: 48 kb/s
    Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
    Metadata:
      title           : Audio
start_time=1.564000


Get Alice's video start_time
$ ffprobe -show_entries format=start_time alice.mkv
Input #0, matroska,webm, from 'alice.mkv':
  Metadata:
    encoder         : GStreamer matroskamux version 1.8.1.1
    creation_time   : 2017-06-30T09:03:44.000000Z
  Duration: 00:13:09.33, start: 1.584000, bitrate: 857 kb/s
    Stream #0:0(eng): Video: vp8, yuv420p(progressive), 640x480, SAR 1:1 DAR 4:3, 1k tbr, 1k tbn, 1k tbc (default)
    Metadata:
      title           : Video
start_time=1.584000


Get Bob's audio start_time
$ ffprobe -show_entries format=start_time bob.mka
Input #0, matroska,webm, from 'bob.mka':
  Metadata:
    encoder         : GStreamer matroskamux version 1.8.1.1
    creation_time   : 2017-06-30T09:04:03.000000Z
  Duration: 00:12:49.46, start: 20.789000, bitrate: 50 kb/s
    Stream #0:0(eng): Audio: opus, 48000 Hz, stereo, fltp (default)
    Metadata:
      title           : Audio
start_time=20.789000


Get Bob's video start_time
$ ffprobe -show_entries format=start_time bob.mkv
ffprobe version 3.3.2 Copyright (c) 2007-2017 the FFmpeg developers
  built with Apple LLVM version 8.0.0 (clang-800.0.42.1)
Input #0, matroska,webm, from 'bob.mkv':
  Metadata:
    encoder         : GStreamer matroskamux version 1.8.1.1
    creation_time   : 2017-06-30T09:04:03.000000Z
  Duration: 00:12:49.42, start: 20.814000, bitrate: 1645 kb/s
    Stream #0:0(eng): Video: vp8, yuv420p(progressive), 640x480, SAR 1:1 DAR 4:3, 1k tbr, 1k tbn, 1k tbc (default)
    Metadata:
      title           : Video
start_time=20.814000


Track
start_time (in ms)
creation_time


alice.mka
1564
2017-06-30T09:03:44.000000Z


alice.mkv
1584
2017-06-30T09:03:44.000000Z


bob.mka
20789
2017-06-30T09:04:03.000000Z


bob.mkv
20814
2017-06-30T09:04:03.000000Z


We can see that the start_time from different media types (audio and video) is not the same for the same participant,
as media arrives with a slight offset after the webrtc negotiation. File creation_time offsets translate into an equivalent
start-time offset. The ~20s that Bob had Alice waiting can be obtained directly from the start_time.
If the start_time of a track would have been relative to the creation_time, we would have had to first get the offset in the creation_time and then calculated the final offset. Since creation_time does not have ms precision, this could lead to synchronization issues.
When merging the different tracks, we’ll need to use as time reference the one that has the lowest start_time.
This is important to keep all tracks in sync. What we will do is

Take the lowest start_stime value of all tracks. In our case that’s alice.mka with a start_time of 1564ms.
Use that track as reference, by not indicating any offset when mixing tracks. We will reflect that in our track list by
offsetting it 0.
Calculate the offset for tracks, by subtracting the reference start_time value from all the others.
The following table shows the current values for our tracks, in which alice.mka is the reference value with 1564ms


Track
Track#
Current value
Reference value
Offset in ms


alice.mka
0
1564
1564
0


alice.mkv
1
1584
1564
20


bob.mka
2
20789
1564
19225


bob.mkv
3
20814
1564
19250


Anatomy of an ffmpeg command

Before we start mixing the files, we’re going to have a quick overview of the ffmpeg program command line
arguments that we are going to use. All commands will have the following structure
ffmpeg [[infile options] -i infile] {filter options} {[outfile options] outfile}

ffmpeg: program name
[[infile options] -i infile]: This tells the program what input files to use. You’ll need to add -i for each file
that you want to add to the conversion process. [infile options] will only be used for video tracks, to indicate the offset with the -itsoffset flag. The position of a file in the inputs list is important, as we’ll later use this position to reference the track. ffmpeg treats this input list as a zero-based array. References to input tracks have the form [#input], where #input
is the position the track has in the input array. For instance, in the list ffmpeg -i alice.mkv -i bob.mkv -i alice.mka -i bob.mka we would use these references

alice.mkv → [0] the first input
bob.mkv → [1] the second input
alice.mka → [2] the third input
bob.mka → [3] the fourth input


{filter options}: this is where we define what to do with the input tracks, whether it is mixing audio, video or both. A media stream (audio or video) can
be passed through a set of steps, each step modifying the stream returned by the previous step. For instance, in the case of the video inputs, we're
going to scale->pad->generate black frames for synchronization->concatenate. The output of this "pipeline" will is named as [r#c#], with r and c standing for
row and column. Since we only have two videos, we could just do with [r#], but then it will be easier for you to extrapolate to other scenarios.
{[outfile options] outfile}: defined where to store the output of the program. This is where you specify webm vs mp4, for instance.

Mixing tracks

We're going to mix all tracks in a single step. We'll explain the resulting command as clearly as possible, but be advised that
this is not for the faint of heart. The command that we are going to use

Keeps video and audio tracks in synchronization
Lets you change the output video resolution
Pads the video tracks to keep aspect ratio of the original videos

The complete command to obtain the mixed file in webm, with a 1024x768 resolution is
ffmpeg -i alice.mkv -i bob.mkv -acodec libopus -i alice.mka -acodec libopus -i bob.mka -y \
       -filter_complex "\
        [0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],color=black:size=512x768:duration=0.020[b0],[b0][vs0]concat[r0c0];\
        [1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],color=black:size=512x768:duration=19.25[b1],[b1][vs1]concat[r0c1];\
        [r0c0][r0c1]hstack=inputs=2[video];\
        [2]aresample=async=1[a0];\
        [3]aresample=async=1,adelay=19225.0|19225.0[a1];\
        [a0][a1]amix=inputs=2[audio]" \
       -map [video] \
       -map [audio] \
       -acodec libopus \
       -vcodec libvpx \
        output.webm

Let's dissect this command

-i alice.mkv -i bob.mkv -i alice.mka -i bob.mka: These are the input files.
-filter_complex: we’ll be performing a filter operation on all tracks passed as input.

[0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],color=black:size=512x768:duration=0.020[b0],[b0][vs0]concat[r0c0]: Here we Take
Alice's video and scale it to half the width of the desired resolution (512) while maintaining the original aspect ratio. We pad the scaled video and
tag it [vs0. Then we generate color=black frames for the duration offset in seconds calculated for this track, which will delay the track
so that it's in sync. Finally, we concat the black stream [b0] with the padded stream [vs0], and tag it as [r0c0].
we concat the black frames with the padded
[1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],color=black:size=512x768:duration=19.25[b1],[b1][vs1]concat[r0c1]: This part is the same as the
previous one, but the offset used for the duration corresponds to this track.
[r0c0][r0c1]hstack=inputs=2[video]
[#]aresample=async=1,adelay={delay}|{delay}[a#];: For each track with an offset value > 0, we need to indicate the audio delay
in milliseconds for that track

[#]: As explained in +Working with room recordings: Anatomy-of-an-ffmpeg-command, each track is referenced
by its position in the inputs array. Since only bob.mka is delayed, there’s only one block.
aresample=async=1: resamples the audio track, filling and trimming if needed. See more info in the resampler docs.
adelay={delay}|{delay}: will be delaying the audio for both left and right channels an amount of {delay} seconds.
[a#]: this is a label that we’ll use to reference this filtered track


[#][a1]..[an]amix=inputs={#of-inputs}: Once we have added the appropriate delays for all audio tracks, we
configure the filter that’ll perform the actual audio mixing, where n is the position of the n-th track.
In our case, there are only two tracks, that’s why we only use [2][a1].


Output definition

-map [video]: Select the stream marked a video to be used in the output
-map [audio]: Select the stream marked a audio to be used in the output
-acodec libopus: The audio codec to use. For webm we'll use OPUS
-vcodec libvpx: The video codec to use. For webm we'll use VP8
output.webm: The output file name


And for an mp4 file
ffmpeg -i alice.mkv -i bob.mkv -acodec libopus -i alice.mka -acodec libopus -i bob.mka -y \
       -filter_complex "\
        [0]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs0],color=black:size=512x768:duration=0.020[b0],[b0][vs0]concat[r0c0];\
        [1]scale=512:-2,pad=512:768:(ow-iw)/2:(oh-ih)/2[vs1],color=black:size=512x768:duration=19.25[b1],[b1][vs1]concat[r0c1];\
        [r0c0][r0c1]hstack=inputs=2[video];\
        [2]aresample=async=1[a0];\
        [3]aresample=async=1,adelay=19225.0|19225.0[a1];\
        [a0][a1]amix=inputs=2[audio]" \
       -map [video] \
       -map [audio] \
       -acodec libfdk_aac \
       -vcodec libx264 \
        output.mp4
Track	`start_time` (in ms)	`creation_time`
alice.mka	1564	2017-06-30T09:03:44.000000Z
alice.mkv	1584	2017-06-30T09:03:44.000000Z
bob.mka	20789	2017-06-30T09:04:03.000000Z
bob.mkv	20814	2017-06-30T09:04:03.000000Z
Track	start_time (in ms)	creation_time	duration (in ms)
alice1.mka	1564	2017-06-30T09:03:44.000000Z	1020
alice1.mkv	1584	2017-06-30T09:03:44.000000Z	1000
alice2.mka	4584	2017-06-30T09:03:47.000000Z	60000
alice2.mkv	4584	2017-06-30T09:03:47.000000Z	60000
bob.mka	20789	2017-06-30T09:04:03.000000Z	120000
bob.mkv	20814	2017-06-30T09:04:03.000000Z	120000