Skip to content

Instantly share code, notes, and snippets.

@MaxPower15
Last active August 5, 2022 14:20
Show Gist options
  • Save MaxPower15/5792659c70f632582fba1dd4d362eb2f to your computer and use it in GitHub Desktop.
Save MaxPower15/5792659c70f632582fba1dd4d362eb2f to your computer and use it in GitHub Desktop.

Okay, so it looks like there are a few different things that need clarity. I'm breaking them into sections.

  1. On the client-side, how do I combine the example audio mix with the example clips?
  2. On the server-side, practically speaking, how can we take both video and audio as inputs to type=audio_mix?
  3. On the client-side, how do we preview the mixes?

(1) On the client-side, how do I combine the example audio mix with the example clips?

First, let me talk about a simpler case. I implemented audioOverlayOp in the edit tree a while back. It mixes just two inputs together. The main one--which could have both a/v streams or just audio--is called "input." And the audio that's mixed into is called "overlay." The relevant thing about this implementation is not its simplicity, but that it implicitly selects which stream can have video and which is just audio. That is, "overlay" is just audio and "input" can be audio and/or video. That detail lets us do the other things we care about on the client and server.

Here's an example of an edit tree that combines the provided example trees in the way I'm talking about by adding a "mainInput" property to the audio_mix node.

{
  type: 'audio_mix',
  mainInput: {
    type: 'concat',
    inputs: [
      {
        type: 'clip',
        range: {
          start: 0,
          end: 67469860,
        },
        input: {
          type: 'video',
          duration: 183059000,
          uri: 'https://embed-fastly.wistia.st/deliveries/690a36d29613bbadb70d00d24a943c854284957e.bin',
          preprocessStrategy: 'disable',
        },
      },
      {
        type: 'clip',
        range: {
          start: 69966870,
          end: 183059000,
        },
        input: {
          type: 'video',
          duration: 183059000,
          uri: 'https://embed-fastly.wistia.st/deliveries/690a36d29613bbadb70d00d24a943c854284957e.bin',
          preprocessStrategy: 'disable',
        },
      },
    ],
  },
  mixInputs: [
    {
      input: {
        type: 'audio',
        uri: 'https://embed-ssl.wistia.com/deliveries/9fbcd194ca4461bea6adcc38b5e20bc9bc752912.bin',
      },
      start: 0,
    },
    {
      input: {
        type: 'audio',
        uri: 'https://embed-ssl.wistia.com/deliveries/87891f9e6ca8ab7ec2fbe452628dd6ec97fb019e.bin',
      },
      start: 2000000,
    },
    {
      input: {
        type: 'audio',
        uri: 'https://embed-ssl.wistia.com/deliveries/1c559dd7a0d0b6e50e4061711fea79ef112750b1.bin',
      },
      start: 3000000,
    },
    {
      input: {
        type: 'audio',
        uri: 'https://embed-ssl.wistia.com/deliveries/4103c109a5f756effc80dfebd7e4c2b0fd3e8f00.bin',
      },
      start: 3500000,
    },
  ],
}

(2) On the server-side, practically speaking, how can we take both video and audio as inputs to type=audio_mix?

On the server side, each "segproc" worker is given a stream type--it knows whether it should be operating on either video or audio already.

So let's talk about the mix_audio type for an audio segproc worker.

This worker will look at mainInput and mixInputs, take only the audio streams from those files (it'll be doing this by default with segfetch already), and then mixes them together. This is the logic you pretty much already have, except we require that mainInput is specified--there must be at least one audio to mix. The output of this segproc worker is a mixed audio segment.

Now let's think about the mix_audio type as a video segproc worker.

We already know we're only focusing on video. The only input of mix_audio that could be contributing video output is "mainInput." It's contributing that output, but we're not actually doing anything to it. So it's a simple pass through operation. For a video worker, we just get the "mainInput" artifact and return that as its output artifact.

If we had specified an "audio" type as the "mainInput", then this is just a "no-op" for a video worker.

(3) On the client-side, how do we preview the mixes?

On the client-side, you can refer to my original implementation of audioOverlayOp. It is basically just a simpler version of the "audio_mix" type. Instead of previewing multiple mixed audio over the video/audio, it just mixes in one audio stream. And it passes through the "input" video stream unchanged.

This is helpful on the client because the logic is a bit simpler with just one audio input that we need to synchronize, but not impossible. And because you can get equivalent mixing behavior by chaining several audioOverlayOps, we could implement the preview that way. Note that I'm not advocating literally creating an edit tree with a bunch of audio overlay ops. I'm suggesting that, if synchronizing all these inputs together is a bit much, we could think about it as if we're chaining multiple operations together.

What I mean by equivalent mixing behavior:

{
  type: 'audio_overlay',
  start: 0,
  overlay: {
    type: 'audio_overlay',
    input: {
      type: 'audio_overlay',
      input: {
        type: 'audio_overlay',
        input: {
          type: 'audio',
          uri: 'https://embed-ssl.wistia.com/deliveries/9fbcd194ca4461bea6adcc38b5e20bc9bc752912.bin',
        },
        start: 35000000,
        overlay: {
          type: 'audio',
          uri: 'https://embed-ssl.wistia.com/deliveries/4103c109a5f756effc80dfebd7e4c2b0fd3e8f00.bin',
        },
      },
      start: 30000000,
      overlay: {
        type: 'audio',
        uri: 'https://embed-ssl.wistia.com/deliveries/1c559dd7a0d0b6e50e4061711fea79ef112750b1.bin',
      },
    },
    start: 20000000,
    overlay: {
      type: 'audio',
      uri: 'https://embed-ssl.wistia.com/deliveries/87891f9e6ca8ab7ec2fbe452628dd6ec97fb019e.bin',
    }
  },
  input: {
    type: 'concat',
    inputs: [
      {
        type: 'clip',
        range: {
          start: 0,
          end: 67469860,
        },
        input: {
          type: 'video',
          duration: 183059000,
          uri: 'https://embed-fastly.wistia.st/deliveries/690a36d29613bbadb70d00d24a943c854284957e.bin',
          preprocessStrategy: 'disable',
        },
      },
      {
        type: 'clip',
        range: {
          start: 69966870,
          end: 183059000,
        },
        input: {
          type: 'video',
          duration: 183059000,
          uri: 'https://embed-fastly.wistia.st/deliveries/690a36d29613bbadb70d00d24a943c854284957e.bin',
          preprocessStrategy: 'disable',
        },
      },
    ],
  },
}

It's a bit of a mouthful in JSON, so I see the appeal of the array, lol. But maybe it's a bit easier to understand if I were to use the audioOverlay helper, e.g....

const inputVid = video({ uri: https://embed-fastly.wistia.st/deliveries/690a36d29613bbadb70d00d24a943c854284957e.bin });

const outputVidBeforeAudioMix = concat([
  clip(inputVid, { start: 0, end: 67469860 }),
  clip(inputVid, { start: 69966870, end: 183059000 }),
]);

const audio1 = audio({ uri: 'https://embed-ssl.wistia.com/deliveries/9fbcd194ca4461bea6adcc38b5e20bc9bc752912.bin' });
const audio2 = audio({ uri: 'https://embed-ssl.wistia.com/deliveries/87891f9e6ca8ab7ec2fbe452628dd6ec97fb019e.bin' })
const audio3 = audio({ uri: 'https://embed-ssl.wistia.com/deliveries/1c559dd7a0d0b6e50e4061711fea79ef112750b1.bin' });
const audio4 = audio({ uri: 'https://embed-ssl.wistia.com/deliveries/4103c109a5f756effc80dfebd7e4c2b0fd3e8f00.bin' });

let audioOverlays = audio1;
audioOverlays = audioOverlay(audioOverlays, audio2, { start: 20000000 });
audioOverlays = audioOverlay(audioOverlays, audio3, { start: 30000000 });
audioOverlays = audioOverlay(audioOverlays, audio4, { start: 35000000 });

const outputVid = audioOverlay(outputVidBeforeAudioMix, audioOverlays, { start: 0 })

(I'll note here that these overlays are specified to play out the whole duration of each audio input. If you wanted each one to only be alive for a little while, you'd use "clip" on each audio input.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment