Skip to content

Instantly share code, notes, and snippets.

@ellenhp
Last active August 29, 2021 20:01
Show Gist options
  • Save ellenhp/0e06c066b83dc30c2281e9f07a67a9b1 to your computer and use it in GitHub Desktop.
Save ellenhp/0e06c066b83dc30c2281e9f07a67a9b1 to your computer and use it in GitHub Desktop.

Audio in Godot

This is a short document describing the audio systems in Godot as I understand them. If you are unfamiliar with digital audio, I'd recommend reading the first two sections of this documentation page. "Sampling audio" and "Audio data format and structure". The terminology defined there will be important to get the most out of this document.

Target audience

This document is targeted at Godot engine developers, but may be useful for users too.

Similarities and Common Themes

The audio systems in Godot 4.x (master branch) and 3.x differ in some pretty substantial ways, but there are a few common themes. This list isn't exhaustive, but it covers most stuff.

There are a few high level components:

  • Audio Drivers
  • Audio Server
  • Audio Buses
  • Audio Effects/Effect Chains
  • Audio Streams
  • Audio Stream Playback objects
  • Audio Player nodes

An Audio Driver's job is to interface with the system and get audio frames from Godot out into PulseAudio or CoreAudio or WASAPI, etc, so that it can be played on the speakers. It's the driver's job to either create an audio thread or get a callback from the system that'll run on the system's audio thread--CoreAudio Driver works like this I think.

The Audio Server is one of the components whose role changed drastically between 3.x and 4.x, but loosely speaking it maintains the state of the audio system, and keeps track of the Audio Buses in use and the relationship between them.

The Audio Buses are an Audio Server construct. They are used to divert sound through different Effect Chains as the audio is mixed from the Audio Player nodes into the Audio Driver's output buffer. Each bus applies an amplification or attenuation (its gain) to the audio passing through it and optionally applies one or more audio effects, in sequence. These audio effects can be used by game developers to compress, distort, equalize, add reverb, or any number of other things. Once the audio bus is done applying gain and effects, it sends the sound to another audio bus, or directly to the speakers. Only the "Master" audio bus is capable of sending sound directly to the speakers, to my knoweledge. The relationship between Audio Buses is effectively a tree. It could be represented as a directed acyclic graph, but I believe there were some optimizations made to prevent iteration over the bus list multiple times during mixing that result in things being restricted to a tree.

As mentioned above, Audio Effects allow users to apply transformations on the audio. Some audio effects have numerical stability issues and can be problematic when combined, especially with the reverb effect.

Audio Streams are how sound assets are represented in Godot. They are just assets, data. They don't contain any playback state. A single stream object can be used in multiple places. This allows the memory holding the data for that stream to be pooled.

Audio Stream Playback objects contain all of the playback information that is not contained in the Audio Stream, including playback position, internal data used for resampling and playback rate scaling, etc.

Audio Player nodes are the primary interface that Godot engine users have when interacting with the audio system. In 3.x they are the only built-in way to play back audio at all.

Differences (3.x perspective)

The Audio Server in 3.x can best be described as a structure that holds audio buffers and handles bus mixing and effects logic. It will provide a pointer to the internal buffers for each bus to anyone who asks. It also maintains a list of callbacks to call whenever it comes time to mix audio. Each audio player node registers a callback pointing to itself when it enters the tree. In this callback, it requests a pointer to the Audio Server bus that it would like to mix its audio into, and mixes audio directly into it. This means each audio player node has to implement volume ramps to avoid pops.

Side-note: Audio Pops

Audio pops (or clicks or discontinuities) occur when an audio frame differs so greatly from a previous audio frame that it resembles more of an impulse to the human ear than a smooth change in pressure. We hear these as clicks or pops, and they can be very unpleasant. Any time you change any parameter in the audio system that can affect volume you should do so smoothly over the course of many frames.

Back to Audio Stream Player nodes

So whenever an Audio Stream Player node's output volume changes, it needs to take care to not just naively push that change through the the audio mixing process. The change must be applied smoothly. This is implemented in all three audio player nodes in 3.x. So is fading-in (ramping from zero volume to volume set in the editor) on play(), fading-out (ramping from volume set in editor to zero) on stop() or when the stream ends, cross-fading during seek operations, fading out when the user does something unexpected like changing the audio stream out of the blue, etc.

One thing none of the audio player nodes were able to handle was fading out when they were queu_free()ed, or when the bus they were sending audio to changed abruptly. These two bugs would have added a bunch of complexity to fix.

4.x Perspective

In 4.x I wanted to simplify the architecture where I could, so essentially moved all audio mixing operations to the Audio Server. Playback nodes essentially register an Audio Stream Playback object with the Audio server and set its volume or playback bus at their leisure and the Audio Server handles all ramps and crossfades. I'm not going to cover the internal architecture of the Audio Server too deeply in this document, but the brief summary is that I added a lockless thread-safe linked list class called SafeList and that handles most (but not all) of the difficult thread-safety issues. This allows the Audio Server to present a thread-safe API to the playback nodes, while ensuring that the system is always in a state where the audio thread can perform a mix operation. The system makes heavy use of std::atomic types to provide these guarantees.

The audio player nodes can use and misuse this API all they like, but because of the design of the Audio Server and the mixing algorithm they won't introduce pops other than by literally playing back an audio stream that contains a pop. All ramps are built into the mixing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment