Skip to content

Instantly share code, notes, and snippets.

@allquixotic
Last active August 29, 2015 14:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save allquixotic/5d25898fd3c330ff634e to your computer and use it in GitHub Desktop.
Save allquixotic/5d25898fd3c330ff634e to your computer and use it in GitHub Desktop.

If I plug a line or USB microphone into my PC, I can record audio at 44.1kHz. That's one sample of audio about every 23 microseconds. The audio wave must be sampled and played back precisely or it will sound distorted. How do non-realtime OSes manage the very time-sensitive process of audio recording and playback with such high sample rates? Is the process any different whether the audio is played/recorded on motherboard audio versus USB speakers/microphone?

It comes down to buffering.

44.1 kHz does not mean that there is an analog signal that has to be timed perfectly when being sent from the sound card (USB, PCIe, on-board, doesn't matter) to the CPU. In fact, the signal is digital. To have a digital Pulse Code Modulation bitstream being sent at 44100 Hertz just means that, for each second of audio, there will be 44,100 "data points" in the signal. The signal is quantized, meaning that it's basically a sequence of numbers. The size of those numbers (8-bit, 16-bit, etc) is determined by the sample format.

While audio work does require fairly precise timing so that the human ear cannot perceive audible "lag" in the audio, it is never the case that a digital audio source, like a sound card or USB microphone, would ever transmit each individual sample as and when they are received, directly to the CPU, one at a time. The problem is that there is significant overhead in each transfer from the audio source to the CPU.

The overhead is much lower on PCI Express than USB (in fact, USB 2.0 can only send and receive packets about once every 1 millisecond -- far too coarse-grained for sending 44,100 samples individually), but in either case, you aren't going to be able to send 44100 individual samples, each one by itself, in one second.

To resolve this, we use buffering. While buffering does introduce latency, the goal is to have a small enough buffer that the user can tolerate it for whatever their use case is, while high enough that the preemptive multi-tasking kernel scheduler of your non-RTOS will "cut off" any other CPU-hogging tasks and give your audio stack a chance to process the samples that have stacked up.

The basic idea of buffering is that you have a sequence of memory locations where the bits representing a certain number of samples (usually several thousand out of those 44,100) are queued up. Once the buffer is full, or nearly full, the sound source sends an interrupt to the kernel, which tells the CPU that the buffer is ready to be read from. Then, when the kernel has time, it schedules a task to perform a Direct Memory Access (DMA) transfer of those samples over to the system memory. From there, the program doing the sound capture can process the samples. The process is similar, but somewhat reversed, for audio playback.

So if you have a buffer of 50 milliseconds (1/20th of a second), which is not all that uncommon, you would have 44100 / 20 = 2205 samples in each buffer. So instead of receiving 44100 interrupts per second (which would surely overload the CPU in just the overhead of receiving and processing these interrupts!), the CPU only receives 20 interrupts per second. Much better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment