Skip to content

Instantly share code, notes, and snippets.

@PierBover
Last active April 29, 2024 23:42
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PierBover/cd3ded43a1078afee74eece0ecc19634 to your computer and use it in GitHub Desktop.
Save PierBover/cd3ded43a1078afee74eece0ecc19634 to your computer and use it in GitHub Desktop.
Reading waveform data in ffmpeg

Reading waveform data in ffmpeg

When working on an audio player, I wanted to extract the audio waveform data to paint the audio waveform dynamically in the browser on a <canvas> element.

Initially I used the bbc/audiowaveform package but this proved problematic for a number of reasons. First I wasn't able to install that package (or build the binary) in macOS for local dev. The other big issue is that I was only able to figure out how to install it on Ubuntu, so I couldn't use it in Alpine (for Docker images) or other environments like cloud functions.

Initial approach

I found out from these docs it's possible to paint a waveform with ffmpeg by extracting raw audio data:

https://trac.ffmpeg.org/wiki/Waveform#UsingGnuplot

The idea is you can input basically anything into ffmpeg (any audio or video file) and output some PCM raw audio into stdout or a file. Then you can read that raw audio data and turn it into something usable to paint your waveform.

I got it working after a bit of tinkering but this approach requires you to downsample the audio. Otherwise you will produce a lot of raw audio data, specially for long audio files. When downsampling you can lose a lot detail in the audio which produces bad waveforms.

I'm documenting this approach here for reference:

Downsample and output binary data to stdout

ffmpeg -i test.wav -ac 1 -filter:a aresample=8000 -map 0:a -c:a pcm_s16le -f data -

Explanation

  • -i test.wav input file
  • -ac 1 mix all audio channels into one
  • -filter:a aresample=8000 downsample to 8000 samples per second to to reduce the amount of data (typically 44100 samples per second)
  • -map 0:a select all audio streams from input 0
  • -c:a pcm_s16le This sets the sample format to 16 bits and you get values between 0 and 65,535 (in case the audio is 24 or 32 bits)
  • -f data - output binary data into sdtout

Downsample and save binary data into a text file

ffmpeg -i test.wav -ac 1 -filter:a aresample=8000 -map 0:a -c:a pcm_s16le -f data data.txt

Explanation

  • -i test.wav input file
  • -ac 1 mix all audio channels into one
  • -filter:a aresample=8000 downsample to 8000 samples per second
  • -map 0:a select all audio streams from input 0
  • -c:a pcm_s16le sets the sample format to 16 bits
  • -f data data.txt output binary data into a text file for further processing

Reducing data even more

Depending on what you want to do, even downsampling to 8000 samples per second at 16bits per sample is going to be way too much data. My goal was to paint a waveform for an audio player so I really didn't need as much resolution so I went as far as 500 samples per second and 8bits per sample.

ffmpeg -i test.wav -ac 1 -filter:a aresample=500 -map 0:a -c:a pcm_8u -f data data.txt

Explanation

  • -i test.wav input file
  • -ac 1 mix all audio channels into one
  • -filter:a aresample=500 downsample to 500 samples per second
  • -map 0:a select all audio streams from input 0
  • -c:a pcm_8u 8 bits per sample (so values between 0 and 255)
  • -f data data.txt output binary data into a text file for further processing

This produced about 25kB of raw data per minute of audio which is easily parsed. Unforunately, like I explained before, the generated waveform doesn't really resemble the actual audio.

Second approach

The better approach consists of using astats to basically tell you the gain in decibels for a series of chunks (or rather frames) of samples.

ffmpeg -i audio.wav -af "aresample=44100,asetnsamples=4000,astats=reset=1:metadata=1,ametadata=print:key='lavfi.astats.Overall.Peak_level':file=stats.log" -f null -

Explanation:

  • aresample=44100 this will downsample the audio to 44100 in case your source is in higher sample rates.
  • asetnsamples=4000 here you're defining the chunk size. So aprox each chunk will consist of 1/11th of a second.
  • lavfi.astats.Overall.Peak_level this is the value that will be printed to the file. If you check the astat docs there are many more values that can be printed like RMS, etc.
  • file=stats.log where the data will be written to.

This is the result you will get in the stats.log file which can be easily parsed.

frame:0    pts:0       pts_time:0
lavfi.astats.Overall.Peak_level=-72.246934
frame:1    pts:4000    pts_time:0.0907029
lavfi.astats.Overall.Peak_level=-72.246934
frame:2    pts:8000    pts_time:0.181406
lavfi.astats.Overall.Peak_level=-71.223883
frame:3    pts:12000   pts_time:0.272109
lavfi.astats.Overall.Peak_level=-71.223883
frame:4    pts:16000   pts_time:0.362812
lavfi.astats.Overall.Peak_level=-70.308734
frame:5    pts:20000   pts_time:0.453515
lavfi.astats.Overall.Peak_level=-69.480880
frame:6    pts:24000   pts_time:0.544218
lavfi.astats.Overall.Peak_level=-68.725109
frame:7    pts:28000   pts_time:0.634921
lavfi.astats.Overall.Peak_level=-49.640259
frame:8    pts:32000   pts_time:0.725624
lavfi.astats.Overall.Peak_level=-40.565966

So in the first chunk, the value you want is -72.246934 which is a logarithmic value in decibels (so 0db is the maximum value).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment