Skip to content

Instantly share code, notes, and snippets.

@ashafq
Created June 23, 2017 17:31
Show Gist options
  • Save ashafq/0db953125a033b783c6e100acd5e64d9 to your computer and use it in GitHub Desktop.
Save ashafq/0db953125a033b783c6e100acd5e64d9 to your computer and use it in GitHub Desktop.
#include <stddef.h>
#include <xmmintrin.h>
void biquad_proc_x4(const float *coeff,
float *state,
float *io,
size_t len)
{
// Set up pointers
const __m128 *vcoeff = (__m128 *) __builtin_assume_aligned(coeff, 16);
__m128 *vstate = (__m128 *) __builtin_assume_aligned(state, 16);
__m128 *vio = (__m128 *) __builtin_assume_aligned(io, 16);
// Load coefficients
const __m128 vb0 = _mm_load_ps((float *) vcoeff++);
const __m128 vb1 = _mm_load_ps((float *) vcoeff++);
const __m128 vb2 = _mm_load_ps((float *) vcoeff++);
const __m128 va1 = _mm_load_ps((float *) vcoeff++);
const __m128 va2 = _mm_load_ps((float *) vcoeff++);
// Load states
__m128 vw1 = _mm_load_ps((float *) vstate++);
__m128 vw2 = _mm_load_ps((float *) vstate++);
// Process samples
for (size_t i = 0; i < len; i++) {
// Load input
__m128 vx = _mm_load_ps((float *) vio);
// Compute output
// y = b0 * x + w1
__m128 vb0_vx = _mm_mul_ps(vb0, vx);
__m128 vy = _mm_add_ps(vb0_vx, vw1);
// Update state: w1
// w1 = b1 * x - a1 * y + w2
__m128 vb1_vx = _mm_mul_ps(vb1, vx);
__m128 va1_vy = _mm_mul_ps(va1, vy);
vw1 = _mm_sub_ps(vb1_vx, va1_vy);
vw1 = _mm_add_ps(vw1, vw2);
// Update state: w2
// w2 = b2 * x - a2 * y
__m128 vb2_vx = _mm_mul_ps(vb2, vx);
__m128 va2_vy = _mm_mul_ps(va2, vy);
vw2 = _mm_sub_ps(vb2_vx, va2_vy);
// Store output to buffer, and update pointer
_mm_store_ps((float *) vio, vy);
++vio;
}
// Store state in state buffer
_mm_store_ps((float *) --vstate, vw2);
_mm_store_ps((float *) --vstate, vw1);
}
@ashafq
Copy link
Author

ashafq commented Jun 16, 2023

This biquad_proc_x4 function implements a biquad (second-order IIR) filter process using SSE (Streaming SIMD Extensions). This function is specifically designed to process four audio samples at once.

It takes the following arguments:

  1. coeff: an array of coefficients for the biquad filter. They should be aligned on a 16-byte boundary because the SSE operations require it. The length of the coeff array must be 20 (5 coeffs per channel times 4 channels). The memory map of the coefficient is shown in the next section.
  2. state: an array that keeps the state of the filter. This is also required to be aligned on a 16-byte boundary. The length of this array should be 8, (2 state x 4 channels).
  3. io: an array of input and output values. The function takes the input values from this array, processes them, and writes the output back into the same array.
  4. len: the length of the io array. The function will process this many audio frames (4 x samples)

The memory map of the coefficient should be:

Biquad coefficients are denoted as: b0, b1, b2, a1, a2 (a0 is assumed to be 1.0)
Channels are indexed as: c0, c1, c2, c3

b0_c0, b0_c1, b0_c2, b0_c3, b1_c0, b1_c1, b1_c2, b1_c3, ..., a2_c0, a2_c1, a2_c2, a2_c3

The function first loads the filter coefficients and state into SSE registers. It then loops over the input/output array, applying the filter to each input sample and storing the result in place. After it's done processing the samples, it stores the updated state back into the state array.

Here is a simple usage example (untested code):

#include <math.h>
#include <stddef.h>

extern void biquad_proc_x4(const float *coeff, float *state, float *io,
                           size_t len);

// Compute coefficients for a low-pass filter
void compute_lowpass_coeff_x4(float *coeff, float cutoff_freq,
                              float sample_freq, float Q) {
  float w0 = 2 * M_PI * cutoff_freq / sample_freq;
  float alpha = sin(w0) / (2 * Q);

  // Unverified math
  float b0 = (1 - cos(w0)) / 2;
  float b1 = 1 - cos(w0);
  float b2 = b0;
  float a1 = -2 * cos(w0);
  float a2 = 1 - alpha;

  for (int i = 0; i < 4; ++i) {
    coeff[i * 5 + 0] = b0;
    coeff[i * 5 + 1] = b1;
    coeff[i * 5 + 2] = b2;
    coeff[i * 5 + 3] = a1;
    coeff[i * 5 + 4] = a2;
  }
}

#define FRAMES (16)

int main() {
  float coeff[20] __attribute__((aligned(16));
  float state[8] __attribute__((aligned(16)) = {0}; // Initialize the state to zero
  float io[FRAMES * 4] __attribute__((aligned(16)); // Input/output samples

  // Initialize the io array with input samples here

  // Compute coefficients for a low-pass filter with a cutoff frequency of 1000
  // Hz, a sampling frequency of 44100 Hz, and a Q factor of 0.707 (Butterworth
  // filter)
  compute_lowpass_coeff_x4(coeff, 1000.0f, 44100.0f, 0.707f);

  // Process the samples
  biquad_proc_x4(coeff, state, io, FRAMES);

  return 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment