Created Nov 30, 2012
SSE/AVX matrix multiply
### Lovejane-N commented Oct 25, 2021

void matmult_ref(Mat44 &out, const Mat44 &A, const Mat44 &B)
{
Mat44 t; // write to temp
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

``````t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

out = t;
``````

}

Using for-loop will slow down the code significantly, that's why O(1,2,3) options can make the code faster.

### Catoverflow commented Jun 2, 2022 • edited

g++ (GCC) 12.1.0, Arch linux 5.18.1, Intel i7-9750H
`g++ -mavx -O2`

``````all ok.
ref: 185.01 cycles
SSE: 120.06 cycles
AVX_4mem: 118.66 cycles
AVX_8: 72.41 cycles
``````

`g++ -mavx -O3`

``````all ok.
ref: 7.39 cycles
SSE: 12.07 cycles
AVX_4mem: 12.44 cycles
AVX_8: 8.17 cycles
``````

Also, `intrin.h` is not included

I checked asembly code in compiler explorer, with `-O2` option the ref function is compiled into AVX, and the inner loop (line 17) is unfolded. By `-O3` the outer loop (line 16) is also unfolded, exactly as @Lovejane-N 's comment above. Interestingly, with `-O3` both native ref and AVX-8 have no loop, buf the former one runs faster.

And for @rygorous 's questions, I'm doing Computer Architecture experiment, and this code is a great help!

