{{ message }}

Instantly share code, notes, and snippets.

# rygorous/gist:4172889

Created Nov 30, 2012
SSE/AVX matrix multiply
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### Lovejane-N commented Oct 25, 2021

void matmult_ref(Mat44 &out, const Mat44 &A, const Mat44 &B)
{
Mat44 t; // write to temp
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

``````t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;
t.m = A.m*B.m + A.m*B.m + A.m*B.m + A.m*B.m;

out = t;
``````

}

Using for-loop will slow down the code significantly, that's why O(1,2,3) options can make the code faster.

### Catoverflow commented Jun 2, 2022 • edited

g++ (GCC) 12.1.0, Arch linux 5.18.1, Intel i7-9750H
`g++ -mavx -O2`

``````all ok.
ref: 185.01 cycles
SSE: 120.06 cycles
AVX_4mem: 118.66 cycles
AVX_8: 72.41 cycles
``````

`g++ -mavx -O3`

``````all ok.
ref: 7.39 cycles
SSE: 12.07 cycles
AVX_4mem: 12.44 cycles
AVX_8: 8.17 cycles
``````

Also, `intrin.h` is not included

I checked asembly code in compiler explorer, with `-O2` option the ref function is compiled into AVX, and the inner loop (line 17) is unfolded. By `-O3` the outer loop (line 16) is also unfolded, exactly as @Lovejane-N 's comment above. Interestingly, with `-O3` both native ref and AVX-8 have no loop, buf the former one runs faster.

And for @rygorous 's questions, I'm doing Computer Architecture experiment, and this code is a great help!

to join this conversation on GitHub. Already have an account? Sign in to comment