Instantly share code, notes, and snippets.

# digikar99/sb-simd-m44-by-v4.md

Last active September 8, 2022 13:01
Show Gist options
• Save digikar99/13ed3868993ca8c71c91b50c8022a64d to your computer and use it in GitHub Desktop.

# sb-simd: Multiplying a 4x4 matrix with a 4-length vector (AVX)

```(require 'sb-simd)
(defpackage :simd-pack-user
(:use :sb-simd-avx :cl))
(in-package :simd-pack-user)```

I'm assuming you are referring to the following operation:

``````| a11 a12 a13 a14 |   | x |   | a11*x + a12*y + a13*z + a14*w |
| a21 a22 a23 a24 |   | y |   | a21*x + a22*y + a23*z + a24*w |
| a31 a32 a33 a34 | X | z | = | a31*x + a32*y + a33*z + a34*w |
| a41 a42 a43 a44 |   | w |   | a41*x + a42*y + a43*z + a44*w |
``````

This might be a good starting point for SIMD reference on intel architectures.

Let's start with the idea that SIMD stands for single instruction multiple data, and that therefore, we should look for instances of operations we can do "at once". Assuming that we can do 4 operations at once, firstly, we need 4 sets of element-wise multiplications. After that, we can obtain the elements of the following 4x4 intermediate matrix:

``````| a11*x   a12*y   a13*z   a14*w |
| a21*x   a22*y   a23*z   a24*w |
| a31*x   a32*y   a33*z   a34*w |
| a41*x   a42*y   a43*z   a44*w |
``````

This intermediate matrix can be obtained by multiplying each row of the original matrix by the vector to obtain the rows of this new intermediate matrix using a single element wise multiplication each.

From this point, one can obtain the required resulting vector by "horizontal sum". But horizontal sums aka hadd instructions are a couple of times slower than their vertical counterparts. (See the latency and throughput of the hadd vs add instructions in the intel reference sheet above. Below, I assume you have access to a machine with AVX support.)

Thus, we will need to rearrange the 4 simd packs corresponding to the 4 rows of the intermediate matrix so that each of the 4 new simd packs will correspond to a column rather than a row. This - transpose - seems to be a common operation, so let's write this as a function then! The common instructions for this task include shuffle, permute, and permute2f128. Of these the lattermost is the most expensive. Here's at least one way in which this could be done:

```(defun %transpose-m44 (a b c d)
(declare (type f32.4 a b c d))
(let* ((ab12 (f32.4-shuffle a b #b01000100))
(ab34 (f32.4-shuffle a b #b11101110))
(cd12 (f32.4-shuffle c d #b01000100))
(cd34 (f32.4-shuffle c d #b11101110))
(ab12 (f32.4-permute ab12 #b11011000))
(ab34 (f32.4-permute ab34 #b11011000))
(cd12 (f32.4-permute cd12 #b11011000))
(cd34 (f32.4-permute cd34 #b11011000))
(abcd1 (f32.4-shuffle ab12 cd12 #b01000100))
(abcd2 (f32.4-shuffle ab12 cd12 #b11101110))
(abcd3 (f32.4-shuffle ab34 cd34 #b01000100))
(abcd4 (f32.4-shuffle ab34 cd34 #b11101110)))
(values abcd1 abcd2 abcd3 abcd4)))```

Once we obtain the transpose, we can simply take the vertical sum.

```(defun %multiply-m44-by-v4 (m1 m2 m3 m4 v)
(declare (type f32.4 m1 m2 m3 m4 v))
(let ((i1 (f32.4* m1 v))
(i2 (f32.4* m2 v))
(i3 (f32.4* m3 v))
(i4 (f32.4* m4 v)))
(multiple-value-bind (i1 i2 i3 i4)
(%transpose-m44 i1 i2 i3 i4)
(f32.4+ i1 i2 i3 i4))))```

Let's test this out now with the multiplication of the following matrix-vector multiplication.

``````| 01 02 03 04 |   | -1 |   |  -6 |
| 02 03 04 05 |   | -2 |   | -10 |
| 03 04 05 06 | X | -3 | = | -14 |
| 04 05 06 07 |   |  2 |   | -18 |
``````
```SIMD-PACK-USER> (%multiply-m44-by-v4 (make-f32.4 1 2 3 4)
(make-f32.4 2 3 4 5)
(make-f32.4 3 4 5 6)
(make-f32.4 4 5 6 7)
(make-f32.4 -1 -2 -3 2))
#<SB-EXT:SIMD-PACK -6.0000000e+0 -1.0000000e+1 -1.4000000e+1 -1.8000000e+1>```

Now we can make things work for an actual 4x4 array a 4-length vector.

```(defun multiply-m44-by-v4 (m44 v4
&optional (dst (make-array 4 :element-type 'single-float)))
(declare (optimize speed)
(type (simple-array single-float (4 4))  m44)
(type (simple-array single-float (4)) v4 dst))
(let* ((m1 (f32.4-row-major-aref m44  0))
(m2 (f32.4-row-major-aref m44  4))
(m3 (f32.4-row-major-aref m44  8))
(m4 (f32.4-row-major-aref m44 12))
( v (f32.4-row-major-aref  v4  0))
(result (%multiply-m44-by-v4 m1 m2 m3 m4 v)))
(setf (f32.4-row-major-aref dst 0) result)
dst))```

And let's write a ANSI CL version for checking correctness and performance:

```(defun ansi-multiply-m44-by-v4
(m44 v4
&optional (dst (make-array 4 :element-type 'single-float)))
(declare (optimize speed)
(type (simple-array single-float (4 4))  m44)
(type (simple-array single-float (4)) v4 dst))
(loop :for i :below 4 :do (setf (row-major-aref dst i) 0.0))
(loop :for i :below 4
:do (loop :for j :below 4
:do (incf (row-major-aref dst i)
(* (aref m44 i j)
(row-major-aref v4 j)))))
dst)```

Indeed, this seems to work.

```SIMD-PACK-USER> (let ((a (make-array '(4 4) :element-type 'single-float
:initial-contents '((1.0 2.0 3.0 4.0)
(2.0 3.0 4.0 5.0)
(3.0 4.0 5.0 6.0)
(4.0 5.0 6.0 7.0))))
(b (make-array '(4) :element-type 'single-float
:initial-contents '(-1.0 -2.0 -3.0 2.0))))
(print (ansi-multiply-m44-by-v4 a b))
(print (multiply-m44-by-v4 a b))
nil)

#(-6.0 -10.0 -14.0 -18.0)
#(-6.0 -10.0 -14.0 -18.0)
NIL```

Now, we can go back and declaim inlines to get a single function.

```(declaim (inline %transpose-m44 %multiply-m44-by-v4))
(declaim (inline ansi-multiply-m44-by-v4 multiply-m44-by-v4))```

Assuming you recompile the above after adding this line, now we are in a position to measure the performance.

`(ql:quickload "array-operations")`

Following this:

```SIMD-PACK-USER> (let ((a (aops:rand*  'single-float '(4 4)))
(b (aops:rand*  'single-float '(4)))
(c (aops:zeros* 'single-float '(4))))
(time (loop repeat 10000000 do (ansi-multiply-m44-by-v4 a b c))))
Evaluation took:
0.335 seconds of real time
0.334162 seconds of total run time (0.334162 user, 0.000000 system)
99.70% CPU
737,838,564 processor cycles
0 bytes consed

NIL
SIMD-PACK-USER> (let ((a (aops:rand*  'single-float '(4 4)))
(b (aops:rand*  'single-float '(4)))
(c (aops:zeros* 'single-float '(4))))
(time (loop repeat 10000000 do (multiply-m44-by-v4 a b c))))
Evaluation took:
0.087 seconds of real time
0.086709 seconds of total run time (0.086709 user, 0.000000 system)
100.00% CPU
191,447,818 processor cycles
0 bytes consed

NIL```

Thus, the SIMD version is about 4 times faster, in accordance with our expectations. Of note is that the operations are approximate and may not yield super-accurate answers; that forms a topic of discussion for another day, and perhaps beyond my current grasp or time.

However, is this worth it in terms of performance? Let's say you want to rotate 100000 points stored in a 100000x4 array, and let's do it using magicl with the lapack backend. (Note: Rotation of a point (x y z) is simply multiplying the vector (x y z 1) by an appropriate 4x4 rotation matrix.)

```SIMD-PACK-USER> (ql:quickload '("magicl" "magicl/ext-lapack"))
magicl

magicl/ext-lapack

("magicl" "magicl/ext-lapack")
SIMD-PACK-USER> (let ((points (magicl:rand '(100000 4) :type 'single-float))
(rotation-matrix (magicl:rand '(4 4) :type 'single-float))
(rotated-points (magicl:zeros '(100000 4) :type 'single-float)))
(magicl.backends:with-backends (:lapack)
(time (loop :repeat 100 :do (magicl:mult points rotation-matrix
:target rotated-points)))))
Evaluation took:
0.140 seconds of real time
0.138159 seconds of total run time (0.138159 user, 0.000000 system)
98.57% CPU
305,047,768 processor cycles
65,280 bytes consed

NIL```

The similar amount of operations (10 million rotations overall) with AVX SIMD costs about 1.5 times less, taking about 0.09 seconds. However, this might be nontrivial to compare, because larger the array, lesser the relative overhead of generic function dispatch, but greater the overhead of memory transfers. But even then, since the time complexity of matrix multiplication is superlinear even by using all the tricks, memory transfer might not be a bottleneck. The SIMD version also has the limitation of being limited to machines with Intel (and perhaps AMD) machines with AVX support - that means no M1s, no arm64 devices. I'm unsure if sb-simd has a way to "write once, run everywhere"; there do exist C libraries to provide wrappers, such as SIMDe. So, there's the nonportability tradeoff as well.

Measuring the performance gain (or loss?) obtained through the avoidance of hadd instructions is left as an exercise for the reader.