digikar99/sb-simd-m44-by-v4.md

## sb-simd-m44-by-v4.md

      
    Raw
  

              sb-simd-m44-by-v4.md
            
          
    sb-simd: Multiplying a 4x4 matrix with a 4-length vector (AVX)

(require 'sb-simd)
(defpackage :simd-pack-user
  (:use :sb-simd-avx :cl))
(in-package :simd-pack-user)
I'm assuming you are referring to the following operation:
| a11 a12 a13 a14 |   | x |   | a11*x + a12*y + a13*z + a14*w |
| a21 a22 a23 a24 |   | y |   | a21*x + a22*y + a23*z + a24*w |
| a31 a32 a33 a34 | X | z | = | a31*x + a32*y + a33*z + a34*w |
| a41 a42 a43 a44 |   | w |   | a41*x + a42*y + a43*z + a44*w |

This might be
a good starting point for SIMD reference on intel architectures.
Let's start with the idea that SIMD stands for single instruction multiple data,
and that therefore, we should look for instances of operations we can do "at
once". Assuming that we can do 4 operations at once, firstly, we need 4 sets of
element-wise multiplications. After that, we can obtain the elements of the
following 4x4 intermediate matrix:
| a11*x   a12*y   a13*z   a14*w |
| a21*x   a22*y   a23*z   a24*w |
| a31*x   a32*y   a33*z   a34*w |
| a41*x   a42*y   a43*z   a44*w |

This intermediate matrix can be obtained by multiplying each row of the original
matrix by the vector to obtain the rows of this new intermediate matrix using a
single element wise multiplication each.
From this point, one can obtain the required resulting vector by "horizontal
sum". But horizontal sums aka hadd instructions are a couple of times slower
than their vertical counterparts. (See the latency and throughput of the hadd
vs add instructions in the intel reference sheet above. Below, I assume you
have access to a machine with AVX support.)
Thus, we will need to rearrange the 4 simd packs corresponding to the 4 rows of
the intermediate matrix so that each of the 4 new simd packs will correspond to
a column rather than a row. This - transpose - seems to be a common operation,
so let's write this as a function then! The common instructions for this task
include shuffle, permute, and permute2f128. Of these the lattermost is the most
expensive. Here's at least one way in which this could be done:
(defun %transpose-m44 (a b c d)
  (declare (type f32.4 a b c d))
  (let* ((ab12 (f32.4-shuffle a b #b01000100))
         (ab34 (f32.4-shuffle a b #b11101110))
         (cd12 (f32.4-shuffle c d #b01000100))
         (cd34 (f32.4-shuffle c d #b11101110))
         (ab12 (f32.4-permute ab12 #b11011000))
         (ab34 (f32.4-permute ab34 #b11011000))
         (cd12 (f32.4-permute cd12 #b11011000))
         (cd34 (f32.4-permute cd34 #b11011000))
         (abcd1 (f32.4-shuffle ab12 cd12 #b01000100))
         (abcd2 (f32.4-shuffle ab12 cd12 #b11101110))
         (abcd3 (f32.4-shuffle ab34 cd34 #b01000100))
         (abcd4 (f32.4-shuffle ab34 cd34 #b11101110)))
    (values abcd1 abcd2 abcd3 abcd4)))
Once we obtain the transpose, we can simply take the vertical sum.
(defun %multiply-m44-by-v4 (m1 m2 m3 m4 v)
  (declare (type f32.4 m1 m2 m3 m4 v))
  (let ((i1 (f32.4* m1 v))
        (i2 (f32.4* m2 v))
        (i3 (f32.4* m3 v))
        (i4 (f32.4* m4 v)))
    (multiple-value-bind (i1 i2 i3 i4)
        (%transpose-m44 i1 i2 i3 i4)
      (f32.4+ i1 i2 i3 i4))))
Let's test this out now with the multiplication of the following matrix-vector
multiplication.
| 01 02 03 04 |   | -1 |   |  -6 |
| 02 03 04 05 |   | -2 |   | -10 |
| 03 04 05 06 | X | -3 | = | -14 |
| 04 05 06 07 |   |  2 |   | -18 |

SIMD-PACK-USER> (%multiply-m44-by-v4 (make-f32.4 1 2 3 4)
                                     (make-f32.4 2 3 4 5)
                                     (make-f32.4 3 4 5 6)
                                     (make-f32.4 4 5 6 7)
                                     (make-f32.4 -1 -2 -3 2))
#<SB-EXT:SIMD-PACK -6.0000000e+0 -1.0000000e+1 -1.4000000e+1 -1.8000000e+1>
Now we can make things work for an actual 4x4 array a 4-length vector.
(defun multiply-m44-by-v4 (m44 v4
                         &optional (dst (make-array 4 :element-type 'single-float)))
  (declare (optimize speed)
           (type (simple-array single-float (4 4))  m44)
           (type (simple-array single-float (4)) v4 dst))
  (let* ((m1 (f32.4-row-major-aref m44  0))
         (m2 (f32.4-row-major-aref m44  4))
         (m3 (f32.4-row-major-aref m44  8))
         (m4 (f32.4-row-major-aref m44 12))
         ( v (f32.4-row-major-aref  v4  0))
         (result (%multiply-m44-by-v4 m1 m2 m3 m4 v)))
    (setf (f32.4-row-major-aref dst 0) result)
    dst))
And let's write a ANSI CL version for checking correctness and performance:
(defun ansi-multiply-m44-by-v4
    (m44 v4
     &optional (dst (make-array 4 :element-type 'single-float)))
  (declare (optimize speed)
           (type (simple-array single-float (4 4))  m44)
           (type (simple-array single-float (4)) v4 dst))
  (loop :for i :below 4 :do (setf (row-major-aref dst i) 0.0))
  (loop :for i :below 4
        :do (loop :for j :below 4
                  :do (incf (row-major-aref dst i)
                            (* (aref m44 i j)
                               (row-major-aref v4 j)))))
  dst)
Indeed, this seems to work.
SIMD-PACK-USER> (let ((a (make-array '(4 4) :element-type 'single-float
                                     :initial-contents '((1.0 2.0 3.0 4.0)
                                                         (2.0 3.0 4.0 5.0)
                                                         (3.0 4.0 5.0 6.0)
                                                         (4.0 5.0 6.0 7.0))))
                      (b (make-array '(4) :element-type 'single-float
                                     :initial-contents '(-1.0 -2.0 -3.0 2.0))))
                  (print (ansi-multiply-m44-by-v4 a b))
                  (print (multiply-m44-by-v4 a b))
                  nil)

#(-6.0 -10.0 -14.0 -18.0)
#(-6.0 -10.0 -14.0 -18.0)
NIL
Now, we can go back and declaim inlines to get a single function.
(declaim (inline %transpose-m44 %multiply-m44-by-v4))
(declaim (inline ansi-multiply-m44-by-v4 multiply-m44-by-v4))
Assuming you recompile the above after adding this line, now we are in a
position to measure the performance.
(ql:quickload "array-operations")
Following this:
SIMD-PACK-USER> (let ((a (aops:rand*  'single-float '(4 4)))
                      (b (aops:rand*  'single-float '(4)))
                      (c (aops:zeros* 'single-float '(4))))
                  (time (loop repeat 10000000 do (ansi-multiply-m44-by-v4 a b c))))
Evaluation took:
  0.335 seconds of real time
  0.334162 seconds of total run time (0.334162 user, 0.000000 system)
  99.70% CPU
  737,838,564 processor cycles
  0 bytes consed

NIL
SIMD-PACK-USER> (let ((a (aops:rand*  'single-float '(4 4)))
                      (b (aops:rand*  'single-float '(4)))
                      (c (aops:zeros* 'single-float '(4))))
                  (time (loop repeat 10000000 do (multiply-m44-by-v4 a b c))))
Evaluation took:
  0.087 seconds of real time
  0.086709 seconds of total run time (0.086709 user, 0.000000 system)
  100.00% CPU
  191,447,818 processor cycles
  0 bytes consed

NIL
Thus, the SIMD version is about 4 times faster, in accordance with our
expectations. Of note is that the operations are approximate and may not yield
super-accurate answers; that forms a topic of discussion for another day, and
perhaps beyond my current grasp or time.
However, is this worth it in terms of performance? Let's say you want to rotate
100000 points stored in a 100000x4 array, and let's do it using magicl with the
lapack backend. (Note: Rotation of a point (x y z) is simply multiplying the
vector (x y z 1) by an appropriate 4x4 rotation matrix.)
SIMD-PACK-USER> (ql:quickload '("magicl" "magicl/ext-lapack"))
To load "magicl":
  Load 1 ASDF system:
    magicl
; Loading "magicl"

To load "magicl/ext-lapack":
  Load 1 ASDF system:
    magicl/ext-lapack
; Loading "magicl/ext-lapack"

("magicl" "magicl/ext-lapack")
SIMD-PACK-USER> (let ((points (magicl:rand '(100000 4) :type 'single-float))
                      (rotation-matrix (magicl:rand '(4 4) :type 'single-float))
                      (rotated-points (magicl:zeros '(100000 4) :type 'single-float)))
                  (magicl.backends:with-backends (:lapack)
                    (time (loop :repeat 100 :do (magicl:mult points rotation-matrix
                                                             :target rotated-points)))))
Evaluation took:
  0.140 seconds of real time
  0.138159 seconds of total run time (0.138159 user, 0.000000 system)
  98.57% CPU
  305,047,768 processor cycles
  65,280 bytes consed

NIL
The similar amount of operations (10 million rotations overall) with AVX SIMD
costs about 1.5 times less, taking about 0.09 seconds. However, this might be
nontrivial to compare, because larger the array, lesser the relative overhead of
generic function dispatch, but greater the overhead of memory transfers. But
even then, since the time complexity of matrix multiplication is superlinear
even by using all the tricks, memory transfer might not be a bottleneck. The
SIMD version also has the limitation of being limited to machines with Intel
(and perhaps AMD) machines with AVX support - that means no M1s, no arm64
devices. I'm unsure if sb-simd has a way to "write once, run everywhere"; there
do exist C libraries to provide wrappers, such as
SIMDe. So, there's the nonportability
tradeoff as well.
Measuring the performance gain (or loss?) obtained through the avoidance of hadd
instructions is left as an exercise for the reader.