Skip to content

Instantly share code, notes, and snippets.

@digikar99
Last active April 4, 2023 01:42
Show Gist options
  • Save digikar99/16066dbf24b8789c969ea58837e0fbef to your computer and use it in GitHub Desktop.
Save digikar99/16066dbf24b8789c969ea58837e0fbef to your computer and use it in GitHub Desktop.
A Survey of Simple Vector Addition Performance of the Numerical Computing Libraries available in Common Lisp

A Survey of Simple Vector Addition Performance of the Numerical Computing Libraries available in Common Lisp

Help with the following systems is required: CL-BLAPACK, MAXIMA, GSLL, XECTO

ImplementationQuicklispSpeed
ARRAY-OPERATIONSNativeMedium
AVMNative, CudaMedium
CLEMNativeSlow
CL-BLAPACKBlas/Lapack-
FEMLISP-MATLISPNativeFast
GSLLBlas/Lapack-
LISP-MATRIXSwitchableSlow
LLABlas/LapackFast
MAGICLBlas/LapackFast
MAXIMA--
NUMERICALSNativeFast
NUMCLNativeSlow
PETALISPNativeSlow
XECTONative-
CL-USER> (let* ((size 1000)
                (a (aops:zeros* 'double-float (list size)))
                (b (aops:zeros* 'double-float (list size))))
           (declare (optimize speed)
                    (type (array double-float) a b))
           (time (loop repeat 1000 do (aops:vectorize (a b) (+ a b)))))
Evaluation took:
  0.073 seconds of real time
  0.073204 seconds of total run time (0.073204 user, 0.000000 system)
  100.00% CPU
  161,620,326 processor cycles
  56,099,616 bytes consed
  
NIL
CL-USER> (let* ((size 1000)
                (a (aops:zeros* 'double-float (list size)))
                (b (aops:zeros* 'double-float (list size)))
                (c (aops:zeros* 'double-float (list size))))
           (declare (optimize speed)
                    (type (array double-float) a b c))
           (time (loop repeat 1000 do (aops:vectorize! c (a b) (+ a b)))))
Evaluation took:
  0.082 seconds of real time
  0.082054 seconds of total run time (0.082054 user, 0.000000 system)
  100.00% CPU
  181,161,538 processor cycles
  48,070,656 bytes consed
  
NIL

Should be loadable with a simple clone.

Example picked up from https://github.com/takagi/avm/blob/master/samples/vector-add.lisp

(in-package :cl-user)
(defpackage avm.samples.vector-add
  (:use :cl
        :avm)
  (:export :main))
(in-package :avm.samples.vector-add)

(defkernel vector-add (c a b)
  (setf (aref c i) (the double (+ (aref a i) (aref b i)))))

(defun random-init (array n)
  (dotimes (i n)
    (setf (array-aref array i) (random 1.0d0))))

(defun verify-result (as bs cs n)
  (dotimes (i n)
    (let ((a (array-aref as i))
          (b (array-aref bs i))
          (c (array-aref cs i)))
      (unless (= (+ a b) c)
        (error "Verification failed: i=~A, a=~A, b=~A, c=~A" i a b c)))))

(defun main (n &optional dev-id)
  (declare (optimize speed))
  (with-cuda (dev-id)
    (with-arrays ((a double n)
                  (b double n)
                  (c double n))
      (random-init a n)
      (random-init b n)
      (time
       (loop repeat 1000 do (vector-add c a b)))
      (verify-result a b c n))))
CL-USER> (avm.samples.vector-add:main 1000)
Evaluation took:
  0.043 seconds of real time
  0.043048 seconds of total run time (0.043048 user, 0.000000 system)
  100.00% CPU
  95,034,454 processor cycles
  88,944 bytes consed
  
NIL
  • Requires: should work from quicklisp
  • Author hopes it to make it efficient some day
CL-USER> (let* ((size 1000)
                (a (clem:zero-matrix 1 size))
                (b (clem:zero-matrix 1 size)))
           (declare (optimize speed)
                    (type clem:matrix a b))
           (time (loop repeat 1000 do (clem:mat-add a b))))
Evaluation took:
  1.021 seconds of real time
  1.022443 seconds of total run time (0.990447 user, 0.031996 system)
  [ Run times consist of 0.073 seconds GC time, and 0.950 seconds non-GC time. ]
  100.10% CPU
  2,255,802,064 processor cycles
  104,763,424 bytes consed
  
NIL
CL-USER> (let* ((size 1000)
                (a (clem:zero-matrix 1 size))
                (b (clem:zero-matrix 1 size)))
           (declare (optimize speed)
                    (type clem:matrix a b))
           (time (loop repeat 1000 do (clem:mat-add a b :in-place t))))
Evaluation took:
  0.751 seconds of real time
  0.003347 seconds of total run time (0.003347 user, 0.000000 system)
  0.40% CPU
  1 form interpreted
  1,657,839,660 processor cycles
  1,184,672 bytes consed
  
  before it was aborted by a non-local transfer of control.
  
; Evaluation aborted on #<SIMPLE-ERROR "not yet supported" {10057CC153}>.

Requires: foreign-numeric-vector Examples: https://github.com/blindglobe/cl-blapack/blob/master/examples.lisp Related: https://stackoverflow.com/questions/48666508/why-are-there-no-blas-routines-for-addition-and-subtraction

Issues: How do I test the addition of two vectors??? PS: I am not familiar with CFFI

CL-USER> (let* ((size 1000)
                (a (fl.matlisp:zeros size)) ; These actually are 2d matrices
                (b (fl.matlisp:zeros size)))
           (declare (optimize speed))
           (time (fl.matlisp:m+ a b))
           nil)
Evaluation took:
  0.008 seconds of real time
  0.008094 seconds of total run time (0.008090 user, 0.000004 system)
  100.00% CPU
  17,849,472 processor cycles
  8,000,016 bytes consed
  
NIL

Untested due to lack of knowledge

Requires: plenty; but also cl-blapack and foreign-numeric-vector from above that are not in quicklisp.

LISP-MATRIX> (let* ((size 1000)
                (a (lisp-matrix:make-vector size :implementation :lisp-array))
                (b (lisp-matrix:make-vector size :implementation :lisp-array)))
               (declare (optimize speed))
               (time (loop repeat 1000 do (lisp-matrix:m+ a b))))
Evaluation took:
  0.757 seconds of real time
  0.756834 seconds of total run time (0.756834 user, 0.000000 system)
  100.00% CPU
  1,671,227,204 processor cycles
  25,168,368 bytes consed
  
NIL
LISP-MATRIX> (let* ((size 1000)
                (a (lisp-matrix:make-vector size :implementation :foreign-array))
                (b (lisp-matrix:make-vector size :implementation :foreign-array)))
               (declare (optimize speed))
               (time (loop repeat 1000 do (lisp-matrix:m+ a b))))
Evaluation took:
  0.664 seconds of real time
  0.663699 seconds of total run time (0.644143 user, 0.019556 system)
  100.00% CPU
  1,465,432,156 processor cycles
  17,261,808 bytes consed
  
NIL
CL-USER> (defparameter *lla-configuration*
           '(:efficiency-warnings (:array-type :array-conversion)))
*LLA-CONFIGURATION*
CL-USER> (ql:quickload "lla")
To load "lla":
  Load 1 ASDF system:
    lla
; Loading "lla"
.........
("lla")
CL-USER> (let* ((size 1000)
                (a (make-array size :element-type 'double-float))
                (b (make-array size :element-type 'double-float)))
           (declare (optimize speed)
                    (type (array double-float) a b))
           (time (loop repeat 1000 do (lla:axpy! 1 a b))))
Evaluation took:
  0.007 seconds of real time
  0.006143 seconds of total run time (0.005861 user, 0.000282 system)
  85.71% CPU
  13,553,164 processor cycles
  163,840 bytes consed
  
NIL
CL-USER> (ql:quickload '("magicl" "magicl/ext-blas"))
To load "magicl":
  Load 1 ASDF system:
    magicl
; Loading "magicl"

To load "magicl/ext-blas":
  Load 1 ASDF system:
    magicl/ext-blas
; Loading "magicl/ext-blas"

("magicl" "magicl/ext-blas")
CL-USER> (let* ((size 1000)
                (a (magicl:zeros (list size)))
                (b (magicl:zeros (list size))))
           (declare (optimize speed)
                    (type magicl:vector/double-float a b))
           (magicl.backends:with-backends (:blas)
             (time (loop repeat 1000 do (magicl:.+ a b)))))
Evaluation took:
  0.020 seconds of real time
  0.019953 seconds of total run time (0.011981 user, 0.007972 system)
  100.00% CPU
  44,039,332 processor cycles
  8,105,248 bytes consed

NIL
CL-USER> (let* ((size 1000)
                (a (magicl:zeros (list size)))
                (b (magicl:zeros (list size)))
                (c (magicl:zeros (list size))))
           (declare (optimize speed)
                    (type magicl:vector/double-float a b c))
           (magicl.backends:with-backends (:blas)
             (time (loop repeat 1000 do (magicl:.+ a b c)))))
Evaluation took:
  0.012 seconds of real time
  0.012168 seconds of total run time (0.012144 user, 0.000024 system)
  100.00% CPU
  26,859,506 processor cycles
  65,536 bytes consed

NIL

Untested due to lack of knowledge

Requires: should work with a simple git clone and SBCL latest release

CL-USER> (let* ((size 1000)
                (numericals:*type* 'double-float)
                (a (numericals:zeros size))
                (b (numericals:zeros size)))
           (declare (optimize speed)
                    (type (array double-float) a b))
           (time (loop repeat 1000 do (numericals:+ a b))))

; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
Evaluation took:
  0.008 seconds of real time
  0.007786 seconds of total run time (0.007786 user, 0.000000 system)
  100.00% CPU
  17,174,838 processor cycles
  8,439,488 bytes consed
  
NIL
CL-USER> (let* ((size 1000)
                (numericals:*type* 'double-float)
                (a (numericals:zeros size))
                (b (numericals:zeros size))
                (c (numericals:zeros size)))
           (declare (optimize speed)
                    (type (array double-float) a b c))
           (time (loop repeat 1000 do (numericals:+ a b :out c))))
; note: Unable to determine optimizability of call to NUMERICALS:+ because type of C (ARRAY
                                                                                      DOUBLE-FLOAT) is not exact
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
Evaluation took:
  0.002 seconds of real time
  0.002089 seconds of total run time (0.002089 user, 0.000000 system)
  100.00% CPU
  4,600,666 processor cycles
  524,000 bytes consed
  
NIL
CL-USER> (let* ((size 1000)
                (a (numcl:zeros (list size) :type 'double-float))
                (b (numcl:zeros (list size) :type 'double-float)))
           (declare (optimize speed)
                    (type (array double-float) a b))
           (time (loop repeat 1000 do (numcl:+ a b))))
; some style warnings ignored here
Evaluation took:
  0.615 seconds of real time
  0.615519 seconds of total run time (0.615415 user, 0.000104 system)
  100.16% CPU
  26 forms interpreted
  2,482 lambdas converted
  1,358,418,382 processor cycles
  84,625,632 bytes consed
  
NIL
CL-USER> (let* ((size 1000)
                (a (make-array size :element-type 'double-float))
                (b (make-array size :element-type 'double-float)))
           (declare (optimize speed)
                    (type (array double-float) a b))
           (time (loop repeat 1000 do (petalisp:compute (petalisp:alpha #'+ a b)))))
Evaluation took:
  0.252 seconds of real time
  0.552736 seconds of total run time (0.383944 user, 0.168792 system)
  219.44% CPU
  556,173,034 processor cycles
  17,688,944 bytes consed
  
NIL
@marcoheisig
Copy link

That is a very interesting survey, thanks!

That reminds me I should add a ahead-of-time mode to Petalisp, where all scheduling and preprocessing is only done once to produce a fast function. Because by default, compute performs all analysis and scheduling at run time. For large data sets or programs, this doesn't really matter. But in cases where you repeatedly perform the same action (like this benchmark) and on 'small' data sets, it might be worthwhile to have such a ahead-of-time mode.

@digikar99
Copy link
Author

I am not familiar with petalisp's concepts, but may be if there is something that allows lazy iteration, loop can be put inside the compute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment