A Survey of Simple Vector Addition Performance of the Numerical Computing Libraries available in Common Lisp
Help with the following systems is required: CL-BLAPACK, MAXIMA, GSLL, XECTO
Implementation | Quicklisp | Speed | |
---|---|---|---|
ARRAY-OPERATIONS | Native | ✓ | Medium |
AVM | Native, Cuda | ✗ | Medium |
CLEM | Native | ✓ | Slow |
CL-BLAPACK | Blas/Lapack | ✗ | - |
FEMLISP-MATLISP | Native | ✓ | Fast |
GSLL | Blas/Lapack | ✓ | - |
LISP-MATRIX | Switchable | ✗ | Slow |
LLA | Blas/Lapack | ✓ | Fast |
MAGICL | Blas/Lapack | ✓ | Fast |
MAXIMA | - | ✗ | - |
NUMERICALS | Native | ✗ | Fast |
NUMCL | Native | ✓ | Slow |
PETALISP | Native | ✓ | Slow |
XECTO | Native | ✗ | - |
CL-USER> (let* ((size 1000)
(a (aops:zeros* 'double-float (list size)))
(b (aops:zeros* 'double-float (list size))))
(declare (optimize speed)
(type (array double-float) a b))
(time (loop repeat 1000 do (aops:vectorize (a b) (+ a b)))))
Evaluation took:
0.073 seconds of real time
0.073204 seconds of total run time (0.073204 user, 0.000000 system)
100.00% CPU
161,620,326 processor cycles
56,099,616 bytes consed
NIL
CL-USER> (let* ((size 1000)
(a (aops:zeros* 'double-float (list size)))
(b (aops:zeros* 'double-float (list size)))
(c (aops:zeros* 'double-float (list size))))
(declare (optimize speed)
(type (array double-float) a b c))
(time (loop repeat 1000 do (aops:vectorize! c (a b) (+ a b)))))
Evaluation took:
0.082 seconds of real time
0.082054 seconds of total run time (0.082054 user, 0.000000 system)
100.00% CPU
181,161,538 processor cycles
48,070,656 bytes consed
NIL
Should be loadable with a simple clone.
Example picked up from https://github.com/takagi/avm/blob/master/samples/vector-add.lisp
(in-package :cl-user)
(defpackage avm.samples.vector-add
(:use :cl
:avm)
(:export :main))
(in-package :avm.samples.vector-add)
(defkernel vector-add (c a b)
(setf (aref c i) (the double (+ (aref a i) (aref b i)))))
(defun random-init (array n)
(dotimes (i n)
(setf (array-aref array i) (random 1.0d0))))
(defun verify-result (as bs cs n)
(dotimes (i n)
(let ((a (array-aref as i))
(b (array-aref bs i))
(c (array-aref cs i)))
(unless (= (+ a b) c)
(error "Verification failed: i=~A, a=~A, b=~A, c=~A" i a b c)))))
(defun main (n &optional dev-id)
(declare (optimize speed))
(with-cuda (dev-id)
(with-arrays ((a double n)
(b double n)
(c double n))
(random-init a n)
(random-init b n)
(time
(loop repeat 1000 do (vector-add c a b)))
(verify-result a b c n))))
CL-USER> (avm.samples.vector-add:main 1000)
Evaluation took:
0.043 seconds of real time
0.043048 seconds of total run time (0.043048 user, 0.000000 system)
100.00% CPU
95,034,454 processor cycles
88,944 bytes consed
NIL
- Requires: should work from quicklisp
- Author hopes it to make it efficient some day
CL-USER> (let* ((size 1000)
(a (clem:zero-matrix 1 size))
(b (clem:zero-matrix 1 size)))
(declare (optimize speed)
(type clem:matrix a b))
(time (loop repeat 1000 do (clem:mat-add a b))))
Evaluation took:
1.021 seconds of real time
1.022443 seconds of total run time (0.990447 user, 0.031996 system)
[ Run times consist of 0.073 seconds GC time, and 0.950 seconds non-GC time. ]
100.10% CPU
2,255,802,064 processor cycles
104,763,424 bytes consed
NIL
CL-USER> (let* ((size 1000)
(a (clem:zero-matrix 1 size))
(b (clem:zero-matrix 1 size)))
(declare (optimize speed)
(type clem:matrix a b))
(time (loop repeat 1000 do (clem:mat-add a b :in-place t))))
Evaluation took:
0.751 seconds of real time
0.003347 seconds of total run time (0.003347 user, 0.000000 system)
0.40% CPU
1 form interpreted
1,657,839,660 processor cycles
1,184,672 bytes consed
before it was aborted by a non-local transfer of control.
; Evaluation aborted on #<SIMPLE-ERROR "not yet supported" {10057CC153}>.
Requires: foreign-numeric-vector Examples: https://github.com/blindglobe/cl-blapack/blob/master/examples.lisp Related: https://stackoverflow.com/questions/48666508/why-are-there-no-blas-routines-for-addition-and-subtraction
Issues: How do I test the addition of two vectors??? PS: I am not familiar with CFFI
CL-USER> (let* ((size 1000)
(a (fl.matlisp:zeros size)) ; These actually are 2d matrices
(b (fl.matlisp:zeros size)))
(declare (optimize speed))
(time (fl.matlisp:m+ a b))
nil)
Evaluation took:
0.008 seconds of real time
0.008094 seconds of total run time (0.008090 user, 0.000004 system)
100.00% CPU
17,849,472 processor cycles
8,000,016 bytes consed
NIL
Untested due to lack of knowledge
Requires: plenty; but also cl-blapack and foreign-numeric-vector from above that are not in quicklisp.
LISP-MATRIX> (let* ((size 1000)
(a (lisp-matrix:make-vector size :implementation :lisp-array))
(b (lisp-matrix:make-vector size :implementation :lisp-array)))
(declare (optimize speed))
(time (loop repeat 1000 do (lisp-matrix:m+ a b))))
Evaluation took:
0.757 seconds of real time
0.756834 seconds of total run time (0.756834 user, 0.000000 system)
100.00% CPU
1,671,227,204 processor cycles
25,168,368 bytes consed
NIL
LISP-MATRIX> (let* ((size 1000)
(a (lisp-matrix:make-vector size :implementation :foreign-array))
(b (lisp-matrix:make-vector size :implementation :foreign-array)))
(declare (optimize speed))
(time (loop repeat 1000 do (lisp-matrix:m+ a b))))
Evaluation took:
0.664 seconds of real time
0.663699 seconds of total run time (0.644143 user, 0.019556 system)
100.00% CPU
1,465,432,156 processor cycles
17,261,808 bytes consed
NIL
CL-USER> (defparameter *lla-configuration*
'(:efficiency-warnings (:array-type :array-conversion)))
*LLA-CONFIGURATION*
CL-USER> (ql:quickload "lla")
To load "lla":
Load 1 ASDF system:
lla
; Loading "lla"
.........
("lla")
CL-USER> (let* ((size 1000)
(a (make-array size :element-type 'double-float))
(b (make-array size :element-type 'double-float)))
(declare (optimize speed)
(type (array double-float) a b))
(time (loop repeat 1000 do (lla:axpy! 1 a b))))
Evaluation took:
0.007 seconds of real time
0.006143 seconds of total run time (0.005861 user, 0.000282 system)
85.71% CPU
13,553,164 processor cycles
163,840 bytes consed
NIL
CL-USER> (ql:quickload '("magicl" "magicl/ext-blas"))
To load "magicl":
Load 1 ASDF system:
magicl
; Loading "magicl"
To load "magicl/ext-blas":
Load 1 ASDF system:
magicl/ext-blas
; Loading "magicl/ext-blas"
("magicl" "magicl/ext-blas")
CL-USER> (let* ((size 1000)
(a (magicl:zeros (list size)))
(b (magicl:zeros (list size))))
(declare (optimize speed)
(type magicl:vector/double-float a b))
(magicl.backends:with-backends (:blas)
(time (loop repeat 1000 do (magicl:.+ a b)))))
Evaluation took:
0.020 seconds of real time
0.019953 seconds of total run time (0.011981 user, 0.007972 system)
100.00% CPU
44,039,332 processor cycles
8,105,248 bytes consed
NIL
CL-USER> (let* ((size 1000)
(a (magicl:zeros (list size)))
(b (magicl:zeros (list size)))
(c (magicl:zeros (list size))))
(declare (optimize speed)
(type magicl:vector/double-float a b c))
(magicl.backends:with-backends (:blas)
(time (loop repeat 1000 do (magicl:.+ a b c)))))
Evaluation took:
0.012 seconds of real time
0.012168 seconds of total run time (0.012144 user, 0.000024 system)
100.00% CPU
26,859,506 processor cycles
65,536 bytes consed
NIL
Untested due to lack of knowledge
Requires: should work with a simple git clone and SBCL latest release
CL-USER> (let* ((size 1000)
(numericals:*type* 'double-float)
(a (numericals:zeros size))
(b (numericals:zeros size)))
(declare (optimize speed)
(type (array double-float) a b))
(time (loop repeat 1000 do (numericals:+ a b))))
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
Evaluation took:
0.008 seconds of real time
0.007786 seconds of total run time (0.007786 user, 0.000000 system)
100.00% CPU
17,174,838 processor cycles
8,439,488 bytes consed
NIL
CL-USER> (let* ((size 1000)
(numericals:*type* 'double-float)
(a (numericals:zeros size))
(b (numericals:zeros size))
(c (numericals:zeros size)))
(declare (optimize speed)
(type (array double-float) a b c))
(time (loop repeat 1000 do (numericals:+ a b :out c))))
; note: Unable to determine optimizability of call to NUMERICALS:+ because type of C (ARRAY
DOUBLE-FLOAT) is not exact
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
; note: Unable to optimize NU:ZEROS without knowing type of SIZE at compile-time.
Evaluation took:
0.002 seconds of real time
0.002089 seconds of total run time (0.002089 user, 0.000000 system)
100.00% CPU
4,600,666 processor cycles
524,000 bytes consed
NIL
CL-USER> (let* ((size 1000)
(a (numcl:zeros (list size) :type 'double-float))
(b (numcl:zeros (list size) :type 'double-float)))
(declare (optimize speed)
(type (array double-float) a b))
(time (loop repeat 1000 do (numcl:+ a b))))
; some style warnings ignored here
Evaluation took:
0.615 seconds of real time
0.615519 seconds of total run time (0.615415 user, 0.000104 system)
100.16% CPU
26 forms interpreted
2,482 lambdas converted
1,358,418,382 processor cycles
84,625,632 bytes consed
NIL
CL-USER> (let* ((size 1000)
(a (make-array size :element-type 'double-float))
(b (make-array size :element-type 'double-float)))
(declare (optimize speed)
(type (array double-float) a b))
(time (loop repeat 1000 do (petalisp:compute (petalisp:alpha #'+ a b)))))
Evaluation took:
0.252 seconds of real time
0.552736 seconds of total run time (0.383944 user, 0.168792 system)
219.44% CPU
556,173,034 processor cycles
17,688,944 bytes consed
NIL
That is a very interesting survey, thanks!
That reminds me I should add a ahead-of-time mode to Petalisp, where all scheduling and preprocessing is only done once to produce a fast function. Because by default,
compute
performs all analysis and scheduling at run time. For large data sets or programs, this doesn't really matter. But in cases where you repeatedly perform the same action (like this benchmark) and on 'small' data sets, it might be worthwhile to have such a ahead-of-time mode.