Skip to content

Instantly share code, notes, and snippets.

@ogrisel
Created January 31, 2021 23:28
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save ogrisel/87dcf2c3ab8a304ededf75934b116b61 to your computer and use it in GitHub Desktop.
Save ogrisel/87dcf2c3ab8a304ededf75934b116b61 to your computer and use it in GitHub Desktop.
Running some benchmark of BLAS level 3 and LAPACK on Apple M1
import numpy as np
try:
import tensorflow as tf
except ImportError:
tf = None
from time import perf_counter
def timeit(func, *args, **kwargs):
durations = []
for i in range(10):
tic = perf_counter()
func(*args, **kwargs)
toc = perf_counter()
durations.append(toc - tic)
durations.sort()
return np.mean(durations[:-5]) # trim stragglers
for dtype in (np.float32, np.float64):
dtype_name = dtype.__name__
rng = np.random.default_rng(42)
A = rng.normal(size=(1024, 1024)).astype(dtype)
B = rng.normal(size=(1024, 1024)).astype(dtype)
C = np.empty_like(A @ B)
d = timeit(np.dot, A, B, out=C)
GFLOP = A.shape[0] * B.shape[1] * (2 * A.shape[1] + 2) / 1e9
print(f"[{dtype_name}] np.dot: {d * 1e3:.3f} ms, {GFLOP / d:.1f} GFLOP/s")
d = timeit(np.linalg.svd, A)
print(f"[{dtype_name}] np.linalg.svd: {d * 1e3:.3f} ms")
if tf is not None:
A = tf.constant(A)
B = tf.constant(B)
d = timeit(np.matmul, A, B)
GFLOP = A.shape[0] * B.shape[1] * (2 * A.shape[1] + 2) / 1e9
print(f"[{dtype_name}] tf.matmul: {d * 1e3:.3f} ms, {GFLOP / d:.1f} GFLOP/s")
d = timeit(tf.linalg.svd, A)
print(f"[{dtype_name}] tf.linalg.svd: {d * 1e3:.3f} ms")
@ogrisel
Copy link
Author

ogrisel commented Jan 31, 2021

Here are the results.

numpy with OpenBLAS from conda-forge

(openblas) ogrisel@mba ~ % OMP_NUM_THREADS=4 python bench_blas_lapack.py
[float32] np.dot: 5.946 ms, 361.5 GFLOP/s
[float32] np.linalg.svd: 284.573 ms
[float64] np.dot: 11.939 ms, 180.0 GFLOP/s
[float64] np.linalg.svd: 282.713 ms
(openblas) ogrisel@mba ~ % OMP_NUM_THREADS=8 python bench_blas_lapack.py
[float32] np.dot: 9.431 ms, 227.9 GFLOP/s
[float32] np.linalg.svd: 513.617 ms
[float64] np.dot: 18.921 ms, 113.6 GFLOP/s
[float64] np.linalg.svd: 487.073 ms

Comment: with the default number of threads (8), the performance is not optimal with OpenBLAS on Apple M1. Using limit 4 cores gives better performance.

numpy with Accelerate for BLAS and netlib for LAPACK

From this conda-forge hack to use the experimental isuruf/vecLibFort.

(accelerate-netlib) ogrisel@mba ~ %  python bench_blas_lapack.py
[float32] np.dot: 2.258 ms, 952.1 GFLOP/s
[float32] np.linalg.svd: 1868.197 ms
[float64] np.dot: 9.132 ms, 235.4 GFLOP/s
[float64] np.linalg.svd: 1866.980 ms

Comments:

  • Accelerate's BLAS level 3 perf is impressive in float32 but comparable to OpenBLAS (4 threads) in float64
  • netlib's SVD is very slow

numpy and tensorflow from the Apple fork of TF

Using the numpy 1.18 wheel from https://github.com/apple/tensorflow_macos (0.1alpha1 release)

(tensforflow_macos) ogrisel@mba ~ % 
[float32] np.dot: 2.272 ms, 946.0 GFLOP/s
[float32] np.linalg.svd: 288.297 ms
[float64] np.dot: 9.501 ms, 226.3 GFLOP/s
[float64] np.linalg.svd: 290.827 ms

Comments:

  • similar performance for BLAS level 3 as the hacked conda-forge env from @isuruf
  • np.linagl.svd is now almost as fast than OpenBLAS-backed svd (configured with 4 threads)
  • [float32] np.dot: 2.272 ms, 946.0 GFLOP/s on a machine without a fan !!!
  • the numpy wheel from the apple/tensorflow_macos 0.1alpha1 release is mostly broken: many failing tests and segfaults.
  • the numpy from conda-forge (with openblas or the experimental Accelerate + netlib combo) have only a few failing tests, no crash and all scikit-learn tests pass.

Edit a prior version of this results used OPENBLAS_NUM_THREADS instead of OMP_NUM_THREADS for the OpenBLAS runs. OPENBLAS_NUM_THREADS has no effect on the openblas shipped by conda-forge and OMP_NUM_THREADS should be used instead.

@ogrisel
Copy link
Author

ogrisel commented Feb 1, 2021

Note: there is a problem in the code above: I use np.matmul on the tf tensors instead of tf.matmul. However if I change to tf.matmul I sometimes get weird results depending on weither I use constant tensors (back by numpy arrays) or GPU allocated data where it seems that tensorflow eager mode is not that eager anymore.

So I would just not trust the TF results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment