Skip to content

Instantly share code, notes, and snippets.

@inoryy
Last active December 28, 2022 07:54
Show Gist options
  • Star 21 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save inoryy/1900d368bf3ad213493042edbb79acb3 to your computer and use it in GitHub Desktop.
Save inoryy/1900d368bf3ad213493042edbb79acb3 to your computer and use it in GitHub Desktop.
Fixing MKL on AMD Zen CPU.

Fixing MKL on AMD Zen CPU

As per discussion on Reddit, it seems a workaround for the Intel MKL's notorious SIMD throttling of AMD Zen CPUs is as simple a setting MKL_DEBUG_CPU_TYPE=5 environment variable.

Benchmarks

All three scripts are executed in the same Python 3.7 environment on a first-gen AMD Zen CPU (1950x).
The difference will be even bigger on newer models as first-gen Zen resolves 256-bit AVX2 in two 128-bit instructions.

4096x4096 Matrix Multiplication

Library OpenBLAS MKL Default MKL With Flag
NumPy 0.58s 1.00s 0.56s
PyTorch N/A 0.48s 0.26s
TensorFlow 0.22s 0.47s 0.20s

Eigendecomposition

Library OpenBLAS MKL Default MKL With Flag
NumPy 11.82s 7.54s 6.67s
PyTorch N/A 2.25s 2.06s
TensorFlow 8.61s 6.51s 6.73s

Note: TensorFlow might be handling eigendecomposition slightly differently than Numpy and PyTorch, hence the disrepancy.

Full results: NumPy, PyTorch, and TensorFlow.

Environment

MKL Environment is setup through Anaconda with the following commands:

$ conda create -n py37mkl python=3.7 && conda activate py37mkl
$ conda install numpy "blas=*=mkl"
$ conda install -c pytorch pytorch
$ conda install -c anaconda tensorflow-mkl

OpenBLAS environment is setup through Anaconda with the following commands:

$ conda create -n py37nomkl python=3.7 && conda activate py37nomkl
$ conda install nomkl
$ conda install numpy "blas=*=openblas"
$ pip install tensorflow

Credits

The NumPy benchmark created by Markus Beuckelmann, adapted for PyTorch by /u/une-transaction and adapted for TensorFlow by me.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
import numpy as np
from time import time
# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)
size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
import torch
from time import time
# Let's take the randomness out of random numbers (for reproducibility)
torch.manual_seed(0)
size = 4096
A, B = torch.rand(size, size), torch.rand(size, size)
C, D = torch.rand(size * 128), torch.rand(size * 128)
E = torch.rand(int(size / 2), int(size / 4))
F = torch.rand(int(size / 2), int(size / 2))
F = torch.matmul(F, F.T)
G = torch.rand(int(size / 2), int(size / 2))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
torch.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
torch.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
torch.svd(E, some = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
torch.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
torch.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
from time import time
import tensorflow as tf
# Let's take the randomness out of random numbers (for reproducibility)
tf.random.set_seed(0)
size = 4096
A, B = tf.random.uniform((size, size)), tf.random.uniform((size, size))
C, D = tf.random.uniform((size * 128,)), tf.random.uniform((size * 128,))
E = tf.random.uniform((int(size / 2), int(size / 4)))
F = tf.random.uniform((int(size / 2), int(size / 2)), dtype=tf.float64)
F = tf.matmul(F, F, transpose_b=True)
G = tf.random.uniform((int(size / 2), int(size / 2)))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
tf.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
tf.tensordot(C, D, axes=1)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
tf.linalg.svd(E, full_matrices = True)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
tf.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
tf.linalg.eigh(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
@Miffyli
Copy link

Miffyli commented Feb 13, 2020

Here are results with second-gen Ryzen 3950x, using the same environments and codes:

4096x4096 Matrix Multiplication

Library OpenBLAS MKL Default MKL With Flag
NumPy 0.28s 0.54s 0.24s
PyTorch N/A 0.32s 0.12s
TensorFlow 0.11s 0.30s 0.11s

Eigendecomposition

Library OpenBLAS MKL Default MKL With Flag
NumPy 6.05s 4.24s 3.47s
PyTorch N/A 1.31s 1.11s
TensorFlow 5.20s 2.73s 2.64s

Less pronounced difference, but MKL without flag does make standard dotting of matrices slower.

@inoryy
Copy link
Author

inoryy commented Feb 13, 2020

Thanks for sharing, @Miffyli!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment