Skip to content

Instantly share code, notes, and snippets.

@cty-yyds
Forked from inoryy/_results.md
Last active February 23, 2024 15:02
Show Gist options
  • Save cty-yyds/41bcfa6a71670527c93049aa9a5d249f to your computer and use it in GitHub Desktop.
Save cty-yyds/41bcfa6a71670527c93049aa9a5d249f to your computer and use it in GitHub Desktop.
Fixing MKL on AMD Zen CPU.

Investigating MKL on AMD Zen CPU

As per discussion on Reddit, it seems a workaround for the Intel MKL's notorious SIMD throttling of AMD Zen CPUs is as simple a setting MKL_DEBUG_CPU_TYPE=5 environment variable. Intel removed the debug mode starting with MKL 2020.1 or newer. Although MKL 2020.1 and following appear to have improved performance by default on AMD to some extend.

Recommendation

For AMD Zen CPU, it is recommended to get MKL to speed-boost NumPy, SciPy, scikit-learn, and NumExpr even without using debug mode. You can have more speed boost using MKL_DEBUG_CPU_TYPE=5, but you need to downgrade the MKL to version 2020.0 or below.

For TensorFlow using AMD CPU, better to install origin version using pip install tensorflow rather than tensorflow-mkl.

Benchmarks

All three scripts are executed in the same Python 3.8 environment on a AMD Ryzen™ 7 5800X CPU.

Dotted two 4096x4096 matrices

Library OpenBLAS MKL2020.2 MKL2020.0 MKL with Flag
NumPy 0.55s 0.54s 0.54s 0.49s
PyTorch N/A 0.68s 0.62s 0.60s
TensorFlow 0.18s 0.17s 0.17s 0.17s

Dotted two vectors of length 524288

Library OpenBLAS MKL2020.2 MKL2020.0 MKL with Flag
NumPy 0.06ms 0.03ms 0.03ms 0.03ms
PyTorch N/A 0.02ms 0.02ms 0.02ms
TensorFlow 0.41ms 0.91ms 0.91ms 0.90ms

SVD of a 2048x1024 matrix

Library OpenBLAS MKL2020.2 MKL2020.0 MKL with Flag
NumPy 0.71s 0.32s 0.32s 0.25s
PyTorch N/A 0.31s 0.30s 0.30s
TensorFlow 0.49s 0.83s 0.83s 0.83s

Cholesky decomposition of a 2048x2048 matrix

Library OpenBLAS MKL2020.2 MKL2020.0 MKL with Flag
NumPy 0.08s 0.07s 0.07s 0.07s
PyTorch N/A 0.04s 0.04s 0.04s
TensorFlow 0.12s 0.19s 0.20s 0.20s

Eigendecomposition of a 2048x2048 matrix

Library OpenBLAS MKL2020.2 MKL2020.0 MKL with Flag
NumPy 3.29s 3.09s 3.07s 2.58s
PyTorch N/A 1.16s 1.12s 1.14s
TensorFlow 3.69s 4.70s 4.66s 4.66s

Note: TensorFlow might be handling eigendecomposition slightly differently than Numpy and PyTorch, hence the disrepancy.

Environment

MKL Environment is setup through Anaconda with the following commands:

$ conda create -n py38mkl python=3.8 && conda activate py38mkl
$ conda install numpy "blas=*=mkl"
$ conda install -c pytorch pytorch
$ conda install -c anaconda tensorflow-mkl

OpenBLAS environment is setup through Anaconda with the following commands:

$ conda create -n py38nomkl python=3.8 && conda activate py38nomkl
$ conda install nomkl
$ conda install numpy "blas=*=openblas"
$ pip install tensorflow

Credits

The NumPy benchmark created by Markus Beuckelmann, adapted for PyTorch by /u/une-transaction, adapted for TensorFlow by Roman Ring, and modified by me.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import numpy as np
from time import time
# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)
size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
import torch
from time import time
# Let's take the randomness out of random numbers (for reproducibility)
torch.manual_seed(0)
size = 4096
A, B = torch.rand(size, size), torch.rand(size, size)
C, D = torch.rand(size * 128), torch.rand(size * 128)
E = torch.rand(int(size / 2), int(size / 4))
F = torch.rand(int(size / 2), int(size / 2))
F = torch.matmul(F, F.T)
G = torch.rand(int(size / 2), int(size / 2))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
torch.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
torch.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
torch.svd(E, some = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
torch.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
torch.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from __future__ import print_function
from time import time
import tensorflow as tf
# Let's take the randomness out of random numbers (for reproducibility)
tf.random.set_seed(0)
size = 4096
A, B = tf.random.uniform((size, size)), tf.random.uniform((size, size))
C, D = tf.random.uniform((size * 128,)), tf.random.uniform((size * 128,))
E = tf.random.uniform((int(size / 2), int(size / 4)))
F = tf.random.uniform((int(size / 2), int(size / 2)), dtype=tf.float64)
F = tf.matmul(F, F, transpose_b=True)
G = tf.random.uniform((int(size / 2), int(size / 2)))
# Matrix multiplication
N = 20
t = time()
for i in range(N):
tf.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B
# Vector multiplication
N = 5000
t = time()
for i in range(N):
tf.tensordot(C, D, axes=1)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D
# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
tf.linalg.svd(E, full_matrices = True)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E
# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
tf.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
# Eigendecomposition
t = time()
for i in range(N):
tf.linalg.eigh(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment