cty-yyds/AMD_mkl_results.md

## AMD_mkl_results.md

      
    Raw
  

              AMD_mkl_results.md
            
          
    Investigating MKL on AMD Zen CPU

As per discussion on Reddit,
it seems a workaround for the Intel MKL's notorious SIMD throttling of AMD Zen CPUs is as simple a setting MKL_DEBUG_CPU_TYPE=5 environment variable. Intel removed the debug mode starting with MKL 2020.1 or newer. Although MKL 2020.1 and following appear to have improved performance by default on AMD to some extend.
Recommendation

For AMD Zen CPU, it is recommended to get MKL to speed-boost NumPy, SciPy, scikit-learn, and NumExpr even without using debug mode. You can have more speed boost using MKL_DEBUG_CPU_TYPE=5, but you need to downgrade the MKL to version 2020.0 or below.
For TensorFlow using AMD CPU, better to install origin version using pip install tensorflow rather than tensorflow-mkl.
Benchmarks

All three scripts are executed in the same Python 3.8 environment on a AMD Ryzen™ 7 5800X CPU.
Dotted two 4096x4096 matrices


Library
OpenBLAS
MKL2020.2
MKL2020.0
MKL with Flag


NumPy
0.55s
0.54s
0.54s
0.49s


PyTorch
N/A
0.68s
0.62s
0.60s


TensorFlow
0.18s
0.17s
0.17s
0.17s


Dotted two vectors of length 524288


Library
OpenBLAS
MKL2020.2
MKL2020.0
MKL with Flag


NumPy
0.06ms
0.03ms
0.03ms
0.03ms


PyTorch
N/A
0.02ms
0.02ms
0.02ms


TensorFlow
0.41ms
0.91ms
0.91ms
0.90ms


SVD of a 2048x1024 matrix


Library
OpenBLAS
MKL2020.2
MKL2020.0
MKL with Flag


NumPy
0.71s
0.32s
0.32s
0.25s


PyTorch
N/A
0.31s
0.30s
0.30s


TensorFlow
0.49s
0.83s
0.83s
0.83s


Cholesky decomposition of a 2048x2048 matrix


Library
OpenBLAS
MKL2020.2
MKL2020.0
MKL with Flag


NumPy
0.08s
0.07s
0.07s
0.07s


PyTorch
N/A
0.04s
0.04s
0.04s


TensorFlow
0.12s
0.19s
0.20s
0.20s


Eigendecomposition of a 2048x2048 matrix


Library
OpenBLAS
MKL2020.2
MKL2020.0
MKL with Flag


NumPy
3.29s
3.09s
3.07s
2.58s


PyTorch
N/A
1.16s
1.12s
1.14s


TensorFlow
3.69s
4.70s
4.66s
4.66s


Note: TensorFlow might be handling eigendecomposition slightly differently than Numpy and PyTorch, hence the disrepancy.
Environment

MKL Environment is setup through Anaconda with the following commands:
$ conda create -n py38mkl python=3.8 && conda activate py38mkl
$ conda install numpy "blas=*=mkl"
$ conda install -c pytorch pytorch
$ conda install -c anaconda tensorflow-mkl
OpenBLAS environment is setup through Anaconda with the following commands:
$ conda create -n py38nomkl python=3.8 && conda activate py38nomkl
$ conda install nomkl
$ conda install numpy "blas=*=openblas"
$ pip install tensorflow
Credits

The NumPy benchmark created by Markus Beuckelmann,
adapted for PyTorch by /u/une-transaction,
adapted for TensorFlow by Roman Ring,
and modified by me.

  
## benchmark_numpy.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import numpy as np
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)

size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
    np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
    np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

## benchmark_pytorch.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function

import torch
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
torch.manual_seed(0)

size = 4096
A, B = torch.rand(size, size), torch.rand(size, size)
C, D = torch.rand(size * 128), torch.rand(size * 128)
E = torch.rand(int(size / 2), int(size / 4))
F = torch.rand(int(size / 2), int(size / 2))
F = torch.matmul(F, F.T)
G = torch.rand(int(size / 2), int(size / 2))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    torch.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    torch.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
    torch.svd(E, some = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
    torch.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    torch.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

## benchmark_tensorflow.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function
from time import time

import tensorflow as tf

# Let's take the randomness out of random numbers (for reproducibility)
tf.random.set_seed(0)

size = 4096
A, B = tf.random.uniform((size, size)), tf.random.uniform((size, size))
C, D = tf.random.uniform((size * 128,)), tf.random.uniform((size * 128,))
E = tf.random.uniform((int(size / 2), int(size / 4)))
F = tf.random.uniform((int(size / 2), int(size / 2)), dtype=tf.float64)
F = tf.matmul(F, F, transpose_b=True)
G = tf.random.uniform((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    tf.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    tf.tensordot(C, D, axes=1)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 10
t = time()
for i in range(N):
    tf.linalg.svd(E, full_matrices = True)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 10
t = time()
for i in range(N):
    tf.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    tf.linalg.eigh(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
Library	OpenBLAS	MKL2020.2	MKL2020.0	MKL with Flag
NumPy	0.55s	0.54s	0.54s	0.49s
PyTorch	N/A	0.68s	0.62s	0.60s
TensorFlow	0.18s	0.17s	0.17s	0.17s
Library	OpenBLAS	MKL2020.2	MKL2020.0	MKL with Flag
NumPy	0.06ms	0.03ms	0.03ms	0.03ms
PyTorch	N/A	0.02ms	0.02ms	0.02ms
TensorFlow	0.41ms	0.91ms	0.91ms	0.90ms
Library	OpenBLAS	MKL2020.2	MKL2020.0	MKL with Flag
NumPy	0.71s	0.32s	0.32s	0.25s
PyTorch	N/A	0.31s	0.30s	0.30s
TensorFlow	0.49s	0.83s	0.83s	0.83s
Library	OpenBLAS	MKL2020.2	MKL2020.0	MKL with Flag
NumPy	0.08s	0.07s	0.07s	0.07s
PyTorch	N/A	0.04s	0.04s	0.04s
TensorFlow	0.12s	0.19s	0.20s	0.20s
Library	OpenBLAS	MKL2020.2	MKL2020.0	MKL with Flag
NumPy	3.29s	3.09s	3.07s	2.58s
PyTorch	N/A	1.16s	1.12s	1.14s
TensorFlow	3.69s	4.70s	4.66s	4.66s
	#!/usr/bin/env python
	# -- coding: UTF-8 --

	import numpy as np
	from time import time

	# Let's take the randomness out of random numbers (for reproducibility)
	np.random.seed(0)

	size = 4096
	A, B = np.random.random((size, size)), np.random.random((size, size))
	C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
	E = np.random.random((int(size / 2), int(size / 4)))
	F = np.random.random((int(size / 2), int(size / 2)))
	F = np.dot(F, F.T)
	G = np.random.random((int(size / 2), int(size / 2)))

	# Matrix multiplication
	N = 20
	t = time()
	for i in range(N):
	np.dot(A, B)
	delta = time() - t
	print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
	del A, B

	# Vector multiplication
	N = 5000
	t = time()
	for i in range(N):
	np.dot(C, D)
	delta = time() - t
	print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
	del C, D

	# Singular Value Decomposition (SVD)
	N = 10
	t = time()
	for i in range(N):
	np.linalg.svd(E, full_matrices = False)
	delta = time() - t
	print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
	del E

	# Cholesky Decomposition
	N = 10
	t = time()
	for i in range(N):
	np.linalg.cholesky(F)
	delta = time() - t
	print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

	# Eigendecomposition
	t = time()
	for i in range(N):
	np.linalg.eig(G)
	delta = time() - t
	print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
	#!/usr/bin/env python
	# -- coding: UTF-8 --

	from __future__ import print_function

	import torch
	from time import time

	# Let's take the randomness out of random numbers (for reproducibility)
	torch.manual_seed(0)

	size = 4096
	A, B = torch.rand(size, size), torch.rand(size, size)
	C, D = torch.rand(size * 128), torch.rand(size * 128)
	E = torch.rand(int(size / 2), int(size / 4))
	F = torch.rand(int(size / 2), int(size / 2))
	F = torch.matmul(F, F.T)
	G = torch.rand(int(size / 2), int(size / 2))

	# Matrix multiplication
	N = 20
	t = time()
	for i in range(N):
	torch.matmul(A, B)
	delta = time() - t
	print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
	del A, B

	# Vector multiplication
	N = 5000
	t = time()
	for i in range(N):
	torch.dot(C, D)
	delta = time() - t
	print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
	del C, D

	# Singular Value Decomposition (SVD)
	N = 10
	t = time()
	for i in range(N):
	torch.svd(E, some = False)
	delta = time() - t
	print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
	del E

	# Cholesky Decomposition
	N = 10
	t = time()
	for i in range(N):
	torch.cholesky(F)
	delta = time() - t
	print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

	# Eigendecomposition
	t = time()
	for i in range(N):
	torch.eig(G)
	delta = time() - t
	print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))