inoryy/_results.md

## _results.md

      
    Raw
  

              _results.md
            
          
    Fixing MKL on AMD Zen CPU

As per discussion on Reddit,
it seems a workaround for the Intel MKL's notorious SIMD throttling of AMD Zen CPUs is as simple a setting MKL_DEBUG_CPU_TYPE=5 environment variable.
Benchmarks

All three scripts are executed in the same Python 3.7 environment on a first-gen AMD Zen CPU (1950x).

The difference will be even bigger on newer models as first-gen Zen resolves 256-bit AVX2 in two 128-bit instructions.
4096x4096 Matrix Multiplication


Library
OpenBLAS
MKL Default
MKL With Flag


NumPy
0.58s
1.00s
0.56s


PyTorch
N/A
0.48s
0.26s


TensorFlow
0.22s
0.47s
0.20s


Eigendecomposition


Library
OpenBLAS
MKL Default
MKL With Flag


NumPy
11.82s
7.54s
6.67s


PyTorch
N/A
2.25s
2.06s


TensorFlow
8.61s
6.51s
6.73s


Note: TensorFlow might be handling eigendecomposition slightly differently than Numpy and PyTorch, hence the disrepancy.
Full results: NumPy, PyTorch, and TensorFlow.
Environment

MKL Environment is setup through Anaconda with the following commands:
$ conda create -n py37mkl python=3.7 && conda activate py37mkl
$ conda install numpy "blas=*=mkl"
$ conda install -c pytorch pytorch
$ conda install -c anaconda tensorflow-mkl
OpenBLAS environment is setup through Anaconda with the following commands:
$ conda create -n py37nomkl python=3.7 && conda activate py37nomkl
$ conda install nomkl
$ conda install numpy "blas=*=openblas"
$ pip install tensorflow
Credits

The NumPy benchmark created by Markus Beuckelmann,
adapted for PyTorch by /u/une-transaction
and adapted for TensorFlow by me.

  
## benchmark_numpy.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function

import numpy as np
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
np.random.seed(0)

size = 4096
A, B = np.random.random((size, size)), np.random.random((size, size))
C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
E = np.random.random((int(size / 2), int(size / 4)))
F = np.random.random((int(size / 2), int(size / 2)))
F = np.dot(F, F.T)
G = np.random.random((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    np.dot(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    np.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
    np.linalg.svd(E, full_matrices = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
    np.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    np.linalg.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

## benchmark_pytorch.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function

import torch
from time import time

# Let's take the randomness out of random numbers (for reproducibility)
torch.manual_seed(0)

size = 4096
A, B = torch.rand(size, size), torch.rand(size, size)
C, D = torch.rand(size * 128), torch.rand(size * 128)
E = torch.rand(int(size / 2), int(size / 4))
F = torch.rand(int(size / 2), int(size / 2))
F = torch.matmul(F, F.T)
G = torch.rand(int(size / 2), int(size / 2))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    torch.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    torch.dot(C, D)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
    torch.svd(E, some = False)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
    torch.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    torch.eig(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

## benchmark_tensorflow.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from __future__ import print_function
from time import time

import tensorflow as tf

# Let's take the randomness out of random numbers (for reproducibility)
tf.random.set_seed(0)

size = 4096
A, B = tf.random.uniform((size, size)), tf.random.uniform((size, size))
C, D = tf.random.uniform((size * 128,)), tf.random.uniform((size * 128,))
E = tf.random.uniform((int(size / 2), int(size / 4)))
F = tf.random.uniform((int(size / 2), int(size / 2)), dtype=tf.float64)
F = tf.matmul(F, F, transpose_b=True)
G = tf.random.uniform((int(size / 2), int(size / 2)))

# Matrix multiplication
N = 20
t = time()
for i in range(N):
    tf.matmul(A, B)
delta = time() - t
print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
del A, B

# Vector multiplication
N = 5000
t = time()
for i in range(N):
    tf.tensordot(C, D, axes=1)
delta = time() - t
print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
del C, D

# Singular Value Decomposition (SVD)
N = 3
t = time()
for i in range(N):
    tf.linalg.svd(E, full_matrices = True)
delta = time() - t
print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
del E

# Cholesky Decomposition
N = 3
t = time()
for i in range(N):
    tf.linalg.cholesky(F)
delta = time() - t
print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

# Eigendecomposition
t = time()
for i in range(N):
    tf.linalg.eigh(G)
delta = time() - t
print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
Library	OpenBLAS	MKL Default	MKL With Flag
NumPy	0.58s	1.00s	0.56s
PyTorch	N/A	0.48s	0.26s
TensorFlow	0.22s	0.47s	0.20s
Library	OpenBLAS	MKL Default	MKL With Flag
NumPy	11.82s	7.54s	6.67s
PyTorch	N/A	2.25s	2.06s
TensorFlow	8.61s	6.51s	6.73s
	#!/usr/bin/env python
	# -- coding: UTF-8 --

	from __future__ import print_function

	import numpy as np
	from time import time

	# Let's take the randomness out of random numbers (for reproducibility)
	np.random.seed(0)

	size = 4096
	A, B = np.random.random((size, size)), np.random.random((size, size))
	C, D = np.random.random((size * 128,)), np.random.random((size * 128,))
	E = np.random.random((int(size / 2), int(size / 4)))
	F = np.random.random((int(size / 2), int(size / 2)))
	F = np.dot(F, F.T)
	G = np.random.random((int(size / 2), int(size / 2)))

	# Matrix multiplication
	N = 20
	t = time()
	for i in range(N):
	np.dot(A, B)
	delta = time() - t
	print('Dotted two %dx%d matrices in %0.2f s.' % (size, size, delta / N))
	del A, B

	# Vector multiplication
	N = 5000
	t = time()
	for i in range(N):
	np.dot(C, D)
	delta = time() - t
	print('Dotted two vectors of length %d in %0.2f ms.' % (size * 128, 1e3 * delta / N))
	del C, D

	# Singular Value Decomposition (SVD)
	N = 3
	t = time()
	for i in range(N):
	np.linalg.svd(E, full_matrices = False)
	delta = time() - t
	print("SVD of a %dx%d matrix in %0.2f s." % (size / 2, size / 4, delta / N))
	del E

	# Cholesky Decomposition
	N = 3
	t = time()
	for i in range(N):
	np.linalg.cholesky(F)
	delta = time() - t
	print("Cholesky decomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))

	# Eigendecomposition
	t = time()
	for i in range(N):
	np.linalg.eig(G)
	delta = time() - t
	print("Eigendecomposition of a %dx%d matrix in %0.2f s." % (size / 2, size / 2, delta / N))
Library	OpenBLAS	MKL Default	MKL With Flag
NumPy	0.28s	0.54s	0.24s
PyTorch	N/A	0.32s	0.12s
TensorFlow	0.11s	0.30s	0.11s
Library	OpenBLAS	MKL Default	MKL With Flag
NumPy	6.05s	4.24s	3.47s
PyTorch	N/A	1.31s	1.11s
TensorFlow	5.20s	2.73s	2.64s