Skip to content

Instantly share code, notes, and snippets.

@bebosudo
Created April 20, 2021 00:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bebosudo/6f43dc6b4329c197f258f25cc69f0ec0 to your computer and use it in GitHub Desktop.
Save bebosudo/6f43dc6b4329c197f258f25cc69f0ec0 to your computer and use it in GitHub Desktop.
Workaround for MKL to use faster codepath on AMD CPUs

See also: https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html

Compile this:

int mkl_serv_intel_cpu_true() {
  return 1;
}

into a shared library: gcc -shared -fPIC -o libfakeintel.so fakeintel.c

And LD_PRELOAD=/path/to/libfakeintel.so (or use a module environment) before your MKL code.

Use the Python code provided below to test the performance improvement with a Numpy release built against MKL, e.g. with conda from the defaults channel: conda install -c defaults --override-channels numpy

Here's the code running normally:

  N = 20000           N = 30000
36.43 seconds      132.03 seconds
36.70 seconds      131.61 seconds
37.61 seconds      130.95 seconds
36.16 seconds      137.04 seconds
36.51 seconds      134.37 seconds
37.40 seconds      140.52 seconds
35.40 seconds      129.63 seconds
40.15 seconds      133.25 seconds
40.92 seconds      131.32 seconds
36.65 seconds      134.69 seconds

And this is the same code when prepending the libfakeintel.so library above:

  N = 20000           N = 30000
12.93 seconds       39.54 seconds
13.51 seconds       37.51 seconds
12.30 seconds       38.28 seconds
13.21 seconds       40.76 seconds
14.28 seconds       36.63 seconds
12.72 seconds       36.95 seconds
12.62 seconds       40.93 seconds
13.23 seconds       42.98 seconds
12.42 seconds       40.93 seconds
12.44 seconds       39.76 seconds

Usually the improvement is around 3x for single precision operations.

import time
import numpy as np
rng = np.random.default_rng()
N = 20000
# The performance boost is more dramatic for single precision.
m1 = rng.random((N, N), dtype='float32')
m2 = rng.random((N, N), dtype='float32')
start = time.time()
m1.dot(m2)
print("{:.2f}".format(time.time() - start))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment