Skip to content

Instantly share code, notes, and snippets.

@eugnsp
Last active June 2, 2019 10:50
Show Gist options
  • Save eugnsp/2495dd397f16a97e78fa46fba7fdb3e4 to your computer and use it in GitHub Desktop.
Save eugnsp/2495dd397f16a97e78fa46fba7fdb3e4 to your computer and use it in GitHub Desktop.
Copy-transpose – strided reads or strided writes?

Copy-transpose – strided reads or strided writes?

Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:

A = transpose(B)

Equivalently, we want to copy a row-major matrix to a column-major one or vice versa.

Below are the results of a simple benchmark for 3 ways to accomplish the task:

  1. Two nested loops such that reads are contiguous and writes are strided.
  2. Two nested loops such that reads are strided and writes are contiguous.
  3. Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects. The source code is available here.

Benchmark results

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:

Intel Core i7-4770

N Bytes (1) (2) (3)
33 9K 2.38 3.57 2.37
65 33K 2.04 1.45 1.60
129 130K 2.43 1.26 1.59
257 516K 4.33 1.26 1.69
513 2M 6.62 1.50 2.03
1'025 8M 3.72 3.19 2.62
2'049 32M 7.89 8.11 4.44
4'097 128M 11.23 10.66 5.77
8'193 512M 11.97 11.54 6.15

Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 8M.


Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:

Intel Xeon E5-2650 v3

N Bytes (1) (2) (3)
33 9K 1.39 1.78 1.35
65 33K 1.85 1.34 1.52
129 130K 2.41 1.33 1.85
257 516K 3.87 1.08 2.73
513 2M 5.95 1.21 3.76
1'025 8M 4.16 1.88 2.74
2'049 32M 4.87 1.79 3.84
4'097 128M 5.65 2.43 4.24
8'193 512M 5.82 3.06 4.48

Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 25M.

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment