Suppose we have two square NxN
matrices A
and B
of double
s, and we want to perform a copy with a transposition:
A = transpose(B)
Equivalently, we want to copy a row-major matrix to a column-major one or vice versa.
Below are the results of a simple benchmark for 3 ways to accomplish the task:
- Two nested loops such that reads are contiguous and writes are strided.
- Two nested loops such that reads are strided and writes are contiguous.
- Sequential MKL's
mkl_domatcopy
.
Copy without transposition is used as a baseline. Values of N
are taken to be 2^K + 1
to mitigate cache associativity effects. The source code is available here.
Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native
) and Intel MKL 2019.0.1:
N | Bytes | (1) | (2) | (3) |
---|---|---|---|---|
33 | 9K | 2.38 | 3.57 | 2.37 |
65 | 33K | 2.04 | 1.45 | 1.60 |
129 | 130K | 2.43 | 1.26 | 1.59 |
257 | 516K | 4.33 | 1.26 | 1.69 |
513 | 2M | 6.62 | 1.50 | 2.03 |
1'025 | 8M | 3.72 | 3.19 | 2.62 |
2'049 | 32M | 7.89 | 8.11 | 4.44 |
4'097 | 128M | 11.23 | 10.66 | 5.77 |
8'193 | 512M | 11.97 | 11.54 | 6.15 |
Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 8M.
Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native
) and Intel MKL 2017.0.1:
N | Bytes | (1) | (2) | (3) |
---|---|---|---|---|
33 | 9K | 1.39 | 1.78 | 1.35 |
65 | 33K | 1.85 | 1.34 | 1.52 |
129 | 130K | 2.41 | 1.33 | 1.85 |
257 | 516K | 3.87 | 1.08 | 2.73 |
513 | 2M | 5.95 | 1.21 | 3.76 |
1'025 | 8M | 4.16 | 1.88 | 2.74 |
2'049 | 32M | 4.87 | 1.79 | 3.84 |
4'097 | 128M | 5.65 | 2.43 | 4.24 |
8'193 | 512M | 5.82 | 3.06 | 4.48 |
Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 25M.