eugnsp/benchmark_copy_transpose.md

## benchmark_copy_transpose.md

      
    Raw
  

              benchmark_copy_transpose.md
            
          
    Copy-transpose – strided reads or strided writes?

Suppose we have two square NxN matrices A and B of doubles, and we want to perform a copy with a transposition:
A = transpose(B)

Equivalently, we want to copy a row-major matrix to a column-major one or vice versa.
Below are the results of a simple benchmark for 3 ways to accomplish the task:

Two nested loops such that reads are contiguous and writes are strided.
Two nested loops such that reads are strided and writes are contiguous.
Sequential MKL's mkl_domatcopy.

Copy without transposition is used as a baseline. Values of N are taken to be 2^K + 1 to mitigate cache associativity effects. The source code is available here.
Benchmark results

Intel Core i7-4770 with GCC 8.3.0 (-O3 -m64 -march=native) and Intel MKL 2019.0.1:


N
Bytes
(1)
(2)
(3)


33
9K
2.38
3.57
2.37


65
33K
2.04
1.45
1.60


129
130K
2.43
1.26
1.59


257
516K
4.33
1.26
1.69


513
2M
6.62
1.50
2.03


1'025
8M
3.72
3.19
2.62


2'049
32M
7.89
8.11
4.44


4'097
128M
11.23
10.66
5.77


8'193
512M
11.97
11.54
6.15


Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 8M.

Intel Xeon E5-2650 v3 with GCC 7.3.0 (-O3 -m64 -march=native) and Intel MKL 2017.0.1:


N
Bytes
(1)
(2)
(3)


33
9K
1.39
1.78
1.35


65
33K
1.85
1.34
1.52


129
130K
2.41
1.33
1.85


257
516K
3.87
1.08
2.73


513
2M
5.95
1.21
3.76


1'025
8M
4.16
1.88
2.74


2'049
32M
4.87
1.79
3.84


4'097
128M
5.65
2.43
4.24


8'193
512M
5.82
3.06
4.48


Cache sizes on this CPU: L1 Data – 32K, L2 – 256K, L3 – 25M.
References


Should I prefer stride one memory access for either reading or writing? – Stack Overflow


## intel_core_i7_4770.png

      
    Raw
  

              intel_core_i7_4770.png
            
          
## intel_xeon_e5_2650v3.png

      
    Raw
  

              intel_xeon_e5_2650v3.png
N	Bytes	(1)	(2)	(3)
33	9K	2.38	3.57	2.37
65	33K	2.04	1.45	1.60
129	130K	2.43	1.26	1.59
257	516K	4.33	1.26	1.69
513	2M	6.62	1.50	2.03
1'025	8M	3.72	3.19	2.62
2'049	32M	7.89	8.11	4.44
4'097	128M	11.23	10.66	5.77
8'193	512M	11.97	11.54	6.15
N	Bytes	(1)	(2)	(3)
33	9K	1.39	1.78	1.35
65	33K	1.85	1.34	1.52
129	130K	2.41	1.33	1.85
257	516K	3.87	1.08	2.73
513	2M	5.95	1.21	3.76
1'025	8M	4.16	1.88	2.74
2'049	32M	4.87	1.79	3.84
4'097	128M	5.65	2.43	4.24
8'193	512M	5.82	3.06	4.48