Skip to content

Instantly share code, notes, and snippets.

@aldanor
Last active June 25, 2019 11:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aldanor/5bb9f1ff3577a4c6d35c267db75e47bd to your computer and use it in GitHub Desktop.
Save aldanor/5bb9f1ff3577a4c6d35c267db75e47bd to your computer and use it in GitHub Desktop.
NumPy dot() vs MKL ?gemm benchmarks

NumPy np.dot() vs MKL dgemm() (+ strict MKL CNR)

  • The tests were run on 4 independent machines with the same specs
  • Each machine has multiple Xeon Platinum 8168 CPUs
  • Python: 3.6, NumPy: 1.16, Intel MKL: 2019.4
  • Either MKL_ENABLE_INSTRUCTIONS or MKL_CBWR was set for each test
  • MKL_DYNAMIC was set to FALSE for all tests
  • MKL_NUM_THREADS was set to either 1, 8 or 16
  • Each test was run for float32 (f4) and float64 (f8) inputs
  • All inputs were pre-aligned to 512-bit boundary (for AVX-512)
  • Each test marked as small=0 was run 25 * num_threads times
  • Each test marked as small=1 was run 250 * num_threads times
  • For each test, the median elapsed time was recorded for each machine
  • Median times were then averaged across machines for each test
  • dot op: np.dot() was called without any alterations
  • gemm op: blas.dgemm() was called (arguments rearranged to match np.dot)
  • gemm-strict op - same as gemm but with MKL Strict CNR mode enabled

shortvec-mat


small=0, dtype=f8, num_threads=1

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 278.52 292.83 -
avx Af-w 247.07 276.66 -
avx w-AcT 333.17 401.95 -
avx w-AfT 182.39 191.47 -
avx2 Ac-w 225.19 226.94 315.4252
avx2 Af-w 251.00 211.06 558.11
avx2 w-AcT 306.19 240.45 304.9262
avx2 w-AfT 260.30 174.11 359.1774
avx512 Ac-w 187.97 186.89 351.1422
avx512 Af-w 255.45 172.14 430.1841
avx512 w-AcT 199.07 204.67 275.3029
avx512 w-AfT 156.65 163.57 333.9329

small=0, dtype=f8, num_threads=8

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 53.18 55.34 -
avx Af-w 34.53 37.49 -
avx w-AcT 62.42 55.15 -
avx w-AfT 35.37 38.60 -
avx2 Ac-w 41.14 42.85 59.9105
avx2 Af-w 31.07 49.07 67.6441
avx2 w-AcT 56.51 44.37 46.4869
avx2 w-AfT 34.69 35.70 68.4242
avx512 Ac-w 38.03 39.92 61.5696
avx512 Af-w 34.80 46.74 89.0299
avx512 w-AcT 44.63 43.11 72.8658
avx512 w-AfT 36.88 33.69 124.8525

small=0, dtype=f8, num_threads=16

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 29.94 31.84 -
avx Af-w 21.29 26.94 -
avx w-AcT 41.95 28.63 -
avx w-AfT 26.93 24.02 -
avx2 Ac-w 25.05 25.42 37.4955
avx2 Af-w 22.75 34.92 46.9344
avx2 w-AcT 34.64 23.10 28.3142
avx2 w-AfT 23.87 24.40 41.3183
avx512 Ac-w 23.64 24.66 40.7913
avx512 Af-w 25.01 25.53 41.4362
avx512 w-AcT 35.85 30.62 82.5378
avx512 w-AfT 31.93 23.21 128.4683

small=0, dtype=f4, num_threads=1

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 180.10 143.40 -
avx Af-w 80.08 135.02 -
avx w-AcT 199.21 196.09 -
avx w-AfT 83.04 83.06 -
avx2 Ac-w 152.79 115.79 252.4276
avx2 Af-w 134.94 84.69 233.166
avx2 w-AcT 167.70 120.72 191.1838
avx2 w-AfT 86.29 86.83 165.9119
avx512 Ac-w 126.44 130.84 206.7101
avx512 Af-w 76.48 76.74 244.271
avx512 w-AcT 182.31 130.66 128.2289
avx512 w-AfT 78.64 75.68 164.6559

small=0, dtype=f4, num_threads=8

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 27.22 26.22 -
avx Af-w 15.59 19.21 -
avx w-AcT 28.60 28.18 -
avx w-AfT 12.96 21.34 -
avx2 Ac-w 22.89 22.39 36.284
avx2 Af-w 20.24 17.54 42.0482
avx2 w-AcT 23.32 21.47 23.674
avx2 w-AfT 15.05 17.73 33.7815
avx512 Ac-w 26.33 23.56 33.5469
avx512 Af-w 12.20 15.81 43.6333
avx512 w-AcT 24.98 23.97 47.5395
avx512 w-AfT 16.93 12.48 46.1854

small=0, dtype=f4, num_threads=16

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • w: (150,) vector
dot gemm gemm-strict
avx Ac-w 15.79 15.49 -
avx Af-w 12.97 15.59 -
avx w-AcT 18.08 14.08 -
avx w-AfT 9.31 10.15 -
avx2 Ac-w 16.62 13.06 20.6012
avx2 Af-w 11.00 13.23 26.0878
avx2 w-AcT 12.88 12.01 15.5666
avx2 w-AfT 14.57 9.91 19.25
avx512 Ac-w 17.61 15.65 18.7627
avx512 Af-w 13.45 16.86 26.3501
avx512 w-AcT 15.16 15.80 46.3575
avx512 w-AfT 8.91 15.15 43.3421

small=1, dtype=f8, num_threads=1

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.8597 0.8758 -
avx Af-w 0.9394 0.9731 -
avx w-AcT 0.9364 0.8839 -
avx w-AfT 0.9435 0.9509 -
avx2 Ac-w 0.8725 0.8707 1.1811
avx2 Af-w 1.0650 1.0676 0.9959
avx2 w-AcT 0.9238 0.8923 1.5117
avx2 w-AfT 1.0314 1.0499 1.0641
avx512 Ac-w 1.0768 0.9760 1.8678
avx512 Af-w 1.0759 1.0974 1.4689
avx512 w-AcT 1.0823 1.0121 1.4306
avx512 w-AfT 1.0760 1.0874 0.9563

small=1, dtype=f8, num_threads=8

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.2103 0.2854 -
avx Af-w 0.2105 0.2429 -
avx w-AcT 0.2830 0.2816 -
avx w-AfT 0.2939 0.2151 -
avx2 Ac-w 0.2108 0.2913 0.248
avx2 Af-w 0.2193 0.2544 0.2048
avx2 w-AcT 0.2828 0.2904 0.2813
avx2 w-AfT 0.3200 0.3285 0.2229
avx512 Ac-w 0.2396 0.3121 0.3481
avx512 Af-w 0.2394 0.2472 0.2849
avx512 w-AcT 0.3071 0.3164 0.4215
avx512 w-AfT 0.3076 0.2744 0.3433

small=1, dtype=f8, num_threads=16

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.1583 0.2518 -
avx Af-w 0.1597 0.2491 -
avx w-AcT 0.2880 0.1633 -
avx w-AfT 0.2592 0.1659 -
avx2 Ac-w 0.1612 0.1690 0.1488
avx2 Af-w 0.1760 0.1891 0.1593
avx2 w-AcT 0.2427 0.1669 0.1705
avx2 w-AfT 0.2480 0.1802 0.1585
avx512 Ac-w 0.1881 0.2303 0.2022
avx512 Af-w 0.1957 0.2170 0.1945
avx512 w-AcT 0.2015 0.1974 0.4064
avx512 w-AfT 0.2053 0.2038 0.3159

small=1, dtype=f4, num_threads=1

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.8495 0.8581 -
avx Af-w 0.4742 0.4859 -
avx w-AcT 0.8485 0.8526 -
avx w-AfT 0.4745 0.4835 -
avx2 Ac-w 0.4628 0.4592 1.2122
avx2 Af-w 0.4950 0.5033 1.0467
avx2 w-AcT 0.4542 0.4630 0.8676
avx2 w-AfT 0.4921 0.5007 0.596
avx512 Ac-w 0.6473 0.6624 1.4192
avx512 Af-w 0.4983 0.5091 1.1797
avx512 w-AcT 0.6514 0.6616 0.8051
avx512 w-AfT 0.4975 0.5084 0.4723

small=1, dtype=f4, num_threads=8

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.1769 0.1729 -
avx Af-w 0.1249 0.1199 -
avx w-AcT 0.1793 0.1853 -
avx w-AfT 0.1254 0.1597 -
avx2 Ac-w 0.1309 0.1259 0.1943
avx2 Af-w 0.1235 0.1202 0.1736
avx2 w-AcT 0.1326 0.1857 0.1695
avx2 w-AfT 0.1211 0.1543 0.1338
avx512 Ac-w 0.1671 0.1764 0.2275
avx512 Af-w 0.1202 0.1287 0.2022
avx512 w-AcT 0.1528 0.1957 0.2482
avx512 w-AfT 0.1342 0.1309 0.155

small=1, dtype=f4, num_threads=16

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • w: (15,) vector
dot gemm gemm-strict
avx Ac-w 0.1258 0.1290 -
avx Af-w 0.0963 0.0968 -
avx w-AcT 0.1183 0.1619 -
avx w-AfT 0.0973 0.0966 -
avx2 Ac-w 0.0885 0.1125 0.1175
avx2 Af-w 0.0947 0.0935 0.1109
avx2 w-AcT 0.0939 0.1328 0.114
avx2 w-AfT 0.1228 0.1171 0.1278
avx512 Ac-w 0.1152 0.1273 0.1445
avx512 Af-w 0.1018 0.1073 0.1342
avx512 w-AcT 0.1137 0.2005 0.2601
avx512 w-AfT 0.1376 0.1400 0.2527

longvec-mat


small=0, dtype=f8, num_threads=1

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 375.88 286.77 -
avx AfT-y 185.65 192.24 -
avx y-Ac 302.22 269.70 -
avx y-Af 186.64 184.56 -
avx2 AcT-y 225.73 214.31 464.2119
avx2 AfT-y 245.28 187.57 257.0441
avx2 y-Ac 227.87 232.95 398.9069
avx2 y-Af 177.51 187.09 215.1909
avx512 AcT-y 190.52 184.97 430.6951
avx512 AfT-y 150.14 155.64 274.4005
avx512 y-Ac 262.89 184.00 399.9746
avx512 y-Af 151.56 151.42 204.8347

small=0, dtype=f8, num_threads=8

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 72.51 57.69 -
avx AfT-y 37.06 57.96 -
avx y-Ac 56.33 60.31 -
avx y-Af 37.53 35.46 -
avx2 AcT-y 54.35 54.04 131.7934
avx2 AfT-y 32.51 37.91 51.929
avx2 y-Ac 50.19 54.50 151.298
avx2 y-Af 37.94 34.71 40.8832
avx512 AcT-y 55.61 52.77 126.3179
avx512 AfT-y 34.28 30.66 52.8289
avx512 y-Ac 50.71 47.45 139.7909
avx512 y-Af 34.32 31.92 40.3422

small=0, dtype=f8, num_threads=16

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 41.01 38.55 -
avx AfT-y 29.84 34.41 -
avx y-Ac 32.38 38.50 -
avx y-Af 22.83 22.11 -
avx2 AcT-y 27.64 25.57 97.4087
avx2 AfT-y 31.53 30.76 39.9516
avx2 y-Ac 35.63 29.91 105.489
avx2 y-Af 32.59 22.54 25.4169
avx512 AcT-y 30.92 21.10 85.5452
avx512 AfT-y 22.67 32.84 44.1593
avx512 y-Ac 34.40 33.87 96.9224
avx512 y-Af 33.44 21.84 31.5531

small=0, dtype=f4, num_threads=1

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 129.34 126.08 -
avx AfT-y 95.86 91.45 -
avx y-Ac 131.97 129.77 -
avx y-Af 127.58 92.65 -
avx2 AcT-y 141.98 100.37 228.5094
avx2 AfT-y 91.16 92.79 243.4499
avx2 y-Ac 105.25 102.64 131.5789
avx2 y-Af 86.90 90.44 140.4917
avx512 AcT-y 101.38 101.59 238.6456
avx512 AfT-y 83.84 83.96 169.2152
avx512 y-Ac 101.85 103.45 129.4417
avx512 y-Af 121.63 78.41 187.0246

small=0, dtype=f4, num_threads=8

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 31.56 29.80 -
avx AfT-y 25.76 17.96 -
avx y-Ac 35.22 20.67 -
avx y-Af 16.46 20.86 -
avx2 AcT-y 34.10 29.16 61.767
avx2 AfT-y 23.74 18.97 33.0189
avx2 y-Ac 24.00 29.31 91.0718
avx2 y-Af 15.78 16.60 27.9925
avx512 AcT-y 33.97 25.91 91.5085
avx512 AfT-y 23.10 18.32 31.8801
avx512 y-Ac 26.59 27.16 91.2677
avx512 y-Af 16.28 20.25 26.5968

small=0, dtype=f4, num_threads=16

  • Ac: (1500000, 150) matrix, C-contiguous
  • AcT: (1500000, 150) matrix, C-contiguous, transposed
  • Af: (1500000, 150) matrix, F-contiguous
  • AfT: (1500000, 150) matrix, F-contiguous, transposed
  • y: (1500000,) vector
dot gemm gemm-strict
avx AcT-y 24.31 16.02 -
avx AfT-y 17.60 14.27 -
avx y-Ac 20.07 21.62 -
avx y-Af 10.54 17.72 -
avx2 AcT-y 18.87 14.36 57.865
avx2 AfT-y 15.81 16.70 26.4263
avx2 y-Ac 14.64 17.15 92.8601
avx2 y-Af 15.24 13.72 18.8649
avx512 AcT-y 13.59 15.01 58.7211
avx512 AfT-y 16.94 19.49 25.3721
avx512 y-Ac 20.83 15.26 72.0012
avx512 y-Af 10.97 14.51 19.9279

small=1, dtype=f8, num_threads=1

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.7522 0.8136 -
avx AfT-y 0.9217 0.9243 -
avx y-Ac 0.8121 0.8248 -
avx y-Af 0.9193 0.9346 -
avx2 AcT-y 0.7827 0.8042 2.0015
avx2 AfT-y 0.9606 0.9718 1.633
avx2 y-Ac 0.8590 0.9023 1.6313
avx2 y-Af 0.9764 0.9727 1.4566
avx512 AcT-y 0.8166 0.9619 2.9684
avx512 AfT-y 0.9460 0.9580 2.2258
avx512 y-Ac 0.8785 0.9055 2.0547
avx512 y-Af 0.9444 0.9565 1.6849

small=1, dtype=f8, num_threads=8

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.7480 0.5035 -
avx AfT-y 0.1042 0.1111 -
avx y-Ac 0.7770 0.6127 -
avx y-Af 0.1109 0.1116 -
avx2 AcT-y 1.0178 0.6397 0.9854
avx2 AfT-y 0.1011 0.1087 0.7269
avx2 y-Ac 0.8044 0.7367 0.9432
avx2 y-Af 0.1032 0.1080 0.7
avx512 AcT-y 0.5493 0.8249 0.9062
avx512 AfT-y 0.1073 0.1252 0.433
avx512 y-Ac 0.5462 0.9861 0.8636
avx512 y-Af 0.1080 0.1156 0.4184

small=1, dtype=f8, num_threads=16

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.2594 0.2844 -
avx AfT-y 0.0520 0.0601 -
avx y-Ac 0.2557 0.3841 -
avx y-Af 0.0522 0.0567 -
avx2 AcT-y 0.6394 0.7250 1.3473
avx2 AfT-y 0.0463 0.0536 0.7966
avx2 y-Ac 0.5724 0.6649 1.3349
avx2 y-Af 0.0446 0.0520 0.783
avx512 AcT-y 0.2518 0.2539 0.9742
avx512 AfT-y 0.0477 0.0544 0.4983
avx512 y-Ac 0.2556 0.2415 0.895
avx512 y-Af 0.0474 0.0527 0.4904

small=1, dtype=f4, num_threads=1

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.7520 0.7348 -
avx AfT-y 0.4706 0.4814 -
avx y-Ac 0.7246 0.7331 -
avx y-Af 0.4714 0.4783 -
avx2 AcT-y 0.4337 0.4416 1.4658
avx2 AfT-y 0.4721 0.4809 1.3509
avx2 y-Ac 0.4325 0.4430 1.2209
avx2 y-Af 0.4702 0.4779 1.0979
avx512 AcT-y 0.3794 0.3922 1.7906
avx512 AfT-y 0.4933 0.4984 1.6916
avx512 y-Ac 0.3810 0.3910 1.4169
avx512 y-Af 0.4756 0.4998 1.3061

small=1, dtype=f4, num_threads=8

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.6914 0.8855 -
avx AfT-y 0.0975 0.1006 -
avx y-Ac 0.7036 0.7992 -
avx y-Af 0.0410 0.0577 -
avx2 AcT-y 0.9887 1.2641 0.9136
avx2 AfT-y 0.0384 0.1013 0.7207
avx2 y-Ac 0.9842 1.0730 1.2324
avx2 y-Af 0.0937 0.0491 1.0977
avx512 AcT-y 0.5894 0.4787 0.6526
avx512 AfT-y 0.0422 0.0532 0.5004
avx512 y-Ac 0.5631 0.4685 0.6435
avx512 y-Af 0.0427 0.0541 0.4594

small=1, dtype=f4, num_threads=16

  • Ac: (150000, 15) matrix, C-contiguous
  • AcT: (150000, 15) matrix, C-contiguous, transposed
  • Af: (150000, 15) matrix, F-contiguous
  • AfT: (150000, 15) matrix, F-contiguous, transposed
  • y: (150000,) vector
dot gemm gemm-strict
avx AcT-y 0.5946 0.3349 -
avx AfT-y 0.0355 0.0465 -
avx y-Ac 0.5720 0.3836 -
avx y-Af 0.0370 0.0415 -
avx2 AcT-y 0.8885 0.5543 1.1668
avx2 AfT-y 0.0358 0.0286 1.2774
avx2 y-Ac 0.9370 0.5372 1.2355
avx2 y-Af 0.0341 0.0455 1.0991
avx512 AcT-y 0.2873 0.4690 0.7153
avx512 AfT-y 0.0408 0.0305 0.5648
avx512 y-Ac 0.4008 0.3853 0.6986
avx512 y-Af 0.0398 0.0288 0.5527

mat-mat


small=0, dtype=f8, num_threads=1

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 2375.17 2374.86 -
avx Bc-Cf 2384.14 2447.75 -
avx Bf-Cc 2373.71 2377.03 -
avx Bf-Cf 2363.71 2367.88 -
avx2 Bc-Cc 1263.58 1255.88 1256.8004
avx2 Bc-Cf 1252.99 1301.95 1254.4515
avx2 Bf-Cc 1251.43 1254.46 1300.8235
avx2 Bf-Cf 1246.28 1246.09 1251.3151
avx512 Bc-Cc 774.56 775.08 773.2522
avx512 Bc-Cf 771.82 776.34 799.6686
avx512 Bf-Cc 779.26 782.37 778.6236
avx512 Bf-Cf 769.04 773.98 775.7101

small=0, dtype=f8, num_threads=8

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 375.62 362.89 -
avx Bc-Cf 347.77 339.80 -
avx Bf-Cc 346.43 339.93 -
avx Bf-Cf 343.72 372.99 -
avx2 Bc-Cc 214.43 212.76 217.7826
avx2 Bc-Cf 200.34 196.39 208.8227
avx2 Bf-Cc 200.54 198.09 206.0442
avx2 Bf-Cf 202.84 215.40 211.8449
avx512 Bc-Cc 162.93 158.49 151.5677
avx512 Bc-Cf 141.32 135.50 146.9132
avx512 Bf-Cc 146.14 147.11 150.8385
avx512 Bf-Cf 161.68 159.28 146.8157

small=0, dtype=f8, num_threads=16

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 211.73 197.75 -
avx Bc-Cf 191.21 211.58 -
avx Bf-Cc 193.20 213.48 -
avx Bf-Cf 207.33 194.72 -
avx2 Bc-Cc 139.49 134.84 144.8766
avx2 Bc-Cf 132.68 136.62 145.2361
avx2 Bf-Cc 126.92 137.15 135.8298
avx2 Bf-Cf 137.72 125.34 143.8074
avx512 Bc-Cc 113.10 121.20 108.8694
avx512 Bc-Cf 105.97 121.44 110.4436
avx512 Bf-Cc 96.76 109.15 110.1264
avx512 Bf-Cf 107.85 93.54 92.4746

small=0, dtype=f4, num_threads=1

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 1237.06 1266.51 -
avx Bc-Cf 1200.05 1197.30 -
avx Bf-Cc 1208.18 1200.05 -
avx Bf-Cf 1188.55 1186.36 -
avx2 Bc-Cc 651.27 646.33 622.0946
avx2 Bc-Cf 636.30 618.42 620.358
avx2 Bf-Cc 619.06 616.23 619.4856
avx2 Bf-Cf 618.98 614.67 615.4227
avx512 Bc-Cc 429.51 415.97 396.0257
avx512 Bc-Cf 425.59 386.70 385.7088
avx512 Bf-Cc 387.70 396.91 388.1412
avx512 Bf-Cf 389.95 386.39 385.8595

small=0, dtype=f4, num_threads=8

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 181.08 183.31 -
avx Bc-Cf 174.86 174.90 -
avx Bf-Cc 181.26 182.11 -
avx Bf-Cf 172.69 173.52 -
avx2 Bc-Cc 119.05 112.96 114.547
avx2 Bc-Cf 102.87 101.86 106.1758
avx2 Bf-Cc 102.00 101.03 100.6712
avx2 Bf-Cf 101.77 96.69 96.692
avx512 Bc-Cc 74.11 75.73 71.1032
avx512 Bc-Cf 70.13 70.64 77.2659
avx512 Bf-Cc 76.91 72.17 72.1736
avx512 Bf-Cf 76.64 70.80 70.7051

small=0, dtype=f4, num_threads=16

  • Bc: (3000, 3000) matrix, C-contiguous
  • Bf: (3000, 3000) matrix, F-contiguous
  • Cc: (3000, 3000) matrix, C-contiguous
  • Cf: (3000, 3000) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 112.95 98.68 -
avx Bc-Cf 96.67 100.91 -
avx Bf-Cc 100.31 106.11 -
avx Bf-Cf 94.04 93.73 -
avx2 Bc-Cc 68.42 63.35 61.5305
avx2 Bc-Cf 62.37 59.23 66.0565
avx2 Bf-Cc 60.32 67.55 57.3751
avx2 Bf-Cf 58.60 58.15 53.5256
avx512 Bc-Cc 44.29 45.04 44.8097
avx512 Bc-Cf 43.71 53.66 53.3274
avx512 Bf-Cc 47.58 54.77 47.1386
avx512 Bf-Cf 45.44 41.61 40.4459

small=1, dtype=f8, num_threads=1

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 2.4306 2.4424 -
avx Bc-Cf 2.4559 2.4641 -
avx Bf-Cc 2.4374 2.4397 -
avx Bf-Cf 2.4593 2.4652 -
avx2 Bc-Cc 1.2834 1.2929 1.2953
avx2 Bc-Cf 1.2904 1.2987 1.3664
avx2 Bf-Cc 1.2822 1.2872 1.2889
avx2 Bf-Cf 1.2877 1.2925 1.297
avx512 Bc-Cc 0.8655 0.8733 0.8759
avx512 Bc-Cf 0.8676 0.8721 0.875
avx512 Bf-Cc 0.8695 0.8740 0.874
avx512 Bf-Cf 0.8664 0.8750 0.8755

small=1, dtype=f8, num_threads=8

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 0.4802 0.4499 -
avx Bc-Cf 0.4153 0.4215 -
avx Bf-Cc 0.4489 0.4723 -
avx Bf-Cf 0.4635 0.4230 -
avx2 Bc-Cc 0.2465 0.2542 0.2539
avx2 Bc-Cf 0.2437 0.2508 0.2897
avx2 Bf-Cc 0.2463 0.2759 0.2749
avx2 Bf-Cf 0.2553 0.2677 0.288
avx512 Bc-Cc 0.1957 0.2155 0.2053
avx512 Bc-Cf 0.2599 0.2286 0.2213
avx512 Bf-Cc 0.1965 0.2272 0.2265
avx512 Bf-Cf 0.2167 0.2275 0.2239

small=1, dtype=f8, num_threads=16

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 0.2437 0.2511 -
avx Bc-Cf 0.2470 0.3000 -
avx Bf-Cc 0.2480 0.2551 -
avx Bf-Cf 0.2964 0.2968 -
avx2 Bc-Cc 0.1655 0.2024 0.1915
avx2 Bc-Cf 0.1698 0.1957 0.2005
avx2 Bf-Cc 0.1652 0.1838 0.1919
avx2 Bf-Cf 0.1845 0.1960 0.1907
avx512 Bc-Cc 0.1042 0.1657 0.1786
avx512 Bc-Cf 0.1268 0.1431 0.1636
avx512 Bf-Cc 0.1574 0.1633 0.1695
avx512 Bf-Cf 0.1645 0.1681 0.1669

small=1, dtype=f4, num_threads=1

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 1.2710 1.2784 -
avx Bc-Cf 1.2810 1.2879 -
avx Bf-Cc 1.2412 1.2496 -
avx Bf-Cf 1.2515 1.2579 -
avx2 Bc-Cc 0.7203 0.7267 0.7401
avx2 Bc-Cf 0.7207 0.7269 0.7401
avx2 Bf-Cc 0.7167 0.7246 0.7374
avx2 Bf-Cf 0.7195 0.7253 0.7377
avx512 Bc-Cc 0.4485 0.4568 0.457
avx512 Bc-Cf 0.4451 0.4532 0.4537
avx512 Bf-Cc 0.4500 0.4585 0.4603
avx512 Bf-Cf 0.4478 0.4554 0.4556

small=1, dtype=f4, num_threads=8

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 0.2458 0.2506 -
avx Bc-Cf 0.2422 0.2372 -
avx Bf-Cc 0.2346 0.2267 -
avx Bf-Cf 0.2282 0.2491 -
avx2 Bc-Cc 0.1359 0.1423 0.1435
avx2 Bc-Cf 0.1385 0.1683 0.1364
avx2 Bf-Cc 0.1414 0.1359 0.1347
avx2 Bf-Cf 0.1340 0.1402 0.1366
avx512 Bc-Cc 0.1121 0.1225 0.1175
avx512 Bc-Cf 0.1111 0.1059 0.1063
avx512 Bf-Cc 0.1070 0.1066 0.1059
avx512 Bf-Cf 0.1047 0.1128 0.1059

small=1, dtype=f4, num_threads=16

  • Bc: (300, 300) matrix, C-contiguous
  • Bf: (300, 300) matrix, F-contiguous
  • Cc: (300, 300) matrix, C-contiguous
  • Cf: (300, 300) matrix, F-contiguous
dot gemm gemm-strict
avx Bc-Cc 0.1421 0.1822 -
avx Bc-Cf 0.1740 0.1488 -
avx Bf-Cc 0.1528 0.1450 -
avx Bf-Cf 0.1378 0.1545 -
avx2 Bc-Cc 0.0904 0.1061 0.1056
avx2 Bc-Cf 0.1040 0.0989 0.0994
avx2 Bf-Cc 0.0904 0.0954 0.0966
avx2 Bf-Cf 0.0954 0.0985 0.1011
avx512 Bc-Cc 0.0904 0.0770 0.0889
avx512 Bc-Cf 0.0894 0.0718 0.0887
avx512 Bf-Cc 0.0694 0.0753 0.0894
avx512 Bf-Cf 0.0668 0.0892 0.0997

vec-vec


small=0, dtype=f8, num_threads=1

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 37.94 39.49 -
avx2 a2-b2 32.57 33.14 67.5952
avx512 a2-b2 29.98 29.58 80.7738

small=0, dtype=f8, num_threads=8

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 6.81 7.17 -
avx2 a2-b2 6.39 6.85 69.2601
avx512 a2-b2 5.95 6.70 84.1317

small=0, dtype=f8, num_threads=16

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 3.69 5.40 -
avx2 a2-b2 4.71 5.67 71.512
avx512 a2-b2 3.32 4.91 84.9237

small=0, dtype=f4, num_threads=1

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 18.19 18.03 -
avx2 a2-b2 16.06 16.19 54.189
avx512 a2-b2 14.31 14.49 62.856

small=0, dtype=f4, num_threads=8

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 3.49 2.59 -
avx2 a2-b2 3.02 3.02 54.4306
avx512 a2-b2 3.21 2.33 62.906

small=0, dtype=f4, num_threads=16

  • a2: (25000000,) vector
  • b2: (25000000,) vector
dot gemm gemm-strict
avx a2-b2 2.04 2.21 -
avx2 a2-b2 1.59 1.70 54.3137
avx512 a2-b2 2.00 1.66 62.7139

small=1, dtype=f8, num_threads=1

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 2.2779 2.5178 -
avx2 a2-b2 2.0921 2.1423 5.5889
avx512 a2-b2 1.9314 1.9053 6.4217

small=1, dtype=f8, num_threads=8

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 0.2286 0.2406 -
avx2 a2-b2 0.2222 0.2358 5.6441
avx512 a2-b2 0.2286 0.2918 6.3899

small=1, dtype=f8, num_threads=16

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 0.1192 0.1383 -
avx2 a2-b2 0.1154 0.1309 6.2925
avx512 a2-b2 0.1428 0.1347 6.356

small=1, dtype=f4, num_threads=1

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 0.7795 0.8206 -
avx2 a2-b2 0.7999 0.8123 5.2384
avx512 a2-b2 0.7951 0.8414 5.6853

small=1, dtype=f4, num_threads=8

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 0.1030 0.1173 -
avx2 a2-b2 0.1063 0.1171 5.2404
avx512 a2-b2 0.1090 0.1171 5.6746

small=1, dtype=f4, num_threads=16

  • a2: (2500000,) vector
  • b2: (2500000,) vector
dot gemm gemm-strict
avx a2-b2 0.0429 0.0534 -
avx2 a2-b2 0.0412 0.0515 5.475
avx512 a2-b2 0.0610 0.0591 6.2094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment