After checking the status quo of aarch64 vector code generation by recent-enough compilers, particularly where it comes to intrinsincs codegen, a few points surfaced, which I'll try to reason about here. The originally participating compilers were gcc-11 through 13 and clang-11 through 13. Here we will focus on just one "lineage" of the sampled compliers -- clang, as its vector codegen is as-good or better than its peers. Particularly, focus will be on how percieved optimisations may affect the codegen, and thus performance, in the context of a simple function -- a vec3-by-matrix3x3 transformation, i.e. multiplication.
Performance surveyed on the following uarches:
- cortex-a76 -- most of aarch64 cloud install base, raspberry pi 5, et al; clang-13
- cortex-a78 -- a fair midrange performer these days; NV Orin et al; clang-11 & 12
- apple M1 -- the uarch which set the bar of aarch64 high-end performance a few years ago; apple clang-12
The function we'll work with, in baseline scalar form:
void transform_x1(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
const float m00 = mat->m[0][0];
const float m01 = mat->m[0][1];
const float m02 = mat->m[0][2];
const float m10 = mat->m[1][0];
const float m11 = mat->m[1][1];
const float m12 = mat->m[1][2];
const float m20 = mat->m[2][0];
const float m21 = mat->m[2][1];
const float m22 = mat->m[2][2];
for ( ; num_vec; num_vec -= 1, vec += 3, out += 3) {
const float v0 = vec[0];
const float v1 = vec[1];
const float v2 = vec[2];
out[0] = m00 * v0 + m01 * v1 + m02 * v2;
out[1] = m10 * v0 + m11 * v1 + m12 * v2;
out[2] = m20 * v0 + m21 * v1 + m22 * v2;
}
}
As we're interested in the generic case, to avoid any compile-time code specialisation, we make sure the compiler does not know:
- how big our input vector array is
- what our vector array contains
- what our matrix operator contains
Still, we'll let the compiler know that:
- our arrays and matrix operator are aligned at cacheline boundaries
- no aliasing between the arrays can happen
We let the compiler flex its muscles as: -Ofast -ffp-contract=fast -fstrict-aliasing -DNDEBUG
As we effectively care only about our array-processing loop, that is what we will look at.
apple clang-12
1000039e8: 3a 48 df 4c ld3.4s { v26, v27, v28 }, [x1], #48
1000039ec: 5d df 31 6e fmul.4s v29, v26, v17
1000039f0: 7d cf 32 4e fmla.4s v29, v27, v18
1000039f4: 9d cf 33 4e fmla.4s v29, v28, v19
1000039f8: 5e df 34 6e fmul.4s v30, v26, v20
1000039fc: 7e cf 35 4e fmla.4s v30, v27, v21
100003a00: 9e cf 36 4e fmla.4s v30, v28, v22
100003a04: 5f df 37 6e fmul.4s v31, v26, v23
100003a08: 7f cf 38 4e fmla.4s v31, v27, v24
100003a0c: 9f cf 39 4e fmla.4s v31, v28, v25
100003a10: 7d 48 9f 4c st3.4s { v29, v30, v31 }, [x3], #48
100003a14: 6b 11 00 f1 subs x11, x11, #4
100003a18: 81 fe ff 54 b.ne 0x1000039e8 <_transform_x1+0x78>
Things to note:
- our 3x3 matrix has been preloaded into 9 VPRs via splatting -- each VPR contains a single matx element, replicated across all lanes
- input and output arrays are accessed via
ld3
/st3
-- "load/store multiple 3-element structures to/from three registers", allowing AoS -> SoA -> AoS streaming access - all ALU ops are carried out over 4-element VPRs, i.e. 4 * vec3 processed per iteration; tails are not considered here, as our array is well-sized
Overall, apple clang-12 demonstrates perfectly competent codegen from our basic function. Moving on.
upstream clang-11, 12, & 13
4009f4: 4cdf4a5a ld3 {v26.4s-v28.4s}, [x18], #48
4009f8: f1001231 subs x17, x17, #0x4
4009fc: 6e31df5d fmul v29.4s, v26.4s, v17.4s
400a00: 4e32cf7d fmla v29.4s, v27.4s, v18.4s
400a04: 4e33cf9d fmla v29.4s, v28.4s, v19.4s
400a08: 6e34df5e fmul v30.4s, v26.4s, v20.4s
400a0c: 4e35cf7e fmla v30.4s, v27.4s, v21.4s
400a10: 4e36cf9e fmla v30.4s, v28.4s, v22.4s
400a14: 6e37df5f fmul v31.4s, v26.4s, v23.4s
400a18: 4e38cf7f fmla v31.4s, v27.4s, v24.4s
400a1c: 4e39cf9f fmla v31.4s, v28.4s, v25.4s
400a20: 4c9f481d st3 {v29.4s-v31.4s}, [x0], #48
400a24: 54fffe81 b.ne 4009f4 <main+0x110>
Things to note:
- upstream clang does on-par with apple clang-12. Same notes apply.
What if our complier was not autovectorising so well, though, and we still wanted to get the SIMD benefits? Enter VPR intrinsics. We'll introduce two hand-vectorised versions of the routine -- by-4 and by-8. We'll see later why.
void transform_x4(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
assert(0 == num_vec % 4);
const float m00 = mat->m[0][0];
const float m01 = mat->m[0][1];
const float m02 = mat->m[0][2];
const float m10 = mat->m[1][0];
const float m11 = mat->m[1][1];
const float m12 = mat->m[1][2];
const float m20 = mat->m[2][0];
const float m21 = mat->m[2][1];
const float m22 = mat->m[2][2];
for ( ; num_vec; num_vec -= 4, vec += 12, out += 12) {
const float32x4x3_t t = vld3q_f32(vec);
const float32x4_t r0 =
m00 * t.val[0] +
m01 * t.val[1] +
m02 * t.val[2];
const float32x4_t r1 =
m10 * t.val[0] +
m11 * t.val[1] +
m12 * t.val[2];
const float32x4_t r2 =
m20 * t.val[0] +
m21 * t.val[1] +
m22 * t.val[2];
const float32x4x3_t r = { .val = { r0, r1, r2 } };
vst3q_f32(out, r);
}
}
void transform_x8(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
assert(0 == num_vec % 8);
const float m00 = mat->m[0][0];
const float m01 = mat->m[0][1];
const float m02 = mat->m[0][2];
const float m10 = mat->m[1][0];
const float m11 = mat->m[1][1];
const float m12 = mat->m[1][2];
const float m20 = mat->m[2][0];
const float m21 = mat->m[2][1];
const float m22 = mat->m[2][2];
for ( ; num_vec; num_vec -= 8, vec += 24, out += 24) {
const float32x4x3_t t0 = vld3q_f32(vec + 0);
const float32x4x3_t t1 = vld3q_f32(vec + 12);
const float32x4_t r0 =
m00 * t0.val[0] +
m01 * t0.val[1] +
m02 * t0.val[2];
const float32x4_t r1 =
m10 * t0.val[0] +
m11 * t0.val[1] +
m12 * t0.val[2];
const float32x4_t r2 =
m20 * t0.val[0] +
m21 * t0.val[1] +
m22 * t0.val[2];
const float32x4_t r3 =
m00 * t1.val[0] +
m01 * t1.val[1] +
m02 * t1.val[2];
const float32x4_t r4 =
m10 * t1.val[0] +
m11 * t1.val[1] +
m12 * t1.val[2];
const float32x4_t r5 =
m20 * t1.val[0] +
m21 * t1.val[1] +
m22 * t1.val[2];
const float32x4x3_t s0 = { .val = { r0, r1, r2 } };
const float32x4x3_t s1 = { .val = { r3, r4, r5 } };
vst3q_f32(out + 0, s0);
vst3q_f32(out + 12, s1);
}
}
The codegen from these two routines, inlining non-withstanding, is very similar across all clang versions, so output from just one compiler shown here:
upstream clang-13 transform_x4
400954: 4cdf49d1 ld3 {v17.4s-v19.4s}, [x14], #48
400958: f10011ad subs x13, x13, #0x4
40095c: 6e22de54 fmul v20.4s, v18.4s, v2.4s
400960: 4e21ce34 fmla v20.4s, v17.4s, v1.4s
400964: 4e23ce74 fmla v20.4s, v19.4s, v3.4s
400968: 6e25de55 fmul v21.4s, v18.4s, v5.4s
40096c: 4e24ce35 fmla v21.4s, v17.4s, v4.4s
400970: 4e26ce75 fmla v21.4s, v19.4s, v6.4s
400974: 6e30de56 fmul v22.4s, v18.4s, v16.4s
400978: 4e27ce36 fmla v22.4s, v17.4s, v7.4s
40097c: 4e20ce76 fmla v22.4s, v19.4s, v0.4s
400980: 4c9f49f4 st3 {v20.4s-v22.4s}, [x15], #48
400984: 54fffe81 b.ne 400954 <main+0x80>
Things to note:
- effectively identical codegen to the scalar version above.
upstream clang-13 transform_x8
400954: aa0e03f0 mov x16, x14
400958: 4cdf4a11 ld3 {v17.4s-v19.4s}, [x16], #48
40095c: f10021ad subs x13, x13, #0x8
400960: 910181ce add x14, x14, #0x60
400964: 6e22de57 fmul v23.4s, v18.4s, v2.4s
400968: 4e21ce37 fmla v23.4s, v17.4s, v1.4s
40096c: 4e23ce77 fmla v23.4s, v19.4s, v3.4s
400970: 6e25de58 fmul v24.4s, v18.4s, v5.4s
400974: 4c404a14 ld3 {v20.4s-v22.4s}, [x16]
400978: 4e24ce38 fmla v24.4s, v17.4s, v4.4s
40097c: 4e26ce78 fmla v24.4s, v19.4s, v6.4s
400980: 6e30de59 fmul v25.4s, v18.4s, v16.4s
400984: 4e27ce39 fmla v25.4s, v17.4s, v7.4s
400988: 4e20ce79 fmla v25.4s, v19.4s, v0.4s
40098c: 6e22deb1 fmul v17.4s, v21.4s, v2.4s
400990: 4e21ce91 fmla v17.4s, v20.4s, v1.4s
400994: 4e23ced1 fmla v17.4s, v22.4s, v3.4s
400998: 6e25deb2 fmul v18.4s, v21.4s, v5.4s
40099c: 4e24ce92 fmla v18.4s, v20.4s, v4.4s
4009a0: 4e26ced2 fmla v18.4s, v22.4s, v6.4s
4009a4: 6e30deb3 fmul v19.4s, v21.4s, v16.4s
4009a8: 4e27ce93 fmla v19.4s, v20.4s, v7.4s
4009ac: 910181f0 add x16, x15, #0x60
4009b0: 4c9f49f7 st3 {v23.4s-v25.4s}, [x15], #48
4009b4: 4e20ced3 fmla v19.4s, v22.4s, v0.4s
4009b8: 4c0049f1 st3 {v17.4s-v19.4s}, [x15]
4009bc: aa1003ef mov x15, x16
4009c0: 54fffca1 b.ne 400954 <main+0x80>
Things to note:
- little surprise here either, though loop iterators (read: incrementing pointers) surely could be done better -- we'll return to that
Additionally, we have hand-written assembly versions of the above two routines, where a prima-vista attempt will be compared verses the compiler codegen.
For performance measurements we will consider two datasets:
- a "regular dataset" -- an array of 2^20 vec3 elements
- a "small dataset" -- an array of 24 vec3 elements
Each of the two datasets will get a fixed amount of repetitions, to an effect of overall test duration between one and two seconds, so we can gether statistically sound results.
Testee filename encoding: vec3.
<compiler-vs-asm> <small-vs-regular-dataset> <routine-version>
e{0,1}
-- compiler (e0
) or assembly (e1
){a,b}
-- "small" (a
) or "regular" (b
) dataset{n1,0,1}
-- scalar version ("negative-one";e0
only),transform_x4
(0
) ortransform_x8
(1
)
So we end up with the following testee set:
vec3.e0an1
-- scalartransform_x1
routine, autovectorised into AoS -> SoA -> AoS for SIMD inner loop, "small" datasetvec3.e0bn1
-- scalartransform_x1
routine, autovectorised into AoS -> SoA -> AoS for SIMD inner loop, "regular" datasetvec3.e0a0
-- intrinsicstransform_x4
routine, "small" datasetvec3.e0b0
-- intrinsicstransform_x4
routine, "regular" datasetvec3.e0a1
-- intrinsicstransform_x8
routine, "small" datasetvec3.e0b1
-- intrinsicstransform_x8
routine, "regular" datasetvec3.e1a0
-- assemblytransform_x4
, "small" datasetvec3.e1b0
-- assemblytransform_x4
, "regular" datasetvec3.e1a1
-- assemblytransform_x8
, "small" datasetvec3.e1b1
-- assemblytransform_x8
, "regular" dataset
Let's start with the raw data from runs over the regular dataset.
cortex-a76 (@ 2.4GHz, as found in raspberry pi 5):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0bn1
Performance counter stats for './vec3.e0bn1':
1,237.45 msec task-clock:u # 1.000 CPUs utilized
2,952,634,002 cycles:u # 2.386 GHz
3,407,993,007 instructions:u # 1.15 insn per cycle
1,572,918,926 l1d_cache:u # 1.271 G/sec
197,194,613 l2d_cache:u # 159.355 M/sec
32,369 ll_cache_rd:u # 26.158 K/sec
1.237873179 seconds time elapsed
1.225787000 seconds user
0.011978000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b0
Performance counter stats for './vec3.e0b0':
1,263.85 msec task-clock:u # 1.000 CPUs utilized
3,016,309,350 cycles:u # 2.387 GHz
3,407,978,988 instructions:u # 1.13 insn per cycle
1,572,922,185 l1d_cache:u # 1.245 G/sec
197,173,853 l2d_cache:u # 156.010 M/sec
33,254 ll_cache_rd:u # 26.312 K/sec
1.264290696 seconds time elapsed
1.256166000 seconds user
0.008001000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b1
Performance counter stats for './vec3.e0b1':
1,232.45 msec task-clock:u # 0.999 CPUs utilized
2,940,806,784 cycles:u # 2.386 GHz
3,670,122,984 instructions:u # 1.25 insn per cycle
1,572,923,894 l1d_cache:u # 1.276 G/sec
197,303,337 l2d_cache:u # 160.091 M/sec
34,230 ll_cache_rd:u # 27.774 K/sec
1.233482958 seconds time elapsed
1.224740000 seconds user
0.008004000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b0
Performance counter stats for './vec3.e1b0':
1,171.99 msec task-clock:u # 1.000 CPUs utilized
2,795,817,823 cycles:u # 2.386 GHz
3,407,967,015 instructions:u # 1.22 insn per cycle
1,572,918,240 l1d_cache:u # 1.342 G/sec
197,144,434 l2d_cache:u # 168.214 M/sec
31,555 ll_cache_rd:u # 26.924 K/sec
1.172311814 seconds time elapsed
1.160338000 seconds user
0.011962000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b1
Performance counter stats for './vec3.e1b1':
1,174.30 msec task-clock:u # 1.000 CPUs utilized
2,801,470,249 cycles:u # 2.386 GHz
3,145,825,015 instructions:u # 1.12 insn per cycle
1,572,920,961 l1d_cache:u # 1.339 G/sec
197,231,327 l2d_cache:u # 167.957 M/sec
30,290 ll_cache_rd:u # 25.794 K/sec
1.174547091 seconds time elapsed
1.166583000 seconds user
0.008017000 seconds sys
cortex-a78 (@ 2.2GHz, as found in NV orin):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0bn1
Performance counter stats for './vec3.e0bn1':
1190,24 msec task-clock:u # 0,999 CPUs utilized
2607223991 cycles:u # 2,191 GHz
3407990894 instructions:u # 1,31 insn per cycle
1310764515 l1d_cache:u # 1101,261 M/sec
589883602 l2d_cache:u # 495,601 M/sec
163423257 ll_cache_rd:u # 137,303 M/sec
1,190908605 seconds time elapsed
1,184724000 seconds user
0,003996000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b0
Performance counter stats for './vec3.e0b0':
1189,80 msec task-clock:u # 0,999 CPUs utilized
2603000583 cycles:u # 2,188 GHz
3407976894 instructions:u # 1,31 insn per cycle
1310766566 l1d_cache:u # 1101,666 M/sec
589884874 l2d_cache:u # 495,783 M/sec
163432694 ll_cache_rd:u # 137,361 M/sec
1,190519403 seconds time elapsed
1,188339000 seconds user
0,000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b1
Performance counter stats for './vec3.e0b1':
1171,95 msec task-clock:u # 1,000 CPUs utilized
2562875964 cycles:u # 2,187 GHz
3670120894 instructions:u # 1,43 insn per cycle
1310766771 l1d_cache:u # 1118,448 M/sec
589881753 l2d_cache:u # 503,333 M/sec
163346177 ll_cache_rd:u # 139,380 M/sec
1,172488529 seconds time elapsed
1,170347000 seconds user
0,000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b0
Performance counter stats for './vec3.e1b0':
1181,95 msec task-clock:u # 1,000 CPUs utilized
2587425392 cycles:u # 2,189 GHz
3407964902 instructions:u # 1,32 insn per cycle
1310762790 l1d_cache:u # 1108,987 M/sec
589883701 l2d_cache:u # 499,079 M/sec
163181288 ll_cache_rd:u # 138,062 M/sec
1,182358976 seconds time elapsed
1,176337000 seconds user
0,003988000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b1
Performance counter stats for './vec3.e1b1':
1161,94 msec task-clock:u # 0,999 CPUs utilized
2538722053 cycles:u # 2,185 GHz
3145822902 instructions:u # 1,24 insn per cycle
1310762562 l1d_cache:u # 1128,077 M/sec
589883678 l2d_cache:u # 507,670 M/sec
163265206 ll_cache_rd:u # 140,510 M/sec
1,162653505 seconds time elapsed
1,156491000 seconds user
0,003994000 seconds sys
Next, raw data from runs over the small dataset.
cortex-a76 (@ 2.4GHz):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0an1
Performance counter stats for './vec3.e0an1':
1,537.63 msec task-clock:u # 1.000 CPUs utilized
3,687,853,899 cycles:u # 2.398 GHz
5,800,093,464 instructions:u # 1.57 insn per cycle
2,100,038,605 l1d_cache:u # 1.366 G/sec
8,476 l2d_cache:u # 5.512 K/sec
3,112 ll_cache_rd:u # 2.024 K/sec
1.537955558 seconds time elapsed
1.537915000 seconds user
0.000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a0
Performance counter stats for './vec3.e0a0':
1,545.52 msec task-clock:u # 1.000 CPUs utilized
3,707,193,546 cycles:u # 2.399 GHz
5,100,093,487 instructions:u # 1.38 insn per cycle
2,200,039,677 l1d_cache:u # 1.423 G/sec
8,080 l2d_cache:u # 5.228 K/sec
3,126 ll_cache_rd:u # 2.023 K/sec
1.545786678 seconds time elapsed
1.541825000 seconds user
0.003994000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a1
Performance counter stats for './vec3.e0a1':
1,468.98 msec task-clock:u # 1.000 CPUs utilized
3,523,414,771 cycles:u # 2.399 GHz
5,400,093,464 instructions:u # 1.53 insn per cycle
2,200,038,074 l1d_cache:u # 1.498 G/sec
7,961 l2d_cache:u # 5.419 K/sec
3,070 ll_cache_rd:u # 2.090 K/sec
1.469244463 seconds time elapsed
1.469262000 seconds user
0.000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a0
Performance counter stats for './vec3.e1a0':
1,376.09 msec task-clock:u # 1.000 CPUs utilized
3,300,163,359 cycles:u # 2.398 GHz
4,500,093,471 instructions:u # 1.36 insn per cycle
2,000,038,758 l1d_cache:u # 1.453 G/sec
9,109 l2d_cache:u # 6.619 K/sec
3,184 ll_cache_rd:u # 2.314 K/sec
1.376407149 seconds time elapsed
1.376364000 seconds user
0.000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a1
Performance counter stats for './vec3.e1a1':
1,338.49 msec task-clock:u # 1.000 CPUs utilized
3,210,260,258 cycles:u # 2.398 GHz
4,300,093,490 instructions:u # 1.34 insn per cycle
2,000,039,007 l1d_cache:u # 1.494 G/sec
9,187 l2d_cache:u # 6.864 K/sec
3,161 ll_cache_rd:u # 2.362 K/sec
1.338920898 seconds time elapsed
1.334840000 seconds user
0.003996000 seconds sys
cortex-a78 (@ 2.2GHz)
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0an1
Performance counter stats for './vec3.e0an1':
1451,88 msec task-clock:u # 1,000 CPUs utilized
3182349024 cycles:u # 2,192 GHz
5800080896 instructions:u # 1,82 insn per cycle
1800038899 l1d_cache:u # 1239,801 M/sec
9267 l2d_cache:u # 0,006 M/sec
3693 ll_cache_rd:u # 0,003 M/sec
1,452396450 seconds time elapsed
1,451673000 seconds user
0,000000000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a0
Performance counter stats for './vec3.e0a0':
1445,92 msec task-clock:u # 1,000 CPUs utilized
3170729452 cycles:u # 2,193 GHz
5100080896 instructions:u # 1,61 insn per cycle
1900038812 l1d_cache:u # 1314,070 M/sec
9837 l2d_cache:u # 0,007 M/sec
3596 ll_cache_rd:u # 0,002 M/sec
1,446544663 seconds time elapsed
1,441731000 seconds user
0,003995000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a1
Performance counter stats for './vec3.e0a1':
1432,04 msec task-clock:u # 1,000 CPUs utilized
3138849659 cycles:u # 2,192 GHz
5400080896 instructions:u # 1,72 insn per cycle
1900038439 l1d_cache:u # 1326,807 M/sec
8322 l2d_cache:u # 0,006 M/sec
3488 ll_cache_rd:u # 0,002 M/sec
1,432679364 seconds time elapsed
1,428091000 seconds user
0,003991000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a0
Performance counter stats for './vec3.e1a0':
1333,72 msec task-clock:u # 1,000 CPUs utilized
2922652840 cycles:u # 2,191 GHz
4500080903 instructions:u # 1,54 insn per cycle
1700038019 l1d_cache:u # 1274,664 M/sec
10238 l2d_cache:u # 0,008 M/sec
3568 ll_cache_rd:u # 0,003 M/sec
1,334323436 seconds time elapsed
1,329582000 seconds user
0,003994000 seconds sys
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a1
Performance counter stats for './vec3.e1a1':
1300,40 msec task-clock:u # 0,999 CPUs utilized
2850149937 cycles:u # 2,192 GHz
4300080903 instructions:u # 1,51 insn per cycle
1700038446 l1d_cache:u # 1307,316 M/sec
8475 l2d_cache:u # 0,007 M/sec
3476 ll_cache_rd:u # 0,003 M/sec
1,301065851 seconds time elapsed
1,300624000 seconds user
0,000000000 seconds sys
Charted, the cpu cycles
datum from all of the above looks like the following, for the regular and small datasets, respectively:
Separated per uarch: