blu/aarch64_vector_codegen_survey.md

## aarch64_vector_codegen_survey.md

      
    Raw
  

              aarch64_vector_codegen_survey.md
            
          
    After checking the status quo of aarch64 vector code generation by recent-enough compilers, particularly where it comes to intrinsincs codegen, a few points surfaced, which I'll try to reason about here. The originally participating compilers were gcc-11 through 13 and clang-11 through 13. Here we will focus on just one "lineage" of the sampled compliers -- clang, as its vector codegen is as-good or better than its peers. Particularly, focus will be on how percieved optimisations may affect the codegen, and thus performance, in the context of a simple function -- a vec3-by-matrix3x3 transformation, i.e. multiplication.
Performance surveyed on the following uarches:

cortex-a76 -- most of aarch64 cloud install base, raspberry pi 5, et al; clang-13
cortex-a78 -- a fair midrange performer these days; NV Orin et al; clang-11 & 12
apple M1 -- the uarch which set the bar of aarch64 high-end performance a few years ago; apple clang-12

Surveyed code

The function we'll work with, in baseline scalar form:
void transform_x1(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
    const float m00 = mat->m[0][0];
    const float m01 = mat->m[0][1];
    const float m02 = mat->m[0][2];
    const float m10 = mat->m[1][0];
    const float m11 = mat->m[1][1];
    const float m12 = mat->m[1][2];
    const float m20 = mat->m[2][0];
    const float m21 = mat->m[2][1];
    const float m22 = mat->m[2][2];

    for ( ; num_vec; num_vec -= 1, vec += 3, out += 3) {
        const float v0 = vec[0];
        const float v1 = vec[1];
        const float v2 = vec[2];

        out[0] = m00 * v0 + m01 * v1 + m02 * v2;
        out[1] = m10 * v0 + m11 * v1 + m12 * v2;
        out[2] = m20 * v0 + m21 * v1 + m22 * v2;
    }
}
As we're interested in the generic case, to avoid any compile-time code specialisation, we make sure the compiler does not know:

how big our input vector array is
what our vector array contains
what our matrix operator contains

Still, we'll let the compiler know that:

our arrays and matrix operator are aligned at cacheline boundaries
no aliasing between the arrays can happen

We let the compiler flex its muscles as: -Ofast -ffp-contract=fast -fstrict-aliasing -DNDEBUG
As we effectively care only about our array-processing loop, that is what we will look at.
apple clang-12
1000039e8: 3a 48 df 4c  ld3.4s  { v26, v27, v28 }, [x1], #48
1000039ec: 5d df 31 6e  fmul.4s v29, v26, v17
1000039f0: 7d cf 32 4e  fmla.4s v29, v27, v18
1000039f4: 9d cf 33 4e  fmla.4s v29, v28, v19
1000039f8: 5e df 34 6e  fmul.4s v30, v26, v20
1000039fc: 7e cf 35 4e  fmla.4s v30, v27, v21
100003a00: 9e cf 36 4e  fmla.4s v30, v28, v22
100003a04: 5f df 37 6e  fmul.4s v31, v26, v23
100003a08: 7f cf 38 4e  fmla.4s v31, v27, v24
100003a0c: 9f cf 39 4e  fmla.4s v31, v28, v25
100003a10: 7d 48 9f 4c  st3.4s  { v29, v30, v31 }, [x3], #48
100003a14: 6b 11 00 f1  subs    x11, x11, #4
100003a18: 81 fe ff 54  b.ne    0x1000039e8 <_transform_x1+0x78>
Things to note:

our 3x3 matrix has been preloaded into 9 VPRs via splatting -- each VPR contains a single matx element, replicated across all lanes
input and output arrays are accessed via ld3/st3 -- "load/store multiple 3-element structures to/from three registers", allowing AoS -> SoA -> AoS streaming access
all ALU ops are carried out over 4-element VPRs, i.e. 4 * vec3 processed per iteration; tails are not considered here, as our array is well-sized

Overall, apple clang-12 demonstrates perfectly competent codegen from our basic function. Moving on.
upstream clang-11, 12, & 13
4009f4:       4cdf4a5a        ld3     {v26.4s-v28.4s}, [x18], #48
4009f8:       f1001231        subs    x17, x17, #0x4
4009fc:       6e31df5d        fmul    v29.4s, v26.4s, v17.4s
400a00:       4e32cf7d        fmla    v29.4s, v27.4s, v18.4s
400a04:       4e33cf9d        fmla    v29.4s, v28.4s, v19.4s
400a08:       6e34df5e        fmul    v30.4s, v26.4s, v20.4s
400a0c:       4e35cf7e        fmla    v30.4s, v27.4s, v21.4s
400a10:       4e36cf9e        fmla    v30.4s, v28.4s, v22.4s
400a14:       6e37df5f        fmul    v31.4s, v26.4s, v23.4s
400a18:       4e38cf7f        fmla    v31.4s, v27.4s, v24.4s
400a1c:       4e39cf9f        fmla    v31.4s, v28.4s, v25.4s
400a20:       4c9f481d        st3     {v29.4s-v31.4s}, [x0], #48
400a24:       54fffe81        b.ne    4009f4 <main+0x110>
Things to note:

upstream clang does on-par with apple clang-12. Same notes apply.

Autovectorisation Absent

What if our complier was not autovectorising so well, though, and we still wanted to get the SIMD benefits? Enter VPR intrinsics. We'll introduce two hand-vectorised versions of the routine -- by-4 and by-8. We'll see later why.
void transform_x4(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
    assert(0 == num_vec % 4);

    const float m00 = mat->m[0][0];
    const float m01 = mat->m[0][1];
    const float m02 = mat->m[0][2];
    const float m10 = mat->m[1][0];
    const float m11 = mat->m[1][1];
    const float m12 = mat->m[1][2];
    const float m20 = mat->m[2][0];
    const float m21 = mat->m[2][1];
    const float m22 = mat->m[2][2];

    for ( ; num_vec; num_vec -= 4, vec += 12, out += 12) {
        const float32x4x3_t t = vld3q_f32(vec);

        const float32x4_t r0 =
            m00 * t.val[0] +
            m01 * t.val[1] +
            m02 * t.val[2];
        const float32x4_t r1 =
            m10 * t.val[0] +
            m11 * t.val[1] +
            m12 * t.val[2];
        const float32x4_t r2 =
            m20 * t.val[0] +
            m21 * t.val[1] +
            m22 * t.val[2];

        const float32x4x3_t r = { .val = { r0, r1, r2 } };
        vst3q_f32(out, r);
    }
}
void transform_x8(const matx3* mat, const float* vec, size_t num_vec, float* out)
{
    assert(0 == num_vec % 8);

    const float m00 = mat->m[0][0];
    const float m01 = mat->m[0][1];
    const float m02 = mat->m[0][2];
    const float m10 = mat->m[1][0];
    const float m11 = mat->m[1][1];
    const float m12 = mat->m[1][2];
    const float m20 = mat->m[2][0];
    const float m21 = mat->m[2][1];
    const float m22 = mat->m[2][2];

    for ( ; num_vec; num_vec -= 8, vec += 24, out += 24) {
        const float32x4x3_t t0 = vld3q_f32(vec +  0);
        const float32x4x3_t t1 = vld3q_f32(vec + 12);

        const float32x4_t r0 =
            m00 * t0.val[0] +
            m01 * t0.val[1] +
            m02 * t0.val[2];
        const float32x4_t r1 =
            m10 * t0.val[0] +
            m11 * t0.val[1] +
            m12 * t0.val[2];
        const float32x4_t r2 =
            m20 * t0.val[0] +
            m21 * t0.val[1] +
            m22 * t0.val[2];

        const float32x4_t r3 =
            m00 * t1.val[0] +
            m01 * t1.val[1] +
            m02 * t1.val[2];
        const float32x4_t r4 =
            m10 * t1.val[0] +
            m11 * t1.val[1] +
            m12 * t1.val[2];
        const float32x4_t r5 =
            m20 * t1.val[0] +
            m21 * t1.val[1] +
            m22 * t1.val[2];

        const float32x4x3_t s0 = { .val = { r0, r1, r2 } };
        const float32x4x3_t s1 = { .val = { r3, r4, r5 } };
        vst3q_f32(out +  0, s0);
        vst3q_f32(out + 12, s1);
    }
}
The codegen from these two routines, inlining non-withstanding, is very similar across all clang versions, so output from just one compiler shown here:
upstream clang-13 transform_x4
400954:       4cdf49d1        ld3     {v17.4s-v19.4s}, [x14], #48
400958:       f10011ad        subs    x13, x13, #0x4
40095c:       6e22de54        fmul    v20.4s, v18.4s, v2.4s
400960:       4e21ce34        fmla    v20.4s, v17.4s, v1.4s
400964:       4e23ce74        fmla    v20.4s, v19.4s, v3.4s
400968:       6e25de55        fmul    v21.4s, v18.4s, v5.4s
40096c:       4e24ce35        fmla    v21.4s, v17.4s, v4.4s
400970:       4e26ce75        fmla    v21.4s, v19.4s, v6.4s
400974:       6e30de56        fmul    v22.4s, v18.4s, v16.4s
400978:       4e27ce36        fmla    v22.4s, v17.4s, v7.4s
40097c:       4e20ce76        fmla    v22.4s, v19.4s, v0.4s
400980:       4c9f49f4        st3     {v20.4s-v22.4s}, [x15], #48
400984:       54fffe81        b.ne    400954 <main+0x80>
Things to note:

effectively identical codegen to the scalar version above.

upstream clang-13 transform_x8
400954:       aa0e03f0        mov     x16, x14
400958:       4cdf4a11        ld3     {v17.4s-v19.4s}, [x16], #48
40095c:       f10021ad        subs    x13, x13, #0x8
400960:       910181ce        add     x14, x14, #0x60
400964:       6e22de57        fmul    v23.4s, v18.4s, v2.4s
400968:       4e21ce37        fmla    v23.4s, v17.4s, v1.4s
40096c:       4e23ce77        fmla    v23.4s, v19.4s, v3.4s
400970:       6e25de58        fmul    v24.4s, v18.4s, v5.4s
400974:       4c404a14        ld3     {v20.4s-v22.4s}, [x16]
400978:       4e24ce38        fmla    v24.4s, v17.4s, v4.4s
40097c:       4e26ce78        fmla    v24.4s, v19.4s, v6.4s
400980:       6e30de59        fmul    v25.4s, v18.4s, v16.4s
400984:       4e27ce39        fmla    v25.4s, v17.4s, v7.4s
400988:       4e20ce79        fmla    v25.4s, v19.4s, v0.4s
40098c:       6e22deb1        fmul    v17.4s, v21.4s, v2.4s
400990:       4e21ce91        fmla    v17.4s, v20.4s, v1.4s
400994:       4e23ced1        fmla    v17.4s, v22.4s, v3.4s
400998:       6e25deb2        fmul    v18.4s, v21.4s, v5.4s
40099c:       4e24ce92        fmla    v18.4s, v20.4s, v4.4s
4009a0:       4e26ced2        fmla    v18.4s, v22.4s, v6.4s
4009a4:       6e30deb3        fmul    v19.4s, v21.4s, v16.4s
4009a8:       4e27ce93        fmla    v19.4s, v20.4s, v7.4s
4009ac:       910181f0        add     x16, x15, #0x60
4009b0:       4c9f49f7        st3     {v23.4s-v25.4s}, [x15], #48
4009b4:       4e20ced3        fmla    v19.4s, v22.4s, v0.4s
4009b8:       4c0049f1        st3     {v17.4s-v19.4s}, [x15]
4009bc:       aa1003ef        mov     x15, x16
4009c0:       54fffca1        b.ne    400954 <main+0x80>
Things to note:

little surprise here either, though loop iterators (read: incrementing pointers) surely could be done better -- we'll return to that

Additionally, we have hand-written assembly versions of the above two routines, where a prima-vista attempt will be compared verses the compiler codegen.
Performance

For performance measurements we will consider two datasets:

a "regular dataset" -- an array of 2^20 vec3 elements
a "small dataset" -- an array of 24 vec3 elements

Each of the two datasets will get a fixed amount of repetitions, to an effect of overall test duration between one and two seconds, so we can gether statistically sound results.
Testee filename encoding: vec3. <compiler-vs-asm> <small-vs-regular-dataset> <routine-version>

e{0,1} -- compiler (e0) or assembly (e1)
{a,b} -- "small" (a) or "regular" (b) dataset
{n1,0,1} -- scalar version ("negative-one"; e0 only), transform_x4 (0) or transform_x8 (1)

So we end up with the following testee set:

vec3.e0an1 -- scalar transform_x1 routine, autovectorised into AoS -> SoA -> AoS for SIMD inner loop, "small" dataset
vec3.e0bn1 -- scalar transform_x1 routine, autovectorised into AoS -> SoA -> AoS for SIMD inner loop, "regular" dataset
vec3.e0a0 -- intrinsics transform_x4 routine, "small" dataset
vec3.e0b0 -- intrinsics transform_x4 routine, "regular" dataset
vec3.e0a1 -- intrinsics transform_x8 routine, "small" dataset
vec3.e0b1 -- intrinsics transform_x8 routine, "regular" dataset
vec3.e1a0 -- assembly transform_x4, "small" dataset
vec3.e1b0 -- assembly transform_x4, "regular" dataset
vec3.e1a1 -- assembly transform_x8, "small" dataset
vec3.e1b1 -- assembly transform_x8, "regular" dataset

Let's start with the raw data from runs over the regular dataset.
cortex-a76 (@ 2.4GHz, as found in raspberry pi 5):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0bn1

 Performance counter stats for './vec3.e0bn1':

          1,237.45 msec task-clock:u                     #    1.000 CPUs utilized
     2,952,634,002      cycles:u                         #    2.386 GHz
     3,407,993,007      instructions:u                   #    1.15  insn per cycle
     1,572,918,926      l1d_cache:u                      #    1.271 G/sec
       197,194,613      l2d_cache:u                      #  159.355 M/sec
            32,369      ll_cache_rd:u                    #   26.158 K/sec

       1.237873179 seconds time elapsed

       1.225787000 seconds user
       0.011978000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b0

 Performance counter stats for './vec3.e0b0':

          1,263.85 msec task-clock:u                     #    1.000 CPUs utilized
     3,016,309,350      cycles:u                         #    2.387 GHz
     3,407,978,988      instructions:u                   #    1.13  insn per cycle
     1,572,922,185      l1d_cache:u                      #    1.245 G/sec
       197,173,853      l2d_cache:u                      #  156.010 M/sec
            33,254      ll_cache_rd:u                    #   26.312 K/sec

       1.264290696 seconds time elapsed

       1.256166000 seconds user
       0.008001000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b1

 Performance counter stats for './vec3.e0b1':

          1,232.45 msec task-clock:u                     #    0.999 CPUs utilized
     2,940,806,784      cycles:u                         #    2.386 GHz
     3,670,122,984      instructions:u                   #    1.25  insn per cycle
     1,572,923,894      l1d_cache:u                      #    1.276 G/sec
       197,303,337      l2d_cache:u                      #  160.091 M/sec
            34,230      ll_cache_rd:u                    #   27.774 K/sec

       1.233482958 seconds time elapsed

       1.224740000 seconds user
       0.008004000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b0

 Performance counter stats for './vec3.e1b0':

          1,171.99 msec task-clock:u                     #    1.000 CPUs utilized
     2,795,817,823      cycles:u                         #    2.386 GHz
     3,407,967,015      instructions:u                   #    1.22  insn per cycle
     1,572,918,240      l1d_cache:u                      #    1.342 G/sec
       197,144,434      l2d_cache:u                      #  168.214 M/sec
            31,555      ll_cache_rd:u                    #   26.924 K/sec

       1.172311814 seconds time elapsed

       1.160338000 seconds user
       0.011962000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b1

 Performance counter stats for './vec3.e1b1':

          1,174.30 msec task-clock:u                     #    1.000 CPUs utilized
     2,801,470,249      cycles:u                         #    2.386 GHz
     3,145,825,015      instructions:u                   #    1.12  insn per cycle
     1,572,920,961      l1d_cache:u                      #    1.339 G/sec
       197,231,327      l2d_cache:u                      #  167.957 M/sec
            30,290      ll_cache_rd:u                    #   25.794 K/sec

       1.174547091 seconds time elapsed

       1.166583000 seconds user
       0.008017000 seconds sys

cortex-a78 (@ 2.2GHz, as found in NV orin):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0bn1

 Performance counter stats for './vec3.e0bn1':

           1190,24 msec task-clock:u              #    0,999 CPUs utilized
        2607223991      cycles:u                  #    2,191 GHz
        3407990894      instructions:u            #    1,31  insn per cycle
        1310764515      l1d_cache:u               # 1101,261 M/sec
         589883602      l2d_cache:u               #  495,601 M/sec
         163423257      ll_cache_rd:u             #  137,303 M/sec

       1,190908605 seconds time elapsed

       1,184724000 seconds user
       0,003996000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b0

 Performance counter stats for './vec3.e0b0':

           1189,80 msec task-clock:u              #    0,999 CPUs utilized
        2603000583      cycles:u                  #    2,188 GHz
        3407976894      instructions:u            #    1,31  insn per cycle
        1310766566      l1d_cache:u               # 1101,666 M/sec
         589884874      l2d_cache:u               #  495,783 M/sec
         163432694      ll_cache_rd:u             #  137,361 M/sec

       1,190519403 seconds time elapsed

       1,188339000 seconds user
       0,000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0b1

 Performance counter stats for './vec3.e0b1':

           1171,95 msec task-clock:u              #    1,000 CPUs utilized
        2562875964      cycles:u                  #    2,187 GHz
        3670120894      instructions:u            #    1,43  insn per cycle
        1310766771      l1d_cache:u               # 1118,448 M/sec
         589881753      l2d_cache:u               #  503,333 M/sec
         163346177      ll_cache_rd:u             #  139,380 M/sec

       1,172488529 seconds time elapsed

       1,170347000 seconds user
       0,000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b0

 Performance counter stats for './vec3.e1b0':

           1181,95 msec task-clock:u              #    1,000 CPUs utilized
        2587425392      cycles:u                  #    2,189 GHz
        3407964902      instructions:u            #    1,32  insn per cycle
        1310762790      l1d_cache:u               # 1108,987 M/sec
         589883701      l2d_cache:u               #  499,079 M/sec
         163181288      ll_cache_rd:u             #  138,062 M/sec

       1,182358976 seconds time elapsed

       1,176337000 seconds user
       0,003988000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1b1

 Performance counter stats for './vec3.e1b1':

           1161,94 msec task-clock:u              #    0,999 CPUs utilized
        2538722053      cycles:u                  #    2,185 GHz
        3145822902      instructions:u            #    1,24  insn per cycle
        1310762562      l1d_cache:u               # 1128,077 M/sec
         589883678      l2d_cache:u               #  507,670 M/sec
         163265206      ll_cache_rd:u             #  140,510 M/sec

       1,162653505 seconds time elapsed

       1,156491000 seconds user
       0,003994000 seconds sys
Next, raw data from runs over the small dataset.
cortex-a76 (@ 2.4GHz):
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0an1

 Performance counter stats for './vec3.e0an1':

          1,537.63 msec task-clock:u                     #    1.000 CPUs utilized             
     3,687,853,899      cycles:u                         #    2.398 GHz                       
     5,800,093,464      instructions:u                   #    1.57  insn per cycle            
     2,100,038,605      l1d_cache:u                      #    1.366 G/sec                     
             8,476      l2d_cache:u                      #    5.512 K/sec                     
             3,112      ll_cache_rd:u                    #    2.024 K/sec                     

       1.537955558 seconds time elapsed

       1.537915000 seconds user
       0.000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a0

 Performance counter stats for './vec3.e0a0':

          1,545.52 msec task-clock:u                     #    1.000 CPUs utilized             
     3,707,193,546      cycles:u                         #    2.399 GHz                       
     5,100,093,487      instructions:u                   #    1.38  insn per cycle            
     2,200,039,677      l1d_cache:u                      #    1.423 G/sec                     
             8,080      l2d_cache:u                      #    5.228 K/sec                     
             3,126      ll_cache_rd:u                    #    2.023 K/sec                     

       1.545786678 seconds time elapsed

       1.541825000 seconds user
       0.003994000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a1

 Performance counter stats for './vec3.e0a1':

          1,468.98 msec task-clock:u                     #    1.000 CPUs utilized             
     3,523,414,771      cycles:u                         #    2.399 GHz                       
     5,400,093,464      instructions:u                   #    1.53  insn per cycle            
     2,200,038,074      l1d_cache:u                      #    1.498 G/sec                     
             7,961      l2d_cache:u                      #    5.419 K/sec                     
             3,070      ll_cache_rd:u                    #    2.090 K/sec                     

       1.469244463 seconds time elapsed

       1.469262000 seconds user
       0.000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a0

 Performance counter stats for './vec3.e1a0':

          1,376.09 msec task-clock:u                     #    1.000 CPUs utilized             
     3,300,163,359      cycles:u                         #    2.398 GHz                       
     4,500,093,471      instructions:u                   #    1.36  insn per cycle            
     2,000,038,758      l1d_cache:u                      #    1.453 G/sec                     
             9,109      l2d_cache:u                      #    6.619 K/sec                     
             3,184      ll_cache_rd:u                    #    2.314 K/sec                     

       1.376407149 seconds time elapsed

       1.376364000 seconds user
       0.000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a1

 Performance counter stats for './vec3.e1a1':

          1,338.49 msec task-clock:u                     #    1.000 CPUs utilized             
     3,210,260,258      cycles:u                         #    2.398 GHz                       
     4,300,093,490      instructions:u                   #    1.34  insn per cycle            
     2,000,039,007      l1d_cache:u                      #    1.494 G/sec                     
             9,187      l2d_cache:u                      #    6.864 K/sec                     
             3,161      ll_cache_rd:u                    #    2.362 K/sec                     

       1.338920898 seconds time elapsed

       1.334840000 seconds user
       0.003996000 seconds sys

cortex-a78 (@ 2.2GHz)
$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0an1

 Performance counter stats for './vec3.e0an1':

           1451,88 msec task-clock:u              #    1,000 CPUs utilized
        3182349024      cycles:u                  #    2,192 GHz
        5800080896      instructions:u            #    1,82  insn per cycle
        1800038899      l1d_cache:u               # 1239,801 M/sec
              9267      l2d_cache:u               #    0,006 M/sec
              3693      ll_cache_rd:u             #    0,003 M/sec

       1,452396450 seconds time elapsed

       1,451673000 seconds user
       0,000000000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a0

 Performance counter stats for './vec3.e0a0':

           1445,92 msec task-clock:u              #    1,000 CPUs utilized
        3170729452      cycles:u                  #    2,193 GHz
        5100080896      instructions:u            #    1,61  insn per cycle
        1900038812      l1d_cache:u               # 1314,070 M/sec
              9837      l2d_cache:u               #    0,007 M/sec
              3596      ll_cache_rd:u             #    0,002 M/sec

       1,446544663 seconds time elapsed

       1,441731000 seconds user
       0,003995000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e0a1

 Performance counter stats for './vec3.e0a1':

           1432,04 msec task-clock:u              #    1,000 CPUs utilized
        3138849659      cycles:u                  #    2,192 GHz
        5400080896      instructions:u            #    1,72  insn per cycle
        1900038439      l1d_cache:u               # 1326,807 M/sec
              8322      l2d_cache:u               #    0,006 M/sec
              3488      ll_cache_rd:u             #    0,002 M/sec

       1,432679364 seconds time elapsed

       1,428091000 seconds user
       0,003991000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a0

 Performance counter stats for './vec3.e1a0':

           1333,72 msec task-clock:u              #    1,000 CPUs utilized
        2922652840      cycles:u                  #    2,191 GHz
        4500080903      instructions:u            #    1,54  insn per cycle
        1700038019      l1d_cache:u               # 1274,664 M/sec
             10238      l2d_cache:u               #    0,008 M/sec
              3568      ll_cache_rd:u             #    0,003 M/sec

       1,334323436 seconds time elapsed

       1,329582000 seconds user
       0,003994000 seconds sys

$ perf stat -e task-clock:u,cycles:u,instructions:u,l1d_cache:u,l2d_cache:u,ll_cache_rd:u ./vec3.e1a1

 Performance counter stats for './vec3.e1a1':

           1300,40 msec task-clock:u              #    0,999 CPUs utilized
        2850149937      cycles:u                  #    2,192 GHz
        4300080903      instructions:u            #    1,51  insn per cycle
        1700038446      l1d_cache:u               # 1307,316 M/sec
              8475      l2d_cache:u               #    0,007 M/sec
              3476      ll_cache_rd:u             #    0,003 M/sec

       1,301065851 seconds time elapsed

       1,300624000 seconds user
       0,000000000 seconds sys
Charted, the cpu cycles datum from all of the above looks like the following, for the regular and small datasets, respectively:


Separated per uarch: