Skip to content

Instantly share code, notes, and snippets.

@Spasi
Last active February 20, 2018 13:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Spasi/025febb7325b7b73ab2b90f0280796ce to your computer and use it in GitHub Desktop.
Save Spasi/025febb7325b7b73ab2b90f0280796ce to your computer and use it in GitHub Desktop.
Vectorized floating point reductions
// double version:
/*
Windows 10, JVM 9.0.4+11, Ryzen 1800X:
--------------------------------------
Benchmark (size) Mode Cnt Score Error Units
SIMDBenchmark.reduceBuffered 1024 thrpt 3 518,570 ± 27,117 ops/ms
SIMDBenchmark.reduceBuffered 65536 thrpt 3 35,050 ± 0,297 ops/ms
SIMDBenchmark.reduceBuffered 131072 thrpt 3 17,822 ± 1,511 ops/ms
SIMDBenchmark.reduceSimple 1024 thrpt 3 1230,955 ± 24,972 ops/ms
SIMDBenchmark.reduceSimple 65536 thrpt 3 18,628 ± 0,725 ops/ms
SIMDBenchmark.reduceSimple 131072 thrpt 3 9,340 ± 0,217 ops/ms
SIMDBenchmark.reduceUnrolled 1024 thrpt 3 4699,957 ± 121,889 ops/ms
SIMDBenchmark.reduceUnrolled 65536 thrpt 3 73,777 ± 0,889 ops/ms
SIMDBenchmark.reduceUnrolled 131072 thrpt 3 36,505 ± 3,219 ops/ms
SIMDBenchmark.reduceVectorised 1024 thrpt 3 365,681 ± 26,555 ops/ms
SIMDBenchmark.reduceVectorised 65536 thrpt 3 27,931 ± 1,371 ops/ms
SIMDBenchmark.reduceVectorised 131072 thrpt 3 14,286 ± 0,930 ops/ms
Linux, JVM 10+43, Ryzen 1800X:
------------------------------
Benchmark (size) Mode Cnt Score Error Units
SIMDBenchmark.reduceUnrolled 1024 thrpt 5 4474.529 ± 29.028 ops/ms
SIMDBenchmark.reduceUnrolled 65536 thrpt 5 70.467 ± 1.030 ops/ms
SIMDBenchmark.reduceUnrolled 131072 thrpt 5 35.435 ± 0.293 ops/ms
SIMDBenchmark.reduceVectorised 1024 thrpt 5 381.656 ± 10.818 ops/ms
SIMDBenchmark.reduceVectorised 65536 thrpt 5 36.867 ± 1.764 ops/ms
SIMDBenchmark.reduceVectorised 131072 thrpt 5 18.645 ± 0.546 ops/ms
*/
@Benchmark
public double reduceVectorised() {
double[] buffer = new double[1024];
double[] temp = new double[1024];
for (int i = 0; i < data.length >>> 10; ++i) {
System.arraycopy(data, i * 1024, temp, 0, temp.length);
for (int j = 0; j < 1024; ++j) {
buffer[j] += temp[j];
}
}
return reduce(buffer);
}
@Benchmark
public double reduceUnrolled() {
double a0 = 0.0;
double a1 = 0.0;
double a2 = 0.0;
double a3 = 0.0;
for (int i = 0; i < data.length >> 2; i++) {
a0 += data[i * 4 + 0];
a1 += data[i * 4 + 1];
a2 += data[i * 4 + 2];
a3 += data[i * 4 + 3];
}
return a0 + a1 + a2 + a3;
}
// float version:
/*
Windows 10, JVM 9.0.4+11, Ryzen 1800X:
--------------------------------------
Benchmark (size) Mode Cnt Score Error Units
SIMDBenchmark.reduceBuffered 1024 thrpt 3 632,734 ± 338,190 ops/ms
SIMDBenchmark.reduceBuffered 65536 thrpt 3 35,742 ± 4,967 ops/ms
SIMDBenchmark.reduceBuffered 131072 thrpt 3 17,894 ± 2,126 ops/ms
SIMDBenchmark.reduceSimple 1024 thrpt 3 1227,272 ± 90,193 ops/ms
SIMDBenchmark.reduceSimple 65536 thrpt 3 18,640 ± 0,616 ops/ms
SIMDBenchmark.reduceSimple 131072 thrpt 3 9,334 ± 0,171 ops/ms
SIMDBenchmark.reduceUnrolled128 1024 thrpt 3 4695,636 ± 98,984 ops/ms
SIMDBenchmark.reduceUnrolled128 65536 thrpt 3 74,801 ± 0,211 ops/ms
SIMDBenchmark.reduceUnrolled128 131072 thrpt 3 36,835 ± 5,834 ops/ms
SIMDBenchmark.reduceUnrolled256 1024 thrpt 3 6750,436 ± 257,662 ops/ms
SIMDBenchmark.reduceUnrolled256 65536 thrpt 3 105,726 ± 0,525 ops/ms
SIMDBenchmark.reduceUnrolled256 131072 thrpt 3 50,739 ± 42,492 ops/ms
SIMDBenchmark.reduceVectorised 1024 thrpt 3 566,206 ± 21,687 ops/ms
SIMDBenchmark.reduceVectorised 65536 thrpt 3 55,265 ± 0,757 ops/ms
SIMDBenchmark.reduceVectorised 131072 thrpt 3 28,468 ± 2,976 ops/ms
Linux, JVM 10+43, Ryzen 1800X:
------------------------------
Benchmark (size) Mode Cnt Score Error Units
SIMDBenchmarkFloat.reduceUnrolled128 1024 thrpt 5 4614.667 ± 70.817 ops/ms
SIMDBenchmarkFloat.reduceUnrolled128 65536 thrpt 5 71.675 ± 1.006 ops/ms
SIMDBenchmarkFloat.reduceUnrolled128 131072 thrpt 5 36.169 ± 0.308 ops/ms
SIMDBenchmarkFloat.reduceUnrolled256 1024 thrpt 5 5010.649 ± 28.092 ops/ms
SIMDBenchmarkFloat.reduceUnrolled256 65536 thrpt 5 78.317 ± 0.749 ops/ms
SIMDBenchmarkFloat.reduceUnrolled256 131072 thrpt 5 39.264 ± 0.432 ops/ms
SIMDBenchmarkFloat.reduceVectorised 1024 thrpt 5 572.878 ± 32.272 ops/ms
SIMDBenchmarkFloat.reduceVectorised 65536 thrpt 5 70.393 ± 1.993 ops/ms
SIMDBenchmarkFloat.reduceVectorised 131072 thrpt 5 37.340 ± 1.031 ops/ms
*/
@Benchmark
public float reduceUnrolled128() {
float a0 = 0.0f;
float a1 = 0.0f;
float a2 = 0.0f;
float a3 = 0.0f;
for (int i = 0; i < data.length >> 2; i++) {
a0 += data[i * 4 + 0];
a1 += data[i * 4 + 1];
a2 += data[i * 4 + 2];
a3 += data[i * 4 + 3];
}
return a0 + a1 + a2 + a3;
}
@Benchmark
public float reduceUnrolled256() {
float a0 = 0.0f;
float a1 = 0.0f;
float a2 = 0.0f;
float a3 = 0.0f;
float a4 = 0.0f;
float a5 = 0.0f;
float a6 = 0.0f;
float a7 = 0.0f;
for (int i = 0; i < data.length >> 3; i++) {
a0 += data[i * 8 + 0];
a1 += data[i * 8 + 1];
a2 += data[i * 8 + 2];
a3 += data[i * 8 + 3];
a4 += data[i * 8 + 4];
a5 += data[i * 8 + 5];
a6 += data[i * 8 + 6];
a7 += data[i * 8 + 7];
}
return a0 + a1 + a2 + a3 + a4 + a5 + a6 + a7;
}
@richardstartin
Copy link

Hi - thanks for posting this! I ran your code with perfasm (skylake, windows 10, jdk10-ea-38), I reproduced quite good performance for reduceUnrolled - but it looks like it the code is scalar (e.g. vmovsd qword). Still, it's faster than reduceVectorised on my machine.

Result "com.openkappa.simd.reduction.ReduceArray.reduceUnrolled":
  37.184 ±(99.9%) 1.401 ops/ms [Average]
  (min, avg, max) = (36.133, 37.184, 38.868), stdev = 0.927
  CI (99.9%): [35.783, 38.586] (assumes normal distribution)

Secondary result "com.openkappa.simd.reduction.ReduceArray.reduceUnrolled:·asm":
PrintAssembly processed: 49635 total address lines.
Perf output processed (skipped 16.532 seconds):
 Column 1: SampledProfile (99910 events)

Hottest code regions (>10.00% "SampledProfile" events):

....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub, version 162 (956 bytes) 

                                                         ;   {optimized virtual_call}
           0x000001e6d2d36728: mov     rdi,qword ptr [rsp+30h]
           0x000001e6d2d3672d: movzx   r10d,byte ptr [rdi+94h]  ;*getfield isDone {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@30 (line 121)
                                                         ; implicit exception: dispatches to 0x000001e6d2d36da6
           0x000001e6d2d36735: test    r10d,r10d
           0x000001e6d2d36738: jne     1e6d2d36c82h      ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@33 (line 121)
           0x000001e6d2d3673e: mov     ebx,1h            ;*aload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@13 (line 119)
  0.00%    0x000001e6d2d36743: mov     rax,rbp
           0x000001e6d2d36746: mov     ebp,dword ptr [rbp+10h]  ;*getfield data {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@16 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36749: mov     esi,dword ptr [r12+rbp*8+0ch]
                                                         ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@19 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
                                                         ; implicit exception: dispatches to 0x000001e6d2d36d9a
  0.03%    0x000001e6d2d3674e: mov     r13d,esi
           0x000001e6d2d36751: sar     r13d,2h           ;*ishr {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@21 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36755: test    r13d,r13d
           0x000001e6d2d36758: jle     1e6d2d36cb5h      ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@22 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
           0x000001e6d2d3675e: cmp     esi,3h
           0x000001e6d2d36761: jbe     1e6d2d36bb2h
           0x000001e6d2d36767: vxorpd  xmm0,xmm0,xmm0
           0x000001e6d2d3676b: vaddsd  xmm4,xmm0,mmword ptr [r12+rbp*8+10h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@37 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36772: vaddsd  xmm0,xmm0,mmword ptr [r12+rbp*8+18h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@51 (line 66)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36779: vxorpd  xmm1,xmm1,xmm1
           0x000001e6d2d3677d: vaddsd  xmm3,xmm1,mmword ptr [r12+rbp*8+28h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@82 (line 68)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36784: vaddsd  xmm1,xmm1,mmword ptr [r12+rbp*8+20h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@66 (line 67)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d3678b: lea     r8,[r12+rbp*8]
  0.00%    0x000001e6d2d3678f: mov     r10d,esi
           0x000001e6d2d36792: add     r10d,0fffffff8h
  0.00%    0x000001e6d2d36796: sar     r10d,1fh
  0.00%    0x000001e6d2d3679a: shr     r10d,1eh
           0x000001e6d2d3679e: add     r10d,esi
           0x000001e6d2d367a1: add     r10d,0fffffff8h
           0x000001e6d2d367a5: sar     r10d,2h
           0x000001e6d2d367a9: mov     r11d,esi
  0.00%    0x000001e6d2d367ac: add     r11d,0fffffff5h
           0x000001e6d2d367b0: sar     r11d,1fh
           0x000001e6d2d367b4: shr     r11d,1eh
           0x000001e6d2d367b8: add     r11d,esi
           0x000001e6d2d367bb: add     r11d,0fffffff5h
  0.00%    0x000001e6d2d367bf: sar     r11d,2h
  0.00%    0x000001e6d2d367c3: cmp     r10d,r11d
           0x000001e6d2d367c6: cmovnle r10d,r11d
  0.00%    0x000001e6d2d367ca: mov     r11d,esi
           0x000001e6d2d367cd: add     r11d,0ffffffedh
  0.00%    0x000001e6d2d367d1: sar     r11d,1fh
           0x000001e6d2d367d5: shr     r11d,1eh
  0.00%    0x000001e6d2d367d9: add     r11d,esi
           0x000001e6d2d367dc: add     r11d,0ffffffedh
           0x000001e6d2d367e0: sar     r11d,2h
           0x000001e6d2d367e4: mov     edx,esi
  0.00%    0x000001e6d2d367e6: add     edx,0fffffff0h
           0x000001e6d2d367e9: sar     edx,1fh
  0.00%    0x000001e6d2d367ec: shr     edx,1eh
           0x000001e6d2d367ef: add     edx,esi
           0x000001e6d2d367f1: add     edx,0fffffff0h
           0x000001e6d2d367f4: sar     edx,2h
  0.00%    0x000001e6d2d367f7: cmp     r11d,edx
           0x000001e6d2d367fa: cmovnle r11d,edx
  0.00%    0x000001e6d2d367fe: mov     edx,esi
  0.00%    0x000001e6d2d36800: add     edx,0fffffff9h
           0x000001e6d2d36803: sar     edx,1fh
           0x000001e6d2d36806: shr     edx,1eh
  0.00%    0x000001e6d2d36809: add     edx,esi
           0x000001e6d2d3680b: add     edx,0fffffff9h
           0x000001e6d2d3680e: sar     edx,2h
           0x000001e6d2d36811: mov     r9d,esi
  0.00%    0x000001e6d2d36814: add     r9d,0fffffffch
           0x000001e6d2d36818: sar     r9d,1fh
           0x000001e6d2d3681c: shr     r9d,1eh
           0x000001e6d2d36820: add     r9d,esi
  0.00%    0x000001e6d2d36823: add     r9d,0fffffffch
           0x000001e6d2d36827: sar     r9d,2h
           0x000001e6d2d3682b: cmp     r9d,edx
           0x000001e6d2d3682e: cmovnle r9d,edx
  0.00%    0x000001e6d2d36832: mov     edx,r13d
  0.00%    0x000001e6d2d36835: dec     edx
           0x000001e6d2d36837: cmp     edx,r9d
           0x000001e6d2d3683a: cmovnle edx,r9d
  0.00%    0x000001e6d2d3683e: mov     r9d,edx
           0x000001e6d2d36841: add     r9d,0fffffffeh
           0x000001e6d2d36845: cmp     edx,r9d
           0x000001e6d2d36848: mov     edx,80000000h
           0x000001e6d2d3684d: cmovl   r9d,edx
           0x000001e6d2d36851: cmp     r9d,r10d
           0x000001e6d2d36854: cmovnle r9d,r10d
           0x000001e6d2d36858: mov     edx,r9d
  0.00%    0x000001e6d2d3685b: add     edx,0fffffffch
           0x000001e6d2d3685e: cmp     r9d,edx
  0.00%    0x000001e6d2d36861: mov     r10d,80000000h
  0.00%    0x000001e6d2d36867: cmovl   edx,r10d
  0.00%    0x000001e6d2d3686b: cmp     edx,r11d
           0x000001e6d2d3686e: cmovnle edx,r11d
           0x000001e6d2d36872: cmp     edx,1h
           0x000001e6d2d36875: jle     1e6d2d36cbeh
  0.00%    0x000001e6d2d3687b: mov     r10d,4h
           0x000001e6d2d36881: mov     r11d,1h
           0x000001e6d2d36887: mov     r9d,edx
           0x000001e6d2d3688a: sub     r9d,r11d
  0.00%    0x000001e6d2d3688d: mov     ecx,1f40h
           0x000001e6d2d36892: cmp     r9d,ecx
           0x000001e6d2d36895: cmovnle r9d,ecx
           0x000001e6d2d36899: add     r9d,r11d
  0.00%    0x000001e6d2d3689c: nop     dword ptr [rax+0h]  ;*dload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@25 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.10%    0x000001e6d2d368a0: cmp     r10d,esi
           0x000001e6d2d368a3: jnb     1e6d2d36bc8h
  1.00%    0x000001e6d2d368a9: vaddsd  xmm2,xmm4,mmword ptr [r8+r10*8+10h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@37 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  1.86%    0x000001e6d2d368b0: mov     ecx,r10d
  0.08%    0x000001e6d2d368b3: add     ecx,3h
  0.98%    0x000001e6d2d368b6: cmp     ecx,esi
           0x000001e6d2d368b8: jnb     1e6d2d36c0eh
  0.69%    0x000001e6d2d368be: movsxd  r10,r10d
  1.24%    0x000001e6d2d368c1: vaddsd  xmm7,xmm0,mmword ptr [r8+r10*8+18h]
  2.47%    0x000001e6d2d368c8: vaddsd  xmm6,xmm3,mmword ptr [r8+r10*8+28h]
  2.45%    0x000001e6d2d368cf: vaddsd  xmm5,xmm1,mmword ptr [r8+r10*8+20h]
  1.95%    0x000001e6d2d368d6: mov     ecx,r11d
  0.08%    0x000001e6d2d368d9: shl     ecx,2h
  1.01%    0x000001e6d2d368dc: mov     r10d,ecx
  1.85%    0x000001e6d2d368df: add     r10d,20h          ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@33 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d368e3: movsxd  rcx,ecx
  0.08%    0x000001e6d2d368e6: vaddsd  xmm0,xmm2,mmword ptr [r8+rcx*8+30h]
  2.82%    0x000001e6d2d368ed: vaddsd  xmm0,xmm0,mmword ptr [r8+rcx*8+50h]
  0.19%    0x000001e6d2d368f4: vaddsd  xmm0,xmm0,mmword ptr [r8+rcx*8+70h]
  3.87%    0x000001e6d2d368fb: vaddsd  xmm0,xmm0,mmword ptr [r8+rcx*8+90h]
  4.03%    0x000001e6d2d36905: vaddsd  xmm0,xmm0,mmword ptr [r8+rcx*8+0b0h]
 10.13%    0x000001e6d2d3690f: vaddsd  xmm0,xmm0,mmword ptr [r8+rcx*8+0d0h]
 10.54%    0x000001e6d2d36919: vaddsd  xmm4,xmm0,mmword ptr [r8+rcx*8+0f0h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@37 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
 12.41%    0x000001e6d2d36923: vmovsd  xmm2,qword ptr [r8+rcx*8+108h]
  0.02%    0x000001e6d2d3692d: vmovsd  xmm3,qword ptr [r8+rcx*8+0e8h]
  0.01%    0x000001e6d2d36937: vmovsd  xmm8,qword ptr [r8+rcx*8+0c8h]
  2.94%    0x000001e6d2d36941: vmovsd  xmm9,qword ptr [r8+rcx*8+0a8h]
  0.02%    0x000001e6d2d3694b: vmovsd  xmm10,qword ptr [r8+rcx*8+88h]
  0.00%    0x000001e6d2d36955: vmovsd  xmm13,qword ptr [r8+rcx*8+68h]
  0.00%    0x000001e6d2d3695c: vmovsd  xmm0,qword ptr [r8+rcx*8+48h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@81 (line 68)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  2.92%    0x000001e6d2d36963: vmovsd  xmm11,qword ptr [r8+rcx*8+100h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@65 (line 67)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.02%    0x000001e6d2d3696d: vmovsd  xmm1,qword ptr [r8+rcx*8+38h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@50 (line 66)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
           0x000001e6d2d36974: vmovsd  xmm12,qword ptr [r8+rcx*8+0e0h]
  0.01%    0x000001e6d2d3697e: vaddsd  xmm1,xmm7,xmm1
  2.96%    0x000001e6d2d36982: vmovsd  xmm7,qword ptr [r8+rcx*8+0c0h]
  0.02%    0x000001e6d2d3698c: vaddsd  xmm0,xmm6,xmm0
           0x000001e6d2d36990: vmovsd  xmm6,qword ptr [r8+rcx*8+80h]
  0.00%    0x000001e6d2d3699a: vaddsd  xmm0,xmm0,xmm13
  3.01%    0x000001e6d2d3699f: vmovsd  xmm13,qword ptr [r8+rcx*8+60h]
  0.02%    0x000001e6d2d369a6: vaddsd  xmm10,xmm0,xmm10
  0.04%    0x000001e6d2d369ab: vmovsd  xmm0,qword ptr [r8+rcx*8+40h]
  0.00%    0x000001e6d2d369b2: vaddsd  xmm0,xmm5,xmm0
  2.88%    0x000001e6d2d369b6: vmovsd  xmm5,qword ptr [r8+rcx*8+0a0h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@65 (line 67)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.03%    0x000001e6d2d369c0: vaddsd  xmm0,xmm0,xmm13
  0.01%    0x000001e6d2d369c5: vmovsd  xmm13,qword ptr [r8+rcx*8+0f8h]
  0.01%    0x000001e6d2d369cf: vaddsd  xmm0,xmm0,xmm6
  2.95%    0x000001e6d2d369d3: vmovsd  xmm6,qword ptr [r8+rcx*8+0d8h]
  0.02%    0x000001e6d2d369dd: vaddsd  xmm0,xmm0,xmm5
  0.29%    0x000001e6d2d369e1: vmovsd  xmm5,qword ptr [r8+rcx*8+0b8h]
  0.00%    0x000001e6d2d369eb: vaddsd  xmm0,xmm0,xmm7
  3.97%    0x000001e6d2d369ef: vmovsd  xmm14,qword ptr [r8+rcx*8+98h]
  0.01%    0x000001e6d2d369f9: vaddsd  xmm12,xmm0,xmm12
  3.27%    0x000001e6d2d369fe: vmovsd  xmm0,qword ptr [r8+rcx*8+78h]
  0.01%    0x000001e6d2d36a05: vmovsd  xmm7,qword ptr [r8+rcx*8+58h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@50 (line 66)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  1.88%    0x000001e6d2d36a0c: vaddsd  xmm1,xmm1,xmm7
  0.00%    0x000001e6d2d36a10: vaddsd  xmm0,xmm1,xmm0
  1.01%    0x000001e6d2d36a14: vaddsd  xmm1,xmm10,xmm9
  0.00%    0x000001e6d2d36a19: vaddsd  xmm0,xmm0,xmm14
  1.91%    0x000001e6d2d36a1e: vaddsd  xmm1,xmm1,xmm8
  0.03%    0x000001e6d2d36a23: vaddsd  xmm0,xmm0,xmm5
  1.13%    0x000001e6d2d36a27: vaddsd  xmm1,xmm1,xmm3
  0.05%    0x000001e6d2d36a2b: vaddsd  xmm0,xmm0,xmm6
  2.29%    0x000001e6d2d36a2f: vaddsd  xmm3,xmm1,xmm2    ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@82 (line 68)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.18%    0x000001e6d2d36a33: vaddsd  xmm0,xmm0,xmm13   ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@51 (line 66)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  3.30%    0x000001e6d2d36a38: vaddsd  xmm1,xmm12,xmm11  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@66 (line 67)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.62%    0x000001e6d2d36a3d: add     r11d,8h           ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@85 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  1.26%    0x000001e6d2d36a41: cmp     r11d,r9d
           0x000001e6d2d36a44: jl      1e6d2d368a0h      ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@88 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36a4a: mov     r9,qword ptr [r15+70h]  ; ImmutableOopMap{r8=Oop rdi=Oop rax=Oop rbp=NarrowOop [56]=Oop [64]=Oop }
                                                         ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@88 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.03%    0x000001e6d2d36a4e: test    dword ptr [r9],eax  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@88 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
                                                         ;   {poll}
  0.13%    0x000001e6d2d36a51: cmp     r11d,edx
           0x000001e6d2d36a54: jl      1e6d2d36887h
           0x000001e6d2d36a5a: cmp     r11d,r13d
           0x000001e6d2d36a5d: jnl     1e6d2d36ab0h
           0x000001e6d2d36a5f: nop                       ;*dload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@25 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36a60: cmp     r10d,esi
           0x000001e6d2d36a63: jnb     1e6d2d36d31h
  0.00%    0x000001e6d2d36a69: vaddsd  xmm4,xmm4,mmword ptr [r8+r10*8+10h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@37 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36a70: mov     edx,r10d
  0.01%    0x000001e6d2d36a73: add     edx,3h
  0.00%    0x000001e6d2d36a76: cmp     edx,esi
           0x000001e6d2d36a78: jnb     1e6d2d36d57h
  0.00%    0x000001e6d2d36a7e: movsxd  r10,r10d
  0.00%    0x000001e6d2d36a81: vaddsd  xmm3,xmm3,mmword ptr [r8+r10*8+28h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@82 (line 68)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36a88: vaddsd  xmm1,xmm1,mmword ptr [r8+r10*8+20h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@66 (line 67)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36a8f: vaddsd  xmm0,xmm0,mmword ptr [r8+r10*8+18h]
                                                         ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@51 (line 66)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.01%    0x000001e6d2d36a96: mov     edx,r11d
  0.00%    0x000001e6d2d36a99: inc     edx               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@85 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36a9b: cmp     edx,r13d
           0x000001e6d2d36a9e: jnl     1e6d2d36ab0h      ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@22 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36aa0: shl     r11d,2h
           0x000001e6d2d36aa4: mov     r10d,r11d
  0.00%    0x000001e6d2d36aa7: add     r10d,4h           ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@33 (line 65)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36aab: mov     r11d,edx
  0.00%    0x000001e6d2d36aae: jmp     1e6d2d36a60h
  0.00%    0x000001e6d2d36ab0: vaddsd  xmm0,xmm0,xmm4
           0x000001e6d2d36ab4: vaddsd  xmm0,xmm0,xmm1
  0.00%    0x000001e6d2d36ab8: vaddsd  xmm0,xmm3,xmm0    ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@99 (line 70)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36abc: mov     qword ptr [rsp+98h],rbx
  0.00%    0x000001e6d2d36ac4: mov     qword ptr [rsp+90h],rdi
  0.00%    0x000001e6d2d36acc: mov     rbp,rax
           0x000001e6d2d36acf: mov     rdx,qword ptr [rsp+40h]
  0.03%    0x000001e6d2d36ad4: nop
           0x000001e6d2d36ad7: call    1e6d2ca7000h      ; ImmutableOopMap{rbp=Oop [144]=Oop [56]=Oop [64]=Oop }
                                                         ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@20 (line 119)
                                                         ;   {optimized virtual_call}
           0x000001e6d2d36adc: mov     rdi,qword ptr [rsp+90h]
  0.00%    0x000001e6d2d36ae4: movzx   r11d,byte ptr [rdi+94h]  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled@88 (line 64)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%    0x000001e6d2d36aec: mov     r8,qword ptr [r15+70h]
           0x000001e6d2d36af0: mov     rbx,qword ptr [rsp+98h]
           0x000001e6d2d36af8: add     rbx,1h            ; ImmutableOopMap{rdi=Oop rbp=Oop [56]=Oop [64]=Oop }
                                                         ;*ifeq {reexecute=1 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@33 (line 121)
           0x000001e6d2d36afc: test    dword ptr [r8],eax  ;   {poll}
  0.00%    0x000001e6d2d36aff: test    r11d,r11d
           0x000001e6d2d36b02: je      1e6d2d36743h      ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@33 (line 121)
           0x000001e6d2d36b08: jmp     1e6d2d36c87h
           0x000001e6d2d36b0d: vxorpd  xmm1,xmm1,xmm1
           0x000001e6d2d36b11: vxorpd  xmm0,xmm0,xmm0
           0x000001e6d2d36b15: vxorpd  xmm3,xmm3,xmm3
           0x000001e6d2d36b19: xor     edi,edi
           0x000001e6d2d36b1b: vxorpd  xmm4,xmm4,xmm4
           0x000001e6d2d36b1f: xor     r8d,r8d
           0x000001e6d2d36b22: vmovsd  qword ptr [rsp+20h],xmm4
....................................................................................................
 99.31%  <total for region 1>

....[Hottest Regions]...............................................................................
 99.31%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub, version 162 (956 bytes) 
  0.56%                  <unknown> (0 bytes) 
  0.08%              c2  org.openjdk.jmh.infra.Blackhole::consume, version 160 (49 bytes) 
  0.00%    runtime stub  StubRoutines::atomic_cmpxchg_long (0 bytes) 
  0.00%         jvm.dll  0x000000007503ce7f (10 bytes) 
  0.00%  KernelBase.dll  WaitForSingleObjectEx (0 bytes) 
  0.00%       ntdll.dll  RtlQueryPerformanceCounter (0 bytes) 
  0.00%         jvm.dll  0x00000000750a0674 (28 bytes) 
  0.00%         jvm.dll  0x0000000075107787 (16 bytes) 
  0.00%         jvm.dll  0x000000007512a753 (0 bytes) 
  0.00%     interpreter  method entry point (kind = zerolocals)  (4 bytes) 
  0.00%       ntdll.dll  ZwWaitForSingleObject (20 bytes) 
  0.00%         jvm.dll  0x0000000074c20a20 (0 bytes) 
  0.00%         jvm.dll  0x0000000074ce7930 (0 bytes) 
  0.00%         jvm.dll  0x0000000074d57b7c (0 bytes) 
  0.00%         jvm.dll  0x0000000074d72534 (0 bytes) 
  0.00%         jvm.dll  0x0000000074d732fa (0 bytes) 
  0.00%         jvm.dll  0x0000000074d742c0 (0 bytes) 
  0.00%         jvm.dll  0x0000000074de6032 (0 bytes) 
  0.00%         jvm.dll  0x0000000075014f90 (0 bytes) 
  0.02%  <...other 19 warm regions...>
....................................................................................................

reduceUnrolled256 does quite badly on my setup, it seems to be dominated by floating point to double conversions. Perhaps these are cheaper on Ryzen?

Result "com.openkappa.simd.reduction.ReduceArray.reduceUnrolled256":
  8.964 ±(99.9%) 0.218 ops/ms [Average]
  (min, avg, max) = (8.643, 8.964, 9.111), stdev = 0.144
  CI (99.9%): [8.746, 9.182] (assumes normal distribution)

Secondary result "com.openkappa.simd.reduction.ReduceArray.reduceUnrolled256:·asm":
PrintAssembly processed: 49765 total address lines.
Perf output processed (skipped 15.089 seconds):
 Column 1: SampledProfile (99970 events)

Hottest code regions (>10.00% "SampledProfile" events):

....[Hottest Region 1]..............................................................................
c2, com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (907 bytes) 

           0x00000260ca740712: cmp     r8d,edx
           0x00000260ca740715: mov     esi,80000000h
           0x00000260ca74071a: cmovl   edx,esi
           0x00000260ca74071d: cmp     edx,r11d
           0x00000260ca740720: cmovnle edx,r11d
           0x00000260ca740724: cmp     edx,1h
           0x00000260ca740727: jle     260ca740d55h
           0x00000260ca74072d: vcvtss2sd xmm5,xmm5,esi   ;*f2d {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@37 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca740731: mov     r11d,8h
           0x00000260ca740737: mov     r8d,1h
           0x00000260ca74073d: jmp     260ca740aa7h
           0x00000260ca740742: nop     word ptr [rax+rax+0h]
           0x00000260ca74074c: nop
  0.00%    0x00000260ca740750: mov     r11d,r9d
           0x00000260ca740753: vmovsd  xmm5,qword ptr [rsp+0c8h]
                                                         ;*fload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@36 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.69%    0x00000260ca74075c: cmp     r11d,r10d
           0x00000260ca74075f: jnb     260ca740ae8h
           0x00000260ca740765: vaddsd  xmm5,xmm5,mmword ptr [rdi+r11*8+10h]
           0x00000260ca74076c: mov     r9d,r11d
  0.69%    0x00000260ca74076f: add     r9d,7h
           0x00000260ca740773: vcvtsd2ss xmm6,xmm6,xmm5  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@51 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca740777: vcvtss2sd xmm2,xmm2,edx   ;*f2d {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@54 (line 86)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.70%    0x00000260ca74077b: cmp     r9d,r10d
           0x00000260ca74077e: jnb     260ca740b5ah
  0.00%    0x00000260ca740784: vcvtss2sd xmm6,xmm6,esi
  0.68%    0x00000260ca740788: vcvtss2sd xmm4,xmm4,esp
  0.00%    0x00000260ca74078c: vmovsd  qword ptr [rsp+90h],xmm4
           0x00000260ca740795: vcvtss2sd xmm5,xmm5,ecx
  0.75%    0x00000260ca740799: vcvtss2sd xmm10,xmm10,r8d
  0.00%    0x00000260ca74079e: vcvtss2sd xmm1,xmm1,ebx
  0.73%    0x00000260ca7407a2: vcvtss2sd xmm3,xmm3,eax
  0.00%    0x00000260ca7407a6: mov     eax,r8d
  0.75%    0x00000260ca7407a9: shl     eax,3h
  0.00%    0x00000260ca7407ac: vaddsd  xmm0,xmm6,mmword ptr [rdi+rax*8+50h]
  0.30%    0x00000260ca7407b2: vcvtsd2ss xmm0,xmm0,xmm0
  0.68%    0x00000260ca7407b6: vcvtss2sd xmm0,xmm0,eax
  0.01%    0x00000260ca7407ba: vaddsd  xmm0,xmm0,mmword ptr [rdi+rax*8+90h]
  0.92%    0x00000260ca7407c3: vcvtsd2ss xmm4,xmm4,xmm0
  0.02%    0x00000260ca7407c7: vcvtss2sd xmm0,xmm0,esp
  0.66%    0x00000260ca7407cb: vaddsd  xmm0,xmm0,mmword ptr [rdi+rax*8+0d0h]
  0.20%    0x00000260ca7407d4: vcvtsd2ss xmm4,xmm4,xmm0  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@51 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.69%    0x00000260ca7407d8: vcvtss2sd xmm6,xmm6,edi
  0.00%    0x00000260ca7407dc: vmovss  dword ptr [rsp+0c0h],xmm4
  0.71%    0x00000260ca7407e5: vcvtss2sd xmm0,xmm0,esp   ;*f2d {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@37 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.01%    0x00000260ca7407e9: vmovsd  qword ptr [rsp+0c8h],xmm0
  0.01%    0x00000260ca7407f2: vmovsd  xmm0,qword ptr [rdi+rax*8+88h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@179 (line 92)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.70%    0x00000260ca7407fb: vmovsd  qword ptr [rsp+0d0h],xmm0
           0x00000260ca740804: vmovsd  xmm0,qword ptr [rdi+rax*8+100h]
  0.30%    0x00000260ca74080d: vmovsd  xmm11,qword ptr [rdi+rax*8+0c0h]
  0.01%    0x00000260ca740816: vmovsd  xmm13,qword ptr [rdi+rax*8+80h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@159 (line 91)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.70%    0x00000260ca74081f: vmovsd  xmm4,qword ptr [rdi+rax*8+0f8h]
           0x00000260ca740828: vmovsd  qword ptr [rsp+20h],xmm4
  0.00%    0x00000260ca74082e: vmovsd  xmm4,qword ptr [rdi+rax*8+0b8h]
  0.01%    0x00000260ca740837: vmovsd  qword ptr [rsp+28h],xmm4
  0.67%    0x00000260ca74083d: vmovsd  xmm8,qword ptr [rdi+rax*8+78h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@139 (line 90)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca740843: vmovsd  xmm4,qword ptr [rdi+rax*8+0f0h]
  0.00%    0x00000260ca74084c: vmovsd  qword ptr [rsp+30h],xmm4
  0.01%    0x00000260ca740852: vmovsd  xmm7,qword ptr [rdi+rax*8+0b0h]
  0.71%    0x00000260ca74085b: vmovsd  xmm9,qword ptr [rdi+rax*8+70h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@120 (line 89)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
           0x00000260ca740861: vmovsd  xmm4,qword ptr [rdi+rax*8+0e8h]
  0.00%    0x00000260ca74086a: vmovsd  qword ptr [rsp+38h],xmm4
  0.00%    0x00000260ca740870: vmovsd  xmm4,qword ptr [rdi+rax*8+0a8h]
  0.73%    0x00000260ca740879: vmovsd  qword ptr [rsp+50h],xmm4
  0.00%    0x00000260ca74087f: vmovsd  xmm4,qword ptr [rdi+rax*8+68h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@101 (line 88)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca740885: vmovsd  qword ptr [rsp+58h],xmm4
  0.00%    0x00000260ca74088b: vmovsd  xmm4,qword ptr [rdi+rax*8+0e0h]
  0.73%    0x00000260ca740894: vmovsd  qword ptr [rsp+60h],xmm4
  0.00%    0x00000260ca74089a: vmovsd  xmm4,qword ptr [rdi+rax*8+0a0h]
  0.00%    0x00000260ca7408a3: vmovsd  qword ptr [rsp+68h],xmm4
  0.00%    0x00000260ca7408a9: vmovsd  xmm4,qword ptr [rdi+rax*8+60h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@83 (line 87)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.71%    0x00000260ca7408af: vmovsd  qword ptr [rsp+70h],xmm4
           0x00000260ca7408b5: vmovsd  xmm4,qword ptr [rdi+rax*8+0d8h]
  0.00%    0x00000260ca7408be: vmovsd  xmm12,qword ptr [rdi+rax*8+98h]
  0.00%    0x00000260ca7408c7: vmovsd  xmm14,qword ptr [rdi+rax*8+58h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@66 (line 86)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.72%    0x00000260ca7408cd: vmovsd  xmm15,qword ptr [rdi+rax*8+0c8h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@179 (line 92)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
           0x00000260ca7408d6: vmovsd  qword ptr [rsp+78h],xmm15
  0.00%    0x00000260ca7408dc: mov     r9d,eax
  0.00%    0x00000260ca7408df: add     r9d,20h           ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@46 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.71%    0x00000260ca7408e3: vmovsd  xmm15,qword ptr [rdi+rax*8+108h]
                                                         ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@179 (line 92)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca7408ec: vmovsd  qword ptr [rsp+80h],xmm15
  0.00%    0x00000260ca7408f5: movsxd  r11,r11d
  0.00%    0x00000260ca7408f8: vaddsd  xmm2,xmm2,mmword ptr [rdi+r11*8+18h]
  0.71%    0x00000260ca7408ff: vaddsd  xmm5,xmm5,mmword ptr [rdi+r11*8+48h]
  0.01%    0x00000260ca740906: vmovsd  qword ptr [rsp+88h],xmm5
  0.71%    0x00000260ca74090f: vaddsd  xmm5,xmm6,mmword ptr [rdi+r11*8+20h]
  0.00%    0x00000260ca740916: vaddsd  xmm6,xmm3,mmword ptr [rdi+r11*8+28h]
  0.69%    0x00000260ca74091d: vaddsd  xmm1,xmm1,mmword ptr [rdi+r11*8+30h]
  0.01%    0x00000260ca740924: vaddsd  xmm10,xmm10,mmword ptr [rdi+r11*8+38h]
  0.73%    0x00000260ca74092b: vmovsd  xmm3,qword ptr [rsp+90h]
  0.00%    0x00000260ca740934: vaddsd  xmm15,xmm3,mmword ptr [rdi+r11*8+40h]
  0.01%    0x00000260ca74093b: vcvtsd2ss xmm3,xmm3,xmm2
  0.64%    0x00000260ca74093f: vcvtsd2ss xmm2,xmm2,xmm15
  0.04%    0x00000260ca740944: vcvtss2sd xmm3,xmm3,ebx
  0.74%    0x00000260ca740948: vaddsd  xmm3,xmm3,xmm14
  0.02%    0x00000260ca74094d: vcvtss2sd xmm2,xmm2,edx
  0.75%    0x00000260ca740951: vaddsd  xmm2,xmm2,xmm13
  0.06%    0x00000260ca740956: vcvtsd2ss xmm3,xmm3,xmm3
  0.01%    0x00000260ca74095a: vcvtsd2ss xmm2,xmm2,xmm2
  0.74%    0x00000260ca74095e: vcvtss2sd xmm3,xmm3,ebx
  0.01%    0x00000260ca740962: vaddsd  xmm3,xmm3,xmm12
  0.74%    0x00000260ca740967: vcvtss2sd xmm2,xmm2,edx
  0.03%    0x00000260ca74096b: vaddsd  xmm2,xmm2,xmm11
  0.07%    0x00000260ca740970: vcvtsd2ss xmm3,xmm3,xmm3
  0.70%    0x00000260ca740974: vcvtsd2ss xmm2,xmm2,xmm2
  0.06%    0x00000260ca740978: vcvtss2sd xmm3,xmm3,ebx
  0.70%    0x00000260ca74097c: vaddsd  xmm3,xmm3,xmm4
  0.03%    0x00000260ca740980: vcvtss2sd xmm2,xmm2,edx
  0.72%    0x00000260ca740984: vaddsd  xmm0,xmm2,xmm0
  0.05%    0x00000260ca740988: vcvtsd2ss xmm2,xmm2,xmm3  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@68 (line 86)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.01%    0x00000260ca74098c: vcvtsd2ss xmm4,xmm4,xmm0  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@161 (line 91)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.73%    0x00000260ca740990: vcvtsd2ss xmm3,xmm3,xmm10
  0.00%    0x00000260ca740995: vcvtsd2ss xmm0,xmm0,xmm1
  0.67%    0x00000260ca740999: vcvtss2sd xmm1,xmm1,ebx
  0.00%    0x00000260ca74099d: vaddsd  xmm1,xmm1,xmm8
  0.74%    0x00000260ca7409a2: vcvtss2sd xmm0,xmm0,eax
  0.02%    0x00000260ca7409a6: vaddsd  xmm0,xmm0,xmm9
  0.05%    0x00000260ca7409ab: vcvtsd2ss xmm1,xmm1,xmm1
  0.69%    0x00000260ca7409af: vcvtsd2ss xmm0,xmm0,xmm0
  0.08%    0x00000260ca7409b3: vcvtss2sd xmm1,xmm1,ecx
  0.78%    0x00000260ca7409b7: vaddsd  xmm1,xmm1,mmword ptr [rsp+28h]
  0.80%    0x00000260ca7409bd: vcvtss2sd xmm0,xmm0,eax
  0.43%    0x00000260ca7409c1: vaddsd  xmm0,xmm0,xmm7
  0.90%    0x00000260ca7409c5: vcvtsd2ss xmm1,xmm1,xmm1
  1.73%    0x00000260ca7409c9: vcvtsd2ss xmm0,xmm0,xmm0
  1.80%    0x00000260ca7409cd: vcvtss2sd xmm1,xmm1,ecx
  1.38%    0x00000260ca7409d1: vaddsd  xmm1,xmm1,mmword ptr [rsp+20h]
  2.59%    0x00000260ca7409d7: vcvtss2sd xmm0,xmm0,eax
  0.43%    0x00000260ca7409db: vaddsd  xmm0,xmm0,mmword ptr [rsp+30h]
  1.86%    0x00000260ca7409e1: vcvtsd2ss xmm8,xmm8,xmm1  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@141 (line 90)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
                                                         ;   {no_reloc}
  1.60%    0x00000260ca7409e5: vcvtsd2ss xmm3,xmm3,xmm0  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@122 (line 89)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  1.83%    0x00000260ca7409e9: vcvtsd2ss xmm1,xmm1,xmm6
  0.14%    0x00000260ca7409ed: vcvtsd2ss xmm5,xmm5,xmm5
  0.92%    0x00000260ca7409f1: vcvtss2sd xmm0,xmm0,ecx
  0.14%    0x00000260ca7409f5: vaddsd  xmm0,xmm0,mmword ptr [rsp+58h]
  1.30%    0x00000260ca7409fb: vcvtss2sd xmm1,xmm1,ebp
  0.00%    0x00000260ca7409ff: vaddsd  xmm1,xmm1,mmword ptr [rsp+70h]
  0.01%    0x00000260ca740a05: vcvtsd2ss xmm0,xmm0,xmm0
  3.60%    0x00000260ca740a09: vcvtsd2ss xmm5,xmm5,xmm1
  0.03%    0x00000260ca740a0d: vcvtss2sd xmm0,xmm0,eax
  3.56%    0x00000260ca740a11: vaddsd  xmm1,xmm0,mmword ptr [rsp+50h]
  2.89%    0x00000260ca740a17: vcvtss2sd xmm0,xmm0,ebp
  0.45%    0x00000260ca740a1b: vaddsd  xmm5,xmm0,mmword ptr [rsp+68h]
  2.83%    0x00000260ca740a21: vcvtsd2ss xmm0,xmm0,xmm1
  0.28%    0x00000260ca740a25: vcvtsd2ss xmm5,xmm5,xmm5
  3.49%    0x00000260ca740a29: vcvtss2sd xmm0,xmm0,eax
  0.03%    0x00000260ca740a2d: vaddsd  xmm0,xmm0,mmword ptr [rsp+38h]
  2.89%    0x00000260ca740a33: vcvtss2sd xmm1,xmm1,ebp
  0.57%    0x00000260ca740a37: vaddsd  xmm1,xmm1,mmword ptr [rsp+60h]
  2.74%    0x00000260ca740a3d: vcvtsd2ss xmm0,xmm0,xmm0  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@103 (line 88)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.31%    0x00000260ca740a41: vcvtsd2ss xmm7,xmm7,xmm1  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@85 (line 87)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  3.28%    0x00000260ca740a45: vcvtsd2ss xmm5,xmm5,mmword ptr [rsp+88h]
  0.00%    0x00000260ca740a4e: vcvtss2sd xmm1,xmm1,ebp
  0.70%    0x00000260ca740a52: vaddsd  xmm1,xmm1,mmword ptr [rsp+0d0h]
  1.65%    0x00000260ca740a5b: vcvtsd2ss xmm1,xmm1,xmm1
  3.85%    0x00000260ca740a5f: vcvtss2sd xmm1,xmm1,ecx
  3.22%    0x00000260ca740a63: vaddsd  xmm1,xmm1,mmword ptr [rsp+78h]
  2.86%    0x00000260ca740a69: vcvtsd2ss xmm5,xmm5,xmm1
  3.79%    0x00000260ca740a6d: vcvtss2sd xmm1,xmm1,ebp
  3.40%    0x00000260ca740a71: vaddsd  xmm1,xmm1,mmword ptr [rsp+80h]
  2.89%    0x00000260ca740a7a: vcvtsd2ss xmm1,xmm1,xmm1  ;*d2f {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@181 (line 92)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  3.55%    0x00000260ca740a7e: add     r8d,4h            ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@184 (line 84)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
  0.00%    0x00000260ca740a82: cmp     r8d,esi
           0x00000260ca740a85: jl      260ca740750h      ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@187 (line 84)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
           0x00000260ca740a8b: mov     r11,qword ptr [r15+70h]  ; ImmutableOopMap{rcx=Oop rdi=Oop rbp=Oop r14=NarrowOop [64]=Oop [72]=Oop }
                                                         ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@187 (line 84)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
           0x00000260ca740a8f: test    dword ptr [r11],eax  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@187 (line 84)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
                                                         ;   {poll}
  0.01%    0x00000260ca740a92: cmp     r8d,edx
           0x00000260ca740a95: jnl     260ca740bd7h
           0x00000260ca740a9b: mov     r11d,r9d
           0x00000260ca740a9e: vmovsd  xmm5,qword ptr [rsp+0c8h]
                                                         ;*fload_1 {reexecute=0 rethrow=0 return_oop=0}
                                                         ; - com.openkappa.simd.reduction.ReduceArray::reduceUnrolled256@36 (line 85)
                                                         ; - com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub@17 (line 119)
           0x00000260ca740aa7: mov     esi,edx
  0.00%    0x00000260ca740aa9: sub     esi,r8d
  0.00%    0x00000260ca740aac: mov     r9d,0fa0h
           0x00000260ca740ab2: cmp     esi,r9d
           0x00000260ca740ab5: cmovnle esi,r9d
  0.00%    0x00000260ca740ab9: add     esi,r8d
  0.00%    0x00000260ca740abc: jmp     260ca74075ch
           0x00000260ca740ac1: vxorps  xmm2,xmm2,xmm2
           0x00000260ca740ac5: vxorps  xmm7,xmm7,xmm7
           0x00000260ca740ac9: vxorps  xmm0,xmm0,xmm0
           0x00000260ca740acd: vxorps  xmm3,xmm3,xmm3
           0x00000260ca740ad1: vxorps  xmm8,xmm8,xmm8
           0x00000260ca740ad6: vxorps  xmm4,xmm4,xmm4
           0x00000260ca740ada: vxorps  xmm1,xmm1,xmm1
           0x00000260ca740ade: xor     r8d,r8d
           0x00000260ca740ae1: xor     r11d,r11d
           0x00000260ca740ae4: vxorpd  xmm5,xmm5,xmm5
....................................................................................................
 99.46%  <total for region 1>

....[Hottest Regions]...............................................................................
 99.46%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (907 bytes) 
  0.43%                  <unknown> (0 bytes) 
  0.03%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (149 bytes) 
  0.01%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (17 bytes) 
  0.01%         jvm.dll  0x000000007503ce81 (17 bytes) 
  0.01%              c2  org.openjdk.jmh.infra.Blackhole::consume, version 160 (24 bytes) 
  0.01%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (57 bytes) 
  0.01%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (51 bytes) 
  0.01%         jvm.dll  0x00000000750a0640 (80 bytes) 
  0.00%         jvm.dll  0x0000000074c9cb30 (13 bytes) 
  0.00%         jvm.dll  0x0000000075127a80 (17 bytes) 
  0.00%    runtime stub  StubRoutines::orderaccess_fence (6 bytes) 
  0.00%     interpreter  method entry point (kind = zerolocals)  (8 bytes) 
  0.00%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 (31 bytes) 
  0.00%         jvm.dll  0x0000000074b5aab5 (0 bytes) 
  0.00%         jvm.dll  0x0000000074b9bbb0 (0 bytes) 
  0.00%         jvm.dll  0x0000000074ce1f15 (0 bytes) 
  0.00%         jvm.dll  0x0000000075015277 (0 bytes) 
  0.00%         jvm.dll  0x000000007503bdb7 (0 bytes) 
  0.00%         jvm.dll  0x00000000750c840f (0 bytes) 
  0.01%  <...other 11 warm regions...>
....................................................................................................
100.00%  <totals>

....[Hottest Methods (after inlining)]..............................................................
 99.52%              c2  com.openkappa.simd.reduction.generated.ReduceArray_reduceUnrolled256_jmhTest::reduceUnrolled256_thrpt_jmhStub, version 161 
  0.43%                  <unknown> 
  0.01%         jvm.dll  0x000000007503ce81 
  0.01%              c2  org.openjdk.jmh.infra.Blackhole::consume, version 160 
  0.01%         jvm.dll  0x00000000750a0640 
  0.00%         jvm.dll  0x0000000074c9cb30 
  0.00%     interpreter  method entry point (kind = zerolocals)  
  0.00%    runtime stub  StubRoutines::orderaccess_fence 
  0.00%         jvm.dll  0x0000000075127a80 
  0.00%    kernel32.dll  GetCurrentThread 
  0.00%    kernel32.dll  TlsGetValue 
  0.00%         jvm.dll  0x00000000750c840f 
  0.00%         jvm.dll  0x000000007515a720 
  0.00%         jvm.dll  0x000000007503bdb7 
  0.00%     interpreter  ireturn  172 ireturn  
  0.00%         jvm.dll  0x0000000074ce1f15 
  0.00%         jvm.dll  0x0000000074b5aab5 
  0.00%         jvm.dll  0x0000000074b9bbb0 
  0.00%         jvm.dll  0x0000000075129f47 
  0.00%  KernelBase.dll  WaitForSingleObjectEx 
  0.00%  <...other 3 warm methods...>
....................................................................................................
100.00%  <totals>

....[Distribution by Source]........................................................................
 99.53%              c2
  0.43%                
  0.03%         jvm.dll
  0.00%     interpreter
  0.00%    runtime stub
  0.00%    kernel32.dll
  0.00%  KernelBase.dll
  0.00%       ntdll.dll
....................................................................................................
100.00%  <totals>

@Spasi
Copy link
Author

Spasi commented Feb 20, 2018

Forgot to mention that for the float versions I have written a different benchmark that uses a float[] data array. So there shouldn't be any float<->double conversions.

@Spasi
Copy link
Author

Spasi commented Feb 20, 2018

reduceVectorized on JDK 10:

51.62%         c2, level 4  bench.generated.SIMDBenchmark_reduceVectorised_jmhTest::reduceVectorised_thrpt_jmhStub, version 921 (626 bytes) 
37.43%        runtime stub  StubRoutines::jlong_disjoint_arraycopy (28 bytes)
 51.62%         c2, level 4  bench.generated.SIMDBenchmark_reduceVectorised_jmhTest::reduceVectorised_thrpt_jmhStub, version 921 (626 bytes) 
 37.43%        runtime stub  StubRoutines::jlong_disjoint_arraycopy (28 bytes)
0x00007f996566c4c0: vmovdqu 0x10(%r8,%r11,8),%ymm0
  0.13%   │    0x00007f996566c4c7: vaddpd 0x10(%r13,%r11,8),%ymm0,%ymm0
  2.32%   │    0x00007f996566c4ce: vmovdqu %ymm0,0x10(%r13,%r11,8)
  0.35%   │    0x00007f996566c4d5: vmovdqu 0x30(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c4dc: vaddpd 0x30(%r13,%r11,8),%ymm0,%ymm0
  2.18%   │    0x00007f996566c4e3: vmovdqu %ymm0,0x30(%r13,%r11,8)
  0.48%   │    0x00007f996566c4ea: vmovdqu 0x50(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c4f1: vaddpd 0x50(%r13,%r11,8),%ymm0,%ymm0
  2.18%   │    0x00007f996566c4f8: vmovdqu %ymm0,0x50(%r13,%r11,8)
  0.49%   │    0x00007f996566c4ff: vmovdqu 0x70(%r8,%r11,8),%ymm0
  0.16%   │    0x00007f996566c506: vaddpd 0x70(%r13,%r11,8),%ymm0,%ymm0
  2.41%   │    0x00007f996566c50d: vmovdqu %ymm0,0x70(%r13,%r11,8)
  0.37%   │    0x00007f996566c514: vmovdqu 0x90(%r8,%r11,8),%ymm0
  0.09%   │    0x00007f996566c51e: vaddpd 0x90(%r13,%r11,8),%ymm0,%ymm0
  2.29%   │    0x00007f996566c528: vmovdqu %ymm0,0x90(%r13,%r11,8)
  0.40%   │    0x00007f996566c532: vmovdqu 0xb0(%r8,%r11,8),%ymm0
  0.13%   │    0x00007f996566c53c: vaddpd 0xb0(%r13,%r11,8),%ymm0,%ymm0
  2.69%   │    0x00007f996566c546: vmovdqu %ymm0,0xb0(%r13,%r11,8)
  0.31%   │    0x00007f996566c550: vmovdqu 0xd0(%r8,%r11,8),%ymm0
  0.19%   │    0x00007f996566c55a: vaddpd 0xd0(%r13,%r11,8),%ymm0,%ymm0
  2.32%   │    0x00007f996566c564: vmovdqu %ymm0,0xd0(%r13,%r11,8)
  0.55%   │    0x00007f996566c56e: vmovdqu 0xf0(%r8,%r11,8),%ymm0
  0.14%   │    0x00007f996566c578: vaddpd 0xf0(%r13,%r11,8),%ymm0,%ymm0
  2.30%   │    0x00007f996566c582: vmovdqu %ymm0,0xf0(%r13,%r11,8)
  0.54%   │    0x00007f996566c58c: vmovdqu 0x110(%r8,%r11,8),%ymm0
  0.10%   │    0x00007f996566c596: vaddpd 0x110(%r13,%r11,8),%ymm0,%ymm0
  2.13%   │    0x00007f996566c5a0: vmovdqu %ymm0,0x110(%r13,%r11,8)
  0.45%   │    0x00007f996566c5aa: vmovdqu 0x130(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c5b4: vaddpd 0x130(%r13,%r11,8),%ymm0,%ymm0
  2.37%   │    0x00007f996566c5be: vmovdqu %ymm0,0x130(%r13,%r11,8)
  0.29%   │    0x00007f996566c5c8: vmovdqu 0x150(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c5d2: vaddpd 0x150(%r13,%r11,8),%ymm0,%ymm0
  2.26%   │    0x00007f996566c5dc: vmovdqu %ymm0,0x150(%r13,%r11,8)
  0.31%   │    0x00007f996566c5e6: vmovdqu 0x170(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c5f0: vaddpd 0x170(%r13,%r11,8),%ymm0,%ymm0
  2.69%   │    0x00007f996566c5fa: vmovdqu %ymm0,0x170(%r13,%r11,8)
  0.35%   │    0x00007f996566c604: vmovdqu 0x190(%r8,%r11,8),%ymm0
  0.14%   │    0x00007f996566c60e: vaddpd 0x190(%r13,%r11,8),%ymm0,%ymm0
  2.18%   │    0x00007f996566c618: vmovdqu %ymm0,0x190(%r13,%r11,8)
  0.39%   │    0x00007f996566c622: vmovdqu 0x1b0(%r8,%r11,8),%ymm0
  0.10%   │    0x00007f996566c62c: vaddpd 0x1b0(%r13,%r11,8),%ymm0,%ymm0
  3.83%   │    0x00007f996566c636: vmovdqu %ymm0,0x1b0(%r13,%r11,8)
  0.51%   │    0x00007f996566c640: vmovdqu 0x1d0(%r8,%r11,8),%ymm0
  0.12%   │    0x00007f996566c64a: vaddpd 0x1d0(%r13,%r11,8),%ymm0,%ymm0
  2.47%   │    0x00007f996566c654: vmovdqu %ymm0,0x1d0(%r13,%r11,8)
  0.33%   │    0x00007f996566c65e: vmovdqu 0x1f0(%r8,%r11,8),%ymm0
  0.10%   │    0x00007f996566c668: vaddpd 0x1f0(%r13,%r11,8),%ymm0,%ymm0
  2.35%   │    0x00007f996566c672: vmovdqu %ymm0,0x1f0(%r13,%r11,8)
                                                  ;*dastore {reexecute=0 rethrow=0 return_oop=0}
                                                  ; - bench.SIMDBenchmark::reduceVectorised@63 (line 137)
                                                  ; - bench.generated.SIMDBenchmark_reduceVectorised_jmhTest::reduceVectorised_thrpt_jmhStub@17 (line 119)
  0.48%   │    0x00007f996566c67c: add    $0x40,%r11d        ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                  ; - bench.SIMDBenchmark::reduceVectorised@64 (line 136)
                                                  ; - bench.generated.SIMDBenchmark_reduceVectorised_jmhTest::reduceVectorised_thrpt_jmhStub@17 (line 119)
  0.01%   │    0x00007f996566c680: cmp    $0x3c1,%r11d
0x00007f996566c687: jl     0x00007f996566c4c0  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                             ; - bench.SIMDBenchmark::reduceVectorised@67 (line 136)
                                                             ; - bench.generated.SIMDBenchmark_reduceVectorised_jmhTest::reduceVectorised_thrpt_jmhStub@17 (line 119)

reducedUnrolled on JDK 10:

         ↗↗  0x00007f0801895a90: cmp    %ecx,%r10d
  0.00%  ││  0x00007f0801895a93: jae    0x00007f0801895d9c
         ││  0x00007f0801895a99: vaddsd 0x10(%rsi,%r10,8),%xmm5,%xmm0
         ││                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@37 (line 215)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.20%  ││  0x00007f0801895aa0: mov    %r10d,%r11d
         ││  0x00007f0801895aa3: add    $0x3,%r11d
  0.82%  ││  0x00007f0801895aa7: cmp    %ecx,%r11d
  2.57%  ││  0x00007f0801895aaa: jae    0x00007f0801895dd6
         ││  0x00007f0801895ab0: movslq %r10d,%r10
  0.51%  ││  0x00007f0801895ab3: vaddsd 0x18(%rsi,%r10,8),%xmm3,%xmm7
  5.86%  ││  0x00007f0801895aba: vaddsd 0x28(%rsi,%r10,8),%xmm6,%xmm6
  0.04%  ││  0x00007f0801895ac1: vaddsd 0x20(%rsi,%r10,8),%xmm4,%xmm4
  0.80%  ││  0x00007f0801895ac8: mov    %r8d,%r9d
  0.00%  ││  0x00007f0801895acb: shl    $0x2,%r9d
         ││  0x00007f0801895acf: mov    %r9d,%r10d
  0.50%  ││  0x00007f0801895ad2: add    $0x20,%r10d        ;*imul {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@33 (line 215)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.00%  ││  0x00007f0801895ad6: movslq %r9d,%r11
         ││  0x00007f0801895ad9: vaddsd 0x30(%rsi,%r11,8),%xmm0,%xmm0
  2.86%  ││  0x00007f0801895ae0: vaddsd 0x50(%rsi,%r11,8),%xmm0,%xmm0
  1.87%  ││  0x00007f0801895ae7: vaddsd 0x70(%rsi,%r11,8),%xmm0,%xmm0
  6.65%  ││  0x00007f0801895aee: vaddsd 0x90(%rsi,%r11,8),%xmm0,%xmm0
 10.53%  ││  0x00007f0801895af8: vaddsd 0xb0(%rsi,%r11,8),%xmm0,%xmm0
 11.39%  ││  0x00007f0801895b02: vaddsd 0xd0(%rsi,%r11,8),%xmm0,%xmm0
 11.85%  ││  0x00007f0801895b0c: vaddsd 0xf0(%rsi,%r11,8),%xmm0,%xmm5
         ││                                                ;*dadd {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@37 (line 215)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
 11.80%  ││  0x00007f0801895b16: vmovsd 0x108(%rsi,%r11,8),%xmm2
  0.06%  ││  0x00007f0801895b20: vmovsd 0xe8(%rsi,%r11,8),%xmm3
  0.01%  ││  0x00007f0801895b2a: vmovsd 0xc8(%rsi,%r11,8),%xmm8
         ││  0x00007f0801895b34: vmovsd 0xa8(%rsi,%r11,8),%xmm9
  0.00%  ││  0x00007f0801895b3e: vmovsd 0x88(%rsi,%r11,8),%xmm10
  0.00%  ││  0x00007f0801895b48: vmovsd 0x68(%rsi,%r11,8),%xmm13
  0.00%  ││  0x00007f0801895b4f: vmovsd 0x48(%rsi,%r11,8),%xmm0  ;*daload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@81 (line 218)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
         ││  0x00007f0801895b56: vmovsd 0x100(%rsi,%r11,8),%xmm11
         ││                                                ;*daload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@65 (line 217)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  3.94%  ││  0x00007f0801895b60: vmovsd 0x38(%rsi,%r11,8),%xmm1  ;*daload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@50 (line 216)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
         ││  0x00007f0801895b67: vmovsd 0xe0(%rsi,%r11,8),%xmm12
  0.00%  ││  0x00007f0801895b71: vaddsd %xmm1,%xmm7,%xmm1
  0.01%  ││  0x00007f0801895b75: vmovsd 0xc0(%rsi,%r11,8),%xmm7
         ││  0x00007f0801895b7f: vaddsd %xmm0,%xmm6,%xmm0
  0.03%  ││  0x00007f0801895b83: vmovsd 0x80(%rsi,%r11,8),%xmm6
  0.00%  ││  0x00007f0801895b8d: vaddsd %xmm13,%xmm0,%xmm0
  0.01%  ││  0x00007f0801895b92: vmovsd 0x60(%rsi,%r11,8),%xmm13
  3.97%  ││  0x00007f0801895b99: vaddsd %xmm10,%xmm0,%xmm10
  0.02%  ││  0x00007f0801895b9e: vmovsd 0x40(%rsi,%r11,8),%xmm0
  0.00%  ││  0x00007f0801895ba5: vaddsd %xmm0,%xmm4,%xmm0
         ││  0x00007f0801895ba9: vmovsd 0xa0(%rsi,%r11,8),%xmm4  ;*daload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@65 (line 217)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
         ││  0x00007f0801895bb3: vaddsd %xmm13,%xmm0,%xmm0
  0.01%  ││  0x00007f0801895bb8: vmovsd 0xf8(%rsi,%r11,8),%xmm13
  0.03%  ││  0x00007f0801895bc2: vaddsd %xmm6,%xmm0,%xmm0
  0.12%  ││  0x00007f0801895bc6: vmovsd 0xd8(%rsi,%r11,8),%xmm6
  3.79%  ││  0x00007f0801895bd0: vaddsd %xmm4,%xmm0,%xmm0
  0.27%  ││  0x00007f0801895bd4: vmovsd 0xb8(%rsi,%r11,8),%xmm4
         ││  0x00007f0801895bde: vaddsd %xmm7,%xmm0,%xmm0
  0.49%  ││  0x00007f0801895be2: vmovsd 0x98(%rsi,%r11,8),%xmm14
         ││  0x00007f0801895bec: vaddsd %xmm12,%xmm0,%xmm12
  2.77%  ││  0x00007f0801895bf1: vmovsd 0x78(%rsi,%r11,8),%xmm0
         ││  0x00007f0801895bf8: vmovsd 0x58(%rsi,%r11,8),%xmm7  ;*daload {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@50 (line 216)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
         ││  0x00007f0801895bff: vaddsd %xmm7,%xmm1,%xmm1
  2.09%  ││  0x00007f0801895c03: vaddsd %xmm0,%xmm1,%xmm0
  0.02%  ││  0x00007f0801895c07: vaddsd %xmm9,%xmm10,%xmm1
         ││  0x00007f0801895c0c: vaddsd %xmm14,%xmm0,%xmm0
  0.07%  ││  0x00007f0801895c11: vaddsd %xmm8,%xmm1,%xmm1
  0.10%  ││  0x00007f0801895c16: vaddsd %xmm4,%xmm0,%xmm0
  1.78%  ││  0x00007f0801895c1a: vaddsd %xmm3,%xmm1,%xmm1
  0.20%  ││  0x00007f0801895c1e: vaddsd %xmm6,%xmm0,%xmm0
  1.41%  ││  0x00007f0801895c22: vaddsd %xmm2,%xmm1,%xmm6  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@82 (line 218)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  2.84%  ││  0x00007f0801895c26: vaddsd %xmm13,%xmm0,%xmm3  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@51 (line 216)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  5.99%  ││  0x00007f0801895c2b: vaddsd %xmm11,%xmm12,%xmm4  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@66 (line 217)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
  0.82%  ││  0x00007f0801895c30: add    $0x8,%r8d          ;*iinc {reexecute=0 rethrow=0 return_oop=0}
         ││                                                ; - bench.SIMDBenchmark::reduceUnrolled@85 (line 214)
         ││                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)
         ││  0x00007f0801895c34: cmp    %edi,%r8d
         ╰│  0x00007f0801895c37: jl     0x00007f0801895a90  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                ; - bench.SIMDBenchmark::reduceUnrolled@88 (line 214)
                                                ; - bench.generated.SIMDBenchmark_reduceUnrolled_jmhTest::reduceUnrolled_thrpt_jmhStub@17 (line 119)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment