Skip to content

Instantly share code, notes, and snippets.

@twillouer
Created November 15, 2014 20:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save twillouer/ac13eb1dadc8a270f821 to your computer and use it in GitHub Desktop.
Save twillouer/ac13eb1dadc8a270f821 to your computer and use it in GitHub Desktop.
Benchmarking of toArray
@State(Scope.Benchmark)
public class ToArrayBench {
ArrayList<Byte> list;
@Setup
public void setup() throws Throwable
{
list = new ArrayList<>();
for (int i = 0; i < 10; i++) {
list.add((byte) i);
}
}
@Benchmark
public void zero_sized_array()
{
list.toArray(new Byte[0]);
}
@Benchmark
public void simple_toArray()
{
list.toArray();
}
@Benchmark
public void sized_array_from_list()
{
list.toArray(new Byte[list.size()]);
}
@Benchmark
public void sized_array_fixed_size()
{
list.toArray(new Byte[100]);
}
@Benchmark
public void defensive_copy()
{
new ArrayList<>(list);
}
public static void main(String[] args) throws RunnerException, IOException
{
Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
.warmupIterations(20)
.warmupTime(TimeValue.seconds(1))
.measurementIterations(20)
.timeUnit(TimeUnit.MILLISECONDS)
.forks(1)
// .addProfiler(LinuxPerfProfiler.class)
.build();
new Runner(opt).run();
}
}
@twillouer
Copy link
Author

java -version
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

ubuntu 14.10, 64b

Run complete. Total time: 00:40:26

Benchmark Mode Samples Score Error Units
c.d.m.b.m.ToArrayBench.defensive_copy thrpt 200 73_431_578,798 ± 875202,665 ops/s
c.d.m.b.m.ToArrayBench.simple_toArray thrpt 200 76_845_735,280 ± 710117,405 ops/s
c.d.m.b.m.ToArrayBench.sized_array_fixed_size thrpt 200 16_077_751,085 ± 155382,904 ops/s
c.d.m.b.m.ToArrayBench.sized_array_from_list thrpt 200 24_524_222,665 ± 311064,966 ops/s
c.d.m.b.m.ToArrayBench.zero_sized_array thrpt 200 34_209_908,158 ± 447488,028 ops/s

@twillouer
Copy link
Author

I do not understand these results. Everyone (PMD, Intellij, stackoverflow) seems to agree that toArray(new String[0]) is less effective than toArray (new String [list.size()], but it suggests the opposite bench ..

@jerrinot
Copy link

that's interesting one. I've simplified your setup to include just the two cases I most interested in:

@State(Scope.Thread)
public class ToArrayBench {

    @Param("10")
    private int size;
    private ArrayList<Byte> list;

    @Setup
    public void setup() throws Throwable
    {
        list = new ArrayList<>();
        for (int i = 0; i < size; i++) {
            list.add((byte) i);
        }
    }

    @Benchmark
    public void zero_sized_array(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[0]);
        bh.consume(bytes);
    }

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[size]);
        bh.consume(bytes);
    }

    public static void main(String[] args) throws Throwable {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(10)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .forks(1)
//                        .addProfiler(LinuxPerfProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

and indeed the results are counter-intuitive:

Benchmark                                      (size)   Mode  Samples      Score  Score error   Units
o.o.j.s.ToArrayBench.sized_array_fixed_size        10  thrpt       20  29170.782     2361.987  ops/ms
o.o.j.s.ToArrayBench.zero_sized_array              10  thrpt       20  41963.943     2474.507  ops/ms

however when I change the sized_array_fixed_size() method to use constant there results are more aligned with expectation:

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        Byte[] bytes = list.toArray(new Byte[10]);
        bh.consume(bytes);
    }
Benchmark                                      (size)   Mode  Samples      Score  Score error   Units
o.o.j.s.ToArrayBench.sized_array_fixed_size        10  thrpt       20  47657.534      607.251  ops/ms
o.o.j.s.ToArrayBench.zero_sized_array              10  thrpt       20  43036.401      716.582  ops/ms

I can see the version the original version has lower instructions per cycle count:
This is the original sized_array_fixed_size():

   23289.588754 task-clock (msec)         #    0.639 CPUs utilized          
            14,374 context-switches          #    0.617 K/sec                  
             3,054 cpu-migrations            #    0.131 K/sec                  
               453 page-faults               #    0.019 K/sec                  
    79,254,722,599 cycles                    #    3.403 GHz                     [31.19%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   187,817,995,702 instructions              #    2.37  insns per cycle         [38.93%]
    36,922,468,827 branches                  # 1585.364 M/sec                   [38.93%]
        51,694,474 branch-misses             #    0.14% of all branches         [38.71%]
    46,459,233,680 L1-dcache-loads           # 1994.850 M/sec                   [38.77%]
       654,546,759 L1-dcache-load-misses     #    1.41% of all L1-dcache hits   [38.58%]
        57,970,689 LLC-loads                 #    2.489 M/sec                   [30.90%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
         8,075,209 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.19%]
    46,181,846,166 dTLB-loads:HG             # 1982.940 M/sec                   [31.09%]
           720,667 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.05%]
        32,330,750 iTLB-loads:HG             #    1.388 M/sec                   [31.14%]
           637,481 iTLB-load-misses:HG       #    1.97% of all iTLB cache hits  [31.39%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.40%]

      36.442513922 seconds time elapsed

vs. the zero_sized_array():

  23279.164686 task-clock (msec)         #    0.639 CPUs utilized          
            15,372 context-switches          #    0.660 K/sec                  
             3,401 cpu-migrations            #    0.146 K/sec                  
               532 page-faults               #    0.023 K/sec                  
    78,414,546,516 cycles                    #    3.368 GHz                     [31.04%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   260,885,594,417 instructions              #    3.33  insns per cycle         [39.01%]
    49,221,494,283 branches                  # 2114.401 M/sec                   [39.16%]
        69,222,874 branch-misses             #    0.14% of all branches         [39.19%]
    62,745,594,182 L1-dcache-loads           # 2695.354 M/sec                   [39.11%]
     1,121,079,378 L1-dcache-load-misses     #    1.79% of all L1-dcache hits   [38.64%]
        99,591,088 LLC-loads                 #    4.278 M/sec                   [30.57%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,673,042 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [30.94%]
    63,298,980,413 dTLB-loads:HG             # 2719.126 M/sec                   [31.00%]
         1,173,107 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [32.64%]
        32,181,074 iTLB-loads:HG             #    1.382 M/sec                   [32.58%]
           463,867 iTLB-load-misses:HG       #    1.44% of all iTLB cache hits  [32.50%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [32.42%]

      36.426116098 seconds time elapsed

The new version of sized_array_fixed_size() with constant has IPC similar to the the zero_sized_array():

      23346.537627 task-clock (msec)         #    0.639 CPUs utilized          
            14,690 context-switches          #    0.629 K/sec                  
             3,524 cpu-migrations            #    0.151 K/sec                  
               502 page-faults               #    0.022 K/sec                  
    79,594,753,008 cycles                    #    3.409 GHz                     [30.88%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   254,160,699,602 instructions              #    3.19  insns per cycle         [38.88%]
    50,744,484,027 branches                  # 2173.534 M/sec                   [38.94%]
        70,854,819 branch-misses             #    0.14% of all branches         [39.08%]
    62,767,371,751 L1-dcache-loads           # 2688.509 M/sec                   [39.04%]
       908,061,562 L1-dcache-load-misses     #    1.45% of all L1-dcache hits   [38.96%]
        83,239,667 LLC-loads                 #    3.565 M/sec                   [30.64%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,329,198 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [30.71%]
    63,590,675,367 dTLB-loads:HG             # 2723.773 M/sec                   [30.74%]
           845,789 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [30.69%]
        29,044,917 iTLB-loads:HG             #    1.244 M/sec                   [31.01%]
           489,904 iTLB-load-misses:HG       #    1.69% of all iTLB cache hits  [30.74%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [30.72%]

      36.531843880 seconds time elapsed

vs. the zero_sized_array():

      23167.723860 task-clock (msec)         #    0.638 CPUs utilized          
            14,513 context-switches          #    0.626 K/sec                  
             3,424 cpu-migrations            #    0.148 K/sec                  
               494 page-faults               #    0.021 K/sec                  
    78,630,215,762 cycles                    #    3.394 GHz                     [30.94%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   265,990,291,006 instructions              #    3.38  insns per cycle         [38.66%]
    50,353,702,113 branches                  # 2173.442 M/sec                   [38.65%]
        67,371,027 branch-misses             #    0.13% of all branches         [38.84%]
    63,736,003,459 L1-dcache-loads           # 2751.069 M/sec                   [38.83%]
     1,126,015,959 L1-dcache-load-misses     #    1.77% of all L1-dcache hits   [38.63%]
       181,894,495 LLC-loads                 #    7.851 M/sec                   [30.82%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,028,501 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [32.29%]
    59,875,036,318 dTLB-loads:HG             # 2584.416 M/sec                   [32.92%]
         1,291,552 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [32.84%]
        36,123,503 iTLB-loads:HG             #    1.559 M/sec                   [32.87%]
           716,049 iTLB-load-misses:HG       #    1.98% of all iTLB cache hits  [32.79%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [32.54%]

      36.333910926 seconds time elapsed

@jerrinot
Copy link

this is giving me similarly confusing results:

@State(Scope.Thread)
public class ToArrayBench {

    @Param("10")
    private int size;
    private Byte[] buffer;

    @Setup
    public void setup() throws Throwable {
        buffer = new Byte[size];
        for (byte i = 0; i < size; i++) {
            buffer[i] = i;
        }
    }

    @Benchmark
    public void fast(Blackhole bh) {
        int s = buffer.length;
        Byte[] copy = Arrays.copyOf(buffer, s, Byte[].class);
        bh.consume(copy);
    }

    @Benchmark
    public void slow(Blackhole bh) {
        int s = buffer.length;
        Byte[] copy = (Byte[]) Array.newInstance(Byte[].class.getComponentType(), s);
        System.arraycopy(buffer, 0, copy, 0, s);
        bh.consume(copy);
    }

    public static void main(String[] args) throws Throwable {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(10)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .forks(1)
                .addProfiler(LinuxPerfProfiler.class)
                .build();

        new Runner(opt).run();
    }
}
Benchmark                         (size)   Mode  Samples       Score  Score error   Units
o.o.j.s.ToArrayBench.fast             10  thrpt       20  134199.249    15573.195  ops/ms
o.o.j.s.ToArrayBench.fast:@cpi        10  thrpt        1       0.367          NaN     CPI
o.o.j.s.ToArrayBench.slow             10  thrpt       20   49499.140     2661.012  ops/ms
o.o.j.s.ToArrayBench.slow:@cpi        10  thrpt        1       0.764          NaN     CPI

perf stats for fast():

      23044.197935 task-clock (msec)         #    0.632 CPUs utilized          
            14,410 context-switches          #    0.625 K/sec                  
             3,227 cpu-migrations            #    0.140 K/sec                  
               423 page-faults               #    0.018 K/sec                  
    78,964,852,601 cycles                    #    3.427 GHz                     [30.93%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   214,945,315,883 instructions              #    2.72  insns per cycle         [38.77%]
    40,570,072,182 branches                  # 1760.533 M/sec                   [39.00%]
         8,633,173 branch-misses             #    0.02% of all branches         [38.99%]
    55,630,805,896 L1-dcache-loads           # 2414.092 M/sec                   [39.18%]
     2,727,361,923 L1-dcache-load-misses     #    4.90% of all L1-dcache hits   [38.89%]
     1,403,751,595 LLC-loads                 #   60.916 M/sec                   [30.76%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        10,486,830 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.88%]
    54,778,270,818 dTLB-loads:HG             # 2377.096 M/sec                   [31.78%]
           584,070 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.70%]
        23,538,830 iTLB-loads:HG             #    1.021 M/sec                   [31.62%]
           312,318 iTLB-load-misses:HG       #    1.33% of all iTLB cache hits  [31.64%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.57%]

      36.459311589 seconds time elapsed

perf stats for slow():

      23121.720993 task-clock (msec)         #    0.635 CPUs utilized          
            14,615 context-switches          #    0.632 K/sec                  
             3,168 cpu-migrations            #    0.137 K/sec                  
               463 page-faults               #    0.020 K/sec                  
    79,644,122,769 cycles                    #    3.445 GHz                     [30.98%]
   <not supported> stalled-cycles-frontend 
   <not supported> stalled-cycles-backend  
   104,254,073,200 instructions              #    1.31  insns per cycle         [38.76%]
    17,489,978,097 branches                  #  756.431 M/sec                   [38.87%]
        42,822,998 branch-misses             #    0.24% of all branches         [38.74%]
    21,942,957,608 L1-dcache-loads           #  949.019 M/sec                   [38.73%]
     1,022,663,441 L1-dcache-load-misses     #    4.66% of all L1-dcache hits   [38.71%]
        88,585,236 LLC-loads                 #    3.831 M/sec                   [30.86%]
   <not supported> LLC-load-misses:HG      
   <not supported> L1-icache-loads:HG      
        11,208,770 L1-icache-load-misses:HG  #    0.00% of all L1-icache hits   [31.33%]
    21,891,428,217 dTLB-loads:HG             #  946.791 M/sec                   [31.21%]
           807,245 dTLB-load-misses:HG       #    0.00% of all dTLB cache hits  [31.04%]
        24,236,835 iTLB-loads:HG             #    1.048 M/sec                   [30.99%]
           387,664 iTLB-load-misses:HG       #    1.60% of all iTLB cache hits  [31.06%]
   <not supported> L1-dcache-prefetches:HG 
                 0 L1-dcache-prefetch-misses:HG #    0.000 K/sec                   [31.13%]

      36.428755980 seconds time elapsed

@twillouer
Copy link
Author

@jerrinot thanks for your time.
Still in trouble to understand the problem :)

@klinham
Copy link

klinham commented Jan 10, 2015

No idea either, can someone please elaborate on that ?

@twillouer
Copy link
Author

Updated:

@State(Scope.Benchmark)
public class ToArrayBench {

    private static final int SIZE = 100;

    ArrayList<Byte> list;

    @Setup
    public void setup() throws Throwable
    {
        list = new ArrayList<>();
        for (int i = 0; i < SIZE; i++) {
            list.add((byte) i);
        }
    }

    @Benchmark
    public void zero_sized_array(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[0]));
    }

    @Benchmark
    public void simple_toArray(Blackhole bh)
    {
        bh.consume(list.toArray());
    }

    @Benchmark
    public void sized_array_from_list(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[list.size()]));
    }

    @Benchmark
    public void sized_array_fixed_size(Blackhole bh)
    {
        bh.consume(list.toArray(new Byte[SIZE]));
    }

    @Benchmark
    public void defensive_copy(Blackhole bh)
    {
        bh.consume(new ArrayList<>(list));
    }

    public static void main(String[] args) throws RunnerException, IOException
    {
        Options opt = new OptionsBuilder().include(".*" + ToArrayBench.class.getSimpleName() + ".*")
                .warmupIterations(20)
                .warmupTime(TimeValue.seconds(1))
                .measurementIterations(20)
                .timeUnit(TimeUnit.MILLISECONDS)
                .forks(1)
//                .addProfiler(LinuxPerfAsmProfiler.class)
                .build();

        new Runner(opt).run();
    }
}

@twillouer
Copy link
Author

Benchmark Mode Cnt Score Error Units
ToArrayBench.defensive_copy thrpt 200 16 714 192 ± 129515,217 ops/s
ToArrayBench.simple_toArray thrpt 200 17 918 950 ± 102801,298 ops/s
ToArrayBench.sized_array_fixed_size thrpt 200 5 799 136 ± 65921,564 ops/s
ToArrayBench.sized_array_from_list thrpt 200 5 643 162 ± 85215,009 ops/s
ToArrayBench.zero_sized_array thrpt 200 6 529 068 ± 78960,062 ops/s

@shipilev
Copy link

Routinely, I will chew on people who can't use perfasm profiler, but this is not your fault it wasn't helping here. ;) Only in JMH 1.5+ (released yesterday) perfasm can decode the VM stubs, and VM stubs are the crucial piece of info to untangle this. See: http://cr.openjdk.java.net/~shade/scratch/ToArrayBench.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment