Skip to content

Instantly share code, notes, and snippets.

@kassane
Last active May 23, 2024 06:41
Show Gist options
  • Save kassane/70792fdb1aa6eb9f4ce7307e824b6575 to your computer and use it in GitHub Desktop.
Save kassane/70792fdb1aa6eb9f4ce7307e824b6575 to your computer and use it in GitHub Desktop.
sieve-cache bench: Zig vs D impl vs C++ impl (Release mode)

System Spec

  • CPU: Ryzen 7 5700G (Zen3)
  • RAM: 16 GiB
  • DISK: SSD 470GiB (format: btrfs)
  • SO: ArchLinux (last upd installed: 20-05-2024)

Programming Languages port

  • D: dub build -b release - note using latest ldc2 (2.108.1) compiler!
  • Zig: zig build -Doptimize=ReleaseFast - (zig v0.12.0)
  • C++: cmake -B build -DCMAKE_BUILD_TYPE=Release
    • libstdc++ [shared]: g++14 (archlinux)
    • libstdc++ [static]: g++14 -static-libstdc++ (archlinux) [add: -DCMAKE_EXE_LINKER_FLAGS=-static-libstdc++]
    • libc++ [shared]: clang++17 -stdlib=libc++ (archlinux)
    • libc++ [static]: zig c++ (a.k.a: clang++17 -stdlib=libc++ -static [default])

Running Benchmark tests

poop/perf bench
$ poop \
    -d 100 \
    './sieve-cache-d/benchmark/benchmark' \
    './sieve-cache-cpp/build_libstdcpp/SieveCache_bench' \
    './sieve-cache-cpp/build_libcpp/SieveCache_bench' \
    './sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench' \
    './sieve-cache-cpp/build_libcpp_static/SieveCache_bench' \
    './zig-sieve/zig-out/bin/sieve_bench'
Benchmark 1 (37 runs): ./sieve-cache-d/benchmark/benchmark
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.42ms ±  151us    1.15ms … 1.94ms          2 ( 5%)        0%
  peak_rss           3.19MB ±  481KB    2.74MB … 5.14MB          3 ( 8%)        0%
  cpu_cycles         2.02M  ±  101K     1.94M  … 2.39M           3 ( 8%)        0%
  instructions       4.50M  ± 19.0K     4.46M  … 4.53M           0 ( 0%)        0%
  cache_references   56.1K  ± 1.60K     53.9K  … 61.0K           4 (11%)        0%
  cache_misses       14.2K  ± 1.06K     13.2K  … 17.1K           4 (11%)        0%
  branch_misses      16.7K  ±  269      16.2K  … 17.7K           1 ( 3%)        0%
Benchmark 2 (64 runs): ./sieve-cache-cpp/build_libstdcpp/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.54ms ±  128us    1.29ms … 1.88ms          1 ( 2%)        💩+  8.2% ±  3.9%
  peak_rss           3.79MB ± 35.4KB    3.67MB … 3.80MB          8 (13%)        💩+ 18.8% ±  3.8%
  cpu_cycles         3.04M  ±  161K     2.92M  … 4.03M           4 ( 6%)        💩+ 50.2% ±  2.9%
  instructions       7.20M  ± 23.5K     7.15M  … 7.25M           0 ( 0%)        💩+ 60.2% ±  0.2%
  cache_references   63.5K  ± 6.77K     60.3K  …  101K           7 (11%)        💩+ 13.1% ±  4.0%
  cache_misses       17.1K  ±  622      16.1K  … 19.3K           4 ( 6%)        💩+ 20.2% ±  2.3%
  branch_misses      22.6K  ±  159      22.3K  … 23.1K           3 ( 5%)        💩+ 35.2% ±  0.5%
Benchmark 3 (15 runs): ./sieve-cache-cpp/build_libcpp/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.91ms ±  192us    6.64ms … 7.33ms          0 ( 0%)        💩+385.4% ±  7.1%
  peak_rss           3.22MB ± 82.6KB    3.15MB … 3.41MB          0 ( 0%)          +  1.1% ±  7.9%
  cpu_cycles         26.5M  ±  159K     26.3M  … 26.9M           0 ( 0%)        💩+1211.6% ±  3.6%
  instructions       37.5M  ±  151K     37.2M  … 37.7M           0 ( 0%)        💩+732.7% ±  1.1%
  cache_references    139K  ± 15.3K      121K  …  167K           0 ( 0%)        💩+147.9% ±  9.0%
  cache_misses       15.6K  ±  852      14.0K  … 17.1K           0 ( 0%)        💩+ 10.0% ±  4.3%
  branch_misses      22.9K  ±  445      22.3K  … 24.0K           1 ( 7%)        💩+ 37.2% ±  1.2%
Benchmark 4 (25 runs): ./sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.06ms ±  186us    3.74ms … 4.57ms          0 ( 0%)        💩+184.8% ±  6.0%
  peak_rss           3.08MB ±  107KB    2.88MB … 3.28MB          0 ( 0%)          -  3.4% ±  6.1%
  cpu_cycles         14.2M  ±  180K     14.0M  … 14.8M           3 (12%)        💩+601.0% ±  3.5%
  instructions       25.5M  ±  136K     25.3M  … 25.9M           0 ( 0%)        💩+467.2% ±  1.0%
  cache_references   93.9K  ± 8.80K     75.2K  …  113K           0 ( 0%)        💩+ 67.2% ±  5.3%
  cache_misses       11.4K  ± 1.01K     10.2K  … 13.1K           0 ( 0%)        ⚡- 19.7% ±  3.8%
  branch_misses      15.2K  ±  340      14.7K  … 16.0K           0 ( 0%)        ⚡-  9.1% ±  0.9%
Benchmark 5 (16 runs): ./sieve-cache-cpp/build_libcpp_static/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          6.46ms ±  182us    6.13ms … 6.66ms          0 ( 0%)        💩+353.4% ±  6.8%
  peak_rss           2.11MB ± 93.0KB    1.96MB … 2.22MB          0 ( 0%)        ⚡- 33.9% ±  7.7%
  cpu_cycles         25.3M  ±  142K     25.0M  … 25.6M           0 ( 0%)        💩+1150.1% ±  3.4%
  instructions       35.0M  ±  153K     34.7M  … 35.3M           0 ( 0%)        💩+678.7% ±  1.1%
  cache_references    161K  ± 4.37K      152K  …  167K           0 ( 0%)        💩+187.4% ±  2.9%
  cache_misses       10.5K  ± 1.21K     8.60K  … 12.3K           0 ( 0%)        ⚡- 25.8% ±  4.7%
  branch_misses      15.5K  ±  320      14.9K  … 16.1K           0 ( 0%)        ⚡-  7.3% ±  1.0%
Benchmark 6 (195 runs): ./zig-sieve/zig-out/bin/sieve_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           484us ± 75.0us     356us …  925us          1 ( 1%)        ⚡- 66.0% ±  2.2%
  peak_rss            909KB ± 1.17KB     893KB …  909KB          1 ( 1%)        ⚡- 71.5% ±  2.1%
  cpu_cycles          512K  ± 60.8K      425K  …  728K          14 ( 7%)        ⚡- 74.7% ±  1.2%
  instructions       1.39M  ± 1.36K     1.39M  … 1.39M           5 ( 3%)        ⚡- 69.2% ±  0.1%
  cache_references   5.50K  ±  600      4.97K  … 11.8K           7 ( 4%)        ⚡- 90.2% ±  0.5%
  cache_misses        663   ± 73.6       559   … 1.22K          11 ( 6%)        ⚡- 95.3% ±  1.0%
  branch_misses      5.16K  ±  954      3.37K  … 6.28K           0 ( 0%)        ⚡- 69.1% ±  1.9%
runtime test
$ time ./sieve-cache-d/benchmark/benchmark;\
  time ./sieve-cache-cpp/build_libstdcpp/SieveCache_bench;\
  time ./sieve-cache-cpp/build_libcpp/SieveCache_bench;\
  time ./sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench;\
  time ./sieve-cache-cpp/build_libcpp_static/SieveCache_bench;\
  time ./zig-sieve/zig-out/bin/sieve_bench
Sequence: 136 μs and 5 hnsecs
Composite: 123 μs and 4 hnsecs
CompositeNormal: 182 μs and 7 hnsecs

real    0m0,002s
user    0m0,002s
sys     0m0,000s
Sequence: 101us
Composite: 240us
Composite Normal: 393us

real    0m0,002s
user    0m0,000s
sys     0m0,002s
Sequence: 1.28e+03us
Composite: 3.57e+03us
Composite Normal: 5.9e+03us

real    0m0,007s
user    0m0,007s
sys     0m0,000s
Sequence: 922us
Composite: 1.82e+03us
Composite Normal: 3.06e+03us

real    0m0,004s
user    0m0,004s
sys     0m0,000s
Sequence: 1.4e+03us
Composite: 3.71e+03us
Composite Normal: 5.96e+03us

real    0m0,007s
user    0m0,007s
sys     0m0,000s
Sequence: 36.618us
Composite: 45.895us
Composite Normal: 124.349us

real    0m0,001s
user    0m0,001s
sys     0m0,000
Hyperfine bench
$ hyperfine \
    -M 100 \
    -N \
    -w 5 \
    './sieve-cache-d/benchmark/benchmark' \
    './sieve-cache-cpp/build_libstdcpp/SieveCache_bench' \
    './sieve-cache-cpp/build_libcpp/SieveCache_bench' \
    './sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench' \
    './sieve-cache-cpp/build_libcpp_static/SieveCache_bench' \
    './zig-sieve/zig-out/bin/sieve_bench'
Benchmark 1: ./sieve-cache-d/benchmark/benchmark
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 0.9 ms, System: 0.3 ms]
  Range (min … max):     1.1 ms …   1.7 ms    100 runs
 
Benchmark 2: ./sieve-cache-cpp/build_libstdcpp/SieveCache_bench
  Time (mean ± σ):       1.4 ms ±   0.1 ms    [User: 1.1 ms, System: 0.3 ms]
  Range (min … max):     1.2 ms …   1.8 ms    100 runs
 
Benchmark 3: ./sieve-cache-cpp/build_libcpp/SieveCache_bench
  Time (mean ± σ):       6.9 ms ±   0.2 ms    [User: 6.1 ms, System: 0.7 ms]
  Range (min … max):     6.6 ms …   7.4 ms    100 runs
 
Benchmark 4: ./sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench
  Time (mean ± σ):       4.0 ms ±   0.2 ms    [User: 3.5 ms, System: 0.4 ms]
  Range (min … max):     3.6 ms …   4.4 ms    100 runs
 
Benchmark 5: ./sieve-cache-cpp/build_libcpp_static/SieveCache_bench
  Time (mean ± σ):       6.4 ms ±   0.2 ms    [User: 5.7 ms, System: 0.5 ms]
  Range (min … max):     6.0 ms …   7.2 ms    100 runs
 
Benchmark 6: ./zig-sieve/zig-out/bin/sieve_bench
  Time (mean ± σ):     398.3 µs ±  52.9 µs    [User: 323.0 µs, System: 29.6 µs]
  Range (min … max):   298.9 µs … 559.2 µs    100 runs
 
Summary
  ./zig-sieve/zig-out/bin/sieve_bench ran
    3.25 ± 0.52 times faster than ./sieve-cache-d/benchmark/benchmark
    3.61 ± 0.56 times faster than ./sieve-cache-cpp/build_libstdcpp/SieveCache_bench
   10.04 ± 1.41 times faster than ./sieve-cache-cpp/build_libstdcpp_static/SieveCache_bench
   16.02 ± 2.19 times faster than ./sieve-cache-cpp/build_libcpp_static/SieveCache_bench
   17.33 ± 2.36 times faster than ./sieve-cache-cpp/build_libcpp/SieveCache_bench

References

tools:

@kassane
Copy link
Author

kassane commented May 20, 2024

Brief analysis.

And Rust?
The cargo bench command does not show me how the underground really works.

rust-sieve-cache
$ cargo bench --all
    Finished `bench` profile [optimized] target(s) in 0.06s
     Running unittests src/lib.rs (target/release/deps/sieve_cache-34107b6d331e04c7)

running 2 tests
test test ... ignored
test test_visited_flag_update ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/criterion.rs (target/release/deps/criterion-647b399aa6d2449e)
Gnuplot not found, using plotters backend
bench_sequence          time:   [10.883 µs 10.920 µs 10.958 µs]
                        change: [-1.2112% -0.8298% -0.4525%] (p = 0.00 < 0.05)
                        Change within noise threshold.

bench_composite         time:   [17.602 µs 17.731 µs 17.865 µs]
                        change: [-2.2555% -1.4544% -0.7070%] (p = 0.00 < 0.05)
                        Change within noise threshold.
kassane:~/rust-sieve-cache $ time cargo bench --all
    Finished `bench` profile [optimized] target(s) in 0.02s
     Running unittests src/lib.rs (target/release/deps/sieve_cache-34107b6d331e04c7)

running 2 tests
test test ... ignored
test test_visited_flag_update ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/criterion.rs (target/release/deps/criterion-647b399aa6d2449e)
Gnuplot not found, using plotters backend
bench_sequence          time:   [10.894 µs 10.927 µs 10.961 µs]
                        change: [-0.5472% -0.2057% +0.1287%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bench_composite         time:   [17.756 µs 17.873 µs 17.999 µs]
                        change: [+0.2530% +0.9962% +1.7687%] (p = 0.01 < 0.05)
                        Change within noise threshold.

bench_composite_normal  time:   [24.978 µs 25.088 µs 25.198 µs]
                        change: [-0.5687% -0.0011% +0.5553%] (p = 0.99 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild


real    2m7,747s
user    2m29,145s
sys     0m0,322s

$ time ./target/release/deps/criterion-647b399aa6d2449e 
Gnuplot not found, using plotters backend
Testing bench_sequence
Success

Testing bench_composite
Success

Testing bench_composite_normal
Success


real    0m0,002s
user    0m0,000s
sys     0m0,002s

cc: @1a1a11a @kubo39

@kubo39
Copy link

kubo39 commented May 20, 2024

Interesting, I guess the pre-allocated hashmap and no lock/atomic operations are the reasons why rust and zig are faster than C++.

@1a1a11a
Copy link

1a1a11a commented May 21, 2024

Wow interesting results! It would be good to include the link of the repo.

@kassane
Copy link
Author

kassane commented May 21, 2024

Interesting, I guess the pre-allocated hashmap and no lock/atomic operations are the reasons why rust and zig are faster than C++.

Yeah! However, the idea is that this initial solution does not have any external dependencies.
I want to try std::flat_map instead of std::unordered_map, but no current modern compiler has it implemented.

It would be good to include the link of the repo.

Done!

@kassane
Copy link
Author

kassane commented May 21, 2024

My new commit (kassane/sieve-cache-cpp@d8a7634) on cpp header, remove STL containers and add std::pmr::memory_resource* mem_resource = std::pmr::get_default_resource() . Allow make & use custom_allocators (C++17)

g++14 [libstdc++ shared]:

poop/perf
$ poop \
     -d 100 \
     './sieve-cache-d/benchmark/benchmark' \
     './sieve-cache-cpp/build_pmr/SieveCache_bench' \
     './sieve-cache-cpp/build_old/SieveCache_bench' \
     './zig-sieve/zig-out/bin/sieve_bench'
Benchmark 1 (59 runs): ./sieve-cache-d/benchmark/benchmark
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.38ms ±  153us    1.16ms … 2.02ms          2 ( 3%)        0%
  peak_rss           3.12MB ±  282KB    2.88MB … 5.13MB          1 ( 2%)        0%
  cpu_cycles         2.01M  ± 80.5K     1.92M  … 2.37M           3 ( 5%)        0%
  instructions       4.49M  ± 21.6K     4.44M  … 4.54M           0 ( 0%)        0%
  cache_references   57.0K  ± 2.57K     54.7K  … 73.9K           2 ( 3%)        0%
  cache_misses       14.3K  ±  916      13.4K  … 17.6K           4 ( 7%)        0%
  branch_misses      16.8K  ±  466      15.3K  … 18.7K           2 ( 3%)        0%
Benchmark 2 (80 runs): ./sieve-cache-cpp/build_pmr/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.21ms ± 99.0us     995us … 1.48ms          0 ( 0%)        ⚡- 12.1% ±  3.0%
  peak_rss           3.77MB ± 67.4KB    3.41MB … 3.80MB         17 (21%)        💩+ 20.9% ±  2.1%
  cpu_cycles         1.84M  ±  131K     1.72M  … 2.62M           5 ( 6%)        ⚡-  8.4% ±  1.9%
  instructions       4.15M  ± 3.57K     4.14M  … 4.16M           0 ( 0%)        ⚡-  7.6% ±  0.1%
  cache_references   60.1K  ± 1.37K     58.1K  … 68.9K           5 ( 6%)        💩+  5.5% ±  1.2%
  cache_misses       17.2K  ±  803      16.0K  … 19.8K           3 ( 4%)        💩+ 20.8% ±  2.0%
  branch_misses      18.1K  ±  110      17.8K  … 18.4K           2 ( 3%)        💩+  7.5% ±  0.6%
Benchmark 3 (72 runs): ./sieve-cache-cpp/build_old/SieveCache_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.36ms ±  113us    1.15ms … 1.57ms          0 ( 0%)          -  1.3% ±  3.3%
  peak_rss           3.77MB ± 80.7KB    3.52MB … 3.93MB         21 (29%)        💩+ 20.8% ±  2.2%
  cpu_cycles         2.53M  ± 67.3K     2.44M  … 2.76M           2 ( 3%)        💩+ 25.6% ±  1.3%
  instructions       5.92M  ± 15.0K     5.89M  … 5.96M           1 ( 1%)        💩+ 31.8% ±  0.1%
  cache_references   64.5K  ± 4.43K     61.2K  … 90.2K           5 ( 7%)        💩+ 13.3% ±  2.2%
  cache_misses       17.8K  ± 1.10K     16.4K  … 20.7K           0 ( 0%)        💩+ 24.9% ±  2.5%
  branch_misses      20.0K  ±  172      19.7K  … 20.5K           1 ( 1%)        💩+ 19.2% ±  0.7%
Benchmark 4 (188 runs): ./zig-sieve/zig-out/bin/sieve_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           499us ± 59.0us     378us …  899us          4 ( 2%)        ⚡- 63.9% ±  1.9%
  peak_rss            913KB ±    0       913KB …  913KB          0 ( 0%)        ⚡- 70.7% ±  1.3%
  cpu_cycles          498K  ± 39.6K      435K  …  675K          21 (11%)        ⚡- 75.3% ±  0.8%
  instructions       1.39M  ± 1.52K     1.38M  … 1.39M          10 ( 5%)        ⚡- 69.2% ±  0.1%
  cache_references   5.64K  ±  307      5.15K  … 7.60K          13 ( 7%)        ⚡- 90.1% ±  0.7%
  cache_misses        709   ±  142       579   … 1.35K          28 (15%)        ⚡- 95.0% ±  0.9%
  branch_misses      5.38K  ±  702      3.71K  … 6.34K           0 ( 0%)        ⚡- 68.0% ±  1.1%
Hyperfine
$ hyperfine \
     -N \
     -M 100 \
     -w 5 \
     './sieve-cache-d/benchmark/benchmark' \
     './sieve-cache-cpp/build_pmr/SieveCache_bench' \
     './sieve-cache-cpp/build_old/SieveCache_bench' \
     './zig-sieve/zig-out/bin/sieve_bench'
Benchmark 1: ./sieve-cache-d/benchmark/benchmark
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 1.0 ms, System: 0.3 ms]
  Range (min … max):     1.1 ms …   1.8 ms    100 runs
 
Benchmark 2: ./sieve-cache-cpp/build_pmr/SieveCache_bench
  Time (mean ± σ):       1.2 ms ±   0.1 ms    [User: 0.9 ms, System: 0.2 ms]
  Range (min … max):     0.9 ms …   1.4 ms    100 runs
 
Benchmark 3: ./sieve-cache-cpp/build_old/SieveCache_bench
  Time (mean ± σ):       1.3 ms ±   0.1 ms    [User: 1.0 ms, System: 0.3 ms]
  Range (min … max):     1.1 ms …   1.6 ms    100 runs
 
Benchmark 4: ./zig-sieve/zig-out/bin/sieve_bench
  Time (mean ± σ):     412.0 µs ±  48.6 µs    [User: 319.6 µs, System: 41.5 µs]
  Range (min … max):   327.0 µs … 547.8 µs    100 runs
 
Summary
  ./zig-sieve/zig-out/bin/sieve_bench ran
    2.80 ± 0.40 times faster than ./sieve-cache-cpp/build_pmr/SieveCache_bench
    3.22 ± 0.46 times faster than ./sieve-cache-d/benchmark/benchmark
    3.26 ± 0.44 times faster than ./sieve-cache-cpp/build_old/SieveCache_bench
runtime test
$ time ./sieve-cache-d/benchmark/benchmark; \
      time ./sieve-cache-cpp/build_pmr/SieveCache_bench; \
      time ./sieve-cache-cpp/build_old/SieveCache_bench; \
      time ./zig-sieve/zig-out/bin/sieve_bench
Sequence: 133 μs and 6 hnsecs
Composite: 118 μs and 2 hnsecs
CompositeNormal: 162 μs and 2 hnsecs

real    0m0,002s
user    0m0,002s
sys     0m0,000s
Sequence: 20.463us
Composite: 90.442us
Composite Normal: 134.372us

real    0m0,002s
user    0m0,000s
sys     0m0,002s
Sequence: 78.151us
Composite: 186.263us
Composite Normal: 290.953us

real    0m0,002s
user    0m0,002s
sys     0m0,000s
Sequence: 38.342us
Composite: 46.653us
Composite Normal: 112.792us

real    0m0,001s
user    0m0,000s
sys     0m0,001s

@kubo39
Copy link

kubo39 commented May 21, 2024

I found that Rust/Zig's hashmap impelmentation is a slightly modified version of swisstable, and that D and C++'s std::unordered_map uses a separate chain-based one. (edited: D uses open addressing since 2015~)
I added the same benchmark program to Nim, which uses a chain-based open addressing one and is about 5x slower than almost same as D.

Also, zig-sieve is a bit buggy..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment