monik3r/7950x3d.md

## 7950x3d.md

      
    Raw
  

              7950x3d.md
            
          
    There was a question in the Awesome Fuzzing discord today about the 7950x3d and whether the extra cache is good for fuzzing.
Here is a quick test that is far from the most optimized or scientific.
Started by cloning and compiling the latest AFL++ in the aflplusplus/aflplusplus container.
Here is the detected numa layout btw:
[AFL++ 4b1f28ef6c43] ~/AFLplusplus/benchmark # numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 95698 MB
node 0 free: 78766 MB
node distances:
node   0 
  0:  10 

It is not clear which cores will have the v-cache and which won't. Going off of this linux_gaming post to assume which cores have the extra cache.
To double check that is not totally wrong we can check core id in cpuinfo:
cat /proc/cpuinfo | grep "core id"
core id		: 0
core id		: 1
core id		: 2
core id		: 3
core id		: 4
core id		: 5
core id		: 6
core id		: 7
core id		: 8
core id		: 9
core id		: 10
core id		: 11
core id		: 12
core id		: 13
core id		: 14
core id		: 15
core id		: 0
core id		: 1
core id		: 2
core id		: 3
core id		: 4
core id		: 5
core id		: 6
core id		: 7
core id		: 8
core id		: 9
core id		: 10
core id		: 11
core id		: 12
core id		: 13
core id		: 14
core id		: 15

That post seems right at first glance.
Running:
[AFL++ 4b1f28ef6c43] ~/AFLplusplus/benchmark # python benchmark.py 
 [*] afl-persistent-config did not run; run it to improve performance (and decrease security).
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] singlecore test-instr-persist-shmem run 1 of 3, execs/s: 71611.04
 [*] singlecore test-instr-persist-shmem run 2 of 3, execs/s: 71524.59
 [*] singlecore test-instr-persist-shmem run 3 of 3, execs/s: 71561.04
 [*] Average execs/sec for this test across all runs was: 71565.56
 [*] Using 32 fuzzers for multicore fuzzing (use --fuzzers to override).
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 1578652.18
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 1562471.01
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 1557715.07
 [*] Average execs/sec for this test across all runs was: 1566279.42
 [*] Results have been written to the benchmark-results.jsonl file.
 [*] Results have been written to the COMPARISON.md file.
CPU                                                 | MHz   | threads | singlecore | multicore | afl-*-config |
====================================================|=======|=========|============|===========|==============|
Raspberry Pi 5                                      | 2400  | 4       | 25786      | 101114    | both         |
AMD EPYC 7282 16-Core Processor                     | 3194  | 32      | 87199      | 769001    | both         |
AMD Ryzen 5 PRO 4650G with Radeon Graphics          | 3700  | 12      | 95356      | 704840    | both         |
Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz            | 4995  | 16      | 120064     | 1168943   | both         |
12th Gen Intel(R) Core(TM) i7-1270P                 | 4761  | 16      | 149778     | 641219    | both         |
AMD Ryzen 9 5950X 16-Core Processor                 | 4792  | 32      | 161690     | 2339763   | both         |
Apple Mac Studio M2 Ultra 2023, Linux VM guest      | 3500  | 16      | 163570     | 1157465   | both         |
AMD Ryzen 9 7950X3D 16-Core Processor               | 5400  | 32      | 71566      | 1566279   | system       |

Note that only system was run, persistent was partially done by hand, but not installed and rebooted, so perf will likely be lower.
First stab at pinning to vcache cores did not work (verified watching htop live):
taskset -c "0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23" python benchmark.py
Running with python benchmark.py -f 8 -m multicore did keep things on the first 8 cores:
 [*] afl-persistent-config did not run; run it to improve performance (and decrease security).
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] Using 8 fuzzers for multicore fuzzing (the default is 32)
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 531118.56
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 528016.65
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 527505.37
 [*] Average execs/sec for this test across all runs was: 528880.19

As a lazy comparison we can run on the first 16 cores to compare what adding in non-vcache cores looks like:
[AFL++ 4b1f28ef6c43] ~/AFLplusplus/benchmark # python benchmark.py -f 16 -m multicore
[*] afl-persistent-config did not run; run it to improve performance (and decrease security).
[*] Ready, starting benchmark...
[*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
[*] Using 16 fuzzers for multicore fuzzing (the default is 32)
[*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 1030594.95
[*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 1024707.31
[*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 1021166.04
[*] Average execs/sec for this test across all runs was: 1025489.43

So 66,110 execs/core/second vs 64,002, or about a 3% difference. Not a massive difference, but perhaps power efficiency might be different and some targets might benefit from more L3.
Note that both of these were lower than the intial test which was interesting.
Turning off mitigations

After turning off mitigations the before and after look as follows:
AMD Ryzen 9 7950X3D 16-Core Processor (no-mit)      | 5400  | 32      | 71566      | 1566279   | system       |
AMD Ryzen 9 7950X3D 16-Core Processor               | 5478  | 32      | 161960     | 2173959   | both         |

Pretty large uplift, especially in single core performance! Let's run the same tests as before:
python benchmark.py -f 8 -m multicore
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] Using 8 fuzzers for multicore fuzzing (the default is 32)
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 1223436.20
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 1224677.47
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 1221867.71
 [*] Average execs/sec for this test across all runs was: 1223327.13
 [*] Results have been written to the benchmark-results.jsonl file.
 [*] Results have not been written to the COMPARISON.md file; this CPU is already present.

python benchmark.py -f 16 -m multicore
 [*] Ready, starting benchmark...
 [*] Compiling the test-instr-persist-shmem fuzzing harness for the benchmark to use.
 [*] Using 16 fuzzers for multicore fuzzing (the default is 32)
 [*] multicore test-instr-persist-shmem run 1 of 3, execs/s: 2008935.41
 [*] multicore test-instr-persist-shmem run 2 of 3, execs/s: 1988427.24
 [*] multicore test-instr-persist-shmem run 3 of 3, execs/s: 1993374.59
 [*] Average execs/sec for this test across all runs was: 1996912.41
 [*] Results have been written to the benchmark-results.jsonl file.
 [*] Results have not been written to the COMPARISON.md file; this CPU is already present.

We see 152,915.89 execs/sec/core on the first 8 cores (all vcache) and 124,807.02 execs/sec/core on 16 cores (half vcache, half non). That is a 22% performance increase! Though some of this may be turbo boost behavior. After re-running and checking clock speeds with cat /proc/cpuinfo | grep -i mhz we can see the first 8 cores clock in at 4950MHz. With 16 cores we can see a split between vcache and non-vcache cores:
cpu MHz		: 4883.895
cpu MHz		: 4883.865
cpu MHz		: 4883.901
cpu MHz		: 4883.890
cpu MHz		: 4883.881
cpu MHz		: 4883.905
cpu MHz		: 4883.897
cpu MHz		: 4883.894
cpu MHz		: 5100.064
cpu MHz		: 5099.998
cpu MHz		: 5100.063
cpu MHz		: 5100.077
cpu MHz		: 5099.994
cpu MHz		: 5100.069
cpu MHz		: 5100.003
cpu MHz		: 5100.016

Note the 4883 vs 5100 MHz! That's pretty neat, and lets us confirm that our clock speeds aren't significantly slower (67 MHz from 8 to 16 cores, or 1.3% slower).
LibXML2

Commands run (for repro):
git clone https://gitlab.gnome.org/GNOME/libxml2.git
cd libxml2/ && ./autogen.sh && ./configure --enable-shared=no
make CC=afl-clang-lto CXX=afl-clang-lto++ LD=afl-clang-lto
mkdir fuzz
cp xmllint fuzz/xmllint_cov
mkdir fuzz/in
cp test/*.xml fuzz/in/
cd fuzz

To run on a single core with the same env vars as in the benchmark:
AFL_DISABLE_TRIM=1 AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES=1 AFL_FAST_CAL=1 AFL_NO_UI=1 AFL_TRY_AFFINITY=1 afl-fuzz -i in/ -o out -V 10 -s 123 -D -b 1 -- ./xmllint_cov @@

To run on a non vcache core:
AFL_DISABLE_TRIM=1 AFL_I_DONT_CARE_ABOUT_MISSING_CRASHES=1 AFL_FAST_CAL=1 AFL_NO_UI=1 AFL_TRY_AFFINITY=1 afl-fuzz -i in/ -o out -V 10 -s 123 -D -b 9 -- ./xmllint_cov @@

Note these could be -b 0 and -b 8, but I originally used core #1 to verify afl-fuzz pinning would work, so I just ran with that.
To get results: cat out/default/fuzzer_stats | grep execs_per_sec
Here are the results:

Core #1: 7,036.47
Core #9: 7,652.53

Huh, it seems that the faster clock speed here actually helps quite a bit at first glance.
After re-running two more times to verify measurements seem to be correct. A non-vcache core gets ~8% more execs/sec! With a single core cpu #1 runs at 5150 MHz while core #9 runs at 5400 MHz, or 5% faster.