Skip to content

Instantly share code, notes, and snippets.

@lukego
Created August 19, 2015 17:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lukego/e16583be5739d6f9d24f to your computer and use it in GitHub Desktop.
Save lukego/e16583be5739d6f9d24f to your computer and use it in GitHub Desktop.
CPU PMU (Performance Monitoring Unit) support
Howdy!
I have geeked out on a new piece of hardware :-)
This time it is the Performance Monitoring Unit built into the CPU. This is a hardware capability to track fine-grained events inside the processor and give visibility into things like cache misses, branch mispredictions, utilization of internal CPU resources,
Turns out that you only need two special CPU instructions to drive this - WMSR to setup a counter, RDPMC to read it - and a simple but interesting benchmarking tool is only 500 lines of code. This was also a good opportunity to use our new ability to write Lua code that generates machine code at runtime.
I have merged the basic support onto the 'next' branch now. I reckon it will take some experimentation to see how to use it effectively. I hope that we will be able to build it into the engine and collect detailed metrics separately for each app both during benchmarking runs and in production. Have to see.
Here is a quick demo of how it works:
* Run with CPU affinity locked to one core (error otherwise):
$ sudo taskset -c 0 ./snabb snsh -i
Snabb>
* Load lib.pmu that will auto-detect the available counters for your CPU.
Snabb> pmu = require("lib.pmu")
* Measure an (interpreted) empty function to get an idea of the overheads:
Snabb> pmu.profile(function() end)
EVENT TOTAL
instructions 850
cycles 1,282
ref-cycles 2,568
* Measure a loop and count branches + mispredictions:
Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"})
EVENT TOTAL
instructions 3,015,391
cycles 1,037,590
ref-cycles 2,075,184
br_inst_retired.all_branches 1,002,138
br_misp_retired.all_branches 366
* Measure the loop again and include an "auxiliary" row to see ratios:
Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"}, {loop=1000000})
EVENT TOTAL /loop
instructions 3,095,286 3.095
cycles 1,281,695 1.282
ref-cycles 2,563,392 2.563
br_inst_retired.all_branches 1,002,142 1.002
br_misp_retired.all_branches 365 0.000
loop 1,000,000 1.000
Here we see that each iteration of the loop takes approx. one cycle and three instructions and that branches are seldom mispredicted.
* Compare checksum routines (generic vs SSE2 vs AVX2)
Here is a script that compares the checksum routines and shows their
cache behavior in this particular benchmark:
profiling: generic
EVENT TOTAL /byte /packet
instructions 420,255,210 2.918 4202.552
cycles 429,483,566 2.983 4294.836
ref-cycles 429,483,552 2.983 4294.836
mem_load_uops_retired.l1_hit 271,618,029 1.886 2716.180
mem_load_uops_retired.l2_hit 173 0.000 0.002
mem_load_uops_retired.l3_hit 229 0.000 0.002
mem_load_uops_retired.l3_miss 60 0.000 0.001
byte 144,000,000 1.000 1440.000
packet 100,000 0.001 1.000
profiling: sse2
EVENT TOTAL /byte /packet
instructions 116,882,789 0.812 1168.828
cycles 32,942,396 0.229 329.424
ref-cycles 32,942,400 0.229 329.424
mem_load_uops_retired.l1_hit 11,508,798 0.080 115.088
mem_load_uops_retired.l2_hit 65 0.000 0.001
mem_load_uops_retired.l3_hit 157 0.000 0.002
mem_load_uops_retired.l3_miss 1 0.000 0.000
byte 144,000,000 1.000 1440.000
packet 100,000 0.001 1.000
profiling: avx2
EVENT TOTAL /byte /packet
instructions 56,773,023 0.394 567.730
cycles 16,444,852 0.114 164.449
ref-cycles 16,444,848 0.114 164.448
mem_load_uops_retired.l1_hit 6,910,824 0.048 69.108
mem_load_uops_retired.l2_hit 107 0.000 0.001
mem_load_uops_retired.l3_hit 100 0.000 0.001
mem_load_uops_retired.l3_miss 0 0.000 0.000
byte 144,000,000 1.000 1440.000
packet 100,000 0.001 1.000
So that is what we have.
I hope that we will be able to do very cool things with this. The basic unit that I would like to measure is an app. One cool idea would be to have a benchmarking environment that subjects a given app to various workloads (traffic mix, cache warm/cold) and generates a "data sheet" showing its expected performance. The other cool idea would be to measure with low enough overhead that we could track this in production and see how metrics like cycles/packet compare with our expectations.
End braindump!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment