Created
August 19, 2015 17:46
-
-
Save lukego/e16583be5739d6f9d24f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CPU PMU (Performance Monitoring Unit) support | |
Howdy! | |
I have geeked out on a new piece of hardware :-) | |
This time it is the Performance Monitoring Unit built into the CPU. This is a hardware capability to track fine-grained events inside the processor and give visibility into things like cache misses, branch mispredictions, utilization of internal CPU resources, | |
Turns out that you only need two special CPU instructions to drive this - WMSR to setup a counter, RDPMC to read it - and a simple but interesting benchmarking tool is only 500 lines of code. This was also a good opportunity to use our new ability to write Lua code that generates machine code at runtime. | |
I have merged the basic support onto the 'next' branch now. I reckon it will take some experimentation to see how to use it effectively. I hope that we will be able to build it into the engine and collect detailed metrics separately for each app both during benchmarking runs and in production. Have to see. | |
Here is a quick demo of how it works: | |
* Run with CPU affinity locked to one core (error otherwise): | |
$ sudo taskset -c 0 ./snabb snsh -i | |
Snabb> | |
* Load lib.pmu that will auto-detect the available counters for your CPU. | |
Snabb> pmu = require("lib.pmu") | |
* Measure an (interpreted) empty function to get an idea of the overheads: | |
Snabb> pmu.profile(function() end) | |
EVENT TOTAL | |
instructions 850 | |
cycles 1,282 | |
ref-cycles 2,568 | |
* Measure a loop and count branches + mispredictions: | |
Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"}) | |
EVENT TOTAL | |
instructions 3,015,391 | |
cycles 1,037,590 | |
ref-cycles 2,075,184 | |
br_inst_retired.all_branches 1,002,138 | |
br_misp_retired.all_branches 366 | |
* Measure the loop again and include an "auxiliary" row to see ratios: | |
Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"}, {loop=1000000}) | |
EVENT TOTAL /loop | |
instructions 3,095,286 3.095 | |
cycles 1,281,695 1.282 | |
ref-cycles 2,563,392 2.563 | |
br_inst_retired.all_branches 1,002,142 1.002 | |
br_misp_retired.all_branches 365 0.000 | |
loop 1,000,000 1.000 | |
Here we see that each iteration of the loop takes approx. one cycle and three instructions and that branches are seldom mispredicted. | |
* Compare checksum routines (generic vs SSE2 vs AVX2) | |
Here is a script that compares the checksum routines and shows their | |
cache behavior in this particular benchmark: | |
profiling: generic | |
EVENT TOTAL /byte /packet | |
instructions 420,255,210 2.918 4202.552 | |
cycles 429,483,566 2.983 4294.836 | |
ref-cycles 429,483,552 2.983 4294.836 | |
mem_load_uops_retired.l1_hit 271,618,029 1.886 2716.180 | |
mem_load_uops_retired.l2_hit 173 0.000 0.002 | |
mem_load_uops_retired.l3_hit 229 0.000 0.002 | |
mem_load_uops_retired.l3_miss 60 0.000 0.001 | |
byte 144,000,000 1.000 1440.000 | |
packet 100,000 0.001 1.000 | |
profiling: sse2 | |
EVENT TOTAL /byte /packet | |
instructions 116,882,789 0.812 1168.828 | |
cycles 32,942,396 0.229 329.424 | |
ref-cycles 32,942,400 0.229 329.424 | |
mem_load_uops_retired.l1_hit 11,508,798 0.080 115.088 | |
mem_load_uops_retired.l2_hit 65 0.000 0.001 | |
mem_load_uops_retired.l3_hit 157 0.000 0.002 | |
mem_load_uops_retired.l3_miss 1 0.000 0.000 | |
byte 144,000,000 1.000 1440.000 | |
packet 100,000 0.001 1.000 | |
profiling: avx2 | |
EVENT TOTAL /byte /packet | |
instructions 56,773,023 0.394 567.730 | |
cycles 16,444,852 0.114 164.449 | |
ref-cycles 16,444,848 0.114 164.448 | |
mem_load_uops_retired.l1_hit 6,910,824 0.048 69.108 | |
mem_load_uops_retired.l2_hit 107 0.000 0.001 | |
mem_load_uops_retired.l3_hit 100 0.000 0.001 | |
mem_load_uops_retired.l3_miss 0 0.000 0.000 | |
byte 144,000,000 1.000 1440.000 | |
packet 100,000 0.001 1.000 | |
So that is what we have. | |
I hope that we will be able to do very cool things with this. The basic unit that I would like to measure is an app. One cool idea would be to have a benchmarking environment that subjects a given app to various workloads (traffic mix, cache warm/cold) and generates a "data sheet" showing its expected performance. The other cool idea would be to measure with low enough overhead that we could track this in production and see how metrics like cycles/packet compare with our expectations. | |
End braindump! | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment