lukego/readme.txt

## readme.txt
CPU PMU (Performance Monitoring Unit) support

Howdy!

I have geeked out on a new piece of hardware :-)

This time it is the Performance Monitoring Unit built into the CPU. This is a hardware capability to track fine-grained events inside the processor and give visibility into things like cache misses, branch mispredictions, utilization of internal CPU resources,

Turns out that you only need two special CPU instructions to drive this - WMSR to setup a counter, RDPMC to read it - and a simple but interesting benchmarking tool is only 500 lines of code. This was also a good opportunity to use our new ability to write Lua code that generates machine code at runtime.

I have merged the basic support onto the 'next' branch now. I reckon it will take some experimentation to see how to use it effectively. I hope that we will be able to build it into the engine and collect detailed metrics separately for each app both during benchmarking runs and in production. Have to see.

Here is a quick demo of how it works:

* Run with CPU affinity locked to one core (error otherwise):

$ sudo taskset -c 0 ./snabb snsh -i
Snabb>

* Load lib.pmu that will auto-detect the available counters for your CPU.
Snabb> pmu = require("lib.pmu")

* Measure an (interpreted) empty function to get an idea of the overheads:

Snabb> pmu.profile(function() end)
EVENT                                             TOTAL
instructions                                        850
cycles                                            1,282
ref-cycles                                        2,568

* Measure a loop and count branches + mispredictions:

Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"})
EVENT                                             TOTAL
instructions                                  3,015,391
cycles                                        1,037,590
ref-cycles                                    2,075,184
br_inst_retired.all_branches                  1,002,138
br_misp_retired.all_branches                        366

* Measure the loop again and include an "auxiliary" row to see ratios:

Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"}, {loop=1000000})
EVENT                                             TOTAL       /loop
instructions                                  3,095,286       3.095
cycles                                        1,281,695       1.282
ref-cycles                                    2,563,392       2.563
br_inst_retired.all_branches                  1,002,142       1.002
br_misp_retired.all_branches                        365       0.000
loop                                          1,000,000       1.000

Here we see that each iteration of the loop takes approx. one cycle and three instructions and that branches are seldom mispredicted.

* Compare checksum routines (generic vs SSE2 vs AVX2)

Here is a script that compares the checksum routines and shows their
cache behavior in this particular benchmark:

profiling: generic
EVENT                                             TOTAL       /byte     /packet
instructions                                420,255,210       2.918    4202.552
cycles                                      429,483,566       2.983    4294.836
ref-cycles                                  429,483,552       2.983    4294.836
mem_load_uops_retired.l1_hit                271,618,029       1.886    2716.180
mem_load_uops_retired.l2_hit                        173       0.000       0.002
mem_load_uops_retired.l3_hit                        229       0.000       0.002
mem_load_uops_retired.l3_miss                        60       0.000       0.001
byte                                        144,000,000       1.000    1440.000
packet                                          100,000       0.001       1.000
profiling: sse2
EVENT                                             TOTAL       /byte     /packet
instructions                                116,882,789       0.812    1168.828
cycles                                       32,942,396       0.229     329.424
ref-cycles                                   32,942,400       0.229     329.424
mem_load_uops_retired.l1_hit                 11,508,798       0.080     115.088
mem_load_uops_retired.l2_hit                         65       0.000       0.001
mem_load_uops_retired.l3_hit                        157       0.000       0.002
mem_load_uops_retired.l3_miss                         1       0.000       0.000
byte                                        144,000,000       1.000    1440.000
packet                                          100,000       0.001       1.000
profiling: avx2
EVENT                                             TOTAL       /byte     /packet
instructions                                 56,773,023       0.394     567.730
cycles                                       16,444,852       0.114     164.449
ref-cycles                                   16,444,848       0.114     164.448
mem_load_uops_retired.l1_hit                  6,910,824       0.048      69.108
mem_load_uops_retired.l2_hit                        107       0.000       0.001
mem_load_uops_retired.l3_hit                        100       0.000       0.001
mem_load_uops_retired.l3_miss                         0       0.000       0.000
byte                                        144,000,000       1.000    1440.000
packet                                          100,000       0.001       1.000


So that is what we have.

I hope that we will be able to do very cool things with this. The basic unit that I would like to measure is an app. One cool idea would be to have a benchmarking environment that subjects a given app to various workloads (traffic mix, cache warm/cold) and generates a "data sheet" showing its expected performance. The other cool idea would be to measure with low enough overhead that we could track this in production and see how metrics like cycles/packet compare with our expectations.

End braindump!
	CPU PMU (Performance Monitoring Unit) support

	Howdy!

	I have geeked out on a new piece of hardware :-)

	This time it is the Performance Monitoring Unit built into the CPU. This is a hardware capability to track fine-grained events inside the processor and give visibility into things like cache misses, branch mispredictions, utilization of internal CPU resources,

	Turns out that you only need two special CPU instructions to drive this - WMSR to setup a counter, RDPMC to read it - and a simple but interesting benchmarking tool is only 500 lines of code. This was also a good opportunity to use our new ability to write Lua code that generates machine code at runtime.

	I have merged the basic support onto the 'next' branch now. I reckon it will take some experimentation to see how to use it effectively. I hope that we will be able to build it into the engine and collect detailed metrics separately for each app both during benchmarking runs and in production. Have to see.

	Here is a quick demo of how it works:

	* Run with CPU affinity locked to one core (error otherwise):

	$ sudo taskset -c 0 ./snabb snsh -i
	Snabb>

	* Load lib.pmu that will auto-detect the available counters for your CPU.
	Snabb> pmu = require("lib.pmu")

	* Measure an (interpreted) empty function to get an idea of the overheads:

	Snabb> pmu.profile(function() end)
	EVENT TOTAL
	instructions 850
	cycles 1,282
	ref-cycles 2,568

	* Measure a loop and count branches + mispredictions:

	Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"})
	EVENT TOTAL
	instructions 3,015,391
	cycles 1,037,590
	ref-cycles 2,075,184
	br_inst_retired.all_branches 1,002,138
	br_misp_retired.all_branches 366

	* Measure the loop again and include an "auxiliary" row to see ratios:

	Snabb> pmu.profile(function() for i = 1, 1000000 do end end, {"retired.*all_branches$"}, {loop=1000000})
	EVENT TOTAL /loop
	instructions 3,095,286 3.095
	cycles 1,281,695 1.282
	ref-cycles 2,563,392 2.563
	br_inst_retired.all_branches 1,002,142 1.002
	br_misp_retired.all_branches 365 0.000
	loop 1,000,000 1.000

	Here we see that each iteration of the loop takes approx. one cycle and three instructions and that branches are seldom mispredicted.

	* Compare checksum routines (generic vs SSE2 vs AVX2)

	Here is a script that compares the checksum routines and shows their
	cache behavior in this particular benchmark:

	profiling: generic
	EVENT TOTAL /byte /packet
	instructions 420,255,210 2.918 4202.552
	cycles 429,483,566 2.983 4294.836
	ref-cycles 429,483,552 2.983 4294.836
	mem_load_uops_retired.l1_hit 271,618,029 1.886 2716.180
	mem_load_uops_retired.l2_hit 173 0.000 0.002
	mem_load_uops_retired.l3_hit 229 0.000 0.002
	mem_load_uops_retired.l3_miss 60 0.000 0.001
	byte 144,000,000 1.000 1440.000
	packet 100,000 0.001 1.000
	profiling: sse2
	EVENT TOTAL /byte /packet
	instructions 116,882,789 0.812 1168.828
	cycles 32,942,396 0.229 329.424
	ref-cycles 32,942,400 0.229 329.424
	mem_load_uops_retired.l1_hit 11,508,798 0.080 115.088
	mem_load_uops_retired.l2_hit 65 0.000 0.001
	mem_load_uops_retired.l3_hit 157 0.000 0.002
	mem_load_uops_retired.l3_miss 1 0.000 0.000
	byte 144,000,000 1.000 1440.000
	packet 100,000 0.001 1.000
	profiling: avx2
	EVENT TOTAL /byte /packet
	instructions 56,773,023 0.394 567.730
	cycles 16,444,852 0.114 164.449
	ref-cycles 16,444,848 0.114 164.448
	mem_load_uops_retired.l1_hit 6,910,824 0.048 69.108
	mem_load_uops_retired.l2_hit 107 0.000 0.001
	mem_load_uops_retired.l3_hit 100 0.000 0.001
	mem_load_uops_retired.l3_miss 0 0.000 0.000
	byte 144,000,000 1.000 1440.000
	packet 100,000 0.001 1.000


	So that is what we have.

	I hope that we will be able to do very cool things with this. The basic unit that I would like to measure is an app. One cool idea would be to have a benchmarking environment that subjects a given app to various workloads (traffic mix, cache warm/cold) and generates a "data sheet" showing its expected performance. The other cool idea would be to measure with low enough overhead that we could track this in production and see how metrics like cycles/packet compare with our expectations.

	End braindump!