sample as long as you want, ctrl-C when you've had enough can also --call-graph dwarf for more accurate stack traces but will take a lot more memory & storage
sudo perf record -e cycles,instructions -c 1000000 -a --call-graph fp
get the stacks
sudo perf script > stacks.out
run the following awk script on stacks.out
to split out the different hardware counter types
#!/usr/bin/awk -f
# Match lines containing hardware counter type, e.g., "cycles:" or "instructions:"
/[[:space:]]+[0-9]+[[:space:]]+[^:]+:/ {
# Extract the counter type, replace spaces with underscore, and use .txt as extension
counter_type = $NF
gsub(/:/, "", counter_type)
gsub(/[[:space:]]/, "_", counter_type)
output_file = counter_type ".txt"
}
{
# Print the current line to the respective output file
print $0 >> output_file
}
Now calculate the folded stack frames. difffolded.pl
usually is used for a differential flamegraph but here I'm just
using it because it conveniently places both counters next to the stack trace in a way that can be easily processed by
sort
and awk
FlameGraph/stackcollapse-perf.pl cycles.txt > cycles.folded
FlameGraph/stackcollapse-perf.pl instructions.txt > instructions.folded
FlameGraph/difffolded.pl cycles.folded instructions.folded > diff.folded
Finally in this case you will want to play with diff.folded
to get data you want. probably you will
want a flamegraph from cycles.folded
and then use that to guide you to a stack trace of interest. In my case I just
wanted the stack traces associated with highest cputime, so I just sorted. Then to get "instructions per cycle" I divided
one event type by the other (but this could be done with any events not just these e.g. cache miss/hit ratio)
sort -k 2 -r -n -t " " diff.folded > diff.sorted
awk '{if ($2 == 0) $4 = 0; else $4 = $3 / $2} 1' diff.sorted > diff.cpi
now you can get the IPC rate not just for the whole program but broken down into very fine-grained codepaths.