maestre3d/performance-measuring.md

## performance-measuring.md

      
    Raw
  

              performance-measuring.md
            
          
(This is not a particularly complete nor finished doc, but I'll share it as a gist anyway.)


In this document:

Measuring Performance in Golang
Writing Good Benchmarks
Where is that alloc coming from?

In other documents:

You can find a lot of material about optimizations, inlining, memory layout, and fitting it all together in the "prerequisite understandings" section of this Memory Layout document from the IPLD project.


Measuring Performance in Golang

basic benchmarking

The golang tooling (and the docs for it!) are really good at this!
See https://golang.org/pkg/testing/#hdr-Benchmarks ...
Put those examples in a file called foobar_test.go,
and run it like this:
go test -bench=. ./...

Tada!  Nanoseconds!  You are now read to being optimizing.
basic benchmarking (with alloc counting)

go test -bench=. -benchmem ./...

Looking at allocation counts can be very useful when optimizing code.
The count of allocations can be very impactful to overall performance:
more allocations means more time spent doing it and more time spent in GC later on.
This flag also causes allocation size to be reported.
However, this is often less directly impactful if you're worried about speed.
benchmarking without GC

(Advanced technique.  You probably don't need to do this until you've exhausted other simpler explorations.)
This is useful to get estimates of how much wallclock time you're losing to GC.
It's also sometimes interesting to sanity check if one test is generating a lot of garbage in such a way it adds time to its neighbors.
(It "shouldn't" happen, since the GC puts some effort into being "fair"... but it still happens.)
GOGC=off go test -bench=. -benchmem ./...

asking the escape analysis what it thinks (and why)

(Advanced technique.  You probably don't need to do this until you've exhausted other simpler explorations.)
go test -gcflags "-m -m" -bench=. -benchmem ./...

Warning: this is loud.  You will almost certainly have to learn how to grep though the outputs.
You can also try to use more targetted go build commands to get closer to your point of interest.
Also tells you what's inlined.
(These flags basically ask the compiler "what are you feeling", but these two topics are what it's mostly chatty about.)
In general this can be tricky to use effectively unless you have questions about a fairly specific line number.
It's hard to overstate how chatty it is.  It may generate more lines of output than there are lines of source code in your program.
And most "escapes to heap" notices aren't... necessarily a problem:
if the escape warning regard a value that, in context, is already known to be heap, there's not actually a cost at this site.
getting time measurements in a pprof

go test -bench=. -benchmem -benchtime=10000x -cpuprofile=cpu.pprof ./specific/package
go tool pprof -http=:9090 cpu.pprof

pprof is your friend when it's time to get detailed observations of what your program is actually doing.
There are ways to run pprof on live applications, but I also like to write benchmarks,
and benchmarks also turn out to be a great thing to extract pprof data from!
You can use go tool pprof in many ways... but in my experience, the interactive browser mode is by far the easiest to navigate quickly,
and it drills down interactively very nicely, so I'd start there.
The -http flag you see in the command above is what starts the interactive browser mode.
Poke around.  You can probably figure out what most of this means just by looking around.
Don't forget to check out all the menus at the top left -- there's various views you can access from there.
Flamegraphs and the various other output formats all have their uses, and you'll either discover them over time and practical experience, or you can use your favored search engine to find more blogs about them; I won't cover it in detail here.
One big thing to know about the cpu graphs: if you see functions that look like garbage collection showing up as a large fraction of your time... it's probably time to start looking at allocations and memory pprof instead.
(I also often use a fixed count of benchmark runs when gathering pprof files,
because this keeps things consistently countable between runs,
and makes it easier for me as a human to build up diffs in my head.
The -benchtime=1000x incantation is what fixates the run count.
It's not perfect -- the pprof system is still sampling, so there's significant randomness to results -- but it should keep the various tests in a ratio, at least.
You may want to use a different constant -- but note that 1x usually won't work, because sampling might then miss your function entirely.)
getting alloc counts in a pprof

go test -bench=. -benchmem -benchtime=10000x -memprofile=mem.pprof ./specific/package
go tool pprof -http=:9090 mem.pprof

Use the 'Sampling' menu to switch to alloc counts.  That's the view you'll probably end up using the most.
As a general rule, programs that have a lower allocation count overall with be faster.
Getting a bunch of small allocations amortized into one big one can very often be a significant win;
this view might help you identify places where you might want to do that.
Be aware that allocation counts from sampling can vary in more ways than you'd like.
Allocations that take longer can end up with proportionally higher reported count.
This is super unhelpful.  It's a fact of life, though.
(To see it: watch out for inequality on two allocations unconditionally next to each other, where one is larger, for example.)
The result of this is that you cannot rely on allocation counts (even roughly!) to be a proxy for invocation counts or ratios.
getting pprof data without benchmarks

All of the data gathering and creation of pprof files can be done without benchmarks too.
Check out the pprof.WriteHeapProfile(some_file) and similar functions from the runtime/pprof package.
Writing Good Benchmarks

Decide: macro or micro

Macro benchmarks should cover whole systems and try to provide insight into what parts of the whole system are the largest proportion of time sinks.
Micro benchmarks focus in on one small piece of code and are used to optimize that code as much as possible.
Neither is suitable for answering all kinds of questions.  Do both as needed.
Misc things to watch out for

The compiler can get very clever sometimes.  If you're writing a benchmark that's trending towards "microbenchmark", you have to watch out to make sure the thing you're trying to test doesn't end up elided by a clever optimization.
total elision due to effect freedom

It's especially common for the compiler to elide operations that assign memory that the compiler can prove is never used again.
The solution to this is to make a var sink interface{} definition at package scope, and assign to it in your benchmark method after the b.N loop.
It's necessary that the sink variable be at package scope, because this is where the go compiler will "give up" on trying to prove effects or the lack thereof.
It's necessary to do the sink = thingy assignment outside of the b.N loop, because if you do it inside the loop, you'll likely score a conversion-to-interface heap allocation, which you don't want to measure.  (You can also avoid this by having a sink of a more specific type; or it might not show up as an effect because the relevant thingy has already escaped to heap; etc.  But why waste the braincycles on it?  Better safe and simple than sorry.)
The bar on this has also moved through the go versions as the compiler gets cleverer!
Sometimes you'll see code that does an assignment of _ = thingy outside of the b.N loop.
This is an earlier generation of "fix" to the elision problem -- but it no longer works reliably.
If you see this, you should probably update that benchmark to the sink technique and see if it changes the data you get.
How do you know if this affects you: well, frankly, it's sufficiently hard to reason about that as a rule of thumb, your author pretty much just always does a sink nowaday.  (There's also no real downside to doing it.)
If you need to check, though: I'd recommend reading the disassembly for your benchmark function.  Adding -gcflags -S to your go test command will cause assembly to be dumped.  Find your benchmark function and see if it looks like it does roughly what you're expecting.  (You can defocus your eyes and skim a bit here -- you don't need to verify that the compiler is correct, you need to see if there's a massive hunk of stuff missing -- much easier task.)  This is also hard to automate and verify continues to do what you think it does across compiler versions -- so you see why I recommend just doing sink unconditionally.
Where is that alloc coming from?

Big picture info about where allocations and slowdowns are at can be found through pprof and visualizations of that data.
But what about when we want to drill down to more specific regions, and figure out why an allocation or occuring, or what's going on there that's slow?
We've got a couple options:

pprof source view can point us to line numbers
we can ask the compiler to dump facts about escape analysis
we can read assembler

This author's favorite is actually (surprisingly) that last one: read assembler.
It's the easiest to script, which makes it the easiest to iterate on.
It's also the view that gives you the most direct insight one "why" (you'll see what runtime call is actually causing the alloc).
But all three work, and can provide useful insights at different scales.
pprof source view

The title honestly says it all on this one.  In the pprof web view covered earlier: don't forget the menus in the top left.  The source view can show you which lines have allocations attributed to them.
info from escape analysis

Use -gcflags "-m -m" to have the compiler tell you everything it knows about escape analysis.
Warning: This is loud!  Very loud.  It also provides other information not related to escape analysis: it's more of a general "debug" flag for the compiler overall.  As a result, it's not my first choice of data... but it's sometimes useful.
If you really want to know why some statement caused an allocation to escape to the heap, this can provide a very full explanation.  Just be ready to have to sift for it.
reading assembler

Use -gcflags -S to have any go command (it works on go install just as well as go test!) dump assembler at you.
This looks gnarly at a first glance -- but it's surprisingly easy to ready once you get the hang of it.

Grep for the string "STEXT" plus the function names that you want to look at in detail: this gets you to the assembly for that function.
There's lots of stuff you'll learn to ignore: PCDATA can be pretty much discarded, for example.
Conversely, there's a handful of things you can often jump to: CALL runtime., for example!

These runtime.* functions usually contain the final answer to any questions about "why is there an allocation" and other major mysteries about your code's performance.  The exact function involved says a lot about "why" -- and the line number that the CALL instruction is attributed to tells you what source-level construct caused this.  Usually this answers the question!
TODO: examples of runtime.* functions you're likely to see, and a short explainer.
You can also get this assembler from other places (you'll find a "disassemble" view in the pprof tools, for example).
They don't look exactly the same, though -- they're from different parts of the pipeline, and some of this "assembly" is closer to the final machine-specific stuff than others, and some of it has left behind more semantics than others.
I recommend the -S system because it's from somewhat earlier in the pipeline, and still consistently has line numbers attached to it.
tools wishlists

The following is a short list of tools I wish I had, and ideas I think might be helpful to systematize.
I haven't written them, but if someone wants to, I'd be thrilled.
impactof

It would be neat to get a tool that tells me what the impact of e.g. inlining is, or of NOGC, etc.
This would just require running go test -bench=. a couple of times, then using benchcmp on the results.
Some things I'd like to have it check for me:

What's the impact of inlining?  (Run with -gcflags '-l'.)

(If you're expecting inlining to be a big issue, this could give you a quick answer.)
(One might also want to run this twice on either side of a change, to see if the impact of inlining increases when you try to make it so.)


What's the impact of GOGC?  (Run with GOGC=off in the env.)

(Frequently useful simply to see if garbage collection time is a serious part of your issue.)


What's the impact of disabling all optimizations?  (Run with -gcflags '-N'.)

(Probably more of an academic interest -- not very actionable itself -- but easy enough to ask once you've got a framework set up for this.)


What's the impact of more inlining?  (Run with -gcflags='-l -l'.)

(Probably more of an academic interest -- but easy enough to ask once you've got a framework set up for this.)