Skip to content

Instantly share code, notes, and snippets.

@garybernhardt
Last active March 7, 2016 17:37
Show Gist options
  • Save garybernhardt/a7e920a31ff55a768fc3 to your computer and use it in GitHub Desktop.
Save garybernhardt/a7e920a31ff55a768fc3 to your computer and use it in GitHub Desktop.
This tool is used to compare microbenchmarks across two versions of code. It's
paranoid about nulling out timing error, so the numbers should be meaningful.
It runs the benchmarks many times, scaling the iterations up if the benchmark
is extremely short, and it nulls out its own timing overhead while doing so. It
reports results graphically with a text interface in the terminal.
You first run it with --record, which generates a JSON dotfile with runtimes
for each of your benchmarks. Then you change the code and run again with
--compare, which re-runs and generates comparison plots between your recorded
and current times. In the example output, I did a --record on the master
branch, then switched to my scoring_redesign and did a --compare. In my output,
three of the benchmarks' runtimes look to be unchanged; the other three got
significantly slower. As you can see at the bottom of the output, the entire
process takes about a second from hitting enter to having full output, so I can
get very rapid performance feedback.
It goes to great lengths to null out both constant timing offsets and timing
jitter. Specifically:
* All benchmarks are run 16 times, with all of those data points being used in
the output (not just the average, median or minimum).
* They're run with and without garbage collection, with both being reported.
* The cost of the benchmark runner itself is nulled out in a way that should be
very accurate. It times the user-supplied block, then repeats the timing with
an empty block, subtracting the latter from the former.
* It guarantees a minimum test runtime for precise timing. First, it runs the
block directly. If it takes less than 1ms, it's deemed unsafe and the process
is restarted, but now the block will be run twice. If it's still under 1ms,
that will be doubled to four times, etc., until it takes at least 1ms. This
is what the "!"s are in the output: each of those is the benchmark being
restarted with double the number of repetitions. The 1ms limit is checked
after correcting for test runner overhead as described above, so the
user-provided benchmark block itself is guaranteed to use at least 1ms of
actual CPU time.
In the plots, "X" is the median of the 16 iterations. The horizontal lines
extend to the minimum and maximum values. This gives you a clear visualization
of timing jitter. Each plot's range goes from 0 to the maximum sampled runtime
to avoid confusing effects of tightly-zoomed plots. If the lines don't overlap,
you can have good confidence that there's a real performance difference. In the
second through fourth benchmarks below, you can see a big difference: I've made
them much slower by changing Selecta's scoring algorithm. (This was my actual
motivation for creating the tool. It's not a synthetic example.) I haven't
wanted fine detail so far, so the text plots have been sufficient. I may add
numbers eventually.
This tool is Ruby-specific, but its design certainly isn't. This is the
benchmark tool that I've wanted for years, since before I even knew Ruby. It 1)
gives me the bare minimum statistics needed for confidence; 2) nulls out every
source of timing offset or jitter that I know of and can control; and 3)
automatically scales the repetitions up to get significant samples.
# I haven't thought much about the API yet.
# bench_group is like RSpec's "describe"; bench is like "it".
# Before and after blocks are supported for setup and teardown (not used here).
bench_group "filtering" do
bench "non-matching" do
Score.score("x" * 16, "y" * 16)
end
bench "matching exactly" do
Score.score("x" * 16, "x" * 16)
end
bench "matching broken up" do
Score.score("xy" * 20, "x" * 10)
end
bench "overlapping matches" do
Score.score("x" * 40, "x" * 10)
end
bench "words, non-matching" do
WORDS[0,1000].each { |choice| Score.score(choice, "x" * 16) }
end
bench "words, matching" do
WORDS[0,1000].each { |choice| Score.score(choice, WORDS[500]) }
end
end
filtering non-matching !!!!!!!!................!!!!!!!!................
filtering matching exactly !!................!!................
filtering matching broken up ................................
filtering overlapping matches ................................
filtering words, non-matching ................................
filtering words, matching ................................
filtering non-matching
Before (GC): | -----X-------- |
Before (No GC): | ---X--------- |
After (GC): | -----X----------- |
After (No GC): | --X----------------------------|
0 0.0105
filtering matching exactly
Before (GC): | X--- |
Before (No GC): | X-- |
After (GC): | -----X---------------------- |
After (No GC): | ----X-----------------------------|
0 0.715
filtering matching broken up
Before (GC): |X- |
Before (No GC): |X- |
After (GC): | --X--- |
After (No GC): | ---X---------------------------------|
0 2.31
filtering overlapping matches
Before (GC): |X- |
Before (No GC): |X- |
After (GC): | ---X---------------------------------|
After (No GC): | ----X----------------------- |
0 6
filtering words, non-matching
Before (GC): | ---X---------------- |
Before (No GC): | ---X---- |
After (GC): | --------X--------------------------------------|
After (No GC): | ---X-------------- |
0 20
filtering words, matching
Before (GC): | -X------------------------------------------- |
Before (No GC): | --X--------------------- |
After (GC): | ---X-----------------------------------------|
After (No GC): | -X-------- |
0 20.1
ruby benchmark.rb --compare 1.03s user 0.08s system 97% cpu 1.136 total
@booch
Copy link

booch commented May 20, 2014

Awesome. Only complaint is missing units on graphs.

Would also be nice to specify A/B variants within 'bench' blocks:

bench 'matching' do
  variant 'using strings' { Score.score("x", "x") }
  variant 'using regexes' { Score.score_with_regex("x", "x") }
end

But admittedly, that's a very different use case. I'm just excited about the whole world of possibilities your idea opens up.

@garybernhardt
Copy link
Author

Is there really a reason to support variants? You can just record the current version, implement the variant (changing the current implementation rather than duplicating it), and benchmark. There's no need to even commit if it's slower; just throw it away. I've already done this a few times this morning!

@booch
Copy link

booch commented May 21, 2014

Interesting thought from @garybernhardt (via Twitter) on not needing units on the graphs: "Units don't really matter for relative comparisons." Makes a lot of sense. And seeing as how we don't show how many runs it has gone through, the units wouldn't be that helpful anyway.

But it might be useful to get a ballpark on how fast my algorithms are running, so I don't end up picking the best of 2 really slow algorithms.

Another issue with not having units is that it's kind of unclear whether these are time intervals, or number of runs per time interval.

@booch
Copy link

booch commented May 21, 2014

I can think of a use case for having both variants in the code next to each other -- sharing the code with other people in an easier way than multiple branches. For example, in a gist. ;)

@garybernhardt
Copy link
Author

I thought about units more after we exchanged those toots and had the same realization: there's an ambiguity between time and IPS. I'll probably switch this tool to IPS, which seems more intuitive (higher numbers are better). Adding " IPS" to the values is trivial, of course.

But for a gist, you can just show both forms of the code followed by the output of the comparison. That seems to give all of the information present with a multi-variant syntax, but without needing an extra feature.

@booch
Copy link

booch commented May 22, 2014

I see your point about the gists. But that's a bit of work for the person reading the gist to make a comparison. And even more if there are a bunch of variants. (For example, various sorting algorithms.) Before/after (record/compare) doesn't handle more than 2 choices well. But like I said, this is a different use case, and it might not be one that you care about right now.

Anyway, I'm glad I was able to get you to think more about how it works. I can't believe more people aren't excited about this enough to be commenting. I guess you weren't controversial enough in the Tweet announcing it. Recommendation for the 1.0 announcement: "Benchmarking is Dead".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment