Skip to content

Instantly share code, notes, and snippets.

@vext01
Last active September 22, 2021 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vext01/c7a95428fad7cab58201636116431055 to your computer and use it in GitHub Desktop.
Save vext01/c7a95428fad7cab58201636116431055 to your computer and use it in GitHub Desktop.
Blocking Optimsations
TODO: convert to markdown.
Consider the scenario where we want to benchmark some code:
```
start_time = time();
bench_output = run_bench(bench_inputs);
result = time() - start_time;
```
The problem is that since there is no dependency between `time()` and the data
being passed in and out of `run_bench()`, the compiler may try to re-order the
code to this:
```
bench_output = run_bench(bench_inputs);
start_time = time();
result = time() - start_time;
```
Or this:
```
start_time = time();
result = time() - start_time;
bench_output = run_bench(bench_inputs);
```
Or even this:
```
start_time = time();
result = time() - start_time;
```
(since we didn't do anything side-effecting with `bench_output`)
There are therefore two problems. We need to:
1) Stop the benchmark disappearing altogether.
2) Ensure the benchmark stays anchored between the two calls to `time()`.
There have been many historical incarnations of inline assembler code which
could be used to tackle this.
Let's look at how Google does this when benchmarking C++ code [gbench].
First, there's `ClobberMemory()` [clobber]. It's definition (when using clang) is:
```
inline BENCHMARK_ALWAYS_INLINE void ClobberMemory() {
asm volatile("" : : : "memory");
}
```
What does this do? Well first, let's remind ourself of GNU extended inline asm
syntax [asm]:
```
asm asm-qualifiers ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
```
First it's `volatile` [volatile]. This means that the compiler may not
"optimise" the contents of the asm block or remove it altogether. It must
include it verbatim. In this example it prevents the obviously empty asm block
from being deleted (`AssemblerTemplate` is empty).
Second, it clobbers memory [asm-clobber]. This means that the compiler must
assume that each and every address in the virtual address space is *both* read
and mutated when the asm block is executed. Of course in reality no memory
address is read or mutated, but the compiler now doesn't know that.
How could this help? Let's sprinkle some clobbers into our benchmark harness:
```
1 start_time = time();
2 ClobberMemory();
3 bench_output = run_bench(bench_inputs);
4 ClobberMemory();
5 result = time() - start_time;
```
Assuming that `bench_inputs` is stored in memory (i.e. stack allocated, not in
a register), the compiler must now assume that the first `ClobberMemory()` may
mutate it. And since `run_bench(bench_inputs)` is expected to observe this
"mutation", this places an ordering on the computations. Line 2 must come
before line 1. The compiler may not reorder those.
Similarly, assuming `start_time` is stack allocated, since the first call to
`ClobberMemory()` may read `start_time`, this places an ordering on the
computation of `start_time` and the call to `ClobberMemory()` on line 2. The
compiler may not reorder lines 1 and 2.
So what we've learned from the last two paragraphs is that, assuming that
`start_time` and `bench_inputs` are in memory, the computations on lines 1, 2
and 3 must appear precisely in that order. In other words, the benchmark may
not be moved up beyond the call to `time()`.
We apply the same reasoning to understand why the second `ClobberMemory()`
prevents and code moving below the second call to `time()`. In sum, these two
anchorings ensure the benchmark remains inside the critical section.
XXX: The above argument relies on the idea that clobbering memory both writes
*and reads* to/from every memory address. The GNU asm docs confirms this
[clobber]. Without this condition I don't see how line 3 is anchored below line
1 in the example. There is an ongoing discussion on a stackoverflow thread
where someone asked (independently) exactly this [so-comment], but there is no
mention of clobber *reading* all memory. Not clear...
But wait. This assumes our variables are stored in memory. What if they are
promoted into registers? No amount of memory clobbering can fool the optimiser
in this case.
This leads us to `DoNotOptimise` [donotopt] which is defined like this (for
clang):
```
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
asm volatile("" : "+r,m"(value) : : "memory");
}
```
What's different? First it takes a value by (C++) reference. This means that a
pointer to `value` is passed. You can't make a pointer to a register, so this
forces `value` to be stack allocated. By kicking the value to the stack we now
ensure that memory clobbering *will* affect the value.
The `+r,m` output constraint is new:
- `r,m` tells the compiler that the pointer itself may live in memory or in a
register and that the pointer itself (not the memory it points to) may be
mutated.
- the `+` tells the compiler that the pointer itself may also be read.
Note that `DoNotOptimise()` still clobbers memory.
So our benchmark is better written as:
```
start_time = time();
DoNotOptimise(bench_inputs);
bench_output = run_bench(bench_inputs);
DoNotOptimise(bench_output);
result = time() - start_time;
```
XXX: In this example it's not clear to me why `run_bench()` cannot be put
before the first call to `time()`. Assume `start_time` is allocated in a
register. Then `DoNotOptimise()` clobbers `bench_inputs`, what it points to,
and all other memory, but *not* start_time... I have a feeling this relates to
my first XXX.
# When to use DoNotOptimise and when to use `ClobberMemory`?
XXX TODO
References:
[so-comment] https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c/56865717#comment71928341_38025837
[clobber] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L358-L368
[donotopt] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L349-L356
[gbench] https://github.com/google/benchmark/blob/e451e50e9b8af453f076dec10bd6890847f1624e/include/benchmark/benchmark.h#L339-L368
[asm] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
[out] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#OutputOperands
[volatile] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile
[asm-clobber] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers
Further reading:
https://www.youtube.com/watch?v=nXaxk27zwlk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment