Create a gist now

Instantly share code, notes, and snippets.

@regehr /
Last active Apr 13, 2016

What would you like to do?
implementation sketch for faster integer overflow checking in clang/llvm

I want to speed up integer overflow checking for hardening purposes by keeping a sticky overflow flag and only trapping when necessary. I want to keep it super simple while hopefully giving the optimizers room to do their thing.

In the codegen part of clang:

  • each function gets an i1 for storing overflow information, initialized to 0
  • each integer overflow check ORs its result into the overflow flag
  • before each function call, return instruction, or other side-effecting operation, execude ud2 if overflow is set


There is one primary problem I see here, at least on x86: you have to put*that i1 somewhere, and that will be too expensive.

The thing is, we have the overflow bit in a flag that gets set automatically. But getting that bit out of a flag register and stored somewhere else will be dramatically more expensive than branching to a trap instruction. Here is an example for Haswell using the Intel Architecture Code Analyzer:

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
|   1    |           | 0.5 |           |           |     | 0.5 |     |     |    | add rax, 0x4
|   3^   | 0.5       |     | 0.3       | 0.3       | 1.0 |     | 0.5 | 0.3 | CP | seto byte ptr [ebp+0x10]

This assumes there is some magical memory at a constant offset of %ebp (I didn't spend long enough to get a real use of say the redzone on x86-64, but it seems representative. Look at the number of micro-ops: 3. At least one of those three gets micro-op fusion applied (the ^ indicates that). But still, this is crazy expensive.

We could make it less expensive by using a register operand, but now we have to occupy an entire register for this! Not good.

But look at how fast branching to some ud2 instruction elsewhere is:

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
|   1    | 0.5       |     |           |           |     |     | 0.5 |     | CP | add rax, 0x4
|   0F   |           |     |           |           |     |     |     |     |    | jo 0xb

The branch gets completely fused away. x86 processors at least are just ridiculously good at this.

What might be useful is a CPU feature which would provide for a saturating overflow flag register. This would allow you to dramatically reduce the number of branches, and it seems very easy to implement in hardware. We should start lobbying with Intel to get this.

@chandlerc Saturating flag register would cause a partial register dependency. I believe it will still be slower than branch-on-overflow.


regehr commented Mar 25, 2016

I agree about the sticky flag. I wrote about something like this earlier, but probably phrased it badly by asking for traps instead of the sticky flag.


regehr commented Mar 25, 2016

@nadavrot do branches not interfere with superscalar execution? I'm not good at this stuff.

@nadavrot You can easily avoid register dependencies. But you're still going to be burning registers extracting and or-ing these things together.

@regehr Yes, branches do interfere with some things, but very minimally when they are perfectly predicted branches to ud2. The worst impacts I'm aware of are mild pressure on the branch prediction tables (but very mild) and blowing out the number of branches supported by the loop stream detector. These will hurt, but they will hurt much less than the alternatives.

@regehr In theory there are limited resources in the branch prediction tables. However, my experience with Swift and JavaScript was that it was never a problem. I don't think that non-taken-overflow-branches are inserted into the branch prediction tables.

There is also the cost of fetching, decoding and executing the 'JO' instruction, but like @Chandler pointed out, the alternative is worse.

@nadavrot I really liked a recent AMD branch predictor which they actually bothered to document -- from memory each branch would perfectly predict not-taken until taken, then predict taken until not-taken, and only then start using up the normal history tables. Really nice because branches like this (never taken) were reliably free in the prediction table.

But much like your experience with JS and Swift, I've never seen the prediction table's resources actually matter in practice. The closest are collisions in the table due to density of branches, and even that seems to happen much less these days. Yay for modern branch predictors, makes many things easier.

sanjoy commented Mar 25, 2016

This scheme (saturating has_overflowed bit) will also make
optimizing overflow checks away more difficult. I.e. in

i32 a = ...;
if ((a + 2) sign overflowed) set bit;
if ((a + 1) sign overflowed) set bit;

you won't be able to optimize away the second check just by doing
control flow analysis (you'll have to rely on the semantics of set bit), but in

i32 a = ...;
if ((a + 2) sign overflowed) abort();
if ((a + 1) sign overflowed) abort();

you can.


regehr commented Mar 25, 2016

Ok, I'm convinced, thanks guys, I'll start reading the passes relevant for check removal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment