Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
A bit of background on compilers exploiting signed overflow
Why do compilers even bother with exploiting undefinedness signed overflow? And what are those
mysterious cases where it helps?
A lot of people (myself included) are against transforms that aggressively exploit undefined behavior, but
I think it's useful to know what compiler writers are accomplishing by this.
TL;DR: C doesn't work very well if int!=register width, but (for backwards compat) int is 32-bit on all
major 64-bit targets, and this causes quite hairy problems for code generation and optimization in some
fairly common cases. The signed overflow UB exploitation is an attempt to work around this.
To be more precise: the #1 transform that compilers really want to get out of this is to replace int32
loop counters with int64 loop counters in 64-bit code whenever possible, because int32s are bad news
for the purposes of memory access and address generation.
To show the actual problem, let's look at a simple loop:
int x[N]; // initialized elsewhere
sum = 0;
for (int i = 0; i < count; ++i)
sum += x[i];
which can be naively compiled (on x86-64) into something like:
; ecx = count
; rsi = points to x[]
xor eax, eax ; clear sum
test ecx, ecx
jng done ; don't even enter loop if count <= 0
xor ebx, ebx ; i
lp:
movsxd rdx, ebx ; * sign-extend i to 64 bits
add eax, [rsi+rdx*4] ; sum += x[i]
inc ebx ; ++i
cmp ebx, ecx
jl lp
done:
That MOVSXD (marked *) is indicative of the issue: we're on a 64-bit machine trying to use a 32-bit index to
refer to an array, and because addresses are 64-bit (and we could absolutely have an array larger than 4GB but
with less than 2 billion entries), we need to sign-extend that index to 64-bit before we can do our addressing
calculation.
That sign-extend is extra work in the inner loop that would not exist in the 32-bit version of this snippet.
It also causes other problems. For example, the 32-bit equivalent of the code above always accesses memory at
esi + ebx*4
and ebx is what's called an "induction variable" - it changes by a constant value in every loop iteration
(incrementing by 1 in this case). A 32-bit compiler can realize this and replace it with pointer arithmetic
(incrementing esi by 4 every iter, say). "esi" and "ebx" are both 32-bit values subject to 32-bit wraparound,
but since we're working modulo 2^32 anyway, this is not a problem.
A 64-bit compiler needs to work with the less friendly address
rsi + sext32to64(ebx)*4
with a sign extend in the middle (that "sext32to64(ebx)" is what the "movsxd rbx, edx" does). And in
particular, should ebx ever overflow from 0x7fffffff to -0x80000000, the resulting address will change
dramatically, by (4 - 2^(32+2)). In 32-bit mode (all addresses modulo 2^32), this is just an increase by 4
as usual; the "wraparound" part is a multiple of 2^32 and hence invisible. In 64-bit mode, it isn't; if the
compiler wants to turn this into address arithmetic, it needs to either:
a) prove that "i" can never overflow/wrap around in this loop,
b) insert extra code that detects wrap-around of "i" and adjusts the addresses accordingly,
c) not use pointer arithmetic and recompute addresses every time (as in the example).
Option a) is relatively straightforward in this particular example, but it can get hairy, and my point is
that 32-bit compilers never need to worry about any of this to begin with (and neither do 64-bit compilers
when dealing with 64-bit integer indices).
And if the compiler *isn't* able to do these proofs, it can get pretty bad. For example, the loop above
is basically all loop overhead, and most compilers will end up unrolling (or vectorizing, but I'll
ignore that) it. Now we'd prefer to see an inner loop like: (I won't show the setup or tail loop here):
lp:
add eax, [rsi+0] ; sum += x[i+0]
add eax, [rsi+4] ; sum += x[i+1]
add eax, [rsi+8] ; sum += x[i+2]
add eax, [rsi+12] ; sum += x[i+3]
add rsi, 16 ; ptr += 4
dec ecx ; (trip count in ecx, was calculated outside loop)
jnz lp
(only showing the adressing parts here, in practice there's other transforms we'd like to see too, but
let's stay focused.)
If the compiler can't (for whatever reason) prove that the index expressions are not going to overflow
as 32-bit values, it needs to do something much worse like:
lp:
movsxd rdi, ebx ; sext32to64(i)
add eax, [rsi+rdi*4]
lea edi, [ebx+1] ; i+1
movsxd rdi, edi
add eax, [rsi+rdi*4]
lea edi, [ebx+2]
movsxd rdi, edi
add eax, [rsi+rdi*4]
lea edi, [ebx+3]
movsxd rdi, edi
add eax, [rsi+rdi*4]
add ebx, 4
cmp ebx, ecx ; (precomputed spot to stop 4x unrolled iters)
jl lp
This is a silly example, but all you need to do is look at some assembly listings for
generated x64 code for hot loops with two's complement overflow semantics and search for
"movsxd"; usually you'll be able to find actual addressing code that does stuff like this
in less than a minute.
Anyway, as said, compilers use the signed-overflow UB as a free ticket to promote int32
loop counters to int64, which fixes this type of problem (and introduces exciting new
problems, especially when it's applied to all signed integer arihtmetic and not cases
like these loop counters where there's actually a specific reason).
But really the core issue is that all of C's semantics (everything narrower auto-widened
to int etc.) work pretty well when "int" is the width of machine registers and pretty
badly when it's not. Newer languages (e.g. Go and Rust) solve this problem by just making
"int" be 64-bit on 64-bit platforms, which removes most of the motivation for this kind
of thing.
What I'd *like* to see compilers spend time on is:
a) you can do profile-guided optimizations; what about profile-guided performance warnings?
Eating the sign extends (and resulting extra work) on cold code is not *that* bad if
there's a way for the compiler to tell the programmer about the hot loops that really
should probably use a different loop counter type.
b) decreasing cost of sign extends. Note that my examples allocate a different register
for the sign-extended quantities, increasing register pressure. This is typical and
often as much (or more) of a problem than the extra instructions. On x64, it is in
fact fine and often preferable to sign-extend a 32-bit register to itself:
movsxd rbx, ebx
or similar. The top 32 bits of "rbx" will get zero-cleared the next time a 32-bit ALU
op writes to ebx, but that's fine. At least we didn't use an extra register.
c) compilers already expend a lot of effort on theorem prover-style analysis that can
e.g. show when overflows can't happen in a loop such as the example loop (in this
case: if we can prove "count" stays fixed throughout the loop, we only enter the loop
if "0 < count", and we only stay in the loop until "i >= count"; with i increasing at
a rate of 1 per iteration, this is guaranteed to happen before i overflows). This is
a powerful tool for enabling better optimizations (like transitioning to address
arithmetic in this loop). Unfortunately it's also pretty much a black box to the
programmer, and it can be really hard to understand what the compiler can or cannot
prove at any given point in time. So it's tricky. Better analysis catches more and
more cases, but when it doesn't work, it fails badly.
If in doubt, it's almost always preferable to have semantics defined in such a way
that you don't *need* a theorem-prover to generate efficient code. (Easier said
than done in some cases!)
Anyway, what you can do?
If you're a compiler writer:
- Don't copy C's backwards-compat compromise that caused this mess. Make sure that your
idiomatic loop counter is register width.
Basically, the less common these sign extends are on critical paths, the less pressure
there is to do questionable things to make them faster.
If you're someone writing C/C++ code: It depends on the platform. On x86-64, 32-bit
arithmetic ops are all zero-extending. So anything that works with uint32s is fine.
ARM AArch64 likewise has 32- and 64-bit variants of everything, *and* some fairly good
sign/zero-extend support for narrower types built in, even in addresssing calculations.
This helps reduce these costs. On 64-bit PowerPC/POWER, you only get 64-bit versions
of many arithmetic operations, and your life sucks no matter what. You really want to
be working with 64-bit integer types whenever possible on these machines.
On most of the machines you're likely to use, "size_t" for loop counters is a good idea
where signed values aren't required. It's unsigned and generally as wide as addresses,
so there's normally no extra code for zero/sign extends, and the type is standard.

Thanks for this little writeup, it was a good read!

Really small correction though, simply because you mentioned Go and Rust: Rust doesn't have an int type; it has explicit i32/u32 and i64/u64 types. It does, however, have a usize type that is "The pointer-sized unsigned integer type" which is almost always used for indexing and loop counters (and ofc also isize, its signed counterpart). This aligns with what you're saying here and is generally a Good Thing™, especially the "Make sure that your idiomatic loop counter is register width" part.

jacobsa commented May 9, 2016

I really enjoyed this, thanks. Do you have pointers to any sort of write-up about why C chose to go the "leave int at 32 bits" route? (And why wasn't the same done for the 16- to 32-bit transition?)

al45tair commented May 9, 2016

@jacobsa For the 16 to 32-bit transition, changing from 16-bit to 32-bit made sense; it only adds two bytes per variable, and quite a lot of counters in typical programs can very easily exceed 65536 in the first place (i.e. there were already lots of programs that either had arbitrary limits or actual brokenness, and those that didn't already had to use 32-bit values all over the place anyway). So the effect of the change was minimal. Furthermore, not all of the "16-bit" platforms were really 16-bit in the first place; the MC68K-based machines were really 32-bit from the get go, and so even on the 68000 and 68010-based systems, some compilers use 32-bit "int" by default anyway (the main downside, until the 68020, being that there was a performance penalty, particularly if multiplication was involved — but since programmers knew that, they could always declare things as "short int" if they knew values would fit in that range).

For the 32-bit to 64-bit transition, things were rather different. Most values in most programs will happily fit in 32 bits (i.e. they are always going to be smaller than ±2bn or +4bn respectively), so extending them all to 64-bits wastes four bytes each time. This creates extra pressure on processor caches and on system memory and has very little benefit otherwise.

As an aside, Cray went for 64-bit int, as did early Alpha-based machines (though later ones switched to 32-bit int).

al45tair commented May 9, 2016

One other issue not mentioned in the above write-up is that loop analysis is easier if you are allowed to assume that signed overflow can’t happen, which means that compilers can more aggressively optimize some loops under this assumption. This applies even on platforms (like ARMv8) that are able to cope easily with mixed-size arithmetic.

TheThief commented May 9, 2016

@jacobsa There's also an issue that there are only two integer types below "int" in standard C, if you want to support all power-of-two sized integers with an 8-bit minimum then you're forced to use a 32-bit "int" or add a non-standard type to fill in the gap:

  • signed/unsigned char: 8-bit
  • short: 16/32-bit ??
  • int: 64-bit
  • long: 128-bit? Also 64-bit?

A 32-bit int just makes more sense under that constraint.

ahh commented May 9, 2016

@jacobsa a doc about the actual decision makers thought process is here: http://www.unix.org/version2/whatsnew/lp64_wp.html

("LP64" is the name for the mode everyone uses, because long and ptr types are 64-bit (and nothing else.))

masklinn commented May 9, 2016

There's also an issue that there are only two integer types below "int" in standard C, if you want to support all power-of-two sized integers with an 8-bit minimum then you're forced to use a 32-bit "int" or add a non-standard type to fill in the gap

That's not been the case for a good 15 years, a C99-compliant implementation can provide any number of _intNt for whichever N it wants, usually N=8, 16, 32 and 64 but e.g. Clang also provides types for N=24, N=40, N=48 and N=56.

Also worth noting that if the 32b instruction results were sign-extended to 64b instead of zero-extended, this particular thing wouldn't be a problem (there'd be a different, but somewhat less frequent problem).

eternaleye commented May 9, 2016

@ahh

"LP64" is the name for the mode everyone uses

Doesn't Windows use LLP64, where long long and pointers are 64-bit, but long is not?

because long and ptr types are 64-bit (and nothing else.)

In both LP64 and LLP64, long long is 64-bit.

ahcox commented May 10, 2016

Could you convert to Markdown (basically change the extension) so this is readable on mobile?

You may want to look at Chris Lattner's "What Every C Programmer Should Know About Undefined Behavior" series (starting at http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html)
I'd say that the optimizations that become possible if we exploit the existence of UB go far beyond the case you've shown. In fact those are architecture-agnostic and don't need to know about the machine word size on the target arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment