alexei-zaripov/C_and_CXX_performance_pitfall.md

## C_and_CXX_performance_pitfall.md

      
    Raw
  

              C_and_CXX_performance_pitfall.md
            
          
    C/C++ performance pitfall: int8_t, aliasing and the ways out

When I was working on a generic port of Google's hashmap to C, I wrote a
function that (ignoring irrelevant parts) looked like this:
typedef struct {
    uint8_t *bytes;
    size_t len;
} bytebuf;

void do_things(bytebuf *b) {
    uint8_t *i = b->bytes;
    for (; i < b->bytes + b->len; i++) {
        *i |= 0xAA;
    }
}
Just iterate over the buffer, doing a trivial bitop on each byte, how complicated can it be? I soon became curious to see how
compilers would vectorize this code for different architectures. The first target I tried was x86_64 with AVX2
for which recent Clang gave me the following assembly (annotated by me):
do_things:
        cmp     qword ptr [rdi + 8], 0    ;  
        jle     .LBB0_3                   ;  if ((uint64_t)b->len < 0)  goto endloop
        mov     rax, qword ptr [rdi]      ;  rax = b->bytes
.LBB0_2:                                  ;loop:
        or      byte ptr [rax], -86       ;  *rax |= 0xAA
        inc     rax                       ;  rax++
        mov     rcx, qword ptr [rdi]      ;  rcx = b->bytes  // ?!
        add     rcx, qword ptr [rdi + 8]  ;  rcx += b->len   // ?!
        cmp     rax, rcx                  ;  
        jb      .LBB0_2                   ;  if (rax < rcx)  goto loop
.LBB0_3:                                  ;endloop:
        ret                               ;  return
... and that got me confused for a few seconds. Why is it not vectorized? And worse, why does it fetch b->len and b->bytes
from memory on every iteration? Why would Clang have a problem doing basic optimizations here? Well...
Problem: pointers to char have special aliasing properties

The thing is, that code above modifies some uint8_t in memory, and uint8_t is just a typedef to unsigned char. char is a
special type in both C and C++: it is the only type that is allowed to read and modify memory occupied by other
types.  So compiler has to account for the bizarre case where *i |= 0xAA might change the value of b->len or b->bytes!
This is why it can't pre-compute b->bytes + b->len and that stops it from unrolling the loop. Be it uint16_t instead of
uint8_t for example, there would be nothing to stop Clang and GCC from unrolling this code however they like (at least as
long as uint16_t and size_t are not the same type under the hood). Ok, but how to keep using uint8_t and allow the
compiler to optimize the code properly?
Approach 1: keep the data in local variables so compiler can assume it is not aliased

If you rewrite example code to
void do_things(bytebuf *b) {
    uint8_t *i = b->bytes;
    uint8_t *end = b->bytes + b->len;
    for (; i < end; i++) {
        *i |= 0xAA;
    }
}
compiler will be able to unroll this loop as loop condition now depends on local variables that b->bytes can't possibly alias.
Unfortunately, converting existing code to this approach can be a bit too tedious when you have a lot of such code.
Approach 2: use restrict keyword

restrict is a type qualifier that applies to pointers, asserting that the memory to which this pointer points can only be accessed
by this pointer or pointers derived from it
(see restrict on cppreference for details). So just
by adding restrict to the declaration of function parameter like this:
void do_things(bytebuf *restrict b) { // ...
you allow the compiler to assume that b->bytes and b->len won't be modified in the loop, and thus let it unroll the loop.
Although restrict is not officially part of C++, it is available as an extension in Clang, GCC, MVC as __restrict.
Unfortunately, fixing existing code with restrict can also be a lot of boring and demanding work, as you would need to
carefully examine the code where you are adding restrict and how that code is used from other parts of the program, because
misusing restrict will bring you very nasty bugs (see CERT entry on that for more info).
Approach 3: use char8_t instead of int8_t and char (C++ only)

C++20 introduced char8_t as a new fundamental type. It does not have the special aliasing properties of char, so replacing
uint8_t with char8_t in the example code will make compilers produce properly optimized code. There is a catch however:
char8_t is unsigned, and attempts to make it signed using signed char8_t will be rejected by compilers. So extra casts to
int8_t will be needed once char8_t is retrieved from memory when you need to treat its value as signed. In C,
Clang makes char8_t available as a fundamental type like in C++ when -fchar8_t command-line option is passed, but it does
not work with GCC, and anyway, this feature is going to conflict with the upcoming C23 where char8_t will be added as a typedef to
unsigned char in the <uchar.h> header.
Approach 4: use _BitInt(8) instead of char

The next C standard, C23 will include a new feature called "bit-precise integers". It allows us to have integer types with
custom bit-widths through the syntax _BitInt(N) where N is a constant expression indicating the amount of bits this integer
type has. It is signed by default but can be made unsigned with the unsigned keyword. So _BitInt(8) would be an 8-bit integer
like char, except that since it's not a char, it is not allowed to alias other types, thus allowing for better code to be generated.
Just changing types from uint8_t to _BitInt(8) in the code from the beginning of this article makes compiler generate
vectorized code. Using this approach to fix existing codebase is simple and safe. Unfortunately, current GCC version 13 does
not support _BitInt, Clang supports this feature in C and as an extension in C++ since version 14 released in 25 March 2022. So
if you are stuck with older compiler versions, you are out of luck.
C++ considerations

unique_ptr<int8_t> and vector<int8_t> as well as their std::byte and char counterparts, also suffer from the char aliasing
problem. That is, assignment to the element of vector<int8_t> makes compilers assume that (almost) all memory has been clobbered.
In other languages

Rust is pretty strict on aliasing: at most one mutable reference to a part of memory can exist. This is backed up by static
checks in compiler that report as error any attempt to create more than one mutable reference to a value in memory (at least
within the "Safe" Rust). Of course, Rustc's optimizer takes advantage of uniqueness of mutable references.

D language does not impose even cross-type aliasing restrictions, except in case of
vector operations where involved slices must not overlap.

Zig allows all kinds of aliasing unless pointer is marked by noalias undocumented keyword. Aliasing rules in Zig are
probably subject to change as the language is still in beta.
Summary


approach
fitness
safety
protability


keeping data local
works only when you can fit this data on stack
safe
maximum


restrict
made for this, but breaks easily
unsafe
C99, in C++ as extension


char8_t
can only be unsigned, two buffers are still alowed to alias each other
safe
since C++20, C++ only


_BitInt(8)
two buffers are still alowed to alias each other
safe
C23, in C++ as extension. not yet in GCC


Closing message

Does the C or C++ codebase you have worked on have performance-sensitive code that writes through int8_t or char pointers
or vectors somewhere in the middle?
Historical notes


Thinking about it, it is weird that C/C++ code working on bytes in memory is inherently harder for compiler to optimize
compared to other integer widths due to how languages are defined. For C it was so for around 34 years from the first standard
C89 until C23.
restrict was introduced in C99 and was at least partially motivated by performance advantage Fortran had over C: Fortran
language standard forbids passing overlapping arrays to functions, which gave Fortran compilers more freedom in reordering
and vectorizing code that operated on multiple arrays. Later GCC (since version 4.3 released in March 2008) and Clang (3.3 - June 2013)
started to insert runtime checks for array overlap that would branch into vectorized code when it was safe to and to
unvectorized code otherwise.
approach	fitness	safety	protability
keeping data local	works only when you can fit this data on stack	safe	maximum
`restrict`	made for this, but breaks easily	unsafe	C99, in C++ as extension
`char8_t`	can only be unsigned, two buffers are still alowed to alias each other	safe	since C++20, C++ only
`_BitInt(8)`	two buffers are still alowed to alias each other	safe	C23, in C++ as extension. not yet in GCC