jibsen/bytes.md

## bytes.md

      
    Raw
  

              bytes.md
            
          
    Introduction

The C standard only specifies minimum limits for the values of character types
and standard integer types. This makes it possible to generate efficient code
on diverse architectures, but can pose problematic if your code expects the
limits to match your development platform, or if you have to do low-level
things.
Before C99, the usual way to solve this was to use typedef to declare synonyms
for standard types of the right size (like u8, u16, u32 used in the Linux
kernel). These can then easily be changed to matching types on other platforms.
C99 introduced extended integer types in <stdint.h>, which include exact-width
types of the form intN_t and uintN_t, where N is the width. They contain no
padding, and the signed types are two's complement. These types are optional,
but if the implementation provides a suitable integer type of any of the widths
8, 16, 32, 64, it must define the corresponding typedef.
While they are quite useful, the standard is (as ever) loose enough in its
requirements, that there are (at least theoretical) pitfalls that may not be
entirely obvious.
In the following, we are going to have a look at uint8_t.
What is uint8_t?

From the C11 standard, 3.6 (byte) and 5.2.4.2.1 (Sizes of integer types), we
gather that a byte is the smallest object that is not a bit-field. Each byte
is uniquely addressable, and is composed of a contiguous sequence of CHAR_BIT
bits.
From 6.2.6.1p3 (Representations of types) and footnote 49, we further see that
unsigned char is required to match a byte.
Since CHAR_BIT is the number of bits of the smallest object, and it is at
least 8, uint8_t can only be defined on platforms where CHAR_BIT is 8.
Thus it seems obvious to use unsigned char as the type of uint8_t, and that
is usually the case, but the standard does not specify this (link,
link, link).
Special properties of character types

Sometimes you need to access the individual bytes of an object, or do arithmetic
on pointers. The standard describes special properties of the character types
that allow this.
When two lvalues refer to the same memory location, they are said to alias. C
has rules about which types are allowed to alias:

6.5p7 Expressions
An object shall have its stored value accessed only by an lvalue expression
that has one of the following types:

a type compatible with the effective type of the object,
a qualified version of a type compatible with the effective type of the
object,
a type that is the signed or unsigned type corresponding to the effective
type of the object,
a type that is the signed or unsigned type corresponding to a qualified
version of the effective type of the object,
an aggregate or union type that includes one of the aforementioned types
among its members (including, recursively, a member of a subaggregate or
contained union), or
a character type.


As we can see, we are only allowed to access the memory of an object using
certain compatible types or a character type. This is referred to as the
strict aliasing rule, and has created some controversy because compiler vendors
have chosen to start enforcing this rule to be able to perform certain
optimizations (link, link, link, link, link,
link).
When we access the memory of an object using a pointer to a character type, we
can address each byte of the object, and pointer arithmetic (within the bounds
of the object) works as expected:

6.3.2.3p7 Pointers
When a pointer to an object is converted to a pointer to a character type,
the result points to the lowest addressed byte of the object. Successive
increments of the result, up to the size of the object, yield pointers to the
remaining bytes of the object.

For pointers to non-character types, arithmetic is only defined within an array
(link, link):

6.5.6p7 Additive operators
For the purposes of these operators, a pointer to an object that is not an
element of an array behaves the same as a pointer to the first element of an
array of length one with the type of the object as its element type.


6.5.6p8 Additive operators
If both the pointer operand and the result point to elements of the same array
object, or one past the last element of the array object, the evaluation shall
not produce an overflow; otherwise, the behavior is undefined.

What if uint8_t is not a character type?

Since the standard does not require uint8_t to be a character type, you could
imagine a compiler vendor deliberately making it a separate type to be able to
take advantage of the strict aliasing rule.
If you are fond of the fixed-width integer types, it is tempting to use
uint8_t in situations where you would have used a character type. Consider
this function for reading an uin32_t value in little-endian order:
#include <stdint.h>

uint32_t
read_le32(const uint32_t *val)
{
        const uint8_t *p = (const uint8_t *) val;

        return (uint32_t) p[0]
            | ((uint32_t) p[1] << 8)
            | ((uint32_t) p[2] << 16)
            | ((uint32_t) p[3] << 24);
}
Here p aliases val, but if uint8_t is not a character type, we are
breaking the strict aliasing rule, and get undefined behavior.
If we make sure CHAR_BIT is 8, we can replace uint8_t with unsigned char
to avoid this problem.
Here is an example that uses a pointer to uint8_t to do pointer arithmetic, in
order to process memory in blocks:
#include <stddef.h>
#include <stdint.h>

#define BLOCK_SIZE 64

extern void
process_single_block(const void *data, size_t size);

void
process_data_in_blocks(const void *data, size_t size)
{
        const uint8_t *p = data;
        size_t offs = 0;

        while (offs < size) {
                size_t num = size - offs > BLOCK_SIZE ? BLOCK_SIZE : size - offs;
                process_single_block(p + offs, num);
                offs += num;
        }
}
We do not access memory, so the strict aliasing rule does not apply. If p
were a pointer to a character type, we would be sure we could address the
individual bytes of whatever object type data points to in this way. But if
uint8_t is not a character type, this could be undefined behavior, unless
data points to an array of uin8_t (or a single uint8_t).
Again, if CHAR_BIT is 8, we can use unsigned char instead.

This post is CC-BY-SA,
any code snippets are MIT.