Skip to content

Instantly share code, notes, and snippets.

@jibsen jibsen/bytes.md
Last active Dec 30, 2018

Embed
What would you like to do?
Ramblings about uint8_t and undefined behavior

Introduction

The C standard only specifies minimum limits for the values of character types and standard integer types. This makes it possible to generate efficient code on diverse architectures, but can pose problematic if your code expects the limits to match your development platform, or if you have to do low-level things.

Before C99, the usual way to solve this was to use typedef to declare synonyms for standard types of the right size (like u8, u16, u32 used in the Linux kernel). These can then easily be changed to matching types on other platforms.

C99 introduced extended integer types in <stdint.h>, which include exact-width types of the form intN_t and uintN_t, where N is the width. They contain no padding, and the signed types are two's complement. These types are optional, but if the implementation provides a suitable integer type of any of the widths 8, 16, 32, 64, it must define the corresponding typedef.

While they are quite useful, the standard is (as ever) loose enough in its requirements, that there are (at least theoretical) pitfalls that may not be entirely obvious.

In the following, we are going to have a look at uint8_t.

What is uint8_t?

From the C11 standard, 3.6 (byte) and 5.2.4.2.1 (Sizes of integer types), we gather that a byte is the smallest object that is not a bit-field. Each byte is uniquely addressable, and is composed of a contiguous sequence of CHAR_BIT bits.

From 6.2.6.1p3 (Representations of types) and footnote 49, we further see that unsigned char is required to match a byte.

Since CHAR_BIT is the number of bits of the smallest object, and it is at least 8, uint8_t can only be defined on platforms where CHAR_BIT is 8.

Thus it seems obvious to use unsigned char as the type of uint8_t, and that is usually the case, but the standard does not specify this (link, link, link).

Special properties of character types

Sometimes you need to access the individual bytes of an object, or do arithmetic on pointers. The standard describes special properties of the character types that allow this.

When two lvalues refer to the same memory location, they are said to alias. C has rules about which types are allowed to alias:

6.5p7 Expressions

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • a type that is the signed or unsigned type corresponding to the effective type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

As we can see, we are only allowed to access the memory of an object using certain compatible types or a character type. This is referred to as the strict aliasing rule, and has created some controversy because compiler vendors have chosen to start enforcing this rule to be able to perform certain optimizations (link, link, link, link, link, link).

When we access the memory of an object using a pointer to a character type, we can address each byte of the object, and pointer arithmetic (within the bounds of the object) works as expected:

6.3.2.3p7 Pointers

When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

For pointers to non-character types, arithmetic is only defined within an array (link, link):

6.5.6p7 Additive operators

For the purposes of these operators, a pointer to an object that is not an element of an array behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type.

6.5.6p8 Additive operators

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

What if uint8_t is not a character type?

Since the standard does not require uint8_t to be a character type, you could imagine a compiler vendor deliberately making it a separate type to be able to take advantage of the strict aliasing rule.

If you are fond of the fixed-width integer types, it is tempting to use uint8_t in situations where you would have used a character type. Consider this function for reading an uin32_t value in little-endian order:

#include <stdint.h>

uint32_t
read_le32(const uint32_t *val)
{
        const uint8_t *p = (const uint8_t *) val;

        return (uint32_t) p[0]
            | ((uint32_t) p[1] << 8)
            | ((uint32_t) p[2] << 16)
            | ((uint32_t) p[3] << 24);
}

Here p aliases val, but if uint8_t is not a character type, we are breaking the strict aliasing rule, and get undefined behavior.

If we make sure CHAR_BIT is 8, we can replace uint8_t with unsigned char to avoid this problem.

Here is an example that uses a pointer to uint8_t to do pointer arithmetic, in order to process memory in blocks:

#include <stddef.h>
#include <stdint.h>

#define BLOCK_SIZE 64

extern void
process_single_block(const void *data, size_t size);

void
process_data_in_blocks(const void *data, size_t size)
{
        const uint8_t *p = data;
        size_t offs = 0;

        while (offs < size) {
                size_t num = size - offs > BLOCK_SIZE ? BLOCK_SIZE : size - offs;
                process_single_block(p + offs, num);
                offs += num;
        }
}

We do not access memory, so the strict aliasing rule does not apply. If p were a pointer to a character type, we would be sure we could address the individual bytes of whatever object type data points to in this way. But if uint8_t is not a character type, this could be undefined behavior, unless data points to an array of uin8_t (or a single uint8_t).

Again, if CHAR_BIT is 8, we can use unsigned char instead.


This post is CC-BY-SA, any code snippets are MIT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.