Skip to content

Instantly share code, notes, and snippets.

@shafik
Last active December 20, 2023 10:03
Show Gist options
  • Save shafik/4598cb9ccdea7c0b63e31c8dcabe4f92 to your computer and use it in GitHub Desktop.
Save shafik/4598cb9ccdea7c0b63e31c8dcabe4f92 to your computer and use it in GitHub Desktop.
Interesting undefined behavior in lex.pptoken/p2

The following code has an interesting form of undefined behavior:

#define STR_START "
#define STR_END "

int puts(const char *);

int main() {
    puts(STR_START hello world STR_END);
}

Example taken from Stack Overflow question Why can't we use the preprocessor to create custom-delimeted strings?.

So we can find this undefined behavior covered in the draft C++ standard in [lex.pptoken]/p2 which says:

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ’ or a " character matches the last category, the behavior is undefined. ...

Now the program won't compile, we will receive a few errors including -Winvalid-pp-token . What is interesting is that examing the output of the preprocessor using clang++ -E the output looks ok:

clang++ -E junk.c

// Deleted output including warning

int puts(const char *);

int main() {
    puts(" hello world ");
}
2 warnings generated.

We see " hello world " which looks like a valid string literal. So what is going on here?

The preprocessor output we are seeing is not the tokens which is required by the standard, in section \lex.phases]/p3:

The source file is decomposed into preprocessing tokens (5.4) and sequences of white-space characters (including comments). ...

but textual output. We can find some details of this covered in the gcc documentation in The C Preprocessor: Preprocessor Output which says:

When the C preprocessor is used with the C, C++, or Objective-C compilers, it is integrated into the compiler and communicates a stream of binary tokens directly to the compiler’s parser. However, it can also be used in the more conventional standalone mode, where it produces textual output. ...

and The C Preprocessor: Traditional lexical analysis which says:

The traditional preprocessor does not decompose its input into tokens the same way a standards-conforming preprocessor does. The input is simply treated as a stream of text with minimal internal form.

So internally to the compiler the preprocessor generates tokens as required by the standard but we don't see that when we invoke it on the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment