shafik/lex_pptoken_undefined_behavior_and_preprocessor.md

## lex_pptoken_undefined_behavior_and_preprocessor.md

      
    Raw
  

              lex_pptoken_undefined_behavior_and_preprocessor.md
            
          
    The following code has an interesting form of undefined behavior:
#define STR_START "
#define STR_END "

int puts(const char *);

int main() {
    puts(STR_START hello world STR_END);
}
Example taken from Stack Overflow question Why can't we use the preprocessor to create custom-delimeted strings?.
So we can find this undefined behavior covered in the draft C++ standard in [lex.pptoken]/p2 which says:

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The
categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals
(including user-defined character literals), string literals (including user-defined string literals), preprocessing
operators and punctuators, and single non-white-space characters that do not lexically match the other
preprocessing token categories. If a ’ or a " character matches the last category, the behavior is undefined.
...

Now the program won't compile, we will receive a few errors including -Winvalid-pp-token
. What is interesting is that examing the output of the preprocessor using clang++ -E the output looks ok:
clang++ -E junk.c

// Deleted output including warning

int puts(const char *);

int main() {
    puts(" hello world ");
}
2 warnings generated.
We see " hello world " which looks like a valid string literal. So what is going on here?
The preprocessor output we are seeing is not the tokens which is required by the standard, in section
\lex.phases]/p3:

The source file is decomposed into preprocessing tokens (5.4) and sequences of white-space characters
(including comments). ...

but textual output. We can find some details of this covered in the gcc documentation in The C Preprocessor: Preprocessor Output which says:

When the C preprocessor is used with the C, C++, or Objective-C compilers, it is integrated into the compiler and
communicates a stream of binary tokens directly to the compiler’s parser.
However, it can also be used in the more conventional standalone mode, where it produces textual output. ...

and The C Preprocessor: Traditional lexical analysis which says:

The traditional preprocessor does not decompose its input into tokens the same way a standards-conforming preprocessor does.
The input is simply treated as a stream of text with minimal internal form.

So internally to the compiler the preprocessor generates tokens as required by the standard but we don't see that when we invoke it on the command line.