Created
June 28, 2016 08:31
-
-
Save rygorous/d905d42489158a5d8b17607887f82de5 to your computer and use it in GitHub Desktop.
C/C++ parse notes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Things in C/C++ that are hard to parse | |
====================================== | |
Type names | |
---------- | |
C is "mostly" LL(1) and generally relatively straightforward to parse. The primary | |
unncessary complication is that parsing C requires keeping a symbol table during | |
parsing that knows whether a given identifier in the current scope refers to a type | |
name or not. Take this example snippet: | |
(A)*B | |
This is ambiguous: | |
1. If A is not a type name, then the intepretation is | |
multiply(A, B) | |
2. If A is a type name, it is | |
cast(target_type=A, expr=unary_deref(B)) | |
Resolving this ambiguity is what requires a C parser to keep track of which names | |
refer to types. Using a different type-cast syntax - for example, by introducing an | |
explicit "cast" keyword and writing say "cast(A, *B)" - would resolve it, and allow | |
parsing C without a symbol table. (Many other solutions are also possible.) | |
My point being that this part of C syntax really has nothing to do with its expressive | |
power in either direction. It's a historical wart more than anything else. | |
Declarations | |
------------ | |
Declarations in C are not, per se, hard to parse; from a grammar point of view, they're | |
straightforward. They are, however, hard for humans to *read*, and it turns out that | |
implementing the semantics of C declarations is somewhat annoying for essentially the | |
same reasons. The idea behind C declaration is to make the declaration of a variable | |
look like a *use* of said variable. This sounds nice and symmetric but has pretty gross | |
results. | |
For example, these two declarations: | |
int (*x)[11]; | |
int (*f)(int a); | |
Declare x as a pointer to an 11-element array of ints and f as a pointer to a function | |
taking one integer argument "a" and returning an integer. Many C/C++ programmers don't | |
know how to write that kind of declaration off the top of their head, and will prefer to | |
use one or more "stepping stone" typedefs such as | |
typedef int ElevenIntArray[11]; | |
ElevenIntArray *x; | |
The reasons the C syntax is hard to read for humans *and* annoying to implement the | |
semantics of are essentially the same: pertinent bits of information appear in multiple | |
places within the declaration; you basically end up building a skeleton of the target | |
type with some blanks first, and then have to fill some of it in later. | |
To explain what I mean: take the first original example: | |
int (*x)[11]; | |
An English-language explanation of x is "pointer to an 11-element array of ints", and a | |
tree describing the type would be something like "x = pointer(array(11, int))". When | |
parsing that declaration from left to right, we learn these facts "out of order": | |
1. (Once we see the "int") "??? = something with ints?" | |
2. (After getting to "x") "x = pointer(something with ints)" | |
3. (After seeing the "[") "x = pointer(array(???, int))" | |
4. (After seeing "11") " x = pointer(array(11, int))" | |
Also note that the superficially similar | |
int *x[11]; | |
corresponds to the quite different "x = array(11, pointer(int))" (11-element array of | |
pointers to integers). | |
Compare say Pascal, where a type definition for an array type looks like this (there's no | |
way to directly define "pointer to an array" in one step in Standard Pascal, as far as I | |
know): | |
"type ElevenIntArray: array[0..10] of integer;" | |
This has a direct correspondence to how that type is *pronounced* (and also to how we | |
want the type tree to look: "ElevenIntArray = array(lower=0, upper=10, type=integer)" but | |
not to how it's used in an expression. It is, however, a lot easier to read, especially | |
when things like pointers are involved: | |
"type EleventIntPtrArray: array[0..10] of ^integer;" | |
ElevenIntPtrArray = array(lower=0, upper=10, type=pointer(integer)); - and note that | |
the ordering in the Pascal declaration steadily proceeds from "outermost" type at the | |
left to "innermost" at the right, with no jumping around. | |
Some newer languages like e.g. Go borrow the Pascal-style declaration syntax, albeit | |
with a lot less extra tokens: | |
"type PtrToElevenInts *[11]int" | |
"type ElevenIntPtrArray [11]*int" | |
Alright, so C declaration syntax works but is awkward to read and write for humans, and | |
somewhat awkward to handle in a language impl as well. But whatever, it works. | |
C++ declarations | |
---------------- | |
This is where it gets really ugly. Consider something like this in C++: | |
"A B(C);" | |
Let's look at two examples: | |
1. A="int", B="foo", C="int", i.e. | |
"int foo(int);" | |
This is a function declaration: | |
"declare_function(ret_type=int, name=foo, args=[int])" | |
2. A="int", B="foo", C="x" (where x is not a type name), i.e. | |
"int foo(x);" | |
This declares an integer variable "foo" that is copy-constructed from "x". | |
Note that the two see an *identical* sequence of tokens until the "x", but | |
nevertheless end up in quite different parts of the grammar. This is what's | |
commonly called the "most vexing parse" in C++, and is (to my knowledge) one | |
of the syntactic constructs in C++ that require backtracking. | |
But again, this has nothing to do with the expressive power of the language; it's | |
not a language feature, it's just another weird corner case/wart, in this case | |
stemming from the similarity between declaration and initialization syntax. | |
One possibly way to resolve this is heroics in the parser (this is what C++ does). | |
That the resulting behavior is generally called the "most vexing parse" does, I | |
think, show the problems with that approach: not only does it make the parser a | |
lot more complicated, it also confuses users. | |
A different solution is to simply make sure that variable and function declarations | |
can never be confused in the first place, for example by adding an extra keyword. | |
For example, in Go, all function declarations start with "func" and all explicit | |
variable declarations start with "var". No ambiguity, no need to backtrack, and | |
no confusion for the user. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment