Skip to content

Instantly share code, notes, and snippets.

@rygorous
Created June 28, 2016 08:31
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rygorous/d905d42489158a5d8b17607887f82de5 to your computer and use it in GitHub Desktop.
Save rygorous/d905d42489158a5d8b17607887f82de5 to your computer and use it in GitHub Desktop.
C/C++ parse notes
Things in C/C++ that are hard to parse
======================================
Type names
----------
C is "mostly" LL(1) and generally relatively straightforward to parse. The primary
unncessary complication is that parsing C requires keeping a symbol table during
parsing that knows whether a given identifier in the current scope refers to a type
name or not. Take this example snippet:
(A)*B
This is ambiguous:
1. If A is not a type name, then the intepretation is
multiply(A, B)
2. If A is a type name, it is
cast(target_type=A, expr=unary_deref(B))
Resolving this ambiguity is what requires a C parser to keep track of which names
refer to types. Using a different type-cast syntax - for example, by introducing an
explicit "cast" keyword and writing say "cast(A, *B)" - would resolve it, and allow
parsing C without a symbol table. (Many other solutions are also possible.)
My point being that this part of C syntax really has nothing to do with its expressive
power in either direction. It's a historical wart more than anything else.
Declarations
------------
Declarations in C are not, per se, hard to parse; from a grammar point of view, they're
straightforward. They are, however, hard for humans to *read*, and it turns out that
implementing the semantics of C declarations is somewhat annoying for essentially the
same reasons. The idea behind C declaration is to make the declaration of a variable
look like a *use* of said variable. This sounds nice and symmetric but has pretty gross
results.
For example, these two declarations:
int (*x)[11];
int (*f)(int a);
Declare x as a pointer to an 11-element array of ints and f as a pointer to a function
taking one integer argument "a" and returning an integer. Many C/C++ programmers don't
know how to write that kind of declaration off the top of their head, and will prefer to
use one or more "stepping stone" typedefs such as
typedef int ElevenIntArray[11];
ElevenIntArray *x;
The reasons the C syntax is hard to read for humans *and* annoying to implement the
semantics of are essentially the same: pertinent bits of information appear in multiple
places within the declaration; you basically end up building a skeleton of the target
type with some blanks first, and then have to fill some of it in later.
To explain what I mean: take the first original example:
int (*x)[11];
An English-language explanation of x is "pointer to an 11-element array of ints", and a
tree describing the type would be something like "x = pointer(array(11, int))". When
parsing that declaration from left to right, we learn these facts "out of order":
1. (Once we see the "int") "??? = something with ints?"
2. (After getting to "x") "x = pointer(something with ints)"
3. (After seeing the "[") "x = pointer(array(???, int))"
4. (After seeing "11") " x = pointer(array(11, int))"
Also note that the superficially similar
int *x[11];
corresponds to the quite different "x = array(11, pointer(int))" (11-element array of
pointers to integers).
Compare say Pascal, where a type definition for an array type looks like this (there's no
way to directly define "pointer to an array" in one step in Standard Pascal, as far as I
know):
"type ElevenIntArray: array[0..10] of integer;"
This has a direct correspondence to how that type is *pronounced* (and also to how we
want the type tree to look: "ElevenIntArray = array(lower=0, upper=10, type=integer)" but
not to how it's used in an expression. It is, however, a lot easier to read, especially
when things like pointers are involved:
"type EleventIntPtrArray: array[0..10] of ^integer;"
ElevenIntPtrArray = array(lower=0, upper=10, type=pointer(integer)); - and note that
the ordering in the Pascal declaration steadily proceeds from "outermost" type at the
left to "innermost" at the right, with no jumping around.
Some newer languages like e.g. Go borrow the Pascal-style declaration syntax, albeit
with a lot less extra tokens:
"type PtrToElevenInts *[11]int"
"type ElevenIntPtrArray [11]*int"
Alright, so C declaration syntax works but is awkward to read and write for humans, and
somewhat awkward to handle in a language impl as well. But whatever, it works.
C++ declarations
----------------
This is where it gets really ugly. Consider something like this in C++:
"A B(C);"
Let's look at two examples:
1. A="int", B="foo", C="int", i.e.
"int foo(int);"
This is a function declaration:
"declare_function(ret_type=int, name=foo, args=[int])"
2. A="int", B="foo", C="x" (where x is not a type name), i.e.
"int foo(x);"
This declares an integer variable "foo" that is copy-constructed from "x".
Note that the two see an *identical* sequence of tokens until the "x", but
nevertheless end up in quite different parts of the grammar. This is what's
commonly called the "most vexing parse" in C++, and is (to my knowledge) one
of the syntactic constructs in C++ that require backtracking.
But again, this has nothing to do with the expressive power of the language; it's
not a language feature, it's just another weird corner case/wart, in this case
stemming from the similarity between declaration and initialization syntax.
One possibly way to resolve this is heroics in the parser (this is what C++ does).
That the resulting behavior is generally called the "most vexing parse" does, I
think, show the problems with that approach: not only does it make the parser a
lot more complicated, it also confuses users.
A different solution is to simply make sure that variable and function declarations
can never be confused in the first place, for example by adding an extra keyword.
For example, in Go, all function declarations start with "func" and all explicit
variable declarations start with "var". No ambiguity, no need to backtrack, and
no confusion for the user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment