rygorous/gist:d905d42489158a5d8b17607887f82de5

## gistfile1.txt
Things in C/C++ that are hard to parse
======================================

Type names
----------

C is "mostly" LL(1) and generally relatively straightforward to parse. The primary
unncessary complication is that parsing C requires keeping a symbol table during
parsing that knows whether a given identifier in the current scope refers to a type
name or not. Take this example snippet:

  (A)*B

This is ambiguous:
1. If A is not a type name, then the intepretation is
   multiply(A, B)
2. If A is a type name, it is
   cast(target_type=A, expr=unary_deref(B))

Resolving this ambiguity is what requires a C parser to keep track of which names
refer to types. Using a different type-cast syntax - for example, by introducing an
explicit "cast" keyword and writing say "cast(A, *B)" - would resolve it, and allow
parsing C without a symbol table. (Many other solutions are also possible.)

My point being that this part of C syntax really has nothing to do with its expressive
power in either direction. It's a historical wart more than anything else.

Declarations
------------

Declarations in C are not, per se, hard to parse; from a grammar point of view, they're
straightforward. They are, however, hard for humans to *read*, and it turns out that
implementing the semantics of C declarations is somewhat annoying for essentially the
same reasons. The idea behind C declaration is to make the declaration of a variable
look like a *use* of said variable. This sounds nice and symmetric but has pretty gross
results.

For example, these two declarations:

  int (*x)[11];
  int (*f)(int a);

Declare x as a pointer to an 11-element array of ints and f as a pointer to a function
taking one integer argument "a" and returning an integer. Many C/C++ programmers don't
know how to write that kind of declaration off the top of their head, and will prefer to
use one or more "stepping stone" typedefs such as

  typedef int ElevenIntArray[11];
  ElevenIntArray *x;

The reasons the C syntax is hard to read for humans *and* annoying to implement the
semantics of are essentially the same: pertinent bits of information appear in multiple
places within the declaration; you basically end up building a skeleton of the target
type with some blanks first, and then have to fill some of it in later.

To explain what I mean: take the first original example:

  int (*x)[11];

An English-language explanation of x is "pointer to an 11-element array of ints", and a
tree describing the type would be something like "x = pointer(array(11, int))". When
parsing that declaration from left to right, we learn these facts "out of order":

  1. (Once we see the "int") "??? = something with ints?"
  2. (After getting to "x") "x = pointer(something with ints)"
  3. (After seeing the "[") "x = pointer(array(???, int))"
  4. (After seeing "11") " x = pointer(array(11, int))"

Also note that the superficially similar

  int *x[11];

corresponds to the quite different "x = array(11, pointer(int))" (11-element array of
pointers to integers).

Compare say Pascal, where a type definition for an array type looks like this (there's no
way to directly define "pointer to an array" in one step in Standard Pascal, as far as I
know):

  "type ElevenIntArray: array[0..10] of integer;"

This has a direct correspondence to how that type is *pronounced* (and also to how we
want the type tree to look: "ElevenIntArray = array(lower=0, upper=10, type=integer)" but
not to how it's used in an expression. It is, however, a lot easier to read, especially
when things like pointers are involved:

  "type EleventIntPtrArray: array[0..10] of ^integer;"

ElevenIntPtrArray = array(lower=0, upper=10, type=pointer(integer));  - and note that
the ordering in the Pascal declaration steadily proceeds from "outermost" type at the
left to "innermost" at the right, with no jumping around.

Some newer languages like e.g. Go borrow the Pascal-style declaration syntax, albeit
with a lot less extra tokens:

  "type PtrToElevenInts *[11]int"
  "type ElevenIntPtrArray [11]*int"

Alright, so C declaration syntax works but is awkward to read and write for humans, and
somewhat awkward to handle in a language impl as well. But whatever, it works.

C++ declarations
----------------

This is where it gets really ugly. Consider something like this in C++:

  "A B(C);"

Let's look at two examples:
1. A="int", B="foo", C="int", i.e.
   "int foo(int);"

   This is a function declaration:
   "declare_function(ret_type=int, name=foo, args=[int])"
2. A="int", B="foo", C="x" (where x is not a type name), i.e.
   "int foo(x);"

   This declares an integer variable "foo" that is copy-constructed from "x".

Note that the two see an *identical* sequence of tokens until the "x", but
nevertheless end up in quite different parts of the grammar. This is what's
commonly called the "most vexing parse" in C++, and is (to my knowledge) one
of the syntactic constructs in C++ that require backtracking.

But again, this has nothing to do with the expressive power of the language; it's
not a language feature, it's just another weird corner case/wart, in this case
stemming from the similarity between declaration and initialization syntax.

One possibly way to resolve this is heroics in the parser (this is what C++ does).
That the resulting behavior is generally called the "most vexing parse" does, I
think, show the problems with that approach: not only does it make the parser a
lot more complicated, it also confuses users.

A different solution is to simply make sure that variable and function declarations
can never be confused in the first place, for example by adding an extra keyword.
For example, in Go, all function declarations start with "func" and all explicit
variable declarations start with "var". No ambiguity, no need to backtrack, and
no confusion for the user.
	Things in C/C++ that are hard to parse
	======================================

	Type names
	----------

	C is "mostly" LL(1) and generally relatively straightforward to parse. The primary
	unncessary complication is that parsing C requires keeping a symbol table during
	parsing that knows whether a given identifier in the current scope refers to a type
	name or not. Take this example snippet:

	(A)*B

	This is ambiguous:
	1. If A is not a type name, then the intepretation is
	multiply(A, B)
	2. If A is a type name, it is
	cast(target_type=A, expr=unary_deref(B))

	Resolving this ambiguity is what requires a C parser to keep track of which names
	refer to types. Using a different type-cast syntax - for example, by introducing an
	explicit "cast" keyword and writing say "cast(A, *B)" - would resolve it, and allow
	parsing C without a symbol table. (Many other solutions are also possible.)

	My point being that this part of C syntax really has nothing to do with its expressive
	power in either direction. It's a historical wart more than anything else.

	Declarations
	------------

	Declarations in C are not, per se, hard to parse; from a grammar point of view, they're
	straightforward. They are, however, hard for humans to read, and it turns out that
	implementing the semantics of C declarations is somewhat annoying for essentially the
	same reasons. The idea behind C declaration is to make the declaration of a variable
	look like a use of said variable. This sounds nice and symmetric but has pretty gross
	results.

	For example, these two declarations:

	int (*x)[11];
	int (*f)(int a);

	Declare x as a pointer to an 11-element array of ints and f as a pointer to a function
	taking one integer argument "a" and returning an integer. Many C/C++ programmers don't
	know how to write that kind of declaration off the top of their head, and will prefer to
	use one or more "stepping stone" typedefs such as

	typedef int ElevenIntArray[11];
	ElevenIntArray *x;

	The reasons the C syntax is hard to read for humans and annoying to implement the
	semantics of are essentially the same: pertinent bits of information appear in multiple
	places within the declaration; you basically end up building a skeleton of the target
	type with some blanks first, and then have to fill some of it in later.

	To explain what I mean: take the first original example:

	int (*x)[11];

	An English-language explanation of x is "pointer to an 11-element array of ints", and a
	tree describing the type would be something like "x = pointer(array(11, int))". When
	parsing that declaration from left to right, we learn these facts "out of order":

	1. (Once we see the "int") "??? = something with ints?"
	2. (After getting to "x") "x = pointer(something with ints)"
	3. (After seeing the "[") "x = pointer(array(???, int))"
	4. (After seeing "11") " x = pointer(array(11, int))"

	Also note that the superficially similar

	int *x[11];

	corresponds to the quite different "x = array(11, pointer(int))" (11-element array of
	pointers to integers).

	Compare say Pascal, where a type definition for an array type looks like this (there's no
	way to directly define "pointer to an array" in one step in Standard Pascal, as far as I
	know):

	"type ElevenIntArray: array[0..10] of integer;"

	This has a direct correspondence to how that type is pronounced (and also to how we
	want the type tree to look: "ElevenIntArray = array(lower=0, upper=10, type=integer)" but
	not to how it's used in an expression. It is, however, a lot easier to read, especially
	when things like pointers are involved:

	"type EleventIntPtrArray: array[0..10] of ^integer;"

	ElevenIntPtrArray = array(lower=0, upper=10, type=pointer(integer)); - and note that
	the ordering in the Pascal declaration steadily proceeds from "outermost" type at the
	left to "innermost" at the right, with no jumping around.

	Some newer languages like e.g. Go borrow the Pascal-style declaration syntax, albeit
	with a lot less extra tokens:

	"type PtrToElevenInts *[11]int"
	"type ElevenIntPtrArray [11]*int"

	Alright, so C declaration syntax works but is awkward to read and write for humans, and
	somewhat awkward to handle in a language impl as well. But whatever, it works.

	C++ declarations
	----------------

	This is where it gets really ugly. Consider something like this in C++:

	"A B(C);"

	Let's look at two examples:
	1. A="int", B="foo", C="int", i.e.
	"int foo(int);"

	This is a function declaration:
	"declare_function(ret_type=int, name=foo, args=[int])"
	2. A="int", B="foo", C="x" (where x is not a type name), i.e.
	"int foo(x);"

	This declares an integer variable "foo" that is copy-constructed from "x".

	Note that the two see an identical sequence of tokens until the "x", but
	nevertheless end up in quite different parts of the grammar. This is what's
	commonly called the "most vexing parse" in C++, and is (to my knowledge) one
	of the syntactic constructs in C++ that require backtracking.

	But again, this has nothing to do with the expressive power of the language; it's
	not a language feature, it's just another weird corner case/wart, in this case
	stemming from the similarity between declaration and initialization syntax.

	One possibly way to resolve this is heroics in the parser (this is what C++ does).
	That the resulting behavior is generally called the "most vexing parse" does, I
	think, show the problems with that approach: not only does it make the parser a
	lot more complicated, it also confuses users.

	A different solution is to simply make sure that variable and function declarations
	can never be confused in the first place, for example by adding an extra keyword.
	For example, in Go, all function declarations start with "func" and all explicit
	variable declarations start with "var". No ambiguity, no need to backtrack, and
	no confusion for the user.