jorendorff/anomalies.md

## anomalies.md

      
    Raw
  

              anomalies.md
            
          
    Anomalies in mainstream programming language grammar: an incomplete freak show

Parser generators have existed forever.
Why don't any serious compilers use them?
Well, Python uses one... written from scratch to parse Python,
unsuitable for any other project.
One reason is that programming languages are such amazing trash.
The more widely used a language is, the more likely it is
to have wonderful, incredible pain points in the syntax.
Below is an incomplete list
that covers a few of the most mainstream languages.
To be clear: these are the quirks that affect the task of
writing a parser for the language.
Syntactic quirks that only matter to humans trying to use the language
are given a miss.


C, C++, JavaScript, etc: dangling else.


C++, Rust, etc.: Is >> one token or two? It depends.
Nested generics can require this to be read as two tokens,
as in vector<set<T>>;
but it's also the right shift operator.
Rust has the same issue with &&,
which is either the logical AND operator or two address-of operators.
And || is the logical OR operator
or the start of a closure that takes no arguments.
But the grammar can work around these problems in dumb-but-good ways:
allowing the single token && as a prefix operator
meaning "take the address of a temporary containing the address of";
and allowing the single token || to serve as closure arguments.
The >> case is harder to address, I think.


C++: Is it a typename? Is it a template?
The statement x*y; is ambiguous
unless C++ knows whether or not x is a type.
Likewise, to parse vector<t>v(x);
you first need to know if vector is a template,
and for a::b<c<d>::e to make any sense at all,
you need to know that a::b is a template and c is not.
(Or vice versa.)
So a C++ parser must track identifiers as it goes and always know
which ones are types, which are templates, and which are variables.
I think this also resolves the ambiguity in new(x)(y),
where if x is not a type, we're doing placement new
and y had better be a type,
but if x is a type, then (y) is an argument list.


C++: I think it's impossible to know
what a leading const or static or extern means
before parsing whatever comes after.
A minor issue by C++ standards:
the amount of lookahead the language requires is 1 more
than whatever it would have been (infinity plus one, then, I suppose).
(In fact, this may not even pose a problem to LR parsers.)


C++: ambiguity between function declarations and local variables
(the "most vexing parse").


C++: ambiguity between local variable declarations and expressions.

int(x), y, *const z;  // int x; int y; int *const z;

Is a comma-separated list of declarations in which the first is
redundantly parenthesised, whereas changing the final list
element:
int(x), y, new int;   // ((int(x)), (y), (new int));

gives a list of expressions, the first two of which are redundant,
and the third causes a memory leak.
Other ambiguities are resolved by the language definition:
int(x), y, z = 0;     // int x; int y; int z = 0;

This could be an expression too, but isn’t. [...]

(This is from "the Fog thesis".)
Obviously the amount of lookahead you need to distinguish
declarations from expressions is unbounded.


C++: general ambiguity between types and other expressions


Rust: OMG the semicolons.
I have never fully understood this and maybe never will.
You're allowed to drop the semicolon from a final expression
in a Rust block.
But what about this case?
fn f() -> usize {
    { do_some_stuff() }   // no semicolon after `)`; is this ok?
    return 0;
}

If you answered yes, you're wrong.
If you answered no, you are also wrong.
The correct answer is, it depends.
It turns out to be basically impossible to enforce the desired semicolon rules grammatically.
To see why, consider this one:
fn f(ok: bool) -> usize {
    if ok {
        g()   // no semicolon. is this ok?
    } else {
        ...

You won't know if that omitted semicolon is acceptable or not
until you know if this if statement is the last thing in the function.
And you won't know that until you get to the end of the else block.
Unbounded lookahead.
The solution is simple: the Rust parser allows missing semicolons
at the end of every block, including both cases above.
It's overly generous.
But don't get cocky.
If you omit a semicolon in a block that's followed by another statement,
the type checker (a later phase of compilation) will probably flag it as an error.
In the above case, if do_some_stuff() happens to return (), you're fine.


Rust: No operator after a block.
Rust syntax for blocks is { stmt* expr? }. That's ambiguous,
since there are strings that match both { stmt expr }
and { expr }.
The rule in Rust is that the { stmt expr } parse is preferred: the
body of the block { loop { break } - 1 } is a loop statement
followed by the expression -1, not a single
subtraction-expression.
But watch out! In {x = {3} - 1}, the value assigned to x is 2.
The block matches { expr },
because there is no { stmt expr } parse for this one;
if you want x = {3} to match stmt and -1 to match expr,
the statement is missing a semicolon.


Rust: Pattern or type?
Amazingly, there's a place in the Rust syntax where we have a
straight-up conflict between types and patterns.
trait MyTrait {
    fn method(...

The parameters of a trait method can be regular named parameters, pat : ty;
or else just ty (if the method does not have a body).
If that were the whole story,
you'd need unbounded lookahead to determine which you've got,
as both patterns and types can be (nested) tuples.
Rust settles this by only permitting a rather ad hoc subset of the pat syntax here.
TODO: details, example


Rust, JavaScript: Boolean parameterization of nonterminals.
In Rust, we have if expr { ... },
but some expressions, like Point { x: 0, y: 0 }, contain {.
The hack in Rust is that after if,
the expression is simply not allowed to be a struct-expression.
Similarly, in JavaScript we have for (expr; expr; expr)
but we also have for (x in y);
and in is a binary operator that can appear in an expr.
Suppose the next few tokens are for (x in z ? including the question mark.
If we're in a C-style for(;;) loop, then we should
reduce x in z as a binary expression, right?
But what if we're in a for-in loop?
In JavaScript, the problem is resolved using boolean parameters that
are propagated throughout the grammar.  This is why for (x in y;;)
is banned: immediately after for ( in the C-style for
production, the nonterminal is Expression[~In], switching to
(effectively) a copy of the expression grammar that does not contain
the in operator.


Only in JavaScript:


Feedback from syntactic context to lexical analysis
(selecting the lexical goal symbol,
to cope with JS regular expressions and template strings)


Automatic semicolon insertion. If a JS script or module
appears to have a syntax error, but it can be fixed by inserting
an imaginary semicolon just before a curly brace or at the end
of a line (or adjacent to comment containing a line terminator),
or... a few other cases... JS imagines the semicolon for you.
Some semicolons can't be imaginary, though.
The above is pretty much
how automatic semicolon insertion is specified
in the language specification.
Like so many wonderful things in JS,
it's an amazing post hoc rationalization of
"whatever Netscape implemented back in the day".


Lookahead assertions.
They look like this in the specification: [lookahead ∉ {let [}].
Lookahead assertions are a super useful hack.
Most regexps support them.
PEGs suport them.
It's not terribly surprising that JS uses them;
it's rather more surprising that more languages don't have to.
In JS, this is used mainly in the lexical grammar,
where it's not so bad;
for example, \0 is allowed as an EscapeSequence in a string,
but only if the next character is not a DecimalDigit. Fine.
But it is used liberally in the grammar too:
for example, if a Statement is expected,
the ambiguity in {x, y} is resolved in favor of Block,
not an ExpressionStatement consisting of an ObjectLiteral,
because the ExpressionStatement production starts with
"[lookahead ∉ { {, function, async [no LineTerminator here]
function, class, let [ }].
The negative lookahead assertion is there specifically
to prevent ExpressionStatement from competing with other productions.


Cover grammars. (TODO: actually Python does a bit of this too)


JavaScript: Conditional keywords. (TODO: I think a bunch of languages do this.)


Python, Haskell: *Indentation isn't context-free. Parsers for
languages that use indentation seem to have custom lexers, for this reason.


Haskell: The language allows declaring your own binary operators,
as well as importing them from other files.
In the latter case, you may be using an operator whose precedence and associativity
are not specified anywhre in the current file.
GHC copes with this by not even trying to generate correct parse trees
for adjacent binary operators. It just constructs a list.
A later stage of the compiler, which has enough information,
converts the list into the desired tree
by applying operator precedence and associativity.
https://stackoverflow.com/questions/29185467/parsing-with-user-defined-operator-precedence


Haskell: Haskell is ambiguous.

The grammar is ambiguous regarding the extent of lambda
abstractions, let expressions, and conditionals. The ambiguity is
resolved by the meta-rule that each of these constructs extends
as far to the right as possible.

The Haskell 98 report includes this amazing passage:

A note about parsing. Expressions that involve the interaction
of fixities with the let/lambda meta-rule may be hard to
parse. For example, the expression
let x = True in x == x == True

cannot possibly mean
let x = True in (x == x == True)

because (==) is a non-associative operator; so the expression must
parse thus:
(let x = True in (x == x)) == True

However, implementations may well use a post-parsing pass to deal
with fixities, so they may well incorrectly deliver the former
parse. Programmers are advised to avoid constructs whose parsing
involves an interaction of (lack of) associativity with the
let/lambda meta-rule.

Haskell 2010 removed the explication,
but the grammar is still ambiguous,
and the "meta-rule" is still phrased exactly the same.
In GHC, this example is a syntax error.
it seems Haskell as specified is not worth parsing.


Agda: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.157.7899


"only perl can parse Perl" and other languages with extensible syntax (reader macros in lisps)


(from @elwoz) CSS:
<style>p{text-indent:2em</style> is not an error,
because end of sheet closes all open syntactic constructs without error.


What else?

Langauges that predate C are excused.
HTML is excused.
Java and C# are nowhere to be found, even
though they are big languages with big grammars, just like JS and Rust.
It may be partly that I don't know those languages very well.
But I think the language designers have to be given some credit.
C# has done some pretty wild stuff over the years;
by rights its grammar should be total a horror show by now.