Skip to content

Instantly share code, notes, and snippets.

@jorendorff
Last active October 7, 2023 23:18
Show Gist options
  • Save jorendorff/c5f349ba75b20cde914f3e141ae42e3e to your computer and use it in GitHub Desktop.
Save jorendorff/c5f349ba75b20cde914f3e141ae42e3e to your computer and use it in GitHub Desktop.

Anomalies in mainstream programming language grammar: an incomplete freak show

Parser generators have existed forever. Why don't any serious compilers use them?

Well, Python uses one... written from scratch to parse Python, unsuitable for any other project.

One reason is that programming languages are such amazing trash. The more widely used a language is, the more likely it is to have wonderful, incredible pain points in the syntax. Below is an incomplete list that covers a few of the most mainstream languages.

To be clear: these are the quirks that affect the task of writing a parser for the language. Syntactic quirks that only matter to humans trying to use the language are given a miss.

  • C, C++, JavaScript, etc: dangling else.

  • C++, Rust, etc.: Is >> one token or two? It depends. Nested generics can require this to be read as two tokens, as in vector<set<T>>; but it's also the right shift operator.

    Rust has the same issue with &&, which is either the logical AND operator or two address-of operators. And || is the logical OR operator or the start of a closure that takes no arguments. But the grammar can work around these problems in dumb-but-good ways: allowing the single token && as a prefix operator meaning "take the address of a temporary containing the address of"; and allowing the single token || to serve as closure arguments. The >> case is harder to address, I think.

  • C++: Is it a typename? Is it a template? The statement x*y; is ambiguous unless C++ knows whether or not x is a type. Likewise, to parse vector<t>v(x); you first need to know if vector is a template, and for a::b<c<d>::e to make any sense at all, you need to know that a::b is a template and c is not. (Or vice versa.)

    So a C++ parser must track identifiers as it goes and always know which ones are types, which are templates, and which are variables.

    I think this also resolves the ambiguity in new(x)(y), where if x is not a type, we're doing placement new and y had better be a type, but if x is a type, then (y) is an argument list.

  • C++: I think it's impossible to know what a leading const or static or extern means before parsing whatever comes after. A minor issue by C++ standards: the amount of lookahead the language requires is 1 more than whatever it would have been (infinity plus one, then, I suppose).

    (In fact, this may not even pose a problem to LR parsers.)

  • C++: ambiguity between function declarations and local variables (the "most vexing parse").

  • C++: ambiguity between local variable declarations and expressions.

    int(x), y, *const z;  // int x; int y; int *const z;
    

    Is a comma-separated list of declarations in which the first is redundantly parenthesised, whereas changing the final list element:

    int(x), y, new int;   // ((int(x)), (y), (new int));
    

    gives a list of expressions, the first two of which are redundant, and the third causes a memory leak.

    Other ambiguities are resolved by the language definition:

    int(x), y, z = 0;     // int x; int y; int z = 0;
    

    This could be an expression too, but isn’t. [...]

    (This is from "the Fog thesis".)

    Obviously the amount of lookahead you need to distinguish declarations from expressions is unbounded.

  • C++: general ambiguity between types and other expressions

  • Rust: OMG the semicolons. I have never fully understood this and maybe never will.

    You're allowed to drop the semicolon from a final expression in a Rust block. But what about this case?

    fn f() -> usize {
        { do_some_stuff() }   // no semicolon after `)`; is this ok?
        return 0;
    }
    

    If you answered yes, you're wrong. If you answered no, you are also wrong. The correct answer is, it depends.

    It turns out to be basically impossible to enforce the desired semicolon rules grammatically. To see why, consider this one:

    fn f(ok: bool) -> usize {
        if ok {
            g()   // no semicolon. is this ok?
        } else {
            ...
    

    You won't know if that omitted semicolon is acceptable or not until you know if this if statement is the last thing in the function. And you won't know that until you get to the end of the else block. Unbounded lookahead.

    The solution is simple: the Rust parser allows missing semicolons at the end of every block, including both cases above. It's overly generous. But don't get cocky. If you omit a semicolon in a block that's followed by another statement, the type checker (a later phase of compilation) will probably flag it as an error. In the above case, if do_some_stuff() happens to return (), you're fine.

  • Rust: No operator after a block. Rust syntax for blocks is { stmt* expr? }. That's ambiguous, since there are strings that match both { stmt expr } and { expr }.

    The rule in Rust is that the { stmt expr } parse is preferred: the body of the block { loop { break } - 1 } is a loop statement followed by the expression -1, not a single subtraction-expression.

    But watch out! In {x = {3} - 1}, the value assigned to x is 2. The block matches { expr }, because there is no { stmt expr } parse for this one; if you want x = {3} to match stmt and -1 to match expr, the statement is missing a semicolon.

  • Rust: Pattern or type? Amazingly, there's a place in the Rust syntax where we have a straight-up conflict between types and patterns.

    trait MyTrait {
        fn method(...
    

    The parameters of a trait method can be regular named parameters, pat : ty; or else just ty (if the method does not have a body). If that were the whole story, you'd need unbounded lookahead to determine which you've got, as both patterns and types can be (nested) tuples. Rust settles this by only permitting a rather ad hoc subset of the pat syntax here.

    TODO: details, example

  • Rust, JavaScript: Boolean parameterization of nonterminals.

    In Rust, we have if expr { ... }, but some expressions, like Point { x: 0, y: 0 }, contain {. The hack in Rust is that after if, the expression is simply not allowed to be a struct-expression.

    Similarly, in JavaScript we have for (expr; expr; expr) but we also have for (x in y); and in is a binary operator that can appear in an expr. Suppose the next few tokens are for (x in z ? including the question mark. If we're in a C-style for(;;) loop, then we should reduce x in z as a binary expression, right? But what if we're in a for-in loop?

    In JavaScript, the problem is resolved using boolean parameters that are propagated throughout the grammar. This is why for (x in y;;) is banned: immediately after for ( in the C-style for production, the nonterminal is Expression[~In], switching to (effectively) a copy of the expression grammar that does not contain the in operator.

  • Only in JavaScript:

    • Feedback from syntactic context to lexical analysis (selecting the lexical goal symbol, to cope with JS regular expressions and template strings)

    • Automatic semicolon insertion. If a JS script or module appears to have a syntax error, but it can be fixed by inserting an imaginary semicolon just before a curly brace or at the end of a line (or adjacent to comment containing a line terminator), or... a few other cases... JS imagines the semicolon for you.

      Some semicolons can't be imaginary, though.

      The above is pretty much how automatic semicolon insertion is specified in the language specification. Like so many wonderful things in JS, it's an amazing post hoc rationalization of "whatever Netscape implemented back in the day".

    • Lookahead assertions. They look like this in the specification: [lookahead ∉ {let [}].

      Lookahead assertions are a super useful hack. Most regexps support them. PEGs suport them. It's not terribly surprising that JS uses them; it's rather more surprising that more languages don't have to.

      In JS, this is used mainly in the lexical grammar, where it's not so bad; for example, \0 is allowed as an EscapeSequence in a string, but only if the next character is not a DecimalDigit. Fine.

      But it is used liberally in the grammar too: for example, if a Statement is expected, the ambiguity in {x, y} is resolved in favor of Block, not an ExpressionStatement consisting of an ObjectLiteral, because the ExpressionStatement production starts with "[lookahead ∉ { {, function, async [no LineTerminator here] function, class, let [ }]. The negative lookahead assertion is there specifically to prevent ExpressionStatement from competing with other productions.

    • Cover grammars. (TODO: actually Python does a bit of this too)

  • JavaScript: Conditional keywords. (TODO: I think a bunch of languages do this.)

  • Python, Haskell: *Indentation isn't context-free. Parsers for languages that use indentation seem to have custom lexers, for this reason.

  • Haskell: The language allows declaring your own binary operators, as well as importing them from other files. In the latter case, you may be using an operator whose precedence and associativity are not specified anywhre in the current file.

    GHC copes with this by not even trying to generate correct parse trees for adjacent binary operators. It just constructs a list. A later stage of the compiler, which has enough information, converts the list into the desired tree by applying operator precedence and associativity.

    https://stackoverflow.com/questions/29185467/parsing-with-user-defined-operator-precedence

  • Haskell: Haskell is ambiguous.

    The grammar is ambiguous regarding the extent of lambda abstractions, let expressions, and conditionals. The ambiguity is resolved by the meta-rule that each of these constructs extends as far to the right as possible.

    The Haskell 98 report includes this amazing passage:

    A note about parsing. Expressions that involve the interaction of fixities with the let/lambda meta-rule may be hard to parse. For example, the expression

    let x = True in x == x == True
    

    cannot possibly mean

    let x = True in (x == x == True)
    

    because (==) is a non-associative operator; so the expression must parse thus:

    (let x = True in (x == x)) == True
    

    However, implementations may well use a post-parsing pass to deal with fixities, so they may well incorrectly deliver the former parse. Programmers are advised to avoid constructs whose parsing involves an interaction of (lack of) associativity with the let/lambda meta-rule.

    Haskell 2010 removed the explication, but the grammar is still ambiguous, and the "meta-rule" is still phrased exactly the same.

    In GHC, this example is a syntax error. it seems Haskell as specified is not worth parsing.

  • Agda: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.157.7899

  • "only perl can parse Perl" and other languages with extensible syntax (reader macros in lisps)

  • (from @elwoz) CSS: <style>p{text-indent:2em</style> is not an error, because end of sheet closes all open syntactic constructs without error.

What else?

Langauges that predate C are excused.

HTML is excused.

Java and C# are nowhere to be found, even though they are big languages with big grammars, just like JS and Rust. It may be partly that I don't know those languages very well. But I think the language designers have to be given some credit. C# has done some pretty wild stuff over the years; by rights its grammar should be total a horror show by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment