This gist demonstrates two ways to create dynamic operators in a grammar à là Haskell, using Alex and Happy.
The grammar is extremely simple, meant only to demonstrate how one would go about achieving this. Here's an example expression:
let infix left 2 *
let infix left 1 +
let infix right 3 ^
1 * 2 + 3 ^ 5 * 4
n.b.: This example assumes that OverloadedStrings
is set.
This is the (prettified) output of using the "Dynamic" files:
>>> LexerDynamic.runAlex "let infix left 2 *\nlet infix left 1 +\nlet infix right 3 ^\n1 * 2 + 3 ^ 5 * 4" ParserDynamic.parseTest
Right
[ Declaration (DOperator (Operator "*"))
, Declaration (DOperator (Operator "+"))
, Declaration (DOperator (Operator "^"))
, Expression
(EInfix (EInfix (EInteger 1) (Operator "*") (EInteger 2))
(Operator "+")
(EInfix (EInfix (EInteger 3) (Operator "^") (EInteger 5))
(Operator "*") (EInteger 4)))
]
And this is the (prettified) output of using the "Rotation" files:
>>> let res = rotate <$> LexerRotation.runAlex "let infix left 2 *\nlet infix left 1 +\nlet infix right 3 ^\n1 * 2 + 3 ^ 5 * 4" ParserRotation.parseTest
>>> join (sequenceA <$> res)
Right
[ Declaration (DOperator LeftAssociative 2 (Operator "*"))
, Declaration (DOperator LeftAssociative 1 (Operator "+"))
, Declaration (DOperator RightAssociative 3 (Operator "^"))
, Expression
(EInfix (EInfix (EInteger 1) (Operator "*") (EInteger 2))
(Operator "+")
(EInfix (EInfix (EInteger 3) (Operator "^") (EInteger 5))
(Operator "*") (EInteger 4)))
]
This was inspired by this Stack Overflow answer.
The basic gist is to have the lexer parse just enough to be able to see the associativity and precedence of operator declarations. Once this is done, it keeps a small symbol table with the declared operators. If a declared operator is seen in an expression, it will insert the precedence and associativity with it and emit the token.
The parser will then declare a limited set of fixities (4 for each associativity in this gist) and use it to correctly parse each operator.
This approach has a lot of downsides and it isn't really worth it.
- The lexer needs to do a bit of parsing to figure out the declarations. The lexer ideally shouldn't need to know about such syntactic information.
- The lexer needs to remember a symbol table, which is not ideal.
- Fixity declarations need to appear before they are used in expressions (the other approach allows them to appear anywhere, even after their first usage).
- Some grammar code needs to be duplicated in the parser: adding three rules for each precedence level. This makes the grammar more complex.
- The quantity of supported precedences is static and limited by how much the user is willing to copy-and-paste.
- Performance might be worse than the other approach, as there are more parser states and extra overhead in the lexer, but I haven't really benchmarked the code to confirm that this is the case.
There are only two advantages I could think of:
- The parser is the one responsible for correctly balancing the tree.
- There's no need to write extra code outside of the parsing system, since the tree is already built correctly.
The code was inspired by GHC's source code, found in ghc/Expr.hs, ghc/HsType.hs, and ghc/Fixity.hs.
This code makes the assumption that the parse tree is always right-balanced (but always left-balanced will work as well with a few trivial changes).
There is nothing special in the lexer for this code, so it's pretty simple and does only the basic.
For the parser, we use %right operator
to always keep the tree in its rightmost derivation initially, no matter the associativity or precedence of the operators.
The magic happens in Rotation.hs
, where the rotate
function will balance the tree accordingly to each operator.
The gist is to visit each node in the tree recursively, and perform the rotation of the nodes from the innermost (rightmost) node going outwards.
This is done by visiting each Expression
node in the tree and perfoming the rotation on them.
In the case of two nested infix expressions, check whether any rotation is needed at all and act appropriately.
For a case like 1 * 2 + 3 (assuming both operators are left-associative, and * binds tighter than +), the parser will output the equivalent of 1 * (2 + 3). Since * has a greater precedence than +, it needs to be rotated. This causes the parenthesis to be swapped, resulting in (1 * 2) + 3. Had the expression been 1 + 2 * 3 instead, we'd find the opposite situation, and the tree would need no balancing.
Operators that have the same precedence but different associativities (or at least one of them is non-associative) can't be mixed together and will cause an error instead.
Compared to the "Dynamic" approach, this is overall easier to maintain, supports a dynamic amount of precedence levels, and results in much simpler and smaller lexer and paser codes.
For more information, check the code for Rotation.hs
, it should hopefully be documented enough and the code simple to comprehend.
This gist demonstrates how to declare operators with custom precedences and associativities in a similar manner that is supported by Haskell. Overall, the "Rotation" approach is simpler, smaller, and more maintainable than the "Dynamic" approach described here, and so I recommend to prefer it. This is the approach used by GHC itself as well. Both were made as proof of concepts inspired by the Stack Overflow answer linked above, and a way to compare both approaches.
Hopefully, if you've found this gist in a struggle and read everything so far, it will help you with you Alex and Happy needs.