Skip to content

Instantly share code, notes, and snippets.

@heitor-lassarote
Last active May 18, 2022 02:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save heitor-lassarote/b20d6da0a9042d31e439befb8c236a4e to your computer and use it in GitHub Desktop.
Save heitor-lassarote/b20d6da0a9042d31e439befb8c236a4e to your computer and use it in GitHub Desktop.
Dynamic infix operators with Alex and Happy

Dynamic infix operators with Alex and Happy

This gist demonstrates two ways to create dynamic operators in a grammar à là Haskell, using Alex and Happy.

The grammar is extremely simple, meant only to demonstrate how one would go about achieving this. Here's an example expression:

let infix left 2 *
let infix left 1 +
let infix right 3 ^
1 * 2 + 3 ^ 5 * 4

n.b.: This example assumes that OverloadedStrings is set.

This is the (prettified) output of using the "Dynamic" files:

>>> LexerDynamic.runAlex "let infix left 2 *\nlet infix left 1 +\nlet infix right 3 ^\n1 * 2 + 3 ^ 5 * 4" ParserDynamic.parseTest
Right
  [ Declaration (DOperator (Operator "*"))
  , Declaration (DOperator (Operator "+"))
  , Declaration (DOperator (Operator "^"))
  , Expression
    (EInfix (EInfix (EInteger 1) (Operator "*") (EInteger 2))
    (Operator "+")
      (EInfix (EInfix (EInteger 3) (Operator "^") (EInteger 5))
      (Operator "*") (EInteger 4)))
  ]

And this is the (prettified) output of using the "Rotation" files:

>>> let res = rotate <$> LexerRotation.runAlex "let infix left 2 *\nlet infix left 1 +\nlet infix right 3 ^\n1 * 2 + 3 ^ 5 * 4" ParserRotation.parseTest
>>> join (sequenceA <$> res)
Right
  [ Declaration (DOperator LeftAssociative 2 (Operator "*"))
  , Declaration (DOperator LeftAssociative 1 (Operator "+"))
  , Declaration (DOperator RightAssociative 3 (Operator "^"))
  , Expression
    (EInfix (EInfix (EInteger 1) (Operator "*") (EInteger 2))
    (Operator "+")
      (EInfix (EInfix (EInteger 3) (Operator "^") (EInteger 5))
      (Operator "*") (EInteger 4)))
  ]

Dynamic fixity parsing (not recommended)

This was inspired by this Stack Overflow answer.

The basic gist is to have the lexer parse just enough to be able to see the associativity and precedence of operator declarations. Once this is done, it keeps a small symbol table with the declared operators. If a declared operator is seen in an expression, it will insert the precedence and associativity with it and emit the token.

The parser will then declare a limited set of fixities (4 for each associativity in this gist) and use it to correctly parse each operator.

This approach has a lot of downsides and it isn't really worth it.

  • The lexer needs to do a bit of parsing to figure out the declarations. The lexer ideally shouldn't need to know about such syntactic information.
  • The lexer needs to remember a symbol table, which is not ideal.
  • Fixity declarations need to appear before they are used in expressions (the other approach allows them to appear anywhere, even after their first usage).
  • Some grammar code needs to be duplicated in the parser: adding three rules for each precedence level. This makes the grammar more complex.
  • The quantity of supported precedences is static and limited by how much the user is willing to copy-and-paste.
  • Performance might be worse than the other approach, as there are more parser states and extra overhead in the lexer, but I haven't really benchmarked the code to confirm that this is the case.

There are only two advantages I could think of:

  • The parser is the one responsible for correctly balancing the tree.
  • There's no need to write extra code outside of the parsing system, since the tree is already built correctly.

Tree rotation (recommended)

The code was inspired by GHC's source code, found in ghc/Expr.hs, ghc/HsType.hs, and ghc/Fixity.hs.

This code makes the assumption that the parse tree is always right-balanced (but always left-balanced will work as well with a few trivial changes).

There is nothing special in the lexer for this code, so it's pretty simple and does only the basic. For the parser, we use %right operator to always keep the tree in its rightmost derivation initially, no matter the associativity or precedence of the operators.

The magic happens in Rotation.hs, where the rotate function will balance the tree accordingly to each operator. The gist is to visit each node in the tree recursively, and perform the rotation of the nodes from the innermost (rightmost) node going outwards.

This is done by visiting each Expression node in the tree and perfoming the rotation on them. In the case of two nested infix expressions, check whether any rotation is needed at all and act appropriately.

For a case like 1 * 2 + 3 (assuming both operators are left-associative, and * binds tighter than +), the parser will output the equivalent of 1 * (2 + 3). Since * has a greater precedence than +, it needs to be rotated. This causes the parenthesis to be swapped, resulting in (1 * 2) + 3. Had the expression been 1 + 2 * 3 instead, we'd find the opposite situation, and the tree would need no balancing.

Operators that have the same precedence but different associativities (or at least one of them is non-associative) can't be mixed together and will cause an error instead.

Compared to the "Dynamic" approach, this is overall easier to maintain, supports a dynamic amount of precedence levels, and results in much simpler and smaller lexer and paser codes.

For more information, check the code for Rotation.hs, it should hopefully be documented enough and the code simple to comprehend.

Conclusion

This gist demonstrates how to declare operators with custom precedences and associativities in a similar manner that is supported by Haskell. Overall, the "Rotation" approach is simpler, smaller, and more maintainable than the "Dynamic" approach described here, and so I recommend to prefer it. This is the approach used by GHC itself as well. Both were made as proof of concepts inspired by the Stack Overflow answer linked above, and a way to compare both approaches.

Hopefully, if you've found this gist in a struggle and read everything so far, it will help you with you Alex and Happy needs.

{
module LexerDynamic
( Token (..)
, Operator (..)
, Associativity (..)
, Precedence
, alexMonadScan
, Alex
, runAlex
, alexError
) where
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.List (find)
}
%wrapper "monadUserState-bytestring"
@operator = [!@\#\$\%\^&\*\-=\+\/\?\,\.]+
tokens :-
<0> $white+ ;
-- Dec
<0> let { \_ _ -> pure TLet }
<0> infix { begin assoc }
<assoc> $white+ ;
<assoc> left { pushAssoc LeftAssociative }
<assoc> right { pushAssoc RightAssociative }
<assoc> [0-9]+ { pushPrec `andBegin` op }
<op> $white+ ;
<op> @operator { mkInfix `andBegin` 0 }
-- Expr
<0> @operator { mkOp }
<0> [a-zA-Z_'][a-zA-Z_'0-9]* { \(_, _, str, _) len -> pure $ TIdentifier $ BS.take len str }
<0> [0-9]+ { \inp len -> pure $ TInteger $ readBS inp len }
{
get :: Alex AlexUserState
get = Alex $ \s' -> let s = alex_ust s' in Right (s', s)
put :: AlexUserState -> Alex ()
put s = Alex $ \s' -> Right (s'{alex_ust = s}, ())
modify :: (AlexUserState -> AlexUserState) -> Alex ()
modify f = Alex $ \s' -> Right (s'{alex_ust = f (alex_ust s')}, ())
pushAssoc :: Associativity -> AlexAction Token
pushAssoc a inp len = do
modify $ \s -> s{currentOpDec = (Just a, snd $ currentOpDec s)}
skip inp len
pushPrec :: AlexAction Token
pushPrec inp len = do
modify $ \s -> s{currentOpDec = (fst $ currentOpDec s, Just $ readBS inp len)}
alexMonadScan
mkInfix :: AlexAction Token
mkInfix (_, _, str, _) len = do
s <- get
let
(a, p) = currentOpDec s
o = Operator (maybe NonAssociative id a) (maybe 0 id p) (BS.take len str)
put s
{ operators = o : operators s
, currentOpDec = (Nothing, Nothing)
}
pure $ TOperator o
mkOp :: AlexAction Token
mkOp (_, _, str, _) len = do
ops <- operators <$> get
let opName = BS.take len str
-- Not very efficient, but enough for demonstration purposes.
case find (\(Operator _ _ opName') -> opName == opName') ops of
Nothing -> alexError "Operator not found"
Just o -> pure $ TOperator o
readBS :: AlexInput -> Int64 -> Integer
readBS (_, _, str, _) len = read $ BS.unpack $ BS.take len str
data Token
= TOperator Operator
| TIdentifier ByteString
| TInteger Integer
| TLet
| TEOF
deriving (Show)
data Operator
= Operator Associativity Precedence ByteString
deriving (Show)
data Associativity
= NonAssociative
| LeftAssociative
| RightAssociative
deriving (Show)
type Precedence = Integer
data AlexUserState = AlexUserState
{ operators :: [Operator]
, currentOpDec :: (Maybe Associativity, Maybe Precedence)
}
alexInitUserState :: AlexUserState
alexInitUserState = AlexUserState
{ operators = []
, currentOpDec = (Nothing, Nothing)
}
alexEOF :: Alex Token
alexEOF = pure TEOF
}
{
module LexerRotation
( Token (..)
, alexMonadScan
, Alex
, runAlex
, alexError
) where
import Data.ByteString.Lazy (ByteString)
import qualified Data.ByteString.Lazy.Char8 as BS
}
%wrapper "monad-bytestring"
@operator = [!@\#\$\%\^&\*\-=\+\/\?\,\.]+
tokens :-
<0> $white+ ;
-- Dec
<0> let { \_ _ -> pure TLet }
<0> infix { \_ _ -> pure TInfix }
<0> left { \_ _ -> pure TLeft }
<0> right { \_ _ -> pure TRight }
-- Expr
<0> @operator { \(_, _, str, _) len -> pure $ TOperator $ BS.take len str }
<0> [a-zA-Z_'][a-zA-Z_'0-9]* { \(_, _, str, _) len -> pure $ TIdentifier $ BS.take len str }
<0> [0-9]+ { \(_, _, str, _) len -> pure $ TInteger $ read $ BS.unpack $ BS.take len str }
{
data Token
= TOperator ByteString
| TIdentifier ByteString
| TInteger Integer
| TLet
| TInfix
| TLeft
| TRight
| TEOF
deriving (Show)
alexEOF :: Alex Token
alexEOF = pure TEOF
}
{
module ParserDynamic
( parseTest
, Operator (..)
, Identifier (..)
, Example (..)
, Expression (..)
, Associativity (..)
, Precedence
, Declaration (..)
) where
import Data.ByteString.Lazy (ByteString)
import LexerDynamic (Alex, Associativity (..), Precedence, Token (..), alexMonadScan)
import qualified LexerDynamic as L
}
%expect 0
%name parseTest examples
%tokentype { Token }
%error { parseError }
%monad { Alex } { >>= } { pure }
%lexer { lexer } { TEOF }
%token
identifier { TIdentifier $$ }
integer { TInteger $$ }
let { TLet }
infix0 { TOperator (L.Operator NonAssociative 0 $$) }
infixL0 { TOperator (L.Operator LeftAssociative 0 $$) }
infixR0 { TOperator (L.Operator RightAssociative 0 $$) }
infix1 { TOperator (L.Operator NonAssociative 1 $$) }
infixL1 { TOperator (L.Operator LeftAssociative 1 $$) }
infixR1 { TOperator (L.Operator RightAssociative 1 $$) }
infix2 { TOperator (L.Operator NonAssociative 2 $$) }
infixL2 { TOperator (L.Operator LeftAssociative 2 $$) }
infixR2 { TOperator (L.Operator RightAssociative 2 $$) }
infix3 { TOperator (L.Operator NonAssociative 3 $$) }
infixL3 { TOperator (L.Operator LeftAssociative 3 $$) }
infixR3 { TOperator (L.Operator RightAssociative 3 $$) }
%nonassoc infix0
%left infixL0
%right infixR0
%nonassoc infix1
%left infixL1
%right infixR1
%nonassoc infix2
%left infixL2
%right infixR2
%nonassoc infix3
%left infixL3
%right infixR3
%%
examples :: { [Example] }
: { [] }
| example examples { $1 : $2 }
example :: { Example }
: expression { Expression $1 }
| declaration { Declaration $1 }
operator :: { Operator }
: infix0 { Operator $1 }
| infixL0 { Operator $1 }
| infixR0 { Operator $1 }
| infix1 { Operator $1 }
| infixL1 { Operator $1 }
| infixR1 { Operator $1 }
| infix2 { Operator $1 }
| infixL2 { Operator $1 }
| infixR2 { Operator $1 }
| infix3 { Operator $1 }
| infixL3 { Operator $1 }
| infixR3 { Operator $1 }
-- Expr
expression :: { Expression }
: expression infix0 expression { EInfix $1 (Operator $2) $3 }
| expression infixL0 expression { EInfix $1 (Operator $2) $3 }
| expression infixR0 expression { EInfix $1 (Operator $2) $3 }
| expression infix1 expression { EInfix $1 (Operator $2) $3 }
| expression infixL1 expression { EInfix $1 (Operator $2) $3 }
| expression infixR1 expression { EInfix $1 (Operator $2) $3 }
| expression infix2 expression { EInfix $1 (Operator $2) $3 }
| expression infixL2 expression { EInfix $1 (Operator $2) $3 }
| expression infixR2 expression { EInfix $1 (Operator $2) $3 }
| expression infix3 expression { EInfix $1 (Operator $2) $3 }
| expression infixL3 expression { EInfix $1 (Operator $2) $3 }
| expression infixR3 expression { EInfix $1 (Operator $2) $3 }
| atom { $1 }
atom :: { Expression }
: identifier { EIdentifer (Identifier $1) }
| integer { EInteger $1 }
-- Dec
declaration :: { Declaration }
: let operator { DOperator $2 }
{
newtype Operator
= Operator ByteString
deriving (Show)
newtype Identifier
= Identifier ByteString
deriving (Show)
data Example
= Expression Expression
| Declaration Declaration
deriving (Show)
data Expression
= EInfix Expression Operator Expression
| EIdentifer Identifier
| EInteger Integer
deriving (Show)
newtype Declaration
= DOperator Operator
deriving (Show)
parseError :: Token -> Alex a
parseError _ = L.alexError "parse error"
lexer :: (Token -> Alex a) -> Alex a
lexer = (=<< alexMonadScan)
}
{
module ParserRotation
( parseTest
, Operator (..)
, Identifier (..)
, Example (..)
, Expression (..)
, Associativity (..)
, Precedence
, Declaration (..)
) where
import Data.ByteString.Lazy (ByteString)
import LexerRotation (Alex, Token (..), alexMonadScan)
import qualified LexerRotation as L
}
%expect 0
%name parseTest examples
%tokentype { Token }
%error { parseError }
%monad { Alex } { >>= } { pure }
%lexer { lexer } { TEOF }
%token
operator { TOperator $$ }
identifier { TIdentifier $$ }
integer { TInteger $$ }
let { TLet }
infix { TInfix }
left { TLeft }
right { TRight }
%right operator
%%
examples :: { [Example] }
: { [] }
| example examples { $1 : $2 }
example :: { Example }
: expression { Expression $1 }
| declaration { Declaration $1 }
-- Expr
expression :: { Expression }
: expression operator expression { EInfix $1 (Operator $2) $3 }
| atom { $1 }
atom :: { Expression }
: identifier { EIdentifer (Identifier $1) }
| integer { EInteger $1 }
-- Dec
associativity :: { Associativity }
: { NonAssociative }
| left { LeftAssociative }
| right { RightAssociative }
declaration :: { Declaration }
: let infix associativity integer operator { DOperator $3 $4 (Operator $5) }
{
newtype Operator
= Operator ByteString
deriving (Eq, Show)
newtype Identifier
= Identifier ByteString
deriving (Show)
data Example
= Expression Expression
| Declaration Declaration
deriving (Show)
data Expression
= EInfix Expression Operator Expression
| EIdentifer Identifier
| EInteger Integer
deriving (Show)
data Associativity
= NonAssociative
| LeftAssociative
| RightAssociative
deriving (Eq, Show)
type Precedence = Integer
data Declaration
= DOperator Associativity Precedence Operator
deriving (Show)
parseError :: Token -> Alex a
parseError _ = L.alexError "parse error"
lexer :: (Token -> Alex a) -> Alex a
lexer = (=<< alexMonadScan)
}
module Rotation
( rotate
) where
import Data.List (find)
import Data.Maybe (mapMaybe)
import ParserRotation
-- | Extracts all operator declarations from the tree.
getOperators :: [Example] -> [Declaration]
getOperators = mapMaybe go
where
go :: Example -> Maybe Declaration
go (Expression _) = Nothing
go (Declaration d) = Just d
-- | Describes which action the rotation algorithm should use.
data Rotation
= Fail -- ^ Fail due to the mixing of incompatible operators.
| Keep -- ^ Keep the tree as it is.
| Rotate -- ^ Balance the tree to the left.
-- | Suppose * has a larger precedence than +. An expression 1 * 2 + 3 will
-- create a tree equivalent to 1 * (2 + 3) because Happy will build its
-- rightmost derivation. We compare the operators and indicate how to rotate the
-- tree.
shouldRotate :: Declaration -> Declaration -> Rotation
shouldRotate (DOperator a p _) (DOperator a' p' _) = case compare p p' of
LT -> Keep
EQ -> case (a, a') of
(LeftAssociative, LeftAssociative) -> Rotate
(RightAssociative, RightAssociative) -> Keep
(_, _) -> Fail
GT -> Rotate
-- | Rebalances the tree to respect the associativity and precedence of the
-- parsed operators.
--
-- As an example, suppose the following expression was parsed and given as the
-- input to this function: 1 ^ (2 ^ (3 * 4))
-- Recurse into 2 ^ (3 * 4):
-- Recurse into 3 * 4:
-- Nothing to do here. Return 3 * 4
-- In this case, we find that we need to rotate the tree, since the left
-- operator has bigger precedence than the right one. We rebalance the tree.
-- Return (2 ^ 3) * 4
-- We get 1 ^ ((2 ^ 3) * 4) as the new tree. Note that we now compare it like
-- 1 ^ (_ * 4) where _ is something we don't care about. As previously, we
-- rebalance the tree, so we get (1 ^ _) * 4.
-- Return (1 ^ (2 ^ 3)) * 4
rotate :: [Example] -> [Either String Example]
rotate tree = map goExample tree
where
ops :: [Declaration]
ops = getOperators tree
-- Not very efficient, but enough for demonstration purposes.
findOp :: Operator -> Either String Declaration
findOp o = maybe (Left "Operator not found") Right $ find (\(DOperator _ _ o') -> o == o') ops
goExample :: Example -> Either String Example
goExample (Expression e) = Expression <$> go e
goExample d@(Declaration _) = Right d
go :: Expression -> Either String Expression
go (EInfix l op r) = do
-- Rotating the left side is unneeded since this grammar is very simple.
-- This is because trees are always right-balanced and the left side is
-- always an atom.
lRotated <- go l
rRotated <- go r
case rRotated of
EInfix l' op' r' -> do
opDec <- findOp op
opDec' <- findOp op'
case shouldRotate opDec opDec' of
Fail -> Left "Cannot mix operators with equal precedences and different (or null) associativities"
Keep -> Right $ EInfix lRotated op rRotated
Rotate -> Right $ EInfix (EInfix lRotated op l') op' r'
_ -> Right $ EInfix lRotated op rRotated
go e@(EIdentifer _) = Right e
go e@(EInteger _) = Right e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment