Skip to content

Instantly share code, notes, and snippets.

@jvranish
Created August 24, 2011 19:19
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save jvranish/73f119c5ebda3f776507 to your computer and use it in GitHub Desktop.
Save jvranish/73f119c5ebda3f776507 to your computer and use it in GitHub Desktop.
A toy compiler written in haskell
To try it out paste the following code into a test.toy file:
def fib(x)
if x < 3 then
1
else
fib(x-1)+fib(x-2)
def start()
fib(40)
This example program was mostly taken from the LLVM tutorial.
(see http://llvm.org/docs/tutorial/index.html)
To try it out:
On Mac:
brew install haskell-platform
brew install llvm --universal
On Linux:
Use your respective package manager to install the haskell platform and the llvm libraries
(make sure to include the development versions)
On Windows:
Download the haskell platform from http://hackage.haskell.org/platform/windows.html
install LLVM - this is left as an exercise for the reader...
On all platforms:
You'll have to pull down and install my slightly modified version of the LLVM bindings.
I needed to make changes to the LLVM bindings to make then useful for generating code
from an AST.
git clone git@github.com:jvranish/llvm.git
cd llvm
cabal install
To build our compiler:
ghc -Wall --make ToyCompiler.lhs
Then you can run our toy program with a JIT compiler!:
./ToyCompiler --jit test.toy
Or you can just compile it to native machine code:
./ToyCompiler test.toy
To link the compiled bitcode:
llvm-ld test.bc -native -o test
Try it out!:
./test
If you are curious what the LLVM disassembly for our toy program looks like run this:
llvm-dis < test.bc
> module Main where
> import System
> import System.FilePath
> import Text.Parsec
> import Text.Parsec.Expr
> import Text.Parsec.String
> import qualified Text.Parsec.Token as Token
> import qualified Data.Map as Map
> import LLVM.FFI.Core
> import LLVM.FFI.Analysis
> import LLVM.Core.Instructions
> import LLVM.ExecutionEngine
> import LLVM.Core
> import qualified LLVM.Core.Util as U
> import Foreign.C.String
> import Foreign.Ptr
> import Control.Monad
Define the lexer style and the tokens we are going to use.
The Haskell Parsec library has some very convenient functions to do this.
> lexerStyle :: Token.LanguageDef ()
> lexerStyle = Token.LanguageDef
> { Token.commentStart = "{-"
> , Token.commentEnd = "-}"
> , Token.commentLine = "--"
> , Token.nestedComments = True
> , Token.identStart = letter
> , Token.identLetter = alphaNum <|> oneOf "_"
> , Token.opStart = Token.opLetter lexerStyle
> , Token.opLetter = oneOf "`~!@$%^&*-+=;:<>./?"
> , Token.reservedOpNames= []
> , Token.reservedNames = ["if", "then", "else", "def"]
> , Token.caseSensitive = True
> }
>
> lexer :: Token.TokenParser ()
> lexer = Token.makeTokenParser lexerStyle
>
> parens :: Parser a -> Parser a
> parens = Token.parens lexer
>
> natural :: Parser Integer
> natural = Token.natural lexer
>
> identifier :: Parser String
> identifier = Token.identifier lexer
>
> reservedOp :: String -> Parser ()
> reservedOp = Token.reservedOp lexer
>
> reserved :: String -> Parser ()
> reserved = Token.reserved lexer
>
> whiteSpace :: Parser ()
> whiteSpace = Token.whiteSpace lexer
>
> comma :: Parser String
> comma = Token.comma lexer
Here is the Abstract Syntax Tree for our expressions.
The Algebraic Data Types in haskell make building and breaking down the AST extremely straightforward.
> data Expr = Number Int
> | Identifier String
> | Call String [Expr]
> | Add Expr Expr
> | Sub Expr Expr
> | Less Expr Expr
> | If Expr Expr Expr
> deriving (Show)
>
> data FuncDef = Func String [String] Expr
> deriving (Show)
parseExpr take a table of operator parsers (in order of operator precidence) and builds an expression
parser that takes into account fixity (prefix, postfix, infix...) and precidence. It's super handy.
The operator parser for '+' is reservedOp "+" >> return Add, which parser the operator and then constructs an 'Add' element of the AST.
For more information about how this works look here, or here.
http://hackage.haskell.org/packages/archive/parsec/3.1.1/doc/html/Text-Parsec-Expr.html
http://legacy.cs.uu.nl/daan/download/parsec/parsec.html
> parseExpr :: Parser Expr
> parseExpr = buildExpressionParser opTable parseFactor <?> "expression"
> where
> opTable = [ [ Infix (reservedOp "+" >> return Add) AssocRight
> , Infix (reservedOp "-" >> return Sub) AssocLeft
> ]
> , [ Infix (reservedOp "<" >> return Less) AssocRight
> ]
> ]
parseFactor parses the fundamental units of our expressions
> parseFactor :: Parser Expr
> parseFactor = parens parseExpr
> <|> parseIf
> <|> liftM (Number . fromInteger) natural
> <|> try parseCall -- this needs the 'try' because the parseCall could succeed in
> -- parsing the identifier, but then fail to parse the argument
> -- list. In this case we want it to back up again and try
> -- parsing the identifier
> <|> liftM Identifier identifier
> <?> "factor"
These parsers parse the variaous elements of our toy language and construct their respective components of the AST.
> parseCall :: Parser Expr
> parseCall = do
> funcName <- identifier
> params <- parens $ sepBy1 parseExpr comma
> return $ Call funcName params
>
> parseIf :: Parser Expr
> parseIf = do
> reserved "if"
> condition <- parseExpr
> reserved "then"
> thenExpr <- parseExpr
> reserved "else"
> elseExpr <- parseExpr
> return $ If condition thenExpr elseExpr
>
> parseFuncDef :: Parser FuncDef
> parseFuncDef = do
> reserved "def"
> funcName <- identifier
> params <- parens $ sepBy identifier comma
> body <- parseExpr
> return $ Func funcName params body
>
> parseDefs :: Parser [FuncDef]
> parseDefs = many parseFuncDef <?> "function definition"
>
The parseLex function discards any leading whitespace and forces the passed in parser to
parse the rest of the input string.
> parseLex :: Parser a -> Parser a
> parseLex p = do
> whiteSpace
> x <- p
> eof
> return x
>
> parseArg :: (Monad m) => [String] -> m (Bool, String)
> parseArg [] = fail "Please pass in a file"
> parseArg ("--jit":x:_) = return (True, x)
> parseArg (x:_) = return (False, x)
>
> main :: IO ()
> main = do
> args <- getArgs
> (runJitNow, fileName) <- parseArg args
> parsedModuleOrError <- parseFromFile (parseLex $ parseDefs) fileName
> let baseName = fst $ splitExtension fileName
> case parsedModuleOrError of
> Left err -> print err
> Right defs -> do
> runJit <- genModule (baseName ++ ".bc") defs
> when runJitNow runJit -- Run it!
The following code is _way_ uglier than it should be; the LLVM haskell bindings for plain code
generation from an AST have not yet been properly haskellized.
Hopefully newer versions of haskell's LLVM bindings will provide a cleaner monadic interface.
> genModule :: String -> [FuncDef] -> IO (IO ())
> genModule filename defs = do
> modRef <- newModule
> builderPtr <- U.createBuilder
> U.withBuilder builderPtr $ \builder -> do
> mapM_ (genFunction builder modRef) defs
> start <- getExistingFunction modRef "start"
> mainFunc <- genMain builder modRef start
> _ <- verifyModule (U.fromModule modRef) 0 nullPtr -- check for consistency
> U.writeBitcodeToFile filename modRef -- write out our bitcode to a file
> initializeNativeTarget
This line runs the JIT compiler on our source and gives us a function we can run
> simpleFunction (return $ (Value mainFunc :: Function (IO ())))
This generates our main function, which calls 'start' and prints the result using 'printf'
from the C runtime.
> genMain :: BuilderRef -> U.Module -> U.Function -> IO U.Function
> genMain builder modRef startFunc = do
> displayString<- U.addGlobal modRef InternalLinkage "DisplayString" (arrayType int8Type 20)
> setGlobalConstant displayString 1
> setInitializer displayString (U.constStringNul "The result was: %d\n")
> printf <- U.addFunction modRef ExternalLinkage "printf" (U.functionType True int32Type [pointerType int8Type 0, int32Type])
> mainFunc <- U.addFunction modRef ExternalLinkage "main" (U.functionType False int32Type [])
> basicBlock <- U.appendBasicBlock mainFunc "entry"
> positionAtEnd builder basicBlock -- sets the point were we
> result <- U.makeCall startFunc builder []
> strPtr <- U.getElementPtr0 builder displayString
> _ <- U.makeCall printf builder [strPtr, result]
> _ <- buildRet builder (constInt int32Type 0 0)
> return mainFunc
This is the code generator for an Expr. This is where most of the magic happens.
> genExpr :: BuilderRef -> U.Module -> Map.Map String ValueRef -> Expr -> IO ValueRef
> genExpr builder modRef = genExpr'
> where
> genExpr' env expr = case expr of
For a number, just emit a constant 32bit integer with the specified value
> (Number a) -> return $ constInt int32Type (fromIntegral a) (fromIntegral a)
For our operators, we generate code for both of our operands first, and then emit the instruction
that performs the desired operation.
> (Add a b) -> do
> a' <- genExpr' env a
> b' <- genExpr' env b
> buildAdd' a' b'
> (Sub a b) -> do
> a' <- genExpr' env a
> b' <- genExpr' env b
> buildSub' a' b'
For our comparison operator, we are going to do something horrible and borrow a practice from C.
We are going to have our comparison return an integer value.
LLVM's comparision instructions return only a single bit result.
We're going to need to cast it to a 32bit integer to match our other values.
> (Less a b) -> do
> a' <- genExpr' env a
> b' <- genExpr' env b
> cmpR <- buildICmp' IntSLT a' b' "cmpLTtmp"
> buildZExt' cmpR int32Type "castBit"
Lookup identifier in local enviroment (which is currently just the function parameters)
and use that value.
> (Identifier name) -> case Map.lookup name env of
> Just a -> return a
> Nothing -> error ("Unknown variable name: " ++ name)
Our 'if' expression is by far the most complex expression to generate.
But it's still pretty straight forward.
> (If cond thenExpr elseExpr) -> do
> condV <- genExpr' env cond
Build instruction to test if condition is equal to zero.
> comp <- buildICmp' IntEQ condV (constInt int32Type 0 0) "ifcond"
> parentFunc <- getBasicBlockParent =<< getInsertBlock builder
Basic blocks are the boxes in a control flow graph.
The last instruction in a basic block is always a branch instruction.
> thenBlock <- U.appendBasicBlock parentFunc "then"
> elseBlock <- U.appendBasicBlock parentFunc "else"
> mergeBlock <- U.appendBasicBlock parentFunc "ifcont"
> -- branch to desired block based on comparison
> _ <- buildCondBr builder comp elseBlock thenBlock
Generate code for the 'then' block from the thenExpr and then branch to the merge block.
> positionAtEnd builder thenBlock
> -- generate code for the 'then' block from the thenExpr
> thenV <- genExpr' env thenExpr
> -- branch back to the merge block
> _ <- buildBr builder mergeBlock
Generate code for the 'else' block from the elseExpr, and then branch to the merge block.
> positionAtEnd builder elseBlock
> elseV <- genExpr' env elseExpr
> _ <- buildBr builder mergeBlock
We need a merge block because LLVM has Static Single Assignment (SSA) and uses a phi instruction
to conditionally select values. See http://en.wikipedia.org/wiki/Static_single_assignment_form
Most modern low level intermediate representations use SSA because it makes it much easier to perform
sophisticated transformations and analysis. It's also easier for humans to reason about.
The phi instruction selects a different value depending on which block we arrived here from.
If we came from the 'then' block, use the thenV value, if we came from the 'else' block, use
the elseV value.
> positionAtEnd builder mergeBlock
> buildPhi' int32Type [(thenV, thenBlock), (elseV, elseBlock)] "iftmp"
> (Call name args) -> do
> args' <- mapM (genExpr' env) args
> func <- getExistingFunction modRef name
> U.makeCall func builder args'
These are just helper functions that should really just exist (in an imporved form) in the LLVM
bindings. But here they are for now.
> buildPhi' typ valuesAndBlocks tag = do
> withCString tag $ \tmp -> do
> phiResult <- buildPhi builder typ tmp
> U.addPhiIns phiResult valuesAndBlocks
> return phiResult
> buildICmp' p l r tag = withCString tag $ \tmp -> buildICmp builder (fromIntPredicate p) l r tmp
> buildAdd' a b = withCString "addtmp" $ \tmp -> buildAdd builder a b tmp
> buildSub' a b = withCString "subtmp" $ \tmp -> buildSub builder a b tmp
> buildZExt' a typ tag = withCString tag $ \tmp -> buildZExt builder a typ tmp
This function generates the code for a .toy function.
> genFunction :: BuilderRef -> U.Module -> FuncDef -> IO ValueRef
> genFunction builder modRef (Func name params body) = do
> func <- U.addFunction modRef ExternalLinkage name (U.functionType False int32Type (fmap (const int32Type) params))
> basicBlock <- U.appendBasicBlock func "entry"
> positionAtEnd builder basicBlock -- sets the point were we
Construct a lookup table of all the parameters for this function.
We pull from this table when we want to know what value an identifier is pointing to.
> let env = Map.fromList $ zip params $ fmap (getParam func) [0..(fromIntegral (length params)) - 1]
Generate code for the function body, and then the return instruction.
> retValue <- genExpr builder modRef env body
> _ <- buildRet builder retValue
> _ <- verifyFunction func 0 -- this is an LLVM function to check for consistency
> return func
> getExistingFunction :: U.Module -> String -> IO ValueRef
> getExistingFunction modRef name =
> withCString name $ \name' -> getNamedFunction (U.fromModule modRef) name'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment