-
-
Save jvranish/73f119c5ebda3f776507 to your computer and use it in GitHub Desktop.
A toy compiler written in haskell
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
To try it out paste the following code into a test.toy file: | |
def fib(x) | |
if x < 3 then | |
1 | |
else | |
fib(x-1)+fib(x-2) | |
def start() | |
fib(40) | |
This example program was mostly taken from the LLVM tutorial. | |
(see http://llvm.org/docs/tutorial/index.html) | |
To try it out: | |
On Mac: | |
brew install haskell-platform | |
brew install llvm --universal | |
On Linux: | |
Use your respective package manager to install the haskell platform and the llvm libraries | |
(make sure to include the development versions) | |
On Windows: | |
Download the haskell platform from http://hackage.haskell.org/platform/windows.html | |
install LLVM - this is left as an exercise for the reader... | |
On all platforms: | |
You'll have to pull down and install my slightly modified version of the LLVM bindings. | |
I needed to make changes to the LLVM bindings to make then useful for generating code | |
from an AST. | |
git clone git@github.com:jvranish/llvm.git | |
cd llvm | |
cabal install | |
To build our compiler: | |
ghc -Wall --make ToyCompiler.lhs | |
Then you can run our toy program with a JIT compiler!: | |
./ToyCompiler --jit test.toy | |
Or you can just compile it to native machine code: | |
./ToyCompiler test.toy | |
To link the compiled bitcode: | |
llvm-ld test.bc -native -o test | |
Try it out!: | |
./test | |
If you are curious what the LLVM disassembly for our toy program looks like run this: | |
llvm-dis < test.bc | |
> module Main where | |
> import System | |
> import System.FilePath | |
> import Text.Parsec | |
> import Text.Parsec.Expr | |
> import Text.Parsec.String | |
> import qualified Text.Parsec.Token as Token | |
> import qualified Data.Map as Map | |
> import LLVM.FFI.Core | |
> import LLVM.FFI.Analysis | |
> import LLVM.Core.Instructions | |
> import LLVM.ExecutionEngine | |
> import LLVM.Core | |
> import qualified LLVM.Core.Util as U | |
> import Foreign.C.String | |
> import Foreign.Ptr | |
> import Control.Monad | |
Define the lexer style and the tokens we are going to use. | |
The Haskell Parsec library has some very convenient functions to do this. | |
> lexerStyle :: Token.LanguageDef () | |
> lexerStyle = Token.LanguageDef | |
> { Token.commentStart = "{-" | |
> , Token.commentEnd = "-}" | |
> , Token.commentLine = "--" | |
> , Token.nestedComments = True | |
> , Token.identStart = letter | |
> , Token.identLetter = alphaNum <|> oneOf "_" | |
> , Token.opStart = Token.opLetter lexerStyle | |
> , Token.opLetter = oneOf "`~!@$%^&*-+=;:<>./?" | |
> , Token.reservedOpNames= [] | |
> , Token.reservedNames = ["if", "then", "else", "def"] | |
> , Token.caseSensitive = True | |
> } | |
> | |
> lexer :: Token.TokenParser () | |
> lexer = Token.makeTokenParser lexerStyle | |
> | |
> parens :: Parser a -> Parser a | |
> parens = Token.parens lexer | |
> | |
> natural :: Parser Integer | |
> natural = Token.natural lexer | |
> | |
> identifier :: Parser String | |
> identifier = Token.identifier lexer | |
> | |
> reservedOp :: String -> Parser () | |
> reservedOp = Token.reservedOp lexer | |
> | |
> reserved :: String -> Parser () | |
> reserved = Token.reserved lexer | |
> | |
> whiteSpace :: Parser () | |
> whiteSpace = Token.whiteSpace lexer | |
> | |
> comma :: Parser String | |
> comma = Token.comma lexer | |
Here is the Abstract Syntax Tree for our expressions. | |
The Algebraic Data Types in haskell make building and breaking down the AST extremely straightforward. | |
> data Expr = Number Int | |
> | Identifier String | |
> | Call String [Expr] | |
> | Add Expr Expr | |
> | Sub Expr Expr | |
> | Less Expr Expr | |
> | If Expr Expr Expr | |
> deriving (Show) | |
> | |
> data FuncDef = Func String [String] Expr | |
> deriving (Show) | |
parseExpr take a table of operator parsers (in order of operator precidence) and builds an expression | |
parser that takes into account fixity (prefix, postfix, infix...) and precidence. It's super handy. | |
The operator parser for '+' is reservedOp "+" >> return Add, which parser the operator and then constructs an 'Add' element of the AST. | |
For more information about how this works look here, or here. | |
http://hackage.haskell.org/packages/archive/parsec/3.1.1/doc/html/Text-Parsec-Expr.html | |
http://legacy.cs.uu.nl/daan/download/parsec/parsec.html | |
> parseExpr :: Parser Expr | |
> parseExpr = buildExpressionParser opTable parseFactor <?> "expression" | |
> where | |
> opTable = [ [ Infix (reservedOp "+" >> return Add) AssocRight | |
> , Infix (reservedOp "-" >> return Sub) AssocLeft | |
> ] | |
> , [ Infix (reservedOp "<" >> return Less) AssocRight | |
> ] | |
> ] | |
parseFactor parses the fundamental units of our expressions | |
> parseFactor :: Parser Expr | |
> parseFactor = parens parseExpr | |
> <|> parseIf | |
> <|> liftM (Number . fromInteger) natural | |
> <|> try parseCall -- this needs the 'try' because the parseCall could succeed in | |
> -- parsing the identifier, but then fail to parse the argument | |
> -- list. In this case we want it to back up again and try | |
> -- parsing the identifier | |
> <|> liftM Identifier identifier | |
> <?> "factor" | |
These parsers parse the variaous elements of our toy language and construct their respective components of the AST. | |
> parseCall :: Parser Expr | |
> parseCall = do | |
> funcName <- identifier | |
> params <- parens $ sepBy1 parseExpr comma | |
> return $ Call funcName params | |
> | |
> parseIf :: Parser Expr | |
> parseIf = do | |
> reserved "if" | |
> condition <- parseExpr | |
> reserved "then" | |
> thenExpr <- parseExpr | |
> reserved "else" | |
> elseExpr <- parseExpr | |
> return $ If condition thenExpr elseExpr | |
> | |
> parseFuncDef :: Parser FuncDef | |
> parseFuncDef = do | |
> reserved "def" | |
> funcName <- identifier | |
> params <- parens $ sepBy identifier comma | |
> body <- parseExpr | |
> return $ Func funcName params body | |
> | |
> parseDefs :: Parser [FuncDef] | |
> parseDefs = many parseFuncDef <?> "function definition" | |
> | |
The parseLex function discards any leading whitespace and forces the passed in parser to | |
parse the rest of the input string. | |
> parseLex :: Parser a -> Parser a | |
> parseLex p = do | |
> whiteSpace | |
> x <- p | |
> eof | |
> return x | |
> | |
> parseArg :: (Monad m) => [String] -> m (Bool, String) | |
> parseArg [] = fail "Please pass in a file" | |
> parseArg ("--jit":x:_) = return (True, x) | |
> parseArg (x:_) = return (False, x) | |
> | |
> main :: IO () | |
> main = do | |
> args <- getArgs | |
> (runJitNow, fileName) <- parseArg args | |
> parsedModuleOrError <- parseFromFile (parseLex $ parseDefs) fileName | |
> let baseName = fst $ splitExtension fileName | |
> case parsedModuleOrError of | |
> Left err -> print err | |
> Right defs -> do | |
> runJit <- genModule (baseName ++ ".bc") defs | |
> when runJitNow runJit -- Run it! | |
The following code is _way_ uglier than it should be; the LLVM haskell bindings for plain code | |
generation from an AST have not yet been properly haskellized. | |
Hopefully newer versions of haskell's LLVM bindings will provide a cleaner monadic interface. | |
> genModule :: String -> [FuncDef] -> IO (IO ()) | |
> genModule filename defs = do | |
> modRef <- newModule | |
> builderPtr <- U.createBuilder | |
> U.withBuilder builderPtr $ \builder -> do | |
> mapM_ (genFunction builder modRef) defs | |
> start <- getExistingFunction modRef "start" | |
> mainFunc <- genMain builder modRef start | |
> _ <- verifyModule (U.fromModule modRef) 0 nullPtr -- check for consistency | |
> U.writeBitcodeToFile filename modRef -- write out our bitcode to a file | |
> initializeNativeTarget | |
This line runs the JIT compiler on our source and gives us a function we can run | |
> simpleFunction (return $ (Value mainFunc :: Function (IO ()))) | |
This generates our main function, which calls 'start' and prints the result using 'printf' | |
from the C runtime. | |
> genMain :: BuilderRef -> U.Module -> U.Function -> IO U.Function | |
> genMain builder modRef startFunc = do | |
> displayString<- U.addGlobal modRef InternalLinkage "DisplayString" (arrayType int8Type 20) | |
> setGlobalConstant displayString 1 | |
> setInitializer displayString (U.constStringNul "The result was: %d\n") | |
> printf <- U.addFunction modRef ExternalLinkage "printf" (U.functionType True int32Type [pointerType int8Type 0, int32Type]) | |
> mainFunc <- U.addFunction modRef ExternalLinkage "main" (U.functionType False int32Type []) | |
> basicBlock <- U.appendBasicBlock mainFunc "entry" | |
> positionAtEnd builder basicBlock -- sets the point were we | |
> result <- U.makeCall startFunc builder [] | |
> strPtr <- U.getElementPtr0 builder displayString | |
> _ <- U.makeCall printf builder [strPtr, result] | |
> _ <- buildRet builder (constInt int32Type 0 0) | |
> return mainFunc | |
This is the code generator for an Expr. This is where most of the magic happens. | |
> genExpr :: BuilderRef -> U.Module -> Map.Map String ValueRef -> Expr -> IO ValueRef | |
> genExpr builder modRef = genExpr' | |
> where | |
> genExpr' env expr = case expr of | |
For a number, just emit a constant 32bit integer with the specified value | |
> (Number a) -> return $ constInt int32Type (fromIntegral a) (fromIntegral a) | |
For our operators, we generate code for both of our operands first, and then emit the instruction | |
that performs the desired operation. | |
> (Add a b) -> do | |
> a' <- genExpr' env a | |
> b' <- genExpr' env b | |
> buildAdd' a' b' | |
> (Sub a b) -> do | |
> a' <- genExpr' env a | |
> b' <- genExpr' env b | |
> buildSub' a' b' | |
For our comparison operator, we are going to do something horrible and borrow a practice from C. | |
We are going to have our comparison return an integer value. | |
LLVM's comparision instructions return only a single bit result. | |
We're going to need to cast it to a 32bit integer to match our other values. | |
> (Less a b) -> do | |
> a' <- genExpr' env a | |
> b' <- genExpr' env b | |
> cmpR <- buildICmp' IntSLT a' b' "cmpLTtmp" | |
> buildZExt' cmpR int32Type "castBit" | |
Lookup identifier in local enviroment (which is currently just the function parameters) | |
and use that value. | |
> (Identifier name) -> case Map.lookup name env of | |
> Just a -> return a | |
> Nothing -> error ("Unknown variable name: " ++ name) | |
Our 'if' expression is by far the most complex expression to generate. | |
But it's still pretty straight forward. | |
> (If cond thenExpr elseExpr) -> do | |
> condV <- genExpr' env cond | |
Build instruction to test if condition is equal to zero. | |
> comp <- buildICmp' IntEQ condV (constInt int32Type 0 0) "ifcond" | |
> parentFunc <- getBasicBlockParent =<< getInsertBlock builder | |
Basic blocks are the boxes in a control flow graph. | |
The last instruction in a basic block is always a branch instruction. | |
> thenBlock <- U.appendBasicBlock parentFunc "then" | |
> elseBlock <- U.appendBasicBlock parentFunc "else" | |
> mergeBlock <- U.appendBasicBlock parentFunc "ifcont" | |
> -- branch to desired block based on comparison | |
> _ <- buildCondBr builder comp elseBlock thenBlock | |
Generate code for the 'then' block from the thenExpr and then branch to the merge block. | |
> positionAtEnd builder thenBlock | |
> -- generate code for the 'then' block from the thenExpr | |
> thenV <- genExpr' env thenExpr | |
> -- branch back to the merge block | |
> _ <- buildBr builder mergeBlock | |
Generate code for the 'else' block from the elseExpr, and then branch to the merge block. | |
> positionAtEnd builder elseBlock | |
> elseV <- genExpr' env elseExpr | |
> _ <- buildBr builder mergeBlock | |
We need a merge block because LLVM has Static Single Assignment (SSA) and uses a phi instruction | |
to conditionally select values. See http://en.wikipedia.org/wiki/Static_single_assignment_form | |
Most modern low level intermediate representations use SSA because it makes it much easier to perform | |
sophisticated transformations and analysis. It's also easier for humans to reason about. | |
The phi instruction selects a different value depending on which block we arrived here from. | |
If we came from the 'then' block, use the thenV value, if we came from the 'else' block, use | |
the elseV value. | |
> positionAtEnd builder mergeBlock | |
> buildPhi' int32Type [(thenV, thenBlock), (elseV, elseBlock)] "iftmp" | |
> (Call name args) -> do | |
> args' <- mapM (genExpr' env) args | |
> func <- getExistingFunction modRef name | |
> U.makeCall func builder args' | |
These are just helper functions that should really just exist (in an imporved form) in the LLVM | |
bindings. But here they are for now. | |
> buildPhi' typ valuesAndBlocks tag = do | |
> withCString tag $ \tmp -> do | |
> phiResult <- buildPhi builder typ tmp | |
> U.addPhiIns phiResult valuesAndBlocks | |
> return phiResult | |
> buildICmp' p l r tag = withCString tag $ \tmp -> buildICmp builder (fromIntPredicate p) l r tmp | |
> buildAdd' a b = withCString "addtmp" $ \tmp -> buildAdd builder a b tmp | |
> buildSub' a b = withCString "subtmp" $ \tmp -> buildSub builder a b tmp | |
> buildZExt' a typ tag = withCString tag $ \tmp -> buildZExt builder a typ tmp | |
This function generates the code for a .toy function. | |
> genFunction :: BuilderRef -> U.Module -> FuncDef -> IO ValueRef | |
> genFunction builder modRef (Func name params body) = do | |
> func <- U.addFunction modRef ExternalLinkage name (U.functionType False int32Type (fmap (const int32Type) params)) | |
> basicBlock <- U.appendBasicBlock func "entry" | |
> positionAtEnd builder basicBlock -- sets the point were we | |
Construct a lookup table of all the parameters for this function. | |
We pull from this table when we want to know what value an identifier is pointing to. | |
> let env = Map.fromList $ zip params $ fmap (getParam func) [0..(fromIntegral (length params)) - 1] | |
Generate code for the function body, and then the return instruction. | |
> retValue <- genExpr builder modRef env body | |
> _ <- buildRet builder retValue | |
> _ <- verifyFunction func 0 -- this is an LLVM function to check for consistency | |
> return func | |
> getExistingFunction :: U.Module -> String -> IO ValueRef | |
> getExistingFunction modRef name = | |
> withCString name $ \name' -> getNamedFunction (U.fromModule modRef) name' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment