Skip to content

Instantly share code, notes, and snippets.

@Robbepop
Last active February 29, 2016 23:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Robbepop/357cd8c04e2204f5cc2d to your computer and use it in GitHub Desktop.
Save Robbepop/357cd8c04e2204f5cc2d to your computer and use it in GitHub Desktop.

Planned Archtiecture of my Compiler/Parser

/// This class represents a source file for compilation.
/// It owns its content and its name and may have a parent Source.
/// Sources are important for error reporting so that the compiler
/// is able to point the programmer to the correct location within
/// the real source code.
/// Can be used as an iterator over its content as characters.
struct Source {
	parent: Option<Rc<Source>>, // maybe Weak<Source> is better?
	content: String, // the content of the source file
	name: String     // the name of the source file (e.g. "foo.rs")
}
/// This class represents the file tree of source files and their directories.
/// It is used to trace errors for the programmer better from the source root.
/// Nodes represent directories within the file system and Leafs represent source files.
enum SourceTree {
	Node<String, RefCell<Vec<SourceTree>>>, // directory name and directory entries
	Leaf<Source>
}
/// This class is used to make error reporting more clear to the programmer
/// to inform him or her in which line and column an error occured since
/// programmers will not get much information by the index of the errornous
/// character within a Source.
/// Unlike SourceRange, SourceLoc has no certain connection
/// to a range of characters or a single character within a Source,
/// its purpose is just for an improved error reporting.
struct SourceLoc {
	line: isize, // the line number
	col: isize   // the column number
}
/// This class represents the range of characters within the Source
/// that are part of a Token generated by the Lexer of the compiler.
/// Every token within the same Source maps to a disjoint SourceRange.
/// A SourceRange may only point to a single character.
/// It should be possible to use the SourceRange as iterator over characters.
struct SourceRange {
	source: Rc<Source>,
	begin: isize, // the index in 'source.content' of the first character within this range
	end: isize    // the index after the last character within 'source.content' of this range
	/* Maybe it would be better to implement begin and end as iterator pair that are
	   pointers pointing to the Source's content buffer directly. However, this shortened indirection
	   would certainly not appease the borrow-checker and require use of unsafe code blocks. */
}
/// This enum represents all the different kinds of tokens a Lexer may generate
/// and a Parser may expect. These are only mere identifiers without any further
/// specialized information about certain token kinds.
enum TokenKind {
	Identifier, // e.g. foo, bar, baz
	OpenParen, CloseParen, // '(' and ')'
	Arrow, // '=>'
	BoolLiteral, // 'true' or 'false'
	IntegerLiteral, // e.g. 5, 42, 1337
	FloatLiteral, // e.g. 0.0, 42.0, 0.24, 13.37
	CharLiteral, // e.g. 'a', '\n', '\x7F', etc.
	StringLiteral, // e.g. "Hello, World!"
	...
	Comment, // represents a comment
	Error, // represents an error (maybe useless?)
	EndOfFile,
	etc...
}
/// This class represents a Token generated by the Lexer and read by the Parser.
/// Every Token has a certain kind and a disjoint range of characters within
/// a Source. They do not store their special information but only point to it
/// with their SourceRange.
/// For example: A Token with kind of a StringLiteral has its content as `info`:
///     Source: foo.txt:["Hello, World"] may resolve to token:
///         Token(TokenKind::StringLiteral, SourceRange(source_foo, 0, 12))
struct Token {
	kind: TokenKind,
	loc: SourceLoc, // maybe it would be wise to also include a
	                // SourceLoc for its begin _AND_ end instead of
	                // only the beginning.
	info: SourceRange // the part of the Source this Token stands for
}
/// This class represents a context object for compilation of a compilation unit.
/// It owns util classes used by several different components within the compiler
/// (such as Lexer, Parser, SemanticCheckers, etc...).
/// These compiler components may reference the same CompileContext and may interact
/// and share information through it - for example with the SymbolTable.
struct CompileContext {
	error_handler: RefCell<ErrorHandler>,
	symbol_table: RefCell<SymbolTable>, // stores useful information about named-AST components
	etc...
}
/// This class reads in characters from its specified Source and generates Tokens.
/// These Tokens each store information where they are located within the Source
/// (SourceLoc) and what characters within the Source they represent (SourceRange).
/// However, they do not own the characters they represent, they just point to them
/// stored within the Source itself.
/// It should be possible to use the Lexer as iterator over Tokens.
struct Lexer<'ctx> {
	context: &'ctx ParseContext,
	source: Rc<Source>,
	loc: SourceLoc, // current SourceLoc for the next generated Token
	token_range: SourceRange // current SourceRange for the next generated Token
}
/// This class'es 'parse' method reads in Tokens from a token iterator
/// (such as the Lexer) and outputs a representation of that Token sequence
/// as abstract syntax tree (AST).
/// It uses the ParseContext reference in order to inform the programmer about
/// errors caused by the Token sequence.
struct Parser<'ctx> {
	context: &'ctx ParseContext
}

impl<'ctx> Parser<'ctx> {
	pub fn parse(input: &Iterator<Token>) -> ASTRoot {..}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment