Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis
Last active October 22, 2019 23:26
Show Gist options
  • Save cellularmitosis/6a0566bd01ff058419dc6469dc6e019f to your computer and use it in GitHub Desktop.
Save cellularmitosis/6a0566bd01ff058419dc6469dc6e019f to your computer and use it in GitHub Desktop.
An alternative syntax for C, part 9: comments, includes and defines

Blog 2019/10/19

<- previous | index | next ->

An alternative syntax for C, part 9: comments, includes and defines

<- part 8 | part 10 ->

In this installment, we add comments, #include statements and #define statements.

Note: the comment support is pretty primitive. Comments must be on their own line / group of lines.

i.e. you can't do this:

i: int = 1  // index

and you can't do this:

for i: int = 1 /* index */; (i < 42); (i += 1):
    continue

Includes and defines are passed through unmodified.

// this is a line-level comment

/* this is a block-level comment */

/*
 * This is a
 * multi-line
 * block-level
 * comment.
 */

#include <foo.h>
#include <dir/foo.h>
#include "foo.h"
#include "dir/foo.h"
#include "/dir/foo.h"

#define FOOS 1
#define debug(m,e) printf("%s:%d: %s:",__FILE__,__LINE__,m); print_obj(e,1); puts("");

TL;DR: What is Cy?

Cy is an alternative syntax for C which features indentation-based scoping and intuitive type declarations.

func main(argc: int, argv: pointer<pointer<char>>) -> int:
    printf("Hello, world!")
    return 0

But why tho?

To teach myself how to write a transpiler! 🤩🤩🤩

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# cy.py: a traspiler for an alternate syntax for C.
# Note: this is an incomplete work-in-progress.
# Data structures:
# A token is a tuple.
# First element is a string 'token'.
# Second element is a string which is the token type (e.g. 'OBRACKET').
# Third element is a string which is the source text of the token (e.g. '[').
# e.g. ('token', 'OBRACKET', '[')
# e.g. ('token', 'IDENTIFIER', 'foo')
# e.g. ('token', 'S', ' ')
# An AST node is a tuple.
# First element is a string 'ast'.
# Second element is a string which is the AST node type (e.g. 'vardecl').
# Third element is an array which are the child tuples (AST nodes or tokens).
# e.g. ('ast', 'type', [('token', 'IDENTIFIER', 'int')])
# Grammar:
#
# program = statement { statement }
#
# statement = ( lstatement NL ) | bstatement | include | define | comment | blankline
# lstatement = declassign | assign | vardecl | funcall | return | BREAK | CONTINUE
# bstatement = fundecl | if | while | for | scope
# include = INCLUDE NL
# define = DEFINE NL
# comment = [S] ( LCOMMENT | BCOMMENT ) NL
# blankline = [S] NL
#
# declassign = vardecl S EQ S expr
# assign = IDENTIFIER S aoperator S expr
# aoperator = PLUSEQ | MINUSEQ | STAREQ | SLASHEQ | PERCENTEQ | EQ
# vardecl = IDENTIFIER COLON S type
# return = RETURN [ S expr ]
# funcall = IDENTIFIER OPAREN [ expr { COMMA S expr } ] CPAREN
#
# fundecl = FUNC S IDENTIFIER fundeclargs [ fundeclret ] scope
# fundeclargs = OPAREN [ vardecl { COMMA S vardecl } ] CPAREN
# fundeclret = S ARROW S type
# if = IF S expr scope { elif } [else]
# elif = ELIF S expr scope
# else = ELSE scope
# scope = COLON NL INDENT statement { statement } DEDENT
# while = WHILE S expr scope
# for = FOR S [lstatement] SEMICOLON S expr SEMICOLON S lstatement scope
# expr = funcall | binary | unary | IDENTIFIER
# | INTLIT | FLOATLIT | STRINGLIT | BOOLLIT
# binary = OPAREN expr S boperator S expr CPAREN
# boperator = PLUS | MINUS | STAR | SLASH | PERCENT
# | LT | LTEQ | GT | GTEQ | EQEQ | BANGEQ
# | AMPAMP | BARBAR | BANG
# | AMP | BAR | LTLT | GTGT | TILDE | CARAT
# unary = uoperator OPAREN expr CPAREN
# uoperator = MINUS | AMP | STAR | BANG
# type = POINTER LT type GT
# | ARRAY LT type GT
# | FUNCTION LT type GT
# | ARRAY OBRACKET INTLIT CBRACKET LT type GT
# | IDENTIFIER
# ws = (S | NL) { (S | NL) }
# The above EBNF notation is:
# - lowercase are AST nodes
# - ALLCAPS are tokens (terminals)
# - Concatenation: rule1 = rule2 rule3 rule4
# - Alternation: rule1 = rule2 | rule3 | rule4
# - Repetition: rule1 = rule2 { rule3 }
# - Grouping: rule1 = (rule2 rule3) | rule4
# Token data structure utils:
def is_token(token):
"""Returns whether the arg is a token."""
return isinstance(token, tuple) and len(token) > 0 and token[0] == 'token'
def token_type(token):
"""Returns the type of the token (e.g. 'IDENTIFIER')."""
assert is_token(token)
return token[1]
def token_text(token):
"""Returns the source text of the token (e.g. 'foo' for a 'IDENTIFIER' token)."""
assert is_token(token)
return token[2]
def is_toktype(token, toktype):
"""
Returns whether the token is of the given type.
_toktype_ is e.g. 'IDENTIFIER', 'COLON', etc.
"""
return is_token(token) and token_type(token) == toktype
# Tokenizer implementation:
def tokenize(tokendefs, keyword_map, input, indent=0):
"""Uses tokendefs to tokenize the 'input' string and return a list of tokens"""
tokens = []
offset = 0
previous = None
while offset < len(input):
for (token_name, regex) in tokendefs:
m = regex.match(input, offset)
if m is not None:
matched_text = m.group(0)
offset += len(matched_text)
token = ('token', token_name, matched_text)
# special-case: distinguish keywords from identifiers
token = keyword_specialcases(token, keyword_map)
# special-case: implement "offside-rule"
(tokens2, indent) = offside_rule(token, previous, indent)
tokens += tokens2
previous = token
break
else:
raise Exception("Couldn't tokenize starting at '%s...'" % input[offset:offset+16])
# special-case: add any needed implicit trailing DEDENT tokens
while indent > 0:
tokens.append(('token', 'DEDENT', ''))
indent -= 1
return tokens
INDENT_UNIT=4
def offside_rule(token, previous, indent_level):
"""
Implement the "offside rule" indentation-based scoping mechanism.
See https://en.wikipedia.org/wiki/Off-side_rule
Args:
- token: the current token
- previous: the previous token
- indent_level: the indent level before encounting this token
Returns a list of replacement tokens and the new indentation level.
If the previous token was a NL and the indentation level has changed,
inject INDENT or DEDENT tokens as needed.
Otherwise, return the token unmodified.
Big-picture example: the token stream
COLON NL S RETURN NL
should become
COLON NL INDENT RETURN NL DEDENT
"""
def get_indent_level(text):
"""
Determine the indentation level of the text.
TODO: this implementation is pretty naive.
"""
indentation = text.split('\n')[-1]
assert len(indentation) % INDENT_UNIT == 0
return len(indentation) / INDENT_UNIT
assert is_token(token)
if previous is None:
return ([token], indent_level)
assert is_token(previous)
if not is_toktype(previous, 'NL'):
return ([token], indent_level)
tokens = []
if is_toktype(token, 'S'):
text = token_text(token)
new_level = get_indent_level(text)
else:
new_level = 0
for _ in range(new_level - indent_level):
tokens.append(('token', 'INDENT', ' '))
for _ in range(indent_level - new_level):
tokens.append(('token', 'DEDENT', ''))
if not is_toktype(token, 'S'):
tokens.append(token)
return (tokens, new_level)
def keyword_specialcases(token, keyword_map):
"""
Add a special-case which turns IDENTIFIER tokens into keyword tokens.
e.g.:
- 'array' changes from an IDENTIFIER token to an ARRAY token.
- 'true' changes from an IDENTIFIER token to a BOOLLIT token.
"""
if is_toktype(token, 'IDENTIFIER'):
text = token_text(token)
if text in keyword_map.keys():
toktype = keyword_map[text]
return ('token', toktype, text)
return token
def load_tokendefs(fpath):
"""
Load the token definitions from the file at fpath.
Returns a list of tuples of the form (<token type>, <compiled regex>).
The format of the tokendefs file should be pairs of lines:
- a line which is the token type (e.g. 'IDENTIFIER', 'OBRACKET', etc),
- followed by a line which is the regex which recognizes that token.
Example tokendefs.txt:
IDENTIFIER
[a-zA-Z][a-zA-Z0-9]*
OPAREN
\(
CPAREN
)
"""
import re
tokendefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
pattern = f.next().rstrip('\n')
regex = re.compile(pattern)
tokendef = (token_name, regex)
tokendefs.append(tokendef)
return tokendefs
def load_keywords(fpath):
"""
Load the keyword definitions from the file at the path.
Returns a dict which maps keywords to toktypes.
The format of the keywords file should be pairs of lines:
- a keywork (e.g. 'true'),
- followed by a line which is the toktype of that keyword (e.g. 'BOOLLIT').
Example keywords.txt file:
true
BOOLLIT
false
BOOLLIT
array
ARRAY
return
RETURN
"""
keyword_map = {}
with open(fpath) as f:
for line in f:
keyword = line.rstrip('\n')
toktype = f.next().rstrip('\n')
keyword_map[keyword] = toktype
return keyword_map
# AST node data structure utils:
def is_ast(ast):
"""Returns whether the arg is an AST node."""
return isinstance(ast, tuple) and len(ast) > 0 and ast[0] == 'ast'
def ast_type(ast):
"""Returns the type of the AST (e.g. 'vardecl', 'statement', etc.)"""
assert is_ast(ast)
return ast[1]
def is_asttype(ast, atype):
"""
Returns whether the AST node is of the given type.
atype is e.g. 'vardecl', 'statement', etc.
"""
return is_ast(ast) and ast_type(ast) == atype
def ast_children(ast):
"""Return the child nodes of the given AST."""
assert is_ast(ast)
return ast[2]
# Parser helpers:
# See https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Table_of_symbols
def alternation(tokens, funcs):
"""Try each of the parsing functions until one succeeds."""
failure = (None, tokens)
for f in funcs:
(ast, tokens) = f(tokens)
if ast is not None:
return (ast, tokens)
return failure
def concatenation(tokens, funcs):
"""
All of the parsing functions must succeed as a whole.
Returns (<array of AST's>, <remaining tokens>).
"""
failure = (None, tokens)
asts = []
for f in funcs:
(ast, tokens) = f(tokens)
if ast is None:
return failure
asts.append(ast)
return (asts, tokens)
# Parsing functions:
# Note: all parse_x functions share a similar format:
# - Their first argument is the list of tokens.
# - They return a tuple of (<parsed result>, <remaining tokens>).
# - <parsed result> is either an AST node or a token (a terminal).
# - <remaining tokens> is the list of unconsumed tokens.
# - If the parse fails, (None, <tokens>) is returned, where <tokens>
# is the list of tokens which was passed in.
def parse(tokens):
"""Attempts to parse the tokens into an AST."""
(ast, tokens) = parse_program(tokens)
if ast is None:
raise Exception("Couldn't parse starting at %s" % tokens[:8])
if len(tokens) > 0:
raise Exception("Leftover tokens: %s" % tokens)
return ast
def parse_program(tokens):
"""
Attempts to parse a program.
Grammar:
program = statement { statement }
"""
failure = (None, tokens)
statement_asts = []
(ast, tokens) = parse_statement(tokens)
if ast is None:
return failure
statement_asts.append(ast)
while True:
(ast, tokens) = parse_statement(tokens)
if ast is None:
break
statement_asts.append(ast)
ast = ('ast', 'program', statement_asts)
return (ast, tokens)
def parse_statement(tokens):
"""
Attempts to parse a statement.
Grammar:
statement = ( lstatement NL ) | bstatement | include | define | comment | blankline
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_lstatement,
parse_NL
])
if asts is not None:
lstatement_ast = asts[0]
ast = ('ast', 'statement', [lstatement_ast])
return (ast, tokens)
(bstatement_ast, tokens) = parse_bstatement(tokens)
if bstatement_ast is not None:
ast = ('ast', 'statement', [bstatement_ast])
return (ast, tokens)
(include_ast, tokens) = parse_include(tokens)
if include_ast is not None:
ast = ('ast', 'statement', [include_ast])
return (ast, tokens)
(define_ast, tokens) = parse_define(tokens)
if define_ast is not None:
ast = ('ast', 'statement', [define_ast])
return (ast, tokens)
(comment_ast, tokens) = parse_comment(tokens)
if comment_ast is not None:
ast = ('ast', 'statement', [comment_ast])
return (ast, tokens)
(blank_ast, tokens) = parse_blankline(tokens)
if blank_ast is not None:
ast = ('ast', 'statement', [blank_ast])
return (ast, tokens)
return failure
def parse_lstatement(tokens):
"""
Attempts to parse a "line" statement.
Grammar:
lstatement = declassign | assign | vardecl | funcall | return | BREAK | CONTINUE
"""
failure = (None, tokens)
(ast, tokens) = alternation(tokens, [
parse_declassign,
parse_assign,
parse_vardecl,
parse_funcall,
parse_return,
parse_BREAK,
parse_CONTINUE
])
if ast is None:
return failure
statement_ast = ('ast', 'lstatement', [ast])
return (statement_ast, tokens)
def parse_bstatement(tokens):
"""
Attempts to parse a "block" statement.
Grammar:
bstatement = fundecl | if | while | for | scope
"""
failure = (None, tokens)
(ast, tokens) = alternation(tokens, [
parse_fundecl,
parse_if,
parse_while,
parse_for,
parse_scope
])
if ast is None:
return failure
statement_ast = ('ast', 'bstatement', [ast])
return (statement_ast, tokens)
def parse_include(tokens):
"""
Attempts to parse an include statement.
Grammar:
include = INCLUDE NL
"""
failure = (None, tokens)
(include_token, tokens) = parse_INCLUDE(tokens)
if include_token is None:
return failure
(nl_ast, tokens) = parse_NL(tokens)
if nl_ast is None:
return failure
ast = ('ast', 'include', [include_token])
return (ast, tokens)
def parse_define(tokens):
"""
Attempts to parse an define statement.
Grammar:
define = DEFINE NL
"""
failure = (None, tokens)
(define_token, tokens) = parse_DEFINE(tokens)
if define_token is None:
return failure
(nl_ast, tokens) = parse_NL(tokens)
if nl_ast is None:
return failure
ast = ('ast', 'define', [define_token])
return (ast, tokens)
def parse_comment(tokens):
"""
Attempts to parse a (line or block-level) comment.
Grammar:
comment = [S] ( LCOMMENT | BCOMMENT ) NL
"""
failure = (None, tokens)
(ast, tokens) = parse_S(tokens)
(token, tokens) = alternation(tokens, [
parse_LCOMMENT,
parse_BCOMMENT
])
if token is None:
return failure
ast = ('ast', 'comment', [token])
(nl_ast, tokens) = parse_NL(tokens)
if nl_ast is None:
return failure
return (ast, tokens)
def parse_blankline(tokens):
"""
Attempts to parse a blank line.
Grammar:
blankline = [S] NL
"""
failure = (None, tokens)
(_, tokens) = parse_S(tokens)
(ast, tokens) = parse_NL(tokens)
if ast is None:
return failure
ast = ('ast', 'blankline', [])
return (ast, tokens)
def parse_declassign(tokens):
"""
Attempts to parse a declaration-assignment statement.
Grammar:
declassign = vardecl S EQ S expr
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_vardecl,
parse_S,
parse_EQ,
parse_S,
parse_expr
])
if asts is None:
return failure
vardecl_ast = asts[0]
expr_ast = asts[4]
ast = ('ast', 'declassign', [vardecl_ast, expr_ast])
return (ast, tokens)
def parse_assign(tokens):
"""
Attempts to parse an assignment statement.
Grammar:
assign = IDENTIFIER S aoperator S expr
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_IDENTIFIER,
parse_S,
parse_aoperator,
parse_S,
parse_expr
])
if asts is None:
return failure
identifier_token = asts[0]
op_ast = asts[2]
expr_ast = asts[4]
ast = ('ast', 'assign', [identifier_token, op_ast, expr_ast])
return (ast, tokens)
def parse_aoperator(tokens):
"""
Attempts to parse an assignment operator.
Grammar:
aoperator = PLUSEQ | MINUSEQ | STAREQ | SLASHEQ | PERCENTEQ | EQ
"""
failure = (None, tokens)
(token, tokens) = alternation(tokens, [
parse_PLUSEQ,
parse_MINUSEQ,
parse_STAREQ,
parse_SLASHEQ,
parse_PERCENTEQ,
parse_EQ
])
if token is None:
return failure
ast = ('ast', 'aoperator', [token])
return (ast, tokens)
def parse_vardecl(tokens):
"""
Attempts to parse a vardecl AST node.
Grammar:
vardecl = IDENTIFIER COLON S type
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_IDENTIFIER,
parse_COLON,
parse_S,
parse_type
])
if asts is None:
return failure
identifier_token = asts[0]
type_ast = asts[3]
vardecl_ast = ('ast', 'vardecl', [identifier_token, type_ast])
return (vardecl_ast, tokens)
def parse_return(tokens):
"""
Attempts to parse a return statement.
Grammar:
return = RETURN [ S expr ]
"""
failure = (None, tokens)
(ast, tokens) = parse_terminal(tokens, 'RETURN')
if ast is None:
return failure
return_ast = ('ast', 'return', [])
(asts, tokens) = concatenation(tokens, [
parse_S,
parse_expr
])
if asts is not None:
expr_ast = asts[1]
return_ast = ('ast', 'return', [expr_ast])
return (return_ast, tokens)
def parse_funcall(tokens):
"""
Attempts to parse a funcion call.
Grammar:
funcall = IDENTIFIER OPAREN [ expr { COMMA S expr } ] CPAREN
"""
failure = (None, tokens)
def list_of_exprs(tokens):
"""
Grammar: expr { COMMA S expr }
Returns (<list of exprs>, <remaining tokens>)
"""
exprs = []
(expr_ast, tokens) = parse_expr(tokens)
if expr_ast is None:
return (exprs, tokens)
exprs.append(expr_ast)
while True:
(asts, tokens) = concatenation(tokens, [
parse_COMMA,
parse_S,
parse_expr
])
if asts is None:
break
expr_ast = asts[2]
exprs.append(expr_ast)
return (exprs, tokens)
(asts, tokens) = concatenation(tokens, [
parse_IDENTIFIER,
parse_OPAREN
])
if asts is None:
return failure
identifier_token = asts[0]
(exprs, tokens) = list_of_exprs(tokens)
(ast, tokens) = parse_CPAREN(tokens)
if ast is None:
return failure
funcall_ast = ('ast', 'funcall', [identifier_token] + exprs)
return (funcall_ast, tokens)
# fundecl examples:
# func f():
# func f(a: int):
# func f() -> int:
# func f(a: int) -> int:
# func f(a: int, b: int):
# func f(a: int, b: int) -> int:
def parse_fundecl(tokens):
"""
Attempts to parse a function declaration.
Grammar:
fundecl = FUNC S IDENTIFIER fundeclargs [ fundeclret ] scope
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_FUNC,
parse_S,
parse_IDENTIFIER,
parse_fundeclargs
])
if asts is None:
return failure
identifier_token = asts[2]
args_ast = asts[3]
(ret_ast, tokens) = parse_fundeclret(tokens)
(scope_ast, tokens) = parse_scope(tokens)
if scope_ast is None:
return failure
ast = ('ast', 'fundecl', [identifier_token, args_ast, ret_ast, scope_ast])
return (ast, tokens)
def parse_fundeclargs(tokens):
"""
Attempts to parse a list of function declaration arguments.
Grammar:
fundeclargs = OPAREN [ vardecl { COMMA S vardecl } ] CPAREN
"""
failure = (None, tokens)
def list_of_args(tokens):
"""
Grammar: vardecl { COMMA S vardecl }
Returns (<list of args>, <remaining tokens>)
"""
args = []
(vardecl_ast, tokens) = parse_vardecl(tokens)
if vardecl_ast is None:
return (args, tokens)
args.append(vardecl_ast)
while True:
(asts, tokens) = concatenation(tokens, [
parse_COMMA,
parse_S,
parse_vardecl
])
if asts is None:
break
vardecl_ast = asts[2]
args.append(vardecl_ast)
return (args, tokens)
(ast, tokens) = parse_OPAREN(tokens)
if ast is None:
return failure
(args, tokens) = list_of_args(tokens)
(ast, tokens) = parse_CPAREN(tokens)
if ast is None:
return failure
ast = ('ast', 'fundeclargs', args)
return (ast, tokens)
def parse_fundeclret(tokens):
"""
Attempts to parse a function declaration return value.
Grammar:
fundeclret = S ARROW S type
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_S,
parse_ARROW,
parse_S,
parse_type
])
if asts is None:
return failure
type_ast = asts[3]
ast = ('ast', 'fundeclret', [type_ast])
return (ast, tokens)
def parse_if(tokens):
"""
Attempts to parse an if statement.
Grammar:
if = IF S expr scope { elif } [else]
"""
failure = (None, tokens)
def parse_elif(tokens):
"""
Attempts to parse an elif statement.
Grammar:
elif = ELIF S expr scope
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_ELIF,
parse_S,
parse_expr,
parse_scope
])
if asts is None:
return failure
ast = ('ast', 'elif', [asts[2], asts[3]])
return (ast, tokens)
def parse_else(tokens):
"""
Attempts to parse an else statement.
Grammar:
else = ELSE scope
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_ELSE,
parse_scope
])
if asts is None:
return failure
ast = ('ast', 'else', [asts[1]])
return (ast, tokens)
(asts, tokens) = concatenation(tokens, [
parse_IF,
parse_S,
parse_expr,
parse_scope
])
if asts is None:
return failure
sub_asts = [asts[2], asts[3]]
elifs = []
while True:
(ast, tokens) = parse_elif(tokens)
if ast is None:
break
elifs.append(ast)
sub_asts += elifs
(else_ast, tokens) = parse_else(tokens)
if else_ast is not None:
sub_asts.append(else_ast)
if_ast = ('ast', 'if', sub_asts)
return (if_ast, tokens)
def parse_while(tokens):
"""
Attempts to parse a while statement.
Grammar:
while = WHILE S expr scope
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_WHILE,
parse_S,
parse_expr,
parse_scope
])
if asts is None:
return failure
expr_ast = asts[2]
scope_ast = asts[3]
ast = ('ast', 'while', [expr_ast, scope_ast])
return (ast, tokens)
def parse_for(tokens):
"""
Attempts to parse a for statement.
Grammar:
for = FOR S [lstatement] SEMICOLON S expr SEMICOLON S lstatement scope
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_FOR,
parse_S
])
if asts is None:
return failure
(lstatement1_ast, tokens) = parse_lstatement(tokens)
(asts, tokens) = concatenation(tokens, [
parse_SEMICOLON,
parse_S,
parse_expr,
parse_SEMICOLON,
parse_S,
parse_lstatement,
parse_scope
])
if asts is None:
return failure
expr2_ast = asts[2]
lstatement3_ast = asts[5]
scope_ast = asts[6]
ast = ('ast', 'for', [lstatement1_ast, expr2_ast, lstatement3_ast, scope_ast])
return (ast, tokens)
def parse_scope(tokens):
"""
Attempts to parse a scoped sequence of statements.
Grammar:
scope = COLON NL INDENT statement { statement } DEDENT
"""
failure = (None, tokens)
statements = []
(asts, tokens) = concatenation(tokens, [
parse_COLON,
parse_NL,
parse_INDENT,
parse_statement
])
if asts is None:
return failure
statements.append(asts[3])
while True:
(ast, tokens) = parse_statement(tokens)
if ast is None:
break
statements.append(ast)
(ast, tokens) = parse_DEDENT(tokens)
if ast is None:
return failure
ast = ('ast', 'scope', statements)
return (ast, tokens)
def parse_expr(tokens):
"""
Attempts to parse an expression.
Expressions result in a value.
Grammar:
expr = funcall | binary | unary | IDENTIFIER
| INTLIT | FLOATLIT | STRINGLIT | BOOLLIT
"""
failure = (None, tokens)
(ast, tokens) = alternation(tokens, [
parse_funcall,
parse_binary,
parse_unary,
parse_IDENTIFIER,
parse_INTLIT,
parse_FLOATLIT,
parse_STRINGLIT,
parse_BOOLLIT
])
if ast is None:
return failure
expr_ast = ('ast', 'expr', [ast])
return (expr_ast, tokens)
def parse_binary(tokens):
"""
Attempts to parse a binary expression.
Grammar:
binary = OPAREN expr S boperator S expr CPAREN
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_OPAREN,
parse_expr,
parse_S,
parse_boperator,
parse_S,
parse_expr,
parse_CPAREN
])
if asts is None:
return failure
expr1 = asts[1]
op = asts[3]
expr2 = asts[5]
ast = ('ast', 'binary', [expr1, op, expr2])
return (ast, tokens)
def parse_boperator(tokens):
"""
Attempts to parse a binary operator.
Grammar:
boperator = PLUS | MINUS | STAR | SLASH | PERCENT
| LT | LTEQ | GT | GTEQ | EQEQ | BANGEQ
| AMPAMP | BARBAR | BANG
| AMP | BAR | LTLT | GTGT | TILDE | CARAT
"""
failure = (None, tokens)
(token, tokens) = alternation(tokens, [
parse_PLUS,
parse_MINUS,
parse_STAR,
parse_SLASH,
parse_LT,
parse_LTEQ,
parse_GT,
parse_GTEQ,
parse_EQEQ,
parse_BANGEQ,
parse_AMPAMP,
parse_BARBAR,
parse_BANG,
parse_AMP,
parse_BAR,
parse_LTLT,
parse_GTGT,
parse_TILDE,
parse_CARAT,
])
if token is None:
return failure
ast = ('ast', 'boperator', [token])
return (ast, tokens)
def parse_unary(tokens):
"""
Attempts to parse a unary expression.
Grammar:
unary = uoperator OPAREN expr CPAREN
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_uoperator,
parse_OPAREN,
parse_expr,
parse_CPAREN
])
if asts is None:
return failure
op = asts[0]
expr = asts[2]
ast = ('ast', 'unary', [op, expr])
return (ast, tokens)
def parse_uoperator(tokens):
"""
Attempts to parse a unary operator.
Grammar:
uoperator = MINUS | AMP | STAR | BANG
"""
failure = (None, tokens)
(token, tokens) = alternation(tokens, [
parse_MINUS,
parse_AMP,
parse_STAR,
parse_BANG
])
if token is None:
return failure
ast = ('ast', 'uoperator', [token])
return (ast, tokens)
def parse_type(tokens):
"""
Attemps to parse a type declaration.
Grammar:
type = POINTER LT type GT
| ARRAY LT type GT
| FUNCTION LT type GT
| ARRAY OBRACKET INTLIT CBRACKET LT type GT
| IDENTIFIER
"""
failure = (None, tokens)
def parse_type1(tokens, toktype):
"""
Grammar fragments:
POINTER LT type GT
ARRAY LT type GT
'array<int>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('ast', 'type', [
('token', 'IDENTIFIER', 'int')
])
])
'pointer<char>' becomes:
('ast', 'type', [
('token', 'POINTER', 'pointer'),
('ast', 'type', [
('token', 'IDENTIFIER', 'char')
])
])
'array<pointer<char>>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('ast', 'type', [
('token', 'POINTER', 'pointer'),
('ast', 'type', [
('token', 'IDENTIFIER', 'char')
])
])
])
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, toktype)
if identifier_token is None:
return failure
(asts, tokens) = concatenation(tokens, [
parse_LT,
parse_type,
parse_GT
])
if asts is None:
return failure
subtype_ast = asts[1]
ast = ('ast', 'type', [identifier_token, subtype_ast])
return (ast, tokens)
def parse_type2(tokens):
"""
Grammar fragment:
ARRAY OBRACKET INTLIT CBRACKET LT type GT
'array[8]<int>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('token', 'INTLIT', '8'),
('ast', 'type', [
('token', 'IDENTIFIER', 'int')
])
])
"""
failure = (None, tokens)
(asts, tokens) = concatenation(tokens, [
parse_ARRAY,
parse_OBRACKET,
parse_INTLIT,
parse_CBRACKET,
parse_LT,
parse_type,
parse_GT
])
if asts is None:
return failure
identifier_token = asts[0]
intlit_token = asts[2]
subtype_ast = asts[5]
ast = ('ast', 'type', [identifier_token, intlit_token, subtype_ast])
return (ast, tokens)
def parse_type3(tokens):
"""
Grammar fragment:
IDENTIFIER
'int' becomes:
('ast', 'type', [('token', 'IDENTIFIER', 'int')])
'char' becomes:
('ast', 'type', [('token', 'IDENTIFIER', 'char')])
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, 'IDENTIFIER')
if identifier_token is None:
return failure
type_ast = ('ast', 'type', [identifier_token])
return (type_ast, tokens)
(ast, tokens) = parse_type1(tokens, 'POINTER')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type1(tokens, 'ARRAY')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type1(tokens, 'FUNCTION')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type2(tokens)
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type3(tokens)
if ast is not None:
return (ast, tokens)
return failure
def parse_ws(tokens):
"""
Attempts to parse any amount of whitespace.
Grammar:
ws = (S | NL) { (S | NL) }
"""
failure = (None, tokens)
subnodes = []
(token, tokens) = alternation(tokens, [
parse_S,
parse_NL
])
if token is None:
return failure
subnodes.append(token)
while True:
(token, tokens) = alternation(tokens, [
parse_S,
parse_NL
])
if token is None:
break
subnodes.append(token)
ast = ('ast', 'ws', subnodes)
return (ast, tokens)
# Terminal parsing: parsing individual tokens.
def parse_terminal(tokens, toktype):
"""
Attempts to parse a terminal node of type _toktype_.
Note that the parsed result of a terminal is the token itself, not an AST node.
"""
failure = (None, tokens)
if len(tokens) > 0 and is_toktype(tokens[0], toktype):
return (tokens[0], tokens[1:])
else:
return failure
def parse_FUNC(tokens):
return parse_terminal(tokens, 'FUNC')
def parse_ARRAY(tokens):
return parse_terminal(tokens, 'ARRAY')
def parse_COMMA(tokens):
return parse_terminal(tokens, 'COMMA')
def parse_ARROW(tokens):
return parse_terminal(tokens, 'ARROW')
def parse_IDENTIFIER(tokens):
return parse_terminal(tokens, 'IDENTIFIER')
def parse_COLON(tokens):
return parse_terminal(tokens, 'COLON')
def parse_SEMICOLON(tokens):
return parse_terminal(tokens, 'SEMICOLON')
def parse_INTLIT(tokens):
return parse_terminal(tokens, 'INTLIT')
def parse_FLOATLIT(tokens):
return parse_terminal(tokens, 'FLOATLIT')
def parse_STRINGLIT(tokens):
return parse_terminal(tokens, 'STRINGLIT')
def parse_BOOLLIT(tokens):
return parse_terminal(tokens, 'BOOLLIT')
def parse_OPAREN(tokens):
return parse_terminal(tokens, 'OPAREN')
def parse_CPAREN(tokens):
return parse_terminal(tokens, 'CPAREN')
def parse_OBRACKET(tokens):
return parse_terminal(tokens, 'OBRACKET')
def parse_CBRACKET(tokens):
return parse_terminal(tokens, 'CBRACKET')
def parse_MINUS(tokens):
return parse_terminal(tokens, 'MINUS')
def parse_AMP(tokens):
return parse_terminal(tokens, 'AMP')
def parse_STAR(tokens):
return parse_terminal(tokens, 'STAR')
def parse_BANG(tokens):
return parse_terminal(tokens, 'BANG')
def parse_PLUS(tokens):
return parse_terminal(tokens, 'PLUS')
def parse_SLASH(tokens):
return parse_terminal(tokens, 'SLASH')
def parse_LT(tokens):
return parse_terminal(tokens, 'LT')
def parse_LTEQ(tokens):
return parse_terminal(tokens, 'LTEQ')
def parse_GT(tokens):
return parse_terminal(tokens, 'GT')
def parse_GTEQ(tokens):
return parse_terminal(tokens, 'GTEQ')
def parse_EQ(tokens):
return parse_terminal(tokens, 'EQ')
def parse_EQEQ(tokens):
return parse_terminal(tokens, 'EQEQ')
def parse_BANGEQ(tokens):
return parse_terminal(tokens, 'BANGEQ')
def parse_AMPAMP(tokens):
return parse_terminal(tokens, 'AMPAMP')
def parse_BARBAR(tokens):
return parse_terminal(tokens, 'BARBAR')
def parse_BAR(tokens):
return parse_terminal(tokens, 'BAR')
def parse_LTLT(tokens):
return parse_terminal(tokens, 'LTLT')
def parse_GTGT(tokens):
# So, I'm not super happy about this, but here's why this has to be special-cased:
# If we just treated this like all of the other tokenization and parsing functions,
# we would have a token for '>>', and this would be a one-liner:
# return parse_terminal(tokens, 'GTGT')
# However, that would break `pointer<pointer<char>>`, because the last two
# characters would be a single GTGT token, rathe than two separate GT tokens.
# So we have to implement parse_GTGT by looking for two distinct GT tokens.
failure = (None, tokens)
(gt_tokens, tokens) = concatenation(tokens, [parse_GT, parse_GT])
if gt_tokens is None:
return failure
token = ('token', 'GTGT', '>>')
def parse_TILDE(tokens):
return parse_terminal(tokens, 'TILDE')
def parse_CARAT(tokens):
return parse_terminal(tokens, 'CARAT')
def parse_PLUSEQ(tokens):
return parse_terminal(tokens, 'PLUSEQ')
def parse_MINUSEQ(tokens):
return parse_terminal(tokens, 'MINUSEQ')
def parse_STAREQ(tokens):
return parse_terminal(tokens, 'STAREQ')
def parse_SLASHEQ(tokens):
return parse_terminal(tokens, 'SLASHEQ')
def parse_PERCENTEQ(tokens):
return parse_terminal(tokens, 'PERCENTEQ')
def parse_INDENT(tokens):
return parse_terminal(tokens, 'INDENT')
def parse_DEDENT(tokens):
return parse_terminal(tokens, 'DEDENT')
def parse_IF(tokens):
return parse_terminal(tokens, 'IF')
def parse_ELIF(tokens):
return parse_terminal(tokens, 'ELIF')
def parse_ELSE(tokens):
return parse_terminal(tokens, 'ELSE')
def parse_WHILE(tokens):
return parse_terminal(tokens, 'WHILE')
def parse_FOR(tokens):
return parse_terminal(tokens, 'FOR')
def parse_BREAK(tokens):
return parse_terminal(tokens, 'BREAK')
def parse_CONTINUE(tokens):
return parse_terminal(tokens, 'CONTINUE')
def parse_NL(tokens):
return parse_terminal(tokens, 'NL')
def parse_S(tokens):
return parse_terminal(tokens, 'S')
def parse_LCOMMENT(tokens):
return parse_terminal(tokens, 'LCOMMENT')
def parse_BCOMMENT(tokens):
return parse_terminal(tokens, 'BCOMMENT')
def parse_INCLUDE(tokens):
return parse_terminal(tokens, 'INCLUDE')
def parse_DEFINE(tokens):
return parse_terminal(tokens, 'DEFINE')
# Output generator implementation:
# Note: all generate_x functions share a similar format:
# - Their first argument is as AST.
# - They return a string (of C code).
# If the passed AST is valid, then errors are not possible.
# Exceptions are thrown otherwise.
def generate(ast):
"""Generate C code from the given AST."""
return generate_program(ast)
def generate_program(ast):
"""Generate a C program."""
assert is_asttype(ast, 'program')
outputs = []
for statement_ast in ast_children(ast):
statement = generate_statement(statement_ast)
outputs.append(statement)
output = '\n'.join(outputs)
if not output.endswith('\n'):
output += '\n' # always include a trailing newline.
return output
def generate_statement(ast):
"""Generate a C statement."""
assert is_asttype(ast, 'statement')
sub_ast = ast_children(ast)[0]
assert is_ast(sub_ast)
if is_asttype(sub_ast, 'lstatement'):
return generate_lstatement(sub_ast)
elif is_asttype(sub_ast, 'bstatement'):
return generate_bstatement(sub_ast)
elif is_asttype(sub_ast, 'include'):
return generate_include(sub_ast)
elif is_asttype(sub_ast, 'define'):
return generate_define(sub_ast)
elif is_asttype(sub_ast, 'comment'):
return generate_comment(sub_ast)
elif is_asttype(sub_ast, 'blankline'):
return generate_blankline(sub_ast)
else:
raise Exception("generate_statement: don't know how to generate %s" % str(sub_ast))
def generate_bstatement(ast):
"""Generate a C "block" statement."""
assert is_asttype(ast, 'bstatement')
sub_ast = ast_children(ast)[0]
assert is_ast(sub_ast)
if is_asttype(sub_ast, 'fundecl'):
return generate_fundecl(sub_ast)
elif is_asttype(sub_ast, 'if'):
return generate_if(sub_ast)
elif is_asttype(sub_ast, 'while'):
return generate_while(sub_ast)
elif is_asttype(sub_ast, 'for'):
return generate_for(sub_ast)
elif is_asttype(sub_ast, 'scope'):
return generate_scope(sub_ast)
else:
raise Exception("generate_bstatement: don't know how to generate %s" % str(sub_ast))
def generate_lstatement(ast):
"""Generate a C "line" statement."""
assert is_asttype(ast, 'lstatement')
subnode = ast_children(ast)[0]
if is_asttype(subnode, 'declassign'):
return generate_declassign(subnode)
elif is_asttype(subnode, 'assign'):
return generate_assign(subnode)
elif is_asttype(subnode, 'vardecl'):
return generate_vardecl(subnode) + ';'
elif is_asttype(subnode, 'funcall'):
return generate_funcall(subnode) + ';'
elif is_asttype(subnode, 'return'):
return generate_return(subnode)
elif is_toktype(subnode, 'BREAK'):
return "break;"
elif is_toktype(subnode, 'CONTINUE'):
return "continue;"
else:
raise Exception("generate_lstatement: don't know how to generate %s" % str(subnode))
def generate_blankline(ast):
"""Generates a blank line."""
assert is_asttype(ast, 'blankline')
return ""
def generate_comment(ast):
"""Generate a C (line or block-level) comment."""
assert is_asttype(ast, 'comment')
comment_token = ast_children(ast)[0]
assert is_toktype(comment_token, 'LCOMMENT') or is_toktype(comment_token, 'BCOMMENT')
return token_text(comment_token)
def generate_include(ast):
"""Generate a C include statement."""
assert is_asttype(ast, 'include')
include_token = ast_children(ast)[0]
assert is_toktype(include_token, 'INCLUDE')
return token_text(include_token)
def generate_define(ast):
"""Generate a C define statement."""
assert is_asttype(ast, 'define')
define_token = ast_children(ast)[0]
assert is_toktype(define_token, 'DEFINE')
return token_text(define_token)
def generate_declassign(ast):
"""Generate a C declaration-assignment statement."""
assert is_asttype(ast, 'declassign')
children = ast_children(ast)
vardecl_ast = children[0]
assert is_asttype(vardecl_ast, 'vardecl')
vardecl = generate_vardecl(vardecl_ast)
expr_ast = children[1]
assert is_asttype(expr_ast, 'expr')
expr = generate_expr(expr_ast)
return "%s = %s;" % (vardecl, expr)
def generate_assign(ast):
"""Generate a C assignment statement."""
assert is_asttype(ast, 'assign')
children = ast_children(ast)
identifier_token = children[0]
assert is_toktype(identifier_token, 'IDENTIFIER')
identifier = token_text(identifier_token)
op_ast = children[1]
assert is_asttype(op_ast, 'aoperator')
op_token = ast_children(op_ast)[0]
assert is_token(op_token)
op = token_text(op_token)
expr_ast = children[2]
assert is_asttype(expr_ast, 'expr')
expr = generate_expr(expr_ast)
return "%s %s %s;" % (identifier, op, expr)
def generate_vardecl(ast):
"""
Generate a C variable declaration.
Note: we don't append the trailing ';' here because this is also used for function args.
"""
assert is_asttype(ast, 'vardecl')
children = ast_children(ast)
identifier_token = children[0]
assert is_toktype(identifier_token, 'IDENTIFIER')
var_name = token_text(identifier_token)
type_ast = children[1]
assert is_asttype(type_ast, 'type')
# e.g. 'int a', 'char* b', 'int c[8]', 'char** d', 'int* d[3]', etc.
return "%s" % generate_type(type_ast, var_name)
def generate_funcall(ast):
"""Generate a C function call."""
assert is_asttype(ast, 'funcall')
children = ast_children(ast)
identifier_token = children[0]
assert is_toktype(identifier_token, 'IDENTIFIER')
identifier = token_text(identifier_token)
exprs = [generate_expr(a) for a in children[1:]]
return "%s(%s)" % (identifier, ", ".join(exprs))
def generate_return(ast):
"""Generate a C return statement."""
assert is_asttype(ast, 'return')
children = ast_children(ast)
if len(children) == 0:
return "return;"
else:
expr_ast = children[0]
return "return %s;" % generate_expr(expr_ast)
def generate_fundecl(ast):
"""Generate a C function declaration."""
assert is_asttype(ast, 'fundecl')
identifier_token = ast_children(ast)[0]
assert is_toktype(identifier_token, 'IDENTIFIER')
identifier = token_text(identifier_token)
fundeclargs_ast = ast_children(ast)[1]
assert is_asttype(fundeclargs_ast, 'fundeclargs')
arg_asts = ast_children(fundeclargs_ast)
args = []
for vardecl_ast in arg_asts:
arg = generate_vardecl(vardecl_ast)
args.append(arg)
args_output = ", ".join(args)
fundeclret_ast = ast_children(ast)[2]
if fundeclret_ast is None:
ret = "void"
else:
ret_ast = ast_children(fundeclret_ast)[0]
ret = generate_type(ret_ast)
scope_ast = ast_children(ast)[3]
scope = generate_scope(scope_ast)
output = "%s %s(%s) %s" % (ret, identifier, args_output, scope)
return output
def generate_if(ast):
"""Generate a C if statement"""
assert is_asttype(ast, 'if')
children = ast_children(ast)
predicate_ast = children[0]
assert is_asttype(predicate_ast, 'expr')
predicate = generate_expr(predicate_ast)
output = "if (%s) " % (predicate)
scope_ast = children[1]
assert is_asttype(scope_ast, 'scope')
output += generate_scope(scope_ast)
for ast in children[2:]:
assert is_ast(ast)
if ast_type(ast) == 'elif':
elif_children = ast_children(ast)
predicate_ast = elif_children[0]
assert is_asttype(predicate_ast, 'expr')
predicate = generate_expr(predicate_ast)
output += " else if (%s) " % (predicate)
scope_ast = elif_children[1]
assert is_asttype(scope_ast, 'scope')
output += generate_scope(scope_ast)
elif ast_type(ast) == 'else':
else_children = ast_children(ast)
scope_ast = else_children[0]
assert is_asttype(scope_ast, 'scope')
output += " else " + generate_scope(scope_ast)
else:
raise Exception("generate_if: unexpected AST type %s" % ast_type(ast))
return output
def generate_while(ast):
"""Attempts to generate a C while statement."""
assert is_asttype(ast, 'while')
children = ast_children(ast)
expr_ast = children[0]
expr = generate_expr(expr_ast)
scope_ast = children[1]
scope = generate_scope(scope_ast)
output = "while (%s) %s" % (expr, scope)
return output
def generate_for(ast):
"""Generates a C for statement."""
assert is_asttype(ast, 'for')
children = ast_children(ast)
output = "for ("
lstatement_ast = children[0]
if lstatement_ast is not None:
assert is_asttype(lstatement_ast, 'lstatement')
output += generate_lstatement(lstatement_ast)
output += " "
expr2_ast = children[1]
assert is_asttype(expr2_ast, 'expr')
output += generate_expr(expr2_ast) + "; "
lstatement3_ast = children[2]
assert is_asttype(lstatement3_ast, 'lstatement')
output += generate_lstatement(lstatement3_ast)
if output.endswith(';'):
# the third clause in a for-loop doesn't need a semicolon.
output = output[:-1]
output += ") "
scope_ast = children[3]
assert is_asttype(scope_ast, 'scope')
output += generate_scope(scope_ast)
return output
def generate_scope(ast):
"""Generate a C scope."""
assert is_ast(ast)
assert is_asttype(ast, 'scope')
output = "{"
for statement_ast in ast_children(ast):
statement = generate_statement(statement_ast)
output += ("\n" + indent_text(statement))
line = "\n}"
output += line
return output
def generate_expr(ast):
"""Generate a C expression."""
assert is_asttype(ast, 'expr')
child = ast_children(ast)[0]
if is_ast(child):
if is_asttype(child, 'funcall'):
return generate_funcall(child)
if is_asttype(child, 'binary'):
return generate_binary(child)
elif is_asttype(child, 'unary'):
return generate_unary(child)
elif is_token(child):
token = child
if is_toktype(token, 'INTLIT') \
or is_toktype(token, 'FLOATLIT') \
or is_toktype(token, 'STRINGLIT') \
or is_toktype(token, 'BOOLLIT') \
or is_toktype(token, 'IDENTIFIER') \
:
output = token_text(token)
return output
else:
raise Exception("generate_expr: unexpcted ast %s" % str(ast))
def generate_binary(ast):
"""Generate a binary-operator C statement."""
assert is_asttype(ast, 'binary')
children = ast_children(ast)
expr1_ast = children[0]
assert is_asttype(expr1_ast, 'expr')
expr1 = generate_expr(expr1_ast)
op_ast = children[1]
assert is_asttype(op_ast, 'boperator')
op_token = ast_children(op_ast)[0]
assert is_token(op_token)
op = token_text(op_token)
expr2_ast = children[2]
assert is_asttype(expr2_ast, 'expr')
expr2 = generate_expr(expr2_ast)
return "(%s %s %s)" % (expr1, op, expr2)
def generate_unary(ast):
"""Generate a unary-operator C statement."""
assert is_asttype(ast, 'unary')
children = ast_children(ast)
op_ast = children[0]
assert is_asttype(op_ast, 'uoperator')
op_token = ast_children(op_ast)[0]
assert is_token(op_token)
op = token_text(op_token)
expr_ast = children[1]
assert is_asttype(expr_ast, 'expr')
expr = generate_expr(expr_ast)
return "%s(%s)" % (op, expr)
def generate_type(ast, identifier=""):
"""
Generate a C type declaration.
If identifier is not given, this is an abstract type declaration (e.g. a cast).
"""
def did_switch_direction(previous, current):
"""
Detect a change in 'direction' while intepreting the type stack.
C type declaration operator precedence requires that we "go right" first.
i.e. 'int *foo[]' is "array of pointer to int", not "pointer to array of int".
In order to express "pointer to array of int", we have to use parenthesis
to change overcome operator precedence, i.e. 'int (*foo)[]'.
Any time we need to "change direction", we need to wrap in parenthesis.
See http://unixwiz.net/techtips/reading-cdecl.html
"""
lefts = ['pointer']
rights = ['array', 'function']
return (previous in lefts and current in rights) \
or (previous in rights and current in lefts)
assert is_asttype(ast, 'type')
types = type_ast_as_list(ast)
output = identifier
previous = None
for t in types[:-1]:
if t == 'pointer':
if did_switch_direction(previous, t):
output = "*(%s)" % output
else:
output = "*%s" % output
elif t == 'array':
if did_switch_direction(previous, t):
output = "(%s)[]" % output
else:
output = "%s[]" % output
elif t == 'function':
if did_switch_direction(previous, t):
output = "(%s)()" % output
else:
output = "%s()" % output
elif t.startswith('array:'):
array_size = int(t.split(':')[1])
if did_switch_direction(previous, 'array'):
output = "(%s)[%i]" % (output, array_size)
else:
output = "%s[%i]" % (output, array_size)
else:
raise Exception("generate_type: unexpected type '%s'." % t)
if t.startswith('array:'):
previous = 'array'
else:
previous = t
base_type = types[-1]
if identifier == "":
output = "%s%s" % (base_type, output)
else:
output = "%s %s" % (base_type, output)
return output
def type_ast_as_list(ast):
"""
Return the type AST as a list of types.
e.g.
- 'int' results in ['int']
- 'pointer<array<char>>' results in ['pointer', 'array', 'char']
- 'array[8]<int>' results in ['array:8', 'int']
"""
assert is_asttype(ast, 'type')
children = ast_children(ast)
if len(children) == 1:
# if there is only one child, it is the "base" type (e.g. int, char, etc).
assert is_token(children[0])
token = children[0]
assert is_toktype(token, 'IDENTIFIER')
# return e.g. ['int'], ['char'], etc.
return [token_text(token)]
elif len(children) == 2:
# if there are 2 children, this is either array or pointer or function.
assert is_toktype(children[0], 'ARRAY') \
or is_toktype(children[0], 'POINTER') \
or is_toktype(children[0], 'FUNCTION')
token = children[0]
assert is_ast(children[1])
sub_ast = children[1]
assert is_asttype(ast, 'type')
# return e.g. ['array', <recursive call>]
return [token_text(token)] + type_ast_as_list(sub_ast)
elif len(children) == 3:
# if there are three children, this is a dimensioned array (e.g. array[8]).
assert is_toktype(children[0], 'ARRAY')
assert is_toktype(children[1], 'INTLIT')
intlit_token = children[1]
array_size = int(token_text(intlit_token))
assert is_ast(children[2])
sub_ast = children[2]
assert is_asttype(sub_ast, 'type')
# return e.g. ['array:8', <recursive call>]
return ["array:%s" % (array_size)] + type_ast_as_list(sub_ast)
else:
raise Exception("type_ast_as_list: type AST node with more than 3 children!")
def indent_text(text):
"""Add indentation to the given block of text."""
indent = " " * INDENT_UNIT
indented_lines = [indent + line for line in text.splitlines()]
return "\n".join(indented_lines)
if __name__ == "__main__":
import sys
import pprint
tdefs = load_tokendefs("tokendefs.txt")
keywords = load_keywords("keywords.txt")
input = None
if len(sys.argv) > 1 and not sys.argv[-1].startswith('--'):
input = open(sys.argv[-1]).read()
else:
input = sys.stdin.read()
tokens = tokenize(tdefs, keywords, input)
if '--tokens' in sys.argv:
pprint.pprint(tokens)
sys.exit(0)
ast = parse(tokens)
if '--ast' in sys.argv:
pprint.pprint(ast)
sys.exit(0)
output = generate(ast)
sys.stdout.write(output)
pointer
POINTER
array
ARRAY
function
FUNCTION
func
FUNC
return
RETURN
true
BOOLLIT
false
BOOLLIT
if
IF
elif
ELIF
else
ELSE
while
WHILE
for
FOR
break
BREAK
continue
CONTINUE
👉 test27.cy
input:
a = 1
a = -2
a: int
a: int = 3
a = (1 + (2 * (3 / 4)))
a = foo()
a = bar(1)
a = baz(1, 2, (1 + 1), foo(), bar(1))
a: float = -3.14159
a: bool = true
a: bool = false
a: pointer<char> = "hello, world!"
a = "I said \"Hello\" to the baker."
return
return 1
return a
return (1 + 1)
return foo(1, (2 + 3), bar(1))
foo()
foo(1)
foo(foo(1))
foo(1, (2 + 3), foo(), foo(1, 2))
:
return
:
return
a = 1
:
a: int = 1
a += 1
:
b: int = (1 + foo())
:
c: float = -1.1
return b
func foo():
return
func foo(a: int, b: pointer<char>) -> pointer<bool>:
c: float = -1.1
return (c + 1)
func main(argc: int, argv: pointer<pointer<char>>) -> int:
a: int = foo()
b: float = ((foo() + foo()) / foo())
if 1:
return a
else:
if (1 + 1):
z: char = "z"
elif foo():
return
else:
foo()
while 1:
break
while true:
a += 1
if (a > 42):
break
else:
a += 2
continue
for ; (i < 1); i += 1:
continue
for i = 2; (i < 42); i += 2:
sideEffect()
for i: int = 0; (i < len(things)); i += 1:
if (i == 0):
continue
else:
sideEffect()
// this is a line-level comment
/* this is a block-level comment */
/*
* This is a
* multi-line
* block-level
* comment.
*/
#include <foo.h>
#include <dir/foo.h>
#include "foo.h"
#include "dir/foo.h"
#include "/dir/foo.h"
#define FOOS 1
#define debug(m,e) printf("%s:%d: %s:",__FILE__,__LINE__,m); print_obj(e,1); puts("");
output:
a = 1;
a = -2;
int a;
int a = 3;
a = (1 + (2 * (3 / 4)));
a = foo();
a = bar(1);
a = baz(1, 2, (1 + 1), foo(), bar(1));
float a = -3.14159;
bool a = true;
bool a = false;
char *a = "hello, world!";
a = "I said \"Hello\" to the baker.";
return;
return 1;
return a;
return (1 + 1);
return foo(1, (2 + 3), bar(1));
foo();
foo(1);
foo(foo(1));
foo(1, (2 + 3), foo(), foo(1, 2));
{
return;
}
{
return;
a = 1;
}
{
int a = 1;
a += 1;
}
{
int b = (1 + foo());
{
float c = -1.1;
return b;
}
}
void foo() {
return;
}
bool* foo(int a, char *b) {
float c = -1.1;
return (c + 1);
}
int main(int argc, char **argv) {
int a = foo();
float b = ((foo() + foo()) / foo());
if (1) {
return a;
} else {
if ((1 + 1)) {
char z = "z";
} else if (foo()) {
return;
} else {
foo();
}
}
}
while (1) {
break;
}
while (true) {
a += 1;
if ((a > 42)) {
break;
} else {
a += 2;
continue;
}
}
for ( (i < 1); i += 1) {
continue;
}
for (i = 2; (i < 42); i += 2) {
sideEffect();
}
for (int i = 0; (i < len(things)); i += 1) {
if ((i == 0)) {
continue;
} else {
sideEffect();
}
}
// this is a line-level comment
/* this is a block-level comment */
/*
* This is a
* multi-line
* block-level
* comment.
*/
#include <foo.h>
#include <dir/foo.h>
#include "foo.h"
#include "dir/foo.h"
#include "/dir/foo.h"
#define FOOS 1
#define debug(m,e) printf("%s:%d: %s:",__FILE__,__LINE__,m); print_obj(e,1); puts("");
✅ test27.cy
#!/bin/bash
# run all of the tests
set -e
for f in test*.cy
do
echo "👉 $f"
echo " input:"
cat $f | sed 's/^/ /'
outfile=`mktemp`
./cy.py $f > $outfile
echo " output:"
cat $outfile | sed 's/^/ /'
cfile="`basename $f .cy`.c"
if diff -q $cfile $outfile >/dev/null
then
echo "✅ $f"
else
echo "❌ $f"
diff -urN --color=auto $cfile $outfile
fi
echo
done
a = 1;
a = -2;
int a;
int a = 3;
a = (1 + (2 * (3 / 4)));
a = foo();
a = bar(1);
a = baz(1, 2, (1 + 1), foo(), bar(1));
float a = -3.14159;
bool a = true;
bool a = false;
char *a = "hello, world!";
a = "I said \"Hello\" to the baker.";
return;
return 1;
return a;
return (1 + 1);
return foo(1, (2 + 3), bar(1));
foo();
foo(1);
foo(foo(1));
foo(1, (2 + 3), foo(), foo(1, 2));
{
return;
}
{
return;
a = 1;
}
{
int a = 1;
a += 1;
}
{
int b = (1 + foo());
{
float c = -1.1;
return b;
}
}
void foo() {
return;
}
bool* foo(int a, char *b) {
float c = -1.1;
return (c + 1);
}
int main(int argc, char **argv) {
int a = foo();
float b = ((foo() + foo()) / foo());
if (1) {
return a;
} else {
if ((1 + 1)) {
char z = "z";
} else if (foo()) {
return;
} else {
foo();
}
}
}
while (1) {
break;
}
while (true) {
a += 1;
if ((a > 42)) {
break;
} else {
a += 2;
continue;
}
}
for ( (i < 1); i += 1) {
continue;
}
for (i = 2; (i < 42); i += 2) {
sideEffect();
}
for (int i = 0; (i < len(things)); i += 1) {
if ((i == 0)) {
continue;
} else {
sideEffect();
}
}
// this is a line-level comment
/* this is a block-level comment */
/*
* This is a
* multi-line
* block-level
* comment.
*/
#include <foo.h>
#include <dir/foo.h>
#include "foo.h"
#include "dir/foo.h"
#include "/dir/foo.h"
#define FOOS 1
#define debug(m,e) printf("%s:%d: %s:",__FILE__,__LINE__,m); print_obj(e,1); puts("");
a = 1
a = -2
a: int
a: int = 3
a = (1 + (2 * (3 / 4)))
a = foo()
a = bar(1)
a = baz(1, 2, (1 + 1), foo(), bar(1))
a: float = -3.14159
a: bool = true
a: bool = false
a: pointer<char> = "hello, world!"
a = "I said \"Hello\" to the baker."
return
return 1
return a
return (1 + 1)
return foo(1, (2 + 3), bar(1))
foo()
foo(1)
foo(foo(1))
foo(1, (2 + 3), foo(), foo(1, 2))
:
return
:
return
a = 1
:
a: int = 1
a += 1
:
b: int = (1 + foo())
:
c: float = -1.1
return b
func foo():
return
func foo(a: int, b: pointer<char>) -> pointer<bool>:
c: float = -1.1
return (c + 1)
func main(argc: int, argv: pointer<pointer<char>>) -> int:
a: int = foo()
b: float = ((foo() + foo()) / foo())
if 1:
return a
else:
if (1 + 1):
z: char = "z"
elif foo():
return
else:
foo()
while 1:
break
while true:
a += 1
if (a > 42):
break
else:
a += 2
continue
for ; (i < 1); i += 1:
continue
for i = 2; (i < 42); i += 2:
sideEffect()
for i: int = 0; (i < len(things)); i += 1:
if (i == 0):
continue
else:
sideEffect()
// this is a line-level comment
/* this is a block-level comment */
/*
* This is a
* multi-line
* block-level
* comment.
*/
#include <foo.h>
#include <dir/foo.h>
#include "foo.h"
#include "dir/foo.h"
#include "/dir/foo.h"
#define FOOS 1
#define debug(m,e) printf("%s:%d: %s:",__FILE__,__LINE__,m); print_obj(e,1); puts("");
IDENTIFIER
[a-zA-Z][a-zA-Z0-9_]*
FLOATLIT
-?\d+\.\d+
INTLIT
-?\d+
STRINGLIT
"([^"\\]|\\[\s\S])*"
LCOMMENT
//.*
BCOMMENT
/\*[\s\S]*?\*/
INCLUDE
#include .*
DEFINE
#define .*
ARROW
->
LTEQ
<=
GTEQ
>=
EQEQ
==
BANGEQ
!=
PLUSEQ
\+=
MINUSEQ
-=
STAREQ
\*=
SLASHEQ
/=
PERCENTEQ
%=
AMPAMP
&&
BARBAR
\|\|
LTLT
<<
TILDE
~
CARAT
\^
COLON
:
SEMICOLON
;
COMMA
,
PLUS
\+
MINUS
-
STAR
\*
SLASH
/
PERCENT
%
LT
<
GT
>
EQ
=
AMP
&
BAR
\|
BANG
!
OBRACKET
\[
CBRACKET
]
OPAREN
\(
CPAREN
\)
S
+
NL
\n
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment