Skip to content
{{ message }}

Instantly share code, notes, and snippets.

# cellularmitosis/README.md

Last active Oct 22, 2019
An alternative syntax for C, part 4: unary and binary operators

Blog 2019/10/16

# An alternative syntax for C, part 4: unary and binary operators

This version of the transpiler implements support for the C unary and binary operators.

This Cy code:

``````func foo(a: pointer<function<char>>, z_z: array<pointer<char>>) -> float:
z: int = (1 + (2 * 3))
z += (3 - 1)
y: int = -((1 + 1))
w: int = !(0)
return (-5 / -3.14159)

func main(argc: int, argv: pointer<pointer<char>>) -> int:
return 0
``````

transpiles to this C code:

``````float foo(char (*a)(), char *(z_z[])) {
int z = (1 + (2 * 3));
z += (3 - 1);
int y = -((1 + 1));
int w = !(0);
return (-5 / -3.14159);
}

int main(int argc, char **argv) {
return 0;
}
``````

## Notes:

The `--` and `++` have been excluded (see Swift proposal SE-0004 for a good discussion).

The syntax around operators is pretty strict and opinionated:

• Unary operators must use parens just like a regular function call.
• : `-foo`
• : `-(foo)`
• Binary operators must use parens and be delimited by whitespace.
• : `1+1`
• : `(1+1)`
• : `(1 + 1)`

One advantage of this approach (other than making the grammar simpler 😜) is that no knowledge of operator precedence is ever needed, as it is impossible to express an ambiguous sequence of binary operations.

That is, the order-of-operations ambiguity of `1 + 2 / 3` isn't possible, because this must be written as `(1 + (2 / 3))`.

However, this can result in some superfluous parens when nesting binary and unary operations, e.g. `-((1 + 1))`. When considered conceptually, this seems reasonable: this is a nesting of `(_ + _)` and `-(_)`. But at first glance, I have to admit it looks a bit silly.

## TL;DR: What is Cy?

Cy is an alternative syntax for C which features indentation-based scoping and intuitive type declarations.

``````func main(argc: int, argv: pointer<pointer<char>>) -> int:
printf("Hello, world!")
return 0
``````

## But why tho?

To teach myself how to write a transpiler! 🤩🤩🤩

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 #!/usr/bin/env python # -*- coding: utf-8 -*- # cy.py: a traspiler for an alternate syntax for C. # Note: this is an incomplete work-in-progress. # Data structures: # A token is a tuple. # First element is a string 'token'. # Second element is a string which is the token type (e.g. 'OBRACKET'). # Third element is a string which is the source text of the token (e.g. '['). # e.g. ('token', 'OBRACKET', '[') # e.g. ('token', 'IDENTIFIER', 'foo') # e.g. ('token', 'WS', ' ') # An AST node is a tuple. # First element is a string 'ast'. # Second element is a string which is the AST node type (e.g. 'vardecl'). # Third element is an array which are the child tuples (AST nodes or tokens). # e.g. ('ast', 'type', [('token', 'IDENTIFIER', 'int')]) # Grammar: # # program = statement [WS] { statement [WS] } # statement = declassign | assign | fundecl | vardecl | scope | return # fundecl = FUNC WS IDENTIFIER fundeclargs [ fundeclret ] scope # fundeclargs = OPAREN [ [WS] vardecl { COMMA WS vardecl } [WS] ] CPAREN # fundeclret = WS ARROW WS type # vardecl = IDENTIFIER COLON WS type # scope = COLON INDENT statement { DENT statement } DEDENT # return = RETURN [ expr ] # declassign = vardecl WS EQ WS expr # assign = IDENTIFIER WS aoperator WS expr # aoperator = PLUSEQ | MINUSEQ | STAREQ | SLASHEQ | PERCENTEQ | EQ # unary = uoperator OPAREN expr CPAREN # uoperator = MINUS | AMP | STAR | BANG # binary = OPAREN expr WS boperator WS expr CPAREN # boperator = PLUS | MINUS | STAR | SLASH | PERCENT # | LT | LTEQ | GT | GTEQ | EQEQ | BANGEQ # | AMPAMP | BARBAR | BANG # | AMP | BAR | LTLT | GTGT | TILDE | CARAT # expr = binary | unary | INTLIT | FLOATLIT | STRINGLIT | BOOLLIT | IDENTIFIER # type = POINTER LT type GT # | ARRAY LT type GT # | FUNCTION LT type GT # | ARRAY OBRACKET INTLIT CBRACKET LT type GT # | IDENTIFIER # # Note: lower case are AST nodes. # Note: ALL CAPS are tokens (terminals). # Token data structure utils: def is_token(token): """Returns whether the arg is a token.""" return isinstance(token, tuple) and len(token) > 0 and token[0] == 'token' def token_type(token): """Returns the type of the token (e.g. 'IDENTIFIER').""" assert is_token(token) return token[1] def token_text(token): """Returns the source text of the token (e.g. 'foo' for a 'IDENTIFIER' token).""" assert is_token(token) return token[2] def is_toktype(token, toktype): """ Returns whether the token is of the given type. _toktype_ is e.g. 'IDENTIFIER', 'COLON', etc. """ assert is_token(token) return token_type(token) == toktype # Tokenizer implementation: def load_tokendefs(fpath): """ Load the token definitions from the file at fpath. Returns a list of tuples of the form (, ). The format of the tokendefs file should be pairs of lines: - a line which is the token type (e.g. 'IDENTIFIER', 'OBRACKET', etc), - followed by a line which is the regex which recognizes that token. Example tokendefs.txt: IDENTIFIER [a-zA-Z][a-zA-Z0-9]* OPAREN \( CPAREN ) """ import re tokendefs = [] with open(fpath) as f: for line in f: token_name = line.rstrip('\n') pattern = f.next().rstrip('\n') regex = re.compile(pattern) tokendef = (token_name, regex) tokendefs.append(tokendef) return tokendefs def offside_rule(token, previous, indent): """ Implement the "offside rule" indentation-based scoping mechanism. See https://en.wikipedia.org/wiki/Off-side_rule If this is a WS (containing a newline) following a COLON and indentation has increased, sub in an INDENT token. If this is a WS (containing a newline) and indentation has decreased, sub in a DEDENT token. If this is a WS (containing a newline) and indentation hasn't changed, sub in a DENT token. Otherwise, return the token unmodified. """ assert is_token(token) assert is_token(previous) or previous is None text = token_text(token) if is_toktype(token, 'WS') and '\n' in text: indentation = text.split('\n')[-1] if is_toktype(previous, 'COLON') and len(indentation) > indent: token = ('token', 'INDENT', text) indent += 1 elif len(indentation) < indent: token = ('token', 'DEDENT', text) indent -= 1 assert indent >= 0 else: if indent > 0: token = ('token', 'DENT', text) return (token, indent) def tokenize(tokendefs, input, indent=0): """Uses tokendefs to tokenize the 'input' string and return a list of tokens""" tokens = [] offset = 0 previous = None while offset < len(input): for (token_name, regex) in tokendefs: m = regex.match(input, offset) if m is not None: matched_text = m.group(0) offset += len(matched_text) token = ('token', token_name, matched_text) (token, indent) = offside_rule(token, previous, indent) if token is not None: tokens.append(token) previous = token break else: raise Exception("Couldn't tokenize starting at '%s...'" % input[offset:offset+16]) return tokens # AST node data structure utils: def is_ast(ast): """Returns whether the arg is an AST node.""" return isinstance(ast, tuple) and len(ast) > 0 and ast[0] == 'ast' def is_asttype(ast, asttype): """ Returns whether the AST node is of the given type. _asttype_ is e.g. 'vardecl', 'statement', etc. """ assert is_ast(ast) return ast[1] == asttype def ast_children(ast): """Return the child nodes of the given AST.""" assert is_ast(ast) return ast[2] # Parser helpers: # See https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Table_of_symbols def alternation(tokens, funcs): """Try each of the parsing functions until one succeeds.""" failure = (None, tokens) for f in funcs: (ast, tokens) = f(tokens) if ast is not None: return (ast, tokens) return failure def concatenation(tokens, funcs): """ All of the parsing functions must succeed as a whole. Returns (, ). """ failure = (None, tokens) asts = [] for f in funcs: (ast, tokens) = f(tokens) if ast is None: return failure asts.append(ast) return (asts, tokens) # Parsing functions: # Note: all parse_x functions share a similar format: # - Their first argument is the list of tokens. # - They return a tuple of (, ). # - is either an AST node or a token (a terminal). # - is the list of unconsumed tokens. # - If the parse fails, (None, ) is returned, where # is the list of tokens which was passed in. # fundecl examples: # func f(): # func f(a: int): # func f() -> int: # func f(a: int) -> int: # func f(a: int, b: int): # func f(a: int, b: int) -> int: def parse_fundecl(tokens): """ Attempts to parse a function declaration. Grammar: fundecl = FUNC WS IDENTIFIER fundeclargs [ fundeclret ] scope """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_FUNC, parse_WS, parse_IDENTIFIER, parse_fundeclargs ]) if asts is None: return failure identifier_token = asts[2] args_ast = asts[3] (ret_ast, tokens) = parse_fundeclret(tokens) (scope_ast, tokens) = parse_scope(tokens) if scope_ast is None: return failure ast = ('ast', 'fundecl', [identifier_token, args_ast, ret_ast, scope_ast]) return (ast, tokens) def parse_fundeclargs(tokens): """ Attempts to parse a list of function declaration arguments. Grammar: fundeclargs = OPAREN [ [WS] vardecl { COMMA WS vardecl } [WS] ] CPAREN """ failure = (None, tokens) (ast, tokens) = parse_OPAREN(tokens) if ast is None: return failure args = [] (_, tokens) = parse_WS(tokens) (arg, tokens) = parse_vardecl(tokens) if arg is not None: args.append(arg) while True: (ast, tokens) = parse_COMMA(tokens) if ast is None: break (ast, tokens) = parse_WS(tokens) if ast is None: break (arg, tokens) = parse_vardecl(tokens) if ast is None: break args.append(arg) (ast, tokens) = parse_CPAREN(tokens) if ast is None: return failure ast = ('ast', 'fundeclargs', args) return (ast, tokens) def parse_fundeclret(tokens): """ Attempts to parse a function declaration return value. Grammar: fundeclret = WS ARROW WS type """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_WS, parse_ARROW, parse_WS, parse_type ]) if asts is None: return failure type_ast = asts[3] ast = ('ast', 'fundeclret', [type_ast]) return (ast, tokens) def parse_scope(tokens): """ Attempts to parse a scoped sequence of statements. Grammar: scope = COLON INDENT statement { DENT statement } DEDENT """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_COLON, parse_INDENT, parse_statement ]) if asts is None: return failure statements = [] statements.append(asts[2]) while True: (ast, tokens) = parse_DENT(tokens) if ast is None: break (ast, tokens) = parse_statement(tokens) if ast is None: break statements.append(ast) (ast, tokens) = parse_DEDENT(tokens) if ast is None: return failure ast = ('ast', 'scope', statements) return (ast, tokens) def parse_vardecl(tokens): """ Attempts to parse a vardecl AST node. Grammar: vardecl = IDENTIFIER COLON WS type """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_IDENTIFIER, parse_COLON, parse_WS, parse_type ]) if asts is None: return failure identifier_token = asts[0] type_ast = asts[3] vardecl_ast = ('ast', 'vardecl', [identifier_token, type_ast]) return (vardecl_ast, tokens) def parse_type(tokens): """ Attemps to parse a type declaration. Grammar: type = POINTER LT type GT | ARRAY LT type GT | FUNCTION LT type GT | ARRAY OBRACKET INTLIT CBRACKET LT type GT | IDENTIFIER """ def parse_type1(tokens, toktype): """ Grammar fragments: POINTER LT type GT ARRAY LT type GT 'array' becomes: ('ast', 'type', [ ('token', 'ARRAY', 'array'), ('ast', 'type', [ ('token', 'IDENTIFIER', 'int') ]) ]) 'pointer' becomes: ('ast', 'type', [ ('token', 'POINTER', 'pointer'), ('ast', 'type', [ ('token', 'IDENTIFIER', 'char') ]) ]) 'array>' becomes: ('ast', 'type', [ ('token', 'ARRAY', 'array'), ('ast', 'type', [ ('token', 'POINTER', 'pointer'), ('ast', 'type', [ ('token', 'IDENTIFIER', 'char') ]) ]) ]) """ failure = (None, tokens) (identifier_token, tokens) = parse_terminal(tokens, toktype) if identifier_token is None: return failure (asts, tokens) = concatenation(tokens, [ parse_LT, parse_type, parse_GT ]) if asts is None: return failure subtype_ast = asts[1] ast = ('ast', 'type', [identifier_token, subtype_ast]) return (ast, tokens) def parse_type2(tokens): """ Grammar fragment: ARRAY OBRACKET INTLIT CBRACKET LT type GT 'array[8]' becomes: ('ast', 'type', [ ('token', 'ARRAY', 'array'), ('token', 'INTLIT', '8'), ('ast', 'type', [ ('token', 'IDENTIFIER', 'int') ]) ]) """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_ARRAY, parse_OBRACKET, parse_INTLIT, parse_CBRACKET, parse_LT, parse_type, parse_GT ]) if asts is None: return failure identifier_token = asts[0] intlit_token = asts[2] subtype_ast = asts[5] ast = ('ast', 'type', [identifier_token, intlit_token, subtype_ast]) return (ast, tokens) def parse_type3(tokens): """ Grammar fragment: IDENTIFIER 'int' becomes: ('ast', 'type', [('token', 'IDENTIFIER', 'int')]) 'char' becomes: ('ast', 'type', [('token', 'IDENTIFIER', 'char')]) """ failure = (None, tokens) (identifier_token, tokens) = parse_terminal(tokens, 'IDENTIFIER') if identifier_token is None: return failure type_ast = ('ast', 'type', [identifier_token]) return (type_ast, tokens) failure = (None, tokens) (ast, tokens) = parse_type1(tokens, 'POINTER') if ast is not None: return (ast, tokens) (ast, tokens) = parse_type1(tokens, 'ARRAY') if ast is not None: return (ast, tokens) (ast, tokens) = parse_type1(tokens, 'FUNCTION') if ast is not None: return (ast, tokens) (ast, tokens) = parse_type2(tokens) if ast is not None: return (ast, tokens) (ast, tokens) = parse_type3(tokens) if ast is not None: return (ast, tokens) return failure def parse_return(tokens): """ Attempts to parse a return statement. Grammar: return = RETURN [ WS expr ] """ failure = (None, tokens) (ast, tokens) = parse_terminal(tokens, 'RETURN') if ast is None: return failure return_ast = ('ast', 'return', []) (ast, tokens) = parse_terminal(tokens, 'WS') if ast is None: return (return_ast, tokens) (expr_ast, tokens) = parse_expr(tokens) if expr_ast is not None: return_ast = ('ast', 'return', [expr_ast]) return (return_ast, tokens) def parse_expr(tokens): """ Attempts to parse an expression. Expressions result in a value. Grammar: expr = binary | unary | INTLIT | FLOATLIT | STRINGLIT | BOOLLIT | IDENTIFIER """ failure = (None, tokens) (ast, tokens) = alternation(tokens, [ parse_binary, parse_unary, parse_INTLIT, parse_FLOATLIT, parse_STRINGLIT, parse_BOOLLIT, parse_IDENTIFIER ]) if ast is None: return failure expr_ast = ('ast', 'expr', [ast]) return (expr_ast, tokens) def parse_declassign(tokens): """ Attempts to parse a declaration-assignment statement. Grammar: declassign = vardecl WS EQ WS expr """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_vardecl, parse_WS, parse_EQ, parse_WS, parse_expr ]) if asts is None: return failure vardecl_ast = asts[0] expr_ast = asts[4] ast = ('ast', 'declassign', [vardecl_ast, expr_ast]) return (ast, tokens) def parse_assign(tokens): """ Attempts to parse an assignment statement. Grammar: assign = IDENTIFIER WS aoperator WS expr """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_IDENTIFIER, parse_WS, parse_aoperator, parse_WS, parse_expr ]) if asts is None: return failure identifier_token = asts[0] op_ast = asts[2] expr_ast = asts[4] ast = ('ast', 'assign', [identifier_token, op_ast, expr_ast]) return (ast, tokens) def parse_aoperator(tokens): """ Attempts to parse an assignment operator. Grammar: aoperator = PLUSEQ | MINUSEQ | STAREQ | SLASHEQ | PERCENTEQ | EQ """ failure = (None, tokens) (token, tokens) = alternation(tokens, [ parse_PLUSEQ, parse_MINUSEQ, parse_STAREQ, parse_SLASHEQ, parse_PERCENTEQ, parse_EQ ]) if token is None: return failure ast = ('ast', 'aoperator', [token]) return (ast, tokens) def parse_unary(tokens): """ Attempts to parse a unary expression. Grammar: unary = uoperator OPAREN expr CPAREN """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_uoperator, parse_OPAREN, parse_expr, parse_CPAREN ]) if asts is None: return failure op = asts[0] expr = asts[2] ast = ('ast', 'unary', [op, expr]) return (ast, tokens) def parse_uoperator(tokens): """ Attempts to parse a unary operator. Grammar: uoperator = MINUS | AMP | STAR | BANG """ failure = (None, tokens) (token, tokens) = alternation(tokens, [ parse_MINUS, parse_AMP, parse_STAR, parse_BANG ]) if token is None: return failure ast = ('ast', 'uoperator', [token]) return (ast, tokens) def parse_binary(tokens): """ Attempts to parse a binary expression. Grammar: binary = OPAREN expr WS boperator WS expr CPAREN """ failure = (None, tokens) (asts, tokens) = concatenation(tokens, [ parse_OPAREN, parse_expr, parse_WS, parse_boperator, parse_WS, parse_expr, parse_CPAREN ]) if asts is None: return failure expr1 = asts[1] op = asts[3] expr2 = asts[5] ast = ('ast', 'binary', [expr1, op, expr2]) return (ast, tokens) def parse_boperator(tokens): """ Attempts to parse a binary operator. Grammar: boperator = PLUS | MINUS | STAR | SLASH | PERCENT | LT | LTEQ | GT | GTEQ | EQEQ | BANGEQ | AMPAMP | BARBAR | BANG | AMP | BAR | LTLT | GTGT | TILDE | CARAT """ failure = (None, tokens) (token, tokens) = alternation(tokens, [ parse_PLUS, parse_MINUS, parse_STAR, parse_SLASH, parse_LT, parse_LTEQ, parse_GT, parse_GTEQ, parse_EQEQ, parse_BANGEQ, parse_AMPAMP, parse_BARBAR, parse_BANG, parse_AMP, parse_BAR, parse_LTLT, parse_GTGT, parse_TILDE, parse_CARAT, ]) if token is None: return failure ast = ('ast', 'boperator', [token]) return (ast, tokens) def parse_statement(tokens): """ Attempts to parse a statement. Grammar: statement = declassign | assign | fundecl | vardecl | scope | return """ failure = (None, tokens) (ast, tokens) = alternation(tokens, [ parse_declassign, parse_assign, parse_fundecl, parse_vardecl, parse_scope, parse_return ]) if ast is None: return failure statement_ast = ('ast', 'statement', [ast]) return (statement_ast, tokens) def parse_program(tokens): """ Attempts to parse a program. Grammar: statement [WS] { statement [WS] } """ failure = (None, tokens) statement_asts = [] (ast, tokens) = parse_statement(tokens) if ast is None: return failure statement_asts.append(ast) (ast, tokens) = parse_terminal(tokens, 'WS') while True: (ast, tokens) = parse_statement(tokens) if ast is None: break statement_asts.append(ast) (ast, tokens) = parse_terminal(tokens, 'WS') ast = ('ast', 'program', statement_asts) return (ast, tokens) def parse(tokens): """Attempts to parse the tokens into an AST.""" (ast, tokens) = parse_program(tokens) if ast is None: raise Exception("Couldn't parse starting at %s" % tokens[:8]) if len(tokens) > 0: raise Exception("Leftover tokens: %s" % tokens) return ast # Terminal parsing: parsing individual tokens. def parse_terminal(tokens, toktype): """ Attempts to parse a terminal node of type _toktype_. Note that the parsed result of a terminal is the token itself, not an AST node. """ failure = (None, tokens) if len(tokens) > 0 and is_toktype(tokens[0], toktype): return (tokens[0], tokens[1:]) else: return failure def parse_FUNC(tokens): return parse_terminal(tokens, 'FUNC') def parse_ARRAY(tokens): return parse_terminal(tokens, 'ARRAY') def parse_COMMA(tokens): return parse_terminal(tokens, 'COMMA') def parse_WS(tokens): return parse_terminal(tokens, 'WS') def parse_ARROW(tokens): return parse_terminal(tokens, 'ARROW') def parse_IDENTIFIER(tokens): return parse_terminal(tokens, 'IDENTIFIER') def parse_COLON(tokens): return parse_terminal(tokens, 'COLON') def parse_INTLIT(tokens): return parse_terminal(tokens, 'INTLIT') def parse_FLOATLIT(tokens): return parse_terminal(tokens, 'FLOATLIT') def parse_STRINGLIT(tokens): return parse_terminal(tokens, 'STRINGLIT') def parse_BOOLLIT(tokens): return parse_terminal(tokens, 'BOOLLIT') def parse_OPAREN(tokens): return parse_terminal(tokens, 'OPAREN') def parse_CPAREN(tokens): return parse_terminal(tokens, 'CPAREN') def parse_OBRACKET(tokens): return parse_terminal(tokens, 'OBRACKET') def parse_CBRACKET(tokens): return parse_terminal(tokens, 'CBRACKET') def parse_MINUS(tokens): return parse_terminal(tokens, 'MINUS') def parse_AMP(tokens): return parse_terminal(tokens, 'AMP') def parse_STAR(tokens): return parse_terminal(tokens, 'STAR') def parse_BANG(tokens): return parse_terminal(tokens, 'BANG') def parse_PLUS(tokens): return parse_terminal(tokens, 'PLUS') def parse_SLASH(tokens): return parse_terminal(tokens, 'SLASH') def parse_LT(tokens): return parse_terminal(tokens, 'LT') def parse_LTEQ(tokens): return parse_terminal(tokens, 'LTEQ') def parse_GT(tokens): return parse_terminal(tokens, 'GT') def parse_GTEQ(tokens): return parse_terminal(tokens, 'GTEQ') def parse_EQ(tokens): return parse_terminal(tokens, 'EQ') def parse_EQEQ(tokens): return parse_terminal(tokens, 'EQEQ') def parse_BANGEQ(tokens): return parse_terminal(tokens, 'BANGEQ') def parse_AMPAMP(tokens): return parse_terminal(tokens, 'AMPAMP') def parse_BARBAR(tokens): return parse_terminal(tokens, 'BARBAR') def parse_BAR(tokens): return parse_terminal(tokens, 'BAR') def parse_LTLT(tokens): return parse_terminal(tokens, 'LTLT') def parse_GTGT(tokens): # So, I'm not super happy about this, but here's why this has to be special-cased: # If we just treated this like all of the other tokenization and parsing functions, # we would have a token for '>>', and this would be a one-liner: # return parse_terminal(tokens, 'GTGT') # However, that would break `pointer>`, because the last two # characters would be a single GTGT token, rathe than two separate GT tokens. # So we have to implement parse_GTGT by looking for two distinct GT tokens. failure = (None, tokens) (gt_tokens, tokens) = concatenation(tokens, [parse_GT, parse_GT]) if gt_tokens is None: return failure token = ('token', 'GTGT', '>>') def parse_TILDE(tokens): return parse_terminal(tokens, 'TILDE') def parse_CARAT(tokens): return parse_terminal(tokens, 'CARAT') def parse_PLUSEQ(tokens): return parse_terminal(tokens, 'PLUSEQ') def parse_MINUSEQ(tokens): return parse_terminal(tokens, 'MINUSEQ') def parse_STAREQ(tokens): return parse_terminal(tokens, 'STAREQ') def parse_SLASHEQ(tokens): return parse_terminal(tokens, 'SLASHEQ') def parse_PERCENTEQ(tokens): return parse_terminal(tokens, 'PERCENTEQ') def parse_INDENT(tokens): return parse_terminal(tokens, 'INDENT') def parse_DENT(tokens): return parse_terminal(tokens, 'DENT') def parse_DEDENT(tokens): return parse_terminal(tokens, 'DEDENT') # Output generator implementation: # Note: all generate_x functions share a similar format: # - Their first argument is as AST. # - They return a string (of C code). # If the passed AST is valid, then errors are not possible. # Exceptions are thrown otherwise. def generate_type(ast, identifier=""): """ Generate a C type declaration. If identifier is not given, this is an abstract type declaration (e.g. a cast). """ def did_switch_direction(previous, current): """ Detect a change in 'direction' while intepreting the type stack. C type declaration operator precedence requires that we "go right" first. i.e. 'int *foo[]' is "array of pointer to int", not "pointer to array of int". In order to express "pointer to array of int", we have to use parenthesis to change overcome operator precedence, i.e. 'int (*foo)[]'. Any time we need to "change direction", we need to wrap in parenthesis. See http://unixwiz.net/techtips/reading-cdecl.html """ lefts = ['pointer'] rights = ['array', 'function'] return (previous in lefts and current in rights) \ or (previous in rights and current in lefts) assert is_asttype(ast, 'type') types = type_ast_as_list(ast) output = identifier previous = None for t in types[:-1]: if t == 'pointer': if did_switch_direction(previous, t): output = "*(%s)" % output else: output = "*%s" % output elif t == 'array': if did_switch_direction(previous, t): output = "(%s)[]" % output else: output = "%s[]" % output elif t == 'function': if did_switch_direction(previous, t): output = "(%s)()" % output else: output = "%s()" % output elif t.startswith('array:'): array_size = int(t.split(':')[1]) if did_switch_direction(previous, 'array'): output = "(%s)[%i]" % (output, array_size) else: output = "%s[%i]" % (output, array_size) else: raise Exception("generate_type: unexpected type '%s'." % t) if t.startswith('array:'): previous = 'array' else: previous = t base_type = types[-1] if identifier == "": output = "%s%s" % (base_type, output) else: output = "%s %s" % (base_type, output) return output def type_ast_as_list(ast): """ Return the type AST as a list of types. e.g. - 'int' results in ['int'] - 'pointer>' results in ['pointer', 'array', 'char'] - 'array[8]' results in ['array:8', 'int'] """ assert is_asttype(ast, 'type') children = ast_children(ast) if len(children) == 1: # if there is only one child, it is the "base" type (e.g. int, char, etc). assert is_token(children[0]) token = children[0] assert is_toktype(token, 'IDENTIFIER') # return e.g. ['int'], ['char'], etc. return [token_text(token)] elif len(children) == 2: # if there are 2 children, this is either array or pointer or function. assert is_toktype(children[0], 'ARRAY') \ or is_toktype(children[0], 'POINTER') \ or is_toktype(children[0], 'FUNCTION') token = children[0] assert is_ast(children[1]) sub_ast = children[1] assert is_asttype(ast, 'type') # return e.g. ['array', ] return [token_text(token)] + type_ast_as_list(sub_ast) elif len(children) == 3: # if there are three children, this is a dimensioned array (e.g. array[8]). assert is_toktype(children[0], 'ARRAY') assert is_toktype(children[1], 'INTLIT') intlit_token = children[1] array_size = int(token_text(intlit_token)) assert is_ast(children[2]) sub_ast = children[2] assert is_asttype(sub_ast, 'type') # return e.g. ['array:8', ] return ["array:%s" % (array_size)] + type_ast_as_list(sub_ast) else: raise Exception("type_ast_as_list: type AST node with more than 3 children!") def generate_vardecl(ast): """ Generate a C variable declaration. Note: we don't append the trailing ';' here because this is also used for function args. """ assert is_asttype(ast, 'vardecl') children = ast_children(ast) identifier_token = children[0] assert is_toktype(identifier_token, 'IDENTIFIER') var_name = token_text(identifier_token) type_ast = children[1] assert is_asttype(type_ast, 'type') # e.g. 'int a', 'char* b', 'int c[8]', 'char** d', 'int* d[3]', etc. return "%s" % generate_type(type_ast, var_name) def generate_fundecl(ast): """Generate a C function declaration.""" assert is_asttype(ast, 'fundecl') identifier_token = ast_children(ast)[0] assert is_toktype(identifier_token, 'IDENTIFIER') identifier = token_text(identifier_token) fundeclargs_ast = ast_children(ast)[1] assert is_asttype(fundeclargs_ast, 'fundeclargs') arg_asts = ast_children(fundeclargs_ast) args = [] for vardecl_ast in arg_asts: arg = generate_vardecl(vardecl_ast) args.append(arg) args_output = ", ".join(args) fundeclret_ast = ast_children(ast)[2] if fundeclret_ast is None: ret = "void" else: ret_ast = ast_children(fundeclret_ast)[0] ret = generate_type(ret_ast) scope_ast = ast_children(ast)[3] scope = generate_scope(scope_ast) output = "%s %s(%s)%s\n" % (ret, identifier, args_output, scope) return output def generate_scope(ast, indent=0): """Generate a C scope.""" assert is_ast(ast) assert is_asttype(ast, 'scope') output = " {" for statement_ast in ast_children(ast): statement = generate_statement(statement_ast) line = "\n" + (" " * 4 * (indent+1)) + statement output += line line = "\n" + (" " * 4 * (indent)) + "}" output += line return output def generate_expr(ast): """Generate a C expression.""" assert is_asttype(ast, 'expr') child = ast_children(ast)[0] if is_ast(child): if is_asttype(child, 'binary'): return generate_binary(child) elif is_asttype(child, 'unary'): return generate_unary(child) elif is_token(child): token = child if is_toktype(token, 'INTLIT') \ or is_toktype(token, 'FLOATLIT') \ or is_toktype(token, 'STRINGLIT') \ or is_toktype(token, 'BOOLLIT') \ or is_toktype(token, 'IDENTIFIER') \ : output = token_text(token) return output else: raise Exception("generate_expr: unexpcted ast %s" % (ast,)) def generate_return(ast): """Generate a C return statement.""" assert is_asttype(ast, 'return') children = ast_children(ast) if len(children) == 0: return "return;" else: expr_ast = children[0] return "return %s;" % generate_expr(expr_ast) def generate_declassign(ast): """Generate a C declaration-assignment statement.""" assert is_asttype(ast, 'declassign') children = ast_children(ast) vardecl_ast = children[0] assert is_asttype(vardecl_ast, 'vardecl') vardecl = generate_vardecl(vardecl_ast) expr_ast = children[1] assert is_asttype(expr_ast, 'expr') expr = generate_expr(expr_ast) return "%s = %s;" % (vardecl, expr) def generate_assign(ast): """Generate a C assignment statement.""" assert is_asttype(ast, 'assign') children = ast_children(ast) identifier_token = children[0] assert is_toktype(identifier_token, 'IDENTIFIER') identifier = token_text(identifier_token) op_ast = children[1] assert is_asttype(op_ast, 'aoperator') op_token = ast_children(op_ast)[0] assert is_token(op_token) op = token_text(op_token) expr_ast = children[2] assert is_asttype(expr_ast, 'expr') expr = generate_expr(expr_ast) return "%s %s %s;" % (identifier, op, expr) def generate_unary(ast): """Generate a unary-operator C statement.""" assert is_asttype(ast, 'unary') children = ast_children(ast) op_ast = children[0] assert is_asttype(op_ast, 'uoperator') op_token = ast_children(op_ast)[0] assert is_token(op_token) op = token_text(op_token) expr_ast = children[1] assert is_asttype(expr_ast, 'expr') expr = generate_expr(expr_ast) return "%s(%s)" % (op, expr) def generate_binary(ast): """Generate a binary-operator C statement.""" assert is_asttype(ast, 'binary') children = ast_children(ast) expr1_ast = children[0] assert is_asttype(expr1_ast, 'expr') expr1 = generate_expr(expr1_ast) op_ast = children[1] assert is_asttype(op_ast, 'boperator') op_token = ast_children(op_ast)[0] assert is_token(op_token) op = token_text(op_token) expr2_ast = children[2] assert is_asttype(expr2_ast, 'expr') expr2 = generate_expr(expr2_ast) return "(%s %s %s)" % (expr1, op, expr2) def generate_statement(ast): """Generate a C statement.""" assert is_asttype(ast, 'statement') sub_ast = ast_children(ast)[0] assert is_ast(sub_ast) if is_asttype(sub_ast, 'declassign'): return generate_declassign(sub_ast) elif is_asttype(sub_ast, 'assign'): return generate_assign(sub_ast) elif is_asttype(sub_ast, 'fundecl'): return generate_fundecl(sub_ast) elif is_asttype(sub_ast, 'vardecl'): return generate_vardecl(sub_ast) + ';' elif is_asttype(sub_ast, 'scope'): return generate_scope(sub_ast) elif is_asttype(sub_ast, 'return'): return generate_return(sub_ast) else: raise Exception("generate_statement: don't know how to generate %s" % sub_ast) def generate_program(ast): """Generate a C program.""" assert is_asttype(ast, 'program') outputs = [] for statement_ast in ast_children(ast): outputs.append(generate_statement(statement_ast)) return '\n'.join(outputs) + '\n' def generate(ast): """Generate C code from the given AST.""" return generate_program(ast) if __name__ == "__main__": import sys import pprint tdefs = load_tokendefs("tokendefs.txt") input = None if len(sys.argv) > 1 and not sys.argv[-1].startswith('--'): input = open(sys.argv[-1]).read() else: input = sys.stdin.read() tokens = tokenize(tdefs, input) if '--tokens' in sys.argv: pprint.pprint(tokens) sys.exit(0) ast = parse(tokens) if '--ast' in sys.argv: pprint.pprint(ast) sys.exit(0) output = generate(ast) sys.stdout.write(output)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 👉 test21.cy input: func not(a: bool) -> bool: return !(a) output: bool not(bool a) { return !(a); } ✅ test21.cy 👉 test22.cy input: func negated_sum(a: int, b: int) -> int: c: int = (a + b) return -(c) output: int negated_sum(int a, int b) { int c = (a + b); return -(c); } ✅ test22.cy 👉 test23.cy input: func foo(a: pointer>, z_z: array>) -> float: z: int = (1 + (2 * 3)) z += (3 - 1) y: int = -((1 + 1)) w: int = !(0) return (-5 / -3.14159) func main(argc: int, argv: pointer>) -> int: return 0 output: float foo(char (*a)(), char *(z_z[])) { int z = (1 + (2 * 3)); z += (3 - 1); int y = -((1 + 1)); int w = !(0); return (-5 / -3.14159); } int main(int argc, char **argv) { return 0; } ✅ test23.cy
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 #!/bin/bash # run all of the tests set -e for f in test*.cy do echo "👉 \$f" echo " input:" cat \$f | sed 's/^/ /' outfile=`mktemp` ./cy.py \$f > \$outfile echo " output:" cat \$outfile | sed 's/^/ /' cfile="`basename \$f .cy`.c" if diff -q \$cfile \$outfile >/dev/null then echo "✅ \$f" else echo "❌ \$f" diff -urN --color=auto \$cfile \$outfile fi echo done
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 float foo(char (*a)(), char *(z_z[])) { int z = (1 + (2 * 3)); z += (3 - 1); int y = -((1 + 1)); int w = !(0); return (-5 / -3.14159); } int main(int argc, char **argv) { return 0; }
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 func foo(a: pointer>, z_z: array>) -> float: z: int = (1 + (2 * 3)) z += (3 - 1) y: int = -((1 + 1)) w: int = !(0) return (-5 / -3.14159) func main(argc: int, argv: pointer>) -> int: return 0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 POINTER pointer ARRAY array FUNCTION function FUNC func RETURN return FLOATLIT -?\d+\.\d+ INTLIT -?\d+ STRINGLIT "([^"\\]|\\[\s\S])*" BOOLLIT (true)|(false) ARROW -> LTEQ <= GTEQ >= EQEQ == BANGEQ != PLUSEQ \+= MINUSEQ -= STAREQ \*= SLASHEQ /= PERCENTEQ %= AMPAMP && BARBAR \|\| LTLT << TILDE ~ CARAT \^ COLON : COMMA , PLUS \+ MINUS - STAR \* SLASH / PERCENT % LT < GT > EQ = AMP & BAR \| BANG ! OBRACKET \[ CBRACKET ] OPAREN \( CPAREN \) WS \s+ IDENTIFIER [a-zA-Z][a-zA-Z0-9_]*
to join this conversation on GitHub. Already have an account? Sign in to comment