Skip to content

Instantly share code, notes, and snippets.

@cellularmitosis

cellularmitosis/README.md

Last active Oct 22, 2019
Embed
What would you like to do?
An alternative syntax for C, part 1: type declarations

Blog 2019/10/13

<- previous | index | next ->

An alternative syntax for C, part 1: type declarations

part 2 ->

I'm toying around with making an alternative syntax for C. The intention is to eventually create a "CoffeeScript for C".

Currently, I just have type declarations working:

foo: array<pointer<char>> transpiles to char *(foo[]);.

foo: pointer<array<char>> transpiles to char (*foo)[];.

Usage:

$ cat test9.cy
foo: array<array[8]<pointer<pointer<function<pointer<array<pointer<char>>>>>>>>
$ ./cy.py test9.cy
char *((*((**(foo[][8]))()))[]);

I'm following the general approach described in An Incremental Approach to Compiler Construction, which is to say, get one piece working end-to-end before moving onto the next piece (hat tip to Nora Sandler πŸ‘‹).

cdecl resources:

TL;DR: What is Cy?

Cy is an alternative syntax for C which features indentation-based scoping and intuitive type declarations.

func main(argc: int, argv: pointer<pointer<char>>) -> int:
    printf("Hello, world!")
    return 0

But why tho?

To teach myself how to write a transpiler! 🀩🀩🀩

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# cy.py: a traspiler for an alternate syntax for C.
# Note: currently, only type declaration statements are supported.
# Data structures:
# A token is a tuple.
# First element is a string 'token'.
# Second element is a string which is the token type (e.g. 'OBRACKET').
# Third element is a string which is the source text of the token (e.g. '[').
# e.g. ('token', 'OBRACKET', '[')
# e.g. ('token', 'IDENTIFIER', 'foo')
# e.g. ('token', 'WSPACE', ' ')
# An AST node is a tuple.
# First element is a string 'ast'.
# Second element is a string which is the AST node type (e.g. 'vardecl').
# Third element is an array which are the child tuples (AST nodes or tokens).
# e.g. ('ast', 'type', [('token', 'IDENTIFIER', 'int')])
# Grammar:
#
# statement = vardecl
# vardecl = IDENTIFIER COLON WSPACE type
# type = POINTER LTHAN type GTHAN
# | ARRAY LTHAN type GTHAN
# | FUNCTION LTHAN type GTHAN
# | ARRAY OBRACKET INTLIT CBRACKET LTHAN type GTHAN
# | IDENTIFIER
#
# Note: lower case are AST nodes.
# Note: ALL CAPS are tokens (terminals).
# Token data structure utils:
def is_token(token):
"""Returns whether the arg is a token."""
return isinstance(token, tuple) and len(token) > 0 and token[0] == 'token'
def token_type(token):
"""Returns the type of the token (e.g. 'IDENTIFIER')."""
assert is_token(token)
return token[1]
def token_text(token):
"""Returns the source text of the token (e.g. 'foo' for a 'IDENTIFIER' token)."""
assert is_token(token)
return token[2]
def is_toktype(token, toktype):
"""
Returns whether the token is of the given type.
_toktype_ is e.g. 'IDENTIFIER', 'COLON', etc.
"""
assert is_token(token)
return token_type(token) == toktype
# Tokenizer implementation:
def load_tokendefs(fpath):
"""
Load the token definitions from the file at fpath.
Returns a list of tuples of the form (<token type>, <compiled regex>).
The format of the tokendefs file should be pairs of lines:
- a line which is the token type (e.g. 'IDENTIFIER', 'OBRACKET', etc),
- followed by a line which is the regex which recognizes that token.
Example tokendefs.txt:
IDENTIFIER
[a-zA-Z][a-zA-Z0-9]*
OPAREN
\(
CPAREN
)
"""
import re
tokendefs = []
with open(fpath) as f:
for line in f:
token_name = line.rstrip('\n')
pattern = f.next().rstrip('\n')
regex = re.compile(pattern)
tokendef = (token_name, regex)
tokendefs.append(tokendef)
return tokendefs
def tokenize(tokendefs, input):
"""Uses tokendefs to tokenize the 'input' string and return a list of tokens"""
tokens = []
offset = 0
while offset < len(input):
for (token_name, regex) in tokendefs:
m = regex.match(input, offset)
if m is not None:
matched_text = m.group(0)
token = ('token', token_name, matched_text)
tokens.append(token)
offset += len(matched_text)
break
else:
raise Exception("Couldn't tokenize starting at '%s...'" % input[offset:offset+16])
return tokens
# AST node data structure utils:
def is_ast(ast):
"""Returns whether the arg is an AST node."""
return isinstance(ast, tuple) and len(ast) > 0 and ast[0] == 'ast'
def is_asttype(ast, asttype):
"""
Returns whether the AST node is of the given type.
_asttype_ is e.g. 'vardecl', 'statement', etc.
"""
assert is_ast(ast)
return ast[1] == asttype
def ast_children(ast):
"""Return the child nodes of the given AST."""
assert is_ast(ast)
return ast[2]
# Parser implementation:
# Note: all parse_x functions share a similar format:
# - Their first argument is the list of tokens.
# - They return a tuple of (<parsed result>, <remaining tokens>).
# - <parsed result> is either an AST node or a token (a terminal).
# - <remaining tokens> is the list of unconsumed tokens.
# - If the parse fails, (None, <tokens>) is returned, where <tokens>
# is the list of tokens which was passed in.
def parse_terminal(tokens, toktype):
"""
Attempts to parse a terminal node of type _toktype_.
Note that the parsed result of a terminal is the token itself, not an AST node.
"""
failure = (None, tokens)
if len(tokens) > 0 and is_toktype(tokens[0], toktype):
return (tokens[0], tokens[1:])
else:
return failure
def parse_vardecl(tokens):
"""
Attempts to parse a vardecl AST node.
Grammar:
vardecl = IDENTIFIER COLON WSPACE type
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, 'IDENTIFIER')
if identifier_token is None:
return failure
(colon_token, tokens) = parse_terminal(tokens, 'COLON')
if colon_token is None:
return failure
(wspace_token, tokens) = parse_terminal(tokens, 'WSPACE')
if wspace_token is None:
return failure
(type_ast, tokens) = parse_type(tokens)
if type_ast is None:
return failure
vardecl_ast = ('ast', 'vardecl', [identifier_token, type_ast])
return (vardecl_ast, tokens)
def parse_type(tokens):
"""
Attemps to parse a type declaration.
Grammar:
type = POINTER LTHAN type GTHAN
| ARRAY LTHAN type GTHAN
| FUNCTION LTHAN type GTHAN
| ARRAY OBRACKET INTLIT CBRACKET LTHAN type GTHAN
| IDENTIFIER
"""
def parse_type1(tokens, toktype):
"""
Grammar fragments:
POINTER LTHAN type GTHAN
ARRAY LTHAN type GTHAN
'array<int>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('ast', 'type', [
('token', 'IDENTIFIER', 'int')
])
])
'pointer<char>' becomes:
('ast', 'type', [
('token', 'POINTER', 'pointer'),
('ast', 'type', [
('token', 'IDENTIFIER', 'char')
])
])
'array<pointer<char>>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('ast', 'type', [
('token', 'POINTER', 'pointer'),
('ast', 'type', [
('token', 'IDENTIFIER', 'char')
])
])
])
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, toktype)
if identifier_token is None:
return failure
(lessthan, tokens) = parse_terminal(tokens, 'LTHAN')
if lessthan is None:
return failure
(subtype_ast, tokens) = parse_type(tokens)
if subtype_ast is None:
return failure
(greaterthan, tokens) = parse_terminal(tokens, 'GTHAN')
if greaterthan is None:
return failure
ast = ('ast', 'type', [identifier_token, subtype_ast])
return (ast, tokens)
def parse_type2(tokens):
"""
Grammar fragment:
ARRAY OBRACKET INTLIT CBRACKET LTHAN type GTHAN
'array[8]<int>' becomes:
('ast', 'type', [
('token', 'ARRAY', 'array'),
('token', 'INTLIT', '8'),
('ast', 'type', [
('token', 'IDENTIFIER', 'int')
])
])
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, 'ARRAY')
if identifier_token is None:
return failure
(obracket, tokens) = parse_terminal(tokens, 'OBRACKET')
if obracket is None:
return failure
(intlit, tokens) = parse_terminal(tokens, 'INTLIT')
if intlit is None:
return failure
(cbracket, tokens) = parse_terminal(tokens, 'CBRACKET')
if obracket is None:
return failure
(lessthan, tokens) = parse_terminal(tokens, 'LTHAN')
if lessthan is None:
return failure
(subtype_ast, tokens) = parse_type(tokens)
if subtype_ast is None:
return failure
(greaterthan, tokens) = parse_terminal(tokens, 'GTHAN')
if greaterthan is None:
return failure
ast = ('ast', 'type', [identifier_token, intlit, subtype_ast])
return (ast, tokens)
def parse_type3(tokens):
"""
Grammar fragment:
IDENTIFIER
'int' becomes:
('ast', 'type', [('token', 'IDENTIFIER', 'int')])
'char' becomes:
('ast', 'type', [('token', 'IDENTIFIER', 'char')])
"""
failure = (None, tokens)
(identifier_token, tokens) = parse_terminal(tokens, 'IDENTIFIER')
if identifier_token is None:
return failure
type_ast = ('ast', 'type', [identifier_token])
return (type_ast, tokens)
failure = (None, tokens)
(ast, tokens) = parse_type1(tokens, 'POINTER')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type1(tokens, 'ARRAY')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type1(tokens, 'FUNCTION')
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type2(tokens)
if ast is not None:
return (ast, tokens)
(ast, tokens) = parse_type3(tokens)
if ast is not None:
return (ast, tokens)
return failure
def parse_statement(tokens):
"""
Attempts to parse a statement.
Grammar:
statement = vardecl
"""
failure = (None, tokens)
(vardecl_ast, tokens) = parse_vardecl(tokens)
if vardecl_ast is None:
return failure
statement_ast = ('ast', 'statement', [vardecl_ast])
return (statement_ast, tokens)
def parse(tokens):
"""Attempts to parse the tokens into an AST."""
(ast, tokens) = parse_statement(tokens)
if ast is None:
raise Exception("Couldn't parse starting at %s" % tokens[:4])
return ast
# Output generator implementation:
# Note: all generate_x functions share a similar format:
# - Their first argument is as AST.
# - They return a string (of C code).
# If the passed AST is valid, then errors are not possible.
# Exceptions are thrown otherwise.
def generate_type(ast, identifier=""):
"""
Generate a C type declaration.
If identifier is not given, this is an abstract type declaration (e.g. a cast).
"""
def did_switch_direction(previous, current):
"""
Detect a change in 'direction' while intepreting the type stack.
C type declaration operator precedence requires that we "go right" first.
i.e. 'int *foo[]' is "array of pointer to int", not "pointer to array of int".
In order to express "pointer to array of int", we have to use parenthesis
to change overcome operator precedence, i.e. 'int (*foo)[]'.
Any time we need to "change direction", we need to wrap in parenthesis.
See http://unixwiz.net/techtips/reading-cdecl.html
"""
lefts = ['pointer']
rights = ['array', 'function']
return (previous in lefts and current in rights) \
or (previous in rights and current in lefts)
types = type_ast_as_list(ast)
output = identifier
previous = None
for t in types[:-1]:
if t == 'pointer':
if did_switch_direction(previous, t):
output = "*(%s)" % output
else:
output = "*%s" % output
elif t == 'array':
if did_switch_direction(previous, t):
output = "(%s)[]" % output
else:
output = "%s[]" % output
elif t == 'function':
if did_switch_direction(previous, t):
output = "(%s)()" % output
else:
output = "%s()" % output
elif t.startswith('array:'):
array_size = int(t.split(':')[1])
if did_switch_direction(previous, 'array'):
output = "(%s)[%i]" % (output, array_size)
else:
output = "%s[%i]" % (output, array_size)
else:
raise Exception("generate_type: unexpected type '%s'." % t)
if t.startswith('array:'):
previous = 'array'
else:
previous = t
base_type = types[-1]
output = "%s %s" % (base_type, output)
return output
def type_ast_as_list(ast):
"""
Return the type AST as a list of types.
e.g.
- 'int' results in ['int']
- 'pointer<array<char>>' results in ['pointer', 'array', 'char']
- 'array[8]<int>' results in ['array:8', 'int']
"""
assert is_asttype(ast, 'type')
children = ast_children(ast)
if len(children) == 1:
# if there is only one child, it is the "base" type (e.g. int, char, etc).
assert is_token(children[0])
token = children[0]
assert is_toktype(token, 'IDENTIFIER')
# return e.g. ['int'], ['char'], etc.
return [token_text(token)]
elif len(children) == 2:
# if there are 2 children, this is either array or pointer or function.
assert is_toktype(children[0], 'ARRAY') \
or is_toktype(children[0], 'POINTER') \
or is_toktype(children[0], 'FUNCTION')
token = children[0]
assert is_ast(children[1])
sub_ast = children[1]
assert is_asttype(ast, 'type')
# return e.g. ['array', <recursive call>]
return [token_text(token)] + type_ast_as_list(sub_ast)
elif len(children) == 3:
# if there are three children, this is a dimensioned array (e.g. array[8]).
assert is_toktype(children[0], 'ARRAY')
assert is_toktype(children[1], 'INTLIT')
intlit_token = children[1]
array_size = int(token_text(intlit_token))
assert is_ast(children[2])
sub_ast = children[2]
assert is_asttype(sub_ast, 'type')
# return e.g. ['array:8', <recursive call>]
return ["array:%s" % (array_size)] + type_ast_as_list(sub_ast)
else:
raise Exception("type_ast_as_list: type AST node with more than 3 children!")
def generate_vardecl(ast):
"""Generate a C variable declaration."""
assert is_asttype(ast, 'vardecl')
children = ast_children(ast)
assert is_token(children[0])
assert is_toktype(children[0], 'IDENTIFIER')
identifier_token = children[0]
var_name = token_text(identifier_token)
assert is_ast(children[1])
assert is_asttype(children[1], 'type')
type_ast = children[1]
# e.g. 'int a;', 'char* b;', 'int c[8];', 'char** d;', 'int* d[3];', etc.
return "%s;" % generate_type(type_ast, var_name)
def generate_statement(ast):
"""Generate a C statement."""
assert is_asttype(ast, 'statement')
first = ast_children(ast)[0]
return generate_vardecl(first)
def generate(ast):
"""Generate C code from the given AST."""
return generate_statement(ast)
if __name__ == "__main__":
import sys
import pprint
tdefs = load_tokendefs("tokendefs.txt")
input = None
if len(sys.argv) > 1:
input = open(sys.argv[-1]).read()
else:
input = sys.stdin.read()
tokens = tokenize(tdefs, input)
ast = parse(tokens)
# pprint.pprint(ast)
output = generate(ast)
print output
πŸ‘‰ test1.cy
input:
a: int
output:
int a;
βœ… test1.cy
πŸ‘‰ test2.cy
input:
area: float
output:
float area;
βœ… test2.cy
πŸ‘‰ test3.cy
input:
a: pointer<int>
output:
int *a;
βœ… test3.cy
πŸ‘‰ test4.cy
input:
b: pointer<pointer<char>>
output:
char **b;
βœ… test4.cy
πŸ‘‰ test5.cy
input:
a: array<int>
output:
int a[];
βœ… test5.cy
πŸ‘‰ test6.cy
input:
b: array<array<int>>
output:
int b[][];
βœ… test6.cy
πŸ‘‰ test7.cy
input:
a: <array<pointer<array<pointer<char>>>>>
output:
char *((*(a[]))[]);
βœ… test7.cy
πŸ‘‰ test8.cy
input:
foo: pointer<function<char>>
output:
char (*foo)();
βœ… test8.cy
πŸ‘‰ test9.cy
input:
foo: array<array[8]<pointer<pointer<function<pointer<array<pointer<char>>>>>>>>
output:
char *((*((**(foo[][8]))()))[]);
βœ… test9.cy
#!/bin/bash
# run all of the tests
set -e
for f in test*.cy
do
echo "πŸ‘‰ $f"
echo " input:"
cat $f | sed 's/^/ /'
outfile=`mktemp`
./cy.py $f > $outfile
echo " output:"
cat $outfile | sed 's/^/ /'
cfile="`basename $f .cy`.c"
if diff -q $cfile $outfile >/dev/null
then
echo "βœ… $f"
else
echo "❌ $f"
diff -urN --color=auto $cfile $outfile
fi
echo
done
char *((*((**(foo[][8]))()))[]);
foo: array<array[8]<pointer<pointer<function<pointer<array<pointer<char>>>>>>>>
POINTER
pointer
ARRAY
array
FUNCTION
function
INTLIT
[0-9]+
COLON
:
OBRACKET
\[
CBRACKET
]
LTHAN
<
GTHAN
>
WSPACE
\W+
IDENTIFIER
[a-zA-Z][a-zA-Z0-9_]*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.