Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active August 8, 2016 14:45
Show Gist options
  • Save paniq/eb03dd7c5b26e3d37fd038fcbb2e49fa to your computer and use it in GitHub Desktop.
Save paniq/eb03dd7c5b26e3d37fd038fcbb2e49fa to your computer and use it in GitHub Desktop.
Syntax Sugar For The M(ol)asses

Syntax Sugar For The M(ol)asses

I've been experimenting with extensions to S-Expression notation for a while now, and made some new additions that I'm going to test drive for a while in my experimental Lisp-like language "Bang".

Naked Notation

The classic restricted S-Expression notation looks like this:

(top level list
    (level 2 list)
    (another list (with more (nested) (lists)))
    (yet another list))

First I re-used naked notation from Nonelang, which uses indentation as a cue to balance parentheses, Python style:

top level list
    level 2 list
    another list
        with more
            nested ;comments are able to wrap single symbols
            lists ;because they are first parsed as symbols, then stripped
    (yet another list) ; classic notation is also supported

(Please forgive the arbitrarily highlighted keywords.)

Of Statements, Comments and Indices

The semicolon ; is used for line comments in Lisp, Scheme, Assembler and most recently, in LLVM, but contemporary programmers know it more as a statement separator, which I wanted to honor.

In a quick survey I did among my Twitter followers, the double-slash // used in C-like languages was the most popular comment token (48%), followed by the hash # (37%), which is traditionally used in scripting languages such as bash, Tcl and Python.

I chose to go with #, as // is very useful in Python as a floordiv operator, an operator I want to support. In naked notation, Bang looks very pythonic anyway, so the hash would make the language more familiar to Pythoneers. C-like languages understand # as almost comment-like preprocessor tags (they're not part of the actual AST), so I believe even C users won't be appalled by this choice.

Array Indexing

Unfortunately I already use # in Nonelang as an array index operator, like so:

print
    matrix # x # y ; and here's a comment

So for Bang I had to find a suitable replacement. I chose @, mostly because it is spelled as "at" (which is a fitting moniker), is unused in C and belongs to a more esoteric feature in Python (decorators). So with the new array indexing syntax the statement becomes

print
    matrix @ x @ y # and here's a comment

Wrapping Single Values

Alright. Now how do we fix the next problem:

do
    print

turns into (do print), not (do (print)), which is what we want.

We could just wrap print into (print) locally, but let's try to avoid more braces for now, especially when they appear in such an utterly surprising and irregular way in the midst of a parens-free statement block.

Since naked notation does not wrap single symbols on a line, I used to append an empty comment as a hack to put a second symbol in the line and thus get the parser to wrap the statement:

do
    print ; there, now it's wrapped

turns into

(do (print "; there, now it's wrapped\n"))

After stripping comments, we then get the desired (do (print)).

Alas, our good friend ; has been taken from us, and to add insult to injury, the Lexer now strips # before the parser even sees it (this is so it plays nice with my editor's toggle block comment feature, which comments blocks in such a way that the parser is completely confused by those weirdly nested comment symbols.).

Statement Separators

So how do we do it now? I turned ; into a new control character, the statement separator, which operates in both naked and coated notation and wraps values up to the previous ; or beginning of scope.

This way, (print a; print b; print;) turns into ((print a) (print b) (print)).

Similarly, in naked notation we can now do this:

do
    print x y # is already wrapped as (print x y)
    print x y; # superfluous semicolon has no effect
    print x; print y; # added to (do ...) as (print x) (print y)
    print; # wraps print as (print)

Not only do we have our old feature back, we get a context free statement separator that is used in both C-likes and Python.

There's a caveat though: if trailing values aren't topped off with ;, they're not going to be wrapped. So (print a; print b; print;) turns into ((print a) (print b) (print)), but (print a; print b; print) turns into ((print a) (print b) print)! It sounds like a drawback, but allows to do this:

do  
    print x; print
        a + b

which turns into (do (print x) (print (a + b))).

Stylish Brackets

Another addition I made, as a sort of "styling utility" for domain specific language designers, was to add context free support for square brackets [] and curly brackets {} as aliases for parentheses (), similar to how Clojure does it, but without inherent semantic meaning. Bracket expressions are all equivalent, but set a style for lists that can be queried in syntax handlers:

(do ((print x) (print y))) # coated
[do {(print x) (print y)}] # styled coated
do {                       # naked into styled coated
    print x;               # make use of new statement character
    print y;               # look, it's almost C! ;-)
}

Argument Separators

I also realized that there is zero use for comma , in Bang, so why not use that one as a delimiter as well?

Now [ptr, * ptr, const * ptr] turns into (ptr (* ptr) (const * ptr)). This example also demonstrates the two key differences to the ; delimiter:

  1. Single values are not wrapped, so (a,b,c,d) is equivalent to (a b c d).
  2. Trailing values are wrapped, so {a = 1, b = 2, c = 3} is the same as {a = 1, b = 2, c = 3,} which translates to ((a = 1) (b = 2) (c = 3)).

Furthermore, the comma separator , has lower precedence than the statement ; separator. Take this fictitious example:

# do it naked, because we can.
do                                # (do
    int x, int y; x = 5, y = 6    #     ((int x) (int y)) ((x = 5) (y = 6)))

The reason why (x = 5) (y = 6) is wrapped here despite a missing trailing statement separator is that naked notation takes care of the wrapping.

Limiting Argument Wrapping

Sort of as an attempt to permit prefix headers to argument lists, I turned colon : into a special symbol that controls from where argument separation begins. For example (label: a = 1, b = 2, c = 3) now turns into (label : (a = 1) (b = 2) (c = 3)).

Treating : as a separable token also allows to parse Pythonic blocks like this one:

while x < size:   # (while x < size :
    x = x + 1     #     (x = x + 1))

Since : separates the bracketless expression from the body, the syntax handler can now easily tokenize the expression into (while (x < size) (x = x + 1)) without having to consider ambiguities.

Accessors

Dots . are truly special. They're pretty much useless in the traditional Lispy sense, as nobody needs such an important character simply devoted to concatenating lists. But they ended up as one of the very first infix operators that I needed in Nonelang, so that I could retire the unwildy format of e.g. (. object field subfield) in favor of a more modern object.field.subfield.

Because my stance on special characters is that they should be spaced correctly anyway, I never made dots special, so the lexer always worked them into symbols. (a . b.c) would indeed literally parse as (a . b.c), and the syntax handler could only do as well as (a . (b . c)), which is of course complete horse manure.

Therefore, in the new lexer, . is tokenized separately, so that (a . b.c) turns into the proper (a . b . c) right away. Unfortunately this broke pattern expressions like (args ...) which then became (args . . .), so I added another rule which groups successive dots. Now for example (a.b..c...d) parses as (a . b .. c ... d), which should also make Lua syntax fans happy who know .. as the string concatenation operator.

Of course this means I now also have to tokenize floating point numbers, which give a second legitimate reason to use dots in a symbol. I do this for doubles in Nonelang (all other notations like 0x and 0b are done in syntax handlers), and I am always worried because this makes it harder to support symmetric code transformation.

I thought I could do without in Bang, but it becomes clear that it's absolutely required, and there are better ways to transform code than parsing the source file, reassembling it and writing it back to disk: I store anchoring data for each token (start/end position in file), and this can be used to patch the file directly.

Are We Done?

Are we? I'm not sure. I want to support ' as a second context free string quote style that may be used in the Python sense (as a simple alternate) or in the C sense (as a char (array) constructor). I'm still considering adding support for Python block strings """ but since our Lispy strings are already multi-line...

; same as "\"they don't\n\ttokenize strings like\n\tthey used to.\""
"\"they don't
    tokenize strings like
    they used to.\""

...I think an ' alternative is more than enough.

These are all changes I could think of at the Lexer / Parser level. I generally treat the problem of writing the parser as servicing the language designer who wants to invent interesting syntax handlers for S-Expressions, as well as the contemporary language user who prefers to style his expressions for clarity in a syntax he and his editor understand. I try to avoid creating an explosion of supposedly powerful forms that on closer inspection are more obscuring than enlightening.

Lisp purists can easily retreat to classical coated notation and never touch the fancy stuff. Tinkerers can play with the fundamental extensibility of the language. Everyone is happy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment