samebchase/raku-grammars.md

## raku-grammars.md

      
    Raw
  

              raku-grammars.md
            
          
    Parsing Clojure namespace forms using Raku grammars

One day, I started wondering if it would be possible to parse Clojure
namespace forms and generate a dependency graph of the various
namespaces used in a real-world Clojure project. While that was the
original motivation, I ended up down the Raku grammar rabbit hole, and
had an enjoyable time learning how to use them. I'm glad you're
joining me in reliving that journey.
Background

Grammars

Informally speaking, grammars can be thought of as a set of rules that
describe a language. With these rules, one can meaningfully parse (to
make sense of, or deconstruct into its grammatical components) a piece
of text. It turns out that this is a common task in computing. We need
to frequently translate programs from one language to another. This is
the job of a compiler. Before being able to translate it, the compiler
needs to know whether the original program is even valid, according to
the language's grammar.
While we have explained in theory what grammars are, Raku grammars
help us model abstract grammars as a programming construct (the
grammar keyword and its adjacent helpers) using which we can perform
parsing tasks. It is important to understand this distinction.
First class grammars are
considered one of the revolutionary features of Raku. Normally, you'd
find grammars as a library or a standalone tool, but Raku has embraced
it wholesale, and has a powerful implementation of grammars which
makes light work of most parsing tasks.
Clojure

Clojure is a modern lisp, and happens to be
the language I use in $DAYJOB. At the top of most Clojure files, there
is a namespace form importing various internal and external
namespaces. This way we can neatly organize our project into many
separate files, rather than having to put them all into one big
file. We are shortly going to design a grammar that can parse these
namespace forms.
Grammar::Tracer

Grammar::Tracer is
helpful in figuring out where your grammar fails to match, and is
invaluable in debugging. Make sure you do zef install Grammar::Tracer before running the code examples.
Let's start cookin'

A trivial example

A Clojure namespace is a Lisp form, which as expected starts with an
open paren and ends with a close paren. Let's write a grammar for one
of the simplest Clojure forms, the empty list.
()
grammar EmptyLispForm {
    token TOP { <lparen> <rparen> }

    token lparen { '(' }
    token rparen { ')' }
}
This is one of the simplest possible grammars we can write. Parsing
always starts with the TOP token, and recurses into the various
component tokens from there. We just have two such tokens lparen and
rparen which are used to denote the left and right parens. Play
around with
trivial.raku
to see how we can parse this.
Warmup: Slaying a cliched interview question using Raku grammars

Q: Write a program that checks that parentheses are balanced.
For example:
() ;; balanced
(()) ;; balanced
(()(())) ;; balanced
(()(((()))) ;; unbalanced
grammar BalancedParens {
    token TOP { <balanced-paren> }

    token balanced-paren { <lparen> <balanced-paren>* <rparen> }

    token lparen { '(' }
    token rparen { ')' }
}
It's likely I've made this more wordy than necessary, but that's still
only six rather-readable lines.
Now, under time pressure which seems more likely? Coding up a stack
and fiddling with corner cases or using grammars with the awesomeness
of Grammar::Tracer to quickly hack out a declarative solution.
It turns out, that we have just tackled one of the trickiest aspects
of writing a real-world grammar, and that is dealing with nested
structures. As programmers we know that when we see nested structures,
we know we will have to deal with recursion in some form.
You can play around with
balanced-parens.raku
program on the console, and observe how the grammar is parsed.
Note: It turns out there is a
better
way to parse nested structures, but this is fine for now.
Parsing our first namespace form

Let's try to parse this simple namespace declaration:
;; 1   3
   |   |
;; (ns my-own.fancy.namespace)
    |                        |
;;  2                        4
While this is simple, it is going to be an important building block in
tackling more complicated namespace forms. Let's break this down into
its four component
lexemes. We
can see the open and close parens, we see "ns", the namespace
my-own.fancy.namespace and finally the close paren. That's it! Let's
tackle these individual pieces using a grammar.
grammar SimpleNs {
    token TOP { <simple-ns> }

    #                  1        2            3         4
    #                  |        |            |         |
    token simple-ns { <lparen> <ns-keyword> <ns-name> <rparen> }

    token ns-keyword { 'ns' }
    token ns-name { <.ns-name-component>+ % '.' }
    token ns-name-component { ( <.alnum> | '-' )+ }

    token lparen { '(' }
    token rparen { ')' }
}
Over here we can see that we have translated this into a simple Raku
grammar. You could argue that defining simple-ns is not even
required, we could have just put it in TOP directly, but anyway.
We will need to deal a little with
regexes here. In
the various flavours of regexes, + usually means one or more. |
has a slightly different meaning from what you're expecting, but you
can check the regex
documentation for all the
details on the difference between | and ||. Loosely speaking, we
are saying that a namespace component, i.e. the thing between two
dots, is made up of one or more alphanumneric characters or
hyphens. Now, if there is a rule saying a namespace has to start with
an alphabetic character, and not a number, the grammar will become a
little more complex, but this is a pedagogic example, so we'll not be
too pedantic.
I expect eagle-eyed readers to point out a few things:

Where is <.alnum> being defined, and why does it have a dot
before it?

alnum is predefined. The reason it has a . before it is so that we
are not interested in capturing each letter; it's too low level. We
are interested in capturing a top-level token like ns-name instead,
and not each individual character. Play around with the code examples
by adding and removing a dot from various tokens and see the
difference in Grammar::Tracer's output.

What does % mean?

This is a very useful convenience to describe patterns where something
is interspersed between a bunch of other things. For example, we can
have a namespace like foo.bar.baz.
token ns-name { <.ns-name-component>+ % '.' }
This means that ns-name is made up of at least one
ns-name-component's (denoted by the +) and separated by a .
(denoted by %).
Okay, so this should work I guess? Let's see what happens when we run
the code!
No, that didn't work. As Grammar::Tracer so helpfully tells us, we
did not account for the space after ns. In traditional compiler
theory, there is a process of tokenisation, where some lightweight
regexes are used to separate the program into its component lexemes
and discard all the extraneous spaces. However, here we will not do
that, and we'll deal with it in the grammar itself. Now, it could be
argued whether that is a good decision, but that's another
discussion. So, let's add some whitespace. I got myself into a
position where I liberally used to sprinkle <.ws>* indicating zero
or more whitespace characters in places where I felt that we should
allow optional whitespace, as you would expect in a real-world
program.
token simple-ns { <lparen> <ns-keyword> <.ws> <ns-name> <rparen> }
With that tiny addition, we are now able to parse the simple namespace
form. You can play around with
simple-ns.raku.
Let's make our lives a little more difficult

Okay, now we are getting the hang of this, so let's see a namespace
form which is a little more
realistic.
(ns my-amazing.module.core
  (:require [another-library.json.module :as json]
            [yet-another.http.library :as http]))
This is a totally realistic namespace form which we shall parse by
adding support for the :require form where other libraries are
imported and given a short nickname.
Can this be done? You bet!
grammar RealisticNs {
    token TOP { <realistic-ns> }

    token realistic-ns { <lparen>
                           <ns-keyword> <.ws> <ns-name> <.ws>
                           <require-form>
                         <rparen> }

    token ns-keyword { 'ns' }

    token ns-name { <.ns-name-component>+ % '.' }
    token ns-name-component { ( <.alnum> | '-' )+ }

    token require-form { <lparen>
                           <require-keyword> <ws>? <ns-imports>
                         <rparen> }

    token require-keyword { ':require' }

    token ns-imports { <ns-import>+ % <.ws> }

    token ns-import { <lsquare>
                        <ns-name> <.ws> ':as' <.ws> <ns-nickname>
                      <rsquare> }

    token ns-nickname { <.alnum>+ }

    token lsquare { '[' }
    token rsquare { ']' }
    
    token lparen { '(' }
    token rparen { ')' }
}
Nothing too scary as yet. We can see how the grammar evolves. At the
top level, in realistic-ns we have added an extra token called
<require-form> and we flesh out the details later. We can manage
complexity in this manner, so that we have the ability to zoom in and
out of the details as necessary.
Using the parsed data

Now that we have been able to parse the data, we need to make use of
what we've parsed. This is where Actions come into the picture.
When we do RealisticNs.parse(...) that returns a Match object
corresponding to the RealisticNs grammar. While we can query that
object to get the pieces of data that we require, it is less
cumbersome to use Actions to build up the data we are actually
interested in.
Given, a namespace, we want to extract out:

Namespace name
Imported namespaces
Imported namespace nicknames

Using these three pieces of data, given a list of files (that are
simple enough to be parsed by our grammar 😜), we can extract the
information we need.
The simple principle to follow is that for the token we are interested
in, we create an Action method with the same name. When the Match
object is being built up, the token Action methods run when the token
is matched. Grammars are parsed top-down, but data is built up by the
Action methods in a bottom-up fashion.
class RealisticNsActions {
    has $!ns-name;
    has $!imported-namespaces = SetHash.new;
    has $!ns-nicknames = SetHash.new;

    method TOP($/) {
        make {
            ns-name => $!ns-name,
            ns-imports => $!imported-namespaces,
            ns-nicknames => $!ns-nicknames
        }
    }
    
    method ns-name($/) {
        $!ns-name = $/.Str;
    }

    method imported-ns-name($/) {
        $!imported-namespaces{$/.Str}++;
    }

    method ns-nickname($/) {
        $!ns-nicknames{$/.Str}++;
    }
}
Here, we have created a RealisticNsActions class and created methods
where we want to do something with the data associated with it. We
don't need to touch the grammar definition at all (which keeps it
clean). The only extra thing we need to do, is while parsing, we need
to pass the Actions object like this, which instructs Raku to run the
token Action methods when it sees those tokens.
sub MAIN() {
    my $s = RealisticNs.parse(slurp("realistic.clj"), actions => RealisticNsActions.new);
    say $s.made;
}
In an Actions class, the TOP method can be used to generate the
final payload that we can access by calling the made method. For
more information on make and made the official Raku Grammar
tutorial makes it clear. In short, arbitrary payloads created using
make can be accessed by using made.
When we run this program, this is what we see:
{ns-imports => SetHash(another-library.json.module
                       yet-another.http.library),
 ns-name => my-amazing.module.core,
 ns-nicknames => SetHash(http json)}
As expected, we can see the parsed data is created in a nice HashMap,
but wouldn't it be nice to know that yet-another.http.library's
nickname in the namespace is http? This is where we run into a
design issue in the Actions class we have just written. We are
building payloads at a lower level, than what we should.
We need more structure to be able to get the namespace -> namespace-nickname mapping that we want. A quick glance at the
grammar tells us that we can find it at the ns-import level, because
it's subtokens are imported-ns-name and ns-nickname, and those two
are the pieces of data that we want.
Let's write an Action method for ns-import!
class RealisticNsActions {
    has $!ns-name;
    has $!imported-namespaces = SetHash.new;
    has %!ns-nicknames;

    method TOP($/) {
        make {
            ns-name => $!ns-name,
            ns-imports => $!imported-namespaces,
            ns-nicknames => %!ns-nicknames
        }
    }

    method ns-name($/) {
        $!ns-name = $/.Str;
    }

    method ns-import($match) {
        #say $match;
        
        my $imported-ns-name = $match<imported-ns-name>.Str;
        my $ns-nickname = $match<ns-nickname>.Str;

        $!imported-namespaces{$imported-ns-name}++;
        %!ns-nicknames{$imported-ns-name} = $ns-nickname;
    }
}
which results in the output:
{ns-imports => SetHash(another-library.json.module 
                       yet-another.http.library),
 ns-name => my-amazing.module.core,
 ns-nicknames => {another-library.json.module => json,
                  yet-another.http.library => http}}
Now we have all the information we require. You can play around with
realistic-ns-with-actions.raku.
Match objects in Action classes

The result of a successful parse is a
Match object. This contains the
entire heirarchical parsed structure.
Whatever we did in the ns-import method, we could have done at a
higher level, but the Match object at that level would need more
querying. This is because the method will receive its "view" into the
full Match object, i.e. TOP will have the entire Match object, and
the ns-import will have a more restricted view using which we could
easily extract out imported-ns-name and ns-nickname. This might
not make sense immediately, but after dealing with Match for some
time, you'll see how it makes sense to extract out useful information
at the lowest possible level which allows for easier querying. At the
TOP level, to extract out ns-nickname we would have had to query
realistic-ns -> require-form -> ns-imports -> ns-import -> ns-nickname which is cumbersome to say the least, and because there
are multiple of these, there will be an array of ns-import.
To see this in action, in each Action method add a say $match or
say $/ as appropriate to see what the structure at that level.
Development Style

Raku does not as yet have a full-featured REPL environment that a Lisp
programmer may be used to. That is just something that needs to be
worked around.
The Raku REPL could be used to quickly test short one-line snippets of
code, mostly as the simplest of sanity checks, but it gets unwieldy
with anything more complex than that.
To get around this, I used a TDD approach of sorts, where I would take
a real-world Clojure project, and run the (rapidly evolving) grammar
on all the Clojure files. With every "correct" change the number of
parsed files increases, and with every "wrong" change, we would not
be able to get as far.
Next steps

With what we've tackled so far, it is not much of a stretch to parse
real world Clojure namespace declarations as well. For example, we
have added support for importing Clojure namespaces using the
:require form. Similarly, we could add support for the :import
form using which we import Java libraries. The same iterative approach
can be used to parse increasingly complicated code. In the zip file
attached, I have added a real-world grammar using which I have been
able to parse hundreds of Clojure files. You may notice that I have to
handle lots of syntactic variation, and optional whitespace, but I
believe we have extracted the crux of the implementation into the
easily understood grammar that we have discussed in detail in this
article.
The final Clojure NS
Grammar
for reference. Using this grammar to generate the dependency graphs is
a story left for another day.
Caveat Emptor (and a few big ones):

There is a chance that when a Grammar is improperly specified (and you
will run into this a few times during experimentation), that the
program just hangs. You can think of this as what is happening as
something with pathological backtracking. This is not something only
Raku grammars, or grammars in general, can suffer from. A badly
written regex can have the same effect too, so one must be aware of
this contingency before putting a regex/grammar in the hot path of a
highly-critical application. There are high-profile post mortems
discussing how a single regex brought down web applications to their
knees.
While dealing with regexes the common advice is to avoid using .*
(any number of any character), and instead be more restrictive in what
that matches with, and that advice holds true for grammars too. Be as
restrictive as possible, and relax certain things when it is
unavoidable. In the final code, I was able to parse hundreds of real
Clojure ns-declarations, but during the process of evolving the
grammar every once in a while, I did run into this behaviour a few
times, and that was remedied by tweaking the grammar as described. It
is done intuitively, and being able to reliably fix it, is a dark art
that most (including yours truly) do not confidently possess.
The other thing to be wary of in a production system is, that even if
a grammar is well-specified, could it be possible for a malicious user
to carefully craft an input that triggers this pathological behaviour?
Given enough motivation, anything is possible, so all bets are off. 🤞
Grammars are more powerful than regular expressions
(https://en.wikipedia.org/wiki/Chomsky_hierarchy). The traditional
warning of not using regexes to parse things it isn't powerful enough
to parse does not apply here,
assuming the grammar is well-written, but the idea of a well-written
grammar is as elusive as a bug-free program. I don't mean the grammar
in the original design of the language, but the grammar you are
writing to parse something. Without properly dealing with corner-cases, and
adequately handling errors, you are going to end up with a brittle and
frustratingly difficult to debug program.
Conclusion

Now that we have reached the end, I hope this is another article
floating out on the internet that will help fledgling Rakoons (can
Rakoons fly? 🤔) understand how Raku grammars can be put to real world
tasks and can be a powerful tool, when the time comes.
References


Grammar Tutorial was
invaluable in helping me get the hang of Raku grammars. This post
could be seen as an alternative tutorial in the same vein.
Parsing: a timeline is a very long article, but totally worth
the read.
Grammar::Tracer

End Credits


All the Raku designers and contributors over the years.
Jonathan Worthington for creating Grammar::Tracer.
All the friendly Rakoons (@JJ) who helped with reviewing this article.