One day, I started wondering if it would be possible to parse Clojure namespace forms and generate a dependency graph of the various namespaces used in a real-world Clojure project. While that was the original motivation, I ended up down the Raku grammar rabbit hole, and had an enjoyable time learning how to use them. I'm glad you're joining me in reliving that journey.
Informally speaking, grammars can be thought of as a set of rules that describe a language. With these rules, one can meaningfully parse (to make sense of, or deconstruct into its grammatical components) a piece of text. It turns out that this is a common task in computing. We need to frequently translate programs from one language to another. This is the job of a compiler. Before being able to translate it, the compiler needs to know whether the original program is even valid, according to the language's grammar.
While we have explained in theory what grammars are, Raku grammars
help us model abstract grammars as a programming construct (the
grammar
keyword and its adjacent helpers) using which we can perform
parsing tasks. It is important to understand this distinction.
First class grammars are considered one of the revolutionary features of Raku. Normally, you'd find grammars as a library or a standalone tool, but Raku has embraced it wholesale, and has a powerful implementation of grammars which makes light work of most parsing tasks.
Clojure is a modern lisp, and happens to be the language I use in $DAYJOB. At the top of most Clojure files, there is a namespace form importing various internal and external namespaces. This way we can neatly organize our project into many separate files, rather than having to put them all into one big file. We are shortly going to design a grammar that can parse these namespace forms.
Grammar::Tracer is
helpful in figuring out where your grammar fails to match, and is
invaluable in debugging. Make sure you do zef install Grammar::Tracer
before running the code examples.
A Clojure namespace is a Lisp form, which as expected starts with an open paren and ends with a close paren. Let's write a grammar for one of the simplest Clojure forms, the empty list.
()
grammar EmptyLispForm {
token TOP { <lparen> <rparen> }
token lparen { '(' }
token rparen { ')' }
}
This is one of the simplest possible grammars we can write. Parsing
always starts with the TOP
token, and recurses into the various
component tokens from there. We just have two such tokens lparen
and
rparen
which are used to denote the left and right parens. Play
around with
trivial.raku
to see how we can parse this.
Q: Write a program that checks that parentheses are balanced.
For example:
() ;; balanced
(()) ;; balanced
(()(())) ;; balanced
(()(((()))) ;; unbalanced
grammar BalancedParens {
token TOP { <balanced-paren> }
token balanced-paren { <lparen> <balanced-paren>* <rparen> }
token lparen { '(' }
token rparen { ')' }
}
It's likely I've made this more wordy than necessary, but that's still only six rather-readable lines.
Now, under time pressure which seems more likely? Coding up a stack
and fiddling with corner cases or using grammars with the awesomeness
of Grammar::Tracer
to quickly hack out a declarative solution.
It turns out, that we have just tackled one of the trickiest aspects of writing a real-world grammar, and that is dealing with nested structures. As programmers we know that when we see nested structures, we know we will have to deal with recursion in some form.
You can play around with balanced-parens.raku program on the console, and observe how the grammar is parsed.
Note: It turns out there is a better way to parse nested structures, but this is fine for now.
Let's try to parse this simple namespace declaration:
;; 1 3
| |
;; (ns my-own.fancy.namespace)
| |
;; 2 4
While this is simple, it is going to be an important building block in
tackling more complicated namespace forms. Let's break this down into
its four component
lexemes. We
can see the open and close parens, we see "ns", the namespace
my-own.fancy.namespace
and finally the close paren. That's it! Let's
tackle these individual pieces using a grammar.
grammar SimpleNs {
token TOP { <simple-ns> }
# 1 2 3 4
# | | | |
token simple-ns { <lparen> <ns-keyword> <ns-name> <rparen> }
token ns-keyword { 'ns' }
token ns-name { <.ns-name-component>+ % '.' }
token ns-name-component { ( <.alnum> | '-' )+ }
token lparen { '(' }
token rparen { ')' }
}
Over here we can see that we have translated this into a simple Raku
grammar. You could argue that defining simple-ns
is not even
required, we could have just put it in TOP
directly, but anyway.
We will need to deal a little with
regexes here. In
the various flavours of regexes, +
usually means one or more. |
has a slightly different meaning from what you're expecting, but you
can check the regex
documentation for all the
details on the difference between |
and ||
. Loosely speaking, we
are saying that a namespace component, i.e. the thing between two
dots, is made up of one or more alphanumneric characters or
hyphens. Now, if there is a rule saying a namespace has to start with
an alphabetic character, and not a number, the grammar will become a
little more complex, but this is a pedagogic example, so we'll not be
too pedantic.
I expect eagle-eyed readers to point out a few things:
- Where is
<.alnum>
being defined, and why does it have a dot before it?
alnum
is predefined. The reason it has a .
before it is so that we
are not interested in capturing each letter; it's too low level. We
are interested in capturing a top-level token like ns-name
instead,
and not each individual character. Play around with the code examples
by adding and removing a dot from various tokens and see the
difference in Grammar::Tracer
's output.
- What does
%
mean?
This is a very useful convenience to describe patterns where something
is interspersed between a bunch of other things. For example, we can
have a namespace like foo.bar.baz
.
token ns-name { <.ns-name-component>+ % '.' }
This means that ns-name
is made up of at least one
ns-name-component
's (denoted by the +
) and separated by a .
(denoted by %
).
Okay, so this should work I guess? Let's see what happens when we run the code!
No, that didn't work. As Grammar::Tracer
so helpfully tells us, we
did not account for the space after ns
. In traditional compiler
theory, there is a process of tokenisation, where some lightweight
regexes are used to separate the program into its component lexemes
and discard all the extraneous spaces. However, here we will not do
that, and we'll deal with it in the grammar itself. Now, it could be
argued whether that is a good decision, but that's another
discussion. So, let's add some whitespace. I got myself into a
position where I liberally used to sprinkle <.ws>*
indicating zero
or more whitespace characters in places where I felt that we should
allow optional whitespace, as you would expect in a real-world
program.
token simple-ns { <lparen> <ns-keyword> <.ws> <ns-name> <rparen> }
With that tiny addition, we are now able to parse the simple namespace form. You can play around with simple-ns.raku.
Okay, now we are getting the hang of this, so let's see a namespace form which is a little more realistic.
(ns my-amazing.module.core
(:require [another-library.json.module :as json]
[yet-another.http.library :as http]))
This is a totally realistic namespace form which we shall parse by
adding support for the :require
form where other libraries are
imported and given a short nickname.
Can this be done? You bet!
grammar RealisticNs {
token TOP { <realistic-ns> }
token realistic-ns { <lparen>
<ns-keyword> <.ws> <ns-name> <.ws>
<require-form>
<rparen> }
token ns-keyword { 'ns' }
token ns-name { <.ns-name-component>+ % '.' }
token ns-name-component { ( <.alnum> | '-' )+ }
token require-form { <lparen>
<require-keyword> <ws>? <ns-imports>
<rparen> }
token require-keyword { ':require' }
token ns-imports { <ns-import>+ % <.ws> }
token ns-import { <lsquare>
<ns-name> <.ws> ':as' <.ws> <ns-nickname>
<rsquare> }
token ns-nickname { <.alnum>+ }
token lsquare { '[' }
token rsquare { ']' }
token lparen { '(' }
token rparen { ')' }
}
Nothing too scary as yet. We can see how the grammar evolves. At the
top level, in realistic-ns
we have added an extra token called
<require-form>
and we flesh out the details later. We can manage
complexity in this manner, so that we have the ability to zoom in and
out of the details as necessary.
Now that we have been able to parse the data, we need to make use of what we've parsed. This is where Actions come into the picture.
When we do RealisticNs.parse(...)
that returns a Match
object
corresponding to the RealisticNs
grammar. While we can query that
object to get the pieces of data that we require, it is less
cumbersome to use Actions to build up the data we are actually
interested in.
Given, a namespace, we want to extract out:
- Namespace name
- Imported namespaces
- Imported namespace nicknames
Using these three pieces of data, given a list of files (that are simple enough to be parsed by our grammar 😜), we can extract the information we need.
The simple principle to follow is that for the token we are interested in, we create an Action method with the same name. When the Match object is being built up, the token Action methods run when the token is matched. Grammars are parsed top-down, but data is built up by the Action methods in a bottom-up fashion.
class RealisticNsActions {
has $!ns-name;
has $!imported-namespaces = SetHash.new;
has $!ns-nicknames = SetHash.new;
method TOP($/) {
make {
ns-name => $!ns-name,
ns-imports => $!imported-namespaces,
ns-nicknames => $!ns-nicknames
}
}
method ns-name($/) {
$!ns-name = $/.Str;
}
method imported-ns-name($/) {
$!imported-namespaces{$/.Str}++;
}
method ns-nickname($/) {
$!ns-nicknames{$/.Str}++;
}
}
Here, we have created a RealisticNsActions
class and created methods
where we want to do something with the data associated with it. We
don't need to touch the grammar definition at all (which keeps it
clean). The only extra thing we need to do, is while parsing, we need
to pass the Actions object like this, which instructs Raku to run the
token Action methods when it sees those tokens.
sub MAIN() {
my $s = RealisticNs.parse(slurp("realistic.clj"), actions => RealisticNsActions.new);
say $s.made;
}
In an Actions class, the TOP
method can be used to generate the
final payload that we can access by calling the made
method. For
more information on make
and made
the official Raku Grammar
tutorial makes it clear. In short, arbitrary payloads created using
make
can be accessed by using made
.
When we run this program, this is what we see:
{ns-imports => SetHash(another-library.json.module
yet-another.http.library),
ns-name => my-amazing.module.core,
ns-nicknames => SetHash(http json)}
As expected, we can see the parsed data is created in a nice HashMap,
but wouldn't it be nice to know that yet-another.http.library
's
nickname in the namespace is http
? This is where we run into a
design issue in the Actions class we have just written. We are
building payloads at a lower level, than what we should.
We need more structure to be able to get the namespace -> namespace-nickname
mapping that we want. A quick glance at the
grammar tells us that we can find it at the ns-import
level, because
it's subtokens are imported-ns-name
and ns-nickname
, and those two
are the pieces of data that we want.
Let's write an Action method for ns-import
!
class RealisticNsActions {
has $!ns-name;
has $!imported-namespaces = SetHash.new;
has %!ns-nicknames;
method TOP($/) {
make {
ns-name => $!ns-name,
ns-imports => $!imported-namespaces,
ns-nicknames => %!ns-nicknames
}
}
method ns-name($/) {
$!ns-name = $/.Str;
}
method ns-import($match) {
#say $match;
my $imported-ns-name = $match<imported-ns-name>.Str;
my $ns-nickname = $match<ns-nickname>.Str;
$!imported-namespaces{$imported-ns-name}++;
%!ns-nicknames{$imported-ns-name} = $ns-nickname;
}
}
which results in the output:
{ns-imports => SetHash(another-library.json.module
yet-another.http.library),
ns-name => my-amazing.module.core,
ns-nicknames => {another-library.json.module => json,
yet-another.http.library => http}}
Now we have all the information we require. You can play around with realistic-ns-with-actions.raku.
The result of a successful parse is a Match object. This contains the entire heirarchical parsed structure.
Whatever we did in the ns-import
method, we could have done at a
higher level, but the Match
object at that level would need more
querying. This is because the method will receive its "view" into the
full Match object, i.e. TOP
will have the entire Match
object, and
the ns-import
will have a more restricted view using which we could
easily extract out imported-ns-name
and ns-nickname
. This might
not make sense immediately, but after dealing with Match
for some
time, you'll see how it makes sense to extract out useful information
at the lowest possible level which allows for easier querying. At the
TOP
level, to extract out ns-nickname
we would have had to query
realistic-ns -> require-form -> ns-imports -> ns-import -> ns-nickname
which is cumbersome to say the least, and because there
are multiple of these, there will be an array of ns-import
.
To see this in action, in each Action method add a say $match
or
say $/
as appropriate to see what the structure at that level.
Raku does not as yet have a full-featured REPL environment that a Lisp programmer may be used to. That is just something that needs to be worked around.
The Raku REPL could be used to quickly test short one-line snippets of code, mostly as the simplest of sanity checks, but it gets unwieldy with anything more complex than that.
To get around this, I used a TDD approach of sorts, where I would take a real-world Clojure project, and run the (rapidly evolving) grammar on all the Clojure files. With every "correct" change the number of parsed files increases, and with every "wrong" change, we would not be able to get as far.
With what we've tackled so far, it is not much of a stretch to parse
real world Clojure namespace declarations as well. For example, we
have added support for importing Clojure namespaces using the
:require
form. Similarly, we could add support for the :import
form using which we import Java libraries. The same iterative approach
can be used to parse increasingly complicated code. In the zip file
attached, I have added a real-world grammar using which I have been
able to parse hundreds of Clojure files. You may notice that I have to
handle lots of syntactic variation, and optional whitespace, but I
believe we have extracted the crux of the implementation into the
easily understood grammar that we have discussed in detail in this
article.
The final Clojure NS Grammar for reference. Using this grammar to generate the dependency graphs is a story left for another day.
There is a chance that when a Grammar is improperly specified (and you will run into this a few times during experimentation), that the program just hangs. You can think of this as what is happening as something with pathological backtracking. This is not something only Raku grammars, or grammars in general, can suffer from. A badly written regex can have the same effect too, so one must be aware of this contingency before putting a regex/grammar in the hot path of a highly-critical application. There are high-profile post mortems discussing how a single regex brought down web applications to their knees.
While dealing with regexes the common advice is to avoid using .*
(any number of any character), and instead be more restrictive in what
that matches with, and that advice holds true for grammars too. Be as
restrictive as possible, and relax certain things when it is
unavoidable. In the final code, I was able to parse hundreds of real
Clojure ns-declarations, but during the process of evolving the
grammar every once in a while, I did run into this behaviour a few
times, and that was remedied by tweaking the grammar as described. It
is done intuitively, and being able to reliably fix it, is a dark art
that most (including yours truly) do not confidently possess.
The other thing to be wary of in a production system is, that even if a grammar is well-specified, could it be possible for a malicious user to carefully craft an input that triggers this pathological behaviour? Given enough motivation, anything is possible, so all bets are off. 🤞
Grammars are more powerful than regular expressions (https://en.wikipedia.org/wiki/Chomsky_hierarchy). The traditional warning of not using regexes to parse things it isn't powerful enough to parse does not apply here, assuming the grammar is well-written, but the idea of a well-written grammar is as elusive as a bug-free program. I don't mean the grammar in the original design of the language, but the grammar you are writing to parse something. Without properly dealing with corner-cases, and adequately handling errors, you are going to end up with a brittle and frustratingly difficult to debug program.
Now that we have reached the end, I hope this is another article floating out on the internet that will help fledgling Rakoons (can Rakoons fly? 🤔) understand how Raku grammars can be put to real world tasks and can be a powerful tool, when the time comes.
- Grammar Tutorial was invaluable in helping me get the hang of Raku grammars. This post could be seen as an alternative tutorial in the same vein.
- Parsing: a timeline is a very long article, but totally worth the read.
- Grammar::Tracer
- All the Raku designers and contributors over the years.
- Jonathan Worthington for creating
Grammar::Tracer
. - All the friendly Rakoons (@JJ) who helped with reviewing this article.
Please ping me when it's ready, and I'll upload it (or you can do it, whatever you prefer)