Skip to content

Instantly share code, notes, and snippets.

@BrianWill
Last active June 20, 2023 13:15
Show Gist options
  • Star 10 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BrianWill/35d4b948c5b6149f5898 to your computer and use it in GitHub Desktop.
Save BrianWill/35d4b948c5b6149f5898 to your computer and use it in GitHub Desktop.

How Clojure Code Becomes a Running Program

Lisp, the original high-level language, introduced a long list of features common in languages today, including dynamic typing, interpretation, and garbage collection. The original Lisp language is long gone, but it had many imitators, which we call 'dialects' of Lisp. Clojure, introduced in 2007, is the first Lisp dialect to gain wide usage in three decades.

Though Lisp features have been co-opted by many other languages, what still distinguishes Lisp dialects from all other languages is how Lisp dialects translate source code into running programs. In non-Lisp languages, the code translation process has two primary steps:

  1. lexing (a.k.a. tokenization) and parsing
  2. code generation (compilation or interpretation)

The lexing and parsing steps produce an AST (Abstract Syntax Tree). In an AST, the elements of source are represented as objects rather than straight text, e.g. a statement is represented as some kind of statement object. It is an AST which is handed off to the code generator because, compared to a raw textual representation of source, an object representation of source is far easier to analyze and translate.

In Lisp dialects, the component of the compiler (or interpreter) which does the lexing and parsing is called the Reader. Unlike in non-Lisp languages, Lisp syntax is composed entirely of literals for the common data types, such as numbers, booleans, strings, lists, and hashmaps. So the data produced by the Reader is not a conventional AST but rather an ordinary object, most commonly a list. Basically, Lisp code is written as something very much like JSON data.

The component of a Lisp compiler (or interpreter) which does the code generation is called the Evaluator. Using a quite simple set of rules, the Evaluator either interprets or compiles the Reader data into a running program.

So to understand how to write code in a Lisp dialect, we first need to learn exactly how its Reader parses text, and then we need to learn the set of rules which the Evaluator uses to translate the Reader data into a running program. (Contrast this with non-Lisp languages, where code translation is generally a black-box to the programmer.)

The Clojure Reader syntax

Clojure is implemented on the JVM, and so every Clojure object is actually an instance of some Java class. Here is how the Clojure Reader parses text into objects:

nil                         ; Java null
“this is a string”          ; a java.lang.String
\H                          ; a java.lang.Character
53                          ; a java.lang.Long
6.2                         ; a java.lang.Double
foo                         ; a clojure.lang.Symbol
:foo                        ; a clojure.lang.Keyword
(1 2 3)                     ; a clojure.lang.PersistentList
[1 2 3]                     ; a clojure.lang.PersistentVector
{“foo” 3 5 7}               ; a clojure.lang.PersistentHashMap 
                            ; (or clojure.lang.PersistentArrayMap)

This really is all the syntax we need to write Clojure (though Clojure also has a handful of special syntactical conveniences we'll discuss later).

What Clojure calls a Symbol is a sequence of characters, just like a string, but Symbols are used as identifiers and conform to restrictions, such as not being allowed to have whitespace characters.

What Clojure calls a Keyword is just like a Symbol but a distinct type. It turns out to be useful to have things which are just like Symbols but distinct from Symbols.

Both PersistentLists and PersistentVectors are persistent, ordered collections. The difference is that PersistentVectors are implemented as tries of nodes suitable for random access of elements whereas PersistentLists are implemented as singly-linked lists and so are not suitable for random access.

So now you should be able to parse the syntax of the Hello, World! program written in Clojure:

(print "Hello, world!")

This is a list with two elements: a symbol print and a string reading "Hello, world!".

The Clojure Evaluator

When the Evaluator processes the Reader data, symbols and lists are given special treatment:

Symbols

To understand how symbols are handled by the Evaluator, we first have to understand namespaces. A Clojure namespace is an object that maps symbols to what Clojure calls Vars. Each Var is simply a mutable object that references some other object. When we lookup symbols in namespaces, we get back the object held in its associated Var. (We'll discuss later why namespaces map symbols to Vars instead of directly to other objects.)

  • At any one time, the Evaluator considers one namespace to be the 'current namespace'. When the Evaluator evaluates a symbol, it attempts to resolve it in the current namespace.
  • When the Clojure runtime starts, the initial current namespace is called user.
  • The standard library namespace clojure.core contains several functions for creating and modifying namespaces.
  • A Symbol containing a slash in the middle is fully-qualified, meaning the part before the slash denotes a namespace. The Evaluator attempts to look up fully-qualified symbols in their denoted namespace rather than in the current namespace.
  • If the Evaluator fails to find a symbol in the appropriate namespace, evaluation fails.

So when evaluating these symbols:

foo      ; the Evaluator expects 'foo' to resolve to a Var in the current namespace
bar/foo  ; the Evaluator expects 'foo' to resolve to a Var in the namespace 'bar'

Lists

When the Evaluator processes a list, it treats it as one of three things:

1. Function calls

When the first element of a list resolves into a function object, the list represents a call to that function with the remaining elements of the list as its arguments:

(foo 3 5 1)      ; call function foo with arguments 3, 5, and 1
2. Macro calls

When the first element of a list is a symbol resolving to a special kind of function called a macro, the arguments to the macro call are not evaluated: instead, the reader data itself is passed as argument. So, for example:

; if foo is a function, call foo with the returned value of (bar 9)
(foo (bar 9))    

; if foo is a macro, call foo with a list containing the symbol bar and the value 9
(foo (bar 9))   

A macro call returns Reader data which is inserted in place of the call and then evaluated in turn. This process is called 'macro expansion'. Most commonly, a macro call expands into some new list representing a function call, macro call, or special form. Effectively, macros provide a general purpose mechanism for syntactical compression. If there's some verbose bit of business you write over and over that can't be abstracted into a function, macros very often provide the solution.

For a simple example, it's very common in Clojure to want to write:

(def foo (fn [] something-something))

...and so the standard library includes a macro called defn that expresses this more compactly:

(defn foo [] something-something)

Be clear that macros are not a substitute for functions. For one thing, macros cannot be passed as arguments to other functions (because, as we'll see, macro calls in the body of a function are expanded when a function is created, not when it is called). Also, most programmers find it difficult to read code that uses many unfamiliar macros. Consequently, the preferred practice in Clojure is to avoid creating new macros in the ordinary course of application programming. Leave macro creation to the authors of API's.

3. Special forms

If the symbol at the start of a list is one of 16 special symbols, the list is a special form. The 16 special forms each have their own special evaluation rules:

(def symbol expr)

A def form creates (or modifies) a mapping in the current namespace:

(def x 3)     ; in the current namespace, map the symbol
              ; x to a Var containing the value 3

(def x 4)     ; modify the Var of x to now reference 4
(if condition then-expr else-expr)

An if form evaluates and returns one of two expressions depending upon whether its condition expression returns a truthy or falsey value (false and nil are falsey; all other values are truthy).

(if true 3 4)      ; true is true, so the form returns 3
(if false 3 4)     ; false is false, so the form returns 4
(if nil 3 4)       ; nil is considered false, so the form returns 4

Be clear that if evaluates only one of the two expressions:

(if (isTuesday?) (foo) (bar))  ; if Tuesday, call foo but not bar
                               ; if not Tuesday, call bar but not foo

When the else-expr is omitted, it defaults to nil:

(if false 3)       ; returns nil
(fn [parameters] body)

An fn form creates and returns a new function object. The parameters are listed as symbols in a vector. The remaining expressions comprise the body of the function. The function returns the value of the last expression of the body.

; Returns a function with two parameters, a and b.
; When called, the function invokes foo and bar.
(fn [a b]
    (foo a)
    (bar b))   ; the value of the last expression is returned from calls

Understand that the vector containing the symbols naming the parameters has nothing to do with the execution of the function. The parameter list vector exists simply as syntax to distinguish the parameters from the expressions of the body.

Also understand that when an fn form is evaluated, its body is evaluated in a special way: the macro calls in the body get expanded, but everything else just gets compiled into code to be run later because, of course, a function is only supposed to run when it's called, not when it's created.

When symbols are resolved in an fn form, the compiled function references the Var, not the value held in the Var. This means that, if we update a Var, the change is seen in any previously compiled fn form that uses the Var:

(def x 3)

; bind to foo a function returning the value of x    
(def foo      
  (fn []
    x))  

(foo)         ; returns 3
(def x 4)
(foo)         ; returns 4

(This explains why namespaces map symbols to Vars rather than directly map symbols to other objects.)

(recur args)

The recur form recursively invokes the fn form in which it is most directly contained. The recursive call is tail-recursive, meaning it reuses the existing stack frame. Therefore, recur may only be used in tail position of the function body:

; a function that repeatedly prints its two arguments forever
(fn [a b]
  (print a b)
  (recur a b))

The other special forms with bodies have a notion of tail position, e.g. the then and else expression of an if are both in tail position because they are both the last expressions possibly evaluated in the if. Tail position is transitive, so recur here is in tail position:

; as long as (foo) keeps returning true, print the two arguments
(fn [a b]
  (print a b)
  (if (foo)
    (recur a b)))
(do body)

A do form very simply evaluates the expressions in its body in order, returning the value of the last expression:

(do
  (foo)    ; invoke foo
  (bar))   ; then invoke bar, returning its return value

The do form is useful in a few contexts where the language expects just a single expression, most notably the then or else expressions of an if:

(if (isTuesday?)
  (do
    (foo)
    (bar)))
(let [bindings] body)

A let form is just like a do but also creates local symbol bindings that hold for the duration of the body. These bindings are written as pairs of symbols and expressions inside a vector before the body. (Like the parameter list vector of an fn form, the bindings vector of a let is used simply as syntax to distinguish the bindings from the expressions of the body.)

(let [x 3         ; locally bind 3 to symbol x
      y (bar)]    ; locally bind result of (bar) to symbol y
  (foo x)         ; invoke foo with the local value of x
  (ack y))        ; return value of invoking ack with the local value of y

The parameters of fn forms and the bindings of let forms are lexically scoped, just like local variables in Javascript, so the locally bound symbols of interior forms shadow locally bound symbols of the same name from enclosing forms.

(loop [bindings] body)

A loop form is just like a let, but it establishes a 'recursion point', such that recur used in tail-position of a loop will jump execution back to the start of the loop with new values for the bindings.

(loop [x 2]             ; in first iteration, x has the value 2
  (if (foo x)          ; if (foo x)...
    (recur (bar x))    ; ...do another iteration with (bar x) bound to x
    x))                ; ...otherwise, return x from the let form
(quote element)

The quote form returns its argument unevaluated:

; returns a list with three elements (a symbol bar, a symbol foo, and a number 3)
(quote (bar foo 3))      

The quote form is occasionally useful, such as when writing macros.

(throw exception)

The throw form throws its argument as an exception. The argument must be a valid Java exception type.

(try body catch-clauses)

The try form catches exceptions in its body. Each catch clause has this form:

(catch exception-type parameter body)

For the catch clause matching the exception type, the exception is bound to the parameter symbol for the duration of that caluse's body.

(var symbol)

The var form returns the var object mapped to a symbol in a namespace:

(var bar)      ; return the Var mapped to bar in the current namespace
(var foo/bar)  ; return the Var mapped to bar in the foo namespace
(new class args)

The new form instantiates a Java type by invoking its constructor:

(new java.io.File "c:/myfile")

(For a symbol containing dots, the evaluator will first try resolving the symbol in the current namespace, like normal, but failing that it will also try to resolve the symbol to a Java type in the JVM's current classpath, e.g. java.io.File.)

(. instance-or-class field)
(. instance-or-class (method-name args))

To retrieve the value of a Java instance field:

(. x bar)            ; x.bar

To invoke a method of a Java instance, we use the . (dot) form:

(. x (foo 3 5))     ; call x.foo(3, 5)

To retrieve static fields or invoke static methods, we specify a class instead of an instance.

(set! (. instance-or-class field) expr)

The set! form assigns a value to an instance or static field. Strangely, the set! form expects a dot form as part of its syntax. Be clear that this dot form inside the set! is not evaluated as normal; the dot form merely specifies the field to which we are assigning:

(set! (. x bar) 3)   ; x.bar = 3
monitor-enter and monitor-exit

These last two special forms are used internally by the language and not meant for general use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment