jorendorff/compilers-101-programming-languages-and-parsing.md

## compilers-101-programming-languages-and-parsing.md

      
    Raw
  

              compilers-101-programming-languages-and-parsing.md
            
          
    Compilers 101: Programming languages and parsing

Slides
Beginning

Good morning

Good morning!
How do we feel about code here?
Are y'all pro-code, anti-code?
OK.
I want to warn you before we begin,
there is a lot of code in this talk.
We are going to be reading code
and thinking about code.
We're going to read code about code.
The only thing you need to know going in is JavaScript.
It will help if you know regular expressions.
But this is a pretty intense 40-minute talk.
I'm thrilled to have this 11 AM slot.
We should all be just about at peak caffienation.
Before we get started, here is a URL you probably want to grab.
I'll post it on twitter too.
https://gist.github.com/jorendorff/6637221
A bit about language

I wish we could dive right into the code.
When I was writing the talk I had planned to go right into running code.
But it turns out that unless you already know
the stuff I'm going to try and teach you today,
we do have to start with a few concepts.
We have to start with the big green ball.
Language is amazing.
By the time you were two or three years old,
and you started combining words to make sentences of your own,
you knew grammar.
Deeply, subconsciously, without thinking,
you knew that
green big ball

is not right, that instead you should say
big green ball

You never learned this in school.
This is a rule of grammar that almost no one consciously knows,
but which every native English speaker subconsciously knows.
I have this book that contains a few others.
Almost two thousand pages of them.
Rules you've known for as long as you can remember.
Rules that hundreds of scientists working around the English-speaking world
had to spend decades
to painstakingly reverse-engineer
and put into words.
Something in your brain is built to acquire language.
Something subconscious, and fluent,
and adaptive, and improvisational,
and altogether faster
than the part in charge of logic and reason.
A bit about programming languages

In my opinion,
of all the hacks ever perpetrated in the history of programming,
the invention of high-level programming languages
is the greatest.
I don't know if programming languages
actually exploit the language hardware in your brain.
That would be a pretty bold claim.
But programming languages do have this thing
where at first when you learn a new one,
you're conscious of every little thing,
and then you stop noticing the syntax
and then you start thinking in the language.
How programming languages work inside
is one of the coolest things I know,
and it is one of the pieces of magic
that makes everything else you will see and do this weekend
possible.
I want to share some of that with you today.
There is a great deal more than I can show you in one hour.
Let's get started.
Middle

What is a compiler?

Well, what is a compiler?
A compiler is a program that translates code
from one language to another.
This is useful because you can translate code
from a language that is good for humans
to a language that's good for some other purpose.
Like running it.
The first compilers translated code
from a high-level programming language,
or what was considered high-level for the time,
down to machine code.
FORTRAN was the first widely used one.
That kind of compiler is still being written.
Go is an example.
Rust is an example.
There are also compilers that target bytecode, for virtual machines.
For example, javac translates Java to Java bytecode.
Erlang is very similar. The erlang compiler generates BEAM bytecode
which is then executed by the BEAM VM.
There are similar compilers in Python, Ruby, etc.
Emacs has one.
Emscripten, which translates huge 3D games from C++ into WebAssembly,
is in this category.
And that's the sort of thing we usually think of
when we hear the word compiler.
But this notion of translating programs,
translating code,
has been astonishingly fertile.
The basic concept has been with us over fifty years,
and new uses for it are being invented all the time.


LESS and Sass are compilers.
They translate LESS and Sass styleshets into CSS.


CoffeeScript is a very basic compiler.
That's really all it is.
It compiles CoffeeScript to JavaScript.


Babel is an amazing project
that lets you use future JavaScript language features
in real Web sites today.
It works by translating future-JavaScript
into old-school JavaScript that will run in today's web browsers.


And of course there are a million more.
There are compilers that compile regular expressions to machine code.
That compile Web templates to functions for faster rendering.
And so on.
Code has structure

OK. So a compiler just translates code from one form to another.
That sounds a lot like data processing,
the kind of thing we do all day.
So why do compilers require special techniques?
Parsers for programming languages are different from
just about any other kind of data processing you've ever done,
and the reason is that code has structure.
Maybe you've used Unix pipes to hack on text.
Those tools assume that your data is line-oriented.
That is, one record per line.
Lines are the unit, period.
Code isn't like that.
I mean, often you have one statement per line.
But not necessarily.
Sometimes a single statement covers multiple lines.
Sometimes a line has more than one statement on it.
Probably most of us have used SQL to do complex things with data.
But SQL assumes your data is made of tables.
Code, of course, isn't.
You've probably used key-value stores.
Again, code just is not that kind of data.
Maybe you've used regular expressions to hack on text.
You can use regular expressions to hack on code, too.
I've done it.
But not for anything really complicated.
They're just not up to the task.
Writing a regular expression to match every function call in a program
would be a nightmare.
And it would be easily confused by comments and strings.
Language is different because it has structure.
It has large chunks like functions and modules and classes.
It has middle-sized chunks like blocks and loops and if-statements.
It has small chunks: statements or expressions.
And even expressions have parts:
a function call has arguments.
An array has elements.
And almost every part of a program can nest, right?
Arrays can contain arrays.
Function calls can have function calls as arguments.
Here's a picture of a language that makes the structure of code very visible.
(video: Scratch)
If you go to scratch.mit.edu, and click Create,
this is what you'll see.
This is a programming environment for kids.
And the programming language consists of these blocks.
You snap them together and that's your program.
And they only fit together certain ways.
Scratch is a programming language without syntax errors.
It achieves this by being very in-your-face about the structure of code.
Understanding the structure of code is essential
to being able to do anything useful with it.
Compilers, or any other program that operates on code,
need to recover the structure from the plain ASCII text.
How are compilers structured?

Let's take a look inside a simple compiler.
Even the simplest kind of compiler these days has two parts.
There's a front end that understands details of the input language.
This is the part that's responsible for reading the user's program
and inferring its structure.
The front end also checks the program for syntax errors.
Depending on the language, there might be more error checking.
The front end might do type-checking,
it might check that you're not trying to use any variables that aren't declared,
and so on.
Then there's a back end that understands details of the output language.
It generates the machine code or bytecode or whatever the target language is.
So for example suppose we're a C compiler,
and we have this statement in a program.
We're calling the sleep function.
Well, the compiler output for this program
is going to look something like this,
it's assembly code, these are actual x86-64 machine instructions.
I won't make you try to read these,
but you can see the last thing that happens here
is a callq instruction,
which actually causes control to jump
to the address of the sleep function in the C standard library.
But if this looks like gibberish to you, great:
now you understand exactly what the back end does.
It knows about machine code. That's its job.
So in a typical compiler, even a very simple one,
these are two totally separate chunks of code.
What sort of data do you think the front end
hands off to the back end?
The answer is, it's some kind of data
that reflects the structure of the program.

Optional topics to mention here:

multiple back ends
"middle end" optimization passes


What syntax looks like (ECMAScript)

IfStatement :
    if ( Expression ) Statement else Statement
    if ( Expression ) Statement

This is an excerpt from the [ECMAScript language specification][1],
that is, the standard document for JavaScript.
This is how the standard defines the syntax of if statements.
Of course, you already know what an if statement looks like.
(At least, I hope you do,
because there is a lot of JavaScript code in these slides.)
What I want to show you is how syntax is specified.
This shows that there are two kinds of if statements,
one with an else clause
and one without.
The symbols shown in black actually have to appear in your program.
The words in green are like variables.
If you start with this pattern
if ( Expression ) Statement

and you replace this word "Expression" with an expression,
and you replace this word "Statement" with a statement,
if (vase.isFragile()) break;

the result is a valid JavaScript if statement.
A single syntax rule
stands in for infinite possibilities.
(three slides in succession:)
if (arr[mid] === target)
    return mid;

if (node.lowlink === node.index) {
    do {
        successor = stack.pop();
    } while (successor !== node);
}

if (fontFamily == "monospace"
      && fontWeight == "bold"
      && whiteSpace == "preserve")
    tagName = "pre";

(then show the pattern again)
And if it isn't quite clear to you
what the term "Statement" means here,
no problem.
You can go to another part of the standard
and see this:
Statement :
    Block
    VariableStatement
    EmptyStatement
    ExpressionStatement
    IfStatement
    IterationStatement
    ContinueStatement
    etc.

The term "Statement" is defined in exactly the same way.
There are all these different kinds of statement.
Each line is one possibility.
And each one of these terms is defined somewhere too.
A Block is one kind of statement.
Here's the definition.
Block :
    { StatementList_opt }

A Block has curly braces around zero or more Statements.
This explains why the IfStatement patterns we saw a minute ago
did not show a separate version with curly braces.
A Block is just a kind of Statement like any other.
Now you may have noticed something kind of circular about these definitions.
The definition of IfStatement refers to Statement,
and the definition of Statement refers to IfStatement.
Statement refers to Block,
and Block refers to Statements.
Well, it is circular...
reflecting the fact that statements can nest.
You can have an if-statement inside an if-statement.
The syntax of JavaScript is full of nesting.
If you look at the syntax of other programming languages
you'll see that this is commonplace.
OK, one last slide from the ECMAScript standard.
This is the bottom line when it comes to ECMAScript syntax.
Program :
    SourceElements_opt

SourceElement :
    Statement
    FunctionDeclaration

A program--that is, a script--is just a sequence of SourceElements.
Each SourceElement is either a Statement or a FunctionDeclaration.
Today, we're going to take a tiny subset of the JavaScript language,
just a small slice of the syntax
for doing basic arithmetic,
and we're going to write code to run little programs
written in that language.
[1] http://www.ecma-international.org/ecma-262/5.1/#sec-A.3
The syntax

Now let me show you the syntax rules
for the language we're going to implement.
Again, this is a tiny subset of the JavaScript language,
and it really only has like 7 things in it.
I took the JavaScript syntax for arithmetic,
and I just threw away 90% of it.
Here's what's left:
PrimaryExpr :
    Number
    Name
    ( Expr )

MulExpr :
    PrimaryExpr ( "*" PrimaryExpr | "/" PrimaryExpr )*

Expr :
    MulExpr ( "+" MulExpr | "-" MulExpr )*

Numbers, variables, parentheses,
multiplication and division, addition and subtraction.
Seven things.
Boring language, but the upside is
it's familiar and I will be able to show you the complete source code
of the parser for this language.
For the sake of simplicity,
let's assume any string of digits is a number
and any string of letters is a name.
Why do they have three tiers here?
Why isn't there a single category, "Expression",
and then seven lines under there,
one for each language feature.
Wouldn't that be simpler?
Well, this hierarchy is how the JavaScript standard specifies operator precedence.
Remember PEMDAS from school?
Here are parentheses (very high precedence),
multiplication and divison (medium precedence),
addition and subtraction (low precedence).
The way this is designed, a multiplication expression can't contain addition.
Not without parentheses.
Therefore if you see an expression that has both plus signs and times signs in it,
the only way it's gonna match
is if it's addition on the outside and multiplication on the inside.
If I had kept all of the JavaScript operators,
there would be even more sets of rules.
There would be extra-low-precedence rules for greater than and less than,
even lower-precedence rules for logical AND and logical OR,
and so on.
So this is it.
This is the grammar of our very own programming language.
And now what we are going to do is write code to take a string like
"3 + 4"

read it, determine that this is addition, and call
out.number("3")
out.number("4")

and then add them together:
out.add(out.number("3"),
        out.number("4"))

How would you do this?
If you don't already know the answer,
you might be thinking, well,
you could start by finding all the parentheses in the program,
and then we could find all matching pairs of parentheses,
and then we could just parse what's inside the parentheses.
That is, divide and conquer.
But it turns out there's an easier way.
We don't have to do any searching!
We can do a single pass over the source code,
reading it from left to right,
just like reading a sentence out of a book.
In fact there are many different parsing algorithms.
The particular one I'm going to show today is called recursive descent.
Recursive descent parsers are fun to write and surprisingly easy to read.
The parser

The whole parser fits on five slides.
...
(After this there is a blank black slide
and that's when we go to the demo.)
Other back ends

I want to give you an idea of how powerful it is
to control how a language is interpreted.
(demo)
End

Summary (0:48)

So where are we?
We have taken a look at the structure of programming languages.
(slide: Scratch blocks)
(slide: JavaScript syntax)
We've looked at the architecture of compilers.
(slide: lexer/parser/ast/back end)
We've looked, in particular, at how you can take syntax
(slide: the toy arithmetic grammar)
and write a parser to recognize programs that match that syntax,
(slide: add source code on the right)
turning code into an AST,
(slide: <span> output)
which is the first stage of every modern compiler.
And we've shown how strangely useful the parser is all by itself.
It can take a tiny subset of the JavaScript language
and do things with it that JavaScript itself doesn't do.
Why

But so what? Why does any of this matter?
Language is a really weird, far-out tool to have in your toolkit.
It's not a hammer you can use to beat down every problem.
But
the ability to write a parser
is the difference between living with CSS
and inventing LESS.
It's the difference between living with JavaScript
and inventing ClojureScript.
It's the difference between living with XML
and inventing JSON.
So I think it's an important tool.
When it's the right tool for the job, language is like magic.
Outro (0:34)

(this is the outro I gave at coderfaire, preserved here because I like it so much)
Well, we're out of time for today.
But I'd like to show you more.
If you don't already have a project to pursue during the hack day tomorrow,
consider joining me and inventing your own programming language.
You don't have to write it in JavaScript.
You can use whatever language you like best.
I know an hour-long talk can't do much more than whet your appetite.
I haven't tried to teach everything you need to know.
But if you decide to get your hands dirty tomorrow
you just might learn something.
The greatest hack of all time.
Thank you for sticking around.
Have a great day.