In this tutorial we're going to build a set of parser combinators.
We'll answer the above question in 2 steps.
- What is a parser?
- and, what is a parser combinator?
So first question: What is parser?
Answer: (in its simplest form) a parser is a
- a function
- that takes some input in form of a raw sequence (like a string of characters)
- and returns some meaningful data built from the raw input
- or some error if the raw input does not conform to what is expected.
Here is a very simple example that takes a string. If the string represents a valid integer it returns that integer, otherwise it returns a parse error.
function parseInteger(input) {
const match = /^\d+$/.exec(input);
if (match != null) {
return +match[0];
}
return new Error("Invalid integer");
}
$ parseInteger("12")
>> 12
$ parseInteger("hey")
>> Error: Invalid integer
Nice, but what about
$ parseInteger("12hey")
>> Error: Invalid integer
Because we used ^
& $
our regular expression checks if the entire input is a valid integer. It makes sense if this is the only thing we want to parse. However, very often we want to parse more complicated things.
Here is another example, we want to parse the following sequence
- an integer
- a '+' character
- then another integer
and return the sum of the 2 numbers obtained in (1) and (3.)
We'll keep it simple and not allow spaces between the 3 steps. So how do we approach it?
We already have our parseInteger
function. We could reuse it somehow with another function parsePlus
. But we need to rethink our previous definition.
Let's think about it: to parse the above sequence, we need to run 3 parsers (i.e. functions) one after another. But it's not as simple as composing simple functions. Passing from one step to another requires some glue code.
- first
parseInteger
will try to parse an integer from the beginning of the input - if (1) returns an error then we stop parsing and returns that error
- otherwise, we call the second parser with the rest of the string.
But to achieve (3) we must get the rest of the string from the first parser. So now our parser function needs to return
- either an error if the parser has failed
- or the result plus the rest of the input in case of success.
So, with the return value in (2) we can call the next parser in the sequence to parse the rest of the input.
Before rewriting parseInteger
let's first make some changes to our parser interface.
// We'll use our own error description
function failure(expected, actual) {
return { isFailure: true, expected, actual };
}
function success(data, rest) {
return { data, rest };
}
// And for our main parsing, we'll invoke this function
function parse(parser, input) {
const result = parser(input);
if (result.isFailure) {
throw new Error(`Parse error.
expected ${result.expected}.
instead found '${result.actual}'
`);
} else {
return result;
}
}
Now let's modify the parseInteger function to fit the new interface (from now on we'll use a more concise naming convention: e.g. ìnteger
instead of parseInteger
. It will make our code more readable as we'll be defining more complex parsers.)
function integer(input) {
// note we removed $ from the end of the regular expression
const match = /^\d+/.exec(input);
if (match != null) {
const matchedText = match[0];
return success(+matchedText, input.slice(matchedText.length));
}
return failure("an integer", input);
}
$ parse(integer, "12")
>> {data: 12, rest: ""}
$ parse(integer, "hey")
Uncaught Error: Parse error.
expected an integer.
instead found 'hey'
$ parse(integer, "12hey")
>> {data: 12, rest: "hey"}
Fine. Let's write our second parser, which parses the '+' character and is much simpler,
function plus(input) {
if (input[0] === "+") {
return success("+", input.slice(1));
}
return failure("'+'", input);
}
and 2 quick tests.
$ parse(plus, '+33')
>> {data: "+", rest: "33"}
$ parse(plus, '33+')
>> Uncaught Error: Parse error.
expected '+'.
instead found '33+'
Now we'll write our main parser which will parse the entire sequence.
function plusExpr(input) {
// step 1 : parse the first integer
const result1 = integer(input);
if (result1.isFailure) return result1;
const { data: int1, rest: input1 } = result1;
// step 2 : parse "+"
const result2 = plus(input1);
if (result2.isFailure) return result2;
const { rest: input2 } = result2;
// step 3 : parse the second integer
const result3 = integer(input2);
if (result3.isFailure) return result3;
const { data: int2, rest: input3 } = result3;
// one last check
if (input3.length > 0) {
return failure("end of input", input3);
}
// everything is alright so return the final result
return success(int1 + int2, input3);
}
$ parse(plusExpr, "12+34")
>> {data: 46, rest: ""}
$ parse(plusExpr, "12a+34")
>> Uncaught Error: Parse error.
expected '+'.
instead found 'a+34'
parse(plusExpr, "12-34")
>> Uncaught Error: Parse error.
expected '+'.
instead found '-34'
$ parse(plusExpr, "12+34rest")
>> Uncaught Error: Parse error.
expected end of input.
instead found 'rest'
So far so good. But for our parser to be practical we need to make some improvements.
- We would like to have some reusable way to parse more things and not just numbers.
- We also need some reusable way to create sequences like in
plusExpr
. Right now sequencing parsers involves some boilerplate:
- At each step we must check if the result is an error to decide whether we should continue or stop,
- we also need to take care of passing the rest of the input to the next parser.
This may not seem too much. But remember that in practice we'll be creating this kind of sequence a lot of the time. So abstracting this somehow is going to make our life easier.
So first (1), we're going to make a couple of helper functions to create parsers.
The first one will just generate a parser that parses a given string of characters
function text(match) {
return function textParser(input) {
if (input.startsWith(match)) {
return success(match, input.slice(match.length));
}
return failure(`'${match}'`, input);
};
}
// example
const plus = text("+");
$ parse(plus, "+12")
>> {data: "+", rest: "12"}
$ parse(plus, "12+")
>> Uncaught Error: Parse error.
expected '+'.
instead found '12+'
Our second helper works like the first one but matches regular expressions instead of plain text.
function regex(regex) {
const anchoredRegex = new RegExp(`^${regex.source}`);
return function regexParser(input) {
const match = anchoredRegex.exec(input);
if (match != null) {
const matchedText = match[0];
return success(matchedText, input.slice(matchedText.length));
}
return failure(regex, input);
};
}
const decimal = regex(/\d+(?:\.\d+)?/);
parse(decimal, "12.34")
>> {data: "12.34", rest: ""}
Hmm... not quite. Our aim is for an actual number 2.3 and not just its textual representation.
We cannot blame our regex helper. A regular expression can be used to parse arbitrary data types, it has no idea what kind of data we are expecting. So we need some general way of transforming the textual representation into some meaningful data.
To make it even more 'general' we'll define another helper function which transforms the result of any parser, not just regex ones; meet the map
function.
function map(func, parser) {
return function mapParser(input) {
const result = parser(input);
if (result.isFailure) return result;
return success(func(result.data), result.rest);
};
}
const decimal = map(x => +x, regex(/\d+(?:\.\d+)?/));
$ parse(decimal, "12.34")
>> {data: 12.34, rest: ""}
$ parse(decimal, "a12.34")
>> Uncaught Error: Parse error.
expected /\d+(?:\.\d+)?/.
instead found 'a12.34'
Certainly not the most helpful error message. We'll see later how to improve that.
Now that we defined our primitive parsers, let's define our sequencing combinator.
We already know that our sequencer needs to take care of error handling and state passing (i.e. passing the rest of the input) between steps. The last question is: what should be the return value?
There are multiple answers:
- We could just return the result of the last step.
- We could also return an array with the results from all steps.
- We could apply some given function to the results from all steps and return the result.
If we think about it, we can define (1) and (2) in terms of (3) (another possibility is to take (2) and use it with map
but we'll stick with (3).)
Ok. So our combinator will take 2 parameters;
- a function that will be applied to the collected results from all parsers,
- and an array of parsers to be sequenced.
function apply(func, parsers) {
return function applyParser(input) {
const accData = [];
let currentInput = input;
for (const parser of parsers) {
const result = parser(currentInput);
if (result.isFailure) return result;
accData.push(result.data);
currentInput = result.rest;
}
return success(func(...accData), currentInput);
};
}
Our plusExpr
parser can now be defined in terms of apply.
const plusExpr = apply((num1, _, num2) => num1 + num2, [
decimal,
plus,
decimal
]);
$ parse(plusExpr, "12+34")
>> {data: 46, rest: ""}
$ parse(plusExpr, "12+34rest")
>> {data: 46, rest: "rest"}
Oops! we forgot to take care of the end of input. Never mind, we'll just create a parser for that.
function eof(input) {
if (input.length === 0) return success(null, input);
return failure("end of input", input);
}
// fix plusExpr
const plusExpr = apply((num1, _, num2) => num1 + num2, [
decimal,
plus,
decimal,
eof
]);
$ parse(plusExpr, "12+34rest")
>> Uncaught Error: Parse error.
expected end of input.
instead found 'rest'
Using apply
we can define helpers for the other possible results of sequencing.
// Yeah, not the best name I guess
function sequence(...parsers) {
return apply((...results) => results[results.length - 1], parsers);
}
function collect(...parsers) {
return apply((...results) => results, parsers);
}
$ parse(
sequence(text("hello"), text(", "), text("world")),
"hello, world"
)
>> {data: "world", rest: ""}
$ parse(
collect(text("hello"), text(", "), text("world")),
"hello, world"
)
>> {data: ["hello", ", ", "world"], rest: ""}
We are going to improve our expression parser by allowing more arithmetic operations.
We need to modify plusExpr
so that in its 2nd step it can handle operations in addition to '+'.
Ah, and as usual we need our solution to be general so that we can allow alternative arbitrary parsers, and not just those for simple strings (so you guessed it, a simple regex won't do it.)
You should be used to it by now. We need another parser combinator.
function oneOf(...parsers) {
return function oneOfParser(input) {
for (const parser of parsers) {
const result = parser(input);
if (result.isFailure) continue;
return result;
}
// We'll see later a way to improve error reporting
return failure("oneOf", input);
};
}
We're equipped now to make a better expression parser (and evaluator.)
const opMap = {
"+": (left, right) => left + right,
"-": (left, right) => left - right,
"*": (left, right) => left * right,
"/": (left, right) => left / right
};
function getOp(op) {
return opMap[op];
}
const op = map(getOp, oneOf(text("+"), text("-"), text("*"), text("/")));
const decimal = map(x => +x, regex(/\d+(?:\.\d+)?/));
const expr = apply((num1, opFunc, num2) => opFunc(num1, num2), [
decimal,
op,
decimal
]);
$ parse(expr, "12-34")
>> {data: -22, rest: ""}
$ parse(expr, "12*34")
>> {data: 408, rest: ""}
Works great, but error reporting could be better.
$ parse(expr, "a12*34")
>> Uncaught Error: Parse error.
expected /\d+(?:\.\d+)?/.
instead found 'a12*34'
parse(expr, "12 + 34")
>> Uncaught Error: Parse error.
expected oneOf.
instead found ' + 34'
And we are still not supporting whitespaces.
Proper error reporting for real world parsers includes much more than just printing friendly names for regular expressions or the oneOf
parsers. We need to report the precise location (file, line & column) of the error as well as all the alternatives expected at this location (including from deeply nested parsers.)
We will may cover error reporting in more detail in another post. For now our solution will be a simple label
helper, which decorates a given parser with a user-friendly message. The implementation has some pitfalls (more precisely we need to fix lookahead) but will suffice for our current needs.
function label(parser, expected) {
return function labelParser(input) {
const result = parser(input);
if (result.isFailure) {
// replace the parser error with our custom one
return failure(expected, result.actual);
}
return result;
};
}
const decimal = map(x => +x, label(regex(/\d+(?:\.\d+)?/), "a decimal"));
const expr = apply((num1, opFunc, num2) => opFunc(num1, num2), [
decimal,
label(op, "an arithmetic operator"),
decimal
]);
$ parse(expr, "12 + 34")
>> Uncaught Error: Parse error.
expected an arithmetic operator.
instead found ' + 34'
$ parse(expr, "a12 + 34")
>> Uncaught Error: Parse error.
expected a decimal.
instead found 'a12 + 34'
Our final touch will be to make the parser a little more realistic by skipping whitespaces.
// Lexeme is a function that takes a parser for 'junk' (e.g. whitespaces, comments)
function lexeme(junk) {
// and returns another function that takes a parser for some meaningful data.
return function createTokenParser(parser) {
// The (second) function returns a parser that
// parses the meaningful data then skips the junk
return apply((data, _) => data, [parser, junk]);
};
}
const spaces = regex(/\s*/);
const token = lexeme(spaces);
// redefine our expression to skip leading and trailing spaces
const expr = apply((_, num1, opFunc, num2) => opFunc(num1, num2), [
spaces, // skips leading spaces
token(decimal),
token(label(op, "an arithmetic operator")),
token(decimal), // skips trailing spaces
eof
]);
$ parse(expr, " 12 + 34 ")
>> {data: 46, rest: ""}
Some of you may know that as the original author of redux-saga
I have a soft spot for generators (that some FP folks see as a restricted do
notation but whatever.)
Imagine we could use generators to write sequences like expr
instead of apply
. We could write something like:
const expr = go(function*() {
yield spaces;
const num1 = yield token(decimal);
const opFunc = yield token(label(op, "an arithmetic operator"));
const num2 = yield token(decimal);
yield eof;
return opFunc(num1, num2);
});
The yield statements embed all the machinery of error handling and state passing. We can write our sequences as if we were calling normal functions.
It doesn't take much more to implement go
than apply
. The only difference is that instead of stepping over an array of parsers, we step over a generator object. The generator yields successive parsers and at the end returns the value that will be returned as the final result of the main parser.
function go(genFunc) {
return function yieldParser(input) {
const gen = genFunc();
let currentInput = input;
let genResult = gen.next();
// if not done yet, genResult.value is the next parser
while (!genResult.done) {
const result = genResult.value(currentInput);
if (result.isFailure) return result;
currentInput = result.rest;
genResult = gen.next(result.data);
}
// if done, genResult.value is the return value of the parser
return success(genResult.value, currentInput);
};
}
The generator definition of expr
looks more imperative than the apply
based one (aka Applicative definition.) Some people will prefer the first style, others will prefer the second. 'Generator definitions' (aka Monadic definitions) also allow some things that are not possible with Applicative ones. For example, imagine parsing an HTML like syntax where each opening tag must have a corresponding closing tag.
const openBracket = text("<");
const closeBracket = text(">");
const identifier = regex(/[^>]*/);
const whateverContent = regex(/[^<]*/);
const element = go(function*() {
// parses opening tag
yield openBracket;
const tagName = yield identifier;
yield closeBracket;
yield whateverContent;
yield text(`</${tagName}>`);
});
In the last step, the yielded parser is created dynamically. There is no way to know what will be the closing tag before parsing the opening tag. With apply
all parsers must be statically passed (known in advance) so we can't have the above kind of definitions.
Generators can also allow some nice recursive definitions. For example, suppose we want to parse some token as many times as possible.
$ parse(many(regex(/\d/)), "123xyz")
should return >> {data: ["1", "2", "3"], rest: "xyz"}
We can define many
using generators like this.
// creates a parser that always succeeds with `value` without consuming any input
function pure(value) {
return function pureParser(input) {
return success(value, input);
};
}
function many(parser) {
const self = oneOf(
go(function*() {
const head = yield parser;
// 1. keep calling self recursively
const tail = yield self;
return [head, ...tail];
}),
// 2. until it fails in which case we return an empty array
pure([])
);
return self;
}
Using many
we can, for example, parse expressions of an arbitrary length.
const expr = go(function*() {
yield spaces;
const num1 = yield token(decimal);
const rest = yield many(collect(token(label(op, "an arithmetic operator")), decimal));
yield eof
return rest.reduce((acc, [opFunc, num]) => opFunc(acc, num), num1)
});
$ parse(expr, '1 + 2 + 3 + 4')
>> {data: 10, rest: ""}
A single post cannot cover parser combinators in detail. For those who want to go further, I made a library pcomb that packages a more comprehensive set of combinators. It's not ready for production but there are already enough features to play with more advanced parsers. Included examples of parsers that illustrate how combinators work.
Here are some things that still need to be covered (may do that in later posts.)
- Lookahead: For example, our
oneOf
definition allows for an arbitrary lookahead. It means that even if an alternative consumes an arbitrary amount of input before failing,oneOf
will always restart the next alternative from the beginning of the current input.
This is not efficient in practice and doesn't allow for proper error reporting. In practice we may better restrict the lookahead so that oneOf
will not try another alternative if the current one has failed while consuming some input. This will also allow for better error reporting since we can propagate exactly what's expected at a specific location.
-
(Proper) Error reporting, this includes reporting the exact location of the failure as well as the expected items at that location while still allowing developers to plug in their own error messages.
-
User state: Parsing complex languages involves state bookkeeping (e.g. "Are we inside a function body?"). This involves allowing a parser to read/write state information. The most simple and composable solution is to write state readers/writers themselves as parsers that can be inserted in a sequence.
-
Refactoring using modular interfaces: abstracts away error handling & state passing into separate interfaces (as done in Haskell with stacks of Monad Transformers.) This provides a more flexible interface, allowing developers to plug in their own implementations.
I hope you enjoyed this post and that you'll have some fun creating your own parsers.
This is amazing. I can't believe how closely the genFunc resembles do notation in a haskell parser. Looking forward to the next entry!