Skip to content

Instantly share code, notes, and snippets.

@yelouafi
Last active April 4, 2024 16:25
Show Gist options
  • Save yelouafi/556e5159e869952335e01f6b473c4ec1 to your computer and use it in GitHub Desktop.
Save yelouafi/556e5159e869952335e01f6b473c4ec1 to your computer and use it in GitHub Desktop.

In this tutorial we're going to build a set of parser combinators.

What is a parser combinator?

We'll answer the above question in 2 steps.

  1. What is a parser?
  2. and, what is a parser combinator?

So first question: What is parser?

Answer: (in its simplest form) a parser is a

  1. a function
  2. that takes some input in form of a raw sequence (like a string of characters)
  3. and returns some meaningful data built from the raw input
  4. or some error if the raw input does not conform to what is expected.

Here is a very simple example that takes a string. If the string represents a valid integer it returns that integer, otherwise it returns a parse error.

function parseInteger(input) {
  const match = /^\d+$/.exec(input);
  if (match != null) {
    return +match[0];
  }
  return new Error("Invalid integer");
}
$ parseInteger("12")
  >> 12

$ parseInteger("hey")
  >> Error: Invalid integer

Nice, but what about

$ parseInteger("12hey")
  >> Error: Invalid integer

Because we used ^ & $ our regular expression checks if the entire input is a valid integer. It makes sense if this is the only thing we want to parse. However, very often we want to parse more complicated things.

Sequencing parsers

Here is another example, we want to parse the following sequence

  1. an integer
  2. a '+' character
  3. then another integer

and return the sum of the 2 numbers obtained in (1) and (3.)

We'll keep it simple and not allow spaces between the 3 steps. So how do we approach it?

We already have our parseInteger function. We could reuse it somehow with another function parsePlus. But we need to rethink our previous definition.

Let's think about it: to parse the above sequence, we need to run 3 parsers (i.e. functions) one after another. But it's not as simple as composing simple functions. Passing from one step to another requires some glue code.

  1. first parseInteger will try to parse an integer from the beginning of the input
  2. if (1) returns an error then we stop parsing and returns that error
  3. otherwise, we call the second parser with the rest of the string.

But to achieve (3) we must get the rest of the string from the first parser. So now our parser function needs to return

  1. either an error if the parser has failed
  2. or the result plus the rest of the input in case of success.

So, with the return value in (2) we can call the next parser in the sequence to parse the rest of the input.

Before rewriting parseInteger let's first make some changes to our parser interface.

// We'll use our own error description
function failure(expected, actual) {
  return { isFailure: true, expected, actual };
}

function success(data, rest) {
  return { data, rest };
}

// And for our main parsing, we'll invoke this function
function parse(parser, input) {
  const result = parser(input);
  if (result.isFailure) {
    throw new Error(`Parse error.
		expected ${result.expected}.
		instead found '${result.actual}'
	`);
  } else {
    return result;
  }
}

Now let's modify the parseInteger function to fit the new interface (from now on we'll use a more concise naming convention: e.g. ìnteger instead of parseInteger. It will make our code more readable as we'll be defining more complex parsers.)

function integer(input) {
  // note we removed $ from the end of the regular expression
  const match = /^\d+/.exec(input);
  if (match != null) {
    const matchedText = match[0];
    return success(+matchedText, input.slice(matchedText.length));
  }
  return failure("an integer", input);
}
$ parse(integer, "12")
  >> {data: 12, rest: ""}

$ parse(integer, "hey")
  Uncaught Error: Parse error.
		expected an integer.
		instead found 'hey'

$ parse(integer, "12hey")
  >> {data: 12, rest: "hey"}

Fine. Let's write our second parser, which parses the '+' character and is much simpler,

function plus(input) {
  if (input[0] === "+") {
    return success("+", input.slice(1));
  }
  return failure("'+'", input);
}

and 2 quick tests.

$ parse(plus, '+33')
  >> {data: "+", rest: "33"}

$ parse(plus, '33+')
  >> Uncaught Error: Parse error.
		expected '+'.
		instead found '33+'

Now we'll write our main parser which will parse the entire sequence.

function plusExpr(input) {
  // step 1 : parse the first integer
  const result1 = integer(input);
  if (result1.isFailure) return result1;
  const { data: int1, rest: input1 } = result1;

  // step 2 : parse "+"
  const result2 = plus(input1);
  if (result2.isFailure) return result2;
  const { rest: input2 } = result2;

  // step 3 : parse the second integer
  const result3 = integer(input2);
  if (result3.isFailure) return result3;
  const { data: int2, rest: input3 } = result3;

  // one last check
  if (input3.length > 0) {
    return failure("end of input", input3);
  }
  
  // everything is alright so return the final result
  return success(int1 + int2, input3);
}
$ parse(plusExpr, "12+34")
  >> {data: 46, rest: ""}

$ parse(plusExpr, "12a+34")
  >> Uncaught Error: Parse error.
		expected '+'.
		instead found 'a+34'

parse(plusExpr, "12-34")
>> Uncaught Error: Parse error.
		expected '+'.
		instead found '-34'

$ parse(plusExpr, "12+34rest")
  >> Uncaught Error: Parse error.
		expected end of input.
		instead found 'rest'

So far so good. But for our parser to be practical we need to make some improvements.

  1. We would like to have some reusable way to parse more things and not just numbers.
  2. We also need some reusable way to create sequences like in plusExpr. Right now sequencing parsers involves some boilerplate:
  • At each step we must check if the result is an error to decide whether we should continue or stop,
  • we also need to take care of passing the rest of the input to the next parser.

This may not seem too much. But remember that in practice we'll be creating this kind of sequence a lot of the time. So abstracting this somehow is going to make our life easier.

So first (1), we're going to make a couple of helper functions to create parsers.

The first one will just generate a parser that parses a given string of characters

function text(match) {
  return function textParser(input) {
    if (input.startsWith(match)) {
      return success(match, input.slice(match.length));
    }
    return failure(`'${match}'`, input);
  };
}

// example
const plus = text("+");
$ parse(plus, "+12")
  >> {data: "+", rest: "12"}

$ parse(plus, "12+")
  >> Uncaught Error: Parse error.
		expected '+'.
		instead found '12+'

Our second helper works like the first one but matches regular expressions instead of plain text.

function regex(regex) {
  const anchoredRegex = new RegExp(`^${regex.source}`);

  return function regexParser(input) {
    const match = anchoredRegex.exec(input);
    if (match != null) {
      const matchedText = match[0];
      return success(matchedText, input.slice(matchedText.length));
    }
    return failure(regex, input);
  };
}

const decimal = regex(/\d+(?:\.\d+)?/);
parse(decimal, "12.34")
  >> {data: "12.34", rest: ""}

Hmm... not quite. Our aim is for an actual number 2.3 and not just its textual representation.

We cannot blame our regex helper. A regular expression can be used to parse arbitrary data types, it has no idea what kind of data we are expecting. So we need some general way of transforming the textual representation into some meaningful data.

To make it even more 'general' we'll define another helper function which transforms the result of any parser, not just regex ones; meet the map function.

function map(func, parser) {
  return function mapParser(input) {
    const result = parser(input);
    if (result.isFailure) return result;
    return success(func(result.data), result.rest);
  };
}

const decimal = map(x => +x, regex(/\d+(?:\.\d+)?/));
$ parse(decimal, "12.34")
  >> {data: 12.34, rest: ""}

$ parse(decimal, "a12.34")
  >> Uncaught Error: Parse error.
		expected /\d+(?:\.\d+)?/.
		instead found 'a12.34'

Certainly not the most helpful error message. We'll see later how to improve that.

Now that we defined our primitive parsers, let's define our sequencing combinator.

We already know that our sequencer needs to take care of error handling and state passing (i.e. passing the rest of the input) between steps. The last question is: what should be the return value?

There are multiple answers:

  1. We could just return the result of the last step.
  2. We could also return an array with the results from all steps.
  3. We could apply some given function to the results from all steps and return the result.

If we think about it, we can define (1) and (2) in terms of (3) (another possibility is to take (2) and use it with map but we'll stick with (3).)

Ok. So our combinator will take 2 parameters;

  1. a function that will be applied to the collected results from all parsers,
  2. and an array of parsers to be sequenced.
function apply(func, parsers) {
  return function applyParser(input) {
    const accData = [];
    let currentInput = input;

    for (const parser of parsers) {
      const result = parser(currentInput);
      if (result.isFailure) return result;
      accData.push(result.data);
      currentInput = result.rest;
    }

    return success(func(...accData), currentInput);
  };
}

Our plusExpr parser can now be defined in terms of apply.

const plusExpr = apply((num1, _, num2) => num1 + num2, [
  decimal,
  plus,
  decimal
]);
$ parse(plusExpr, "12+34")
  >> {data: 46, rest: ""}

$ parse(plusExpr, "12+34rest")
  >> {data: 46, rest: "rest"}

Oops! we forgot to take care of the end of input. Never mind, we'll just create a parser for that.

function eof(input) {
  if (input.length === 0) return success(null, input);
  return failure("end of input", input);
}

// fix plusExpr
const plusExpr = apply((num1, _, num2) => num1 + num2, [
  decimal,
  plus,
  decimal,
  eof
]);
$ parse(plusExpr, "12+34rest")
  >> Uncaught Error: Parse error.
		expected end of input.
		instead found 'rest'

Using apply we can define helpers for the other possible results of sequencing.

// Yeah, not the best name I guess
function sequence(...parsers) {
  return apply((...results) => results[results.length - 1], parsers);
}
function collect(...parsers) {
  return apply((...results) => results, parsers);
}
$ parse(
    sequence(text("hello"), text(", "), text("world")),
    "hello, world"
  )
  >> {data: "world", rest: ""}

$ parse(
    collect(text("hello"), text(", "), text("world")),
    "hello, world"
  )
  >> {data: ["hello", ", ", "world"], rest: ""}

Merging parsers

We are going to improve our expression parser by allowing more arithmetic operations.

We need to modify plusExpr so that in its 2nd step it can handle operations in addition to '+'.

Ah, and as usual we need our solution to be general so that we can allow alternative arbitrary parsers, and not just those for simple strings (so you guessed it, a simple regex won't do it.)

You should be used to it by now. We need another parser combinator.

function oneOf(...parsers) {
  return function oneOfParser(input) {
    for (const parser of parsers) {
      const result = parser(input);
      if (result.isFailure) continue;
      return result;
    }
    // We'll see later a way to improve error reporting
    return failure("oneOf", input);
  };
}

We're equipped now to make a better expression parser (and evaluator.)

const opMap = {
  "+": (left, right) => left + right,
  "-": (left, right) => left - right,
  "*": (left, right) => left * right,
  "/": (left, right) => left / right
};

function getOp(op) {
  return opMap[op];
}

const op = map(getOp, oneOf(text("+"), text("-"), text("*"), text("/")));

const decimal = map(x => +x, regex(/\d+(?:\.\d+)?/));

const expr = apply((num1, opFunc, num2) => opFunc(num1, num2), [
  decimal,
  op,
  decimal
]);
$ parse(expr, "12-34")
  >> {data: -22, rest: ""}

$ parse(expr, "12*34")
  >> {data: 408, rest: ""}

Works great, but error reporting could be better.

$ parse(expr, "a12*34")

>> Uncaught Error: Parse error.
		expected /\d+(?:\.\d+)?/.
		instead found 'a12*34'

parse(expr, "12 + 34")
  >> Uncaught Error: Parse error.
		expected oneOf.
		instead found ' + 34'

And we are still not supporting whitespaces.

Proper error reporting for real world parsers includes much more than just printing friendly names for regular expressions or the oneOf parsers. We need to report the precise location (file, line & column) of the error as well as all the alternatives expected at this location (including from deeply nested parsers.)

We will may cover error reporting in more detail in another post. For now our solution will be a simple label helper, which decorates a given parser with a user-friendly message. The implementation has some pitfalls (more precisely we need to fix lookahead) but will suffice for our current needs.

function label(parser, expected) {
  return function labelParser(input) {
    const result = parser(input);
    if (result.isFailure) {
      // replace the parser error with our custom one
      return failure(expected, result.actual);
    }
    return result;
  };
}

const decimal = map(x => +x, label(regex(/\d+(?:\.\d+)?/), "a decimal"));

const expr = apply((num1, opFunc, num2) => opFunc(num1, num2), [
  decimal,
  label(op, "an arithmetic operator"),
  decimal
]);
$ parse(expr, "12 + 34")
  >> Uncaught Error: Parse error.
		expected an arithmetic operator.
		instead found ' + 34'

$ parse(expr, "a12 + 34")
  >> Uncaught Error: Parse error.
		expected a decimal.
		instead found 'a12 + 34'

Our final touch will be to make the parser a little more realistic by skipping whitespaces.

// Lexeme is a function that takes a parser for 'junk' (e.g. whitespaces, comments)
function lexeme(junk) {
  // and returns another function that takes a parser for some meaningful data.
  return function createTokenParser(parser) {
    // The (second) function returns a parser that
    // parses the meaningful  data then skips the junk
    return apply((data, _) => data, [parser, junk]);
  };
}

const spaces = regex(/\s*/);
const token = lexeme(spaces);

// redefine our expression  to skip leading and trailing spaces
const expr = apply((_, num1, opFunc, num2) => opFunc(num1, num2), [
  spaces, // skips leading spaces
  token(decimal),
  token(label(op, "an arithmetic operator")),
  token(decimal), // skips trailing spaces
  eof
]);
$ parse(expr, " 12 + 34 ")
  >> {data: 46, rest: ""}

Yielding parsers

Some of you may know that as the original author of redux-saga I have a soft spot for generators (that some FP folks see as a restricted do notation but whatever.)

Imagine we could use generators to write sequences like expr instead of apply. We could write something like:

const expr = go(function*() {
  yield spaces;
  const num1 = yield token(decimal);
  const opFunc = yield token(label(op, "an arithmetic operator"));
  const num2 = yield token(decimal);
  yield eof;
  return opFunc(num1, num2);
});

The yield statements embed all the machinery of error handling and state passing. We can write our sequences as if we were calling normal functions.

It doesn't take much more to implement go than apply. The only difference is that instead of stepping over an array of parsers, we step over a generator object. The generator yields successive parsers and at the end returns the value that will be returned as the final result of the main parser.

function go(genFunc) {
  return function yieldParser(input) {
    const gen = genFunc();
    let currentInput = input;
    let genResult = gen.next();
    
    // if not done yet, genResult.value is the next parser
    while (!genResult.done) {
      const result = genResult.value(currentInput);
      if (result.isFailure) return result;
      currentInput = result.rest;
      genResult = gen.next(result.data);
    }
    
    // if done, genResult.value is the return value of the parser
    return success(genResult.value, currentInput);
  };
}

The generator definition of expr looks more imperative than the apply based one (aka Applicative definition.) Some people will prefer the first style, others will prefer the second. 'Generator definitions' (aka Monadic definitions) also allow some things that are not possible with Applicative ones. For example, imagine parsing an HTML like syntax where each opening tag must have a corresponding closing tag.

const openBracket = text("<");
const closeBracket = text(">");
const identifier = regex(/[^>]*/);
const whateverContent = regex(/[^<]*/);

const element = go(function*() {
  // parses opening tag
  yield openBracket;
  const tagName = yield identifier;
  yield closeBracket;
  yield whateverContent;
  yield text(`</${tagName}>`);
});

In the last step, the yielded parser is created dynamically. There is no way to know what will be the closing tag before parsing the opening tag. With apply all parsers must be statically passed (known in advance) so we can't have the above kind of definitions.

Generators can also allow some nice recursive definitions. For example, suppose we want to parse some token as many times as possible.

$ parse(many(regex(/\d/)), "123xyz")
  should return >> {data: ["1", "2", "3"], rest: "xyz"}

We can define many using generators like this.

// creates a parser that always succeeds with `value` without consuming any input
function pure(value) {
  return function pureParser(input) {
    return success(value, input);
  };
}

function many(parser) {
  const self = oneOf(
    go(function*() {
      const head = yield parser;
      // 1. keep calling self recursively
      const tail = yield self;
      return [head, ...tail];
    }),
    // 2. until it fails in which case we return an empty array
    pure([])
  );
  return self;
}

Using many we can, for example, parse expressions of an arbitrary length.

const expr = go(function*() {
  yield spaces;
  const num1 = yield token(decimal);
  const rest = yield many(collect(token(label(op, "an arithmetic operator")), decimal));
  yield eof
  return rest.reduce((acc, [opFunc, num]) => opFunc(acc, num), num1)
});
$ parse(expr, '1 + 2 + 3 + 4')
  >> {data: 10, rest: ""}

There is much more

A single post cannot cover parser combinators in detail. For those who want to go further, I made a library pcomb that packages a more comprehensive set of combinators. It's not ready for production but there are already enough features to play with more advanced parsers. Included examples of parsers that illustrate how combinators work.

Here are some things that still need to be covered (may do that in later posts.)

  • Lookahead: For example, our oneOf definition allows for an arbitrary lookahead. It means that even if an alternative consumes an arbitrary amount of input before failing, oneOf will always restart the next alternative from the beginning of the current input.

This is not efficient in practice and doesn't allow for proper error reporting. In practice we may better restrict the lookahead so that oneOf will not try another alternative if the current one has failed while consuming some input. This will also allow for better error reporting since we can propagate exactly what's expected at a specific location.

  • (Proper) Error reporting, this includes reporting the exact location of the failure as well as the expected items at that location while still allowing developers to plug in their own error messages.

  • User state: Parsing complex languages involves state bookkeeping (e.g. "Are we inside a function body?"). This involves allowing a parser to read/write state information. The most simple and composable solution is to write state readers/writers themselves as parsers that can be inserted in a sequence.

  • Refactoring using modular interfaces: abstracts away error handling & state passing into separate interfaces (as done in Haskell with stacks of Monad Transformers.) This provides a more flexible interface, allowing developers to plug in their own implementations.

I hope you enjoyed this post and that you'll have some fun creating your own parsers.

@TracyGJG
Copy link

Hi Yassine,
I first read your excellent article about three years ago but it is only recently I have had opportunity to follow the tutorial step by step. In doing so I have noticed a few typographic and technical errors. I have taken the liberty of forking the Gist myself and making the necessary changes. If you would like to regard this message as a form of pull request, I would be pleased if you decide to use it to update your master. Please let me know, in a comment in my fork, if you do choose to use my version and I will take my Gist down. I have includes the corrections made by sroucheray, thanks.
With my best regards.

@yelouafi
Copy link
Author

Hi @sroucheray and @TracyGJG thanks for the fixes!

I've already updated my gist with content from @TracyGJG

@twome
Copy link

twome commented Mar 3, 2023

like the overall idea of this but starting with a regex makes things more confusing. also using the + operator to cast/coerce to a number in the very first code example is very alarming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment