Skip to content

Instantly share code, notes, and snippets.

@sma
Last active March 3, 2024 13:00
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sma/2ac4c74fe990678caf56f916fcfba86c to your computer and use it in GitHub Desktop.
Save sma/2ac4c74fe990678caf56f916fcfba86c to your computer and use it in GitHub Desktop.
A toy parser combinator using Dart 3

Parser Combinator

I wrote a toy parser combinator using Dart 3.

[Warning, the current Dart compiler crashes when running this!]

To enable inline classes, add these lines to your analysis_options.yaml file:

enable-experiment:
  - inline-class

A parser is a function that takes some input and returns a tuple of some result representing the structure of the parsed input together with the remaining input. A combinator is a higher-order function that takes parsers as input and returns a new parser that somehow combines their behaviors.

Types

Parser

Here is my definition of a Parser<T>:

typedef Parser<T> = (Result<T>, Input) Function(Input);

Result

Let's represent the Result<T> as a sealed class cluster:

sealed class Result<T> {}

final class Success<T> extends Result<T> {
  Success(this.value);
  final T value;
}

final class Failure<T> extends Result<T> {
  Failure(this.error);
  final String error;
}

sealed is a new keyword for abstract classes where all subclasses are known at package level. Also, a final class is a class that must not be subclassed. This way, we can switch over all alternatives.

Input

I want to use an inline class for Input in order to distinguish it from ordinary strings and to gain some additional type safety:

inline class Input {
  Input(this.source);
  final String source;
}

Parsers

succeed

The most trivial parser is one that always succeeds:

Parser<T> succeed<T>(T value) {
  return (input) => (Success(value), input);
}

fail

Another trivial parser is one that always fails:

Parser<T> fail<T>(String error) {
  return (input) => (Failure(error), input);
}

consume

Here's a parser that matches and consumes a given string if it is at the beginning of the input:

Parser<void> consume(String s) {
  return (input) {
    if (input.startsWith(s)) {
      return (Success(null), input.consume(s.length));
    }
    return (Failure('expected $s'), input);
  };
}

To make this work, we add two methods to Input:

inline class Input {
  ...
  bool startsWith(Pattern pattern) => source.startsWith(pattern);
  Input consume(int length) => Input(source.substring(length));
}

Example

Here's a parser that accepts foo:

final fooParser = consume('foo');

To use this parser on some input, we add an extension to Parser<T>:

extension ParserExtensions<T> on Parser<T> {
  Result<T> parse(String source) {
    final (result, _) = this(Input(source));
    return result;
  }
}

This should return a Success:

fooParser.parse('foobar');

This should return a Failure('expected foo'):

fooParser.parse('baz');

To combine parsers, let's assume we have defined parsers that accept the 10 digits 0 to 9. The oneOf combinator will then create a parser that accepts any one digit. For reasons I don't understand, Dart fails to infer oneOf's type. If you first create that array of parsers as a variable and then initialize oneOf with that variable, it works as expected. Bummer.

final zeroParser = consume('0');
final oneParser = consume('1');
final digitParser = oneOf<void>([zeroParser, oneParser]);

oneOf

Here's the oneOf implementation, which is a bit more evolved than simply trying out all parsers because I want to collect all error messages and combine them into a single error message.

Parser<T> oneOf<T>(List<Parser<T>> parsers) {
  return (input) {
    final errors = <String>[];
    for (final parser in parsers) {
      final (result, next) = parser(input);
      if (result is Success<T>) {
        return (result, next);
      } else if (result is Failure<T>) {
        errors.add(result.error);
      }
    }
    return (Failure(errors.join(' or ')), input);
  };
}

BTW, either I didn't understand sealed correctly or that feature isn't implemented yet. I thought that because the compiler knows that Success and Failure are the only subclasses of Result, a simple else would have been enough. Let's use switch instead:

Parser<T> oneOf<T>(List<Parser<T>> parsers) {
  return (input) {
    final errors = <String>[];
    for (final parser in parsers) {
      switch (parser(input)) {
        case (:Result<T> result, :Input next):
          return (result, next);
        case (Failure(:var error), _):
          errors.add(error);
      }
    }
    return (Failure(errors.join(' or ')), input);
  };
}

Example, cont.

This should return a Failure('expected 0 or expected 1'):

digitParser.parse('aaa');

And this should work:

digitParser.parse('101');

many

To parse many digits at once, a many combinator is useful:

Parser<List<T>> many<T>(Parser<T> parser) {
  return (input) {
    final results = <T>[];
    while (true) {
      switch (parser(input)) {
        case (Success(:T value), :Input next):
          results.add(value);
          input = next;
        default:
          return (Success(results), input);
      }
    }
  };
}

Example, cont.

Then it's possible to create a parser for zero or more digits like so:

final digitsParser = many(digitParser);
digitsParser.parse('1001');

map & value

Because consume returns void, I do not get a value. This is by design. Let's add a map method, that transforms its receiver into another parser that will transform a successful result to something more meaningful.

extension ParserExtensions<T> on Parser<T> {
  ...
  Parser<U> map<U>(U Function(T) transform) {
    return (input) => switch (this(input)) {
          (Success<T>(:var value), :Input next) => (Success(transform(value)), next),
          (Failure<T>(:var error), :Input next) => (Failure<U>(error), next),
        };
  }

  Parser<U> value<U>(U value) => map((_) => value);
}

I've also added a trivial application of map called value which ignores the given value and returns a given constant.

Example, cont.

This is handy to provide values to all digit parsers:

final zeroParser = consume('0').value(0);
final oneParser = consume('1').value(1);
final digitParser = oneOf<int>([zeroParser, oneParser]);

The digits parser would now return a list of zeros and ones (as I omitted all other digits as before, you can add them). I can reduce that list to a single value:

final digitsParser = many(digitParser).map((values) => //
    values.fold(0, (value, digit) => value * 2 + digit));

This should print 9:

print(digitsParser.parse('1001'));

many1

Right now, zero digits are a valid result. Here's a many1 combinator that will call a given parser at least once. Instead of implementing the anonymous parser function myself, I will combine it from more primitive parsers, using a new andThen parser (which is sometimes called flatMap):

Parser<List<T>> many1<T>(Parser<T> parser) {
  return parser.andThen((result) => switch (result) {
        Success<T>(:var value) => many(parser).map((values) => [value] +    values),
        Failure<T>(:var error) => fail(error),
      });
}

Again, I struggle with the switch exhaustiveness.

andThen

Here is the implementation of andThen (aka flatMap):

extension ParserExtensions<T> on Parser<T> {
  Parser<U> andThen<U>(Parser<U> Function(Result<T> result) continuation) {
    return (input) {
      final (result, next) = this(input);
      return continuation(result)(next);
    };
  }
}

and

Most if not all combinators could be implemented using this function, BTW, but lets not go there. Instead, I will just demonstrate sequencing parsers.

Unfortunately, I cannot call this parser &, because operator declarations don't support generic types in Dart. So and it is, and it will take another parser and apply both sequentially, returning a tuple of the result values.

extension ParserExtensions<T> on Parser<T> {
  ...
  Parser<(T, U)> and<U>(Parser<U> parser) {
    return andThen((result) => switch (result) {
      Success<T>(:var value) => parser.map((value2) => (value, value2)),
      Failure<T>(:var error) => fail(error),
    });
  }
}

and1, and2, and0

Often, I don't need both values. Therefore, I create three more parsers which return only the first value, only the second value or no value at all:

extension ParserExtensions<T> on Parser<T> {
  ...
  Parser<T> and1<U>(Parser<U> parser) => and(parser).map((tuple) => tuple.$1);

  Parser<U> and2<U>(Parser<U> parser) => and(parser).map((tuple) => tuple.$2);

  Parser<void> and0<U>(Parser<U> parser) => and(parser).value(null);
}

or

Also, let's define an "or" parser, this time as an | operator, because I can:

extension ParserExtensions<T> on Parser<T> {
  ...
  Parser<T> operator |(Parser<T> parser) {
    return andThen((result) => switch (result) {
      Success<T>(:var value) => succeed(value),
      Failure<T>() => parser,
    });
  }
}

Parsing examples

I can now parse a negative number like so:

final positive = many1(digitParser);
final negative = consume('-').and2(positive).map((v) => -v);
final integer = positive | negative;

Parsing a comma separated list of integers? Here you go:

Parser<List<T>> Function(Parser<T> p) sepBy1<T>(Parser<dynamic> sep) {
  return (p) => p.and(many(sep.and2(p))).map((t) => [t.$1] + t.$2);
}

final comma = consume(',');
final listOfInt = sepBy1<int>(comma)(integer);

(Dart fails to infer the parser's type, it really needs to become smarter)

Parsing that list within parenthesis? This time I will uncurry the parser:

Parser<T> between<T>(Parser begin, Parser end, Parser<T> p) {
  return begin.and2(p.and1(end));
}

final parenthesizedList = between(consume('('), consume(')'), listOfInt);

Once the foundation is laid down, increasingly more complex parsers become very easy to construct. And Dart's new tuple and pattern matching features make writing code quite elegant.

consumeIf

Let's go one step back and create a consumeIf parser so it's easier to parse different digits without the need to combine trivial digit parsers:

Parser<void> consumeIf(bool Function(int ch) test) {
  return (input) {
    return switch (input.consumeIf(test)) {
      Input next => (Success(null), next),
      _ => (Failure('no match'), input),
    };
  };
}

We need this new method in Input:

inline class Input {
  ...
  Input? consumeIf(bool Function(int ch) test) {
    return (source.isNotEmpty && test(source.codeUnitAt(0))) ? consume(1) : null;
  }
}

Let's assume there is a one-to-one mapping between characters (grapheme clusters) and code points (which is wrong), we could create a consumeDigit-Parser like so:

final consumeDigit = consumeIf((ch) => ch >= 48 && ch < 58);

We could also create primitive parsers like consumeWhile or consumeSome or consumeRange or whatever you like as foundation building block. If you care for speed, allowing regular expressions would also be an option.

Here's a typical definition of an identifier that starts with a (ASCII) letter which is followed by (ASCII) letters or digits.

final consumeLetter = consumeIf((ch) => ch >= 65 && ch < 91 || ch >= 97 && ch < 123);
final consumeIdentifier = consumeLetter.and0(many(consumeLetter | consumeDigit));

get

To get parsed string, we use a get parser like so:

extension ParserExtensions<T> on Parser<T> {
  ...
  Parser<String> get get {
    return (input) => switch (this(input)) {
      (Success<T>(), :Input next) => (Success(input.without(next)), next),
      (Failure<T>(:var error), :Input next) => (Failure(error), next),
      _ => throw Error(), // why?!
    };
  }
}

And then we can use get to create an identifer parser:

final identifier = consumeIdentifier.get;

Here's how to parse a floating point value:

final nothing = succeed<void>(null);
final number = digits.and0(consume('.').and0(digits) | nothing).get.map(double.parse);

opt

Alternatively, define an opt parser for parsing something or nothing.

Parser<T?> opt<T>(Parser<T> p) => p.map<T?>((v) => v) | succeed(null);
final number2 = opt(consume('-')).and(digits.and(opt(consume('.').and(digits)))).get.map(double.parse);

Also notice, that double.parse might throw an exception and because no parser does so, we should implement a slightly more sophisticated map method that catches exceptions and returns them as Failure.

However, implementing this feature is left as an exercise for the reader.

@RandalSchwartz
Copy link

Seems like petitparser is gonna get a whole lot simpler. :)

@csells
Copy link

csells commented Mar 18, 2023

This is freakin' fantastic!

@csells
Copy link

csells commented Mar 18, 2023

Have you logged your feedback for the Dart team to take a look at?

Fyi @munificent

@munificent
Copy link

This is super cool!

BTW, either I didn't understand sealed correctly or that feature isn't implemented yet. I thought that because the compiler knows that Success and Failure are the only subclasses of Result, a simple else would have been enough.

Exhaustiveness checking only applies to switches, not to chains of if-elses. We discussed applying it to if-else chains too, but it gets kind of weird in some scenarios and we felt it was easier for users to understand if we tried to keep the type promotion user experience (which kicks in on if statements) somewhat separate from the exhaustiveness checking (which applies to switches).

But I had to add the is Failure test to make the typechecker happy. I could have used switch instead, but that asks for an unwanted default case.

Exhaustiveness checking isn't fully implemented in the current beta, but the nexta beta should be smarter. Once it's working as intended, you should be able to write:

Parser<T> oneOf<T>(List<Parser<T>> parsers) {
  return (input) {
    final errors = <String>[];
    for (final parser in parsers) {
      switch (parser(input)) {
        case (Success result, var next): return (result, next);
        case (Failure failure, _): errors.add(failure.error);
      }
    }
    return (Failure(errors.join(' or ')), input);
  };
}

Or, if you're feeling clever:

Parser<T> oneOf<T>(List<Parser<T>> parsers) {
  return (input) {
    final errors = <String>[];
    for (final parser in parsers) {
      switch (parser(input)) {
        case (Success, Input) result: return result;
        case (Failure(:var error), _): errors.add(error);
      }
    }
    return (Failure(errors.join(' or ')), input);
  };
}

@sma
Copy link
Author

sma commented Mar 18, 2023

Have you logged your feedback for the Dart team to take a look at?

I think, I get this error so I didn't attempt to file my own report.

@sma
Copy link
Author

sma commented Mar 18, 2023

This is super cool!

Thanks.

Exhaustiveness checking isn't fully implemented in the current beta, but the nexta beta should be smarter.

I'll wait patiently for the next beta then, because I just wrote a parser (which can't handle whitespace) for a Smalltalk-like language for fun to have a larger example which I want to try out then :-)

I had to introduce a lazy Forward parser I haven't discussed yet, because I didn't want to deal with a Y-combinator.

class Forward<T> {
  late final Parser<T> parser;
  (Result<T>, Input) call(Input input) => parser(input);
}

Parser<AST> get smatalk {
  final exp = Forward<AST>();

  final nil = consume("nil").value<AST>(Nil());
  final tru = consume("true").value(Tru());
  final fal = consume("false").value(Fal());
  final digit = consumeIf((ch) => ch >= 48 && ch < 58);
  final digits = many1(digit);
  final num = opt(consume('-'))
      .and(digits.and(opt(consume('.').and(digits))))
      .get //
      .map(double.parse)
      .map<AST>(Num.new);
  final quote = consume('"');
  final notQuote = many(consumeIf((ch) => ch != 34));
  final str = between(quote, quote, notQuote.get).map<AST>(Str.new);
  final literal = nil | tru | fal | num | str;

  final letter = consumeIf((ch) => ch >= 65 && ch < 91 || ch >= 97 && ch < 123);
  final name = letter.and(many(letter | digit)).get;
  final operator = many1(consume('+')).get; // TODO
  final keyword = name.and(consume(':')).get;

  final variable = name.get.map<AST>(Var.new);

  final param = consume(':').and2(name);
  final params = opt(many1(param).and1(consume('|'))).map((pars) => pars ?? []);
  final ret = (consume('^').and2(exp).map<AST>(Ret.new) | succeed(Ret(Var('self')))).map((ast) => [ast]);
  final body = (exp as Parser<AST>)
          .and(many(consume('.').and2(exp))) //
          .map((t) => [t.$1] + t.$2)
          .and(ret)
          .map((t) => t.$1 + t.$2) |
      ret;
  final blk = between(consume('['), consume(']'), params.and(body).map((t) => Blk(t.$1, t.$2)));
  final nested = between(consume('('), consume(')'), exp);

  final prim = literal | variable | blk | nested;

  AST u((AST, Iterable<String>) t) {
    if (t.$2.isEmpty) return t.$1;
    return u((Snd(t.$2.first, [t.$1]), t.$2.skip(1)));
  }

  AST b((AST, Iterable<(String, AST)>) t) {
    if (t.$2.isEmpty) return t.$1;
    return b((Snd(t.$2.first.$1, [t.$1, t.$2.first.$2]), t.$2.skip(1)));
  }

  final unexp = prim.and(many(name)).map(u);
  final biexp = unexp.and(many(operator.and(unexp))).map(b);
  final kwexp = biexp.and(many(keyword.and(biexp))).map((t) {
    return Snd(t.$2.map((t) => t.$1).join(), [t.$1, ...t.$2.map((t) => t.$2)]);
  });

  final assign = keyword.get.and(exp).map<AST>((t) => Asn(t.$1, t.$2));

  return exp.parser = assign | kwexp;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment