Skip to content

Instantly share code, notes, and snippets.

@cmazakas
Last active December 12, 2024 14:32
Show Gist options
  • Save cmazakas/f363187ef45a03c55661fc2324a6df36 to your computer and use it in GitHub Desktop.
Save cmazakas/f363187ef45a03c55661fc2324a6df36 to your computer and use it in GitHub Desktop.
Review of Boost.Parser

I've been a huge fan of Spirit for many years now.

One of my favorite Spirit memories was during two separate job interviews.

The first time, I was interviewing with a game company and on my resume, I noted that I had written a simple HTTP server that performed simple geospatial querying, creating a list of cities within a radius of a provided zip code. Here's the link.

I remember that I had used Spirit for parsing a CSV and the interviewer was genuinely concerned about me using such a library for such a seemingly small task. It was the main focus of the interview and I think represented a cultural divide. I did not get the job.

The second time, I used it as a litmus test for me writing C++ at the company. They saw me use Spirit and still hired me anyway which I took to be a great sign. This time, I got the job.

Boost.Parser

Cutting to the chase, I vote we ACCEPT Parser.

To test the library, I've been using a small test project here: https://github.com/cmazakas/parser-review

My criteria for acceptance was that I'd be able to write an HTTP/1 parser with feature parity to what is currently being done in the proposed Boost.Http.Proto library. I feel that I was able to achieve this goal and I stopped proceeding because I no longer felt like the library itself would be a blocker.

Now, even though I'm voting ACCEPT I'm going to spend most of the review criticizing the library. To me, the value-add of a type-safe parser combinator library is obvious so I won't spend too much time here. Basically, testable, reusable parsers are good and even more good when they elegantly compose as well.

The Feedback

The differences between Spirit.X3 and Parser need to be made much more clear by the documentation, by which I mean:

  • Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule.
  • Spirit X3 is missing many of the convenient interfaces to parsers that Spirit 2 had. For instance, you cannot add parameters to a parser.
  • All versions of Spirit have Unicode support, but it is quite difficult to get working.

These need code samples showing the limitations of X3 and how Parser actually solves them. I think many, myself included, aren't overwhelmingly clear on what these differences are and how Parser solves them and why this is a good thing.

Though I have attempted X3's Unicode before and I will say, Parser's unified interface is much better. For example, I wrote an entire RFC-compliant URL parser in X3, including Unicode support, for my library Foxy.


Not a criticism but Parser's directives are a dramatic improvement over X3. By these, I'm thinking of boost::parser::string_view[], boost::parser::separate[] and boost::parser::merge[]. These should be highlighted in the comparison vs X3 as I think they add a good bit of substantive value.

More praise for the library includes its parse API which is a dramatic improvement over X3 in terms of ergonomics. Personally, I can see myself main using prefix_parse, similarly to X3's original parse() but I think the value of parse() itself is obvious and doesn't warrant much explanation.


The library generates a dramatic amount of warnings when compiled with -Wall -Wextra -pedantic. I had originally forgotten to include_directorie() Parser as a SYSTEM directory and was met with 172 warnings.

I think being clean under -Wall -Wextra -Werror is reasonable criteria for acceptance.


When writing the parser for the HTTP version line, I ran into a lot of issues separating the major version from the minor version when using:

parser::digit >> "." >> parser::digit

It was only when I was lazily browsing the docs that I found the separate[] directive which I think indicates the library's documentation has poor discoverability.

Ultimately, it was cleaner for me to just do:

// HTTP-version = HTTP-name "/" DIGIT "." DIGIT
// HTTP-name = %x48.54.54.50 ; HTTP
auto const on_version = [](auto &ctx) {
  auto const &tup = _attr(ctx);
  auto const major = tup[0_c];
  auto const minor = tup[1_c];
  if (major > 9 || minor > 9) {
    _pass(ctx) = false;
    return;
  }

  response::metadata &md = _globals(ctx).md_;
  md.version_.major_ = major;
  md.version_.minor_ = minor;
};

auto const http_name = parser::lit("HTTP");
auto const http_version =
    (http_name >> "/" >> parser::uint_ >> "." >> parser::uint_)[on_version];

The documentation doesn't exactly give users any guidelines about when they should formally define rules with attributes vs semantic actions.

This is one of those things where if users don't already have the intuition built up, using the library is an exercise in frustration.

Namely, attributes don't compose well when you start nesting rules and want to have different out params at different levels of the parsers. I suppose this should be obvious because the rules themselves form a hierarchy so the attributes must mirror that.

I found that the best thing to do for my use-case was to abuse _globals(ctx) and semantic actions.

I think a good rule of thumb for users can be expressed as something like: use rule attributes for simple grammars where all you really want is just the std::vector<double>. For something complex with domain-defined validation rules, prefer _globals(ctx) and semantic actions for validating everything.

In general, Parser's semantic actions are a huge improvement over Spirit's.


I felt like I came up with working but unsatisfactory solutions to parsing characters in a specific range.

Namely, I just wrote a simple semantic action around +parser::char_[f] but this made me feel like I was writing unidiomatic code. I'd appreciate some feedback here:

// reason-phrase = 1*( HTAB / SP / VCHAR / obs-text )
// VCHAR         = %x21-7E
auto const on_reason_phrase = [](auto &ctx) {
  auto ch = _attr(ctx);
  if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) {
    _pass(ctx) = false;
    return;
  }

  response::metadata &md = _globals(ctx).md_;
  std::ranges::subrange<char const *> sub = _where(ctx);
  if (!md.reason_phrase_begin_) {
    md.reason_phrase_begin_ = sub.begin();
  }
  md.reason_phrase_end_ = sub.end();
  md.status_line_end_ = sub.end();
  ;
};

auto const reason_phrase = parser::omit[+parser::char_[on_reason_phrase]];

There should be a debugging utility for discovering what the synthesized attribute is for a rule. Perhaps this could exist as a semantic action that attempts to instantiate an incomplete template.

template<class> struct diagnostic;

auto const debug=[](auto& ctx){diagnostic<decltype(_attr(ctx))> d;};

It took me a long time to realize what the attribute type for digit >> "." >> digit was (it was std::string).


Dependencies are still a nightmare in C++, so Boost.Parser can be used as a purely standalone library, independent of Boost.

This was funny for me to read because I had a working vcpkg port working for my test project in like 2 minutes.


The library seems to really emphasize Boost.Hana's tuple type but I think in practice, this winds up being a mistake unless there's substantive compile-time savings by using Hana here.

Namely, Hana's tuple doesn't support structured bindings. Because Parser is a C++17 library, I consider this a somewhat large defect of the interface. What's worse, Hana's literals aren't compatible with Parser's.

A huge boon for Hana and its tuple is that it supports indexing via tup[0_c] but this doesn't work when Parser's literal are involved as it also defines a _c so an ambiguity is introduced and compilation fails.

Between this and the lack of structured binding support, I think relying on Hana here is a strict downgrade unless there's compelling reasons I'm simply unaware of.

https://github.com/boostorg/hana/milestone/4


Stuff like _globals in the docs would 404 for me. I also think the docs are kind of just incomplete in general and because there's not much cross-referencing, discoverability is hurt so a user has to essentially read the entire docs front-to-back to understand the library which is a large ask for a parsing library.

Conclusion

Overall, I think that even with all of the problems I found in the library, it's still quite good and worthy of being brought into Boost as the successor for Spirit. I don't think anything I found was a deal-breaker and I think in a few iterations of Boost, this library can be something quite great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment