cmazakas/boost-parser-review.md

## boost-parser-review.md

      
    Raw
  

              boost-parser-review.md
            
          
    I've been a huge fan of Spirit for many years now.
One of my favorite Spirit memories was during two separate job interviews.
The first time, I was interviewing with a game company and on my resume, I noted that
I had written a simple HTTP server that performed simple geospatial querying, creating a
list of cities within a radius of a provided zip code. Here's the link.
I remember that I had used Spirit for parsing a CSV and the interviewer was genuinely
concerned about me using such a library for such a seemingly small task. It was the main
focus of the interview and I think represented a cultural divide. I did not get the job.
The second time, I used it as a litmus test for me writing C++ at the company. They saw me
use Spirit and still hired me anyway which I took to be a great sign. This time, I got
the job.
Boost.Parser

Cutting to the chase, I vote we ACCEPT Parser.
To test the library, I've been using a small test project here:
https://github.com/cmazakas/parser-review
My criteria for acceptance was that I'd be able to write an HTTP/1 parser with feature
parity to what is currently being done in the proposed Boost.Http.Proto library. I feel
that I was able to achieve this goal and I stopped proceeding because I no longer felt
like the library itself would be a blocker.
Now, even though I'm voting ACCEPT I'm going to spend most of the review criticizing
the library. To me, the value-add of a type-safe parser combinator library is obvious
so I won't spend too much time here. Basically, testable, reusable parsers are good and
even more good when they elegantly compose as well.
The Feedback

The differences between Spirit.X3 and Parser need to be made much more clear by the
documentation, by which I mean:


Spirit X3 has rules that do not compose well — the attributes produced by a rule can change depending on the context in which you use the rule.
Spirit X3 is missing many of the convenient interfaces to parsers that Spirit 2 had. For instance, you cannot add
parameters to a parser.
All versions of Spirit have Unicode support, but it is quite difficult to get working.


These need code samples showing the limitations of X3 and how Parser actually solves them. I think many,
myself included, aren't overwhelmingly clear on what these differences are and how Parser solves them
and why this is a good thing.
Though I have attempted X3's Unicode before and I will say, Parser's unified interface is much better.
For example, I wrote an entire RFC-compliant URL parser in X3, including Unicode support, for my library
Foxy.

Not a criticism but Parser's directives are a dramatic improvement over X3. By these, I'm thinking of
boost::parser::string_view[], boost::parser::separate[] and boost::parser::merge[]. These should be
highlighted in the comparison vs X3 as I think they add a good bit of substantive value.
More praise for the library includes its parse API which is a dramatic improvement over X3 in  terms of
ergonomics. Personally, I can see myself main using prefix_parse, similarly to X3's original parse() but
I think the value of parse() itself is obvious and doesn't warrant much explanation.

The library generates a dramatic amount of warnings when compiled with -Wall -Wextra -pedantic. I had
originally forgotten to include_directorie() Parser as a SYSTEM directory and was met with 172 warnings.
I think being clean under -Wall -Wextra -Werror is reasonable criteria for acceptance.

When writing the parser for the HTTP version line, I ran into a lot of issues separating the major version
from the minor version when using:
parser::digit >> "." >> parser::digit
It was only when I was lazily browsing the docs that I found the separate[] directive which I think indicates
the library's documentation has poor discoverability.
Ultimately, it was cleaner for me to just do:
// HTTP-version = HTTP-name "/" DIGIT "." DIGIT
// HTTP-name = %x48.54.54.50 ; HTTP
auto const on_version = [](auto &ctx) {
  auto const &tup = _attr(ctx);
  auto const major = tup[0_c];
  auto const minor = tup[1_c];
  if (major > 9 || minor > 9) {
    _pass(ctx) = false;
    return;
  }

  response::metadata &md = _globals(ctx).md_;
  md.version_.major_ = major;
  md.version_.minor_ = minor;
};

auto const http_name = parser::lit("HTTP");
auto const http_version =
    (http_name >> "/" >> parser::uint_ >> "." >> parser::uint_)[on_version];

The documentation doesn't exactly give users any guidelines about when they should formally
define rules with attributes vs semantic actions.
This is one of those things where if users don't already have the intuition built up, using the
library is an exercise in frustration.
Namely, attributes don't compose well when you start nesting rules and want to have different out
params at different levels of the parsers. I suppose this should be obvious because the rules themselves
form a hierarchy so the attributes must mirror that.
I found that the best thing to do for my use-case was to abuse _globals(ctx) and semantic actions.
I think a good rule of thumb for users can be expressed as something like: use rule attributes for simple
grammars where all you really want is just the std::vector<double>. For something complex with domain-defined
validation rules, prefer _globals(ctx) and semantic actions for validating everything.
In general, Parser's semantic actions are a huge improvement over Spirit's.

I felt like I came up with working but unsatisfactory solutions to parsing characters in a specific range.
Namely, I just wrote a simple semantic action around +parser::char_[f] but this made me feel like I was
writing unidiomatic code. I'd appreciate some feedback here:
// reason-phrase = 1*( HTAB / SP / VCHAR / obs-text )
// VCHAR         = %x21-7E
auto const on_reason_phrase = [](auto &ctx) {
  auto ch = _attr(ctx);
  if (ch != '\t' && ch != ' ' && (ch < '\x21' || ch > '\x7e')) {
    _pass(ctx) = false;
    return;
  }

  response::metadata &md = _globals(ctx).md_;
  std::ranges::subrange<char const *> sub = _where(ctx);
  if (!md.reason_phrase_begin_) {
    md.reason_phrase_begin_ = sub.begin();
  }
  md.reason_phrase_end_ = sub.end();
  md.status_line_end_ = sub.end();
  ;
};

auto const reason_phrase = parser::omit[+parser::char_[on_reason_phrase]];

There should be a debugging utility for discovering what the synthesized attribute is for a rule. Perhaps
this could exist as a semantic action that attempts to instantiate an incomplete template.
template<class> struct diagnostic;

auto const debug=[](auto& ctx){diagnostic<decltype(_attr(ctx))> d;};
It took me a long time to realize what the attribute type for digit >> "." >> digit was (it was std::string).


Dependencies are still a nightmare in C++, so Boost.Parser can be used as a purely standalone library, independent
of Boost.

This was funny for me to read because I had a working vcpkg port working for my test project in like 2 minutes.

The library seems to really emphasize Boost.Hana's tuple type but I think in practice, this winds up
being a mistake unless there's substantive compile-time savings by using Hana here.
Namely, Hana's tuple doesn't support structured bindings. Because Parser is a C++17 library, I consider
this a somewhat large defect of the interface. What's worse, Hana's literals aren't compatible with Parser's.
A huge boon for Hana and its tuple is that it supports indexing via tup[0_c] but this doesn't work when
Parser's literal are involved as it also defines a _c so an ambiguity is introduced and compilation fails.
Between this and the lack of structured binding support, I think relying on Hana here is a strict downgrade unless
there's compelling reasons I'm simply unaware of.
https://github.com/boostorg/hana/milestone/4

Stuff like _globals in the docs would 404 for me. I also think the docs are kind of just incomplete in general
and because there's not much cross-referencing, discoverability is hurt so a user has to essentially read the entire
docs front-to-back to understand the library which is a large ask for a parsing library.
Conclusion

Overall, I think that even with all of the problems I found in the library, it's still quite good and worthy
of being brought into Boost as the successor for Spirit. I don't think anything I found was a deal-breaker and
I think in a few iterations of Boost, this library can be something quite great.