Skip to content

Instantly share code, notes, and snippets.

@raiph

raiph/.md Secret

Last active March 24, 2023 16:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save raiph/32b3ba969b4eb939a63f48de14d0a1e0 to your computer and use it in GitHub Desktop.
Save raiph/32b3ba969b4eb939a63f48de14d0a1e0 to your computer and use it in GitHub Desktop.
Comparing/contrasting Raku Grammars and PEGs

Quoting the Wikipedia page Raku Rules:

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language.

A Raku Grammar is a collection of rules that declares a parser.

Quoting the Wikipedia page Parsing expression grammar:

In computer science, a parsing expression grammar (PEG), is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language.

This article compares/contrasts Raku Grammars with PEGs.

Analytic vs generative grammars

There are various ways to categorize grammars. Let's start with the big two mutually distinct categories, so it's clear that both PEGs and Raku Grammars are in one category, not the other:

PEGs and Raku grammars are analytic grammars, not generative grammars.[1]


PEGs are categorized by CS folk as formal analytic grammars. This is because: they are just grammars; and they are amenable to guarantees regarding termination/execution of static analysis of a grammar and its subsequent execution; and the nature and role of any "host" GPL is left out of the academic categorization equation.

In contrast Raku Grammars are expressed using Raku rules which are part of the Raku language which is "a braid of sub-languages" -- "a GPL, plus DSLs for strings, regexes, embedded documentation, and so on, that mutually embed each other".

So, while Raku Grammars contain subsets that are amenable to guarantees regarding termination/execution of static analysis of a grammar and its subsequent execution (thus, for example, some constructs compile to NFAs), the overall categorization is necessarily "Turing complete". So CS folk say things like "Raku is a fascinating experiment, but it's a deliberately unconstrained formalism (if it really can be described as a formalism)".

PEGs vs Raku Grammars

This section quotes the entire Introduction section of the Wikipedia page on PEGs (as it was in June 2021) interspersing each sentence about PEG with my discussion of aspects of Raku grammars I know to be similar and/or different:

PEGs: In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language.

Raku Grammars: Some folk describe Raku Grammars as PEGs. This helps place Raku Grammars in the right category (analytic, not generative). Plus both Raku Grammars (typically) or PEGs (always) directly express a recursive descent parser. Also, they look superficially similar.[3] However, putting aside what they share in broad terms re grammar category, parsing strategy, and apperance, PEGs and Raku Grammars are very different in their underlying theory, practice, strengths, and weaknesses. In particular, Raku Grammars directly support Turing complete power, and provide a much wider range of features than PEGs.[4]

PEGs: The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s.

Raku Grammars: The first public documents I've found specifying Raku Grammars are Apocalypse 5 from 2002. The first prototype appeared in 2004. While supporting arbitrary parsing techniques, Raku grammars are naturally related to, and make it especially easy to express, the same parsing strategy category as PEGs: top-down recursive descent.

PEGs: Syntactically, PEGs also look similar to context-free grammars (CFGs)

Raku grammars: Like all the DSLs that make up Raku, the standard grammar DSL embeds within it all the other DSLs. So, while many Raku grammars look just like CFGs, writers can choose to weave in arbitrary GPL programming techniques, either using the same EBNF looking formalism, or just inserting ordinary looking GPL code.

PEGs: [PEGs] have a different interpretation [from CFGs]: the choice operator selects the first match in PEG, while it is ambiguous in CFG.

Raku Grammars: Raku's grammar formalism supports the same unambiguous first match choice operator.

PEGs: [choice operator selects the first match] ... is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

Raku grammars: While the unambiguous logically sequential choice (first match among alternatives) operator (spelled || in Raku) is indeed closer to how an actual parser works than a CFG's ambiguous choice, Raku's innovative unambiguous logically parallel choice (Longest Token Match (LTM) among alternatives) operator (spelled |; eg . | .. | ... matches aaa, not a, given the input aaa) is an important improvement, and is more commonly used in Raku Grammars.

LTM is much closer than first match among alternatives because humans instinctively resolve token ambiguities baes on length ($100 is more likely interpreted as a hundred dollars than other interpretations), so languages tend to follow suit, so parsers do too.

In PEGs, LTM-like choice has to be relatively awkwardly manually expressed, and, much more importantly, desk-checked when making changes. It's also potentially non-optimal performance-wise. Raku Grammars make LTM trivial to express, automatically maintained, and automatically mapped to appropriate optimizable NFAs/DFAs. This can significantly improve a grammar's readability, maintainability, and algorithmic (big O) performance.

PEGs: Unlike CFGs, PEGs cannot be ambiguous; if a string parses, it has exactly one valid parse tree.

Raku grammars: Raku grammars have this same property. But they also have another important property that both CFGs and PEGs lack, albeit for different reasons. To wit, combining two or more grammars or grammar fragments that are independently designed and maintained, with join points between grammars being alternations. (Think "this grammar (fragment) or that".) For CFGs the problem is that a grammar that is a combination of two CFGs that have been proven unambiguous may be ambiguous.[5] For PEGs the problem is that PEG only has the first match choice operator.[6] This grammar combination capability is one of Raku's most consequential innovations.[7]

PEGs: It is conjectured that there exist context-free languages that cannot be recognized by a PEG, but this is not yet proven.

Raku grammars: Raku Rules are Turing complete. Thus, if a language, context-free or otherwise, can be parsed by a programming language, then a Raku grammar can be written to do so. Conversely, if a language can't be parsed by a Raku grammar then no computer program can be written to parse that language. (See the correspondence discussed in unrestricted grammars, though note that left recursion is not currently automatically resolved in Raku Grammars.)

PEGs: PEGs are well-suited to parsing computer languages (and artificial human languages such as Lojban), but not natural languages where the performance of PEG algorithms is comparable to general CFG algorithms such as the Earley algorithm.

Raku grammars: A similar point (non-suitability for parsing natural languages) applies to Raku Grammars. In theory, Raku Grammar performance primarily corresponds to recursive descent parsing for the non-terminal rules, and NFA processing for LTM tokens. In practice, it's considered fast enough by core devs. This in turn means it hasn't been optimized, despite hints in recent years that it might be (1, 2). And that in turn ironically leaves it a lot slower than it could be for now -- often much slower than existing fast parsing solutions.

Footnotes

[1] A generative grammar is a formalism that theoretically generates output compatible with the grammar, rather than going in the other direction and actually parsing input compatible with the grammar.[2]

[2] Generative grammars can be used backwards, to construct parsers. Similarly, analytic grammars can be used backwards, to generate strings that match the grammar. cf a related comment by jnthn about generating JSON examples from a Raku Grammar.

[3] Both PEGs and Raku Grammars look EBNFish. However, unlike the PEG formalism, the Raku Grammar formalism is actually arbitrary function definitions and calls, equivalent to a hand-written parser. So, despite appearances, the semantics are unrelated.

[4] The gap between PEGs and Raku Grammars isn't just about the latter's Turing complete power, though that is obviously a major difference. It's also that arbitrary programming constructs can be added to the Raku language including Raku's rules / grammar features.

[5] "whether an arbitrary [CFG] grammar is ambiguous is undecidable" and, as an essentially fatal flaw in practice for the use case of combining CFG grammars or grammar fragments, even if cooperatively designed and maintained, composing two or more unambiguous CFGs is itself of unknown ambiguity.

[6] Quoting Lukas Diekmann and Laurence Tratt from their paper Parsing composed grammars with language boxes:

PEGs are appealing for parsing composed languages because PEGs are closed under composition and inherently unambiguous. However, the ordered choice operator, which must be used when composing grammars, can cause the resulting PEG not to parse inputs valid in its constituent grammars. Assume the LHS and RHS of the ordered choice in the well known PEG S ::= a / ab are composed grammars. If the LHS matches a, the ordered choice matches and returns success to its container; the RHS is not given a chance to match anything, even if it could have matched a longer input sequence.

[7] Benefits that arise from Raku's focus on grammar combination:

  • Being able to compose or alter a grammar (language) "on-the-fly" from combinations of non-coordinated independently designed and maintained DSLs / grammars / grammar fragments. This requires that composing one or more new rules with one or more others A) just works and B) does so with O(n) (n being matched string length) performance. Sequential choice (PEG) fails to support either. Parallel choice supports both.

  • Being able to de-compose grammars. The corrollary of being able to compose a grammar from parts is being able to de-compose a single larger monolithic grammar into smaller parts. This in turn allows relatively easy design, evolution, and maintenance of a PL as a composition of a collection of DSLs into an overall combined language. Indeed, a half dozen Raku grammars are composed to construct Raku itself -- the so-called Raku "language" (singular) is actually an arbitrary braid of languages (plural) that are (literally seamlessly!) woven together automatically into an overall grammar by just specifying rules that call from one grammar to another, either unidirectionally or mutually recursively. This achieves the same as the paper quoted in footnote 6 while dispensing with the additional machinery the authors' "boxes" introduce.

  • The Raku language is designed to make it easy for "mere mortals" to mutate it to be whatever arbitrary braid of DSLs and GPLs they want. This is a powerful and general Language Oriented Programming solution for two key scenarios: A) Addressing arbitrary programming problems, and B) Addressing Raku's own progress as a "language" by letting ordinary users of Raku create and try out (r)evolutionary mutations of individual DSLs or combinations of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment