Skip to content

Instantly share code, notes, and snippets.

@maestrow
Last active May 11, 2021 17:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maestrow/5332f3d0b5f44c2ea369eb5778ebc9e2 to your computer and use it in GitHub Desktop.
Save maestrow/5332f3d0b5f44c2ea369eb5778ebc9e2 to your computer and use it in GitHub Desktop.
Compiler Theory & Syntax Analysis

Guido van Rossum

Серия статей от основателя Python - Guido van Rossum по теме PEG - Parsing Expression Grammars. В серии есть статья об устранении левой рекурсии. Примечательно, что Guido решил делать новый парсер на основе PEG (март 2020).

OMeta

https://en.wikipedia.org/wiki/OMeta

Диссертация Alessandro Warth's о языке OMeta - OMeta: An Object-Oriented Language for Pattern-Matching.

Репозитории:

Мои эксперименты:

OHM

https://github.com/harc/ohm Ohm is a successor to Ometa that aims to improve on it by (amongst other things) separating the grammar from the semantic actions.

Henri Tuhola

Статьи из личного блога Henri Tuhola:

Термины и определения

Formal grammar

Грамматика определяет правила формирования строк из алфавита языка. Грамматика не определяет значение строк, а только их форму. Грамматику можно задать набором правил.

Грамматику можно использовать как генератор языка или как распознаватель. Генератор - генерирует текст по правилам.

Распознаватель - определяет относится ли заданная строка к описываемому грамматикой языку. Для описания распознавателей используется теория автоматов.

Парсинг - процесс распознавания. В результате паринга создается Parse tree.

Описание грамматики

G = (N, E, P, S)
N - множество нетерминалов
E - множество терминалов
P - множество правил вида (EN)*N(EN)* -> (EN)*
S - принадлежит множеству N - стартовое правило.

Иерархия Хомского

Хомский классифицировал грамматики:

  • Контекстно-свободные
  • Регулярные

Языки, которые могут быть описаны такими грамматиками называются соответственно: контекстно свободными и регулярными языками.

Все регулярные языки могут быть распознаны конечным автоматом.

Для распознавания контекстно-свободных языков существуют алгоритмы построения LL парсеров и LR парсеров.

Контекстно-свободные грамматики

Контекстно-свободные грамматики - грамматики, у которых левая часть каждого правила состоит только из одного нетерминала.

Контекстно-свободные языки могут быть распознаны алгоритмом Эрли за время O(n3).

Детерминированные контекстно-свободные языки - подмножество контекстно-свободных языков, которые могут быть распознаны за линейное время с помощью Deterministic pushdown automaton.

Регулярные грамматики

Pushdown Automata - Автомат с магазинной памятью

Автомат с магазинной памятью — это конечный автомат, который использует стек для хранения состояний. По комбинации текущего состояния, входного символа и символа на вершине магазина автомат выбирает следующее состояние и, возможно, символ для записи в магазин.

PEG - Parsing expression grammar

Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

PEG = (N,E,P,S) N - множество нетерминалов E - множество терминалов P - множество правил S - стартовое правило

P = A <- e. A - нетерминал e - разбирающее выражение

Атомарное e:

  • любой терминал
  • любой нетерминал
  • пустая строка

Операторы:

  • Последовательность: e1 e2 e3
  • Упорядоченный выбор: e1|e2|e3
  • 0 или больше: e*
  • один или больше: e+
  • опционально: e?
  • "и" предикат: ?e
  • "не" предикат: !e
  • скобки ()

Предикаты "и" и "не" делают возможным to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility

Особенности PEG парсеров:

  • операторы *, +, ? всегда ведут себя жадно
  • нет ассоциативности. Например a+b+c+d преобразуется в дерево: [+ (+ +(ab) c) d], а не в желаемое + a b c d. Ассоциативность можно достич левой рекурсией. Однако ее нужно спецаиально обрабатывать
  • возможность появления левой рекурсии. Ее нужно специально обрабатывать, чтобы избежать бесконечной рекурсии.

PEG parsing is typically carried out via packrat parsing, which uses memoization[5][6] to eliminate redundant parsing steps.

A PEG is called well-formed[1] if it contains no left-recursive rules. Only the OMeta parsing algorithm[9] supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parsers support left recursion.

Bottom-up PEG parsing

A pika parser[11] uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also conveys optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.

Парсеры

  • top-down parsing
    • Definite clause grammar parsers
    • Recursive descent parser
    • Predictive parser
    • Earley parser
  • bottom-up parsing
    • Precedence parser
      • Simple precedence parser
      • Operator-precedence parser
    • Bounded-context parser (BC)
    • LR parser (Left-to-right, Rightmost derivation in reverse)
      • Simple LR parser (SLR)
      • LALR parser (Look-Ahead)
      • Canonical LR parser (LR(1))
      • GLR parser (Generalized)[3]
    • CYK parser (Cocke–Younger–Kasami)
    • Recursive ascent parser
      • Packrat parser
    • Shift-reduce parser

Источники:

Что не вошло в классификацию:

  • Автомат с магазинной памятью

Search for: ToDo.

Тезисы из диссертации warth_dissertation.pdf по языку OMeta

OMeta is based on a variant of Parsing Expression Grammars (PEGs) [For04]—a recognition-based foundation for describing syntax—which I have extended to support pattern matching on arbitrary datatypes. I show that OMeta’s general-purpose pattern matching facilities provide a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree transformers, all of which can be extended in interesting ways using familiar object-oriented mechanisms.

Расширение PEG

  • правила с параметрами charRange :x :y = char:c ?(x <= c && c <= y) -> c
  • semantic actions
  • bound to identifiers using the : operator

Пример:

ometa ExpParser {
  dig = ’0’ | ... | ’9’,
  num = dig+:ds -> [’num’, parseInt(ds.join(’’))],
  fac = fac:x ’*’ num:y -> [’mul’, x, y]
      | fac:x ’/’ num:y -> [’div’, x, y]
      | num,
  exp = exp:x ’+’ fac:y -> [’add’, x, y]
      | exp:x ’-’ fac:y -> [’sub’, x, y]
      | fac
}
  • semantic predicates digit = char:d ?(d >= ’0’ && d <= ’9’) -> d

  • PEG оперирует потоком символов, OMeta grammars can operate on structured as well as unstructured data, и предоставляет синтаксис для сопоставления с основными типами:

    • строки: 'строка',
    • числа: 42,
    • списки: [’hello’ 42 []].
  • To avoid ambiguities that arise from using a non-deterministic choice operator (the kind of choice found in context-free grammars), PEGs only support prioritized choice. In other words, choices are always evaluated in order.

Базовые правила

  • OMeta has a single built-in rule from which every other rule is derived. The name of this rule is anything, and it consumes exactly one object from the input stream.
  • Even the end rule, which detects the end of the input stream, is implemented in terms of anything: end = ˜anything
  • Оператор & - это ~~ (двойное отрицание, does't consume any input)
ometa Base {
  & :r = ~~apply(r)
  end = ~anything
}

Left recursion

fac = fac:x ’*’ num:y -> [’mul’, x, y]
    | fac:x ’/’ num:y -> [’div’, x, y]
    | num,

Первая альтернатива (choise) правила fac начинается с применения правила fac. Такая рекурсия никогда не завершится: применение fac завершается следующим применением этого же правила без поглощения ввода.

Alternative would be to rewrite fac to be right-recursive. However, this would result in right-associative parse trees, which is a problem because the multiplication operator is supposed to be left-associative.

Our last resort is to rewrite fac using the repetition operator (*) instead of left recursion. But in order to generate properly left-associative parse trees, we need somewhat complicated semantic actions,

mulOp = ’*’ -> ’mul’
      | ’/’ -> ’div’,
fac = num:x (mulOp:op num:y -> (x = [op, x, y]))* -> x

which makes this significantly less readable than the original left-recursive version.

(The “decidedly imperative” version of the fac rule shown above uses OMeta’s iteration operator (*) as a kind of loop construct, and a semantic action to update the x variable, which holds the parse tree, each time a new part of the expression is recognized.)

OMeta avoids these issues by extending PEGs with support for left recursion. While this does not make OMeta more powerful than PEGs, it does make it easier for programmers to express certain rules. This in turn increases OMeta’s usefulness as a rapid prototyping tool.

Parameterized Rules

Combining scannerless and “scannerful” parsing with parameterized rules:

eq  = ’=’         -> {kind: ’=’, value: ’=’},
num = digit+:ds   -> {kind: ’num’, value: parseInt(ds.join(’’))},
id  = letter+:ls  -> {kind: ’id’, value: ls.join(’’)},

scanner  = space* (eq | num | id),
token :k = scanner:t ?(t.kind == k) -> t.value,
assign   = token(’id’) token(’=’) token(’num’)

I have found this idiom to be less error-prone than scannerless parsing (the only kind supported by PEGs), and yet just as expressive since each rule may access the character stream directly if desired.

To make this idiom more convenient to use, OMeta supports a syntactic sugar for invoking a user-defined rule called token, i.e., token(’=’) can be written as "=":

condStmt = "if" "(" expr:c ")" stmt:tb "else" stmt:fb -> ...

The Parser grammar, which is part OMeta’s “standard library”, provides a default implementation of token that skips any number of spaces and then tries to match the characters that make up the string that it receives as an argument.

Pattern Matching on Rule Arguments

OMeta’s parameterized rules can also pattern-match on their arguments. In fact,

charRange :x :y = ...

is actually shorthand for

charRange anything:x anything:y = ...

which means that x and y can be any kind of object. This mechanism can be used to validate the arguments that are passed to a rule. For example, the following version of charRange ensures that both of its arguments are characters:

charRange char:x char:y = ...

Also, because any pattern (not just rule applications) can be used on the left-hand side of a rule, OMeta naturally supports the inductive style used to define functions in languages like Haskell and ML:

fact 0 = -> 1,
fact :n = fact(n - 1):m -> (n * m)

(When a rule has multiple definitions, as in the example above, each definition is tried in order until the first one succeeds.)

Higher-Order Rules

OMeta also provides a mechanism for implementing higher-order rules, i.e., rules that take other rules as arguments. This is supported by the apply rule, which takes as an argument the name of a rule, and invokes it. In other words, apply(’expr’) is equivalent to expr.

As an example, the rule

listOf :p = apply(p) ("," apply(p))*

can be used to recognize both comma-delimited lists of expressions

listOf(’expr’)

and lists of names

listOf(’name’)

OMeta’s Parameterized and higher-order rules bring the expressive power of parser combinator libraries.

A Note on Memoization

Packrat parsers are parsers for PEGs that are able to guarantee linear parse times while supporting backtracking and unlimited look-ahead “by saving all intermediate parsing results as they are computed and ensuring that no result is evaluated more than once.” [For02a] While OMeta is based on PEGs, it does not necessarily have to be implemented using packrat-style memoization. My implementations do in fact memoize the results of rules without arguments, but in order to keep their memory footprints small, I chose not to memoize the results of rules with arguments.

O is for Object-Oriented

Extensible Pattern Matching

ToDo: Разобраться с оптимизацией в разделе 2.3.2 Extensible Pattern Matching.

ometa EJSParser <: JSParser {
    isKeyword :x = ?(x == ’say’)
                | ˆisKeyword(x),
    stmt =        "say" expr:x sc -> [’call’, [’get’, ’alert’], x]
                | ˆstmt
}

Figure 2.6: Extending a JavaScript parser with the say statement (say ’hello’; is equivalent to alert(’hello’);)

Stateful Pattern Matching

Note that OMeta does not attempt to undo the side effects of semantic actions while backtracking; for some semantic actions, like printing characters to the console, this would be impossible. Programmers implementing stateful pattern matchers must therefore write their semantic actions carefully. (The case study of Chapter 4 (Worlds) of this dissertation provides a solution to this problem.)

Foreign Rule Invocation

ometa OMetaJSParser {
    topLevel = foreign(OMetaParser, ’grammar’)
             | foreign(JSParser, ’srcElem’)
}

Core-OMeta: an Operational Semantics for OMeta-Style Pattern Matching

Core-OMeta’s language of parsing expressions extends PEGs with

  • bindings and semantic actions
  • pattern matching on structured data (i.e., lists)

Basic PEG functionality

Empty

pattern e always succeeds without consuming any input.

Atomic Values (e.g., characters, numbers, booleans)

The atomic value pattern a succeeds only if the input stream is not empty and its first element is equal to the atomic value in the pattern. A successful match yields the value that was consumed.

Nonterminal

The nonterminal (or rule application) A succeeds if its associated pattern succeeds when applied to the input. Note that the "body of the rule" is applied to the input with a fresh store to ensure that bindings are (statically) scoped to the rules in which they appear.

Sequence e1 e2

Alternation e1 | e2

The alternation pattern succeeds if e1 matches the input (in which case e2 is skipped); otherwise (i.e., if e1 fails), the result of the alternation pattern is the same as the result of e2.

Note that e1’s side effects—i.e., whatever changes it may have made to the store—are not rolled back before e2 is evaluated.

Iteration e*

The iteration pattern e∗ successively applies e to the input, only stopping when that results in a match failure. It yields a (possibly empty) list containing the results of each successful application.

Negatiation

The negation pattern !e succeeds if the application of e to the input results in a match failure, in which case it consumes no input and yields the value none. It fails if the application of e succeeds.

Bindings and Semantic Actions

Binding

The binding pattern e:x succeeds only if e succeeds when applied to the input. It yields the same value as e, but also modifies the store in order to bind the variable x to that value.

Semantic Action

The semantic action -> t always succeeds without consuming any input. It yields the value obtained from evaluating the term t in the context of the current store.

List (Tree) Matching

A list pattern succeeds only if the input stream is not empty and its first element is a list all of whose elements are matched by e.

Left Recursion Support for Packrat Parsers

seed parse

growing the seed

https://arxiv.org/pdf/1509.02439.pdf Parsing Expression Grammars Made Practical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment