maestrow/post.md

## post.md

      
    Raw
  

              post.md
            
          
What is packrat parsing?
https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/Syntax.html/Syntax.html

Guido van Rossum

Серия статей от основателя Python - Guido van Rossum по теме PEG - Parsing Expression Grammars. В серии есть статья об устранении левой рекурсии. Примечательно, что Guido решил делать новый парсер на основе PEG (март 2020).
OMeta

https://en.wikipedia.org/wiki/OMeta
Диссертация Alessandro Warth's о языке OMeta - OMeta: An Object-Oriented Language for Pattern-Matching.
Репозитории:

https://github.com/alexwarth/ometa-js Оригинальная версия OMeta-JS автора
https://github.com/veged/ometa-js Доработанная версия товарищами из Яндекса.

Мои эксперименты:

https://github.com/maestrow/ometa-js3
https://github.com/maestrow/Learning-OMetaJS

OHM

https://github.com/harc/ohm Ohm is a successor to Ometa that aims to improve on it by (amongst other things) separating the grammar from the semantic actions.
Henri Tuhola

Статьи из личного блога Henri Tuhola:

How to write your own LR(1) parsing tables (and generate them)
Layout sensitive LR(1) algorithm analysis
Error correction in LR parsing


## theory.md

      
    Raw
  

              theory.md
            
          
    Термины и определения

Formal grammar

Грамматика определяет правила формирования строк из алфавита языка. Грамматика не определяет значение строк, а только их форму. Грамматику можно задать набором правил.
Грамматику можно использовать как генератор языка или как распознаватель. Генератор - генерирует текст по правилам.
Распознаватель - определяет относится ли заданная строка к описываемому грамматикой языку. Для описания распознавателей используется теория автоматов.
Парсинг - процесс распознавания. В результате паринга создается Parse tree.
Описание грамматики

G = (N, E, P, S)
N - множество нетерминалов
E - множество терминалов
P - множество правил вида (EN)*N(EN)* -> (EN)*
S - принадлежит множеству N - стартовое правило.

Иерархия Хомского

Хомский классифицировал грамматики:

Контекстно-свободные
Регулярные

Языки, которые могут быть описаны такими грамматиками называются соответственно: контекстно свободными и регулярными языками.
Все регулярные языки могут быть распознаны конечным автоматом.
Для распознавания контекстно-свободных языков существуют алгоритмы построения LL парсеров и LR парсеров.
Контекстно-свободные грамматики

Контекстно-свободные грамматики - грамматики, у которых левая часть каждого правила состоит только из одного нетерминала.
Контекстно-свободные языки могут быть распознаны алгоритмом Эрли за время O(n3).
Детерминированные контекстно-свободные языки - подмножество контекстно-свободных языков, которые могут быть распознаны за линейное время с помощью Deterministic pushdown automaton.
Регулярные грамматики

Pushdown Automata - Автомат с магазинной памятью

Автомат с магазинной памятью — это конечный автомат, который использует стек для хранения состояний. По комбинации текущего состояния, входного символа и символа на вершине магазина автомат выбирает следующее состояние и, возможно, символ для записи в магазин.
PEG - Parsing expression grammar

Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.
PEG = (N,E,P,S)
N - множество нетерминалов
E - множество терминалов
P - множество правил
S - стартовое правило
P = A <- e.
A - нетерминал
e - разбирающее выражение
Атомарное e:

любой терминал
любой нетерминал
пустая строка

Операторы:

Последовательность: e1 e2 e3
Упорядоченный выбор: e1|e2|e3
0 или больше: e*
один или больше: e+
опционально: e?
"и" предикат: ?e
"не" предикат: !e
скобки ()

Предикаты "и" и "не" делают возможным to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility
Особенности PEG парсеров:

операторы *, +, ? всегда ведут себя жадно
нет ассоциативности. Например a+b+c+d преобразуется в дерево: [+ (+ +(ab) c) d], а не в желаемое + a b c d. Ассоциативность можно достич левой рекурсией. Однако ее нужно спецаиально обрабатывать
возможность появления левой рекурсии. Ее нужно специально обрабатывать, чтобы избежать бесконечной рекурсии.

PEG parsing is typically carried out via packrat parsing, which uses memoization[5][6] to eliminate redundant parsing steps.
A PEG is called well-formed[1] if it contains no left-recursive rules. Only the OMeta parsing algorithm[9] supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parsers support left recursion.
Bottom-up PEG parsing

A pika parser[11] uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also conveys optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.
Парсеры


top-down parsing

Definite clause grammar parsers
Recursive descent parser
Predictive parser
Earley parser


bottom-up parsing

Precedence parser

Simple precedence parser
Operator-precedence parser


Bounded-context parser (BC)
LR parser (Left-to-right, Rightmost derivation in reverse)

Simple LR parser (SLR)
LALR parser (Look-Ahead)
Canonical LR parser (LR(1))
GLR parser (Generalized)[3]


CYK parser (Cocke–Younger–Kasami)
Recursive ascent parser

Packrat parser


Shift-reduce parser


Источники:

https://en.wikipedia.org/wiki/Top-down_parsing
https://en.wikipedia.org/wiki/Bottom-up_parsing

Что не вошло в классификацию:

Автомат с магазинной памятью


## warth_disser.md

      
    Raw
  

              warth_disser.md
            
          
    Search for: ToDo.
Тезисы из диссертации warth_dissertation.pdf по языку OMeta

OMeta is based on a variant of
Parsing Expression Grammars (PEGs) [For04]—a recognition-based foundation for
describing syntax—which I have extended to support pattern matching on arbitrary
datatypes. I show that OMeta’s general-purpose pattern matching facilities provide a
natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree transformers, all of which can be extended in interesting ways using
familiar object-oriented mechanisms.
Расширение PEG


правила с параметрами charRange :x :y = char:c ?(x <= c && c <= y) -> c
semantic actions
bound to identifiers using the : operator

Пример:
ometa ExpParser {
  dig = ’0’ | ... | ’9’,
  num = dig+:ds -> [’num’, parseInt(ds.join(’’))],
  fac = fac:x ’*’ num:y -> [’mul’, x, y]
      | fac:x ’/’ num:y -> [’div’, x, y]
      | num,
  exp = exp:x ’+’ fac:y -> [’add’, x, y]
      | exp:x ’-’ fac:y -> [’sub’, x, y]
      | fac
}


semantic predicates digit = char:d ?(d >= ’0’ && d <= ’9’) -> d


PEG оперирует потоком символов, OMeta grammars can operate on structured
as well as unstructured data, и предоставляет синтаксис для сопоставления с основными типами:

строки: 'строка',
числа: 42,
списки: [’hello’ 42 []].


To avoid ambiguities that arise from using a non-deterministic choice operator (the
kind of choice found in context-free grammars), PEGs only support prioritized choice.
In other words, choices are always evaluated in order.


Базовые правила


OMeta has a single built-in rule from which every other rule is derived. The name
of this rule is anything, and it consumes exactly one object from the input stream.
Even the end rule, which detects the end of the input stream, is implemented in terms
of anything: end = ˜anything
Оператор & - это ~~ (двойное отрицание, does't consume any input)

ometa Base {
  & :r = ~~apply(r)
  end = ~anything
}

Left recursion

fac = fac:x ’*’ num:y -> [’mul’, x, y]
    | fac:x ’/’ num:y -> [’div’, x, y]
    | num,

Первая альтернатива (choise) правила fac начинается с применения правила fac. Такая рекурсия никогда не завершится: применение fac завершается следующим применением этого же правила без поглощения ввода.
Alternative would be to rewrite fac to be right-recursive. However, this would result in right-associative parse trees, which is a problem because the
multiplication operator is supposed to be left-associative.
Our last resort is to rewrite fac using the repetition operator (*) instead of left recursion. But in order to generate properly left-associative parse trees, we need somewhat complicated semantic actions,
mulOp = ’*’ -> ’mul’
      | ’/’ -> ’div’,
fac = num:x (mulOp:op num:y -> (x = [op, x, y]))* -> x

which makes this significantly less readable than the original left-recursive version.
(The “decidedly imperative” version of the fac rule shown above uses OMeta’s iteration operator (*) as a kind of loop construct, and a semantic action to update the x
variable, which holds the parse tree, each time a new part of the expression is recognized.)
OMeta avoids these issues by extending PEGs with support for left recursion.
While this does not make OMeta more powerful than PEGs, it does make it easier
for programmers to express certain rules. This in turn increases OMeta’s usefulness
as a rapid prototyping tool.
Parameterized Rules

Combining scannerless and “scannerful” parsing with parameterized rules:
eq  = ’=’         -> {kind: ’=’, value: ’=’},
num = digit+:ds   -> {kind: ’num’, value: parseInt(ds.join(’’))},
id  = letter+:ls  -> {kind: ’id’, value: ls.join(’’)},

scanner  = space* (eq | num | id),
token :k = scanner:t ?(t.kind == k) -> t.value,
assign   = token(’id’) token(’=’) token(’num’)

I have found this idiom to be less error-prone than scannerless parsing (the only kind
supported by PEGs), and yet just as expressive since each rule may access the character
stream directly if desired.
To make this idiom more convenient to use, OMeta supports a syntactic sugar
for invoking a user-defined rule called token, i.e., token(’=’) can be written as
"=":
condStmt = "if" "(" expr:c ")" stmt:tb "else" stmt:fb -> ...

The Parser grammar, which is part OMeta’s “standard library”, provides a default
implementation of token that skips any number of spaces and then tries to match the
characters that make up the string that it receives as an argument.
Pattern Matching on Rule Arguments

OMeta’s parameterized rules can also pattern-match on their arguments. In fact,
charRange :x :y = ...

is actually shorthand for
charRange anything:x anything:y = ...

which means that x and y can be any kind of object.
This mechanism can be used to validate the arguments that are passed to a rule.
For example, the following version of charRange ensures that both of its arguments
are characters:
charRange char:x char:y = ...

Also, because any pattern (not just rule applications) can be used on the left-hand
side of a rule, OMeta naturally supports the inductive style used to define functions in
languages like Haskell and ML:
fact 0 = -> 1,
fact :n = fact(n - 1):m -> (n * m)

(When a rule has multiple definitions, as in the example above, each definition is tried
in order until the first one succeeds.)
Higher-Order Rules

OMeta also provides a mechanism for implementing higher-order rules, i.e., rules that
take other rules as arguments. This is supported by the apply rule, which takes as
an argument the name of a rule, and invokes it. In other words, apply(’expr’) is
equivalent to expr.
As an example, the rule
listOf :p = apply(p) ("," apply(p))*

can be used to recognize both comma-delimited lists of expressions
listOf(’expr’)

and lists of names
listOf(’name’)

OMeta’s Parameterized and higher-order rules bring the expressive power of parser
combinator libraries.
A Note on Memoization

Packrat parsers are parsers for PEGs that are able to guarantee linear parse times while
supporting backtracking and unlimited look-ahead “by saving all intermediate parsing results as they are computed and ensuring that no result is evaluated more than
once.” [For02a] While OMeta is based on PEGs, it does not necessarily have to be
implemented using packrat-style memoization.
My implementations do in fact memoize the results of rules without arguments,
but in order to keep their memory footprints small, I chose not to memoize the results
of rules with arguments.
O is for Object-Oriented

Extensible Pattern Matching

ToDo: Разобраться с оптимизацией в разделе 2.3.2 Extensible Pattern Matching.
ometa EJSParser <: JSParser {
    isKeyword :x = ?(x == ’say’)
                | ˆisKeyword(x),
    stmt =        "say" expr:x sc -> [’call’, [’get’, ’alert’], x]
                | ˆstmt
}

Figure 2.6: Extending a JavaScript parser with the say statement (say ’hello’; is
equivalent to alert(’hello’);)
Stateful Pattern Matching

Note that OMeta does not attempt to undo the side effects of semantic actions
while backtracking; for some semantic actions, like printing characters to the console,
this would be impossible. Programmers implementing stateful pattern matchers must
therefore write their semantic actions carefully. (The case study of Chapter 4 (Worlds) of this
dissertation provides a solution to this problem.)
Foreign Rule Invocation

ometa OMetaJSParser {
    topLevel = foreign(OMetaParser, ’grammar’)
             | foreign(JSParser, ’srcElem’)
}

Core-OMeta: an Operational Semantics for OMeta-Style Pattern Matching

Core-OMeta’s language of parsing expressions extends PEGs with

bindings and semantic actions
pattern matching on structured data (i.e., lists)

Basic PEG functionality


Empty

pattern e always succeeds without consuming any input.

Atomic Values (e.g., characters, numbers, booleans)

The atomic value pattern a succeeds only if the input stream is not empty and its first
element is equal to the atomic value in the pattern. A successful match yields the value
that was consumed.

Nonterminal

The nonterminal (or rule application) A succeeds if its associated pattern succeeds
when applied to the input. Note that the "body of the rule" is applied to the input with
a fresh store to ensure that bindings are (statically) scoped to the rules in which
they appear.

Sequence e1 e2


Alternation e1 | e2

The alternation pattern succeeds if e1 matches the input (in which case e2 is
skipped); otherwise (i.e., if e1 fails), the result of the alternation pattern is the same as
the result of e2.
Note that e1’s side effects—i.e., whatever changes it may have made to the store—are
not rolled back before e2 is evaluated.

Iteration e*

The iteration pattern e∗
successively applies e to the input, only stopping when that
results in a match failure. It yields a (possibly empty) list containing the results of
each successful application.

Negatiation

The negation pattern !e succeeds if the application of e to the input results in a match
failure, in which case it consumes no input and yields the value none. It fails if the
application of e succeeds.
Bindings and Semantic Actions


Binding

The binding pattern e:x succeeds only if e succeeds when applied to the input. It yields the same value as e, but also modifies the store in order to bind the variable x to that value.

Semantic Action

The semantic action -> t always succeeds without consuming any input. It yields the value obtained from evaluating the term t in the context of the current store.
List (Tree) Matching

A list pattern succeeds only if the input stream is not empty and its first element
is a list all of whose elements are matched by e.
Left Recursion Support for Packrat Parsers

seed parse
growing the seed
https://arxiv.org/pdf/1509.02439.pdf Parsing Expression Grammars Made Practical