CrockAgile/earley_parsing_explained.md

## earley_parsing_explained.md

      
    Raw
  

              earley_parsing_explained.md
            
          
    Earley Parsing Explained


      Loading

  
      flowchart TD
0["sum"] --> 1["sum"]
1["sum"] --> 2["product"]
2["product"] --> 3["factor"]
3["factor"] --> 4["number"]
4["number"] --> 5["digit"]
5["digit"] --> 6["1"]
0["sum"] --> 7["add"]
7["add"] --> 8["+"]
0["sum"] --> 9["product"]
9["product"] --> 10["product"]
10["product"] --> 11["factor"]
11["factor"] --> 12["number"]
12["number"] --> 13["digit"]
13["digit"] --> 14["2"]
9["product"] --> 15["mult"]
15["mult"] --> 16["*"]
9["product"] --> 17["factor"]
17["factor"] --> 18["number"]
18["number"] --> 19["digit"]
19["digit"] --> 20["3"]

    
Grammars

Languages are defined by their grammar. For computer scientists, a "grammar" is a set of production rules. These production rules are written using syntax like:
<dna> ::= <base> | <base> <dna>
<base> ::= "A" | "C" | "G" | "T"


note: for this article, "terminal" symbols will be written like "A", and "non-terminal" symbols will be written like <dna>

If these production rules were paraphrased to English, they could be read as:

<base> ::= "A" | "C" | "G" | "T"
<base> can be either "A", "C", "G", or "T"
<dna> ::= <base> | <base> <dna>
<dna> can be either a <base>, or a <base> followed by more <dna>

Assuming <dna> to be the starting state of this grammar, one possible traversal of this grammar could be:

  
      Loading

  
      flowchart TD
0["dna"] --> 1["base"]
1["base"] --> 2["G"]

    
This diagram shows a traversal using the following steps:

<dna> ::= <base>
<base> ::= "G"

So, this simple DNA grammar is capable of generating "G". Or put another way, "G" is a valid sentence in the grammar.
Generating sentences from a grammar is fun, but the more common practical application with grammars is "parsing". Parsing answers the questions:

is an input sentence valid in the grammar?
if so, what production rules would be traversed?

These features were requested by users of , a Rust crate for exactly this kind of grammar exploration.
Only a few years later (😅), and with some generous collaboration from @shnewto, Earley Parsing was used to answer these questions (PR). This article includes what I learned implementing that Earley parser.
Earley Parsing

References

The wikipedia page for Earley Parsing is "classic wikipedia". It is helpful, but correctness is given priority over clarity.
A much more informative source would be another blog post, Earley Parsing Explained.
These two were the "primary sources" used for implementing Earley parsing. I hope this article will be similarly helpful to the next Earley hacker to come along.
How Does Earley Work?

Earley Parsers commonly use notation like: <dna> ::= <base> • <dna>
The • is important! The • is used as a progress cursor. Anything behind the • has already been matched. Anything ahead of the • has not yet been matched. The first symbol ahead of the • is currently being matched.
If there are no more symbols ahead of the • (<dna> ::= <base> <dna> •), then the production is called "complete".
Let's use similar syntax for the input sentence being parsed. Again, • will be the progress cursor. But it will be useful to also track where a traversal attempt began in the input. This "starting point" will use $. For example, $G•A would mean this Earley traversal started at the beginning of the input sentence, a "G" has already been matched, and "A" has yet to be matched.
An Earley Parser attempts to traverse any production in the grammar that matches the current input, starting with the initial state. This is done by keeping a queue of possible traversals.
There are three possible actions when attempting a traversal: "predict", "scan", and "complete". Each is used depending on the next symbol after the • cursor.

if a non-terminal (<dna> ::= • <base>) then predict more traversals using its production rules
if a terminal (<base> ::= • "G") then scan the input text for a match
if nothing remaining (<base> ::= "G" •) then complete other pending traversals

Let's walk through Earley attempt to parse the input string "GA"
First, initialize the traversals with productions for the starting state <dna>:


id
production
input


0
<dna> ::= • <base>
"$•GA"


1
<dna> ::= • <base> <dna>
"$•GA"


Each traversal step will get a unique identifier, in this case an incrementing integer.
There are multiple initial items because our grammar syntax supports production rules with multiple expressions (<dna> ::= <base> | <base> <dna>).
In traversal 0 (<dna> ::= • <base>), the non-terminal <base> is currently being matched. Earley "predicts" possible traversal using all production rules in the grammar for <base>. Using the production rule <base> ::= "A" | "C" | "G" | "T", 4 new traversal items are created.


id
production
input


2
<base> ::= • "A"
"$•GA"


3
<base> ::= • "C"
"$•GA"


4
<base> ::= • "G"
"$•GA"


5
<base> ::= • "T"
"$•GA"


These new traversal attempts are matching on terminals, so the input is scanned. "A", "C", and "T" all fail, but "G" successfully scans. When a match is found, the cursor • is moved forward. Advancing the cursor • gives a new traversal item:


id
production
input


6
<base> ::= "G" •
"$G•A"


This traversal is "complete", but it is only a partial success. The full input has not been matched, and <base> is not the grammar's starting state. However this partial success is still good news. Remembering all active traversals:


id
production
input


0
<dna> ::= • <base>
"$•GA"


1
<dna> ::= • <base> <dna>
"$•GA"


6
<base> ::= "G" •
"$G•A"


Completing <base> allows the other traversals to advance. But a complete traversal can only advance other traversals which are matching on the "completed" term (<base>). Also, the completed input must have started ($) where the other traversal is attempting to match (•).
Traversals 0 and 1 satisfy both these requirements. They are matching on <base>, the term which was "completed". Also, traversals 0 and 1 both have their matching cursor • before "G", which is where traversal 6 began ($G).
Completing <base> in traversals 0 and 1 adds two new traversals attempts:


id
production
input


0
<dna> ::= • <base>
"$•GA"


1
<dna> ::= • <base> <dna>
"$•GA"


6
<base> ::= "G" •
"$G•A"


7
<dna> ::= <base=6> •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


Traversals 0 and 1 have been advanced to create 7 and 8. Only "complete" traversals are required to build the final parse tree, so partial traversals like 0 and 1 will be omitted once they are no longer relevant.
How <base> was completed was also recorded, as <base=6>. Recording that the term was completed via traversal 6 is critical for building the final parse tree.
Continuing Earley parsing, the relevant traversals are:


id
production
input


6
<base> ::= "G" •
"$G•A"


7
<dna> ::= <base=6> •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


Traversal 7 is "complete". Altho traversal 8 is matching the completed term <dna>, its input cursor does not match (traversal 7 began before the "G", but traversal 8 has already advanced its cursor beyond that). Because there are no traversals that qualify, traversal 7 completing has no effect.
Traversal 8 is incomplete, and is matching on a non-terminal <dna>. Looking at the production rules for <dna>, two predictions are added:


id
production
input


6
<base> ::= "G" •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


9
<dna> ::= • <base>
"G$•A"


10
<dna> ::= • <base> <dna>
"G$•A"


The new traversals 9 and 10 can also make predictions:


id
production
input


6
<base> ::= "G" •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


9
<dna> ::= • <base>
"G$•A"


10
<dna> ::= • <base> <dna>
"G$•A"


11
<base> ::= • "A"
"G$•A"


12
<base> ::= • "C"
"G$•A"


13
<base> ::= • "G"
"G$•A"


14
<base> ::= • "T"
"G$•A"


Only the "A" scans successfully, leaving:


id
production
input


6
<base> ::= "G" •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


9
<dna> ::= • <base>
"G$•A"


10
<dna> ::= • <base> <dna>
"G$•A"


15
<base> ::= "A" •
"G$A•"


Traversal 15 completes traversals 9 and 10:


id
production
input


6
<base> ::= "G" •
"$G•A"


8
<dna> ::= <base=6> • <dna>
"$G•A"


15
<base> ::= "A" •
"G$A•"


16
<dna> ::= <base=15> •
"G$A•"


17
<dna> ::= <base=15> • <dna>
"G$A•"


Traversal 16 completes traversal 8. Traversal 17 can be ignored because it is incomplete but there is no more input:


id
production
input


6
<base> ::= "G" •
"$G•A"


15
<base> ::= "A" •
"G$A•"


16
<dna> ::= <base=15> •
"G$A•"


18
<dna> ::= <base=6> <dna=16> •
"$GA•"


And voila ! Traversal 18 is a full parse because:

it is "complete" (all terms in production have matched)
it began from the grammar's starting term <dna>
the full input has been parsed ($ at beginning and • at end)

And not only is it a successful parse, but it also includes how the input was parsed. Tracing the matched terms (e.g. <base=6>) produces the parse tree:

  
      Loading

  
      flowchart TD
0["dna"] --> 1["base"]
1["base"] --> 2["G"]
0["dna"] --> 3["dna"]
3["dna"] --> 4["base"]
4["base"] --> 5["A"]

    
Code It In Rust

It was a lot of manual labor parsing even that basic example of DNA. From now on a computer should handle parsing.
Code snippets will opt for clarity over performance. In other words, there will be lots of clones! The goal of this section is to see Earley roughly translated to Rust. If you are hungry for more realistic code, checkout the PR that inspired this article.
The  crate already defines common language types, such as Grammar and Production. In fact, it uses the same syntax as this article, so creating a grammar looks familiar:
let grammar: Grammar = "
    <dna> ::= <base> | <base> <dna>
    <base> ::= 'A' | 'C' | 'G' | 'T'
".parse().unwrap();
Modeling

Next, the Earley traversals need to be modeled. The final traversals looked like:


id
production
input


6
<base> ::= "G" •
"$G•A"


15
<base> ::= "A" •
"G$A•"


16
<dna> ::= <base=15> •
"G$A•"


18
<dna> ::= <base=6> <dna=16> •
"$GA•"


The traversal identifier can be a usize.
struct TraversalId(usize);
The "input" column is a string, with two cursors ($ and •)
struct InputRange {
    // this should be &str but let's ignore lifetimes
    input: String,
    start_index: usize,
    matching_index: usize,
}
The "production" column includes matched and unmatched terms. Matched terms include how they were matched (e.g. <base=15>)
enum TermMatch {
    /// a match on a string literal, like after scanning some input
    /// e.g. `<base> ::= "G" •`
    Terminal(String),
    /// a match on a non-terminal, like after completing a production
    /// e.g. `<dna> ::= <base=15> •`
    NonTerminal(TraversalId),
}
The production can then be broken into three fields
// e.g. `<dna> ::= <base=6> • <dna>`
use bnf::Term;
struct TraversalProduction {
    left_hand_side: Term,
    matched_terms: Vec<TermMatch>,
    unmatched_terms: Vec<Term>,
}
With all its components modeled, the full traversal type is:
struct Traversal {
    production: TraversalProduction,
    input: InputRange,
}
There is one bit of pain that has been put off. Rust's beautiful ownership semantics make tree structures (like a parse tree!) a bit verbose for this blog post. For this reason, a naive "arena allocator" will be used. If you want to read more about arena allocators, I highly recommend Manish Goregaokar's blog post Arenas In Rust.
Because we don't need fancy features from real arena allocator crates, our "naive arena allocator" is Vec<Traversal> wrapped in an "append only" API.
struct TraversalArena {
    traversals: Vec<Traversal>,
}

impl TraversalArena {
    pub fn push(&mut self, item: Traversal) -> TraversalId {
        let idx = self.traversals.len();
        self.traversals.push(item);
        TraversalId(idx)
    }

    pub fn get(&self, key: TraversalId) -> Option<&Traversal> {
        self.traversals.get(key.0)
    }
}
Algorithm

// like before, the first step is setting up the grammar, input string, etc
let grammar: Grammar = "
    <dna> ::= <base> | <base> <dna>
    <base> ::= 'A' | 'C' | 'G' | 'T'
".parse().unwrap()

let input = "GA".to_string();

let mut arena = TraversalArena::default();

// this queue will track which traversals have not yet been evaluated
let mut queue = std::collections::VecDeque::<TraversalId>::new();

// assume the first production is the grammar's starting state
// which is `<dna> ::= <base> | <base> <dna>`
let first_prod = grammar.productions_iter().next().unwrap();
// starting state is `<dna>`
let starting_state = &first_prod.lhs;

// initial traversals are:
// * <dna> ::= <base>
// * <dna> ::= <base> <dna>
let initial_traversals = first_prod.rhs_iter().map(|expression| {
    let production = TraversalProduction {
        left_hand_side: starting_state.clone(),
        matched_terms: vec![],
        unmatched_terms: expression.terms.clone(),
    };
    let input = InputRange {
        input: input.clone(),
        matching_index: 0,
        start_index: 0,
    };
    Traversal { input, production }
});

for traversal in initial_traversals {
    let id = arena.push(traversal);
    queue.push_back(id);
}

// finally it's time for Earley parsing!
// loop while there are active traversals
while let Some(id) = queue.pop_front() {
    let traversal = arena.get(id).unwrap();

    let next_unmatched = traversal.production.unmatched_terms.get(0);

    let new_traversals = match next_unmatched {
        // if the next unmatched term is a non-terminal
        // then Earley needs to **predict**
        Some(Term::Nonterminal(_)) => traversal.predict(&grammar),
        // if the next unmatched term is a terminal
        // then Earley needs to **scan**
        Some(Term::Terminal(terminal)) => traversal.scan(&input),
        // if there is no next unmatched term
        // then Earley needs to **complete**
        None => traversal.complete(&arena.traversals),
    };

    for traversal in new_traversals {
        if traversal.is_complete(starting_state, input.len()) {
            println!("complete! {:?}", traversal);
        }

        let id = arena.push(traversal);
        queue.push_back(id);
    }
}
The loop above captures the "essence" of Earley parsing. Loop over traversals, predict, scan, complete, and return any full parses.
The remaining implementation code is more careful bookkeeping (creating new traversals, advancing cursors, etc.) but is not particularly enlightening. So feel free to skip it.

impl Traversal {
    pub fn predict(&self, grammar: &Grammar) -> Vec<Self> {
        let Self { production, input } = self;
        let nonterminal = &production.unmatched_terms[0];
        grammar
            .productions_iter()
            // find productions with the non-terminal on the left hand side
            .filter(|prod| &prod.lhs == nonterminal)
            // create new traversals for each production rule expression
            .flat_map(|prod| {
                prod.rhs_iter().map(|expr| {
                    let production = TraversalProduction {
                        left_hand_side: nonterminal.clone(),
                        matched_terms: vec![],
                        unmatched_terms: expr.terms.clone(),
                    };

                    // prediction input ranges start from where the prediction is matching
                    let input = InputRange {
                        input: input.input.clone(),
                        matching_index: input.matching_index,
                        start_index: input.matching_index,
                    };
                    Traversal { input, production }
                })
            })
            .collect()
    }
    pub fn scan(&self, sentence: &str) -> Vec<Self> {
        let Self { production, input } = self;
        let terminal = match &production.unmatched_terms[0] {
            Term::Nonterminal(..) => unreachable!(),
            Term::Terminal(term) => term,
        };

        // slice the input string to what has not yet been matched
        let matching = &input.input[input.matching_index..];

        if matching.starts_with(terminal) {
            // move the first unmatched term over to the matched
            let mut matched_terms = production.matched_terms.clone();
            let matched = TermMatch::Terminal(terminal.clone());
            matched_terms.push(matched);
            let unmatched_terms =
                Vec::from_iter(production.unmatched_terms.iter().skip(1).cloned());

            let production = TraversalProduction {
                left_hand_side: production.left_hand_side.clone(),
                matched_terms,
                unmatched_terms,
            };

            // advance input cursor by matched string length
            let input = InputRange {
                input: input.input.clone(),
                matching_index: input.matching_index + matching.len(),
                start_index: input.start_index,
            };

            let traversal = Traversal { input, production };
            vec![traversal]
        } else {
            // scan found no match, so no new traversals
            vec![]
        }
    }
    pub fn complete(&self, others: &[Self]) -> Vec<Self> {
        let Self { production, input } = self;
        others
            .iter()
            .enumerate()
            // find traversals which can be completed
            .filter(|(_, other)| {
                other.production.unmatched_terms.get(0) == Some(&production.left_hand_side)
                    && other.input.matching_index == input.start_index
            })
            .map(|(idx, other)| {
                // move the first unmatched term over to the matched
                let mut matched_terms = other.production.matched_terms.clone();
                let matched = TermMatch::NonTerminal(TraversalId(idx));
                matched_terms.push(matched);
                let unmatched_terms =
                    Vec::from_iter(other.production.unmatched_terms.iter().skip(1).cloned());

                let production = TraversalProduction {
                    left_hand_side: other.production.left_hand_side.clone(),
                    matched_terms,
                    unmatched_terms,
                };

                Traversal {
                    production,
                    input: other.input.clone(),
                }
            })
            .collect()
    }
    pub fn is_complete(&self, starting_term: &Term, full_input_len: usize) -> bool {
        &self.production.left_hand_side == starting_term
            && self.production.unmatched_terms.len() == 0
            && self.input.start_index == 0
            && self.input.matching_index == full_input_len
    }
}

So What?

Earley parsing throws a handful of new terminology and syntax at the reader, but I hope this article shows it isn't as complex as wikipedia makes it out to be. The dot cursor syntax (<dna> ::= <base> • <dna>) is a basic progress marker. predict, scan, and complete are formal terminology, but could be summarized as "try to match terms until you can't".
This isn't to say Earley is "easy". I definitely got dizzy 😵‍💫 in my first dozen attempts to understand Earley parsing. But I believe Earley parsing can be accessible, and maybe even a little fun.
id	production	input
0	`<dna> ::= • <base>`	`"$•GA"`
1	`<dna> ::= • <base> <dna>`	`"$•GA"`
id	production	input
2	`<base> ::= • "A"`	`"$•GA"`
3	`<base> ::= • "C"`	`"$•GA"`
4	`<base> ::= • "G"`	`"$•GA"`
5	`<base> ::= • "T"`	`"$•GA"`
id	production	input
6	`<base> ::= "G" •`	`"$G•A"`
7	`<dna> ::= <base=6> •`	`"$G•A"`
8	`<dna> ::= <base=6> • <dna>`	`"$G•A"`
id	production	input
6	`<base> ::= "G" •`	`"$G•A"`
15	`<base> ::= "A" •`	`"G$A•"`
16	`<dna> ::= <base=15> •`	`"G$A•"`
18	`<dna> ::= <base=6> <dna=16> •`	`"$GA•"`