Last active
January 3, 2024 17:15
-
-
Save DmitrySoshnikov/e2c3a793636dc03f2200 to your computer and use it in GitHub Desktop.
Recursive descent parser with simple backtracking
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* = Recursive descent parser = | |
* | |
* MIT Style License | |
* By Dmitry Soshnikov <dmitry.soshnikov@gmail.com> | |
* | |
* In this short lecture we'll cover the basic (non-predictive, backtracking) | |
* recursive descent parsing algorithm. | |
* | |
* Recursive descent is an LL parser: scan from left to right, doing | |
* the left-most derivation. | |
* | |
* Top-down means it starts its analysis from the main start symbol, and goes | |
* down to parsing the sub-parts of this start symbol. And the left-most | |
* derivation means, that it tries to replace the left most non-terminal | |
* in a production. | |
* | |
* Consider the grammar (the "->" symbol reads as "can be represented as"): | |
* | |
* 1. E -> T | |
* 2. E -> T + E | |
* 3. T -> a | |
* 4. T -> a * T | |
* 5. T -> (E) | |
* | |
* For brevity, usually the same non-terminal symbols are combined using the | |
* alternative | symbol (reads as "or"), so we can represent the same grammar | |
* shorter: | |
* | |
* E -> T | |
* | T + E | |
* | |
* T -> a | |
* | a * T | |
* | (E) | |
* | |
* An example source string which this language accepts is: | |
* | |
* (a) | |
* | |
* And is derived as: | |
* | |
* E -> T -> (E) -> (T) -> (a) | |
* | |
* The recursive descent parsers are widely used on practice, since it's very | |
* easy to implement them by hands, and the implementation just directly maps | |
* the grammar to the code. | |
* | |
* The algorithm is simple: | |
* | |
* (0. Start from the main top symbol of the grammar, E in our case) | |
* | |
* 1. For each non-terminal symbol in the grammar implement corresponding | |
* parsing function (so e.g. to parse E we'll have `function E() {}`) | |
* | |
* 2. If there are several alternative productions for a non-terminal, | |
* implement each sub-production as a sub-function, and try them in order | |
* If some of the sub-productions succeeds, the main non-terminal is | |
* considered succeeded as well. If a sub-production *does not* succeed, | |
* then *restore the cursor* (do backtracking) to the beginning of that | |
* sub-production, and try parsing next sub-production. | |
* | |
* So for E we going to implement E1 for T, and E2 for T + E, and call | |
* E1 and E2 in order from main E: | |
* | |
* E = E1 || E2; | |
* | |
* E1 = T; | |
* E2 = T + E; | |
* | |
* 3. Implement function that checks for the presence of a needed token at | |
* the current cursor position in the source code. We'll call this function | |
* `term(...)`, and e.g. to check terminal "a" in the source stream at | |
* cursor position, we do term("a"). After the check (positive or negative) | |
* it also advances the cursor to the next token. | |
* | |
* Notice one important thing in the point (2): if during parsing of a | |
* sub-production we realize that it won't succeed, we need to restore | |
* (to backtrack) the cursor in order to try the next alternative. I.e. in this | |
* version of recursive descent we don't do any prediction, but do full | |
* backtracking in case of a fail. On practice a recursive descent usually | |
* implements the predictive algorithm, which we'll cover in the next lectures. | |
* | |
* OK, enough theory, let's start to code! ;) | |
*/ | |
/** | |
* This tracks position of a current token. | |
*/ | |
var cursor = 0; | |
/** | |
* This is for restoring the cursor at backtracking. We save | |
* the cursor every time before trying an alternative sub-production. | |
*/ | |
var savedCursor = cursor; | |
// So according to point 0, we should start our parsing from the main top | |
// symbol, and then to go recursively down to sub-parts. In our case it's | |
// the symbol E, and as we see in the grammar it has two alternatives | |
// (T and T + E), which we encode as separate sub-functions E1 and E2. | |
/** | |
* Main symbol E. | |
* | |
* E -> T | |
* | T + E | |
* | |
* Succeeds if any first of its alternatives succeed. And as we mentioned, | |
* we backtrack on each fail, in order to try the next alternative. For | |
* that we have to save the cursor before trying the alternative. | |
*/ | |
function E() { | |
return (saveCursor(), E1()) || (backtrack(), saveCursor(), E2()); | |
} | |
/** | |
* E1 is the first alternative of E, and corresponds to the T. | |
*/ | |
function E1() { | |
return T(); | |
} | |
/** | |
* E2 is the second alternative of E, and corresponds to the T + E. | |
* | |
* Several interesting parts here: | |
* | |
* - E2 is going to succeed only if *all* of its sub-productions succeed | |
* (notice the usage of && operator for that). | |
* | |
* - We see this `term('+')` from the rule (3) of the parsing algorithm, | |
* that checks and expects the token "+" in the source stream, and | |
* fails if doesn't find this token there. | |
* | |
* - Here we first start to see the recursion for E(), hence the name | |
* of the parser: *recursive* descent. | |
*/ | |
function E2() { | |
return T() && term('+') && E(); | |
} | |
// OK, we're done with E! Now let's encode the rules for T. | |
/** | |
* Implements T non-terminal. | |
* | |
* T -> a | |
* | a * T | |
* | (E) | |
* | |
* The same as E, it has alternatives: T1, T2, and T3. | |
*/ | |
function T() { | |
return (saveCursor(), T1()) || | |
(backtrack(), saveCursor(), T2()) || | |
(backtrack(), saveCursor(), T3()); | |
} | |
/** | |
* T1 alternative, T -> a | |
*/ | |
function T1() { | |
return term('a'); | |
} | |
/** | |
* T2 alternative, T -> a * T. | |
* | |
* We see two terminals go each after another, and again recursion for T. | |
*/ | |
function T2() { | |
return term('a') && term('*') && T(); | |
} | |
/** | |
* T3 alternative, T -> (E). | |
* | |
* Here we see a recursion even to the very first E symbol. | |
*/ | |
function T3() { | |
return term('(') && E() && term(')'); | |
} | |
/** | |
* Saves position of the current cursor for possible future backtracking. | |
*/ | |
function saveCursor() { | |
savedCursor = cursor; | |
} | |
/** | |
* Restores cursor position to the saved one. | |
*/ | |
function backtrack() { | |
cursor = savedCursor; | |
} | |
/** | |
* And here's our `term(...)` function, that checks for the | |
* presence of a needed terminal in the stream, and advances | |
* the cursor. | |
*/ | |
function term(expected) { | |
return getNextToken() === expected; | |
} | |
/** | |
* Returns the next token in the stream, and advances the cursor. | |
* | |
* For simplicity we have only one-character tokens in this language | |
* (the "a", "+", "(" and ")"), so we just return the next non-whitespace | |
* symbol. And a real tokenizer returns tokens that may consist of several | |
* characters, e.g. a token for keyword "int", or "while", etc. | |
*/ | |
function getNextToken() { | |
// Skip whitespace. | |
while (source[cursor] === ' ') cursor++; | |
var nextToken = source[cursor]; | |
cursor++; | |
return nextToken; | |
} | |
/** | |
* Stores the source. | |
*/ | |
var source; | |
/** | |
* Main parsing function, returns true if the string is accepted | |
* by the grammar, or false otherwise. | |
*/ | |
function parse(s) { | |
source = s; | |
cursor = 0; | |
// We succeed if our main E symbol succeeds *and* we parse | |
// all tokens (reached end of the source string). | |
return E() && cursor == source.length; | |
} | |
// Test time! | |
console.log('(a)', parse('(a)')); // true | |
console.log('-a', parse('-a')); // false | |
// Seems legit so far. Now let's try parsing the: | |
// --- Simple backtracking limitation -- | |
console.log('a * a', parse('a * a')); // false! | |
// Why is it false? It's clearly in the grammar of our language. | |
// Is it a bug in the parser? Not really. | |
// Here we encounter the first limitation of the recursive descent parsing. | |
// In fact, the backtracking algorithm we implemented here is kind of "naive": | |
// the parser succeeded when it found first "a" token (since it corresponds to | |
// the T1 alternative of T), and then it stopped parsing. However, we haven't | |
// yet reached the end of the string. Had we tried the T2 production first | |
// (a * T), the string would be parsed. So the conclusion which we can make | |
// here is: | |
// | |
// - This implementation of the recursive descent is capable to parse | |
// only productions with one alternative that starts with the same token | |
// (it cannot parse "a" and "a * T" if tries just "a" first). So the | |
// backtracking algorithm we have here is *not generic*. | |
// | |
// - However, it's possible to implement a generic backtracking that would | |
// consider as well the fact, that the cursor hasn't reached yet the | |
// end of the source string, and that we should probably try another | |
// alternative (even if the previous alternative succeeded!), T2 in our | |
// case, that would parse the "a * a". | |
// | |
// - Another approach to fix it, is to restructure our grammar. This | |
// technique is called *left factoring*, which we wil cover in later | |
// lectures. | |
// Additional info can be found in the "Dragon book", and in other lectures | |
// e.g from prof. Alex Aiken. | |
// Exercise: Implement generic backtracking that would parse "a + a". |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
❤️