Skip to content

Instantly share code, notes, and snippets.

@justinmeiners
Created October 30, 2022 01:20
Show Gist options
  • Save justinmeiners/343c7db75e92b469e43fbff10aa531e8 to your computer and use it in GitHub Desktop.
Save justinmeiners/343c7db75e92b469e43fbff10aa531e8 to your computer and use it in GitHub Desktop.
From: https://old.reddit.com/r/compsci/comments/kgpjv/can_somebody_please_explain_the_difference/
Standard recursive descent parsers are LL(1) parsers, yes (the 1 means that you only look at the very next token to decide which rule to apply). Because of the whole recursive descent thing, LL parsers are top-down parsers (whereas LR parsers are bottom-up).
Their big downside, in parsing programming languages, is in parsing infix operator expressions. For example, suppose you want to have a simple grammar that matches a language in which you can add and multiply binary digits (ignoring operator precedence for now):
Expr <- Expr "*" Digit
| Expr "+" Digit
| Digit
Digit <- "0"
| "1"
That's one straightforward way to turn that into a grammar, right? Only problem is that with LL you can't parse it. Suppose you want to parse this string: "1+1". Won't work. Why? Well, you're going to jump into "Expr" (start symbol), then try the first possible expansion, which is "Expr". Infinite loop! Hence the name "left recursion". To fix it, you have to write the grammar differently (separate topic) and the resulting parse tree can be a bit weird.
Let's turn to LR instead. I'm not going to talk about LALR in particular; all I'll be giving you is a rough idea about how the whole class of LR parsers works. This seems to be the most difficult part of understanding LR, anyway.
Roughly, what you do here instead is a recursive ascent. That is, you consume terminals onto a stack until the sequence of symbols on the top of the stack matches a right-hand side (RHS) of the grammar, then you replace that sequence on the stack with the LHS of the same rule. You keep doing that until you have consumed everything and all that's left on the stack is the start symbol. While you do all that, you can generate a parse tree or whatever.
The tricky part is making that somewhat efficient. Without going into too much detail, LR parsers usually have an internal set of states that act as an intermediate representation: they encode what you have seen recently and in which context. Instead of putting symbols on the stack, you now have states on it (one for each symbol). Two states can represent the same symbol on the stack, but instruct you to apply different rules next.
Of course, that means that you need to have some lookup table that tells you what to do at each given state (top of stack): just add another state to the stack ("shift"); or remove a bunch of states, generate another node of the parse tree, and add a new state to the stack ("reduce"); or accept the whole string; or fail.
Generating this lookup table is fairly tedious to do by hand, which is why you'll almost never see LR parsers outside of parser generators... and we haven't even talked about adding lookahead (that would mean looking at more than one unconsumed token in figuring out which action to take)! And, in fact, handling lookahead is even more complicated for LALR parsers (as opposed to SLR = simple LR parsers) because you have to analyse the previously generated lookup table in order to determine the "lookahead sets" needed for that.
The reason why LALR parsers are so popular anyway is that, once you have a generator for the tables and lookahead sets, the whole thing ends up being much more compact than general LR parsers, and it's just as fast (linear in the size of the input).
Wikipedia can walk you through a simple LR(0) parser for the same grammar I described above: http://en.wikipedia.org/wiki/LR_parser
It took me quite a while to find out how look at LR parsers in a way that makes some kind of sense. I hope this will help you figure it out much faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment