Skip to content

Instantly share code, notes, and snippets.

@ingydotnet
Created August 27, 2012 15:32
Show Gist options
  • Save ingydotnet/3489550 to your computer and use it in GitHub Desktop.
Save ingydotnet/3489550 to your computer and use it in GitHub Desktop.
19:01 <mst> http://paste.scsys.co.uk/206434
19:02 <mst> thoughts?
16:32 <ingy> my first thought is that SineSwiper is wrong in his assessment that parsing is better with lexing
16:32 <ingy> with a separate lexing step
16:33 <ingy> this is a majot problem with coffeescript
16:33 <ingy> which I am currently trying to fix
16:35 <ingy> I think SS thinks that RecDescent means that you parse too deeply and waste time, but that is a simply a matter of writing good grammars vs poor ones.
16:35 <ingy> My grammars are careful never to need much lookahead
16:36 <ingy> because I was thinking about that when I wrote them
16:36 <ingy> anyway, I think it's weird that he didn't talk to me about it
16:37 <ingy> the thing he should do is write some simple failing tests
16:37 <ingy> it is entirely possible that Pegex has flaws. almost certain in fact.
16:38 <ingy> but unless he can come up with test cases that I can't workaround with good grammar writing, I am certainly not going to entertain lexing
... in #cdent
16:39 <@ingy> mst just made me aware of people griping that Pegex is not a Lex/Parse setup
16:39 <@ingy> and hi mst :)
16:40 <@ingy> Pegex does both lexing and parsing at the same time
16:41 <@ingy> so coffeescript does a full lex. then a lexical analysis, then a finally a grammar parse
16:41 <@ingy> with lots of code in every stage
16:42 <@ingy> the pegex way is to do this all at once, and unless I'm sorely mistaken, Pegex parsing of CoffeeScript will be faster, and also work around all the heinous corners that coffee has painted itself into
16:43 <@ingy> func() if bool # is required to be on one line in coffeee
16:44 <@ingy> this sucks when that expression grows past 80 columns
16:45 <@ingy> but the lexer would assign a newline to be a terminator token, and the parser isn't expecting that
16:45 <@ingy> which is not to say that the parser can't be smarter
16:46 <@ingy> but in my experience requiring the lexing to be completely separate from the parsing throws away too much context, and leaves you having to invent ways to deal with the resulting problems
16:50 * sevvie smiles.
16:51 < sevvie> It just sounds like you lex well, to me.
16:53 < sevvie> It does bring up the question; what benefits does a balanced lexer and parser provide? (For all intents and purposes, I know nothing.)
16:54 <@ingy> mst: sevvie just made me realize that pegex could be be setup to be "just a lexer"
16:54 <@ingy> I should have examples of this
16:55 <@ingy> doing it both ways
@SineSwiper
Copy link

Yeah, sorry about not chatting about it earlier; thought you weren't online.

As a matter of context, I'm trying to take Pg's original SQL lexer/parser (flex/bison code) and convert it into Perl with its own interchange format. Because of the sheer depth of the actual parsing code (parser = 545 rules), this Pg parser project forces a balancing act between "C->Perl conversion" and "writing entirely new code". I'm fine with wholesale conversions of the various types, and I had to convert huge batches of C code, anyway. Though, if I start writing new code, then I'm throwing away the work that the Pg crew already figured out for me. So, I'm trying to keep that sort of thing fairly minimized.

So far, I've successfully converted the thing into a working Eyapp module. I used that because I didn't know any better, and because it was the closest Perl analogue to Bison/yacc, anyway. There's still some bugs here and there, but it's good enough to start re-converting it into a more modern parser like Pegex. (And hopefully, it doesn't compile things into 8MB PM files, like Eyapp.)

As of now, I'm currently in the process of the "Lexer" conversion to Pegex, under the assumption that it will all be one "piece". (I think you have "#include" support somewhere. Otherwise, it'll just be one large *.pgx file.) My goal here will be to 'warp' the lexer to provide rules that turn the tokens within the parser into rules. This would at least reduce the time I'd need to spend messing with the parser, though I still have to figure out how the AST/Receiver modules will play out.

I am finding out that I can replace huge sections of "old skool" lexer code with more modern Perl REs. (For example, the old scan.l file relies on start/end state changes for lexing that can be replaced with a single (start/content/end) RE.) However, I don't think I'm going to get away with a lot of those kinds of optimizations on the parser, besides maybe taking some of those large OR blocks and splitting them up into subgroups.

Anyway, if you could fill in the gaps on the Syntax POD (see pull request), that would be very helpful. I still can't figure out what those rule modifiers do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment