-
-
Save meijeru/1046347 to your computer and use it in GitHub Desktop.
REBOL [ | |
Title: "Red/System lexical analysis" | |
Date: 1-Jul-2011 | |
Name: "Reds lexer" | |
Type: none | |
Version: 1.0.0 | |
File: %/G/Projects/Common/RED/red-system/sources/reds-lexer/reds-lexer.r | |
Home: http://users.telenet.be/rwmeijer | |
Author: "Rudolf W. Meijer" | |
Rights: "Copyright (C) 2011 Rudolf W. Meijer. All Rights Reserved" | |
History: [ | |
0.0.0 [19-Jun-2011 {Start of project} "RM"] | |
0.5.0 [24-Jun-2011 {First working version} "RM"] | |
0.7.0 [27-Jun-2011 {Added file! and tuple! literals} "RM"] | |
0.8.0 [27-Jun-2011 {Simplified the separator} "RM"] | |
0.9.0 [29-Jun-2011 { | |
Separator reduced to stripping comments only, | |
Grammar takes care of whitespace | |
} "RM" ] | |
1.0.0 [1-Jul-2011 {Grammar takes care of comments also} "RM"] | |
] | |
] | |
;---|----1----|----2----|----3----|----4----|----5----|----6----|----7----|- | |
do %reds-lex-grammar.r | |
reds-lexer: func [ | |
inp [file! url! binary! string!] | |
][ | |
unless string? inp [ | |
unless binary? inp [inp: read/binary inp] | |
inp: to-string inp | |
] | |
either empty? inp | |
[ | |
print "nothing to analyse: empty input" | |
][ | |
print "start" | |
; diagnostic | |
["parse" dt [ | |
parse/all inp lex-grammar/program | |
]] | |
; parsed source is in lex-grammar/source | |
] | |
] | |
test-text: | |
%../../tests/source/units/exit-test.reds | |
print "call" | |
reds-lexer | |
;copy | |
test-text | |
; diagnostic | |
print mold/all head lex-grammar/source | |
ask "" | |
halt |
I discovered I will have to put in files and tuples because they occur in headers; have found a few bugs also. Next version to come soon.
This (0.7.0) is now the version with tuples and files in, and it has parsed all unit test scripts. My only worry is that it is rather slow on the two long auto-test files (6000 and 10000 lines, respectively) where it took between 20 and 50 seconds. I fear it is O(n2) or worse (exponential?). Still, it may serve as basis for a better written lexer.
Most of the above-mentioned time is spent in the separator; the parse is fast (2.5 seconds for the longest file). Unfortunately, when Red is introducing more of the datatypes (and their literals) that we appreciate in REBOL, the separation process will become even more intricate. I did it for REBOL 3, and the code is twice as long as the one presented here for Red/System.
I have now signifiantly simplified the code for the Red/System case and replaced REBOL append
and take
by their native equivalents. the saving is huge! byte-auto-test.reds
now separates in less than 5 seconds. The new versions 0.8.0 are up!
Nice progress!
I added the % operator.
I will review and test your lexer in depth, in a few days, to see if I can replace the current LOAD native with it. It would certainly open new options for better syntactic constructions of some literal datatypes (that would not be allowed by REBOL's scanner), so I am very interested in it, both for Red/System and Red (at least for the bootstrapping phase, once Red up and running, we would port the lexer to Red).
Please ignore any version prior to today 30-Jun-2011. I just simplified the "separator" into a comment-stripper and let the grammar take care of the whitespace. Previously, I did not think this could be done, that is why before today I chose the more complicated route which takes much longer unnecessarily..
I have now managed to eliminate even the comment-stripping: REBOL's parse is just more powerful than I suspected. So the grammar takes care of everything. See version 1.0.0 of 1-Jul.
Lexer and grammar files downloaded, having just a quick look now, will do all the testing tomorrow. Hope it could be used as a drop-in replacement to LOAD (or at least would not require too much work to do so). It looks very exciting anyway. :-)
I realize it is not geared to #include and #define, so that part will have to be (re-)done anyway.
A few comments from my first review session:
-
It is sometime up to 50x slower than
LOAD
(using%tests/source/units/auto-tests/byte-auto-test.reds
for example, I get 11ms with LOAD and 504ms withreds-lexer
). It seems that a lot of time is spent indecode-string
, could it be rewritten usingparse
rules instead? This is not a big issue at this point, it just shows how slow REBOL is... -
Using
%tests/source/units/auto-tests/integer-auto-test.reds
as input, I get the following error:** Math Error: Math or number overflow ** Where: store-integer ** Near: pow: pow * 10
-
The previous point makes me realize that
reds-lexer
is currently lacking error catching support, including proper reporting of the input position where the scanning failed (required to be able to generate accurate syntax error messages in Red/System). This point it really important for using it as a LOAD replacement. -
It will require significant work (maybe a day or two) to integrate
reds-lexer
in compiler. Mainly re-wiring properly, removing or replacing deferred syntax checking in compiler and extensive testing/fixing for regressions. -
Related to this work, I would be very interested in a formal Red/System syntax grammar specification (ideally using BNF format). Let me know if you are interested.
I am still interested in replacing LOAD with reds-lexer
, but I guess it is not doable before going beta (announce planned for tomorrow). I guess that reds-lexer
integration could be achieved probably later this week or next week.
Thank you for your nice work!
I do have a Red/System syntax grammar ready, as a Word file (BNF productions - albeit with ambiguity - and semantic comments). I will send it to your nr@red-lang.org address shortly.
This grammar does not consider the shift operators that were added just today.
Can it already parse existing Red/System sources, like %hello.reds, %common.reds, and all the unit test scripts?
Nice work, by the way!