|def tokenize(instring, delimiters=[',',':',';','[',']','+','-']):|
|Tokenize a string of ASM code, splitting based on special characters|
|but at the same time including delimiters (but not whitespace) in the set|
|tokens = instring.split()|
|for d in delimiters:|
|newtokens = list()|
|for t in tokens:|
|raw = t.split(d)|
|for r_idx, r_token in enumerate(raw):|
|if r_token != '':|
|element will be empty when delimiter begins or|
|ends the string that was split|
|so don't add empty elements|
|if r_idx != len(raw)-1:|
|tokens = newtokens|
|test = "MOV [ R7 :R8],R0 ; Testing stuff"|
The unholy tokenizer (in perl to maximize unholiness):
or sanctified unholiness in python:
But if there are 2 delimiters in a row, you get a null string. If you need to strip them, then there are many ways:
or in python
(You can also add a space to the set of delimiters)
Geez, regular expressions aren't that hard to learn. While regex is a poor approach to complex parsers, an assembler can be so simple that you can probably tokenize and parse at the same time with regexes, maybe even map to machine code. If the syntax doesn't match on assembly input, you can use a regex test to issue nice error messages, e.g. "MOVE' is not a valid operator, or "MOV requires an [Rn:Rm] operand". I would guess that a simple assembly parser might only be 10 lines of perl or 20 in python, plus whatever error messages.
BTW, you can make your regex easier to read/write if you do some cleanup/simplifications before parsing, e.g. (in perl for brevity):
Also, compose a complex regex on several lines with comments using variables.
Thanks for this! I tried out your suggestions and they do work in almost all cases. As you mentioned, whitespace is not in the regex you used so, for instance, this fails with code like:
I have little to no experience with regex. I'll add it to my project list to try a dive into it once again (I've tried before, but TBH it's never felt exciting to me). From where I sit, the regex you have showed feels much harder to audit for errors than The holy tokenizerTM and harder to add delimiters in the future.
However, I find the succinctness of your approach to be extremely tasty! This assembler is for a Hackaday conference badge to be unveiled after the great distancing has ended. Perhaps we should have a assembler optimization contest? Currently my version is 884 lines (including comments and helpful error messages) and 32k. I'd love to see how small this task can be made ;-)
Mike, you've piqued my curiosity about the use of regex in simple parsers and assemblers. Long ago I wrote one or 2 assemblers in high school (in assembly), then learned the now out-of-fashion LALR table driver compilers (with error reporting and recovery) in college (wrote our own equivalent to yacc/lex). I got into regex writing perl code to clean up messed-up data files where it was a godsend, fixing weird abbreviations and corrupt data. Also in web scraping, custom scripts eventually evolved into a generic data converter with simple patterns read in and compiled into regex code. Basic regex OR patterns are not too hard. I rarely use lookahead or other advanced features, so always have to look that up. But I've learned how to debug (trim down the pattern till you figure the mismatch) and use regex wisely with some tricks including table lookups in a substitution string. Perl is designed for regex, so hard to not drink the regex kool-aid using perl. :-)
I was trying to find the badge processor and didn't realize it's not released. I would be interested to see how a regex-based assembler might look. If you can post the assembly syntax and machine code (or post your 884 line prototype) I would be happy to see how some alternatives might look, and post a regex solution or whatever makes sense.
Small isn't the main goal, but less code, well commented and easy to understand (even by regex-phobes), and easy to extend is good.
With the python:
you get the string with None delimiters using |\s+ in the split:
Note if we first map multiple occurrences of any space character (tab/space) on a line to single ' ' and trim spaces around delimiters (
To make delimiters extensible (with space as a delimiter), use:
Well, I'm not super excited to have it out there since it still feels a bit hacky. But it is relatively stable right now, and of course always happy to have help on passion projects like the conference badges ;-)
Here's a snapshot to play with: https://gist.github.com/szczys/b9a19714ea27d50be01d1a8479f97795