Skip to content

Instantly share code, notes, and snippets.

@szczys
Last active November 28, 2020 07:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save szczys/c150539ed4cb2c176eeffb67acfffc73 to your computer and use it in GitHub Desktop.
Save szczys/c150539ed4cb2c176eeffb67acfffc73 to your computer and use it in GitHub Desktop.
The holy tokenizer
def tokenize(instring, delimiters=[',',':',';','[',']','+','-']):
'''
Tokenize a string of ASM code, splitting based on special characters
but at the same time including delimiters (but not whitespace) in the set
'''
tokens = instring.split()
for d in delimiters:
newtokens = list()
for t in tokens:
raw = t.split(d)
for r_idx, r_token in enumerate(raw):
if r_token != '':
'''
element will be empty when delimiter begins or
ends the string that was split
so don't add empty elements
'''
newtokens.append(r_token)
if r_idx != len(raw)-1:
newtokens.append(d)
tokens = newtokens
return tokens
test = "MOV [ R7 :R8],R0 ; Testing stuff"
print(tokenize(test))
@carl3
Copy link

carl3 commented Nov 24, 2020

The unholy tokenizer (in perl to maximize unholiness):

s/\s*;.*//; # Trimming comments is easy in perl
@tokens = split(/\s*([,:;\[\]+\-])\s*| +/,$_);

or sanctified unholiness in python:

tokens = re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)

But if there are 2 delimiters in a row, you get a null string. If you need to strip them, then there are many ways:

@tokens = grep(/./,split(/\s*([,:;\[\]+\-])\s*|\s+/,$_));
@tokens = grep {$_ ne ''}  split(/\s*([,:;\[\]+\-])\s*|\s+/,$_);
@tokens = map {$_ ne '' ? $_ : ()} split(/\s*([,:;\[\]+\-])\s*|\s+/,$_);

or in python

tokens = list(filter(None,re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)))
tokens = [x for x in re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring) if x!='']

(You can also add a space to the set of delimiters)

Geez, regular expressions aren't that hard to learn. While regex is a poor approach to complex parsers, an assembler can be so simple that you can probably tokenize and parse at the same time with regexes, maybe even map to machine code. If the syntax doesn't match on assembly input, you can use a regex test to issue nice error messages, e.g. "MOVE' is not a valid operator, or "MOV requires an [Rn:Rm] operand". I would guess that a simple assembly parser might only be 10 lines of perl or 20 in python, plus whatever error messages.

BTW, you can make your regex easier to read/write if you do some cleanup/simplifications before parsing, e.g. (in perl for brevity):

s/\s+/ /g; # Convert all whitespace spans to a single space
s/ *([^\s\w]+) */$1/g; # Trim space around non-word delimiters

Also, compose a complex regex on several lines with comments using variables.

@szczys
Copy link
Author

szczys commented Nov 26, 2020

Thanks for this! I tried out your suggestions and they do work in almost all cases. As you mentioned, whitespace is not in the regex you used so, for instance, this fails with code like:

max EQU 15
min EQU max-11

I have little to no experience with regex. I'll add it to my project list to try a dive into it once again (I've tried before, but TBH it's never felt exciting to me). From where I sit, the regex you have showed feels much harder to audit for errors than The holy tokenizerTM and harder to add delimiters in the future.

However, I find the succinctness of your approach to be extremely tasty! This assembler is for a Hackaday conference badge to be unveiled after the great distancing has ended. Perhaps we should have a assembler optimization contest? Currently my version is 884 lines (including comments and helpful error messages) and 32k. I'd love to see how small this task can be made ;-)

@carl3
Copy link

carl3 commented Nov 28, 2020

Mike, you've piqued my curiosity about the use of regex in simple parsers and assemblers. Long ago I wrote one or 2 assemblers in high school (in assembly), then learned the now out-of-fashion LALR table driver compilers (with error reporting and recovery) in college (wrote our own equivalent to yacc/lex). I got into regex writing perl code to clean up messed-up data files where it was a godsend, fixing weird abbreviations and corrupt data. Also in web scraping, custom scripts eventually evolved into a generic data converter with simple patterns read in and compiled into regex code. Basic regex OR patterns are not too hard. I rarely use lookahead or other advanced features, so always have to look that up. But I've learned how to debug (trim down the pattern till you figure the mismatch) and use regex wisely with some tricks including table lookups in a substitution string. Perl is designed for regex, so hard to not drink the regex kool-aid using perl. :-)

I was trying to find the badge processor and didn't realize it's not released. I would be interested to see how a regex-based assembler might look. If you can post the assembly syntax and machine code (or post your 884 line prototype) I would be happy to see how some alternatives might look, and post a regex solution or whatever makes sense.

Small isn't the main goal, but less code, well commented and easy to understand (even by regex-phobes), and easy to extend is good.

With the python:

import re
# Plain split on delimiters excluding space
re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)
# Clean null delimiters after split
list(filter(None,re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)))
# Add space as a returned delimiter
re.split(r'\s*([,:;\[\]+\- ])\s*',instring)

you get the string with None delimiters using |\s+ in the split:
['min', None, 'EQU', None, 'max', '-', '11']
and with None filtered out:
['min', 'EQU', 'max', '-', '11']
or you can extend the delimiter characters to include space:
['min', ' ', 'EQU', ' ', 'max', '-', '11']

Note if we first map multiple occurrences of any space character (tab/space) on a line to single ' ' and trim spaces around delimiters (\s*[{some-delimiters}]\s* then you can just write re.split(([,:;\[\]+\-]). The \s* means 0 or more spaces, the [] encloses the delimiter characters, with \ preceding special characters.

To make delimiters extensible (with space as a delimiter), use:

delimiters = ' ,:[]+-;' # Delimiters to separate line items
delimpat = re.escape(delimiters) # Delimiters for use in a regex
tokens = list(filter(None,re.split(r'\s*(['+delimpat+r'])\s*',instring)))

@szczys
Copy link
Author

szczys commented Nov 28, 2020

Well, I'm not super excited to have it out there since it still feels a bit hacky. But it is relatively stable right now, and of course always happy to have help on passion projects like the conference badges ;-)

Here's a snapshot to play with: https://gist.github.com/szczys/b9a19714ea27d50be01d1a8479f97795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment