Skip to content

Instantly share code, notes, and snippets.

@szczys
Last active November 28, 2020 07:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save szczys/c150539ed4cb2c176eeffb67acfffc73 to your computer and use it in GitHub Desktop.
Save szczys/c150539ed4cb2c176eeffb67acfffc73 to your computer and use it in GitHub Desktop.
The holy tokenizer
def tokenize(instring, delimiters=[',',':',';','[',']','+','-']):
'''
Tokenize a string of ASM code, splitting based on special characters
but at the same time including delimiters (but not whitespace) in the set
'''
tokens = instring.split()
for d in delimiters:
newtokens = list()
for t in tokens:
raw = t.split(d)
for r_idx, r_token in enumerate(raw):
if r_token != '':
'''
element will be empty when delimiter begins or
ends the string that was split
so don't add empty elements
'''
newtokens.append(r_token)
if r_idx != len(raw)-1:
newtokens.append(d)
tokens = newtokens
return tokens
test = "MOV [ R7 :R8],R0 ; Testing stuff"
print(tokenize(test))
@carl3
Copy link

carl3 commented Nov 28, 2020

Mike, you've piqued my curiosity about the use of regex in simple parsers and assemblers. Long ago I wrote one or 2 assemblers in high school (in assembly), then learned the now out-of-fashion LALR table driver compilers (with error reporting and recovery) in college (wrote our own equivalent to yacc/lex). I got into regex writing perl code to clean up messed-up data files where it was a godsend, fixing weird abbreviations and corrupt data. Also in web scraping, custom scripts eventually evolved into a generic data converter with simple patterns read in and compiled into regex code. Basic regex OR patterns are not too hard. I rarely use lookahead or other advanced features, so always have to look that up. But I've learned how to debug (trim down the pattern till you figure the mismatch) and use regex wisely with some tricks including table lookups in a substitution string. Perl is designed for regex, so hard to not drink the regex kool-aid using perl. :-)

I was trying to find the badge processor and didn't realize it's not released. I would be interested to see how a regex-based assembler might look. If you can post the assembly syntax and machine code (or post your 884 line prototype) I would be happy to see how some alternatives might look, and post a regex solution or whatever makes sense.

Small isn't the main goal, but less code, well commented and easy to understand (even by regex-phobes), and easy to extend is good.

With the python:

import re
# Plain split on delimiters excluding space
re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)
# Clean null delimiters after split
list(filter(None,re.split(r'\s*([,:;\[\]+\-])\s*|\s+',instring)))
# Add space as a returned delimiter
re.split(r'\s*([,:;\[\]+\- ])\s*',instring)

you get the string with None delimiters using |\s+ in the split:
['min', None, 'EQU', None, 'max', '-', '11']
and with None filtered out:
['min', 'EQU', 'max', '-', '11']
or you can extend the delimiter characters to include space:
['min', ' ', 'EQU', ' ', 'max', '-', '11']

Note if we first map multiple occurrences of any space character (tab/space) on a line to single ' ' and trim spaces around delimiters (\s*[{some-delimiters}]\s* then you can just write re.split(([,:;\[\]+\-]). The \s* means 0 or more spaces, the [] encloses the delimiter characters, with \ preceding special characters.

To make delimiters extensible (with space as a delimiter), use:

delimiters = ' ,:[]+-;' # Delimiters to separate line items
delimpat = re.escape(delimiters) # Delimiters for use in a regex
tokens = list(filter(None,re.split(r'\s*(['+delimpat+r'])\s*',instring)))

@szczys
Copy link
Author

szczys commented Nov 28, 2020

Well, I'm not super excited to have it out there since it still feels a bit hacky. But it is relatively stable right now, and of course always happy to have help on passion projects like the conference badges ;-)

Here's a snapshot to play with: https://gist.github.com/szczys/b9a19714ea27d50be01d1a8479f97795

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment