Skip to content

Instantly share code, notes, and snippets.

@shyjuzz
Created February 13, 2020 07:38
Show Gist options
  • Save shyjuzz/b01041bc317cd30f827edc742c962826 to your computer and use it in GitHub Desktop.
Save shyjuzz/b01041bc317cd30f827edc742c962826 to your computer and use it in GitHub Desktop.
Tokenize a string with start and end index in Python
import re
def tokenize(txt):
output = []
tokens = re.split('; |, |\*|\n',txt)
offset = 0
for token in tokens:
offset = txt.find(token, offset)
output.append((token, offset, offset+len(token)))
offset += len(token)
return output
s = 'name, account balance, total equity, assets, liabilities, last update date'
for token in tokenize(s):
print (token)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment