Skip to content

Instantly share code, notes, and snippets.

@applenob
Created February 22, 2019 06:41
Show Gist options
  • Save applenob/5d916e299bbdac57e25237cc52edcc8c to your computer and use it in GitHub Desktop.
Save applenob/5d916e299bbdac57e25237cc52edcc8c to your computer and use it in GitHub Desktop.
A simple english tokenizer.
# Refer to https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s):
return re_tok.sub(r' \1 ', s).split()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment