Skip to content

Instantly share code, notes, and snippets.

@Smerity
Created November 19, 2017 20:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Smerity/59f2475a67aeefd24d966443819600f5 to your computer and use it in GitHub Desktop.
Save Smerity/59f2475a67aeefd24d966443819600f5 to your computer and use it in GitHub Desktop.
WikiText: Python 2 post processing used on Moses tokenized input
# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import re
number_match_re = re.compile(r'^([0-9]+[,.]?)+$')
number_split_re = re.compile(r'([,.])')
for i, line in enumerate(sys.stdin):
# Fix a silly tokenization that was never intended
line = line.replace('< formula >', '<formula>')
raw_tokens = [x for x in line.split() if x]
tokens = []
for token in raw_tokens:
if number_match_re.match(token):
token = number_split_re.sub(r' @\1@ ', token)
tokens.append(token)
# Starting each line with a blank line is required
# Some systems replace \n with <eos> and assume, like in PTB, everything is space separated
tokens = [''] + tokens + ['\n']
line = ' '.join(tokens)
sys.stdout.write(line)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment