Skip to content

Instantly share code, notes, and snippets.

@MaximumEntropy
Last active November 11, 2017 10:23
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MaximumEntropy/cea907016d3eaaa25bd3334954e44e87 to your computer and use it in GitHub Desktop.
Save MaximumEntropy/cea907016d3eaaa25bd3334954e44e87 to your computer and use it in GitHub Desktop.
Simple python interface to the moses tokenizer
import subprocess
import sys
tokenizer_path = sys.argv[1] # Path to the moses tokenizer mosesdecoder/scripts/tokenizer.perl
text = sys.argv[2] # Text to be tokenized
lang = sys.argv[3] # Input language ex: en, fr, de
pipe = subprocess.Popen(["perl", tokenizer_path, '-l', lang, text], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
pipe.stdin.write(text.encode('utf-8'))
pipe.stdin.close()
tokenized_output = pipe.stdout.read()
print tokenized_output.strip()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment