Skip to content

Instantly share code, notes, and snippets.

@menzenski
Created September 12, 2013 04:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save menzenski/6532911 to your computer and use it in GitHub Desktop.
Save menzenski/6532911 to your computer and use it in GitHub Desktop.
tokenizes Russian text
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import nltk
import codecs
from urllib import urlopen
def print_list(mylist):
'''Print a list containing unicode characters.'''
print '[' + ', '.join(
"" + word.encode('utf8') + "" for word in mylist) + ']'
data = codecs.open("masterandmargarita.txt", encoding="utf8")
text = data.read()
tokens = nltk.word_tokenize(text)
print len(tokens)
print len(set(tokens))
print_list(tokens[:200])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment