Skip to content

Instantly share code, notes, and snippets.

@japerk
Created February 25, 2012 16:36
Show Gist options
  • Star 23 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save japerk/1909413 to your computer and use it in GitHub Desktop.
Save japerk/1909413 to your computer and use it in GitHub Desktop.
NLTK Tokenization, Tagging, Chunking, Treebank

Sentence Tokenization

>>> from nltk import tokenize >>> para = "Hello. My name is Jacob. Today you'll be learning NLTK." >>> sents = tokenize.sent_tokenize(para) >>> sents ['Hello.', 'My name is Jacob.', "Today you'll be learning NLTK."]

Word Tokenization

>>> sent = tokenize.word_tokenize(sents[2]) >>> sent ['Today', 'you', "'ll", 'be', 'learning', 'NLTK', '.']

go to http://text-processing.com/demo/tokenize/ to see how all NLTK word tokenizers work

POS Tagging

>>> from nltk import tag >>> tagged_sent = tag.pos_tag(sent) >>> tagged_sent [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]

see http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html for a list of treebank POS tags

NE Chunking

>>> from nltk import chunk >>> tree = chunk.ne_chunk(tagged_sent) >>> tree Tree('S', [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning', 'VBG'), Tree('ORGANIZATION', [('NLTK', 'NNP')]), ('.', '.')]) >>> tree.draw()

Treebank Corpus

>>> from nltk.corpus import treebank_chunk >>> treebank_chunk.tagged_sents()[0] [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] >>> treebank_chunk.chunked_sents()[0] Tree('S', [Tree('NP', [('Pierre', 'NNP'), ('Vinken', 'NNP')]), (',', ','), Tree('NP', [('61', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), Tree('NP', [('the', 'DT'), ('board', 'NN')]), ('as', 'IN'), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD')]), ('.', '.')]) >>> treebank_chunk.chunked_sents()[0].draw()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment