Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Basic example of using NLTK for name entity extraction.
import nltk
with open('sample.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'node') and t.node:
if t.node == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)

Im sorry but your code isnt working ---- nltk.batch_ne_chunk : 'module' object has no attribute 'batch_ne_chunk'
Please suggest what to do

hi Rsingh, the NLTK 3.0 docs say 😄 chunk.batch_ne_chunk() → chunk.ne_chunk_sents()
i replaced that and script works again ...

hugo

also seems there is more changes in NLTK 3.0

also change this 'node' to 'label()' :

if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':

ririw commented Jul 3, 2015

For future readers, here's a version that works for me, using NLTK version 3.0.3

import nltk 
with open('sample.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)

Thanks for this.

Rahulvks commented Jan 5, 2016

Am facing error when in run the code !!

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 518: ordinal not in range(128)

Try
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

 Resource u'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
  not found.  Please use the NLTK Downloader to obtain the
  resource:  >>> nltk.download()
  Searched in:
    - '/Users/admin/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''

dibosh commented Aug 23, 2016

@loretoparisi download the necessary dependencies-

nltk.download('maxent_ne_chunker')
nltk.download('words')

@dibosh - that's really helpful.

On a fresh nltk install, had to add

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Thanks for this. Works great.

pemagrg1 commented Aug 1, 2017

@rsingh2083 Run the code from python2.7 and make sure that you have downloaded NLTK package completely.

dicksonj commented Aug 8, 2017

use import nltk nltk.download() within python or be specific to specify a NLTK library like, nltk.download('stopwords')
I didn't work for me for some reason, when I tried installing the whole nltk package.

This code giving me syntax error

What is the expected output. I have just started learning nlp. I executed the code and its giving me blank array. I have copied random English text in sample.txt file. Waiting for reply. :)

@mihir19297: have a look at the following link, the code given in this link is a small variation of the code above, but described with an example and expected output:

https://stackoverflow.com/questions/36255291/extract-city-names-from-text-using-python

Jaypratap commented Dec 15, 2017

It gives the wrong output. it return's the Author workshop from txt file instead of Jay Pratap Pandey. How Can i get correct output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment