Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Basic example of using NLTK for name entity extraction.
import nltk
with open('sample.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'node') and t.node:
if t.node == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
@rsingh2083

This comment has been minimized.

Copy link

rsingh2083 commented Nov 30, 2014

Im sorry but your code isnt working ---- nltk.batch_ne_chunk : 'module' object has no attribute 'batch_ne_chunk'
Please suggest what to do

@hugokoopmans

This comment has been minimized.

Copy link

hugokoopmans commented Dec 4, 2014

hi Rsingh, the NLTK 3.0 docs say 😄 chunk.batch_ne_chunk() → chunk.ne_chunk_sents()
i replaced that and script works again ...

hugo

@hugokoopmans

This comment has been minimized.

Copy link

hugokoopmans commented Dec 4, 2014

also seems there is more changes in NLTK 3.0

also change this 'node' to 'label()' :

if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':
@ririw

This comment has been minimized.

Copy link

ririw commented Jul 3, 2015

For future readers, here's a version that works for me, using NLTK version 3.0.3

import nltk 
with open('sample.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)
@matthewcornell

This comment has been minimized.

Copy link

matthewcornell commented Sep 16, 2015

Thanks for this.

@Rahulvks

This comment has been minimized.

Copy link

Rahulvks commented Jan 5, 2016

Am facing error when in run the code !!

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 518: ordinal not in range(128)

@hongtao510

This comment has been minimized.

Copy link

hongtao510 commented Feb 19, 2016

Try
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

@loretoparisi

This comment has been minimized.

Copy link

loretoparisi commented May 9, 2016

 Resource u'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
  not found.  Please use the NLTK Downloader to obtain the
  resource:  >>> nltk.download()
  Searched in:
    - '/Users/admin/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
@dibosh

This comment has been minimized.

Copy link

dibosh commented Aug 23, 2016

@loretoparisi download the necessary dependencies-

nltk.download('maxent_ne_chunker')
nltk.download('words')

@manjunath-s

This comment has been minimized.

Copy link

manjunath-s commented Feb 10, 2017

@dibosh - that's really helpful.

@cyterdan

This comment has been minimized.

Copy link

cyterdan commented Jun 27, 2017

On a fresh nltk install, had to add

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

@farshadj

This comment has been minimized.

Copy link

farshadj commented Jul 22, 2017

Thanks for this. Works great.

@pemagrg1

This comment has been minimized.

Copy link

pemagrg1 commented Aug 1, 2017

@rsingh2083 Run the code from python2.7 and make sure that you have downloaded NLTK package completely.

@dicksonj

This comment has been minimized.

Copy link

dicksonj commented Aug 8, 2017

use import nltk nltk.download() within python or be specific to specify a NLTK library like, nltk.download('stopwords')
I didn't work for me for some reason, when I tried installing the whole nltk package.

@GraniteConsultingReviews

This comment has been minimized.

Copy link

GraniteConsultingReviews commented Aug 29, 2017

This code giving me syntax error

@mihir19297

This comment has been minimized.

Copy link

mihir19297 commented Sep 3, 2017

What is the expected output. I have just started learning nlp. I executed the code and its giving me blank array. I have copied random English text in sample.txt file. Waiting for reply. :)

@erumharis

This comment has been minimized.

Copy link

erumharis commented Sep 11, 2017

@mihir19297: have a look at the following link, the code given in this link is a small variation of the code above, but described with an example and expected output:

https://stackoverflow.com/questions/36255291/extract-city-names-from-text-using-python

@Jaypratap

This comment has been minimized.

Copy link

Jaypratap commented Dec 15, 2017

It gives the wrong output. it return's the Author workshop from txt file instead of Jay Pratap Pandey. How Can i get correct output?

@disruptfwd

This comment has been minimized.

Copy link

disruptfwd commented May 14, 2018

How would you modify the code to exclude Name Entities.

@nimashiri

This comment has been minimized.

Copy link

nimashiri commented May 28, 2019

@rsingh2083

You have to download the package using the following command in terminal:

import nltk
nltk.donwload('batch_ne_chunk')

@jashanbhullar

This comment has been minimized.

Copy link

jashanbhullar commented Jun 19, 2019

Hey, can you attach the sample.txt file you used with this code? I am getting an empty set

@nimashiri

This comment has been minimized.

Copy link

nimashiri commented Jun 20, 2019

Hey, can you attach the sample.txt file you used with this code? I am getting an empty set

Sorry. I didn't understand what you asked?

@pallanesi

This comment has been minimized.

Copy link

pallanesi commented Sep 27, 2019

Hi and thanks for the code. I tried the version 'ririw commented on 3 Jul 2015'. I got syntax error on the last line where it was converting to unique names. If I deleted set then it worked and that was actually better for me because my need was to list the names by frequency. I tried it on an Icelandic saga, Laxdæla ant it worked fine. I added a dictionary to achieve unique names and a line to sort them by value. Here is the adapted code:
import nltk
with open('laxd.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
entity_names = []

if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':
        entity_names.append(' '.join([child[0] for child in t]))
    else:
        for child in t:
            entity_names.extend(extract_entity_names(child))

return entity_names

entity_names = []
names = {}
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
for w in entity_names:
names[w] = names.get(w, 0) +1

Print all entity names

print(sorted(names.items(), key=lambda x:x[1], reverse=True))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.