Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Basic example of using NLTK for name entity extraction.
import nltk
with open('sample.txt', 'r') as f:
sample = f.read()
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences, binary=True)
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'node') and t.node:
if t.node == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
entity_names = []
for tree in chunked_sentences:
# Print results per sentence
# print extract_entity_names(tree)
entity_names.extend(extract_entity_names(tree))
# Print all entity names
#print entity_names
# Print unique entity names
print set(entity_names)
@rsingh2083

This comment has been minimized.

Show comment Hide comment
@rsingh2083

rsingh2083 Nov 30, 2014

Im sorry but your code isnt working ---- nltk.batch_ne_chunk : 'module' object has no attribute 'batch_ne_chunk'
Please suggest what to do

Im sorry but your code isnt working ---- nltk.batch_ne_chunk : 'module' object has no attribute 'batch_ne_chunk'
Please suggest what to do

@hugokoopmans

This comment has been minimized.

Show comment Hide comment
@hugokoopmans

hugokoopmans Dec 4, 2014

hi Rsingh, the NLTK 3.0 docs say 😄 chunk.batch_ne_chunk() → chunk.ne_chunk_sents()
i replaced that and script works again ...

hugo

hi Rsingh, the NLTK 3.0 docs say 😄 chunk.batch_ne_chunk() → chunk.ne_chunk_sents()
i replaced that and script works again ...

hugo

@hugokoopmans

This comment has been minimized.

Show comment Hide comment
@hugokoopmans

hugokoopmans Dec 4, 2014

also seems there is more changes in NLTK 3.0

also change this 'node' to 'label()' :

if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':

also seems there is more changes in NLTK 3.0

also change this 'node' to 'label()' :

if hasattr(t, 'label') and t.label:
    if t.label() == 'NE':
@ririw

This comment has been minimized.

Show comment Hide comment
@ririw

ririw Jul 3, 2015

For future readers, here's a version that works for me, using NLTK version 3.0.3

import nltk 
with open('sample.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)

ririw commented Jul 3, 2015

For future readers, here's a version that works for me, using NLTK version 3.0.3

import nltk 
with open('sample.txt', 'r') as f:
    sample = f.read()


sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)
@matthewcornell

This comment has been minimized.

Show comment Hide comment
@matthewcornell

matthewcornell Sep 16, 2015

Thanks for this.

Thanks for this.

@RVKRM

This comment has been minimized.

Show comment Hide comment
@RVKRM

RVKRM Jan 5, 2016

Am facing error when in run the code !!

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 518: ordinal not in range(128)

RVKRM commented Jan 5, 2016

Am facing error when in run the code !!

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8e in position 518: ordinal not in range(128)

@hongtao510

This comment has been minimized.

Show comment Hide comment
@hongtao510

hongtao510 Feb 19, 2016

Try
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

Try
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

@loretoparisi

This comment has been minimized.

Show comment Hide comment
@loretoparisi

loretoparisi May 9, 2016

 Resource u'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
  not found.  Please use the NLTK Downloader to obtain the
  resource:  >>> nltk.download()
  Searched in:
    - '/Users/admin/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
 Resource u'chunkers/maxent_ne_chunker/english_ace_binary.pickle'
  not found.  Please use the NLTK Downloader to obtain the
  resource:  >>> nltk.download()
  Searched in:
    - '/Users/admin/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
@dibosh

This comment has been minimized.

Show comment Hide comment
@dibosh

dibosh Aug 23, 2016

@loretoparisi download the necessary dependencies-

nltk.download('maxent_ne_chunker')
nltk.download('words')

dibosh commented Aug 23, 2016

@loretoparisi download the necessary dependencies-

nltk.download('maxent_ne_chunker')
nltk.download('words')

@manjunath-s

This comment has been minimized.

Show comment Hide comment
@manjunath-s

manjunath-s Feb 10, 2017

@dibosh - that's really helpful.

@dibosh - that's really helpful.

@cyterdan

This comment has been minimized.

Show comment Hide comment
@cyterdan

cyterdan Jun 27, 2017

On a fresh nltk install, had to add

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

On a fresh nltk install, had to add

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

@farshadj

This comment has been minimized.

Show comment Hide comment
@farshadj

farshadj Jul 22, 2017

Thanks for this. Works great.

Thanks for this. Works great.

@pemagrg1

This comment has been minimized.

Show comment Hide comment
@pemagrg1

pemagrg1 Aug 1, 2017

@rsingh2083 Run the code from python2.7 and make sure that you have downloaded NLTK package completely.

pemagrg1 commented Aug 1, 2017

@rsingh2083 Run the code from python2.7 and make sure that you have downloaded NLTK package completely.

@dicksonj

This comment has been minimized.

Show comment Hide comment
@dicksonj

dicksonj Aug 8, 2017

use import nltk nltk.download() within python or be specific to specify a NLTK library like, nltk.download('stopwords')
I didn't work for me for some reason, when I tried installing the whole nltk package.

dicksonj commented Aug 8, 2017

use import nltk nltk.download() within python or be specific to specify a NLTK library like, nltk.download('stopwords')
I didn't work for me for some reason, when I tried installing the whole nltk package.

@GraniteConsultingReviews

This comment has been minimized.

Show comment Hide comment
@GraniteConsultingReviews

GraniteConsultingReviews Aug 29, 2017

This code giving me syntax error

This code giving me syntax error

@mihir19297

This comment has been minimized.

Show comment Hide comment
@mihir19297

mihir19297 Sep 3, 2017

What is the expected output. I have just started learning nlp. I executed the code and its giving me blank array. I have copied random English text in sample.txt file. Waiting for reply. :)

What is the expected output. I have just started learning nlp. I executed the code and its giving me blank array. I have copied random English text in sample.txt file. Waiting for reply. :)

@erumharis

This comment has been minimized.

Show comment Hide comment
@erumharis

erumharis Sep 11, 2017

@mihir19297: have a look at the following link, the code given in this link is a small variation of the code above, but described with an example and expected output:

https://stackoverflow.com/questions/36255291/extract-city-names-from-text-using-python

@mihir19297: have a look at the following link, the code given in this link is a small variation of the code above, but described with an example and expected output:

https://stackoverflow.com/questions/36255291/extract-city-names-from-text-using-python

@Jaypratap

This comment has been minimized.

Show comment Hide comment
@Jaypratap

Jaypratap Dec 15, 2017

It gives the wrong output. it return's the Author workshop from txt file instead of Jay Pratap Pandey. How Can i get correct output?

Jaypratap commented Dec 15, 2017

It gives the wrong output. it return's the Author workshop from txt file instead of Jay Pratap Pandey. How Can i get correct output?

@disruptfwd

This comment has been minimized.

Show comment Hide comment
@disruptfwd

disruptfwd May 14, 2018

How would you modify the code to exclude Name Entities.

How would you modify the code to exclude Name Entities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment