Skip to content

Instantly share code, notes, and snippets.

@alexbowe
Created March 21, 2011 12:59
Show Gist options
  • Save alexbowe/879414 to your computer and use it in GitHub Desktop.
Save alexbowe/879414 to your computer and use it in GitHub Desktop.
Demonstration of extracting key phrases with NLTK in Python
import nltk
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""
# Used when tokenizing words
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
#Taken from Su Nam Kim Paper...
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)
print postoks
tree = chunker.parse(postoks)
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(tree):
for leaf in leaves(tree):
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
yield term
terms = get_terms(tree)
for term in terms:
for word in term:
print word,
print
@aciong32
Copy link

After stemmer, key phrases will be... such as "digit comput"?
This output key phrases do not make sense in some situation.
does it make sense to comment out the stemmer step sometimes?

@anupamchoudhari
Copy link

I'm not sure if it's just me, but the verbose regular expression used for tokenization did not work for me. This fix. Used parenthesis for grouping the given expressions and I changed all the parenthesis to non-capturing.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

@AtomsForPeace
Copy link

thanks @anupamchoudhari!!

@fibelatti
Copy link

Thanks @alexbowe this is very useful for my current research, and thanks @anupamchoudhari for the fix!

@petulla
Copy link

petulla commented May 24, 2016

Hm. I received this error: AttributeError: 'tuple' object has no attribute 'isdigit', seems to be a bug in the most recent nltk release. Installing 3.05 fixes.

@Rich700000000000
Copy link

Yeah, I'm getting the same error as @petulla. What's wrong?

@anu003
Copy link

anu003 commented Jul 22, 2016

@petulla and @Rich700000000000, looks like it works fine if you make the changes mentioned by @anupamchoudhari and @tejasshah93. Thanks guys !!

@renaud
Copy link

renaud commented Sep 1, 2016

note that {<NBAR><IN><NBAR>} should come above {<NBAR>} for it to work

@pal2ie
Copy link

pal2ie commented Sep 14, 2016

thanks @anupamchoudhari!!

@Phdntom
Copy link

Phdntom commented Nov 9, 2016

This expression
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

is not valid as is. Nobody else is getting this SyntaxError? Seems to be at the final ?, presumably from the ' closing the string?

@avikdelta
Copy link

@Phdntom: Yes I get a syntax error too. Did you find the solution?

@NeethuAnish
Copy link

Getting same error as @Phdntom..any one got the solution

@karibot
Copy link

karibot commented Feb 9, 2017

@Phdntom
for your syntax error, you have to escape the simple quote ' like this
r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
the syntax error disapear but i have another error when parsing the regex : error: nothing to repeat

@Mohan-kr
Copy link

Mohan-kr commented Feb 24, 2017

Hi guys,
I am also getting the same error "sre_constants.error: nothing to repeat at position 48"
Can anyone suggest how to fix.
Traceback (most recent call last):
File "C:/Users/mohan.choudhary/Desktop/Copied_Shared/New folder/KeyTokenizer.py", line 24, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 196, in regexp_tokenize
return tokenizer.tokenize(text)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 119, in tokenize
self._check_regexp()
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 116, in _check_regexp
self._regexp = re.compile(self._pattern, self._flags)
File "C:\Python_3.5.0\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\Python_3.5.0\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python_3.5.0\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\Python_3.5.0\lib\sre_parse.py", line 829, in parse
p = _parse_sub(source, pattern, 0)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 778, in _parse
p = _parse_sub(source, state)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 638, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 50
thanks

@hash-include
Copy link

@Mohan-kr did you solve that error? I am also getting the same error?

@pavelnunez
Copy link

I'm currently working on a project that uses some of the Natural Languages features present on NLTK. I know this post is 6 years old now, but as I've stumble into this gist I think it might be useful if @alexbowe post (and edit) this gist again with the requirements for this script to run.

In my experience running it "out of the box" it needs (and this is by no means an incomplete list of requirements) to run:

If you're running Python 2.7:

  • Python 2.7+
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 2.7 (with an unmodified version of the script) it will run with some exceptions.

If you're running Python 3.5:

  • Python 3.5+ (with some minor changes to the script to replace the old print construct with the newer print() function)
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 3.5 it will NOT run with the Exception: "AttributeError: 'tuple' object has no attribute 'isdigit'" as I'm not a Python developer I don't know what to do about it. However if you can install both versions of Python, it will be better running it on Python 2.7.

Take into account that you might need to switch from pip to pip3 (when installing Python modules) as the latter is used on Python 3.x installations.

The dependencies for nltk are available in the Python shell (>>>) with the utility nltk.download()

I hope this "indications" to be useful for someone else.

@Reihan-amn
Copy link

Agree with @renaud
Any new rules has to be placed before .
I actually added this as well: {} to capture all NPs that conjunct with each other.

@shreya-singh-tech
Copy link

shreya-singh-tech commented Jun 14, 2017

i need to extract words that are verb phrases along with noun phrases.i have defined the grammer correctly but the i think where we are checking t.node a simple " or" will not suffice because that is leading to the extracted words are getting printed twice,sometimes sentence wise sometimes consecutively bcos my grammer has NP inside VP . I checked my tree and it seems okay.Does anyone have a solution to this?

@Reihan-amn
Copy link

Why not using NBAR:{<NN*|JJ><NN>}? Why those dots are there?

@bashokku
Copy link

@Mohan-kr @hash-include did you solve the error you were getting for this problem ??

@bashokku
Copy link

For the error : AttributeError: 'tuple' object has no attribute 'isdigit find the below solution.

you need to uninstall higher versions of nltk, it works for versions 3.0.

Solution ::

The default tagger is made as Perceptron in the nltk 3.1 version. Which is now the latest version. All my nltk.regexp_tokenize stopped functioning correctly and all my nltk.pos_tag started giving the above error.

The solution that I have currently is to use the previous version nltk 3.0.1 to make them functioning. I am not sure if this is a bug in the current release of nltk.

Installation instruction for nltk 3.0.4 version in ubuntu. From your home directory or any other directory do the following steps.

$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install

@mkpisk
Copy link

mkpisk commented Oct 3, 2017

you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

After this I am able to run the above code

@KanimozhiU
Copy link

Traceback (most recent call last):
File "nltk-intro.py", line 31, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 203, in regexp_tokenize
return tokenizer.tokenize(text)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 126, in tokenize
self._check_regexp()
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 121, in _check_regexp
self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 55, in compile_regexp_to_noncapturing
return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 51, in convert_regexp_to_noncapturing_parsed
parsed_pattern.pattern.groups = 1
AttributeError: can't set attribute

Error encountered after following,
you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

@bearami
Copy link

bearami commented Jan 6, 2018

I have made the changes suggested by @anupamchoudhari and @tejasshah93.
I am getting syntax error in the regular expression @anupamchoudhari suggested. I am using python 3.6.3 version. Any help fixing is greatly appreciated as I am a newbie in python and NLTK.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

@jamesballard
Copy link

The following regular expression seems to work in Python 3.x

sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

from https://stackoverflow.com/questions/36353125/nltk-regular-expression-tokenizer

Plus other fixes -

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):

@eliksr
Copy link

eliksr commented Jun 14, 2018

@jamesballard Thanks! it works for me with Python 3.x

@komal-bhalla
Copy link

I am getting an error from running the code below:
postoks = nltk.tag.pos_tag(toks)

URLError:

@Rich2020
Copy link

Rich2020 commented May 29, 2019

Working for Python 3.6.

  • line 44: change t.node to t.label()
  • line 50: change stemmer.stem_word(word) to stemmer.stem(word)

Full working version:

import nltk

text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""

# Used when tokenizing words
sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()

#Taken from Su Nam Kim Paper...
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)

toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)

print(postoks)

tree = chunker.parse(postoks)

from nltk.corpus import stopwords
stopwords = stopwords.words('english')


def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
        yield subtree.leaves()

def normalise(word):
    """Normalises words to lowercase and stems and lemmatizes it."""
    word = word.lower()
    word = stemmer.stem(word)
    word = lemmatizer.lemmatize(word)
    return word

def acceptable_word(word):
    """Checks conditions for acceptable word: length, stopword."""
    accepted = bool(2 <= len(word) <= 40
                    and word.lower() not in stopwords)
    return accepted


def get_terms(tree):
    for leaf in leaves(tree):
        term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
        yield term

terms = get_terms(tree)

for term in terms:
    for word in term:
        print(word)
    print(term)

@ChannaJayanath
Copy link

Working for Python 3.6.

  • line 44: change t.node to t.label()
  • line 50: change stemmer.stem_word(word) to stemmer.stem(word)

Full working version:

import nltk

text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""

# Used when tokenizing words
sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()

#Taken from Su Nam Kim Paper...
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)

toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)

print(postoks)

tree = chunker.parse(postoks)

from nltk.corpus import stopwords
stopwords = stopwords.words('english')


def leaves(tree):
    """Finds NP (nounphrase) leaf nodes of a chunk tree."""
    for subtree in tree.subtrees(filter = lambda t: t.label()=='NP'):
        yield subtree.leaves()

def normalise(word):
    """Normalises words to lowercase and stems and lemmatizes it."""
    word = word.lower()
    word = stemmer.stem(word)
    word = lemmatizer.lemmatize(word)
    return word

def acceptable_word(word):
    """Checks conditions for acceptable word: length, stopword."""
    accepted = bool(2 <= len(word) <= 40
                    and word.lower() not in stopwords)
    return accepted


def get_terms(tree):
    for leaf in leaves(tree):
        term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
        yield term

terms = get_terms(tree)

for term in terms:
    for word in term:
        print(word)
    print(term)

thank you

@anish-adm
Copy link

Thank you @Rich2020, worked for me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment