Instantly share code, notes, and snippets.

Embed
What would you like to do?
Demonstration of extracting key phrases with NLTK in Python
import nltk
text = """The Buddha, the Godhead, resides quite as comfortably in the circuits of a digital
computer or the gears of a cycle transmission as he does at the top of a mountain
or in the petals of a flower. To think otherwise is to demean the Buddha...which is
to demean oneself."""
# Used when tokenizing words
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
#Taken from Su Nam Kim Paper...
grammar = r"""
NBAR:
{<NN.*|JJ>*<NN.*>} # Nouns and Adjectives, terminated with Nouns
NP:
{<NBAR>}
{<NBAR><IN><NBAR>} # Above, connected with in/of/etc...
"""
chunker = nltk.RegexpParser(grammar)
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)
print postoks
tree = chunker.parse(postoks)
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
def normalise(word):
"""Normalises words to lowercase and stems and lemmatizes it."""
word = word.lower()
word = stemmer.stem_word(word)
word = lemmatizer.lemmatize(word)
return word
def acceptable_word(word):
"""Checks conditions for acceptable word: length, stopword."""
accepted = bool(2 <= len(word) <= 40
and word.lower() not in stopwords)
return accepted
def get_terms(tree):
for leaf in leaves(tree):
term = [ normalise(w) for w,t in leaf if acceptable_word(w) ]
yield term
terms = get_terms(tree)
for term in terms:
for word in term:
print word,
print
@siriusblac37

This comment has been minimized.

Show comment
Hide comment
@siriusblac37

siriusblac37 Nov 30, 2013

Which paper by Su Nam Kim has been used for the grammar expression?

siriusblac37 commented Nov 30, 2013

Which paper by Su Nam Kim has been used for the grammar expression?

@alexbowe

This comment has been minimized.

Show comment
Hide comment
@alexbowe

alexbowe Dec 8, 2014

@adwait-group Hi, I only saw this comment now. This gist is part of a blog post (http://alexbowe.com/au-naturale/) in which the paper is cited: S. N. Kim, T. Baldwin, and M.-Y. Kan. Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. Technical report, University of Melbourne, Melbourne 2010.

Owner

alexbowe commented Dec 8, 2014

@adwait-group Hi, I only saw this comment now. This gist is part of a blog post (http://alexbowe.com/au-naturale/) in which the paper is cited: S. N. Kim, T. Baldwin, and M.-Y. Kan. Evaluating n-gram based evaluation metrics for automatic keyphrase extraction. Technical report, University of Melbourne, Melbourne 2010.

@maziyarpanahi

This comment has been minimized.

Show comment
Hide comment
@maziyarpanahi

maziyarpanahi Jan 27, 2015

If anybody gets the error about "t.node" at line 44 just replace it with: t.label() and it works.
Thanks

maziyarpanahi commented Jan 27, 2015

If anybody gets the error about "t.node" at line 44 just replace it with: t.label() and it works.
Thanks

@toomanycats

This comment has been minimized.

Show comment
Hide comment
@toomanycats

toomanycats May 27, 2015

Thanks for the great example...This worked as a great crash course in applying a grammar and good reminders of regex :)

line 44: node == "NP"
as noted by the comment above is now "label", and for me it's a method call:
labels() == "NP"

Cheers

toomanycats commented May 27, 2015

Thanks for the great example...This worked as a great crash course in applying a grammar and good reminders of regex :)

line 44: node == "NP"
as noted by the comment above is now "label", and for me it's a method call:
labels() == "NP"

Cheers

@alvations

This comment has been minimized.

Show comment
Hide comment
@alvations

alvations commented Oct 1, 2015

I think it's this paper: http://www.aclweb.org/anthology/C10-1065

@tejasshah93

This comment has been minimized.

Show comment
Hide comment
@tejasshah93

tejasshah93 Oct 14, 2015

Works the way it should. Thanks! :) and as mentioned, replacing t.node by t.label() does the job.
+1

tejasshah93 commented Oct 14, 2015

Works the way it should. Thanks! :) and as mentioned, replacing t.node by t.label() does the job.
+1

@anshmania

This comment has been minimized.

Show comment
Hide comment
@anshmania

anshmania Jan 28, 2016

Did anyone get this kind of error?

anshmania commented Jan 28, 2016

Did anyone get this kind of error?

@aciong32

This comment has been minimized.

Show comment
Hide comment
@aciong32

aciong32 Feb 19, 2016

After stemmer, key phrases will be... such as "digit comput"?
This output key phrases do not make sense in some situation.
does it make sense to comment out the stemmer step sometimes?

aciong32 commented Feb 19, 2016

After stemmer, key phrases will be... such as "digit comput"?
This output key phrases do not make sense in some situation.
does it make sense to comment out the stemmer step sometimes?

@anupamchoudhari

This comment has been minimized.

Show comment
Hide comment
@anupamchoudhari

anupamchoudhari Feb 23, 2016

I'm not sure if it's just me, but the verbose regular expression used for tokenization did not work for me. This fix. Used parenthesis for grouping the given expressions and I changed all the parenthesis to non-capturing.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

anupamchoudhari commented Feb 23, 2016

I'm not sure if it's just me, but the verbose regular expression used for tokenization did not work for me. This fix. Used parenthesis for grouping the given expressions and I changed all the parenthesis to non-capturing.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

@AtomsForPeace

This comment has been minimized.

Show comment
Hide comment
@AtomsForPeace

AtomsForPeace commented Mar 10, 2016

thanks @anupamchoudhari!!

@fibelatti

This comment has been minimized.

Show comment
Hide comment
@fibelatti

fibelatti May 5, 2016

Thanks @alexbowe this is very useful for my current research, and thanks @anupamchoudhari for the fix!

fibelatti commented May 5, 2016

Thanks @alexbowe this is very useful for my current research, and thanks @anupamchoudhari for the fix!

@petulla

This comment has been minimized.

Show comment
Hide comment
@petulla

petulla May 24, 2016

Hm. I received this error: AttributeError: 'tuple' object has no attribute 'isdigit', seems to be a bug in the most recent nltk release. Installing 3.05 fixes.

petulla commented May 24, 2016

Hm. I received this error: AttributeError: 'tuple' object has no attribute 'isdigit', seems to be a bug in the most recent nltk release. Installing 3.05 fixes.

@Rich700000000000

This comment has been minimized.

Show comment
Hide comment
@Rich700000000000

Rich700000000000 Jul 2, 2016

Yeah, I'm getting the same error as @petulla. What's wrong?

Rich700000000000 commented Jul 2, 2016

Yeah, I'm getting the same error as @petulla. What's wrong?

@anu003

This comment has been minimized.

Show comment
Hide comment
@anu003

anu003 Jul 22, 2016

@petulla and @Rich700000000000, looks like it works fine if you make the changes mentioned by @anupamchoudhari and @tejasshah93. Thanks guys !!

anu003 commented Jul 22, 2016

@petulla and @Rich700000000000, looks like it works fine if you make the changes mentioned by @anupamchoudhari and @tejasshah93. Thanks guys !!

@renaud

This comment has been minimized.

Show comment
Hide comment
@renaud

renaud Sep 1, 2016

note that {<NBAR><IN><NBAR>} should come above {<NBAR>} for it to work

renaud commented Sep 1, 2016

note that {<NBAR><IN><NBAR>} should come above {<NBAR>} for it to work

@pal2ie

This comment has been minimized.

Show comment
Hide comment
@pal2ie

pal2ie commented Sep 14, 2016

thanks @anupamchoudhari!!

@Phdntom

This comment has been minimized.

Show comment
Hide comment
@Phdntom

Phdntom Nov 9, 2016

This expression
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

is not valid as is. Nobody else is getting this SyntaxError? Seems to be at the final ?, presumably from the ' closing the string?

Phdntom commented Nov 9, 2016

This expression
sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

is not valid as is. Nobody else is getting this SyntaxError? Seems to be at the final ?, presumably from the ' closing the string?

@avikdelta

This comment has been minimized.

Show comment
Hide comment
@avikdelta

avikdelta Jan 24, 2017

@Phdntom: Yes I get a syntax error too. Did you find the solution?

avikdelta commented Jan 24, 2017

@Phdntom: Yes I get a syntax error too. Did you find the solution?

@NeethuAnish

This comment has been minimized.

Show comment
Hide comment
@NeethuAnish

NeethuAnish Feb 6, 2017

Getting same error as @Phdntom..any one got the solution

NeethuAnish commented Feb 6, 2017

Getting same error as @Phdntom..any one got the solution

@karibot

This comment has been minimized.

Show comment
Hide comment
@karibot

karibot Feb 9, 2017

@Phdntom
for your syntax error, you have to escape the simple quote ' like this
r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
the syntax error disapear but i have another error when parsing the regex : error: nothing to repeat

karibot commented Feb 9, 2017

@Phdntom
for your syntax error, you have to escape the simple quote ' like this
r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'
the syntax error disapear but i have another error when parsing the regex : error: nothing to repeat

@Mohan-kr

This comment has been minimized.

Show comment
Hide comment
@Mohan-kr

Mohan-kr Feb 24, 2017

Hi guys,
I am also getting the same error "sre_constants.error: nothing to repeat at position 48"
Can anyone suggest how to fix.
Traceback (most recent call last):
File "C:/Users/mohan.choudhary/Desktop/Copied_Shared/New folder/KeyTokenizer.py", line 24, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 196, in regexp_tokenize
return tokenizer.tokenize(text)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 119, in tokenize
self._check_regexp()
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 116, in _check_regexp
self._regexp = re.compile(self._pattern, self._flags)
File "C:\Python_3.5.0\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\Python_3.5.0\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python_3.5.0\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\Python_3.5.0\lib\sre_parse.py", line 829, in parse
p = _parse_sub(source, pattern, 0)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 778, in _parse
p = _parse_sub(source, state)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 638, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 50
thanks

Mohan-kr commented Feb 24, 2017

Hi guys,
I am also getting the same error "sre_constants.error: nothing to repeat at position 48"
Can anyone suggest how to fix.
Traceback (most recent call last):
File "C:/Users/mohan.choudhary/Desktop/Copied_Shared/New folder/KeyTokenizer.py", line 24, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 196, in regexp_tokenize
return tokenizer.tokenize(text)
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 119, in tokenize
self._check_regexp()
File "C:\Python_3.5.0\lib\site-packages\nltk\tokenize\regexp.py", line 116, in _check_regexp
self._regexp = re.compile(self._pattern, self._flags)
File "C:\Python_3.5.0\lib\re.py", line 224, in compile
return _compile(pattern, flags)
File "C:\Python_3.5.0\lib\re.py", line 293, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python_3.5.0\lib\sre_compile.py", line 536, in compile
p = sre_parse.parse(p, flags)
File "C:\Python_3.5.0\lib\sre_parse.py", line 829, in parse
p = _parse_sub(source, pattern, 0)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 778, in _parse
p = _parse_sub(source, state)
File "C:\Python_3.5.0\lib\sre_parse.py", line 437, in _parse_sub
itemsappend(_parse(source, state))
File "C:\Python_3.5.0\lib\sre_parse.py", line 638, in _parse
source.tell() - here + len(this))
sre_constants.error: nothing to repeat at position 50
thanks

@hash-include

This comment has been minimized.

Show comment
Hide comment
@hash-include

hash-include Mar 22, 2017

@Mohan-kr did you solve that error? I am also getting the same error?

hash-include commented Mar 22, 2017

@Mohan-kr did you solve that error? I am also getting the same error?

@pavelnunez

This comment has been minimized.

Show comment
Hide comment
@pavelnunez

pavelnunez Mar 25, 2017

I'm currently working on a project that uses some of the Natural Languages features present on NLTK. I know this post is 6 years old now, but as I've stumble into this gist I think it might be useful if @alexbowe post (and edit) this gist again with the requirements for this script to run.

In my experience running it "out of the box" it needs (and this is by no means an incomplete list of requirements) to run:

If you're running Python 2.7:

  • Python 2.7+
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 2.7 (with an unmodified version of the script) it will run with some exceptions.

If you're running Python 3.5:

  • Python 3.5+ (with some minor changes to the script to replace the old print construct with the newer print() function)
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 3.5 it will NOT run with the Exception: "AttributeError: 'tuple' object has no attribute 'isdigit'" as I'm not a Python developer I don't know what to do about it. However if you can install both versions of Python, it will be better running it on Python 2.7.

Take into account that you might need to switch from pip to pip3 (when installing Python modules) as the latter is used on Python 3.x installations.

The dependencies for nltk are available in the Python shell (>>>) with the utility nltk.download()

I hope this "indications" to be useful for someone else.

pavelnunez commented Mar 25, 2017

I'm currently working on a project that uses some of the Natural Languages features present on NLTK. I know this post is 6 years old now, but as I've stumble into this gist I think it might be useful if @alexbowe post (and edit) this gist again with the requirements for this script to run.

In my experience running it "out of the box" it needs (and this is by no means an incomplete list of requirements) to run:

If you're running Python 2.7:

  • Python 2.7+
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 2.7 (with an unmodified version of the script) it will run with some exceptions.

If you're running Python 3.5:

  • Python 3.5+ (with some minor changes to the script to replace the old print construct with the newer print() function)
  • nltk
  • The POS (Part of Speech) with the identifier: maxent_treebank_pos_tagger
  • A Model with the identifier: averaged_perceptron_tagger
  • A Corpora with the identifier: stopwords

Using Python 3.5 it will NOT run with the Exception: "AttributeError: 'tuple' object has no attribute 'isdigit'" as I'm not a Python developer I don't know what to do about it. However if you can install both versions of Python, it will be better running it on Python 2.7.

Take into account that you might need to switch from pip to pip3 (when installing Python modules) as the latter is used on Python 3.x installations.

The dependencies for nltk are available in the Python shell (>>>) with the utility nltk.download()

I hope this "indications" to be useful for someone else.

@Reihan-amn

This comment has been minimized.

Show comment
Hide comment
@Reihan-amn

Reihan-amn May 29, 2017

Agree with @renaud
Any new rules has to be placed before .
I actually added this as well: {} to capture all NPs that conjunct with each other.

Reihan-amn commented May 29, 2017

Agree with @renaud
Any new rules has to be placed before .
I actually added this as well: {} to capture all NPs that conjunct with each other.

@shreyu2403

This comment has been minimized.

Show comment
Hide comment
@shreyu2403

shreyu2403 Jun 14, 2017

i need to extract words that are verb phrases along with noun phrases.i have defined the grammer correctly but the i think where we are checking t.node a simple " or" will not suffice because that is leading to the extracted words are getting printed twice,sometimes sentence wise sometimes consecutively bcos my grammer has NP inside VP . I checked my tree and it seems okay.Does anyone have a solution to this?

shreyu2403 commented Jun 14, 2017

i need to extract words that are verb phrases along with noun phrases.i have defined the grammer correctly but the i think where we are checking t.node a simple " or" will not suffice because that is leading to the extracted words are getting printed twice,sometimes sentence wise sometimes consecutively bcos my grammer has NP inside VP . I checked my tree and it seems okay.Does anyone have a solution to this?

@Reihan-amn

This comment has been minimized.

Show comment
Hide comment
@Reihan-amn

Reihan-amn Jun 27, 2017

Why not using NBAR:{<NN*|JJ><NN>}? Why those dots are there?

Reihan-amn commented Jun 27, 2017

Why not using NBAR:{<NN*|JJ><NN>}? Why those dots are there?

@bashokku

This comment has been minimized.

Show comment
Hide comment
@bashokku

bashokku Jun 27, 2017

@Mohan-kr @hash-include did you solve the error you were getting for this problem ??

bashokku commented Jun 27, 2017

@Mohan-kr @hash-include did you solve the error you were getting for this problem ??

@bashokku

This comment has been minimized.

Show comment
Hide comment
@bashokku

bashokku Jun 28, 2017

For the error : AttributeError: 'tuple' object has no attribute 'isdigit find the below solution.

you need to uninstall higher versions of nltk, it works for versions 3.0.

Solution ::

The default tagger is made as Perceptron in the nltk 3.1 version. Which is now the latest version. All my nltk.regexp_tokenize stopped functioning correctly and all my nltk.pos_tag started giving the above error.

The solution that I have currently is to use the previous version nltk 3.0.1 to make them functioning. I am not sure if this is a bug in the current release of nltk.

Installation instruction for nltk 3.0.4 version in ubuntu. From your home directory or any other directory do the following steps.

$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install

bashokku commented Jun 28, 2017

For the error : AttributeError: 'tuple' object has no attribute 'isdigit find the below solution.

you need to uninstall higher versions of nltk, it works for versions 3.0.

Solution ::

The default tagger is made as Perceptron in the nltk 3.1 version. Which is now the latest version. All my nltk.regexp_tokenize stopped functioning correctly and all my nltk.pos_tag started giving the above error.

The solution that I have currently is to use the previous version nltk 3.0.1 to make them functioning. I am not sure if this is a bug in the current release of nltk.

Installation instruction for nltk 3.0.4 version in ubuntu. From your home directory or any other directory do the following steps.

$ wget https://github.com/nltk/nltk/archive/3.0.4.tar.gz
$ tar -xvzf 3.0.4.tar.gz
$ cd nltk-3.0.4
$ sudo python3.4 setup.py install

@mkpisk

This comment has been minimized.

Show comment
Hide comment
@mkpisk

mkpisk Oct 3, 2017

you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

After this I am able to run the above code

mkpisk commented Oct 3, 2017

you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

After this I am able to run the above code

@KanimozhiU

This comment has been minimized.

Show comment
Hide comment
@KanimozhiU

KanimozhiU Nov 27, 2017

Traceback (most recent call last):
File "nltk-intro.py", line 31, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 203, in regexp_tokenize
return tokenizer.tokenize(text)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 126, in tokenize
self._check_regexp()
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 121, in _check_regexp
self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 55, in compile_regexp_to_noncapturing
return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 51, in convert_regexp_to_noncapturing_parsed
parsed_pattern.pattern.groups = 1
AttributeError: can't set attribute

Error encountered after following,
you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

KanimozhiU commented Nov 27, 2017

Traceback (most recent call last):
File "nltk-intro.py", line 31, in
toks = nltk.regexp_tokenize(text, sentence_re)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 203, in regexp_tokenize
return tokenizer.tokenize(text)
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 126, in tokenize
self._check_regexp()
File "/home/user/Desktop/nltk-3.0.4/nltk/tokenize/regexp.py", line 121, in _check_regexp
self._regexp = compile_regexp_to_noncapturing(self._pattern, self._flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 55, in compile_regexp_to_noncapturing
return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)
File "/home/user/Desktop/nltk-3.0.4/nltk/internals.py", line 51, in convert_regexp_to_noncapturing_parsed
parsed_pattern.pattern.groups = 1
AttributeError: can't set attribute

Error encountered after following,
you can use the following code to install nltk 3.0.4

pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz

it will automaticall uninstalls your latest version

/****************************************************************************/
pip install https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Collecting https://pypi.python.org/packages/source/n/nltk/nltk-3.0.4.tar.gz
Downloading nltk-3.0.4.tar.gz (1.0MB)
100% |################################| 1.0MB 562kB/s
Building wheels for collected packages: nltk
Running setup.py bdist_wheel for nltk ... done
Stored in directory: C:\Users\1534038\AppData\Local\pip\Cache\wheels\8a\1e\1e\9f124d9995acdfd40f645da9592cd126f6fbe19b5e54b1c4b4
Successfully built nltk
Installing collected packages: nltk
Found existing installation: nltk 3.2.4
Uninstalling nltk-3.2.4:
Successfully uninstalled nltk-3.2.4
Successfully installed nltk-3.0.4
/**************************************************************************************************/

@bearami

This comment has been minimized.

Show comment
Hide comment
@bearami

bearami Jan 6, 2018

I have made the changes suggested by @anupamchoudhari and @tejasshah93.
I am getting syntax error in the regular expression @anupamchoudhari suggested. I am using python 3.6.3 version. Any help fixing is greatly appreciated as I am a newbie in python and NLTK.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

bearami commented Jan 6, 2018

I have made the changes suggested by @anupamchoudhari and @tejasshah93.
I am getting syntax error in the regular expression @anupamchoudhari suggested. I am using python 3.6.3 version. Any help fixing is greatly appreciated as I am a newbie in python and NLTK.

sentence_re = r'(?:(?:[A-Z])(?:.[A-Z])+.?)|(?:\w+(?:-\w+)*)|(?:$?\d+(?:.\d+)?%?)|(?:...|)(?:[][.,;"'?():-_`])'

@jamesballard

This comment has been minimized.

Show comment
Hide comment
@jamesballard

jamesballard Mar 30, 2018

The following regular expression seems to work in Python 3.x

sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

from https://stackoverflow.com/questions/36353125/nltk-regular-expression-tokenizer

Plus other fixes -

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):

jamesballard commented Mar 30, 2018

The following regular expression seems to work in Python 3.x

sentence_re = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

from https://stackoverflow.com/questions/36353125/nltk-regular-expression-tokenizer

Plus other fixes -

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):

@eliksr

This comment has been minimized.

Show comment
Hide comment
@eliksr

eliksr Jun 14, 2018

@jamesballard Thanks! it works for me with Python 3.x

eliksr commented Jun 14, 2018

@jamesballard Thanks! it works for me with Python 3.x

@komal-bhalla

This comment has been minimized.

Show comment
Hide comment
@komal-bhalla

komal-bhalla Aug 13, 2018

I am getting an error from running the code below:
postoks = nltk.tag.pos_tag(toks)

URLError:

komal-bhalla commented Aug 13, 2018

I am getting an error from running the code below:
postoks = nltk.tag.pos_tag(toks)

URLError:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment