Skip to content

Instantly share code, notes, and snippets.

@bdewilde
Last active April 12, 2019 15:18
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bdewilde/5393079 to your computer and use it in GitHub Desktop.
Save bdewilde/5393079 to your computer and use it in GitHub Desktop.
basic regular expression chunker and chunk-getter
def chunk_tagged_sents(tagged_sents):
from nltk.chunk import regexp
# define a chunk "grammar", i.e. chunking rules
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
chunker = regexp.RegexpParser(grammar, loop=2)
chunked_sents = [chunker.parse(tagged_sent) for tagged_sent in tagged_sents]
return chunked_sents
def get_chunks(chunked_sents, chunk_type='NP'):
all_chunks = []
# chunked sentences are in the form of nested trees
for tree in chunked_sents:
chunks = []
# iterate through subtrees / leaves to get individual chunks
raw_chunks = [subtree.leaves() for subtree in tree.subtrees()
if subtree.node == chunk_type]
for raw_chunk in raw_chunks:
chunk = []
for word_tag in raw_chunk:
# drop POS tags, keep words
chunk.append(word_tag[0])
chunks.append(' '.join(chunk))
all_chunks.append(chunks)
return all_chunks
@mehdiHadji
Copy link

mehdiHadji commented Apr 12, 2019

hey i appreciate your code, in my case, i wanna write a grammar like this :

<JJ><NN><anything>
<RB><JJ><not NN nor NNS>

but i find difficult to do so, do u have any documentation would help me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment