Skip to content

Instantly share code, notes, and snippets.

@macbre
Last active March 23, 2019 20:41
Show Gist options
  • Save macbre/ef5186d987ab9305ce956a2a054adc3c to your computer and use it in GitHub Desktop.
Save macbre/ef5186d987ab9305ce956a2a054adc3c to your computer and use it in GitHub Desktop.
scaPy playground
.idea/
env/
*.swp

nlp-playground

Setup

virtualenv env -ppython3
source env/bin/activate
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_md

en_core_web_md model needs to fetch ~95 MB of data

Examples

muppet_nlp.py

Elmo is a furry red Muppet monster with an orange nose who lives on Sesame Street.


Noun chunks:
Elmo 0 1
a furry red Muppet monster 2 7
an orange nose 8 11
who 11 12
Sesame Street 14 16

Named Entity Recognition:
Elmo ORG 0
Muppet PRODUCT 5
Sesame Street FAC 14

Matcher:
11646861348542573214 -> Elmo is  -  be = a furry red Muppet monster
11646861348542573214 -> who lives on  -  live = Sesame Street

Oscar the Grouch is a furry, green Grouch who lives in a trash can on Sesame Street. In fact, he loves trash so much that he's rarely seen outside of his trash can. His trademark song, explaining his passion for refuse, is "I Love Trash."

Noun chunks:
Oscar 0 1
the Grouch 1 3
a furry, green Grouch 4 9
who 9 10
a trash can 12 15
Sesame Street 16 18
fact 20 21
he 22 23
trash 24 25
he 28 29
his trash can 34 37
His trademark song 38 41
his passion 43 45
refuse 46 47
"I 49 51
Trash 52 53

Named Entity Recognition:
Grouch ORG 8
Sesame Street FAC 16
I Love Trash WORK_OF_ART 50

Matcher:
11646861348542573214 -> Grouch is  -  be = a furry, green Grouch
11646861348542573214 -> who lives in  -  live = a trash can
11646861348542573214 -> he loves  -  love = trash
11646861348542573214 -> I Love  -  love = Trash

Big Bird is an 8'2" yellow bird who lives on Sesame Street. Since Sesame Street premiered in 1969, Big Bird has entertained millions of pre-school children and their parents with his wide-eyed wondering at the world. Big Bird is also a bird who makes friends easily.

Noun chunks:
Big Bird 0 2
an 8'2" yellow bird 3 8
who 8 9
Sesame Street 11 13
Sesame Street 15 17
Big Bird 21 23
millions 25 26
pre-school children 27 31
their parents 32 34
the world 41 43
Big Bird 44 46
a bird 48 50
who 50 51
friends 52 53

Named Entity Recognition:
Big Bird ORG 0
Sesame Street FAC 11
Sesame Street FAC 15
1969 DATE 19
Big Bird ORG 21
millions CARDINAL 25
Big Bird ORG 44

Matcher:
11646861348542573214 -> Bird is  -  be = an 8'2" yellow bird
11646861348542573214 -> who lives on  -  live = Sesame Street
11646861348542573214 -> Street premiered in  -  premiere = None
11646861348542573214 -> Bird has  -  have = None
11646861348542573214 -> Bird is  -  be = None
11646861348542573214 -> who makes  -  make = friends

facts_extract.py

Elmo is a furry red Muppet monster with an orange nose who lives on Sesame Street.

{"be": "a furry red Muppet monster", "live": "Sesame Street"}

Oscar the Grouch is a furry, green Grouch who lives in a trash can on Sesame Street. In fact, he loves trash so much that he's rarely seen outside of his trash can. His trademark song, explaining his passion for refuse, is "I Love Trash."

{"be": "a furry, green Grouch", "live": "a trash can", "love": "trash"}

Big Bird is an 8'2" yellow bird who lives on Sesame Street. Since Sesame Street premiered in 1969, Big Bird has entertained millions of pre-school children and their parents with his wide-eyed wondering at the world. Big Bird is also a bird who makes friends easily.

{"be": "an 8'2\" yellow bird", "live": "Sesame Street", "make": "friends"}

Matthew Abram "Matt" Groening (born February 15, 1954) is an American cartoonist, who is best known for creating the American animated television series The Simpsons and Futurama, as well as the comic strip, Life in Hell.

{"bear": "February 15, 1954"}

Jake Simmonds was a young member of the resistance group called the Preachers. They opposed John Lumic's Cybermen on a parallel Earth. Jake kidnapped Mickey Smith, who was a parallel version of Ricky Smith, the new "most wanted" person in London, thinking he was Ricky. They found Ricky already at their base and scanned Mickey, unable to determine what he was. Mickey joined the Preachers in following an International Electromatics van in their fight. Cybermen began emptying from the vans and attacked Jackie Tyler's party. (TV: Rise of the Cybermen)

{"be": "a young member", "oppose": "John Lumic's Cybermen", "kidnap": "Mickey Smith", "a parallel version of": ["Ricky Smith"], "find": "Ricky", "join": "the Preachers"}

Bartholomew JoJo "Bart" Simpson (born April 17, 1985) is a main character and the tritagonist of The Simpsons. Bart is the mischievous, rebellious, misunderstood and "potentially dangerous" eldest child of Homer and Marge Simpson, and the older brother of Lisa and Maggie. He also has been nicknamed "Cosmo", after discovering a comet in "Bart's Comet". Bart has also been on the cover on numerous comics, such as "Critical Hit", "Simpsons Treasure Trove #11", and "Winter Wingding". Bart also has a 100-issue comic series entitled the Simpson Comics Presents Bart Simpson. Bart is loosely based on Matt Groening and his older brother, Mark Groening.

{"bear": "April 17, 1985", "be": "the mischievous, rebellious, misunderstood", "child of": ["Homer", "Marge Simpson"], "the older brother of": ["Lisa", "Maggie"]}

questions_parser.py

When was Homer Simpson born?

[('When', 'when', 'ADV', 'advmod'), ('was', 'be', 'VERB', 'auxpass'), ('Homer Simpson', 'NOUN_CHUNK', 'PROPN', 'nsubj'), ('born', 'bear', 'VERB', 'ROOT'), ('?', '?', 'PUNCT', 'punct')]

Take "bear" property from "Homer Simpson" node

Who is a furry red Muppet monster?

[('Who', 'NOUN_CHUNK', 'PRON', 'nsubj'), ('is', 'be', 'VERB', 'ROOT'), ('a furry red Muppet monster', 'NOUN_CHUNK', 'DET', 'attr'), ('?', '?', 'PUNCT', 'punct')]

Who has "be" property with "a furry red Muppet monster" value?

Who kidnapped Mickey Smith?

[('Who', 'NOUN_CHUNK', 'PRON', 'nsubj'), ('kidnapped', 'kidnap', 'VERB', 'ROOT'), ('Mickey Smith', 'NOUN_CHUNK', 'PROPN', 'dobj'), ('?', '?', 'PUNCT', 'punct')]

Who has "kidnap" property with "Mickey Smith" value?

Who lives on Sesame Street?

[('Who', 'NOUN_CHUNK', 'PRON', 'nsubj'), ('lives', 'live', 'VERB', 'ROOT'), ('on', 'on', 'ADP', 'prep'), ('Sesame Street', 'NOUN_CHUNK', 'PROPN', 'pobj'), ('?', '?', 'PUNCT', 'punct')]

Who has "live" property with "Sesame Street" value?

"""
Merges tokens based on noun chunks and Named Entity Recognition.
Then facts are extracted from a sentence.
"""
import json
import re
# https://nlpforhackers.io/complete-guide-to-spacy/
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Token
texts = list()
texts.append("Elmo is a furry red Muppet monster with an orange nose who lives on Sesame Street.")
texts.append("Oscar the Grouch is a furry, green Grouch who lives in a trash can on Sesame Street. "
"In fact, he loves trash so much that he's rarely seen outside of his trash can. "
"His trademark song, explaining his passion for refuse, is \"I Love Trash.\"")
texts.append("Big Bird is an 8'2\" yellow bird who lives on Sesame Street. Since Sesame Street premiered in 1969, "
"Big Bird has entertained millions of pre-school children and their parents with his wide-eyed "
"wondering at the world. Big Bird is also a bird who makes friends easily.")
# https://futurama.fandom.com/wiki/Matt_Groening
texts.append("Matthew Abram \"Matt\" Groening (born February 15, 1954) is an American cartoonist, "
"who is best known for creating the American animated television series The Simpsons and Futurama, "
"as well as the comic strip, Life in Hell.")
# https://tardis.fandom.com/wiki/Jake_Simmonds
texts.append("Jake Simmonds was a young member of the resistance group called the Preachers. "
"They opposed John Lumic's Cybermen on a parallel Earth.\nJake kidnapped Mickey Smith, "
"who was a parallel version of Ricky Smith, the new \"most wanted\" person in London, "
"thinking he was Ricky. They found Ricky already at their base and scanned Mickey, "
"unable to determine what he was. Mickey joined the Preachers in following an "
"International Electromatics van in their fight. Cybermen began emptying from the "
"vans and attacked Jackie Tyler's party. (TV: Rise of the Cybermen)")
# https://simpsons.fandom.com/wiki/Bart_Simpson
texts.append("""
Bartholomew JoJo "Bart" Simpson (born April 17, 1985)[4] is a main character and the tritagonist of The Simpsons.
Bart is the mischievous, rebellious, misunderstood and "potentially dangerous" eldest child of Homer and Marge Simpson,
and the older brother of Lisa and Maggie. He also has been nicknamed "Cosmo", after discovering a comet in
"Bart's Comet". Bart has also been on the cover on numerous comics, such as "Critical Hit",
"Simpsons Treasure Trove #11", and "Winter Wingding".
Bart also has a 100-issue comic series entitled the Simpson Comics Presents Bart Simpson.
Bart is loosely based on Matt Groening and his older brother, Mark Groening.
""".strip())
# https://simpsons.fandom.com/wiki/Homer_Simpson
texts.append("Homer Jay Simpson (born May 12, 1956)[14] is the main protagonist of the series. "
"He is the spouse of Marge Simpson and father of Bart Simpson, Lisa Simpson, and Maggie Simpson. "
"Homer is overweight (said to be ~240 lbs), lazy, and often ignorant to the world around him. "
"Although Homer has many flaws, he has shown to have great caring, love, and even bravery "
"to those he cares about and, sometimes, even others he doesn't. "
"He served as the main protagonist of both the TV series and the 2007 film adaptation.")
nlp = spacy.load("en_core_web_sm") # https://spacy.io/models/en#en_core_web_sm
# register a custom chunk attribute, used when retokenizing the text and matching tokens
Token.set_extension('chunk', default=None)
# for text in texts[-1:]: # DEBUG
for text in texts:
# text cleanup
# remove references [4]
text = re.sub(r'\[\d+]', '', text)
doc = nlp(text)
print("\n" + text)
# get noun chunks and merge them
# https://spacy.io/api/doc#retokenizer.merge
with doc.retokenize() as retokenizer:
for ent in doc.noun_chunks:
# print('>', ent)
retokenizer.merge(doc[ent.start:ent.end], attrs={"_": {"chunk": "NOUN"}, "LEMMA": "NOUN_CHUNK"})
with doc.retokenize() as retokenizer:
# merge DATE tokens
first_date_token = None
for token in doc:
if token.ent_type_ == 'DATE':
if first_date_token is None:
first_date_token = token.i
elif first_date_token is not None:
# print('> DATE', doc[first_date_token: token.i])
# merge the date token
retokenizer.merge(doc[first_date_token:token.i])
first_date_token = None
# show current tokens
for token in doc:
continue
print(str(token), token.pos_, token.dep_, token.ent_type_, token.lemma_, token._.chunk)
# now let's extract some facts
matcher = Matcher(nlp.vocab)
# https://spacy.io/usage/rule-based-matching#adding-patterns-attributes
matcher.add(1, None, [
{"DEP": "nsubj"}, # Elmo / who
{"POS": "VERB"}, # is / was / lives
{"POS": "ADP", "OP": "?"}, # on (optional pattern - ?)
{"LEMMA": "NOUN_CHUNK"}, # see above
])
matcher.add(2, None, [
{"POS": "VERB"}, # verb - born (bear)
{"ENT_TYPE": "DATE"}, # date
])
"""
the older brother DET conj NOUN_CHUNK NOUN
of ADP prep of None
Lisa PROPN pobj PERSON NOUN_CHUNK NOUN
and CCONJ cc and None
Maggie PROPN conj PERSON NOUN_CHUNK NOUN
"""
matcher.add(3, None, [
{"LEMMA": "NOUN_CHUNK"}, # brother / father
{"LEMMA": "of"}, # of
{"ENT_TYPE": "PERSON"}, # someone{"LEMMA": "of"}, # of
{"LEMMA": "and", "OP": "?"}, # AND
{"ENT_TYPE": "PERSON", "OP": "?"}, # someone
])
matches = matcher(doc)
facts = dict()
for match_id, start, end in matches:
allow_multivalues = False
if match_id == 1:
# subject - property (verb) = value
# Elmo is a furry red Muppet monster
# Elmo = be : a furry red Muppet monster
#
# who lives on Sesame Street
# who = live : Sesame Street
prop_name = doc[start + 1].lemma_ # normalized using lemma
prop_value = doc[end-1] # this is the last token of them match
elif match_id == 2:
# born February 15, 1954
# bear : February 15, 1954
prop_name = doc[start].lemma_ # born -> bear
prop_value = doc[end-1]
elif match_id == 3:
# "potentially dangerous" eldest child, of, Homer
# the older brother, of, Lisa
prop_name = str(doc[start])
prop_value = doc[end-1]
if 'child' in prop_name:
prop_name = 'child'
prop_name += ' of'
allow_multivalues = True
else:
print(doc[start:end])
continue
prop_name = str(prop_name)
prop_value = str(prop_value)
# print(match_id, '->', list(doc[start:end]), ' - ', prop_name, '=', prop_value)
# do not overwrite existing facts
# we assume that the most important things are stated earlier
if not allow_multivalues and prop_name not in facts:
facts[prop_name] = prop_value
elif allow_multivalues:
# we allow multiple values
if prop_name in facts:
facts[prop_name].append(prop_value)
else:
facts[prop_name] = [prop_value]
print("```json\n{}\n```".format(json.dumps(facts)))
# https://nlpforhackers.io/complete-guide-to-spacy/
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm") # https://spacy.io/models/en#en_core_web_sm
# nlp = spacy.load("en_core_web_md") # https://spacy.io/models/en#en_core_web_md
# doc = nlp("Elmo is a furry red Muppet monster with an orange nose who lives on Sesame Street.")
# """
doc = nlp("Oscar the Grouch is a furry, green Grouch who lives in a trash can on Sesame Street. "
"In fact, he loves trash so much that he's rarely seen outside of his trash can. "
"His trademark song, explaining his passion for refuse, is \"I Love Trash.\"")
# """
"""
doc = nlp("Big Bird is an 8'2\" yellow bird who lives on Sesame Street. Since Sesame Street premiered in 1969, "
"Big Bird has entertained millions of pre-school children and their parents with his wide-eyed "
"wondering at the world. Big Bird is also a bird who makes friends easily.")
"""
"""
# https://futurama.fandom.com/wiki/Matt_Groening
doc = nlp("Matthew Abram \"Matt\" Groening (born February 15, 1954) is an American cartoonist, "
"who is best known for creating the American animated television series The Simpsons and Futurama, "
"as well as the comic strip, Life in Hell.")
"""
print(doc)
# https://spacy.io/api/token#attributes
print("\nTokens:")
for token in doc:
print([
# The index of the token within the parent document.
token.i,
# Verbatim text content.
token.text,
# Base form of the token, with no inflectional suffixes.
token.lemma_,
# Fine-grained part-of-speech.
token.tag_,
# Coarse-grained part-of-speech.
token.pos_,
# Syntactic dependency relation.
token.dep_,
# Named entity type.
token.ent_type_,
# A scalar value indicating the positivity or negativity of the token.
# token.sentiment
])
# https://spacy.io/api/doc#noun_chunks
nouns = dict()
print("\nNoun chunks:")
for chunk in doc.noun_chunks:
# https://spacy.io/api/span
print(chunk, chunk.start, chunk.end)
nouns[chunk.start] = str(chunk)
# https://spacy.io/api/doc#ents
print("\nNamed Entity Recognition:")
for ent in doc.ents:
print(ent.text, ent.label_, ent.start)
# https://spacy.io/api/matcher#call
print("\nMatcher:")
matcher = Matcher(nlp.vocab)
# https://spacy.io/usage/rule-based-matching#adding-patterns-attributes
matcher.add("NounIs", None, [
# {"POS": "PROPN"}, # Elmo
# {"LEMMA": "be"}, # is / was
{"DEP": "nsubj"}, # Elmo / who
{"POS": "VERB"}, # is / was / lives
{"POS": "ADP", "OP": "?"}, # on (optional pattern - ?)
])
matches = matcher(doc)
for match_id, start, end in matches:
# subject - property (verb) = value
print(match_id, '->', doc[start:end], ' - ', doc[start+1].lemma_, '=', nouns.get(end))
"""
Understanding of questions
"""
import spacy
from spacy.matcher import Matcher
# see facts_extract.py for answers
questions = [
'When was Homer Simpson born?',
'Who is a furry red Muppet monster?',
'Who kidnapped Mickey Smith?',
'Who lives on Sesame Street?',
]
nlp = spacy.load("en_core_web_sm") # https://spacy.io/models/en#en_core_web_sm
for question in questions:
doc = nlp(question)
print('\n> ' + str(doc))
# get noun chunks and merge them
# https://spacy.io/api/doc#retokenizer.merge
with doc.retokenize() as retokenizer:
for ent in doc.noun_chunks:
# print('>', ent)
retokenizer.merge(doc[ent.start:ent.end], attrs={"LEMMA": "NOUN_CHUNK"})
print('```json')
print([(str(token), token.lemma_, token.pos_, token.dep_) for token in doc])
print('```')
# now let's match the question
matcher = Matcher(nlp.vocab)
# https://spacy.io/usage/rule-based-matching#adding-patterns-attributes
# > When was Homer Simpson born?
# [('When', 'when', 'ADV', 'advmod'), ('was', 'be', 'VERB', 'auxpass'), ('Homer Simpson', 'NOUN_CHUNK', 'PROPN', 'nsubj'), ('born', 'bear', 'VERB', 'ROOT'), ('?', '?', 'PUNCT', 'punct')]
matcher.add(1, None, [
{"DEP": "advmod"}, # When
{"LEMMA": "be"}, # is / was / be / did
{"LEMMA": "NOUN_CHUNK"}, # something
{"POS": "VERB"}, # born / die
])
# > Who is a furry red Muppet monster?
# [('Who', 'NOUN_CHUNK', 'PRON', 'nsubj'), ('is', 'be', 'VERB', 'ROOT'), ('a furry red Muppet monster', 'NOUN_CHUNK', 'DET', 'attr'), ('?', '?', 'PUNCT', 'punct')]
matcher.add(2, None, [
{"DEP": "nsubj"}, # Who
{"POS": "VERB"}, # is / was / lives
{"DEP": "prep", "OP": "?"}, # on (optional)
{"LEMMA": "NOUN_CHUNK"}, # something
])
for match_id, start, end in matcher(doc):
# print(match_id, '->', list((token, token.lemma_) for token in doc[start:end]))
if match_id == 1:
print('Take "{}" property from "{}" node'.format(doc[end-1].lemma_, doc[start+2]))
elif match_id == 2:
print('Who has "{}" property with "{}" value?'.format(doc[start+1].lemma_, doc[end-1]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment