Skip to content

Instantly share code, notes, and snippets.

@Mageswaran1989
Created April 17, 2020 16:27
Show Gist options
  • Save Mageswaran1989/99d0da9fb2ff0fb1983b66b1982c7fb2 to your computer and use it in GitHub Desktop.
Save Mageswaran1989/99d0da9fb2ff0fb1983b66b1982c7fb2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Natural Language Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Online materials:\n",
"- https://www.youtube.com/watch?v=jB1-NukGZm0\n",
"- https://course.spacy.io/\n",
"- https://spacy.io/usage/spacy-101\n",
"- https://github.com/explosion/spacy-notebooks\n",
"\n",
"In this notebook we are gonna learn the basics of NLP using spaCy(https://spacy.io) \n",
"\n",
"Batteries included\n",
"\n",
"- Index preserving tokenization (details about this later)\n",
"- Models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing\n",
"- Supports 8 languages out of the box\n",
"- Easy and beautiful visualizations\n",
"- Pretrained word vectors\n",
"\n",
"It plays nicely with all the other already existing tools that you know and love: Scikit-Learn, TensorFlow, ...\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Topics\n",
"- Tokenization\n",
"- Part Of Speech\n",
"- Named Entity Recoginition\n",
"- Sentence Detection\n",
"- Text Normalization\n",
"- Word Vectors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So why we need a NLP library? We are going to explore the answer for that in a while, but before that lets see how a basic operation of splitting a paragraph can go wrong.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/opt/envs/aie/bin/python\n"
]
}
],
"source": [
"!which python\n",
"# /opt/envs/aie/bin/python"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The waiter was very rude, \n",
"e\n",
"g\n",
" when I accidentally opened the wrong door\n",
"he screamed \"Private!\"\n",
"\n"
]
}
],
"source": [
"paragraph = '''The waiter was very rude, \n",
"e.g. when I accidentally opened the wrong door\n",
"he screamed \"Private!\".'''\n",
"\n",
"sentences = paragraph.split('.')\n",
"print('\\n'.join(sentences))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Import spacy and English models\n",
"import spacy\n",
"nlp = spacy.load('en_core_web_sm')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Language Processing Pipelines\n",
"\n",
"- When you call `nlp` on a text, spaCy first `tokenizes` the text to produce a `Doc object`.\n",
"- The `Doc` is then processed in several different steps – this is also referred to as the processing pipeline.\n",
"- The pipeline used by the default models consists of a `tagger`, a `parser` and an `entity recognizer`.\n",
"\n",
"More info [here](https://spacy.io/usage/processing-pipelines)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenization\n",
"\n",
"https://spacy.io/usage/spacy-101#annotations-token\n",
"\n",
"Tokenization is the process of segmenting text into words, punctuation etc. \n",
"spaCy tokenizes the text, processes it, and stores the data in the Doc object.\n",
"The Token class exposes a lot of word-level attributes.\n",
"\n",
"![](https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import getpass\n",
"username = getpass.getuser()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Hi, Myself mageswarand. I am glad to learn NLP and use it wherver applicable!'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = \"Hi, Myself {}. I am glad to learn NLP and use it wherver applicable!\".format(username)\n",
"text"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"doc = nlp(text)\n",
"# Here’s something interesting—After processing the text, \n",
"#spaCy keeps all the information about the original text intact within the Doc object."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Hi\t0\thi\tFalse\tFalse\tXx\tINTJ\tUH\n",
",\t2\t,\tTrue\tFalse\t,\tPUNCT\t,\n",
"Myself\t4\tMyself\tFalse\tFalse\tXxxxx\tPROPN\tNNP\n",
"mageswarand\t11\tmageswarand\tFalse\tFalse\txxxx\tNOUN\tNN\n",
".\t22\t.\tTrue\tFalse\t.\tPUNCT\t.\n",
"I\t24\t-PRON-\tFalse\tFalse\tX\tPRON\tPRP\n",
"am\t26\tbe\tFalse\tFalse\txx\tAUX\tVBP\n",
"glad\t29\tglad\tFalse\tFalse\txxxx\tADJ\tJJ\n",
"to\t34\tto\tFalse\tFalse\txx\tPART\tTO\n",
"learn\t37\tlearn\tFalse\tFalse\txxxx\tVERB\tVB\n",
"NLP\t43\tNLP\tFalse\tFalse\tXXX\tPROPN\tNNP\n",
"and\t47\tand\tFalse\tFalse\txxx\tCCONJ\tCC\n",
"use\t51\tuse\tFalse\tFalse\txxx\tVERB\tVB\n",
"it\t55\t-PRON-\tFalse\tFalse\txx\tPRON\tPRP\n",
"wherver\t58\twherver\tFalse\tFalse\txxxx\tADV\tRB\n",
"applicable\t66\tapplicable\tFalse\tFalse\txxxx\tADJ\tJJ\n",
"!\t76\t!\tTrue\tFalse\t!\tPUNCT\t.\n"
]
}
],
"source": [
"for token in doc:\n",
" print(\"{0}\\t{1}\\t{2}\\t{3}\\t{4}\\t{5}\\t{6}\\t{7}\".format(\n",
" token.text,\n",
" token.idx,\n",
" token.lemma_,\n",
" token.is_punct,\n",
" token.is_space,\n",
" token.shape_,\n",
" token.pos_,\n",
" token.tag_\n",
" ))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[spaCy Doc](https://spacy.io/api/doc)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I\n",
"ca\n",
"n't\n",
"imagine\n",
"spending\n",
"$\n",
"3000\n",
"for\n",
"a\n",
"single\n",
"bedroom\n",
"apartment\n",
"in\n",
"N.Y.C.\n"
]
}
],
"source": [
"text = u\"I can't imagine spending $3000 for a single bedroom apartment in N.Y.C.\"\n",
"doc = nlp(text)\n",
"# Print out tokens\n",
"for token in doc:\n",
" print(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sentence Detection"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_sentences(text):\n",
" doc = nlp(text)\n",
" for sent in doc.sents:\n",
" print(sent)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_sentences(paragraph)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text = \"The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother's keeper and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_sentences(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stop Words"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"I\n",
"ca\n",
"n't\n",
"for\n",
"a\n",
"in\n"
]
}
],
"source": [
"doc = nlp(text)\n",
"for word in doc:\n",
" if word.is_stop == True:\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Named entity recognition refers to the identification of words in a sentence as an entity e.g. the name of a person, place, organization, etc.\n",
"\n",
"`spaCy` can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.\n",
"\n",
"https://spacy.io/usage/linguistic-features#section-named-entities"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Apple 0 5 ORG\n",
"U.K. 27 31 GPE\n",
"$1 billion 44 54 MONEY\n"
]
}
],
"source": [
"doc = nlp(\"Apple is looking at buying U.K. startup for $1 billion\")\n",
"\n",
"for ent in doc.ents:\n",
" print(ent.text, ent.start_char, ent.end_char, ent.label_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Entities Explained** : https://spacy.io/api/annotation#named-entities\n",
"\n",
" PERSON: People, including fictional.\n",
" NORP: Nationalities or religious or political groups.\n",
" FAC: Buildings, airports, highways, bridges, etc.\n",
" ORG: Companies, agencies, institutions, etc.\n",
" GPE: Countries, cities, states.\n",
" LOC: Non-GPE locations, mountain ranges, bodies of water.\n",
" PRODUCT: Objects, vehicles, foods, etc. (Not services.)\n",
" EVENT: Named hurricanes, battles, wars, sports events, etc.\n",
" WORK_OF_ART: Titles of books, songs, etc.\n",
" LAW: Named documents made into laws.\n",
" LANGUAGE: Any named language.\n",
" DATE: Absolute or relative dates or periods.\n",
" TIME: Times smaller than a day.\n",
" PERCENT: Percentage, including \"%\".\n",
" MONEY: Monetary values, including unit.\n",
" QUANTITY: Measurements, as of weight or distance.\n",
" ORDINAL: \" , \"second\", etc.\n",
" CARDINAL: Numerals that do not fall under another type.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"def explain_text_entities(text):\n",
" doc = nlp(text)\n",
" for ent in doc.ents:\n",
" print(f'Entity: {ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entity: Apple, Label: ORG, Companies, agencies, institutions, etc.\n",
"Entity: U.K., Label: GPE, Countries, cities, states\n",
"Entity: $1 billion, Label: MONEY, Monetary values, including unit\n"
]
}
],
"source": [
"explain_text_entities(\"Apple is looking at buying U.K. startup for $1 billion\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from spacy import displacy"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">But \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Google\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" is starting from behind. The company made a late push into hardware, and \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Apple\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
"’s \n",
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Siri\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
"</mark>\n",
", available on iPhones, and \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Amazon\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
"’s \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Alexa\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" software, which runs on its \n",
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Echo\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
"</mark>\n",
" and \n",
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Dot\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
"</mark>\n",
" devices, have clear leads in consumer adoption.</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"text = \"\"\"But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.\"\"\"\n",
"doc = nlp(text)\n",
"displacy.render(doc, style='ent', jupyter=True)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">But \n",
"<mark class=\"entity\" style=\"background: linear-gradient(90deg, #aa9cfc, #fc9ce7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Google\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" is starting from behind. The company made a late push into hardware, and \n",
"<mark class=\"entity\" style=\"background: linear-gradient(90deg, #aa9cfc, #fc9ce7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Apple\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
"’s Siri, available on iPhones, and \n",
"<mark class=\"entity\" style=\"background: linear-gradient(90deg, #aa9cfc, #fc9ce7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Amazon\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
"’s \n",
"<mark class=\"entity\" style=\"background: linear-gradient(90deg, #aa9cfc, #fc9ce7); padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Alexa\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"colors = {\"ORG\": \"linear-gradient(90deg, #aa9cfc, #fc9ce7)\"}\n",
"options = {\"ents\": [\"ORG\"], \"colors\": colors}\n",
"displacy.render(doc, style='ent', jupyter=True, options=options)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Excercise**\n",
"Imagine you are a journalist who wants to publish a large set of documents while still hiding the identity of your sources. Can you write a function that masks all personal names, i.e. by replacing them with \"[MASKED]\"?"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Harry Potter\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
" is a series of fantasy novels written by \n",
"<mark class=\"entity\" style=\"background: #c887fb; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" British\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">NORP</span>\n",
"</mark>\n",
" author \n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" J. K. Rowling\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
". The novels chronicle the lives of a young wizard, \n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Harry Potter\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
", and his friends \n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Hermione Granger\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
" and \n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Ron Weasley\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
", all of whom are students at \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Hogwarts School of Witchcraft\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" and \n",
"<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Wizardry\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
"</mark>\n",
". The main story \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" arc\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
" concerns \n",
"<mark class=\"entity\" style=\"background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Harry\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PERSON</span>\n",
"</mark>\n",
"'s struggle against Lord \n",
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" Voldemort\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
"</mark>\n",
", a dark wizard who intends to become immortal, overthrow the wizard governing body known as \n",
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
" the Ministry of Magic\n",
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; text-transform: uppercase; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
"</mark>\n",
", and subjugate all wizards and Muggles (non-magical people).</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def mask_names(text):\n",
" # your code here\n",
" return text\n",
" \n",
"original_text = \"Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal, overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles (non-magical people).\"\n",
"masked_text = mask_names(original_text)\n",
"doc = nlp(masked_text)\n",
"displacy.render(doc, style='ent', jupyter=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lemmatization\n",
"\n",
"Lemmatization is the assigning of the base forms of words. For example: “was” → “be” or “cats” → “cat”\n",
"\n",
"To perform lemmatization, the Doc object needs to be parsed. The processed Doc object contains the lemma of words."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Success success nsubj\n",
"is be ROOT\n",
"not not neg\n",
"final final acomp\n",
". . punct\n"
]
}
],
"source": [
"doc = nlp(u\"Success is not final.\")\n",
"for token in doc:\n",
" print(token.text, token.lemma_, token.dep_)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Men man nsubj\n",
"are be aux\n",
"climbing climb ROOT\n",
"up up prep\n",
"the the det\n",
"trees tree pobj\n",
". . punct\n"
]
}
],
"source": [
"doc = nlp(u\"Men are climbing up the trees.\")\n",
"for token in doc:\n",
" print(token.text, token.lemma_, token.dep_)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Part Of Speech Tagging\n",
"\n",
"Check the list [here](https://spacy.io/api/annotation#pos-tagging)!\n",
"\n",
"Parts of speech tagging simply refers to assigning parts of speech to individual words in a sentence.\n",
"\n",
"Sometimes, we want to quickly pull out keywords, or keyphrases from a larger body of text. This helps us mentally paint a picture of what this text is about. This is particularly helpful in analysis of texts like long emails or essays.\n",
"\n",
"After tokenization, the text goes through parsing and tagging. With the use of the statistical model, spaCy can predict the most likely tag/label for a token in a given context.\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Harry\tHarry\tPROPN\tNNP\tcompound\tXxxxx\tTrue\tFalse\n",
"Potter\tPotter\tPROPN\tNNP\tnsubj\tXxxxx\tTrue\tFalse\n",
"is\tbe\tAUX\tVBZ\tROOT\txx\tTrue\tTrue\n",
"a\ta\tDET\tDT\tdet\tx\tTrue\tTrue\n",
"series\tseries\tNOUN\tNN\tattr\txxxx\tTrue\tFalse\n",
"of\tof\tADP\tIN\tprep\txx\tTrue\tTrue\n",
"fantasy\tfantasy\tNOUN\tNN\tcompound\txxxx\tTrue\tFalse\n",
"novels\tnovel\tNOUN\tNNS\tpobj\txxxx\tTrue\tFalse\n",
"written\twrite\tVERB\tVBN\tacl\txxxx\tTrue\tFalse\n",
"by\tby\tADP\tIN\tagent\txx\tTrue\tTrue\n",
"British\tbritish\tADJ\tJJ\tamod\tXxxxx\tTrue\tFalse\n",
"author\tauthor\tNOUN\tNN\tcompound\txxxx\tTrue\tFalse\n",
"J.\tJ.\tPROPN\tNNP\tcompound\tX.\tFalse\tFalse\n",
"K.\tK.\tPROPN\tNNP\tcompound\tX.\tFalse\tFalse\n",
"Rowling\tRowling\tPROPN\tNNP\tpobj\tXxxxx\tTrue\tFalse\n",
".\t.\tPUNCT\t.\tpunct\t.\tFalse\tFalse\n"
]
}
],
"source": [
"doc = nlp(\"Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\")\n",
"# For each token, print corresponding part of speech tag\n",
"for token in doc:\n",
" print('%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s\\t%s' % (token.text, token.lemma_, token.pos_, token.tag_, token.dep_,\n",
" token.shape_, token.is_alpha, token.is_stop))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Harry PROPN\n",
"Potter PROPN\n",
"is AUX\n",
"a DET\n",
"series NOUN\n",
"of ADP\n",
"fantasy NOUN\n",
"novels NOUN\n",
"written VERB\n",
"by ADP\n",
"British ADJ\n",
"author NOUN\n",
"J. PROPN\n",
"K. PROPN\n",
"Rowling PROPN\n",
". PUNCT\n"
]
}
],
"source": [
"for token in doc:\n",
" print(token, token.pos_)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def explain_pos(text):\n",
" doc = nlp(text)\n",
" for word in doc:\n",
" print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}} {spacy.explain(word.tag_)}')"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Harry PROPN NNP noun, proper singular\n",
"Potter PROPN NNP noun, proper singular\n",
"is AUX VBZ verb, 3rd person singular present\n",
"a DET DT determiner\n",
"series NOUN NN noun, singular or mass\n",
"of ADP IN conjunction, subordinating or preposition\n",
"fantasy NOUN NN noun, singular or mass\n",
"novels NOUN NNS noun, plural\n",
"written VERB VBN verb, past participle\n",
"by ADP IN conjunction, subordinating or preposition\n",
"British ADJ JJ adjective\n",
"author NOUN NN noun, singular or mass\n",
"J. PROPN NNP noun, proper singular\n",
"K. PROPN NNP noun, proper singular\n",
"Rowling PROPN NNP noun, proper singular\n",
". PUNCT . punctuation mark, sentence closer\n"
]
}
],
"source": [
"explain_pos(\"Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Why POS Tagging is Useful?**\n",
"\n",
"POS tagging can be really useful, particularly if you have words or tokens that can have multiple POS tags. For instance, the word \"google\" can be used as both a noun and verb, depending upon the context. While processing natural language, it is important to identify this difference. Fortunately, the spaCy library comes pre-built with machine learning algorithms that, depending upon the context (surrounding words), it is capable of returning the correct POS tag for the word."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Can VERB MD verb, modal auxiliary\n",
"you PRON PRP pronoun, personal\n",
"google VERB VB verb, base form\n",
"it PRON PRP pronoun, personal\n",
"? PUNCT . punctuation mark, sentence closer\n"
]
}
],
"source": [
"explain_pos(u'Can you google it?')\n",
"#From the output, you can see that the word \"google\" has been correctly identified as a verb."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Can VERB MD verb, modal auxiliary\n",
"you PRON PRP pronoun, personal\n",
"search VERB VB verb, base form\n",
"it PRON PRP pronoun, personal\n",
"on ADP IN conjunction, subordinating or preposition\n",
"google PROPN NNP noun, proper singular\n",
"? PUNCT . punctuation mark, sentence closer\n"
]
}
],
"source": [
"explain_pos(u'Can you search it on google?')\n",
"#Here in the above script the word \"google\" is being used as a noun as shown by the output:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"9e9f8a21c9924a7c910c06df311c1b32-0\" class=\"displacy\" width=\"1325\" height=\"307.0\" direction=\"ltr\" style=\"max-width: none; height: 307.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Harry</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"135\">Potter</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"135\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"220\">is</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"220\">AUX</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"305\">a</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"305\">DET</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"390\">series</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"390\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"475\">of</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"475\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"560\">fantasy</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"560\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"645\">novels</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"645\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"730\">written</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"730\">VERB</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"815\">by</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"815\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"900\">British</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"900\">ADJ</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"985\">author</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"985\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1070\">J.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1070\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1155\">K.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1155\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"217.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1240\">Rowling.</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1240\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-0\" stroke-width=\"2px\" d=\"M70,172.0 C70,129.5 120.0,129.5 120.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M70,174.0 L62,162.0 78,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-1\" stroke-width=\"2px\" d=\"M155,172.0 C155,129.5 205.0,129.5 205.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M155,174.0 L147,162.0 163,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-2\" stroke-width=\"2px\" d=\"M325,172.0 C325,129.5 375.0,129.5 375.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M325,174.0 L317,162.0 333,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-3\" stroke-width=\"2px\" d=\"M240,172.0 C240,87.0 380.0,87.0 380.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">attr</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M380.0,174.0 L388.0,162.0 372.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-4\" stroke-width=\"2px\" d=\"M410,172.0 C410,129.5 460.0,129.5 460.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M460.0,174.0 L468.0,162.0 452.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-5\" stroke-width=\"2px\" d=\"M580,172.0 C580,129.5 630.0,129.5 630.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M580,174.0 L572,162.0 588,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-6\" stroke-width=\"2px\" d=\"M495,172.0 C495,87.0 635.0,87.0 635.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M635.0,174.0 L643.0,162.0 627.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-7\" stroke-width=\"2px\" d=\"M665,172.0 C665,129.5 715.0,129.5 715.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">acl</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M715.0,174.0 L723.0,162.0 707.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-8\" stroke-width=\"2px\" d=\"M750,172.0 C750,129.5 800.0,129.5 800.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">agent</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M800.0,174.0 L808.0,162.0 792.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-9\" stroke-width=\"2px\" d=\"M920,172.0 C920,129.5 970.0,129.5 970.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M920,174.0 L912,162.0 928,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-10\" stroke-width=\"2px\" d=\"M1005,172.0 C1005,44.5 1235.0,44.5 1235.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-10\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1005,174.0 L997,162.0 1013,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-11\" stroke-width=\"2px\" d=\"M1090,172.0 C1090,87.0 1230.0,87.0 1230.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-11\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1090,174.0 L1082,162.0 1098,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-12\" stroke-width=\"2px\" d=\"M1175,172.0 C1175,129.5 1225.0,129.5 1225.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-12\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1175,174.0 L1167,162.0 1183,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-9e9f8a21c9924a7c910c06df311c1b32-0-13\" stroke-width=\"2px\" d=\"M835,172.0 C835,2.0 1240.0,2.0 1240.0,172.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-9e9f8a21c9924a7c910c06df311c1b32-0-13\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1240.0,174.0 L1248.0,162.0 1232.0,162.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"</svg>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sen = nlp(\"Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\")\n",
"displacy.render(sen, style='dep', jupyter=True, options={'distance': 85})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Chunking\n",
"\n",
"https://spacy.io/usage/linguistic-features#noun-chunks\n",
"\n",
"We need noun chunks. Noun chunks are noun phrases - not a single word, but a short phrase which describes the noun. For example, \"the blue skies\" or \"the world’s largest conglomerate\".\n",
"\n",
"eg: In the following code snippet, “Tall big tree is in the vast garden” → The words “tall” and “big” describe the noun “tree”, and “vast” describes the noun “garden”.\n",
"\n",
"To get the noun chunks in a document, simply iterate over doc.noun_chunks:\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall Street Journal \t NP \t Journal\n",
"an interesting piece \t NP \t piece\n",
"crypto currencies \t NP \t currencies\n"
]
}
],
"source": [
"doc = nlp(\"Wall Street Journal just published an interesting piece on crypto currencies\")\n",
"for chunk in doc.noun_chunks:\n",
" print(chunk.text, '\\t', chunk.label_, '\\t', chunk.root.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" text → text: The original noun chunk text.\n",
" root.text → root text: The original text of the word connecting the noun chunk to the rest of the parse.\n",
" root.dep_ → root dep: Dependency relation connecting the root to its head.\n",
" root.head.text → root head text: The text of the root token’s head."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Break!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets use our NLP skills to do a simple exploration!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"What does Trump talk about?\n",
"\n",
"It might be interesting to explore what does Trump even talk about? Is it always them 'Angry Dems'? Or is he a narcissist with too many mentions of The President and the USA?\n",
"\n",
"One way to explore this would be to mine out all the entities and noun chunks from all his tweets! Let's go ahead and do that with amazing ease using spaCy\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/envs/aie/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (0,1,2,3,4,6) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" interactivity=interactivity, compiler=compiler, result=result)\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 360x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"import seaborn as sns #for visualization\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"plt.style.use('seaborn')\n",
"sns.set(font_scale=2)\n",
"\n",
"tweets = pd.read_csv(\"../data/dataset/trump_tweets.csv\")\n",
"\n",
"text = tweets['text'].str.cat(sep=' ')\n",
"# spaCy enforces a max limit of 1000000 characters for NER and similar use cases.\n",
"# Since `text` might be longer than that, we will slice it off here\n",
"max_length = 1000000-1\n",
"text = text[:max_length]\n",
"\n",
"# removing URLs and '&amp' substrings using regex\n",
"import re\n",
"url_reg = r'[a-z]*[:.]+\\S+'\n",
"text = re.sub(url_reg, '', text)\n",
"noise_reg = r'\\&amp'\n",
"text = re.sub(noise_reg, '', text)\n",
"\n",
"doc = nlp(text)\n",
"\n",
"items_of_interest = list(doc.noun_chunks)\n",
"# each element in this list is spaCy's inbuilt `Span`, which is not useful for us\n",
"items_of_interest = [str(x) for x in items_of_interest]\n",
"# so we've converted it to string\n",
"\n",
"df_nouns = pd.DataFrame(items_of_interest, columns=[\"TrumpSays\"])\n",
"plt.figure(figsize=(5,4))\n",
"sns.countplot(y=\"TrumpSays\",\n",
" data=df_nouns,\n",
" order=df_nouns[\"TrumpSays\"].value_counts().iloc[:10].index)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### WordCloud\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "invalid syntax (<ipython-input-22-c6a2c052fd41>, line 6)",
"output_type": "error",
"traceback": [
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-22-c6a2c052fd41>\"\u001b[0;36m, line \u001b[0;32m6\u001b[0m\n\u001b[0;31m trump_topics = ???\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
]
}
],
"source": [
"from spacy.lang.en.stop_words import STOP_WORDS\n",
"from wordcloud import WordCloud\n",
"\n",
"#Can you plot the entities that Trump mentions the most?\n",
"# Visit https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy \n",
"trump_topics = ???\n",
"\n",
"plt.figure(figsize=(10,5))\n",
"wordcloud = WordCloud(background_color=\"white\",\n",
" stopwords = STOP_WORDS,\n",
" max_words=45,\n",
" max_font_size=30,\n",
" random_state=42\n",
" ).generate(str(trump_topics))\n",
"plt.imshow(wordcloud)\n",
"plt.axis(\"off\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dependency Parsing\n",
"\n",
"Dependency parsing is the process of assigning syntactic dependency labels that describe the relationships between individual tokens, like subject or object.\n",
"\n",
"Dependency parsing analyzes the grammatical structure of a sentence. It establishes a \"tree\" like structure between a \"root\" word and those that are related to it by branches of some manner.\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall/NNP<--compound-- Street/NNP\n",
"Street/NNP<--compound-- Journal/NNP\n",
"Journal/NNP<--nsubj-- published/VBD\n",
"just/RB<--advmod-- published/VBD\n",
"published/VBD<--ROOT-- published/VBD\n",
"an/DT<--det-- piece/NN\n",
"interesting/JJ<--amod-- piece/NN\n",
"piece/NN<--dobj-- published/VBD\n",
"on/IN<--prep-- piece/NN\n",
"crypto/NNP<--compound-- currencies/NNS\n",
"currencies/NNS<--pobj-- on/IN\n"
]
}
],
"source": [
"doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')\n",
" \n",
"for token in doc:\n",
" print(\"{0}/{1}<--{2}-- {3}/{4}\".format(\n",
" token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"1390aa432a6d4f33b28509aa7155b496-0\" class=\"displacy\" width=\"1370\" height=\"317.0\" direction=\"ltr\" style=\"max-width: none; height: 317.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Wall</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"170\">Street</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"170\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"290\">Journal</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"290\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"410\">just</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"410\">ADV</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"530\">published</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"530\">VERB</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">an</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">DET</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"770\">interesting</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"770\">ADJ</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"890\">piece</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"890\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1010\">on</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1010\">ADP</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1130\">crypto</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1130\">PROPN</tspan>\n",
"</text>\n",
"\n",
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"227.0\">\n",
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"1250\">currencies</tspan>\n",
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"1250\">NOUN</tspan>\n",
"</text>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-0\" stroke-width=\"2px\" d=\"M70,182.0 C70,122.0 160.0,122.0 160.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M70,184.0 L62,172.0 78,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-1\" stroke-width=\"2px\" d=\"M190,182.0 C190,122.0 280.0,122.0 280.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M190,184.0 L182,172.0 198,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-2\" stroke-width=\"2px\" d=\"M310,182.0 C310,62.0 525.0,62.0 525.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M310,184.0 L302,172.0 318,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-3\" stroke-width=\"2px\" d=\"M430,182.0 C430,122.0 520.0,122.0 520.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">advmod</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M430,184.0 L422,172.0 438,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-4\" stroke-width=\"2px\" d=\"M670,182.0 C670,62.0 885.0,62.0 885.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M670,184.0 L662,172.0 678,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-5\" stroke-width=\"2px\" d=\"M790,182.0 C790,122.0 880.0,122.0 880.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M790,184.0 L782,172.0 798,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-6\" stroke-width=\"2px\" d=\"M550,182.0 C550,2.0 890.0,2.0 890.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M890.0,184.0 L898.0,172.0 882.0,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-7\" stroke-width=\"2px\" d=\"M910,182.0 C910,122.0 1000.0,122.0 1000.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-7\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1000.0,184.0 L1008.0,172.0 992.0,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-8\" stroke-width=\"2px\" d=\"M1150,182.0 C1150,122.0 1240.0,122.0 1240.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-8\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1150,184.0 L1142,172.0 1158,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"\n",
"<g class=\"displacy-arrow\">\n",
" <path class=\"displacy-arc\" id=\"arrow-1390aa432a6d4f33b28509aa7155b496-0-9\" stroke-width=\"2px\" d=\"M1030,182.0 C1030,62.0 1245.0,62.0 1245.0,182.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
" <textPath xlink:href=\"#arrow-1390aa432a6d4f33b28509aa7155b496-0-9\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
" </text>\n",
" <path class=\"displacy-arrowhead\" d=\"M1245.0,184.0 L1253.0,172.0 1237.0,172.0\" fill=\"currentColor\"/>\n",
"</g>\n",
"</svg>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"displacy.render(doc, style='dep', jupyter=True, options={'distance': 120})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Word Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"spaCy comes shipped with a Word Vector model as well. We’ll need to download a larger model for that:\n",
"\n",
"`python -m spacy download en_core_web_lg`\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 2.0228e-01 -7.6618e-02 3.7032e-01 3.2845e-02 -4.1957e-01 7.2069e-02\n",
" -3.7476e-01 5.7460e-02 -1.2401e-02 5.2949e-01 -5.2380e-01 -1.9771e-01\n",
" -3.4147e-01 5.3317e-01 -2.5331e-02 1.7380e-01 1.6772e-01 8.3984e-01\n",
" 5.5107e-02 1.0547e-01 3.7872e-01 2.4275e-01 1.4745e-02 5.5951e-01\n",
" 1.2521e-01 -6.7596e-01 3.5842e-01 -4.0028e-02 9.5949e-02 -5.0690e-01\n",
" -8.5318e-02 1.7980e-01 3.3867e-01 1.3230e-01 3.1021e-01 2.1878e-01\n",
" 1.6853e-01 1.9874e-01 -5.7385e-01 -1.0649e-01 2.6669e-01 1.2838e-01\n",
" -1.2803e-01 -1.3284e-01 1.2657e-01 8.6723e-01 9.6721e-02 4.8306e-01\n",
" 2.1271e-01 -5.4990e-02 -8.2425e-02 2.2408e-01 2.3975e-01 -6.2260e-02\n",
" 6.2194e-01 -5.9900e-01 4.3201e-01 2.8143e-01 3.3842e-02 -4.8815e-01\n",
" -2.1359e-01 2.7401e-01 2.4095e-01 4.5950e-01 -1.8605e-01 -1.0497e+00\n",
" -9.7305e-02 -1.8908e-01 -7.0929e-01 4.0195e-01 -1.8768e-01 5.1687e-01\n",
" 1.2520e-01 8.4150e-01 1.2097e-01 8.8239e-02 -2.9196e-02 1.2151e-03\n",
" 5.6825e-02 -2.7421e-01 2.5564e-01 6.9793e-02 -2.2258e-01 -3.6006e-01\n",
" -2.2402e-01 -5.3699e-02 1.2022e+00 5.4535e-01 -5.7998e-01 1.0905e-01\n",
" 4.2167e-01 2.0662e-01 1.2936e-01 -4.1457e-02 -6.6777e-01 4.0467e-01\n",
" -1.5218e-02 -2.7640e-01 -1.5611e-01 -7.9198e-02 4.0037e-02 -1.2944e-01\n",
" -2.4090e-04 -2.6785e-01 -3.8115e-01 -9.7245e-01 3.1726e-01 -4.3951e-01\n",
" 4.1934e-01 1.8353e-01 -1.5260e-01 -1.0808e-01 -1.0358e+00 7.6217e-02\n",
" 1.6519e-01 2.6526e-04 1.6616e-01 -1.5281e-01 1.8123e-01 7.0274e-01\n",
" 5.7956e-03 5.1664e-02 -5.9745e-02 -2.7551e-01 -3.9049e-01 6.1132e-02\n",
" 5.5430e-01 -8.7997e-02 -4.1681e-01 3.2826e-01 -5.2549e-01 -4.4288e-01\n",
" 8.2183e-03 2.4486e-01 -2.2982e-01 -3.4981e-01 2.6894e-01 3.9166e-01\n",
" -4.1904e-01 1.6191e-01 -2.6263e+00 6.4134e-01 3.9743e-01 -1.2868e-01\n",
" -3.1946e-01 -2.5633e-01 -1.2220e-01 3.2275e-01 -7.9933e-02 -1.5348e-01\n",
" 3.1505e-01 3.0591e-01 2.6012e-01 1.8553e-01 -2.4043e-01 4.2886e-02\n",
" 4.0622e-01 -2.4256e-01 6.3870e-01 6.9983e-01 -1.4043e-01 2.5209e-01\n",
" 4.8984e-01 -6.1067e-02 -3.6766e-01 -5.5089e-01 -3.8265e-01 -2.0843e-01\n",
" 2.2832e-01 5.1218e-01 2.7868e-01 4.7652e-01 4.7951e-02 -3.4008e-01\n",
" -3.2873e-01 -4.1967e-01 -7.5499e-02 -3.8954e-01 -2.9622e-02 -3.4070e-01\n",
" 2.2170e-01 -6.2856e-02 -5.1903e-01 -3.7774e-01 -4.3477e-03 -5.8301e-01\n",
" -8.7546e-02 -2.3929e-01 -2.4711e-01 -2.5887e-01 -2.9894e-01 1.3715e-01\n",
" 2.9892e-02 3.6544e-02 -4.9665e-01 -1.8160e-01 5.2939e-01 2.1992e-01\n",
" -4.4514e-01 3.7798e-01 -5.7062e-01 -4.6946e-02 8.1806e-02 1.9279e-02\n",
" 3.3246e-01 -1.4620e-01 1.7156e-01 3.9981e-01 3.6217e-01 1.2816e-01\n",
" 3.1644e-01 3.7569e-01 -7.4690e-02 -4.8480e-02 -3.1401e-01 -1.9286e-01\n",
" -3.1294e-01 -1.7553e-02 -1.7514e-01 -2.7587e-02 -1.0000e+00 1.8387e-01\n",
" 8.1434e-01 -1.8913e-01 5.0999e-01 -9.1960e-03 -1.9295e-03 2.8189e-01\n",
" 2.7247e-02 4.3409e-01 -5.4967e-01 -9.7426e-02 -2.4540e-01 -1.7203e-01\n",
" -8.8650e-02 -3.0298e-01 -1.3591e-01 -2.7765e-01 3.1286e-03 2.0556e-01\n",
" -1.5772e-01 -5.2308e-01 -6.4701e-01 -3.7014e-01 6.9393e-02 1.1401e-01\n",
" 2.7594e-01 -1.3875e-01 -2.7268e-01 6.6891e-01 -5.6454e-02 2.4017e-01\n",
" -2.6730e-01 2.9860e-01 1.0083e-01 5.5592e-01 3.2849e-01 7.6858e-02\n",
" 1.5528e-01 2.5636e-01 -1.0772e-01 -1.2359e-01 1.1827e-01 -9.9029e-02\n",
" -3.4328e-01 1.1502e-01 -3.7808e-01 -3.9012e-02 -3.4593e-01 -1.9404e-01\n",
" -3.3580e-01 -6.2334e-02 2.8919e-01 2.8032e-01 -5.3741e-01 6.2794e-01\n",
" 5.6955e-02 6.2147e-01 -2.5282e-01 4.1670e-01 -1.0108e-02 -2.5434e-01\n",
" 4.0003e-01 4.2432e-01 2.2672e-01 1.7553e-01 2.3049e-01 2.8323e-01\n",
" 1.3882e-01 3.1218e-03 1.7057e-01 3.6685e-01 2.5247e-03 -6.4009e-01\n",
" -2.9765e-01 7.8943e-01 3.3168e-01 -1.1966e+00 -4.7156e-02 5.3175e-01]\n"
]
}
],
"source": [
"nlp = spacy.load('en_core_web_lg')\n",
"print(nlp.vocab['banana'].vector)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Computing Similarity**\n",
"\n",
"Based on the word embeddings, spaCy offers a similarity interface for all of it’s building blocks: Token, Span, Doc and Lexeme. Here’s how to use that similarity interface:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'dog' is 0.66185343 similar to 'animal' and 0.23552851 similar to 'fruit'\n",
"'banana' is 0.24272855 similar to 'animal' and 0.67148364 similar to 'fruit'\n"
]
}
],
"source": [
"banana = nlp.vocab['banana']\n",
"dog = nlp.vocab['dog']\n",
"fruit = nlp.vocab['fruit']\n",
"animal = nlp.vocab['animal']\n",
" \n",
"print(\"'dog' is %s similar to 'animal' and %s similar to 'fruit'\" % (dog.similarity(animal), dog.similarity(fruit)))\n",
"print(\"'banana' is %s similar to 'animal' and %s similar to 'fruit'\" % (banana.similarity(animal), banana.similarity(fruit)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s now use this technique on entire texts:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8901766262114666\n",
"0.9115828449161616\n",
"0.7822956256736615\n",
"0.7133323899064792\n",
"0.6526212010025575\n"
]
}
],
"source": [
"target = nlp(\"Cats are beautiful animals.\")\n",
" \n",
"doc1 = nlp(\"Dogs are awesome.\")\n",
"doc2 = nlp(\"Some gorgeous creatures are felines.\")\n",
"doc3 = nlp(\"Dolphins are swimming mammals.\")\n",
"doc4 = nlp(\"Snoopy is a very smart dog.\")\n",
"doc5 = nlp(\"Tomorrow it will rain a lot in Berlin.\")\n",
" \n",
"print(target.similarity(doc1))\n",
"print(target.similarity(doc2))\n",
"print(target.similarity(doc3))\n",
"print(target.similarity(doc4))\n",
"print(target.similarity(doc5))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\"king\" - \"man\" + \"woman\" = \"queen\"?\n",
"\n",
"There’s a really famous example of word embedding math: \"king\" - \"man\" + \"woman\" = \"queen\". Let’s test that out:\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"maybe_queen : [ 5.14087021e-01 -2.78459996e-01 2.42767006e-01 4.54899669e-02\n",
" -2.59425014e-01 -3.19999963e-01 3.23920012e-01 -6.71030045e-01\n",
" -9.98499990e-02 1.91499996e+00 -5.68080008e-01 -2.74451017e-01\n",
" -1.49906695e-01 8.01083148e-02 -2.34764010e-01 -1.10950008e-01\n",
" -1.02593988e-01 8.53819966e-01 -2.68564999e-01 3.85140002e-01\n",
" -1.36149988e-01 6.35029972e-01 -7.62044966e-01 -2.52770007e-01\n",
" -6.75969958e-01 3.89851004e-01 -2.89680034e-01 1.75860003e-01\n",
" -5.16229987e-01 5.21373034e-01 -1.89909995e-01 6.73759937e-01\n",
" 1.17550008e-01 -4.69896019e-01 5.88999987e-01 1.29447982e-01\n",
" -5.71900010e-01 -5.47450066e-01 -4.84210014e-01 5.85503951e-02\n",
" 4.82379973e-01 -2.86769986e-01 -2.01718003e-01 -4.74729985e-01\n",
" 3.43068987e-01 -2.28827983e-01 -1.76439017e-01 6.05450034e-01\n",
" 2.07139999e-01 -2.89762974e-01 -7.63288975e-01 4.37090009e-01\n",
" -2.06220001e-01 -4.20252979e-01 1.98040009e-01 3.18709970e-01\n",
" -9.51815993e-02 -3.23054016e-01 -6.02343976e-01 2.33427018e-01\n",
" -2.15409994e-02 -6.29774988e-01 3.72432500e-01 3.41740012e-01\n",
" 5.81782043e-01 7.02129960e-01 7.19299972e-01 3.28493983e-01\n",
" 3.36353004e-01 1.06999278e-03 -5.53239942e-01 -2.46219993e-01\n",
" -6.37116969e-01 -1.72280014e-01 8.97620022e-01 -1.38548493e-01\n",
" -5.71600199e-02 6.41870022e-01 3.89845997e-01 -3.98499995e-01\n",
" -7.28532076e-01 9.17530134e-02 -3.40600014e-01 3.46671015e-01\n",
" -2.63424516e-01 3.68355006e-01 8.78340006e-01 -1.57473043e-01\n",
" -4.29450005e-01 -4.91259992e-01 -1.23234093e-02 3.27509999e-01\n",
" 1.44889995e-01 -3.27081025e-01 9.45929945e-01 -8.07909966e-01\n",
" -2.07101002e-01 -8.87000561e-03 -5.59080057e-02 7.93069959e-01\n",
" 3.58245999e-01 6.05069995e-01 1.01848006e-01 -1.89061001e-01\n",
" 1.09030008e-02 -7.64109969e-01 -5.05369961e-01 -1.11367017e-01\n",
" 6.56607985e-01 -1.48448005e-01 1.30866021e-01 6.62039995e-01\n",
" -1.54300034e-02 -4.17466015e-01 -4.54553008e-01 -5.05975008e-01\n",
" 4.15473014e-01 4.00425017e-01 7.88707018e-01 -5.19399941e-02\n",
" -3.91889989e-01 8.31609964e-02 4.58730012e-01 1.23339996e-01\n",
" 2.39246994e-01 3.81098986e-01 1.86000004e-01 2.69684941e-02\n",
" -5.55605292e-01 2.53284007e-01 -6.67639971e-01 -5.55985987e-01\n",
" -3.71130019e-01 -6.53919995e-01 -1.09452009e-01 -6.04629993e-01\n",
" -4.62760001e-01 3.97581995e-01 -3.26649994e-01 2.60998994e-01\n",
" -2.09120011e+00 -2.76019007e-01 2.68036008e-01 -3.35714996e-01\n",
" -4.75513011e-01 -2.83890069e-02 4.40270007e-01 2.24150002e-01\n",
" -4.50639009e-01 -6.16590083e-01 1.10599995e-01 -3.00589710e-01\n",
" 1.24530017e-01 2.99279988e-01 3.03467005e-01 -3.42969984e-01\n",
" 3.93694013e-01 -5.84149957e-01 -1.88180000e-01 2.98162013e-01\n",
" -1.80879980e-01 -3.70599926e-02 4.09860015e-02 -8.07899833e-02\n",
" 3.92280012e-01 -4.94572997e-01 4.01719987e-01 8.48469973e-01\n",
" -1.94183022e-01 4.29439992e-01 -6.07819974e-01 -9.71959978e-02\n",
" 3.55786979e-01 -1.79980025e-02 -5.83269954e-01 -2.50129998e-01\n",
" 2.80330002e-01 -3.72725993e-01 -7.41009951e-01 1.03881419e-01\n",
" 8.04000199e-02 -1.64650023e-01 1.09247290e-01 -5.68639994e-01\n",
" 4.11399961e-01 5.69249988e-01 -2.14549989e-01 -1.56975001e-01\n",
" 9.64879990e-02 2.01149940e-01 -9.81989980e-01 -9.00639057e-01\n",
" 1.57496989e-01 -1.24968991e-01 9.11729932e-02 -5.17108977e-01\n",
" 6.34269863e-02 1.72169998e-01 -2.36945987e-01 -7.58899987e-01\n",
" 5.74868977e-01 6.10739946e-01 8.88329893e-02 -2.59585023e-01\n",
" -9.03399587e-02 -8.53200257e-02 1.69609979e-01 -7.29799643e-03\n",
" -2.05680996e-01 -1.93440005e-01 -4.92264986e-01 3.19920003e-01\n",
" -3.66147995e-01 5.69279015e-01 6.27799928e-02 7.91899860e-02\n",
" -3.93792808e-01 4.87831026e-01 -3.85988951e-02 7.52799988e-01\n",
" 1.74212992e-01 -6.07100964e-01 4.81240004e-01 1.49755001e-01\n",
" 4.32273030e-01 2.77104974e-01 4.56589013e-01 -3.32702011e-01\n",
" -2.80999988e-01 6.35839045e-01 1.15425006e-01 7.80760050e-02\n",
" 3.17489982e-01 -4.80073988e-01 4.07790095e-02 -8.21070611e-01\n",
" -1.63500011e-03 -3.97460014e-01 -9.85880196e-02 -5.31642020e-01\n",
" -4.52499986e-02 -4.23010021e-01 1.44284993e-01 -7.62080014e-01\n",
" 2.15179995e-01 -7.05516994e-01 6.44015014e-01 -9.44310054e-02\n",
" -5.36169946e-01 -1.31442308e+00 4.51058030e-01 1.44240022e-01\n",
" 3.84460092e-02 -1.80320218e-02 -2.95219988e-01 4.90060002e-01\n",
" 3.83020639e-02 -1.70519948e-02 -7.32708037e-01 5.04490495e-01\n",
" 1.77098006e-01 5.36670089e-02 -2.40814000e-01 -8.20799917e-02\n",
" 2.19249994e-01 -4.58490014e-01 3.68449986e-01 3.09300005e-01\n",
" -1.21967995e+00 -2.55998999e-01 -8.38758051e-01 -1.99926004e-01\n",
" -3.38140011e-01 -8.05199146e-03 1.42598450e-02 -3.56069952e-01\n",
" 8.31499994e-02 2.89311975e-01 5.29001653e-03 -1.11837029e-01\n",
" 1.28127396e+00 8.09929967e-01 5.58990002e-01 -2.18623012e-01\n",
" -1.70580015e-01 7.43115008e-01 -1.40369982e-01 2.97093987e-01\n",
" -3.28552961e-01 -3.10106993e-01 1.80748999e-01 3.05629998e-01\n",
" 2.17199922e-02 -4.68929976e-01 -1.95840016e-01 6.82327509e-01\n",
" -2.89168000e-01 -7.09619969e-02 8.64340067e-01 -3.79067004e-01]\n",
"['King', 'KING', 'king', 'KIng', 'Queen', 'QUEEN', 'queen', 'Prince', 'PRINCE', 'prince']\n"
]
}
],
"source": [
"from scipy import spatial\n",
" \n",
"cosine_similarity = lambda x, y: 1 - spatial.distance.cosine(x, y)\n",
" \n",
"man = nlp.vocab['man'].vector\n",
"woman = nlp.vocab['woman'].vector\n",
"queen = nlp.vocab['queen'].vector\n",
"king = nlp.vocab['king'].vector\n",
" \n",
"# We now need to find the closest vector in the vocabulary to the result of \"man\" - \"woman\" + \"queen\"\n",
"maybe_queen = king - man + woman\n",
"\n",
"# print(\"maybe_queen : {}\".format(maybe_queen))\n",
"\n",
"computed_similarities = []\n",
" \n",
"for word in nlp.vocab:\n",
" if word.has_vector: # Ignore words without vectors\n",
" similarity = cosine_similarity(maybe_queen, word.vector)\n",
" computed_similarities.append((word, similarity))\n",
" \n",
"computed_similarities = sorted(computed_similarities, key=lambda item: -item[1])\n",
"\n",
"print([w[0].text for w in computed_similarities[:10]])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" **Refer this link for more on similarity : https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**References**\n",
"- https://github.com/pyladies-bcn/spacy-workshop\n",
"- https://nlpforhackers.io/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**For further reading**\n",
"- https://github.com/explosion/spacy-notebooks\n",
"- Trump dataset exploration @ https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment