Skip to content

Instantly share code, notes, and snippets.

@aparrish
Created March 23, 2018 22:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save aparrish/89ab632d9b2c4fd17bea12df546cbb17 to your computer and use it in GitHub Desktop.
Save aparrish/89ab632d9b2c4fd17bea12df546cbb17 to your computer and use it in GitHub Desktop.
rwet notes from class 2018-03-09
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# notes 2018-03-09"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import spacy"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nlp = spacy.load('en_core_web_md')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"doc = nlp(\"All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[All human beings are born free and equal in dignity and rights.,\n",
" They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.,\n",
" Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status.,\n",
" Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(doc.sents)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sents = list(doc.sents)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"spacy.tokens.span.Span"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(sents[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'All human beings are born free and equal in dignity and rights.'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sents[0].text"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All human beings are born free and equal in dignity and rights.\n",
"They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.\n",
"Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status.\n",
"Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.\n"
]
}
],
"source": [
"for item in doc.sents:\n",
" print(item.text)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sentences = [item.text for item in doc.sents]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty.'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"random.choice(sentences)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"words = [item.text for item in doc]"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['All',\n",
" 'human',\n",
" 'beings',\n",
" 'are',\n",
" 'born',\n",
" 'free',\n",
" 'and',\n",
" 'equal',\n",
" 'in',\n",
" 'dignity',\n",
" 'and',\n",
" 'rights',\n",
" '.',\n",
" 'They',\n",
" 'are',\n",
" 'endowed',\n",
" 'with',\n",
" 'reason',\n",
" 'and',\n",
" 'conscience',\n",
" 'and',\n",
" 'should',\n",
" 'act',\n",
" 'towards',\n",
" 'one',\n",
" 'another',\n",
" 'in',\n",
" 'a',\n",
" 'spirit',\n",
" 'of',\n",
" 'brotherhood',\n",
" '.',\n",
" 'Everyone',\n",
" 'is',\n",
" 'entitled',\n",
" 'to',\n",
" 'all',\n",
" 'the',\n",
" 'rights',\n",
" 'and',\n",
" 'freedoms',\n",
" 'set',\n",
" 'forth',\n",
" 'in',\n",
" 'this',\n",
" 'Declaration',\n",
" ',',\n",
" 'without',\n",
" 'distinction',\n",
" 'of',\n",
" 'any',\n",
" 'kind',\n",
" ',',\n",
" 'such',\n",
" 'as',\n",
" 'race',\n",
" ',',\n",
" 'colour',\n",
" ',',\n",
" 'sex',\n",
" ',',\n",
" 'language',\n",
" ',',\n",
" 'religion',\n",
" ',',\n",
" 'political',\n",
" 'or',\n",
" 'other',\n",
" 'opinion',\n",
" ',',\n",
" 'national',\n",
" 'or',\n",
" 'social',\n",
" 'origin',\n",
" ',',\n",
" 'property',\n",
" ',',\n",
" 'birth',\n",
" 'or',\n",
" 'other',\n",
" 'status',\n",
" '.',\n",
" 'Furthermore',\n",
" ',',\n",
" 'no',\n",
" 'distinction',\n",
" 'shall',\n",
" 'be',\n",
" 'made',\n",
" 'on',\n",
" 'the',\n",
" 'basis',\n",
" 'of',\n",
" 'the',\n",
" 'political',\n",
" ',',\n",
" 'jurisdictional',\n",
" 'or',\n",
" 'international',\n",
" 'status',\n",
" 'of',\n",
" 'the',\n",
" 'country',\n",
" 'or',\n",
" 'territory',\n",
" 'to',\n",
" 'which',\n",
" 'a',\n",
" 'person',\n",
" 'belongs',\n",
" ',',\n",
" 'whether',\n",
" 'it',\n",
" 'be',\n",
" 'independent',\n",
" ',',\n",
" 'trust',\n",
" ',',\n",
" 'non',\n",
" '-',\n",
" 'self',\n",
" '-',\n",
" 'governing',\n",
" 'or',\n",
" 'under',\n",
" 'any',\n",
" 'other',\n",
" 'limitation',\n",
" 'of',\n",
" 'sovereignty',\n",
" '.']"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"words"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All all\n",
"human human\n",
"beings being\n",
"are be\n",
"born bear\n",
"free free\n",
"and and\n",
"equal equal\n",
"in in\n",
"dignity dignity\n",
"and and\n",
"rights right\n",
". .\n",
"They -PRON-\n",
"are be\n",
"endowed endow\n",
"with with\n",
"reason reason\n",
"and and\n",
"conscience conscience\n",
"and and\n",
"should should\n",
"act act\n",
"towards towards\n",
"one one\n",
"another another\n",
"in in\n",
"a a\n",
"spirit spirit\n",
"of of\n",
"brotherhood brotherhood\n",
". .\n",
"Everyone everyone\n",
"is be\n",
"entitled entitle\n",
"to to\n",
"all all\n",
"the the\n",
"rights right\n",
"and and\n",
"freedoms freedom\n",
"set set\n",
"forth forth\n",
"in in\n",
"this this\n",
"Declaration declaration\n",
", ,\n",
"without without\n",
"distinction distinction\n",
"of of\n",
"any any\n",
"kind kind\n",
", ,\n",
"such such\n",
"as as\n",
"race race\n",
", ,\n",
"colour colour\n",
", ,\n",
"sex sex\n",
", ,\n",
"language language\n",
", ,\n",
"religion religion\n",
", ,\n",
"political political\n",
"or or\n",
"other other\n",
"opinion opinion\n",
", ,\n",
"national national\n",
"or or\n",
"social social\n",
"origin origin\n",
", ,\n",
"property property\n",
", ,\n",
"birth birth\n",
"or or\n",
"other other\n",
"status status\n",
". .\n",
"Furthermore furthermore\n",
", ,\n",
"no no\n",
"distinction distinction\n",
"shall shall\n",
"be be\n",
"made make\n",
"on on\n",
"the the\n",
"basis basis\n",
"of of\n",
"the the\n",
"political political\n",
", ,\n",
"jurisdictional jurisdictional\n",
"or or\n",
"international international\n",
"status status\n",
"of of\n",
"the the\n",
"country country\n",
"or or\n",
"territory territory\n",
"to to\n",
"which which\n",
"a a\n",
"person person\n",
"belongs belong\n",
", ,\n",
"whether whether\n",
"it -PRON-\n",
"be be\n",
"independent independent\n",
", ,\n",
"trust trust\n",
", ,\n",
"non non\n",
"- -\n",
"self self\n",
"- -\n",
"governing governing\n",
"or or\n",
"under under\n",
"any any\n",
"other other\n",
"limitation limitation\n",
"of of\n",
"sovereignty sovereignty\n",
". .\n"
]
}
],
"source": [
"for item in doc:\n",
" print(item.text, item.lemma_)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All DET DT\n",
"human ADJ JJ\n",
"beings NOUN NNS\n",
"are VERB VBP\n",
"born VERB VBN\n",
"free ADJ JJ\n",
"and CCONJ CC\n",
"equal ADJ JJ\n",
"in ADP IN\n",
"dignity NOUN NN\n",
"and CCONJ CC\n",
"rights NOUN NNS\n",
". PUNCT .\n",
"They PRON PRP\n",
"are VERB VBP\n",
"endowed VERB VBN\n",
"with ADP IN\n",
"reason NOUN NN\n",
"and CCONJ CC\n",
"conscience NOUN NN\n",
"and CCONJ CC\n",
"should VERB MD\n",
"act VERB VB\n",
"towards ADP IN\n",
"one NUM CD\n",
"another DET DT\n",
"in ADP IN\n",
"a DET DT\n",
"spirit NOUN NN\n",
"of ADP IN\n",
"brotherhood NOUN NN\n",
". PUNCT .\n",
"Everyone NOUN NN\n",
"is VERB VBZ\n",
"entitled VERB VBN\n",
"to ADP IN\n",
"all ADJ PDT\n",
"the DET DT\n",
"rights NOUN NNS\n",
"and CCONJ CC\n",
"freedoms NOUN NNS\n",
"set VERB VBN\n",
"forth ADV RB\n",
"in ADP IN\n",
"this DET DT\n",
"Declaration PROPN NNP\n",
", PUNCT ,\n",
"without ADP IN\n",
"distinction NOUN NN\n",
"of ADP IN\n",
"any DET DT\n",
"kind NOUN NN\n",
", PUNCT ,\n",
"such ADJ JJ\n",
"as ADP IN\n",
"race NOUN NN\n",
", PUNCT ,\n",
"colour NOUN NN\n",
", PUNCT ,\n",
"sex NOUN NN\n",
", PUNCT ,\n",
"language NOUN NN\n",
", PUNCT ,\n",
"religion NOUN NN\n",
", PUNCT ,\n",
"political ADJ JJ\n",
"or CCONJ CC\n",
"other ADJ JJ\n",
"opinion NOUN NN\n",
", PUNCT ,\n",
"national ADJ JJ\n",
"or CCONJ CC\n",
"social ADJ JJ\n",
"origin NOUN NN\n",
", PUNCT ,\n",
"property NOUN NN\n",
", PUNCT ,\n",
"birth NOUN NN\n",
"or CCONJ CC\n",
"other ADJ JJ\n",
"status NOUN NN\n",
". PUNCT .\n",
"Furthermore ADV RB\n",
", PUNCT ,\n",
"no DET DT\n",
"distinction NOUN NN\n",
"shall VERB MD\n",
"be VERB VB\n",
"made VERB VBN\n",
"on ADP IN\n",
"the DET DT\n",
"basis NOUN NN\n",
"of ADP IN\n",
"the DET DT\n",
"political ADJ JJ\n",
", PUNCT ,\n",
"jurisdictional ADJ JJ\n",
"or CCONJ CC\n",
"international ADJ JJ\n",
"status NOUN NN\n",
"of ADP IN\n",
"the DET DT\n",
"country NOUN NN\n",
"or CCONJ CC\n",
"territory NOUN NN\n",
"to PART TO\n",
"which ADJ WDT\n",
"a DET DT\n",
"person NOUN NN\n",
"belongs VERB VBZ\n",
", PUNCT ,\n",
"whether ADP IN\n",
"it PRON PRP\n",
"be VERB VB\n",
"independent ADJ JJ\n",
", PUNCT ,\n",
"trust NOUN NN\n",
", PUNCT ,\n",
"non ADJ JJ\n",
"- PUNCT HYPH\n",
"self NOUN NN\n",
"- PUNCT HYPH\n",
"governing NOUN NN\n",
"or CCONJ CC\n",
"under ADP IN\n",
"any DET DT\n",
"other ADJ JJ\n",
"limitation NOUN NN\n",
"of ADP IN\n",
"sovereignty NOUN NN\n",
". PUNCT .\n"
]
}
],
"source": [
"for item in doc:\n",
" print(item.text, item.pos_, item.tag_)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['beings', 'rights', 'rights', 'freedoms']"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[item.text for item in doc if item.tag_ == 'NNS']"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"nouns = [item.text for item in doc if item.pos_ == 'NOUN']\n",
"adjectives = [item.text for item in doc if item.pos_ == 'ADJ']"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"jurisdictional colour\n",
"other rights\n",
"other rights\n",
"all religion\n",
"such property\n",
"all conscience\n",
"political governing\n",
"all distinction\n",
"which birth\n",
"other governing\n"
]
}
],
"source": [
"for i in range(10):\n",
" print(random.choice(adjectives) + \" \" + random.choice(nouns))"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['All human beings',\n",
" 'dignity',\n",
" 'rights',\n",
" 'They',\n",
" 'reason',\n",
" 'conscience',\n",
" 'a spirit',\n",
" 'brotherhood',\n",
" 'Everyone',\n",
" 'all the rights',\n",
" 'freedoms',\n",
" 'this Declaration',\n",
" 'distinction',\n",
" 'any kind',\n",
" 'race',\n",
" 'colour',\n",
" 'sex',\n",
" 'language',\n",
" 'religion',\n",
" 'opinion',\n",
" 'national or social origin',\n",
" 'property',\n",
" 'birth',\n",
" 'other status',\n",
" 'no distinction',\n",
" 'the basis',\n",
" 'the political, jurisdictional or international status',\n",
" 'the country',\n",
" 'territory',\n",
" 'a person',\n",
" 'it',\n",
" 'any other limitation',\n",
" 'sovereignty']"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[item.text for item in doc.noun_chunks]"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"my_sentence = list(doc.sents)[1]"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"They PRP endowed nsubjpass [They]\n",
"are VBP endowed auxpass [are]\n",
"endowed VBN endowed ROOT [They, are, endowed, with, reason, and, conscience, and, should, act, towards, one, another, in, a, spirit, of, brotherhood, .]\n",
"with IN endowed prep [with, reason, and, conscience]\n",
"reason NN with pobj [reason, and, conscience]\n",
"and CC reason cc [and]\n",
"conscience NN reason conj [conscience]\n",
"and CC endowed cc [and]\n",
"should MD act aux [should]\n",
"act VB endowed conj [should, act, towards, one, another, in, a, spirit, of, brotherhood]\n",
"towards IN act prep [towards, one, another]\n",
"one CD towards pobj [one, another]\n",
"another DT one det [another]\n",
"in IN act prep [in, a, spirit, of, brotherhood]\n",
"a DT spirit det [a]\n",
"spirit NN in pobj [a, spirit, of, brotherhood]\n",
"of IN spirit prep [of, brotherhood]\n",
"brotherhood NN of pobj [brotherhood]\n",
". . endowed punct [.]\n"
]
}
],
"source": [
"for item in my_sentence:\n",
" print(item.text, item.tag_, item.head.text, item.dep_,\n",
" list(item.subtree))"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def flatten_subtree(st):\n",
" return ''.join([w.text_with_ws for w in list(st)]).strip()"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"They / They\n",
"are / are\n",
"endowed / They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.\n",
"with / with reason and conscience\n",
"reason / reason and conscience\n",
"and / and\n",
"conscience / conscience\n",
"and / and\n",
"should / should\n",
"act / should act towards one another in a spirit of brotherhood\n",
"towards / towards one another\n",
"one / one another\n",
"another / another\n",
"in / in a spirit of brotherhood\n",
"a / a\n",
"spirit / a spirit of brotherhood\n",
"of / of brotherhood\n",
"brotherhood / brotherhood\n",
". / .\n"
]
}
],
"source": [
"for item in my_sentence:\n",
" print(item.text, \"/\", flatten_subtree(item.subtree))"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"in dignity and rights\n",
"with reason and conscience\n",
"towards one another\n",
"in a spirit of brotherhood\n",
"of brotherhood\n",
"to all the rights and freedoms set forth in this Declaration\n",
"in this Declaration\n",
"without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status\n",
"of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status\n",
"such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status\n",
"on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs\n",
"of the political, jurisdictional or international status of the country or territory to which a person belongs\n",
"of the country or territory to which a person belongs\n",
"to which\n",
"of sovereignty\n"
]
}
],
"source": [
"for word in doc:\n",
" if word.dep_ == 'prep':\n",
" print(flatten_subtree(word.subtree))"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All human beings\n",
"They\n",
"Everyone\n",
"no distinction\n",
"a person\n",
"it\n"
]
}
],
"source": [
"for word in doc:\n",
" if word.dep_ == 'nsubj' or word.dep_ == 'nsubjpass':\n",
" print(flatten_subtree(word.subtree))"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[dignity,\n",
" reason,\n",
" one,\n",
" spirit,\n",
" brotherhood,\n",
" rights,\n",
" Declaration,\n",
" distinction,\n",
" kind,\n",
" race,\n",
" basis,\n",
" status,\n",
" country,\n",
" which,\n",
" limitation,\n",
" sovereignty]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[item for item in doc if item.dep_ == 'pobj']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### parsing from a file"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"doc2 = nlp(open(\"genesis.txt\").read())"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"earth LOC\n",
"earth LOC\n",
"God PERSON\n",
"God PERSON\n",
"Night PERSON\n",
"the evening and the morning TIME\n",
"the first day DATE\n",
"the second day DATE\n",
"one CARDINAL\n",
"Earth LOC\n",
"earth LOC\n",
"earth LOC\n",
"the third day DATE\n",
"the day DATE\n",
"seasons DATE\n",
"days DATE\n",
"years DATE\n",
"earth LOC\n",
"two CARDINAL\n",
"the day DATE\n",
"the night TIME\n",
"the day DATE\n",
"the night TIME\n",
"the evening and the morning TIME\n",
"the fourth day DATE\n",
"the fifth day DATE\n",
"Behold PERSON\n",
"earth LOC\n",
"earth LOC\n",
"earth LOC\n",
"the sixth day DATE\n"
]
}
],
"source": [
"for item in doc2.ents:\n",
" print(item.text, item.label_)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['God', 'God', 'Night', 'Behold']"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[item.text for item in doc2.ents if item.label_ == \"PERSON\"]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment