Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save j-k-projects/0fb9d70976b193a7375600bbe1cec37f to your computer and use it in GitHub Desktop.
Save j-k-projects/0fb9d70976b193a7375600bbe1cec37f to your computer and use it in GitHub Desktop.
Example code for extracting things with spacy and writing them out to text files and then reading them in again.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extracting and writing to files"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import spacy"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"def flatten_subtree(st):\n",
" return ''.join([w.text_with_ws for w in list(st)]).strip()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"nlp = spacy.load('en_core_web_md')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"replace `nature_corpus.txt` with your text file's filename"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"doc = nlp(open(\"nature_corpus.txt\").read())"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"words = [item.text for item in doc]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"replace `LOC` with the [named entity](https://spacy.io/api/annotation#named-entities) that you want!"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"loc = [item.text for item in doc.ents if item.label_ == 'LOC']"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"loc"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open(\"loc.txt\", \"w\").write(\"\\n\".join(times))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"replace `NOUN` with the [universal part of speech tag](https://spacy.io/api/annotation#pos-tagging) that you want..."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"nouns = [item.text for item in doc if item.pos_ == 'NOUN']"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2045"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open(\"nature_nouns.txt\", \"w\").write(\"\\n\".join(nouns))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"pl_nouns = [item.text for item in doc if item.tag_ == 'NNS']"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"765"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open(\"nature_pl_nouns.txt\", \"w\").write(\"\\n\".join(pl_nouns))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"prep_phrases = []\n",
"for word in doc:\n",
" if word.dep_ == 'prep':\n",
" prep_phrases.append(flatten_subtree(word.subtree))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2218"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"open(\"prep_phrases.txt\", \"w\").write(\"\\n\".join(prep_phrases))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using files with lists of strings to make very very good poems"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"nouns = open(\"nature_nouns.txt\").read().split(\"\\n\")\n",
"prep_phrases = open(\"prep_phrases.txt\").read().split(\"\\n\")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"forms like the stacks of bowls in my kitchen cabinet\n",
"flowers with yellow flowers and overgrowth\n",
"mold to the RFTA\n",
"pollen like Gs or Cs or treble clefs\n",
"things with those eroded ridges those eyes of black and brown\n",
"rain in my hand\n",
"hair in units of woods and fields\n",
"tree quite like soap\n",
"sun in units of woods and fields\n",
"path in places like a shirt\n",
"chains for the path\n",
"leaf in my hand\n",
"flowers on green thrones\n",
"insects An extinguishment of hydrobased pressure\n",
"towers with a love letter\n",
"beats with white stuff\n",
"clover of their stems\n",
"things with those eroded ridges those eyes of black and brown\n",
"growth like abused feather brushes\n",
"tree like Gs or Cs or treble clefs\n",
"parchment of things\n",
"places in places like a shirt\n",
"bear of my knuckle\n",
"home around the axis of my knuckle\n",
"lilac in distant trees\n",
"leaves from bear scratch\n",
"scratchings around them\n",
"woods from a half enthusiastic rain\n",
"flowers of hydrobased pressure\n",
"end with yellow flowers and overgrowth\n",
"growth in Snowmass\n",
"truck of blood\n",
"bug in a way that they are also pink\n",
"oxidation An extinguishment of hydrobased pressure\n",
"murmur in places like a shirt\n",
"hiss in a way that they are also pink\n",
"dust on concrete's father\n",
"species with yellow flowers and overgrowth\n",
"knots from sport utility vehicles\n",
"leaves in units of woods and fields\n",
"words from the side\n",
"shade with these eyes\n",
"vehicles like molecular diagrams\n",
"level of my knuckle\n",
"green of passing cars\n",
"green of blood\n",
"swamps at each other\n",
"leaves of passing cars\n",
"bark with yellow flowers and overgrowth\n",
"gray in my kitchen cabinet\n"
]
}
],
"source": [
"for i in range(50):\n",
" print(random.choice(nouns), random.choice(prep_phrases))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment