Replacements and random walks with word vectors. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Replacements and walks with word vectors\n", | |
"\n", | |
"By [Allison Parrish](http://www.decontextualize.com/)\n", | |
"\n", | |
"This (rough) notebook contains some code and strategies for finding words that mean similar things to other words (using [spacy](https://spacy.io/)'s word vectors) and words that sound like other words (using [my phonetic similarity vectors](https://github.com/aparrish/phonetic-similarity-vectors).\n", | |
"\n", | |
"## What you'll need\n", | |
"\n", | |
"Assuming you're starting with Anaconda, you'll need to [install spacy](https://spacy.io/usage/) and download a language model for English with word vectors. Run the following at the command line:\n", | |
"\n", | |
" conda install -c conda-forge spacy\n", | |
" python -m spacy download en\n", | |
"\n", | |
"You'll also need to install [Annoy](https://pypi.python.org/pypi/annoy) and the [wordfilter](https://pypi.python.org/pypi/wordfilter) module:\n", | |
"\n", | |
" pip install annoy\n", | |
" pip install wordfilter\n", | |
" \n", | |
"For the phonetic similarity code, you'll need to download [this file](https://raw.githubusercontent.com/aparrish/phonetic-similarity-vectors/master/cmudict-0.7b-simvecs) and put it in the same directory as this notebook." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Creating the nearest-neighbor index with semantic word vectors" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from itertools import islice\n", | |
"import random" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import spacy\n", | |
"import annoy" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"nlp = spacy.load('en_core_web_lg') # or en_core_web_md" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The spacy model includes millions of words; only a handful have vectors and are alphabetic, so we're just going to grab those." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"qualified = [item for item in nlp.vocab if item.has_vector and item.is_alpha]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"525088" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"len(qualified)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The following cell loads spacy's vectors into a nearest-neighbor index. This makes it much faster to look up the nearest neighbor for a given vector. (This will take a while.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"lexmap = []\n", | |
"t = annoy.AnnoyIndex(300)\n", | |
"for i, item in enumerate(islice(sorted(qualified, key=lambda x: x.prob, reverse=True), 100000)):\n", | |
" t.add_item(i, item.vector)\n", | |
" lexmap.append(item)\n", | |
"t.build(25)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Semantic similarity\n", | |
"\n", | |
"The following function returns the *n* most similar words to the given word (assuming that the word has an associated vector)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['hi',\n", | |
" 'hey',\n", | |
" 'dear',\n", | |
" 'bye',\n", | |
" 'oh',\n", | |
" 'goodbye',\n", | |
" 'sorry',\n", | |
" 'yay',\n", | |
" 'yeah',\n", | |
" 'yep',\n", | |
" 'hehe',\n", | |
" 'thank',\n", | |
" 'thanx',\n", | |
" 'yup',\n", | |
" 'kitty',\n", | |
" 'ok',\n", | |
" 'haha',\n", | |
" 'hmm',\n", | |
" 'goodnight',\n", | |
" 'thanks']" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"def similarsemantic(t, nlp, word, n):\n", | |
" seen = set()\n", | |
" count = 0\n", | |
" for i in t.get_nns_by_vector(nlp.vocab[word].vector, 100):\n", | |
" this_word = lexmap[i].orth_.lower()\n", | |
" if this_word not in seen and word != this_word:\n", | |
" seen.add(this_word)\n", | |
" count += 1\n", | |
" yield this_word\n", | |
" if count >= n:\n", | |
" break\n", | |
"list(similarsemantic(t, nlp, \"hello\", 20))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Replacing words with semantically similar words" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"text = \"\"\"\\\n", | |
"In the beginning God created the heaven and the earth. \n", | |
"And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. \n", | |
"And God said, Let there be light: and there was light. \n", | |
"And God saw the light, that it was good: and God divided the light from the darkness. \n", | |
"And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. \n", | |
"And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.\n", | |
"\"\"\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"In the beginning christ created the heaven and the earth. \n", | |
"And the earth was without form, and void; and darkness was upon the face of the deep. other the Spirit of God moved upon the face of the waters. \n", | |
"And God said, know though be light: and there was illuminating. \n", | |
"And God saw the light, that so was good: and God divided the light from the darkness. \n", | |
"And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. \n", | |
"And God said, Let what be a firmament into the midst of the sea, and let it divide the waters from the waters.\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"chance = 10\n", | |
"nuttiness = 10\n", | |
"doc = nlp(text)\n", | |
"output = []\n", | |
"for tok in doc:\n", | |
" if tok.is_alpha and random.randrange(100) < chance:\n", | |
" output.append(\n", | |
" random.choice(list(similarsemantic(t, nlp, tok.text, nuttiness))) + tok.whitespace_)\n", | |
" else:\n", | |
" output.append(tok.text_with_ws)\n", | |
"print(\"\".join(output))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Semantic random walk" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"cheddar\n", | |
"mozzarella\n", | |
"parmesan\n", | |
"pesto\n", | |
"basil\n", | |
"parsley\n", | |
"chives\n", | |
"tarragon\n", | |
"thyme\n", | |
"rosemary\n", | |
"oregano\n", | |
"garlic\n", | |
"onions\n", | |
"onion\n", | |
"celery\n", | |
"carrots\n", | |
"carrot\n", | |
"potato\n", | |
"potatoes\n", | |
"mashed\n", | |
"gravy\n", | |
"sauce\n", | |
"chili\n", | |
"chilli\n", | |
"chillies\n" | |
] | |
} | |
], | |
"source": [ | |
"def semanticwalk(current=None):\n", | |
" seen = set()\n", | |
" if current is None:\n", | |
" current = nlp.vocab[random.choice(lexmap).text].text\n", | |
" seen.add(current)\n", | |
" while True:\n", | |
" selected = [s for s in similarsemantic(t, nlp, current, 100) if s not in seen][0]\n", | |
" yield selected\n", | |
" seen.add(selected)\n", | |
" current = selected\n", | |
"walker = semanticwalk(\"cheese\")\n", | |
"for i in range(25):\n", | |
" print(next(walker))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Phonetic similarity" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"p = annoy.AnnoyIndex(50)\n", | |
"phonmap = []\n", | |
"phonlookup = {}\n", | |
"for i, line in enumerate(open(\"./cmudict-0.7b-simvecs\")):\n", | |
" word, vec_raw = line.split(\" \")\n", | |
" word = word.lower().rstrip(\"(0123)\")\n", | |
" vec = [float(v) for v in vec_raw.split()]\n", | |
" p.add_item(i, vec)\n", | |
" phonmap.append(word)\n", | |
" phonlookup[word] = vec\n", | |
"p.build(25)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from wordfilter import Wordfilter\n", | |
"wf = Wordfilter()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"['halo',\n", | |
" 'yellow',\n", | |
" 'hilo',\n", | |
" 'hallow',\n", | |
" 'hemlo',\n", | |
" 'hollo',\n", | |
" 'hollow',\n", | |
" 'hoyle',\n", | |
" 'hail',\n", | |
" 'haile',\n", | |
" 'hale',\n", | |
" 'heyl',\n", | |
" 'colello',\n", | |
" 'lorello',\n", | |
" 'herrold',\n", | |
" 'riolo',\n", | |
" 'tirello',\n", | |
" 'pirrello',\n", | |
" 'heller',\n", | |
" 'hiller']" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"def similarphonetic(t, word, n):\n", | |
" count = 0\n", | |
" if word not in phonlookup:\n", | |
" return\n", | |
" for i in t.get_nns_by_vector(phonlookup[word], 100):\n", | |
" if word != phonmap[i] and not(wf.blacklisted(phonmap[i])):\n", | |
" count += 1\n", | |
" yield phonmap[i]\n", | |
" if count >= n:\n", | |
" break\n", | |
"list(similarphonetic(p, \"hello\", 20))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Phonetic replacement" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"In the beginning God created the heaven and the earth. \n", | |
"And the earth was without form, and void; and darkness was upon thy face of the deep. And the Spirit eva's God moved upon z face of the waters. \n", | |
"inland God said, Let there be light: and there was light. \n", | |
"aunt gagan saw the light, that it was dig: and God divided the light schrum thy darkness. \n", | |
"And God cooled the lyde Day, and the darkness he called Night. And the evening and the morning were they first day. \n", | |
"And God said, Let there be a firmament ane the midst of the waters, and let it divide the waters from the watered.\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"chance = 10\n", | |
"nuttiness = 5\n", | |
"doc = nlp(text)\n", | |
"output = []\n", | |
"for tok in doc:\n", | |
" if tok.is_alpha and tok.text.lower() in phonmap and random.randrange(100) < chance:\n", | |
" output.append(\n", | |
" random.choice(list(similarphonetic(p, tok.text.lower(), nuttiness))) + tok.whitespace_)\n", | |
" else:\n", | |
" output.append(tok.text_with_ws)\n", | |
"print(\"\".join(output))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Phonetic random walk" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"def phonwalk(current=None):\n", | |
" seen = set()\n", | |
" if current is None:\n", | |
" current = random.choice(list(phonlookup.keys()))\n", | |
" seen.add(tuple(phonlookup[current]))\n", | |
" while True:\n", | |
" selected = [s for s in similarphonetic(p, current, 100) \\\n", | |
" if tuple(phonlookup[s]) not in seen and len(s) in [7, 8, 9]][0]\n", | |
" yield selected\n", | |
" seen.add(tuple(phonlookup[selected]))\n", | |
" current = selected\n", | |
"walker = phonwalk()\n", | |
"output = []\n", | |
"for i in range(200):\n", | |
" item = next(walker)\n", | |
" if \"'\" not in item:\n", | |
" output.append(item)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"seyfarth\n", | |
"fail-safe\n", | |
"facelift\n", | |
"defaced\n", | |
"deficits\n", | |
"chefitz\n", | |
"steffek\n", | |
"stevick\n", | |
"staving\n", | |
"sieving\n", | |
"phasing\n", | |
"savings\n", | |
"shavings\n", | |
"shaving\n", | |
"ravishing\n", | |
"lavishing\n", | |
"fishing\n", | |
"finishing\n", | |
"vanishing\n", | |
"gnashing\n", | |
"lashing\n", | |
"dashing\n", | |
"bashing\n", | |
"banishing\n", | |
"meshing\n", | |
"enmeshing\n", | |
"smashing\n", | |
"smashes\n", | |
"beijing\n", | |
"bouygues\n", | |
"bogging\n", | |
"dogging\n", | |
"jogging\n", | |
"logging\n", | |
"lauding\n", | |
"rawding\n", | |
"rogaine\n", | |
"goading\n", | |
"gooding\n", | |
"goodkin\n", | |
"puddings\n", | |
"bookings\n", | |
"booking\n", | |
"cooking\n", | |
"cocaine\n", | |
"coating\n", | |
"coaching\n", | |
"choking\n", | |
"soaking\n", | |
"stoking\n", | |
"stroking\n", | |
"scorching\n", | |
"storting\n", | |
"torching\n", | |
"courting\n", | |
"cording\n", | |
"corning\n", | |
"kohring\n", | |
"pouring\n", | |
"goehring\n", | |
"doehring\n", | |
"dornier\n", | |
"doornail\n", | |
"dorrell\n", | |
"gorrell\n", | |
"borrell\n", | |
"borello\n", | |
"borrero\n", | |
"barreiro\n", | |
"marrero\n", | |
"monteiro\n", | |
"montello\n", | |
"mondello\n", | |
"mondallo\n", | |
"manzano\n", | |
"marzano\n", | |
"zamorano\n", | |
"zamarron\n", | |
"naramore\n", | |
"myanmar\n", | |
"meagher\n", | |
"manjarrez\n", | |
"marcelle\n", | |
"marcell\n", | |
"marcille\n", | |
"markkas\n", | |
"marcussen\n", | |
"markson\n", | |
"marxist\n", | |
"marxists\n", | |
"marxism\n", | |
"marcrum\n", | |
"marchman\n", | |
"marksman\n", | |
"unmarked\n", | |
"earmarked\n", | |
"earmarks\n", | |
"remarks\n", | |
"remarked\n", | |
"marques\n", | |
"monarch\n", | |
"monarchy\n", | |
"lamarche\n", | |
"marquee\n", | |
"marchita\n", | |
"martita\n", | |
"martina\n", | |
"martini\n", | |
"marchini\n", | |
"marteney\n", | |
"martinek\n", | |
"martinec\n", | |
"martech\n", | |
"marquette\n", | |
"marchetti\n", | |
"marchetta\n", | |
"ancheta\n", | |
"architect\n", | |
"equitex\n", | |
"equitec\n", | |
"arthotec\n", | |
"ratajczak\n", | |
"janacek\n", | |
"janecek\n", | |
"konicek\n", | |
"kessenich\n", | |
"kenneth\n", | |
"kenison\n", | |
"tennison\n", | |
"denison\n", | |
"jenison\n", | |
"renison\n", | |
"menacing\n", | |
"menning\n", | |
"meaning\n", | |
"remaining\n", | |
"renaming\n", | |
"demeaning\n", | |
"dimming\n", | |
"demming\n", | |
"beaming\n", | |
"maiming\n", | |
"lemming\n", | |
"leaming\n", | |
"lambing\n", | |
"ramming\n", | |
"rhyming\n", | |
"priming\n", | |
"cramming\n", | |
"climbing\n", | |
"claiming\n", | |
"flaming\n", | |
"fleming\n", | |
"framing\n", | |
"mainframe\n", | |
"inflaming\n", | |
"e-mailing\n", | |
"mailing\n", | |
"milling\n", | |
"mehling\n", | |
"mealing\n", | |
"kneeling\n", | |
"leaning\n", | |
"lanning\n", | |
"tanning\n", | |
"channing\n", | |
"janning\n", | |
"banning\n", | |
"panning\n", | |
"canning\n", | |
"kenning\n", | |
"penning\n", | |
"painting\n", | |
"pinning\n", | |
"binning\n", | |
"dinning\n", | |
"ginning\n", | |
"deigning\n", | |
"denning\n", | |
"jenning\n", | |
"venning\n", | |
"pfenning\n", | |
"feigning\n", | |
"finning\n", | |
"thinning\n", | |
"sinning\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"\\n\".join(output))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Combine 'em both" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"adrift adrift\n", | |
"floundering drifted\n", | |
"floundered drafted\n", | |
"faltered grafted\n", | |
"wavered crafted\n", | |
"wavering kravetz\n", | |
"waver gravitt's\n", | |
"wavers gravett\n", | |
"falters gravity\n", | |
"falter griffee\n", | |
"faltering griffin\n", | |
"shaky griffing\n", | |
"wobbly riffing\n", | |
"wobbling leafing\n", | |
"wobble laughing\n", | |
"wobbles knifing\n", | |
"creaks refining\n", | |
"squeaks defining\n", | |
"squeaking affining\n", | |
"squeak siphoning\n", | |
"squeal suffice\n", | |
"squealing sufficed\n", | |
"shrieking suffices\n", | |
"screeching acidifies\n", | |
"screech acidified\n", | |
"shriek acidify\n" | |
] | |
} | |
], | |
"source": [ | |
"worda = \"adrift\"\n", | |
"wordb = \"adrift\"\n", | |
"walkera, walkerb = phonwalk(worda), semanticwalk(wordb)\n", | |
"print(worda, wordb)\n", | |
"for i in range(25):\n", | |
" print(next(walkerb), next(walkera))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment