Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Replacements and random walks with word vectors. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Replacements and walks with word vectors\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"This (rough) notebook contains some code and strategies for finding words that mean similar things to other words (using [spacy](https://spacy.io/)'s word vectors) and words that sound like other words (using [my phonetic similarity vectors](https://github.com/aparrish/phonetic-similarity-vectors).\n",
"\n",
"## What you'll need\n",
"\n",
"Assuming you're starting with Anaconda, you'll need to [install spacy](https://spacy.io/usage/) and download a language model for English with word vectors. Run the following at the command line:\n",
"\n",
" conda install -c conda-forge spacy\n",
" python -m spacy download en\n",
"\n",
"You'll also need to install [Annoy](https://pypi.python.org/pypi/annoy) and the [wordfilter](https://pypi.python.org/pypi/wordfilter) module:\n",
"\n",
" pip install annoy\n",
" pip install wordfilter\n",
" \n",
"For the phonetic similarity code, you'll need to download [this file](https://raw.githubusercontent.com/aparrish/phonetic-similarity-vectors/master/cmudict-0.7b-simvecs) and put it in the same directory as this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating the nearest-neighbor index with semantic word vectors"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from itertools import islice\n",
"import random"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import spacy\n",
"import annoy"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nlp = spacy.load('en_core_web_lg') # or en_core_web_md"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The spacy model includes millions of words; only a handful have vectors and are alphabetic, so we're just going to grab those."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"qualified = [item for item in nlp.vocab if item.has_vector and item.is_alpha]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"525088"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(qualified)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell loads spacy's vectors into a nearest-neighbor index. This makes it much faster to look up the nearest neighbor for a given vector. (This will take a while.)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lexmap = []\n",
"t = annoy.AnnoyIndex(300)\n",
"for i, item in enumerate(islice(sorted(qualified, key=lambda x: x.prob, reverse=True), 100000)):\n",
" t.add_item(i, item.vector)\n",
" lexmap.append(item)\n",
"t.build(25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Semantic similarity\n",
"\n",
"The following function returns the *n* most similar words to the given word (assuming that the word has an associated vector)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['hi',\n",
" 'hey',\n",
" 'dear',\n",
" 'bye',\n",
" 'oh',\n",
" 'goodbye',\n",
" 'sorry',\n",
" 'yay',\n",
" 'yeah',\n",
" 'yep',\n",
" 'hehe',\n",
" 'thank',\n",
" 'thanx',\n",
" 'yup',\n",
" 'kitty',\n",
" 'ok',\n",
" 'haha',\n",
" 'hmm',\n",
" 'goodnight',\n",
" 'thanks']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def similarsemantic(t, nlp, word, n):\n",
" seen = set()\n",
" count = 0\n",
" for i in t.get_nns_by_vector(nlp.vocab[word].vector, 100):\n",
" this_word = lexmap[i].orth_.lower()\n",
" if this_word not in seen and word != this_word:\n",
" seen.add(this_word)\n",
" count += 1\n",
" yield this_word\n",
" if count >= n:\n",
" break\n",
"list(similarsemantic(t, nlp, \"hello\", 20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Replacing words with semantically similar words"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"text = \"\"\"\\\n",
"In the beginning God created the heaven and the earth. \n",
"And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. \n",
"And God said, Let there be light: and there was light. \n",
"And God saw the light, that it was good: and God divided the light from the darkness. \n",
"And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. \n",
"And God said, Let there be a firmament in the midst of the waters, and let it divide the waters from the waters.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In the beginning christ created the heaven and the earth. \n",
"And the earth was without form, and void; and darkness was upon the face of the deep. other the Spirit of God moved upon the face of the waters. \n",
"And God said, know though be light: and there was illuminating. \n",
"And God saw the light, that so was good: and God divided the light from the darkness. \n",
"And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. \n",
"And God said, Let what be a firmament into the midst of the sea, and let it divide the waters from the waters.\n",
"\n"
]
}
],
"source": [
"chance = 10\n",
"nuttiness = 10\n",
"doc = nlp(text)\n",
"output = []\n",
"for tok in doc:\n",
" if tok.is_alpha and random.randrange(100) < chance:\n",
" output.append(\n",
" random.choice(list(similarsemantic(t, nlp, tok.text, nuttiness))) + tok.whitespace_)\n",
" else:\n",
" output.append(tok.text_with_ws)\n",
"print(\"\".join(output))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Semantic random walk"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"cheddar\n",
"mozzarella\n",
"parmesan\n",
"pesto\n",
"basil\n",
"parsley\n",
"chives\n",
"tarragon\n",
"thyme\n",
"rosemary\n",
"oregano\n",
"garlic\n",
"onions\n",
"onion\n",
"celery\n",
"carrots\n",
"carrot\n",
"potato\n",
"potatoes\n",
"mashed\n",
"gravy\n",
"sauce\n",
"chili\n",
"chilli\n",
"chillies\n"
]
}
],
"source": [
"def semanticwalk(current=None):\n",
" seen = set()\n",
" if current is None:\n",
" current = nlp.vocab[random.choice(lexmap).text].text\n",
" seen.add(current)\n",
" while True:\n",
" selected = [s for s in similarsemantic(t, nlp, current, 100) if s not in seen][0]\n",
" yield selected\n",
" seen.add(selected)\n",
" current = selected\n",
"walker = semanticwalk(\"cheese\")\n",
"for i in range(25):\n",
" print(next(walker))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Phonetic similarity"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p = annoy.AnnoyIndex(50)\n",
"phonmap = []\n",
"phonlookup = {}\n",
"for i, line in enumerate(open(\"./cmudict-0.7b-simvecs\")):\n",
" word, vec_raw = line.split(\" \")\n",
" word = word.lower().rstrip(\"(0123)\")\n",
" vec = [float(v) for v in vec_raw.split()]\n",
" p.add_item(i, vec)\n",
" phonmap.append(word)\n",
" phonlookup[word] = vec\n",
"p.build(25)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from wordfilter import Wordfilter\n",
"wf = Wordfilter()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['halo',\n",
" 'yellow',\n",
" 'hilo',\n",
" 'hallow',\n",
" 'hemlo',\n",
" 'hollo',\n",
" 'hollow',\n",
" 'hoyle',\n",
" 'hail',\n",
" 'haile',\n",
" 'hale',\n",
" 'heyl',\n",
" 'colello',\n",
" 'lorello',\n",
" 'herrold',\n",
" 'riolo',\n",
" 'tirello',\n",
" 'pirrello',\n",
" 'heller',\n",
" 'hiller']"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def similarphonetic(t, word, n):\n",
" count = 0\n",
" if word not in phonlookup:\n",
" return\n",
" for i in t.get_nns_by_vector(phonlookup[word], 100):\n",
" if word != phonmap[i] and not(wf.blacklisted(phonmap[i])):\n",
" count += 1\n",
" yield phonmap[i]\n",
" if count >= n:\n",
" break\n",
"list(similarphonetic(p, \"hello\", 20))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Phonetic replacement"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In the beginning God created the heaven and the earth. \n",
"And the earth was without form, and void; and darkness was upon thy face of the deep. And the Spirit eva's God moved upon z face of the waters. \n",
"inland God said, Let there be light: and there was light. \n",
"aunt gagan saw the light, that it was dig: and God divided the light schrum thy darkness. \n",
"And God cooled the lyde Day, and the darkness he called Night. And the evening and the morning were they first day. \n",
"And God said, Let there be a firmament ane the midst of the waters, and let it divide the waters from the watered.\n",
"\n"
]
}
],
"source": [
"chance = 10\n",
"nuttiness = 5\n",
"doc = nlp(text)\n",
"output = []\n",
"for tok in doc:\n",
" if tok.is_alpha and tok.text.lower() in phonmap and random.randrange(100) < chance:\n",
" output.append(\n",
" random.choice(list(similarphonetic(p, tok.text.lower(), nuttiness))) + tok.whitespace_)\n",
" else:\n",
" output.append(tok.text_with_ws)\n",
"print(\"\".join(output))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Phonetic random walk"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def phonwalk(current=None):\n",
" seen = set()\n",
" if current is None:\n",
" current = random.choice(list(phonlookup.keys()))\n",
" seen.add(tuple(phonlookup[current]))\n",
" while True:\n",
" selected = [s for s in similarphonetic(p, current, 100) \\\n",
" if tuple(phonlookup[s]) not in seen and len(s) in [7, 8, 9]][0]\n",
" yield selected\n",
" seen.add(tuple(phonlookup[selected]))\n",
" current = selected\n",
"walker = phonwalk()\n",
"output = []\n",
"for i in range(200):\n",
" item = next(walker)\n",
" if \"'\" not in item:\n",
" output.append(item)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"seyfarth\n",
"fail-safe\n",
"facelift\n",
"defaced\n",
"deficits\n",
"chefitz\n",
"steffek\n",
"stevick\n",
"staving\n",
"sieving\n",
"phasing\n",
"savings\n",
"shavings\n",
"shaving\n",
"ravishing\n",
"lavishing\n",
"fishing\n",
"finishing\n",
"vanishing\n",
"gnashing\n",
"lashing\n",
"dashing\n",
"bashing\n",
"banishing\n",
"meshing\n",
"enmeshing\n",
"smashing\n",
"smashes\n",
"beijing\n",
"bouygues\n",
"bogging\n",
"dogging\n",
"jogging\n",
"logging\n",
"lauding\n",
"rawding\n",
"rogaine\n",
"goading\n",
"gooding\n",
"goodkin\n",
"puddings\n",
"bookings\n",
"booking\n",
"cooking\n",
"cocaine\n",
"coating\n",
"coaching\n",
"choking\n",
"soaking\n",
"stoking\n",
"stroking\n",
"scorching\n",
"storting\n",
"torching\n",
"courting\n",
"cording\n",
"corning\n",
"kohring\n",
"pouring\n",
"goehring\n",
"doehring\n",
"dornier\n",
"doornail\n",
"dorrell\n",
"gorrell\n",
"borrell\n",
"borello\n",
"borrero\n",
"barreiro\n",
"marrero\n",
"monteiro\n",
"montello\n",
"mondello\n",
"mondallo\n",
"manzano\n",
"marzano\n",
"zamorano\n",
"zamarron\n",
"naramore\n",
"myanmar\n",
"meagher\n",
"manjarrez\n",
"marcelle\n",
"marcell\n",
"marcille\n",
"markkas\n",
"marcussen\n",
"markson\n",
"marxist\n",
"marxists\n",
"marxism\n",
"marcrum\n",
"marchman\n",
"marksman\n",
"unmarked\n",
"earmarked\n",
"earmarks\n",
"remarks\n",
"remarked\n",
"marques\n",
"monarch\n",
"monarchy\n",
"lamarche\n",
"marquee\n",
"marchita\n",
"martita\n",
"martina\n",
"martini\n",
"marchini\n",
"marteney\n",
"martinek\n",
"martinec\n",
"martech\n",
"marquette\n",
"marchetti\n",
"marchetta\n",
"ancheta\n",
"architect\n",
"equitex\n",
"equitec\n",
"arthotec\n",
"ratajczak\n",
"janacek\n",
"janecek\n",
"konicek\n",
"kessenich\n",
"kenneth\n",
"kenison\n",
"tennison\n",
"denison\n",
"jenison\n",
"renison\n",
"menacing\n",
"menning\n",
"meaning\n",
"remaining\n",
"renaming\n",
"demeaning\n",
"dimming\n",
"demming\n",
"beaming\n",
"maiming\n",
"lemming\n",
"leaming\n",
"lambing\n",
"ramming\n",
"rhyming\n",
"priming\n",
"cramming\n",
"climbing\n",
"claiming\n",
"flaming\n",
"fleming\n",
"framing\n",
"mainframe\n",
"inflaming\n",
"e-mailing\n",
"mailing\n",
"milling\n",
"mehling\n",
"mealing\n",
"kneeling\n",
"leaning\n",
"lanning\n",
"tanning\n",
"channing\n",
"janning\n",
"banning\n",
"panning\n",
"canning\n",
"kenning\n",
"penning\n",
"painting\n",
"pinning\n",
"binning\n",
"dinning\n",
"ginning\n",
"deigning\n",
"denning\n",
"jenning\n",
"venning\n",
"pfenning\n",
"feigning\n",
"finning\n",
"thinning\n",
"sinning\n"
]
}
],
"source": [
"print(\"\\n\".join(output))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine 'em both"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"adrift adrift\n",
"floundering drifted\n",
"floundered drafted\n",
"faltered grafted\n",
"wavered crafted\n",
"wavering kravetz\n",
"waver gravitt's\n",
"wavers gravett\n",
"falters gravity\n",
"falter griffee\n",
"faltering griffin\n",
"shaky griffing\n",
"wobbly riffing\n",
"wobbling leafing\n",
"wobble laughing\n",
"wobbles knifing\n",
"creaks refining\n",
"squeaks defining\n",
"squeaking affining\n",
"squeak siphoning\n",
"squeal suffice\n",
"squealing sufficed\n",
"shrieking suffices\n",
"screeching acidifies\n",
"screech acidified\n",
"shriek acidify\n"
]
}
],
"source": [
"worda = \"adrift\"\n",
"wordb = \"adrift\"\n",
"walkera, walkerb = phonwalk(worda), semanticwalk(wordb)\n",
"print(worda, wordb)\n",
"for i in range(25):\n",
" print(next(walkerb), next(walkera))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.