Skip to content

Instantly share code, notes, and snippets.

@fzliu
Created December 23, 2021 11:56
Show Gist options
  • Save fzliu/9634d5e1cadd1ca27090bc41c829a8b7 to your computer and use it in GitHub Desktop.
Save fzliu/9634d5e1cadd1ca27090bc41c829a8b7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "7195e4b7",
"metadata": {},
"source": [
"### Some prep work\n",
"Before beginning, we'll need to install the gensim library and load a Word2Vec model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6d2ac701",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: gensim in /Users/fzliu/.pyenv/lib/python3.8/site-packages (4.1.2)\n",
"Requirement already satisfied: smart-open>=1.8.1 in /Users/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (5.2.1)\n",
"Requirement already satisfied: numpy>=1.17.0 in /Users/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (1.21.2)\n",
"Requirement already satisfied: scipy>=0.18.1 in /Users/fzliu/.pyenv/lib/python3.8/site-packages (from gensim) (1.7.1)\n",
"--2021-11-16 17:34:14-- https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz\n",
"Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.165.200\n",
"Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.165.200|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 1647046227 (1.5G) [application/x-gzip]\n",
"Saving to: ‘GoogleNews-vectors-negative300.bin.gz’\n",
"\n",
"GoogleNews-vectors- 100%[===================>] 1.53G 1.90MB/s in 15m 31s \n",
"\n",
"2021-11-16 17:49:46 (1.69 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]\n",
"\n"
]
}
],
"source": [
"!pip install gensim --disable-pip-version-check\n",
"!wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz\n",
"!gunzip GoogleNews-vectors-negative300.bin"
]
},
{
"cell_type": "markdown",
"id": "c18c1cf6",
"metadata": {},
"source": [
"Now that we've done all the prep work required to generate word-to-vector embeddings, let's load the trained Word2Vec model."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "78db453e",
"metadata": {},
"outputs": [],
"source": [
"from gensim.models import KeyedVectors\n",
"model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)"
]
},
{
"cell_type": "markdown",
"id": "a838a74d",
"metadata": {},
"source": [
"### Example 0: Marlon Brando\n",
"\n",
"Let's take a look at how Word2Vec interprets the famous actor Marlon Brando."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "cd7ff482",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Brando', 0.7574540376663208), ('Humphrey_Bogart', 0.6143958568572998), ('actor_Marlon_Brando', 0.6016287207603455), ('Al_Pacino', 0.5675410032272339), ('Elia_Kazan', 0.5594002604484558), ('Steve_McQueen', 0.5539456605911255), ('Marilyn_Monroe', 0.5512186884880066), ('Jack_Nicholson', 0.5440200567245483), ('Shelley_Winters', 0.5432392954826355), ('Apocalypse_Now', 0.5306933522224426)]\n"
]
}
],
"source": [
"print(model.most_similar(positive=['Marlon_Brando']))"
]
},
{
"cell_type": "markdown",
"id": "4d5769cd",
"metadata": {},
"source": [
"Marlon Brando worked with Al Pacino in The Godfather and Elia Kazan in A Streetcar Named Desire. He also starred in Apocalypse Now."
]
},
{
"cell_type": "markdown",
"id": "3abba0c9",
"metadata": {},
"source": [
"### Example 1: If all of the kings had their queens on the throne\n",
"\n",
"Vectors can be added and subtracted from each other to demo underlying semantic changes."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "36f7d74c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('queen', 0.7118193507194519)]\n"
]
}
],
"source": [
"print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1))"
]
},
{
"cell_type": "markdown",
"id": "e9d62527",
"metadata": {},
"source": [
"Who says engineers can't enjoy a bit of dance-pop now and then?"
]
},
{
"cell_type": "markdown",
"id": "74f392f1",
"metadata": {},
"source": [
"### Example 2: Apple, the company, the fruit, ... or both?\n",
"\n",
"The word \"apple\" can refer to both the company as well as the delicious red fruit. In this example, we can see that Word2Vec retains both meanings."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0a29f65e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('droid_x', 0.6324754953384399)]\n",
"[('apple', 0.6410146951675415)]\n"
]
}
],
"source": [
"print(model.most_similar(positive=['samsung', 'iphone'], negative=['apple'], topn=1))\n",
"print(model.most_similar(positive=['fruit'], topn=10)[9:])"
]
},
{
"cell_type": "markdown",
"id": "a613bada",
"metadata": {},
"source": [
"\"Droid\" refers to Samsung's first 4G LTE smartphone (\"Samsung\" + \"iPhone\" - \"Apple\" = \"Droid\"), while \"apple\" is the 10th closest word to \"fruit\"."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment