Skip to content

Instantly share code, notes, and snippets.

@linanqiu
Created May 9, 2016 00:48
Show Gist options
  • Save linanqiu/e967559d535def1e6f5b931e9b580c58 to your computer and use it in GitHub Desktop.
Save linanqiu/e967559d535def1e6f5b931e9b580c58 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Verifying Word2Vec GPU"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Found a GPU implementation of `word2vec` on GitHub. Shout out to the guy's username. Love it.\n",
"\n",
"[https://github.com/whatupbiatch/cuda-word2vec](https://github.com/whatupbiatch/cuda-word2vec)\n",
"\n",
"Specifically, it only implements to Continuous Bag of Words (CBOW) half of word2vec (the other being the Skip-Gram). However, most CUDA implementations so far have been inaccurate / memory constrained. It'd serve us well to verify the results of this GPU implementation.\n",
"\n",
"**Spoilers first: The CUDA code gives inaccurate embeddings**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But before that, here's why we're doing this.\n",
"\n",
"On a 8 core CPU with 32G of RAM, `gensim`'s implementation of `word2vec` ran at around 20K words per second. The same dataset with the same parameters ran at 400K words per second. This speed up is very sexy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll train a CPU version of the model and a GPU version of the model using the same parameters, and do the following\n",
"\n",
"1. Check their cosine distance for every vocabulary and see that it behaves like that of 2 CPU models\n",
"2. Check word similarities to see that the model makes sense"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"%config InlineBackend.figure_format = 'svg'\n",
"\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('ggplot')\n",
"\n",
"from lib.w2v.w2v import *"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"cpu_model = model_from_saved(\"./temp/models-ec2/enron\", binary=False)\n",
"gpu_model = model_from_saved(\"./temp/models-gpu/enron\", binary=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cosine Distance\n",
"\n",
"First let's measure the cosine distance between vocabularies."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.figure.Figure at 0x1104e5c90>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEACAYAAAC6d6FnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFOBJREFUeJzt3W2MHdd93/HvT5IFK5ZqQnVASRQDCQ3VmoUN22rMtLFR\nurUF1igk9Y3koFWEhAgCsK2MIkhCpkDNAkVi50UbuYH0IrEjyo1VEE4jKJAqk1a9qBEkYp1INm1K\noViAqLkNV67ryHlACgr698XOmuP1cvfu3fu45/sBFjwzd2b2HO6985tz5uGmqpAkteeqaVdAkjQd\nBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMGCoAkO5J8LslLSc4k2ZfkxiQnk5xNciLJjt7yR5K8kuTl\nJHf15t+Z5HT32sPjaJAkaTCD9gAeBp6pqrcD7wReBg4DJ6vqDuC5bpoke4H7gb3AAeCRJOm28yhw\nsKr2AHuSHBhZSyRJm7JhACR5K/D+qvo0QFW9XlWvAXcDx7rFjgH3duV7gCeq6lJVnQfOAfuS3Azc\nUFWnuuUe760jSZqwQXoAtwPfTPKbSf4oya8neQuws6qWumWWgJ1d+RbgQm/9C8CuNeYvdvMlSVMw\nSABcA7wHeKSq3gP8Bd1wz4pafp6Ez5SQpDlyzQDLXAAuVNX/6KY/BxwBLia5qaoudsM7r3avLwK7\ne+vf2m1jsSv35y+u/mVJDBJJ2qSqysZLfa8NA6DbwX8jyR1VdRb4IPD17udB4BPdv092qzwFfDbJ\nv2d5iGcPcKqqKsl3kuwDTgEPAJ8cVUPmQZKjVXV02vUYF9s332zf/Br2wHmQHgDAvwR+K8m1wP8E\nfhK4Gjie5CBwHrgPoKrOJDkOnAFeBw7V5UeOHgIeA65j+aqiZ4eptCRp6wYKgKr6CvAja7z0wSss\n/0vAL60x/w+Bd2ymgpKk8fBO4MlamHYFxmxh2hUYs4VpV2DMFqZdgTFbmHYFZk1m7QthktR2PQcg\nSeMw7H7THoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlAEgSY0yACSpUQaAJDXKAJCkRhkAktQo\nA0CSGmUASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhp1zbQrIM26JNWfHubLt6VZ\nZABIA1nJgHxPIBgGmmcOAUmbVlwOBGl+2QOQOquHeqTtzh6A9D36R/ce6Wt7MwAkqVEDBUCS80m+\nmuSFJKe6eTcmOZnkbJITSXb0lj+S5JUkLye5qzf/ziSnu9ceHn1zJEmDGrQHUMD+qnp3Vb23m3cY\nOFlVdwDPddMk2QvcD+wFDgCPJFm5UuJR4GBV7QH2JDkwonZIkjZpM0NAqy93uxs41pWPAfd25XuA\nJ6rqUlWdB84B+5LcDNxQVae65R7vrSPNpSS18jPtukibtZkewBeSfDnJT3fzdlbVUldeAnZ25VuA\nC711LwC71pi/2M2XpqK/8x5+B+6JYs2vQS8D/bGq+pMkPwicTPJy/8Wq8ghIc6r/tt3aPV3eMax5\nM1AAVNWfdP9+M8nvAO8FlpLcVFUXu+GdV7vFF4HdvdVvZfnIf7Er9+cvrvX7khztTS5U1cIg9ZSm\na3RhIq0nyX5g/5a3U7X+gXuSHwCurqo/S/IW4ATwb4EPAt+qqk8kOQzsqKrD3Ungz7IcEruALwA/\n3PUSngceAk4BTwOfrKpnV/2+8shJk7B8xL56p10jKK9MX+Z7WuM07H5zkB7ATuB3ugt5rgF+q6pO\nJPkycDzJQeA8cB9AVZ1Jchw4A7wOHKrLKXMIeAy4Dnhm9c5f2l764SDNng17AJNmD0CTMv4ewOXX\nfE9rnIbdb3onsCQ1ygCQpEb5NFA1xcuVpcvsAahB3rwlgQEgSc1yCEiaAL9GUrPIHoA0EQ47afYY\nAJLUKANAkhplAEhSowwASWqUASBJjfIyUG173v0rrc0AUCNm59HM3hOgWeEQkDRx3hOg2WAASFKj\nDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLUKANAkhplAEhSo3wUhLYln/8jbcwegLYxH7kgrccA\nkKRGGQCS1CgDQJIaNVAAJLk6yQtJfrebvjHJySRnk5xIsqO37JEkryR5Ocldvfl3Jjndvfbw6Jsi\nSdqMQXsAHwXOcPmM2mHgZFXdATzXTZNkL3A/sBc4ADySZOULLx4FDlbVHmBPkgOjaYI0v5LUys+0\n66L2bBgASW4FPgz8Bpe/Tulu4FhXPgbc25XvAZ6oqktVdR44B+xLcjNwQ1Wd6pZ7vLeO1DCvVNL0\nDNID+A/AzwFv9ObtrKqlrrwE7OzKtwAXestdAHatMX+xmy+NjEfT0uaseyNYkn8MvFpVLyTZv9Yy\nVTXyD1ySo73JhapaGOX2tZ3Nznf/SuPS7Y/3b3U7G90J/PeAu5N8GHgz8NeSfAZYSnJTVV3shnde\n7ZZfBHb31r+V5SP/xa7cn794pV9aVUc31QpJakh3ULywMp3kY8NsZ90hoKr6xaraXVW3Ax8B/ltV\nPQA8BTzYLfYg8GRXfgr4SJJrk9wO7AFOVdVF4DtJ9nUnhR/orSMJh7A0eZt9FtDKG/PjwPEkB4Hz\nwH0AVXUmyXGWrxh6HThUVSvrHAIeA64DnqmqZ7dWdWm7cfhKk5XL++fZkKSqyk+ANm35yLm/E91M\neZh1xrctPwPajGH3m94JLEmNMgAkqVEGgCQ1ygCQpEb5jWCaa14yKQ3PHoC2AZ+nIw3DAJCkRhkA\nktQoA0CSGmUASFKjDABJapSXgUozqH95q88F0rjYA5Bmkpe2avwMAElqlAEgSY0yACSpUZ4E1lzx\n2T/S6BgAmkOrv3lL0jAcApKkRhkAktQoA0CSGuU5AGnGrT7x7Z3BGhUDQDPPK3886a3xcAhIc8JH\nI0ijZgBIUqMMAElqlAEgSY0yACSpUesGQJI3J3k+yYtJziT55W7+jUlOJjmb5ESSHb11jiR5JcnL\nSe7qzb8zyenutYfH1yRJ0iDWDYCq+ivgA1X1LuCdwAeSvA84DJysqjuA57ppkuwF7gf2AgeAR5Ks\nXLf2KHCwqvYAe5IcGEeDJEmD2XAIqKr+siteC1wNfBu4GzjWzT8G3NuV7wGeqKpLVXUeOAfsS3Iz\ncENVneqWe7y3jiRpCjYMgCRXJXkRWAK+WFVfB3ZW1VK3yBKwsyvfAlzorX4B2LXG/MVuvqRNSlIr\nP9Oui+bbhncCV9UbwLuSvBX4fJIPrHp95G/EJEd7kwtVtTDK7UvzbeXj5l3BrUqyH9i/1e0M/CiI\nqnotydPAncBSkpuq6mI3vPNqt9gisLu32q0sH/kvduX+/MV1ftfRQeslSa3pDooXVqaTfGyY7Wx0\nFdDbVq7wSXId8CHgBeAp4MFusQeBJ7vyU8BHklyb5HZgD3Cqqi4C30myrzsp/EBvHUnSFGzUA7gZ\nOJbkKpbD4jNV9VySF4DjSQ4C54H7AKrqTJLjwBngdeBQVa30Vw8BjwHXAc9U1bOjbowkaXC5vH+e\nDUnKx93q+88r9ce9Vz8dc63XNlue1W2tv10/K4Lh95veCawZ5hNApXEyACSpUQaAJDXKAJCkRhkA\nktQoA0CSGmUASFKjDABJatTAzwKSNHv6N8x5U5g2yx6ANNe8WU7DswegmeHz7aXJsgegGeMRrTQp\nBoAkNcoAkKRGGQCS1CgDQJIaZQBIUqMMAElqlPcBSNuEdwVrs+wBSNuG91BocwwASWqUQ0CaKh//\nIE2PPQDNAIcupGkwACSpUQaAJDXKAJCkRhkAktQoA0CSGrVhACTZneSLSb6e5GtJHurm35jkZJKz\nSU4k2dFb50iSV5K8nOSu3vw7k5zuXnt4PE2SlKRWfqZdF82uQXoAl4B/VVV/G/hR4J8neTtwGDhZ\nVXcAz3XTJNkL3A/sBQ4AjyRZuS39UeBgVe0B9iQ5MNLWSOp4aa02tmEAVNXFqnqxK/858BKwC7gb\nONYtdgy4tyvfAzxRVZeq6jxwDtiX5Gbghqo61S33eG8dSdKEbeocQJLbgHcDzwM7q2qpe2kJ2NmV\nbwEu9Fa7wHJgrJ6/2M2XJE3BwI+CSHI98NvAR6vqzy6P6kBVjXSsMcnR3uRCVS2MatuSNO+S7Af2\nb3U7AwVAkjexvPP/TFU92c1eSnJTVV3shnde7eYvArt7q9/K8pH/Ylfuz19c6/dV1dGBW6C544lJ\naWu6g+KFlekkHxtmO4NcBRTgU8CZqvrV3ktPAQ925QeBJ3vzP5Lk2iS3A3uAU1V1EfhOkn3dNh/o\nraPmeJJSmrZUrf8hTPI+4L8DX+XyJ/YIcAo4DvwQcB64r6r+tFvnF4GfAl5necjo8938O4HHgOuA\nZ6rqoTV+X/llFtvbcg9g5a0Uhi9vdf152NbWt+vnafsbdr+5YQBMmgGw/RkABoBGa9j9pncCS1Kj\nDABJapQBIEmNMgAkqVEGgCQ1yi+F19h545c0mwwATcjqSxk1Kf0A9pJQ9TkEJG173nWttRkAktQo\nA0CSGmUASFKjPAksNWT1FVmeFG6bASA1xauxdJkBoLHw2n9p9nkOQGPk5YfSLDMAJKlRBoAkNcoA\nkKRGGQCS1CgDQJIaZQBIUqMMAElqlDeCSQ3zuwLaZg9Aapo367XMHoBGxsc/SPPFHoBGzCNKaV4Y\nAJLUKIeAJAGeEG7Rhj2AJJ9OspTkdG/ejUlOJjmb5ESSHb3XjiR5JcnLSe7qzb8zyenutYdH3xRJ\nW+PwXWsGGQL6TeDAqnmHgZNVdQfwXDdNkr3A/cDebp1HkqwcSTwKHKyqPcCeJKu3qTmUpFZ+pl0X\nSZuzYQBU1ZeAb6+afTdwrCsfA+7tyvcAT1TVpao6D5wD9iW5Gbihqk51yz3eW0dzzyNHaR4NexJ4\nZ1UtdeUlYGdXvgW40FvuArBrjfmL3XxJ0pRs+SqgqvLwT5Lm0LBXAS0luamqLnbDO6928xeB3b3l\nbmX5yH+xK/fnL15p40mO9iYXqmphyHpK0raTZD+wf8vbWT6A3/CX3Qb8blW9o5v+FeBbVfWJJIeB\nHVV1uDsJ/FngvSwP8XwB+OGqqiTPAw8Bp4CngU9W1bNr/K7yErT5sXzyd+U9FDYuD7qc25pmHf0M\nzpdh95sb9gCSPAH8feBtSb4B/Bvg48DxJAeB88B9AFV1Jslx4AzwOnCoLifMIeAx4DrgmbV2/pJm\ng/cEtGGgHsAk2QOYfd9/yedsHsVuj21Nv45+HmffsPtNHwWhIXnuX5p3BoAkNcoAkKRG+TA4DcRH\nPUjbjz0AbYLj/tJ2Yg9A0rq8JHT7sgcgaQP2/LYrewCSBmZvYHsxALQmT/pqbf2bxTTvDACtY/Vd\no5K2E88BSFKjDABJapRDQJKGsvo8kSeF548BIGlIniOadw4BSVKjDABJapQBIEmN8hyAvsubv6S2\nGACNW//rHaXB+ZiI+eMQkPBhXxoN30fzxh6ApJGzNzAfDIAGOdav8XMocR44BNQsu+tS6+wBNMKj\nfkmrGQBNsVuuyfN8wOwyACSN2eUDD8NgthgA25jDPpo99kJniQGwjay9w/cDp9lkb2D6Jn4VUJID\nSV5O8kqSX5j079/+Cq/w0Xy4/D5NUv2f6darHRMNgCRXA78GHAD2Aj+e5O2TrMM0Jdk/ou3UWj+j\n2LbWszDtCozZwhR/9/ceuIzjfT2qz992MukewHuBc1V1vqouAf8ZuGfCdZim/aPbVP8o3yP+yViY\ndgXGbGHaFegZSxjs32qttptJnwPYBXyjN30B2DfhOsw0j+Sl1da+iuiKS3s+YWCT7gGMZOeWXPUf\nVw1//JNRbHdSrjSEc/nNfaWje4/01bqNPxvrfLY+NvHqzrhUTW6HkuRHgaNVdaCbPgK8UVWf6C3j\nHk6SNmmYns+kA+Aa4I+Bfwj8b+AU8ONV9dLEKiFJAiZ8DqCqXk/yL4DPA1cDn3LnL0nTMdEegCRp\ndkz1cdBJbkxyMsnZJCeS7FhjmTcneT7Ji0nOJPnladR1GAO2b3eSLyb5epKvJXloGnUdxiDt65b7\ndJKlJKcnXcdhDHKzYpJPdq9/Jcm7J13HYW3UtiR/K8nvJ/mrJD87jTpuxQDt+6fd3+yrSX4vyTun\nUc9hDdC+e7r2vZDkD5P8g3U3WFVT+wF+Bfj5rvwLwMevsNwPdP9eA/wB8L5p1nuU7QNuAt7Vla9n\n+RzJ26dd9xH//d4PvBs4Pe06D9Cmq4FzwG3Am4AXV/89gA8Dz3TlfcAfTLveI2zbDwJ/B/h3wM9O\nu85jaN/fBd7alQ/My99uE+17S6/8Dpbvu7riNqf9hTB3A8e68jHg3rUWqqq/7IrXsvyf8H/HX7WR\n2LB9VXWxql7syn8OvATcMrEabs2gf78vAd+eVKW2aJCbFb/b7qp6HtiRZOdkqzmUDdtWVd+sqi8D\nl6ZRwS0apH2/X1WvdZPPA7dOuI5bMUj7/qI3eT3wf9bb4LQDYGdVLXXlJWDND1GSq5K82C3zxao6\nM6kKbtFA7VuR5DaWj5SfH2+1RmZT7ZsTa92suGuAZeZhRzJI2+bZZtt3EHhmrDUarYHal+TeJC8B\n/xVYd0h57FcBJTnJ8jDHav+6P1FVV7zVu6reAN6V5K3A55Psr6qFkVd2CKNoX7ed64HPAR/tegIz\nYVTtmyODtmH1Ndfz0PZ5qONWDNy+JB8Afgr4sfFVZ+QGal9VPQk8meT9wGeAv3mlZcceAFX1oSu9\n1p0YvKmqLia5GXh1g229luRplscoF0Zb0+GMon1J3gT8NvCfuj/ezBjl329OLAK7e9O7WT7SWm+Z\nW7t5s26Qts2zgdrXnfj9deBAVc3L0CRs8u9XVV9Kck2Sv15V31prmWkPAT0FPNiVHwS+b+eX5G0r\nV5ckuQ74EPDCxGq4NYO0L8CngDNV9asTrNsobNi+OfRlYE+S25JcC9zPcjv7ngJ+Ar57d/uf9obC\nZtkgbVsxj8/T2bB9SX4I+C/AP6uqc1Oo41YM0r6/0e1TSPIegCvt/OlenOZZ7RuBLwBngRPAjm7+\nLcDTXfmdwB+xfMb7q8DPTfts/Ijb9z7gja59L3Q/B6Zd91G1r5t+guU7v/8fy2OYPzntum/Qrn/E\n8tVY54Aj3byfAX6mt8yvda9/BXjPtOs8qraxPNz3DeA1lk/c/y/g+mnXe4Tt+w3gW73P2qlp13nE\n7ft54Gtd274E/Mh62/NGMElq1LSHgCRJU2IASFKjDABJapQBIEmNMgAkqVEGgCQ1ygCQpEYZAJLU\nqP8Pbf/Agzp1BhwAAAAASUVORK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x10d7d2ed0>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"<matplotlib.figure.Figure at 0x1104e5c90>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from scipy import spatial\n",
"\n",
"vocabs = {}\n",
"\n",
"for word in cpu_model.vocab:\n",
" if word in cpu_model and word in gpu_model:\n",
" cosine_sim = 1 - spatial.distance.cosine(cpu_model[word], gpu_model[word])\n",
" vocabs[word] = cosine_sim\n",
"\n",
"import matplotlib.pyplot as plot\n",
"\n",
"plot.hist(vocabs.values(), bins=100)\n",
"plot.figure()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This distribution looks amazingly similar to the CPU model (found at `vector_rotation_default.ipynb` in this same directory.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Word Similarity\n",
"\n",
"Now let's make sure that the embeddings make sense"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'hi', 0.6631308794021606),\n",
" (u'hey', 0.572208046913147),\n",
" (u'dear', 0.5451931953430176),\n",
" (u'good_morning', 0.5373660326004028),\n",
" (u'happy_new_year', 0.4551734924316406),\n",
" (u'congratulations', 0.44265079498291016),\n",
" (u'hope', 0.4411052167415619),\n",
" (u'thanks', 0.43363627791404724),\n",
" (u'happy_birthday', 0.41684263944625854),\n",
" (u'doing_well', 0.4160914421081543)]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cpu_model.most_similar('hello')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'playoffs_antowain_smith', 1.0),\n",
" (u'second_guessing', 1.0),\n",
" (u'transmisison', 1.0),\n",
" (u'remains_unchanged', 1.0),\n",
" (u'seoul_korea', 1.0),\n",
" (u'clearly_defined', 1.0),\n",
" (u'correspond', 1.0),\n",
" (u'yds', 1.0),\n",
" (u'steve_leppard', 0.22951051592826843),\n",
" (u'tiverton', 0.22951051592826843)]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpu_model.most_similar('hello')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unfortunately, the embeddings don't seem to make too much sense.\n",
"\n",
"For an entire afternoon I thought I had insane GPU gains. Unfortunately, that's a little too good to be true."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment