Instantly share code, notes, and snippets.

# emaadmanzoor/Word embeddings via PMI-matrix factorization.ipynb Created Jan 26, 2018

 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Word Embeddings via PMI Matrix Factorization\n", "\n", "*Contact TA: emaad[at]cmu.edu, [eyeshalfclosed.com/teaching/](http://www.eyeshalfclosed.com/teaching/)*\n", "\n", " * Based on [Neural Word Embedding as Implicit Matrix Factorization](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization), by Omar Levy and Yoav Goldberg, NIPS 2014.\n", " * Dataset: https://www.kaggle.com/hacker-news/hacker-news-posts/downloads/HN_posts_year_to_Sep_26_2016.csv\n", " * Notes: http://www.eyeshalfclosed.com/teaching/95865-recitation-word2vec_as_PMI.pdf\n", " * Source material: https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/.\n", " * Source material: https://www.kaggle.com/alexklibisz/simple-word-vectors-with-co-occurrence-pmi-and-svd" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from itertools import combinations\n", "from math import log\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from pprint import pformat\n", "from scipy.sparse import csc_matrix\n", "from scipy.sparse.linalg import svds, norm\n", "from string import punctuation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0. Load the data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
title
\n", " \n", " \n", " \n", " \n", " \n", "
0You have two days to comment if you want stem ...
1SQLAR the SQLite Archiver
2What if we just printed a flatscreen televisio...
3algorithmic music
4How the Data Vault Enables the Next-Gen Data W...