Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save butchland/ce5f172e87db6af0cb02ee38a02ae248 to your computer and use it in GitHub Desktop.
Save butchland/ce5f172e87db6af0cb02ee38a02ae248 to your computer and use it in GitHub Desktop.
updated-colab-2-svd-nmf-topic-modeling.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"colab": {
"name": "updated-colab-2-svd-nmf-topic-modeling.ipynb",
"provenance": [],
"include_colab_link": true
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/butchland/ce5f172e87db6af0cb02ee38a02ae248/updated-colab-2-svd-nmf-topic-modeling.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Q_zKtCJN2FP7"
},
"source": [
"# Topic Modeling with NMF and SVD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "k94t1Ipp2FQC"
},
"source": [
"## The problem"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1p98osE62FQD"
},
"source": [
"Topic modeling is a fun way to start our study of NLP. We will use two popular **matrix decomposition techniques**. \n",
"\n",
"We start with a **term-document matrix**:\n",
"\n",
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/document_term.png?raw=1\" alt=\"term-document matrix\" style=\"width: 80%\"/>\n",
"\n",
"source: [Introduction to Information Retrieval](http://player.slideplayer.com/15/4528582/#)\n",
"\n",
"We can decompose this into one tall thin matrix times one wide short matrix (possibly with a diagonal matrix in between).\n",
"\n",
"Notice that this representation does not take into account word order or sentence structure. It's an example of a **bag of words** approach."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xUimjoMn2FQE"
},
"source": [
"Latent Semantic Analysis (LSA) uses Singular Value Decomposition (SVD)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "goYbJLJ32FQE"
},
"source": [
"### Motivation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wDJxqovO2FQF"
},
"source": [
"Consider the most extreme case - reconstructing the matrix using an outer product of two vectors. Clearly, in most cases we won't be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.\n",
"\n",
"Now consider increasing that matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different a distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups \"topics\". And we would cluster the words into two groups, based on those which most frequently appear in each of the topics. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6-jhCTke2FQG"
},
"source": [
"## Getting started"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "90szR3s92FQH"
},
"source": [
"We'll take a dataset of documents in several different categories, and find topics (consisting of groups of words) for them. Knowing the actual categories helps us evaluate if the topics we find make sense.\n",
"\n",
"We will try this with two different matrix factorizations: **Singular Value Decomposition (SVD)** and **Non-negative Matrix Factorization (NMF)**"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Qtuygvlj2FQI"
},
"source": [
"import numpy as np\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"from sklearn import decomposition\n",
"from scipy import linalg\n",
"import matplotlib.pyplot as plt"
],
"execution_count": 1,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "TdLCJrWE2FQJ"
},
"source": [
"%matplotlib inline\n",
"np.set_printoptions(suppress=True)"
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "ROqCia8H2FQK"
},
"source": [
"### Additional Resources"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aA38ERkh2FQK"
},
"source": [
"- [Data source](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html): Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics.\n",
"- [Chris Manning's book chapter](https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) on matrix factorization and LSI \n",
"- Scikit learn [truncated SVD LSI details](http://scikit-learn.org/stable/modules/decomposition.html#lsa)\n",
"\n",
"### Other Tutorials\n",
"- [Scikit-Learn: Out-of-core classification of text documents](http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html): uses [Reuters-21578](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection) dataset (Reuters articles labeled with ~100 categories), HashingVectorizer\n",
"- [Text Analysis with Topic Models for the Humanities and Social Sciences](https://de.dariah.eu/tatom/index.html): uses [British and French Literature dataset](https://de.dariah.eu/tatom/datasets.html) of Jane Austen, Charlotte Bronte, Victor Hugo, and more"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_b6qe2pk2FQL"
},
"source": [
"## Look at our data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uNHNPQXk2FQM"
},
"source": [
"Scikit Learn comes with a number of built-in datasets, as well as loading utilities to load several standard external datasets. This is a [great resource](http://scikit-learn.org/stable/datasets/), and the datasets include Boston housing prices, face images, patches of forest, diabetes, breast cancer, and more. We will be using the newsgroups dataset.\n",
"\n",
"Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "wI3ysTYw2FQM",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "6a1245e3-b862-441e-c5bc-87a386827453"
},
"source": [
"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
"remove = ('headers', 'footers', 'quotes')\n",
"newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)\n",
"newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"Downloading 20news dataset. This may take a few minutes.\n",
"Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "iH2DREat2FQN",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "9651c744-3eed-47f1-93ed-9e9c9de10f44"
},
"source": [
"newsgroups_train.filenames.shape, newsgroups_train.target.shape"
],
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"((2034,), (2034,))"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o9xXEMal2FQP"
},
"source": [
"Let's look at some of the data. Can you guess which category these messages are in?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "5_UJacvU2FQP",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "96a2e39c-9901-4e75-cba9-2f5b6074e862"
},
"source": [
"print(\"\\n\".join(newsgroups_train.data[:3]))"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"Hi,\n",
"\n",
"I've noticed that if you only save a model (with all your mapping planes\n",
"positioned carefully) to a .3DS file that when you reload it after restarting\n",
"3DS, they are given a default position and orientation. But if you save\n",
"to a .PRJ file their positions/orientation are preserved. Does anyone\n",
"know why this information is not stored in the .3DS file? Nothing is\n",
"explicitly said in the manual about saving texture rules in the .PRJ file. \n",
"I'd like to be able to read the texture rule information, does anyone have \n",
"the format for the .PRJ file?\n",
"\n",
"Is the .CEL file format available from somewhere?\n",
"\n",
"Rych\n",
"\n",
"\n",
"Seems to be, barring evidence to the contrary, that Koresh was simply\n",
"another deranged fanatic who thought it neccessary to take a whole bunch of\n",
"folks with him, children and all, to satisfy his delusional mania. Jim\n",
"Jones, circa 1993.\n",
"\n",
"\n",
"Nope - fruitcakes like Koresh have been demonstrating such evil corruption\n",
"for centuries.\n",
"\n",
" >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) \n",
"\n",
"MB> So the\n",
"MB> 1970 figure seems unlikely to actually be anything but a perijove.\n",
"\n",
"JG>Sorry, _perijoves_...I'm not used to talking this language.\n",
"\n",
"Couldn't we just say periapsis or apoapsis?\n",
"\n",
" \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CObjPTul2FQQ"
},
"source": [
"hint: definition of *perijove* is the point in the orbit of a satellite of Jupiter nearest the planet's center "
]
},
{
"cell_type": "code",
"metadata": {
"id": "s9dJhofe2FQR",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "33f52f36-72cf-45c3-85b0-5f458762d7d1"
},
"source": [
"np.array(newsgroups_train.target_names)[newsgroups_train.target[:3]]"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['comp.graphics', 'talk.religion.misc', 'sci.space'], dtype='<U18')"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P8lMcOuf2FQR"
},
"source": [
"The target attribute is the integer index of the category."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5RrmovrX2FQS",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "579d81e0-ff3f-4a02-c67f-65f677e75481"
},
"source": [
"newsgroups_train.target[:10]"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1])"
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "PgW6TTUz2FQS"
},
"source": [
"num_topics, num_top_words = 6, 8"
],
"execution_count": 8,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "jYWRaJ0V2FQT"
},
"source": [
"## Stop words, stemming, lemmatization"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "feKbc7MK2FQT"
},
"source": [
"### Stop words"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h7lVdqwN2FQT"
},
"source": [
"From [Intro to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html):\n",
"\n",
"*Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.*\n",
"\n",
"*The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XTf4Vh4y2FQU"
},
"source": [
"#### NLTK"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZQvMe7xI2FQU",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b8ce7f75-e60b-42ad-a273-5a58c4367bde"
},
"source": [
"from sklearn.feature_extraction import stop_words\n",
"\n",
"sorted(list(stop_words.ENGLISH_STOP_WORDS))[:20]"
],
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.feature_extraction.stop_words module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.feature_extraction.text. Anything that cannot be imported from sklearn.feature_extraction.text is now part of the private API.\n",
" warnings.warn(message, FutureWarning)\n"
],
"name": "stderr"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['a',\n",
" 'about',\n",
" 'above',\n",
" 'across',\n",
" 'after',\n",
" 'afterwards',\n",
" 'again',\n",
" 'against',\n",
" 'all',\n",
" 'almost',\n",
" 'alone',\n",
" 'along',\n",
" 'already',\n",
" 'also',\n",
" 'although',\n",
" 'always',\n",
" 'am',\n",
" 'among',\n",
" 'amongst',\n",
" 'amoungst']"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EwqBotkl2FQV"
},
"source": [
"There is no single universal list of stop words."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gXezyBGR2FQV"
},
"source": [
"### Stemming and Lemmatization"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "woiHCS3Q2FQW"
},
"source": [
"from [Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) textbook:\n",
"\n",
"Are the below words the same?\n",
"\n",
"*organize, organizes, and organizing*\n",
"\n",
"*democracy, democratic, and democratization*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mp8Qs4e52FQW"
},
"source": [
"Stemming and Lemmatization both generate the root form of the words. \n",
"\n",
"Lemmatization uses the rules about a language. The resulting tokens are all actual words\n",
"\n",
"\"Stemming is the poor-man’s lemmatization.\" (Noah Smith, 2011) Stemming is a crude heuristic that chops the ends off of words. The resulting tokens may not be actual words. Stemming is faster."
]
},
{
"cell_type": "code",
"metadata": {
"id": "yIasEm7Q2FQW",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "c0ef7fcb-eaad-4177-9a48-095d8eca5509"
},
"source": [
"import nltk\n",
"nltk.download('wordnet')"
],
"execution_count": 10,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package wordnet to /root/nltk_data...\n",
"[nltk_data] Unzipping corpora/wordnet.zip.\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "7AQ3ur5p2FQX"
},
"source": [
"from nltk import stem"
],
"execution_count": 11,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "xXAQyhT82FQX"
},
"source": [
"wnl = stem.WordNetLemmatizer()\n",
"porter = stem.porter.PorterStemmer()"
],
"execution_count": 12,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "UGt9MN0o2FQY"
},
"source": [
"word_list = ['feet', 'foot', 'foots', 'footing']"
],
"execution_count": 13,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "QHB5fcUO2FQY",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b884f793-57cc-456b-d141-6029d10ca54b"
},
"source": [
"[wnl.lemmatize(word) for word in word_list]"
],
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['foot', 'foot', 'foot', 'footing']"
]
},
"metadata": {
"tags": []
},
"execution_count": 14
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "KNN2DgSa2FQY",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "dcba9e97-7c6f-47b6-a99f-c0f40aedfd36"
},
"source": [
"[porter.stem(word) for word in word_list]"
],
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['feet', 'foot', 'foot', 'foot']"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KTqVGt9a2FQZ"
},
"source": [
"Your turn! Now, try lemmatizing and stemming the following collections of words:\n",
"\n",
"- fly, flies, flying\n",
"- organize, organizes, organizing\n",
"- universe, university"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NW15JWQ92FQZ"
},
"source": [
"fastai/course-nlp"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Df9RCCht2FQZ"
},
"source": [
"Stemming and lemmatization are language dependent. Languages with more complex morphologies may show bigger benefits. For example, Sanskrit has a very [large number of verb forms](https://en.wikipedia.org/wiki/Sanskrit_verbs). "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5aCHJbqx2FQa"
},
"source": [
"### Spacy"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tUPsm8aZ2FQa"
},
"source": [
"Stemming and lemmatization are implementation dependent."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Pl55OeX42FQa"
},
"source": [
"Spacy is a very modern & fast nlp library. Spacy is opinionated, in that it typically offers one highly optimized way to do something (whereas nltk offers a huge variety of ways, although they are usually not as optimized).\n",
"\n",
"You will need to install it.\n",
"\n",
"if you use conda:\n",
"```\n",
"conda install -c conda-forge spacy\n",
"```\n",
"if you use pip:\n",
"```\n",
"pip install -U spacy\n",
"```\n",
"\n",
"You will then need to download the English model:\n",
"```\n",
"spacy -m download en_core_web_sm\n",
"```"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "iKVPFrIJ3xQC",
"outputId": "00256d2b-094d-404d-ed40-b58cb6f419f0"
},
"source": [
"!spacy download en_core_web_sm"
],
"execution_count": 19,
"outputs": [
{
"output_type": "stream",
"text": [
"Requirement already satisfied: en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 in /usr/local/lib/python3.7/dist-packages (2.2.5)\n",
"Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.7/dist-packages (from en_core_web_sm==2.2.5) (2.2.4)\n",
"Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.19.5)\n",
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.5)\n",
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)\n",
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.5)\n",
"Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)\n",
"Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)\n",
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.5)\n",
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.41.1)\n",
"Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (57.0.0)\n",
"Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.5)\n",
"Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.8.2)\n",
"Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)\n",
"Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.24.3)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2021.5.30)\n",
"Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.10)\n",
"Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)\n",
"Requirement already satisfied: importlib-metadata>=0.20; python_version < \"3.8\" in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (4.5.0)\n",
"Requirement already satisfied: typing-extensions>=3.6.4; python_version < \"3.8\" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.7.4.3)\n",
"Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20; python_version < \"3.8\"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.4.1)\n",
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
"You can now load the model via spacy.load('en_core_web_sm')\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "viEGvOYG2FQb"
},
"source": [
"import spacy"
],
"execution_count": 16,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "YEX7T_BO4Ss8"
},
"source": [
"model = spacy.load('en_core_web_sm')"
],
"execution_count": 25,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
},
"id": "TMcadpXi4cvu",
"outputId": "86b0d1a0-7205-47d9-9fe7-b2acc0bf29f2"
},
"source": [
"spacy.__version__"
],
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"data": {
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
},
"text/plain": [
"'2.2.4'"
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "8_zflbKN7wsM"
},
"source": [
"import spacy.lookups"
],
"execution_count": 27,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "0zX4RSuE71St"
},
"source": [
"lookups = spacy.lookups.Lookups()"
],
"execution_count": 28,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "vE8X8Ym26GTt"
},
"source": [
"??Lemmatizer"
],
"execution_count": 23,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "8zz30Zyn2FQb"
},
"source": [
"from spacy.lemmatizer import Lemmatizer\n",
"lemmatizer = Lemmatizer(lookups)"
],
"execution_count": 29,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "uIbcDj_H2FQc",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b56d7e3f-a432-4edf-9264-ab46c15bbdea"
},
"source": [
"[lemmatizer.lookup(word) for word in word_list]"
],
"execution_count": 30,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['feet', 'foot', 'foots', 'footing']"
]
},
"metadata": {
"tags": []
},
"execution_count": 30
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TSnfWIks2FQd"
},
"source": [
"Spacy doesn't offer a stemmer (since lemmatization is considered better-- this is an example of being opinionated!)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-R6p9O4j2FQd"
},
"source": [
"Stop words vary from library to library"
]
},
{
"cell_type": "code",
"metadata": {
"id": "3zwsOIfA2FQe"
},
"source": [
"nlp = spacy.load(\"en_core_web_sm\")"
],
"execution_count": 31,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "4nLUkh6u2FQe",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "70bd464b-dd37-4394-f9d0-a8265aaac9ca"
},
"source": [
"sorted(list(nlp.Defaults.stop_words))[:20]"
],
"execution_count": 32,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[\"'d\",\n",
" \"'ll\",\n",
" \"'m\",\n",
" \"'re\",\n",
" \"'s\",\n",
" \"'ve\",\n",
" 'a',\n",
" 'about',\n",
" 'above',\n",
" 'across',\n",
" 'after',\n",
" 'afterwards',\n",
" 'again',\n",
" 'against',\n",
" 'all',\n",
" 'almost',\n",
" 'alone',\n",
" 'along',\n",
" 'already',\n",
" 'also']"
]
},
"metadata": {
"tags": []
},
"execution_count": 32
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rXTXWkna2FQf"
},
"source": [
"#### Exercise: What stop words appear in spacy but not in sklearn?"
]
},
{
"cell_type": "code",
"metadata": {
"id": "VcZhi_ek2FQf"
},
"source": [
"#Exercise:\n"
],
"execution_count": 33,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "Dp-aHmkt2FQg"
},
"source": [
"#### Exercise: And what stop words are in sklearn but not spacy?"
]
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "mdDI8W6N2FQg"
},
"source": [
"#Exercise:\n"
],
"execution_count": 34,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "GV4B4Bvd2FQh"
},
"source": [
"### When to use these?"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "TFc25oFq2FQh"
},
"source": [
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/skomoroch.png?raw=1\" alt=\"\" style=\"width: 65%\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "6uf855HR2FQh"
},
"source": [
"These were long considered standard techniques, but they can often **hurt** your performance **if using deep learning**. Stemming, lemmatization, and removing stop words all involve throwing away information.\n",
"\n",
"However, they can still be useful when working with simpler models."
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "_nsxs-pR2FQi"
},
"source": [
"### Another approach: sub-word units"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "ryZh-Epw2FQi"
},
"source": [
"[SentencePiece](https://github.com/google/sentencepiece) library from Google"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AvPqZlGb2FQi"
},
"source": [
"## Data Processing"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "03sM6Ytj2FQi"
},
"source": [
"Next, scikit learn has a method that will extract all the word counts for us. In the next lesson, we'll learn how to write our own version of CountVectorizer, to see what's happening underneath the hood."
]
},
{
"cell_type": "code",
"metadata": {
"id": "GMIcTatu2FQj"
},
"source": [
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer"
],
"execution_count": 35,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "MESTcQ_l2FQj"
},
"source": [
"import nltk\n",
"# nltk.download('punkt')"
],
"execution_count": 36,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "E6aero4J2FQj"
},
"source": [
"# from nltk import word_tokenize\n",
"\n",
"# class LemmaTokenizer(object):\n",
"# def __init__(self):\n",
"# self.wnl = stem.WordNetLemmatizer()\n",
"# def __call__(self, doc):\n",
"# return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]"
],
"execution_count": 37,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "hHRlKOGP2FQk"
},
"source": [
"vectorizer = CountVectorizer(stop_words='english') #, tokenizer=LemmaTokenizer())"
],
"execution_count": 38,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "0YxJoYnl2FQk",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "61f58a70-c1b5-429c-c547-6d505222aff5"
},
"source": [
"vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)\n",
"vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape"
],
"execution_count": 39,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(2034, 26576)"
]
},
"metadata": {
"tags": []
},
"execution_count": 39
}
]
},
{
"cell_type": "code",
"metadata": {
"scrolled": true,
"id": "_ZOcep3u2FQk",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "3913fc07-ec5c-4b96-b009-57f0e1faa544"
},
"source": [
"print(len(newsgroups_train.data), vectors.shape)"
],
"execution_count": 40,
"outputs": [
{
"output_type": "stream",
"text": [
"2034 (2034, 26576)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "ObDdZCg02FQl"
},
"source": [
"vocab = np.array(vectorizer.get_feature_names())"
],
"execution_count": 41,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "cQXI4Pbo2FQl",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "792473e3-e57e-4963-87a0-01b5f0be28e9"
},
"source": [
"vocab.shape"
],
"execution_count": 42,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"(26576,)"
]
},
"metadata": {
"tags": []
},
"execution_count": 42
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "MmbY8IxX2FQm",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "54de8b80-8ade-47b1-9777-05d563f427c9"
},
"source": [
"vocab[7000:7020]"
],
"execution_count": 43,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',\n",
" 'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',\n",
" 'couched', 'couldn', 'council', 'councils', 'counsel',\n",
" 'counselees', 'counselor', 'count'], dtype='<U80')"
]
},
"metadata": {
"tags": []
},
"execution_count": 43
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JwoRlyex2FQm"
},
"source": [
"## Singular Value Decomposition (SVD)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AUvXRfwB2FQm"
},
"source": [
"\"SVD is not nearly as famous as it should be.\" - Gilbert Strang"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zafiugbL2FQn"
},
"source": [
"We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other - otherwise that word wouldn't make a good choice to separate out the two topics. Therefore, we expect the topics to be **orthogonal**.\n",
"\n",
"The SVD algorithm factorizes a matrix into one matrix with **orthogonal columns** and one with **orthogonal rows** (along with a diagonal matrix, which contains the **relative importance** of each factor).\n",
"\n",
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/svd_fb.png?raw=1\" alt=\"\" style=\"width: 80%\"/>\n",
"(source: [Facebook Research: Fast Randomized SVD](https://research.fb.com/fast-randomized-svd/))\n",
"\n",
"SVD is an **exact decomposition**, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science, including:\n",
"\n",
"- semantic analysis\n",
"- collaborative filtering/recommendations ([winning entry for Netflix Prize](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf))\n",
"- calculate Moore-Penrose pseudoinverse\n",
"- data compression\n",
"- principal component analysis"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PYuetwcK2FQn"
},
"source": [
"Latent Semantic Analysis (LSA) uses SVD. You will sometimes hear topic modelling referred to as LSA."
]
},
{
"cell_type": "code",
"metadata": {
"id": "MyD-LcTM2FQn",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4a7a5905-2491-45c2-f9a7-35ab05db40d0"
},
"source": [
"%time U, s, Vh = linalg.svd(vectors, full_matrices=False)"
],
"execution_count": 44,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 1min 28s, sys: 4.99 s, total: 1min 33s\n",
"Wall time: 48.4 s\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "tIIwX8Ns2FQo",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "10d6c6ca-2cab-4411-a4e8-e57adc56c690"
},
"source": [
"print(U.shape, s.shape, Vh.shape)"
],
"execution_count": 45,
"outputs": [
{
"output_type": "stream",
"text": [
"(2034, 2034) (2034,) (2034, 26576)\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aTa_eD7c2FQo"
},
"source": [
"Confirm this is a decomposition of the input."
]
},
{
"cell_type": "code",
"metadata": {
"id": "mu6giVsK2FQp",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "f2a15677-f8d5-44a9-baea-bf14c2a81fcf"
},
"source": [
"s[:4]"
],
"execution_count": 46,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([433.92698542, 291.51012741, 240.71137677, 220.00048043])"
]
},
"metadata": {
"tags": []
},
"execution_count": 46
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "GZAkAelt2FQp",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "a408454e-f97f-413e-8737-822cac8fc85f"
},
"source": [
"np.diag(np.diag(s[:4]))"
],
"execution_count": 47,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([433.92698542, 291.51012741, 240.71137677, 220.00048043])"
]
},
"metadata": {
"tags": []
},
"execution_count": 47
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RXm9DOys2FQp"
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"metadata": {
"id": "3OGxCK9Z2FQq"
},
"source": [
"#Exercise: confrim that U, s, Vh is a decomposition of `vectors`\n"
],
"execution_count": 48,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "aYRiyanI2FQq"
},
"source": [
"Confirm that U, V are orthonormal"
]
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "ZmGNZRan2FQq"
},
"source": [
"#### Answer"
]
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "bDyxqZjG2FQr"
},
"source": [
"#Exercise: Confirm that U, Vh are orthonormal\n"
],
"execution_count": 49,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"heading_collapsed": true,
"id": "jNtz0vYp2FQr"
},
"source": [
"#### Topics"
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "FWNJSxKk2FQs"
},
"source": [
"What can we say about the singular values s?"
]
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "zrszVrLs2FQs",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 265
},
"outputId": "a5b6919d-a0dc-4a33-8581-fab79bf02035"
},
"source": [
"plt.plot(s);"
],
"execution_count": 50,
"outputs": [
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "0gKwq5ix2FQt",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"outputId": "eca14915-0936-445a-e946-c9d2fa3bf8fb"
},
"source": [
"plt.plot(s[:10])"
],
"execution_count": 51,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<matplotlib.lines.Line2D at 0x7f832c5c75d0>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 51
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "dpB5QWVB2FQt"
},
"source": [
"num_top_words=8\n",
"\n",
"def show_topics(a):\n",
" top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]\n",
" topic_words = ([top_words(t) for t in a])\n",
" return [' '.join(t) for t in topic_words]"
],
"execution_count": 52,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"hidden": true,
"id": "LTPGZP752FQt",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "0220bcab-1b98-447a-92f0-448f9dd933f2"
},
"source": [
"show_topics(Vh[:10])"
],
"execution_count": 53,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['ditto critus propagandist surname galacticentric kindergarten surreal imaginative',\n",
" 'jpeg gif file color quality image jfif format',\n",
" 'graphics edu pub mail 128 3d ray ftp',\n",
" 'jesus god matthew people atheists atheism does graphics',\n",
" 'image data processing analysis software available tools display',\n",
" 'god atheists atheism religious believe religion argument true',\n",
" 'space nasa lunar mars probe moon missions probes',\n",
" 'image probe surface lunar mars probes moon orbit',\n",
" 'argument fallacy conclusion example true ad argumentum premises',\n",
" 'space larson image theory universe physical nasa material']"
]
},
"metadata": {
"tags": []
},
"execution_count": 53
}
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "70Lw1Swz2FQu"
},
"source": [
"We get topics that match the kinds of clusters we would expect! This is despite the fact that this is an **unsupervised algorithm** - which is to say, we never actually told the algorithm how our documents are grouped."
]
},
{
"cell_type": "markdown",
"metadata": {
"hidden": true,
"id": "-p1V9Doc2FQu"
},
"source": [
"We will return to SVD in **much more detail** later. For now, the important takeaway is that we have a tool that allows us to exactly factor a matrix into orthogonal columns and orthogonal rows."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Gb6GnmbI2FQv"
},
"source": [
"## Non-negative Matrix Factorization (NMF)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uquwpmgp2FQv"
},
"source": [
"#### Motivation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EkGma0ZB2FQv"
},
"source": [
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/face_pca.png?raw=1\" alt=\"PCA on faces\" style=\"width: 80%\"/>\n",
"\n",
"(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))\n",
"\n",
"A more interpretable approach:\n",
"\n",
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/face_outputs.png?raw=1\" alt=\"NMF on Faces\" style=\"width: 80%\"/>\n",
"\n",
"(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vP_bLpGp2FQw"
},
"source": [
"#### Idea"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pL9KzFwg2FQw"
},
"source": [
"Rather than constraining our factors to be *orthogonal*, another idea would to constrain them to be *non-negative*. NMF is a factorization of a non-negative data set $V$: $$ V = W H$$ into non-negative matrices $W,\\; H$. Often positive factors will be **more easily interpretable** (and this is the reason behind NMF's popularity). \n",
"\n",
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/face_nmf.png?raw=1\" alt=\"NMF on faces\" style=\"width: 80%\"/>\n",
"\n",
"(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))\n",
"\n",
"Nonnegative matrix factorization (NMF) is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T0908mtB2FQx"
},
"source": [
"#### Applications of NMF"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H_KnwZ2F2FQx"
},
"source": [
"- [Face Decompositions](http://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py)\n",
"- [Collaborative Filtering, eg movie recommendations](http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/)\n",
"- [Audio source separation](https://pdfs.semanticscholar.org/cc88/0b24791349df39c5d9b8c352911a0417df34.pdf)\n",
"- [Chemistry](http://ieeexplore.ieee.org/document/1532909/)\n",
"- [Bioinformatics](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0485-4) and [Gene Expression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2623306/)\n",
"- Topic Modeling (our problem!)\n",
"\n",
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/nmf_doc.png?raw=1\" alt=\"NMF on documents\" style=\"width: 80%\"/>\n",
"\n",
"(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ih_516E62FQy"
},
"source": [
"**More Reading**:\n",
"\n",
"- [The Why and How of Nonnegative Matrix Factorization](https://arxiv.org/pdf/1401.5226.pdf)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CTlmStNm2FQy"
},
"source": [
"### NMF from sklearn"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1dDhiqBB2FQy"
},
"source": [
"We will use [scikit-learn's implementation of NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html):"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vvl80Ywr2FQz"
},
"source": [
"m,n=vectors.shape\n",
"d=5 # num topics"
],
"execution_count": 54,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rhKhHlhu2FQz"
},
"source": [
"clf = decomposition.NMF(n_components=d, random_state=1)\n",
"\n",
"W1 = clf.fit_transform(vectors)\n",
"H1 = clf.components_"
],
"execution_count": 55,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "QBnw6NWg2FQ0",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2537448f-418f-4f7b-a92d-c4362028b8b9"
},
"source": [
"show_topics(H1)"
],
"execution_count": 56,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['jpeg image gif file color images format quality',\n",
" 'edu graphics pub mail 128 ray ftp send',\n",
" 'space launch satellite nasa commercial satellites year market',\n",
" 'jesus god people matthew atheists does atheism said',\n",
" 'image data available software processing ftp edu analysis']"
]
},
"metadata": {
"tags": []
},
"execution_count": 56
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LdcejIhE2FQ0"
},
"source": [
"### TF-IDF"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4RnZHXKv2FQ0"
},
"source": [
"[Topic Frequency-Inverse Document Frequency](http://www.tfidf.com/) (TF-IDF) is a way to normalize term counts by taking into account how often they appear in a document, how long the document is, and how commmon/rare the term is.\n",
"\n",
"TF = (# occurrences of term t in document) / (# of words in documents)\n",
"\n",
"IDF = log(# of documents / # documents with term t in it)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "NKVitVyo2FQ1"
},
"source": [
"vectorizer_tfidf = TfidfVectorizer(stop_words='english')\n",
"vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)"
],
"execution_count": 57,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "P4MGqRZw2FQ1",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "dbeb1ab6-ad6c-4b13-d825-d521e5a13f0c"
},
"source": [
"newsgroups_train.data[10:20]"
],
"execution_count": 58,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[\"a\\n\\nWhat about positional uncertainties in S-L 1993e? I assume we know where\\nand what Galileo is doing within a few meters. But without the\\nHGA, don't we have to have some pretty good ideas, of where to look\\nbefore imaging? If the HGA was working, they could slew around\\nin near real time (Less speed of light delay). But when they were\\nimaging toutatis???? didn't someone have to get lucky on a guess to\\nfind the first images? \\n\\nAlso, I imagine S-L 1993e will be mostly a visual image. so how will\\nthat affect the other imaging missions. with the LGA, there is a real\\ntight allocation of bandwidth. It may be premature to hope for answers,\\nbut I thought i'd throw it on the floor.\",\n",
" \"I would like to program Tseng ET4000 to nonstandard 1024x768 mode by\\nswitching to standard 1024x768 mode using BIOS and than changing some\\ntiming details (0x3D4 registers 0x00-0x1F) but I don't know how to\\nselect 36 MHz pixel clock I need. The BIOS function selects 40 MHz.\\n\\nIs there anybody who knows where to obtain technical info about this.\\nI am also interested in any other technical information about Tseng ET4000\\nand Trident 8900 and 9000 chipsets.\\n\\n\\t\\t\\tthanks very much\",\n",
" 'In-Reply-To: <20APR199312262902@rigel.tamu.edu> lmp8913@rigel.tamu.edu (PRESTON, LISA M)',\n",
" \"\\n\\n\\n\\nI'm not sure, but it almost sounds like they can't figure out where the \\n_nucleus_ is within the coma. If they're off by a couple hundred\\nmiles, well, you can imagine the rest...\\n\",\n",
" \"Hello,\\n I am looking to add voice input capability to a user interface I am\\ndeveloping on an HP730 (UNIX) workstation. I would greatly appreciate \\ninformation anyone would care to offer about voice input systems that are \\neasily accessible from the UNIX environment. \\n\\n The names or adresses of applicable vendors, as well as any \\nexperiences you have had with specific systems, would be very helpful.\\n\\n Please respond via email; I will post a summary if there is \\nsufficient interest.\\n\\n\\nThanks,\\nKen\\n\\n\\nP.S. I have found several impressive systems for IBM PC's, but I would \\nlike to avoid the hassle of purchasing and maintaining a separate PC if \\nat all possible.\\n\\n-------------------------------------------------------------------------------\\nKen Hinckley (kph2q@virginia.edu)\\nUniversity of Virginia \\nNeurosurgical Visualization Laboratory\",\n",
" '\\nIt was a test of the first reusable tool.\\n\\n\\nPointy so they can find them or so they will stick into their pants better, and\\nbe closer to their brains?',\n",
" '\\nSize of armies, duration, numbers of casualties both absolute and as a\\npercentage of those involved, geographical area and numbers of countries\\ntoo, are all measures of size. In this case I\\'d say the relevant\\nstatistic would be the number of combatants (total troops) compared to\\ntotal casualties from among the total civilian population in the\\naffected geographical area.\\n\\n\\nVietnam and Korea might make good comparisons.\\n\\n\\nWestern news in general, but in particular the American \"mass media\":\\nCBS, NBC, ABC, etc. The general tone of the news during the whole\\nwar was one of \"those poor, poor Iraqis\" along with \"look how precisely\\nthis cruise missile blew this building to bits\".\\n\\n\\nI agree.\\n\\n\\nPerhaps so. And maybe the atomic bomb was a mistake too. But that\\'s easy\\nto say from our \"enlightened\" viewpoint here in the 90\\'s, right? Back\\nthen, it was *all-out* war, and Germany and Japan had to be squashed.\\nAfter all, a million or more British had already died, hundreds of \\nthousands of French, a couple hundread thousand or so Americans, and \\nmillions of Russians, not to mention a few million Jews, Poles, and \\nother people of slavic descent in German concentration camps. All \\nthings considered, the fire-bombings and the atomic bomb were\\nessential (and therefore justified) in bringing the war to a quick\\nend to avoid even greater allied losses.\\n\\nI, for one, don\\'t regret it.\\n\\n\\nSure. And it\\'s the people who suffer because of them. All the more\\nreason to depose these \"entrenched political rulers operating in their\\nown selfish interests\"! Or do you mean that this applies to the allies\\nas well??\\n\\n\\nI make no claim or effort to justify the misguided foreign policy of the\\nWest before the war. It is evident that the West, especially America,\\nmisjudged Hussein drastically. But once Hussein invaded Kuwait and \\nthreatened to militarily corner a significant portion of the world\\'s\\noil supply, he had to be stopped. Sure the war could have been\\nprevented by judicious and concerted effort on the part of the West\\nbefore Hussein invaded Kuwait, but it is still *Hussein* who is\\nresponsible for his decision to invade. And once he did so, a\\nstrong response from the West was required.\\n\\n\\nWell, it\\'s not very \"loving\" to allow a Hussein or a Hitler to gobble up\\nnearby countries and keep them. Or to allow them to continue with mass\\nslaughter of certain peoples under their dominion. So, I\\'d have to\\nsay yes, stopping Hussein was the most \"loving\" thing to do for the\\nmost people involved once he set his mind on military conquest.\\n\\nI mentioned it.\\n\\nIf we hadn\\'t intervened, allowing Hussein to keep Kuwait, then it would\\nhave been appeasement. It is precisely the lessons the world learned\\nin WW2 that motivated the Western alliance to war. Letting Hitler take\\nAustria and Czechoslavkia did not stop WW2 from happening, and letting\\nHussein keep Kuwait would not have stopped an eventual Gulf War to\\nprotect Saudi Arabia.\\n\\n\\nSure. What was truly unfortunate was that they followed Hitler in\\nhis grandiose quest for a \"Thousand Year Reich\". The consequences\\nstemmed from that.\\n\\nWhat should I say about them? Anything in particular?\\n\\n\\n\\nSo? It was the *policemen* on trial not Rodney King!! And under American\\nlaw they deserved a jury of *their* peers! If there had been black\\nofficers involved, I\\'m sure their would have been black jurors too.\\nThis point (of allegedly racial motivations) is really shallow.\\n\\n\\nSo? It\\'s \"hard to imagine\"? So when has Argument from Incredulity\\ngained acceptance from the revered author of \"Constructing a Logical\\nArgument\"? Can we expect another revision soon?? :) (Just kidding.)\\n\\n\\nI have to admit that I wonder this too. But *neither* the prosecution\\nnor the defense is talking. So one cannot conclude either way due to\\nthe silence of the principals. \\n\\n\\nOK. It certainly seemed to me that there was excessive force involved.\\nAnd frankly, the original \"not guilty\" verdict baffled me too. But then\\nI learned that the prosecution in the first case did not try to convict\\non a charge of excessive force or simple assault which they probably\\nwould have won, they tried to get a conviction on a charge of aggravated\\nassault with intent to inflict serious bodily harm. A charge, which\\nnews commentators said, was akin to attempted murder under California\\nlaw. Based on what the prosecution was asking for, it\\'s evident that \\nthe first jury decided that the officers were \"not guilty\". Note, \\nnot \"not guilty\" of doing wrong, but \"not guilty\" of aggravated assault \\nwith the *intent* of inflicting serious bodily harm. The seeds of the \\nprosecutions defeat were in their own overconfidence in obtaining a \\nverdict such that they went for the most extreme charge they could.\\n\\nIf the facts as the news commentators presented them are true, then\\nI feel the \"not guilty\" verdict was a reasonable one.\\n\\n\\nThanks mathew, I like the quote. Pretty funny actually. (I\\'m a \\nMonty Python fan, you know. Kind of seems in that vein.)\\n\\nOf course, oversimplifying any moral argument can make it seem\\ncontradictory. But then, you know that already. \\n\\nRegards,',\n",
" \"<stuff deleted>\\n\\nYou mean like: seconds, minutes, hours, days, months, years. . . :-)\\n\\nRemember, the Fahrenheit temperature scale is also a centigrade scale. Some\\nrevisionists tell the history something like this: The coldest point in a\\nparticular Russian winter was marked on the thermometer as was the body\\ntemperature of a volunteer (turns out he was sick, but you can't win 'em all).\\nThen the space in between the marks on the thermometer was then divided into\\nhundredths.\\n\\t\\t\\t\\t\\t\\t\\t\\t:-)\\n\\nFWIW,\\n\\nDoug Page\\n\",\n",
" \"\\nIt wasn't especially prominent, as I recall. However, quite possibly it's\\nno longer on display; NASM, like most museums, has much more stuff than it\\ncan display at once, and does rotate the displays occasionally.\",\n",
" \"DM> Fact or rumor....? Madalyn Murray O'Hare an atheist who eliminated the\\nDM> use of the bible reading and prayer in public schools 15 years ago is now\\nDM> going to appear before the FCC with a petition to stop the reading of the\\nDM> Gospel on the airways of America. And she is also campaigning to remove\\nDM> Christmas programs, songs, etc from the public schools. If it is true\\nDM> then mail to Federal Communications Commission 1919 H Street Washington DC\\nDM> 20054 expressing your opposition to her request. Reference Petition number\\n\\nDM> 2493.\\n\\nFalse. This story has been going around for years. There's not a drop of\\ntruth. Note that I don't care for O'Hare (O'Hair?) myself, but this\\nis one thing she's not guilty of.\\n\"]"
]
},
"metadata": {
"tags": []
},
"execution_count": 58
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "9rrzsEAB2FQ1"
},
"source": [
"W1 = clf.fit_transform(vectors_tfidf)\n",
"H1 = clf.components_"
],
"execution_count": 59,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "-Z8un9LG2FQ2",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "801186e5-5678-430c-c10f-716d0e67b09e"
},
"source": [
"show_topics(H1)"
],
"execution_count": 60,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['people don think just like objective say morality',\n",
" 'graphics thanks files image file program windows know',\n",
" 'space nasa launch shuttle orbit moon lunar earth',\n",
" 'ico bobbe tek beauchaine bronx manhattan sank queens',\n",
" 'god jesus bible believe christian atheism does belief']"
]
},
"metadata": {
"tags": []
},
"execution_count": 60
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "46IDAsRR2FQ2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"outputId": "db438b7a-8d0a-41e7-bf44-326df6d3b51b"
},
"source": [
"plt.plot(clf.components_[0])"
],
"execution_count": 61,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<matplotlib.lines.Line2D at 0x7f8321c12150>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 61
},
{
"output_type": "display_data",
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"tags": [],
"needs_background": "light"
}
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "9fAxAtXH2FQ3",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "137a4cd8-176d-4111-a57f-2590b260cb69"
},
"source": [
"clf.reconstruction_err_"
],
"execution_count": 62,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"43.712926057952785"
]
},
"metadata": {
"tags": []
},
"execution_count": 62
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v6Vc-bdu2FQ3"
},
"source": [
"### NMF in summary"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "weIEBQea2FQ3"
},
"source": [
"Benefits: Fast and easy to use!\n",
"\n",
"Downsides: took years of research and expertise to create"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DJ3j2soy2FQ4"
},
"source": [
"Notes:\n",
"- For NMF, matrix needs to be at least as tall as it is wide, or we get an error with fit_transform\n",
"- Can use df_min in CountVectorizer to only look at words that were in at least k of the split texts"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Bqa35kUB2FQ4"
},
"source": [
"## Truncated SVD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_pkiI3Sl2FQ4"
},
"source": [
"We saved a lot of time when we calculated NMF by only calculating the subset of columns we were interested in. Is there a way to get this benefit with SVD? Yes there is! It's called truncated SVD. We are just interested in the vectors corresponding to the **largest** singular values."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Vsb_sVYK2FQ5"
},
"source": [
"<img src=\"https://github.com/fastai/course-nlp/blob/master/images/svd_fb.png?raw=1\" alt=\"\" style=\"width: 80%\"/>\n",
"\n",
"(source: [Facebook Research: Fast Randomized SVD](https://research.fb.com/fast-randomized-svd/))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lU7eKxKO2FQ5"
},
"source": [
"#### Shortcomings of classical algorithms for decomposition:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2JpzRtsR2FQ5"
},
"source": [
"- Matrices are \"stupendously big\"\n",
"- Data are often **missing or inaccurate**. Why spend extra computational resources when imprecision of input limits precision of the output?\n",
"- **Data transfer** now plays a major role in time of algorithms. Techniques the require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).\n",
"- Important to take advantage of **GPUs**.\n",
"\n",
"(source: [Halko](https://arxiv.org/abs/0909.4061))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "W9pFpaxc2FQ6"
},
"source": [
"#### Advantages of randomized algorithms:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dTXNmRb02FQ6"
},
"source": [
"- inherently stable\n",
"- performance guarantees do not depend on subtle spectral properties\n",
"- needed matrix-vector products can be done in parallel\n",
"\n",
"(source: [Halko](https://arxiv.org/abs/0909.4061))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rUHhhoum2FQ6"
},
"source": [
"### Timing comparison"
]
},
{
"cell_type": "code",
"metadata": {
"id": "COwTUOry2FQ7",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "cdefb9f8-6f7c-409a-c2ac-5fe7bf2166d3"
},
"source": [
"%time u, s, v = np.linalg.svd(vectors, full_matrices=False)"
],
"execution_count": 63,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 1min 47s, sys: 7.48 s, total: 1min 54s\n",
"Wall time: 59.4 s\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "q794z0GR86t6",
"outputId": "ae3dd086-f7b2-481d-8447-51d383aae042"
},
"source": [
"!pip install -qqq fbpca"
],
"execution_count": 65,
"outputs": [
{
"output_type": "stream",
"text": [
" Building wheel for fbpca (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "_4WLeqgB2FQ7"
},
"source": [
"from sklearn import decomposition\n",
"import fbpca"
],
"execution_count": 66,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "eOnVt3r_2FQ7",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "79f96b03-a979-496c-fc7e-aa22e3797430"
},
"source": [
"%time u, s, v = decomposition.randomized_svd(vectors, 10)"
],
"execution_count": 67,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 14.6 s, sys: 1.86 s, total: 16.5 s\n",
"Wall time: 11.3 s\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EciCddx62FQ8"
},
"source": [
"Randomized SVD from Facebook's library fbpca:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Zk78_kuX2FQ8",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "d215ce8d-680a-4a8a-ad60-f4c62d3b7199"
},
"source": [
"%time u, s, v = fbpca.pca(vectors, 10)"
],
"execution_count": 68,
"outputs": [
{
"output_type": "stream",
"text": [
"CPU times: user 3.28 s, sys: 688 ms, total: 3.97 s\n",
"Wall time: 2.21 s\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w2s-zurd2FQ9"
},
"source": [
"For more on randomized SVD, check out my [PyBay 2017 talk](https://www.youtube.com/watch?v=7i6kBz1kZ-A&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=7).\n",
"\n",
"For significantly more on randomized SVD, check out the [Computational Linear Algebra course](https://github.com/fastai/numerical-linear-algebra)."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TWiMvAfE2FQ9"
},
"source": [
"## End"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment