Skip to content

Instantly share code, notes, and snippets.

@SimonCarryer
Last active December 6, 2018 09:17
Show Gist options
  • Save SimonCarryer/d555959ec21536842c7a381f8b2c0562 to your computer and use it in GitHub Desktop.
Save SimonCarryer/d555959ec21536842c7a381f8b2c0562 to your computer and use it in GitHub Desktop.
Very simple document clustering with scikit-learn
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Very simple document clustering with scikit-learn\n",
"\n",
"This is a very brief and simple introduction to document clustering with scikit-learn. This is going to skim over a lot of the fundamentals to get to the bit where it actually works, so be prepared to not understand what it's doing sometimes. We're going to get some text documents, transform them into a numeric representation that's useful to work with, and then \"cluster\" them, finding out which of them are most similar to each other."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Some text\n",
"I'm going to use the \"inaugural address\" dataset from NLTK, and import a few other libraries at the same time."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package inaugural to /Users/Simon/nltk_data...\n",
"[nltk_data] Package inaugural is already up-to-date!\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import nltk\n",
"from nltk.corpus import inaugural\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.cluster import KMeans\n",
"nltk.download('inaugural')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset is the inaugural address of the US President from each inauguration from 1789 to 2009.\n",
"\n",
"Here's a look at what it looks like:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(u'1789-Washington.txt', u'Fellow-Citizens of the Senate and of the House of Representatives', '...')\n",
"(u'1793-Washington.txt', u'Fellow citizens, I am again called upon by the voice of my countr', '...')\n",
"(u'1797-Adams.txt', u'When it was first perceived, in early times, that no middle cours', '...')\n",
"(u'1801-Jefferson.txt', u'Friends and Fellow Citizens:\\n\\nCalled upon to undertake the duties', '...')\n",
"(u'1805-Jefferson.txt', u'Proceeding, fellow citizens, to that qualification which the Cons', '...')\n",
"(u'1809-Madison.txt', u'Unwilling to depart from examples of the most revered authority, ', '...')\n",
"(u'1813-Madison.txt', u'About to add the solemnity of an oath to the obligations imposed ', '...')\n",
"(u'1817-Monroe.txt', u'I should be destitute of feeling if I was not deeply affected by ', '...')\n",
"(u'1821-Monroe.txt', u'Fellow citizens, I shall not attempt to describe the grateful emo', '...')\n",
"(u'1825-Adams.txt', u'In compliance with an usage coeval with the existence of our Fede', '...')\n",
"(u'1829-Jackson.txt', u'Fellow citizens, about to undertake the arduous duties that I hav', '...')\n",
"(u'1833-Jackson.txt', u'Fellow citizens, the will of the American people, expressed throu', '...')\n",
"(u'1837-VanBuren.txt', u'Fellow citizens: The practice of all my predecessors imposes on m', '...')\n",
"(u'1841-Harrison.txt', u'Called from a retirement which I had supposed was to continue for', '...')\n",
"(u'1845-Polk.txt', u'Fellow citizens, without solicitation on my part, I have been cho', '...')\n",
"(u'1849-Taylor.txt', u'Elected by the American people to the highest office known to our', '...')\n",
"(u'1853-Pierce.txt', u'My Countrymen, It a relief to feel that no heart but my own can k', '...')\n",
"(u'1857-Buchanan.txt', u'Fellow citizens, I appear before you this day to take the solemn ', '...')\n",
"(u'1861-Lincoln.txt', u'Fellow-Citizens of the United States: In compliance with a custom', '...')\n",
"(u'1865-Lincoln.txt', u'Fellow-Countrymen:\\n\\nAt this second appearing to take the oath of ', '...')\n",
"(u'1869-Grant.txt', u'Citizens of the United States:\\n\\nYour suffrages having elected me ', '...')\n",
"(u'1873-Grant.txt', u'Fellow-Citizens:\\n\\nUnder Providence I have been called a second ti', '...')\n",
"(u'1877-Hayes.txt', u'Fellow citizens, we have assembled to repeat the public ceremonia', '...')\n",
"(u'1881-Garfield.txt', u'Fellow-Citizens:\\n\\nWe stand to-day upon an eminence which overlook', '...')\n",
"(u'1885-Cleveland.txt', u'Fellow citizens, in the presence of this vast assemblage of my co', '...')\n",
"(u'1889-Harrison.txt', u'Fellow-Citizens, there is no constitutional or legal requirement ', '...')\n",
"(u'1893-Cleveland.txt', u'My Fellow citizens, in obedience of the mandate of my countrymen ', '...')\n",
"(u'1897-McKinley.txt', u'Fellow citizens, In obedience to the will of the people, and in t', '...')\n",
"(u'1901-McKinley.txt', u'My fellow-citizens, when we assembled here on the 4th of March, 1', '...')\n",
"(u'1905-Roosevelt.txt', u'My fellow citizens, no people on earth have more cause to be than', '...')\n",
"(u'1909-Taft.txt', u'My fellow citizens: Anyone who has taken the oath I have just tak', '...')\n",
"(u'1913-Wilson.txt', u'There has been a change of government. It began two years ago, wh', '...')\n",
"(u'1917-Wilson.txt', u'My Fellow citizens: The four years which have elapsed since last ', '...')\n",
"(u'1921-Harding.txt', u'My Countrymen:\\n\\nWhen one surveys the world about him after the gr', '...')\n",
"(u'1925-Coolidge.txt', u'My countrymen,\\n\\nno one can contemplate current conditions without', '...')\n",
"(u'1929-Hoover.txt', u'My Countrymen: This occasion is not alone the administration of t', '...')\n",
"(u'1933-Roosevelt.txt', u'I am certain that my fellow Americans expect that on my induction', '...')\n",
"(u'1937-Roosevelt.txt', u'When four years ago we met to inaugurate a President, the Republi', '...')\n",
"(u'1941-Roosevelt.txt', u'On each national day of inauguration since 1789, the people have ', '...')\n",
"(u'1945-Roosevelt.txt', u'Chief Justice, Mr. Vice President, my friends, you will understan', '...')\n",
"(u'1949-Truman.txt', u'Mr. Vice President, Mr. Chief Justice, and fellow citizens, I acc', '...')\n",
"(u'1953-Eisenhower.txt', u'My friends, before I begin the expression of those thoughts that ', '...')\n",
"(u'1957-Eisenhower.txt', u'The Price of Peace\\nMr. Chairman, Mr. Vice President, Mr. Chief Ju', '...')\n",
"(u'1961-Kennedy.txt', u'Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President', '...')\n",
"(u'1965-Johnson.txt', u'My fellow countrymen, on this occasion, the oath I have taken bef', '...')\n",
"(u'1969-Nixon.txt', u'Senator Dirksen, Mr. Chief Justice, Mr. Vice President, President', '...')\n",
"(u'1973-Nixon.txt', u'Mr. Vice President, Mr. Speaker, Mr. Chief Justice, Senator Cook,', '...')\n",
"(u'1977-Carter.txt', u'For myself and for our Nation, I want to thank my predecessor for', '...')\n",
"(u'1981-Reagan.txt', u'Senator Hatfield, Mr. Chief Justice, Mr. President, Vice Presiden', '...')\n",
"(u'1985-Reagan.txt', u'Senator Mathias, Chief Justice Burger, Vice President Bush, Speak', '...')\n",
"(u'1989-Bush.txt', u'Mr. Chief Justice, Mr. President, Vice President Quayle, Senator ', '...')\n",
"(u'1993-Clinton.txt', u'My fellow citizens, today we celebrate the mystery of American re', '...')\n",
"(u'1997-Clinton.txt', u'My fellow citizens: At this last presidential inauguration of the', '...')\n",
"(u'2001-Bush.txt', u'President Clinton, distinguished guests and my fellow citizens, t', '...')\n",
"(u'2005-Bush.txt', u'Vice President Cheney, Mr. Chief Justice, President Carter, Presi', '...')\n",
"(u'2009-Obama.txt', u'My fellow citizens:\\n\\nI stand here today humbled by the task befor', '...')\n"
]
}
],
"source": [
"for fileid in inaugural.fileids():\n",
" print(fileid, inaugural.raw(fileid)[:65], '...')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK gives you a few fancy things you can do with this text, but we're not going to use those, we're just going to pull it all down into a list. (This might take a few seconds)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"raw_text = [inaugural.raw(fileid) for fileid in inaugural.fileids()]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF vectorising\n",
"\n",
"This is doing a lot of different things in one step, so it's going to be a bit confusing! We're using a thing called a \"TFIDF Vectorizer\" which does two things:\n",
" \n",
" 1. \"Bag of Words\" encodes a list of documents. That means it takes your list of documents and makes it into a table where each column represents a word, and each row represents one of the documents. The value in each cell is the count of how many times that word appears in that document.\n",
" 2. \"TF-IDF\" transformation. That means that it looks at how often the word appears in the document, and compares that to how often the word appears in all the documents. It calculates a new value that represents how much more often the word occurs in this document, compared to documents overall.\n",
" \n",
"It also returns the data in a \"sparse\" format, which is a bit tricky to work with (but saves on memory when you're dealing with really big datasets. This is a small dataset, so we'll turn it back into a dataframe.)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"vectoriser = TfidfVectorizer(stop_words='english')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sparse = vectoriser.fit_transform(raw_text)\n",
"\n",
"df = pd.DataFrame(sparse.todense(), columns=vectoriser.get_feature_names(), index=inaugural.fileids())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a look at the data now. You can see each row is one of the documents, from 1789 to 2009, and each column is a word, from \"000\" to \"zealous\"."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>000</th>\n",
" <th>100</th>\n",
" <th>120</th>\n",
" <th>125</th>\n",
" <th>13</th>\n",
" <th>14th</th>\n",
" <th>15th</th>\n",
" <th>16</th>\n",
" <th>1774</th>\n",
" <th>1776</th>\n",
" <th>...</th>\n",
" <th>yorktown</th>\n",
" <th>young</th>\n",
" <th>younger</th>\n",
" <th>youngest</th>\n",
" <th>youth</th>\n",
" <th>youthful</th>\n",
" <th>zeal</th>\n",
" <th>zealous</th>\n",
" <th>zealously</th>\n",
" <th>zone</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1789-Washington.txt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.057533</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1793-Washington.txt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1797-Adams.txt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.026509</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1801-Jefferson.txt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.034306</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1805-Jefferson.txt</th>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.084242</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 8747 columns</p>\n",
"</div>"
],
"text/plain": [
" 000 100 120 125 13 14th 15th 16 1774 1776 \\\n",
"1789-Washington.txt 0.0 0.0 0.0 0.0 0.0 0.057533 0.0 0.0 0.0 0.0 \n",
"1793-Washington.txt 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 \n",
"1797-Adams.txt 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 \n",
"1801-Jefferson.txt 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 \n",
"1805-Jefferson.txt 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 \n",
"\n",
" ... yorktown young younger youngest youth \\\n",
"1789-Washington.txt ... 0.0 0.0 0.0 0.0 0.0 \n",
"1793-Washington.txt ... 0.0 0.0 0.0 0.0 0.0 \n",
"1797-Adams.txt ... 0.0 0.0 0.0 0.0 0.0 \n",
"1801-Jefferson.txt ... 0.0 0.0 0.0 0.0 0.0 \n",
"1805-Jefferson.txt ... 0.0 0.0 0.0 0.0 0.0 \n",
"\n",
" youthful zeal zealous zealously zone \n",
"1789-Washington.txt 0.0 0.000000 0.0 0.0 0.0 \n",
"1793-Washington.txt 0.0 0.000000 0.0 0.0 0.0 \n",
"1797-Adams.txt 0.0 0.026509 0.0 0.0 0.0 \n",
"1801-Jefferson.txt 0.0 0.034306 0.0 0.0 0.0 \n",
"1805-Jefferson.txt 0.0 0.084242 0.0 0.0 0.0 \n",
"\n",
"[5 rows x 8747 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By looking at the highest-value words for one of the documents, we can see which words were used in that speech more often than usual."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"america 0.150625\n",
"nation 0.120196\n",
"new 0.118257\n",
"generation 0.101366\n",
"jobs 0.100642\n",
"today 0.096174\n",
"let 0.091969\n",
"hard 0.088218\n",
"women 0.088218\n",
"crisis 0.088218\n",
"Name: 2009-Obama.txt, dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc['2009-Obama.txt'].sort_values(ascending=False)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## K-means clustering\n",
"\n",
"Again, we're doing a heck of a lot of stuff very quickly. We're going to implement \"k-means\" clustering, which finds a given number of clusters within a dataset. If you ask for three clusters, it'll give you three. If you ask for 20, it'll give you 20. It's a non-deterministic method, which means that it uses some random elements and won't give exactly the same results every time you run it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A cool thing about the sk-learn kmeans class is that it can work with a sparse dataframe. That means we can give it the sparse object we got from the tfidf vectoriser earlier on.\n",
"\n",
"We're going to cluster the documents based on the words they've used. My hunch is that it will find that addresses from similar time periods belong together. The list of addresses is in order of inaguration date, so if I'm right, it'll come back with the clusters all grouped together."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fuster_clucker = KMeans(n_clusters=3)\n",
"clusters = fuster_clucker.fit_predict(sparse)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the cluster (a number between 0 and 2) for each of the documents:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0, u'1789-Washington.txt'),\n",
" (0, u'1793-Washington.txt'),\n",
" (0, u'1797-Adams.txt'),\n",
" (0, u'1801-Jefferson.txt'),\n",
" (0, u'1805-Jefferson.txt'),\n",
" (0, u'1809-Madison.txt'),\n",
" (1, u'1813-Madison.txt'),\n",
" (0, u'1817-Monroe.txt'),\n",
" (0, u'1821-Monroe.txt'),\n",
" (0, u'1825-Adams.txt'),\n",
" (0, u'1829-Jackson.txt'),\n",
" (0, u'1833-Jackson.txt'),\n",
" (0, u'1837-VanBuren.txt'),\n",
" (0, u'1841-Harrison.txt'),\n",
" (0, u'1845-Polk.txt'),\n",
" (0, u'1849-Taylor.txt'),\n",
" (0, u'1853-Pierce.txt'),\n",
" (0, u'1857-Buchanan.txt'),\n",
" (0, u'1861-Lincoln.txt'),\n",
" (1, u'1865-Lincoln.txt'),\n",
" (0, u'1869-Grant.txt'),\n",
" (0, u'1873-Grant.txt'),\n",
" (0, u'1877-Hayes.txt'),\n",
" (0, u'1881-Garfield.txt'),\n",
" (0, u'1885-Cleveland.txt'),\n",
" (1, u'1889-Harrison.txt'),\n",
" (1, u'1893-Cleveland.txt'),\n",
" (1, u'1897-McKinley.txt'),\n",
" (1, u'1901-McKinley.txt'),\n",
" (1, u'1905-Roosevelt.txt'),\n",
" (1, u'1909-Taft.txt'),\n",
" (1, u'1913-Wilson.txt'),\n",
" (2, u'1917-Wilson.txt'),\n",
" (1, u'1921-Harding.txt'),\n",
" (1, u'1925-Coolidge.txt'),\n",
" (1, u'1929-Hoover.txt'),\n",
" (1, u'1933-Roosevelt.txt'),\n",
" (1, u'1937-Roosevelt.txt'),\n",
" (2, u'1941-Roosevelt.txt'),\n",
" (2, u'1945-Roosevelt.txt'),\n",
" (2, u'1949-Truman.txt'),\n",
" (2, u'1953-Eisenhower.txt'),\n",
" (2, u'1957-Eisenhower.txt'),\n",
" (2, u'1961-Kennedy.txt'),\n",
" (2, u'1965-Johnson.txt'),\n",
" (2, u'1969-Nixon.txt'),\n",
" (2, u'1973-Nixon.txt'),\n",
" (2, u'1977-Carter.txt'),\n",
" (2, u'1981-Reagan.txt'),\n",
" (2, u'1985-Reagan.txt'),\n",
" (2, u'1989-Bush.txt'),\n",
" (2, u'1993-Clinton.txt'),\n",
" (2, u'1997-Clinton.txt'),\n",
" (2, u'2001-Bush.txt'),\n",
" (2, u'2005-Bush.txt'),\n",
" (2, u'2009-Obama.txt')]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[i for i in zip(clusters, inaugural.fileids())]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hey cool! It works! I hope this is useful!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment