rebeccabilbro/nist_clustering.ipynb

## nist_clustering.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering NIST headlines and descriptions\n",
    "\n",
    "_adapted from https://github.com/StarCYing/open_data_day_dc_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction:\n",
    "In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of the National Institute of Standards and Technology ([NIST](www.nist.gov)). For those not in the know, NIST is comprised of multiple research centers which include: \n",
    "* Center for Nanoscale Science and Technology (CNST)\n",
    "* Engineering Laboratory (EL)\n",
    "* Information Technology Laboratory (ITL)\n",
    "* NIST Center for Neutron Research (NCNR)\n",
    "* Material Measurement Laboratory (MML)\n",
    "* Physical Measurement Laboratory (PML)\n",
    "\n",
    "This makes it an easy target for topic modeling, a way of identifying patterns in a corpus that uses __clustering__."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from lxml import html\n",
    "import requests\n",
    "from __future__ import print_function\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.cluster import KMeans, MiniBatchKMeans\n",
    "from time import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building the list of headlines and descriptions\n",
    "\n",
    "We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website. \n",
    "\n",
    "We then pass that retrieved content to our HTML parser and search for a specific div class, \"select_portal_module_wrapper\" which is assigned to every headline and every description (headlines receive a strong tag and descriptions receive a p tag).\n",
    "\n",
    "We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Retrieving data from NIST...\n",
      "Last item in list retrieved: A New NIST Online Database: The NIST Polycyclic Aromatic Hydrocarbon Structure IndexÂ Recently, a new website containing a wealth of information on polycyclic aromatic hydrocarbons (PAHs) was made publicly available by NIST. PAHs are compounds that are produced during the â€¦ \n"
     ]
    }
   ],
   "source": [
    "print(\"Retrieving data from NIST...\")\n",
    "\n",
    "# Retrieve the data from the web page.\n",
    "page = requests.get('http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014') \n",
    "\n",
    "# Use html module to parse it out and store in tree.\n",
    "tree = html.fromstring(page.content)\n",
    "\n",
    "# Create list of news headlines and descriptions. \n",
    "# This required obtaining the xpath of the elements by examining the web page.\n",
    "list_of_headlines = tree.xpath('//div[@class=\"select_portal_module_wrapper\"]/a/strong/text()')\n",
    "list_of_descriptions = tree.xpath('//div[@class=\"select_portal_module_wrapper\"]/p/text()')\n",
    "\n",
    "#Combine each headline and description into one value in a list\n",
    "news=[]\n",
    "for each_headline in list_of_headlines:\n",
    "    for each_description in list_of_descriptions:\n",
    "        news.append(each_headline+each_description)\n",
    "\n",
    "print(\"Last item in list retrieved: %s\" % news[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Term Frequency-Inverse Document Frequency\n",
    "\n",
    "- The weight of a term that occurs in a document is proportional to the term frequency.    \n",
    "- Term frequency is the number of times a term occurs in a document.        \n",
    "- Inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.\n",
    " \n",
    "![TFIDF](figures/tfidf.png) \n",
    "\n",
    "\n",
    "### Convert collection of documents to TF-IDF matrix\n",
    "We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Extracting features from the training dataset using a sparse vectorizer\n",
      "done in 4.089050s\n",
      "n_samples: 110224, n_features: 12197\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"Extracting features from the training dataset using a sparse vectorizer\")\n",
    "t0 = time()\n",
    "\n",
    "# Create a sparse word occurrence frequency matrix of the most frequent words\n",
    "# with the following parameters:\n",
    "# Maximum document frequency = half the total documents\n",
    "# Minimum document frequency = two documents\n",
    "# Toss out common English stop words.\n",
    "vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')\n",
    "\n",
    "# This calculates the counts \n",
    "X = vectorizer.fit_transform(news) \n",
    "\n",
    "print(\"done in %fs\" % (time() - t0))\n",
    "print(\"n_samples: %d, n_features: %d\" % X.shape)\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's do some clustering\n",
    "\n",
    "![Kmeans clustering](figures/voronoi.png)    \n",
    "\n",
    "I happen to know there are 15 [subject areas](http://www.nist.gov/subject_areas.cfm) at NIST:    \n",
    "    - Bioscience & Health\n",
    "    - Building and Fire Research\n",
    "    - Chemistry\n",
    "    - Electronics & Telecommunications\n",
    "    - Energy\n",
    "    - Environment/Climate\n",
    "    - Information Technology\n",
    "    - Manufacturing\n",
    "    - Materials Science\n",
    "    - Math\n",
    "    - Nanotechnology\n",
    "    - Physics\n",
    "    - Public Safety & Security\n",
    "    - Quality\n",
    "    - Transportation\n",
    "\n",
    "So, why don't we cheat and set the number of clusters to 15?     \n",
    "\n",
    "Then we call the KMeans clustering model from sklearn and set an upper bound to the number of iterations for fitting the data to the model.    \n",
    "\n",
    "Finally we list out each centroid and the top 10 terms associated with each centroid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=15, n_init=10,\n",
      "    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n",
      "    verbose=0)\n",
      "done in 119.277s\n",
      "\n",
      "Top terms per cluster:\n",
      "Cluster 1: attack terrorist information work prepare cyber phone hazard united states\n",
      "Cluster 2: baldrige excellence performance program award malcolm quality organizations penny pritzker\n",
      "Cluster 3: based technology university institute standards national researchers demonstrated new maryland\n",
      "Cluster 4: forensic science standards national technology institute new research committees osac\n",
      "Cluster 5: committee technology visiting agency primary sector national vcat advanced group\n",
      "Cluster 6: induced silicon pressure resolution force atomic chip photothermal ptir lateral\n",
      "Cluster 7: cnst make center edition nanoscale science spring recent xps news\n",
      "Cluster 8: test image left john rpki 15 showing standard click materials\n",
      "Cluster 9: optical fiber long dark ago laser nistar idea years recent\n",
      "Cluster 10: draft public comment review standards federal institute technology national issued\n",
      "Cluster 11: extension hollings partnership manufacturing mep new standards institute technology national\n",
      "Cluster 12: dimensional metrology pml semiconductor division electron scanning researchers ultra energy\n",
      "Cluster 13: technology standards institute national released scientists new research 2014 computer\n",
      "Cluster 14: new used medical time national director 2014 technology center data\n",
      "Cluster 15: 00 dna sequencer et participate analysts wednesday webinar invited free\n"
     ]
    }
   ],
   "source": [
    "# Set the number of clusters to 15\n",
    "k = 15\n",
    "\n",
    "# Initialize the kMeans cluster model.\n",
    "km = KMeans(n_clusters=k, init='k-means++', max_iter=100)\n",
    "\n",
    "print(\"Clustering sparse data with %s\" % km)\n",
    "t0 = time()\n",
    "\n",
    "# Pass the model our sparse matrix with the TF-IDF counts.\n",
    "km.fit(X)\n",
    "print(\"done in %0.3fs\" % (time() - t0))\n",
    "print()\n",
    "\n",
    "print(\"Top terms per cluster:\")\n",
    "order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
    "\n",
    "terms = vectorizer.get_feature_names()\n",
    "for i in range(k):\n",
    "    print(\"Cluster %d:\" % (i+1), end='')\n",
    "    for ind in order_centroids[i, :10]:\n",
    "        print(' %s' % terms[ind], end='')\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "How do the results compare to NIST's listed [subject areas](http://www.nist.gov/subject_areas.cfm)?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Clustering NIST headlines and descriptions\n",
	"\n",
	"_adapted from https://github.com/StarCYing/open_data_day_dc_"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Introduction:\n",
	"In this workshop we show you an example of a workflow in data science from initial data ingestion, cleaning, modeling, and ultimately clustering. In this example we scrape the news feed of the National Institute of Standards and Technology ([NIST](www.nist.gov)). For those not in the know, NIST is comprised of multiple research centers which include: \n",
	"* Center for Nanoscale Science and Technology (CNST)\n",
	"* Engineering Laboratory (EL)\n",
	"* Information Technology Laboratory (ITL)\n",
	"* NIST Center for Neutron Research (NCNR)\n",
	"* Material Measurement Laboratory (MML)\n",
	"* Physical Measurement Laboratory (PML)\n",
	"\n",
	"This makes it an easy target for topic modeling, a way of identifying patterns in a corpus that uses __clustering__."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"from lxml import html\n",
	"import requests\n",
	"from __future__ import print_function\n",
	"from sklearn.feature_extraction.text import TfidfVectorizer\n",
	"from sklearn.cluster import KMeans, MiniBatchKMeans\n",
	"from time import time"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Building the list of headlines and descriptions\n",
	"\n",
	"We request NIST news based on the following URL, 'http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014'. For this workshop, we look at only 2014 news articles posted on the NIST website. \n",
	"\n",
	"We then pass that retrieved content to our HTML parser and search for a specific div class, \"select_portal_module_wrapper\" which is assigned to every headline and every description (headlines receive a strong tag and descriptions receive a p tag).\n",
	"\n",
	"We then merge both the headline and description into one entry in the list because we don't need to differentiate between title and description."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Retrieving data from NIST...\n",
	"Last item in list retrieved: A New NIST Online Database: The NIST Polycyclic Aromatic Hydrocarbon Structure IndexÂ Recently, a new website containing a wealth of information on polycyclic aromatic hydrocarbons (PAHs) was made publicly available by NIST. PAHs are compounds that are produced during the â€¦ \n"
	]
	}
	],
	"source": [
	"print(\"Retrieving data from NIST...\")\n",
	"\n",
	"# Retrieve the data from the web page.\n",
	"page = requests.get('http://www.nist.gov/allnews.cfm?s=01-01-2014&e=12-31-2014') \n",
	"\n",
	"# Use html module to parse it out and store in tree.\n",
	"tree = html.fromstring(page.content)\n",
	"\n",
	"# Create list of news headlines and descriptions. \n",
	"# This required obtaining the xpath of the elements by examining the web page.\n",
	"list_of_headlines = tree.xpath('//div[@class=\"select_portal_module_wrapper\"]/a/strong/text()')\n",
	"list_of_descriptions = tree.xpath('//div[@class=\"select_portal_module_wrapper\"]/p/text()')\n",
	"\n",
	"#Combine each headline and description into one value in a list\n",
	"news=[]\n",
	"for each_headline in list_of_headlines:\n",
	" for each_description in list_of_descriptions:\n",
	" news.append(each_headline+each_description)\n",
	"\n",
	"print(\"Last item in list retrieved: %s\" % news[-1])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Term Frequency-Inverse Document Frequency\n",
	"\n",
	"- The weight of a term that occurs in a document is proportional to the term frequency. \n",
	"- Term frequency is the number of times a term occurs in a document. \n",
	"- Inverse document frequency diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.\n",
	" \n",
	"![TFIDF](figures/tfidf.png) \n",
	"\n",
	"\n",
	"### Convert collection of documents to TF-IDF matrix\n",
	"We now call a TF-IDF vectorizer to create a sparse matrix with term frequency-inverse document frequency weights:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Extracting features from the training dataset using a sparse vectorizer\n",
	"done in 4.089050s\n",
	"n_samples: 110224, n_features: 12197\n",
	"\n"
	]
	}
	],
	"source": [
	"print(\"Extracting features from the training dataset using a sparse vectorizer\")\n",
	"t0 = time()\n",
	"\n",
	"# Create a sparse word occurrence frequency matrix of the most frequent words\n",
	"# with the following parameters:\n",
	"# Maximum document frequency = half the total documents\n",
	"# Minimum document frequency = two documents\n",
	"# Toss out common English stop words.\n",
	"vectorizer = TfidfVectorizer(input=news, max_df=0.5, min_df=2, stop_words='english')\n",
	"\n",
	"# This calculates the counts \n",
	"X = vectorizer.fit_transform(news) \n",
	"\n",
	"print(\"done in %fs\" % (time() - t0))\n",
	"print(\"n_samples: %d, n_features: %d\" % X.shape)\n",
	"print()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Let's do some clustering\n",
	"\n",
	"![Kmeans clustering](figures/voronoi.png) \n",
	"\n",
	"I happen to know there are 15 [subject areas](http://www.nist.gov/subject_areas.cfm) at NIST: \n",
	" - Bioscience & Health\n",
	" - Building and Fire Research\n",
	" - Chemistry\n",
	" - Electronics & Telecommunications\n",
	" - Energy\n",
	" - Environment/Climate\n",
	" - Information Technology\n",
	" - Manufacturing\n",
	" - Materials Science\n",
	" - Math\n",
	" - Nanotechnology\n",
	" - Physics\n",
	" - Public Safety & Security\n",
	" - Quality\n",
	" - Transportation\n",
	"\n",
	"So, why don't we cheat and set the number of clusters to 15? \n",
	"\n",
	"Then we call the KMeans clustering model from sklearn and set an upper bound to the number of iterations for fitting the data to the model. \n",
	"\n",
	"Finally we list out each centroid and the top 10 terms associated with each centroid."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Clustering sparse data with KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=15, n_init=10,\n",
	" n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,\n",
	" verbose=0)\n",
	"done in 119.277s\n",
	"\n",
	"Top terms per cluster:\n",
	"Cluster 1: attack terrorist information work prepare cyber phone hazard united states\n",
	"Cluster 2: baldrige excellence performance program award malcolm quality organizations penny pritzker\n",
	"Cluster 3: based technology university institute standards national researchers demonstrated new maryland\n",
	"Cluster 4: forensic science standards national technology institute new research committees osac\n",
	"Cluster 5: committee technology visiting agency primary sector national vcat advanced group\n",
	"Cluster 6: induced silicon pressure resolution force atomic chip photothermal ptir lateral\n",
	"Cluster 7: cnst make center edition nanoscale science spring recent xps news\n",
	"Cluster 8: test image left john rpki 15 showing standard click materials\n",
	"Cluster 9: optical fiber long dark ago laser nistar idea years recent\n",
	"Cluster 10: draft public comment review standards federal institute technology national issued\n",
	"Cluster 11: extension hollings partnership manufacturing mep new standards institute technology national\n",
	"Cluster 12: dimensional metrology pml semiconductor division electron scanning researchers ultra energy\n",
	"Cluster 13: technology standards institute national released scientists new research 2014 computer\n",
	"Cluster 14: new used medical time national director 2014 technology center data\n",
	"Cluster 15: 00 dna sequencer et participate analysts wednesday webinar invited free\n"
	]
	}
	],
	"source": [
	"# Set the number of clusters to 15\n",
	"k = 15\n",
	"\n",
	"# Initialize the kMeans cluster model.\n",
	"km = KMeans(n_clusters=k, init='k-means++', max_iter=100)\n",
	"\n",
	"print(\"Clustering sparse data with %s\" % km)\n",
	"t0 = time()\n",
	"\n",
	"# Pass the model our sparse matrix with the TF-IDF counts.\n",
	"km.fit(X)\n",
	"print(\"done in %0.3fs\" % (time() - t0))\n",
	"print()\n",
	"\n",
	"print(\"Top terms per cluster:\")\n",
	"order_centroids = km.cluster_centers_.argsort()[:, ::-1]\n",
	"\n",
	"terms = vectorizer.get_feature_names()\n",
	"for i in range(k):\n",
	" print(\"Cluster %d:\" % (i+1), end='')\n",
	" for ind in order_centroids[i, :10]:\n",
	" print(' %s' % terms[ind], end='')\n",
	" print()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"How do the results compare to NIST's listed [subject areas](http://www.nist.gov/subject_areas.cfm)?"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}