cmgerber/Process_explanation.ipynb

## Process_explanation.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "PLOS Cloud Explorer: The Process",
     "level": 1
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This notebook is about our process for figuring out what PLOS Cloud Explorer was going to be. It includes early code, prototypes, and dead ends.\n\nFor the full story including the happy ending, read [this document](https://github.com/cmgerber/PLOS_Cloud_Explorer/blob/master/README.md) and follow the other notebook links to see the code we actually used."
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "First things first. All imports for this notebook:"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "from __future__ import unicode_literals\n\n# You need an API Key for PLOS\nimport settings\n\n# Data analysis\nimport numpy as np\nimport pandas as pd\nfrom numpy import nan\nfrom pandas import Series, DataFrame\n\n# Interacting with API\nimport requests\nimport urllib\nimport time\nfrom retrying import retry\nimport os\nimport random\nimport json\n\n# Natural language processing\nimport nltk\nfrom nltk.collocations import BigramCollocationFinder\nfrom nltk.metrics import BigramAssocMeasures\nfrom nltk.corpus import stopwords\nimport string\n\n# For the IPython widgets:\nfrom IPython.display import display, Image, HTML, clear_output\nfrom IPython.html import widgets\nfrom jinja2 import Template",
     "prompt_number": 1,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Data Collection",
     "level": 1
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We began with a really simple way of getting article data from the PLOS Search API:"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "r = requests.get('http://api.plos.org/search?q=subject:\"biotechnology\"&start=0&rows=500&api_key={%s}&wt=json' % settings.PLOS_KEY).json()\nlen(r['response']['docs'])",
     "prompt_number": 2,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 2,
       "metadata": {},
       "text": "500"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# Write out a file.\nwith open('biotech500.json', 'wb') as fp:\n    json.dump(r, fp)",
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We later developed a much more sophisticated way to get huge amounts of data from the API. To see how we collected data sets, see the \n[batch data collection notebook](http://nbviewer.ipython.org/github/cmgerber/PLOS_Cloud_Explorer/blob/master/ipython_notebooks/Batch_data_collection_full.ipynb)."
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Exploring Output",
     "level": 2
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Here we show what the output looks like, from a previously run API query. Through the magic of Python, we can pickle the resulting DataFrame and access it again now without making any API calls."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "abstract_df = pd.read_pickle('../data/abstract_df.pkl')",
     "prompt_number": 8,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "len(list(abstract_df.author))",
     "prompt_number": 9,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 9,
       "metadata": {},
       "text": "1120"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "print list(abstract_df.subject)[0]",
     "prompt_number": 10,
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "[u'/Computer and information sciences/Information technology/Data processing', u'/Computer and information sciences/Information technology/Data reduction', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical methods', u'/Research and analysis methods/Mathematical and statistical techniques/Statistical methods', u'/Computer and information sciences/Information technology/Databases', u'/Physical sciences/Mathematics/Statistics (mathematics)/Statistical data', u'/Computer and information sciences/Computer architecture/User interfaces', u'/Medicine and health sciences/Infectious diseases/Infectious disease control', u'/Computer and information sciences/Data management']\n"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "abstract_df.tail()",
     "prompt_number": 11,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>abstract</th>\n      <th>author</th>\n      <th>id</th>\n      <th>journal</th>\n      <th>publication_date</th>\n      <th>score</th>\n      <th>subject</th>\n      <th>title_display</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>15</th>\n      <td> [\\nPopulation structure can confound the ident...</td>\n      <td> [Jonathan Carlson, Carl Kadie, Simon Mallal, D...</td>\n      <td> 10.1371/journal.pone.0000591</td>\n      <td>                   PLoS ONE</td>\n      <td> 2007-07-04T00:00:00Z</td>\n      <td> 0.443733</td>\n      <td> [/Biology and life sciences/Genetics/Phenotype...</td>\n      <td> Leveraging Hierarchical Population Structure i...</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td> [\\n        The discrimination of thatcherized ...</td>\n      <td> [Nick Donnelly, Nicole R Zürcher, Katherine Co...</td>\n      <td> 10.1371/journal.pone.0023340</td>\n      <td>                   PLoS ONE</td>\n      <td> 2011-08-31T00:00:00Z</td>\n      <td> 0.443733</td>\n      <td> [/Medicine and health sciences/Diagnostic medi...</td>\n      <td> Discriminating Grotesque from Typical Faces: E...</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td> [\\nInfluenza viruses have been responsible for...</td>\n      <td>           [Zhipeng Cai, Tong Zhang, Xiu-Feng Wan]</td>\n      <td> 10.1371/journal.pcbi.1000949</td>\n      <td> PLoS Computational Biology</td>\n      <td> 2010-10-07T00:00:00Z</td>\n      <td> 0.443733</td>\n      <td> [/Biology and life sciences/Organisms/Viruses/...</td>\n      <td> A Computational Framework for Influenza Antige...</td>\n    </tr>\n    <tr>\n      <th>18</th>\n      <td> [\\n        Based on previous evidence for indi...</td>\n      <td> [Luis F H Basile, João R Sato, Milkes Y Alvare...</td>\n      <td> 10.1371/journal.pone.0059595</td>\n      <td>                   PLoS ONE</td>\n      <td> 2013-03-27T00:00:00Z</td>\n      <td> 0.443733</td>\n      <td> [/Medicine and health sciences/Diagnostic medi...</td>\n      <td> Lack of Systematic Topographic Difference betw...</td>\n    </tr>\n    <tr>\n      <th>19</th>\n      <td> [Objective: Herpes simplex virus type 2 (HSV-2...</td>\n      <td> [Alison C Roxby, Alison L Drake, Francisca Ong...</td>\n      <td> 10.1371/journal.pone.0038622</td>\n      <td>                   PLoS ONE</td>\n      <td> 2012-06-12T00:00:00Z</td>\n      <td> 0.443733</td>\n      <td> [/Medicine and health sciences/Women's health/...</td>\n      <td> Effects of Valacyclovir on Markers of Disease ...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 8 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 11,
       "text": "                                             abstract  \\\n15  [\\nPopulation structure can confound the ident...   \n16  [\\n        The discrimination of thatcherized ...   \n17  [\\nInfluenza viruses have been responsible for...   \n18  [\\n        Based on previous evidence for indi...   \n19  [Objective: Herpes simplex virus type 2 (HSV-2...   \n\n                                               author  \\\n15  [Jonathan Carlson, Carl Kadie, Simon Mallal, D...   \n16  [Nick Donnelly, Nicole R Zürcher, Katherine Co...   \n17            [Zhipeng Cai, Tong Zhang, Xiu-Feng Wan]   \n18  [Luis F H Basile, João R Sato, Milkes Y Alvare...   \n19  [Alison C Roxby, Alison L Drake, Francisca Ong...   \n\n                              id                     journal  \\\n15  10.1371/journal.pone.0000591                    PLoS ONE   \n16  10.1371/journal.pone.0023340                    PLoS ONE   \n17  10.1371/journal.pcbi.1000949  PLoS Computational Biology   \n18  10.1371/journal.pone.0059595                    PLoS ONE   \n19  10.1371/journal.pone.0038622                    PLoS ONE   \n\n        publication_date     score  \\\n15  2007-07-04T00:00:00Z  0.443733   \n16  2011-08-31T00:00:00Z  0.443733   \n17  2010-10-07T00:00:00Z  0.443733   \n18  2013-03-27T00:00:00Z  0.443733   \n19  2012-06-12T00:00:00Z  0.443733   \n\n                                              subject  \\\n15  [/Biology and life sciences/Genetics/Phenotype...   \n16  [/Medicine and health sciences/Diagnostic medi...   \n17  [/Biology and life sciences/Organisms/Viruses/...   \n18  [/Medicine and health sciences/Diagnostic medi...   \n19  [/Medicine and health sciences/Women's health/...   \n\n                                        title_display  \n15  Leveraging Hierarchical Population Structure i...  \n16  Discriminating Grotesque from Typical Faces: E...  \n17  A Computational Framework for Influenza Antige...  \n18  Lack of Systematic Topographic Difference betw...  \n19  Effects of Valacyclovir on Markers of Disease ...  \n\n[5 rows x 8 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Initial attempts to make word clouds using abstracts",
     "level": 1
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "We wanted to use basic natural language processing (NLP) to make word clouds out of aggregated abstract text, and see how they change over time.\n\nNB: These examples use a previously collected dataset that's different and smaller than the one we generated above."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# Globally define a set of stopwords.\nstops = set(stopwords.words('english'))\n# We can add science-y stuff to it as well. Just an example:\nstops.add('conclusions')\n\n\ndef wordify(abs_list, min_word_len=2):\n    '''\n    Convert the abstract field from PLoS API data to a filtered list of words.\n    '''\n\n    # The abstract field is a list. Make it a string.\n    text = ' '.join(abs_list).strip(' \\n\\t')\n\n    if text == '':\n        return nan\n\n    else:\n        # Remove punctuation & replace with space,\n        # because we want 'metal-contaminated' => 'metal contaminated'\n        # ...not 'metalcontaminated', and so on.\n        for c in string.punctuation:\n            text = text.replace(c, ' ')\n\n        # Now make it a Series of words, and do some cleaning.\n        words = Series(text.split(' '))\n        words = words.str.lower()\n        # Filter out words less than minimum word length.\n        words = words[words.str.len() >= min_word_len]\n        words = words[~words.str.contains(r'[^#@a-z]')]  # What exactly does this do?\n\n        # Filter out globally-defined stopwords\n        ignore = stops & set(words.unique())\n        words_out = [w for w in words.tolist() if w not in ignore]\n\n        return words_out\n",
     "prompt_number": 12,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Load up some data."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "with open('biotech500.json', 'rb') as fp:\n    data = json.load(fp)\n    \narticles_list = data['response']['docs']\narticles = DataFrame(articles_list)\narticles = articles[articles['abstract'].notnull()]\narticles.head()",
     "prompt_number": 13,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>abstract</th>\n      <th>article_type</th>\n      <th>author_display</th>\n      <th>eissn</th>\n      <th>id</th>\n      <th>journal</th>\n      <th>publication_date</th>\n      <th>score</th>\n      <th>title_display</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>7 </th>\n      <td> [\\nThe objective of this paper is to assess th...</td>\n      <td> Research Article</td>\n      <td> [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0086174</td>\n      <td> PLoS ONE</td>\n      <td> 2014-01-29T00:00:00Z</td>\n      <td> 1.211935</td>\n      <td> Determinants of Public Attitudes to Geneticall...</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td> [\\n        Atrazine (ATZ) and S-metolachlor (S...</td>\n      <td> Research Article</td>\n      <td> [Cristina A. Viegas, Catarina Costa, Sandra An...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0037140</td>\n      <td> PLoS ONE</td>\n      <td> 2012-05-15T00:00:00Z</td>\n      <td> 1.119538</td>\n      <td> Does &lt;i&gt;S&lt;/i&gt;-Metolachlor Affect the Performan...</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td> [\\nDue to environmental persistence and biotox...</td>\n      <td> Research Article</td>\n      <td> [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0070686</td>\n      <td> PLoS ONE</td>\n      <td> 2013-08-05T00:00:00Z</td>\n      <td> 1.119538</td>\n      <td> Microbial Electricity Generation Enhances Deca...</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td> [\\n        Intensive use of chlorpyrifos has r...</td>\n      <td> Research Article</td>\n      <td> [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0047205</td>\n      <td>      NaN</td>\n      <td> 2012-10-08T00:00:00Z</td>\n      <td> 1.119538</td>\n      <td> Biodegradation of Chlorpyrifos and Its Hydroly...</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td> [Background: The complex characteristics and u...</td>\n      <td> Research Article</td>\n      <td> [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0042270</td>\n      <td>      NaN</td>\n      <td> 2012-08-09T00:00:00Z</td>\n      <td> 0.989541</td>\n      <td> Microbial Transformation of Biomacromolecules ...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 9 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 13,
       "text": "                                             abstract      article_type  \\\n7   [\\nThe objective of this paper is to assess th...  Research Article   \n16  [\\n        Atrazine (ATZ) and S-metolachlor (S...  Research Article   \n17  [\\nDue to environmental persistence and biotox...  Research Article   \n34  [\\n        Intensive use of chlorpyrifos has r...  Research Article   \n35  [Background: The complex characteristics and u...  Research Article   \n\n                                       author_display      eissn  \\\n7   [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf...  1932-6203   \n16  [Cristina A. Viegas, Catarina Costa, Sandra An...  1932-6203   \n17  [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,...  1932-6203   \n34  [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong...  1932-6203   \n35  [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu...  1932-6203   \n\n                              id   journal      publication_date     score  \\\n7   10.1371/journal.pone.0086174  PLoS ONE  2014-01-29T00:00:00Z  1.211935   \n16  10.1371/journal.pone.0037140  PLoS ONE  2012-05-15T00:00:00Z  1.119538   \n17  10.1371/journal.pone.0070686  PLoS ONE  2013-08-05T00:00:00Z  1.119538   \n34  10.1371/journal.pone.0047205       NaN  2012-10-08T00:00:00Z  1.119538   \n35  10.1371/journal.pone.0042270       NaN  2012-08-09T00:00:00Z  0.989541   \n\n                                        title_display  \n7   Determinants of Public Attitudes to Geneticall...  \n16  Does <i>S</i>-Metolachlor Affect the Performan...  \n17  Microbial Electricity Generation Enhances Deca...  \n34  Biodegradation of Chlorpyrifos and Its Hydroly...  \n35  Microbial Transformation of Biomacromolecules ...  \n\n[5 rows x 9 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Applying this to the whole DataFrame of articles"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "articles['words'] = articles.apply(lambda s: wordify(s['abstract'] + [s['title_display']]), axis=1)\narticles.drop(['article_type', 'score', 'title_display', 'abstract'], axis=1, inplace=True)\narticles.head()",
     "prompt_number": 14,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>author_display</th>\n      <th>eissn</th>\n      <th>id</th>\n      <th>journal</th>\n      <th>publication_date</th>\n      <th>words</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>7 </th>\n      <td> [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0086174</td>\n      <td> PLoS ONE</td>\n      <td> 2014-01-29T00:00:00Z</td>\n      <td> [objective, paper, assess, attitude, malaysian...</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td> [Cristina A. Viegas, Catarina Costa, Sandra An...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0037140</td>\n      <td> PLoS ONE</td>\n      <td> 2012-05-15T00:00:00Z</td>\n      <td> [atrazine, atz, metolachlor, met, two, herbici...</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td> [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0070686</td>\n      <td> PLoS ONE</td>\n      <td> 2013-08-05T00:00:00Z</td>\n      <td> [due, environmental, persistence, biotoxicity,...</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td> [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0047205</td>\n      <td>      NaN</td>\n      <td> 2012-10-08T00:00:00Z</td>\n      <td> [intensive, use, chlorpyrifos, resulted, ubiqu...</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td> [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu...</td>\n      <td> 1932-6203</td>\n      <td> 10.1371/journal.pone.0042270</td>\n      <td>      NaN</td>\n      <td> 2012-08-09T00:00:00Z</td>\n      <td> [background, complex, characteristics, unclear...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 6 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 14,
       "text": "                                       author_display      eissn  \\\n7   [Latifah Amin, Md. Abul Kalam Azad, Mohd Hanaf...  1932-6203   \n16  [Cristina A. Viegas, Catarina Costa, Sandra An...  1932-6203   \n17  [Yonggang Yang, Meiying Xu, Zhili He, Jun Guo,...  1932-6203   \n34  [Shaohua Chen, Chenglan Liu, Chuyan Peng, Hong...  1932-6203   \n35  [Zhongbo Zhou, Fangang Meng, So-Ryong Chae, Gu...  1932-6203   \n\n                              id   journal      publication_date  \\\n7   10.1371/journal.pone.0086174  PLoS ONE  2014-01-29T00:00:00Z   \n16  10.1371/journal.pone.0037140  PLoS ONE  2012-05-15T00:00:00Z   \n17  10.1371/journal.pone.0070686  PLoS ONE  2013-08-05T00:00:00Z   \n34  10.1371/journal.pone.0047205       NaN  2012-10-08T00:00:00Z   \n35  10.1371/journal.pone.0042270       NaN  2012-08-09T00:00:00Z   \n\n                                                words  \n7   [objective, paper, assess, attitude, malaysian...  \n16  [atrazine, atz, metolachlor, met, two, herbici...  \n17  [due, environmental, persistence, biotoxicity,...  \n34  [intensive, use, chlorpyrifos, resulted, ubiqu...  \n35  [background, complex, characteristics, unclear...  \n\n[5 rows x 6 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Doing some natural language processing",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "abs_df = DataFrame(articles['words'].apply(lambda x: ' '.join(x)).tolist(), columns=['text'])\nabs_df.head()",
     "prompt_number": 15,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>text</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td> objective paper assess attitude malaysian stak...</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td> atrazine atz metolachlor met two herbicides wi...</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td> due environmental persistence biotoxicity poly...</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td> intensive use chlorpyrifos resulted ubiquitous...</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td> background complex characteristics unclear bio...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 1 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 15,
       "text": "                                                text\n0  objective paper assess attitude malaysian stak...\n1  atrazine atz metolachlor met two herbicides wi...\n2  due environmental persistence biotoxicity poly...\n3  intensive use chlorpyrifos resulted ubiquitous...\n4  background complex characteristics unclear bio...\n\n[5 rows x 1 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Common word pairs",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This section uses all words from abstracts to find the common word pairs."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "#include all words from abstracts for getting common word pairs\nwords_all = pd.Series(' '.join(abs_df['text']).split(' '))\nwords_all.value_counts()",
     "prompt_number": 16,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 16,
       "metadata": {},
       "text": "study             56\nusing             33\ntwo               32\npatients          31\nbiodegradation    30\nnon               29\ndata              28\nthree             28\nanalysis          27\ncompared          27\nsoil              27\nnew               27\nresults           26\nspecies           25\ncell              25\n...\nengage             1\nthermal            1\ngeochip            1\ndominant           1\nsuggests           1\nthird              1\nusually            1\nlocomotion         1\nrpos               1\nscales             1\nprefer             1\nquite              1\nprotocatechuate    1\nroutine            1\nagr                1\nLength: 3028, dtype: int64"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "relevant_words_pairs = words_all.copy()\nrelevant_words_pairs.value_counts()",
     "prompt_number": 17,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 17,
       "metadata": {},
       "text": "study             56\nusing             33\ntwo               32\npatients          31\nbiodegradation    30\nnon               29\ndata              28\nthree             28\nanalysis          27\ncompared          27\nsoil              27\nnew               27\nresults           26\nspecies           25\ncell              25\n...\nengage             1\nthermal            1\ngeochip            1\ndominant           1\nsuggests           1\nthird              1\nusually            1\nlocomotion         1\nrpos               1\nscales             1\nprefer             1\nquite              1\nprotocatechuate    1\nroutine            1\nagr                1\nLength: 3028, dtype: int64"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "bcf = BigramCollocationFinder.from_words(relevant_words_pairs)\nfor pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):\n    print ' '.join(pair)",
     "prompt_number": 18,
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "synthetic biology\nspider silk\nes cell\nadjacent segment\nmedical imaging\ndp dtmax\nsecurity privacy\nindustry backgrounds\nremoval initiation\nuv irradiated\ngm salmon\npersistent crsab\nantimicrobial therapy\nlimb amputation\ncellular phone\nwireless powered\nminimally invasive\nphone technology\nheavy metals\nbattery powered\ncomposite mesh\nfrequency currents\ngenetically modified\ntissue engineering\ncatheter removal\nacting reversible\nbrassica napus\nbrown streak\nquasi stiffness\ndata code\n"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "bcf.nbest(BigramAssocMeasures.likelihood_ratio, 20)",
     "prompt_number": 19,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 19,
       "metadata": {},
       "text": "[(u'synthetic', u'biology'),\n (u'spider', u'silk'),\n (u'es', u'cell'),\n (u'adjacent', u'segment'),\n (u'medical', u'imaging'),\n (u'dp', u'dtmax'),\n (u'security', u'privacy'),\n (u'industry', u'backgrounds'),\n (u'removal', u'initiation'),\n (u'uv', u'irradiated'),\n (u'gm', u'salmon'),\n (u'persistent', u'crsab'),\n (u'antimicrobial', u'therapy'),\n (u'limb', u'amputation'),\n (u'cellular', u'phone'),\n (u'wireless', u'powered'),\n (u'minimally', u'invasive'),\n (u'phone', u'technology'),\n (u'heavy', u'metals'),\n (u'battery', u'powered')]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Making word clouds: select the top words",
     "level": 2
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Here, we takes only unique words from each abstract."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "abs_set_df = DataFrame(articles['words'].apply(lambda x: ' '.join(set(x))).tolist(), columns=['text'])\nabs_set_df.head()",
     "prompt_number": 20,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>text</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td> among developed attitude paper identify accept...</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td> aquatic mineralization dose experiments still ...</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td> mfc hypothesized distinctly results nitrogen s...</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td> fungal contaminant tcp accumulative gc morphol...</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td> origin humic mineralization show mainly result...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 1 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 20,
       "text": "                                                text\n0  among developed attitude paper identify accept...\n1  aquatic mineralization dose experiments still ...\n2  mfc hypothesized distinctly results nitrogen s...\n3  fungal contaminant tcp accumulative gc morphol...\n4  origin humic mineralization show mainly result...\n\n[5 rows x 1 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "words = pd.Series(' '.join(abs_set_df['text']).split(' '))\nwords.value_counts()",
     "prompt_number": 21,
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 21,
       "metadata": {},
       "text": "study            38\ntwo              23\nusing            21\nresults          20\nthree            20\nanalysis         20\ncompared         17\nused             16\nhigher           16\nmay              16\nnon              15\nbased            15\nsignificantly    14\nalso             14\nhowever          14\n...\nseptal             1\nrecommendations    1\ngenomes            1\npoking             1\ngck                1\noptimised          1\nvaried             1\ncounting           1\nmonitoring         1\nmalware            1\ntmc                1\nrape               1\noccur              1\nconversely         1\ncda                1\nLength: 3028, dtype: int64"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "top_words = words.value_counts().reset_index()\ntop_words.columns = ['word', 'count']\ntop_words.head(15)",
     "prompt_number": 22,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>word</th>\n      <th>count</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0 </th>\n      <td>         study</td>\n      <td> 38</td>\n    </tr>\n    <tr>\n      <th>1 </th>\n      <td>           two</td>\n      <td> 23</td>\n    </tr>\n    <tr>\n      <th>2 </th>\n      <td>         using</td>\n      <td> 21</td>\n    </tr>\n    <tr>\n      <th>3 </th>\n      <td>       results</td>\n      <td> 20</td>\n    </tr>\n    <tr>\n      <th>4 </th>\n      <td>         three</td>\n      <td> 20</td>\n    </tr>\n    <tr>\n      <th>5 </th>\n      <td>      analysis</td>\n      <td> 20</td>\n    </tr>\n    <tr>\n      <th>6 </th>\n      <td>      compared</td>\n      <td> 17</td>\n    </tr>\n    <tr>\n      <th>7 </th>\n      <td>          used</td>\n      <td> 16</td>\n    </tr>\n    <tr>\n      <th>8 </th>\n      <td>        higher</td>\n      <td> 16</td>\n    </tr>\n    <tr>\n      <th>9 </th>\n      <td>           may</td>\n      <td> 16</td>\n    </tr>\n    <tr>\n      <th>10</th>\n      <td>           non</td>\n      <td> 15</td>\n    </tr>\n    <tr>\n      <th>11</th>\n      <td>         based</td>\n      <td> 15</td>\n    </tr>\n    <tr>\n      <th>12</th>\n      <td> significantly</td>\n      <td> 14</td>\n    </tr>\n    <tr>\n      <th>13</th>\n      <td>          also</td>\n      <td> 14</td>\n    </tr>\n    <tr>\n      <th>14</th>\n      <td>       however</td>\n      <td> 14</td>\n    </tr>\n  </tbody>\n</table>\n<p>15 rows × 2 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 22,
       "text": "             word  count\n0           study     38\n1             two     23\n2           using     21\n3         results     20\n4           three     20\n5        analysis     20\n6        compared     17\n7            used     16\n8          higher     16\n9             may     16\n10            non     15\n11          based     15\n12  significantly     14\n13           also     14\n14        however     14\n\n[15 rows x 2 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Exporting word count data as CSV for D3 word-cloudification",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "# top_words.to_csv('../wordcloud2.csv', index=False)",
     "prompt_number": 23,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Initial word cloud results",
     "level": 2
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "When we created the word clouds, we noticed something about the most common words in these article abstracts... \n\n![cloud](../wordcloud_example_old.jpg)"
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Change over time: working with article abstracts as time series data",
     "level": 2
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "articles_list = data['response']['docs']\narticles = DataFrame(articles_list)\narticles = articles[articles['abstract'].notnull()].ix[:,['abstract', 'publication_date']]\narticles.abstract = articles.abstract.apply(wordify, 3)\narticles = articles[articles['abstract'].notnull()]\narticles.publication_date = pd.to_datetime(articles.publication_date)\narticles.head()",
     "prompt_number": 24,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>abstract</th>\n      <th>publication_date</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>7 </th>\n      <td> [objective, paper, assess, attitude, malaysian...</td>\n      <td>2014-01-29</td>\n    </tr>\n    <tr>\n      <th>16</th>\n      <td> [atrazine, atz, metolachlor, met, two, herbici...</td>\n      <td>2012-05-15</td>\n    </tr>\n    <tr>\n      <th>17</th>\n      <td> [due, environmental, persistence, biotoxicity,...</td>\n      <td>2013-08-05</td>\n    </tr>\n    <tr>\n      <th>34</th>\n      <td> [intensive, use, chlorpyrifos, resulted, ubiqu...</td>\n      <td>2012-10-08</td>\n    </tr>\n    <tr>\n      <th>35</th>\n      <td> [background, complex, characteristics, unclear...</td>\n      <td>2012-08-09</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 2 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 24,
       "text": "                                             abstract publication_date\n7   [objective, paper, assess, attitude, malaysian...       2014-01-29\n16  [atrazine, atz, metolachlor, met, two, herbici...       2012-05-15\n17  [due, environmental, persistence, biotoxicity,...       2013-08-05\n34  [intensive, use, chlorpyrifos, resulted, ubiqu...       2012-10-08\n35  [background, complex, characteristics, unclear...       2012-08-09\n\n[5 rows x 2 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "print articles.publication_date.min(), articles.publication_date.max()\nprint len(articles)",
     "prompt_number": 25,
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": "2008-04-30 00:00:00 2014-04-11 00:00:00\n57\n"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "The time series spans ~9 years with 57 data points. **We need to resample!**\n\nThere are probably many ways to do this..."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "articles_timed = articles.set_index('publication_date')\narticles_timed.head()",
     "prompt_number": 26,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>abstract</th>\n    </tr>\n    <tr>\n      <th>publication_date</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2014-01-29</th>\n      <td> [objective, paper, assess, attitude, malaysian...</td>\n    </tr>\n    <tr>\n      <th>2012-05-15</th>\n      <td> [atrazine, atz, metolachlor, met, two, herbici...</td>\n    </tr>\n    <tr>\n      <th>2013-08-05</th>\n      <td> [due, environmental, persistence, biotoxicity,...</td>\n    </tr>\n    <tr>\n      <th>2012-10-08</th>\n      <td> [intensive, use, chlorpyrifos, resulted, ubiqu...</td>\n    </tr>\n    <tr>\n      <th>2012-08-09</th>\n      <td> [background, complex, characteristics, unclear...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 1 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 26,
       "text": "                                                           abstract\npublication_date                                                   \n2014-01-29        [objective, paper, assess, attitude, malaysian...\n2012-05-15        [atrazine, atz, metolachlor, met, two, herbici...\n2013-08-05        [due, environmental, persistence, biotoxicity,...\n2012-10-08        [intensive, use, chlorpyrifos, resulted, ubiqu...\n2012-08-09        [background, complex, characteristics, unclear...\n\n[5 rows x 1 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Using pandas time series resampling functions",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Using the `sum` aggregation method works because all the values were lists. The three abstracts published in 2013-05 were concatenated together (see below)."
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "articles_monthly = articles_timed.resample('M', how='sum', fill_method='ffill', kind='period')\narticles_monthly.abstract = articles_monthly.abstract.apply(lambda x: np.nan if x == 0 else x)\narticles_monthly.fillna(method='ffill', inplace=True)\narticles_monthly.head()",
     "prompt_number": 27,
     "outputs": [
      {
       "output_type": "pyout",
       "html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>abstract</th>\n    </tr>\n    <tr>\n      <th>publication_date</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>2008-04</th>\n      <td> [according, world, health, organization, repor...</td>\n    </tr>\n    <tr>\n      <th>2008-05</th>\n      <td> [according, world, health, organization, repor...</td>\n    </tr>\n    <tr>\n      <th>2008-06</th>\n      <td> [according, world, health, organization, repor...</td>\n    </tr>\n    <tr>\n      <th>2008-07</th>\n      <td> [according, world, health, organization, repor...</td>\n    </tr>\n    <tr>\n      <th>2008-08</th>\n      <td> [according, world, health, organization, repor...</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 1 columns</p>\n</div>",
       "metadata": {},
       "prompt_number": 27,
       "text": "                                                           abstract\npublication_date                                                   \n2008-04           [according, world, health, organization, repor...\n2008-05           [according, world, health, organization, repor...\n2008-06           [according, world, health, organization, repor...\n2008-07           [according, world, health, organization, repor...\n2008-08           [according, world, health, organization, repor...\n\n[5 rows x 1 columns]"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "heading",
     "source": "Making a time slider for abstract text",
     "level": 3
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "widgetmax = len(articles_monthly) - 1\n\ndef textbarf(t): \n    html_template = \"\"\"\n    <style>\n    #textbarf {\n        display: block;\n        width: 666px;\n        padding: 23px;\n        background-color: #ddeeff;\n    }\n    </style>\n    <div id=\"textbarf\"> {{blargh}} </div>\"\"\"\n\n    blob = ' '.join(articles_monthly.ix[t]['abstract'])\n    html_src = Template(html_template).render(blargh=blob)\n    display(HTML(html_src))\n",
     "prompt_number": 28,
     "outputs": [],
     "language": "python",
     "trusted": false,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "widgets.interact(textbarf,\n                 t=widgets.IntSliderWidget(min=0,max=widgetmax,step=1,value=42),\n                )",
     "prompt_number": 29,
     "outputs": [
      {
       "output_type": "display_data",
       "html": "\n    <style>\n    #textbarf {\n        display: block;\n        width: 666px;\n        padding: 23px;\n        background-color: #ddeeff;\n    }\n    </style>\n    <div id=\"textbarf\"> concerns regarding commercial release genetically engineered ge crops include naturalization introgression sexually compatible relatives transfer beneficial traits native weedy species hybridization date documented reports escape leading researchers question environmental risks biotech products study conducted systematic roadside survey canola brassica napus populations growing outside cultivation north dakota usa dominant canola growing region document presence two escaped transgenic genotypes well non ge canola provide evidence novel combinations transgenic forms wild results demonstrate feral populations large widespread moreover flowering times escaped populations well fertile condition majority collections suggest populations established persistent outside cultivation </div>",
       "metadata": {},
       "text": "<IPython.core.display.HTML at 0x1099400d0>"
      },
      {
       "output_type": "pyout",
       "prompt_number": 29,
       "metadata": {},
       "text": "<function __main__.textbarf>"
      }
     ],
     "language": "python",
     "trusted": false,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:6d14f92bdff26eefbe3be93f0262e35304958ddcab6570dd624168a2d5567e61"
 },
 "nbformat": 3
}