lawlesst/.gitignore

## .gitignore
datasets/
.ipynb*

## 1-topic-modeling.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />\n",
    "**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />\n",
    "![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
    "___\n",
    "\n",
    "# Latent Dirichlet Allocation (LDA) Topic Modeling\n",
    "\n",
    "**Description of methods in this notebook:**\n",
    "This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
    "\n",
    "* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n",
    "* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n",
    "* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
    "* Building a gensim dictionary and training the model\n",
    "* Computing a topic list\n",
    "* Visualizing the topic list\n",
    "\n",
    "**Difficulty:** Intermediate\n",
    "\n",
    "**Purpose:** Learning (Optimized for explanation over code)\n",
    "\n",
    "**Knowledge Required:** \n",
    "* [Python Basics I](./0-python-basics-1.ipynb)\n",
    "* [Python Basics II](./0-python-basics-2.ipynb)\n",
    "* [Python Basics III](./0-python-basics-3.ipynb)\n",
    "\n",
    "**Knowledge Recommended:**\n",
    "* [Exploring Metadata](./1-metadata.ipynb)\n",
    "* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n",
    "\n",
    "**Completion time:** 90 minutes\n",
    "\n",
    "**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
    "\n",
    "**Libraries Used:**\n",
    "* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n",
    "* **[gensim](https://docs.tdm-pilot.org/key-terms/#gensim)** to accomplish the topic modeling\n",
    "* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
    "* **pyldavis** to visualize our topic model\n",
    "____"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is Topic Modeling?\n",
    "\n",
    "**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n",
    "\n",
    "**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n",
    "\n",
    "**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import your dataset\n",
    "\n",
    "You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Importing your dataset with a dataset ID\n",
    "import tdm_client\n",
    "#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
    "tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n",
    "# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n",
    "#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n",
    "\n",
    "* lowercases all tokens\n",
    "* discards all tokens less than 4 characters\n",
    "* discards non alphabetical tokens - e.g. --9\n",
    "* removes stopwords using NLTK's stopword list\n",
    "* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "from nltk.corpus import stopwords\n",
    "stop_words = set(stopwords.words('english'))\n",
    "\n",
    "def process_token(token):\n",
    "    token = token.lower()\n",
    "    if len(token) < 4:\n",
    "        return\n",
    "    if not(token.isalpha()):\n",
    "        return\n",
    "    if token in stop_words:\n",
    "        return\n",
    "    return WordNetLemmatizer().lemmatize(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "documents = []\n",
    "doc_count = 0\n",
    "# Limit the number of documents, set to None to not limit.\n",
    "limit_to = 25\n",
    "\n",
    "with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n",
    "    for line in input_file:\n",
    "        doc = json.loads(line)\n",
    "        unigram_count = doc[\"unigramCount\"]\n",
    "        document_tokens = []\n",
    "        for token, count in unigram_count.items():\n",
    "            clean_token = process_token(token)\n",
    "            if clean_token is None:\n",
    "                continue\n",
    "            document_tokens += [clean_token] * count\n",
    "        documents.append(document_tokens)\n",
    "        doc_count += 1 \n",
    "        if (limit_to is not None) and (doc_count >= limit_to):\n",
    "            break\n",
    "                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gensim\n",
    "\n",
    "num_topics = 7 # Change the number of topics\n",
    "\n",
    "dictionary = gensim.corpora.Dictionary(documents)\n",
    "\n",
    "# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n",
    "dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n",
    "\n",
    "bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n",
    "\n",
    "# Train the LDA model.\n",
    "model = gensim.models.LdaModel(\n",
    "    corpus=bow_corpus,\n",
    "    id2word=dictionary,\n",
    "    num_topics=num_topics,\n",
    "    passes=20 # Change the number of passes or iterations\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the most significant terms, as determined by the model, for each topic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for topic_num in range(0, num_topics):\n",
    "    word_ids = model.get_topic_terms(topic_num)\n",
    "    words = []\n",
    "    for wid, weight in word_ids:\n",
    "        word = dictionary.id2token[wid]\n",
    "        words.append(word)\n",
    "    print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyLDAvis.gensim\n",
    "pyLDAvis.enable_notebook()\n",
    "pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

## download-dataset.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              download-dataset.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## filtering-a-dataset.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              filtering-a-dataset.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## getDataset
#!/bin/bash
set -e

#service=http://localhost:5000/dl
service=https://www.jstor.org/api/tdm/v1

fname=$2
if [ -z "${fname}" ]; then
    fname=$1
fi
mkdir -p datasets

dl=`curl -s $service/nb/dataset/$1/info |\
    grep -o 'https://ithaka-labs.*Expires\=[0-9]*'`

dset="./datasets/$fname.jsonl.gz"
wget -q -L --show-progress \
    -O $dset \
    --user-agent "tdm notebooks" \
    $dl

export DATASET_FILE=$dset

echo "Your dataset $1 is stored in: $dset"

## library-history.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              library-history.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## requirements.txt
jupyter-notebookparams
jupyter_contrib_nbextensions
pandas
matplotlib
seaborn
gensim
wordfreq

## start
#!/bin/bash

/opt/conda/bin/python3

version=0.1

python -m nltk.downloader stopwords wordnet

jupyter contrib nbextension install --user
jupyter nbextension install jupyter_contrib_nbextensions/nbextensions/toc2 --user
jupyter nbextension enable toc2/main
jupyter nbextension enable --py jupyter_notebookparams

exec "$@"


## tdm-client-demo.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              tdm-client-demo.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## word-frequencies-across-dataset.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              word-frequencies-across-dataset.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />\n",
	"For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />\n",
	"![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
	"___\n",
	"\n",
	"# Latent Dirichlet Allocation (LDA) Topic Modeling\n",
	"\n",
	"Description of methods in this notebook:\n",
	"This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
	"\n",
	"* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n",
	"* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n",
	"* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
	"* Building a gensim dictionary and training the model\n",
	"* Computing a topic list\n",
	"* Visualizing the topic list\n",
	"\n",
	"Difficulty: Intermediate\n",
	"\n",
	"Purpose: Learning (Optimized for explanation over code)\n",
	"\n",
	"Knowledge Required: \n",
	"* [Python Basics I](./0-python-basics-1.ipynb)\n",
	"* [Python Basics II](./0-python-basics-2.ipynb)\n",
	"* [Python Basics III](./0-python-basics-3.ipynb)\n",
	"\n",
	"Knowledge Recommended:\n",
	"* [Exploring Metadata](./1-metadata.ipynb)\n",
	"* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n",
	"\n",
	"Completion time: 90 minutes\n",
	"\n",
	"Data Format: [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
	"\n",
	"Libraries Used:\n",
	"* [json](https://docs.tdm-pilot.org/key-terms/#json-python-library) to convert our dataset from json lines format to a Python list\n",
	"* [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) to accomplish the topic modeling\n",
	"* [NLTK](https://docs.tdm-pilot.org/key-terms/#nltk) to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
	"* pyldavis to visualize our topic model\n",
	"____"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## What is Topic Modeling?\n",
	"\n",
	"Topic modeling is a machine learning technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n",
	"\n",
	"Topic modeling is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a supervised, clustering technique called Topic Classification, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n",
	"\n",
	"Topic modeling is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. Topic Classification, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Import your dataset\n",
	"\n",
	"You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"#Importing your dataset with a dataset ID\n",
	"import tdm_client\n",
	"#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
	"tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n",
	"# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n",
	"#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n",
	"\n",
	"* lowercases all tokens\n",
	"* discards all tokens less than 4 characters\n",
	"* discards non alphabetical tokens - e.g. --9\n",
	"* removes stopwords using NLTK's stopword list\n",
	"* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from nltk.stem.wordnet import WordNetLemmatizer\n",
	"from nltk.corpus import stopwords\n",
	"stop_words = set(stopwords.words('english'))\n",
	"\n",
	"def process_token(token):\n",
	" token = token.lower()\n",
	" if len(token) < 4:\n",
	" return\n",
	" if not(token.isalpha()):\n",
	" return\n",
	" if token in stop_words:\n",
	" return\n",
	" return WordNetLemmatizer().lemmatize(token)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import json\n",
	"\n",
	"documents = []\n",
	"doc_count = 0\n",
	"# Limit the number of documents, set to None to not limit.\n",
	"limit_to = 25\n",
	"\n",
	"with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n",
	" for line in input_file:\n",
	" doc = json.loads(line)\n",
	" unigram_count = doc[\"unigramCount\"]\n",
	" document_tokens = []\n",
	" for token, count in unigram_count.items():\n",
	" clean_token = process_token(token)\n",
	" if clean_token is None:\n",
	" continue\n",
	" document_tokens += [clean_token] * count\n",
	" documents.append(document_tokens)\n",
	" doc_count += 1 \n",
	" if (limit_to is not None) and (doc_count >= limit_to):\n",
	" break\n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import gensim\n",
	"\n",
	"num_topics = 7 # Change the number of topics\n",
	"\n",
	"dictionary = gensim.corpora.Dictionary(documents)\n",
	"\n",
	"# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n",
	"dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n",
	"\n",
	"bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n",
	"\n",
	"# Train the LDA model.\n",
	"model = gensim.models.LdaModel(\n",
	" corpus=bow_corpus,\n",
	" id2word=dictionary,\n",
	" num_topics=num_topics,\n",
	" passes=20 # Change the number of passes or iterations\n",
	")\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Print the most significant terms, as determined by the model, for each topic."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"for topic_num in range(0, num_topics):\n",
	" word_ids = model.get_topic_terms(topic_num)\n",
	" words = []\n",
	" for wid, weight in word_ids:\n",
	" word = dictionary.id2token[wid]\n",
	" words.append(word)\n",
	" print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pyLDAvis.gensim\n",
	"pyLDAvis.enable_notebook()\n",
	"pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.6"
	},
	"toc": {
	"base_numbering": 1,
	"nav_menu": {},
	"number_sections": true,
	"sideBar": true,
	"skip_h1_title": true,
	"title_cell": "Table of Contents",
	"title_sidebar": "Contents",
	"toc_cell": false,
	"toc_position": {},
	"toc_section_display": true,
	"toc_window_display": false
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}
	#!/bin/bash
	set -e

	#service=http://localhost:5000/dl
	service=https://www.jstor.org/api/tdm/v1

	fname=$2
	if [ -z "${fname}" ]; then
	fname=$1
	fi
	mkdir -p datasets

	dl=`curl -s $service/nb/dataset/$1/info \|\
	grep -o 'https://ithaka-labs.Expires\=[0-9]'`

	dset="./datasets/$fname.jsonl.gz"
	wget -q -L --show-progress \
	-O $dset \
	--user-agent "tdm notebooks" \
	$dl

	export DATASET_FILE=$dset

	echo "Your dataset $1 is stored in: $dset"
	jupyter-notebookparams
	jupyter_contrib_nbextensions
	pandas
	matplotlib
	seaborn
	gensim
	wordfreq
	#!/bin/bash

	/opt/conda/bin/python3

	version=0.1

	python -m nltk.downloader stopwords wordnet

	jupyter contrib nbextension install --user
	jupyter nbextension install jupyter_contrib_nbextensions/nbextensions/toc2 --user
	jupyter nbextension enable toc2/main
	jupyter nbextension enable --py jupyter_notebookparams

	exec "$@"