Skip to content

Instantly share code, notes, and snippets.

@lawlesst
Last active July 7, 2020 16:06
Show Gist options
  • Save lawlesst/175f99d06712432c3d16aa3056e586f3 to your computer and use it in GitHub Desktop.
Save lawlesst/175f99d06712432c3d16aa3056e586f3 to your computer and use it in GitHub Desktop.
tdm-pilot.org gists
datasets/
.ipynb*
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />\n",
"**For questions/comments/improvements, email nathan.kelber@ithaka.org.**<br />\n",
"![CC BY License Logo](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png)\n",
"___\n",
"\n",
"# Latent Dirichlet Allocation (LDA) Topic Modeling\n",
"\n",
"**Description of methods in this notebook:**\n",
"This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling on a [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor) and/or [Portico](https://docs.tdm-pilot.org/key-terms/#portico) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Python](https://docs.tdm-pilot.org/key-terms/#python). The following processes are described:\n",
"\n",
"* Importing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset)\n",
"* Importing libraries including `gensim`, `nltk`, and `pyLDAvis`\n",
"* Writing a helper function to help clean up a single [token](https://docs.tdm-pilot.org/key-terms/#token)\n",
"* Building a gensim dictionary and training the model\n",
"* Computing a topic list\n",
"* Visualizing the topic list\n",
"\n",
"**Difficulty:** Intermediate\n",
"\n",
"**Purpose:** Learning (Optimized for explanation over code)\n",
"\n",
"**Knowledge Required:** \n",
"* [Python Basics I](./0-python-basics-1.ipynb)\n",
"* [Python Basics II](./0-python-basics-2.ipynb)\n",
"* [Python Basics III](./0-python-basics-3.ipynb)\n",
"\n",
"**Knowledge Recommended:**\n",
"* [Exploring Metadata](./1-metadata.ipynb)\n",
"* A familiarity with [gensim](https://docs.tdm-pilot.org/key-terms/#gensim) is helpful but not required.\n",
"\n",
"**Completion time:** 90 minutes\n",
"\n",
"**Data Format:** [JSTOR](https://docs.tdm-pilot.org/key-terms/#jstor)/[Portico](https://docs.tdm-pilot.org/key-terms/#portico) [JSON Lines (.jsonl)](https://docs.tdm-pilot.org/key-terms/#jsonl)\n",
"\n",
"**Libraries Used:**\n",
"* **[json](https://docs.tdm-pilot.org/key-terms/#json-python-library)** to convert our dataset from json lines format to a Python list\n",
"* **[gensim](https://docs.tdm-pilot.org/key-terms/#gensim)** to accomplish the topic modeling\n",
"* **[NLTK](https://docs.tdm-pilot.org/key-terms/#nltk)** to help [clean](https://docs.tdm-pilot.org/key-terms/#clean-data) up our dataset\n",
"* **pyldavis** to visualize our topic model\n",
"____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is Topic Modeling?\n",
"\n",
"**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.\n",
"\n",
"**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.\n",
"\n",
"**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import your dataset\n",
"\n",
"You'll use the tdm_client library to automatically upload your dataset. We import the `Dataset` module from the `tdm_client` library. The tdm_client library contains functions for connecting to the JSTOR server containing our [corpus](https://docs.tdm-pilot.org/key-terms/#corpus) [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). To analyze your dataset, use the [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) provided when you created your [dataset](https://docs.tdm-pilot.org/key-terms//#dataset). A copy of your [dataset ID](https://docs.tdm-pilot.org/key-terms//#dataset-ID) was sent to your email when you created your [corpus](https://docs.tdm-pilot.org/key-terms/#corpus). It should look like a long series of characters surrounded by dashes. If you haven't created a dataset, feel free to use a sample dataset. Here's a [list by discipline](https://docs.tdm-pilot.org/sample-datasets/). Advanced users can also [upload a dataset from their local machine](https://docs.tdm-pilot.org/uploading-a-dataset/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Importing your dataset with a dataset ID\n",
"import tdm_client\n",
"#Load the sample dataset, the full run of Shakespeare Quarterly from 1950-2013.\n",
"tdm_client.get_dataset(\"7e41317e-740f-e86a-4729-20dab492e925\", \"sampleJournalAnalysis\") #Insert your dataset ID on this line\n",
"# Load the sample dataset, the full run of Negro American Literature Forum (1967-1976) + Black American Literature Forum (1976-1991) + African American Review (1992-2016).\n",
"#tdm_client.get_dataset(\"b4668c50-a970-c4d7-eb2c-bb6d04313542\", \"sampleJournalAnalysis\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define a function for processing tokens from the extracted features for volumes in the curated dataset. This function:\n",
"\n",
"* lowercases all tokens\n",
"* discards all tokens less than 4 characters\n",
"* discards non alphabetical tokens - e.g. --9\n",
"* removes stopwords using NLTK's stopword list\n",
"* Lemmatizes the token using NLTK's [WordNetLemmatizer](https://www.nltk.org/_modules/nltk/stem/wordnet.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from nltk.stem.wordnet import WordNetLemmatizer\n",
"from nltk.corpus import stopwords\n",
"stop_words = set(stopwords.words('english'))\n",
"\n",
"def process_token(token):\n",
" token = token.lower()\n",
" if len(token) < 4:\n",
" return\n",
" if not(token.isalpha()):\n",
" return\n",
" if token in stop_words:\n",
" return\n",
" return WordNetLemmatizer().lemmatize(token)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loop through the documents in the dataset and build a list of doucments where each document is a list of tokens."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"documents = []\n",
"doc_count = 0\n",
"# Limit the number of documents, set to None to not limit.\n",
"limit_to = 25\n",
"\n",
"with open(\"./datasets/sampleJournalAnalysis.jsonl\") as input_file:\n",
" for line in input_file:\n",
" doc = json.loads(line)\n",
" unigram_count = doc[\"unigramCount\"]\n",
" document_tokens = []\n",
" for token, count in unigram_count.items():\n",
" clean_token = process_token(token)\n",
" if clean_token is None:\n",
" continue\n",
" document_tokens += [clean_token] * count\n",
" documents.append(document_tokens)\n",
" doc_count += 1 \n",
" if (limit_to is not None) and (doc_count >= limit_to):\n",
" break\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gensim\n",
"\n",
"num_topics = 7 # Change the number of topics\n",
"\n",
"dictionary = gensim.corpora.Dictionary(documents)\n",
"\n",
"# Remove terms that appear in less than 10% of documents and more than 75% of documents.\n",
"dictionary.filter_extremes(no_below=doc_count * .10, no_above=0.75)\n",
"\n",
"bow_corpus = [dictionary.doc2bow(doc) for doc in documents]\n",
"\n",
"# Train the LDA model.\n",
"model = gensim.models.LdaModel(\n",
" corpus=bow_corpus,\n",
" id2word=dictionary,\n",
" num_topics=num_topics,\n",
" passes=20 # Change the number of passes or iterations\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the most significant terms, as determined by the model, for each topic."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for topic_num in range(0, num_topics):\n",
" word_ids = model.get_topic_terms(topic_num)\n",
" words = []\n",
" for wid, weight in word_ids:\n",
" word = dictionary.id2token[wid]\n",
" words.append(word)\n",
" print(\"Topic {}\".format(str(topic_num).ljust(5)), \" \".join(words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization takes several minutes to an hour to generate depending on the size of your dataset. To run, remove the `#` symbol on the line below and run the cell. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pyLDAvis.gensim\n",
"pyLDAvis.enable_notebook()\n",
"pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": true,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
#!/bin/bash
set -e
#service=http://localhost:5000/dl
service=https://www.jstor.org/api/tdm/v1
fname=$2
if [ -z "${fname}" ]; then
fname=$1
fi
mkdir -p datasets
dl=`curl -s $service/nb/dataset/$1/info |\
grep -o 'https://ithaka-labs.*Expires\=[0-9]*'`
dset="./datasets/$fname.jsonl.gz"
wget -q -L --show-progress \
-O $dset \
--user-agent "tdm notebooks" \
$dl
export DATASET_FILE=$dset
echo "Your dataset $1 is stored in: $dset"
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
jupyter-notebookparams
jupyter_contrib_nbextensions
pandas
matplotlib
seaborn
gensim
wordfreq
#!/bin/bash
/opt/conda/bin/python3
version=0.1
python -m nltk.downloader stopwords wordnet
jupyter contrib nbextension install --user
jupyter nbextension install jupyter_contrib_nbextensions/nbextensions/toc2 --user
jupyter nbextension enable toc2/main
jupyter nbextension enable --py jupyter_notebookparams
exec "$@"
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment