Skip to content

Instantly share code, notes, and snippets.

@JaimieMurdock
Created January 28, 2019 20:10
Show Gist options
  • Save JaimieMurdock/c603b262a41173a939e9625205b76701 to your computer and use it in GitHub Desktop.
Save JaimieMurdock/c603b262a41173a939e9625205b76701 to your computer and use it in GitHub Desktop.
Query Sampling with InPhO Topic Explorer
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Query Sampling Demo\n",
"This notebook shows how to compute a topic distribution over a new document given a previous model. It then calculates the distance from the new document to the overall topic distribution in the old document and the divergence from the new document to each old document.\n",
"\n",
"Using the SEP model, the steps will be:\n",
"1. Download the SEP Spring 2018 edition from S3: https://s3.us-east-2.amazonaws.com/hypershelf/sep.spr2018.tez\n",
"2. `topicexplorer import sep.spr2018.tez`\n",
"3. `topicexplorer notebook data_spr2018`\n",
"4. Place this notebook file into the `notebooks` directory. (Easiest way is to use the \"Upload\" button on the notebook \"Home\" page.\n",
"5. Place the `origin-1e.txt` file into the `notebooks` directory.\n",
"6. Run all cells."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pylab inline\n",
"from __future__ import print_function\n",
"from codecs import open\n",
"\n",
"from corpus import *\n",
"from vsm.extensions.corpusbuilders import toy_corpus, corpus_from_strings\n",
"from vsm.spatial import *\n",
"\n",
"# global settings for computations, mostly you'll change the k value\n",
"k = 20\n",
"m = lda_m[k]\n",
"v = lda_v[k]\n",
"\n",
"def build_sample(newdoc_file):\n",
" print(\"opening corpus\")\n",
" with open(newdoc_file, encoding='utf8') as newdoc:\n",
" text = [newdoc.read()]\n",
" print(\"building corpus\")\n",
" newdoc = corpus_from_strings(text, nltk_stop=True, stop_freq=0)\n",
" newdoc.context_types = ['article']\n",
" \n",
" print(\"aligning corpus\")\n",
" c = align_corpora(v.corpus, newdoc)\n",
" q = LdaCgsQuerySampler(v.model, old_corpus=v.corpus, new_corpus=c, context_type='article', align_corpora=False)\n",
" q.train(n_iterations=200)\n",
" return q\n",
"\n",
"def get_topics(query_sample):\n",
" return np.squeeze(query_sample.top_doc / sum(query_sample.top_doc))\n",
"\n",
"# create a topic distribution for a new document\n",
"newdoc_tops = get_topics(build_sample('origin-1e.txt'))\n",
"\n",
"# get the aggregate topic distribution for all existing documents\n",
"tops = v.doc_topic_matrix(all_ids)\n",
"\n",
"# get the distance between the new document and the average of all other documents\n",
"JS_dist(tops, newdoc_tops)\n",
"\n",
"# calculate the distance from the new document to all previous documents, asymmetrically using KL Divergence\n",
"closest_docs = sorted(all_ids, key=lambda doc_id: KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)))\n",
"for doc_id in closest_docs:\n",
" print(KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)), doc_id)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment