JaimieMurdock/Query Sampling Demo.ipynb

## Query Sampling Demo.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Query Sampling Demo\n",
    "This notebook shows how to compute a topic distribution over a new document given a previous model. It then calculates the distance from the new document to the overall topic distribution in the old document and the divergence from the new document to each old document.\n",
    "\n",
    "Using the SEP model, the steps will be:\n",
    "1. Download the SEP Spring 2018 edition from S3: https://s3.us-east-2.amazonaws.com/hypershelf/sep.spr2018.tez\n",
    "2. `topicexplorer import sep.spr2018.tez`\n",
    "3. `topicexplorer notebook data_spr2018`\n",
    "4. Place this notebook file into the `notebooks` directory. (Easiest way is to use the \"Upload\" button on the notebook \"Home\" page.\n",
    "5. Place the `origin-1e.txt` file into the `notebooks` directory.\n",
    "6. Run all cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "from __future__ import print_function\n",
    "from codecs import open\n",
    "\n",
    "from corpus import *\n",
    "from vsm.extensions.corpusbuilders import toy_corpus, corpus_from_strings\n",
    "from vsm.spatial import *\n",
    "\n",
    "# global settings for computations, mostly you'll change the k value\n",
    "k = 20\n",
    "m = lda_m[k]\n",
    "v = lda_v[k]\n",
    "\n",
    "def build_sample(newdoc_file):\n",
    "    print(\"opening corpus\")\n",
    "    with open(newdoc_file, encoding='utf8') as newdoc:\n",
    "        text = [newdoc.read()]\n",
    "    print(\"building corpus\")\n",
    "    newdoc = corpus_from_strings(text, nltk_stop=True, stop_freq=0)\n",
    "    newdoc.context_types = ['article']\n",
    "    \n",
    "    print(\"aligning corpus\")\n",
    "    c = align_corpora(v.corpus, newdoc)\n",
    "    q = LdaCgsQuerySampler(v.model, old_corpus=v.corpus, new_corpus=c, context_type='article', align_corpora=False)\n",
    "    q.train(n_iterations=200)\n",
    "    return q\n",
    "\n",
    "def get_topics(query_sample):\n",
    "    return np.squeeze(query_sample.top_doc / sum(query_sample.top_doc))\n",
    "\n",
    "# create a topic distribution for a new document\n",
    "newdoc_tops = get_topics(build_sample('origin-1e.txt'))\n",
    "\n",
    "# get the aggregate topic distribution for all existing documents\n",
    "tops = v.doc_topic_matrix(all_ids)\n",
    "\n",
    "# get the distance between the new document and the average of all other documents\n",
    "JS_dist(tops, newdoc_tops)\n",
    "\n",
    "# calculate the distance from the new document to all previous documents, asymmetrically using KL Divergence\n",
    "closest_docs = sorted(all_ids, key=lambda doc_id: KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)))\n",
    "for doc_id in closest_docs:\n",
    "    print(KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)), doc_id)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Query Sampling Demo\n",
	"This notebook shows how to compute a topic distribution over a new document given a previous model. It then calculates the distance from the new document to the overall topic distribution in the old document and the divergence from the new document to each old document.\n",
	"\n",
	"Using the SEP model, the steps will be:\n",
	"1. Download the SEP Spring 2018 edition from S3: https://s3.us-east-2.amazonaws.com/hypershelf/sep.spr2018.tez\n",
	"2. `topicexplorer import sep.spr2018.tez`\n",
	"3. `topicexplorer notebook data_spr2018`\n",
	"4. Place this notebook file into the `notebooks` directory. (Easiest way is to use the \"Upload\" button on the notebook \"Home\" page.\n",
	"5. Place the `origin-1e.txt` file into the `notebooks` directory.\n",
	"6. Run all cells."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"%pylab inline\n",
	"from __future__ import print_function\n",
	"from codecs import open\n",
	"\n",
	"from corpus import *\n",
	"from vsm.extensions.corpusbuilders import toy_corpus, corpus_from_strings\n",
	"from vsm.spatial import *\n",
	"\n",
	"# global settings for computations, mostly you'll change the k value\n",
	"k = 20\n",
	"m = lda_m[k]\n",
	"v = lda_v[k]\n",
	"\n",
	"def build_sample(newdoc_file):\n",
	" print(\"opening corpus\")\n",
	" with open(newdoc_file, encoding='utf8') as newdoc:\n",
	" text = [newdoc.read()]\n",
	" print(\"building corpus\")\n",
	" newdoc = corpus_from_strings(text, nltk_stop=True, stop_freq=0)\n",
	" newdoc.context_types = ['article']\n",
	" \n",
	" print(\"aligning corpus\")\n",
	" c = align_corpora(v.corpus, newdoc)\n",
	" q = LdaCgsQuerySampler(v.model, old_corpus=v.corpus, new_corpus=c, context_type='article', align_corpora=False)\n",
	" q.train(n_iterations=200)\n",
	" return q\n",
	"\n",
	"def get_topics(query_sample):\n",
	" return np.squeeze(query_sample.top_doc / sum(query_sample.top_doc))\n",
	"\n",
	"# create a topic distribution for a new document\n",
	"newdoc_tops = get_topics(build_sample('origin-1e.txt'))\n",
	"\n",
	"# get the aggregate topic distribution for all existing documents\n",
	"tops = v.doc_topic_matrix(all_ids)\n",
	"\n",
	"# get the distance between the new document and the average of all other documents\n",
	"JS_dist(tops, newdoc_tops)\n",
	"\n",
	"# calculate the distance from the new document to all previous documents, asymmetrically using KL Divergence\n",
	"closest_docs = sorted(all_ids, key=lambda doc_id: KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)))\n",
	"for doc_id in closest_docs:\n",
	" print(KL_div(newdoc_tops, v.doc_topic_matrix(doc_id)), doc_id)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}