valentina-s/test.ipynb

## test.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Estimating the number of topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on ['Model Selection for Topic Models via Spectral Decomposition'](https://arxiv.org/abs/1410.6466) by Cheng, et. al."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$D$ - number of documents "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$L(d)$ - number of words in a document $d$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$V$ - size of vocabulary (here total number of words across all documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$x_{dl}$ is a the $l$'th word in the $d$'th document, represented as a $V$-dimensional vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$ x_{dl} = \\left[\\begin{array}{c}\n",
    "\\vdots \\\\\n",
    " \\text{1 if the $l$'th word in the $d$'th document is equal to the $v$'th word in the vocabulary, zero otherwise}\\\\\n",
    "\\vdots \\\\ \\end{array} \\right]$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to calculate the empirical moments for the distribution based on the observations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### First Product:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{DL} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if the documents have different length, the value in $L$ will change. So the above formula needs to be modified. I believe it should be the total number of words.   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{D\\sum_d L(d)} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So $\\hat M_1$ is a vector of size $V$ which contains the total number of times the $v$'th word in the vocabulary is appearing in the collection of all documents. $\\hat M_1$ is the 'occurence' vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\hat M_1 = \\left[\\begin{array}{c}\n",
    "\\vdots \\\\\n",
    " \\text{number of total occurrences of $v$'th word in the vocabulary/(size of vocabulary)}\\\\\n",
    "\\vdots \\\\ \\end{array} \\right]$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Second Product:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ \\hat M_2 = \\frac{\\sum_d \\sum_{l\\ne l'} x_{dl}{x_{dl'}}^T}{DL(L-1)} - \\frac{\\alpha_0}{\\alpha_0+1} \\hat M_1 \\hat M_1^T$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In one of the  the R packages they set $\\alpha_0$ to $50$, in this case $\\alpha_0/(\\alpha_0+1)$ is approximately $1$ and we can ignore it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should dig better what values are used in the packages we are planning to run."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the other hand, in the tensor decomposition they suggest that $\\alpha_0$ small is of interest as it  corresponds to few topics for a document (which in our case should be really true)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So let's for now assume it is 1 and we can modify the constant later if needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$\\hat M_2$ is of dimension $V\\times V$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So in plain English the second moment is equal to the coocurrence minus the product of the occurences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Since $x_{dl}$ is an indicator for the $l$'th word for each document, $x_{dl}x_{dl'}^T$ is a $V\\times V$ matrix, whose $vv'$'th entry is 1 if the $l$'th word is the $v$'th vocabulary word and the $l'$th word is the $v'$'th entry from the vocabulary, and zero other wise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summing over different $l$'s and $l'$'s we get"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " $\\sum_{l\\ne l'}x_{dl}x_{dl'}^T  = \\left[\\begin{array}{cc}\n",
    "\\vdots&\\vdots & \\vdots\\\\\n",
    " \\vdots & \\text{number of times the $v$'th and $v'$'th words appear together in a document}&\\vdots\\\\\n",
    "\\vdots&\\vdots &\\vdots\\\\ \\end{array} \\right]$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summing over all documents we get that the $vv'$'th entry for the first term in $\\hat M_2$ is a multiple of the number of times the $v$'th and $v'$th words cooccur together in a document. The second term is the product of the individual occurences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The denominator $DL(L-1)$ makes sense when the number of words in each document is equal, otherwise we should have $\\sum_d L(d)(L(d)-1)$ instead."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have all the terms to calculate $\\hat M_2$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Estimating the number of topics"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Based on ['Model Selection for Topic Models via Spectral Decomposition'](https://arxiv.org/abs/1410.6466) by Cheng, et. al."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$D$ - number of documents "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$L(d)$ - number of words in a document $d$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$V$ - size of vocabulary (here total number of words across all documents)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$x_{dl}$ is a the $l$'th word in the $d$'th document, represented as a $V$-dimensional vector"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$ x_{dl} = \\left[\\begin{array}{c}\n",
	"\\vdots \\\\\n",
	" \\text{1 if the $l$'th word in the $d$'th document is equal to the $v$'th word in the vocabulary, zero otherwise}\\\\\n",
	"\\vdots \\\\ \\end{array} \\right]$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We want to calculate the empirical moments for the distribution based on the observations. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### First Product:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{DL} $$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note that if the documents have different length, the value in $L$ will change. So the above formula needs to be modified. I believe it should be the total number of words. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{D\\sum_d L(d)} $$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So $\\hat M_1$ is a vector of size $V$ which contains the total number of times the $v$'th word in the vocabulary is appearing in the collection of all documents. $\\hat M_1$ is the 'occurence' vector."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$\\hat M_1 = \\left[\\begin{array}{c}\n",
	"\\vdots \\\\\n",
	" \\text{number of total occurrences of $v$'th word in the vocabulary/(size of vocabulary)}\\\\\n",
	"\\vdots \\\\ \\end{array} \\right]$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Second Product:"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$$ \\hat M_2 = \\frac{\\sum_d \\sum_{l\\ne l'} x_{dl}{x_{dl'}}^T}{DL(L-1)} - \\frac{\\alpha_0}{\\alpha_0+1} \\hat M_1 \\hat M_1^T$$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In one of the the R packages they set $\\alpha_0$ to $50$, in this case $\\alpha_0/(\\alpha_0+1)$ is approximately $1$ and we can ignore it."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We should dig better what values are used in the packages we are planning to run."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"On the other hand, in the tensor decomposition they suggest that $\\alpha_0$ small is of interest as it corresponds to few topics for a document (which in our case should be really true)."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So let's for now assume it is 1 and we can modify the constant later if needed."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"$\\hat M_2$ is of dimension $V\\times V$."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So in plain English the second moment is equal to the coocurrence minus the product of the occurences."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"Since $x_{dl}$ is an indicator for the $l$'th word for each document, $x_{dl}x_{dl'}^T$ is a $V\\times V$ matrix, whose $vv'$'th entry is 1 if the $l$'th word is the $v$'th vocabulary word and the $l'$th word is the $v'$'th entry from the vocabulary, and zero other wise."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Summing over different $l$'s and $l'$'s we get"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" $\\sum_{l\\ne l'}x_{dl}x_{dl'}^T = \\left[\\begin{array}{cc}\n",
	"\\vdots&\\vdots & \\vdots\\\\\n",
	" \\vdots & \\text{number of times the $v$'th and $v'$'th words appear together in a document}&\\vdots\\\\\n",
	"\\vdots&\\vdots &\\vdots\\\\ \\end{array} \\right]$"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Summing over all documents we get that the $vv'$'th entry for the first term in $\\hat M_2$ is a multiple of the number of times the $v$'th and $v'$th words cooccur together in a document. The second term is the product of the individual occurences."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"The denominator $DL(L-1)$ makes sense when the number of words in each document is equal, otherwise we should have $\\sum_d L(d)(L(d)-1)$ instead."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we have all the terms to calculate $\\hat M_2$"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}