Created
July 25, 2016 23:41
-
-
Save valentina-s/1ec226bedb5cc62a41dc144bf7ed902b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Estimating the number of topics" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Based on ['Model Selection for Topic Models via Spectral Decomposition'](https://arxiv.org/abs/1410.6466) by Cheng, et. al." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$D$ - number of documents " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$L(d)$ - number of words in a document $d$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$V$ - size of vocabulary (here total number of words across all documents)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$x_{dl}$ is a the $l$'th word in the $d$'th document, represented as a $V$-dimensional vector" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$ x_{dl} = \\left[\\begin{array}{c}\n", | |
"\\vdots \\\\\n", | |
" \\text{1 if the $l$'th word in the $d$'th document is equal to the $v$'th word in the vocabulary, zero otherwise}\\\\\n", | |
"\\vdots \\\\ \\end{array} \\right]$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We want to calculate the empirical moments for the distribution based on the observations. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### First Product:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{DL} $$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that if the documents have different length, the value in $L$ will change. So the above formula needs to be modified. I believe it should be the total number of words. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{D\\sum_d L(d)} $$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So $\\hat M_1$ is a vector of size $V$ which contains the total number of times the $v$'th word in the vocabulary is appearing in the collection of all documents. $\\hat M_1$ is the 'occurence' vector." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$\\hat M_1 = \\left[\\begin{array}{c}\n", | |
"\\vdots \\\\\n", | |
" \\text{number of total occurrences of $v$'th word in the vocabulary/(size of vocabulary)}\\\\\n", | |
"\\vdots \\\\ \\end{array} \\right]$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Second Product:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$$ \\hat M_2 = \\frac{\\sum_d \\sum_{l\\ne l'} x_{dl}{x_{dl'}}^T}{DL(L-1)} - \\frac{\\alpha_0}{\\alpha_0+1} \\hat M_1 \\hat M_1^T$$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In one of the the R packages they set $\\alpha_0$ to $50$, in this case $\\alpha_0/(\\alpha_0+1)$ is approximately $1$ and we can ignore it." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We should dig better what values are used in the packages we are planning to run." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"On the other hand, in the tensor decomposition they suggest that $\\alpha_0$ small is of interest as it corresponds to few topics for a document (which in our case should be really true)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So let's for now assume it is 1 and we can modify the constant later if needed." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"$\\hat M_2$ is of dimension $V\\times V$." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"So in plain English the second moment is equal to the coocurrence minus the product of the occurences." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"Since $x_{dl}$ is an indicator for the $l$'th word for each document, $x_{dl}x_{dl'}^T$ is a $V\\times V$ matrix, whose $vv'$'th entry is 1 if the $l$'th word is the $v$'th vocabulary word and the $l'$th word is the $v'$'th entry from the vocabulary, and zero other wise." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Summing over different $l$'s and $l'$'s we get" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" $\\sum_{l\\ne l'}x_{dl}x_{dl'}^T = \\left[\\begin{array}{cc}\n", | |
"\\vdots&\\vdots & \\vdots\\\\\n", | |
" \\vdots & \\text{number of times the $v$'th and $v'$'th words appear together in a document}&\\vdots\\\\\n", | |
"\\vdots&\\vdots &\\vdots\\\\ \\end{array} \\right]$" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Summing over all documents we get that the $vv'$'th entry for the first term in $\\hat M_2$ is a multiple of the number of times the $v$'th and $v'$th words cooccur together in a document. The second term is the product of the individual occurences." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"The denominator $DL(L-1)$ makes sense when the number of words in each document is equal, otherwise we should have $\\sum_d L(d)(L(d)-1)$ instead." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we have all the terms to calculate $\\hat M_2$" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment