Skip to content

Instantly share code, notes, and snippets.

@valentina-s
Created July 25, 2016 23:36
Show Gist options
  • Save valentina-s/d8014ed4da7407ed6a4b3fab0dfbaf50 to your computer and use it in GitHub Desktop.
Save valentina-s/d8014ed4da7407ed6a4b3fab0dfbaf50 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Estimating the number of topics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on ['Model Selection for Topic Models via Spectral Decomposition'](https://arxiv.org/abs/1410.6466) by Cheng, et. al."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$D$ - number of documents "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$L(d)$ - number of words in a document $d$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$V$ - size of vocabulary (here total number of words across all documents)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$x_{dl}$ is a the $l$'th word in the $d$'th document, represented as a $V$-dimensional vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$ x_{dl} = \\left[\\begin{array}{c}\n",
"\\vdots \\\\\n",
" \\text{1 if the $l$'th word in the $d$'th document is equal to the $v$'th word in the vocabulary, zero otherwise}\\\\\n",
"\\vdots \\\\ \\end{array} \\right]$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to calculate the empirical moments for the distribution based on the observations. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### First Product:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{DL} $$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that if the documents have different length, the value in $L$ will change. So the above formula needs to be modified. I believe it should be the total number of words. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\\hat M_1= \\frac{\\sum_d \\sum_l x_{dl}}{D\\sum_d L(d)} $$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So $\\hat M_1$ is a vector of size $V$ which contains the total number of times the $v$'th word in the vocabulary is appearing in the collection of all documents. $\\hat M_1$ is the 'occurence' vector."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$\\hat M_1 = \\left[\\begin{array}{c}\n",
"\\vdots \\\\\n",
" \\text{number of total occurrences of $v$'th word in the vocabulary/(size of vocabulary)}\\\\\n",
"\\vdots \\\\ \\end{array} \\right]$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Second Product:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$ \\hat M_2 = \\frac{\\sum_d \\sum_{l\\ne l'} x_{dl}{x_{dl'}}^T}{DL(L-1)} - \\frac{\\alpha_0}{\\alpha_0+1} \\hat M_1 \\hat M_1^T$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In one of the the R packages they set $\\alpha_0$ to $50$, in this case $\\alpha_0/(\\alpha_0+1)$ is approximately $1$ and we can ignore it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should dig better what values are used in the packages we are planning to run."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the other hand, in the tensor decomposition they suggest that $\\alpha_0$ small is of interest as it corresponds to few topics for a document (which in our case should be really true)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So let's for now assume it is 1 and we can modify the constant later if needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$\\hat M_2$ is of dimension $V\\times V$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So in plain English the second moment is equal to the coocurrence minus the product of the occurences."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Since $x_{dl}$ is an indicator for the $l$'th word for each document, $x_{dl}x_{dl'}^T$ is a $V\\times V$ matrix, whose $vv'$'th entry is 1 if the $l$'th word is the $v$'th vocabulary word and the $l'$th word is the $v'$'th entry from the vocabulary, and zero other wise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Summing over different $l$'s and $l'$'s we get"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" $\\sum_{l\\ne l'}x_{dl}x_{dl'}^T = \\left[\\begin{array}{cc}\n",
"\\vdots&\\vdots & \\vdots\\\\\n",
" \\vdots & \\text{number of times the $v$'th and $v'$'th words appear together in a document}&\\vdots\\\\\n",
"\\vdots&\\vdots &\\vdots\\\\ \\end{array} \\right]$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Summing over all documents we get that the $vv'$'th entry for the first term in $\\hat M_2$ is a multiple of the number of times the $v$'th and $v'$th words cooccur together in a document. The second term is the product of the individual occurences."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The denominator $DL(L-1)$ makes sense when the number of words in each document is equal, otherwise we should have $\\sum_d L(d)(L(d)-1)$ instead."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have all the terms to calculate $\\hat M_2$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment