dlwh/writing about lda

## writing about lda
>>I am getting stuck in interpreting joint and conditional probabilities while trying to understand some
>>language modeling stuff.  Given w (the words in a corpus of documents), and z (the topics), and T (the number
>>of topics), how would you interpret the following: (1) P(*w*|T), (2) P(*w*|*z*, T) and (3) P(*z*|*w*, T)?
>>Thanks!

First, bayes rule:

p(z|w,t) = p(w|z,t)p(z|t)/p(w|t)

"the posterior probability is proportional to the prior probability (p(z|T)) times conditional likelihood (p(w|z,T))" The other term makes it normalize.

p(z|w,t) says, given these words and the number of topics, how likely am i to see this assignment of topics
p(w|z,t) is the likelihood: given these topic assignments, how likely am i to see these words.
p(z|t) says, without any other information (what the actual words are), how likely am i to see these topics. In LDA, this is uniform (no topic is a priori any more probable than any other topic)
p(w|t) is the "marginal probability" of the words, given the number of topics:

p(w|t) = \sum_(all assignments z') p(w|z',t)

Looking at it, it should be clearer now why it makes sense as the denominator in Bayes rule.

In practice, the number of "all assignments z'" is T^W, where W is the number of observed word tokens. i.e. it's really really really big.  However, it's a useful number for estimating how good of a fit a number of topics gives you. It's important to note that this should be (basically) increase as we increase the number of topics T. In the limit, we can have one topic for each word. That said, it should probably "level off" and you get diminishing returns as you increase T. The question is, where does it level off?

So, G&S propose to use the "harmonic mean estimator" to estimate the marginal probability by drawing samples of z' and averaging them together. (well, averaging their harmonic means). See http://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/ for an explanation of that, and why it's a horrible idea. (It's pretty dense, and I have trouble following it, but the punch line is: "The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe.")
	>>I am getting stuck in interpreting joint and conditional probabilities while trying to understand some
	>>language modeling stuff. Given w (the words in a corpus of documents), and z (the topics), and T (the number
	>>of topics), how would you interpret the following: (1) P(w\|T), (2) P(w\|z, T) and (3) P(z\|w, T)?
	>>Thanks!

	First, bayes rule:

	p(z\|w,t) = p(w\|z,t)p(z\|t)/p(w\|t)

	"the posterior probability is proportional to the prior probability (p(z\|T)) times conditional likelihood (p(w\|z,T))" The other term makes it normalize.

	p(z\|w,t) says, given these words and the number of topics, how likely am i to see this assignment of topics
	p(w\|z,t) is the likelihood: given these topic assignments, how likely am i to see these words.
	p(z\|t) says, without any other information (what the actual words are), how likely am i to see these topics. In LDA, this is uniform (no topic is a priori any more probable than any other topic)
	p(w\|t) is the "marginal probability" of the words, given the number of topics:

	p(w\|t) = \sum_(all assignments z') p(w\|z',t)

	Looking at it, it should be clearer now why it makes sense as the denominator in Bayes rule.

	In practice, the number of "all assignments z'" is T^W, where W is the number of observed word tokens. i.e. it's really really really big. However, it's a useful number for estimating how good of a fit a number of topics gives you. It's important to note that this should be (basically) increase as we increase the number of topics T. In the limit, we can have one topic for each word. That said, it should probably "level off" and you get diminishing returns as you increase T. The question is, where does it level off?

	So, G&S propose to use the "harmonic mean estimator" to estimate the marginal probability by drawing samples of z' and averaging them together. (well, averaging their harmonic means). See http://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/ for an explanation of that, and why it's a horrible idea. (It's pretty dense, and I have trouble following it, but the punch line is: "The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe.")