Skip to content

Instantly share code, notes, and snippets.

@jesterhazy
jesterhazy / gensim hdp results (multi-pass)
Created March 4, 2012 22:33
gensim hdp results (multi-pass)
1. nyt corpus, gensim hdp, max_time = 14400
PROGRESS: finished document 2440960 of 300000
saving topics to 0304/final.topics
topic 0: 0.026*game + 0.020*plai + 0.017*team + 0.016*season + 0.011*run + 0.010*point + 0.010*player + 0.009*win + 0.008*hit + 0.008*start
topic 1: 0.025*compani + 0.016*percent + 0.011*market + 0.011*stock + 0.009*million + 0.008*busi + 0.007*price + 0.007*share + 0.006*billion + 0.006*rate
topic 2: 0.013*vote + 0.012*campaign + 0.011*elect + 0.009*democrat + 0.009*polit + 0.008*presid + 0.007*candid + 0.006*support + 0.006*voter + 0.006*republican
topic 3: 0.012*offici + 0.009*attack + 0.008*militari + 0.008*forc + 0.007*govern + 0.006*war + 0.006*countri + 0.006*terrorist + 0.006*palestinian + 0.006*leader
topic 4: 0.009*palestinian + 0.007*offici + 0.007*kill + 0.006*isra + 0.006*attack + 0.005*offic + 0.005*govern + 0.005*case + 0.005*polic + 0.004*group
topic 5: 0.012*tax + 0.008*percent + 0.008*compani + 0.007*plan + 0.007*million + 0.007*billion + 0.007*govern + 0.006*cut + 0.
@jesterhazy
jesterhazy / gist:1869822
Created February 20, 2012 15:56
gensim results - uci nyt corpus, trimmed to 2000 features
nyt wang hdp (3 hours):
topic 0: 0.0168*patient + 0.0144*cancer + 0.0135*studi + 0.0114*diseas + 0.0111*women + 0.0101*doctor + 0.0087*test + 0.0078*percent + 0.0077*heart + 0.0075*cell +
topic 1: 0.0387*cup + 0.0231*minut + 0.0218*serv + 0.0193*tablespoon + 0.0188*add + 0.0169*teaspoon + 0.0131*pepper + 0.0128*sugar + 0.0125*oil + 0.0118*butter +
topic 2: 0.0346*percent + 0.0240*rate + 0.0190*economi + 0.0184*market + 0.0177*stock + 0.0110*price + 0.0107*point + 0.0098*cut + 0.0090*economist + 0.0089*quarter +
topic 3: 0.0369*run + 0.0347*in + 0.0309*game + 0.0293*hit + 0.0229*pitch + 0.0144*singl + 0.0134*score + 0.0123*start + 0.0118*season + 0.0116*lead +
topic 4: 0.0085*com + 0.0071*palm + 0.0068*beach + 0.0065*book + 0.0060*look + 0.0050*daili + 0.0049*american + 0.0049*question + 0.0041*statesman + 0.0040*home +
topic 5: 0.0192*cook + 0.0182*cup + 0.0169*minut + 0.0158*serv + 0.0138*add + 0.0135*tablespoon + 0.0125*oil + 0.0113*pepper + 0.0104*sauc + 0.0103*teaspoon +
topic 6: 0.0347*drug + 0.017
@jesterhazy
jesterhazy / additional-hdp-results.txt
Created February 18, 2012 17:14
additional hdp results
addition wang hdp results - maxiter 1000, maxtime 3600
topic 0: 0.0406*cell + 0.0152*patient + 0.0109*express + 0.0099*human + 0.0098*tumor + 0.0097*gene + 0.0080*activ + 0.0071*studi + 0.0070*protein + 0.0068*normal
topic 1: 0.0238*patient + 0.0103*increas + 0.0095*group + 0.0095*effect + 0.0091*studi + 0.0079*level + 0.0074*blood + 0.0074*rat + 0.0072*control + 0.0071*hypertens
topic 2: 0.0186*patient + 0.0107*increas + 0.0104*group + 0.0100*effect + 0.0088*studi + 0.0072*blood + 0.0070*control + 0.0068*rat + 0.0065*pressur + 0.0064*arteri
topic 3: 0.0351*patient + 0.0111*ventricular + 0.0101*arteri + 0.0096*group + 0.0084*studi + 0.0081*left + 0.0074*coronari + 0.0070*pressur + 0.0065*increas + 0.0062*diseas
topic 4: 0.0204*patient + 0.0166*cell + 0.0081*increas + 0.0080*studi + 0.0077*activ + 0.0072*effect + 0.0066*level + 0.0060*diseas + 0.0059*respons + 0.0055*rat
topic 5: 0.0437*patient + 0.0109*group + 0.0107*arteri + 0.0099*coronari + 0.0077*studi + 0.0076*year + 0.0069*diseas + 0.0063*infarct + 0.0
@jesterhazy
jesterhazy / gensim-results2.txt
Created February 18, 2012 15:04
more gensim hdp results
Preprocessing notes
- converted ohsumed text to gensim MmCorpus and Dictionary files. (56984 docs x 202967 terms)
- used gensim's parser.preprocessing (all steps) to reduce dictionary to 40554 terms
- calculated corpus-level tfidfs, reduced dictionary to top 2000 terms by tfidf score
- ran gensim hdp, ldi, lsi, wang online hdp
following CW's suggestion, choose the vocabulary by TFIDF as described here: http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf ("Choosing the Vocabulary")
with TFIDF understood as described in: https://lists.cs.princeton.edu/pipermail/topic-models/2009-April/000531.html
@jesterhazy
jesterhazy / gensim-hdp-results.txt
Created February 9, 2012 17:01
jesterhazy's gensim hdp results
results with ohsumed corpus, 56984 documents, 17411 features
gensim lda results:
2012-02-06 01:00:10,617 : INFO : topic #0: 0.186*per + 0.061*cancer + 0.048*cent + 0.029*radiation + 0.023*obese + 0.015*actuarial + 0.013*orbital + 0.012*rest + 0.011*rate + 0.011*survival
2012-02-06 01:00:10,629 : INFO : topic #1: 0.069*myocardial + 0.035*coronary + 0.034*infarction + 0.032*angioplasty + 0.030*acute + 0.022*cardiac + 0.016*ischemic + 0.015*tachycardia + 0.014*thrombolytic + 0.014*episodes
2012-02-06 01:00:10,641 : INFO : topic #2: 0.101*dna + 0.042*reaction + 0.040*chain + 0.027*analysis + 0.025*tissue + 0.023*polymerase + 0.022*sequence + 0.017*autologous + 0.016*detection + 0.015*detected
2012-02-06 01:00:10,653 : INFO : topic #3: 0.039*synthesis + 0.032*albumin + 0.024*parathyroid + 0.015*laryngeal + 0.015*vasopressin + 0.013*tubular + 0.013*water + 0.013*paralysis + 0.011*thallium + 0.011*hyperparathyroidism
2012-02-06 01:00:10,666 : INFO : topic #4: 0.051*injection + 0.035*subcutaneous + 0.030*enzyme + 0