This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. nyt corpus, gensim hdp, max_time = 14400 | |
PROGRESS: finished document 2440960 of 300000 | |
saving topics to 0304/final.topics | |
topic 0: 0.026*game + 0.020*plai + 0.017*team + 0.016*season + 0.011*run + 0.010*point + 0.010*player + 0.009*win + 0.008*hit + 0.008*start | |
topic 1: 0.025*compani + 0.016*percent + 0.011*market + 0.011*stock + 0.009*million + 0.008*busi + 0.007*price + 0.007*share + 0.006*billion + 0.006*rate | |
topic 2: 0.013*vote + 0.012*campaign + 0.011*elect + 0.009*democrat + 0.009*polit + 0.008*presid + 0.007*candid + 0.006*support + 0.006*voter + 0.006*republican | |
topic 3: 0.012*offici + 0.009*attack + 0.008*militari + 0.008*forc + 0.007*govern + 0.006*war + 0.006*countri + 0.006*terrorist + 0.006*palestinian + 0.006*leader | |
topic 4: 0.009*palestinian + 0.007*offici + 0.007*kill + 0.006*isra + 0.006*attack + 0.005*offic + 0.005*govern + 0.005*case + 0.005*polic + 0.004*group | |
topic 5: 0.012*tax + 0.008*percent + 0.008*compani + 0.007*plan + 0.007*million + 0.007*billion + 0.007*govern + 0.006*cut + 0. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nyt wang hdp (3 hours): | |
topic 0: 0.0168*patient + 0.0144*cancer + 0.0135*studi + 0.0114*diseas + 0.0111*women + 0.0101*doctor + 0.0087*test + 0.0078*percent + 0.0077*heart + 0.0075*cell + | |
topic 1: 0.0387*cup + 0.0231*minut + 0.0218*serv + 0.0193*tablespoon + 0.0188*add + 0.0169*teaspoon + 0.0131*pepper + 0.0128*sugar + 0.0125*oil + 0.0118*butter + | |
topic 2: 0.0346*percent + 0.0240*rate + 0.0190*economi + 0.0184*market + 0.0177*stock + 0.0110*price + 0.0107*point + 0.0098*cut + 0.0090*economist + 0.0089*quarter + | |
topic 3: 0.0369*run + 0.0347*in + 0.0309*game + 0.0293*hit + 0.0229*pitch + 0.0144*singl + 0.0134*score + 0.0123*start + 0.0118*season + 0.0116*lead + | |
topic 4: 0.0085*com + 0.0071*palm + 0.0068*beach + 0.0065*book + 0.0060*look + 0.0050*daili + 0.0049*american + 0.0049*question + 0.0041*statesman + 0.0040*home + | |
topic 5: 0.0192*cook + 0.0182*cup + 0.0169*minut + 0.0158*serv + 0.0138*add + 0.0135*tablespoon + 0.0125*oil + 0.0113*pepper + 0.0104*sauc + 0.0103*teaspoon + | |
topic 6: 0.0347*drug + 0.017 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
addition wang hdp results - maxiter 1000, maxtime 3600 | |
topic 0: 0.0406*cell + 0.0152*patient + 0.0109*express + 0.0099*human + 0.0098*tumor + 0.0097*gene + 0.0080*activ + 0.0071*studi + 0.0070*protein + 0.0068*normal | |
topic 1: 0.0238*patient + 0.0103*increas + 0.0095*group + 0.0095*effect + 0.0091*studi + 0.0079*level + 0.0074*blood + 0.0074*rat + 0.0072*control + 0.0071*hypertens | |
topic 2: 0.0186*patient + 0.0107*increas + 0.0104*group + 0.0100*effect + 0.0088*studi + 0.0072*blood + 0.0070*control + 0.0068*rat + 0.0065*pressur + 0.0064*arteri | |
topic 3: 0.0351*patient + 0.0111*ventricular + 0.0101*arteri + 0.0096*group + 0.0084*studi + 0.0081*left + 0.0074*coronari + 0.0070*pressur + 0.0065*increas + 0.0062*diseas | |
topic 4: 0.0204*patient + 0.0166*cell + 0.0081*increas + 0.0080*studi + 0.0077*activ + 0.0072*effect + 0.0066*level + 0.0060*diseas + 0.0059*respons + 0.0055*rat | |
topic 5: 0.0437*patient + 0.0109*group + 0.0107*arteri + 0.0099*coronari + 0.0077*studi + 0.0076*year + 0.0069*diseas + 0.0063*infarct + 0.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Preprocessing notes | |
- converted ohsumed text to gensim MmCorpus and Dictionary files. (56984 docs x 202967 terms) | |
- used gensim's parser.preprocessing (all steps) to reduce dictionary to 40554 terms | |
- calculated corpus-level tfidfs, reduced dictionary to top 2000 terms by tfidf score | |
- ran gensim hdp, ldi, lsi, wang online hdp | |
following CW's suggestion, choose the vocabulary by TFIDF as described here: http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf ("Choosing the Vocabulary") | |
with TFIDF understood as described in: https://lists.cs.princeton.edu/pipermail/topic-models/2009-April/000531.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
results with ohsumed corpus, 56984 documents, 17411 features | |
gensim lda results: | |
2012-02-06 01:00:10,617 : INFO : topic #0: 0.186*per + 0.061*cancer + 0.048*cent + 0.029*radiation + 0.023*obese + 0.015*actuarial + 0.013*orbital + 0.012*rest + 0.011*rate + 0.011*survival | |
2012-02-06 01:00:10,629 : INFO : topic #1: 0.069*myocardial + 0.035*coronary + 0.034*infarction + 0.032*angioplasty + 0.030*acute + 0.022*cardiac + 0.016*ischemic + 0.015*tachycardia + 0.014*thrombolytic + 0.014*episodes | |
2012-02-06 01:00:10,641 : INFO : topic #2: 0.101*dna + 0.042*reaction + 0.040*chain + 0.027*analysis + 0.025*tissue + 0.023*polymerase + 0.022*sequence + 0.017*autologous + 0.016*detection + 0.015*detected | |
2012-02-06 01:00:10,653 : INFO : topic #3: 0.039*synthesis + 0.032*albumin + 0.024*parathyroid + 0.015*laryngeal + 0.015*vasopressin + 0.013*tubular + 0.013*water + 0.013*paralysis + 0.011*thallium + 0.011*hyperparathyroidism | |
2012-02-06 01:00:10,666 : INFO : topic #4: 0.051*injection + 0.035*subcutaneous + 0.030*enzyme + 0 |