Skip to content

Instantly share code, notes, and snippets.

@varnit
Created November 9, 2011 18:29
Show Gist options
  • Save varnit/1352377 to your computer and use it in GitHub Desktop.
Save varnit/1352377 to your computer and use it in GitHub Desktop.
mahout lda
$ wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
$ mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/"
$ hadoop dfs -put reuters-extracted/* reuters/
$ bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors
$ bin/mahout lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 70000 -x 20
$ bin/mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i reuters-lda-sparse/state-20/ -d reuters vectors/dictionary.file-* -dt sequencefile -w 5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment