Skip to content

Instantly share code, notes, and snippets.

@jduckles
Created July 28, 2011 00:10
Show Gist options
  • Save jduckles/1110645 to your computer and use it in GitHub Desktop.
Save jduckles/1110645 to your computer and use it in GitHub Desktop.
Commands from Mahout workshop Pt 1 at OSCON2011
# See http://www.oscon.com/oscon2011/public/schedule/detail/18836 for getting Mahout setup
# Get Reuters Data
wget http://goo.gl/qv6Ad
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..
# Mahout steps
# slip out text from SGM files
bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text
# Create sequence files
bin/mahout seqdirectory -i reuters-text -o reuters-seqfiles -c UTF-8 -chunk 5
# Look at sequence files
bin/mahout seqdumper -s reuters-seqfiles/chunk-0 |less
# Seq 2 sparse will pull out sequences
bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -wt tfidf
# perform kmeans
bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow
# look at output
bin/mahout clusterdump -s mahout-clusters/clusters-10/part-r-00000 -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment