Skip to content

Instantly share code, notes, and snippets.

Last active December 20, 2015 07:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save VanessaD/ec6dbb8b9f7ba7f47299 to your computer and use it in GitHub Desktop.
Save VanessaD/ec6dbb8b9f7ba7f47299 to your computer and use it in GitHub Desktop.
Run Mahout K-means on Reuter Example
# For reuters Example
# Get the data first, I place it within the example folder from mahout home director: mahout-0.5-cdh3u5/examples/reuters
mkdir reuters
cd reuters
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..
# Mahout steps
# (1) For reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles. For sequencesFiles, key is the document id and value is the document content. This step will ben done using 'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text data to vectors (Vector Space Model: VSM)
# For the first preprocessing job, a much
quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file.
Because its bundled along with Mahout, all you need to do is change to the examples/
directory under the Mahout source tree and run the org.apache.lucene.benchmark
.utils.ExtractReuters class. <>
# From the text file generate SGM files, note that generated files reside in local
${MAHOUT_HOME}/bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text
hadoop fs -copyFromLocal ./reuters-text/ /your-hdfs-path-to/reuters-text
# Then generate sequence-file
mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i /your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8 -chunk 5
# Check the generated sequence-file
mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s /your-hdfs-path-to/reuters-seqfiles/chunk-0 |less
# From sequence-file generate vector file
mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i /your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors
# take a look at it should have 7 items:
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors
# check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors
# Run kmeans
mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow
# Check the cluster output
mahout clusterdump -s mahout-clusters/clusters-* -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o ./cluster-output.txt
# Some other tips
You can set any other Hadoop parameters by doing:
mahout <options> -D<hadoop_property>=<value>
Replace <hadoop_property> with the property name you want to define and <value> with the value you want.
In cases where you reach an "OutOfMemoryException" or "GC overhead limit exceeded", it may help to add the following parameters to your Mahout job:
(required in order to change the memory heap allocations to either the map or reduce phase)
(recommended extension of memory; up to 4g is acceptable)
(recommended extension of memory; up to 4g is acceptable)
(recommended extension of memory; up to 4g is acceptable; using this will overwrite the map and reduce memory allocations with the one specified here)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment