-
-
Save VanessaD/ec6dbb8b9f7ba7f47299 to your computer and use it in GitHub Desktop.
Run Mahout K-means on Reuter Example
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# For reuters Example | |
#~~~~~~~~~~~~~~~~~~~~ | |
# Get the data first, I place it within the example folder from mahout home director: mahout-0.5-cdh3u5/examples/reuters | |
mkdir reuters | |
cd reuters | |
wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz | |
mkdir reuters-out | |
mv reuters21578.tar.gz reuters-out | |
cd reuters-out | |
tar -xzvf reuters21578.tar.gz | |
cd .. | |
# Mahout steps | |
# (1) For reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles. For sequencesFiles, key is the document id and value is the document content. This step will ben done using 'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text data to vectors (Vector Space Model: VSM) | |
# For the first preprocessing job, a much | |
quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file. | |
Because its bundled along with Mahout, all you need to do is change to the examples/ | |
directory under the Mahout source tree and run the org.apache.lucene.benchmark | |
.utils.ExtractReuters class. <http://manning.com/owen/MiA_SampleCh08.pdf> | |
# From the text file generate SGM files, note that generated files reside in local | |
${MAHOUT_HOME}/bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text | |
hadoop fs -copyFromLocal ./reuters-text/ /your-hdfs-path-to/reuters-text | |
# Then generate sequence-file | |
mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i /your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8 -chunk 5 | |
# Check the generated sequence-file | |
mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s /your-hdfs-path-to/reuters-seqfiles/chunk-0 |less | |
# From sequence-file generate vector file | |
mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i /your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors -Dmapred.job.queue.name=your-queue-name | |
# take a look at it should have 7 items: | |
#reuters-vectors/df-count | |
#reuters-vectors/dictionary.file-0 | |
#reuters-vectors/frequency.file-0 | |
#reuters-vectors/tf-vectors | |
#reuters-vectors/tfidf-vectors | |
#reuters-vectors/tokenized-documents | |
#reuters-vectors/wordcount | |
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors | |
# check the vector: reuters-vectors/tf-vectors/part-r-00000 | |
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors | |
# Run kmeans | |
mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow | |
# Check the cluster output | |
# http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper | |
mahout clusterdump -s mahout-clusters/clusters-* -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o ./cluster-output.txt | |
# Some other tips | |
You can set any other Hadoop parameters by doing: | |
mahout <options> -D<hadoop_property>=<value> | |
Replace <hadoop_property> with the property name you want to define and <value> with the value you want. | |
In cases where you reach an "OutOfMemoryException" or "GC overhead limit exceeded", it may help to add the following parameters to your Mahout job: | |
-Dmapred.child.ulimit=4718592 | |
(required in order to change the memory heap allocations to either the map or reduce phase) | |
-Dmapred.map.child.java.opts=-Xmx3g | |
(recommended extension of memory; up to 4g is acceptable) | |
-Dmapred.reduce.child.java.opts=-Xmx3g | |
(recommended extension of memory; up to 4g is acceptable) | |
-Dmapred.child.java.opts=-Xmx3g | |
(recommended extension of memory; up to 4g is acceptable; using this will overwrite the map and reduce memory allocations with the one specified here) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment