VanessaD/Mahout K-means on Reuter Example Secret

## Mahout K-means on Reuter Example
# For reuters Example
#~~~~~~~~~~~~~~~~~~~~

# Get the data first, I place it within the example folder from mahout home director: mahout-0.5-cdh3u5/examples/reuters
mkdir reuters
cd reuters
wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..

# Mahout steps

# (1) For reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles. For sequencesFiles, key is the document id and value is the document content. This step will ben done using 'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text data to vectors (Vector Space Model: VSM)
# For the first preprocessing job, a much
quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file.
Because its bundled along with Mahout, all you need to do is change to the examples/
directory under the Mahout source tree and run the org.apache.lucene.benchmark
.utils.ExtractReuters class. <http://manning.com/owen/MiA_SampleCh08.pdf>

# From the text file generate SGM files, note that generated files reside in local
${MAHOUT_HOME}/bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text

hadoop fs -copyFromLocal ./reuters-text/ /your-hdfs-path-to/reuters-text

# Then generate sequence-file
mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i /your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8 -chunk 5

# Check the generated sequence-file
mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s /your-hdfs-path-to/reuters-seqfiles/chunk-0 |less

# From sequence-file generate vector file
mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i /your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors -Dmapred.job.queue.name=your-queue-name

# take a look at it should have 7 items:
#reuters-vectors/df-count
#reuters-vectors/dictionary.file-0
#reuters-vectors/frequency.file-0
#reuters-vectors/tf-vectors
#reuters-vectors/tfidf-vectors
#reuters-vectors/tokenized-documents
#reuters-vectors/wordcount
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors

# check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors


# Run kmeans
mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow


# Check the cluster output
# http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper
mahout clusterdump -s mahout-clusters/clusters-* -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o ./cluster-output.txt


# Some other tips
You can set any other Hadoop parameters by doing:

mahout <options> -D<hadoop_property>=<value>
Replace <hadoop_property> with the property name you want to define and <value> with the value you want.

In cases where you reach an "OutOfMemoryException" or "GC overhead limit exceeded", it may help to add the following parameters to your Mahout job:

-Dmapred.child.ulimit=4718592
(required in order to change the memory heap allocations to either the map or reduce phase)

-Dmapred.map.child.java.opts=-Xmx3g
(recommended extension of memory; up to 4g is acceptable)

-Dmapred.reduce.child.java.opts=-Xmx3g
(recommended extension of memory; up to 4g is acceptable)

-Dmapred.child.java.opts=-Xmx3g
(recommended extension of memory; up to 4g is acceptable; using this will overwrite the map and reduce memory allocations with the one specified here)
	# For reuters Example
	#~~~~~~~~~~~~~~~~~~~~

	# Get the data first, I place it within the example folder from mahout home director: mahout-0.5-cdh3u5/examples/reuters
	mkdir reuters
	cd reuters
	wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

	mkdir reuters-out
	mv reuters21578.tar.gz reuters-out
	cd reuters-out
	tar -xzvf reuters21578.tar.gz
	cd ..

	# Mahout steps

	# (1) For reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse(like preprocessing) those files into document-id and document-text. After that we can convert the file into sequenceFiles. For sequencesFiles, key is the document id and value is the document content. This step will ben done using 'seqdirectory'. Then use 'seq2sparse' do if-idf convert the id-text data to vectors (Vector Space Model: VSM)
	# For the first preprocessing job, a much
	quicker way is to reuse the Reuters parser given in the Lucene benchmark JAR file.
	Because its bundled along with Mahout, all you need to do is change to the examples/
	directory under the Mahout source tree and run the org.apache.lucene.benchmark
	.utils.ExtractReuters class. <http://manning.com/owen/MiA_SampleCh08.pdf>

	# From the text file generate SGM files, note that generated files reside in local
	${MAHOUT_HOME}/bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-out reuters-text

	hadoop fs -copyFromLocal ./reuters-text/ /your-hdfs-path-to/reuters-text

	# Then generate sequence-file
	mahout-0.5-cdh3u5:$ ./bin/mahout seqdirectory -i /your-hdfs-path-to/reuters-text -o /your-hdfs-path-to/reuters-seqfiles -c UTF-8 -chunk 5

	# Check the generated sequence-file
	mahout-0.5-cdh3u5:$ ./bin/mahout seqdumper -s /your-hdfs-path-to/reuters-seqfiles/chunk-0 \|less

	# From sequence-file generate vector file
	mahout-0.5-cdh3u5:$ ./bin/mahout seq2sparse -i /your-hdfs-path-to/reuters-seqfiles/ -o /your-hdfs-path-to/reuters-vectors -Dmapred.job.queue.name=your-queue-name

	# take a look at it should have 7 items:
	#reuters-vectors/df-count
	#reuters-vectors/dictionary.file-0
	#reuters-vectors/frequency.file-0
	#reuters-vectors/tf-vectors
	#reuters-vectors/tfidf-vectors
	#reuters-vectors/tokenized-documents
	#reuters-vectors/wordcount
	mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors

	# check the vector: reuters-vectors/tf-vectors/part-r-00000
	mahout-0.5-cdh3u5:$ hadoop fs -ls reuters-vectors/tf-vectors


	# Run kmeans
	mahout-0.5-cdh3u5:$./bin/mahout kmeans -i reuters-vectors/tfidf-vectors/ -o mahout-clusters -c mahout-initial-centers -c 0.1 -k 20 -x 10 -ow


	# Check the cluster output
	# http://stackoverflow.com/questions/5805225/interpreting-output-from-mahout-clusterdumper
	mahout clusterdump -s mahout-clusters/clusters-* -d reuters-vectors/dictionary.file-0 -dt sequencefile -b 100 -n 20 -o ./cluster-output.txt


	# Some other tips
	You can set any other Hadoop parameters by doing:

	mahout <options> -D<hadoop_property>=<value>
	Replace <hadoop_property> with the property name you want to define and <value> with the value you want.

	In cases where you reach an "OutOfMemoryException" or "GC overhead limit exceeded", it may help to add the following parameters to your Mahout job:

	-Dmapred.child.ulimit=4718592
	(required in order to change the memory heap allocations to either the map or reduce phase)

	-Dmapred.map.child.java.opts=-Xmx3g
	(recommended extension of memory; up to 4g is acceptable)

	-Dmapred.reduce.child.java.opts=-Xmx3g
	(recommended extension of memory; up to 4g is acceptable)

	-Dmapred.child.java.opts=-Xmx3g
	(recommended extension of memory; up to 4g is acceptable; using this will overwrite the map and reduce memory allocations with the one specified here)