Skip to content

Instantly share code, notes, and snippets.

@burlresearch
Last active December 19, 2015 23:18
Show Gist options
  • Save burlresearch/6033656 to your computer and use it in GitHub Desktop.
Save burlresearch/6033656 to your computer and use it in GitHub Desktop.
preliminary mahout vectors and clustering
#!/bin/bash
MAHOUT_LOCAL=1 # skip hadoop for now
K=3
set -ex
mahout seqdirectory -i issblog -o issblog-seqfiles -ow
#-filter org.apache.lucene.analysis.en.EnglishMinimalStemFilter
#-filter org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilter
#-filter org.apache.lucene.analysis.en.PorterStemFilter
#mahout seq2sparse -i issblog-seqfiles -o issblog-vectors -ow
mahout seq2sparse -i issblog-seqfiles -o issblog-vectors -ow \
-a org.apache.lucene.analysis.en.EnglishAnalyzer
#-a org.apache.lucene.analysis.en.PorterStemFilter
mahout vectordump -i issblog-vectors/tf-vectors \
-d issblog-vectors/dictionary.file-0 -dt sequencefile \
-sort 1 --vectorSize 12 --printKey 1 \
-o vectors-12
# mahout vectordump -i issblog-vectors/tfidf-vectors -d issblog-vectors/dictionary.file-0 -dt sequencefile -sort 1 --vectorSize 12
mahout kmeans -i issblog-vectors/tfidf-vectors -o issblog-kmeans-clusters -ow \
-c issblog-init-clusters \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure \
-cd 1.0 -k $K -x 20 -cl
mahout clusterdump \
-d issblog-vectors/dictionary.file-0 \
-dt sequencefile \
-i issblog-kmeans-clusters
-b 10 \
-n 10
# -o clusters.txt \
exit 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment