Skip to content

Instantly share code, notes, and snippets.

@bitsnaps
Forked from rahulsom/Ele.groovy
Last active November 29, 2016 14:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bitsnaps/689e7d0e5e72c4c24cee39e0b2ca9ce4 to your computer and use it in GitHub Desktop.
Save bitsnaps/689e7d0e5e72c4c24cee39e0b2ca9ce4 to your computer and use it in GitHub Desktop.
Mahout with Groovy - the faster way

Mahout with Groovy (original work of @rahulsom)

I started looking at ML libraries and read somewhere that Apache Mahout is pretty good. Then I started looking for a hello world, and ran into this page.

It sucks that the tutorial is a youtube video. That's right you need to watch this guy do a bunch of stuff on a Youtube video to learn how to use Mahout. Much worse, he is manually managing libs in his project.

So I decided to implement his whole video with Groovy. As a bonus, I print movie names instead of ids.

You will have to download the data file from here and set the location in the variable mlDir.

@Grab(group = 'org.apache.mahout', module = 'mahout-core', version = '0.9')
import org.apache.mahout.cf.taste.impl.common.FastByIDMap
import org.apache.mahout.cf.taste.impl.common.FastIDSet
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity
//you can get this data from here: http://files.grouplens.org/datasets/movielens/ml-100k.zip
def mlDir = new File(getClass().protectionDomain.codeSource.location.path).parent+'/ml-100k'
def f = new File("$mlDir/u.data")
assert f.exists()
def m = new FileDataModel(f, ',') {
void processLine(
String line, FastByIDMap<?> data, FastByIDMap<FastByIDMap<Long>> timestamps, boolean fromPriorData
) {
def newLine = line.split('\t').take(3).join(',')
super.processLine(newLine, data, timestamps, fromPriorData)
}
void processLineWithoutID(
String line, FastByIDMap<FastIDSet> data, FastByIDMap<FastByIDMap<Long>> timestamps) {
def newLine = line.split('\t').take(3).join(',')
try {
if (newLine && (newLine[0].isNumber()))
super.processLineWithoutID(newLine, data, timestamps)
} catch (NoSuchElementException ignore) {
// if you run into this line, you're probably pushing String to processLineWithoutID, so check newLine's value!
}
}
}
def similarity = new TanimotoCoefficientSimilarity(m)
def recommender = new GenericItemBasedRecommender(m, similarity)
def items = new File("$mlDir/u.item").readLines().collectEntries { it.split('\\|').take(2).toList() }
m.itemIDs.each { itemId ->
def recommendedItems = recommender.mostSimilarItems(itemId, 5)
println "People who liked '${items[itemId.toString()]}' also liked"
recommendedItems.each { recommendedItem ->
println " (${(recommendedItem.value * 100).intValue()}) ${items[recommendedItem.itemID.toString()]}"
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment