Skip to content

Instantly share code, notes, and snippets.

@migue
Created November 17, 2014 15:47
Show Gist options
  • Save migue/b97cf1d20743988f15ac to your computer and use it in GitHub Desktop.
Save migue/b97cf1d20743988f15ac to your computer and use it in GitHub Desktop.
Featurize message boards
package com.liferay.message.boards.classifier
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.HashingTF
object Utils {
val numFeatures = 1000
val tf = new HashingTF(numFeatures)
/**
* Create feature vectors by turning each message into bigrams of characters (an n-gram model)
* and then hashing those to a length-1000 feature vector that we can pass to MLlib.
* This is a common way to decrease the number of features in a model while still
* getting excellent accuracy (otherwise every pair of Unicode characters would
* potentially be a feature).
*/
def featurize(s: String): Vector = {
tf.transform(s.sliding(2).toSeq)
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment