Skip to content

Instantly share code, notes, and snippets.

@migue
Created November 17, 2014 15:11
Show Gist options
  • Save migue/e0890e899ac07702e795 to your computer and use it in GitHub Desktop.
Save migue/e0890e899ac07702e795 to your computer and use it in GitHub Desktop.
K-means train process on message boards
val messagesTable = sc.textFile(messagesInput).map(_.split(";")).map(m => MBMessage(m(4), m(5)))
val schema =
StructType(
StructField("title", StringType, false) ::
StructField("body", StringType, true) :: Nil)
val messagesSchemaRDD = sqlContext.applySchema(messagesTable, schema)
messagesSchemaRDD.registerTempTable("messagesTable")
val bodies = sqlContext.sql("SELECT body from messagesTable").map(_.head.toString)
val vectors = bodies.map(Utils.featurize).cache()
val model = KMeans.train(vectors, numClusters.toInt, numIterations.toInt)
sc.makeRDD(model.clusterCenters, numClusters.toInt).saveAsObjectFile(outputModelDir)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment