Skip to content

Instantly share code, notes, and snippets.

@jkbradley
jkbradley / benchm-ml-spark
Created September 8, 2015 23:58
Running benchm-ml benchmark for random forest on Spark, using soft predictions to get better AUC
Here are 2 code snippets:
(1) Compute one-hot encoded data for Spark, using the data generated by https://github.com/szilard/benchm-ml/blob/master/0-init/2-gendata.txt
(2) Run MLlib, computing soft predictions by hand.
I ran these with Spark 1.4, and they should work for 1.5 as well.
Note: There's no real need to switch to DataFrames yet for benchmarking. Both the RDD and DataFrame APIs use the same underlying implementation. (I hope to improve on that in Spark 1.6 if there is time.)
Ran on EC2 cluster with 4 workers with 9.6GB memory each, and 8 partitions for training RDD.
For the 1M dataset, training the forest took 2080.814977193 sec and achieved AUC 0.7129779357732448 on the test set.
@jkbradley
jkbradley / LDA_SparkDocs
Created March 24, 2015 23:56
LDA Example: Modeling topics in the Spark documentation
/*
This example uses Scala. Please see the MLlib documentation for a Java example.
Try running this code in the Spark shell. It may produce different topics each time (since LDA includes some randomization), but it should give topics similar to those listed above.
This example is paired with a blog post on LDA in Spark: http://databricks.com/blog
Spark: http://spark.apache.org/
*/
import scala.collection.mutable