Skip to content

Instantly share code, notes, and snippets.

@jaroslav-kuchar
Last active December 6, 2015 12:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaroslav-kuchar/edbcbe72c5a884136db1 to your computer and use it in GitHub Desktop.
Save jaroslav-kuchar/edbcbe72c5a884136db1 to your computer and use it in GitHub Desktop.
Audiology dataset - Spark FPGrowth experiment
// spark-shell --master yarn --deploy-mode client
// Dataset available from:
// http://repository.seasr.org/Datasets/UCI/csv/audiology.csv
import org.apache.spark.mllib.fpm.FPGrowth
def elapsed[R](block: => R): R = {
val ts = System.nanoTime()
val r = block
val te = System.nanoTime()
println("Elapsed time: " + (te - ts)/1000000000.0 + "s")
r
}
val header = sc.textFile("./audiology.csv").first().split(",")
val transactions = sc.textFile("./audiology.csv").mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(_.split(",").zipWithIndex.map { case (s, i) => (header(i), s) }.filter(_._2 != "")).cache()
transactions.count()
for(minSupport <- List(0.99,0.98,0.97,0.96,0.95,0.94)){
val fpgrowth = new FPGrowth().setMinSupport(minSupport).setNumPartitions(10)
elapsed {
val model = fpgrowth.run(transactions)
println(s"$minSupport - "+model.freqItemsets.count())
}
}
// 0.99 - 289
// Elapsed time: 0.161282138s
// 0.98 - 14564
// Elapsed time: 0.277948854s
// 0.97 - 308552
// Elapsed time: 1.024475363s
// 0.96 - 10939116
// Elapsed time: 22.424327037s
// 0.95 - 73162705
// Elapsed time: 155.5946339s
// 0.94 - 366880771
// Elapsed time: 835.873894232s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment