Skip to content

Instantly share code, notes, and snippets.

@ianmilligan1
Created May 20, 2016 16:08
Show Gist options
  • Save ianmilligan1/48e6d78843faaa0976f65f350fc28ef9 to your computer and use it in GitHub Desktop.
Save ianmilligan1/48e6d78843faaa0976f65f350fc28ef9 to your computer and use it in GitHub Desktop.
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RecordLoader, ExtractClusters}
val recs=RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/", sc)
.keepUrlPatterns(Set("http://geocities.com/EnchantedForest/.*".r))
val clusters = ExtractClusters(recs, sc)
.topNWords("GEO_ENCHANTED_FOREST_TOP_N", sc)
.computeLDA("GEO_ENCHANTED_FOREST_LDA", sc)
.saveSampleDocs("GEO_ENCHANTED_FOREST_LDA", sc)
spark-shell --jars ~/git/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 75 --executor-cores 5 --executor-memory 20G --driver-memory 10G
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment