Skip to content

Instantly share code, notes, and snippets.

@ianmilligan1
Created May 16, 2016 19:42
Show Gist options
  • Save ianmilligan1/c45db75fb22034ad7853cce384af18d1 to your computer and use it in GitHub Desktop.
Save ianmilligan1/c45db75fb22034ad7853cce384af18d1 to your computer and use it in GitHub Desktop.
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/",sc)
.keepValidPages()
.keepContent(Set("auschwitz".r, "auschwitz-birkenau".r, "dachau".r, "neuengamme".r, "sachsenhausen".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("holocaust-text-geocities/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment