Skip to content

Instantly share code, notes, and snippets.

@ruebot
Last active February 10, 2020 22:46
Show Gist options
  • Save ruebot/60b5f848252284b7f380e3d5006d7135 to your computer and use it in GitHub Desktop.
Save ruebot/60b5f848252284b7f380e3d5006d7135 to your computer and use it in GitHub Desktop.
import io.archivesunleashed._
import io.archivesunleashed.df._
val urlPattern = Set("(?i)http://geocities.com/EnchantedForest/.*".r)
RecordLoader.loadArchives("/store/scratch/nruest/web_archives/geocities/warcs",sc)
.imagegraph()
.keepUrlPatternsDF(urlPattern)
.write.parquet("/store/scratch/nruest/web_archives/geocities/derivatives/parquet/uras-2020-enchanted-forest")
import io.archivesunleashed._
import io.archivesunleashed.df._
val urlPattern = Set("(?i)http://geocities.com/EnchantedForest/.*".r)
RecordLoader.loadArchives("/store/scratch/nruest/web_archives/geocities/warcs",sc)
.images()
.keepUrlPatternsDF(urlPattern)
.write.parquet("/store/scratch/nruest/web_archives/geocities/derivatives/parquet/uras-2020-enchanted-forest")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment