Skip to content

Instantly share code, notes, and snippets.

@ianmilligan1
Created April 16, 2020 20:24
Show Gist options
  • Save ianmilligan1/6482ec44512b53a15925807a04acbe45 to your computer and use it in GitHub Desktop.
Save ianmilligan1/6482ec44512b53a15925807a04acbe45 to your computer and use it in GitHub Desktop.
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc)
.webpages()
.keepLanguagesDF(Set("de"))
.select($"crawl_date", $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
.write.csv("/political_actors_data/plain-text-noboilerplate-df/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment