Skip to content

Instantly share code, notes, and snippets.

@ianmilligan1
Created April 16, 2020 20:23
Show Gist options
  • Save ianmilligan1/c66c512dd89f27e08d8abc1e93fc1fc2 to your computer and use it in GitHub Desktop.
Save ianmilligan1/c66c512dd89f27e08d8abc1e93fc1fc2 to your computer and use it in GitHub Desktop.
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc)
.webpages()
.keepLanguagesDF(Set("de"))
.select($"crawl_date", $"url", RemoveHTMLDF($"content"))
.write.csv("/political_actors_data/plain-text-df/")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment