Skip to content

Instantly share code, notes, and snippets.

@ruebot
Last active October 8, 2019 21:55
Show Gist options
  • Save ruebot/90987189f6e225674d57e4b1530758c4 to your computer and use it in GitHub Desktop.
Save ruebot/90987189f6e225674d57e4b1530758c4 to your computer and use it in GitHub Desktop.
import io.archivesunleashed._
import io.archivesunleashed.df._
val images = RecordLoader
.loadArchives("/path/to/web/archive/collection", sc)
.extractImageDetailsDF();
images.select($"url", $"filename", $"extension", $"mime_type_web_server",
$"mime_type_tika", $"width", $"height", $"md5")
.orderBy(desc("md5"))
.write
.csv("/path/to/images/dataframe")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment