Skip to content

Instantly share code, notes, and snippets.

@ruebot
ruebot / Extractors
Last active June 17, 2020 17:21
AUT Spark 3.0.0 Testing
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.80.1-SNAPSHOT-fatjar.jar --extractor AudioInformationExtractor --input /home/nruest/Projects/au/sample-data/geocities --output /home/nruest/Projects/au/sample-data/3.0.0-testing/audio/csv
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.80.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input /home/nruest/Projects/au/sample-data/geocities --output /home/nruest/Projects/au/sample-data/3.0.0-testing/domains/csv
/home/nruest/bin/spark-3.0.0-bin-hadoop2.7/bin/spark-submit --master local\[2\] --driver-memory 4g --conf spark.driver.maxResultSize=0 --class io.archivesunleashed.app.CommandLineAppR
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-preview
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
Type in expressions to have them evaluated.
Type :help for more information.
URL MD5 COUNT MD5 COUNT FILENAME
http://www.geocities.com/clipart/pbi/c.gif c4746081d66bc2abc269f22ca27ebb46 2,705 373,198
http://pic.geocities.com/images/pixel.gif b4682377ddfbe4e7dabfddb2e543e842 3,336 18,685
http://www.google.com/images/cleardot.gif fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747
http://www.google.com/clear.gif 55fade2068e7503eae8d7ddf5eb6bd09 2,551 13,852
https://killersites.com/killerSites/resources/dot_clear.gif b4682377ddfbe4e7dabfddb2e543e842 3,336 1,780
https://mail.google.com/mail/images/cleardot.gif fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747
http://visit.geocities.yahoo.com/visit.gif 4f59788bde58d15d541a9c116d0e850d 2,729,121 2,731,243
http://blingee.com/images/spaceball.gif 325472601571f31e1bf00674c368d335 18,537,796 39
http://www-cdr.stanford.edu/~petrie/blank.gif accba0b69f352b4c9440f05891b015c5 1,341 26,292
We can make this file beautiful and searchable if this error is corrected: It looks like row 8 should actually have 8 columns, instead of 1. in line 7.
http://it.geocities.com/grannoce/camere/thumb/camera_blu_001.jpg,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92
http://ar.geocities.com/angeles_uno/PLAYMATES/1999/JUNIO/KIMBERLY_SPICER/06_small.jpg,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a
http://in.geocities.com/nileshtx/images/DSC01219.jpg,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d
http://kr.geocities.com/magicianclow/img/favor.gif,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e
http://login.space2000.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://91-143-80-250.blue.kundencontroller.de/logo.gif,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07
http://cf.geocities.com/rouquins/images/merlin0.jpg,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb
http://ar.geocities.com/aliaga_fernandoo/ediciones/ed7/imagenes/menu/MENU7_r11_c21.
import io.archivesunleashed._
import io.archivesunleashed.df._
val images = RecordLoader
.loadArchives("/path/to/web/archive/collection", sc)
.extractImageDetailsDF();
images.select($"url", $"filename", $"extension", $"mime_type_web_server",
$"mime_type_tika", $"width", $"height", $"md5")
.orderBy(desc("md5"))
@ruebot
ruebot / dc-datathon-about-vms.md
Last active March 21, 2019 16:20
Archives Unleashed Washington, DC Datathon VMs

About the VMs

Each VM has:

  • Apache Spark 2.4.0
    • Spark shell: /home/ubuntu/spark/bin/spark-shell
  • Python 3.7.1 (Anaconda)
  • Java 8
  • Ruby 2.5.1
  • jq
├── albany
│   ├── environmental-advocates
│   │   ├── derivatives
│   │   └── warcs
│   ├── gillibrand
│   │   ├── derivatives
│   │   └── warcs
│   ├── ny-civil-liberties
│   │   ├── derivatives
$ ./spark-shell --master local\[10\] --driver-memory 30G --conf spark.network.timeout='10000000' --conf spark.executor.heartbeatInterval='600s' --conf spark.driver.maxResultSize='4G' --jars ~/git/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar
2018-11-30 09:08:03 WARN Utils:66 - Your hostname, wombat resolves to a loopback address: 127.0.1.1; using 10.0.1.44 instead (on interface enp0s31f6)
2018-11-30 09:08:03 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-11-30 09:08:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.0.1.44:4040
Spark context available as 'sc' (master = local[10], app id = local-1543586887449).
Spark session available as 'spark'.
Welcome to