Skip to content

Instantly share code, notes, and snippets.

View ianmilligan1's full-sized avatar

Ian Milligan ianmilligan1

View GitHub Profile
docker run - rm -it -v "/my/data:/data" aut:0.50.0 /spark/bin/spark-shell - packages "io.archivesunleashed:aut:0.50.0" - driver-memory 7G
.select($"crawl_date", $"url", RemoveHTMLDF(ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content"))))
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc)
.webpages()
.keepLanguagesDF(Set("de"))
.select($"crawl_date", $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))
.write.csv("/political_actors_data/plain-text-noboilerplate-df/")
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc)
.webpages()
.keepLanguagesDF(Set("de"))
.select($"crawl_date", $"url", RemoveHTMLDF($"content"))
.write.csv("/political_actors_data/plain-text-df/")
We can make this file beautiful and searchable if this error is corrected: It looks like row 10 should actually have 6 columns, instead of 1. in line 9.
seed,status,all_count,new_count,all_size,new_size
https://blog.aarp.org/aarp-celebrates-stonewall-50th-anniversary-during-lgbt-pride-month,Crawled,10356,9664,5826862764,5804623908
https://www.vogue.com/article/stonewall-inn-50th-anniversary-interview/,Redirected,25227,16941,5471678002,5385428188
http://nyfos.org/stonewall-at-50/,Crawled,69082,68923,5371521735,5369687319
https://en.wikipedia.org/wiki/Stonewall_50_%E2%80%93_WorldPride_NYC_2019,Crawled,40728,39016,5416393613,5369366999
https://www.atlanta.net/events/detail/stonewall-50-exhibit-at-atlanta-city-hall/122642/,Crawled,66378,62970,5423253752,5369149241
https://www.thedailybeast.com/stonewall-50-dont-forget-the-black-and-brown-lgbtq-struggle/,Crawled,61033,56887,5448153720,5369118822
https://kywnewsradio.radio.com/categories/stonewall-50/,Crawled,34508,34174,5414148507,5369005035
http://www.roosevelthouse.hunter.cuny.edu/events/fifty-years-stonewall-now-go/,Crawled,86378,86261,5370103576,5368828744
https://soundcloud.com/workingclasshistory/stonewall-r
@ianmilligan1
ianmilligan1 / gist:0d82dc6584464da126a95473463bdc49
Created February 10, 2020 20:10
to-copy-into-your-notebook.py
from google.colab import files
uploaded = files.upload()
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--page.tpl.php-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">
<head>
<title>Consumers need protection from genetically modified foods | NDP</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link type="text/css" rel="stylesheet" media="all" href="/sites/all/modules/nice_menus/nice_menus.css?f" />
ianmilligan1@Ians-MacBook-Pro-3:~/dropbox/git/aut$ python ~/dropbox/git/aut/src/main/python/tf/detect.py --web_archive "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz" --aut_jar /Users/ianmilligan1/dropbox/git/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar --spark /Users/ianmilligan1/dropbox/spark-2.4.3-bin-hadoop2.7/bin --master spark://Ians-MacBook-Pro-3.local:7077 --img_model ssd --filter_size 50 50 --output_path /Users/ianmilligan1/desktop/aut-image-tf-testing
19/07/10 15:44:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
height >= 50 and width >= 50
[Stage 0:> (0 + 1) / 1
scala> :paste
// Entering paste mode (ctrl-D to finish)
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("example.arc.gz", sc)
.keepValidPages()
.keepDomains(Set("www.archive.org"))
.keepLanguages(Set("fr"))
2018-11-30 09:36:20,149 [main-ScalaTest-running-CommandLineAppTest] ERROR CommandLineApp - _AUTCmdTestOutputDir already exists
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.191 sec <<< FAILURE!
command line app tests(io.archivesunleashed.CommandLineAppTest) Time elapsed: 0.108 sec <<< ERROR!
java.lang.IllegalArgumentException
at io.archivesunleashed.app.CommandLineApp.verifyArgumentsOrExit(CommandLineApp.scala:219)
at io.archivesunleashed.app.CommandLineAppRunner$.test(CommandLineApp.scala:344)
at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:76)
at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:75)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)