This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
docker run - rm -it -v "/my/data:/data" aut:0.50.0 /spark/bin/spark-shell - packages "io.archivesunleashed:aut:0.50.0" - driver-memory 7G |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
.select($"crawl_date", $"url", RemoveHTMLDF(ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content")))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import io.archivesunleashed._ | |
import io.archivesunleashed.matchbox._ | |
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc) | |
.webpages() | |
.keepLanguagesDF(Set("de")) | |
.select($"crawl_date", $"url", ExtractBoilerpipeTextDF(RemoveHTTPHeaderDF($"content"))) | |
.write.csv("/political_actors_data/plain-text-noboilerplate-df/") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import io.archivesunleashed._ | |
import io.archivesunleashed.matchbox._ | |
RecordLoader.loadArchives("/political_actors_data/*.warc.gz", sc) | |
.webpages() | |
.keepLanguagesDF(Set("de")) | |
.select($"crawl_date", $"url", RemoveHTMLDF($"content")) | |
.write.csv("/political_actors_data/plain-text-df/") |
We can make this file beautiful and searchable if this error is corrected: It looks like row 10 should actually have 6 columns, instead of 1. in line 9.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
seed,status,all_count,new_count,all_size,new_size | |
https://blog.aarp.org/aarp-celebrates-stonewall-50th-anniversary-during-lgbt-pride-month,Crawled,10356,9664,5826862764,5804623908 | |
https://www.vogue.com/article/stonewall-inn-50th-anniversary-interview/,Redirected,25227,16941,5471678002,5385428188 | |
http://nyfos.org/stonewall-at-50/,Crawled,69082,68923,5371521735,5369687319 | |
https://en.wikipedia.org/wiki/Stonewall_50_%E2%80%93_WorldPride_NYC_2019,Crawled,40728,39016,5416393613,5369366999 | |
https://www.atlanta.net/events/detail/stonewall-50-exhibit-at-atlanta-city-hall/122642/,Crawled,66378,62970,5423253752,5369149241 | |
https://www.thedailybeast.com/stonewall-50-dont-forget-the-black-and-brown-lgbtq-struggle/,Crawled,61033,56887,5448153720,5369118822 | |
https://kywnewsradio.radio.com/categories/stonewall-50/,Crawled,34508,34174,5414148507,5369005035 | |
http://www.roosevelthouse.hunter.cuny.edu/events/fifty-years-stonewall-now-go/,Crawled,86378,86261,5370103576,5368828744 | |
https://soundcloud.com/workingclasshistory/stonewall-r |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from google.colab import files | |
uploaded = files.upload() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> | |
<!--page.tpl.php--> | |
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> | |
<head> | |
<title>Consumers need protection from genetically modified foods | NDP</title> | |
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> | |
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /> | |
<link type="text/css" rel="stylesheet" media="all" href="/sites/all/modules/nice_menus/nice_menus.css?f" /> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ianmilligan1@Ians-MacBook-Pro-3:~/dropbox/git/aut$ python ~/dropbox/git/aut/src/main/python/tf/detect.py --web_archive "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz" --aut_jar /Users/ianmilligan1/dropbox/git/aut/target/aut-0.17.1-SNAPSHOT-fatjar.jar --spark /Users/ianmilligan1/dropbox/spark-2.4.3-bin-hadoop2.7/bin --master spark://Ians-MacBook-Pro-3.local:7077 --img_model ssd --filter_size 50 50 --output_path /Users/ianmilligan1/desktop/aut-image-tf-testing | |
19/07/10 15:44:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties | |
Setting default log level to "WARN". | |
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). | |
height >= 50 and width >= 50 | |
[Stage 0:> (0 + 1) / 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
scala> :paste | |
// Entering paste mode (ctrl-D to finish) | |
import io.archivesunleashed._ | |
import io.archivesunleashed.matchbox._ | |
RecordLoader.loadArchives("example.arc.gz", sc) | |
.keepValidPages() | |
.keepDomains(Set("www.archive.org")) | |
.keepLanguages(Set("fr")) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2018-11-30 09:36:20,149 [main-ScalaTest-running-CommandLineAppTest] ERROR CommandLineApp - _AUTCmdTestOutputDir already exists | |
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.191 sec <<< FAILURE! | |
command line app tests(io.archivesunleashed.CommandLineAppTest) Time elapsed: 0.108 sec <<< ERROR! | |
java.lang.IllegalArgumentException | |
at io.archivesunleashed.app.CommandLineApp.verifyArgumentsOrExit(CommandLineApp.scala:219) | |
at io.archivesunleashed.app.CommandLineAppRunner$.test(CommandLineApp.scala:344) | |
at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:76) | |
at io.archivesunleashed.CommandLineAppTest$$anonfun$2$$anonfun$apply$mcV$sp$1.apply(CommandLineAppTest.scala:75) | |
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) | |
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) |
NewerOlder