Skip to content

Instantly share code, notes, and snippets.

@ruebot
Last active March 21, 2019 16:20
Show Gist options
  • Save ruebot/3a07a34369b02fd9ef2c2df4d5092c8c to your computer and use it in GitHub Desktop.
Save ruebot/3a07a34369b02fd9ef2c2df4d5092c8c to your computer and use it in GitHub Desktop.
Archives Unleashed Washington, DC Datathon VMs

About the VMs

Each VM has:

  • Apache Spark 2.4.0
    • Spark shell: /home/ubuntu/spark/bin/spark-shell
  • Python 3.7.1 (Anaconda)
  • Java 8
  • Ruby 2.5.1
  • jq
  • GNU Parallel
  • Jupyter Notebook
  • Datathon notebook

The machine has (6 total):

  • 16 virtual cores
  • 30G RAM (One has 60G)
  • data: /mnt/data (179G of free space)

Archive-It Collections

ID Title
4719 International Brotherhood of Teamsters
7082 Environmental Advocates of New York
7081 New York Civil Liberties Union
6917 Senator Kristen Gillibrand
6918 Senator Chuck Schumer
9655 New York State Political Third Parties
3642 Avian Influenza A (H7N9) Virus web archive
4254 Disorders of the Developing and Aging Brain: Autism and Alzheimer’s on the Web
6850 Bioethics web archive
7219 Environmental Health web archive
8370 Domestic Violence Awareness and Prevention
10568 District of Columbia Elections 2018
10427 DC Punk (Web) Archive
10985 DC 1968
7485 Nova Scotia Municipal Governments
10188 Ecology Action Centre websites
11360 Websites of the Former Soviet Union & Eastern Europe
10866 #metoo collection
5828 StopBullying.gov

Web Archives for Historical Research Collections

  • #climatemarch
  • #elxn42
  • #marchforscience
  • #panamanpapers
  • #womensmarch

About data:

  • Each collection's directory will have a warcs directory containing all of it's ARCs/WARCs
  • Each collection's directory will have a derivatives directory containing the scholarly derivatives created on cloud.archivesunleashed.org
    • More info about those files can be found here.
  • See this for a look at the directory structure.

Apache Spark + aut

  • To run Apache Spark shell with aut on each machine, run the following command: ~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0"

  • In the course of your project, you might need to use additional flags. These should work well on each machine: ~/spark/bin/spark-shell --packages "io.archivesunleashed:aut:0.17.0" --master local[*] --driver-memory 12G --conf spark.network.timeout=100000000 --conf spark.executor.heartbeatInterval=6000s

Accessing the machines:

  • The permissions on the key should be 600. You can do this with the following command on your own laptop before shelling in: chmod 600 /path/to/archives-hackathon.key

  • You can shell in the machines with the following command: ssh -i /path/to/archives-hackathon.key ubuntu@206.167.183.xx

  • I will provide the key, and IP address to each machine after teams are formed on the first day of the datathon.

├── albany
│   ├── environmental-advocates
│   │   ├── derivatives
│   │   └── warcs
│   ├── gillibrand
│   │   ├── derivatives
│   │   └── warcs
│   ├── ny-civil-liberties
│   │   ├── derivatives
│   │   └── warcs
│   ├── ny-third-parties
│   │   ├── derivatives
│   │   └── warcs
│   └── schumer
│       ├── derivatives
│       └── warcs
├── dalhousie
│   ├── eco-action-centre
│   │   ├── derivatives
│   │   └── warcs
│   └── nova-scoita-municipal-govs
│       ├── derivatives
│       └── warcs
├── dcpl
│   ├── dc-1968
│   │   ├── derivatives
│   │   └── warcs
│   ├── dc-2018-elections
│   │   ├── derivatives
│   │   └── warcs
│   └── dc-punk
│       ├── derivatives
│       └── warcs
├── fdlp
│   └── stopbullying
│       ├── derivatives
│       └── warcs
├── gwu
│   └── teamsters
│       ├── derivatives
│       └── warcs
├── harvard
│   └── metoo
│       ├── derivatives
│       └── warcs
├── ivy
│   └── former-soviet-union
│       ├── derivatives
│       └── warcs
├── nlm
│   ├── H7N9
│   │   ├── derivatives
│   │   └── warcs
│   ├── bioethics
│   │   ├── derivatives
│   │   └── warcs
│   ├── brain-disorders
│   │   ├── derivatives
│   │   └── warcs
│   ├── domestic-violence
│   │   ├── derivatives
│   │   └── warcs
│   └── enviro-health
│       ├── derivatives
│       └── warcs
└── wahr
    ├── climatemarch
    │   └── warcs
    ├── elxn42
    │   └── warcs
    ├── marchforscience
    │   └── warcs
    ├── panamanpapers
    │   └── warcs
    └── womensmarch
        └── warcs
##-------------------
## AU DC Datathon
##-------------------
## c16-30gb-880gb machines:
alias datathon1="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.180.194"
alias datathon2="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.181.55"
alias datathon3="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.181.60"
alias datathon4="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.181.52"
alias datathon5="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.180.242"
## c16-60gb-880gb machine:
alias datathon6="ssh -i ~/.ssh/archives-hackathon.key ubuntu@206.167.181.64"
##-------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment