Brooke Schreier Ganz Asparagirl

## gist:6202872

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              20 stars
            
          
                Asparagirl
                / gist:6202872
            
            
              Last active
              March 28, 2022 20:28
            
              
                Want to help Archive Team do a "panic grab" of a website, so that you can later upload it to the Internet Archive for inclusion in its WayBack Machine? Here's the code!
              
          
    Want to grab a copy of your favorite website, using wget in the command line, and saving it in WARC format?  Then this is the gist for you.  Read on!
First, copy the following lines into a textfile, and edit them as needed.  Then paste them into your command line and hit enter:
export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="www.example.com"
export SPECIFIC_HOSTNAMES_TO_INCLUDE="example1.com,example2.com,images.example2.com"
export FILES_AND_PATHS_TO_EXCLUDE="/path/to/ignore"
export WARC_NAME="example.com-20130810-panicgrab"

  
## gist:6206247

      
              1 file
            
          
              0 forks
            
          
              4 comments
            
          
              26 stars
            
          
                Asparagirl
                / gist:6206247
            
            
              Last active
              February 14, 2024 19:56
            
              
                Have a WARC that you would like to upload to the Internet Archive so that it can eventually be included in their Wayback Machine? Here's how to upload it from the command line.
              
          
    Do you have a WARC file of a website all downloaded and ready to be added to the Internet Archive?  Great!  You can do that with the Internet Archive's web-based uploader, but it's not ideal and it can't handle really big uploads.  Here's how you can upload your WARC files to the IA from the command line, and without worrying about a size restriction.
First, you need to get your Access Key and Secret Key from the Internet Archive for the S3-like API.  Here's where you can get that for your IA account: http://archive.org/account/s3.php  Don't share those with other people!
Here's their documentation file about how to use it, if you need some extra help: http://archive.org/help/abouts3.txt
Next, you should copy the following files to a text file and edit them as needed:
export IA_S3_ACCESS_KEY="YOUR-ACCESS-KEY-FROM-THE-IA-GOES-HERE"

  
## gist:8f6b52e2cedc055ec1fb

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Asparagirl
                / gist:8f6b52e2cedc055ec1fb
            
            
              Last active
              August 29, 2015 14:07
            
              
                How to download a streaming Livesteam.com video to a server
              
          
    How to download a streaming Livesteam.com video to a server

Install

First, install the program Livestreamer, whose docs are online here:
http://livestreamer.readthedocs.org/en/latest/index.html
On Debian or Ubuntu, that would be:
sudo apt-get install livestreamer

  
## gist:1f8c0b2c9edc2d8565a6

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Asparagirl
                / gist:1f8c0b2c9edc2d8565a6
            
            
              Last active
              January 12, 2017 04:34
            
              
                Set up a server from scratch with wpull and PhantomJS
              
          
    Set up a server from scratch with wpull and PhantomJS and youtube-dl

Basic server set up
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install fail2ban
Bring on the packages!

  
## gist:c2f710724232f76187b3

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                Asparagirl
                / gist:c2f710724232f76187b3
            
            
              Last active
              November 25, 2018 21:24
            
              
                Grab a website with wpull and PhantomJS
              
          
    Grab a website with wpull and PhantomJS

export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="http://www.example.com/"
export DOMAINS_TO_INCLUDE="example.com,images.example.com,relatedwebsite.com"
# this one can be regex, or you can leave it out, whatever
export THINGS_TO_IGNORE="ignore-this,other-thing-to-ignore"
export WARC_NAME="Example.com_-_2014-10-15"
# these two are needed in case wpull quits or chokes and we need to restart where we left off

  
## gist:b5b9645c59eb8c684368

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                Asparagirl
                / gist:b5b9645c59eb8c684368
            
            
              Last active
              July 27, 2017 19:33
            
              
                Download YouTube videos with youtube-dl
              
          
    Use youtube-dl to download all videos from a YouTube user, rename with consistent naming scheme, add/embed thumbnails and metadata

youtube-dl https://www.youtube.com/user/ohbutyes/videos --format mp4/flv/3gp --output '%(title)s___YouTube_video_id_%(id)s___uploaded_by_%(uploader)s___uploaded_%(upload_date)s___%(resolution)s.%(ext)s' --restrict-filenames --write-sub --write-description --write-info-json --write-thumbnail --print-traffic --verbose --embed-subs --embed-thumbnail --add-metadata --xattrs


## keybase.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Asparagirl
                / keybase.md
            
            
              Created
              January 14, 2017 18:45
            
              
                so leet
              
          
    Keybase proof

I hereby claim:

I am Asparagirl on github.
I am asparagirl (https://keybase.io/asparagirl) on keybase.
I have a public key whose fingerprint is BC82 31E4 A69E 0BF4 42AD  E962 235F 967E 56AD C7F0

To claim this, I am signing this object:

  
## gist:7ae5cb95aa1e016dfaad9b5762ffd7a9
root@archiveteam-to-the-rescue:~# lsof
COMMAND    PID  TID             USER   FD      TYPE             DEVICE      SIZE/OFF       NODE NAME
systemd      1                  root  cwd       DIR              253,1          4096          2 /
systemd      1                  root  rtd       DIR              253,1          4096          2 /
systemd      1                  root  txt       REG              253,1       1577232       8229 /lib/systemd/systemd
systemd      1                  root  mem       REG              253,1         18976       2411 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
systemd      1                  root  mem       REG              253,1        262408       2048 /lib/x86_64-linux-gnu/libblkid.so.1.1.0
systemd      1                  root  mem       REG              253,1         14608      12859 /lib/x86_64-linux-gnu/libdl-2.23.so
systemd      1                  root  mem       REG              253,1        456632       2141 /lib/x86_64-linux-gnu/libpcre.so.3.13.2
systemd      1

## get-tweets.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                Asparagirl
                / get-tweets.md
            
            
              Last active
              November 11, 2022 06:09
            
              
                How to generate a Twitter user's unique tweet URL's, and then feed them into ArchiveBot to be saved
              
          
    Set up Tweep


Download the file tweep.py from GitHub - https://github.com/haccer/tweep


Put it in a brand new folder. Let's call the folder "Tweep". So the full path here would be, as an example, /Users/asparagirl/Desktop/Tweep


Add a folder inside of that one called tmp. So the full path here would be, as an example, /Users/asparagirl/Desktop/Tweep/tmp


Edit tweep.py slightly to add logging and stop it from getting images from tweets. The top of the file should be edited to look like this:


## INSTRUCTIONS.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Asparagirl
                / INSTRUCTIONS.md
            
            
              Last active
              August 6, 2018 22:50
            
              
                What to do when an ArchiveBot job crashes or is aborted on your pipeline and you need to manually upload the job's associated log file to FOS
              
          
    When you have to manually kill an ArchiveBot web scraping job on one of your pipeline servers, or if the job crashes on its own, the incomplete WARC files do usually move over to FOS, but the log.gz file does not. You have to manually find the proper file, rename it in just the right way, and then rsync it yourself.


Make a note somewhere of the job id of the stuck job, such as aqz8ac6ar202mulnvn8xpzv3f. Also make note of the way the WARC's and JSON's are named, such as www.gog.com-inf-20180603-063227-aqz8a.json Note that the first five letters of the job id are the last five letters of the filename. (The log files do not follow the same naming convention.)


Kill-9 the stuck job.


Watch the ArchiveBot dashboard to make sure the incomplete WARC and JSON files do indeed upload to FOS and the job is done.


Go into the ~/ArchiveBot/pipeline/ directory. Look at the various blahblahblah.log.gz files in there. It is probably impossible to tell just by looking which of these log files correspo
	root@archiveteam-to-the-rescue:~# lsof
	COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
	systemd 1 root cwd DIR 253,1 4096 2 /
	systemd 1 root rtd DIR 253,1 4096 2 /
	systemd 1 root txt REG 253,1 1577232 8229 /lib/systemd/systemd
	systemd 1 root mem REG 253,1 18976 2411 /lib/x86_64-linux-gnu/libuuid.so.1.3.0
	systemd 1 root mem REG 253,1 262408 2048 /lib/x86_64-linux-gnu/libblkid.so.1.1.0
	systemd 1 root mem REG 253,1 14608 12859 /lib/x86_64-linux-gnu/libdl-2.23.so
	systemd 1 root mem REG 253,1 456632 2141 /lib/x86_64-linux-gnu/libpcre.so.3.13.2
	systemd 1