First, install the program Livestreamer, whose docs are online here: http://livestreamer.readthedocs.org/en/latest/index.html
On Debian or Ubuntu, that would be:
sudo apt-get install livestreamer
First, install the program Livestreamer, whose docs are online here: http://livestreamer.readthedocs.org/en/latest/index.html
On Debian or Ubuntu, that would be:
sudo apt-get install livestreamer
I hereby claim:
To claim this, I am signing this object:
root@archiveteam-to-the-rescue:~# lsof | |
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME | |
systemd 1 root cwd DIR 253,1 4096 2 / | |
systemd 1 root rtd DIR 253,1 4096 2 / | |
systemd 1 root txt REG 253,1 1577232 8229 /lib/systemd/systemd | |
systemd 1 root mem REG 253,1 18976 2411 /lib/x86_64-linux-gnu/libuuid.so.1.3.0 | |
systemd 1 root mem REG 253,1 262408 2048 /lib/x86_64-linux-gnu/libblkid.so.1.1.0 | |
systemd 1 root mem REG 253,1 14608 12859 /lib/x86_64-linux-gnu/libdl-2.23.so | |
systemd 1 root mem REG 253,1 456632 2141 /lib/x86_64-linux-gnu/libpcre.so.3.13.2 | |
systemd 1 |
youtube-dl https://www.youtube.com/user/ohbutyes/videos --format mp4/flv/3gp --output '%(title)s___YouTube_video_id_%(id)s___uploaded_by_%(uploader)s___uploaded_%(upload_date)s___%(resolution)s.%(ext)s' --restrict-filenames --write-sub --write-description --write-info-json --write-thumbnail --print-traffic --verbose --embed-subs --embed-thumbnail --add-metadata --xattrs
When you have to manually kill an ArchiveBot web scraping job on one of your pipeline servers, or if the job crashes on its own, the incomplete WARC files do usually move over to FOS, but the log.gz file does not. You have to manually find the proper file, rename it in just the right way, and then rsync it yourself.
Make a note somewhere of the job id of the stuck job, such as aqz8ac6ar202mulnvn8xpzv3f
. Also make note of the way the WARC's and JSON's are named, such as www.gog.com-inf-20180603-063227-aqz8a.json
Note that the first five letters of the job id are the last five letters of the filename. (The log files do not follow the same naming convention.)
Kill-9 the stuck job.
Watch the ArchiveBot dashboard to make sure the incomplete WARC and JSON files do indeed upload to FOS and the job is done.
Go into the ~/ArchiveBot/pipeline/ directory. Look at the various blahblahblah.log.gz files in there. It is probably impossible to tell just by looking which of these log files correspo
export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="http://www.example.com/"
export DOMAINS_TO_INCLUDE="example.com,images.example.com,relatedwebsite.com"
# this one can be regex, or you can leave it out, whatever
export THINGS_TO_IGNORE="ignore-this,other-thing-to-ignore"
export WARC_NAME="Example.com_-_2014-10-15"
# these two are needed in case wpull quits or chokes and we need to restart where we left off
Want to know what an older Etsy product sold for, but Etsy won't display the data on the old product page? Copy and paste this snippet as a new bookmark in your web browser bar, and then click it when you're looking at a product page:
Want to grab a copy of your favorite website, using wget in the command line, and saving it in WARC format? Then this is the gist for you. Read on!
First, copy the following lines into a textfile, and edit them as needed. Then paste them into your command line and hit enter:
export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="www.example.com"
export SPECIFIC_HOSTNAMES_TO_INCLUDE="example1.com,example2.com,images.example2.com"
export FILES_AND_PATHS_TO_EXCLUDE="/path/to/ignore"
export WARC_NAME="example.com-20130810-panicgrab"
Download the file tweep.py
from GitHub - https://github.com/haccer/tweep
Put it in a brand new folder. Let's call the folder "Tweep". So the full path here would be, as an example, /Users/asparagirl/Desktop/Tweep
Add a folder inside of that one called tmp
. So the full path here would be, as an example, /Users/asparagirl/Desktop/Tweep/tmp
Edit tweep.py
slightly to add logging and stop it from getting images from tweets. The top of the file should be edited to look like this: