Skip to content

Instantly share code, notes, and snippets.

Last active August 6, 2018 22:50
What would you like to do?
What to do when an ArchiveBot job crashes or is aborted on your pipeline and you need to manually upload the job's associated log file to FOS

When you have to manually kill an ArchiveBot web scraping job on one of your pipeline servers, or if the job crashes on its own, the incomplete WARC files do usually move over to FOS, but the log.gz file does not. You have to manually find the proper file, rename it in just the right way, and then rsync it yourself.

  1. Make a note somewhere of the job id of the stuck job, such as aqz8ac6ar202mulnvn8xpzv3f. Also make note of the way the WARC's and JSON's are named, such as Note that the first five letters of the job id are the last five letters of the filename. (The log files do not follow the same naming convention.)

  2. Kill-9 the stuck job.

  3. Watch the ArchiveBot dashboard to make sure the incomplete WARC and JSON files do indeed upload to FOS and the job is done.

  4. Go into the ~/ArchiveBot/pipeline/ directory. Look at the various blahblahblah.log.gz files in there. It is probably impossible to tell just by looking which of these log files corresponds to the just-flushed job.

  5. One by one, do a zcat SOMETHING.log.gz | head -2 on each of the log.gz files. For example, zcat tmp-wpull-warc-2o0loq4o.log.gz | head -2. Look at the output; the second line should have spit out the job id. Manually check it against the job id to see if this is the right log file. NOTE: checking only the job ID might not be enough especially in the case of aborted and shortly thereafter requeued jobs.

  6. If it's the right log file, rename it to the same pattern as the WARC and JSON files. For example, mv tmp-wpull-warc-2o0loq4o.log.gz

  7. Then use rsync to upload this log file to FOS. You cannot just move it into the ~/warcs4fos/ directory because the uploader running in there doesn't know what to do with log files yet. So do rsync -tv --timeout=300 --contimeout=300 --progress --ignore-existing YOUR-LOGFILE-HERE rsync:// where LOGFILE is replaced by the name of this log file, such as rsync -tv --timeout=300 --contimeout=300 --progress --ignore-existing rsync://

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment