The following applies to the scripts located here.
Once DCMR has finished the QC/QA of all the files, they will inform MSD (or Robin, who will then inform MSD?) that their work is complete. At this point it is time to process the files according the [HathiTrust Cloud Validation and ingest requirements] (https://docs.google.com/document/d/1OQ0SKAiOH8Xi0HVVxg4TryBrPUPtdv4qA70d8ghRltU/edit).
#####Step 1: Remove all unnecessary files
The only files that should exist in each directory as this work begins are .tif and .txt files (the scanned images and the text ocr). To remove all of the .jpg files, navigate in the terminal to the directory that contains ALL of the subdirectories (as in, the directory that contains folders for each barcode) and run:
find . -name "*.jpg" -type f
This will locate all jpg files and list them for you. It will not delete them. This is a good, safe, step to take to ensure that you are really only grabbing the files you want. Once satisfied, run
find . -name "*.jpg" -type f -delete
This will delete them. Repeat the same process to remove the PDF file(s) from the directories.
find . -name "*.pdf" -type f
find . -name "*.pdf" -type f -delete
#####Step 2: Strip form feed characters from the txt files
For some reason (vendor software, maybe?) the ocr .txt files contain form feed characters, which are invalid per HathiTrust. I have put together a small script that goes through each .txt file in each directory, locates, and strips these characters.
#!/bin/bash
for file in $(find . -name '*.txt')
do
sed -i "" 's/\^L//' "$file"
done
It is important (at least it was for me) to compose this script directly in terminal using vi, as the formfeed character (appears as ^L in the script) is actually created by hitting the keys CtrlV and then CtrlL in succession. Once the script is created, save it in that top directory as a .sh file and run it:
bash form_strip.sh
#####Step 3: Creating the meta.yml files
Now that all changes have been made to the directories and files, we can now move on to creating the required meta.yml files. Creation of these files involves a bash script invoking a python script as it cycles through each directory
#!/bin/bash
for file in $(find . -name '00000001.tif');
do
parentdir=$(dirname "$file")
exiftool -j -FileModifyDate -XResolution "$file" | python json2yaml.py > $parentdir/meta.yml
done
The bash file searches each subdirectory for a particular file (we are basing the info for each meta.yml on a single page in each subdirectory and considering that representative of that subdirectory) and extracts some embedded metadata, which it then passes to the python script, which processes it (more on that later) then sends the product back that subdirectory and writes the results as meta.yml. The python script processing (see below) converts the json retrieved from exif into yaml. This involves some date reformatting (exif date formatting is not compliant, so must be fixed). Additionally, the keys in the key-value yaml pairings, must be renamed.
#!/usr/bin/env python3
import json, yaml, sys, datetime
import dateutil.parser as parser
data = json.load(sys.stdin)
#split the date apart to replace the non-ISO ':' with a '-' then put it back together again
def datesplit(CreateDate):
splitdate = CreateDate.split(" ")
splitdate[0]
splitdate[1]
splitdate[0]=splitdate[0].replace(":","-")
capture_date="T".join(splitdate)
return capture_date
#rename the keys according to HT specs, while also providing some of the constant values
for item in data:
item.pop("SourceFile")
# item["capture_date"] = datesplit(item.pop("CreateDate")) CreateDate does not consistently appear in metadata - can't use
item["capture_date"] = datesplit(item.pop("FileModifyDate"))
item["scanner_make"] = "NIKON CORPORATION"
item["scanner_model"] = "NIKON D810"
item["scanner_user"] = "Creekside Digital"
item["contone_resolution_dpi"] = item.pop("XResolution")
item["scanning_order"] = "right-to-left"
item["reading_order"] = "right-to-left"
#print data into yaml format
yml = yaml.safe_dump(data[0],default_flow_style=False, stream=None)
print(yml)
Once these two files are set and saved in the parent directory, run them with:
bash json2yml.sh
#####Step 4: Generating checksums
HathiTrust cloud ingest also requires the presence of a checksum.md5 file listing checksums and associated filenames for each file in directory. Again, this is done with a bash script:
for dir in $(find . -type d); do
md5 -r "$dir"/* | sed "s:$dir/::" > "$dir/checksum.md5"
done
This particular script takes an extremely long time to run.
bash md5.sh
#####Step 5: Zipping up the files
Speaking of things that take a long time to run...
Once the files are all set with their meta.yml and checksum.md5 files, it's time to zip them up so they can be dropped into the cloud. Again, I've using a bash script to accomplish this
for dir in $(find . -type d); do
zip -r -X "$dir".zip "$dir"
done
again, run by
bash zip.sh
This script does take an extremely long time to run, and it's best to let it go overnight, or better yet, over a weekend.
Thanks Josh!
Re: Step 5 - yes, these are on a server. I did my testing of workflows on dcrprojects - that's about the only spot large enough for them. (Eric does still have the hard drives with all the original files, too, thankfully).
What does it take to make something resumable? I guess I can wait and find out Monday!