The following applies to the scripts located here.
Once DCMR has finished the QC/QA of all the files, they will inform MSD (or Robin, who will then inform MSD?) that their work is complete. At this point it is time to process the files according the [HathiTrust Cloud Validation and ingest requirements] (https://docs.google.com/document/d/1OQ0SKAiOH8Xi0HVVxg4TryBrPUPtdv4qA70d8ghRltU/edit).
#####Step 1: Remove all unnecessary files
The only files that should exist in each directory as this work begins are .tif and .txt files (the scanned images and the text ocr). To remove all of the .jpg files, navigate in the terminal to the directory that contains ALL of the subdirectories (as in, the directory that contains folders for each barcode) and run:
find . -name "*.jpg" -type f
This will locate all jpg files and list them for you. It will not delete them. This is a good, safe, step to take to ensure that you are really only grabbing the files you want. Once satisfied, run
find . -name "*.jpg" -type f -delete
This will delete them. Repeat the same process to remove the PDF file(s) from the directories.
find . -name "*.pdf" -type f
find . -name "*.pdf" -type f -delete
#####Step 2: Strip form feed characters from the txt files
For some reason (vendor software, maybe?) the ocr .txt files contain form feed characters, which are invalid per HathiTrust. I have put together a small script that goes through each .txt file in each directory, locates, and strips these characters.
#!/bin/bash
for file in $(find . -name '*.txt')
do
sed -i "" 's/\^L//' "$file"
done
It is important (at least it was for me) to compose this script directly in terminal using vi, as the formfeed character (appears as ^L in the script) is actually created by hitting the keys CtrlV and then CtrlL in succession. Once the script is created, save it in that top directory as a .sh file and run it:
bash form_strip.sh
#####Step 3: Creating the meta.yml files
Now that all changes have been made to the directories and files, we can now move on to creating the required meta.yml files. Creation of these files involves a bash script invoking a python script as it cycles through each directory
#!/bin/bash
for file in $(find . -name '00000001.tif');
do
parentdir=$(dirname "$file")
exiftool -j -FileModifyDate -XResolution "$file" | python json2yaml.py > $parentdir/meta.yml
done
The bash file searches each subdirectory for a particular file (we are basing the info for each meta.yml on a single page in each subdirectory and considering that representative of that subdirectory) and extracts some embedded metadata, which it then passes to the python script, which processes it (more on that later) then sends the product back that subdirectory and writes the results as meta.yml. The python script processing (see below) converts the json retrieved from exif into yaml. This involves some date reformatting (exif date formatting is not compliant, so must be fixed). Additionally, the keys in the key-value yaml pairings, must be renamed.
#!/usr/bin/env python3
import json, yaml, sys, datetime
import dateutil.parser as parser
data = json.load(sys.stdin)
#split the date apart to replace the non-ISO ':' with a '-' then put it back together again
def datesplit(CreateDate):
splitdate = CreateDate.split(" ")
splitdate[0]
splitdate[1]
splitdate[0]=splitdate[0].replace(":","-")
capture_date="T".join(splitdate)
return capture_date
#rename the keys according to HT specs, while also providing some of the constant values
for item in data:
item.pop("SourceFile")
# item["capture_date"] = datesplit(item.pop("CreateDate")) CreateDate does not consistently appear in metadata - can't use
item["capture_date"] = datesplit(item.pop("FileModifyDate"))
item["scanner_make"] = "NIKON CORPORATION"
item["scanner_model"] = "NIKON D810"
item["scanner_user"] = "Creekside Digital"
item["contone_resolution_dpi"] = item.pop("XResolution")
item["scanning_order"] = "right-to-left"
item["reading_order"] = "right-to-left"
#print data into yaml format
yml = yaml.safe_dump(data[0],default_flow_style=False, stream=None)
print(yml)
Once these two files are set and saved in the parent directory, run them with:
bash json2yml.sh
#####Step 4: Generating checksums
HathiTrust cloud ingest also requires the presence of a checksum.md5 file listing checksums and associated filenames for each file in directory. Again, this is done with a bash script:
for dir in $(find . -type d); do
md5 -r "$dir"/* | sed "s:$dir/::" > "$dir/checksum.md5"
done
This particular script takes an extremely long time to run.
bash md5.sh
#####Step 5: Zipping up the files
Speaking of things that take a long time to run...
Once the files are all set with their meta.yml and checksum.md5 files, it's time to zip them up so they can be dropped into the cloud. Again, I've using a bash script to accomplish this
for dir in $(find . -type d); do
zip -r -X "$dir".zip "$dir"
done
again, run by
bash zip.sh
This script does take an extremely long time to run, and it's best to let it go overnight, or better yet, over a weekend.
This looks excellent. A few thoughts/suggestions, none of which are mandatory, just ideas for future improvement:
In step 2, I've seen it argued that one should not do in-line transformations with sed, including not with the -i "" workaround. The alternative that I've used is to use auto-created temp files with the mktemp command. This should be seen as a possible future improvement, though, as the present version should work fine. Do note, however, that if for some reason the sed form feed remover got applied to a binary file, it could potentially corrupt that file. I don't think that's likely given your use of "-name *.txt".
In step 3 above, you might want to quote your variable in case of spaces in a directory path:
... python json2yaml.py > "$parentdir"/meta.yml
In step 4, a future improvement might be to make the MD5 creator resumable. This might be useful since it does take time to run. The idea would be to stream the checksums into a file and have the script check whether files have been checksummed already. Something we could talk about at coding workshop.
In step 5, are these files on a server? Part of why it takes so long is that the checksumming involves downloading the entire file. As for running over night or over the weekend, one issue is that the sysadmins reboot all the production servers on Friday nights. This has bitten me a few times when I wanted to let things run over the weekend.