Skip to content

Instantly share code, notes, and snippets.

@brialparker
Last active October 18, 2019 17:05
Show Gist options
  • Save brialparker/d24860114d335d774a6d to your computer and use it in GitHub Desktop.
Save brialparker/d24860114d335d774a6d to your computer and use it in GitHub Desktop.
Processing Files for HathiTrust Ingest at UMD

The following applies to the scripts located here.

Once DCMR has finished the QC/QA of all the files, they will inform MSD (or Robin, who will then inform MSD?) that their work is complete. At this point it is time to process the files according the [HathiTrust Cloud Validation and ingest requirements] (https://docs.google.com/document/d/1OQ0SKAiOH8Xi0HVVxg4TryBrPUPtdv4qA70d8ghRltU/edit).

#####Step 1: Remove all unnecessary files

The only files that should exist in each directory as this work begins are .tif and .txt files (the scanned images and the text ocr). To remove all of the .jpg files, navigate in the terminal to the directory that contains ALL of the subdirectories (as in, the directory that contains folders for each barcode) and run:

find . -name "*.jpg" -type f

This will locate all jpg files and list them for you. It will not delete them. This is a good, safe, step to take to ensure that you are really only grabbing the files you want. Once satisfied, run

find . -name "*.jpg" -type f -delete

This will delete them. Repeat the same process to remove the PDF file(s) from the directories.

find . -name "*.pdf" -type f
find . -name "*.pdf" -type f -delete

#####Step 2: Strip form feed characters from the txt files

For some reason (vendor software, maybe?) the ocr .txt files contain form feed characters, which are invalid per HathiTrust. I have put together a small script that goes through each .txt file in each directory, locates, and strips these characters.

#!/bin/bash

for file in $(find . -name '*.txt')
do
        sed -i "" 's/\^L//' "$file"
done

It is important (at least it was for me) to compose this script directly in terminal using vi, as the formfeed character (appears as ^L in the script) is actually created by hitting the keys CtrlV and then CtrlL in succession. Once the script is created, save it in that top directory as a .sh file and run it:

bash form_strip.sh

#####Step 3: Creating the meta.yml files

Now that all changes have been made to the directories and files, we can now move on to creating the required meta.yml files. Creation of these files involves a bash script invoking a python script as it cycles through each directory

#!/bin/bash

for file in $(find . -name '00000001.tif');
do
        parentdir=$(dirname "$file")
        exiftool -j -FileModifyDate -XResolution  "$file" | python json2yaml.py > $parentdir/meta.yml
done

The bash file searches each subdirectory for a particular file (we are basing the info for each meta.yml on a single page in each subdirectory and considering that representative of that subdirectory) and extracts some embedded metadata, which it then passes to the python script, which processes it (more on that later) then sends the product back that subdirectory and writes the results as meta.yml. The python script processing (see below) converts the json retrieved from exif into yaml. This involves some date reformatting (exif date formatting is not compliant, so must be fixed). Additionally, the keys in the key-value yaml pairings, must be renamed.

#!/usr/bin/env python3

import json, yaml, sys, datetime
import dateutil.parser as parser

data = json.load(sys.stdin)

#split the date apart to replace the non-ISO ':' with a '-' then put it back together again

def datesplit(CreateDate):
        splitdate = CreateDate.split(" ")
        splitdate[0]
        splitdate[1]
        splitdate[0]=splitdate[0].replace(":","-")
        capture_date="T".join(splitdate)
        return capture_date

#rename the keys according to HT specs, while also providing some of the constant values

for item in data:
        item.pop("SourceFile")
#       item["capture_date"] = datesplit(item.pop("CreateDate")) CreateDate does not consistently appear in metadata - can't use
        item["capture_date"] = datesplit(item.pop("FileModifyDate"))
        item["scanner_make"] = "NIKON CORPORATION"
        item["scanner_model"] = "NIKON D810"
        item["scanner_user"] = "Creekside Digital"
        item["contone_resolution_dpi"] = item.pop("XResolution")
        item["scanning_order"] = "right-to-left"
        item["reading_order"] = "right-to-left"

#print data into yaml format

yml = yaml.safe_dump(data[0],default_flow_style=False, stream=None)

print(yml)

Once these two files are set and saved in the parent directory, run them with:

bash json2yml.sh 

#####Step 4: Generating checksums

HathiTrust cloud ingest also requires the presence of a checksum.md5 file listing checksums and associated filenames for each file in directory. Again, this is done with a bash script:

for dir in $(find . -type d); do
    md5 -r "$dir"/* | sed "s:$dir/::" > "$dir/checksum.md5"
done

This particular script takes an extremely long time to run.

bash md5.sh

#####Step 5: Zipping up the files

Speaking of things that take a long time to run...

Once the files are all set with their meta.yml and checksum.md5 files, it's time to zip them up so they can be dropped into the cloud. Again, I've using a bash script to accomplish this

for dir in $(find . -type d); do
    zip -r -X  "$dir".zip "$dir"
done

again, run by

bash zip.sh

This script does take an extremely long time to run, and it's best to let it go overnight, or better yet, over a weekend.

@jwestgard
Copy link

This looks excellent. A few thoughts/suggestions, none of which are mandatory, just ideas for future improvement:

In step 2, I've seen it argued that one should not do in-line transformations with sed, including not with the -i "" workaround. The alternative that I've used is to use auto-created temp files with the mktemp command. This should be seen as a possible future improvement, though, as the present version should work fine. Do note, however, that if for some reason the sed form feed remover got applied to a binary file, it could potentially corrupt that file. I don't think that's likely given your use of "-name *.txt".

In step 3 above, you might want to quote your variable in case of spaces in a directory path:
... python json2yaml.py > "$parentdir"/meta.yml

In step 4, a future improvement might be to make the MD5 creator resumable. This might be useful since it does take time to run. The idea would be to stream the checksums into a file and have the script check whether files have been checksummed already. Something we could talk about at coding workshop.

In step 5, are these files on a server? Part of why it takes so long is that the checksumming involves downloading the entire file. As for running over night or over the weekend, one issue is that the sysadmins reboot all the production servers on Friday nights. This has bitten me a few times when I wanted to let things run over the weekend.

@brialparker
Copy link
Author

Thanks Josh!

Re: Step 5 - yes, these are on a server. I did my testing of workflows on dcrprojects - that's about the only spot large enough for them. (Eric does still have the hard drives with all the original files, too, thankfully).

What does it take to make something resumable? I guess I can wait and find out Monday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment