Skip to content

Instantly share code, notes, and snippets.

@brialparker
Last active January 14, 2016 12:47
Show Gist options
  • Save brialparker/a14ed4646311b2e1d82d to your computer and use it in GitHub Desktop.
Save brialparker/a14ed4646311b2e1d82d to your computer and use it in GitHub Desktop.
Metadata clean-up for HathiTrust

HathiTrust requires metadata to be in MARC-XML format, UTF-8 encoding. Here's how we get there.

#####Step 1: request the MARC records from Aleph.

Because the records for Hebraica are suppressed, we cannot use Z39.50 against our Aleph catalog to retrieve them (that's something to think about, though, for future projects). Instead, we have to submit an Aleph RX request for them. The parameters CLAS needs to complete the request are:

Sublibrary:CPMCK

Collection:CAT

IPS:DB

The marc records need to include brief item information in a 955 field (to fit with HathiTrust specs for ingest).

barcode --> 955 $b

description --> 955 $v

Additionally, please export these as UTF-8.

CLAS will extract the desired MARC records and put them up in a Dropbox folder (and will give you instructions/add you to the folder if you do not already have access). Download the file from Dropbox and save somewhere on your local workspace (or something like Box) to begin work on them.

***I currently do not have a programatic way to separate out JUST the records needed for a given shipment. Ask the Hebraica cataloger (or whichever cataloger is working on current project) to provide a list of barcodes for the shipment, and the large MARC file can be manually whittled down to only include the records for the upcomding shipment.

#####Step 2: reformat to UTF-8

I know. We asked for them in UTF-8. And they are! But there are some UTF-8 artifacts that are not stripped. Using MarcEdit and its MarcBreaker utility convert the files from marc to mrk in UTF-8 (there's a little checkbox that will do that for you.

Now, open up the file in the MarcEdit editor. First, delete all of the 066 fields (look for the Add/Delete Field utility in the Tools dropdown). Next, perform a find and replace

Find

/(2/r

and replace with

/r

This will remove two items that the utf-8 conversion did not/cannot change.

#####Step 3: Format for HathiTrust

HathiTrust requires that there be one record per barcode, meaning that for serials, or monographs with multiple volumes, each volume requires a separate MARC record. I have not found a quick, automated way to do this, so I search (all records) for =955 field and visually scan the search results for records with multiple 955 fields. I then duplicate them, then make sure each has a unique 955 with barcode and volume. It takes a few minutes, but is not too cumbersome.

The next step is to compile the file back into MARC (make sure it's still UTF-8!). Once it's back in .mrc, it can be converted to MARCXML using the Marc Utilites in MarcEdit.

#####Step 4: Upload to Zepheira

Instructions are here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment