Skip to content

Instantly share code, notes, and snippets.

@brialparker
Last active February 9, 2017 15:50
Show Gist options
  • Save brialparker/7cd9547ba78a8c96bbfc615f4497cee7 to your computer and use it in GitHub Desktop.
Save brialparker/7cd9547ba78a8c96bbfc615f4497cee7 to your computer and use it in GitHub Desktop.
A description of the process to transform our EAD for ArchivesSpace import

Getting UMD Finding Aids into ArchivesSpace was an iterative process:

Once we had laid out all the changes that we thought needed to be made, our Systems Librarian developed a transformation script/application in Python to handle many of the changes, including inserting handle uris as an eadid attribute, clean up our parent/child container situation, strip unnecessary sections and empty element tags and otherwise tidy up and remove some of our local practices that are not necessary in ArchivesSpace. We then had a much cleaner set of EAD to work with.

I (Metadata Librarian) then used Dallas Pillen's (Bentley) fabulous date cleanup scripts for extracting dates and add normal date attributes with normalized dates, using the workflow (and OpenRefine!) as outlined in the Bentley's blog. This resulted in the normalization of 200,000+ dates

I was then able to modify those scripts to extract and replace extent data and reuse the workflows so that too could be normalized, particularly so that extent types s could be properly formatted to match the enumeration values. I also reused the scripts and workflow to replace our horrible collection unitid numbering.

At this point, we were testing import, and still finding problems we would have to address before ArchivesSpace would accept our EAD. I then developed an xslt that could handle all of the "stoppers" and some of the "would-be-nice" fixes. I then ran this schematron over all of the transformed files, fixed wonky date problems (like, why did you process the collection in such a way that the date range goes backwards? True story), and successfully importer 1145 resources.

At this point, there were still a couple things we wanted fixed: we wanted our IDs parsed (the importer only put IDs into id_0 so nothing we could do prior to import would matter) and all of our finding aids unpublished. With the help of Noah Huffman of Duke, I put together a little python script to parse the IDs. I then used a handy script from Yale to unpublish our resources.

For fun, and because it was requested, I put this Python script together to look at the resource/EAD titles and, based on the title, select a resource type.

There is further work to be done to fix issues with access restrictions not being read/parsed the way we had hoped, but I have put together and tested a workflow for using the API to extracting and parsing archival objects looking for restriction notes, and pulling the resource URIs so that these resources can be updated to reflect that restrictions apply. At this point, it's very much json based, but I'm hoping to put the process into Ptyhon script now that I'm getting the hang of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment