Skip to content

Instantly share code, notes, and snippets.

@kspurgin
Created March 31, 2017 20:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kspurgin/4277ac8501507ecf782269a84dd8c0b2 to your computer and use it in GitHub Desktop.
Save kspurgin/4277ac8501507ecf782269a84dd8c0b2 to your computer and use it in GitHub Desktop.
MARC upsides
"MARC is concise as a physical format (something that is less important today)"
^--- I know it used to be a lot MORE important, this doesn't feel unimportant when I'm extracting 6.5 million records from one system and transferring them to another server! (And I'm aware there are more concise ways to express data that we often see expressed in XML)
-=-=-=-=-=-=-=-=-=-
Ease of transforming/exporting/analyzing
-=-=-=-=-=-=-=-=-=-
I work with big batches of MARC (in MARC-binary and MARC-XML) and other metadata formats (DC, EAD, DDI, MODS, Oracle Endeca format for indexed data) expressed in XML.
I whip up XSLT to do stuff when I have to. But Terry Reese gave us MARCedit which makes it so easy to do fairly complex transformations across a set of MARC records, or just get an overview of what fields are in a record set and with what frequency.
Perhaps tools to do this for XML data exist, but I have not had luck finding them. What MARC records don't have a 26X $c? That is a snap to find with free tools ready to hand. What MODS records are lacking "mods:dateIssued"? Not so fast and easy...
Of course, tools develop to meet the needs at hand.
It's super easy to export MARC fields/subfields to a spreadsheet/delimited text format. This is partly because of its flatness.
All the nesting in our XML metadata makes it a nightmare to deal with, which may say more about how our metadata evolved than anything specifically about the formats, but just trying to separate the ETDs described as composite objects (with data and other attachments) vs. the ones that are just the thesis basically requires a programming project.
-=-=-=-=-=-=-=-=-=-
Field structure/semantics = shorthands for processing
-=-=-=-=-=-=-=-=-=-
I write a lot of code to process MARC. It is a beautiful thing to be able to grab all LCSH headings, regardless of subject, name, geog, chron type, with a quick:
tag =~ /6../ && ( i2 == '0' || (i2 == '7' && field_string.include?('$2 lcsh')))
Or to know that certain subfields of 111, 711, 811 can get processed with the same method because they are defined the same as meeting names.
I've heard a lot of complaining from developers about how terrible MARC is, but honestly if you understand it, it gives you a lot of shortcuts for working with the data with code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment