Mark: DPLA harvests data in DC, MODS, MARCXML and site specific formats. This session is about that harvesting, mapping, and enhancement. What should an aggregation ingest stack look like?
What are the tools, what is the current state of the art?
Bess: Stanford is harvesting GIS data and transforming it into MODS (their stack is built around MODS)
Tom Cramer: Stanford also has problems with MARC and MODS mapping to the same SOLR schema. Handling 'format families' by item type.
- place names
- orcid ID
- VIAF
Is cleanup for consistency an example of metadata augmentation? Yes.
Can we define e.g. what is a properly formatted MODS record within local practice? For harvesting in DPLA's case, the mapping may require different transforms when harvesting local manifestations of standardized metadata like MODS.
DPLA is using an RDF-based model. On DPLA's side: there's an interest in modeling a core 'Item', where requirements can be validated.
Would be a positive thing to have more people with solid metadata expertize. Need more 'catalogers' working intensively with metadata enhancement processes. Could we have instructional paths for programming metadata enhancement/harvesting/augmentation in the same way we do for building web apps.
Need to establish base patterns. Patterns for validation, and patterns for controlled vocabularies.
What is DPLA's storage platform? Currently CouchDB (with JSON-LD). Interest in Fedora 4.
What are the patterns for doing augmentation?
Dump/Transform/Load: first dump, then run processes, then reload as a new version with some kind of commit message.
Harvest/Reingest: dpla's current pattern. Mark doesn't find this desirable, but it's the current cycle.
Digital Curation bots: automated process leaving a provenance breadcrumb behind.
Can't necessarily get around a big ETL style processes. Some tasks are more specialized or one off, and not well suited to a regular bot.
Should tools for enhancement be a community commitment?
SOLR workshops at Code4Lib have involved bringing your own data to index. Could do the same with a metadata cleanup workshop--bring 1000 of your own records. There are plans for something a bit like this in Portland as a DPLA hackday this summer.
DPLA needs to work on enhancement through the Content Hubs.
Opportunity to approach this as small, repurposable 'pipeline' tools which can be run in any order.
Pipeline examples:
- Whitespace normalization
- Date normalization
- Namespace declaration
- Named entity recognition
Think of this as a test suite. Is it normalized (yes/no). Build up correction approaches for various kinds of weirdness. Compare to OpenRefine--have common tasks, create custom ones, apply.
Not desiring to do this all in one pass, or try to build a final cleanup tool. Rather, it's an ongoing process.
What if we built tools and did community-focused cleanup? A Dive-into-Hydra like workshop on using metadata enhancement tools. Would like to see this in library schools.
Where will this data live, and where do you want to run this process for small historical societies? Tools built for RDF would likely not be broadly useful to this cohort.
There's a role here for nationally focused organizations like DPLA, Europeana, Trove, etc... Also good to touch base with Linked Data 4 Libraries.
What are safe transformations (Hillmann, Phipps, Dushay paper)? Removing whitespace, non-semantic punctuation. Some processes like this are be (relatively) safe. But others are risky to some extent.
"Safe transformations" is questionable. Certainly the things that qualify would be domain-specific.
What are the provenance options? Version control approach with detailed commit messages; Named graphs in RDF with graph metadata.
Person looking at quality assurance of output & some kind of test suite at end of cleanup pipeline.
Keep the original.
Harvesting from multiple sources leads to duplicates. It's hard to clean them up and there's a desire to know about the relationship between them. Human beings can do this.
CloseMatch vs. SameAs
What is identity in your system? The identifier? The checksum of the item? Checksum of item + metadata? Some matching metadata constraints? This is a really hard problem.
- "MODS Bridge" in PDX?
- Compare notes between DPLA and LD4L Roadmap. (Mark, Tom C., Simeon)