Skip to content

Instantly share code, notes, and snippets.

@no-reply
Last active August 29, 2015 14:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save no-reply/11236398 to your computer and use it in GitHub Desktop.
Save no-reply/11236398 to your computer and use it in GitHub Desktop.
LDCX^5: Metadata Aggregation/Enhancement Notes

Metadata Session

Framing

Mark: DPLA harvests data in DC, MODS, MARCXML and site specific formats. This session is about that harvesting, mapping, and enhancement. What should an aggregation ingest stack look like?

What are the tools, what is the current state of the art?

Bess: Stanford is harvesting GIS data and transforming it into MODS (their stack is built around MODS)

Tom Cramer: Stanford also has problems with MARC and MODS mapping to the same SOLR schema. Handling 'format families' by item type.

Augmentation

Tying names to authority records

  • place names
  • orcid ID
  • VIAF

Is cleanup for consistency an example of metadata augmentation? Yes.

Can we define e.g. what is a properly formatted MODS record within local practice? For harvesting in DPLA's case, the mapping may require different transforms when harvesting local manifestations of standardized metadata like MODS.

DPLA is using an RDF-based model. On DPLA's side: there's an interest in modeling a core 'Item', where requirements can be validated.

Knowledge Gap on Metadata

Would be a positive thing to have more people with solid metadata expertize. Need more 'catalogers' working intensively with metadata enhancement processes. Could we have instructional paths for programming metadata enhancement/harvesting/augmentation in the same way we do for building web apps.

Need to establish base patterns. Patterns for validation, and patterns for controlled vocabularies.

Cleanup Workflows

What is DPLA's storage platform? Currently CouchDB (with JSON-LD). Interest in Fedora 4.

What are the patterns for doing augmentation?

Dump/Transform/Load: first dump, then run processes, then reload as a new version with some kind of commit message.

Harvest/Reingest: dpla's current pattern. Mark doesn't find this desirable, but it's the current cycle.

Digital Curation bots: automated process leaving a provenance breadcrumb behind.

Can't necessarily get around a big ETL style processes. Some tasks are more specialized or one off, and not well suited to a regular bot.

Should tools for enhancement be a community commitment?

SOLR workshops at Code4Lib have involved bringing your own data to index. Could do the same with a metadata cleanup workshop--bring 1000 of your own records. There are plans for something a bit like this in Portland as a DPLA hackday this summer.

DPLA needs to work on enhancement through the Content Hubs.

Opportunity to approach this as small, repurposable 'pipeline' tools which can be run in any order.

Pipeline examples:

  • Whitespace normalization
  • Date normalization
  • Namespace declaration
  • Named entity recognition

Think of this as a test suite. Is it normalized (yes/no). Build up correction approaches for various kinds of weirdness. Compare to OpenRefine--have common tasks, create custom ones, apply.

Not desiring to do this all in one pass, or try to build a final cleanup tool. Rather, it's an ongoing process.

What if we built tools and did community-focused cleanup? A Dive-into-Hydra like workshop on using metadata enhancement tools. Would like to see this in library schools.

Where will this data live, and where do you want to run this process for small historical societies? Tools built for RDF would likely not be broadly useful to this cohort.

Modeling/Interoperability Issues

There's a role here for nationally focused organizations like DPLA, Europeana, Trove, etc... Also good to touch base with Linked Data 4 Libraries.

Safe Transformations & Provenance

What are safe transformations (Hillmann, Phipps, Dushay paper)? Removing whitespace, non-semantic punctuation. Some processes like this are be (relatively) safe. But others are risky to some extent.

"Safe transformations" is questionable. Certainly the things that qualify would be domain-specific.

What are the provenance options? Version control approach with detailed commit messages; Named graphs in RDF with graph metadata.

Person looking at quality assurance of output & some kind of test suite at end of cleanup pipeline.

Keep the original.

Deduplication

Harvesting from multiple sources leads to duplicates. It's hard to clean them up and there's a desire to know about the relationship between them. Human beings can do this.

CloseMatch vs. SameAs

What is identity in your system? The identifier? The checksum of the item? Checksum of item + metadata? Some matching metadata constraints? This is a really hard problem.

Next Steps

  • "MODS Bridge" in PDX?
  • Compare notes between DPLA and LD4L Roadmap. (Mark, Tom C., Simeon)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment