no-reply/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    Metadata Session

Framing

Mark: DPLA harvests data in DC, MODS, MARCXML and site specific
formats.  This session is about that harvesting, mapping, and
enhancement. What should an aggregation ingest stack look like?
What are the tools, what is the current state of the art?
Bess: Stanford is harvesting GIS data and transforming it into MODS
(their stack is built around MODS)
Tom Cramer: Stanford also has problems with MARC and MODS mapping to
the same SOLR schema. Handling 'format families' by item type.
Augmentation

Tying names to authority records


place names
orcid ID
VIAF

Is cleanup for consistency an example of metadata augmentation? Yes.
Can we define e.g. what is a properly formatted MODS record within
local practice? For harvesting in DPLA's case, the mapping may require
different transforms when harvesting local manifestations of
standardized metadata like MODS.
DPLA is using an RDF-based model. On DPLA's side: there's an interest
in modeling a core 'Item', where requirements can be validated.
Knowledge Gap on Metadata

Would be a positive thing to have more people with solid metadata
expertize. Need more 'catalogers' working intensively with metadata
enhancement processes. Could we have instructional paths for
programming metadata enhancement/harvesting/augmentation in the same
way we do for building web apps.
Need to establish base patterns. Patterns for validation, and patterns
for controlled vocabularies.
Cleanup Workflows

What is DPLA's storage platform?  Currently CouchDB (with JSON-LD).
Interest in Fedora 4.
What are the patterns for doing augmentation?
Dump/Transform/Load: first dump, then run processes, then reload as a
new version with some kind of commit message.
Harvest/Reingest: dpla's current pattern. Mark doesn't find this
desirable, but it's the current cycle.
Digital Curation bots: automated process leaving a provenance
breadcrumb behind.
Can't necessarily get around a big ETL style processes. Some tasks are
more specialized or one off, and not well suited to a regular bot.
Should tools for enhancement be a community commitment?
SOLR workshops at Code4Lib have involved bringing your own data to
index. Could do the same with a metadata cleanup workshop--bring 1000
of your own records. There are plans for something a bit like this in
Portland as a DPLA hackday this summer.
DPLA needs to work on enhancement through the Content Hubs.
Opportunity to approach this as small, repurposable 'pipeline' tools
which can be run in any order.
Pipeline examples:

Whitespace normalization
Date normalization
Namespace declaration
Named entity recognition

Think of this as a test suite. Is it normalized (yes/no). Build up
correction approaches for various kinds of weirdness. Compare to
OpenRefine--have common tasks, create custom ones, apply.
Not desiring to do this all in one pass, or try to build a final
cleanup tool. Rather, it's an ongoing process.
What if we built tools and did community-focused cleanup? A
Dive-into-Hydra like workshop on using metadata enhancement
tools. Would like to see this in library schools.
Where will this data live, and where do you want to run this process
for small historical societies? Tools built for RDF would likely not
be broadly useful to this cohort.
Modeling/Interoperability Issues

There's a role here for nationally focused organizations like DPLA,
Europeana, Trove, etc... Also good to touch base with Linked Data 4
Libraries.
Safe Transformations & Provenance

What are safe transformations (Hillmann, Phipps, Dushay paper)? Removing
whitespace, non-semantic punctuation. Some processes like this are
be (relatively) safe. But others are risky to some extent.
"Safe transformations" is questionable. Certainly the things that
qualify would be domain-specific.
What are the provenance options? Version control approach with
detailed commit messages; Named graphs in RDF with graph metadata.
Person looking at quality assurance of output & some kind of test
suite at end of cleanup pipeline.
Keep the original.
Deduplication

Harvesting from multiple sources leads to duplicates. It's hard to
clean them up and there's a desire to know about the relationship
between them. Human beings can do this.
CloseMatch vs. SameAs
What is identity in your system? The identifier? The checksum of the
item? Checksum of item + metadata? Some matching metadata constraints?
This is a really hard problem.
Next Steps


"MODS	Bridge" in PDX?
Compare notes between DPLA and LD4L Roadmap. (Mark, Tom C., Simeon)