Skip to content

Instantly share code, notes, and snippets.

@no-reply
Last active March 23, 2017 20:09
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save no-reply/6a635c7ced661c65aeea to your computer and use it in GitHub Desktop.
Save no-reply/6a635c7ced661c65aeea to your computer and use it in GitHub Desktop.
Records, Documents, & Graphs: Accounting for record scope & mutability in metadata management

Records, Documents, & Graphs

Accounting for record scope & mutability in metadata management.

Smoothies cannot be edited @anarchivist -- 6:52 PM PDT - 23 Apr 2015

Questions

The key question I'm setting out to answer is: How can we account for routine change and updates in our metadata records. An initial attempt to derive a model for change from current practice has led to some corollary questions about the relationship between Records, Documents, Description Sets, Application Profiles, Resources, and RDF Sourceslit review:

  • What is a Record?
    • Possible definition: Records are Documents instantiating a Description Set
    • Are Records mutable? What about Description Sets?
    • What other views of metadata records need to be taken into account?
  • Is a Description Set in DCAM equivalent to a (or a kind of) Graph in RDF?property value
  • What is the relationship between a Description Set (and by extension, a Record) and a Resource?
    • If we speak of a "record for Moby Dick", how do we distinguish that from a "record for Melville" that happens to contain some statements about Moby Dick? Is this a valid distinction under DCAM?
  • Is a Description Set an example of an RDF Source?

Records and Documents

The Dublin Core Abstract Model (DCAM) defines a metadata record as a document that instantiates a Description Setrecord. Description Sets, in turn, are defined as sets of (one or more) Descriptions, with each Description defined as a set of one or more Statements "about one, and only one, resource".

Functional Requirements for Bibliographic Records (FRBR) treats "Records" as an aggregation of "descriptive elements" and "filing devices" (IFLA, 1997; see especially Sec. 2.2). It's not clear from the definition given (or a loose reading of the remainder of the document) whether the IFLA Study Group's view is of a record as an abstract entity that can be updated, as a static representation of data, or as a literal physical document. While some combination appears to be at play, there seems to be an emphasis on the last.

The issue of Record mutability in both understandings raises the issues documented in Documents Cannot be Edited (Renear & Wickett, 2009). There is no model for revision in place and Description Sets lack identifiers of their own to pin revisions on. Records often likewise. Even taking a casual view of Records as physical documents, there seems to be little option but to view "revisions" as new documents which will be filed in roughly equivalent places to their predecessors in a card catalog or similar.

In the case of DCAM, the problem is compounded, since Description Sets are defined as sets (sets of sets of statements). This keeps the model close to that of RDF, but leaves the idea of a persistent, changeable Record out of the picture.

Mutability as a Requirement for Actionable Records

The view of Records implied by the above leaves us with significant problems for even basic metadata and asset management workflows. Our practice when describing a resource is to assume that new (and deleted) assertions update an old description. Our systems manage this with internal representations of state, controlled with database rows, or object representations, or otherwise; but usually without an articulated formal model. This won't do when we introduce Linked Data (or any large scale interoperability scheme). A shared model for mutability is needed.

[I would like to further document/articulate the nature of this requirement! What would we be lacking if we always saw records as static?]

Reviewing the RDF Model

  • Resources
  • Statements
  • Graphs
  • Datasets
  • RDF Source

Graphs are Immutable

Graphs are sets of statements.

RDF Source

We informally use the term RDF source to refer to a persistent yet mutable source or container of RDF graphs. An RDF source is a resource that may be said to have a state that can change over time. A snapshot of the state can be expressed as an RDF graph. For example, any web document that has an RDF-bearing representation may be considered an RDF source. Like all resources, RDF sources may be named with IRIs and therefore described in other RDF graphs.

As Resources, Sources can be denoted by an IRI or existentially quantified as a blank node. Further, a Source may be said to relate a time sequence of zero or more RDF graphs, with each graph representing a state of the mutable Resource at a given time.

Revisiting DCAM

A description is a set of statements that follow the one-to-one principle over the set. In explicit RDF terms, that is, a Graph whose triples share a single Resource as their subject node. On its face, this is very similar to the kind of "resource view" common on Linked Data publishing platforms that expose the triples "about" a given Resource. In practice, a description adds notions of constraint and completeness either through Description Templates (and Statement Templates) in a Description Set Profile or through less formal guidelines for vocabulary usage commonly included in Application Profiles.

The larger Description Set and its associated Record instantiations are, similarly, Graphs without the subject restriction. Any Graph can arguably be interpreted as a Description Set containing Descriptions for each of the Resources that appear as subjects in its triples; though there may be value in the view that a Graph is only a Description Set when viewed in the context of some set of constraints, or as a candidate expression of a "Profile" or "Shape"infinite .

Some Gaps

  • While a Record is said to instantiate a single Description Set, DCAM provides no mechanism for determining which Description Set is instantiated.
    • This points to an interpretation of Description Set as equivalent to Graph---both are defined as sets of statements, without the trappings that come with being a representation of a given Resource.
    • If this is the case then a Record instantiates a given Description Set merely by faithfully encoding the statements that make it up. This leaves no support for notions like "each metadata record is to represent exactly one book" as found in Sec. 6 of Guidelines for Dublin Core Application Profiles (Coyle & Baker, 2008).
  • Constraints and completeness are similarly problematic, since a single Record may be valid and complete for one profile, but not another.
  • ...

RDF Source

The RDF Source concept offers a potential solution for each of these problems.

Towards a Formalized Model for RDF Sources

While a common pattern (alluded to in RDF and Change Over Time) is to dereference the Source's IRI to get the current state of the Resource, it's not explicitly required that the representation express the current state. Nor is it necessary to retain each graph in the sequence, or that continuity be maintained.

Linked Data Platform codifies more specific patterns of dereferencability, including a requirement of fullness of the representation, and methods for updating "current persistent state". I've done some work to formalize similar handling of locally managed state-bearing Graphs in ActiveTriples in a comment on the GitHub issue "Resource-centric vs graph-centric in persistence/querying".

Removing the implementation specific language and restrictions:

  • An RDF Source is "a resource that may be said to have a state that can change over time". Therefore, it:
    • is a Resource
    • may be the referent of a URI.
  • An RDF Source has a Graph container.
    • A container is a mechanism for retrieving specific Graphs; a container may be, e.g.
      • a dereferencable URI (web address); or
      • a named graph; or
      • a language construct (an Object, or a Variable); or
      • a document; or
      • a memory block; etc...
    • The Graph in the container represents the Source's current state.

Problems for Provenance


Notes

[lit-review]: Literature review is still on-going, but I believe I've pulled in the relevant concepts. Some fashion of definition of each concept listed is attempted somewhere the main text.

[implementations]: While in LDP and ActiveTriples, the current state is represented by a specific Graph, in principle it's only necessary that some snapshots of state may be represented by Graphs.

[property value]: While working through this question, it has occurred to me that JSON-LD represents another example of this issue. Its graphs are expressed in documents as property value pairs in a model very similar to DCAM.

[record]: Specifically, it says a record is"An instantiation of a description set, created according to one of the DCMI encoding guidelines (for example, XHTML meta tags, XML and RDF/XML)." The tie to an encoding is significant, since it ensures that a record expresses at most one Graph.

[infinite]: Consider, for example the Graph of the web. It's not clear what use there is in viewing this as a Set with a functionally infinite number of Descriptions or why anyone would want to instantiate such a thing as a Record.


Bibliography

@aisaac
Copy link

aisaac commented May 5, 2015

A colleague of mine has pointed me to this discussion. I have not much to add, no time to think about the theory. But if you need use cases and requirements, the problem of representation/versions had to be tackled for the ongoing EuropeanaCloud project, where they (I was not much involved) created a model for records and datasets. It's not RDF, and there's not much documentation besides an old deliverable http://pro.europeana.eu/files/Europeana_Professional/Projects/Project_list/Europeana_Cloud/Deliverables/D2.2%20Europeana%20Cloud%20Architectural%20Design.pdf and the current API: http://sonar.eanadev.org/job/oncommit-eCloud/MCS_REST_API/index.html

@no-reply
Copy link
Author

no-reply commented May 5, 2015

@Aissac: Thanks. I'm reading this.

@no-reply
Copy link
Author

no-reply commented May 5, 2015

I added a loose distillation of the model I've introduced to ActiveTriples, removing implementation specific restrictions. I expect to put more substantial additions in in the next couple of days, as I regain the energy to work on this.

@kcoyle I think the bullets I added outlining an RDF Source may help clarify where I'm going with this and how it relates to your conception of a "document". Oddly, I think the Shapes group is (probably) right to be focusing on the validation of individual graphs. What RDF Source gets us is a target for concepts like:

  • this set of constraints applies to this Resource (or Class, or Source with these qualities).
  • this Resource was valid at time x and invalid at time y.

as well as "this description set is about a book" and similar.

@kcoyle
Copy link

kcoyle commented May 11, 2015

You've asked elsewhere why I focus on DSP rather than DCAM -- it's mainly because "abstract" just doesn't interest me. But here's what I have to say about DCAM:

  1. terminology: yikes!
    a)if you just change "surrogate" to "type" or "representation" or something similar it suddenly becomes much more readable
    b) syntax encoding scheme is a data type. Call it a data type.

  2. the vocabulary encoding scheme is, as far as I can tell, a validation issue, not an abstract model issue, and should be part of the DSP, not the DCAM. You basically need to have either a single URI as a value, a set of URIs as a value, or the ability to validate against a URI pattern. I have included in the requirements for Shapes the ability to indicate a URI pattern against which values can be compared. So you could have "http://id.loc.gov/" or "http://id.loc.gov/names/" etc. And the arrows from "non-literal value -> member of -> vocabulary encoding scheme" unnecessarily complicate the diagram and I don't see why you would need to know that a non-literal value is a member of a vocabulary encoding scheme as part of your abstract model.

  3. Another argument against the vocabulary encoding scheme here, and for putting it in the DSP, is a lack of a parallel for literal values. You may want to have a set of string values that you validate against ("red, blue, green"). You could even want to apply some regex-type validation to those (e.g. word stemming). Oddly, multiple value strings are not allowed. Even more oddly, multiple language strings are allowed for each plain value string, which, AFAIK, is an error.

If you remove these "oddities" you basically get the elements of RDF, minus the concept of classes. (Interestingly missing from the DCAM.) DCAM adds the record structure, but that could be considered the intro to the DSP. So my preference is to assume RDF/S, and apply that to the DSP to define a record. I just don't see a whole lot of value in the DCAM as it is today, other than as an introduction to the DSP.

@no-reply
Copy link
Author

Thanks Karen.

I agree totally about encoding schemes and the like. These are overly prescriptive as part of an abstract model, not well adopted (most systems apply these constraints on the property-level, not the value), and arguably not good practice for many uses.

My hope for DCAM in general, and part of my reason for undertaking a close reading of the literature about it, is that it can provide the basis for reading common non-RDF metadata as RDF equivalent. The language and the class diagrams, I agree, are a mess; and a lot of key concepts are left to implicature or are just too vague to be useful. (For instance: though there's much talk about the "described resource" and "property-value pairs", I can't see it stated anywhere that properties are to be understood as properties of the "described resource".)

Your update improves things. I think I would add:

  • A "description set" is equivalent to a RDF "Graph".
  • A "description" is equivalent to an RDF "Graph", with the constraint that all of its Statements must have the same subject (the "described resource").
    • In an RDF context, we can (but don't have to) dispense with this altogether. Still, somehow it feels to me like the main value in DCAM.
  • A DCAM "statement" is equivalent to an RDF "Statement", with an implied subject of the "described resource".
    • I may be misunderstanding you, but I think this fixes the problem of multiple value strings.
    • I'm not entirely clear on whether a 'description' without a "resource URI" maps to a blank node. Formally, bnodes assert a resource while DCAM 'descriptions' seem to want to assert this resource. But this problem extends far beyond DCAM.

I think this still leaves pretty much all of my questions above unanswered, but at least the mapping between the two models is clarified as far as it goes. :)

@mjsuhonos
Copy link

Hi all -- I've added a gist with some of my own (parallel/related/semi-formed) thoughts related to this document:

https://gist.github.com/mjsuhonos/9d4922cf85627ed909e2

I really do think the terminology is important, and I try to take a stab at aligning some of it (ie. what Karen calls a "document" I basically equate to an "object", plus its direct neighbours, which can be considered an RDF graph).

The main issue I have with treating RDF graphs as immutable, atomic units is when they contain entirely unrelated objects or even indirected (neighbour-of-a-neighbour-of-a-neighbour, etc) objects. Sure, this is valid RDF, but it's really hard to model in an object-document sense, and seems inherently fragile. I don't think we would be likely to see a traditional "record" contain this degree of indirection.

Anyway, I'm going to re-read through this thread a few more times and will try to add any (hopefully marginally useful) thoughts if/as they materialize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment