Skip to content

Instantly share code, notes, and snippets.

@joeltg
Last active October 27, 2020 15:16
Show Gist options
  • Save joeltg/4af25175f8a2e3d99de079d6fbab37bd to your computer and use it in GitHub Desktop.
Save joeltg/4af25175f8a2e3d99de079d6fbab37bd to your computer and use it in GitHub Desktop.

Schemas and Collections

Table of Contents

Overview

A collection is a general and portable format for managing and publishing datasets. A collection contains a schema and a set of assertions. Assertions are RDF datasets, and the schema describes the shape of the data in the graphs of the assertions.

Schemas

Relationship to RDF

The schema language is designed around the RDF data model, and instances of schemas are represented as RDF graphs. However, it is almost certainly not possible to write a schema for an arbitrary pre-existing RDF graph. This is due to the many specific representation choices that we are forced to make when modelling data with RDF (which will become more clear when we cover the instantiated RDF format in detail).

The schema language is very specific and very rigid. As a result, we prefer to say say that that a graph instantiates a particular schema (emphasizing the schema as the fixed and most significant object), as opposed to saying that a graph validates a schema (emphasizing the graph as something that the schema describes).

More generally, we use RDF as a serialization format for schema instances, but not as a means of interoperating with the broader semantic web.

Data model

interface Schema {
  import: { url: string; version: string }[]
  namespace: string
  classes: { [label: string]: ClassDefinition }
}

type ClassDefinition = { [key: string]: Datatype | PropertyDefinition }

type Datatype =
  | "string"
  | "integer"
  | "double"
  | "boolean"
  | "dateTime"
  | "date"

type Cardinality = "required" | "optional" | "any"

type PropertyDefinition =
  | { kind: "uri"; cardinality?: Cardinality }
  | { kind: "literal"; datatype: Datatype; cardinality?: Cardinality }
  | { kind: "reference"; label: string; cardinality?: Cardinality }

A schema defines a set of classes. Each class has a unique URI label, and a set of zero or more properties. Each property 1) has a unique URI key, 2) has one of three possible cardinalities, and 3) is one of three possible property kinds.

(We deliberately avoid using the overloaded word "type". Instead, we use "class" for the elements of a schema, "kind" for the different abstract types of properties, and "datatype" for the different types of primitive values.)

The possible cardinalities are required, optional, and any. Intuitively, these mean that instances of the class must have exactly one value for the property, zero or one values for the property, or zero or more values for the property, respectively.

The kinds of properties are reference properties, URI properties, and literal properties.

  • Reference properties are configured with a URI label pointing to another class. A class can have a reference property pointing to itself, or a different class defined in the same schema. Intuitively, reference properties are analogous to foreign keys or typed pointers: a value for a reference property is an instance of the referenced class.
  • URI properties do not take any additional configuration. A value for a URI property is an RDF IRI.
  • Literal properties are configured with a datatype from the XML Schema built-in datatypes The value of a literal property configued with datatype x is an RDF literal with datatype x. This essentially just gives us formal specs for most common datatypes like string, boolean, integer, double, dateTime, etc.

And that's it! There are classes, which have properties, which are either references, URIs, or literals, and each property has a cardinality. That's the abstract schema language.

URIs vs literals

Both URI and literal properties look like "primitive types", so it may appear confusing to have them as different property kinds.

The difference comes from the RDF data model, which distinguishes IRIs (ie URIs) as a different type of term than "literals". This means that values of URI properties are represented as IRIs in RDF, while the values of literal properties are represented as RDF literals.

More prescriptively, we think that the difference can be a useful indicator of scope and intent. URI properties are not intended for values that are arbitrary URLs, like website links. Instead, they should be thought of as identifiers with global scope. URI properties are appropriate for any kind of value that is primarily used to join, link, or co-identify values across datasets.

For example, tags or enum values are more idiomatically modelled with URI properties (e.g. using URNs with a custom namespace) than with string literals. URI properties are also a good fit for external unique identifiers like ISBN numbers or DOIs.

Following these idioms makes for more expressive modelling and allows tools to better optimize for joins, etc.

Management

Schemas are versioned, managed, and published entirely separately from collections.

Schemas are written in a human-friendly TOML format, which includes a way to import other TOML schemas by URL. Whenever a version of a schema is published, its imports are recursively resolved and reduced to a compiled schema, represented as an RDF graph instantiating a master "schema schema". This compiled schema is what gets included in collections - effectively vendoring all of the schema's dependencies.

TOML syntax

TOML re-uses the JSON data model (with the exception of null), so every TOML document can be parsed into a JSON object (a TOML document is always an object at the root level). We use TOML to write schemas because it supports comments and is generally more human-friendly than working with JSON directly.

It'd be good to read the TOML spec before going any further.

Format version

A schema starts with a top-level string property called format. The format string is a fixed URL that specifies the version of our TOML schema format that is used.

# The actual `format` version URL is not decided yet,
# but it will look something like this
format = "http://underlay.org/schema/v1.0"

Namespace

After the format version there is another top-level string property called namespace. The namespace string is a URI that a) has a path component ending in a trailing slash, and b) has no query or fragment component.

namespace = "http://example.com/"

The namespace string is a prefix that makes defining class labels and property keys (which are both full absolute URIs) more concise. It's not a proper value associated with the schema itself - it just enables some syntactic sugar in class definitions that we'll see later.

Imports

After the format version and namespace prefix, a schema has an array of imports:

[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
version = "4.2.0"

[[import]]
url = "http://r1.underlay.org/schemas/baylor/crackle"
version = "0.8.1"

[[import]]
url = "http://r1.underlay.org/schemas/emerson/pop"
version = "45.0.0"

(the double-bracket syntax defines elements of an array in TOML)

Each import has a string url and an exact semver string version.

All that importing does is let you point to the imported classes in reference properties. You can’t "extend" classes or anything.

If a class in a schema is defined with the same label as a class in an imported schema, the imported one is just ignored. Similarly, the imports themselves overwrite each other in order if there are conflicts (it’s important that .import is an array). But labels are absolute URIs, so collisions should be relatively rare as long as people label their classes responsibly.

Class definitions

After the import array there is a table (aka object) called classes. The keys of the table are label URIs, and the values are class definitions Class definitions are objects whose keys are property URIs, and values are property definitions.

[classes]

[classes.Skyscraper]

# Classes themselves are objects with zero or more properties keys.
# This one has zero.

Each key of this table - Skyscraper in this example - is the URI label of the class. If the key is a TOML bare key (ie the key validates /^(A-Za-z0-9_\-\/)+$/), then it is appended to the namespace URI to get the label URI. Otherwise, it is parsed as an absolute directly. Keys must either validate the bare key pattern or parse as valid absolute URIs.

# Given this namespace...
namespace = "http://example.com/"

[classes]

# ... these two class declarations are equivalent:
[classes.Skyscraper]
[classes."http://example.com/Skyscraper"]

# both define an empty class with the label "http://example.com/Skyscraper"

An equivalent (but discouraged) way of writing this would be to use TOML object literals:

namespace = "http://example.com/"
classes = { Skyscraper = { } }

Property definitions

Class definitions are tables whose keys are property URIs and values are property definitions. Let's make a new class, this time with some properties:

format = "..."
namespace = "http://example.com/"

# (the import array is optional)

[classes]

[classes.Person]

[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

[classes.Person.email]
kind = "uri"
cardinality = "optional"

[classes.Person.knows]
kind = "reference"
label = "Person"
cardinality = "any"

The keys of each class definition - in this case, name, email, and knows - are handled in a similar way as the class. If a key is a TOML bare key, then the key is appended as a path segment to the class label URI. Otherwise, it is parsed as an absolute URI directly. For example, the property definition classes.Person.name translates into a property key of http://example.com/Person/name. We could have defined the same property more explicitly like this:

[classes.Person."http://example.com/Person/name"]
# ...

or even like this:

[classes."http://example.com/Person"."http://example.com/Person/name"]
# ...

... but these are so verbose that it's best to work within a single namespace whenever possible.

Every property definition has a string kind that is one of "reference", "uri", or "literal". Reference kinds require an additional string label, and literal kinds require an additional string datatype.

The label of a reference property is interpreted in the same way as class definition labels: if the value matches /^(A-Za-z0-9_\-\/)+$/, then it is appended to the schema namespace to derive the absolute URI; otherwise it is parsed as a URI directly. In either case, there must be a class defined with the label URI in the schema, or in one of the recursively imported schemas.

The datatype of a literal property has to be the name of one of the XML Schema built-in datatypes, like "string", "integer" "dayTimeDuration", etc. Another way of thinking of this is that all literal datatypes are parsed relative to the implicit namespace http://www.w3.org/2001/XMLSchema#.

Shorthand property definitions

There is a shorthand syntax for required literal properties, where the entire property definition object can be replaced by the datatype string.

For example, this property

[classes.Person]
[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

can be written just as

[classes.Person]
name = "string"

Again, this only applies to literal properties with required cardinality.

Instances

Instances of schemas are represented as RDF graphs. A schema instance has zero or more class instances of each class defined in the schema.

Class instances

A class instance is represented by a blank node in the graph. All class instances are tagged with their label URI using the rdf:type predicate:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Skyscraper> .

This single-triple graph is a valid instance of the first example schema, since the Skyscraper class had no properties.

Property values are represented in different ways depending on their declared cardinality.

required property instances

An instance of a class with required properties must have exactly one triple for each required property, with the class instance blank node as the subject and the property key URI as the predicate.

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .

optional property instances

The blank node _:b0 in that last graph is declared to be an instance of class http://example.com/Person, and it has a triple for the property http://example.com/Person/name with an appropriate object (an RDF literal with datatype xsd:string in this case). However it is not a valid instance of the Person class as defined in the example schema

[classes.Person]

[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

[classes.Person.email]
kind = "uri"
cardinality = "optional"

[classes.Person.knows]
kind = "reference"
label = "Person"
cardinality = "any"

Even though the email property has optional cardinality, that "option" must be explicitly instantiated in the RDF graph.

For optional properties, regardless of whether the instance has a value for the property or not, the class instance must have exactly one triple with instance blank node as the subject, the property key as the predicate, and a new blank node as the object. This new blank node is the subject of exactly one additional triple: either with the predicate http://underlay.org/ns/none and another new blank node as the object, or with the predicate http://underlay.org/ns/some and the property value as the object.

So if our example instance did have an email value mailto:john-doe@example.com, we would represent that as:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/some> <mailto:john-doe@example.com> .

Alternatively, if the instance had no value for the email property:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/none> _:b2 .

any property instances

Properties with cardinality any are represented in a different way altogether.

Here, each value for the property is itself instantiated with its own blank node, with rdf:type of the property key URI. This "any property instance" has one triple with predicate http://underlay.org/ns/source and the class instance blank node as the object, and another triple with predicate http://underlay.org/ns/target and the property value as the object.

Here's a larger example where both John and Jane are http://example.com/Person instances, and there is a cardinality-any reference property knows linking John to Jane.

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/none> _:b2 .

_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b3 <http://example.com/Person/name> "Jane Doe" .
_:b3 <http://example.com/Person/email> _:b4 .
_:b4 <http://underlay.org/ns/none> _:b5 .

_:b6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person/knows> .
_:b6 <http://underlay.org/ns/source> _:b0 .
_:b6 <http://underlay.org/ns/target> _:b3 .

Assertions and provenance

A "schema instance" is an RDF graph, ie a set of RDF triples. An assertion is a slightly more complicated structure, since they use RDF named graphs to annotate data with provenance.

An assertion actually instantiates two different schemas at once - a "data schema" and a "provenance schema". Each collection specifies both, along with a provenance key, which is one of the label URIs defined in the provenance schema.

interface SchemaReference {
	url: string
	version: string
}

interface Collection {
	schema: SchemaReference
	provenanceSchema: SchemaReference
	provenanceKey: string
	// ...
}

An assertion in a collection is an RDF dataset that satisfies the following constraints:

  • every named graph in the dataset is an instance of the collection schema
  • every named graph has a blank graph name
  • the default graph of the dataset is an instance of the collection provenance schema
  • the graph name of every named graph appears in the default graph as an instance of the class indicated by the collection provenance key

What does this look like?

The simplest way to adapt our example schema instance into an assertion is to use a trivial provenance schema with just one empty class:

namespace = "http://example.com/"
[classes.Graph]

and then a collection with a provenanceKey of "http://example.com/Graph" might have an assertion like this:

_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Graph> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g0 .
_:b0 <http://example.com/Person/name> "John Doe" _:g0 .
_:b0 <http://example.com/Person/email> _:b1 _:g0 .
_:b1 <http://underlay.org/ns/none> _:b2 _:g0 .

Here, all that's happening is that we put our previous schema instance into a named graph _:g0, and made a default graph where blank node _:g0 is an instance of the (empty) class Graph.

A slightly more interesting example might use a provenance schema like this:

[classes."http://www.w3.org/ns/prov#Entity"]
"http://www.w3.org/ns/prov#generatedAtTime" = "dateTime"

(this schema defines one class with label http://www.w3.org/ns/prov#Entity with one required literal property with key http://www.w3.org/ns/prov#generatedAtTime and datatype xsd:dateTime)

Then, in a collection with provenance key http://www.w3.org/ns/prov#Entity, a more complex assertion might look like this:

_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "2020-10-20T18:22:36.537Z"^^<http://www.w3.org/2000/10/XMLSchema#dateTime> .
_:g1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
_:g1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "2020-09-06T10:12:53.011Z"^^<http://www.w3.org/2000/10/XMLSchema#dateTime> .

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g0 .
_:b0 <http://example.com/Person/name> "John Doe" _:g0 .
_:b0 <http://example.com/Person/email> _:b1 _:g0 .
_:b1 <http://underlay.org/ns/none> _:b2 _:g0 .

_:b6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person/knows> _:g0 .
_:b6 <http://underlay.org/ns/source> _:b0 _:g0 .
_:b6 <http://underlay.org/ns/target> _:b3 _:g0 .

_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g1 .
_:b3 <http://example.com/Person/name> "Jane Doe" _:g1 .
_:b3 <http://example.com/Person/email> _:b4 _:g1 .
_:b4 <http://underlay.org/ns/none> _:b5 _:g1 .

Here, we have two named graphs _:g0 and _:g1. _:g0 has one instance of the Person class (John), and _:g1 has another (Jane). Both graphs appear in the default graph as instances of http://www.w3.org/ns/prov#Entity, with different values for the generatedAtTime property.

Note that even though they're in separate graphs, the knows property instance in _:g0 is able to reference _:b3, the blank node for the Jane class instance in graph _:g1. This is (crucially) allowed because blank nodes are shared across named graphs in RDF datasets - there's no actual sense in which one blank node is "in" one graph and not another.

Collections

Coming soon

Compiled schemas

Coming soon

@joeltg
Copy link
Author

joeltg commented Oct 21, 2020

Thanks for the feedback! I added a typescript interface for schemas at the start of the data model section.

Ought they be unique?

No! The kind of "namespaces" I've been imagining are like "http://schema.org/": vocabulary prefixes that will be common to lots and lots of schemas.

If so, is it useful to have them be human readable?

I think so.

If not, any reason (on the R1 side) to not fix namespaces as r1.underlay.org/ns/?

I could tell that you hadn't understood what I was saying about namespaces because I knew you would ask exactly this once you did :-)

The only way to actually answer to these questions is to address "schemas vs ontologies" more broadly, which I'll try to do in another section in the gist itself.


only allowing a single Class to exist within provenance schemas

We definitely don't want to do this. For example, we want to be able to say "this graph was attributed to this person", and that requires two classes at least (graph entity and person), maybe more (like reifying "attribution" if it has properties of its own).

why does a collection need to specify only a single provenance key?

Right - I thought a lot about exactly what you're proposing. Do we actually need to declare up-front a single class that all the named graphs will be?

I think we do need this. Suppose I have a standard W3C PROV provenance schema with prov:Entity, prov:Activity, and prov:Agent classes. Clearly my intent is to tell a provenance story involving entities, activities, and agents, where the named graphs are (some of) the entities. If we let named graphs instantiate any class in the provenance schema, someone could make a valid assertion where a named graph is a prov:Activity in the default graph. What would that even mean? How would anyone even interpret that? Given the semantics of the provenance schema this clearly shouldn't be possible, and I think that's going to be the case in every (?) provenance schema. Whenever you use a provenance schema, you're going to know which class you want to associate with the named graphs.

This feels a lot clearer in my mind that I'm able to get across. I'm strongly of the opinion that all named graphs are the same singular kind of thing - they're "digital objects" or "containers" or "entities" or "assertions" (in a colloquial sense). People can choose to model these named graphs in different ways, depending on what kind of data they have about them, but they're always talking about the same underlying kind of thing. A named graph can't "be" an activity or an agent because a named graph is a named graph; it doesn't represent anything other than itself-as-a-digital-object. I hope this makes sense 🙃


If assertions instantiate a schema, won't they similarly need to instantiate a specific provenance schema?

Yes - in fact I try to avoid saying that "an assertion instantiates a schema". Graphs instantiate schemas. Maybe we could say that assertions instantiate collections (?) but you're right: assertions sort of instantiate both a specific data schema and a specific provenance schema, both specified by their collection.

What's the point of data schema reuse if the assertions would be invalid with another prov schema? If I'm understanding that correctly, it feels like it may be more ergonomic to have a single schema and a provKey that points at a specific label in that schemas for provenance

So far there have been three proposals for "what a collection links to" schema-wise:

  1. one schema
  2. two schemas, one called "provenance" and one called "data"
  3. an array of schemas that gets compiled into one schema

These are actually equivalent w/r/t data reuse. In case 1, an assertion from collection A will "fit" in collection B if they have the exact same schema. In case 2, it'll fit if they have the exact same provenance schema and the exact same data schema. And same for case 3 - it'll fit if the arrays are identical. The issue of assertions-matching-or-being-invalidated-in-other-collections isn't relevant to our choice of the three options. So what are the effective differences?

I was originally motivated to propose 3) because I anticipated people "mixing and matching" provenance and data schemas a lot. I still think this will be the case. Some group will manage a schema for a certain domain, and lots of people will want to make collections of it, but the situation/process by which they populate that schema will vary from user to user - meaning that people will absolutely want to "shop separately", and it'd be better to avoid a combinatorial explosion of registered schemas if we can help it.

My thought process in switching to 2) mostly driven by "trying to model user's intents". We could have one schema, or an array of schemas, but I think the way that most people would end up thinking anyway is "okay I want a prov schema and a data schema". We could choose to force them to create their own schema for each combination, but it'd be a better fit to embrace the specific two-ness of it.

The other significant difference is that in 2), the data in the default graph is restricted to only the classes of the provenance schema, while in 1) or 3) any class from the named graphs could also show up. I think this is clearly not intended for most collections, and again we should embrace the separation.

This is definitely all a little hypothetical and of course our understanding of it all will change as it gets used.

There would certainly be schemas who only contain labels intended for provenance use

Yes, but I think those schemas will also be the data schemas of lots of collections - someone will make a proper collection just about the provenance relationships of different collections, etc. Thinking about W3C PROV in particular here.

@metasj
Copy link

metasj commented Oct 21, 2020

Travis writes:

I found the typescript representation of these sentences from your fanfic super clarifying.

Seconded. Prov could similarly use a parallel representation in some existing format.

Schemas:

we use RDF as a serialization format for schema instances, but not as a means of interoperating with the broader semantic web.

In which case: worth mentioning here the means of interoperating with the broader semantic web and RDF-S?

The schema language is very specific and very rigid. As a result, we prefer to say say that that a graph instantiates a particular schema (emphasizing the schema as the fixed and most significant object), as opposed to saying that a graph validates a schema (emphasizing the graph as something that the schema describes).

Is the schema always most significant? Both assertions + schemas may need revision and cleaning up. If a source didn't come with a schema of this form already (at first, this will be all sources), any schema definition feels aspirational - descriptive rather than normative.

More generally, sources are often messy, internally inconsistent, not entirely living up to their own ideal form. How is that captured with rigid structures? What happens to assertions or provenance statements that don't have expected required elements? What do you say when a collection's graph does not perfectly instantiate its schema?

@joeltg
Copy link
Author

joeltg commented Oct 21, 2020

In which case: worth mentioning here the means of interoperating with the broader semantic web and RDF-S?

There is none.

What happens to assertions or provenance statements that don't have expected required elements? What do you say when a collection's graph does not perfectly instantiate its schema?

I think we're well beyond these questions - we've decided that the Underlay is going to strongly typed. Schemas are strict. There is no leeway. If a value is required, it's required. If a graph doesn't instantiate the right schema, then it doesn't parse and all of our tools will reject it outright. You can't parse an assertion apart from its exact schema.

@metasj
Copy link

metasj commented Oct 21, 2020

Provenance:
(edit conflict w/ the previous post)
This may be clearer w/ an example that highlights where three kinds of provenance end up:

  1. prov of a collection as a whole
  2. prov of an individual named graph within it
  3. prov included as part of a data schema (compare choosing among quote (string + its alaises), attributed quote (string, author), sourced quote (string, author, date, context, {cite})

@joeltg
Copy link
Author

joeltg commented Oct 21, 2020

The only kind of provenance that I've ever tried to represent is the second one that you list. Whenever I've ever said "provenance", I mean exactly that one. I could see us potentially approaching the first one in the future. I don't understand what you mean by the third one.

@metasj
Copy link

metasj commented Oct 27, 2020

By the third one I mean, schemas commonly include sources + times + context for data in the schema, in the properties of the schema. If you chose instead a TimestampedPerson schema that has a required generatedAtTime property, do you still want a similar property in your provenance?

@joeltg
Copy link
Author

joeltg commented Oct 27, 2020

I think that should be a strong anti-pattern. People are definitely used to thinking this way - "I have this timestamp, so where should I put it? Inside the Person object I guess" - but it's our job to nudge people toward thinking more ontologically: "Is this actually a property of the person, or is it a property of the data about the person?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment