Skip to content

Instantly share code, notes, and snippets.

@joeltg
Last active October 27, 2020 15:16
Show Gist options
  • Save joeltg/4af25175f8a2e3d99de079d6fbab37bd to your computer and use it in GitHub Desktop.
Save joeltg/4af25175f8a2e3d99de079d6fbab37bd to your computer and use it in GitHub Desktop.

Schemas and Collections

Table of Contents

Overview

A collection is a general and portable format for managing and publishing datasets. A collection contains a schema and a set of assertions. Assertions are RDF datasets, and the schema describes the shape of the data in the graphs of the assertions.

Schemas

Relationship to RDF

The schema language is designed around the RDF data model, and instances of schemas are represented as RDF graphs. However, it is almost certainly not possible to write a schema for an arbitrary pre-existing RDF graph. This is due to the many specific representation choices that we are forced to make when modelling data with RDF (which will become more clear when we cover the instantiated RDF format in detail).

The schema language is very specific and very rigid. As a result, we prefer to say say that that a graph instantiates a particular schema (emphasizing the schema as the fixed and most significant object), as opposed to saying that a graph validates a schema (emphasizing the graph as something that the schema describes).

More generally, we use RDF as a serialization format for schema instances, but not as a means of interoperating with the broader semantic web.

Data model

interface Schema {
  import: { url: string; version: string }[]
  namespace: string
  classes: { [label: string]: ClassDefinition }
}

type ClassDefinition = { [key: string]: Datatype | PropertyDefinition }

type Datatype =
  | "string"
  | "integer"
  | "double"
  | "boolean"
  | "dateTime"
  | "date"

type Cardinality = "required" | "optional" | "any"

type PropertyDefinition =
  | { kind: "uri"; cardinality?: Cardinality }
  | { kind: "literal"; datatype: Datatype; cardinality?: Cardinality }
  | { kind: "reference"; label: string; cardinality?: Cardinality }

A schema defines a set of classes. Each class has a unique URI label, and a set of zero or more properties. Each property 1) has a unique URI key, 2) has one of three possible cardinalities, and 3) is one of three possible property kinds.

(We deliberately avoid using the overloaded word "type". Instead, we use "class" for the elements of a schema, "kind" for the different abstract types of properties, and "datatype" for the different types of primitive values.)

The possible cardinalities are required, optional, and any. Intuitively, these mean that instances of the class must have exactly one value for the property, zero or one values for the property, or zero or more values for the property, respectively.

The kinds of properties are reference properties, URI properties, and literal properties.

  • Reference properties are configured with a URI label pointing to another class. A class can have a reference property pointing to itself, or a different class defined in the same schema. Intuitively, reference properties are analogous to foreign keys or typed pointers: a value for a reference property is an instance of the referenced class.
  • URI properties do not take any additional configuration. A value for a URI property is an RDF IRI.
  • Literal properties are configured with a datatype from the XML Schema built-in datatypes The value of a literal property configued with datatype x is an RDF literal with datatype x. This essentially just gives us formal specs for most common datatypes like string, boolean, integer, double, dateTime, etc.

And that's it! There are classes, which have properties, which are either references, URIs, or literals, and each property has a cardinality. That's the abstract schema language.

URIs vs literals

Both URI and literal properties look like "primitive types", so it may appear confusing to have them as different property kinds.

The difference comes from the RDF data model, which distinguishes IRIs (ie URIs) as a different type of term than "literals". This means that values of URI properties are represented as IRIs in RDF, while the values of literal properties are represented as RDF literals.

More prescriptively, we think that the difference can be a useful indicator of scope and intent. URI properties are not intended for values that are arbitrary URLs, like website links. Instead, they should be thought of as identifiers with global scope. URI properties are appropriate for any kind of value that is primarily used to join, link, or co-identify values across datasets.

For example, tags or enum values are more idiomatically modelled with URI properties (e.g. using URNs with a custom namespace) than with string literals. URI properties are also a good fit for external unique identifiers like ISBN numbers or DOIs.

Following these idioms makes for more expressive modelling and allows tools to better optimize for joins, etc.

Management

Schemas are versioned, managed, and published entirely separately from collections.

Schemas are written in a human-friendly TOML format, which includes a way to import other TOML schemas by URL. Whenever a version of a schema is published, its imports are recursively resolved and reduced to a compiled schema, represented as an RDF graph instantiating a master "schema schema". This compiled schema is what gets included in collections - effectively vendoring all of the schema's dependencies.

TOML syntax

TOML re-uses the JSON data model (with the exception of null), so every TOML document can be parsed into a JSON object (a TOML document is always an object at the root level). We use TOML to write schemas because it supports comments and is generally more human-friendly than working with JSON directly.

It'd be good to read the TOML spec before going any further.

Format version

A schema starts with a top-level string property called format. The format string is a fixed URL that specifies the version of our TOML schema format that is used.

# The actual `format` version URL is not decided yet,
# but it will look something like this
format = "http://underlay.org/schema/v1.0"

Namespace

After the format version there is another top-level string property called namespace. The namespace string is a URI that a) has a path component ending in a trailing slash, and b) has no query or fragment component.

namespace = "http://example.com/"

The namespace string is a prefix that makes defining class labels and property keys (which are both full absolute URIs) more concise. It's not a proper value associated with the schema itself - it just enables some syntactic sugar in class definitions that we'll see later.

Imports

After the format version and namespace prefix, a schema has an array of imports:

[[import]]
url = "http://r1.underlay.org/schemas/baylor/snap"
version = "4.2.0"

[[import]]
url = "http://r1.underlay.org/schemas/baylor/crackle"
version = "0.8.1"

[[import]]
url = "http://r1.underlay.org/schemas/emerson/pop"
version = "45.0.0"

(the double-bracket syntax defines elements of an array in TOML)

Each import has a string url and an exact semver string version.

All that importing does is let you point to the imported classes in reference properties. You can’t "extend" classes or anything.

If a class in a schema is defined with the same label as a class in an imported schema, the imported one is just ignored. Similarly, the imports themselves overwrite each other in order if there are conflicts (it’s important that .import is an array). But labels are absolute URIs, so collisions should be relatively rare as long as people label their classes responsibly.

Class definitions

After the import array there is a table (aka object) called classes. The keys of the table are label URIs, and the values are class definitions Class definitions are objects whose keys are property URIs, and values are property definitions.

[classes]

[classes.Skyscraper]

# Classes themselves are objects with zero or more properties keys.
# This one has zero.

Each key of this table - Skyscraper in this example - is the URI label of the class. If the key is a TOML bare key (ie the key validates /^(A-Za-z0-9_\-\/)+$/), then it is appended to the namespace URI to get the label URI. Otherwise, it is parsed as an absolute directly. Keys must either validate the bare key pattern or parse as valid absolute URIs.

# Given this namespace...
namespace = "http://example.com/"

[classes]

# ... these two class declarations are equivalent:
[classes.Skyscraper]
[classes."http://example.com/Skyscraper"]

# both define an empty class with the label "http://example.com/Skyscraper"

An equivalent (but discouraged) way of writing this would be to use TOML object literals:

namespace = "http://example.com/"
classes = { Skyscraper = { } }

Property definitions

Class definitions are tables whose keys are property URIs and values are property definitions. Let's make a new class, this time with some properties:

format = "..."
namespace = "http://example.com/"

# (the import array is optional)

[classes]

[classes.Person]

[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

[classes.Person.email]
kind = "uri"
cardinality = "optional"

[classes.Person.knows]
kind = "reference"
label = "Person"
cardinality = "any"

The keys of each class definition - in this case, name, email, and knows - are handled in a similar way as the class. If a key is a TOML bare key, then the key is appended as a path segment to the class label URI. Otherwise, it is parsed as an absolute URI directly. For example, the property definition classes.Person.name translates into a property key of http://example.com/Person/name. We could have defined the same property more explicitly like this:

[classes.Person."http://example.com/Person/name"]
# ...

or even like this:

[classes."http://example.com/Person"."http://example.com/Person/name"]
# ...

... but these are so verbose that it's best to work within a single namespace whenever possible.

Every property definition has a string kind that is one of "reference", "uri", or "literal". Reference kinds require an additional string label, and literal kinds require an additional string datatype.

The label of a reference property is interpreted in the same way as class definition labels: if the value matches /^(A-Za-z0-9_\-\/)+$/, then it is appended to the schema namespace to derive the absolute URI; otherwise it is parsed as a URI directly. In either case, there must be a class defined with the label URI in the schema, or in one of the recursively imported schemas.

The datatype of a literal property has to be the name of one of the XML Schema built-in datatypes, like "string", "integer" "dayTimeDuration", etc. Another way of thinking of this is that all literal datatypes are parsed relative to the implicit namespace http://www.w3.org/2001/XMLSchema#.

Shorthand property definitions

There is a shorthand syntax for required literal properties, where the entire property definition object can be replaced by the datatype string.

For example, this property

[classes.Person]
[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

can be written just as

[classes.Person]
name = "string"

Again, this only applies to literal properties with required cardinality.

Instances

Instances of schemas are represented as RDF graphs. A schema instance has zero or more class instances of each class defined in the schema.

Class instances

A class instance is represented by a blank node in the graph. All class instances are tagged with their label URI using the rdf:type predicate:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Skyscraper> .

This single-triple graph is a valid instance of the first example schema, since the Skyscraper class had no properties.

Property values are represented in different ways depending on their declared cardinality.

required property instances

An instance of a class with required properties must have exactly one triple for each required property, with the class instance blank node as the subject and the property key URI as the predicate.

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .

optional property instances

The blank node _:b0 in that last graph is declared to be an instance of class http://example.com/Person, and it has a triple for the property http://example.com/Person/name with an appropriate object (an RDF literal with datatype xsd:string in this case). However it is not a valid instance of the Person class as defined in the example schema

[classes.Person]

[classes.Person.name]
kind = "literal"
datatype = "string"
cardinality = "required"

[classes.Person.email]
kind = "uri"
cardinality = "optional"

[classes.Person.knows]
kind = "reference"
label = "Person"
cardinality = "any"

Even though the email property has optional cardinality, that "option" must be explicitly instantiated in the RDF graph.

For optional properties, regardless of whether the instance has a value for the property or not, the class instance must have exactly one triple with instance blank node as the subject, the property key as the predicate, and a new blank node as the object. This new blank node is the subject of exactly one additional triple: either with the predicate http://underlay.org/ns/none and another new blank node as the object, or with the predicate http://underlay.org/ns/some and the property value as the object.

So if our example instance did have an email value mailto:john-doe@example.com, we would represent that as:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/some> <mailto:john-doe@example.com> .

Alternatively, if the instance had no value for the email property:

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/none> _:b2 .

any property instances

Properties with cardinality any are represented in a different way altogether.

Here, each value for the property is itself instantiated with its own blank node, with rdf:type of the property key URI. This "any property instance" has one triple with predicate http://underlay.org/ns/source and the class instance blank node as the object, and another triple with predicate http://underlay.org/ns/target and the property value as the object.

Here's a larger example where both John and Jane are http://example.com/Person instances, and there is a cardinality-any reference property knows linking John to Jane.

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b0 <http://example.com/Person/name> "John Doe" .
_:b0 <http://example.com/Person/email> _:b1 .
_:b1 <http://underlay.org/ns/none> _:b2 .

_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> .
_:b3 <http://example.com/Person/name> "Jane Doe" .
_:b3 <http://example.com/Person/email> _:b4 .
_:b4 <http://underlay.org/ns/none> _:b5 .

_:b6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person/knows> .
_:b6 <http://underlay.org/ns/source> _:b0 .
_:b6 <http://underlay.org/ns/target> _:b3 .

Assertions and provenance

A "schema instance" is an RDF graph, ie a set of RDF triples. An assertion is a slightly more complicated structure, since they use RDF named graphs to annotate data with provenance.

An assertion actually instantiates two different schemas at once - a "data schema" and a "provenance schema". Each collection specifies both, along with a provenance key, which is one of the label URIs defined in the provenance schema.

interface SchemaReference {
	url: string
	version: string
}

interface Collection {
	schema: SchemaReference
	provenanceSchema: SchemaReference
	provenanceKey: string
	// ...
}

An assertion in a collection is an RDF dataset that satisfies the following constraints:

  • every named graph in the dataset is an instance of the collection schema
  • every named graph has a blank graph name
  • the default graph of the dataset is an instance of the collection provenance schema
  • the graph name of every named graph appears in the default graph as an instance of the class indicated by the collection provenance key

What does this look like?

The simplest way to adapt our example schema instance into an assertion is to use a trivial provenance schema with just one empty class:

namespace = "http://example.com/"
[classes.Graph]

and then a collection with a provenanceKey of "http://example.com/Graph" might have an assertion like this:

_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Graph> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g0 .
_:b0 <http://example.com/Person/name> "John Doe" _:g0 .
_:b0 <http://example.com/Person/email> _:b1 _:g0 .
_:b1 <http://underlay.org/ns/none> _:b2 _:g0 .

Here, all that's happening is that we put our previous schema instance into a named graph _:g0, and made a default graph where blank node _:g0 is an instance of the (empty) class Graph.

A slightly more interesting example might use a provenance schema like this:

[classes."http://www.w3.org/ns/prov#Entity"]
"http://www.w3.org/ns/prov#generatedAtTime" = "dateTime"

(this schema defines one class with label http://www.w3.org/ns/prov#Entity with one required literal property with key http://www.w3.org/ns/prov#generatedAtTime and datatype xsd:dateTime)

Then, in a collection with provenance key http://www.w3.org/ns/prov#Entity, a more complex assertion might look like this:

_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
_:g0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "2020-10-20T18:22:36.537Z"^^<http://www.w3.org/2000/10/XMLSchema#dateTime> .
_:g1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/ns/prov#Entity> .
_:g1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> "2020-09-06T10:12:53.011Z"^^<http://www.w3.org/2000/10/XMLSchema#dateTime> .

_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g0 .
_:b0 <http://example.com/Person/name> "John Doe" _:g0 .
_:b0 <http://example.com/Person/email> _:b1 _:g0 .
_:b1 <http://underlay.org/ns/none> _:b2 _:g0 .

_:b6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person/knows> _:g0 .
_:b6 <http://underlay.org/ns/source> _:b0 _:g0 .
_:b6 <http://underlay.org/ns/target> _:b3 _:g0 .

_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.com/Person> _:g1 .
_:b3 <http://example.com/Person/name> "Jane Doe" _:g1 .
_:b3 <http://example.com/Person/email> _:b4 _:g1 .
_:b4 <http://underlay.org/ns/none> _:b5 _:g1 .

Here, we have two named graphs _:g0 and _:g1. _:g0 has one instance of the Person class (John), and _:g1 has another (Jane). Both graphs appear in the default graph as instances of http://www.w3.org/ns/prov#Entity, with different values for the generatedAtTime property.

Note that even though they're in separate graphs, the knows property instance in _:g0 is able to reference _:b3, the blank node for the Jane class instance in graph _:g1. This is (crucially) allowed because blank nodes are shared across named graphs in RDF datasets - there's no actual sense in which one blank node is "in" one graph and not another.

Collections

Coming soon

Compiled schemas

Coming soon

@joeltg
Copy link
Author

joeltg commented Oct 21, 2020

The only kind of provenance that I've ever tried to represent is the second one that you list. Whenever I've ever said "provenance", I mean exactly that one. I could see us potentially approaching the first one in the future. I don't understand what you mean by the third one.

@metasj
Copy link

metasj commented Oct 27, 2020

By the third one I mean, schemas commonly include sources + times + context for data in the schema, in the properties of the schema. If you chose instead a TimestampedPerson schema that has a required generatedAtTime property, do you still want a similar property in your provenance?

@joeltg
Copy link
Author

joeltg commented Oct 27, 2020

I think that should be a strong anti-pattern. People are definitely used to thinking this way - "I have this timestamp, so where should I put it? Inside the Person object I guess" - but it's our job to nudge people toward thinking more ontologically: "Is this actually a property of the person, or is it a property of the data about the person?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment