Skip to content

Instantly share code, notes, and snippets.

@niklasl
Created October 5, 2023 11:56
Show Gist options
  • Save niklasl/c22994e664663b6730613ecc1321c418 to your computer and use it in GitHub Desktop.
Save niklasl/c22994e664663b6730613ecc1321c418 to your computer and use it in GitHub Desktop.
Prerequisites and Requirements for Quotation in RDF

Prerequisites and Requirements for Quotation in RDF

My recent proposal for using blank graphs for quotation and annotation is an attempt to cater for such use cases in a backwards-compatible way using graphs named by blank nodes.

Considering the history of RDF and recent use case requirements, some questions reoccur which I'll explore further in this document:

  • Are quotations types or tokens? Where and why does it matter?
  • Do blank graphs offer a pragmatic method for representing quotations?
  • What is denoted by graph names used for quotation? (The quotation occurrence or its object? What is the object of a quotation?)
  • What is needed in practise for managing blankly named graphs in datasets?
  • Can entailment application overcome the type-token distinction? What about opacity/transparency?

The Type-Token Distinction

In this document, the terms "type" and "token", as defined by the Type-token distinction, are used in the same (informal) way that "tokens" are used in RDF Concepts: to clarify that reified statements are tokens of triples. Occasionally, "occurrence" is used instead of "token". While that is not the same thing linguistically (there can be occurrences of both tokens and types), it will be used here to mean "an occurrence of a triple" (a "claim"). The important difference is that between a fact (the triple itself, a mathematical object, as as "type") and an utterance of a fact (the "claim", as a "token").

(This distinction is reasonably different from the one between lexical representations and values, in that the former is about meaning and occurrences thereof, and the latter about representations of such things (and both representations and values may be occurrences of either tokens or types, so the delineation is orthogonal here). Cf. "equality and identity".)

The type-token distinction is on the table since, as soon as we start talking about quotation, hard questions come into play: what is quoted (is it opaque or transparent, "the sign, the sense or the reference"), and what kind of thing is the quote (a token or a type, a claim as an utterance of facts or the facts themselves (as a state of affairs))?

These are difficult distinctions, with debatable resolutions, perhaps even of questionable value for the main bulk of use cases. One of the core motivations of this document is to avoid forcing users of RDF to consider them, if possible. They are an integral aspect of the use cases and examples, but the attempt is to stay on the side of usage and needs, not theory. For as someone perhaps wise perhaps once said:

In theory, there is no difference between theory and practise.

In practise, there is.

Conventions

All TriG and SPARQL examples in this document assume the following declarations:

BASE <http://example.org/>
PREFIX : <http://example.org/ns#>

Datasets of Graphs are Occurrences of Triples

Adhering to the semantics of RDF, this is one triple:

<s> :p <o> .
<s> :p <o> .

This SPARQL query counts the occurrences of that triple in the graphs below (expressed in TriG):

SELECT (COUNT(?g) AS ?count) {
    GRAPH ?g { <s> :p <o> }
}

One named graph:

<g> { <s> :p <o> }
<g> { <s> :p <o> }

Two named graphs:

<g1> { <s> :p <o> }
<g2> { <s> :p <o> }

Also two named graphs (or 2 * N graphs, where N is the number of RDF sources):

_:g1 { <s> :p <o> }
_:g2 { <s> :p <o> }

Also 2 * N named graphs:

[] { <s> :p <o> }
[] { <s> :p <o> }

The above is, as far as I can see, the normative behaviour in SPARQL 1.1, under all standard entailments. There is no "inverse functional" identity in the value of the graph expression, so neither of <g1>, <g2>, _:g1 nor _:g2 are considered to be the same entity.

This is thus to be expected: the occurrences of graphs are distinct, and they are not the same, even if their graph patterns are equal (or even identical).

In other words, name-graph pairs are effectively occurrences of triple sets. There is no entailment defined for concluding that the above pairs are aliases of each other. In fact, defining that would reasonably represent a breaking change, as graph "records" are expected to be distinct even when they "contain" the same triples.

Note that this does not preclude more restrictive semantics being defined for typed graphs. The last part of this document will consider that option.

RDF-star Begs the Question

The RDF-star proposal directly addresses the type-token question. It defines that this:

<< <s> :p <o> >> :date "2023" .

Is a "type". Thus, the following asserts the same one triple, no matter how many times it's repeated:

<< <s> :p <o> >> :date "2023" .
<< <s> :p <o> >> :date "2023" .

And that is "intuitively" obvious, both since it's the same statement, and since the syntax looks like a "compound" IRI, and we know that IRIs denote their referents.

All the while, the shorthand annotation syntax is less clear. While as defined, this:

<s> :p <o> {| :date "2023-09-24" |} .

Is short for:

<s> :p <o> .
<< <s> :p <o> >> :date "2023-09-24" .

This annotation syntax doesn't necessarily signal that we are speaking about the triple itself rather than about this utterance in the graph.

As seen, even the seminal example got this intuition wrong (and without the annotation syntax, we might add). With this level of cognitive burden upon authors, is usage bound to get this wrong? We may be able to teach it, but is that level of discipline acceptable for this kind of notation?

In comparison, with IRIs, it is much more intuitive and obvious that they are identifiers (if not crystal clear, as they are often confused with Proper Names, rather than correctly used as References).

Also, what are the consequences of talking about triples-as-types as if they were token occurrences? Does that merely express a claim for a more detailed truth? Is that OK if such "muddied" claims occur(!) within named graphs?

Orthogonal to Graphs?

Technically, if the above syntax introduces a new kind of RDF term, this is orthogonal to graphs, yes. But it still begs the question of type vs. token, and also provokes the question: "why two kinds of quotation" (which begs the other hard question of opacity!).

Thusly, Pandora's box has been opened. Maybe "just a little", plucking one triple at a time out from the Platonic Sets inside. But in so doing, the hard questions about graphs escaped too: Types or tokens? Opaque or transparent? Both questions unanswered since we lack of semantics for named graphs in RDF 1.1.

Unanswered Questions

Hitherto, named graphs in datasets have trodded along just fine in practise, basically avoiding the questions. There are no formal semantics for them. We do not know if they are opaque, and since there is an undefined indirection between the name and the "graph itself", we don't know what is denoted. It is not normatively answerable.

That may be unsatisfactory, perhaps more of a pragmatic "engineering cheat", making computer systems tick on all the while containing contradictions and nonsene. For better or worse.

It is, is you will, an implicit "extra-modal logic" of datasets. Leaving logicians and mathematicians wanting. That is controversial. It would thus be valuable to formally define the state of things as they are in RDF 1.1 Datasets. Adding as little as possible and mainly clarify and formalize what is currently undefined:

  • What is a "name, graph pair" (can it be up to its description and/or usage)?
  • How are such pairs "accepted"?
  • Are blank names for graphs special?

It may be possible to continue avoiding these questions. But in assessing if and how the use cases for quotation can be addressed by using blank graphs, we may come back to them, seeing them more clearly.

What do Use Cases Call For?

Given existing RDF 1.1 and JSON-LD 1.1 deployments, utilizing named graphs, together with the current working state being effectively occurrence-based, there is no apparent need for graphs-as-types.

Even with RDF Dataset Canonicalization, the values of graphs can be identical, but are not defined as one identity. On the contrary, it explicitly calculates a predictable blank node id for each occurrence of the same graph pattern, by basing it on its usage. This is consistently done for all blank node IDs.

Consider the use cases submitted to the WG, as well as examples in the CG issue tracker, the aforementioned seminal example, and various examples in the wild (e.g. Ontotext GraphDB). Many of them seem to founder on the "platonic" triple, slipping into talking about particulars. In fact, even the CG blog post warning about this starts out with a model which has to be changed once a new occurrence appears.

As for types or tokens, neither Verifiable Credentials nor any the submitted use cases so far, explicitly call for types. There is no direct wording or conceptualization on the triples as mathematical objects. That doesn't preclude it from being useful of course, but for it to be required we need more clear scenarios. Since the matching of a quotation can done on its constituent parts ("where are these subject, predicates and/or objects" stated), so their identity as triples is, if at all, in the background.

(It is conceivable that optimizations can be made on its identity of course; but that can be done without formally defining the quotation as a type. That would be more like string interning; done under the hood in all but the most low-level use cases. This could be done for triples for a variety of reasons, e.g. when occurring(!) repeatedly in multiple graphs.)

In regards to opacity vs. transparency; in the current assessment transparent semantics dominate. A few cases are phrased as fully opaque (as in "no entailment on the quotation"), but that can be handled either by using literals, or by conditional acceptance of graphs. This is addressed at the end of this document.

Summary

There seems to be two main categories in play:

  1. Management of triples (provenance, when/where/how utterances were made).
  2. Adding details when the model isn't granular enough (modal aspects of the observed event/measurement/phenomenon, such as time, place, situation, condition of the subject or object, particulars more granular than the predicate conveys).

None of these are just about the triple itself, but an occurrence of a fact which the triple captures (and possibly in a lossy way, since these additional "marginalia" adds to what the triple itself does not convey).

Labelled Property Graphs

The case of labelled property graphs (LPGs) falls in the latter category. In LPGs the arc itself can have properties.

That is not directly catered for by neither quoted triples nor blank graphs, but quoted triples more closely resembles this, with one big exception: LPGs do model triples as occurrences. This is in Cypher notation (popular for Neo4J):

CREATE (d {name:"Liz"})-[:LIKES]->(a {name: "Richard"})
CREATE (d)-[:LIKES]->(a)
CREATE (d)-[:LIKES]->(a)

Which when queried:

MATCH (d {name:"Liz"})-[l:LIKES]->(a {name: "Richard"})
RETURN COUNT(l)

yields:

------------
| COUNT(l) |
------------
| 3        |
------------

Whereas, as we know, in RDF, this:

<liz> :married <richard> .
<liz> :married <richard> .
<liz> :married <richard> .

when queried:

SELECT (COUNT(?p) AS ?count) {
  <liz> ?p <richard> .
  FILTER(?p = :married)
}

yields:

----------
| ?count |
----------
| 1      |
----------

This is a fundamental restriction in the RDF model: graphs are not composed of triple occurrences, but are sets of triples. Thus, in either solution, talking about triple occurrences has to either be explicit "occurrence nodes", or multiple small graph occurrences with the same contained triples.

And since the RDF-star CG report models them as mathematical objects, denotable in their own right, multiple occurrences must be explicit. Making the mathematical triple itself the subject is very precise, but is it usable?

Modelling Woes

We need to carefully design, document and promote the mechanism of quotation and annotation so that it does not radically reinvent what RDF is successfully used for. Any "creative" use of quotation should be clearly distinguishable from "regular" RDF, rather than a palette of new options with no guidance for interoperable forms. We already have blank nodes, reification, structured values using rdf:value, the option of singleton properties and named graphs. Adding detailed provenance and marginalia is in many ways possible with those tools. If that still leaves us wanting, building upon what we have, adding clarification and more precise semantics could suffice.

It can be argued that the precise approach of quoted triples discourages "sloppy" use, but given the already prolific examples of usage, from the seminal example and onward, it can be equally argued that quotes and annotations will almost invariably be "occurrence-oriented", as usage, and not logical truth, is the more common framing when "speaking outside" of the regular triples. Because, and this is fundamental, quotation, including "mixed quotation", is an advanced and reasonably marginal practise. If it was to be front and center, a richer modal logic would be called for.

And in such logic, having arcs with properties would arguably be occurrences anyway (a kind of triple multisets or singleton properties by default, like in LPGs). So RDF-star may be too pure for what it tries to support. Blank graphs on the other hand, are not. They are a "scruffy indirection" from the platonic, mathematical world, realized as graph diagrams (as Pat Hayes put it in slide 20 of Blogic, ISWC 2009 ).

Exploring Syntax

As seen, with the RDF-star proposal we are forced to consider the question about types or tokens, as it is decidedly about triples as types. With blank graphs, can we (and do we want to) stay with the current undefined but de-facto state of occurrences with undefined references to "whatever graphs really are" (one indirection away from their value space of "states of affairs")? Or do we need to define a more concrete foundation for multiple graphs? Can blank graphs "nest" in named graphs?

Let's look at what we have.

Reification

The RDF 1.1 Semantics section on Reification contains the following:

The subject of a reification is intended to refer to a concrete realization of an RDF triple, such as a document in a surface syntax, rather than a triple considered as an abstract object. This supports use cases where properties such as dates of composition or provenance information are applied to the reified triple, which are meaningful only when thought of as referring to a particular instance or token of a triple.

In the previous RDF 1.0 Semantics, the Reification section was a little more elaborate:

The semantic extension described here requires the reified triple that the reification describes - I(_:xxx) in the above example - to be a particular token or instance of a triple in a (real or notional) RDF document, rather than an 'abstract' triple considered as a grammatical form. There could be several such entities which have the same subject, predicate and object properties. Although a graph is defined as a set of triples, several such tokens with the same triple structure might occur in different documents. Thus, it would be meaningful to claim that the blank node in the second graph above does not refer to the triple in the first graph, but to some other triple with the same structure. This particular interpretation of reification was chosen on the basis of use cases where properties such as dates of composition or provenance information have been applied to the reified triple, which are meaningful only when thought of as referring to a particular instance or token of a triple.

An example usage in RDF/XML reminiscent of RDF-star annotations would be:

<rdf:Description rdf:about="book">
  <contributor rdf:resource="mary" rdf:ID="q1"/>
  <contributor rdf:resource="mary" rdf:ID="q2"/>
</rdf:Description>
<rdf:Description rdf:ID="q1">
  <date>1853</date>
  <source rdf:resource="biography"/>
</rdf:Description>
<rdf:Description rdf:ID="q2">
  <date>2023-09-25</date>
  <source rdf:resource="wikipedia"/>
</rdf:Description>

Unsurprisingly, reification is explicitly about triple tokens.

Notation 3 Graph Terms

Graph terms in Notation 3 are defined as:

Essentially, a graph term represents an occurrence of an RDF graph — i.e., a quoting or citing of the graph. Importantly, a graph term does not assert the contents of the RDF graph as being true (e.g., :cervantes dc:wrote :moby_dick). In fact, the graph term is interpreted as a resource on its own.

It appears clear: "an occurrence of an RDF graph" implies that each one its own, unique resource, regardless of their "contents". Unless " resource on its own" is the same thing as "literals denote themselves"? Is this an ambiguous definition?

Blank Graph Subjects are Occurrences

These are two blank graphs, with the same claim:

_:g1 { <s> :p <o> }
_:g1 :date "2023" .

_:g2 { <s> :p <o> }
_:g2 :date "2023" .

That is, these are clearly occurrences (two tokens of one logical triple).

With extended TriG syntax:

[7g]	labelOrSubject	::=	iri | BlankNode | blankNodePropertyList

that could be condensed into:

[ :date "2023" ] { <s> :p <o> }
[ :date "2023" ] { <s> :p <o> }

Graph Objects Perhaps Beg the Question

If we similarly extend TriG to allow for blank graphs as objects:

[12]	object	::=	iri | blank | blankNodePropertyList | literal | wrappedGraph

we need to be clear that this:

<g1> :contains { <s> :p <o> } .
<g1> :contains { <s> :p <o> } .

is it short for:

<g1> :contains _:g1 .
_:g1 { <s> :p <o> }

<g1> :contains _:g2 .
_:g2 { <s> :p <o> }

I.e. two occurrences. Given what was shown initially above, this is the case for RDF and SPARQL 1.1. (And appears to be the case in Notation 3?)

Given that, it is reasonable to define these graph objects as shorthand for blank graph occurrences.

Graphs as Mathematical Objects

... but if the above was desired to mean the same facts twice; that is, expanding to something where <s-p-o> is "a unique name based on the canonicalized contents":

<g1> :contains <s-p-o> .
<s-p-o> { <s> :p <o> }

<g1> :contains <s-p-o> .
<s-p-o> { <s> :p <o> }

i.e. meaning only:

<g1> :contains <s-p-o> .
<s-p-o> { <s> :p <o> }

Then that is a type.

The proposed definition of blank graphs as graph occurrences does not exclude the notion of graphs themselves as mathematical objects. It just doesn't provide a means for describing them directly. Just as literals are hard to use as subjects. Of course, with reasoning, inferring types from tokens is possible.

Quotation Occurrences as Blank Graphs

Typed Annotations

Currently, an RDF-star annotation is just a quoted triple and an assertion, the annotation being statements about the quoted triple itself. Thus, what is described is the fact as a type, and not its token occurrence in use (the utterance, coded in RDF). Again this is different from "plain old" reification, since that is explicitly about occurrences.

Conflating The Type and the Token

Many examples using RDF-star annotations without an explicit "occurrence node" tend to skew from this platonic arc and go into the domain of occurrence, talking about events or the observations of such (which are not "universal facts about facts").

This example evidently fails to stay within the lines:

<book> :contributor <jane> {| a :Statement ;
      :comment "Major inspiration."@en ;  # This is vague, intended subject is ambiguous.
      :source <biography> ;  # No, the triple itself had the contribution act as its "source"?
      :date "1831"  # The date of publication? Or of a date within when the contribution happened?
    |} .

Since the triple itself represents a fact, whatever is paired with that ("names" it) should reasonably entail it, so with a specific relationship for that, relating multiple occurrences entailing the triple, each "token or type" nature can be decided using rdf:type:

<book> :contributor <mary> {|
    :entailedBy [ a :Contribution ;
        :role <inspirer>, <editor> ;
        :comment "Major inspiration and editing."@en ;
        :date "1831"
      ] , [ a :Claim ;
        :date "1853" ;
        :source <biography>
      ], [ a :Claim ;
        :date "2023-09-25" ;
        :source <wikipedia>
      ]
    |} .

That looks clear enough. But which relationships on the triple itself are meaningful here? An occurrence, observation, claim or event; i.e. one relationship per main occurrence type? What other properties are meaningful to place on the triple itself? As said, most examples tend to stray from that abstract domain and go into less mathematically pure descriptions.

Annotation as Assertions with Typed Blank Graphs

If annotations simply talk about occurrences of the triple, we don't need that :entailedBy spelled out. Given a variant syntax, allowing embedding of node terms in an optional {...} block following a triple:

annotation ::= '{' (iri | blank | blankNodePropertyList)+ '}'

the above could be expressed as a shorthand for blank graphs:

<book> :contributor <mary> {[ a :Contribution ;
      :role <inspirer>, <editor> ;
      :comment "Major inspiration and editing."@en ;
      :date "1831"
    ] [ a :Claim ;
      :date "1853" ;
      :source <biography>
    ] [ a :Claim ;
      :date "2023-09-25" ;
      :source <wikipedia>
    ]} .

Does the above represent reasonable types for "occurrences of graphs" (as sets of claims entailing the asserted claim)? It is shorthand for:

<book> :contributor <mary> .

[ a :Contribution ;
      :role <inspirer>, <editor> ;
      :comment "Major inspiration and editing."@en ;
      :date "1831"
] { <book> :contributor <mary> }

[ a :Claim ;
  :date "1853" ;
  :source <biography>
] { <book> :contributor <mary> }

[ a :Claim ;
  :date "2023-09-25" ;
  :source <wikipedia>
] { <book> :contributor <mary> }

Notably, the quoted graph is distinct for each. Thus each could be subject to different entailment regimes (depending on its rdf:type), or none in the case of opaque occurrences.

Quotation is Not Simple

If quotation is defined upon the notion of graphs, it will be considered a rather advanced mechanism. In fact, there would be no support for them in Turtle, as it has to be encoded using quads (TriG, NQuads, JSON-LD).

Before objections to this are raised, consider again: what is quotation for? What are the use cases? Do they encode knowledge in the world, or is it about provenance and "things left out" (not fitting in the chosen model, having to be "written in the margins")?

An RDF Source with a default graph and one or more blank graphs could be considered special case, a "simple source" plus "margins" or "appendices". Perhaps extending Turtle for that alone is useful? It would still require users needing to quote more than one triple at a time (including RDF lists) to switch to a graph-enabled syntax.

Blank Graphs as Appendices

There is one big elephant in the blank graphs camp. In RDF, graphs are not nested. You cannot put quotations in the asserted graphs, only relate to a blank "sibling" graph. For whose relationships there is are currently no explicit semantics nor implementation requirements.

So if blank graphs are to used for quotation in practise, at least this is required:

  1. They MUST NOT be in the active union graph by default (if at all).
  2. For operations such as insertion and deletion of a named graph, they MUST be treated as being part of the named graph where they are described.
  3. If blank graph identifiers are skolemized, they MUST still be treated as per the second requirement.

Blank graphs must thus be treated as part of one (named or default) graph. In that sense, a blank graph is a sub-graph of a proper graph. or a kind of "appendix".

But which graph is the "owning" graph? What if the bnode is used in more than one graph? Do we still need a new kind of term? A "tagged", graph-local bnode? Maybe a cleanup routine would suffice, similar to reference counting garbage collection, where after any graph update or deletion, all blank graphs no longer described by any triple in the dataset are simply dropped. If that is too brittle, perhaps quotation and annotation shorthands must also add a rdfs:seeAlso relationship from the "owning" graph to the blank graph, which is then used for operations. In any event, it should be possible to keep this in the implementation details, and not force users to do any manual management.

Beyond RDF-star: Datasets and Entailments

If a minimal basic semantics for named graphs can be defined (similar to that for RDF datatype entailment), ontologies can be defined upon that to achieve the advanced use cases, only when called for, instead of required up front.

Can Occurrences Become Types?

We'd need some way to relate the graph occurrence and "that which is quoted" (the graph pattern itself, as a type).

Alternatives:

  1. The graph pattern is the canonical rdf:value of the name of the graph.
  2. The graph pattern is the canonical :entails of the name of the graph.

From that baseline it could be possible to determine a specific relationship from the type of the graph.

For instance, each graph occurrence with the same value is the same occurrence (as if the graph value property was an owl:FunctionalProperty):

PREFIX rdfg: <http://www.w3.org/2004/03/trix/rdfg-1/>

:StateOfAffairs a owl:Class ;
  rdfs:subClassOf rdfg:Graph ;
  owl:hasKey (rdf:value) .

or a "Platonic type", denoting itself:

:StateOfAffairs a owl:Class ; 
  rdfs:subClassOf [ a owl:Restriction ;
        owl:onProperty rdf:value ;
        owl:hasSelf true
      ] .

And data like:

_:g1 a :StateOfAffairs .
_:g1 { <s> :p <o> . }

_:g2 a :StateOfAffairs .
_:g2 { <s> :p <o> . }

We conclude that:

_:g1 owl:sameAs _:g2 .

Effectively making the named graph a type.

Explicit Graph Types

Other possible classes to narrow down the nature and purpose of graph occurrences:

# singleton graph, a Proposition of one triple:
[ a rdf:Statement ] { <s> :p <o> }

# a Statement which is (said to be) asserted:
[ a :Assertion ] { <s> :p <o> }

# Token (named Occurrence):
[ a :Proposition ] {
  <s1> <p1> <o1>, <o2> ;
    <p2> "l1" .
  <s2> <p1> <o2> .
}

[ a :Event ] {
  <tolkien> :creator <lotr> .
}

# No entailment; the default mode of all blank graphs:
[ a :Quotation ] {
  owl:Thing owl:equivalentClass owl:Nothing ;
    rdfs:comment "This kind of nonsense is isolated from the asserted world."@en .
}

# No entailment:
:OpaqueStatement rdfs:subClassOf :NeutralGraph, rdf:Statement .

This also works with the "Typed Annotations" above, to add further ontological structure to them.

Explicit Dataset Composition

Descriptions of datasets; to be put within the default graph or by selecting it as an explicit active graph.

Here describing a usage combining accepted graphs, quoted (neutral or opaque) graphs, along with which entailment to use:

PREFIX ds: <https://example.org/ns/rdf-dataset-ontology/>

<ds1> ds:Dataset ;
  ds:entailmentRegime ds:OWLFull ; 
  ds:acceptsGraph <g1>, <g2> ;
  ds:acceptsGraphType :Event ;
  ds:opaqueGraphType :Quote, :Citation .

Two questions:

  1. Do ds:acceptsGraph, rdfs:seeAlso and owl:imports have similar semantics?
  2. How does this declaration relate to the RDF notion of interpretation?

Opacity as Conditional Entailment

Although there are other, arguably better means for opacity (using literals to represent literal quotes for instance), given that the few use cases calling for opacity call for full opacity in the sense of not applying any entailment for the quoted triple(s); the above dataset description mechanisms provide a fully user-configurable means for doing so; combining data publisher signals (typed graphs) with consumer decisions of what to interpret of that, and how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment