Last Update: 2018-10-27
Author: Rob Sanderson (for the JSON-LD Working Group)
In order to ensure persistent references, HTML and other specifications have the notion of a base URI (e.g. http://example.com/
) with which relative URIs (e.g. page.html
) are resolved into absolute URIs (http://example.com/page.html
). This notion is especially important in Linked Data, as the connectedness of URIs via relationships is what makes it a web of data, not just another standalone data format. Knowledge graphs of all descriptions would fail to coalesce if the resolution is not deterministic and left up to individual implementers to decide on the method.
Some data formats such as RDFA or Microdata can be used to assert relationships in HTML using element and attribute patterns. These formats use information in the DOM to determine the base URI for resolving any relative URIs encountered (see Microdata's itemid resolution, RDFA's evaluation context for reference).
JSON-LD can be embedded in HTML using the script element as a data-block. This is currently only discussed in a non-normative section of the JSON-LD 1.0 specification, however it is the majority usage of JSON-LD via schema.org recommendations and the obvious advantage for SEO with search engines. In JSON-LD 1.1, the Working Group needs to ensure this is consistently applied and normatively defined.
However the question that has troubled us, and relates to the wider web ecosystem, is whether data in script elements should recognize the DOM-based methodology of determining the base URI for resolving relative URIs or not.
Resolving this issue has the goal of ...
Ensuring that consumers of JSON-LD embedded in HTML can consistently resolve relative URIs
Consuming applications of JSON-LD embedded in HTML, including but not limited to search engines, browsers, knowledge graphs and artificial intelligence, the web of things devices, semantic claim verification ecosystems, digital publishing, and so forth. There are very few application domains that this issue does not have the potential to affect.
We do not aim to solve the issue for embedded formats beyond JSON-LD (e.g. YAML) ... however feel that this would act as a precedent for such formats, and thus feel it is important to have early TAG review of the issue.
The document retrieved from https://example.org/cms/1/index.html
has the contents:
<html>
<head>
<base href="https://example.org/content/"/>
</head>
<body>
<script type="application/ld+json">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
</body>
</html>
In order to resolve the relative URI in the id
property, there are two possibilities:
- The <base> tag is respected, resulting in
https://example.org/content/this-resource
- Only the document's retrieval location is respected, resulting in
https://example.org/cms/1/this-resource
The issue becomes more complex in the following scenarios
The document, if it is XHTML, might also have the xml:base
attribute on the encapsulating script
tag.
Example:
<html>
<head>
<base href="https://example.org/content/"/>
</head>
<body>
<script type="application/ld+json"
xml:base="http://not.example.com/elsewhere/">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
</body>
</html>
The document might use the link
element with the rel
attribute value of canonical
to set the canonical URI for the document. This could also be interpreted as the base URI. Future rel
values might also have implications on the base URI that should be chosen.
Example:
<html>
<head>
<link href="https://example.org/content/"
rel="canonical/>
</head>
<body>
<script type="application/ld+json"
base="http://not.example.com/elsewhere/">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
</body>
</html>
Instead of directly on the script or in a global position such as base
or link
, the xml:base
attribute might be on any ancestor element. This setting would be in effect when processing the script tag and could thus affect the processing of relative URIs.
Example:
<html>
<head>
<base href="https://example.org/content/"/>
</head>
<body>
<div xml:base="http://not.example.com/ancestor/">
<script type="application/ld+json">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
</div>
</body>
</html>
In an environment that executes javascript, the attributes on any of the above elements could also be mutated by scripts. This is a widely used mechanism for injection of script tag + json-ld in publishing workflows, and is extracted by Google for Web search features. It would result in different environments processing the resolution differently, based on whether the code is executed, and its results if so.
Example:
<html>
<head>
<base href="https://example.org/content/"/>
</head>
<body>
<script type="application/ld+json">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
<script>
$("#base").href = "https://www.example.org/" + Math.random();
</script>
</body>
</html>
The script element could be within an iframe. This is described in utmost clarity* in the processing steps for the fallback base URL. Advice on how to interpret the iframe fallback step would be appreciated.
Example:
<html>
<head>
<base id="base" href="https://example.org/content/"/>
</head>
<body>
<iframe href="...">
<!-- transcluded content from ... -->
<script type="application/ld+json">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource"
}
</script>
</iframe>
</body>
</html>
* For the sarcasmly impaired, I reproduce the step for your specification parsing pleasure:
If document is an iframe srcdoc document, then return the document base URL of the Document’s browsing context’s browsing context container’s node document.
Related to base URI resolution, the resolution of language of a string could equivalently be affected by setting of the lang
attribute in the DOM. It would be potentially surprising if base was processed but lang was not. RDFA does require processing lang
as well as base
.
Example:
<html>
<head>
<base href="https://example.org/content/"/>
</head>
<body lang="fr">
<div>Cette page est ecrit en français.</div>
<script type="application/ld+json">
{
"@context": {"@vocab": "http://schema.org/"},
"id": "this-resource",
"name": "Name of this Resource (in English)"
}
</script>
</body>
</html>
The general, but not unanimous, consensus of the JSON-LD 1.1 Working Group is that features of the DOM MUST NOT be taken into account when determining the base URI for resolving relative URIs included in data-block values, such as JSON-LD. This is from the following observations:
- The barrier to entry is higher as a DOM processor is needed
- If javascript execution is required to be processed, the barrier to entry is signficantly higher
- The unintended effects of moving an expected-to-be-standalone script element around in the DOM would be a negative developer experience
- The very common use of templating content management systems would affect data blocks and be difficult to ensure consistency in any reasonably sized organization
- The definition of data-block in HTML implies that the surrounding DOM SHOULD NOT affect the processing of the contents, but is not entirely clear.
- Relative URIs are valuable in many scenarios, and are called out as a recommendation in the LDP Best Practices document, wherein JSON-LD is a required serialization (4.3.2.3).
A significant counter argument is that an HTML document that contains both RDFA or Microdata and embedded JSON-LD would be required to use different processing rules for extracting the knowledge graph from the different formats, likely resulting in the disconnected system that we aim to avoid in the first place. Google's extraction processor does take the features into account, including javascript execution.
References are included inline in the description above.
The primary participants in this discussion to date are:
- Rob Sanderson (@azaroth42) (chair)
- Benjamin Young (@bigbluehat) (chair)
- Gregg Kellogg (@gkellogg) (editor)
- Ivan Herman (@iherman) (staff contact)
- Dan Brickley (@danbri)
- Adam Soroka
- Jeff Mixter
- David Lehn
- Harold Solbrig
- Pierre-Antoine Champin
- Hadley Beeman
My vie is that “XML document” includes application/xhtml+xml, and this behavior was defined for HTML5 at one time. If they removed it, I’d expect some change indication someplace.