Skip to content

Instantly share code, notes, and snippets.

@azaroth42
Last active October 31, 2018 15:22
Show Gist options
  • Save azaroth42/40fa22c0bb86c1fc92063fee5a7c3d4e to your computer and use it in GitHub Desktop.
Save azaroth42/40fa22c0bb86c1fc92063fee5a7c3d4e to your computer and use it in GitHub Desktop.

The Base URI Issue Explained

Last Update: 2018-10-27

Author: Rob Sanderson (for the JSON-LD Working Group)

What’s this Base URI issue then?

In order to ensure persistent references, HTML and other specifications have the notion of a base URI (e.g. http://example.com/) with which relative URIs (e.g. page.html) are resolved into absolute URIs (http://example.com/page.html). This notion is especially important in Linked Data, as the connectedness of URIs via relationships is what makes it a web of data, not just another standalone data format. Knowledge graphs of all descriptions would fail to coalesce if the resolution is not deterministic and left up to individual implementers to decide on the method.

Some data formats such as RDFA or Microdata can be used to assert relationships in HTML using element and attribute patterns. These formats use information in the DOM to determine the base URI for resolving any relative URIs encountered (see Microdata's itemid resolution, RDFA's evaluation context for reference).

JSON-LD can be embedded in HTML using the script element as a data-block. This is currently only discussed in a non-normative section of the JSON-LD 1.0 specification, however it is the majority usage of JSON-LD via schema.org recommendations and the obvious advantage for SEO with search engines. In JSON-LD 1.1, the Working Group needs to ensure this is consistently applied and normatively defined.

However the question that has troubled us, and relates to the wider web ecosystem, is whether data in script elements should recognize the DOM-based methodology of determining the base URI for resolving relative URIs or not.

Goals

Resolving this issue has the goal of ...

Ensuring that consumers of JSON-LD embedded in HTML can consistently resolve relative URIs

Target Audience

Consuming applications of JSON-LD embedded in HTML, including but not limited to search engines, browsers, knowledge graphs and artificial intelligence, the web of things devices, semantic claim verification ecosystems, digital publishing, and so forth. There are very few application domains that this issue does not have the potential to affect.

Mostly-Non-Goals

We do not aim to solve the issue for embedded formats beyond JSON-LD (e.g. YAML) ... however feel that this would act as a precedent for such formats, and thus feel it is important to have early TAG review of the issue.

Illustrative Example

The document retrieved from https://example.org/cms/1/index.html has the contents:

<html>
  <head>
    <base href="https://example.org/content/"/>
  </head>
  <body>
    <script type="application/ld+json">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
    </script>
  </body>
 </html>

In order to resolve the relative URI in the id property, there are two possibilities:

  1. The <base> tag is respected, resulting in https://example.org/content/this-resource
  2. Only the document's retrieval location is respected, resulting in https://example.org/cms/1/this-resource

Key scenarios

The issue becomes more complex in the following scenarios

Scenario 1: xml:base attribute on script tag

The document, if it is XHTML, might also have the xml:base attribute on the encapsulating script tag.

Example:

<html>
  <head>
    <base href="https://example.org/content/"/>
  </head>
  <body>
    <script type="application/ld+json" 
            xml:base="http://not.example.com/elsewhere/">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
    </script>
  </body>
 </html>

Scenario 2: base given in link[@rel=canonical] element

The document might use the link element with the rel attribute value of canonical to set the canonical URI for the document. This could also be interpreted as the base URI. Future rel values might also have implications on the base URI that should be chosen.

Example:

<html>
  <head>
    <link href="https://example.org/content/"
          rel="canonical/>
  </head>
  <body>
    <script type="application/ld+json" 
            base="http://not.example.com/elsewhere/">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
    </script>
  </body>
 </html>

Scenario 3: base attribute in ancestor tags

Instead of directly on the script or in a global position such as base or link, the xml:base attribute might be on any ancestor element. This setting would be in effect when processing the script tag and could thus affect the processing of relative URIs.

Example:

<html>
  <head>
    <base href="https://example.org/content/"/>
  </head>
  <body>
    <div xml:base="http://not.example.com/ancestor/">
      <script type="application/ld+json">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
      </script>
    </div>
  </body>
 </html>

Scenario 4: attributes mutated by javascript

In an environment that executes javascript, the attributes on any of the above elements could also be mutated by scripts. This is a widely used mechanism for injection of script tag + json-ld in publishing workflows, and is extracted by Google for Web search features. It would result in different environments processing the resolution differently, based on whether the code is executed, and its results if so.

Example:

<html>
  <head>
    <base href="https://example.org/content/"/>
  </head>
  <body>
    <script type="application/ld+json">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
    </script>
    <script>
$("#base").href = "https://www.example.org/" + Math.random();
    </script>
  </body>
 </html>

Scenario 5: iframes

The script element could be within an iframe. This is described in utmost clarity* in the processing steps for the fallback base URL. Advice on how to interpret the iframe fallback step would be appreciated.

Example:

<html>
  <head>
    <base id="base" href="https://example.org/content/"/>
  </head>
  <body>
    <iframe href="...">
<!-- transcluded content from ... -->
      <script type="application/ld+json">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource"
}
      </script>
    </iframe>
  </body>
 </html>

* For the sarcasmly impaired, I reproduce the step for your specification parsing pleasure:

If document is an iframe srcdoc document, then return the document base URL of the Document’s browsing context’s browsing context container’s node document.

Scenario 6: language (and other) attributes

Related to base URI resolution, the resolution of language of a string could equivalently be affected by setting of the lang attribute in the DOM. It would be potentially surprising if base was processed but lang was not. RDFA does require processing lang as well as base.

Example:

<html>
  <head>
    <base href="https://example.org/content/"/>
  </head>
  <body lang="fr">
    <div>Cette page est ecrit en français.</div>
    <script type="application/ld+json">
{
  "@context": {"@vocab": "http://schema.org/"},
  "id": "this-resource",
  "name": "Name of this Resource (in English)"
}
    </script>
  </body>
 </html>

Detailed Design Discussion

The general, but not unanimous, consensus of the JSON-LD 1.1 Working Group is that features of the DOM MUST NOT be taken into account when determining the base URI for resolving relative URIs included in data-block values, such as JSON-LD. This is from the following observations:

  • The barrier to entry is higher as a DOM processor is needed
  • If javascript execution is required to be processed, the barrier to entry is signficantly higher
  • The unintended effects of moving an expected-to-be-standalone script element around in the DOM would be a negative developer experience
  • The very common use of templating content management systems would affect data blocks and be difficult to ensure consistency in any reasonably sized organization
  • The definition of data-block in HTML implies that the surrounding DOM SHOULD NOT affect the processing of the contents, but is not entirely clear.
  • Relative URIs are valuable in many scenarios, and are called out as a recommendation in the LDP Best Practices document, wherein JSON-LD is a required serialization (4.3.2.3).

A significant counter argument is that an HTML document that contains both RDFA or Microdata and embedded JSON-LD would be required to use different processing rules for extracting the knowledge graph from the different formats, likely resulting in the disconnected system that we aim to avoid in the first place. Google's extraction processor does take the features into account, including javascript execution.

References and Acknowledgements

References are included inline in the description above.

The primary participants in this discussion to date are:

  • Rob Sanderson (@azaroth42) (chair)
  • Benjamin Young (@bigbluehat) (chair)
  • Gregg Kellogg (@gkellogg) (editor)
  • Ivan Herman (@iherman) (staff contact)
  • Dan Brickley (@danbri)
  • Adam Soroka
  • Jeff Mixter
  • David Lehn
  • Harold Solbrig
  • Pierre-Antoine Champin
  • Hadley Beeman
@iherman
Copy link

iherman commented Oct 29, 2018

(a) I wonder whether we should mix the xml:base attribute into the mix. The HTML5 standard says:

The xml:base attribute may be used on HTML elements of XML documents. Authors must not use the xml:base attribute on HTML elements in HTML documents.

(Emphasis is mine.) We define embedded JSON-LD in HTML which, per recommendations chains, means HTML5 these days, and not in an arbitrary XML. My reading is that in HTML5 the usage of xml:base is not permitted. (It does not even appear in the HTML5 living standard at all, b.t.w..). I propose to remove this, ie, example 1 and 3.

(b) I think it would be better to refer to the baseURL attribute in the DOM element, as appearing in the script element. That would take care of scenarios 1, 4, and 5, I believe (if anybody understands 5:-) and it makes it unnecessary to treat them separately. Note, however, that there is now equivalent (alas!) for the language tag.

@gkellogg
Copy link

My vie is that “XML document” includes application/xhtml+xml, and this behavior was defined for HTML5 at one time. If they removed it, I’d expect some change indication someplace.

@BigBlueHat
Copy link

@gkellog it's now limited to XML-only processing of HTML documents per https://www.w3.org/TR/html/dom.html#the-xmlbase-attribute-xml-only

The scope did creep here quite a bit...so I'm writing up some scenario reviews in the original JSON-LD Syntax issue #23.

@BigBlueHat
Copy link

Lengthy (sorry!) feedback left on w3c/json-ld-syntax#23 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment