RFC 3986 defines the syntax of URIs. When used with linked data only scheme and hier-part are significant; query and fragment are ignored.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
SPARQL defines two forms of IRIs: IRI references and prefixed names. The mapping between the two is:
A prefixed name is a prefix label and a local part, separated by a colon ":".
A prefixed name is mapped to an IRI by concatenating the IRI associated with the prefix
and the local part. The prefix label or the local part may be empty.
[136] iri ::= IRIREF | PrefixedName
[137] PrefixedName ::= PNAME_LN | PNAME_NS
[141] PNAME_LN ::= PNAME_NS PN_LOCAL
The following fragments are some of the different ways to write the same IRI:
<http://example.org/book/book1>
BASE <http://example.org/book/>
<book1>
PREFIX book: <http://example.org/book/>
book:book1
SPARQL's definition of a well-defined mapping between IRI and PrefixedName is the key to understanding how to model and serialize IRIs used as Element IDs:
Derived attributes are attributes that do not exist in the physical data, but their values are derived from other attributes present in the data. For example, age can be derived from date_of_birth.
--- Entity-Relationship Modeling
In a class model representing Elements identified by IRIs, the following types can be used to model the relationship between physical data in SPDX documents and derived attribute values:
- ElementId: a globally-unique IRI property of an Element
- ElementRef: an IRI reference to an Element, e.g. Document/rootElement, Relationship/to.
- Namespace: (PNAME_NS) the BASE IRI or the IRI corresponding to a PREFIX string.
- Lui: (Local Unique Identifier = PN_LOCAL) the local part of PrefixedName
- Hint: information that is not part of ElementId but may be included in ElementRef to aid readability
For collections of multiple Elements such as Document and ContextualCollection, Lui uniquely identifies the Element within the collection. For singleton Elements, Lui MAY be empty/null, in which case the ElementId IRI equals its Namespace. The productions and functions used to support derived attributes are:
elementId ::= a production of namespace and lui that yields an iri
elementRef ::= a production of namespace, lui, and hint that yields an iri
elementId = element_ref_format(namespace, lui)
elementRef = element_ref_format(namespace, lui, hint)
namespace, lui, hint = element_ref_parse(iri)
The SPDX definition of PN_LOCAL MUST be consistent with that of SPARQL, but MAY impose additional restrictions (such as prohibiting escaped characters) in the interest of efficiency and readability.
The following IRIs are aliases for the same resource:
- [1] https://www.amazon.com/dp/0127999574
- [2] https://www.amazon.com/RDF-Database-Systems-Triples-Processing/dp/0127999574
- [3] https://www.amazon.com/dp/0127999574#RDF-Database-Systems-Triples-Processing
When treated as strings they would be considered three different resources. But treating them as IRIs derived from the same properties identifies the aliasing and allows the single resource to be recognized.
- [1] Namespace = https://www.amazon.com/dp/, Lui = 0127999574
- [3] Namespace = https://www.amazon.com/dp/, Lui = 0127999574, Hint = RDF-Database-Systems-Triples-Processing
Amazon uses server-side processing to alias pages with and without embedded hints. Without interpreting an Amazon-specific Lui delimiter "/dp/", the IRI is not recognized as an alias for the same resource as the other two:
- [2] Namespace = https://www.amazon.com/RDF-Database-Systems-Triples-Processing/dp/, Lui = 0127999574
Note that Lui may sometimes be a "component id", since ISBNs, grocery barcode UPCs, etc. identify a specific component across more than one vendor, application, or namespace. But Lui is always local to the namespace under which it appears, since the same value 0127999574 under a different namespace may be entirely unrelated to its use as an ISBN.
Identifiers.org is a SPARQL-based service to enable on-the-fly integration of life science data. Identifiers.org registers and assigns short names to major producers of life sciences data, for example "csd" for the Cambridge Crystallographic Data Centre. The producers in turn identify their products by id.
- Resource URI: https://identifiers.org/csd:PELNAW
- Namespace = https://identifiers.org/, Lui = csd:PELNAW
SPDX does not need to understand multi-level namespace hierarchies, but treating the top level in accordance with linked data standards ensures that the SPDX standard does not preclude more complex use cases.
SPDX v2 recognizes IRI structure by defining documentNamespace (e.g., http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330) and SPDXID (e.g., SPDXRef-File, SPDXRef-JenaLib, DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement, LicenseRef-4). If SPDXID were an unstructured string type, the namespace would be repeated throughout the document every time an id was defined or referenced:
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/SPDXRef-File
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/SPDXRef-JenaLib
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/LicenseRef-4
No XML document, whether representing linked data or not, treats URI as a string, just as they do not treat number as a string where "1.0" has a different value than "1.00". The equivalence between IRI and PrefixName and the ability to derive one from the other, as defined in SPARQL, must be respected in SPDX v3.
SPDX v2 includes both unique identifier and hint information in SPDXID values. But SPARQL, by explicitly excluding query and fragment when comparing IRIs, provides a standard syntax for hints. SPDX v3 does not need to define anything else in order for producing tools to include whatever query and/or fragment values in ElementRefs are deemed useful, knowing that they are ignored by consuming tools.
The following ElementRefs identify the same Element:
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4#LicenseRef
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4?type=License&name=Apache-2.0
The properties used to derive the IRI lexical value are:
namespace = http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/
lui = 4
hint = "", "#LicenseRef" or "?type=License&name=Apache-2.0" respectively
@davaya Nicely organized and researched set of design considerations!
Agree with most of the proposal with the exception of using fragments for the hints.
According to SPARQL Query spec section 1.4, fragments are allowed in the URI's. SPDX 2.2 uses fragments explicitly when converting between tag/value lui's and RDF URI's.
I would propose using the query for all hints and allow fragments to be used as part of the URI. This would enable compatibility with SPDX 2.2 and allow for a common practice of using fragments to designate unique components within the same web page (e.g. having a web page with a single SPDX document and using that web page URL as the namespace URI and the lui's would be the fragements within the web page).