Skip to content

Instantly share code, notes, and snippets.

@davaya
Last active August 17, 2021 19:52
Show Gist options
  • Save davaya/07abb3b8d809de754396da3e35ae1587 to your computer and use it in GitHub Desktop.
Save davaya/07abb3b8d809de754396da3e35ae1587 to your computer and use it in GitHub Desktop.
Design considerations for the SPDX v3 class model

SPDX Element IDs

RFC 3986 defines the syntax of URIs. When used with linked data only scheme and hier-part are significant; query and fragment are ignored.

    URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]  
    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"  
    reserved    = gen-delims / sub-delims  
    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"  
    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"  
                      / "*" / "+" / "," / ";" / "="  

SPARQL defines two forms of IRIs: IRI references and prefixed names. The mapping between the two is:

A prefixed name is a prefix label and a local part, separated by a colon ":".
A prefixed name is mapped to an IRI by concatenating the IRI associated with the prefix
and the local part. The prefix label or the local part may be empty.

[136] iri           ::= IRIREF | PrefixedName
[137] PrefixedName  ::= PNAME_LN | PNAME_NS
[141] PNAME_LN      ::= PNAME_NS PN_LOCAL

The following fragments are some of the different ways to write the same IRI:

<http://example.org/book/book1>

BASE <http://example.org/book/>
<book1>

PREFIX book: <http://example.org/book/>
book:book1

SPARQL's definition of a well-defined mapping between IRI and PrefixedName is the key to understanding how to model and serialize IRIs used as Element IDs:

Derived attributes are attributes that do not exist in the physical data, but their values are derived from other attributes present in the data. For example, age can be derived from date_of_birth.
--- Entity-Relationship Modeling

In a class model representing Elements identified by IRIs, the following types can be used to model the relationship between physical data in SPDX documents and derived attribute values:

  • ElementId: a globally-unique IRI property of an Element
  • ElementRef: an IRI reference to an Element, e.g. Document/rootElement, Relationship/to.
  • Namespace: (PNAME_NS) the BASE IRI or the IRI corresponding to a PREFIX string.
  • Lui: (Local Unique Identifier = PN_LOCAL) the local part of PrefixedName
  • Hint: information that is not part of ElementId but may be included in ElementRef to aid readability

For collections of multiple Elements such as Document and ContextualCollection, Lui uniquely identifies the Element within the collection. For singleton Elements, Lui MAY be empty/null, in which case the ElementId IRI equals its Namespace. The productions and functions used to support derived attributes are:

elementId    ::= a production of namespace and lui that yields an iri
elementRef   ::= a production of namespace, lui, and hint that yields an iri

elementId = element_ref_format(namespace, lui)
elementRef = element_ref_format(namespace, lui, hint)
namespace, lui, hint = element_ref_parse(iri)

Examples

The SPDX definition of PN_LOCAL MUST be consistent with that of SPARQL, but MAY impose additional restrictions (such as prohibiting escaped characters) in the interest of efficiency and readability.

Amazon product page

The following IRIs are aliases for the same resource:

When treated as strings they would be considered three different resources. But treating them as IRIs derived from the same properties identifies the aliasing and allows the single resource to be recognized.

Amazon uses server-side processing to alias pages with and without embedded hints. Without interpreting an Amazon-specific Lui delimiter "/dp/", the IRI is not recognized as an alias for the same resource as the other two:

Note that Lui may sometimes be a "component id", since ISBNs, grocery barcode UPCs, etc. identify a specific component across more than one vendor, application, or namespace. But Lui is always local to the namespace under which it appears, since the same value 0127999574 under a different namespace may be entirely unrelated to its use as an ISBN.

Identifier Conversion Service

Identifiers.org is a SPARQL-based service to enable on-the-fly integration of life science data. Identifiers.org registers and assigns short names to major producers of life sciences data, for example "csd" for the Cambridge Crystallographic Data Centre. The producers in turn identify their products by id.

SPDX does not need to understand multi-level namespace hierarchies, but treating the top level in accordance with linked data standards ensures that the SPDX standard does not preclude more complex use cases.

SPDX v2

SPDX v2 recognizes IRI structure by defining documentNamespace (e.g., http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330) and SPDXID (e.g., SPDXRef-File, SPDXRef-JenaLib, DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement, LicenseRef-4). If SPDXID were an unstructured string type, the namespace would be repeated throughout the document every time an id was defined or referenced:

http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/SPDXRef-File  
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/SPDXRef-JenaLib  
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/DocumentRef-spdx-tool-1.2:SPDXRef-ToolsElement  
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/LicenseRef-4  

No XML document, whether representing linked data or not, treats URI as a string, just as they do not treat number as a string where "1.0" has a different value than "1.00". The equivalence between IRI and PrefixName and the ability to derive one from the other, as defined in SPARQL, must be respected in SPDX v3.

Hints

SPDX v2 includes both unique identifier and hint information in SPDXID values. But SPARQL, by explicitly excluding query and fragment when comparing IRIs, provides a standard syntax for hints. SPDX v3 does not need to define anything else in order for producing tools to include whatever query and/or fragment values in ElementRefs are deemed useful, knowing that they are ignored by consuming tools.

The following ElementRefs identify the same Element:

http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4  
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4#LicenseRef  
http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/4?type=License&name=Apache-2.0  

The properties used to derive the IRI lexical value are:

namespace = http://spdx.org/spdxdocs/spdx-example-444504E0-4F89-41D3-9A0C-0305E82C330/  
lui = 4  
hint = "", "#LicenseRef" or "?type=License&name=Apache-2.0" respectively  
@goneall
Copy link

goneall commented Aug 13, 2021

@davaya Nicely organized and researched set of design considerations!

Agree with most of the proposal with the exception of using fragments for the hints.

According to SPARQL Query spec section 1.4, fragments are allowed in the URI's. SPDX 2.2 uses fragments explicitly when converting between tag/value lui's and RDF URI's.

I would propose using the query for all hints and allow fragments to be used as part of the URI. This would enable compatibility with SPDX 2.2 and allow for a common practice of using fragments to designate unique components within the same web page (e.g. having a web page with a single SPDX document and using that web page URL as the namespace URI and the lui's would be the fragements within the web page).

@sbarnum
Copy link

sbarnum commented Aug 17, 2021

Thank you David for the writeup.

I agree with the top portions about structure of URIs and the functionality in SPARQL to enable prefixing for simplification of query syntax (though the SPARQL processor actually fully expands it back to an IRI string before acting on it).

I would very strongly disagree with the following assertion though:
"For collections of multiple Elements such as Document and ContextualCollection, Lui uniquely identifies the Element within the collection."
I think this is the heart of the issue under discussion.
I will assert that it is absolutely imperative that Elements (and thus their IDs) are independent of any other construct or element. They must be independently definable and independently referenceable. They MUST not be forced to be secondary resources to some container (Document or ContextualCollection). That does not preclude usage scenarios that desire to always define and convey their own content in such containers. Independent Elements support both modes of operation.
This issue was discussed extensively within the 3T-SBOM efforts and is the reason that the current model is structured to support Element independence.

I suspect that if we can get past this issue and recognize the requirement for Element independence much of the other details we are getting wrapped around tend to resolve themselves.

@davaya
Copy link
Author

davaya commented Aug 17, 2021

I think we are using different words to describe the same idea. I agree that an Element must be referenceable, from a collection or not, as an atomic item regardless of any other collections that may also reference the same Element. The key to harmonizing the ideas is recognizing the difference between "define" and "reference".

  • An element is defined with a Namespace and Lui.
  • An element is referenced from within a Document (or context) using:
    1. Lui (for Elements defined within the same context), or
    2. Namespace:Lui for Elements defined anywhere other than in the referencing context.
  • An element is referenced from outside a context using Namespace:Lui

For SPDX, this does not require an External Reference to reference an external Element. But External Reference MAY be used to supply additional context (like a hash to validate the referenced Element).

On a more fundamental level, Element does not have to contain its own IRI, just as a book does not need to contain its own ISBN or a file contain its own filename. An Element is defined with an IRI, but putting that IRI inside the Element is just a convenience for indexing and retrieval services. A webserver given a URL will return content based on a filename, not based on the content of the file.
Defining an Element assigns an IRI to the Element's properties.
Referencing an Element doesn't need an IRI property in the content; the IRI is already known since it is the reference.
This is easily visualized cryptographically, if you hash an Element, does the hash value depend on which IRI it was referenced by? If yes, the IRI is logically part of the element. If no, the IRI is an index to the Element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment