Skip to content

Instantly share code, notes, and snippets.

@pfrazee
Created June 8, 2022 20:22
Show Gist options
  • Save pfrazee/0c51dc1afceac83d984ebfd555fe6340 to your computer and use it in GitHub Desktop.
Save pfrazee/0c51dc1afceac83d984ebfd555fe6340 to your computer and use it in GitHub Desktop.
An initial draft proposal of record schemas for Bluesky's ADX project.

Schemas Design Doc (draft)

Please note: The following document is an initial draft proposal. All decisions are subject to change. Our present goal is to collect feedback and iterate upon this document. Please feel free to share your suggestions and concerns.

Overview

ADX is a federated network for distributing data. It leverages cryptographic signatures and hashes to distribute authenticity proofs along with the data, enabling each node to transact upon the data independently of the data's origin. ADX might therefore be described as an Internet-native database in which records are replicated across nodes.

As a consequence of relying on authenticity proofs, ADX must exchange "canonical" records. That is, ADX records must be transmitted in their original encoding and structure in order to validate the signatures and hashes which comprise the proofs. This stands in contrast to the RESTful model of the Web in which "representations" of records are exchanged and therefore may be constructed at the time of exchange. While ADX records may be stored and queried in a variety of forms, they must be transmitted in their canonical form.

The canonical form implies an encoding, layout, and underlying data model which is shared by all ADX nodes. Again, this stands in contrast to Internet applications which interoperate through messaging. ADX is a record-oriented network and must provide a sufficiently-general data model for a wide variety of applications. Developers will likely view ADX as a database within their software stack, and while records can be copied into other databases and systems, any records to be transmitted must be written in the canonical form to the ADX systems.

In addition to the data model of the canonical form, ADX applications must also agree upon the semantics of the exchanged records. This is generally referred to as the "schemas." In ADX, schemas inform the data model and visa-versa, therefore this document encompasses encoding, data model, and semantics.

Since the start of Bluesky, schemas have been highlighted by engineers both inside and outside the team as a lynch-pin to the project's success. This interest reflects many factors: the impact of schemas on developer experience, their relevance toward the evolvability of the network, and the high amount of opinion among SMEs in the space. Schemas are one of the most hotly debated topics among the community, and a good solution will consider as many of the known solutions as possible.

Three dominant philosophies have emerged in decentralized networks: global term definitions via RDF, convention-oriented freeform objects, and networked programs such as Ethereum's smart contracts. It's worth giving each a brief overview and discussing their strengths and weaknesses:

Global terms (RDF) RDF is a highly-general model for creating unambiguous semantics. It uses a directed graph to organize all information into "triples" of facts. Many developers are only aware of RDF via JSON-LD, a format which provides an object-document abstraction over RDF while preserving the graph model.

RDF's strengths are its rigour, its standards-driven governance, its wide adoption in the Fediverse, and its flexibility. Its weaknesses are its complexity, poor DX, and unfamiliarity outside of certain developer niches.

JSON-LD has demonstrated that well-designed tooling can overcome the weaknesses of RDF when consuming schemas, but authoring new schemas (vocabularies) remains a daunting task.

Freeform objects Convention-driven systems have a rich history in the indie-hacker culture and are often proposed as a solution to decentralized networks. These models typically use an object-document model and leave developers free to populate the document however they see fit, often with a few baked-in conventions such as indexing upon a "type" attribute.

The strengths of freeform objects include evolvability, flexibility, and ease-of-understanding. The weaknesses include the slow development of conventions, lack of coordination between separate teams/orgs, and frequent incompabilities between applications.

Freeform objects often rely upon application libraries and can enable bazaar-style innovation. However, the innovation process can often be frustrating for end-users and developers as incompatibilities surface frequently and can be slow to resolve.

Networked programs Blockchain-based systems like Ethereum have recently advanced the use of a shared runtime which abstracts the network. Programs on the runtime (smart contracts) encapsulate state with a set of APIs which enforce schemas and business logic.

Networked programs benefit from their intuitive nature: developers can think of them like regular programs, or perhaps like Web APIs. The bytecode is publicly available and can be connected to the source-code to clearly explain their behavior. However the current models suffer from a great deal of runtime overhead, the gas-fee incentive to pre-optimize, and the existence of subtle complexities which lead to bugs.

Bluesky is not using a blockchain, however there are interesting lessons to be learned from networked programs. Declarative, machine-readable definitions could be distributed over the network to instruct general-purpose nodes to enforce useful behaviors.

This proposal builds upon RDF’s global terms with tooling inspired by the free form objects philosophy. The intent is to provide optional-but-recommended mechanics which assist developers without overly constraining them.

A slightly richer set of "value types" are used in the encoding of ADX records than is common for eg JSON. This enables ADX records to self-describe with some higher-level semantics, facilitating schema-free operation.

Schemas provide additional semantics, descriptions, constraints, and properties to ADX's value types. They assist in the interpretation and consistent usages of ADX records. Schemas may be published to the ADX network in a machine-readable form, enabling convenient distribution and access by software. However, most usages of schemas are optional, ensuring that the network remains flexible. Any software which depends on network-accessible schemas must additionally provide a fallback behavior if a schema is not available.

The semantics of schemas are based on RDF. This serves multiple goals: to benefit from the rigor of RDF, to leverage existing RDF software and techniques, and to ensure interoperability with systems outside of ADX. Many of the systems in this document can be described as a DSL over RDF.

Objectives

  1. Schema evolvability. The schema system must enable developers to extend, evolve, and repurpose the network and its data.
    • Ideally this should occur with minimal upfront social consensus – developers should not need to convince a "spec owner" to modify their schema in order to make changes.
  2. Developer convenience. Schemas and their tooling should be obvious, easy to use, and empowering.
    • Tools should not overburden developers with strictness or busy-work. When there are guard-rails, those rails should inform the developer, not control them.
    • We should recognize that the deployment of new schemas is a common task and cannot depend on slow-moving standards bodies.
  3. Reduced incompatibilities. Incompatibilities are a coordination challenge which emerge during independent development. Often they result in an inconsistent user experience: markup showing up in text, features missing from content (eg absent embedded media), or unexpected behaviors between applications. These issues affect users and place a burden upon developers to introduce features without creating compatibility issues. Tooling which helps with coordination between teams can improve evolvability as developers can more easily predict the effects of their software.
  4. Avoid NIH. There is a large body of prior work available for schemas. When possible, this system should leverage existing technology in order to benefit from its software, corpus of specifications, and expertise.
    • When developing a novel solution, the reasoning for divergence should be clear and justified.
  5. Hash-friendly encodings. In order to validate authenticity proofs, ADX records must reliably serialize to a canonical form.
    • While JSON has wide adoption, it fails to provide a canonical encoding without additional rules such as sorted keys, discarded duplicate keys, and string-encoded decimal numbers. This requirement demands either a modified JSON encoding or some other encoding format.
  6. Unambiguous terms. Applications should agree upon which values are being shared.
    • Ambiguous terms – eg keynames in documents which are not well-defined – risk creating collisions which are difficult to resolve (and often difficult to detect and debug).
  7. User-friendly descriptions. ADX software must provide UIs such as permission prompts which describe the data being affected.
  8. Secure trust model. Any schema information must have a clear and secure trust model.
    • This is trivial for most applications as they choose which schemas to integrate and then act upon the records according to their own validation. However, in some situations the schemas are chosen by third-parties such as when providing permission prompts, enabling an application to misrepresent actions to the user or to the system. All usages of schemas must consider the effects of malicious actors.

Core concepts

Data encoding (CBOR)

ADX records are encoded using CBOR.

Value types

"Value types" establish the kinds of values in the data model.

The data model supports a subset of CBOR's available value types:

null A CBOR simple value (major type 7, subtype 24) with a simple value of 22 (null).
boolean A CBOR simple value (major type 7, subtype 24) with a simple value of 21 (true) or 20 (false).
integer A CBOR integer (major type 0 or 1), choosing the shortest byte representation.
float A CBOR floating-point number (major type 7). All floating point values MUST be encoded as 64-bits (additional type value 27), even for integral values.
string A CBOR string (major type 3).
list A CBOR array (major type 4), where each element of the list is added, in order, as a value of the array according to its type.
map A CBOR map (major type 5), where each entry is represented as a member of the CBOR map. The entry key is expressed as a CBOR string (major type 3) as the key.
datetime A CBOR datetime (major type 6, tag 0), an ISO-8601-formatted date-time string.
uri A CBOR uri (major type 6, tag 32), an RFC-6986-formatted uri string.

TODO: do we need uint53, int54?

TODO: do we need bignums?

TODO: do we need binary? I'm inclined to say no: we face problems when people stuff binary into records rather than using blobs

Data types

"Data types" establish the kinds of properties in the data model. They are used in schemas to define how properties should be interpreted.

Every data type defines an interpretation for each value type. In cases where no useful interpretation can be created, the interpretation is mapped to Null. For example, a DateTime property can be validly be set to the String and Null values; for all other values, the interpretation resorts to Null.

All data types (builtin and user-defined) follow the RDF model. Consequently, the type-identifiers for the builtin simple types and user-defined record types expand to URIs. They are mapped to shortened terms for use in records.

Builtin data types

The data model supports a set of simple data types:

Data type Primary value type RDF term
any
boolean boolean xsd:boolean
integer integer xsd:integer
float float xsd:double
string string xsd:string
datetime datetime xsd:dateTime
date datetime xsd:date
time datetime xsd:time
uri uri xsd:anyURI

The data model also supports a set of complex types:

Data type Primary value type RDF term Description
record map rdfs:Resource A key/value document.
list list rdf:Seq An ordered array.

The complex types can contain all other simple or complex types. As explained in "Data layout," all information is published in records, and records can contain records.

User-defined record types

Records may be assigned custom types. New record types are created by publishing a schema.

A record type is any valid URI. Tools may attempt to download a machine-readable schema from the URI, but this is not required.

Standard record fields

Records contain the following standard fields:

Field Type Description
type string Declares the type of a record. Must be a valid Schema ID or URI.

Data layout

The data layout establishes the units of network-transmissable data. It includes the following three groupings:

  • Repository. The dataset of a single actor; contains a set of collections.
  • Collection. An ordered list of records.
  • Record. A key/value document.

These groupings establish addressability as well as the available network queries. For instance, a Repository is addressed by its DID and can be fetched in its entirety, while a Collection is addressed by a DID + its ID and can be fetched partially with range queries. It is not possible to transmit smaller units of data than these three groupings; for instance, a subset of a record cannot be requested over the network.

Additional properties and behaviors for each grouping are defined below.

Repository

Repositories are the dataset of a single "actor" (ie user) in the ADX network. Every user has a single repository which is identified by a DID.

Collection

A collection is an ordered list of records. Every collection has a type and is identified by the Schema ID of its type. Collections may contain records of any type and cannot enforce any constraints on them.

Record

A record is a key/value document. It is the smallest unit of data which can be transmitted over the network. Every record has a type and is identified by a key which is chosen by the writing software.

Builtin collections

The builtin "Definitions collection," identified by adxs.org:Definitions, is used to store schema definitions.

Schemas

Schemas are documents which declare new types. They define:

  • Semantic meanings,
  • Descriptive metadata,
  • Shape-constraints, and
  • Behavior hints.

The primary purpose of schemas is to help developers reach consensus on how they interact on the system. Their secondary purpose is to provide tooling which reduces bugs and incompatibilities, however most tooling is chosen by applications and is therefore optional.

Schema IDs

All schemas are published as records in the builtin adxs.org:Definitions collection. This makes it possible to reference schemas using only the repository name and schema keyname. We call this the "Schema ID".

schema-id   = repo-name ":" schema-name
repo-name   = [ reg-name "@" ] reg-name
schema-name = reg-name

reg-name is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.2.2.

For example, the schema ID of example.com:Song can be found in the adx:example.com repository in the def collection under the song key.

"adx" URL scheme

The adx URL scheme is used to address records in the ADX network.

adx-url   = "adx://" authority path [ "?" query ] [ "#" fragment ]
authority = repo-name / did
repo-name = [ reg-name "@" ] reg-name
path      = [ "/" schema-id [ "/" record-id ] ]
coll-ns   = reg-name
coll-id   = 1*pchar
record-id = 1*pchar

did is defined in https://w3c.github.io/did-core/#did-syntax.

reg-name is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.2.2.

pchar is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.3.

query is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.4.

fragment is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.5.

schema-id is defined in "Schema IDs."

The fragment segment only has meaning if the URL references a record. Its value maps to a subrecord with the matching "id" value.

Some example adx URLs:

Repository adx://bob.com
Repository adx://bob@work.com
Repository adx://did:ion:EiAnKD8-jfdd0MDcZUjAbRgaThBrMxPTFOxcnfJhI7Ukaw
Collection adx://bob.com/example.com:songs
Record adx://bob.com/example.com:songs/3yI5-c1z-cc2p-1a

ADX-Schema (ADXS)

ADX-Schema (or "ADXS") is a DSL for schemas in ADX.

It can be helpful to start from an example:

{
  "name": "Post",
  "extends": "record",
  "comment": "A little chirp.",
  "props": {
    "text": {
      "type": "string",
      "required": true,
      "maxLength": 255
    },
    "extendedText": "string",
    "postedFrom": "gis.org:Location",
    "mentions": "adx.net:User[]"
  }
}

The schema above defines a record-type named "Post." When published, its ID will combine the repo name with the schema name, eg example.com:Post.

The post record defines a set of properties, each with a type. Let's look at each in detail:

  • text Uses the builtin string type with a length constraint. It also declares that the field is required.
  • extendedText Uses the builtin string type with no extra constraints.
  • postedFrom Uses a record with a custom type which is imported from another schema.
  • mentions Uses a list of records with a custom type which is also imported from another schema.

Inheritance

Schemas are polymorphic, meaning they can extend existing schemas to add or redefine constraints. TODO: what are the rules for "overwriting" parent schema definitions?

Properties take advantage of polymorphism, meaning that properties will accept values of the given type or their child types.

Not all types are extensible. The types which may be extended are:

  • record A key/value document.
  • collection An ordered list of records.
  • view A network endpoint which provides views of the network data. To be described in a future document.
  • procedure A network endpoint which provides effectful operations. To be described in a future document.

Consequently, all schemas extend from these base types or a subtype of them.

ADXS Structure

The structure of an ADXS document depends on the base type. The attributes and their interpretation are described in the following sub-sections.

Schema attributes

Schema objects may contain the following fields:

Field Type Description Applies to
name string The name of the schema. any
extends string The base type of the schema. May be "record", "collection", "view", "procedure", or the Schema ID of an existing schema. any
comment string A description of the schema. any
props Properties map A map of properties which can be included in the record or view and their definitions. record, view

Properties map

The properties map enumerates a list of properties and their definitions. It is used in record and view schemas.

Keys string The "path" of the property.
Values string|object A type string or a property definition object. See "Property attributes" for a description of property definition objects.

Type string format

Type strings follow the following format:

type    = ( type-id [ "[]" ] / URI )
type-id = "any" / "boolean" / "integer" / "float" / "string" / "datetime" / "date" / "time" / "uri" / "record" / schema-id / "null"

reg-name is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.2.2.

URI is defined in https://www.rfc-editor.org/rfc/rfc3986#section-3.

schema-id is defined in "Schema ID".

Type strings are interpreted with the following rules:

  • The type-id segment maps to a builtin datatype or a user-defined datatype.
  • The [] postfix indicates that the type is a list.
  • The URI indicates an RDF vocabulary definition.

Property attributes

Property objects may contain the following fields:

Field Type Description Applies to
type string|string[] The type or types of the property. any
contains string The type of the contained property. Follows the rules of `type`. list
required boolean Do records need to specify a value for this property? any
minCount number The minimum number of values in the list. list
maxCount number The maximum number of values in the list. list
minLength number The minimum length of the string. string
maxLength number The maximum length of the string. string
mimeType string|string[] The supported MIME types of the value. string
pattern string A regex defining valid values of the string. string
oneOf any[] A list of valid values. integer, string, integer[], string[]
minInclusive number|string The minimum value, inclusive. integer, float, date, time, datetime, duration
minExclusive number|string The minimum value, exclusive. integer, float, date, time, datetime, duration
maxInclusive number|string The maximum value, inclusive. integer, float, date, time, datetime, duration
maxExclusive number|string The maximum value, exclusive. integer, float, date, time, datetime, duration
defaultValue any A default value to assign the property if none is provided. any
comment string A description of the property. any

TODO: need a way to express per-type constraints when multiple types are supported

TODO: hints about behaviors such as value indexing?

Processes

Schema publishing

Schemas are published in ADX repos in the builtin adxs.org:Definitions collection.

ADXS records

ADX-Schemas are published in a record of the adxs.org:Schema type. This requires some structure modification; for instance, the "Properties map" must be transformed into an array form as ADX records do not support "map" constructs.

Schema-ID assignment

The Schema-ID of a schema can be constructed once published according to the rules defined in "Schema IDs". For instance, a schema record published at adx://example.com/adxs.org:Definitions/Thing will have the Schema-ID of example.com:Thing.

Hosting

Schemas are expected to be kept available by their authors. If hosting is not maintained, developers and systems will be unable to access the schema definition and will fallback to schema-less behaviors. This will degrade the user experience and therefore schema-authors should be conscious of their obligation to continue hosting.

To improve the reliability of schema hosting, it's recommended to operate "schema management" services to which authors can submit their schemas. This will improve availability and can enable some additional validation to be enforced.

Schema consumption

Schemas are referenced by a Schema ID in ADX applications.

The builtin behaviors around schema consumption are kept minimal to ensure ADX is flexible and tolerant of unavailable schema definitions (see "Operation without schemas").

Applications can download schemas using tools similar to software package managers. These schemas can be stored in the app's software repository and leveraged by libraries to provide additional behaviors. Suggested behaviors include:

  • Write validation which errors when a record does not conform to the schema.
  • Read validation with configurable behaviors for non-conforming records (skip, warn, error, ignore).
  • Read coercion which interprets value types into the schema's asserted data types.

Permission interfaces and resource-descriptions

The ADX ecosystem includes a capabilities-model of permissioning built on UCANs. The UCANs require a string format for identifying resources in an ADX repo which include the following scopes:

  • repository
  • collection
  • record

These can be constructed using the adx URL (or the equivalent semantics).

Permissioning screens must give users a clear description of the resources being requested. This information cannot be provided by the application as this would represent an attack vector; therefore the descriptions of the resources must be fetched from a trusted source.

While the repository can be identified by the repo's asserted name (eg "bob.com" or "bob@example.com") the collections and records must be identified by outside information. In these cases, the collection's schema must be fetched and used to provide such a description.

Additional notes

Schema consistency

A consistent view of schema definitions across the network is important for ensuring compatibility. This has two effects on the design of ADX's schema system:

  1. Schemas must use identifiers which are global in scope, and
  2. Schema definitions must remain backwards compatible.

While global identifiers are provided with Schema IDs (which relies on DNS) there are presently no mechanisms to ensure backwards compatibility. It is incumbent on authors and consumers of schemas to ensure that schemas are properly maintained. Tools for publishing schemas are encouraged to validate schema-changes against the previous version to reduce errors.

Schema versioning

No formal mechanism for schema versions is defined. If a breaking change to a schema is required, authors are encouraged to publish the schema under a new name.

Note that records are published, addressed, and queried using their containing collection's type. This makes it trivial to interact with records of multiple types.

Operation without schemas

Schemas provide tooling to assist with correctness and compatibility. However, it is possible that the definitions do not remain available on the network. This means that systems may have to operate on records without access to the schemas. There is likewise the possible need for developers to locally override the schemas.

To counter-act this, the record encoding model defines a rich set of value types which provide some core semantics to the information. This ensures that records are easy to transact with in the absence of schema definitions.

Trust model

In the majority of cases, schemas are asserted by applications. This enables the application developers to download and verify the schemas before using them.

However, there are two known cases where schemas must be fetched from an authenticated network source:

  • Permission screens
  • General-purpose indexers

While fetching schemas from the ADX network does ensure their authenticity, it does not protect against malicious or erroneous actions by the schema publishers. For instance, the author of a collection schema could change the user-facing descriptions to confuse users; or, the author of a record schema could change the definition to cause indexers to struggle with parsing. At this time, no mitigations for these issues have been defined.

Compatibility with RDF

While not emphasized throughout the document, all semantics and behaviors in this proposal are derived from RDF. All information may be decomposed to RDF graph triples, and all terms are either directly equivalent to existing RDF vocabularies or easily translated to them.

The primary motivation of this choice is to enable ADX data and semantics to be expressed at the boundaries of ADX. External systems will frequently need to interact with the ADX networks, and the RDF model will enable ADX data to be encoded using JSON-LD, Turtle, and other RDF formats.

The secondary motivation is to enable graph-model databases to easily encode ADX data. Graph-triples provide a flexible and fine-grained view of information. These properties are especially useful for general-purpose indexers which ADX relies upon to provide aggregated views of the network.

Much of this proposal can be viewed as a DSL atop RDF. The goal of this DSL is not to support all possible RDF constructions. As a consequence, it is not always possible to encode existing RDF vocabularies in ADX. This was seen as an important tradeoff to achieve usability: by removing some features, we can enable developers to learn a small set of concepts and techniques before working productively.

Embedded RDF semantics and vocabulary

The underlying terms of ADX and ADXS are:

An ADX-Schema is a JSON document which encodes an RDF graph. Transformation rulsets enable ADXS documents to be converted to RDF triples. While the rulesets will require a full specification, the core principles are simple.

The example schema in the "ADX-Schema" section would look like this after transformation to Turtle:

@prefix :       <adx://example.com/adxs.org:Definitions/Post#> .
@prefix schema: <adx://adxs.org/adxs.org:Definitions/Schema#> .
@prefix prop:   <adx://adxs.org/adxs.org:Definitions/SchemaProp#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .

: a schema: ;
  schema:comment "A little chirp." ;
  schema:props [
    a prop: ;
    prop:path :text ;
    prop:type xsd:string ;
    prop:required true ;
    prop:maxLength 255 ;
  ] ;
  schema:props [
    a prop: ;
    prop:path :extendedText ;
    prop:type xsd:string ;
  ] ;
  schema:props [
    a prop: ;
    prop:path :postedFrom ;
    prop:type <adx://gis.org/adxs.org:Definitions/Location> ;
    sh:maxCount 1 ;
  ] ;
  schema:props [
    a prop: ;
    prop:path :mentions ;
    prop:type <adx://adx.net/adxs.org:Definitions/User> ;
  ] .

Because the ADXS vocabulary maintains an equivalence to XSD, RDFS, & SHACL vocabularies, it is also to translate the documents to those more common terms:

@prefix :     <adx://example.com/adxs.org:Definitions/Post#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

: a rdfs:Class, sh:NodeShape ;
  rdfs:comment "A little chirp." ;
  sh:property [
    sh:path :text ;
    sh:datatype xsd:string ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:maxlength 255 ;
  ] ;
  sh:property [
    sh:path :extendedText ;
    sh:datatype xsd:string ;
    sh:maxCount 1 ;
  ] ;
  sh:property [
    sh:path :postedFrom ;
    sh:class <adx://gis.org/adxs.org:Definitions/Location> ;
    sh:maxCount 1 ;
  ] ;
  sh:property [
    sh:path :mentions ;
    sh:class <adx://adx.net/adxs.org:Definitions/User> ;
  ] .

Future work

Blobs

TODO

Views and procedures

TODO

Appendix A. Datatype value interpretations

Data types are asserted by schemas while value types are asserted by records through the CBOR encoding. In the event of a mismatch, the value may be coerced using the following rules.

null boolean integer float string list map datetime uri
data type: any null boolean integer float string list map datetime uri
data type: boolean null boolean 0 → false; 1 → true null null null null null null
data type: integer null false → 0; true → 1 integer null null null null null null
data type: float null false → 0.0; true → 1.0 float float null null null null null
data type: string null null null null string null null string string
data type: duration null null duration (milliseconds) null null null null null null
data type: datetime null null datetime (Unix epoch) null datetime (ISO-8601) null null datetime null
data type: time null null time (Unix epoch) null time (ISO-8601) null null time null
data type: date null null date (Unix epoch) null date (ISO-8601) null null date null
data type: uri null null null null uri (RFC-6986) null null null uri
data type: map null null null null null null map null null
data type: list null null null null null list null null null

TODO: should strings support map<string> for language maps?

Appendix B. Builtin definitions

Note: the record shorthand maps to rdfs:Resource. All definitions inherit from this.

adxs.org:Definitions

{
  "name": "Definitions",
  "extends": "collection",
  "comment": "System definitions.",
}

adxs.org:Collection

Mapped to the collection shorthand in ADXS.

{
  "name": "Collection",
  "extends": "record",
  "comment": "A collection of records.",
}

adxs.org:Schema

{
  "name": "Schema",
  "extends": "record",
  "comment": "A type definition.",
  "props": {
    "name": {
      "type": "string",
      "required": true
    },
    "extends": "string",
    "comment": "string",
    "props": "adxs.org:SchemaProp[]"
  }
}

adxs.org:SchemaProp

{
  "name": "SchemaProp",
  "extends": "record",
  "comment": "A schema property definition.",
  "props": {
    "path": {
      "type": "string",
      "required": true
    }
    "type": {
      "type": "string",
      "required": true
    },
    "contains": "string",
    "required": "boolean",
    "minCount": "number",
    "maxCount": "number",
    "minLength": "number",
    "mimeType": "string|string[]",
    "pattern": "string",
    "oneOf": "any[]",
    "minInclusive": "number|string",
    "minExclusive": "number|string",
    "maxInclusive": "number|string",
    "maxExclusive": "number|string",
    "defaultValue": "any",
    "comment": "string"
  }
}
@bserdar
Copy link

bserdar commented Jun 9, 2022

I saw some messages on Trust over IP Foundation slack about this draft, so I read it and wanted to share some of the experiences we had
while developing the layered schemas project (https://layeredschemas.org), and applying it to semantic interoperability problems for health data. Here, your aim appears to be mostly structural interoperability. We are working on semantic interoperability where <height>160cm</height> and {"h":1.6, "unit":"m"} would be considered equal. My comments are mainly on the schema language itself.

Schema evolvability and extensions: Our approach to this is using schema overlays. Such overlays can add/remove fields, modify semantics of the underlying schema, and in general, adjusts the schema to fit to a particular use case. This is similar to what you call "inheritance", however, we call it "schema composition", and the result is a "schema variant".

Developer convenience: Working with JSON-LD and RDF is not easy. Conversely, JSON schemas are ubiquitous. So instead of limiting
the schema language to a JSON-LD based syntax, we decided to support JSON schemas for schemas and overlays. This allows using already available and standardized schemas to be incorporated into the ecosystem. An existing JSON schema can be extended using overlays to add annotations, semantics, or different data types for fields. Something similar can be done for your schemas. Early in the process, we also switched to using labeled property graphs instead of RDF.

Hash-friendly encodings: Layered schemas include an "attributeIndex" annotation for each field. When we ingest data, we create a graph including these annotations and then reconstruct it consistently. Something similar can be adopted.

Value types: There are multiple conventions around how data elements are represented. If one system operates with Unix epoch-style
timestamps and another uses RFC3339, you cannot really interoperate. Because of this, we decided to not require data types in schema specifications. We perform type coercions/translations based on a known set of data types including the xsd: namespace and json types/formats when we need to translate data between different variants. The type system remains extensible though, for instance, we have a type for "Measurement"s composed of a value and a unit.

The required data types in our case are "structural" types: "Value", "Object", "Array", "Polymorphic". A "Value" contains an array of bytes. An "Object" contains an unordered set of elements, etc.

In our case, a "Polymorphic" data type refers to truly polymorphic data, that is, a data element that can be one of several types. That
appears to be lacking in your case.

We also support "Reference" data types. These are simply fields that are already defined by an existing external schema.

@mikestaub
Copy link

This is a fantastic start!

Using RDF is a great choice, you should reach out to the origintrail.io team to see if they had any issues using them as they scaled out.

I think this is a big mistake:
No formal mechanism for schema versions is defined. If a breaking change to a schema is required, authors are encouraged to publish the schema under a new name.

I would follow npm's convention and make the version a required component of the schema ID. ie gis.org:Location@2.0.0

@gobengo
Copy link

gobengo commented Jun 14, 2022

Are the goals of ADX meaningfully different from ToIP Trust Registries? https://wiki.trustoverip.org/display/HOME/Trust+Registry+Task+Force

@gobengo
Copy link

gobengo commented Jun 14, 2022

@pfrazee Will you publish a preferred way of receiving annotations?

Examples:

@gobengo
Copy link

gobengo commented Jun 14, 2022

It would be good to include a note about https://digitalbazaar.github.io/cbor-ld-spec/ as a 'solution considered'

@gobengo
Copy link

gobengo commented Jun 14, 2022

Do you really mean to include adxs.org in some of these terms? What is the relationship between adx and https://adxs.org ? (I can't read the german). I tried to goog translate that URL, and it appears to be about ADHD. Definitely something I personally should look into more, but not obvious that it's related to bluesky adx?

@gobengo
Copy link

gobengo commented Jun 14, 2022

Generally I am a big fan of this. I concur that it's a good idea to make use of rdf, cbor, shacl, did

I think this is similar enough to https://www.w3.org/TR/rdf-schema/ that it would be worth comparing to it explicitly.

With that said, assuming there is a this new adxs.org:Schema schema, I would prefer "props" to be "properties" (i.e. jsonschema, but also to use a dictionary word instead of a colloquialism)

@pfrazee
Copy link
Author

pfrazee commented Jun 14, 2022

Thanks for the comments everybody -- not going to reply specifically yet, I'm just gathering your comments here and the other commentary I receive from folks and letting that feed into next drafts.

Will you publish a preferred way of receiving annotations?

@gobengo I was hoping that just comments on the gist would be enough. I know it's not ideal but nothing really is -- hackmd might be the most accessible option there.

Do you really mean to include adxs.org in some of these terms?

That was just a placeholder / example domain. Have a couple of strings in here which I just had to cook up ad-hoc until we get our domain names nailed down.

@gvelez17
Copy link

Some line comments - if there's a PR sort of thing I could make these inline and make it easier to follow -

"must be written in the canonical form to the ADX systems" -
must be transmitted in canonical form
what is "written" if not transmitted?

RDF vs Freeform vs Networked

DSL over RDF - I like this! maybe expand on it earlier from this perspective

important point:

While JSON has wide adoption, it fails to provide a canonical encoding without additional rules such as sorted keys, discarded duplicate keys, and string-encoded decimal numbers. This requirement demands either a modified JSON encoding or some other encoding format.

probably will need bignums
(index of all reactions in the world, for instance?
or all views)

mention - an actor is a thing with a DID - was there consideration for this or out of scope of this doc?

Repository. The dataset of a single actor; contains a set of collections.
what is an actor?

Every user has a single repository which is identified by a DID. - this seems like a big decision?
is the user a DID?
if I want two repositories I can registered two DIDs?

Schema ID - what does it point to - what kind of things are schemas? Are they also records?

maybe put this earlier 'An ADX-Schema is a JSON document which encodes an RDF graph.'

" it is also to translate the" => "it is also possible to translate"

@gvelez17
Copy link

Agree with @mikestaub that explicit versioning in the schema should be required or at least very strongly encouraged

@gvelez17
Copy link

also agree with @bserdar on Polymorphic or "any" type - while this can be abused/overused, it can be really really a saver when extending things

@gvelez17
Copy link

Reference is also very powerful. maybe a string can be a URI but nice to know that it is one

@gobengo
Copy link

gobengo commented Jun 15, 2022

@pfrazee @gvelez17

While JSON has wide adoption, it fails to provide a canonical encoding without additional rules such as sorted keys, discarded duplicate keys, and string-encoded decimal numbers. This requirement demands either a modified JSON encoding or some other encoding format.

We need 'rdf dataset canonicalization and hashing', which is a new WG booting up at W3C

w3c/rch-wg-charter#38 (comment)

@gobengo
Copy link

gobengo commented Jun 15, 2022

relevant to schema evolution and data migration https://www.inkandswitch.com/cambria/

@bserdar
Copy link

bserdar commented Jun 15, 2022

Reference is also very powerful. maybe a string can be a URI but nice to know that it is one

@gvelez17, references are powerful but difficult to do right. The main problem is the case of cyclic dependencies. Any nontrivial schema system will likely have at least one cycle. References also add to the problem created by contexts: the schema is no longer self-contained. To deal with these (and also to deal with a schema referencing multiple versions/variants of another schema), we specified a compile operation that collects all external pieces of a schema and builds a cyclic schema graph.

@gvelez17
Copy link

@bserdar we are planning a forum on schema evolution, would you like/be willing to be on it? If you could reach out directly to ailin@whatscookin.us we are trying to plan it for 7/22 or 7/29 - its on this and related subjects - will be a like hr live convo with various folks, based on the format we've used here https://dsocialcommons.org/voices.html

(couldn't find another way to reach you! let me know if there is another channel better)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment