robons/document-versioning-approach.md

## document-versioning-approach.md

      
    Raw
  

              document-versioning-approach.md
            
          
    Linked Data Repository - Versioning

The Problem with Simple Versioning

Versioning Qubes


Say I publish a data Qube as a CSV-W and upload it to the linked data repository which automatically assigns a version number 1. I could end up with the following URIs:

http://my-linked-data-repository/my-dataset/1 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
http://my-linked-data-repository/my-dataset/latest -> Temporary Redirects to http://my-linked-data-repository/my-dataset/1

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/1#dimension/period.
Publishing a second version:
Now if I publish a second edition of the dataset to the linked data repository, which automatically assigns a version number 2, I end up with the following URIs:

http://my-linked-data-repository/my-dataset/2 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
http://my-linked-data-repository/my-dataset/latest -> Temporary Redirects to http://my-linked-data-repository/my-dataset/2 now instead of 1.

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/2#dimension/period.
But don't http://my-linked-data-repository/my-dataset/1#dimension/period and http://my-linked-data-repository/my-dataset/2#dimension/period mean the same thing? Shouldn't they use the same identifier? This is almost certainly the case if the dataset is being published in a sequential fashion, e.g. 1 contains data from January 2020 and 2 contains data from Feburary 2020 and the cube's Data Structure Definition is equivalent. Can we find some way of ensuring that we don't define URIs where we don't need to?
But what happens if the user wants to add a column, remove a column or worse change the meaning of one of the columns from one version to the next? How do we stop creating duplicate URIs where it is unnecessary, but continue to define new URIs where something has changed?
Versioning Family-Level Component Definitions


Let's say I've got a CSV-W containing definitions of measures that are re-useable.
If I upload the CSV-W to the linked data repository, it automatically assigns a version 1 and results in the following URIs:

http://my-linked-data-repository/my-dimensions/1 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
http://my-linked-data-repository/my-dimensions/latest -> Temporary Redirects to http://my-linked-data-repository/my-dimensions/1

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/1#dimension/period.
Publishing a second version:
Let's say I add a row to my CSV-W because I want to define a new re-usable dimension.
If I upload the CSV-W to the linked data repository, it automatically assigns a version 2 and results in the following URIs:

http://my-linked-data-repository/my-dimensions/2 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
http://my-linked-data-repository/my-dimensions/latest -> Temporary Redirects to http://my-linked-data-repository/my-dimensions/2 instead of 1.

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/2#dimension/period.
But aren't http://my-linked-data-repository/my-dimensions/1#dimension/period and http://my-linked-data-repository/my-dimensions/2#dimension/period exactly the same dimension? I've just tried to create re-useable components and I've just done something terrible, I've created different URIs which mean exactly the same thing; which one does the user choose?
Solution


Linked Data Store to support (major) versioning for each document uploaded

The version name can be configured by the user on upload to the platform, e.g. they can select 2018 for the 2018 edition of a publication.
Each major version is considered an independent publication accessible via a URI like http://my-linked-data-repository/my-dataset/data/2018; all document URIs which are relative are to be coined relative to this URI. There is no requirement for immutability between versions since each version contains independent URIs, generally describing different points in time.


Each Data CSV-W gets a URI like http://my-linked-data-repository/my-dataset/data/2018

If uploading the first data CSV for the given version (2018) then it is automatically accessible via http://my-linked-data-repository/my-dataset/data/2018 as well as an automatically-generated revision URI http://my-linked-data-repository/my-dataset/data/2018/1 (URIs should never be defined relative to this URI - this is only to support tracking historical revisions).
If there already exists a document with version 2018, if the checksums differ then this new document replaces it at http://my-linked-data-repository/my-dataset/data/2018 and is also accessible at an automatically-generated revision URI http://my-linked-data-repository/my-dataset/data/2018/2 (URIs should never be defined relative to this URI - this is only to support tracking historical revisions).
This is an example of the replacement revisions approach where breaking alterations/changes can be made to the version (2018) to support the need to make retrospective corrections - as are frequently made to statistical publications.
If a user neglects to define a version then a default value of initial will be set. This combined with the natural wipe-the-slate-clean approach of replacement uploads should stand as a good starting point for users who don't want to think too much about versioning. We will still keep track of all revisions uploaded to the service.


We should define the Data Structure Definition as a separate JSON-LD document and upload it as a separately (automatically) versioned document. e.g. http://my-linked-data-repository/my-dataset/structure/1

The behaviour for uploading this file is similar to the data CSV-W, except a new revision can only replace an existing version iff sensible constraints which ensure non-conflicting changes pass:

in general we assert that no triples previously defined have been altered or removed.
for a data structure definition, we ensure that no new compulsary components (e.g. dimensions, measures, required attributes) are added - we don't want to break qb:DataSets already using this structure.
but where we're defining global dimension/attribute/measure properties in a single CSV-W, we should still allow new definitions to be added to a given version.
This is an example of non-conflicting (managed) revisions which are designed to support defining reusable resources which are to be used by many documents.


If any conflicting changes are discovered, the major version number increases, e.g. switch from http://my-linked-data-repository/my-dataset/structure/1 being the latest version to http://my-linked-data-repository/my-dataset/structure/2 being the latest version. Note that this may result in duplicate definitions (e.g. http://my-linked-data-repository/my-dataset/structure/1#dimension/period and http://my-linked-data-repository/my-dataset/structure/2#dimension/period), but minimises the circumstances in which this URI duplication should be necessary. We could investigate figuring out which resources have not had breaking changes made to them and add owl:sameAs triples to provide a mapping between the two versions.


Upon upload URIs should become relative to the domain root i.e. ./structure.json#dimension/period inside the my-dataset upload -> /my-dataset/structure/1#dimension/period.

Would remove any ambiguity introduced by allowing the same document to be accessed by multiple URIs.
Would allow users to be ultra-cautious and use a specific revision of a code-list (e.g. http://my-linked-data-repository/my-dataset/a-code-list/1/2) and still get valid URIs defined (e.g. http://my-linked-data-repository/my-dataset/a-code-list/1#some-concept).
We could still change the (sub-)domain name without any significant issues.


Explicitly define dependencies between documents using VOID datasets bound to URI namespaces/prefixes.

TODO: Explain why this is necessary
Allows us to be specific about a particular revision of a data structure definition/code-list whilst using easily dereferencable identifiers, e.g.
    # An example of how we might be able to use the VOID vocabulary to specify dependencies between CSV-W files.
    # The following ttl triples could be generated from a CSV-W. 

    </my-dataset/data/2018#my-code-list-dataset> a void:Dataset;    # Define the code-list's void:Dataset
        void:dataDump </my-dataset/my-code-list/1/2>;               # Specifically reference version 1 revision 2 of `my-code-list`
        void:uriSpace "/my-dataset/my-code-list/1".                 # Assert that all URIs starting with this prefix are defined by this dataset. 

    </my-dataset/data/2018#my-code-list-dataset> a void:Dataset;    # Define the data structure definition's void:Dataset
        void:dataDump </my-dataset/structure/1>;                    # Reference the latest revision available for version 1 of the DSD.
        void:uriSpace "/my-dataset/structure/1".                    # Assert that all URIs starting with this prefix are defined by this dataset. 


    </my-dataset/data/2018#dataset> a qb:DataSet;
        qb:structure </my-dataset/structure/1#structure>.
    
    </my-dataset/data/2018#obs/some-concept> a qb:Observation;
        qb:dataSet </my-dataset/data/2018#dataset>;
        </my-dataset/structure/1#dimension/some-dimension> </my-dataset/my-code-list/1#some-concept>. 
        
        # N.B. Although we earlier specified to use version 1 revision 2 of `my-code-list`, the identifiers contained there-in do not mention the revision number. 
        # The latest revision within the version *will* contain a definition for the concepts in previous revisions.


The linked data platform needs to support uploading multiple inter-dependent files to the linked data store

We need to replacing URIs referring to other documents defined in the upload with URIs referencing versioned copies of the documents.


Which versioning approach should we use


Code-lists are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
Data Structure Definitions are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
Data CSV-Ws are uploaded using the replacement revisions approach to versioning since they often require destructive corrections.

Questions


Should we be moving the cube definition outside of the data CSV-W to support longitudinal splitting of data? Or is that something we can retrospectively generate at some later point by grouping together the data CSV-Ws which share the same DSD version? This feels like it could be a specialised version of the replacement revisions approach.