Skip to content

Instantly share code, notes, and snippets.

@robons
Created January 19, 2022 12:29
Show Gist options
  • Save robons/ee063b662dab46dc100848dc3ceb359d to your computer and use it in GitHub Desktop.
Save robons/ee063b662dab46dc100848dc3ceb359d to your computer and use it in GitHub Desktop.

Linked Data Repository - Versioning

The Problem with Simple Versioning

Versioning Qubes

Say I publish a data Qube as a CSV-W and upload it to the linked data repository which automatically assigns a version number 1. I could end up with the following URIs:

  • http://my-linked-data-repository/my-dataset/1 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
  • http://my-linked-data-repository/my-dataset/latest -> Temporary Redirects to http://my-linked-data-repository/my-dataset/1

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/1#dimension/period.

Publishing a second version:

Now if I publish a second edition of the dataset to the linked data repository, which automatically assigns a version number 2, I end up with the following URIs:

  • http://my-linked-data-repository/my-dataset/2 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
  • http://my-linked-data-repository/my-dataset/latest -> Temporary Redirects to http://my-linked-data-repository/my-dataset/2 now instead of 1.

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/2#dimension/period.

But don't http://my-linked-data-repository/my-dataset/1#dimension/period and http://my-linked-data-repository/my-dataset/2#dimension/period mean the same thing? Shouldn't they use the same identifier? This is almost certainly the case if the dataset is being published in a sequential fashion, e.g. 1 contains data from January 2020 and 2 contains data from Feburary 2020 and the cube's Data Structure Definition is equivalent. Can we find some way of ensuring that we don't define URIs where we don't need to?

But what happens if the user wants to add a column, remove a column or worse change the meaning of one of the columns from one version to the next? How do we stop creating duplicate URIs where it is unnecessary, but continue to define new URIs where something has changed?

Versioning Family-Level Component Definitions

Let's say I've got a CSV-W containing definitions of measures that are re-useable.

If I upload the CSV-W to the linked data repository, it automatically assigns a version 1 and results in the following URIs:

  • http://my-linked-data-repository/my-dimensions/1 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
  • http://my-linked-data-repository/my-dimensions/latest -> Temporary Redirects to http://my-linked-data-repository/my-dimensions/1

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/1#dimension/period.

Publishing a second version:

Let's say I add a row to my CSV-W because I want to define a new re-usable dimension.

If I upload the CSV-W to the linked data repository, it automatically assigns a version 2 and results in the following URIs:

  • http://my-linked-data-repository/my-dimensions/2 - automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.
  • http://my-linked-data-repository/my-dimensions/latest -> Temporary Redirects to http://my-linked-data-repository/my-dimensions/2 instead of 1.

Thus if my qube contains relative URIs (e.g. #dimension/period) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/2#dimension/period.

But aren't http://my-linked-data-repository/my-dimensions/1#dimension/period and http://my-linked-data-repository/my-dimensions/2#dimension/period exactly the same dimension? I've just tried to create re-useable components and I've just done something terrible, I've created different URIs which mean exactly the same thing; which one does the user choose?

Solution

  • Linked Data Store to support (major) versioning for each document uploaded
    • The version name can be configured by the user on upload to the platform, e.g. they can select 2018 for the 2018 edition of a publication.
    • Each major version is considered an independent publication accessible via a URI like http://my-linked-data-repository/my-dataset/data/2018; all document URIs which are relative are to be coined relative to this URI. There is no requirement for immutability between versions since each version contains independent URIs, generally describing different points in time.
  • Each Data CSV-W gets a URI like http://my-linked-data-repository/my-dataset/data/2018
    • If uploading the first data CSV for the given version (2018) then it is automatically accessible via http://my-linked-data-repository/my-dataset/data/2018 as well as an automatically-generated revision URI http://my-linked-data-repository/my-dataset/data/2018/1 (URIs should never be defined relative to this URI - this is only to support tracking historical revisions).
    • If there already exists a document with version 2018, if the checksums differ then this new document replaces it at http://my-linked-data-repository/my-dataset/data/2018 and is also accessible at an automatically-generated revision URI http://my-linked-data-repository/my-dataset/data/2018/2 (URIs should never be defined relative to this URI - this is only to support tracking historical revisions).
    • This is an example of the replacement revisions approach where breaking alterations/changes can be made to the version (2018) to support the need to make retrospective corrections - as are frequently made to statistical publications.
    • If a user neglects to define a version then a default value of initial will be set. This combined with the natural wipe-the-slate-clean approach of replacement uploads should stand as a good starting point for users who don't want to think too much about versioning. We will still keep track of all revisions uploaded to the service.
  • We should define the Data Structure Definition as a separate JSON-LD document and upload it as a separately (automatically) versioned document. e.g. http://my-linked-data-repository/my-dataset/structure/1
    • The behaviour for uploading this file is similar to the data CSV-W, except a new revision can only replace an existing version iff sensible constraints which ensure non-conflicting changes pass:
      • in general we assert that no triples previously defined have been altered or removed.
      • for a data structure definition, we ensure that no new compulsary components (e.g. dimensions, measures, required attributes) are added - we don't want to break qb:DataSets already using this structure.
      • but where we're defining global dimension/attribute/measure properties in a single CSV-W, we should still allow new definitions to be added to a given version.
      • This is an example of non-conflicting (managed) revisions which are designed to support defining reusable resources which are to be used by many documents.
    • If any conflicting changes are discovered, the major version number increases, e.g. switch from http://my-linked-data-repository/my-dataset/structure/1 being the latest version to http://my-linked-data-repository/my-dataset/structure/2 being the latest version. Note that this may result in duplicate definitions (e.g. http://my-linked-data-repository/my-dataset/structure/1#dimension/period and http://my-linked-data-repository/my-dataset/structure/2#dimension/period), but minimises the circumstances in which this URI duplication should be necessary. We could investigate figuring out which resources have not had breaking changes made to them and add owl:sameAs triples to provide a mapping between the two versions.
  • Upon upload URIs should become relative to the domain root i.e. ./structure.json#dimension/period inside the my-dataset upload -> /my-dataset/structure/1#dimension/period.
    • Would remove any ambiguity introduced by allowing the same document to be accessed by multiple URIs.
    • Would allow users to be ultra-cautious and use a specific revision of a code-list (e.g. http://my-linked-data-repository/my-dataset/a-code-list/1/2) and still get valid URIs defined (e.g. http://my-linked-data-repository/my-dataset/a-code-list/1#some-concept).
    • We could still change the (sub-)domain name without any significant issues.
  • Explicitly define dependencies between documents using VOID datasets bound to URI namespaces/prefixes.
    • TODO: Explain why this is necessary
    • Allows us to be specific about a particular revision of a data structure definition/code-list whilst using easily dereferencable identifiers, e.g.
          # An example of how we might be able to use the VOID vocabulary to specify dependencies between CSV-W files.
          # The following ttl triples could be generated from a CSV-W. 
      
          </my-dataset/data/2018#my-code-list-dataset> a void:Dataset;    # Define the code-list's void:Dataset
              void:dataDump </my-dataset/my-code-list/1/2>;               # Specifically reference version 1 revision 2 of `my-code-list`
              void:uriSpace "/my-dataset/my-code-list/1".                 # Assert that all URIs starting with this prefix are defined by this dataset. 
      
          </my-dataset/data/2018#my-code-list-dataset> a void:Dataset;    # Define the data structure definition's void:Dataset
              void:dataDump </my-dataset/structure/1>;                    # Reference the latest revision available for version 1 of the DSD.
              void:uriSpace "/my-dataset/structure/1".                    # Assert that all URIs starting with this prefix are defined by this dataset. 
      
      
          </my-dataset/data/2018#dataset> a qb:DataSet;
              qb:structure </my-dataset/structure/1#structure>.
          
          </my-dataset/data/2018#obs/some-concept> a qb:Observation;
              qb:dataSet </my-dataset/data/2018#dataset>;
              </my-dataset/structure/1#dimension/some-dimension> </my-dataset/my-code-list/1#some-concept>. 
              
              # N.B. Although we earlier specified to use version 1 revision 2 of `my-code-list`, the identifiers contained there-in do not mention the revision number. 
              # The latest revision within the version *will* contain a definition for the concepts in previous revisions.
      
  • The linked data platform needs to support uploading multiple inter-dependent files to the linked data store
    • We need to replacing URIs referring to other documents defined in the upload with URIs referencing versioned copies of the documents.

Which versioning approach should we use

  • Code-lists are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
  • Data Structure Definitions are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
  • Data CSV-Ws are uploaded using the replacement revisions approach to versioning since they often require destructive corrections.

Questions

  • Should we be moving the cube definition outside of the data CSV-W to support longitudinal splitting of data? Or is that something we can retrospectively generate at some later point by grouping together the data CSV-Ws which share the same DSD version? This feels like it could be a specialised version of the replacement revisions approach.
@rossbowen
Copy link

I had a go writing some turtle a while back which tries to capture the essence of the recommendations and the web architecture doc:

  • DWBP #7: Provide a version indicator.
  • DWBP #8: Provide version history.
  • DWBP #11: Assign URIs to dataset versions and series.
# Generic URI 
<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> a dcat:Dataset ;
    dcat:hasCurrentVersion <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.1> ;
    dcat:hasVersion <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0>, 
        <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.1> ;
    dcat:version "1.1" ;
    adms:versionNotes "Dataset was corrected following an error being recognised."@en ;
    .

# Most recent version
<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.1> a dcat:Dataset ;
    dcterms:identifier "dataset-2018-v1.1" ;
    dcterms:issued "2018-03-01T00:00:00Z"^^xsd:dateTime ;
    dcat:isVersionOf <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> ;
    dcat:previousVersion <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0>;
    prov:wasRevisionOf <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0>;
    prov:specializationOf <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018>;
    .

# A previous version
<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0> a dcat:Dataset ;
    dcterms:identifier "dataset-2018-v1.0" ;
    dcterms:issued "2018-01-01T00:00:00Z"^^xsd:dateTime ;
    dcat:isVersionOf <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> ;
    prov:specializationOf <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018>;
    prov:invalidatedAtTime "2018-02-28T23:59:59Z"^^xsd:dateTime ;
    .

So you've got:

  • http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018 as a generic URI which represents the latest.
  • http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.1 which is the latest version and would forward to the generic URI,
  • http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0 which is an older version.

My current feeling is I'd want to keep this detail outside of CSVWs and rather have all this handled by a content (data) management system which I think is what you're saying - a linked data repository would handle this stuff on behalf of the user. If so, awesome, we can reuse some of that code.

My current feeling is also that I don't think there's a requirement to keep all previous versions of data, and so while we could go about versioning DSDs, measures/attributes/dimensions etc. I've never found myself wishing for this and I don't think there's a requirement to do so. The most recent data is what you get and the user is welcome to explore that - there were previous versions at other points in time, but those are no longer available.

@robons
Copy link
Author

robons commented Jan 19, 2022

while we could go about versioning DSDs, measures/attributes/dimensions etc. I've never found myself wishing for this

The point is we know not to change the meaning of measures, attributes, dimensions, etc. but external users are less likely to be aware of that. What happens if they decide to change the meaning of one of the dimensions - they change the labels/description and publish the data again.

Given that we've got historic versions of the data all referencing the same Data Structure Definition (because we want linked data), we have to keep that old version alive so those historic versions of data continue to be interpretable. We then need to publish a new version of the data structure definition to keep track of the current state of affairs.

The world that data management currently inhabits is one in which we don't really consider version history because we just drop and replace the data all the time - meaning we don't have to keep a track of what the schema looked like at any previous point in time since we didn't need to keep the 'old' data.

The world I'm imaginging here is one in which we need to keep hosting the 2017, 2018 and 2019 versions of a dataset when we publish the 2020 version which has schema changes in it.

@rossbowen
Copy link

rossbowen commented Jan 19, 2022

For the 2018, 2019, 2020 thing you mention above - I've started referring to those as editions because I think they're a little bit different than what I meant by versions.

For those cases, I'd really like to look at the dcat:DatasetSeries. We might get something a bit like this:

<http://data.gov.uk/series/name-of-my-statistical-series> a dcat:DatasetSeries ;
    dcterms:title "Dataset Series"@en ;
    dcat:first <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2016> ;
    dcat:last <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> ;
    .

<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> a dcat:Dataset ;
    dcat:inSeries <http://data.gov.uk/series/name-of-my-statistical-series> ;
    dcat:prev <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2017> ;
    .

<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2017> a dcat:Dataset ;
    dcat:inSeries <http://data.gov.uk/series/name-of-my-statistical-series> ;
    dcat:prev <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2016> ;
    dcat:next <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> ;
    .

<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2016> a dcat:Dataset ;
    dcat:inSeries <http://data.gov.uk/series/name-of-my-statistical-series> ;
    dcat:next <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2017> ;
    .

So I figure each one of these datasets has their own qb:DataSet as a dcat:distribution, and their own DSDs. The 2016 cube might have an equivalent DSD to the 2017 cube, but we'd end up coining separate URIs for each of those DSDs.

<http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018> a dcat:Dataset ;
    dcat:distribution <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/rdf>, 
        <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018.csv>, 
        <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018.json> ;

 <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/rdf> a qb:DataSet, dcat:Distribution ;
        qb:structure <http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/rdf/structure> ;

# etc. etc. for the other years.

Up to there things make sense to me - you have different editions of the dataset, and the structure might change one year to the next. You can always inspect the DSD for a given dataset and discover the structure of the cube, and all that data remains available for people to see.

I think where you're talking about versioning components is where I begin differing.

So I get that, by default if a publisher doesn't define proper URIs for their components that we'll end up with differing URIs for what might be the same thing, e.g.

  • http://data.gov.uk/series/name-of-my-statistical-series/dataset/2016/rdf/dimension/country-of-origin
  • http://data.gov.uk/series/name-of-my-statistical-series/dataset/2017/rdf/dimension/country-of-origin

So maybe we can encourage them to do better, and have them coin some URIs which they'll reuse, e.g.

  • http://data.gov.uk/series/name-of-my-statistical-series/dimension/country-of-origin

Now, things change. Maybe the codelist changes, maybe the methodology changes. I suppose I'm interested in how much things could change, and we could still keep the URI the same. Atm we craft loads of different URIs for period because different datasets use different codelists for time... but this does make the usability of the datasets much worse. Maybe codelists could be scoped to a graph. Maybe methodologies are themselves entities which have start and end dates.

I think major methodological changes or codelist changes could warrant URI changes, e.g.

<http://data.gov.uk/series/name-of-my-statistical-series/dimension/country-of-origin/2016> a qb:DimensionProperty ;
    qb:codeList <http://data.gov.uk/codelist/country-of-origin/2016> ;
    .

<http://data.gov.uk/series/name-of-my-statistical-series/dimension/country-of-origin/2021> a qb:DimensionProperty ;
    qb:codeList <http://data.gov.uk/codelist/country-of-origin/2021> ;
    .

... but I don't know yet. I'd feel real unhappy about the of proliferation of versioned dimensions though, because they're a pain for dealing with time periods!

A bit of a non-serious suggestion, but if only we could stamp these things with date ranges...

<http://data.gov.uk/series/name-of-my-statistical-series/dimension/country-of-origin> a qb:DimensionProperty ;
    xqb:codeListList 
        [ qb:codeList <http://data.gov.uk/codelist/country-of-origin/2021> ;
          xqb:validFrom "2021-01-01"^^xsd:date ; ], 
        [ qb:codeList <http://data.gov.uk/codelist/country-of-origin/2016> ;
          xqb:validFrom "2016-01-01"^^xsd:date ;
          xqb:validTo "2020-12-31"^^xsd:date ; ] ;
    .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment