Say I publish a data Qube as a CSV-W and upload it to the linked data repository which automatically assigns a version number 1
. I could end up with the following URIs:
http://my-linked-data-repository/my-dataset/1
- automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.http://my-linked-data-repository/my-dataset/latest
-> Temporary Redirects tohttp://my-linked-data-repository/my-dataset/1
Thus if my qube contains relative URIs (e.g. #dimension/period
) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/1#dimension/period
.
Publishing a second version:
Now if I publish a second edition of the dataset to the linked data repository, which automatically assigns a version number 2
, I end up with the following URIs:
http://my-linked-data-repository/my-dataset/2
- automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.http://my-linked-data-repository/my-dataset/latest
-> Temporary Redirects tohttp://my-linked-data-repository/my-dataset/2
now instead of1
.
Thus if my qube contains relative URIs (e.g. #dimension/period
) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dataset/2#dimension/period
.
But don't http://my-linked-data-repository/my-dataset/1#dimension/period
and http://my-linked-data-repository/my-dataset/2#dimension/period
mean the same thing? Shouldn't they use the same identifier? This is almost certainly the case if the dataset is being published in a sequential fashion, e.g. 1 contains data from January 2020
and 2 contains data from Feburary 2020
and the cube's Data Structure Definition is equivalent. Can we find some way of ensuring that we don't define URIs where we don't need to?
But what happens if the user wants to add a column, remove a column or worse change the meaning of one of the columns from one version to the next? How do we stop creating duplicate URIs where it is unnecessary, but continue to define new URIs where something has changed?
Let's say I've got a CSV-W containing definitions of measures that are re-useable.
If I upload the CSV-W to the linked data repository, it automatically assigns a version 1
and results in the following URIs:
http://my-linked-data-repository/my-dimensions/1
- automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.http://my-linked-data-repository/my-dimensions/latest
-> Temporary Redirects tohttp://my-linked-data-repository/my-dimensions/1
Thus if my qube contains relative URIs (e.g. #dimension/period
) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/1#dimension/period
.
Publishing a second version:
Let's say I add a row to my CSV-W because I want to define a new re-usable dimension.
If I upload the CSV-W to the linked data repository, it automatically assigns a version 2
and results in the following URIs:
http://my-linked-data-repository/my-dimensions/2
- automatically serves up the CSV-W's Metadata JSON file. All URIs are relative to here.http://my-linked-data-repository/my-dimensions/latest
-> Temporary Redirects tohttp://my-linked-data-repository/my-dimensions/2
instead of1
.
Thus if my qube contains relative URIs (e.g. #dimension/period
) then the absolute URIs will end up in the form http://my-linked-data-repository/my-dimensions/2#dimension/period
.
But aren't http://my-linked-data-repository/my-dimensions/1#dimension/period
and http://my-linked-data-repository/my-dimensions/2#dimension/period
exactly the same dimension? I've just tried to create re-useable components and I've just done something terrible, I've created different URIs which mean exactly the same thing; which one does the user choose?
- Linked Data Store to support (major) versioning for each document uploaded
- The version name can be configured by the user on upload to the platform, e.g. they can select
2018
for the 2018 edition of a publication. - Each major version is considered an independent publication accessible via a URI like
http://my-linked-data-repository/my-dataset/data/2018
; all document URIs which are relative are to be coined relative to this URI. There is no requirement for immutability between versions since each version contains independent URIs, generally describing different points in time.
- The version name can be configured by the user on upload to the platform, e.g. they can select
- Each Data CSV-W gets a URI like
http://my-linked-data-repository/my-dataset/data/2018
- If uploading the first data CSV for the given version (
2018
) then it is automatically accessible viahttp://my-linked-data-repository/my-dataset/data/2018
as well as an automatically-generated revision URIhttp://my-linked-data-repository/my-dataset/data/2018/1
(URIs should never be defined relative to this URI - this is only to support tracking historical revisions). - If there already exists a document with version
2018
, if the checksums differ then this new document replaces it athttp://my-linked-data-repository/my-dataset/data/2018
and is also accessible at an automatically-generated revision URIhttp://my-linked-data-repository/my-dataset/data/2018/2
(URIs should never be defined relative to this URI - this is only to support tracking historical revisions). - This is an example of the replacement revisions approach where breaking alterations/changes can be made to the version (
2018
) to support the need to make retrospective corrections - as are frequently made to statistical publications. - If a user neglects to define a version then a default value of
initial
will be set. This combined with the natural wipe-the-slate-clean approach of replacement uploads should stand as a good starting point for users who don't want to think too much about versioning. We will still keep track of all revisions uploaded to the service.
- If uploading the first data CSV for the given version (
- We should define the Data Structure Definition as a separate JSON-LD document and upload it as a separately (automatically) versioned document. e.g.
http://my-linked-data-repository/my-dataset/structure/1
- The behaviour for uploading this file is similar to the data CSV-W, except a new revision can only replace an existing version iff sensible constraints which ensure non-conflicting changes pass:
- in general we assert that no triples previously defined have been altered or removed.
- for a data structure definition, we ensure that no new compulsary components (e.g. dimensions, measures, required attributes) are added - we don't want to break
qb:DataSet
s already using this structure. - but where we're defining global dimension/attribute/measure properties in a single CSV-W, we should still allow new definitions to be added to a given version.
- This is an example of non-conflicting (managed) revisions which are designed to support defining reusable resources which are to be used by many documents.
- If any conflicting changes are discovered, the major version number increases, e.g. switch from
http://my-linked-data-repository/my-dataset/structure/1
being the latest version tohttp://my-linked-data-repository/my-dataset/structure/2
being the latest version. Note that this may result in duplicate definitions (e.g.http://my-linked-data-repository/my-dataset/structure/1#dimension/period
andhttp://my-linked-data-repository/my-dataset/structure/2#dimension/period
), but minimises the circumstances in which this URI duplication should be necessary. We could investigate figuring out which resources have not had breaking changes made to them and addowl:sameAs
triples to provide a mapping between the two versions.
- The behaviour for uploading this file is similar to the data CSV-W, except a new revision can only replace an existing version iff sensible constraints which ensure non-conflicting changes pass:
- Upon upload URIs should become relative to the domain root i.e.
./structure.json#dimension/period
inside themy-dataset
upload ->/my-dataset/structure/1#dimension/period
.- Would remove any ambiguity introduced by allowing the same document to be accessed by multiple URIs.
- Would allow users to be ultra-cautious and use a specific revision of a code-list (e.g.
http://my-linked-data-repository/my-dataset/a-code-list/1/2
) and still get valid URIs defined (e.g.http://my-linked-data-repository/my-dataset/a-code-list/1#some-concept
). - We could still change the (sub-)domain name without any significant issues.
- Explicitly define dependencies between documents using VOID datasets bound to URI namespaces/prefixes.
- TODO: Explain why this is necessary
- Allows us to be specific about a particular revision of a data structure definition/code-list whilst using easily dereferencable identifiers, e.g.
# An example of how we might be able to use the VOID vocabulary to specify dependencies between CSV-W files. # The following ttl triples could be generated from a CSV-W. </my-dataset/data/2018#my-code-list-dataset> a void:Dataset; # Define the code-list's void:Dataset void:dataDump </my-dataset/my-code-list/1/2>; # Specifically reference version 1 revision 2 of `my-code-list` void:uriSpace "/my-dataset/my-code-list/1". # Assert that all URIs starting with this prefix are defined by this dataset. </my-dataset/data/2018#my-code-list-dataset> a void:Dataset; # Define the data structure definition's void:Dataset void:dataDump </my-dataset/structure/1>; # Reference the latest revision available for version 1 of the DSD. void:uriSpace "/my-dataset/structure/1". # Assert that all URIs starting with this prefix are defined by this dataset. </my-dataset/data/2018#dataset> a qb:DataSet; qb:structure </my-dataset/structure/1#structure>. </my-dataset/data/2018#obs/some-concept> a qb:Observation; qb:dataSet </my-dataset/data/2018#dataset>; </my-dataset/structure/1#dimension/some-dimension> </my-dataset/my-code-list/1#some-concept>. # N.B. Although we earlier specified to use version 1 revision 2 of `my-code-list`, the identifiers contained there-in do not mention the revision number. # The latest revision within the version *will* contain a definition for the concepts in previous revisions.
- The linked data platform needs to support uploading multiple inter-dependent files to the linked data store
- We need to replacing URIs referring to other documents defined in the upload with URIs referencing versioned copies of the documents.
- Code-lists are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
- Data Structure Definitions are uploaded using the non-conflicting revisions approach to versioning since they should be reusable between different versions of the same dataset.
- Data CSV-Ws are uploaded using the replacement revisions approach to versioning since they often require destructive corrections.
- Should we be moving the cube definition outside of the data CSV-W to support longitudinal splitting of data? Or is that something we can retrospectively generate at some later point by grouping together the data CSV-Ws which share the same DSD version? This feels like it could be a specialised version of the replacement revisions approach.
I had a go writing some turtle a while back which tries to capture the essence of the recommendations and the web architecture doc:
So you've got:
http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018
as a generic URI which represents the latest.http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.1
which is the latest version and would forward to the generic URI,http://data.gov.uk/series/name-of-my-statistical-series/dataset/2018/version/1.0
which is an older version.My current feeling is I'd want to keep this detail outside of CSVWs and rather have all this handled by a content (data) management system which I think is what you're saying - a linked data repository would handle this stuff on behalf of the user. If so, awesome, we can reuse some of that code.
My current feeling is also that I don't think there's a requirement to keep all previous versions of data, and so while we could go about versioning DSDs, measures/attributes/dimensions etc. I've never found myself wishing for this and I don't think there's a requirement to do so. The most recent data is what you get and the user is welcome to explore that - there were previous versions at other points in time, but those are no longer available.