Currently Data Packagese must have a name
attribute but do not have an id
attribute.
There has been debate about both the semantics (e.g. uniqueness) of the name
field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.
Two identifier fields:
name
: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies- (?) Have this as a MUST?
id
: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.
Name has the following structure:
[namespace/]path
- Namespace is optional
path
is made up of subparts each of which MUST NOT contain/
.
name
may only contain lower case alphanumeric plus _-.
and /
as a separator (?? should we allow other url compatible values e.g. :
?)
# single-part - for resolution one could anticipate these become `core/{name}` or interpreted in context
finance-vix
# 2 part: `namespace/subpart`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
core/abc
data.gov.uk/xyz
doi/{doi}
# multipart: (? still not sure we want this) `namespace/name-part-1/name-part-2`
doi/{doi} # if {doi} has /
github.com/rgrp/court-decisions-gb
Aside: I did think about having an initial "scheme" value e.g. dp/core/abc
or www/data.gov.uk/xyz
but felt we were starting to reinvent the url wheel a bit too much ...
Aside 2: one option I thought about was about keeping name
single-valued and having id
support the multipart option.
Why does having an identifier matter? What is used for?
- To refer to in e.g. dependencies
dataDependencies
field - To use in tooling e.g.
dpm install {data-package-name}
- Discovery: using an identifier we can locate a data package [in a registry]
- Question: Which registry - or more generally what is the "resolution" protocol. See http://dataprotocols.org/data-package-identifier/ for more on this
- To support storing and management in a catalog or registry
- Online e.g. CKAN
- Or locally e.g.
.datapackages
or similar - a local store or cache
Note also @amercader comment: "As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one." -- though my question is why do you want to decide if it is the same?
- Check out Zooko's Triangle. For names hard to have more than 2 of:
- meaningful (for humans)
- decentralized
- secure / non-colliding
Aims for name
:
- be human-usable and usable in dependencies
- make possible and likely but not guarantee non-collision
- be partially distributed