Skip to content

Instantly share code, notes, and snippets.

@rufuspollock
Created December 31, 2015 18:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rufuspollock/15b796a44bd634f20248 to your computer and use it in GitHub Desktop.
Save rufuspollock/15b796a44bd634f20248 to your computer and use it in GitHub Desktop.
Data Package identification and naming

Currently Data Packagese must have a name attribute but do not have an id attribute.

There has been debate about both the semantics (e.g. uniqueness) of the name field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.

Proposal

Two identifier fields:

  • name: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies
    • (?) Have this as a MUST?
  • id: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.

What is the structure of name?

Name has the following structure:

[namespace/]path

  • Namespace is optional
  • path is made up of subparts each of which MUST NOT contain /.

name may only contain lower case alphanumeric plus _-. and / as a separator (?? should we allow other url compatible values e.g. :?)

# single-part - for resolution one could anticipate these become `core/{name}` or interpreted in context
finance-vix

# 2 part: `namespace/subpart`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
core/abc
data.gov.uk/xyz
doi/{doi}

# multipart: (? still not sure we want this) `namespace/name-part-1/name-part-2`
doi/{doi}   # if {doi} has /
github.com/rgrp/court-decisions-gb

Aside: I did think about having an initial "scheme" value e.g. dp/core/abc or www/data.gov.uk/xyz but felt we were starting to reinvent the url wheel a bit too much ...

Aside 2: one option I thought about was about keeping name single-valued and having id support the multipart option.

Use Cases

Why does having an identifier matter? What is used for?

  • To refer to in e.g. dependencies dataDependencies field
  • To use in tooling e.g. dpm install {data-package-name}
  • Discovery: using an identifier we can locate a data package [in a registry]
  • To support storing and management in a catalog or registry
    • Online e.g. CKAN
    • Or locally e.g. .datapackages or similar - a local store or cache

Note also @amercader comment: "As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one." -- though my question is why do you want to decide if it is the same?

Context

  • Check out Zooko's Triangle. For names hard to have more than 2 of:
    • meaningful (for humans)
    • decentralized
    • secure / non-colliding

Aims for name:

  • be human-usable and usable in dependencies
  • make possible and likely but not guarantee non-collision
  • be partially distributed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment