dwinston/shouder_to_type-code-plus-shoulder.md

## shouder_to_type-code-plus-shoulder.md

      
    Raw
  

              shouder_to_type-code-plus-shoulder.md
            
          
    My intended identifier design was nmdc:<shoulder><generated_id> where shoulder is an opaque namespace granted to a minting (sub)organization to avoid ID clashes in case they are bringing existing (not generated) IDs into the mix, and it helps keep IDs short for a minting org that doesn't mint a lot.
Unfortunately, it seems folks really really want to infer semantics from an ID rather than the discipline of resolving it and fetching metadata. So we've had a mushy compromise where the shoulders are kinda memorable, like "mga0" is for metagenome annotations.
I propose, given insistence on the desire for by-eye ID typing, we re-set the scheme as nmdc:<type_code><shoulder><generated_id> , where type_code is an alphabetical code that is part of a recognized and controlled set, like how "Gb" is "GOLD biosample" (e.g., https://identifiers.org/gold:Gb0110680) and "Gs" is "GOLD study" (e.g., https://identifiers.org/gold:Gs0103573) for their IDs, and then we have opaque shoulders for minting authority (e.g. you can have a shoulder for minting biosample IDs, someone else can as well, etc.), where in this scheme shoulders would be of the form [0-9][a-z]*[0-9], i.e. two numbers with zero or more alphabetical characters in-between. And then the id (shoulder-org preassigned or generated by the service) after that as before.
a shoulder would be like a "minting agent username", but opaque, it would be generated to be as short as possible and assigned. To an org. Like “EMSL” or “JGI”. Not to a person.
the current shoulder "mga0" was chosen to signal a "MetaGenomics Analysis (MGA)" activity and then because it’s a shoulder, it ends with a digit.
Here is what IDs should look like using my proposal of taking meaning out of shoulders and prefixing a shoulder with a type-code that can bear meaning:
nmdc:samp00q2z368
nmdc:stdy00q2z368
nmdc:mga00q2z368
nmdc:ompro00q2z368
nmdc:metap00q2z368
nmdc:nom00q2z368

I expect type codes to be all alphabetical characters, and compact - no more than six characters. That is, matching a regex of [a-z]{1,6}. Above I offer “samp” for sample, “stdy” for study, “mga” for “metagenomics analysis”, “ompro” for Omics Processing, etc., but this would not be up to me. The team would set the type codes and their correspondences, and we can add type codes over time.
There is only one shoulder above, 00 (double zero), and the blade (“key = shoulder + blade”), that is, the unique part under a given type-code-plus-shoulder namespace, is the same for each ID above, which is a base32 encoding of a 64-bit integer value plus a checksum (same as current scheme), which helps defend against many communication errors.
It is critical that an ID analyzer can decompose any structure without lookup. In the proposed scheme, an analyzer can always get the type code by grabbing letters after “nmdc:” until it hits a digit. Then it can always get the shoulder by starting with that digit, collecting zero or more letters, and finishing when it hits a second digit. The rest is the blade. In the previous (currently deployed) scheme, a shoulder doesn’t start with a digit because there’s no preceding type code that needs to be distinguished when parsing an ID.
We could technically go far with just one shoulder if everyone agrees to use the one central minting service. But shoulders are (1) essential for distributed minting with conflict-free merging, like if EMSL wants to install an offline minter with a local database that ensures locally unique blades, NMDC can just assign the minting steward a shoulder; and (2) shoulders ensures unlimited IDs  when storing blades as 64-bit integers for efficiency. It’s unlikely to be a practical problem, as a 12-character base32-encoded string  can represent 2^60 ~ 1e18 integers, but shoulders provide convenient compact and opaque (no semantics) namespacing within the broader NMDC ID namespace for whatever reason we may find that convenient — using legacy IDs as blades, keeping total ID lengths compact even if large volumes are minted (perhaps not all minted IDs will ultimately be shared broadly), etc., although the motivating use case is again to accommodate distributed minting by known organizations/sites in case a central minting server is not continuously available (a standard code+database container can be maintained for would-be minting organizations to use).
This actually reminds me a bit of the DOI system. DOIs all begin with “10.”  And then organizations get assigned a full prefix, like “10.1038”. You could think of “10.” as analogous to “nmdc:” and “1038/“ as analogous to a shoulder. And so you have a DOI like “10.1038/s41597-019-0184-5”.