Skip to content

Instantly share code, notes, and snippets.

@nine9ths
Created February 2, 2017 01:50
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save nine9ths/e5878f0b5f4a462f870ec6a67980b2cf to your computer and use it in GitHub Desktop.

There are multiple different specifications which cover the production of identifiers in contexts which are relevant to XML producers.

HTML 4 is the most restrictive

https://www.w3.org/TR/html4/types.html#type-id

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

ID = IDStartChar IDChar*
IDChar = IDStartChar | [0-9] | "-" | "_" | ":" | "."
IDStartChar = [A-Z] | [a-z]

xsd:id (what XML Schema and Relax Schema use) is more permissive, but also provides a restriction

https://www.w3.org/TR/xmlschema-2/#ID

The ·value space· of ID is the set of all strings that ·match· the NCName production in [Namespaces in XML]

NCName = NCNameStartChar NCNameChar*
NCNameChar = NCNameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
NCNameStartChar = [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

Note: This permits everything that HTML 4 ID does except for ':'.

XML is of note (because that's what the JATS DTD allows)

https://www.w3.org/TR/REC-xml/#id

Values of type ID must match the Name production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.

Name = NameStartChar (NameChar)*
NameChar = NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
NameStartChar = ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

Note: This is more permissive than either HTML 4 or xsd:id

HTML 5 is also of note in that it restricts nothing

https://www.w3.org/TR/html5/dom.html#the-id-attribute

The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character. The value must not contain any space characters. There are no other restrictions on what form an ID can take; in particular, IDs can consist of just digits, start with a digit, start with an underscore, consist of just punctuation, etc.

The JATS4R recommendation

For maximum compatibility we recommend that identifiers in JATS documents follow the production:

JATSID = JATSIDStartChar JATSIDChar*
JATSIDChar = JATSIDStartChar | [0-9] | "-" | "_" | "."
JATSIDStartChar = [A-Z] | [a-z]

Or in regex form: [A-Za-z][-_.A-Za-z0-9]*

This is the HTML 4 production with XML NCName ':' restriction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment