Skip to content

Instantly share code, notes, and snippets.

@nichtich
Last active April 28, 2021 15:27
Show Gist options
  • Save nichtich/d0fd9727b07c5b014de8fe901a167864 to your computer and use it in GitHub Desktop.
Save nichtich/d0fd9727b07c5b014de8fe901a167864 to your computer and use it in GitHub Desktop.
Overlapping Markup Data Language (OMDL)

This is a quick draft of a data format specification to encode annotated character data to support overlapping markup, also known as standoff markup.

Everything is subject to discussion

Related work

  • By now this is kind of a fork of atjson
  • Ted Nelson's xanadoc EDL format
  • OCR formats ALTO, PAGE, hOCR
  • Mac OSX Core Text
  • Google Docs and other real-time editors
  • ...

Summary

An OMDL document is a JSON document with a JSON object having exactely two keys (aka as "names" in RFC 8259):

  • content mapped to a (possibly empty) string, refered to as content string
  • annotations mapped to a (possibly empty) array of annotations

A content string is a JSON string and by this a Unicode string.

An annotation is a JSON object with

  • mandatory key start mapped to a start position
  • mandatory key end mapped to an end position
  • optional key type mapped to an annotation type
  • optional key attributes mapped to an arbitrary JSON object

and the following integrity constraints:

  • start position and end position MUST be non-negative integer values
  • start position MUST NOT be larger than end position
  • start position and end position MUST not be larger than the length of the content string in number of Unicode codepoints
  • a document must not contain fully identical annotations (same positions, same type, same attributes)

Positions represent the space in-between characters of the content string. For instance the first character is references by start position 0 and end position 1.

An annotation type is a non-empty JSON string with

  • either an URI
  • or a well-known annotation type name, that is a non-empty string consististing of lowercase letters a to z and/or the minus sign -.

Each annotation type should imply an attribute schema, that is a specification of the attributes (e.g. by a JSON Schema).

Normalization (optional)

A normalized OMDL document

  • is a canonical JSON document
  • must not contain empty object attributes
  • must sort annotations
    • first by start position
    • then by end position
    • then by annotation type
    • then by attributes (given as canonical JSON)
  • should normalize content string to Unicode normalization form NFC (note that the process of unicode normalization may change annotation positions)

References

Normative References

Informative References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment