This is a quick draft of a data format specification to encode annotated character data to support overlapping markup, also known as standoff markup.
Everything is subject to discussion
- By now this is kind of a fork of atjson
- Ted Nelson's xanadoc EDL format
- OCR formats ALTO, PAGE, hOCR
- Mac OSX Core Text
- Google Docs and other real-time editors
- ...
An OMDL document is a JSON document with a JSON object having exactely two keys (aka as "names" in RFC 8259):
content
mapped to a (possibly empty) string, refered to as content stringannotations
mapped to a (possibly empty) array of annotations
A content string is a JSON string and by this a Unicode string.
An annotation is a JSON object with
- mandatory key
start
mapped to a start position - mandatory key
end
mapped to an end position - optional key
type
mapped to an annotation type - optional key
attributes
mapped to an arbitrary JSON object
and the following integrity constraints:
- start position and end position MUST be non-negative integer values
- start position MUST NOT be larger than end position
- start position and end position MUST not be larger than the length of the content string in number of Unicode codepoints
- a document must not contain fully identical annotations (same positions, same type, same attributes)
Positions represent the space in-between characters of the content string. For instance the first character is references by start position 0
and end position 1
.
An annotation type is a non-empty JSON string with
- either an URI
- or a well-known annotation type name, that is a non-empty string consististing of lowercase letters
a
toz
and/or the minus sign-
.
Each annotation type should imply an attribute schema, that is a specification of the attributes (e.g. by a JSON Schema).
A normalized OMDL document
- is a canonical JSON document
- must not contain empty object attributes
- must sort annotations
- first by start position
- then by end position
- then by annotation type
- then by attributes (given as canonical JSON)
- should normalize content string to Unicode normalization form NFC (note that the process of unicode normalization may change annotation positions)
- RFC 8259: The JavaScript Object Notation (JSON) Data Interchange Format
- RFC : Uniform Resource Identifier (URI): Generic Syntax
- Canonical JSON (where is the current spec?)
- Ted Nelson (1997): Embedded Markup Considered Harmful