In storing our annotations, we will need some kind of document outline.
- Minimally, we need a URI.
- Annotation sets tagged to particular corpora. This could be stored elsewhere, though.
- If we differentiate between "word" level and "unit" level annotations, we need to note that somewhere (again, maybe elsewhere).
- We need the actual annotation set
{
"uri": "landmark/trajector",
"corpus": "NA1904",
"type": "word_level",
"annotations": [...]
}
I don't have a solution for annotations. What follows is my attempt to think through the issues at play and provide food for thought.
Morphological tagging is a set of key value pairs attached to a given token. For example:
{
"head_token": "Gen 1:1!2",
"morphology": {
"part_of_speech": "verb",
"person": "3",
"number": "singular",
"tense": "perfect",
}
}
Levinsohn features mark particular words with a tag. In essence, a simplified kv pair.
{
"head_token": "Gen 1:1!2",
"levinsohn_feature": "subject"
}
With multiple tags, this might be represented as:
{
"head_token": "Gen 1:1!2",
"levinsohn_features": ["subject", "focus"]
}
This could be represented as above:
{
"head_token": "Gen 1:1!2",
"tags": {
"part_of_speech": "verb",
"person": "3",
"number": "singular",
"tense": "perfect",
"levinsohn_feature": "focus",
// or, with multiple tags
"levinsohn_features": ["subject", "focus"]
},
}
It may be useful to annotate referents and antecedents on words like pronouns. These annotations are relational but could easily fit into the above system:
{
"head_token": "John 1:2!",
"tags": {
"part_of_speech": "demonstrative pronoun",
"gloss": "this one, he, it",
...
"antecedent": "John 1:1!5", // i.e., "Logos" in v. 1
},
}
Landmark/Trajector annotations relate tokens (or token sets?) to one another. For example:
{
"preposition": "Gen 1:1!2",
"landmark": "Gen 1:1!3",
"trajector": "Gen 1:1!4",
"predicate": "Gen 1:1!5",
}
Treating the preposition as a head_token, we could use the above representation. Note that this will also work fine with token sets:
{
"head_token": "Gen 1:1!2",
"tags": {
"landmark": ["Gen 1:1!3", "Gen 1:1!4"],
"trajector": ["Gen 1:1!5", "Gen 1:1!6"],
"predicate": ["Gen 1:1!7", "Gen 1:1!8"],
}
}
At this point head_token
may need to become a list as well, which would complicate O(1) lookups on the head token for morphological data.
It is worth noting that this may also be represented in a flat list with a reference to the parent annotation:
{
"id": "uri/gen-1",
"type": "head",
"parent": null,
"tokens": [],
"children": [
"uri/gen-1.1",
"uri/gen-1.2",
"uri/gen-1.3",
"uri/gen-1.4",
],
}, {
"id": "uri/gen-1.preposition",
"type": "preposition",
"parent": "uri/gen-1",
"tokens": ["Gen 1:1!2"],
"children": [],
}, {
"id": "uri/gen-1.landmark",
"type": "landmark",
"parent": "uri/gen-1",
"tokens": ["Gen 1:1!3", "Gen 1:1!4"],
"children": [],
}, {
"id": "uri/gen-1.trajector",
"type": "trajector",
"parent": "uri/gen-1",
"tokens": ["Gen 1:1!5", "Gen 1:1!6"],
"children": [],
}, {
"id": "uri/gen-1.predicate",
"type": "predicate",
"parent": "uri/gen-1",
"tokens": ["Gen 1:1!7", "Gen 1:1!8"],
"children": [],
}
Runge-style discourse annotations relate units of discourse in a nested tree structure. This may be represented in a tree:
{
"tags": {
"label": "Creation",
"kind": "Basis",
},
"tokens": [],
"children": [{
"tags": {
"label": "",
"kind": "Basis",
},
"tokens": ["Gen 1:1!2", "Gen 1:1!3", "Gen 1:1!4", "Gen 1:1!5"],
"children": []
}, {
"tags": {
"label": "",
"kind": "Mainline",
},
"tokens": ["Gen 1:1!6", "Gen 1:1!7", "Gen 1:1!8"],
"children": []
}]
}
It may also be represented as a flat list, with references to parents and children:
{
"id": "uri/gen-1",
"tags": {
"label": "Creation",
"kind": "Basis",
},
"tokens": [],
"parent": null,
"children": ["uri/gen-1.1", "uri/gen-1.2"]
}, {
"id": "uri/gen-1.1",
"tags": {
"label": "",
"kind": "Basis",
},
"tokens": ["Gen 1:1!2", "Gen 1:1!3", "Gen 1:1!4", "Gen 1:1!5"],
"parent": "uri/gen-1",
"children": []
}, {
"id": "uri/gen-1.2",
"tags": {
"label": "",
"kind": "Mainline",
},
"tokens": ["Gen 1:1!6", "Gen 1:1!7", "Gen 1:1!8"],
"parent": "uri/gen-1",
"children": []
}
I have been thinking about a distinction between word and unit level annotations, and another possibility is that the distinction could be made based on the number of tokens that could be needed in head_token(s)
. If the concept of head_token
does not make sense, the perhaps the annotation is unit-level.
The more important might be distinction might be relational and non-relational annotations. The possibility of representing Runge-style annotations in the same way as Landmark/Trajector annotations is suggestive in this regard.
However, there are simple relationships, such as marking pronoun referents. The fact that these could be represented in the same way does not mean they should be, though. This logic would create an EAV-style database for morphology features.
It remains unclear to me how to distinguish these kinds of annotations. The two complicating factors are (1) the need to denote relationships and (2) the need to represent multiple tokens as "heads".
An interesting problem is Runge's decision words. These could be represented as:
- Like landmark/trajector: A child of the annotation (alongside mainline and basis children). This has the obvious problem that it is not a part of the tree in the same way that the other children are.
- A tag on the annotation. This suggests that tags (like label, kind, and [I guess] part_of_speech) must support strings and relationships.
When mapping over objects on the frontend, it is really helpful to have predictable key names (for type safety). I have not done this with the above examples. Not this:
{
"label": "Creation",
"kind": "Basis",
}
I would prefer the more verbose:
[{
"key": "label",
"value": "Creation",
},{
"key": "kind",
"value": "Basis",
}]
I'm trying to wrestle with some higher-order concerns within ATLAS around EAV. As I think I've mentioned before, I do anticipate routinely hitting a point where we have a more "standardized" way of modeling data and then using that to encode something akin to: