Skip to content

Instantly share code, notes, and snippets.

@jcuenod
Last active June 30, 2023 17:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jcuenod/93a80488a0c3214eedf08a5fa2fb70fa to your computer and use it in GitHub Desktop.
Save jcuenod/93a80488a0c3214eedf08a5fa2fb70fa to your computer and use it in GitHub Desktop.

Document

In storing our annotations, we will need some kind of document outline.

  1. Minimally, we need a URI.
  2. Annotation sets tagged to particular corpora. This could be stored elsewhere, though.
  3. If we differentiate between "word" level and "unit" level annotations, we need to note that somewhere (again, maybe elsewhere).
  4. We need the actual annotation set
{
    "uri": "landmark/trajector",
    "corpus": "NA1904",
    "type": "word_level",
    "annotations": [...]
}

Annotations

I don't have a solution for annotations. What follows is my attempt to think through the issues at play and provide food for thought.

Annotation Types

Morphological tagging

Morphological tagging is a set of key value pairs attached to a given token. For example:

{
    "head_token": "Gen 1:1!2",
    "morphology": {
        "part_of_speech": "verb",
        "person": "3",
        "number": "singular",
        "tense": "perfect",
    }
}

Levinsohn Features

Levinsohn features mark particular words with a tag. In essence, a simplified kv pair.

{
    "head_token": "Gen 1:1!2",
    "levinsohn_feature": "subject"
}

With multiple tags, this might be represented as:

{
    "head_token": "Gen 1:1!2",
    "levinsohn_features": ["subject", "focus"]
}

This could be represented as above:

{
    "head_token": "Gen 1:1!2",
    "tags": {
        "part_of_speech": "verb",
        "person": "3",
        "number": "singular",
        "tense": "perfect",
        "levinsohn_feature": "focus",
        // or, with multiple tags
        "levinsohn_features": ["subject", "focus"]
    },
}

Simple Relational Morphology

It may be useful to annotate referents and antecedents on words like pronouns. These annotations are relational but could easily fit into the above system:

{
    "head_token": "John 1:2!",
    "tags": {
        "part_of_speech": "demonstrative pronoun",
        "gloss": "this one, he, it",
        ...
        "antecedent": "John 1:1!5", // i.e., "Logos" in v. 1
    },
}

Landmark/Trajector

Landmark/Trajector annotations relate tokens (or token sets?) to one another. For example:

{
    "preposition": "Gen 1:1!2",
    "landmark": "Gen 1:1!3",
    "trajector": "Gen 1:1!4",
    "predicate": "Gen 1:1!5",
}

Treating the preposition as a head_token, we could use the above representation. Note that this will also work fine with token sets:

{
    "head_token": "Gen 1:1!2",
    "tags": {
        "landmark": ["Gen 1:1!3", "Gen 1:1!4"],
        "trajector": ["Gen 1:1!5", "Gen 1:1!6"],
        "predicate": ["Gen 1:1!7", "Gen 1:1!8"],
    }
}

At this point head_token may need to become a list as well, which would complicate O(1) lookups on the head token for morphological data.

It is worth noting that this may also be represented in a flat list with a reference to the parent annotation:

{
    "id": "uri/gen-1",
    "type": "head",
    "parent": null,
    "tokens": [],
    "children": [
        "uri/gen-1.1",
        "uri/gen-1.2",
        "uri/gen-1.3",
        "uri/gen-1.4",
    ],
}, {
    "id": "uri/gen-1.preposition",
    "type": "preposition",
    "parent": "uri/gen-1",
    "tokens": ["Gen 1:1!2"],
    "children": [],
}, {
    "id": "uri/gen-1.landmark",
    "type": "landmark",
    "parent": "uri/gen-1",
    "tokens": ["Gen 1:1!3", "Gen 1:1!4"],
    "children": [],
}, {
    "id": "uri/gen-1.trajector",
    "type": "trajector",
    "parent": "uri/gen-1",
    "tokens": ["Gen 1:1!5", "Gen 1:1!6"],
    "children": [],
}, {
    "id": "uri/gen-1.predicate",
    "type": "predicate",
    "parent": "uri/gen-1",
    "tokens": ["Gen 1:1!7", "Gen 1:1!8"],
    "children": [],
}

Runge-style Discourse Annotations

Runge-style discourse annotations relate units of discourse in a nested tree structure. This may be represented in a tree:

{
    "tags": {
        "label": "Creation",
        "kind": "Basis",
    },
    "tokens": [],
    "children": [{
        "tags": {
            "label": "",
            "kind": "Basis",
        },
        "tokens": ["Gen 1:1!2", "Gen 1:1!3", "Gen 1:1!4", "Gen 1:1!5"],
        "children": []
    }, {
        "tags": {
            "label": "",
            "kind": "Mainline",
        },
        "tokens": ["Gen 1:1!6", "Gen 1:1!7", "Gen 1:1!8"],
        "children": []
    }]
}

It may also be represented as a flat list, with references to parents and children:

{
    "id": "uri/gen-1",
    "tags": {
        "label": "Creation",
        "kind": "Basis",
    },
    "tokens": [],
    "parent": null,
    "children": ["uri/gen-1.1", "uri/gen-1.2"]
}, {
    "id": "uri/gen-1.1",
    "tags": {
        "label": "",
        "kind": "Basis",
    },
    "tokens": ["Gen 1:1!2", "Gen 1:1!3", "Gen 1:1!4", "Gen 1:1!5"],
    "parent": "uri/gen-1",
    "children": []
}, {
    "id": "uri/gen-1.2",
    "tags": {
        "label": "",
        "kind": "Mainline",
    },
    "tokens": ["Gen 1:1!6", "Gen 1:1!7", "Gen 1:1!8"],
    "parent": "uri/gen-1",
    "children": []
}

Notes

Word-v-Unit / Head-Tokens / Relational-v-Non-Relational

I have been thinking about a distinction between word and unit level annotations, and another possibility is that the distinction could be made based on the number of tokens that could be needed in head_token(s). If the concept of head_token does not make sense, the perhaps the annotation is unit-level.

The more important might be distinction might be relational and non-relational annotations. The possibility of representing Runge-style annotations in the same way as Landmark/Trajector annotations is suggestive in this regard.

However, there are simple relationships, such as marking pronoun referents. The fact that these could be represented in the same way does not mean they should be, though. This logic would create an EAV-style database for morphology features.

It remains unclear to me how to distinguish these kinds of annotations. The two complicating factors are (1) the need to denote relationships and (2) the need to represent multiple tokens as "heads".

Decision Words

An interesting problem is Runge's decision words. These could be represented as:

  1. Like landmark/trajector: A child of the annotation (alongside mainline and basis children). This has the obvious problem that it is not a part of the tree in the same way that the other children are.
  2. A tag on the annotation. This suggests that tags (like label, kind, and [I guess] part_of_speech) must support strings and relationships.

Type Safety

When mapping over objects on the frontend, it is really helpful to have predictable key names (for type safety). I have not done this with the above examples. Not this:

{
    "label": "Creation",
    "kind": "Basis",
}

I would prefer the more verbose:

[{
    "key": "label",
    "value": "Creation",
},{
    "key": "kind",
    "value": "Basis",
}]
@jacobwegner
Copy link

This logic would create an EAV-style database for morphology features

I'm trying to wrestle with some higher-order concerns within ATLAS around EAV. As I think I've mentioned before, I do anticipate routinely hitting a point where we have a more "standardized" way of modeling data and then using that to encode something akin to:

TokenMorphology
  morphology_collection: https://github.com/jcuenod/sweet-morphology-o-mine
  token: Gen 1:1!2 (WLC) 
  part_of_speech: verb
  person: '3'
  number: singular
  tense: perfect

@jacobwegner
Copy link

When mapping over objects on the frontend, it is really helpful to have predictable key names (for type safety)

I think we've talked before about the purpose of the "label" field; the label on an annotation I think would always be distinct from a user-provided "key", e.g. kind, etc

@jacobwegner
Copy link

(I still have "homework" to give my thinking on each of the examples you've enumerated; didn't get there today, but will pick it back up next week!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment