This is my first crack at spec'ing out the fields in the wordObject
s contained in a "universal transcript" as produced by a machine, a human, or both, from human speech.
This is the word that the transcriber thinks is spoken.
This is a string or integer identifier of who is speaking. It can also be undefined.
This is when in a recording the word is uttered, expressed as a decimal/float. Two decimal points should be sufficient for most applications. For individual words, only a machine will produce reliable values for this field.
This is when the word ends. Only a machine will produce a reliable value for this.
This is how confident, from 0 to 1 (two decimal points) the transcriber is in the word
value. In most cases this should be set to 1 for a human transcriber, though a human could use brackets or a similar convention for indicating a word or phrase that they"re less than 100% confident in, which could be parsed to something less than 1 here.
This helper field is for keeping capitalization accurate when a UI-rendered transcript is edited, and, for example, a proper noun -- or perhaps the word "I" -- changes from being the first word in a sentence to being the second word. The rendering code could then keep that proper noun capitalized despite that it no longer begins the sentence, by taking into consideration an alwaysCapitalized
value of true
.
An array containing any punctuation that should be appended to a given word. An array should be used here because there are scenarios where there may be more than one punctuation character occuring after a word, e.g. "'Hi.'" Using a field such as this instead of putting punctuation into its own wordObject makes sense if you consider that punctuation would generally not
have a start
, end
, or confidence
value, though changing the rendering of surrounding words might be a little more complex this way -- for example if a period were changed to a comma, necessitating a change to the rendering of the next word in a transcript (to un-capitalize it if appropriate).
An array containing any punctuation that should prepend a given word. See puncAfter
for applications and tradeoffs.
{"transcript": [
{
"word": "Hi",
"speakerID": 1,
"start": 1.23,
"end": 1.56,
"confidence": .98,
"alwaysCapitalized": false,
"puncAfter": [],
"puncBefore": []
},
{
"word": "there",
"speakerID": 1,
"start": 1.56,
"end": 1.85,
"confidence": 1,
"alwaysCapitalized": false,
"puncAfter": ["."],
"puncBefore": []
},
...
]}