Indexes have been created for recordings in the Aviary SpokenWeb collection. These indexes were oroginally created outside of the application (in Google sheets) in a custom format. See Metadata Preparation below for more details.
A script was developed to transform timestamps from a field- and format-normalized spreadsheet into WebVTT index format for ingest into Aviary. An index contains labels (segments) with a start and end time in timecode.
A spreadsheet template with field validation was created for future indexes. See Timestamp Template and Sample Metadata below.
The original timestamp spreadsheets created by SpokenWeb students included required information from the start. To prepare the data I normalized SpokenWeb identifiers, consolidated headers, and normalized data formats (time as hh:mm:ss.mil
) to consistent strings across spreadsheets. It is possible this prep work won't be needed in future work if timestamps are created with template linked below.
Timestamp metadata was created in separate worksheets for each Aviary resource. Worksheets are named after each resource's SpokenWeb identifier.
Exporting spreadseet worsksheets to tsv was done using an adapted LibreOffice macro (included in present gist as Export2Tsv.bas). Possible future work: process spreadsheets to character-separated files using python
script or google-script
.
A small python
script was put together to process metadata and export it as WebVTT indexes.
The script tests for the presence of Start
, Stop
, and Action
values, and currently assists in the formatting of valid hh:mm:ss.mil
timestamps. Future work: formatting process will be removed and validation of time data will be done before transformation, and with the assistance of spreadsheet template to be used by Index creators (see Timestamp Template section below).
Each index is exported as a txt
file named after SpokenWeb identifiers. These identifiers are later used to map files to existing Aviary resources for ingest.
A template for timestamp creation is available here. Contact metadata@ualberta.ca for access.
Sample tsv exported from the template: WebVTT Timestamps - sample-123.tsv
Sample result WebVTT index from sample tsv: WebVTT Timestamps - sample-123.tsv.txt