Skip to content

Instantly share code, notes, and snippets.

@anayram
Last active April 9, 2024 19:11
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anayram/9934dbaede7bf870c7aa1503dceb1145 to your computer and use it in GitHub Desktop.
Save anayram/9934dbaede7bf870c7aa1503dceb1145 to your computer and use it in GitHub Desktop.
WebVTT index conversion

WebVTT Index Conversion Project

Indexes have been created for recordings in the Aviary SpokenWeb collection. These indexes were oroginally created outside of the application (in Google sheets) in a custom format. See Metadata Preparation below for more details.

A script was developed to transform timestamps from a field- and format-normalized spreadsheet into WebVTT index format for ingest into Aviary. An index contains labels (segments) with a start and end time in timecode.

A spreadsheet template with field validation was created for future indexes. See Timestamp Template and Sample Metadata below.

Metadata Preparation

The original timestamp spreadsheets created by SpokenWeb students included required information from the start. To prepare the data I normalized SpokenWeb identifiers, consolidated headers, and normalized data formats (time as hh:mm:ss.mil) to consistent strings across spreadsheets. It is possible this prep work won't be needed in future work if timestamps are created with template linked below.

Timestamp metadata was created in separate worksheets for each Aviary resource. Worksheets are named after each resource's SpokenWeb identifier.

Exporting spreadseet worsksheets to tsv was done using an adapted LibreOffice macro (included in present gist as Export2Tsv.bas). Possible future work: process spreadsheets to character-separated files using python script or google-script.

Script

A small python script was put together to process metadata and export it as WebVTT indexes.

The script tests for the presence of Start, Stop, and Action values, and currently assists in the formatting of valid hh:mm:ss.mil timestamps. Future work: formatting process will be removed and validation of time data will be done before transformation, and with the assistance of spreadsheet template to be used by Index creators (see Timestamp Template section below).

Each index is exported as a txt file named after SpokenWeb identifiers. These identifiers are later used to map files to existing Aviary resources for ingest.

Timestamp Template and Sample Metadata

A template for timestamp creation is available here. Contact metadata@ualberta.ca for access.

Sample tsv exported from the template: WebVTT Timestamps - sample-123.tsv

Sample result WebVTT index from sample tsv: WebVTT Timestamps - sample-123.tsv.txt

#!/usr/bin/env python3
import pandas as pd
import os
import glob
def isNaN(num):
return num != num
def formattime(row):
return str(row["Start"]).strip() + ".000 --> " + str(row["Stop"]).strip() + ".000"
def transform(dataframe, filename):
with(open(filename + ".txt", "w")) as f:
f.write("WEBVTT" + "\n\n")
for index, row in dataframe.iterrows():
# f.write(row)
print(index)
if not isNaN(row["Start"]) or not isNaN(row["Action"]) or not isNaN(row["Stop"]):
print(row["Action"])
f.write(str(row["Action"]).strip() + "\n")
f.write(formattime(row) + "\n")
if not isNaN(row["Synopsis"]):
f.write(str(row["Synopsis"]).strip() + "\n")
f.write("\n")
# add speaker right before Action
# subject, keyword, partial transcript, coordinates with hyperlink text and zoom level(?), hyperlink
# this program expects tsv files to be in subdirectory indexdata
filepath = './indexdata'
os.chdir(filepath)
files = glob.glob("*.{}".format("tsv"))
for file in files:
print(file)
df = pd.read_table(file)
df = df.reset_index()
transform(df, file)
ID Start Stop Action Valid Span Valid Jump Synopsis Notes Subjects Custom Field
00:00:00 00:00:03 Some Action Title 1 valid valid
00:00:04 00:00:10 Some Action Title 2 valid valid
00:00:11 00:00:20 Some Action Title 3 valid valid
00:00:21 00:00:35 Some Action Title 4 valid valid
00:00:36 00:00:37 Some Action Title 5 valid valid
00:00:37 00:00:38 Some Action Title 6 valid valid
00:00:48 06:09:10 Some Action Title 7 valid valid
06:09:10 06:09:40 Some Action Title 8 valid
WEBVTT
Some Action Title 1
00:00:00.000 --> 00:00:03.000
Some Action Title 2
00:00:04.000 --> 00:00:10.000
Some Action Title 3
00:00:11.000 --> 00:00:20.000
Some Action Title 4
00:00:21.000 --> 00:00:35.000
Some Action Title 5
00:00:36.000 --> 00:00:37.000
Some Action Title 6
00:00:37.000 --> 00:00:38.000
Some Action Title 7
00:00:48.000 --> 06:09:10.000
Some Action Title 8
06:09:10.000 --> 06:09:40.000
REM ***** BASIC *****
Sub ExportToTsv
document = ThisComponent
' Use the global string tools library to generate a path to save each CSV
GlobalScope.BasicLibraries.loadLibrary("Tools")
FileDirectory = Tools.Strings.DirectoryNameoutofPath(document.getURL(), "/")
' Work out number of sheets for looping over them later.
Sheets = document.Sheets
NumSheets = Sheets.Count - 1
' Set up a propval object to store the filter properties
Dim Propval(1) as New com.sun.star.beans.PropertyValue
Propval(0).Name = "FilterName"
Propval(0).Value = "Text - txt - csv (StarCalc)"
Propval(1).Name = "FilterOptions"
Propval(1).Value ="9,34,0,1,1" 'ASCII 59 = ; 34 = "
For I = 0 to NumSheets
' For each sheet, assemble a filename and save using the filter
document.getCurrentController.setActiveSheet(Sheets(I))
Filename = FileDirectory + "/" + Sheets(I).Name + ".tsv"
FileURL = convertToURL(Filename)
document.StoreToURL(FileURL, Propval())
Next I
End Sub
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment