anayram/README.md

## README.md

      
    Raw
  

              README.md
            
          
    WebVTT Index Conversion Project

Indexes have been created for recordings in the Aviary SpokenWeb collection. These indexes were oroginally created outside of the application (in Google sheets) in a custom format. See Metadata Preparation below for more details.
A script was developed to transform timestamps from a field- and format-normalized spreadsheet into WebVTT index format for ingest into Aviary. An index contains labels (segments) with a start and end time in timecode.
A spreadsheet template with field validation was created for future indexes. See Timestamp Template and Sample Metadata below.
Metadata Preparation

The original timestamp spreadsheets created by SpokenWeb students included required information from the start. To prepare the data I normalized SpokenWeb identifiers, consolidated headers, and normalized data formats (time as hh:mm:ss.mil) to consistent strings across spreadsheets. It is possible this prep work won't be needed in future work if timestamps are created with template linked below.
Timestamp metadata was created in separate worksheets for each Aviary resource. Worksheets are named after each resource's SpokenWeb identifier.
Exporting spreadseet worsksheets to tsv was done using an adapted LibreOffice macro (included in present gist as Export2Tsv.bas). Possible future work: process spreadsheets to character-separated files using python script or google-script.
Script

A small python script was put together to process metadata and export it as WebVTT indexes.
The script tests for the presence of Start, Stop, and Action values, and currently assists in the formatting of valid hh:mm:ss.mil timestamps. Future work: formatting process will be removed and validation of time data will be done before transformation, and with the assistance of spreadsheet template to be used by Index creators (see Timestamp Template section below).
Each index is exported as a txt file named after SpokenWeb identifiers. These identifiers are later used to map files to existing Aviary resources for ingest.
Timestamp Template and Sample Metadata

A template for timestamp creation is available here. Contact metadata@ualberta.ca for access.
Sample tsv exported from the template: WebVTT Timestamps - sample-123.tsv
Sample result WebVTT index from sample tsv: WebVTT Timestamps - sample-123.tsv.txt

  
## tsv2webvtt.py
#!/usr/bin/env python3

import pandas as pd
import os
import glob


def isNaN(num):
    return num != num

def formattime(row):
   return str(row["Start"]).strip() + ".000 --> " + str(row["Stop"]).strip() + ".000"

def transform(dataframe, filename):
    with(open(filename + ".txt", "w")) as f:
        f.write("WEBVTT" + "\n\n")
        for index, row in dataframe.iterrows():
            # f.write(row)
            print(index)
            if not isNaN(row["Start"]) or not isNaN(row["Action"]) or not isNaN(row["Stop"]):
                print(row["Action"])
                f.write(str(row["Action"]).strip() + "\n")
                f.write(formattime(row) + "\n")
                if not isNaN(row["Synopsis"]):
                    f.write(str(row["Synopsis"]).strip() + "\n")
                f.write("\n")
                #   add speaker right before Action
                #   subject, keyword, partial transcript, coordinates with hyperlink text and zoom level(?), hyperlink

# this program expects tsv files to be in subdirectory indexdata
filepath = './indexdata'
os.chdir(filepath)
files = glob.glob("*.{}".format("tsv"))

for file in files:
    print(file)
    df = pd.read_table(file)
    df = df.reset_index()
    transform(df, file)

## WebVTT Timestamps - sample-123.tsv

          
            ID
            Start
            Stop
            Action
            Valid Span
            Valid Jump
            Synopsis
            Notes
            Subjects
            Custom Field

            
              00:00:00
              00:00:03
              Some Action Title 1
              valid
              valid

            
              00:00:04
              00:00:10
              Some Action Title 2
              valid
              valid

            
              00:00:11
              00:00:20
              Some Action Title 3
              valid
              valid

            
              00:00:21
              00:00:35
              Some Action Title 4
              valid
              valid

            
              00:00:36
              00:00:37
              Some Action Title 5
              valid
              valid

            
              00:00:37
              00:00:38
              Some Action Title 6
              valid
              valid

            
              00:00:48
              06:09:10
              Some Action Title 7
              valid
              valid

            
              06:09:10
              06:09:40
              Some Action Title 8
              valid

## WebVTT Timestamps - sample-123.tsv.txt
WEBVTT

Some Action Title 1
00:00:00.000 --> 00:00:03.000

Some Action Title 2
00:00:04.000 --> 00:00:10.000

Some Action Title 3
00:00:11.000 --> 00:00:20.000

Some Action Title 4
00:00:21.000 --> 00:00:35.000

Some Action Title 5
00:00:36.000 --> 00:00:37.000

Some Action Title 6
00:00:37.000 --> 00:00:38.000

Some Action Title 7
00:00:48.000 --> 06:09:10.000

Some Action Title 8
06:09:10.000 --> 06:09:40.000

## Z_Export2Tsv.bas
REM  *****  BASIC  *****

Sub ExportToTsv
    document = ThisComponent

    ' Use the global string tools library to generate a path to save each CSV
    GlobalScope.BasicLibraries.loadLibrary("Tools")
    FileDirectory = Tools.Strings.DirectoryNameoutofPath(document.getURL(), "/")

    ' Work out number of sheets for looping over them later.
    Sheets = document.Sheets
    NumSheets = Sheets.Count - 1

    ' Set up a propval object to store the filter properties
    Dim Propval(1) as New com.sun.star.beans.PropertyValue
    Propval(0).Name = "FilterName"
    Propval(0).Value = "Text - txt - csv (StarCalc)"
    Propval(1).Name = "FilterOptions"
    Propval(1).Value ="9,34,0,1,1"   'ASCII  59 = ;  34 = "

    For I = 0 to NumSheets
        ' For each sheet, assemble a filename and save using the filter
        document.getCurrentController.setActiveSheet(Sheets(I))
        Filename = FileDirectory + "/" + Sheets(I).Name + ".tsv"
        FileURL = convertToURL(Filename)
        document.StoreToURL(FileURL, Propval())
    Next I

End Sub
	#!/usr/bin/env python3

	import pandas as pd
	import os
	import glob


	def isNaN(num):
	return num != num

	def formattime(row):
	return str(row["Start"]).strip() + ".000 --> " + str(row["Stop"]).strip() + ".000"

	def transform(dataframe, filename):
	with(open(filename + ".txt", "w")) as f:
	f.write("WEBVTT" + "\n\n")
	for index, row in dataframe.iterrows():
	# f.write(row)
	print(index)
	if not isNaN(row["Start"]) or not isNaN(row["Action"]) or not isNaN(row["Stop"]):
	print(row["Action"])
	f.write(str(row["Action"]).strip() + "\n")
	f.write(formattime(row) + "\n")
	if not isNaN(row["Synopsis"]):
	f.write(str(row["Synopsis"]).strip() + "\n")
	f.write("\n")
	# add speaker right before Action
	# subject, keyword, partial transcript, coordinates with hyperlink text and zoom level(?), hyperlink

	# this program expects tsv files to be in subdirectory indexdata
	filepath = './indexdata'
	os.chdir(filepath)
	files = glob.glob("*.{}".format("tsv"))

	for file in files:
	print(file)
	df = pd.read_table(file)
	df = df.reset_index()
	transform(df, file)
ID	Start	Stop	Action	Valid Span	Valid Jump	Synopsis	Notes	Subjects	Custom Field
	00:00:00	00:00:03	Some Action Title 1	valid	valid
	00:00:04	00:00:10	Some Action Title 2	valid	valid
	00:00:11	00:00:20	Some Action Title 3	valid	valid
	00:00:21	00:00:35	Some Action Title 4	valid	valid
	00:00:36	00:00:37	Some Action Title 5	valid	valid
	00:00:37	00:00:38	Some Action Title 6	valid	valid
	00:00:48	06:09:10	Some Action Title 7	valid	valid
	06:09:10	06:09:40	Some Action Title 8	valid
	WEBVTT

	Some Action Title 1
	00:00:00.000 --> 00:00:03.000

	Some Action Title 2
	00:00:04.000 --> 00:00:10.000

	Some Action Title 3
	00:00:11.000 --> 00:00:20.000

	Some Action Title 4
	00:00:21.000 --> 00:00:35.000

	Some Action Title 5
	00:00:36.000 --> 00:00:37.000

	Some Action Title 6
	00:00:37.000 --> 00:00:38.000

	Some Action Title 7
	00:00:48.000 --> 06:09:10.000

	Some Action Title 8
	06:09:10.000 --> 06:09:40.000
	REM *** BASIC ***

	Sub ExportToTsv
	document = ThisComponent

	' Use the global string tools library to generate a path to save each CSV
	GlobalScope.BasicLibraries.loadLibrary("Tools")
	FileDirectory = Tools.Strings.DirectoryNameoutofPath(document.getURL(), "/")

	' Work out number of sheets for looping over them later.
	Sheets = document.Sheets
	NumSheets = Sheets.Count - 1

	' Set up a propval object to store the filter properties
	Dim Propval(1) as New com.sun.star.beans.PropertyValue
	Propval(0).Name = "FilterName"
	Propval(0).Value = "Text - txt - csv (StarCalc)"
	Propval(1).Name = "FilterOptions"
	Propval(1).Value ="9,34,0,1,1" 'ASCII 59 = ; 34 = "

	For I = 0 to NumSheets
	' For each sheet, assemble a filename and save using the filter
	document.getCurrentController.setActiveSheet(Sheets(I))
	Filename = FileDirectory + "/" + Sheets(I).Name + ".tsv"
	FileURL = convertToURL(Filename)
	document.StoreToURL(FileURL, Propval())
	Next I

	End Sub