Last active
November 5, 2023 21:05
-
-
Save jimregan/6d997feca9a14a0b40e0628862cd19f2 to your computer and use it in GitHub Desktop.
Edited whisper output of my 30% seminar
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 | |
00:00:01,387 --> 00:00:03,387 | |
[mwahahahaha] | |
2 | |
00:00:04,202 --> 00:00:11,662 | |
So... Phoneme recognition for fan and prophet: uses of more direct represations- representations of speech. | |
3 | |
00:00:13,620 --> 00:00:15,840 | |
Or, you know, in English orthography. | |
4 | |
00:00:17,260 --> 00:00:20,120 | |
Uh... the suffering that Jens goes through. | |
5 | |
00:00:20,160 --> 00:00:21,860 | |
But when I showed him that first slide, and... | |
6 | |
00:00:21,940 --> 00:00:23,400 | |
"Of course you did." | |
7 | |
00:00:26,600 --> 00:00:28,960 | |
Ehm, yeah, so, eh... | |
8 | |
00:00:29,400 --> 00:00:33,320 | |
The matter was raised that because I can pronounce an /əʊ/ and an /oʊ/ | |
9 | |
00:00:33,780 --> 00:00:38,180 | |
that it's not necessarily the case than mine is a monophthong, but, eh | |
10 | |
00:00:38,860 --> 00:00:42,800 | |
I based that on a recording, and for demonstration purposes, kind of, | |
11 | |
00:00:43,560 --> 00:00:49,460 | |
that is the out-, eh, was originally the output of a phoneme recognition system for English that someone else trained | |
12 | |
00:00:49,980 --> 00:00:53,500 | |
and I just corrected it a bit to match my Irish pronunciation. | |
13 | |
00:00:53,920 --> 00:01:00,260 | |
But, eh, anyone who can read a spectrogram can see that this is very definitely a monophthong, not a diphthong. | |
14 | |
00:01:02,340 --> 00:01:04,660 | |
Okay, so, things that I've done. | |
15 | |
00:01:05,180 --> 00:01:17,480 | |
Eh... TA duty, thesis supervision, and this paper that we did for the PSST challenge, which actually was a course project for the speech and speaker recognition course. | |
16 | |
00:01:18,340 --> 00:01:35,160 | |
And that eh... directly led to this eh... work that I've been doing most recently on trying to make more use of phoneme recognition, both as a way of bridging the gaps between ASR and an official transcript that isn't quite trustworthy. | |
17 | |
00:01:35,580 --> 00:01:46,500 | |
And also, umm... while we're at it, trying to get eh... acoustically validated eh... pronunciation lex- lexica for text to speech purposes. | |
18 | |
00:01:48,140 --> 00:01:53,740 | |
And so eh... TA duty: I TA'd on both these courses this year and last year. | |
19 | |
00:01:56,020 --> 00:02:00,760 | |
And I supervised this thesis, which eh... also involved phonemes. | |
20 | |
00:02:04,600 --> 00:02:12,240 | |
Eh... the thesis isn't available but... a... a paper derived from it is and if you follow that QR code it will take you to the arXiv page. | |
21 | |
00:02:15,800 --> 00:02:19,460 | |
So... first presentation. | |
22 | |
00:02:20,800 --> 00:02:22,040 | |
Ah, dammit. | |
23 | |
00:02:31,800 --> 00:02:40,280 | |
So, eh, this paper was presented at the RaPID4 workshop, eh, which was co located with LREC last year in Marseille. | |
24 | |
00:02:41,420 --> 00:02:52,080 | |
The aim of the challenge was to get phoneme recognition for aphasic speech, and to improve it over a... an established baseline that the organizers provided. | |
25 | |
00:02:52,480 --> 00:02:59,620 | |
Along the way the, the organisers said: "Oh, oops, turns out we did too good a job." ehm... | |
26 | |
00:02:59,800 --> 00:03:03,280 | |
"A bunch of people emailed us saying: 'we can't beat the baseline.'" | |
27 | |
00:03:03,660 --> 00:03:06,020 | |
So, y'know, we did. Happy days. | |
28 | |
00:03:07,580 --> 00:03:11,360 | |
So we focused on data augmentation. | |
29 | |
00:03:11,800 --> 00:03:22,040 | |
So we tried augmenting the original, and we tried matching out of domain speech, eh, using data augmentation so that it sounded more like the in domain data. | |
30 | |
00:03:22,680 --> 00:03:28,500 | |
We tried using phonetic language models, we tried using voice conversion and we tried using synthetic voices. | |
31 | |
00:03:31,280 --> 00:03:39,300 | |
I think... this has been mentioned in, eh, several, eh, 30% seminars, but in passing. | |
32 | |
00:03:39,800 --> 00:03:41,320 | |
But this is the full presentation. | |
33 | |
00:03:41,800 --> 00:03:48,980 | |
So, eh, the augmentations we used were Gaussian noise, pitch shift, time stretch and voice conversion. | |
34 | |
00:03:49,500 --> 00:03:54,040 | |
Basically, we listened to the data and tried to figure out what augmentations will make it sound like that. | |
35 | |
00:03:55,800 --> 00:04:00,240 | |
The choice of Gaussian noise instead of, eh, white noise or pink noise or brown noise... | |
36 | |
00:04:00,720 --> 00:04:04,200 | |
Eh... Shivam said "do that!" and we said: "yeah, okay". | |
37 | |
00:04:07,180 --> 00:04:17,220 | |
So, we had 2000- 2300 segments from PT- eh... PSST, eh, 1400 from TIMIT. | |
38 | |
00:04:18,860 --> 00:04:29,080 | |
Which, eh... So, when we're using TIMIT, eh, we tried to map it down to the same phoneset that was used in the RaPID Challenge. | |
39 | |
00:04:29,800 --> 00:04:35,800 | |
Eh... TIMIT has a more... a slightly more expressive phoneset and there were things that didn't quite match. | |
40 | |
00:04:36,120 --> 00:04:45,440 | |
So, rather than try and fudge it, we just omitted anything that didn't match up and that drastically reduced the amount of data. | |
41 | |
00:04:45,800 --> 00:04:49,800 | |
Eh, the training set by default is about three hours, we were left with about one. | |
42 | |
00:04:50,060 --> 00:04:57,880 | |
Eh, but, eh it occurred to us afterwards that, uh, we didn't need to keep the test or validation set, we could have used those and that would have been an extra two hours. | |
43 | |
00:05:00,120 --> 00:05:05,800 | |
And, eh, we tried to do things with Common Voice, that was mostly Birger. | |
44 | |
00:05:08,740 --> 00:05:14,800 | |
So, we tried using base and large wav2vec 2 models and surprise, surprise, the bigger model performed better. | |
45 | |
00:05:15,480 --> 00:05:19,260 | |
But using the base model was a hell of a lot faster and... | |
46 | |
00:05:20,180 --> 00:05:22,000 | |
eh... you know... | |
47 | |
00:05:22,920 --> 00:05:31,460 | |
there was a moment when Birger asked Jens if he could... "oh, can we use the Språkbanken machines if Jim does it" and Jens said something to him. | |
48 | |
00:05:32,120 --> 00:05:33,380 | |
But it seemed more like a... | |
49 | |
00:05:34,400 --> 00:05:37,840 | |
you know, convince me type of thing, but, I don't know... | |
50 | |
00:05:38,200 --> 00:05:45,800 | |
someone else came along and said, oh, yeah, okay, that... that... that's that, and we just dropped it and used things like Kaggle and Google Colab. | |
51 | |
00:05:46,380 --> 00:05:52,580 | |
So, ehm... I think this person had this uh... false idea that Jens is unwavering in his opinions. | |
52 | |
00:05:52,700 --> 00:05:54,200 | |
He certainly has strong opinions. | |
53 | |
00:05:54,880 --> 00:06:02,240 | |
But, eh, no, I- I've seen many times since then how excited he gets when he gets a new perspective on something. | |
54 | |
00:06:04,920 --> 00:06:10,980 | |
So, we used voice conversion, but we had very, very little data per speaker. | |
55 | |
00:06:11,940 --> 00:06:15,540 | |
Eh... if we'd had more data, it could have worked out better. | |
56 | |
00:06:15,700 --> 00:06:18,020 | |
But this was particularly noisy data. | |
57 | |
00:06:18,160 --> 00:06:22,420 | |
It was recorded on something like a cassette recorder in a quite reverberant room. | |
58 | |
00:06:22,680 --> 00:06:26,960 | |
And the cassette recorder was on a table between the interviewer and the interviewee. | |
59 | |
00:06:27,340 --> 00:06:30,520 | |
So, it really did not work out. | |
60 | |
00:06:32,920 --> 00:06:34,700 | |
Eh... we used pitch shift. | |
61 | |
00:06:35,100 --> 00:06:39,460 | |
Eh... that worked quite well with augmenting the PSST data. | |
62 | |
00:06:42,980 --> 00:06:52,280 | |
Eh... Gaussian noise worked quite well uh... in augmenting mostly the TIMIT data, the clean data that we were adding. | |
63 | |
00:06:53,920 --> 00:07:00,420 | |
Time stretch... eh... w- we added this because people with aphasia tend to draw out the things they say. | |
64 | |
00:07:02,800 --> 00:07:07,940 | |
And room impulse s- response. Well, from every recording, you could hear the echo. | |
65 | |
00:07:07,980 --> 00:07:14,140 | |
Uh, so, we used this, and it turned out to be the most successful augmentation individually of the TIMIT data. | |
66 | |
00:07:16,260 --> 00:07:19,620 | |
And so, we combined the augmentations in as... well... | |
67 | |
00:07:21,040 --> 00:07:24,120 | |
Yeah, it- th- this ended up being a little bit of a... | |
68 | |
00:07:25,140 --> 00:07:28,480 | |
So, eh, Birger is a very big picture kind of person. | |
69 | |
00:07:28,760 --> 00:07:31,140 | |
I'm a very detail-oriented kind of person. | |
70 | |
00:07:31,320 --> 00:07:34,920 | |
There're a- a couple of axes where we're kind of at opposite ends. | |
71 | |
00:07:35,580 --> 00:07:45,580 | |
So, one of the things, the augmenting the PSST data, I thought, that doesn't seem like the way to go, but I'll show you how. | |
72 | |
00:07:45,580 --> 00:07:45,600 | |
And it- it ended up working out quite well. | |
So, one of the things, the augmenting the PSST data, I thought, that doesn't seem like the way to go, but I'll show you how. | |
73 | |
00:07:45,600 --> 00:07:48,180 | |
And it- it ended up working out quite well. | |
74 | |
00:07:48,880 --> 00:07:49,840 | |
To my surprise. | |
75 | |
00:07:50,380 --> 00:07:58,240 | |
But em... he wasn't convinced that we need to break it down and show uh... which combinations work... eh... but... eh... | |
76 | |
00:07:59,200 --> 00:08:04,200 | |
yeah, when the organisers asked for a table comparing all of these, it was like, isn't it good that I have that? | |
77 | |
00:08:08,880 --> 00:08:15,300 | |
We tried using language models. Eh... there was little or no improvement over this, over the baseline using this. | |
78 | |
00:08:16,600 --> 00:08:19,620 | |
And we got these nice-looking results. | |
79 | |
00:08:19,820 --> 00:08:26,500 | |
Ah... so, eh, PER is phoneme error rate, and FER is feature error rate. | |
80 | |
00:08:26,560 --> 00:08:30,360 | |
So, for each eh... phoneme, you get this phonetic feature vector. | |
81 | |
00:08:30,640 --> 00:08:35,220 | |
So, like, if it's a vowel, it could be front, it could be rounded, it could be nasal. | |
82 | |
00:08:35,680 --> 00:08:40,340 | |
Eh... if it's a consonant, this will be eh... place of articulation, et cetera. | |
83 | |
00:08:40,860 --> 00:08:44,800 | |
And so, you can get the phoneme wrong, but get most of the features. | |
84 | |
00:08:44,800 --> 00:08:52,300 | |
So, like, if it gives a... a /d/ instead of a /t/, that's one eh... phoneme error, but it's only one feature error. | |
85 | |
00:08:52,640 --> 00:08:54,800 | |
And so, it counts a lot less. | |
86 | |
00:08:57,260 --> 00:09:04,060 | |
And so, augmenting the eh... data worked out, increasing the size of the model worked out. | |
87 | |
00:09:07,560 --> 00:09:09,380 | |
Oh, wow, I haven't looked at this slide since last year. | |
88 | |
00:09:09,760 --> 00:09:16,200 | |
Ehm...yeah, the best-performing model combined aphasic and non-aphasic data, so.... the domain-adopted data. | |
89 | |
00:09:16,400 --> 00:09:22,620 | |
We got a 21% phoneme error rate, which is acceptable. So anything | |
90 | |
00:09:23,600 --> 00:09:28,540 | |
if you think of it analogously to word error rate, anything below 30% eh... is... | |
91 | |
00:09:29,500 --> 00:09:36,800 | |
worth fixing for a human. Anything above that, just scrap it. You're... you may as well just listen and type it out by hand. | |
92 | |
00:09:39,160 --> 00:09:42,040 | |
And so, this was... was the original former slide. | |
93 | |
00:09:43,640 --> 00:09:45,800 | |
But there was a... it was a challenge. | |
94 | |
00:09:46,020 --> 00:09:48,340 | |
We came second! Yay! | |
95 | |
00:09:49,240 --> 00:09:50,140 | |
Of two teams. | |
96 | |
00:09:50,140 --> 00:09:51,800 | |
[laughter] | |
97 | |
00:09:52,440 --> 00:09:54,240 | |
But the other team was Baidu Research. | |
98 | |
00:09:55,100 --> 00:10:08,540 | |
Eh... If... if you're not aware, Baidu Research are the people who created the model that, ah, lead to the paradigm shift towards end-to-end speech recognition. So, they kind of know something about this. | |
99 | |
00:10:08,800 --> 00:10:14,020 | |
Also, they've got a lot more money, and they were able to train on massive models. | |
100 | |
00:10:14,800 --> 00:10:20,800 | |
So, they were using the, what is it called, VoxPopulus models. | |
101 | |
00:10:21,200 --> 00:10:23,800 | |
I couldn't use those because those are not open source. | |
102 | |
00:10:23,800 --> 00:10:26,800 | |
Meta likes to say open source a lot, but they don't like to do it. | |
103 | |
00:10:27,060 --> 00:10:35,480 | |
Eh... they have recently been asked by the Open Source Initiative to not do that. I think maybe trademark lawyers may have been involved somewhere. | |
104 | |
00:10:35,800 --> 00:10:40,520 | |
But, uh, yeah, these are not open source, so I didn't feel like I could use them. | |
105 | |
00:10:40,580 --> 00:10:43,800 | |
Ah... so, we were stuck with much smaller models. | |
106 | |
00:10:43,800 --> 00:10:45,780 | |
So, at least we beat them on two. | |
107 | |
00:10:46,800 --> 00:10:47,980 | |
It's not nothing. | |
108 | |
00:10:50,800 --> 00:10:54,800 | |
And so, th- this is the citation for the previous table. | |
109 | |
00:10:55,960 --> 00:11:01,140 | |
And so, oh, wow, I took 10 minutes longer doing that last time. | |
110 | |
00:11:08,400 --> 00:11:14,800 | |
Okay, so eh.. this most recent one uh... was given recently in the University of Gothenburg. | |
111 | |
00:11:15,360 --> 00:11:17,800 | |
Um... apologies to people who've seen this before. | |
112 | |
00:11:19,100 --> 00:11:26,200 | |
So, uh... one of the projects, well, uh, we're using eh... Riksdag data for multiple projects. | |
113 | |
00:11:29,800 --> 00:11:37,400 | |
So, eh, I mentioned earlier that we're using phoneme recognition to try and bridge the gap between uh... transcripts that are kind of untrustworthy. | |
114 | |
00:11:37,800 --> 00:11:49,320 | |
Eh... what we think happens, eh, what happens generally is that they want to represent the intent rather than the literal word-for-word transcript. | |
115 | |
00:11:49,860 --> 00:11:54,360 | |
So, the speeches are filed in advance, and then those are lightly edited. | |
116 | |
00:11:54,420 --> 00:11:58,600 | |
But, you know, they may or may not match what was actually spoken. | |
117 | |
00:12:01,480 --> 00:12:05,360 | |
So, we want to find out things about how people speak. | |
118 | |
00:12:06,300 --> 00:12:12,080 | |
We'd like to have better training materials, and we want to develop iterative training processes. | |
119 | |
00:12:12,800 --> 00:12:19,520 | |
So, eh, this is part of the SweTerror project, which is a multidisciplinary project. | |
120 | |
00:12:20,140 --> 00:12:28,380 | |
At the other end, we have people who are looking at the psychological aspects of how terror is being discussed in Riksdag. | |
121 | |
00:12:32,020 --> 00:12:35,900 | |
And so, we have data for about 10 years. | |
122 | |
00:12:36,220 --> 00:12:39,820 | |
Well, we have data for about 20 years. We have tw-... eh... | |
123 | |
00:12:40,280 --> 00:12:43,880 | |
this eh... almost 6,000 hours of processed data. | |
124 | |
00:12:47,880 --> 00:12:56,240 | |
Eh so, these are the places, well, the counts for the amount of times that terror or compound words that include the word terror occur. | |
125 | |
00:12:57,300 --> 00:13:08,420 | |
So, in 826 transcript files, but over a thousand eh... files that are the result of ASR output. | |
126 | |
00:13:09,420 --> 00:13:16,160 | |
And 6,700 words in the transcript, 7,200 in the ASR output. | |
127 | |
00:13:16,420 --> 00:13:19,200 | |
So, something is going missing somewhere. | |
128 | |
00:13:21,920 --> 00:13:26,340 | |
And we have these counts for... so this is the transcripts. | |
129 | |
00:13:27,260 --> 00:13:33,080 | |
And it's quite similar for the ASR, but slightly larger in most counts. | |
130 | |
00:13:33,420 --> 00:13:38,360 | |
And these ones in italics have a different order. | |
131 | |
00:13:38,840 --> 00:13:40,500 | |
So, see it again. | |
132 | |
00:13:43,620 --> 00:13:46,800 | |
So, there are mundane issues whenever you use ASR. | |
133 | |
00:13:51,800 --> 00:13:53,800 | |
[Swedish: av förslag till] | |
134 | |
00:13:54,260 --> 00:13:55,800 | |
Okay, that wasn't loud. | |
135 | |
00:13:57,460 --> 00:13:58,840 | |
[Swedish: av förslag till] | |
136 | |
00:14:00,120 --> 00:14:07,120 | |
So, the point here is that the video starts at this point and it's missing that word, but the transcript includes it. | |
137 | |
00:14:07,440 --> 00:14:09,800 | |
So, we can't trust things like that. | |
138 | |
00:14:14,200 --> 00:14:16,800 | |
Then there are things like normalization. | |
139 | |
00:14:23,620 --> 00:14:24,580 | |
This one was fun. | |
140 | |
00:14:28,880 --> 00:14:30,400 | |
So, you can see what happened there. | |
141 | |
00:14:31,060 --> 00:14:34,900 | |
She said "i sex", but this gave "IX". | |
142 | |
00:14:45,040 --> 00:14:47,980 | |
So, mundane recognition error. | |
143 | |
00:14:54,620 --> 00:14:57,800 | |
So, we've mostly solved the normalization issue. | |
144 | |
00:14:58,800 --> 00:15:04,860 | |
Eh... this NeMo text processing module includes full support for Swedish | |
145 | |
00:15:06,220 --> 00:15:14,320 | |
which took way too long... but eh... we're using it for what's called acoustically validated normalization. | |
146 | |
00:15:15,000 --> 00:15:17,500 | |
So... eh... now... | |
147 | |
00:15:18,800 --> 00:15:26,560 | |
they described this much more complicated process where they use like a... a BERT style language model to rate multiple outputs. | |
148 | |
00:15:26,800 --> 00:15:39,460 | |
What we're using is the composition of a... a basic n- n-gram model with the potential outputs of the finite state transducer that's created by this. | |
149 | |
00:15:39,800 --> 00:15:44,800 | |
And that creates an acceptor and it will accept any of the possibilities. | |
150 | |
00:15:48,260 --> 00:15:50,360 | |
So, can we trust transcripts? | |
151 | |
00:15:51,160 --> 00:15:59,380 | |
Uh... for this, we're using an ASR model that comes from KBlab, the large uh... VoxRex Swedish. | |
152 | |
00:16:00,880 --> 00:16:04,540 | |
So, in the transcripts, we get some things that were never actually said. | |
153 | |
00:16:04,660 --> 00:16:08,420 | |
So, this is in the transcript, but not... it was never spoken. | |
154 | |
00:16:09,380 --> 00:16:12,800 | |
Doesn't change the meaning so much here, but eh... other places it could. | |
155 | |
00:16:14,900 --> 00:16:16,800 | |
Then, you know, phrases can be moved. | |
156 | |
00:16:17,800 --> 00:16:26,380 | |
This one was up here at the start of the phrase in the transcript, but was moved to the end of the sentence in... when spoken. | |
157 | |
00:16:28,800 --> 00:16:30,800 | |
Some things are just added in the moment. | |
158 | |
00:16:30,800 --> 00:16:33,740 | |
Uh... they just added in these couple of words. | |
159 | |
00:16:36,260 --> 00:16:38,380 | |
And so, what can we find automatically? | |
160 | |
00:16:38,980 --> 00:16:41,200 | |
Well, we can find false starts. | |
161 | |
00:16:41,800 --> 00:16:53,160 | |
So, when we have, eh, on the right side, a valid word, and then on the left side, a word that ends with that, but starts with a substring of that. | |
162 | |
00:16:53,740 --> 00:16:57,140 | |
That's a pretty clear indication that this person had a false start. | |
163 | |
00:16:57,980 --> 00:16:59,040 | |
This is quite easy. | |
164 | |
00:17:00,800 --> 00:17:06,420 | |
Then, you get things that are very indicatio- indicative of alternative pronunciations. | |
165 | |
00:17:06,520 --> 00:17:13,500 | |
So, you know, clearly, someone had a different dialect, or they said the word really quickly, and... | |
166 | |
00:17:14,160 --> 00:17:20,720 | |
you know, a- a wav2vec-style model is trained with CTC, and it basically predicts every frame. | |
167 | |
00:17:21,220 --> 00:17:25,060 | |
And if something isn't quite audible, then it's omitted. | |
168 | |
00:17:27,180 --> 00:17:31,000 | |
Then, we get filled fog- pauses, so, like an "eh" or a "mm". | |
169 | |
00:17:34,220 --> 00:17:37,560 | |
But wait, didn't OpenAI Whisper solve ASR? | |
170 | |
00:17:38,740 --> 00:17:46,900 | |
Well, we can't really trust it, because eh... th- the data that we're using was probably part of the training data. | |
171 | |
00:17:51,420 --> 00:18:05,060 | |
And we get this thing where the word "tack" disappears when it comes before "talman", "herr talman", or "fru talman", because this is never included in the transcripts. | |
172 | |
00:18:05,200 --> 00:18:10,220 | |
The transcip- transcripts only ever start with one of these three phrases. | |
173 | |
00:18:10,500 --> 00:18:17,220 | |
And so, anywhere that this occurs in anything spoken, the word "tack" disappears. | |
174 | |
00:18:19,800 --> 00:18:21,800 | |
Then, you get these strange insertions. | |
175 | |
00:18:22,440 --> 00:18:25,840 | |
So, I mean, this is about five minutes, and, well... | |
176 | |
00:18:26,380 --> 00:18:30,560 | |
I think in total it was about five minutes, but uh... I didn't think I needed to put all that much. | |
177 | |
00:18:30,800 --> 00:18:35,180 | |
But, eh, yeah, eh, yeah, this was silence. Nobody said these words. | |
178 | |
00:18:36,720 --> 00:18:37,800 | |
Well, you can see what happened. | |
179 | |
00:18:38,280 --> 00:18:40,800 | |
This is stuff that probably came from YouTube. | |
180 | |
00:18:41,580 --> 00:18:50,540 | |
And, you know, this is eh, one of the things that was mentioned by an anonymous ASR... or TTS researcher recently, but, eh, yeah | |
181 | |
00:18:50,640 --> 00:18:54,800 | |
this kind of thing happens all the time, 'cause people use subtitles in really strange ways. | |
182 | |
00:18:56,800 --> 00:19:00,800 | |
So, our alternative is to use phonemic recognition. | |
183 | |
00:19:01,060 --> 00:19:09,280 | |
I'm using the word phoneme in a very technology-oriented sense, and apologies to anyone who knows phonetics or phonology. | |
184 | |
00:19:10,340 --> 00:19:14,600 | |
And for our data, we're using Waxholm data set. | |
185 | |
00:19:14,660 --> 00:19:17,020 | |
So, Vaxholm is an island. I... | |
186 | |
00:19:18,340 --> 00:19:23,500 | |
Feels really... It felt more appropriate in Gothenburg to have this slide, and I forgot to check. | |
187 | |
00:19:24,900 --> 00:19:26,480 | |
It's a really nice place. | |
188 | |
00:19:27,320 --> 00:19:28,740 | |
They have a nice Christmas market. | |
189 | |
00:19:30,280 --> 00:19:37,180 | |
Very picturesque.... eh, one of the people we were there with said, oh, it's like a setting for a Hallmark Christmas movie. | |
190 | |
00:19:41,200 --> 00:19:43,040 | |
Then eh... Waxholm with a W. | |
191 | |
00:19:44,460 --> 00:19:52,520 | |
Well, it's a dialogue system that gave information about shipping, mostly to and from Vaxholm, as far as I can tell. But, you know, there was more than that. | |
192 | |
00:19:52,760 --> 00:19:58,120 | |
So, it incorporated text-to-speech, ASR, face synthesis, and dialogue management. | |
193 | |
00:19:58,200 --> 00:20:03,380 | |
But in the earliest versions, there was no ASR system. So, there was a Wizard of Oz setup used. | |
194 | |
00:20:03,900 --> 00:20:10,340 | |
The data from these sessions was transcribed at the word and phoneme level, including non-speech acoustic events. | |
195 | |
00:20:12,540 --> 00:20:18,920 | |
And this is a diagram of the system. There was quite a lot happening there. | |
196 | |
00:20:19,700 --> 00:20:24,240 | |
Eh... I think just about everything that happened there would be replaced by neural network these days, but | |
197 | |
00:20:24,780 --> 00:20:27,820 | |
a lot of this is rule-based or HMM-based, or | |
198 | |
00:20:31,120 --> 00:20:36,500 | |
Yeah, I... I saw this video in the... in Jens' course, and... | |
199 | |
00:20:37,340 --> 00:20:44,160 | |
Where am I going to find this? I thought, hmm, Joakim has to have a YouTube channel. And sure enough, yes, he does. | |
200 | |
00:20:50,080 --> 00:20:52,060 | |
This is what the original data looked like. | |
201 | |
00:20:53,900 --> 00:20:55,640 | |
That was fun. Ehm... | |
202 | |
00:20:56,460 --> 00:21:02,800 | |
Some... eh... eh... there seems to have been a couple of different conventions, like, things seem to have been done at different times. | |
203 | |
00:21:03,140 --> 00:21:05,140 | |
Ah, so, eh... | |
204 | |
00:21:06,700 --> 00:21:07,600 | |
It's been fun. | |
205 | |
00:21:10,920 --> 00:21:17,800 | |
There are some frames that are inconsistently labeled, there are empty frames used to mark unrealised segments. | |
206 | |
00:21:19,400 --> 00:21:22,640 | |
I mean, I see why they're there, but they're kind of annoying. Ehm... | |
207 | |
00:21:23,180 --> 00:21:34,920 | |
couple of schools of thoughts about how to.... about the ge- the generated phoneme sequences, whether to include the... the non-speech events or not. | |
208 | |
00:21:35,760 --> 00:21:43,200 | |
And there was a lot of copying and pasting of files, so the metadata is often, like, really wrong. | |
209 | |
00:21:46,800 --> 00:21:52,300 | |
Here's some videos with... output from the phonetic... or phonemic-ish... | |
210 | |
00:21:55,540 --> 00:22:00,120 | |
So eh... this, by chance, was one of the first things I picked out after the... | |
211 | |
00:22:01,120 --> 00:22:06,920 | |
eh, I think it was three weeks of running Whisper across 6,000 hours. | |
212 | |
00:22:07,320 --> 00:22:11,380 | |
And... eh... and also wav2vec 2. | |
213 | |
00:22:12,200 --> 00:22:19,380 | |
This little "ha" in the middle is exactly the sort of thing that our downstream users want to look at. | |
214 | |
00:22:19,800 --> 00:22:22,640 | |
So, happy days. I found it straight away. | |
215 | |
00:22:23,360 --> 00:22:24,400 | |
Happy accident. | |
216 | |
00:22:35,480 --> 00:22:40,100 | |
Eh... He has a little "heh", like, a derisive laugh. | |
217 | |
00:22:50,140 --> 00:22:51,060 | |
So this... | |
218 | |
00:23:12,000 --> 00:23:13,340 | |
It's already at max. | |
219 | |
00:23:15,700 --> 00:23:22,540 | |
Right, so I'll... I mean, this is not only a false start, it's a false start with a phrase inserted, so this is harder to catch. | |
220 | |
00:23:28,760 --> 00:23:30,360 | |
Ah, this one from earlier. | |
221 | |
00:23:30,800 --> 00:23:32,860 | |
Phoneme recognizer: got it. | |
222 | |
00:23:41,380 --> 00:23:43,560 | |
And it kinda got this. | |
223 | |
00:23:46,784 --> 00:23:48,324 | |
It missed out on... | |
224 | |
00:23:48,700 --> 00:23:51,060 | |
Well, what sounds like a rhoticism. Eh... | |
225 | |
00:23:52,340 --> 00:23:58,660 | |
It missed out on that because, eh, those are very short, ehm, I think that's mentioned on a future slide. | |
226 | |
00:24:02,320 --> 00:24:06,260 | |
Eh, it's not perfect, there... a few mistakes here. | |
227 | |
00:24:06,800 --> 00:24:11,120 | |
[Swedish: Mr Speaker! EU cooperation makes Sweden stronger and safer.] | |
228 | |
00:24:11,140 --> 00:24:18,920 | |
[Swedish: Threats like the climate crisis, pandemics, terrorism and organised crime cannot be solved by a single country.] | |
229 | |
00:24:29,798 --> 00:24:34,038 | |
Ah, this... it did quite a good job of catching the pauses and hesitations. | |
230 | |
00:25:08,724 --> 00:25:14,284 | |
And with this, eh, it picked up that there was a different pronunciation than standard. | |
231 | |
00:25:26,200 --> 00:25:30,820 | |
I swear I picked this out before the current... crisis. I... I... | |
232 | |
00:25:31,800 --> 00:25:40,420 | |
But, eh, yeah, so instead of any of the usual pronunciations of how she says /o:/ and I've verified this. | |
233 | |
00:25:42,800 --> 00:25:44,800 | |
So, uh, other things that we're doing. | |
234 | |
00:25:44,800 --> 00:25:55,120 | |
Eh, so we're going to be going back to HMM-based forced alignment because it tends not to skip over as much. | |
235 | |
00:25:55,980 --> 00:26:01,520 | |
So there's a shorter stride, it's 10 men- milliseconds instead of 20, and... | |
236 | |
00:26:01,980 --> 00:26:07,360 | |
with that, you get to catch these really short, eh, R and L sounds. | |
237 | |
00:26:07,800 --> 00:26:13,020 | |
And it's dictionary-based, we'll be generating the dictionary, but then, y'know, it either aligns or it doesn't. | |
238 | |
00:26:13,280 --> 00:26:15,920 | |
So if something doesn't align, we need to go back and check. | |
239 | |
00:26:17,100 --> 00:26:22,880 | |
And we're looking at making an acoustically validated pronunciation dictionary. | |
240 | |
00:26:23,100 --> 00:26:30,460 | |
For the vast majority of these speakers, eh, they... they're public figures: we know who they are, we know where they grew up, we know... know where they were born. | |
241 | |
00:26:30,560 --> 00:26:35,280 | |
So we can give very good guesses as to what their dialect is going to be. | |
242 | |
00:26:35,800 --> 00:26:43,600 | |
Ehm... and so to begin with, we're doing an intersection of the dictionary-derived pronunciations and the phonemic transcriptions, | |
243 | |
00:26:44,060 --> 00:26:51,120 | |
which is great if it's a Stockholm speaker because pronunciation dictionaries tend to favour the Stockholm dialect. | |
244 | |
00:26:52,100 --> 00:26:57,300 | |
We're using rules to add various alternatives and context-based replacements. | |
245 | |
00:26:57,740 --> 00:27:04,060 | |
And in the end, we want to get dialogue-specific lexica, eh, to begin with, and then | |
246 | |
00:27:04,440 --> 00:27:14,660 | |
other things based on speaking rate and other factors, eh, that will be useful mostly for text-to-speech research, eh, | |
247 | |
00:27:14,800 --> 00:27:22,800 | |
Hopefully, I get to just pick this uh package up and go, "here you go, let me know if there are any problems", but, ehm, we'll see. | |
248 | |
00:27:25,800 --> 00:27:31,800 | |
And so based just on Wiktionary rather than a much larger dictionary that was harder to work with, | |
249 | |
00:27:33,140 --> 00:27:43,800 | |
these are the top 10 validated pairs, so the written form taken from the intersection of the transcript and the ASR output, | |
250 | |
00:27:44,140 --> 00:27:49,140 | |
and then the pronunciation that occurred at the same time step. | |
251 | |
00:27:50,004 --> 00:28:03,280 | |
Eh, there was a question last time because, eh, the first word is pronounced /at/ and not /o/, which surprised people, eh, I've looked it up since: this is an error in Wiktionary. | |
252 | |
00:28:03,640 --> 00:28:12,400 | |
They've given an alternative pronunciation of /ɔ/, which happens maybe 40 times in 6,000 hours. | |
253 | |
00:28:12,615 --> 00:28:15,615 | |
But /o/ happens quite a lot. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment