Skip to content

Instantly share code, notes, and snippets.

@jimregan
Last active November 5, 2023 21:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jimregan/6d997feca9a14a0b40e0628862cd19f2 to your computer and use it in GitHub Desktop.
Save jimregan/6d997feca9a14a0b40e0628862cd19f2 to your computer and use it in GitHub Desktop.
Edited whisper output of my 30% seminar
1
00:00:01,387 --> 00:00:03,387
[mwahahahaha]
2
00:00:04,202 --> 00:00:11,662
So... Phoneme recognition for fan and prophet: uses of more direct represations- representations of speech.
3
00:00:13,620 --> 00:00:15,840
Or, you know, in English orthography.
4
00:00:17,260 --> 00:00:20,120
Uh... the suffering that Jens goes through.
5
00:00:20,160 --> 00:00:21,860
But when I showed him that first slide, and...
6
00:00:21,940 --> 00:00:23,400
"Of course you did."
7
00:00:26,600 --> 00:00:28,960
Ehm, yeah, so, eh...
8
00:00:29,400 --> 00:00:33,320
The matter was raised that because I can pronounce an /əʊ/ and an /oʊ/
9
00:00:33,780 --> 00:00:38,180
that it's not necessarily the case than mine is a monophthong, but, eh
10
00:00:38,860 --> 00:00:42,800
I based that on a recording, and for demonstration purposes, kind of,
11
00:00:43,560 --> 00:00:49,460
that is the out-, eh, was originally the output of a phoneme recognition system for English that someone else trained
12
00:00:49,980 --> 00:00:53,500
and I just corrected it a bit to match my Irish pronunciation.
13
00:00:53,920 --> 00:01:00,260
But, eh, anyone who can read a spectrogram can see that this is very definitely a monophthong, not a diphthong.
14
00:01:02,340 --> 00:01:04,660
Okay, so, things that I've done.
15
00:01:05,180 --> 00:01:17,480
Eh... TA duty, thesis supervision, and this paper that we did for the PSST challenge, which actually was a course project for the speech and speaker recognition course.
16
00:01:18,340 --> 00:01:35,160
And that eh... directly led to this eh... work that I've been doing most recently on trying to make more use of phoneme recognition, both as a way of bridging the gaps between ASR and an official transcript that isn't quite trustworthy.
17
00:01:35,580 --> 00:01:46,500
And also, umm... while we're at it, trying to get eh... acoustically validated eh... pronunciation lex- lexica for text to speech purposes.
18
00:01:48,140 --> 00:01:53,740
And so eh... TA duty: I TA'd on both these courses this year and last year.
19
00:01:56,020 --> 00:02:00,760
And I supervised this thesis, which eh... also involved phonemes.
20
00:02:04,600 --> 00:02:12,240
Eh... the thesis isn't available but... a... a paper derived from it is and if you follow that QR code it will take you to the arXiv page.
21
00:02:15,800 --> 00:02:19,460
So... first presentation.
22
00:02:20,800 --> 00:02:22,040
Ah, dammit.
23
00:02:31,800 --> 00:02:40,280
So, eh, this paper was presented at the RaPID4 workshop, eh, which was co located with LREC last year in Marseille.
24
00:02:41,420 --> 00:02:52,080
The aim of the challenge was to get phoneme recognition for aphasic speech, and to improve it over a... an established baseline that the organizers provided.
25
00:02:52,480 --> 00:02:59,620
Along the way the, the organisers said: "Oh, oops, turns out we did too good a job." ehm...
26
00:02:59,800 --> 00:03:03,280
"A bunch of people emailed us saying: 'we can't beat the baseline.'"
27
00:03:03,660 --> 00:03:06,020
So, y'know, we did. Happy days.
28
00:03:07,580 --> 00:03:11,360
So we focused on data augmentation.
29
00:03:11,800 --> 00:03:22,040
So we tried augmenting the original, and we tried matching out of domain speech, eh, using data augmentation so that it sounded more like the in domain data.
30
00:03:22,680 --> 00:03:28,500
We tried using phonetic language models, we tried using voice conversion and we tried using synthetic voices.
31
00:03:31,280 --> 00:03:39,300
I think... this has been mentioned in, eh, several, eh, 30% seminars, but in passing.
32
00:03:39,800 --> 00:03:41,320
But this is the full presentation.
33
00:03:41,800 --> 00:03:48,980
So, eh, the augmentations we used were Gaussian noise, pitch shift, time stretch and voice conversion.
34
00:03:49,500 --> 00:03:54,040
Basically, we listened to the data and tried to figure out what augmentations will make it sound like that.
35
00:03:55,800 --> 00:04:00,240
The choice of Gaussian noise instead of, eh, white noise or pink noise or brown noise...
36
00:04:00,720 --> 00:04:04,200
Eh... Shivam said "do that!" and we said: "yeah, okay".
37
00:04:07,180 --> 00:04:17,220
So, we had 2000- 2300 segments from PT- eh... PSST, eh, 1400 from TIMIT.
38
00:04:18,860 --> 00:04:29,080
Which, eh... So, when we're using TIMIT, eh, we tried to map it down to the same phoneset that was used in the RaPID Challenge.
39
00:04:29,800 --> 00:04:35,800
Eh... TIMIT has a more... a slightly more expressive phoneset and there were things that didn't quite match.
40
00:04:36,120 --> 00:04:45,440
So, rather than try and fudge it, we just omitted anything that didn't match up and that drastically reduced the amount of data.
41
00:04:45,800 --> 00:04:49,800
Eh, the training set by default is about three hours, we were left with about one.
42
00:04:50,060 --> 00:04:57,880
Eh, but, eh it occurred to us afterwards that, uh, we didn't need to keep the test or validation set, we could have used those and that would have been an extra two hours.
43
00:05:00,120 --> 00:05:05,800
And, eh, we tried to do things with Common Voice, that was mostly Birger.
44
00:05:08,740 --> 00:05:14,800
So, we tried using base and large wav2vec 2 models and surprise, surprise, the bigger model performed better.
45
00:05:15,480 --> 00:05:19,260
But using the base model was a hell of a lot faster and...
46
00:05:20,180 --> 00:05:22,000
eh... you know...
47
00:05:22,920 --> 00:05:31,460
there was a moment when Birger asked Jens if he could... "oh, can we use the Språkbanken machines if Jim does it" and Jens said something to him.
48
00:05:32,120 --> 00:05:33,380
But it seemed more like a...
49
00:05:34,400 --> 00:05:37,840
you know, convince me type of thing, but, I don't know...
50
00:05:38,200 --> 00:05:45,800
someone else came along and said, oh, yeah, okay, that... that... that's that, and we just dropped it and used things like Kaggle and Google Colab.
51
00:05:46,380 --> 00:05:52,580
So, ehm... I think this person had this uh... false idea that Jens is unwavering in his opinions.
52
00:05:52,700 --> 00:05:54,200
He certainly has strong opinions.
53
00:05:54,880 --> 00:06:02,240
But, eh, no, I- I've seen many times since then how excited he gets when he gets a new perspective on something.
54
00:06:04,920 --> 00:06:10,980
So, we used voice conversion, but we had very, very little data per speaker.
55
00:06:11,940 --> 00:06:15,540
Eh... if we'd had more data, it could have worked out better.
56
00:06:15,700 --> 00:06:18,020
But this was particularly noisy data.
57
00:06:18,160 --> 00:06:22,420
It was recorded on something like a cassette recorder in a quite reverberant room.
58
00:06:22,680 --> 00:06:26,960
And the cassette recorder was on a table between the interviewer and the interviewee.
59
00:06:27,340 --> 00:06:30,520
So, it really did not work out.
60
00:06:32,920 --> 00:06:34,700
Eh... we used pitch shift.
61
00:06:35,100 --> 00:06:39,460
Eh... that worked quite well with augmenting the PSST data.
62
00:06:42,980 --> 00:06:52,280
Eh... Gaussian noise worked quite well uh... in augmenting mostly the TIMIT data, the clean data that we were adding.
63
00:06:53,920 --> 00:07:00,420
Time stretch... eh... w- we added this because people with aphasia tend to draw out the things they say.
64
00:07:02,800 --> 00:07:07,940
And room impulse s- response. Well, from every recording, you could hear the echo.
65
00:07:07,980 --> 00:07:14,140
Uh, so, we used this, and it turned out to be the most successful augmentation individually of the TIMIT data.
66
00:07:16,260 --> 00:07:19,620
And so, we combined the augmentations in as... well...
67
00:07:21,040 --> 00:07:24,120
Yeah, it- th- this ended up being a little bit of a...
68
00:07:25,140 --> 00:07:28,480
So, eh, Birger is a very big picture kind of person.
69
00:07:28,760 --> 00:07:31,140
I'm a very detail-oriented kind of person.
70
00:07:31,320 --> 00:07:34,920
There're a- a couple of axes where we're kind of at opposite ends.
71
00:07:35,580 --> 00:07:45,580
So, one of the things, the augmenting the PSST data, I thought, that doesn't seem like the way to go, but I'll show you how.
72
00:07:45,580 --> 00:07:45,600
And it- it ended up working out quite well.
So, one of the things, the augmenting the PSST data, I thought, that doesn't seem like the way to go, but I'll show you how.
73
00:07:45,600 --> 00:07:48,180
And it- it ended up working out quite well.
74
00:07:48,880 --> 00:07:49,840
To my surprise.
75
00:07:50,380 --> 00:07:58,240
But em... he wasn't convinced that we need to break it down and show uh... which combinations work... eh... but... eh...
76
00:07:59,200 --> 00:08:04,200
yeah, when the organisers asked for a table comparing all of these, it was like, isn't it good that I have that?
77
00:08:08,880 --> 00:08:15,300
We tried using language models. Eh... there was little or no improvement over this, over the baseline using this.
78
00:08:16,600 --> 00:08:19,620
And we got these nice-looking results.
79
00:08:19,820 --> 00:08:26,500
Ah... so, eh, PER is phoneme error rate, and FER is feature error rate.
80
00:08:26,560 --> 00:08:30,360
So, for each eh... phoneme, you get this phonetic feature vector.
81
00:08:30,640 --> 00:08:35,220
So, like, if it's a vowel, it could be front, it could be rounded, it could be nasal.
82
00:08:35,680 --> 00:08:40,340
Eh... if it's a consonant, this will be eh... place of articulation, et cetera.
83
00:08:40,860 --> 00:08:44,800
And so, you can get the phoneme wrong, but get most of the features.
84
00:08:44,800 --> 00:08:52,300
So, like, if it gives a... a /d/ instead of a /t/, that's one eh... phoneme error, but it's only one feature error.
85
00:08:52,640 --> 00:08:54,800
And so, it counts a lot less.
86
00:08:57,260 --> 00:09:04,060
And so, augmenting the eh... data worked out, increasing the size of the model worked out.
87
00:09:07,560 --> 00:09:09,380
Oh, wow, I haven't looked at this slide since last year.
88
00:09:09,760 --> 00:09:16,200
Ehm...yeah, the best-performing model combined aphasic and non-aphasic data, so.... the domain-adopted data.
89
00:09:16,400 --> 00:09:22,620
We got a 21% phoneme error rate, which is acceptable. So anything
90
00:09:23,600 --> 00:09:28,540
if you think of it analogously to word error rate, anything below 30% eh... is...
91
00:09:29,500 --> 00:09:36,800
worth fixing for a human. Anything above that, just scrap it. You're... you may as well just listen and type it out by hand.
92
00:09:39,160 --> 00:09:42,040
And so, this was... was the original former slide.
93
00:09:43,640 --> 00:09:45,800
But there was a... it was a challenge.
94
00:09:46,020 --> 00:09:48,340
We came second! Yay!
95
00:09:49,240 --> 00:09:50,140
Of two teams.
96
00:09:50,140 --> 00:09:51,800
[laughter]
97
00:09:52,440 --> 00:09:54,240
But the other team was Baidu Research.
98
00:09:55,100 --> 00:10:08,540
Eh... If... if you're not aware, Baidu Research are the people who created the model that, ah, lead to the paradigm shift towards end-to-end speech recognition. So, they kind of know something about this.
99
00:10:08,800 --> 00:10:14,020
Also, they've got a lot more money, and they were able to train on massive models.
100
00:10:14,800 --> 00:10:20,800
So, they were using the, what is it called, VoxPopulus models.
101
00:10:21,200 --> 00:10:23,800
I couldn't use those because those are not open source.
102
00:10:23,800 --> 00:10:26,800
Meta likes to say open source a lot, but they don't like to do it.
103
00:10:27,060 --> 00:10:35,480
Eh... they have recently been asked by the Open Source Initiative to not do that. I think maybe trademark lawyers may have been involved somewhere.
104
00:10:35,800 --> 00:10:40,520
But, uh, yeah, these are not open source, so I didn't feel like I could use them.
105
00:10:40,580 --> 00:10:43,800
Ah... so, we were stuck with much smaller models.
106
00:10:43,800 --> 00:10:45,780
So, at least we beat them on two.
107
00:10:46,800 --> 00:10:47,980
It's not nothing.
108
00:10:50,800 --> 00:10:54,800
And so, th- this is the citation for the previous table.
109
00:10:55,960 --> 00:11:01,140
And so, oh, wow, I took 10 minutes longer doing that last time.
110
00:11:08,400 --> 00:11:14,800
Okay, so eh.. this most recent one uh... was given recently in the University of Gothenburg.
111
00:11:15,360 --> 00:11:17,800
Um... apologies to people who've seen this before.
112
00:11:19,100 --> 00:11:26,200
So, uh... one of the projects, well, uh, we're using eh... Riksdag data for multiple projects.
113
00:11:29,800 --> 00:11:37,400
So, eh, I mentioned earlier that we're using phoneme recognition to try and bridge the gap between uh... transcripts that are kind of untrustworthy.
114
00:11:37,800 --> 00:11:49,320
Eh... what we think happens, eh, what happens generally is that they want to represent the intent rather than the literal word-for-word transcript.
115
00:11:49,860 --> 00:11:54,360
So, the speeches are filed in advance, and then those are lightly edited.
116
00:11:54,420 --> 00:11:58,600
But, you know, they may or may not match what was actually spoken.
117
00:12:01,480 --> 00:12:05,360
So, we want to find out things about how people speak.
118
00:12:06,300 --> 00:12:12,080
We'd like to have better training materials, and we want to develop iterative training processes.
119
00:12:12,800 --> 00:12:19,520
So, eh, this is part of the SweTerror project, which is a multidisciplinary project.
120
00:12:20,140 --> 00:12:28,380
At the other end, we have people who are looking at the psychological aspects of how terror is being discussed in Riksdag.
121
00:12:32,020 --> 00:12:35,900
And so, we have data for about 10 years.
122
00:12:36,220 --> 00:12:39,820
Well, we have data for about 20 years. We have tw-... eh...
123
00:12:40,280 --> 00:12:43,880
this eh... almost 6,000 hours of processed data.
124
00:12:47,880 --> 00:12:56,240
Eh so, these are the places, well, the counts for the amount of times that terror or compound words that include the word terror occur.
125
00:12:57,300 --> 00:13:08,420
So, in 826 transcript files, but over a thousand eh... files that are the result of ASR output.
126
00:13:09,420 --> 00:13:16,160
And 6,700 words in the transcript, 7,200 in the ASR output.
127
00:13:16,420 --> 00:13:19,200
So, something is going missing somewhere.
128
00:13:21,920 --> 00:13:26,340
And we have these counts for... so this is the transcripts.
129
00:13:27,260 --> 00:13:33,080
And it's quite similar for the ASR, but slightly larger in most counts.
130
00:13:33,420 --> 00:13:38,360
And these ones in italics have a different order.
131
00:13:38,840 --> 00:13:40,500
So, see it again.
132
00:13:43,620 --> 00:13:46,800
So, there are mundane issues whenever you use ASR.
133
00:13:51,800 --> 00:13:53,800
[Swedish: av förslag till]
134
00:13:54,260 --> 00:13:55,800
Okay, that wasn't loud.
135
00:13:57,460 --> 00:13:58,840
[Swedish: av förslag till]
136
00:14:00,120 --> 00:14:07,120
So, the point here is that the video starts at this point and it's missing that word, but the transcript includes it.
137
00:14:07,440 --> 00:14:09,800
So, we can't trust things like that.
138
00:14:14,200 --> 00:14:16,800
Then there are things like normalization.
139
00:14:23,620 --> 00:14:24,580
This one was fun.
140
00:14:28,880 --> 00:14:30,400
So, you can see what happened there.
141
00:14:31,060 --> 00:14:34,900
She said "i sex", but this gave "IX".
142
00:14:45,040 --> 00:14:47,980
So, mundane recognition error.
143
00:14:54,620 --> 00:14:57,800
So, we've mostly solved the normalization issue.
144
00:14:58,800 --> 00:15:04,860
Eh... this NeMo text processing module includes full support for Swedish
145
00:15:06,220 --> 00:15:14,320
which took way too long... but eh... we're using it for what's called acoustically validated normalization.
146
00:15:15,000 --> 00:15:17,500
So... eh... now...
147
00:15:18,800 --> 00:15:26,560
they described this much more complicated process where they use like a... a BERT style language model to rate multiple outputs.
148
00:15:26,800 --> 00:15:39,460
What we're using is the composition of a... a basic n- n-gram model with the potential outputs of the finite state transducer that's created by this.
149
00:15:39,800 --> 00:15:44,800
And that creates an acceptor and it will accept any of the possibilities.
150
00:15:48,260 --> 00:15:50,360
So, can we trust transcripts?
151
00:15:51,160 --> 00:15:59,380
Uh... for this, we're using an ASR model that comes from KBlab, the large uh... VoxRex Swedish.
152
00:16:00,880 --> 00:16:04,540
So, in the transcripts, we get some things that were never actually said.
153
00:16:04,660 --> 00:16:08,420
So, this is in the transcript, but not... it was never spoken.
154
00:16:09,380 --> 00:16:12,800
Doesn't change the meaning so much here, but eh... other places it could.
155
00:16:14,900 --> 00:16:16,800
Then, you know, phrases can be moved.
156
00:16:17,800 --> 00:16:26,380
This one was up here at the start of the phrase in the transcript, but was moved to the end of the sentence in... when spoken.
157
00:16:28,800 --> 00:16:30,800
Some things are just added in the moment.
158
00:16:30,800 --> 00:16:33,740
Uh... they just added in these couple of words.
159
00:16:36,260 --> 00:16:38,380
And so, what can we find automatically?
160
00:16:38,980 --> 00:16:41,200
Well, we can find false starts.
161
00:16:41,800 --> 00:16:53,160
So, when we have, eh, on the right side, a valid word, and then on the left side, a word that ends with that, but starts with a substring of that.
162
00:16:53,740 --> 00:16:57,140
That's a pretty clear indication that this person had a false start.
163
00:16:57,980 --> 00:16:59,040
This is quite easy.
164
00:17:00,800 --> 00:17:06,420
Then, you get things that are very indicatio- indicative of alternative pronunciations.
165
00:17:06,520 --> 00:17:13,500
So, you know, clearly, someone had a different dialect, or they said the word really quickly, and...
166
00:17:14,160 --> 00:17:20,720
you know, a- a wav2vec-style model is trained with CTC, and it basically predicts every frame.
167
00:17:21,220 --> 00:17:25,060
And if something isn't quite audible, then it's omitted.
168
00:17:27,180 --> 00:17:31,000
Then, we get filled fog- pauses, so, like an "eh" or a "mm".
169
00:17:34,220 --> 00:17:37,560
But wait, didn't OpenAI Whisper solve ASR?
170
00:17:38,740 --> 00:17:46,900
Well, we can't really trust it, because eh... th- the data that we're using was probably part of the training data.
171
00:17:51,420 --> 00:18:05,060
And we get this thing where the word "tack" disappears when it comes before "talman", "herr talman", or "fru talman", because this is never included in the transcripts.
172
00:18:05,200 --> 00:18:10,220
The transcip- transcripts only ever start with one of these three phrases.
173
00:18:10,500 --> 00:18:17,220
And so, anywhere that this occurs in anything spoken, the word "tack" disappears.
174
00:18:19,800 --> 00:18:21,800
Then, you get these strange insertions.
175
00:18:22,440 --> 00:18:25,840
So, I mean, this is about five minutes, and, well...
176
00:18:26,380 --> 00:18:30,560
I think in total it was about five minutes, but uh... I didn't think I needed to put all that much.
177
00:18:30,800 --> 00:18:35,180
But, eh, yeah, eh, yeah, this was silence. Nobody said these words.
178
00:18:36,720 --> 00:18:37,800
Well, you can see what happened.
179
00:18:38,280 --> 00:18:40,800
This is stuff that probably came from YouTube.
180
00:18:41,580 --> 00:18:50,540
And, you know, this is eh, one of the things that was mentioned by an anonymous ASR... or TTS researcher recently, but, eh, yeah
181
00:18:50,640 --> 00:18:54,800
this kind of thing happens all the time, 'cause people use subtitles in really strange ways.
182
00:18:56,800 --> 00:19:00,800
So, our alternative is to use phonemic recognition.
183
00:19:01,060 --> 00:19:09,280
I'm using the word phoneme in a very technology-oriented sense, and apologies to anyone who knows phonetics or phonology.
184
00:19:10,340 --> 00:19:14,600
And for our data, we're using Waxholm data set.
185
00:19:14,660 --> 00:19:17,020
So, Vaxholm is an island. I...
186
00:19:18,340 --> 00:19:23,500
Feels really... It felt more appropriate in Gothenburg to have this slide, and I forgot to check.
187
00:19:24,900 --> 00:19:26,480
It's a really nice place.
188
00:19:27,320 --> 00:19:28,740
They have a nice Christmas market.
189
00:19:30,280 --> 00:19:37,180
Very picturesque.... eh, one of the people we were there with said, oh, it's like a setting for a Hallmark Christmas movie.
190
00:19:41,200 --> 00:19:43,040
Then eh... Waxholm with a W.
191
00:19:44,460 --> 00:19:52,520
Well, it's a dialogue system that gave information about shipping, mostly to and from Vaxholm, as far as I can tell. But, you know, there was more than that.
192
00:19:52,760 --> 00:19:58,120
So, it incorporated text-to-speech, ASR, face synthesis, and dialogue management.
193
00:19:58,200 --> 00:20:03,380
But in the earliest versions, there was no ASR system. So, there was a Wizard of Oz setup used.
194
00:20:03,900 --> 00:20:10,340
The data from these sessions was transcribed at the word and phoneme level, including non-speech acoustic events.
195
00:20:12,540 --> 00:20:18,920
And this is a diagram of the system. There was quite a lot happening there.
196
00:20:19,700 --> 00:20:24,240
Eh... I think just about everything that happened there would be replaced by neural network these days, but
197
00:20:24,780 --> 00:20:27,820
a lot of this is rule-based or HMM-based, or
198
00:20:31,120 --> 00:20:36,500
Yeah, I... I saw this video in the... in Jens' course, and...
199
00:20:37,340 --> 00:20:44,160
Where am I going to find this? I thought, hmm, Joakim has to have a YouTube channel. And sure enough, yes, he does.
200
00:20:50,080 --> 00:20:52,060
This is what the original data looked like.
201
00:20:53,900 --> 00:20:55,640
That was fun. Ehm...
202
00:20:56,460 --> 00:21:02,800
Some... eh... eh... there seems to have been a couple of different conventions, like, things seem to have been done at different times.
203
00:21:03,140 --> 00:21:05,140
Ah, so, eh...
204
00:21:06,700 --> 00:21:07,600
It's been fun.
205
00:21:10,920 --> 00:21:17,800
There are some frames that are inconsistently labeled, there are empty frames used to mark unrealised segments.
206
00:21:19,400 --> 00:21:22,640
I mean, I see why they're there, but they're kind of annoying. Ehm...
207
00:21:23,180 --> 00:21:34,920
couple of schools of thoughts about how to.... about the ge- the generated phoneme sequences, whether to include the... the non-speech events or not.
208
00:21:35,760 --> 00:21:43,200
And there was a lot of copying and pasting of files, so the metadata is often, like, really wrong.
209
00:21:46,800 --> 00:21:52,300
Here's some videos with... output from the phonetic... or phonemic-ish...
210
00:21:55,540 --> 00:22:00,120
So eh... this, by chance, was one of the first things I picked out after the...
211
00:22:01,120 --> 00:22:06,920
eh, I think it was three weeks of running Whisper across 6,000 hours.
212
00:22:07,320 --> 00:22:11,380
And... eh... and also wav2vec 2.
213
00:22:12,200 --> 00:22:19,380
This little "ha" in the middle is exactly the sort of thing that our downstream users want to look at.
214
00:22:19,800 --> 00:22:22,640
So, happy days. I found it straight away.
215
00:22:23,360 --> 00:22:24,400
Happy accident.
216
00:22:35,480 --> 00:22:40,100
Eh... He has a little "heh", like, a derisive laugh.
217
00:22:50,140 --> 00:22:51,060
So this...
218
00:23:12,000 --> 00:23:13,340
It's already at max.
219
00:23:15,700 --> 00:23:22,540
Right, so I'll... I mean, this is not only a false start, it's a false start with a phrase inserted, so this is harder to catch.
220
00:23:28,760 --> 00:23:30,360
Ah, this one from earlier.
221
00:23:30,800 --> 00:23:32,860
Phoneme recognizer: got it.
222
00:23:41,380 --> 00:23:43,560
And it kinda got this.
223
00:23:46,784 --> 00:23:48,324
It missed out on...
224
00:23:48,700 --> 00:23:51,060
Well, what sounds like a rhoticism. Eh...
225
00:23:52,340 --> 00:23:58,660
It missed out on that because, eh, those are very short, ehm, I think that's mentioned on a future slide.
226
00:24:02,320 --> 00:24:06,260
Eh, it's not perfect, there... a few mistakes here.
227
00:24:06,800 --> 00:24:11,120
[Swedish: Mr Speaker! EU cooperation makes Sweden stronger and safer.]
228
00:24:11,140 --> 00:24:18,920
[Swedish: Threats like the climate crisis, pandemics, terrorism and organised crime cannot be solved by a single country.]
229
00:24:29,798 --> 00:24:34,038
Ah, this... it did quite a good job of catching the pauses and hesitations.
230
00:25:08,724 --> 00:25:14,284
And with this, eh, it picked up that there was a different pronunciation than standard.
231
00:25:26,200 --> 00:25:30,820
I swear I picked this out before the current... crisis. I... I...
232
00:25:31,800 --> 00:25:40,420
But, eh, yeah, so instead of any of the usual pronunciations of how she says /o:/ and I've verified this.
233
00:25:42,800 --> 00:25:44,800
So, uh, other things that we're doing.
234
00:25:44,800 --> 00:25:55,120
Eh, so we're going to be going back to HMM-based forced alignment because it tends not to skip over as much.
235
00:25:55,980 --> 00:26:01,520
So there's a shorter stride, it's 10 men- milliseconds instead of 20, and...
236
00:26:01,980 --> 00:26:07,360
with that, you get to catch these really short, eh, R and L sounds.
237
00:26:07,800 --> 00:26:13,020
And it's dictionary-based, we'll be generating the dictionary, but then, y'know, it either aligns or it doesn't.
238
00:26:13,280 --> 00:26:15,920
So if something doesn't align, we need to go back and check.
239
00:26:17,100 --> 00:26:22,880
And we're looking at making an acoustically validated pronunciation dictionary.
240
00:26:23,100 --> 00:26:30,460
For the vast majority of these speakers, eh, they... they're public figures: we know who they are, we know where they grew up, we know... know where they were born.
241
00:26:30,560 --> 00:26:35,280
So we can give very good guesses as to what their dialect is going to be.
242
00:26:35,800 --> 00:26:43,600
Ehm... and so to begin with, we're doing an intersection of the dictionary-derived pronunciations and the phonemic transcriptions,
243
00:26:44,060 --> 00:26:51,120
which is great if it's a Stockholm speaker because pronunciation dictionaries tend to favour the Stockholm dialect.
244
00:26:52,100 --> 00:26:57,300
We're using rules to add various alternatives and context-based replacements.
245
00:26:57,740 --> 00:27:04,060
And in the end, we want to get dialogue-specific lexica, eh, to begin with, and then
246
00:27:04,440 --> 00:27:14,660
other things based on speaking rate and other factors, eh, that will be useful mostly for text-to-speech research, eh,
247
00:27:14,800 --> 00:27:22,800
Hopefully, I get to just pick this uh package up and go, "here you go, let me know if there are any problems", but, ehm, we'll see.
248
00:27:25,800 --> 00:27:31,800
And so based just on Wiktionary rather than a much larger dictionary that was harder to work with,
249
00:27:33,140 --> 00:27:43,800
these are the top 10 validated pairs, so the written form taken from the intersection of the transcript and the ASR output,
250
00:27:44,140 --> 00:27:49,140
and then the pronunciation that occurred at the same time step.
251
00:27:50,004 --> 00:28:03,280
Eh, there was a question last time because, eh, the first word is pronounced /at/ and not /o/, which surprised people, eh, I've looked it up since: this is an error in Wiktionary.
252
00:28:03,640 --> 00:28:12,400
They've given an alternative pronunciation of /ɔ/, which happens maybe 40 times in 6,000 hours.
253
00:28:12,615 --> 00:28:15,615
But /o/ happens quite a lot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment