Faster, Better Speech Recognition with Wav2Letter's Auto Segmentation Criterion
In 2016, Facebook AI Research (FAIR) broke new ground with Wav2Letter, a fully convolutional speech recognition system.
In this article, we'll focus on an understudied module at the core of Wav2Letter: the Auto Segmentation (ASG) Criterion.
In the Wav2Letter architecture shown above, we'll find ASG to the right of the acoustic model.
Using a convolutional approach with ASG, FAIR reported significant improvements in Letter Error Rate (LER) when applied to the TIMIT dataset.
... as well as speed increases for short and long sequences, even though Wav2Letter used a CPU-only version of the model for benchmarking.
Short sequence timing in ms
Long sequence timing in ms
Fundamentally, the ASG Criterion is a special type of loss function.
ASG builds on techniques of older algorithms like Connectionist Temporal Classification (CTC), which have long been a mainstay of speech recognition models.
To understand ASG, we'll first need to understand the specific problem solved by algorithms like CTC. Then we'll take a brief look at CTC so we can finally understand how ASG differs and improves upon it.
From Sound to Letter
The heart of Wav2Letter is an acoustic model that, as you may have already guessed, predicts letters from sound waves.
Specifically, Wav2Letter processes audio into slices, passes them through various convolutional layers, and outputs a set of probabilities for each audio slice. Each probability set holds estimates for each letter in the model's dictionary of letters.
This means that for a slice of audio, we'll have an estimate that the letter spoken at that moment is an 'e' or 't' or 's', or any other possible letter.
The acoustic model spits out a chain of these probabilities, where each link in the chain represents an estimate that a particular letter appears at that moment.
This chain contains the hypotheses of our acoustic model. To reach our final prediction, we need to transform this chain of probabilities into the most likely series of letters that occur across audio slices.
The Alignment Problem
Remember that Wav2Letter's acoustic model is basically a sound wave-to-letter classifier. The model 'sees' a bit of the sound wave input and says, "Ok, that looks like an 'H' or maybe an 'S'.
But unlike a static image, sound waves flow through time. If our phrase is "THE CAT", how do we know when the speaker has stopped saying 'T' and moved on to 'H'?
To learn "This is what a 'T' looks like", Wav2Letter needs to understand how spoken utterances transition between letters over time.
Only by understanding these transitions can the model begin to map its representations of a sound to the correct letter label.
But training data for speech recognition typically only comes with audio and a written transcript – no data for alignment between the two. We might input a three second
.wav file of someone saying "The Cat" alongside a
.txt file of the letters: "The Cat".
We know that 'T' comes before 'H' in 'The', but the transcript doesn't tell us when.
Example letter alignments from CTC (top) and ASG (bottom) over an audio segment.
Aligning every letter by hand to the matching moment in audio would be time-consuming and nearly impossible at scale. We also can't rely on superficial general rules like 'one letter lasts 500 milliseconds' because people speak at different speeds.
What do we do? Enter CTC.
How CTC Solves the Alignment Problem
Traditionally, practitioners solved this lack of alignment data with the Connectionist Temporal Classification (CTC) Algorithm (don't you love these names?).
In this section, we'll only touch on the high points of CTC so we can see how ASG differs. You can read a deeper explanation of CTC in this great article on Distill.
For each slice of audio, CTC expects a set of probabilities for all possible letters plus, crucially, a special 'blank' token.
Wav2Letter's acoustic model feeds its output chain of probability sets into CTC, which gets to work finding the highest probability output. And it does it without ever having any timing data.
Let's take a simplified example of a single audio file where the speaker says 'hello'.
As we've seen, our input is audio transformed by our acoustic model into a set of probabilities for each letter at each audio slice.
In this example, we're imagining that our dictionary of letters only contains 'h' 'e' 'l' 'o' and the special blank token mentioned before, which I'll call by its formal scientific name: squiggly e.
Every time slice has estimates for each letter. Darker cells represent a higher probability for that letter. Believe it or not, this is all we need to infer likely alignments between sound and letters.
Imagine this grid of letters as a graph. CTC snakes its way through every possible combination, column by column.
A graph of every valid CTC alignment for the word 'cat'
This graph gives us every possible alignment of letters for the audio. (In practice, CTC uses dynamic programming techniques that make this process much more efficient than it sounds).
CTC sums the probability for each possible alignment. Once finished, CTC surfaces the most probable alignments for a segment.
Put more formally, CTC aims to maximize the overall score of a path through this graph of possible alignments.
Here are two highly probable alignments for our 'hello' example:
Raw alignments will run from the reasonable to the ridiculous. Notice that our second alignment, while technically valid, doesn't even include an 'H'! Other alignments might be 'HHHHHELLOO" or "HEELLLLLOO" and, on the less-likely side, "OOOOOOOOOO" and "LLLLLOOOOO".
To generate its final output CTC removes repeated letters...
... and removes the special blank token, the squiggly e.
With repeats and squiggly e's removed we end up with: "hello" as our highest-probability output.
This output can be compared with the written transcript for our audio. We can calculate our model's loss against the ground-truth of our transcript. Nice!
CTC accomplishes two important things:
First, by snaking its way through possible alignments and linking highly-probable individual letter guesses, we end up with a valid prediction of a transcript for the audio without any alignment data.
This also allows CTC to handle variations in audio, such as when a speaker dwells on the letter 'h', because CTC can include 'h' multiple times in the alignment. When we remove duplicates, we still end up with "hello".
Second, our special squiggly e does double-duty as the separator for junk-frames (such as silence or breathing that might occur between letters) and as the separator for repeated letters.
This lets the model cope with noisy frames where it's not confident about any letter. Plus, it lets the model generate words like 'hello' even though 'l' is a repeated letter and CTC removes repeats.
Ok, so WTF is ASG?
The Auto Segmentation Criterion (ASG) is different from CTC in two ways:
- There is no special blank label (squiggly e).
- ASG avoids a certain type of normalization.
Let's look at each of these.
No Blank Token Makes Things Simpler and Faster
In Wav2Letter FAIR reports that, in practice, there was "no advantage" to using a special blank token to handle junk frames of audio between letters.
So ASG removes this token. For repeated letters, ASG includes a '2' for repeated letters, instead of the blank token. In our example, 'hello' would become 'hel2o'.
A CTC graph the acceptable sequences of letters for 'CAT'
An ASG graph of the acceptable sequences of letters for 'CAT'. Notice that there's no special token for junk frames.
By removing the special token, ASG significantly simplifies the graph that the algorithm must search when generating alignments. This likely leads to some of the performance gains reported.
ASG Allows the Acoustic Model to Learn Relationships Between Letters
CTC expects its input to be normalized at the frame level. For each probability set in the chain created by our acoustic model, the probability of each letter is normalized with the probability of the other letters in that frame.
For CTC, each frame is its own little world. What matters is to find the highest sum of letter-to-letter predictions across the frames.
A CTC graph for all valid alignments for 'CAT' across five frames. Nodes connect to each other, but the lines don't indicate a greater or lesser probability of a given connection.
For various technical reasons, ASG doesn't do frame normalization. The details of the normalization are less important than what it implies:
ASG gives powers to Wav2Letter's acoustic model that are usually reserved for language models: the ability to learn the likelihood of transitions between letters.
In real language, certain combinations of letters are much more likely than others. This likelihood of certain combinations of letters, called 'transitions', could improve model accuracy.
Some transitions are obviously more likely than others. For example, in English the series 'TH' is much more likely than 'TS' (as in obscure words like 'tsar' or 'tsetse fly').
ASG contains its own weight matrix that models possible transitions between each letter. Like any other standard weight matrix, these weights are trained through backpropagation.
Using this matrix, ASG allows the acoustic model to learn transition scores – the likelihood that a letter follows another letter – and bake them right into the edges of the graph we use to generate the most likely alignment for our letter-to-letter prediction.
An ASG graph for 'CAT' unfolded over five frames. Edges (lines) between graph nodes contain learned scores for transitions between letters.
FAIR's results suggest that this enhancement of the acoustic model improves the accuracy of the CNN.
Since the acoustic model contains useful understanding of the letter sequences, Wav2Letter's decoder actually uses the transition data from the acoustic model as well as output from its real language model when scoring its final transcript.
Speech recognition systems like Wav2Letter face an annoying problem: there's rarely data about how sound and transcriptions are aligned in time.
But to generate an accurate letter-by-letter prediction, we need to know when one letter starts and another letter ends as our acoustic model learns to associate sound waves with certain letters.
Traditionally, deep learning practitioners solved this problem with an algorithm called CTC. Though CTC works well in many cases, it includes an extra token that increase complexity and decrease speed. It also includes a form of normalization that limits how much the acoustic model can learn.
ASG is a special type of loss function that refines CTC by removing CTC's extra token and allowing the acoustic model to use its own weight matrix to learn transitions between letters.
If you're curious to learn more about Wav2Letter or ASG, see the references below.
- Sequence Modeling with CTC
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks
- Wav2Letter: an End-to-End ConvNet-based Speech Recognition System
- Fully Convolutional Speech Recognition
- Letter-Based Speech Recognition With Gated Convnets
- Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data