Skip to content

Instantly share code, notes, and snippets.

@chauhan-utk
Created August 21, 2021 03:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chauhan-utk/c5d7553c0c80a775e77d3595df5cbf59 to your computer and use it in GitHub Desktop.
Save chauhan-utk/c5d7553c0c80a775e77d3595df5cbf59 to your computer and use it in GitHub Desktop.
Logseq Base64 render from Zotero notes issue

tags:: [[Computer Science - Artificial Intelligence]], [[Computer Science - Computation and Language]], [[Computer Science - Machine Learning]], [[Electrical Engineering and Systems Science - Audio and Speech Processing]] date:: [[Jun 14th, 2021]] extra:: arXiv: 2106.07447 title:: HuBERT Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units item-type:: [[journalArticle]] access-date:: 2021-07-27T06:40:41Z original-title:: HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units url:: http://arxiv.org/abs/2106.07447 short-title:: HuBERT publication-title:: "arXiv:2106.07447 [cs, eess]" authors:: [[Wei-Ning Hsu]], [[Benjamin Bolte]], [[Yao-Hung Hubert Tsai]], [[Kushal Lakhotia]], [[Ruslan Salakhutdinov]], [[Abdelrahman Mohamed]] library-catalog:: arXiv.org links:: Local library, Web library

  • [[Abstract]]
    • Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
  • [[Notes]]
    • Important Results

    • Extracted Annotations (8/20/2021, 8:43:00 AM)

      "self-supervised representations offer two unique advantages" (Hsu et al 2021:1)

      "self-supervised pretext tasks force the model to represent the entire input signal by compressing much more bits of information into the learned latent representation" (Hsu et al 2021:1)

      "Speech signals differ from text and images in that they are continuous-valued sequences." (Hsu et al 2021:2)

      "The predictive loss is only applied over the masked regions, forcing the model to learn good high-level representations of unmasked inputs to infer the targets of masked ones correctly." (Hsu et al 2021:2)

      "One crucial insight motivating this work is the importance of consistency of the targets, not just their correctness, which enables the model to focus on modeling the sequential structure of input data." (Hsu et al 2021:2)

      "In the extreme case when  = 0, the loss is computed over the unmasked timesteps, which is similar to acoustic modeling in hybrid speech recognition systems" (Hsu et al 2021:3)

      "In the other extreme with = 1, the loss is only computed over the masked timesteps where the model has to predict the targets corresponding to the unseen frames from context, analogous to language modeling. It forces the model to learn both the acoustic representation of unmasked segments and the long-range temporal structure of the speech data. We hypothesize that the setup with = 1 is more resilient to the quality of cluster targets, which is demonstrated in our experiments" (Hsu et al 2021:3)

      "an ensemble of k-means models with different codebook sizes can create targets of different granularity, from manner classes (vowel/consonant) to sub-phone states (senones)" (Hsu et al 2021:3)

      "This is analogous to multi-task learning, but with tasks created by unsupervised clustering." (Hsu et al 2021:3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment