Skip to content

Instantly share code, notes, and snippets.

@ssss1029
Last active October 7, 2018 23:53
Show Gist options
  • Save ssss1029/8e6107aafba90d159346c3d7c417f7bd to your computer and use it in GitHub Desktop.
Save ssss1029/8e6107aafba90d159346c3d7c417f7bd to your computer and use it in GitHub Desktop.
ASR Notes

10/2/2018

Goal:

Use Kaldi on WSJ data to train and decode with traditional HMM-GMM monophone model, then a triphone HMM-GMM model, and then a simple HMM-DNN model.

Background

Contents

  • Geting set up
  • Important files
  • Data Preparation (This is somewhat WSJ-specific)
  • Training a GMM-HMM monophone model
  • Decoding with your newly-trained monophone model
  • Training a GMM-HMM triphone model
  • Decoding with your triphone model
  • Training a DNN-HMM model
  • Decoding with your DNN-HMM model.
  • Common problems

Getting set up

You're going to want to pull the kaldi trunk and figure out compiling kaldi first

Note that the build setup stuff will tell you to run the configure script before running make. configure sets up variables and other configurations that you should know how to control. Around line 93 there is the following comment:

     Following environment variables can be used to override the default toolchain.
        CXX         C++ compiler [default=g++]
        AR          Archive maintenance utility [default=ar]
        AS          Assembler [default=as]
        RANLIB      Archive indexing utility [default=ranlib]

You can change the $CXX environment variable before running configure which C++ compiler to use. This will come in handy when we try to set up CUDA. If you have a GPU and want to use it to speed up neural network computation, you shoud make sure you have the correct drivers installed, nvidia-smi is working, and you have the cuda toolkit installed see (https://developer.nvidia.com/cuda-toolkit-archive). Please read all the way through this to figure out which version of the toolkit to download and install. If you end up installing a version that is too new and you need to downgrade and you don't really know what you're doing, its a massive headache (I basically ended up reinstalling Ubuntu after going through this).

If you want to compile Kaldi without CUDA, you can skip the next paragraph and run ./configure --use-cuda=no instead of ./configure

To figure out which version of CUDA to install, see the configure script again and see the line that says function configure_cuda {. Note there the version of CUDA you install must be compatible with the g++ version on your machine (and this is exactly why the $CXX environment variable I mentioned above come in handy. You can use that to get to a configuration that works for you.

At this point you should be able to:

saurav@mrp-dev:~/git/kaldi/tools$ make -j
...
saurav@mrp-dev:~/git/kaldi/src$ ./configure
...
saurav@mrp-dev:~/git/kaldi/src$ make clean -j; make depend -j; make -j;

Hopefully at this point, Kaldi should have compiled, and you should see a bunch of executable binaries in /src. Most(all?) of these executables are located in /src/*bin folders. None of these binaries are in your path now, but we'll get to that later. You can call themn with no arguments to figure out what they do:

saurav@mrp-dev:~/git/kaldi/src/featbin$ ./compute-mfcc-feats
./compute-mfcc-feats

Create MFCC feature files.
Usage:  compute-mfcc-feats [options...] <wav-rspecifier> <feats-wspecifier>

Options:
  --allow-downsample          : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)
  --blackman-coeff            : Constant coefficient for generalized Blackman window. (float, default = 0.42)
  --cepstral-lifter           : Constant that controls scaling of MFCCs (float, default = 22)

...etc...

One thing to note is that most (all?) of thes binaries print out the command they were called by as the first thing in stdout, which can be really useful to debug stuff.

Now lets go into the kaldi/egs/wsj/s5 folder. egs stands for "examples" and I think s5 is just the 5th revision to this example. This folder contains scripts to help up build and train models using the WSJ data. One thing to note is there are two data sets in "WSJ": "WSJ0" and "WSJ1." You can read about them here:

In a nutshell, the corpora consist of a bunch of people reading articles out of the Wall Street Journal. If you don't have access to this data, it should be fine to read through the data preparation stage and try to create/acquire data of your own to do follow the remaining sections with.

After you download these, make sure they are in an easily accessible location (not inside the kaldi trunk). To tell the wsj recipe where your data is, go into kaldi/egs/wsj/s5/run.sh and look for the lines that say wsj0=xxx and wsj1=xxx replace these with the correct paths for your data. At this point, you can probably run the ./run.sh script and it will actually train and decode a monophone GMM-HMM model and a few different triphone models on its own. The next few sections try to go through how this works.

Important Files

This section contains a list of some important Kaldi-like files that are referenced all over the place

  • wsj/s5/path.sh: This file sets up the path variable in your terminal so you can call the Kaldi executables. You can see that many of the shell scripts call this as one of the first things they do. If you want to call the Kaldi executables from your terminal, you should cd to where it is located and source it.

Data Preparation

When you open up run.sh, you'll notice that the file is organized in terms of stages. Stage 0 is data preparation. If you look inside the if [ $stage -le 0 ]; then clause, you see that it's calling a bunch of scripts that are either located in local/* or utils/*. The way to think about this is the stuff in the local folder is all specific to the WSJ dataset and are generally not universally compatible. The stuff in the utils/* folder, on the other hand, is mnore or less universally compatible and are useful regardless of the dataset you are working on.

There are 4 scripts that the WSJ run script does before doing anything interesting

  • local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?: This creates the following files:
    • data/local/data/train_si84.flist and data/local/data/train_si84.flist: These are "file list" files, which contain lists of files used for training. The si84 set contains speech from 84 speakers, while the si284 set contains data from 284 speakers
    • data/local/data/test_evalxxxx.flist, data/local/data/test_devxxxx.flist: These are meant for testing and evaluation
    • One for each of the above sets of the following:
      • data/local/data/x_wav.scp: Locations of the wav files, indexed by utterance. Note that since the data that comes from the LDC is in sphere form, we have commands to call to get the wav files (thanks to sph2wav)
      • data/local/data/x.txt: The transcripts with lines (indexed by utterance ID)
      • data/local/data/x.utt2spk
      • data/local/data/x.spk2utt
    • A bunch of files for the LM (TODO)
  • local/wsj_prepare_dict.sh --dict-suffix "_nosp"
    • Downloads cmudict and uses it to create data/local/dict_nosp/lexicon.txt. This file maps words to their pronunciations. Note that if you open up data/local/dict_nosp/nonsilence_phones.txt, you can see all the phones that are being used. The numbers after phones (e.g. IY1) correspond to stress levels.
  • utils/prepare_lang.sh data/local/dict_nosp "<SPOKEN_NOISE>" data/local/lang_tmp_nosp data/lang_nosp
    • This files inside data/lang_nosp and data/local/lang_tmp_nosp. The latter is just a temp directory. Inside the former, you'll find a few files:
      • L.fst and L_disambig.fst each encode the entire lexicon. (In this case, a "lexicon" is a mapping from sequences of phones to words).
      • phones.txt contains the inputs to the FST. Each phone comes in different "flavors": _B stands for beginning, _I stands for inside, _E stands for end, and _S stands for singular (when a phone is on its own). Looking at this file, we can see that we're going to model each phone depending on its stress and where it appears in the word differently.
      • words.txt contains the outputs to the FST
      • topo contains the Bakis HMM structures for each phone. Note that you'll find a section for phones 1-15 and a section for phones 16+. The former models different types of silence phones and the latter models actual speech phones.
  • local/wsj_format_data.sh --lang-suffix "_nosp"
    • Just take a look insde this file, it's less scary. It creates a bunch of different kinds of language models and puts the data in some convenient directories

Feature Extraction

The next thing that run.sh calls is this loop:

for x in test_eval92 test_eval93 test_dev93 train_si284; do
    # Changed from 20 -> 36
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 36 data/$x || exit 1;
    steps/compute_cmvn_stats.sh data/$x || exit 1;
  done
  • steps/make_mfcc.sh Computs MFCC features for each of the data directories. If you look inside this script, you'll find a call to compute-mfcc-feats, which is the Kaldi binary in charge of doing this. Assuming you have sourced path.sh, you can just run compute-mfcc-feats at you terminal, and it will print out a list of options you can give it, as well as how to call it. Since there's a ton of options here, we generally find it easier to create a file called mfcc.conf and put all of our options inside there. This file would look something like this:
    --sample-frequency=8000 
    --frame-length=25 
    --low-freq=20 
    --high-freq=3700 
    --num-ceps=20 
    --snip-edges=false
    
    In fact the call inside make_mfcc.sh looks like compute-mfcc-feats ... --config=$mfcc_config .... One thing to note is that compute_mfcc-feats does not add delta or delta-delta features. This will be done later by add-deltas.
  • steps/compute_cmvn_stats.sh Creates data/$x/data/cmvn_$x.{ark,scp}.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment