ssss1029/kaldi.md

## kaldi.md

      
    Raw
  

              kaldi.md
            
          
    10/2/2018
Goal:

Use Kaldi on WSJ data to train and decode with traditional HMM-GMM monophone model, then a triphone HMM-GMM model, and then a simple HMM-DNN model.
Background


GMM & E-M Algorithm	Gaussian mixture models and the EM algorithm: https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf
Jurafsky & Martin (Chapter 6, 7,  and 9): http://stp.lingfil.uu.se/~santinim/ml/2014/JurafskyMartinSpeechAndLanguageProcessing2ed_draft%202007.pdf
HMM-GMM Aoustic Models for Speech Reognition: http://www1.icsi.berkeley.edu/~arlo/publications/faria_cs281a_proj.pdf
WFSTs	Speech Recognition with Weighted Finite State Transducers:	https://cs.nyu.edu/~mohri/pub/hbka.pdf
The Applications of HMMs in Speech Recognition: https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf (This is pretty dense, but I think it's the most useful one on this list)
General CART Trees	Classification and regression trees	https://www.stat.wisc.edu/~loh/treeprogs/guide/wires11.pdf
Decision Trees	Decision Tree-Based State Tying For Acoustic Modeling: https://www.cse.iitb.ac.in/~pjyothi/cs753/TiedstateHMMs.pdf

Contents


Geting set up
Important files
Data Preparation (This is somewhat WSJ-specific)
Training a GMM-HMM monophone model
Decoding with your newly-trained monophone model
Training a GMM-HMM triphone model
Decoding with your triphone model
Training a DNN-HMM model
Decoding with your DNN-HMM model.
Common problems

Getting set up

You're going to want to pull the kaldi trunk and figure out compiling kaldi first

http://kaldi-asr.org/doc/install.html
http://kaldi-asr.org/doc/build_setup.html

Note that the build setup stuff will tell you to run the configure script before running make. configure sets up variables and other configurations that you should know how to control. Around line 93 there is the following comment:
     Following environment variables can be used to override the default toolchain.
        CXX         C++ compiler [default=g++]
        AR          Archive maintenance utility [default=ar]
        AS          Assembler [default=as]
        RANLIB      Archive indexing utility [default=ranlib]

You can change the $CXX environment variable before running configure which C++ compiler to use. This will come in handy when we try to set up CUDA. If you have a GPU and want to use it to speed up neural network computation, you shoud make sure you have the correct drivers installed, nvidia-smi is working, and you have the cuda toolkit installed see (https://developer.nvidia.com/cuda-toolkit-archive). Please read all the way through this to figure out which version of the toolkit to download and install. If you end up installing a version that is too new and you need to downgrade and you don't really know what you're doing, its a massive headache (I basically ended up reinstalling Ubuntu after going through this).
If you want to compile Kaldi without CUDA, you can skip the next paragraph and run ./configure --use-cuda=no instead of ./configure
To figure out which version of CUDA to install, see the configure script again and see the line that says function configure_cuda {. Note there the version of CUDA you install must be compatible with the g++ version on your machine (and this is exactly why the $CXX environment variable I mentioned above come in handy. You can use that to get to a configuration that works for you.
At this point you should be able to:
saurav@mrp-dev:~/git/kaldi/tools$ make -j
...
saurav@mrp-dev:~/git/kaldi/src$ ./configure
...
saurav@mrp-dev:~/git/kaldi/src$ make clean -j; make depend -j; make -j;

Hopefully at this point, Kaldi should have compiled, and you should see a bunch of executable binaries in /src. Most(all?) of these executables are located in /src/*bin folders. None of these binaries are in your path now, but we'll get to that later. You can call themn with no arguments to figure out what they do:
saurav@mrp-dev:~/git/kaldi/src/featbin$ ./compute-mfcc-feats
./compute-mfcc-feats

Create MFCC feature files.
Usage:  compute-mfcc-feats [options...] <wav-rspecifier> <feats-wspecifier>

Options:
  --allow-downsample          : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)
  --blackman-coeff            : Constant coefficient for generalized Blackman window. (float, default = 0.42)
  --cepstral-lifter           : Constant that controls scaling of MFCCs (float, default = 22)

...etc...


One thing to note is that most (all?) of thes binaries print out the command they were called by as the first thing in stdout, which can be really useful to debug stuff.
Now lets go into the kaldi/egs/wsj/s5 folder. egs stands for "examples" and I think s5 is just the 5th revision to this example. This folder contains scripts to help up build and train models using the WSJ data.  One thing to note is there are two data sets in "WSJ": "WSJ0" and "WSJ1." You can read about them here:

https://catalog.ldc.upenn.edu/ldc93s6a
https://catalog.ldc.upenn.edu/ldc94s13a

In a nutshell, the corpora consist of a bunch of people reading articles out of the Wall Street Journal. If you don't have access to this data, it should be fine to read through the data preparation stage and try to create/acquire data of your own to do follow the remaining sections with.
After you download these, make sure they are in an easily accessible location (not inside the kaldi trunk). To tell the wsj recipe where your data is, go into kaldi/egs/wsj/s5/run.sh and look for the lines that say wsj0=xxx and wsj1=xxx replace these with the correct paths for your data. At this point, you can probably run the ./run.sh script and it will actually train and decode a monophone GMM-HMM model and a few different triphone models on its own. The next few sections try to go through how this works.
Important Files

This section contains a list of some important Kaldi-like files that are referenced all over the place

wsj/s5/path.sh: This file sets up the path variable in your terminal so you can call the Kaldi executables. You can see that many of the shell scripts call this as one of the first things they do. If you want to call the Kaldi executables from your terminal, you should cd to where it is located and source it.

Data Preparation

When you open up run.sh, you'll notice that the file is organized in terms of stages. Stage 0 is data preparation. If you look inside the if [ $stage -le 0 ]; then clause, you see that it's calling a bunch of scripts that are either located in local/* or utils/*. The way to think about this is the stuff in the local folder is all specific to the WSJ dataset and are generally not universally compatible. The stuff in the utils/* folder, on the other hand, is mnore or less universally compatible and are useful regardless of the dataset you are working on.
There are 4 scripts that the WSJ run script does before doing anything interesting

local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?: This creates the following files:

data/local/data/train_si84.flist and data/local/data/train_si84.flist: These are "file list" files, which contain lists of files used for training. The si84 set contains speech from 84 speakers, while the si284 set contains data from 284 speakers
data/local/data/test_evalxxxx.flist, data/local/data/test_devxxxx.flist: These are meant for testing and evaluation
One for each of the above sets of the following:

data/local/data/x_wav.scp: Locations of the wav files, indexed by utterance. Note that since the data that comes from the LDC is in  sphere  form, we have commands to call to get the wav files (thanks to sph2wav)
data/local/data/x.txt: The transcripts with lines (indexed by utterance ID)
data/local/data/x.utt2spk
data/local/data/x.spk2utt


A bunch of files for the LM (TODO)


local/wsj_prepare_dict.sh --dict-suffix "_nosp"

Downloads cmudict and uses it to create data/local/dict_nosp/lexicon.txt. This file maps words to their pronunciations. Note that if you open up data/local/dict_nosp/nonsilence_phones.txt, you can see all the phones that are being used. The numbers after phones (e.g. IY1) correspond to stress levels.


utils/prepare_lang.sh data/local/dict_nosp "<SPOKEN_NOISE>" data/local/lang_tmp_nosp data/lang_nosp

This files inside data/lang_nosp and data/local/lang_tmp_nosp. The latter is just a temp directory. Inside the former, you'll find a few files:

L.fst and L_disambig.fst each encode the entire lexicon. (In this case, a "lexicon" is a mapping from sequences of phones to words).
phones.txt contains the inputs to the FST. Each phone comes in different "flavors": _B stands for beginning, _I stands for inside, _E stands for end, and _S stands for singular (when a phone is on its own). Looking at this file, we can see that we're going to model each phone depending on its stress and where it appears in the word differently.
words.txt contains the outputs to the FST
topo contains the Bakis HMM structures for each phone. Note that you'll find a section for phones 1-15 and a section for phones 16+. The former models different types of silence phones and the latter models actual speech phones.


local/wsj_format_data.sh --lang-suffix "_nosp"

Just take a look insde this file, it's less scary. It creates a bunch of different kinds of language models and puts the data in some convenient directories


Feature Extraction

The next thing that run.sh calls is this loop:
for x in test_eval92 test_eval93 test_dev93 train_si284; do
    # Changed from 20 -> 36
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 36 data/$x || exit 1;
    steps/compute_cmvn_stats.sh data/$x || exit 1;
  done

steps/make_mfcc.sh Computs MFCC features for each of the data directories. If you look inside this script, you'll find a call to compute-mfcc-feats, which is the Kaldi binary in charge of doing this. Assuming you have sourced path.sh, you can just run compute-mfcc-feats at you terminal, and it will print out a list of options you can give it, as well as how to call it. Since there's a ton of options here, we generally find it easier to create a file called mfcc.conf and put all of our options inside there. This file would look something like this:
--sample-frequency=8000 
--frame-length=25 
--low-freq=20 
--high-freq=3700 
--num-ceps=20 
--snip-edges=false

In fact the call inside make_mfcc.sh looks like compute-mfcc-feats  ... --config=$mfcc_config .... One thing to note is that compute_mfcc-feats does not add delta or delta-delta features. This will be done later by add-deltas.
steps/compute_cmvn_stats.sh Creates data/$x/data/cmvn_$x.{ark,scp}.