10/2/2018
Use Kaldi on WSJ data to train and decode with traditional HMM-GMM monophone model, then a triphone HMM-GMM model, and then a simple HMM-DNN model.
- GMM & E-M Algorithm Gaussian mixture models and the EM algorithm: https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf
- Jurafsky & Martin (Chapter 6, 7, and 9): http://stp.lingfil.uu.se/~santinim/ml/2014/JurafskyMartinSpeechAndLanguageProcessing2ed_draft%202007.pdf
- HMM-GMM Aoustic Models for Speech Reognition: http://www1.icsi.berkeley.edu/~arlo/publications/faria_cs281a_proj.pdf
- WFSTs Speech Recognition with Weighted Finite State Transducers: https://cs.nyu.edu/~mohri/pub/hbka.pdf
- The Applications of HMMs in Speech Recognition: https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf (This is pretty dense, but I think it's the most useful one on this list)
- General CART Trees Classification and regression trees https://www.stat.wisc.edu/~loh/treeprogs/guide/wires11.pdf
- Decision Trees Decision Tree-Based State Tying For Acoustic Modeling: https://www.cse.iitb.ac.in/~pjyothi/cs753/TiedstateHMMs.pdf
- Geting set up
- Important files
- Data Preparation (This is somewhat WSJ-specific)
- Training a GMM-HMM monophone model
- Decoding with your newly-trained monophone model
- Training a GMM-HMM triphone model
- Decoding with your triphone model
- Training a DNN-HMM model
- Decoding with your DNN-HMM model.
- Common problems
You're going to want to pull the kaldi trunk and figure out compiling kaldi first
Note that the build setup stuff will tell you to run the configure
script before running make
. configure
sets up variables and other configurations that you should know how to control. Around line 93 there is the following comment:
Following environment variables can be used to override the default toolchain.
CXX C++ compiler [default=g++]
AR Archive maintenance utility [default=ar]
AS Assembler [default=as]
RANLIB Archive indexing utility [default=ranlib]
You can change the $CXX
environment variable before running configure which C++ compiler to use. This will come in handy when we try to set up CUDA. If you have a GPU and want to use it to speed up neural network computation, you shoud make sure you have the correct drivers installed, nvidia-smi
is working, and you have the cuda toolkit installed see (https://developer.nvidia.com/cuda-toolkit-archive). Please read all the way through this to figure out which version of the toolkit to download and install. If you end up installing a version that is too new and you need to downgrade and you don't really know what you're doing, its a massive headache (I basically ended up reinstalling Ubuntu after going through this).
If you want to compile Kaldi without CUDA, you can skip the next paragraph and run ./configure --use-cuda=no
instead of ./configure
To figure out which version of CUDA to install, see the configure
script again and see the line that says function configure_cuda {
. Note there the version of CUDA you install must be compatible with the g++
version on your machine (and this is exactly why the $CXX
environment variable I mentioned above come in handy. You can use that to get to a configuration that works for you.
At this point you should be able to:
saurav@mrp-dev:~/git/kaldi/tools$ make -j
...
saurav@mrp-dev:~/git/kaldi/src$ ./configure
...
saurav@mrp-dev:~/git/kaldi/src$ make clean -j; make depend -j; make -j;
Hopefully at this point, Kaldi should have compiled, and you should see a bunch of executable binaries in /src
. Most(all?) of these executables are located in /src/*bin
folders. None of these binaries are in your path now, but we'll get to that later. You can call themn with no arguments to figure out what they do:
saurav@mrp-dev:~/git/kaldi/src/featbin$ ./compute-mfcc-feats
./compute-mfcc-feats
Create MFCC feature files.
Usage: compute-mfcc-feats [options...] <wav-rspecifier> <feats-wspecifier>
Options:
--allow-downsample : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)
--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)
--cepstral-lifter : Constant that controls scaling of MFCCs (float, default = 22)
...etc...
One thing to note is that most (all?) of thes binaries print out the command they were called by as the first thing in stdout, which can be really useful to debug stuff.
Now lets go into the kaldi/egs/wsj/s5
folder. egs
stands for "examples" and I think s5
is just the 5th revision to this example. This folder contains scripts to help up build and train models using the WSJ data. One thing to note is there are two data sets in "WSJ": "WSJ0" and "WSJ1." You can read about them here:
In a nutshell, the corpora consist of a bunch of people reading articles out of the Wall Street Journal. If you don't have access to this data, it should be fine to read through the data preparation stage and try to create/acquire data of your own to do follow the remaining sections with.
After you download these, make sure they are in an easily accessible location (not inside the kaldi trunk). To tell the wsj recipe where your data is, go into kaldi/egs/wsj/s5/run.sh
and look for the lines that say wsj0=xxx
and wsj1=xxx
replace these with the correct paths for your data. At this point, you can probably run the ./run.sh
script and it will actually train and decode a monophone GMM-HMM model and a few different triphone models on its own. The next few sections try to go through how this works.
This section contains a list of some important Kaldi-like files that are referenced all over the place
wsj/s5/path.sh
: This file sets up the path variable in your terminal so you can call the Kaldi executables. You can see that many of the shell scripts call this as one of the first things they do. If you want to call the Kaldi executables from your terminal, you should cd to where it is located andsource
it.
When you open up run.sh
, you'll notice that the file is organized in terms of stages. Stage 0 is data preparation. If you look inside the if [ $stage -le 0 ]; then
clause, you see that it's calling a bunch of scripts that are either located in local/*
or utils/*
. The way to think about this is the stuff in the local
folder is all specific to the WSJ dataset and are generally not universally compatible. The stuff in the utils/*
folder, on the other hand, is mnore or less universally compatible and are useful regardless of the dataset you are working on.
There are 4 scripts that the WSJ run script does before doing anything interesting
local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?
: This creates the following files:data/local/data/train_si84.flist
anddata/local/data/train_si84.flist
: These are "file list" files, which contain lists of files used for training. Thesi84
set contains speech from 84 speakers, while thesi284
set contains data from 284 speakersdata/local/data/test_evalxxxx.flist
,data/local/data/test_devxxxx.flist
: These are meant for testing and evaluation- One for each of the above sets of the following:
data/local/data/x_wav.scp
: Locations of the wav files, indexed by utterance. Note that since the data that comes from the LDC is in sphere form, we have commands to call to get the wav files (thanks to sph2wav)data/local/data/x.txt
: The transcripts with lines (indexed by utterance ID)data/local/data/x.utt2spk
data/local/data/x.spk2utt
- A bunch of files for the LM (TODO)
local/wsj_prepare_dict.sh --dict-suffix "_nosp"
- Downloads cmudict and uses it to create
data/local/dict_nosp/lexicon.txt
. This file maps words to their pronunciations. Note that if you open updata/local/dict_nosp/nonsilence_phones.txt
, you can see all the phones that are being used. The numbers after phones (e.g. IY1) correspond to stress levels.
- Downloads cmudict and uses it to create
utils/prepare_lang.sh data/local/dict_nosp "<SPOKEN_NOISE>" data/local/lang_tmp_nosp data/lang_nosp
- This files inside
data/lang_nosp
anddata/local/lang_tmp_nosp
. The latter is just a temp directory. Inside the former, you'll find a few files:L.fst
andL_disambig.fst
each encode the entire lexicon. (In this case, a "lexicon" is a mapping from sequences of phones to words).phones.txt
contains the inputs to the FST. Each phone comes in different "flavors": _B stands for beginning, _I stands for inside, _E stands for end, and _S stands for singular (when a phone is on its own). Looking at this file, we can see that we're going to model each phone depending on its stress and where it appears in the word differently.words.txt
contains the outputs to the FSTtopo
contains the Bakis HMM structures for each phone. Note that you'll find a section for phones 1-15 and a section for phones 16+. The former models different types of silence phones and the latter models actual speech phones.
- This files inside
local/wsj_format_data.sh --lang-suffix "_nosp"
- Just take a look insde this file, it's less scary. It creates a bunch of different kinds of language models and puts the data in some convenient directories
The next thing that run.sh
calls is this loop:
for x in test_eval92 test_eval93 test_dev93 train_si284; do
# Changed from 20 -> 36
steps/make_mfcc.sh --cmd "$train_cmd" --nj 36 data/$x || exit 1;
steps/compute_cmvn_stats.sh data/$x || exit 1;
done
steps/make_mfcc.sh
Computs MFCC features for each of the data directories. If you look inside this script, you'll find a call tocompute-mfcc-feats
, which is the Kaldi binary in charge of doing this. Assuming you have sourcedpath.sh
, you can just runcompute-mfcc-feats
at you terminal, and it will print out a list of options you can give it, as well as how to call it. Since there's a ton of options here, we generally find it easier to create a file calledmfcc.conf
and put all of our options inside there. This file would look something like this:
In fact the call inside--sample-frequency=8000 --frame-length=25 --low-freq=20 --high-freq=3700 --num-ceps=20 --snip-edges=false
make_mfcc.sh
looks likecompute-mfcc-feats ... --config=$mfcc_config ...
. One thing to note is thatcompute_mfcc-feats
does not add delta or delta-delta features. This will be done later byadd-deltas
.steps/compute_cmvn_stats.sh
Createsdata/$x/data/cmvn_$x.{ark,scp}
.