Skip to content

Instantly share code, notes, and snippets.

@AdolfVonKleist
Last active August 29, 2015 14:25
Show Gist options
  • Save AdolfVonKleist/433f3ac1afe6a1498a69 to your computer and use it in GitHub Desktop.
Save AdolfVonKleist/433f3ac1afe6a1498a69 to your computer and use it in GitHub Desktop.
#!/bin/bash
# Obtain the latest version of the CMU dict from github
wget https://raw.githubusercontent.com/cmusphinx/cmudict/master/cmudict.dict -O cmudict.dict
# Get rid of the alternative pronunciation markers and normalize formatting
cat cmudict.dict | perl -e 'while (<>) {
chomp;
@_ = split (/\s+/);
$w = shift (@_);
$w =~ s/\([0-9]+\)//;
print $w."\t".join (" ", @_)."\n";
}' \
> cmudict.formatted.txt
#Align the dictionary
phonetisaurus-align --input=cmudict.formatted.txt \
--ofile=cmudict.formatted.corpus \
--seq1_del=false
#Train an n-gram model with mitlm (or whatever LM training toolkit you like
estimate-ngram -o 8 -t cmudict.formatted.corpus -wl cmudict.formatted.o8.arpa
#Transform into an FST
phonetisaurus-arpa2wfst-omega --lm=cmudict.formatted.o8.arpa \
--ofile=cmudict.formatted.o8.fst
#Test with some word 'TESTING'
phonetisaurus-rex.py --model cmudict.formatted.o8.fst \
--word testing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment