Skip to content

Instantly share code, notes, and snippets.

@candlewill
Last active November 2, 2022 08:34
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save candlewill/5584911728260904414b4a6679a93d53 to your computer and use it in GitHub Desktop.
Save candlewill/5584911728260904414b4a6679a93d53 to your computer and use it in GitHub Desktop.
Analysis the source code of merlin

声学特征提取

本文介绍如何提取提取声学特征用于Merlin训练。在语音合成中,属于声码器(vocoder)的内容。

Merlin可以使用两种vocoder,STRAIGHTWORLDWORLD的目标是提取60-dim MGC, variable-dim BAP (BAP dim: 1 for 16Khz, 5 for 48Khz), 1-dim LF0;STRAIGHT的目标是提取60-dim MGC, 25-dim BAP, 1-dim LF0。

新版本的WORLD_v2还在开发中,目标是提取60-dim MGC, 5-dim BAP, 1-dim LF0(MGC和BAP的维度支持微调)。

由于STRAIGHT的使用有严格的证书限制,本文,主要介绍WORLD

代码

代码路径为:https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/vocoder/world/extract_features_for_merlin.py

在这个代码中,主要是调用worldanalysissptkx2x工具。

输入

在调用时需要指定四个参数,如下所示:

python extract_features_for_merlin.py <path_to_merlin_dir> <path_to_wav_dir> <path_to_feat_dir> <sampling frequency>

<path_to_merlin_dir>    Merlin安装路径,借此,可以定位到world和sptk路径
<path_to_wav_dir>       原始音频路径
<path_to_feat_dir>      提取出的特征所保存的路径
<sampling frequency>    采样率

输出

在<path_to_feat_dir>路径下创建三个目录:

|-- bap
|-- lf0
`-- mgc

分别用于保存不同类型的特征。

forced alignment

前文利用festival提取了文本标签,历经festival -b <scm_file>dumpfeats、归一化等操作,形成了归一化的full context labels。本文,我们将介绍如何使用HTK工具,利用full context labels和wav实现对齐。

注意:Merlin提供了state和phone两种级别的对齐。由于state对齐性能更好,本文,我们只考虑如何进行state级别的对齐。

初探

对齐脚本位于:https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align

目录结构如下:

├── binary_io.py
├── forced_alignment.py
├── htk_io.py
├── htkmfc.py
├── mean_variance_norm.py
├── prepare_labels_from_txt.sh
├── README.md
├── run_aligner.sh
└── setup.sh

运行方式为:

python $aligner/forced_alignment.py

不带任何参数,如需修改,可通过sed命令修改,例如:

sed -i s#'HTKDIR =.*'#'HTKDIR = "'$HTKDIR'"'# $aligner/forced_alignment.py                       # HTK目录
sed -i s#'work_dir =.*'#'work_dir = "'$WorkDir/$lab_dir'"'# $aligner/forced_alignment.py         # 工作路径,里面包含一个子目录"label_no_align",为未对齐的标签
sed -i s#'wav_dir =.*'#'wav_dir = "'$wav_dir'"'# $aligner/forced_alignment.py                    # 音频所在路径

未对齐标签格式如下所示,不含有时间信息(time steps):

$ cat label_no_align/arctic_b0001.lab

x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1
x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
sil^g-ae+d=d@2_2/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
g^ae-d+d=uw@3_1/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1
..................后略..................

输出:

将对齐之后的标签输出到<work_dir>/<lab_align_dir>目录中。

$ cat label_state_align/arctic_b0001.lab

0 50000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[2]
50000 100000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[3]
100000 300000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[4]
300000 1450000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[5]
1450000 1750000 x^x-sil+g=ae@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:1+1+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_1/G:0_0/H:x=x@1=1|0/I:7=5/J:7+5-1[6]
1750000 1800000 x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1[2]
1800000 1850000 x^sil-g+ae=d@1_3/A:0_0_0/B:1-1-3@1-1&1-7#1-5$1-3!0-1;0-1|ae/C:1+1+2/D:0_0/E:content+1@1+5&1+4#0+1/F:content_1/G:0_0/H:7=5@1=1|L-L%/I:0=0/J:7+5-1[3]
..................后略..................

这两个文件完整内容可访问:未对齐已对齐。对应英文文本为:Gad, do I remember it.

原理

上述对齐使用到了HTK提供的工具,包括:HCompV, HCopy, HERest, HHEd, HVite.

使用的先后顺序为:HCopy -> HCompV -> HERest -> HHEd -> HVite。下面我们先对这几个工具简单介绍。

工具 说明
HCopy 参数化数据,即,提特征,将wav格式的语音文件转化为包含若干特征矢量的特征文件
HCompV 初始化模型参数
HERest 模型训练,参数估计
HHEd 模型定义编辑器
HVite 解码,维特比算法

对齐标签将用于后续训练时长模型和声学模型,详细下文介绍。

Merlin Source Code Analysis

本文简单分析Merlin的一些源码。用于更深入的学习Merlin。

genScmFile.py

代码路径:https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/utils/genScmFile.py

作用是对文本文件进行格式转换,转换成文本标准格式标准格式3类文件组成:utt文件,scheme文件,scp文件。utt为空文件夹,供后续操作;scheme文件为文本和后续产生的utt文件之间的对应关系;scp文件为文件列表(无后缀)。

输入

 <in_txt_dir/in_txt_file>    为原始文本所在目录(每个文件以.txt结尾),或者原始文本
 <out_utt_dir>               之后utt产生的路径
 <out_scm_file>              生成的scm文件
 <out_file_id_list>          生成的scp文件

输出

  • 常见 <out_utt_dir>空文件夹
  • 生成文件名称为 scm文件,内容如下所示:
(utt.save (utt.synth (Utterance Text "Hello world." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_001.utt")
(utt.save (utt.synth (Utterance Text "Hi, this is a demo voice from Merlin." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_002.utt")
(utt.save (utt.synth (Utterance Text "Hope you guys enjoy free open-source voices from Merlin." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_003.utt")
(utt.save (utt.synth (Utterance Text "I love you China." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_004.utt")
(utt.save (utt.synth (Utterance Text "Are you OK?" )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_005.utt")
(utt.save (utt.synth (Utterance Text "I am comming from China." )) "D:\Python_Programs\Merlin_Toolkit\egs_database\utt\test_006.utt")
  • 生成文件列表scp,如下所示:
test_001
test_002
test_003
test_004
test_005
test_006

<in_txt_dir/in_txt_file>不仅可以是文本路径,也可以是单个文件,其格式如下:

( arctic_a0001 "Author of the danger trail, Philip Steels, etc." )
( arctic_a0002 "Not at this particular case, Tom, apologized Whittemore." )
( arctic_a0003 "For the twentieth time that evening the two men shook hands." )
( arctic_a0004 "Lord, but I'm glad to see you again, Phil." )

festival

festival -b <scheme_file>

作用:调用festival对文本进行批量处理。<scheme_file>为前一步所产生。(no interaction)

结果:生成utt文件。路径保存于<scheme_file>所指定的路径。

festival这一前端工具对文本进行了分析,例如:对文本Hello world.操作后的结果为:

EST_File utterance
DataType ascii
version 2
EST_Header_End
Features max_id 44 ; type Text ; iform "\"Hello world.\"" ; 
Stream_Items
1 id _1 ; name Hello ; whitespace "" ; prepunctuation "" ; 
2 id _2 ; name world ; punc . ; whitespace " " ; prepunctuation "" ; 
............此处省略n行............
End_of_Relation
Relation US_map ; ()
1 43 0 0 0 0
End_of_Relation
Relation Wave ; ()
1 44 0 0 0 0
End_of_Relation
End_of_Relations
End_of_Utterance

make_labels

代码路径:https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/festival_utt_to_lab/make_labels

功能

从utt中提取单音素(monophone),以及full context labels

用法

make_labels <labels_dir> <utts_dir> <dumpfeats> <scripts>

<labels_dir>      ## 新产生的标签所在的文件路径
<utts_dir>        ## utt文件所在路径
<dumpfeats>       ## 指向Festival的dumpfeats脚本,安装好festival后应该知道,常见为:{FESTDIR}/examples/dumpfeats
<scripts>         ## 下列脚本所在路径: extra_feats.scm label.feats label-full.awk  label-mono.awk 

执行流程

  • 在<labels_dir>文件夹中创建两个子目录,mono和full
  • 对于<utts_dir>文件夹中的每个utt文件执行:
    1. 通过basename $utt .utt获得basename
    2. 调用dumpfeats提取特征:
      dumpfeats	-eval		$scripts/extra_feats.scm \
      		-relation 	Segment \
      		-feats    	$scripts/label.feats \
      		-output   	$labels/tmp \
      		$utt
      
    3. 分别写入mono和full文件夹:
      gawk -f $scripts/label-full.awk $labels/tmp > $labels/full/$base.lab; \
      gawk -f $scripts/label-mono.awk $labels/tmp > $labels/mono/$base.lab; \
      
  • 清理临时产生的文件:rm -f tmp

解释说明

  • dumpfeats为festival提供的工具,用于从utt中提取特征,详细如下:
Usage: dumpfeats [options] <utt_file_0> <utt_file_1> ...
  
  Dump features from a set of utterances
  
  Options
  -relation  <string>
             Relation from which the features have to be dumped from
  -output    <string>
             If output parameter contains a %s its treated as a skeleton
             e.g feats/%s.feats and multiple files will be created one
             each utterance.  If output doesn't contain %s the output
             is treated as a single file and all features and dumped in it.
  -feats     <string>
             If argument starts with a "(" it is treated as a list of
             features to dump, otherwise it is treated as a filename whose
             contents contain a set of features (without parenetheses).
  -eval      <ifile>
             A scheme file to be loaded before dumping.  This may contain
             dump specific features etc.  If filename starts with a left
             parenthis it it evaluated as lisp.
  -from_file <ifile>
             A file with a list of utterance files names (used when there
             are a very large number of files.
  • gawk为比sed更强大的文本操作命令。-f选项表示指定program文件:
-f file		 Specifies a filename to read the program from 

详细program可见$scripts/label-full.awk$scripts/label-mono.awk

示例

我们以刚才通过文本Hello world.产生的utt为例,展示经过make_labels之后可以得到什么。

当前路径:/root/workspace/merlin_projects/step_by_step, 这个路径中所含文件结构如下:

root@de-3879-ng-2-123705-3173223045-0f7q9:~/workspace/merlin_projects/step_by_step# tree ./
|-- scm.scm
|-- test_001.utt
|-- test_002.utt
|-- test_003.utt
|-- test_004.utt
|-- test_005.utt
`-- test_006.utt

dumpfeats

dumpfeats=/root/workspace/Python_Programs/merlin/tools/festival/examples/dumpfeats
scripts=/root/workspace/Python_Programs/merlin/misc/scripts/frontend/festival_utt_to_lab
labels=.
utt=test_001.utt
$dumpfeats -eval $scripts/extra_feats.scm \
-relation Segment \
-feats $scripts/label.feats \
-output $labels/tmp \
$utt

执行完后,将产生一个tmp新文件,内容如下:

0 pau hh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 1 0 0 0 0 0 3 0 content 0 2 0 3 0 2 0 ax 0 0.22 
pau hh ax 0 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 0 0 0 0 3 0 content 0 2 0 3 0 2 0 l 0.22 0.27795401 
hh ax l 1 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 0 0 0 3 3 content content 2 2 3 3 2 2 pau ow 0.27795401 0.32017601 
ax l ow 2 0 0 1 0 0 1 0 3 1 0 0 2 0 2 0 2 0 1 0 1 ax 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 1 0 1 3 1 content content 2 2 3 3 2 2 hh w 0.32017601 0.39965901 
l ow w 0 0 1 1 0 1 1 3 1 4 1 1 1 0 1 0 1 0 1 0 1 ow 0 content content 0 2 1 0 2 0 1 0 1 0 3 0 0 2 0 0 L-L% 3 2 1 0 1 0 1 3 4 content content 2 1 3 3 2 2 ax er 0.39965901 0.55004603 
ow w er 0 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 1 4 content content 2 1 3 3 2 2 l l 0.55004603 0.62555099 
w er l 1 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 4 4 content content 1 1 3 3 2 2 ow d 0.62555099 0.72588098 
er l d 2 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 1 1 1 4 4 content content 1 1 3 3 2 2 w pau 0.72588098 0.81338102 
l d pau 3 1 1 0 1 1 0 1 4 0 0 2 0 1 0 1 0 1 0 1 0 er content content 0 2 1 0 1 1 1 0 1 0 0 3 0 0 2 0 0 L-L% 3 2 1 1 0 1 0 4 0 content 0 1 0 3 0 2 0 er 0 0.81338102 0.88916397 
d pau 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 1 1 0 1 0 4 0 content 0 1 0 3 0 2 0 l 0 0.88916397 1.33796 

gawk

mkdir full
mkdir mono
base=test_001
gawk -f $scripts/label-full.awk $labels/tmp > $labels/full/$base.lab; \
gawk -f $scripts/label-mono.awk $labels/tmp > $labels/mono/$base.lab;

执行完后文件夹结构

|-- full
|   `-- test_001.lab
|-- mono
|   `-- test_001.lab
|-- scm.scm
|-- test_001.utt
|-- test_002.utt
|-- test_003.utt
|-- test_004.utt
|-- test_005.utt
|-- test_006.utt
`-- tmp

full/test_001.lab文件内容:

         0    2200000 x^x-pau+hh=ax@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_2/G:0_0/H:x=x@1=1|0/I:3=2/J:3+2-1
   2200000    2779540 x^pau-hh+ax=l@1_3/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   2779540    3201760 pau^hh-ax+l=ow@2_2/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   3201760    3996590 hh^ax-l+ow=w@3_1/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   3996590    5500460 ax^l-ow+w=er@1_1/A:0_0_3/B:1-1-1@2-1&2-2#1-2$1-2!0-1;0-1|ow/C:1+1+4/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   5500460    6255510 l^ow-w+er=l@1_4/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   6255510    7258810 ow^w-er+l=d@2_3/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   7258810    8133810 w^er-l+d=pau@3_2/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   8133810    8891640 er^l-d+pau=x@4_1/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
   8891640   13379600 l^d-pau+x=x@x_x/A:1_1_4/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:content_1/E:x+x@x+x&x+x#x+x/F:0_0/G:3_2/H:x=x@1=1|0/I:0=0/J:3+2-1

mono/test_001.lab文件内容:

         0    2200000 pau
   2200000    2779540 hh
   2779540    3201760 ax
   3201760    3996590 l
   3996590    5500460 ow
   5500460    6255510 w
   6255510    7258810 er
   7258810    8133810 l
   8133810    8891640 d
   8891640   13379600 pau

normalize_lab_for_merlin.py

路径:https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/frontend/utils/normalize_lab_for_merlin.py

功能

将上面步骤产生的mono和full lab进行归一化(normalization),以供merlin使用。

依据CSTR-Edinburgh/merlin#156 所言,这一代码主要做如下三件事:

  1. Normalize duration to nearest divisible number by 5. Say 1.413 -> 1.415
  2. Merge consecutive silence phones or pause phones to one.
  3. Get rid of timestamps if required -- input format for HTK alignment

即:

  1. 将duration向最近邻靠近
  2. 对连续静音和暂停进行合并
  3. 如果需要,去掉timestamps

参数

Usage: python normalize_lab_for_merlin.py <input_lab_dir> <output_lab_dir> <label_style> <file_id_list_scp> <optional: write_time_stamps (1/0)>

<input_lab_dir>                          full标签所在路径
<output_lab_dir>                         归一化后标签保存路径
<label_style>                            使用何种对齐方式,支持phone_align, state_align
<file_id_list_scp>                       标签文件列表所在路径
<optional: write_time_stamps (1/0)>      是否写time stamps (可以省略,默认为1)

注意:上述过程暂时没有使用到mono label信息。
注意:对于训练数据需要指定label_style>=phone_align并且置write_time_stamps>=0对于测试数据,无此要求(推荐:label_style>=stete_align, <write_time_stamps>=1

结果

归一化的结果保存在<output_lab_dir>,文件名称和原文件相同。内容如下:

0 2200000 x^x-sil+hh=ax@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+3/D:0_0/E:x+x@x+x&x+x#x+x/F:content_2/G:0_0/H:x=x@1=1|0/I:3=2/J:3+2-1
2200000 2800000 x^sil-hh+ax=l@1_3/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
2800000 3200000 sil^hh-ax+l=ow@2_2/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
3200000 4000000 hh^ax-l+ow=w@3_1/A:0_0_0/B:0-0-3@1-2&1-3#1-3$1-3!0-1;0-1|ax/C:1+1+1/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
4000000 5500000 ax^l-ow+w=er@1_1/A:0_0_3/B:1-1-1@2-1&2-2#1-2$1-2!0-1;0-1|ow/C:1+1+4/D:0_0/E:content+2@1+2&1+1#0+1/F:content_1/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
5500000 6250000 l^ow-w+er=l@1_4/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
6250000 7250000 ow^w-er+l=d@2_3/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
7250000 8150000 w^er-l+d=sil@3_2/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
8150000 8900000 er^l-d+sil=x@4_1/A:1_1_1/B:1-1-4@1-1&3-1#2-1$2-1!1-0;1-0|er/C:0+0+0/D:content_2/E:content+1@2+1&2+0#1+0/F:0_0/G:0_0/H:3=2@1=1|L-L%/I:0=0/J:3+2-1
8900000 13400000 l^d-sil+x=x@x_x/A:1_1_4/B:x-x-x@x-x&x-x#x-x$x-x!x-x;x-x|x/C:0+0+0/D:content_1/E:x+x@x+x&x+x#x+x/F:0_0/G:3_2/H:x=x@1=1|0/I:0=0/J:3+2-1

归一化之后的标签将输入到forced_alignment.py,实现对齐。具体如何对齐,我们将在后文介绍。

模型训练

本文介绍如何训练时长模型和声学模型。

配置文件

训练时长模型需要一个配置文件(后续的声学模型也一样)。一般而言,在一个样例配置文件上做一些修改即可。例如,训练DNN模型所用的样例配置文件为duration_demo.conf

Merlin称这些不同的样例配置文件为recipes,全部recipes可见:https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/recipes

配置文件,主要包含路径信息、对齐方式、问题集名称、模型结构、数据划分、执行过程等信息。

run_merlin.py

程序执行入口,路径为:https://github.com/CSTR-Edinburgh/merlin/blob/master/src/run_merlin.py

执行过程

按照配置文件中不同的sub-processes,将会有不同的执行方式。

顺序编号 代码 配置文件 默认值 解释
1 GenTestList GenTestList False 产生测试列表
2 AcousticModel AcousticModel False 声学模型
3 NORMLAB NORMLAB False 对标签进行归一化
4 MAKEDUR MAKEDUR False 产生输出的时长数据
5 MAKECMP MAKECMP False 产生输出的声学数据
6 NORMCMP NORMCMP False 归一化输出的声学数据
7 TRAINDNN TRAINDNN False 是否需要训练模型
8 GENBNFEA GENBNFEA False 产生瓶颈层特征
9 DNNGEN DNNGEN False 预测
10 GENWAV GENWAV False 产生wav音频
11 DurationModel DurationModel False 时长模型
12 CALMCD CALMCD False 模型评估

上述各个参数默认取值都为False,因此配置文件中只需要设置取值为True的参数即可。

训练时长模型,训练声学模型,测试时长模型,测试声学模型对应的配置文件,指定的执行流程,分别如下所示:

训练时长模型

NORMLAB  : True
MAKEDUR  : True
MAKECMP  : True
NORMCMP  : True

TRAINDNN : True
DNNGEN   : True

CALMCD   : True

训练声学模型

NORMLAB  : True
MAKECMP  : True
NORMCMP  : True

TRAINDNN : True
DNNGEN   : True

GENWAV   : True
CALMCD   : True

测试时长模型

NORMLAB: True

DNNGEN: True

测试声学模型

NORMLAB  : True
DNNGEN   : True

GENWAV   : True
@shartoo
Copy link

shartoo commented Sep 7, 2018

写得真详细~ 非常感谢您的整理。

@shartoo
Copy link

shartoo commented Sep 7, 2018

merlin-misc

@shartoo
Copy link

shartoo commented Sep 7, 2018

你有没有兴趣重写下Merlin啊?我感觉Merlin写得太复杂,冗余了,而且Tensorflow也不是完全支持。Theano已经不更新了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment