This is a bief explanation of how to build a new model for ilive tts.
the input of model generation process will be a set of directories containing required files for model generation, and these folders can be listed as the follwing:
text
dir contains set of diactized arabic text sentences each in a separated file.wav
dir contains set of wav files each represents arabic pronounciation of corresponding sentence in text dirlab
dir contains arabic text pronounciation timelaps for each speech segment for each corresponding text and wav filelanguage
dir we create that contain pre-processing outputsphonemizer
dir contain set of rules of pronounciation of arabic language phonemes.
the input acquired from linguistics team needs some pre-processing to generate intermediate files needed while generating the model.
- gather all unique words in all scentences in one file and put it in language folder.
# enter text directory
cd text
# put all sentences in one file in language directory
awk 'FNR==1{print ""}1' *.txt > ../language/text_ar.txt
# enter language directory
cd ../language
# get unique words from all sentences and put them in one file
tr -s [:space:] \\n < text_ar.txt | sort | uniq > unique_words.txt
# then manually remove space` ` and `,` if exist in the unique words file
- in
lab
directory, in all files, replace allSIL
andSSIL
with_
.
# enter lab directory
cd lab
# replace all SSIL with _ in all files
sed -i -- 's/SSIL/_/g' *
# replace all SIL with _ in all files
sed -i -- 's/SIL/_/g' *
- in project
phonemizer-continuous
resources [src/main/resources/com/univox
], replace [allophones.ar_SA.xml
,ArabicPhonemesMap
,ArabicScript
] with files from phonemizer directory, and also in both the jar file and the rar file inside it. - in project
marytts
:marytts-runtime/src/main/resources
andmarytts-runtime/src/main/java/marytts/com/univox
replace [ArabicPhonemesMap
,ArabicScript
,allophones.ar_SA.xml
] with files from the previos step. - put the unique words file in project
phonemizer-continuous
base folder. - in project
phonemizer-continuous
:src/main/java/com/univox/PhonemizerMain.java
make sure that the file name used iis the same as your unique words file name in the base filder.
String filename="unique_words.txt";
- run project
phonemizer-continuous
as a java application. [this will take a few minutes] - move the output of
phonemizer-continuous
project output namedunique_words.ph
tolanguage
directory and name itar.txt
. - in
language
folder, inar.txt
file replace all__
withfunctional
(yes: with a pre space), then remove all the remaining_
from the file.
# enter language folder
cd language
# replace "__" with " functional"
sed -i -- 's/__/ functional/g' ar.txt
# remove remaining '_' in the file
sed 's/_//g' ar.txt
- in
marytts
project, deletetarget
folder, then re-createmarytts
# enter marytts directory
cd marytts
mvn -Dmaven.test.skip=true install
- in language folder run
transcription.sh
which results frommarytts
build in the previous step.
cd language
transcription.sh
this will open a GUI tool that will require few steps:-
- asks for
alophones.ar_SA.xml
file, select it from its location. - then from
file
menu selectopen
and then selectar.txt
file fromlanguage
directory. - check all words, none should be in red, and if so this indicates and error.
- click
train and predict
button. - from
file
menu, selectsave
. this will result in saving few files inlanguage
directory. - close the gui tool.
- in
marytts
project, delete target folder. - in
marytts
project, in directorymarytts-languages/marytts-lang-ar
, delete target folder. - in
marytts
project, in directorymarytts-languages/marytts-lang-ar/src/main/resources/marytts/language/ar/lexicon
replace the files in it with output files from thetranscription.sh
tool. - in the directory from the previous step rename
allophones.ar_SA.xml
toallophones.ar.xml
and also remove the_SA
from the taglang
inside the file. - in project
marytts
: dirmarytts-language/marytts-lang-ar/lib/modules/ar/lexicon
repace the two files [allophones.ar.xml
,ar
] with modifiedall_phones.ar.xml
file from the previos step andar.txt
file fromlanguage
directory afrer being renamed toar
only - in project
marytts
: dirmarytts-language/marytts-lang-ar
test that everything is okay.
mvn test
- in project
marytts
:diruser-dictionaries
, replaceuserdict-ar.txt
with the output of thephonemizer-continuous
project which isunique_words.ph
after it being renamed touserdict-ar.txt
. - edit the filed moved to
user-dictionaries
, replace remove__
and replace_
with|
sed 's/__//g' userdict-ar.txt
sed 's/_/| /g' userdict-ar.txt
- in
marytts
project, deletetarget
folder, then buildmarytts
# enter marytts directory
cd marytts
mvn -Dmaven.test.skip=true install
- open
marytts
projcet -univox workspace
-in eclipse and add build configuration for server and client as the following. - run server: run configuration with
~/git/marytts
as mary base - run DatabaseImportMain: run configuration with db directory as database base folder.