ICASSP 2023 LIMMITS CHALENGE SUBMISSION FOR VANI: VERY-LIGHTWEIGHT ACCENT-CONTROLLABLE TTS FOR NATIVE AND NON-NATIVE SPEAKERS WITH IDENTITY PRESERVATION
This gist includes:
Filelists for all tracks available here: https://drive.google.com/drive/folders/1u4Yri7KQmk8EiVNMIVJdZdvAVVNoiyb8?usp=sharing
- For track 1: please refer to the parallel folder
- For track 2: please refer to the nonparallel folder
- For track 3: please refer to the parallel folder
HuggingFace ASR model checkpoints used for data preprocessing:
- Hindi - Harveenchadha/vakyansh-wav2vec2-hindi-him-4200
- Marathi - tanmaylaud/wav2vec2-large-xlsr-hindi-marathi
- Telugu - Harveenchadha/vakyansh-wav2vec2-telugu-tem-100
speaker | original dataset | CER threshold | nonparallel dataset | parallel dataset | |||||||
train | val | train | val | ||||||||
files | # hours | files | # hours | files | # hours | files | # hours | files | # hours | ||
Hindi Female | 16512 | 40.28 | 0.1159 | 6149 | 12.93 | 83 | 0.17 | 2222 | 4.95 | 100 | 0.17 |
Hindi Male | 17798 | 40.48 | 0.0893 | 6824 | 14.43 | 76 | 0.17 | 2226 | 4.97 | 100 | 0.18 |
Marathi Female | 17874 | 42.83 | 0.1733 | 3568 | 7.41 | 36 | 0.08 | 2078 | 4.51 | 90 | 0.19 |
Marathi Male | 16747 | 41.2 | 0.1846 | 6098 | 13.12 | 70 | 0.17 | 2110 | 4.51 | 88 | 0.18 |
Telugu Female | 15933 | 41.08 | 0.1655 | 5654 | 11.2 | 60 | 0.14 | 1533 | 3.6 | 53 | 0.12 |
Telugu Male | 16939 | 42.03 | 0.187 | 5983 | 12.8 | 76 | 0.17 | 2068 | 4.43 | 82 | 0.17 |
Total | 101803 | 247.9 | 0.9156 | 34276 | 71.89 | 401 | 0.9 | 12237 | 26.97 | 513 | 1.01 |
Parameter count of different components of VANI (for Track 2 and 3):
accent_embedding weight | 12 |
attention key_proj | 447,800 |
attention query_proj | 57,920 |
context_lstm | 439,560 |
decoder flows | 1,391,424 |
duration predictor | 289,377 |
embedding weight | 106,240 |
encoder | 1,424,020 |
energy predictor | 169,313 |
f0 predictor | 572,001 |
speaker_embedding weight | 24 |
unvoiced_bias_module 0 | 261 |
voiced prediction | 71,025 |
Total | 4,968,977 |
Parameter count of different components of RADMMM (For track 1):
accent_embedding weight | 24 |
attention | 1,764,640 |
context_lstm | 6,716,160 |
duration predictor | 217,537 |
embedding weight | 212,480 |
encoder | 5,687,240 |
energy predictor | 217,537 |
f0 predictor | 397,761 |
flows | 212,450,992 |
speaker_embedding weight | 144 |
voiced predictor | 120,810 |
Total | 227,785,325 |
Example curl cmd:
curl \
-F "spk=hi_f" \
-F "lang=mr" \
-F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
http://52.14.182.205:12400/track_1 -o "track_1_hi_f_mr.wav"
curl \
-F "spk=hi_f" \
-F "lang=mr" \
-F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
http://52.14.182.205:12400/track_2 -o "track_1_hi_f_mr.wav"
curl \
-F "spk=hi_f" \
-F "lang=mr" \
-F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
http://52.14.182.205:12400/track_3 -o "track_1_hi_f_mr.wav"