Skip to content

Instantly share code, notes, and snippets.

@aroraakshit
Last active April 29, 2024 09:37
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aroraakshit/1d9b4c852fb2ec79e5dca770267294f9 to your computer and use it in GitHub Desktop.
Save aroraakshit/1d9b4c852fb2ec79e5dca770267294f9 to your computer and use it in GitHub Desktop.
Team vani

ICASSP 2023 LIMMITS CHALENGE SUBMISSION FOR VANI: VERY-LIGHTWEIGHT ACCENT-CONTROLLABLE TTS FOR NATIVE AND NON-NATIVE SPEAKERS WITH IDENTITY PRESERVATION

This gist includes:

Dataset summary

Filelists for all tracks available here: https://drive.google.com/drive/folders/1u4Yri7KQmk8EiVNMIVJdZdvAVVNoiyb8?usp=sharing

  • For track 1: please refer to the parallel folder
  • For track 2: please refer to the nonparallel folder
  • For track 3: please refer to the parallel folder

HuggingFace ASR model checkpoints used for data preprocessing:

  • Hindi - Harveenchadha/vakyansh-wav2vec2-hindi-him-4200
  • Marathi - tanmaylaud/wav2vec2-large-xlsr-hindi-marathi
  • Telugu - Harveenchadha/vakyansh-wav2vec2-telugu-tem-100
speaker original dataset CER threshold nonparallel dataset parallel dataset
train val train val
files # hours files # hours files # hours files # hours files # hours
Hindi Female 16512 40.28 0.1159 6149 12.93 83 0.17 2222 4.95 100 0.17
Hindi Male 17798 40.48 0.0893 6824 14.43 76 0.17 2226 4.97 100 0.18
Marathi Female 17874 42.83 0.1733 3568 7.41 36 0.08 2078 4.51 90 0.19
Marathi Male 16747 41.2 0.1846 6098 13.12 70 0.17 2110 4.51 88 0.18
Telugu Female 15933 41.08 0.1655 5654 11.2 60 0.14 1533 3.6 53 0.12
Telugu Male 16939 42.03 0.187 5983 12.8 76 0.17 2068 4.43 82 0.17
Total 101803 247.9 0.9156 34276 71.89 401 0.9 12237 26.97 513 1.01

Model Parameter summary

Parameter count of different components of VANI (for Track 2 and 3):

accent_embedding weight 12
attention key_proj 447,800
attention query_proj 57,920
context_lstm 439,560
decoder flows 1,391,424
duration predictor 289,377
embedding weight 106,240
encoder 1,424,020
energy predictor 169,313
f0 predictor 572,001
speaker_embedding weight 24
unvoiced_bias_module 0 261
voiced prediction 71,025
Total 4,968,977

Parameter count of different components of RADMMM (For track 1):

accent_embedding weight 24
attention 1,764,640
context_lstm 6,716,160
duration predictor 217,537
embedding weight 212,480
encoder 5,687,240
energy predictor 217,537
f0 predictor 397,761
flows 212,450,992
speaker_embedding weight 144
voiced predictor 120,810
Total 227,785,325

API usage

Example curl cmd:

curl \
  -F "spk=hi_f" \
  -F "lang=mr" \
  -F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
  http://52.14.182.205:12400/track_1 -o "track_1_hi_f_mr.wav"
curl \
  -F "spk=hi_f" \
  -F "lang=mr" \
  -F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
  http://52.14.182.205:12400/track_2 -o "track_1_hi_f_mr.wav"
curl \
  -F "spk=hi_f" \
  -F "lang=mr" \
  -F "text=गुलाबी ओठांसाठी ताज्या लाल गुलाबाच्या पाकळ्या वाटून मध आणि लोण्यात मिसळून लावावे." \
  http://52.14.182.205:12400/track_3 -o "track_1_hi_f_mr.wav"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment