Skip to content

Instantly share code, notes, and snippets.

@kan-bayashi
Created February 8, 2021 09:45
Show Gist options
  • Save kan-bayashi/0a4911719ca04e73f378b45c335c2dc4 to your computer and use it in GitHub Desktop.
Save kan-bayashi/0a4911719ca04e73f378b45c335c2dc4 to your computer and use it in GitHub Desktop.
espnet2_tts_demo
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "espnet2_tts_demo",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/kan-bayashi/0a4911719ca04e73f378b45c335c2dc4/espnet2_tts_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SMSw_r1uRm4a"
},
"source": [
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MuhqhYSToxl7"
},
"source": [
"# ESPnet2-TTS realtime demonstration\n",
"\n",
"This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN (+ MelGAN).\n",
"\n",
"- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1\n",
"- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN\n",
"\n",
"Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9e_i_gdgAFNJ"
},
"source": [
"## Installation"
]
},
{
"cell_type": "code",
"metadata": {
"id": "fjJ5zkyaoy29",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e8d432f6-96a8-431e-c808-b39162e266ab"
},
"source": [
"# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care\n",
"!pip install -q espnet==0.9.7 parallel_wavegan==0.4.8\n",
"!pip install -q espnet_model_zoo"
],
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": [
"\u001b[K |████████████████████████████████| 727kB 8.3MB/s \n",
"\u001b[K |████████████████████████████████| 51kB 8.4MB/s \n",
"\u001b[K |████████████████████████████████| 13.1MB 241kB/s \n",
"\u001b[K |████████████████████████████████| 1.0MB 54.8MB/s \n",
"\u001b[K |████████████████████████████████| 174kB 49.2MB/s \n",
"\u001b[K |████████████████████████████████| 317kB 55.1MB/s \n",
"\u001b[K |████████████████████████████████| 225kB 56.5MB/s \n",
"\u001b[K |████████████████████████████████| 61kB 9.6MB/s \n",
"\u001b[K |████████████████████████████████| 51kB 8.1MB/s \n",
"\u001b[K |████████████████████████████████| 92kB 12.8MB/s \n",
"\u001b[K |████████████████████████████████| 1.4MB 53.1MB/s \n",
"\u001b[K |████████████████████████████████| 645kB 39.4MB/s \n",
"\u001b[K |████████████████████████████████| 2.0MB 50.3MB/s \n",
"\u001b[K |████████████████████████████████| 1.3MB 39.9MB/s \n",
"\u001b[K |████████████████████████████████| 245kB 58.7MB/s \n",
"\u001b[K |████████████████████████████████| 3.1MB 51.9MB/s \n",
"\u001b[K |████████████████████████████████| 133kB 59.5MB/s \n",
"\u001b[K |████████████████████████████████| 102kB 13.7MB/s \n",
"\u001b[K |████████████████████████████████| 163kB 56.4MB/s \n",
"\u001b[K |████████████████████████████████| 184kB 58.8MB/s \n",
"\u001b[K |████████████████████████████████| 71kB 12.0MB/s \n",
"\u001b[?25h Building wheel for parallel-wavegan (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for kaldiio (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for pyworld (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for ctc-segmentation (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for configargparse (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for nltk (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for jaconv (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for pathtools (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for subprocess32 (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for distance (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"\u001b[31mERROR: plotnine 0.6.0 has requirement matplotlib>=3.1.1, but you'll have matplotlib 3.1.0 which is incompatible.\u001b[0m\n",
"\u001b[31mERROR: mizani 0.6.0 has requirement matplotlib>=3.1.1, but you'll have matplotlib 3.1.0 which is incompatible.\u001b[0m\n",
"\u001b[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.\u001b[0m\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zhDsW_dYnp2N"
},
"source": [
"### (Optional)\n",
"\n",
"If you want to try Japanese TTS, please run the following cell to install pyopenjtalk."
]
},
{
"cell_type": "code",
"metadata": {
"id": "sDAWw-Upnbpn"
},
"source": [
"!mkdir tools && cd tools && git clone https://github.com/r9y9/hts_engine_API.git\n",
"!mkdir -p tools/hts_engine_API/src/build && cd tools/hts_engine_API/src/build && \\\n",
" cmake -DCMAKE_INSTALL_PREFIX=../.. .. && make -j && make install\n",
"!cd tools && git clone https://github.com/r9y9/open_jtalk.git\n",
"!mkdir -p tools/open_jtalk/src/build && cd tools/open_jtalk/src/build && \\\n",
" cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON \\\n",
" -DHTS_ENGINE_LIB=../../../hts_engine_API/lib \\\n",
" -DHTS_ENGINE_INCLUDE_DIR=../../../hts_engine_API/include .. && \\\n",
" make install\n",
"!cp tools/open_jtalk/src/build/*.so* /usr/lib64-nvidia\n",
"!cd tools && git clone https://github.com/r9y9/pyopenjtalk.git\n",
"!cd tools/pyopenjtalk && pip install ."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BYLn3bL-qQjN"
},
"source": [
"## Single speaker model demo"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "as4iFXid0m4f"
},
"source": [
"### Model Selection\n",
"\n",
"Please select models by comment out.\n",
"\n",
"English, Japanese, and Mandarin are supported.\n",
"\n",
"You can try Tacotron2, FastSpeech, and FastSpeech2 as the text2mel model. \n",
"And you can use Parallel WaveGAN and Multi-band MelGAN as the vocoder model."
]
},
{
"cell_type": "code",
"metadata": {
"id": "GQ4ra5DcwwGI"
},
"source": [
"###################################\n",
"# ENGLISH MODELS #\n",
"###################################\n",
"fs, lang = 22050, \"English\"\n",
"tag = \"kan-bayashi/ljspeech_tacotron2\"\n",
"# tag = \"kan-bayashi/ljspeech_fastspeech\"\n",
"# tag = \"kan-bayashi/ljspeech_fastspeech2\"\n",
"# tag = \"kan-bayashi/ljspeech_conformer_fastspeech2\"\n",
"vocoder_tag = \"ljspeech_parallel_wavegan.v1\"\n",
"# vocoder_tag = \"ljspeech_full_band_melgan.v2\"\n",
"# vocoder_tag = \"ljspeech_multi_band_melgan.v2\""
],
"execution_count": 2,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "g9S-SFPe0z0w"
},
"source": [
"### Model Setup"
]
},
{
"cell_type": "code",
"metadata": {
"id": "z64fD2UgjJ6Q",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "e73e8e71-5afa-4dce-d7bf-68e52450f063"
},
"source": [
"import time\n",
"import torch\n",
"from espnet_model_zoo.downloader import ModelDownloader\n",
"from espnet2.bin.tts_inference import Text2Speech\n",
"from parallel_wavegan.utils import download_pretrained_model\n",
"from parallel_wavegan.utils import load_model\n",
"d = ModelDownloader()\n",
"text2speech = Text2Speech(\n",
" **d.download_and_unpack(tag),\n",
" device=\"cuda\",\n",
" # Only for Tacotron 2\n",
" threshold=0.5,\n",
" minlenratio=0.0,\n",
" maxlenratio=10.0,\n",
" use_att_constraint=False,\n",
" backward_window=1,\n",
" forward_window=3,\n",
" # Only for FastSpeech & FastSpeech2\n",
" speed_control_alpha=1.0,\n",
")\n",
"text2speech.spc2wav = None # Disable griffin-lim\n",
"# NOTE: Sometimes download is failed due to \"Permission denied\". That is \n",
"# the limitation of google drive. Please retry after serveral hours.\n",
"vocoder = load_model(download_pretrained_model(vocoder_tag)).to(\"cuda\").eval()\n",
"vocoder.remove_weight_norm()"
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package averaged_perceptron_tagger to\n",
"[nltk_data] /root/nltk_data...\n",
"[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.\n",
"[nltk_data] Downloading package cmudict to /root/nltk_data...\n",
"[nltk_data] Unzipping corpora/cmudict.zip.\n",
"https://zenodo.org/record/3989498/files/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best.zip?download=1: 100%|██████████| 102M/102M [00:10<00:00, 9.73MB/s]\n",
"Downloading...\n",
"From: https://drive.google.com/uc?id=1PdZv37JhAQH6AwNh31QlqruqrvjTBq7U\n",
"To: /root/.cache/parallel_wavegan/ljspeech_parallel_wavegan.v1.tar.gz\n",
"15.9MB [00:00, 164MB/s]\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FMaT0Zev021a"
},
"source": [
"### Synthesis"
]
},
{
"cell_type": "code",
"metadata": {
"id": "vrRM57hhgtHy",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "7ef66eab-c8f4-4c11-a2f3-aaf499109148"
},
"source": [
"import numpy as np\n",
"from argparse import Namespace\n",
"from espnet.utils.deterministic_utils import set_deterministic_pytorch\n",
"\n",
"# decide the input sentence by yourself\n",
"print(f\"Input your favorite sentence in {lang}.\")\n",
"x = input()\n",
"\n",
"set_deterministic_pytorch(Namespace(seed=1, debugmode=1))\n",
"\n",
"# synthesis\n",
"with torch.no_grad():\n",
" start = time.time()\n",
" _, c, *_ = text2speech(x)\n",
" wav = vocoder.inference(c)\n",
"rtf = (time.time() - start) / (len(wav) / fs)\n",
"print(f\"RTF = {rtf:5f}\")\n",
"\n",
"set_deterministic_pytorch(Namespace(seed=1, debugmode=1))\n",
"\n",
"# synthesis\n",
"with torch.no_grad():\n",
" start = time.time()\n",
" _, c, *_ = text2speech(x)\n",
" wav2 = vocoder.inference(c)\n",
"rtf = (time.time() - start) / (len(wav) / fs)\n",
"print(f\"RTF = {rtf:5f}\")\n",
"\n",
"np.testing.assert_array_equal(wav.cpu().numpy(), wav2.cpu().numpy())"
],
"execution_count": 13,
"outputs": [
{
"output_type": "stream",
"text": [
"Input your favorite sentence in English.\n",
"This is a test.\n",
"RTF = 0.238290\n",
"RTF = 0.220165\n"
],
"name": "stdout"
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment