nickovchinnikov/FastPitch.ipynb

## FastPitch.ipynb
{
    "cells": [
     {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
       "Let $x = (x_1, \\dots, x_n)$ be the sequence of input lexical units, and $y = (y_1, \\dots, y_t)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $h = \\text{FFTr}(x)$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN\n",
       "\n",
       "$$d = \\text{DurationPredictor}(h), \\hat{p} = \\text{PitchPredictor}(h)$$\n",
       "\n",
       "where $\\hat{d} \\in \\mathbb{N}^n$ and $\\hat{p} \\in \\mathbb{R}^n$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in \\mathbb{R}^{n×d}$ and added to $h$. The resulting sum $g$ is discretely up-sampled and passed to the output FFTr, which produces the output mel-spectrogram sequence:\n",
       "\n",
       "$g = h + \\text{PitchEmbedding}(p)$\n",
       "$\\hat{y} = \\text{FFTr}([\\underbrace{g_1, \\dots, g_1}_{d_1}, \\dots \\underbrace{g_n, \\dots, g_n}_{d_n} ])$\n",
       "\n",
       "Ground truth $p$ and $d$ are used during training, and predicted $\\hat{p}$ and $\\hat{d}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\n",
       "$\\mathcal{L} = ||\\hat{y} − y||^2_2 + α|| \\hat{p} − p ||^2_2 + γ||\\hat{d} − d||^2_2$"
      ]
     }
    ],
    "metadata": {
     "language_info": {
      "name": "python"
     }
    },
    "nbformat": 4,
    "nbformat_minor": 2
   }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let $x = (x_1, \\dots, x_n)$ be the sequence of input lexical units, and $y = (y_1, \\dots, y_t)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $h = \\text{FFTr}(x)$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN\n",
	"\n",
	"$$d = \\text{DurationPredictor}(h), \\hat{p} = \\text{PitchPredictor}(h)$$\n",
	"\n",
	"where $\\hat{d} \\in \\mathbb{N}^n$ and $\\hat{p} \\in \\mathbb{R}^n$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in \\mathbb{R}^{n×d}$ and added to $h$. The resulting sum $g$ is discretely up-sampled and passed to the output FFTr, which produces the output mel-spectrogram sequence:\n",
	"\n",
	"$g = h + \\text{PitchEmbedding}(p)$\n",
	"$\\hat{y} = \\text{FFTr}([\\underbrace{g_1, \\dots, g_1}_{d_1}, \\dots \\underbrace{g_n, \\dots, g_n}_{d_n} ])$\n",
	"\n",
	"Ground truth $p$ and $d$ are used during training, and predicted $\\hat{p}$ and $\\hat{d}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\n",
	"$\\mathcal{L} = \|\|\\hat{y} − y\|\|^2_2 + α\|\| \\hat{p} − p \|\|^2_2 + γ\|\|\\hat{d} − d\|\|^2_2$"
	]
	}
	],
	"metadata": {
	"language_info": {
	"name": "python"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}