Skip to content

Instantly share code, notes, and snippets.

@nickovchinnikov
Last active June 10, 2024 08:24
Show Gist options
  • Save nickovchinnikov/84894395ee9c9387d53cc80e6c4e253d to your computer and use it in GitHub Desktop.
Save nickovchinnikov/84894395ee9c9387d53cc80e6c4e253d to your computer and use it in GitHub Desktop.
FastPitch
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let $x = (x_1, \\dots, x_n)$ be the sequence of input lexical units, and $y = (y_1, \\dots, y_t)$ be the sequence of target mel-scale spectrogram frames. The first FFTr stack produces the hidden representation $h = \\text{FFTr}(x)$. The hidden representation $h$ is used to make predictions about the duration and average pitch of every character with a 1-D CNN\n",
"\n",
"$$d = \\text{DurationPredictor}(h), \\hat{p} = \\text{PitchPredictor}(h)$$\n",
"\n",
"where $\\hat{d} \\in \\mathbb{N}^n$ and $\\hat{p} \\in \\mathbb{R}^n$. Next, the pitch is projected to match the dimensionality of the hidden representation $h \\in \\mathbb{R}^{n×d}$ and added to $h$. The resulting sum $g$ is discretely up-sampled and passed to the output FFTr, which produces the output mel-spectrogram sequence:\n",
"\n",
"$g = h + \\text{PitchEmbedding}(p)$\n",
"$\\hat{y} = \\text{FFTr}([\\underbrace{g_1, \\dots, g_1}_{d_1}, \\dots \\underbrace{g_n, \\dots, g_n}_{d_n} ])$\n",
"\n",
"Ground truth $p$ and $d$ are used during training, and predicted $\\hat{p}$ and $\\hat{d}$ are used during inference. The model optimizes mean-squared error (MSE) between the predicted and ground-truth modalities\n",
"$\\mathcal{L} = ||\\hat{y} − y||^2_2 + α|| \\hat{p} − p ||^2_2 + γ||\\hat{d} − d||^2_2$"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment