yang-zhang/dropout-scaling.ipynb

## dropout-scaling.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "# Scaling in Neural Network Dropout Layers (with Pytorch code example)"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "For several times I have confused myself over how and why a dropout layer scales its input. I'm writing down some notes before I forget again. "
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "In [Pytorch doc](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout) it says:\n\n> Furthermore, the outputs are scaled by a factor of `1/(1-p)` during training. This means that during evaluation the module simply computes an identity function."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "So how is this done and why? Let's look at some code in Pytorch."
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "import torch\nimport numpy as np",
      "execution_count": 277,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Create a dropout layer `m` with a dropout rate `p=0.4`: "
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "p = 0.4\nm = torch.nn.Dropout(p)",
      "execution_count": 278,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "As explained in [Pytorch doc](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout):\n\n> During training, randomly zeroes some of the elements of the input tensor with probability `p` using samples from a Bernoulli distribution. The elements to zero are randomized on every forward call."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Put a random input through the dropout layer and confirm that ~40% (`p=0.4`) of the elements have become 0:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "nbig = 5000000\ninp = torch.rand(nbig, 10)\noutp = m(inp)\n\nprint(f'percent of zero elements in the output: {(outp==0).numpy().mean():.5f}, is close to p={p}')",
      "execution_count": 279,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "percent of zero elements in the output: 0.40007, is close to p=0.4\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "----"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "We now look at the scaling part:"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "> Furthermore, the outputs are scaled by a factor of `1/(1-p)` during training."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Create a smaller random input and put it through the dropout layer. Compare input and output:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "np.random.seed(42)\n\ninp = torch.rand(5, 4)\ninp",
      "execution_count": 283,
      "outputs": [
        {
          "data": {
            "text/plain": "tensor([[0.6485, 0.3114, 0.1626, 0.1022],\n        [0.7352, 0.4634, 0.8206, 0.4228],\n        [0.0322, 0.9399, 0.9163, 0.4169],\n        [0.2574, 0.0467, 0.2213, 0.6171],\n        [0.4146, 0.2288, 0.0388, 0.7752]])"
          },
          "execution_count": 283,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "outp = m(inp)",
      "execution_count": 284,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "We can see below that he outputs are scaled by a factor of `1/(1-p)` during training, by comparing the non-zero elements in the two tensors below:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "inp/(1-p)",
      "execution_count": 285,
      "outputs": [
        {
          "data": {
            "text/plain": "tensor([[1.0808, 0.5191, 0.2710, 0.1703],\n        [1.2254, 0.7723, 1.3676, 0.7046],\n        [0.0537, 1.5665, 1.5272, 0.6948],\n        [0.4290, 0.0778, 0.3689, 1.0284],\n        [0.6909, 0.3813, 0.0646, 1.2920]])"
          },
          "execution_count": 285,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "outp",
      "execution_count": 286,
      "outputs": [
        {
          "data": {
            "text/plain": "tensor([[1.0808, 0.5191, 0.2710, 0.0000],\n        [0.0000, 0.7723, 0.0000, 0.0000],\n        [0.0000, 1.5665, 1.5272, 0.6948],\n        [0.4290, 0.0778, 0.3689, 1.0284],\n        [0.6909, 0.0000, 0.0646, 0.0000]])"
          },
          "execution_count": 286,
          "metadata": {},
          "output_type": "execute_result"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "We can assert that observation in code:"
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "idx_nonzero = outp!=0\nassert np.allclose(outp[idx_nonzero].numpy(), (inp/(1-p))[idx_nonzero].numpy())",
      "execution_count": 287,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "----"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "So why does it does this? In the doc:\n\n>  This means that during evaluation the module simply computes an identity function."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "This basically says during evaluation/test/inference time, the the dropout layer becomes an identity function and makes no change to its input. \n\nBecause dropout is active only during training time but not inference time, without the scaling, the expected output would be larger during inference time because the elements are not being randomly chosen to be dropped (set to 0). But we want the expected output with and without going through the dropout layer to be the same. Therefore, during training, we compensate by making the output of the dropout layer larger by the scaling factor of `1/(1-p)`. A larger `p` means more aggressive dropout, which means the more compensation we need, i.e. the larger scaling factor `1/(1-p)`. \n\nThe code below demonstrates how scaling factor put output back to the same scale as the input."
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "inp = torch.rand(nbig, 10)\noutp = m(inp)\n\nprint(f'Average output ({outp.mean():.4f}) of the dropout layer is close to averge input ({inp.mean():.4f}).')",
      "execution_count": 293,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": "Average output (0.5000) of the dropout layer is close to averge input (0.5000).\n"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "----"
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Instead of making the output of the dropout layer larger during training, one could equivalently make the output of the identity function during inference smaller. However the former is easier to implement. There is a [discussion](https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout) on stackoverflow that provides some details. But be careful, the `p` in that discussion (from slides of Standford CS231n: Convolutional Neural Networks for Visual Recognition) is the ratio for keeping instead of for dropping. "
    },
    {
      "metadata": {
        "trusted": true
      },
      "cell_type": "code",
      "source": "",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "_draft": {
      "nbviewer_url": "https://gist.github.com/2b6ee68ae07188f2ce0e58d9e7f123db"
    },
    "gist": {
      "id": "2b6ee68ae07188f2ce0e58d9e7f123db",
      "data": {
        "description": "git/yang-zhang.github.io/ds_math/dropout-scaling.ipynb",
        "public": true
      }
    },
    "kernelspec": {
      "name": "pytorch",
      "display_name": "pytorch",
      "language": "python"
    },
    "language_info": {
      "name": "python",
      "version": "3.7.1",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "toc": {
      "nav_menu": {},
      "number_sections": true,
      "sideBar": true,
      "skip_h1_title": false,
      "base_numbering": 1,
      "title_cell": "Table of Contents",
      "title_sidebar": "Contents",
      "toc_cell": false,
      "toc_position": {},
      "toc_section_display": true,
      "toc_window_display": false
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "# Scaling in Neural Network Dropout Layers (with Pytorch code example)"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "For several times I have confused myself over how and why a dropout layer scales its input. I'm writing down some notes before I forget again. "
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "In [Pytorch doc](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout) it says:\n\n> Furthermore, the outputs are scaled by a factor of `1/(1-p)` during training. This means that during evaluation the module simply computes an identity function."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "So how is this done and why? Let's look at some code in Pytorch."
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "import torch\nimport numpy as np",
	"execution_count": 277,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Create a dropout layer `m` with a dropout rate `p=0.4`: "
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "p = 0.4\nm = torch.nn.Dropout(p)",
	"execution_count": 278,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "As explained in [Pytorch doc](https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout):\n\n> During training, randomly zeroes some of the elements of the input tensor with probability `p` using samples from a Bernoulli distribution. The elements to zero are randomized on every forward call."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Put a random input through the dropout layer and confirm that ~40% (`p=0.4`) of the elements have become 0:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "nbig = 5000000\ninp = torch.rand(nbig, 10)\noutp = m(inp)\n\nprint(f'percent of zero elements in the output: {(outp==0).numpy().mean():.5f}, is close to p={p}')",
	"execution_count": 279,
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "percent of zero elements in the output: 0.40007, is close to p=0.4\n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "----"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We now look at the scaling part:"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "> Furthermore, the outputs are scaled by a factor of `1/(1-p)` during training."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Create a smaller random input and put it through the dropout layer. Compare input and output:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "np.random.seed(42)\n\ninp = torch.rand(5, 4)\ninp",
	"execution_count": 283,
	"outputs": [
	{
	"data": {
	"text/plain": "tensor([[0.6485, 0.3114, 0.1626, 0.1022],\n [0.7352, 0.4634, 0.8206, 0.4228],\n [0.0322, 0.9399, 0.9163, 0.4169],\n [0.2574, 0.0467, 0.2213, 0.6171],\n [0.4146, 0.2288, 0.0388, 0.7752]])"
	},
	"execution_count": 283,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "outp = m(inp)",
	"execution_count": 284,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We can see below that he outputs are scaled by a factor of `1/(1-p)` during training, by comparing the non-zero elements in the two tensors below:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "inp/(1-p)",
	"execution_count": 285,
	"outputs": [
	{
	"data": {
	"text/plain": "tensor([[1.0808, 0.5191, 0.2710, 0.1703],\n [1.2254, 0.7723, 1.3676, 0.7046],\n [0.0537, 1.5665, 1.5272, 0.6948],\n [0.4290, 0.0778, 0.3689, 1.0284],\n [0.6909, 0.3813, 0.0646, 1.2920]])"
	},
	"execution_count": 285,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "outp",
	"execution_count": 286,
	"outputs": [
	{
	"data": {
	"text/plain": "tensor([[1.0808, 0.5191, 0.2710, 0.0000],\n [0.0000, 0.7723, 0.0000, 0.0000],\n [0.0000, 1.5665, 1.5272, 0.6948],\n [0.4290, 0.0778, 0.3689, 1.0284],\n [0.6909, 0.0000, 0.0646, 0.0000]])"
	},
	"execution_count": 286,
	"metadata": {},
	"output_type": "execute_result"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "We can assert that observation in code:"
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "idx_nonzero = outp!=0\nassert np.allclose(outp[idx_nonzero].numpy(), (inp/(1-p))[idx_nonzero].numpy())",
	"execution_count": 287,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "----"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "So why does it does this? In the doc:\n\n> This means that during evaluation the module simply computes an identity function."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "This basically says during evaluation/test/inference time, the the dropout layer becomes an identity function and makes no change to its input. \n\nBecause dropout is active only during training time but not inference time, without the scaling, the expected output would be larger during inference time because the elements are not being randomly chosen to be dropped (set to 0). But we want the expected output with and without going through the dropout layer to be the same. Therefore, during training, we compensate by making the output of the dropout layer larger by the scaling factor of `1/(1-p)`. A larger `p` means more aggressive dropout, which means the more compensation we need, i.e. the larger scaling factor `1/(1-p)`. \n\nThe code below demonstrates how scaling factor put output back to the same scale as the input."
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "inp = torch.rand(nbig, 10)\noutp = m(inp)\n\nprint(f'Average output ({outp.mean():.4f}) of the dropout layer is close to averge input ({inp.mean():.4f}).')",
	"execution_count": 293,
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": "Average output (0.5000) of the dropout layer is close to averge input (0.5000).\n"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "----"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Instead of making the output of the dropout layer larger during training, one could equivalently make the output of the identity function during inference smaller. However the former is easier to implement. There is a [discussion](https://stats.stackexchange.com/questions/205932/dropout-scaling-the-activation-versus-inverting-the-dropout) on stackoverflow that provides some details. But be careful, the `p` in that discussion (from slides of Standford CS231n: Convolutional Neural Networks for Visual Recognition) is the ratio for keeping instead of for dropping. "
	},
	{
	"metadata": {
	"trusted": true
	},
	"cell_type": "code",
	"source": "",
	"execution_count": null,
	"outputs": []
	}
	],
	"metadata": {
	"_draft": {
	"nbviewer_url": "https://gist.github.com/2b6ee68ae07188f2ce0e58d9e7f123db"
	},
	"gist": {
	"id": "2b6ee68ae07188f2ce0e58d9e7f123db",
	"data": {
	"description": "git/yang-zhang.github.io/ds_math/dropout-scaling.ipynb",
	"public": true
	}
	},
	"kernelspec": {
	"name": "pytorch",
	"display_name": "pytorch",
	"language": "python"
	},
	"language_info": {
	"name": "python",
	"version": "3.7.1",
	"mimetype": "text/x-python",
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"pygments_lexer": "ipython3",
	"nbconvert_exporter": "python",
	"file_extension": ".py"
	},
	"toc": {
	"nav_menu": {},
	"number_sections": true,
	"sideBar": true,
	"skip_h1_title": false,
	"base_numbering": 1,
	"title_cell": "Table of Contents",
	"title_sidebar": "Contents",
	"toc_cell": false,
	"toc_position": {},
	"toc_section_display": true,
	"toc_window_display": false
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}