Instantly share code, notes, and snippets.

\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import math\n", "import torch as th\n", "from torch.autograd import Variable\n", "import numpy as np\n", "import holoviews as hv\n", "%load_ext holoviews.ipython" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "th.manual_seed(42)\n", "th.cuda.manual_seed(42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Shattered Gradients on a Toy Problem\n", "\n", "In the toy problem we have the data is just grid of points in the range (-2, 2):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = np.linspace(-2., 2., 256, dtype=np.float32).reshape(-1,1)\n", "X = Variable(th.from_numpy(X), requires_grad=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"...each hidden layer contains $N=200$ rectifier neurons...\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "H = 200\n", "v, w = th.nn.Linear(1,H), th.nn.Linear(H,1)\n", "v.weight.data.fill_(1.0)\n", "v.bias.data.normal_(std=1.)\n", "w.weight.data.normal_(std=1./math.sqrt(H))\n", "w.bias.data.zero_()\n", "one_layer = th.nn.Sequential(v, th.nn.ReLU(), w)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = one_layer(X)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "one_layer.zero_grad()\n", "#_=X.grad.data.zero_()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y.backward(th.ones((256,1)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Curve [x] (y)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hv.Curve(zip(X.data.numpy().squeeze(),\n", " X.grad.data.numpy().squeeze()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Appears to have the same structure as shown in the paper. Plotting the covariance matrix:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "g = X.grad.data.numpy()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "g_mu = np.mean(g)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "g_centered = g-g_mu\n", "g_cov = np.abs(np.dot(g_centered, g_centered.T))/np.var(g)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Image [x,y] (z)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hv.Image(g_cov)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For deeper networks and resnets:" ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# deep network definition\n", "layers = [th.nn.Linear(1,H), th.nn.ReLU()]\n", "for i in range(23):\n", " layers += [th.nn.Linear(H,H), th.nn.ReLU()]\n", "layers += [th.nn.Linear(H,1)]\n", "deep = th.nn.Sequential(*layers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing the input through both:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y_deep = deep(X)\n", "deep.zero_grad()\n", "X.grad.data.zero_()\n", "y_deep.backward(th.ones(256,1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Weirdly, we don't get the same results as the paper here. The gradients still have brownian noise structure, but they're tiny:" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Layout\n", " .Curve.I :Curve [x] (y)\n", " .AdjointLayout.I :AdjointLayout\n", " :Histogram [z] (Frequency)\n", " :Image [x,y] (z)" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = X.grad.data.numpy()\n", "c = hv.Curve(zip(X.data.numpy().squeeze(),\n", " g.squeeze()))\n", "g_cov = np.abs(np.dot((g-g.mean()),(g-g.mean()).T))/np.var(g)\n", "c + hv.Image(g_cov).hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They do mention in the paper that the initialisation they use is just from a standard normal for the weights, while we're using the traditional stdv = 1. / math.sqrt(self.weight.size(1)) intialization. Changing just that was not enough. To replicate their results I also had to standardise the activations at every layer. Even then the output is tiny, but now does at least look correct." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from torch.nn import functional as F\n", "class Linear(th.nn.Linear):\n", " \"\"\"Linear with simple normal initialisation.\"\"\"\n", " def reset_parameters(self):\n", " stdv = 1.\n", " self.weight.data.normal_(std=stdv)\n", " self.bias.data.normal_(std=stdv)\n", " def forward(self, input):\n", " activations = F.linear(input, self.weight, self.bias)\n", " alpha = th.sqrt(activations.var(0))\n", " alpha = alpha.expand_as(activations)\n", " mu = activations.mean(0).expand_as(activations)\n", " return (activations - mu)/alpha" ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# deep network definition\n", "layers = [Linear(1,H), th.nn.ReLU()]\n", "for i in range(23):\n", " layers += [Linear(H,H), th.nn.ReLU()]\n", "layers += [Linear(H,1)]\n", "#layers += [w]\n", "deep = th.nn.Sequential(*layers)" ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "y_deep = deep(X)\n", "deep.zero_grad()\n", "X.grad.data.zero_()\n", "y_deep.backward(th.ones(256,1))" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Layout\n", " .Curve.I :Curve [x] (y)\n", " .AdjointLayout.I :AdjointLayout\n", " :Histogram [z] (Frequency)\n", " :Image [x,y] (z)" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g = X.grad.data.numpy()\n", "c = hv.Curve(zip(X.data.numpy().squeeze(),\n", " g.squeeze()))\n", "g_cov = np.abs(np.dot((g-g.mean()),(g-g.mean()).T))/np.var(g)\n", "c + hv.Image(g_cov).hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Matches apart from the scale. Looks like I'm getting a bit of a problem with vanishing gradients. Appears to be an issue with the large weights in the final layer. Switching for the layer used in the previous experiment (with a 1./stdv initialisation) increases the scale massively. Maybe I shouldn't be standard scaling the output of this final layer?" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Layout\n", " .Curve.I :Curve [x] (y)\n", " .AdjointLayout.I :AdjointLayout\n", " :Histogram [z] (Frequency)\n", " :Image [x,y] (z)" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# deep network definition\n", "layers = [Linear(1,H), th.nn.ReLU()]\n", "for i in range(23):\n", " layers += [Linear(H,H), th.nn.ReLU()]\n", "layers += [w]\n", "deep = th.nn.Sequential(*layers)\n", "y_deep = deep(X)\n", "deep.zero_grad()\n", "X.grad.data.zero_()\n", "y_deep.backward(th.ones(256,1))\n", "g = X.grad.data.numpy()\n", "c = hv.Curve(zip(X.data.numpy().squeeze(),\n", " g.squeeze()))\n", "g_cov = np.abs(np.dot((g-g.mean()),(g-g.mean()).T))/np.var(g)\n", "c + hv.Image(g_cov).hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resnet used in experiments in this paper isn't a traditional resnet, but it's supposed to have the same properties. It's described in equation 2 to be:\n", "\n", "$$\n", "\\mathbf{x}_l = \\alpha \\left( \\mathbf{x}_{l-1} + \\beta \\mathbf{W}^l \\rho (BN(\\mathbf{x}_{l-1})) \\right)\n", "$$\n", "\n", "Where $\\alpha$ and $\\beta$ are chosen scaling factors and $BN$ is batch normalisation." ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# resnet network definition\n", "BN = th.nn.BatchNorm1d\n", "class ResBlock(th.nn.Linear):\n", " \"\"\"Resnet fully connected block\"\"\"\n", " def reset_parameters(self):\n", " stdv = 1.\n", " self.weight.data.normal_(std=stdv)\n", " self.bias = None\n", " # scaling factors\n", " self.alpha = 1.0\n", " self.beta = 0.1\n", " # batch norm\n", " self.bn = BN(self.in_features)\n", " #self.bn = lambda x: x\n", " self.rho = th.nn.ReLU()\n", " \n", " def forward(self, input):\n", " # batch norm-like alpha?\n", " alpha, beta = self.alpha, self.beta\n", " #return F.linear(self.rho(self.bn(input)), self.weight, self.bias)\n", " return alpha*(input+beta*F.linear(self.rho(self.bn(input)), self.weight, self.bias))\n", " \n", "reslayers = [v]\n", "for i in range(49):\n", " reslayers += [ResBlock(H,H)]\n", "reslayers += [w]\n", "resnet = th.nn.Sequential(*reslayers)" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Layout\n", " .Curve.I :Curve [x] (y)\n", " .AdjointLayout.I :AdjointLayout\n", " :Histogram [z] (Frequency)\n", " :Image [x,y] (z)" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_resnet = resnet(X)\n", "resnet.zero_grad()\n", "X.grad.data.zero_()\n", "y_resnet.backward(th.ones(256,1))\n", "g = X.grad.data.numpy()\n", "c = hv.Curve(zip(X.data.numpy().squeeze(),\n", " g.squeeze()))\n", "g_cov = np.abs(np.dot((g-g.mean()),(g-g.mean()).T))/np.var(g)\n", "c + hv.Image(g_cov).hist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think that shows less shattering, but it's not as nice as the picture in the paper.\n", "\n", "# Autocorrelation Function\n", "\n", "Luckily numpy already has correlate, so this shouldn't be too difficult to calculate." ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def autocorr(x, t=1):\n", " return np.corrcoef(np.array([x[0:len(x)-t], x[t:len(x)]]))\n", "\n", "def autocorrelation(depth):\n", " # deep network definition\n", " layers = [Linear(1,H), th.nn.ReLU()]\n", " for i in range(depth-1):\n", " layers += [Linear(H,H), th.nn.ReLU()]\n", " layers += [Linear(H,1)]\n", " deep = th.nn.Sequential(*layers)\n", " y_deep = deep(X)\n", " deep.zero_grad()\n", " X.grad.data.zero_()\n", " y_deep.backward(th.ones(256,1))\n", " g = X.grad.data.numpy().squeeze()\n", " results = []\n", " for lag in range(15):\n", " results.append(autocorr(g, t=lag))\n", " return np.array(results)[:,0,1]" ] }, { "cell_type": "code", "execution_count": 294, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def mean_autocorr(autocorrelation, depth, repeats=20):\n", " results = []\n", " for repeat in range(repeats):\n", " results.append(autocorrelation(depth))\n", " results = np.array(results).mean(axis=0)\n", " return results" ] }, { "cell_type": "code", "execution_count": 297, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%output size=200" ] }, { "cell_type": "code", "execution_count": 305, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Overlay\n", " .Curve.Depth_2 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_4 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_10 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_24 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_50 :Curve [Lag] (Autocorrelation)" ] }, "execution_count": 305, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feedforward_results = [hv.Curve(mean_autocorr(autocorrelation, depth), label='Depth %i'%depth,\n", " kdims=['Lag'], vdims=['Autocorrelation'])\n", " for depth in [2, 4, 10, 24, 50]]\n", "hv.Overlay(feedforward_results).relabel('Feedforward')" ] }, { "cell_type": "code", "execution_count": 308, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def resnet_autocorr(depth):\n", " reslayers = [v]\n", " for i in range(depth-1):\n", " reslayers += [ResBlock(H,H)]\n", " reslayers += [w]\n", " resnet = th.nn.Sequential(*reslayers)\n", " y_resnet = resnet(X)\n", " resnet.zero_grad()\n", " X.grad.data.zero_()\n", " y_resnet.backward(th.ones(256,1))\n", " g = X.grad.data.numpy().squeeze()\n", " results = []\n", " for lag in range(15):\n", " results.append(autocorr(g, t=lag))\n", " return np.array(results)[:,0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results are for $\\beta=0.1$, but they look more like the results in the paper for $\\beta=1.0$. I don't know why." ] }, { "cell_type": "code", "execution_count": 309, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ ":Overlay\n", " .Curve.Depth_2 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_4 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_10 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_24 :Curve [Lag] (Autocorrelation)\n", " .Curve.Depth_50 :Curve [Lag] (Autocorrelation)" ] }, "execution_count": 309, "metadata": {}, "output_type": "execute_result" } ], "source": [ "resnet_results = [hv.Curve(mean_autocorr(resnet_autocorr, depth), label='Depth %i'%depth,\n", " kdims=['Lag'], vdims=['Autocorrelation'])\n", " for depth in [2, 4, 10, 24, 50]]\n", "hv.Overlay(resnet_results).relabel('Resnet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Effect of \"looks linear\" initialisation\n", "\n", "In the paper they don't actually provide a graph to show the effect of their \"looks linear\" (LL) initialisation on the autocorrelation function. It would be nice to have that, so I'm going to try and produce it here. The initialisation appears to depend on [concatenated rectifiers][concat]. They're a weird type of nonlinearity because you get two outputs for every input.\n", "\n", "$$\n", "CReLU(x) = \\begin{pmatrix} relu(x) \\\\ relu(-x) \\end{pmatrix}\n", "$$\n", "\n", "I'm going to concatenate on axis 1:\n", "\n", "[concat]: https://arxiv.org/abs/1603.05201" ] }, { "cell_type": "code", "execution_count": 357, "metadata": {}, "outputs": [], "source": [ "class CReLU(th.nn.ReLU):\n", " def forward(self, input):\n", " relu = lambda x: F.threshold(x, self.threshold, self.value, self.inplace)\n", " return th.cat([relu(input),relu(-input)], 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confusion notes\n", "\n", "I don't entirely understand what they mean by a mirrored block structure. Is a mirrored matrix a [symmetric matrix][sym]? And aren't there a very large number of possible matrices with mirrored block structures? How should one be generated. All they put in the paper is the following equation:\n", "\n", "$$\n", "\\begin{pmatrix} \\mathbf{W} & - \\mathbf{W}\\end{pmatrix}.\\begin{pmatrix} \\rho (\\mathbf{x}) \\\\ \\rho (-\\mathbf{x}) \\end{pmatrix} =\n", "\\mathbf{W}\\rho(\\mathbf{x}) - \\mathbf{W} \\rho (-\\mathbf{x}) = \\mathbf{W}\\mathbf{x}\n", "$$\n", "\n", "Does this equation mean I should initialise a matrix that satisfies it, or should I implement a layer that performs two different matrix multiplies on two parts of a CReLU.\n", "\n", "What does the matrix multiply look like after we typically apply a CReLU?\n", "\n", "### Solution\n", "\n", "Turns out that if we know we've concatenated in the way the CReLU does, then the block matrix the vector above is describing could be:\n", "\n", "\\begin{pmatrix} \\mathbf{W} & \\mathbf{0} \\\\ \\mathbf{0} &- \\mathbf{W}\\end{pmatrix}\n", "\n", "Or, it could be:\n", "\n", "\\begin{pmatrix} \\mathbf{W} & -\\mathbf{W} \\\\ \\mathbf{W} &- \\mathbf{W}\\end{pmatrix}\n", "\n", "Ah, in that case, the left half is all free parameters, and it took me way too long to realise exactly what this equation meant. Now I understand what mirrored matrix meant.\n", "\n", "Then the above equation works out fine. It's not too difficult to implement a Linear layer that initialises with this structure (using Glorot initialisation):\n", "\n", "[sym]: https://en.wikipedia.org/wiki/Symmetric_matrix" ] }, { "cell_type": "code", "execution_count": 404, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class LooksLinear(th.nn.Linear):\n", " \"\"\"Linear with simple normal initialisation.\"\"\"\n", " def reset_parameters(self):\n", " in_dim = self.weight.size(1)\n", " out_dim = self.weight.size(0)\n", " stdv = 1. / math.sqrt(in_dim)\n", " block = (th.rand(out_dim,in_dim//2)*stdv*2)-stdv\n", " blocks = th.cat([block, -block],1)\n", " self.weight.data.zero_()\n", " self.weight.data.set_(blocks)\n", " if self.bias is not None:\n", " self.bias.data.zero_()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if we initialise a feedforward network with this structure? It should look linear with the above tests. However, it's not really a one to one comparison, because now we have to map down in dimensionality after each concatenated relu. Otherwise we'd get a dimensionality explosion. Unfortunately, I do still see the gradients tend to extremely low values with the Glorot initialisation:" ] }, { "cell_type": "code", "execution_count": 405, "metadata": {}, "outputs": [ { "data": { "text/html": [ "