ragulpr/zero-padding-batchnorm-keras.ipynb

## zero-padding-batchnorm-keras.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Is Keras BatchNorm respect Mask?\n",
    "\n",
    "I've been under the assumption that it *does*, and thought I tested it thoroughly before. I also said so in a [reddit comment](https://www.reddit.com/r/MachineLearning/comments/7yco19/d_does_zero_padding_affect_normalization_output/duff8ls/)\n",
    "\n",
    "Looking deeper into it, I still *wish it was so*, but really; I can find no proof that it is so. This is a (poor) attempt to answer it.\n",
    "\n",
    "### Background (as of 2019-02-27)\n",
    "* Basically very few tests are testing effect of `mask`. In particular...\n",
    "* https://github.com/keras-team/keras/blob/master/tests/keras/layers/normalization_test.py does not test `mask`\n",
    "* https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py does not mention `mask`\n",
    "* https://github.com/keras-team/keras/blob/d48e97079914d897e82ddcb1a45261ce4415b8ea/keras/backend/tensorflow_backend.py#L1913 does not...\n",
    "\n",
    "## Expected effect of BN not respecting *mask*\n",
    "A very crude intuitive idea[1] of batchnorm is that one centers/scales via `y = (x-x.mean())/x.std()`, so if `x` has nonsense-values where `mask` would tell is it does, this nonsense should probagate into the centering/scaling of coefficients.\n",
    "\n",
    "\n",
    "[1] Note, it's definitely not implemented like this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "keras.__version__ 2.2.0\n",
      "theano.__version__ 1.0.1+unknown\n",
      "tf.__version__ 1.0.1+unknown\n"
     ]
    }
   ],
   "source": [
    "import keras.backend as K\n",
    "import keras.layers as L\n",
    "from keras.layers.normalization import BatchNormalization\n",
    "\n",
    "from keras.models import Sequential\n",
    "from keras.optimizers import adam\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import keras\n",
    "print('keras.__version__',keras.__version__)\n",
    "try:\n",
    "    import theano\n",
    "    print('theano.__version__',theano.__version__)\n",
    "except:\n",
    "    pass\n",
    "import tensorflow as tf\n",
    "print('tf.__version__',theano.__version__)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "do_mask = True\n",
    "\n",
    "def bn_runner(add_nonsense,use_mask, lr = 0.1):\n",
    "    np.random.seed(1)\n",
    "\n",
    "    model = Sequential()\n",
    "    model.add(L.InputLayer(input_shape=(1,)))\n",
    "    mask_value = 3000.1337 \n",
    "\n",
    "    if use_mask:\n",
    "        model.add(L.Masking(mask_value=mask_value))\n",
    "\n",
    "    model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=.1))\n",
    "\n",
    "    model.compile(loss='mse', optimizer=adam(lr=lr))\n",
    "\n",
    "    bn_coefs_before = model.layers[-1].get_weights()\n",
    "    #  [gamma, beta, mean, std]\n",
    "    \n",
    "    n = 200\n",
    "\n",
    "    x = np.zeros((n,1))\n",
    "    y = np.zeros((n,1))#np.random.normal(0,1,(n,1)) \n",
    "    \n",
    "    if add_nonsense:\n",
    "        # If `use_mask` we assume this wont be seen/affect stuff\n",
    "        x[-20:] = mask_value\n",
    "    \n",
    "    model.fit(x,y,epochs=200,batch_size = 25, verbose=0)\n",
    "\n",
    "    bn_coefs_after = model.layers[-1].get_weights()\n",
    "    \n",
    "    predicted_unique_vals = np.unique(model.predict(np.zeros_like(x)))\n",
    "    return '[gamma, beta, mean, std] init',bn_coefs_before,'after',bn_coefs_after,'pred output',predicted_unique_vals\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Colocations handled automatically by placer.\n",
      "WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Use tf.cast instead.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "('[gamma, beta, mean, std] init',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([1.], dtype=float32)],\n",
       " 'after',\n",
       " [array([-1.01612344e-10], dtype=float32),\n",
       "  array([0.00447437], dtype=float32),\n",
       "  array([297.11792], dtype=float32),\n",
       "  array([817455.56], dtype=float32)],\n",
       " 'pred output',\n",
       " array([0.00447437], dtype=float32))"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn_runner(add_nonsense=True,use_mask=False,lr = 0.1) # assume bn will learn very large vals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('[gamma, beta, mean, std] init',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([1.], dtype=float32)],\n",
       " 'after',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32)],\n",
       " 'pred output',\n",
       " array([0.], dtype=float32))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn_runner(add_nonsense=True,use_mask=True,lr = 0.1) # assume bn centers at 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('[gamma, beta, mean, std] init',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([1.], dtype=float32)],\n",
       " 'after',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32)],\n",
       " 'pred output',\n",
       " array([0.], dtype=float32))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn_runner(add_nonsense=False,use_mask=False,lr = 0.1) # Best if learned same as above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('[gamma, beta, mean, std] init',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([1.], dtype=float32)],\n",
       " 'after',\n",
       " [array([1.], dtype=float32),\n",
       "  array([0.], dtype=float32),\n",
       "  array([297.11792], dtype=float32),\n",
       "  array([817455.56], dtype=float32)],\n",
       " 'pred output',\n",
       " array([-0.32862207], dtype=float32))"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bn_runner(add_nonsense=True,use_mask=False,lr = 0.) # Best if learned sameish as first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Results: Inconclusive\n",
    "Seems like mask is respected but impossible to find in keras-repo why this would be the case.\n",
    "Confounding factor could be how `mask` is respected by loss etc etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Is Keras BatchNorm respect Mask?\n",
	"\n",
	"I've been under the assumption that it does, and thought I tested it thoroughly before. I also said so in a [reddit comment](https://www.reddit.com/r/MachineLearning/comments/7yco19/d_does_zero_padding_affect_normalization_output/duff8ls/)\n",
	"\n",
	"Looking deeper into it, I still wish it was so, but really; I can find no proof that it is so. This is a (poor) attempt to answer it.\n",
	"\n",
	"### Background (as of 2019-02-27)\n",
	"* Basically very few tests are testing effect of `mask`. In particular...\n",
	"* https://github.com/keras-team/keras/blob/master/tests/keras/layers/normalization_test.py does not test `mask`\n",
	"* https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py does not mention `mask`\n",
	"* https://github.com/keras-team/keras/blob/d48e97079914d897e82ddcb1a45261ce4415b8ea/keras/backend/tensorflow_backend.py#L1913 does not...\n",
	"\n",
	"## Expected effect of BN not respecting mask\n",
	"A very crude intuitive idea[1] of batchnorm is that one centers/scales via `y = (x-x.mean())/x.std()`, so if `x` has nonsense-values where `mask` would tell is it does, this nonsense should probagate into the centering/scaling of coefficients.\n",
	"\n",
	"\n",
	"[1] Note, it's definitely not implemented like this."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"Using TensorFlow backend.\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"keras.__version__ 2.2.0\n",
	"theano.__version__ 1.0.1+unknown\n",
	"tf.__version__ 1.0.1+unknown\n"
	]
	}
	],
	"source": [
	"import keras.backend as K\n",
	"import keras.layers as L\n",
	"from keras.layers.normalization import BatchNormalization\n",
	"\n",
	"from keras.models import Sequential\n",
	"from keras.optimizers import adam\n",
	"\n",
	"import numpy as np\n",
	"\n",
	"import keras\n",
	"print('keras.__version__',keras.__version__)\n",
	"try:\n",
	" import theano\n",
	" print('theano.__version__',theano.__version__)\n",
	"except:\n",
	" pass\n",
	"import tensorflow as tf\n",
	"print('tf.__version__',theano.__version__)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"do_mask = True\n",
	"\n",
	"def bn_runner(add_nonsense,use_mask, lr = 0.1):\n",
	" np.random.seed(1)\n",
	"\n",
	" model = Sequential()\n",
	" model.add(L.InputLayer(input_shape=(1,)))\n",
	" mask_value = 3000.1337 \n",
	"\n",
	" if use_mask:\n",
	" model.add(L.Masking(mask_value=mask_value))\n",
	"\n",
	" model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=.1))\n",
	"\n",
	" model.compile(loss='mse', optimizer=adam(lr=lr))\n",
	"\n",
	" bn_coefs_before = model.layers[-1].get_weights()\n",
	" # [gamma, beta, mean, std]\n",
	" \n",
	" n = 200\n",
	"\n",
	" x = np.zeros((n,1))\n",
	" y = np.zeros((n,1))#np.random.normal(0,1,(n,1)) \n",
	" \n",
	" if add_nonsense:\n",
	" # If `use_mask` we assume this wont be seen/affect stuff\n",
	" x[-20:] = mask_value\n",
	" \n",
	" model.fit(x,y,epochs=200,batch_size = 25, verbose=0)\n",
	"\n",
	" bn_coefs_after = model.layers[-1].get_weights()\n",
	" \n",
	" predicted_unique_vals = np.unique(model.predict(np.zeros_like(x)))\n",
	" return '[gamma, beta, mean, std] init',bn_coefs_before,'after',bn_coefs_after,'pred output',predicted_unique_vals\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n",
	"Instructions for updating:\n",
	"Colocations handled automatically by placer.\n",
	"WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
	"Instructions for updating:\n",
	"Use tf.cast instead.\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"('[gamma, beta, mean, std] init',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([1.], dtype=float32)],\n",
	" 'after',\n",
	" [array([-1.01612344e-10], dtype=float32),\n",
	" array([0.00447437], dtype=float32),\n",
	" array([297.11792], dtype=float32),\n",
	" array([817455.56], dtype=float32)],\n",
	" 'pred output',\n",
	" array([0.00447437], dtype=float32))"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"bn_runner(add_nonsense=True,use_mask=False,lr = 0.1) # assume bn will learn very large vals"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"('[gamma, beta, mean, std] init',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([1.], dtype=float32)],\n",
	" 'after',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32)],\n",
	" 'pred output',\n",
	" array([0.], dtype=float32))"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"bn_runner(add_nonsense=True,use_mask=True,lr = 0.1) # assume bn centers at 0"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"('[gamma, beta, mean, std] init',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([1.], dtype=float32)],\n",
	" 'after',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32)],\n",
	" 'pred output',\n",
	" array([0.], dtype=float32))"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"bn_runner(add_nonsense=False,use_mask=False,lr = 0.1) # Best if learned same as above"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"('[gamma, beta, mean, std] init',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([1.], dtype=float32)],\n",
	" 'after',\n",
	" [array([1.], dtype=float32),\n",
	" array([0.], dtype=float32),\n",
	" array([297.11792], dtype=float32),\n",
	" array([817455.56], dtype=float32)],\n",
	" 'pred output',\n",
	" array([-0.32862207], dtype=float32))"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"bn_runner(add_nonsense=True,use_mask=False,lr = 0.) # Best if learned sameish as first"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Results: Inconclusive\n",
	"Seems like mask is respected but impossible to find in keras-repo why this would be the case.\n",
	"Confounding factor could be how `mask` is respected by loss etc etc."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.7"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}