Created
February 27, 2019 00:22
-
-
Save ragulpr/50b7011e7348944bee1ee160db2fbe0a to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Is Keras BatchNorm respect Mask?\n", | |
"\n", | |
"I've been under the assumption that it *does*, and thought I tested it thoroughly before. I also said so in a [reddit comment](https://www.reddit.com/r/MachineLearning/comments/7yco19/d_does_zero_padding_affect_normalization_output/duff8ls/)\n", | |
"\n", | |
"Looking deeper into it, I still *wish it was so*, but really; I can find no proof that it is so. This is a (poor) attempt to answer it.\n", | |
"\n", | |
"### Background (as of 2019-02-27)\n", | |
"* Basically very few tests are testing effect of `mask`. In particular...\n", | |
"* https://github.com/keras-team/keras/blob/master/tests/keras/layers/normalization_test.py does not test `mask`\n", | |
"* https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py does not mention `mask`\n", | |
"* https://github.com/keras-team/keras/blob/d48e97079914d897e82ddcb1a45261ce4415b8ea/keras/backend/tensorflow_backend.py#L1913 does not...\n", | |
"\n", | |
"## Expected effect of BN not respecting *mask*\n", | |
"A very crude intuitive idea[1] of batchnorm is that one centers/scales via `y = (x-x.mean())/x.std()`, so if `x` has nonsense-values where `mask` would tell is it does, this nonsense should probagate into the centering/scaling of coefficients.\n", | |
"\n", | |
"\n", | |
"[1] Note, it's definitely not implemented like this." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"Using TensorFlow backend.\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"keras.__version__ 2.2.0\n", | |
"theano.__version__ 1.0.1+unknown\n", | |
"tf.__version__ 1.0.1+unknown\n" | |
] | |
} | |
], | |
"source": [ | |
"import keras.backend as K\n", | |
"import keras.layers as L\n", | |
"from keras.layers.normalization import BatchNormalization\n", | |
"\n", | |
"from keras.models import Sequential\n", | |
"from keras.optimizers import adam\n", | |
"\n", | |
"import numpy as np\n", | |
"\n", | |
"import keras\n", | |
"print('keras.__version__',keras.__version__)\n", | |
"try:\n", | |
" import theano\n", | |
" print('theano.__version__',theano.__version__)\n", | |
"except:\n", | |
" pass\n", | |
"import tensorflow as tf\n", | |
"print('tf.__version__',theano.__version__)\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"do_mask = True\n", | |
"\n", | |
"def bn_runner(add_nonsense,use_mask, lr = 0.1):\n", | |
" np.random.seed(1)\n", | |
"\n", | |
" model = Sequential()\n", | |
" model.add(L.InputLayer(input_shape=(1,)))\n", | |
" mask_value = 3000.1337 \n", | |
"\n", | |
" if use_mask:\n", | |
" model.add(L.Masking(mask_value=mask_value))\n", | |
"\n", | |
" model.add(BatchNormalization(axis=-1, momentum=0.95, epsilon=.1))\n", | |
"\n", | |
" model.compile(loss='mse', optimizer=adam(lr=lr))\n", | |
"\n", | |
" bn_coefs_before = model.layers[-1].get_weights()\n", | |
" # [gamma, beta, mean, std]\n", | |
" \n", | |
" n = 200\n", | |
"\n", | |
" x = np.zeros((n,1))\n", | |
" y = np.zeros((n,1))#np.random.normal(0,1,(n,1)) \n", | |
" \n", | |
" if add_nonsense:\n", | |
" # If `use_mask` we assume this wont be seen/affect stuff\n", | |
" x[-20:] = mask_value\n", | |
" \n", | |
" model.fit(x,y,epochs=200,batch_size = 25, verbose=0)\n", | |
"\n", | |
" bn_coefs_after = model.layers[-1].get_weights()\n", | |
" \n", | |
" predicted_unique_vals = np.unique(model.predict(np.zeros_like(x)))\n", | |
" return '[gamma, beta, mean, std] init',bn_coefs_before,'after',bn_coefs_after,'pred output',predicted_unique_vals\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.\n", | |
"Instructions for updating:\n", | |
"Colocations handled automatically by placer.\n", | |
"WARNING:tensorflow:From /usr/local/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n", | |
"Instructions for updating:\n", | |
"Use tf.cast instead.\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"('[gamma, beta, mean, std] init',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([1.], dtype=float32)],\n", | |
" 'after',\n", | |
" [array([-1.01612344e-10], dtype=float32),\n", | |
" array([0.00447437], dtype=float32),\n", | |
" array([297.11792], dtype=float32),\n", | |
" array([817455.56], dtype=float32)],\n", | |
" 'pred output',\n", | |
" array([0.00447437], dtype=float32))" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bn_runner(add_nonsense=True,use_mask=False,lr = 0.1) # assume bn will learn very large vals" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('[gamma, beta, mean, std] init',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([1.], dtype=float32)],\n", | |
" 'after',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32)],\n", | |
" 'pred output',\n", | |
" array([0.], dtype=float32))" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bn_runner(add_nonsense=True,use_mask=True,lr = 0.1) # assume bn centers at 0" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('[gamma, beta, mean, std] init',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([1.], dtype=float32)],\n", | |
" 'after',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32)],\n", | |
" 'pred output',\n", | |
" array([0.], dtype=float32))" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bn_runner(add_nonsense=False,use_mask=False,lr = 0.1) # Best if learned same as above" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"('[gamma, beta, mean, std] init',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([1.], dtype=float32)],\n", | |
" 'after',\n", | |
" [array([1.], dtype=float32),\n", | |
" array([0.], dtype=float32),\n", | |
" array([297.11792], dtype=float32),\n", | |
" array([817455.56], dtype=float32)],\n", | |
" 'pred output',\n", | |
" array([-0.32862207], dtype=float32))" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"bn_runner(add_nonsense=True,use_mask=False,lr = 0.) # Best if learned sameish as first" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Results: Inconclusive\n", | |
"Seems like mask is respected but impossible to find in keras-repo why this would be the case.\n", | |
"Confounding factor could be how `mask` is respected by loss etc etc." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
@AnirudhDagar, unless things have happened over in the Torch (and in particular, the CUDA)-community since last year when I checked, Batchnorm does not support masking. Unfortunately, there's many numerical gotchas and BatchNorm has highly optimized low level implementations, so it seems like it wouldn't be very feasible to just write it using the basic Pytorch python API (I've tried, it was slow).
Thanks for the explanation :)
This is old but I found this post from reddit, and this is just to say that Keras BatchNormalization does support masking now:
https://keras.io/api/layers/normalization_layers/batch_normalization/
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, how can I do something similar in PyTorch? I have input to my
BatchNorm1d
layer for example of shape8,1630,50,171
wherebatch size=8
,1630
is the dimension along which I have padding. So for example along that dim I have data like [0,1,2,...1112,0,0,0,0,...0] so after 1112 it is padded with zeros to make it of length 1630 and similarly for all in the batch size.I also have a corresponding mask.