Skip to content

Instantly share code, notes, and snippets.

@gvtulder
Created November 12, 2016 12:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gvtulder/628d7f107dcac2a67e464d9f02b3af41 to your computer and use it in GitHub Desktop.
Save gvtulder/628d7f107dcac2a67e464d9f02b3af41 to your computer and use it in GitHub Desktop.
Theano string-dependent optimization bug

In this commit https://github.com/Theano/Theano/pull/5190/commits/772fbcb00ba03b8ef9c6c198f04bf519a91589b1 I 'fixed' a bug by renaming the property 'inplace_running_mean' to 'inplace_running_xxx'.

The problem seems to be in the optimization:

  1. There is a GpuDnnBatchNorm Op that takes as its first input a tensor ('x'). This tensor is then normalized and returned as the output. The GpuDnnBatchNorm Op has a few inplace parameters named inplace_running_mean, inplace_running_var and inplace_output. Setting inplace_output=True will modify the original input tensor 'x'.

  2. The gradient Op GpuDnnBatchNormGrad takes as input the original tensor ('x') and some outputs of the GpuDnnBatchNorm Op. This means that the grad Op has to run after the normalization Op.

  3. Obviously, if the normalization Op and the grad Op use the same input 'x', the normalization Op should not run with inplace_output=True.

In https://github.com/gvtulder/Theano/commit/fde4542b9c591c23b4ffcd955096451e779d7956 I introduced the inplace_output option. This causes the test to fail because the gradient computed by GpuDnnBatchNormGrad is not what is expected. From inspecting the debugprint and the toposort order, it looks as if inplace_output has been set to True and that the grad runs after the normalization has modified the original input.

I've narrowed this problem down to the name of the inplace_running_mean property. I renamed it to inplace_running_xxx in this commit: https://github.com/Theano/Theano/pull/5190/commits/772fbcb00ba03b8ef9c6c198f04bf519a91589b1.

  1. With inplace_running_mean, the inplace_output is set to True and the test fails.
  2. With inplace_running_xxx, the inplace_output is set to False and the test is successful.

There is a demo script below. Check out one of the two commits I linked above, run the script and inspect the results. I've added the output below.

GpuDnnBatchNorm has properties: ['inplace_running_mean', 'inplace_running_var', 'inplace_output']
####################
debugprint before compilation: out and grads[0]
GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=False, inplace_running_var=False, inplace_output=False}.0 [id A] ''
|GpuContiguous [id B] ''
| |GpuFromHost<None> [id C] ''
| |x [id D]
|GpuContiguous [id E] ''
| |GpuFromHost<None> [id F] ''
| |scale [id G]
|GpuContiguous [id H] ''
| |GpuFromHost<None> [id I] ''
| |bias [id J]
|Constant{0.005} [id K]
|Constant{0.3} [id L]
|GpuContiguous [id M] ''
| |GpuFromHost<None> [id N] ''
| |running_mean [id O]
|GpuContiguous [id P] ''
|GpuFromHost<None> [id Q] ''
|running_var [id R]
HostFromGpu(gpuarray) [id S] ''
|GpuDnnBatchNormGrad{mode='per-activation'}.0 [id T] ''
|GpuContiguous [id B] ''
|GpuFromHost<None> [id U] ''
| |dy [id V]
|GpuContiguous [id E] ''
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=False, inplace_running_var=False, inplace_output=False}.1 [id W] ''
| |GpuContiguous [id B] ''
| |GpuContiguous [id E] ''
| |GpuContiguous [id H] ''
| |Constant{0.005} [id K]
| |Constant{0.3} [id L]
| |GpuContiguous [id M] ''
| |GpuContiguous [id P] ''
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=False, inplace_running_var=False, inplace_output=False}.2 [id W] ''
|Constant{0.005} [id K]
####################
compiling
####################
debugprint after compilation
Elemwise{Add}[(0, 0)] [id A] '' 17
|InplaceDimShuffle{x,0,1,2,3,4} [id B] '' 14
| |HostFromGpu(gpuarray) [id C] '' 12
| |GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.0 [id D] '' 11
| |GpuContiguous [id E] '' 10
| | |GpuFromHost<None> [id F] '' 4
| | |x [id G]
| |GpuContiguous [id H] '' 9
| | |GpuFromHost<None> [id I] '' 3
| | |scale [id J]
| |GpuContiguous [id K] '' 8
| | |GpuFromHost<None> [id L] '' 2
| | |bias [id M]
| |Constant{0.005} [id N]
| |Constant{0.3} [id O]
| |GpuContiguous [id P] '' 7
| | |GpuFromHost<None> [id Q] '' 1
| | |running_mean [id R]
| |GpuContiguous [id S] '' 6
| |GpuFromHost<None> [id T] '' 0
| |running_var [id U]
|InplaceDimShuffle{x,0,1,2,3,4} [id V] '' 16
|HostFromGpu(gpuarray) [id W] '' 15
|GpuDnnBatchNormGrad{mode='per-activation'}.0 [id X] '' 13
|GpuContiguous [id E] '' 10
|GpuFromHost<None> [id Y] '' 5
| |dy [id Z]
|GpuContiguous [id H] '' 9
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.1 [id D] '' 11
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.2 [id D] '' 11
|Constant{0.005} [id N]
####################
toposort order after compilation
0: GpuFromHost<None>(running_var)
1: GpuFromHost<None>(running_mean)
2: GpuFromHost<None>(bias)
3: GpuFromHost<None>(scale)
4: GpuFromHost<None>(x)
5: GpuFromHost<None>(dy)
6: GpuContiguous(GpuFromHost<None>.0)
7: GpuContiguous(GpuFromHost<None>.0)
8: GpuContiguous(GpuFromHost<None>.0)
9: GpuContiguous(GpuFromHost<None>.0)
10: GpuContiguous(GpuFromHost<None>.0)
11: GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, Constant{0.005}, Constant{0.3}, GpuContiguous.0, GpuContiguous.0)
12: HostFromGpu(gpuarray)(GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.0)
13: GpuDnnBatchNormGrad{mode='per-activation'}(GpuContiguous.0, GpuFromHost<None>.0, GpuContiguous.0, GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.1, GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_mean=True, inplace_running_var=True, inplace_output=True}.2, Constant{0.005})
14: InplaceDimShuffle{x,0,1,2,3,4}(HostFromGpu(gpuarray).0)
15: HostFromGpu(gpuarray)(GpuDnnBatchNormGrad{mode='per-activation'}.0)
16: InplaceDimShuffle{x,0,1,2,3,4}(HostFromGpu(gpuarray).0)
17: Elemwise{Add}[(0, 0)](InplaceDimShuffle{x,0,1,2,3,4}.0, InplaceDimShuffle{x,0,1,2,3,4}.0)
####################
some analysis
GpuDnnBatchNorm has properties: ['inplace_running_mean', 'inplace_running_var', 'inplace_output']
- There is one GpuDnnBatchNorm Op and one GpuDnnBatchNormGrad Op.
- Both nodes share the same input[0].
- The GpuDnnBatchNorm Op works inplace on this input[0] (inplace_output=True).
- The GpuDnnBatchNorm Op runs before the GpuDnnBatchNormGrad Op.
Problem? Yes. This is not supposed to happen.
GpuDnnBatchNorm has properties: ['inplace_running_xxx', 'inplace_running_var', 'inplace_output']
####################
debugprint before compilation: out and grads[0]
GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=False, inplace_running_var=False, inplace_output=False}.0 [id A] ''
|GpuContiguous [id B] ''
| |GpuFromHost<None> [id C] ''
| |x [id D]
|GpuContiguous [id E] ''
| |GpuFromHost<None> [id F] ''
| |scale [id G]
|GpuContiguous [id H] ''
| |GpuFromHost<None> [id I] ''
| |bias [id J]
|Constant{0.005} [id K]
|Constant{0.3} [id L]
|GpuContiguous [id M] ''
| |GpuFromHost<None> [id N] ''
| |running_mean [id O]
|GpuContiguous [id P] ''
|GpuFromHost<None> [id Q] ''
|running_var [id R]
HostFromGpu(gpuarray) [id S] ''
|GpuDnnBatchNormGrad{mode='per-activation'}.0 [id T] ''
|GpuContiguous [id B] ''
|GpuFromHost<None> [id U] ''
| |dy [id V]
|GpuContiguous [id E] ''
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=False, inplace_running_var=False, inplace_output=False}.1 [id W] ''
| |GpuContiguous [id B] ''
| |GpuContiguous [id E] ''
| |GpuContiguous [id H] ''
| |Constant{0.005} [id K]
| |Constant{0.3} [id L]
| |GpuContiguous [id M] ''
| |GpuContiguous [id P] ''
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=False, inplace_running_var=False, inplace_output=False}.2 [id W] ''
|Constant{0.005} [id K]
####################
compiling
####################
debugprint after compilation
Elemwise{Add}[(0, 0)] [id A] '' 17
|InplaceDimShuffle{x,0,1,2,3,4} [id B] '' 14
| |HostFromGpu(gpuarray) [id C] '' 12
| |GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.0 [id D] '' 11
| |GpuContiguous [id E] '' 10
| | |GpuFromHost<None> [id F] '' 4
| | |x [id G]
| |GpuContiguous [id H] '' 9
| | |GpuFromHost<None> [id I] '' 3
| | |scale [id J]
| |GpuContiguous [id K] '' 8
| | |GpuFromHost<None> [id L] '' 2
| | |bias [id M]
| |Constant{0.005} [id N]
| |Constant{0.3} [id O]
| |GpuContiguous [id P] '' 7
| | |GpuFromHost<None> [id Q] '' 1
| | |running_mean [id R]
| |GpuContiguous [id S] '' 6
| |GpuFromHost<None> [id T] '' 0
| |running_var [id U]
|InplaceDimShuffle{x,0,1,2,3,4} [id V] '' 16
|HostFromGpu(gpuarray) [id W] '' 15
|GpuDnnBatchNormGrad{mode='per-activation'}.0 [id X] '' 13
|GpuContiguous [id E] '' 10
|GpuFromHost<None> [id Y] '' 5
| |dy [id Z]
|GpuContiguous [id H] '' 9
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.1 [id D] '' 11
|GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.2 [id D] '' 11
|Constant{0.005} [id N]
####################
toposort order after compilation
0: GpuFromHost<None>(running_var)
1: GpuFromHost<None>(running_mean)
2: GpuFromHost<None>(bias)
3: GpuFromHost<None>(scale)
4: GpuFromHost<None>(x)
5: GpuFromHost<None>(dy)
6: GpuContiguous(GpuFromHost<None>.0)
7: GpuContiguous(GpuFromHost<None>.0)
8: GpuContiguous(GpuFromHost<None>.0)
9: GpuContiguous(GpuFromHost<None>.0)
10: GpuContiguous(GpuFromHost<None>.0)
11: GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, Constant{0.005}, Constant{0.3}, GpuContiguous.0, GpuContiguous.0)
12: HostFromGpu(gpuarray)(GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.0)
13: GpuDnnBatchNormGrad{mode='per-activation'}(GpuContiguous.0, GpuFromHost<None>.0, GpuContiguous.0, GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.1, GpuDnnBatchNorm{mode='per-activation', running_averages=True, inplace_running_xxx=True, inplace_running_var=True, inplace_output=False}.2, Constant{0.005})
14: InplaceDimShuffle{x,0,1,2,3,4}(HostFromGpu(gpuarray).0)
15: HostFromGpu(gpuarray)(GpuDnnBatchNormGrad{mode='per-activation'}.0)
16: InplaceDimShuffle{x,0,1,2,3,4}(HostFromGpu(gpuarray).0)
17: Elemwise{Add}[(0, 0)](InplaceDimShuffle{x,0,1,2,3,4}.0, InplaceDimShuffle{x,0,1,2,3,4}.0)
####################
some analysis
GpuDnnBatchNorm has properties: ['inplace_running_xxx', 'inplace_running_var', 'inplace_output']
- There is one GpuDnnBatchNorm Op and one GpuDnnBatchNormGrad Op.
- Both nodes share the same input[0].
- The GpuDnnBatchNorm Op does not work inplace on this input[0] (inplace_output=False).
- The GpuDnnBatchNorm Op runs before the GpuDnnBatchNormGrad Op.
Problem? No, this might work.
import theano
import theano.tensor as T
import theano.gpuarray.dnn as dnn
inplace_running_props = [prop for prop in dnn.GpuDnnBatchNorm.__props__
if 'inplace_' in prop]
print('GpuDnnBatchNorm has properties: %s' % str(inplace_running_props))
print('')
mode = 'per-activation'
vartype = T.tensor5
x, scale, bias, running_mean, running_var = (vartype(n)
for n in ('x', 'scale', 'bias',
'running_mean',
'running_var'))
# forward pass, direct interface
out_gpu, x_mean_gpu, x_invstd_gpu, \
out_running_mean_gpu, out_running_var_gpu = \
dnn.dnn_batch_normalization_train(x, scale, bias, mode, 5e-3, 0.3,
running_mean, running_var)
# backward pass
dy = vartype('dy')
grad_gpu = T.grad(None, wrt=x, known_grads={out_gpu: dy})
print('#' * 20)
print('debugprint before compilation: out and grads[0]')
theano.printing.debugprint([out_gpu, grad_gpu])
print('')
# compile
print('#' * 20)
print('compiling')
f_gpu = theano.function([x, scale, bias, running_mean, running_var, dy],
[out_gpu] + grad_gpu)
print('')
print('#' * 20)
print('debugprint after compilation')
theano.printing.debugprint(f_gpu)
print('')
print('#' * 20)
print('toposort order after compilation')
for i, n in enumerate(f_gpu.maker.fgraph.toposort()):
print('%2d: %s' % (i, str(n)))
print('')
print('#' * 20)
print('some analysis')
nodes = f_gpu.maker.fgraph.toposort()
batch_norm_nodes = [n for n in nodes if isinstance(n.op, dnn.GpuDnnBatchNorm)]
batch_norm_grad_nodes = [n for n in nodes if isinstance(n.op, dnn.GpuDnnBatchNormGrad)]
inplace_running_props = [prop for prop in dnn.GpuDnnBatchNorm.__props__
if 'inplace_' in prop]
print('GpuDnnBatchNorm has properties: %s' % str(inplace_running_props))
if len(batch_norm_nodes) == 1 and len(batch_norm_nodes) == 1:
print('- There is one GpuDnnBatchNorm Op and one GpuDnnBatchNormGrad Op.')
if batch_norm_nodes[0].inputs[0] == batch_norm_grad_nodes[0].inputs[0]:
print('- Both nodes share the same input[0].')
if batch_norm_nodes[0].op.inplace_output:
print('- The GpuDnnBatchNorm Op works inplace on this input[0] (inplace_output=True).')
inplace = True
else:
print('- The GpuDnnBatchNorm Op does not work inplace on this input[0] (inplace_output=False).')
inplace = False
if nodes.index(batch_norm_nodes[0]) < nodes.index(batch_norm_grad_nodes[0]):
norm_before_grad = True
print('- The GpuDnnBatchNorm Op runs before the GpuDnnBatchNormGrad Op.')
else:
norm_before_grad = False
if inplace and norm_before_grad:
print('Problem? Yes. This is not supposed to happen.')
else:
print('Problem? No, this might work.')
else:
print('Hard to say if this will work or not.')
print('')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment