Skip to content

Instantly share code, notes, and snippets.

@skaae
Created January 16, 2015 20:11
Show Gist options
  • Save skaae/de82044eec00a2c7ebeb to your computer and use it in GitHub Desktop.
Save skaae/de82044eec00a2c7ebeb to your computer and use it in GitHub Desktop.
Function profiling
==================
Message: experiment.py:196
Time in 12 calls to Function.__call__: 8.222084e+01s
Time in Function.fn.__call__: 8.221798e+01s (99.997%)
Time in thunks: 8.196062e+01s (99.684%)
Total compile time: 2.509097e+02s
Number of Apply nodes: 1214
Theano Optimizer time: 1.675486e+02s
Theano validate time: 1.319681e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 8.298181e+01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
88.9% 88.9% 72.863s 1.01e+00s Py 72 6 theano.scan_module.scan_op.Scan
5.0% 93.9% 4.096s 1.90e-02s C 216 18 theano.sandbox.cuda.blas.GpuDot22
1.5% 95.4% 1.260s 1.31e-02s C 96 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.3% 96.7% 1.050s 2.30e-03s Py 456 38 theano.sandbox.cuda.basic_ops.GpuReshape
1.0% 97.7% 0.813s 7.06e-04s C 1152 96 theano.sandbox.cuda.basic_ops.GpuElemwise
0.9% 98.6% 0.766s 1.82e-03s C 420 35 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 99.1% 0.345s 4.80e-03s C 72 6 theano.sandbox.cuda.basic_ops.HostFromGpu
0.3% 99.3% 0.227s 3.71e-04s C 612 51 theano.sandbox.cuda.basic_ops.GpuAlloc
0.2% 99.6% 0.186s 5.18e-03s C 36 3 theano.sandbox.cuda.basic_ops.GpuJoin
0.2% 99.8% 0.151s 4.20e-03s Py 36 3 theano.tensor.basic.Split
0.1% 99.9% 0.116s 8.82e-04s C 132 11 theano.sandbox.cuda.basic_ops.GpuFromHost
0.0% 99.9% 0.021s 1.79e-03s C 12 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
0.0% 99.9% 0.016s 1.37e-03s C 12 1 theano.tensor.elemwise.Sum
0.0% 100.0% 0.016s 1.34e-03s C 12 1 theano.tensor.nnet.nnet.SoftmaxGrad
0.0% 100.0% 0.015s 1.22e-03s C 12 1 theano.sandbox.cuda.blas.GpuGemm
0.0% 100.0% 0.008s 1.36e-06s C 6084 507 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 1.78e-06s C 1356 113 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.002s 2.30e-06s C 984 82 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.002s 1.94e-06s C 1068 89 theano.compile.ops.Shape_i
0.0% 100.0% 0.001s 9.74e-06s Py 144 12 theano.compile.ops.Rebroadcast
... (remaining 3 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
38.1% 38.1% 31.267s 2.61e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
23.8% 62.0% 19.532s 1.63e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
17.3% 79.2% 14.149s 1.18e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
5.0% 84.2% 4.096s 1.90e-02s C 216 18 GpuDot22
3.9% 88.1% 3.184s 2.65e-01s Py 12 1 forall_inplace,gpu,scan_fn}
3.3% 91.4% 2.702s 2.25e-01s Py 12 1 forall_inplace,gpu,scan_fn}
2.5% 93.9% 2.028s 1.69e-01s Py 12 1 forall_inplace,gpu,scan_fn}
1.5% 95.4% 1.242s 1.72e-02s C 72 6 GpuCAReduce{add}{1,1,0}
1.3% 96.7% 1.044s 3.35e-03s Py 312 26 GpuReshape{2}
0.5% 97.2% 0.398s 5.53e-03s C 72 6 GpuIncSubtensor{Inc;:int64:}
0.4% 97.6% 0.345s 4.80e-03s C 72 6 HostFromGpu
0.3% 97.9% 0.278s 1.93e-03s C 144 12 GpuIncSubtensor{InplaceInc;int64::}
0.3% 98.2% 0.236s 2.46e-03s C 96 8 GpuElemwise{add,no_inplace}
0.3% 98.5% 0.227s 3.71e-04s C 612 51 GpuAlloc{memset_0=True}
0.2% 98.7% 0.186s 5.18e-03s C 36 3 GpuJoin
0.2% 98.9% 0.160s 1.49e-03s C 108 9 GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}
0.2% 99.1% 0.151s 4.20e-03s Py 36 3 Split{2}
0.1% 99.2% 0.116s 8.82e-04s C 132 11 GpuFromHost
0.1% 99.4% 0.113s 3.92e-04s C 288 24 GpuElemwise{Add}[(0, 0)]
0.1% 99.5% 0.087s 4.81e-04s C 180 15 GpuElemwise{sub,no_inplace}
... (remaining 126 Ops account for 0.51%(0.42s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
38.1% 38.1% 31.267s 2.61e+00s 12 1119 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
23.8% 62.0% 19.532s 1.63e+00s 12 1175 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
17.3% 79.2% 14.149s 1.18e+00s 12 1063 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
3.9% 83.1% 3.184s 2.65e-01s 12 405 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
3.3% 86.4% 2.702s 2.25e-01s 12 680 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
2.5% 88.9% 2.028s 1.69e-01s 12 972 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
0.5% 89.4% 0.379s 3.16e-02s 12 1149 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.5% 89.8% 0.379s 3.16e-02s 12 1146 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.4% 90.3% 0.363s 3.03e-02s 12 1147 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.4% 90.7% 0.361s 3.01e-02s 12 1150 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.4% 91.1% 0.350s 2.92e-02s 12 490 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.4% 91.6% 0.346s 2.89e-02s 12 501 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.4% 92.0% 0.340s 2.83e-02s 12 1139 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0)
0.4% 92.4% 0.329s 2.75e-02s 12 1141 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0)
0.3% 92.7% 0.280s 2.33e-02s 12 1090 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.3% 93.1% 0.279s 2.33e-02s 12 1093 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.3% 93.3% 0.228s 1.90e-02s 12 778 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.3% 93.6% 0.226s 1.88e-02s 12 768 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.3% 93.9% 0.220s 1.83e-02s 12 1091 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.3% 94.1% 0.220s 1.83e-02s 12 1094 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
... (remaining 1194 Apply instances account for 5.85%(4.80s) of the runtime)
Scan Op profiling ( scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 3.428722e+00s
Total time spent in calling the VM 2.907996e+00s (84.813%)
Total overhead (computing slices..) 5.207253e-01s (15.187%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
82.7% 82.7% 2.353s 4.19e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm
17.0% 99.7% 0.485s 4.32e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise
0.3% 100.0% 0.009s 7.69e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
82.7% 82.7% 2.353s 4.19e-04s C 5616 2 GpuGemm{no_inplace}
4.9% 87.5% 0.139s 4.95e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}
4.6% 92.1% 0.131s 4.65e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}
3.8% 95.9% 0.108s 3.84e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}
3.8% 99.7% 0.108s 3.84e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}
0.3% 100.0% 0.009s 7.69e-07s C 11232 4 GpuSubtensor{::, int64:int64:}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
41.5% 41.5% 1.182s 4.21e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
41.1% 82.7% 1.171s 4.17e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
4.9% 87.5% 0.139s 4.95e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>)
4.6% 92.1% 0.131s 4.65e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0)
3.8% 95.9% 0.108s 3.84e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0)
3.8% 99.7% 0.108s 3.84e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>)
0.1% 99.8% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{156})
0.1% 99.9% 0.002s 8.73e-07s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{156})
0.1% 99.9% 0.002s 5.42e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{312}, Constant{468})
0.1% 100.0% 0.001s 5.33e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{312}, Constant{468})
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Scan Op profiling ( scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 2.909503e+00s
Total time spent in calling the VM 2.319428e+00s (79.719%)
Total overhead (computing slices..) 5.900743e-01s (20.281%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
75.5% 75.5% 1.685s 3.00e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm
24.1% 99.6% 0.538s 4.79e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise
0.4% 100.0% 0.009s 7.85e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
75.5% 75.5% 1.685s 3.00e-04s C 5616 2 GpuGemm{no_inplace}
6.8% 82.2% 0.151s 5.38e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}
6.3% 88.6% 0.141s 5.02e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}
5.6% 94.1% 0.124s 4.42e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}
5.5% 99.6% 0.123s 4.36e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}
0.4% 100.0% 0.009s 7.85e-07s C 11232 4 GpuSubtensor{::, int64:int64:}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
42.0% 42.0% 0.938s 3.34e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
33.5% 75.5% 0.747s 2.66e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
6.8% 82.2% 0.151s 5.38e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>)
6.3% 88.6% 0.141s 5.02e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0)
5.6% 94.1% 0.124s 4.42e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>)
5.5% 99.6% 0.123s 4.36e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0)
0.1% 99.7% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{300})
0.1% 99.9% 0.003s 1.06e-06s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{300})
0.1% 99.9% 0.001s 4.77e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{600}, Constant{900})
0.1% 100.0% 0.001s 4.66e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{600}, Constant{900})
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Scan Op profiling ( scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 2.186090e+00s
Total time spent in calling the VM 1.657587e+00s (75.824%)
Total overhead (computing slices..) 5.285032e-01s (24.176%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
73.6% 73.6% 1.174s 2.09e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm
25.9% 99.5% 0.414s 3.68e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise
0.5% 100.0% 0.008s 7.31e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
73.6% 73.6% 1.174s 2.09e-04s C 5616 2 GpuGemm{no_inplace}
7.1% 80.6% 0.113s 4.01e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}
6.8% 87.5% 0.109s 3.89e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}
6.2% 93.6% 0.098s 3.50e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}
5.9% 99.5% 0.093s 3.33e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}
0.5% 100.0% 0.008s 7.31e-07s C 11232 4 GpuSubtensor{::, int64:int64:}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
37.1% 37.1% 0.592s 2.11e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
36.5% 73.6% 0.583s 2.07e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
7.1% 80.6% 0.113s 4.01e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>)
6.8% 87.5% 0.109s 3.89e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0)
6.2% 93.6% 0.098s 3.50e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0)
5.9% 99.5% 0.093s 3.33e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>)
0.2% 99.7% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{102})
0.2% 99.8% 0.002s 8.53e-07s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{102})
0.1% 99.9% 0.001s 4.75e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{204}, Constant{306})
0.1% 100.0% 0.001s 4.69e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{204}, Constant{306})
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
Scan Op profiling ( grad_of_scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 1.525270e+01s
Total time spent in calling the VM 1.357668e+01s (89.012%)
Total overhead (computing slices..) 1.676021e+00s (10.988%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
40.0% 40.0% 4.951s 2.20e-04s C 22464 8 theano.sandbox.cuda.blas.GpuGemm
32.5% 72.5% 4.020s 2.86e-05s C 140400 50 theano.sandbox.cuda.basic_ops.GpuElemwise
8.9% 81.4% 1.105s 2.19e-05s C 50544 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
8.6% 90.0% 1.066s 1.90e-04s C 5616 2 theano.sandbox.cuda.blas.GpuDot22
8.0% 98.0% 0.995s 3.54e-05s C 28080 10 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.8% 99.8% 0.221s 3.94e-05s C 5616 2 theano.sandbox.cuda.basic_ops.GpuAlloc
0.1% 99.9% 0.013s 1.12e-06s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 100.0% 0.009s 7.81e-07s C 11232 4 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
27.7% 27.7% 3.434s 2.45e-04s C 14040 5 GpuGemm{no_inplace}
12.3% 40.0% 1.517s 1.80e-04s C 8424 3 GpuGemm{inplace}
8.6% 48.6% 1.066s 1.90e-04s C 5616 2 GpuDot22
8.0% 56.6% 0.995s 3.54e-05s C 28080 10 GpuCAReduce{add}{1,0}
6.9% 63.5% 0.852s 3.37e-05s C 25272 9 GpuElemwise{mul,no_inplace}
4.8% 68.3% 0.596s 3.54e-05s C 16848 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}
3.7% 72.0% 0.454s 1.62e-05s C 28080 10 GpuElemwise{Mul}[(0, 0)]
3.6% 75.6% 0.441s 1.57e-05s C 28080 10 GpuIncSubtensor{InplaceInc;int64:int64:}
2.8% 78.3% 0.342s 2.03e-05s C 16848 6 GpuIncSubtensor{InplaceInc;::, int64:int64:}
2.6% 80.9% 0.322s 5.74e-05s C 5616 2 GpuIncSubtensor{Inc;::, int64:int64:}
2.2% 83.2% 0.276s 3.27e-05s C 8424 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace}
2.0% 85.1% 0.242s 4.30e-05s C 5616 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}
1.8% 86.9% 0.221s 3.94e-05s C 5616 2 GpuAlloc{memset_0=True}
1.2% 88.1% 0.151s 2.68e-05s C 5616 2 GpuElemwise{Add}[(0, 0)]
1.1% 89.3% 0.141s 2.51e-05s C 5616 2 GpuElemwise{Tanh}[(0, 0)]
1.0% 90.3% 0.126s 4.50e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}
1.0% 91.3% 0.123s 4.37e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace}
0.9% 92.2% 0.116s 4.15e-05s C 2808 1 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}
0.9% 93.1% 0.111s 3.95e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(*2 -> add(i1, *1 -> add(i1, i1)), add(*1, *2)))]},no_inplace}
0.8% 93.9% 0.105s 3.74e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}
... (remaining 13 Ops account for 6.05%(0.75s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
6.7% 6.7% 0.826s 2.94e-04s 2808 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
6.1% 12.8% 0.754s 2.69e-04s 2808 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
5.1% 17.8% 0.628s 2.24e-04s 2808 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
5.0% 22.8% 0.615s 2.19e-04s 2808 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
4.9% 27.7% 0.610s 2.17e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
4.6% 32.3% 0.565s 2.01e-04s 2808 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda])
4.2% 36.5% 0.523s 1.86e-04s 2808 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0})
4.2% 40.7% 0.520s 1.85e-04s 2808 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
4.0% 44.8% 0.501s 1.79e-04s 2808 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0)
3.8% 48.6% 0.474s 1.69e-04s 2808 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
1.4% 50.0% 0.174s 6.20e-05s 2808 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{102})
1.2% 51.2% 0.148s 5.28e-05s 2808 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{102})
1.1% 52.3% 0.134s 4.76e-05s 2808 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>)
1.0% 53.3% 0.126s 4.50e-05s 2808 11 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}(<CudaNdarrayType(float32, vector)>, <CudaNdarrayType(float32, vector)>)
1.0% 54.3% 0.123s 4.38e-05s 2808 28 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>)
1.0% 55.3% 0.123s 4.37e-05s 2808 40 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>, GpuElemwise{mul,no_inplace}.0, <CudaNdarrayType(float32, matrix)>, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}.0, <CudaNdarrayType(float32, row)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float3
1.0% 56.3% 0.119s 4.23e-05s 2808 27 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>)
0.9% 57.2% 0.117s 4.16e-05s 2808 51 GpuCAReduce{add}{1,0}(GpuElemwise{Mul}[(0, 0)].0)
0.9% 58.1% 0.116s 4.15e-05s 2808 36 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, CudaNdarrayConstant{[[ 1.]]})
0.9% 59.1% 0.115s 4.09e-05s 2808 80 GpuCAReduce{add}{1,0}(GpuElemwise{Mul}[(0, 0)].0)
... (remaining 78 Apply instances account for 40.94%(5.07s) of the runtime)
Scan Op profiling ( grad_of_scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 3.372918e+01s
Total time spent in calling the VM 3.095004e+01s (91.760%)
Total overhead (computing slices..) 2.779134e+00s (8.240%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
57.3% 57.3% 17.038s 7.58e-04s C 22464 8 theano.sandbox.cuda.blas.GpuGemm
17.1% 74.3% 5.077s 3.62e-05s C 140400 50 theano.sandbox.cuda.basic_ops.GpuElemwise
16.1% 90.4% 4.778s 8.51e-04s C 5616 2 theano.sandbox.cuda.blas.GpuDot22
5.3% 95.7% 1.583s 3.13e-05s C 50544 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
3.3% 99.0% 0.987s 3.51e-05s C 28080 10 theano.sandbox.cuda.basic_ops.GpuCAReduce
0.9% 99.9% 0.263s 4.68e-05s C 5616 2 theano.sandbox.cuda.basic_ops.GpuAlloc
0.0% 100.0% 0.013s 1.20e-06s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.010s 8.76e-07s C 11232 4 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
36.7% 36.7% 10.926s 7.78e-04s C 14040 5 GpuGemm{no_inplace}
20.5% 57.3% 6.111s 7.25e-04s C 8424 3 GpuGemm{inplace}
16.1% 73.3% 4.778s 8.51e-04s C 5616 2 GpuDot22
3.3% 76.6% 0.987s 3.51e-05s C 28080 10 GpuCAReduce{add}{1,0}
3.0% 79.6% 0.891s 3.53e-05s C 25272 9 GpuElemwise{mul,no_inplace}
2.8% 82.5% 0.837s 4.97e-05s C 16848 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}
2.2% 84.6% 0.642s 1.14e-04s C 5616 2 GpuIncSubtensor{Inc;::, int64:int64:}
1.8% 86.4% 0.543s 1.93e-05s C 28080 10 GpuElemwise{Mul}[(0, 0)]
1.6% 88.1% 0.482s 2.86e-05s C 16848 6 GpuIncSubtensor{InplaceInc;::, int64:int64:}
1.6% 89.7% 0.475s 8.45e-05s C 5616 2 GpuElemwise{Add}[(0, 0)]
1.5% 91.2% 0.459s 1.63e-05s C 28080 10 GpuIncSubtensor{InplaceInc;int64:int64:}
1.0% 92.2% 0.288s 3.42e-05s C 8424 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace}
0.9% 93.1% 0.263s 4.68e-05s C 5616 2 GpuAlloc{memset_0=True}
0.8% 93.9% 0.246s 4.38e-05s C 5616 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}
0.7% 94.6% 0.208s 7.39e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}
0.6% 95.2% 0.180s 6.41e-05s C 2808 1 GpuElemwise{Composite{[add(add(i0, i1), i2)]}}[(0, 0)]
0.5% 95.7% 0.146s 2.61e-05s C 5616 2 GpuElemwise{Tanh}[(0, 0)]
0.5% 96.1% 0.138s 2.46e-05s C 5616 2 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)]
0.5% 96.6% 0.134s 4.77e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace}
0.4% 97.0% 0.122s 4.35e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(add(i0, i1), mul(i2, i3)), add(add(i4, i5), mul(i6, i3))), add(add(add(i7, i8), i9), mul(i10, i3))), i11)]}}[(0, 0)]
... (remaining 13 Ops account for 3.00%(0.89s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
13.2% 13.2% 3.918s 1.40e-03s 2808 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
12.6% 25.7% 3.737s 1.33e-03s 2808 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
11.2% 37.0% 3.340s 1.19e-03s 2808 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0)
10.9% 47.9% 3.251s 1.16e-03s 2808 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
5.1% 53.0% 1.527s 5.44e-04s 2808 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
4.8% 57.9% 1.438s 5.12e-04s 2808 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda])
4.8% 62.7% 1.433s 5.10e-04s 2808 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0})
4.8% 67.5% 1.428s 5.08e-04s 2808 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
3.3% 70.8% 0.979s 3.49e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
2.6% 73.3% 0.765s 2.73e-04s 2808 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
1.3% 74.7% 0.401s 1.43e-04s 2808 86 GpuElemwise{Add}[(0, 0)](GpuGemm{inplace}.0, GpuGemm{no_inplace}.0)
1.1% 75.8% 0.329s 1.17e-04s 2808 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{300})
1.1% 76.8% 0.313s 1.12e-04s 2808 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{300})
0.7% 77.5% 0.208s 7.39e-05s 2808 11 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}(<CudaNdarrayType(float32, vector)>, <CudaNdarrayType(float32, vector)>)
0.7% 78.2% 0.194s 6.92e-05s 2808 8 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>)
0.6% 78.8% 0.193s 6.88e-05s 2808 38 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, GpuElemwise{sub,no_inplace}.0)
0.6% 79.4% 0.180s 6.41e-05s 2808 72 GpuElemwise{Composite{[add(add(i0, i1), i2)]}}[(0, 0)](GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0)
0.6% 80.0% 0.168s 5.99e-05s 2808 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>)
0.5% 80.5% 0.144s 5.13e-05s 2808 21 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
0.5% 80.9% 0.134s 4.78e-05s 2808 24 GpuElemwise{mul,no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>)
... (remaining 78 Apply instances account for 19.06%(5.67s) of the runtime)
Scan Op profiling ( grad_of_scan_fn )
==================
Message: None
Time in 13 calls of the op (for a total of 2808 steps) 2.108672e+01s
Total time spent in calling the VM 1.903834e+01s (90.286%)
Total overhead (computing slices..) 2.048384e+00s (9.714%)
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
48.9% 48.9% 8.076s 3.89e-04s C 20736 8 theano.sandbox.cuda.blas.GpuGemm
24.8% 73.7% 4.087s 3.15e-05s C 129600 50 theano.sandbox.cuda.basic_ops.GpuElemwise
12.4% 86.1% 2.054s 3.96e-04s C 5184 2 theano.sandbox.cuda.blas.GpuDot22
6.5% 92.6% 1.076s 2.31e-05s C 46656 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
5.6% 98.2% 0.920s 3.55e-05s C 25920 10 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.7% 99.9% 0.276s 5.32e-05s C 5184 2 theano.sandbox.cuda.basic_ops.GpuAlloc
0.1% 99.9% 0.012s 1.19e-06s C 10368 4 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.1% 100.0% 0.009s 8.27e-07s C 10368 4 theano.compile.ops.Shape_i
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
32.0% 32.0% 5.278s 4.07e-04s C 12960 5 GpuGemm{no_inplace}
16.9% 48.9% 2.798s 3.60e-04s C 7776 3 GpuGemm{inplace}
12.4% 61.4% 2.054s 3.96e-04s C 5184 2 GpuDot22
5.6% 66.9% 0.920s 3.55e-05s C 25920 10 GpuCAReduce{add}{1,0}
4.6% 71.5% 0.760s 3.26e-05s C 23328 9 GpuElemwise{mul,no_inplace}
3.5% 75.0% 0.574s 3.69e-05s C 15552 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}
2.6% 77.6% 0.421s 1.62e-05s C 25920 10 GpuElemwise{Mul}[(0, 0)]
2.4% 80.0% 0.400s 1.54e-05s C 25920 10 GpuIncSubtensor{InplaceInc;int64:int64:}
2.1% 82.1% 0.344s 6.63e-05s C 5184 2 GpuIncSubtensor{Inc;::, int64:int64:}
2.0% 84.1% 0.333s 2.14e-05s C 15552 6 GpuIncSubtensor{InplaceInc;::, int64:int64:}
1.7% 85.7% 0.276s 5.32e-05s C 5184 2 GpuAlloc{memset_0=True}
1.7% 87.4% 0.275s 1.06e-04s C 2592 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}
1.5% 88.9% 0.251s 3.23e-05s C 7776 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace}
1.5% 90.4% 0.244s 9.43e-05s C 2592 1 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}
1.4% 91.8% 0.234s 4.52e-05s C 5184 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}
1.1% 92.9% 0.181s 3.50e-05s C 5184 2 GpuElemwise{Add}[(0, 0)]
0.8% 93.7% 0.126s 2.43e-05s C 5184 2 GpuElemwise{Tanh}[(0, 0)]
0.6% 94.3% 0.105s 4.06e-05s C 2592 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace}
0.6% 95.0% 0.104s 2.01e-05s C 5184 2 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)]
0.6% 95.6% 0.101s 3.90e-05s C 2592 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}
... (remaining 13 Ops account for 4.42%(0.73s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
7.0% 7.0% 1.153s 4.45e-04s 2592 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0})
6.5% 13.4% 1.067s 4.11e-04s 2592 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0})
6.3% 19.8% 1.047s 4.04e-04s 2592 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
6.3% 26.0% 1.033s 3.98e-04s 2592 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
6.2% 32.3% 1.031s 3.98e-04s 2592 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda])
6.2% 38.5% 1.022s 3.94e-04s 2592 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0)
5.9% 44.4% 0.979s 3.78e-04s 2592 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
5.9% 50.3% 0.975s 3.76e-04s 2592 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0})
5.8% 56.2% 0.965s 3.72e-04s 2592 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0})
5.2% 61.4% 0.858s 3.31e-04s 2592 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0})
1.7% 63.0% 0.275s 1.06e-04s 2592 64 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)].0, <CudaNdarrayType(float32, row)>)
1.5% 64.5% 0.244s 9.43e-05s 2592 36 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, CudaNdarrayConstant{[[ 1.]]})
1.2% 65.7% 0.195s 7.54e-05s 2592 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{156})
1.0% 66.7% 0.166s 6.42e-05s 2592 17 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
0.9% 67.6% 0.148s 5.72e-05s 2592 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{156})
0.8% 68.4% 0.130s 5.03e-05s 2592 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>)
0.8% 69.1% 0.125s 4.81e-05s 2592 86 GpuElemwise{Add}[(0, 0)](GpuGemm{inplace}.0, GpuGemm{no_inplace}.0)
0.8% 69.9% 0.124s 4.80e-05s 2592 27 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>)
0.7% 70.6% 0.110s 4.24e-05s 2592 28 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>)
0.7% 71.2% 0.109s 4.22e-05s 2592 21 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0)
... (remaining 78 Apply instances account for 28.79%(4.75s) of the runtime)
Function profiling
==================
Message: experiment.py:197
Time in 0 calls to Function.__call__: 0.000000e+00s
Total compile time: 1.256202e+01s
Number of Apply nodes: 0
Theano Optimizer time: 1.010402e+01s
Theano validate time: 4.241929e-01s
Theano Linker time (includes C, CUDA code generation/compiling): 2.288918e+00s
Function profiling
==================
Message: experiment.py:198
Time in 0 calls to Function.__call__: 0.000000e+00s
Total compile time: 1.454442e+01s
Number of Apply nodes: 0
Theano Optimizer time: 1.051052e+01s
Theano validate time: 2.100599e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 3.858204e+00s
Function profiling
==================
Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
Time in 12 calls to Function.__call__: 8.222084e+01s
Time in Function.fn.__call__: 8.221798e+01s (99.997%)
Time in thunks: 8.196062e+01s (99.684%)
Total compile time: 2.780161e+02s
Number of Apply nodes: 1214
Theano Optimizer time: 1.881631e+02s
Theano validate time: 3.844473e+00s
Theano Linker time (includes C, CUDA code generation/compiling): 8.912894e+01s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
88.9% 88.9% 72.863s 1.01e+00s Py 72 6 theano.scan_module.scan_op.Scan
5.0% 93.9% 4.096s 1.90e-02s C 216 18 theano.sandbox.cuda.blas.GpuDot22
1.5% 95.4% 1.260s 1.31e-02s C 96 8 theano.sandbox.cuda.basic_ops.GpuCAReduce
1.3% 96.7% 1.050s 2.30e-03s Py 456 38 theano.sandbox.cuda.basic_ops.GpuReshape
1.0% 97.7% 0.813s 7.06e-04s C 1152 96 theano.sandbox.cuda.basic_ops.GpuElemwise
0.9% 98.6% 0.766s 1.82e-03s C 420 35 theano.sandbox.cuda.basic_ops.GpuIncSubtensor
0.4% 99.1% 0.345s 4.80e-03s C 72 6 theano.sandbox.cuda.basic_ops.HostFromGpu
0.3% 99.3% 0.227s 3.71e-04s C 612 51 theano.sandbox.cuda.basic_ops.GpuAlloc
0.2% 99.6% 0.186s 5.18e-03s C 36 3 theano.sandbox.cuda.basic_ops.GpuJoin
0.2% 99.8% 0.151s 4.20e-03s Py 36 3 theano.tensor.basic.Split
0.1% 99.9% 0.116s 8.82e-04s C 132 11 theano.sandbox.cuda.basic_ops.GpuFromHost
0.0% 99.9% 0.021s 1.79e-03s C 12 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
0.0% 99.9% 0.016s 1.37e-03s C 12 1 theano.tensor.elemwise.Sum
0.0% 100.0% 0.016s 1.34e-03s C 12 1 theano.tensor.nnet.nnet.SoftmaxGrad
0.0% 100.0% 0.015s 1.22e-03s C 12 1 theano.sandbox.cuda.blas.GpuGemm
0.0% 100.0% 0.008s 1.36e-06s C 6084 507 theano.tensor.elemwise.Elemwise
0.0% 100.0% 0.002s 1.78e-06s C 1356 113 theano.sandbox.cuda.basic_ops.GpuDimShuffle
0.0% 100.0% 0.002s 2.30e-06s C 984 82 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.0% 100.0% 0.002s 1.94e-06s C 1068 89 theano.compile.ops.Shape_i
0.0% 100.0% 0.001s 9.74e-06s Py 144 12 theano.compile.ops.Rebroadcast
... (remaining 3 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
38.1% 38.1% 31.267s 2.61e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
23.8% 62.0% 19.532s 1.63e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
17.3% 79.2% 14.149s 1.18e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn}
5.0% 84.2% 4.096s 1.90e-02s C 216 18 GpuDot22
3.9% 88.1% 3.184s 2.65e-01s Py 12 1 forall_inplace,gpu,scan_fn}
3.3% 91.4% 2.702s 2.25e-01s Py 12 1 forall_inplace,gpu,scan_fn}
2.5% 93.9% 2.028s 1.69e-01s Py 12 1 forall_inplace,gpu,scan_fn}
1.5% 95.4% 1.242s 1.72e-02s C 72 6 GpuCAReduce{add}{1,1,0}
1.3% 96.7% 1.044s 3.35e-03s Py 312 26 GpuReshape{2}
0.5% 97.2% 0.398s 5.53e-03s C 72 6 GpuIncSubtensor{Inc;:int64:}
0.4% 97.6% 0.345s 4.80e-03s C 72 6 HostFromGpu
0.3% 97.9% 0.278s 1.93e-03s C 144 12 GpuIncSubtensor{InplaceInc;int64::}
0.3% 98.2% 0.236s 2.46e-03s C 96 8 GpuElemwise{add,no_inplace}
0.3% 98.5% 0.227s 3.71e-04s C 612 51 GpuAlloc{memset_0=True}
0.2% 98.7% 0.186s 5.18e-03s C 36 3 GpuJoin
0.2% 98.9% 0.160s 1.49e-03s C 108 9 GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}
0.2% 99.1% 0.151s 4.20e-03s Py 36 3 Split{2}
0.1% 99.2% 0.116s 8.82e-04s C 132 11 GpuFromHost
0.1% 99.4% 0.113s 3.92e-04s C 288 24 GpuElemwise{Add}[(0, 0)]
0.1% 99.5% 0.087s 4.81e-04s C 180 15 GpuElemwise{sub,no_inplace}
... (remaining 126 Ops account for 0.51%(0.42s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
38.1% 38.1% 31.267s 2.61e+00s 12 1119 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
23.8% 62.0% 19.532s 1.63e+00s 12 1175 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
17.3% 79.2% 14.149s 1.18e+00s 12 1063 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},
3.9% 83.1% 3.184s 2.65e-01s 12 405 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
3.3% 86.4% 2.702s 2.25e-01s 12 680 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
2.5% 88.9% 2.028s 1.69e-01s 12 972 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.
0.5% 89.4% 0.379s 3.16e-02s 12 1149 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.5% 89.8% 0.379s 3.16e-02s 12 1146 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.4% 90.3% 0.363s 3.03e-02s 12 1147 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.4% 90.7% 0.361s 3.01e-02s 12 1150 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.4% 91.1% 0.350s 2.92e-02s 12 490 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.4% 91.6% 0.346s 2.89e-02s 12 501 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.4% 92.0% 0.340s 2.83e-02s 12 1139 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0)
0.4% 92.4% 0.329s 2.75e-02s 12 1141 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0)
0.3% 92.7% 0.280s 2.33e-02s 12 1090 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.3% 93.1% 0.279s 2.33e-02s 12 1093 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0)
0.3% 93.3% 0.228s 1.90e-02s 12 778 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.3% 93.6% 0.226s 1.88e-02s 12 768 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
0.3% 93.9% 0.220s 1.83e-02s 12 1091 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
0.3% 94.1% 0.220s 1.83e-02s 12 1094 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0)
... (remaining 1194 Apply instances account for 5.85%(4.80s) of the runtime)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment