Created
January 16, 2015 20:11
-
-
Save skaae/de82044eec00a2c7ebeb to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Function profiling | |
================== | |
Message: experiment.py:196 | |
Time in 12 calls to Function.__call__: 8.222084e+01s | |
Time in Function.fn.__call__: 8.221798e+01s (99.997%) | |
Time in thunks: 8.196062e+01s (99.684%) | |
Total compile time: 2.509097e+02s | |
Number of Apply nodes: 1214 | |
Theano Optimizer time: 1.675486e+02s | |
Theano validate time: 1.319681e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 8.298181e+01s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
88.9% 88.9% 72.863s 1.01e+00s Py 72 6 theano.scan_module.scan_op.Scan | |
5.0% 93.9% 4.096s 1.90e-02s C 216 18 theano.sandbox.cuda.blas.GpuDot22 | |
1.5% 95.4% 1.260s 1.31e-02s C 96 8 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.3% 96.7% 1.050s 2.30e-03s Py 456 38 theano.sandbox.cuda.basic_ops.GpuReshape | |
1.0% 97.7% 0.813s 7.06e-04s C 1152 96 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.9% 98.6% 0.766s 1.82e-03s C 420 35 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 99.1% 0.345s 4.80e-03s C 72 6 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.3% 99.3% 0.227s 3.71e-04s C 612 51 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.2% 99.6% 0.186s 5.18e-03s C 36 3 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.2% 99.8% 0.151s 4.20e-03s Py 36 3 theano.tensor.basic.Split | |
0.1% 99.9% 0.116s 8.82e-04s C 132 11 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.0% 99.9% 0.021s 1.79e-03s C 12 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias | |
0.0% 99.9% 0.016s 1.37e-03s C 12 1 theano.tensor.elemwise.Sum | |
0.0% 100.0% 0.016s 1.34e-03s C 12 1 theano.tensor.nnet.nnet.SoftmaxGrad | |
0.0% 100.0% 0.015s 1.22e-03s C 12 1 theano.sandbox.cuda.blas.GpuGemm | |
0.0% 100.0% 0.008s 1.36e-06s C 6084 507 theano.tensor.elemwise.Elemwise | |
0.0% 100.0% 0.002s 1.78e-06s C 1356 113 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.0% 100.0% 0.002s 2.30e-06s C 984 82 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 100.0% 0.002s 1.94e-06s C 1068 89 theano.compile.ops.Shape_i | |
0.0% 100.0% 0.001s 9.74e-06s Py 144 12 theano.compile.ops.Rebroadcast | |
... (remaining 3 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
38.1% 38.1% 31.267s 2.61e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
23.8% 62.0% 19.532s 1.63e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
17.3% 79.2% 14.149s 1.18e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
5.0% 84.2% 4.096s 1.90e-02s C 216 18 GpuDot22 | |
3.9% 88.1% 3.184s 2.65e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
3.3% 91.4% 2.702s 2.25e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
2.5% 93.9% 2.028s 1.69e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
1.5% 95.4% 1.242s 1.72e-02s C 72 6 GpuCAReduce{add}{1,1,0} | |
1.3% 96.7% 1.044s 3.35e-03s Py 312 26 GpuReshape{2} | |
0.5% 97.2% 0.398s 5.53e-03s C 72 6 GpuIncSubtensor{Inc;:int64:} | |
0.4% 97.6% 0.345s 4.80e-03s C 72 6 HostFromGpu | |
0.3% 97.9% 0.278s 1.93e-03s C 144 12 GpuIncSubtensor{InplaceInc;int64::} | |
0.3% 98.2% 0.236s 2.46e-03s C 96 8 GpuElemwise{add,no_inplace} | |
0.3% 98.5% 0.227s 3.71e-04s C 612 51 GpuAlloc{memset_0=True} | |
0.2% 98.7% 0.186s 5.18e-03s C 36 3 GpuJoin | |
0.2% 98.9% 0.160s 1.49e-03s C 108 9 GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace} | |
0.2% 99.1% 0.151s 4.20e-03s Py 36 3 Split{2} | |
0.1% 99.2% 0.116s 8.82e-04s C 132 11 GpuFromHost | |
0.1% 99.4% 0.113s 3.92e-04s C 288 24 GpuElemwise{Add}[(0, 0)] | |
0.1% 99.5% 0.087s 4.81e-04s C 180 15 GpuElemwise{sub,no_inplace} | |
... (remaining 126 Ops account for 0.51%(0.42s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
38.1% 38.1% 31.267s 2.61e+00s 12 1119 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
23.8% 62.0% 19.532s 1.63e+00s 12 1175 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
17.3% 79.2% 14.149s 1.18e+00s 12 1063 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
3.9% 83.1% 3.184s 2.65e-01s 12 405 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
3.3% 86.4% 2.702s 2.25e-01s 12 680 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
2.5% 88.9% 2.028s 1.69e-01s 12 972 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
0.5% 89.4% 0.379s 3.16e-02s 12 1149 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.5% 89.8% 0.379s 3.16e-02s 12 1146 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.4% 90.3% 0.363s 3.03e-02s 12 1147 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.4% 90.7% 0.361s 3.01e-02s 12 1150 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.4% 91.1% 0.350s 2.92e-02s 12 490 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.4% 91.6% 0.346s 2.89e-02s 12 501 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.4% 92.0% 0.340s 2.83e-02s 12 1139 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0) | |
0.4% 92.4% 0.329s 2.75e-02s 12 1141 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0) | |
0.3% 92.7% 0.280s 2.33e-02s 12 1090 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.3% 93.1% 0.279s 2.33e-02s 12 1093 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.3% 93.3% 0.228s 1.90e-02s 12 778 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.3% 93.6% 0.226s 1.88e-02s 12 768 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.3% 93.9% 0.220s 1.83e-02s 12 1091 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.3% 94.1% 0.220s 1.83e-02s 12 1094 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
... (remaining 1194 Apply instances account for 5.85%(4.80s) of the runtime) | |
Scan Op profiling ( scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 3.428722e+00s | |
Total time spent in calling the VM 2.907996e+00s (84.813%) | |
Total overhead (computing slices..) 5.207253e-01s (15.187%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
82.7% 82.7% 2.353s 4.19e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm | |
17.0% 99.7% 0.485s 4.32e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.3% 100.0% 0.009s 7.69e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
82.7% 82.7% 2.353s 4.19e-04s C 5616 2 GpuGemm{no_inplace} | |
4.9% 87.5% 0.139s 4.95e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace} | |
4.6% 92.1% 0.131s 4.65e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace} | |
3.8% 95.9% 0.108s 3.84e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace} | |
3.8% 99.7% 0.108s 3.84e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace} | |
0.3% 100.0% 0.009s 7.69e-07s C 11232 4 GpuSubtensor{::, int64:int64:} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
41.5% 41.5% 1.182s 4.21e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
41.1% 82.7% 1.171s 4.17e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
4.9% 87.5% 0.139s 4.95e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>) | |
4.6% 92.1% 0.131s 4.65e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0) | |
3.8% 95.9% 0.108s 3.84e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0) | |
3.8% 99.7% 0.108s 3.84e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>) | |
0.1% 99.8% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{156}) | |
0.1% 99.9% 0.002s 8.73e-07s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{156}) | |
0.1% 99.9% 0.002s 5.42e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{312}, Constant{468}) | |
0.1% 100.0% 0.001s 5.33e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{312}, Constant{468}) | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Scan Op profiling ( scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 2.909503e+00s | |
Total time spent in calling the VM 2.319428e+00s (79.719%) | |
Total overhead (computing slices..) 5.900743e-01s (20.281%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
75.5% 75.5% 1.685s 3.00e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm | |
24.1% 99.6% 0.538s 4.79e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.4% 100.0% 0.009s 7.85e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
75.5% 75.5% 1.685s 3.00e-04s C 5616 2 GpuGemm{no_inplace} | |
6.8% 82.2% 0.151s 5.38e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace} | |
6.3% 88.6% 0.141s 5.02e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace} | |
5.6% 94.1% 0.124s 4.42e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace} | |
5.5% 99.6% 0.123s 4.36e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace} | |
0.4% 100.0% 0.009s 7.85e-07s C 11232 4 GpuSubtensor{::, int64:int64:} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
42.0% 42.0% 0.938s 3.34e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
33.5% 75.5% 0.747s 2.66e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
6.8% 82.2% 0.151s 5.38e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>) | |
6.3% 88.6% 0.141s 5.02e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0) | |
5.6% 94.1% 0.124s 4.42e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>) | |
5.5% 99.6% 0.123s 4.36e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0) | |
0.1% 99.7% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{300}) | |
0.1% 99.9% 0.003s 1.06e-06s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{300}) | |
0.1% 99.9% 0.001s 4.77e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{600}, Constant{900}) | |
0.1% 100.0% 0.001s 4.66e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{600}, Constant{900}) | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Scan Op profiling ( scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 2.186090e+00s | |
Total time spent in calling the VM 1.657587e+00s (75.824%) | |
Total overhead (computing slices..) 5.285032e-01s (24.176%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
73.6% 73.6% 1.174s 2.09e-04s C 5616 2 theano.sandbox.cuda.blas.GpuGemm | |
25.9% 99.5% 0.414s 3.68e-05s C 11232 4 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.5% 100.0% 0.008s 7.31e-07s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
73.6% 73.6% 1.174s 2.09e-04s C 5616 2 GpuGemm{no_inplace} | |
7.1% 80.6% 0.113s 4.01e-05s C 2808 1 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace} | |
6.8% 87.5% 0.109s 3.89e-05s C 2808 1 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace} | |
6.2% 93.6% 0.098s 3.50e-05s C 2808 1 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace} | |
5.9% 99.5% 0.093s 3.33e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace} | |
0.5% 100.0% 0.008s 7.31e-07s C 11232 4 GpuSubtensor{::, int64:int64:} | |
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
37.1% 37.1% 0.592s 2.11e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
36.5% 73.6% 0.583s 2.07e-04s 2808 1 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
7.1% 80.6% 0.113s 4.01e-05s 2808 7 GpuElemwise{Composite{[add(mul(add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4))), i5), mul(i0, i6))]},no_inplace}(cell_init_bck[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, col)>, <CudaNdarrayType(float32, col)>) | |
6.8% 87.5% 0.109s 3.89e-05s 2808 6 GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, row)>, GpuSubtensor{::, int64:int64:}.0) | |
6.2% 93.6% 0.098s 3.50e-05s 2808 8 GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}(cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[add(mul(scalar_sigmoid(mul(i0, i1)), i0), mul(scalar_sigmoid(add(i2, mul(i0, i3))), tanh(i4)))]},no_inplace}.0) | |
5.9% 99.5% 0.093s 3.33e-05s 2808 9 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(scalar_sigmoid(mul(i0, i1)), tanh(i2))]},no_inplace}.0, <CudaNdarrayType(float32, col)>, hid_init_bck[t-1][cuda], <CudaNdarrayType(float32, col)>) | |
0.2% 99.7% 0.003s 1.13e-06s 2808 3 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{102}) | |
0.2% 99.8% 0.002s 8.53e-07s 2808 5 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{0}, Constant{102}) | |
0.1% 99.9% 0.001s 4.75e-07s 2808 2 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{204}, Constant{306}) | |
0.1% 100.0% 0.001s 4.69e-07s 2808 4 GpuSubtensor{::, int64:int64:}(GpuGemm{no_inplace}.0, Constant{204}, Constant{306}) | |
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime) | |
Scan Op profiling ( grad_of_scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 1.525270e+01s | |
Total time spent in calling the VM 1.357668e+01s (89.012%) | |
Total overhead (computing slices..) 1.676021e+00s (10.988%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
40.0% 40.0% 4.951s 2.20e-04s C 22464 8 theano.sandbox.cuda.blas.GpuGemm | |
32.5% 72.5% 4.020s 2.86e-05s C 140400 50 theano.sandbox.cuda.basic_ops.GpuElemwise | |
8.9% 81.4% 1.105s 2.19e-05s C 50544 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
8.6% 90.0% 1.066s 1.90e-04s C 5616 2 theano.sandbox.cuda.blas.GpuDot22 | |
8.0% 98.0% 0.995s 3.54e-05s C 28080 10 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.8% 99.8% 0.221s 3.94e-05s C 5616 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.1% 99.9% 0.013s 1.12e-06s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 100.0% 0.009s 7.81e-07s C 11232 4 theano.compile.ops.Shape_i | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
27.7% 27.7% 3.434s 2.45e-04s C 14040 5 GpuGemm{no_inplace} | |
12.3% 40.0% 1.517s 1.80e-04s C 8424 3 GpuGemm{inplace} | |
8.6% 48.6% 1.066s 1.90e-04s C 5616 2 GpuDot22 | |
8.0% 56.6% 0.995s 3.54e-05s C 28080 10 GpuCAReduce{add}{1,0} | |
6.9% 63.5% 0.852s 3.37e-05s C 25272 9 GpuElemwise{mul,no_inplace} | |
4.8% 68.3% 0.596s 3.54e-05s C 16848 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace} | |
3.7% 72.0% 0.454s 1.62e-05s C 28080 10 GpuElemwise{Mul}[(0, 0)] | |
3.6% 75.6% 0.441s 1.57e-05s C 28080 10 GpuIncSubtensor{InplaceInc;int64:int64:} | |
2.8% 78.3% 0.342s 2.03e-05s C 16848 6 GpuIncSubtensor{InplaceInc;::, int64:int64:} | |
2.6% 80.9% 0.322s 5.74e-05s C 5616 2 GpuIncSubtensor{Inc;::, int64:int64:} | |
2.2% 83.2% 0.276s 3.27e-05s C 8424 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace} | |
2.0% 85.1% 0.242s 4.30e-05s C 5616 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace} | |
1.8% 86.9% 0.221s 3.94e-05s C 5616 2 GpuAlloc{memset_0=True} | |
1.2% 88.1% 0.151s 2.68e-05s C 5616 2 GpuElemwise{Add}[(0, 0)] | |
1.1% 89.3% 0.141s 2.51e-05s C 5616 2 GpuElemwise{Tanh}[(0, 0)] | |
1.0% 90.3% 0.126s 4.50e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace} | |
1.0% 91.3% 0.123s 4.37e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace} | |
0.9% 92.2% 0.116s 4.15e-05s C 2808 1 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace} | |
0.9% 93.1% 0.111s 3.95e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(*2 -> add(i1, *1 -> add(i1, i1)), add(*1, *2)))]},no_inplace} | |
0.8% 93.9% 0.105s 3.74e-05s C 2808 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace} | |
... (remaining 13 Ops account for 6.05%(0.75s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
6.7% 6.7% 0.826s 2.94e-04s 2808 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
6.1% 12.8% 0.754s 2.69e-04s 2808 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
5.1% 17.8% 0.628s 2.24e-04s 2808 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
5.0% 22.8% 0.615s 2.19e-04s 2808 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
4.9% 27.7% 0.610s 2.17e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
4.6% 32.3% 0.565s 2.01e-04s 2808 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda]) | |
4.2% 36.5% 0.523s 1.86e-04s 2808 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0}) | |
4.2% 40.7% 0.520s 1.85e-04s 2808 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
4.0% 44.8% 0.501s 1.79e-04s 2808 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0) | |
3.8% 48.6% 0.474s 1.69e-04s 2808 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
1.4% 50.0% 0.174s 6.20e-05s 2808 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{102}) | |
1.2% 51.2% 0.148s 5.28e-05s 2808 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{102}) | |
1.1% 52.3% 0.134s 4.76e-05s 2808 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>) | |
1.0% 53.3% 0.126s 4.50e-05s 2808 11 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}(<CudaNdarrayType(float32, vector)>, <CudaNdarrayType(float32, vector)>) | |
1.0% 54.3% 0.123s 4.38e-05s 2808 28 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>) | |
1.0% 55.3% 0.123s 4.37e-05s 2808 40 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>, GpuElemwise{mul,no_inplace}.0, <CudaNdarrayType(float32, matrix)>, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}.0, <CudaNdarrayType(float32, row)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float3 | |
1.0% 56.3% 0.119s 4.23e-05s 2808 27 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>) | |
0.9% 57.2% 0.117s 4.16e-05s 2808 51 GpuCAReduce{add}{1,0}(GpuElemwise{Mul}[(0, 0)].0) | |
0.9% 58.1% 0.116s 4.15e-05s 2808 36 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}) | |
0.9% 59.1% 0.115s 4.09e-05s 2808 80 GpuCAReduce{add}{1,0}(GpuElemwise{Mul}[(0, 0)].0) | |
... (remaining 78 Apply instances account for 40.94%(5.07s) of the runtime) | |
Scan Op profiling ( grad_of_scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 3.372918e+01s | |
Total time spent in calling the VM 3.095004e+01s (91.760%) | |
Total overhead (computing slices..) 2.779134e+00s (8.240%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
57.3% 57.3% 17.038s 7.58e-04s C 22464 8 theano.sandbox.cuda.blas.GpuGemm | |
17.1% 74.3% 5.077s 3.62e-05s C 140400 50 theano.sandbox.cuda.basic_ops.GpuElemwise | |
16.1% 90.4% 4.778s 8.51e-04s C 5616 2 theano.sandbox.cuda.blas.GpuDot22 | |
5.3% 95.7% 1.583s 3.13e-05s C 50544 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
3.3% 99.0% 0.987s 3.51e-05s C 28080 10 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
0.9% 99.9% 0.263s 4.68e-05s C 5616 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.0% 100.0% 0.013s 1.20e-06s C 11232 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 100.0% 0.010s 8.76e-07s C 11232 4 theano.compile.ops.Shape_i | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
36.7% 36.7% 10.926s 7.78e-04s C 14040 5 GpuGemm{no_inplace} | |
20.5% 57.3% 6.111s 7.25e-04s C 8424 3 GpuGemm{inplace} | |
16.1% 73.3% 4.778s 8.51e-04s C 5616 2 GpuDot22 | |
3.3% 76.6% 0.987s 3.51e-05s C 28080 10 GpuCAReduce{add}{1,0} | |
3.0% 79.6% 0.891s 3.53e-05s C 25272 9 GpuElemwise{mul,no_inplace} | |
2.8% 82.5% 0.837s 4.97e-05s C 16848 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace} | |
2.2% 84.6% 0.642s 1.14e-04s C 5616 2 GpuIncSubtensor{Inc;::, int64:int64:} | |
1.8% 86.4% 0.543s 1.93e-05s C 28080 10 GpuElemwise{Mul}[(0, 0)] | |
1.6% 88.1% 0.482s 2.86e-05s C 16848 6 GpuIncSubtensor{InplaceInc;::, int64:int64:} | |
1.6% 89.7% 0.475s 8.45e-05s C 5616 2 GpuElemwise{Add}[(0, 0)] | |
1.5% 91.2% 0.459s 1.63e-05s C 28080 10 GpuIncSubtensor{InplaceInc;int64:int64:} | |
1.0% 92.2% 0.288s 3.42e-05s C 8424 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace} | |
0.9% 93.1% 0.263s 4.68e-05s C 5616 2 GpuAlloc{memset_0=True} | |
0.8% 93.9% 0.246s 4.38e-05s C 5616 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace} | |
0.7% 94.6% 0.208s 7.39e-05s C 2808 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace} | |
0.6% 95.2% 0.180s 6.41e-05s C 2808 1 GpuElemwise{Composite{[add(add(i0, i1), i2)]}}[(0, 0)] | |
0.5% 95.7% 0.146s 2.61e-05s C 5616 2 GpuElemwise{Tanh}[(0, 0)] | |
0.5% 96.1% 0.138s 2.46e-05s C 5616 2 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)] | |
0.5% 96.6% 0.134s 4.77e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace} | |
0.4% 97.0% 0.122s 4.35e-05s C 2808 1 GpuElemwise{Composite{[add(add(add(add(add(i0, i1), mul(i2, i3)), add(add(i4, i5), mul(i6, i3))), add(add(add(i7, i8), i9), mul(i10, i3))), i11)]}}[(0, 0)] | |
... (remaining 13 Ops account for 3.00%(0.89s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
13.2% 13.2% 3.918s 1.40e-03s 2808 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
12.6% 25.7% 3.737s 1.33e-03s 2808 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
11.2% 37.0% 3.340s 1.19e-03s 2808 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0) | |
10.9% 47.9% 3.251s 1.16e-03s 2808 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
5.1% 53.0% 1.527s 5.44e-04s 2808 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
4.8% 57.9% 1.438s 5.12e-04s 2808 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda]) | |
4.8% 62.7% 1.433s 5.10e-04s 2808 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0}) | |
4.8% 67.5% 1.428s 5.08e-04s 2808 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
3.3% 70.8% 0.979s 3.49e-04s 2808 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
2.6% 73.3% 0.765s 2.73e-04s 2808 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
1.3% 74.7% 0.401s 1.43e-04s 2808 86 GpuElemwise{Add}[(0, 0)](GpuGemm{inplace}.0, GpuGemm{no_inplace}.0) | |
1.1% 75.8% 0.329s 1.17e-04s 2808 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{300}) | |
1.1% 76.8% 0.313s 1.12e-04s 2808 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{300}) | |
0.7% 77.5% 0.208s 7.39e-05s 2808 11 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace}(<CudaNdarrayType(float32, vector)>, <CudaNdarrayType(float32, vector)>) | |
0.7% 78.2% 0.194s 6.92e-05s 2808 8 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>) | |
0.6% 78.8% 0.193s 6.88e-05s 2808 38 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, GpuElemwise{sub,no_inplace}.0) | |
0.6% 79.4% 0.180s 6.41e-05s 2808 72 GpuElemwise{Composite{[add(add(i0, i1), i2)]}}[(0, 0)](GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0) | |
0.6% 80.0% 0.168s 5.99e-05s 2808 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>) | |
0.5% 80.5% 0.144s 5.13e-05s 2808 21 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
0.5% 80.9% 0.134s 4.78e-05s 2808 24 GpuElemwise{mul,no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>) | |
... (remaining 78 Apply instances account for 19.06%(5.67s) of the runtime) | |
Scan Op profiling ( grad_of_scan_fn ) | |
================== | |
Message: None | |
Time in 13 calls of the op (for a total of 2808 steps) 2.108672e+01s | |
Total time spent in calling the VM 1.903834e+01s (90.286%) | |
Total overhead (computing slices..) 2.048384e+00s (9.714%) | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
48.9% 48.9% 8.076s 3.89e-04s C 20736 8 theano.sandbox.cuda.blas.GpuGemm | |
24.8% 73.7% 4.087s 3.15e-05s C 129600 50 theano.sandbox.cuda.basic_ops.GpuElemwise | |
12.4% 86.1% 2.054s 3.96e-04s C 5184 2 theano.sandbox.cuda.blas.GpuDot22 | |
6.5% 92.6% 1.076s 2.31e-05s C 46656 18 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
5.6% 98.2% 0.920s 3.55e-05s C 25920 10 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.7% 99.9% 0.276s 5.32e-05s C 5184 2 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.1% 99.9% 0.012s 1.19e-06s C 10368 4 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.1% 100.0% 0.009s 8.27e-07s C 10368 4 theano.compile.ops.Shape_i | |
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
32.0% 32.0% 5.278s 4.07e-04s C 12960 5 GpuGemm{no_inplace} | |
16.9% 48.9% 2.798s 3.60e-04s C 7776 3 GpuGemm{inplace} | |
12.4% 61.4% 2.054s 3.96e-04s C 5184 2 GpuDot22 | |
5.6% 66.9% 0.920s 3.55e-05s C 25920 10 GpuCAReduce{add}{1,0} | |
4.6% 71.5% 0.760s 3.26e-05s C 23328 9 GpuElemwise{mul,no_inplace} | |
3.5% 75.0% 0.574s 3.69e-05s C 15552 6 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace} | |
2.6% 77.6% 0.421s 1.62e-05s C 25920 10 GpuElemwise{Mul}[(0, 0)] | |
2.4% 80.0% 0.400s 1.54e-05s C 25920 10 GpuIncSubtensor{InplaceInc;int64:int64:} | |
2.1% 82.1% 0.344s 6.63e-05s C 5184 2 GpuIncSubtensor{Inc;::, int64:int64:} | |
2.0% 84.1% 0.333s 2.14e-05s C 15552 6 GpuIncSubtensor{InplaceInc;::, int64:int64:} | |
1.7% 85.7% 0.276s 5.32e-05s C 5184 2 GpuAlloc{memset_0=True} | |
1.7% 87.4% 0.275s 1.06e-04s C 2592 1 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace} | |
1.5% 88.9% 0.251s 3.23e-05s C 7776 3 GpuElemwise{Composite{[mul(mul(i0, i1), i2)]},no_inplace} | |
1.5% 90.4% 0.244s 9.43e-05s C 2592 1 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace} | |
1.4% 91.8% 0.234s 4.52e-05s C 5184 2 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace} | |
1.1% 92.9% 0.181s 3.50e-05s C 5184 2 GpuElemwise{Add}[(0, 0)] | |
0.8% 93.7% 0.126s 2.43e-05s C 5184 2 GpuElemwise{Tanh}[(0, 0)] | |
0.6% 94.3% 0.105s 4.06e-05s C 2592 1 GpuElemwise{Composite{[add(add(add(add(mul(i0, i1), mul(i2, i3)), mul(i4, i5)), mul(i6, i7)), i8)]},no_inplace} | |
0.6% 95.0% 0.104s 2.01e-05s C 5184 2 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)] | |
0.6% 95.6% 0.101s 3.90e-05s C 2592 1 GpuElemwise{Composite{[add(i0, add(i1, i1))]},no_inplace} | |
... (remaining 13 Ops account for 4.42%(0.73s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
7.0% 7.0% 1.153s 4.45e-04s 2592 0 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1][cuda], W_hid_to_gates_bck_copy[cuda], TensorConstant{1.0}) | |
6.5% 13.4% 1.067s 4.11e-04s 2592 4 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1][cuda], W_hid_to_gates_fwd_copy[cuda], TensorConstant{1.0}) | |
6.3% 19.8% 1.047s 4.04e-04s 2592 74 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
6.3% 26.0% 1.033s 3.98e-04s 2592 73 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
6.2% 32.3% 1.031s 3.98e-04s 2592 67 GpuDot22(GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda]) | |
6.2% 38.5% 1.022s 3.94e-04s 2592 68 GpuDot22(hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0) | |
5.9% 44.4% 0.979s 3.78e-04s 2592 57 GpuGemm{no_inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{1.0}, hid_init_bck[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
5.9% 50.3% 0.975s 3.76e-04s 2592 58 GpuGemm{inplace}(GpuElemwise{mul,no_inplace}.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_bck_copy.T_replace[cuda], TensorConstant{1.0}) | |
5.8% 56.2% 0.965s 3.72e-04s 2592 76 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, W_hid_to_gates_fwd_copy.T_replace[cuda], TensorConstant{1.0}) | |
5.2% 61.4% 0.858s 3.31e-04s 2592 77 GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0}, hid_init_fwd[t-1].T_replace[cuda], GpuIncSubtensor{InplaceInc;::, int64:int64:}.0, TensorConstant{1.0}) | |
1.7% 63.0% 0.275s 1.06e-04s 2592 64 GpuElemwise{Composite{[add(mul(i0, i1), mul(i2, i3))]},no_inplace}(GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, <CudaNdarrayType(float32, row)>, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]}}[(0, 0)].0, <CudaNdarrayType(float32, row)>) | |
1.5% 64.5% 0.244s 9.43e-05s 2592 36 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), sub(i3, i2))]},no_inplace}(GpuElemwise{mul,no_inplace}.0, GpuElemwise{Tanh}[(0, 0)].0, GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}.0, CudaNdarrayConstant{[[ 1.]]}) | |
1.2% 65.7% 0.195s 7.54e-05s 2592 48 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{156}) | |
1.0% 66.7% 0.166s 6.42e-05s 2592 17 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
0.9% 67.6% 0.148s 5.72e-05s 2592 46 GpuIncSubtensor{Inc;::, int64:int64:}(GpuAlloc{memset_0=True}.0, GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}.0, Constant{0}, Constant{156}) | |
0.8% 68.4% 0.130s 5.03e-05s 2592 13 GpuElemwise{Composite{[mul(mul(mul(i0, i1), i2), i3)]},no_inplace}(<CudaNdarrayType(float32, matrix)>, cell_init_fwd[t-1][cuda], <CudaNdarrayType(float32, matrix)>, <CudaNdarrayType(float32, matrix)>) | |
0.8% 69.1% 0.125s 4.81e-05s 2592 86 GpuElemwise{Add}[(0, 0)](GpuGemm{inplace}.0, GpuGemm{no_inplace}.0) | |
0.8% 69.9% 0.124s 4.80e-05s 2592 27 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>) | |
0.7% 70.6% 0.110s 4.24e-05s 2592 28 GpuElemwise{Composite{[scalar_sigmoid(add(i0, i1))]},no_inplace}(GpuSubtensor{::, int64:int64:}.0, <CudaNdarrayType(float32, matrix)>) | |
0.7% 71.2% 0.109s 4.22e-05s 2592 21 GpuAlloc{memset_0=True}(CudaNdarrayConstant{[[ 0.]]}, Shape_i{0}.0, Shape_i{1}.0) | |
... (remaining 78 Apply instances account for 28.79%(4.75s) of the runtime) | |
Function profiling | |
================== | |
Message: experiment.py:197 | |
Time in 0 calls to Function.__call__: 0.000000e+00s | |
Total compile time: 1.256202e+01s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 1.010402e+01s | |
Theano validate time: 4.241929e-01s | |
Theano Linker time (includes C, CUDA code generation/compiling): 2.288918e+00s | |
Function profiling | |
================== | |
Message: experiment.py:198 | |
Time in 0 calls to Function.__call__: 0.000000e+00s | |
Total compile time: 1.454442e+01s | |
Number of Apply nodes: 0 | |
Theano Optimizer time: 1.051052e+01s | |
Theano validate time: 2.100599e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 3.858204e+00s | |
Function profiling | |
================== | |
Message: Sum of all(3) printed profiles at exit excluding Scan op profile. | |
Time in 12 calls to Function.__call__: 8.222084e+01s | |
Time in Function.fn.__call__: 8.221798e+01s (99.997%) | |
Time in thunks: 8.196062e+01s (99.684%) | |
Total compile time: 2.780161e+02s | |
Number of Apply nodes: 1214 | |
Theano Optimizer time: 1.881631e+02s | |
Theano validate time: 3.844473e+00s | |
Theano Linker time (includes C, CUDA code generation/compiling): 8.912894e+01s | |
Class | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name> | |
88.9% 88.9% 72.863s 1.01e+00s Py 72 6 theano.scan_module.scan_op.Scan | |
5.0% 93.9% 4.096s 1.90e-02s C 216 18 theano.sandbox.cuda.blas.GpuDot22 | |
1.5% 95.4% 1.260s 1.31e-02s C 96 8 theano.sandbox.cuda.basic_ops.GpuCAReduce | |
1.3% 96.7% 1.050s 2.30e-03s Py 456 38 theano.sandbox.cuda.basic_ops.GpuReshape | |
1.0% 97.7% 0.813s 7.06e-04s C 1152 96 theano.sandbox.cuda.basic_ops.GpuElemwise | |
0.9% 98.6% 0.766s 1.82e-03s C 420 35 theano.sandbox.cuda.basic_ops.GpuIncSubtensor | |
0.4% 99.1% 0.345s 4.80e-03s C 72 6 theano.sandbox.cuda.basic_ops.HostFromGpu | |
0.3% 99.3% 0.227s 3.71e-04s C 612 51 theano.sandbox.cuda.basic_ops.GpuAlloc | |
0.2% 99.6% 0.186s 5.18e-03s C 36 3 theano.sandbox.cuda.basic_ops.GpuJoin | |
0.2% 99.8% 0.151s 4.20e-03s Py 36 3 theano.tensor.basic.Split | |
0.1% 99.9% 0.116s 8.82e-04s C 132 11 theano.sandbox.cuda.basic_ops.GpuFromHost | |
0.0% 99.9% 0.021s 1.79e-03s C 12 1 theano.sandbox.cuda.nnet.GpuSoftmaxWithBias | |
0.0% 99.9% 0.016s 1.37e-03s C 12 1 theano.tensor.elemwise.Sum | |
0.0% 100.0% 0.016s 1.34e-03s C 12 1 theano.tensor.nnet.nnet.SoftmaxGrad | |
0.0% 100.0% 0.015s 1.22e-03s C 12 1 theano.sandbox.cuda.blas.GpuGemm | |
0.0% 100.0% 0.008s 1.36e-06s C 6084 507 theano.tensor.elemwise.Elemwise | |
0.0% 100.0% 0.002s 1.78e-06s C 1356 113 theano.sandbox.cuda.basic_ops.GpuDimShuffle | |
0.0% 100.0% 0.002s 2.30e-06s C 984 82 theano.sandbox.cuda.basic_ops.GpuSubtensor | |
0.0% 100.0% 0.002s 1.94e-06s C 1068 89 theano.compile.ops.Shape_i | |
0.0% 100.0% 0.001s 9.74e-06s Py 144 12 theano.compile.ops.Rebroadcast | |
... (remaining 3 Classes account for 0.00%(0.00s) of the runtime) | |
Ops | |
--- | |
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name> | |
38.1% 38.1% 31.267s 2.61e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
23.8% 62.0% 19.532s 1.63e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
17.3% 79.2% 14.149s 1.18e+00s Py 12 1 forall_inplace,gpu,grad_of_scan_fn} | |
5.0% 84.2% 4.096s 1.90e-02s C 216 18 GpuDot22 | |
3.9% 88.1% 3.184s 2.65e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
3.3% 91.4% 2.702s 2.25e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
2.5% 93.9% 2.028s 1.69e-01s Py 12 1 forall_inplace,gpu,scan_fn} | |
1.5% 95.4% 1.242s 1.72e-02s C 72 6 GpuCAReduce{add}{1,1,0} | |
1.3% 96.7% 1.044s 3.35e-03s Py 312 26 GpuReshape{2} | |
0.5% 97.2% 0.398s 5.53e-03s C 72 6 GpuIncSubtensor{Inc;:int64:} | |
0.4% 97.6% 0.345s 4.80e-03s C 72 6 HostFromGpu | |
0.3% 97.9% 0.278s 1.93e-03s C 144 12 GpuIncSubtensor{InplaceInc;int64::} | |
0.3% 98.2% 0.236s 2.46e-03s C 96 8 GpuElemwise{add,no_inplace} | |
0.3% 98.5% 0.227s 3.71e-04s C 612 51 GpuAlloc{memset_0=True} | |
0.2% 98.7% 0.186s 5.18e-03s C 36 3 GpuJoin | |
0.2% 98.9% 0.160s 1.49e-03s C 108 9 GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace} | |
0.2% 99.1% 0.151s 4.20e-03s Py 36 3 Split{2} | |
0.1% 99.2% 0.116s 8.82e-04s C 132 11 GpuFromHost | |
0.1% 99.4% 0.113s 3.92e-04s C 288 24 GpuElemwise{Add}[(0, 0)] | |
0.1% 99.5% 0.087s 4.81e-04s C 180 15 GpuElemwise{sub,no_inplace} | |
... (remaining 126 Ops account for 0.51%(0.42s) of the runtime) | |
Apply | |
------ | |
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name> | |
38.1% 38.1% 31.267s 2.61e+00s 12 1119 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
23.8% 62.0% 19.532s 1.63e+00s 12 1175 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
17.3% 79.2% 14.149s 1.18e+00s 12 1063 forall_inplace,gpu,grad_of_scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]},no_inplace}.0, GpuElemwise{Composite{[scalar_sigmoid(mul(i0, i1))]}, | |
3.9% 83.1% 3.184s 2.65e-01s 12 405 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
3.3% 86.4% 2.702s 2.25e-01s 12 680 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
2.5% 88.9% 2.028s 1.69e-01s 12 972 forall_inplace,gpu,scan_fn}(Elemwise{Composite{[minimum(minimum(i0, i1), i2)]}}.0, GpuElemwise{sub,no_inplace}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, W_hid_to_gates_fwd, W_hid_to_gates_bck, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}. | |
0.5% 89.4% 0.379s 3.16e-02s 12 1149 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.5% 89.8% 0.379s 3.16e-02s 12 1146 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.4% 90.3% 0.363s 3.03e-02s 12 1147 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.4% 90.7% 0.361s 3.01e-02s 12 1150 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.4% 91.1% 0.350s 2.92e-02s 12 490 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.4% 91.6% 0.346s 2.89e-02s 12 501 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.4% 92.0% 0.340s 2.83e-02s 12 1139 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0) | |
0.4% 92.4% 0.329s 2.75e-02s 12 1141 GpuCAReduce{add}{1,1,0}(GpuIncSubtensor{InplaceInc;int64::}.0) | |
0.3% 92.7% 0.280s 2.33e-02s 12 1090 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.3% 93.1% 0.279s 2.33e-02s 12 1093 GpuDot22(GpuReshape{2}.0, GpuDimShuffle{1,0}.0) | |
0.3% 93.3% 0.228s 1.90e-02s 12 778 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.3% 93.6% 0.226s 1.88e-02s 12 768 GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0) | |
0.3% 93.9% 0.220s 1.83e-02s 12 1091 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
0.3% 94.1% 0.220s 1.83e-02s 12 1094 GpuDot22(GpuDimShuffle{1,0}.0, GpuReshape{2}.0) | |
... (remaining 1194 Apply instances account for 5.85%(4.80s) of the runtime) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment