Skip to content

Instantly share code, notes, and snippets.

@khaotik
Last active November 30, 2016 12:01
Show Gist options
  • Save khaotik/6fc2e80115afae3e5d6523cd18a69b96 to your computer and use it in GitHub Desktop.
Save khaotik/6fc2e80115afae3e5d6523cd18a69b96 to your computer and use it in GitHub Desktop.
Theano scan investigation 1

The snippet for scan test is:

x = T.matrix()
y, _ = th.scan(fn=lambda x : T.sum(x*x), sequences=x)
fn_sum_scan = theano.function([x], y)
xval = np.random.randn(5000,10000).astype(np.float32)
timeit(fn(xval))

Debugprint of fn_sum_scan

CPU:

for{cpu,scan_fn} [id A] ''   6 --> 6.99e-01s 63.4% 1.10e+00s 99.9%
 |Shape_i{0} [id B] ''   0 --> 8.11e-06s  0.0% 8.11e-06s  0.0%
 | |<TensorType(float32, matrix)> [id C]
 |Elemwise{sqr,no_inplace} [id D] ''   5 --> 3.47e-01s 31.5% 4.03e-01s 36.5%
 | |Subtensor{int64:int64:int8} [id E] ''   4 --> 5.56e-02s  5.0% 5.57e-02s  5.1%
 |   |<TensorType(float32, matrix)> [id C]
 |   |ScalarFromTensor [id F] ''   3 --> 3.48e-05s  0.0% 7.18e-05s  0.0%
 |   | |Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}} [id G] ''   2 --> 2.88e-05s  0.0% 3.70e-05s  0.0%
 |   |   |Shape_i{0} [id B] ''   0 --> 8.11e-06s  0.0% 8.11e-06s  0.0%
 |   |   |TensorConstant{0} [id H]
 |   |   |TensorConstant{0} [id I]
 |   |ScalarFromTensor [id J] ''   1 --> 5.01e-06s  0.0% 1.31e-05s  0.0%
 |   | |Shape_i{0} [id B] ''   0 --> 8.11e-06s  0.0% 8.11e-06s  0.0%
 |   |Constant{1} [id K]
 |Shape_i{0} [id B] ''   0 --> 8.11e-06s  0.0% 8.11e-06s  0.0%

Inner graphs of the scan ops:

for{cpu,scan_fn} [id A] ''   
 >Sum{acc_dtype=float64} [id L] ''   
 > |<TensorType(float32, vector)> [id M] -> [id D]

GPU:

HostFromGpu(gpuarray) [id A] ''   8 --> 1.64e-01s  7.9% 2.06e+00s 100.0%
 |for{cpu,scan_fn} [id B] ''   7 --> 1.60e+00s 77.3% 1.90e+00s 92.0%
   |Shape_i{0} [id C] ''   0 --> 1.29e-05s  0.0% 1.29e-05s  0.0%
   | |<TensorType(float32, matrix)> [id D]
   |GpuElemwise{Sqr}[(0, 0)]<gpuarray> [id E] ''   6 --> 3.28e-04s  0.0% 3.03e-01s 14.7%
   | |GpuSubtensor{int64:int64:int8} [id F] ''   5 --> 1.04e-04s  0.0% 3.03e-01s 14.7%
   |   |GpuFromHost<None> [id G] ''   1 --> 3.03e-01s 14.7% 3.03e-01s 14.7%
   |   | |<TensorType(float32, matrix)> [id D]
   |   |ScalarFromTensor [id H] ''   4 --> 5.17e-05s  0.0% 1.10e-04s  0.0%
   |   | |Elemwise{Composite{Switch(LE(i0, i1), i1, i2)}} [id I] ''   3 --> 4.53e-05s  0.0% 5.82e-05s  0.0%
   |   |   |Shape_i{0} [id C] ''   0 --> 1.29e-05s  0.0% 1.29e-05s  0.0%
   |   |   |TensorConstant{0} [id J]
   |   |   |TensorConstant{0} [id K]
   |   |ScalarFromTensor [id L] ''   2 --> 5.96e-06s  0.0% 1.88e-05s  0.0%
   |   | |Shape_i{0} [id C] ''   0 --> 1.29e-05s  0.0% 1.29e-05s  0.0%
   |   |Constant{1} [id M]
   |Shape_i{0} [id C] ''   0 --> 1.29e-05s  0.0% 1.29e-05s  0.0%

Inner graphs of the scan ops:

for{cpu,scan_fn} [id B] ''   
 >GpuCAReduceCuda{add} [id N] ''   
 > |<GpuArrayType<None>(float32, (False,))> [id O] -> [id E]

Scan internal profiling result dump:

  GPU
single scan timing: pre | pre_loop | fn | post_loop | post
1.7404556274414062e-05 0.011431694030761719 0.07190227508544922 0.0724020004272461 3.0994415283203125e-06
9.298324584960938e-06 0.01133584976196289 0.07033467292785645 0.07410383224487305 3.0994415283203125e-06
9.059906005859375e-06 0.011420726776123047 0.07079935073852539 0.07366394996643066 3.337860107421875e-06
1.0251998901367188e-05 0.013664960861206055 0.05923724174499512 0.0854940414428711 4.5299530029296875e-06
1.1682510375976562e-05 0.012000083923339844 0.05719399452209473 0.08690905570983887 3.337860107421875e-06
9.5367431640625e-06 0.011611223220825195 0.07001042366027832 0.07445144653320312 2.86102294921875e-06
1.0013580322265625e-05 0.01223444938659668 0.06577706336975098 0.0782625675201416 3.337860107421875e-06
9.775161743164062e-06 0.012152671813964844 0.06923460960388184 0.07717061042785645 3.337860107421875e-06
9.775161743164062e-06 0.012459039688110352 0.06276059150695801 0.08088088035583496 3.337860107421875e-06
9.298324584960938e-06 0.012417078018188477 0.056359052658081055 0.08724284172058105 2.86102294921875e-06

  CPU
single scan timing: pre | pre_loop | fn | post_loop | post
1.33514404296875e-05 0.005300283432006836 0.05896925926208496 0.003942012786865234 3.0994415283203125e-06
1.0967254638671875e-05 0.005280971527099609 0.05911135673522949 0.00392913818359375 3.337860107421875e-06
1.0967254638671875e-05 0.005237102508544922 0.05887770652770996 0.003923654556274414 3.337860107421875e-06
1.0251998901367188e-05 0.005327939987182617 0.05917644500732422 0.003917694091796875 3.5762786865234375e-06
1.0251998901367188e-05 0.005271196365356445 0.0590672492980957 0.003888368606567383 3.0994415283203125e-06
9.775161743164062e-06 0.005278110504150391 0.0592494010925293 0.003932476043701172 3.814697265625e-06
9.775161743164062e-06 0.005196332931518555 0.059012651443481445 0.0039615631103515625 3.337860107421875e-06
1.0013580322265625e-05 0.0054929256439208984 0.060483694076538086 0.004021883010864258 3.814697265625e-06
1.0251998901367188e-05 0.005238771438598633 0.05917620658874512 0.003945589065551758 3.337860107421875e-06
1.0013580322265625e-05 0.005337238311767578 0.05956625938415527 0.003913164138793945 3.5762786865234375e-06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment