Skip to content

Instantly share code, notes, and snippets.

@AeroXi
Created August 27, 2019 08:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AeroXi/d4d273da9f443c0f2cf9f6d6872eeffe to your computer and use it in GitHub Desktop.
Save AeroXi/d4d273da9f443c0f2cf9f6d6872eeffe to your computer and use it in GitHub Desktop.
error log when pretrain on vcr
2019-08-26 21:17:42.954423: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 2 Chunks of size 29364224 totalling 56.01MiB
2019-08-26 21:17:42.954434: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 29425664 totalling 28.06MiB
2019-08-26 21:17:42.954446: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 32751616 totalling 31.23MiB
2019-08-26 21:17:42.954458: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 6 Chunks of size 125018112 totalling 715.36MiB
2019-08-26 21:17:42.954469: I tensorflow/core/common_runtime/bfc_allocator.cc:658] Sum Total of in-use chunks: 10.14GiB
2019-08-26 21:17:42.954485: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Stats:
Limit: 10895235482
InUse: 10891294208
MaxInUse: 10891294208
NumAllocs: 4208
MaxAllocSize: 125018112
2019-08-26 21:17:42.954655: W tensorflow/core/common_runtime/bfc_allocator.cc:275] ****************************************************************************************************
2019-08-26 21:17:42.954734: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
INFO:tensorflow:Error recorded from training_loop: OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_15/intermediate/dense/truediv}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_15/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_1/_9593}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6797_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'bert/encoder/layer_15/intermediate/dense/truediv', defined at:
File "pretrain_on_vcr.py", line 467, in <module>
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
features, labels, mode, config)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "pretrain_on_vcr.py", line 148, in model_fn
use_one_hot_embeddings=use_one_hot_embeddings)
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 216, in __init__
do_return_all_layers=True)
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 879, in transformer_model
kernel_initializer=create_initializer(initializer_range))
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/layers/core.py", line 184, in dense
return layer.apply(inputs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 828, in apply
return self.__call__(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 364, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 769, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/layers/core.py", line 951, in call
return self.activation(outputs) # pylint: disable=not-callable
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 276, in gelu
cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 862, in binary_op_wrapper
return func(x, y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 970, in _truediv_python3
return gen_math_ops.real_div(x, y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5989, in real_div
"RealDiv", x=x, y=y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_15/intermediate/dense/truediv}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_15/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_1/_9593}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6797_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_15/intermediate/dense/truediv}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_15/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_1/_9593}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6797_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pretrain_on_vcr.py", line 467, in <module>
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train
rendezvous.raise_errors()
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
raise six.reraise(*original_exc_info)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
run_metadata=run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_15/intermediate/dense/truediv}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_15/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_1/_9593}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6797_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'bert/encoder/layer_15/intermediate/dense/truediv', defined at:
File "pretrain_on_vcr.py", line 467, in <module>
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
features, labels, mode, config)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "pretrain_on_vcr.py", line 148, in model_fn
use_one_hot_embeddings=use_one_hot_embeddings)
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 216, in __init__
do_return_all_layers=True)
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 879, in transformer_model
kernel_initializer=create_initializer(initializer_range))
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/layers/core.py", line 184, in dense
return layer.apply(inputs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 828, in apply
return self.__call__(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 364, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 769, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/keras/layers/core.py", line 951, in call
return self.activation(outputs) # pylint: disable=not-callable
File "/data1/cx/r2c/data/get_bert_embeddings/modeling.py", line 276, in gelu
cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 862, in binary_op_wrapper
return func(x, y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 970, in _truediv_python3
return gen_math_ops.real_div(x, y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 5989, in real_div
"RealDiv", x=x, y=y, name=name)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/home/yuweijiang/anaconda3/envs/vcr/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node bert/encoder/layer_15/intermediate/dense/truediv}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bert/encoder/layer_15/intermediate/dense/BiasAdd, ConstantFolding/gradients/bert/encoder/layer_0/intermediate/dense/truediv_grad/RealDiv_recip)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_1/_9593}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6797_add_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment