Skip to content

Instantly share code, notes, and snippets.

@morganmcg1
Last active August 9, 2021 13:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save morganmcg1/0e4344df49fe3b43243505992ce998d5 to your computer and use it in GitHub Desktop.
Save morganmcg1/0e4344df49fe3b43243505992ce998d5 to your computer and use it in GitHub Desktop.
gpt-j generation error stacktrace
2021-08-09 13:33:24.972717: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 2 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.972816: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 5 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.972875: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 3 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.972954: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 1 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.972999: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 7 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.973251: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 6 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:24.974006: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 4 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.62G free, 0B reserved, and 2.65G reservable.
2021-08-09 13:33:25.109534: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:1981] Execution of replica 0 failed: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.61G free, 0B reserved, and 2.65G reservable.
Traceback (most recent call last):
File "device_train.py", line 384, in <module>
loss, last_loss, grad_norm, grad_norm_micro = train_step(
File "device_train.py", line 113, in train_step
loss, last_loss, grad_norm, grad_norm_micro = network.train(inputs)
File "/home/morganmcguire/mesh-transformer-jax/mesh_transformer/transformer_shard.py", line 301, in train
loss, last_loss, grad_norm, grad_norm_micro, self.state = self.train_xmap(self.state, obs, target)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 516, in fun_mapped
out_flat = xmap_p.bind(
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 652, in bind
return core.call_bind(self, fun, *args, **params) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/jax/core.py", line 1393, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 655, in process
return trace.process_xmap(self, fun, tracers, params)
File "/usr/local/lib/python3.8/dist-packages/jax/core.py", line 600, in process_call
return primitive.impl(f, *tracers, **params)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 539, in xmap_impl
return make_xmap_callable(fun, name, in_axes, out_axes_thunk, donated_invars, global_axis_sizes,
File "/usr/local/lib/python3.8/dist-packages/jax/interpreters/pxla.py", line 1130, in execute_replicated
out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
RuntimeError: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.61G free, 0B reserved, and 2.65G reservable.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Traceback (most recent call last):
File "device_train.py", line 384, in <module>
loss, last_loss, grad_norm, grad_norm_micro = train_step(
File "device_train.py", line 113, in train_step
loss, last_loss, grad_norm, grad_norm_micro = network.train(inputs)
File "/home/morganmcguire/mesh-transformer-jax/mesh_transformer/transformer_shard.py", line 301, in train
loss, last_loss, grad_norm, grad_norm_micro, self.state = self.train_xmap(self.state, obs, target)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 516, in fun_mapped
out_flat = xmap_p.bind(
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 652, in bind
return core.call_bind(self, fun, *args, **params) # type: ignore
File "/usr/local/lib/python3.8/dist-packages/jax/core.py", line 1393, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 655, in process
return trace.process_xmap(self, fun, tracers, params)
File "/usr/local/lib/python3.8/dist-packages/jax/core.py", line 600, in process_call
return primitive.impl(f, *tracers, **params)
File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 539, in xmap_impl
return make_xmap_callable(fun, name, in_axes, out_axes_thunk, donated_invars, global_axis_sizes,
File "/usr/local/lib/python3.8/dist-packages/jax/interpreters/pxla.py", line 1130, in execute_replicated
out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
RuntimeError: Resource exhausted: Attempting to reserve 4.44G at the bottom of memory. That was not possible. There are 9.61G free, 0B reserved, and 2.65G reservable.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment