Skip to content

Instantly share code, notes, and snippets.

@davidtweedle
Created November 8, 2024 18:21
Show Gist options
  • Save davidtweedle/a870a7dd0d409e920604565a2e08b638 to your computer and use it in GitHub Desktop.
Save davidtweedle/a870a7dd0d409e920604565a2e08b638 to your computer and use it in GitHub Desktop.
Log file for cifar workload with torch.compile, use_orig_params=True in FSDP
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING]
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] *****************************************
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] *****************************************
I1108 17:53:21.528697 134097630226240 logger_utils.py:81] Creating experiment directory at /kaggle/working/experiments/trial1/cifar_pytorch.
I1108 17:53:21.528696 139809002714944 logger_utils.py:81] Creating experiment directory at /kaggle/working/experiments/trial1/cifar_pytorch.
I1108 17:53:22.236347 134097630226240 submission_runner.py:564] Using RNG seed 3191040022
I1108 17:53:22.237189 139809002714944 logger_utils.py:97] Saving hparams to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/hparams.json.
I1108 17:53:22.237842 134097630226240 submission_runner.py:573] --- Tuning run 1/1 ---
I1108 17:53:22.238002 134097630226240 submission_runner.py:578] Creating tuning directory at /kaggle/working/experiments/trial1/cifar_pytorch/trial_1.
I1108 17:53:22.238237 134097630226240 logger_utils.py:97] Saving hparams to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/hparams.json.
I1108 17:53:22.467518 134097630226240 submission_runner.py:215] Initializing dataset.
I1108 17:53:23.170488 134097630226240 submission_runner.py:226] Initializing model.
I1108 17:53:23.401008 134097630226240 submission_runner.py:264] Performing `torch.compile`.
I1108 17:53:24.578627 134097630226240 submission_runner.py:268] Initializing optimizer.
I1108 17:53:25.178119 134097630226240 submission_runner.py:275] Initializing metrics bundle.
I1108 17:53:25.178369 134097630226240 submission_runner.py:293] Initializing checkpoint and logger.
I1108 17:53:25.179337 134097630226240 submission_runner.py:313] Saving meta data to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/meta_data_0.json.
I1108 17:53:25.256645 134097630226240 submission_runner.py:317] Saving flags to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/flags_0.json.
I1108 17:53:25.347355 134097630226240 submission_runner.py:329] Starting training loop.
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
I1108 17:54:38.432303 134088132654848 logging_writer.py:48] [0] global_step=0, loss=2.304934
I1108 17:54:38.464384 134097630226240 submission.py:134] 0) loss = 2.305
I1108 17:54:38.989048 134097630226240 spec.py:321] Evaluating on the training split.
E1108 17:54:40.106069 134097630226240 submission_runner.py:463] Eval step 1 error.
Traceback (most recent call last):
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 388, in train_once
latest_eval_result = workload.eval_model(global_eval_batch_size,
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/spec.py", line 322, in eval_model
train_metrics = self._eval_model_on_split(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/workload.py", line 177, in _eval_model_on_split
synced_metrics = self._eval_model(params,
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 220, in _eval_model
logits, _ = self.model_fn(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 178, in model_fn
logits_batch = model(augmented_and_preprocessed_input_batch['inputs'])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 586, in _compile
raise InternalTorchDynamoError(str(e)).with_traceback(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
super().run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1115, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 716, in call_function
).call_function(tx, [self] + list(args), kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX
self.call_function(fn, argsvars.items, kwargsvars.items)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 307, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR
result = BuiltinVariable(getattr).call_function(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 618, in call_function
result = handler(tx, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1116, in call_getattr
obj.var_getattr(tx, name).clone(source=source).add_options(options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/user_defined.py", line 491, in var_getattr
return VariableBuilder(tx, source)(subobj).add_options(options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 223, in __call__
vt = self._wrap(value).clone(**self.options())
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 368, in _wrap
return type_dispatch(self, value)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 879, in wrap_tensor
return self.tx.output.register_attr_or_module(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 728, in register_attr_or_module
return wrap_name(name)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 634, in wrap_name
return wrap_fx_proxy(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy
return wrap_fx_proxy_cls(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1302, in wrap_fx_proxy_cls
example_value = wrap_to_fake_tensor_and_record(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1583, in wrap_to_fake_tensor_and_record
fake_e = wrap_fake_exception(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception
return fn()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1584, in <lambda>
lambda: tx.fake_mode.from_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1721, in from_tensor
return self.fake_tensor_converter(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 371, in __call__
return self.from_real_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 324, in from_real_tensor
out = self.meta_converter(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 591, in __call__
r = self.meta_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 307, in meta_tensor
base = self.meta_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 478, in meta_tensor
r.grad = self.meta_tensor(
torch._dynamo.exc.InternalTorchDynamoError: attempting to assign a gradient of size '[867877]' to a tensor of size '[1735754]'. Please ensure that the gradient and the tensor are the same size
from user code:
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/models.py", line 129, in forward
x = self.conv1(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
E1108 17:54:40.128622 139809002714944 submission_runner.py:463] Eval step 1 error.
Traceback (most recent call last):
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 388, in train_once
latest_eval_result = workload.eval_model(global_eval_batch_size,
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/spec.py", line 322, in eval_model
train_metrics = self._eval_model_on_split(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/workload.py", line 177, in _eval_model_on_split
synced_metrics = self._eval_model(params,
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 220, in _eval_model
logits, _ = self.model_fn(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 178, in model_fn
logits_batch = model(augmented_and_preprocessed_input_batch['inputs'])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 586, in _compile
raise InternalTorchDynamoError(str(e)).with_traceback(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
super().run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1115, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 716, in call_function
).call_function(tx, [self] + list(args), kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
return inner_fn(self, inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX
self.call_function(fn, argsvars.items, kwargsvars.items)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 307, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function
return super().call_function(tx, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
return tx.inline_user_function_return(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call
return cls.inline_call_(parent, func, args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR
result = BuiltinVariable(getattr).call_function(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 618, in call_function
result = handler(tx, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1116, in call_getattr
obj.var_getattr(tx, name).clone(source=source).add_options(options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/user_defined.py", line 491, in var_getattr
return VariableBuilder(tx, source)(subobj).add_options(options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 223, in __call__
vt = self._wrap(value).clone(**self.options())
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 368, in _wrap
return type_dispatch(self, value)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 879, in wrap_tensor
return self.tx.output.register_attr_or_module(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 728, in register_attr_or_module
return wrap_name(name)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 634, in wrap_name
return wrap_fx_proxy(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy
return wrap_fx_proxy_cls(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1302, in wrap_fx_proxy_cls
example_value = wrap_to_fake_tensor_and_record(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1583, in wrap_to_fake_tensor_and_record
fake_e = wrap_fake_exception(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception
return fn()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1584, in <lambda>
lambda: tx.fake_mode.from_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1721, in from_tensor
return self.fake_tensor_converter(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 371, in __call__
return self.from_real_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 324, in from_real_tensor
out = self.meta_converter(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 591, in __call__
r = self.meta_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 307, in meta_tensor
base = self.meta_tensor(
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 478, in meta_tensor
r.grad = self.meta_tensor(
torch._dynamo.exc.InternalTorchDynamoError: attempting to assign a gradient of size '[867877]' to a tensor of size '[1735754]'. Please ensure that the gradient and the tensor are the same size
from user code:
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/models.py", line 129, in forward
x = self.conv1(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
Traceback (most recent call last):
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module>
app.run(main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main
score = score_submission_on_workload(
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once
optimizer_state, model_params, model_state = update_params(
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params
logits_batch, new_model_state = workload.model_fn(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn
model.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply
module.apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply
self._assert_state(TrainingState.IDLE)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1013, in _assert_state
raise ValueError(msg)
ValueError: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD
Asserting FSDP instance is: FullyShardedDataParallel(
(_fsdp_wrapped_module): ResNet(
(conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(conv): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(conv): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): FullyShardedDataParallel(
(_fsdp_wrapped_module): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): FullyShardedDataParallel(
(_fsdp_wrapped_module): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): FullyShardedDataParallel(
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): FullyShardedDataParallel(
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(act_fnc): ReLU(inplace=True)
(conv2): FullyShardedDataParallel(
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(fc): Linear(in_features=512, out_features=10, bias=True)
)
)
ERROR: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module>
app.run(main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main
score = score_submission_on_workload(
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once
optimizer_state, model_params, model_state = update_params(
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params
logits_batch, new_model_state = workload.model_fn(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn
model.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply
module.apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply
self._assert_state(TrainingState.IDLE)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1012, in _assert_state
traceback.print_stack()
Traceback (most recent call last):
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module>
app.run(main)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main
score = score_submission_on_workload(
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once
optimizer_state, model_params, model_state = update_params(
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params
logits_batch, new_model_state = workload.model_fn(
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn
model.apply(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply
module.apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply
self._assert_state(TrainingState.IDLE)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1013, in _assert_state
raise ValueError(msg)
ValueError: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD
[2024-11-08 17:54:43,347] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 270) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
submission_runner.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-11-08_17:54:43
host : 88ba16d426dc
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 271)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-08_17:54:43
host : 88ba16d426dc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 270)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment