Created
November 8, 2024 18:21
-
-
Save davidtweedle/a870a7dd0d409e920604565a2e08b638 to your computer and use it in GitHub Desktop.
Log file for cifar workload with torch.compile, use_orig_params=True in FSDP
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. | |
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] | |
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] ***************************************** | |
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
[2024-11-08 17:53:13,072] torch.distributed.run: [WARNING] ***************************************** | |
I1108 17:53:21.528697 134097630226240 logger_utils.py:81] Creating experiment directory at /kaggle/working/experiments/trial1/cifar_pytorch. | |
I1108 17:53:21.528696 139809002714944 logger_utils.py:81] Creating experiment directory at /kaggle/working/experiments/trial1/cifar_pytorch. | |
I1108 17:53:22.236347 134097630226240 submission_runner.py:564] Using RNG seed 3191040022 | |
I1108 17:53:22.237189 139809002714944 logger_utils.py:97] Saving hparams to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/hparams.json. | |
I1108 17:53:22.237842 134097630226240 submission_runner.py:573] --- Tuning run 1/1 --- | |
I1108 17:53:22.238002 134097630226240 submission_runner.py:578] Creating tuning directory at /kaggle/working/experiments/trial1/cifar_pytorch/trial_1. | |
I1108 17:53:22.238237 134097630226240 logger_utils.py:97] Saving hparams to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/hparams.json. | |
I1108 17:53:22.467518 134097630226240 submission_runner.py:215] Initializing dataset. | |
I1108 17:53:23.170488 134097630226240 submission_runner.py:226] Initializing model. | |
I1108 17:53:23.401008 134097630226240 submission_runner.py:264] Performing `torch.compile`. | |
I1108 17:53:24.578627 134097630226240 submission_runner.py:268] Initializing optimizer. | |
I1108 17:53:25.178119 134097630226240 submission_runner.py:275] Initializing metrics bundle. | |
I1108 17:53:25.178369 134097630226240 submission_runner.py:293] Initializing checkpoint and logger. | |
I1108 17:53:25.179337 134097630226240 submission_runner.py:313] Saving meta data to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/meta_data_0.json. | |
I1108 17:53:25.256645 134097630226240 submission_runner.py:317] Saving flags to /kaggle/working/experiments/trial1/cifar_pytorch/trial_1/flags_0.json. | |
I1108 17:53:25.347355 134097630226240 submission_runner.py:329] Starting training loop. | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' | |
torch.has_cuda, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' | |
torch.has_cudnn, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' | |
torch.has_mps, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' | |
torch.has_mkldnn, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()' | |
torch.has_cuda, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()' | |
torch.has_cudnn, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' | |
torch.has_mps, | |
/opt/conda/lib/python3.10/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()' | |
torch.has_mkldnn, | |
I1108 17:54:38.432303 134088132654848 logging_writer.py:48] [0] global_step=0, loss=2.304934 | |
I1108 17:54:38.464384 134097630226240 submission.py:134] 0) loss = 2.305 | |
I1108 17:54:38.989048 134097630226240 spec.py:321] Evaluating on the training split. | |
E1108 17:54:40.106069 134097630226240 submission_runner.py:463] Eval step 1 error. | |
Traceback (most recent call last): | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 388, in train_once | |
latest_eval_result = workload.eval_model(global_eval_batch_size, | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/spec.py", line 322, in eval_model | |
train_metrics = self._eval_model_on_split( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/workload.py", line 177, in _eval_model_on_split | |
synced_metrics = self._eval_model(params, | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 220, in _eval_model | |
logits, _ = self.model_fn( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 178, in model_fn | |
logits_batch = model(augmented_and_preprocessed_input_batch['inputs']) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward | |
output = self._fsdp_wrapped_module(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors | |
return callback(frame, cache_entry, hooks, frame_state) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame | |
result = inner_convert(frame, cache_size, hooks, frame_state) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert | |
return _compile( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 586, in _compile | |
raise InternalTorchDynamoError(str(e)).with_traceback( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile | |
guarded_code = compile_inner(code, one_graph, hooks, transform) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper | |
r = func(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner | |
out_code = transform_code_object(code, transform) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object | |
transformations(instructions, code_options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run | |
super().run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper | |
return inner_fn(self, inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1115, in CALL_FUNCTION | |
self.call_function(fn, args, {}) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function | |
self.push(fn.call_function(self, args, kwargs)) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 716, in call_function | |
).call_function(tx, [self] + list(args), kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function | |
return tx.inline_user_function_return( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return | |
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call | |
return cls.inline_call_(parent, func, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_ | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper | |
return inner_fn(self, inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX | |
self.call_function(fn, argsvars.items, kwargsvars.items) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function | |
self.push(fn.call_function(self, args, kwargs)) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 307, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function | |
return tx.inline_user_function_return( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return | |
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call | |
return cls.inline_call_(parent, func, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_ | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR | |
result = BuiltinVariable(getattr).call_function( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 618, in call_function | |
result = handler(tx, *args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1116, in call_getattr | |
obj.var_getattr(tx, name).clone(source=source).add_options(options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/user_defined.py", line 491, in var_getattr | |
return VariableBuilder(tx, source)(subobj).add_options(options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 223, in __call__ | |
vt = self._wrap(value).clone(**self.options()) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 368, in _wrap | |
return type_dispatch(self, value) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 879, in wrap_tensor | |
return self.tx.output.register_attr_or_module( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 728, in register_attr_or_module | |
return wrap_name(name) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 634, in wrap_name | |
return wrap_fx_proxy( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy | |
return wrap_fx_proxy_cls( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1302, in wrap_fx_proxy_cls | |
example_value = wrap_to_fake_tensor_and_record( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1583, in wrap_to_fake_tensor_and_record | |
fake_e = wrap_fake_exception( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception | |
return fn() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1584, in <lambda> | |
lambda: tx.fake_mode.from_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1721, in from_tensor | |
return self.fake_tensor_converter( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 371, in __call__ | |
return self.from_real_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 324, in from_real_tensor | |
out = self.meta_converter( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 591, in __call__ | |
r = self.meta_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 307, in meta_tensor | |
base = self.meta_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 478, in meta_tensor | |
r.grad = self.meta_tensor( | |
torch._dynamo.exc.InternalTorchDynamoError: attempting to assign a gradient of size '[867877]' to a tensor of size '[1735754]'. Please ensure that the gradient and the tensor are the same size | |
from user code: | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/models.py", line 129, in forward | |
x = self.conv1(x) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward | |
return self._conv_forward(input, self.weight, self.bias) | |
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information | |
You can suppress this exception and fall back to eager by setting: | |
import torch._dynamo | |
torch._dynamo.config.suppress_errors = True | |
E1108 17:54:40.128622 139809002714944 submission_runner.py:463] Eval step 1 error. | |
Traceback (most recent call last): | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 388, in train_once | |
latest_eval_result = workload.eval_model(global_eval_batch_size, | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/spec.py", line 322, in eval_model | |
train_metrics = self._eval_model_on_split( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/workload.py", line 177, in _eval_model_on_split | |
synced_metrics = self._eval_model(params, | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 220, in _eval_model | |
logits, _ = self.model_fn( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 178, in model_fn | |
logits_batch = model(augmented_and_preprocessed_input_batch['inputs']) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward | |
output = self._fsdp_wrapped_module(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl | |
return self._call_impl(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors | |
return callback(frame, cache_entry, hooks, frame_state) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame | |
result = inner_convert(frame, cache_size, hooks, frame_state) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn | |
return fn(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert | |
return _compile( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 586, in _compile | |
raise InternalTorchDynamoError(str(e)).with_traceback( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile | |
guarded_code = compile_inner(code, one_graph, hooks, transform) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper | |
r = func(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner | |
out_code = transform_code_object(code, transform) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object | |
transformations(instructions, code_options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run | |
super().run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper | |
return inner_fn(self, inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1115, in CALL_FUNCTION | |
self.call_function(fn, args, {}) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function | |
self.push(fn.call_function(self, args, kwargs)) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/nn_module.py", line 716, in call_function | |
).call_function(tx, [self] + list(args), kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function | |
return tx.inline_user_function_return( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return | |
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call | |
return cls.inline_call_(parent, func, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_ | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 392, in wrapper | |
return inner_fn(self, inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1155, in CALL_FUNCTION_EX | |
self.call_function(fn, argsvars.items, kwargsvars.items) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 562, in call_function | |
self.push(fn.call_function(self, args, kwargs)) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 307, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 261, in call_function | |
return super().call_function(tx, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function | |
return tx.inline_user_function_return( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return | |
result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2179, in inline_call | |
return cls.inline_call_(parent, func, args, kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2286, in inline_call_ | |
tracer.run() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run | |
and self.step() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step | |
getattr(self, inst.opname)(inst) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1191, in LOAD_ATTR | |
result = BuiltinVariable(getattr).call_function( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 618, in call_function | |
result = handler(tx, *args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builtin.py", line 1116, in call_getattr | |
obj.var_getattr(tx, name).clone(source=source).add_options(options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/user_defined.py", line 491, in var_getattr | |
return VariableBuilder(tx, source)(subobj).add_options(options) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 223, in __call__ | |
vt = self._wrap(value).clone(**self.options()) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 368, in _wrap | |
return type_dispatch(self, value) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 879, in wrap_tensor | |
return self.tx.output.register_attr_or_module( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 728, in register_attr_or_module | |
return wrap_name(name) | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 634, in wrap_name | |
return wrap_fx_proxy( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1187, in wrap_fx_proxy | |
return wrap_fx_proxy_cls( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1302, in wrap_fx_proxy_cls | |
example_value = wrap_to_fake_tensor_and_record( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1583, in wrap_to_fake_tensor_and_record | |
fake_e = wrap_fake_exception( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 916, in wrap_fake_exception | |
return fn() | |
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/variables/builder.py", line 1584, in <lambda> | |
lambda: tx.fake_mode.from_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1721, in from_tensor | |
return self.fake_tensor_converter( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 371, in __call__ | |
return self.from_real_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 324, in from_real_tensor | |
out = self.meta_converter( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 591, in __call__ | |
r = self.meta_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 307, in meta_tensor | |
base = self.meta_tensor( | |
File "/opt/conda/lib/python3.10/site-packages/torch/_subclasses/meta_utils.py", line 478, in meta_tensor | |
r.grad = self.meta_tensor( | |
torch._dynamo.exc.InternalTorchDynamoError: attempting to assign a gradient of size '[867877]' to a tensor of size '[1735754]'. Please ensure that the gradient and the tensor are the same size | |
from user code: | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/models.py", line 129, in forward | |
x = self.conv1(x) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl | |
return forward_call(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward | |
return self._conv_forward(input, self.weight, self.bias) | |
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information | |
You can suppress this exception and fall back to eager by setting: | |
import torch._dynamo | |
torch._dynamo.config.suppress_errors = True | |
Traceback (most recent call last): | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module> | |
app.run(main) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run | |
_run_main(main, args) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main | |
sys.exit(main(argv)) | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main | |
score = score_submission_on_workload( | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload | |
timing, metrics = train_once(workload, workload_name, | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once | |
optimizer_state, model_params, model_state = update_params( | |
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params | |
logits_batch, new_model_state = workload.model_fn( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn | |
model.apply( | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply | |
module.apply(fn) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply | |
self._assert_state(TrainingState.IDLE) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1013, in _assert_state | |
raise ValueError(msg) | |
ValueError: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD | |
Asserting FSDP instance is: FullyShardedDataParallel( | |
(_fsdp_wrapped_module): ResNet( | |
(conv1): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(relu): ReLU(inplace=True) | |
(layer1): Sequential( | |
(0): BasicBlock( | |
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
(1): BasicBlock( | |
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(layer2): Sequential( | |
(0): BasicBlock( | |
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(downsample): Sequential( | |
(conv): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) | |
(bn): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(1): BasicBlock( | |
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(layer3): Sequential( | |
(0): BasicBlock( | |
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(downsample): Sequential( | |
(conv): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) | |
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(1): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): BasicBlock( | |
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
) | |
(layer4): Sequential( | |
(0): BasicBlock( | |
(conv1): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) | |
) | |
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
) | |
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(downsample): Sequential( | |
(conv): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) | |
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(1): BasicBlock( | |
(conv1): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
) | |
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
(act_fnc): ReLU(inplace=True) | |
(conv2): FullyShardedDataParallel( | |
(_fsdp_wrapped_module): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) | |
) | |
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) | |
) | |
) | |
(fc): Linear(in_features=512, out_features=10, bias=True) | |
) | |
) | |
ERROR: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module> | |
app.run(main) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run | |
_run_main(main, args) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main | |
sys.exit(main(argv)) | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main | |
score = score_submission_on_workload( | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload | |
timing, metrics = train_once(workload, workload_name, | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once | |
optimizer_state, model_params, model_state = update_params( | |
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params | |
logits_batch, new_model_state = workload.model_fn( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn | |
model.apply( | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply | |
module.apply(fn) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply | |
self._assert_state(TrainingState.IDLE) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1012, in _assert_state | |
traceback.print_stack() | |
Traceback (most recent call last): | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 715, in <module> | |
app.run(main) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 308, in run | |
_run_main(main, args) | |
File "/opt/conda/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main | |
sys.exit(main(argv)) | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 683, in main | |
score = score_submission_on_workload( | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 588, in score_submission_on_workload | |
timing, metrics = train_once(workload, workload_name, | |
File "/kaggle/working/algorithmic-efficiency/submission_runner.py", line 351, in train_once | |
optimizer_state, model_params, model_state = update_params( | |
File "/kaggle/working/algorithmic-efficiency/reference_algorithms/paper_baselines/momentum/pytorch/submission.py", line 91, in update_params | |
logits_batch, new_model_state = workload.model_fn( | |
File "/kaggle/working/algorithmic-efficiency/algorithmic_efficiency/workloads/cifar/cifar_pytorch/workload.py", line 169, in model_fn | |
model.apply( | |
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 897, in apply | |
module.apply(fn) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 579, in apply | |
self._assert_state(TrainingState.IDLE) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 1013, in _assert_state | |
raise ValueError(msg) | |
ValueError: expected to be in states [<TrainingState.IDLE: 1>] but current state is TrainingState.FORWARD_BACKWARD | |
[2024-11-08 17:54:43,347] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 270) of binary: /opt/conda/bin/python3.10 | |
Traceback (most recent call last): | |
File "/opt/conda/bin/torchrun", line 8, in <module> | |
sys.exit(main()) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper | |
return f(*args, **kwargs) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main | |
run(args) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run | |
elastic_launch( | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ | |
return launch_agent(self._config, self._entrypoint, list(args)) | |
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent | |
raise ChildFailedError( | |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: | |
============================================================ | |
submission_runner.py FAILED | |
------------------------------------------------------------ | |
Failures: | |
[1]: | |
time : 2024-11-08_17:54:43 | |
host : 88ba16d426dc | |
rank : 1 (local_rank: 1) | |
exitcode : 1 (pid: 271) | |
error_file: <N/A> | |
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html | |
------------------------------------------------------------ | |
Root Cause (first observed failure): | |
[0]: | |
time : 2024-11-08_17:54:43 | |
host : 88ba16d426dc | |
rank : 0 (local_rank: 0) | |
exitcode : 1 (pid: 270) | |
error_file: <N/A> | |
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html | |
============================================================ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment