Skip to content

Instantly share code, notes, and snippets.

@vkuzo
Created May 29, 2024 20:52
Show Gist options
  • Save vkuzo/d1035200db22f2e3357438824cd3594f to your computer and use it in GitHub Desktop.
Save vkuzo/d1035200db22f2e3357438824cd3594f to your computer and use it in GitHub Desktop.
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778]
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] *****************************************
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] *****************************************
Running tests: 0%| | 0/6 [00:00<?, ?it/s] Running tests: 0%| | 0/6 [00:00<?, ?it/s] Running tests: 17%|█▋ | 1/6 [00:01<00:08, 1.61s/it] Running tests: 17%|█▋ | 1/6 [00:02<00:14, 2.84s/it]NCCL version 2.20.5+cuda12.2
Running tests: 33%|███▎ | 2/6 [00:04<00:07, 1.93s/it] Running tests: 33%|███▎ | 2/6 [00:02<00:05, 1.45s/it] Running tests: 50%|█████ | 3/6 [00:03<00:02, 1.12it/s] Running tests: 50%|█████ | 3/6 [00:04<00:03, 1.15s/it] Running tests: 67%|██████▋ | 4/6 [00:03<00:01, 1.56it/s] Running tests: 67%|██████▋ | 4/6 [00:04<00:01, 1.25it/s]Test test_fp8_mlp_tensor_parallelism_base failed with error: Tensor-likes are not close!
Mismatched elements: 507 / 512 (99.0%)
Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed)
Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed)
Running tests: 67%|██████▋ | 4/6 [00:03<00:01, 1.15it/s]
[rank1]: Traceback (most recent call last):
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 248, in <module>
[rank1]: raise e
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 245, in <module>
[rank1]: test(device_mesh)
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 215, in test_fp8_mlp_tensor_parallelism_base
[rank1]: torch.testing.assert_close(tp_out, global_out)
[rank1]: File "/data/users/vasiliy/pytorch/torch/testing/_comparison.py", line 1523, in assert_close
[rank1]: raise error_metas[0].to_error(msg)
[rank1]: AssertionError: Tensor-likes are not close!
[rank1]: Mismatched elements: 507 / 512 (99.0%)
[rank1]: Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed)
[rank1]: Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed)
Test test_fp8_mlp_tensor_parallelism_base failed with error: Tensor-likes are not close!
Mismatched elements: 507 / 512 (99.0%)
Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed)
Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed)
Running tests: 67%|██████▋ | 4/6 [00:04<00:02, 1.17s/it]
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 248, in <module>
[rank0]: raise e
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 245, in <module>
[rank0]: test(device_mesh)
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 215, in test_fp8_mlp_tensor_parallelism_base
[rank0]: torch.testing.assert_close(tp_out, global_out)
[rank0]: File "/data/users/vasiliy/pytorch/torch/testing/_comparison.py", line 1523, in assert_close
[rank0]: raise error_metas[0].to_error(msg)
[rank0]: AssertionError: Tensor-likes are not close!
[rank0]: Mismatched elements: 507 / 512 (99.0%)
[rank0]: Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed)
[rank0]: Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed)
[rank0]:[W529 13:52:24.112238076 ProcessGroupNCCL.cpp:1125] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0529 13:52:26.261000 140709939508224 torch/distributed/elastic/multiprocessing/api.py:857] Sending process 3826678 closing signal SIGTERM
E0529 13:52:27.026000 140709939508224 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 0 (pid: 3826676) of binary: /home/vasiliy/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/home/vasiliy/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/data/users/vasiliy/pytorch/torch/distributed/run.py", line 900, in main
run(args)
File "/data/users/vasiliy/pytorch/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/data/users/vasiliy/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/vasiliy/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test/test_dtensor.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-29_13:52:26
host : devgpu003.cco3.facebook.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3826676)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment