Created
May 29, 2024 20:52
-
-
Save vkuzo/d1035200db22f2e3357438824cd3594f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] | |
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] ***************************************** | |
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. | |
W0529 13:52:17.543000 140709939508224 torch/distributed/run.py:778] ***************************************** | |
Running tests: 0%| | 0/6 [00:00<?, ?it/s] Running tests: 0%| | 0/6 [00:00<?, ?it/s] Running tests: 17%|█▋ | 1/6 [00:01<00:08, 1.61s/it] Running tests: 17%|█▋ | 1/6 [00:02<00:14, 2.84s/it]NCCL version 2.20.5+cuda12.2 | |
Running tests: 33%|███▎ | 2/6 [00:04<00:07, 1.93s/it] Running tests: 33%|███▎ | 2/6 [00:02<00:05, 1.45s/it] Running tests: 50%|█████ | 3/6 [00:03<00:02, 1.12it/s] Running tests: 50%|█████ | 3/6 [00:04<00:03, 1.15s/it] Running tests: 67%|██████▋ | 4/6 [00:03<00:01, 1.56it/s] Running tests: 67%|██████▋ | 4/6 [00:04<00:01, 1.25it/s]Test test_fp8_mlp_tensor_parallelism_base failed with error: Tensor-likes are not close! | |
Mismatched elements: 507 / 512 (99.0%) | |
Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed) | |
Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed) | |
Running tests: 67%|██████▋ | 4/6 [00:03<00:01, 1.15it/s] | |
[rank1]: Traceback (most recent call last): | |
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 248, in <module> | |
[rank1]: raise e | |
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 245, in <module> | |
[rank1]: test(device_mesh) | |
[rank1]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 215, in test_fp8_mlp_tensor_parallelism_base | |
[rank1]: torch.testing.assert_close(tp_out, global_out) | |
[rank1]: File "/data/users/vasiliy/pytorch/torch/testing/_comparison.py", line 1523, in assert_close | |
[rank1]: raise error_metas[0].to_error(msg) | |
[rank1]: AssertionError: Tensor-likes are not close! | |
[rank1]: Mismatched elements: 507 / 512 (99.0%) | |
[rank1]: Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed) | |
[rank1]: Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed) | |
Test test_fp8_mlp_tensor_parallelism_base failed with error: Tensor-likes are not close! | |
Mismatched elements: 507 / 512 (99.0%) | |
Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed) | |
Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed) | |
Running tests: 67%|██████▋ | 4/6 [00:04<00:02, 1.17s/it] | |
[rank0]: Traceback (most recent call last): | |
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 248, in <module> | |
[rank0]: raise e | |
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 245, in <module> | |
[rank0]: test(device_mesh) | |
[rank0]: File "/data/users/vasiliy/float8_experimental/test/test_dtensor.py", line 215, in test_fp8_mlp_tensor_parallelism_base | |
[rank0]: torch.testing.assert_close(tp_out, global_out) | |
[rank0]: File "/data/users/vasiliy/pytorch/torch/testing/_comparison.py", line 1523, in assert_close | |
[rank0]: raise error_metas[0].to_error(msg) | |
[rank0]: AssertionError: Tensor-likes are not close! | |
[rank0]: Mismatched elements: 507 / 512 (99.0%) | |
[rank0]: Greatest absolute difference: 0.010187506675720215 at index (24, 13) (up to 1e-05 allowed) | |
[rank0]: Greatest relative difference: 55.54027557373047 at index (12, 8) (up to 1.3e-06 allowed) | |
[rank0]:[W529 13:52:24.112238076 ProcessGroupNCCL.cpp:1125] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) | |
W0529 13:52:26.261000 140709939508224 torch/distributed/elastic/multiprocessing/api.py:857] Sending process 3826678 closing signal SIGTERM | |
E0529 13:52:27.026000 140709939508224 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 0 (pid: 3826676) of binary: /home/vasiliy/miniconda3/envs/pytorch/bin/python | |
Traceback (most recent call last): | |
File "/home/vasiliy/miniconda3/envs/pytorch/bin/torchrun", line 33, in <module> | |
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
File "/data/users/vasiliy/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper | |
return f(*args, **kwargs) | |
^^^^^^^^^^^^^^^^^^ | |
File "/data/users/vasiliy/pytorch/torch/distributed/run.py", line 900, in main | |
run(args) | |
File "/data/users/vasiliy/pytorch/torch/distributed/run.py", line 891, in run | |
elastic_launch( | |
File "/data/users/vasiliy/pytorch/torch/distributed/launcher/api.py", line 132, in __call__ | |
return launch_agent(self._config, self._entrypoint, list(args)) | |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
File "/data/users/vasiliy/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent | |
raise ChildFailedError( | |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: | |
============================================================ | |
test/test_dtensor.py FAILED | |
------------------------------------------------------------ | |
Failures: | |
<NO_OTHER_FAILURES> | |
------------------------------------------------------------ | |
Root Cause (first observed failure): | |
[0]: | |
time : 2024-05-29_13:52:26 | |
host : devgpu003.cco3.facebook.com | |
rank : 0 (local_rank: 0) | |
exitcode : 1 (pid: 3826676) | |
error_file: <N/A> | |
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html | |
============================================================ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment