|
# CMD: CUDA_VISIBLE_DEVICES=0 base_cmd stories_small --num-workers 2 --use-plasma-view #segfault |
|
2021-03-04 11:49:24 | INFO | fairseq_cli.train | Started plasma server pid 495419 |
|
2021-03-04 11:49:24 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types |
|
2021-03-04 11:49:24 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid |
|
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory. |
|
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled |
|
2021-03-04 11:49:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers*********************** |
|
2021-03-04 11:49:27 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB |
|
2021-03-04 11:49:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers*********************** |
|
2021-03-04 11:49:27 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) |
|
2021-03-04 11:49:27 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1 |
|
2021-03-04 11:49:27 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt |
|
2021-03-04 11:49:27 | INFO | fairseq.trainer | No existing checkpoint found x.pt |
|
2021-03-04 11:49:27 | INFO | fairseq.trainer | loading train data for epoch 1 |
|
2021-03-04 11:49:27 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train |
|
2021-03-04 11:49:27 | INFO | fairseq.trainer | begin training epoch 1 |
|
2021-03-04 11:49:27 | INFO | fairseq_cli.train | Start iterating over samples |
|
2021-03-04 11:49:27 | INFO | fairseq_cli.train | begin validation on "valid" subset |
|
2021-03-04 11:49:28 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 15.995 | ppl 65324.1 | wps 16554.1 | wpb 457.4 | bsz 1 | num_updates 4 |
|
2021-03-04 11:49:28 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) |
|
2021-03-04 11:49:28 | INFO | train | epoch 001 | loss 16.211 | ppl 75844.5 | wps 4747.4 | ups 10.3 | wpb 473 | bsz 1 | num_updates 4 | lr 0.0005 | gnorm 1 | loss_scale 128 | train_wall 0 | gb_free 15.5 | wall 1 |
|
2021-03-04 11:49:28 | INFO | fairseq_cli.train | done training in 0.8 seconds |
|
2021-03-04 11:49:28 | INFO | wandb.sdk.internal.internal | Internal process exited |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 6 |
|
(ray) ➜ fairseq-public-fork git:(plasma-latest) ✗ CUDA_VISIBLE_DEVICES=0,1 base_cmd stories_small --num-workers 2 --use-plasma-view #segfault |
|
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 200 more times |
|
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 199 more times |
|
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 198 more times |
|
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory. |
|
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled |
|
2021-03-04 11:49:47 | INFO | fairseq_cli.train | Started plasma server pid 495624 |
|
2021-03-04 11:49:50 | INFO | fairseq.distributed.utils | distributed init (rank 0): tcp://localhost:15262 |
|
2021-03-04 11:49:50 | INFO | fairseq.distributed.utils | distributed init (rank 1): tcp://localhost:15262 |
|
2021-03-04 11:49:52 | INFO | fairseq.distributed.utils | initialized host learnfair0552 as rank 1 |
|
2021-03-04 11:49:52 | INFO | fairseq.distributed.utils | initialized host learnfair0552 as rank 0 |
|
2021-03-04 11:49:52 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types |
|
2021-03-04 11:49:52 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid |
|
2021-03-04 11:49:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers*********************** |
|
2021-03-04 11:49:53 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB |
|
2021-03-04 11:49:53 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB |
|
2021-03-04 11:49:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers*********************** |
|
2021-03-04 11:49:53 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs) |
|
2021-03-04 11:49:53 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1 |
|
2021-03-04 11:49:53 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt |
|
2021-03-04 11:49:53 | INFO | fairseq.trainer | No existing checkpoint found x.pt |
|
2021-03-04 11:49:53 | INFO | fairseq.trainer | loading train data for epoch 1 |
|
2021-03-04 11:49:53 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train |
|
2021-03-04 11:49:53 | INFO | fairseq.trainer | begin training epoch 1 |
|
2021-03-04 11:49:53 | INFO | fairseq_cli.train | Start iterating over samples |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 14 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 15 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 18 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 16 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 17 |
|
2021-03-04 11:50:00 | INFO | root | Reducer buckets have been rebuilt in this iteration. |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 14 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 18 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 17 |
|
2021-03-04 11:50:00 | INFO | fairseq_cli.train | begin validation on "valid" subset |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 11 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 12 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 8 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 10 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 16 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 15 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 19 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 12 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 11 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 10 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9 |
|
2021-03-04 11:50:07 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 16.102 | ppl 70348.6 | wps 3821.8 | wpb 762.3 | bsz 1.7 | num_updates 2 |
|
2021-03-04 11:50:07 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) |
|
2021-03-04 11:50:07 | INFO | train | epoch 001 | loss 16.255 | ppl 78210.7 | wps 114.4 | ups 0.13 | wpb 946 | bsz 2 | num_updates 2 | lr 0.0005 | gnorm 0.889 | loss_scale 128 | train_wall 1 | gb_free 15.4 | wall 15 |
|
2021-03-04 11:50:07 | INFO | wandb.sdk.internal.internal | Internal process exited |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 8 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9 |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13 |
|
Traceback (most recent call last): |
|
File "/private/home/sshleifer/.conda/envs/ray/bin/fairseq-train", line 33, in <module> |
|
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 490, in cli_main |
|
distributed_utils.call_main(cfg, main) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/distributed/utils.py", line 349, in call_main |
|
join=True, |
|
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn |
|
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') |
|
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes |
|
while not context.join(): |
|
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join |
|
raise Exception(msg) |
|
Exception: |
|
|
|
-- Process 0 terminated with the following error: |
|
Traceback (most recent call last): |
|
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap |
|
fn(i, *args) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/distributed/utils.py", line 326, in distributed_main |
|
main(cfg, **kwargs) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 168, in main |
|
disable_iterator_cache=task.has_sharded_data("train"), |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/trainer.py", line 464, in get_train_iterator |
|
self.reset_dummy_batch(batch_iterator.first_batch) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 343, in first_batch |
|
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]]) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 343, in <listcomp> |
|
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]]) |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/monolingual_dataset.py", line 98, in __getitem__ |
|
source, future_target, past_target = self.dataset[index] |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 162, in __getitem__ |
|
start_ds_idx, start_offset, end_ds_idx = self.block_to_dataset_index[index] |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 155, in block_to_dataset_index |
|
return self._block_to_dataset_index.array |
|
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/plasma_utils.py", line 142, in array |
|
ret = self.client.get(self.object_id) |
|
File "pyarrow/_plasma.pyx", line 595, in pyarrow._plasma.PlasmaClient.get |
|
File "pyarrow/_plasma.pyx", line 583, in pyarrow._plasma.PlasmaClient.get |
|
File "pyarrow/_plasma.pyx", line 431, in pyarrow._plasma.PlasmaClient.get_buffers |
|
File "pyarrow/_plasma.pyx", line 325, in pyarrow._plasma.PlasmaClient._get_object_buffers |
|
File "pyarrow/_plasma.pyx", line 289, in pyarrow._plasma.plasma_check_status |
|
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status |
|
OSError: Bad file descriptor |
|
|
|
2021-03-04 11:50:08 | INFO | wandb.sdk.internal.internal | Internal process exited |
|
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 6 |