Skip to content

Instantly share code, notes, and snippets.

@sshleifer
Last active March 4, 2021 18:50
Show Gist options
  • Save sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a to your computer and use it in GitHub Desktop.
Save sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a to your computer and use it in GitHub Desktop.

Setup

git clone git@github.com:sshleifer/fairseq.git
git fetch
git checkout plasma-latest
pip install -e .
tar -xzvf small_ds.tgz

We use the following basic train command, so copy this into terminal.

base_cmd () {
        dd=$1
        shift
        fairseq-train --fp16 $dd --task language_modeling \
        --arch transformer_lm_gpt2_tiny --sample-break-mode complete \
        --tokens-per-sample 512 --optimizer adam --clip-norm 0.0 --lr 0.0005 \
        --batch-size 1 --max-update 200 --max-epoch 1 --log-format simple \
        --log-interval 100 --restore-file x.pt --no-save \
        --skip-invalid-size-inputs-valid-test $@
}

Segfault

command for traceback.py

pkill -f plasma_store
CUDA_VISIBLE_DEVICES=0,1 base_cmd stories_small --num-workers 2 —use-plasma-view # works

command for traceback2.py

pkill -f plasma_store
NO_LOCK=1 CUDA_VISIBLE_DEVICES=0  base_cmd stories_small --num-workers 2 --use-plasma-view --max-epoch 2
# CMD: CUDA_VISIBLE_DEVICES=0 base_cmd stories_small --num-workers 2 --use-plasma-view #segfault
2021-03-04 11:49:24 | INFO | fairseq_cli.train | Started plasma server pid 495419
2021-03-04 11:49:24 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types
2021-03-04 11:49:24 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory.
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled
2021-03-04 11:49:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:49:27 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-03-04 11:49:27 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:49:27 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-03-04 11:49:27 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1
2021-03-04 11:49:27 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt
2021-03-04 11:49:27 | INFO | fairseq.trainer | No existing checkpoint found x.pt
2021-03-04 11:49:27 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 11:49:27 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train
2021-03-04 11:49:27 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 11:49:27 | INFO | fairseq_cli.train | Start iterating over samples
2021-03-04 11:49:27 | INFO | fairseq_cli.train | begin validation on "valid" subset
2021-03-04 11:49:28 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 15.995 | ppl 65324.1 | wps 16554.1 | wpb 457.4 | bsz 1 | num_updates 4
2021-03-04 11:49:28 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-03-04 11:49:28 | INFO | train | epoch 001 | loss 16.211 | ppl 75844.5 | wps 4747.4 | ups 10.3 | wpb 473 | bsz 1 | num_updates 4 | lr 0.0005 | gnorm 1 | loss_scale 128 | train_wall 0 | gb_free 15.5 | wall 1
2021-03-04 11:49:28 | INFO | fairseq_cli.train | done training in 0.8 seconds
2021-03-04 11:49:28 | INFO | wandb.sdk.internal.internal | Internal process exited
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 6
(ray) ➜ fairseq-public-fork git:(plasma-latest) ✗ CUDA_VISIBLE_DEVICES=0,1 base_cmd stories_small --num-workers 2 --use-plasma-view #segfault
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 200 more times
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 199 more times
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 198 more times
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory.
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled
2021-03-04 11:49:47 | INFO | fairseq_cli.train | Started plasma server pid 495624
2021-03-04 11:49:50 | INFO | fairseq.distributed.utils | distributed init (rank 0): tcp://localhost:15262
2021-03-04 11:49:50 | INFO | fairseq.distributed.utils | distributed init (rank 1): tcp://localhost:15262
2021-03-04 11:49:52 | INFO | fairseq.distributed.utils | initialized host learnfair0552 as rank 1
2021-03-04 11:49:52 | INFO | fairseq.distributed.utils | initialized host learnfair0552 as rank 0
2021-03-04 11:49:52 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types
2021-03-04 11:49:52 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid
2021-03-04 11:49:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-03-04 11:49:53 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-03-04 11:49:53 | INFO | fairseq.utils | rank 1: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-03-04 11:49:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 2 workers***********************
2021-03-04 11:49:53 | INFO | fairseq_cli.train | training on 2 devices (GPUs/TPUs)
2021-03-04 11:49:53 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1
2021-03-04 11:49:53 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt
2021-03-04 11:49:53 | INFO | fairseq.trainer | No existing checkpoint found x.pt
2021-03-04 11:49:53 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 11:49:53 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train
2021-03-04 11:49:53 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 11:49:53 | INFO | fairseq_cli.train | Start iterating over samples
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 14
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 15
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 18
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 16
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 17
2021-03-04 11:50:00 | INFO | root | Reducer buckets have been rebuilt in this iteration.
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 14
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 18
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 17
2021-03-04 11:50:00 | INFO | fairseq_cli.train | begin validation on "valid" subset
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 11
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 12
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 8
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 10
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 16
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 15
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 19
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 12
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 11
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 10
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9
2021-03-04 11:50:07 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 16.102 | ppl 70348.6 | wps 3821.8 | wpb 762.3 | bsz 1.7 | num_updates 2
2021-03-04 11:50:07 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-03-04 11:50:07 | INFO | train | epoch 001 | loss 16.255 | ppl 78210.7 | wps 114.4 | ups 0.13 | wpb 946 | bsz 2 | num_updates 2 | lr 0.0005 | gnorm 0.889 | loss_scale 128 | train_wall 1 | gb_free 15.4 | wall 15
2021-03-04 11:50:07 | INFO | wandb.sdk.internal.internal | Internal process exited
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 8
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 9
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 13
Traceback (most recent call last):
File "/private/home/sshleifer/.conda/envs/ray/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 490, in cli_main
distributed_utils.call_main(cfg, main)
File "/private/home/sshleifer/fairseq-public-fork/fairseq/distributed/utils.py", line 349, in call_main
join=True,
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/private/home/sshleifer/fairseq-public-fork/fairseq/distributed/utils.py", line 326, in distributed_main
main(cfg, **kwargs)
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 168, in main
disable_iterator_cache=task.has_sharded_data("train"),
File "/private/home/sshleifer/fairseq-public-fork/fairseq/trainer.py", line 464, in get_train_iterator
self.reset_dummy_batch(batch_iterator.first_batch)
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 343, in first_batch
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 343, in <listcomp>
return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]])
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/monolingual_dataset.py", line 98, in __getitem__
source, future_target, past_target = self.dataset[index]
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 162, in __getitem__
start_ds_idx, start_offset, end_ds_idx = self.block_to_dataset_index[index]
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 155, in block_to_dataset_index
return self._block_to_dataset_index.array
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/plasma_utils.py", line 142, in array
ret = self.client.get(self.object_id)
File "pyarrow/_plasma.pyx", line 595, in pyarrow._plasma.PlasmaClient.get
File "pyarrow/_plasma.pyx", line 583, in pyarrow._plasma.PlasmaClient.get
File "pyarrow/_plasma.pyx", line 431, in pyarrow._plasma.PlasmaClient.get_buffers
File "pyarrow/_plasma.pyx", line 325, in pyarrow._plasma.PlasmaClient._get_object_buffers
File "pyarrow/_plasma.pyx", line 289, in pyarrow._plasma.plasma_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Bad file descriptor
2021-03-04 11:50:08 | INFO | wandb.sdk.internal.internal | Internal process exited
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 6
NO_LOCK=1 CUDA_VISIBLE_DEVICES=0 base_cmd stories_small --num-workers 2 --use-plasma-view --max-epoch 2
2021-03-04 11:59:07 | INFO | fairseq_cli.train | Started plasma server pid 500605
2021-03-04 11:59:07 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types
2021-03-04 11:59:07 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory.
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled
2021-03-04 11:59:10 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:59:10 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-03-04 11:59:10 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:59:10 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-03-04 11:59:10 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1
2021-03-04 11:59:10 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt
2021-03-04 11:59:10 | INFO | fairseq.trainer | No existing checkpoint found x.pt
2021-03-04 11:59:10 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 11:59:10 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train
2021-03-04 11:59:10 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 11:59:10 | INFO | fairseq_cli.train | Start iterating over samples
2021-03-04 11:59:10 | INFO | fairseq_cli.train | begin validation on "valid" subset
2021-03-04 11:59:11 | INFO | valid | epoch 001 | valid on 'valid' subset | loss 15.995 | ppl 65324.1 | wps 20399.2 | wpb 457.4 | bsz 1 | num_updates 4
2021-03-04 11:59:11 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-03-04 11:59:11 | INFO | train | epoch 001 | loss 16.211 | ppl 75844.5 | wps 5367.4 | ups 11.64 | wpb 473 | bsz 1 | num_updates 4 | lr 0.0005 | gnorm 1 | loss_scale 128 | train_wall 0 | gb_free 15.5 | wall 1
2021-03-04 11:59:11 | INFO | fairseq.trainer | begin training epoch 2
2021-03-04 11:59:11 | INFO | fairseq_cli.train | Start iterating over samples
2021-03-04 11:59:11 | INFO | fairseq_cli.train | begin validation on "valid" subset
2021-03-04 11:59:11 | INFO | valid | epoch 002 | valid on 'valid' subset | loss 15.762 | ppl 55580.3 | wps 24032 | wpb 457.4 | bsz 1 | num_updates 8 | best_loss 15.762
2021-03-04 11:59:11 | INFO | fairseq_cli.train | end of epoch 2 (average epoch stats below)
2021-03-04 11:59:11 | INFO | train | epoch 002 | loss 15.855 | ppl 59287 | wps 5052.2 | ups 10.68 | wpb 473 | bsz 1 | num_updates 8 | lr 0.0005 | gnorm 0.992 | loss_scale 128 | train_wall 0 | gb_free 15.4 | wall 1
2021-03-04 11:59:11 | INFO | fairseq_cli.train | done training in 1.1 seconds
2021-03-04 11:59:11 | INFO | wandb.sdk.internal.internal | Internal process exited
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 7
/arrow/cpp/src/plasma/store.cc:754: Disconnecting client on fd 6
(ray) ➜ fairseq-public-fork git:(plasma-latest) ✗ NO_LOCK=1 CUDA_VISIBLE_DEVICES=0 base_cmd stories_small --num-workers 2 --use-plasma-view --max-epoch 2
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 200 more times
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 199 more times
/arrow/cpp/src/plasma/io.cc:177: Connection to IPC socket failed for pathname /tmp/plasma, retrying 198 more times
/arrow/cpp/src/plasma/store.cc:1274: Allowing the Plasma store to use up to 107.374GB of memory.
/arrow/cpp/src/plasma/store.cc:1297: Starting object store with directory /dev/shm and huge page support disabled
2021-03-04 11:59:34 | INFO | fairseq_cli.train | Started plasma server pid 500809
2021-03-04 11:59:34 | INFO | fairseq.tasks.language_modeling | dictionary: 50264 types
2021-03-04 11:59:34 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/valid
2021-03-04 11:59:36 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:59:36 | INFO | fairseq.utils | rank 0: capabilities = 7.0 ; total memory = 15.782 GB ; name = Tesla V100-SXM2-16GB
2021-03-04 11:59:36 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-03-04 11:59:36 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-03-04 11:59:37 | INFO | fairseq_cli.train | max tokens per GPU = None and batch size per GPU = 1
2021-03-04 11:59:37 | INFO | fairseq.trainer | Preparing to load checkpoint x.pt
2021-03-04 11:59:37 | INFO | fairseq.trainer | No existing checkpoint found x.pt
2021-03-04 11:59:37 | INFO | fairseq.trainer | loading train data for epoch 1
2021-03-04 11:59:37 | INFO | fairseq.data.data_utils | loaded 100 examples from: stories_small/train
2021-03-04 11:59:37 | INFO | fairseq.trainer | begin training epoch 1
2021-03-04 11:59:37 | INFO | fairseq_cli.train | Start iterating over samples
/arrow/cpp/src/plasma/store.cc:586: Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/pyarrow/libarrow.so.300(+0x720cd8)[0x7f9c79bc3cd8]
/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/pyarrow/libarrow.so.300(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f9c79bc414d]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x4158ec]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x418069]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x41965e]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x40c56a]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x422668]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x4213e5]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x41335d]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x40b409]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9c78fa40b3]
/private/home/sshleifer/.conda/envs/ray/bin/plasma_store[0x40c350]
Traceback (most recent call last):
File "/private/home/sshleifer/.conda/envs/ray/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 490, in cli_main
distributed_utils.call_main(cfg, main)
File "/private/home/sshleifer/fairseq-public-fork/fairseq/distributed/utils.py", line 364, in call_main
main(cfg, **kwargs)
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 156, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/private/home/sshleifer/fairseq-public-fork/fairseq_cli/train.py", line 273, in train
for i, samples in enumerate(progress):
File "/private/home/sshleifer/fairseq-public-fork/fairseq/logging/progress_bar.py", line 256, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 59, in __iter__
for x in self.iterable:
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 528, in _chunk_iterator
for x in itr:
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 59, in __iter__
for x in self.iterable:
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 650, in __next__
raise item
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/iterators.py", line 581, in run
for item in self._source:
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/private/home/sshleifer/.conda/envs/ray/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/monolingual_dataset.py", line 98, in __getitem__
source, future_target, past_target = self.dataset[index]
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 167, in __getitem__
slice_s, slice_e = self.slice_indices[index]
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/token_block_dataset.py", line 147, in slice_indices
return self._slice_indices.array
File "/private/home/sshleifer/fairseq-public-fork/fairseq/data/plasma_utils.py", line 145, in array
ret = self.client.get(self.object_id)
File "pyarrow/_plasma.pyx", line 595, in pyarrow._plasma.PlasmaClient.get
File "pyarrow/_plasma.pyx", line 583, in pyarrow._plasma.PlasmaClient.get
File "pyarrow/_plasma.pyx", line 431, in pyarrow._plasma.PlasmaClient.get_buffers
File "pyarrow/_plasma.pyx", line 325, in pyarrow._plasma.PlasmaClient._get_object_buffers
File "pyarrow/_plasma.pyx", line 289, in pyarrow._plasma.plasma_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Encountered unexpected EOF
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment