Skip to content

Instantly share code, notes, and snippets.

@jramapuram
Created April 29, 2020 09:13
Show Gist options
  • Save jramapuram/b877d5fa97c5bc50dc53ec18d5038391 to your computer and use it in GitHub Desktop.
Save jramapuram/b877d5fa97c5bc50dc53ec18d5038391 to your computer and use it in GitHub Desktop.
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
train-0[Epoch 1][1280768 samples][849.67 sec]: Loss: 7.0388 Top-1: 0.1027 Top-5: 0.4965
test-0[Epoch 1][50176 samples][17.05 sec]: Loss: 6.9965 Top-1: 0.1016 Top-5: 0.4604
/home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been ov
erridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://py
torch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
train-0[Epoch 2][1281280 samples][851.96 sec]: Loss: 5.2698 Top-1: 8.3982 Top-5: 20.8343
test-0[Epoch 2][50176 samples][16.72 sec]: Loss: 4.0580 Top-1: 18.9772 Top-5: 41.2129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
train-0[Epoch 3][1281280 samples][848.86 sec]: Loss: 3.9013 Top-1: 22.7465 Top-5: 44.8709
test-0[Epoch 3][50176 samples][17.22 sec]: Loss: 3.6010 Top-1: 26.4190 Top-5: 50.2671
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
train-0[Epoch 4][1281280 samples][852.70 sec]: Loss: 3.3167 Top-1: 31.4567 Top-5: 55.7103
test-0[Epoch 4][50176 samples][17.07 sec]: Loss: 2.9855 Top-1: 35.9196 Top-5: 61.6071
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 65536.0
train-0[Epoch 5][1281280 samples][850.95 sec]: Loss: 2.9109 Top-1: 38.2023 Top-5: 62.8001
test-0[Epoch 5][50176 samples][17.12 sec]: Loss: 2.4874 Top-1: 44.2821 Top-5: 70.0155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
train-0[Epoch 6][1281280 samples][852.87 sec]: Loss: 2.6764 Top-1: 42.3411 Top-5: 66.7361
test-0[Epoch 6][50176 samples][17.10 sec]: Loss: 2.6723 Top-1: 41.9703 Top-5: 67.0819
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
train-0[Epoch 7][1281280 samples][853.50 sec]: Loss: 2.5180 Top-1: 45.1213 Top-5: 69.3008
test-0[Epoch 7][50176 samples][16.95 sec]: Loss: 2.2402 Top-1: 49.1291 Top-5: 74.2427
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 131072.0
Traceback (most recent call last):
File "supervised_main.py", line 636, in <module>
run(rank=0, num_replicas=args.num_replicas)
File "supervised_main.py", line 602, in run
train(epoch, model, optimizer, loader.train_loader, grapher)
File "supervised_main.py", line 529, in train
return execute_graph(epoch, model, train_loader, grapher, optimizer, prefix='train')
File "supervised_main.py", line 448, in execute_graph
for minibatch, labels in loader:
File "/home/jramapuram/sshfs/ml_base/datasets/dali_imagefolder.py", line 230, in __next__
sample = super(DALIClassificationIteratorLikePytorch, self).__next__()
File "/home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/plugin/pytorch.py", line 177, in __next__
outputs.append(p.share_outputs())
File "/home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/pipeline.py", line 410, in share_outputs
return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline: [/opt/dali/dali/util/local_file.cc:105] File mapping failed: /datasets/imagenet/ILSVRC/Data/CLS-LOC/train/n01601694/n0
1601694_13136.JPG
Stacktrace (9 entries):
[frame 0]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x6ab7e) [0x7f6b188acb7e]
[frame 1]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali.so(+0x1772b4) [0x7f6b189b92b4]
[frame 2]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali.so(dali::FileStream::Open(std::string const&, bool)+0xfb
) [0x7f6b189ac0eb]
[frame 3]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x12effea) [0x7f6af5599fea]
[frame 4]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x133454a) [0x7f6af55de54a]
[frame 5]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x1335d25) [0x7f6af55dfd25]
[frame 6]: /home/jramapuram/.venv3/envs/pytorch1.5-py37/lib/python3.7/site-packages/nvidia/dali/libdali_operators.so(+0x18d0bb0) [0x7f6af5b7abb0]
[frame 7]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f6b87cf8609]
[frame 8]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f6b87c1f103]
Current pipeline object is no longer valid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment