These are my public, unorganized notes on stumbling blocks I've run across setting up DeepMask / SharpMask for training: https://github.com/facebookresearch/deepmask
An alternative (non-Facebook, Python instead of Torch) open-source implementation of DeepMask was previously available here: https://github.com/abbypa/NNProject_DeepMask
Before the Facebook code was released, I started some work on Dockerizing this implementation which may also help others use it: https://github.com/ryanfb/NNProject_DeepMask/blob/docker_experimental/Dockerfile
N.B. this implementation should also run on OS X provided you can get all the dependencies installed. It uses a lot of memory (>4GB) just for the network, so typically it needs to be trained on CPU instead of GPU, which is relatively slow.
You need to unzip the huge MSCOCO zip files the README tells you to download. If you try to use the default unzip
command, you'll get an error like this:
Archive: val2014.zip
warning [val2014.zip]: 2345540230 extra bytes at beginning or within zipfile
(attempting to process anyway)
error [val2014.zip]: start of central directory not found;
zipfile corrupt.
(please check that you have transferred or created the zipfile in the
appropriate BINARY mode and that you have compiled UnZip properly)
You need to use 7zip (brew install p7zip
) to unzip these instead. Inside your $DEEPMASK/data
directory, use the commands 7z x train2014.zip
, 7z x val2014.zip
, 7z x instances_train-val2014.zip
. Your data/
directory should then have the subdirectories annotations/
, train2014/
, and val2014/
.
Install the Lua coco
module by downloading and following the instructions here: https://github.com/pdollar/coco
module 'cutorch' not found
, module 'nn' not found
, error: use of undeclared identifier 'TH_INDEX_BASE'
Upgrade your torch installation by running ./update.sh
in your torch
directory, then run luarocks install cutorch scm-1
. See: torch/cutorch#480. You don't need to run luarocks install nn
.
By default I got the following error:
-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
/Users/ryan/source/torch/install/bin/luajit: ...n/source/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] not enough memory
stack traceback:
[C]: in function 'error'
...n/source/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
...n/source/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
...n/source/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific'
...n/source/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads'
/Users/ryan/mess/2016/34/deepmask/DataLoader.lua:40: in function '__init'
...s/ryan/source/torch/install/share/lua/5.1/torch/init.lua:91: in function <...s/ryan/source/torch/install/share/lua/5.1/torch/init.lua:87>
[C]: in function 'DataLoader'
/Users/ryan/mess/2016/34/deepmask/DataLoader.lua:21: in function 'create'
train.lua:101: in main chunk
[C]: in function 'dofile'
...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0107f24cf0
Tried using th train.lua -nthreads 1
to reduce memory, and still got this error:
-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
nthreads 1 2
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp,nthreads=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
FATAL THREAD PANIC: (write) not enough memory
SOLUTION: bring up an interactive torch shell with th
and load the annotations to do the .t7
conversion outside the training process:
th> coco = require 'coco'
[0.0000s]
th> coco.CocoApi("data/annotations/instances_train2014.json")
convert: data/annotations/instances_train2014.json --> .t7 [please be patient]
converting: annotations
converting: categories
converting: images
convert: building indices
convert: complete [57.22 s]
CocoApi
[58.0662s]
th> coco.CocoApi("data/annotations/instances_val2014.json")
convert: data/annotations/instances_val2014.json --> .t7 [please be patient]
converting: annotations
converting: categories
converting: images
convert: building indices
convert: complete [26.07 s]
CocoApi
[26.3127s]
th>
After fixing the annotations loading, I still get an OOM from CUDA when the training starts:
-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
nthreads 1 2
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp,nthreads=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
| start training
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7759/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
/Users/ryan/source/torch/install/bin/luajit: ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 7 module of nn.Sequential:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:68: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7759/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
[C]: in function 'resizeAs'
...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:68: in function <...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
[C]: in function 'xpcall'
...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
[C]: in function 'xpcall'
...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
/Users/ryan/mess/2016/34/deepmask/TrainerDeepMask.lua:88: in function 'train'
train.lua:117: in main chunk
[C]: in function 'dofile'
...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010f24dcf0
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
/Users/ryan/mess/2016/34/deepmask/TrainerDeepMask.lua:88: in function 'train'
train.lua:117: in main chunk
[C]: in function 'dofile'
...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010f24dcf0
The options appear to be:
- run on a bigger GPU (I have 4GB RAM on mine, so EC2 g2 instances won't fix this, could try an Azure instance)
- patch to run on CPU (i.e. switch all cutorch calls back to torch, may be incredibly slow)
- somehow reduce GPU memory usage of the existing CUDA/GPU implementation
SOLUTION: reducing the GPU memory usage with -batch 8
allowed training to run for me
Running th computeProposals.lua
against my own training files resulted in the following error:
| loading model file... /Users/ryan/mess/current/deepmask/exps/deepmask/exp,batch=8,maxepoch=10
| start
/Users/ryan/source/torch/install/bin/luajit: bad argument #2 to '?' (out of range at /Users/ryan/source/torch/pkg/torch/generic/Tensor.c:890)
stack traceback:
[C]: at 0x0b5e7600
[C]: in function '__index'
/Users/ryan/mess/2016/34/deepmask/InferDeepMask.lua:167: in function 'getTopScores'
/Users/ryan/mess/2016/34/deepmask/InferDeepMask.lua:224: in function 'getTopProps'
computeProposals.lua:84: in main chunk
[C]: in function 'dofile'
...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x010b2a3cf0
See: facebookresearch/deepmask#9
If you're seeing lines like this with nan
loss in your training logs:
[train] | epoch 00001 | s/batch 1.74 | loss: nan
You need to update DeepMask (git pull
inside the source directory), install inn
with luarocks install inn
, and re-run your training.