Skip to content

Instantly share code, notes, and snippets.

@ryanfb
Last active April 4, 2018 19:36
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ryanfb/13bd5cf3d89d6b5e8acbd553256507f2 to your computer and use it in GitHub Desktop.
Save ryanfb/13bd5cf3d89d6b5e8acbd553256507f2 to your computer and use it in GitHub Desktop.
DeepMask / SharpMask training notes

About

These are my public, unorganized notes on stumbling blocks I've run across setting up DeepMask / SharpMask for training: https://github.com/facebookresearch/deepmask

NNProject_DeepMask

An alternative (non-Facebook, Python instead of Torch) open-source implementation of DeepMask was previously available here: https://github.com/abbypa/NNProject_DeepMask

Before the Facebook code was released, I started some work on Dockerizing this implementation which may also help others use it: https://github.com/ryanfb/NNProject_DeepMask/blob/docker_experimental/Dockerfile

N.B. this implementation should also run on OS X provided you can get all the dependencies installed. It uses a lot of memory (>4GB) just for the network, so typically it needs to be trained on CPU instead of GPU, which is relatively slow.

data/ directory setup

You need to unzip the huge MSCOCO zip files the README tells you to download. If you try to use the default unzip command, you'll get an error like this:


Archive:  val2014.zip
warning [val2014.zip]:  2345540230 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [val2014.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)

You need to use 7zip (brew install p7zip) to unzip these instead. Inside your $DEEPMASK/data directory, use the commands 7z x train2014.zip, 7z x val2014.zip, 7z x instances_train-val2014.zip. Your data/ directory should then have the subdirectories annotations/, train2014/, and val2014/.

module 'coco' not found

Install the Lua coco module by downloading and following the instructions here: https://github.com/pdollar/coco

module 'cutorch' not found, module 'nn' not found, error: use of undeclared identifier 'TH_INDEX_BASE'

Upgrade your torch installation by running ./update.sh in your torch directory, then run luarocks install cutorch scm-1. See: torch/cutorch#480. You don't need to run luarocks install nn.

Out of Memory loading annotations during th train.lua

By default I got the following error:

-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
/Users/ryan/source/torch/install/bin/luajit: ...n/source/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] not enough memory
stack traceback:
        [C]: in function 'error'
        ...n/source/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        ...n/source/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
        ...n/source/torch/install/share/lua/5.1/threads/threads.lua:142: in function 'specific'
        ...n/source/torch/install/share/lua/5.1/threads/threads.lua:125: in function 'Threads'
        /Users/ryan/mess/2016/34/deepmask/DataLoader.lua:40: in function '__init'
        ...s/ryan/source/torch/install/share/lua/5.1/torch/init.lua:91: in function <...s/ryan/source/torch/install/share/lua/5.1/torch/init.lua:87>
        [C]: in function 'DataLoader'
        /Users/ryan/mess/2016/34/deepmask/DataLoader.lua:21: in function 'create'
        train.lua:101: in main chunk
        [C]: in function 'dofile'
        ...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x0107f24cf0

Tried using th train.lua -nthreads 1 to reduce memory, and still got this error:

-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
nthreads        1       2
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp,nthreads=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
convert: data//annotations/instances_train2014.json --> .t7 [please be patient]
FATAL THREAD PANIC: (write) not enough memory

SOLUTION: bring up an interactive torch shell with th and load the annotations to do the .t7 conversion outside the training process:

th> coco = require 'coco'
                                                                      [0.0000s]
th> coco.CocoApi("data/annotations/instances_train2014.json")
convert: data/annotations/instances_train2014.json --> .t7 [please be patient]
converting: annotations
converting: categories
converting: images
convert: building indices
convert: complete [57.22 s]
CocoApi
                                                                      [58.0662s]
th> coco.CocoApi("data/annotations/instances_val2014.json")
convert: data/annotations/instances_val2014.json --> .t7 [please be patient]
converting: annotations
converting: categories
converting: images
convert: building indices
convert: complete [26.07 s]
CocoApi
                                                                      [26.3127s]
th>

cuda runtime error (2) : out of memory

After fixing the annotations loading, I still get an OOM from CUDA when the training starts:

-- ignore option rundir
-- ignore option dm
-- ignore option reload
-- ignore option gpu
-- ignore option datadir
nthreads        1       2
| running in directory /Users/ryan/mess/2016/34/deepmask/exps/deepmask/exp,nthreads=1
| number of paramaters trunk: 15198016
| number of paramaters mask branch: 1608768
| number of paramaters score branch: 526337
| number of paramaters total: 17333121
| start training
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-7759/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
/Users/ryan/source/torch/install/bin/luajit: ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 7 module of nn.Sequential:
In 3 module of nn.Sequential:
In 1 module of nn.Sequential:
...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:68: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-7759/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
        [C]: in function 'resizeAs'
        ...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:68: in function <...an/source/torch/install/share/lua/5.1/nn/ConcatTable.lua:30>
        [C]: in function 'xpcall'
        ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        ...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        ...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        ...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function <...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        ...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
        /Users/ryan/mess/2016/34/deepmask/TrainerDeepMask.lua:88: in function 'train'
        train.lua:117: in main chunk
        [C]: in function 'dofile'
        ...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x010f24dcf0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        ...ryan/source/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        ...yan/source/torch/install/share/lua/5.1/nn/Sequential.lua:88: in function 'backward'
        /Users/ryan/mess/2016/34/deepmask/TrainerDeepMask.lua:88: in function 'train'
        train.lua:117: in main chunk
        [C]: in function 'dofile'
        ...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x010f24dcf0

The options appear to be:

  • run on a bigger GPU (I have 4GB RAM on mine, so EC2 g2 instances won't fix this, could try an Azure instance)
  • patch to run on CPU (i.e. switch all cutorch calls back to torch, may be incredibly slow)
  • somehow reduce GPU memory usage of the existing CUDA/GPU implementation

SOLUTION: reducing the GPU memory usage with -batch 8 allowed training to run for me

bad argument #2 to '?' when running against custom training

Running th computeProposals.lua against my own training files resulted in the following error:

| loading model file... /Users/ryan/mess/current/deepmask/exps/deepmask/exp,batch=8,maxepoch=10
| start
/Users/ryan/source/torch/install/bin/luajit: bad argument #2 to '?' (out of range at /Users/ryan/source/torch/pkg/torch/generic/Tensor.c:890)
stack traceback:
        [C]: at 0x0b5e7600
        [C]: in function '__index'
        /Users/ryan/mess/2016/34/deepmask/InferDeepMask.lua:167: in function 'getTopScores'
        /Users/ryan/mess/2016/34/deepmask/InferDeepMask.lua:224: in function 'getTopProps'
        computeProposals.lua:84: in main chunk
        [C]: in function 'dofile'
        ...urce/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x010b2a3cf0

See: facebookresearch/deepmask#9

If you're seeing lines like this with nan loss in your training logs:

[train] | epoch 00001 | s/batch 1.74 | loss:     nan 

You need to update DeepMask (git pull inside the source directory), install inn with luarocks install inn, and re-run your training.

@kyleingraham
Copy link

Hi. Thank you for your notes. Above you wrote the following:
SOLUTION: reducing the GPU memory usage with -batch 8 allowed training to run for me

What do you mean by '-batch 8' and how would I go about setting it? I have an out of memory error when running computeProposals.lua for sharpmask.

@sebaschaal
Copy link

The -batch 8 flag only helps for the training procedure (train.lua).
If you run out of memory while using computeProposals.lua, your image is too large.
Try to reduce the image size (for 2GB GPU, a maximum side length of 400 did the job)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment