Skip to content

Instantly share code, notes, and snippets.

@mohapatras
Forked from erogol/plank_notes
Created September 14, 2017 04:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mohapatras/60e8d31fc9322695e5b6de9b10216b44 to your computer and use it in GitHub Desktop.
Save mohapatras/60e8d31fc9322695e5b6de9b10216b44 to your computer and use it in GitHub Desktop.
Kaggle Plankton Classification winner's approach notes
FROM: http://benanne.github.io/2015/03/17/plankton.html
Meta-Tricks:
- Use %10 for validation with STRATIFIED SAMPLING (my mistake)
- Cyclic Pooling
- Leaky ReLU = max(x, a*x) learned a
- reduces overfitting with a ~= 1/3
- Orthogonal initialization http://arxiv.org/pdf/1312.6120v3.pdf
- Use larger weight decay for larger models since otherwise some layers might diverge
Trick #1:
- Train a network 10 times for different inits.
- Ensemble with uniform weights
Trick #2:
- Extract hand crafted features from images
- Train small 80 unit 2 layer NNs per feature
- Ensemble features at the softmax layer with the main ConvNet Model
Image size in pixels
- Used Features
- Size and shape estimates based on image moments
- Hu moments
- Haralick texture features
Trick #3:
- Use Conv Auto Encoders for pretraining by Test set http://link.springer.com/chapter/10.1007/978-3-642-21735-7_7
- use Max-Unpooling instead of Denoising principle
- Use full-stack training but sometimes init by layer-weise training
- Use tied weights since it is faster
- At the supervise phase, keep the learning rates halve too keep the pretraining base of the filters
- OR train only the FC layers for some initial iterations
Trick #4:
- Pseudo sampling to use test data in training (Hinton http://arxiv.org/abs/1503.02531)
- use previsous models or the ensemble of the models for prediction of the test data and feed
these predictions as label to new NN.
- It regularize the network
- Using soft outputs is better than hard outputs
- 33% from test data, 67% from train data in a batch
Trick #5:
- Augment test data and mean the prediction
Trick #6:
- Try to learn affine transformation as a layer of NN
- It did't work but IDEA is GREAT !!!
Trick #7:
- Ensemble 300 models
- Find their weights on validation set
- Use weights to infer the importance of models
Other Tricks:
- Untied biases for each spatial location
- winner takes all nonlinearity (WTA, also known as channel-out) in the fully connected layers instead of ReLUs / maxout.
- Batch Normalization did not work well
Best Tools against Overfitting:
- dropout
- aggressive data augmentation
- model architecture
- weight decay
- unsuperwised pre-training (time consuming)
- cyclic pooling (especially with root-mean-squared pooling)
- leaky RELUs
- pseudo-labeling
Final Structure ----->
Layer type Size Output shape
cyclic slice (128, 1, 95, 95)
convolution 32 3x3 filters (128, 32, 95, 95)
convolution 16 3x3 filters (128, 16, 95, 95)
max pooling 3x3, stride 2 (128, 16, 47, 47)
cyclic roll (128, 64, 47, 47)
convolution 64 3x3 filters (128, 64, 47, 47)
convolution 32 3x3 filters (128, 32, 47, 47)
max pooling 3x3, stride 2 (128, 32, 23, 23)
cyclic roll (128, 128, 23, 23)
convolution 128 3x3 filters (128, 128, 23, 23)
convolution 128 3x3 filters (128, 128, 23, 23)
convolution 64 3x3 filters (128, 64, 23, 23)
max pooling 3x3, stride 2 (128, 64, 11, 11)
cyclic roll (128, 256, 11, 11)
convolution 256 3x3 filters (128, 256, 11, 11)
convolution 256 3x3 filters (128, 256, 11, 11)
convolution 128 3x3 filters (128, 128, 11, 11)
max pooling 3x3, stride 2 (128, 128, 5, 5)
cyclic roll (128, 256, 5, 5)
fully connected 512 2-piece maxout units (128, 512)
cyclic pooling (rms) (32, 512)
fully connected 512 2-piece maxout units (32, 512)
fully connected 121-way softmax (32, 121)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment