mohapatras/plank_notes

## plank_notes
FROM: http://benanne.github.io/2015/03/17/plankton.html

Meta-Tricks:
	- Use %10 for validation with STRATIFIED SAMPLING (my mistake)
	- Cyclic Pooling
	- Leaky ReLU = max(x, a*x) learned a
		- reduces overfitting with a ~= 1/3
	- Orthogonal initialization http://arxiv.org/pdf/1312.6120v3.pdf
	- Use larger weight decay for larger models since otherwise some layers might diverge


Trick #1:
	- Train a network 10 times for different inits.
	- Ensemble with uniform weights

Trick #2:
	- Extract hand crafted features from images
	- Train small 80 unit 2 layer NNs per feature
	- Ensemble features at the softmax layer with the main ConvNet Model
	Image size in pixels
	- Used Features
		- Size and shape estimates based on image moments
		- Hu moments
		- Haralick texture features

Trick #3:
    - Use Conv Auto Encoders for pretraining by Test set http://link.springer.com/chapter/10.1007/978-3-642-21735-7_7
	- use Max-Unpooling instead of Denoising principle
	- Use full-stack training but sometimes init by layer-weise training
	- Use tied weights since it is faster
	- At the supervise phase, keep the learning rates halve too keep the pretraining base of the filters
	- OR train only the FC layers for some initial iterations

Trick #4:
    - Pseudo sampling to use test data in training  (Hinton http://arxiv.org/abs/1503.02531)
    - use previsous models or the ensemble of the models for prediction of the test data and feed
	these predictions as label to new NN.
	- It regularize the network
	- Using soft outputs is better than hard outputs
	- 33% from test data, 67% from train data in a batch

Trick #5:
	- Augment test data and mean the prediction

Trick #6:
    - Try to learn affine transformation as a layer of NN
	- It did't work but IDEA is GREAT !!!

Trick #7:
	- Ensemble 300 models
	- Find their weights on validation set
	- Use weights to infer the importance of models

Other Tricks:
    - Untied biases for each spatial location
	- winner takes all nonlinearity (WTA, also known as channel-out) in the fully connected layers instead of ReLUs / maxout.
	- Batch Normalization did not work well

Best Tools against Overfitting:
	- dropout
	- aggressive data augmentation
	- model architecture
	- weight decay
	- unsuperwised  pre-training (time consuming)
	- cyclic pooling (especially with root-mean-squared pooling)
	- leaky RELUs
	- pseudo-labeling


Final Structure ----->

Layer type	Size	Output shape
cyclic slice	 	(128, 1, 95, 95)
convolution	32 3x3 filters	(128, 32, 95, 95)
convolution	16 3x3 filters	(128, 16, 95, 95)
max pooling	3x3, stride 2	(128, 16, 47, 47)
cyclic roll	 	(128, 64, 47, 47)
convolution	64 3x3 filters	(128, 64, 47, 47)
convolution	32 3x3 filters	(128, 32, 47, 47)
max pooling	3x3, stride 2	(128, 32, 23, 23)
cyclic roll	 	(128, 128, 23, 23)
convolution	128 3x3 filters	(128, 128, 23, 23)
convolution	128 3x3 filters	(128, 128, 23, 23)
convolution	64 3x3 filters	(128, 64, 23, 23)
max pooling	3x3, stride 2	(128, 64, 11, 11)
cyclic roll	 	(128, 256, 11, 11)
convolution	256 3x3 filters	(128, 256, 11, 11)
convolution	256 3x3 filters	(128, 256, 11, 11)
convolution	128 3x3 filters	(128, 128, 11, 11)
max pooling	3x3, stride 2	(128, 128, 5, 5)
cyclic roll	 	(128, 256, 5, 5)
fully connected	512 2-piece maxout units	(128, 512)
cyclic pooling (rms)	 	(32, 512)
fully connected	512 2-piece maxout units	(32, 512)
fully connected	121-way softmax	(32, 121)
	FROM: http://benanne.github.io/2015/03/17/plankton.html

	Meta-Tricks:
	- Use %10 for validation with STRATIFIED SAMPLING (my mistake)
	- Cyclic Pooling
	- Leaky ReLU = max(x, a*x) learned a
	- reduces overfitting with a ~= 1/3
	- Orthogonal initialization http://arxiv.org/pdf/1312.6120v3.pdf
	- Use larger weight decay for larger models since otherwise some layers might diverge


	Trick #1:
	- Train a network 10 times for different inits.
	- Ensemble with uniform weights

	Trick #2:
	- Extract hand crafted features from images
	- Train small 80 unit 2 layer NNs per feature
	- Ensemble features at the softmax layer with the main ConvNet Model
	Image size in pixels
	- Used Features
	- Size and shape estimates based on image moments
	- Hu moments
	- Haralick texture features

	Trick #3:
	- Use Conv Auto Encoders for pretraining by Test set http://link.springer.com/chapter/10.1007/978-3-642-21735-7_7
	- use Max-Unpooling instead of Denoising principle
	- Use full-stack training but sometimes init by layer-weise training
	- Use tied weights since it is faster
	- At the supervise phase, keep the learning rates halve too keep the pretraining base of the filters
	- OR train only the FC layers for some initial iterations

	Trick #4:
	- Pseudo sampling to use test data in training (Hinton http://arxiv.org/abs/1503.02531)
	- use previsous models or the ensemble of the models for prediction of the test data and feed
	these predictions as label to new NN.
	- It regularize the network
	- Using soft outputs is better than hard outputs
	- 33% from test data, 67% from train data in a batch

	Trick #5:
	- Augment test data and mean the prediction

	Trick #6:
	- Try to learn affine transformation as a layer of NN
	- It did't work but IDEA is GREAT !!!

	Trick #7:
	- Ensemble 300 models
	- Find their weights on validation set
	- Use weights to infer the importance of models

	Other Tricks:
	- Untied biases for each spatial location
	- winner takes all nonlinearity (WTA, also known as channel-out) in the fully connected layers instead of ReLUs / maxout.
	- Batch Normalization did not work well

	Best Tools against Overfitting:
	- dropout
	- aggressive data augmentation
	- model architecture
	- weight decay
	- unsuperwised pre-training (time consuming)
	- cyclic pooling (especially with root-mean-squared pooling)
	- leaky RELUs
	- pseudo-labeling


	Final Structure ----->

	Layer type Size Output shape
	cyclic slice (128, 1, 95, 95)
	convolution 32 3x3 filters (128, 32, 95, 95)
	convolution 16 3x3 filters (128, 16, 95, 95)
	max pooling 3x3, stride 2 (128, 16, 47, 47)
	cyclic roll (128, 64, 47, 47)
	convolution 64 3x3 filters (128, 64, 47, 47)
	convolution 32 3x3 filters (128, 32, 47, 47)
	max pooling 3x3, stride 2 (128, 32, 23, 23)
	cyclic roll (128, 128, 23, 23)
	convolution 128 3x3 filters (128, 128, 23, 23)
	convolution 128 3x3 filters (128, 128, 23, 23)
	convolution 64 3x3 filters (128, 64, 23, 23)
	max pooling 3x3, stride 2 (128, 64, 11, 11)
	cyclic roll (128, 256, 11, 11)
	convolution 256 3x3 filters (128, 256, 11, 11)
	convolution 256 3x3 filters (128, 256, 11, 11)
	convolution 128 3x3 filters (128, 128, 11, 11)
	max pooling 3x3, stride 2 (128, 128, 5, 5)
	cyclic roll (128, 256, 5, 5)
	fully connected 512 2-piece maxout units (128, 512)
	cyclic pooling (rms) (32, 512)
	fully connected 512 2-piece maxout units (32, 512)
	fully connected 121-way softmax (32, 121)