bhavikngala/fast_ai_mooc_important_points.md

## fast_ai_mooc_important_points.md

      
    Raw
  

              fast_ai_mooc_important_points.md
            
          
    This gist contains a list of points I found very useful while watching the fast.ai "Practical deep learning for coders" and "Cutting edge deep learning for coders" MOOC by Jeremy Howard and team. This list may not be complete as I watched the video at 1.5x speed on marathon but I did write down as many things I found to be very useful to get a model working. A fair warning the points are in no particular order, you may find the topics are all jumbled up.
Before beginning, I want to thank Jeremy Howard, Rachel Thomas, and the entire fast.ai team in making this awesome practically oriented MOOC.


Progressive image resolution training: Train the network on lower res first and then increase the resolution to get better performance. This can be thought of as transfer learning from the same dataset but at a different resolution. There is one paper by NVIDIA as well that used such an approach to train GANs.


Cyclical learning rates: Gradually increasing the learning rate initially helps to avoid getting stuck in saddle points and explore entire(or more areas) of the loss landscape. [https://arxiv.org/abs/1506.01186]


To reduce memory usage you can use lower precision floating points i.e. float16 instead of float32.


Self-supervised learning - labels are inbuilt in data.


For NLP tasks other that language models, you can use language model for transfer learning i.e. first train the model to be a language model and then add the actual functionality.


When using transfer learning for NLP, in language model you can and should use entire dataset i.e. train and test datasets.


Discriminative learning rates: use different learning rates for different layer groups in your network.


Random forests can be used to find optimal hyperparameters.


Use embeddings for categorical variables.


For missing values - replace them with the median of the variable and add a new column of boolean variable saying missing=True/False.


Wherever possible use transfer learning, it always increases performance.


You can give a range to sigmoid function in last layer, it can increase the model performce.


out = sigmoid(x) * (max_range - min_range) + min_range


Complexity is not measured by the number of parameters.


You can use the data/time data given in the dataset to extract various useful information like the day of the week, the day of the month, the day of the year, year, month, week, is it a holiday, etc. It is useful in cases like detecting patterns like a certain event increased because it was a payday or holiday etc.


More data is always useful.


Too much dropout reduces the capacity of the network, experiment with multiple values.


You can apply dropout to the output of the embedding layer too.


Batch normalization helps to smoothen the loss landscape thus allowing higher learning rates.


Reflection padding in case of zero padding works better.


Larger kernel for the first layer in CNN is better since the number of channels is just 3(or very very less) in the beginning.


t[None] can add a new dimension to the tensor, i.e. convert 3D tensor to 4D tensor.


Use forward hooks to grab outputs of intermediate layers, simplifies implementations of pyramid style networks a lot.


Ethics in AI: the privileged are processed by people and the poor are processed by algorithms - Cathy O'Neil


When using transfer learning, you should use the stats of the dataset on which the model was trained to normalization your dataset.


Paper read: Visualizing loss landscape of neural networks. [https://arxiv.org/abs/1712.09913]


Densenet works very well for smaller datasets and on segmentation task. Resnet works very well on the segmentation task as well.


You can apply modern methods on old papers and get SOTA results.


A new UNET style network: resnet34 + subpixel convolutional upsampling.


Subpixel convolutions for upsampling: a lot of improvement in removing checkerboard artifacts.


Pretrained discriminator and generator in GAN.


Spectral normalization in GAN.


Don't use momentum in GANs, they don't like it.


Loss value for generator and discriminator should converge, the only way to confirm GAN training is by visual inspection.


Perceptual losses for style transfer and super resolutions.


Say there is a network with complex loss function or a loss function which requires intermediate layer outputs, then do this:


class SomeLoss(nn.Module):
    def __init__(self, network, ...):
        self.net = network
        self.hooks = ... # apply hooks to the networks to get intermediate layers outputs
        # additional statements
        
    def forward(self, x, target):
        y_hat = self.net(x)
        # intermediate outputs are in self.hooks
        # compute the loss
        return loss


Gradual unfreezing: Take trained model -> replace last layers -> fine tune last layer -> fine tune earlier layers.


Five steps to avoid overfitting: More data -> data augmentation -> generalizable architecture -> regularization -> reduce architecture complexity.


Use lambda functions to reduce lines of code wherever possible.


Functions should be 5 lines or less wherever possible.


Python debugger: pdb - useful commands [s, n, l, c, u, h, p].


In case of multiple losses, find a multiplier to make all the losses approximately equal.


Batch Norm after ReLU makes better sense since BN normalizes activations and ReLU after BN will shift mean and var.


BN should not be used right after the dropout layer.


Receptive field.


Chunk size in pandas.dataframe to get iterator on large datasets.


NLP tokenization: the beginning of sentence token, field token, when converting to UPPER case to lower case then add a token denoting UPPER case before the word.


Limit vocabulary to ~60000 words, remove tokens that do not appear more than 2 times.


For NLP tasks, the model can be trained on a subset of Wikipedia articles. Model pretraining.


wget -r


Command line tools can be run in jupyter notebook by placing ! before them.


Since sequences cannot be randomly shuffled, we can vary the length of the sequence to add randomness.


perplexity = exp(cross_entropy)


Accuracy can be used in NLP as a metric.


Don't implement a paper mindlessly. You can have ideas that the authors didn't have.


Paper read: A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. [https://arxiv.org/abs/1803.09820]


Google's Fire library.


VNC port forwarding to access jupyter notebook on servers.


!pip install git+URL for installing lib from git.


To free CNN from the input image size use adaptive average pooling.


Using bi-directional LSTM in the seq2seq model improves performance. Teacher enforcing and attention.


In high dimensional spaces, everything is on the edge, and thus distance does not matter but the angle matters. Thus cosine similarity loss is way better that L1/L2 loss.


python nmslib for the nearest neighbor query.


Get word vector of imagenet classes from wordnet -> train imagenet to predict word vectors -> now you have a search engine for images -> input word -> get word vector -> get images with similar word vectors. I apologize, I do not have the link for this paper.


In practice, LeakyReLU is useful for smaller datasets.


In neural networks, replace all the operations in forward function with their _ version, for example, replace + with add_ to perform operations in place and save GPU memory.


Paper read: Wide residual networks. [https://arxiv.org/abs/1605.07146]. The fast.ai team got the first place on DAWN benchmark.


Topic read: Stochastic weight average.


Fastai train phase API.


Paper read: LARS. [https://arxiv.org/abs/1708.03888]


You can train the model with different optimizers during different training phases.


You can break 7x7 filter to two 1x7 and 7x1 filter: linearly separable filters. This reduces computations.
TODO: insert image.


The very initial stage of the backbone network where the input channels say 3 are increased to higher numbers say 64 is called stem of the backbone network. Inception network stem is very good then other networks. One can try inception stem on the resnet main backbone.


Paper read: Progressive growing of GANs. [https://arxiv.org/abs/1710.10196]


The most interesting layers to grab output from are the ones before the max pooling layer because they represent the data best before the grid size changes.


I may have missed some points and there may be some mistakes.
I haven't included any paper citations but all of the above points are from the MOOC and the papers presented in the MOOC.