rahulvigneswaran/things_i_learnt_from_running_experiments_on_large_datasets.md

## things_i_learnt_from_running_experiments_on_large_datasets.md

      
    Raw
  

              things_i_learnt_from_running_experiments_on_large_datasets.md
            
          
    Table of contents


GPU
CPU
Storage
General Training

Loss
Dataset/Dataloader
Misc


GPU


If you have limited GPU memory, always lazy load the data. Instead of loading the images of the entire dataset at the same time, load the images batch by batch.

Check https://discuss.pytorch.org/t/loading-huge-data-functionality/346/3 for example.


Batch size plays an important role in your training, so if you have a limited GPU memory and can't fit the entire batch in it, do gradient accumulation. In this you do optimizer.step and model.zero_grad() once in few steps.

Check this link https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3.


If you have GPU bottleneck and decide to go with torch.nn.DataParallel, there is a catch regarding the batch size.

Look at this thread https://twitter.com/somethingmyname/status/1400042667543654402?s=20 for more.


Whenever possible, use the following,

DDP instead of DP
Mixed Precision


Useful utilities/commands:

gpustat --watch
glances


CPU


If you are running multiple experiments but have limited number of cores, use taskset --cpu-list <starting_thread>-<ending thread number> <your_code>.py. This will make sure your specific runs use only the allotted threads from <starting_thread>to <ending thread number> and prevents from constant reallocation of CPU threads as each run fight for the threads. Note that this is helpful only if everyone on the server respects the core allotment.
More num_workers doesn't lead to a faster data loader. In fact, in most cases having higher num_workers will lead to a slower data loader. As far as I know, there is no thumb rule but there does exist a sweet spot that is mostly identified through trial and error.

Check this thread https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813 for more.


Useful utilities/commands:

htop
glances


Storage


Make sure you are running (read, log, train) on SSD. HDD causes I/O bottlenecks which are hard to get over even if you sell your soul to satan.

Check with this lsblk -o NAME,MOUNTPOINT,MODEL,ROTA,SIZE. ROTA == 0 means, the drive is an SSD.


Instead of loading your data from SSD or HDD, you can directly move it to the RAM. /dev/shm/ is the RAM dir. First check whether you have sufficient RAM size, then move the entire dataset to the RAM. Then make your dataloader load from /dev/shm directly.
Useful utilities/commands:

ncdu
df -h


General Training

Loss


If you have implemented a new type of loss, do a overfit test first. Instead of running the experiments on the entire dataset, overfit a single batch. You should be able to get your lower bound of the implemented loss and 100% train accuracy, else something is wrong with the implementation.

Check http://karpathy.github.io/2019/04/25/recipe/#:~:text=overfit%20one%20batch.%20Overfit%20a%20single%20batch,we%20cannot%20continue%20to%20the%20next%20stage.


nan related issues:

If you have some custom loss and don't know where the nan is coming from, use https://pytorch.org/docs/master/autograd.html#anomaly-detection.
Ways to tackle nan and related issues: https://youtu.be/XlYD8jn1ayE?list=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k&t=1616


Dataset/Dataloader


If you want to know whether that idea that's keeping you awake at night works but ImageNet takes too much time, then check these datasets:

Tiny ImageNet : 200 class version of ImageNet - https://www.kaggle.com/c/tiny-imagenet
mini ImageNet : 100 class version of ImageNet - https://github.com/yaoyao-liu/mini-imagenet-tools

Note that the above version is for few-shot learning. You can convert it into normal classification using this - 


Imagenette : Easier 10 class version of ImageNet - https://github.com/fastai/imagenette

Noisy Imagenette: https://github.com/fastai/imagenette/tree/master/noisy_labels


Imagewoof: Relatively hard version of Imagenette - https://github.com/fastai/imagenette#imagewoof

Noisy Imagewoof: https://github.com/fastai/imagenette/tree/master/noisy_labels


Imagewang: Semi-supervised version which combines both Imagenette and Imagewoof - https://github.com/fastai/imagenette#image%E7%BD%91


Misc


Implement resume functionality ASAP. Trust me, this will prevent crying yourself to sleep at night.

If you are using wandb, then you can even resume the logging. Feels like magic. Check this thread https://twitter.com/somethingmyname/status/1400237720413171713?s=20 for more.


If you come from the happy land of CIFARs and MNISTs like me, don't stare at the runs. It will take days to weeks. Get a hobby or it's finally time to open that "Interesting Papers" folder.
As a final note, make sure to avoid everything in this thread https://twitter.com/karpathy/status/1013244313327681536?s=20 by Karpathy.