ZijiaLewisLu/Tricks to Speed Up Data Loading with PyTorch.md

## Tricks to Speed Up Data Loading with PyTorch.md

      
    Raw
  

              Tricks to Speed Up Data Loading with PyTorch.md
            
          
    In most of deep learning projects, the training scripts always start with lines to load in data, which can easily take a handful minutes. Only after data ready can start testing my buggy code. It is so frustratingly often that I wait for ten minutes just to find I made a stupid typo, then I have to restart and wait for another ten minutes hoping no other typos are made.
In order to make my life easy, I devote lots of effort to reduce the overhead of I/O loading. Here I list some useful tricks I found and hope they also save you some time.


use Numpy Memmap to load array and say goodbye to HDF5.
I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. Yet that was before I realized how fast and charming Numpy Memmapfile is. In short, Memmapfile does not load in the whole array at open, and only later "lazily" load in the parts that are required for real operations.
Sometimes I may want to copy the full array to memory at once, as it makes later operations faster. Using Memmapfile is still much faster than HDF5. Just do array = numpy.array(memmap_file). It reduces the several minutes with HDF5 to several seconds. Pretty impressive, isn't it!
A usefully tool to check out is sharearray. It hides for you the verbose details of creating memmap file.
If you want to create memmap array that is too large to reside in your memory, use numpy.memmap().


torch.from_numpy() to avoid extra copy.
While torch.Tensor make a copy of the passing-in numpy array. torch.from_numpy() use the same storage as the numpy array.


torch.utils.data.DataLoader for multithread loading.
I think most people are aware of it. With DataLoader, a optional argument num_workers can be passed in to set how many threads to create for loading data.


A simple trick to overlap data-copy time and GPU Time.
Copying data to GPU can be relatively slow, you would want to overlap I/O and GPU time to hide the latency. Unfortunatly, PyTorch does not provide a handy tools to do it. Here is a simple snippet to hack around it with DataLoader, pin_memory and .cuda(async=True).


from torch.utils.data import DataLoader

# some code

loader = DataLoader(your_dataset, ..., pin_memory=True)
data_iter = iter(loader)

next_batch = data_iter.next() # start loading the first batch
next_batch = [ _.cuda(non_blocking=True) for _ in next_batch ]  # with pin_memory=True and non_blocking=True, this will copy data to GPU non blockingly

for i in range(len(loader)):
    batch = next_batch 
    if i + 2 != len(loader): 
        # start copying data of next batch
        next_batch = data_iter.next()
        next_batch = [ _.cuda(async=True) for _ in next_batch]
    
    # training code