In most of deep learning projects, the training scripts always start with lines to load in data, which can easily take a handful minutes. Only after data ready can start testing my buggy code. It is so frustratingly often that I wait for ten minutes just to find I made a stupid typo, then I have to restart and wait for another ten minutes hoping no other typos are made.
In order to make my life easy, I devote lots of effort to reduce the overhead of I/O loading. Here I list some useful tricks I found and hope they also save you some time.
use Numpy Memmap to load array and say goodbye to HDF5.
I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. Yet that was before I realized how fast and charming Numpy Memmapfile is. In short, Memmapfile does not load in the whole array at open, and only later "lazily" load in the parts that are required for real operations.
Sometimes I may want to copy the full array to memory at once, as it makes later operations faster. Using Memmapfile is still much faster than HDF5. Just do
array = numpy.array(memmap_file). It reduces the several minutes with HDF5 to several seconds. Pretty impressive, isn't it!
A usefully tool to check out is sharearray. It hides for you the verbose details of creating memmap file.
If you want to create memmap array that is too large to reside in your memory, use
torch.from_numpy()to avoid extra copy.
torch.Tensormake a copy of the passing-in numpy array.
torch.from_numpy()use the same storage as the numpy array.
torch.utils.data.DataLoaderfor multithread loading.
I think most people are aware of it. With DataLoader, a optional argument
num_workerscan be passed in to set how many threads to create for loading data.
A simple trick to overlap data-copy time and GPU Time.
Copying data to GPU can be relatively slow, you would want to overlap I/O and GPU time to hide the latency. Unfortunatly, PyTorch does not provide a handy tools to do it. Here is a simple snippet to hack around it with
from torch.utils.data import DataLoader # some code loader = DataLoader(your_dataset, ..., pin_memory=True) data_iter = iter(loader) next_batch = data_iter.next() # start loading the first batch next_batch = [ _.cuda(non_blocking=True) for _ in next_batch ] # with pin_memory=True and non_blocking=True, this will copy data to GPU non blockingly for i in range(len(loader)): batch = next_batch if i + 2 != len(loader): # start copying data of next batch next_batch = data_iter.next() next_batch = [ _.cuda(async=True) for _ in next_batch] # training code