Skip to content

Instantly share code, notes, and snippets.

View dgketchum's full-sized avatar

David Ketchum dgketchum

  • State of Montana, University of Montana
  • Missoula, MT
View GitHub Profile
@ZijiaLewisLu
ZijiaLewisLu / Tricks to Speed Up Data Loading with PyTorch.md
Last active May 3, 2024 11:44
Tricks to Speed Up Data Loading with PyTorch

In most of deep learning projects, the training scripts always start with lines to load in data, which can easily take a handful minutes. Only after data ready can start testing my buggy code. It is so frustratingly often that I wait for ten minutes just to find I made a stupid typo, then I have to restart and wait for another ten minutes hoping no other typos are made.

In order to make my life easy, I devote lots of effort to reduce the overhead of I/O loading. Here I list some useful tricks I found and hope they also save you some time.

  1. use Numpy Memmap to load array and say goodbye to HDF5.

    I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. Yet that was before I realized how fast and charming Numpy Memmapfile is. In short, Memmapfile does not load in the whole array at open, and only later "lazily" load in the parts that are required for real operations.

Sometimes I may want to copy the full array to memory at once, as it makes later operations

@LeegleechN
LeegleechN / verify_tfrecords.py
Last active September 14, 2020 15:56
Check TFRecords
"""Checks if a set of TFRecords appear to be valid.
Specifically, this checks whether the provided record sizes are consistent and
that the file does not end in the middle of a record. It does not verify the
CRCs.
"""
import struct
import tensorflow as tf
from tensorflow import app
@aidanheerdegen
aidanheerdegen / kial_average.py
Last active August 8, 2022 15:47
Using xarray to calculate daily averages over multiple years
import numpy as np
from netCDF4 import Dataset
import sys
import xarray
from xarray.ufuncs import *
from glob import glob
import os
import string