Skip to content

Instantly share code, notes, and snippets.

@tylerneylon
Last active April 30, 2022 06:48
Show Gist options
  • Save tylerneylon/ce60e8a06e7506ac45788443f7269e40 to your computer and use it in GitHub Desktop.
Save tylerneylon/ce60e8a06e7506ac45788443f7269e40 to your computer and use it in GitHub Desktop.
A function to load numpy arrays from the MNIST data files.
""" A function that can read MNIST's idx file format into numpy arrays.
The MNIST data files can be downloaded from here:
http://yann.lecun.com/exdb/mnist/
This relies on the fact that the MNIST dataset consistently uses
unsigned char types with their data segments.
"""
import struct
import numpy as np
def read_idx(filename):
with open(filename, 'rb') as f:
zero, data_type, dims = struct.unpack('>HBB', f.read(4))
shape = tuple(struct.unpack('>I', f.read(4))[0] for d in range(dims))
return np.fromstring(f.read(), dtype=np.uint8).reshape(shape)
@sushantpaygude
Copy link

Please help me with the following error:

line 10, in read_idx
return np.fromstring(f.read(), dtype=np.uint8).reshape(shape)
ValueError: cannot reshape array of size 1648841 into shape (2913389620,226353,812330345,1835100005,1932421476,2016619893,1652126821,15506697)

@carl-allen
Copy link

use "gunzip [filename]" for each file

@ssong26
Copy link

ssong26 commented May 13, 2018

The original file "train-images-idx3-ubyte.gz" is 9912422 bytes. It is much smaller than 282860000. We need to unzip the file first in order to use it. Some software 7zip or Winrar will help you to do this. Otherwise you will face the problem "cannot reshape array of size 9912386 into shape (2055376946,226418,1634299437,1768776039,1702047337,1685599021,1969387892,1694559388)". (written for some students who is as puzzled as me, when they are doing their homework)

@JonnoFTW
Copy link

Python can automatically handle gzip files, just add:

import gzip

Then change

with open(filename, 'rb') as f:

to:

with gzip.open(filename) as f:

@alanhyue
Copy link

Thanks a lot dude! It is interesting that the default encoding was high-endian for NON-intel processors... since in my mind most people ARE using Intel processors... Anyway, thanks for the gist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment