-
-
Save tylerneylon/ce60e8a06e7506ac45788443f7269e40 to your computer and use it in GitHub Desktop.
""" A function that can read MNIST's idx file format into numpy arrays. | |
The MNIST data files can be downloaded from here: | |
http://yann.lecun.com/exdb/mnist/ | |
This relies on the fact that the MNIST dataset consistently uses | |
unsigned char types with their data segments. | |
""" | |
import struct | |
import numpy as np | |
def read_idx(filename): | |
with open(filename, 'rb') as f: | |
zero, data_type, dims = struct.unpack('>HBB', f.read(4)) | |
shape = tuple(struct.unpack('>I', f.read(4))[0] for d in range(dims)) | |
return np.fromstring(f.read(), dtype=np.uint8).reshape(shape) |
use "gunzip [filename]" for each file
The original file "train-images-idx3-ubyte.gz" is 9912422 bytes. It is much smaller than 282860000. We need to unzip the file first in order to use it. Some software 7zip or Winrar will help you to do this. Otherwise you will face the problem "cannot reshape array of size 9912386 into shape (2055376946,226418,1634299437,1768776039,1702047337,1685599021,1969387892,1694559388)". (written for some students who is as puzzled as me, when they are doing their homework)
Python can automatically handle gzip files, just add:
import gzip
Then change
with open(filename, 'rb') as f:
to:
with gzip.open(filename) as f:
Thanks a lot dude! It is interesting that the default encoding was high-endian for NON-intel processors... since in my mind most people ARE using Intel processors... Anyway, thanks for the gist!
Please help me with the following error:
line 10, in read_idx
return np.fromstring(f.read(), dtype=np.uint8).reshape(shape)
ValueError: cannot reshape array of size 1648841 into shape (2913389620,226353,812330345,1835100005,1932421476,2016619893,1652126821,15506697)