from keras.models import Sequential | |
from keras.layers import Dense | |
from keras.utils.io_utils import HDF5Matrix | |
import numpy as np | |
def create_dataset(): | |
import h5py | |
X = np.random.randn(200,10).astype('float32') | |
y = np.random.randint(0, 2, size=(200,1)) | |
f = h5py.File('test.h5', 'w') | |
# Creating dataset to store features | |
X_dset = f.create_dataset('my_data', (200,10), dtype='f') | |
X_dset[:] = X | |
# Creating dataset to store labels | |
y_dset = f.create_dataset('my_labels', (200,1), dtype='i') | |
y_dset[:] = y | |
f.close() | |
create_dataset() | |
# Instantiating HDF5Matrix for the training set, which is a slice of the first 150 elements | |
X_train = HDF5Matrix('test.h5', 'my_data', start=0, end=150) | |
y_train = HDF5Matrix('test.h5', 'my_labels', start=0, end=150) | |
# Likewise for the test set | |
X_test = HDF5Matrix('test.h5', 'my_data', start=150, end=200) | |
y_test = HDF5Matrix('test.h5', 'my_labels', start=150, end=200) | |
# HDF5Matrix behave more or less like Numpy matrices with regards to indexing | |
print(y_train[10]) | |
# But they do not support negative indices, so don't try print(X_train[-1]) | |
model = Sequential() | |
model.add(Dense(64, input_shape=(10,), activation='relu')) | |
model.add(Dense(1, activation='sigmoid')) | |
model.compile(loss='binary_crossentropy', optimizer='sgd') | |
# Note: you have to use shuffle='batch' or False with HDF5Matrix | |
model.fit(X_train, y_train, batch_size=32, shuffle='batch') | |
model.evaluate(X_test, y_test, batch_size=32) |
This comment has been minimized.
This comment has been minimized.
Im not really understanding how to do this with images, especially because HDF5 matrix only works on matrixes of course - 2 dimensions. So for instance I have an hdf5 file that has 2 datasets, X and y. X is of shape (92072960, 1) and y is of shape (92072960, 112). X has been flattened to a long list of pixels with their respective values so that it can be stored as a matrix. Thus to feed the image into the CNN, I need to unflatten it. Since each image of 224 *224 has 50176 pixels, I could do something like:...
do you see what I asking -- where to reshape arrays loaded from the hdf5 matrix so that they can be loaded into a CNN? And when to call model.fit |
This comment has been minimized.
This comment has been minimized.
thank you. this is the kind of example I was looking for. |
This comment has been minimized.
This comment has been minimized.
I'm still wondering if I could use HDF5Matrix for multiple input/output model in Keras... |
This comment has been minimized.
This comment has been minimized.
I am using keras with theano backend.
gives me Traceback (most recent call last): Please help |
This comment has been minimized.
This comment has been minimized.
Hi, I just ran your code (example_hdf5matrix.py) and it does not work. I get the following error trace:
|
This comment has been minimized.
This comment has been minimized.
Hi, I am using HDF5Matrix to load a dataset and train my model with it. Comparing to a numpy array with the same contents, training a keras model with the HDF5Matrix results in very slow learning. I mean, in the first epoch I get 10% accuracy when using the HDF5Matrix, but 40% accuracy when using the numpy array. I have posted in the keras forum for help as well, see the post for more details. Thank you |
This comment has been minimized.
This comment has been minimized.
@lamenramen I got the same error. Did you ever figure it out? |
This comment has been minimized.
This comment has been minimized.
Works well until you create HDF5 using Pandas |
This comment has been minimized.
This comment has been minimized.
HDF5Matrix is much slower when I read data batches by batches, or use a for loop. Here is a quick modification:
It uses a generator, and basically split the large dataset that couldn't fit into memory as a whole, and split into 100 segments, and generate on each segment. |
This comment has been minimized.
This comment has been minimized.
@Shawn-Shan, thanks a lot! |
This comment has been minimized.
This comment has been minimized.
@Shawn-Shan, can we use it with multiple workers? |
This comment has been minimized.
This comment has been minimized.
I think it should not be used with multiple workers. @Shawn-Shan And for my use case (I use the Sequence interface), I need to set Shuffle=False explicitly. |
This comment has been minimized.
This comment has been minimized.
Thanks for the generator tip @Shawn-Shan. That meant I could actually fit my 200 GB data! Note that I had to change |
This comment has been minimized.
Thanks.👍