The goal of this document is to explain the basic mechanism of the dataset in Chainer and how to prepare customized dataset that can work with Trainer.
See the official document for the full detail of the specifications.
In order to make the dataset be able to work with the Trainer, it must have two methods:
__getitem__
, which is used for indexing likedataset[i]
or slicing likedataset[i:j]
__len__
, which enables us to feed it tolen()
Conversely, Trainer can handle any object as a dataset if it has __getitem__
and __len__
.
For example suppose we want use a numpy ndarray as a dataset:
dataset = numpy.random.uniform(-1, 1, (100, 50)).astype(numpy.float32)
Then, this dataset consists of 100 (=len(dataset)
) samples and its i
-th example is dataset[i]
, which is 1-dimensional array of length 50.
If we want to attach ground-truth labels, chainer.datasets.TupleDataset
is suitable for this job.
In the example above, each sample consists of a pair of a feature vector of length 50 and a integer label:
feature_vector = numpy.random.uniform(-1, 1, (100, 50)).astype(numpy.float32)
label = numpy.random.randint(0, 10, (100,)).astype(numpy.int32)
dataset = chainer.datasets.TupleDataset(feature_vector, label)
We have an utility function get_mnist
, which is used in the official MNIST example.
If get_mnist
is called with withlabel = True
, the returned value is the tuple of TupleDataset
, one is for training and the other is for test. Each TupleDataset
is a pair of a gray-scale image and a label (a single interger).
If your dataset is so large that it cannot be fitted to the memory, one way to create a dataset is to make an object (or Python class) that takes a list of paths to samples (e.g. in the __init__
method) and fetches samples lazily from the HDD in __getitem__
method.
chainer.datasets.ImageDataset
that Chainer offers officially takes this strategy.
It is sometimes cumbersome to implement __getitem__
because it must support not only indexing-by-integer but also slicing.
In that case, users can create a dataset by inheriting chainer.dataset.DatasetMixin
and implement get_example
method, instead of implementing __getitem__
directly.
get_example(i)
should return i-th
sample.
It enables the dataset to index by integer (e.g. dataset[i]
) but also more clever slicing (e.g. dataset[i:j]
).
The implementation of chainer.datasets.SubDataset could be helpful as an example of DatasetMixin
.
Note that if the object you want to treat as a dataset alreday supports slicing, you do not need to use DatasetMixin
.
Note also that sometimes slicing via get_example
could be slower because it creates sub arrays by fetching a sample one by one via get_example
. If your application needs better performance, you might need to implement __getitem__
directly.
Thank you very much, this document helped me.