This dataset is named MSBin which stands for MultiSpectral Document Binarization. The dataset is dedicated to the (document image) binarization of multispectral images. The dataset is introduced in [Hollaus et al. 2019].
The dataset is on Zenodo:
The dataset contains the following folders:
- train: The training data set.
- test: The test data set.
- train_val_split: This folder contains the same images as the train folder, but the images are split into two folders: train and val, whereby the latter one is the validation dataset used in the training phase of the CNN in [Hollaus et al. 2019].
The folders containing the training and test data contain subdirectories named images and labels. The multispectral images contained in the images folder are named with the following naming convention, whereby an underscore separates the different elements:
BookId describes the type of manuscript and is either EA or BT.
PageId is a digit, which can be mapped to a certain page in the corresponding book.
WavelengthId is also a number that depicts the spectral range at which the image was acquired.
The mapping is provided in the following table, together with the exposure time that was used for the certain spectral ranges.
||Illumination / Spectral Range||Exposure time (in sec.)|
|0||White light (broadband)||0.0666|
|1||365 nm (UV flurorescence)||10|
|2||450 nm (narrowband)||0.125|
|3||465 nm (narrowband)||0.1|
|4||505 nm (narrowband)||0.05|
|5||535 nm (narrowband)||0.0666|
|6||570 nm (narrowband)||0.1666|
|7||625 nm (narrowband)||0.0333|
|8||700 nm (narrowband)||0.1666|
|9||780 nm (narrowband)||0.2|
|10||780 nm (narrowband)||0.2|
|11||870 nm (narrowband)||0.5|
|12||940 nm (narrowband)||3|
For the ground truth images contained in the labels folder, the following naming convention is used:
The dataset is comprised of 130 image portions, whereby the training and test sets contain 80 and 50 multispectral images, respectively. The portions are taken from two medieval manuscripts, named Bitola-Triodion ABAN 38 (hereafter named BT) and Enina-Apostolus NBMK 1144 (hereafter named EA). The latter one is in a worse condition than the first one, since it contains partially damaged folios and faded-out ink. Each multispectral image in the dataset has been taken from a different manuscript folio.
Both manuscripts contain cyrillic text written in iron gall ink. The corresponding foreground regions are colored black or brown. This class is hereafter denoted as FG. The document background class is abbreviated with BG. Additionally, a subset of the images contain characters that are written in red ink, denoted as FGR. The test set contains certain regions that are labeled as uncertain regions UR in the ground truth images. UR denote regions, that could not be clearly identified as belonging to FG, FGR or BG. These regions are excluded from the evaluation: Therefore, in the evaluation they are marked as belonging to the background - both in the ground truth images as well as in the resulting images. The training set does not contain uncertain regions, in order to allow for a training on entire image patches.
The ground truth contains a color-coded image for each multispectral image, whereby the colors encode different classes - as listed in the following table.
|Label||Description||RGB color code|
|FG||Main text||(255, 255, 255)|
|FGR||Red ink||(122, 122, 122)|
|BG||Background||(0, 0, 0)|
|UR||Uncertain region||(0, 0, 255)|
This table shows some properties of the dataset - including the number of training and test images.
|Train images containing FGR||19||11|
|Test images containing FGR||14||6|
The image acquisition was fulfilled in the course of the CIMA (Centre of Image and Material Analysis in Cultural Heritage) project.
The images contained in the MSBin dataset have been captured with a Phase One IQ260 achromatic camera with a resolution of 60 megapixels. A multispectral LED panel provides 11 different narrow-band spectral ranges from 365 nm until 940 nm. An UltraViolet (UV) long pass filter has been used in combination with UV light (365 nm) to acquire UV fluorescence images. For the acquisition of the remaining 10 spectral ranges no optical filter has been used. Additionally, a broadband LED illumination has been used to acquire white light images. Therefore, each multispectral image consists of 12 channels.
For each spectral range an individual exposure time has been determined in order to maximize the spectral range. These individual exposure times for the spectral ranges remained unchanged during the acquisition, because otherwise the spectral variability of the target classes would have been increased. The images have been registered onto each other with a multimodal image registration algorithm [Heinrich et al. 2012], in order to correct optical distortions.
[Hollaus et al. 2019]
F. Hollaus, S. Brenner and R. Sablatnig: "CNN based Binarization of MultiSpectral Document Images". To appear in: International Conference on Document Analysis and Recognition (ICDAR), 2019.
[Heinrich et al. 2012]
M. P. Heinrich, M. Jenkinson, M. Bhushan, T. Matin, F. V. Gleeson, S. M. Brady, and J. A. Schnabel, “MIND: Modality independent neighbourhood descriptor for multi-modal deformable registration”. Medical Image Analysis, vol. 16, no. 7, pp. 1423–1435, 2012