bretsko/gist:5fc1d14309ea97bfe46ac725425d76b6

## gistfile1.txt
ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage: www.elsevier.com/locate/isprsjprs

Cloud removal in Sentinel-2 imagery using a deep residual neural network
and SAR-optical data fusion

T

Andrea Meranera,1, Patrick Ebela, Xiao Xiang Zhua,b, , Michael Schmitta,
⁎

a
b

⁎

Signal Processing in Earth Observation, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany
Remote Sensing Technology Institute, German Aerospace Center (DLR), Münchener Straße 20, 82234 Weßling-Oberpfaffenhofen, Germany

ARTICLE INFO

ABSTRACT

Keywords:
Cloud removal
Optical imagery
SAR-optical
Data fusion
Deep learning
Residual network

Optical remote sensing imagery is at the core of many Earth observation activities. The regular, consistent and
global-scale nature of the satellite data is exploited in many applications, such as cropland monitoring, climate
change assessment, land-cover and land-use classification, and disaster assessment. However, one main problem
severely affects the temporal and spatial availability of surface observations, namely cloud cover. The task of
removing clouds from optical images has been subject of studies since decades. The advent of the Big Data era in
satellite remote sensing opens new possibilities for tackling the problem using powerful data-driven deep
learning methods.
In this paper, a deep residual neural network architecture is designed to remove clouds from multispectral
Sentinel-2 imagery. SAR-optical data fusion is used to exploit the synergistic properties of the two imaging
systems to guide the image reconstruction. Additionally, a novel cloud-adaptive loss is proposed to maximize the
retainment of original information. The network is trained and tested on a globally sampled dataset comprising
real cloudy and cloud-free images. The proposed setup allows to remove even optically thick clouds by reconstructing an optical representation of the underlying land surface structure.

1. Introduction
1.1. Motivation
While the quality and quantity of satellite observations dramatically
increased in recent years, one common problem persists for remote
sensing in the optical domain since the first observation until today:
cloud cover. As thick clouds appear opaque in all optical frequency
bands, the presence thereof completely corrupts the reflectance signal
and obstructs the view of the surface underneath. This causes considerable data gaps in both the spatial and temporal domains. For applications where consistent time series are needed, e.g. agricultural
monitoring, or where a certain scene must be observed at a specific
time, e.g. disaster monitoring, cloud cover represents a serious hindrance.
The problem of cloud cover becomes even more apparent considering the amount of cloud coverage the Earth’s surface experiences
every day. An analysis over 12 years of observations by the Moderate
Resolution Imaging Spectroradiometer (MODIS) instrument aboard the

satellites Terra and Aqua showed that 67% of the Earth’s surface is
covered by clouds on average (King et al., 2013). Over land surfaces,
the cloud fraction averages to 55%, featuring distinctive seasonal patterns. Considering the importance of these cloud occlusion percentages,
it becomes clear how a successful cloud removal algorithm would
greatly increase the availability of useful data. The task of detecting and
removing clouds from satellite images has been tackled since the beginning of Earth observation activities, and is still today an area of
active research. In this work, we present a deep learning model capable
of removing clouds from Sentinel-2 images. The network design and the
integration of additional Sentinel-1 SAR data makes it robust to extensive cloud coverage conditions. The model is trained on a large
dataset containing scenes acquired globally, ensuring its general applicability on any land cover type.
1.2. Related works
The reconstruction of missing information in remote sensing data is
a long-studied problem. In Shen et al. (2015), a comprehensive review

Corresponding authors at: Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Oberpfaffenhofen, 82234 Wessling, Germany and Signal
Processing in Earth Observation, Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany.
E-mail addresses: andrea.meraner@eumetsat.int (A. Meraner), patrick.ebel@tum.de (P. Ebel), xiaoxiang.zhu@dlr.de (X.X. Zhu), m.schmitt@tum.de (M. Schmitt).
1
Present address: EUMETSAT, Eumetsat Allee 1, 64295 Darmstadt, Germany.
⁎

https://doi.org/10.1016/j.isprsjprs.2020.05.013
Received 14 January 2020; Received in revised form 15 May 2020; Accepted 18 May 2020
Available online 02 July 2020
0924-2716/ © 2020 The Authors. Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). This is an
open access article under the CC BY license (http://creativecommons.org/licenses/BY/4.0/).

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

In addition to the conceptual considerations, the need of large datasets
is also a prominent problem in deep learning for cloud removal. The
studies cited above achieve promising results, but the used datasets are
very limited and the performance is evaluated on non-independent
data. An assessment of the generalization capability of the networks,
i.e. their ability to remove clouds on previously unseen scenes, is
therefore not directly possible. In contrast, we present and use a large
dataset that is suited for a deterministic separation of images for
training and testing purposes and thus provides a sound idea of how
well the network will generalize to unseen Sentinel-2 data.

of traditional techniques in provided. In the last decades, a multitude of
approaches have been proposed for the specific task of cloud removal in
optical imagery. Methods that follow traditional approaches can be
categorized into three major clusters, namely multispectral, multitemporal and inpainting techniques. Many methods are a hybrid combination of these categories. Multispectral approaches are applied in the
case of haze and thin cirrus clouds, where optical signals are not
completely blocked but experience partial wavelength-dependent absorption and reflection. In such cases, surface information is partly
present and can be restored, e.g. using mathematical (Xu et al., 2019;
Hu et al., 2015) or physical models (Xu et al., 2016; Lv et al., 2016).
Multispectral methods have the advantage of exploiting information
from the original scene without requiring additional data, but are
limited to filmy, semi-transparent clouds. Multitemporal approaches
restore cloudy scenes by integrating information from reference images
acquired with clear sky conditions (Lin et al., 2013; Li et al., 2015;
Ramoino et al., 2017; Ji et al., 2018). For this, also multitemporal
dictionary learning techniques can be used (Li et al., 2014). The multitemporal data may also come from different sensors on different satellites (Li et al., 2019). Multitemporal methods are the most popular as
they substitute corrupted pixels with real cloud-free observations.
However, problems arise when reconstructing scenes with rapidly
changing surface conditions (e.g. due to phenological events) because
of the time difference between the scene to be reconstructed and the
reference acquisition. Inpainting approaches fill corrupted regions by
exploiting surface information from clear parts of the same cloud-affected image (Meng et al., 2017). Such direct inpainting methods do not
require additional images, but achieve good results only with small
clouds. To mitigate this problem, the process of selecting the most
suitable similar pixel to be cloned is often guided by auxiliary data, e.g.
multitemporal (Cheng et al., 2014) or SAR images (Eckardt et al.,
2013). Such methods deliver good results but have an increased complexity due to the requirement of multitemporal or multisensorial additional data.
In parallel to traditional approaches for cloud removal, data-driven
methods using deep learning have been gaining attention recently.
Many of the problems arising from traditional algorithms can be potentially solved by the end-to-end learning of deep neural networks
(DNN). For example, the detection and segmentation of clouds as a
preliminary step is often not required, as it can be learned implicitly by
the networks. In the case of multisensor data fusion, the translation
between different sensor domains can also be learned. Moreover, DNNs
can be trained to cope with any type of cloud and residual atmospheric
conditions. A first paper exploiting the potential of DNNs for restoring
missing information in remote sensing imagery was published in Zhang
et al. (2018). The method uses a spatial–temporal-spectral convolutional neural network (CNN) to restore data gaps in Landsat TM data. In
the case of clouds, an additional multitemporal image of the same scene
is used to support the reconstruction. Recent papers have been focusing
on using a modern CNN architecture called conditional generative adversarial network (cGAN) (Mirza and Osindero, 2014). In Enomoto
et al. (2017), a cGAN is trained to remove simulated clouds from
Worldview-2 RGB images using NIR images as auxiliary data, while in
Grohnfeldt et al. (2018) a cGAN removes simulated clouds from Sentinel-2 imagery using SAR data as additional information. An evolution
of the cGAN, called Cycle-GAN, can be used to avoid the need of paired
cloudy-cloudfree images for training (Singh and Komodakis, 2018). A
different approach for generating cloud-free images is to perform a
direct translation from SAR to optical using cGANs (Bermudez et al.,
2018; Bermudez et al., 2019; He and Yokoya, 2018; Fuentes Reyes
et al., 2019). Besides their powerful generative capabilities, cGANs can
suffer from training and prediction instabilities when fed with bad input
data (e.g. large cloud coverage), as reported in some of the referenced
studies and in Mescheder et al. (2018). Based on these experiences, the
work presented in this paper develops a model architecture that is robust to the presence of large and optically thick clouds in the input data.

1.3. Paper structure
This paper is structured as follows. After this introductory section,
the characteristics of the used dataset are presented in Section 2. The
proposed methodology, including the designed neural network architecture and custom loss, are explained in Section 3. The conducted
experiments and obtained results are then presented in Section 4 and
further discussed in 5. Finally, a summary and conclusions are given in
Section 6.
2. Data
While the data-driven method proposed in this paper is of generic
nature and sensor-agnostic, the specific model we train and our experiments focus on satellite imagery provided by the Sentinel satellites
of the European Copernicus Earth observation program (Desnos et al.,
2014), as these data are globally and freely available in a user-friendly
manner.
2.1. Sentinel-1 and Sentinel-2 missions
The cloud removal algorithm developed in this work is applied on
optical data from the Copernicus Sentinel-2 mission (Drusch et al.,
2012). The mission provides data for risk management, land use/land
cover and environmental monitoring, as well as urban and terrestrial
mapping for humanitarian and development aid. Imagery is available
over all main land areas from −56° to 84° of latitude with a global
revisit time of 5 days at the equator. The optical payload is called Multi
Spectral Instrument (MSI) and comprises 13 spectral bands. Four 10 m
high-resolution bands are placed in the visible and NIR domain for core
mapping applications. Six 20 m resolution bands are used for environmental monitoring and high-level products. Three 60 m bands are
used for detection and correction of atmospheric effects. The swath
width is 290 km.
The SAR data used in this work originates from the Copernicus
Sentinel-1 mission (Torres et al., 2012). The C-band radar instrument
(5.4 GHz center frequency) on board of the two constellation satellites
can operate in various modes depending on the position of the satellite
and the scope of the observations. The main operational mode, called
Interferometric Wide Swath (IW), is used over land surfaces and features a swath of 250 km and a resolution of 5 m in range and 20 m in
azimuth direction. The combined revisit time is of 6 days. The Sentinel1 mission was designed to provide data in all weather situations for
maritime and land monitoring, emergency response, climate change
and security.
2.2. SEN12MS-CR Dataset
The dataset presented and used in this work, called SEN12MS-CR, is
an evolution of the SEN12MS dataset (Schmitt et al., 2019b). SEN12MS
is publicly available and contains triplets of cloud-free Sentinel-2 optical images, Sentinel-1 SAR images and MODIS land cover maps. It was
developed for common remote sensing applications, such as scene
classification or semantic segmentation for land cover mapping. Using
the same procedure as described in the original paper, SEN12MS-CR
334

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

was created specifically as a dataset for training deep learning models
for cloud removal.
SEN12MS-CR contains 169 non-overlapping regions of interest
(ROIs) sampled across all inhabited continents during all meteorological seasons. The scene locations are randomly drawn from two
uniform distributions, namely one over all landmasses and one over
urban areas only. This introduces a bias towards urban landscapes, that
are often in the focus of remote sensing studies and contain more
complex patterns. The ROIs have an average size of approx. 5200 × 4000
px, which corresponds to 52 × 40 km ground coverage due to the pixels
having 10 m ground sampling distance. Each ROI is composed of a
triplet of orthorectified, geo-referenced cloudy and cloud-free Sentinel2 images, as well as the correspondent Sentinel-1 image. All three
images were acquired within the same meteorological season to limit
surface changes. To assess the cloud coverage of the optical images, the
cloud detector described in Schmitt et al. (2019a) was used. The cloudfree Sentinel-2 images have been selected with a threshold of 10%
cloud coverage, while cloudy images are within 20% and 70% of cloud
coverage.
The Sentinel-2 data is from the Level-1C top-of-atmosphere reflectance product and has values in the range [0, 10,000]. All 13 original bands were included. The Sentinel-1 data is from the Level-1 GRD
product acquired in IW mode with two polarization channels (VV and
VH). The values are 0 backscatter coefficients that have been transformed into dB scale.
To adapt the images for the ingestion into a CNN, the ROIs were cut
into small 256 × 256 px patches with a 128 px stride. The amount of
overlap between neighboring patches is therefore 50%. This has been
chosen to maximize the number of patches extractable from an image,
while still ensuring an acceptable independency. An automated and
manual check of the generated patches was performed to eliminate
mosaicking artifacts and other corrupted regions. The final qualitycontrolled SEN12MS-CR dataset contains 157, 521 patches-triplets with
a total of 28 layers, amounting to around 620 GB of storage size. Fig. 1
shows examples of patch triplets from the dataset. In the deep-learning
based cloud removal algorithms cited in the related works, the networks are trained on datasets with clear limitations. E.g., in Enomoto
et al. (2017), Grohnfeldt et al. (2018), Zhang et al. (2018) the networks
are trained exclusively on simulated clouds, by using simple Perlin
noise or introducing manually gaps into the imagery. In Singh and
Komodakis (2018), a dataset of real unpaired cloudy and cloud-free
Sentinel-2 images is used, which however is limited to the RGB channels and comprises only 20 cloudy and 13 cloud-free scenes. In Hu et al.
(2015), a dataset of ten paired cloudy and cloud-free scenes acquired by
Landsat-8 is used. However, the cloud-contaminated images contain
only filmy, partly-transparent clouds. To the best of the authors’
knowledge, SEN12MS-CR is the first dataset used for training cloud
removal networks that comprises a large and representative number of
scenes sampled worldwide, with full multispectral information, containing different types of real-life clouds with their characteristic signature in all channels.

3. A ResNet architecture for cloud removal
3.1. ResNet principle
The deep learning model used as backbone for this work is based on
the popular ResNet architecture (He et al., 2016a). ResNets make use of
shortcut connections, operations that skip some layers to shuttle the
information to lower parts of the network, acting as a direct path for
information flow. In the original ResNet case, the shortcut connection
performs an additive identity mapping, i.e. the input state of a residual
block is added to the output of the bypassed layers.
To further understand the residual learning rationale, let H (x ) be
the mapping that the skipped layers are supposed to learn as in a traditional plain network starting from the input x . By adding the additive
skip connection, we let the layers explicitly learn a residual function
F (x ) instead:

F (x ) = H (x )

x.

(1)

This is helpful since it preconditions the task: learning a residual correction to the input has proven to be easier for current optimizers than
learning the entire input–output mapping from scratch. This is especially true when the optimal mapping for a residual unit is actually
close to the identity, i.e. when the network has to just reproduce the
input data in the output.
3.2. Residual learning for cloud removal
For the task of cloud removal, the residual skip connections of a
ResNet are helpful in several ways:

• Filmy clouds correction: Residual learning offers a clear advantage

•
•

•

2.3. Train, validation and test datasets
To properly assess the generalization capability of a network, a
training, validation and test dataset split must be performed. For this,
the 169 ROIs of SEN12MS-CR were split into 149 scenes for training, 10
for validation and 10 for testing, following a random global distribution. Fig. 2 shows the spatial distribution of the ROIs. The split according to the ROIs, rather than the patches, ensures that the three
datasets are spatially and temporally completely disparate. All three
datasets contain acquisitions from all meteorological seasons. A visual
and automated analysis confirmed that all three datasets also have a
similar distribution of cloud types and coverage amount. When separating the patches according to this split, the training dataset amounts to
134, 907 patches-triplets, the validation to 11, 921 and the test to 10, 693.

in the presence of filmy clouds. In this case, the network has to learn
only an additive correction that compensates for the thin cloud
disturbance in the overcast regions. Through the band concatenation, the network is able to access both the spectral and spatial
features; the still partially present ground information acts as a good
preconditioning for the restoration process.
Cloud-free parts reproduction: Due to the large field of view and the
comparably small size of clouds, satellite images are typically a
mixture of cloudy and cloud-free regions. Over clear-sky regions, the
residual connections offer a direct path to transfer unmodified surface information directly to the output.
Stability of prediction: a ResNet architecture for cloud removal is
robust to the presence of large and optically thick clouds in the input
data. Even if an input cloudy image is mostly covered by opaque
clouds, the network is at least able to reproduce adequately the
cloud-free sections. C-GAN based methods (e.g. see Singh and
Komodakis, 2018), tend to suffer from prediction instabilities or
complete failures with bad input data.
Optimized learning of deep models: High representational capacity
given by a large number of layers and filters in CNNs is required to
reconstruct the signal under thick clouds, where complex structures
need to be restored. The ResNet architecture allows to optimize
large and deep models in a comparably fast way and with good
performance (He et al., 2016b).

3.3. DSen2-CR model
The proposed model, called DSen2-CR, is based on the super-resolution Deep Sentinel-2 (DSen-2) ResNet presented in Lanaras et al.
(2018), which is itself derived from the state-of-the-art single-image
super-resolution EDSR network (Lim et al., 2017). Similarly to superresolution, cloud removal can be seen as an image reconstruction task,
where missing spatial and spectral information has to be integrated into
the image to restore the complete information content. To guide the
reconstruction process under thick, optically impenetrable clouds
335

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

Fig. 1. Example 256 × 256 px patch triplets from the SEN12MS-CR dataset. (a,d,g) are the input cloudy optical images, (b,e,h) are the input SAR channels, and
(c,f,i) are the target cloud-free optical images. Throughout the paper, the shown optical images are enhanced true-color RGB composites from the Sentinel-2 10 m
resolution B4-B3-B2 bands. The shown SAR images are a composite of the two polarization channels (G = VH, B = VV, R = 0).

where no ground information is available, DSen2-CR leverages a SAR
image as a form of prior. For this, a Sentinel-1 image of the same scene
is introduced to the network as an additional input. The image's SAR
channels are simply concatenated to the other channels of the input
optical image. The highly non-linear SAR-to-optical translation, as well
as the cloud detection and treatment, are learned and performed implicitly inside the network. The training is done in an end-to-end setup,
and a cloud-free image of the same scene is presented to the network as
a target for the loss computation. Fig. 3 shows a diagram of the DSen2CR model and the used residual block design. In the following, further
properties and peculiarities of the network are described:

• Long

•

skip connection: An additive shortcut shuttles the input
cloudy image to an addition layer right before the final output, as
originally proposed in Lanaras et al. (2018). This basically means
that the entire network is learning to predict a residual map that
contains corrections to each pixel of the input cloudy image. In the
case of a clear sky input or filmy clouds, the predicted corrections
will be minor or non-existent. Conversely, for thick clouds with
bright appearance, the corrections will be larger.
Residual blocks: The main part of the network consists of several
residual units stacked in sequence. The specific number of units B in

•
336

the network is a hyperparameter that defines the depth of the network. The residual units each contain four layers and an addition
layer for the residual connection. The four skipped layers are a 2D
convolution layer with subsequent ReLU activation, a second 2D
convolution layer and a final residual scaling layer (see next point).
Only one ReLU activation is used after the first convolutional layer
but not after the second, since the network is supposed to predict
corrections that can be both positive and negative. For both convolutional layers, 3 × 3 kernels are used, following the general
community trend to use smaller kernels in deeper models (Lanaras
et al., 2018). The output feature dimension F, i.e. the number of
different filters, is fixed for all units and is a hyperparameter. A
stride of one pixel and zero padding is always used in order to
maintain the spatial dimensions of the data throughout the network.
Compromising between representational capacity and computational complexity, as well as considering own experiments and the
reported experiences in Lanaras et al. (2018), Lim et al. (2017),
residual units with F = 256 features were selected as a baseline for
the DSen2-CR architecture.
Residual scaling: This residual scaling layer is a custom layer that
multiplies its inputs with a constant scalar. First proposed in
Szegedy et al. (2017), this activations scaling has the effect of

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

Fig. 2. Global distribution of the 169 ROIs of the SEN12MS-CR dataset. Orange markers denote ROIs selected for training, green for validation and azure for testing.
Background image credits: Google Earth/Mapmaker.
Fig. 3. Left: DSen2-CR model diagram. Right:
Residual block design. For each part of the network,
the number of layers and the two spatial dimensions
are indicated inside parentheses. Since the network
is fully convolutional, it can accept input images of
arbitrary spatial dimensions m during training and
prediction time. F indicates the selected feature dimension and B the selected number of residual
blocks included in the network.

•

stabilizing the training without introducing additional parameters,
such as in batch normalization layers. The value of 0.1 is selected for
the scaling constant in this work.
Additional convolutions: At the beginning of the network, a concatenation layer stacks vertically the input optical and SAR layers to
enable the joint processing. After this, a 3 × 3 convolution layer

with ReLU activation is introduced to treat the concatenation before
the data is passed through the residual blocks. After the last residual
unit, a final 3 × 3 convolution restores the spectral dimensions to
match the number of bands of the optical image before reaching the
residuals addition layer.

337

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

(a)

(b)

(c)

(d)

Fig. 4. Example images showing changes in surface conditions between the input cloudy acquisitions (a,c) and the target cloud-free images (b,d) taken on a different
date.

Fig. 5. Flowchart of the cloud (left stream) and shadow (right stream) detectors employed for the mask creation used in the

Several experiments on the network structure and residual block
design confirmed the validity and quality of the original DSen-2 architecture. The modifications in DSen2-CR with respect to the original
network include the adaptations required to accommodate the two SAR

CARL

loss.

input layers used for guiding the reconstruction, the different number of
input and output optical channels, and network depth as described
above.

338

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

Fig. 6. Example images showing the influence of the SAR input on an agricultural and an urban scene under heavy cloud coverage. (a,f) show the cloudy input
images, (b,g) the input auxiliary SAR images, (c,h) the target cloud-free images. (d,i) are the model predictions without the SAR input, and (e,j) are the predictions
of the full DSen2-CR model including the SAR input.

ground information below clouds without modifying clear parts, it is of
strong importance that the most possible information from the input
image is retained in the output. To minimize the influence of ground
changes in the target image, a custom training loss was developed in
this work.
Following the recommendation of Lanaras et al. (2018), the L1
metric (mean absolute error) was used as a basic error function due to
the robustness to large deviations and the high dynamic range of the
Sentinel-2 data. Defining the predicted output image as P and the
cloud-free target image as T , the classic target loss T based on the
simple L1 distance between prediction and target can be formulated as

Table 1
Quantitative results computed on the hold-out test dataset. Results are reported
for the proposed DSen2-CR network in different configurations: trained on the
proposed CARL loss, trained on the plain L1 target loss T , and trained on
CARL and T but without the SAR input. In the tables, Target refers to the error
computed between the predicted image and the target cloud-free image. This is
the loss as optimized using T . Reprod denotes the reproduction error, namely
the error between the predicted image and the clear parts of the input image.
This is part of the CARL loss that is explicitly optimized. Recon is the reconstruction error, namely the error between the predicted image and the
target image inside the reconstructed clouds and shadow regions.
(a) Test results on pixel-wise metrics
MAE (
Method

TOA )

Target Reprod

DSen2-CR on
DSen2-CR on
DSen2-CR on
DSen2-CR on

Recon

0.0290 0.0204 0.0266
0.0270 0.0398 0.0266
T
CARL w/o SAR 0.0306 0.0188 0.0282
0.0284 0.0389 0.0281
T w/o SAR
CARL

pix2pix

0.0292

0.0210

0.0274

RMSE
PSNR (dB)
( TOA )
Target
Target
0.0366
0.0343
0.0387
0.0361

28.7
29.3
27.6
28.8

0.0424

28.2

T

Method
DSen2-CR
DSen2-CR
DSen2-CR
DSen2-CR
pix2pix

on
on
on
on

CARL

T

w/o SAR
w/o SAR

CARL

T

SSIM

Target

Reprod

Recon

Target

8.15
8.07
8.98
8.97

3.94
6.33
3.86
6.17

8.04
8.13
8.97
9.05

0.875
0.878
0.870
0.873

13.68

13.93

12.67

0.844

P

T
Ntot

1

,

(2)

with Ntot being the total number of pixels in all channels of the optical
images. The optimization on this plain L1 loss is simple and straightforward, but it has a drawback: the network is induced to learn, predict
and apply unwanted surface changes, due to being trained on multitemporal data with changing ground conditions. To reduce these artifacts, a novel loss principle was developed. The idea is to incorporate a
binary cloud and cloud-shadow mask (CSM) into the loss computation,
and use this information to steer the learning process towards a maximized retainment of input information. This custom loss, which we call
Cloud-Adaptive Regularized Loss ( CARL ), is formulated as

(b) Test results on spectral and structural fidelity metrics.
SAM (°)

=

target reg.
part

cloud adaptive part
CARL =

CSM

(P

T ) + (1 CSM)
Ntot

(P

I) 1
+

P

T 1
Ntot

(3)

with P , T , I denoting respectively the predicted, target, and input optical images. The CSM mask has the same spatial dimensions of the
images and pixel values 1 for clouds and shadows pixels or 0 for uncorrupted pixels. 1 denotes a matrix of ones with the same spatial dimensions as the images and the CSM. The multiplications marked with
between the CSM and the image differences are element-wise and
applied over all channels. In the cloud-adaptive part, the mean absolute
error loss is computed w.r.t. the target image for cloudy or shadowed
pixels of the input image, and w.r.t. the input image itself for clear-sky
pixels. With this, the network learns that it shall optimize the predictions to match the cloud-free parts of the input, and use the multitemporal information only when needed, i.e. for the cloud and shadow
reconstruction. However, when training with this cloud-adaptive part

3.4. Cloud-adaptive regularized loss
As described in the dataset section, the input cloudy image and the
target cloud-free optical images have been acquired on different days,
but within the same meteorological season. Although the time difference is limited, changes in the surface conditions between the images
can still often be observed, especially on agricultural landscapes (see
Fig. 4). Since the objective of a cloud removal algorithm is to restore
339

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Fig. 7. Example images showing the influence of the CARL loss on two agricultural scenes. (a,c) are the input images. (b,d) are the target images. (e,g) are the
predictions obtained by training the DSen2-CR model on the plain T , and (i,k) are the predictions obtained using CARL . (f,h) and (j,l) are the respective reproduction error maps in units of top-of-atmosphere radiance. The areas within the cloud and cloud-shadow mask (CSM) are depicted in black.

only, it was observed that the network introduced artifacts in the predicted images due to a too precise learning of the mask. To avoid this
effect, an additional target regularization term in the form of a classic
mean absolute error loss between prediction and target (equivalent to
T in Eq. (2)), was added to the loss function. This additional loss induces the network to learn to produce images that still have a natural,
smooth appearance similar to the target image. The regularization
factor , that scales this target regularization term in Eq. (3), is a hyperparameter that effectively balances the input information retainment and the prediction artifacts. After extensive tuning, the value of
= 1 was found to provide the best trade-off.
The authors have found that a methodically similar context-aware
loss was proposed in Li et al. (2019) in a more generic image processing
context. The novelty of the described CARL approach still resides in
how a cloud and cloud-shadow mask is created and used in the context
of cloud removal, with the specific intent of guiding and improving the
reconstruction performance.
For the CSM mask implementation, which is needed during training,
a combination of the methods proposed in Schmitt et al. (2019a) (cloud
detection) and in Zhai et al. (2018) (cloud-shadow detection) was used.
Fig. 5 shows the flowchart of the different processing steps for the mask
creation. The threshold TCL = 0.2 for the cloud binarization was selected
after a visual evaluation. The thresholds for the cloud detection were
5
3
computed using the parameters TCSI = 4 and TWBI = 6 . The threshold
values were chosen in a conservative manner to reduce false negative
detections. We refer to the original papers for further details on the
algorithm implementations.

the Sentinel-2 bands is [0, 10,000], for the Sentinel-1 VV and VH polarizations it is [−25,0] and [−32.5,0], respectively. For the Sentinel-2
data, a division by 2000 is further applied to all bands to ensure numerical stability (Lanaras et al., 2018). Similarly, the Sentinel-1 values
are shifted into the positive domain and scaled to the range [0, 2] to
approximately match the optical data values distribution after scaling.
As a data augmentation step, random rotations and flips are applied to
the images before the ingestion.
The training framework has been implemented in the Keras open
source deep-learning Python library with Tensorflow (Abadi et al.,
2016) as backend, basing on the code from (Lanaras et al., 2018). The
models were trained on a NVIDIA DGX-1 machine containing 8 P100
GPUs.
The weights of the network have been initialized using a uniform He
distribution (He et al., 2015), and the biases were initialized to zero.
Several tests with common optimizers showed that the Adam algorithm
with integrated Nesterov momentum (Dozat, 2015) delivers the best
performance. After a systematic search, the optimal learning rate has
been found to be 7·10 5 for a batch size of 16.
4. Experiments & results
For a quantitative evaluation, we report the error metrics obtained
by evaluating the results from the entire hold-out test dataset on different network configurations in the following. The used metrics are the
mean absolute error (MAE) and the root-mean-square error (RMSE) in
units of top-of-atmosphere reflectance TOA , the peak signal-to-noise
ratio (PSNR) in decibel units, the spectral angle mapper (SAM) (Kruse
et al., 1993) in degrees, and the unitless structural similarity index
(SSIM) (Wang et al., 2004). The MAE, RMSE, and PSNR are popular
evaluation metrics for pixel-wise reconstruction quality. The SAM gives
a measure of the spectral fidelity of the reconstructed images, while the

3.5. Preprocessing and training setup
Prior to the ingestion into the network, the images are value-clipped
to eliminate small amounts of anomalous pixels. The clipping range for
340

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

Fig. 8. Example images comparing the cloud removal results of our model with the pix2pix baseline network, both models receiving cloudy optical and SAR data as
input. (a,f,k) show the cloudy input images, (b,g,l) the input auxiliary SAR images, (c,h,m) the target cloud-free images. (d,i,n) are the predictions of our DSen2-CR
model, and (e,j,o) are the predictions of the pix2pix baseline. The results show that our model achieves higher-fidelity results, removes cloud shadows better and is
less prone to artifacts.

SSIM assesses spatial structure quality based on visual perception
principles.

SAR information also when reproducing cloud-free regions of the input
image. Since such artifacts do not have a correspondence in the original
optical image, this leads to a higher reproduction error (for MAE and
SAM respectively 2% and 3% using T , and 9% and 2% using CARL ).
However, the benefit in terms of reconstruction error (approx. 6% for
MAE and 11% for SAM for both losses) outbalances this problem,
making the SAR-optical data fusion concept beneficial for the overall
cloud removal task.
This becomes also clear by a qualitative analysis of the produced
images. In Fig. 6, exemplary detail patches under thick cloud cover are
presented. By comparing the predicted images with and without SAR
prior, the gain in structural content provided by the SAR fusion is clear.

4.1. Influence of SAR-optical data fusion
Several experiments were dedicated to verify the usefulness of the
SAR-optical data fusion setup used in DSen2-CR. For this, we performed
a full network training with and without including the SAR auxiliary
input. In Fig. 6, example results obtained on the hold-out test dataset
are visually compared. For better comparability, both networks were
trained using the plain L1 loss T . It can clearly be seen that the results
which make use of SAR-optical data fusion contain much more structure than the results relying on pure optical-to-optical image translation.
Especially large structures that have regular shapes and a distinctive
appearance in the SAR image, e.g. the large fields in the agricultural
example scene, are correctly included in the predicted image. Complex
objects, e.g. in cityscapes, are harder to integrate due to their more
complicated patterns. Here, the model is able to reconstruct the scene
only on a coarse scale. For example, the urban example area, with the
core town and the river entering from the south, is at least roughly
recognizable in the predicted image generated using the SAR information, whereas it is not reconstructed at all if no SAR data is used.
Considerations about the effectiveness of the SAR input can also be
made by evaluating the test results reported in Table 1. Here, results
from experimental training runs without SAR are provided alongside
the full configurations. Comparing the numbers, the network with the
SAR input scores better results for most evaluated metrics. Interestingly, however, the networks without SAR achieve lower MAE and SAM
reproduction errors. This indicates that the network partly integrates

4.2. Influence of the cloud-adaptive regularized loss
One of the main contributions of this work is the design of the socalled cloud-adaptive regularized loss CARL . This custom loss is cloudand shadow-aware and introduces an optimization w.r.t. to the input
image, in order to retain the most possible amount of information from
the uncorrupted input regions. To assess the effectiveness of this proposed loss, we compare the predictions of DSen2-CR models trained on
CARL to models trained only on the plain
T . Fig. 7 shows example
images from the test dataset containing two different agricultural
landscapes subject to substantial surface changes between the input and
the target images. By comparing the RGB composites of the results
obtained using T and CARL with the input and the target images, it
becomes clear how the network optimized on CARL is able to optimally
retain input information and limit the artifact generation in the predicted images. In the left image series, for example, the blooming rapeseed fields captured in the input image are kept in a bright yellow
341

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Fig. 9. Example results from the final setup of DSen2-CR using the
target images.

CARL

loss. (a,d,g,j) are the input cloudy images, (b,e,h,k) the predicted images, and (c,f,i,l) the

temporal images with differing ground conditions.
Observing the reproduction error maps shown in the figure, the
influence of the adaptive loss is evident, with predictions from T
showing much higher reproduction errors in the clear-sky pixels. An
evaluation of the final test results in Table 1 shows that model trained
on the CARL loss achieves 49% less MAE reproduction error and 38%
less SAM reproduction error w.r.t. to the network optimized on T . The
reconstruction errors between the two models are comparable, showing

color by the CARL , while being changed to green by T .
The shown error maps are the pixel-wise mean absolute error between the predicted image and the cloud-free parts of the input image.
In the following, we call this measure reproduction error, i.e. the error
introduced by the network while reproducing the already cloud-free
parts of the input image into the prediction. A low reproduction error
indicates an optimal retainment of useful input information. Moreover,
it signifies a low artifact generation caused by the training on multi342

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

Fig. 10. Left column: channel-wise normalized root-mean-square error (nRMSE) in units of percentage for each image shown in Fig. 9. The normalization was
performed using the value range of each band. Right column: Pixel spectra of the central pixel in the respective input, predicted, and target images. The point markers
denote the band resolution: circles for 10 m, triangles for 20 m, and squares for 60 m resolution. (a) additionally contains labels for each band following the Sentinel2 bands na.ming convention.

that CARL does not affect negatively the cloud reconstruction performance of the network while optimizing the information retainment
capabilities. Considering these observations, we conclude that the usage
of CARL in the optimization process is beneficial for the cloud removal
task. This is particularly true for agricultural areas, which exhibit
phenological changes even within the limited time span lying between
the acquisition of the cloud-affected image and the acquisition of the
cloud-free target image. It may be noted, however, that using T
naturally leads to better results in target-only based metrics (here

RMSE, PSNR and SSIM) since the optimization and the evaluation is
performed on the same objective. This however does not necessarily
signify an improvement in the overall cloud removal performance, due
to the artifact generation in cloud-free part as discussed above.
4.3. Comparison against baseline model
In order to compare our model against a standard baseline, we
utilized the popular pix2pix architecture (Isola et al., 2017) that was as
343

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

The network weights are initialized with a Normal initialization and
biases are set to zero, The network is trained on the complete training
set via ADAM (Karacan et al., 2016) (momentum 0.5) for a total of 10
epochs with the original GAN loss (Isola et al., 2017) and an L1 loss,
weighted with LGAN , L1 = 1, 100 as in the original study (Goodfellow
et al., 2014). Batch normalization (Ioffe and Szegedy, 2015) is applied
to the generator. The initial Niterinit = 5 epochs are trained at a learning
rate of lrinit = 2·10 4 , followed by Niterdecay = 5 epochs with lambda
learning rate decaying lrinit by the multiplicative factor
max (0, 2 + epoch Niterinit )/(Niterdecay + 1) , where epoch
decay = 1.0
denotes the number of the current epoch. Both the quantitative results
presented in Table 1 and the example images shown in Fig. 8 illustrate
the superiority of the our DSen2-CR approach – especially in terms of
spectral and structural fidelity.

Fig. 11. Average of channel-wise nRMSE over all test images.

4.4. Application of the full model on large scenes

well adapted in previous studies on cloud removal (Grohnfeldt et al.,
2018; Bermudez et al., 2018). The architecture of our baseline consists
of a U-net (Ronneberger et al., 2015) generator and a PatchGAN discriminator (Karacan et al., 2016). The generator takes 13 channel
multi-spectral optical and dual-polarimetric SAR patches as input, both
of size 256 × 256 pixels. The discriminator takes as input a concatenation of dual-polarimetric SAR patches, the 13-channel multi-spectral
cloudy and the real or generated cloud-free patches. SAR patches are
clipped to values [−25, 0] and rescaled to range [−1, 1]. Optical
patches are clipped to values [0, 10,000] and rescaled to range [−1, 1].

For a qualitative evaluation of the operational performance of the
full DSen2-CR model trained on the CARL loss including the SAR input,
Fig. 9 shows a selection of large reconstructed scenes, i.e. images larger
than the 256 × 256 -pixel patches the model was trained and validated
on. These scenes were concatenated from patches belonging to the holdout test dataset. To assess the reconstruction performance in all optical
channels, Fig. 10 shows the normalized root-mean-square errors
(nRMSE) averaged over each optical channel for the pictures shown in
Fig. 9. The normalized representation was chosen for better

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Fig. 12. 60-m resolution channels (B1, B9, B10) for the second image in Fig. 9. Left column: input image. Central column: prediction. Right column: target image.
344

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

interpretability, since the absolute RMSE spectra have been observed to
correlate with the reflectance spectra. Additionally, in Fig. 10 we also
show spectra of the central pixel of each image. To assess the overall
band-wise reconstruction quality, averages over all test images of each
band-wise normalized RMSE are shown in Fig. 11. It can be seen that
the channels, which experience the overall worst reconstruction
quality, are B10, followed by B9, and B1 – all of which observe the
atmosphere rather than the land surface (see Fig. 12).
Therefore, the performance of the model in reconstructing ground
information even below large and thick clouds can still be appreciated
on a large scale. The central pixel of the last image (Fig. 10h) is a cloudy
pixel, which can be recognized by the high reflectance values of the
input image. Here it can be seen how the model successfully reconstructs the entire cloud-free pixel spectrum. For the third image
(Fig. 10f) the reconstructed spectrum is also very close to the target,
while for the first two images (Figs. 10d and 10b) the reconstruction lies
between input and target, either due to prediction inaccuracy or due to
the partial retainment of input information induced by the CARL loss.

cloud-removal in single-temporal Sentinel-2 satellite imagery. The main
features of the proposed approach are threefold: On the one hand, we
have incorporated a data fusion strategy to the cloud removal process in
order to provide further information about the surface characteristics of
the target scene based on Sentinel-1 SAR imagery. On the other hand,
we have proposed a cloud-adaptive loss to circumvent the problem that
cloud-affected and cloud-free training images can never be acquired at
the same time. Finally, we have trained our model on a dataset sampled
across the globe and over all meteorological seasons. Based on a deterministic split of training and test data, our experiments confirm the
generic applicability of the final cloud-removal model. Both qualitative
and quantitative results show that both the SAR-optical data fusion
component and the cloud-adaptive training loss help significantly to
predict reasonable cloud-free image content. In many cases, the pixel
spectra are also improved. Due to the free availability of both Sentinel-2
and Sentinel-1 satellite imagery for all regions of the Earth, it is expected that the presented cloud-removal approach will be beneficial to
a more temporally seamless monitoring of our environment.

5. Discussion

Declaration of Competing Interest

As the results summarized in Section 4 show, the DSen2-CR network
is generally capable of removing clouds from Sentinel-2 imagery. This is
not limited to a purely visual RGB representation of the declouded input
image, but includes the reconstruction of the whole pixel spectrum with
an average normalized RMSE between 2% and 20%, depending on the
band. It should be noted, however, that the worst reconstruction results
are achieved for the 60 m-bands, which are not meant to observe the
surface of the Earth, but rather the atmosphere: B10, which shows the
worst normalized RMSE values, is dedicated to a measurement of Cirrus
clouds with a short-wave infrared wavelength; B9 is dedicated to
measuring water vapor, and B1 is supposed to deliver information
about coastal aerosoles (cf. Fig. 11). Since the SAR auxilary image uses
a C-band signal with much longer wavelength, it is not affected by those
atmospheric parameters at all and just provides information about the
geometrical structure of the Earth surface. This, of course, distorts the
reconstruction of the atmosphere-related Sentinel-2 bands, as can be
seen in Fig. 11. However, most classical Earth observation tasks, which
benefit from a cloud-removal pre-processing step, do not employ those
bands anyway and restrict their analyses to the 10 m- and 20 m-bands,
which provide actual measurements of the Earth surface. Thus, the
inclusion of the SAR auxiliary image can definitely be deemed helpful,
which is also confirmed by the numerical results listed in Table 1 and
the qualitative examples shown in Fig. 6: The overall best result with
respect to pure numbers is achieved when the classic loss T and SARoptical data fusion are used. The new cloud-adaptive loss CARL ,
however, leads to a much better retainment of the original input and
introduces less image translation artifacts, which are usually caused by
training on images with a temporal offset. In summary, the combination
of SAR-optical data fusion and the cloud-adaptive loss CARL provides
the results that generalize best to different situations and also provide
reliable cloud-removal for both rather thick clouds and vegetated areas
which exhibit phenological changes. In the worst case, i.e. when the
scene is comprised of complex patterns and the cloud cover is optically
very thick, the network fails to provide a detailed and fully accurate
reconstruction (c.f. the urban example in Fig. 6). It has to be stressed
again, however, that the dataset used for training of the DSen2-CR
model is globally sampled, which means that the network needs to learn
a highly complex mapping from SAR to optical imagery for virtually
every land cover type existing. By restricting the dataset or fine-tuning
the model to a specific region or land cover type, it is expected that the
SAR-to-optical translation results would improve significantly.

None.
Acknowledgments
This work was partially supported by the Federal Ministry for
Economic Affairs and Energy of Germany in the project “AI4Sentinels –
Deep Learning for the Enrichment of Sentinel Satellite Imagery” (FKZ
50EE1910). The work of X. Zhu is jointly supported by the European
Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. [ERC-2016StG-714087], Acronym: So2Sat), Helmholtz Artificial Intelligence
Cooperation Unit (HAICU) - Local Unit “Munich Unit @Aeronautics,
Space and Transport (MASTr)” and Helmholtz Excellent Professorship
“Data Science in Earth Observation - Big Data Fusion for Urban
Research”.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis,
A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,
Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R.,
Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I.,
Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Warden,
P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X., 2016. TensorFlow: Large-Scale
Machine Learning on Heterogeneous Distributed Systems, CoRR abs/1603.0.
Bermudez, J.D., Happ, P.N., Oliveira, D.A.B., Feitosa, R.Q., 2018. SAR to optical image
synthesis for cloud removal with generative adversarial networks. ISPRS Annals
Photogram., Remote Sens. Spatial Inform. Sci., IV-1, 2018, pp. 5–11.
Bermudez, J.D., Happ, P.N., Feitosa, R.Q., Oliveira, D.A.B., 2019. Synthesis of
Multispectral Optical Images From SAR/Optical Multitemporal Data Using
Conditional Generative Adversarial Networks. IEEE Geosci. Remote Sens. Lett. 16,
1220–1224.
Cheng, Q., Shen, H., Zhang, L., Yuan, Q., Zeng, C., 2014. Cloud removal for remotely
sensed images by similar pixel replacement guided with a spatio-temporal MRF
model. ISPRS J. Photogram. Remote Sens. 92, 54–68.
Desnos, Y., Borgeaud, M., Doherty, M., Rast, M., Liebig, V., 2014. The European Space
Agency’s Earth observation program. IEEE Geosci. Remote Sens. Magaz. 2, 37–46.
Dozat, T., 2015. Incorporating Nesterov momentum into Adam, Technical Report.
Stanford University.
Drusch, M., Del Bello, U., Carlier, S., Colin, O., Fernandez, V., Gascon, F., Hoersch, B.,
Isola, C., Laberinti, P., Martimort, P., Meygret, A., Spoto, F., Sy, O., Marchese, F.,
Bargellini, P., 2012. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES
Operational Services. Remote Sens. Environ. 120, 25–36.
Eckardt, R., Berger, C., Thiel, C., Schmullius, C., 2013. Removal of optically thick clouds
from multi-spectral satellite images using multi-frequency SAR data. Remote Sens. 5,
2973–3006.
Enomoto, K., Sakurada, K., Wang, W., Fukui, H., Matsuoka, M., Nakamura, R.,
Kawaguchi, N., 2017. Filmy cloud removal on satellite imagery with multispectral
conditional Generative Adversarial Nets. In: 2017 IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), volume 14, IEEE, 2017, pp.
1533–1541.
Fuentes Reyes, M., Auer, S., Merkle, N., Schmitt, M., 2019. SAR-to-optical image translation based on conditional generative adversarial networks – optimization,

6. Summary and conclusion
In this paper, we have presented a deep residual neural network for
345

ISPRS Journal of Photogrammetry and Remote Sensing 166 (2020) 333–346

A. Meraner, et al.

Pattern Recognition Workshops (CVPRW), IEEE. IEEE, pp. 1132–1140.
Lin, C.-H., Tsai, P.-H., Lai, K.-H., Chen, J.-Y., 2013. Cloud removal from multitemporal
satellite images using information cloning. IEEE Trans. Geosci. Remote Sens. 51,
232–241.
Lv, H., Wang, Y., Shen, Y., 2016. An empirical and radiative transfer model based algorithm to remove thin clouds in visible bands. Remote Sens. Environ. 179, 183–195.
Meng, F., Yang, X., Zhou, C., Li, Z., 2017. A sparse dictionary learning-based adaptive
patch inpainting method for thick clouds removal from high-spatial resolution remote
sensing imagery. Sensors 17, 2130.
Mescheder, L., Geiger, A., Nowozin, S., 2018. Which training methods for GANs do actually converge?, CoRR abs/1801.0.
Mirza, M., Osindero, S., 2014. Conditional Generative Adversarial Nets, CoRR abs/
1411.1.
Ramoino, F., Tutunaru, F., Pera, F., Arino, O., 2017. Ten-meter Sentinel-2A cloud-free
composite—Southern Africa 2016. Remote Sens. 9, 652.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing
and computer-assisted intervention, Springer, pp. 234–241.
Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X., 2019a. Aggregating cloud-free Sentinel-2
images with Google Earth Engine. In: ISPRS Annals of the Photogrammetry, Remote
Sensing and Spatial Information Sciences, volume IV-2/W7, pp. 145–152.
Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X., 2019b. SEN12MS – a curated dataset of
georeferenced multi-spectral Sentinel-1/2 imagery for deep learning and data fusion.
In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information
Sciences, volume IV-2/W7, pp. 153–160.
Shen, H., Li, X., Cheng, Q., Zeng, C., Yang, G., Li, H., Zhang, L., 2015. Missing information
reconstruction of remote sensing data: a technical review. IEEE Geosci. Remote Sens.
Magaz. 3, 61–85.
Singh, P., Komodakis, N., 2018. Cloud-Gan: cloud removal for Sentinel-2 imagery using a
cyclic consistent Generative Adversarial Network. In: IGARSS 2018–2018 IEEE
International Geoscience and Remote Sensing Symposium, pp. 1772–1775.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2017. Inception-v4, InceptionResNet and the impact of residual connections on learning. In: Proceedings of the
Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Inception-v4, pp.
4278–4284.
Torres, R., Snoeij, P., Geudtner, D., Bibby, D., Davidson, M., Attema, E., Potin, P.,
Rommen, B., Floury, N., Brown, M., Traver, I.N., Deghaye, P., Duesmann, B., Rosich,
B., Miranda, N., Bruno, C., L’Abbate, M., Croci, R., Pietropaolo, A., Huchler, M.,
Rostan, F., 2012. GMES Sentinel-1 mission. Remote Sens. Environ. 120, 9–24.
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E., 2004. Image quality assessment: from
error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612.
Xu, M., Pickering, M., Plaza, A.J., Jia, X., 2016. Thin cloud removal based on signal
transmission principles and spectral mixture analysis. IEEE Trans. Geosci. Remote
Sens. 54, 1659–1669.
Xu, M., Jia, X., Pickering, M., Jia, S., 2019. Thin cloud removal from optical remote
sensing images using the noise-adjusted principal components transform. ISPRS J.
Photogram. Remote Sens. 149, 215–225.
Zhai, H., Zhang, H., Zhang, L., Li, P., 2018. Cloud/shadow detection based on spectral
indices for multi/hyperspectral optical remote sensing imagery. ISPRS J. Photogram.
Remote Sens. 144, 235–253.
Zhang, Q., Yuan, Q., Zeng, C., Li, X., Wei, Y., 2018. Missing data reconstruction in remote
sensing image with a unified spatial–temporal–spectral deep convolutional neural
network. IEEE Trans. Geosci. Remote Sens. 56, 4274–4288.

opportunities and limits. Remote Sens. 11, 2067.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in neural information
processing systems, pp. 2672–2680.
Grohnfeldt, C., Schmitt, M., Zhu, X., 2018. A conditional Generative Adversarial Network
to fuse SAR and multispectral optical data for cloud removal from Sentinel-2 Images.
In: IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing
Symposium. IEEE, pp. 1726–1729.
He, W., Yokoya, N., 2018. Multi-Temporal Sentinel-1 and -2 Data Fusion for Optical
Image Simulation. ISPRS Int. J. Geo-Inform. 7, 389.
He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification, CoRR abs/1502.0.
He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition. In:
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
770–778.
He, K., Zhang, X., Ren, S., Sun, J., 2016b. Identity Mappings in Deep Residual Networks,
CoRR abs/1603.0.
Hu, G., Li, X., Liang, D., 2015. Thin cloud removal from remote sensing images using
multidirectional dual tree complex wavelet transform and transfer least square support vector regression. J. Appl. Remote Sens. 9, 095053.
Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by
reducing internal covariate shift, arXiv preprint arXiv:1502.03167.
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1125–1134.
Ji, T.-Y., Yokoya, N., Zhu, X.X., Huang, T.-Z., 2018. Nonlocal tensor completion for
multitemporal remotely sensed images’ inpainting. IEEE Trans. Geosci. Remote Sens.
56, 3047–3061.
Karacan, L., Akata, Z., Erdem, A., Erdem, E., 2016. Learning to generate images of outdoor scenes from attributes and semantic layouts, arXiv preprint arXiv:1612.00215
(2016).
King, M.D., Platnick, S., Menzel, W.P., Ackerman, S.A., Hubanks, P.A., 2013. Spatial and
temporal distribution of clouds observed by MODIS onboard the Terra and Aqua
satellites. IEEE Trans. Geosci. Remote Sens. 51, 3826–3852.
Kruse, F., Lefkoff, A., Boardman, J., Heidebrecht, K., Shapiro, A., Barloon, P., Goetz, A.,
1993. The spectral image processing system (SIPS)—interactive visualization and
analysis of imaging spectrometer data. Remote Sens. Environ. 44, 145–163.
Lanaras, C., Bioucas-Dias, J., Galliani, S., Baltsavias, E., Schindler, K., 2018. Super-resolution of Sentinel-2 images: Learning a globally applicable deep neural network.
ISPRS J. Photogram. Remote Sens. 146, 305–319.
Li, Xinghua, Shen, Huanfeng, Zhang, Liangpei, Zhang, Hongyan, Yuan, Qiangqiang, Yang,
Gang, 2014. Recovering quantitative remote sensing products contaminated by thick
clouds and shadows using multitemporal dictionary learning. IEEE Trans. Geosci.
Remote Sens. 52, 7086–7098.
Li, X., Shen, H., Zhang, L., Li, H., 2015. Sparse-based reconstruction of missing information in remote sensing images from spectral/temporal complementary information. ISPRS J. Photogram. Remote Sens. 106, 1–15.
Li, X., Wang, L., Cheng, Q., Wu, P., Gan, W., Fang, L., 2019. Cloud removal in remote
sensing images using nonnegative matrix factorization and error correction. ISPRS J.
Photogram. Remote Sens. 148, 103–113.
Li, H., Li, G., Lin, L., Yu, H., Yu, Y., 2019. Context-aware semantic inpainting. IEEE Trans.
Cybernet. 49, 4398–4411.
Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M., 2017. Enhanced deep residual networks for
single image super-resolution. In: 2017 IEEE Conference on Computer Vision