ccarouge/Merge arrays with missing data.ipynb

## Merge arrays with missing data.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's say you have 2 datasets coming from different sources but representing the same quantity. You'd like to merge those datasets into a single one via a mean, unfortunately both datasets have missing data at different times and places. \n",
    "Accordingly, we want the merged dataset to follow these rules:\n",
    " - if both original datasets have data, we take the mean of both\n",
    " - if one dataset only has data, we take this data\n",
    " - if the data is missing in both original datasets, we keep a missing data\n",
    " \n",
    "The strategy using xarray is to open each dataset in an DataArray, concatenate both arrays on a new dimension and then average along this dimension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xarray as xr\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First define 2 arrays of same dimensions with missing data at different places"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "aa = xr.DataArray([[0,1,2],[3,4,np.nan]],dims=('x','y'))\n",
    "bb = xr.DataArray([[5,np.nan,6],[np.nan,7,np.nan]],dims=('x','y'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (x: 2, y: 3)>\n",
       "array([[ 0.,  1.,  2.],\n",
       "       [ 3.,  4., nan]])\n",
       "Dimensions without coordinates: x, y"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (x: 2, y: 3)>\n",
       "array([[ 5., nan,  6.],\n",
       "       [nan,  7., nan]])\n",
       "Dimensions without coordinates: x, y"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we simply sum the arrays together, we do not get what we want. The missing value take precedence. That is, if any of the array has a missing value, the sum is missing. So summing and dividing by the number of arrays won't work"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (x: 2, y: 3)>\n",
       "array([[ 5., nan,  8.],\n",
       "       [nan, 11., nan]])\n",
       "Dimensions without coordinates: x, y"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aa+bb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the opposite, if we can do a mean, it will work as then the missing value is ignored (mean(1,nan) = 1). For this, we need to \"merge\" the arrays into a single array. For this we'll use the `xarray.concat()` method.\n",
    "\n",
    "Concatenate the arrays along a new dimension we'll call z"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "cc = xr.concat((aa,bb),'z')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (z: 2, x: 2, y: 3)>\n",
       "array([[[ 0.,  1.,  2.],\n",
       "        [ 3.,  4., nan]],\n",
       "\n",
       "       [[ 5., nan,  6.],\n",
       "        [nan,  7., nan]]])\n",
       "Dimensions without coordinates: z, x, y"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you see above the concatenation allows us to have the 2 arrays aligned together in a new array. Now we take advantage of the fact xarray handles missing data correctly. That is, a mean will not count missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (x: 2, y: 3)>\n",
       "array([[2.5, 1. , 4. ],\n",
       "       [3. , 5.5, nan]])\n",
       "Dimensions without coordinates: x, y"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cc.mean(dim='z')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Usually you would find these last 2 operations combined as you don't need to store the results of the `concat` operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.DataArray (x: 2, y: 3)>\n",
       "array([[2.5, 1. , 4. ],\n",
       "       [3. , 5.5, nan]])\n",
       "Dimensions without coordinates: x, y"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "xr.concat((aa,bb),'z').mean(dim='z')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:analysis3-18.04]",
   "language": "python",
   "name": "conda-env-analysis3-18.04-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's say you have 2 datasets coming from different sources but representing the same quantity. You'd like to merge those datasets into a single one via a mean, unfortunately both datasets have missing data at different times and places. \n",
	"Accordingly, we want the merged dataset to follow these rules:\n",
	" - if both original datasets have data, we take the mean of both\n",
	" - if one dataset only has data, we take this data\n",
	" - if the data is missing in both original datasets, we keep a missing data\n",
	" \n",
	"The strategy using xarray is to open each dataset in an DataArray, concatenate both arrays on a new dimension and then average along this dimension."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import xarray as xr\n",
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"First define 2 arrays of same dimensions with missing data at different places"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"aa = xr.DataArray([[0,1,2],[3,4,np.nan]],dims=('x','y'))\n",
	"bb = xr.DataArray([[5,np.nan,6],[np.nan,7,np.nan]],dims=('x','y'))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (x: 2, y: 3)>\n",
	"array([[ 0., 1., 2.],\n",
	" [ 3., 4., nan]])\n",
	"Dimensions without coordinates: x, y"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"aa"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (x: 2, y: 3)>\n",
	"array([[ 5., nan, 6.],\n",
	" [nan, 7., nan]])\n",
	"Dimensions without coordinates: x, y"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"bb"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now, if we simply sum the arrays together, we do not get what we want. The missing value take precedence. That is, if any of the array has a missing value, the sum is missing. So summing and dividing by the number of arrays won't work"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (x: 2, y: 3)>\n",
	"array([[ 5., nan, 8.],\n",
	" [nan, 11., nan]])\n",
	"Dimensions without coordinates: x, y"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"aa+bb"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"At the opposite, if we can do a mean, it will work as then the missing value is ignored (mean(1,nan) = 1). For this, we need to \"merge\" the arrays into a single array. For this we'll use the `xarray.concat()` method.\n",
	"\n",
	"Concatenate the arrays along a new dimension we'll call z"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [],
	"source": [
	"cc = xr.concat((aa,bb),'z')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (z: 2, x: 2, y: 3)>\n",
	"array([[[ 0., 1., 2.],\n",
	" [ 3., 4., nan]],\n",
	"\n",
	" [[ 5., nan, 6.],\n",
	" [nan, 7., nan]]])\n",
	"Dimensions without coordinates: z, x, y"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"cc"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As you see above the concatenation allows us to have the 2 arrays aligned together in a new array. Now we take advantage of the fact xarray handles missing data correctly. That is, a mean will not count missing data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (x: 2, y: 3)>\n",
	"array([[2.5, 1. , 4. ],\n",
	" [3. , 5.5, nan]])\n",
	"Dimensions without coordinates: x, y"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"cc.mean(dim='z')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Usually you would find these last 2 operations combined as you don't need to store the results of the `concat` operation."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.DataArray (x: 2, y: 3)>\n",
	"array([[2.5, 1. , 4. ],\n",
	" [3. , 5.5, nan]])\n",
	"Dimensions without coordinates: x, y"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"xr.concat((aa,bb),'z').mean(dim='z')"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python [conda env:analysis3-18.04]",
	"language": "python",
	"name": "conda-env-analysis3-18.04-py"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}