canyon289/ArvizDataTypes.ipynb

## ArvizDataTypes.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Arviz Data Structures"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While ArviZ supports plotting from familiar datatypes, such as dictionaries and numpy arrays, there are a couple data structures central to ArviZ that are useful to know when using the library. \n",
    "\n",
    "They are\n",
    "* xarray\n",
    "* InferenceData\n",
    "* netcdf\n",
    "\n",
    "\n",
    "## Why more than one data structure?\n",
    "Bayesian Inference generates numerous sets that represent different aspects of the model. For example in a single analysis a bayesian practioner could end up with any of the following data.\n",
    "* Prior Distribution for N number of variables\n",
    "* Posterior Distribution for N number of variables\n",
    "* Prior Predictive Distribution\n",
    "* Posterior Predictive Distribution\n",
    "* Trace data for each of the above\n",
    "* Sample statistics for each inference run\n",
    "* Whatever else\n",
    "\n",
    "While we made an effort to use \"common\" data types such as numpy arrays or Pandas dataframes, due to the heterogenity of the data it become cumbersome and complex to try and force these data points into one homogenous data structures. To add to the complexity ArviZ must handle the data generated from multiple Bayesian Modeling libraries, such as pymc3 and pystan.\n",
    "\n",
    "Although seemingly more complex at a glance we believe that the usage of *xarray*, *InferenceData*, and *netcdf* will simply the handling, referencing, and serialization of data generated by MCMC runs.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## An introduction to each\n",
    "To help you get familiar with each ArviZ includes some toy datasets. To start an `az.InferenceData` sample can be loaded into Python quite easily."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Inference data with groups:\n",
       "\t> posterior\n",
       "\t> sample_stats\n",
       "\t> posterior_predictive\n",
       "\t> prior\n",
       "\t> observed_data"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the centered eight schools model\n",
    "import arviz as az\n",
    "data = az.load_arviz_data('centered_eight')\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see in this case the `az.InferenceData` object contains both a posterior_predictive distribution, and the observed data, among other datasets. Each group in InferenceData is both an attribute on `InferenceData` and itself an `xarray.DataSet` object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.Dataset>\n",
       "Dimensions:  (chain: 4, draw: 500, school: 8)\n",
       "Coordinates:\n",
       "  * chain    (chain) int64 0 1 2 3\n",
       "  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...\n",
       "  * school   (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...\n",
       "Data variables:\n",
       "    mu       (chain, draw) float64 ...\n",
       "    theta    (chain, draw, school) float64 ...\n",
       "    tau      (chain, draw) float64 ..."
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the posterior xarray\n",
    "posterior = data.posterior\n",
    "posterior"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our eight schools example we can see that this particular posterior trace consists of three variables, estimated over 4 chains. In addition this model is a hierachial models where values for the variable `theta` are associated with a particular school. \n",
    "\n",
    "In xarray terminology Data Variable are the actual values generated from the MCMC draws, Dimensions are the axes on which we can refer to the Data Variables, and Coordinates are pointers to specific slices or points in the `xarray.DataSet`\n",
    "\n",
    "We can access the observed data through the same method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.Dataset>\n",
       "Dimensions:  (school: 8)\n",
       "Coordinates:\n",
       "  * school   (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...\n",
       "Data variables:\n",
       "    obs      (school) float64 28.0 8.0 -3.0 7.0 -1.0 1.0 18.0 12.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the observed xarray\n",
    "observed_data = data.observed_data\n",
    "observed_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It should be noted that the observed dataset contains only 8 DataVariables and shares no dimensions or coordinates with the posterior. This difference in sizes is the motivating reason behind *InferenceData*. Rather than force multiple different sized arrays into one array, or force users to manage multiple objects corresponding to different datasets, we felt that it would be easier to hold references to each xarray in an *InferenceData* object."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NetCDF\n",
    "[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a standard for referencing array oriented files. In other words while, *xarray.Dataset*s, and by extension *InferenceData*, are convenient for accessing arrays in Python memory, *NetCDF* provides a convenient mechanism for persistence of model data on disk.\n",
    "\n",
    "Most users will not have to concern themselves with the *netcdf* standard but for completeness it is good to make its usage transparent.\n",
    "\n",
    "Earlier in this tutorial we loaded loaded *InferenceData* from a *NetCDF* file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = az.load_arviz_data('centered_eight')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly we can persist *InferenceData* objects in the NetCDF format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'eight_schools_model.nc'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.to_netcdf(\"eight_schools_model.nc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Additional Reading\n",
    "Additional documentation and tutorials exist for xarray and netcd4. If still curious we encourage you to visit the following pages\n",
    "\n",
    "## xarray\n",
    "* [xarray documentation](http://xarray.pydata.org/en/stable/why-xarray.html)\n",
    "* [xarray lightning talk at scipy 2015](https://www.youtube.com/watch?v=X0pAhJgySxk&t=949s)\n",
    "\n",
    "## netcdf\n",
    "* [netcd documentation](http://unidata.github.io/netcdf4-python/)\n",
    "* [netcd usage in xarray](http://xarray.pydata.org/en/stable/io.html#netcdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Arviz Data Structures"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"While ArviZ supports plotting from familiar datatypes, such as dictionaries and numpy arrays, there are a couple data structures central to ArviZ that are useful to know when using the library. \n",
	"\n",
	"They are\n",
	"* xarray\n",
	"* InferenceData\n",
	"* netcdf\n",
	"\n",
	"\n",
	"## Why more than one data structure?\n",
	"Bayesian Inference generates numerous sets that represent different aspects of the model. For example in a single analysis a bayesian practioner could end up with any of the following data.\n",
	"* Prior Distribution for N number of variables\n",
	"* Posterior Distribution for N number of variables\n",
	"* Prior Predictive Distribution\n",
	"* Posterior Predictive Distribution\n",
	"* Trace data for each of the above\n",
	"* Sample statistics for each inference run\n",
	"* Whatever else\n",
	"\n",
	"While we made an effort to use \"common\" data types such as numpy arrays or Pandas dataframes, due to the heterogenity of the data it become cumbersome and complex to try and force these data points into one homogenous data structures. To add to the complexity ArviZ must handle the data generated from multiple Bayesian Modeling libraries, such as pymc3 and pystan.\n",
	"\n",
	"Although seemingly more complex at a glance we believe that the usage of xarray, InferenceData, and netcdf will simply the handling, referencing, and serialization of data generated by MCMC runs.\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## An introduction to each\n",
	"To help you get familiar with each ArviZ includes some toy datasets. To start an `az.InferenceData` sample can be loaded into Python quite easily."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Inference data with groups:\n",
	"\t> posterior\n",
	"\t> sample_stats\n",
	"\t> posterior_predictive\n",
	"\t> prior\n",
	"\t> observed_data"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Load the centered eight schools model\n",
	"import arviz as az\n",
	"data = az.load_arviz_data('centered_eight')\n",
	"data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"As we can see in this case the `az.InferenceData` object contains both a posterior_predictive distribution, and the observed data, among other datasets. Each group in InferenceData is both an attribute on `InferenceData` and itself an `xarray.DataSet` object. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.Dataset>\n",
	"Dimensions: (chain: 4, draw: 500, school: 8)\n",
	"Coordinates:\n",
	" * chain (chain) int64 0 1 2 3\n",
	" * draw (draw) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...\n",
	" * school (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...\n",
	"Data variables:\n",
	" mu (chain, draw) float64 ...\n",
	" theta (chain, draw, school) float64 ...\n",
	" tau (chain, draw) float64 ..."
	]
	},
	"execution_count": 20,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Get the posterior xarray\n",
	"posterior = data.posterior\n",
	"posterior"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In our eight schools example we can see that this particular posterior trace consists of three variables, estimated over 4 chains. In addition this model is a hierachial models where values for the variable `theta` are associated with a particular school. \n",
	"\n",
	"In xarray terminology Data Variable are the actual values generated from the MCMC draws, Dimensions are the axes on which we can refer to the Data Variables, and Coordinates are pointers to specific slices or points in the `xarray.DataSet`\n",
	"\n",
	"We can access the observed data through the same method."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<xarray.Dataset>\n",
	"Dimensions: (school: 8)\n",
	"Coordinates:\n",
	" * school (school) object 'Choate' 'Deerfield' 'Phillips Andover' ...\n",
	"Data variables:\n",
	" obs (school) float64 28.0 8.0 -3.0 7.0 -1.0 1.0 18.0 12.0"
	]
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Get the observed xarray\n",
	"observed_data = data.observed_data\n",
	"observed_data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It should be noted that the observed dataset contains only 8 DataVariables and shares no dimensions or coordinates with the posterior. This difference in sizes is the motivating reason behind InferenceData. Rather than force multiple different sized arrays into one array, or force users to manage multiple objects corresponding to different datasets, we felt that it would be easier to hold references to each xarray in an InferenceData object."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## NetCDF\n",
	"[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is a standard for referencing array oriented files. In other words while, xarray.Datasets, and by extension InferenceData, are convenient for accessing arrays in Python memory, NetCDF provides a convenient mechanism for persistence of model data on disk.\n",
	"\n",
	"Most users will not have to concern themselves with the netcdf standard but for completeness it is good to make its usage transparent.\n",
	"\n",
	"Earlier in this tutorial we loaded loaded InferenceData from a NetCDF file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [],
	"source": [
	"data = az.load_arviz_data('centered_eight')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Similarly we can persist InferenceData objects in the NetCDF format"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'eight_schools_model.nc'"
	]
	},
	"execution_count": 22,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"data.to_netcdf(\"eight_schools_model.nc\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Additional Reading\n",
	"Additional documentation and tutorials exist for xarray and netcd4. If still curious we encourage you to visit the following pages\n",
	"\n",
	"## xarray\n",
	"* [xarray documentation](http://xarray.pydata.org/en/stable/why-xarray.html)\n",
	"* [xarray lightning talk at scipy 2015](https://www.youtube.com/watch?v=X0pAhJgySxk&t=949s)\n",
	"\n",
	"## netcdf\n",
	"* [netcd documentation](http://unidata.github.io/netcdf4-python/)\n",
	"* [netcd usage in xarray](http://xarray.pydata.org/en/stable/io.html#netcdf)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}