Skip to content

Instantly share code, notes, and snippets.

Created April 30, 2017 16:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anonymous/4b3c503016192728c91bb0d8f69fce9b to your computer and use it in GitHub Desktop.
Save anonymous/4b3c503016192728c91bb0d8f69fce9b to your computer and use it in GitHub Desktop.
LaTeX file for report on data analysis
\documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{graphicx}
\usepackage[lmargin=2.4cm, rmargin=2.4cm, tmargin=2.6cm, bmargin=2.4cm]{geometry}
\usepackage{subfig}
\title{Nasa SpaceApps challenge}
\author{Dr. Thomai Tsiftsi, Dr James P. Edwards, Dr Gourouni Ph.D.,\\ Det Sup. Leodaris Edwards, H Mouma Edwards}
\begin{document}
\maketitle
In this challenge we examine the Hi-SEAS solar radiation data and employ extreme value analysis to describe the probabilities of extreme values. In particular, we focus on asymptotic analysis of the distributions of particularly low total daily radiation output of the solar panels. The motivation of this analysis is to facilitate the Hi-SEAS crew planning for the possibility of extreme drops in solar panel output (due to natural variation in weather, radiation output and other factors), ensuring that sufficient power reserves are maintained to cope with extreme lows.
The analyse our data we began by estimating the total daily output by approximating the total energy generated by the cells as a discretisation of the time integral of the flux power,
\begin{equation}
x_{i} = \sum_{k = 1}^{N} \, 72 \times p_{ik} \times \frac{t_{i,k+1} - t_{i,k}}{3600} \times \mathcal{E}
\end{equation}
where $\mathcal{E} \approx 0.1$ is the estimated efficiency of the solar cells which have area $72m^{2}$. The sum is over the given data points, $p_{ik}$, representing the solar radiation flux on day $i$ at time $t_{i,k}$. In anticipation of extreme value analysis, we also blocked these data into weekly observations, $\{x_{1}, \ldots, x_{n = 7}\}$. These data are displayed below in figure (\ref{figdat})
\begin{figure}
\centering
\subfloat[][The estimated total daily energy output by Hi-SEAS polar panels for 120 days.]
{
\includegraphics[width=0.45\textwidth]{DailyEnergy.png}
}
\subfloat[][The maximum daily energy output in each week during the same period.]
{
\includegraphics[width=0.45\textwidth]{weeklyEnergy.png}
}
\caption{Daily maxima and the weekly maxima energy output.}
\label{figdat}
\end{figure}
\subsection*{Generalised extreme value theory}
Let $M_{i} = \textrm{max}\{ x_{1}, \ldots, x_{n}\}$. Then extreme value analysis states that the asymptotic distribution of $M_{n}$, if it exists, must take the form (GEV)
\begin{equation}
\mathcal{P}(M_{i} \leq z) \simeq G(z | \mu, \sigma, \xi) \equiv \exp{\left[ - \left(1 + \xi\left(\frac{z-\mu}{\sigma}\right)^{-\frac{1}{\xi}} \right) \right]},
\end{equation}
for parameters $\mu$, location, $\sigma$, scale, and shape, $\xi$. These parameters can be fitted to our data, following which $G(z)$ provides an asymptotic estimate of the tails of the distribution of the maxima of a set of data. Here we are interested in \textit{minima} so we must take our daily accumulated energies and \textit{negate} them so that minima become (negative) maxima.
Fitting our data to the model initially provided \textit{maximum likelihood\footnote{The likelihood for the data given the model is\begin{equation}-N\log(\sigma) - (1 + \frac{1}{\xi})\sum_{i=1}^{N}\left[\log{\left(1 + \xi\left(\frac{z_{i} - \mu}{\sigma}\right)\right)} - \left(1 + \xi\left(\frac{z_{i} - \mu}{\sigma}\right)\right)^{-\frac{1}{\xi}} \right]\end{equation} where $z_{i}$ are the empirical maxima. The case $\xi = 0$ is treated as the limiting form of $G(z | \mu, \sigma, \xi\rightarrow 0)$. The likelihood is maximised with respect to the parameters $\mu$, $\sigma$, $\xi$.}} estimates of the parameters
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
& $\mu$ & $\sigma$ & $\xi$ \\
\hline
Estimate & $-2.6\times 10^{4}$ & $48.3\times 10^{3}$ & $3.9\times 10^{-2}$ \\
\hline
Std Err& $2.3\times 10^{3}$ & $2.8 \times 10^{3}$ & $4.6 \times 10^{-1}$ \\
\hline
\end{tabular}
\end{center}
However, one notes that the shape parameter has large standard errors making its estimate compatible with $0$. For this reason we re-fit the data to the subset of models with vanishing shape parameter (Gumbel distribution); the likelihood ratio test provided a $p$-value of $0.89$ on a null hypothesis of $\xi = 0$, giving sufficient evidence to accept it. This lead to a refined model with fewer parameters, whose estimates are
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
& $\mu$ & $\sigma$ \\
\hline
Estimate & $-2.6\times 10^{4}$ & $8.3 \times 10^{3}$ \\
\hline
Std Err& $2.1 \times 10^{3}$ & $1.8 \times 10^{3}$ \\
\hline
\end{tabular}
\end{center}
which are used as estimates for prediction of extreme daily energy outputs per week.
The plot in figure \ref{figGum} gives diagnostic and predictive results for the data. The probability distribution function at at $z$ gives the probability that $M_{n}$ will be within an infinitesimal interval, $dz$ of $z$. Of particular interest for Hi-SEAS is the \textbf{return level} plot (and associated confidence intervals). The $m^{th}$ return level is the value of the daily energy so \textit{low} that it is \textit{exceeded} once every $m$ observations, here translated into numbers of years. We summary various results of special interest in the table below:
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
& Estimate & Lower confidence interval & Upper confidence interval \\
\hline
2 years & 17.9 kJ & 23.9kJ & 22.7kJ \\
\hline
3 years & 18.1kJ & 12.2kJ & 24.0kJ \\
\hline
4 years & 15.2kJ & 8.4kJ & 22.0kJ \\
\hline
5 years & 13.1kJ& 5.6kJ & 20.6kJ \\
\hline
\end{tabular}
\end{center}
As an example, once every 4 years, we expect to see a total daily flux of $8.4$kJ or less.
\begin{figure}
\centering
\includegraphics[width=0.75\textwidth]{GEVGumbel.png}
\caption{Diagnostics for the GEV fitting procedure.}
\label{figGum}
\end{figure}
\subsection*{Threshold excesses}
Examining only the minima of the daily temperatures each week does not take full advantage of the dataset. For this reason, an alternative asymptotic analysis to extreme events is to determine the limiting distribution of $\mathcal{P}(X < u |X < u_{0})$ where $u_{0}$ is a sufficiently large parameter and $X$ represents the daily energy output. If the asymptotic limit of this distribution exists it takes the form of a generalised Pareto distribution,
\begin{equation}
H(y \equiv x - u) = 1 - \left(1 + \frac{\xi y}{\tilde{\sigma}}\right)^{-\frac{1}{\xi}},
\end{equation}
where $\tilde{\sigma} = \sigma + \xi(u - \mu)$ and we fit the parameters $\xi$ and $\tilde{\sigma}$ from the data. By empirically estimating $\zeta_{u_{0}} \equiv \mathcal{P}(X < u_{0})$ one can assign probabilities to
\begin{equation}
\mathcal{P}(X > u > u_{0}) \approx \zeta_{u_{0}}H(x - u).
\end{equation}
The threshold, $u_{0}$ must be sufficiently large that $\tilde{\sigma}$ is constant with respect to $u$. We found in our data that $u_{0} \approx 35,000$ satisfies this condition within the $95\%$ confidence interval. Our maximum likelihood analysis\footnote{If $K$ observations are above the threshold by amounts $y_{i} = x_{i} - u_{0}$ then the likelihood is easily found to be \begin{equation}-K\log(\sigma) - (1 + \frac{1}{\xi})\sum_{i = 1}^{K}\log{\left(1 + \frac{\xi y_{i}}{\tilde{\sigma}}\right)}\end{equation} with the $\xi = 0$ limit following from a limiting procedure.} yielded the following parameters
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
& $\mu$ & $\sigma$ \\
\hline
Estimate & $1.7 \times 10^{4}$ & $4.8\times 10^{-1}$ \\
\hline
Std Err& $3.2 \times 10^{3}$ & $1.7 \times 10^{-1}$ \\
\hline
\end{tabular}
\end{center}
It is an unfortunate artifact of the relatively small number of days in the dataset (120 days, approximately 17 weeks) and the rudimentary estimate of the total daily energy output that these parameters are not compatible with those of the GEV above.
Diagnostic plots provide the post-hoc justification of our analysis -- figure (\ref{figP}). In particular, the probability distribution function provides estimates of finding $X - u$ within an infinitesimal region, $dy$ of the value $y$. Return level plots give the value of the daily output that occurs once every $m$ observations, here given in years.
\begin{figure}
\centering
\includegraphics[width=0.75 \textwidth]{Paretoextremes.png}
\end{figure}
\end{document}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment