cvitolo/rnrfa_BigDataApplication.Rmd

## rnrfa_BigDataApplication.Rmd
---
title: "Big Data analytics using the rnrfa package"
author: "Claudia Vitolo"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
                      message = FALSE,
                      cache = FALSE,
                      warning = FALSE)
```

In the last few years, the UK MetOffice has reported "unusual warmth and lack of rainfall during March and April, particularly over England and Wales" \footnote{\url{http://www.metoffice.gov.uk/climate/uk/interesting/2011_spring}}. Dry springs can affect water resources, because river flow below average translates, for instance, in reduced availability of drinking water. In this application we present a big data analytics experiment in which we try to understand if there is any evidence, in the NRFA data, that springs in the UK are becoming drier, both in terms of rainfall and river flow. This type of experiment consists of retrieving all the available rainfall and flow time series and find out, for each station, whether there is a increasing/decreasing trend.

We use a cluster to download and analyse all the time series available from the NRFA stations with more than 10 years of recordings. The time series are first downloaded, then summarised in terms of annual averages over the spring period. Seasonal averages are calculated using the function \code{seasonal\_averages()}, which takes as input a time series and a period of interest (e.g. spring) and calculates the related annual average. Using a very simplistic approach, a linear model is fit to the annual averages and the slope coefficient is then used to estimate the trend. Negative slopes correspond to decreasing flow/rainfall, while positive slopes correspond to an increase of flow/rainfall over time. Once the fitted slope is calculated for each station, the results are plotted using the function \code{plot\_trend()}.

```{r echo=FALSE, message=FALSE}
system.time({

  # install.packages(c("dplyr", "lubridate", "ggplot2", "parallel", "devtools"))
  # devtools::install_github("cvitolo/r_rnrfa", subdir = "rnrfa")
  library(rnrfa)
  library(dplyr)
  library(lubridate)
  library(ggplot2)
  library(parallel)
  library(gridExtra)

  # Make cluster
  cl <- makeCluster(getOption("cl.cores", detectCores() - 1))

  # RAINFALL TRENDS ############################################################

  # Get site meta with region columns
  data("stationSummary")
  stationsR <- stationSummary   # stations 478-855 return an error

  # Get unique station ids
  ids <- as.character(stationsR$id)

  # Get all rainfall data
  allCMR <- cmr(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
  names(allCMR) <- ids

  # remove empties
  torm <- unlist(lapply(allCMR, is.null))
  allCMR2 <- allCMR[!torm]
  # remove errors
  allCMR3 <- allCMR2[!sapply(allCMR2, function(x) class(x)[1] == "try-error")]

  # Extract data between 21st March and 20th June and calculate averages
  springCMR <- parLapply(cl = cl, X = allCMR3, fun = seasonal_averages,
                         season = "Spring")
  slopeIDX <- seq(1, length(unlist(springCMR)), 2)
  pvalueIDX <- seq(2, length(unlist(springCMR)), 2)

  springCMRdf <- data.frame("id" = names(springCMR),
                            "Slope" = unlist(springCMR)[slopeIDX],
                            "pvalue" = unlist(springCMR)[pvalueIDX])

  # merge station information with calculated trends
  stationsR <- stationsR %>% left_join(springCMRdf)

  # Keep only stations for which there is a significant trend
  rows2keep <- which(stationsR$pvalue <= 0.05 &
                       !is.na(stationsR$Slope))
  st <- stationsR[rows2keep,]

  # Plot figure 7
  trendPlots <- plot_trend(df = st, 'Region')

  ggsave("BigDataCMR.png",
         gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
         height=5, width=10, units='in', dpi=600)

  # FLOW TRENDS ################################################################

  # Get site meta with region columns
  stationsF <- stationSummary   # stations 478-855 return an error

  # Get unique station ids
  ids <- as.character(stationsF$id)

  # Get all flow data
  allGDF <- gdf(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
  names(allGDF) <- ids

  # remove empties
  torm <- unlist(lapply(allGDF, is.null))
  allGDF2 <- allGDF[!torm]
  # remove errors
  allGDF3 <- allGDF2[!sapply(allGDF2, function(x) class(x)[1] == "try-error")]

  # Extract data between 21st March and 20th June and calculate averages
  springGDF <- parLapply(cl = cl, X = allGDF3, fun = seasonal_averages,
                         season = "Spring")
  slopeIDX <- seq(1, length(unlist(springGDF)), 2)
  pvalueIDX <- seq(2, length(unlist(springGDF)), 2)

  springGDFdf <- data.frame("id" = names(springGDF),
                            "Slope" = unlist(springGDF)[slopeIDX],
                            "pvalue" = unlist(springGDF)[pvalueIDX])

  stationsF <- stationsF %>% left_join(springGDFdf)

  # Keep only stations for which there is a significant trend
  rows2keep <- which(stationsF$pvalue <= 0.05 &
                       !is.na(stationsF$Slope))
  st <- stationsF[rows2keep,]

  # Plot figure 8
  trendPlots <- plot_trend(df = st, 'Region')

  ggsave("BigDataGDF.png",
         gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
         height=5, width=10, units='in', dpi=600)

  stopCluster(cl)

})
```
	---
	title: "Big Data analytics using the rnrfa package"
	author: "Claudia Vitolo"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = FALSE,
	message = FALSE,
	cache = FALSE,
	warning = FALSE)
	```

	In the last few years, the UK MetOffice has reported "unusual warmth and lack of rainfall during March and April, particularly over England and Wales" \footnote{\url{http://www.metoffice.gov.uk/climate/uk/interesting/2011_spring}}. Dry springs can affect water resources, because river flow below average translates, for instance, in reduced availability of drinking water. In this application we present a big data analytics experiment in which we try to understand if there is any evidence, in the NRFA data, that springs in the UK are becoming drier, both in terms of rainfall and river flow. This type of experiment consists of retrieving all the available rainfall and flow time series and find out, for each station, whether there is a increasing/decreasing trend.

	We use a cluster to download and analyse all the time series available from the NRFA stations with more than 10 years of recordings. The time series are first downloaded, then summarised in terms of annual averages over the spring period. Seasonal averages are calculated using the function \code{seasonal\_averages()}, which takes as input a time series and a period of interest (e.g. spring) and calculates the related annual average. Using a very simplistic approach, a linear model is fit to the annual averages and the slope coefficient is then used to estimate the trend. Negative slopes correspond to decreasing flow/rainfall, while positive slopes correspond to an increase of flow/rainfall over time. Once the fitted slope is calculated for each station, the results are plotted using the function \code{plot\_trend()}.

	```{r echo=FALSE, message=FALSE}
	system.time({

	# install.packages(c("dplyr", "lubridate", "ggplot2", "parallel", "devtools"))
	# devtools::install_github("cvitolo/r_rnrfa", subdir = "rnrfa")
	library(rnrfa)
	library(dplyr)
	library(lubridate)
	library(ggplot2)
	library(parallel)
	library(gridExtra)

	# Make cluster
	cl <- makeCluster(getOption("cl.cores", detectCores() - 1))

	# RAINFALL TRENDS ############################################################

	# Get site meta with region columns
	data("stationSummary")
	stationsR <- stationSummary # stations 478-855 return an error

	# Get unique station ids
	ids <- as.character(stationsR$id)

	# Get all rainfall data
	allCMR <- cmr(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
	names(allCMR) <- ids

	# remove empties
	torm <- unlist(lapply(allCMR, is.null))
	allCMR2 <- allCMR[!torm]
	# remove errors
	allCMR3 <- allCMR2[!sapply(allCMR2, function(x) class(x)[1] == "try-error")]

	# Extract data between 21st March and 20th June and calculate averages
	springCMR <- parLapply(cl = cl, X = allCMR3, fun = seasonal_averages,
	season = "Spring")
	slopeIDX <- seq(1, length(unlist(springCMR)), 2)
	pvalueIDX <- seq(2, length(unlist(springCMR)), 2)

	springCMRdf <- data.frame("id" = names(springCMR),
	"Slope" = unlist(springCMR)[slopeIDX],
	"pvalue" = unlist(springCMR)[pvalueIDX])

	# merge station information with calculated trends
	stationsR <- stationsR %>% left_join(springCMRdf)

	# Keep only stations for which there is a significant trend
	rows2keep <- which(stationsR$pvalue <= 0.05 &
	!is.na(stationsR$Slope))
	st <- stationsR[rows2keep,]

	# Plot figure 7
	trendPlots <- plot_trend(df = st, 'Region')

	ggsave("BigDataCMR.png",
	gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
	height=5, width=10, units='in', dpi=600)

	# FLOW TRENDS ################################################################

	# Get site meta with region columns
	stationsF <- stationSummary # stations 478-855 return an error

	# Get unique station ids
	ids <- as.character(stationsF$id)

	# Get all flow data
	allGDF <- gdf(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
	names(allGDF) <- ids

	# remove empties
	torm <- unlist(lapply(allGDF, is.null))
	allGDF2 <- allGDF[!torm]
	# remove errors
	allGDF3 <- allGDF2[!sapply(allGDF2, function(x) class(x)[1] == "try-error")]

	# Extract data between 21st March and 20th June and calculate averages
	springGDF <- parLapply(cl = cl, X = allGDF3, fun = seasonal_averages,
	season = "Spring")
	slopeIDX <- seq(1, length(unlist(springGDF)), 2)
	pvalueIDX <- seq(2, length(unlist(springGDF)), 2)

	springGDFdf <- data.frame("id" = names(springGDF),
	"Slope" = unlist(springGDF)[slopeIDX],
	"pvalue" = unlist(springGDF)[pvalueIDX])

	stationsF <- stationsF %>% left_join(springGDFdf)

	# Keep only stations for which there is a significant trend
	rows2keep <- which(stationsF$pvalue <= 0.05 &
	!is.na(stationsF$Slope))
	st <- stationsF[rows2keep,]

	# Plot figure 8
	trendPlots <- plot_trend(df = st, 'Region')

	ggsave("BigDataGDF.png",
	gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
	height=5, width=10, units='in', dpi=600)

	stopCluster(cl)

	})
	```