Skip to content

Instantly share code, notes, and snippets.

@cvitolo
Last active December 23, 2016 13:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cvitolo/612eb2ae9b47fe8f11a1ed8d06e3b434 to your computer and use it in GitHub Desktop.
Save cvitolo/612eb2ae9b47fe8f11a1ed8d06e3b434 to your computer and use it in GitHub Desktop.
Big data analytics experiment using the rnrfa package
---
title: "Big Data analytics using the rnrfa package"
author: "Claudia Vitolo"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
message = FALSE,
cache = FALSE,
warning = FALSE)
```
In the last few years, the UK MetOffice has reported "unusual warmth and lack of rainfall during March and April, particularly over England and Wales" \footnote{\url{http://www.metoffice.gov.uk/climate/uk/interesting/2011_spring}}. Dry springs can affect water resources, because river flow below average translates, for instance, in reduced availability of drinking water. In this application we present a big data analytics experiment in which we try to understand if there is any evidence, in the NRFA data, that springs in the UK are becoming drier, both in terms of rainfall and river flow. This type of experiment consists of retrieving all the available rainfall and flow time series and find out, for each station, whether there is a increasing/decreasing trend.
We use a cluster to download and analyse all the time series available from the NRFA stations with more than 10 years of recordings. The time series are first downloaded, then summarised in terms of annual averages over the spring period. Seasonal averages are calculated using the function \code{seasonal\_averages()}, which takes as input a time series and a period of interest (e.g. spring) and calculates the related annual average. Using a very simplistic approach, a linear model is fit to the annual averages and the slope coefficient is then used to estimate the trend. Negative slopes correspond to decreasing flow/rainfall, while positive slopes correspond to an increase of flow/rainfall over time. Once the fitted slope is calculated for each station, the results are plotted using the function \code{plot\_trend()}.
```{r echo=FALSE, message=FALSE}
system.time({
# install.packages(c("dplyr", "lubridate", "ggplot2", "parallel", "devtools"))
# devtools::install_github("cvitolo/r_rnrfa", subdir = "rnrfa")
library(rnrfa)
library(dplyr)
library(lubridate)
library(ggplot2)
library(parallel)
library(gridExtra)
# Make cluster
cl <- makeCluster(getOption("cl.cores", detectCores() - 1))
# RAINFALL TRENDS ############################################################
# Get site meta with region columns
data("stationSummary")
stationsR <- stationSummary # stations 478-855 return an error
# Get unique station ids
ids <- as.character(stationsR$id)
# Get all rainfall data
allCMR <- cmr(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
names(allCMR) <- ids
# remove empties
torm <- unlist(lapply(allCMR, is.null))
allCMR2 <- allCMR[!torm]
# remove errors
allCMR3 <- allCMR2[!sapply(allCMR2, function(x) class(x)[1] == "try-error")]
# Extract data between 21st March and 20th June and calculate averages
springCMR <- parLapply(cl = cl, X = allCMR3, fun = seasonal_averages,
season = "Spring")
slopeIDX <- seq(1, length(unlist(springCMR)), 2)
pvalueIDX <- seq(2, length(unlist(springCMR)), 2)
springCMRdf <- data.frame("id" = names(springCMR),
"Slope" = unlist(springCMR)[slopeIDX],
"pvalue" = unlist(springCMR)[pvalueIDX])
# merge station information with calculated trends
stationsR <- stationsR %>% left_join(springCMRdf)
# Keep only stations for which there is a significant trend
rows2keep <- which(stationsR$pvalue <= 0.05 &
!is.na(stationsR$Slope))
st <- stationsR[rows2keep,]
# Plot figure 7
trendPlots <- plot_trend(df = st, 'Region')
ggsave("BigDataCMR.png",
gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
height=5, width=10, units='in', dpi=600)
# FLOW TRENDS ################################################################
# Get site meta with region columns
stationsF <- stationSummary # stations 478-855 return an error
# Get unique station ids
ids <- as.character(stationsF$id)
# Get all flow data
allGDF <- gdf(id = ids, metadata = FALSE, cl = cl, verbose = FALSE)
names(allGDF) <- ids
# remove empties
torm <- unlist(lapply(allGDF, is.null))
allGDF2 <- allGDF[!torm]
# remove errors
allGDF3 <- allGDF2[!sapply(allGDF2, function(x) class(x)[1] == "try-error")]
# Extract data between 21st March and 20th June and calculate averages
springGDF <- parLapply(cl = cl, X = allGDF3, fun = seasonal_averages,
season = "Spring")
slopeIDX <- seq(1, length(unlist(springGDF)), 2)
pvalueIDX <- seq(2, length(unlist(springGDF)), 2)
springGDFdf <- data.frame("id" = names(springGDF),
"Slope" = unlist(springGDF)[slopeIDX],
"pvalue" = unlist(springGDF)[pvalueIDX])
stationsF <- stationsF %>% left_join(springGDFdf)
# Keep only stations for which there is a significant trend
rows2keep <- which(stationsF$pvalue <= 0.05 &
!is.na(stationsF$Slope))
st <- stationsF[rows2keep,]
# Plot figure 8
trendPlots <- plot_trend(df = st, 'Region')
ggsave("BigDataGDF.png",
gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2),
height=5, width=10, units='in', dpi=600)
stopCluster(cl)
})
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment