Last active
December 23, 2016 13:16
-
-
Save cvitolo/612eb2ae9b47fe8f11a1ed8d06e3b434 to your computer and use it in GitHub Desktop.
Big data analytics experiment using the rnrfa package
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Big Data analytics using the rnrfa package" | |
author: "Claudia Vitolo" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = FALSE, | |
message = FALSE, | |
cache = FALSE, | |
warning = FALSE) | |
``` | |
In the last few years, the UK MetOffice has reported "unusual warmth and lack of rainfall during March and April, particularly over England and Wales" \footnote{\url{http://www.metoffice.gov.uk/climate/uk/interesting/2011_spring}}. Dry springs can affect water resources, because river flow below average translates, for instance, in reduced availability of drinking water. In this application we present a big data analytics experiment in which we try to understand if there is any evidence, in the NRFA data, that springs in the UK are becoming drier, both in terms of rainfall and river flow. This type of experiment consists of retrieving all the available rainfall and flow time series and find out, for each station, whether there is a increasing/decreasing trend. | |
We use a cluster to download and analyse all the time series available from the NRFA stations with more than 10 years of recordings. The time series are first downloaded, then summarised in terms of annual averages over the spring period. Seasonal averages are calculated using the function \code{seasonal\_averages()}, which takes as input a time series and a period of interest (e.g. spring) and calculates the related annual average. Using a very simplistic approach, a linear model is fit to the annual averages and the slope coefficient is then used to estimate the trend. Negative slopes correspond to decreasing flow/rainfall, while positive slopes correspond to an increase of flow/rainfall over time. Once the fitted slope is calculated for each station, the results are plotted using the function \code{plot\_trend()}. | |
```{r echo=FALSE, message=FALSE} | |
system.time({ | |
# install.packages(c("dplyr", "lubridate", "ggplot2", "parallel", "devtools")) | |
# devtools::install_github("cvitolo/r_rnrfa", subdir = "rnrfa") | |
library(rnrfa) | |
library(dplyr) | |
library(lubridate) | |
library(ggplot2) | |
library(parallel) | |
library(gridExtra) | |
# Make cluster | |
cl <- makeCluster(getOption("cl.cores", detectCores() - 1)) | |
# RAINFALL TRENDS ############################################################ | |
# Get site meta with region columns | |
data("stationSummary") | |
stationsR <- stationSummary # stations 478-855 return an error | |
# Get unique station ids | |
ids <- as.character(stationsR$id) | |
# Get all rainfall data | |
allCMR <- cmr(id = ids, metadata = FALSE, cl = cl, verbose = FALSE) | |
names(allCMR) <- ids | |
# remove empties | |
torm <- unlist(lapply(allCMR, is.null)) | |
allCMR2 <- allCMR[!torm] | |
# remove errors | |
allCMR3 <- allCMR2[!sapply(allCMR2, function(x) class(x)[1] == "try-error")] | |
# Extract data between 21st March and 20th June and calculate averages | |
springCMR <- parLapply(cl = cl, X = allCMR3, fun = seasonal_averages, | |
season = "Spring") | |
slopeIDX <- seq(1, length(unlist(springCMR)), 2) | |
pvalueIDX <- seq(2, length(unlist(springCMR)), 2) | |
springCMRdf <- data.frame("id" = names(springCMR), | |
"Slope" = unlist(springCMR)[slopeIDX], | |
"pvalue" = unlist(springCMR)[pvalueIDX]) | |
# merge station information with calculated trends | |
stationsR <- stationsR %>% left_join(springCMRdf) | |
# Keep only stations for which there is a significant trend | |
rows2keep <- which(stationsR$pvalue <= 0.05 & | |
!is.na(stationsR$Slope)) | |
st <- stationsR[rows2keep,] | |
# Plot figure 7 | |
trendPlots <- plot_trend(df = st, 'Region') | |
ggsave("BigDataCMR.png", | |
gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2), | |
height=5, width=10, units='in', dpi=600) | |
# FLOW TRENDS ################################################################ | |
# Get site meta with region columns | |
stationsF <- stationSummary # stations 478-855 return an error | |
# Get unique station ids | |
ids <- as.character(stationsF$id) | |
# Get all flow data | |
allGDF <- gdf(id = ids, metadata = FALSE, cl = cl, verbose = FALSE) | |
names(allGDF) <- ids | |
# remove empties | |
torm <- unlist(lapply(allGDF, is.null)) | |
allGDF2 <- allGDF[!torm] | |
# remove errors | |
allGDF3 <- allGDF2[!sapply(allGDF2, function(x) class(x)[1] == "try-error")] | |
# Extract data between 21st March and 20th June and calculate averages | |
springGDF <- parLapply(cl = cl, X = allGDF3, fun = seasonal_averages, | |
season = "Spring") | |
slopeIDX <- seq(1, length(unlist(springGDF)), 2) | |
pvalueIDX <- seq(2, length(unlist(springGDF)), 2) | |
springGDFdf <- data.frame("id" = names(springGDF), | |
"Slope" = unlist(springGDF)[slopeIDX], | |
"pvalue" = unlist(springGDF)[pvalueIDX]) | |
stationsF <- stationsF %>% left_join(springGDFdf) | |
# Keep only stations for which there is a significant trend | |
rows2keep <- which(stationsF$pvalue <= 0.05 & | |
!is.na(stationsF$Slope)) | |
st <- stationsF[rows2keep,] | |
# Plot figure 8 | |
trendPlots <- plot_trend(df = st, 'Region') | |
ggsave("BigDataGDF.png", | |
gridExtra::arrangeGrob(trendPlots$A, trendPlots$B, ncol=2), | |
height=5, width=10, units='in', dpi=600) | |
stopCluster(cl) | |
}) | |
``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment