benmarwick/google-sheet-reliability-test.rmd

## google-sheet-reliability-test.rmd
---
output: html_document
---
# A quick check of the availability of the UW Anthropology curriculum forecast

I have heard occasional reports from faculty and students that the curriculum forecast on the anthropology department website shows no useful data. I have also observed this myself, that the cells in the spreadsheet are sometimes empty or show dashes. Given these reports and observations I thought it would be good to study the availability of the forecast so we can make an informed decision about whether or not the current setup is meeting our needs.

## Method

I wrote a script to automatically access the forecast webpage every 10 minutes, 24 hours per day, for ten days (script is here: [https://gist.github.com/benmarwick/11023760](https://gist.github.com/benmarwick/11023760)). The script checks to see if there are useful data in the 'this year' columns on the left side of the page and the 'next year' columns on the right side of the page. The script ran from 5 May to 15 May 2014. I informed the other editors of the page that I was running the experiment and asked them not to make an unusual changes that might alter the normal performance of the webpage. Then I did a little bit of analysis to see how available the data are, and if there are any patterns in its unavailability.

```{r check-the-forecast, echo = FALSE, eval = FALSE, message = FALSE}
# this is the chunk for data collection
require(RCurl) # if you get an error message here that says something like 'there is no package called 'RCurl''
# then you need to install the package by typing this into the console (and then hitting enter): install.packages("RCurl")
# wait for the package to download and install, then run line 3 again before proceeding.
options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
# in google spreadsheet, go to file-> publish to web -> get link to publish to web -> get csv file
goog_sheet <- "https://docs.google.com/spreadsheet/pub?key=0As7CmPqGXTzldF9DclQwY0NaU2JZNjBFQjg4RVdSdWc&single=true&gid=15&output=csv"

# collect data every 10 mins in a 24 h period
# 24 h x 10 min = 144 collections
# and just check if data is displaying

days <- 10
n <- days * 24 * (60/10)

result <- data.frame(test_time = Sys.time(),
                     this_year = 1:n,
                     next_year = 1:n)

for(i in 1:n){
  # assign data from google sheet to R object
  try_result <- try( # error handling so if no connection then skip to next
    # rather than exit loop
    data <- read.csv(textConnection(getURL(goog_sheet)), stringsAsFactors = FALSE)
  ); if(class(try_result) == "try-error") next;
  # test to see if data are displaying or not
  # fill a row of the data frame with the results,
  # plus the time of the test
  result$test_time[i] <- Sys.time()
  result$this_year[i] <- ifelse(length(grep("--", data[5:6, 3])) == 2, "dashes", data[5, 3])
  result$next_year[i] <- ifelse(length(grep("--", data[5:6, 13])) == 2, "dashes", data[5, 13])
  # show the result so we can see it working
  print(result[i,])
  # overwrite downloaded data ready to start again
  data <- NULL
  # now wait 10 minutes
  Sys.sleep(10 * 60)
}
```

## Results

```{r slice-the-data-by-day, echo = FALSE}
load("~/teamviewer/anthro/anthro_curriculum_forecast_availability.RData")

# set interval
interval <- "day"

# Create a sequence of cut points
dateCuts <- seq(from=min(result_ended_20140515$test_time), to=max(result_ended_20140515$test_time), by=interval)

# Use cut to separate dates into periods
result_ended_20140515$period <- cut(result_ended_20140515$test_time, breaks=dateCuts)

# how many times per day did we try to access the forecast?
daily_attempts <- mean(table(result_ended_20140515$period))

# Summarise data for this year (rate of successful accesses per interval)
library(plyr)
this_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(this_year == "2014"))/length(period)))

# Summarise data for this year (rate of success accesses per interval)
next_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(next_year == "2014"))/length(period)))

# combine
both_years <- merge(this_year, next_year, by = 'period')
names(both_years) <- c('period', 'this_year', 'next_year')

library(reshape2)
both_years_m <- melt(both_years, id = 'period')
```

If we aggregate the data by day, we see that the section of the sheet that shows this year's data is very reliable and almost always shows useful data. However, the section that shows next year's data is much less reliable. We obtained useful data about next year's classes in only `r round(max(next_year$value),2)*100`% to `r round(min(next_year$value),2)*100`% of the `r round(daily_attempts,0)` attempts per day.

```{r plot-daily, echo = FALSE}
# plot
library(ggplot2)
both_years_m$period <- format(as.Date(both_years_m$period), format="%d %B %Y")
both_years_m$period[is.na(both_years_m$period)] <- "14 May 2014"
ggplot(both_years_m, aes(period, value, colour = variable)) +
  geom_point(size = 4) +
  theme_minimal() +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ylab("Proportion of attempts that show useful data")
```

Obviously this is not ideal, since students might come to the forecast and find that it's not showing useful information. We can estimate the impact these episodes of unavailability might have by identifying (i) when they occur during the day and (ii) how long the episodes last.

If the page is predictably unavailable late at night or early in the morning then the impact of less then perfect availability is small. Below we see that there is no obvious pattern (each dot represents one day) in the time of day that the data for next year's classes is unavailable.

```{r patterns, echo = FALSE}
# set interval
interval <- "hour"

# Create a sequence of cut points
dateCuts <- seq(from=min(result_ended_20140515$test_time), to=max(result_ended_20140515$test_time), by=interval)

# Use cut to separate dates into periods
result_ended_20140515$period <- cut(result_ended_20140515$test_time, breaks=dateCuts)

# how many times per hour did we try to access the forecast?
hourly_attempts <- mean(table(result_ended_20140515$period))

# Summarise data for this year (rate of successful accesses per interval)
library(plyr)
this_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(this_year == "2014"))/length(period)))

# summarise for next year
next_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(next_year == "2014"))/length(period)))

# strip ddmmyy from time accessed so we can group by hour of the day
next_year$period <- strptime(next_year$period, format="%Y-%m-%d %H:%M:%S")
next_year$period <- format(next_year$period, "%H:%M")

# aggregate by hour of the day
next_year_by_hour <- aggregate(value ~ period, next_year, mean)

# plot
# ggplot(next_year_by_hour, aes(period, value)) +
#   geom_point(size = 4) +
#   theme_minimal() +
#   theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
#   ylab("Proportion of attempts that show useful data")

ggplot(next_year, aes(period, value)) +
  # geom_boxplot(colour = "grey70") +
  geom_jitter(size = 2) +
  theme_minimal() +
  theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ylab("Proportion of attempts that show useful data") +
  xlab("time of day")
```

We can more systematically search for cyclical patterns with a fast Fourier transform to identify periodicity. If the data were cyclical, then we'd see a small number of isolated narrow peaks in the plot below. Instead we see a wide distribution with no obvious narrow peaks, indicating no clearly cyclical patterns in the data.

```{r periodogram, echo = FALSE, message = FALSE}
library(TSA)
periodogram(next_year$value,log='no',plot=TRUE,ylab="Periodogram", xlab="Frequency",lwd=2)
```

```{r duration-of-unaval, echo = FALSE, message = FALSE}
rle_unavail <- rle(result_ended_20140515$next_year)
rle_unavail <- data.frame(n = rle_unavail[[1]], what = rle_unavail[[2]])
rle_unavail_dashes <- rle_unavail[rle_unavail$what == "dashes", ]
rle_unavail_dashes$duration <- rle_unavail_dashes$n * 10 # length of unavail in min
```

Another way to understand the impact of the unavailability of the data is to measure the duration of unavailability. If a student accesses the site and finds no useful data, how long do they need to wait until they will see useful data? If the duration is regular then we can mitigate the problem by putting a note to inform students to check back in _x_ minutes. In the plot below we see that the durations of episodes of unavailable data for next year has a peaked distribution. This indicates some regularity in the durations of unavailability. The average time a student will need to wait is `r round(mean(rle_unavail_dashes$duration),0)` minutes, and never longer than `r round(max(rle_unavail_dashes$duration),0)` minutes. Given this result, we can add a note to the curriculum forecast to inform students to check back in an hour if they don't see any useful information about next year's courses.

```{r duration-plot, echo = FALSE, message = FALSE}
# plot
ggplot(rle_unavail_dashes, aes(duration)) +
  geom_histogram() +
    theme_minimal() +
  # theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
  ylab("Frequency of attempts the obtained no useful data") +
  xlab("Duration of epsiode where no useful data was obtained (minutes)")
dat <- rle_unavail_dashes$duration


```

## Conclusion

Our system for delivering class scheduling information to student is not 100% reliable, especially for data on next year's classes, which is probably of most interest to students using this webpage. There is no clear pattern in time of day that the data are unavailable. The duration of unavailability seems to average about `r round(mean(rle_unavail_dashes$duration),0)` and never exceeds `r round(max(rle_unavail_dashes$duration),0)` minutes.

It is not clear to me why next year's data should be less reliably available than this year's data; they are both generated with a similar method in the Google sheets.

With our current system, a student will get useful data 7 or 8 times out of 10 attempts. They will need to wait an average of `r round(mean(rle_unavail_dashes$duration),0)` minutes until they can see useful data for next year's classes.

One practical implication of these results is that we can add a note to the page that tells the student to check back in an hour if they don't get useful data. The question we cannot answer with these data is whether or not this level of availability is _good enough_ for our students, and whether we need to find another method to deliver this information.
	---
	output: html_document
	---
	# A quick check of the availability of the UW Anthropology curriculum forecast

	I have heard occasional reports from faculty and students that the curriculum forecast on the anthropology department website shows no useful data. I have also observed this myself, that the cells in the spreadsheet are sometimes empty or show dashes. Given these reports and observations I thought it would be good to study the availability of the forecast so we can make an informed decision about whether or not the current setup is meeting our needs.

	## Method

	I wrote a script to automatically access the forecast webpage every 10 minutes, 24 hours per day, for ten days (script is here: [https://gist.github.com/benmarwick/11023760](https://gist.github.com/benmarwick/11023760)). The script checks to see if there are useful data in the 'this year' columns on the left side of the page and the 'next year' columns on the right side of the page. The script ran from 5 May to 15 May 2014. I informed the other editors of the page that I was running the experiment and asked them not to make an unusual changes that might alter the normal performance of the webpage. Then I did a little bit of analysis to see how available the data are, and if there are any patterns in its unavailability.

	```{r check-the-forecast, echo = FALSE, eval = FALSE, message = FALSE}
	# this is the chunk for data collection
	require(RCurl) # if you get an error message here that says something like 'there is no package called 'RCurl''
	# then you need to install the package by typing this into the console (and then hitting enter): install.packages("RCurl")
	# wait for the package to download and install, then run line 3 again before proceeding.
	options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
	# in google spreadsheet, go to file-> publish to web -> get link to publish to web -> get csv file
	goog_sheet <- "https://docs.google.com/spreadsheet/pub?key=0As7CmPqGXTzldF9DclQwY0NaU2JZNjBFQjg4RVdSdWc&single=true&gid=15&output=csv"

	# collect data every 10 mins in a 24 h period
	# 24 h x 10 min = 144 collections
	# and just check if data is displaying

	days <- 10
	n <- days * 24 * (60/10)

	result <- data.frame(test_time = Sys.time(),
	this_year = 1:n,
	next_year = 1:n)

	for(i in 1:n){
	# assign data from google sheet to R object
	try_result <- try( # error handling so if no connection then skip to next
	# rather than exit loop
	data <- read.csv(textConnection(getURL(goog_sheet)), stringsAsFactors = FALSE)
	); if(class(try_result) == "try-error") next;
	# test to see if data are displaying or not
	# fill a row of the data frame with the results,
	# plus the time of the test
	result$test_time[i] <- Sys.time()
	result$this_year[i] <- ifelse(length(grep("--", data[5:6, 3])) == 2, "dashes", data[5, 3])
	result$next_year[i] <- ifelse(length(grep("--", data[5:6, 13])) == 2, "dashes", data[5, 13])
	# show the result so we can see it working
	print(result[i,])
	# overwrite downloaded data ready to start again
	data <- NULL
	# now wait 10 minutes
	Sys.sleep(10 * 60)
	}
	```

	## Results

	```{r slice-the-data-by-day, echo = FALSE}
	load("~/teamviewer/anthro/anthro_curriculum_forecast_availability.RData")

	# set interval
	interval <- "day"

	# Create a sequence of cut points
	dateCuts <- seq(from=min(result_ended_20140515$test_time), to=max(result_ended_20140515$test_time), by=interval)

	# Use cut to separate dates into periods
	result_ended_20140515$period <- cut(result_ended_20140515$test_time, breaks=dateCuts)

	# how many times per day did we try to access the forecast?
	daily_attempts <- mean(table(result_ended_20140515$period))

	# Summarise data for this year (rate of successful accesses per interval)
	library(plyr)
	this_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(this_year == "2014"))/length(period)))

	# Summarise data for this year (rate of success accesses per interval)
	next_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(next_year == "2014"))/length(period)))

	# combine
	both_years <- merge(this_year, next_year, by = 'period')
	names(both_years) <- c('period', 'this_year', 'next_year')

	library(reshape2)
	both_years_m <- melt(both_years, id = 'period')
	```

	If we aggregate the data by day, we see that the section of the sheet that shows this year's data is very reliable and almost always shows useful data. However, the section that shows next year's data is much less reliable. We obtained useful data about next year's classes in only `r round(max(next_year$value),2)100`% to `r round(min(next_year$value),2)100`% of the `r round(daily_attempts,0)` attempts per day.

	```{r plot-daily, echo = FALSE}
	# plot
	library(ggplot2)
	both_years_m$period <- format(as.Date(both_years_m$period), format="%d %B %Y")
	both_years_m$period[is.na(both_years_m$period)] <- "14 May 2014"
	ggplot(both_years_m, aes(period, value, colour = variable)) +
	geom_point(size = 4) +
	theme_minimal() +
	theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
	ylab("Proportion of attempts that show useful data")
	```

	Obviously this is not ideal, since students might come to the forecast and find that it's not showing useful information. We can estimate the impact these episodes of unavailability might have by identifying (i) when they occur during the day and (ii) how long the episodes last.

	If the page is predictably unavailable late at night or early in the morning then the impact of less then perfect availability is small. Below we see that there is no obvious pattern (each dot represents one day) in the time of day that the data for next year's classes is unavailable.

	```{r patterns, echo = FALSE}
	# set interval
	interval <- "hour"

	# Create a sequence of cut points
	dateCuts <- seq(from=min(result_ended_20140515$test_time), to=max(result_ended_20140515$test_time), by=interval)

	# Use cut to separate dates into periods
	result_ended_20140515$period <- cut(result_ended_20140515$test_time, breaks=dateCuts)

	# how many times per hour did we try to access the forecast?
	hourly_attempts <- mean(table(result_ended_20140515$period))

	# Summarise data for this year (rate of successful accesses per interval)
	library(plyr)
	this_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(this_year == "2014"))/length(period)))

	# summarise for next year
	next_year <- ddply(result_ended_20140515, .(period), summarize, value=((sum(next_year == "2014"))/length(period)))

	# strip ddmmyy from time accessed so we can group by hour of the day
	next_year$period <- strptime(next_year$period, format="%Y-%m-%d %H:%M:%S")
	next_year$period <- format(next_year$period, "%H:%M")

	# aggregate by hour of the day
	next_year_by_hour <- aggregate(value ~ period, next_year, mean)

	# plot
	# ggplot(next_year_by_hour, aes(period, value)) +
	# geom_point(size = 4) +
	# theme_minimal() +
	# theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
	# ylab("Proportion of attempts that show useful data")

	ggplot(next_year, aes(period, value)) +
	# geom_boxplot(colour = "grey70") +
	geom_jitter(size = 2) +
	theme_minimal() +
	theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
	ylab("Proportion of attempts that show useful data") +
	xlab("time of day")
	```

	We can more systematically search for cyclical patterns with a fast Fourier transform to identify periodicity. If the data were cyclical, then we'd see a small number of isolated narrow peaks in the plot below. Instead we see a wide distribution with no obvious narrow peaks, indicating no clearly cyclical patterns in the data.

	```{r periodogram, echo = FALSE, message = FALSE}
	library(TSA)
	periodogram(next_year$value,log='no',plot=TRUE,ylab="Periodogram", xlab="Frequency",lwd=2)
	```

	```{r duration-of-unaval, echo = FALSE, message = FALSE}
	rle_unavail <- rle(result_ended_20140515$next_year)
	rle_unavail <- data.frame(n = rle_unavail[[1]], what = rle_unavail[[2]])
	rle_unavail_dashes <- rle_unavail[rle_unavail$what == "dashes", ]
	rle_unavail_dashes$duration <- rle_unavail_dashes$n * 10 # length of unavail in min
	```

	Another way to understand the impact of the unavailability of the data is to measure the duration of unavailability. If a student accesses the site and finds no useful data, how long do they need to wait until they will see useful data? If the duration is regular then we can mitigate the problem by putting a note to inform students to check back in _x_ minutes. In the plot below we see that the durations of episodes of unavailable data for next year has a peaked distribution. This indicates some regularity in the durations of unavailability. The average time a student will need to wait is `r round(mean(rle_unavail_dashes$duration),0)` minutes, and never longer than `r round(max(rle_unavail_dashes$duration),0)` minutes. Given this result, we can add a note to the curriculum forecast to inform students to check back in an hour if they don't see any useful information about next year's courses.

	```{r duration-plot, echo = FALSE, message = FALSE}
	# plot
	ggplot(rle_unavail_dashes, aes(duration)) +
	geom_histogram() +
	theme_minimal() +
	# theme(axis.text.x=element_text(angle = 90, hjust = 0)) +
	ylab("Frequency of attempts the obtained no useful data") +
	xlab("Duration of epsiode where no useful data was obtained (minutes)")
	dat <- rle_unavail_dashes$duration


	```

	## Conclusion

	Our system for delivering class scheduling information to student is not 100% reliable, especially for data on next year's classes, which is probably of most interest to students using this webpage. There is no clear pattern in time of day that the data are unavailable. The duration of unavailability seems to average about `r round(mean(rle_unavail_dashes$duration),0)` and never exceeds `r round(max(rle_unavail_dashes$duration),0)` minutes.

	It is not clear to me why next year's data should be less reliably available than this year's data; they are both generated with a similar method in the Google sheets.

	With our current system, a student will get useful data 7 or 8 times out of 10 attempts. They will need to wait an average of `r round(mean(rle_unavail_dashes$duration),0)` minutes until they can see useful data for next year's classes.

	One practical implication of these results is that we can add a note to the page that tells the student to check back in an hour if they don't get useful data. The question we cannot answer with these data is whether or not this level of availability is _good enough_ for our students, and whether we need to find another method to deliver this information.