Skip to content

Instantly share code, notes, and snippets.

@nacnudus
Created November 25, 2020 18:40
Show Gist options
  • Save nacnudus/16ea9bd7169c796a4d0d64f3ad14f56c to your computer and use it in GitHub Desktop.
Save nacnudus/16ea9bd7169c796a4d0d64f3ad14f56c to your computer and use it in GitHub Desktop.
Parse a UK government spreadsheet of cancer waiting times with R, tidyxl and unpivotr
library(tidyverse)
library(tidyxl)
library(unpivotr)
#' Convert relevant character values to dates
#'
#' @param cells Data frame derived from `tidyxl::xlsx_cells()` or
#' `readr::melt_csv()` or similar.
#' @param condition An expression that returns a logical value,
#' is defined in terms of the columns in `cells`. Similar to `dplyr::filter()`.
#' @param ... Passed to `readr::parse_datetime()`
character_to_date <- function(cells, condition, ...) {
index <- rlang::eval_tidy(enquo(condition), cells)
cells$date[index] <- parse_datetime(cells$character[index], ...)
cells$data_type[index] <- "date"
cells
}
cells <- xlsx_cells("~/Downloads/Cancer-Waiting-Times-Commissioner-Time-Series-Apr-2011-Sep-2020-with-Revisions-2.xlsx")
tidy_sheet <- function(cells) {
cells %>%
filter(!is_blank) %>%
filter(row >= 4) %>%
behead("up-left", "month") %>%
behead("up", "metric") %>%
behead("left", "ccg_code") %>%
behead("left", "ccg") %>%
select(ccg_code, ccg, month, metric, value = numeric)
}
tidy <-
cells %>%
filter(sheet != "Cover", !is_blank, row >= 4) %>%
# Dates in row 4 are sometimes text
character_to_date(row == 4 & is.na(date), "%B %Y") %>%
# Parse each sheet in turn
nest(data = -sheet) %>%
mutate(data = map(data, tidy_sheet)) %>%
# Flatten
unnest(cols = data)
tidy
# # A tibble: 587,254 x 6
# sheet ccg_code ccg month metric value
# <chr> <chr> <chr> <dttm> <chr> <dbl>
# 1 Two Week Wait 5E1 MID-HAMPSHIRE PCT 2011-04-01 00:00:00 "Total seen" 413
# 2 Two Week Wait 5E1 MID-HAMPSHIRE PCT 2011-04-01 00:00:00 "Within Standard " 391
# 3 Two Week Wait 5E1 MID-HAMPSHIRE PCT 2011-04-01 00:00:00 "Performance" 0.947
# 4 Two Week Wait 5A3 SOUTH GLOUCESTERSHIRE PCT 2011-04-01 00:00:00 "Total seen" 530
# 5 Two Week Wait 5A3 SOUTH GLOUCESTERSHIRE PCT 2011-04-01 00:00:00 "Within Standard " 503
# 6 Two Week Wait 5A3 SOUTH GLOUCESTERSHIRE PCT 2011-04-01 00:00:00 "Performance" 0.949
# 7 Two Week Wait 5A4 HAVERING PCT 2011-04-01 00:00:00 "Total seen" 478
# 8 Two Week Wait 5A4 HAVERING PCT 2011-04-01 00:00:00 "Within Standard " 477
# 9 Two Week Wait 5A4 HAVERING PCT 2011-04-01 00:00:00 "Performance" 0.998
# 10 Two Week Wait 5A5 KINGSTON PCT 2011-04-01 00:00:00 "Total seen" 212
# # … with 587,244 more rows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment