thoughtfulbloke/ESR_wastewater_notes.Rmd

## ESR_wastewater_notes.Rmd

```{r}
library(readr)
library(dplyr)
library(lubridate)
```

David's usage notes for the wastewater data at: https://github.com/ESR-NZ/covid_in_wastewater

I am reading the files directly of github, rather than downloading and then reading locally.

The two key files are ww_data_all.csv, which contains all of the sampling data, and sites.csv which contains information about the testing locations. These have been aggregated into site, region, and national weekly data with accompanying cases for the (combined) catchment areas. For the sites level aggregations one issue is that meshblocks area for cases do not match the catchment boundary population borders, and cases from the entire meshblock, in those cases, are assigned to the every catchment they cross. this leads to an overestimate of cases in certain sites. Site level aggregation also supresses small site summaries for small numbers of people with covid within an area and privacy.

```{r}
ww_data_all.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/ww_data_all.csv",
                            col_types=cols(
  SampleLocation = col_character(),
  sars_gcl = col_double(),
  Collected = col_date(format = ""),
  Result = col_character(),
  copies_per_day_per_person = col_double()
))
sites.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/sites.csv",
                      col_types=cols(
  SampleLocation = col_character(),
  DisplayName = col_character(),
  SampleType = col_character(),
  Latitude = col_double(),
  Longitude = col_double(),
  Population = col_double(),
  Region = col_character(),
  shp_label = col_character()
))
```

Which can be merged on the basis of SampleLocation

```{r}
samples <- ww_data_all.csv %>%
  inner_join(sites.csv, by="SampleLocation")
```

The sample itself is stored in sars_gcl, the number of SARS-CoV-2 genome copies per litre of wastewater.

Then there is the derived estimate, copies per day per person, for which the geographic area is the catchment.

So, for example, we can aggregate up to TA level and the week ending dates (Sunday) rather than collection dates to get population, then blend in the catchment case data for each region. Noting also there can be multiple samples for a site, so that needs some handling.

```{r}
cases_regional.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/cases_regional.csv",
                               col_types = cols(
  week_end_date = col_date(format = ""),
  Region = col_character(),
  case_7d_avg = col_double()))
ww_regional.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/ww_regional.csv",
                            col_types=cols(
  week_end_date = col_date(format = ""),
  Region = col_character(),
  copies_per_day_per_person = col_double(),
  n_sites = col_double()))

regional_summary <- samples %>%
  mutate(week_end_date = ceiling_date(Collected, unit = "week")) %>%
  group_by(SampleLocation, Region, week_end_date) %>%
  summarise(Population = mean(Population),
            .groups = "drop") %>%
  group_by(Region, week_end_date) %>%
  summarise(Population = sum(Population),
            sites_in_data = n(),
            .groups = "drop")

regional_summary %>%
  inner_join(cases_regional.csv, by = c("Region", "week_end_date")) %>%
  inner_join(ww_regional.csv, by = c("Region", "week_end_date")) %>%
  View()
```

	```{r}
	library(readr)
	library(dplyr)
	library(lubridate)
	```

	David's usage notes for the wastewater data at: https://github.com/ESR-NZ/covid_in_wastewater

	I am reading the files directly of github, rather than downloading and then reading locally.

	The two key files are ww_data_all.csv, which contains all of the sampling data, and sites.csv which contains information about the testing locations. These have been aggregated into site, region, and national weekly data with accompanying cases for the (combined) catchment areas. For the sites level aggregations one issue is that meshblocks area for cases do not match the catchment boundary population borders, and cases from the entire meshblock, in those cases, are assigned to the every catchment they cross. this leads to an overestimate of cases in certain sites. Site level aggregation also supresses small site summaries for small numbers of people with covid within an area and privacy.

	```{r}
	ww_data_all.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/ww_data_all.csv",
	col_types=cols(
	SampleLocation = col_character(),
	sars_gcl = col_double(),
	Collected = col_date(format = ""),
	Result = col_character(),
	copies_per_day_per_person = col_double()
	))
	sites.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/sites.csv",
	col_types=cols(
	SampleLocation = col_character(),
	DisplayName = col_character(),
	SampleType = col_character(),
	Latitude = col_double(),
	Longitude = col_double(),
	Population = col_double(),
	Region = col_character(),
	shp_label = col_character()
	))
	```

	Which can be merged on the basis of SampleLocation

	```{r}
	samples <- ww_data_all.csv %>%
	inner_join(sites.csv, by="SampleLocation")
	```

	The sample itself is stored in sars_gcl, the number of SARS-CoV-2 genome copies per litre of wastewater.

	Then there is the derived estimate, copies per day per person, for which the geographic area is the catchment.

	So, for example, we can aggregate up to TA level and the week ending dates (Sunday) rather than collection dates to get population, then blend in the catchment case data for each region. Noting also there can be multiple samples for a site, so that needs some handling.

	```{r}
	cases_regional.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/cases_regional.csv",
	col_types = cols(
	week_end_date = col_date(format = ""),
	Region = col_character(),
	case_7d_avg = col_double()))
	ww_regional.csv <- read_csv("https://raw.githubusercontent.com/ESR-NZ/covid_in_wastewater/main/data/ww_regional.csv",
	col_types=cols(
	week_end_date = col_date(format = ""),
	Region = col_character(),
	copies_per_day_per_person = col_double(),
	n_sites = col_double()))

	regional_summary <- samples %>%
	mutate(week_end_date = ceiling_date(Collected, unit = "week")) %>%
	group_by(SampleLocation, Region, week_end_date) %>%
	summarise(Population = mean(Population),
	.groups = "drop") %>%
	group_by(Region, week_end_date) %>%
	summarise(Population = sum(Population),
	sites_in_data = n(),
	.groups = "drop")

	regional_summary %>%
	inner_join(cases_regional.csv, by = c("Region", "week_end_date")) %>%
	inner_join(ww_regional.csv, by = c("Region", "week_end_date")) %>%
	View()
	```