pdparker/Rmd

## Rmd
---
title: "R Notebook"
output: html_notebook
---

# Overview

##

![Overview](fig/PISA-Workshop.png)

## Resources

- [Working with Archival Data: Studying Lives](https://www.amazon.com/Working-Archival-Data-Quantitative-Applications/dp/0803942621)
- [Complex Surveys: A Guide to Analysis Using R](https://www.amazon.com/Complex-Surveys-Guide-Analysis-Using/dp/0470284307/ref=sr_1_1?s=books&ie=UTF8&qid=1539927970&sr=1-1&keywords=Thomas+Lumley)
- [Analyze Survey Data for Free](http://asdfree.com/)
- [R for Data Scientists](http://r4ds.had.co.nz/)
- [A Tidyverse Approach to Survey Data](https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html)


# Archival Data - Advantages

- Cost effective
- Better than what you can afford to do on your own
- Representative

## The disadvantages of archival data

>archival data analysis often means initial questions are reformulated to fit the data and the data are reworked in coding and recoding to fit the question better

Elder, Clipp, and Colerick (1993, p. 2)

## The unique skills required for archival data

- Munging: What rows or columns need to be excluded, cleaned, merged
- Harmonizing:Can this survey data be merged with other survey data
- Integrating: Can this survey data be merged with data from other sources
- Indexing: Can I combine variables to get an index of what I want
- Aligning: Can I create a correspondence between one type of coding and another

# Intro to Complex Survey Designs

# Why Complex Survey Designs
- Simple random samples are simple and require relatively few units
- Simple random samples are not often used because stratified samples can provide the same precision at a lower cost

## Sampling units
- Primary Sampling Unit e.g., school
- Secondary Sampling Unit e.g., child

## Strata
- Break a population down into known groups
- Probability sample from within each group
- MUST know strata membership and characteristics ahead of time
- Stratification can reduce standard errors because more information about the population is known

## Oversampling
- Over sampling can be beneficial when important groups are small
- PISA oversamples Indigenous students
- Provided population statistics are know this can be corrected via weights

## Weights

Types of Weights:

- Precision (inverse-probability) weights
- Frequency weights
- Sampling weights
- Senate weights
- Sampling weights
- Repeated Replicate weights
- Attrition weights

## Replicate weights

- PISA is a two-stage design (with schools as the primary sampling unit)
- Standard errors will be wrong if this is not accounted for
- PISA provides repeated replication weights to correct for this

## Replicate Weights
> The replicate-weight approach to estimating standard errors computes the standard deviation of the estimated summary across many partially independent subsets of the one sample, and extrapolates from this to the standard deviation between completely independent samples.

Lumley, Thomas. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology) (Kindle Locations 797-800). Wiley. Kindle Edition.

## Types of Replicate Weights

- Balanced Repeated Replication: Estimates from half samples
- Fay: All observations are weighted $(2 - p)/\pi_i$ and $p/\pi_i$ (PISA prob = .5)
- Jackknife: JK1 (1 ob or 1 cluster weighted at zero) JKn (n ob or n cluster)
- Block Bootstrap

## Statistic for PISA Replicate Weights

- Point: $\sum_{R=1}^MA_r/M$

- Standard Error: $\sqrt{  \frac{4}{M} \times \sum_{R=1}^{M} (A_r - \bar{A})^2 }$

## But Multilevel Models?

>  In PISA, the sampling variances estimated with multi-level models will always be greater than the sampling variances estimated with Fay replicate samples

[PISA 2003 Data Analysis Manual](https://goo.gl/dfLSMC)

# My Approach to R

## Base R script
Normal R code can get quite messy
mean(subset(iris, Species == 'virginica')$Sepal.Length)
```


## Tidyverse
Tidycode provides pipes which clean things up alot by passing objects from the left to the right.
```{r}
suppressPackageStartupMessages(library(tidyverse))
iris %>%
  filter(Species == 'virginica') %>%
  summarise(m = mean(Sepal.Length))
```

# Setting up PISA data

## The packages to use
```{r message=FALSE}
## Required ---
library(survey) # the core of everything we will do today
library(mitools) # needed for working with plausible values
## optional but makes life easier ---
library(tidyverse) # a suite of tools to make coding easier
library(readit) # Figures out what sort of file you are giving it and read in the data
library(srvyr) # tidyverse for survey data
library(svyPVpack) # convience functions for survey data and plausible values
## For integrating data from statistical agencies ---
library(OECD) # grabs OECD data from the website
##F or analysis in parallel ---
library(furrr) # needed to easily set up and run models in parallel
plan(multiprocess) # prepare to run things in parallel
##Pretty graphs ---
library(ggthemes) # emulate the style of professional data vis
library(wesanderson) # colour pallets that work well together
```

## Data Munging
```{r cache=TRUE}
d2015 <- readit("~/Dropbox/Databases/WORLD/PUF_SAS_COMBINED_CMB_STU_QQQ/cy6_ms_cmb_stu_qqq.sas7bdat") %>% # load file
  filter(CNT == 'AUS') %>% # Lets just use one country
  rename(GENDER = ST004D01T) %>% #give us better names to work with
  mutate(GENDER = ifelse(GENDER == 1, 'girl', 'boy')) # give us meaningful names
```


## Creating a complex survey data.frame - The easy way
```{r}
#Tidyverse
d2015_survey <- d2015 %>%
  as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"),
            weights = W_FSTUWT, rho = .5)
#Standard R
# d2015_survey<-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+",
#                   type = "Fay", rho = .5, data = d2015)
```

# Some basic models

## Sample Size and basic descriptives
```{r}
d2015_survey %>%
  group_by(GENDER) %>%
  summarise(un = unweighted(n()), n = survey_total(),
            m_scie = survey_mean(PV1SCIE))
```


## Some Tidyverse Magic
```{r}
d2015_survey %>%
  summarise_at(vars(matches("PV[1-9]+MATH")), survey_mean) %>%
  t()
```


## Balanced Repeated Replication weights by Hand
```{r}
ms <- c()
for (i in 1:80){
  tmp <- d2015 %>% select(W_FSTUWT, paste0('W_FSTURWT',i), PV2MATH)
  wt = tmp[,paste0('W_FSTURWT',i)]
  names(wt) <- "wt"
  tmp <- bind_cols(tmp, wt)
  ms[i] <- weighted.mean(x = tmp$PV2MATH, w = tmp$wt, na.rm=TRUE)
}

```

## Results
```{r}
mean(ms); sqrt(sum(4/80*(ms - mean(ms))^2))

tmp <- svymean(~PV2MATH, design = d2015_survey, return.replicates=TRUE)
tmp

all(ms == tmp$replicates)
```

## Some other basic descriptives
```{r}
svytable(~GENDER, d2015_survey)
svytable(~GENDER, d2015_survey, Ntotal = 100)

```

## Graph
```{r}
p1<- d2015_survey %>%
  group_by(GENDER) %>%
  summarise(m_scie = survey_mean(TEACHSUP, na.rm=TRUE)) %>%
  ggplot(aes(x=GENDER, y = m_scie, fill = GENDER)) +
  geom_bar(stat="identity") +
  theme_wsj() +
  scale_fill_manual(values=wes_palette(n=2, name="Royal1"))
```


## Graph

```{r echo=FALSE}
p1
```

# Plausible Values

## What are Plausible Values
- Not all children in PISA receive all test items
- Thus we must estimate their scores with missing information
- When we estimate scores we do so with uncertainty
- Plausible values are taken from an IRT model
- They are estimates of the child's true proficiency
- Combined using Multiple Imputation logic of Rubin

## Dealing with plausible values

- The Simple Way (covers most basic cases but not very flexible)
- The Multiple Imputation Way (annoying but possible to do almost anything with it)
- Parallel Processing (mostly not worthwhile except for very big models)

## Replicating OECD report results
```{r cache=TRUE}
# Matches 2015 PISA Report
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "SCIE")) %>%
  select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "READ")) %>%
  select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
```

## Replicating OECD Results
```{r cache=TRUE}
# Matches 2015 PISA Report
svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "MATH"))%>%
  select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
svyPVglm(PV..SCIE ~ ESCS, d2015_survey) # P46
```

## Replicating OECD Results using data lists
```{r}
pvs <- list()
mis <- list()
K = paste0("PV",1:10, "MATH")

for (i in 1:10){
mis[[i]] <-  d2015 %>%
  select(matches(paste0("PV", i, "[A-Z]+") ), ESCS,
         GENDER, W_FSTUWT, starts_with("ST"),
         starts_with("W_FSTURWT"), CNT)
names(mis[[i]]) <- gsub("PV[0-9]+", "", names(mis[[i]]))
pvs[[i]] <- mis[[i]] %>%
  as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"), weights = W_FSTUWT, rho = .5)
}
```

## Replicating OECD Results using data lists
```{r}
names(pvs) <- paste0("PV", 1:10)

mis <- imputationList(mis)

mis <-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+",
                  type = "Fay", rho = .5, data = mis)

```

## Data list

```{r cache=TRUE}
with(mis, svymean(~SCIE, na.rm=TRUE)) %>%
  MIcombine() %>%
  summary

pvs %>%
  future_map(~ svymean(~SCIE, design = ., na.rm=TRUE)) %>%
  MIcombine() %>%
  summary
```


# More Complex Models

## Complex GLM Model
```{r cache = TRUE}
sjlabelled::get_label(d2015$ST123Q01NA)

pvs %>%
  future_map(~ svyolr(as.factor(ST123Q01NA)~scale(SCIE), design = .)) %>%
  MIcombine() %>%
  summary
```


## SEM Models
```
library(lavaan)
library(lavaan.survey)

M1 <- '
EMOSUPS =~ ST123Q01NA + ST123Q02NA + ST123Q03NA + ST123Q04NA
INSTSCIE =~ ST113Q01TA + ST113Q02TA + ST113Q03TA + ST113Q04TA
'

simple.fit <- cfa(M1, data=d2015)
summary(simple, standardized = TRUE, fit.measures = TRUE)

survey.fit <- lavaan.survey(lavaan.fit=simple.fit, survey.design=d2015_survey)
summary(survey.fit, standardized = TRUE, fit.measures = TRUE)

```

## Missing data
- Easiest to use multiple imputations via a package like Amelia II
- Assign one PV to each imputation
- Run analysis as per the models already shown

## Models not covered by the Survey Package
```{r cache = TRUE, message=FALSE}
library(quantreg)
withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE),
                                           tau=0.2, weights=.weights))))
withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE),
                                           tau=0.8, weights=.weights))))

```


## Complications with PVs: Tables
```{r}
#At top proformance band
M1 <- with(mis, svytable(~I(MATH>707.93), Ntotal=100)) %>% unlist

mean(M1[c(T,F)]); mean(M1[c(F,T)])
```
	---
	title: "R Notebook"
	output: html_notebook
	---

	# Overview

	##

	![Overview](fig/PISA-Workshop.png)

	## Resources

	- [Working with Archival Data: Studying Lives](https://www.amazon.com/Working-Archival-Data-Quantitative-Applications/dp/0803942621)
	- [Complex Surveys: A Guide to Analysis Using R](https://www.amazon.com/Complex-Surveys-Guide-Analysis-Using/dp/0470284307/ref=sr_1_1?s=books&ie=UTF8&qid=1539927970&sr=1-1&keywords=Thomas+Lumley)
	- [Analyze Survey Data for Free](http://asdfree.com/)
	- [R for Data Scientists](http://r4ds.had.co.nz/)
	- [A Tidyverse Approach to Survey Data](https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html)



	# Archival Data - Advantages

	- Cost effective
	- Better than what you can afford to do on your own
	- Representative

	## The disadvantages of archival data

	>archival data analysis often means initial questions are reformulated to fit the data and the data are reworked in coding and recoding to fit the question better

	Elder, Clipp, and Colerick (1993, p. 2)

	## The unique skills required for archival data

	- Munging: What rows or columns need to be excluded, cleaned, merged
	- Harmonizing:Can this survey data be merged with other survey data
	- Integrating: Can this survey data be merged with data from other sources
	- Indexing: Can I combine variables to get an index of what I want
	- Aligning: Can I create a correspondence between one type of coding and another

	# Intro to Complex Survey Designs

	# Why Complex Survey Designs
	- Simple random samples are simple and require relatively few units
	- Simple random samples are not often used because stratified samples can provide the same precision at a lower cost

	## Sampling units
	- Primary Sampling Unit e.g., school
	- Secondary Sampling Unit e.g., child

	## Strata
	- Break a population down into known groups
	- Probability sample from within each group
	- MUST know strata membership and characteristics ahead of time
	- Stratification can reduce standard errors because more information about the population is known

	## Oversampling
	- Over sampling can be beneficial when important groups are small
	- PISA oversamples Indigenous students
	- Provided population statistics are know this can be corrected via weights

	## Weights

	Types of Weights:

	- Precision (inverse-probability) weights
	- Frequency weights
	- Sampling weights
	- Senate weights
	- Sampling weights
	- Repeated Replicate weights
	- Attrition weights

	## Replicate weights

	- PISA is a two-stage design (with schools as the primary sampling unit)
	- Standard errors will be wrong if this is not accounted for
	- PISA provides repeated replication weights to correct for this

	## Replicate Weights
	> The replicate-weight approach to estimating standard errors computes the standard deviation of the estimated summary across many partially independent subsets of the one sample, and extrapolates from this to the standard deviation between completely independent samples.

	Lumley, Thomas. Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology) (Kindle Locations 797-800). Wiley. Kindle Edition.

	## Types of Replicate Weights

	- Balanced Repeated Replication: Estimates from half samples
	- Fay: All observations are weighted $(2 - p)/\pi_i$ and $p/\pi_i$ (PISA prob = .5)
	- Jackknife: JK1 (1 ob or 1 cluster weighted at zero) JKn (n ob or n cluster)
	- Block Bootstrap

	## Statistic for PISA Replicate Weights

	- Point: $\sum_{R=1}^MA_r/M$

	- Standard Error: $\sqrt{ \frac{4}{M} \times \sum_{R=1}^{M} (A_r - \bar{A})^2 }$

	## But Multilevel Models?

	> In PISA, the sampling variances estimated with multi-level models will always be greater than the sampling variances estimated with Fay replicate samples

	[PISA 2003 Data Analysis Manual](https://goo.gl/dfLSMC)

	# My Approach to R

	## Base R script
	Normal R code can get quite messy
	mean(subset(iris, Species == 'virginica')$Sepal.Length)
	```


	## Tidyverse
	Tidycode provides pipes which clean things up alot by passing objects from the left to the right.
	```{r}
	suppressPackageStartupMessages(library(tidyverse))
	iris %>%
	filter(Species == 'virginica') %>%
	summarise(m = mean(Sepal.Length))
	```

	# Setting up PISA data

	## The packages to use
	```{r message=FALSE}
	## Required ---
	library(survey) # the core of everything we will do today
	library(mitools) # needed for working with plausible values
	## optional but makes life easier ---
	library(tidyverse) # a suite of tools to make coding easier
	library(readit) # Figures out what sort of file you are giving it and read in the data
	library(srvyr) # tidyverse for survey data
	library(svyPVpack) # convience functions for survey data and plausible values
	## For integrating data from statistical agencies ---
	library(OECD) # grabs OECD data from the website
	##F or analysis in parallel ---
	library(furrr) # needed to easily set up and run models in parallel
	plan(multiprocess) # prepare to run things in parallel
	##Pretty graphs ---
	library(ggthemes) # emulate the style of professional data vis
	library(wesanderson) # colour pallets that work well together
	```

	## Data Munging
	```{r cache=TRUE}
	d2015 <- readit("~/Dropbox/Databases/WORLD/PUF_SAS_COMBINED_CMB_STU_QQQ/cy6_ms_cmb_stu_qqq.sas7bdat") %>% # load file
	filter(CNT == 'AUS') %>% # Lets just use one country
	rename(GENDER = ST004D01T) %>% #give us better names to work with
	mutate(GENDER = ifelse(GENDER == 1, 'girl', 'boy')) # give us meaningful names
	```


	## Creating a complex survey data.frame - The easy way
	```{r}
	#Tidyverse
	d2015_survey <- d2015 %>%
	as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"),
	weights = W_FSTUWT, rho = .5)
	#Standard R
	# d2015_survey<-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+",
	# type = "Fay", rho = .5, data = d2015)
	```

	# Some basic models

	## Sample Size and basic descriptives
	```{r}
	d2015_survey %>%
	group_by(GENDER) %>%
	summarise(un = unweighted(n()), n = survey_total(),
	m_scie = survey_mean(PV1SCIE))
	```


	## Some Tidyverse Magic
	```{r}
	d2015_survey %>%
	summarise_at(vars(matches("PV[1-9]+MATH")), survey_mean) %>%
	t()
	```


	## Balanced Repeated Replication weights by Hand
	```{r}
	ms <- c()
	for (i in 1:80){
	tmp <- d2015 %>% select(W_FSTUWT, paste0('W_FSTURWT',i), PV2MATH)
	wt = tmp[,paste0('W_FSTURWT',i)]
	names(wt) <- "wt"
	tmp <- bind_cols(tmp, wt)
	ms[i] <- weighted.mean(x = tmp$PV2MATH, w = tmp$wt, na.rm=TRUE)
	}

	```

	## Results
	```{r}
	mean(ms); sqrt(sum(4/80*(ms - mean(ms))^2))

	tmp <- svymean(~PV2MATH, design = d2015_survey, return.replicates=TRUE)
	tmp

	all(ms == tmp$replicates)
	```

	## Some other basic descriptives
	```{r}
	svytable(~GENDER, d2015_survey)
	svytable(~GENDER, d2015_survey, Ntotal = 100)

	```

	## Graph
	```{r}
	p1<- d2015_survey %>%
	group_by(GENDER) %>%
	summarise(m_scie = survey_mean(TEACHSUP, na.rm=TRUE)) %>%
	ggplot(aes(x=GENDER, y = m_scie, fill = GENDER)) +
	geom_bar(stat="identity") +
	theme_wsj() +
	scale_fill_manual(values=wes_palette(n=2, name="Royal1"))
	```


	## Graph

	```{r echo=FALSE}
	p1
	```

	# Plausible Values

	## What are Plausible Values
	- Not all children in PISA receive all test items
	- Thus we must estimate their scores with missing information
	- When we estimate scores we do so with uncertainty
	- Plausible values are taken from an IRT model
	- They are estimates of the child's true proficiency
	- Combined using Multiple Imputation logic of Rubin

	## Dealing with plausible values

	- The Simple Way (covers most basic cases but not very flexible)
	- The Multiple Imputation Way (annoying but possible to do almost anything with it)
	- Parallel Processing (mostly not worthwhile except for very big models)

	## Replicating OECD report results
	```{r cache=TRUE}
	# Matches 2015 PISA Report
	svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "SCIE")) %>%
	select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
	svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "READ")) %>%
	select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
	```

	## Replicating OECD Results
	```{r cache=TRUE}
	# Matches 2015 PISA Report
	svyPVpm(by = ~CNT, d2015_survey, pvs = paste0("PV", 1:10, "MATH"))%>%
	select(-ends_with("SE"),-Percent, -SE.Percent, -CNT, -Group1) # p 44
	svyPVglm(PV..SCIE ~ ESCS, d2015_survey) # P46
	```

	## Replicating OECD Results using data lists
	```{r}
	pvs <- list()
	mis <- list()
	K = paste0("PV",1:10, "MATH")

	for (i in 1:10){
	mis[[i]] <- d2015 %>%
	select(matches(paste0("PV", i, "[A-Z]+") ), ESCS,
	GENDER, W_FSTUWT, starts_with("ST"),
	starts_with("W_FSTURWT"), CNT)
	names(mis[[i]]) <- gsub("PV[0-9]+", "", names(mis[[i]]))
	pvs[[i]] <- mis[[i]] %>%
	as_survey(type = "Fay", repweights = starts_with("W_FSTURWT"), weights = W_FSTUWT, rho = .5)
	}
	```

	## Replicating OECD Results using data lists
	```{r}
	names(pvs) <- paste0("PV", 1:10)

	mis <- imputationList(mis)

	mis <-svrepdesign(weights = ~W_FSTUWT, repweights = "W_FSTURWT[0-9]+",
	type = "Fay", rho = .5, data = mis)

	```

	## Data list

	```{r cache=TRUE}
	with(mis, svymean(~SCIE, na.rm=TRUE)) %>%
	MIcombine() %>%
	summary

	pvs %>%
	future_map(~ svymean(~SCIE, design = ., na.rm=TRUE)) %>%
	MIcombine() %>%
	summary
	```


	# More Complex Models

	## Complex GLM Model
	```{r cache = TRUE}
	sjlabelled::get_label(d2015$ST123Q01NA)

	pvs %>%
	future_map(~ svyolr(as.factor(ST123Q01NA)~scale(SCIE), design = .)) %>%
	MIcombine() %>%
	summary
	```


	## SEM Models
	```
	library(lavaan)
	library(lavaan.survey)

	M1 <- '
	EMOSUPS =~ ST123Q01NA + ST123Q02NA + ST123Q03NA + ST123Q04NA
	INSTSCIE =~ ST113Q01TA + ST113Q02TA + ST113Q03TA + ST113Q04TA
	'

	simple.fit <- cfa(M1, data=d2015)
	summary(simple, standardized = TRUE, fit.measures = TRUE)

	survey.fit <- lavaan.survey(lavaan.fit=simple.fit, survey.design=d2015_survey)
	summary(survey.fit, standardized = TRUE, fit.measures = TRUE)

	```

	## Missing data
	- Easiest to use multiple imputations via a package like Amelia II
	- Assign one PV to each imputation
	- Run analysis as per the models already shown

	## Models not covered by the Survey Package
	```{r cache = TRUE, message=FALSE}
	library(quantreg)
	withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE),
	tau=0.2, weights=.weights))))
	withReplicates(d2015_survey, quote(coef(rq(INSTSCIE~scale(PV1SCIE),
	tau=0.8, weights=.weights))))

	```


	## Complications with PVs: Tables
	```{r}
	#At top proformance band
	M1 <- with(mis, svytable(~I(MATH>707.93), Ntotal=100)) %>% unlist

	mean(M1[c(T,F)]); mean(M1[c(F,T)])
	```