mcfrank/blake.Rmd

## blake.Rmd
---
title: 'Class 6a: Blake et al. (2015) exercise'
author: "Mike Frank"
date: "February 9, 2016"
output:
  html_document:
    toc: true
---

# Intro

This is an exploration of Blake et al. (2015), [Ontogeny of fairness in seven societies](http://www.nature.com/nature/journal/v528/n7581/full/nature15703.html).

```{r}
rm(list=ls())
library(ggplot2)
library(tidyr)
library(dplyr)
library(bootstrap)

theme_set(theme_bw())
sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(length(x))}
ci95 <- function(x) {sem(x) * 1.96}

theta <- function(x,xdata,na.rm=T) {mean(xdata[x],na.rm=na.rm)}
ci.low <- function(x,na.rm=T) {
  mean(x,na.rm=na.rm) - quantile(bootstrap(1:length(x),1000,theta,x,na.rm=na.rm)$thetastar,.025,na.rm=na.rm)}
ci.high <- function(x,na.rm=T) {
  quantile(bootstrap(1:length(x),1000,theta,x,na.rm=na.rm)$thetastar,.975,na.rm=na.rm) - mean(x,na.rm=na.rm)}

```

# Data Prep

First read in the data, as distributed by the journal.

```{r}
d <- read.csv("data/Ontogeny_fairness_seven_societies_data.csv",
                header = TRUE, na.strings = c("NA", "."))
```

Do some preprocessing, taken directly from the supplemental material.

```{r}
facVars <- c("eq.uneq", "value", "decision")
d[, facVars] <- lapply(d[, facVars], factor)
d$trial.number <- as.numeric(gsub(".(\\d+)", "\\1", d$trial))
```

Rename things so that they are easy to deal with.

```{r}
d$trial_type <- factor(d$eq.uneq,
                       levels = c("E","U"),
                       labels = c("Equal","Unequal"))
d$condition <- factor(d$condition,
                      levels = c("AI","DI"),
                      labels = c("Advantageous","Disadvantageous"))
```

# Variable exploration

Describe the dataset graphically in ways that are useful for you to get a handle on the data collection effort.

```{r}
demo <- d %>%
  mutate(age = floor(actor.age.years)) %>%
  group_by(country, age, actor.id) %>%
  summarise(n = n()) %>%
  summarise(n = n())

ggplot(demo, aes(x = age, y = n)) +
  geom_bar(stat = "identity") +
  facet_grid(. ~ country)
```

# Hypothesis-related exploration

Try just plotting the un-adjusted curves.

```{r}
ms <- d %>%
  filter(!is.na(eq.uneq)) %>%
  mutate(age = floor(actor.age.years),
         decision = decision == "reject") %>%
  group_by(country, trial_type, condition, age, actor.id) %>%
  summarise(decision = mean(decision, na.rm=TRUE)) %>%
  summarise(mean = mean(decision, na.rm=TRUE),
            ci_lower = mean(decision) - ci95(decision),
            ci_upper = mean(decision) + ci95(decision),
            n = n())

ggplot(ms, aes(x = age, y = mean, col = country)) +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = .1)) +
  geom_point(aes(size = n)) +
  facet_grid(condition ~ trial_type) +
  geom_smooth(method="lm", se = FALSE) +
  ylab("Proportion offers rejected") +
  xlab("Age (years)") +
  ylim(c(0,1))
```

Now rebin into 3-year bins.

```{r}
ms <- d %>%
  filter(!is.na(eq.uneq)) %>%
  mutate(age = floor(actor.age.years/2)*2,
         decision = decision == "reject") %>%
  group_by(country, trial_type, condition, age, actor.id) %>%
  summarise(decision = mean(decision, na.rm=TRUE)) %>%
  summarise(mean = mean(decision, na.rm=TRUE),
            ci_lower = mean(decision) - ci95(decision),
            ci_upper = mean(decision) + ci95(decision),
            n = n())

ggplot(ms, aes(x = age, y = mean, col = country)) +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = .1)) +
  geom_point(aes(size = n)) +
  facet_grid(condition ~ trial_type) +
  geom_smooth(method="lm", se = FALSE) +
  ylab("Proportion offers rejected") +
  xlab("Age (years)") +
  ylim(c(0,1))
```

Break this down by country. I like this plot best so far.

```{r}
ggplot(ms, aes(x = age, y = mean, col = condition)) +
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
                  position = position_dodge(width = .1)) +
  geom_point(aes(size = n)) +
  facet_grid(trial_type ~ country) +
  geom_smooth(method="lm", se = FALSE, span = 2) +
  ylab("Proportion offers rejected") +
  xlab("Age (years)") +
  ylim(c(0,1))
```

But maybe we can do better. Let's:

+ bootstrap CIs
+ sort the plots
+ add better smoother (quadratic)
+ use lines instead of points
+ better palette

```{r}
ms <- d %>%
  filter(!is.na(eq.uneq)) %>%
  mutate(age = floor(actor.age.years/2)*2,
         decision = decision == "reject") %>%
  group_by(country, trial_type, condition, age, actor.id) %>%
  summarise(decision = mean(decision, na.rm=TRUE)) %>%
  summarise(mean = mean(decision, na.rm=TRUE),
            ci_lower = ci.low(decision),
            ci_upper = ci.high(decision),
            n = n())

ms$country <- factor(ms$country,
                     levels = c("Uganda","US","Canada",
                                "Senegal","India","Peru","Mexico"))

ggplot(ms, aes(x = age, y = mean, col = condition)) +
  geom_linerange(aes(ymin = mean - ci_lower,
                     ymax = mean + ci_upper),
                  position = position_dodge(width = .5)) +
  geom_point(aes(size=n)) +
  facet_grid(trial_type ~ country) +
  geom_smooth(method="lm", se = FALSE, formula = y ~ I(x^2), aes(weight = 1/n)) +
  ylab("Proportion offers rejected") +
  xlab("Age (years)") +
  ylim(c(0,1)) +
  langcog::scale_colour_solarized()
```
I'm still not super happy with the curves. Can we add some model-derived fits?

I'd like to be fitting an lmer model. Unfortunately it doesn't converge on country subsets.

So, let's ignore the p values and SEs, which will be wrong, and fit regular GLM. (The coefficients should be very similar because the design is balanced).

```{r}
mods <- d %>%
  group_by(country) %>%
  do(predict(glm(decision == "reject" ~ condition * trial_type * actor.age.years,
                data = .,
                family = "binomial"),
             newdata = data.frame(expand.grid(actor.age.years = 4:14,
                                    condition = c("Advantageous","Disadvantageous"),
                                    trial_type = c("Equal","Unequal")))))
```

Now plot these. This is not the *ideal* model, but it's way better than nothing.

```{r}
nd <- expand.grid(actor.age.years = 4:14,
                                    condition = c("Advantageous","Disadvantageous"),
                                    trial_type = c("Equal","Unequal"))

preds <- d %>%
  group_by(country) %>%
  do(data.frame(nd,
                preds = predict(glm(decision == "reject" ~
                                      condition * trial_type * actor.age.years,
                                    data = .,
                                    family = "binomial"),
                                newdata = nd, type = "response")))
```

Plot these and then put legends on the bottom.

```{r}
ggplot(ms, aes(x = age, y = mean, col = condition)) +
  geom_linerange(aes(ymin = mean - ci_lower,
                     ymax = mean + ci_upper),
                  position = position_dodge(width = .5)) +
  geom_point(aes(size=n)) +
  facet_grid(trial_type ~ country) +
  geom_line(data = preds, aes(x = actor.age.years, y = preds)) +
  ylab("Proportion offers rejected") +
  xlab("Age (years)") +
  ylim(c(0,1)) +
  langcog::scale_colour_solarized() +
  theme(legend.position = "bottom")
```
	---
	title: 'Class 6a: Blake et al. (2015) exercise'
	author: "Mike Frank"
	date: "February 9, 2016"
	output:
	html_document:
	toc: true
	---

	# Intro

	This is an exploration of Blake et al. (2015), [Ontogeny of fairness in seven societies](http://www.nature.com/nature/journal/v528/n7581/full/nature15703.html).

	```{r}
	rm(list=ls())
	library(ggplot2)
	library(tidyr)
	library(dplyr)
	library(bootstrap)

	theme_set(theme_bw())
	sem <- function(x) {sd(x, na.rm=TRUE) / sqrt(length(x))}
	ci95 <- function(x) {sem(x) * 1.96}

	theta <- function(x,xdata,na.rm=T) {mean(xdata[x],na.rm=na.rm)}
	ci.low <- function(x,na.rm=T) {
	mean(x,na.rm=na.rm) - quantile(bootstrap(1:length(x),1000,theta,x,na.rm=na.rm)$thetastar,.025,na.rm=na.rm)}
	ci.high <- function(x,na.rm=T) {
	quantile(bootstrap(1:length(x),1000,theta,x,na.rm=na.rm)$thetastar,.975,na.rm=na.rm) - mean(x,na.rm=na.rm)}

	```

	# Data Prep

	First read in the data, as distributed by the journal.

	```{r}
	d <- read.csv("data/Ontogeny_fairness_seven_societies_data.csv",
	header = TRUE, na.strings = c("NA", "."))
	```

	Do some preprocessing, taken directly from the supplemental material.

	```{r}
	facVars <- c("eq.uneq", "value", "decision")
	d[, facVars] <- lapply(d[, facVars], factor)
	d$trial.number <- as.numeric(gsub(".(\\d+)", "\\1", d$trial))
	```

	Rename things so that they are easy to deal with.

	```{r}
	d$trial_type <- factor(d$eq.uneq,
	levels = c("E","U"),
	labels = c("Equal","Unequal"))
	d$condition <- factor(d$condition,
	levels = c("AI","DI"),
	labels = c("Advantageous","Disadvantageous"))
	```

	# Variable exploration

	Describe the dataset graphically in ways that are useful for you to get a handle on the data collection effort.

	```{r}
	demo <- d %>%
	mutate(age = floor(actor.age.years)) %>%
	group_by(country, age, actor.id) %>%
	summarise(n = n()) %>%
	summarise(n = n())

	ggplot(demo, aes(x = age, y = n)) +
	geom_bar(stat = "identity") +
	facet_grid(. ~ country)
	```

	# Hypothesis-related exploration

	Try just plotting the un-adjusted curves.

	```{r}
	ms <- d %>%
	filter(!is.na(eq.uneq)) %>%
	mutate(age = floor(actor.age.years),
	decision = decision == "reject") %>%
	group_by(country, trial_type, condition, age, actor.id) %>%
	summarise(decision = mean(decision, na.rm=TRUE)) %>%
	summarise(mean = mean(decision, na.rm=TRUE),
	ci_lower = mean(decision) - ci95(decision),
	ci_upper = mean(decision) + ci95(decision),
	n = n())

	ggplot(ms, aes(x = age, y = mean, col = country)) +
	geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
	position = position_dodge(width = .1)) +
	geom_point(aes(size = n)) +
	facet_grid(condition ~ trial_type) +
	geom_smooth(method="lm", se = FALSE) +
	ylab("Proportion offers rejected") +
	xlab("Age (years)") +
	ylim(c(0,1))
	```

	Now rebin into 3-year bins.

	```{r}
	ms <- d %>%
	filter(!is.na(eq.uneq)) %>%
	mutate(age = floor(actor.age.years/2)*2,
	decision = decision == "reject") %>%
	group_by(country, trial_type, condition, age, actor.id) %>%
	summarise(decision = mean(decision, na.rm=TRUE)) %>%
	summarise(mean = mean(decision, na.rm=TRUE),
	ci_lower = mean(decision) - ci95(decision),
	ci_upper = mean(decision) + ci95(decision),
	n = n())

	ggplot(ms, aes(x = age, y = mean, col = country)) +
	geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
	position = position_dodge(width = .1)) +
	geom_point(aes(size = n)) +
	facet_grid(condition ~ trial_type) +
	geom_smooth(method="lm", se = FALSE) +
	ylab("Proportion offers rejected") +
	xlab("Age (years)") +
	ylim(c(0,1))
	```

	Break this down by country. I like this plot best so far.

	```{r}
	ggplot(ms, aes(x = age, y = mean, col = condition)) +
	geom_linerange(aes(ymin = ci_lower, ymax = ci_upper),
	position = position_dodge(width = .1)) +
	geom_point(aes(size = n)) +
	facet_grid(trial_type ~ country) +
	geom_smooth(method="lm", se = FALSE, span = 2) +
	ylab("Proportion offers rejected") +
	xlab("Age (years)") +
	ylim(c(0,1))
	```

	But maybe we can do better. Let's:

	+ bootstrap CIs
	+ sort the plots
	+ add better smoother (quadratic)
	+ use lines instead of points
	+ better palette

	```{r}
	ms <- d %>%
	filter(!is.na(eq.uneq)) %>%
	mutate(age = floor(actor.age.years/2)*2,
	decision = decision == "reject") %>%
	group_by(country, trial_type, condition, age, actor.id) %>%
	summarise(decision = mean(decision, na.rm=TRUE)) %>%
	summarise(mean = mean(decision, na.rm=TRUE),
	ci_lower = ci.low(decision),
	ci_upper = ci.high(decision),
	n = n())

	ms$country <- factor(ms$country,
	levels = c("Uganda","US","Canada",
	"Senegal","India","Peru","Mexico"))

	ggplot(ms, aes(x = age, y = mean, col = condition)) +
	geom_linerange(aes(ymin = mean - ci_lower,
	ymax = mean + ci_upper),
	position = position_dodge(width = .5)) +
	geom_point(aes(size=n)) +
	facet_grid(trial_type ~ country) +
	geom_smooth(method="lm", se = FALSE, formula = y ~ I(x^2), aes(weight = 1/n)) +
	ylab("Proportion offers rejected") +
	xlab("Age (years)") +
	ylim(c(0,1)) +
	langcog::scale_colour_solarized()
	```
	I'm still not super happy with the curves. Can we add some model-derived fits?

	I'd like to be fitting an lmer model. Unfortunately it doesn't converge on country subsets.

	So, let's ignore the p values and SEs, which will be wrong, and fit regular GLM. (The coefficients should be very similar because the design is balanced).

	```{r}
	mods <- d %>%
	group_by(country) %>%
	do(predict(glm(decision == "reject" ~ condition * trial_type * actor.age.years,
	data = .,
	family = "binomial"),
	newdata = data.frame(expand.grid(actor.age.years = 4:14,
	condition = c("Advantageous","Disadvantageous"),
	trial_type = c("Equal","Unequal")))))
	```

	Now plot these. This is not the ideal model, but it's way better than nothing.

	```{r}
	nd <- expand.grid(actor.age.years = 4:14,
	condition = c("Advantageous","Disadvantageous"),
	trial_type = c("Equal","Unequal"))

	preds <- d %>%
	group_by(country) %>%
	do(data.frame(nd,
	preds = predict(glm(decision == "reject" ~
	condition * trial_type * actor.age.years,
	data = .,
	family = "binomial"),
	newdata = nd, type = "response")))
	```

	Plot these and then put legends on the bottom.

	```{r}
	ggplot(ms, aes(x = age, y = mean, col = condition)) +
	geom_linerange(aes(ymin = mean - ci_lower,
	ymax = mean + ci_upper),
	position = position_dodge(width = .5)) +
	geom_point(aes(size=n)) +
	facet_grid(trial_type ~ country) +
	geom_line(data = preds, aes(x = actor.age.years, y = preds)) +
	ylab("Proportion offers rejected") +
	xlab("Age (years)") +
	ylim(c(0,1)) +
	langcog::scale_colour_solarized() +
	theme(legend.position = "bottom")
	```