acarril/313_ps03_solution.Rmd

## 313_ps03_solution.Rmd
---
title: "ECO313 - PS03"
author: "Álvaro Carril"
output:
  pdf_document: default
  html_document:
    df_print: paged
---

This is my take on problem set 3, in R, for those who are curious about trying a ~~better~~ different tool.
Drop me a line if you are interested in having these going forward, and/or if you need advice doing it yourselves.
The code for this notebook can be found in [https://gist.github.com/acarril/854763a1caa82e0f1b781a249079fb43](https://gist.github.com/acarril/854763a1caa82e0f1b781a249079fb43).

# Inspecting and preparing the data

### Reading the data

We can read a `.dta` file in R using the `haven::read_dta()` function. Let's load the data and print the first 10 rows.
```{r}
library(haven)
df <- read_dta("mnook5.dta")
df
```

### Log income
As discussed in class, it is sometimes useful to create log-transformed versions of our variables. In particular, it'll be useful to have log earnings, so we'll create those variables with the help of the `dplyr` library.
I leave this here to show how the variables I created, but I ended up not using them as most (all?) of you did not log-transform income.
```{r message=FALSE}
library(dplyr)
df <- mutate(df, log_earndad = log(earndad + 1), log_earnmom = log(earnmom + 1))
```


### Number of kids
Secondly, we'll create dummy variables for the number of kids in each family. We first take a look at the distribution of this variable. For this, we can use a simple tabulation or a histogram.
```{r}
library(ggplot2)
table(df$nkids)
ggplot(df, aes(nkids)) + geom_bar()
```

Note that there is only one observation (i.e. family) with one kid in the data, so it is more sensible to group families with three or four kids together in our dummy variables.
Another valid alternative is to drop that observation entirely.
We create the dummy variables below.
```{r}
df <- mutate(df,
             nkids1 = as.numeric(nkids == 1),
             nkids2 = as.numeric(nkids == 2),
             nkids3plus = as.numeric(nkids == 3 | nkids == 4))
```

### Testing homoskedasticity in baseline model

Our baseline model is one were we regress `award` against parental income (`earnmom` and `earndad`) and the number of kids (`nkids`).
As warned in the problem set, it is possible that the unexplained portion of awards has higher variance when earnings are relatively high, which violates the homoskedasticity assumption. This assumption can be visually inspected by looking at a plot of fitted values against the residuals.
Homoskedasticity is satisfied when there are no patterns in this plot; however, if we see a "cone shaped" pattern, this is suggestive that heteroskedasticity is an issue. By looking at the plot below, this certainly appears to be the case.
```{r}
model_base <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus, data = df)
ggplot(model_base, aes(x = .fitted, y = .resid)) + geom_point()
```

We can formally test for heteroskedasticity using the Breusch-Pagan test (a White test would also work; I'm just doing one here). In essence, the idea behind the test is to check whether the conditional variance of the residual is independent from the regressors. Therefore, if our linear model is
$$y = \beta_0 + \beta_1 x + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + u, \,$$
then the independence of the variance of the residuals can be tested by regressing the squared residuals ($\widehat{u}^2$), against the regressors $x$. Therefore, we estimate the auxiliary model
$$\widehat{u}^2 = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_4 + v.\,$$
In R, this test can be carried out with several different packages.
I'll use `car::ncvTest()` here, which tests the null hypothesis of no heteroskedasticity.
While it doesn't give us the critical value to compare the test statistic, all you need to look at is the $p$-value to determine whether or not you should reject the null. If the p-value is less than the level of significance (in this case if the p-value is less than $\alpha = 0.05$), then you reject the null hypothesis. Since $p < 0.05$, we can reject the null hypothesis of homocedasticity.
```{r message=FALSE}
library(car)
ncvTest(model_base)
```

If you are feeling adventurous, you can carry out this test "manually" by analyzing the auxiliary regression where the dependent variable is $\widehat{u}^2$. This expression is sometimes called a **variance function**, and in general is
$$var(y_i)=\sigma_i^2=E(u_i^2)=h(\alpha_1+\alpha_2 x_{i2}+\alpha_3 x_{i3}+ \ldots )$$
Notice in the above equation that the variance of $y_i$ changes for each observation depending on the values of $x$'s. If $\alpha_1 = \alpha_2 = \ldots = 0$, then we have constant variance and thus heteroskedasticity is not present. Therfore, we are testing the following null and alternative hypotheses:
$$
\begin{aligned}
H_0:& \alpha_1=\alpha_2= \ldots\ =\alpha_s=0 \\
H_1:& \text{At least one of the $\alpha$'s is not zero}
\end{aligned}
$$
We reject the null hypothesis and accept the alternative if $\chi^2 \geq \chi_{(1-\alpha,S-1)}^2$.
To obtain a test statistic for our hypothesis test, we consider the linear variance function given by the expression for $\widehat{u}^2$ above.
We are interested in figuring out whether the variables $x_1, x_2, \ldots$ help explain the variation in the least squares residual $\widehat{u}^2$, and since $R^2$ measures the proportion of variance in $\widehat{u}^2$ (i.e. the proportion due to regression) explained by the $x$'s, it is a natural candidate for a test statistic. When $H0$ is true, the sample size $N$ multiplied by $R^2$ has a $\chi^2$ distribution with $S - 1$ degrees of freedom. That is,
$$
\chi^2 = N \times R^2 \sim \chi_{S-1}^2
$$
Let's inspect the result of estimating the variance function.
```{r}
aux_reg <- lm(model_base$fitted.values^2 ~ df$earnmom + df$earndad + df$nkids2 + df$nkids3plus)
summary(aux_reg)
```
Our $R^2$ is `r summary(aux_reg)$r.squared` and $N$ is `r nobs(aux_reg)`, so our test statistic is `r summary(aux_reg)$r.squared *  nobs(aux_reg)`. The corresponding critical value is
```{r}
qchisq(.95, df = 4)
```
Since our computed test statistic is greater than the critical value, we can reject the null of no heteroskedasticity.

# 1. Marginal cost per kid

We want to inspect whether the marginal cost per kid is constant.
If that were to be the case, then we would expect that the increase in award from having two kids instead of one is the same as having three kids.
In our model, having one kid is the baseline case, so we omit `nkids1` from the regression equation (if not, one would be needed to be dropped anyway, or else we could drop the constant).
The estimated coefficient for `nkids2` indicates the marginal change in award due to having two kids instead of one (the baseline), and the estimated coefficient for `nkids3plus` indicates the marginal change in award due to having three (or more) kids instead of one.

A simple way of testing whether the relationship is constant or not is to add a quadratic term to our model. By regressing the award amount against `nkids` and `nkids`-squared, the squared term captures the potential non-linearity in the relationship. We use `lmtest::coeftest()` and `sandwich::vcovHC()` to report heteroskedasticity-robust standard errors.
```{r message=FALSE}
library(lmtest)
library(sandwich)
# reg1_log <- lm(award ~ log_earnmom + log_earndad + nkids2 + nkids3plus, data = df)
model_linKids <- lm(award ~ earnmom + earndad + nkids + I(nkids^2), data = df)
coeftest(model_linKids, vcov = vcovHC(model_linKids, "HC1"))
```
Furthermore, the negative sign of the estimated coefficient for the square of the number of kids can be interpreted as there being a concave relationship between the award and the number of kids. That is, even though the award is increasing in number of kids, the marginal increase is reduced as the number of kids increases.

In order to actually test this result, we turn to our regression with dummy variables for the number of kids.
Notice that we need to test `nkids2 = nkids3plus - nkids2`; that is, if the change in award from one to two kids (captured by `nkids2`) is the same as the change in award from two to three or more kids (captured by `nkids3plus - nkids2`).
```{r}
linearHypothesis(model_base, "nkids2 = nkids3plus - nkids2")
```
With $\Pr(>F) \approx .06$ we can reject the null at 10 percent level, but not at 5 percent. Therefore, I would say there is only very weak evidence that the marginal cost per kid is not constant.

# 2. Effect on award by gender of parent

We now analyze whether there is a symmetric effect on the award given depending on whether it is the mother or the father the one giving the award. Therefore, the null is that `earnmom = earndad`.
```{r}
linearHypothesis(model_base, "earnmom = earndad")
```
This result indicates that there is strong evidence to reject the null that mothers' and fathers' earnings have an equal effect on the award given; given the computed coefficients, it seems fathers indeed have a greater obligation than mothers, conditional on their earnings and number of kids.

# 3. Effect of conflict

In order to analyze the hypothesis that support awards are lower where there is more conflict over custody, we augment our basic model with a proxy for conflict, `confcust`. We also test for homocedasticity, and again find that the assumption doesn't appear to hold.
```{r}
model_conflict <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + confcust, data = df)
ncvTest(model_conflict) # Test for homocedasticity
coeftest(model_conflict, vcov = vcovHC(model_conflict, "HC1")) # report regression results with robust standard errors
```
The estimated coefficient of interest is only significant at the 10 percent level, suggesting there is no strong relationship between the (imperfectly measured) level of conflict and the support award given.
We can refine the analysis a bit by accounting for the fact that the conflict variable is an ordered indicator variable, so we can include dummies for every level and inspect the pattern. We see that there does not appear to be a systematic relationship between the coefficients of these dummies and the dependent variable.
```{r}
model_conflictFactor <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + factor(confcust), data = df)
coeftest(model_conflictFactor, vcov = vcovHC(model_conflictFactor, "HC1"))
```

# 4. Effect of gender of the petitioner

Lastly, we can analyze the question of whether if the mother petitions for divorce, then the award given is lower.
We augment the base model with the `petmom` variable; alternatively, we could have included `confcust` as well, although it is not necessary (it _is_ necessary to control for number of kids and parental income).
For completeness, we also test the homocedasticity assumption in this model.
```{r}
# reg1_log <- lm(award ~ log_earnmom + log_earndad + nkids2 + nkids3plus, data = df)
model_petmom <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + petmom, data = df)
ncvTest(model_petmom) # test homocedasticity
coeftest(model_petmom, vcov = vcovHC(model_petmom, "HC1")) # report estimates with robust standard errors
```

These results suggest that indeed if it is the mother one who files for divorce (which is a proxy of "wanting out of the marriage more"), the award is statistically significantly lower.

## 313_ps04_solution.Rmd
# Placeholder

## _!313_pset_solutions_R.md

      
    Raw
  

              _!313_pset_solutions_R.md
            
          
    Something
	---
	title: "ECO313 - PS03"
	author: "Álvaro Carril"
	output:
	pdf_document: default
	html_document:
	df_print: paged
	---

	This is my take on problem set 3, in R, for those who are curious about trying a ~~better~~ different tool.
	Drop me a line if you are interested in having these going forward, and/or if you need advice doing it yourselves.
	The code for this notebook can be found in [https://gist.github.com/acarril/854763a1caa82e0f1b781a249079fb43](https://gist.github.com/acarril/854763a1caa82e0f1b781a249079fb43).

	# Inspecting and preparing the data

	### Reading the data

	We can read a `.dta` file in R using the `haven::read_dta()` function. Let's load the data and print the first 10 rows.
	```{r}
	library(haven)
	df <- read_dta("mnook5.dta")
	df
	```

	### Log income
	As discussed in class, it is sometimes useful to create log-transformed versions of our variables. In particular, it'll be useful to have log earnings, so we'll create those variables with the help of the `dplyr` library.
	I leave this here to show how the variables I created, but I ended up not using them as most (all?) of you did not log-transform income.
	```{r message=FALSE}
	library(dplyr)
	df <- mutate(df, log_earndad = log(earndad + 1), log_earnmom = log(earnmom + 1))
	```


	### Number of kids
	Secondly, we'll create dummy variables for the number of kids in each family. We first take a look at the distribution of this variable. For this, we can use a simple tabulation or a histogram.
	```{r}
	library(ggplot2)
	table(df$nkids)
	ggplot(df, aes(nkids)) + geom_bar()
	```

	Note that there is only one observation (i.e. family) with one kid in the data, so it is more sensible to group families with three or four kids together in our dummy variables.
	Another valid alternative is to drop that observation entirely.
	We create the dummy variables below.
	```{r}
	df <- mutate(df,
	nkids1 = as.numeric(nkids == 1),
	nkids2 = as.numeric(nkids == 2),
	nkids3plus = as.numeric(nkids == 3 \| nkids == 4))
	```

	### Testing homoskedasticity in baseline model

	Our baseline model is one were we regress `award` against parental income (`earnmom` and `earndad`) and the number of kids (`nkids`).
	As warned in the problem set, it is possible that the unexplained portion of awards has higher variance when earnings are relatively high, which violates the homoskedasticity assumption. This assumption can be visually inspected by looking at a plot of fitted values against the residuals.
	Homoskedasticity is satisfied when there are no patterns in this plot; however, if we see a "cone shaped" pattern, this is suggestive that heteroskedasticity is an issue. By looking at the plot below, this certainly appears to be the case.
	```{r}
	model_base <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus, data = df)
	ggplot(model_base, aes(x = .fitted, y = .resid)) + geom_point()
	```

	We can formally test for heteroskedasticity using the Breusch-Pagan test (a White test would also work; I'm just doing one here). In essence, the idea behind the test is to check whether the conditional variance of the residual is independent from the regressors. Therefore, if our linear model is
	$$y = \beta_0 + \beta_1 x + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + u, \,$$
	then the independence of the variance of the residuals can be tested by regressing the squared residuals ($\widehat{u}^2$), against the regressors $x$. Therefore, we estimate the auxiliary model
	$$\widehat{u}^2 = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_4 + v.\,$$
	In R, this test can be carried out with several different packages.
	I'll use `car::ncvTest()` here, which tests the null hypothesis of no heteroskedasticity.
	While it doesn't give us the critical value to compare the test statistic, all you need to look at is the $p$-value to determine whether or not you should reject the null. If the p-value is less than the level of significance (in this case if the p-value is less than $\alpha = 0.05$), then you reject the null hypothesis. Since $p < 0.05$, we can reject the null hypothesis of homocedasticity.
	```{r message=FALSE}
	library(car)
	ncvTest(model_base)
	```

	If you are feeling adventurous, you can carry out this test "manually" by analyzing the auxiliary regression where the dependent variable is $\widehat{u}^2$. This expression is sometimes called a variance function, and in general is
	$$var(y_i)=\sigma_i^2=E(u_i^2)=h(\alpha_1+\alpha_2 x_{i2}+\alpha_3 x_{i3}+ \ldots )$$
	Notice in the above equation that the variance of $y_i$ changes for each observation depending on the values of $x$'s. If $\alpha_1 = \alpha_2 = \ldots = 0$, then we have constant variance and thus heteroskedasticity is not present. Therfore, we are testing the following null and alternative hypotheses:
	$$
	\begin{aligned}
	H_0:& \alpha_1=\alpha_2= \ldots\ =\alpha_s=0 \\
	H_1:& \text{At least one of the $\alpha$'s is not zero}
	\end{aligned}
	$$
	We reject the null hypothesis and accept the alternative if $\chi^2 \geq \chi_{(1-\alpha,S-1)}^2$.
	To obtain a test statistic for our hypothesis test, we consider the linear variance function given by the expression for $\widehat{u}^2$ above.
	We are interested in figuring out whether the variables $x_1, x_2, \ldots$ help explain the variation in the least squares residual $\widehat{u}^2$, and since $R^2$ measures the proportion of variance in $\widehat{u}^2$ (i.e. the proportion due to regression) explained by the $x$'s, it is a natural candidate for a test statistic. When $H0$ is true, the sample size $N$ multiplied by $R^2$ has a $\chi^2$ distribution with $S - 1$ degrees of freedom. That is,
	$$
	\chi^2 = N \times R^2 \sim \chi_{S-1}^2
	$$
	Let's inspect the result of estimating the variance function.
	```{r}
	aux_reg <- lm(model_base$fitted.values^2 ~ df$earnmom + df$earndad + df$nkids2 + df$nkids3plus)
	summary(aux_reg)
	```
	Our $R^2$ is `r summary(aux_reg)$r.squared` and $N$ is `r nobs(aux_reg)`, so our test statistic is `r summary(aux_reg)$r.squared * nobs(aux_reg)`. The corresponding critical value is
	```{r}
	qchisq(.95, df = 4)
	```
	Since our computed test statistic is greater than the critical value, we can reject the null of no heteroskedasticity.

	# 1. Marginal cost per kid

	We want to inspect whether the marginal cost per kid is constant.
	If that were to be the case, then we would expect that the increase in award from having two kids instead of one is the same as having three kids.
	In our model, having one kid is the baseline case, so we omit `nkids1` from the regression equation (if not, one would be needed to be dropped anyway, or else we could drop the constant).
	The estimated coefficient for `nkids2` indicates the marginal change in award due to having two kids instead of one (the baseline), and the estimated coefficient for `nkids3plus` indicates the marginal change in award due to having three (or more) kids instead of one.

	A simple way of testing whether the relationship is constant or not is to add a quadratic term to our model. By regressing the award amount against `nkids` and `nkids`-squared, the squared term captures the potential non-linearity in the relationship. We use `lmtest::coeftest()` and `sandwich::vcovHC()` to report heteroskedasticity-robust standard errors.
	```{r message=FALSE}
	library(lmtest)
	library(sandwich)
	# reg1_log <- lm(award ~ log_earnmom + log_earndad + nkids2 + nkids3plus, data = df)
	model_linKids <- lm(award ~ earnmom + earndad + nkids + I(nkids^2), data = df)
	coeftest(model_linKids, vcov = vcovHC(model_linKids, "HC1"))
	```
	Furthermore, the negative sign of the estimated coefficient for the square of the number of kids can be interpreted as there being a concave relationship between the award and the number of kids. That is, even though the award is increasing in number of kids, the marginal increase is reduced as the number of kids increases.

	In order to actually test this result, we turn to our regression with dummy variables for the number of kids.
	Notice that we need to test `nkids2 = nkids3plus - nkids2`; that is, if the change in award from one to two kids (captured by `nkids2`) is the same as the change in award from two to three or more kids (captured by `nkids3plus - nkids2`).
	```{r}
	linearHypothesis(model_base, "nkids2 = nkids3plus - nkids2")
	```
	With $\Pr(>F) \approx .06$ we can reject the null at 10 percent level, but not at 5 percent. Therefore, I would say there is only very weak evidence that the marginal cost per kid is not constant.

	# 2. Effect on award by gender of parent

	We now analyze whether there is a symmetric effect on the award given depending on whether it is the mother or the father the one giving the award. Therefore, the null is that `earnmom = earndad`.
	```{r}
	linearHypothesis(model_base, "earnmom = earndad")
	```
	This result indicates that there is strong evidence to reject the null that mothers' and fathers' earnings have an equal effect on the award given; given the computed coefficients, it seems fathers indeed have a greater obligation than mothers, conditional on their earnings and number of kids.

	# 3. Effect of conflict

	In order to analyze the hypothesis that support awards are lower where there is more conflict over custody, we augment our basic model with a proxy for conflict, `confcust`. We also test for homocedasticity, and again find that the assumption doesn't appear to hold.
	```{r}
	model_conflict <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + confcust, data = df)
	ncvTest(model_conflict) # Test for homocedasticity
	coeftest(model_conflict, vcov = vcovHC(model_conflict, "HC1")) # report regression results with robust standard errors
	```
	The estimated coefficient of interest is only significant at the 10 percent level, suggesting there is no strong relationship between the (imperfectly measured) level of conflict and the support award given.
	We can refine the analysis a bit by accounting for the fact that the conflict variable is an ordered indicator variable, so we can include dummies for every level and inspect the pattern. We see that there does not appear to be a systematic relationship between the coefficients of these dummies and the dependent variable.
	```{r}
	model_conflictFactor <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + factor(confcust), data = df)
	coeftest(model_conflictFactor, vcov = vcovHC(model_conflictFactor, "HC1"))
	```

	# 4. Effect of gender of the petitioner

	Lastly, we can analyze the question of whether if the mother petitions for divorce, then the award given is lower.
	We augment the base model with the `petmom` variable; alternatively, we could have included `confcust` as well, although it is not necessary (it _is_ necessary to control for number of kids and parental income).
	For completeness, we also test the homocedasticity assumption in this model.
	```{r}
	# reg1_log <- lm(award ~ log_earnmom + log_earndad + nkids2 + nkids3plus, data = df)
	model_petmom <- lm(award ~ earnmom + earndad + nkids2 + nkids3plus + petmom, data = df)
	ncvTest(model_petmom) # test homocedasticity
	coeftest(model_petmom, vcov = vcovHC(model_petmom, "HC1")) # report estimates with robust standard errors
	```

	These results suggest that indeed if it is the mother one who files for divorce (which is a proxy of "wanting out of the marriage more"), the award is statistically significantly lower.