acarril/ECO313-PSetSolutions

## ECO313-PSetSolutions
# ECO313 - Problem Set Solutions

## ps04_solution.Rmd
---
title: "ECO313 - Problem Set 4"
author: "Álvaro Carril"
output:
  pdf_document: default
  html_notebook: default
---

The code for this solution can be found in [https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb](https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb).

We investigate the potential effect of the increase in the state gasoline tax in Pennsylvania, which rose from 40.7 to 50.5 cents per gallon on January 1st, 2014.
The data is a panel that contains information on prices at the zipcode level, for different PA zipcodes and for different dates before and after the PA tax increase.
We'll use the following packages for our analysis.
```{r message=FALSE}
library(haven)      # load *.dta files
library(dplyr)      # data manipulation
library(tidyr)      # tidy data
library(lubridate)  # deal with dates
library(ggplot2)    # non-ugly plots
```

**Note.** I won't comment as much about R itself as I did before. Let me know if you would prefer more programming details.

# Inspecting and Preparing the Data

The data already contains prices in USD at the zipcode level, for dates before and after the tax reform and for zipcodes inside PA. Additionally, there is information on distance to the NJ border, distance to any border, and two control groups that are described in the problem set.
For the analysis, we need information about

- **price**, the dependent variable
- **date**, which determines if the observation falls before or after the tax increase
- **border distance** (to NJ or any border), which is used to determine the treatment/control status of the observation.

### Price

Price is given in USD, and a quick histogram shows it is roughly normally distributed.
We could still log-transform it to interpret the estimated coefficients as the percentage change in price due to a 1 unit increase in the regressors; either way would be fine, so I'll leave it as is.

```{r}
df <- read_dta("https://www.princeton.edu/~econ313/data/njpa_gas02.dta")
hist(df$price)
```

### Date

Note that there are several columns that have (sometimes redundant) date data, and all of them are strings. We create a single `date` column with day-month-year information, and we define it to be of type date (as opposed to string), which will help us in visually inspecting the pre-treatment trends.
Additionally, we create the `post` indicator variable if the observation's date is post-treatment, and the `days` variable which counts the days after treatment (so it has negative values before the treatment date).
Finally, we drop observations for which `dbordernj` is missing.


```{r message=FALSE}
df <- df %>%
  unite(date, c("date", "year"), sep = ", ") %>% # join month, day, year strings
  mutate(date = mdy(date)) %>% # transform date string to date column type
  select(-c(yrm, month)) %>% # drop columns with other date info
  mutate(post = as.numeric(date >= mdy("01-01-2014")), # dummy variable for post-treatment period
         days = time_length(interval(ymd("2014-01-01"), date), "days")) %>% # days to price increase
  drop_na(dbordernj) # drop 18 observations where distance to NJ border is missing
df
```


# 1. Treatment and Control Groups

### Treated

Determining which PA zipcodes are "treated" is not straightforward, as it depends on their proximity to the NJ state border, among other unobserved factors.
As the problem set suggests, a good start is to consider as treated any zipcode that is no more than 10 miles away from NJ.
I considered two alternatives to this definition.

1. An arbitrary and very stringent treatment definition with zipcodes that are 2 miles or closer to the NJ border.
2. A data-driven definition, motivated by a plot of the raw data of distance to any border against price, and explained below.

**Note**. I left these alternative treatment definitions as illustrations of approaches to creating a valid treatment group. Using any of these two groups doesn't alter any conclusions significantly, so for the rest of the analysis I focus solely on the treatment definition suggested in the problem set.

```{r}
ggplot(filter(df, dbordernj < 50, post==0), aes(x = dborder, y = price)) +
  geom_point(color = 'grey') +
  stat_summary_bin(fun = mean,
                   fun.min = function(x) mean(x) - sd(x),
                   fun.max = function(x) mean(x) + sd(x),
                   geom = "pointrange", color = 'orange', bins = 10)
```
I plot the raw data points of distance to any border against price (in gray), and then I overlay the mean with $\pm 1$ standard deviation of 5-mile bins (in orange).
Note that I only use data from the pre-treatment period, because it's usually better to define treatment units in quasi-experimental designs by relying solely on pre-treatment data.
I also restrict the data to leave out observations in control group 1, which, as I argue below, is the better control group.
By visually inspecting the plot above, we can see that as the distance to any border increases, the price variance tends to decrease.
In particular, by visual inspection alone there seems to be a sharp decline in price variance after the 15 mile mark, which might be consistent with the intuition that past that distance firms price differently. Therefore I define a treatment group where treated zipcodes are the ones located under 15 mile of any border.

### Controls

We are careful here with the interaction of treated and control conditions, as they are mutually exclusive but not exhaustive. Control group 1 corresponds to zip codes that are over 50 miles away from *any* border, and control group 2 are those who are over 50 miles away from the *NJ* state border.
Therefore, it is possible to have observations that are not assigned to treatment nor control.
For example, an observation (i.e. zipcode) which is 25 miles away from the NJ border is not in the treatment group, but is also not in any control group.
Below, we code the treatment variables being careful with these considerations.

Control group 2 might be problematic because it includes zipcodes that are 50 miles away from the NJ state border only, and this might include zipcodes that are extremely close to other state borders. By being close to other state borders, these consumers could potentially source gasoline from other states, and thus we would be overestimating the elasticity of the control group.
This could be addressed with data of gasoline prices (and other factors) of other neighboring states, but we don't have such data.
In particular, if for instance we knew that all other neighboring states had higher gasoline prices than PA, then we could be sure that control zipcodes are not sourcing gasoline from these neighboring states, and that the only relevant border in terms of treatment intensity is the NJ border.
On the other hand, control group 2 is by definition much larger, which can improve how precise our estimates are.
In spite of this, it seems the better control group is group 1, and so for the rest of the analysis we will use that one.

```{r}
df <- df %>%
  mutate(
    treated = case_when(  # T is 10 miles from NJ; C is 50 miles from any border
      dbordernj <= 10 ~ 1, # indicator is 1 if less than 10 miles from NJ border
      control1 == 1 ~ 0, # indicator is 0 if observation is in control group 1
      TRUE ~ NA_real_ # indicator is NA for all other cases
    )
  ) %>%
  mutate_at(vars(starts_with("treated")), factor)  # change to factor type
```

# 2. Effect of Tax Increase

### Average Price Change by Group

We now look at the change in price between the pre- and post- periods in both groups, which the code below computes. From the problem set we know that the tax increase was $50.5 - 40.7 = 9.8$ cents.
We see that on average both groups increased their prices by roughly double the tax increase.


```{r message=FALSE}
grp_prices      <- df %>%
  drop_na(treated) %>%
  group_by(treated, post) %>%
  summarise(price = mean(price)) %>%
  data.matrix()
delta_control   <- grp_prices[2, 3] - grp_prices[1, 3]
delta_treatment <- grp_prices[4, 3] - grp_prices[3, 3]
writeLines(c("Change in control: ", delta_control,
             "\nChange in treatment: ", delta_treatment))
```

### The Need for DiD

Given that the average price increase for both groups after the tax reform is much higher than the tax increase itself, this might be indicative of some other unobserved factors at play. We can think of many examples for this: global oil price increase, supply chain problems, increased demand for automobile usage during that period, etc.
Using a DiD analysis would account for these unobserved factors, as long as they affect both groups to the same degree: if prices are increasing in both states because of a global increase in the price of oil, by using a DiD design we "net out" this increase, which we assume is affecting the average price increase in all states to the same extent. This is the parallel trends assumption, which we analyze below.

Notice that at this point we could very well conduct a DiD analysis with the pre- and post- group averages (can you see how?)
Using a regression framework for a DiD analysis is sometimes more convenient, but not a prerequisite for doing so.

### Parallel Trends

```{r message=FALSE}
plot_price_ts <- function(df, treatmentvar){
  df_plot <- df %>%
    drop_na(!!sym(treatmentvar)) %>%
    group_by(!!sym(treatmentvar), date) %>%
    summarise(price = mean(price))
  plot <- ggplot() +
    geom_line(data=df_plot, aes(x = date, y = price,
                                group = !!sym(treatmentvar),
                                color = !!sym(treatmentvar))) +
    scale_x_date(date_labels = "%b %y") +
    geom_vline(xintercept = ymd("2014-01-01"))
}

ts_plot <- plot_price_ts(df, "treated") %>% show()
```

We can visually inspect the parallel trends assumption by plotting average prices per group in the periods for which we have data---in particular, we care about the pre-treatment trends.
By visual inspection the parallel trends assumption seems to hold.
The dip in prices from mid-October to mid-November occurs for both groups, and the increase that begins after that period is also similar for both groups throughout the pre-treatment period.
The fact that the average price level for treatment zipcodes dips below the average for the control group is also highly suggestive that the treatment is well-defined and had a real effect on prices for some zipcodes.
However, only proper analysis using a DiD framework will give us more certainty that there are statistically significant differences.


### Difference in Difference Analysis

We are interested in the effect of treatment $D_i$ (i.e. being "close" to the NJ border, as discussed above) on an outcome $Y_i$, which in this case is the price of gasoline in zipcode $i$.
In order to conduct a DiD analysis, we estimate
$$
y = \beta_0 + \beta_1 P + \beta_2 T + \beta_3(P \cdot T) + \varepsilon,
$$
where $P$ is an indicator variable for the post-treatment period and $T$ is an indicator being in the treatment group.
The composite variable $P \cdot T$ is an indicator for treatment units in the post-treatment period.
We are interested in $\hat\beta_3$, which captures the difference between the average change over time in the outcome variable for the treatment group and the average change over time for the control group.
We implement this in R below and report the results below.

```{r}
didreg <- lm(price ~ post + treated + post*treated, data = df)
summary(didreg)
```

The estimate for $\beta_3$ can be interpreted as evidence that after the tax increase in PA, the average price in zipcodes close to the NJ border increased `r round((delta_control - delta_treatment)*100, digits = 2)` cents less than the average price of zipcodes very far from any border.
This is to be expected, because the control group's demand should be more inelastic, and therefore suppliers can increase the average price relatively more.
This is also consistent with the hypothesis that the tax incidence is greater for consumers when there are no close substitutes, and that it is relatively greater for suppliers when consumers can more easily substitute away from their good. Therefore, we (maybe unsurprisingly) conclude NJ gasoline is a relevant substitute for PA zipcodes near the NJ border.

## ps07_solution.Rmd
---
title: "ECO313 - Problem Set 7"
author: "Álvaro Carril"
output:
  pdf_document: default
  html_notebook: default
---

We explore the private/public sector wage gap using a Blinder-Oaxaca decomposition.
The code for this notebook can be found in [https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb](https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb).

```{r message=FALSE, warning=FALSE}
library(haven)  # read dta
library(dplyr)  # manipulate data
library(summarytools) # summarize with weights
library(ggplot2)  # plots
library(gridExtra)  # plots side by side
library(stargazer)  # report/export results
library(oaxaca)   # oaxaca-blinder decomposition

df <- read_dta("https://www.princeton.edu/~econ313/data/cps19a.dta")
```

```{r,echo=FALSE,message=FALSE,warning=FALSE}
# Set so that long lines in R will be wrapped:
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60), tidy=TRUE)
```

# a) Documenting the public/private wage gap

First I plot the weighted kernel density estimates for the log-wages of each group to get a sense of the overall distribution per group.
Private sector wages appear to be higher across the entire distribution, except for the highest wages, which tend to be equal in both sectors.

```{r, warning=FALSE}
ggplot() +
  geom_density(
    data=df,
    aes(x=lwage, group=as.factor(private), fill=as.factor(private), weight = weight),
    alpha=0.5,
    adjust=2
  ) +
  xlim(0, 4) +
  xlab("Log wage") +
  ylab("Density") +
  guides(fill=guide_legend(title="private"))
```

We can take another look at wages (and other statistics) by computing weighted summary statistics for observations corresponding to workers in the public and private sectors (reported below).
As suggested by the plot, these summary statistics confirm that wages have a higher mean in the public sector, and they also have less dispersion.
Even though the maximum wage is higher in the private sector, the plots and further quantile inspections (not reported) suggest that increased variance in the public sector is mostly driven by the left-hand side of the distribution.

```{r, results="hold", tidy=TRUE}
df %>%
  filter(private == 0) %>%
  descr(stats = c("mean", "sd", "min", "max"), transpose = TRUE, weights = df[df$private==0,]$weight)
df %>%
  filter(private == 1) %>%
  descr(stats = c("mean", "sd", "min", "max"), transpose = TRUE, weights = df[df$private==1,]$weight)
```


# b) Regression model of earnings

A very straightforward earnings function would be
$$
\ln(\text{wage}_i) = \alpha + \beta \cdot \text{private}_i  + \gamma \cdot x_i + \varepsilon_i,
$$
where the dependent variable is log-wage of individual $i$, $\text{private}_i$ is an indicator variable for whether individual $i$ works in the private sector, and $x_i$ is a vector of covariates that potentially includes unionization, education, and age.
Below I estimate a "base" model whitout any controls (ie. $x$ is empty), and a "full" model with a collection of controls.

```{r, message = FALSE, warning=FALSE}
reg_base <- lm(lwage ~ private, data = df, weights = weight)
reg_full <- lm(lwage ~ private + union + age + agesq + factor(edcat) + nonwhite + female + marr + marrfe, data = df, weights = weight)
stargazer(reg_base, reg_full, type = "text")
```

The base model confirms the results from the previous section: without any controls, working in the private sector is associated with a wage that is `r round(abs(reg_base$coeff["private"])*100, digits = 1)`% *lower* on average.
However, after controlling for different covariates the story is reversed: our model suggests that working in the private sector is associated with wage that is `r round(reg_full$coeff["private"]*100, digits = 1)`% *higher* on average.
This suggests that the observed wage gap is explained at least to a large extent by the characteristics of the workers in each sector.


# c) Blinder-Oaxaca decomposition

```{r, message = FALSE}
oax <- oaxaca(lwage ~ union + age + agesq + factor(edcat) + nonwhite + female + marr | private | union , data = df)
oax.overall <- plot.oaxaca(oax, decomposition = "twofold", group.weight = -1, unexplained.split = FALSE, type = "overall")
oax.factors <- plot.oaxaca(oax, decomposition = "twofold", group.weight = -1, unexplained.split = FALSE)
```

Now I decompose the wage gap between the two sectors using a Blinder-Oaxaca decomposition.
The mean outcome value for observations in group A (i.e. the public sector) is `r round(oax$y$y.A, digits=3)`, and it is `r round(oax$y$y.B, digits=3)` for group B (i.e. the private sector).
Thus, the overall difference between the mean log wages in both sectors (ie. A - B) is `r round(oax$y$y.diff*100, digits=1)`%.
This total difference in mean outcomes is decomposed into

- the **explained** portion, due to cross-group differences in the explanatory variables, which is `r round(oax.overall$data$value[1], digits=3)`, and
- the **unexplained** portion (i.e. the remainder), attributable to unobserved factors, which is `r round(oax.overall$data$value[2], digits=3)`.

The results of the twofold decomposition are plotted on the left hand side figure, with the figure on the right further decomposing these portions by variable.

```{r, message = FALSE}
grid.arrange(oax.overall, oax.factors, ncol = 2)
```


# d) Conclusion about earnings gap

Based on the results from the Blinder-Oaxaca decomposition, the explained portion of the difference is higher than the observed difference (which is why the unexplained portion is negative). This suggests that the wage gap between both groups is *smaller* than what we would expect given the characteristics of workers in both groups.
In other words, even though wages in the public sector are `r round(oax$y$y.diff*100, digits=1)`% higher on average, given their characteristics (e.g. more educated, older), there should be an even larger difference.

Considering the role of specific variables, note that while unionization has a statistically significant positive effect on the explained portion of the wage gap, its magnitude is negligible. Age and education (specially higher education), on the other hand, have large and statistically significant positive effects in wage, and they account for much of the explained portion of the earnings gap.
	---
	title: "ECO313 - Problem Set 4"
	author: "Álvaro Carril"
	output:
	pdf_document: default
	html_notebook: default
	---

	The code for this solution can be found in [https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb](https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb).

	We investigate the potential effect of the increase in the state gasoline tax in Pennsylvania, which rose from 40.7 to 50.5 cents per gallon on January 1st, 2014.
	The data is a panel that contains information on prices at the zipcode level, for different PA zipcodes and for different dates before and after the PA tax increase.
	We'll use the following packages for our analysis.
	```{r message=FALSE}
	library(haven) # load *.dta files
	library(dplyr) # data manipulation
	library(tidyr) # tidy data
	library(lubridate) # deal with dates
	library(ggplot2) # non-ugly plots
	```

	Note. I won't comment as much about R itself as I did before. Let me know if you would prefer more programming details.

	# Inspecting and Preparing the Data

	The data already contains prices in USD at the zipcode level, for dates before and after the tax reform and for zipcodes inside PA. Additionally, there is information on distance to the NJ border, distance to any border, and two control groups that are described in the problem set.
	For the analysis, we need information about

	- price, the dependent variable
	- date, which determines if the observation falls before or after the tax increase
	- border distance (to NJ or any border), which is used to determine the treatment/control status of the observation.

	### Price

	Price is given in USD, and a quick histogram shows it is roughly normally distributed.
	We could still log-transform it to interpret the estimated coefficients as the percentage change in price due to a 1 unit increase in the regressors; either way would be fine, so I'll leave it as is.

	```{r}
	df <- read_dta("https://www.princeton.edu/~econ313/data/njpa_gas02.dta")
	hist(df$price)
	```

	### Date

	Note that there are several columns that have (sometimes redundant) date data, and all of them are strings. We create a single `date` column with day-month-year information, and we define it to be of type date (as opposed to string), which will help us in visually inspecting the pre-treatment trends.
	Additionally, we create the `post` indicator variable if the observation's date is post-treatment, and the `days` variable which counts the days after treatment (so it has negative values before the treatment date).
	Finally, we drop observations for which `dbordernj` is missing.


	```{r message=FALSE}
	df <- df %>%
	unite(date, c("date", "year"), sep = ", ") %>% # join month, day, year strings
	mutate(date = mdy(date)) %>% # transform date string to date column type
	select(-c(yrm, month)) %>% # drop columns with other date info
	mutate(post = as.numeric(date >= mdy("01-01-2014")), # dummy variable for post-treatment period
	days = time_length(interval(ymd("2014-01-01"), date), "days")) %>% # days to price increase
	drop_na(dbordernj) # drop 18 observations where distance to NJ border is missing
	df
	```


	# 1. Treatment and Control Groups

	### Treated

	Determining which PA zipcodes are "treated" is not straightforward, as it depends on their proximity to the NJ state border, among other unobserved factors.
	As the problem set suggests, a good start is to consider as treated any zipcode that is no more than 10 miles away from NJ.
	I considered two alternatives to this definition.

	1. An arbitrary and very stringent treatment definition with zipcodes that are 2 miles or closer to the NJ border.
	2. A data-driven definition, motivated by a plot of the raw data of distance to any border against price, and explained below.

	Note. I left these alternative treatment definitions as illustrations of approaches to creating a valid treatment group. Using any of these two groups doesn't alter any conclusions significantly, so for the rest of the analysis I focus solely on the treatment definition suggested in the problem set.

	```{r}
	ggplot(filter(df, dbordernj < 50, post==0), aes(x = dborder, y = price)) +
	geom_point(color = 'grey') +
	stat_summary_bin(fun = mean,
	fun.min = function(x) mean(x) - sd(x),
	fun.max = function(x) mean(x) + sd(x),
	geom = "pointrange", color = 'orange', bins = 10)
	```
	I plot the raw data points of distance to any border against price (in gray), and then I overlay the mean with $\pm 1$ standard deviation of 5-mile bins (in orange).
	Note that I only use data from the pre-treatment period, because it's usually better to define treatment units in quasi-experimental designs by relying solely on pre-treatment data.
	I also restrict the data to leave out observations in control group 1, which, as I argue below, is the better control group.
	By visually inspecting the plot above, we can see that as the distance to any border increases, the price variance tends to decrease.
	In particular, by visual inspection alone there seems to be a sharp decline in price variance after the 15 mile mark, which might be consistent with the intuition that past that distance firms price differently. Therefore I define a treatment group where treated zipcodes are the ones located under 15 mile of any border.

	### Controls

	We are careful here with the interaction of treated and control conditions, as they are mutually exclusive but not exhaustive. Control group 1 corresponds to zip codes that are over 50 miles away from any border, and control group 2 are those who are over 50 miles away from the NJ state border.
	Therefore, it is possible to have observations that are not assigned to treatment nor control.
	For example, an observation (i.e. zipcode) which is 25 miles away from the NJ border is not in the treatment group, but is also not in any control group.
	Below, we code the treatment variables being careful with these considerations.

	Control group 2 might be problematic because it includes zipcodes that are 50 miles away from the NJ state border only, and this might include zipcodes that are extremely close to other state borders. By being close to other state borders, these consumers could potentially source gasoline from other states, and thus we would be overestimating the elasticity of the control group.
	This could be addressed with data of gasoline prices (and other factors) of other neighboring states, but we don't have such data.
	In particular, if for instance we knew that all other neighboring states had higher gasoline prices than PA, then we could be sure that control zipcodes are not sourcing gasoline from these neighboring states, and that the only relevant border in terms of treatment intensity is the NJ border.
	On the other hand, control group 2 is by definition much larger, which can improve how precise our estimates are.
	In spite of this, it seems the better control group is group 1, and so for the rest of the analysis we will use that one.

	```{r}
	df <- df %>%
	mutate(
	treated = case_when( # T is 10 miles from NJ; C is 50 miles from any border
	dbordernj <= 10 ~ 1, # indicator is 1 if less than 10 miles from NJ border
	control1 == 1 ~ 0, # indicator is 0 if observation is in control group 1
	TRUE ~ NA_real_ # indicator is NA for all other cases
	)
	) %>%
	mutate_at(vars(starts_with("treated")), factor) # change to factor type
	```

	# 2. Effect of Tax Increase

	### Average Price Change by Group

	We now look at the change in price between the pre- and post- periods in both groups, which the code below computes. From the problem set we know that the tax increase was $50.5 - 40.7 = 9.8$ cents.
	We see that on average both groups increased their prices by roughly double the tax increase.


	```{r message=FALSE}
	grp_prices <- df %>%
	drop_na(treated) %>%
	group_by(treated, post) %>%
	summarise(price = mean(price)) %>%
	data.matrix()
	delta_control <- grp_prices[2, 3] - grp_prices[1, 3]
	delta_treatment <- grp_prices[4, 3] - grp_prices[3, 3]
	writeLines(c("Change in control: ", delta_control,
	"\nChange in treatment: ", delta_treatment))
	```

	### The Need for DiD

	Given that the average price increase for both groups after the tax reform is much higher than the tax increase itself, this might be indicative of some other unobserved factors at play. We can think of many examples for this: global oil price increase, supply chain problems, increased demand for automobile usage during that period, etc.
	Using a DiD analysis would account for these unobserved factors, as long as they affect both groups to the same degree: if prices are increasing in both states because of a global increase in the price of oil, by using a DiD design we "net out" this increase, which we assume is affecting the average price increase in all states to the same extent. This is the parallel trends assumption, which we analyze below.

	Notice that at this point we could very well conduct a DiD analysis with the pre- and post- group averages (can you see how?)
	Using a regression framework for a DiD analysis is sometimes more convenient, but not a prerequisite for doing so.

	### Parallel Trends

	```{r message=FALSE}
	plot_price_ts <- function(df, treatmentvar){
	df_plot <- df %>%
	drop_na(!!sym(treatmentvar)) %>%
	group_by(!!sym(treatmentvar), date) %>%
	summarise(price = mean(price))
	plot <- ggplot() +
	geom_line(data=df_plot, aes(x = date, y = price,
	group = !!sym(treatmentvar),
	color = !!sym(treatmentvar))) +
	scale_x_date(date_labels = "%b %y") +
	geom_vline(xintercept = ymd("2014-01-01"))
	}

	ts_plot <- plot_price_ts(df, "treated") %>% show()
	```

	We can visually inspect the parallel trends assumption by plotting average prices per group in the periods for which we have data---in particular, we care about the pre-treatment trends.
	By visual inspection the parallel trends assumption seems to hold.
	The dip in prices from mid-October to mid-November occurs for both groups, and the increase that begins after that period is also similar for both groups throughout the pre-treatment period.
	The fact that the average price level for treatment zipcodes dips below the average for the control group is also highly suggestive that the treatment is well-defined and had a real effect on prices for some zipcodes.
	However, only proper analysis using a DiD framework will give us more certainty that there are statistically significant differences.


	### Difference in Difference Analysis

	We are interested in the effect of treatment $D_i$ (i.e. being "close" to the NJ border, as discussed above) on an outcome $Y_i$, which in this case is the price of gasoline in zipcode $i$.
	In order to conduct a DiD analysis, we estimate
	$$
	y = \beta_0 + \beta_1 P + \beta_2 T + \beta_3(P \cdot T) + \varepsilon,
	$$
	where $P$ is an indicator variable for the post-treatment period and $T$ is an indicator being in the treatment group.
	The composite variable $P \cdot T$ is an indicator for treatment units in the post-treatment period.
	We are interested in $\hat\beta_3$, which captures the difference between the average change over time in the outcome variable for the treatment group and the average change over time for the control group.
	We implement this in R below and report the results below.

	```{r}
	didreg <- lm(price ~ post + treated + post*treated, data = df)
	summary(didreg)
	```

	The estimate for $\beta_3$ can be interpreted as evidence that after the tax increase in PA, the average price in zipcodes close to the NJ border increased `r round((delta_control - delta_treatment)*100, digits = 2)` cents less than the average price of zipcodes very far from any border.
	This is to be expected, because the control group's demand should be more inelastic, and therefore suppliers can increase the average price relatively more.
	This is also consistent with the hypothesis that the tax incidence is greater for consumers when there are no close substitutes, and that it is relatively greater for suppliers when consumers can more easily substitute away from their good. Therefore, we (maybe unsurprisingly) conclude NJ gasoline is a relevant substitute for PA zipcodes near the NJ border.
	---
	title: "ECO313 - Problem Set 7"
	author: "Álvaro Carril"
	output:
	pdf_document: default
	html_notebook: default
	---

	We explore the private/public sector wage gap using a Blinder-Oaxaca decomposition.
	The code for this notebook can be found in [https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb](https://gist.github.com/acarril/a643993e5011f4f2a592b1d93750bdcb).

	```{r message=FALSE, warning=FALSE}
	library(haven) # read dta
	library(dplyr) # manipulate data
	library(summarytools) # summarize with weights
	library(ggplot2) # plots
	library(gridExtra) # plots side by side
	library(stargazer) # report/export results
	library(oaxaca) # oaxaca-blinder decomposition

	df <- read_dta("https://www.princeton.edu/~econ313/data/cps19a.dta")
	```

	```{r,echo=FALSE,message=FALSE,warning=FALSE}
	# Set so that long lines in R will be wrapped:
	knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60), tidy=TRUE)
	```

	# a) Documenting the public/private wage gap

	First I plot the weighted kernel density estimates for the log-wages of each group to get a sense of the overall distribution per group.
	Private sector wages appear to be higher across the entire distribution, except for the highest wages, which tend to be equal in both sectors.

	```{r, warning=FALSE}
	ggplot() +
	geom_density(
	data=df,
	aes(x=lwage, group=as.factor(private), fill=as.factor(private), weight = weight),
	alpha=0.5,
	adjust=2
	) +
	xlim(0, 4) +
	xlab("Log wage") +
	ylab("Density") +
	guides(fill=guide_legend(title="private"))
	```

	We can take another look at wages (and other statistics) by computing weighted summary statistics for observations corresponding to workers in the public and private sectors (reported below).
	As suggested by the plot, these summary statistics confirm that wages have a higher mean in the public sector, and they also have less dispersion.
	Even though the maximum wage is higher in the private sector, the plots and further quantile inspections (not reported) suggest that increased variance in the public sector is mostly driven by the left-hand side of the distribution.

	```{r, results="hold", tidy=TRUE}
	df %>%
	filter(private == 0) %>%
	descr(stats = c("mean", "sd", "min", "max"), transpose = TRUE, weights = df[df$private==0,]$weight)
	df %>%
	filter(private == 1) %>%
	descr(stats = c("mean", "sd", "min", "max"), transpose = TRUE, weights = df[df$private==1,]$weight)
	```


	# b) Regression model of earnings

	A very straightforward earnings function would be
	$$
	\ln(\text{wage}_i) = \alpha + \beta \cdot \text{private}_i + \gamma \cdot x_i + \varepsilon_i,
	$$
	where the dependent variable is log-wage of individual $i$, $\text{private}_i$ is an indicator variable for whether individual $i$ works in the private sector, and $x_i$ is a vector of covariates that potentially includes unionization, education, and age.
	Below I estimate a "base" model whitout any controls (ie. $x$ is empty), and a "full" model with a collection of controls.

	```{r, message = FALSE, warning=FALSE}
	reg_base <- lm(lwage ~ private, data = df, weights = weight)
	reg_full <- lm(lwage ~ private + union + age + agesq + factor(edcat) + nonwhite + female + marr + marrfe, data = df, weights = weight)
	stargazer(reg_base, reg_full, type = "text")
	```

	The base model confirms the results from the previous section: without any controls, working in the private sector is associated with a wage that is `r round(abs(reg_base$coeff["private"])100, digits = 1)`% lower* on average.
	However, after controlling for different covariates the story is reversed: our model suggests that working in the private sector is associated with wage that is `r round(reg_full$coeff["private"]100, digits = 1)`% higher* on average.
	This suggests that the observed wage gap is explained at least to a large extent by the characteristics of the workers in each sector.




	# c) Blinder-Oaxaca decomposition

	```{r, message = FALSE}
	oax <- oaxaca(lwage ~ union + age + agesq + factor(edcat) + nonwhite + female + marr \| private \| union , data = df)
	oax.overall <- plot.oaxaca(oax, decomposition = "twofold", group.weight = -1, unexplained.split = FALSE, type = "overall")
	oax.factors <- plot.oaxaca(oax, decomposition = "twofold", group.weight = -1, unexplained.split = FALSE)
	```

	Now I decompose the wage gap between the two sectors using a Blinder-Oaxaca decomposition.
	The mean outcome value for observations in group A (i.e. the public sector) is `r round(oax$y$y.A, digits=3)`, and it is `r round(oax$y$y.B, digits=3)` for group B (i.e. the private sector).
	Thus, the overall difference between the mean log wages in both sectors (ie. A - B) is `r round(oax$y$y.diff*100, digits=1)`%.
	This total difference in mean outcomes is decomposed into

	- the explained portion, due to cross-group differences in the explanatory variables, which is `r round(oax.overall$data$value[1], digits=3)`, and
	- the unexplained portion (i.e. the remainder), attributable to unobserved factors, which is `r round(oax.overall$data$value[2], digits=3)`.

	The results of the twofold decomposition are plotted on the left hand side figure, with the figure on the right further decomposing these portions by variable.

	```{r, message = FALSE}
	grid.arrange(oax.overall, oax.factors, ncol = 2)
	```


	# d) Conclusion about earnings gap

	Based on the results from the Blinder-Oaxaca decomposition, the explained portion of the difference is higher than the observed difference (which is why the unexplained portion is negative). This suggests that the wage gap between both groups is smaller than what we would expect given the characteristics of workers in both groups.
	In other words, even though wages in the public sector are `r round(oax$y$y.diff*100, digits=1)`% higher on average, given their characteristics (e.g. more educated, older), there should be an even larger difference.

	Considering the role of specific variables, note that while unionization has a statistically significant positive effect on the explained portion of the wage gap, its magnitude is negligible. Age and education (specially higher education), on the other hand, have large and statistically significant positive effects in wage, and they account for much of the explained portion of the earnings gap.