diamonaj/causalinference.R

## causalinference.R
Spring 2019

*****INSTRUCTIONS*****

(1) Debugging--in the 3 cases below (a through c), identify the major coding error in each case and explain how to fix it, in 1-2
sentences. DO NOT actually copy/paste corrected code:

(a) https://gist.github.com/diamonaj/2e5d5ba5226b7b9760f5d1bf1e7bf765

(b) https://gist.github.com/diamonaj/3b6bc83d040098486634184d99fc4c55

(c) https://gist.github.com/diamonaj/a88cb40132ed8584e5182b585e1c84c8

Questions 2-4 below require the peacekeeping data set that we worked on in class, as well as this codebook:
https://www.nyu.edu/gsas/dept/politics/faculty/cohen/codebook.pdf

The class breakout instructions (including data download code) are here:
https://gist.github.com/diamonaj/3795bfc2e6349d00aa0ccfe14102858d

Define treatment as below:
Tr <- rep(0, length(foo$uncint))
Tr[which(foo$uncint != 0 & foo$uncint != 1)] <- 1

(2a) What does this mean? What is "treatment"?


(2b) Replicate figure 8 in https://gking.harvard.edu/files/counterf.pdf.
The _original model_ simply includes every predictor.
The _modified model_ adds (wardur * untype4) to the original model.

(2c) Now add an ADDITIONAL interaction term (to the model above): add (wardur squared * untype4).

A few suggestions:
	a. Read the class breakout instructions above to get the data and relevant columns,
	b. If you are not clear on the model, read the relevant sections of the paper and focus on understanding Table 2;
	c. To plot the figure, you should use a strategy similar to the one we used in the statistics scavenger hunt, which was also used
  in a previous assignment (e.g., holding predictors at their means and looping through values of one variable to obtain treatment
  effects at different levels of the variable--you may want to review the answer key for that previous assignment, but please note
  that you WILL NOT have to simulate coefficients this time because there is no need to estimate uncertainty e.g., intervals).


(4) Let's pretend you work for an NGO and your manager asks you to estimate the impact of the treatment identified above on lenient
peacebuilding success 2 years and 5 years after the war. You will have to search for these two outcomes variables in the codebook.

(a) In no more than 1 sentence, articulate the causal question as best you can (being as clear as you can about treatment and control):

(b) In no more than 1 sentence, explain how/why SUTVA might be violated here. In no more than 1 additional sentence, explain how you
could in theory use the "restrict" argument (in Match()/GenMatch()) to help address this potential problem.

(c) Use simple logistic regression, propensity score matching, and genetic matching to try to answer these questions.

For the matching exercises, measure balance on AT LEAST the basic variables we considered in the class exercise.

For the genetic matching exercise, population size should be at least 200 and you should run it for at least 25 generations
(which may require you to modify the number of non-changing generations). When performing genetic matching, take a little time to try
different approaches to producing excellent balance. You can tweak the values of "M", you can do caliper matching, you can match
on quadratic and/or interaction terms, you can add a propensity score, you can attempt exact matching, etc.

JUST ONE WORD OF ADVICE: The precise way you run GenMatch is how you have to run Match. For example, if you run GenMatch with M = 2 and
X includes interaction terms etc., then in the next line of code you have to run Match exactly the same way (using the GenMatch output
as the weight.matrix). Then in the next line you run MatchBalance, using the Match output.

Match with replacement and allow ties. Ideally, you would measure/optimize balance on the interaction terms and quadratic terms
as well (but this will make things a bit harder than simply balancing on the basic variables).

Your final answer should include:

(i) a table like this one--the caption below the table should include the asterisked footnotes AS WELL AS **the functional forms of
the propensity score model, **the variables you've genetically matched on, and **the MatchBalance variables used for
genetic matching:

******TABLE FORMAT******* (Please give it a title)
                            tmt effect (bias adj) 	tmt effect (no bias adj)	p-value (from MatchBalance)
logistic regression
len success 2 years 		                NA*
len success 5 years 		                NA*

p- score matching
len success 2 years 					          **
len success 5 years 					          **

gen match
len success 2 years 					          **
len success 5 years 					          **

*No need to provide bias-adjusted results for logistic regression--only for matching estimates.
**Only provide a treatment effect for matching results if your leximin p-value is above 0.10. Otherwise write in "NA".

(ii) Let's pretend you have to write a decision memo for policy purposes summarizing all your work (above). Your memo would begin with a
a brief executive summary summarizing what you've done and your policy advice, and it would end with a brief concluding passage
restating your analysis and what you want your reader to take away from it (including the policy advice). The executive summary
and the conclusion would be very similar--to the extent the two are at all different, there is scope for the conclusion to be a bit
more technical and/or nuanced, and the conclusion could also include some recommendations for relevant future analysis.
DO NOT WRITE the ENTIRE decision memo. Instead, just provide a 3-5 sentence executive summary AND a separate
3-5 sentence conclusion. DO ADDRESS THE MEMO TO A SPECIFIC PERSON (USE YOUR IMAGINATION, BUT TAKE THE EXERCISE SERIOUSLY.)
	Spring 2019

	***INSTRUCTIONS***

	(1) Debugging--in the 3 cases below (a through c), identify the major coding error in each case and explain how to fix it, in 1-2
	sentences. DO NOT actually copy/paste corrected code:

	(a) https://gist.github.com/diamonaj/2e5d5ba5226b7b9760f5d1bf1e7bf765

	(b) https://gist.github.com/diamonaj/3b6bc83d040098486634184d99fc4c55

	(c) https://gist.github.com/diamonaj/a88cb40132ed8584e5182b585e1c84c8

	Questions 2-4 below require the peacekeeping data set that we worked on in class, as well as this codebook:
	https://www.nyu.edu/gsas/dept/politics/faculty/cohen/codebook.pdf

	The class breakout instructions (including data download code) are here:
	https://gist.github.com/diamonaj/3795bfc2e6349d00aa0ccfe14102858d

	Define treatment as below:
	Tr <- rep(0, length(foo$uncint))
	Tr[which(foo$uncint != 0 & foo$uncint != 1)] <- 1

	(2a) What does this mean? What is "treatment"?


	(2b) Replicate figure 8 in https://gking.harvard.edu/files/counterf.pdf.
	The _original model_ simply includes every predictor.
	The _modified model_ adds (wardur * untype4) to the original model.

	(2c) Now add an ADDITIONAL interaction term (to the model above): add (wardur squared * untype4).

	A few suggestions:
	a. Read the class breakout instructions above to get the data and relevant columns,
	b. If you are not clear on the model, read the relevant sections of the paper and focus on understanding Table 2;
	c. To plot the figure, you should use a strategy similar to the one we used in the statistics scavenger hunt, which was also used
	in a previous assignment (e.g., holding predictors at their means and looping through values of one variable to obtain treatment
	effects at different levels of the variable--you may want to review the answer key for that previous assignment, but please note
	that you WILL NOT have to simulate coefficients this time because there is no need to estimate uncertainty e.g., intervals).


	(4) Let's pretend you work for an NGO and your manager asks you to estimate the impact of the treatment identified above on lenient
	peacebuilding success 2 years and 5 years after the war. You will have to search for these two outcomes variables in the codebook.

	(a) In no more than 1 sentence, articulate the causal question as best you can (being as clear as you can about treatment and control):

	(b) In no more than 1 sentence, explain how/why SUTVA might be violated here. In no more than 1 additional sentence, explain how you
	could in theory use the "restrict" argument (in Match()/GenMatch()) to help address this potential problem.

	(c) Use simple logistic regression, propensity score matching, and genetic matching to try to answer these questions.

	For the matching exercises, measure balance on AT LEAST the basic variables we considered in the class exercise.

	For the genetic matching exercise, population size should be at least 200 and you should run it for at least 25 generations
	(which may require you to modify the number of non-changing generations). When performing genetic matching, take a little time to try
	different approaches to producing excellent balance. You can tweak the values of "M", you can do caliper matching, you can match
	on quadratic and/or interaction terms, you can add a propensity score, you can attempt exact matching, etc.

	JUST ONE WORD OF ADVICE: The precise way you run GenMatch is how you have to run Match. For example, if you run GenMatch with M = 2 and
	X includes interaction terms etc., then in the next line of code you have to run Match exactly the same way (using the GenMatch output
	as the weight.matrix). Then in the next line you run MatchBalance, using the Match output.

	Match with replacement and allow ties. Ideally, you would measure/optimize balance on the interaction terms and quadratic terms
	as well (but this will make things a bit harder than simply balancing on the basic variables).

	Your final answer should include:

	(i) a table like this one--the caption below the table should include the asterisked footnotes AS WELL AS **the functional forms of
	the propensity score model, the variables you've genetically matched on, and the MatchBalance variables used for
	genetic matching:

	****TABLE FORMAT***** (Please give it a title)
	tmt effect (bias adj) tmt effect (no bias adj) p-value (from MatchBalance)
	logistic regression
	len success 2 years NA*
	len success 5 years NA*

	p- score matching
	len success 2 years **
	len success 5 years **

	gen match
	len success 2 years **
	len success 5 years **

	*No need to provide bias-adjusted results for logistic regression--only for matching estimates.
	**Only provide a treatment effect for matching results if your leximin p-value is above 0.10. Otherwise write in "NA".

	(ii) Let's pretend you have to write a decision memo for policy purposes summarizing all your work (above). Your memo would begin with a
	a brief executive summary summarizing what you've done and your policy advice, and it would end with a brief concluding passage
	restating your analysis and what you want your reader to take away from it (including the policy advice). The executive summary
	and the conclusion would be very similar--to the extent the two are at all different, there is scope for the conclusion to be a bit
	more technical and/or nuanced, and the conclusion could also include some recommendations for relevant future analysis.
	DO NOT WRITE the ENTIRE decision memo. Instead, just provide a 3-5 sentence executive summary AND a separate
	3-5 sentence conclusion. DO ADDRESS THE MEMO TO A SPECIFIC PERSON (USE YOUR IMAGINATION, BUT TAKE THE EXERCISE SERIOUSLY.)