Step 1
First we load the packages needed to create the example and plot the DAG.
#We load the package dagitty to create the IV model
library(dagitty)
#Definition of the DAG
g <- dagitty('dag {
X [pos="0,0"]
I [pos="-1,0"]
Y [pos="4,0"]
U [pos="2,2"]
W [pos="2,-2"]
I -> X -> Y
U -> X
U -> Y
W -> X
W -> Y
}')
#We load the package ggdag to plot the DAG
library(ggdag)
#Plot the DAG
ggdag(g) + theme_dag_blank()
Step 2
Based on the model previously defined we simulate the theoretical data. In this case we interested in the causal effect of X on Y, the variable I is the instrument, and variables U and W are unobserved confounders.
#We load the package lavaan to simulate the data
library(lavaan)
#Simulated model (effects of exogenous variables are set to be non-zero)
lavaan_model <- "X ~ .5*I + .4*U + .4*W
Y ~ .7*X + .1*U + .4*W"
#Data consistent with the simulated model with 1000 observations
set.seed(1234)
g_tbl <- simulateData(lavaan_model, sample.nobs=1000)
#Show the first rows of simulated data
head(g_tbl)
## X Y I U W
## 1 -1.4189076 -1.9098602 0.23531557 0.9724314 -1.32087141
## 2 -0.2432510 0.8651646 -0.01304215 0.1341461 0.09257077
## 3 1.5269161 1.2851663 1.40818583 0.8286825 -0.59320416
## 4 -2.0031355 -2.8247830 -1.68911977 -2.8507097 -1.66129610
## 5 0.4566986 1.2465048 -1.68652731 0.8188786 -0.57662826
## 6 1.6487979 -0.1609893 0.75620545 1.4406250 -1.04088989
Step 3
In order to check if the simulation created a good instrument I measure the correlation between I and X.
#Correlation check
cor.test(g_tbl$I,g_tbl$X)
##
## Pearson's product-moment correlation
##
## data: g_tbl$I and g_tbl$X
## t = 13.199, df = 998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3314420 0.4370615
## sample estimates:
## cor
## 0.3855139
Step 4
Now we estimate the causal effect using instrumental variables via two different methods available in R. First, we used the two stage least squares method step by step. Second, we used the package ivpack specific for IVs.
#Two stageleast squares
#First step by step
#Stage 1: Regress X on I
s1 <- lm(X ~ I, data = g_tbl)
#Get predicted value of X given I
predx <- predict(s1, type="response")
#Stage 2: Regress Y on predicted values of X
s2 <- lm(g_tbl$Y ~ predx)
#Summary the results
summary(s2)
##
## Call:
## lm(formula = g_tbl$Y ~ predx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1899 -0.9990 0.0060 0.9992 4.7011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.008825 0.046668 0.189 0.85
## predx 0.699069 0.099702 7.012 4.34e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.467 on 998 degrees of freedom
## Multiple R-squared: 0.04695, Adjusted R-squared: 0.04599
## F-statistic: 49.16 on 1 and 998 DF, p-value: 4.337e-12
The result shows that the estimated causal effect is approx 0.7 which is the value we simulate so the use of Iv in this case allow us to compute a causal effect pretty close to the real one simulated but without considering the unobserved confounders. Let’s see what are the results using the package ivpack
#Two least squares using ivpack
#We load the package ivpack
library(ivpack)
#we compute the causal IV model specifying the regression and the instrument
ivmodel <- ivreg(Y ~ X, ~ I, data=g_tbl)
#summary results showing standard errors
robust.se(ivmodel)
## [1] "Robust Standard Errors"
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0088247 0.0344555 0.2561 0.7979
## X 0.6990690 0.0714531 9.7836 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this result we obtained the same value for the causal effect of X on Y from the previous calculation but the package ivpack also give us the standard errors.
Step 5
To show the mistakes we can made in the case of not defining an instrument. We calculate the effect of X on Y directly and the effect of X on Y when we can measure U and W
#Effect of X on Y with without instrument I
summary(lm(Y ~ X, data = g_tbl))
##
## Call:
## lm(formula = Y ~ X, data = g_tbl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5005 -0.7308 -0.0406 0.7581 3.0766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01850 0.03345 0.553 0.58
## X 0.88400 0.02770 31.917 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.057 on 998 degrees of freedom
## Multiple R-squared: 0.5051, Adjusted R-squared: 0.5046
## F-statistic: 1019 on 1 and 998 DF, p-value: < 2.2e-16
We can see that the estimate of the causal effect of X on Y is wrong in this case because 0.88 is far from the simulated 0.7
#Effect of X on Y in case we can measure U and W
summary(lm(Y ~ X + U + W, data = g_tbl))
##
## Call:
## lm(formula = Y ~ X + U + W, data = g_tbl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0141 -0.6737 0.0037 0.6703 3.3444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01011 0.03122 0.324 0.746134
## X 0.73480 0.02906 25.283 < 2e-16 ***
## U 0.11501 0.03281 3.505 0.000476 ***
## W 0.40206 0.03337 12.047 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9858 on 996 degrees of freedom
## Multiple R-squared: 0.5702, Adjusted R-squared: 0.5689
## F-statistic: 440.5 on 3 and 996 DF, p-value: < 2.2e-16
Results
In this example we shown that when we define an instrumental variable it is possible to obtain a good estimate of the causal effect under study even if we can’t measure confounded variables.