Applied microeconometrics

Weeks 7 and 8 - Instrumental variables

Josh Merfeld

KDI School

November 4, 2024

What are we doing today?

  • Introduction to IVs

    • Requirements/assumptions
  • IVs and RCTs

  • In a world of LATE

  • Weak instruments

Instrumental variables

  • Instrumental variables (IVs) are a way to estimate causal effects when we have endogeneity
    • The endogeneity can take many forms: omitted variables, measurement error, simultaneity, etc.
  • Consider my paper: effects of pollution on agricultural productivity
    • What’s the problem with simply regression productivity on pollution?

Endogeneity in the pollution example

Endogeneity in the pollution example

Putting structure on this

  • What we really want to estimate is this: \[\begin{gather} \label{eq:iv1} productivity_{it} = \beta_0 + \beta_1 pollution_{it} + \epsilon_{it} \end{gather}\] where \(\beta_1\) is the causal effect of pollution on productivity.

  • Endogeneity is defined as \(cov(pollution_{it}, \epsilon_{it})\neq0\)

    • That is, the error term is correlated with the endogenous variable
    • A common example is omitted variables

Putting structure on this

\[\begin{gather} \tag{1} productivity_{it} = \beta_0^* + \beta_1^* pollution_{it} + \epsilon_{it}^* \end{gather}\]

  • When we estimate this, due to the way OLS works, the residuals and pollution will be orthogonal
    • That is, \(cov(pollution_{it}, \epsilon_{it}^*)=0\)
    • This is a property of OLS
  • However, the issue is that under endogeneity, \(\beta^*_1\neq\beta_1\)
    • That is, the OLS estimate of \(\beta_1\) is biased for the true structural parameter

Putting structure on this

  • Another way to think about it is that what we want to estimate is this: \[\begin{gather} productivity_{it} = \beta_0 + \beta_1 pollution_{it} + \beta_2 X + \epsilon_{it} \end{gather}\]

  • But if we don’t properly control for everything – in this case \(X\) – we are really estimating this: \[\begin{gather} \label{eq:iv2} productivity_{it} = \tilde{\beta_0} + \tilde{\beta_1} pollution_{it} + \eta_{it}, \end{gather}\] where \(\eta_{it} = \beta_2 X_{it} + \epsilon_{it}\).

Differences in differences?

  • One solution is to use a differences-in-differences (DiD) approach

  • This requires the assumption of parallel trends

    • That is, the trends in the outcome variable would have been the same in the absence of the treatment
  • But what if changing economic growth is leading to changes in both pollution and productivity?

    • Then the parallel trends assumption is violated since areas with more pollution are also experiencing faster economic growth

Control for growth?

  • If you’re willing to make assumptions about what the omitted variables are, maybe you could control for theme

  • But this is a strong assumption

    • No matter what we do, we’ll have to make assumptions, though

Enter: instruments

  • Let’s take a different approach

  • We’ll use an instrument

    • A variable that is correlated with the endogenous variable (pollution) but is not correlated with the error term

Instrument in the pollution example

Requirements of an instrument

  • I very purposefully created the example so that the instrument is correlated with pollution
    • But it’s not directly correlated with productivity
    • And it’s not correlated with the omitted variable (the error term… will show you this in a second)
  • Let’s look at these more formally

Back to our problem

\[\begin{gather} \tag{3} productivity_{it} = \tilde{\beta_0} + \tilde{\beta_1} pollution_{it} + \eta_{it} \end{gather}\]

  • Can we estimate a version of this equation – that is, without controlling for \(X_{it}\) – and still get causal effects?

  • Maybe, if we can find a valid instrument.

  • So what makes an instrument valid?

What else can instruments help with?

  • It turns out IVs can also help with measurement error

    • If we have a variable that is measured with error, we can use an instrument to correct for this
  • From Hansen, consider the model: \[\begin{gather} X = Q + u, \end{gather}\] where \(X\) is the variable we observe, \(Q\) is the variable we want to measure, and \(u\) is measurement error.

  • Assume that \(cov(u, Q)=0\), so that the measurement error is random, i.e. uncorrelated with the true value of \(Q\).

    • This is known as classical measurement error

Classical measurement error and attenuation bias

  • We want to estimate: \[\begin{gather} Y = \beta_0 + \beta_1 Q + \epsilon, \end{gather}\] but what we really estimate is: \[\begin{gather} Y = \tilde{\beta}_0 + \tilde{\beta}_1 X + \tilde{\epsilon} = \tilde{\beta}_0 + \tilde{\beta}_1 (Q + u) + \tilde{\epsilon} \end{gather}\]

Classical measurement error and attenuation bias

  • This is what we get: \[\begin{gather} \tilde{\beta}_1 = \beta_1\left(1-\frac{\mathbb{E}(u^2)}{\mathbb{E}(X^2)}\right) \end{gather}\]

  • By definition, \(\mathbb{E}(X^2)>\mathbb{E}(u^2)\), so \(\tilde{\beta}_1<\beta_1\).

    • Why is this true?
    • That is, the OLS estimate of \(\beta_1\) is biased towards zero
    • This is called attenutation bias, but is only guaranteed with the measurement error is classical (random)

Requirements for an instrument

\[\begin{gather} \tag{3} productivity_{it} = \tilde{\beta_0} + \tilde{\beta_1} pollution_{it} + \eta_{it} \end{gather}\]

  1. The instrument must be correlated with the endogenous variable (pollution)

  2. The instrument must not be correlated with the error term (\(\eta_{it}\))

    • Note that this implies two things:
      • The instrument must not be correlated with any omitted variable (here \(X_{it}\))
      • The instrument must not directly affect the outcome (\(productivity_{it}\))

Using an instrument

  • If we can find a valid instrument, we can use it to estimate the causal effect of pollution on productivity

  • The simplest example uses two stages:

    1. \(pollution_{it} = \pi_0 + \pi_1 instrument_{it} + \nu_{it}\)
    2. \(productivity_{it} = \phi_0 + \phi_1 pollution_{it} + \zeta_{it}\)
  • We can then estimate \(\phi_1\) using OLS

    • Note that only under certain circumstances will \(\phi_1=\beta_1\)
    • More on this later

The intuition with venn diagrams

The IV only affects productivity through pollution

This doesn’t work. Direct effects on productivity!

This doesn’t work. Correlated with growth!

Back to our “two stages”, redefining names

\[\text{Stage}\;1:\;T_{it} = \pi_0 + \pi_1 Z_{it} + \nu_{it}\] \[\text{Stage}\;2:\;Y_{it} = \phi_0 + \phi_1 T_{it} + \zeta_{it}\]

  • Requirements:
    • \(cov(Z_{it}, T_{it}) \neq 0\)
    • \(cov(Z_{it}, \zeta_{it}) = 0\)
  • We first regress T on the instrument to get \(\hat{T}_{it}\)
  • Then, we use the predicted values of T to estimate the effects on Y
    • If the IV is valid, these predicted values are unrelated to the omitted variables!

Some comments

\[\text{Stage}\;1:\;T_{it} = \pi_0 + \pi_1 Z_{it} + \nu_{it}\]

\[\begin{gather}cov(Z_{it}, T_{it}) \neq 0\end{gather}\]

  • This is the first requirement

  • We can test this!

    • F-test of all excluded instruments in the first stages
    • I say all excluded instruments because you can technically have more than one

Some comments

\[\text{Stage}\;1:\;T_{it} = \pi_0 + \pi_1 Z_{it} + \nu_{it}\] \[\text{Stage}\;2:\;Y_{it} = \phi_0 + \phi_1 T_{it} + \zeta_{it}\]

\[\begin{gather}cov(Z_{it}, \zeta_{it}) = 0\end{gather}\]

  • This is the second requirement

  • We cannot explicitly test this

    • This is an identifying assumption
    • We need this to be true to attribute causality to the second stage

Some comments

\[\text{Stage}\;1:\;T_{it} = \pi_0 + \pi_1 Z_{it} + \nu_{it}\] \[\text{Stage}\;2:\;Y_{it} = \phi_0 + \phi_1 T_{it} + \zeta_{it}\]

\[\begin{gather}cov(Z_{it}, \zeta_{it}) = 0\end{gather}\]

  • Note that we will use \(Z_{it}\) to predict \(T_{it}\).
    • We cannot actually observe \(cov(Z_{it}, \zeta_{it})\)
  • So if \(cov(Z_{it}, \zeta_{it})\neq0\)
    • Then this correlation will be contained in the predicted values, \(\hat{T}_{it}\)
    • i.e. the predicted values will still be endogenous

IVs in supply and demand

  • Economists have long been interested in supply and demand
    • Obviously…
  • How does a change in supply affect prices?
    • Not a straightforward question to answer, because prices are determined jointly by supply and demand
    • We can’t determine what is changing when we observe market prices
    • One option: an instrument that moves only one side of the market
  • Small note: this is how IVs originally came about in economics

Favara and Imbs, 2015 (American Economic Review)

  • How does the availability of credit affect house prices?

  • They use a change in deregulation of banks in the US

    • This deregulation led to an increase in credit supply
    • But it did not affect credit demand, since it was a supply-side change
  • Idea: show the change in credit availability for banks affected by the change

    • And no change for banks not affected by the change

Deregulation index across states and years

Two stages: predict credit supply, then predict house prices

\[\begin{align} &\text{Stage 1: } credit_{ct} = \delta_0 + \delta_1 deregulation_{ct} + \delta_2 X_{ct} + \alpha_c + \gamma_t + \nu_{ct} \\ &\text{Stage 2: } price_{ct} = \beta_0 + \beta_1 credit_{ct} + \beta_2 X_{ct} + \phi_c + \eta_t + \zeta_{ct} \end{align}\]

  • They instrument for \(credit\) using \(deregulation\)
    • \(deregulation\) is correlated with \(credit\) but not with \(\zeta_{ct}\), according to the authors
    • (Let’s ignore whether this is true for now since it’s so contextual)
  • They control for \(X_{ct}\), which is a vector of controls
  • This is also a two-way fixed effects specification:
    • \(\alpha_c\) and \(\gamma_t\) (\(\phi_c\) and \(\eta_t\) in stage 2) are county and year fixed effects

Replication data: week7files/hmda_merged.dta

Code
library(haven)
df <- read_dta("week7files/hmda_merged.dta")
head(df)
# A tibble: 6 × 99
   year county state_n yryear_1994 yryear_1995 yryear_1996 yryear_1997
  <dbl>  <dbl>   <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
1  1994   1001       1           1           0           0           0
2  1995   1001       1           0           1           0           0
3  1996   1001       1           0           0           1           0
4  1997   1001       1           0           0           0           1
5  1998   1001       1           0           0           0           0
6  1999   1001       1           0           0           0           0
# ℹ 92 more variables: yryear_1998 <dbl>, yryear_1999 <dbl>, yryear_2000 <dbl>,
#   yryear_2001 <dbl>, yryear_2002 <dbl>, yryear_2003 <dbl>, yryear_2004 <dbl>,
#   yryear_2005 <dbl>, Dl_nloans_b <dbl>, LDl_nloans_b <dbl>,
#   Dl_vloans_b <dbl>, LDl_vloans_b <dbl>, Dl_nden_b <dbl>, LDl_nden_b <dbl>,
#   Dl_lir_b <dbl>, LDl_lir_b <dbl>, Dl_nsold_b <dbl>, LDl_nsold_b <dbl>,
#   Dl_nloans_pl <dbl>, LDl_nloans_pl <dbl>, Dl_vloans_pl <dbl>,
#   LDl_vloans_pl <dbl>, Dl_nden_pl <dbl>, LDl_nden_pl <dbl>, …
Code
# key controls: LDl_hpi Dl_inc LDl_inc Dl_pop LDl_pop Dl_her_v LDl_her_v
# instrument: Linter_bra
# endogenous variables: Dl_nloans_b Dl_vloans_b Dl_lir_b
# weights: w1
# restriction: border counties only (border==1)
# county and year FE
# cluster on state

Reduced form

  • It is common to estimate the reduced form of the first stage
    • This is a regression of the outcome of interest on the instrument
  • In this case, this equals \[\begin{gather} price_{ct} = B_0 + B_1 deregulation_{ct} + B_2 X_{ct} + \cdots \end{gather}\]

Reduced form

Code
bordercounties <- df |> filter(border==1)
summary(feols(Dl_hpi ~ Linter_bra + LDl_hpi + Dl_inc + LDl_inc + Dl_pop + LDl_pop + Dl_her_v + LDl_her_v | county + year, 
        data = bordercounties, weights = bordercounties$w1,
        cluster = "state_n"))
OLS estimation, Dep. Var.: Dl_hpi
Observations: 2,937 
Weights: bordercounties$w1 
Fixed-effects: county: 267,  year: 11
Standard-errors: Clustered (state_n) 
            Estimate Std. Error   t value   Pr(>|t|)    
Linter_bra  0.004217   0.001822  2.314494 2.6813e-02 *  
LDl_hpi     0.530888   0.041265 12.865486 1.2778e-14 ***
Dl_inc      0.144029   0.046402  3.103911 3.8332e-03 ** 
LDl_inc     0.033606   0.046377  0.724637 4.7363e-01    
Dl_pop      0.428247   0.149652  2.861615 7.1620e-03 ** 
LDl_pop     0.410567   0.172030  2.386604 2.2713e-02 *  
Dl_her_v   -0.004457   0.003411 -1.306403 2.0018e-01    
LDl_her_v  -0.003473   0.002327 -1.492225 1.4486e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.006641     Adj. R2: 0.47468
                 Within R2: 0.34867

First stage

Code
bordercounties <- df |> filter(border==1)
reg1 <- feols(Dl_nloans_b ~ Linter_bra + LDl_hpi + Dl_inc + LDl_inc + Dl_pop + LDl_pop + Dl_her_v + LDl_her_v | county + year, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")
reg2 <- feols(Dl_vloans_b ~ Linter_bra + LDl_hpi + Dl_inc + LDl_inc + Dl_pop + LDl_pop + Dl_her_v + LDl_her_v | county + year, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")
reg3 <- feols(Dl_lir_b ~ Linter_bra + LDl_hpi + Dl_inc + LDl_inc + Dl_pop + LDl_pop + Dl_her_v + LDl_her_v | county + year, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")

First stage

Loans Loan volume Loan-to-inc. ratio
IV 0.034*** 0.034** 0.034***
(0.011) (0.013) (0.012)
House price (lag) 0.280 0.647** 0.653**
(0.261) (0.251) (0.248)
Inc. p.c. 1.37** 1.56*** 1.01*
(0.555) (0.486) (0.518)
Inc. p.c. (lag) 0.310 0.682* 0.467
(0.345) (0.370) (0.357)
Population 5.43*** 5.48*** 4.99***
(1.34) (1.56) (1.65)
Population (lag) 0.115 0.996 0.918
(1.35) (1.61) (1.64)
Herf. index -0.105*** -0.087** -0.087**
(0.033) (0.033) (0.034)
Herf. index (lag) -0.120** -0.134** -0.142**
(0.044) (0.055) (0.057)
Observations 2,914 2,914 2,914
F-test for instrument 8.986 6.917 7.803
Note: F-test differs from results in paper due to differences in how xtreg calculates standard errors.
Standard errors clustered on state in parentheses.

First stage predictions vs. actual values… what do you notice?

First stage predictions vs. actual values… what do you notice?

min max SD
Actual -1.792 3.128 0.326
Predicted -0.346 1.020 0.158
  • Note how much less variance there is in the predicted values than the actual values
    • This is the point of using an instrument!
    • We are able to isolate the variation in the endogenous variable that is not correlated with the error term
      • This is of course only a subset of the total variation in the endogenous variable
  • This will be important later

We cannot simply use the predicted values in the second stage… standard errors will be wrong!

Code
# create a macro for the main regression controls (to avoid repetition and save space)
setFixest_fml(..controls = ~ LDl_hpi + Dl_inc + LDl_inc + Dl_pop + LDl_pop + Dl_her_v + LDl_her_v)
# Let's use feols to estimate the two stages
reg1 <- feols(Dl_hpi ~ ..controls | county + year | Dl_nloans_b ~ Linter_bra, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")
reg2 <- feols(Dl_hpi ~ ..controls | county + year | Dl_vloans_b ~ Linter_bra, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")
reg3 <- feols(Dl_hpi ~ ..controls | county + year | Dl_lir_b ~ Linter_bra, 
              data = bordercounties, weights = bordercounties$w1,
              cluster = "state_n")

fixest will give us the correct standard errors, however (first stage)

Code
# first stage:
etable(
      reg1, reg2, reg3,
      stage = 1,
      se.below = TRUE,
      depvar = FALSE,
      signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
      digits = "r3",
      digits.stats = "r0",
      fitstat = c("ivwald", "n"), # make sure to use ivwald for first-stage F-test
      coefstat = "se",
      group = list(controls = "LDl_hpi"),
      keep = "Linter_bra"
    )

fixest will give us the correct standard errors, however (first stage)

Loans Loan volume Loan-to-inc. ratio
IV (deregulation index) 0.034*** 0.034** 0.034***
(0.011) (0.013) (0.012)
controls Yes Yes Yes
Fixed-Effects:
county Yes Yes Yes
year Yes Yes Yes
Wald (1st stage) 8.986 6.917 7.803
Observations 2,914 2,914 2,914
Note: The Wald (similar to F-test) values do not equal the values in the paper due to differences in how xtreg calculates standard errors.
Standard errors clustered on state in parentheses.

fixest will give us the correct standard errors, however (second stage)

Code
# second stage:
etable(
      reg1, reg2, reg3,
      stage = 2,
      se.below = TRUE,
      depvar = FALSE,
      signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
      digits = "r3",
      digits.stats = "r3",
      fitstat = c("ivwald", "n"), # make sure to use ivwald for first-stage F-test
      coefstat = "se",
      group = list(controls = "LDl_hpi"),
      keep = c("Dl_nloans_b", "Dl_vloans_b", "Dl_lir_b")
    )

fixest will give us the correct standard errors, however (second stage)

(1) (2) (3)
Loans 0.123*
(0.066)
Loan volume 0.123*
(0.066)
Loan-to-inc. ratio 0.121*
(0.061)
controls Yes Yes Yes
Fixed-Effects:
county Yes Yes Yes
year Yes Yes Yes
Wald (1st stage) 8.986 6.917 7.803
Observations 2,914 2,914 2,914
Note: The Wald (similar to F-test) values do not equal the values in the paper due to differences in how xtreg calculates standard errors.
Standard errors clustered on state in parentheses.

Note the syntax for fixest

feols(y ~ x | fe1 + fe2 | endogenousvar ~ z, ...)

feols(y ~ x | fe1 + fe2 | endogenousvar1 + endogenousvar2 ~ z1 + z2, ...)

  • All controls should be in the first stage, as well as the second
    • fixest does this for us automatically
  • The package also automatically calculates correct standard errors in the second stage
    • For the “generated regressor”

Estimating it all together

  • With just a single instrument and a single endogenous variable, there is a single first stage

  • Let’s continue with our outcome \(Y\), our endogenous variable \(X\), and our exogenous variables \(Z\) (which includes the instrument)

  • It turns out that we can write \(\hat{\beta}_{IV}\) as: \[\begin{gather} \hat{\beta}_{IV}=\left((Z'Z)^{-1}(Z'X)\right)^{-1}\left((Z'Z)^{-1}(Z'Y)\right) \end{gather}\]

Estimating it all together

\[\begin{gather} \tag{14} \hat{\beta}_{IV}=\left((Z'Z)^{-1}(Z'X)\right)^{-1}\left((Z'Z)^{-1}(Z'Y)\right) \end{gather}\]

  • We can immediately see two things:

    • The requirement that \(Z\) predicts \(X\) is necessary to invert the first term

    • The IV estimate scales the reduced form by the first stage

Just a quick note that this simplifies

\[\begin{align} \tag{14} \hat{\beta}_{IV}&=\left((Z'Z)^{-1}(Z'X)\right)^{-1}\left((Z'Z)^{-1}(Z'Y)\right) \\ &=(Z'X)^{-1}(Z'Z)(Z'Z)^{-1}(Z'Y) \\ &=(Z'X)^{-1}(Z'Y) \end{align}\]

Binary instrument and binary treatment

  • Let’s consider a binary instrument and a binary treatment
    • \(Z\) and \(D\) are binary, i.e. \(Z,D\in\{0,1\}\)
  • It turns out there is a very real case where we can find a valid instrument that is binary
    • Treatment assignment in an RCT!

RCTs and IV

  • Banerjee et al. (2015): The Miracle of Microfinance? Evidence from a Randomized Evaluation (AEJ: Applied)

  • They are interested in the effects of access to credit on outcomes

    • They randomly assign households (sort of) to microcredit access
  • Z: whether or not the household was offered microcredit

    • This is a binary instrument
  • X: whether or not the household received credit

    • This is a binary endogenous variable

RCTs and IV

  • Banerjee et al. (2015): The Miracle of Microfinance? Evidence from a Randomized Evaluation (AEJ: Applied)

  • They are interested in the effects of access to credit on outcomes

    • They randomly assign households (sort of) to microcredit access
  • Z: whether or not the household was offered microcredit

    • This is a binary instrument
  • X: whether or not the household received credit

    • This is a binary endogenous variable

Effects of the program on outcomes in endline 1

Code
df <- read_dta("week7files/banerjeeetal.dta")
# create a macro for the main regression controls (to avoid repetition and save space)
setFixest_fml(..controls = ~ area_pop_base + area_debt_total_base + area_business_total_base + area_exp_pc_mean_base + 
                              area_literate_head_base + area_literate_base)
# they control for baseline values of NEIGHBORHOOD means of these variables

  • They estimate: \[\begin{gather} y_{in} = \beta_0 + \beta_1 Z_{n} + \sum_{k=1}^K\gamma_k X_k + \varepsilon_{n}, \end{gather}\] where \(Z_{i}\) is the treatment variable (microcredit access) and standard errors are clustered at the areaid (neighborhood)

Reduced form

Code
reg1 <- feols(any_biz_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
reg2 <- feols(bizassets_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
reg3 <- feols(bizprofit_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
table <- etable(reg1, reg2, reg3,
                digits = 3, fitstat = c("n"), se.below = TRUE, depvar = FALSE,
                # change significance codes to the norm
                signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
                group = list(controls = "area_pop_base"), keep = "treatment")

Reduced form, clean table

Any biz? Biz week7assets Biz profits
treatment 0.005 421.4 345.7
(0.019) (310.8) (315.9)
controls Yes Yes Yes
Observations 6,186 6,186 6,186
Standard errors clustered on neighborhood in parentheses.

First stage

Code
reg1 <- feols(anymfi_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
reg2 <- feols(anyloan_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
table <- etable(reg1, reg2,
                digits = 3, fitstat = c("n"), se.below = TRUE, depvar = FALSE,
                # change significance codes to the norm
                signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
                group = list(controls = "area_pop_base"), keep = "treatment")

First stage, clean table

Any MFI loan? Any loan?
treatment 0.083*** -0.018
(0.026) (0.013)
controls Yes Yes
Observations 6,186 6,186
Standard errors clustered on neighborhood in parentheses.

IV results

Code
reg1 <- feols(any_biz_1 ~ ..controls | anymfi_1 ~ treatment, 
              data = df, weights = df$w1,
              cluster = "areaid")
reg2 <- feols(bizassets_1 ~ ..controls | anymfi_1 ~ treatment, 
              data = df, weights = df$w1,
              cluster = "areaid")
reg3 <- feols(bizprofit_1 ~ ..controls | anymfi_1 ~ treatment, 
              data = df, weights = df$w1,
              cluster = "areaid")

table <- etable(reg1, reg2, reg3,
                digits = 3, fitstat = c("ivwald", "n"), se.below = TRUE, depvar = FALSE,
                # change significance codes to the norm
                signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
                group = list(controls = "area_pop_base"), keep = "anymfi_1")

IV results, clean table

Any biz? Biz week7assets Biz profits
Has MFI loan 0.062 5,092.5 4,177.2
(0.229) (4,182.9) (3,876.0)
controls Yes Yes Yes
Wald (1st stage) 9.8326 9.8326 9.8326
Observations 6,186 6,186 6,186
Standard errors clustered on neighborhood in parentheses.

Putting them together

Code
# reduced form
reg1 <- feols(any_biz_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
# first stage
reg2 <- feols(anymfi_1 ~ treatment + ..controls, 
              data = df, weights = df$w1,
              cluster = "areaid")
# IV result
reg3 <- feols(any_biz_1 ~ ..controls | anymfi_1 ~ treatment, 
              data = df, weights = df$w1,
              cluster = "areaid")

  • Coefficient on reduced form: 0.0051320381986844

  • Coefficient on first stage: 0.082757689583759

  • Coefficient on IV: 0.0620128259329202

    • Can you figure out how this is related to the RF and FS?
    • This is a ratio: \(\frac{\hat{\beta}_{RF}}{\hat{\beta}_{FS}} = \hat{\beta}_{IV}\)
    • The IV result scales the reduced form by the first stage

Putting them together, the intuition

  • The IV estimate is a ratio of two coefficients
    • The reduced form coefficient and the first stage coefficient
  • In this example, treatment increases MFI loan take-up by 8.2 percentage points.
    • In other words, the treatment effect is driven by a change in MFI loan take-up among 8.2 percent of households
  • If the probability of owning a business goes up by 0.005 (0.5 p.p.), what is the change in probability of owning a business for those who take up the MFI loan?
    • 0.005/0.082! This is the IV estimate

The Wald estimator

  • This is sometimes referred to as the wald estimator (Wald 1940) \[\begin{gather} \beta = \frac{\mathbb{E}\left[Y\mid Z=1\right]-\mathbb{E}\left[Y\mid Z=0\right]}{\mathbb{E}\left[X\mid Z=1\right]-\mathbb{E}\left[X\mid Z=0\right]} \end{gather}\]

  • Note that these expectations are not observed

    • We estimate them with the reduced form and first stage

Interpreting IV estimates

  • So this IV estimate is driven by the change in MFI loan take-up among 8.2 percent of households
    • What does this mean for the effect of MFI loans on business ownership?
  • Two worlds:
    • Homogeneous treatment effects
    • Heterogeneous treatment effects
  • Remember how I said an IV identifies just certain kinds of variation?
    • This will come into play here

Homogeneous treatment effects

  • We had a similar discussion when we talked about DiD

  • If everyone has the same treatment effect, then it doesn’t matter what variation we isolate

    • All variation will be identifying the same effect
  • In this case, the IV is estimated the average treatment effect

  • But what if effects are not homogeneous?

Heterogeneous treatment effects

  • What if not everyone has the same treatment effect?
    • In other words, what if different types of variation are identifying different effects?
  • Imagine a world in which we have an endogenous variable, \(D\)
    • Imagine we also have multiple valid instruments: \(Z_1\) and \(Z_2\)
  • If \(Z_1\) and \(Z_2\) are correlated with different “parts” of \(D\), then they can be isolating different variation in \(D\)
    • This also means that they IV results can lead to different estimates, even though both instruments are valid!

Defining the LATE

  • We need to define four separate groups:
    • Compliers
    • Always-takers
    • Never-takers
    • Defiers
  • Let’s look at these four groups assuming a binary treatment

Compliers

Never-takers

Always-takers

Defiers

In Hansen, where X is treatment assignment

Comparing the four groups

What are we estimating?

  • Never takers never take up the treatment
    • If we have no variation in treatment for them, we can’t estimate the effect of the treatment on them
    • Same goes for always takers
  • That leaves us with two groups: compliers and defiers
    • Let’s make one more assumption: \(P(X(1)-X(0)<0)=0\) (or \(>0\))
    • i.e. there are no defiers

What are we estimating?

  • This is called the local average treatment effect (LATE)

  • This is the effect of the treatment on compliers

    • i.e. the effect of the treatment on those who are induced to take up treatment because of the instrument
  • Again, if treatment is homogeneous, the effect on compliers is the same on others

    • In this case, the LATE is the ATE
    • But, do we really think this is ever true?

Different instruments, different effects

  • One implication of LATE is that different instruments can identify different effects
    • In other words, the group of “compliers” can differ across instruments, even if all the instruments are valid
  • Example:
    • Interested in the effects of going to college
    • Instrument 1: whether or not you live close to a college
    • Instrument 2: whether or not you have a scholarship

This might be okay, though

  • When we think about interventions, we often think about the margins of the intervention
    • In other words, we are interested in the effect of the intervention on those who are induced to take up the intervention
  • If a government is considering a new program/policy, then the effects will always be driven by those who are induced to take up the program/policy
    • In other words, the compliers
    • So identifying a LATE might actually be policy relevant in some contexts!
  • One final note:
    • The LATE interpretation also holds for non-binary instruments
    • Interpretation of what it means to be a “complier” is a bit more complicated, though

Some notes on compliers under LATE

  • The first stage tells us the complier share of the overall population (if it’s binary)
    • A small note: the more compliers there are, the less problematic violations of the exclusion restriction are (Angrist et al., 1996)
  • We can learn a bit about characteristics of compliers, too, using a similar intuition
    • Works with discrete characteristics

Weak instruments

  • Let’s return to our discussion about the first stage: \(Z\) must be correlated with \(X\)
    • If \(Z\) is not correlated with \(X\), then we cannot identify the effect of \(X\) on \(Y\)
  • We often think about this in terms of the first stage F-statistic
    • Is the F-statistic is high “enough”?
    • What is high “enough” in this context?]
  • We used to think about \(F>10\), but recent literature argues it should be even higher!
    • e.g. Plfueger and Wang (2013) closer to 23
    • Lee et al. (2020) argue for 100 or higher
      • Focus on t-statistic, not the coefficient
      • Lower F-statistics mean the critical value should actually be higher than 1.96
    • No “right” answer, but higher is better

Compulsory school attendance and earnings

  • Let’s look at an example: Angrist and Krueger (1991)
    • They are interested in the returns to schooling
  • Basic idea:
    • School attendance laws require students to stay in school until a certain age
    • Consider a school year that starts on August 1st
      • Someone who was born on July 31st will be one year older at the start of the school year than someone born on August 2nd
  • Instrument for school attendance using the time of birth
    • “Individuals born in the beginning of the year start school at an older age, and can therefore drop out after completing less schooling than individuals born near the end of the year.”

Compulsory school attendance and earnings, year/quarter of birth

Compulsory school attendance and earnings, reduced form

The model

\[\begin{gather} y = \beta s + \varepsilon \\ s = \gamma Z + \eta, \end{gather}\]

  • \(y\) is earnings
  • \(s\) is years of schooling
  • \(Z\) is the instrument
    • They use interactions between year and quarter of birth

Bias in OLS

  • If \(\varepsilon\) and s are correlated, then OLS gives biased estimates

  • The bias is: \[\begin{gather} E\left[\hat\beta_{OLS}-\beta\right] = \frac{Cov(s,\varepsilon)}{Var(s)} \end{gather}\]

  • Let’s rename this ratio as \(\frac{\sigma_{\varepsilon\eta}}{\sigma_{s}^2}\)

Bias in OLS and first stage F-statistics

  • It turns out we can approximate the bias in 2SLS as: \[\begin{gather}\frac{\sigma_{\varepsilon\eta}}{\sigma_{s}^2}\frac{1}{F+1} \end{gather}\]

  • Note that if the first stage is weak, \(F\) is closer to zero and the 2SLS bias is closer to the OLS bias

    • If the first stage is strong, \(F\) is larger and the bias gets closer to zero

Bound et al. (1995), JASA

  • Bound et al. (1995) were the first to point this problem out
    • You see, Angrist and Krueger, added a lot of instruments to some of their specifications
    • The addition of more instruments can be a problem: it tends to decrease the first-stage F-statistic
  • Let’s take a look at their results

Note what happens to the IV coefficient as F decreases

A weak first stage won’t necessarily lead to large standard errors

  • I used to think a weak first stage would lead to large standard errors
    • This is not necessarily true
  • Bound et al. do a simulation exercise where they create completely random instruments
    • In other words, by construction, the instruments should not predict the endogenous variable

Random instruments and standard errors

More problems with weak instruments

\[\begin{gather*} \hat{\beta}_{2SLS} = \frac{Cov(Y, Z)}{Cov(X, Z)} \end{gather*}\]

  • We’ve seen this before: the IV estimate is a ratio of covariances (or the ratio of the reduced form and the first stage)

  • With weak instruments, \(Cov(X,Z)\) is small

    • This means that small changes in \(Cov(Y,Z)\) can lead to large changes in \(\hat{\beta}_{2SLS}\)
    • Asymptotically, this isn’t a problem. But in small samples…
    • We’re back to something we’ve seen before: might need relatively large sample sizes to reliably estimate what you want to estimate!
  • This is a problem with ratios more generally. Try bootstrapping a ratio where the numerator is small and see what happens.

Example from Goldsmith-Pinkham’s slides

  • Rather than create my own, I’m going to use Paul’s example
    • https://github.com/paulgp/applied-methods-phd
  • Let’s look at three things:
    • The behavior of the first stage when the instrument is weak (he calls this Pi hat)
    • The relationship between the first stage and the second stage
    • The behavior of the 2SLS estimator as a whole when the instrument is weak

Marginally significant first stage, simulations

Marginally significant first stage, simulations

Marginally significant first stage, simulations

Marginally significant first stage, simulations

  • The distribution of \(\hat\beta\) is absolutely not normal
    • Asymptotics won’t save you here!
  • Note that this problem can (mostly) disappear when the first stage is strong
    • For example, a larger sample size will lead to better behavior of the estimator
  • Again, asymptotic approximations – just like with the CLT and skewed distributions – won’t necessary apply

Takewaways

  • Looking at the second stage won’t necessarily tell you if the first stage is weak

  • Nowadays, it is very common to report the first stage F-statistic

    • You can’t write a paper without reporting it
  • The key idea is that many instruments can increase bias, even if it isn’t obvious

    • Part of the problem is related to overfitting, which we’ll cover in a few Weeks
    • In fact, Angrist and Kolesar (2023) argue that weak instruments may not be a huge problem in the just-identified (i.e. one instrument) case!
  • Chernozhukov and Hansen (2008) detail a routine to calculate confidence intervals that are valid regardless of the strength of the first stage (in the just-identified case).

    • Packages in both Stata and R

Overidentification tests

  • In the previous case, we had many instruments
    • This is called overidentification
  • With overidentification, it is possible to test the “validity” of the instruments…
    • … if we are willing to assume at least one of the instruments is valid!
  • The intuition: different instruments should give us the same result

Overidentification tests

  • Consider a single endogenous \(X\) and two instruments, \(Z_1\) and \(Z_2\): \[\begin{gather} \mathbb{E}\left[Z_1\right]=\mathbb{E}\left[Z_1X\right]\beta \\ \mathbb{E}\left[Z_2\right]=\mathbb{E}\left[Z_2X\right]\beta \end{gather}\]

  • Assumption for overidentification are saying that \(\beta\) solves both equations simultaneously

    • In other words, \(\beta\) is the same for both instruments
  • If one instrument is valid and the other isn’t, they should give us different results

    • We can test this!
    • Sometimes referred to as an overidentification test, a Sargan test (or Sargan’s J), or a Sargan-Hansen test

Overidentification tests

  • Consider a single endogenous \(X\) and two instruments, \(Z_1\) and \(Z_2\): \[\begin{gather} \tag{23} \mathbb{E}\left[Z_1\right]=\mathbb{E}\left[Z_1X\right]\beta \\ \tag{24} \mathbb{E}\left[Z_2\right]=\mathbb{E}\left[Z_2X\right]\beta \end{gather}\]

  • But there’s a problem…

    • And I already mentioned it. What’s the problem?
  • In a world of LATEs, the instruments can identify different effects

    • So we can’t really test the validity of the instruments!
  • TLDR: overidentification tests are not very useful (my take, anyway)

Shift-share instruments

Shift-share instruments (SSIV)

Before the theory…

  • Before getting into theory, let’s look at an example

  • Autor et al. (2013)

    • The China Syndrome: Local Labor Market Effects of Import Competition in the United States

Abstract

We analyze the effect of rising Chinese import competition between 1990 and 2007 on US local labor markets, exploiting cross-market variation in import exposure stemming from initial differences in industry specialization and instrumenting for US imports using changes in Chinese imports by other high-income countries.

  • Basic idea: use initial shares of import exposure
    • Instrument using change in Chinese imports in other high-income countries
    • This is the basic setup for a SSIV
    • They do more, so we just focus on the SSIV part

Chinese exports and local labor markets

  • Interested in wages (\(W_i\)), employment for traded goods (\(L_{Ti}\)), and employment for non-traded goods (\(L_{Ni}\))

\[\begin{align} W_i =& \;\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ti} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ti}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ni} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[-\theta_{ijC}E_{Cj}+\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \end{align}\]

Chinese exports and local labor markets

\[\begin{align} W_i =& \;\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ti} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ti}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ni} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[-\theta_{ijC}E_{Cj}+\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \end{align}\]

  • \(A_{Cj}\) is change in China’s “export-supply capability” in each industry
  • \(E_{Cj}\) is change in China’s change in expenditures within China in each industry
  • \(\theta_{ijC}\) is initial share of output in region \(i\) that is shipped to China
  • \(\theta_{ijk}\) is initial share of output in region \(i\) that is shipped to each market \(k\)
  • \(\phi_{Cjk}\) is initial share of imports from China in total purchases

Chinese exports and local labor markets

\[\begin{align} W_i =& \;\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ti} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ti}}\left[\theta_{ijC}E_{Cj}-\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \\ L_{Ni} =& \;\rho_i\sum_j c_{ij}\frac{L_{ij}}{L_{Ni}}\left[-\theta_{ijC}E_{Cj}+\sum_k \theta_{ijk}\phi_{Cjk}A_{Cj}\right] \end{align}\]

  • “Positive shocks to China’s export supply decrease region \(i\)’s wage and employment in traded goods and increase its employment in non-traded goods. Similarly, positive shocks to China’s import demand increase region \(i\)’s wage and employment in traded goods and decrease its employment in non-traded goods.” (p. 2127)

What is endogenous here?

  • Initial share is certainly endogenous
  • The change for a specific region also certainly endogenous!
  • “our main measure of local labor market exposure to import competition is the change in Chinese import exposure per worker in a region, where imports are apportioned to the region according to its share of national industry employment:” (p. 2128)

\[\begin{gather} \Delta IPW_{uit} = \sum_j\frac{L_{ijt}}{L_{ujt}}\frac{\Delta M_{ucjt}}{L_{it}} \end{gather}\]

  • \(\Delta M_{ucjt}\) is change in US imports from China in industry \(j\)

The change is endogenous, too!

p. 2128-2129

“A concern for our subsequent estimation is that realized US imports from China… may be correlated with industry import demand shocks, in which case the OLS estimate of how increased imports from China affect US manufacturing employment may understate the true impact, as both US employment and imports may be positively correlated with unobserved shocks to US product demand.”

  • The solution?
  • “[W]e instrument for growth in Chinese imports to the United States using the contemporaneous composition and growth of Chinese imports in eight other developed countries. Specifically, we instrument the measured import exposure variable \(\Delta IPW_{uit}\) with a non-US exposure variable \(\Delta IPW_{oit}\) that is constructed using data on contemporaneous industry-level growth of Chinese exports to other high-income markets:” (p. 2129)

\[\begin{gather} \Delta IPW_{oit} = \sum_j \frac{L_{ijt-1}}{L_{ujt-1}}\frac{\Delta M_{ocjt}}{L_{it-1}} \end{gather}\]

Back to Hull’s notes

  • Same paper, different syntax. Instrument is

\[\begin{gather} z_\ell = \sum_n s_{\ell n}g_n \end{gather}\]

for the model

\[\begin{gather} y_\ell = \beta x_\ell + w'_\ell + \varepsilon_\ell \end{gather}\]

  • \(x_\ell\): growth of Chinese import comp. in location \(\ell\)
  • \(y_\ell\): growth of outcome of interest
  • \(g_n\): growth of Chinese exports in industry \(n\) to non-US countries
  • \(s_{\ell n}\): initial share of employment (well, 10-year lags)
  • \(z_\ell\): instrument for \(x_\ell\) (predicted growth of Chinese import comp.)

What do we need?

Following Borusyak et al. (2024):

  • “Quasi-random shock assignment”: In our example, this is true when “expected growth of chinese imports \(g_n\) is the same across industries with high vs. low [shock-level unobservables] \(\bar{\varepsilon}_n\) (and [average exposure] \(s_n\))”
  • “Many uncorrelated shocks”: In our example, “imposes many uncorrelated industry growth rates and sufficiently different industry specialization across locations”
    • Hull notes that this is basically a “shock-level law of large numbers”
    • Essentially, the expected value of \(\sum_n s_n g_n \bar{\varepsilon}_n\) is zero

What do we need?

  • Important change: incomplete shares
    • Initial assumption is “constant sum-of-shares”: \(S_\ell=\sum_n s_{\ell n}=1\;\forall\;\ell\)
  • In our example, this is not true!
    • In practice, we can control for the sum-of-shares \(S_\ell\)
    • In panels, control for interaction between sum-of-shares and the year fixed effect (period effects)

Back to the paper

  • SSIV in the paper is \(z_{\ell t}=\sum_n s_{\ell nt}g_{nt}\)
    • \(n\): 397 different industries \(\times\) two periods
    • \(g_{nt}\): growth of Chinese imports in non-US economics per US worker
    • \(s_{\ell nt}\): lagged share of mfg. industry \(n\) in total employment of location \(\ell\)
  • In practice, Borusyak et al. (2024) suggest clustering by industry (since that is essentially the level of treatment)

Check “balance”

  • Can regress industry covariates on the shock. We expect null results.

  • Borusyak et al. (Table 3):

Balance variable Coef. SE
Production workers’ share of employment, 1991 -0.011 (0.012)
Ratio of capital to value-added, 1991 –0.007 (0.019)
Log real wage (2007 USD), 1991 -0.005 (0.022)
Computer investment as share of total, 1990 0.750 (0.465)
High-tech equipment as share of total investment, 1990 0.532 (0.296
The table is Panel A. of Table 3 in Borusyak et al. (2024).
  • Key: “Shocks do not predict industry-level observables controlling for period FE”
    • (Can also check location-level characteristics, as Borusyak et al. do)

What are we identifying?

  • Goldsmith-Pinkham, Sorkin, and Swift (2020)
    • See paper for more details
  • Big takeaway: they show that the SSIV estimator is equivalent to using many different IVs, one for each industry/market
    • You can derive the weights!
  • SSIV puts more weight on:
    • Share instruments with more extreme shocks \(g_n\)
    • Largest first stages

Requirement: “share exogeneity”

  • Share exogeneity means something a little different here: “all relevant unobservables are unforecastable from the shares” (Hull’s notes)

  • Key: Goldsmith-Pinkham, Sorkin, and Swift (2020) show that you can test it!

    • Check \(n\) with high weights
    • Can do balance and pre-trend tests

Recentered IV

Recentered IV

  • Borusyak and Hull (2023)

  • The idea:

    • Imagine a policy that rolls out over many years, like the building of roads
    • The location of roads might be endogenous, but maybe the exact completion data is not!
    • If the date of completion is somewhat random, we may be able to create an IV
  • Example I’ll use: roads in India

Roads in India, by wave of NSS

  • NSS has three waves of interest:
    • 2004-2005 (wave 61)
    • 2007-2008 (wave 64)
    • 2011-2012 (wave 68)

We might be interested in the following

\[\begin{gather} y_{it} = \alpha_i + \delta_t + \beta roads_{it} + X_{it} + \varepsilon_{it} \end{gather}\]

  • \(y_{it}\) is some outcome of interest
  • \(\alpha_i\) is district FE
  • \(\delta_t\) is time FE
  • \(roads_{it}\) is the length of roads in the district, \(\beta\) is the coefficient of interest
  • But there is a concern… what?
  • Perhaps roads are built in places that are trending in certain ways

Roads in India, by wave of NSS

Roads in India, by wave of NSS

The basic idea

  • The basic idea is similar to randomization inference

  • Find “expected” value based on randomized completion date

    • Instrument is actual - expected
    • This is the recentered IV