Applied microeconometrics

Weeks 5 and 6 - Differences-in-differences and synthetic control

Josh Merfeld

KDI School

October 14, 2024

Introduction

What are we doing today?

Canonical differences-in-differences
- Inference
- Wild cluster bootstrap
Fixed effects vs. random effects
Bias in two-way fixed effects

Differences-in-differences

Differences-in-differences… in the 1800s?

Differences-in-differences… in the 1800s? (from Cunningham’s CI)

Differences-in-differences

More commonly referred to as “DID” or “diff-in-diff”
- Classic reference: Card and Krueger (1994)
Most common method, likely because data requirements are least stringent
Example in Mostly Harmless: offering credit to banks during the Great Depression (Richardson and Troost, 2009)
- Set up: Two different federal reserve banks lent to neighborhood banks in Mississippi
- Atlanta fed favored lending to banks in trouble
- St. Louis fed favored the exact opposite

Richardson and Troost (2009) - Mississippi dividing line

Did the policy of extra lending save banks?

Basic idea: compare what happened to Atlanta fed banks (southern Mississippi) with St. Louis fed banks (northern Mississippi)
Could compare after lending, but what’s the assumption here?

Assumption: same levels before intervention (very strict assumption)

In fact, pre-intervention levels are different!

Did the policy of extra lending save banks?

Instead, compare changes from before to after treatment
Assumption: parallel trends
If valid, the fact the districts were different prior to the treatment isn’t a problem

“Parallel trends”

Why is it “differences in differences”?

Difference 1: St. Louis post minus St. Louis pre
Difference 2: Atlanta post minus Atlanta pre
Difference-in-differences: Difference 2 minus difference 1

“Differences in differences” graphically

Parallel trends assumption

The key assumption in differences-in-differences is the parallel trends assumption
- If the treated group had not been treated, it would have changed by the same amount (“had the same trend”) as the comparison group.
This is a counterfactual assumption: We cannot explicitly test it
What can we do instead?

We can test trends before treatment
- Or in the case of this article, after treatment!

Richardson and Troost (2009) - Testing the assumption

Estimating diff-in-diff empirically

Can be estimated in a straightforward regression: \[ Y_{it} = \beta_0 + \beta_1 TREAT_i + \beta_2 POST_t + \beta_3 (POST_t \times TREAT_i) + \varepsilon_{it} \]

\(\beta_0\): pre mean for the comparison group
\(\beta_1\): difference in the pre mean between the treated and untreated group
\(\beta_2\): difference in means between the pre and post period for the comparison group
\(\beta_3\): difference-in-differences estimate
- This is the difference in the change from pre to post for the treated group relative to the comparison group

Card and Krueger (1994) - Minimum wage and employment

Code

ckdata <- read_csv("week5files/cardkruegerlong.csv")
head(ckdata)

# A tibble: 6 × 15
  sheet chain co_owned state southj centralj northj   pa1   pa2 shore calls
  <dbl> <dbl>    <dbl> <dbl>  <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl> <chr>
1   506     2        1     0      0        0      0     1     0     0 0    
2    56     4        1     0      0        0      0     1     0     0 0    
3    61     4        1     0      0        0      0     1     0     0 0    
4    62     4        1     0      0        0      0     1     0     0 2    
5   445     1        0     0      0        0      0     0     1     0 0    
6   451     1        0     0      0        0      0     0     1     0 0    
# ℹ 4 more variables: fulltime <chr>, parttime <chr>, managers <chr>,
#   post <dbl>

Code

# note that state = 1 for NJ and 0 for PA.
# also note that post = 1 for 1993 and 0 for 1992
# NJ is treated group, so state = 1 means treat = 1

Card and Krueger (1994) - Minimum wage and employment

Code

# looks like fulltime is a character! let's try to make it numeric
ckdata$fulltime_num <- as.numeric(ckdata$fulltime)

Warning: NAs introduced by coercion

Code

ckdata$parttime_num <- as.numeric(ckdata$parttime)

Warning: NAs introduced by coercion

Card and Krueger (1994) - Minimum wage and employment

Code

# said there are NAs in the data, so let's see where they are
ckdata %>% filter(is.na(fulltime_num)) %>% dplyr::select(fulltime, fulltime_num)

# A tibble: 18 × 2
   fulltime fulltime_num
   <chr>           <dbl>
 1 .                  NA
 2 .                  NA
 3 .                  NA
 4 .                  NA
 5 .                  NA
 6 .                  NA
 7 .                  NA
 8 .                  NA
 9 .                  NA
10 .                  NA
11 .                  NA
12 .                  NA
13 .                  NA
14 .                  NA
15 .                  NA
16 .                  NA
17 .                  NA
18 .                  NA

Code

# ah, so they are .! those are missing values in Stata, so leave as missing.

Card and Krueger (1994) - Minimum wage and employment

Code

reg1 <- feols(fulltime_num ~ state + post + state:post, data = ckdata, vcov = "HC1")
summary(reg1)

OLS estimation, Dep. Var.: fulltime_num
Observations: 798 
Standard-errors: Heteroskedasticity-robust 
            Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)  9.99342    1.21845  8.20172 9.5089e-16 ***
state       -2.26949    1.29598 -1.75118 8.0301e-02 .  
post        -2.27342    1.56537 -1.45232 1.4681e-01    
state:post   2.99498    1.68446  1.77801 7.5786e-02 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8.24468   Adj. R2: 0.002734

Card and Krueger (1994) - Minimum wage and employment

Code

reg1 <- feols(fulltime_num ~ state + post + state:post, data = ckdata, vcov = "HC1")
reg2 <- feols(parttime_num ~ state + post + state:post, data = ckdata, vcov = "HC1")
table <- etable(reg1, reg2,
                # standard errors, digits, fit statistics, put SE below coefficients (the norm)
                vcov = "HC1", digits = 3, fitstat = "", se.below = TRUE, 
                # change significance codes to the norm
                signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
                # rename the variables
                dict = c("Constant" = "Intercept", "state" = "Treat", "post" = "Post", "state:post" = "Treat x Post"))
table

                        reg1         reg2
Dependent Var.: fulltime_num parttime_num
                                         
Constant             9.99***      19.7***
                    (1.22)        (1.09) 
Treat               -2.27*        -1.04  
                    (1.30)        (1.23) 
Post                -2.27          0.103 
                    (1.57)        (1.60) 
Treat x Post         2.99*        -0.405 
                    (1.68)        (1.80) 
_______________ ____________ ____________
S.E. type       Hetero.-rob. Hetero.-rob.
---
Signif. codes: 0 '***' 0.01 '**' 0.05 '*' 0.1 ' ' 1

Code

# drop some rows
table <- table[-c(1:2, 11:nrow(table)),]
# rename columns
colnames(table) <- c("", "Full-time", "Part-time")

Card and Krueger (1994) - Minimum wage and employment

Code

kable(table, 
      align = "lcc", booktabs = TRUE, linesep = "", escape = FALSE, row.names = FALSE) %>%
  column_spec(1,width = "4cm") %>%
  column_spec(c(2:3),width = "3cm") %>%
  kable_styling() %>%
  footnote("* p < 0.1, ** p < 0.05, *** p < 0.01.", general_title = "",
            footnote_as_chunk = TRUE
            ) %>%
  footnote("Note: Robust standard errors in parentheses.", general_title = "",
            footnote_as_chunk = TRUE
            )

	Full-time	Part-time
Constant	9.99***	19.7***
	(1.22)	(1.09)
Treat	-2.27*	-1.04
	(1.30)	(1.23)
Post	-2.27	0.103
	(1.57)	(1.60)
Treat x Post	2.99*	-0.405
	(1.68)	(1.80)
* p < 0.1, p < 0.05, * p < 0.01.
Note: Robust standard errors in parentheses.

Card and Krueger (1994) - Poisson regression!

Code

reg1 <- feglm(fulltime_num ~ state + post + state:post, data = ckdata, vcov = "HC1", family = "poisson")
reg2 <- feglm(parttime_num ~ state + post + state:post, data = ckdata, vcov = "HC1", family = "poisson")
table <- etable(reg1, reg2,
                # standard errors, digits, fit statistics, put SE below coefficients (the norm)
                vcov = "HC1", digits = 3, fitstat = "", se.below = TRUE, 
                # change significance codes to the norm
                signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1),
                # rename the variables
                dict = c("Constant" = "Intercept", "state" = "Treat", "post" = "Post", "state:post" = "Treat x Post"))
table

                        reg1         reg2
Dependent Var.: fulltime_num parttime_num
                                         
Constant             2.30***      2.98***
                    (0.122)      (0.056) 
Treat               -0.258*      -0.054  
                    (0.135)      (0.063) 
Post                -0.258        0.005  
                    (0.176)      (0.081) 
Treat x Post         0.347*      -0.021  
                    (0.192)      (0.092) 
_______________ ____________ ____________
S.E. type       Hetero.-rob. Hetero.-rob.
---
Signif. codes: 0 '***' 0.01 '**' 0.05 '*' 0.1 ' ' 1

Code

# drop some rows
table <- table[-c(1:2, 11:nrow(table)),]
# rename columns
colnames(table) <- c("", "Full-time", "Part-time")

Card and Krueger (1994) - Minimum wage and employment

Code

kable(table, caption = "Poisson regression", # adding a caption
      align = "lcc", booktabs = TRUE, linesep = "", escape = FALSE, row.names = FALSE) %>%
  column_spec(1,width = "4cm") %>%
  column_spec(c(2:3),width = "3cm") %>%
  kable_styling() %>%
  footnote("* p < 0.1, ** p < 0.05, *** p < 0.01.", general_title = "",
            footnote_as_chunk = TRUE
            ) %>%
  footnote("Note: Robust standard errors in parentheses.", general_title = "",
            footnote_as_chunk = TRUE
            )

Poisson regression
	Full-time	Part-time
Constant	2.30***	2.98***
	(0.122)	(0.056)
Treat	-0.258*	-0.054
	(0.135)	(0.063)
Post	-0.258	0.005
	(0.176)	(0.081)
Treat x Post	0.347*	-0.021
	(0.192)	(0.092)
* p < 0.1, p < 0.05, * p < 0.01.
Note: Robust standard errors in parentheses.

Estimating diff-in-diff empirically - adding controls

Can add control variables \[ Y_{it} = \beta_0 + \beta_1 TREAT_i + \beta_2 POST_t + \beta_3 (POST_t \times TREAT_i) + X_{it} + \varepsilon_{it} \]
Adding controls can help control for differing trends (“conditional” parallel trends)
Note: the interpretation of \(\beta_0\) is no longer the same; others stay the same

Standard errors in differences-in-differences

Card and Krueger did not cluster standard errors
- In fact, that would have been difficult because they really only had two “clusters”!
- But their robust standard errors are likely underestimated due to the clustering
Classic reference: Bertrand, Duflo, and Mullainathan (2004)
- “How Much Should We Trust Differences-in-Differences Estimates?”

Standard errors in differences-in-differences

Bertrand, Duflo, and Mullainathan (2004) suggest three possibilities:
1. Cluster at the group level
2. Block bootstrap (not going to discuss)
3. Aggregating data into one pre and one post period (event studies later)
Let’s go through these

Clustering

The most common approach: cluster standard errors
Cameron, Gelbach, and Miller (2008) show that this is problematic with few clusters
- “Bootstrap-based improvements for inference with clustered errors”
The authors look at many possible approaches and find that the “wild cluster bootstrap” seems to perform best, on average

Wild cluster bootstrap

The wild cluster bootstrap is a “non-parametric” bootstrap
- I’ll do the non-cluster as an example. Software makes this easy!
Suppose we are interested in the following regression: \[ y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \]
We want to test whether \(\beta_1 = 0\) and we have relatively few clusters (say between 5 and 30)

Wild cluster bootstrap

\[\begin{gather}\label{wcreg} y_i = \beta_0 + \beta_1 x_i + \varepsilon_i \end{gather}\]

Estimate above regression and obtain \(\hat{\beta}\), \(\hat{\varepsilon}\)
Impose the null hypothesis (\(\beta_1 = 0\)) and estimate the restricted regression: \[\begin{gather}\label{wcregres} \tilde{y}_i = \tilde{\beta}_0 + \tilde{\varepsilon}_i \end{gather}\]

Wild cluster bootstrap

Bootstrap replications:

Use equation to generate \(\tilde{y}_i^b\), where \(\tilde{y}_i^b = \tilde{\beta}_0^{b} + \tilde{\varepsilon}_i^{b}\)
- Rademacher weights: The randomness comes from either adding \(\hat{\varepsilon}_i\) or \(-\hat{\varepsilon}_i\) with equal probability
Estimate:

\[\begin{gather}\label{wcregB} \tilde{y}_i^{b} = \tilde{\beta}_0^{b*} + \tilde{\beta}_1^{b*}x_i + \tilde{\varepsilon}_i^{b*} \end{gather}\]

Calculate the t-statistic for the bootstrap replication:

\[\begin{gather}\label{wctstat} t^{b*} = \frac{\tilde{\beta}_1^{b*}}{\sqrt{\tilde{V}^{b*}}} \end{gather}\]

Wild cluster bootstrap

Two-tailed test:

Reject the null hypothesis if \[|t^{b*}| > |t^{*}|\;\mathrm{for}\;b = 1, \dots, B,\] where \(t^{*}\) is the t-statistic from the .
P-value across \(B\) bootstrap samples is: \[\begin{gather}\frac{1}{B}\sum_{b=1}^B \mathbb{I}(|t^{b*}| > |t^{*}|),\end{gather}\]

where \(\mathbb{I}\) is the indicator function.

Implementing WCB in `R`

Thankfully there’s a package that allows us to do this!
- fwildclusterboot (Friedrich, 2019)
This package works with fixest objects!
Let’s use the castle.dta data in the GitHub repo to test this

Implementing WCB in `R`

Code

library(haven) # to load .dta files
df <- read_dta("week5files/castle.dta")
head(df)

# A tibble: 6 × 185
  state   year   sid   cdl pre2_cdl caselaw anywhere assumption civil homicide_c
  <chr>  <dbl> <dbl> <dbl>    <dbl>   <dbl>    <dbl>      <dbl> <dbl>      <dbl>
1 Alaba…  2000     1     0        0       0        0          0     0        329
2 Alaba…  2001     1     0        0       0        0          0     0        379
3 Alaba…  2002     1     0        0       0        0          0     0        303
4 Alaba…  2003     1     0        0       0        0          0     0        299
5 Alaba…  2004     1     0        1       0        0          0     0        254
6 Alaba…  2005     1     0        1       0        0          0     0        374
# ℹ 175 more variables: robbery_gun_r <dbl>, jhcitizen_c <dbl>,
#   jhpolice_c <dbl>, homicide <dbl>, robbery <dbl>, assault <dbl>,
#   burglary <dbl>, larceny <dbl>, motor <dbl>, murder <dbl>,
#   hc_felonywsus <dbl>, jhcitizen <dbl>, jhpolice <dbl>, population <dbl>,
#   police <dbl>, unemployrt <dbl>, income <dbl>, blackm_15_24 <dbl>,
#   whitem_15_24 <dbl>, blackm_25_44 <dbl>, whitem_25_44 <dbl>, prisoner <dbl>,
#   lagprisoner <dbl>, poverty <dbl>, exp_subsidy <dbl>, …

Code

# key variables: state, year, cdl ("treatment"), and homicide_c (outcome)
# homicide_c to rate (per 100,000 people)
df$homicide_c <- (df$homicide_c/df$population)*100000
# and log
df$homicide_c <- log(df$homicide_c)

Implementing WCB in `R`

Code

# Note: this is not differences-in-differences. 
# Just an example of the wild cluster bootstrap with fwildclusterboot
reg1 <- feols(homicide_c ~ cdl, data = df, cluster = "state")
summary(reg1)

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Standard-errors: Clustered (state) 
            Estimate Std. Error  t value   Pr(>|t|)    
(Intercept) 1.360284   0.082267 16.53498  < 2.2e-16 ***
cdl         0.338002   0.092532  3.65279 0.00063065 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.578226   Adj. R2: 0.036511

Implementing WCB in `R`

Code

reg1 <- feols(homicide_c ~ cdl, data = df, cluster = "state")
boot_reg <- boottest(
                    reg1, 
                    clustid = c("state"), 
                    param = "cdl", 
                    B = 10000,
                    type = "rademacher" # default weighting, by the way
                    )
boot_reg

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state"), 
    type = "rademacher")
 
p value: 5e-04 
confidence interval: 0.1541 0.5249 
test statistic 3.6528

Add controls

Code

reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df, cluster = "state")
reg1

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Standard-errors: Clustered (state) 
                 Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)      8.328111   5.001764  1.665035 1.0229e-01    
cdl              0.156808   0.088948  1.762923 8.4149e-02 .  
log(population)  0.304923   0.062867  4.850276 1.2902e-05 ***
log(income)     -1.061536   0.477547 -2.222894 3.0868e-02 *  
unemployrt      -0.006588   0.021765 -0.302699 7.6340e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.471373   Adj. R2: 0.35618

Code

boot_reg <- boottest(
                    reg1, 
                    clustid = c("state"), 
                    param = "cdl", 
                    B = 10000,
                    type = "rademacher" # default weighting, by the way
                    )
boot_reg

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state"), 
    type = "rademacher")
 
p value: 0.1001 
confidence interval: -0.0301 0.3393 
test statistic 1.7629

Can change null hypothesis, like cdl = 0.1

Code

reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df, cluster = "state")
(boot_reg <- boottest(
                    reg1, 
                    clustid = c("state"), 
                    param = "cdl", 
                    r = 0.1, # null hypothesis is cdl = 0.1
                    B = 10000,
                    type = "rademacher" # default weighting, by the way
                    ))

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state"), 
    r = 0.1, type = "rademacher")
 
p value: 0.5441 
confidence interval: -0.0335 0.3393 
test statistic 0.6387

Multi-way clustering, too!

Code

reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df, cluster = c("state", "year"))
(boot_reg <- boottest(
                    reg1, 
                    clustid = c("state", "year"), 
                    param = "cdl", 
                    B = 10000,
                    type = "rademacher" # default weighting, by the way
                    ))

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state", 
    "year"), type = "rademacher")
 
p value: 0.2199 
confidence interval: -0.1483 0.4476 
test statistic 1.5779

Example with random subset of 12 clusters

Code

set.seed(13489)
randomclusters <- unique(df$state)[sample(1:length(unique(df$state)), 8, replace = F)]
reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df[df$state %in% randomclusters,], cluster = c("state", "year"))
(boot_reg <- boottest(
                    reg1, 
                    clustid = c("state", "year"), 
                    param = "cdl", 
                    B = 10000,
                    type = "rademacher" # default weighting, by the way
                    ))

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state", 
    "year"), type = "rademacher")
 
p value: 0.1729 
confidence interval: -0.3462 1.4174 
test statistic 1.9382

Finally, Webb weights (Webb, 2023) – but using Rademacher weights is the norm

Code

set.seed(13489)
randomclusters <- unique(df$state)[sample(1:length(unique(df$state)), 8, replace = F)]
reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df[df$state %in% randomclusters,], cluster = c("state", "year"))
(boot_reg <- boottest(
                    reg1, 
                    clustid = c("state", "year"), 
                    param = "cdl", 
                    B = 10000,
                    type = "webb" # webb weights, six-point distribution
                    ))

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("state", 
    "year"), type = "webb")
 
p value: 0.1748 
confidence interval: -0.3707 1.4389 
test statistic 1.9382

Some thoughts on clustering

If you have more than ~30 clusters, you can probably just cluster at the group level
- We see that the standard errors are very similar in the state examples above
Otherwise, consider using an alternative approach
- Also important when clusters have wildly different sample sizes, or where the treated clusters are relatively few (Moulton, 1990)
Alternative approach with only one treated cluster: randomization inference

Randomization inference in Buchmueller, DiNardo, and Valletta (2011)

They are interested in the change in insurance coverage in Hawaii relative to other states: \[\begin{gather} Y_{ist}=X_{ist}\beta^t+Z_{st}\gamma^t+H_{it}\delta^t+\phi_{st}+\eta_{it} \end{gather}\]
They calculate the change as: \(\Delta=\delta^1-\delta^0\)
The idea: see where the Hawaii effect sits in the distribution of the same effect across all US states
- “Placebo” effects
- Note that this is not true “randomization” inference
  - I’ll show you an example with one of my papers in a minute

Randomization inference in Buchmueller, DiNardo, and Valletta (2011)

Randomization inference in Merfeld (2023)

I’m interested in the effects of pollution on agricultural productivity in India
I have villages, which are nested within districts
- I cluster on villages
Alternative: randomly assign pollution to villages within the same district and compare my effects to the distribution of effects

Randomization inference in Merfeld (2023)

Placebo tests for the parallel trends assumption

Above, we looked at the parallel trends assumption graphically in Richardson and Troost (2009)
Another common way is to look at leads of treatment
- In my paper, for example, pollution next year should not affect agricultural productivity this year

Leads of pollution in Merfeld (2023)

	(1)	(2)
particulate matter (one-year lead)	-0.033
	(0.067)
particulate matter (two-year lead)		-0.070
		(0.052)
weather (expanded)	No	No
fixed effects:
village	Yes	Yes
year	Yes	Yes
F	592	783
observations	1,161,265	1,055,562
* p < 0.1, p < 0.05, * p < 0.01.
Note: Standard errors are in parentheses and are clustered at the village level.

Convincing the reader is like writing a good story

When you’re writing a diff-in-diff paper, think about the possible threats to your identification strategy
Then, think about how you can convince the reader that your strategy is valid
- Use placebos: is there somewhere we shouldn’t expect an effect?
- In the case of my paper, the leads convinced some seminar participants!
You can likewise think of heterogeneity we would expect to see, and test for that!

Fixed and random effects

Before moving on to some of the new literature…

Let’s talk about fixed and random effects
- Fixed effects will be important for the upcoming discussions
Some nice (but old) slides from Oscar Tores-Reyna here.

Panel data

Both fixed and random effects are used in panel data
- Panel data: data with multiple observations for each unit
- Examples: individuals, firms, countries, etc.
In our previous example of homicide and castle doctrine laws: unit is the state!

Panel data

Fixed effects

Fixed effects are a way to control for omitted variables
- However there is a key assumption: the omitted variable is time-invariant
Fixed effects are also called “within” effects
- Why? Because we are looking at the variation within each unit
The regression is of the form: \[\begin{gather} y_{it} = \alpha_i + \beta x_{it} + \varepsilon_{it}, \end{gather}\]

where \(\alpha_i\) is the fixed effect for unit \(i\). Note the subscript! No \(t\)!

Time-invariant assumption

The key assumption is that the omitted variable is time-invariant
- For example, in the case of the homicide data, we might think that there is some time-invariant variable that affects homicide rates
However, this is a strong assumption
- Moreover, do you think it is more likely to hold in short or long panels?

Nice paper on this by Millimet and Bellemare (2023)

Fixed effects, empirically

Empirically, what are fixed effects doing?
- They are subtracting the mean of each unit from the outcome variable
In a regression, we add a dummy variable for each unit
- We have to leave out one dummy variable, though
- Software will do this for us!
- Note that the intercept is usually meaningless in this case
Cannot include time-invariant variables in the regression
- Why? Because the fixed effect will absorb them!

Fixed effects with feols

feols makes this easy on us. Let’s return to our previous example.

Code

reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df, cluster = c("state"))
reg2 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt | state, data = df, cluster = c("state"))
etable(reg1, reg2,
        # standard errors, digits, fit statistics, put SE below coefficients (the norm)
        digits = 3, fitstat = "", se.below = TRUE, 
        # change significance codes to the norm
        signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1))

                      reg1       reg2
Dependent Var.: homicide_c homicide_c
                                     
Constant          8.33               
                 (5.00)              
cdl               0.157*     0.059   
                 (0.089)    (0.049)  
log(population)   0.305***  -0.151   
                 (0.063)    (0.341)  
log(income)      -1.06**     0.242   
                 (0.478)    (0.335)  
unemployrt       -0.007     -0.024***
                 (0.022)    (0.006)  
Fixed-Effects:  ---------- ----------
state                   No        Yes
_______________ __________ __________
S.E.: Clustered  by: state  by: state
---
Signif. codes: 0 '***' 0.01 '**' 0.05 '*' 0.1 ' ' 1

Fixed effects with feols

Notice how different the coefficients are!

                      reg1       reg2
Dependent Var.: homicide_c homicide_c
                                     
Constant          8.33               
                 (5.00)              
cdl               0.157*     0.059   
                 (0.089)    (0.049)  
log(population)   0.305***  -0.151   
                 (0.063)    (0.341)  
log(income)      -1.06**     0.242   
                 (0.478)    (0.335)  
unemployrt       -0.007     -0.024***
                 (0.022)    (0.006)  
Fixed-Effects:  ---------- ----------
state                   No        Yes
_______________ __________ __________
S.E.: Clustered  by: state  by: state
---
Signif. codes: 0 '***' 0.01 '**' 0.05 '*' 0.1 ' ' 1

Just a quick note: wild cluster bootstrap still works!

Code

# need to use a numeric variable for the bootstrap. sid is in our data, thankfully.
reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt | sid, data = df, cluster = c("sid"))
reg1

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Fixed-effects: sid: 50
Standard-errors: Clustered (sid) 
                 Estimate Std. Error   t value   Pr(>|t|)    
cdl              0.058781   0.049370  1.190624 0.23953723    
log(population) -0.151406   0.341426 -0.443451 0.65939131    
log(income)      0.241548   0.335153  0.720711 0.47451204    
unemployrt      -0.023468   0.006455 -3.635678 0.00066447 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.179187     Adj. R2: 0.897774
                 Within R2: 0.058749

Code

boot_reg <- boottest(
                    reg1, 
                    clustid = c("sid"), # note that it requires a numeric variable!
                    param = "cdl", 
                    B = 10000
                    )
boot_reg

boottest.fixest(object = reg1, param = "cdl", B = 10000, clustid = c("sid"))
 
p value: 0.2757 
confidence interval: -0.0338 0.1599 
test statistic 1.1917

Fixed effects are the norm with “differences-in-differences”

It’s not quite the same as the canonical differences-in-differences model
We redefine treatment for the same unit
- Before treatment, the value is zero, and after it is one
- Note that this is different from the canonical model, where the value is zero for the comparison group and one for the treated group at all points in time
In fact, the regression we just saw is like a differences-in-differences model of this form!
- In practice, we often tend to add time fixed effects, too:

\[\begin{gather} y_{it} = \alpha_i + \delta_{t} + \beta x_{it} + \varepsilon_{it}, \end{gather}\]

This is colloquially called “two-way fixed effects (TWFE)”

The “effect” of castle doctrine laws, TWFE

Code

reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt | state, data = df, cluster = c("state"))
reg2 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt | state + year, data = df, cluster = c("state"))
etable(reg1, reg2,
        # standard errors, digits, fit statistics, put SE below coefficients (the norm)
        digits = 3, fitstat = "", se.below = TRUE, 
        # change significance codes to the norm
        signif.code = c("***" = 0.01, "**" = 0.05, "*" = 0.1))

                      reg1       reg2
Dependent Var.: homicide_c homicide_c
                                     
cdl               0.059       0.076  
                 (0.049)     (0.056) 
log(population)  -0.151      -0.974**
                 (0.341)     (0.484) 
log(income)       0.242       0.283  
                 (0.335)     (0.308) 
unemployrt       -0.024***   -0.009  
                 (0.006)     (0.012) 
Fixed-Effects:  ---------- ----------
state                  Yes        Yes
year                    No        Yes
_______________ __________ __________
S.E.: Clustered  by: state  by: state
---
Signif. codes: 0 '***' 0.01 '**' 0.05 '*' 0.1 ' ' 1

The “effect” of castle doctrine laws, TWFE

CDL laws and homicide rates
	(1)	(2)	(3)
Castle doctrine law	0.157*	0.059	0.076
	(0.089)	(0.049)	(0.056)
Population (log)	0.305***	-0.151	-0.974**
	(0.063)	(0.341)	(0.484)
Income (log)	-1.06**	0.242	0.283
	(0.478)	(0.335)	(0.308)
Unemp. rate	-0.007	-0.024***	-0.009
	(0.022)	(0.006)	(0.012)
Fixed-Effects:
State	No	Yes	Yes
Year	No	No	Yes
Observations	550	550	550
* p < 0.1, p < 0.05, * p < 0.01.
Note: Standard errors clustered at the state level are in parentheses.

Random effects

Before turning to recent issues discovered with the two-way fixed effects estimator, let’s talk about random effects
Random effects are a way to capture the heterogeneity across units
- The key is that this heterogeneity is random and uncorrelated with the predictors in the model
This is really a way to capture the variance across units
- In practice, this absorbs some of the variance, increasing precision (but at the cost of the assumption above)

Random effects in `R`

Code

library(lme4)
reg1 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt, data = df)
# note iid standard errors for simplicity for random effects and compare to reg1
# (can use other packages to change the vcov calculation if we think assumptions aren't exactly true...)
reg2 <- lmer(homicide_c ~ cdl + log(population) + log(income) + unemployrt + (1 | state) + (1 | year), data = df) 
reg3 <- feols(homicide_c ~ cdl + log(population) + log(income) + unemployrt | state + year, data = df, cluster = "state")

Random effects in `R`

	None	RE	FE
Castle laws	0.157	0.050	0.076
	(0.065)	(0.030)	(0.056)
Population (log)	0.305	0.273	-0.974
	(0.021)	(0.066)	(0.484)
Income (log)	-1.062	0.018	0.283
	(0.140)	(0.191)	(0.308)
Unemployment rate	-0.007	-0.027	-0.009
	(0.011)	(0.006)	(0.012)

Random effects

Some things about random effects relative to vanilla OLS (not fixed effects):

The random effects estimator is asymptotically more efficient than OLS if there is unit-level heterogeneity: \(\mathbb{E}(V_{RE})<\mathbb{E}(V_{OLS})\)
- In practice with finite samples, not necessarily
In expectation, coefficients are the same: \(\mathbb{E}(\beta_{RE}) = \mathbb{E}(\beta_{OLS})\)
- Random effects is estimated using (feasible) generalized least squares, which essentially reweights the observations
- In practice with finite samples, this means the coefficients will not be exactly the same

Fixed or random effects?

In expectation, coefficients are the same: \[\mathbb{E}(\beta_{RE}) = \mathbb{E}(\beta_{OLS})\]
So, how might we test for whether we should use random or fixed effects?

We can test whether coefficients change “significantly” when using fixed vs. random effects!
- This is called a Hausman test
- Note: In practice, economists rarely do this. They tend to assume fixed effects unless they have a good reason to assume otherwise. Not necessarily true in other disciplines.

Hausman test

\[\begin{gather} Hausman = (b_{FE} - b_{RE})'(Var(b_{FE}) - Var(b_{RE}))^{-1}(b_{FE} - b_{RE}) \end{gather}\]

Some notes:
- If RE is consistent, FE is, too; but RE is more efficient, asymptotically
- \(H_0\): Random effects is consistent (i.e. “okay” to use)
- The test statistic is asymptotically \(\chi^2\) distributed with \(k\) degrees of freedom, where \(k\) is the number of coefficients in the model

Triple differences

We have discussed differences-in-differences
- We can also have triple differences
- For example, if we have three groups and two time periods
The Card and Krueger paper simply compared employment rates in fast food restaurants in New Jersey and Pennsylvania before and after the minimum wage increase in New Jersey
- This is a double difference
Can you think of a group that theoretically would not be affected by the minimum wage change?

How about high-wage workers?

Triple differences

Differences-in-differences can be extend into higher “differences”
- Triple differences, quadruple differences, etc.
Example: Muralidharan and Prakash (2017)
They study a bicycle program for girls in Bihar, India
- Reducing the gender gap by providing girls in secondary school with a bicycle

Higher differences

Using higher differences is quite common
Common problem: large standard errors
Triple difference in this paper:
- Gender by age by location

The program

Program in Bihar
- Third largest (by population) state in India
- Quite poor
Secondary schools can be located far from households
- Along with a large gender gap in schooling, this motivates the program itself
“Chief Minister’s Bicycle Program”
- “Conditional kind transfer”

Data

Data was not collected specifically for the program
Indian District Level Health Survey (DLHS)
DLHS is nationally representative
- 720,000 households across 601 districts

Identification

Compare girls eligible for the program based on age
- The survey was 18 months after the start of the program
- Compare girls who were just barely eligible for the program based on age
Could just compare boys to girls based on age
- Problem: possible non-parallel trends

Pre-trends a problem?

Solution? Triple difference bringing in a second state, Jharkhand

Triple difference bringing in a second state, Jharkhand

Main results: enrollment/completion of grade 9

Expected heterogeneity

One way to convince people that your causal results are true is to make arguments about heterogeneity
For example, with this paper distance should be important
For those who live very close to a school, bicycles shouldn’t matter
- For those who live very far, same
So what is the heterogeneity we EXPECT to see?

Treatment effect should be largest at “middle” distances

Expected heterogeneity by distance

Effects on grade 10 board exams

Bias in TWFE

Recently, a number of papers have shown that the two-way fixed effects estimator can be… problematic
We have been discussing differences-in-differences with TWFE of the following form: \[\begin{gather} y_{it} = \alpha_i + \delta_{t} + \beta D_{it} + \gamma x_{it} + \varepsilon_{it}, \end{gather}\] where \(D_{it}\) is a dummy variable for treatment.
If there are only two time periods and one group receives treatment in only one period, this is not a problem!
- The Card and Krueger setup is not an issue with TWFE

TWFE and differential treatment timing

The real issue is when treatment is staggered across time
- For example, if treatment is introduced in different years for different states

It turns out this is the case with the castle doctrine law!

year	treated
2000	0.00
2001	0.00
2002	0.00
2003	0.00
2004	0.00
2005	0.00
2006	0.02
2007	0.28
2008	0.36
2009	0.40
2010	0.42

Goodman-Bacon (2021), Journal of Econometrics

Goodman-Bacon lays out the problem (as does Scott in CI)
Suppose we have three groups and three time periods
- Group 1 is treated before period 2 (Goodman-Bacon calls this group k)
- Group 2 is treated before period 3 (Goodman-Bacon calls this group l)
- Group 3 is never treated (Goodman-Bacon calls this group U)
If we are willing to assume treatment effect homogeneity, then we have no problems!
- Are you willing to assume this?

Goodman-Bacon (2021)

Goodman-Bacon shows that the overall treatment effect is a weighted average of treatment effects from every possible 2x2 comparison where treatment status doesn’t change:
- Group 1 vs group 2
- Group 1 vs group 3
- Group 2 vs group 1
- Group 2 vs group 3
These weights are a function of two things:
- Group sizes
- Variance in treatment

Goodman-Bacon (2021)

He shows there are three relevant comparisons (groups for which the treatment changes):

\[\begin{align} \hat{\beta}_{kU}^{2x2}&\equiv \left(\bar{y}_k^{POST(k)}-\bar{y}_k^{PRE(k)}\right) - \left(\bar{y}_U^{POST(k)}-\bar{y}_U^{PRE(k)}\right) \\ \hat{\beta}_{kl}^{2x2,k}&\equiv \left(\bar{y}_k^{MID(k,l)}-\bar{y}_k^{PRE(k)}\right) - \left(\bar{y}_l^{MID(k,l)}-\bar{y}_l^{PRE(k)}\right) \\ \hat{\beta}_{kl}^{2x2,l}&\equiv \left(\bar{y}_l^{POST(l)}-\bar{y}_l^{MID(k,l)}\right) - \left(\bar{y}_k^{POST(l)}-\bar{y}_k^{MID(k,l)}\right), \end{align}\]

where \(k\) and \(l\) are treated groups, and \(U\) is the untreated group.

Note the first term includes two separate comparisons (both treated groups vs. untreated group)

Goodman-Bacon (2021)

The DD estimator is a weighted average of all these comparisons.
Generalizing to \(K\) time periods: \[\begin{gather} \hat{\beta}^{DD}=\sum_{k\neq U}s_{kU}\hat{\beta}_{kU}^{2x2}+\sum_{k\neq U}\sum_{l>k}\left[s_{kl}^k\hat{\beta}_{kl}^{2x2,k}+s_{kl}^l\hat{\beta}_{kl}^{2x2,l} \right], \end{gather}\] where \(s_{ij}\) is the weight for the comparison between groups \(i\) and \(j\).

Goodman-Bacon (2021)

The weights: \[\begin{align} s_{kU}&=\frac{(n_k+n_U)^2n_{kU}(1-n_{kU})\bar{D}_k(1-\bar{D}_k)}{\hat{V}^D} \\ s_{kl}^k&=\frac{\left((n_k+n_l)(1-\bar{D}_l)\right)^2n_{kl}(1-n_{kl})\frac{\bar{D}_k-\bar{D}_l}{1-\bar{D}_l}\frac{1-\bar{D}_k}{1-\bar{D}_k}}{\hat{V}^D} \\ s_{kl}^l&=\frac{\left((n_k+n_l)(\bar{D}_k)\right)^2n_{kl}(1-n_{kl})\frac{\bar{D}_l}{\bar{D}_k}\frac{\bar{D}_k-\bar{D}_l}{\bar{D}_k}}{\hat{V}^D} \end{align}\]

Note how the variance of treatment affects the weights! “Changing the number or spacing of time periods changes the weights” (Goodman-Bacon).
- Even if the treatment effect is constant, changing the length of the panel can change the weighted average if different groups have different treatment effects.

de Chaisemartin and D’Haultfœuille (2020), American Economic Review

de Chaisemartin and D’Haultfœuille (2020) more explicitly show the problem with weights.
The problem is an extrapolation problem
- Essentially, “the regression predicts a treatment probability larger than the one in that cell” (de Chaisemartin and D’Haultfœuille)
Note that if the treatment effect is constant, then the weighted average is always the same, no matter the weights
- But if the treatment effect is not constant, then the weighted average can be different from the true average

Back to Bacon-Goodman

Let’s return to Bacon-Goodman’s formulation
What happens if we have a “control” group in a later period that is treated in an earlier period? \[\begin{align} \begin{split} \hat{\delta}_{lk}^{2x2} = &ATT_{l,POST(l)} \\ + &\Delta Y_l^0(Post(l), MID) - \Delta Y_k^0(Post(l), MID) \\ - &(ATT_k(Post) - ATT_k(Mid)) \end{split} \end{align}\]
The first line is what we want
The second line is parallel-trends bias
The third line is bias due to heterogeneity in time!
- Even with parallel trends, this third line can cause deviations from the true ATT

Back to Bacon-Goodman

In other words, what causes the problem?
- The fact that already-treated groups can be used as control groups for later-treated groups
This means that changing “effects” of the intervention can bias this later comparison

Goodman-Bacon (2021)

Enough math, what to do?

That’s enough math. Let’s talk about what we can actually do!
Let’s go back to the castle.dta dataset
- Cheng and Hoekstra (2013), Journal of Human Resources
Let’s use the information in Cunningham’s book

Cheng and Hoekstra (2013), Journal of Human Resources

The “castle doctrine” laws essentially make lethal force “more” legal
Recall the changes we made: turn homicide into a rate (per 100,000 people) and take the log:

Code

library(haven) # to load .dta files
df <- read_dta("week5files/castle.dta")
head(df)
# key variables: state, year, cdl ("treatment"), and homicide_c (outcome)
# homicide_c to rate (per 100,000 people)
df$homicide_c <- (df$homicide_c/df$population)*100000
# and log
df$homicide_c <- log(df$homicide_c)

Cheng and Hoekstra (2013), Journal of Human Resources

I made some changes to the data, as well
- I’ve turned the “treatment” variable (cdl) into a dummy variable
We have the issue we ran into above:
- Treatment is staggered across time
- This means that some already-treated units will serve as controls for later-treated units!

Treatment timing

Treatment timing by individual state

Two-way fixed effects with `fixest`, as simple as possible

Code

# state fe, note the weights!
reg1 <- feols(homicide_c ~ cdl + log(population) + unemployrt | state, data = df, 
              cluster = c("state"), weights = df$population)
# state and year fe
reg2 <- feols(homicide_c ~ cdl + log(population) + unemployrt | state + year, data = df, 
              cluster = c("state"), weights = df$population)
# with state linear trends, note the syntax
reg3 <- feols(homicide_c ~ cdl + log(population) + unemployrt | state + year + state[year], data = df, 
              cluster = c("state"), weights = df$population)

Two-way fixed effects with `fixest`

	reg1	reg2	reg3
Treat	0.057	0.089**	0.097*
	(0.034)	(0.033)	(0.036)
Pop. (log)	-0.544**	-0.768*	-0.543
	(0.168)	(0.370)	(0.960)
Unemp. rate	-0.024***	-0.009	-0.003
	(0.004)	(0.010)	(0.011)
Fixed-Effects:
state	Yes	Yes	Yes
year	No	Yes	Yes
Varying Slopes:
year (state)	No	No	Yes
Observations	550	550	550
* p < 0.1, p < 0.05, * p < 0.01.
Note: Standard errors clustered at the state level are in parentheses.

Event studies

Event studies are a way to look at the effect of treatment over time
- We can see if the effect is immediate, or if it takes time to “kick in”
- We can also see whether there are any pre-trends
Effectively, what we want to do is redefine the time period to be relative to treatment
- For example, if treatment is introduced in 2005, we want to redefine 2005 as year 0, 2004 as year -1, etc.
- Let’s do this now

Event studies

Code

# first, find the year of treatment by state
df <- df %>%
      # group by state ("panel" identifier)
      group_by(state) %>%
      mutate(year_treat = ifelse(cdl==1, year, NA),
              # first year with treatment
              year_treat = min(year_treat, na.rm = TRUE),
              # create new time variable called event_year
              event_year = year - year_treat) %>%
              # ungroup
              ungroup()
# note that states NEVER treated are -Inf
table(df$event_year)


-Inf  -10   -9   -8   -7   -6   -5   -4   -3   -2   -1    0    1    2    3    4 
 319    1    3    7   20   21   21   21   21   21   21   21   20   18   14    1

Code

# now find the average homicide rate by event_year
df_event <- df %>%
            # drop the missings
            filter(event_year>-11) %>%
            group_by(event_year) %>%
            summarize(homicide_c = weighted.mean(homicide_c, weights = population, na.rm = TRUE))

check it looks okay

# A tibble: 11 × 4
   state    year event_year   cdl
   <chr>   <dbl>      <dbl> <dbl>
 1 Alabama  2000         -7     0
 2 Alabama  2001         -6     0
 3 Alabama  2002         -5     0
 4 Alabama  2003         -4     0
 5 Alabama  2004         -3     0
 6 Alabama  2005         -2     0
 7 Alabama  2006         -1     0
 8 Alabama  2007          0     1
 9 Alabama  2008          1     1
10 Alabama  2009          2     1
11 Alabama  2010          3     1

# A tibble: 11 × 4
   state   year event_year   cdl
   <chr>  <dbl>      <dbl> <dbl>
 1 Alaska  2000         -7     0
 2 Alaska  2001         -6     0
 3 Alaska  2002         -5     0
 4 Alaska  2003         -4     0
 5 Alaska  2004         -3     0
 6 Alaska  2005         -2     0
 7 Alaska  2006         -1     0
 8 Alaska  2007          0     1
 9 Alaska  2008          1     1
10 Alaska  2009          2     1
11 Alaska  2010          3     1

Plot the pure means

Code

ggplot(df_event) + 
  geom_line(aes(x = event_year, y = homicide_c)) +
  theme_bw() +
  labs(x = "Years relative to treatment", y = "Homicide rate (log)")

What do we really want to see?

We don’t really want the means, though
What do we want instead?

We want the effect of treatment over time
- We want to essentially plot coefficients

Calculating year-specific coefficients

Code

# note that the year BEFORE treatment, -1, IS THE OMITTED CATEGORY
# i() is a fixest-specific way to create factors/dummies
reg1 <- feols(homicide_c ~ i(event_year, ref = -1) + log(population) + unemployrt | state + year, data = df, 
              cluster = c("state"), weights = df$population)
reg1$coefficients

event_year::-10  event_year::-9  event_year::-8  event_year::-7  event_year::-6 
   -0.268714642    -0.193012613    -0.087832067     0.031459588     0.004916894 
 event_year::-5  event_year::-4  event_year::-3  event_year::-2   event_year::0 
   -0.027364799    -0.003498961    -0.013767586    -0.045036279     0.079746416 
  event_year::1   event_year::2   event_year::3   event_year::4 log(population) 
    0.069676912     0.079426337     0.062440521     0.155021385    -0.724793998 
     unemployrt 
   -0.009803301

Code

# It's a vector. We can extract the coefficients we want by subsetting with []
# get coefficients
coef <- c(reg1$coefficients[1:9], 0, reg1$coefficients[10:14])
# confidence intervals
lower <- c(confint(reg1)[1:9,1], 0, confint(reg1)[10:14,1])
upper <- c(confint(reg1)[1:9,2], 0, confint(reg1)[10:14,2])
# create minimum/maximum 
years <- c(-10:4)

Plot these coefficients using geom_point

Code

ggplot() +
  geom_point(aes(x = years, y = coef)) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = "red") +
  labs(x = "Years to treatment", y = "Coefficient (relative to T = -1)") +
  # change x axis to be every year
  scale_x_continuous(breaks = years) +
  theme_bw()

Plot these coefficients using geom_point

Let’s remove the first three years because of small sample sizes

Add the confidence intervals using geom_errorbar

Code

ggplot() +
  geom_point(aes(x = years, y = coef)) +
  geom_errorbar(aes(x = years, ymin = lower, ymax = upper), alpha = 0.2, width = 0.1) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = "red") +
  labs(x = "Years to treatment", y = "Coefficient (relative to T = -1)") +
  # change x axis to be every year
  scale_x_continuous(breaks = years) +
  theme_bw()

Add the confidence intervals using geom_errorbar

Add the confidence intervals using geom_line, if you prefer

Add the confidence intervals using geom_ribbon, if you prefer

Let’s use the “Bacon decomposition” to look at weighting

This won’t allow us to calculate standard errors
Think of this as “diagnostics”
We can see how different 2x2 cells have different weights, sometimes markedly so
- We can also see that different cells have different treatment estimates

Let’s try the “Bacon decomposition” - Note the treatment variable must be binary

Code

library(bacondecomp)

# syntax:
# bacon(formula, data, id_var, time_var, quietly = F)
bacon <- bacon(homicide_c ~ cdl + log(population) + unemployrt, data = df, id_var = "state", time_var = "year", quietly = F)

                  type  weight  avg_est
1         Both Treated 0.10043 -0.00824
2 Treated vs Untreated 0.89957  0.07906

Code

bacon$two_by_twos

   treated untreated      weight     estimate                 type
1     2007     99999 0.611262032  0.059301141 Treated vs Untreated
2     2007      2008 0.030235799 -0.001376830         Both Treated
3     2007      2010 0.017953382 -0.061066749         Both Treated
4     2007      2009 0.026493259  0.053198223         Both Treated
5     2006      2007 0.007511240  0.004557765         Both Treated
6     2006     99999 0.048373691  0.149043206 Treated vs Untreated
7     2006      2008 0.004157732  0.082718843         Both Treated
8     2006      2010 0.001570967 -0.048012581         Both Treated
9     2006      2009 0.002684252  0.086501842         Both Treated
10    2008     99999 0.160751757  0.092373673 Treated vs Untreated
11    2008      2010 0.004142796 -0.219504731         Both Treated
12    2008      2009 0.004503055 -0.151035583         Both Treated
13    2010     99999 0.016782728  0.076684232 Treated vs Untreated
14    2009     99999 0.062402696  0.184717537 Treated vs Untreated
15    2009      2010 0.001174611 -0.037899937         Both Treated

Get the overall average effect by multiplying weights by estimates

Code

bacon <- bacon(homicide_c ~ cdl + log(population) + unemployrt, data = df, id_var = "state", time_var = "year", quietly = F)

                  type  weight  avg_est
1         Both Treated 0.10043 -0.00824
2 Treated vs Untreated 0.89957  0.07906

Code

weighted.mean(bacon$two_by_twos$estimate, bacon$two_by_twos$weight)

[1] 0.07029371

Code

# compare to TWFE estimate
feols(homicide_c ~ cdl + log(population) + unemployrt | state + year, data = df, 
              cluster = c("state")) # No weights since Bacon decomp doesn't allow them

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Fixed-effects: state: 50,  year: 11
Standard-errors: Clustered (state) 
                 Estimate Std. Error   t value Pr(>|t|)    
cdl              0.073200   0.055921  1.308986 0.196645    
log(population) -1.018959   0.513389 -1.984769 0.052784 .  
unemployrt      -0.013406   0.013728 -0.976562 0.333583    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.175397     Adj. R2: 0.900242
                 Within R2: 0.023689

We can plot the average effects for the different groups

Let’s estimate effects using the `did2s` function from Kyle Butts

bacondecomp is good for diagnostics, but we really want to estimate the ATT with standard errors

Code

library(did2s)

yname: the outcome variable
first_stage: formula for first stage, can include fixed effects and covariates, but do not include treatment variable(s)!
second_stage: This should be the treatment variable or in the case of event studies, treatment variables.
treatment: This has to be the 0/1 treatment variable that marks when treatment turns on for a unit. If you suspect anticipation, see note above for accounting for this.
cluster_var: Which variables to cluster on

How does this actually work?

The main problem with TWFE is the “residualization” of the treatment variable
did2s uses the implementation from Gardner (2021), which is similar to Borusyak et. al. (2021)
- Estimate the fixed effects SEPARATELY to avoid residualization of the treatment indicator
This is a “two-step” estimator
- Implemented with a generalized method of moments (GMM) estimator to correct standard errors

Let’s estimate effects using the `did2s` function from Kyle Butts

Code

library(did2s)
# note that we can use fixest syntax, with FE and with i()!
# can also add weights
did2s <- did2s(data = df, yname = "homicide_c", first_stage = "log(population) + unemployrt | state + year", 
                second_stage = "cdl", treatment = "cdl", cluster_var = "state", weights = "population")
# let's compare it to the vanilla TWFE
twfe <- feols(homicide_c ~ cdl + log(population) + unemployrt | state + year, data = df, 
              cluster = c("state"), weights = df$population)

And the results

Code

# The "correct" results
did2s

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Weights: weights_vector 
Standard-errors: Custom 
    Estimate Std. Error t value Pr(>|t|)    
cdl 0.094871   0.037989  2.4973 0.012806 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 262.4   Adj. R2: 0.085672

Code

# and the TWFE results?
twfe

OLS estimation, Dep. Var.: homicide_c
Observations: 550 
Weights: df$population 
Fixed-effects: state: 50,  year: 11
Standard-errors: Clustered (state) 
                 Estimate Std. Error   t value  Pr(>|t|)    
cdl              0.088939   0.032653  2.723783 0.0089189 ** 
log(population) -0.767701   0.370173 -2.073895 0.0433658 *  
unemployrt      -0.009254   0.010207 -0.906587 0.3690621    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 252.9     Adj. R2: 0.927248
              Within R2: 0.048831

We can also estimate the event study this way!

Code

# Estimate
did2s <- did2s(data = df, yname = "homicide_c", first_stage = "log(population) + unemployrt | state + year", 
                second_stage = "i(event_year, ref = -1)", treatment = "cdl", cluster_var = "state", 
                weights = "population")
coefficients <- c(did2s$coefficients[2:10], 0, did2s$coefficients[11:15])
# confidence intervals
lower <- c(confint(did2s)[2:10,1], 0, confint(did2s)[11:15,1])
upper <- c(confint(did2s)[2:10,2], 0, confint(did2s)[11:15,2])
# plot estimates
ggplot() +
  geom_point(aes(x = c(-10:4), y = coefficients)) +
  geom_errorbar(aes(x = c(-10:4), ymin = lower, ymax = upper), alpha = 0.2, width = 0.1) +
  geom_vline(xintercept = -0.5, linetype = "dashed", color = "red") +
  labs(x = "Years to treatment", y = "Coefficient (relative to T = -1)") +
  # change x axis to be every year
  scale_x_continuous(breaks = years) +
  theme_bw()

Some small changes to remove first three years

Compared to TWFE?

Roth et al. (2023)

de Chaisemartin and D’Haultfoeuille (2022)

Wrapping up TWFE

We’ve learned how to estimate two-way fixed effects models with fixest
- We’ve also learned how they can be biased if treatment is staggered
We saw how to use did2s to estimate the ATT
- We also saw how to use bacondecomp to look at the weights
A lingering question: TWFE with continuous treatment variables
- Callaway et al. (2021) and Chaisemartin and D’Haultfœuille (2023) have some ideas
IVs with TWFE?
- Note that you could in theory do this by hand! IV estimates are just a ratio of reduced form and first stage. Could bootstrap the ratio.

Synthetic control

Increasing cigarette taxes and discouraging smoking in California (in 1988)

What if we wanted to figure out whether the law changed smoking in California?
A key problem, similar to Card and Krueger:
- There is really only one treated unit
- Many people live in California, obviously, but they are all equally affected by the law
Differences-in-differences is problematic here
- A single treated cluster!
So what are we to do?

Synthetic control

This is where synthetic control comes in
- Abadie and Gardeazabal (2003), American Economic Review
- Abadie, Diamond, and Hainmueller (2010), Journal of the American Statistical Association
The basic idea:
- We want to create a “synthetic” California that is a weighted average of the other states
- We want to weight the other states so that they look like California before the law
- We can then compare the synthetic California to the real California after the law

Synthetic control

Synthetic control doesn’t just match on pre-treatment outcomes
- It also matches on pre-treatment trends and covariates
- This is what makes it different from pure matching
Main requirement:
- Many pre-treatment periods (Abadie, Diamond, and Hainmueller, 2010)

A little formalization

Let’s say we are interested in outcome \(Y_{jt}\) for unit \(j\) at time \(t\)
- “Treated” group is \(j=1\)
In the post period, we estimate the effect of the intervention as \[\begin{gather} Y_{1t} - \sum_{j=1}^Jw_j^*Y_{jt}, \end{gather}\]

where \(w_j\) are time-invariant weights that we will estimate in the pre period.

A little formalization

In an ideal world, we would estimate the weights such that \[\begin{gather} \sum_{j=2}^Jw_j^*Y_{j1} = Y_{11},\;\sum_{j=2}^Jw_j^*Y_{j2} = Y_{12},\;\ldots,\;\sum_{j=2}^Jw_j^*Y_{jT_0} = Y_{1T_0}, \end{gather}\] where \(T_0\) is the number of pre-treatment periods.
In practice, however, this will never hold exactly.
- Instead, we will have to choose weights such that the differences are as small as possible.
Our job is to estimate \(W\), the vector of weights for each of the candidate control units.

A little formalization

Consider a set of variables (which can include pre-treatment outcomes) \(X\), where \(X_1\) is the treated unit and \(X_0\) are the control units.
We will minimize \[\begin{gather} \lVert X_1-X_0W\rVert = \sqrt{(X_1-X_0W)'V(X_1-X_0W)}, \end{gather}\] where \(V\) is a diagonal matrix of weights for different variables.
Essentially there are two types of weights:
- Weights for different variables
- Weights for different units

A little formalization

Estimating \(V\) is important
The most common approach is to minimize mean squared prediction error (Cunnigham, 2022): \[\begin{gather} \sum_{t=1}^{T_0}\left(Y_{1t}-\sum_{j=1}^Jw_j^*Y_{jt}\right)^2, \end{gather}\] where, again, \(T_0\) is the number of pre-treatment periods.
Thankfully, the canned R packages do all of this for us!
- It’s nonetheless good to have an understanding of what they’re doing

Synthetic control in `R` with the `tidysynth` package

Code

library(tidysynth)

# We are going to use the "smoking" data from Abadie, Diamond, and Hainmueller (2010)
# One issue: note how many missing values there are! We'll talk more on the next slide.
summary(df)

     state         year         cigsale         lnincome           beer      
 Min.   : 1   Min.   :1970   Min.   : 40.7   Min.   : 9.397   Min.   : 2.50  
 1st Qu.:10   1st Qu.:1977   1st Qu.:100.9   1st Qu.: 9.739   1st Qu.:20.90  
 Median :20   Median :1985   Median :116.3   Median : 9.861   Median :23.30  
 Mean   :20   Mean   :1985   Mean   :118.9   Mean   : 9.862   Mean   :23.43  
 3rd Qu.:30   3rd Qu.:1993   3rd Qu.:130.5   3rd Qu.: 9.973   3rd Qu.:25.10  
 Max.   :39   Max.   :2000   Max.   :296.2   Max.   :10.487   Max.   :40.40  
                                             NA's   :195      NA's   :663    
   age15to24         retprice      state_str           california     
 Min.   :0.1294   Min.   : 27.3   Length:1209        Min.   :0.00000  
 1st Qu.:0.1658   1st Qu.: 50.0   Class :character   1st Qu.:0.00000  
 Median :0.1781   Median : 95.5   Mode  :character   Median :0.00000  
 Mean   :0.1755   Mean   :108.3                      Mean   :0.02564  
 3rd Qu.:0.1867   3rd Qu.:158.4                      3rd Qu.:0.00000  
 Max.   :0.2037   Max.   :351.2                      Max.   :1.00000  
 NA's   :390

A note: using this package is not straightforward. The website has code you can simply copy-paste.

We have to create a “synthetic control object”… kind of a pain, to be honest

Code

smoking_out <- df %>%
                # initial the synthetic control object
                synthetic_control(outcome = cigsale, # outcome
                                  unit = state, # unit index in the panel data
                                  time = year, # time index in the panel data
                                  i_unit = mean(df$state[df$california==1]), # unit where the intervention occurred
                                  i_time = 1988, # time period when the intervention occurred
                                  generate_placebos = T) %>% # generate placebo synthetic controls (for inference)
                # Generate the aggregate predictors used to fit the weights
                # average log income, retail price of cigarettes, and proportion of the
                # population between 15 and 24 years of age from 1980 - 1988
                generate_predictor(time_window = 1980:1988,
                                  ln_income = mean(lnincome, na.rm = T),
                                  ret_price = mean(retprice, na.rm = T),
                                  youth = mean(age15to24, na.rm = T)) %>%
                # average beer consumption in the donor pool from 1984 - 1988
                generate_predictor(time_window = 1984:1988,
                                  beer_sales = mean(beer, na.rm = T)) %>%
                # Lagged cigarette sales 
                generate_predictor(time_window = 1975,
                                  cigsale_1975 = cigsale) %>%
                generate_predictor(time_window = 1980,
                                  cigsale_1980 = cigsale) %>%
                generate_predictor(time_window = 1988,
                                  cigsale_1988 = cigsale) %>%
                # Generate the fitted weights for the synthetic control
                generate_weights(optimization_window = 1970:1988, # time to use in the optimization task
                                margin_ipop = .02, sigf_ipop = 7, bound_ipop = 6) %>% # optimizer options
                # Generate the synthetic control
                generate_control()

Let’s first look at the weights

Code

# If you get the first step above, the rest is easy
smoking_out %>% plot_weights()

Code

# but what are the states!?

Let’s first look at the weights

Code

# Grab the weights
weights <- smoking_out %>% grab_unit_weights()
# merge in state names
weights <- weights %>%
            mutate(unit = as.numeric(unit)) %>%
            left_join(df %>% select(unit = state, state_str) %>% distinct(), 
                      by = "unit")
# arrange by weight, in descending order
weights %>% arrange(-weight)
# note they sum to one
print(paste0("Sum of weights: ", sum(weights$weight)))

# A tibble: 38 × 3
    unit     weight state_str  
   <dbl>      <dbl> <chr>      
 1    34 0.342      Utah       
 2    21 0.238      Nevada     
 3    19 0.209      Montana    
 4     4 0.149      Colorado   
 5     5 0.0617     Connecticut
 6    23 0.000434   New Mexico 
 7     8 0.00000341 Idaho      
 8    38 0.00000254 Wisconsin  
 9    33 0.00000163 Texas      
10    20 0.00000153 Nebraska   
# ℹ 28 more rows

[1] "Sum of weights: 0.999999999995394"

Another common test: balance

Code

balance <- smoking_out %>% grab_balance_table()
colnames(balance) <- c("Variable", "California", "Synthetic California", "Placebo")
kable(balance, 
      align = "lccc", linesep = "", row.names = FALSE, digits = 3) %>%
  column_spec(1,width = "6cm") %>%
  column_spec(c(2:4),width = "4cm") %>%
  kable_styling()

Variable	California	Synthetic California	Placebo
ln_income	10.077	9.853	9.829
ret_price	89.422	89.391	87.266
youth	0.174	0.174	0.173
beer_sales	24.280	24.222	23.655
cigsale_1975	127.100	126.991	136.932
cigsale_1980	120.200	120.220	138.089
cigsale_1988	90.100	91.389	113.824

The results

Code

gg1 <- smoking_out %>% plot_trends() # it's a ggplot object! So let's prettify it
gg1 +
  labs(x = "Year", y = "Cigarette sales (per capita)") +
  theme_bw() +
  theme(legend.position = c(0.2, 0.2))

Just the difference

Code

gg1 <- smoking_out %>% plot_differences()
gg1 +
  labs(x = "Year", y = "Difference in cigarette sales (per capita)") +
  theme_bw()

Inference in synthetic control

Okay… so is the result “big”?
- I know what you’re really thinking: is the result statistically significant?
There are no standard errors, as such, in synthetic control
Instead, we use placebo tests
- We basically do the exact same thing for every state in our data
- If the effect is real, it should be much larger than any “effect” in other states, right?
- tidysynth makes this easy

Placebo tests

Code

smoking_out %>% plot_placebos() +
  theme_bw() +
  theme(legend.position = c(0.2, 0.8))

Placebo tests

It turns out what Abadie et al. (2010) do is a little more complicated
- They look at mean squared prediction error (MSPE) for each state before and after the intervention
- They then calculate the ratio: \(\frac{MSPE_{after}}{MSPE_{before}}\)
We can rank states based on how much they change after the intervention!

Placebo tests

Code

sig <- smoking_out %>% grab_significance()
# merge in state names
sig <- sig %>%
        mutate(unit = as.numeric(unit_name)) %>%
        left_join(df %>% select(unit = state, state_str) %>% distinct(), by = "unit")
# let's replace "unit_name" with "state_str"
sig$unit_name <- sig$state_str
sig <- sig %>% select(-c("state_str", "z_score", "unit"))
# round
sig[,c(3,4,5,7)] <- round(sig[,c(3,4,5,7)], 2)
colnames(sig) <- c("State", "Type", "Pre-MSPE", "Post-MSPE", "Ratio", "Rank", "P-value")
kable(sig[1:10,],  # just first 10
      align = "lcccccc", booktabs = TRUE, linesep = "", row.names = FALSE) %>%
  column_spec(c(1),width = "5cm") %>%
  column_spec(c(2:7),width = "3cm") %>%
  kable_styling()

State	Type	Pre-MSPE	Post-MSPE	Ratio	Rank	P-value
California	Treated	3.17	392.20	123.87	1	0.03
Georgia	Donor	3.79	178.71	47.21	2	0.05
Indiana	Donor	25.17	769.66	30.58	3	0.08
West Virginia	Donor	9.52	284.10	29.83	4	0.10
Wisconsin	Donor	11.13	267.76	24.05	5	0.13
Missouri	Donor	3.03	67.77	22.38	6	0.15
Texas	Donor	14.36	277.14	19.29	7	0.18
South Carolina	Donor	12.55	233.79	18.63	8	0.21
Virginia	Donor	9.81	96.40	9.83	9	0.23
Nebraska	Donor	6.30	52.89	8.40	10	0.26

Applied microeconometrics

Introduction

What are we doing today?

Differences-in-differences

Differences-in-differences… in the 1800s?

Differences-in-differences… in the 1800s? (from Cunningham’s CI)

Differences-in-differences

Richardson and Troost (2009) - Mississippi dividing line

Did the policy of extra lending save banks?

In fact, pre-intervention levels are different!

Did the policy of extra lending save banks?

“Parallel trends”

Why is it “differences in differences”?

“Differences in differences” graphically

Parallel trends assumption

Richardson and Troost (2009) - Testing the assumption

Estimating diff-in-diff empirically

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Minimum wage and employment

Card and Krueger (1994) - Poisson regression!

Card and Krueger (1994) - Minimum wage and employment

Estimating diff-in-diff empirically - adding controls

Standard errors in differences-in-differences

Standard errors in differences-in-differences

Clustering

Wild cluster bootstrap

Wild cluster bootstrap

Wild cluster bootstrap

Wild cluster bootstrap

Implementing WCB in R

Implementing WCB in R

Implementing WCB in R

Implementing WCB in R

Add controls

Can change null hypothesis, like cdl = 0.1

Multi-way clustering, too!

Example with random subset of 12 clusters

Finally, Webb weights (Webb, 2023) – but using Rademacher weights is the norm

Some thoughts on clustering

Randomization inference in Buchmueller, DiNardo, and Valletta (2011)

Randomization inference in Buchmueller, DiNardo, and Valletta (2011)

Randomization inference in Merfeld (2023)

Randomization inference in Merfeld (2023)

Placebo tests for the parallel trends assumption

Leads of pollution in Merfeld (2023)

Convincing the reader is like writing a good story

Fixed and random effects

Before moving on to some of the new literature…

Panel data

Panel data

Fixed effects

Time-invariant assumption

Fixed effects, empirically

Fixed effects with feols

Fixed effects with feols

Just a quick note: wild cluster bootstrap still works!

Fixed effects are the norm with “differences-in-differences”

The “effect” of castle doctrine laws, TWFE

The “effect” of castle doctrine laws, TWFE

Random effects

Random effects in R

Random effects in R

Random effects

Fixed or random effects?

Hausman test

Triple differences

Triple differences

Higher differences

The program

Data

Identification

Pre-trends a problem?

Solution? Triple difference bringing in a second state, Jharkhand

Triple difference bringing in a second state, Jharkhand

Main results: enrollment/completion of grade 9

Expected heterogeneity

Implementing WCB in `R`

Implementing WCB in `R`

Implementing WCB in `R`

Implementing WCB in `R`

Random effects in `R`

Random effects in `R`

Two-way fixed effects with `fixest`, as simple as possible

Two-way fixed effects with `fixest`

Let’s estimate effects using the `did2s` function from Kyle Butts

Let’s estimate effects using the `did2s` function from Kyle Butts

Synthetic control in `R` with the `tidysynth` package