Applied microeconometrics

Week 3 - Discrete Choice

Josh Merfeld

KDI School

September 30, 2024

Introduction

What are we doing today?

  • This week we will review methods for limited dependent variables
    • We will also discuss how to interpret the results
  • Some of this will be review, but there will be some new stuff, too
    • For example, we will discuss poisson regression, which you might not have seen before

What are we doing today?

  • Much of the mathematical notation and theory comes from Econometrics by Bruce Hansen
    • I already discussed the older version of the textbook that is available for free online
  • A small note:
    • You can use OLS for binary outcomes! This is actually pretty common in economics.
    • I’ll discuss this more in a bit.
    • Other disciplines (e.g. public health) really like some of the new methods we will discuss today, so they are worth knowing.

Binary Choice

Binary choice

  • Let’s start with the simplest possibility: binary choice

  • You can think of this as a No/Yes or False/True question, but we will generally refer to it as 0/1 choice

    • In programming, always remember that \(FALSE=0\) and \(TRUE=1\)
  • Focusing on the binary choice case will allow us to build intuition for the more general case of discrete choice

    • We will also be able to use the same data as we move to the more general case

Dichotomous variables

  • We will be thinking about this \(Y\in\{0,1\}\)
    • In other words, \(Y\) is a dichotomous (dummy) variable that can only be equal to 0 or 1
  • We are going to discuss methods to output conditional probabilities: \[\begin{gather} \mathbb{P}\left[Y=1|X\right] \end{gather}\]

The error term

  • Consider the following model: \[\begin{gather} Y = \beta X + \epsilon, \end{gather}\] where \(X\) can have any number of columns (variables), \(k\).

  • We already know that \(\epsilon\) does not need to be normally distributed

  • In fact, if \(Y\in\{0,1\}\), then \(\epsilon\) will never be normally distributed

    • Why?

The error term

  • \(\epsilon\) has a two-point conditional distribution: \[\begin{gather} \epsilon = \Biggl\{ \begin{array}{ll} 1 - P(X), & \text{with probability } P(X) \\ P(X), & \text{with probability } 1 - P(X) \end{array} \end{gather}\]

  • \(\epsilon\) is heteroskedastic: \[\begin{gather} \text{Var}(\epsilon|X) = P(X)(1-P(X)) \end{gather}\]

  • In fact, the variance of any dummy variable is \(P(1-P)\), where \(P\) is the probability of the dummy variable being equal to 1

Scatterplots are pretty worthless!

Code
# for reading in Stata data. 
# Install this using your console
library(haven) 
# read in data for the week:
df <- read_dta("data.dta")

# scatter of in_poverty on h_age:
ggplot(data = df) +
  geom_point(aes(x = h_age, y = in_poverty)) +
  labs(x = "Head age", y = "In poverty this month") +
  theme_bw() +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Let’s plot out the conditional probabilities (means)

Code
# means by age
povmeans <- df %>%
            group_by(h_age) %>%
            summarize(mean = mean(in_poverty)) %>%
            ungroup

# scatter of in_poverty on h_age:
ggplot(data = povmeans) +
  geom_point(aes(x = h_age, y = mean)) +
  labs(x = "Head age", y = "P(poverty)") +
  theme_bw() +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Linear probability model (LPM)

  • We can estimate this using OLS: \[\begin{gather} Y = \beta X + \epsilon \end{gather}\]

Code
f1 <- feols(in_poverty ~ h_male + h_age, data = df)
f2 <- feols(in_poverty ~ h_male + h_age, data = df, vcov = "HC1")
etable(f1, f2)
                                f1                 f2
Dependent Var.:         in_poverty         in_poverty
                                                     
Constant        0.3523*** (0.0359) 0.3523*** (0.0372)
h_male           -0.0606* (0.0249)  -0.0606* (0.0260)
h_age            -4.96e-5 (0.0005)  -4.96e-5 (0.0006)
_______________ __________________ __________________
S.E. type                      IID Heteroskedas.-rob.
Observations                 4,609              4,609
R2                         0.00129            0.00129
Adj. R2                    0.00085            0.00085
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear probability model (LPM)

Code
summary(feols(in_poverty ~ h_male + h_age, data = df, vcov = "HC1"))
OLS estimation, Dep. Var.: in_poverty
Observations: 4,609 
Standard-errors: Heteroskedasticity-robust 
             Estimate Std. Error   t value  Pr(>|t|)    
(Intercept)  0.352347   0.037235  9.462735 < 2.2e-16 ***
h_male      -0.060625   0.025993 -2.332375  0.019724 *  
h_age       -0.000050   0.000553 -0.089808  0.928443    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.455294   Adj. R2: 8.54e-4

  • The interpretation of these coefficients is pretty straightforward:
    • It is the change in the probability of being in poverty for a one-unit change in the variable of interest
    • Male-headed households are 6.1 percentage points less likely to be poor than female-headed households, controlling for age.
    • Each addition year of age increases the probability of being in poverty by 0.005 percentage points, controlling for gender of the head.

Linear probability model (LPM)

  • Sometimes people motivate other estimation methods based on heteroskedasticity
    • But we can easily correct for this using robust standard errors (HC1 in feols)
  • There are two other problems with LPM, though:
  • The predicted values can be outside of the 0-1 range
    • Is this a problem? Maybe. Maybe not. It depends on what you’re doing.
  • Constant effects throughout the probability distribution
  • Is this realistic? If we think someone has a 95 percent probability of being poor, do we think the percentage point change would be the same for changing a variable relative to someone with a 50 percent probability of being poor?

Another option

  • We can think of this as a latent variable model: \[\begin{gather} Y^* = \beta X + \epsilon \\ \epsilon\sim G(\epsilon) \\ Y = \mathbbm{1}(Y^*>0) = \Biggl\{ \begin{array}{ll} 1, & \text{if } Y^*>0 \\ 0, & \text{otherwise} \end{array} \end{gather}\] where \(Y^*\) is the latent variable, \(G(\cdot)\) is the distribution of \(\epsilon\), and \(\mathbbm{1}(\cdot)\) is the indicator function.

  • One way to think about this is that \(y^*\) is utility, but we only observe whether the choice increases utility (\(y=1\)) or doesn’t (\(y=0\)).

Let’s give this a bit more structure

  • Note that \(Y=1\) iff \(Y^*>0\), which is the same as saying \(\beta X + \epsilon > 0\)

  • The response probability is then given by the CDF of \(\epsilon\) evaluated at \(-\beta X\): \[\begin{gather} \mathbb{P}\left[Y=1|X\right] = \mathbb{P}\left[\epsilon > -\beta X\right] = 1 - G(-\beta X) = G(\beta X) \end{gather}\]

  • Note that CDFs (cumulative distribution functions) give us probabilities of being less than or equal to a given value

    • The last equality holds because \(G(\cdot)\) will always be symmetric around 0 here
    • That value here is \(\beta X\)

CDF examples of the logistic (sigmoid) function - Wikipedia

Likelihood

  • Likelihood: the joint probability of the data evaluated with the sample, as a function of the parameters
    • What?
  • Let’s start with probit, which uses a normal distribution. Here is the conditional density of \(Y\) given \(X\) under this assumption: \[\begin{gather} f(Y|X) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(Y-\beta X)^2\right) \end{gather}\]

Likelihood

\[\begin{gather} f(Y|X) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(y-\beta X)^2\right) \end{gather}\]

  • In this case, what is the probability we observe our sample given the values of \(\beta\) and \(\sigma\)? \[\begin{align} f(y_1,\ldots,y_n | x_1,\ldots, x_n) &= \prod_{i=1}^n f(y_i | x_i) \\ &= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(y_i-\beta x_i)^2\right) \\ &= \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}\prod_{i=1}^n(y_i-\beta x_i)^2\right) \\ &= L_n(\beta, \sigma^2) \end{align}\]

Likelihood

\[\begin{gather} f(y_1,\ldots,y_n | x_1,\ldots, x_n) = L_n(\beta, \sigma^2) \end{gather}\]

  • This is the likelihood function

    • Note that it is a function of the parameters, \(\beta\) and \(\sigma^2\)
  • The properties of logs make this easier to work with: \[\begin{gather} \ell_n(\beta, \sigma^2) = \log(L_n(\beta, \sigma^2)) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\beta x_i)^2 \end{gather}\]

  • This is the log likelihood function

    • It is of course also a function of the parameters, \(\beta\) and \(\sigma^2\)

Maximum likelihood estimation

\[\begin{gather} \ell_n(\beta, \sigma^2) = \log(L_n(\beta, \sigma^2)) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\beta x_i)^2 \end{gather}\]

  • We have our log likelihood function, which is a function of the parameters, \(\beta\) and \(\sigma^2\)

Maximum likelihood estimation

  • What we want to do is find the values of \(\beta\) and \(\sigma^2\) that maximize this function
    • In other words, the values that make our sample the most likely to have been observed, or the biggest probability of observing our sample
  • This is called maximum likelihood estimation

\[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmax}}\;\ell_n(\beta, \sigma^2) \end{gather}\] where \(k\) is the number of variables (coefficients), including the intercept.

Maximum likelihood estimation

  • A simple example is a coin flip

  • Let’s say we flip a coin. If it’s a fair coin, what is the probability of obtaining heads?

- 50% or 0.5 (we generally work with the proportion 0.5, and not the percent)
  • What is the probability of obtaining heads twice in a row?
- 0.5 * 0.5 = 0.25
  • Three times in a row?
- 0.5 * 0.5 * 0.5 = 0.125

Maximum likelihood estimation

  • Say we flip the coin a bunch of times
    • For argument’s sake, let’s say we flip it 100 times and obtain 60 heads
  • If we know nothing about whether the coin is actually fair, what is the most likely distribution that would give us 60 heads and 40 tails?
    • It’s a distribution in which the probability of heads is 0.6!
  • This is like maximum likelihood estimation. We are trying to find the parameters that makes our sample the most likely to have been observed.
    • In this case, the parameter would be the true mean of the distribution of the coin (where heads = 1 and tails = 0).
    • We could then of course test whether this is significantly different from 0.5, which might be our null hypothesis.

Generalizing maximum likelihood estimation

\[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmax}}\;\ell_n(\beta, \sigma^2) \end{gather}\]

  • MLE is generally always estimated using numerical optimization
    • We will not discuss the details of this here
    • The basic reason is that most likelihood functions are not easy to maximize analytically (i.e. they have no closed-form solution)
  • In the case of the normal regression model, however, there is a closed-form solution
    • And this is the same closed-form solution as OLS!

Logit for binary choice

  • Let’s return to our binary choice model.
    • Regardless of how you estimate it, the probability mass function for \(Y\) is:

\[\begin{gather} \pi(y) = p^y(1-p)^{1-y}, \end{gather}\] where \(p\) is the probability of \(Y=1\), or the mean. Remember that \(Y\in\{0,1\}\); i.e., it can only equal 0 or 1.

  • Let’s bring our link function back into it. The conditional probability is: \[\begin{gather} \pi(Y|X)=G(\beta X)^Y(1-G(\beta X))^{1-Y}=G(\beta X)^Y(G(-\beta X))^{1-Y}=G(\beta Z), \end{gather}\]

where \(Z = \Biggl\{ \begin{array}{ll} X, & \text{if } Y=1 \\ -X, & \text{if } Y=0 \end{array}\)

Logit for binary choice

\[\begin{gather} \pi(Y|X)=G(\beta Z), \end{gather}\]

  • Taking logs (because they’re easy to work with), we get the log likelihood function: \[\begin{gather} \ell_n(\beta) = \sum_{i=1}^n \log G(\beta Z) \end{gather}\]

  • This is the same as the log likelihood function for probit, except that the link function is different.

Logit for binary choice

  • Again, we want to find the values of \(\beta\) (and \(\sigma\), which will show up in the link function) that maximize this function. \[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmax}}\;\ell_n(\beta, \sigma^2) \end{gather}\]

  • Something interesting is that in practice, we don’t numerically maximize…

    • Instead, we minimize the negative of the log likelihood function!

\[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmin}}\;-\ell_n(\beta, \sigma^2) \end{gather}\]

An example in R - Household variables and poverty using glm()

Code
summary(glm(in_poverty ~ h_male, data = df, family = binomial(link = "logit")))

Call:
glm(formula = in_poverty ~ h_male, family = binomial(link = "logit"), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9280  -0.8263  -0.8263   1.5752   1.5752  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6196     0.1101  -5.631  1.8e-08 ***
h_male       -0.2796     0.1151  -2.428   0.0152 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 5583.2  on 4608  degrees of freedom
Residual deviance: 5577.5  on 4607  degrees of freedom
AIC: 5581.5

Number of Fisher Scoring iterations: 4

Interpreting logit output

              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.6196447  0.1100514 -5.630505 1.796828e-08
h_male      -0.2795628  0.1151389 -2.428047 1.518036e-02

  • How do we interpret these coefficients?
    • The coefficients are “log odds”
  • For male household heads, the log odds of being in poverty is 0.280 lower than that for female household heads
    • What?

Log odds

  • What are log odds?
    • Let’s start with odds:

\[\begin{gather} \text{odds} = \frac{p}{1-p}, \end{gather}\]

where \(p\) is the probability of \(y = 1\) (being poor in this case).

  • Log odds?

\[\begin{gather} \text{log odds} = \log\left(\frac{p}{1-p}\right) \end{gather}\]

Log odds and logit

  • Logit regression is basically estimating:

\[\begin{gather} \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k \end{gather}\]

The intercept is the log odds of being in poverty for female households

              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.6196447  0.1100514 -5.630505 1.796828e-08
h_male      -0.2795628  0.1151389 -2.428047 1.518036e-02

\[\begin{gather} \log\left(\frac{p}{1-p}\right) = -0.6196447 \\ \left(\frac{p}{1-p}\right) = exp(-0.6196447) \\ \left(\frac{p}{1-p}\right) \approx 0.538 \\ p = 0.538-0.538p \\ 1.538p = 0.538 \\ p\approx 0.350 \end{gather}\]

What is the actual mean for female headed households? 0.35

The coefficient?

\[\begin{gather} \log\left(\frac{p_m}{1-p_m}\right) - \log\left(\frac{p_f}{1-p_f}\right) = -0.2795628 \\ \log\left(\frac{\frac{p_m}{1-p_m}}{\frac{p_f}{1-p_f}}\right) = -0.2795628 \\ \left(\frac{\frac{p_m}{1-p_m}}{\frac{p_f}{1-p_f}}\right) = exp(-0.2795628) \\ \left(\frac{\frac{p_m}{1-p_m}}{\frac{p_f}{1-p_f}}\right) = 0.7561142 \end{gather}\]

  • In the last line, this is referred to as an odds ratio.
    • It’s less than one, which means male-headed households are less likely to be in poverty.
    • Their odds of being in poverty are around 24% lower.

The coefficient?

  • Mean for female-headed households: 0.35

    • odds (\(\frac{p_f}{1-p_f}\)): 0.538
  • Mean for male-headed households: 0.289

    • odds \(\left(\frac{p_m}{1-p_m}\right)\): 0.407
  • Odds ratio (\(\frac{\frac{p_m}{1-p_m}}{\frac{p_f}{1-p_f}}\)): 0.756

  • Exponentiating shows us the odds ratio!

Marginal effects

  • We can also calculate marginal effects
    • These are the change in the probability of being in poverty for a one-unit change in the variable of interest
    • An important note is that this depends on where you are located in the distribution
  • We just calculated the means, so with only the binary independent variable, we know that the marginal effect is:
    • -0.0606
  • We will use the “mfx” package for this, so please install it in the console.

Marginal effects

Code
logitmfx(in_poverty ~ h_male, data = df)
Call:
logitmfx(formula = in_poverty ~ h_male, data = df)

Marginal Effects:
           dF/dx Std. Err.       z   P>|z|  
h_male -0.060649  0.025981 -2.3343 0.01958 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dF/dx is for discrete change for the following variables:

[1] "h_male"
Code
# for binary outcomes, it shows the change from 0 to 1!

logit with more coefficients

Code
logitmfx(in_poverty ~ h_male + h_age, data = df)
Call:
logitmfx(formula = in_poverty ~ h_male + h_age, data = df)

Marginal Effects:
             dF/dx   Std. Err.       z   P>|z|  
h_male -6.0623e-02  2.5982e-02 -2.3333 0.01963 *
h_age  -4.9739e-05  5.3672e-04 -0.0927 0.92616  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dF/dx is for discrete change for the following variables:

[1] "h_male"
Code
# for binary outcomes, it shows the change from 0 to 1!
# for continuous variables, it's the derivative (i.e. instantaneous change)!
# By default, it calculates these by holding variables AT THEIR MEANS

Probit

Code
summary(glm(in_poverty ~ h_male, data = df, family = binomial(link = "probit")))$coefficients
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.3856924 0.06759123 -5.706248 1.154935e-08
h_male      -0.1699918 0.07058912 -2.408188 1.603194e-02

  • What about probit coefficients?
    • These relate to the CDF of the standard normal distribution
  • The intercept is the mean for female-headed households

Standard normal CDF

  • The intercept is -0.3856924
    • The mean poverty for female-headed households is 0.35
  • Here’s the CDF for the standard normal distribution with the intercept:

Standard normal CDF

  • The intercept is -0.3856924
    • The mean poverty for female-headed households is 0.35
  • Here’s the CDF for the standard normal distribution with BOTH:

Now let’s look at the coefficient on h_male

Code
summary(glm(in_poverty ~ h_male, data = df, family = binomial(link = "probit")))$coefficients
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.3856924 0.06759123 -5.706248 1.154935e-08
h_male      -0.1699918 0.07058912 -2.408188 1.603194e-02

Standard normal CDF

  • The intercept is -0.3038412 and the coefficient is -0.1699918
  • Here’s the CDF for the standard normal distribution with BOTH:

Standard normal CDF

  • The intercept is -0.3856924 and the coefficient is -0.1699918
  • What’s the change in PROBABILITY? \(\text{mean(male)} - \text{mean(female)}\) or -0.0606

Marginal effects

  • The intercept is -0.3856924 and the coefficient is -0.1699918
  • What’s the change in PROBABILITY? \(\text{mean(male)} - \text{mean(female)}\) or -0.0606

Code
probitmfx(in_poverty ~ h_male, data = df)
Call:
probitmfx(formula = in_poverty ~ h_male, data = df)

Marginal Effects:
           dF/dx Std. Err.       z   P>|z|  
h_male -0.060649  0.025981 -2.3343 0.01958 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dF/dx is for discrete change for the following variables:

[1] "h_male"

Probit and marginal effects

Code
# change in z-scores
summary(glm(in_poverty ~ h_male + log(h_age), data = df, family = binomial(link = "probit")))$coefficients
               Estimate Std. Error    z value   Pr(>|z|)
(Intercept) -0.15654185 0.30328127 -0.5161606 0.60574222
h_male      -0.16978359 0.07059221 -2.4051321 0.01616662
log(h_age)  -0.05896973 0.07611372 -0.7747582 0.43848254

Code
# change in probability, holding other variables at their means
probitmfx(in_poverty ~ h_male + log(h_age), data = df)
Call:
probitmfx(formula = in_poverty ~ h_male + log(h_age), data = df)

Marginal Effects:
               dF/dx Std. Err.       z   P>|z|  
h_male     -0.060570  0.025980 -2.3314 0.01973 *
log(h_age) -0.020308  0.026211 -0.7748 0.43848  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dF/dx is for discrete change for the following variables:

[1] "h_male"

Multiple discrete choice

What if the outcome has more than two categories?

  • Many outcomes are not 0/1.

  • We can think of outcomes with discrete categories, but more than two:

    • Religion
    • Political party
    • Opinion on a likert scale (e.g. strongly agree, agree, neutral, disagree, strongly disagree)
    • Months in poverty
  • There’s a key difference between the first two and the last two:

    • The first two are unordered
    • The last two are ordered

Months in poverty - distribution

Code
ggplot() +
    geom_histogram(data = df, aes(x = months_in_poverty), binwidth = 1) + 
    theme_bw() +
    labs(x = "Months in poverty", y = "Count") +
    scale_x_continuous(breaks = seq(0, 12, 1)) +
    geom_vline(xintercept = mean(df$months_in_poverty), linetype = "dotted", color = "blue") +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Ordered logistic regression with the polr function

Code
# note that the outcome must be a FACTOR variable for polr
summary(polr(as_factor(months_in_poverty) ~ h_male + h_age, data = df, Hess = TRUE))
Call:
polr(formula = as_factor(months_in_poverty) ~ h_male + h_age, 
    data = df, Hess = TRUE)

Coefficients:
            Value Std. Error  t value
h_male -0.3826883    0.10035 -3.81337
h_age   0.0001876    0.00216  0.08681

Intercepts:
      Value   Std. Error t value
0|1   -0.5168  0.1457    -3.5478
1|2   -0.2434  0.1455    -1.6730
2|3   -0.0376  0.1455    -0.2585
3|4    0.1472  0.1455     1.0119
4|5    0.3078  0.1455     2.1149
5|6    0.4514  0.1456     3.1004
6|7    0.6416  0.1458     4.4014
7|8    0.8171  0.1460     5.5975
8|9    1.0249  0.1463     7.0044
9|10   1.2665  0.1469     8.6198
10|11  1.6049  0.1482    10.8280
11|12  2.2775  0.1529    14.8999

Residual Deviance: 18592.59 
AIC: 18620.59 

Ordered data: ordinal logit/probit

  • When we have ordered discrete variable, we can use an ordered logit or probit model
    • These are also called ordinal logit/probit models
  • The idea is that we have a latent variable, \(Y^*\), that is continuous
    • We observe \(Y\) as a discrete variable, but it is ordered
    • We can think of \(Y\) as being binned into categories

\[\begin{gather} Y = \Biggl\{ \begin{array}{ll} 1, & \text{if } Y^*\in(-\infty,\theta_1] \\ 2, & \text{if } Y^*\in(\theta_1,\theta_2] \\ \vdots & \vdots \\ J, & \text{if } Y^*\in(\theta_{J-1},\infty) \end{array} \end{gather}\]

The interpretation?

Call:
polr(formula = as_factor(months_in_poverty) ~ h_male + h_age, 
    data = df, Hess = TRUE)

Coefficients:
            Value Std. Error  t value
h_male -0.3826883    0.10035 -3.81337
h_age   0.0001876    0.00216  0.08681

Intercepts:
      Value   Std. Error t value
0|1   -0.5168  0.1457    -3.5478
1|2   -0.2434  0.1455    -1.6730
2|3   -0.0376  0.1455    -0.2585
3|4    0.1472  0.1455     1.0119
4|5    0.3078  0.1455     2.1149
5|6    0.4514  0.1456     3.1004
6|7    0.6416  0.1458     4.4014
7|8    0.8171  0.1460     5.5975
8|9    1.0249  0.1463     7.0044
9|10   1.2665  0.1469     8.6198
10|11  1.6049  0.1482    10.8280
11|12  2.2775  0.1529    14.8999

Residual Deviance: 18592.59 
AIC: 18620.59 

  • The interpretation is similar to logit: a change in the log-odds of being in a higher level of months in poverty!

Understanding fit with MLE

  • There is no r-squared in MLE
    • It is not a true r-squared because there is no sense of “mean” with discrete data, especially unordered data
  • We can use the log likelihood function to compare models
    • The log likelihood function is a function of the parameters, \(\beta\) and \(\sigma^2\)
    • The higher the log likelihood, the better the fit
  • We can also use the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
    • These are functions of the log likelihood function and the number of parameters
    • The lower the AIC/BIC, the better the fit
  • These are best thought of as useful for comparing across models
    • Difficult to interpret them on their own

AIC

  • AIC is defined as follows:
    • \(k\) is the number of parameters in the model
    • \(L\) is the log likelihood function
    • AIC: \(2k - 2\log(L)\)

Code
# save our model
results <- glm(in_poverty ~ h_male + h_age, data = df, family = binomial(link = "probit"))

# log likelihood
logLik(results)
'log Lik.' -2788.727 (df=3)
Code
# aic from temp
results$aic
[1] 5583.453
Code
# Calculate AIC by hand:
2*3 - 2*(-2788.727)
[1] 5583.454

BIC

  • BIC is defined as follows:
    • \(k\) is the number of parameters in the model
    • \(L\) is the log likelihood function
    • \(n\) is the number of observations
    • BIC: \(k\log(n) - 2\log(L)\)

Code
# save our model
results <- glm(in_poverty ~ h_male + h_age, data = df, family = binomial(link = "probit"))

# log likelihood
logLik(results)
'log Lik.' -2788.727 (df=3)
Code
nrow(df)
[1] 4609
Code
# Calculate BIC by hand:
3*log(4609) - 2*(-2788.727)
[1] 5602.761

Pseudo r-squared

  • We can also calculate a pseudo r-squared
    • This is a measure of the change in the log likelihood function relative to the null model (no coefficients except intercepts)

Code
# null model
logLik(glm(in_poverty ~ 1, data = df, family = binomial(link = "probit")))
'log Lik.' -2791.606 (df=1)
Code
# our model
logLik(glm(in_poverty ~ h_male + h_age, data = df, family = binomial(link = "probit")))
'log Lik.' -2788.727 (df=3)
Code
# pseudo r-squared: 1 - (log likelihood of model / log likelihood of null model)
1 - (-2788.727/-2791.606)
[1] 0.001031306
Code
# Same thing: (null - model) / null
(-2791.606 - -2788.727)/-2791.606
[1] 0.001031306

Multinomial probit/logit

  • Ordered logit/probit is used when the outcome is ordered

  • But what if it’s not, like trying to predict what the species of a flower is?

    • Let’s use a built-in dataset in R called “iris” to look at this:

Code
data(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Multinomial probit/logit

Code
data(iris)
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
Code
colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

  • Let’s use sepal/pedal length/width to predict the species

Multinomial probit/logit

Code
data(iris)
library(nnet)
multinomresults <- multinom(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

Code
summary(multinomresults)$coefficients
           (Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor    18.69037    -5.458424   -8.707401     14.24477   -3.097684
virginica    -23.83628    -7.923634  -15.370769     23.65978   15.135301

  • Note how setosa isn’t there… it’s the omitted category
    • We can interpret the coefficients as the log odds of being in a particular category relative to setosa (the omitted category)

Extensions

Extensions

  • GLM is a very general framework
    • We can use it for other distributions, too
  • Let’s look at a poisson distribution
    • The poisson distribution is often used for count data
    • It has assumptions (mean = variance), but violation isn’t a big deal!

The poisson distribution

  • The poisson distribution is defined as follows: \[\begin{gather} f(y; \lambda) = \frac{\lambda^ye^{-\lambda}}{y!} \end{gather}\]

  • Note the \(e\): this is going to lead to a nice log interpretation

  • \(y\) is the number of occurrences of the variable (in our example, it will be number of months in poverty)

  • As mentioned, one implication of this distribution is: \[\begin{gather} E(y) = Var(y) = \lambda \end{gather}\] i.e. the mean of \(y\) equals its variance. But we can work around this if it’s false (which it probably is).

Possion, months in poverty (with feols - feglm)

Code
# could save results. Not going to here... just display them
summary(feglm(months_in_poverty ~ h_male + h_age, data = df, family = "poisson"))
GLM estimation, family = poisson, Dep. Var.: months_in_poverty
Observations: 4,609 
Standard-errors: IID 
             Estimate Std. Error   t value  Pr(>|t|)    
(Intercept)  1.456020   0.039857  36.53087 < 2.2e-16 ***
h_male      -0.277845   0.025915 -10.72158 < 2.2e-16 ***
h_age        0.001255   0.000625   2.00869   0.04457 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -17,506.6   Adj. Pseudo R2: 0.00303 
           BIC:  35,038.6     Squared Cor.: 0.004978
Code
# notice the difference in standard errors if we use HC1 (which we want to here because of overdispersion)
summary(feglm(months_in_poverty ~ h_male + h_age, data = df, family = "poisson", vcov = "HC1"))
GLM estimation, family = poisson, Dep. Var.: months_in_poverty
Observations: 4,609 
Standard-errors: Heteroskedasticity-robust 
             Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)  1.456020   0.092210 15.790222  < 2.2e-16 ***
h_male      -0.277845   0.057268 -4.851679 1.2242e-06 ***
h_age        0.001255   0.001487  0.844145 3.9859e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -17,506.6   Adj. Pseudo R2: 0.00303 
           BIC:  35,038.6     Squared Cor.: 0.004978

Interpreting poisson coefficients

Code
summary(feglm(months_in_poverty ~ h_male, data = df, family = "poisson", vcov = "HC1"))$coefficients
(Intercept)      h_male 
  1.5189645  -0.2772304 

  • How do we interpret poisson coefficients?
    • They are the log of the rate of the outcome
    • i.e. the log of the number of months in poverty
  • The log count for male-headed households is around 0.28 lower than for female-headed households

The intercept is log count for female households

Code
summary(feglm(months_in_poverty ~ h_male, data = df, family = "poisson", vcov = "HC1"))$coefficients
(Intercept)      h_male 
  1.5189645  -0.2772304 
Code
# intercept is log count, so exponentiate for levels
exp(1.5189645)
[1] 4.567493
Code
# what is the mean for female-headed households?
mean(df$months_in_poverty[df$h_male==0])
[1] 4.567493
Code
# intercept plus coefficient for male-headed households
exp(1.5189645 - 0.2772304)
[1] 3.461611
Code
# mean for male-headed households?
mean(df$months_in_poverty[df$h_male==1])
[1] 3.461611

Quasi-poisson

  • The poisson distribution has a mean = variance assumption
    • This is often violated
    • We can use quasi-poisson instead
  • The quasi-poisson is the same as poisson, but the variance is estimated from the data
    • With feglm, it estimates the variance as a linear function of the mean
  • Small note: if you use glm, you can use the “quasipoisson” family
    • This is more similar to poisson with vcov = “HC1”!
      • If you use “HC1” in both, you’ll get identical results.
    • This is only about the structure of the error term. Not the coefficients.

Possion vs quasi-poisson (note the similar standard errors with HC1 for poisson)

Code
# Poisson with HC1
summary(feglm(months_in_poverty ~ h_male + h_age, data = df, family = "poisson", vcov = "HC1"))
GLM estimation, family = poisson, Dep. Var.: months_in_poverty
Observations: 4,609 
Standard-errors: Heteroskedasticity-robust 
             Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)  1.456020   0.092210 15.790222  < 2.2e-16 ***
h_male      -0.277845   0.057268 -4.851679 1.2242e-06 ***
h_age        0.001255   0.001487  0.844145 3.9859e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log-Likelihood: -17,506.6   Adj. Pseudo R2: 0.00303 
           BIC:  35,038.6     Squared Cor.: 0.004978
Code
# quasipoisson
summary(feglm(months_in_poverty ~ h_male + h_age, data = df, family = "quasipoisson"))
GLM estimation, family = quasipoisson, Dep. Var.: months_in_poverty
Observations: 4,609 
Standard-errors: IID 
             Estimate Std. Error   t value   Pr(>|t|)    
(Intercept)  1.456020   0.091127 15.977892  < 2.2e-16 ***
h_male      -0.277845   0.059249 -4.689409 2.8192e-06 ***
h_age        0.001255   0.001429  0.878563 3.7968e-01    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
                                           
  Squared Cor.: 0.004978                   

Doesn’t have to be integers!

  • You can use poisson for non-integer outcomes, too!

    • It’s just a distribution
    • It’s often used for integer outcomes because it’s a count distribution
  • Jeff Wooldridge is a huge proponent of using (quasi) poisson

  • Two really nice things: it is robust and has an easy interpretation

Survival models

Hazard/survival models

  • These models are used for duration data
    • i.e. time until an event occurs or a state changes
    • e.g. time until death after being diagnosed with cancer, time until a person leaves poverty, time until a person gets a job, until contracting a disease, etc.
  • You need a very specific kind of data for this…
    • We want to know what kinds of variables predict the occurence of some event (e.g. death).
  • Warning: I am not an expert on survival models. I will give you a brief overview, but you should consult a textbook for more information.

Pfizer-BioNTech COVID-19 vaccine

Survival models

  • Let’s use the package survival in R.
    • It has a dataset set up for us, called diabetic

Code
library(survival)
# dataset with "high-risk" diabetic retinopathy patients
head(diabetic)
  id laser age   eye trt risk  time status
1  5 argon  28  left   0    9 46.23      0
2  5 argon  28 right   1    9 46.23      0
3 14 xenon  12  left   1    8 42.50      0
4 14 xenon  12 right   0    6 31.30      1
5 16 xenon   9  left   1   11 42.27      0
6 16 xenon   9 right   0   11 42.27      0

  • Note: this is a panel dataset
    • Each row is a patient (id)
    • Each patient has multiple observations (time)
    • Interested in loss of sight (status = 1)

Survival function

  • Let’s first look at the survival function
    • This is the probability of surviving past a certain time
    • We’re going to use the Kaplan-Meier estimator

Code
KM <- survfit(Surv(time = time, event = status) ~ 1, data = diabetic)
# note that Surv() is necessary here. It's a function that creates a survival object.
# The ~ 1 means this is for EVERYONE.

Survival curve

Code
KM <- survfit(Surv(time = time, event = status) ~ 1, data = diabetic)
plot(KM, ylab = "Survival probability", xlab = "Time (months)")

Comparing survival curves based on treatment

  • Base R has ugly plots. We can use survminer to make it work with ggplot2.

Code
library(survminer)
KM <- survfit(Surv(time = time, event = status) ~ 1, data = diabetic)
ggsurvplot(KM)$plot +
  labs(y = "Survival probability", x = "Time (months)") +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Comparing survival curves based on treatment

Code
KM <- survfit(Surv(time = time, event = status) ~ trt, data = diabetic)
ggsurvplot(KM)$plot +
  labs(y = "Survival probability", x = "Time (months)") +
  scale_color_discrete("Treatment:", labels = c("No", "Yes")) +
  theme(legend.position = c(0.1, 0.2)) +
  theme(plot.background = element_rect(fill = "#f0f1eb", color = "#f0f1eb"))

Comparing survival curves based on treatment, empirically

Code
survdiff(Surv(time = time, event = status) ~ trt, data = diabetic)
Call:
survdiff(formula = Surv(time = time, event = status) ~ trt, data = diabetic)

        N Observed Expected (O-E)^2/E (O-E)^2/V
trt=0 197      101     71.8      11.9      22.2
trt=1 197       54     83.2      10.3      22.2

 Chisq= 22.2  on 1 degrees of freedom, p= 2e-06 
Code
# changes the weighting (more weight on earlier time points); doesn't matter here! Huge differences.
survdiff(Surv(time = time, event = status) ~ trt, data = diabetic, rho = 1)
Call:
survdiff(formula = Surv(time = time, event = status) ~ trt, data = diabetic, 
    rho = 1)

        N Observed Expected (O-E)^2/E (O-E)^2/V
trt=0 197     80.3     57.6      8.95      20.7
trt=1 197     43.1     65.8      7.84      20.7

 Chisq= 20.7  on 1 degrees of freedom, p= 6e-06 

Adding more covariates

  • Plotting two survival curves with a simply treatment dummy is straightforward.
    • But what if we want to add more covariates?
    • For example, the diabetic dataset has age at diagnosis and which eye the problem is. Does this matter?
  • We can use a Cox proportional hazards model to do this.
    • This is a semi-parametric model that is very popular in survival analysis.
    • It is a proportional hazards model, which means that the hazard ratio is constant over time.
    • Its nature means we do not estimate the baseline hazard function.
      • Instead, we compare across variables

Adding more covariates

Code
coxph(Surv(time = time, event = status) ~ age + as_factor(eye) + trt, data = diabetic)
Call:
coxph(formula = Surv(time = time, event = status) ~ age + as_factor(eye) + 
    trt, data = diabetic)

                         coef exp(coef)  se(coef)      z        p
age                  0.003321  1.003326  0.005463  0.608   0.5433
as_factor(eye)right  0.350484  1.419754  0.162537  2.156   0.0311
trt                 -0.812522  0.443738  0.169412 -4.796 1.62e-06

Likelihood ratio test=27.59  on 3 df, p=4.421e-06
n= 394, number of events= 155 

  • Note that treatment is randomized, so we shouldn’t see large changes in the coefficient on treatment when we add covariates.
    • More on this next week!

Some warnings

  • Note that all these methods have assumptions that can sometimes be important

  • Given our time, I am purposefully just showing you the basics

    • If any of these specific methods interest you, I suggest doing more in-depth readings. I can provide some suggestions.

Articles for next week (check syllabus for textbook readings)

  • Deserranno, E., Stryjan, M., & Sulaiman, M. (2019). Leader selection and service delivery in community groups: Experimental evidence from Uganda. American Economic Journal: Applied Economics, 11(4), 240-267.

  • Fischer, T., Frölich, M., & Landmann, A. (2023). Adverse Selection in Low-Income Health Insurance Markets: Evidence from an RCT in Pakistan. American Economic Journal: Applied Economics, 15(3), 313-340.