This week we will review methods for limited dependent variables
We will also discuss how to interpret the results
Some of this will be review, but there will be some new stuff, too
For example, we will discuss poisson regression, which you might not have seen before
What are we doing today?
Much of the mathematical notation and theory comes from Econometrics by Bruce Hansen
I already discussed the older version of the textbook that is available for free online
A small note:
You can use OLS for binary outcomes! This is actually pretty common in economics.
I’ll discuss this more in a bit.
Other disciplines (e.g. public health) really like some of the new methods we will discuss today, so they are worth knowing.
Binary Choice
Binary choice
Let’s start with the simplest possibility: binary choice
You can think of this as a No/Yes or False/True question, but we will generally refer to it as 0/1 choice
In programming, always remember that \(FALSE=0\) and \(TRUE=1\)
Focusing on the binary choice case will allow us to build intuition for the more general case of discrete choice
We will also be able to use the same data as we move to the more general case
Dichotomous variables
We will be thinking about this \(Y\in\{0,1\}\)
In other words, \(Y\) is a dichotomous (dummy) variable that can only be equal to 0 or 1
We are going to discuss methods to output conditional probabilities: \[\begin{gather} \mathbb{P}\left[Y=1|X\right] \end{gather}\]
The error term
Consider the following model: \[\begin{gather} Y = \beta X + \epsilon, \end{gather}\] where \(X\) can have any number of columns (variables), \(k\).
We already know that \(\epsilon\) does not need to be normally distributed
In fact, if \(Y\in\{0,1\}\), then \(\epsilon\)will never be normally distributed
Why?
The error term
\(\epsilon\) has a two-point conditional distribution: \[\begin{gather} \epsilon = \Biggl\{ \begin{array}{ll} 1 - P(X), & \text{with probability } P(X) \\ P(X), & \text{with probability } 1 - P(X) \end{array} \end{gather}\]
\(\epsilon\) is heteroskedastic: \[\begin{gather} \text{Var}(\epsilon|X) = P(X)(1-P(X)) \end{gather}\]
In fact, the variance of any dummy variable is \(P(1-P)\), where \(P\) is the probability of the dummy variable being equal to 1
Scatterplots are pretty worthless!
Code
# for reading in Stata data. # Install this using your consolelibrary(haven) # read in data for the week:df <-read_dta("data.dta")# scatter of in_poverty on h_age:ggplot(data = df) +geom_point(aes(x = h_age, y = in_poverty)) +labs(x ="Head age", y ="In poverty this month") +theme_bw() +theme(plot.background =element_rect(fill ="#f0f1eb", color ="#f0f1eb"))
Let’s plot out the conditional probabilities (means)
Code
# means by agepovmeans <- df %>%group_by(h_age) %>%summarize(mean =mean(in_poverty)) %>% ungroup# scatter of in_poverty on h_age:ggplot(data = povmeans) +geom_point(aes(x = h_age, y = mean)) +labs(x ="Head age", y ="P(poverty)") +theme_bw() +theme(plot.background =element_rect(fill ="#f0f1eb", color ="#f0f1eb"))
Linear probability model (LPM)
We can estimate this using OLS: \[\begin{gather} Y = \beta X + \epsilon \end{gather}\]
Code
f1 <-feols(in_poverty ~ h_male + h_age, data = df)f2 <-feols(in_poverty ~ h_male + h_age, data = df, vcov ="HC1")etable(f1, f2)
The interpretation of these coefficients is pretty straightforward:
It is the change in the probability of being in poverty for a one-unit change in the variable of interest
Male-headed households are 6.1 percentage points less likely to be poor than female-headed households, controlling for age.
Each addition year of age increases the probability of being in poverty by 0.005 percentage points, controlling for gender of the head.
Linear probability model (LPM)
Sometimes people motivate other estimation methods based on heteroskedasticity
But we can easily correct for this using robust standard errors (HC1 in feols)
There are two other problems with LPM, though:
The predicted values can be outside of the 0-1 range
Is this a problem? Maybe. Maybe not. It depends on what you’re doing.
Constant effects throughout the probability distribution
Is this realistic? If we think someone has a 95 percent probability of being poor, do we think the percentage point change would be the same for changing a variable relative to someone with a 50 percent probability of being poor?
Another option
We can think of this as a latent variable model: \[\begin{gather} Y^* = \beta X + \epsilon \\
\epsilon\sim G(\epsilon) \\
Y = \mathbbm{1}(Y^*>0) = \Biggl\{ \begin{array}{ll} 1, & \text{if } Y^*>0 \\ 0, & \text{otherwise} \end{array} \end{gather}\] where \(Y^*\) is the latent variable, \(G(\cdot)\) is the distribution of \(\epsilon\), and \(\mathbbm{1}(\cdot)\) is the indicator function.
One way to think about this is that \(y^*\) is utility, but we only observe whether the choice increases utility (\(y=1\)) or doesn’t (\(y=0\)).
Let’s give this a bit more structure
Note that \(Y=1\) iff \(Y^*>0\), which is the same as saying \(\beta X + \epsilon > 0\)
The response probability is then given by the CDF of \(\epsilon\) evaluated at \(-\beta X\): \[\begin{gather} \mathbb{P}\left[Y=1|X\right] = \mathbb{P}\left[\epsilon > -\beta X\right] = 1 - G(-\beta X) = G(\beta X) \end{gather}\]
Note that CDFs (cumulative distribution functions) give us probabilities of being less than or equal to a given value
The last equality holds because \(G(\cdot)\) will always be symmetric around 0 here
That value here is \(\beta X\)
CDF examples of the logistic (sigmoid) function - Wikipedia
The link function
The function \(G(\cdot)\) is called the link function and plays an important role here
Two common link functions are the logit and probit link functions:
Probit: \(G(\epsilon) = \Phi(\epsilon)\), where \(\Phi(\cdot)\) is the CDF of the standard normal distribution
We will discuss these in more detail in a bit
Likelihood
Likelihood: the joint probability of the data evaluated with the sample, as a function of the parameters
What?
Let’s start with probit, which uses a normal distribution. Here is the conditional density of \(Y\) given \(X\) under this assumption: \[\begin{gather} f(Y|X) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(Y-\beta X)^2\right) \end{gather}\]
In this case, what is the probability we observe our sample given the values of \(\beta\) and \(\sigma\)? \[\begin{align} f(y_1,\ldots,y_n | x_1,\ldots, x_n) &= \prod_{i=1}^n f(y_i | x_i) \\
&= \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(y_i-\beta x_i)^2\right) \\
&= \frac{1}{(2\pi\sigma^2)^{n/2}}\exp\left(-\frac{1}{2\sigma^2}\prod_{i=1}^n(y_i-\beta x_i)^2\right) \\
&= L_n(\beta, \sigma^2) \end{align}\]
Note that it is a function of the parameters, \(\beta\) and \(\sigma^2\)
The properties of logs make this easier to work with: \[\begin{gather} \ell_n(\beta, \sigma^2) = \log(L_n(\beta, \sigma^2)) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\beta x_i)^2 \end{gather}\]
This is the log likelihood function
It is of course also a function of the parameters, \(\beta\) and \(\sigma^2\)
We have our log likelihood function, which is a function of the parameters, \(\beta\) and \(\sigma^2\)
Maximum likelihood estimation
What we want to do is find the values of \(\beta\) and \(\sigma^2\) that maximize this function
In other words, the values that make our sample the most likely to have been observed, or the biggest probability of observing our sample
This is called maximum likelihood estimation
\[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmax}}\;\ell_n(\beta, \sigma^2) \end{gather}\] where \(k\) is the number of variables (coefficients), including the intercept.
Maximum likelihood estimation
A simple example is a coin flip
Let’s say we flip a coin. If it’s a fair coin, what is the probability of obtaining heads?
- 50% or 0.5 (we generally work with the proportion 0.5, and not the percent)
What is the probability of obtaining heads twice in a row?
- 0.5 * 0.5 = 0.25
Three times in a row?
- 0.5 * 0.5 * 0.5 = 0.125
Maximum likelihood estimation
Say we flip the coin a bunch of times
For argument’s sake, let’s say we flip it 100 times and obtain 60 heads
If we know nothing about whether the coin is actually fair, what is the most likely distribution that would give us 60 heads and 40 tails?
It’s a distribution in which the probability of heads is 0.6!
This is like maximum likelihood estimation. We are trying to find the parameters that makes our sample the most likely to have been observed.
In this case, the parameter would be the true mean of the distribution of the coin (where heads = 1 and tails = 0).
We could then of course test whether this is significantly different from 0.5, which might be our null hypothesis.
MLE is generally always estimated using numerical optimization
We will not discuss the details of this here
The basic reason is that most likelihood functions are not easy to maximize analytically (i.e. they have no closed-form solution)
In the case of the normal regression model, however, there is a closed-form solution
And this is the same closed-form solution as OLS!
Logit for binary choice
Let’s return to our binary choice model.
Regardless of how you estimate it, the probability mass function for \(Y\) is:
\[\begin{gather} \pi(y) = p^y(1-p)^{1-y}, \end{gather}\] where \(p\) is the probability of \(Y=1\), or the mean. Remember that \(Y\in\{0,1\}\); i.e., it can only equal 0 or 1.
Let’s bring our link function back into it. The conditional probability is: \[\begin{gather} \pi(Y|X)=G(\beta X)^Y(1-G(\beta X))^{1-Y}=G(\beta X)^Y(G(-\beta X))^{1-Y}=G(\beta Z), \end{gather}\]
Taking logs (because they’re easy to work with), we get the log likelihood function: \[\begin{gather} \ell_n(\beta) = \sum_{i=1}^n \log G(\beta Z) \end{gather}\]
This is the same as the log likelihood function for probit, except that the link function is different.
Logit for binary choice
Again, we want to find the values of \(\beta\) (and \(\sigma\), which will show up in the link function) that maximize this function. \[\begin{gather} \left(\hat{\beta}, \hat{\sigma}^2\right) = \underset{\beta\in\mathbb{R}^k,\sigma^2>0}{\text{argmax}}\;\ell_n(\beta, \sigma^2) \end{gather}\]
Something interesting is that in practice, we don’t numerically maximize…
Instead, we minimize the negative of the log likelihood function!
In the last line, this is referred to as an odds ratio.
It’s less than one, which means male-headed households are less likely to be in poverty.
Their odds of being in poverty are around 24% lower.
The coefficient?
Mean for female-headed households: 0.35
odds (\(\frac{p_f}{1-p_f}\)): 0.538
Mean for male-headed households: 0.289
odds \(\left(\frac{p_m}{1-p_m}\right)\): 0.407
Odds ratio (\(\frac{\frac{p_m}{1-p_m}}{\frac{p_f}{1-p_f}}\)): 0.756
Exponentiating shows us the odds ratio!
Marginal effects
We can also calculate marginal effects
These are the change in the probability of being in poverty for a one-unit change in the variable of interest
An important note is that this depends on where you are located in the distribution
We just calculated the means, so with only the binary independent variable, we know that the marginal effect is:
-0.0606
We will use the “mfx” package for this, so please install it in the console.
Marginal effects
Code
logitmfx(in_poverty ~ h_male, data = df)
Call:
logitmfx(formula = in_poverty ~ h_male, data = df)
Marginal Effects:
dF/dx Std. Err. z P>|z|
h_male -0.060649 0.025981 -2.3343 0.01958 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dF/dx is for discrete change for the following variables:
[1] "h_male"
Code
# for binary outcomes, it shows the change from 0 to 1!
logit with more coefficients
Code
logitmfx(in_poverty ~ h_male + h_age, data = df)
Call:
logitmfx(formula = in_poverty ~ h_male + h_age, data = df)
Marginal Effects:
dF/dx Std. Err. z P>|z|
h_male -6.0623e-02 2.5982e-02 -2.3333 0.01963 *
h_age -4.9739e-05 5.3672e-04 -0.0927 0.92616
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dF/dx is for discrete change for the following variables:
[1] "h_male"
Code
# for binary outcomes, it shows the change from 0 to 1!# for continuous variables, it's the derivative (i.e. instantaneous change)!# By default, it calculates these by holding variables AT THEIR MEANS
Probit
Code
summary(glm(in_poverty ~ h_male, data = df, family =binomial(link ="probit")))$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3856924 0.06759123 -5.706248 1.154935e-08
h_male -0.1699918 0.07058912 -2.408188 1.603194e-02
What about probit coefficients?
These relate to the CDF of the standard normal distribution
The intercept is the mean for female-headed households
Standard normal CDF
The intercept is -0.3856924
The mean poverty for female-headed households is 0.35
Here’s the CDF for the standard normal distribution with the intercept:
Standard normal CDF
The intercept is -0.3856924
The mean poverty for female-headed households is 0.35
Here’s the CDF for the standard normal distribution with BOTH:
Now let’s look at the coefficient on h_male
Code
summary(glm(in_poverty ~ h_male, data = df, family =binomial(link ="probit")))$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3856924 0.06759123 -5.706248 1.154935e-08
h_male -0.1699918 0.07058912 -2.408188 1.603194e-02
Standard normal CDF
The intercept is -0.3038412 and the coefficient is -0.1699918
Here’s the CDF for the standard normal distribution with BOTH:
Standard normal CDF
The intercept is -0.3856924 and the coefficient is -0.1699918
What’s the change in PROBABILITY? \(\text{mean(male)} - \text{mean(female)}\) or -0.0606
Marginal effects
The intercept is -0.3856924 and the coefficient is -0.1699918
What’s the change in PROBABILITY? \(\text{mean(male)} - \text{mean(female)}\) or -0.0606
Code
probitmfx(in_poverty ~ h_male, data = df)
Call:
probitmfx(formula = in_poverty ~ h_male, data = df)
Marginal Effects:
dF/dx Std. Err. z P>|z|
h_male -0.060649 0.025981 -2.3343 0.01958 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dF/dx is for discrete change for the following variables:
[1] "h_male"
Probit and marginal effects
Code
# change in z-scoressummary(glm(in_poverty ~ h_male +log(h_age), data = df, family =binomial(link ="probit")))$coefficients
Note how setosa isn’t there… it’s the omitted category
We can interpret the coefficients as the log odds of being in a particular category relative to setosa (the omitted category)
Extensions
Extensions
GLM is a very general framework
We can use it for other distributions, too
Let’s look at a poisson distribution
The poisson distribution is often used for count data
It has assumptions (mean = variance), but violation isn’t a big deal!
The poisson distribution
The poisson distribution is defined as follows: \[\begin{gather} f(y; \lambda) = \frac{\lambda^ye^{-\lambda}}{y!} \end{gather}\]
Note the \(e\): this is going to lead to a nice log interpretation
\(y\) is the number of occurrences of the variable (in our example, it will be number of months in poverty)
As mentioned, one implication of this distribution is: \[\begin{gather} E(y) = Var(y) = \lambda \end{gather}\] i.e. the mean of \(y\) equals its variance. But we can work around this if it’s false (which it probably is).
Possion, months in poverty (with feols - feglm)
Code
# could save results. Not going to here... just display themsummary(feglm(months_in_poverty ~ h_male + h_age, data = df, family ="poisson"))
# notice the difference in standard errors if we use HC1 (which we want to here because of overdispersion)summary(feglm(months_in_poverty ~ h_male + h_age, data = df, family ="poisson", vcov ="HC1"))
You can use poisson for non-integer outcomes, too!
It’s just a distribution
It’s often used for integer outcomes because it’s a count distribution
Jeff Wooldridge is a huge proponent of using (quasi) poisson
Two really nice things: it is robust and has an easy interpretation
Survival models
Hazard/survival models
These models are used for duration data
i.e. time until an event occurs or a state changes
e.g. time until death after being diagnosed with cancer, time until a person leaves poverty, time until a person gets a job, until contracting a disease, etc.
You need a very specific kind of data for this…
We want to know what kinds of variables predict the occurence of some event (e.g. death).
Warning: I am not an expert on survival models. I will give you a brief overview, but you should consult a textbook for more information.
Pfizer-BioNTech COVID-19 vaccine
Survival models
Let’s use the package survival in R.
It has a dataset set up for us, called diabetic
Code
library(survival)# dataset with "high-risk" diabetic retinopathy patientshead(diabetic)
id laser age eye trt risk time status
1 5 argon 28 left 0 9 46.23 0
2 5 argon 28 right 1 9 46.23 0
3 14 xenon 12 left 1 8 42.50 0
4 14 xenon 12 right 0 6 31.30 1
5 16 xenon 9 left 1 11 42.27 0
6 16 xenon 9 right 0 11 42.27 0
Note: this is a panel dataset
Each row is a patient (id)
Each patient has multiple observations (time)
Interested in loss of sight (status = 1)
Survival function
Let’s first look at the survival function
This is the probability of surviving past a certain time
We’re going to use the Kaplan-Meier estimator
Code
KM <-survfit(Surv(time = time, event = status) ~1, data = diabetic)# note that Surv() is necessary here. It's a function that creates a survival object.# The ~ 1 means this is for EVERYONE.
Survival curve
Code
KM <-survfit(Surv(time = time, event = status) ~1, data = diabetic)plot(KM, ylab ="Survival probability", xlab ="Time (months)")
Comparing survival curves based on treatment
Base R has ugly plots. We can use survminer to make it work with ggplot2.
Code
library(survminer)KM <-survfit(Surv(time = time, event = status) ~1, data = diabetic)ggsurvplot(KM)$plot +labs(y ="Survival probability", x ="Time (months)") +theme(plot.background =element_rect(fill ="#f0f1eb", color ="#f0f1eb"))
Comparing survival curves based on treatment
Code
KM <-survfit(Surv(time = time, event = status) ~ trt, data = diabetic)ggsurvplot(KM)$plot +labs(y ="Survival probability", x ="Time (months)") +scale_color_discrete("Treatment:", labels =c("No", "Yes")) +theme(legend.position =c(0.1, 0.2)) +theme(plot.background =element_rect(fill ="#f0f1eb", color ="#f0f1eb"))
Comparing survival curves based on treatment, empirically
Code
survdiff(Surv(time = time, event = status) ~ trt, data = diabetic)
Call:
survdiff(formula = Surv(time = time, event = status) ~ trt, data = diabetic)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=0 197 101 71.8 11.9 22.2
trt=1 197 54 83.2 10.3 22.2
Chisq= 22.2 on 1 degrees of freedom, p= 2e-06
Code
# changes the weighting (more weight on earlier time points); doesn't matter here! Huge differences.survdiff(Surv(time = time, event = status) ~ trt, data = diabetic, rho =1)
Call:
survdiff(formula = Surv(time = time, event = status) ~ trt, data = diabetic,
rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=0 197 80.3 57.6 8.95 20.7
trt=1 197 43.1 65.8 7.84 20.7
Chisq= 20.7 on 1 degrees of freedom, p= 6e-06
Adding more covariates
Plotting two survival curves with a simply treatment dummy is straightforward.
But what if we want to add more covariates?
For example, the diabetic dataset has age at diagnosis and which eye the problem is. Does this matter?
We can use a Cox proportional hazards model to do this.
This is a semi-parametric model that is very popular in survival analysis.
It is a proportional hazards model, which means that the hazard ratio is constant over time.
Its nature means we do not estimate the baseline hazard function.
Instead, we compare across variables
Adding more covariates
Code
coxph(Surv(time = time, event = status) ~ age +as_factor(eye) + trt, data = diabetic)
Call:
coxph(formula = Surv(time = time, event = status) ~ age + as_factor(eye) +
trt, data = diabetic)
coef exp(coef) se(coef) z p
age 0.003321 1.003326 0.005463 0.608 0.5433
as_factor(eye)right 0.350484 1.419754 0.162537 2.156 0.0311
trt -0.812522 0.443738 0.169412 -4.796 1.62e-06
Likelihood ratio test=27.59 on 3 df, p=4.421e-06
n= 394, number of events= 155
Note that treatment is randomized, so we shouldn’t see large changes in the coefficient on treatment when we add covariates.
More on this next week!
Some warnings
Note that all these methods have assumptions that can sometimes be important
Given our time, I am purposefully just showing you the basics
If any of these specific methods interest you, I suggest doing more in-depth readings. I can provide some suggestions.
Articles for next week (check syllabus for textbook readings)
Deserranno, E., Stryjan, M., & Sulaiman, M. (2019). Leader selection and service delivery in community groups: Experimental evidence from Uganda. American Economic Journal: Applied Economics, 11(4), 240-267.
Fischer, T., Frölich, M., & Landmann, A. (2023). Adverse Selection in Low-Income Health Insurance Markets: Evidence from an RCT in Pakistan. American Economic Journal: Applied Economics, 15(3), 313-340.