Inequality and consumption variability

Josh Merfeld

University of Queensland

IZA

Jonathan Morduch

NYU Wagner School of Public Service

December 9, 2025

Lots of progress on poverty

But inequality much more stubborn

Gini coefficient across countries

Where does this data come from?

How do we calculate poverty and inequality?
- In developing countries, usually household surveys

Hard to collect accurate data!
- Do you know how much money you spent on maize in the last 365 days?
- So we ask about shorter time frames, like 7 days

Where does this data come from?

In developed countries, this is often administrative data
- Usually for the entire year
- Example: ATO/IRS tax data tax data
But we nonetheless tend to compare across developed and developing countries
- “Poverty rate is the proportion of people living on less than X dollars per person per day”
  - This value is X lower in country Y than country Z
- “The gini coefficient is higher in country A than country B”

But…

Merfeld and Morduch (2024): But poverty in the two contexts measures something inherently different
- In developing countries, the poverty rate is the mean proportion of the year that people live in poverty

Something similar with inequality

How do we interpret inequality?
- Differences in income/expenditures across households

This paper: inequality as measured is actually a combination of two things:
- Inequality across households (traditional inequality)
- “Inequality” within households across time
- Related to measurement, but very different issue from focus in dev. country literature (e.g. Clarke and Kopczuk, 2025)

Propose method to estimate what we actually want

Apply ideas to look at effect of road construction on inequality

Inequality, mathematically

We start with a Theil index into both of these components. What we think we are measuring:

\[\begin{equation} T_{L} = \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{\overline{x}_i} \right) \end{equation}\]

\(\overline{x}_i\) is each household’s mean expenditure
\(\mu = \frac{1}{N} \sum^{N}_{i=1} \overline{x}_i\) is overall mean

Household expenditures

We only observe one random month

Another possible sample

Inequality, mathematically

What we are actually measuring (where \(x_{it}\) is observed expenditures, which is a random draw for each household):

\[\begin{align} T_{L-HF} = \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{x_{it}} \right) \color{#D7D1CC}{\neq \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{\overline{x}_i} \right)} \end{align}\]

Inequality, mathematically

What we are actually measuring (where \(x_{it}\) is observed expenditures, which is a random draw for each household):

\[\begin{align} T_{L-HF} = \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{x_{it}} \right) \neq \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{\overline{x}_i} \right) \end{align}\]

Inequality, mathematically

What we are actually measuring (where \(x_{it}\) is observed expenditures, which is a random draw for each household): \[\begin{align} T_{L-HF} =& \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{x_{it}} \right) \\ =& \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{x_{i}}\frac{x_{i}}{x_{it}} \right) \\ =& \frac{1}{N} \sum^{N}_{i=1} \ln \left( \frac{\mu}{x_{i}} \right) + \frac{1}{N} \sum^{N}_{i=1} \frac{1}{T} \sum^{T}_{t=1} \ln \left( \frac{x_i}{x_{it}} \right) \\ =& T_{L} + V_{L}. \end{align}\]
We call these between inequality and within inequality, respectively
Note that \(V_{L}\) is arguably of interest in its own right! It is within-household expenditure variability

Recovering traditional inequality

We want to try and recover traditional inequality, \(T_{L}\)
- This isn’t what we’re measuring!

How? We will try and estimate each household’s mean expenditures for the year, \(\overline{x}_i\)
- We use detailed information on households and modern machine learning method, XGBoost

Basic idea

Estimate XGBoost model to predict monthly expenditures in each month, \(x_{it}\)

Use these predictions to estimate each household’s mean expenditures, \(\overline{x}_i\)

Calculate between inequality using these estimates

Calculate within inequality as the difference between measured Theil index and between inequality

Bootstrap entire process for inference

How do we know how well it works?
- Validate method using ICRISAT data which has monthly expenditures for ~1,000 households in India

Data

Two sources of data in this paper:

ICRISAT VDSA:
- Monthly household panel data for five years
- Rural India only
- Use for validation of our approach

National Sample Survey (NSS):
- Nationally representative household survey
- Three waves: 2004-2005, 2007-2008, 2011-2012

Decision trees and XGBoost

Decision tree: makes prediction
Residual: difference between actual and predicted
New decision tree: makes prediction on residuals
Repeat steps 2 and 3 many times
Final prediction: sum of all predictions

Validation with ICRISAT

NSS: Even better! Out-of-sample, monthly exp p.c.

Summary so far

Starting point is estimating monthly expenditures
- But we then want to aggregate to annual mean
- We only use the annual values in the rest of our estimation

ICRISAT:
- Monthly expenditures correlation: \(0.636\)
- Annual expenditures much better: \(0.826\)

NSS:
- Monthly expenditures correlation: \(\approx0.81\)
- Annual expenditures much better: \(???\)

How much are we overestimating inequality?¹

	2004-2005	2007-2008	2011-2012
Theil - total	0.200	0.223	0.208
	(0.184, 0.214)	(0.203, 0.248)	(0.187, 0.226)
Theil - between	0.179	0.192	0.179
	(0.165, 0.193)	(0.176, 0.207)	(0.164, 0.193)
Theil - within	0.021	0.031	0.028
	(0.016, 0.026)	(0.02, 0.048)	(0.021, 0.036)

Within-household expenditure variability

	2004-2005	2007-2008	2011-2012
Head less than primary	0.004	0.006	0.009
	(0.003, 0.006)	(0.004, 0.007)	(0.006, 0.012)
Head primary or higher	0.009	0.013	0.027
	(0.007, 0.01)	(0.011, 0.016)	(0.021, 0.031)
Head male	0.010	0.014	0.028
	(0.008, 0.011)	(0.012, 0.018)	(0.022, 0.033)
Head female	0.007	0.013	0.024
	(0.002, 0.012)	(0.01, 0.018)	(0.019, 0.029)

Richer households have higher within-household expenditure variability!

An application to PMGSY

Pradhan Mantri Gram Sadak Yojana (PMGSY)
- Rural road construction program in India
- We use NSS data to estimate the impact of PMGSY on inequality
- Shapefiles from https://geosadak-pmgsy.nic.in/opendata/

Use roll-out of roads across time and space to estimate impact
- We use a difference-in-differences approach
- Test robustness using recentered IV (Borusyak and Hull, 2023)

Rollout of PMGSY roads

Effects of PMGSY on inequality

	Total	Between	Within
Panel A: Simple TWFE
Length of roads	-0.005	-0.003	-0.002
	(-0.014, 0.001)	(-0.006, 0.000)	(-0.009, 0.002)
	[-0.012, -0.000]	[-0.005, -0.000]	[-0.008, 0.001]
Panel B: Recentered IV control
Length of roads	-0.005	-0.003	-0.002
	(-0.014, 0.001)	(-0.006, 0.000)	(-0.009, 0.002)
	[-0.012, -0.000]	[-0.005, -0.000]	[-0.007, 0.001]

Wrapping up

Inequality as measured in LDCs is actually a combination of two things:
- Traditional inequality (differences across households)
- Within-household expenditure variability

We propose and implement a method to estimate traditional inequality
- We find that we are overestimating inequality by 10-15% in the NSS data

Within-household expenditure volatility is the remainder
- It is also of interest in its own right!
- Higher volatility for richer households

Wrapping up

We apply our method to PMGSY
- We find that PMGSY has a small but significant impact on inequality as measured (what we normally see)

However, important caveat:
- Its effects are approximately equal in magnitude for both types of inequality!
- True even though “within” inequality is only 10-15% of measured inequality

Next steps

MGNREGS? Other programs?

Use LSMS data to validate in other countries?
- Malawi
- Tanzania
- Uganda
- Ethiopia

Inequality and consumption variability

Lots of progress on poverty

But inequality much more stubborn

Gini coefficient across countries

Where does this data come from?

Where does this data come from?

Where does this data come from?

Where does this data come from?

Where does this data come from?

But…

Something similar with inequality

Inequality, mathematically

Household expenditures

Household expenditures

We only observe one random month

Another possible sample

Inequality, mathematically

Inequality, mathematically

Inequality, mathematically

Recovering traditional inequality

Basic idea

Data

Decision trees and XGBoost

Decision trees and XGBoost

Validation with ICRISAT

NSS: Even better! Out-of-sample, monthly exp p.c.

Summary so far

How much are we overestimating inequality?1

Within-household expenditure variability

An application to PMGSY

Rollout of PMGSY roads

Effects of PMGSY on inequality

Wrapping up

Wrapping up

Next steps

Thank you!

How much are we overestimating inequality?¹