Improving Estimates of Mean Welfare and Uncertainty in Developing Countries

Josh Merfeld

KDI School and IZA

David Newhouse

World Bank and IZA

Hai-Anh Dang

World Bank and IZA

February 11, 2025

Poverty mapping

Poverty mapping

Getting from A to B

  • How do we get from A to B?
  • Statistical approach: small area estimation (SAE)
  • Machine learning (ML) approach: varied
  • Trade-offs:
    • SAE allows variance estimation
    • ML may be more accurate

This paper

  • In this paper, we propose a method to calculate uncertainty in ML estimates
    • XGBoost
    • Residual bootstrap
    • Purely geospatial data - testing for data sparse contexts
  • Results:
    • XGBoost more precise and better coverage rates
    • Coverage comes from wider confidence intervals

Data

Madagascar Malawi Mozambique Sri Lanka Vietnam
Census data:
Year 2017 2018 2017 2012 2019
Sample 100% 20% 100% 100% 10%
Survey data IHS (2019)
Outcome Assets Assets and poverty Assets Assets Assets
Geospatial data Varied Varied Varied Varied Varied
Geospatial data includes MOSAIKS features (Rolf et al., 2021).

Example geospatial data

Small area estimation: EBP

  • EBP: Empirical Best Predictor (Molina and Rao, 2010)
    • Combines survey and synthetic estimates
    • Estimating a sub-area level model

Area and sub-area

Area and sub-area

  • Two-fold nested error regression model:

\[y_{as} = X'_{as}\beta_{1} + X'_{s}\beta_{2} + \eta_{a} + \varepsilon_{as}\]

  • Basic idea:
    • Predict the outcome for all subareas
    • Aggregate prediction to area level
    • Some areas have a sample, some don’t (example)
    • Combine survey and synthetic estimates (based on variance)
    • Parametric bootstrap for inference

XGBoost

  • XGBoost (“eXtreme Gradient BOOSTing”)
    • Decision trees on steroids

Decision tree

XGBoost

  • XGBoost (“eXtreme Gradient BOOSTing”)
    • Decision trees on steroids



  • Estimate decision tree
  • Calculate residuals
  • Estimate residuals
  • Calculate residuals
  • etc.

Variance estimation

  • Two-stage residual bootstrap
  • Estimate XGBoost at subarea
    • Calculate \(\hat{y}_{as}\) and \(\hat{\varepsilon}_{a}\)
    • Calculate \(\hat{\varepsilon}_{as} = y^{sample}_{as} - \hat{y}_{as}\) (and same at area level)
    • Randomly sample residuals at both levels 1,000 times
    • Calculate confidence intervals

Resample from censuses

  • We resample from censuses
  • Design sampling after actual household survey from each country (e.g. Malawi IHS)
    • Repeat 100 times
  • Use actual census as “truth”

Example: Sample 1

Overall results - Correlations

Pearson Spearman
XGB EBP Diff XGB EBP Diff
Madagascar 0.894 0.883 0.011 0.826 0.804 0.022
Mozambique 0.917 0.880 0.036 0.800 0.767 0.032
Srilanka 0.941 0.927 0.013 0.925 0.901 0.024
Vietnam 0.913 0.905 0.007 0.913 0.912 0.001
Malawi (Assets) 0.825 0.652 0.173 0.835 0.719 0.116
Malawi (Poor) 0.863 0.494 0.369 0.828 0.527 0.300
Average 0.892 0.790 0.101 0.855 0.772 0.082

Overall results - Accuracy

MAE MSE
XGB EBP Diff XGB EBP Diff
Madagascar 0.1459 0.1362 0.009 0.3294 0.3040 0.025
Mozambique 0.0714 0.1241 -0.052 0.2090 0.2838 -0.074
Srilanka 0.0285 0.0379 -0.009 0.1219 0.1476 -0.025
Vietnam 0.0716 0.0798 -0.008 0.2051 0.2089 -0.003
Malawi (Assets) 0.2233 0.7412 -0.517 0.2959 0.5239 -0.228
Malawi (Poor) 0.0293 0.0568 -0.027 0.1358 0.1950 -0.059
Average 0.0950 0.1960 -0.100 0.2162 0.2772 -0.061

Overall results - Coverage rates

Coverage rate CI width
XGB EBP Diff XGB EBP Diff
Madagascar 0.961 0.616 0.344 1.348 0.689 0.659
Mozambique 0.927 0.814 0.112 1.206 0.935 0.271
Srilanka 0.983 0.909 0.073 1.093 0.611 0.482
Vietnam 0.989 0.979 0.009 1.557 1.295 0.262
Malawi (Assets) 0.871 0.812 0.059 2.116 1.777 0.339
Malawi (Poor) 0.957 0.801 0.156 0.752 0.578 0.174
Average 0.948 0.822 0.125 1.346 0.981 0.364

Wrapping up

  • We propose a method to calculate uncertainty in ML estimates
    • XGBoost
    • Residual bootstrap
  • XGBoost is more accurate, both in terms of point estimates and uncertainty
  • Purely with geospatial data
    • Important for data sparse contexts, where household surveys are few and far between

Thank you!



Website: https://joshmerfeld.github.io
Github repo: https://github.com/JoshMerfeld

Sample status (back)

Correlation (pearson) by sample status

In sample Out of sample
XGB EBP Diff XGB EBP Diff
Madagascar 0.903 0.894 0.009 0.823 0.798 0.024
Mozambique 0.957 0.941 0.016 0.744 0.703 0.040
Srilanka 0.947 0.937 0.010 0.865 0.838 0.026
Vietnam 0.913 0.910 0.003 0.764 0.763 0.001
Malawi (Assets) 0.892 0.673 0.218 0.806 0.693 0.112
Malawi (Poor) 0.910 0.614 0.295 0.792 0.472 0.320
Average 0.920 0.828 0.092 0.799 0.711 0.087

Coverage rates by sample status

In sample Out of sample
XGB EBP Diff XGB EBP Diff
Madagascar 0.962 0.602 0.360 0.823 0.798 0.024
Mozambique 0.923 0.644 0.278 0.744 0.703 0.040
Srilanka 0.989 0.917 0.072 0.865 0.838 0.026
Vietnam 0.989 0.980 0.009 0.764 0.763 0.001
Malawi (Assets) 0.939 0.728 0.210 0.806 0.693 0.112
Malawi (Poor) 0.986 0.801 0.185 0.792 0.472 0.320
Average 0.965 0.779 0.186 0.799 0.711 0.087