Comparing methods for smoothing temporal demographic data
At the International Population Conference of the International Union for the Scientific Study of Population (IUSSP) I will present work on comparing different methods for smoothing demographic data. This post briefly outlines the motivation for the project and describes the R package distortr
which accompanies the project.
Motivation
An important part of demographic research is the ability to estimate and project time series of demographic and health indicators. However, it is often the case that populations that have the poorest outcomes also have poor-quality data. In these cases, the underlying trends may be unclear due to missing data or overly messy data.
In such situations, demographers often employ statistical models to help estimate and understand underlying trends. Often, these statistical models have the general form:
\[ \theta_t = f(X_t) + Z_t + \varepsilon_t \] where
- \(\theta_t\) is the outcome of interest (mortality rate, fertility rate, etc)
- \(f(X_t)\) is a regression framework, a function of covariates \(X_t\)
- \(Z_t\) are temporal distortions, which capture data-driven non-linear trends over time, not otherwise captured in \(f(X_t)\)
- \(\varepsilon_t\) is an error term.
While the inclusion of covariates in the regression framework is often well justified, the choice for modeling the distortions \(Z_t\) is often more arbitrary. However, models for \(Z_t\) are important: they allow for data-driven trends that may not be captured by simple regression models; they smooth distortions, accounting for error in data observations; they incorporate uncertainty in the underlying processes; and allow for a temporal mechanism to be projected into the future. Different model choice can sometimes lead to vastly different estimates.
This project aims to compare three main families of temporal models:
- ARMA models
- Gaussian Process regression
- Penalized splines regression
The aim is to compare the three methods theoretically and see how differences manifest into differences in estimates for different data scenarios.
The paper presented at IUSSP is available here, and the slides are here.
The distortr
package
As part of this project, I am developing an R package to aid in comparing and fitting different models for estimation of demographic time series. The distortr
package is available on GitHub.
The package consists of two main parts:
1. Simulate time series of distortions, and fit and validate models
This part of the package contains tools to investigate how different models perform in different simulation settings and how much it matters if the ‘wrong’ model is chosen.
Simulated time series of data can be created from any of the following processes:
- AR(1)
- ARMA(1,1)
- P-splines (first or second order penalization)
- Gaussian Process (squared exponential or Matern function)
The various parameters associated with each function can be specified. The user can also specify how much of the time series is missing, and the sampling error around data. The sample autocorrelation function of the time series can also be plotted.
In terms of model fitting, any of the above models can be fit to simulated data. Projections of time series can also easily be obtained. Estimates and uncertainty around estimates can be outputted and plotted.
2. Fit Bayesian hierarchical models to datasets with observations from multiple areas
Given data are often sparse or unreliable, especially in the case of developing countries, models that estimate demographic indicators for multiple areas/countries are often hierarchical, incorporating pooling of information across geographies. This part of the package has the infrastructure to fit Bayesian hierarchical models using one of the temporal smoothing methods. The user can specify whether or not to include a linear trend, and the type of temporal smoother to fit to the data.
Datasets with observations from multiple countries/areas can be used, with the following columns required:
- country/area name or code (e.g. country ISO code)
- value of observation
- year of observation
- sampling error of observation
- data source (e.g. survey, administrative)
In addition, a region column may also be included (e.g. World Bank region). By default the built-in models include a region hierarchy (i.e. a country within a region within the world). However, models can also be run without the region level.
For an example using real data, refer to the file real_data_anc4_example.R.
Summary
One of the aims of this project was to provide the tools to increase transparency of model choice and to help demographers and policymakers understand differences in models and sensitivities of estimates to model choice. The distortr
package provides some infrastructure to explore and fit different methods. It is a work in progress and any feedback is much appreciated.