Real differences and model artefacts in mortality inequality research
A couple of weeks ago, a paper was published in the JAMA detailing income inequalities in mortality across cities in the United States. The research, which was led by Raj Chetty at Stanford University, received wide attention. It is an important contribution to the literature, utilizing detailed income and mortality data from the IRS and SSA to estimate adult mortality by income at the city level.
The most striking finding of the research was that, while the rich live longer everywhere, life expectancy for those in the poorer income groups varies substantially by geography. The standard deviation of the average age at death for those in the bottom income quartile was found to be 1.39 years; the standard deviation for those in the top income quartile was only 0.7 years.
Although the authors had access to a rich, longitudinal dataset, there were some data limitations which meant that some assumptions had to be made. Here I want to draw attention to one assumption in particular – that differences in mortality across race/ethnic groups are constant across the whole of the United States.
Producing race-adjusted life expectancy estimates
The main metric of mortality used in the paper is a race-adjusted measure of life expectancy. (An aside: although it is referred to as life expectancy in the paper, it is actually life expectancy at age 40 + 40 years, i.e. \(e_{40} + 40\) years, which is the average age at death having reached age 40).
On the log scale, the average mortality rate at age \(x\) for a particular sex, area and income group is the weighted sum of the mortality rates of the white, black, Asian and Hispanic populations:
\[ \log(m_x) = p_{white} \log(m_{x,white}) + p_{black} \log(m_{x,black}) + p_{Hispanic} \log(m_{x,Hispanic}) + p_{Asian} \log(m_{x,Asian}) \]
where \(p_{group}\) is the proportion of the total population in that group. Once the race-specific mortality rates are found and life expectancies calculated, the aggregate life expectancy is reweighted based on national values of \(p_{group}\) to produce a race-adjusted life expectancy measure.
The average mortality rate, \(\log(m_x)\) comes from the main dataset. But in this dataset information on race or ethnicity is not available, so the authors use information from the National Longitudinal Mortality Study (NLMS). Mortality rates for a particular race group are assumed to follow a Gompertz model, which models age-specific mortality as a function of two parameters, \(\alpha\) and \(\beta\); i.e. for age \(x\):
\[ \log(m_{x,white}) = \alpha + \beta x \]
Now mortality rates for the other race groups can be represented in terms of the difference between that group and white mortality. For example, for the black population:
\[ \log(m_{x,black}) = \log(m_{x,white}) + \delta_{black} + \Delta_{black}x \]
The \((\delta, \Delta)\) are race shifters: the \(\delta\) term is the difference in the intercept parameter and the \(\Delta\) term is the difference in the slope for a particular race group. The key assumption made is that, for each group (black, Hispanic and Asian) those race shifters are constant across all areas and income groups. For example, the \(\delta_{black}\) equals around 0.53, irrespective of location or income group.
This is implicitly assuming that mortality differences by race are constant across geography and income. This seems strange given the main finding of the paper was that overall mortality varies greatly by geography and income.
The authors justify this assumption by showing that, in the NLMS, these race shifters do not differ significantly across broad Census regions (Northeast, Midwest, South, West). But just because there are no significant differences at the aggregate level does not mean that it holds when comparing one city to another, say, Detroit and Salt Lake City.
Sensitivities
How might this assumption affect the race-adjusted life expectancy measure? Look at the equation for black log-mortality rates above. What if, in a particular city, the proportion of the black population was higher than the national average, and additionally the mortality differences were higher and black mortality was worse than assumed. Then the \((\delta, \Delta)\) values would be too small, and so \(\log(m_{x,black})\) would be too low. But because all race-specific mortality rates are constrained to equal the average mortality rate \(\log(m_x)\), if the \(\log(m_{x,black})\) is too low, then, all else equal, \(\log(m_{x,white})\) is too high. As a consequence, the race-adjusted mortality will be too high, so race-adjusted life expectancy will be too low.
The graph below illustrates this effect. The underlying race distribution is roughly equal to females in Detroit in the lowest income quartile. The mortality rates are made up using a Gompertz model, but with a life expectancy roughly equal to that published in the paper for females in Detroit in the lowest income quartile.
With \((\delta_{black}, \Delta_{black}) = (0.53, -0.0012)\), the race adjusted life expectancy is 80.4 years. These shifters are approximately equal to those used in the paper; I say approximately, because the values weren’t actually published anywhere, so I had to eye-ball figure e6/7 in the supplementary information. If you change the values to \((\delta_{black}, \Delta_{black}) = (0.8, 0)\), the race adjusted life expectancy is 81.3 years.
Summary
Demographic estimation usually involves using some sort of model process or set of assumptions, no matter how detailed or high-quality the data are. As data availability increases, we are producing estimates at more granular levels of geography, with disaggregation into smaller subpopulations. As this continues, it becomes even more important to be aware of how modeling assumptions may affect results.
The main conclusions in the Chetty et. al paper are most likely right, and it is a great paper. But without access to the raw data, it is difficult for other researchers to disentangle how much are real differences, versus model artefacts, with any confidence.
Code
Code for this is here.