Abstract
In many settings, researchers may not have direct access to data on 1 or more variables needed for an analysis and instead may use regression-based estimates of those variables. Using such estimates in place of original data, however, introduces complications and can result in uninterpretable analyses. In simulations and observational data, we illustrate the issues that arise when an average treatment effect is estimated from data where the outcome of interest is predicted from an auxiliary model. We show that bias in any direction can result, under both the null and alternative hypotheses.
Keywords: imputation, measurement error, proxy variables
Abbreviations
- CDC
Centers for Disease Control and Prevention
- eGFR
estimated glomerular filtration rate
- SES
socioeconomic status
In many settings, researchers lacking direct access to data on variables needed for an analysis rely instead on regression-based estimates of those variables. Regression-based estimates are routinely used as exposures, outcomes, and effect modifiers or covariates in epidemiologic studies, including some appearing in top epidemiology journals (1–6). These include some types of small-area estimates (7, 8), models for estimating the composition of biological samples (9), models that combine biomarker data with demographic covariates (10), and models of frailty in aging research (2, 11). Our motivating example is a small-area estimation model (7, 8) which purports to be useful for “informing local health policy-makers, improving community-based public health program planning and intervention strategy development, and facilitating public health resource allocation and delivery” (8, p. 127). However, use of regression-based estimates in place of original data can result in uninterpretable analyses and is rarely valid.
In this paper, we use the terms primary analysis to refer to the analysis that answers a question of interest and primary data to refer to the data used in the primary analysis. We use auxiliary model to refer to a regression model, fitted to auxiliary data, that provides estimates of a variable, V, that is required for the primary analysis but is not available in the primary data. A list of terminology and notation used throughout is presented in Table 1. We show in simulations that the common use of predictions from auxiliary models in primary analyses can result in spurious association estimates, including estimates that are in the opposite direction from the truth.
Table 1.
Terminology and Notation Used to Describe Primary and Auxiliary Data and Models
Term | Definition |
---|---|
Primary analysis | Analysis that answers question of interest |
Primary data | Data used in primary analysis |
V | Variable required for primary analysis but not available in primary data |
Auxiliary model | Regression model used to estimate V |
Auxiliary data | Data used for auxiliary model |
Z | Auxiliary data covariates used as independent variables in model for V |
g(Z) | Function that predicts V; result of the auxiliary model |
Our warnings apply to any estimate of V that can be expressed as a function of covariates (i.e., the regressors in the auxiliary model). They do not apply to the use of predicted values from deterministic mathematical models, such as differential equation models of infectious disease epidemics or models that smooth between observations (e.g., kriging to estimate air pollution levels). They also do not apply to settings amenable to multiple imputation, where V is observed for some but not all subjects in the primary data.
WHAT MAY GO WRONG WHEN USING ESTIMATES FROM AUXILIARY REGRESSION MODELS
Suppose an auxiliary model uses a set of covariates Z to estimate a variable, V, that researchers need for their primary analysis. As a toy example, suppose that V is blood pressure and that Z includes age, sex, body mass index (weight (kg)/height (m)2), race/ethnicity, and socioeconomic status (SES). The result of the auxiliary model is a function g(Z) (e.g., if the auxiliary model is a linear regression model) that outputs estimated values of V. Plugging the Z values for a particular observation into g(Z) estimates that observation’s V.
If V is a confounder in a primary analysis, using the estimated V values g(Z) to control for V cannot do better than optimally controlling directly for Z, because a function of Z can only contain less information than Z alone. In our toy example, controlling for estimated blood pressure can do no better than correctly specifying a model for the outcome given the exposure and age, sex, body mass index, race/ethnicity, and SES. In a paper published in Epidemiology, Cuthbertson et al. (3) recommended that researchers interested in estimating the effect of medical interventions on mortality among older adults using Medicare claims data control for confounding by frailty using a function of age, sex, race/ethnicity, and 20 claims for conditions, symptoms, and medical equipment (e.g., wheelchair, vertigo, and dementia) shown to be predictive of frailty. If the association of interest is unconfounded after controlling for these 23 variables, in addition to other measured covariates, then all 23 variables should be included in any claims-based analysis. If confounding remains after conditioning on these 23 variables, then no function of them can control for confounding—no matter how predictive of frailty it is.
If V is an exposure or treatment of interest, including g(Z) instead of V in a primary analysis only tells us about the relationship between a function of Z and the outcome. One problem is that g(Z) may fail to capture features of V that drive the true effect of interest. Another is that it may introduce spurious associations. For example, suppose SES has a strong relationship with the outcome of interest. Then V = blood pressure may appear to be a strong predictor of the outcome simply because SES is an important component of g(Z). In a paper published in the International Journal of Epidemiology, Al Hazzouri et al. (1) suggested estimating early- and midlife cardiovascular disease risk factors as a function of race/ethnicity, sex, and age cohort in order to use cohort studies of older adults to estimate the associations of early-life risk factors with later-life outcomes. They first estimated smoking as a function of race, sex, and age cohort. Denote this function s(race, sex, cohort). Next they estimated body mass index as a function of race, sex, age cohort, and estimated smoking—b[s(race, sex, cohort), race, sex, cohort]—so body mass index is also a function only of race, sex, and age. They subsequently estimated other risk factors as functions of race/ethnicity, sex, cohort, and previously estimated risk factors. They suggested that estimated risk factors can be used to assess “the effects of early and midlife exposures on later life outcomes” (1, p. 1005), but any apparent association between these risk factors and later-life outcomes could also be driven by associations with age cohort, sex, or race/ethnicity unrelated to cardiovascular disease risk factors.
If V is an outcome, using g(Z) instead of V only tells us about how well the variables in our model of interest predict a function of Z. As above, g(Z) may fail to capture features of V that are truly affected by the exposure of interest, or it may introduce spurious associations. In our toy example, if SES is strongly associated with an exposure of interest (as indeed it often is), this could drive an apparent but spurious relationship with predicted blood pressure. In a paper published in this journal, Bash et al. (4) compared definitions of incident chronic kidney disease based on estimated glomerular filtration rate (eGFR), which is a function of age, sex, Black race/ethnicity, and creatinine level, with a definition based only on creatinine level, which is a biomarker for glomerular filtration rate (4). The creatinine definition identified male sex as a risk factor for incident chronic kidney disease, whereas the eGFR definitions identified male sex as protective, suggesting that “the nonlinear relation between serum creatinine and eGFR may partially explain this apparent discrepancy” (4, p. 422). However, the eGFR equation included sex as a factor, thereby defining eGFR as 1.35 times higher for males than for females. Whenever eGFR is used as an outcome in an epidemiologic study (e.g., the study by Navas-Acien et al. (5)), this same discrepancy could bias estimates of associations with any exposure that is differential by sex either toward or away from the null. The use of race in eGFR and other clinical algorithms has come under increasing criticism (12, 13); the phenomenon we have just described is one reason why using race as a predictor can be highly problematic.
When V is an exposure or an outcome, controlling for any elements of Z in the primary analysis can undermine the predictive power of g(Z) to capture information about V. In the extreme case, controlling for Z could be tantamount to controlling for g(Z) itself, forcing all estimated associations with g(Z) to be null. In the cardiovascular disease risk factor analysis (1), any estimate of the predictor g(Z) that controls for sex, race/ethnicity, or age cohort would probably be attenuated, and any estimate that stratifies by all 3 covariates would be null.
The problem of a missing variable is an extreme case of missing data, where V is missing for every observation. It is well-known from the literature on missing-data imputation that the joint relationships among all of the variables in a primary analysis must be correctly modeled in the auxiliary data in order for estimated V to be a valid replacement for the true V (14). This may not be a problem in most missing-data settings, where all of the relevant variables are available in a single data set. In our setting, however, this generally entails that all of the variables to be included in a primary analysis are also included in the auxiliary model for V, with all of the relationships correctly specified. This is almost never possible: Researchers may not have access to the auxiliary data or oversight over how the auxiliary model is fitted, and the auxiliary data may not include all of the primary analysis variables.
In the best-case scenario of a correctly specified auxiliary model and informative predictors, the auxiliary model may be very predictive of V, with highly statistically significant coefficients, a high R2, etc. Nevertheless, these criteria are not sufficient to validate the use of predicted V values in subsequent analyses; no matter how “good” the auxiliary model is for V, if it does not correctly capture the joint relationships among all of the primary analysis variables, it can introduce meaningful bias into the primary analysis. The bias introduced by replacing V with its estimate will depend on many factors, and using the predicted V may be better or worse than ignoring V altogether or using a more naive proxy for V. In simulations presented below, we show that associations estimated using predicted V can go in the opposite direction from the true association, regardless of whether V is a confounder, an outcome variable, or an exposure variable.
WHEN IS IT ACCEPTABLE TO USE ESTIMATES FROM AUXILIARY REGRESSION MODELS?
Let W be the collection of primary analysis variables that are not available in the auxiliary data. The strategy of predicting V using auxiliary data on V and Z is unproblematic when either of the following 2 conditional independences holds:
![]() |
(1) |
or
![]() |
(2) |
Equivalently, the joint distribution of V, Z, and W factorizes into one component that is a function of V and Z and another component that is a function of W and either Z or V, but not both.
In this case, the joint relationships among Z, V, and W can be correctly modeled in the auxiliary data even though W does not appear in the auxiliary data, because the regression of V on Z involves only and not the other factor in the full joint distribution. When the joint distribution factorizes in this way, using g(Z) in place of V is akin to observing V with error, where the magnitude of the error is directly related to how well g(Z) predicts V.
MOTIVATING EXAMPLE
The Centers for Disease Control and Prevention’s (CDC’s) 500 Cities Project (2016–2019) used small-area estimation models to estimate the prevalence of health outcomes at the census tract level based on county-level data. The auxiliary model uses county-level data to regress V onto age category, sex, race/ethnicity category, and county-level poverty rate. Census tract data on covariates, Z, coupled with the county-level fitted auxiliary model, are used to predict V at the census tract level, g(Z). (Random effects are included only at the state and county levels, meaning that within a county, V is predicted from essentially an ordinary regression model.) Although the CDC website recommends that these data be used for surveillance purposes, the original papers proposing the small-area models suggested that they could be used to design and assess interventions (7, 8), and they have frequently been used as outcomes (15–18), exposures (17), and covariates in epidemiologic research, sometimes simultaneously.
SIMULATIONS
We considered 2 different simulation settings to mimic analyses of the 500 Cities data that use auxiliary model predictions as outcome variables in the primary analysis.
We first simulated the county-level data: 4 covariates based on the distributions of the 4 predictors used in the 500 Cities small-area estimation models, the exposure with approximately the same marginal distribution as day-night average noise level, and the outcome with approximately the same conditional distribution as poor sleep conditional on the exposure of noise. We fitted an auxiliary model regressing the prevalence of poor sleep on the 4 auxiliary covariates.
Next, we simulated the primary data at the census tract level, using the same data-generating models as those used for the county-level data. We simulated the true outcome, V, for each observation. To mimic settings in which researchers do not have access to the true V, we also calculated a predicted outcome g(Z) for each observation by using the auxiliary model. We compared 2 analyses, both using the primary data: one using the true V to assess the association between exposure and V and one using the predicted values g(Z) that would be available to a researcher. The results are depicted in Figure 1. In Figure 1A, the association reverses direction when V is replaced with g(Z). In Figure 1B, we illustrate the perils of controlling for an auxiliary model predictor in the primary analysis. Although both the exposure-V and exposure-g(Z) associations are positive, controlling for race/ethnicity results in a null association.
Figure 1.
Simulated associations between day-night average noise level and proportion of individuals with poor sleep under 2 different scenarios. The figure displays predicted values (lines) and corresponding 95% confidence intervals (shaded areas) from logistic regression models. The results shown in panel A are unadjusted, comparing the associations of the true (blue) and estimated (red) outcomes with exposure. In panel B, the unadjusted associations between exposure and g(Z) (green) and the association between exposure and V adjusted for race/ethnicity (blue) are positive, but the association between exposure and g(Z) adjusted for race/ethnicity (red) is null.
We also conducted 2 simulations to illustrate the potential consequences of using g(Z) as a covariate or exposure (Figure 2). For Figure 2A, we simulated a continuous covariate V, conditional on an auxiliary predictor Z. We then simulated exposure and outcome to have a strong negative association conditional on V. We carried out this same data-generating process in auxiliary and analysis data sets. We regressed V on Z in the auxiliary data and then predicted g(Z)) in the analysis data. Conditional on g(Z), exposure and outcome had a strong positive association (in the opposite direction of the truth). For Figure 2B, in auxiliary and analysis data we simulated a continuous outcome and then simulated a continuous exposure V, conditional on an auxiliary predictor but independent of the outcome. We used the model fit obtained from regressing V on the auxiliary predictor in the auxiliary data to calculate g(Z) in the analysis data. Despite the true null relationship, g(Z) and the outcome were strongly positively associated.
Figure 2.
Simulated associations between an exposure and an outcome under 2 different scenarios. The figure displays predicted values (lines) and corresponding 95% confidence intervals (shaded areas) from linear regression models. A) Association between exposure and outcome adjusted for a true (blue) versus predicted (red) covariate; B) unadjusted association between an outcome and a predicted (purple) and true (orange) exposure.
All of the software code used is available on GitHub (19).
OBSERVATIONAL DATA ANALYSIS
Using logistic regression, we examined associations between day-night average noise level and poor sleep using the 500 Cities data. Day-night average noise level was estimated from a geospatial sound model over a 24-hour period with a 10-decibel penalty added between 10 Pm and 7 Am (20). Potential census-tract–level confounding variables from the 2011–2015 US American Community Survey included population density (number of residents per km2) and proportions of females, persons aged ≥75 years, non-Hispanic Whites, non-Hispanic Blacks, homeowners, and persons living below the federal poverty threshold. We included modeled estimates of census-tract–level pollutant concentrations (nitrogen dioxide (or NO2) and particulate matter with an aerodynamic diameter ≤2.5 μm (or PM2.5)) from the Center for Air, Climate and Energy Solutions at Carnegie Mellon University (Pittsburgh, Pennsylvania) (21). For this analysis, we ignored issues related to modeling noise and air pollution levels, but note that these models included deterministic components that differentiated them from the auxiliary regression models of primary concern.
Although our results (Figure 3) appeared to show a strong association between noise and poor sleep, these results could have been driven by an association between noise and one of the auxiliary predictors used in the CDC small-area estimation models. For example, SES has previously been shown to be strongly associated with day-night average noise level in the United States (22).
Figure 3.
Association between day-night average noise level and predicted proportion of individuals with poor sleep in a nationwide data set (the 500 Cities Project), United States, 2016–2019. The figure displays predicted values (lines) and corresponding 95% confidence intervals (shaded areas) from logistic regression models. Panel A shows results from the unadjusted model. Adjusted model 1 (panel B) included only covariates that were not in the auxiliary model: population density, pollutant concentrations (nitrogen dioxide and particulate matter with an aerodynamic diameter ≤2.5 μm), and homeownership. Adjusted model 2 (panel C) included the same covariates plus those from the auxiliary model: age category, sex, race/ethnicity category, and county-level poverty rate.
CONCLUSION
Predictions from auxiliary regression models should not generally be used to learn about associations, mechanisms, or causal effects. While the predictions themselves can be useful for descriptive purposes, such as health surveillance or predicting burden of disease, including them as outcomes, exposures, or covariates in larger models for the purpose of estimating associations or causal effects can result in bias.
The exception to this general principle is when the joint distribution of the variables in the union of the auxiliary and primary data (X, V, and W) factorizes into 2 terms, one of which involves only X and V and the other of which may involve W and either X or V but not both. This is the case whenever V is a deterministic function of X. In future work, we plan to formalize this principle, quantify departures from the factorization, and characterize the relationship between such departures and potential bias.
ACKNOWLEDGMENTS
Author affiliations: Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States (Elizabeth L. Ogburn); Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, New York, United States (Kara E. Rudolph, Joan A. Casey); Department of Environmental Science, Policy and Management and School of Public Health, University of California, Berkeley, Berkeley, California, United States (Rachel Morello-Frosch); Department of Environmental and Occupational Health Sciences, School of Public Health, University of Washington, Seattle, Washington, United States (Amber Khan).
This research was based upon work supported by the Urban Institute through funds provided by the Robert Wood Johnson Foundation. E.L.O. was supported by National Institutes of Health grant U24OD023382 and Office of Naval Research grant N00014-18-1-2760.
We are grateful to Drs. Ana Navas-Acien, Steve Cole, and Jay Kaufman for helpful feedback on earlier drafts of this article.
The findings and conclusions presented in this paper are those of the authors alone and do not necessarily reflect the opinions of the Urban Institute or the Robert Wood Johnson Foundation.
Conflict of interest: none declared.
REFERENCES
- 1. Al Hazzouri , Vittinghoff E, Zhang Y, et al. Use of a pooled cohort to impute cardiovascular disease risk factors across the adult life course. Int J Epidemiol. 2019;48(3):1004–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Segal JB, Huang J, Roth DL, et al. External validation of the claims-based frailty index in the National Health and Aging Trends Study cohort. Am J Epidemiol. 2017;186(6):745–747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Cuthbertson CC, Kucharska-Newton A, Faurot KR, et al. Controlling for frailty in pharmacoepidemiologic studies of older adults: validation of an existing Medicare claims-based algorithm. Epidemiology. 2018;29(4):556–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Bash LD, Coresh J, Köttgen A, et al. Defining incident chronic kidney disease in the research setting: the ARIC Study. Am J Epidemiol. 2009;170(4):414–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Navas-Acien A, Tellez-Plaza M, Guallar E, et al. Blood cadmium and lead and chronic kidney disease in US adults: a joint analysis. Am J Epidemiol. 2009;170(9):1156–1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Darsie B, Shlipak MG, Sarnak MJ, et al. Kidney function and cognitive health in older adults: the Cardiovascular Health Study. Am J Epidemiol. 2014;180(1):68–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zhang X, Holt JB, Lu H, et al. Multilevel regression and poststratification for small-area estimation of population health outcomes: a case study of chronic obstructive pulmonary disease prevalence using the Behavioral Risk Factor Surveillance System. Am J Epidemiol. 2014;179(8):1025–1033. [DOI] [PubMed] [Google Scholar]
- 8. Zhang X, Holt JB, Yun S, et al. Validation of multilevel regression and poststratification methodology for small area estimation of health indicators from the Behavioral Risk Factor Surveillance System. Am J Epidemiol. 2015;182(2):127–137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Houseman EA, Accomando WP, Koestler DC, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):Article 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Levey AS, Bosch JP, Lewis JB, et al. A more accurate method to estimate glomerular filtration rate from serum creatinine: a new prediction equation. Ann Intern Med. 1999;130(6):461–470. [DOI] [PubMed] [Google Scholar]
- 11. Van Domelen DR , Bandeen-Roche K. A note on proposed estimation procedures for claims-based frailty indexes. Am J Epidemiol. 2020;189(5):369–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Eneanya ND, Yang W, Reese PP. Reconsidering the consequences of using race to estimate kidney function. JAMA. 2019;322(2):113–114. [DOI] [PubMed] [Google Scholar]
- 13. Vyas DA, Eisenstein LG, Jones DS. Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. N Engl J Med. 2020;383(9):874–882. [DOI] [PubMed] [Google Scholar]
- 14. Meng X-L. Multiple-imputation inferences with uncongenial sources of input. Stat Sci. 1994;9(4):538–558. [Google Scholar]
- 15. Li Y, Liu SH, Niu L, et al. Unhealthy behaviors, prevention measures, and neighborhood cardiovascular health: a machine learning approach. J Public Health Manag Pract. 2019;25(1):E25–E28. [DOI] [PubMed] [Google Scholar]
- 16. Fitzpatrick KM, Shi X, Willis D, et al. Obesity and place: chronic disease in the 500 largest US cities. Obes Res Clin Pract. 2018;12(5):421–425. [DOI] [PubMed] [Google Scholar]
- 17. Liu SH, Liu B, Li Y. Risk factors associated with multiple correlated health outcomes in the 500 Cities Project. Prev Med. 2018;112:126–129. [DOI] [PubMed] [Google Scholar]
- 18. Wang Y, Holt JB, Xu F, et al. Using 3 health surveys to compare multilevel models for small area estimation for chronic diseases and health behaviors. Prev Chronic Dis. 2018;15:E133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Casey JA. auxiliary_mod_perils. https://github.com/joanacasey/auxiliary_mod_perils.Published November 1, 2019. Accessed March 12, 2021.
- 20. Mennitt D, Sherrill K, Fristrup K. A geospatial model of ambient sound pressure levels in the contiguous United States. J Acoust Soc Am. 2014;135(5):2746–2764. [DOI] [PubMed] [Google Scholar]
- 21. Kim S-Y, Bechle M, Hankey S, et al. Concentrations of criteria pollutants in the contiguous US, 1979–2015: role of prediction model parsimony in integrated empirical geographic regression. PLoS One. 2020;15(2):e0228535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Casey JA, Morello-Frosch R, Mennitt DJ, et al. Race/ethnicity, socioeconomic status, residential segregation, and spatial variation in noise exposure in the contiguous United States. Environ Health Perspect. 2017;125(7):077017. [DOI] [PMC free article] [PubMed] [Google Scholar]