Abstract
Accurate risk prediction plays a key role in disease prevention and disease management; emergence of new biomarkers may lead to an important question about how much improvement in prediction accuracy it would achieve by adding the new markers into the existing risk prediction tools. However, in large prospective cohort studies, the standard full-cohort design, requiring marker measurement on the entire cohort, may be infeasible due to cost and low rate of the clinical condition of interest. To overcome such difficulties, nested case-control (NCC) studies provide cost-effective alternatives but bring about challenges in statistical analyses due to complex datasets generated. To evaluate prognostic accuracy of a risk model, Cai and Zheng1 proposed a class of nonparametric inverse probability weighting (IPW) estimators for accuracy measures in the time-dependent receiver operating characteristic curve analysis. To accommodate a three-phase NCC design in Nurses' Health Study, we extend the double IPW estimators of Cai and Zheng1 to develop risk prediction models under time-dependent generalized linear models and evaluate the incremental values of new biomarkers and genetic markers. Our results suggest that aggregating the information from both the genetic markers and biomarkers substantially improves the accuracy for predicting 5-year and 10-year risks of rheumatoid arthritis.
Introduction
Achieving the full promise of personalized medicine requires tools for accurate risk assessment. When the risk of developing a clinical condition varies with the values of the risk markers, one could potentially use the marker information to develop optimal disease management strategies. Through collaborative efforts, several large prospective cohorts have been assembled over the past decade for studying long term disease risk. Examples include the Framingham Heart Study2, the Atherosclerosis Risk in Communities (ARIC) study3 the Nurse's Health Study (NHS)4, the Health Professional Follow up Study5, and the Women's Health Initiative (WHI)6. These studies, with long term follow up information on a wide range of clinical conditions, have become a wealth of resources for studying disease risks. In addition to recording phenotype information over time, the bio-specimens of the study participants were often collected at baseline and stored for future molecular marker studies. As knowledge accumulates over time, new markers may emerge as candidates for disease risk prediction. Once such research questions are formed, the stored specimens can be assayed and the marker measurements can be linked to phenotypes of interest to address these questions.
Large cohort studies provide valuable means for assessing the prognostic potential of new novel markers. However, obtaining marker values for the entire cohort may be infeasible due to cost associated with marker measurement, especial if the clinical condition of interest is rare. As cost-effect alternatives to the standard full-cohort design, a sub-cohort sampling design, nested case-control (NCC) studies are often employed in practice. For example, the NCC design was employed to study biomarkers for the risk of rheumatoid arthritis (RA) in the NHS.7 Under a standard two-phase NCC design, all subjects in the original phase-I cohort are followed to observe clinical outcomes, however expensive markers are only measured for the phase-II sub-cohort consisting of cases and controls selected randomly from the risk set of the cases. This design, while cost-effective, generates complex datasets for analysis due to outcome dependent missingness and differential follow-up time.
Statistical methods for risk prediction with NCC studies
To develop a risk model with data collected under a NCC design, the hazard ratio parameters in a proportional hazard (PH) model8 can be estimated by fitting a conditional logistic regression (CLR) model.9 To estimate the absolute risk, Langholz and Borgan10 used the CLR estimator of the hazard ratios and derived a simple weighted estimator focusing on within stratum comparisons. However, the CLR estimators could be quite inefficient. Furthermore, the CLR based approaches could not be easily extended to make inference beyond the PH model. Fully efficient nonparametric maximum likelihood (NPMLE) estimators have been obtained by Scheike and Juul11 for the Cox model and by Zeng et al.12 for a class of linear transformation models (LTMs). One major limitation of the NPMLE is that it requires the estimation of the conditional density for the new markers given other clinical variables, which may be infeasible in multi-dimensional settings unless additional modeling assumptions are imposed. One useful alternative to these methods is the inverse probability weighting (IPW) approach proposed by Samuelsen13. Besides being more efficient than the CLR, the IPW estimators are less model dependent and could be easily extended to other modeling paradigm. For example, Lu and Liu14 proposed IPW estimators under the LTMs. The IPW approach also allows for easy estimation of our targeted accuracy summaries.
To evaluate the predictiveness of a risk model, a popular approach is to use the time-dependent receiver operating characteristic (ROC) curve analysis. Statistical methods for estimating the accuracy measures in the ROC paradigm have been proposed using time-to-event data from full cohorts.15-17 To compare different risk models and quantify the Incremental value (IncV) of new markers over routine clinical risk factors, several approaches have been advocated, for example, by comparing the area under the ROC curve (AUC).17,18 However, those methods are only applicable to standard full cohort studies. To accommodate NCC sampling, Cai and Zheng19 developed consistent IPW estimators for time-dependent predication accuracy measures. However, their estimators are based on the PH assumption and therefore may be biased under model misspecification. Cai and Zheng1 proposed a class of nonparametric IPW estimators of accuracy measures. However, these estimators only accommodate standard two-phase NCC studies and consider only the evaluation of a single marker.
Results for analyzing the NHS data
In NHS, two sets of biomarkers are measured sequentially at two different phases from the original cohort study. Accommodating such a three-phase NCC study, we extend the double IPW estimators of Cai and Zheng1 to develop absolute risk prediction models under time-dependent generalized linear models and evaluate the incremental values of new biomarkers. Using data from the NHS, we developed four absolute risk models for RA: (i) clinical model with clinical markers alone; (ii) biomarker model with both clinical markers and two biomarkers; (iii) genetic model with clinical markers and a genetic risk score (GRS); (iv) full model with clinical markers, biomarker and GRS, for predicting a short-term, 5-year risk, and a long-term, 15-year risk. Our results suggest that the RA risk increases with pack-years of smoking, and alcohol consumption appears to be inversely associated with RA risk. In addition, higher levels of sTNFRII and IL6 are associated with higher risk of RA; IL6 has stronger effect on the 5-year RA risk than the 15-year risk, whereas sTNFRII appears to have stronger effect for the 15-year risk. The GRS appears to be highly predictive of RA risk. The clinical models appear to have low discriminatory capacity with area under the ROC curve (AUC) around 0.53 ∼ 0.58. Adding the biomarkers appear to be more helpful for predicting the short-term risk compared to the long-term risk. Aggregating information from both the genetic markers and biomarkers increases accuracy substantially when compared with the clinical models with AUC improving to 0.70 for 5-year risk and 0.62 for 10-year risk.
Discussion
It is challenging to assess the prognostic capability of new biomarkers based on sequentially selected subsamples of the study population. The extension of the double IPW estimators provides robust approaches to evaluate the IncV of new biomarkers on top of conventional risk factors in absolute risk prediction allowing for multi-phase NCC sampling designs. In addition, this method is straightforward to extend to other accuracy measures, including the positive and negative values15,20 as well as integrated discrimination improvement index and the net reclassification improvement index.21
Acknowledgments
The authors wish to thank the participants, investigators and study staff of the Nurses' Health Studies, headquartered at the Channing Laboratory, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA. This research is supported by P01 CA053996, R01 GM079330, R01 GM079330, R01 GM085047, U01 CA086368, and U54 LM008748.
References
- 1.Cai T, Zheng Y. Nonparametric evaluation of biomarker accuracy under nested case-control studies. Journal of the American Statistical Association. 2011;106:569–580. doi: 10.1198/jasa.2011.tm09807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wilson P, D'Agostino R, Levy D, Belanger A, Silbershatz H, Kannel W. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–47. doi: 10.1161/01.cir.97.18.1837. [DOI] [PubMed] [Google Scholar]
- 3.Folsom A, Aleksic N, Catellier D, Juneja H, Wu K. C-reactive protein and incident coronary heart disease in the Atherosclerosis Risk In Communities (ARIC) study. American heart journal. 2002;144:233–238. doi: 10.1067/mhj.2002.124054. [DOI] [PubMed] [Google Scholar]
- 4.Colditz G, Manson J, Hankinson S. The Nurses' Health Study: 20-year contribution to the understanding of health among women. Journal of Women's Health. 1997;6:49–62. doi: 10.1089/jwh.1997.6.49. [DOI] [PubMed] [Google Scholar]
- 5.Rimm E, Giovannucci E, Willett W, Colditz G, Ascherio A, Rosner B, Stampfer M. Prospective study of alcohol consumption and risk of coronary disease in men. The Lancet. 1991;338:464–468. doi: 10.1016/0140-6736(91)90542-w. [DOI] [PubMed] [Google Scholar]
- 6.Anderson G, Manson J, Wallace R, Lund B, Hall D, Davis S, Shumaker S, Wang C, Stein E, Prentice R. Implementation of the Women's Health Initiative study design. Annals of Epidemiology. 2003;13:5. doi: 10.1016/s1047-2797(03)00043-7. [DOI] [PubMed] [Google Scholar]
- 7.Karlson E, Chibnik L, Tworoger S, Lee I, et al. Biomarkers of inflammation and development of rheumatoid arthritis in women from two prospective cohort studies. Arthritis & Rheumatism. 2009;60:641–652. doi: 10.1002/art.24350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cox D. Regression models and life-tables. J R Statist Soc B. 1972:187–220. [Google Scholar]
- 9.Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the Cox regression model. The Annals of Statistics. 1992;20:1903–1928. [Google Scholar]
- 10.Langholz B, Borgan Ø. Estimation of absolute risk from nested case-control data. Biometrics. 1997;53:767–774. [PubMed] [Google Scholar]
- 11.Scheike T, Juul A. Maximum likelihood estimation for Cox's regression model under nested case-control sampling. Biostatistics. 2004;5:193–206. doi: 10.1093/biostatistics/5.2.193. [DOI] [PubMed] [Google Scholar]
- 12.Zeng D, Lin D, Avery C, North K, Bray M. Efficient semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies. Biostatistics. 2006;7:486–502. doi: 10.1093/biostatistics/kxj021. [DOI] [PubMed] [Google Scholar]
- 13.Samuelsen S. A psudolikelihood approach to analysis of nested case-control studies. Biometrika. 1997;84:379–394. [Google Scholar]
- 14.Lu W, Liu M. On estimation of linear transformation models with nested case– control sampling. Lifetime Data Analysis. 2011 doi: 10.1007/s10985-011-9203-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Heagerty P, Lumley T, Pepe M. Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics. 2000;56:337–44. doi: 10.1111/j.0006-341x.2000.00337.x. [DOI] [PubMed] [Google Scholar]
- 16.Heagerty P, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005;61:92–105. doi: 10.1111/j.0006-341X.2005.030814.x. [DOI] [PubMed] [Google Scholar]
- 17.Uno H, Cai T, Tian L, Wei L. Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association. 2007;102:527–537. [Google Scholar]
- 18.DeLong E, DeLong D, Clarke-Pearson D. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
- 19.Cai T, Zheng Y. Evaluating prognostic accuracy of biomarkers under nested case-control studies. Biostatistics. 2011 doi: 10.1093/biostatistics/kxr021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Pepe M. The statistical evaluation of medical tests for classification and prediction. Oxford University Press, USA; 2004. [Google Scholar]
- 21.Pencina M, D'Agostino R, Sr, D'Agostino R, Jr, Vasan R. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27:157–172. doi: 10.1002/sim.2929. [DOI] [PubMed] [Google Scholar]
