Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Nov 27.
Published in final edited form as: Stat Theory Relat Fields. 2017 Nov 27;1(2):159–170. doi: 10.1080/24754269.2017.1400418

Robust Dynamic Risk Prediction with Longitudinal Studies

Qian M Zhou 1,†,, Wei Dai 2,, Yingye Zheng 3, Tianxi Cai 4
PMCID: PMC5766050  NIHMSID: NIHMS926351  PMID: 29335682

Abstract

Providing accurate and dynamic age-specific risk prediction is a crucial step in precision medicine. In this manuscript, we introduce an approach for estimating the τ-year age-specific absolute risk directly via a flexible varying coefficient model. The approach facilitates the utilization of predictors varying over an individual's lifetime. By using a nonparametric inverse probability weighted kernel estimating equation, the age-specific effects of risk factors are estimated without requiring the specification of the functional form. The approach allows borrowing information across individuals of similar ages, and therefore provides a practical solution for situations where the longitudinal information is only measured sparsely. We evaluate the performance of the proposed estimation and inference procedures with numerical studies, and make comparisons with existing methods in the literature. We illustrate the performance of our proposed approach by developing a dynamic prediction model using data from the Framingham Study.

Keywords: Inverse probability weighting, Longitudinal markers, Nonparametric smoothing, Predictive accuracy, Risk prediction, Survival analysis

1 Introduction

Accurate and individualized risk prediction is a key component of precision medicine. For example in colorectal cancer, risk calculator can help tailoring individual's screening regimen and making decisions on specific ages for screening initialization and surveillance. Factors pertaining to specific ages, such as family history and nutrition intake are important to be incorporated in the outcome prediction. For cardiovascular disease (CVD), the Framingham risk score (FRS) (Wolf and others 1991) has been developed separately for men and women based on risk factors such as total cholesterol, high-density-lipoprotein, systolic blood pressure and smoking status. Patients with 10-year FRS below 10% are considered to be at lower risk for vascular events during the next decade, whereas patients with scores between 10% and 20% are at moderate risk and those larger than 20% are at higher risk. Various intervention strategies can be implemented based on such risk stratification (Mosca and others 2004). In these clinical settings, the analytical goal is to provide patients with an estimate of the likelihood of developing a disease within the next τ-years given the subject is disease free at age a and his/her risk profile updated by age a, i.e., the age-specific absolute τ-year residual life risk.

Currently available prediction models are often limited in predicting such age-specific absolute residual life risk. For example, the FRS model includes age as a standard risk factor with linear effects. However it is well recognized that the relationship between age and CVD risk may change in magnitude through complex interactions with other risk factors (Ridker and others 2007). Naturally, a risk equation should be a more stochastic function of age (Lloyd-Jones 2010). Such simplistic models, failing to capture the complex age varying effects, may lead to poor risk estimates and prediction models with low discriminatory power. To be clinically useful with sufficiently adequate prediction accuracy, an ideal prediction model should take into account an individual's most up to date health information and reflect the fact that risk factors may have differential effects on the τ-year residual life risk over various stages of an individual's life span.

Constructing a model that accurately captures how risks change dynamically over lifetime, while of great importance, is challenging for several reasons. Often important risk factors change over time. However, collecting such time-varying information for a large prospective cohort can be a major undertaking. Most cohort studies only collect age-specific information intermittently, sometimes irregularly. Characterizing changes over time with limited measures at discrete time points requires much deliberation. Furthermore, the importance of risk factors on disease outcome may change during an individual's lifetime. For example, body mass index (BMI) may have substantially different effects on future CVD risks depending on the age. Characterizing age-varying effects with a powerful yet flexible statistical model can be challenging. Similar challenges arise when assessing the prediction performance of an age-specific prediction model since the accuracy may also vary with age.

A popular approach in the risk prediction literature is to decompose the absolute risk into two components: the age-dependent disease risk for a baseline risk profile, which can be estimated from a prospective cohort study or external disease incidence data, and the relative risk of developing disease for a particular risk factor profile compared to the baseline (Gail and others 1989, Liu and others 2014). While such an approach can accommodate differential effects of risk factors at different ages by adding interaction terms of age and the other covariates in the regression model, it does not incorporate time-varying covariates collected over time. To incorporate repeated measurements, joint modeling of both the covariate process Z(·) and a survival outcome T have been developed in recent years (Tsiatis and others 1995, Wang and Taylor 2001, Ye and others 2008). Parameter estimation typically involves specifying the covariate process Z(·) and then linking Z(·) to T via a proportional hazard (PH) model with time-varying covariates. In addition to requiring strong modeling assumptions about Z(·) and T, these joint modeling methods have a limitation of being computationally infeasible when many time-varying covariates are under consideration. Semi-parametric methods have been proposed to directly make prediction of τ-year residual life risk given covariate information at a landmark time t0 (Zheng and Heagerty 2005, Parast and others 2012). However, no existing methods allow for incorporating age-specific effects non-parametrically with sparsely measured time-varying covariates. Furthermore, procedures for non-parametrically evaluating such age-specific prediction models with longitudinal markers and censored event times are not yet available.

In this paper, we propose to directly model the age-specific absolute risk function via a flexible varying-coefficient model and estimate the covariate effects as functions of age via inverse probability weighted (IPW) kernel estimating equations. The procedure allows the estimation of age-specific risks flexibly with the longitudinally collected risk factor information on the same patient, while borrowing strength across individuals of similar ages at different study time points. It also handles irregularly measured serial covariates and censoring easily. Our proposed model, by allowing the effects of risk factors to change over age and the target residual life span τ, is more realistic and could potentially lead to improved predictive performance. To quantify the performance of the age-specific models of τ-year residual life risk, it would be desirable to consider measures of prediction performance specific to age and prediction time since the prediction accuracy of such models is likely to vary over both dimensions. A wide range of performance measures have been considered to quantify the time-specific prediction accuracy of τ-year absolute risk models constructed with baseline markers (Zheng and others 2006, Uno and others 2007, Gerds and others 2008). In the longitudinal setting, Zheng and Heagerty (2004) considered a model-based approach for estimating the accuracy in the absence of censoring. To guard against potential model mis-specification and incorporate censored outcomes, we propose an IPW kernel estimator to calculate model performance parameters that quantify the accuracy of the proposed prediction models in predicting τ-year residual life at age a. No existing methods provide non-parametric estimates of such prediction performance with longitudinal markers and censored outcomes.

The remainder of this manuscript is organized as follows. In Section 2, we describe the proposed model and estimation framework. Then we present simulation results comparing our approach to other popular methods in Section 3. We apply the proposed method to the Framingham Heart Study in Section 4, assessing the age-specific effects of routinely used cardiovascular risk factors on the 10-year residual CVD risk and quantifying the performances of the prediction models. We conclude with a brief discussion in Section 5.

2 Methods

Let Ti be the time to event onset since a baseline time such as study entry. Due to censoring, one can only observe Xi = min{Ti, Ci} and δi = I(TiCi), where Ci is the censoring time and I(·) is the indicator function. To facilitate the calculation of age-specific risks, we also record age at the occurrence of the event and censoring. Let Ai0 be the age at which subject i enters the study and then TiA=Ti+Ai0 is the age at which the event occurs, and CiA=Ci+Ai0 is the age at which TiA might be censored. Let XiA=Xi+Ai0=min{TiA,CiA}. In addition to the event time information, risk markers are ascertained repeatedly during the follow up. For the ith subject, let Zi(a) = (Zi1(a), Zi2(a), …, Zip(a)) denote a vector of p risk factors measured at age a, let {Aik, k = 0, …, mi} be the ages at which these risk factors are collected and Zij = Zi(Aij), where mi is the total number of measurement times. We assume that Zi(a) is potentially observable among those with TiA>a, the values of Zi(Aij) are not dependent on the study measurement time AijAi0 given age Aij. In addition, Ci is assumed independent of Ti, Zi(·), entry age Ai0, and the underlying study measurement times, with support not shorter than that of AijAi0 + τ. Figure 1 provides a graphical illustration of the data structure.

Figure 1.

Figure 1

Data structure.

2.1 Modeling, Estimation and Inference

We are interested in estimating the risk of experiencing the event in the next τ-years for subjects who are at age a and event-free, based on the risk factor measured at age a, Z(a). Thus, the goal is to estimate the conditional risk function

πτ,a(z)=Pr{TAaτ|TA>a,Z(a)=z,A=a}.

To approximate πτ,a(z), we propose a flexible varying coefficient model

πτ,a{Z(a)}=g{βτ(a)U(a)}, (1)

where g(·) is a known smooth probability distribution function, such as g(x) = exp(x)/{1 + exp(x)} (which is used in Section 4), U(a) = ψ{Z(a)} represents transformed risk factors for some known function ψ and U(a) includes 1 as the first component, such as a log transformed risk factor (used in Section 4), βτ(a) is an unknown smooth function representing the covariate effects on the τ-year residual life risk at age a. Model (1) allows the effects of risk factors Z(a) to vary over both age and the residual life span τ. This flexibility is attractive when the risk factors have different effects on long term versus short term risks and when certain risk factor profiles have more detrimental effects for younger subjects than for older subjects.

To estimate βτ(a) for any given age a in the presence of censoring, we propose to obtain β̂τ(a) as the solution to the IPW kernel smoothed estimating equation, Φ̂a(β) = 0, where

Φ^a(β)=1ni=1nj=0miw^τijI(XiAAij)Kh(Aija)Uij{I(XiA<Aij+τ)g(βUij)}, (2)

where Uij = Ui(Aij), w^τij=δiI(XiAAij+τ)/G^(Xi)+I(XiA>Aij+τ)/G^(AijAi0+τ) with Ĝ(·) being the Kaplan-Meier estimator for the survival function of the censoring time G(c) = P(Ci > c) of Ci, and Kh(s) = K(s/h)/h is a symmetric standard kernel function K(·) with a finite support and with h the smoothing parameter. Note that under the independent censoring assumption, E(w^τij|TiA,Aij,Zij,j=0,,,mi)1.

Following similar arguments as given in Cai and others (2010) and Parast and others (2012), one may show that β̂τ(a) converges in probability to a deterministic vector β̄τ(a) as n → ∞ regardless of whether (1) is correctly specified or not. In addition, one may also show that for h = Op(nv) with v ∈ (1/5, 1/2), (nh)12{β^τ(a)βτ(a)} converges in distribution to a zero-mean normal random vector for any given a. However, it is difficult to directly estimate the asymptotic variance of (nh)12{β^τ(a)βτ(a)}. To construct confidence interval (CI) for βτ(a) in practice, we suggest using a perturbation resampling (sometimes referred to as wild bootstrap) method (Wu 1986, Park and Wei 2003, Tian and others 2005) to approximate the distribution of the proposed estimator. Compared to the standard bootstrap, perturbation resampling tends to be more stable especially in the survival setting since all observations take positive weights and contribute to the estimation. Let V(b)={V1(b),,Vn(b)}, b = 1, …, B, be B sets of independent positive random variables from a known distribution with mean and variance equal to one. Then one may obtain perturbed estimates of βτ(a), β^τ(b)(a), as the solution to the equation

1ni=1nj=0miw^τij(b)I(XiAAij)Kh(Aija)Uij{I(XiA<Aij+τ)g(βUij)}=0,

where w^τij(b)=Vi(b){δiI(XiAAij+τ)/G^(b)(Xi)+I(XiA>Aij+τ)/G^(b)(AijAi0+τ)} and Ĝ(b)(·) is the weighted Kaplan-Meier estimator of G(·) with each subject's contribution to the estimator weighted by Vi(b). The asymptotic variance of β̂τ(a), σ^βτ(a)2, can be estimated by the empirical variance of { β^τ(b)(a), b = 1, …, B}. The 100(1 – α)% simultaneous confidence bands for {βτ(a), al < a < au} can be obtained as {β̂τ(a) ±ζασ̂βτ(a)}, where ζα is the 100(1 – α)th percentile of { supal<a<au|β^τ(b)(a)β^τ(a)|/σ^βτ(a), b = 1, …, B}. To justify the resampling method, we may first show that

(nh)12{β^τ(a)βτ(a)}=(nh)1/2i=1nj=0miKh(Aija)U(Aij,Uij,XiA)+Op(h12)

where 𝒰 is some deterministic function and E{U(Aij,Uij,XiA)}=0. Thus for any fixed a and h = Op(nv) with v ∈ (1/5, 1/2), (nh)12{β^τ(a)βτ(a)} is asymptotically normal. Furthermore, following similar arguments as given in Tian and others (2005), we may show that with proper normalization, supa|(nh)12{β^τ(a)βτ(a)}| converges to an extreme value distribution. In addition,

(nh)12{β^τ(b)(a)β^τ(a)}=(nh)1/2i=1nj=0miKh(Aija)U(Aij,Uij,XiA)(Vi(b)1)+Op(h12),

where Op* is with respect to probability space generated by both the observed data and 𝒱(b). Then we may show that conditional on the data, (nh)1/2i=1nj=0miKh(Aija)U(Aij,Uij,XiA)(Vi(b)1) converges in distribution to the same limiting unconditional distribution of (nh)1/2i=1nj=0miKh(Aija)U(Aij,Uij,XiA).

2.2 Accuracy Measure Estimation

With βτ(a) estimated as β̂τ(a), one can then use π̂τ,a{Z(a)} = g{β̂τ(a)U(a)} to estimate the τ-year survival probability for event-free subjects with risk factor profile Z(a) at age a. To assess the accuracy of the limiting risk model π̄τ,a{Z(a)} = g{β̄τ(a)U(a)} in predicting the τ-year residual survival status for different age groups, we extend commonly used time-dependent accuracy parameters, such as true positive rate (TPR), false positive rate (FPR) and area under the receiver operating characteristic curve (AUC), to also incorporate the age dimension. Since the proposed model evaluation method is not limited to a specific prediction model, we next describe these accuracy parameters for a genetic τ-year residual life risk function, Πτ,A(Z), derived based on an age A and the risk factor Z(·) collected up to age A. For the proposed model, Πτ,A(Z) = g{βτ(A)U(A)}.

2.2.1 Time and Age Specific Prediction Accuracy

When a specific age group a is of interest, we summarize the prediction performance of risk model Πτ,A(Z) using time and age specific TPR and FPR functions, respectively defined as

TPRτ,a(c)=p{Πτ,A(Z)>c|0<TAaτ,A=a},FPRτ,a(c)=P{Πτ,A(Z)>c|TAa>τ,A=a}.

We may summarize the overall predictiveness of the model for a given a and τ using

AUCτ,a=TPRτ,a(c)dFPRτ,a(c)=P[Πτ,Ai(Zi)Πτ,Ai(Zi)|0<TiAaτ,TiAaτ,Ai=Ai=a],

where i and i′ index two independent individuals. Similar to the estimation of βτ(a), these parameters can be estimated using an IPW kernel smoothing approach. For example, TPRτ,a(c) can be estimated as

TPR^τ,a(c)=i,j:0<XiAa<τw^τijKh(Aija)I{Π^τ,Aij(Zi)c}i,j:0<XiAa<τw^τijKh(Aija). (3)

and AUCτ,a can be estimated as

AUC^τ,a=i,j,i,j:0<XiAa<τXiAaw^τijw^τijKh(Aija)Kh(Aija)I{Π^τ,Aij(Zi)Π^τ,Aij(Zi)}i,j,i,j:0<XiAa<τXiAaw^τijw^τijKh(Aija)Kh(Aija). (4)

where Π̂τ,Aij(Zi) is the estimated risk function plugging in estimated model parameters. For the proposed varying coefficient model, Π̂τ,Aij(Zi) = g{β̂τ(Aij)Uij} and using similar arguments as those for the consistency of β̂τ(a), one may show that AUC^τ,a is a consistent estimator of AUCτ,a. Furthermore, (nh)12(AUC^τ,aAUCτ,a) converges in distribution to a normal random variable. The CI for AUCτ,a can be constructed by perturbed estimates. Specifically, for b = 1, …, B, the bth perturbed estimate of AUC^τ,a can be obtained as

AUC^τ,a(b)=i,j,i,j:0<XiAa<τXiAaw^τij(b)w^τij(b)Kh(Aija)kh(Aija)I{Π^τ,Aij(b)(Zi)Π^τ,Aij(b)(Zi)}i,j,i,j:0<XiAa<τXiAaw^τij(b)w^τij(b)Kh(Aija)kh(Aija). (5)

where Π^τ,Aij(b)(Zi) is the perturbed counterpart of Π̂τ,Aij(Zi) obtained similar to β^τ(b)(a) using weights 𝒱(b).

2.2.2 Time Specific Prediction Accuracy Including Age as a Predictor

When interest lies in evaluating the accuracy of the prediction model treating age as a risk factor, an overall summary that is not conditional on age would be preferred. That is, incorporating age as a predictor, one may seek to assess the accuracy in predicting τ-year residual life of the risk estimate Πτ,A(Z), constructed using Z(·) information collected up to a random age A among those with T𝔸 > A. For such settings, one may consider

TPRτ(c)=P[Πτ,A(Z)>c|0<TAAτ]=TPRτ,adF(a)=E(TPRτ,A),FPRτ(c)=P[Πτ,A(Z)>c|TAA>τ]=FPRτ,adF(a)=E(FPRτ,A).

where ℱ is the distribution of the age at measurement. The overall performance of the risk model Πτ,a(Z) for predicting τ-year residual life can be summarized by

AUCτ=P[Πτ,Ai(Zi)>Πτ,Ai(Zj)|0TiAAi<τ,TiAAiτ]=TPRτ,a(c)dFPRτ,a(c)dF(a)dF(a)=E{TPRτ,Ai(c)dFPRτ,Ai(c)},

where i and i′ index two independent subjects. Plug-in estimates may be constructed for these parameters with an estimated ℱ(·). For example, in the simple case when only a single measurement is taken at baseline, then AUCτ may be estimated as

n2i,iTPR^τ,Ai0(c)dFPR^τ,Ai0(c)i,i:0<XiAAi0<τXiAAi0w^τi0w^τi0I[Π^τ,Ai0{Zi(Ai0)}Π^τ,Ai0{Zi(Ai0)}]i,i:0<XiAAi0<τXiAAi0w^τi0w^τi0. (6)

Given the longitudinal data structure, we may also be interested in estimating these accuracy parameters when the age at measurement A follows the marginal distribution of the observed measurement ages in the study. In which case,

TPRτ(c)=P{Πτ,Aij(Zij)>c|0<TiAAijτ},FPRτ(c)=P{Πτ,Aij(Zij)>c|0<TiAAijτ},

and AUCτ=P{Πτ,Aij(Zij)>Πτ,Aij(Zij)|0<TiAAijτ,TiAAij>τ}.

The accuracy measure AUCτ in this case can be estimated as

i,j,i,j:0<XiAAijτ<XiAAijw^τijw^τijI{Π^τ,Aij(Zi)Π^τ,Aij(Zi)}i,j,i,j:0<XiAAijτ<XiAAijw^τijw^τij.

Standard error (SE) and CIs can be constructed similarly to those given above for AUC^τ,a.

2.3 Selection of Smoothing Parameter

It is known that the choice of the smoothing parameter h is critical as in any nonparametric estimation problem. We employ a K-fold cross validation to select the smoothing parameter. Specifically, the study subjects are divided into K folds of approximately equal sizes. The optimal bandwidth hopt minimizes the weighted mean squared prediction error

k=1KiSkj=0miw^τij[I(XiA<Aij+τ)g{Uijβ^τ(k)(Aij)}]2,

where Sk is the set of subjects that are in fold k and β^τ(k)(Aij) is the estimate of βτ(Aij) using data excluding those from fold k. To obtain an estimator whose variance dominates bias, we follow the common practice to undersmooth (Neumann and Polzehl 1998, Tian and others 2005, Cai and others 2010) using the final bandwidth h=hoptn10.1, where n1 is the number of observed events by τ years.

3 Simulations

In this section, we report results from simulation studies that examine the finite-sample performance of the proposed methods and compare our methods with existing methods. Although the proposed methods are catered for survival data with longitudinally measured risk factors, it can be applied to traditional survival data with one single measurement of the risk factor at baseline to flexibly capture the age effect. For the single measurement setting, we compare the proposed procedure to the standard PH model which includes age as a covariate. In the longitudinal setting, we will compare the proposed methods to the commonly used joint modeling approach. For both settings, we considered n = 2000 and 5000, let K(·) be the Gaussian kernel, and B = 1000 for perturbations. For each configuration, results are summarized based on 1000 simulated datasets.

3.1 Simulations with a Single Measurement

For this setting, we simulate Ai0 from Uniform (15, 75) and two independent baseline covariates Zi1(Ai0) and Zi2(Ai0) from N (0, 4) and N (1, 4) respectively. The survival time from entry, Ti, is generated from log(Ti) = 2.5 + β1(Ai0)Zi1(Ai0) + β2(Ai0)Zi2(Ai0) + 0.5, where β1(a) = a2/7500, β2(a) = {(a – 45)2 – 100}/2000 and follows a standard logistic distribution. The censoring time Ci is generated from exp(i) where i ∼ N (1.6, 0.36), resulting in about 80% of censoring. For illustration, we choose τ to be 5 years. Throughout, we choose g(·) to be the logistic link function. Using the proposed bandwidth selection procedure with 5-fold cross-validation, h is about 6.4 and 5.4 for n = 2000 and 5000 respectively. We obtain the estimates of βτ(a) and AUCτ,a for ages from 20 to 70. Throughout the simulation studies, we let Uij = Zij = Zi(Aij).

In Figure 2, we present the average of the point estimates, the average of the SE estimates compared with the empirical SEs and the coverage probabilities (CovPs) of the 95% CIs for βτ(a) and AUCτ,a across a range of a. The results suggest that the proposed estimators produce negligible biases, and the estimated SEs are close to the empirical SEs. The empirical CovPs of the 95% CIs are close to their nominal level for βτ(a) coefficients. For AUC, the CovPs of the CIs are close to the nominal level but slightly below 95% for younger ages when n = 2000, possibly due to the fact the curvature of the AUC function is high in that range leading to a slight bias. The results are much improved when n increases to 5000.

Figure 2.

Figure 2

Average of estimates, average of the standard error estimates (ASE), empirical standard errors (ESE) and empirical coverage probabilities (CovP) of the 95% CI. One measurement at baseline. Each entry is based on 1000 simulated samples. The x-axis in all the plots is age.

For comparison, we obtain an alternative τ-year risk estimate, Π^τ,Ai0cox(Zi0)=gcox(logΛ^τ+γ^AAi0+γ^Zzi0), from fitting a Cox model including Ai0 and Zi0 = (Zi1(Ai0), Zi2(Ai0)) as covariates, where Λ̂τ is the estimated baseline cumulative hazard function at τ, gcox(x) = 1 – eex, and (γ^A,γ^Z) are the estimated log-hazard ratio for (Ai0, Zi0). For both of the risk estimates from our proposed method and the Cox model, we evaluate their age-specific prediction performance as well as the overall prediction performance based on AUCτ. As shown in Figure 3, the age specific AUC, AUCτ,a, of our proposed approach was generally higher than those from the Cox model. The overall AUC, AUCτ, was about 0.817 for the proposed model and 0.77 for the Cox model. The average difference between the two overall AUCτ's was 0.047 (SE = 0.007). These results highlight the improved prediction performance for using the proposed age-specific model.

Figure 3.

Figure 3

Age-specific AUCs for the proposed method (Age-specific) and Cox model (Cox). Values are averaged over 1000 repetitions with sample size n = 5000.

3.2 Simulations with Longitudinal Measurements

We also conducted simulation studies to examine the performance of the proposed procedures in longitudinal settings. To simulate the age at the occurrence of event T𝔸 and longitudinal measurements of a risk factor Z(a), we generate two random effects α0, α1 from N(0, 1). The age of event T𝔸 is obtained from

log(TA)=0.5log(log+3)/{Φ(α1)+0.5}+3

where is generated from an uniform distribution over (0, 1) and Φ(·) is the cumulative distribution function of a standard normal distribution. We simulate age of entry to the study A0 from Uniform (10, 70). Among the subjects who survive by the entry to the study, i.e., T𝔸 > A0, we randomly sample n subjects as our cohort. For the i-th subject in this cohort, the survival time since entry is Ti=TiAAi0, and Ci is generated from a Uniform (10, 40), which leads to about 25% of censoring. The risk factor is measured at the entry age Ai0 and ages Aij = Ai0 + Δij after entering the study until event or censored, whichever comes first, where Δij is generated from a N(4, 1) distribution. And the observed marker value at age Aij is Zi(Aij) = α0i + α1i log(Aij) + ei(Aij) where ei(Aij) ∼ N(0, 1.52). We choose τ to be 10 years. The selected smoothing parameter h is around 5.0 and 1.3 for n = 2000 and 5000 respectively using 5-fold cross-validation scheme described in Section 2.3. We obtain βτ(a) and AUCτ,a estimates for ages from 20 to 60.

The average of the point estimates, average of the SE estimators, the empirical SEs and the coverage probabilities of the 95% CIs for the βτ(a) and AUCτ,a at a series of ages are shown in Figure 4. The proposed procedures yield estimators with negligible biases. The estimated SEs obtained through perturbation resampling are close to the empirical SEs.

Figure 4.

Figure 4

Average of estimates, average of the standard error estimates (ASE), empirical standard errors (ESE) and empirical coverage probabilities of the 95% CI. Longitudinal measurements. Each entry is based on 1000 simulated samples. The x-axis in all the plots is age.

For comparison, we also fit the data with a joint modeling (JM) approach (Tsiatis and Davidian 2004, Rizopoulos 2010) for longitudinal and survival data. In particular, we follow the setup in Rizopoulos (2010) to specify a linear mixed effects model with random intercept and slope for the longitudinal measurements Zij and a PH model relating the hazard function to the random slope, in which the log baseline hazard is approximated using B-splines. With the parameter estimates from the joint model, for subject i with measurement at Aij, one can use a Monte Carlo approach to predict τ-year residual life risk given that the person has survived Aij and we let Π^τ,AijJM(Zi) denote the resulting estimate of the risk function. We can estimate its corresponding AUCτ,a as discussed in Section 2.2. We present the average of the estimated AUCτ,a for the two modeling approaches at n = 5000 in Figure 5. The results suggest that the proposed approach improved prediction accuracy over a wide range of ages compared to the JM approach.

Figure 5.

Figure 5

Age-specific AUC for the proposed method (Age-specific) and JM with longitudinal measurements. Values are averaged over 1000 repetitions with sample size n = 5000.

4 Application

In this section, we apply the proposed methods to the Framingham Heart Study to develop and evaluate age-specific CVD or death risk prediction models. The original goal of the study was to identify the common factors that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed CVD. Started in 1948 with 5,209 adult subjects, the study is now on its third generation of participants. Information on a wide spectrum of risk factors and disease outcomes is collected on each of the many followup visits during participants' lifetime. The dataset consists of 3,982 subjects (2,108 females and 1,874 males) with complete information on the risk factors at least one measurement time. Several traditional Framingham risk factors were collected on these subjects on each of their visits, including age, diastolic blood pressure, cholesterol, high-density lipoprotein (HDL), diabetes and smoking. In addition, an inflammation marker, C-reactive protein (CRP), was also measured at various visits. The median number of measurement times is 3. We use both Framingham risk factors and CRP to estimate age-specific 10-year risk of CVD or death for females and males separately. Thus we only include visits where all these risk factors are measured based on the study design. The outcome of interest is time to the onset of first major CVD event or death. Such a composite outcome avoids the issues of having to account for the competing risks from other causes of death. Neverthelss we note that our calculation of the probability of experiencing CVD or death within 10-years in this dataset can be regard approximately as 10-year CVD risk since majority of the observed events are CVD events, especially at younger age. In the study, there are 54 subjects who had CVD prior to death within 10 years. The cumulative incidence rate for CVD prior to death within 10 year is estimated to be about 1.4%. The median follow-up time was 32 years and the entry ages range from 5 to 70 with a median of 35.

We fit the proposed age-specific 10-year risk model (1) with a logistic link. Predictors include original scale of diastolic blood pressure, diabetes, smoking, the log scale of cholesterol, high-density lipoprotein and CRP. We use a gaussian kernel for K(·), with the smoothing parameter h selected as 5.2 for males and 3.8 for females using the 5-fold cross-validation scheme described in Section 2.3. In Figure 6(a) and (b), we demonstrate how the effects of major risk factors may vary by age for women and men respectively. For men, blood pressure does not show any significant association with risk of outcome for all the ages, whereas for women, higher blood pressure significantly increases the risk, especially for younger ages and older ages. For men, as expected, high cholesterol increases risk, and our analysis further reveals that for men, the effects decrease with ages. In other words, having high cholesterol for a younger man expose them to higher risk compared with older men. For women, diabetes status also shows an age-varying pattern: the effect increases with age initially, reaches a peak at 50, and goes down afterward. Smoking shows significant effect at age 45 for women, but the effect diminishes as age increases. However, for men, neither diabetes status nor smoking shows significant association with the risk at almost all ages. HDL shows no significant association with outcome for women, but is inversely associated with the risk between age 40 and 60 and no significant association at other ages for men. Finally CRP also exhibits different age-varying effects between women and men. For men, the effect of CRP is monotonically decreasing as age increases while for women, higher CRP increases risk before 60 years old and peaks at around 45 years old. In Figure 6(c), we show the age-specific AUC of the proposed age-specific risk scores for men and women. For women, the AUC10,a does not vary substantially over age a with values fluctuating around 0.7. On the contrary, for men, AUC10,a is substantially higher for younger ages with value as high as 0.9 and decreases to 0.6 at older ages. Thus, the age-specific risk model is highly accurate in predicting 10-year risk of CVD or death for younger males but only moderately accurate for middle aged or older males.

Figure 6.

Figure 6

Data analysis: Framingham Heart Study. The panels (a) and (b) show the point and interval estimates of the age-specific effects of the major risk factors for male and female, respectively. The panel (c) shows the point and interval estimates of the age-specific AUC for male and female respectively.

5 Discussion

When subjects are monitored over time for a clinical condition, it is highly desirable to dynamically recalculate risk estimates according to the updated risk factor information. Age is an important risk factor for many diseases such as CVD and the effects of other predictors on the disease risk may vary over age. Current risk prediction models used in clinical practice such as the FRS often incorporate age as an additive risk factor, which may limit the model prediction performance. Our proposed method estimates the age-specific absolute risk directly via a flexible varying-coefficient model that allows the predictor effects to vary over age and allows for easy incorporation of longitudinally collected risk factor information. We also provide procedures for non-parametrically assessing the prediction performance of such age-specific models, extending existing time-specific accuracy parameters to also incorporate the additional age domain.

Unlike the Cox model with time-varying covariates, our proposed model can easily provide age-specific absolute risk estimates without having to specify the full longitudinal marker processes. Compared to the JM approach, our method has the major advantage of allowing for non-linear effects and non-trivial number of time-varying continuous or discrete risk markers. Additionally our kernel-based procedures allows borrowing information across individuals of similar ages therefore provides a practical solution for situations where the longitudinal information is only measured sparsely and irregularly.

Our method allows for internal covariates in that it aims to make prediction for the residual life among event free subjects at age a, although it does require the availability of the marker information at age a for those with T𝔸 > a. When the outcome is a non-terminal event that is subject to death as a competing risk, then one may easily modify the proposed procedures to instead make prediction of the disease risk for those who are still alive and have not yet developed the disease at age a. The age-specific accuracy parameters can also be modified to accommodate competing risks similar to those considered in Blanche and others (2013).

The proposed method employs a working model that requires the specification of g and ψ, both of which could potentially impact the model prediction performance. In general, the prediction performance is less sensitive to the choice of g due to the robustness properties such as the logistic likelihood as noted in Li and Duan (1989) and Eguchi and Copas (2002). One may choose appropriate ψ based on existing literature on the functional form of known risk factors or exploratory analyses. On the other hand, regardless of the choice of ψ or g, the fitted model may be mis-specified yet our proposed method could derive a risk model with good prediction performance. The flexible varying coefficient model is expected to perform well under mis-specification and the proposed inference procedures are always valid regardless of the potential mis-specification in the fitted model. In addition, while the proposed simple IPW method has the advantage of enabling robustness in inference under model mis-specification, it may come at a cost in efficiency loss. If there are auxiliary variables available at baseline, efficiency augmentation methods leveraging such information warrants further research.

Acknowledgments

The Framingham Heart Study and the Framingham SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. The Framingham SHARe data used for the analyses described in this manuscript were obtained through dbGaP (access number: phs000007.v3.p2). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or the NHLBI.

The work is supported by grants from Natural Sciences and Engineering Research Council of Canada, U01-CA86368, P01-CA053996, R01-GM085047, U54-HG007963, and R01-HL089778 from the National Institutes of Health.

Contributor Information

Qian M. Zhou, Department of Mathematics and Statistics, Mississippi State University, Mississippi State, Mississippi, USA, 39762.

Wei Dai, Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA, 02115.

Yingye Zheng, Department of Biostatistics and Biomathematics, Fred Hutchinson Cancer Research Center, Seattle, WA, USA, 98109.

Tianxi Cai, Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA, 02115.

References

  1. Blanche Paul, Dartigues Jean-François, Jacqmin-Gadda Hélène. Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks. Statistics in medicine. 2013;32(30):5381–5397. doi: 10.1002/sim.5958. [DOI] [PubMed] [Google Scholar]
  2. Cai Tianxi, Tian Lu, Uno Hajime, Solomon Scott D, Wei LJ. Calibrating parametric subject-specific risk estimation. Biometrika. 2010;97(2):389–404. doi: 10.1093/biomet/asq012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Eguchi Shinto, Copas John. A class of logistic-type discriminant functions. Biometrika. 2002;89(1):1–22. [Google Scholar]
  4. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute. 1989;81(24):1879. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
  5. Gerds Thomas A, Cai Tianxi, Schumacher Martin. The performance of risk prediction models. Biometrical journal. 2008;50(4):457–479. doi: 10.1002/bimj.200810443. [DOI] [PubMed] [Google Scholar]
  6. Li Ker-Chau, Duan Naihua. Regression analysis under link violation. The Annals of Statistics. 1989:1009–1052. [Google Scholar]
  7. Liu Dandan, Zheng Yingye, Prentice Ross L, Hsu Li. Estimating risk with time-to-event data: An application to the women's health initiative. Journal of the American Statistical Association. 2014;109(506):514–524. doi: 10.1080/01621459.2014.881739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lloyd-Jones Donald M. Cardiovascular risk prediction basic concepts, current status, and future directions. Circulation. 2010;121(15):1768–1777. doi: 10.1161/CIRCULATIONAHA.109.849166. [DOI] [PubMed] [Google Scholar]
  9. Mosca Lori, Appel Lawrence J, Benjamin Emelia J, Berra Kathy, Chandra-Strobos Nisha, Fabunmi Rosalind P, Grady Deborah, Haan Constance K, Hayes Sharonne N, Judelson Debra R, et al. Evidence-based guidelines for cardiovascular disease prevention in women 1. Journal of the American College of Cardiology. 2004;43(5):900–921. doi: 10.1016/j.jacc.2004.02.001. [DOI] [PubMed] [Google Scholar]
  10. Neumann Michael H, Polzehl Jörg. Simultaneous bootstrap confidence bands in non-parametric regression. Journal of Nonparametric Statistics. 1998;9(4):307–333. [Google Scholar]
  11. Parast Layla, Cheng Su-Chun, Cai Tianxi. Landmark prediction of long-term survival incorporating short-term event time information. Journal of the American Statistical Association. 2012;107(500):1492–1501. doi: 10.1080/01621459.2012.721281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Park Yuhyun, Wei LJ. Estimating subject-specific survival functions under the accelerated failure time model. Biometrika. 2003;90(3):717–723. [Google Scholar]
  13. Ridker Paul M, Buring Julie E, Rifai Nader, Cook Nancy R. Development and validation of improved algorithms for the assessment of global cardiovascular risk in women: the reynolds risk score. Jama. 2007;297(6):611–619. doi: 10.1001/jama.297.6.611. [DOI] [PubMed] [Google Scholar]
  14. Rizopoulos Dimitris. Jm: An r package for the joint modelling of longitudinal and time-to-event data. Journal of Statistical Software. 2010;35(9):1–33. [Google Scholar]
  15. Tian Lu, Zucker David, Wei LJ. On the cox model with time-varying regression coefficients. Journal of the American statistical Association. 2005;100(469):172–183. [Google Scholar]
  16. Tsiatis A, DeGruttola Victor, Wulfsohn MS. Modeling the relationship of survival to longitudinal data measured with error. applications to survival and cd4 counts in patients with aids. Journal of the American Statistical Association. 1995;90(429):27. [Google Scholar]
  17. Tsiatis Anastasios A, Davidian Marie. Joint modeling of longitudinal and time-to-event data: an overview. Statistica Sinica. 2004;14(3):809–834. [Google Scholar]
  18. Uno Hajime, Cai Tianxi, Tian Lu, Wei LJ. Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association. 2007;102(478) [Google Scholar]
  19. Wang Yan, Taylor Jeremy MG. Jointly modeling longitudinal and event time data with application to acquired immunodeficiency syndrome. Journal of the American Statistical Association. 2001;96(455):895–905. [Google Scholar]
  20. Wolf Philip A, D'Agostino Ralph B, Belanger Albert J, Kannel William B. Probability of stroke: a risk profile from the framingham study. Stroke. 1991;22(3):312–318. doi: 10.1161/01.str.22.3.312. [DOI] [PubMed] [Google Scholar]
  21. Wu Chien-Fu Jeff. Jackknife, bootstrap and other resampling methods in regression analysis. the Annals of Statistics. 1986:1261–1295. [Google Scholar]
  22. Ye Wen, Lin Xihong, Taylor Jeremy MG. Semiparametric modeling of longitudinal measurements and time-to-event data–a two-stage regression calibration approach. Biometrics. 2008;64(4):1238–1246. doi: 10.1111/j.1541-0420.2007.00983.x. [DOI] [PubMed] [Google Scholar]
  23. Zheng Yingye, Cai Tianxi, Feng Ziding. Application of the time-dependent roc curves for prognostic accuracy with multiple biomarkers. Biometrics. 2006;62(1):279–287. doi: 10.1111/j.1541-0420.2005.00441.x. [DOI] [PubMed] [Google Scholar]
  24. Zheng Yingye, Heagerty Patrick J. Semiparametric estimation of time-dependent roc curves for longitudinal marker data. Biostatistics. 2004;5(4):615–632. doi: 10.1093/biostatistics/kxh013. [DOI] [PubMed] [Google Scholar]
  25. Zheng Yingye, Heagerty Patrick J. Partly conditional survival models for longitudinal data. Biometrics. 2005;61(2):379–391. doi: 10.1111/j.1541-0420.2005.00323.x. [DOI] [PubMed] [Google Scholar]

RESOURCES