The ROC curve for regularly measured longitudinal biomarkers

Haben Michael; Lu Tian; Musie Ghebremichael

doi:10.1093/biostatistics/kxy010

. 2018 Mar 28;20(3):433–451. doi: 10.1093/biostatistics/kxy010

The ROC curve for regularly measured longitudinal biomarkers

Haben Michael ^1,^✉, Lu Tian ¹, Musie Ghebremichael ¹

PMCID: PMC6587928 PMID: 29608649

Summary

The receiver operating characteristic (ROC) curve is a commonly used graphical summary of the discriminative capacity of a thresholded continuous scoring system for a binary outcome. Estimation and inference procedures for the ROC curve are well-studied in the cross-sectional setting. However, there is a paucity of research when both biomarker measurements and disease status are observed longitudinally. In a motivating example, we are interested in characterizing the value of longitudinally measured CD4 counts for predicting the presence or absence of a transient spike in HIV viral load, also time-dependent. The existing method neither appropriately characterizes the diagnostic value of observed CD4 counts nor efficiently uses status history in predicting the current spike status. We propose to jointly model the binary status as a Markov chain and the biomarkers levels, conditional on the binary status, as an autoregressive process, yielding a dynamic scoring procedure for predicting the occurrence of a spike. Based on the resulting prediction rule, we propose several natural extensions of the ROC curve to the longitudinal setting and describe procedures for statistical inference. Lastly, extensive simulations have been conducted to examine the small sample operational characteristics of the proposed methods.

Keywords: HIV/AIDS, Longitudinal binary outcomes, Longitudinal biomarker, Predictive value, Receiver operator characteristic (ROC) curve

1. Introduction

The receiver operating characteristic (ROC) curve is a graphical summary of the discriminative capacity of a thresholded continuous scoring system for a binary outcome. The curve consists of pairs of true positive and false positive rates as the threshold is varied (Swets and Pickett, 1982). Although many alternative measures have been proposed (Uno and others, 2007, 2013; Pencina and others, 2008; Steyerberg and others, 2010), the ROC curve remains the most commonly used in many fields. In medical research, ROC curves are often used to characterize the quality of a continuous biomarker as a diagnostic for binary statuses such as “diseased” versus “non-diseased” (Pepe, 2003; Zhou and others, 2009). Well-studied in the cross-sectional setting, the ROC curve has been generalized to settings where the outcome of interest is time-to-event (Heagerty and others, 2000; Zheng and Heagerty, 2004; Heagerty and Zheng, 2005) and where the biomarker is longitudinally measured (Foulkes and others, 2010; Liu and Albert, 2014). However, less research has considered both longitudinal biomarker measurements and binary statuses.

A motivating example is data gathered from the Yale Prospective Longitudinal Pediatric HIV Cohort. The cohort comprises 97 children born to HIV-infected mothers in the New Haven, CT, area since 1985. Various measurements were taken on the participants every 2–3 months over the 10-year period 1996–2006. Among these measurements, we focus on a continuous biomarker, CD4+ lymphocyte count, as a predictor of a binary outcome, “blip” status, the presence or absence of a transient spike in viral load (Paintsil and others, 2008).

Let Inline graphic and denote the biomarker value and a binary status of patient at visit , respectively, , The values may be direct assay measurements, as in the motivating example, or they may be derived or composite quantities. To assess the predictive value of the longitudinal biomarker for predicting Inline graphic , Liu and Wu (2003) and Liu and others (2005) propose a simple mixed effect regression model (Breslow and Clayton, 1993)

(1.1)

where Inline graphic is the logit function and and are, respectively, the subject-specific random intercept and slope. Similar models are described in Foulkes and others (2010) and Albert (2012). The random vector is assumed to follow a Gaussian distribution

The ROC curve summarizing the diagnostic value of Inline graphic is then constructed based on pairs

where Inline graphic are estimates of the subject-specific random effects obtained from the observed data, for example,

and Inline graphic , , and are maximum likelihood estimators for the corresponding population parameters.

While this approach is simple and intuitive, we mention several limitations. First, the parametric assumptions may be too restrictive for some applications. For example, as discussed below, the Yale pediatric HIV data suggest greater dependency among biomarkers and disease statuses nearer in time. Neither is accounted for by model (1.1), which is symmetric in time. Second, the subject-specific random effect estimate Inline graphic and as defined above, are not available at visit as the biomarker levels and responses are not yet observed. Third, the approach uses the same data both to fit the model and, by using the fitted biomarkers to construct the ROC curve, to assess the quality of the fit. One expects such an assessment to overestimate the true diagnostic quality of the biomarker (Janes and others, 2009). Efforts to set aside a group of patients for validation after estimation encounter the difficulty that subject effect estimates for the validation patients are unavailable (Foulkes and others, 2010). Lastly and more conceptually, the notion of ROC curve stands to be refined in the context of longitudinal measurements of multiple patients. In contrast to the cross-sectional setting, several useful ROC curves suggest themselves. For example, the predictive performance of the biomarker for a given patient, as determined by that patient’s history, can be quite different from the predictive performance for the entire patient population. In the next section, we propose a general framework to address these limitations.

2. Methods

We first note two properties desirable in a framework for assessing diagnostic performance in the longitudinal, multiple subject design under consideration. First, to assess the predictive performance of a biomarker, the biomarker should depend only on data that is available when a prediction is to be made. We adopt the vantage of a practitioner who has previous biomarker and status data for a patient, is confronted with a current biomarker for the patient, and must now predict current status. In the HIV example, due to the turnaround time of the tests involved, CD4 count or percentage is normally available before blip status. The patient’s history includes previous blip statuses, and the clinician may need to determine a course of treatment based on as-yet unavailable current blip status as predicted from current CD4 count or percentage. A similar problem is described by Yang and others (2015). Here, the “status” is the presence of absence of influenza-like illness in periodic reports issued by the Center for Disease Control (CDC), and the predictor or “biomarker” is real-time internet search data. The CDC’s reports describe outbreaks at a 1–3 week delay. When making real-time predictions with the search data, only CDC outbreak data referencing previous time points are available. In this case, an accurate early warning of the outbreak can be very important for public health.

As predictions may be made at different times, the accuracy of the prediction and the associated ROC curve will depend on time, with the corresponding prediction depending only on patient history available at that time. Second, two types of prediction performance should be differentiated: that for an individual patient and that for a patient population. For the former, we target the performance of Inline graphic as a predictor of where is a continuous score summarizing the predictive information contained in the history up to visit of a given patient and For the latter, we are interested in the predictive performance of in the entire patient population at a time , that is, marginalizing across patients.

In the following, we first generalize the simple mixed effect model (1.1) and discuss the two types of predictive performance under the proposed model. As discussed further below, easy extensions lead to more sophisticated models allowing for more flexible prediction rules. We assume that the longitudinal biomarker levels Inline graphic follow an autoregressive process conditional on disease status which are generated by a Markov chain as in, e.g., Azzalini (1994). Specifically, we assume that for the th patient

(2.1)

where

independently and identically, and Inline graphic are hyperparameters. is set to 0 to initialize the autoregressive, implying that the baseline biomarker level follows a Gaussian distribution conditional on the blip status. This set of parametric distributions for the random effects is chosen in part for convenience as they permit the model parameters to be estimated using many standard statistical software packages. Specifically, Inline graphic can be estimated by fitting the linear mixed effects model (Laird and Ware, 1982)

and Inline graphic can be estimated by fitting the generalized mixed effects model

where Inline graphic and In addition, can be estimated by the observed proportion across patients at the initial visit, i.e., the baseline. More importantly, under this model, we may link with the observed history at visit via a random effects model

(2.2)

where

(2.3)

Thus model (2.1) generalizes model (1.1), insofar as the log-odds of positive disease status is modeled as linear in the subject’s most current biomarker level, although the distribution of the coefficients in this linear combination may differ between the two models. Generalizations of (2.1) that include more biomarkers, e.g., Inline graphic , correspond to higher order autoregressive processes in model (1.1).

2.1. Individual patient ROC curve

We would like to evaluate the predictive performance of the biomarker or score Inline graphic (or its history) for patient at time by contrasting two survival functions, and where

and

Inline graphic uses the available history, since under model (2.1), is conditionally independent of the remaining history given . We may then use the ROC curve or derived statistics such as the area under the ROC curve to summarize this contrast.

As Inline graphic , depend on the unknown subject-specific random effect the score is unavailable in practice. An ROC curve based on can only serve as a theoretical benchmark. We therefore estimate the random effect based on its conditional distribution where and use a plug-in estimator for . For example, we may estimate the random effects and Inline graphic by the posterior mean

and

(2.4)

respectively (Robinson, 1991), where the functions Inline graphic are obtained by replacing all the relevant subject-specific random effects in (2.3) with their estimated counterparts based on For example,

Here, the subscript Inline graphic is used to emphasize that the prediction of the subject-specific random effect is made at visit using information up to visit The estimator depends on the subject only through the first argument, i.e., the patient history, and so the subscript has been dropped. An explicit expression for this choice of Inline graphic can be found in Appendix A of the supplementary material available at Biostatistics online. Using the estimated score (or if is unknown) to predict the disease status at visit , the predictive performance of patient ’s biomarker at visit can be summarized by the ROC curve

where

is the subject- and visit-specific survival function of the estimated score. Inline graphic depends on the joint distribution of the random history and the response and thus also on the subject-specific random effect Since we do not have a convenient analytic expression for we resort to a Monte-Carlo method. Specifically, for the th patient:

Simulate and according to model (2.1) using consistent estimates of the subject-specific random effect and the population parameter
Compute according to (2.4).
Repeat steps 1–2 a large number of times and calculate the empirical ROC curve of the resulting pairs

Inline graphic can serve as an approximation to the subject-specific ROC curve of the th patient at the th visit provided that this patient’s subject-specific parameters are known or can be estimated up to the desired accuracy. When this assumption is unmet, e.g., when the time is small and few observations on the patient of interest are available, we instead propose two alternative summaries of the diagnostic performance of the biomarker at the individual level.

The first is the average individual-specific ROC curve over the patient population,

(2.5)

where the expectation is taken with respect to the random effect Inline graphic In practice, we may use Monte-Carlo methods, simulating a large number of random effects from the distribution for the random effect and estimating by

The resulting Inline graphic is not the ROC curve for any individual patient but the expected patient-level ROC curve for a typical patient from the given population. As before, when is unknown, we may replace it by a consistent estimator and let

Since Inline graphic is a smooth function of , is a consistent estimator for and converges weakly to a mean zero Gaussian process indexed by when converges weakly to a mean zero Gaussian distribution.

The second option is the limit

(2.6)

As Inline graphic and converge to , and , respectively, where

are subject-specific state probabilities of the stationary distribution of the 2-state Markov chain. Therefore, provided Inline graphic ,

(2.7)

where Inline graphic is cumulative distribution function of the standard normal. Here, we used the fact that under model (2.1), given is normally distributed with mean and variance . When , i.e., a patient’s diseased and non-diseased biomarker means are the same, the posterior probability of positive event status (2.2) reduces to

posterior probability of a 2-state Markov chain. Consequently, the ROC curve summarizes the performance of a 2-state Markov chain in predicting the next state in this case. This performance serves as a limiting case when Inline graphic becomes small in magnitude, the biomarkers cease to provide useful discrimination, and the patient’s prior status carries all the information about current status.

Inline graphic can be viewed as the ROC curve for subject after adequate follow-up and therefore reflects the ultimate personalized diagnostic value of the biomarker for the th patient with the subject random effect It may or may not be similar to the population counterpart described in the next section. Inline graphic can be estimated by which is the same as with and being replaced by and respectively. Assuming that and is consistent and converges to a mean zero Gaussian process. Therefore, the key assumption for estimating in practice is that be sufficiently large to allow acceptable estimation of the individual-specific random effect. The resulting estimated ROC curve can then be used to characterize the diagnostic value of the biomarker for an individual patient after sufficient follow-up.

Inference for Inline graphic and can be carried out with the parametric bootstrap. One simulates fresh data using the estimated population parameter from model (2.1) and obtains and , the estimators for the corresponding ROC curves, from the simulated data. The empirical distributions and based on a large number of simulations serve as approximations to the distributions of Inline graphic and respectively. Point-wise confidence intervals (CIs) of and can be constructed along these lines.

Remark 2.1

One may be interested in the diagnostic value of at the th visit given the past history In this case, the ROC curve can be constructed based on the conditional survival function

In contrast to ROC curves based on or its estimator, this ROC curve reflects the predictive value of only. It also depends on the random effect , unknown at the visit One may also consider its expectation with respect to random effects or its limit when as an estimable alternative.

Remark 2.2

Both and are parametric in nature in that their summarization of the diagnostic value of the longitudinally measured biomarker are valid only if model (2.1) is correctly specified.

2.2. ROC curve for the patient population

The predictive performance of the biomarker across the entire population may be very different from that for an individual patient. For example, the latter does not take into account biomarker variation between patients, or differences between patients in the prior probabilities of positive status events. Were the data not longitudinal, we might consider the empirical ROC curve of biomarker–status pairs Inline graphic . To take accumulated patient data into account, we instead consider the ROC curve of , the patients’ biomarker scores (2.4) at a given time . The scores synthesize all the predictive information in the past history under model (2.1).

Conditionally on the population parameter Inline graphic , the patient scores are iid, and the empirical ROC curve is a valid metric for the predictive value of the scores regardless of the validity of the model being used to derive them. If model (2.1) is a good approximation to the true relationship between and , one may anticipate good prediction accuracy of the resulting score. A severely misspecified model may give a prediction score with poor performance. In either case, the ROC curve and derived statistics such as the area under the ROC curve remain objective measures for the predictive value of the scoring system.

Formally, assuming that Inline graphic in probability, the score converges to

where Inline graphic if the model is correctly specified. We are interested in estimating the ROC curve for the predictive value at the th visit

(2.8)

where

We do so by plugging in the empirical survival function:

(2.9)

where Inline graphic

and Inline graphic is the event indicator function. Similarly, the area under the ROC curve, the concordance statistics, may be estimated as

In Appendix B of supplementary material, available at Biostatistics online, we show that Inline graphic is a consistent estimator for and the distribution of converges to a mean zero Gaussian process under mild regularity conditions. The variance of can be approximated by an efficient resampling method. At each iteration, we first generate random weights from the unit exponential distribution and estimate Inline graphic under model (2.1) with the th observation weighted by Denote the estimator by and let

where

Obtaining in this way a large number of realizations of Inline graphic their empirical variance can be used to approximate that of Similar resampling methods can be used to make inference on , the area under the ROC curve at the th visit.

The predictive value of the biomarker in the entire population also varies with the visit Inline graphic . With more visits and richer data observed, the predictive ability of the updated scoring system is expected to increase. We may study the trend of predictive value from visit to by simultaneously estimating and . It is not difficult to show that

can be approximated by a multivariate mean-zero Gaussian distribution, based on which joint inference for the predictive value at all visits of interest may be conducted.

When the predictive value of the constructed scoring system only varies moderately from visit Inline graphic to , i.e., are similar, it is tempting to estimate the ROC curve by the average predictive value between these two visits. To this end, one may empirically construct a ROC curve as

where

and Inline graphic Since it averages observations from multiple visits, can be substantially more stable than is a consistent estimator of

where

a weighted average of Inline graphic Statistical inference based on can be made by resampling methods similar to those previously described.

Remark 2.3

Despite some similarities, and are quite different. The former is parametric and interpretable only when model (2.1) is correctly specified, while and latter is non-parametric in nature. The former, ignoring the differentiability in biomarkers across patients, tends to be smaller than the latter.

Remark 2.4

The proposed ROC curves depend on the patient history through the biomarkers estimates We may consider other functions of given by different statistical models of the response. More generally, one may consider a working regression model

and construct the ROC curve based on

where is a parametric function of observed history and and are the model parameter and its appropriate estimator, respectively.

2.3. Extension

In model (2.1), we assume that (i) the underlying disease status follows a simple Markov chain, i.e., the distribution of Inline graphic only depends on and (ii) the distribution of the biomarker level at visit , only depends on and ; see Figure 1a, which diagrams the probability generating process described in (2.1). There are several obvious extensions:

The distribution of depends on (Figure 1b).
The distribution of depends on (Figure 1c).

Fig. 1. — Schematic of the data-generation process described by (2.1) and extensions. (a) Model (2.1). (b) generated from . (c) generated from .

Inline graphic — Schematic of the data-generation process described by (2.1) and extensions. (a) Model (2.1). (b) generated from . (c) generated from .

Adapting model (2.1) to the first setting, where the biomarker value depends not only on the current disease status but also the status at the previous visit, gives:

(2.10)

where Inline graphic is the subject-specific random effect and is independent . Under this model

where

Therefore, besides the terms in (2.2), model (2.10) leads to additional interaction terms Inline graphic and contributing to the prediction of the disease status at the th visit,

For the second setting, we may assume that

where

In other words, the transition probability of the underlying disease status depends on the biomarker level at the prior visit. Under this model

where

Therefore, compared with (2.2), there is an additional interaction term Inline graphic contributing to the prediction of the disease status at the th visit, There may be more extensive generalizations of model (2.2) such as the combination of extensions of (1) and (2) or higher order Markov chains for . As mentioned in the previous sections, while the validity of individualized ROC curve depends on the correct model specification, the population-based ROC curve Inline graphic and can be constructed for scoring systems developed under different modeling assumptions and used to compare different models in terms of their predictive ability.

3. Example

The goal of highly active antiretroviral therapy in the treatment of HIV is to keep a patient’s CD4 count high and to suppress viral load. CD4 count measures immunosuppression, the risk of opportunistic infections, and the strength of the immune system. Viral load is the amount of HIV in a sample, indicative of, among other things, transmission risk. Although viral load is regarded as a better indicator of disease status, it is also more expensive and time-consuming to measure than CD4 count. According to clinical guidelines, both are tested regularly in a typical treatment regimen and used to guide subsequent treatment.

Even when therapy is effective and viral load is clinically categorized as suppressed, transient spikes in viral load, or “blips,” are observed. The clinical significance of viral blips is not understood well. While some studies have reported that viral blips are of no clinical significance, others have reported an association between viral blips and virologic failure. The identification of the predictors of viral blips and the association between viral blips and CD4+ T-cell changes over time are subjects of ongoing research. (see Paintsil and others, 2016 and references therein.)

We consider the accuracy of absolute CD4+ T-lymphocyte count as a predictor of blip status among children. We analyzed longitudinal data from HIV-infected children enrolled in the Yale Prospective Longitudinal Cohort study comprising 97 children born to HIV-infected mothers in the greater New Haven, CT, area since 1985. The predictor CD4 count measures the number of CD4 cells/mm Inline graphic of blood and the response blip status is defined as a viral load equal or exceeding 50 copies/mL. The median number of visits/patient is 33, with 1st and 3rd quartiles of 15 and 47 visits, respectively. For all of the 3309 visits in the data set, the median time between visits is exactly 90 days, with 1st and 3rd quartiles of 57 and 112 days, respectively, giving approximately evenly spaced visits during the follow-up. The average age at enrollment is 6.7 years (standard deviation: 3.9 years). Figure S1(a) of supplementary material available at Biostatistics online summarizes the dates of visits in the lifetimes of the subjects. Further details on the cohort and definitions used here can be found in Paintsil and others (2008) and the references therein. Sixteen patients with fewer than 10 visits were removed in order to allow for estimation of the individual ROC, Inline graphic as discussed in Section 2. Eighty-one subjects remain after excluding those with fewer than 10 visits.

The choice of how to group longitudinal observations is an important issue in many cohort-based longitudinal data analyzes, including ours. At each visit, measurements including CD4 count and blip status are taken, and antiretroviral treatment is administered. Therefore, the visit may serve as a surrogate for the number of treatments administered since baseline. While the specific enrollment time varies, the majority of the enrolled children (average age of 6.7 years) are in the early stages of treatment at baseline, and therefore it may be sensible to align observations according to their visit numbers. When the sample size allows, the analysis can be restricted to a more homogeneous subgroup of children with similar conditions at the baseline, which makes grouping by visits still more interpretable.

A crude indication of the value of CD4 as a predictor of blip status is given in Figure S1(b) of supplementary material available at Biostatistics online, which plots the histograms of CD4, aggregated over patients and visits, conditional on positive and negative blip status. Despite the large overlap, there is a clear location shift between the two measures. Figure S2 of supplementary material available at Biostatistics online plots the trajectories of CD4, with plotting shape encoding blip status, for a representative sample of subjects. The long sequences of like shapes even as CD4 fluctuates wildly suggests previous blip status as a predictor of future blip status, motivating the Markov structure in model (2.1). Finally, Figure S3 of supplementary material available at Biostatistics online is a heatmap of the empirical correlation matrix among CD4 measurements on the first 40 visits, for the 44 patients with 40 or more visits. The entries tend to decrease in magnitude moving away from the diagonal. This correlation structure accords with the weak dependency implied by the autocorrelative structure in model (2.1).

We apply the ROC estimation procedure described in Section 2 to the pediatric HIV data in order to assess the value of the past CD4 counts and blip statuses as a predictor of current blip status. The MLE Inline graphic (95% CI (0.28,0.63)), describes the strong autoregressive dependency of CD4, as suggested by the heatmap. Similarly, the strong dependence between previous and current blip status is confirmed by the population transition probabilities , giving the probabilities of remaining in the negative and positive, respectively, blip status states in successive visits. A 95% CI for the difference Inline graphic is (0.05,0.12). The difference between the CD4 standardized means conditional on negative versus positive blip status, (95% bootstrap CI (0.63,0.91)), is consistent with the location shift apparent from Figure S1(b) of supplementary material available at Biostatistics online.

The resulting time-indexed ROC curves at time Inline graphic and their associated 95% CIs are summarized in Figure S4 of supplementary material available at Biostatistics online. The time was chosen to be consistent with our exclusion of patients with fewer than 10 visits from the analysis. As Figure S1(a) of supplementary material available at Biostatistics online shows, ROC curves at later time points are available if one is willing to exclude patients with insufficient visits. The CIs are constructed by a bootstrap with Inline graphic bootstrap samples. Despite the noisy data presented in Figures S1(b) and S2 of supplementary material available at Biostatistics online, the risk score taking into account both previous CD4 values and blip status performs reasonably well as a predictor of the current disease status. We also plot the time-asymptotic individual ROC curve Inline graphic for selected patients. Patient no. 14 exhibits a non-smooth curve. The “elbow” arises when a patient’s previous disease status is significantly more predictive than the patient’s biomarkers of future disease status. In such cases, the ROC curve approximates the discrete behavior of a threshold predictor. As mentioned above, the validity of the individualized ROC curves depends on the correct specification of model (2.1). If we view model (2.1) as a working device used to derive a risk score for predicting the blip status, we may use Inline graphic as well as to summarize the predictive value of the scoring system between visits and Due to the small sample size and infrequent occurrence of blips, we construct and its 95% CI as shown in Figure S4 of supplementary material available at Biostatistics online. The area under the ROC curve is 0.865, with a bootstrap standard error of 0.008, also indicating good predictive value. The jagged shape of the ROC curve reflects the fact that few of the 81 patient scores lie in the overlap of the case and control distributions.

As a comparison, we also plot in Figure 2 the ROC curve based on scores Inline graphic by fitting the simple random effect model (1.1). As expected, the resulting ROC curve is higher than and . However, using fitted values as scores to predict the blip status requires information not available at visit and is not therefore a comparable measure of the predictiveness of the biomarker at that time.

Fig. 2. — The ROC curve when fitted values under a random effects model (1.1) are used as scores compared with and (pediatric HIV data).

4. Simulation

In this section, we investigate the finite-sample performance of the proposed method. To this end, we generate data sets mimicking the pediatric HIV data. Specifically, Inline graphic are simulated under model (2.1) with the population parameter being the maximum likelihood estimator obtained from the HIV data. We use a Monte-Carlo approximation for the underlying true ROC curves and We also calculate based on the analytic expression of given in (2.7) for selected random effects Inline graphic Next, we generate 500 data sets, each consisting of patients with visits each to match the HIV example. For each simulated data set, we estimate

the expected individual-specific ROC curve at and its 95% point-wise CI using the parametric bootstrap method;
the limiting individual-specific ROC curve for selected patients;
the population ROC curve and its 95% point-wise CI using the resampling method, for

The resulting ROC curves estimates and 95% CIs based on one generated data set are presented in Figure 3. To evaluate the performance of the proposed method, we estimate the empirical bias of the point estimators as well as the coverage level of the 95% CIs at selected Inline graphic for both and , For each estimator of interest, we also calculate the empirical average of the estimated standard errors and the empirical standard error. The detailed simulation results for and are summarized at Table 1. The empirical biases are reasonably small in magnitude. The average estimated standard errors of all estimators are very close to the empirical standard errors and the coverage level of the 95% CIs are consistent with the nominal level allowed by the Monte-Carlo simulation error. In general, as expected, the population ROC curve tends to be higher than the individualized counterpart at the same visit. For estimating Inline graphic we compare the AUC under ROC curve, based on data from increasing number of visits with the true limiting AUC value for selected random effects s. We focus in particular on whether the estimator converges to the truth as the number of visits increases under this correct model specification. Figure S5 of supplementary material available at Biostatistics online plots the number of visits against the difference between estimated AUCs and the truth with five different realizations of Inline graphic , showing the expected convergence. The convergence may be too slow for some purposes, requiring data from a large number of visits in order to achieve the required estimation accuracy of the individual random effect.

Fig. 3. — Expected individual ROC (solid) and population ROC (dotted) at visit with 95% bootstrap CI; limiting individual ROC for four patients. The data were generated under model (2.1) with hyperparameters estimated from the pediatric HIV data.

Table 1.

Nominal 95% CI coverage (CVL), bias (BS), average standard error (ASE), and empirical standard error (ESE) of and for false positive rates (FPR) 10%, 25%, 50%, and 75% at visits and 35 (synthetic data using hyperparameters estimated from the pediatric HIV data, patients)

		0.10	0.25	0.50	0.75
Visit	ROC
		CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)
5		0.94 (0.01,0.05,0.05)	0.96 (0.00,0.04,0.04)	0.94 (0.01,0.03,0.03)	0.95 (0.00,0.02,0.02)
		0.94 (0.04,0.12,0.12)	0.95 (0.02,0.12,0.11)	0.97 (0.00,0.10,0.10)	0.97 (0.00,0.06,0.06)
15		0.93 (0.01,0.05,0.05)	0.93 (0.01,0.04,0.04)	0.91 (0.02,0.03,0.03)	0.97 (0.00,0.02,0.02)
		0.96 (0.03,0.13,0.12)	0.96 (0.01,0.12,0.11)	0.98 (0.00,0.08,0.08)	0.95 (0.01,0.05,0.04)
25		0.93 (0.00,0.05,0.05)	0.94 (0.01,0.04,0.04)	0.95 (0.01,0.03,0.03)	0.94 (0.01,0.02,0.02)
		0.95 (0.02,0.13,0.12)	0.97 (0.00,0.11,0.11)	0.95 (0.01,0.07,0.07)	0.92 (0.00,0.04,0.04)
35		0.94 (0.01,0.05,0.05)	0.92 (0.01,0.04,0.04)	0.96 (0.00,0.03,0.03)	0.95 (0.00,0.02,0.02)
		0.96 (0.02,0.13,0.12)	0.94 (0.01,0.11,0.11)	0.94 (0.00,0.08,0.07)	0.92 (0.00,0.04,0.04)

Open in a new tab

In the second set of simulations, we examine the performance of the proposal under model mis-specification. Specifically, we simulate data under the random effect model (1.1), with model parameters taken to be the maximum likelihood estimators from the HIV data. As discussed above, the diagnostic value represented by the ROC curve based on Inline graphic cannot be achieved in practice but may serve as a benchmark. Since model (2.1) is misspecified, we focus on the population ROC curve only. First, we plot in Figure S6 of supplementary material available at Biostatistics online the true and that based on by setting and As expected, by comparison with the benchmark Inline graphic fails to reflect the predictive value of the observed history due to model misspecification. Next, we repeat the simulation with a sample size and 500 times to examine the finite-sample biases of the point estimators and coverage levels of the 95% CIs for estimating the true ROC curves. Table 2 confirms our expectation that the inference procedure for Inline graphic remains valid in the presence of model misspecification.

Table 2.

Misspecified model: nominal 95% CI coverage (CVL), bias (BS), average standard error (ASE), and empirical standard error (ESE) of for false positive rates (FPR) 10%, 25%, 50%, and 75% at visits and 35 (random slope-intercept logistic model, patients)

	0.10	0.25	0.50	0.75
Visit
	CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)	CVL (BS,ASE,ESE)
5	0.97 (0.00,0.14,0.13)	0.97 (0.00,0.12,0.10)	0.96 (0.00,0.09,0.08)	0.92 (0.00,0.06,0.05)
15	0.94 (0.01,0.15,0.16)	0.95 (0.05,0.11,0.10)	0.92 (0.00,0.07,0.06)	0.92 (0.00,0.04,0.03)
25	0.94 (0.01,0.15,0.15)	0.93 (0.03,0.11,0.11)	0.96 (0.03,0.07,0.07)	0.93 (0.02,0.04,0.03)
35	0.93 (0.08,0.14,0.11)	0.97 (0.03,0.11,0.09)	0.95 (0.00,0.06,0.06)	0.95 (0.01,0.03,0.03)

Open in a new tab

5. Discussion

We have proposed a set of ROC-based metrics and statistical methods for evaluating the predictive value of a biomarker in a longitudinal, multiple patient design. We emphasize three keys in extending the ROC curve from the cross-sectional to longitudinal setting: (i) the score used to construct the ROC curve should take into account all of the observed history; (ii) the score should not require unobserved history; and (iii) the predictive value of the biomarker at the individual and the population levels should be treated differently. These objectives are not met satisfactorily by the mixed effects model (1.1) available in the literature, where (i) a patient’s observations are conditionally independent given the subject effects, and in particular past observations are not taken into account in using the observation as a score; (ii) all time points are used to estimate the subject effects, so that the score estimate for a given time point is a function of the disease status it is intended to predict; and (iii) there is no distinction between patient and population ROC curves.

The current approach is developed based on a simple parametric model. While the parametric assumptions are plausible in light of our data, they are chosen mainly for convenience in implementation and motivated by the HIV data. Necessary model checking analysis for other data is needed.

In the proposed approach, we assume that the biomarker is measured at a regular time interval, which is true in the HIV example. However, in clinical practice, the measurement time is often irregular and it may not be possible to group the measurements into comparable 1st, 2nd Inline graphic visits. Furthermore, even with regular measurements, grouping measurements by visits may not be interpretable if the baseline does not represent an interpretable “origin,” such as onset of disease or start of treatment. In such cases, the predictiveness of the biomarker needs to be evaluated with respect to the measurement history, including the actual measurement times. Doing so may require complex joint modeling of measurement time, biomarker level and disease status. Further research in this direction is warranted.

Supplementary Material

kxy010_Supplementary_Data

Click here for additional data file.^{(3.2MB, zip)}

Acknowledgments

This research was partially supported by grants NIH/NAIDS 2P30 AI060354-11 from the National Institutes of Health and R01HL089778-05 from the National Heart, Lung, and Blood Institute. We thank the two referees and associate editor for their constructive comments. The authors also thank the study participants and the principal investigator of the study, Dr. Elijah Paintsil, for sharing the data with us. Conflict of Interest: None declared.

References

Albert, P. S. (2012). A linear mixed model for predicting a binary event from longitudinal data under random effects misspecification. Statistics in Medicine 31, 143–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
Azzalini, A. (1994). Logistic regression for autocorrelated data with application to repeated measures. Biometrika 81, 767–775. [Google Scholar]
Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25. [Google Scholar]
Foulkes, A. S., Azzoni, L., Li, X., Johnson, M. A., Smith, C., Mounzer, K. and Montaner, L. J. (2010). Prediction based classification for longitudinal biomarkers. The Annals of Applied Statistics 4, 1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61, 92–105. [DOI] [PubMed] [Google Scholar]
Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56, 337–344. [DOI] [PubMed] [Google Scholar]
Janes, H., Longton, G. and Pepe, M. (2009). Accommodating covariates in ROC analysis. Stata Journal 9, 17–39. [PMC free article] [PubMed] [Google Scholar]
Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38, 963–974. [PubMed] [Google Scholar]
Liu, D. and Albert, P. S. (2014). Combination of longitudinal biomarkers in predicting binary events. Biostatistics 15, 706–718. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu, H. and Wu, T. (2003). Estimating the area under a receiver operating characteristic (ROC) curve for repeated measures design. Journal of Statistical Software 8, 1–18. [Google Scholar]
Liu, H., Li, G., Cumberland, W. and Wu, T. (2005). Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science 3, 257–278. [Google Scholar]
Paintsil, E., Ghebremichael, M., Romano, S. and Andiman, W. (2008). Absolute CD4+ T-lymphocyte count as a surrogate marker of pediatric HIV disease progression. Pediatric Infectious Disease Journal 7, 629–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
Paintsil, E., Martin, R., Goldenthal, A., Bhandari, S., Andiman, W. and Ghebremichael, M. (2016). Frequent episodes of detectable viremia in HIV treatment-experienced children is associated with a decline in CD4+ T-cells over time. Journal of AIDS & Clinical Research 7, 565–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pencina, M. J., D’Agostino, R. B. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 27, 157–172. [DOI] [PubMed] [Google Scholar]
Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press, USA. [Google Scholar]
Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical Science 6, 15–32. [Google Scholar]
Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J. and Kattan, M. W. (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, MA) 21, 128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press Series in Cognition and Perception New York, NY: Elsevier Science & Technology Books. [Google Scholar]
Uno, H., Cai, T., Tian, L. and Wei, L.-J. (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537. [Google Scholar]
Uno, H., Tian, L, Cai, T., Kohane, I. S. and Wei, L.-J. (2013). A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Statistics in Medicine 32, 2430–2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang, S., Santillana, M. and Kou, S. C. (2015). Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences 112, 14473–14478. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng, Y. and Heagerty, P. J. (2004). Semiparametric estimation of time-dependent ROC curves for longitudinal marker data. Biostatistics 5, 615–632. [DOI] [PubMed] [Google Scholar]
Zhou, X.-H., McClish, D. K. and Obuchowski, N. A. (2009). Statistical Methods in Diagnostic Medicine, Volume 569 New York: John Wiley & Sons. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxy010_Supplementary_Data

Click here for additional data file.^{(3.2MB, zip)}

[B1] Albert, P. S. (2012). A linear mixed model for predicting a binary event from longitudinal data under random effects misspecification. Statistics in Medicine 31, 143–154. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Azzalini, A. (1994). Logistic regression for autocorrelated data with application to repeated measures. Biometrika 81, 767–775. [Google Scholar]

[B3] Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25. [Google Scholar]

[B4] Foulkes, A. S., Azzoni, L., Li, X., Johnson, M. A., Smith, C., Mounzer, K. and Montaner, L. J. (2010). Prediction based classification for longitudinal biomarkers. The Annals of Applied Statistics 4, 1476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61, 92–105. [DOI] [PubMed] [Google Scholar]

[B6] Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56, 337–344. [DOI] [PubMed] [Google Scholar]

[B7] Janes, H., Longton, G. and Pepe, M. (2009). Accommodating covariates in ROC analysis. Stata Journal 9, 17–39. [PMC free article] [PubMed] [Google Scholar]

[B8] Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38, 963–974. [PubMed] [Google Scholar]

[B9] Liu, D. and Albert, P. S. (2014). Combination of longitudinal biomarkers in predicting binary events. Biostatistics 15, 706–718. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Liu, H. and Wu, T. (2003). Estimating the area under a receiver operating characteristic (ROC) curve for repeated measures design. Journal of Statistical Software 8, 1–18. [Google Scholar]

[B11] Liu, H., Li, G., Cumberland, W. and Wu, T. (2005). Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science 3, 257–278. [Google Scholar]

[B12] Paintsil, E., Ghebremichael, M., Romano, S. and Andiman, W. (2008). Absolute CD4+ T-lymphocyte count as a surrogate marker of pediatric HIV disease progression. Pediatric Infectious Disease Journal 7, 629–635. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Paintsil, E., Martin, R., Goldenthal, A., Bhandari, S., Andiman, W. and Ghebremichael, M. (2016). Frequent episodes of detectable viremia in HIV treatment-experienced children is associated with a decline in CD4+ T-cells over time. Journal of AIDS & Clinical Research 7, 565–577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Pencina, M. J., D’Agostino, R. B. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 27, 157–172. [DOI] [PubMed] [Google Scholar]

[B15] Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press, USA. [Google Scholar]

[B16] Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical Science 6, 15–32. [Google Scholar]

[B17] Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J. and Kattan, M. W. (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, MA) 21, 128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press Series in Cognition and Perception New York, NY: Elsevier Science & Technology Books. [Google Scholar]

[B19] Uno, H., Cai, T., Tian, L. and Wei, L.-J. (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537. [Google Scholar]

[B20] Uno, H., Tian, L, Cai, T., Kohane, I. S. and Wei, L.-J. (2013). A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Statistics in Medicine 32, 2430–2442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Yang, S., Santillana, M. and Kou, S. C. (2015). Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences 112, 14473–14478. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Zheng, Y. and Heagerty, P. J. (2004). Semiparametric estimation of time-dependent ROC curves for longitudinal marker data. Biostatistics 5, 615–632. [DOI] [PubMed] [Google Scholar]

[B23] Zhou, X.-H., McClish, D. K. and Obuchowski, N. A. (2009). Statistical Methods in Diagnostic Medicine, Volume 569 New York: John Wiley & Sons. [Google Scholar]

PERMALINK

The ROC curve for regularly measured longitudinal biomarkers

Haben Michael

Lu Tian

Musie Ghebremichael

Summary

1. Introduction

2. Methods

2.1. Individual patient ROC curve

Remark 2.1

Remark 2.2

2.2. ROC curve for the patient population

Remark 2.3

Remark 2.4

2.3. Extension

Fig. 1.

3. Example

Fig. 2.

4. Simulation

Fig. 3.

Table 1.

Table 2.

5. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The ROC curve for regularly measured longitudinal biomarkers

Haben Michael

Lu Tian

Musie Ghebremichael

Summary

1. Introduction

2. Methods

2.1. Individual patient ROC curve

Remark 2.1

Remark 2.2

2.2. ROC curve for the patient population

Remark 2.3

Remark 2.4

2.3. Extension

Fig. 1.

3. Example

Fig. 2.

4. Simulation

Fig. 3.

Table 1.

Table 2.

5. Discussion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases