Summary
The receiver operating characteristic (ROC) curve is a commonly used graphical summary of the discriminative capacity of a thresholded continuous scoring system for a binary outcome. Estimation and inference procedures for the ROC curve are well-studied in the cross-sectional setting. However, there is a paucity of research when both biomarker measurements and disease status are observed longitudinally. In a motivating example, we are interested in characterizing the value of longitudinally measured CD4 counts for predicting the presence or absence of a transient spike in HIV viral load, also time-dependent. The existing method neither appropriately characterizes the diagnostic value of observed CD4 counts nor efficiently uses status history in predicting the current spike status. We propose to jointly model the binary status as a Markov chain and the biomarkers levels, conditional on the binary status, as an autoregressive process, yielding a dynamic scoring procedure for predicting the occurrence of a spike. Based on the resulting prediction rule, we propose several natural extensions of the ROC curve to the longitudinal setting and describe procedures for statistical inference. Lastly, extensive simulations have been conducted to examine the small sample operational characteristics of the proposed methods.
Keywords: HIV/AIDS, Longitudinal binary outcomes, Longitudinal biomarker, Predictive value, Receiver operator characteristic (ROC) curve
1. Introduction
The receiver operating characteristic (ROC) curve is a graphical summary of the discriminative capacity of a thresholded continuous scoring system for a binary outcome. The curve consists of pairs of true positive and false positive rates as the threshold is varied (Swets and Pickett, 1982). Although many alternative measures have been proposed (Uno and others, 2007, 2013; Pencina and others, 2008; Steyerberg and others, 2010), the ROC curve remains the most commonly used in many fields. In medical research, ROC curves are often used to characterize the quality of a continuous biomarker as a diagnostic for binary statuses such as “diseased” versus “non-diseased” (Pepe, 2003; Zhou and others, 2009). Well-studied in the cross-sectional setting, the ROC curve has been generalized to settings where the outcome of interest is time-to-event (Heagerty and others, 2000; Zheng and Heagerty, 2004; Heagerty and Zheng, 2005) and where the biomarker is longitudinally measured (Foulkes and others, 2010; Liu and Albert, 2014). However, less research has considered both longitudinal biomarker measurements and binary statuses.
A motivating example is data gathered from the Yale Prospective Longitudinal Pediatric HIV Cohort. The cohort comprises 97 children born to HIV-infected mothers in the New Haven, CT, area since 1985. Various measurements were taken on the participants every 2–3 months over the 10-year period 1996–2006. Among these measurements, we focus on a continuous biomarker, CD4+ lymphocyte count, as a predictor of a binary outcome, “blip” status, the presence or absence of a transient spike in viral load (Paintsil and others, 2008).
Let
and
denote the biomarker value and a binary status of patient
at visit
, respectively,
,
The values
may be direct assay measurements, as in the motivating example, or they may be derived or composite quantities. To assess the predictive value of the longitudinal biomarker
for predicting
, Liu and Wu (2003) and Liu and others (2005) propose a simple mixed effect regression model (Breslow and Clayton, 1993)
![]() |
(1.1) |
where
is the logit function and
and
are, respectively, the subject-specific random intercept and slope. Similar models are described in Foulkes and others (2010) and Albert (2012). The random vector
is assumed to follow a Gaussian distribution
![]() |
The ROC curve summarizing the diagnostic value of
is then constructed based on pairs
![]() |
where
are estimates of the subject-specific random effects obtained from the observed data, for example,
![]() |
and
,
, and
are maximum likelihood estimators for the corresponding population parameters.
While this approach is simple and intuitive, we mention several limitations. First, the parametric assumptions may be too restrictive for some applications. For example, as discussed below, the Yale pediatric HIV data suggest greater dependency among biomarkers and disease statuses nearer in time. Neither is accounted for by model (1.1), which is symmetric in time. Second, the subject-specific random effect estimate
and
as defined above, are not available at visit
as the biomarker levels
and responses
are not yet observed. Third, the approach uses the same data both to fit the model and, by using the fitted biomarkers to construct the ROC curve, to assess the quality of the fit. One expects such an assessment to overestimate the true diagnostic quality of the biomarker (Janes and others, 2009). Efforts to set aside a group of patients for validation after estimation encounter the difficulty that subject effect estimates for the validation patients are unavailable (Foulkes and others, 2010). Lastly and more conceptually, the notion of ROC curve stands to be refined in the context of longitudinal measurements of multiple patients. In contrast to the cross-sectional setting, several useful ROC curves suggest themselves. For example, the predictive performance of the biomarker for a given patient, as determined by that patient’s history, can be quite different from the predictive performance for the entire patient population. In the next section, we propose a general framework to address these limitations.
2. Methods
We first note two properties desirable in a framework for assessing diagnostic performance in the longitudinal, multiple subject design under consideration. First, to assess the predictive performance of a biomarker, the biomarker should depend only on data that is available when a prediction is to be made. We adopt the vantage of a practitioner who has previous biomarker and status data for a patient, is confronted with a current biomarker for the patient, and must now predict current status. In the HIV example, due to the turnaround time of the tests involved, CD4 count or percentage is normally available before blip status. The patient’s history includes previous blip statuses, and the clinician may need to determine a course of treatment based on as-yet unavailable current blip status as predicted from current CD4 count or percentage. A similar problem is described by Yang and others (2015). Here, the “status” is the presence of absence of influenza-like illness in periodic reports issued by the Center for Disease Control (CDC), and the predictor or “biomarker” is real-time internet search data. The CDC’s reports describe outbreaks at a 1–3 week delay. When making real-time predictions with the search data, only CDC outbreak data referencing previous time points are available. In this case, an accurate early warning of the outbreak can be very important for public health.
As predictions may be made at different times, the accuracy of the prediction and the associated ROC curve will depend on time, with the corresponding prediction depending only on patient history available at that time. Second, two types of prediction performance should be differentiated: that for an individual patient and that for a patient population. For the former, we target the performance of
as a predictor of
where
is a continuous score summarizing the predictive information contained in the history up to visit
of a given patient
and
For the latter, we are interested in the predictive performance of
in the entire patient population at a time
, that is, marginalizing across patients.
In the following, we first generalize the simple mixed effect model (1.1) and discuss the two types of predictive performance under the proposed model. As discussed further below, easy extensions lead to more sophisticated models allowing for more flexible prediction rules. We assume that the longitudinal biomarker levels
follow an autoregressive process conditional on disease status
which are generated by a Markov chain as in, e.g., Azzalini (1994). Specifically, we assume that for the
th patient
![]() |
(2.1) |
where
![]() |
independently and identically, and
are hyperparameters.
is set to 0 to initialize the autoregressive, implying that the baseline biomarker level
follows a Gaussian distribution conditional on the blip status. This set of parametric distributions for the random effects is chosen in part for convenience as they permit the model parameters to be estimated using many standard statistical software packages. Specifically,
can be estimated by fitting the linear mixed effects model (Laird and Ware, 1982)
![]() |
and
can be estimated by fitting the generalized mixed effects model
![]() |
where
and
In addition,
can be estimated by the observed proportion across patients at the initial visit, i.e., the baseline. More importantly, under this model, we may link
with the observed history at visit
via a random effects model
![]() |
(2.2) |
where
![]() |
(2.3) |
Thus model (2.1) generalizes model (1.1), insofar as the log-odds of positive disease status is modeled as linear in the subject’s most current biomarker level, although the distribution of the coefficients in this linear combination may differ between the two models. Generalizations of (2.1) that include more biomarkers, e.g.,
, correspond to higher order autoregressive processes in model (1.1).
2.1. Individual patient ROC curve
We would like to evaluate the predictive performance of the biomarker or score
(or its history) for patient
at time
by contrasting two survival functions,
and
where
![]() |
and
![]() |
uses the available history, since under model (2.1),
is conditionally independent of the remaining history given
. We may then use the ROC curve
or derived statistics such as the area under the ROC curve
to summarize this contrast.
As
, depend on the unknown subject-specific random effect
the score
is unavailable in practice. An ROC curve based on
can only serve as a theoretical benchmark. We therefore estimate the random effect
based on its conditional distribution
where
and use a plug-in estimator for
. For example, we may estimate the random effects and
by the posterior mean
![]() |
and
![]() |
(2.4) |
respectively (Robinson, 1991), where the functions
are obtained by replacing all the relevant subject-specific random effects in (2.3) with their estimated counterparts based on
For example,
![]() |
Here, the subscript
is used to emphasize that the prediction of the subject-specific random effect is made at visit
using information up to visit
The estimator
depends on the subject only through the first argument, i.e., the patient history, and so the subscript
has been dropped. An explicit expression for this choice of
can be found in Appendix A of the supplementary material available at Biostatistics online. Using the estimated score
(or
if
is unknown) to predict the disease status at visit
, the predictive performance of patient
’s biomarker at visit
can be summarized by the ROC curve
![]() |
where
![]() |
is the subject- and visit-specific survival function of the estimated score.
depends on the joint distribution of the random history
and the response
and thus also on the subject-specific random effect
Since we do not have a convenient analytic expression for
we resort to a Monte-Carlo method. Specifically, for the
th patient:
Simulate
and
according to model (2.1) using consistent estimates of the subject-specific random effect
and the population parameter 
Compute
according to (2.4).Repeat steps 1–2 a large number of times and calculate the empirical ROC curve
of the resulting pairs 
can serve as an approximation to the subject-specific ROC curve of the
th patient at the
th visit provided that this patient’s subject-specific parameters are known or can be estimated up to the desired accuracy. When this assumption is unmet, e.g., when the time
is small and few observations on the patient of interest are available, we instead propose two alternative summaries of the diagnostic performance of the biomarker at the individual level.
The first is the average individual-specific ROC curve over the patient population,
![]() |
(2.5) |
where the expectation is taken with respect to the random effect
In practice, we may use Monte-Carlo methods, simulating a large number
of random effects
from the distribution for the random effect and estimating
by
![]() |
The resulting
is not the ROC curve for any individual patient but the expected patient-level ROC curve for a typical patient from the given population. As before, when
is unknown, we may replace it by a consistent estimator
and let
![]() |
Since
is a smooth function of
,
is a consistent estimator for
and
converges weakly to a mean zero Gaussian process indexed by
when
converges weakly to a mean zero Gaussian distribution.
The second option is the limit
![]() |
(2.6) |
As 

and
converge to 
, and
, respectively, where
![]() |
are subject-specific state probabilities of the stationary distribution of the 2-state Markov chain. Therefore, provided
,
![]() |
(2.7) |
where
is cumulative distribution function of the standard normal. Here, we used the fact that under model (2.1),
given
is normally distributed with mean
and variance
. When
, i.e., a patient’s diseased and non-diseased biomarker means are the same, the posterior probability of positive event status (2.2) reduces to
![]() |
posterior probability of a 2-state Markov chain. Consequently, the ROC curve summarizes the performance of a 2-state Markov chain in predicting the next state in this case. This performance serves as a limiting case when
becomes small in magnitude, the biomarkers cease to provide useful discrimination, and the patient’s prior status carries all the information about current status.
can be viewed as the ROC curve for subject
after adequate follow-up and therefore reflects the ultimate personalized diagnostic value of the biomarker for the
th patient with the subject random effect
It may or may not be similar to the population counterpart described in the next section.
can be estimated by
which is the same as
with
and
being replaced by
and
respectively. Assuming that
and 
is consistent and
converges to a mean zero Gaussian process. Therefore, the key assumption for estimating
in practice is that
be sufficiently large to allow acceptable estimation of the individual-specific random effect. The resulting estimated ROC curve can then be used to characterize the diagnostic value of the biomarker for an individual patient after sufficient follow-up.
Inference for
and
can be carried out with the parametric bootstrap. One simulates fresh data using the estimated population parameter
from model (2.1) and obtains
and
, the estimators for the corresponding ROC curves, from the simulated data. The empirical distributions
and
based on a large number of simulations serve as approximations to the distributions of
and
respectively. Point-wise confidence intervals (CIs) of
and
can be constructed along these lines.
Remark 2.1
One may be interested in the diagnostic value of
at the
th visit given the past history
In this case, the ROC curve can be constructed based on the conditional survival function
In contrast to ROC curves based on
or its estimator, this ROC curve reflects the predictive value of
only. It also depends on the random effect
, unknown at the visit
One may also consider its expectation with respect to random effects or its limit when
as an estimable alternative.
Remark 2.2
Both
and
are parametric in nature in that their summarization of the diagnostic value of the longitudinally measured biomarker are valid only if model (2.1) is correctly specified.
2.2. ROC curve for the patient population
The predictive performance of the biomarker across the entire population may be very different from that for an individual patient. For example, the latter does not take into account biomarker variation between patients, or differences between patients in the prior probabilities of positive status events. Were the data not longitudinal, we might consider the empirical ROC curve of biomarker–status pairs
. To take accumulated patient data into account, we instead consider the ROC curve of
, the patients’ biomarker scores (2.4) at a given time
. The scores synthesize all the predictive information in the past history under model (2.1).
Conditionally on the population parameter
, the patient scores are iid, and the empirical ROC curve is a valid metric for the predictive value of the scores regardless of the validity of the model being used to derive them. If model (2.1) is a good approximation to the true relationship between
and
, one may anticipate good prediction accuracy of the resulting score. A severely misspecified model may give a prediction score with poor performance. In either case, the ROC curve and derived statistics such as the area under the ROC curve remain objective measures for the predictive value of the scoring system.
Formally, assuming that
in probability, the score
converges to
![]() |
where
if the model is correctly specified. We are interested in estimating the ROC curve for the predictive value at the
th visit
![]() |
(2.8) |
where
![]() |
We do so by plugging in the empirical survival function:
![]() |
(2.9) |
where 
![]() |
and
is the event indicator function. Similarly, the area under the ROC curve, the concordance statistics, may be estimated as
![]() |
In Appendix B of supplementary material, available at Biostatistics online, we show that
is a consistent estimator for
and the distribution of
converges to a mean zero Gaussian process under mild regularity conditions. The variance of
can be approximated by an efficient resampling method. At each iteration, we first generate random weights
from the unit exponential distribution and estimate
under model (2.1) with the
th observation weighted by
Denote the estimator by
and let
![]() |
where
![]() |
Obtaining in this way a large number of realizations of
their empirical variance can be used to approximate that of
Similar resampling methods can be used to make inference on
, the area under the ROC curve at the
th visit.
The predictive value of the biomarker in the entire population also varies with the visit
. With more visits and richer data observed, the predictive ability of the updated scoring system is expected to increase. We may study the trend of predictive value from visit
to
by simultaneously estimating
and
. It is not difficult to show that
![]() |
can be approximated by a multivariate mean-zero Gaussian distribution, based on which joint inference for the predictive value at all visits of interest may be conducted.
When the predictive value of the constructed scoring system only varies moderately from visit
to
, i.e.,
are similar, it is tempting to estimate the ROC curve by the average predictive value between these two visits. To this end, one may empirically construct a ROC curve as
![]() |
where
![]() |
and
Since it averages observations from multiple visits,
can be substantially more stable than 
is a consistent estimator of
![]() |
where
![]() |
a weighted average of
Statistical inference based on
can be made by resampling methods similar to those previously described.
Remark 2.3
Despite some similarities,
and
are quite different. The former is parametric and interpretable only when model (2.1) is correctly specified, while and latter is non-parametric in nature. The former, ignoring the differentiability in biomarkers across patients, tends to be smaller than the latter.
Remark 2.4
The proposed ROC curves depend on the patient history
through the biomarkers estimates
We may consider other functions of
given by different statistical models of the response. More generally, one may consider a working regression model
and construct the ROC curve based on
where
is a parametric function of observed history and
and
are the model parameter and its appropriate estimator, respectively.
2.3. Extension
In model (2.1), we assume that (i) the underlying disease status follows a simple Markov chain, i.e., the distribution of
only depends on
and (ii) the distribution of the biomarker level at visit
,
only depends on
and
; see Figure 1a, which diagrams the probability generating process described in (2.1). There are several obvious extensions:
Fig. 1.

Schematic of the data-generation process described by (2.1) and extensions. (a) Model (2.1). (b)
generated from
. (c)
generated from
.
Adapting model (2.1) to the first setting, where the biomarker value depends not only on the current disease status but also the status at the previous visit, gives:
![]() |
(2.10) |
where
is the subject-specific random effect and
is independent
. Under this model
![]() |
where
![]() |
Therefore, besides the terms in (2.2), model (2.10) leads to additional interaction terms
and
contributing to the prediction of the disease status at the
th visit, 
For the second setting, we may assume that
![]() |
where
![]() |
In other words, the transition probability of the underlying disease status depends on the biomarker level at the prior visit. Under this model
![]() |
where
![]() |
Therefore, compared with (2.2), there is an additional interaction term
contributing to the prediction of the disease status at the
th visit,
There may be more extensive generalizations of model (2.2) such as the combination of extensions of (1) and (2) or higher order Markov chains for
. As mentioned in the previous sections, while the validity of individualized ROC curve depends on the correct model specification, the population-based ROC curve
and
can be constructed for scoring systems developed under different modeling assumptions and used to compare different models in terms of their predictive ability.
3. Example
The goal of highly active antiretroviral therapy in the treatment of HIV is to keep a patient’s CD4 count high and to suppress viral load. CD4 count measures immunosuppression, the risk of opportunistic infections, and the strength of the immune system. Viral load is the amount of HIV in a sample, indicative of, among other things, transmission risk. Although viral load is regarded as a better indicator of disease status, it is also more expensive and time-consuming to measure than CD4 count. According to clinical guidelines, both are tested regularly in a typical treatment regimen and used to guide subsequent treatment.
Even when therapy is effective and viral load is clinically categorized as suppressed, transient spikes in viral load, or “blips,” are observed. The clinical significance of viral blips is not understood well. While some studies have reported that viral blips are of no clinical significance, others have reported an association between viral blips and virologic failure. The identification of the predictors of viral blips and the association between viral blips and CD4+ T-cell changes over time are subjects of ongoing research. (see Paintsil and others, 2016 and references therein.)
We consider the accuracy of absolute CD4+ T-lymphocyte count as a predictor of blip status among children. We analyzed longitudinal data from HIV-infected children enrolled in the Yale Prospective Longitudinal Cohort study comprising 97 children born to HIV-infected mothers in the greater New Haven, CT, area since 1985. The predictor CD4 count measures the number of CD4 cells/mm
of blood and the response blip status is defined as a viral load equal or exceeding 50 copies/mL. The median number of visits/patient is 33, with 1st and 3rd quartiles of 15 and 47 visits, respectively. For all of the 3309 visits in the data set, the median time between visits is exactly 90 days, with 1st and 3rd quartiles of 57 and 112 days, respectively, giving approximately evenly spaced visits during the follow-up. The average age at enrollment is 6.7 years (standard deviation: 3.9 years). Figure S1(a) of supplementary material available at Biostatistics online summarizes the dates of visits in the lifetimes of the subjects. Further details on the cohort and definitions used here can be found in Paintsil and others (2008) and the references therein. Sixteen patients with fewer than 10 visits were removed in order to allow for estimation of the individual ROC,
as discussed in Section 2. Eighty-one subjects remain after excluding those with fewer than 10 visits.
The choice of how to group longitudinal observations is an important issue in many cohort-based longitudinal data analyzes, including ours. At each visit, measurements including CD4 count and blip status are taken, and antiretroviral treatment is administered. Therefore, the visit may serve as a surrogate for the number of treatments administered since baseline. While the specific enrollment time varies, the majority of the enrolled children (average age of 6.7 years) are in the early stages of treatment at baseline, and therefore it may be sensible to align observations according to their visit numbers. When the sample size allows, the analysis can be restricted to a more homogeneous subgroup of children with similar conditions at the baseline, which makes grouping by visits still more interpretable.
A crude indication of the value of CD4 as a predictor of blip status is given in Figure S1(b) of supplementary material available at Biostatistics online, which plots the histograms of CD4, aggregated over patients and visits, conditional on positive and negative blip status. Despite the large overlap, there is a clear location shift between the two measures. Figure S2 of supplementary material available at Biostatistics online plots the trajectories of CD4, with plotting shape encoding blip status, for a representative sample of subjects. The long sequences of like shapes even as CD4 fluctuates wildly suggests previous blip status as a predictor of future blip status, motivating the Markov structure in model (2.1). Finally, Figure S3 of supplementary material available at Biostatistics online is a heatmap of the empirical correlation matrix among CD4 measurements on the first 40 visits, for the 44 patients with 40 or more visits. The entries tend to decrease in magnitude moving away from the diagonal. This correlation structure accords with the weak dependency implied by the autocorrelative structure in model (2.1).
We apply the ROC estimation procedure described in Section 2 to the pediatric HIV data in order to assess the value of the past CD4 counts and blip statuses as a predictor of current blip status. The MLE
(95% CI (0.28,0.63)), describes the strong autoregressive dependency of CD4, as suggested by the heatmap. Similarly, the strong dependence between previous and current blip status is confirmed by the population transition probabilities
, giving the probabilities of remaining in the negative and positive, respectively, blip status states in successive visits. A 95% CI for the difference
is (0.05,0.12). The difference between the CD4 standardized means conditional on negative versus positive blip status,
(95% bootstrap CI (0.63,0.91)), is consistent with the location shift apparent from Figure S1(b) of supplementary material available at Biostatistics online.
The resulting time-indexed ROC curves at time
and their associated 95% CIs are summarized in Figure S4 of supplementary material available at Biostatistics online. The time
was chosen to be consistent with our exclusion of patients with fewer than 10 visits from the analysis. As Figure S1(a) of supplementary material available at Biostatistics online shows, ROC curves at later time points are available if one is willing to exclude patients with insufficient visits. The CIs are constructed by a bootstrap with
bootstrap samples. Despite the noisy data presented in Figures S1(b) and S2 of supplementary material available at Biostatistics online, the risk score taking into account both previous CD4 values and blip status performs reasonably well as a predictor of the current disease status. We also plot the time-asymptotic individual ROC curve
for selected patients. Patient no. 14 exhibits a non-smooth curve. The “elbow” arises when a patient’s previous disease status is significantly more predictive than the patient’s biomarkers of future disease status. In such cases, the ROC curve approximates the discrete behavior of a threshold predictor. As mentioned above, the validity of the individualized ROC curves depends on the correct specification of model (2.1). If we view model (2.1) as a working device used to derive a risk score for predicting the blip status, we may use
as well as
to summarize the predictive value of the scoring system between visits
and
Due to the small sample size and infrequent occurrence of blips, we construct
and its 95% CI as shown in Figure S4 of supplementary material available at Biostatistics online. The area under the ROC curve is 0.865, with a bootstrap standard error of 0.008, also indicating good predictive value. The jagged shape of the ROC curve reflects the fact that few of the 81 patient scores lie in the overlap of the case and control distributions.
As a comparison, we also plot in Figure 2 the ROC curve based on scores
by fitting the simple random effect model (1.1). As expected, the resulting ROC curve is higher than
and
. However, using fitted values as scores to predict the blip status requires information not available at visit
and is not therefore a comparable measure of the predictiveness of the biomarker at that time.
Fig. 2.

The ROC curve when fitted values under a random effects model (1.1) are used as scores compared with
and
(pediatric HIV data).
4. Simulation
In this section, we investigate the finite-sample performance of the proposed method. To this end, we generate data sets mimicking the pediatric HIV data. Specifically,
are simulated under model (2.1) with the population parameter
being the maximum likelihood estimator obtained from the HIV data. We use a Monte-Carlo approximation for the underlying true ROC curves
and
We also calculate
based on the analytic expression of
given in (2.7) for selected random effects
Next, we generate 500 data sets, each consisting of
patients with
visits each to match the HIV example. For each simulated data set, we estimate
the expected individual-specific ROC curve
at
and its 95% point-wise CI using the parametric bootstrap method;the limiting individual-specific ROC curve
for selected patients;the population ROC curve
and its 95% point-wise CI using the resampling method, for 
The resulting ROC curves estimates and 95% CIs based on one generated data set are presented in Figure 3. To evaluate the performance of the proposed method, we estimate the empirical bias of the point estimators as well as the coverage level of the 95% CIs at selected
for both
and
,
For each estimator of interest, we also calculate the empirical average of the estimated standard errors and the empirical standard error. The detailed simulation results for
and
are summarized at Table 1. The empirical biases are reasonably small in magnitude. The average estimated standard errors of all estimators are very close to the empirical standard errors and the coverage level of the 95% CIs are consistent with the nominal level allowed by the Monte-Carlo simulation error. In general, as expected, the population ROC curve tends to be higher than the individualized counterpart at the same visit. For estimating
we compare the AUC under ROC curve,
based on data from increasing number of visits with the true limiting AUC value for selected random effects
s. We focus in particular on whether the estimator converges to the truth as the number of visits increases under this correct model specification. Figure S5 of supplementary material available at Biostatistics online plots the number of visits against the difference between estimated AUCs and the truth with five different realizations of
, showing the expected convergence. The convergence may be too slow for some purposes, requiring data from a large number of visits in order to achieve the required estimation accuracy of the individual random effect.
Fig. 3.
Expected individual ROC
(solid) and population ROC
(dotted) at visit
with 95% bootstrap CI; limiting individual ROC
for four patients. The data were generated under model (2.1) with hyperparameters estimated from the pediatric HIV data.
Table 1.
Nominal 95% CI coverage (CVL), bias (BS), average standard error (ASE), and empirical standard error (ESE) of
and
for false positive rates (FPR) 10%, 25%, 50%, and 75% at visits
and 35 (synthetic data using hyperparameters estimated from the pediatric HIV data,
patients)
| FPR | 0.10 | 0.25 | 0.50 | 0.75 | ||
|---|---|---|---|---|---|---|
| Visit | ROC | |||||
| CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | |||
| 5 |
|
0.94 (0.01,0.05,0.05) | 0.96 ( 0.00,0.04,0.04) |
0.94 ( 0.01,0.03,0.03) |
0.95 ( 0.00,0.02,0.02) |
|
|
0.94 ( 0.04,0.12,0.12) |
0.95 ( 0.02,0.12,0.11) |
0.97 ( 0.00,0.10,0.10) |
0.97 ( 0.00,0.06,0.06) |
||
| 15 |
|
0.93 ( 0.01,0.05,0.05) |
0.93 ( 0.01,0.04,0.04) |
0.91 ( 0.02,0.03,0.03) |
0.97 ( 0.00,0.02,0.02) |
|
|
0.96 ( 0.03,0.13,0.12) |
0.96 ( 0.01,0.12,0.11) |
0.98 (0.00,0.08,0.08) | 0.95 (0.01,0.05,0.04) | ||
| 25 |
|
0.93 ( 0.00,0.05,0.05) |
0.94 ( 0.01,0.04,0.04) |
0.95 ( 0.01,0.03,0.03) |
0.94 ( 0.01,0.02,0.02) |
|
|
0.95 ( 0.02,0.13,0.12) |
0.97 ( 0.00,0.11,0.11) |
0.95 ( 0.01,0.07,0.07) |
0.92 (0.00,0.04,0.04) | ||
| 35 |
|
0.94 ( 0.01,0.05,0.05) |
0.92 ( 0.01,0.04,0.04) |
0.96 ( 0.00,0.03,0.03) |
0.95 ( 0.00,0.02,0.02) |
|
|
0.96 (0.02,0.13,0.12) | 0.94 (0.01,0.11,0.11) | 0.94 ( 0.00,0.08,0.07) |
0.92 ( 0.00,0.04,0.04) |
In the second set of simulations, we examine the performance of the proposal under model mis-specification. Specifically, we simulate data under the random effect model (1.1), with model parameters taken to be the maximum likelihood estimators from the HIV data. As discussed above, the diagnostic value represented by the ROC curve based on
cannot be achieved in practice but may serve as a benchmark. Since model (2.1) is misspecified, we focus on the population ROC curve only. First, we plot in Figure S6 of supplementary material available at Biostatistics online the true
and that based on
by setting
and
As expected, by comparison with the benchmark
fails to reflect the predictive value of the observed history due to model misspecification. Next, we repeat the simulation with a sample size
and
500 times to examine the finite-sample biases of the point estimators and coverage levels of the 95% CIs for estimating the true ROC curves. Table 2 confirms our expectation that the inference procedure for
remains valid in the presence of model misspecification.
Table 2.
Misspecified model: nominal 95% CI coverage (CVL), bias (BS), average standard error (ASE), and empirical standard error (ESE) of
for false positive rates (FPR) 10%, 25%, 50%, and 75% at visits
and 35 (random slope-intercept logistic model,
patients)
| FPR | 0.10 | 0.25 | 0.50 | 0.75 | |
|---|---|---|---|---|---|
| Visit | |||||
| CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | CVL (BS,ASE,ESE) | ||
| 5 | 0.97 ( 0.00,0.14,0.13) |
0.97 ( 0.00,0.12,0.10) |
0.96 (0.00,0.09,0.08) | 0.92 (0.00,0.06,0.05) | |
| 15 | 0.94 (0.01,0.15,0.16) | 0.95 (0.05,0.11,0.10) | 0.92 ( 0.00,0.07,0.06) |
0.92 (0.00,0.04,0.03) | |
| 25 | 0.94 ( 0.01,0.15,0.15) |
0.93 (0.03,0.11,0.11) | 0.96 (0.03,0.07,0.07) | 0.93 (0.02,0.04,0.03) | |
| 35 | 0.93 ( 0.08,0.14,0.11) |
0.97 ( 0.03,0.11,0.09) |
0.95 ( 0.00,0.06,0.06) |
0.95 (0.01,0.03,0.03) |
5. Discussion
We have proposed a set of ROC-based metrics and statistical methods for evaluating the predictive value of a biomarker in a longitudinal, multiple patient design. We emphasize three keys in extending the ROC curve from the cross-sectional to longitudinal setting: (i) the score used to construct the ROC curve should take into account all of the observed history; (ii) the score should not require unobserved history; and (iii) the predictive value of the biomarker at the individual and the population levels should be treated differently. These objectives are not met satisfactorily by the mixed effects model (1.1) available in the literature, where (i) a patient’s observations are conditionally independent given the subject effects, and in particular past observations are not taken into account in using the observation as a score; (ii) all time points are used to estimate the subject effects, so that the score estimate for a given time point is a function of the disease status it is intended to predict; and (iii) there is no distinction between patient and population ROC curves.
The current approach is developed based on a simple parametric model. While the parametric assumptions are plausible in light of our data, they are chosen mainly for convenience in implementation and motivated by the HIV data. Necessary model checking analysis for other data is needed.
In the proposed approach, we assume that the biomarker is measured at a regular time interval, which is true in the HIV example. However, in clinical practice, the measurement time is often irregular and it may not be possible to group the measurements into comparable 1st, 2nd
visits. Furthermore, even with regular measurements, grouping measurements by visits may not be interpretable if the baseline does not represent an interpretable “origin,” such as onset of disease or start of treatment. In such cases, the predictiveness of the biomarker needs to be evaluated with respect to the measurement history, including the actual measurement times. Doing so may require complex joint modeling of measurement time, biomarker level and disease status. Further research in this direction is warranted.
Supplementary Material
Acknowledgments
This research was partially supported by grants NIH/NAIDS 2P30 AI060354-11 from the National Institutes of Health and R01HL089778-05 from the National Heart, Lung, and Blood Institute. We thank the two referees and associate editor for their constructive comments. The authors also thank the study participants and the principal investigator of the study, Dr. Elijah Paintsil, for sharing the data with us. Conflict of Interest: None declared.
References
- Albert, P. S. (2012). A linear mixed model for predicting a binary event from longitudinal data under random effects misspecification. Statistics in Medicine 31, 143–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Azzalini, A. (1994). Logistic regression for autocorrelated data with application to repeated measures. Biometrika 81, 767–775. [Google Scholar]
- Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9–25. [Google Scholar]
- Foulkes, A. S., Azzoni, L., Li, X., Johnson, M. A., Smith, C., Mounzer, K. and Montaner, L. J. (2010). Prediction based classification for longitudinal biomarkers. The Annals of Applied Statistics 4, 1476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61, 92–105. [DOI] [PubMed] [Google Scholar]
- Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56, 337–344. [DOI] [PubMed] [Google Scholar]
- Janes, H., Longton, G. and Pepe, M. (2009). Accommodating covariates in ROC analysis. Stata Journal 9, 17–39. [PMC free article] [PubMed] [Google Scholar]
- Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38, 963–974. [PubMed] [Google Scholar]
- Liu, D. and Albert, P. S. (2014). Combination of longitudinal biomarkers in predicting binary events. Biostatistics 15, 706–718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, H. and Wu, T. (2003). Estimating the area under a receiver operating characteristic (ROC) curve for repeated measures design. Journal of Statistical Software 8, 1–18. [Google Scholar]
- Liu, H., Li, G., Cumberland, W. and Wu, T. (2005). Testing statistical significance of the area under a receiving operating characteristics curve for repeated measures design with bootstrapping. Journal of Data Science 3, 257–278. [Google Scholar]
- Paintsil, E., Ghebremichael, M., Romano, S. and Andiman, W. (2008). Absolute CD4+ T-lymphocyte count as a surrogate marker of pediatric HIV disease progression. Pediatric Infectious Disease Journal 7, 629–635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paintsil, E., Martin, R., Goldenthal, A., Bhandari, S., Andiman, W. and Ghebremichael, M. (2016). Frequent episodes of detectable viremia in HIV treatment-experienced children is associated with a decline in CD4+ T-cells over time. Journal of AIDS & Clinical Research 7, 565–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pencina, M. J., D’Agostino, R. B. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 27, 157–172. [DOI] [PubMed] [Google Scholar]
- Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press, USA. [Google Scholar]
- Robinson, G. K. (1991). That BLUP is a good thing: the estimation of random effects. Statistical Science 6, 15–32. [Google Scholar]
- Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J. and Kattan, M. W. (2010). Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology (Cambridge, MA) 21, 128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swets, J. A. and Pickett, R. M. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. Academic Press Series in Cognition and Perception New York, NY: Elsevier Science & Technology Books. [Google Scholar]
- Uno, H., Cai, T., Tian, L. and Wei, L.-J. (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association 102, 527–537. [Google Scholar]
- Uno, H., Tian, L, Cai, T., Kohane, I. S. and Wei, L.-J. (2013). A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Statistics in Medicine 32, 2430–2442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, S., Santillana, M. and Kou, S. C. (2015). Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences 112, 14473–14478. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng, Y. and Heagerty, P. J. (2004). Semiparametric estimation of time-dependent ROC curves for longitudinal marker data. Biostatistics 5, 615–632. [DOI] [PubMed] [Google Scholar]
- Zhou, X.-H., McClish, D. K. and Obuchowski, N. A. (2009). Statistical Methods in Diagnostic Medicine, Volume 569 New York: John Wiley & Sons. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.















































































































