Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 9.
Published in final edited form as: J R Stat Soc Ser C Appl Stat. 2010;59(3):437–456. doi: 10.1111/j.1467-9876.2009.00707.x

Semiparametric methods for evaluating the covariate-specific predictiveness of continuous markers in matched case-control studies

Y Huang 1,, M S Pepe 1
PMCID: PMC3090216  NIHMSID: NIHMS221299  PMID: 21562626

Summary

To assess the value of a continuous marker in predicting the risk of a disease, a graphical tool called the predictiveness curve has been proposed. It characterizes the marker’s predictiveness, or capacity to risk stratify the population by displaying the distribution of risk endowed by the marker. Methods for making inference about the curve and for comparing curves in a general population have been developed. However, knowledge about a marker’s performance in the general population only is not enough. Since a marker’s effect on the risk model and its distribution can both differ across subpopulations, its predictiveness may vary when applied to different subpopulations. Moreover, information about the predictiveness of a marker conditional on baseline covariates is valuable for individual decision making about having the marker measured or not. Therefore, to fully realize the usefulness of a risk prediction marker, it is important to study its performance conditional on covariates. In this article, we propose semiparametric methods for estimating covariate-specific predictiveness curves for a continuous marker. Unmatched and matched case-control study designs are accommodated. We illustrate application of the methodology by evaluating serum creatinine as a predictor of risk of renal artery stenosis.

1. Introduction

New technologies, such as gene expression microarrays and protein mass spectrometry, promise a multitude of biomarkers for detecting disease and predicting future events. Statistical methodology is needed to critically evaluate them. Biomarkers differ in their purposes and consequently demand different criteria for evaluation. For example, diagnostic markers are used to classify people as diseased or nondiseased. Their performance is typically evaluated with measures like sensitivity, specificity, and the ROC curve. Risk prediction markers, on the other hand, are utilized to predict the risk of having or getting a disease or other event. For risk prediction markers, measures that quantify their abilities to stratify risk are most relevant (Cook et al., 2006). Pepe et al. (2008a) and Huang et al. (2007) proposed a graphical tool, the predictiveness curve, to characterize the population distribution of predicted risk. Let D be a binary outcome, Y be the marker of interest, and let Risk(y) = P(D = 1∣Y = y) be the risk predicted for subjects with marker value Y = y. We write the predictiveness curve as R(v) vs v, where R(v) is the 100 × vth percentile of Risk(Y). Equivalently we plot p vs FR(p), where FR(p) is the cumulative distribution of risk in the population, R−1(p) = FR(p).

The distribution of Risk(Y) provides a common scale for comparing risk prediction markers or models. A better model for risk stratification will have larger variability in risk percentiles. Equivalently, it will assign more subjects into low and high risk ranges.

As pointed out by Huang et al. (2007), the predictiveness of a marker depends on the target population. If there are covariates that define subpopulations, then a marker’s covariate-specific predictiveness should be explored in order to identify subpopulations where it is useful for risk stratification and subpopulations where it is not. Another motivation for considering covariate-specific curves is for individual decision making about having a marker measured or not. A subject may decide that it is valuable to ascertain the value of a marker for him/her only if there is a reasonably large chance that his/her risk calculated on the basis of marker and covariates will differ substantially from that calculated only on the basis of covariates. Covariate-specific predictiveness curves also address this question as we will illustrate.

There are two ways in which covariates can impact a marker’s predictiveness. Covariates may be predictors themselves, possibly interacting with the marker’s effect on risk. In other words, covariates can affect the shape and height of the risk as a function of Y. On the other hand, the distribution of the marker may vary with covariates. Both effects impact on the distribution of risk, i.e. the predictiveness curve.

Let Z be the vector of covariates of interest, let Risk(Y, Z) = P(D = 1∣Y, Z), then the covariate-specific predictiveness curve for Y given Z = z is the curve Rz(v) vs v, where Rz(v) is the 100 × vth percentile of Risk(Y, Z) in the population with Z = z. Estimation of the covariate-specific predictiveness curve with data from a cohort design has been studied in Huang et al. (2007). Since case-control studies are more frequently conducted, particularly in the early phases of biomarker development (Pepe et al., 2001), it is important to develop estimators for this type of design as well. Here we consider case-control study designs. Moreover, our methods accommodate matching where controls are frequency matched to cases in regards to a subset of covariates.

To fix ideas consider the following example of a study to evaluate serum creatinine as a predictor of renal artery stenosis in patients with therapy resistant hypertension. Renal artery stenosis is assessed with a costly invasive procedure, renal angiography, that must only be performed on subjects at high risk of having the condition. A large cohort of subjects undergoing renal angiography constitutes the patient cohort. The marker, serum creatinine, was measured on a subset of cases with renal stenosis and on a subset of controls without renal stenosis. Details are provided in Section 4. Baseline risk of significant stenosis is calculated on the basis of several covariates including age, gender, and smoking. For illustration, we consider risk values above 0.4 as high enough to routinely recommend renal angiography and risk values below 0.1 as low enough to discourage the practice. Figure 1 shows predictiveness curves for serum creatinine in subjects with baseline risk categorized as low, medium, and high. We see that among subjects originally deemed medium risk according to baseline factors, 24.8% are reclassified as low risk and 7.1% are reclassified as high risk after including serum creatinine in the risk model. On the other hand, only a small fraction of subjects in the high or low risk categories according to baseline factors move to the medium risk designation after serum creatinine is included, and none move from the high risk to low risk category or vice versa. Therefore, on average, ascertainment of serum creatinine seems to be most useful for subjects with medium baseline risk since in this group a substantial proportion are moved across risk thresholds that affect medical decisions. Cook (2007) proposed a simplified version of this sort of approach to evaluating the incremental value of C-reactive protein for cardiovascular risk assessment.

Fig. 1.

Fig. 1

Predictiveness curves for serum creatinine specific to patients with low, high, and equivocal baseline risk.

While predictiveness curves specific to baseline risk categories provide a big picture of the marker’s incremental value in terms of affecting medical decisions in subpopulations, another kind of predictiveness curve which is specific to an individual’s baseline covariate values is more useful for individual decision making. In Figure 2 the covariate-specific curves are for individuals with specific values for the baseline covariates. For example, subject 2 has baseline risk equal to 38.1%. His predictiveness curve indicates that if his serum creatinine is measured, he has 0.5% chance of being classified as low risk. If his personal threshold for risk is low and he will opt for renal angiography unless he is deemed to have risk < 10%, there is no point in ascertaining serum creatinine for him, because almost certainly his risk calculated with serum creatinine in addition to baseline covariates will exceed 10%.

Fig. 2.

Fig. 2

Examples of covariate-specific predictiveness curves for serum creatinine.

In this paper we develop methodology to estimate and make inference about covariate-specific pre-dictiveness curves, exemplified in Figures 1 and 2.

2. Method

We consider a case-control design where cases and controls are frequency matched within S different strata. The special case where there is no matching corresponds to S = 1. Let S be the indicator for the matching stratum, taking values from 1 to S. Assume ρs = P(D = 1∣S = s), the disease prevalence within stratum s is known apriori, or can be estimated from a cohort. For example, we might have data from an independent cohort, or we might conduct a two-phase study where the second-phase case-control sample is nested within a parent cohort (Baker et al., 2002; Pepe et al., 2008b). We are interested in estimating the predictiveness curve for marker Y given values of covariate Z. Consider the scenario where the matching stratum S can be written as a function of Z, S(Z). An implication of this functional relationship between S and Z is that risk of disease conditional on marker Y and covariate Z is independent of S. That is, P(D = 1∣Y, Z, S) = P(D = 1∣Y, Z), which we denote by G(θ, Y, Z), where θ is the risk model parameter. This is satisfied for example if strata are defined by a subset of covariates. In the context of the example in Section 1, suppose cases and controls are matched on baseline risk category, we have s = 1, 2, 3. The marker Y is serum creatinine. In Figure 1 the covariate Z is the baseline risk category while in Figure 2 Z are the baseline covariates.

Consider a logistic model where the risk conditional on covariate values is monotone increasing as a function of the marker

logit{P(D=1Y,Z)}=logit{G(θ,Y,Z)}=θ0+κ(θ1,Y,Z), (1)

where θ = (θ0, θ1)T. Here κ(θ1, Y, Z) is a pre-defined monotone increasing function for Y given Z = z. For example, κ(θ1, Y, Z) might be θ11Y+θ12TZ, the linear logistic model, or more flexibly θ11Y(θ13)+θ12TZ, where Y(θ13) is the Box-Cox transformation (Cole and Green, 1992). Moreover, interactions between Y and Z might also be included. Note that under the monotone increasing risk assumption, the curve Rz(v) vs v can be written as P(D=1Y,z)vsFz1(Y) where Fz(y) = P(Yy∣Z = z) is the CDF of Y when Z = z. This suggests that estimation of the curve can be achieved in two steps: estimation of the risk model and estimation of the covariate-specific marker distribution. We take advantage of this formulation in our proposed methods.

2.1. Discrete Covariate

Suppose the covariate of interest is discrete with Ƶ categories. Let Z be the covariate group indicator, taking values from 1 to Ƶ. Consider the risk model

logit{P(D=1Y,Z,S)}=logit{G(θ,Y,Z)}=θ0+θ1Y+θ2TZM+θ3TYZI,

where θ = (θ0, θ1, θ2, θ3)T. Here ZM, ZI are vectors indexing subsets of covariate group defined by Z. For example, in a study where cases and controls are frequency matched by age subgroup, covariates of interest might be the combination of age subgroup and gender; we can use ZM to index main effects for age subgroup and for gender, and use ZI to index interactions between the marker and gender only. The interaction terms vanish when ZI is of length 0. Let LRz be the likelihood ratio of Y conditional on covariate value z:

LRz(y)=P(Y=yD=1,Z=z)P(Y=yD=0,Z=z)=P(D=1Y=y,Z=z)P(D=0Y=y,Z=z)P(D=0Z=z)P(D=1Z=z)=exp{θ0+η(z)+θ1y+θ2TzM+θ3TyzI}, (2)

where η(z) = log {P(D = 0∣Z = z)/P(D = 1∣Z = z)}.

When the matching stratum S is the same as the covariate group, P(D = 1∣Z = z) is either known or can be estimated from another cohort sample by assumption, and therefore so is η(z). Consider now settings where the matching stratum is a proper subset of the covariate group. If information about P(D = 1∣Z = z) is available, it can be directly utilized. However, if disease prevalence within a covariate group is not available, η(z) can be estimated from the case-control sample because by Bayes’ theorem, for s = S(z),

η(z)=log{P(D=0Z=z,S=s)P(D=1Z=z,S=s)}=log{P(D=0S=s)P(D=1S=s)}+log{P(Z=zD=0,S=s)P(Z=zD=1,S=s)}=log(1ρsρs)+log{P(Z=zD=0,S=s)P(Z=zD=1,S=s)}, (3)

the first term in (3) is known or can be estimated from a cohort by assumption, and for the second term, we can estimate P(Z = zD = 0, S = s) and P(Z = zD = 1, S = s) empirically from the case-control sample. Henceforth we use η(z) to represent the value that is known or estimated.

Suppose nD cases and nD controls are sampled in the study. Let Us be the set of subjects within matching stratum s with UDs and UDs being the subsets of cases and controls respectively. Let i index subject. We maximize the empirical likelihood (Owen, 1988, 1990; Qin and Lawless, 1994) of observing the marker and the covariates values in the matched case-control sample.

L=s=1S{iUDsP(Yi,ZiDi)iUDsP(Yi,ZiDi)}=s=1S{iUDsP(YiZi,Di)P(ZiDi)iUDsP(YiZi,Di)P(ZiDi)}s=1S{iUDsP(YiZi,Di)iUDsP(YiZi,Di)}. (4)

Define Fz, FDz, and FDz to be the cumulative distribution functions for marker Y in covariate group z in the general, diseased, and non-diseased populations. With a slight abuse of notation, let Uz be the set of subjects within covariate group z with UDz and UDz being the subsets of cases and controls respectively, and let nz, nDz, and nDz be the corresponding sample sizes. Denote by piz the probability density function of Y for the ith subject in covariate group z in the non-diseased population. Note that (4) can be rewritten as

L(θ,FDz)=z=1Z{iUDzdFDz(Yi)iUDzdFDz(Yi)}=z=1Z{iUDzdFDz(Yi)iUDzLRz(Yi)dFDz(Yi)}=z=1Z{iUzdFDz(Yi)iUDzLRz(Yi)}=z=1Z[iUzpiziUDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}]

subject to the restrictions piz > 0, ∑i∈Uz pik = 1, and iUzpizLRz(Yi)=1, for z = 1, …, Ƶ. It is equivalent to maximize

l=z=1ZiUzlog(piz)+z=1ZiUDz{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}+z=1Zλ1z(1iUzpiz)+z=1Zλ2z[iUzpiziUzpizexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}], (5)

where λ1z, λ2z are Lagrange multipliers.

Based on (5) it can be shown that piz=1(nz+λ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1]) for iUz. Methods for solving for λ2z and θ are derived in the appendix. In the special case where separate main effect is included for each covariate group (i.e. ZM is the covariate group indicator), a closed-form solution exists. Specifically, we have λ2z = nDz, and the maximum likelihood estimator of θ can be obtained by fitting a prospective logistic model logit{P(D=1Y,Z)}=θ0+ξ(Z)+θ1Y+θ2TZM+θ3TYZI to the case-control sample using ξ(z)=log(nDznDz)+η(z) as an offset term. In the more general case, λ2z and θ can be estimated iteratively.

We then compute the covariate-group specific marker distributions within controls and cases as

F^Dz(y)=iUzp^izI(Yiy),F^Dz(y)=iUzp^izI(Yiy)LR^z(Yi),

where LR^z and p^iz are the maximum likelihood estimators of LRz and piz. We estimate Fz as a weighted average of FDz and FDz with weight being P(D = 1∣Z = z), disease prevalence within covariate group z. Recall that P(D = 1∣Z = z) is either known or can be estimated. Let the estimate of Fz be F^z(y)=P(D=1Z=z)F^Dz(y)+P(D=0Z=z)F^Dz(y). The semiparametric maximum likelihood estimator (MLE) of the predictiveness curve and its inverse within covariate group z are:

R^z(v)=G{θ^,F^z1(v),z}forv(0,1),R^z1(p)=F^z{G1(θ^,p,z)}forp{Rz(v):v(0,1)},

where G−1(θ, p, z) = {y : G(θ, y, z) = p}.

Observe that for the special case where both ZM and ZI are the covariate group indicator, we essentially allow a different risk model within each covariate group, and our methods are equivalent to fitting the separate predictiveness curves to data within each covariate category using the semiparametric two-sample MLE method (Huang, 2007). However, when the risk model is not fully saturated, the MLE method uses data from all the covariate groups to estimate each covariate-specific predictiveness curve. It borrows information across different covariate groups to achieve better effciency

Based on arguments similar to those in Huang (2007), asymptotic normality of R^z(v) and R^z1(p) can be derived. Proofs of the following theorems are outlined in the appendix.

Theorem 1

As n → ∞, n{R^z(v)Rz(v)} converges to a normal random variable with mean zero and variance

Σ1z(v)={Rz(v)v}2var(n[F^z{Fz1(v)}v])+(Rz(v)θ)Tvar{n(θ^θ)}(Rz(v)θ)+2(Rz(v)θ)Tcov(n(θ^θ),n[F^z{Fz1(v)}v]){Rz(v)v}.

Theorem 2

As n → ∞, n{R^z1(p)Rz1(p)} converges to a normal random variable with mean zero and variance

Σ2z(p)=var(n[F^z{G1(θ,p,z)}Fz{G1(θ,p,z)}])+(Rz1(p)θ)Tvar{n(θ^θ)}(Rz1(p)θ)+2(Rz1(p)θ)Tcov(n(θ^θ),n[F^z{G1(θ,p,z)}Fz{G1(θ,p,z)}]).

Observe that when v=Rz1(p),

Rz1(p)θRz(v)v=vθRz(v)v=Rz(v)θ,

thus ∑1z(v) = {∂Rz(v)/∂v}22z(p). That is the variance of R^z(v) and its inverse are related by a factor equal to square of the derivative of Rz(v), which is intuitive since a perturbation in Rz(v) can be approximated by Rz(v) times a perturbation in Rz1(p).

Note that calculation of asymptotic variance for the MLE estimators is quite involved, requiring nonparametric density estimation. In practice, we recommend bootstrap resampling to assess sampling variability. Bootstrap resampling was shown to have better performance than estimated asymptotic variance in the simpler setting where no covariates are involved (Huang, 2007). The bootstrap procedure is performed to mimic the study design. For example, for a frequency-matched case-control sample, cases and controls are randomly sampled within each matching stratum. Also, if matching stratum specific disease prevalence ρs is estimated from an independent cohort or from the parent cohort within which the case-control sample is nested, then this cohort is also resampled in the bootstrap procedure to take into account variability in the estimate of ρs.

2.2. Continuous Covariate

In Section 2.1, we developed an estimator of the covariate-specific predictiveness curve from a matched case-control study when covariates of interest are discrete. In this section, we investigate estimation in more general settings. In particular, continuous covariates or discrete covariates with many sparse strata are accommodated with this methodology.

Our strategy is to separate estimation of the covariate-specific predictiveness curve into estimation of the risk model and estimation of the covariate-specific marker distribution. The former estimation task can be achieved by employing logistic regression model adjusted by the known or estimated disease prevalences within the matching strata. The second estimation task can be performed using a weighted version of the semiparametric location-scale model (Heagerty and Pepe, 1999) for the marker distribution.

Consider the general risk model in (1). We apply a logistic model

logit{P(D=1Y,Z,S=s)}=θ0+ξ(s)+κ(θ1,Y,Z) (6)

to the case-control sample using

ξ(s)=log{P(D=1S=s,sampled)P(D=0S=s,sampled)1ρsρs}

as the offset term, where P(D = 1∣S = s, sampled)/P(D = 0∣S = s, sampled) is fixed by design and/or can be estimated in the case-control sample with nDsnDs.

Fitting this modified prospective logistic regression is essentially maximizing the “conditional (or pseudo) maximum likelihood” that an observation in the case-control sample within a particular matching stratum is a case or a control (Manski and Fadden, 1981; Hsieh et al., 1985; Breslow and Cain, 1988; Fears and Brown, 1986; Breslow and Zhao, 1988; Scott and Wild, 2001). Specifically, the contribution of subject i to the pseudolikelihood is P(DiYi, Zi, Si, sampled). According to Bayes’ theorem, the odds of being diseased in the case-control sample is

P(D=1Y,Z,S,sampled)P(D=0Y,Z,S,sampled)=P(D=1Y,Z)P(D=0S)P(D=0Y,Z)P(D=1S)P(D=1S,sampled)P(D=0S,sampled)=exp{θ0+ξ(S)+κ(θ1,Y,Z)}.

Note that when the covariate of interest is the matching stratum, this pseudolikelihood approach leads to the same estimates of the risk model as the MLE approach in Section 2.1.

For a two-phase design, this pseudolikelihood method has been shown to give rise to estimates that are close to the maximum likelihood estimator which can be obtained by repeated fitting of ordinary logistic regression models (Wild, 1991; Scott and Wild, 1997; Breslow and Holubkov, 1997). Both estimators are known to be much more efficient than the inverse probability weighted likelihood approach (Flanders and Greenland, 1991).

To estimate Fz, we employ a semiparametric location-scale model for the distribution of Y given Z (Heagerty and Pepe, 1999). Suppose

Fz(y)=F0(yμzσz),

where F0 is the cumulative distribution function of some unknown distribution, μZ = γTU(Z) and log(σZ) = δTW(Z), where U(Z) and W(Z) are specified functions of Z. For example, for a discrete Z, U(Z) and W(Z) could be dummy variables indicating unique values of Z, while for a continuous Z, they could be B-spline basis function of Z. Denote Ui = U(Zi) and Wi = W(Zi). For a cohort study, Heagerty and Pepe (1999) proposed estimating γ and δ by solving the estimating equations

i=1nUi(YiγTUi)σZi2=0,i=1nWi{(YiγTUi)2σZi2}σZi2=0. (7)

For the matched case-control study, we obtain γ^ and δ^ as solutions to the inverse probability weighted (Horvitz and Thompson, 1952) version of the estimating equations in (7):

i=1n1q^iUi(YiγTUi)σZi2=0,i=1n1q^iWi{(YiγTUi)2σZi2}σZi2=0, (8)

where q^i=P^(sampledYi,Zi,Si,Di). Observe that

P(sampledYi,Zi,Si,Di)=P(sampledSi,Di)=P(DiSi,sampled)P(Sisampled)P(DiSi)P(Si),

where P(DiSi, sampled) is fixed by design and can be estimated as well. So if P(Si) is known or can be estimated from cohort in addition to P(DiSi), we can estimate P(Si∣sampled) from the case-control sample and plug in

q^i=P^(DiSi,sampled)P^(Sisampled)P(DiSi)P(Si).

Furthermore, we estimate F0 by

F^0(c)=i=1n1q^iI(Yiγ^TUieδ^TWic)/i=1n1q^i.

The covariate-specific marker distribution estimate is

F^z(y)=F^0(yγ^Tueδ^Tw),

where u = U(z) and w = W(z). The corresponding vth quantile is

F^z1(v)=γ^Tu+eδ^TwF^01(v)forv(0,1).

Estimators of the covariate-specific predictiveness curve for Y given Z = z which we call PMLE are

R^z(v)=G{θ^,F^z1(v),z}forv(0,1),R^z1(p)=F^z{G1(θ^,p,z)}forp{Rz(v):v(0,1)}.

Note that compared to the MLE method in Section 2.1, the PMLE method requires additional auxiliary information about the distribution of S. If the distribution of Y conditional on Z is independent of the matching stratum S, we can modify the PMLE method such that the extra piece of information about P(S) is not necessary. This is satisfied in the current setting since S can be represented as a function of Z. The concept is to ensure unbiasedness of the estimating equations in (8) averaged across the distribution of Y within each matching stratum rather than averaged over the entire population. It is then sufficient to adjust for selection bias within each stratum. For any two subjects i, j in the same stratum, we choose qi and qj such that their relative weights are

qiqj=P(sampledDi,Yi,Zi,Si)P(sampledDj,Yj,Zj,Sj).

Given Si = Sj, we need

P(sampledDi,Yi,Zi,Si)P(sampledDj,Yj,Zj,Sj)=P(sampledDi,Si)P(sampledDj,Sj)=P(Di,Sisampled)P(Dj,Sj)P(Dj,Sjsampled)P(Di,Si)=P(DiSi,sampled)P(DjSj)P(DjSj,sampled)P(DiSi)

to guarantee unbiasedness of (8) within each matching stratum. One choice of qi is P(DiSi, sampled)P(1 – DiSi), since when Di = Dj, P(1 – DiSi)/P(1 – DjSj) = P(DjSj)/P(DiSj) = 1, and when Di = 1 – Dj, P(1 – DiSi)/P(1 – DjSj) = P(DjSi)/P(DiSj) = P(DjSj)/P(DiSi). We plug q^i=P^(DiSi,sampled)P(1DiSi) into (8).

The covariate-specific predictiveness curve for marker Y can be derived as before. We call the estimator PMLE-M. Note that PMLE and PMLE-M are equivalent when there is no matching in the case-control design. We use bootstrap resampling for variance estimation for both PMLE and PMLE-M estimators. Again the bootstrap procedure is performed to mimic the study design.

3. Simulation Studies

We evaluate our methodology using two simulation settings. In the first setting, a binary covariate Z takes value 1 with probability 0.5 in the population. Suppose Z is also the covariate of interest for evaluating the marker’s predictiveness. The disease prevalences given Z = 0, 1 are 0.1 and 0.2 respectively. A two-phase study is performed. In the first phase, a cohort sample is obtained with Z and D measured for all subjects. In the second phase, a nested case-control sample is obtained from the parent cohort, where cases and controls are frequency matched according to Z. Within each matching stratum defined by the value of Z, equal numbers of cases and controls are sampled, and sample sizes are constant across strata. The marker Y conditional on D and Z is normally distributed,

YDZ=0N(0.5,1),YDZ=0N(0,1),YDZ=1N(1,1),YDZ=1N(0.5,1).

We compare the three estimators of the covariate Z-specific predictiveness curves developed in Section 2: the semiparametric MLE method assuming the risk is linear logistic in Y and Z, and the PMLE and PMLE-M methods where location and log-scale parameters for the marker distribution are modeled as linear in Z. The bootstrap is performed to construct confidence intervals for those estimators. Specifically, during each bootstrap resampling step, the first phase cohort is randomly sampled with replacement to incorporate variability in estimating P(Z) and P(D = 1∣Z); and cases and controls are resampled separately within each Z stratum.

In the second setting, we have a continuous covariate Z. The overall prevalence of D is 0.15. We have Y and Z bivariate normally distributed with correlation 0.5 conditional on D,

(YZ)D=0N((00),(10.50.51)),(YZ)D=1N((10),(10.50.51)).

During the simulation cases are randomly sampled from the case population and for each case, a control with the same Z value is generated. In addition, we assume there is another independent cohort sample available with information about the joint distribution of D and Z. We again compare the three estimators of covariate Z-specific predictiveness curves. For the semiparametric MLE estimator, the risk is modeled as linear logistic in Y and discretized Z (where cutoff points are chosen to be quintiles of Z in the case population). For the PMLE and PMLE-M estimators, location and log(scale) parameters for the marker distribution are modeled as linear in Z. The bootstrap is performed to construct confidence interval for these estimators. Specifically, during each bootstrap resampling step, the independent cohort is randomly sampled as well as the case-control pairs.

For total sample size (cases plus controls) varying from 200 to 1200, we evaluate estimates of Rz(v) and Rz1(p) for v = 0.1, 0.3, 0.5, 0.7, 0.9 and corresponding p = Rz(v). In both simulation settings, the sample size for the cohort is set to be 6 times that of the total sample size for the case-control sample. Results are based on 1000 simulated datasets in each simulation setting, and for each simulation, 500 bootstrap samples are generated.

For the first setting, simulation results for Z = 0 are presented in Tables 1 and 2. For the second setting, simulation results for Z equal to the median in the case population are presented in Tables 3 and 4. We find that in both settings the three estimators have reasonably good performance. Bias was minimal for Rz(v) and for Rz1(p). Coverage of the 95% confidence intervals constructed from the bootstrap distribution is fairly close to the nominal level. In the first setting where there are two covariate groups, the MSEs for the three estimators are similar in magnitude. The MLE estimator is slightly more efficient than the PMLE and PMLE-M estimators for certain v. As the number of (discretized) covariate groups increases to five (the second setting), the MLE estimator becomes much less efficient compared to the PMLE and PMLE-M estimators.

Table 1.

Performance of the covariate-specific predictiveness curve estimators for the first simulation setting in Section 3 with P(DZ) (and f(Z)) estimated. Shown are results for R^z(v) with z = 0.

v = 0.1 v = 0.3 v = 0.5 v = 0.7 v = 0.9
Rz(v) 0.05 0.072 0.091 0.116 0.161
Bias
% bias in R^z(v)
  n = 200 MLE 6.08 2.77 1.64 0.95 0.59
PMLE-M 6.62 2.73 1.48 1.12 0.84
PMLE 6.59 2.64 1.4 1.15 0.83
  n = 500 MLE 2.42 0.95 0.26 −0.1 −0.56
PMLE-M 2.63 1.01 0.38 −0.03 −0.48
PMLE 2.64 1.01 0.33 −0.05 −0.48
  n = 1200 MLE 1.75 0.73 0.14 −0.38 −0.89
PMLE-M 1.84 0.71 0.09 −0.36 −0.91
PMLE 1.87 0.69 0.06 −0.37 −0.9
MSE Efficiency relative to MLE
  n = 200 PMLE-M 1.01 1 0.99 0.91 0.89
PMLE 1.01 1 0.98 0.92 0.89
  n = 500 0.98 0.99 0.99 0.97 0.96
PMLE 0.98 0.99 1 0.98 0.96
  n = 1200 PMLE-M 0.99 0.97 0.98 0.97 0.96
PMLE 0.99 0.97 0.97 0.97 0.97
95% Percentile Bootstrap CI
coverage (%)
  n = 200 MLE 93.1 95 97.4 98.4 95.4
PMLE-M 93.4 95.7 98.1 97.7 94.5
PMLE 93.3 95.7 98 97.5 94.5
  n = 500 MLE 92.7 94.3 96 95.2 93.4
PMLE-M 92.1 94.5 95.5 95.4 92.7
PMLE 92.1 94.8 95.3 95.5 92.6
  n = 1200 MLE 92.8 94.6 95.6 94.3 92.6
PMLE-M 92.9 94.5 95.7 94.9 92
PMLE 92.8 94.8 95.9 94.6 92.1

Table 2.

Performance of the covariate-specific predictiveness curve estimators for the first simulation setting in Section 3 with P(DZ) (and f(Z)) estimated. Shown are results for R^z1(p) with z = 0.

p = 0.05 p = 0.072 p = 0.091 p = 0.116 p = 0.16
Rz1(p) 0.10 0.30 0.50 0.70 10.90
Bias
% bias in R^z1(p)
  n = 200 MLE 5.04 −8.82 −5.28 −0.52 −0.12
PMLE-M 4.11 −9.01 −4.68 −0.35 −0.09
PMLE 4.13 −8.9 −4.65 −0.35 −0.11
  n = 500 MLE 1.96 −3.27 −1.5 0.54 0.36
PMLE-M 1.71 −3.53 −1.64 0.49 0.31
PMLE 1.82 −3.5 −1.59 0.5 0.3
  n = 1200 MLE −2 −2.27 −0.43 0.62 0.48
PMLE-M −1.99 −2.26 −0.46 0.64 0.52
PMLE −2.14 −2.19 −0.42 0.64 0.51
MSE Efficiency relative to MLE
  n = 200 PMLE-M 1 0.98 0.97 0.97 0.9
PMLE 1 0.98 0.97 0.98 0.9
  n = 500 PMLE-M 1 0.97 0.99 0.96 0.95
PMLE 0.99 0.97 1 0.97 0.96
  n = 1200 PMLE-M 0.99 0.98 0.98 0.97 0.94
PMLE 0.99 0.98 0.98 0.98 0.96
95% Percentile Bootstrap CI
coverage (%)
  n = 200 MLE 93.2 94.9 97.4 98.3 95.5
PMLE-M 93 95.8 98.1 97.5 94.3
PMLE 93.3 95.7 97.9 97.7 94.4
  n = 500 MLE 92.9 94.5 96 95.3 93.5
PMLE-M 92.1 94.6 95.5 95.5 92.6
PMLE 92.2 94.7 95.2 95.4 92.4
  n = 1200 MLE 92.9 94.6 95.8 94.2 92.4
PMLE-M 92.9 94.3 95.8 94.8 92
PMLE 92.9 94.8 96.1 94.8 92.2

Table 3.

Performance of the covariate-specific predictiveness curve estimators for the second simulation setting in Section 3, with P(DZ) (and f(Z)) estimated. Shown are results for R^z(v) with z being the median of Z in the case population.

v = 0.1 v = 0.3 v = 0.5 v = 0.7 v = 0.9
Rz(v) 0.022 0.054 0.097 0.173 0.358
Bias
% bias in R^z(v)
  n = 200 MLE 13.08 3.47 0.63 −0.93 −0.22
PMLE-M 3.71 −0.37 −1.13 −1.06 −0.2
PMLE 3.82 −0.5 −0.95 −0.73 −0.15
  n = 500 MLE 9.75 3.22 1.25 0.03 −0.97
PMLE-M 0.95 −0.41 −0.45 −0.04 0.23
PMLE 1.25 −0.47 −0.42 −0.08 0.28
  n = 1200 MLE 6.69 3.35 1.99 0.63 −1.04
PMLE-M 0.98 −0.01 −0.04 0.12 0.2
PMLE 0.9 −0.03 −0.05 0.06 0.16
MSE Efficiency relative to MLE
  n = 200 PMLE-M 1.85 1.55 1.55 1.31 1.37
PMLE 1.85 1.52 1.38 1.23 1.13
  n = 500 PMLE-M 1.94 1.62 1.44 1.38 1.25
PMLE 1.74 1.42 1.34 1.15 1.19
  n = 1200 PMLE-M 1.92 1.7 1.62 1.15 1.21
PMLE 1.86 1.62 1.5 0.99 1.11
95% Percentile Bootstrap CI
coverage (%)
  n = 200 MLE 96 96.4 97.9 99.5 98.3
PMLE-M 95.1 95.1 95.5 96.1 96.7
PMLE 93.8 93 93.2 95.5 96.7
  n = 500 MLE 95 95.1 96.9 97.9 96.8
PMLE-M 94.2 93.1 94.7 96.7 95
PMLE 92.7 92.5 92.9 96.3 95.3
  n = 1200 MLE 94.1 95.3 96 97.8 95.9
PMLE-M 94.6 95.5 95.4 95.4 95.7
PMLE 93.8 93.7 93.9 94.7 96.4

Table 4.

Performance of the covariate-specific predictiveness curve estimators for the second simulation setting in Section 3, with P(DZ) (and f(Z)) estimated. Shown are results for R^z1(p) with z being the median of Z in the case population.

p = 0.022 p = 0.054 p = 0.097 p = 0.173 p = 0.358
Rz1(p) 0.10 0.30 0.50 0.70 0.90
Bias
% bias in R^z1(p)
  n = 200 MLE 0.48 −2.62 −0.66 0.29 0.31
PMLE-M 7.49 1.27 0.41 0.4 0.16
PMLE 7.63 1.49 0.4 0.31 0.14
  n = 500 MLE −5.95 −2.98 −1.42 −0.07 0.37
PMLE-M 4.16 0.79 0.06 0.05 0.07
PMLE 4.08 0.77 0.1 0 0.09
  n = 1200 MLE −6.4 −3.08 −1.63 −0.36 0.33
PMLE-M 0.98 0.1 0.01 −0.07 0.004
PMLE 2.24 0.36 −0.03 0.01 0.01
MSE Efficiency relative to MLE
  n = 200 PMLE-M 1.57 1.57 1.54 1.33 1.29
PMLE 1.51 1.5 1.43 1.13 1.16
  n = 500 PMLE-M 1.42 1.59 1.53 1.45 1.36
PMLE 1.37 1.49 1.41 1.28 1.27
  n = 1200 PMLE-M 1.62 1.65 1.74 1.23 1.36
PMLE 1.54 1.59 1.58 1.06 1.24
95% Percentile Bootstrap CI
coverage (%)
  n = 200 MLE 96.1 96.5 97.9 99.6 98.3
PMLE-M 95.3 95.1 95.8 96.1 96.8
PMLE 93.9 93.2 93.1 95.5 96.7
  n = 500 MLE 95.2 95.1 96.7 98.2 96.8
PMLE-M 93.9 93.1 94.8 96.6 95
PMLE 92.6 92.5 92.9 96.2 95.2
  n = 1200 MLE 94 95.2 96.1 97.8 96.1
PMLE-M 94.7 95.4 95.1 95.4 95.6
PMLE 93.7 94 93.8 94.7 96

Since the MLE method requires less modeling assumptions, it is more robust than the PMLE (PMLE-M) methods. In practice the MLE method is preferred when the number of covariate groups is relatively small and the number of observations in each group is not too small, e.g. when the covariate is the baseline risk category in the renal stenosis example. On the other hand, since the PMLE (PMLE-M) methods borrow information across covariate groups during estimation of the marker distribution, it is expected to be more efficient compared to the MLE method as the number of covariate groups increases. Specifically, when we have several continuous covariates, like those baseline covariate values described in the renal stenosis example, discretizing these covariates for application of MLE estimation methodology becomes very difficult and the PMLE (PMLE-M) methods are more appealing.

To study the performance of our estimators when the disease is rare, we performed simulation studies in scenarios similar to those described above but with disease prevalence equal to 0.05 (in the first setting, we set P(D = 1∣Z = 0) = 0.03 and P(D = 1∣Z = 1) = 0.07). Our estimators were found to have satisfactory performance in both simulation studies (results omitted).

4. Application

We illustrate the methodology by evaluating serum creatinine as a risk prediction marker for renal artery stenosis in patients with therapy resistant hypertension. The original cohort consists of 426 hypertensive patients undergoing renal angiography (Janssens et al., 2005; Krijnen et al., 1998). Baseline risk is modeled with age, smoking status (ever versus never) and their interaction, gender, hypertension, body mass index (BMI), abdominal bruit, atherosclerosis disease, and hypercholesterolaemia. Consider a low risk threshold of 0.1 below which no routine angiography is recommended and a high risk threshold of 0.4 above which routine angiography is encouraged. There are 162 subjects in the cohort with low baseline risk (10 cases), 176 subjects with medium baseline risk (33 cases), and 88 subjects with high baseline risk (55 cases). A stratified case-control sample of size 217 is generated including all 98 cases and controls frequency matched according to baseline risk group. Within low and medium baseline risk strata, the number of controls selected are twice that of cases, while within the high risk group, all controls are selected.

Figure 1 displays predictiveness curves for serum creatinine, for patients within different baseline risk strata. The covariate is baseline risk category, the same as the matching stratum. The MLE method was employed for estimation, and the bootstrap was used to construct confidence intervals with variability in estimated stratum-specific disease prevalence incorporated. Observe that for those patients who have low baseline risks, after measuring serum creatinine, there is a R^z1(0.1)=94.8% chance that they will remain classified as low risk (with 95% CI (58.6%, 100%)), a small chance that their risk will be elevated to the intermediate risk range (R^z1(0.4)R^z1(0.1)=5.2% with 95% CI (0, 41.4%)), and a negligible chance that their estimates risks will be high enough to receive the recommendation to undergo renal angiography (1R^z1(0.4)=0 with 95% CI (0,0)). For those patients who are originally in the medium risk range, after measuring serum creatinine, they have 24.8% chance of being reclassified as low risk (with 95% CI (1.1%,45.2%)), 7.1% chance of being reclassified as high risk (with 95% CI (0.8%, 17.0%), and 68.1% chance of remaining in the intermediate risk grey zone (95% CI (44.5%, 96.0%)). For those patients whose baseline risks are high, if their serum creatinine levels are measured, their estimated risks have a 90.9% chance of remaining high (with 95% CI (78.9%,100%)), 9.1% chance of being medium (with 95% CI (0, 20.5%)), and almost zero chance of being low enough to recommend against renal angiography (with 95% CI (0,0)).

As illustrated here, the baseline risk category specific predictiveness curves, together with a patient’s baseline risk classification, can be used to help him/her make decisions regarding whether or not it is worthwhile to have the biomarker measured. A subject’s choice will be affected by the marker’s potential impact on changing one’s risk group classification and consequent effects on medical recommendations. Specific risk thresholds on the curve that pertain to an individual will depend on the patient’s perceptions about costs and benefits of medical interventions recommended on the basis of his risk value. In addition, the cost of measuring the raw marker may enter into consideration. Suppose a patient is categorized as having medium baseline risk where the recommendations about having angiography or not are unclear. If the patient prefers to have a clear recommendation and is not concerned too much about having an extra blood test, he might choose to have serum creatinine measured since there is approximately a 31.9% chance that he can be reclassified as high or low risk with the additional information provided by the marker. Consider another patient who is in the low baseline risk category, and suppose he has a low tolerance for risk such that he will decide on having renal angiography as long as his risk is above the low risk threshold 0.1 and having an additional blood test does not bother him, then he might choose to have serum creatinine measurement since there is around 5.2% chance that he will move into the medium risk group. Another patient who also has low tolerance for disease risk, however, might choose not to measure serum creatinine if this patient feels that 5.2% chance of changing his decision is too small to be worth the cost or inconvenience of a blood test. Yet, another patient in the low baseline risk category might opt for renal angiography only if his risk is above the high risk threshold of 0.4. This patient is very likely to decide against measuring serum creatinine since it is highly unlikely that his decision will be changed with the additional test.

For individual patients, detailed information about one’s baseline covariate values may be more informative than the risk group to which one belongs. Not only do individual baseline risks vary within a baseline risk category but also the new marker’s distribution may depend on the baseline covariates. Whenever information about baseline covariates are available, the covariate-specific predictiveness curve can be used to tailor a patient’s view about worthiness of having an extra test. To illustrate this, we display covariate-specific predictiveness curves for three individuals using the PMLE method (Figure 2). Results based on PMLE-M method are fairly similar (omitted). Subject 1 is a 57 years old female with BMI 26.8kg/m3 who has a smoking history, hypertension and hypercholesterolaemia. Her baseline risk is 25.6%. Subject 2 is a 65 years old male with BMI 26.0kg/m3 who has atherosclerosis disease. His baseline risk is 38.1%. Subject 3 is a 57 years old female with BMI 22.2kg/m3 who has smoking history and atherosclerosis disease. Her baseline risk is 42.6%. Next we look closely at the expected impact of measuring serum creatinine on each patient.

Subject 1 has her baseline risk in the middle of the equivocal range. If serum creatinine is included in the risk calculation, she has an 8.6% chance of being reclassified as low risk (with 95% CI (0%, 50.6%)), 11.9% chance (95% CI (2.5%, 51.2%)) of being reclassified as high risk, and 79.5% chance (95% CI (33.1, 88.8%)) of remaining in the equivocal zone. Subject 2 is originally at the high end of the equivocal risk range. If his serum creatinine level is measured, he has only a 0.5% chance of being declared as low risk (with 95% CI (0%, 10.2%)), 50.7% chance (95% CI (7.8%, 84.6%)) of remaining in the risk grey zone. At the same time, there is a 48.8% chance (95% CI (9.7%, 91.9%)) that he will be classified as high risk and receive the recommendation to undergo angiography. Subject 3 has her baseline risk marginally above the high risk threshold. By measuring serum creatinine, there is almost zero possibility (with 95% CI (0, 5.6%)) that she will be classified as low risk; there is a 69.4% chance (95% CI (28.4%, 96.6%)) that her risk will remain high and a 30.6% chance (95% CI (3.4%, 69.4%)) that her risk will be deemed inadequate for making a recommendation for or against angiography.

Providing a patient with his/her baseline covariate-specific predictiveness curve again can help him/her to make a decision about having the marker measured. For example, if subject 3 is willing to have routine angiography unless her risk of renal stenosis is below 10%, she might choose not to measure serum creatinine since it is highly unlikely that her estimated risk will go below 10% based on the extra test.

Ideally, baseline covariate-specific predictiveness curves provide finer and more detailed information to patients for decision making compared to baseline risk category specific predictiveness curves. Yet, the latter are much easier to implement in medical practice and can be prepared in advance of a clinic visit. The same chart with baseline risk category specific predictiveness curves or a table with reclassification probabilities based on baseline risk stratum can be provided to every patient, while baseline covariate-specific predictiveness curves need to be prepared for each patient separately according to his/her baseline covariate information.

5. Discussion

In this article we developed methodology for estimating the covariate-specific predictiveness curve for a single marker in a case-control study. Our methods accommodate matching in the design. Two types of semiparametric approaches are examined assuming the risk is monotone increasing in the marker value. The first approach is based on the empirical likelihood of Y and is applicable to discrete covariates. The method can be easily generalized to the setting where Y is multivariate by modeling the empirical likelihood of Riskz(Y) following arguments similar to those in Huang (2007). Note that another related estimator can be constructed by estimating FDz and FDz empirically using the case or control sample within the zth covariate group. Since the exponential tilt relationship (2) implied by the risk model is not incorporated into estimation of the marker distribution, this estimator is probably less efficient than the MLE. This has been demonstrated for the overall predictiveness curve estimated with a case-control study (Huang, 2007).

The second approach combines estimation of the risk model with estimation of the covariate-specific marker distribution assuming a semiparametric location-scale model for Y. Different weighting schemes can be applied to accommodate the biased sampling design. This approach is applicable to general settings and is analogous to the semiparametric approach proposed in Huang et al. (2007) for a cohort design. For a generalization of this method when Y is multivariate, we can obtain Risk^(Yi), i = 1,…, n using say the pseudo-maximum likelihood method, and then estimate the covariate-specific distribution of Risk(Y) by plugging Risk^(Yi) into the weighted semiparametric location-scale model for logit{Risk(Y)}.

The predictiveness curve, when applied to the whole population, has been proposed as a graphical tool to evaluating the risk prediction capacity of markers or risk models (Pepe et al., 2008a). As we demonstrate in this paper, the predictiveness curve, conditional on covariates of interest, provides valuable information for individual decision making regarding whether or not to have an additional marker measurement over baseline covariates. This has direct important application to clinical practice. In this paper, we illustrate application of the covariate-specific predictiveness curve using an example evaluating serum creatinine as a risk prediction marker for renal artery stenosis. The approach can also be applied to other clinical settings. For example, Cook (2007) discussed an example in the context of cardiovascular disease, where estimated 10-year risk categories of cardiovascular events (< 5%, 5–20%,> 20%) are used in treatment guidelines. She showed that by adding high-density lipoprotein cholesterol (HDL-C) to a baseline risk model including age, systolic blood pressure, smoking, and total cholesterol, the reclassification of a patient to different risk categories can be used to determine whether it may be important to measure HDL-C in an individual. This is exactly what we want to achieve using the covariate-specific predictiveness for discrete covariates. We developed a rigorous MLE procedure for estimation. Further-more, the PMLE approach we developed for estimating the covariate-specific predictiveness curve given continuous covariates can provide patients with more detailed information for decision making.

We have previously developed methods for estimating the predictiveness curve in the whole population based on a cohort sample (Huang et al., 2007) or a case-control sample (Huang, 2007; Huang and Pepe, 2009). We also proposed a semiparametric method for estimating the covariate-specific predictiveness curve on the basis of a cohort sample (Huang et al., 2007). In this paper, we make an additional contribution with respect to previous work by extending estimation of the covariate-specific predictiveness curve to general study designs:

  1. our approaches are applicable to case-control designs or two-phase study designs, and can accommodate matching during sampling. The estimator developed in Huang et al. (2007) for estimating the covariate-specific predictiveness curve is a special case of our PMLE estimator in the scenario of a cohort sample.

  2. For discrete covariates, we provide a more efficient MLE estimator compared to the PMLE estimator. For a saturated risk model, this procedure is equivalent to estimating the predictiveness curve using observations from each covariate group separately and employing the semiparametric-efficient method proposed in Huang (2007). Yet for unsaturated risk models, estimating the covariate-specific predictiveness curve using data from all the covariate groups is more efficient since it borrows information across covariate groups.

Note that from a matched case-control biomarker study, there are other types of predictiveness curves we might be interested in estimating besides the covariate-specific predictiveness curve investigated here. For example, we might be interested in the population level predictiveness curve for risk based on both the biomarker and the covariates. Alternatively, we might want to investigate the covariate-adjusted predictiveness curve for the marker, characterized by a weighted summary of the covariate-specific predictiveness curves with weights related to the covariate distribution. These different types of predictiveness curves can be used to answer different scientific questions. Estimation of these curves in a matched case-control study can be achieved by combining estimates of the covariate-specific predictiveness Rz(v) vs v with additional information. This is a topic of current research.

Acknowledgments

The authors are grateful for support provided by NIH grants GM-54438 and NCI grants CA86368.

Appendix

A1: Proof of Theorems 1 and 2

Proof of Theorem 1

By Taylor’s expansion,

n{R^z(v)Rz(v)}=n[G{θ^,F^z1(v),z}G{θ,Fz1(v),z}]={G(s,y,z)ys=θ,y=Fz1(v)}Tn{F^z1(v)Fz1(v)}+(G(s,y,z)ss=θ,y=Fz1(v))Tn(θ^θ)+op(1).

The result follows according to the delta method. Asymptotic normality of θ^ and F^z follows similar arguments as in Huang (2007).

Proof of Theorem 2

n{R^z1(p)Rz1(p)}=n[F^z{G1(θ^,p,z)}Fz{G1(θ,p,z)}]=n[F^z{G1(θ,p,z)}Fz{G1(θ,p,z)}]+n[Fz{G1(θ^,p,z)}Fz{G1(θ,p,z)}]+Rn,

where

Rn=n[F^z{G1(θ^,p,z)}F^z{G1(θ,p,z)}](n[Fz{G1(θ^,p,z)}Fz{G1(θ,p,z)}])=op(1)

by equicontinuity of the process n(F^zFz). The result follows according to the delta method.

A2: Estimation for Discrete Covariates

Setting the first derivatives of (5) with respect to piz, θ2 and θ0 to zero, we get

lpiz=1pizλ1zλ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1]=0piz=1λ1z+λ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1],iUzlpizpiz=nkλ1zλ2z+λ2z=0λ1z=nz,lθ0=nDz=1Zλ2ziUzpizexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0z=1Zλ2z=nD. (9)

A2.1: Including separate main effect for each covariate group

Suppose ZM is a length Ƶ – 1, vector of dummy variables indicating covariate group, with the ith element being I(Z = i + 1), i = 1, …, Ƶ – 1, we have

lθ2=z=1ZiUDzZiz=1Zλ2kiUzZiMpikexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0[λ2z]z=2,,Z=z=1ZiUDzZiM=[nDz]z=2,,Z. (10)

Thus λ2z = nDz, z = 1, …, Ƶ. We have

piz=1nDz+nDzLRz(Yi)=1nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}.

Substituting piz into (5) we have

l=z=1ZiUzlog[nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}]+z=1ZiUDz{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}.

Thus the maximum likelihood estimators θ^0, θ^1, θ^2 and θ^3 solve

lθ0=nDz=1ZnDziUzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0,lθ1=z=1ZiUDzYiz=1ZnDziUzYiexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0,lθ2=z=1ZiUDzZiMz=1ZnDziUzZiMexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0,lθ3=z=1ZiUDzYiZiIz=1ZnDziUzYiZiIexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}nDz+nDzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}=0,

which are the score equations if we fit a prospective logistic model logit{P(D=1Y,Z)}=θ0+ξ(Z)+θ1Y+θ2TZM+θ3TYZI. to the case-control sample using ξ(z) = log (nDz/nDz as an offset term.

A2.2: Without including separate main effect for each covariate group

Starting from (9)

piz=1nz+λ2z{LRz(Yi,Zi)1}=1nz+λ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1]

and

iUzexp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1nz+λ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1]=0. (11)

Substituting piz into (5) we have

l=z=1ZiUzlog(nz+λ2z[exp{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}1])+z=1ZjUDz{θ0+η(z)+θ1Yi+θ2TZiM+θ3TYiZiI}.

The score equations for θ0, θ1, θ2, and θ3 are the same as those from a prospective logistic model logit{P(D=1Y,Z)}=θ0+ξ(Z)+θ1Y+θ2TZM+θ3TYZI fitted to the case-control sample using ξ(z) = log {λ2z/(nzλ2z)} + η(z) as the offset term. Therefore, we can estimate λ2z and θ iteratively. That is, given some starting value of λ2z, we fit the modified logistic regression model to obtain an estimate of θ. Given the current estimate of θ, a new λ^2z is solved as the solution to (11). Continue till convergence.

References

  1. Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: Statistical guidelines for nested case-control studies. BMI Medical Research Methodology. 2002;2:4. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Breslow NE. Statistics in epidemiology: the case-control study. JASA. 1996;91:14–28. doi: 10.1080/01621459.1996.10476660. [DOI] [PubMed] [Google Scholar]
  3. Breslow NE, Day NE. Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. IARC; 1993. [PubMed] [Google Scholar]
  4. Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75(1):11–20. [Google Scholar]
  5. Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J. R. Statist. Soc. B. 1997;59(2):447–461. [Google Scholar]
  6. Breslow NE, Holubkov R. Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Statistics in Medicine. 1997;16:103–116. doi: 10.1002/(sici)1097-0258(19970115)16:1<103::aid-sim474>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
  7. Breslow NE, Zhao LP. Logistic regression for stratified case-control studies. Biometrics. 1988;44:891–899. [PubMed] [Google Scholar]
  8. Cole TJ, Green PJ. Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine. 1992;11(165):1305–1319. doi: 10.1002/sim.4780111005. [DOI] [PubMed] [Google Scholar]
  9. Fears TR, Brown CC. Logistic regression methods for retrospective case-control studies using complex sampling procedures. Biometrics. 1986;42:955–960. [PubMed] [Google Scholar]
  10. Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine. 1991;10:739–747. doi: 10.1002/sim.4780100509. [DOI] [PubMed] [Google Scholar]
  11. Cook NR, Buring JE, Ridker PM. The effect of including C-reactive protein in cardiovascular risk prediction models for women. Ann Intern Med. 2006;145:21–29. doi: 10.7326/0003-4819-145-1-200607040-00128. [DOI] [PubMed] [Google Scholar]
  12. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]
  13. Heagerty PJ, Pepe MS. Semiparametric estimation of regression quantiles with application to standardizing weight for height and age in U.S. children. Applied Statistics. 1999;48:553–551. [Google Scholar]
  14. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. JASA. 1952;47:663–685. [Google Scholar]
  15. Hsieh DA, Manski CF, McFadden D. Estimation of response probabilities from augmented retrospective observations. JASA. 1985;80(391):651–662. [Google Scholar]
  16. Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007;63(4):1181–1188. doi: 10.1111/j.1541-0420.2007.00814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Huang Y. UW thesis. 2007. Evaluating the predictiveness of continuous biomarkers. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Huang Y, Pepe MS. Semiparametric methods for evaluating risk prediction markers in case-control studies. Biometrika. 2009 doi: 10.1093/biomet/asp040. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Janssens AC, Deng Y, Borsboom GJ, Eijkemans MJ, Habbema JD, Steyerberg EW. A new logistic regression approach for the evaluation of diagnostic test results. Medical Decision Making. 2005;25:168–177. doi: 10.1177/0272989X05275154. [DOI] [PubMed] [Google Scholar]
  20. Krijnen P, van Jaarsveld BC, Steyerberg EW, Man in’t Veld AJ, Schalekamp MA, Habbema JD. A clinical prediction rule for renal artery stenosis. Ann Intern Med. 1998;129:705–711. doi: 10.7326/0003-4819-129-9-199811010-00005. [DOI] [PubMed] [Google Scholar]
  21. Manski CF, McFadden D. Alternative Estimators and Sample Designs for Discrete Choice Analysis. Structural Analysis of Discrete Data with Econometric Applications. 1991:2–50. [Google Scholar]
  22. Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249. [Google Scholar]
  23. Owen AB. Empirical Likelihood Ratio Confidence Regions. The Annals of Statistics. 1990;18(1):90–120. [Google Scholar]
  24. Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003. [Google Scholar]
  25. Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute. 2001;93(14):1054–1061. doi: 10.1093/jnci/93.14.1054. [DOI] [PubMed] [Google Scholar]
  26. Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. American Journal of Epidemiology. 2008;167(3):362–368. doi: 10.1093/aje/kwm305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. JNCI. 2008;100(20):1432–1438. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66(3):403–411. [Google Scholar]
  29. Qin J, Lawless J. Empirical likelihood and general estimating equations. The Annals of Statistics. 1994;22(1):300–325. [Google Scholar]
  30. Scott A, Wild C. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84(1):57–71. [Google Scholar]
  31. Scott A, Wild C. Case-control studies with complex sampling. Biometrika. 2001;50:389–401. [Google Scholar]
  32. Wild CJ. Fitting prospective regression models to case-control data. Biometrika. 1991;78(4):705–717. [Google Scholar]

RESOURCES