Semiparametric methods for evaluating risk prediction markers in case-control studies

Ying Huang; Margaret Sullivan Pepe

doi:10.1093/biomet/asp040

. 2009 Oct 12;96(4):991–997. doi: 10.1093/biomet/asp040

Semiparametric methods for evaluating risk prediction markers in case-control studies

Ying Huang ¹, Margaret Sullivan Pepe ¹

PMCID: PMC3372083 PMID: 22822247

Abstract

The performance of a well-calibrated risk model for a binary disease outcome can be characterized by the population distribution of risk and displayed with the predictiveness curve. Better performance is characterized by a wider distribution of risk, since this corresponds to better risk stratification in the sense that more subjects are identified at low and high risk for the disease outcome. Although methods have been developed to estimate predictiveness curves from cohort studies, most studies to evaluate novel risk prediction markers employ case-control designs. Here we develop semiparametric methods that accommodate case-control data. The semiparametric methods are flexible, and naturally generalize methods previously developed for cohort data. Applications to prostate cancer risk prediction markers illustrate the methods.

Some key words: Biased sampling, Biomarker, Case-control, Predictiveness curve, Risk prediction, Semiparametric method

1. Introduction

Selecting biomarkers for medical practice is an important and challenging task. Of the thousands of markers made available by modern techniques, we want to find those that can assist medical decision making by helping to identify disease or risk of poor outcomes. Criteria for evaluating a biomarker depend on its purpose. In this paper, our goal is to evaluate risk prediction markers that are used to stratify the population into risk groups for whom different treatment recommendations are made.

Pepe et al. (2008a) suggested using the predictiveness curve (Bura & Gastwirth, 2001) to evaluate a risk prediction marker or model. They argued that the performance of a model to predict risk within a population relies not only on the effect of each predictor in the risk model, but also on the distribution of the predictors. The predictiveness curve integrates these two factors together by displaying the population distribution of risk endowed by the risk model. Let D denote a binary outcome that we term disease here, D = 1 for diseased and D = 0 for nondiseased. Let Y denote a marker of interest and let Risk(Y) = pr(D = 1 | Y) denote the risk calculated on the basis of Y. The predictiveness curve is the curve R(υ) against υ for υ ∈ (0, 1), where R(υ) is the υth quantile of Risk(Y). The inverse function R⁻¹(p) = pr{Risk(Y) ⩽ p} is the proportion of the population with risks less than or equal to p. An attractive feature of this curve is that it provides a common meaningful scale for comparing markers that may not be comparable on their original scales. A risk prediction model with larger variability in R(υ) has a better capacity to stratify risk. A particularly clinically meaningful comparison can be based on R⁻¹(p). Suppose there exists a prespecified low risk threshold p_L and/or a high risk threshold p_H such that recommendation for or against treatment is clear if the estimated risk for a patient is above p_H or below p_L. A risk model which assigns more people into the low and high risk ranges, i.e. larger R⁻¹(p_L) and larger 1 − R⁻¹(p_H), is preferred.

Pepe et al. (2001) proposed five phases for developing a biomarker. Case-control studies are conducted in phases 1, 2 and 3, since they are smaller and more cost efficient than cohort studies. Since early phase studies dominate biomarker research, it is crucial that measures of biomarker performance accommodate case-control designs. Huang et al. (2007) developed a semiparametric estimator of the predictiveness curve for cohort studies. Here we address the more common case-control design. We focus on the scenario of a single continuous marker or a predefined marker combination, although our methods can be easily extended to a general risk model. Biomarker researchers are well aware of problems caused by developing combinations and assessing them in the same dataset and encourage the assessment of a predefined combination with independent test data (Ransohoff, 2007; Simon, 2005; Pepe et al., 2008b). The methods presented here accommodate such evaluations.

Let Y, Y_D and Y_D̄ denote the marker measurement in the general, diseased, and nondiseased populations respectively. Let F, F_D and F_D̄ be the corresponding distribution functions and let f, f_D and f_D̄ be the density functions. Let ρ = pr(D = 1) denote the disease prevalence. We assume either that ρ is known or that a prevalence estimate ρ̂ is available in addition to the case-control sample. For example, an estimate might be obtained from a cohort study reported in the literature. Alternatively, it may be calculated from a parent cohort within which the case-control study is nested (Baker et al., 2002; Pepe et al., 2008b). In these scenarios, variability in ρ̂ can be evaluated and taken into account in calculating the variance of the predictiveness curve estimator.

Furthermore, we assume the risk of disease pr(D = 1 | Y) is monotone increasing in Y. Under this assumption, R(υ) equals pr{D = 1 | Y = F⁻¹(υ)}, the risk at the υth quantile of Y. Thus, the curve R(υ) against υ is the same as the curve pr(D = 1 | Y = y) against F(y). Therefore, estimation of the predictiveness curve can be undertaken in two steps: estimation of the risk model pr(D = 1 | Y = y) and estimation of the marker distribution F(y). We develop estimators for these two entities and combine them to get a predictiveness curve estimator. We consider a case-control study with n_D cases Y_Di (i = 1, . . ., n_D), n_D̄ controls Y_D̄i (i = 1, . . ., n_D̄), and write {Y_k, k = 1, . . ., n} for {Y_D̄1, . . ., Y_{D̄n_D̄}, Y_D1, . . ., Y_{Dn_D}} where n = n_D̄ + n_D. A 2008 University of Washington working paper 333 by Huang and Pepe contains proofs.

2. Semiparametric estimators

2.1. Estimation of the risk model

Suppose the risk model of interest is pr(D = 1 | Y) = G(θ, Y), where

logit {G (θ, Y)} = θ_{0} + η (θ_{1}, Y)

(1)

and η is some monotone increasing function of Y. Examples of logit{G(θ, Y)} include θ₀ + θ₁Y with θ₁ > 0, the linear logistic model, and θ₀ + θ₁ Y^(θ₂) with θ₁ > 0, where Y^(θ₂) = (Y^θ₂ − 1)/θ₂ when θ₂ ≠ 0 and Y^(θ₂) = log Y when θ₂ = 0, the logistic model with Box–Cox transformation (Cole & Green, 1992). In case-control studies, the maximum likelihood estimator of the odds ratio from the retrospective likelihood can be obtained by applying the prospective logistic model to the sample (Anderson, 1972; Prentice & Pyke, 1979), and this achieves the semiparametric information bound (Bickel et al., 1993; Breslow et al., 2000; Gilbert, 2000).

Let S denote being selected into the case-control sample. We apply the standard logistic regression model logit{pr(D = 1 | Y, S)} = θ_0S + η (θ_1S, Y) to the data and correct the intercept with disease prevalence according to Bayes’ theorem: logit{pr(D = 1 | Y)} = logit{pr(D = 1 | Y, S)} − logit{pr(D = 1| S)} + logit{pr(D = 1)}. That is, let (θ̂_0S, θ̂_1S) be the maximum likelihood estimators of (θ_0S, θ_1S), then the estimator of θ is θ̂ = (θ̂₀, θ̂₁), where θ̂₀ = θ̂_0S + log[n_D̄ρ̂/{n_D (1 − ρ̂)}] and θ̂₁ = θ̂_1S.

2.2. Estimation of the marker distribution and the predictiveness curve

In a case-control study, since we do not have an independent identically distributed sample from the population, the marker distribution F cannot be estimated directly. Rather, with an estimate of disease prevalence, ρ̂, we can estimate F according to ρ̂F_D + (1 − ρ̂)F_D̄, substituting estimates for F_D and F_D̄. Next, we examine two ways of estimating F_D and F_D̄.

First, in the absence of matching, since control and case samples are representative of their corresponding distributions in the population, natural estimators for F_D̄ and F_D are the empirical estimators F̃_D̄ and F̃_D. We estimate F with F̃ = ρ̂F̃_D + (1 − ρ̂)F̃_D̄. The semiparametric empirical estimators of R(υ) and R⁻¹(p) are R̃(υ) = G{θ̃, F̃⁻¹(υ)} for υ ∈ (0, 1) and R̃⁻¹(p) = F̃ {G⁻¹(θ̂, p)} for p ∈ {R(υ) : υ ∈ (0, 1)}, where G⁻¹(θ, p) = inf{y: G(θ, y) ⩾ p}.

However, there is a more efficient way to obtain estimators for F_D and F_D̄. Observe that the risk model (1) implies the following relationship between marker densities in cases and controls:

f_{D} (Y) = L R (Y) f_{\bar{D}} (Y) = exp {α + η (β, Y)} f_{\bar{D}} (Y),

(2)

where α = θ₀ + log{(1 − ρ)/ρ}, β = θ₁, and LR(Y) is the likelihood ratio function of Y (Green & Swets, 1966). When we estimate F_D and F_D̄ empirically, positive point masses are allocated only to marker values observed in the corresponding case or control sample. For a marker measured on a continuous scale, the supports for F̃_D̄ and F̃_D are rarely the same. Therefore, the relationship (2) is not incorporated into estimation of F_D and F_D̄ in the empirical procedure. A related issue arises in a different problem where the task is to estimate the misclassification rates of a binary classification rule constructed from binomial regression (Lloyd, 2000). Lloyd (2000) pointed out that if the accuracy of the rule is summarized by the empirical type I and type II misclassification rates, the exponential tilt relationship (2) between densities of predictors in the diseased and nondiseased populations is ignored.

Incorporation of (2) can be achieved by using the semiparametric likelihood framework (Qin & Zhang, 1997, 2003). This was originally proposed by Qin & Zhang (1997) to test the logistic regression assumption under a case-control sampling plan, and used by Qin & Zhang (2003) to estimate the receiver operating characteristic curve as an alternative to the fully parametric and nonparametric approaches. Suppose η (β, Y) = β^Tr(Y), where r(Y) is a vector of functions of Y. The likelihood ratio of Y becomes LR(Y) = exp{α + β^Tr(Y)}. Here, we focus on Y being a single marker, but this method applies also when Y is a vector of markers. The semiparametric likelihood for observing the case-control data is $L (α, β, F_{\bar{D}}) = \prod_{i = 1}^{n_{\bar{D}}} d F_{\bar{D}} (Y_{\bar{D} i}) \prod_{j = 1}^{n_{D}} exp {α + β^{T} r (Y_{D j})} d F_{\bar{D}} (Y_{D j})$ , subject to $\sum_{i = 1}^{n} d F_{\bar{D}} (Y_{i}) = 1$ and $\sum_{i = 1}^{n} exp {α + β^{T} r (Y_{i})} d F_{\bar{D}} (Y_{i}) = 1$ .

Solving this restricted maximum likelihood using the Lagrange multiplier method, the resulting maximum likelihood estimators for F_D̄ and F_D are

\begin{array}{l} {\hat{F}}_{\bar{D}} (y) = \frac{1}{n_{\bar{D}}} \sum_{i = 1}^{n} \frac{I (Y_{i} ⩽ y)}{1 + \frac{n_{D}}{n_{\bar{D}}} exp {\hat{α} + {\hat{β}}^{T} r (Y_{i})}} = \frac{1}{n} \sum_{i = 1}^{n} \frac{I (Y_{i} ⩽ y)}{\frac{n_{\bar{D}}}{n} + \frac{n_{D}}{n} \hat{L R} (Y_{i})}, \\ {\hat{F}}_{D} (y) = \frac{1}{n_{\bar{D}}} \sum_{i = 1}^{n} \frac{exp {\hat{α} + {\hat{β}}^{T} r (Y_{i})} I (Y_{i} ⩽ y)}{1 + \frac{n_{D}}{n_{\bar{D}}} exp {\hat{α} + {\hat{β}}^{T} r (Y_{i})}} = \frac{1}{n} \sum_{i = 1}^{n} \frac{\hat{L R} (Y_{i}) I (Y_{i} ⩽ y)}{\frac{n_{\bar{D}}}{n} + \frac{n_{D}}{n} \hat{L R} (Y_{i})}, \end{array}

where α̂ = θ̂ − log{ρ̂/(1 − ρ̂)}, β̂ = θ̂, and $\hat{L R}$ is the maximum likelihood estimator of L R.

We use these estimators to compute F̂ = (1 − ρ̂)F̂_D̄ + ρ̂F̂_D. Then we insert θ̂ and F̂ into G to get the semiparametric maximum likelihood estimators of R(υ) and R⁻¹(p): R̂(υ) = G{θ̂, F̂⁻¹(υ)} for υ ∈ (0, 1), R̂⁻¹(p) = F̂ {G⁻¹(θ ^, p)} for p ∈ {R(υ) : υ ∈ (0, 1)}.

An intrinsic property of the predictiveness curve is that the area under the curve is equal to ρ since $\int_{0}^{1} R (υ) d υ = pr (D = 1) = ρ$ . The analogue is not necessarily true though for an estimated predictiveness curve. However, it can be shown that the area under the semiparametric maximum likelihood estimator R̂(υ) is always equal to ρ̂. See an unpublished 2007 University of Washington dissertation by Huang for a proof. This property facilitates visual comparison between two estimated curves, for example predictiveness curves for different markers. This result does not hold for R̃(υ). An intuitive explanation is that the empirically estimated marker distribution does not take advantage of the structure imposed by the risk model.

2.3. Estimation in a cohort design

Our semiparametric methods were developed for case-control designs but can nevertheless be applied to a cohort study as well by plugging in the sample prevalence ρ̂ = n_D/n. Let α̂, β̂ be the maximum likelihood estimators of α, β by applying the logistic regression model logit{pr(D = 1 | Y)} = α + β^Tr(Y) + log(n_D/n_D̄) to the cohort sample. The last term is included here in order to make the notation, and the definition of α in particular, consistent with the previous subsection. For y ∈ ℛ, the semiparametric empirical estimator of F becomes $\tilde{F} (y) = n_{D} {(n_{D} n)}^{- 1} \sum_{i = 1}^{n_{D}} I (Y_{D i} ⩽ y) + n_{\bar{D}} {(n_{\bar{D}} n)}^{- 1} \sum_{i = 1}^{n_{\bar{D}}} I (Y_{\bar{D} i} ⩽ y) = n^{- 1} \sum_{i = 1}^{n} I (Y_{i} ⩽ y)$ , while the semiparametric maximum likelihood estimator F̂(y) can be easily shown to also equal $n^{- 1} \sum_{i = 1}^{n} I (Y_{i} ⩽ y)$ . That is, F̂ and F̃ calculated from a cohort sample are both equal to the empirical distribution function. This is also true for a case-control sample where the proportion of cases is equal to ρ̂. Consequently, the two semiparametric estimators of the predictiveness curve developed in § 2.2 when applied to a cohort study are the same as the semiparametric estimator developed by Huang et al. (2007). That is, our methods generalize earlier methods to case-control designs.

3. Asymptotic theory for the semiparametric estimators

We present asymptotic theory for the semiparametric estimators defined in § 2.2 as well as some consequent attractive properties. We assume the following conditions hold:

G(s, Y) is differentiable with respect to s and Y at s = θ, Y = F⁻¹(υ);
∂G⁻¹(s, p)/∂s exists at s = θ;
for 0 < a < b < 1, F has the continuous positive density f on [F⁻¹(a) − ∊, F⁻¹(b) + ∊] for some ∊ > 0;
ρ̂ is either estimated from a cohort or is set equal to a known constant ρ.

Asymptotic theory for the semiparametric maximum likelihood predictiveness curve estimator is presented in Theorems 1 and 2.

Theorem 1. As n → ∞, n^1/2 {R̂(υ) − R(υ)} converges to a normal random variable with mean zero and variance

\begin{matrix} Σ_{1 M} (υ) & = & {\frac{\partial R (υ)}{\partial υ}}^{2} var (n^{1 / 2} [\hat{F} {F^{- 1} (υ)} - υ]) + {\frac{\partial R (υ)}{\partial θ}}^{T} var {n^{1 / 2} (\hat{θ} - θ)} {\frac{\partial R (υ)}{\partial θ}} \\ + 2 {\frac{\partial R (υ)}{\partial θ}}^{T} cov (n^{1 / 2} (\hat{θ} - θ), n^{1 / 2} [\hat{F} {F^{- 1} (υ)} - υ]) {\frac{\partial R (υ)}{\partial υ}}, \end{matrix}

and n^1/2{R̂⁻¹(p) − R⁻¹(p)} converges to a normal random variable with mean zero and variance Σ_2M (p) = Σ_1M (υ)/{∂R(υ)/∂υ}² for υ = R⁻¹(p).

Theorem 2. As n → ∞, n^1/2 {R̃(υ) − R(υ)} converges to a normal random variable with mean zero and variance

\begin{matrix} Σ_{1 E} (υ) & = & {\frac{\partial R (υ)}{\partial υ}}^{2} var (n^{1 / 2} [\tilde{F} {F^{- 1} (υ)} - υ]) + {\frac{\partial R (υ)}{\partial θ}}^{T} var {n^{1 / 2} (\hat{θ} - θ)} {\frac{\partial R (υ)}{\partial θ}} \\ + 2 {\frac{\partial R (υ)}{\partial θ}}^{T} cov (n^{1 / 2} (\hat{θ} - θ), n^{1 / 2} [\tilde{F} {F^{- 1} (υ)} - υ]) {\frac{\partial R (υ)}{\partial υ}}, \end{matrix}

and n^1/2{R̃⁻¹(p) − R⁻¹(p)} converges to a normal random variable with mean zero and variance Σ_2E (p) = Σ_1E (υ)/{∂R(υ)/∂υ}² for υ = R⁻¹(p).

Theorems 1 and 2 state that variance of R̂(υ) and R̃(υ) and their inverse are related by a factor equal to the derivative of R(υ). Intuitively, a perturbation in R(υ) can be approximated by R′(υ) times a perturbation in R⁻¹(p).

Estimating F using the maximum likelihood method in a case-control design is a special case of the biased sampling problem. Vardi (1985) developed a nonparametric maximum likelihood estimator for F in a biased sampling model with known selection weights, for which the large sample theory was provided by Gill et al. (1988). Gilbert et al. (1999) extended this method to allow the weight functions to depend on an unknown finite-dimensional parameter θ. Gilbert (2000) demonstrated that the maximum likelihood estimators for θ and F_D̄ are semiparametric efficient. The efficiency of our semiparametric maximum likelihood estimators follows.

It can be shown that the asymptotic covariance between θ̂ and the estimator of F is the same for the two semiparametric procedures. This is expected according to the convolution theorem (van der Vaart, 1998, Theorem 25.20) given the fact that θ̂ is the semiparametric efficient estimator. Thus, the difference in asymptotic variance between R̂(υ) and R̃(υ) or between R̂⁻¹(p) and R̂⁻¹(p) is completely attributed to the difference in the asymptotic variances of F̂ and F̃. The latter can be shown to be positively proportional to the asymptotic variance of n^1/2(F̂_D̄ − F̃_D̄). Thus, as expected, R̂(υ) and R̂⁻¹(p) are asymptotically more efficient than R̃(υ) and R̃⁻¹(p).

4. Illustration

We illustrate our methods using a simulated case-control dataset from the Prostate Cancer Prevention Trial, a randomized prospective study of men with PSA, the prostate specific antigen, less than 3.0 ng/mL, and 55 years of age and older who were followed up for seven years with annual PSA measurements. Thompson et al. (2006) identified 5519 men on the placebo arm who had undergone prostate biopsy, had a PSA and digital rectal exam during the year prior to biopsy and at least two PSA values from the three years prior to biopsy. Sample disease prevalence from the study cohort is ρ̂ = 21.9%. We randomly sampled 250 cases and 250 controls from this cohort to form a case-control sample for illustration.

We compare PSA and PSA velocity as risk prediction markers for prostate cancer utilizing the predictiveness curve technique. A logistic regression risk model with a Box–Cox transformation of the marker is employed. The two semiparametric estimators of the predictiveness curves displayed in Fig. 1(a) are fairly similar to each other for both markers. Their variance estimates are also similar. The pointwise 95% bootstrap percentile confidence intervals for R(υ) constructed from the semiparametric maximum likelihood estimators are displayed in Fig. 1(b), with variability in ρ̂ incorporated.

Fig. 1 — (a) The semiparametric maximum likelihood estimators (solid) and semiparametric empirical estimators (dotted) of predictiveness curves for PSA (black) and PSA velocity (grey) for predicting prostate cancer; (b) the 95% pointwise confidence intervals constructed from percentiles of the bootstrap distribution based on the semiparametric maximum likelihood estimators of predictiveness curves for PSA (black) and PSA velocity (grey); and (c) the semiparametric maximum likelihood estimators of predictiveness curves for PSA (black) and PSA velocity (grey) when ρ = 0.165 (solid) and ρ = 0.274 (dashed). The horizontal lines indicate disease prevalences plugged in.

PSA has steeper predictiveness curves, suggesting that it is a better marker for predicting risk of prostate cancer. Table 1 presents values for the risk percentiles of PSA and PSA velocity in the population, R(υ), for υ = 10% and 90%. In addition, risk stratum sizes, R⁻¹(p), for a low risk threshold of 10% and a high risk threshold of 30% are presented. P-values for comparing markers are based on the bootstrap variance estimates. Using the semiparametric methods we conclude that PSA is a significantly better risk prediction marker than PSA velocity. Specifically, it is better for predicting high risk as quantified by larger R(0.9), better for predicting low risk, i.e. smaller R(0.1), and it classifies more people into the low and high risk ranges.

Table 1.

Comparisons between PSA and PSA velocity for the predicting risk of prostate cancer using the semiparametric maximum likelihood method

	PSA		PSA velocity
Measure	Estimate	95% Confidence interval	Estimate	95% Confidence interval	P-value
R(0.1)	0.072	(0.046, 0.109)	0.122	(0.075, 0.159)	0.027
R(0.9)	0.413	(0.356, 0.476)	0.313	(0.275, 0.356)	<0.001
R⁻¹(0.1)	0.188	(0.073, 0.291)	0.060	(0.020, 0.142)	0.021
1 − R⁻¹(0.3)	0.244	(0.191, 0.296)	0.129	(0.030, 0.197)	0.009
R⁻¹(0.3) − R⁻¹(0.1)	0.568	(0.443, 0.724)	0.811	(0.668, 0.935)	0.004

Open in a new tab

In practice, there may not always be a cohort for estimating prevalence. Often an investigator plugs in a specific prevalence value and treats it as known. We illustrate application of a sensitivity analysis using our example. Consider two values ρ = 0.165 and ρ = 0.274, which correspond to a 25% change from ρ̂ = 0.219. The corresponding predictiveness curves are displayed in Fig. 1(c). The comparison of predictiveness curves with respect to steepness is not sensitive to perturbation in prevalence. PSA appears overall to be a better risk prediction marker than PSA velocity in the sense that the risk percentiles vary more. Comparisons at particular risk thresholds, on the other hand, are affected by prevalence. For example, when ρ = 0.165, based on the semiparametric maximum likelihood procedure, PSA assigns significantly more people into the low risk range than PSA velocity, with estimates of R⁻¹(0.1) being 31.3% and 13.5% respectively, p-value < 0.001. PSA is also a significantly better marker for predicting high risk than PSA velocity, with estimates of 1 − R⁻¹(0.3) being 12.5% and 2.9% respectively, p-value < 0.001. In contrast, when ρ = 0.263, estimates of R⁻¹(0.1) become 9.7% and 3.8% for PSA and PSA velocity, and estimates of 1 − R⁻¹(0.3) are 38.9% and 36.9% respectively. Neither of the comparisons is significant with p-values being 0.192 and 0.736, respectively. The comparison with respect to the percentage classified into the equivocal risk range is significant when ρ = 0.165, p-value < 0.001, but not when θ = 0.274, p-value = 0.375.

5. Concluding remarks

In this paper, we have developed flexible semiparametric estimators of the predictiveness curve for case-control studies. This is particularly valuable for evaluating a risk prediction marker or model early in its development when case-control designs are most common. Both semiparametric estimators are easy to compute: risk models can be estimated utilizing standard statistical procedures, and risk distributions can be calculated easily based on analytic formulae. There are other approaches under development for estimating the predictiveness curve, including a nonparametric approach and an approach based on its relationship with the receiver operating characteristic curve (Huang & Pepe, 2009).

The validity of both semiparametric estimators relies upon assumptions about the risk model. If the risk model is misspecified, bias can be introduced into both estimators. This, however, may not be a big concern since the risk model can be made highly flexible using techniques such as regression splines. Given a well-specified model, the semiparametric maximum likelihood estimator is more efficient than its empirical counterpart. Asymptotic relative efficiency of the former versus the latter is a complicated function of the disease prevalence, the separation between cases and controls, the case-control sampling ratio and the quantile of interest. In our example, the two estimators have similar variance when the disease prevalence is medium. It is shown in the 2007 dissertation by Huang that for rare diseases, using the model-based approach may achieve considerable efficiency gains for certain quantiles.

An important use of asymptotic theory is to guide study design. To design an efficient case-control study for evaluating a risk model, the optimal case-control sampling ratio is dictated by the disease prevalence, the separation between cases and controls and the performance measure that is of primary interest. A detailed study can be found in the 2007 dissertation by Huang.

Comparing markers or models for their risk stratification capacity is of great significance in medical practice. Researchers are often interested in whether additional risk factors which may be hard to measure can lead to a significant improvement in utility compared with an existing model. More research on methods to evaluate incremental value is warranted. Methods described here can be easily extended and adapted for such purposes.

Acknowledgments

The authors are grateful for support provided by grants from the U.S. National Institutes of Health and the National Cancer Institute.

References

Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35. [Google Scholar]
Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: statistical guidelines for nested case-control studies. BMI Med Res Methodol. 2002;2:4–11. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore, MD: Johns Hopkins University Press; 1993. [Google Scholar]
Breslow NE, Robins JM, Wellner JA. On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli. 2000;6:447–55. [Google Scholar]
Bura E, Gastwirth JL. The binary regression quantile plot: assessing the importance of predictors in binary regression visually. Biomet J. 2001;43:5–21. [Google Scholar]
Cole TJ, Green PJ. Smoothing reference centile curves: the LMS method and penalized likelihood. Statist Med. 1992;11:1305–19. doi: 10.1002/sim.4780111005. [DOI] [PubMed] [Google Scholar]
Gilbert PB. Large sample theory of maximum likelihood estimates in semiparametric biased sampling models. Ann Statist. 2000;28:151–194. [Google Scholar]
Gilbert PB, Lele S, Vardi Y. Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika. 1999;86:27–43. [Google Scholar]
Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. Ann Statist. 1988;16:1069–1112. [Google Scholar]
Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Wiley; 1966. [Google Scholar]
Huang Y, Pepe MS. A parametric ROC model based approach for evaluating the predictiveness of continuous markers in case-control studies. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01201.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007;63:1181–88. doi: 10.1111/j.1541-0420.2007.00814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lloyd CJ. Maximum likelihood estimation of misclassification rates of a binomial regression. Biometrika. 2000;87:700–705. [Google Scholar]
Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. Phases of biomarker development for early detection of cancer. J Nat Cancer Inst. 2001;93:1054–61. doi: 10.1093/jnci/93.14.1054. [DOI] [PubMed] [Google Scholar]
Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol. 2008a;167:362–68. doi: 10.1093/aje/kwm305. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Nat Cancer Inst. 2008b;100:1432–38. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–11. [Google Scholar]
Qin J, Zhang J. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika. 1997;84:609–18. [Google Scholar]
Qin J, Zhang J. Using logistic regression procedures for estimating receiver operating characteristic curves. Biometrika. 2003;93:585–96. [Google Scholar]
Ransohoff DF. How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. J Clin Epidemiol. 2007;60:1205–19. doi: 10.1016/j.jclinepi.2007.04.020. [DOI] [PubMed] [Google Scholar]
Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol. 2005;23:7332–41. doi: 10.1200/JCO.2005.02.8712. [DOI] [PubMed] [Google Scholar]
Thompson IM, Pauler Ankerst D, Chi C. Assessing prostate cancer risk: results from the prostate cancer prevention trial. J Nat Cancer Inst. 2006;98:529–34. doi: 10.1093/jnci/djj131. [DOI] [PubMed] [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge, UK: Cambridge University Press; 1998. [Google Scholar]
Vardi Y. Empirical distributions in selection bias models. Ann Statist. 1985;13:178–203. [Google Scholar]

[b1-asp040] Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35. [Google Scholar]

[b2-asp040] Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: statistical guidelines for nested case-control studies. BMI Med Res Methodol. 2002;2:4–11. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b3-asp040] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore, MD: Johns Hopkins University Press; 1993. [Google Scholar]

[b4-asp040] Breslow NE, Robins JM, Wellner JA. On the semi-parametric efficiency of logistic regression under case-control sampling. Bernoulli. 2000;6:447–55. [Google Scholar]

[b5-asp040] Bura E, Gastwirth JL. The binary regression quantile plot: assessing the importance of predictors in binary regression visually. Biomet J. 2001;43:5–21. [Google Scholar]

[b6-asp040] Cole TJ, Green PJ. Smoothing reference centile curves: the LMS method and penalized likelihood. Statist Med. 1992;11:1305–19. doi: 10.1002/sim.4780111005. [DOI] [PubMed] [Google Scholar]

[b7-asp040] Gilbert PB. Large sample theory of maximum likelihood estimates in semiparametric biased sampling models. Ann Statist. 2000;28:151–194. [Google Scholar]

[b8-asp040] Gilbert PB, Lele S, Vardi Y. Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika. 1999;86:27–43. [Google Scholar]

[b9-asp040] Gill RD, Vardi Y, Wellner JA. Large sample theory of empirical distributions in biased sampling models. Ann Statist. 1988;16:1069–1112. [Google Scholar]

[b10-asp040] Green DM, Swets JA. Signal Detection Theory and Psychophysics. New York: Wiley; 1966. [Google Scholar]

[b11-asp040] Huang Y, Pepe MS. A parametric ROC model based approach for evaluating the predictiveness of continuous markers in case-control studies. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01201.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b12-asp040] Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Biometrics. 2007;63:1181–88. doi: 10.1111/j.1541-0420.2007.00814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b13-asp040] Lloyd CJ. Maximum likelihood estimation of misclassification rates of a binomial regression. Biometrika. 2000;87:700–705. [Google Scholar]

[b14-asp040] Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. Phases of biomarker development for early detection of cancer. J Nat Cancer Inst. 2001;93:1054–61. doi: 10.1093/jnci/93.14.1054. [DOI] [PubMed] [Google Scholar]

[b15-asp040] Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol. 2008a;167:362–68. doi: 10.1093/aje/kwm305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b16-asp040] Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a biomarker used for classification or prediction: standards for study design. J Nat Cancer Inst. 2008b;100:1432–38. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b17-asp040] Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–11. [Google Scholar]

[b18-asp040] Qin J, Zhang J. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika. 1997;84:609–18. [Google Scholar]

[b19-asp040] Qin J, Zhang J. Using logistic regression procedures for estimating receiver operating characteristic curves. Biometrika. 2003;93:585–96. [Google Scholar]

[b20-asp040] Ransohoff DF. How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. J Clin Epidemiol. 2007;60:1205–19. doi: 10.1016/j.jclinepi.2007.04.020. [DOI] [PubMed] [Google Scholar]

[b21-asp040] Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. J Clin Oncol. 2005;23:7332–41. doi: 10.1200/JCO.2005.02.8712. [DOI] [PubMed] [Google Scholar]

[b22-asp040] Thompson IM, Pauler Ankerst D, Chi C. Assessing prostate cancer risk: results from the prostate cancer prevention trial. J Nat Cancer Inst. 2006;98:529–34. doi: 10.1093/jnci/djj131. [DOI] [PubMed] [Google Scholar]

[b23-asp040] van der Vaart AW. Asymptotic Statistics. Cambridge, UK: Cambridge University Press; 1998. [Google Scholar]

[b24-asp040] Vardi Y. Empirical distributions in selection bias models. Ann Statist. 1985;13:178–203. [Google Scholar]

PERMALINK

Semiparametric methods for evaluating risk prediction markers in case-control studies

Ying Huang

Margaret Sullivan Pepe

Abstract

1. Introduction

2. Semiparametric estimators

2.1. Estimation of the risk model

2.2. Estimation of the marker distribution and the predictiveness curve

2.3. Estimation in a cohort design

3. Asymptotic theory for the semiparametric estimators

4. Illustration

Fig. 1.

Table 1.

5. Concluding remarks

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Semiparametric methods for evaluating risk prediction markers in case-control studies

Ying Huang

Margaret Sullivan Pepe

Abstract

1. Introduction

2. Semiparametric estimators

2.1. Estimation of the risk model

2.2. Estimation of the marker distribution and the predictiveness curve

2.3. Estimation in a cohort design

3. Asymptotic theory for the semiparametric estimators

4. Illustration

Fig. 1.

Table 1.

5. Concluding remarks

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases