Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods

Ying Huang; Margaret Sullivan Pepe

doi:10.1002/sim.3876

. Author manuscript; available in PMC: 2011 Jun 15.

Published in final edited form as: Stat Med. 2010 Jun 15;29(13):1391–1410. doi: 10.1002/sim.3876

Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods

Ying Huang ^†,^†, Margaret Sullivan Pepe ^*

PMCID: PMC3045657 NIHMSID: NIHMS269007 PMID: 20527013

Summary

The predictiveness curve is a graphical tool that characterizes the population distribution of Risk(Y) = P(D = 1|Y), where D denotes a binary outcome such as occurrence of an event within a specified time period and Y denotes predictors. A wider distribution of Risk(Y) indicates better performance of a risk model in the sense that making treatment recommendations is easier for more subjects. Decisions are more straightforward when a subject's risk is deemed to be high or low. Methods have been developed to estimate predictiveness curves from cohort studies. However early phase studies to evaluate novel risk prediction markers typically employ case-control designs. Here we present semiparametric and nonparametric methods for evaluating a continuous risk prediction marker that accommodate case-control data. Small sample properties are investigated through simulation studies. The semiparametric methods are substantially more efficient than their nonparametric counterparts under a correctly specified model. We generalize them to settings where multiple prediction markers are involved. Applications to prostate cancer risk prediction markers illustrate methods for comparing the risk prediction capacities of markers and for evaluating the increment in performance gained by adding a marker to a baseline risk model. We propose a modified Hosmer-Lemeshow test for case-control study data to assess calibration of the risk model that is a natural complement to this graphical tool.

Keywords: biomarker, case-control study, classification, Hosmer-Lemeshow test, predictiveness curve, risk, ROC curve

1. Introduction

Criteria for evaluating a biomarker depend on the purpose for which it will be used. The key performance measure for a diagnostic marker is its classification accuracy, i.e. the ability to provide the correct diagnosis given a subject's true disease status. Classification accuracy of a continuous marker has been commonly assessed by the receiver operating characteristic (ROC) curve [1]. Classification, however, is not always the objective. Sometimes a marker is used mainly to predict risk of disease and to stratify the population into risk groups geared towards different treatment recommendations. Because of its popularity in the field of diagnostic testing, the ROC curve has been used frequently in this setting as well. However, as pointed out by Gail and Pfeiffer [2], Cook [3], and Pencina and others [4], criteria for evaluating a classification marker might be unnecessarily stringent for evaluating a risk prediction marker. In other words, the ROC curve may not be optimal when selecting a marker for risk prediction.

The predictiveness curve [5] was proposed by Pepe and others [6] and Huang and others [7] to evaluate a risk prediction marker or model. It characterizes the performance of a risk prediction model by displaying the population distribution of risk endowed by the model. Arguments for displaying the risk distribution have also appeared recently in the clinical literature [8]. A binary outcome D is considered here such as presence of disease or occurrence of an event within some specified time period. We write D = 1 for cases, subjects with a bad outcome and D = 0 for controls, subjects with a good outcome. Let Y be a vector of predictors of interest and let Risk(Y) = P(D = 1|Y) be the risk calculated based on Y. The predictiveness curve displays the risk distribution through the population quantiles, R(v) vs v for v ∈ (0, 1), where R(v) is the v^th quantile of Risk(Y). Equivalently, the inverse function, R⁻¹(p)= P{Risk(Y) ≤ p}, is the proportion of the population with risks less than or equal to p, the cumulative distribution function. If p_H corresponds to a high risk threshold, the capacity of the risk model to identify high risk subjects is 1 − R⁻¹(p_H). If p_L is a low risk threshold, R⁻¹(p_L) quantifies the capacity of the model to identify low risk subjects. Better risk markers put more subjects into high and low risk categories and fewer people into the intermediate range where treatment decisions are more difficult. In other words, a risk prediction model with larger variability in population quantiles, i.e. steeper predictiveness curve, has a better capacity to stratify risk.

For cohort studies, Huang and others [7] developed a semiparametric estimator of the predictiveness curve. However case-control studies, being smaller and more cost efficient than cohort studies, are the design of choice in early phases of biomarker development [9, 10]. Thus one objective of the current manuscript is to extend estimation to case-control designs. We describe two semiparametric methods. Large sample theory for these estimators was developed in Huang and Pepe [11] when Y is univariate. Here we consider the practical application of these methods. We examine methods for making inference in practical sample sizes and evaluate them using simulation studies. Importantly we extend the methods to accommodate multiple predictors as this often arises in real applications. In practice, robustness to modeling assumptions is always a concern. Another objective of the current paper is to develop a nonparametric estimator. We compare its performance with the semiparametric methods in simulations and in a real dataset. Moreover, we propose a measure accompanying the estimated predictiveness curve to formally test for calibration of the risk model.

We begin with models including only a single continuous marker or a pre-defined marker combination and later examine the extension to a general risk model. The problems caused by developing combinations and assessing them in the same dataset have been well recognized and the assessment of a predefined combination with independent test data is encouraged [12, 13]. In these circumstances our methods apply to evaluations with the test data. For example, Buyse and others [14] recently reported the performance of a gene expression signature combination previously developed by van't Veer and others [15] and van de Vijver and others [16]. Other examples of well known predefined combination scores are the Framingham score for cardiovascular events [17] and the Gail score for breast cancer risk [18].

Let ρ = P(D = 1) denote the prevalence of the bad outcome. We assume either that ρ is fixed at a specified value or that an estimate $\hat{ρ}$ is available in addition to the case-control sample. For example, the prevalence is essentially known if obtained from a large population registry; alternatively, one can entertain various fixed values for ρ that might reflect prevalences in different populations, performing a “what if” exercise that allows one to surmise in which populations the biomarker would be useful and in which populations it might not. Settings where a prevalence estimate is available includes estimates from an independent cohort study reported in the literature, or estimates calculated from a parent cohort within which the case-control study is nested [10, 19]. When an estimate of ρ is obtained from an independent cohort or the parent cohort, variability in $\hat{ρ}$ must be taken into account in computing variance of the predictiveness estimator.

We make the assumption that P(D = 1|Y) is monotone increasing in Y. If the risk is decreasing in Y, the marker can be negated to satisfy this assumption. Extensions discussed in section 6 accommodate non-monotone risk functions. Under the monotone increasing risk assumption, the v^th quantile of the marker corresponds to the v^th quantile of risk which implies that R(v) = Risk{F⁻¹(v)}. For estimation purposes we therefore need to estimate the risk function, Risk(Y) = P(D = 1|Y), as well as the marker distribution F(y) = P(Y ≤ y), and combine the two estimands to get the estimator for the risk quantile.

2. Estimation of the Risk Function

In this section, we consider estimation of the risk function as the first step in estimating the predictiveness curve. The risk can be estimated either using parametric or nonparametric methods. The former gives rise to semiparametric predictiveness curve estimates while the latter gives rise to fully nonparametric estimates.

2.1 Parametric Risk Functions: Logistic Regression

For case-control data, a logistic regression formulation of the risk model is convenient. We write it as

logit P (D = 1 ∣ Y) = logit {G (θ, Y)} = θ_{0} + η (θ_{1}, Y)

(2.1)

where η is monotone increasing in Y. For example, η(θ₁, Y) can take a linear form θ₁Y with θ₁ > 0. A more general and flexible model can involve the Box-Cox type transformation [20]. That is η(θ₁, Y) = θ₁₁Y^(θ₁₂) with θ₁₁ > 0, where Y^(θ₁₂) = (Y^θ₁₂ − 1)/θ₁₂ when θ₁₂ ≠ 0 and Y^(θ₁₂) = log Y when θ₁₂ = 0. In case-control studies, since the sampling rate of cases versus controls is fixed by design, the intercept term θ₀ in the risk model is not estimable. However, the odds ratio is still estimable, a fact that is routinely exploited in epidemiology [21]. The maximum likelihood estimator of the odds ratio from the retrospective likelihood can be obtained by maximizing the prospective likelihood of the case-control sample, pretending that the outcome is random and ignoring the outcome-dependent nature of the sampling [22, 23].

Let n_D and n_D̄ be the number of cases and controls respectively in the case-control sample. Applying the logistic regression model (2.1) to the data and then applying a shift $log {\hat{ρ} ∕ (1 - \hat{ρ}) n_{\overset{‒}{D}} ∕ n_{D}}$ to the intercept, we obtain $\hat{θ} = ({\hat{θ}}_{0}, {\hat{θ}}_{1})$ , the maximum likelihood estimator of θ. This follows because the population odds is related to the sample odds as a result of the Bayes’ theorem:

\frac{P (D = 1 ∣ Y)}{P (D = 0 ∣ Y)} = \frac{P (D = 1 ∣ Y, S)}{P (D = 0 ∣ Y, S)} \frac{P (D = 0 ∣ S)}{P (D = 1 ∣ S)} \frac{P (D = 1)}{P (D = 0)},

where S is the indicator of being included in the case-control sample. Therefore to calculate the population risk from the model fit to case-control data, we add the term $log {\hat{ρ} ∕ (1 - \hat{ρ}) n_{\overset{‒}{D}} ∕ n_{D}}$ to the estimated intercept.

2.2 Nonparametric Risk Functions: Isotonic Regression

A more robust approach is to estimate the risk model nonparametrically. Again the risk is assumed to be monotone increasing in Y. We compute the nonparametric maximum likelihood estimator for the risk function subject to monotonicity using isotonic regression [24]. A heuristic explanation of the algorithm in this particular circumstance was given by Lloyd [25]. Specifically, marker data {y₁, . . . , y_n} are arranged in increasing order, followed by repetitive blocking and pooling of adjacent blocks until the sample proportion of cases within each block is non-decreasing. Finally, we calculate P̂(D = 1|Y = y_j, S), the proportion of diseased subjects within the block containing y_j. Case-control sampling again requires an adjustment to estimate the population risk function. Specifically we use the relationship

\frac{\hat{P} (D = 1 ∣ Y)}{\hat{P} (D = 0 ∣ Y)} = \frac{\hat{P} (D = 1 ∣ Y, S)}{\hat{P} (D = 0 ∣ Y, S)} \frac{n_{\overset{‒}{D}}}{n_{D}} \frac{\hat{ρ}}{1 - \hat{ρ}} .

3. Estimation of the Marker Distribution and the Predictiveness Curve

In a case-control study, F cannot be estimated directly but can be estimated as a weighted average of the distributions of Y in the case and control subpopulations. Specifically, since F = ρF_D + (1 − ρ)F_D̄, we estimate ρ, F_D and F_D̄ and substitute the estimates to obtain the estimate of F. Two approaches to estimating F_D and F_D̄ are possible under the parametric and nonparametric risk modeling assumptions.

3.1 The Semiparametric Estimators of the Predictiveness Curve

3.1.1 The Semiparametric “Empirical” Estimator

A natural strategy to estimate F_D̄ and F_D is to use the corresponding empirical estimators which we denote by F̃_D̄ and F̃_D. Estimating F with $\tilde{F} = \hat{ρ} {\tilde{F}}_{D} + (1 - \hat{ρ}) {\tilde{F}}_{\overset{‒}{D}}$ , the resulting semiparametric “empirical” estimators of R(v) and R⁻¹(p) are

\begin{matrix} \tilde{R} (v) & = G {\hat{θ}, {\tilde{F}}^{- 1} (v)} for v \in (0, 1), \\ {\tilde{R}}^{- 1} (p) & = \tilde{F} {G^{- 1} (\hat{θ}, p)} for p \in {R (v) : v \in (0, 1)}, \end{matrix}

where G⁻¹(θ, p) = inf{y : G(θ, y) ≥ p}.

3.1.2 The Semiparametric “Maximum Likelihood” Estimator

Let f_D and f_D̄ denote density functions of the marker Y in the case and control populations respectively. Observe that the risk model (2.1) implies an exponential tilt relationship between marker densities among cases and controls

L R (Y) = f_{D} (Y) ∕ f_{\overset{‒}{D}} (Y) = exp {θ_{0} + log (\frac{1 - ρ}{ρ}) + η (θ_{1}, Y)},

(3.1)

where $L R (Y)$ is called the likelihood ratio of Y. This relationship is not exploited when F_D and F_D̄ are estimated empirically [26, 11]. We have shown that by employing an empirical likelihood approach [27, 28, 29], the maximum likelihood estimators for F_D̄ and F_D are

\begin{matrix} {\hat{F}}_{\overset{‒}{D}} (y) = \frac{1}{n_{\overset{‒}{D}}} \sum_{i = 1}^{n} \frac{I (Y_{i} \leq y)}{1 + \frac{n_{D}}{n_{\overset{‒}{D}}} exp {{\hat{θ}}_{0} + log (\frac{1 - \hat{ρ}}{\hat{ρ}}) + η ({\hat{θ}}_{1}, Y_{i})}} = \frac{1}{n} \sum_{i = 1}^{n} \frac{I (Y_{i} \leq y)}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} \hat{L R} (Y_{i})}, \\ {\hat{F}}_{D} (y) = \frac{1}{n_{\overset{‒}{D}}} \sum_{i = 1}^{n} \frac{exp {{\hat{θ}}_{0} + log (\frac{1 - \hat{ρ}}{\hat{ρ}}) + η ({\hat{θ}}_{1}, Y_{i})} I (Y_{i} \leq y)}{1 + \frac{n_{D}}{n_{\overset{‒}{D}}} exp {{\hat{θ}}_{0} + log (\frac{1 - \hat{ρ}}{\hat{ρ}}) + η ({\hat{θ}}_{1}, Y_{i})}} = \frac{1}{n} \sum_{i = 1}^{n} \frac{\hat{L R} (Y_{i}) I (Y_{i} \leq y)}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} \hat{L R} (Y_{i})}, \end{matrix}

where ${\hat{θ}}_{0}$ is the logistic regression intercept adjusted by $log {\hat{ρ} ∕ (1 - \hat{ρ}) n_{\overset{‒}{D}} ∕ n_{D}}$ , and $\hat{L R}$ is the maximum likelihood estimator of $L R$ [11]. We use these estimators to compute $\hat{F} = (1 - \hat{ρ}) {\hat{F}}_{\overset{‒}{D}} + \hat{ρ} {\hat{F}}_{D}$ , and then plug $\hat{θ}$ and F̂ into G to get the semiparametric maximum likelihood estimators of R(v) and R⁻¹(p):

\begin{matrix} \hat{R} (v) & = G {\hat{θ}, {\hat{F}}^{- 1} (v)} for v \in (0, 1), \\ {\hat{R}}^{- 1} (p) & = \hat{F} {G^{- 1} (\hat{θ}, p)} for p \in {R (v) : v \in (0, 1)} . \end{matrix}

Note that the semiparametric estimators developed here for case-control studies generalizes the semiparametric estimator developed for cohort studies in Huang and others [7]. That is, when plugging in $\hat{ρ} = n_{D} ∕ n$ from a cohort study, both the semiparametric maximum likelihood estimator and the semiparametric “empirical” estimator of the predictiveness curve equal to the cohort version proposed earlier [11].

Asymptotic distribution theory for the two estimators can be found in Huang and Pepe [11]. As an example, consider an ordinary logistic risk model: $logit {G (θ, Y)} = θ_{0} + θ_{1}^{T} r (Y)$ , where r(Y) is some monotone increasing function of Y. Suppose $\hat{ρ}$ is estimated from a cohort independent of the case-control sample, or the parent cohort within which the case-control sample is nested, with the size of the cohort λ times the size of the case-control sample. Then we have

\begin{matrix} n var {{\hat{R}}^{- 1} (p)} & ≃ {F_{D} (p) - F_{\overset{‒}{D}} (p)}^{2} ρ (1 - ρ) ∕ λ + V_{M 1} \\ + {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} {(\begin{matrix} \frac{1}{λ ρ (1 - ρ)} & 0 \\ 0 & 0 \end{matrix}) + V_{M 2}} {\frac{\partial R^{- 1} (p)}{\partial θ}} \\ + 2 {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} {\frac{F_{D} (t) - F_{\overset{‒}{D}} (t)}{λ} + V_{M 3}} . \end{matrix}

(3.2)

Analytic forms for V_M1, V_M2, V_M3 are provided in Appendix A. Similarly, we have

\begin{matrix} n var {{\tilde{R}}^{- 1} (p)} & ≃ {F_{D} (p) - F_{\overset{‒}{D}} (p)}^{2} ρ (1 - ρ) ∕ λ + V_{E 1} \\ + {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} {(\begin{matrix} \frac{1}{λ ρ (1 - ρ)} & 0 \\ 0 & 0 \end{matrix}) + V_{E 2}} {\frac{\partial R^{- 1} (p)}{\partial θ}} \\ + 2 {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} {\frac{F_{D} (t) - F_{\overset{‒}{D}} (t)}{λ} + V_{E 3}}, \end{matrix}

(3.3)

where analytic forms for V_E1, V_E2, V_E3 are provided in Appendix A. Moreover, for v = R⁻¹(p), var {R̃(v)} ≃ {∂R(v)/∂v}² var {R̃⁻¹(p)} and var {R̂(v)} ≃ {∂R(v)/∂v}² var {R̂⁻¹(p)}. When ρ is fixed, essentially we have λ → ∞, thus terms involving 1/λ vanish in (3.2) and (3.3).

3.2 The Nonparametric Estimator

Similar to the semiparametric approach, we can estimate F_D and F_D̄ empirically with F̃_D̄ and F̃_D yielding $\tilde{F} = \hat{ρ} {\tilde{F}}_{D} + (1 - \hat{ρ}) {\tilde{F}}_{\overset{‒}{D}}$ . Substituting the non-parametric risk estimator and F̃ we derive the nonparametric “empirical” estimators of R(v) and R⁻¹(p)as

\begin{matrix} \tilde{R} (v) & = \hat{P} {D = 1 ∣ Y = {\tilde{F}}^{- 1} (v)} v \in (0, 1), \\ {\tilde{R}}^{- 1} (p) & = \tilde{F} [sup {y : \hat{P} (D = 1 ∣ Y = y) \leq p}] p \in {R (v) : v \in (0, 1)} . \end{matrix}

Alternatively, we can incorporate the estimated risk function into estimation of the marker distribution, as was done for the semiparametric procedure. Lloyd [25] showed that maximizing the joint likelihood of D and Y can be achieved by first obtaining P̂(D = 1|Y, S), and then estimating f_D̄ and f_D based on the relationship

L R (Y) = \frac{f_{D} (Y)}{f_{\overset{‒}{D}} (Y)} = \frac{P (D = 1 ∣ Y, S)}{P (D = 0 ∣ Y, S)} \frac{n_{\overset{‒}{D}}}{n_{D}} \propto \frac{P (D = 1 ∣ Y, S)}{P (D = 0 ∣ Y, S)} .

In particular, let ŵ(Y) = P̂(D = 1|Y, S)/P̂(D = 0|Y, S) and let κ denote {k : ŵ(Y_k) = ∞}. He demonstrated that by maximizing $L (F_{\overset{‒}{D}}, F_{D}) = \prod_{i = 1}^{n_{D}} f_{\overset{‒}{D}} (Y_{\overset{‒}{D} i}) \prod_{j = 1}^{n_{D}} f_{D} (Y_{D j}) = \prod_{i = 1}^{n} f_{\overset{‒}{D}} (Y_{i}) \prod_{j = 1}^{n_{D}} \hat{w} (Y_{D j}) ∕ μ$ with μ a normalizing factor, the estimators of f_D̄ and f_D are

{\hat{f}}_{\overset{‒}{D}} (Y_{k}) = {\begin{matrix} \hat{μ} ∕ (n_{D} \hat{w} (Y_{k}) + n_{\overset{‒}{D}} \hat{μ}) & k \notin κ \\ 0 & k \in κ \end{matrix}, {\hat{f}}_{D} (Y_{k}) = {\begin{matrix} \hat{w} (Y_{k}) {\hat{f}}_{\overset{‒}{D}} (Y_{k}) ∕ \hat{μ} & k \notin κ \\ 1 ∕ n_{D} & k \in κ \end{matrix}

(3.4)

in the absence of ties. He also suggested that $\hat{μ}$ could be found by solving

\sum_{k \in κ} \frac{μ}{n_{D} \hat{w} (Y_{k}) + n_{\overset{‒}{D}} μ} = 1,

(3.5)

which is monotone increasing in μ.

The following new result, proved in Appendix B1, shows that when P(D = 1|Y, S) is estimated using isotonic regression, $\hat{μ}$ can be written down explicitly as a function of n_D̄ and n_D.

Theorem 1 When P(D = 1|Y, S) is estimated using isotonic regression, $\hat{μ} = n_{D} ∕ n_{\overset{‒}{D}}$ .

Plugging $\hat{μ}$ into (3.4), we have

{\hat{f}}_{\overset{‒}{D}} (Y_{k}) = {\begin{matrix} \frac{1}{n_{\overset{‒}{D}} {\hat{w} (Y_{k}) + 1}} & k \notin κ \\ 0 & k \in κ \end{matrix}, {\hat{f}}_{D} (Y_{k}) = {\begin{matrix} \frac{\hat{w} (Y_{k})}{n_{D} {\hat{w} (Y_{k}) + 1}} & k \notin κ \\ 1 ∕ n_{D} & k \in κ \end{matrix} .

Estimating F with $\hat{F} = \hat{ρ} {\hat{F}}_{D} + (1 - \hat{ρ}) {\hat{F}}_{\overset{‒}{D}}$ , where F̂_D and F̂_D̄ denote the corresponding cumulative distribution functions, the nonparametric maximum likelihood estimators of R(v) and R⁻¹(p) are

\begin{matrix} \hat{R} (v) & = \hat{P} {D = 1 ∣ Y = {\hat{F}}^{- 1} (v)} for v \in (0, 1), \\ {\hat{R}}^{- 1} (p) & = \hat{F} [sup {y : \hat{P} (D = 1 ∣ Y = y) \leq p}] for p \in {R (v) : v \in (0, 1)} . \end{matrix}

Interestingly, we have found that even if the nonparametric “empirical” and maximum likelihood procedures described above lead to different estimators of the marker distribution F, the corresponding predictiveness curve estimators are the same (Theorem 2). This fact is not true for the semiparametric estimators. A proof can be found in Appendix B2.

Theorem 2 When the risk model is estimated nonparametrically with isotonic regression, R̂(v) = R̃(v) and R̂⁻¹(p) = R̃⁻¹(p).

3.3 Area under the Predictiveness Curve

The area under the true predictiveness curve, $\int_{0}^{1} R (v) d v$ , is equal to ρ [7]. This facilitates visual comparisons of predictiveness curves for two different risk models because, in a sense, it maintains them both on exactly the same scale. The steepness of curves can be compared more easily when both integrate to ρ. An analogous result holds for the nonparametric and semiparametric maximum likelihood estimators (Theorem 3) but not for the semiparametric “empirical” estimator.

Theorem 3 Let R̂(v) be the nonparametric or semiparametric maximum likelihood estimator of R(v) for v ∈ (0, 1) using the prevalence estimator $\hat{ρ}$ . Then $\int_{0}^{1} \hat{R} (v) d v$ , the area under the predictiveness curve estimate equals to $\hat{ρ}$ .

Proof of Theorem 3 is presented in Appendix B3. An implication of Theorem 3 is that for the nonparametric and semiparametric maximum likelihood estimators, the two areas sandwiched between the curve and the horizontal line are equal. To see this, let $v^{*} = inf {v : \hat{R} (v) \geq \hat{ρ}}$ , then the area below the horizontal line at $\hat{ρ}$ and above the estimated predictiveness curve is equal to $\int_{0}^{v^{*}} {\hat{ρ} - \hat{R} (v)} d v$ , while the area above the horizontal line at $\hat{ρ}$ and below the estimated predictiveness curve is equal to $\int_{v^{*}}^{1} {\hat{R} (v) - \hat{ρ}} d v$ . According to Theorem 3,

\int_{0}^{1} \hat{R} (v) d v = \hat{ρ} \Rightarrow \int_{0}^{1} {\hat{R} (v) - \hat{ρ}} d v = 0 \Rightarrow \int_{0}^{v^{*}} {\hat{R} (v) - \hat{ρ}} d v + \int_{v^{*}}^{1} {\hat{R} (v) - \hat{ρ}} d v = 0 \Rightarrow \int_{0}^{v^{*}} {\hat{ρ} - \hat{R} (v)} d v = \int_{v^{*}}^{1} {\hat{R} (v) - \hat{ρ}} d v .

4. Simulation Studies

We conducted simulation studies in two settings to evaluate the performances of the proposed estimators. In each setting, data were generated to mimic a two-phase study. In the first phase, a random cohort sample is obtained and the disease status of every subject is determined. In the second phase, cases and controls were selected independently from the parent cohort and biomarker data was ascertained. The size of the cohort is chosen to be five times that of the nested case-control sample.

4.1 Simulation Setting 1

In the first simulation setting, a binary outcome status was generated with ρ = 0.2 from a cohort of size 5n and marker data were generated according to Y_D̄ ~ N(0, 1) and Y_D ~ N(μ_D, 1), for equal numbers of cases and controls, n_D = n_D̄ = n/2. The resulting risk function follows a linear logistic model. We explored sample sizes n ranging from 100 to 2,000. For each scenario, 5,000 Monte-Carlo simulations were conducted.

The semiparametric (based on the linear logistic model) and nonparametric estimators of the predictiveness curve were estimated using $\hat{ρ}$ obtained from the cohort. Variance estimates for the semiparametric estimators were calculated using analytic formulae from the asymptotic theory which incorporates variability in $\hat{ρ}$ (as provided in Appendix B1). Bootstrapping was also performed by separately resampling cases and controls for Y and resampling D from the parent cohort. Results pertaining to the choice μ_D = 1 are presented in Tables 1 - 4 for v = 0.1, 0.3, 0.5, 0.7, and 0.9 and for the corresponding values of p, p = R(v).

Table 1.

Bias of the semiparametric and nonparametric estimators for the linear logistic model. Also shown are biases of variance estimators based on asymptotic theory. Here n_D = n_D̄ = n/2. Size of the phase-one cohort for estimating $\hat{ρ}$ is 5n. SPMLE denotes the semiparametric maximum likelihood estimator, SPE denotes the semiparametric “empirical” estimator, NPMLE denotes the nonparametric maximum likelihood estimator.

		v = 0.1	v = 0.3	v = 0.5	v = 0.7	v = 0.9
R(v)		0.045	0.094	0.15	0.24	0.43
% bias in R̂(v)
n = 100	SPMLE	4.47	–0.50	–1.39	–0.82	0.82
	SPE	5.13	–0.40	–1.18	–0.53	0.94
	NPMLE	–35.35	–9.42	–5.44	–3.27	2.86
n = 500	SPMLE	1.14	–0.06	–0.34	–0.13	0.16
	SPE	1.12	–0.06	–0.30	–0.07	0.15
	NPMLE	–13.15	–3.21	–1.86	–1.38	0.58
n = 2000	SPMLE	0.16	–0.13	–0.13	–0.07	0.11
	SPE	0.15	–0.13	–0.10	–0.06	0.10
	NPMLE	–4.72	–1.59	–0.83	–0.47	0.35
% bias in variance estimate of R̂(v)
n = 100	SPMLE	-4.44	7.91	7.28	1.54	-4.23
	SPE	-1.27	7.03	9.02	1.22	-4.35
n = 500	SPMLE	-1.99	1.47	3.63	2.11	-0.35
	SPE	–2.50	2.07	3.88	-1.92	2.48
n = 2000	SPMLE	–1.39	1.33	1.91	1.20	-0.49
	SPE	–1.13	1.65	2.31	1.65	-1.06
		p = 0.045	p = 0.094	p = 0.15	p = 0.24	p = 0.43
R^–1(p)		0.1	0.3	0.5	0.7	0.9
% bias in R̂^–1(p)
n = 100	SPMLE	12.68	0.39	–0.16	0.55	0.11
	SPE	12.90	0.38	–0.34	0.50	0.12
	NPMLE	80.50	15.00	5.86	2.06	–0.79
n = 500	SPMLE	2.43	0.02	0.002	0.11	0.03
	SPE	2.53	0.02	–0.04	0.06	0.04
	NPMLE	29.78	5.84	2.11	0.86	–0.23
n = 2000	SPMLE	0.94	0.15	0.05	0.04	–0.01
	SPE	0.94	0.14	0.03	0.04	–0.01
	NPMLE	12.84	2.47	0.98	0.34	–0.16
% bias in variance estimate of R̂^–1(p)
n = 100	SPMLE	15.14	-6.80	-14.03	-11.01	9.52
	SPE	14.88	-5.62	-13.62	-9.67	8.81
n = 500	SPMLE	7.40	-6.14	-5.92	-5.86	5.54
	SPE	7.34	–6.12	-6.05	-5.87	3.58
n = 2000	SPMLE	2.75	-2.87	-3.08	-1.84	3.66
	SPE	2.70	-2.75	-2.36	-0.56	4.00

Open in a new tab

Table 4.

Efficiency (ratio of observed variances in simulation studies) of the semiparametric “empirical” estimator and nonparametric estimator relative to the semiparametric maximum likelihood estimator of the predictiveness curve in nested case-control studies for the linear logistic model. SPE denotes the semiparametric “empirical” estimator, NPMLE denotes the nonparametric maximum likelihood estimator. Here n = ∞ denotes the asymptotic variance.

		v = 0.1	v = 0.3	v = 0.5	v = 0.7	v = 0.9
R(v)		0.045	0.094	0.15	0.24	0.43
n = 100	SPE	1.02	0.97	0.97	0.90	0.91
	NPMLE	0.45	0.41	0.27	0.18	0.29
n = 500	SPE	0.98	0.98	0.96	0.90	0.89
	NPMLE	0.25	0.25	0.16	0.10	0.18
n = 2000	SPE	0.99	0.98	0.96	0.90	0.91
	NPMLE	0.17	0.16	0.10	0.06	0.11
n = ∞	SPE	0.99	0.97	0.97	0.90	0.91
		p = 0.045	p = 0.094	p = 0.15	p = 0.24	p = 0.43
R^–1(p)		0.1	0.3	0.5	0.7	0.9
n = 100	SPE	0.99	0.99	0.96	0.91	0.91
	NPMLE	0.47	0.47	0.32	0.21	0.38
n = 500	SPE	0.99	0.98	0.95	0.90	0.90
	NPMLE	0.28	0.27	0.17	0.10	0.20
n = 2000	SPE	0.99	0.98	0.96	0.91	0.91
	NPMLE	0.18	0.16	0.10	0.06	0.12
n = ∞	SPE	0.99	0.97	0.97	0.90	0.91

Open in a new tab

First we consider the performance of the semiparametric estimators for R(v) and R⁻¹(p). We see that they have minimal bias for sample sizes as small as 100 (Table 1). Variance estimators that are based on analytic formulas from asymptotic theory agree well with the empirical variance from simulations when n_D + n_D̄ ≥ 500. This was also true for the bootstrap variance (results not shown). Coverage of the 95% Wald confidence intervals using asymptotic or bootstrap variance estimates are fairly close to the nominal level, except for a little undercoverage when n_D + n_D̄ ≤ 200 (Table 2). The intervals shown in Table 2 assumed that the logit transform of the estimator was normally distributed and had better coverage than symmetric intervals for the untransformed estimators.

Table 2.

Coverage of 95% Wald confidence intervals based on the semiparametric and nonparametric estimators in nested case-control studies for the linear logistic model, assuming the logit transform of the estimator is normally distributed. SPMLE denotes the semiparametric maximum likelihood estimator, SPE denotes the semiparametric “empirical” estimator, and NPMLE denotes the nonparametric maximum likelihood estimator.

		v = 0.1	v = 0.3	v = 0.5	v = 0.7	v = 0.9
R(v)		0.045	0.094	0.15	0.24	0.43
		Based on asymptotic variance estimate
n = 100	SPMLE	95.82	95.64	95.94	96.22	95.50
	SPE	95.78	95.58	96.38	96.04	95.44
n = 500	SPMLE	94.62	95.14	95.88	95.40	94.80
	SPE	94.64	95.30	95.78	95.42	94.52
n = 2000	SPMLE	95.00	95.38	95.32	95.06	94.96
	SPE	94.92	95.22	95.44	95.62	95.50
		Based on bootstrap variance estimate
n = 100	SPMLE	96.02	95.14	95.68	96.68	96.26
	SPE	96.04	95.02	96.02	96.86	96.04
	NPMLE	96.49	97.56	96.64	96.36	98.10
n = 500	SPMLE	95.06	94.94	95.30	95.28	95.18
	SPE	94.96	95.02	95.34	95.60	94.86
	NPMLE	96.57	94.76	95.16	95.30	96.04
n = 2000	SPMLE	95.20	95.18	94.78	95.22	94.94
	SPE	95.02	95.22	95.04	95.36	95.58
	NPMLE	94.58	94.08	94.84	94.62	94.90
		p = 0.045	p = 0.094	p = 0.15	p = 0.24	p = 0.43
R^–1(p)		0.1	0.3	0.5	0.7	0.9
		Based on asymptotic variance estimate
n = 100	SPMLE	91.39	93.55	95.30	95.94	93.13
	SPE	91.22	93.97	95.50	96.26	93.63
n = 500	SPMLE	95.40	95.10	94.98	94.86	95.76
	SPE	95.32	94.86	94.94	94.82	95.76
n = 2000	95.40	94.82	94.80	94.94	95.28
	SPE	95.28	95.10	94.80	95.30	95.56
		Based on bootstrap variance estimate
n = 100	SPMLE	91.73	94.75	96.16	97.10	93.85
	SPE	91.62	94.99	96.28	96.98	94.06
	NPMLE	72.82	90.81	96.40	97.52	90.68
n = 500	SPMLE	95.16	95.32	95.48	95.32	94.78
	SPE	95.10	95.36	95.60	95.24	95.34
	NPMLE	83.21	93.08	96.26	96.66	93.20
n = 2000	SPMLE	94.86	94.92	95.04	95.10	94.84
	SPE	94.80	95.12	95.36	95.58	94.90
	NPMLE	89.08	94.14	95.22	95.68	94.58

Open in a new tab

Results are also shown in Table 3 for confidence intervals employing percentiles of the bootstrap distribution. We found that these confidence intervals performed best overall. Moreover, the corresponding lower and upper confidence limits are monotone increasing in v. This is a desirable property because, by definition, the predictiveness curve itself is monotone increasing. Having lower and upper pointwise confidence limit curves that are monotone increasing is consistent with monotonicity of the predictiveness curve. To see that the pointwise confidence limits are increasing in v, let R̂_b(v) be the estimate of R(v) based on the b^th bootstrap sample. We have R̂_b(v₁) ≤ R̂_b(v₂) for v₁ ≤ v₂ according to our estimation methods. As a result, the α^th percentile of R̂_b(v₁) is always smaller than or equal to the α^th percentile of R̂_b(v₂) among the same set of bootstrap samples. In our simulations, we chose α = 0.025 and α = 0.975.

Table 3.

Coverage of 95% percentile bootstrap confidence intervals based on the semiparametric and nonparametric estimators in nested case-control studies for the linear logistic model. SPMLE denotes the semiparametric maximum likelihood estimator, SPE denotes the semiparametric “empirical” estimator, and NPMLE denotes the nonparametric maximum likelihood estimator.

		v = 0.1	v = 0.3	v = 0.5	v = 0.7	v = 0.9
R(v)		0.045	0.094	0.15	0.24	0.43
n = 100	SPMLE	94.24	94.50	95.08	95.86	94.80
	SPE	94.58	94.64	95.28	96.36	94.98
	NPMLE	79.74	94.30	96.58	97.54	98.02
n = 500	SPMLE	94.18	94.18	94.42	95.88	94.22
	SPE	94.66	94.16	94.60	96.02	94.54
	NPMLE	93.04	95.92	97.18	97.40	98.20
n = 2000	SPMLE	94.38	94.52	94.66	94.72	94.22
	SPE	94.58	94.38	95.08	95.08	94.76
	NPMLE	94.80	96.66	97.44	97.80	97.96
		p = 0.045	p = 0.094	p = 0.15	p = 0.24	p = 0.43
R^–1(p)		0.1	0.3	0.5	0.7	0.9
n = 100	SPMLE	94.24	94.54	95.04	95.90	94.80
	SPE	94.56	94.72	95.28	96.38	95.08
	NPMLE	79.74	94.36	96.62	97.60	98.06
n = 500	SPMLE	94.12	94.06	95.02	94.90	94.22
	SPE	94.22	94.46	95.40	95.18	94.62
	NPMLE	91.44	96.02	97.20	97.92	98.20
n = 2000	SPMLE	94.38	94.48	94.70	94.74	94.12
	SPE	94.58	94.42	95.12	95.10	94.74
	NPMLE	94.80	96.62	97.46	97.80	97.92

Open in a new tab

The nonparametric estimator of the predictiveness curve performed poorly relative to the semi-parametric estimators in this simulation setting. When sample sizes are smaller than 500, biases in estimates of R(v) and R⁻¹(p) are substantial, and confidence intervals suffer from undercoverage or overcoverage in many settings (Tables 2,3). Their efficiency is dramatically worse than the efficiencies of the semiparametric estimators (Table 4). This is especially true in large samples.

The two semiparametric estimators have similar performances in the simulations. Of note, the semiparametric “empirical” estimator is fairly efficient relative to the semiparametric maximum likelihood estimator (Table 4). Based on these limited simulations we recommend use of either of the semiparametric estimators in practice with confidence intervals constructed from percentiles of the bootstrap distribution when resampling is feasible, or from the logit transform with corresponding analytic variance formulas when bootstrapping is not feasible.

4.2 Simulation Setting 2

We investigated another simulation setting where marker Y follows a standard normal distribution, and the risk quantile follows a piece-wise linear form with cutpoint at the quintiles. Specifically, R(v) takes value 0, 0.1, 0.16, 0.2, 0.24, 0.3 at cutpoints v = 0, 0.2, 0.4, 0.6, 0.8, 1, and is linear in between. As before, we first simulate a cohort of size 5n, and then randomly sample n_D = n/2 cases and n_D̄ = n/2 controls from the cohort. The semiparametric estimators again are obtained assuming a linear logistic model. Tables 5 and 6 present bias, efficiency (in terms of mean squared error), and coverage of the 95% percentile bootstrap confidence intervals for the semiparametric and nonparametric estimators, for case-control sample sizes varying from 500 to 2,000. The semiparametric estimators that assume an incorrect working model, have poorer performance compared to that in the first simulation setting. For estimation of both R(v) and R⁻¹(p), they have nonignorable biases that do not dissapate as sample size increases. As a result, coverages of their confidence intervals are often seriously below the nominal level, especially in large sample size. The performance of the nonparametric estimator, nevertheless, is fairly consistent with that in the first simulation setting. Its bias is much smaller compared to that of the semiparametric estimators and decreases as sample size increases. Consequently, for certain quantiles, the mean squared error of the nonparametric estimator is smaller than that of the semiparametric estimators. In general, for a sample size greater than 500, the confidence interval constructed from the nonparametric estimators maintain coverage close to the nominal level.

Table 5.

Performances of the semiparametric and nonparametric estimators for R(v) when the predictiveness curve is piecewise linear. Here n_D = n_D̄ = n/2. Size of the phase-one cohort for estimating $\hat{ρ}$ is 5n. SPMLE denotes the semiparametric maximum likelihood estimator, SPE denotes the semiparametric “empirical” estimator, NPMLE denotes the nonparametric maximum likelihood estimator.

		v = 0.1	v = 0.3	v = 0.5	v = 0.7	v = 0.9
R(v)		0.05	0.13	0.18	0.22	0.27
% bias in R̂(v)
n = 500	SPMLE	57.46	-9.06	-14.25	-9.24	4.72
	SPE	51.92	-10.74	-14.35	-8.02	7.70
	NPMLE	-13.25	-4.58	-2.55	-0.80	1.92
n = 1000	SPMLE	57.48	-8.87	-14.07	-9.13	4.57
	SPE	51.88	-10.53	-14.16	-7.94	7.54
	NPMLE	-8.42	-2.62	-1.38	-0.66	0.78
n = 2000	SPMLE	57.63	-8.74	-13.97	-9.11	4.46
	SPE	51.97	-10.38	-14.05	-7.90	7.36
	NPMLE	-4.54	-1.14	-0.71	-0.50	0.15
Efficiecy^a related to MLE
n = 500	SPE	1.19	0.81	0.98	1.22	0.66
	NPMLE	1.68	0.35	1.14	0.70	0.44
n = 1000	SPE	1.21	0.77	0.98	1.26	0.56
	NPMLE	2.57	0.43	1.65	0.97	0.48
n = 2000	SPE	1.22	0.75	0.99	1.29	0.50
	NPMLE	4.10	0.56	2.55	1.46	0.52
Coverage of 95% percentile boostrap CI
n = 500	SPMLE	30.16	79.22	16.70	46.18	90.06
	SPE	37.60	74.26	17.82	58.78	85.18
	NPMLE	92.44	96.18	96.64	96.78	97.56
n = 1000	SPMLE	5.92	65.32	1.64	18.60	87.36
	SPE	10.60	56.44	2.08	33.80	75.56
	NPMLE	94.52	96.54	96.84	96.94	97.62
n = 2000	SPMLE	0.06	43.04	0.00	2.16	79.06
	SPE	0.40	29.86	0.00	8.54	58.44
	NPMLE	95.60	96.90	97.20	97.32	97.46

Open in a new tab

efficiency in terms of mean squared error

Table 6.

Performances of the semiparametric and nonparametric estimators for R^–1(p) when the predictiveness curve is piecewise linear. Here n_D = n_D̄ = n/2. Size of the phase-one cohort for estimating $\hat{ρ}$ is 5n. SPMLE denotes the semiparametric maximum likelihood estimator, SPE denotes the semiparametric “empirical” estimator, NPMLE denotes the nonparametric maximum likelihood estimator.

		p = 0.05	p = 0.13	p = 0.18	p = 0.22	p = 0.27
R^–1(p)		0.1	0.3	0.5	0.7	0.9
% bias in R̂^–1(p)
n = 500	SPMLE	-74.16	21.00	24.42	9.86	-2.01
	SPE	-71.15	23.93	23.28	8.27	-3.28
	NPMLE	19.68	7.85	5.52	1.53	-1.70
n = 1000	SPMLE	-76.78	21.13	24.07	9.65	-2.03
	SPE	-73.94	24.04	22.94	8.07	-3.32
	NPMLE	13.36	4.86	3.24	1.19	-0.99
n = 2000	SPMLE	-78.39	21.13	23.99	9.57	-2.04
	SPE	-75.62	24.02	22.83	7.97	-3.33
	NPMLE	8.28	2.99	1.59	1.05	-0.24
Efficiecy^a related to SPMLE
n = 500	SPE	1.05	1.20	1.17	1.21	1.17
	NPMLE	4.09	0.41	0.29	0.82	0.74
n = 1000	SPE	1.05	1.25	1.21	1.22	1.18
	NPMLE	6.68	0.32	0.29	1.05	0.73
n = 2000	SPE	1.06	1.29	1.23	1.23	1.19
	NPMLE	9.92	0.28	0.31	1.38	0.70
Coverage of 95% percentile boostrap CI
n = 500	SPMLE	30.28	79.30	16.74	46.14	90.04
	SPE	37.64	74.22	17.78	58.78	85.20
	NPMLE	92.40	96.14	96.64	96.78	97.58
n = 1000	SPMLE	5.94	65.32	1.64	18.56	87.40
	SPE	10.64	56.48	2.06	33.96	75.60
	NPMLE	94.44	96.54	96.78	96.96	97.64
n = 2000	SPMLE	0.06	42.94	0.00	2.16	79.06
	SPE	0.42	29.88	0.00	8.56	58.44
	NPMLE	95.62	96.96	97.24	97.30	97.46

Open in a new tab

efficiency in terms of mean squared error

Because of its robustness, the nonparametric estimator might be preferred in large studies where bias rather than precision is the major concern. On the other hand, in practice it is important to make the semiparametric model flexible to ensure a good fit, in light of its sensitivity to the risk model assumption. Comparing the nonparametric estimator with the semiparametric estimator provides an avenue for model checking. Later in this paper we propose a goodness-of-fit test to assess calibration formally.

5. Illustration

Levels of prostate specific antigen (PSA) and recent increases in levels of PSA (PSA velocity) are markers for prostate cancer. We evaluate them as predictors of the risk that a man will be diagnosed with prostate cancer if biopsied. These markers should only be used in decisions to take a biopsy if they are sufficiently informative of this risk. Data for evaluating these markers come from the Prostate Cancer Prevention Trial, a randomized prospective study with 7 years of follow-up [30]. Subjects were at least 55 years old, had serum PSA value less than 3.0 ng/ml at baseline and were scheduled for annual blood draws to measure serum PSA. Almost all subjects had a prostate biopsy taken at the end of study. We analyze data for the 5519 men on the placebo arm of the trial that had a prostate biopsy, a PSA measure during the year prior to biopsy and at least 2 PSA values from the 3 years prior to biopsy to calculate PSA velocity. The prevalence of prostate cancer in the cohort is $\hat{ρ} = 21.9 %$ . We selected 250 cases and 250 controls at random from the cohort to simulate a nested case-control study. Thus the data for analysis consist of the prevalence estimate from the 5519 men in the parent cohort and PSA and PSA velocity for subjects in the case-control subset.

To implement the semiparametric methods for estimating predictiveness curves, for each marker a logistic regression risk model was employed using a Box-Cox transformation of the marker. The two semiparametric estimators of the predictiveness curves are very similar to each other for both PSA and PSA velocity and so only the semiparametric maximum likelihood estimators are displayed in Figure 1. Also displayed in Figure 1 are the nonparametric predictiveness curve estimates. Observe that the semiparametric curves are much smoother than the nonparametric ones, but agree with them, suggesting a good-fit for the semiparametric models. Overall, PSA has a steeper predictiveness curve, indicating that it is a better marker for predicting risk of prostate cancer than PSA velocity.

The predictiveness curves for PSA and PSA velocity for predicting prostate cancer.

For the semiparametric estimators, the asymptotic and bootstrap variance estimates for R̂(v) and R̂⁻¹(p) are similar in magnitude (results not shown). Moreover, the Wald confidence intervals for R(v) and R⁻¹(p) are close to those based on percentiles of the bootstrap distributions. Here we present only the latter. The pointwise 95% percentile bootstrap confidence intervals for R(v) constructed from the semiparametric maximum likelihood estimators are displayed in Figure 2(a)(b). They are much narrower in comparison to those constructed from the nonparametric estimators.

The 95% pointwise confidence intervals constructed from percentiles of the bootstrap distribution for the predictiveness curves of PSA and PSA velocity. SPMLE: semiparametric maximum likelihood estimator; NPMLE: nonparametric maximum likelihood estimator.

We next compare the predictive capacities of the two markers in terms of the 10^th and 90^th percentiles of their risk distributions and sizes of risk strata corresponding to a low risk threshold of 10% and a high risk threshold of 30% (Table 7 (a)). Results are presented for both the semiparametric maximum likelihood estimators and for the nonparametric estimators. P-values employ Wald tests based on differences in R(v) and R⁻¹(p) with variances estimated with the bootstrap. Using the semiparametric methods, PSA appears to have a better capacity to predict high risk of prostate cancer than does PSA velocity given that it has a larger value for R(0.9) as well as better capacity to predict low risk given that it has a smaller value for R(0.1). In addition, PSA categorizes more people into the low and high risk ranges as can be seen from semiparametric estimates of R⁻¹(0.1) and 1 − R⁻¹(0.3). In contrast, these comparisons are not significant based on the nonparametric methods due to their large sampling variabilities.

Table 7.

Comparisons between (a) PSA and PSA velocity and (b) between PSA and PSA plus other risk factors for predicting risk of prostate cancer. SPMLE denotes the semiparametric maximum likelihood estimator, NPMLE denotes the nonparametric maximum likelihood estimator.

Measure	Method		(a) PSA	PSA Velocity		pvalue
		Est	95% CI	Est	95% CI
R(0.1)	NPMLE	0.079	(0.043,0.110)	0.088	(0.044, 0.131)	0.730
	SPMLE	0.072	(0.046, 0.109)	0.122	(0.075, 0.159)	0.027
R(0.9)	NPMLE	0.474	(0.369, 0.577)	0.302	(0.253, 0.402)	0.005
	SPMLE	0.413	(0.356, 0.476)	0.313	(0.275, 0.356)	< 0.001
R^–1(0.1)	NPMLE	0.304	(0.050, 0.39)	0.155	(0.019, 0.305)	0.178
	SPMLE	0.188	(0.073, 0.291)	0.06	(0.020, 0.142)	0.021
1 – R^–1(0.3)	NPMLE	0.302	(0.140, 0.447)	0.168	(0.007, 0.501)	0.274
	SPMLE	0.244	(0.191, 0.296)	0.129	(0.030, 0.197)	0.009
R^–1(0.3) – R^–1(0.1)	NPMLE	0.393	(0.210, 0.664)	0.677	(0.301, 0.933)	0.100
	SPMLE	0.568	(0.443, 0.724)	0.811	(0.668, 0.935)	0.004

Measure	Method		(b) PSA	PSA + other factors		pvalue
		Est	95% CI	Est	95% CI
R(0.1)	SPMLE	0.072	(0.045, 0.109)	0.070	(0.039, 0.094)	0.798
R(0.9)	SPMLE	0.413	(0.356, 0.476)	0.429	(0.372, 0.502)	0.223
R^–1(0.1)	SPMLE	0.188	(0.073, 0.291)	0.204	(0.109, 0.310)	0.595
1 – R^–1(0.3)	SPMLE	0.244	(0.191, 0.296)	0.243	(0.203, 0.281)	0.952
R^–1(0.3) – R^–1(0.1)	SPMLE	0.568	(0.443, 0.724)	0.554	(0.436, 0.662)	0.667

Open in a new tab

6. Extending Semiparametric Estimation to Multiple Markers

The semiparametric estimators can be extended naturally to accommodate multiple predictors or to settings where the monotone increasing risk assumption is not true.

6.1 Inference

We present the generalized estimators here as well as their asymptotic distribution theory. In practice, since estimation of the asymptotic variance involves both numerical differentiation and nonparametric density estimation, we rely on resampling techniques rather than on asymptotic theory for inference.

Let Y be a vector of predictors that may include different functional forms of a single predictor (e.g. a set of spline basis functions) as well as interactions among predictors. Let F_R, F_DR, F_D̄R indicate the cumulative distribution functions for Risk(Y) in the general, case and control populations respectively. As before, we calculate $\hat{R i s k} (Y_{i})$ as the predicted risk for subject i based on fitting a standard logistic regression model to case-control data with offset $log {(1 - \hat{ρ}) ∕ \hat{ρ} n_{D} ∕ n_{\overset{‒}{D}})$ . To estimate F_R we write F_R = ρF_DR + (1 − ρ)F_D̄R and substitute estimates for each component. The components F_DR and F_D̄R can be estimated “empirically” using ${\tilde{F}}_{D R} (p) = \sum_{i = 1}^{n_{D}} I {\hat{R i s k} (Y_{D i}) \leq p} ∕ n_{D}$ and ${\tilde{F}}_{\overset{‒}{D} R} (p) = \sum_{i = 1}^{n_{D}} I {\hat{R i s k} (Y_{\overset{‒}{D} i}) \leq p} ∕ n_{\overset{‒}{D}}$ .

A more efficient approach is to use the semiparametric maximum likelihood estimates that are derived using arguments similar to those provided in Huang and Pepe [11] for the single marker setting:

\begin{matrix} {\hat{F}}_{D R} (p) & = \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}}, \\ {\hat{F}}_{\overset{‒}{D} R} (p) & = \frac{1}{n} \sum_{i = 1}^{n} \frac{I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}}, \end{matrix}

where ${\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} = \hat{R i s k} (Y_{i}) ∕ {1 - \hat{R i s k} (Y_{i})} \times (1 - \hat{ρ}) ∕ \hat{ρ}$ .

We write $\hat{R} (v) = {\hat{F}}_{R}^{- 1} (v)$ and R̂⁻¹(p) = F̂_R(p) for the semiparametric maximum likelihood estimators of the predictiveness curve and R̃(v) = F̃_R(p) and R̃⁻¹(p) = F̃_R(p) for the semiparametric “empirical” estimators. The following results are proved in Appendix B4. Here variability of $\hat{ρ}$ is taken into account in calculating asymptotic variance of the predictiveness curve estimators, with details provided in Appendix B4.

Theorem 4 As n → ∞,

$\sqrt{n} {{\hat{R}}^{- 1} (p) - R^{- 1} (p)}$ converges to a normal random variable with mean zero and variance
$Σ_{2 M . R} (p) = var [\sqrt{n} {Q_{M} (p) - R^{- 1} (p)}] + {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} var {\sqrt{n} (\hat{θ} - θ)} {\frac{\partial R^{- 1} (p)}{\partial θ}} + 2 {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} cov [\sqrt{n} (\hat{θ} - θ), \sqrt{n} {Q_{M} (p) - R^{- 1} (p)}],$
$\sqrt{n} {\hat{R} (v) - R (v)}$ converges to a normal random variable with mean zero and variance Σ_1M.R(v) = {∂R(v)/∂v}² Σ_2M.R{R(v)}, where
$Q_{M} (p) = \hat{ρ} \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} + (1 - \hat{ρ}) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} .$
$\sqrt{n} {{\tilde{R}}^{- 1} (p) - R^{- 1} (p)}$ converges to a normal random variable with mean zero and variance
$Σ_{2 E . R} (p) = var [\sqrt{n} {Q_{E} (p) - R^{- 1} (p)}] + {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} var {\sqrt{n} (\hat{θ} - θ)} {\frac{\partial R^{- 1} (p)}{\partial θ}} + 2 {\frac{\partial R^{- 1} (p)}{\partial θ}}^{T} cov [\sqrt{n} (\hat{θ} - θ), \sqrt{n} {Q_{E} (p) - R^{- 1} (p)}],$
$\sqrt{n} {\tilde{R} (v) - R (v)}$ converges to a normal random variable with mean zero and variance Σ_1E.R(v) = {∂R(v)/∂v}² Σ_2E.R{R(v)}, where
$Q_{E} (p) = \hat{ρ} \frac{1}{n_{D}} \sum_{i = 1}^{n_{D}} I {R i s k (Y_{D i}) \leq p} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{i = 1}^{n_{\overset{‒}{D}}} I {R i s k (Y_{\overset{‒}{D} i}) \leq p} .$

6.2 Illustration

We illustrate using the data described in Section 5. We compare the logistic risk model based on PSA alone with a more comprehensive risk model that combines PSA and other risk factors, namely family history of prostate cancer, digital rectal exam, and previous negative biopsies. All of these factors are highly statistically significant factors in determining risk [30].

A fundamental preliminary step in assessing the value of a risk model concerns calibration or model fit. The predictiveness curve can be helpful in this assessment [6]. We now illustrate how the assessment can be made with data from a nested case-control study by application to the model that includes other risk factors in addition to PSA. We let $\hat{R i s k} (Y)$ be risk estimates from the model employing PSA and other risk factors and partition the observations into deciles of the distribution for $\hat{R i s k} (Y)$ . For k ∈ {1, . . . , K =10}, we estimate P̂(D = 1|k, S) as the observed proportion of cases within the k^th group. The population risk within the k^th group, P(D = 1|k) is then estimated according to

\frac{\hat{P} (D = 1 ∣ k)}{1 - \hat{P} (D = 1 ∣ k)} = \frac{\hat{P} (D = 1 ∣ k, S)}{1 - \hat{P} (D = 1 ∣ k, S)} \frac{n_{\overset{‒}{D}}}{n_{D}} \frac{1 - \hat{ρ}}{\hat{ρ}} .

At the midpoints of the k^th group, a visual comparison can be made between these points and the predictiveness curve by superimposing the value of P̂(D = 1|k) on the predictiveness curve plot. This comparison of observed risk and average risk within deciles of modeled risk provides a graphical display of calibration as suggested previously [6] but now generalized to case-control data. According to Figure 3(a), there does not to appear be serious lack of fit of the risk model.

(a) The semiparametric maximum likelihood estimates of predictiveness curves for PSA and PSA plus other factors for predicting risk prostate cancer,the dots are average risk within deciles of modeled risk based on the latter model; (b) their 95% pointwise confidence intervals using percentiles of the bootstrap distribution. The horizontal lines indicate disease prevalences.

Let ${\hat{R i s k}}^{S}$ be the risk estimate from the logistic regression applied to the case-control sample without correcting for the intercept term. We propose a Hosmer-Lemeshow measure of calibration to accompany the plot:

T = \sum_{k = 1}^{K} \frac{{\hat{P} (D = 1 ∣ k) - {\hat{R}}_{k}}^{2}}{{\hat{R}}_{k}^{2} {(1 - {\hat{R}}_{k})}^{2} ∕ {n_{k} {\hat{R}}_{k}^{S} (1 - {\hat{R}}_{k}^{S})}},

(6.1)

where ${\hat{R}}_{k}^{S}$ is the average of ${\hat{R i s k}}^{S}$ within the k^th group, and R̂_k is a population version of it corrected for the biased sampling

{\hat{R}}_{k} = {{\hat{R}}_{k}^{S} \frac{n_{\overset{‒}{D}}}{n_{D}} \frac{\hat{ρ}}{1 - \hat{ρ}}} ∕ {1 + {\hat{R}}_{k}^{S} \frac{n_{\overset{‒}{D}}}{n_{D}} \frac{\hat{ρ}}{1 - \hat{ρ}}} .

Within each discretized risk group, the term in the numerator of (6.1) is the squared difference between the ‘observed’ and ‘expected’ disease proportion in the population, while the term in the denominator is the estimated variance of this difference. Comparing this measure T with a $χ_{K - 2}^{2}$ distribution, we obtain a p-value for the test of calibration. The calculation of R̂_k and justification for this test are outlined in Appendix C. Our measure is a modification of the established Hosmer-Lemeshow test for case-control study [32], which compares observed and expected disease proportion in the case-control sample. The test without modification yields a valid test with case-control data. But the modified test is more tightly linked to the predictiveness curve and the measure of calibration upon which it is based does not vary with the case-control design. In our example, the modified test yields a p-value of 0.170 for the predictiveness curve employing PSA and other risk factors, suggesting a reasonable representation of the risk distribution in the population provided by the predictiveness curve.

Figure 3(a) displays the semiparametric maximum likelihood estimators of the predictiveness curves for PSA alone and for the model that includes other factors (the semiparametric “empirical” estimators are similar). The two risk models have very similar predictiveness curves. Confidence intervals for R(v) constructed with the semiparametric maximum likelihood estimators are presented in Figure 3(b). Sampling variability of estimates derived from the two risk models appears to be similar in magnitude.

Detailed results comparing the predictiveness curves of the two models are shown in Table 7(b). Briefly, the percentages of people classified into the low, high, or equivocal risk ranges are not significantly different between the two models, nor are the 10^th and 90^th percentiles of risk significantly different. Thus including other factors in the model in addition to PSA does not lead to a significant improvement in risk stratification even when these factors are all statistically significant in the multivariate logistic regression model. It reinforces our earlier argument that the risk model by itself is not enough to characterize the population performance of a risk prediction model.

An important issue pertaining to models with multiple predictors is over-fitting when the number of predictors gets large relative to the sample size. To account for potential over-fitting, we implemented 10-fold cross-validation to compare the predictiveness of the two models. Again, including other factors beside PSA in the risk model has trivial influence on risk stratification.

7. Discussion

It has been argued in both the applied [8] and biostatistical literature [2, 6] that displaying the population distribution of risk is useful for evaluating the potential impact of a risk model for risk stratifying the population. The key ideas of risk stratification tables, introduced by Cook and others [33] and Cook [34], are closely related. In particular, the margins of the two-way risk stratification table show the population distribution of risk achieved by the two models, albeit using discrete risk categories. Janes and others [35] show that the key information pertinent to comparing models is contained in the margins rather than in the cells of the table. Predictiveness curves provide more complete descriptions of the marginal risk distributions since they show risk distributions over a continuum of risk thresholds that could be used to define risk categories rather than only at a few pre-specified risk thresholds.

Methods for estimating predictiveness curves from cohort studies were developed previously. However, case-control designs are often preferred in biomarker development [9] and the goal of the current paper is to develop estimation procedures for use with case-control data. Here we discussed semiparametric methods that rely on a logistic regression form for the risk and a non-parametric method that relies on isotonic regression for estimating the risk. Another approach developed by Huang and Pepe [36] is based on the relationship between the predictiveness curve and the ROC curve. Here, we found that the nonparametric method is inefficient compared with the semiparametric methods and that valid inference requires large sample sizes. We recommend the semiparametric methods for use in practice because (i) simulations indicate that inferential procedures are adequate with realistic sample sizes, (ii) they accommodate risk models with multiple predictors, and (iii) they can be made flexible by employing flexible forms for the predictors in the logistic regression model. The last point is important to ensure good model fit by the semiparametric model. The nonparametric estimator has the advantage that it is completely robust but potentially very inefficient. Therefore it can be useful in large studies where precision is not an issue and minimum bias is desired. And it can be used for comparison with the semiparametric estimator in a single marker setting to further assess its goodness-of-fit. For a general logistic risk model allowing for multiple markers, we proposed a modified Hosmer-Lemeshow test assessing calibration of the risk model. It extends the established Hosmer-Lemeshow test for case-control data by mapping the difference between observed and established disease proportion at the case-control sample level to that at the population level. As a result, performance of the test under alternative hypothesis would potentially be less sensitive to factors such as case-control ratio. Based on limited simulation studies (results not shown), this modified test appears to have power comparable to that of the standard Hosmer-Lemeshow test and is more powerful in some settings when the proportion of cases in the case-control sample is high.

Pepe and others [6] proposed displaying the predictiveness curve and curves displaying true and false positive rates together for maximum information. Specifically, to evaluate a risk prediction marker, one will be interested in knowing not only 1 − R⁻¹(p), the proportion of the population with risk above p, but also the proportion of diseased subjects correctly classified (the true positive rate TPR(p) = P{Risk(Y) > p|D = 1}) and the proportion of non-diseased subjects incorrectly classified (the false positive rate FPR(p) = P{Risk(Y) > p|D = 0}), according to the classification rule ‘Risk(Y) > p’. Our semiparametric and nonparametric procedures developed in this manuscript yield estimators of F_DR, F_D̄R and F_R as by-products. These can be directly plugged into $TPR (p) = F_{D R} {F_{R}^{- 1} (p)}$ and $FPR (p) = F_{\overset{‒}{D} R} {F_{R}^{- 1} (p)}$ to estimate these quantities. Asymptotic theory for corresponding semiparametric estimators can be developed using techniques similar to those employed for estimators of the predictiveness curve.

Finally, for interested readers, fitting of the logistic regression and isotonic regression models can be performed using standard statistical software such as R. R programs for estimating the corresponding predictiveness curves and their asymptotic variances are available from the authors upon request.

Acknowledgments

The authors are grateful for support provided by NIGMS grant GM-54438 and NCI grant CA86368.

Appendix A: Analytic Forms of the Asymptotic Variances for the Semiparametric Estimators of the Predictiveness Curve (for the Example in Section 3.1)

Let α = θ₀ − log{ρ/(1 − ρ)}, β = θ₁, and let η = n_D/n_D̄, we have

\begin{matrix} V_{M 1} & = {(1 - ρ)}^{2} (1 + η) [F_{\overset{‒}{D}} {G^{- 1} (θ, p)} - F_{\overset{‒}{D}} {G^{- 1} (θ, p)}^{2}] + ρ^{2} \frac{1 + η}{η} [F_{D} {G^{- 1} (θ, p)} - F_{D} {G^{- 1} (θ, p)}^{2}] - (\frac{1 + η}{η}) {ρ - (1 - ρ) η}^{2} {A_{0} {G^{- 1} (θ, p)} - [A_{0} {G^{- 1} (θ, p)}, A_{1} {G^{- 1} (θ, p)}^{T}] A^{- 1} [\begin{matrix} A_{0} {G^{- 1} (θ, p)} \\ A_{1} {G^{- 1} (θ, p)} \end{matrix}]}, \\ V_{E 1} & = {(1 - ρ)}^{2} (1 + η) [F_{\overset{‒}{D}} {G^{- 1} (θ, p)} - F_{\overset{‒}{D}} {G^{- 1} (θ, p)}^{2}] + ρ^{2} \frac{1 + η}{η} [F_{D} {G^{- 1} (θ, p)} - F_{D} {G^{- 1} (θ, p)}^{2}], \\ V_{M 2} & = V_{E 2} = \frac{1 + η}{η} {A^{- 1} - (\begin{matrix} 1 + η & 0 \\ 0 & 0 \end{matrix})}, \\ V_{M 3} & = V_{E 3} = \frac{1 + η}{η} ({ρ - η (1 - ρ)} A^{- 1} [\begin{matrix} A_{0} {G^{- 1} (θ, p)} \\ A_{1} {G^{- 1} (θ, p)} \end{matrix}] - [\begin{matrix} ρ F_{D} {G^{- 1} (θ, p)} - η (1 - ρ) F_{\overset{‒}{D}} {G^{- 1} (θ, p)} \\ 0 \end{matrix}]), \end{matrix}

where

\begin{matrix} A_{0} (t) & = \int_{- \infty}^{t} \frac{exp {α + β^{T} r (y)}}{1 + η exp {α + β^{T} r (y)}} d F_{\overset{‒}{D}} (y), \\ A_{1} (t) & = \int_{- \infty}^{t} \frac{r (y) exp {α + β^{T} r (y)}}{1 + η exp {α + β^{T} r (y)}} d F_{\overset{‒}{D}} (y), \\ A_{2} (t) & = \int_{- \infty}^{t} \frac{r (y) r {(y)}^{T} exp {α + β^{T} r (y)}}{1 + η exp {α + β^{T} r (y)}} d F_{\overset{‒}{D}} (y), \\ A = (\begin{matrix} A_{0} & A_{1}^{T} \\ A_{1} & A_{2} \end{matrix}), \end{matrix}

with A₀ = A₀(∞), A₁ = A₁(∞), A₂ = A₂(∞).

Appendix B: Proof of Theorems

B1: Proof of Theorem 1

Suppose there are m pooled groups after isotonic regression with ŵ(Y) < ∞. In the i^th group, there are m_i observations, among which m_Di are cases. Then for subject k (k ∉ κ) belonging to the i^th group, ŵ(Y_k) = m_Di/(m_i − m_Di).

Plugging $\hat{μ} = n_{D} ∕ n_{\overset{‒}{D}}$ into (3.5) results in

\sum_{k \notin κ} \frac{\hat{μ}}{n_{D} \hat{w} (Y_{k}) + n_{\overset{‒}{D}} \hat{μ}} = \sum_{k \notin κ} \frac{\frac{n_{D}}{n_{\overset{‒}{D}}}}{n_{D} \hat{w} (Y_{k}) + n_{\overset{‒}{D}} \frac{n_{D}}{n_{\overset{‒}{D}}}} = \sum_{i = 1}^{m} \frac{\frac{n_{D}}{n_{\overset{‒}{D}}} m_{i}}{n_{D} \frac{m_{D i}}{m_{i} - m_{D i}} + n_{\overset{‒}{D}} \frac{n_{D}}{n_{\overset{‒}{D}}}} = \sum_{i = 1}^{m} \frac{\frac{n_{D}}{n_{\overset{‒}{D}}} m_{i} (m_{i} - m_{D i})}{n_{D} m_{D i} + n_{D} (m_{i} - m_{D i})} = \sum_{i = 1}^{m} \frac{m_{i} - m_{D i}}{n_{\overset{‒}{D}}} = \frac{n_{\overset{‒}{D}}}{n_{\overset{‒}{D}}} = 1 .

Since the term on the left-hand side of (3.5) is monotone increasing in μ, $\hat{μ} = n_{D} ∕ n_{\overset{‒}{D}}$ is the unique solution.

B2: Proof of Theorem 2

At the end of the isotonic regression, the estimated risks are constant within each block of marker values. Suppose there are m blocks with m_i subjects and m_Di cases in the i^th block. Let y₍₁₎, ..., y_(n) be the marker values in the case-control sample ordered increasingly, with y_(i1), ..., y_{(im_i)} belonging to the i^th block, then P̂(D = 1|Y) is constant for Y ∈ {y_(i1), ..., y_{(im_i)}}. Because the quantile function F⁻¹ is defined to be left continuous by convention, the nonparametric estimator R̂(v) or R̃(v) vs v is a step function where a jump is ready to be made (but not yet) at every v corresponding to the largest element in a block, i.e. v = F̂{y_{(im_i)}} or v = F̃{Y_{(im_i)}} for i = 1, ..., m.

Therefore, to show the equivalence between the two predictiveness curve estimators, all we need to show is that the sets of v′s where jumps are about to happen is the same between the two curves. In other words, we want to show that F̃ {y_{(im_i)}} = F̂ {y_{(im_i)}} for i = 1, ..., m.

Notice that

\begin{matrix} \tilde{F} {y_{(i m_{i})}} & = \hat{ρ} \frac{1}{n_{D}} \sum_{j = 1}^{n_{D}} I {Y_{D j} \leq y_{(i m_{i})}} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{j = 1}^{n_{\overset{‒}{D}}} I {Y_{\overset{‒}{D} j} \leq y_{(i m_{i})}} \\ = \hat{ρ} \frac{1}{n_{D}} \sum_{l = 1}^{i} m_{D l} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{l = 1}^{i} (m_{l} - m_{D l}) \end{matrix}

and that

\begin{matrix} \hat{F} {y_{(i m_{i})}} & = \hat{ρ} \frac{1}{n} \sum_{j \notin κ, j = 1}^{n} \frac{\frac{m_{D j}}{m_{j} - m_{D j}} \frac{n_{\overset{‒}{D}}}{n_{D}} I {Y_{j} \leq y_{(i m_{i})}}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} \frac{m_{D j}}{m_{j} - m_{D j}} \frac{n_{\overset{‒}{D}}}{n_{D}}} + \hat{ρ} \sum_{j \in κ, j = 1}^{n} \frac{I {Y_{j} \leq y_{(i m_{i})}}}{n_{D}} + (1 - \hat{ρ}) \frac{1}{n} \sum_{j = 1}^{n} \frac{I {Y_{j} \leq y_{(i m_{i})}}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} \frac{m_{D j}}{m_{j} - m_{D j}} \frac{n_{\overset{‒}{D}}}{n_{D}}} \\ = \hat{ρ} \frac{1}{n_{D}} \sum_{l ⊄ κ, l = 1}^{i} m_{D l} + \hat{ρ} \frac{1}{n_{D}} \sum_{l \subset κ, l = 1}^{i} m_{l} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{l = 1}^{i} (m_{l} - m_{D l}) \\ = \hat{ρ} \frac{1}{n_{D}} \sum_{j ⊄ κ, l = 1}^{i} m_{D l} + \hat{ρ} \frac{1}{n_{D}} \sum_{l \subset κ, l = 1}^{i} m_{D l} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{l = 1}^{i} (m_{l} - m_{D l}) \\ = \hat{ρ} \frac{1}{n_{D}} \sum_{l = 1}^{i} m_{D l} + (1 - \hat{ρ}) \frac{1}{n_{\overset{‒}{D}}} \sum_{l = 1}^{i} (m_{l} - m_{D l}) . \end{matrix}

Consequently under the monotone increasing risk model assumption, the nonparametric “empirical” and model-based approaches lead to the same estimator of the predictiveness curve.

B3: Proof of Theorem 3

For a continuous marker Y, it has been shown that there is a one-to-one relationship between the predictiveness curve and the ROC curve [36]. That is, suppose P(D = 1|Y) is monotone increasing in Y, R(v) vs v can be represented as

\frac{ρ {ROC}^{'} (t)}{ρ {ROC}^{'} (t) + (1 - ρ)} vs 1 - (1 - ρ) t - ρ ROC (t), t \in (0, 1) .

A similar result can be proved for the semiparametric maximum likelihood estimator. Suppose the unique marker value within the case-control sample is {y₁, ..., y_n} in increasing order. For a marker value y_i,

\begin{matrix} v_{i} & = \hat{F} (y_{i}) = \hat{ρ} {\hat{F}}_{D} (y_{i}) + (1 - \hat{ρ}) {\hat{F}}_{\overset{‒}{D}} (y_{i}), \\ \hat{R} (v_{i}) & = \hat{P} (D = 1 ∣ Y = y_{i}) = \frac{\hat{P} (D = 1, Y = y_{i})}{\hat{P} (D = 1)} \\ = \frac{\hat{ρ} \hat{P} (Y = y_{i} ∣ D = 1)}{\hat{ρ} \hat{P} (Y = y_{i} ∣ D = 1) + (1 - \hat{ρ}) \hat{P} (Y = y_{i} ∣ D = 0)} = \frac{\hat{ρ} \hat{L R} (y_{i})}{\hat{ρ} \hat{L R} (y_{i}) + (1 - \hat{ρ})} . \end{matrix}

Next we generate the ROC curve, $\hat{ROC} (t)$ , corresponding to the semiparametric maximum likelihood estimator of the predictiveness curve: (1) we order the support of the marker decreasingly; (2) we estimate the pair of TPF(c), FPF(c) where c = {y_n, ..., y₁, −∞} (here we define positive as Y > c instead of Y ≥ c to accommodate the convention that F̂ is right continuous); (3) we connect neighboring points by a straight line and define ${\hat{ROC}}^{'} (t)$ to be the right-hand derivative of $\hat{ROC} (t)$ . Suppose F̂_D̄(y_i) = 1 − t_i, since $\hat{L R} {{\hat{F}}^{- 1} (v_{i})} = \hat{L R} {{\hat{F}}_{\overset{‒}{D}}^{- 1} (1 - t_{i})} = {\hat{ROC}}^{'} (t_{i})$ , we have that R̂(v) vs v can be represented as

\frac{\hat{ρ} {\hat{ROC}}^{'} (t)}{\hat{ρ} {\hat{ROC}}^{'} (t) + (1 - \hat{ρ})} vs 1 - (1 - \hat{ρ}) t - ρ \hat{ROC} (t), t \in (0, 1) .

For the semiparametric maximum likelihood estimator, P̂(Y = y_i|D = 0) > 0, thus we have $\hat{L R} (y_{i}) = \hat{P} (Y = y_{i} ∣ D = 1) ∕ \hat{P} (Y = y_{i} ∣ D = 0) < \infty$ . That is, the derivative of the corresponding curve is always finite. Therefore, the semiparametric maximum likelihood estimator of the predictiveness curve corresponds to an ROC curve which is continuous and piecewise differentiable everywhere. Moreover the ROC curve is concave since P(D|Y) is monotone increasing in Y. We have

\begin{matrix} \int_{0}^{1} \hat{R} (v) d v = \int_{t = 1}^{t = 0} \frac{\hat{ρ} {\hat{ROC}}^{'} (t)}{\hat{ρ} R \hat{O} C^{'} (t) + (1 - \hat{ρ})} d {1 - (1 - \hat{ρ}) t - \hat{ρ} \hat{ROC} (t)} & = - \int_{t = 1}^{t = 0} \frac{\hat{ρ} {\hat{ROC}}^{'} (t)}{\hat{ρ} {\hat{ROC}}^{'} (t) + (1 - \hat{ρ})} {(1 - \hat{ρ}) + \hat{ρ} {\hat{ROC}}^{'} (t)} d t \\ = \hat{ρ} \int_{t = 0}^{t = 1} {\hat{ROC}}^{'} (t) d t = \hat{ρ} {{\hat{ROC}}^{'} (1) - \hat{ROC} (0)} = \hat{ρ} \end{matrix}

For the nonparametric maximum likelihood estimator of the predictiveness curve, we can obtain the corresponding ROC curve similarly. This ROC curve is piecewise differentiable with finite derivative everywhere if P̂(D = 1|Y = y) < 1 for every y in the support of the marker. However, when we use isotonic regression to estimate the risk model, the estimated risk could be 1 if the largest marker measure comes from the case sample. This would lead to a vertical line from (0, 0) to (0, n_κ/n_D) in the corresponding ROC curve, where n_κ is the number of observations in κ. Nevertheless, the area under the estimated predictiveness curve is still equal to $\hat{ρ}$ in this scenario because

\begin{matrix} \int_{0}^{1} \hat{R} (v) d v = 1 \times \sum_{k \in κ} \hat{f} (Y_{k}) + \int_{t = 1}^{t = 0^{+}} \frac{\hat{ρ} {\hat{ROC}}^{'} (t)}{\hat{ρ} {\hat{ROC}}^{'} (t) + (1 - \hat{ρ})} d {1 - (1 - \hat{ρ}) t - \hat{ρ} \hat{ROC} (t)} & = \hat{ρ} n_{κ} \frac{1}{n_{D}} + (1 - \hat{ρ}) \times 0 - \int_{t = 1}^{t = 0^{+}} \frac{\hat{ρ} {\hat{ROC}}^{'} (t)}{\hat{ρ} {\hat{ROC}}^{'} (t) + (1 - \hat{ρ})} {(1 - \hat{ρ}) + \hat{ρ} {\hat{ROC}}^{'} (t)} d t \\ = \hat{ρ} n_{κ} \frac{1}{n_{D}} + \hat{ρ} \int_{t = 0 +}^{t = 1} {\hat{ROC}}^{'} (t) d t = ρ n_{κ} \frac{1}{n_{D}} + \hat{ρ} {\hat{ROC} (1) - \hat{ROC} (0^{+})} \\ = \hat{ρ} n_{κ} \frac{1}{n_{D}} + \hat{ρ} (1 - \frac{n_{κ}}{n_{D}}) = \hat{ρ} . \end{matrix}

B4: Proof of Theorem 4

For the semiparametric maximum likelihood estimator,

\sqrt{n} {{\hat{R}}^{- 1} (p) - R^{- 1} (p)} = \sqrt{n} {{\hat{F}}_{R} (p) - F_{R} (p)} = \sqrt{n} {(1 - \hat{ρ}) {\hat{F}}_{\overset{‒}{D} R} (t) + \hat{ρ} {\hat{F}}_{D R} (t) - F_{R} (t)} = A + B,

where

\begin{matrix} A = & \sqrt{n} [\hat{ρ} \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} + (1 - \hat{ρ}) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} - R^{- 1} (p)] \\ = & \sqrt{n} {Q_{M} (p) - R^{- 1} (p)}, \\ B = & \sqrt{n} [\hat{ρ} \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} + (1 - \hat{ρ}) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}}] \\ - & \sqrt{n} [\hat{ρ} \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} + (1 - \hat{ρ}) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}}] \\ =^{*} & \sqrt{n} [ρ \frac{1}{n} \sum_{i = 1}^{n} \frac{L R_{R} {R i s k (Y_{i})} I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} L R_{R} {R i s k (Y_{i})}} + (1 - ρ) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {\hat{R i s k} (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} L R_{R} {R i s k (Y_{i})}}] \\ - & \sqrt{n} [ρ \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{L R}}_{R} {\hat{R i s k} (Y_{i})} I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}} + (1 - ρ) \frac{1}{n} \sum_{i = 1}^{n} \frac{I {R i s k (Y_{i}) \leq p}}{\frac{n_{\overset{‒}{D}}}{n} + \frac{n_{D}}{n} {\hat{L R}}_{R} {\hat{R i s k} (Y_{i})}}] \\ = & \sqrt{n} {F_{\hat{R}} (t) - F_{R} (t)} + o_{p} (1) = \sqrt{n} \frac{\partial F_{R} (t)}{\partial θ} (\hat{θ} - θ) + o_{p} (1), \end{matrix}

where * holds under appropriate equicontinuity conditions [37].

Suppose $\hat{ρ}$ is estimated from a cohort independent of the case-control sample, or the parent cohort where the case-control sampled is nested within. Assume the size of the cohort is λ times the size of the case-control sample. Let $Q_{M}^{⋆}, {\hat{θ}}^{⋆}$ be the counterparts of Q_M, $\hat{θ}$ if we plug in true ρ where $\hat{ρ}$ is originally used. Note that ${\hat{L R}}_{R} {\hat{R i s k}}$ is independent of $\hat{ρ}$ for a logistic regression risk model. Then, we have

\begin{matrix} var [\sqrt{n} {Q_{M} (p) - R^{- 1} (p)}] = {F_{D R} (p) - F_{\overset{‒}{D} R} (p)}^{2} ρ (1 - ρ) ∕ λ + var [\sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}], \\ var {\sqrt{n} (\hat{θ} - θ)} = (\begin{matrix} \frac{1}{λ ρ (1 - ρ)} & 0 \\ 0 & 0 \end{matrix}) + var {\sqrt{n} ({\hat{θ}}^{⋆} - θ)}, \\ cov [\sqrt{n} (\hat{θ} - θ), \sqrt{n} {Q_{M} (p) - R^{- 1} (p)}] = \frac{F_{D R} (p) - F_{\overset{‒}{D} R} (p)}{λ} + cov [\sqrt{n} ({\hat{θ}}^{⋆} - θ), \sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}] . \end{matrix}

To show this, use the first result as an example, note that

var [\sqrt{n} {Q_{M} (p) - R^{- 1} (p)}] = var [\sqrt{n} {Q_{M} (p) - Q_{M}^{⋆} (p)}] + var [\sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}] ≃ \sqrt{n} {\hat{ρ} F_{D R} (p) + (1 - \hat{ρ}) F_{\overset{‒}{D} R} (p) - F_{R} (p)} + var [\sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}] (*) .

Since the two terms in (*) are asymptotically uncorrelated, we have

var [\sqrt{n} {Q_{M} (p) - R^{- 1} (p)}] ≃ var [\sqrt{n} {F_{D R} (p) - F_{\overset{‒}{D} R} (p)}^{2} var (\hat{ρ})] + var [\sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}] = {F_{D} (t) - F_{\overset{‒}{D}} (t)}^{2} ρ (1 - ρ) ∕ λ + var [\sqrt{n} {Q_{M}^{⋆} (p) - R^{- 1} (p)}] .

Finally, results for semiparametric “empirical” estimators can be derived following similar arguments.

Appendix C: The modified Hosmer-Lemeshow test for case-control data

Let observations in a case-control sample by divided into K groups according to distribution of ${\hat{R i s k}}^{S}$ , the unmodified Hosmer-Lemeshow test for case-control study is defined as

H L = \sum_{k = 1}^{K} \frac{{\hat{P} (D = 1 ∣ k, S) - {\hat{R}}_{k}^{S}}^{2}}{{\hat{R}}_{k}^{S} (1 - {\hat{R}}_{k}^{S}) ∕ n_{k}} .

Based on Bayes’ theorem, we have $\hat{R i s k} = g ({\hat{R i s k}}^{S})$ and ${\hat{R}}_{k} = g ({\hat{R}}_{k}^{S})$ for

g (x) = \frac{x}{1 - x} \frac{n_{D}}{n_{\overset{‒}{D}}} \frac{\hat{ρ}}{1 - \hat{ρ}} ∕ (1 + \frac{x}{1 - x} \frac{n_{D}}{n_{\overset{‒}{D}}} \frac{\hat{ρ}}{1 - \hat{ρ}}) .

Then for k = 1, . . . , K, we have $\hat{P} (D = 1 ∣ k) - {\hat{R}}_{k} = g (\hat{P} (D = 1 ∣ k, S) - g ({\hat{R}}_{k}^{S})$ . Its variance under H₀ can be shown to be approximately equal to ${\hat{R}}_{k}^{2} {(1 - {\hat{R}}_{k})}^{2} ∕ {n_{k} R_{k}^{S} (1 - R_{k}^{S})}$ through Delta method. Therefore under $H_{0}, {\hat{P} (D = 1 ∣ k, S) - {\hat{R}}_{k}^{S}}^{2} ∕ {{\hat{R}}_{k}^{S} (1 - {\hat{R}}_{k}^{S}) ∕ n_{k}}$ and ${\hat{P} (D = 1 ∣ k) - {\hat{R}}_{k}}^{2} ∕ [{\hat{R}}_{k}^{2} {(1 - {\hat{R}}_{k})}^{2} ∕ {n_{k} {\hat{R}}_{k}^{S} (1 - {\hat{R}}_{k}^{S})}]$ are asymptotically equivalent, which proves the asymptotic equivalence between T and HL.

References

1.Pepe MS. Oxford University Press; Oxford, United Kingdom: 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. [Google Scholar]
2.Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics. 2005;6(2):227–239. doi: 10.1093/biostatistics/kxi005. [DOI] [PubMed] [Google Scholar]
3.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]
4.Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27(2):157–172. doi: 10.1002/sim.2929. [DOI] [PubMed] [Google Scholar]
5.Bura E, Gastwirth JL. The binary regression quantile plot: assessing the importance of predictors in binary regression visually. Biometrical Journal. 2001;43(1):5–21. [Google Scholar]
6.Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. American Journal of Epidemiology. 2008;167(3):362–368. doi: 10.1093/aje/kwm305. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Bio-metrics. 2007;63(4):1181–1188. doi: 10.1111/j.1541-0420.2007.00814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Stern RH. Evaluating new cardiovascular risk factors for risk stratification. Journal of Clinical Hypertension. 2008;10:485–488. doi: 10.1111/j.1751-7176.2008.07814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute. 2001;93(14):1054–1061. doi: 10.1093/jnci/93.14.1054. [DOI] [PubMed] [Google Scholar]
10.Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: Statistical guidelines for nested case-control studies. BMI Medical Research Methodology. 2002;2:4. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Huang Y, Pepe MS. Semiparametric methods for evaluating risk prediction markers in case-control studies. Biometrika. 2009 doi: 10.1093/biomet/asp040. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ransohoff DF. How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. Journal of Clinical Epidemiology. 2007;60:1205–1219. doi: 10.1016/j.jclinepi.2007.04.020. [DOI] [PubMed] [Google Scholar]
13.Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology. 2005;23(29):7332–7341. doi: 10.1200/JCO.2005.02.8712. [DOI] [PubMed] [Google Scholar]
14.Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, Saghatchian d'Assignies M, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ. Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer. Journal of the National Cancer Institute. 2006;98(17):1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]
15.van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]
16.van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine. 2002;247(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]
17.Anderson KM, Odell PM, Wilson PW, Kannel WB. Cardiovascular disease risk profiles. American Heart Journal. 1991;121:293–298. doi: 10.1016/0002-8703(91)90861-b. [DOI] [PubMed] [Google Scholar]
18.Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute. 1989;81(24):1879–1886. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
19.Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a classification biomarker: standards for study design. Journal of the National Cancer Institute. 2008;100(20):1432–1438. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Cole TJ, Green PJ. Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine. 1992;11(165):1305–1319. doi: 10.1002/sim.4780111005. [DOI] [PubMed] [Google Scholar]
21.Breslow NE. Statistics in epidemiology: the case-control study. JASA. 1996;91:14–28. doi: 10.1080/01621459.1996.10476660. [DOI] [PubMed] [Google Scholar]
22.Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59(1):19–35. [Google Scholar]
23.Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66(3):403–411. [Google Scholar]
24.Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley; London: 1972. [Google Scholar]
25.Lloyd CJ. Estimation of a Convex ROC Curve. Statistics & Probability Letters. 2002;59:99–111. [Google Scholar]
26.Lloyd CJ. Maximum likelihood estimation of misclassification rates of a binomial regression. Biometrika. 2000;87(3):700–705. [Google Scholar]
27.Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249. [Google Scholar]
28.Qin J, Zhang J. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika. 1997;84(3):609–618. [Google Scholar]
29.Qin J, Zhang J. Using logistic regression procedures for estimating receiver operating characteristic curves. Biometrika. 2003;93(3):585–596. [Google Scholar]
30.Thompson IM, Pauler Ankerst D, Chi C. Assessing prostate cancer risk: results from the prostate cancer prevention trial. Journal of the National Cancer Institute. 2006;98:529–534. doi: 10.1093/jnci/djj131. [DOI] [PubMed] [Google Scholar]
31.Hosmer DW, Lemesbow S. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics-Theory and Methods. 1980;9(10):1043–1069. [Google Scholar]
32.Hosmer DW, Lemeshow S. 2nd Edition John Willey & Sons; 2000. Applied logistic regression. [Google Scholar]
33.Cook NR, Buring JE, Ridker PM. The effect of including C-reactive protein in cardiovascular risk prediction models for women. Ann Intern Med. 2006;145:21–9. doi: 10.7326/0003-4819-145-1-200607040-00128. [DOI] [PubMed] [Google Scholar]
34.Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clin Chem. 2008;54:17–23. doi: 10.1373/clinchem.2007.096529. [DOI] [PubMed] [Google Scholar]
35.Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Annals of Internal Medicine. 2008;149:751–760. doi: 10.7326/0003-4819-149-10-200811180-00009. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Huang Y, Pepe MS. A parametric ROC model based approach for evaluating the predictiveness of continuous markers in case-control studies. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01201.x. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Process. Spring-Verlag; New York: 1996. [Google Scholar]

[R1] 1.Pepe MS. Oxford University Press; Oxford, United Kingdom: 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. [Google Scholar]

[R2] 2.Gail MH, Pfeiffer RM. On criteria for evaluating models of absolute risk. Biostatistics. 2005;6(2):227–239. doi: 10.1093/biostatistics/kxi005. [DOI] [PubMed] [Google Scholar]

[R3] 3.Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935. doi: 10.1161/CIRCULATIONAHA.106.672402. [DOI] [PubMed] [Google Scholar]

[R4] 4.Pencina MJ, D'Agostino Sr RB, D'Agostino Jr RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27(2):157–172. doi: 10.1002/sim.2929. [DOI] [PubMed] [Google Scholar]

[R5] 5.Bura E, Gastwirth JL. The binary regression quantile plot: assessing the importance of predictors in binary regression visually. Biometrical Journal. 2001;43(1):5–21. [Google Scholar]

[R6] 6.Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. American Journal of Epidemiology. 2008;167(3):362–368. doi: 10.1093/aje/kwm305. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Huang Y, Pepe MS, Feng Z. Evaluating the predictiveness of a continuous marker. Bio-metrics. 2007;63(4):1181–1188. doi: 10.1111/j.1541-0420.2007.00814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Stern RH. Evaluating new cardiovascular risk factors for risk stratification. Journal of Clinical Hypertension. 2008;10:485–488. doi: 10.1111/j.1751-7176.2008.07814.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute. 2001;93(14):1054–1061. doi: 10.1093/jnci/93.14.1054. [DOI] [PubMed] [Google Scholar]

[R10] 10.Baker SG, Kramer BS, Srivastava S. Markers for early detection of cancer: Statistical guidelines for nested case-control studies. BMI Medical Research Methodology. 2002;2:4. doi: 10.1186/1471-2288-2-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Huang Y, Pepe MS. Semiparametric methods for evaluating risk prediction markers in case-control studies. Biometrika. 2009 doi: 10.1093/biomet/asp040. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Ransohoff DF. How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. Journal of Clinical Epidemiology. 2007;60:1205–1219. doi: 10.1016/j.jclinepi.2007.04.020. [DOI] [PubMed] [Google Scholar]

[R13] 13.Simon R. Roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology. 2005;23(29):7332–7341. doi: 10.1200/JCO.2005.02.8712. [DOI] [PubMed] [Google Scholar]

[R14] 14.Buyse M, Loi S, van't Veer L, Viale G, Delorenzi M, Glas AM, Saghatchian d'Assignies M, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ. Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer. Journal of the National Cancer Institute. 2006;98(17):1183–1192. doi: 10.1093/jnci/djj329. [DOI] [PubMed] [Google Scholar]

[R15] 15.van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. [DOI] [PubMed] [Google Scholar]

[R16] 16.van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AAM, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers ET, Friend SH, Bernards R. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine. 2002;247(25):1999–2009. doi: 10.1056/NEJMoa021967. [DOI] [PubMed] [Google Scholar]

[R17] 17.Anderson KM, Odell PM, Wilson PW, Kannel WB. Cardiovascular disease risk profiles. American Heart Journal. 1991;121:293–298. doi: 10.1016/0002-8703(91)90861-b. [DOI] [PubMed] [Google Scholar]

[R18] 18.Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute. 1989;81(24):1879–1886. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]

[R19] 19.Pepe MS, Feng Z, Janes H, Bossuyt PM, Potter JD. Pivotal evaluation of the accuracy of a classification biomarker: standards for study design. Journal of the National Cancer Institute. 2008;100(20):1432–1438. doi: 10.1093/jnci/djn326. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Cole TJ, Green PJ. Smoothing reference centile curves: The LMS method and penalized likelihood. Statistics in Medicine. 1992;11(165):1305–1319. doi: 10.1002/sim.4780111005. [DOI] [PubMed] [Google Scholar]

[R21] 21.Breslow NE. Statistics in epidemiology: the case-control study. JASA. 1996;91:14–28. doi: 10.1080/01621459.1996.10476660. [DOI] [PubMed] [Google Scholar]

[R22] 22.Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59(1):19–35. [Google Scholar]

[R23] 23.Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66(3):403–411. [Google Scholar]

[R24] 24.Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley; London: 1972. [Google Scholar]

[R25] 25.Lloyd CJ. Estimation of a Convex ROC Curve. Statistics & Probability Letters. 2002;59:99–111. [Google Scholar]

[R26] 26.Lloyd CJ. Maximum likelihood estimation of misclassification rates of a binomial regression. Biometrika. 2000;87(3):700–705. [Google Scholar]

[R27] 27.Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249. [Google Scholar]

[R28] 28.Qin J, Zhang J. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika. 1997;84(3):609–618. [Google Scholar]

[R29] 29.Qin J, Zhang J. Using logistic regression procedures for estimating receiver operating characteristic curves. Biometrika. 2003;93(3):585–596. [Google Scholar]

[R30] 30.Thompson IM, Pauler Ankerst D, Chi C. Assessing prostate cancer risk: results from the prostate cancer prevention trial. Journal of the National Cancer Institute. 2006;98:529–534. doi: 10.1093/jnci/djj131. [DOI] [PubMed] [Google Scholar]

[R31] 31.Hosmer DW, Lemesbow S. Goodness of fit tests for the multiple logistic regression model. Communications in Statistics-Theory and Methods. 1980;9(10):1043–1069. [Google Scholar]

[R32] 32.Hosmer DW, Lemeshow S. 2nd Edition John Willey & Sons; 2000. Applied logistic regression. [Google Scholar]

[R33] 33.Cook NR, Buring JE, Ridker PM. The effect of including C-reactive protein in cardiovascular risk prediction models for women. Ann Intern Med. 2006;145:21–9. doi: 10.7326/0003-4819-145-1-200607040-00128. [DOI] [PubMed] [Google Scholar]

[R34] 34.Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clin Chem. 2008;54:17–23. doi: 10.1373/clinchem.2007.096529. [DOI] [PubMed] [Google Scholar]

[R35] 35.Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. Annals of Internal Medicine. 2008;149:751–760. doi: 10.7326/0003-4819-149-10-200811180-00009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Huang Y, Pepe MS. A parametric ROC model based approach for evaluating the predictiveness of continuous markers in case-control studies. Biometrics. 2009 doi: 10.1111/j.1541-0420.2009.01201.x. In Press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.van der Vaart AW, Wellner JA. Weak Convergence and Empirical Process. Spring-Verlag; New York: 1996. [Google Scholar]

PERMALINK

Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods

Ying Huang

Margaret Sullivan Pepe

Summary

1. Introduction

2. Estimation of the Risk Function

2.1 Parametric Risk Functions: Logistic Regression

2.2 Nonparametric Risk Functions: Isotonic Regression

3. Estimation of the Marker Distribution and the Predictiveness Curve

3.1 The Semiparametric Estimators of the Predictiveness Curve

3.1.1 The Semiparametric “Empirical” Estimator

3.1.2 The Semiparametric “Maximum Likelihood” Estimator

3.2 The Nonparametric Estimator

3.3 Area under the Predictiveness Curve

4. Simulation Studies

4.1 Simulation Setting 1

Table 1.

Table 4.

Table 2.

Table 3.

4.2 Simulation Setting 2

Table 5.

Table 6.

5. Illustration

Figure 1.

Figure 2.

Table 7.

6. Extending Semiparametric Estimation to Multiple Markers

6.1 Inference

6.2 Illustration

Figure 3.

7. Discussion

Acknowledgments

Appendix A: Analytic Forms of the Asymptotic Variances for the Semiparametric Estimators of the Predictiveness Curve (for the Example in Section 3.1)

Appendix B: Proof of Theorems

B1: Proof of Theorem 1

B2: Proof of Theorem 2

B3: Proof of Theorem 3

B4: Proof of Theorem 4

Appendix C: The modified Hosmer-Lemeshow test for case-control data

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases