Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Mar 1.
Published in final edited form as: Biometrics. 2013 Feb 14;69(1):91–100. doi: 10.1111/biom.12001

Covariate adjustment in estimating the area under ROC curve with partially missing gold standard

Danping Liu 1,*, Xiao-Hua Zhou 2,3,*
PMCID: PMC3622116  NIHMSID: NIHMS446611  PMID: 23410529

Summary

In ROC analysis, covariate adjustment is advocated when the covariates impact the magnitude or accuracy of the test under study. Meanwhile, for many large scale screening tests, the true condition status may be subject to missingness because it is expensive and/or invasive to ascertain the disease status. The complete-case analysis may end up with a biased inference, also known as “verification bias”. To address the issue of covariate adjustment with verification bias in ROC analysis, we propose several estimators for the area under the covariate-specific and covariate-adjusted ROC curves (AUCx and AAUC). The AUCx is directly modelled in the form of binary regression, and the estimating equations are based on the U statistics. The AAUC is estimated from the weighted average of AUCx over the covariate distribution of the diseased subjects. We employ reweighting and imputation techniques to overcome the verification bias problem. Our proposed estimators are initially derived assuming that the true disease status is missing at random (MAR), and then with some modification, the estimators can be extended to the not-missing-at-random (NMAR) situation. The asymptotic distributions are derived for the proposed estimators. The finite sample performance is evaluated by a series of simulation studies. Our method is applied to a data set in Alzheimer's disease research.

Keywords: Alzheimer's disease, area under ROC curve, covariate adjustment, U statistics, verification bias, weighted estimating equations

1. Introduction

In ROC analysis, covariate adjustment is advocated when the magnitude and/or accuracy of the test result depends on patient characteristics. This is analogous to adjusting for confounders and effect modifiers in linear regression. In this paper, we focus on the covariate-adjustment in estimating the area under ROC curve (AUC), which measures the probability that a diseased subject (case) gets a larger test score than a healthy subject (control). Covariate-specific AUC (AUCx) is commonly used in the literature for examining the diagnostic accuracy within a subpopulation stratified by the covariates. Janes and Pepe (2009) proposed a new covariate-adjusted ROC curve and AUC (AROC and AAUC), which are weighted averages of the covariate-specific ROC curve (ROCx) and AUCx, respectively. AAUC conveniently summarizes a test's overall accuracy by a single number, while taking the covariates information into consideration.

For large-scale observational studies, the gold standard for a patient's true disease status may not be available, due to high cost or harm to the patient. The decision of disease verification could be made by physicians or the patients themselves, and is often associated with the test results and other patient characteristics. The estimated ROC curve and AUC only using verified subjects can be biased, known as “verification bias” (Begg and Greenes, 1983). Since the non-verified subjects can be regarded as missing disease status, the missing data framework is introduced to adjust for the verification bias. The verification process is said to be missing at random (MAR) if the probability of disease verification is only affected by the observed variables, and not missing at random (NMAR) if the verification is associated with the missing disease status itself conditioning on all the observed variables.

Recently, much attention has been paid to correct verification bias for the continuous test. Under the MAR assumption, Alonzo and Pepe (2005) first proposed several empirical estimators for the ROC curve. He et al. (2009) further derived the AUC estimator using the inverse probability weighting approach. Rotnitzky et al. (2006) considered an NMAR assumption and proposed a doubly robust AUC estimator, and Fluss et al. (2009) later developed the ROC curve estimator under a similar framework. However, both papers had to assume that a “nonignorable parameter” (log odds ratio of verification for diseased versus healthy subjects) was known and performed a sensitivity analysis. Alternatively, Liu and Zhou (2010) proposed to estimate the nonignorable parameter from the data using the score equations, and then construct several empirical ROC curve and AUC estimators. The biggest limitation of the above five recent papers is that they all performed the ROC analysis on the whole population, and none of them could adjust for the covariates effect on the classification accuracy. Page and Rotnitzky (2010), and Liu and Zhou (2011) are the only two papers, which considered bias-corrected covariate-specific ROC (ROCx) curve estimator for continuous test: the former proposed a fully parametric model and the latter used a semiparametric framework. To our knowledge, the AUCx and AAUC estimators under verification bias have not been studied yet.

The main contributions of this paper are: (1) we propose the U-statistic type estimating equations for verification-bias corrected AUCx and AAUC; (2) we prove the asymptotic theories for the new estimators. Theoretically, once we have the estimated ROCx curve, AUCx could be computed. For example, using the ROCx estimator in Liu and Zhou (2011), one may integrate the ROC curve over [0, 1]. However, as the link and baseline function of the ROCx estimator are both nonparametric functions, AUCx may not have an explicit expression. In addition, the covariate effects in both Liu and Zhou (2011) and Page and Rotnitzky (2010) are interpreted as the effect on the mean test result. But in many situations, one may wish to find out whether and how the diagnostic accuracy itself is affected by the covariates, and hence it is more relevant to model AUCx directly. The idea of our approach is, the regression model assumption on AUCx is made for the full data, and then several reweighting techniques are used to correct for the verification bias. The reweighting methods are first derived under the MAR assumption and then extended to the NMAR situation. Subsequently, the AAUC estimators can be derived as a weighted average of AUCx. The weights depend on the covariate distribution for diseased subjects, and are estimated empirically. The asymptotic distributions for the AUCx and AAUC estimators are derived from the U statistics theory.

The paper is organized as follows. In Section 2, we propose the verification bias-corrected estimators of AUCx. In Section 3, we construct a weighted average of the estimated AUCx as an estimator of AAUC. Several simulation studies are presented in Section 4, followed by a real example from Alzheimer's disease research in Section 5. Finally we make the concluding remarks in Section 6.

2. Estimation for Covariate-Specific AUC (AUCx)

Let Ti, Di, Vi and Xi denote the continuous test result, binary disease status (Di = 1 if diseased and 0 if healthy), binary verification status (Vi = 1 if Di is observed and 0 if Di missing), and patient-level characteristics for subject i. Without loss of generality, we assume that a greater value of Ti is more indicative of disease. The subscript i is omitted sometimes if there is no confusion. In this section we first discuss the model setting and assumptions, then we propose to use the weighted estimating equations to correct for the verification bias and obtain the estimated AUCx, and we present the asymptotic results in the end.

2.1 Model Assumption

The covariate-specific AUC measures the diagnostic accuracy of a test in a subpopulation defined by covariate value x. Similar to the ordinary unadjusted AUC, AUCx is interpreted as the probability that a case has a greater test result than a control when they share the common covariate x, i.e.,

ν(x)=Pr(T2>T1D2=1,D1=0,X2=X1=x)+12Pr(T2=T1D2=1,D1=0,X2=X1=x). (1)

We adjust for the ties by adding half of the probability that a case and a control has the same test results.

Like in Dodd and Pepe (2003), we assume that AUCx takes the following generalized linear form:

ν(x)AUCx=g1(xTθ), (2)

where g(·) is a monotone link function (known), and θ* is the vector of parameters of interest. A convenient choice of the link function may be a logit or probit function. This model makes assumptions on the distribution of the difference of test results between a case and a control with the same covariates: when D2 = 1, D1 = 0 and X2 = X1 = x, we have h(T2T1) = xT θ* + ∊, where h is some unknown monotone transformation, and ∊ follows the distribution g−1 with mean 0.

The binormal ROC curve is a special case of (2), which can be seen from the following example. Suppose the test results for the cases and the controls both follow normal distributions after an unknown monotone transformation h, that is,

[h(T)D=d,X=x]~N(μ(d,x),σ2(d,x)) (3)

where μ(d, x) and σ2(d, x) are the conditional mean and variance of h(T) as a function of d and x. It is easily verified that the covariate-specific ROC curve is

ROCx(t)=Φ[σ(0,x)σ(1,x)Φ1(t)+μ(1,x)μ(0,x)σ(1,x)],

which takes the “binormal” form with Φ being the standard normal distribution function. The AUCx could be written explicitly as

ν(x)AUCx=Φ(ab). (4)

where a = μ(1, x) − μ(0, x) and b = σ2(1, x) + σ2(0, x). Hence if μ(d, x) is linear in x and σ2(d, x) is constant in x, then ν(x) takes the generalized linear form (2) with link function g−1 = Φ. The AUC structure (2) is both parsimonious and flexible, as analogous to generalized regression models for the mean. One may also consider link functions other than Φ.

The definition of AUCx in (1) restricts the comparison of the test results for subjects in the same covariates' level. However, if some of the covariates are continuous, there may not exist any pairs of a case and a control with the identical covariates value. Hence the estimation of ν(x) only based on (2) is not feasible. This prompts us to extend the AUC structure (2) a little bit and allows the comparison of cases and controls with different covariates. Let ξ(x, y) be the probability of correctly ordering a case with covariates x and a control with covariates y, i.e.,

ξ(x,y)=Pr(T2>T1D2=1,D1=0,X2=x,X1=y)+12Pr(T2=T1D2=1,D1=0,X2=x,=X1=y).

Under the transformed location-scale model, one can verify the similar result as in (4):

ξ(x,y)=Φ(ab)

with a* = μ(1, x) − μ(0, y) and b* = σ2(1, x) + σ2(0, y). Hence a flexible and parsimonious structure of ξ(x, y) can also be assumed:

ξ(x,y)=g1(WTθ), (5)

where a natural choice of W is W = (1, xT, yT)T. Obviously, (5) is a special case of (2), because ξ(x, x) = ν(x).

2.2 Weighted Estimating Equations

Denote IijI(Ti > Tj) + 1/2I(Ti = Tj), where I(·) is the indicator function. When the disease status is observed for every subject, we can construct the full data estimating equations in the form of U statistics. Note that

E(IijDi=1,Dj=0,Xi=Xj)=ν(Xi).

Under (2), when the covariates are all discrete, we could consider those pairwise comparisons Iij with Di = 1, Dj = 0 and Xi = Xj. Then the generalized estimating equations for θ* are given by

0=ij(ν(Xi)θT)T(Ωij)1[Iijν(Xi)]Di(1Dj)I(Xi=Xj), (6)

where Ωij=var(IijDi=1,Dj=0,Xi=Xj)=ν(Xi)(1ν(Xi)).

When the covariates are continuous, however, the above estimating equations (6) become degenerate, because I(Xi = Xj) is always equal to 0. A solution is to construct the estimating equations from the extended AUC form (5). We write ξij ≡ ξ(Xi, Xj) for short. By noting that

E(IijDi=1,Dj=0,Xi,Xj)=ξ(Xi,Xj),

we obtain the estimating equations for θ by considering all the pairwise comparison Iij with Di = 1 and Dj = 0 as follows:

0=ijUij(FD)=ij(ξijθT)TΩij1(Iijξij)Di(1Dj), (7)

where Ωij = var(Iij|Di = 1, Dj = 0, Xi, Xj) = ξij(1−ξij). Note that the full data estimating equations (7) are the classic score equations for binary regression with weight Di(1−Dj). The point estimate of θ could be obtained by the binary regression, but the variance estimator needs to be modified to address the cross-correlation of Uij(FD).

Let ρi = Pr (Di = 1| Ti, Xi) be the disease probability, and πi = Pr (Vi = 1| Ti, Xi, Di) be the verification probability. The disease and verification probabilities often need to be estimated from the data. With the estimated ρi and πi, we may either impute the missing disease status or perform a weighted analysis. Now suppose we have the estimated ρ^i and π^i, and we will discuss the estimation of the two probabilities in the next subsection. We propose four types of weighted estimating functions that correct for the verification bias. These estimators work for both MAR and NMAR assumptions. The only difference lies in estimation of the disease and verification models, which we will see in Section 2.3.

The first approach is full imputation (FI), which replaces all the disease status Di with the estimated probability ρ^i in the estimating functions (7):

U(FI)=ijUij(FI)=ij(ξijθT)TΩij1(Iijξij)ρ^i(1ρ^j). (8)

The imputation is performed on every subject regardless of the verification status. So the FI estimator is highly sensitive to the correct specification of the disease model.

Another imputation method is mean score imputation (MSI), which only imputes the missing Di with ρ0i ≡ Pr(Di| Ti, Xi, Vi = 0) but keeps the observed ones. The estimated version, ρ^0i, can be expressed as ρ^0i=(1π^1i)ρ^i(1π^1i)ρ^i+(1π^0i)(1ρ^i), where π^diPr^(Vi=1Ti,Xi,Di=d) for d = 0, 1. If MAR assumption holds, ρ0i is equal to ρi. Let DMSI,iViDi + (1 − Vi0i and D^MSI,iViDi+(1Vi)ρ^0i. The MSI estimating functions are as follows:

U(MSI)=ijUij(MSI)=ij(ξijθT)TΩij1(Iijξij)D^MSI,i(1D^MSI,j). (9)

The MSI estimator still requires the disease model to be correctly specified, so that D^MSI,i mimics Di closely. With mis-specified disease probability ρ^i, we expect the MSI estimator to work better than the FI estimator. This is because for the MSI estimator, the incorrect imputation model only affects the subjects with missing disease status, while for the FI estimator, the imputation is performed on every subject.

The third method is inverse probability weighting (IPW). Only the complete data are included in the estimating functions that are weighted by the inverse of the sampling probabilities:

U(IPW)=ijUij(IPW)=ij(ξijθT)TΩij1(Iijξij)Di(1Dj)ViVjπ^iπ^j. (10)

The IPW estimator requires π^i to be estimated consistently.

The fourth method is doubly robust (DR) estimator, which makes use of both π^i and ρ^i. Let DDR,iViDi/ πi + (1 − Vi/ πi0i and D^DR,iViDiπ^i+(1Viπ^i)ρ^0i. The DR estimating functions are

U(DR)=ijUij(DR)=ij(ξijθT)TΩij1(Iijξij)D^DR,i(1D^DR,j). (11)

The DR estimator requires either ρ^i or π^i to be consistently estimated, but not necessarily both. It also implicitly requires that ρ^i and π^i can be estimated separately. As we will see in Section 2.3, the disease and verification probabilities are estimated separately under MAR assumption. Under NMAR, if one is confident to specify a nonignorable parameter (the log odds ratio of verification for diseased versus healthy subjects), Rotnitzky et al. (2009) demonstrated that ρ^i and π^i can still be estimated separately, which leads to the doubly robust property for the AUC estimation. However, if one chooses to estimate the nonignorable parameter, Liu and Zhou (2010) showed that the estimation follows by maximizing the joint likelihood of Di and Vi. Hence we must specify the correct selection model and disease model, and the doubly robust property is sacrificed.

2.3 Estimating Verification and Disease Processes

From the weighted estimating equations (8)–(11), we note that correctly estimating the disease and verification probabilities ρi and πi is crucial to the estimation of θ and hence AUCx. We discuss the estimation methods for MAR and NMAR verification processes respectively.

Under the MAR verification process, the disease and verification probabilities can be estimated separately, because

Pr(Di=1Ti,Xi)=Pr(Di=1Ti,Xi,Vi=1),Pr(Vi=1Ti,Xi,Di)=Pr(Vi=1Ti,Xi).

Therefore, the disease probability could be estimated by fitting a binary regression of Di versus Ti and Xi for the verified cases (Vi = 1); the verification probability could be estimated by fitting another binary regression of Vi versus Ti and Xi for all the subjects.

With a NMAR verification process, we could implement the likelihood-based estimators as in Liu and Zhou (2010). The observed data likelihood involves both πi and ρi, which can only be estimated together by solving the score equations. The identifiability of the selection model comes from the parametric assumption of ρi on the whole population. The validity of the disease and verification models could not be tested nonparametrically. So we recommend to build the parametric models from the scientific point of view, and to determine the plausible models prior to data analysis.

2.4 Asymptotic Normality

Denote η to be the vector of nuisance parameters in the disease and/or verification models. Our parameters of interest is θ, the covariates effect on AUCx. All the FI, MSI, IPW and DR estimating functions are U-statistics based on pairwise independently and identically distributed (i.i.d) random variables. We sometimes suppress the superscript and write these estimating functions to be Σij Uij(θ, η). Note that the estimating functions (8), (9), (10) and (11) are actually ΣijUij(θ,η). Let θ0 and η0 be the true value of the unknown parameters. We begin with showing that Uij(θ0, η0) has zero expectation in Lemma 1; then the influence function and limit theorem of the U statistic are given in Lemma 2; finally we complete the proof of the asymptotic normality for θ^

Lemma 1: Under the MAR assumption,

  1. if the verification model is correctly specified, then the IPW estimating functions Uij(θ0, η0) have zero expectation;

  2. if the disease model is correctly specified, then the FI and MSI estimating functions Uij(θ0, η0) both have zero expectation;

  3. if either the verification model or the disease model is correctly specified, then the DR estimating functions Uij(θ0, η0) have zero expectation.

    Under the NMAR situation,

  4. if both the verification and the disease model are correctly specified, then the IPW, FI, MSI and DR estimating functions Uij(θ0, η0) have zero expectation.

The proof of the above lemma is given in the Web Appendix A. Let

Sij(θ,η)(Uij(θ,η)+Uij(θ,η))2.

So we have Σij U ij (θ, η), = Σij Sij(θ, η) is symmetric in addition. Denote the U-statistic Sn(θ,η)=1n(n1)ΣijSij(θ,η), the central limit theorem of Sn is stated as below:

Lemma 2: If var(Sij) is nite, then

n(Sne)=1niE[SijOi]+op(1)dN(0,4σ12),

where e=ES12(θ,η),σ12var(E[S12O1]) and Oi = (Vi, ViDi, Ti, Xi)is the observed data for the ith subject.

This lemma is a special case of Theorem 6.1.2 of Lehmann (1999). We skip the proof here. With the influence function of Sn in Lemma 2, we are able to analyze the variability of θ^ using Taylor expansion. As shown in Section 2.3, the nuisance parameters could be estimated from either score equations, or generalized estimating equations, denoted by B(η) ≡ Σi Bi(η). Under the regularity conditions stated in the Web Appendix A, the asymptotic normality results are shown in Theorem 1.

Theorem 1: Suppose the regularity conditions C1–C6 hold, then

n(θ^θ0)dN(0,Ω)

as n → ∞.

The explicit formula for Ω, as well as the proof of the above theorem are also given in the Web Appendix A. From the proof, we note that the variability of θ^ comes from two sources: one is the variability due to the U-statistic; the other is the variability due to estimating η.

3. Estimation for Covariate-Adjusted AUC (AAUC)

3.1 Definition

The AUCx identifies the risk factors that affect the diagnostic accuracy of the test. However, the policy makers may be more interested in summarizing the overall accuracy across the whole patient population. Such a summary measure was recently proposed by Janes and Pepe (2009). They came up with the covariate-adjusted ROC (AROC) curve, which is a weighted average of all possible covariate-specific ROC curves. In this section, we discuss the estimation method of the area under AROC curve, when verification bias problem is present.

As shown in Janes and Pepe (2009), the AROC curve can be written as a weighted average of the covariate-specific ROC curves:

AROC(t)=+ROCx(t)dFXD=1(x). (12)

The AROC curve at t is interpreted as the average sensitivity when the covariate-specific decision thresholds are chosen to ensure the specificity to be 1 –t in each subpopulation. We can also consider the covariate-adjusted AUC, or AAUC:

AAUC=01AROC(t)dt=01+ROCx(t)dFXD=1(x)dt=+AUCxdFXD=1(x),

which suggests that AAUC is also a weighted average of AUCx. The AAUC measures the probability that a randomly selected case (D = 1) has a greater test result than a matched control (D = 0).

3.2 Estimation Procedures

With the AUCx estimator in the previous section, we only need to estimate the covariate distribution for the cases, FX|D=1(x). With the disease status observed for every patient, an empirical estimator is given by:

F^XD=1(x)=ΣiI(Xix)DiΣiDi,

which is a step function with jump of size 1/Σi Di at every data point Xi with Di = 1. Since Di is missing for some of the patients, we could estimate FX|D=1(x) using FI, MSI, IPW or DR approaches:

F^XD=1(est)(x)=ΣiI(Xix)D^iΣiD^i,

where D^i is some version of “estimated” disease status. With similar idea for the estimation of AUCx, the four different estimators for D^i are available:

D^i(FI)=ρ^i,
D^i(MSI)=ViDi+(1+Vi)ρ^i,
D^i(IPW)=Viπ^iDi,
D^i(DR)=Viπ^iDi+(1Viπ^i)ρ^i.

Therefore, the AAUC is estimated by

AAUC^=+AUC^xdF^XD=1(est)(x)=1ΣiD^iiAUC^XiD^i. (13)

As stated in Theorem 1 and its proof, the AUCx estimator is asymptotically linear. As AAUC is merely a weighted average of AUCx, the influence function of the AAUC estimator can also be derived. We write AAUC and AAUC^ to be ν and ν^ for short, and it follows from (13) that

0=iRi(θ^,η^,ν^)i(AUC^Xiν^)D^i.

Note that AUC^Xi is a function of θ^ and η^, and D^i is a function of η^.

Theorem 2: If the regularity conditions C1–C6 hold, AAUC^ is asymptotically linear:

n(AAUC^AAUC)=1niΨi+op(1),

where the form of Ψi is given in the Web Appendix A. Consequently, the asymptotic variance for AAUC^ is given by var(Ψ1).

Theorem 2 is proved by Taylor expansion on Ri(θ^,η^,ν^) around the true values θ0, η0 and ν. The detailed proof is given in the Web Appendix A.

4. Simulation Studies

We conducted two sets of simulation studies to examine the finite sample performance for the AUCx and AAUC estimators in this section. The first simulation study assumed the MAR verification process, and the second examined NMAR situation. More extensive simulation studies were reported in the Web Appendix B, which examined the different sample size, AUC, disease prevalence and verification proportion.

4.1 Simulation One: MAR Verification Process

We generated two covariates X1 and X2 from Bernoulli(0.5) and U(−1, 1) distributions, respectively. The true disease status D was generated from D|X1, X2 ~ Bernoulli(p), where logit(p) = −1.4 + 0.5X1 + 0.8X2. The test result T was generated from a location-scale model, T|D, X1, X2 ~ N(μ, σ2), where μ = 1 + 0.4D + 0.2X1 + 0.7X2 + X1D + 0.5X2D and σ = 0.8D + 1.2(1 − D). The verification indicator V was generated from V|T, X1, X2 ~ Bernoulli(π), where logit(π) = −1.2 + T + 0.6X1 + 1.2X2. The disease prevalence was about 25% and the verification proportion was about 57%, which was similar to our real example in the next section. The true AUCx was given by Φ(θ0+θ1X1+θ2X2), where (θ0,θ1,θ2)=(0.277,0.693,0.453)1.

Since the disease probability, ρ = Pr(D = 1|T, X1, X2), is jointly determined by D|X1, X2 and T|D, X1, X2, we show in the Web Appendix A that logit (ρ) is indeed a quadratic form of (T, X1, X2) under our data generating procedure. The correct disease model we used here was a logistic regression with main effects and pairwise interactions of T, X1, and X2, as well as the quadratic terms of T and X2. We also examined the mis-specified disease and verification models: the mis-specified disease model ignored the quadratic term of X2 and the interaction between X1 and X2, while the mis-specified verification model ignored the effect of X2. We set the sample size to be 1000 and repeated the simulation for 1000 times. Table 1 displays the results for estimating θ*, AUCx and AAUC, where a total of 12 estimators were calculated:

  • (1)

    Ideal: full data analysis, i.e., the true disease status was observed for every subject.

  • (2)

    CC: complete-case analysis, i.e., all the non-verified subjects were removed.

  • (3)

    IPW1: the IPW estimator with correct verification model.

  • (4)

    IPW2: the IPW estimator with incorrect verification model.

  • (5)

    FI1: the FI estimator with correct disease model.

  • (6)

    FI2: the FI estimator with incorrect disease model.

  • (7)

    MSI1: the MSI estimator with correct disease model.

  • (8)

    MSI2: the MSI estimator with incorrect disease model.

  • (9)

    DR1: the DR estimator with correct disease and verification models.

  • (10)

    DR2: the DR estimator with incorrect disease model and correct verification model.

  • (11)

    DR3: the DR estimator with correct disease model and incorrect verification model.

  • (12)

    DR4: the DR estimator with incorrect disease and verification models.

Table 1.

The bias (in percentage of the true value), average standard error (SE), empirical standard deviation (SD) for θ*, AUCx and AAUC estimators under MAR verification process.

θ0=0.277
θ1=0.693
θ2=0.347

Estimator Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%)
Ideal 1.1 7.35 7.06 95.8 0.0 10.36 10.09 95.2 −1.0 9.13 9.11 94.6
CC −86.1 10.84 10.97 39.7 10.3 13.46 13.53 91.4 43.9 12.63 12.48 77.4
FI1 0.6 10.88 11.24 94.8 1.1 12.63 12.69 94.1 −0.8 11.97 12.06 95.0
FI2 49.7 10.42 10.81 71.7 −25.1 12.11 12.07 68.9 −55.0 11.26 11.47 57.5
MSI1 0.6 10.93 11.31 95.0 0.8 12.78 12.87 94.0 −1.2 12.10 12.10 95.3
MSI2 26.8 10.29 10.73 87.0 −9.3 12.11 12.21 90.8 −26.7 11.16 11.17 84.8
IPW1 4.3 12.69 14.12 92.2 −0.7 16.18 17.63 92.3 −4.4 15.23 17.01 92.2
IPW2 19.2 12.89 13.94 90.1 0.4 16.02 16.89 93.9 30.3 14.96 16.03 87.8
DR1 6.9 11.92 13.23 90.6 −1.5 14.83 16.03 93.4 −7.8 13.97 15.66 89.6
DR2 7.3 12.30 13.68 90.7 −1.7 15.19 16.50 93.5 −8.1 14.54 16.16 90.2
DR3 4.2 11.38 12.30 92.5 −0.6 14.12 14.83 93.8 −3.4 12.63 13.80 91.8
DR4 16.4 10.93 11.95 88.4 −2.5 13.93 14.62 93.7 −14.2 12.20 13.35 89.9
AUC(0, 0) = 0.6092 AUC(1, 0.5) = 0.8737 AAUC = 0.7582

Estimator Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%)
Ideal 0.1 2.81 2.70 95.8 −0.1 1.76 1.70 96.5 0.0 1.67 1.63 95.3
CC −15.4 4.29 4.35 39.7 −2.4 2.20 2.14 83.6 −2.1 2.14 2.12 91.9
FI1 −0.0 4.15 4.29 94.8 0.1 1.84 1.80 96.4 −0.1 2.56 2.60 94.4
FI2 8.4 3.80 3.93 71.7 −3.4 1.99 1.94 63.0 −0.1 2.50 2.55 94.3
MSI1 −0.0 4.17 4.31 95.0 0.0 1.88 1.82 95.8 −0.2 2.57 2.60 95.1
MSI2 4.5 3.84 4.00 87.0 −1.0 1.86 1.80 93.6 −0.1 2.55 2.60 94.7
IPW1 0.6 4.82 5.36 92.2 −0.2 2.26 2.32 95.2 −0.1 2.71 2.84 93.2
IPW2 3.1 4.84 5.22 90.1 2.3 1.95 1.94 83.4 3.4 2.50 2.55 73.8
DR1 1.0 4.52 5.02 90.6 −0.3 2.14 2.21 94.4 0.0 2.64 2.82 91.7
DR2 1.1 4.65 5.17 90.7 −0.3 2.20 2.28 94.8 0.0 2.66 2.86 91.6
DR3 0.6 4.34 4.68 92.5 −0.1 2.04 2.03 95.6 −0.0 2.59 2.67 93.6
DR4 2.7 4.12 4.50 88.4 −0.0 2.00 1.98 95.8 −0.0 2.58 2.68 93.1

The estimated coefficients θ* were shown on the top of Table 1. The ideal estimator used the full data with all Di available, and hence is unfeasible in practice. It just served as a reference to indicate the amount of information gained by observing the values of the missing data. The CC estimator was biased as we expected, because the complete data were no longer a random sample from the population. The IPW1, FI1, MSI1, DR1 estimators all performed well under correct model specification. When either the disease or verification model was wrong, the DR2 and DR3 estimators still worked as good as the DR1 estimator in terms of unbiasedness, efficiency and coverage rate. It is noted by a referee that the DR3 estimator yielded less bias than DR1. We examined the fitted verification probabilities for the correct and incorrect model in one simulated data set. The estimated “weight” function D^DR,i(1D^DR,j) in (11) could be large if both π^i and π^j are small. We found that the verification probabilities in DR1 is more scattered within (0, 1) than those in DR3. Therefore, DR1 generally has more large weights than DR3, and is less stable in finite samples. This would not be a serious issue as the sample size increases, as both estimators are consistent. In fact, in an unreported simulation study with n = 2000, we observed less bias for the DR1 estimator. It would be interesting future research to investigate the impact of ”extreme” weights in the DR and IPW estimators.

The FI2, MSI2, IPW2 and DR4 estimators for θ* were all biased, but the latter three still have reasonable CI coverage (between 87% and 94%), compared to that of the CC estimator (between 39% and 91%). This suggests that when the model mis-specification is moderate, the proposed MSI2, IPW2 and DR4 estimators may still be used in practice to correct for some verification bias. We also noted that FI2 estimator was the most biased, because the disease status was imputed for every subject using the wrong disease model. As a comparison, DR4 and MSI2 estimators provided better protection against incorrect disease model.

The performance of AUCx and AAUC estimators was summarized in the bottom of Table 1. The proposed IPW1, FI1, MSI1, DR1, DR2, DR3 estimators still yielded good results. It appeared that AUC(1;0.5) was consistent even for MSI2 and DR4 estimators, but this is just because the negatively biased θ^1 and θ^2 cancelled out with the positively biased θ^0. It was somewhat surprising that the AAUC estimators were not very sensitive to model mis-specification except for the IPW estimator. This is probably because the bias in estimating AUCx could be in either direction depending on the choice of x, and the bias may cancel out in computing AAUC as the weighted average of AUCx. As for the relative efficiency of the proposed estimators, we found that imputation (FI and MSI) led to the best precision, and IPW estimator had the least precision.

To examine more serious model mis-specifications, we conducted a separate simulation study. We use different (wrong) working disease and verification models: the disease model only included the main effects of T, X1 and X2; the verification model only included the main effects of X1 and X2, but not T. The results for FI2, MSI2, and IPW2 estimators are shown in Table 5 of the Web Appendix. The DR4 estimator did not converge for 20–30% of data generations, so the results were not reported. An intuitive explanation of non-convergence is that some of the “weights” in equation (11), D^DR,i(1D^DR,j), are negative. The negative weights in a regression force the fitted regression line to be as far from the corresponding data points as possible. There is no guarantee that such estimating equations have any solution, as the quasi likelihood function may be no longer concave. On the contrary, the weights of estimating equations in (8)–(10) are all positive, and all the simulations converged for FI2, MSI2, and IPW2 estimators. Now the bias for both θ* and AUCx was more substantial, and the coverage rate was much lower than 95%. However, the MSI and FI estimators still worked well for estimating AAUC.

In practice, if one is confident about the disease model, then MSI and FI estimators are recommended for better power. Otherwise, the DR estimator would be a safe choice that protects some model mis-specification.

4.2 Simulation Two: NMAR Situation

For the NMAR verification process, we generated X1, X2, D, and T similarly as in the previous subsection. The verification indicator V was generated from V| T, X1, X2, D ~ Bernoulli(π), where logit(π) = −1.5 + T + 0.6X1 + 1.2X2 + 2D. The verification proportion was about 58% in this case. We repeated the simulation for 1000 times. With the sample size being 2000, the NMAR model did not converge for 4.9% generations, so we only reported the results for the converged data sets. In Table 2, we found that the complete-case analysis was seriously biased as we expected. The bias for the proposed estimators was small for the reasonably large sample size. The CI coverage rate was close to 95% nominal level. But as we show in the appendix, when n = 1000, the NMAR model could be a bit more biased for θ*, and more generations (14.6%) did not converge. With smaller sample size, the data do not contain enough information to effectively estimate the nonignorable parameter. The non-convergence occurs if a boundary solution is obtained; even for converged generations, the standard error for the nonignorable parameter might be large, leading to unstable AUC estimators. This nonconvergence issue was also noted by several other authors (Zhou and Castelluccio, 2003; Kosinski and Barnhart, 2003; Liu and Zhou, 2010).

Table 2.

The bias (in percentage of the true value), average standard error (SE), empirical standard deviation (SD) for θ*, AUCx and AAUC estimators under NMAR verification process.

θ0=0.277
θ1=0.693
θ2=0.347

Estimator Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%)
CC −144.0 6.99 7.24 0.0 17.6 9.04 9.31 72.1 72.1 8.35 8.54 15.3
FI1 −5.9 8.94 8.96 92.8 1.8 8.98 9.13 94.4 4.2 8.69 8.88 93.9
MSI1 −6.2 8.98 8.96 93.1 1.6 9.10 9.18 94.7 3.6 8.79 8.97 93.6
IPW1 −2.7 10.11 10.21 93.6 −0.2 11.61 12.01 93.9 −0.0 11.13 11.39 93.9
AUC(0,0) = 0.6092 AUC(1, 0.5) = 0.8737 AAUC = 0.7582

Estimator Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%) Bias (%) SE×100 SD×100 Coverage (%)
CC −25.9 2.76 2.86 0.0 −4.0 1.64 1.71 38.8 −10.0 1.59 1.65 0.5
FI1 −1.1 3.44 3.46 92.8 0.0 1.26 1.31 94.6 −1.2 2.85 2.87 91.8
MSI1 −1.2 3.46 3.56 93.1 −0.1 1.29 1.34 94.3 −1.2 2.67 2.86 91.4
IPW1 −0.6 3.88 3.93 93.6 −0.3 1.60 1.68 94.1 −1.2 2.86 2.94 92.0

The cost of estimating the NMAR model is the loss of robustness. Therefore, if NMAR verification process is conjectured in practice, one needs to be cautious to specify the correct disease and verification models.

5. Example

We applied the proposed AUCx and AAUC estimators to an Alzheimer's disease research. We used a data set within the Uniform Data Set (UDS) of National Alzheimer's Coordinating Center (NACC), which came from 32 Alzheimer's Disease Centers throughout North America since 2006. The patients were referred or self-referred for evaluation of possible dementia, or recruited specifically to participate in clinical research. Most patients underwent clinical evaluation and neuropsychological tests for cognitive impairment at enrollment. During the follow-up period, the patients received periodical re-evaluation and cognitive tests. Among these cognitive tests, the mini-mental state examination (MMSE) is a brief 30-point questionnaire test that is widely used to screen for cognitive impairment. In the progression of dementia, the amnestic mild cognitive impairment (aMCI) is an important transitional stage. Patients with aMCI could still revert to normal, but dementia is generally believed to be irreversible. We are interested to investigate the one-year progression from aMCI to dementia, and find out how well the baseline MMSE score classifies the patients who progressed to dementia and those who do not in one year. The classification ability of MMSE may also depend on patient characteristics, so we used AUCx to describe the covariate effects, and AAUC to evaluate the overall classification accuracy.

We included the patients who aged over 65 and was diagnosed to be aMCI at their first visit. If a patient made a visit about one year (within the 6–18 months window) after the baseline, his/her cognitive status is observed with D = 1 indicating progression to dementia and 0 otherwise. The disease status is missing if the patient only made the baseline visit, or the follow-up visits were all outside the 6–18 months window. This led to about 56.1% disease verification. Within the verified sample, the progression probability was about 24.9%. The covariates we adjusted for in the ROC analysis include age, gender, race, marital status, living situation, stroke, and history of cardiovascular diseases. We also collected other disease history variables, and clinical dementia rating (CDR) sum of boxes as the predictor for the missingness mechanism and the disease model. For simplicity, subjects with missing covariates were excluded, and our final sample for analysis consisted of 2,702 subjects. The list of variables and their summary statistics are shown in the Web Appendix C.

We started with the MAR assumption. Logistic regressions were used for estimating the disease and verification probabilities. The verification model included the MMSE score (T) and its quadratic term, the covariates we adjusted for in the ROC analysis, the disease history information, CDR sum of boxes, interaction between T and stroke, and interaction between T and history of cardiovascular diseases. The disease model included all the main effects in the verification model, as well as the interactions between T and the covariates in the ROC analysis. Under the MAR assumption, the estimated covariate effects are listed in the top of Table 7.

Although the disease and verification models included many risk factors, unobserved confounders may still exist, because (i) the progression of dementia is complicated and not fully understood, and (ii) in this observational study, the missingness could be due to various reasons. So we would recommend the DR estimator in this example, which protects the model mis-specification. The DR estimator showed that MMSE has significantly worse classification accuracy for patients with stroke (probit AUCx = −0.433, 95% CI: −0.786, 0.080), and significantly better accuracy for patients with more than 17 years of education (probit AUCx = 0.259, 95% CI: 0.024, 0.494). Age and cardiovascular history are both marginally significant in affecting the accuracy. The magnitude of the coefficients are interpreted in the probit scale. For example, the IPW estimator showed that, with 10 years increase in age, the probit AUCx decreased by 0.132, with 95% CI being (0.005, 0.259). We also noted that the imputation-based estimators, especially FI estimator, is a bit deviated from others, suggesting that the progression from aMCI to dementia might be very complicated, and not well estimated by the working disease model. The AAUC was estimated to be about 0.64. This implies that, although MMSE score has some classification ability to predict the progression to dementia, it is less than a satisfactory marker. Further research is motivated to find a combination of several tests as a better marker. The complete-case analysis was not too different from the DR estimator, because the verification process was not so strongly affected by the covariates. In Figure 1 of the Web Appendix, we plotted out the AUCx as a function of age, according to the DR estimator. Other covariates are fixed as a white male, single, living alone, with a history of cardiovascular diseases, no stroke, and over 17 years of education. We can observe the decreasing trend, but it is inconclusive at the 5% significance level.

The sicker patients might be more likely to miss a follow-up visit, and hence have missing D. Even though the verification model included MMSE and CDR sum of boxes as the measurement of disease severity, there may still be unobserved attributes associated with the disease verification. Therefore, we also investigated the NMAR situation in our data set as a sensitivity analysis, and the results are presented in the bottom half of Table 7. The IPW estimator was close to the estimators under MAR assumption. The MSI and FI estimators differed from the MAR estimators, perhaps due to the unsatisfactory disease model.

6. Discussion

This paper tackled the verification bias problem in ROC analysis with covariate adjustment. We specifically focused on AUCx and AAUC, and proposed several verification bias-corrected estimators. The proposed estimation procedure used the idea that the pairwise comparison of the test result has the conditional expectation that is exactly the AUCx. To obtain the overall measure, AAUC, we took the weighted average of AUCx over the covariates distribution of the diseased subjects. The estimating equations for AUCx and AAUC were constructed, and reweighted in several different fashions. We proved the central limit theorem for AUCx and AAUC estimators, and derived the analytic form of the asymptotic variances. We noted that this reweighting approach works for both MAR and NMAR verification processes, while the only difference was in estimating the nuisance parameters. Simulation studies were conducted to confirm the good finite-sample performance of the proposed estimators and their analytical variance formula.

The proposed AUCx estimators make an extension to Dodd and Pepe (2003) to allow for the missing gold standard. Compared to Liu and Zhou (2011), our focus is in the covariate effects on the diagnostic accuracy itself in this paper. Therefore, we choose a summary measure, AUCx, and directly make assumptions on its form as a function of covariate x, while Liu and Zhou (2011) modeled the test results first and then derived the ROCx expression. We note that either of the two approaches is not the special case of the other, but have different focuses. When one is more interested to estimate the ROC curve for particular groups of people, the method of Liu and Zhou (2011) would apply; if one wishes to determine directly how the accuracy is affected by covariates, the proposed estimators could answer the question. In a sense, the parameter interpretation in AUCx is easier and more direct.

The NMAR model requires both the disease and verification models to be correct. While the model assumptions could not be tested nonparametrically using the observed data, we recommend building up plausible models from the researcher's scientific knowledge. In practice, the NMAR model could serve as a sensitivity analysis to supplement the MAR model. The traditional setting of the sensitivity analysis (Rotnitzky, et al. 2006) involves fixing the nonignorable parameter at different values, but it remains unclear which value is the most likely. Here we make additional model assumptions from the prior knowledge, and make use of the “model information” to infer about the nonignorable parameter.

A possible weakness of the proposed estimators is the computation complexity. As the pairs of the test results {Iij} contribute to the estimating equations, the computation load could be extremely heavy for large data sets. For the sample size n = 10, 000, the FI, MSI and DR estimators all need to fit a weighted binary regression of sample size n2 = 108. The IPW estimator is faster, because only the Iij pairs with Di = 1 and Dj = 0 have non-zero weights. If the disease prevalence is 50%, the weighted regression has sample size (n × 0.5)2 = 2.5 × 107. Generally speaking, the computation load is approximately O(n2) for the proposed estimators. Another possible limitation of our proposed method is the parametric form of the link function. Although the probit or logit link covers a wide variety of possible AUC structure, there may be examples that the true link function is non-standard, especially when the test distribution is highly skewed. This motivates for future research on single index models for the AUC.

Supplementary Material

R codes
Suppl

Table 3.

Estimated covariate effects on AUCx, AAUC and their associated standard errors for the NACC UDS example.

DR (MAR) IPW (MAR) MSI (MAR) FI (MAR)
Age (per 10 years) −0.118 (0.061) −0.132 (0.065) a −0.106 (0.060) −0.075 (0.058)
Gender (Male vs. Female) −0.030 (0.097) −0.019 (0.100) −0.058 (0.091) −0.105 (0.088)
Race (White vs. Non-white) −0.027 (0.130) −0.085 (0.132) −0.054 (0.128) −0.091 (0.126)
Education 12− years (reference) - - - -
Education 13–16 years 0.085 (0.110) 0.122 (0.112) 0.066 (0.105) 0.043 (0.100)
Education 17+ years 0.259 (0.120) 0.282 (0.121) 0.232 (0.113) 0.202 (0.108)
Living situation (Alone vs. Others) 0.106 (0.146) 0.035 (0.160) 0.084 (0.139) 0.077 (0.132)
Marital status (Married vs. Others) 0.100 (0.149) 0.023 (0.164) 0.111 (0.142) 0.132 (0.134)
Stroke (Yes vs No) −0.433 (0.180) −0.431 (0.191) −0.403 (0.180) −0.338 (0.178)
Cardiovascular history (Yes vs No) 0.142 (0.094) 0.188 (0.096) 0.163 (0.087) 0.187 (0.084)

AAUC 0.6371 (0.0163) 0.6398 (0.0166) 0.6372 (0.0164) 0.6379 (0.0164)
CC IPW (NMAR) MSI (NMAR) FI (NMAR)
Age (per 10 years) −0.143 (0.064) −0.137 (0.065) −0.144 (0.057) −0.125 (0.052)
Gender (Male vs. Female) −0.012 (0.098) −0.017 (0.098) −0.025 (0.091) −0.029 (0.085)
Race (White vs. Non-white) −0.077 (0.132) −0.074 (0.131) −0.017 (0.123) 0.015 (0.123)
Education 12− years (reference) - - - -
Education 13–16 years 0.116 (0.112) 0.119 (0.111) 0.066 (0.103) 0.023 (0.098)
Education 17+ years 0.271 (0.120) 0.275 (0.120) 0.189 (0.111) 0.063 (0.104)
Living situation (Alone vs. Others) 0.003 (0.158) 0.010 (0.158) 0.027 (0.138) −0.089 (0.120)
Marital status (Married vs. Others) 0.001 (0.161) −0.001 (0.161) 0.038 (0.141) −0.103 (0.121)
Stroke (Yes vs No) −0.425 (0.188) −0.430 (0.185) −0.360 (0.181) 0.016 (0.204)
Cardiovascular history (Yes vs No) 0.190 (0.095) 0.200 (0.093) 0.159 (0.094) 0.016 (0.085)

AAUC 0.6388 (0.0166) 0.6374 (0.0161) 0.6291 (0.0159) 0.6229 (0.0151)
a

Coefficients in bold are significant at p < 0.05.

Acknowledgement

The authors would like to thank National Alzheimer's Coordinating Center (NACC) for providing the data for analysis. The authors are grateful to the associate editor and two anonymous referees for their insightful comments, which greatly improved the quality of this paper. The work was supported in part by NIH/NIA grant U01AG016976. This paper does not necessarily represent the findings and conclusions of VA HSR&D. Dr. Xiao-Hua Zhou is presently a Core Investigator and Biostatistics Unit Director at HSR&D Center of Excellence, Department of Veterans Affairs Puget Sound Health Care System, Seattle, WA.

Footnotes

7. Supplementary Materials Supplementary Web Appendices referenced in Sections 2–5 are available with this paper at the Biometrics website on Wiley Online Library.

References

  1. Alonzo TA, Pepe MS. Assessing accuracy of a continuous screening test in the presence of verification bias. Journal of the Royal Statistical Society - Series C (Applied Statistics) 2005;54:173–190. [Google Scholar]
  2. Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to verification bias. Biometrics. 1983;39:207–215. [PubMed] [Google Scholar]
  3. Bennett DA, Schneider JA, Aggarwel NT, Arvantitakis Z, Shah RC, Kelly JF, Fox JH, Cochran EJ, Arends D, Treinkman A, Wilson RS. Decision rules guiding the clinical diagnosis of Alzheimer's disease in two community-based cohort studies compared to standard practice in a clinic-based cohort study. Neuroepidemiology. 2006;27:169–176. doi: 10.1159/000096129. [DOI] [PubMed] [Google Scholar]
  4. Dodd LE, Pepe MS. Semiparametric regression for the area under the receiver operating characteristic curve. Journal of the American Statistical Association. 2003;98:409–417. [Google Scholar]
  5. Fluss R, Reiser B, Faraggi D, Rotnitzky A. Estimation of the ROC curve under verification bias. Biometrical Journal. 2009;51:475–490. doi: 10.1002/bimj.200800128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. He H, Lyness ML, McDermott MP. Direct estimation of the area under the receiver operating characteristic curve in the presence of verification bias. Statistics in Medicine. 2009;28:361–376. doi: 10.1002/sim.3388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Janes H, Pepe MS. Adjusting for covariate effects on classification accuracy using the covariate-adjusted receiver operating characteristic curve. Biometrika. 2009;96:371–382. doi: 10.1093/biomet/asp002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Koepsell TD, Chi YY, Zhou XH, Lee WW, Ramos EM, Kukull WA. An alternative method for estimating efficacy of the AN1792 vaccine for Alzheimer disease. Neurology. 2007;69:1868–1872. doi: 10.1212/01.wnl.0000278226.96003.f8. [DOI] [PubMed] [Google Scholar]
  9. Kosinski AS, Barnhart HX. Accounting for nonignorable verificaiton bias in assessment of diagnostic tests. Biometrics. 2003;59:163–171. doi: 10.1111/1541-0420.00019. [DOI] [PubMed] [Google Scholar]
  10. Lehmann EL. Elements of Large Sample Theory. Springer; New York: 1999. [Google Scholar]
  11. Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2nd edition John Wiley; New York: 2002. [Google Scholar]
  12. Liu D, Zhou XH. A model for adjusting for nonignorable verification bias in estimation of the ROC curve and its area with likelihood-based approach. Biometrics. 2010;66:1119–1128. doi: 10.1111/j.1541-0420.2010.01397.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Liu D, Zhou XH. Semiparametric estimation of the covariate-specific ROC curve in presence of ignorable verification bias. Biometrics. 2011;67:906–916. doi: 10.1111/j.1541-0420.2011.01562.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Page JH, Rotnitzky A. Estimation of the disease-specific diagnostic marker distribution under verification bias. Computational Statistics and Data Analysis. 2009;53:707–717. doi: 10.1016/j.csda.2008.06.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Rotnitzky A, Faraggi D, Schisterman E. Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association. 2006;101:1276–1288. [Google Scholar]
  16. Zhou XH, Castelluccio P. Nonparametric analysis for the ROC areas of two diagnostic tests in the presence of nonignorable verification bias. Journal of Statistical Planning and Inference. 2003;115:193–213. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

R codes
Suppl

RESOURCES