Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 28.
Published in final edited form as: Biometrics. 2010 Sep 3;67(2):559–567. doi: 10.1111/j.1541-0420.2010.01487.x

Robust Estimation of Area Under ROC Curve Using Auxiliary Variables In the Presence of Missing Biomarker Values

Qi Long 1,*, Xiaoxi Zhang 2, Brent A Johnson 1
PMCID: PMC8162996  NIHMSID: NIHMS1704813  PMID: 20825391

Summary:

In medical research, the receiver operating characteristic (ROC) curves can be used to evaluate the performance of biomarkers for diagnosing diseases or predicting the risk of developing a disease in the future. The area under the ROC curve (AUC), as a summary measure of ROC curves, is widely utilized, especially when comparing multiple ROC curves. In observational studies, the estimation of the AUC is often complicated by the presence of missing biomarker values, which means that the existing estimators of the AUC are potentially biased. In this article, we develop robust statistical methods for estimating the ROC AUC and the proposed methods use information from auxiliary variables that are potentially predictive of the missingness of the biomarkers or the missing biomarker values. We are particularly interested in auxiliary variables that are predictive of the missing biomarker values. In the case of missing at random (MAR), i.e., missingness of biomarker values only depends on the observed data, our estimators have the attractive feature of being consistent if one correctly specifies, conditional on auxiliary variables and disease status, either the model for the probabilities of being missing or the model for the biomarker values. In the case of missing not at random (MNAR), i.e., missingness may depend on the unobserved biomarker values, we propose a sensitivity analysis to assess the impact of MNAR on the estimation of the ROC AUC. The asymptotic properties of the proposed estimators are studied and their finite sample behaviors are evaluated in simulation studies. The methods are further illustrated using data from a study of maternal depression during pregnancy.

Keywords: Area under the curve, Biomarker, Doubly robust estimators, Missing at random, Missing not at random, Receiver operating characteristic curve, Sensitivity analysis

1. Introduction

The receiver operating characteristic (ROC) curve plots the fraction of true positives (sensitivity) against the fraction of false positives (1–specificity) as the discrimination threshold (e.g., of a biomarker for a disease) is varied, and it is often used to evaluate the performance of biomarkers for diagnosing diseases or predicting the risk of developing diseases in the future. It was originally developed for the analysis of signal detection (Green and Swets, 1966) and was first used in medicine for the assessment of imaging devices (Zweig and Campbell, 1993). In medical studies, summary measures of ROC curves are often used and they are particularly powerful when comparing several ROC curves. The most widely used summary measure is the area under the ROC curve (ROC AUC) (Bamber, 1975). The ROC AUC is bounded between 0.5 and 1, and has the interpretation of the probability of a randomly selected observation from the diseased (non-diseased) population having a higher biomarker value than that from the non-diseased (diseased) population. Therefore, a large AUC value represents good separation in the biomarker values between the diseased and non-diseased populations. In particular, a perfect test would achieve an AUC of 1.0, whereas an uninformative test would have an AUC of 0.5. A wealth of literature has been developed for this type of research (Pepe (2003) and references therein).

In practice, the biomarker value may be missing for some subjects, especially in observational studies. Take for example a self-rated mental illness score collected from pregnant women in a psychiatric study, where the disease of interest is the presence (or absence) of a major depressive episode throughout pregnancy (see Section 4 for more details). Since the biomarker score is self-rated, it is possible that some subjects did not complete the self-evaluation and hence the score is missing. In such studies, additional variables including demographic and baseline variables are often available, which are referred to as auxiliary variables. While these variables are not of primary interest themselves, they are potentially predictive of the missingness of the biomarker value or the value itself, and can be incorporated in a data analysis to improve its robustness and/or efficiency. If an auxiliary variable is predictive of missingness but independent of the missing values, then using it in an analysis will not affect the results. Thus, we are interested in auxiliary variables that are predictive of the missing values, especially if they are also predictive of the missingness.

As with the general setting discussed in Little and Rubin (2002) and references therein, a naive analysis that only uses complete observations may lead to bias and loss of efficiency in the estimation of the ROC AUC. First, when the biomarker is missing completely at random (MCAR), i.e., the missingness does not depend on either observed or unobserved data, the naive analysis is valid but is not efficient. Second, when the biomarker is missing at random (MAR), i.e., the missingness is conditionally independent of the missing data given the observed data, the naive analysis is biased and other methods, e.g., inverse-weighted (IW) methods, can be extended for consistent estimation. IW methods weight each complete case by the inverse of the probability of observing the biomarker value. Despite its conceptual simplicity, IW methods have limitations. Most notably, IW methods are not efficient and are subject to bias if one misspecifies the model for the missingness. Alternatively, one can extend the methods that are doubly robust and more efficient (Robins et al., 1994; Scharfstein et al., 1999) for estimating the ROC AUC. In the case of missing not at random (MNAR), i.e., missingness depends on unobserved biomarker values even after conditioning on the observed data, it is common practice to conduct sensitivity analysis (Zhou, 1994; Rotnitzky and Robins, 1997; Scharfstein et al., 1999; Kosinski and Barnhart, 2003). In all cases, auxiliary variables can be used to potentially reduce bias and improve efficiency when associated with the probability of missing and the value of biomarkers, or simply improve efficiency when only associated with the value of biomarkers.

We confine the scope of this paper to the case where the disease status is always confirmed and a set of auxiliary variables are fully observed but the biomarker values are missing for some subjects, and we are interested in estimating the ROC AUC. Our setting is to be distinguished from the existing research on verification bias (Zhou, 1993, 1998; Rotnitzky et al., 2006; Fluss et al., 2009). In the presence of verification bias, the biomarker values are always observed whereas the true disease status is only verified for a non-random sample of the population of interest, e.g., the selection for testing may depend on the disease status or other variables. In particular, Rotnitzky et al. (2006) extended the doubly robust method developed in Rotnitzky and Robins (1997) to the estimation of the ROC AUC in the presence of verification biases. As a result of different problem setups (i.e. biomarker values missing vs. disease status unconfirmed for a subset of subjects), there are important differences between our work and theirs. In our setting, a working model on biomarker values, which can be continuous or categorical, is utilized, whereas a working model on the presence (or absence) of the disease, a binary variable, was utilized in Rotnitzky et al. (2006); consequently, our methods require modeling of the conditional distribution of biomarker values. Furthermore, we study and compare parametric and nonparametric approaches for estimating this conditional distribution and discuss two types of MAR assumptions, which have different implications on the estimation of AUC.

The remainder of the article is organized as follows. In Section 2, we describe the proposed estimators and their theoretical properties under MAR and propose a sensitivity analysis under MNAR. In Section 3, we evaluate the finite sample performance of the proposed estimators through simulations. In Section 4, we apply the proposed methods to a psychiatric study of maternal depression during pregnancy. We conclude with a discussion in Section 5.

2. Methodology

Suppose that a random sample of n subjects is selected from a population of interest to evaluate the performance of a diagnostic or predictive test using a biomarker. Each subject i, i = 1, … , n, is classified into one of two groups, diseased (Di = 1) or non-diseased (Di = 0), based on a gold standard. For each subject i, denote the biomarker value by Xi, which is used to diagnose or predict the disease status (Di). Xi is not observed in a subset of the subjects, and let δi denote the missing indicator for Xi, i.e., δi = 1 when Xi is observed and δi = 0 if Xi is missing. In addition, p auxiliary variables that may be associated with the value of Xi and/or its missingness (δi) are also collected and denoted by Zi=(Zi(1),,Zi(p))T. Then for subject i, the complete data are (Di, Zi, δi, Xi). When δi = 1, the observed data are Oi = (Di, Zi, δi, Xi) and subject i is called a complete case; when δi = 0, the observed data are Oi = (Di, Zi, δi) and subject i is called an incomplete case. We denote by O the collection of observed data for all subjects. When δi is independent of Xi conditional on Di and Zi, it is a case of MAR; when δi is dependent on Xi conditional on Di and Zi, it is a case of MNAR.

We are interested in estimating the ROC AUC, which is equivalent to a U-statistic (Bamber, 1975), θ = Pr(Xi > Xj | Di = 1, Dj = 0), assuming that the diseased tend to have higher biomarker values. When all data are completely observed, an unbiased estimator of θ is

θ^=1ijDi(1Dj)ijDi(1Dj)Iij,

where Iij = I(Xi > Xj) + 0.5I(Xi = Xj) with I(A) equals to 1 if A is true and 0 if A is false. When X is missing for some subjects, a naive extension of the above estimator using only the complete observations (i.e., δi = 1) is

θ^0=1ijDi(1Dj)δiδjijDi(1Dj)δiδjIij. (1)

It is straightforward to verify the following proposition:

Proposition 1: (i) When δ is independent of X given D, θ^0 is an unbiased estimator of θ; (ii) when δ is dependent on X given D, then θ^0 is subject to potential bias.

We note that (i) includes the case of MCAR and a special case of MAR where δ may depend on D and Z and is independent of X given D and (ii) includes the case of MNAR and a special case of MAR where δ is dependent on X conditional on D but is independent of X conditional on D and Z. We refer to θ^0 as the naive estimator throughout this article.

2.1. Inverse-Weighted Estimator

In the case of MAR, we first study an inverse-weighted estimator,

θ^IW=1ijδiδjπ^iπ^jDi(1Dj)ijδiδjπ^iπ^jDi(1Dj)Iij, (2)

where π^i is an estimate of the probability of observing Xi, namely, πi = Pr(δi = 1), conditional on Zi and Di under MAR. We denote by (M1) the working model for πi given Zi and Di with a set of unknown parameters, α, and denote by A(α;O)=iAi(α;O) the estimating equations for computing the estimate of α, namely, α^, based on the observed data. For instance, one can use a logistic regression model for (M1), i.e. logit(πi) = W(Zi, Di; α) where W(Zi, Di; α) is a function of Zi and Di and is parameterized by α; A(α;O) can be taken as the likelihood equations for the logistic regression model. (M1) is also known as the propensity score model (Rosenbaum and Rubin, 1983). It can be readily shown that if the working model (M1) is correctly specified, θ^I is a consistent estimator of θ under MAR.

2.2. Doubly Robust Estimators

In the case of MAR, we propose a second estimator

θ^DR=1ijδiδjπ^iπ^jDi(1Dj)ijDi(1Dj){δiδjπ^iπ^jIijδiδjπ^iπ^jπ^iπ^jE(IijZi,Zj,Di=1,Dj=0)} (3)

where π^i is the same as previously defined and E(Iij | Zi, Zj, Di = 1, Dj = 0) can be estimated based on the joint conditional distribution of Xi and Xj given the observed data. Specifically, we denote by (M2) the working model for characterizing the conditional distribution of X given Z and D with a set of unknown parameters, β, and denote by B(β;O)=iBi(β;O) the estimating equations for computing the estimate of β, namely, β^, based on the observed data. We note that the conditional mean of X given Z and D is only part of (M2). It can be readily shown that if either (M1) or (M2) is correctly specified, θ^DR is a consistent estimator of θ under MAR.

We consider two options for the working model (M2). In the first option, X given Z and D is assumed to follow a known parametric distribution with unknown parameters β. One special case is the Gaussian distribution, i.e., [XiZi,Di]~N(V(Zi,Di;β1),σ12Di+σ02(1Di)), where V(Zi, Di; β1) is a function of Zi and Di parameterized by β1. Let β=(β1T,σ12,σ02)T denote all parameters of interest, and it follows that

XiXjZi,Zj,Di=1,Dj=0~N(V(Zi,Di=1;β1)V(Zj,Dj=0;β1),σ12+σ02),

and hence

E{IijZi,Zj,Di=1,Dj=0}=Φ(V(Zi,Di=1;β1)V(Zj,Dj=0;β1)σ12+σ02),

where Φ(·) is the cumulative distribution function (c.d.f.) of a standard normal random variable. θ^DR can be rewritten as

θ^DR=1ijδiδjπ^iπjDi(1Dj)ijDi(1Dj)[δiδjπ^iπ^jIijδiδjπ^iπ^jπ^iπ^jΦ{bij(β^)}], (4)

where bij(β^)=V(Zi,Di=1;β^1)V(Zj,Dj=0;β^1)σ^12+σ^02 and β^ can be obtained through, say, linear regression for (M2) using the observed data. From here on, let θ^DR denote the doubly robust estimator in Equation (4), which assumes that the conditional distribution of X is Gaussian.

In the second option, suppose Xi = V(Zi, Di; β1)+ε1iDi+ε0i(1−Di), where {ε1i, i = 1, …, n1} and {ε0i, i = 1, …, n0} are independent and identically distributed (i.i.d.) random errors in the diseased and non-diseased, respectively, and their respective distributions are unknown. In this case, the conditional distribution of Xi given Zi and Di can be estimated nonparametrically. We denote the set of observed residuals by {ε^1k=XkV(Zk,Dk=1;β^1),k=1,,n1o} and {ε^0l=XlV(Zl,Dl=0;β^1),l=1,,n0o} for the diseased and non-diseased, respectively, where n1o and n0o are the number of subjects with observed X in the diseased and non-diseased, respectively. An empirical sample of the estimated conditional distribution of Xi given Zi and Di can be constructed as {X˜ik1=V(Zi,Di=1;β^1)+ε^1k,k=1,,n1o} in the diseased and {X˜il0=V(Zi,Di=0;β^1)+ε^0l,l=1,,n0o} in the non-diseased. E(Iij | Zi, Zj, Di = 1, Dj = 0) in Equation (3) can then be estimated using 1n10n0σk=1n10l=1n0o{I(X˜ik1>X˜jl0)+0.5I(X˜ik1=X˜jl0)}, where i and j go through all subjects including those with missing X, and we denote the resulting nonparametric estimator of θ by θ^DRN. When random errors are not i.i.d., e.g., the variance changes as the mean of X changes, the above procedure needs to be modified accordingly, e.g., performed within strata of the mean of X.

When computing θ^IW, θ^DR and θ^DRN, the weights (1π^i) may be large and unstable, and lead to extra noise in estimation, in particular, when computing the bootstrap SE of θ^DRN. Thus, we consider a simple modification to stabilize the weights, namely, replacing 1π^i with 1π^iniδi/π^i. When (M1) is correctly specified, it can be readily shown that 1niδi/π^i converges to 1 in probability, hence 1π^iniδi/π^i is equivalent to 1π^i asymptotically.

2.3. Theoretical Properties

Following our previous notation, we further define the following,

Ui,j(θ,α)θδiδjπiπjDi(1Dj)δiδjπiπjDi(1Dj)Iij,
Vi,j(θ,α,β)θδiδjπiπjDi(1Dj)δiδjπiπiDi(1Dj)Iij+δiδjπiπjπiπjDi(1Dj)E(IijZi,Zj,Di,Dj),

where πi depends on α and E(Iij | Zi, Zj, Di, Dj) depends on β. It follows that U=ijUi,j(θ,α^) and V=ijVi,j(θ,α^,β^) are the set of estimating equations for θ^IW and θ^DR, respectively. Let α0 and β0 be the probability limits of α^ and β^, respectively, which usually exist.

Theorem 1: Under the regularity conditions (A1)–(A3) given in Web Appendix A, if either or both of (M1) and (M2) are correctly specified, then n(θ^DRθ)N(0,Ω) in distribution, where Ω=Var[{EδiδjπiπjDi(1Dj)}1Qi(θ,α0,β0)], and

Qi=E{Vi,j(θ,α,β)+Vj,i(θ,α,β)Oi}+[αE{Vi,j(θ,α,β)}]×[αE{Ai(α)}]1Ai(α)+[βE{Vi,j(θ,α,β)}]×[βE{Bi(β)}]1Bi(β).

can be consistently estimated by Ω^=1γ2ni=1nQ^i2 with γ=1n2i,jδiδjπ^iπ^jDi(1Dj) and

Q^i=1n[j{Vi,j(θ^DR,α^,β^)+Vj,i(θ^DR,α^,β^)}]+1n[ijVi,j(θ^DR,α,β^)α|α=α^][iAi(α)α|α=α^]1Ai(α^)+1n[ijVi,j(θ^DR,α^,β)β|β=β^][iBi(β)β|β=β^]1Bi(β^).

Theorem 2: Under the regularity conditions similar to (A1)–(A3) given in Web Appendix A, if (M1) is correctly specified, then n(θ^IWθ)N(0,Ω) in distribution, where Ω=Var[{EδiδjπiπjDi(1Dj)}1Ri(θ,α0)], and

Ri=E{ui,j(θ,α)+uj,i(θ,α)Oi}+[αE{Ui,j(θ,α)}]×[αE{Ai(α)}]1Ai(α).

can be consistently estimated by Ω^=1γ2ni=1nR^i2 with γ=1n2i,jδiδjπ^iπ^jDi(1Dj) and

R^i=1n[j{ui,j(θ^IW,α^)+uj,i(θ^IW,α^)}]+1n[ijUi,j(θ^IW,α)α|α=α^][iAi(α)α|α=α^]1Ai(α^).

A sketch of proof for Theorems 1 and 2 is provided in Web Appendix A, which is along the similar lines of Rotnitzky et al. (2006). The underlying idea is to derive the influence functions for θ^IW or θ^DR by plugging in the influence functions for α^ and β^. The consistency of θ^DRN is straightforward to show when either (M1) or (M2) holds and its SE can be computed using a bootstrap procedure, which resamples the data with replacement within disease strata.

A few remarks are in order. First, as stated in Proposition 1, θ^0 is unbiased when δ is independent of X given D; but if δ and X are associated with Z, θ^IW, θ^DR and θ^DRN are potentially more efficient when the working models are correctly specified. Second, when δ is dependent on X given D but independent of X given D and Z, θ^IW, θ^DR and θ^DRN are still consistent under suitable conditions, while θ^0 is subject to potential bias. Thirdly, θ^DR assumes that the residuals are Gaussian in (M2) and is subject to model misspecification even if the mean model is correctly specified; θ^DRN does not impose this restriction.

2.4. MNAR: Sensitivity Analysis

We now consider a case of MNAR, where δ is dependent on X conditional on Z and D; thus, a working model (M1) that only includes Z and D is misspecified. We investigate a sensitivity analysis to assess the impact on θ^IW, θ^DR and θ^DRN, as the effect of X on δ is varied. To fix the idea, suppose that logit(πi) = S(Zi, Di; αS) + U(Xi, αX), where αS and αX are two sets of unknown parameters associated with known functions S and U, respectively. αX represents the effect of biomarker values on the probability of being missing. Since αS and αX can not be jointly estimated using the observed data, we fix αX at a set of pre-determined values and estimate αS using the following set of estimating equations,

i=1n(δiπi1)W(Zi,Di), (5)

where W(Zi, Di) is an arbitrary known vector function with the same dimension as αS. For instance, if S(Zi, Di; αS) = αSW(Zi, Di), then W(Zi, Di) is the covariate vector for i, which may include interaction terms. Compared to the likelihood equations for the logistic regression, one advantage of the estimation equations (5) is that πi is not needed when Xi is missing. For every pre-determined value of αX, we can compute α^S using (5) and π^i for subjects with observed Xi; subsequently we can compute θ^IW, θ^DR and θ^DRN, all of which do not need π^i for subjects with missing Xi. This procedure is repeated for a grid of αX values, and the resulting estimators are compared to assess the impact of αX and hence the impact of MNAR. U(Xi, αX) = 0 corresponds to the case of MAR, and U(Xi, αX) ≠ 0 corresponds to the case of MNAR. In this sensitivity analysis, we do not assume that the estimation of the parameters of (M2) is not affected by MNAR. To simplify the sensitivity analysis and, in particular, avoid performing sensitivity analysis for two working models, we exploit the doubly robust property, i.e., if (M1) is correctly specified then the proposed estimators are consistent, and focus on (M1) only.

3. Simulation studies

We conducted simulations to evaluate the finite sample performance of the proposed estimators, first in the case of MAR where δ is independent of X given D and Z, then in the case of MNAR where δ is dependent on X given D and Z. In our simulations, θ^0, θ^IW, θ^DR and θ^DRN were compared. In addition, we considered another estimator, namely, θ^IMP=1ijDi(1Dj)ijDi(1Dj)[δiδjIij(δiδj1)Φ{bij(β^)}], which only relies on (M2) and is not doubly robust. While it is not of primary interest in this article, θ^IMP under the correctly specified (M2) can be used as an optimal benchmark for efficiency as suggested by a referee. To benchmark bias and loss of efficiency due to missing data, a so-called gold standard (GS) estimator was also computed, i.e., θ^GS=1ijDi(1Dj)i=1nj=1nDi(1Dj)Iij, which uses the underlying “true” biomarker values for all subjects and is not applicable in real data analysis. In Tables 13, modified weights as described in Section 2.2 were used for θ^IW, θ^DR and θ^DRN; to compute the standard error for θ^DRN, we used 200 bootstrap datasets randomly sampled with replacement from the data while stratified on the disease status. In each simulation, we generated a random sample of n = 200 independent subjects with an equal number of diseased and non-diseased subjects. For each simulation setting, 500 Monte Carlo datasets were generated and the results were summarized using the following measures: the mean relative bias (RB), mean of the standard error estimates (SE), Monte Carlo standard deviation of parameter estimates (SD), square root of mean squared errors (SMSE) and coverage rate (CR) of 95% Wald’s confidence interval using a logistic transform of θ as suggested in Pepe (2003) (Ch. 5).

Table 1.

Results of simulation study under MAR: comparison of θ^0, θ^IW, θ^DR and θ^DRN using modified weights, when Z1 and Z2 are identical. ε is Gaussian (i.e., ε ~ N(0, 1)) or non-Gaussian (i.e., ε = 20{ηE(η)} with η ~ Beta(5, 1)). True θ is 0.722 for Gaussian ε and 0.675 for non-Gaussian ε. The details of true models and misspecified working models are provided in Section 3.1.

Gaussian ε non-Gaussian ε
RB (%) SE SD SMSE CR (%) RB (%) SE SD SMSE CR (%)
θ^GS −0.2 0.036 0.037 0.037 94.0 0.0 0.038 0.038 0.038 95.8
θ^0 11.6 0.050 0.054 0.099 70.0 10.8 0.057 0.056 0.092 80.4
Both mean models correctly specified
θ^IMP −0.1 0.040 0.042 0.042 95.0 −0.8 0.054 0.054 0.054 94.8
θ^IW 0.5 0.048 0.052 0.052 93.0 1.0 0.056 0.058 0.059 95.0
θ^DR 0.0 0.045 0.043 0.043 96.4 0.5 0.057 0.055 0.055 96.4
θ^DRN 0.0 0.042 0.043 0.043 94.4 0.5 0.056 0.056 0.056 96.0
Mean model for (M1) misspecified
θ^IW 8.4 0.050 0.054 0.081 78.6 8.0 0.056 0.058 0.079 84.8
θ^DR 0.0 0.040 0.043 0.043 94.0 0.6 0.052 0.055 0.055 94.4
θ^DRN 0.0 0.041 0.043 0.043 95.4 0.4 0.056 0.055 0.055 95.8
Mean model for (M2) misspecified
θ^DR 0.5 0.054 0.050 0.050 96.2 0.9 0.061 0.058 0.058 96.2
θ^DRN 0.5 0.050 0.050 0.050 94.0 0.9 0.059 0.058 0.058 95.6
Both mean models misspecified
θ^DR 8.4 0.050 0.053 0.081 78.6 7.9 0.057 0.058 0.079 86.0
θ^DRN 8.4 0.051 0.053 0.081 79.0 7.9 0.058 0.058 0.079 86.2

Table 3.

Results of simulation study under MNAR: comparison of θ^0, θ^IW, θ^DR, θ^DRN, θ^IWS, θ^DRS, and θ^DRNS using modified weights, when ε ~ N(0, 1) and Z1 and Z2 are identical. True θ is 0.722. The details of true models and misspecified working models are provided in Section 3.2.

RB (%) SE SD SMSE CR (%)
θ^GS −0.2 0.036 0.037 0.037 94.0
θ^0 −8.1 0.053 0.055 0.080 78.2
Correct subset of Z1 and D included in (M1)
θ^IW −5.8 0.052 0.055 0.069 85.4
θ^IWS −1.0 0.049 0.052 0.053 93.0
Correct subset of Z and D included in both models
θ^DR −0.5 0.039 0.040 0.041 95.0
θ^DRN −0.5 0.039 0.040 0.040 94.4
θ^DRS −0.2 0.038 0.039 0.039 93.8
θ^DRNS −0.2 0.038 0.039 0.039 93.6
Incorrect subset of Z1 and D included in (M1)
θ^DR −0.6 0.039 0.040 0.041 94.6
θ^DRN −0.5 0.039 0.040 0.040 93.2
θ^DRS −0.2 0.038 0.039 0.039 94.0
θ^DRNS −0.2 0.038 0.039 0.039 93.4
Incorrect subset of Z2 and D included in (M2)
θ^DR −5.3 0.049 0.053 0.065 85.0
θ^DRN −5.3 0.050 0.053 0.065 86.2
θ^DRS −0.8 0.047 0.049 0.049 92.8
θ^DRNS −0.8 0.048 0.049 0.049 93.8

3.1. MAR: δ independent of X given D and Z

Under MAR, we considered two settings, namely, δ dependent on X given D and δ independent of X given D. Corresponding to each setting, we generated the auxiliary variables, Z1=(Z1(1),Z1(2),Z1(3)), which are associated with δ, and Z2=(Z2(1),Z2(2),Z2(3)), which are associated with X. In the first setting, Z1 = Z2 and they were generated from a multivariate Gaussian distribution with a mean μZ = (3, −2, −1) and a variance matrix ΣZ = diag(0.25, 0.25, 0.25), which implies that δ is dependent on X given D and hence θ^0 is subject to potential bias. In the second setting, Z1 and Z2 were generated from two independent multivariate Gaussian distributions with the same mean and variance as in the first setting, which implies that δ is independent of X given D and hence θ^0 is unbiased. Next, we generated X as follows, X = β0 + β1D + β2Z2 + β3DZ2 + ε with β0 = 1, β1 = 2.5, β2 = (3, 3, 3), and β3 = (.5, .5, .5), which is the true underlying model for (M2). Two different residual distributions were considered so that we could compare the performance of θ^DR and θ^DRN; specifically, ε ~ N(0, σ2) or ε = 20{ηE(η)} with η ~ Beta(5, 1). The resulting true θ is 0.722 for Gaussian ε and 0.675 for non-Gaussian ε. Subsequently, we generated the missing indicator δ from a Bernoulli distribution with mean π which satisfies logit(π) = α0 + α1D + α2Z1 + α3DZ1 with α0 = 0.3, α1 = 0.3, α2 = (0.4,0.5,0.3), and α3 = (−0.7, −0.7, −0.9); this is the underlying true model for (M1). The resulting average probability of missing X is 66.4% in the diseased group and 55.8% in the non-diseased group.

When computing θ^IW, θ^DR and θ^DRN, we fitted the two working models for δ and X, namely, (M1) and (M2), under the following four scenarios: 1) the mean structure is correctly specified for both working models, i.e., Z1 and D are included in (M1), and Z2 and D are included in (M2); 2) the mean structure is misspecified for (M1), i.e., only Z1(1) and D are used in (M1); 3) the mean structure is misspecified for (M2), i.e., only Z2(1) and D are used in (M2); and 4) the mean structure is misspecified for both working models, i.e., only Z1(1) and D are included in (M1) and only Z2(1) and D are included in (M2). We note that θ^DR assumes that X follows Gaussian distributions. Consequently, if the residuals for X follow a Gaussian distribution, e.g. ε ~ N(0, σ2), then the correct specification of the mean structure in (M2) also indicates the correct specification of the conditional distribution for X when computing θ^DR. However, if the residual distribution is not Gaussian, e.g., ε = 20{ηE(η)} with η ~ Beta(5, 1), the conditional distribution for X is misspecified when computing θ^DR, even if the mean structure is correctly specified in (M2). Since θ^DRN is robust to the mis-specification of distributions of the residuals for X, it should remain consistent in both cases.

3.1.1. The case of δ dependent on X given D.

In this setting, we let Z1 and Z2 be identical, hence δ is dependent on X given D. Table 1 presents the results for two different residual distributions for X. We first discuss the case of Gaussian ε. θ^0 shows a large RB of 11.6% with a low coverage rate of 70.0%. θ^IW exhibits negligible bias and a CR close to the nominal level when (M1) is correctly specified; however, its bias becomes substantial and CR degrades considerably to 78.6% when (M1) is misspecified. When at least one working model is correctly specified, θ^DR and θ^DRN show negligible bias that is comparable to θ^GS and good coverage properties. In particular, as long as (M2) is correctly specified, θ^DR and θ^DRN are more efficient than θ^IW, and is almost as efficient as θ^IMP; in this case, negligible loss of efficiency is observed even if (M1) is misspecified. By contrast, when (M2) is misspecified and (M1) is correctly specified, the loss of efficiency is considerable for θ^DR and θ^DRN. These observations are consistent with what have been reported in the literature, i.e., the correct specification of (M2) for X is more important in terms of improving efficiency of θ^DR. When both working models are misspecified, the bias and MSE of θ^DR and θ^DRN are still similar to or less than those of θ^IW or θ^0.

When the residuals are not Gaussian, (M2) is always misspecified for θ^DR. Our results in Table 1 show that θ^DR is fairly robust to the misspecified distribution of ε as long as the conditional mean of X in (M2) is correctly specified. In addition, most observations for Gaussian ε are still true for non-Gaussian ε. In this case, θ^IMP serves as an approximate benchmark for efficiency, since θ^IMP is also fairly robust to a mis-specified distribution for ε and it is generally difficult to obtain an exact “imputation” estimator when ε is non-Gaussian. Similar results were observed in our additional simulations with other non-Gaussian distributions for ε, say, χ2 distribution.

3.1.2. The case of δ independent of X given D.

In this setting, Z1 and Z2 are two separate sets of auxiliary variables, hence δ is independent of X given D. Table 2 presents the results for both Gaussian and non-Gaussian residuals. In all cases, all estimators exhibit negligible bias and satisfactory coverage properties, which is consistent with our discussion in Section 2. Again, as long as (M2) is (approximately) correctly specified, θ^DR and θ^DRN are almost as efficient as θ^IMP; they perform no worse than θ^IW and θ^0 in other settings. As with the case of δ dependent on X given D in Section 3.1.1, the results are very similar for two different types of residual distributions for X.

Table 2.

Results of simulation study under MAR: comparison of θ^0, θ^IW, θ^DR and θ^DRN using modified weights, when Z1 and Z2 are independent. ε is Gaussian (i.e., ε ~ N(0, 1)) or non-Gaussian (i.e., ε = 20{η ~ E(η)} with η ~ Beta(5, 1)). True θ is 0.722 for Gaussian ε and 0.675 for non-Gaussian ε. The details of true models and misspecified working models are provided in Section 3.1.

Gaussian ε non-Gaussian ε
RB (%) SE SD SMSE CR (%) RB (%) SE SD SMSE CR (%)
θ^GS 0.2 0.036 0.035 0.035 96.0 0.2 0.038 0.036 0.036 96.0
θ^0 −0.1 0.059 0.057 0.057 95.8 0.4 0.063 0.061 0.061 96.4
Both mean models correctly specified
θ^IMP 0.1 0.039 0.040 0.040 94.2 −0.6 0.051 0.050 0.050 95.8
θ^IW −0.1 0.059 0.060 0.059 94.8 0.3 0.062 0.065 0.065 94.6
θ^DR 0.2 0.044 0.040 0.040 96.4 0.1 0.056 0.053 0.053 95.8
θ^DRN 0.2 0.041 0.041 0.041 95.4 0.1 0.056 0.053 0.053 95.2
Mean model for (M1) misspecified
θ^IW −0.2 0.058 0.059 0.059 95.0 0.3 0.062 0.062 0.062 95.2
θ^DR 0.2 0.041 0.040 0.040 95.0 0.2 0.053 0.051 0.051 95.8
θ^DRN 0.2 0.041 0.040 0.040 94.8 0.1 0.054 0.051 0.051 96.2
Mean model for (M2) misspecified
θ^DR −0.2 0.059 0.055 0.055 96.0 0.4 0.063 0.062 0.062 95.4
θ^DRN −0.2 0.057 0.055 0.055 96.4 0.4 0.063 0.062 0.062 95.2
th Mean Models misspecified
θ^DR. −0.2 0.054 0.054 0.054 95.0 0.4 0.059 0.059 0.059 95.2
θ^DRN −0.2 0.055 0.054 0.054 95.8 0.4 0.061 0.059 0.059 95.8

We repeated the simulations in Tables 1 and 2 using the original weights (Web Appendix B), and the results are almost the same except that the performance of the bootstrap SE for θ^DRN deteriorates somewhat.

3.2. MNAR: δ dependent on X given D and Z

We now consider the case of MNAR where δ is dependent on X conditional on D and Z, i.e., the true model for δ is logit(π)=α0+αZZ1(3)+αDD+αXX with (α0, αZ, αD, αX) = (−1, 0.2, 0.5, 0.3). The rest of the simulation setup is identical to that in Section 3.1. The resulting average probability of missing X is 57.4% in the diseased group and 31.5% in the non-diseased group. We focused on the case where Z1 and Z2 are identical and ε is Gaussian; in this case, the true θ remains 0.722. Our primary goal is to compare θ^IW, θ^DR and θ^DRN with their corresponding sensitivity estimators as described in Section 2.4, namely, θ^IWS, θ^DRS and θ^DRNS, for which the estimating equations (5) were used to estimate αS = (α0, αZ, αD) with αX fixed at its true value. The rest of estimating procedures remain the same for all estimators. As with the case of MAR in Section 3.1, we investigated the impact of the mis-specified (M1) and/or (M2); specifically, we considered a misspecified (M1) that includes Z1(1) and D and a misspecified (M2) that includes only Z2(3) and D. We also note that X is included as a covariate in (M1) for θ^IWS, θ^DRS and θ^DRNS, but not for θ^IW, θ^DR or θ^DRN. Thus, when D and the correct subset of Z1 (i.e., Z1(3)) are included in (M1), (M1) is correctly specified for θ^IWS, θ^DRS and θ^DRNS, but is misspecified for θ^IW, θ^DR and θ^DRN.

Table 3 presents the simulation results. First, θ^0 again exhibits substantial bias under MNAR. We now compare θ^IW and θ^IWS. When (M1) does not account for the effect of X, θ^IW shows considerable bias even if (M1) includes D and the correct subset of Z1. On the other hand, θ^IWS, which accounts for the effect of X, shows negligible bias. Next, we compare θ^DR and θ^DRN with θ^DRS and θ^DRNS. When correct subsets of Z and D are included in both working models, (M1) is still misspecified for θ^DR and θ^DRN. However, since (M2) is correctly specified, θ^DR and θ^DRN exhibit negligible bias and good coverage properties as a result of their double robustness, and their efficiency is comparable to that of θ^DRS and θ^DRNS. These results still hold when (M1) includes the incorrect subset of auxiliary variables and (M2) is correctly specified. When an incorrect subset of Z2 is included in (M2), both working models are misspecified for θ^DR and θ^DRN; consequently, θ^DR and θ^DRN exhibit considerable bias. In all three settings, θ^DRS and θ^DRNS show negligible bias, but their SDs increase when (M2) is misspecified, which is consistent with the earlier findings that (M2) is more important in terms of improving efficiency.

4. Data Analysis

We illustrate our methods using an observational psychiatric study, which was concerned with the impact of maternal depression during pregnancy on infant outcomes. In this study, participants were enrolled no later than week 28 of gestation and evaluated at each trimester across pregnancy. As part of the study, the presence (or absence) of a major depressive episode (disease status, D) was determined at each visit by the Mood Module of the Structured Clinical Interview for DSMIV Axis I Disorders (SCID) (First et al., 2002), which needs to be administered by a trained research professional and is considerably more time-consuming and difficult to obtain in practice. At the same time, some subjects also completed the self-rated Edinburgh Postnatal Depression Scale (EPDS) (Cox and Holden, 1987) at each visit.

In female mental health research, several rating scales have been developed for identifying postpartum depression (Fergerson et al., 2002; Perfetti et al., 2004), and in particular the self-rated EPDS has emerged as a widely-used instrument for postpartum depression screening and detection (Austin et al., 2005; Felice et al., 2006), which can be obtained fairly easily in practice. In contrast, there are no validated tools to assess depression during pregnancy. In practice, the EPDS, developed for postpartum use, has been increasingly used to identify depression during pregnancy and to screen for those at risk for developing depression during pregnancy. While not designed for such purpose, data collected from this study have been recently used to evaluate EPDS as a biomarker for the diagnosis of maternal depression throughout pregnancy. For the purpose of illustration, we focus on the data collected from the second trimester; a subset of the study population who had data in the second trimester was used and the sample size is n = 517 in the analysis. The outcome of interest is the presence of a major depressive episode (D) and is confirmed for all subjects, whereas EPDS is the biomarker of interest and is missing in 79% of the subjects. Additional auxiliary variables were also measured in this study including the mother’s age, race, marital status and eduction level, whether or not it was the first pregnancy. In addition, a research interviewer masked to treatment status administered the Structured Interview Guide for the Hamilton Rating Scale for Depression to obtain 17-item (HRSD17), which is known to be highly correlated with EPDS. These variables are treated as auxiliary variables (Z) and are used to build (M1) and (M2).

We conducted a sensitivity analysis for θ^0, θ^IW, θ^DR and θ^DRN as described in Section 2.4. Specifically, we considered a (M1) that is similar to what is discussed in Section 2.4, i.e., logit(π)=αSTW(Z,D)+αXX, where W(Z, D) include the intercept and interaction terms between auxiliary variables Z and D. In fitting this (M1), estimating equations (5) were used with αX fixed at −1, 0 and 1, where αX = 0 corresponds to the case of MAR, and αX = −1 or 1 correspond to the case of MNAR. In our analysis, all continuous variables including X were standardized to have mean 0 and unit standard deviation. Consequently, αX captures the effect of a one-SD change in X. Table 4 presents the results using modified weights. The impact of different αX values is moderate on θ^IW, θ^DR, and θ^DRN, and the estimates using different methods including θ^0 are comparable. It indicates that the missingness of X (δ) is likely close to be independent of X given D. Nevertheless, θ^DR, which incorporates information from auxiliary variables, is more efficient than the other estimators. Since the proportion of missing data is very high in the data, the bootstrap SE of θ^DRN is greater than the SE of θ^DR, but it is still smaller than the SE of θ^IW. We repeated this analysis using the original weights in Web Appendix C; while the main results remain similar, a larger bootstrap SE for θ^DRN is observed as a result of large and unstable weights.

Table 4.

Sensitivity analysis using the modified weights for estimating the ROC AUC (θ) in the psychiatric study

Estimate SE
θ^0 0.861 0.038
αX = −1 αX = 0 αX = 1
Estimate SE Estimate SE Estimate SE
θ^IW 0.864 0.037 0.851 0.040 0.849 0.042
θ^DR 0.873 0.028 0.852 0.030 0.841 0.032
θ^DRN 0.873 0.035 0.852 0.038 0.841 0.038

With θ^DR ranging from 0.841 to 0.873, our results suggest that EPDS has very good discriminative power during the second trimester. However, in this study, only a subset of the study population had depression status confirmed during each perinatal window. As a result, in addition to missing values in the biomarker, the verification bias is potentially in play as well. Furthermore, both the rating scale and the presence of a major depressive episode were repeatedly measured through the pregnancy. Therefore, it is of substantial interest in the future studies to investigate methods that can account for both missing biomarker values as well as verification bias and accommodate repeatedly measured biomarker values and disease status when estimating the ROC AUC.

5. Discussion

We have proposed and contrasted several estimators of the ROC AUC when the biomarker value is missing for some subjects. Our numerical studies show that the doubly robust estimators perform as well as or better than other estimators in all cases even when both working models are misspecified. θ^DR is also fairly robust to the misspecified residual distribution for the biomarker variable (X). Since only ranks of X are used in estimating θ, the correct specified conditional mean is more important and the impact of a misspecified residual distribution may be limited given the correctly specified conditional mean. The bootstrap procedure for obtaining SE of θ^DRN is computationally more expensive and also makes it more susceptible to large and unstable weights. Thus, in practice, we recommend the use of θ^DR and stabilized weights such as ours, and emphasize the importance of identifying (approximately) correct (M2). We also note that θ^DR can readily accommodate categorical biomarker values, e.g., a baseline logit model (Agresti, 2002) can be used to model the conditional distribution of a categorical biomarker variable.

More recently, Cao et al. (2009) investigated alternative doubly robust estimators for estimating a population mean; their methods achieve minimum variance under incorrectly specified (M2) and correctly specified (M1), and they do not suffer from large and unstable weights. While their enhanced model for (M1) can be readily adopted in our methods as an alternative to alleviate the problem of large and unstable weights, it is more involved to extend their approach of minimizing variance under misspecified (M2) and correctly specified (M1) to the estimation of the ROC AUC as complications arise from the use of U-statistic in our methods. Potential future research may also include extending sensitivity analysis to (M2) and investigating more complicated missing patterns, e.g., auxiliary variables are also missing and missingness is not monotone, for which an imputation approach may be more practical.

Supplementary Material

Supplementary Material

Acknowledgements

We thank Editor Verbeke, an associate editor and two referees for their insightful suggestions which greatly improved an earlier draft of this manuscript.

Footnotes

Supplementary Materials

Web Appendices referenced in Sections 2 and 3 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

References

  1. Agresti A (2002). Categorial Data Analysis, 2nd Edition. John Wiley & Sons. [Google Scholar]
  2. Austin M, D. H-P, Saint K, and Parker G (2005). Antenatal screening for the prediction of postnatal depression: validation of a psychosocial pregnancy risk questionnaire. Acta Psychiatr Scand. 112, 310–317. [DOI] [PubMed] [Google Scholar]
  3. Bamber D (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 12, 387C415. [Google Scholar]
  4. Cao W, Tsiatis A, and Davidian M (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cox J and Holden J (1987). Detection of postnatal depression. development of the 10-item edinburgh postnatal depression scale. Br J Psychiatry. 150, 782–786. [DOI] [PubMed] [Google Scholar]
  6. Felice E, Saliba J, Grech V, and Cox J (2006). Validation of the maltese version of the edinburgh postnatal depression scale. Arch Womens Ment Health. 9, 75–80. [DOI] [PubMed] [Google Scholar]
  7. Fergerson S, Jamieson D, and Lindsay M (2002). Diagnosing postpartum depression: can we do better? Am J Obstet Gynecol. 186, 899–902. [DOI] [PubMed] [Google Scholar]
  8. First M, Spitzer R, Gibbon M, and Williams J (2002). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Patient Edition (SCID-IP, 11/2002 Revision). Washington, DC: American Psychiatric Press. [Google Scholar]
  9. Fluss R, Reiser B, Faraggi D, and Rotnitzky A (2009). Estimation of the roc curve under verification bias. Bio-1metrical Journal. 51(3), 475–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Green D and Swets J (1966). Signal detection theory and psychopysics. Wiley, New York. [Google Scholar]
  11. Kosinski A and Barnhart H (2003). Accounting for non-ignorable verification bias in assessment of diagnostic test. Biometrics 59, 163–171. [DOI] [PubMed] [Google Scholar]
  12. Little R and Rubin D (2002). Statistical Analysis with Missing Data. 2nd Edition. Wiley &. Sons. [Google Scholar]
  13. Pepe M (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford: University Press. [Google Scholar]
  14. Perfetti J, Clark R, and Fillmore C (2004). Postpartum depression: identification, screening, and treatment. Wis Med J. 103, 56–63. [PubMed] [Google Scholar]
  15. Robins J, Rotnitzky A, and Zhao L (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 89, 846–866. [Google Scholar]
  16. Rosenbaum P and Rubin D (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. [Google Scholar]
  17. Rotnitzky A, Faraggi D, and Schisterman E (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association. 101, 1276–1288. [Google Scholar]
  18. Rotnitzky A and Robins J (1997). Analysis of semiparametric regression models with non-ignorable nonresponse. Statistics in Medicine. 16, 81–102. [DOI] [PubMed] [Google Scholar]
  19. Scharfstein D, Rotnitzky A, and Robins J (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models (with discussion). Journal of the American Statistical Association 94, 1096–1120. [Google Scholar]
  20. Zhou X (1993). Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communication in Statistics-Theory and Methods. 22, 3177–3198. [Google Scholar]
  21. Zhou X (1994). Effect of verification bias on positive and negative predictive values. Statistics in Medicine. 13, 1737–1745. [DOI] [PubMed] [Google Scholar]
  22. Zhou X (1998). Correcting for verification bias in studies of a diagnostic test’s accuracy. Statistical Methods in Medical Research. 7, 337–353. [DOI] [PubMed] [Google Scholar]
  23. Zweig M and Campbell G (1993). Receiver-operating characteristic (roc) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry. 39, 561–577. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES