Summary:
In medical research, the receiver operating characteristic (ROC) curves can be used to evaluate the performance of biomarkers for diagnosing diseases or predicting the risk of developing a disease in the future. The area under the ROC curve (AUC), as a summary measure of ROC curves, is widely utilized, especially when comparing multiple ROC curves. In observational studies, the estimation of the AUC is often complicated by the presence of missing biomarker values, which means that the existing estimators of the AUC are potentially biased. In this article, we develop robust statistical methods for estimating the ROC AUC and the proposed methods use information from auxiliary variables that are potentially predictive of the missingness of the biomarkers or the missing biomarker values. We are particularly interested in auxiliary variables that are predictive of the missing biomarker values. In the case of missing at random (MAR), i.e., missingness of biomarker values only depends on the observed data, our estimators have the attractive feature of being consistent if one correctly specifies, conditional on auxiliary variables and disease status, either the model for the probabilities of being missing or the model for the biomarker values. In the case of missing not at random (MNAR), i.e., missingness may depend on the unobserved biomarker values, we propose a sensitivity analysis to assess the impact of MNAR on the estimation of the ROC AUC. The asymptotic properties of the proposed estimators are studied and their finite sample behaviors are evaluated in simulation studies. The methods are further illustrated using data from a study of maternal depression during pregnancy.
Keywords: Area under the curve, Biomarker, Doubly robust estimators, Missing at random, Missing not at random, Receiver operating characteristic curve, Sensitivity analysis
1. Introduction
The receiver operating characteristic (ROC) curve plots the fraction of true positives (sensitivity) against the fraction of false positives (1–specificity) as the discrimination threshold (e.g., of a biomarker for a disease) is varied, and it is often used to evaluate the performance of biomarkers for diagnosing diseases or predicting the risk of developing diseases in the future. It was originally developed for the analysis of signal detection (Green and Swets, 1966) and was first used in medicine for the assessment of imaging devices (Zweig and Campbell, 1993). In medical studies, summary measures of ROC curves are often used and they are particularly powerful when comparing several ROC curves. The most widely used summary measure is the area under the ROC curve (ROC AUC) (Bamber, 1975). The ROC AUC is bounded between 0.5 and 1, and has the interpretation of the probability of a randomly selected observation from the diseased (non-diseased) population having a higher biomarker value than that from the non-diseased (diseased) population. Therefore, a large AUC value represents good separation in the biomarker values between the diseased and non-diseased populations. In particular, a perfect test would achieve an AUC of 1.0, whereas an uninformative test would have an AUC of 0.5. A wealth of literature has been developed for this type of research (Pepe (2003) and references therein).
In practice, the biomarker value may be missing for some subjects, especially in observational studies. Take for example a self-rated mental illness score collected from pregnant women in a psychiatric study, where the disease of interest is the presence (or absence) of a major depressive episode throughout pregnancy (see Section 4 for more details). Since the biomarker score is self-rated, it is possible that some subjects did not complete the self-evaluation and hence the score is missing. In such studies, additional variables including demographic and baseline variables are often available, which are referred to as auxiliary variables. While these variables are not of primary interest themselves, they are potentially predictive of the missingness of the biomarker value or the value itself, and can be incorporated in a data analysis to improve its robustness and/or efficiency. If an auxiliary variable is predictive of missingness but independent of the missing values, then using it in an analysis will not affect the results. Thus, we are interested in auxiliary variables that are predictive of the missing values, especially if they are also predictive of the missingness.
As with the general setting discussed in Little and Rubin (2002) and references therein, a naive analysis that only uses complete observations may lead to bias and loss of efficiency in the estimation of the ROC AUC. First, when the biomarker is missing completely at random (MCAR), i.e., the missingness does not depend on either observed or unobserved data, the naive analysis is valid but is not efficient. Second, when the biomarker is missing at random (MAR), i.e., the missingness is conditionally independent of the missing data given the observed data, the naive analysis is biased and other methods, e.g., inverse-weighted (IW) methods, can be extended for consistent estimation. IW methods weight each complete case by the inverse of the probability of observing the biomarker value. Despite its conceptual simplicity, IW methods have limitations. Most notably, IW methods are not efficient and are subject to bias if one misspecifies the model for the missingness. Alternatively, one can extend the methods that are doubly robust and more efficient (Robins et al., 1994; Scharfstein et al., 1999) for estimating the ROC AUC. In the case of missing not at random (MNAR), i.e., missingness depends on unobserved biomarker values even after conditioning on the observed data, it is common practice to conduct sensitivity analysis (Zhou, 1994; Rotnitzky and Robins, 1997; Scharfstein et al., 1999; Kosinski and Barnhart, 2003). In all cases, auxiliary variables can be used to potentially reduce bias and improve efficiency when associated with the probability of missing and the value of biomarkers, or simply improve efficiency when only associated with the value of biomarkers.
We confine the scope of this paper to the case where the disease status is always confirmed and a set of auxiliary variables are fully observed but the biomarker values are missing for some subjects, and we are interested in estimating the ROC AUC. Our setting is to be distinguished from the existing research on verification bias (Zhou, 1993, 1998; Rotnitzky et al., 2006; Fluss et al., 2009). In the presence of verification bias, the biomarker values are always observed whereas the true disease status is only verified for a non-random sample of the population of interest, e.g., the selection for testing may depend on the disease status or other variables. In particular, Rotnitzky et al. (2006) extended the doubly robust method developed in Rotnitzky and Robins (1997) to the estimation of the ROC AUC in the presence of verification biases. As a result of different problem setups (i.e. biomarker values missing vs. disease status unconfirmed for a subset of subjects), there are important differences between our work and theirs. In our setting, a working model on biomarker values, which can be continuous or categorical, is utilized, whereas a working model on the presence (or absence) of the disease, a binary variable, was utilized in Rotnitzky et al. (2006); consequently, our methods require modeling of the conditional distribution of biomarker values. Furthermore, we study and compare parametric and nonparametric approaches for estimating this conditional distribution and discuss two types of MAR assumptions, which have different implications on the estimation of AUC.
The remainder of the article is organized as follows. In Section 2, we describe the proposed estimators and their theoretical properties under MAR and propose a sensitivity analysis under MNAR. In Section 3, we evaluate the finite sample performance of the proposed estimators through simulations. In Section 4, we apply the proposed methods to a psychiatric study of maternal depression during pregnancy. We conclude with a discussion in Section 5.
2. Methodology
Suppose that a random sample of n subjects is selected from a population of interest to evaluate the performance of a diagnostic or predictive test using a biomarker. Each subject i, i = 1, … , n, is classified into one of two groups, diseased (Di = 1) or non-diseased (Di = 0), based on a gold standard. For each subject i, denote the biomarker value by Xi, which is used to diagnose or predict the disease status (Di). Xi is not observed in a subset of the subjects, and let δi denote the missing indicator for Xi, i.e., δi = 1 when Xi is observed and δi = 0 if Xi is missing. In addition, p auxiliary variables that may be associated with the value of Xi and/or its missingness (δi) are also collected and denoted by . Then for subject i, the complete data are (Di, Zi, δi, Xi). When δi = 1, the observed data are Oi = (Di, Zi, δi, Xi) and subject i is called a complete case; when δi = 0, the observed data are Oi = (Di, Zi, δi) and subject i is called an incomplete case. We denote by O the collection of observed data for all subjects. When δi is independent of Xi conditional on Di and Zi, it is a case of MAR; when δi is dependent on Xi conditional on Di and Zi, it is a case of MNAR.
We are interested in estimating the ROC AUC, which is equivalent to a U-statistic (Bamber, 1975), θ = Pr(Xi > Xj | Di = 1, Dj = 0), assuming that the diseased tend to have higher biomarker values. When all data are completely observed, an unbiased estimator of θ is
where Iij = I(Xi > Xj) + 0.5I(Xi = Xj) with I(A) equals to 1 if A is true and 0 if A is false. When X is missing for some subjects, a naive extension of the above estimator using only the complete observations (i.e., δi = 1) is
(1) |
It is straightforward to verify the following proposition:
Proposition 1: (i) When δ is independent of X given D, is an unbiased estimator of θ; (ii) when δ is dependent on X given D, then is subject to potential bias.
We note that (i) includes the case of MCAR and a special case of MAR where δ may depend on D and Z and is independent of X given D and (ii) includes the case of MNAR and a special case of MAR where δ is dependent on X conditional on D but is independent of X conditional on D and Z. We refer to as the naive estimator throughout this article.
2.1. Inverse-Weighted Estimator
In the case of MAR, we first study an inverse-weighted estimator,
(2) |
where is an estimate of the probability of observing Xi, namely, πi = Pr(δi = 1), conditional on Zi and Di under MAR. We denote by the working model for πi given Zi and Di with a set of unknown parameters, α, and denote by the estimating equations for computing the estimate of α, namely, , based on the observed data. For instance, one can use a logistic regression model for , i.e. logit(πi) = W(Zi, Di; α) where W(Zi, Di; α) is a function of Zi and Di and is parameterized by α; can be taken as the likelihood equations for the logistic regression model. is also known as the propensity score model (Rosenbaum and Rubin, 1983). It can be readily shown that if the working model is correctly specified, is a consistent estimator of θ under MAR.
2.2. Doubly Robust Estimators
In the case of MAR, we propose a second estimator
(3) |
where is the same as previously defined and E(Iij | Zi, Zj, Di = 1, Dj = 0) can be estimated based on the joint conditional distribution of Xi and Xj given the observed data. Specifically, we denote by the working model for characterizing the conditional distribution of X given Z and D with a set of unknown parameters, β, and denote by the estimating equations for computing the estimate of β, namely, , based on the observed data. We note that the conditional mean of X given Z and D is only part of . It can be readily shown that if either or is correctly specified, is a consistent estimator of θ under MAR.
We consider two options for the working model . In the first option, X given Z and D is assumed to follow a known parametric distribution with unknown parameters β. One special case is the Gaussian distribution, i.e., , where V(Zi, Di; β1) is a function of Zi and Di parameterized by β1. Let denote all parameters of interest, and it follows that
and hence
where Φ(·) is the cumulative distribution function (c.d.f.) of a standard normal random variable. can be rewritten as
(4) |
where and can be obtained through, say, linear regression for using the observed data. From here on, let denote the doubly robust estimator in Equation (4), which assumes that the conditional distribution of X is Gaussian.
In the second option, suppose Xi = V(Zi, Di; β1)+ε1iDi+ε0i(1−Di), where {ε1i, i = 1, …, n1} and {ε0i, i = 1, …, n0} are independent and identically distributed (i.i.d.) random errors in the diseased and non-diseased, respectively, and their respective distributions are unknown. In this case, the conditional distribution of Xi given Zi and Di can be estimated nonparametrically. We denote the set of observed residuals by and for the diseased and non-diseased, respectively, where and are the number of subjects with observed X in the diseased and non-diseased, respectively. An empirical sample of the estimated conditional distribution of Xi given Zi and Di can be constructed as in the diseased and in the non-diseased. E(Iij | Zi, Zj, Di = 1, Dj = 0) in Equation (3) can then be estimated using , where i and j go through all subjects including those with missing X, and we denote the resulting nonparametric estimator of θ by . When random errors are not i.i.d., e.g., the variance changes as the mean of X changes, the above procedure needs to be modified accordingly, e.g., performed within strata of the mean of X.
When computing , and , the weights may be large and unstable, and lead to extra noise in estimation, in particular, when computing the bootstrap SE of . Thus, we consider a simple modification to stabilize the weights, namely, replacing with . When is correctly specified, it can be readily shown that converges to 1 in probability, hence is equivalent to asymptotically.
2.3. Theoretical Properties
Following our previous notation, we further define the following,
where πi depends on α and E(Iij | Zi, Zj, Di, Dj) depends on β. It follows that and are the set of estimating equations for and , respectively. Let α0 and β0 be the probability limits of and , respectively, which usually exist.
Theorem 1: Under the regularity conditions (A1)–(A3) given in Web Appendix A, if either or both of and are correctly specified, then in distribution, where , and
Ω can be consistently estimated by with and
Theorem 2: Under the regularity conditions similar to (A1)–(A3) given in Web Appendix A, if is correctly specified, then in distribution, where , and
Ω can be consistently estimated by with and
A sketch of proof for Theorems 1 and 2 is provided in Web Appendix A, which is along the similar lines of Rotnitzky et al. (2006). The underlying idea is to derive the influence functions for or by plugging in the influence functions for and . The consistency of is straightforward to show when either or holds and its SE can be computed using a bootstrap procedure, which resamples the data with replacement within disease strata.
A few remarks are in order. First, as stated in Proposition 1, is unbiased when δ is independent of X given D; but if δ and X are associated with Z, , and are potentially more efficient when the working models are correctly specified. Second, when δ is dependent on X given D but independent of X given D and Z, , and are still consistent under suitable conditions, while is subject to potential bias. Thirdly, assumes that the residuals are Gaussian in and is subject to model misspecification even if the mean model is correctly specified; does not impose this restriction.
2.4. MNAR: Sensitivity Analysis
We now consider a case of MNAR, where δ is dependent on X conditional on Z and D; thus, a working model that only includes Z and D is misspecified. We investigate a sensitivity analysis to assess the impact on , and , as the effect of X on δ is varied. To fix the idea, suppose that logit(πi) = S(Zi, Di; αS) + U(Xi, αX), where αS and αX are two sets of unknown parameters associated with known functions S and U, respectively. αX represents the effect of biomarker values on the probability of being missing. Since αS and αX can not be jointly estimated using the observed data, we fix αX at a set of pre-determined values and estimate αS using the following set of estimating equations,
(5) |
where W(Zi, Di) is an arbitrary known vector function with the same dimension as αS. For instance, if S(Zi, Di; αS) = αSW(Zi, Di), then W(Zi, Di) is the covariate vector for i, which may include interaction terms. Compared to the likelihood equations for the logistic regression, one advantage of the estimation equations (5) is that πi is not needed when Xi is missing. For every pre-determined value of αX, we can compute using (5) and for subjects with observed Xi; subsequently we can compute , and , all of which do not need for subjects with missing Xi. This procedure is repeated for a grid of αX values, and the resulting estimators are compared to assess the impact of αX and hence the impact of MNAR. U(Xi, αX) = 0 corresponds to the case of MAR, and U(Xi, αX) ≠ 0 corresponds to the case of MNAR. In this sensitivity analysis, we do not assume that the estimation of the parameters of is not affected by MNAR. To simplify the sensitivity analysis and, in particular, avoid performing sensitivity analysis for two working models, we exploit the doubly robust property, i.e., if is correctly specified then the proposed estimators are consistent, and focus on only.
3. Simulation studies
We conducted simulations to evaluate the finite sample performance of the proposed estimators, first in the case of MAR where δ is independent of X given D and Z, then in the case of MNAR where δ is dependent on X given D and Z. In our simulations, , , and were compared. In addition, we considered another estimator, namely, , which only relies on and is not doubly robust. While it is not of primary interest in this article, under the correctly specified can be used as an optimal benchmark for efficiency as suggested by a referee. To benchmark bias and loss of efficiency due to missing data, a so-called gold standard (GS) estimator was also computed, i.e., , which uses the underlying “true” biomarker values for all subjects and is not applicable in real data analysis. In Tables 1–3, modified weights as described in Section 2.2 were used for , and ; to compute the standard error for , we used 200 bootstrap datasets randomly sampled with replacement from the data while stratified on the disease status. In each simulation, we generated a random sample of n = 200 independent subjects with an equal number of diseased and non-diseased subjects. For each simulation setting, 500 Monte Carlo datasets were generated and the results were summarized using the following measures: the mean relative bias (RB), mean of the standard error estimates (SE), Monte Carlo standard deviation of parameter estimates (SD), square root of mean squared errors (SMSE) and coverage rate (CR) of 95% Wald’s confidence interval using a logistic transform of θ as suggested in Pepe (2003) (Ch. 5).
Table 1.
Gaussian ε | non-Gaussian ε | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
RB (%) | SE | SD | SMSE | CR (%) | RB (%) | SE | SD | SMSE | CR (%) | |
−0.2 | 0.036 | 0.037 | 0.037 | 94.0 | 0.0 | 0.038 | 0.038 | 0.038 | 95.8 | |
11.6 | 0.050 | 0.054 | 0.099 | 70.0 | 10.8 | 0.057 | 0.056 | 0.092 | 80.4 | |
Both mean models correctly specified | ||||||||||
−0.1 | 0.040 | 0.042 | 0.042 | 95.0 | −0.8 | 0.054 | 0.054 | 0.054 | 94.8 | |
0.5 | 0.048 | 0.052 | 0.052 | 93.0 | 1.0 | 0.056 | 0.058 | 0.059 | 95.0 | |
0.0 | 0.045 | 0.043 | 0.043 | 96.4 | 0.5 | 0.057 | 0.055 | 0.055 | 96.4 | |
0.0 | 0.042 | 0.043 | 0.043 | 94.4 | 0.5 | 0.056 | 0.056 | 0.056 | 96.0 | |
Mean model for misspecified | ||||||||||
8.4 | 0.050 | 0.054 | 0.081 | 78.6 | 8.0 | 0.056 | 0.058 | 0.079 | 84.8 | |
0.0 | 0.040 | 0.043 | 0.043 | 94.0 | 0.6 | 0.052 | 0.055 | 0.055 | 94.4 | |
0.0 | 0.041 | 0.043 | 0.043 | 95.4 | 0.4 | 0.056 | 0.055 | 0.055 | 95.8 | |
Mean model for misspecified | ||||||||||
0.5 | 0.054 | 0.050 | 0.050 | 96.2 | 0.9 | 0.061 | 0.058 | 0.058 | 96.2 | |
0.5 | 0.050 | 0.050 | 0.050 | 94.0 | 0.9 | 0.059 | 0.058 | 0.058 | 95.6 | |
Both mean models misspecified | ||||||||||
8.4 | 0.050 | 0.053 | 0.081 | 78.6 | 7.9 | 0.057 | 0.058 | 0.079 | 86.0 | |
8.4 | 0.051 | 0.053 | 0.081 | 79.0 | 7.9 | 0.058 | 0.058 | 0.079 | 86.2 |
Table 3.
RB (%) | SE | SD | SMSE | CR (%) | |
---|---|---|---|---|---|
−0.2 | 0.036 | 0.037 | 0.037 | 94.0 | |
−8.1 | 0.053 | 0.055 | 0.080 | 78.2 | |
Correct subset of Z1 and D included in | |||||
−5.8 | 0.052 | 0.055 | 0.069 | 85.4 | |
−1.0 | 0.049 | 0.052 | 0.053 | 93.0 | |
Correct subset of Z and D included in both models | |||||
−0.5 | 0.039 | 0.040 | 0.041 | 95.0 | |
−0.5 | 0.039 | 0.040 | 0.040 | 94.4 | |
−0.2 | 0.038 | 0.039 | 0.039 | 93.8 | |
−0.2 | 0.038 | 0.039 | 0.039 | 93.6 | |
Incorrect subset of Z1 and D included in | |||||
−0.6 | 0.039 | 0.040 | 0.041 | 94.6 | |
−0.5 | 0.039 | 0.040 | 0.040 | 93.2 | |
−0.2 | 0.038 | 0.039 | 0.039 | 94.0 | |
−0.2 | 0.038 | 0.039 | 0.039 | 93.4 | |
Incorrect subset of Z2 and D included in | |||||
−5.3 | 0.049 | 0.053 | 0.065 | 85.0 | |
−5.3 | 0.050 | 0.053 | 0.065 | 86.2 | |
−0.8 | 0.047 | 0.049 | 0.049 | 92.8 | |
−0.8 | 0.048 | 0.049 | 0.049 | 93.8 |
3.1. MAR: δ independent of X given D and Z
Under MAR, we considered two settings, namely, δ dependent on X given D and δ independent of X given D. Corresponding to each setting, we generated the auxiliary variables, , which are associated with δ, and , which are associated with X. In the first setting, Z1 = Z2 and they were generated from a multivariate Gaussian distribution with a mean μZ = (3, −2, −1) and a variance matrix ΣZ = diag(0.25, 0.25, 0.25), which implies that δ is dependent on X given D and hence is subject to potential bias. In the second setting, Z1 and Z2 were generated from two independent multivariate Gaussian distributions with the same mean and variance as in the first setting, which implies that δ is independent of X given D and hence is unbiased. Next, we generated X as follows, X = β0 + β1D + β2Z2 + β3DZ2 + ε with β0 = 1, β1 = 2.5, β2 = (3, 3, 3), and β3 = (.5, .5, .5), which is the true underlying model for . Two different residual distributions were considered so that we could compare the performance of and ; specifically, ε ~ N(0, σ2) or ε = 20{η − E(η)} with η ~ Beta(5, 1). The resulting true θ is 0.722 for Gaussian ε and 0.675 for non-Gaussian ε. Subsequently, we generated the missing indicator δ from a Bernoulli distribution with mean π which satisfies logit(π) = α0 + α1D + α2Z1 + α3DZ1 with α0 = 0.3, α1 = 0.3, α2 = (0.4,0.5,0.3), and α3 = (−0.7, −0.7, −0.9); this is the underlying true model for . The resulting average probability of missing X is 66.4% in the diseased group and 55.8% in the non-diseased group.
When computing , and , we fitted the two working models for δ and X, namely, and , under the following four scenarios: 1) the mean structure is correctly specified for both working models, i.e., Z1 and D are included in , and Z2 and D are included in ; 2) the mean structure is misspecified for , i.e., only and D are used in ; 3) the mean structure is misspecified for , i.e., only and D are used in ; and 4) the mean structure is misspecified for both working models, i.e., only and D are included in and only and D are included in . We note that assumes that X follows Gaussian distributions. Consequently, if the residuals for X follow a Gaussian distribution, e.g. ε ~ N(0, σ2), then the correct specification of the mean structure in also indicates the correct specification of the conditional distribution for X when computing . However, if the residual distribution is not Gaussian, e.g., ε = 20{η−E(η)} with η ~ Beta(5, 1), the conditional distribution for X is misspecified when computing , even if the mean structure is correctly specified in . Since is robust to the mis-specification of distributions of the residuals for X, it should remain consistent in both cases.
3.1.1. The case of δ dependent on X given D.
In this setting, we let Z1 and Z2 be identical, hence δ is dependent on X given D. Table 1 presents the results for two different residual distributions for X. We first discuss the case of Gaussian ε. shows a large RB of 11.6% with a low coverage rate of 70.0%. exhibits negligible bias and a CR close to the nominal level when is correctly specified; however, its bias becomes substantial and CR degrades considerably to 78.6% when is misspecified. When at least one working model is correctly specified, and show negligible bias that is comparable to and good coverage properties. In particular, as long as is correctly specified, and are more efficient than , and is almost as efficient as ; in this case, negligible loss of efficiency is observed even if is misspecified. By contrast, when is misspecified and is correctly specified, the loss of efficiency is considerable for and . These observations are consistent with what have been reported in the literature, i.e., the correct specification of for X is more important in terms of improving efficiency of . When both working models are misspecified, the bias and MSE of and are still similar to or less than those of or .
When the residuals are not Gaussian, is always misspecified for . Our results in Table 1 show that is fairly robust to the misspecified distribution of ε as long as the conditional mean of X in is correctly specified. In addition, most observations for Gaussian ε are still true for non-Gaussian ε. In this case, serves as an approximate benchmark for efficiency, since is also fairly robust to a mis-specified distribution for ε and it is generally difficult to obtain an exact “imputation” estimator when ε is non-Gaussian. Similar results were observed in our additional simulations with other non-Gaussian distributions for ε, say, χ2 distribution.
3.1.2. The case of δ independent of X given D.
In this setting, Z1 and Z2 are two separate sets of auxiliary variables, hence δ is independent of X given D. Table 2 presents the results for both Gaussian and non-Gaussian residuals. In all cases, all estimators exhibit negligible bias and satisfactory coverage properties, which is consistent with our discussion in Section 2. Again, as long as is (approximately) correctly specified, and are almost as efficient as ; they perform no worse than and in other settings. As with the case of δ dependent on X given D in Section 3.1.1, the results are very similar for two different types of residual distributions for X.
Table 2.
Gaussian ε | non-Gaussian ε | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
RB (%) | SE | SD | SMSE | CR (%) | RB (%) | SE | SD | SMSE | CR (%) | |
0.2 | 0.036 | 0.035 | 0.035 | 96.0 | 0.2 | 0.038 | 0.036 | 0.036 | 96.0 | |
−0.1 | 0.059 | 0.057 | 0.057 | 95.8 | 0.4 | 0.063 | 0.061 | 0.061 | 96.4 | |
Both mean models correctly specified | ||||||||||
0.1 | 0.039 | 0.040 | 0.040 | 94.2 | −0.6 | 0.051 | 0.050 | 0.050 | 95.8 | |
−0.1 | 0.059 | 0.060 | 0.059 | 94.8 | 0.3 | 0.062 | 0.065 | 0.065 | 94.6 | |
0.2 | 0.044 | 0.040 | 0.040 | 96.4 | 0.1 | 0.056 | 0.053 | 0.053 | 95.8 | |
0.2 | 0.041 | 0.041 | 0.041 | 95.4 | 0.1 | 0.056 | 0.053 | 0.053 | 95.2 | |
Mean model for misspecified | ||||||||||
−0.2 | 0.058 | 0.059 | 0.059 | 95.0 | 0.3 | 0.062 | 0.062 | 0.062 | 95.2 | |
0.2 | 0.041 | 0.040 | 0.040 | 95.0 | 0.2 | 0.053 | 0.051 | 0.051 | 95.8 | |
0.2 | 0.041 | 0.040 | 0.040 | 94.8 | 0.1 | 0.054 | 0.051 | 0.051 | 96.2 | |
Mean model for misspecified | ||||||||||
−0.2 | 0.059 | 0.055 | 0.055 | 96.0 | 0.4 | 0.063 | 0.062 | 0.062 | 95.4 | |
−0.2 | 0.057 | 0.055 | 0.055 | 96.4 | 0.4 | 0.063 | 0.062 | 0.062 | 95.2 | |
th Mean Models misspecified | ||||||||||
. | −0.2 | 0.054 | 0.054 | 0.054 | 95.0 | 0.4 | 0.059 | 0.059 | 0.059 | 95.2 |
−0.2 | 0.055 | 0.054 | 0.054 | 95.8 | 0.4 | 0.061 | 0.059 | 0.059 | 95.8 |
We repeated the simulations in Tables 1 and 2 using the original weights (Web Appendix B), and the results are almost the same except that the performance of the bootstrap SE for deteriorates somewhat.
3.2. MNAR: δ dependent on X given D and Z
We now consider the case of MNAR where δ is dependent on X conditional on D and Z, i.e., the true model for δ is with (α0, αZ, αD, αX) = (−1, 0.2, 0.5, 0.3). The rest of the simulation setup is identical to that in Section 3.1. The resulting average probability of missing X is 57.4% in the diseased group and 31.5% in the non-diseased group. We focused on the case where Z1 and Z2 are identical and ε is Gaussian; in this case, the true θ remains 0.722. Our primary goal is to compare , and with their corresponding sensitivity estimators as described in Section 2.4, namely, , and , for which the estimating equations (5) were used to estimate αS = (α0, αZ, αD) with αX fixed at its true value. The rest of estimating procedures remain the same for all estimators. As with the case of MAR in Section 3.1, we investigated the impact of the mis-specified and/or ; specifically, we considered a misspecified that includes and D and a misspecified that includes only and D. We also note that X is included as a covariate in for , and , but not for , or . Thus, when D and the correct subset of Z1 (i.e., ) are included in , is correctly specified for , and , but is misspecified for , and .
Table 3 presents the simulation results. First, again exhibits substantial bias under MNAR. We now compare and . When does not account for the effect of X, shows considerable bias even if includes D and the correct subset of Z1. On the other hand, , which accounts for the effect of X, shows negligible bias. Next, we compare and with and . When correct subsets of Z and D are included in both working models, is still misspecified for and . However, since is correctly specified, and exhibit negligible bias and good coverage properties as a result of their double robustness, and their efficiency is comparable to that of and . These results still hold when includes the incorrect subset of auxiliary variables and is correctly specified. When an incorrect subset of Z2 is included in , both working models are misspecified for and ; consequently, and exhibit considerable bias. In all three settings, and show negligible bias, but their SDs increase when is misspecified, which is consistent with the earlier findings that is more important in terms of improving efficiency.
4. Data Analysis
We illustrate our methods using an observational psychiatric study, which was concerned with the impact of maternal depression during pregnancy on infant outcomes. In this study, participants were enrolled no later than week 28 of gestation and evaluated at each trimester across pregnancy. As part of the study, the presence (or absence) of a major depressive episode (disease status, D) was determined at each visit by the Mood Module of the Structured Clinical Interview for DSMIV Axis I Disorders (SCID) (First et al., 2002), which needs to be administered by a trained research professional and is considerably more time-consuming and difficult to obtain in practice. At the same time, some subjects also completed the self-rated Edinburgh Postnatal Depression Scale (EPDS) (Cox and Holden, 1987) at each visit.
In female mental health research, several rating scales have been developed for identifying postpartum depression (Fergerson et al., 2002; Perfetti et al., 2004), and in particular the self-rated EPDS has emerged as a widely-used instrument for postpartum depression screening and detection (Austin et al., 2005; Felice et al., 2006), which can be obtained fairly easily in practice. In contrast, there are no validated tools to assess depression during pregnancy. In practice, the EPDS, developed for postpartum use, has been increasingly used to identify depression during pregnancy and to screen for those at risk for developing depression during pregnancy. While not designed for such purpose, data collected from this study have been recently used to evaluate EPDS as a biomarker for the diagnosis of maternal depression throughout pregnancy. For the purpose of illustration, we focus on the data collected from the second trimester; a subset of the study population who had data in the second trimester was used and the sample size is n = 517 in the analysis. The outcome of interest is the presence of a major depressive episode (D) and is confirmed for all subjects, whereas EPDS is the biomarker of interest and is missing in 79% of the subjects. Additional auxiliary variables were also measured in this study including the mother’s age, race, marital status and eduction level, whether or not it was the first pregnancy. In addition, a research interviewer masked to treatment status administered the Structured Interview Guide for the Hamilton Rating Scale for Depression to obtain 17-item (HRSD17), which is known to be highly correlated with EPDS. These variables are treated as auxiliary variables (Z) and are used to build and .
We conducted a sensitivity analysis for , , and as described in Section 2.4. Specifically, we considered a that is similar to what is discussed in Section 2.4, i.e., , where W(Z, D) include the intercept and interaction terms between auxiliary variables Z and D. In fitting this , estimating equations (5) were used with αX fixed at −1, 0 and 1, where αX = 0 corresponds to the case of MAR, and αX = −1 or 1 correspond to the case of MNAR. In our analysis, all continuous variables including X were standardized to have mean 0 and unit standard deviation. Consequently, αX captures the effect of a one-SD change in X. Table 4 presents the results using modified weights. The impact of different αX values is moderate on , , and , and the estimates using different methods including are comparable. It indicates that the missingness of X (δ) is likely close to be independent of X given D. Nevertheless, , which incorporates information from auxiliary variables, is more efficient than the other estimators. Since the proportion of missing data is very high in the data, the bootstrap SE of is greater than the SE of , but it is still smaller than the SE of . We repeated this analysis using the original weights in Web Appendix C; while the main results remain similar, a larger bootstrap SE for is observed as a result of large and unstable weights.
Table 4.
Estimate | SE | |||||
---|---|---|---|---|---|---|
0.861 | 0.038 | |||||
αX = −1 | αX = 0 | αX = 1 | ||||
Estimate | SE | Estimate | SE | Estimate | SE | |
0.864 | 0.037 | 0.851 | 0.040 | 0.849 | 0.042 | |
0.873 | 0.028 | 0.852 | 0.030 | 0.841 | 0.032 | |
0.873 | 0.035 | 0.852 | 0.038 | 0.841 | 0.038 |
With ranging from 0.841 to 0.873, our results suggest that EPDS has very good discriminative power during the second trimester. However, in this study, only a subset of the study population had depression status confirmed during each perinatal window. As a result, in addition to missing values in the biomarker, the verification bias is potentially in play as well. Furthermore, both the rating scale and the presence of a major depressive episode were repeatedly measured through the pregnancy. Therefore, it is of substantial interest in the future studies to investigate methods that can account for both missing biomarker values as well as verification bias and accommodate repeatedly measured biomarker values and disease status when estimating the ROC AUC.
5. Discussion
We have proposed and contrasted several estimators of the ROC AUC when the biomarker value is missing for some subjects. Our numerical studies show that the doubly robust estimators perform as well as or better than other estimators in all cases even when both working models are misspecified. is also fairly robust to the misspecified residual distribution for the biomarker variable (X). Since only ranks of X are used in estimating θ, the correct specified conditional mean is more important and the impact of a misspecified residual distribution may be limited given the correctly specified conditional mean. The bootstrap procedure for obtaining SE of is computationally more expensive and also makes it more susceptible to large and unstable weights. Thus, in practice, we recommend the use of and stabilized weights such as ours, and emphasize the importance of identifying (approximately) correct . We also note that can readily accommodate categorical biomarker values, e.g., a baseline logit model (Agresti, 2002) can be used to model the conditional distribution of a categorical biomarker variable.
More recently, Cao et al. (2009) investigated alternative doubly robust estimators for estimating a population mean; their methods achieve minimum variance under incorrectly specified and correctly specified , and they do not suffer from large and unstable weights. While their enhanced model for can be readily adopted in our methods as an alternative to alleviate the problem of large and unstable weights, it is more involved to extend their approach of minimizing variance under misspecified and correctly specified to the estimation of the ROC AUC as complications arise from the use of U-statistic in our methods. Potential future research may also include extending sensitivity analysis to and investigating more complicated missing patterns, e.g., auxiliary variables are also missing and missingness is not monotone, for which an imputation approach may be more practical.
Supplementary Material
Acknowledgements
We thank Editor Verbeke, an associate editor and two referees for their insightful suggestions which greatly improved an earlier draft of this manuscript.
Footnotes
Supplementary Materials
Web Appendices referenced in Sections 2 and 3 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.
References
- Agresti A (2002). Categorial Data Analysis, 2nd Edition. John Wiley & Sons. [Google Scholar]
- Austin M, D. H-P, Saint K, and Parker G (2005). Antenatal screening for the prediction of postnatal depression: validation of a psychosocial pregnancy risk questionnaire. Acta Psychiatr Scand. 112, 310–317. [DOI] [PubMed] [Google Scholar]
- Bamber D (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 12, 387C415. [Google Scholar]
- Cao W, Tsiatis A, and Davidian M (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96, 723–734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox J and Holden J (1987). Detection of postnatal depression. development of the 10-item edinburgh postnatal depression scale. Br J Psychiatry. 150, 782–786. [DOI] [PubMed] [Google Scholar]
- Felice E, Saliba J, Grech V, and Cox J (2006). Validation of the maltese version of the edinburgh postnatal depression scale. Arch Womens Ment Health. 9, 75–80. [DOI] [PubMed] [Google Scholar]
- Fergerson S, Jamieson D, and Lindsay M (2002). Diagnosing postpartum depression: can we do better? Am J Obstet Gynecol. 186, 899–902. [DOI] [PubMed] [Google Scholar]
- First M, Spitzer R, Gibbon M, and Williams J (2002). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Patient Edition (SCID-IP, 11/2002 Revision). Washington, DC: American Psychiatric Press. [Google Scholar]
- Fluss R, Reiser B, Faraggi D, and Rotnitzky A (2009). Estimation of the roc curve under verification bias. Bio-1metrical Journal. 51(3), 475–490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Green D and Swets J (1966). Signal detection theory and psychopysics. Wiley, New York. [Google Scholar]
- Kosinski A and Barnhart H (2003). Accounting for non-ignorable verification bias in assessment of diagnostic test. Biometrics 59, 163–171. [DOI] [PubMed] [Google Scholar]
- Little R and Rubin D (2002). Statistical Analysis with Missing Data. 2nd Edition. Wiley &. Sons. [Google Scholar]
- Pepe M (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford: University Press. [Google Scholar]
- Perfetti J, Clark R, and Fillmore C (2004). Postpartum depression: identification, screening, and treatment. Wis Med J. 103, 56–63. [PubMed] [Google Scholar]
- Robins J, Rotnitzky A, and Zhao L (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 89, 846–866. [Google Scholar]
- Rosenbaum P and Rubin D (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. [Google Scholar]
- Rotnitzky A, Faraggi D, and Schisterman E (2006). Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association. 101, 1276–1288. [Google Scholar]
- Rotnitzky A and Robins J (1997). Analysis of semiparametric regression models with non-ignorable nonresponse. Statistics in Medicine. 16, 81–102. [DOI] [PubMed] [Google Scholar]
- Scharfstein D, Rotnitzky A, and Robins J (1999). Adjusting for nonignorable dropout using semiparametric nonresponse models (with discussion). Journal of the American Statistical Association 94, 1096–1120. [Google Scholar]
- Zhou X (1993). Maximum likelihood estimators of sensitivity and specificity corrected for verification bias. Communication in Statistics-Theory and Methods. 22, 3177–3198. [Google Scholar]
- Zhou X (1994). Effect of verification bias on positive and negative predictive values. Statistics in Medicine. 13, 1737–1745. [DOI] [PubMed] [Google Scholar]
- Zhou X (1998). Correcting for verification bias in studies of a diagnostic test’s accuracy. Statistical Methods in Medical Research. 7, 337–353. [DOI] [PubMed] [Google Scholar]
- Zweig M and Campbell G (1993). Receiver-operating characteristic (roc) plots: A fundamental evaluation tool in clinical medicine. Clinical Chemistry. 39, 561–577. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.