SUMMARY
To compare the accuracy of multiple diagnostic tests in a single study, three designs are commonly used (i) the multiple test comparison design; (ii) the randomized design, and (iii) the non-comparative design. Existing meta-analysis methods of diagnostic tests (MA-DT) have been focused on evaluating the performance of a single test by comparing it with a reference test. The increasing number of available diagnostic instruments for a disease condition and the different study designs being used have generated the need to develop efficient and flexible meta-analysis framework to combine all designs for simultaneous inference. In this article, we develop a missing data framework and a Bayesian hierarchical model for network MA-DT (NMA-DT) and offer important promises over traditional MA-DT: (i) It combines studies using all three designs; (ii) It pools both studies with or without a gold standard; (iii) it combines studies with different sets of candidate tests; and (iv) it accounts for heterogeneity across studies and complex correlation structure among multiple tests. We illustrate our method through a case study: network meta-analysis of deep vein thrombosis tests.
Keywords: Diagnostic test, Hierarchical model, Missing data, Multiple test comparison, Network meta-analysis
1. Introduction
Comparative effectiveness research relies fundamentally on accurate assessment of clinical outcomes. The growing number of assessment instruments, as well as the rapid escalation in the cost, have generated the increasing need for scientifically rigorous comparisons of multiple diagnostic tests in clinical practice. To compare the accuracy of multiple diagnostic tests in a single study, three designs are commonly used (Takwoingi, Leeflang and Deeks, 2013): (i) The multiple test comparison design where all subjects are diagnosed by all candidate tests and verified by a gold standard; (ii) The randomized design where subjects are randomly assigned to one of candidate tests, and all subjects are verified by a gold standard; and (iii) The non-comparative design where different sets of subjects are used to compare a candidate test to a gold standard or to another candidate test. Systematic reviews and meta-analysis methods have been developed as useful tools to improve the estimation of diagnostic test accuracy by combining information from multiple studies (Rutter and Gatsonis, 2001; Reitsma and others, 2005). Thus, a flexible meta-analysis framework is needed to combine information from all three designs for effectively ranking all candidate tests.
However, in the methodology literature of meta-analysis of diagnostic tests, a great deal of attention has been devoted to developing methods to estimate the performance of one candidate test compared to a reference test. When the reference test is a gold standard, multivariate random effects models are developed to account for the heterogeneity of test performance across studies and correlations among test accuracy indices (such as sensitivity and specificity) (Rutter and Gatsonis, 2001; Reitsma and others, 2005; Chu and Cole, 2006; Harbord and others, 2007; Chu, Chen and Louis, 2009; Ma and others, 2016b; Ma and others, 2016a; Chen and others, 2015). When the reference test cannot perfectly distinguish diseased and non-diseased subjects (i.e., non-gold standard), latent class random effects models (Chu, Chen and Louis, 2009, Dendukuri and others, 2012, Liu, Chen and Chu, 2015) are proposed to estimate diagnostic accuracy of both candidate and reference tests.
Very few papers have discussed how to simultaneously compare multiple candidate tests in meta-analysis. A naive procedure is to conduct separate meta-analysis methods of diagnostic tests (MA-DT) of each candidate test then compare their summary estimates, which is valid only under the missing completely random (MCAR) assumption. However, there are some important drawbacks of this procedure. First, for studies that compared multiple diagnostic tests, the accuracy estimates of each candidate test from separate MA-DT are typically correlated, as multiple test comparison design may be used for some studies and some subjects may be evaluated by multiple tests. Ignoring such correlations can lead to efficiency loss. Secondly, current methods are not able to combine studies comparing a candidate test with a gold standard and studies comparing a candidate test with a non-gold standard reference; Thirdly, when candidate tests are evaluated one at a time, the number of studies is typically small, which can potentially lead to issues of model fitting (Hamza and others, 2008) and difficulty in estimating between-study heterogeneity. In addition, as different studies that represent heterogeneous populations are synthesized, the candidate tests are not directly comparable without certain strong assumptions, thus limiting the generalizability of results. At last, separate MA-DT does not allow for “borrowing of information,” which can potentially lead to statistical efficiency loss.
To address these limitations, we develop a network MA-DT (NMA-DT) framework from the perspective of missing data analysis to simultaneously compare multiple tests. The proposed framework is motivated from the literature on network meta-analysis of randomized clinical trials, which extends the scope of traditional pairwise meta-analysis by synthesizing both direct and indirect comparisons of multiple treatments across randomized controlled trials (Lu and Ades, 2004, Salanti and others, 2011, Zhang and others, 2014). Specifically, we view studies using the randomized design and non-comparative design as if they were designed using the multiple test comparison design such that all subjects in all studies were evaluated by all candidate tests and a gold-standard test. However, most of the studies only include a subset of the whole set of tests of interest. The test outcomes from non-included tests are considered as missing data. By simultaneously comparing all candidate tests and a gold standard, the proposed approach can make use of all available information, allow for borrowing of information across studies and rank diagnostic tests through full posterior inferences. It effectively handles three critical challenges in the traditional MA-DT by (i) combining information from studies with all three designs; (ii) pooling both studies with or without a gold standard; and (iii) allowing different sets of candidate tests in different studies or different subsets of subjects within a study. This model also accounts for potential heterogeneity across studies (due to difference of study population, design and lab technical issues) as in conventional MA-DT models, as well as the complex correlation structure among multiple diagnostic tests.
The rest of this article is organized as follows. In Section 2, we describe our motivating case study: NMA of deep vein thrombosis (DVT) tests. We present the proposed NMA-DT model and the Bayesian inference method in Section 3 and apply the proposed method to the motivating study in Section 4. Simulation studies are conducted in Section 5, and Section 6 provides a brief discussion. A directed graphical model of the proposed model, an additional case study, data and some additional results are provided in the supplementary material available at Biostatistics online.
2. Motivating study: NMA of DVT tests
DVT is developed when blood clots form in one or more deep veins of the human body. If DVT is left untreated, the blood clot can cause pulmonary embolus and result in death (Venta and Venta, 1987). The gold standard diagnostic test for DVT, contrast venography, is an invasive procedure and can introduce allergic reaction. Therefore, ultrasonography is a commonly used surrogate test because it is noninvasive and has good accuracy. Alternatively, D-dimer is a small protein fragment present in the blood when there is a blood clot and thus testing its concentration in a screening blood test can also be used to diagnose DVT.
A recent paper by Kang and others (2013) presented a meta-analysis that included 12 studies comparing the accuracy of diagnostic tests for DVT. Among the 12 studies, 4 studies compared D-dimer test to venography, three studies compared ultrasonography to venography and 5 studies compared the D-dimer to ultrasonography (Kang and others, 2013). None of the studies compared the three tests together. A mixed-effects log-linear model was applied and random effects were incorporated to account for the heterogeneity in test accuracies of D-dimer but not for ultrasonography. In addition, the log linear model for test accuracies made it difficult to interpret the model parameters, and hard to generalize when comparing more diagnostic tests.
3. A unified statistical framework
We present a Bayesian hierarchical NMA-DT model to compare multiple tests simultaneously. In this article, we focus on modeling a commonly used pair of test accuracy indices, sensitivity (Se) and specificity (Sp), where sensitivity is the probability of a candidate test being positive given a diseased subject and specificity is the probability of a candidate test being negative given non-diseased (Pepe, 2003). In addition, other test accuracy indices such as positive and negative likelihood ratios (LR+ and LR
), positive and negative predictive values (PPV and NPV) can be useful in practice. LR+ (LR
) is the likelihood that a positive (negative) test result would be expected on a patient with the target disease compared to the likelihood that the same result would be expected on a patient without the disease. And PPV (NPV) describes the chance of truely diseased (non-diseased) given a positive (negative) test result. However, these test indices (PPV and NPV) are closely related to disease prevalence and estimation of these quantities requires information on prevalence. Furthermore, disease prevalence has been argued to be potentially correlated with Se and Sp and meta-analysis models accounting for such correlation are proposed (Chu and others, 2009; Leeflang and others, 2009). Therefore, the Bayesian hierarchical NMA-DT approach also models disease prevalence to account for these correlations and to provide the inference of other test accuracy indices. In this section, we first present the hierarchical model with random effects, then describe the prior distributions of the parameters. Next we provide the likelihood and posterior estimates.
3.1 Hierarchical model
We view different studies as if they were all designed to adopt a multiple test comparison design such that all studies should undergo a whole set of tests containing all candidate tests and a gold-standard. However, each of the studies includes a subset of the whole set, and the test outcomes from non-included tests are considered as missing data (Little and Rubin, 2002). We assume that the missing test outcomes are missing at random (MAR). Under MAR, the presence of a test does not depend on any unobserved characteristics, which means in our case missingness is independent of its sensitivity and specificity. In Section 3.4, we will provide a method for sensitivity analysis under missing not at random assumption (MNAR).
Let
be a set of
binary diagnostic tests, where
denotes a gold standard and
stand for candidate tests under evaluation. Suppose we have a collection of
studies, where each of them reports outcomes of tests in a subset of
. In the
th study, for
, let
be the test outcome of
on subject
(
if positive and
if negative) and let
be the missing data indicator (
if
is conducted to the
th subject and 0 if not). Let
be the study-specific disease prevalence:
,
. For
, let
and
denote the study specific sensitivity and specificity for the
th test, respectively:
and
. Denote
as the set of candidate tests conducted on subject
(
) in the
th study, and
as the collection of candidate test outcomes for this subject.
Multivariate random effects are used to account for potential across study heterogeneities in prevalence, sensitivities and specificities and correlations among them. Specifically, we write
| (3.1) |
where
is the standard normal cumulative distribution function for probit transformations. We note that other link functions can be specified as well. The parameter
is the fixed effect for prevalence. The parameters
and
are the fixed effects for sensitivity and specificity of
, respectively. The random effects
,
and
are the study specific effects for prevalence, sensitivity and specificity of
, respectively. It is straightforward to incorporate meta-regression covariates in (3.1) by
for
and
, where
,
and
are study-level covariates such as study population characteristics, and
and
are the corresponding coefficient vectors. In this paper, we focus on models without covariates for simplicity.
We introduce the within-study dependency structure of multiple test parameters by assuming that the random effect vector follows a multivariate normal distribution. Furthermore, this distribution also accounts for potential correlations between prevalence and the test accuracy parameters (Leeflang and others, 2009, Chu, Chen and Louis, 2009):
| (3.2) |
The covariance matrix
can be written as
, where
is a
diagonal matrix with diagonal elements (
) capturing the between study heterogeneities and
is a positive definite correlation matrix whose diagonal elements are 1 and the off-diagonal elements measure potential correlations among disease prevalence and the test accuracy parameters. We assume the same correlation structure for all studies. Therefore, studies reporting all test outcomes of
contribute to estimating
and studies with missing test outcomes directly contribute to estimating a submatrix of
. By assuming MAR and the same covariance matrix across all studies, which is equivalent to assuming all studies apply the multiple test comparison design, the NMA-DT model can combine studies reporting different sets of candidate tests and make inferences on the relative test performances.
3.2 Likelihood specification
To derive the likelihood for the
th subject in the
th study, we first consider a subject that is tested by the gold standard test (
) such that the true disease status is known. Conditional independence is assumed such that candidate test results are independent given the disease status. This assumption has been used in latent class models assessing accuracy of diagnostic tests without a gold standard (Chu, Chen and Louis, 2009). Denote
,
,
and
, the probability of the test outcomes for a diseased subject, given the random effects, is calculated as,
| (3.3) |
where
. Similarly, the probability for a non-diseased subject
in study
is given by
| (3.4) |
where
.
Now we consider the setting where the subject
has not been tested by the gold standard
(i.e.,
). By the law of total probability, the probability of the test outcomes
, given the random effects, can be written as
| (3.5) |
In general, the probability of test outcomes for subject
(with or without the gold standard) can be written in the following unified form
| (3.6) |
3.3 Prior specifications and posterior estimations
In this subsection, we describe specifications of prior distributions of
and
. Conjugate Wishart prior can be assumed for the precision matrix:
. Taking the degrees of freedom
equal to the dimension of
,
, will have approximately uniform prior in the correlation coefficients. Different choices of
can give relatively informative or non-informative priors on the variance parameters. Specific choices of
are discussed in the case studies.
Vague normal priors with mean 0 and variance 10 are assumed for
and
(
), which correspond to equal tail 95% prior credible intervals (CI) of approximately (0,1) for
and
,
.
Denote the above prior distributions for
,
,
and
by
and
, respectively. With these specifications, the joint posterior distribution is:
| (3.7) |
Given the likelihood specification in Section 3.2, the posterior can be written as
| (3.8) |
where
and
are defined after equations (3.3) and (3.4), respectively, and
is defined in equation (3.2).
We use JAGS software via the rjags package in R to sample from the joint posterior distribution using Markov Chain Monte Carlo (MCMC) methods (Lunn and others, 2000; Plummer and others, 2003). The posterior samples are drawn by Gibbs and Metropolis-Hasting’s algorithms. Posterior estimates are similar to the maximum likelihood estimates when the priors are non-informative, and Bayesian approach allows for full posterior inference, so that the asymptotic approximations are not required. Convergence is assessed using trace plots, sample autocorrelation and Gelman-Rudin statistic (Gelman and Rubin, 1992).
Posterior samples of median disease prevalence, sensitivity and specificity of
can be achieved by MCMC sampling:
,
and
. Posterior medians of other measures of clinical interest such as PPV, NPV, positive likelihood (LR+) and negative likelihood (LR
) of
can also be achieved:
The surface under the cumulative ranking (SUCRA) can be used to rank test performance, accounting for the uncertainty in ranking (Salanti and others, 2011). The SUCRA for each candidate test can be calculated as follows.
where
is the cumulative probability of
ranking as the
th best test:
, where
can be calculated from posterior samples. Larger value of SUCRA indicates better test performance, and SUCRA=1 (SUCRA=0) if the test always ranks first (last). Therefore, we can pick the best test by comparing their SUCRA value. This approach is most useful when the difference in preference between successive ranks is the same across the entire ranking scale, otherwise it can be misleading.
3.4 A sensitivity analysis for missingness not at random
The NMA-DT model is built upon the assumption of MAR. However, this assumption may be questionable in some applications. For example, researchers select candidate tests that are believed to have better performance, and hence missing test outcomes are related to unknown test accuracy parameters (which is MNAR). In this subsection, we present a model of missingness to be incorporated into the NMA-DT model to account for known MNAR mechanism. However, in practice, the MAR assumption is hardly known and is not testable. Thus, different models of missingness can be used in sensitivity analyses to evaluate the impact on parameter estimates if the MAR assumption is violated.
Let
matrix
denote the study-level missingness of a NMA-DT dataset containing
studies and
candidate tests. The entries of
are
,
and
, such that
if
is missing in study
and
otherwise.
, where
is the probability of missing
in study
. We specify a model of missingness for
as
where
(
) controls the degree of association between the missing outcomes and the study-specific sensitivity (specificity). We assume non-positive
and
such that
is prone to be missing when it’s accuracy is low. When
, the outcomes of
are MAR with respect to its sensitivity and specificity. The model of missingness can then be incorporated in the likelihood in (3.6) under different pre-specified values of
and
. It should be noticed that the model of missingness described here is not for general MNAR scenarios, but is specific for the NMA-DT problem, and we only consider the scenarios when missingness is related to the test accuracy indices.
4. Case study results and sensitivity analyses: NMA of DVT tests
We analyze the NMA of DVT tests in Section 2 by the proposed NMA-DT model. In this study, we have
. We adopt a relatively moderate informative Wishart prior with
and
with diagonal elements equal to 5 and off-diagonal elements equal to 0.05. This Wishart prior corresponds to a 95% prior CI of (0.2, 15) for the standard deviation components
. We fit the model by assuming vague
priors for
and
. Here the assumed conditional independence between candidate tests given disease status implies that any agreement between D-dimer and ultrasound test results for a specific subject is only a result of the subject’s disease status.
After 10,000 burn-in samples, 1,000,000 posterior samples are obtained. Table 1 shows the results from the proposed NMA-DT model. Figure 1 plots joint posterior distributions and study-specific posterior medians and 95% CIs for prevalence, sensitivity and specificity parameters. We write posterior medians followed by 95% CI in brackets for the rest of this article. The NMA-DT model concludes that ultrasonography has median Se of 0.90 (0.77, 0.96) and median Sp of 0.80 (0.54, 0.97). The D-dimer test is estimated to have moderate ability in diagnosing DVT with median Se 0.83 (0.68, 0.92) and median Sp 0.88 (0.75, 0.97). The SUCRA values for ultrasonography is 0.2 and for D-dimer is 0.13. Overall, ultrasonography is favored in detecting the diseased with higher sensitivity, whereas D-dimer performs better in ruling out the non-diseased with higher specificity.
Table 1.
Meta-analysis of DVT tests: Posterior median estimates and 95% CIs
| Ultrasonography | D-dimer | |
|---|---|---|
| Sensitivity | 0.90 (0.77, 0.96) | 0.83 (0.68, 0.92) |
| Specificity | 0.80 (0.54, 0.97) | 0.88 (0.75, 0.97) |
| PPV | 0.84 (0.68, 0.96) | 0.84 (0.68 0.96) |
| NPV | 0.91 (0.80, 0.97) | 0.87 (0.77, 0.94) |
LR
|
4.39 (1.89, 27.90) | 7.00 (3.10, 33.49) |
LR
|
0.13 (0.05, 0.33) | 0.20 (0.09, 0.38) |
| Prevalence | 0.43 (0.36, 0.50) |
Fig. 1.
Meta-analysis of DVT tests: forest plots and countour plots. (a) is the forest plot for prevalence, (b), (c) are forest plots for sensitivity and specificity of D-dimer, respectively, and (e), (f) are forest plots for sensitivity and specificity of ultrasound, respectively. The solid (dashed) lines denote the corresponding 95% credible intervals when the test is included (not included) in the study, respectively. (d) and (g) are the quantile countours of posterior sensitivity versus specificity at quantile levels 0.25, 0.5, 0.75, 0.9 and 0.95
4.1 Sensitivity analyses to prior distribution of
Sensitivity analyses to the prior distributions of
are conducted to evaluate the effect of the prior distribution on the posterior prevalence, sensitivity and specificity. A relatively more informative Wishart prior with
and
with diagonal elements equal to 20 and off-diagonal elements equal to 0.05 is used to repeat the analysis. This Wishart prior corresponds to a 95% prior CI of (0.1, 7.5) for the standard deviation components
. The posterior median disease prevalence is estimated to be 0.43 (0.37, 0.49). Ultrasonography has posterior median sensitivity of 0.89 (0.78, 0.96) and specificity of 0.79 (0.54, 0.96). The D-dimer test has posterior median sensitivity of 0.82 (0.67, 0.92) and specificity of 0.88 (0.75, 0.97). Similar posterior medians and 95% CIs compared to Table 1 are derived using a more informative prior.
A vague prior taking
and
with diagonal elements equal to 1 and off-diagonal elements equal to 0.05 is also used to repeat the analysis. This prior distribution corresponds to a 95% prior CI of (0.4, 35) for the standard deviation components. The posterior median of disease prevalence is 0.43 (0.33, 0.53). Ultrasonography has posterior median sensitivity 0.90 (0.74, 0.97) and specificity 0.82 (0.56, 0.98). D-dimer has posterior median sensitivity 0.83 (0.65, 0.93) and specificity of 0.89 (0.74, 0.98). Compared to Table 1, this prior leads to wider CIs for all parameters and slightly higher posterior medians of ultrasonography Sp.
Overall, different choices of the Wishart prior for
have little effect on the posterior medians of prevalence, sensitivity and specificity but have slight influences on the width of their CIs.
4.2 Sensitivity analysis to the MAR assumption
The MAR assumption is untestable, but looking at the observed data might inform us the validity of this assumption. For example, in Figure 1(b) and (c), the Se and Sp estimates for D-dimer test are generally higher in studies 1-4 which includes D-dimer test and the gold standard, than other studies, due to some reasons leading to MNAR.
In this section, we conduct sensitivity analyses to explore the influence on parameter estimates when the MAR assumption is violated. We incorporate the model of missingness in Section 3.4 under different values of
and
: 0,
,
1, and
2, which corresponds to MAR, an odds ratio of missingness 0.61, 0.37, and 0.13 (with respect to 1 unit increase in the logit scale of accuracy parameters), respectively.
The posterior medians of prevalence, sensitivities, and specificities are presented in Table 2 under different missingness assumptions: MAR, missingness related to accuracy of ultrasonography or D-dimer test only, missingness related to sensitivities of both tests or specificities only and missingness related to sensitivities and specificities of both tests. Compared to MAR, the estimates of
are barely affected under different assumptions. When missingness is negatively correlated with one of the tests, assuming MAR overestimates its Se and Sp while underestimates the other test’s performance. When the missingness probabilities are negatively correlated with specificities, assuming MAR overestimates specificities but similar phenomenon is only observed for
when missingness is related to sensitivies. When the missing probabilities are negatively correlated with all parameters, assuming MAR generally overestimates all test accuracies (except for sensitivity estimates when the relation is weak). The differences between the estimates under the MAR and MNAR assumptions are generally enlarged when the bond between missing and test accuracy becomes stronger. In general, when missingness is negatively correlated with the test accuracy parameters, ignoring the model of missingness will overestimate the test performance. Note that, as shown in this example, due to the complex dependency structure of multiple test parameters, it is hard to tell whether the other tests will be over or underestimated when one of the tests is MNAR.
Table 2.
Meta-analysis of DVT tests: median parameter estimates and 95% CIs under different missingness assumptions. MNAR=“None” is equivalent to MAR; MNAR=“D-dimer” (“Ultrasonography”) means missingness related to sensitivity and specificity of D-dimer test (ultrasonography); MNAR=“Se”(“Sp”) means missingness related to the sensitivities (specificities) of both the D-dimer test and ultrasonography; MNAR=“All” means missingness related to sensitivities and specificities of both tests. Bold numbers indicate parameters directly related to missingness
| MNAR |
|
|
|
|
|
Se: D-dimer | Se: ultrasonography | Sp: D-dimer | Sp: ultrasonography |
|---|---|---|---|---|---|---|---|---|---|
| None | 0 | 0 | 0 | 0 | 0.43 (0.36,0.50) | 0.83 (0.68,0.92) | 0.90 (0.77,0.96) | 0.88 (0.75,0.97) | 0.80 (0.54,0.97) |
| D-dimer | -0.5 | 0 | -0.5 | 0 | 0.44 (0.37,0.51) | 0.81 (0.58,0.95) | 0.94 (0.84,1) | 0.84 (0.61,0.96) | 0.8 (0.56,0.98) |
| Ultrasonography | 0 | -0.5 | 0 | -0.5 | 0.43 (0.36,0.51) | 0.89 (0.75,0.99) | 0.89 (0.66,0.99) | 0.91 (0.78,1) | 0.61 (0.15,0.91) |
| Se | -0.5 | -0.5 | 0 | 0 | 0.44 (0.37, 0.52) | 0.86 (0.72,0.96) | 0.93 (0.83,0.99) | 0.88 (0.68,0.99) | 0.73 (0.38,0.96) |
| Sp | 0 | 0 | -0.5 | -0.5 | 0.43 (0.37,0.51) | 0.85 (0.67,0.97) | 0.91 (0.7,0.99) | 0.88 (0.74,0.98) | 0.76 (0.43,0.96) |
| All | -0.5 | -0.5 | -0.5 | -0.5 | 0.43 (0.36,0.51) | 0.85 (0.68,0.97) | 0.92 (0.79,0.99) | 0.87 (0.71,0.97) | 0.72 (0.4,0.93) |
| D-dimer | -1 | 0 | -1 | 0 | 0.44 (0.37,0.51) | 0.78 (0.49,0.94) | 0.95 (0.85,1) | 0.80 (0.45,0.96) | 0.83 (0.60, 1) |
| Ultrasonography | 0 | -1 | 0 | -1 | 0.43 (0.36,0.50) | 0.91 (0.78,1) | 0.88 (0.64,0.98) | 0.92 (0.79,1) | 0.54 (0.11,0.89) |
| Se | -1 | -1 | 0 | 0 | 0.43 (0.37,0.51) | 0.84 (0.63,0.96) | 0.89 (0.63,0.99) | 0.90 (0.77,0.99) | 0.79 (0.51,0.99) |
| Sp | 0 | 0 | -1 | -1 | 0.43 (0.36,0.51) | 0.86 (0.69,0.98) | 0.93 (0.83,0.99) | 0.86 (0.63,0.98) | 0.70 (0.37,0.92) |
| All | -1 | -1 | -1 | -1 | 0.44 (0.37,0.51) | 0.85 (0.68,0.96) | 0.90 (0.78,0.98) | 0.87 (0.72,0.98) | 0.71 (0.38,0.91) |
| D-dimer | -2 | 0 | -2 | 0 | 0.44 (0.37,0.52) | 0.77 (0.46,0.93) | 0.95 (0.85,1) | 0.81 (0.49,0.96) | 0.85 (0.62,1) |
| Ultrasonography | 0 | -2 | 0 | -2 | 0.43 (0.36,0.50) | 0.91 (0.77,1) | 0.87 (0.56,0.99) | 0.92 (0.79,1) | 0.53 (0.11, 0.87) |
| Se | -2 | -2 | 0 | 0 | 0.44 (0.37, 0.52) | 0.81 (0.56, 0.94) | 0.84 (0.53, 0.97) | 0.93 (0.81, 0.99) | 0.83 (0.57, 0.98) |
| Sp | 0 | 0 | -2 | -2 | 0.43 (0.36,0.51) | 0.88 (00.74,0.98) | 0.96 (0.86,1) | 0.83 (0.61,0.96) | 0.68 (0.38,0.89) |
| All | -2 | -2 | -2 | -2 | 0.44 (0.37,0.51) | 0.82 (0.55,0.96) | 0.88 (0.65,0.98) | 0.86 (0.58,0.99) | 0.69 (0.25,0.92) |
In summary, the estimates of diagnostic test accuracies were fairly robust under various MNAR models that we considered. The relative rank in test accuracies among these tests also preserved under different MNAR models. Additional model fitting results when the
parameters are treated as random are presented in the supplementary material available at Biostatistics online, showing that the
s are weakly identified from this model.
5. Simulation
5.1 Simulation setups
Simulation studies were conducted to test how the NMA-DT model performs under different assumptions. As the case study, we assume K=2, i.e., the whole test set contains two candidate tests (
and
) and a gold standard (
). The Se (Sp) of
is 0.8 (0.9) and the Se (Sp) of
is 0.6 (0.7). The overall true disease prevalence is 0.4. We assume the random effects have standard deviations of 0.3: (
) = 0.3. The correlations between prevalence and sensitivities are set to be 0.5. The correlations between prevalence and specificities, and the correlation between sensitivities and specificities are all set to be
0.5.
We create partially missing data under MCAR, MAR, and MNAR assumptions, respectively. Under each scenario, we simulate 1000 replicates of NMA-DT datasets. Each dataset comprises 20 studies where 100 subjects are tested by both candidate tests and the gold standard. To generate test outcomes in each study, study specific prevalences, sensitivities and specificities are sampled from the multivariate normal distribution of the probit transformed parameters in Section 3.2, with the mean vector and covariance matrix specified above. Under the MCAR assumption, the missing indicators for all studies are prespecified such that the first five studies do not have a missing test outcome, the next five studies are missing
, the next five are missing
, and the last five studies are missing
. In addition to the moderate correlation (0.5) assumption, we also investigate the model performance under weak (0.3) and strong (0.8) correlation assumptions for the probit transformed parameters. To generate the partially missing data under the MAR and MNAR assumptions, we only prespecify the missing indicator for the gold standard test such that
is missing in five studies. We then apply a logit model with the observed (unobserved) data as covariates to calculate the missing probabilities of
and
for MAR (MNAR) scenario, which can be further used to generate the random missing indicators to drop the candidate tests in each study. The model for the missing indicators can be described as below:
where k=1, 2. Since each study performs at least 2 tests, if 1 test is missing, the missing indicators for the other 2 tests are automatically equal to 1. The intercept terms
and
are chosen to control the average number of studies missing test k so that it is approximately equal to 5 in each scenario (as the MCAR scenario):
,
,
,
. To demonstrate the robustness of our model under the MAR assumption, we create a case where the missing probabilities highly depend on the observed prevalences:
. Under the MNAR, both methods are biased, we choose
and
as an example to investigate the impacts of MNAR on the estimates from each method. Cross-classified cell counts can be collected for each study to present the observed data as in the case studies.
We compare the performance of the NMA-DT model with the “naive” approach. The “naive” method applies the trivariate generalized linear mixed model (TGLMM) (Chu and others, 2009) to studies reporting both
and
, accounting for potential correlations between disease prevalence and test accuracy parameters. Specifically, studies reporting
and
and studies reporting
and
are excluded from the naive analysis. The “naive” analysis is not applied to
because
and
are exchangeable. Test outcomes of
in studies reporting all three tests are ignored and only
tables cross-classifying outcomes of
and
are used to fit the trivariate GLMM. In total, 10 out of the 20 studies in each dataset are used to evaluate the performance of
in the “naive” approach. The estimates of the fixed effects for prevalence, sensitivity and specificity of
are compared with the estimates from the NMA-DT model.
5.2 Simulation results
Table 3 summarizes the bias, mean squared error (MSE) and 95% CI coverage probability (CP) of the fixed effects estimates using the proposed NMA-DT model (in column “NMA-DT”). Under different assumptions, the NMA-DT model is shown to provide nearly unbiased estimates for all parameters with small MSE. Generally, the estimates are more biased under MAR and MNAR assumptions, or as the correlation becomes stronger. The coverage probabilities remain close to the nominal level of 0.95 in all scenarios, and decrease under the MNAR assumption.
Table 3.
Simulation results: bias, mean square error (MSE) and 95% CI coverage probabilities (CP) of the estimates for fixed effects
. Estimates from the proposed NMA-DT model and the “naive” method are compared for 
| NMA-DT | Naive | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Parameter (true) | Bias | MSE | CP | Bias | MSE | CP | |||
| MCAR | |||||||||
| Weak Correlation (0.3) | |||||||||
(-0.25) |
0.001 | 0.008 | 0.957 | 0.001 | 0.011 | 0.976 | |||
(0.84) |
0.005 | 0.015 | 0.965 | 0.008 | 0.017 | 0.975 | |||
(1.28) |
-0.001 | 0.013 | 0.966 | 0.003 | 0.015 | 0.978 | |||
(0.25) |
0.006 | 0.012 | 0.963 | ||||||
(0.52) |
0.003 | 0.01 | 0.966 | ||||||
| Moderate Correlation (0.5) | |||||||||
(-0.25) |
0.001 | 0.005 | 0.967 | 0.001 | 0.011 | 0.962 | |||
(0.84) |
0.008 | 0.014 | 0.961 | 0.011 | 0.017 | 0.956 | |||
(1.28) |
0.008 | 0.014 | 0.957 | 0.013 | 0.017 | 0.958 | |||
(0.25) |
0.007 | 0.01 | 0.955 | ||||||
(0.52) |
0.007 | 0.009 | 0.959 | ||||||
| Strong Correlation (0.8) | |||||||||
(-0.25) |
-0.006 | 0.007 | 0.964 | -0.005 | 0.011 | 0.972 | |||
(0.84) |
0.01 | 0.013 | 0.972 | 0.014 | 0.016 | 0.973 | |||
(1.28) |
0.021 | 0.014 | 0.972 | 0.023 | 0.016 | 0.971 | |||
(0.25) |
0.007 | 0.011 | 0.971 | ||||||
(0.52) |
0.01 | 0.01 | 0.969 | ||||||
| MAR | |||||||||
| Medium Correlation (0.5) | |||||||||
(-0.25) |
-0.009 | 0.006 | 0.962 | 0.230 | 0.059 | 0.534 | |||
(0.84) |
0.055 | 0.021 | 0.972 | 0.116 | 0.033 | 0.920 | |||
(1.28) |
-0.040 | 0.020 | 0.967 | -0.105 | 0.030 | 0.928 | |||
(0.25) |
0.012 | 0.007 | 0.967 | ||||||
(0.52) |
0.005 | 0.007 | 0.955 | ||||||
| MNAR | |||||||||
| Medium Correlation (0.5) | |||||||||
(-0.25) |
-0.009 | 0.006 | 0.954 | 0.047 | 0.012 | 0.959 | |||
(0.84) |
0.073 | 0.018 | 0.932 | 0.104 | 0.024 | 0.926 | |||
(1.28) |
-0.015 | 0.013 | 0.966 | -0.036 | 0.016 | 0.968 | |||
(0.25) |
0.032 | 0.010 | 0.959 | ||||||
(0.52) |
0.007 | 0.009 | 0.955 | ||||||
The fixed effects estimates for prevalence, sensitivity, and specificity for
from the ‘naive” approach are also summarized in Table 3, column “Naive.” Under the MCAR assumption, both methods perform reasonably well, although NMA-DT is slightly more efficient since it borrows strength from the indirect evidence. However, under MAR and MNAR assumptions, the estimates from the naive method are significantly more biased than those from the NMA-DT model, and the coverage probabilities may be substantially lower than 0.95 when the missingness is strongly associated with the observed or unobserved data. These observations are expected because the underlying assumption is MAR for NMA-DT , while it is MCAR for the naive approach.
6. Discussion
There is a growing interest in simultaneously comparing the performance of multiple diagnostic tests in a network meta-analysis setting. However, due to mixture study designs, various reported test outcomes, heterogeneity in a meta-analysis, and complex correlation structure of multiple test outcomes, the methodological development for NMA-DT remains challenging. In this article, we presented a Bayesian hierarchical NMA-DT framework that unifies all three types of study designs into the multiple test comparison design using a missing data framework. In addition, it can provide ranks of diagnostic tests to guide clinical decision making. Through simulation studies, we have shown that the proposed method can provide unbiased estimates for prevalence and test accuracy. In addition, it is more efficient than a commonly used “naive” approach doing separate meta-analyses for each candidate test.
The NMA-DT model relies on an “consistency” assumption, which assumes that candidate tests would have been performed consistently on subjects assigned and not assigned to the test. However, inconsistency could happen, when studies that do not include
may include a population for whom
is inappropriate, and hence their performance may differ systematically from studies that do include
. In this situation, the MAR assumption is questionable and borrowing information from studies must be done with caution. The concern of inconsistency is also discussed in contrast based NMA methods (Lu and Ades, 2006) that indirect evidence may be inconsistent with direct evidence. White and others (2012) proposed frequentist ways to estimate consistency and inconsistency models by expressing them as multivariate random-effects meta-regressions. Lu and Ades (2006) proposed to use inconsistency degrees of freedom to estimate the degree of inconsistency in evidence cycles. However, this method cannot be directly applied in NMA-DT because it is restricted to relative effects (e.g., log odds ratio) while NMA-DT is estimating marginal test accuracies (e.g., Se and Sp). On the other hand, researchers have been working on developing methods to detect inconsistency in arm-based NMA models (Hong, chu and others, 2016a; Zhao and others, 2016). Discussions have been elaborated around the comparison of contrast based and arm-based NMA methods (Dias and Ades, 2016; Hong and others, 2016b). Further research is needed to develop a formal test of inconsistency in NMA-DT.
An assumption made in the proposed model is the conditional independent test results given the true disease status and all study-specific diagnostic accuracy parameters. Such an assumption may be violated when two candidate tests are based on a similar biological mechanisms (Vacek, 1985). Attempts were made to account for this dependence through a correlation parameter (Chu, Chen and others, 2009), an additional latent class random effect (Qu and others, 1996), or multivariate probit models (Xu and Craig, 2009). However, they cannot be directly applied to NMA-DT model, because correlation parameters are only suitable for pairwise comparisons and only a small portion of the studies in NMA-DT may be subject to conditional dependence. Specifically, for studies adopting the randomized design, each candidate test is compared to the gold standard, thus conditional independence assumption is not required. For studies adopting the multiple test comparison design, conditional dependence may become a concern, since several candidate tests are compared simultaneously. Similarly, non-comparative designs may also suffer conditional dependence, but only when gold-standard test is not involved and subjects are tested by two candidate tests. As a result, how to adjust for conditional dependence in NMA-DT is subject to future studies.
A concern brought by combining studies in a systematic review is how to correctly measure between-study heterogeneity. In this article, generalized linear mixed models are used to account for heterogeneity in a Bayesian framework, where posterior random effects covariance estimate measures the extent of heterogeneity. Inverse Wishart prior is used for the covariance matrix, but is limited in that the variance components are always positive. Another limitation is that when the correlation matrix grows, it imposes an unstructured covariance matrix while a structured correlation assumption may be more efficient. Applications of alternative priors for the covariance matrix (Daniels and Kass, 1999) in NMA-DT deserve further research.
Finally, we note that although multivariate normal distribution for random effects is assumed in the proposed model, it is straightforward to extend the proposed model to other multivariate distributions including distributions generated from copulas.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Supplementary Material
Acknowledgments
We thank the associate editor and an anonymous reviewer for many constructive comments. Conflict of Interest: None declared.
Funding
Research reported in this publication was supported in part by NIAID R21 AI103012 (H.C., X.M.), NIDCR R03 DE024750 (H.C.), NLM R21 LM012197 (H.C.), NIDDK U01 DK106786 (H.C.), and NHLBI T32HL129956 (Q.L). The content is solely the responsibility of the authors and does not necessarily represent official views of the National Institutes of Health.
References
- Chen Y., Liu Y., Ning J., Cormier J. and Chu H. (2015). A hybrid model for combining case–control and cohort studies in systematic reviews of diagnostic tests. Journal of the Royal Statistical Society: Series C (Applied Statistics) 64, 469–489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu H., Chen S. and Louis T. A. (2009). Random effects models in a meta-analysis of the accuracy of two diagnostic tests without a gold standard. Journal of the American Statistical Association 104, 512–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu H. and Cole S. R. (2006). Bivariate meta-analysis of sensitivity and specificity with sparse data: a generalized linear mixed model approach. Journal of Clinical Epidemiology 59, 1331–1332. [DOI] [PubMed] [Google Scholar]
- Chu H., Nie L., Cole S. R. and Poole C. (2009). Meta-analysis of diagnostic accuracy studies accounting for disease prevalence: Alternative parameterizations and model selection. Statistics in Medicine 28, 2384–2399. [DOI] [PubMed] [Google Scholar]
- Daniels M. J. and Kass R. E. (1999). Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. Journal of the American Statistical Association 94, 1254–1263. [Google Scholar]
- Dendukuri N., Schiller I., Joseph L. and Pai M. (2012). Bayesian meta-analysis of the accuracy of a test for tuberculous pleuritis in the absence of a gold standard reference, Biometrics 68, 1285–1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dias S. and Ades A. E. (2016). Absolute or relative effects? Arm-based synthesis of trial data, Research Synthesis Methods 7, 23–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A. and Rubin D. B. (1992). Inference from iterative simulation using multiple sequences, Statistical Science 7, 457–472. [Google Scholar]
- Hamza T. H., Reitsma J. B. and Stijnen T. (2008). Meta-analysis of diagnostic studies: a comparison of random intercept, normal-normal, and binomial-normal bivariate summary ROC approaches, Medical Decision Making 28, 639–649. [DOI] [PubMed] [Google Scholar]
- Harbord R. M., Deeks J. J., Egger M., Whiting P. and Sterne J. A. (2007). A unification of models for meta-analysis of diagnostic accuracy studies, Biostatistics 8, 239–251. [DOI] [PubMed] [Google Scholar]
- Hong H., Chu H., Zhang J. and Carlin B. P. (2016a), A Bayesian missing data framework for generalized multiple outcome mixed treatment comparisons, Research Synthesis Methods 7, 6–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong H., Chu H., Zhang J. and Carlin B. P. (2016b). Rejoinder to the discussion of a Bayesian missing data framework for generalized multiple outcome mixed treatment comparisons, by Dias S. and Ades AE. Research Synthesis Methods 7, 29–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang J., Brant R. and Ghali W. A. (2013). Statistical methods for the meta-analysis of diagnostic tests must take into account the use of surrogate standards, Journal of Clinical Epidemiology 66, 566–574. [DOI] [PubMed] [Google Scholar]
- Leeflang M. M., Bossuyt P. M. and Irwig L. (2009). Diagnostic test accuracy may vary with prevalence: implications for evidence-based diagnosis, Journal of Clinical Epidemiology 62, 5–12. [DOI] [PubMed] [Google Scholar]
- Little R. J. and Rubin D. (2002). Statistical Analysis with Missing Data, 2nd edition.Hoboken, NJ: John Wiley & Sons. [Google Scholar]
- Liu Y., Chen Y. and Chu H. (2015). A unification of models for meta-analysis of diagnostic accuracy studies without a gold standard, Biometrics 71, 538–547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu G. and Ades A. (2004). Combination of direct and indirect evidence in mixed treatment comparisons, Statistics in Medicine 23, 3105–3124. [DOI] [PubMed] [Google Scholar]
- Lu G. and Ades A. (2006). Assessing evidence inconsistency in mixed treatment comparisons, Journal of the American Statistical Association 101, 447–459. [Google Scholar]
- Lunn D. J., Thomas A., Best N. and Spiegelhalter D. (2000). WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility, Statistics and Computing 10, 325–337. [Google Scholar]
- Ma X., Chen Y., Cole S. R. and Chu H. (2016a). A hybrid Bayesian hierarchical model combining cohort and case-control studies for meta-analysis of diagnostic tests: accounting for partial verificatin bias, Statistical Methods in Medical Research 25, 3015–3037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma X., Nie L., Cole S. R. and Chu H. (2016b). Statistical methods for multivariate meta-analysis of diagnostic tests: an overview and tutorial, Statistical Methods in Medical Research 25, 1596–1619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, chapter 2. Oxford: Oxford University Press. [Google Scholar]
- Plummer M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In: Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). pp. 20–22. [Google Scholar]
- Qu Y., Tan M. and Kutner M. H. (1996). Random effects models in latent class analysis for evaluating accuracy of diagnostic tests, Biometrics 52, 797–810. [PubMed] [Google Scholar]
- Reitsma J. B., Glas A. S., Rutjes A. W., Scholten R. J., Bossuyt P. M. and Zwinderman A. H. (2005). Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews, Journal of Clinical Epidemiology 58, 982–990. [DOI] [PubMed] [Google Scholar]
- Rutter C. M. and Gatsonis C. A. (2001). A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations, Statistics in Medicine 20, 2865–2884. [DOI] [PubMed] [Google Scholar]
- Salanti G., Ades A. and Ioannidis J. (2011). Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial, Journal of Clinical Epidemiology 64, 163–171. [DOI] [PubMed] [Google Scholar]
- Takwoingi Y., Leeflang M. M. and Deeks J. J. (2013). Empirical evidence of the importance of comparative studies of diagnostic test accuracy, Annals of Internal Medicine 158, 544–554. [DOI] [PubMed] [Google Scholar]
- Vacek P. M. (1985). The effect of conditional dependence on the evaluation of diagnostic tests, Biometrics 41, 959–968. [PubMed] [Google Scholar]
- Venta E. R. and Venta L. A. (1987). The diagnosis of deep-vein thrombosis: an application of decision analysis, Journal of the Operational Research Society 38, 615–624. [Google Scholar]
- White I. R., Barrett J. K., Jackson D. and Higgins J. (2012). Consistency and inconsistency in network meta-analysis: model estimation using multivariate meta-regression, Research Synthesis Methods 3, 111–125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu H. and Craig B. A. (2009). A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests, Biometrics 65, 1145–1155. [DOI] [PubMed] [Google Scholar]
- Zhang J., Carlin B. P., Neaton J. D., Soon G. G., Nie L., Kane R., Virnig B. A. and Chu H. (2014). Network meta-analysis of randomized clinical trials: reporting the proper summaries, Clinical Trials 11, 246–262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H., Hodges J. S., Ma H., Jiang Q. and Carlin B. P. (2016). Hierarchical Bayesian approaches for detecting inconsistency in network meta-analysis, Statistics in Medicine 35, 3524–3536. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

































