Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2022 Feb 25;50(6):1418–1434. doi: 10.1080/02664763.2022.2041565

On summary ROC curve for dichotomous diagnostic studies: an application to meta-analysis of COVID-19

ShengLi Tzeng a, Chun-Shu Chen b, Yu-Fen Li c, Jin-Hua Chen d,e,f,CONTACT
PMCID: PMC10071901  PMID: 37025283

Abstract

In a systematic review of a diagnostic performance, summarizing performance metrics is crucial. There are various summary models in the literature, and hence model selection becomes inevitable. However, most existing large-sample-based model selection approaches may not fit in a meta-analysis of diagnostic studies, typically having a rather small sample size. Researchers need to effectively determine the final model for further inference, which motivates this article to investigate existing methods and to suggest a more robust method for this need. We considered models covering several widely-used methods for bivariate summary of sensitivity and specificity. Simulation studies were conducted based on different number of studies and different population sensitivity and specificity. Then final models were selected using several existing criteria, and we compared the summary receiver operating characteristic (sROC) curves to the theoretical ROC curve given the generating model. Even though parametric likelihood-based criteria are often applied in practice for their asymptotic property, they fail to consistently choose appropriate models under the limited number of studies. When the number of studies is as small as 10 or 5, our suggestion is best in different scenarios. An example for summary ROC curves for chemiluminescence immunoassay (CLIA) used in COVID-19 diagnosis is also illustrated.

Keywords: Summary ROC curve, sensitivity, specificity, meta analysis, systematic review, random effects

1. Introduction

Summarizing information on test performance metrics, such as sensitivity, specificity, and diagnostic odds ratios (DOR), is an important part of a systematic review of a medical test performance on clinical outcomes. Through a meta-analysis on clinical studies of diagnostic tests, we may investigate hypotheses about the test performance that cannot be answered by an individual study. Sotiriadis et al. [33] recommended a guideline for systematic review of diagnostic test accuracy studies, as a counterpart to Cochrane Handbook [21] and PRISMA [32] widely used in general systematic review.

Most diagnostic tests are used to separate patients into two groups as test positive and test negative (i.e. T+/T). The fundamental diagnostic result can be a real value (continuous output), in which case the classification boundary between the two groups must be determined by a threshold value, usually based on a receiver operating characteristic (ROC) curve [35]. Accordingly, there are four possible outcomes from a dichotomized test when a gold standard is available. If the true disease status of a subject is positive ( D+), a T+ classification is called a true positive (TP), while a T result is called a false negative (FN). Conversely, given a true negative disease status of a subject ( D), a T classification results in a true negative (TN) and a T+ result gets a false positive (FP).

When a gold standard exists, test accuracy is estimated as the proportion of diseased individuals to be ‘test positive’ (sensitivity) and of non-diseased individuals to be ‘test negative’ (specificity) [14,23,24]. For systematic reviews of dichotomous diagnostic studies, we have ‘data’ merely consisting of nTP, nFN, nFP, and nTN for each study involved, summing up the number of subjects classified as TP, FN, FP, or TN, respectively. The corresponding results are usually reported based on a particular threshold, as used in clinical practice. It is improper to simply use the sums across studies of these four numbers to derive summary estimates of sensitivity, specificity and DOR, where the summary statistics would be dominantly affected by several studies in the largest study sizes [17].

Another naive summary is to pool sensitivity and specificity separately using standard meta-analyses for proportions. However, Walter and Jadad [38] and Moses et al. [26] showed that sensitivity and specificity are often negatively correlated, usually because of different thresholds among studies to define T+ and T. Even though this ‘separate summary’ method is sometimes recommended [8,34], ignoring the correlation between sensitivity and specificity would result in an inaccurate inference or even misleading in the claims.

Chappell et al. [8] and Trikalinos et al. [34] discussed that helpful ways about summarizing medical test studies include ‘separate summary’, ‘summary point’, and ‘summary line’. Some advised procedures regarding when to use which kind of summary representations were also provided in their studies. Various models have been proposed in the literature for a meaningful summary ‘point’ or ‘line’ across all studies [22,26,28,29] The methods of Reitsma et al. [28] and Rutter and Gatsonis [29] have become almost the de facto standard for a ‘summary point’ or a ‘summary line’, and Harbord et al. [20] showed their equivalence when no covariates are included. If more studies are available (usually with a number larger than 30), some sophisticated extensions attempt to incorporate other sources of heterogeneities, such as disease prevalence [11], latent subgroups [30] or measurement errors [19]. In view of so many alternative methods, a natural and important question is how to select a suitable model. Recently, Doebler et al. [15] integrated a wide range of models into a unified parametric linear mixed model framework after transformation, utilizing likelihood-maximization to estimate model parameters. It covers the models of Chu et al. [10], Reitsma et al. [28], Rutter and Gatsonis [29], and Holling et al. [22] as special cases. Furthermore, some likelihood-based criteria can be used for selecting a ‘best’ model, among which the Akaike information criterion (AIC) with nice asymptotic properties is the most often used method [5]. A related criterion, Bayesian information criterion (BIC) or Schwarz information criterion [31], is often used, too. When a set of candidate models is considered, we may choose the model with the smallest AIC or BIC value, and make statistical inference based on it. As discussed in Section 2.2, in the cases considered here, AIC and BIC will always select the same model. Hence in this paper we only discuss on AIC.

Nevertheless, the practical essence of such meta-analyses is restricted to a small sample size N more often than not, where N is the final number of studies for systematic review. For the same medical test design, there are usually not many compatible studies and hence N is too limited to apply the relative asymptotic theories. Several studies proposed conditional AIC (cAIC) in linear mixed models, which is a tailored model selection method for small N [18,25,36]. In addition, an empirical likelihood (EL) method analogous to the work of Owen [27] can also be used for small N in practice. In this article, we focus on the issue of model selection for ‘summary line’ situations. The key questions here include: (a) whether AIC gives an acceptable result, (b) which selection criterion (e.g. AIC, cAIC, or EL) has better performance, and (c) does there exist a summary method that performs satisfactorily under various situations especially when N is small?

The rest of the article is organized as follows. Section 2 first reviews several commonly used models and then describes some existing model selection criteria. The effectiveness of these model selection methods are compared through simulations in Section 3. An example of its application to coronavirus disease-2019 (COVID-19) diagnosis is given in Section 4. Finally, we conclude with some discussion in Section 5.

2. Model families and selection criteria

This section introduces the fundamentals of this study. We shall reviews commonly used models and describe several existing model selection criteria. Then all these methods will be investigated using simulations in Section 3.

2.1. Models to be considered

This section briefly describes the two model families under tα transformation as illustrated in [15] and the relations with other approaches. The details of models can be found in the original literature.

A class of monotonic transformation functions controlled by α[0,1] for x[0,1] is defined as follows [15]:

tα(x)=αlog(x)(2α)log(1x).

Let pi and qi be the unobserved true sensitivity and false positive rate (1-specificity) for the ith study, respectively. With a pair of transformation parameters (αp,αq), the two transformed variables (tαp(pi),tαq(qi)) are then assumed to follow a bivariate normal distribution with mean μ=(μp,μq) and covariance matrix

Σi=(σp2σσσq2)+(dip200diq2). (1)

Following Doebler et al. [15], we consider two families of models by setting dip2 and diq2 in (1) to different values. The first family of models uses fixed dip2=diq2=0 while the second family of models takes study heteroscedasticity into account with dip2=Var^(nTPinTPi+nFNi) and diq2=Var^(nFPinTNi+nFPi).

It has been pointed out that tα respectively corresponds to logit transformation with α=1 and log transformation with α=2. Moreover, tα is also approximately proportional to the complementary logarithmic function when α is around 0.6. On the other hand, tα(x) can be regarded as log(1x) and complementary log(1x) if α=0 and α=1.4, respectively.

When αp=αq=2, the first family of models corresponds to the Lehmann family or proportional hazard models of Holling et al. [22]. If logit(pi) and logit(qi) are assumed to follow a bivariate normal distribution, the first family of models with αp=αq=1 coincides with the summary ROC method of Moses et al. [26] but based on different parameterizations. Hereafter, the corresponding model is called MSL method.

When αp=αq=1, the second family of models is equivalent to bivariate models [28] and the hierarchical summary ROC method (HSROC) of [29]. Furthermore, the second family of models approaches the complementary logarithmic models of [10] when αp=αq=1.4.

Aside from (αp,αq), each family has five common parameters θ=(μp,μq,σp2,σq2,σ) involved. Maximum likelihood (ML) or restricted maximum likelihood (REML) methods can be used to estimate model parameters. For a larger N, (αp,αq) can also be estimated as additional parameters.

2.2. Model selection criteria

The work of [15] reviewed in the previous subsection generalized several widely used models. For N20, they also showed that it is possible to recover (αp,αq) by treating it as free parameters. Nevertheless, they admitted that it is hard to estimate (αp,αq) for N10, and they suggested these two quantities should be fixed for a small N. In practice, treating (αp,αq) as fixed or free parameters does not respectively make the problem easier or harder; an analyst still needs a way to determine suitable values of (αp,αq) for transformation. Additionally, although they proposed two useful families of models, little has been known about how to select among them especially when the sample size N is small. Therefore, a good model selection strategy is important.

Model selection can be viewed as a selection of both the model assumptions and the estimated parameters, which amounts to a choice of underlying probabilistic mechanism. Model selection (or variable selection) in linear regression and generalized linear models has been studied extensively [5,12,16]. Unfortunately, the models considered here have no variables to be selected and the key structure, the covariance matrix Σi, is heavily affected by (αp,αq). Selection of the covariance structure in a linear mixed model is still a very open research area. Yet, the two families considered here are linear mixed models only after transformation, which raises another challenge for us. Therefore, the results would be doubtful if one directly applies the existing model selection methods to select among the two families.

In what follows, a ‘model’ indicates a triplet of αp, αq, and a model family index (i.e. 1st or 2nd). We shall consider several model selection criteria, and compare their performance based on simulations. The first one is AIC [1], which was also inspected in Doebler et al.[15]. Let lM(θ^) be the log-likelihood for a model M and θ^ be the corresponding estimates of model parameters, and k be the number of parameters. Then AIC is defined as 2lM(θ^)+2k with k being the number of parameters in the model M. Since each model has 5 parameters, minimizing AIC amounts to choosing the model having the largest lM(θ^), where θ^ is obtained by ML or REML. BIC defined as 2lM(θ^)+klog(N), will also choose the maximum-likelihood model, since klog(N) are the same for all models. Thus AIC and BIC are equivalent for the considered cases. Moreover, we use the term AIC simply because Doebler et al. [15] also suggested this criterion.

While AIC has been shown to have nice asymptotic properties for model selection with a large N [5] the focus in this work is the small N problem. Although there is a corrected version of AIC with penalty 2k(k+1)/(Nk1) instead of 2k for small N [6], it selects an identical model as AIC for models considered here.

The second and third criteria are based on cAIC in the literature [18,25,36]. Let yi=(nTPinTPi+nFNi,nFPinTNi+nFPi); i=1,2,,N be the vector of observed sensitivities and 1-specificities. For a specific model M, define z^i(M)=EM(zi|θ^,Z), where

zi=t(yi)(tαp(nTPinTPi+nFNi),tαq(nFPinTNi+nFPi)),

and Z=(z1,,zN). Thus z^i(M) is the empirical best linear unbiased predictor of t(μi)(tαp(pi),tαq(qi)) based on the model M. Then the conditional AIC (cAIC) is defined as lM(θ^)+ penalty, where lM(θ^)=lM(θ^)lM(θ|^Z^(M)) with lM(θ|^Z^(M)) being the log-likelihood for a model M evaluated at θ^ and the observations Z replaced by Z^(M)(z^1(M),,z^N(M)), and the penalty in cAIC was discussed in Vaida and Blanchard [36], Liang et al. [25], and Greven and Kneib [18]. In particular, Vaida and Blanchard [36] assumed θ to be known, while Liang et al. [25] and Greven and Kneib [18] took the uncertainty of estimation into consideration. The major difference between Liang et al. [25] and Greven and Kneib [18] is that the former calculated the penalty approximately, while the latter provided an exact method. In the simulation studies, we will compare the performance of Vaida and Blanchard [36] and Greven and Kneib [18], and refer to them as cAIC-VB and cAIC-GK, respectively. Also note that a model in the first family does not consider the random effect, hence cAIC reduces to AIC in this case.

The fourth criterion is EL approach [27], which was primarily a method for constructing a confidence region for mean parameters, and Baggerly [3] pointed out its connection to goodness-of-fit measures. Denote t1(x)(tαp1(xp),tαq1(xq)) for an arbitrary x=(xp,xq), and hence t1(μM)(tαp1(μp),tαq1(μq)) is the (back-transformed) mean of summarized sensitivity and 1-specificity for a model M. For a model with the mean parameters being μM=(μp,μq), its EL is given by

L(μM)=maxi=1Nwi(μM) (2)

under constraints iwi(μM)(yit1(μM))=0, wi(μM)>0 and iwi(μM)=1. EL for the saturated model is

L(μ)=maxi=1Nwi

under the constraints wi>0 and iwi=1. To assess the hypothesis that μM is the mean of N independent data y1,,yN, we should first find the weight wi(μM) of each datum with (2). Then,

R=2log(L(μM)L(μ))=2i=1Nlog(wi(μM))+2Nlog(N)

can be obtained, where L(μ)=NN is a constant and R has a chi-square limiting distribution with a degree of freedom equal to the rank of Var(y) [27]. Thus, a larger value of R indicates a model's deficiency.

3. Comparison of selection criteria

To investigate the model selection methods above, we conduct several simulation studies. Each model selection criterion is applied to the simulated data, and we compare the models selected by different criteria.

3.1. Simulation setup

We considered meta-analyses of N = 5, 10 or 50 primary studies in the diagnostic test, and generated the corresponding xi=(nTP i,nFN i,nFP i,nTN i) for each primary study according to the model given in the previous subsection. Two scenarios of sensitivity and specificity were generated. The first scenario has μ=(tαp(0.9),tαq(0.9)) while the second has μ=(tαp(0.95),tαq(0.8)). To generate data for the ith study, m0i non-disease participants and m1i diseased participants were drawn independently and identically from a Poisson distribution with mean 76 and 49 respectively. That is, on average we had 49 diseased and 76 non-disease participants in each primary study. The mean numbers for participants are based on a survey by Bachmann et al. [2].

Once (tαp(pi),tαq(qi)) was drawn from a bivariate normal distribution, we straightforward obtained xi=(nTP i,nFN i,nFP i,nTN i) . Based on data values of xi; i=1,2,,N, the standard estimation procedures for a model (a triplet of αp, αq and index of family) was applied, and all competing model selection criteria introduced in the previous subsection were also used. For simplicity, candidate models were restricted to those only within (αp,αq){0,0.6,1,1.4,2}2 in combination with the two model family indices. Therefore, there are 50 candidate models under this setting. For each criterion, a ‘best’ model was chosen, i.e. the model with the smallest criterion's value was selected among all the candidate models.

The ‘true’ models considered here were simply αp=αq{0,0.6,1,1.4,2} with a index of family. After data generation from each ‘true’ model, a final model was determined based on each selection criterion. We assessed a model selection criterion as follows.

Let M be the true model, and M be the selected model. Let C be the theoretical ROC curve in (0,1)×(0,1) space based on M, and CM be the estimates of C based on M. Denote an indicator function as δ. Then, the successful recovery rate 1Bδ(M=M), and the mean integrated absolute deviation between CM and C, MIAE( CM), are used for assessment. The simulation experiment was replicated B=500 times for each combination of scenario, N, and αp=αq.

3.2. Results

We shall merely report results based on REML estimators, since [15] concluded that the ML estimator of the covariance is always biased. In fact, results based on ML estimator give the same conclusion.

The successful recovery rate of each model selection criterion is given in Tables 1 and 2 for the two scenarios. Most of criteria fail to find the true model, no matter which sample size (N) is used. cAIC-VB seems consistently to find the true model with at least 60% success when αp=αq=0 under the first family. Under the first family and αp=αq taken a value of 0.6, 1, or 1.4, cAIC-GK outperforms others, despite of a low success rate never larger than 15%. When αp=αq=2 under the second family, EL does a better job with a success rate ranging from 30% to 90%. AIC outperforms others when αp=αq=0 under the second family. Note that about 10% to 15% calculation of the criterion cAIC-GK will not converge. Even though without convergence we can still have a value for cAIC-GK, the successful recovery rate may be underestimated. EL also has about 3% failure of convergence.

Table 1.

Percentages of selecting the correct model in 500 simulations for each model selection criterion in Scenario 1. (a), (c) and (e) under family = 1; (b), (d) and (f) under family = 2; (a) and (b): N = 5; (c) and (d): N = 10; (e) and (f): N = 50.

(a)        
  AIC cAIC-VB cAIC-GK EL
α=0 4.8 92.0 4.0 0.0
α=0.4 0.4 0.0 2.0 0.0
α=1 0.0 0.0 1.6 0.0
α=1.6 1.6 0.0 7.0 0.0
α=2 0.4 0.0 1.6 26.4
(b)        
  AIC cAIC-VB cAIC-GK EL
α=0 56.8 4.4 3.2 1.2
α=0.6 0.8 0.0 0.0 0.0
α=1 0.4 0.0 0.4 0.4
α=1.4 0.8 0.0 0.0 3.6
α=2 0.0 0.0 1.2 46.0
(c)        
  AIC cAIC-VB cAIC-GK EL
α=0 5.2 98.0 2.0 0.0
α=0.6 0.4 0.0 0.4 0.0
α=1 2.0 0.0 4.8 0.0
α=1.4 3.6 0.0 7.8 0.0
α=2 0.0 0.0 0.4 22.0
(d)        
  AIC cAIC-VB cAIC-GK EL
α=0 52.8 0 0.8 0.4
α=0.6 1.2 0 0.4 0.0
α=1 0.0 0 0.8 0.0
α=1.4 0.0 0 0.0 2.8
α=2 0.0 0 0.4 60.8
(e)        
  AIC cAIC-VB cAIC-GK EL
α=0 15.2 100 1.2 0.0
α=0.6 5.2 0.0 6.8 0.0
α=1 3.6 0.0 11.2 0.0
α=1.4 1.6 0.0 24.2 0.0
α=2 0.0 0.0 13.2 2.4
(f)        
  AIC cAIC-VB cAIC-GK EL
α=0 53.6 0.0 1.2 0.4
α=0.6 1.6 0.0 0.4 0.0
α=1 0.0 0.0 0.4 0.0
α=1.4 0.0 0.0 0.0 0.0
α=2 0.0 0.0 0.0 90.2

Table 2.

Percentages of selecting the correct model in 500 simulations for each model selection criterion in Scenario 2. (a), (c) and (e) under family = 1; (b), (d) and (f) under family = 2; (a) and (b): N = 5; (c) and (d): N = 10; (e) and (f): N = 50.

(a)        
  AIC cAIC-VB cAIC-GK EL
α=0 3.6 62.6 7.2 0.4
α=0.6 0.0 0.0 3.2 0.0
α=1 0.4 0.0 1.8 0.0
α=1.4 3.2 0.0 6.8 1.2
α=2 0.4 0.0 2.8 22.8
(b)        
  AIC cAIC-VB cAIC-GK EL
α=0 22.4 0.0 0.4 1.2
α=0.6 0.0 0.0 0.8 0.0
α=1 0.4 0.0 0.4 0.0
α=1.4 0.0 0.0 1.0 0.4
α=2 0.0 0.0 0.8 32.4
(c)        
  AIC cAIC-VB cAIC-GK EL
α=0 6.0 60.8 4.8 0.0
α=0.6 3.2 0.0 6.0 0.0
α=1 1.6 0.0 4.0 0.0
α=1.4 6.8 0.0 11.2 0.4
α=2 0.0 0.0 1.6 12.0
(d)        
  AIC cAIC-VB cAIC-GK EL
α=0 8.0 0.0 0.8 0.0
α=0.6 0.0 0.0 1.0 0.0
α=1 0.2 0.0 0.8 0.0
α=1.4 0.2 0.0 4.0 1.8
α=2 0.0 0.0 1.2 51.2
(e)        
  AIC cAIC-VB cAIC-GK EL
α=0 23.2 66.4 16.0 0.0
α=0.6 14.8 0.0 9.1 0.0
α=1 4.0 0.0 10.8 0.0
α=1.4 3.2 0.0 24.4 0.0
α=2 0.0 0.0 29.6 2.6
(f)        
  AIC cAIC-VB cAIC-GK EL
α=0 47.2 0.0 0.0 0.8
α=0.6 0.0 0.0 0.0 0.0
α=1 2.0 0.0 4.0 0.0
α=1.4 3.4 0.0 6.8 0.0
α=2 0.0 0.0 0.0 88.6

Then MIAE of each model selection criterion is given in Figures 1 and 2 for the two scenarios. The trend between α is quite clear: the MIAE has larger variation among models in the second family. This should be due to the second family is more flexible than the first family. Among the four criteria, EL seems to have the smallest MIAE, i.e. the ROC curve most similar to the true one for most cases. AIC and cAIC-VB are better than EL for some values of α if N = 50.

Figure 1.

Figure 1.

MIAE of different candidate models or selected models for Scenario 1 under first (left panel) and second (right panel) families. (a) and (b): N = 5; (c) and (d): N = 10; (e) and (f): N = 50. Different colors stand for different α: 0 in cyan, 0.6 in black, 1 in red, 1.4 in green, and 2 in gray. The numbers in the parentheses on the horizontal axis represent ( αp, αq, family index).

Figure 2.

Figure 2.

MIAE of different candidate models or selected models for Scenario 2 under first (left panel) and second (right panel) families. (a) and (b): N = 5; (c) and (d): N = 10; (e) and (f): N = 50. Different colors stand for different α: 0 in cyan, 0.6 in black, 1 in red, 1.4 in green, and 2 in gray. The numbers in the parentheses on the horizontal axis represent ( αp, αq, family index).

Closer inspection of Figures 1 and 2 reveals that, even more surprisingly, the ‘true’ models usually are not the best, and there is an ever victor outperforming all the other models. The model with αp=1.4, αq=1.4, and index =1 almost always has the smallest MIAE, except for α=0 and N = 50 cases in Scenario 2.

4. Application of sROC curve to COVID-19 diagnosis

As stated in the official statistics [39], at the end of year 2020 there have been 82404103 confirmed cases of COVID-19, including 1801382 deaths, reported to World Health Organization (WHO). The outbreak was first identified in December 2019 and it developed into a global pandemic over more than 200 countries by March 2020. Symptoms of COVID-19 are highly variable and can be very mild or severe. Commonly reported symptoms usually last around 9 to 10 days after symptom onset, but people remain infectious for up to two weeks even without showing symptoms [7]. Hence, accurate and rapid diagnostic tests is important for the policy of disease control.

Although detection with real-time polymerase chain reaction (RT-PCR) has quite high accuracy, the method is too expensive and time-consuming to be widely available. In contrast, immunoassays detect a few antibodies and have been used for rapid and low-cost tests. Our bodies could make the antibodies to resistance to disease from viruses or bacteria. The antibodies are also called the immunoglobulins. The types of immunoglobulin/antibody include IgA, IgG, IgM, IgE and IgD, and different types of them fight for different diseases. We could find out whether a person is infected through a whole blood or serum test with these antibodies.

Intensive researches are emerging for the rapid and highly accurate detection, and chemiluminescence immunoassay (CLIA) is one of such assays. Other molecular and immunoassay-based techniques, including enzyme-linked immunosorbent assay (ELISA) with a long history, are referred to [37]. It has been shown that CLIA is more time-saving with competitive sensitivity, specificity and accuracy compared to ELISA in the clinical applications (e.g. [9,13]). Bastos et al. [4] have reviewed quite a few studies reporting the diagnostic accuracy of serological tests for COVID-19. Here, we focus on the sROC curve of COVID-19 diagnosis using CLIA by detecting specific IgG and IgM. There were 9 studies for IgG and 10 studies for IgM, whose nTP, nFN, nFP, and nTN can be easily found in Tables 3 and 4 of [4].

We used the one having the smallest MIAE in the simulations, i.e. αp=αq=1.4 and index =1, to estimate the sROC curves for both antibodies, and the results are shown in Figures 3(d) and 4(d). This method is referred to as MIM in what follows. AIC selected αp=0, αq=1.5 and index =2 for CLIA (IgG), while it selected αp=1, αq=1.5 and index =2 for CLIA (IgM). We also compared the sROC based on the HSROC model of [29] ( αp=αq=1, index =2) and the MSL model of [26] ( αp=αq=1, index =1).

Figure 3.

Figure 3.

Estimated sROC curves with the confidence region (red line) and prediction region (dashed line) for CLIA(IgG) using (a) the model selected by AIC, (b) MSL method, (c) HSROC model, and (d) the model with αp=αq=1.4, and index =1, where gray circles (with the area proportional to the sample size) are the sensitivity and 1-specificity observed from individual studies and the black triangle indicates the summary sensitivity and 1-specificity.

Figure 4.

Figure 4.

Estimated sROC curves with the confidence region (red line) and prediction region (dashed line) for CLIA(IgM) using (a) the model selected by AIC, (b) MSL method, (c) HSROC model, and (d) the model with αp=αq=1.4, and index =1, where gray circles (with the area proportional to the sample size) are the sensitivity and 1-specificity observed from individual studies and the black triangle indicates the summary sensitivity and 1-specificity.

A positive IgM antibody means that the patient is in the early stage of infection, while a positive IgG antibody means that the patient is in the late stage of infection or recovery. Therefore, a higher sensitivity of IgG is expected due to the sampled subjects, which is confirmed from the overall patterns in Figures 3 and 4. In theses two problems, MIM seems to fall somewhere between the other three. As can be easily seen, MIM has a narrow confidence region for the summary sensitivity and specificity, compared to MSL and HSROC. It also has a slightly higher summary specificity and a lower summary sensitivity. This reflects that MIM puts more weights on studies around the top-left corner in plots, while MSL and HSROC were influenced by more small-sample studies.

5. Discussion and conclusion

According to the results of the successful recovery rate and MIAE, we can answer the first two key questions. AIC is clearly not acceptable here. It finds the true model with an about 20% to 60% successful rate only in very restricted situations, and in most cases its ability to find the true model is even worse than a random guessing ( 2% in expectation). Surprisingly, its performance does not improve when N becomes larger. The second key question cannot be completely solved by considering only the four model selection criteria. Although some of them are better for a certain α in the simulations, we never know the ‘true’ α in practice. There does not exist an overall best criteria among the four. Thus we could not find a better model merely based on these selection criteria.

We are curious about what is the MIAE if we use the ‘true’ α and family index to estimate its ROC curve. As expected, ‘true’ models are at the top of the list. The α=0 cases in the first family are exceptions, where the ‘true’ model has a very large MIAE. We can also compare the performance of some special cases often used in terms of MIAE. MSL method with a triplet (1,1,1) and HSROC with a triplet (1,1,2) usually perform quite well and give quite similar MIAE if the true α1. Complementary logarithmic model with a triplet (1.4,1.4,2) and proportional hazard model with a triplet (2,2,1) are slightly better than the previous two models if the true α>1.

Now we turn to the third key question, regarding a summary method that performs satisfactorily under various situations especially when N is small. According to Figures 1 and 2, the model with αp=αq=1.4 and index =1 always has the smallest MIAE, except for α=0 and N = 50 cases in Scenario 2. Since we rarely have a diagnostic meta-analysis with a sample size as large as 50, always using this model is the best strategy.

We suggest to construct the sROC curve based on the tα model with αp=αq=1.4 and index =1. It is the best in terms of MIAE, which gives the ROC curve most similar to the true one. The recommended complementary logarithmic model is close related Chu et al. [10]. However, very few meta-analyses of diagnostic studies applied the method even though several years have passed since it was proposed in 2010. A lack of software for practitioners would be a main reason. In addition, unlike the popular logistic transformation in [22,26,29], the complementary logarithmic link is asymmetric with respect to 0.5, making it more difficult to imagine the behavior of extremely large or small sensitivity and specificity after transformation. We provided our implementation based on R language in Appendix, hopefully to increase the chance of it being used.

Appendix.

We illustrate how to use our implementation based on R language through the COVID-19 data.

First of all, install and load the package mode on Comprehensive R Archive Network (CRAN), which already gave the model fitting for the second class of models. We modified their code so as to fit models of both classes using a single function reitsma.default. Our codes are in ‘func.R’, which can be read into R using source.

Appendix.

Next, input the dataset. Note that its column names should be in such an order: ‘TP’, ‘FN’, ‘FP’ and ‘TN’.

Appendix.

Below are codes for finding the model with the smallest AIC among all the ( αp, αq, family) considered. The main function reitsma.default has four essential inputs: dataset, alphasens ( αp), alphafpr ( αq), and use1stModel (TRUE or FALSE to use the first class of models).

Appendix.

We also implemented a function srocPlot for plotting as that in Figures 3 and 4, which takes three inputs: (1) model, (2) title, and (3) list of sensitivities and specificities. The function freq2ratio can be used to calculate sensitivity and specificity for each study.

Appendix.

Funding Statement

Financial support for this study was provided in part by a grant from National Science Council, Taiwan [grant number NSC102-2118-M038-002], and Ministry of Science and Technology, Taiwan [grant numbers MOST104-2314-B039-037, MOST107-2118-M-110-004-MY3 and MOST108-2628-M-008-005-MY3]. The funding agreement ensured the authors' independence in designing the study, interpreting the results, writing, and publishing the report.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Akaike H., Information theory and an extension of the maximum likelihood principle, Proceedings of 2nd International Symposium on Information Theory, Akadémiai Kiadó, 1973, pp. 267–281.
  • 2.Bachmann L.M., Puhan M.A., Ter Riet G., and Bossuyt P.M., Sample sizes of studies on diagnostic accuracy: Literature survey, BMJ 332 (2006), pp. 1127–1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Baggerly K.A., Empirical likelihood as a goodness-of-fit measure, Biometrika 85 (1998), pp. 535–547. [Google Scholar]
  • 4.Bastos M.L., Tavaziva G., Abidi S.K., Campbell J.R., Haraoui L.P., Johnston J.C., Lan Z., Law S., MacLean E., Trajman A., Menzies D., Benedetti A., and Ahmad Khan F., Diagnostic accuracy of serological tests for covid-19: Systematic review and meta-analysis, BMJ 370 (2020), pp. m2516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Burnham K.P. and Anderson D.R., Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Springer Science & Business Media, New York, 2003. [Google Scholar]
  • 6.Cavanaugh J.E., Unifying the derivations for the akaike and corrected akaike information criteria, Stat. Probab. Lett. 33 (1997), pp. 201–208. [Google Scholar]
  • 7.Centers for Disease Control and Prevention and others , Duration of Isolation and Precautions for Adults with Covid-19, CDC, Atlanta.
  • 8.Chappell F., Raab G., and Wardlaw J., When are summary roc curves appropriate for diagnostic meta-analyses? Stat. Med. 28 (2009), pp. 2653–2668. [DOI] [PubMed] [Google Scholar]
  • 9.Chen D., Zhang Y., Xu Y., Shen T., Cheng G., Huang B., Ruan X., and Wang C., Comparison of chemiluminescence immunoassay, enzyme-linked immunosorbent assay and passive agglutination for diagnosis of mycoplasma pneumoniae infection, Ther. Clin. Risk Manag. 14 (2018), pp. 1091–1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chu H., Guo H., and Zhou Y., Bivariate random effects meta-analysis of diagnostic studies using generalized linear mixed models, Med. Decis. Making 30 (2010), pp. 499–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chu H., Nie L., Cole S.R., and Poole C., Meta-analysis of diagnostic accuracy studies accounting for disease prevalence: Alternative parameterizations and model selection, Stat. Med. 28 (2009), pp. 2384–2399. [DOI] [PubMed] [Google Scholar]
  • 12.Claeskens G. and Hjort N.L., Model Selection and Model Averaging, Cambridge University Press, Cambridge, 2008. [Google Scholar]
  • 13.de Ory F., Minguito T., Balfagón P., and Sanz J.C., Comparison of chemiluminescent immunoassay and elisa for measles igg and igm, Apmis 123 (2015), pp. 648–651. [DOI] [PubMed] [Google Scholar]
  • 14.Decks J.J., Systematic reviews in health care: Systematic reviews of evaluations of diagnostic and screening tests, BMJ 323 (2001), pp. 157–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Doebler P., Holling H., and Böhning D., A mixed model approach to meta-analysis of diagnostic studies with binary test outcome, Psychol. Methods 17 (2012), pp. 418–436. [DOI] [PubMed] [Google Scholar]
  • 16.Fan J. and Lv J., A selective overview of variable selection in high dimensional feature space, Stat. Sin. 20 (2010), pp. 101–148. [PMC free article] [PubMed] [Google Scholar]
  • 17.Gatsonis C. and Paliwal P., Meta-analysis of diagnostic and screening test accuracy evaluations: Methodologic primer, Am. J. Roentgenol. 187 (2006), pp. 271–281. [DOI] [PubMed] [Google Scholar]
  • 18.Greven S. and Kneib T., On the behaviour of marginal and conditional aic in linear mixed models, Biometrika 97 (2010), pp. 773–789. [Google Scholar]
  • 19.Guolo A., A double simex approach for bivariate random-effects meta-analysis of diagnostic accuracy studies, BMC Med. Res. Methodol. 17 (2017), pp. 266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Harbord R.M., Deeks J.J., Egger M., Whiting P., and Sterne J.A., A unification of models for meta-analysis of diagnostic accuracy studies, Biostatistics 8 (2006), pp. 239–251. [DOI] [PubMed] [Google Scholar]
  • 21.Higgins J.P. and Green S., Cochrane Handbook for Systematic Reviews of Interventions, vol. 4, John Wiley & Sons, Hoboken, 2011. [Google Scholar]
  • 22.Holling H., Böhning W., and Böhning D., Likelihood-based clustering of meta-analytic sroc curves, Psychometrika 77 (2012), pp. 106–126. [Google Scholar]
  • 23.Honest H. and Khan K.S., Reporting of measures of accuracy in systematic reviews of diagnostic literature, BMC Health Serv. Res. 2 (2002), p. 112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Irwig L., Macaskill P., Glasziou P., and Fahey M., Meta-analytic methods for diagnostic test accuracy, J. Clin. Epidemiol. 48 (1995), pp. 119–130. [DOI] [PubMed] [Google Scholar]
  • 25.Liang H., Wu H., and Zou G., A note on conditional aic for linear mixed-effects models, Biometrika 95 (2008), pp. 773–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Moses L.E., Shapiro D., and Littenberg B., Combining independent studies of a diagnostic test into a summary roc curve: Data-analytic approaches and some additional considerations, Stat. Med. 12 (1993), pp. 1293–1316. [DOI] [PubMed] [Google Scholar]
  • 27.Owen A., Empirical likelihood ratio confidence regions, Ann. Stat. 18 (1990), pp. 90–120. [Google Scholar]
  • 28.Reitsma J.B., Glas A.S., Rutjes A.W., Scholten R.J., Bossuyt P.M., and Zwinderman A.H., Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews, J. Clin. Epidemiol. 58 (2005), pp. 982–990. [DOI] [PubMed] [Google Scholar]
  • 29.Rutter C.M. and Gatsonis C.A., A hierarchical regression approach to meta-analysis of diagnostic test accuracy evaluations, Stat. Med. 20 (2001), pp. 2865–2884. [DOI] [PubMed] [Google Scholar]
  • 30.Schlattmann P., Verba M., Dewey M., and Walther M., Mixture models in diagnostic meta-analyses–clustering summary receiver operating characteristic curves accounted for heterogeneity and correlation, J. Clin. Epidemiol. 68 (2015), pp. 61–72. [DOI] [PubMed] [Google Scholar]
  • 31.Schwarz G., Estimating the dimension of a model, Ann. Statist. 6 (1978), pp. 461–464. [Google Scholar]
  • 32.Shamseer L., Moher D., Clarke M., Ghersi D., Liberati A., Petticrew M., Shekelle P., and Stewart L.A., Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: Elaboration and explanation, BMJ 349 (2015), pp. g7647. [DOI] [PubMed] [Google Scholar]
  • 33.Sotiriadis A., Papatheodorou S., and Martins W., Synthesizing evidence from diagnostic accuracy tests: The sedate guideline, Ultrasound Obstet. Gynecol. 47 (2016), pp. 386–395. [DOI] [PubMed] [Google Scholar]
  • 34.Trikalinos T.A., Balion C.M., Coleman C.I., Griffith L., Santaguida P.L., Vandermeer B., and Fu R., meta-analysis of test performance when there is a ‘gold standard’, J. Gen. Intern. Med. 27 (2012), pp. 56–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tripepi G., Jager K.J., Dekker F.W., and Zoccali C., Diagnostic methods 2: Receiver operating characteristic (ROC) curves, Kidney Int. 76 (2009), pp. 252–256. [DOI] [PubMed] [Google Scholar]
  • 36.Vaida F. and Blanchard S., Conditional akaike information for mixed-effects models, Biometrika 92 (2005), pp. 351–370. [Google Scholar]
  • 37.Verma N., Patel D., and Pandya A., Emerging diagnostic tools for detection of covid-19 and perspective, Biomed. Microdevices 22 (2020), pp. 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Walter S. and Jadad A., Meta-analysis of screening data: A survey of the literature, Stat. Med. 18 (1999), pp. 3409–3424. [DOI] [PubMed] [Google Scholar]
  • 39.World Health Organization , WHO coronavirus disease (COVID-19) dashboard. Available at https://covid19.who.int/.

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES