Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2022 Jul 26;49(12):3141–3163. doi: 10.1080/02664763.2022.2104229

How reliable are the multiple comparison methods for odds ratio?

Ayfer Ezgi Yilmaz 1,CONTACT
PMCID: PMC9415621  PMID: 36035608

ABSTRACT

The homogeneity tests of odds ratios are used in clinical trials and epidemiological investigations as a preliminary step of meta-analysis. In recent studies, the severity or mortality of COVID-19 in relation to demographic characteristics, comorbidities, and other conditions has been popularly discussed by interpreting odds ratios and using meta-analysis. According to the homogeneity test results, a common odds ratio summarizes all of the odds ratios in a series of studies. If the aim is not to find a common odds ratio, but to find which of the sub-characteristics/groups is different from the others or is under risk, then the implementation of a multiple comparison procedure is required. In this article, the focus is placed on the accuracy and reliability of the homogeneity of odds ratio tests for multiple comparisons when the odds ratios are heterogeneous at the omnibus level. Three recently proposed multiple comparison tests and four homogeneity of odds ratios tests with six adjustment methods to control the type-I error rate are considered. The reliability and accuracy of the methods are discussed in relation to COVID-19 severity data associated with diabetes on a country-by-country basis, and a simulation study to assess the powers and type-I error rates of the tests is conducted.

KEYWORDS: Homogeneity of odds ratios, multiple comparisons, type-I error, statistical power, meta-analysis, COVID-19

1. Introduction

It has recently become very popular to investigate the mortality or severity of COVID19 in relation to demographics, clinical characteristics, or signs and symptoms to understand the impacts of COVID-19 more clearly. Not only in COVID-19 studies, but also in general meta-analysis projects, the odds ratio of death or severity for a patient with comorbidity (diabetes, hypertension, cardiac disease, etc.) or a symptom (fever, cough, fatigue, etc.) have been investigated. Under the assumption that the odds ratios across studies are homogeneous, the results of several studies are aggregated via a meta-analysis. Before calculating a common odds ratio, the homogeneity of studies should be tested. Studies directly focusing on testing the homogeneity of the odds ratios date back to the first half of the 1950s. While Woolf's [39] test was the first application in the literature to test the homogeneity of the logarithm of odds ratios, it was found to be very conservative by Gavaghan et al. [16]. Breslow and Day [6] used the Mantel-Haenszel estimator instead of the conditional maximum likelihood estimate. Tarone [34] suggested the adjusted version of the Breslow-Day (BD) test. Because the BD statistic is based on the Cochran-Mantel-Haenszel odds ratio estimator and this estimator is not efficient, Almalik and van den Heuvel [26] suggested using the Tarone test. However, studies showing that there was a difference between the BD and Tarone tests reported it as only in the 4th decimal place [24]. Reis et al. [30] discussed the limitation of the asymptotic chi-square tests and did not recommend using them when most of the expected values were less than five. The Zelen test [45] was recommended to overcome this problem, but it was found to be biased and inconsistent [18,26]. Yusuf et al. [44] proposed a chi-square-based method, called the Peto method, that is identical to the asymptotic Zelen test [16,30]. DerSimonian–Laird [11] statistic is the likelihood ratio (LR) test of a mixed logistic model [1,3] and the conditional maximum likelihood score statistic [16] have also been used to test the homogeneity of the odds ratios. The natural logarithm of the odds-based DerSimonian-Laird statistic is equivalent to the Woolf statistic [16]. Because of its simple calculation, the use of BD test has been recommended instead of the mixed logistic model and score tests [3,16,30]. All of these methods are calculated under the assumption of homogeneity of the odds ratios.

There are many studies that have compared the properties of the homogeneity tests of odds ratios in the literature. Jones et al. [23] compared the power of seven tests of homogeneity of the odds ratio for balanced and unbalanced designs. As a result of their simulation study, they suggested using the BD statistic for non-sparse tables. Paul and Donner [27] also compared the performance of nine tests for the homogeneity of odds ratios according to the data designs (balanced, mildly unbalanced, severely unbalanced, and within-strata unbalanced) and the number of strata. Because of its simple calculation and power performance, they recommend using the Tarone test in practice. Additionally, they recommend using the Woolf test for balanced or mildly unbalanced designs. Reis et al. [30] conducted a Monte Carlo simulation to compare the performance of six asymptotic tests for the homogeneity of odds ratios. As a result of their simulation study, the BD and Pearson chi-square tests were slightly better than the other tests for a non-small sample size. Gavaghan et al. [16] compared the performance of Peto, Woolf, DerSimonian–Laird, and BD statistics, and the test scores. They suggested using the BD statistic in the meta-analysis of pain studies. Their simulation study showed that when the Woolf statistic under-estimated the degree of heterogeneity, the DerSimonian–Laird statistic over-estimated it. Bagheri et al. [3] compared the likelihood ratio test of a mixed logistic model, DerSimonian–Laird statistic, and BD test when the sample size was equal and non-equal. They concluded that the BD test was the most powerful test among these three tests, and the studies with more strata had a higher power. Wei and Lai [37] discussed the effects of small sample size on the power of homogeneity tests. They suggested using U-statistics (U3 and WU3), which have higher power than the other tests. They concluded that the sample size had a positive effect on the power. When the number of strata and sample size increased, the power of the U-statistics improved.

These studies in the literature did not agree on a particular test that could directly be used in practice. The main reason for this was the design spaces of the simulation studies. None of the studies considered an extensive simulation space that could account for real-world scenarios, including an extensive combination of the number of studies, true odds ratios, different combinations of sample sizes, and the distributions of the cells counts across the cells of resulting contingency tables. The aim of this study is to produce new knowledge on the power and type-I error performances of a large bunch of tests under an extensive simulation space. In this way, results with a higher likelihood of generalizability will be provided.

A meta-analysis was used to pool independent studies focused on the same question. In meta-analysis studies, it is required that all available studies are reported. The heterogeneity of these studies (effect sizes) was tested with I-square and its related statistics. If heterogeneity is observed, it is important to consider a strategy for handling the sources of heterogeneity. There are different types of heterogeneity in a meta-analysis, such as clinical heterogeneity (differences in participant characteristics (gender, age group, race, etc.), types or timing of the outcome measurements, intervention characteristics, methodological heterogeneity (trial design and quality), and statistical heterogeneity (treatment effects between trials) [15]. Gagnier et al. [15] discussed that clinical and methodological heterogeneity can cause significant statistical heterogeneity and affect the results.

COVID-19 meta-analysis aims to investigate the relationship between different demographic characteristics (gender, age group, race, Hispanic origin, etc.) by comparing the difference between mortality/severity and comorbidities across the strata. Such studies aim to reveal the difference between the odds ratios across the strata (demographics). For example, the risk of death of a patient with hypertension may be specifically higher than in other races or the relationship between mortality and hypertension may not be statistically significant in some races. When the odds ratios of COVID-19 data are heterogeneous, it is not appropriate to calculate a common odds ratio and it is important to detect the odds ratio that causes heterogeneity among all the considered odds ratios. In this case, a multiple comparison procedure is needed. To produce reliable results in such a crucial area, it is essential to understand the power and type-I error behaviors of the multiple comparison procedures used for heterogeneous odds ratios.

Even though there are many methods to test the homogeneity of odds ratios, there is limited literature about the multiple comparison procedures to be applied when the odds ratios are heterogeneous. Yilmaz and Aktas Altunay [43] suggested using the BD-based least significant difference (LSD), chi-square-based LSD, and adjusted BD tests for multiple comparisons of odds ratios. They used these tests to compare six COVID-19 mortality data from China, and the study was limited to real-life data. Their numerical application showed that the Bonferroni and Dunn-Šidák adjustment methods were very conservative when comparing the odds ratios and their proposed methods were less conservative. They also recommend the use of the BD-based and chi-square-based LSD methods for sparse tables.

In this article, the focus was placed on the use of the homogeneity of odds ratio tests for the performance of multiple comparisons when the odds ratios are heterogeneous. Specifically, we focused on COVID-19 data to get accurate and reliable inferences when the odds ratios are heterogeneous in meta-analysis studies on COVID-19. Following the simulation study results of Bagheri et al. [3], Gavaghan et al. [16], Jones et al. [23], Paul and Donner [27], and Reis et al. [30], the BD, Tarone, Woolf, and Peto homogeneity of the odds ratio test statistics were considered. Bonferroni, Dunn–Šidák, Holm, Hochberg, Hommel, and Benjamini–Hochberg adjustments were used to control the type-I error rate with multiple comparison tests. Also, we considered the BD-based LSD, chi-square-based LSD, and adjusted BD tests for multiple comparisons [43]. In total, 27 methods were taken into consideration and compared in terms of seven measures, consisting of the any-pair power, all pairs power, positive predicted value, true negative rate (TNR), per comparison error rate, family-wise error rate, and false discovery rate, including the different number of strata, sample sizes, sample size designs (equal, within-center inequality, among-centers inequality), and structure of the table (balanced, imbalanced). With such an extensive numerical study, clearer and more reliable results were obtained on the power and error characteristics of multiple comparison tests for odds ratios than in previous studies. The contributions of this study were that we (1) demonstrated the importance of heterogeneous odds ratios in COVID-19 studies and discussed the use of multiple comparison procedures to get reliable results, (2) examined the performance of the multiple comparison procedures for odds ratios in terms of power and error rates using an extensive simulation space that covered a wide range of realistic scenarios that can occur in practice, (3) examined the performance of the homogeneity tests of odds ratios in multiple comparison procedure and discussed the effect of adjustment methods on the tests, and (4) identified methods that can be used under different data compositions and areas of practice.

In Section 2, the methods to test the homogeneity of odds ratios and the multiple comparison procedures are presented. In Sections 3 and 4, the results of numerical studies with COVID-19 and synthetic data are presented. In Section 5, the general recommendations and conclusions are given.

2. Methods

In this section, the methods to test the homogeneity of odds ratios and the multiple comparison methods are introduced.

2.1. Test methods

Consider K different strata, which are investigated in the association of X and Y binary variables. Let nijk be the number of observations in the ith row, jth column, and kth stratum, where i, j = 1, 2 and k=1,,K. n..k is the total number of observations in the kth stratum. A 2×2×K study design is summarized in Table 1.

Table 1.

A 2×2×K study design.

Stratum   Y = 1 Y = 2 Total
1 X = 1 n111 n121 n1.1
  X = 2 n211 n221 n2.1
  Total n.11 n.21 n..1
2 X = 1 n112 n122 n1.2
  X = 2 n212 n222 n2.2
  Total n.12 n.22 n..2
K X = 1 n11K n12K n1.K
  X = 2 n21K n22K n2.K
  Total n.1K n.2K n..K

The odds ratio formulation is θk=n11k×n22kn12k×n21k, where k=1,,K. The null hypothesis for the homogeneity of odds ratios is

H0:θ1=θ2==θK, (1)

against H1: θiθj for at least one pair of (i,j), where i,j=1,,K, ( ij).

The Peto, BD, Tarone, and Woolf methods were used to test the null hypothesis of the equality of several odds ratios. The Mantel and Haenszel [25] odds ratio is

θ^MH=k=1Kn11kn22kn..kk=1Kn12kn21kn..k. (2)

Yusuf et al. [44] proposed an alternative method to the MH method for pooling odds ratios across the strata. This method is referred to as the Peto method. The Peto statistic is

χPeto2=k=1K(n11kEi)2Vi[k=1K(n11kEk)]2k=1KVi, (3)

where the expected frequency ( Ek) and its variance ( Vk) at the kth stratum are

Ek=n1.kn.1kn..kandVk=n1.kn.1kn2.kn.2kn..k2(n..k1).

Breslow-Day (BD) test is used to test of homogeneity of the odds ratios across K strata [5,6]. BD test statistic is

χBD2=k=1K[n11kμ^k(θ^MH)]2σ^k2(θ^MH), (4)

where

σ^k2(θ^MH)=[1μ^k+1n1.kμ^k+1n.1kμ^k+1n22kn11kμ^k]1. (5)

Here, μ^k(θ^) and σ^k2(θ^) are the expected value and the variance of n11k under the assumption of homogeneous odds ratios, respectively. The BD formula uses the MH odds ratio to generate the expected values using the conditional maximum likelihood method.

The Tarone adjustment is a special case of the BD test statistic [34].

χTarone2=χBD2[k=1Kn11kk=1Kμ^k(θ^MH)]2k=1Kσ^k2(θ^MH). (6)

Here, μ^k(θ^) and σ^k2(θ^) are the expected value and the variance of n11k, the same as in the BD test statistics.

The Woolf statistic is

χWoolf2=k=1Kwk[ln(θk)]2[k=1Kwkln(θk)]2k=1Kwk, (7)

where the weights are wk=[1n11k+1n12k+1n21k+1n22k]1, where k=1,,K.

All of these methods were used to determine whether there are any statistically significant differences between the independent or unrelated odds ratios calculated from 2-by-2 studies. The Peto, BD, Tarone, and Woolf test statistics follow the chi-square distribution with k−1 degrees of freedom.

2.2. The multiple comparison procedures for the odds ratio

When the methods represented in Section 2.1 indicate the presence of heterogeneity in the odds ratios, it is important to determine which of these groups are different from the others. With this purpose, the Peto, BD, Tarone, and Woolf tests were used for each pair of the studies or strata. The null hypothesis for the multiple comparisons of the odds ratios is

H0:θi=θj,H1:θiθj,

for (i<j), where i,j=1,,K. Because the multiple comparison procedure affects the error rates, different methods have been proposed to adjust the type-I error.

  • Bonferroni Adjustment: This method is the most popular but also the most conservative one [13]. The Bonferroni method controls the family-wise error rate. Suppose m=K(K1)/2 is the number of simultaneously tested hypotheses. The Bonferroni adjusted type-I error is α=α/m.

  • Dunn–Šidák Adjustment: The Šidák [31] method is slightly more powerful than the Bonferroni method [12]. The Dunn–Šidák adjusted type-I error is α=1(1α)1/m.

  • Holm Adjustment: The Holm [20] sequential adjustment is also based on the Bonferroni method, but it is less conservative. First, the p-values of the m tests are ranked from smallest to the largest. Starting from the smallest p-value (i = 1), the adjusted αi=α/(Ki+1) is calculated. The pi-value is compared with the αi. The comparison continues until piαi. All of the remaining hypotheses are considered as non-significant.

  • Hochberg Adjustment: The Hochberg [19] sequential adjustment is very similar to the Holm adjustment. For this method, the p-values of the m tested hypotheses are ranked from the largest to the smallest and the procedure starts from the largest p-value (i = 1). The comparison continues until pi<αi. All of the smaller pi-values are considered as significant.

  • Hommel Adjustment: The Hommel [21]method is also less conservative than the Bonferroni method. It is slightly more powerful than the Hochberg method. First, the p-values of the m tests are ranked from the smallest to the largest. Suppose j is the number of hypotheses in the largest subset of the hypotheses and j=max{i(1,,m):p(mi+k)>kα/iforall(k=1,,i)}. If there exists no j, all of the hypotheses are rejected. Otherwise, the hypotheses are rejected when piα/j [40].

  • Benjamini–Hochberg Adjustment: The Benjamini and Hochberg [4] method is suggested to control the false discovery rate. It is less conservative than the other methods and gives better results when a large number of hypotheses are tested [7]. First, the p-values of the m tested hypotheses are ranked from the largest to the smallest. Let j=max{i:piαi/m}. All of the hypotheses are rejected, for which pipj and any of the hypotheses are not rejected if j does not exist.

In addition to the tests of the homogeneity of the odds ratios, the BD-based LSD, the chi-square-based LSD, and the adjusted BD tests can be used for multiple comparisons [43]. To avoid confusion with other adjustment methods of the BD test, the adjusted BD test will be referred to as the YA test from now on.

Zwinderman and Bossuyt [46] and Van den Ende et al. [35] reported that the odds ratios were more useful if converted to log values. Armistead [2] discussed the limitations of the measures of associations and that taking the natural algorithm of the odds ratio makes it symmetric above and below one, with ln(1)=0. The BD-based and chi-square-based LSD test methods are based on the difference between the two log-odds ratios. Assume that θi is the odds ratio of the ith stratum and θj is the odds ratio of the jth stratum, where ( i,j=1,,K). Let δ be the difference between these two log-odds ratios, as

δ=|ln(θi)ln(θj)|,i<j. (8)

The common standard error is

SE(δ^)=k=1Kσ^k(θ^MH)K, (9)

where σ^k2(θ^MH) is defined in Equation (5). The δ value is compared with ORDIF in Equation (10). The null hypothesis is rejected if the difference is greater and equal to the critical value, δORDIF.

ORDIF=Zα/2SE(δ^). (10)

The chi-square-based LSD test, where the expected values are based on the chi-square approach, is used for multiple comparison. The excepted values ( E11k, E12k, E21k, E22k) are calculated from the ordinary chi-square, where (k=1,,K). The standard error for stratum i is

SEk=[1E11k+1E12k+1E21k+1E22k]0.5. (11)

Then, all of the steps used in the first method are followed by calculating the ORDIF.

The YA test is based on the BD test. In order to calculate this method, the expected values based on the overall MH statistic are used.

χYA2=[n11iμ^i(θ^MH)]2σ^i2(θ^MH)+[n11jμ^j(θ^MH)]2σ^j2(θ^MH). (12)

The calculated chi-square value is compared with a chi-square value for the appropriate α and df = 1.

3. COVID-19 studies

Even though the effects of COVID-19 on human health have been investigated, recent studies in China, Europe, and America have revealed a relationship between disease severity or mortality and the comorbidities (diabetes, hypertension, cardiovascular disease, liver injury, etc.) for COVID-19 [9,41]. These studies were followed by meta-analysis studies [9,38,42].

de Almeida-Pititto et al. [9] applied several meta-analyses for diabetes, hypertension, and cardiovascular disease, and the use of ACE/ARB in COVID-19 mortality and severity cases. Their study indicated a high heterogeneity in ACE/ABE; hence, they calculated a common odds ratio based on the random-effects meta-analysis. Wong et al. [38] collected data from Asia and applied a meta-analysis of COVID-19 severity cases associated with liver injury, discussed the heterogeneity of the data and applied a subgroup meta-analysis to minimize the heterogeneity. Yang et al. [42] also used a meta-analysis to describe the risk of hypertension, diabetes, respiratory system disease, and cardiovascular in COVID-19 severity cases. In these studies, the heterogeneity of the data was tested by Q-statistic and related measures, then a random effects meta-analysis was applied due to the presence of moderate to high heterogeneity.

The risk of comorbidities in severity or mortality may differ depending on the age group, region, Hispanic origin, etc. Furthermore, this risk of some groups may be specifically higher than the others. When the odds ratios of COVID-19 data are homogeneous, a common (Mantel–Haenszel) odds ratio is calculated. On the other hand, if the test results indicate a difference in the odds ratios, it is neither suitable nor reliable to calculate a common odds ratio, and it is important to detect the different ones. For instance, in the study of de Almeida-Pititto et al. [9], even though they concluded the necessity of further studies to detect the risk association in different age groups, they did not provide any further analysis. In that case, applying multiple comparison procedures is strongly suggested to make reliable inferences.

In previous studies [8,9,17,33,36], the risk of developing diabetes in severe patients was compared with the same risk in non-severe patients, and the relationship between the severity of COVID-19 (severe/non-severe) and diabetes was presented. Severity was described as ICU admission or the need for mechanical ventilation [9].

The aim of this study is to investigate the risk of developing diabetes for different levels of disease severity, while investigating if there is a statistically significant difference between different countries. With this purpose, data sets from Greece [17], Italy [8], China [36], and France [33] were used to determine COVID-19 severity with regard to diabetes, as presented in Table 2.

Table 2.

The number of patients who were admitted to the ICU or needed mechanical ventilation.

    Diabetes    
Country Severity + Total Odds ratio
Greece + 6 22 28 1.50
  4 22 26  
  Total 10 44 54  
Italy + 22 86 108 1.93
  15 113 128  
  Total 37 199 236  
China + 6 8 14 40.50
  1 54 55  
  Total 7 62 69  
France + 23 62 39 2.52
  5 34 85  
  Total 28 96 124  

The BD test results indicated that the odds ratios were not statistically homogeneous ( χ2=9.743, df = 3, pvalue=0.021 ). Even if the odds ratios for Greece, France, and Italy were close, the odds ratio for China was quite a bit higher than the others. In this case, multiple comparison tests were used to detect the difference between each pair of odds ratios between China and the other countries. To assess the behavior of the tests for such an obvious difference between the estimates of the odds ratios among the countries, multiple comparison tests were applied, and their results are summarized in Table 3.

Table 3.

P-values of the multiple comparison tests of the COVID-19 data set.

Adjustment Test G-I G-C I-C G-F I-F C-F
Bonferroni BD 1.000 0.162 0.068 1.000 1.000 0.341
  Tarone 1.000 0.195 0.071 1.000 1.000 0.410
  Woolf 1.000 0.348 0.270 1.000 1.000 0.676
  Peto 1.000 0.019 0.007 1.000 1.000 0.020
Dunn–Šidák BD 1.000 0.040 0.017 0.993 0.999 0.082
  Tarone 1.000 0.048 0.018 0.993 0.999 0.098
  Woolf 1.000 0.084 0.066 0.993 0.999 0.157
  Peto 1.000 0.005 0.002 0.997 1.000 0.005
Holm BD 1.000 0.128 0.059 1.000 1.000 0.227
  Tarone 1.000 0.146 0.059 1.000 1.000 0.239
  Woolf 1.000 0.227 0.191 1.000 1.000 0.366
  Peto 1.000 0.018 0.007 1.000 1.000 0.018
Hochberg BD 0.789 0.128 0.059 0.789 0.789 0.218
  Tarone 0.789 0.146 0.059 0.789 0.789 0.239
  Woolf 0.789 0.218 0.191 0.789 0.789 0.366
  Peto 0.789 0.018 0.007 0.789 0.789 0.018
Hommel BD 0.789 0.101 0.054 0.789 0.789 0.197
  Tarone 0.789 0.122 0.054 0.789 0.789 0.222
  Woolf 0.789 0.197 0.157 0.789 0.789 0.360
  Peto 0.789 0.017 0.007 0.789 0.789 0.018
Benjamini–Hochberg BD 0.787 0.027 0.014 0.787 0.787 0.035
  Tarone 0.787 0.028 0.014 0.787 0.787 0.037
  Woolf 0.787 0.035 0.034 0.787 0.787 0.056
  Peto 0.787 0.007 0.007 0.787 0.789 0.007
YA test 0.488 0.031 0.618 0.031 0.629 0.036
Log-Odds Difference 0.251 3.296 0.520 3.045 0.269 2.776

Abbreviation: G, Greece; I, Italy; C, China; F, France.Bold values indicate a statistically significant difference (p<0.05). The results are compared to LSD=3.649 for BD-based LSD test and LSD=3.854 for Chi-squared-based LSD test.

According to the multiple comparison results in Table 3, there were no statistically significant differences between those of Greece and Italy, or between the odds ratios of Italy and France, as expected. The YA test was the only test that found a statistical difference between the odds ratios of Greece and France. All of the tests with Benjamini–Hochberg adjustment methods, the Peto test with all of the adjustment methods, and the YA test method indicated the difference between the odds ratios of China and the other countries, except for the Benjamini–Hochberg adjusted Woolf test between the odds ratios of China and France. The BD-based LSD and chi-square-based LSD tests, and also the Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests did not indicate any statistically significant difference between China and Greece, Italy, or France even though the odds ratio of China was obviously greater than the other odds ratios.

This observation from the COVID-19 data not only strongly showed the necessity of a multiple comparison procedure for the heterogeneous odds ratios in COVID-19 studies, but also demonstrated the importance of relying on the most suitable test to make inferences. The results showed that the different multiple comparison procedures indicated different results, even when the tests recommended in the previous literature were used. Thus, a detailed simulation study needs to be performed to discuss the reliability and accuracy of the methods.

4. Simulation study

A simulation study was conducted to compare the performance of the multiple comparison methods introduced in Section 2.

4.1. Simulation design

Odds ratios were simulated considering K = 3, 5, 7 strata. Three scenarios were considered for the alternative hypothesis space: P1 corresponds to the case where only one odds ratio was different, in P2 more than one odds ratios are different, and in F all the odds ratios are different.

Three different sample size designs, given by Bagheri et al. [3], were used. In the first one, equal, in the second one, within-center inequality, and in the third one, among-centers inequality sample size designs were created (see Table 2 in Bagheri et al. [3]). The marginal probabilities were accepted as (π.1k=π.2k=0.50) and (π1.k=π2.k=0.50) for the balanced design, and (π.1k=0.25) and (π1.k=0.25) for the imbalanced design, where k=1,,K. All of the simulation scenarios are given in Table 4. In total, 72 different simulation scenarios were selected to be run. The results were based on 5000 replications. While E1, WI1, and AI1 represented the small sample size for equal sample size design (E), within-center inequality (WI), and among-centers inequality (AI), the medium sample size was represented by E2, WI2, and AI2. The large sample size was represented by E3, WI3, and AI3.

Table 4.

Simulation scenarios and their abbreviations.

      Sample size design       Sample size design
Comb. K True odds ratio   n Str. Comb. K True odds ratio   n Str.
C1 3 P1: (10,10,1.2) E1 (40,40,40) B C37 5 F: (20,10,4,1.2,0.5) E1 (40,40,40,40,40) B
C2     E2 (100,100,100)   C38     E2 (100,100,100,100,100)  
C3     E3 (200,200,200)   C39     E3 (200,200,200,200,200)  
C4     WI1 (40,40,40) IB C40     WI1 (40,40,40,40,40) IB
C5     WI2 (100,100,100)   C41     WI2 (100,100,100,100,100)  
C6     WI3 (200,200,200)   C42     WI3 (200,200,200,200,200)  
C7     AI1 (20,20,80) B C43     AI1 (20,20,20,20,160) B
C8     AI2 (50,50,200)   C44     AI2 (50,50,50,50,300)  
C9     AI3 (100,100,400)   C45     AI3 (100,100,100,100,600)  
C10   F: (10,4,1.2) E1 (40,40,40) B C46 7 P1: (10,10,10,10,10,10,1.2) E1 (40,40,40,40,40,40,40) B
C11     E2 (100,100,100)   C47     E2 (100,100,100,100,100,100,100)  
C12     E3 (200,200,200)   C48     E3 (200,200,200,200,200,200,200)  
C13     WI1 (40,40,40) IB C49     WI1 (40,40,40,40,40,40,40) IB
C14     WI2 (100,100,100)   C50     WI2 (100,100,100,100,100,100,100)  
C15     WI3 (200,200,200)   C51     WI3 (200,200,200,200,200,200,200)  
C16     AI1 (20,20,80) B C52     AI1 (20,20,20,20,20,20,140) B
C17     AI2 (50,50,200)   C53     AI2 (50,50,50,50,50,50,400)  
C18     AI3 (100,100,400)   C54     AI3 (100,100,100,100,100,100,800)  
C19 5 P1: (10,10,10,10,1.2) E1 (40,40,40,40,40) B C55   P2: (35,35,10,10,10,10,1.2) E1 (40,40,40,40,40,40,40) B
C20     E2 (100,100,100,100,100)   C56     E2 (100,100,100,100,100,100,100)  
C21     E3 (200,200,200,200,200)   C57     E3 (200,200,200,200,200,200,200)  
C22     WI1 (40,40,40,40,40) IB C58     WI1 (40,40,40,40,40,40,40) IB
C23     WI2 (100,100,100,100,100)   C59     WI2 (100,100,100,100,100,100,100)  
C24     WI3 (200,200,200,200,200)   C60     WI3 (200,200,200,200,200,200,200)  
C25     AI1 (20,20,20,20,160) B C61     AI1 (20,20,20,20,20,20,140) B
C26     AI2 (50,50,50,50,300)   C62     AI2 (50,50,50,50,50,50,400)  
C27     AI3 (100,100,100,100,600)   C63     AI3 (100,100,100,100,100,100,800)  
C28   P2: (20,10,10,10,1.2) E1 (40,40,40,40,40) B C64   F: (35,30,25,20,10,5,1.2) E1 (40,40,40,40,40,40,40) B
C29     E2 (100,100,100,100,100)   C65     E2 (100,100,100,100,100,100,100)  
C30     E3 (200,200,200,200,200)   C66     E3 (200,200,200,200,200,200,200)  
C31     WI1 (40,40,40,40,40) IB C67     WI1 (40,40,40,40,40,40,40) IB
C32     WI2 (100,100,100,100,100)   C68     WI2 (100,100,100,100,100,100,100)  
C33     WI3 (200,200,200,200,200)   C69     WI3 (200,200,200,200,200,200,200)  
C34     AI1 (20,20,20,20,160) B C70     AI1 (20,20,20,20,20,20,140) B
C35     AI2 (50,50,50,50,300)   C71     AI2 (50,50,50,50,50,50,400)  
C36     AI3 (100,100,100,100,600)   C72     AI3 (100,100,100,100,100,100,800)  

Abbreviation: E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; Str, structure of the table; B, balanced, IB, imbalanced; P1, only one of the odd ratios is different; P2, more than one of the odd ratios is different; F, all the odd ratios are different.

A total of 27 different methods, consisting of four tests with six adjustment methods and three multiple comparison methods were considered.

The simulation software was developed in R version 3.6.1 by the author. The BreslowDayTest() and WoolfTest() functions of the DescTools package [32] were used to perform the BD, Tarone, and Woolf tests, and the p.adjust() function of the stats package [28] to calculate the adjustment method. The critical value was accepted as α=0.05.

4.2. Measures to evaluate the tests

Table 5 summaries the possible outcomes in the hypothesis testing [4]. Assuming that m0 is the number of true null hypotheses and m is the possible outcomes for testing, where the combination of K odds ratios was m=K(K1)/2.

Table 5.

The possible outcomes in the hypothesis testing.

  Declared non-significant Declared significant Total
H0 is true U V m0
HA is true T S mm0
Total mR R m

In Table 5, U is the number of hypotheses that were correctly declared as non-significant and S is the number of hypotheses that were correctly declared as significant. R is the number of hypotheses that were declared as significant. V is the type-I error and T is the type-II error.

The measures to assess the hypotheses tested were divided in the classes of power measures and error measures. The former are the any-pair power (ANPP), all pairs power (APP), positive predictive value (PPV), true negative rate (TNR). The latter are per-comparison error rate (PCER), family-wise error rate (FWER), and false discovery rate (FDR).

The ANNP is defined as the probability of identifying at least one true difference between the pairs and the APP is the probability of detecting all of the significant pairs [22,29]. The TNR is the proportion of correctly declared non-significant hypotheses ( TNR=U/(mR) ). The PPV or precision is the proportion of correctly declared significant hypothesis ( PPV=S/R ). The FWER is the probability of having at least one type-I error over the comparisons and the PCER is the probability of observing a type-I error in any comparison [10]. The FDR is the proportion of falsely declared significant hypotheses ( FDR=V/R ) [4].

In the simulation study, m odds ratios were compared using the multiple comparison procedures. Table 5 was created for each replication of each scenario mentioned in Section 4.1. Then, the mean values of the measures to assess the hypotheses tested were calculated as represented in Table 5. To evaluate the tests, the ANPP, APP, PPV, TNR, PCER, FWER, and FDR measures were used for the scenarios with the P1- and P2-type alternative hypotheses. Because it is required to have at least one true- non-significant hypothesis to calculate TNR, PCER, FWER, and FDR measures, and F-type alternative hypotheses were defined as the case when all the true odds ratios are different, only the ANPP, APP, and PPV were calculated for the F-type alternative hypotheses.

4.3. Simulation results

The values of the power and error measures for all of the methods and scenarios are presented for each run in Tables 69. All the results are not tabulated here due to limited space. Some of them are given in Tables 1–28 of Supplemental Material.

Table 6.

ANPP values computed for the Bonferroni, Dunn-Šidák, Holm, and Hochberg adjusted tests for P1-type alternative hypothesis.

      Bonferroni Dunn-Šidák Holm Hochberg
K SSD Com. BD Tarone Woolf Peto BD Tarone Woolf Peto BD Tarone Woolf Peto BD Tarone Woolf Peto
3 E1 C1 0.597 0.584 0.502 0.481 0.921 0.916 0.877 0.823 0.597 0.597 0.552 0.530 0.611 0.610 0.563 0.542
  E2 C2 0.902 0.900 0.895 0.850 0.990 0.989 0.988 0.969 0.902 0.902 0.902 0.885 0.912 0.912 0.912 0.891
  E3 C3 0.998 0.998 0.997 0.995 1.000 1.000 1.000 0.999 0.998 0.998 0.998 0.997 0.998 0.998 0.998 0.997
  WI1 C4 0.492 0.478 0.270 0.582 0.915 0.908 0.754 0.929 0.513 0.511 0.367 0.592 0.527 0.526 0.375 0.597
  WI2 C5 0.781 0.777 0.749 0.841 0.969 0.969 0.962 0.984 0.793 0.793 0.790 0.845 0.813 0.813 0.804 0.853
  WI3 C6 0.975 0.975 0.974 0.984 0.998 0.998 0.997 0.998 0.977 0.977 0.977 0.984 0.982 0.982 0.982 0.985
  AI1 C7 0.369 0.367 0.088 0.303 0.789 0.785 0.517 0.695 0.369 0.369 0.159 0.341 0.394 0.394 0.166 0.347
  AI2 C8 0.825 0.824 0.781 0.797 0.964 0.964 0.953 0.950 0.825 0.825 0.816 0.820 0.844 0.844 0.832 0.837
  AI3 C9 0.994 0.994 0.994 0.992 0.999 0.999 0.999 0.999 0.994 0.994 0.994 0.994 0.996 0.996 0.996 0.996
5 E1 C19 0.560 0.542 0.390 0.409 0.869 0.855 0.763 0.721 0.560 0.550 0.404 0.418 0.561 0.550 0.404 0.418
  E2 C20 0.908 0.907 0.899 0.845 0.984 0.983 0.982 0.959 0.908 0.908 0.902 0.855 0.908 0.908 0.902 0.855
  E3 C21 0.999 0.999 0.999 0.998 1.000 1.000 1.000 0.999 0.999 0.999 0.999 0.998 0.999 0.999 0.999 0.998
  WI1 C22 0.439 0.422 0.133 0.547 0.822 0.807 0.552 0.868 0.443 0.432 0.152 0.549 0.443 0.432 0.152 0.549
  WI2 C23 0.782 0.775 0.738 0.855 0.960 0.957 0.939 0.978 0.785 0.783 0.751 0.857 0.786 0.784 0.751 0.857
  WI3 C24 0.971 0.970 0.967 0.981 0.996 0.996 0.995 0.998 0.971 0.971 0.970 0.981 0.971 0.971 0.970 0.981
  AI1 C25 0.247 0.246 0.002 0.187 0.589 0.587 0.146 0.501 0.247 0.247 0.005 0.195 0.250 0.250 0.005 0.195
  AI2 C26 0.877 0.877 0.798 0.849 0.969 0.968 0.951 0.957 0.877 0.877 0.813 0.855 0.880 0.880 0.813 0.856
  AI3 C27 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999
7 E1 C46 0.527 0.502 0.323 0.373 0.836 0.821 0.675 0.671 0.527 0.504 0.332 0.377 0.527 0.504 0.332 0.377
  E2 C47 0.912 0.908 0.893 0.825 0.983 0.981 0.977 0.944 0.912 0.909 0.895 0.830 0.912 0.909 0.895 0.831
  E3 C48 0.998 0.998 0.998 0.995 1.000 1.000 1.000 0.999 0.998 0.998 0.998 0.995 0.998 0.998 0.998 0.995
  WI1 C49 0.416 0.397 0.079 0.537 0.777 0.758 0.411 0.841 0.418 0.402 0.088 0.538 0.419 0.403 0.088 0.538
  WI2 C50 0.784 0.774 0.710 0.855 0.947 0.943 0.925 0.971 0.786 0.780 0.721 0.856 0.786 0.780 0.721 0.856
  WI3 C51 0.979 0.978 0.975 0.987 0.998 0.998 0.997 0.999 0.979 0.979 0.977 0.987 0.979 0.979 0.977 0.987
  AI1 C52 0.171 0.169 0.000 0.127 0.472 0.470 0.035 0.385 0.171 0.171 0.000 0.130 0.172 0.172 0.000 0.130
  AI2 C53 0.906 0.906 0.792 0.878 0.982 0.982 0.962 0.976 0.906 0.906 0.802 0.880 0.907 0.907 0.802 0.880
  AI3 C54 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Abbreviation: SSD, sample size design, E, equal sample size design; WI, within-center inequality; AI, among-centers inequality.

Table 9.

FWER values computed for the Hommel and Benjamini–Hochberg adjusted tests, BD-based and Chi-squared-based LSD tests, and YA test for P1-type alternative hypothesis.

      Hommel Benjamini–Hochberg      
K SSD Com. BD Tarone Woolf Peto BD Tarone Woolf Peto BDLSD CSLSD YA
3 E1 C1 0.013 0.011 0.008 0.006 0.034 0.034 0.031 0.018 0.036 0.026 0.180
  E2 C2 0.022 0.021 0.021 0.016 0.042 0.042 0.042 0.021 0.000 0.000 0.558
  E3 C3 0.022 0.022 0.021 0.018 0.045 0.045 0.045 0.018 0.000 0.000 0.949
  WI1 C4 0.014 0.013 0.007 0.017 0.042 0.042 0.036 0.052 0.159 0.176 0.126
  WI2 C5 0.027 0.027 0.025 0.029 0.047 0.047 0.047 0.053 0.000 <0.001 0.325
  WI3 C6 0.037 0.037 0.037 0.038 0.046 0.046 0.046 0.051 0.000 0.000 0.739
  AI1 C7 0.003 0.003 0.001 0.001 0.017 0.017 0.011 0.011 0.079 0.072 0.416
  AI2 C8 0.012 0.011 0.010 0.006 0.035 0.035 0.033 0.017 <0.001 <0.001 0.851
  AI3 C9 0.027 0.026 0.025 0.022 0.044 0.044 0.044 0.023 0.000 0.000 0.996
5 E1 C19 0.010 0.008 0.003 0.002 0.062 0.061 0.049 0.031 0.162 0.100 0.225
  E2 C20 0.011 0.011 0.009 0.002 0.102 0.102 0.102 0.045 0.000 0.000 0.481
  E3 C21 0.013 0.012 0.011 0.003 0.104 0.104 0.104 0.045 0.000 0.000 0.859
  WI1 C22 0.009 0.008 <0.001 0.020 0.075 0.074 0.053 0.095 0.479 0.497 0.193
  WI2 C23 0.012 0.011 0.008 0.017 0.085 0.085 0.084 0.111 0.001 0.001 0.298
  WI3 C24 0.011 0.009 0.008 0.016 0.101 0.101 0.101 0.121 0.000 0.000 0.575
  AI1 C25 0.002 0.001 <0.001 <0.001 0.016 0.016 0.010 0.010 0.176 0.164 0.756
  AI2 C26 0.006 0.004 0.002 0.001 0.075 0.075 0.065 0.029 0.000 0.000 0.969
  AI3 C27 0.011 0.010 0.008 0.001 0.103 0.103 0.103 0.050 0.000 0.000 1.000
7 E1 C46 0.008 0.005 0.001 0.001 0.086 0.083 0.059 0.036 0.328 0.207 0.280
  E2 C47 0.009 0.007 0.005 0.001 0.133 0.133 0.132 0.052 0.000 0.000 0.433
  E3 C48 0.010 0.010 0.010 0.002 0.151 0.151 0.151 0.064 0.000 0.000 0.759
  WI1 C49 0.008 0.006 0.000 0.021 0.093 0.088 0.052 0.122 0.721 0.731 0.276
  WI2 C50 0.013 0.011 0.009 0.019 0.124 0.124 0.119 0.156 0.001 0.002 0.312
  WI3 C51 0.010 0.010 0.009 0.019 0.142 0.142 0.142 0.177 0.000 0.000 0.510
  AI1 C52 0.001 0.001 0.000 0.000 0.013 0.012 0.005 0.006 0.492 0.466 0.728
  AI2 C53 0.005 0.005 0.001 0.001 0.106 0.104 0.081 0.037 0.001 <0.001 0.996
  AI3 C54 0.010 0.009 0.007 0.002 0.141 0.141 0.139 0.055 0.000 0.000 1.000

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.

4.3.1. ANPP

The ANPP results for the case where only one odds ratio was different are summarized in Tables 6 and 7 (see Tables 1–4 of Supplemental Material for the cases where more than one of and all of the odds ratios were different results). The ANPP values ranged between 0 and 1, and high values indicated a desirable high power for the test. The results are interpreted below by the number of strata, sample size design, and methods.

Table 7.

ANPP values computed for the Hommel and Benjamini-Hochberg adjusted tests, BD-based and chi-squared-based LSD tests, and YA test for P1-type alternative hypothesis.

      Hommel Benjamini–Hochberg      
K SSD Com. BD Tarone Woolf Peto BD Tarone Woolf Peto BDLSD CSLSD YA
3 E1 C1 0.684 0.674 0.619 0.580 0.855 0.855 0.850 0.844 0.706 0.596 0.708
  E2 C2 0.934 0.932 0.928 0.904 0.982 0.982 0.982 0.977 0.001 0.000 0.942
  E3 C3 0.998 0.998 0.998 0.997 1.000 1.000 1.000 0.999 0.000 0.000 0.999
  WI1 C4 0.590 0.580 0.408 0.660 0.849 0.849 0.801 0.850 0.992 0.992 0.595
  WI2 C5 0.852 0.850 0.831 0.891 0.971 0.971 0.971 0.972 0.063 0.093 0.848
  WI3 C6 0.987 0.987 0.987 0.991 0.998 0.998 0.998 0.998 0.000 0.000 0.986
  AI1 C7 0.446 0.443 0.190 0.382 0.642 0.642 0.565 0.638 0.809 0.781 0.486
  AI2 C8 0.879 0.878 0.850 0.858 0.957 0.957 0.957 0.957 0.008 0.006 0.885
  AI3 C9 0.997 0.997 0.997 0.997 0.999 0.999 0.999 0.999 0.000 0.000 0.997
5 E1 C19 0.583 0.568 0.433 0.438 0.785 0.785 0.771 0.768 0.934 0.834 0.964
  E2 C20 0.920 0.917 0.911 0.863 0.979 0.979 0.979 0.974 0.002 0.000 0.998
  E3 C21 0.999 0.999 0.999 0.998 1.000 1.000 1.000 1.000 0.000 0.000 1.000
  WI1 C22 0.463 0.448 0.166 0.566 0.769 0.769 0.675 0.773 0.999 0.999 0.912
  WI2 C23 0.804 0.799 0.764 0.871 0.961 0.961 0.961 0.961 0.116 0.155 0.989
  WI3 C24 0.976 0.975 0.973 0.984 0.997 0.997 0.997 0.997 0.000 0.000 1.000
  AI1 C25 0.260 0.259 0.006 0.201 0.462 0.462 0.261 0.460 0.739 0.706 0.778
  AI2 C26 0.891 0.890 0.829 0.867 0.962 0.962 0.961 0.962 0.020 0.015 0.992
  AI3 C27 0.999 0.999 0.999 0.999 1.000 1.000 1.000 1.000 0.000 0.000 1.000
7 E1 C46 0.539 0.514 0.341 0.386 0.748 0.748 0.722 0.723 0.986 0.940 0.985
  E2 C47 0.917 0.915 0.902 0.840 0.979 0.979 0.979 0.969 0.005 0.001 1.000
  E3 C48 0.998 0.998 0.998 0.996 1.000 1.000 1.000 1.000 0.000 0.000 1.000
  WI1 C49 0.428 0.411 0.094 0.544 0.741 0.740 0.607 0.748 0.999 0.999 0.958
  WI2 C50 0.795 0.789 0.730 0.863 0.955 0.955 0.955 0.956 0.160 0.195 0.997
  WI3 C51 0.981 0.980 0.978 0.988 0.999 0.999 0.999 0.999 0.000 0.000 1.000
  AI1 C52 0.178 0.177 0.001 0.133 0.361 0.361 0.154 0.356 0.968 0.955 0.948
  AI2 C53 0.913 0.913 0.815 0.891 0.979 0.979 0.977 0.978 0.030 0.022 1.000
  AI3 C54 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 0.000 1.000

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.

  • Results based on the number of strata: The probability of identifying at least one true difference between the pairs for the tests was not impacted by the number of strata for the large sample sizes. While the number of strata increased from 3 to 7, when only one of the odds ratios was different and the sample size was small, the ability of the tests to identify at least one difference between the pairs decreased averagely 24% for all the tests except for the BD-based and chi-square-based LSD tests and the YA test.

  • Results based on the number of different odds ratios: We compared ANPP values of by the number of different odds ratios (P1, P2, and F-type alternative hypotheses). The probability of identifying at least one true difference between the pairs for the tests was not notably impacted by the changes in number of different odds ratios for the large sample sizes (E3, WI3, or AI3). For the tables with 3 strata, the ability of the tests to identify at least one difference between the pairs was higher when all of the odds ratios were different when compared to the case where only one of the odds ratios was different. For the tables with 5 or 7 strata, in general, the ability of the tests to identify at least one difference between the pairs increased when the number of different odds ratios increased.

  • Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests to identify at least one difference between the pairs increased averagely 24.5% when the sample size increased. Nevertheless, the ability of the BD-based and chi-square-based LSD tests had the highest values when the sample size was small.

  • Results based on the multiple comparison methods: For all of the scenarios, when the sample size was large, the probability of identifying at least one true difference between the pairs for all of the adjusted BD, Tarone, Woolf, and Peto tests and YA the test were over 0.930. For the tables with 5 or 7 strata, more than one or all of the odds ratios were different, and when the sample size was medium, the ANPP values of these multiple comparison methods were similar and over 0.828 (Tables 1–4 of Supplemental Material).

4.3.2. APP

The APP results are summarized in Tables 5–10 of Supplemental Material. The APP values ranged between 0 and 1, and high values of the APP were expected. The APP values were mostly lower than those of the ANPP. For the tables with 5 and 7 strata, the APP values of all of the methods were around ‘0’ when more than one of or all of the odds ratios were different, except for the YA test in some scenarios (Tables 5–6 of Supplemental Material). Thus, a comparison of the APP by strata was possible for only some of the scenarios.

  • Results based on the number of strata: When only one of the odds ratios was different, the probability of identifying all of the pairs that are actually different for the tests, except the YA test, decreased averagely 17.6% as the number of strata increased.

  • Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the power of the methods decreased. When more than one of or all of the odds ratios were different, the ability of the tests to identify all of the pairs that are actually different were mostly close to zero, except for the YA test.

  • Results based on the sample size design: The probability of identifying all of the pairs that are actually different for all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test increased when the sample size increased.

  • Results based on the multiple comparison methods: By considering the ability of the tests to identify all of the pairs that are actually different, the YA test performed better than the other tests in most of the scenarios. The APP values of all of the tests, except for the YA test, were very low.

4.3.3. PPV

The PPV results are summarized in Tables 11–14 of Supplemental Material. All PPV values for the case where all of the odds ratios were different were found ‘1’. The high PPV values are desired. The PPV values of the BD-based and chi-square-based LSD tests could not be calculated for the larger-sample size since no declared significant differences were found during the simulation runs.

  • Results based on number of strata: The proportion of the tests that correctly declared differences between the odds ratios was not impacted by the number of strata, except for some of the scenarios of the BD-based and chi-square-based LSD tests and the YA test.

  • Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the ability of all of the tests that correctly declared differences between the odds ratios increased averagely 1.4%.

  • Results based on the sample size design: The proportion of the tests that correctly declared differences between the odds ratios was not affected by the changes in the sample size design, except for some of the scenarios of the BD-based and chi-square-based LSD tests and YA test. The lowest values of the YA test were found for the table with among-centers inequality.

  • Results based on the multiple comparison methods: High PPV values were found in most of the scenarios and most of the methods. In general, the ability of YA test to detect the true differences between the odds ratios was mostly slightly lower than in the other methods, but the average value was 0.870. The PPV of the YA test was lower than 0.50 for the tables with 5 or 7 strata, one different odds ratio (P1-type) and an among-center inequality design (AI).

4.3.4. TNR

The TNR results are summarized in Tables 15–18 of Supplemental Material. As with the PPV, the high TNR values are desired.

  • Results based on the number of strata: While the number of strata increased from 3 to 7, when only one of the odds ratios was different (P1-type), and the sample size was small (E1, WI1, or AI1), the probability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 16.3%. When more than one odds ratio was different, the ability of the tests to indicate significant differences between the odds ratios that were actually not different were not too much differed by the changes in the number of strata.

  • Results based on the number of different odds ratios: In all of the scenarios, the ability of the tests that correctly declared non-significant differences between the odds ratios was higher when only one of the odds ratios was different when compared to the case where at least one of the odds ratios was different.

  • Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 17.3% when the sample size increased. In general, the highest TNR values of all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test were observed in large sample size design. On the other hand, the highest TNR values of the BD-based and chi-square-based LSD tests were observed in the within-sample inequality design with a small sample size.

  • Results based on the multiple comparison methods: The Benjamini–Hochberg adjusted tests showed better performance for the tables with 3 strata. The chi-square-based test showed better performance for the tables with 3 or 5 strata and within center inequality design with small sample size. The YA test mostly showed a much higher ability to indicate significant test that correctly declared non-significant differences between the odds ratios for all of the scenarios of 5 and 7 strata.

4.3.5. PCER

The PCER results are summarized in Tables 19–22 of Supplemental Material. The PCER values were expected to be around α=0.05.

  • Results based on the number of strata: The change in the number of strata had a slight effect (around 0.001±0.015) on the probability of observing a type-I error in any comparison. When only one of the odds ratios was different and the table had among-centers inequality with medium or large sample sizes, the YA test was the only method that diverged from α=0.05 with the increasing number of strata.

  • Results based on the number of different odds ratios: While the YA test mostly performed better for tables with more than one different odds ratio than the case with only one different odds ratio, the change in the number of different odds ratios had a slight effect on the probability of observing a type-I error in any comparison for the other methods.

  • Results based on the sample size design: The probability of observing a type-I error in any comparison of the adjusted BD, Tarone, Woolf, and Peto tests was not impacted by the sample design. The probability of observing a type-I error in any comparison of the BD-based and chi-square-based LSD tests converged to α when the table had the within-center inequality design with a small sample size. The PCER value of the YA test diverged from α when the table had an among-centers inequality design with large sample size, followed by a medium sample size.

  • Results based on the multiple comparison methods: When one or more than one odds ratio was different and the table had among-centers inequality designs with medium and large sample sizes, the Benjamini–Hochberg adjusted BD, Tarone, and Woolf tests showed the best PCER performance. When only one of the odds ratios was different and the table had the within-center inequality design with a medium or large sample size, the Benjamini–Hochberg adjusted Peto test showed the slightly better PCER performance. The BD-based and chi-square-based LSD tests showed the best PCER performance for the within- or among-center inequality designs with a small sample size. The YA test showed the best PCER performance for the tables with 3 strata, only one different odds ratio and an equal and small sample size; for the tables with 5 strata, only one different odds ratio and an equal or within-center inequality design with a small sample size; for the tables with 7 strata with medium or large sample sizes.

4.3.6. FWER

The FWER results for the case where only one odds ratio was different are summarized in Tables 8 and 9 (see Tables 23 and 24 of Supplemental Material for the case where more than one of the odds ratios were different results). FWER values are also expected to be around α=0.05.

Table 8.

FWER values computed for the Bonferroni, Dunn-Šidák, Holm, and Hochberg adjusted tests for P1-type alternative hypothesis.

      Bonferroni Dunn-Šidák Holm Hochberg
K SSD Com. BD Tarone Woolf Peto BD Tarone Woolf Peto BD Tarone Woolf Peto BD Tarone Woolf Peto
3 E1 C1 0.005 0.005 0.003 0.002 0.024 0.022 0.015 0.008 0.007 0.007 0.005 0.004 0.009 0.009 0.007 0.006
  E2 C2 0.004 0.004 0.003 0.001 0.020 0.019 0.018 0.006 0.010 0.010 0.010 0.007 0.020 0.020 0.020 0.016
  E3 C3 0.004 0.004 0.003 0.001 0.013 0.013 0.013 0.005 0.010 0.010 0.010 0.010 0.021 0.021 0.021 0.018
  WI1 C4 0.007 0.005 0.001 0.010 0.031 0.029 0.017 0.035 0.010 0.009 0.004 0.012 0.011 0.011 0.005 0.014
  WI2 C5 0.005 0.005 0.003 0.009 0.020 0.020 0.018 0.030 0.010 0.010 0.009 0.016 0.025 0.025 0.025 0.028
  WI3 C6 0.005 0.005 0.004 0.007 0.017 0.017 0.016 0.025 0.011 0.011 0.011 0.017 0.037 0.037 0.037 0.037
  AI1 C7 0.001 0.001 0.000 <0.001 0.010 0.009 0.004 0.005 0.002 0.002 <0.001 0.001 0.002 0.002 <0.001 0.001
  AI2 C8 0.005 0.004 0.003 0.001 0.019 0.018 0.014 0.007 0.007 0.007 0.006 0.004 0.010 0.010 0.009 0.006
  AI3 C9 0.004 0.004 0.003 0.001 0.020 0.019 0.018 0.005 0.011 0.011 0.011 0.010 0.025 0.025 0.025 0.022
5 E1 C19 0.007 0.006 0.001 0.001 0.032 0.028 0.015 0.010 0.007 0.007 0.002 0.002 0.007 0.007 0.002 0.002
  E2 C20 0.008 0.006 0.005 0.001 0.031 0.029 0.026 0.007 0.009 0.009 0.008 0.002 0.009 0.009 0.008 0.002
  E3 C21 0.006 0.006 0.005 0.002 0.027 0.026 0.026 0.005 0.009 0.009 0.009 0.003 0.010 0.010 0.010 0.003
  WI1 C22 0.007 0.006 <0.001 0.017 0.041 0.037 0.012 0.059 0.008 0.007 <0.001 0.018 0.008 0.007 <0.001 0.018
  WI2 C23 0.009 0.009 0.005 0.013 0.028 0.027 0.022 0.042 0.011 0.010 0.007 0.017 0.011 0.010 0.008 0.017
  WI3 C24 0.004 0.004 0.004 0.010 0.024 0.022 0.021 0.033 0.007 0.007 0.007 0.014 0.007 0.007 0.007 0.014
  AI1 C25 0.001 0.001 <0.001 <0.010 0.009 0.009 0.001 0.004 0.001 0.001 <0.001 <0.001 0.001 0.001 <0.001 <0.001
  AI2 C26 0.004 0.003 0.001 <0.001 0.022 0.019 0.011 0.005 0.005 0.004 0.002 0.001 0.005 0.004 0.002 0.001
  AI3 C27 0.005 0.004 0.003 0.001 0.028 0.027 0.024 0.006 0.009 0.008 0.007 0.001 0.009 0.008 0.007 0.001
7 E1 C46 0.006 0.004 <0.001 0.001 0.033 0.027 0.013 0.008 0.006 0.004 <0.001 0.001 0.006 0.004 <0.001 0.001
  E2 C47 0.006 0.006 0.004 0.001 0.027 0.025 0.022 0.004 0.008 0.007 0.005 0.001 0.008 0.007 0.005 0.001
  E3 C48 0.008 0.008 0.007 0.001 0.028 0.027 0.026 0.006 0.010 0.010 0.009 0.002 0.010 0.010 0.009 0.002
  WI1 C49 0.007 0.005 <0.001 0.020 0.044 0.036 0.006 0.073 0.007 0.005 0.000 0.020 0.007 0.005 0.000 0.020
  WI2 C50 0.011 0.010 0.006 0.017 0.035 0.032 0.027 0.052 0.012 0.010 0.008 0.018 0.012 0.010 0.008 0.018
  WI3 C51 0.007 0.006 0.006 0.015 0.028 0.027 0.025 0.045 0.010 0.009 0.009 0.017 0.010 0.009 0.009 0.017
  AI1 C52 0.001 0.001 0.000 0.000 0.006 0.005 <0.001 0.002 0.001 0.001 0.000 0.000 0.001 0.001 0.000 0.000
  AI2 C53 0.005 0.003 0.001 <0.001 0.022 0.018 0.008 0.003 0.005 0.004 0.001 0.001 0.005 0.004 0.001 0.001
  AI3 C54 0.007 0.006 0.004 0.001 0.027 0.025 0.022 0.004 0.009 0.008 0.006 0.002 0.009 0.008 0.006 0.002

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality.

  • Results based on the number of strata: In all of the scenarios, the probability of having at least one type-I error over the scenarios of all of the adjusted BD, Tarone, Woolf, and Peto tests was not notably impacted (around 0.004±0.016) by the changes in the number strata. While the FWER values of the BD-based and chi-square-based LSD tests only diverged when the sample size was small, the FWER values of the YA test diverged in most of the scenarios.

  • Results based on the number of different odds ratios: In all of the scenarios, the probability of having at least one type-I error over the scenarios was not notably impacted (around 0.011±0.025) by the changes in the number of different odds ratios, except for the YA test.

  • Results based on the sample size design: The change in the sample size design had a small effect (around 0.004±0.087) on the probability of having at least one type-I error over the scenarios of the Bonferroni, Dunn–Šidák, Holm, Hochberg, and Hommel adjusted BD, Tarone, Woolf, and Peto tests. When the sample size increased, the FWER performance of the YA test decreased.

  • Results based on the multiple comparison methods: In most of the scenarios, the Benjamini–Hochberg adjusted tests controlled the type-I error rate better than the others. In all of the scenarios, the YA test was the most diverged method from α=0.05.

Considering that the FWER is the probability of having at least one type-I error over all the comparisons and the PCER is the probability of observing a type-I error in any comparison, PCER and FWER are the probability of a type-I error (V). For each run of simulation, when there was no type-I error, we found V = 0. Because we calculated the mean of these measures over the runs, this caused low values of PCER and FWER. When PCER and FWER were low, this indicated that these tests controlled the type-I error well. Except for BD-based and Chi-squared-based LSD tests and YA tests in some scenarios, we found low PCER and FWER values for the tests. We also found that PCER values less than or equal to FWER, similar to the literature.

4.3.7. FDR

The FDR results are given in Tables 25–28 of Supplemental Material. A low FDR value is expected from a good method. FDR was calculated as FDR=V/R where R was the number of hypotheses that were declared as significant. Because BD-based and chi-square-based LSD tests did not indicate significant differences between any pair of the odds ratios tested, no significant difference was obtained during the simulation runs and the number of hypotheses that were declared as significant was found as 0. Then, the proportion of falsely declared significant hypotheses of the BD-based and chi-square-based LSD tests could not be calculated for the larger sample size tables. Because FDR=1PPV, the results of the FDR were consistent with the PPV results.

5. Discussion

In meta-analysis studies, comparisons of the odds ratios may lead researchers to detect statistical heterogeneity. In some specific studies, instead of pooling different studies, the results are pooled over a third factor (i.e. age, gender, country) and this may cause clinical heterogeneity. Even though Fletcher [14] discussed the fact that judgments about clinical heterogeneity are qualitative and do not involve any calculations, the clinical heterogeneity among the participant characteristics can be detected by comparing the odds ratios. In this study, the focus was placed on meta-analysis studies in which the results are pooled over a third factor and the odds ratios were non-homogeneous. The necessity for a multiple comparison procedure was discussed when the null hypothesis of homogeneity of the odds ratios is rejected.

The results showed that some of these methods controlled the type-I error rates at the desired level, while some of them were more powerful than others. By considering the power and type-I error performance together, some promising tests were identified for the considered scenarios. The recommended tests by considering the main findings of the study are summarized in Table 10.

Table 10.

The recommended tests.

K TOR Sample size design Recommended tests
3 P1 E or WI Benjamini–Hochberg adjusted tests
    Small AI Dunn–Sidak adjusted BD test
    Medium or large AI Benjamini–Hochberg adjusted BD test
  F Small Dunn–Sidak adjusted BD test
    Medium or large YA test
5 P1 Small E or WI YA test
    Medium or large E or WI Benjamini–Hochberg adjusted tests
    Small AI Dunn–Sidak adjusted BD test
    Medium or large AI Benjamini–Hochberg adjusted BD test
  P2 or F All YA test
7 P1 Small E or WI YA test
    Medium or large E or WI Benjamini–Hochberg adjusted tests
    Small AI Dunn–Sidak adjusted BD test
    Medium or large AI Benjamini–Hochberg adjusted BD test
  P2 Small E Chi-square-based LSD
    Small WI Dunn–Sidak adjusted Peto test
    Small AI Dunn–Sidak adjusted BD or Tarone test
    Medium or large Benjamini–Hochberg adjusted BD,
      Tarone, or Woolf tests
  F Small BD-based LSD test
    Medium or large YA test

Abbreviation: TOR, true odds ratio; P1, only one of the odd ratios is different; P2, more than one of the odds ratios is different, F, all the odd ratios are different; E, equal sample size design, WI, within-center inequality, AI, among-centers inequality.

The Breslow-Day, Tarone, Woolf, and Peto tests were suggested to test the homogeneity of more than three odds ratios. The multiple comparison procedures are applied when the odds ratios are heterogeneous. We discussed the performance of these tests in the multiple comparison procedure. For this purpose, we used the adjustment methods to control the type-I and type-II errors. The BD-based and chi-square-based LSD tests and the YA test were specifically suggested as multiple comparison tests of odds ratios. Thus, the BD-based and chi-square-based LSD tests and the YA test behaved differently from the other tested methods. Unexpectedly, even though BD-based and chi-square-based LSD tests were multiple-comparison tests, their performance was below the others.

The simulation space of this numerical study covered the COVID-19 data given in Section 3. The data had 4 strata and only the odds ratio for China was greater than the others (only one different odds ratio). While the studies from Italy and France had large sample sizes, those of Greece and China were smaller than the others. The simulation study results showed that the tests with Benjamini–Hochberg adjustment were more powerful and controlled the error rates for the tables with 4 strata, only one different odds ratio, and among-centers inequality design. The Benjamini–Hochberg adjusted tests results indicated the difference between the odds ratios for China and the other countries and was consistent with the simulation study results. Only the Woolf test with Benjamini–Hochberg adjustment between the odds ratios for China and France, which were slightly non-significant (p = 0.056). As the simulation results showed that Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests, and the BD-based and chi-square-based LSD tests were not suitable for the multiple comparisons of odds ratios for such tables, these tests did not indicate any difference between the odds ratios. The YA test indicated the statistically significant difference between the odds ratios for Greece and France, which were less conservative than the others.

Because it is difficult to design a simulation study by considering the heterogeneity of odds ratios and the difference between the sample sizes in each stratum, this study was limited in that the maximum number of strata was 7. Even though the minimum value was theoretically 2, working with a larger number of studies made the meta-analysis more powerful and reliable.

Supplementary Material

Supplemental Material

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Agresti A. and Jonathan J., Strategies for comparing treatments on a binary response with multi-centre data, Stat. Med. 19 (2000), pp. 1115–1139. [DOI] [PubMed] [Google Scholar]
  • 2.Armistead T.W., Measures of association, in Wiley StatsRef: Statistics Reference Online, John Wiley & Sons, Ltd., 2014, pp. 1–24.
  • 3.Bagheri Z., Ayatollahi S.M.T, and Jafari P., Comparison of three tests of homogeneity of odds ratios in multicenter trials with unequal sample sizes within and among centers, BMC Med. Res. Methodol. 11 (2011), pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
  • 5.Breslow N.E., Statistics in epidemiology: The case-control study, J. Am. Stat. Assoc. 91 (1996), pp. 14–28. [DOI] [PubMed] [Google Scholar]
  • 6.Breslow N.S. and Day N.E., The Analysis of Case-control Studies. Statistical Methods in Cancer Research, Vol. 1, International Agency for Research on Cancer Scientific Publications, Lyon, France, 1980. [PubMed]
  • 7.Chen S.Y., Feng Z., and Yi X., A general introduction to adjustment for multiple comparisons, J. Thorac. Dis. 9 (2017), pp. 1725–1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Colombi D., Bodini F. C., Petrini M., Maffi G., Morelli N., Milanese G., Silva M., Sverzellati N., and Michieletti E., Well-aerated lung on admitting chest CT to predict adverse outcome in COVID-19 pneumonia, Radiology 296 (2020), pp. E86–E96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.de Almeida-Pititto B., Dualib P.M., Zajdenverg L., Dantas J.R., De Souza F.D., Rodacki M., and Bertoluci M.C., Severity and mortality of COVID 19 in patients with diabetes, hypertension and cardiovascular disease: A meta-analysis, Diabetol. Metab. Syndr. 12 (2020), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Demirhan H., Dolgun N.A., and Demirhan Y.P., Performance of some multiple comparison tests under heteroscedasticity and dependency, J. Stat. Comput. Simul. 80 (2010), pp. 1083–1100. [Google Scholar]
  • 11.Der Simonian R. and Laird N., Meta-analysis in clinical trials, Control Clin. Trials 7 (1986), pp. 177–188. [DOI] [PubMed] [Google Scholar]
  • 12.Dinno A., Nonparametric pairwise multiple comparisons in independent groups using Dunn's test, Stata J. 15 (2015), pp. 292–300. [Google Scholar]
  • 13.Dunn O.J., Multiple comparisons among means, J. Am. Stat. Assoc. 56 (1961), pp. 52–64. [Google Scholar]
  • 14.Fletcher J., What is heterogeneity and is it important?, Br. Med. J. 334 (2007), pp. 94–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gagnier J.J., Moher D., Boon H., Beyene J., and Bombardier C., Investigating clinical heterogeneity in systematic reviews: A methodologic review of guidance in the literature, BMC Med. Res. Methodol. 12 (2012), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Gavaghan D.J., Moore R.A., and McQuay H.J., An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data, Pain 85 (2000), pp. 415–424. [DOI] [PubMed] [Google Scholar]
  • 17.Giamarellos-Bourboulis E. J., Netea M. G., Rovina N., Akinosoglou K., Antoniadou A., Antonakos N., Damoraki G., Gkavogianni T., Adami M.-E., Katsaounou P., Ntaganou M., Kyriakopoulou M., Dimopoulos G., Koutsodimitropoulos I., Velissaris D., Koufargyris P., Karageorgos A., Katrini K., Lekakis V., Lupse M., Kotsaki A., Renieris G., Theodoulou D., Panou V., Koukaki E., Koulouris N., Gogos C., and Koutsoukou A., Complex immune dysregulation in COVID-19 patients with severe respiratory failure, Cell Host Microbe 27 (2020), pp. 992–1000.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Halperin M., Ware J.H., Byar D.P., Mantel N., Brown C.C., Koziol J., Gail M., and Sylvan B.G.R., Testing for interaction in an i×j×k contingency table, Biometrika 64 (1977), pp. 271–275. [Google Scholar]
  • 19.Hochberg Y., A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (1988), pp. 800–802. [Google Scholar]
  • 20.Holm S., A simple sequentially rejective multiple test procedure, Scand. J. Stat. 6 (1979), pp. 65–70. [Google Scholar]
  • 21.Hommel G., A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75 (1988), pp. 383–386. [Google Scholar]
  • 22.Hsiung T.H. and Olejnik S., Power of pairwise multiple comparisons in the unequal variance case, Commun Stat - Simul Comput 23 (1994), pp. 691–710. [Google Scholar]
  • 23.Jones M.P., O'Gorman T.W., Lemke J.H., and Woolson R.F., A Monte Carlo investigation of homogeneity tests of the odds ratio under various sample size configurations, Biometrics. 45 (1989), pp. 171–181. [PubMed] [Google Scholar]
  • 24.Kulinskaya E. and Dollinger M.B., An accurate test for homogeneity of odds ratios based on Cochran's q-statistic, BMC Med. Res. Methodol. 15 (2015), pp. 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mantel N. and Haenszel W., Statistical aspects of the analysis of data from retrospective studies of disease, J. Natl. Cancer Inst. 22 (1959), pp. 719–748. [PubMed] [Google Scholar]
  • 26.Osama A., Testing homogeneity of effect sizes in pooling 2×2 contingency tables from multiple studies: A comparison of methods, Cogent Math. Stat. 5 (2018), p. 1478698. [Google Scholar]
  • 27.Paul S.R. and Donner A., A comparison of tests of homogeneity of odds ratios in k 2× 2 tables, Stat. Med. 8 (1989), pp. 1455–1468. [DOI] [PubMed] [Google Scholar]
  • 28.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2019. Available at https://www.R-project.org/.
  • 29.Ramsey P.H., Power differences between pairwise multiple comparisons, J. Am. Stat. Assoc. 73 (1989), pp. 479–485. [Google Scholar]
  • 30.Reis I.M., Hirji K.F., and Afifi A.A., Exact and asymptotic tests for homogeneity in several 2×2 tables, Stat. Med. 18 (1999), pp. 893–906. [DOI] [PubMed] [Google Scholar]
  • 31.Šidák Z., Rectangular confidence regions for the means of multivariate normal distributions, J. Am. Stat. Assoc. 62 (1967), pp. 626–633. [Google Scholar]
  • 32.Signorell A., Aho K., Alfons A., Anderegg N., Aragon T., Arppe A., Baddeley A., Barton K., Bolker B., and Borchers H.W., Desctools: Tools for descriptive statistics, 2020, R package version 0.99.36. Available at https://cran.r-project.org/package=DescTools.
  • 33.Simonnet A., Chetboun M., Poissy J., Raverdy V., Noulette J., Duhamel A., Labreuche J., Mathieu D., Pattou F., and Jourdain M., High prevalence of obesity in severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) requiring invasive mechanical ventilation, Obesity 28 (2020), pp. 1195–1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tarone R.E., On heterogeneity tests based on efficient scores, Biometrika. 72 (1985), pp. 91–95. [Google Scholar]
  • 35.Van den Ende J., Moreira J., Basinga P., and Bisoffi Z., The trouble with likelihood ratios, Lancet. 366 (2005), p. 548. [DOI] [PubMed] [Google Scholar]
  • 36.Wang L., He W., Yu X., Hu D., Bao M., Liu H., Zhou J., and Jiang H., Coronavirus disease 2019 in elderly patients: Characteristics and prognostic factors based on 4-week follow-up, J. Infect. 80 (2020), pp. 639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wei Q. and Lai D., Test for homogeneity of odds ratios using U-statistics, Open J. Stat. 9 (2019), pp. 347–360. [Google Scholar]
  • 38.Wong Y.J., Tan M., Zheng Q., Li J.W., Kumar R., Fock K.M., Teo E.K., and Ang T.L., A systematic review and meta-analysis of the COVID-19 associated liver injury, Ann. Hepatol. 19 (2020), pp. 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Woolf B., On estimating the relation between blood group and disease, Ann. Hum. Genet. 19 (1955), pp. 251–253. [DOI] [PubMed] [Google Scholar]
  • 40.Wright S.P., Adjusted p-values for simultaneous inference, Biometrics. 48 (1992), pp. 1005–1023. [Google Scholar]
  • 41.Wu Z. and McGoogan J.M., Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72 314 cases from the Chinese center for disease control and prevention, JAMA 323 (2020), pp. 1239–1242. [DOI] [PubMed] [Google Scholar]
  • 42.Yang J., Zheng Y.A., Gou X., Pu K., Chen Z., Guo Q., Ji R., Wang H., Wang Y., and Zhou Y., Prevalence of comorbidities and its effects in patients infected with SARS-CoV-2: A systematic review and meta-analysis, Int. J. Infect. Dis. 94 (2020), pp. 91–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Yilmaz A.E. and Altunay S.A., Post-hoc comparison tests for odds ratios, Electron. J. Appl. Stat. Anal. 15 (2022), pp. 75–94. [Google Scholar]
  • 44.Yusuf S., Peto R., Lewis J., Collins R., and Sleight P., Beta blockade during and after myocardial infarction: An overview of the randomized trials, Prog. Cardiovasc. Dis. 27 (1985), pp. 335–371. [DOI] [PubMed] [Google Scholar]
  • 45.Zelen M., The analysis of several 2×2 contingency tables, Biometrika 58 (1971), pp. 129–137. [Google Scholar]
  • 46.Zwinderman A.H. and Bossuyt P.M., We should not pool diagnostic likelihood ratios in systematic reviews, Stat. Med. 27 (2008), pp. 687–697. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES