How reliable are the multiple comparison methods for odds ratio?

Ayfer Ezgi Yilmaz

doi:10.1080/02664763.2022.2104229

. 2022 Jul 26;49(12):3141–3163. doi: 10.1080/02664763.2022.2104229

How reliable are the multiple comparison methods for odds ratio?

Ayfer Ezgi Yilmaz ^1,^CONTACT

PMCID: PMC9415621 PMID: 36035608

ABSTRACT

The homogeneity tests of odds ratios are used in clinical trials and epidemiological investigations as a preliminary step of meta-analysis. In recent studies, the severity or mortality of COVID-19 in relation to demographic characteristics, comorbidities, and other conditions has been popularly discussed by interpreting odds ratios and using meta-analysis. According to the homogeneity test results, a common odds ratio summarizes all of the odds ratios in a series of studies. If the aim is not to find a common odds ratio, but to find which of the sub-characteristics/groups is different from the others or is under risk, then the implementation of a multiple comparison procedure is required. In this article, the focus is placed on the accuracy and reliability of the homogeneity of odds ratio tests for multiple comparisons when the odds ratios are heterogeneous at the omnibus level. Three recently proposed multiple comparison tests and four homogeneity of odds ratios tests with six adjustment methods to control the type-I error rate are considered. The reliability and accuracy of the methods are discussed in relation to COVID-19 severity data associated with diabetes on a country-by-country basis, and a simulation study to assess the powers and type-I error rates of the tests is conducted.

KEYWORDS: Homogeneity of odds ratios, multiple comparisons, type-I error, statistical power, meta-analysis, COVID-19

1. Introduction

It has recently become very popular to investigate the mortality or severity of COVID19 in relation to demographics, clinical characteristics, or signs and symptoms to understand the impacts of COVID-19 more clearly. Not only in COVID-19 studies, but also in general meta-analysis projects, the odds ratio of death or severity for a patient with comorbidity (diabetes, hypertension, cardiac disease, etc.) or a symptom (fever, cough, fatigue, etc.) have been investigated. Under the assumption that the odds ratios across studies are homogeneous, the results of several studies are aggregated via a meta-analysis. Before calculating a common odds ratio, the homogeneity of studies should be tested. Studies directly focusing on testing the homogeneity of the odds ratios date back to the first half of the 1950s. While Woolf's [39] test was the first application in the literature to test the homogeneity of the logarithm of odds ratios, it was found to be very conservative by Gavaghan et al. [16]. Breslow and Day [6] used the Mantel-Haenszel estimator instead of the conditional maximum likelihood estimate. Tarone [34] suggested the adjusted version of the Breslow-Day (BD) test. Because the BD statistic is based on the Cochran-Mantel-Haenszel odds ratio estimator and this estimator is not efficient, Almalik and van den Heuvel [26] suggested using the Tarone test. However, studies showing that there was a difference between the BD and Tarone tests reported it as only in the 4th decimal place [24]. Reis et al. [30] discussed the limitation of the asymptotic chi-square tests and did not recommend using them when most of the expected values were less than five. The Zelen test [45] was recommended to overcome this problem, but it was found to be biased and inconsistent [18,26]. Yusuf et al. [44] proposed a chi-square-based method, called the Peto method, that is identical to the asymptotic Zelen test [16,30]. DerSimonian–Laird [11] statistic is the likelihood ratio (LR) test of a mixed logistic model [1,3] and the conditional maximum likelihood score statistic [16] have also been used to test the homogeneity of the odds ratios. The natural logarithm of the odds-based DerSimonian-Laird statistic is equivalent to the Woolf statistic [16]. Because of its simple calculation, the use of BD test has been recommended instead of the mixed logistic model and score tests [3,16,30]. All of these methods are calculated under the assumption of homogeneity of the odds ratios.

There are many studies that have compared the properties of the homogeneity tests of odds ratios in the literature. Jones et al. [23] compared the power of seven tests of homogeneity of the odds ratio for balanced and unbalanced designs. As a result of their simulation study, they suggested using the BD statistic for non-sparse tables. Paul and Donner [27] also compared the performance of nine tests for the homogeneity of odds ratios according to the data designs (balanced, mildly unbalanced, severely unbalanced, and within-strata unbalanced) and the number of strata. Because of its simple calculation and power performance, they recommend using the Tarone test in practice. Additionally, they recommend using the Woolf test for balanced or mildly unbalanced designs. Reis et al. [30] conducted a Monte Carlo simulation to compare the performance of six asymptotic tests for the homogeneity of odds ratios. As a result of their simulation study, the BD and Pearson chi-square tests were slightly better than the other tests for a non-small sample size. Gavaghan et al. [16] compared the performance of Peto, Woolf, DerSimonian–Laird, and BD statistics, and the test scores. They suggested using the BD statistic in the meta-analysis of pain studies. Their simulation study showed that when the Woolf statistic under-estimated the degree of heterogeneity, the DerSimonian–Laird statistic over-estimated it. Bagheri et al. [3] compared the likelihood ratio test of a mixed logistic model, DerSimonian–Laird statistic, and BD test when the sample size was equal and non-equal. They concluded that the BD test was the most powerful test among these three tests, and the studies with more strata had a higher power. Wei and Lai [37] discussed the effects of small sample size on the power of homogeneity tests. They suggested using U-statistics (U3 and WU3), which have higher power than the other tests. They concluded that the sample size had a positive effect on the power. When the number of strata and sample size increased, the power of the U-statistics improved.

These studies in the literature did not agree on a particular test that could directly be used in practice. The main reason for this was the design spaces of the simulation studies. None of the studies considered an extensive simulation space that could account for real-world scenarios, including an extensive combination of the number of studies, true odds ratios, different combinations of sample sizes, and the distributions of the cells counts across the cells of resulting contingency tables. The aim of this study is to produce new knowledge on the power and type-I error performances of a large bunch of tests under an extensive simulation space. In this way, results with a higher likelihood of generalizability will be provided.

A meta-analysis was used to pool independent studies focused on the same question. In meta-analysis studies, it is required that all available studies are reported. The heterogeneity of these studies (effect sizes) was tested with I-square and its related statistics. If heterogeneity is observed, it is important to consider a strategy for handling the sources of heterogeneity. There are different types of heterogeneity in a meta-analysis, such as clinical heterogeneity (differences in participant characteristics (gender, age group, race, etc.), types or timing of the outcome measurements, intervention characteristics, methodological heterogeneity (trial design and quality), and statistical heterogeneity (treatment effects between trials) [15]. Gagnier et al. [15] discussed that clinical and methodological heterogeneity can cause significant statistical heterogeneity and affect the results.

COVID-19 meta-analysis aims to investigate the relationship between different demographic characteristics (gender, age group, race, Hispanic origin, etc.) by comparing the difference between mortality/severity and comorbidities across the strata. Such studies aim to reveal the difference between the odds ratios across the strata (demographics). For example, the risk of death of a patient with hypertension may be specifically higher than in other races or the relationship between mortality and hypertension may not be statistically significant in some races. When the odds ratios of COVID-19 data are heterogeneous, it is not appropriate to calculate a common odds ratio and it is important to detect the odds ratio that causes heterogeneity among all the considered odds ratios. In this case, a multiple comparison procedure is needed. To produce reliable results in such a crucial area, it is essential to understand the power and type-I error behaviors of the multiple comparison procedures used for heterogeneous odds ratios.

Even though there are many methods to test the homogeneity of odds ratios, there is limited literature about the multiple comparison procedures to be applied when the odds ratios are heterogeneous. Yilmaz and Aktas Altunay [43] suggested using the BD-based least significant difference (LSD), chi-square-based LSD, and adjusted BD tests for multiple comparisons of odds ratios. They used these tests to compare six COVID-19 mortality data from China, and the study was limited to real-life data. Their numerical application showed that the Bonferroni and Dunn-Šidák adjustment methods were very conservative when comparing the odds ratios and their proposed methods were less conservative. They also recommend the use of the BD-based and chi-square-based LSD methods for sparse tables.

In this article, the focus was placed on the use of the homogeneity of odds ratio tests for the performance of multiple comparisons when the odds ratios are heterogeneous. Specifically, we focused on COVID-19 data to get accurate and reliable inferences when the odds ratios are heterogeneous in meta-analysis studies on COVID-19. Following the simulation study results of Bagheri et al. [3], Gavaghan et al. [16], Jones et al. [23], Paul and Donner [27], and Reis et al. [30], the BD, Tarone, Woolf, and Peto homogeneity of the odds ratio test statistics were considered. Bonferroni, Dunn–Šidák, Holm, Hochberg, Hommel, and Benjamini–Hochberg adjustments were used to control the type-I error rate with multiple comparison tests. Also, we considered the BD-based LSD, chi-square-based LSD, and adjusted BD tests for multiple comparisons [43]. In total, 27 methods were taken into consideration and compared in terms of seven measures, consisting of the any-pair power, all pairs power, positive predicted value, true negative rate (TNR), per comparison error rate, family-wise error rate, and false discovery rate, including the different number of strata, sample sizes, sample size designs (equal, within-center inequality, among-centers inequality), and structure of the table (balanced, imbalanced). With such an extensive numerical study, clearer and more reliable results were obtained on the power and error characteristics of multiple comparison tests for odds ratios than in previous studies. The contributions of this study were that we (1) demonstrated the importance of heterogeneous odds ratios in COVID-19 studies and discussed the use of multiple comparison procedures to get reliable results, (2) examined the performance of the multiple comparison procedures for odds ratios in terms of power and error rates using an extensive simulation space that covered a wide range of realistic scenarios that can occur in practice, (3) examined the performance of the homogeneity tests of odds ratios in multiple comparison procedure and discussed the effect of adjustment methods on the tests, and (4) identified methods that can be used under different data compositions and areas of practice.

In Section 2, the methods to test the homogeneity of odds ratios and the multiple comparison procedures are presented. In Sections 3 and 4, the results of numerical studies with COVID-19 and synthetic data are presented. In Section 5, the general recommendations and conclusions are given.

2. Methods

In this section, the methods to test the homogeneity of odds ratios and the multiple comparison methods are introduced.

2.1. Test methods

Consider K different strata, which are investigated in the association of X and Y binary variables. Let $n_{i j k}$ be the number of observations in the ith row, jth column, and kth stratum, where i, j = 1, 2 and $k = 1, \dots, K$ . $n_{. . k}$ is the total number of observations in the kth stratum. A $2 \times 2 \times K$ study design is summarized in Table 1.

Table 1.

A $2 \times 2 \times K$ study design.

Stratum		Y = 1	Y = 2	Total
1	X = 1	$n_{111}$	$n_{121}$	$n_{1.1}$
	X = 2	$n_{211}$	$n_{221}$	$n_{2.1}$
	Total	$n_{.11}$	$n_{.21}$	$n_{. .1}$
2	X = 1	$n_{112}$	$n_{122}$	$n_{1.2}$
	X = 2	$n_{212}$	$n_{222}$	$n_{2.2}$
	Total	$n_{.12}$	$n_{.22}$	$n_{. .2}$
⋮	⋮	$⋱$		⋮
K	X = 1	$n_{11 K}$	$n_{12 K}$	$n_{1. K}$
	X = 2	$n_{21 K}$	$n_{22 K}$	$n_{2. K}$
	Total	$n_{.1 K}$	$n_{.2 K}$	$n_{. . K}$

Open in a new tab

The odds ratio formulation is $θ_{k} = \frac{n_{11 k} \times n_{22 k}}{n_{12 k} \times n_{21 k}}$ , where $k = 1, \dots, K .$ The null hypothesis for the homogeneity of odds ratios is

H_{0} : θ_{1} = θ_{2} = \dots = θ_{K},

(1)

against $H_{1}$ : $θ_{i} \neq θ_{j}$ for at least one pair of $(i, j)$ , where $i, j = 1, \dots, K$ , ( $i \neq j)$ .

The Peto, BD, Tarone, and Woolf methods were used to test the null hypothesis of the equality of several odds ratios. The Mantel and Haenszel [25] odds ratio is

{\hat{θ}}_{MH} = \frac{\sum_{k = 1}^{K} \frac{n_{11 k} n_{22 k}}{n_{. . k}}}{\sum_{k = 1}^{K} \frac{n_{12 k} n_{21 k}}{n_{. . k}}} .

(2)

Yusuf et al. [44] proposed an alternative method to the MH method for pooling odds ratios across the strata. This method is referred to as the Peto method. The Peto statistic is

χ_{P e t o}^{2} = \sum_{k = 1}^{K} \frac{(n_{11 k} - E_{i})^{2}}{V_{i}} - \frac{{[\sum_{k = 1}^{K} (n_{11 k} - E_{k})]}^{2}}{\sum_{k = 1}^{K} V_{i}},

(3)

where the expected frequency ( $E_{k}$ ) and its variance ( $V_{k}$ ) at the kth stratum are

E_{k} = \frac{n_{1. k} n_{.1 k}}{n_{. . k}} and V_{k} = \frac{n_{1. k} n_{.1 k} n_{2. k} n_{.2 k}}{n_{. . k}^{2} (n_{. . k} - 1)} .

Breslow-Day (BD) test is used to test of homogeneity of the odds ratios across K strata [5,6]. BD test statistic is

χ_{B D}^{2} = \sum_{k = 1}^{K} \frac{[n_{11 k} - {\hat{μ}}_{k} ({\hat{θ}}_{MH})]^{2}}{{\hat{σ}}_{k}^{2} ({\hat{θ}}_{MH})},

(4)

where

{\hat{σ}}_{k}^{2} ({\hat{θ}}_{MH}) = {[\frac{1}{{\hat{μ}}_{k}} + \frac{1}{n_{1. k} - {\hat{μ}}_{k}} + \frac{1}{n_{.1 k} - {\hat{μ}}_{k}} + \frac{1}{n_{22 k} - n_{11 k} - {\hat{μ}}_{k}}]}^{- 1} .

(5)

Here, ${\hat{μ}}_{k} (\hat{θ})$ and ${\hat{σ}}_{k}^{2} (\hat{θ})$ are the expected value and the variance of $n_{11 k}$ under the assumption of homogeneous odds ratios, respectively. The BD formula uses the MH odds ratio to generate the expected values using the conditional maximum likelihood method.

The Tarone adjustment is a special case of the BD test statistic [34].

χ_{T a r o n e}^{2} = χ_{B D}^{2} - \frac{{[\sum_{k = 1}^{K} n_{11 k} - \sum_{k = 1}^{K} {\hat{μ}}_{k} ({\hat{θ}}_{MH})]}^{2}}{\sum_{k = 1}^{K} {\hat{σ}}_{k}^{2} ({\hat{θ}}_{MH})} .

(6)

Here, ${\hat{μ}}_{k} (\hat{θ})$ and ${\hat{σ}}_{k}^{2} (\hat{θ})$ are the expected value and the variance of $n_{11 k}$ , the same as in the BD test statistics.

The Woolf statistic is

χ_{Woolf}^{2} = \sum_{k = 1}^{K} w_{k} [\ln (θ_{k})]^{2} - \frac{{[\sum_{k = 1}^{K} w_{k} \ln (θ_{k})]}^{2}}{\sum_{k = 1}^{K} w_{k}},

(7)

where the weights are $w_{k} = [\frac{1}{n_{11 k}} + \frac{1}{n_{12 k}} + \frac{1}{n_{21 k}} + \frac{1}{n_{22 k}}]^{- 1}$ , where $k = 1, \dots, K$ .

All of these methods were used to determine whether there are any statistically significant differences between the independent or unrelated odds ratios calculated from 2-by-2 studies. The Peto, BD, Tarone, and Woolf test statistics follow the chi-square distribution with k−1 degrees of freedom.

2.2. The multiple comparison procedures for the odds ratio

When the methods represented in Section 2.1 indicate the presence of heterogeneity in the odds ratios, it is important to determine which of these groups are different from the others. With this purpose, the Peto, BD, Tarone, and Woolf tests were used for each pair of the studies or strata. The null hypothesis for the multiple comparisons of the odds ratios is

\begin{aligned} H_{0} : θ_{i} = θ_{j}, \\ H_{1} : θ_{i} \neq θ_{j}, \end{aligned}

for $(i < j)$ , where $i, j = 1, \dots, K$ . Because the multiple comparison procedure affects the error rates, different methods have been proposed to adjust the type-I error.

Bonferroni Adjustment: This method is the most popular but also the most conservative one [13]. The Bonferroni method controls the family-wise error rate. Suppose $m = K (K - 1) / 2$ is the number of simultaneously tested hypotheses. The Bonferroni adjusted type-I error is $α^{'} = α / m$ .
Dunn–Šidák Adjustment: The Šidák [31] method is slightly more powerful than the Bonferroni method [12]. The Dunn–Šidák adjusted type-I error is $α^{'} = 1 - (1 - α)^{1 / m}$ .
Holm Adjustment: The Holm [20] sequential adjustment is also based on the Bonferroni method, but it is less conservative. First, the p-values of the m tests are ranked from smallest to the largest. Starting from the smallest p-value (i = 1), the adjusted $α_{i}^{'} = α / (K - i + 1)$ is calculated. The $p_{i}$ -value is compared with the $α_{i}^{'}$ . The comparison continues until $p_{i} \geq α_{i}^{'}$ . All of the remaining hypotheses are considered as non-significant.
Hochberg Adjustment: The Hochberg [19] sequential adjustment is very similar to the Holm adjustment. For this method, the p-values of the m tested hypotheses are ranked from the largest to the smallest and the procedure starts from the largest p-value (i = 1). The comparison continues until $p_{i} < α_{i}^{'}$ . All of the smaller $p_{i}$ -values are considered as significant.
Hommel Adjustment: The Hommel [21]method is also less conservative than the Bonferroni method. It is slightly more powerful than the Hochberg method. First, the p-values of the m tests are ranked from the smallest to the largest. Suppose j is the number of hypotheses in the largest subset of the hypotheses and $j = max {i \in (1, \dots, m) : p_{(m - i + k)} > k α / i f o r a l l (k = 1, \dots, i)}$ . If there exists no j, all of the hypotheses are rejected. Otherwise, the hypotheses are rejected when $p_{i} \leq α / j$ [40].
Benjamini–Hochberg Adjustment: The Benjamini and Hochberg [4] method is suggested to control the false discovery rate. It is less conservative than the other methods and gives better results when a large number of hypotheses are tested [7]. First, the p-values of the m tested hypotheses are ranked from the largest to the smallest. Let $j = max {i : p_{i} \leq α i / m}$ . All of the hypotheses are rejected, for which $p_{i} \leq p_{j}$ and any of the hypotheses are not rejected if j does not exist.

In addition to the tests of the homogeneity of the odds ratios, the BD-based LSD, the chi-square-based LSD, and the adjusted BD tests can be used for multiple comparisons [43]. To avoid confusion with other adjustment methods of the BD test, the adjusted BD test will be referred to as the YA test from now on.

Zwinderman and Bossuyt [46] and Van den Ende et al. [35] reported that the odds ratios were more useful if converted to log values. Armistead [2] discussed the limitations of the measures of associations and that taking the natural algorithm of the odds ratio makes it symmetric above and below one, with $\ln (1) = 0$ . The BD-based and chi-square-based LSD test methods are based on the difference between the two log-odds ratios. Assume that $θ_{i}$ is the odds ratio of the ith stratum and $θ_{j}$ is the odds ratio of the jth stratum, where ( $i, j = 1, \dots, K$ ). Let δ be the difference between these two log-odds ratios, as

δ = | \ln (θ_{i}) - \ln (θ_{j}) |, i < j .

(8)

The common standard error is

SE (\hat{δ}) = \frac{\sum_{k = 1}^{K} {\hat{σ}}_{k} ({\hat{θ}}_{MH})}{K},

(9)

where ${\hat{σ}}_{k}^{2} ({\hat{θ}}_{MH})$ is defined in Equation (5). The δ value is compared with ORDIF in Equation (10). The null hypothesis is rejected if the difference is greater and equal to the critical value, $δ \geq O R D I F$ .

O R D I F = Z_{α / 2} SE (\hat{δ}) .

(10)

The chi-square-based LSD test, where the expected values are based on the chi-square approach, is used for multiple comparison. The excepted values ( $E_{11 k}$ , $E_{12 k}$ , $E_{21 k}$ , $E_{22 k}$ ) are calculated from the ordinary chi-square, where $(k = 1, \dots, K)$ . The standard error for stratum i is

{SE}_{k} = {[\frac{1}{E_{11 k}} + \frac{1}{E_{12 k}} + \frac{1}{E_{21 k}} + \frac{1}{E_{22 k}}]}^{- 0.5} .

(11)

Then, all of the steps used in the first method are followed by calculating the ORDIF.

The YA test is based on the BD test. In order to calculate this method, the expected values based on the overall MH statistic are used.

χ_{Y A}^{2} = \frac{[n_{11 i} - {\hat{μ}}_{i} ({\hat{θ}}_{MH})]^{2}}{{\hat{σ}}_{i}^{2} ({\hat{θ}}_{MH})} + \frac{[n_{11 j} - {\hat{μ}}_{j} ({\hat{θ}}_{MH})]^{2}}{{\hat{σ}}_{j}^{2} ({\hat{θ}}_{MH})} .

(12)

The calculated chi-square value is compared with a chi-square value for the appropriate α and df = 1.

3. COVID-19 studies

Even though the effects of COVID-19 on human health have been investigated, recent studies in China, Europe, and America have revealed a relationship between disease severity or mortality and the comorbidities (diabetes, hypertension, cardiovascular disease, liver injury, etc.) for COVID-19 [9,41]. These studies were followed by meta-analysis studies [9,38,42].

de Almeida-Pititto et al. [9] applied several meta-analyses for diabetes, hypertension, and cardiovascular disease, and the use of ACE/ARB in COVID-19 mortality and severity cases. Their study indicated a high heterogeneity in ACE/ABE; hence, they calculated a common odds ratio based on the random-effects meta-analysis. Wong et al. [38] collected data from Asia and applied a meta-analysis of COVID-19 severity cases associated with liver injury, discussed the heterogeneity of the data and applied a subgroup meta-analysis to minimize the heterogeneity. Yang et al. [42] also used a meta-analysis to describe the risk of hypertension, diabetes, respiratory system disease, and cardiovascular in COVID-19 severity cases. In these studies, the heterogeneity of the data was tested by Q-statistic and related measures, then a random effects meta-analysis was applied due to the presence of moderate to high heterogeneity.

The risk of comorbidities in severity or mortality may differ depending on the age group, region, Hispanic origin, etc. Furthermore, this risk of some groups may be specifically higher than the others. When the odds ratios of COVID-19 data are homogeneous, a common (Mantel–Haenszel) odds ratio is calculated. On the other hand, if the test results indicate a difference in the odds ratios, it is neither suitable nor reliable to calculate a common odds ratio, and it is important to detect the different ones. For instance, in the study of de Almeida-Pititto et al. [9], even though they concluded the necessity of further studies to detect the risk association in different age groups, they did not provide any further analysis. In that case, applying multiple comparison procedures is strongly suggested to make reliable inferences.

In previous studies [8,9,17,33,36], the risk of developing diabetes in severe patients was compared with the same risk in non-severe patients, and the relationship between the severity of COVID-19 (severe/non-severe) and diabetes was presented. Severity was described as ICU admission or the need for mechanical ventilation [9].

The aim of this study is to investigate the risk of developing diabetes for different levels of disease severity, while investigating if there is a statistically significant difference between different countries. With this purpose, data sets from Greece [17], Italy [8], China [36], and France [33] were used to determine COVID-19 severity with regard to diabetes, as presented in Table 2.

Table 2.

The number of patients who were admitted to the ICU or needed mechanical ventilation.

		Diabetes
Country	Severity	+	−	Total	Odds ratio
Greece	+	6	22	28	1.50
	−	4	22	26
	Total	10	44	54
Italy	+	22	86	108	1.93
	−	15	113	128
	Total	37	199	236
China	+	6	8	14	40.50
	−	1	54	55
	Total	7	62	69
France	+	23	62	39	2.52
	−	5	34	85
	Total	28	96	124

Open in a new tab

The BD test results indicated that the odds ratios were not statistically homogeneous ( $χ^{2} = 9.743$ , df = 3, $p - value = 0.021$ ). Even if the odds ratios for Greece, France, and Italy were close, the odds ratio for China was quite a bit higher than the others. In this case, multiple comparison tests were used to detect the difference between each pair of odds ratios between China and the other countries. To assess the behavior of the tests for such an obvious difference between the estimates of the odds ratios among the countries, multiple comparison tests were applied, and their results are summarized in Table 3.

Table 3.

P-values of the multiple comparison tests of the COVID-19 data set.

Adjustment	Test	G-I	G-C	I-C	G-F	I-F	C-F
Bonferroni	BD	1.000	0.162	0.068	1.000	1.000	0.341
	Tarone	1.000	0.195	0.071	1.000	1.000	0.410
	Woolf	1.000	0.348	0.270	1.000	1.000	0.676
	Peto	1.000	0.019	0.007	1.000	1.000	0.020
Dunn–Šidák	BD	1.000	0.040	0.017	0.993	0.999	0.082
	Tarone	1.000	0.048	0.018	0.993	0.999	0.098
	Woolf	1.000	0.084	0.066	0.993	0.999	0.157
	Peto	1.000	0.005	0.002	0.997	1.000	0.005
Holm	BD	1.000	0.128	0.059	1.000	1.000	0.227
	Tarone	1.000	0.146	0.059	1.000	1.000	0.239
	Woolf	1.000	0.227	0.191	1.000	1.000	0.366
	Peto	1.000	0.018	0.007	1.000	1.000	0.018
Hochberg	BD	0.789	0.128	0.059	0.789	0.789	0.218
	Tarone	0.789	0.146	0.059	0.789	0.789	0.239
	Woolf	0.789	0.218	0.191	0.789	0.789	0.366
	Peto	0.789	0.018	0.007	0.789	0.789	0.018
Hommel	BD	0.789	0.101	0.054	0.789	0.789	0.197
	Tarone	0.789	0.122	0.054	0.789	0.789	0.222
	Woolf	0.789	0.197	0.157	0.789	0.789	0.360
	Peto	0.789	0.017	0.007	0.789	0.789	0.018
Benjamini–Hochberg	BD	0.787	0.027	0.014	0.787	0.787	0.035
	Tarone	0.787	0.028	0.014	0.787	0.787	0.037
	Woolf	0.787	0.035	0.034	0.787	0.787	0.056
	Peto	0.787	0.007	0.007	0.787	0.789	0.007
YA test		0.488	0.031	0.618	0.031	0.629	0.036
Log-Odds Difference $^{*}$		0.251	3.296	0.520	3.045	0.269	2.776

Open in a new tab

Abbreviation: G, Greece; I, Italy; C, China; F, France.Bold values indicate a statistically significant difference (p<0.05). $^{*}$ The results are compared to $LSD = 3.649$ for BD-based LSD test and $LSD = 3.854$ for Chi-squared-based LSD test.

According to the multiple comparison results in Table 3, there were no statistically significant differences between those of Greece and Italy, or between the odds ratios of Italy and France, as expected. The YA test was the only test that found a statistical difference between the odds ratios of Greece and France. All of the tests with Benjamini–Hochberg adjustment methods, the Peto test with all of the adjustment methods, and the YA test method indicated the difference between the odds ratios of China and the other countries, except for the Benjamini–Hochberg adjusted Woolf test between the odds ratios of China and France. The BD-based LSD and chi-square-based LSD tests, and also the Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests did not indicate any statistically significant difference between China and Greece, Italy, or France even though the odds ratio of China was obviously greater than the other odds ratios.

This observation from the COVID-19 data not only strongly showed the necessity of a multiple comparison procedure for the heterogeneous odds ratios in COVID-19 studies, but also demonstrated the importance of relying on the most suitable test to make inferences. The results showed that the different multiple comparison procedures indicated different results, even when the tests recommended in the previous literature were used. Thus, a detailed simulation study needs to be performed to discuss the reliability and accuracy of the methods.

4. Simulation study

A simulation study was conducted to compare the performance of the multiple comparison methods introduced in Section 2.

4.1. Simulation design

Odds ratios were simulated considering K = 3, 5, 7 strata. Three scenarios were considered for the alternative hypothesis space: P1 corresponds to the case where only one odds ratio was different, in P2 more than one odds ratios are different, and in F all the odds ratios are different.

Three different sample size designs, given by Bagheri et al. [3], were used. In the first one, equal, in the second one, within-center inequality, and in the third one, among-centers inequality sample size designs were created (see Table 2 in Bagheri et al. [3]). The marginal probabilities were accepted as $(π_{.1 k} = π_{.2 k} = 0.50)$ and $(π_{1. k} = π_{2. k} = 0.50)$ for the balanced design, and $(π_{.1 k} = 0.25)$ and $(π_{1. k} = 0.25)$ for the imbalanced design, where $k = 1, \dots, K$ . All of the simulation scenarios are given in Table 4. In total, 72 different simulation scenarios were selected to be run. The results were based on 5000 replications. While E1, WI1, and AI1 represented the small sample size for equal sample size design (E), within-center inequality (WI), and among-centers inequality (AI), the medium sample size was represented by E2, WI2, and AI2. The large sample size was represented by E3, WI3, and AI3.

Table 4.

Simulation scenarios and their abbreviations.

			Sample size design						Sample size design
Comb.	K	True odds ratio		n	Str.	Comb.	K	True odds ratio		n	Str.
C1	3	P1: (10,10,1.2)	E1	(40,40,40)	B	C37	5	F: (20,10,4,1.2,0.5)	E1	(40,40,40,40,40)	B
C2			E2	(100,100,100)		C38			E2	(100,100,100,100,100)
C3			E3	(200,200,200)		C39			E3	(200,200,200,200,200)
C4			WI1	(40,40,40)	IB	C40			WI1	(40,40,40,40,40)	IB
C5			WI2	(100,100,100)		C41			WI2	(100,100,100,100,100)
C6			WI3	(200,200,200)		C42			WI3	(200,200,200,200,200)
C7			AI1	(20,20,80)	B	C43			AI1	(20,20,20,20,160)	B
C8			AI2	(50,50,200)		C44			AI2	(50,50,50,50,300)
C9			AI3	(100,100,400)		C45			AI3	(100,100,100,100,600)
C10		F: (10,4,1.2)	E1	(40,40,40)	B	C46	7	P1: (10,10,10,10,10,10,1.2)	E1	(40,40,40,40,40,40,40)	B
C11			E2	(100,100,100)		C47			E2	(100,100,100,100,100,100,100)
C12			E3	(200,200,200)		C48			E3	(200,200,200,200,200,200,200)
C13			WI1	(40,40,40)	IB	C49			WI1	(40,40,40,40,40,40,40)	IB
C14			WI2	(100,100,100)		C50			WI2	(100,100,100,100,100,100,100)
C15			WI3	(200,200,200)		C51			WI3	(200,200,200,200,200,200,200)
C16			AI1	(20,20,80)	B	C52			AI1	(20,20,20,20,20,20,140)	B
C17			AI2	(50,50,200)		C53			AI2	(50,50,50,50,50,50,400)
C18			AI3	(100,100,400)		C54			AI3	(100,100,100,100,100,100,800)
C19	5	P1: (10,10,10,10,1.2)	E1	(40,40,40,40,40)	B	C55		P2: (35,35,10,10,10,10,1.2)	E1	(40,40,40,40,40,40,40)	B
C20			E2	(100,100,100,100,100)		C56			E2	(100,100,100,100,100,100,100)
C21			E3	(200,200,200,200,200)		C57			E3	(200,200,200,200,200,200,200)
C22			WI1	(40,40,40,40,40)	IB	C58			WI1	(40,40,40,40,40,40,40)	IB
C23			WI2	(100,100,100,100,100)		C59			WI2	(100,100,100,100,100,100,100)
C24			WI3	(200,200,200,200,200)		C60			WI3	(200,200,200,200,200,200,200)
C25			AI1	(20,20,20,20,160)	B	C61			AI1	(20,20,20,20,20,20,140)	B
C26			AI2	(50,50,50,50,300)		C62			AI2	(50,50,50,50,50,50,400)
C27			AI3	(100,100,100,100,600)		C63			AI3	(100,100,100,100,100,100,800)
C28		P2: (20,10,10,10,1.2)	E1	(40,40,40,40,40)	B	C64		F: (35,30,25,20,10,5,1.2)	E1	(40,40,40,40,40,40,40)	B
C29			E2	(100,100,100,100,100)		C65			E2	(100,100,100,100,100,100,100)
C30			E3	(200,200,200,200,200)		C66			E3	(200,200,200,200,200,200,200)
C31			WI1	(40,40,40,40,40)	IB	C67			WI1	(40,40,40,40,40,40,40)	IB
C32			WI2	(100,100,100,100,100)		C68			WI2	(100,100,100,100,100,100,100)
C33			WI3	(200,200,200,200,200)		C69			WI3	(200,200,200,200,200,200,200)
C34			AI1	(20,20,20,20,160)	B	C70			AI1	(20,20,20,20,20,20,140)	B
C35			AI2	(50,50,50,50,300)		C71			AI2	(50,50,50,50,50,50,400)
C36			AI3	(100,100,100,100,600)		C72			AI3	(100,100,100,100,100,100,800)

Open in a new tab

Abbreviation: E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; Str, structure of the table; B, balanced, IB, imbalanced; P1, only one of the odd ratios is different; P2, more than one of the odd ratios is different; F, all the odd ratios are different.

A total of 27 different methods, consisting of four tests with six adjustment methods and three multiple comparison methods were considered.

The simulation software was developed in R version 3.6.1 by the author. The BreslowDayTest() and WoolfTest() functions of the DescTools package [32] were used to perform the BD, Tarone, and Woolf tests, and the p.adjust() function of the stats package [28] to calculate the adjustment method. The critical value was accepted as $α = 0.05$ .

4.2. Measures to evaluate the tests

Table 5 summaries the possible outcomes in the hypothesis testing [4]. Assuming that $m_{0}$ is the number of true null hypotheses and m is the possible outcomes for testing, where the combination of K odds ratios was $m = K (K - 1) / 2$ .

Table 5.

The possible outcomes in the hypothesis testing.

	Declared non-significant	Declared significant	Total
$H_{0}$ is true	U	V	$m_{0}$
$H_{A}$ is true	T	S	$m - m_{0}$
Total	m−R	R	m

Open in a new tab

In Table 5, U is the number of hypotheses that were correctly declared as non-significant and S is the number of hypotheses that were correctly declared as significant. R is the number of hypotheses that were declared as significant. V is the type-I error and T is the type-II error.

The measures to assess the hypotheses tested were divided in the classes of power measures and error measures. The former are the any-pair power (ANPP), all pairs power (APP), positive predictive value (PPV), true negative rate (TNR). The latter are per-comparison error rate (PCER), family-wise error rate (FWER), and false discovery rate (FDR).

The ANNP is defined as the probability of identifying at least one true difference between the pairs and the APP is the probability of detecting all of the significant pairs [22,29]. The TNR is the proportion of correctly declared non-significant hypotheses ( $TNR = U / (m - R)$ ). The PPV or precision is the proportion of correctly declared significant hypothesis ( $PPV = S / R$ ). The FWER is the probability of having at least one type-I error over the comparisons and the PCER is the probability of observing a type-I error in any comparison [10]. The FDR is the proportion of falsely declared significant hypotheses ( $FDR = V / R$ ) [4].

In the simulation study, m odds ratios were compared using the multiple comparison procedures. Table 5 was created for each replication of each scenario mentioned in Section 4.1. Then, the mean values of the measures to assess the hypotheses tested were calculated as represented in Table 5. To evaluate the tests, the ANPP, APP, PPV, TNR, PCER, FWER, and FDR measures were used for the scenarios with the P1- and P2-type alternative hypotheses. Because it is required to have at least one true- non-significant hypothesis to calculate TNR, PCER, FWER, and FDR measures, and F-type alternative hypotheses were defined as the case when all the true odds ratios are different, only the ANPP, APP, and PPV were calculated for the F-type alternative hypotheses.

4.3. Simulation results

The values of the power and error measures for all of the methods and scenarios are presented for each run in Tables 6–9. All the results are not tabulated here due to limited space. Some of them are given in Tables 1–28 of Supplemental Material.

Table 6.

ANPP values computed for the Bonferroni, Dunn-Šidák, Holm, and Hochberg adjusted tests for P1-type alternative hypothesis.

			Bonferroni				Dunn-Šidák				Holm				Hochberg
K	SSD	Com.	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto
3	E1	C1	0.597	0.584	0.502	0.481	0.921	0.916	0.877	0.823	0.597	0.597	0.552	0.530	0.611	0.610	0.563	0.542
	E2	C2	0.902	0.900	0.895	0.850	0.990	0.989	0.988	0.969	0.902	0.902	0.902	0.885	0.912	0.912	0.912	0.891
	E3	C3	0.998	0.998	0.997	0.995	1.000	1.000	1.000	0.999	0.998	0.998	0.998	0.997	0.998	0.998	0.998	0.997
	WI1	C4	0.492	0.478	0.270	0.582	0.915	0.908	0.754	0.929	0.513	0.511	0.367	0.592	0.527	0.526	0.375	0.597
	WI2	C5	0.781	0.777	0.749	0.841	0.969	0.969	0.962	0.984	0.793	0.793	0.790	0.845	0.813	0.813	0.804	0.853
	WI3	C6	0.975	0.975	0.974	0.984	0.998	0.998	0.997	0.998	0.977	0.977	0.977	0.984	0.982	0.982	0.982	0.985
	AI1	C7	0.369	0.367	0.088	0.303	0.789	0.785	0.517	0.695	0.369	0.369	0.159	0.341	0.394	0.394	0.166	0.347
	AI2	C8	0.825	0.824	0.781	0.797	0.964	0.964	0.953	0.950	0.825	0.825	0.816	0.820	0.844	0.844	0.832	0.837
	AI3	C9	0.994	0.994	0.994	0.992	0.999	0.999	0.999	0.999	0.994	0.994	0.994	0.994	0.996	0.996	0.996	0.996
5	E1	C19	0.560	0.542	0.390	0.409	0.869	0.855	0.763	0.721	0.560	0.550	0.404	0.418	0.561	0.550	0.404	0.418
	E2	C20	0.908	0.907	0.899	0.845	0.984	0.983	0.982	0.959	0.908	0.908	0.902	0.855	0.908	0.908	0.902	0.855
	E3	C21	0.999	0.999	0.999	0.998	1.000	1.000	1.000	0.999	0.999	0.999	0.999	0.998	0.999	0.999	0.999	0.998
	WI1	C22	0.439	0.422	0.133	0.547	0.822	0.807	0.552	0.868	0.443	0.432	0.152	0.549	0.443	0.432	0.152	0.549
	WI2	C23	0.782	0.775	0.738	0.855	0.960	0.957	0.939	0.978	0.785	0.783	0.751	0.857	0.786	0.784	0.751	0.857
	WI3	C24	0.971	0.970	0.967	0.981	0.996	0.996	0.995	0.998	0.971	0.971	0.970	0.981	0.971	0.971	0.970	0.981
	AI1	C25	0.247	0.246	0.002	0.187	0.589	0.587	0.146	0.501	0.247	0.247	0.005	0.195	0.250	0.250	0.005	0.195
	AI2	C26	0.877	0.877	0.798	0.849	0.969	0.968	0.951	0.957	0.877	0.877	0.813	0.855	0.880	0.880	0.813	0.856
	AI3	C27	0.999	0.999	0.999	0.999	1.000	1.000	1.000	1.000	0.999	0.999	0.999	0.999	0.999	0.999	0.999	0.999
7	E1	C46	0.527	0.502	0.323	0.373	0.836	0.821	0.675	0.671	0.527	0.504	0.332	0.377	0.527	0.504	0.332	0.377
	E2	C47	0.912	0.908	0.893	0.825	0.983	0.981	0.977	0.944	0.912	0.909	0.895	0.830	0.912	0.909	0.895	0.831
	E3	C48	0.998	0.998	0.998	0.995	1.000	1.000	1.000	0.999	0.998	0.998	0.998	0.995	0.998	0.998	0.998	0.995
	WI1	C49	0.416	0.397	0.079	0.537	0.777	0.758	0.411	0.841	0.418	0.402	0.088	0.538	0.419	0.403	0.088	0.538
	WI2	C50	0.784	0.774	0.710	0.855	0.947	0.943	0.925	0.971	0.786	0.780	0.721	0.856	0.786	0.780	0.721	0.856
	WI3	C51	0.979	0.978	0.975	0.987	0.998	0.998	0.997	0.999	0.979	0.979	0.977	0.987	0.979	0.979	0.977	0.987
	AI1	C52	0.171	0.169	0.000	0.127	0.472	0.470	0.035	0.385	0.171	0.171	0.000	0.130	0.172	0.172	0.000	0.130
	AI2	C53	0.906	0.906	0.792	0.878	0.982	0.982	0.962	0.976	0.906	0.906	0.802	0.880	0.907	0.907	0.802	0.880
	AI3	C54	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

Open in a new tab

Abbreviation: SSD, sample size design, E, equal sample size design; WI, within-center inequality; AI, among-centers inequality.

Table 9.

FWER values computed for the Hommel and Benjamini–Hochberg adjusted tests, BD-based and Chi-squared-based LSD tests, and YA test for P1-type alternative hypothesis.

			Hommel				Benjamini–Hochberg
K	SSD	Com.	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BDLSD	CSLSD	YA
3	E1	C1	0.013	0.011	0.008	0.006	0.034	0.034	0.031	0.018	0.036	0.026	0.180
	E2	C2	0.022	0.021	0.021	0.016	0.042	0.042	0.042	0.021	0.000	0.000	0.558
	E3	C3	0.022	0.022	0.021	0.018	0.045	0.045	0.045	0.018	0.000	0.000	0.949
	WI1	C4	0.014	0.013	0.007	0.017	0.042	0.042	0.036	0.052	0.159	0.176	0.126
	WI2	C5	0.027	0.027	0.025	0.029	0.047	0.047	0.047	0.053	0.000	<0.001	0.325
	WI3	C6	0.037	0.037	0.037	0.038	0.046	0.046	0.046	0.051	0.000	0.000	0.739
	AI1	C7	0.003	0.003	0.001	0.001	0.017	0.017	0.011	0.011	0.079	0.072	0.416
	AI2	C8	0.012	0.011	0.010	0.006	0.035	0.035	0.033	0.017	<0.001	<0.001	0.851
	AI3	C9	0.027	0.026	0.025	0.022	0.044	0.044	0.044	0.023	0.000	0.000	0.996
5	E1	C19	0.010	0.008	0.003	0.002	0.062	0.061	0.049	0.031	0.162	0.100	0.225
	E2	C20	0.011	0.011	0.009	0.002	0.102	0.102	0.102	0.045	0.000	0.000	0.481
	E3	C21	0.013	0.012	0.011	0.003	0.104	0.104	0.104	0.045	0.000	0.000	0.859
	WI1	C22	0.009	0.008	<0.001	0.020	0.075	0.074	0.053	0.095	0.479	0.497	0.193
	WI2	C23	0.012	0.011	0.008	0.017	0.085	0.085	0.084	0.111	0.001	0.001	0.298
	WI3	C24	0.011	0.009	0.008	0.016	0.101	0.101	0.101	0.121	0.000	0.000	0.575
	AI1	C25	0.002	0.001	<0.001	<0.001	0.016	0.016	0.010	0.010	0.176	0.164	0.756
	AI2	C26	0.006	0.004	0.002	0.001	0.075	0.075	0.065	0.029	0.000	0.000	0.969
	AI3	C27	0.011	0.010	0.008	0.001	0.103	0.103	0.103	0.050	0.000	0.000	1.000
7	E1	C46	0.008	0.005	0.001	0.001	0.086	0.083	0.059	0.036	0.328	0.207	0.280
	E2	C47	0.009	0.007	0.005	0.001	0.133	0.133	0.132	0.052	0.000	0.000	0.433
	E3	C48	0.010	0.010	0.010	0.002	0.151	0.151	0.151	0.064	0.000	0.000	0.759
	WI1	C49	0.008	0.006	0.000	0.021	0.093	0.088	0.052	0.122	0.721	0.731	0.276
	WI2	C50	0.013	0.011	0.009	0.019	0.124	0.124	0.119	0.156	0.001	0.002	0.312
	WI3	C51	0.010	0.010	0.009	0.019	0.142	0.142	0.142	0.177	0.000	0.000	0.510
	AI1	C52	0.001	0.001	0.000	0.000	0.013	0.012	0.005	0.006	0.492	0.466	0.728
	AI2	C53	0.005	0.005	0.001	0.001	0.106	0.104	0.081	0.037	0.001	<0.001	0.996
	AI3	C54	0.010	0.009	0.007	0.002	0.141	0.141	0.139	0.055	0.000	0.000	1.000

Open in a new tab

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.

4.3.1. ANPP

The ANPP results for the case where only one odds ratio was different are summarized in Tables 6 and 7 (see Tables 1–4 of Supplemental Material for the cases where more than one of and all of the odds ratios were different results). The ANPP values ranged between 0 and 1, and high values indicated a desirable high power for the test. The results are interpreted below by the number of strata, sample size design, and methods.

Table 7.

ANPP values computed for the Hommel and Benjamini-Hochberg adjusted tests, BD-based and chi-squared-based LSD tests, and YA test for P1-type alternative hypothesis.

			Hommel				Benjamini–Hochberg
K	SSD	Com.	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BDLSD	CSLSD	YA
3	E1	C1	0.684	0.674	0.619	0.580	0.855	0.855	0.850	0.844	0.706	0.596	0.708
	E2	C2	0.934	0.932	0.928	0.904	0.982	0.982	0.982	0.977	0.001	0.000	0.942
	E3	C3	0.998	0.998	0.998	0.997	1.000	1.000	1.000	0.999	0.000	0.000	0.999
	WI1	C4	0.590	0.580	0.408	0.660	0.849	0.849	0.801	0.850	0.992	0.992	0.595
	WI2	C5	0.852	0.850	0.831	0.891	0.971	0.971	0.971	0.972	0.063	0.093	0.848
	WI3	C6	0.987	0.987	0.987	0.991	0.998	0.998	0.998	0.998	0.000	0.000	0.986
	AI1	C7	0.446	0.443	0.190	0.382	0.642	0.642	0.565	0.638	0.809	0.781	0.486
	AI2	C8	0.879	0.878	0.850	0.858	0.957	0.957	0.957	0.957	0.008	0.006	0.885
	AI3	C9	0.997	0.997	0.997	0.997	0.999	0.999	0.999	0.999	0.000	0.000	0.997
5	E1	C19	0.583	0.568	0.433	0.438	0.785	0.785	0.771	0.768	0.934	0.834	0.964
	E2	C20	0.920	0.917	0.911	0.863	0.979	0.979	0.979	0.974	0.002	0.000	0.998
	E3	C21	0.999	0.999	0.999	0.998	1.000	1.000	1.000	1.000	0.000	0.000	1.000
	WI1	C22	0.463	0.448	0.166	0.566	0.769	0.769	0.675	0.773	0.999	0.999	0.912
	WI2	C23	0.804	0.799	0.764	0.871	0.961	0.961	0.961	0.961	0.116	0.155	0.989
	WI3	C24	0.976	0.975	0.973	0.984	0.997	0.997	0.997	0.997	0.000	0.000	1.000
	AI1	C25	0.260	0.259	0.006	0.201	0.462	0.462	0.261	0.460	0.739	0.706	0.778
	AI2	C26	0.891	0.890	0.829	0.867	0.962	0.962	0.961	0.962	0.020	0.015	0.992
	AI3	C27	0.999	0.999	0.999	0.999	1.000	1.000	1.000	1.000	0.000	0.000	1.000
7	E1	C46	0.539	0.514	0.341	0.386	0.748	0.748	0.722	0.723	0.986	0.940	0.985
	E2	C47	0.917	0.915	0.902	0.840	0.979	0.979	0.979	0.969	0.005	0.001	1.000
	E3	C48	0.998	0.998	0.998	0.996	1.000	1.000	1.000	1.000	0.000	0.000	1.000
	WI1	C49	0.428	0.411	0.094	0.544	0.741	0.740	0.607	0.748	0.999	0.999	0.958
	WI2	C50	0.795	0.789	0.730	0.863	0.955	0.955	0.955	0.956	0.160	0.195	0.997
	WI3	C51	0.981	0.980	0.978	0.988	0.999	0.999	0.999	0.999	0.000	0.000	1.000
	AI1	C52	0.178	0.177	0.001	0.133	0.361	0.361	0.154	0.356	0.968	0.955	0.948
	AI2	C53	0.913	0.913	0.815	0.891	0.979	0.979	0.977	0.978	0.030	0.022	1.000
	AI3	C54	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	0.000	0.000	1.000

Open in a new tab

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality; BDLSD, BD-based LSD; CSLSD, chi-squared-based LSD.

Results based on the number of strata: The probability of identifying at least one true difference between the pairs for the tests was not impacted by the number of strata for the large sample sizes. While the number of strata increased from 3 to 7, when only one of the odds ratios was different and the sample size was small, the ability of the tests to identify at least one difference between the pairs decreased averagely 24% for all the tests except for the BD-based and chi-square-based LSD tests and the YA test.
Results based on the number of different odds ratios: We compared ANPP values of by the number of different odds ratios (P1, P2, and F-type alternative hypotheses). The probability of identifying at least one true difference between the pairs for the tests was not notably impacted by the changes in number of different odds ratios for the large sample sizes (E3, WI3, or AI3). For the tables with 3 strata, the ability of the tests to identify at least one difference between the pairs was higher when all of the odds ratios were different when compared to the case where only one of the odds ratios was different. For the tables with 5 or 7 strata, in general, the ability of the tests to identify at least one difference between the pairs increased when the number of different odds ratios increased.
Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests to identify at least one difference between the pairs increased averagely 24.5% when the sample size increased. Nevertheless, the ability of the BD-based and chi-square-based LSD tests had the highest values when the sample size was small.
Results based on the multiple comparison methods: For all of the scenarios, when the sample size was large, the probability of identifying at least one true difference between the pairs for all of the adjusted BD, Tarone, Woolf, and Peto tests and YA the test were over 0.930. For the tables with 5 or 7 strata, more than one or all of the odds ratios were different, and when the sample size was medium, the ANPP values of these multiple comparison methods were similar and over 0.828 (Tables 1–4 of Supplemental Material).

4.3.2. APP

The APP results are summarized in Tables 5–10 of Supplemental Material. The APP values ranged between 0 and 1, and high values of the APP were expected. The APP values were mostly lower than those of the ANPP. For the tables with 5 and 7 strata, the APP values of all of the methods were around ‘0’ when more than one of or all of the odds ratios were different, except for the YA test in some scenarios (Tables 5–6 of Supplemental Material). Thus, a comparison of the APP by strata was possible for only some of the scenarios.

Results based on the number of strata: When only one of the odds ratios was different, the probability of identifying all of the pairs that are actually different for the tests, except the YA test, decreased averagely 17.6% as the number of strata increased.
Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the power of the methods decreased. When more than one of or all of the odds ratios were different, the ability of the tests to identify all of the pairs that are actually different were mostly close to zero, except for the YA test.
Results based on the sample size design: The probability of identifying all of the pairs that are actually different for all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test increased when the sample size increased.
Results based on the multiple comparison methods: By considering the ability of the tests to identify all of the pairs that are actually different, the YA test performed better than the other tests in most of the scenarios. The APP values of all of the tests, except for the YA test, were very low.

4.3.3. PPV

The PPV results are summarized in Tables 11–14 of Supplemental Material. All PPV values for the case where all of the odds ratios were different were found ‘1’. The high PPV values are desired. The PPV values of the BD-based and chi-square-based LSD tests could not be calculated for the larger-sample size since no declared significant differences were found during the simulation runs.

Results based on number of strata: The proportion of the tests that correctly declared differences between the odds ratios was not impacted by the number of strata, except for some of the scenarios of the BD-based and chi-square-based LSD tests and the YA test.
Results based on the number of different odds ratios: In general, when the number of different odds ratios increased, the ability of all of the tests that correctly declared differences between the odds ratios increased averagely 1.4%.
Results based on the sample size design: The proportion of the tests that correctly declared differences between the odds ratios was not affected by the changes in the sample size design, except for some of the scenarios of the BD-based and chi-square-based LSD tests and YA test. The lowest values of the YA test were found for the table with among-centers inequality.
Results based on the multiple comparison methods: High PPV values were found in most of the scenarios and most of the methods. In general, the ability of YA test to detect the true differences between the odds ratios was mostly slightly lower than in the other methods, but the average value was 0.870. The PPV of the YA test was lower than 0.50 for the tables with 5 or 7 strata, one different odds ratio (P1-type) and an among-center inequality design (AI).

4.3.4. TNR

The TNR results are summarized in Tables 15–18 of Supplemental Material. As with the PPV, the high TNR values are desired.

Results based on the number of strata: While the number of strata increased from 3 to 7, when only one of the odds ratios was different (P1-type), and the sample size was small (E1, WI1, or AI1), the probability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 16.3%. When more than one odds ratio was different, the ability of the tests to indicate significant differences between the odds ratios that were actually not different were not too much differed by the changes in the number of strata.
Results based on the number of different odds ratios: In all of the scenarios, the ability of the tests that correctly declared non-significant differences between the odds ratios was higher when only one of the odds ratios was different when compared to the case where at least one of the odds ratios was different.
Results based on the sample size design: For all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test, the ability of the tests that correctly declared non-significant differences between the odds ratios increased averagely 17.3% when the sample size increased. In general, the highest TNR values of all of the adjusted BD, Tarone, Woolf, and Peto tests and the YA test were observed in large sample size design. On the other hand, the highest TNR values of the BD-based and chi-square-based LSD tests were observed in the within-sample inequality design with a small sample size.
Results based on the multiple comparison methods: The Benjamini–Hochberg adjusted tests showed better performance for the tables with 3 strata. The chi-square-based test showed better performance for the tables with 3 or 5 strata and within center inequality design with small sample size. The YA test mostly showed a much higher ability to indicate significant test that correctly declared non-significant differences between the odds ratios for all of the scenarios of 5 and 7 strata.

4.3.5. PCER

The PCER results are summarized in Tables 19–22 of Supplemental Material. The PCER values were expected to be around $α = 0.05$ .

Results based on the number of strata: The change in the number of strata had a slight effect (around $0.001 \pm 0.015$ ) on the probability of observing a type-I error in any comparison. When only one of the odds ratios was different and the table had among-centers inequality with medium or large sample sizes, the YA test was the only method that diverged from $α = 0.05$ with the increasing number of strata.
Results based on the number of different odds ratios: While the YA test mostly performed better for tables with more than one different odds ratio than the case with only one different odds ratio, the change in the number of different odds ratios had a slight effect on the probability of observing a type-I error in any comparison for the other methods.
Results based on the sample size design: The probability of observing a type-I error in any comparison of the adjusted BD, Tarone, Woolf, and Peto tests was not impacted by the sample design. The probability of observing a type-I error in any comparison of the BD-based and chi-square-based LSD tests converged to α when the table had the within-center inequality design with a small sample size. The PCER value of the YA test diverged from α when the table had an among-centers inequality design with large sample size, followed by a medium sample size.
Results based on the multiple comparison methods: When one or more than one odds ratio was different and the table had among-centers inequality designs with medium and large sample sizes, the Benjamini–Hochberg adjusted BD, Tarone, and Woolf tests showed the best PCER performance. When only one of the odds ratios was different and the table had the within-center inequality design with a medium or large sample size, the Benjamini–Hochberg adjusted Peto test showed the slightly better PCER performance. The BD-based and chi-square-based LSD tests showed the best PCER performance for the within- or among-center inequality designs with a small sample size. The YA test showed the best PCER performance for the tables with 3 strata, only one different odds ratio and an equal and small sample size; for the tables with 5 strata, only one different odds ratio and an equal or within-center inequality design with a small sample size; for the tables with 7 strata with medium or large sample sizes.

4.3.6. FWER

The FWER results for the case where only one odds ratio was different are summarized in Tables 8 and 9 (see Tables 23 and 24 of Supplemental Material for the case where more than one of the odds ratios were different results). FWER values are also expected to be around $α = 0.05$ .

Table 8.

FWER values computed for the Bonferroni, Dunn-Šidák, Holm, and Hochberg adjusted tests for P1-type alternative hypothesis.

			Bonferroni				Dunn-Šidák				Holm				Hochberg
K	SSD	Com.	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto	BD	Tarone	Woolf	Peto
3	E1	C1	0.005	0.005	0.003	0.002	0.024	0.022	0.015	0.008	0.007	0.007	0.005	0.004	0.009	0.009	0.007	0.006
	E2	C2	0.004	0.004	0.003	0.001	0.020	0.019	0.018	0.006	0.010	0.010	0.010	0.007	0.020	0.020	0.020	0.016
	E3	C3	0.004	0.004	0.003	0.001	0.013	0.013	0.013	0.005	0.010	0.010	0.010	0.010	0.021	0.021	0.021	0.018
	WI1	C4	0.007	0.005	0.001	0.010	0.031	0.029	0.017	0.035	0.010	0.009	0.004	0.012	0.011	0.011	0.005	0.014
	WI2	C5	0.005	0.005	0.003	0.009	0.020	0.020	0.018	0.030	0.010	0.010	0.009	0.016	0.025	0.025	0.025	0.028
	WI3	C6	0.005	0.005	0.004	0.007	0.017	0.017	0.016	0.025	0.011	0.011	0.011	0.017	0.037	0.037	0.037	0.037
	AI1	C7	0.001	0.001	0.000	<0.001	0.010	0.009	0.004	0.005	0.002	0.002	<0.001	0.001	0.002	0.002	<0.001	0.001
	AI2	C8	0.005	0.004	0.003	0.001	0.019	0.018	0.014	0.007	0.007	0.007	0.006	0.004	0.010	0.010	0.009	0.006
	AI3	C9	0.004	0.004	0.003	0.001	0.020	0.019	0.018	0.005	0.011	0.011	0.011	0.010	0.025	0.025	0.025	0.022
5	E1	C19	0.007	0.006	0.001	0.001	0.032	0.028	0.015	0.010	0.007	0.007	0.002	0.002	0.007	0.007	0.002	0.002
	E2	C20	0.008	0.006	0.005	0.001	0.031	0.029	0.026	0.007	0.009	0.009	0.008	0.002	0.009	0.009	0.008	0.002
	E3	C21	0.006	0.006	0.005	0.002	0.027	0.026	0.026	0.005	0.009	0.009	0.009	0.003	0.010	0.010	0.010	0.003
	WI1	C22	0.007	0.006	<0.001	0.017	0.041	0.037	0.012	0.059	0.008	0.007	<0.001	0.018	0.008	0.007	<0.001	0.018
	WI2	C23	0.009	0.009	0.005	0.013	0.028	0.027	0.022	0.042	0.011	0.010	0.007	0.017	0.011	0.010	0.008	0.017
	WI3	C24	0.004	0.004	0.004	0.010	0.024	0.022	0.021	0.033	0.007	0.007	0.007	0.014	0.007	0.007	0.007	0.014
	AI1	C25	0.001	0.001	<0.001	<0.010	0.009	0.009	0.001	0.004	0.001	0.001	<0.001	<0.001	0.001	0.001	<0.001	<0.001
	AI2	C26	0.004	0.003	0.001	<0.001	0.022	0.019	0.011	0.005	0.005	0.004	0.002	0.001	0.005	0.004	0.002	0.001
	AI3	C27	0.005	0.004	0.003	0.001	0.028	0.027	0.024	0.006	0.009	0.008	0.007	0.001	0.009	0.008	0.007	0.001
7	E1	C46	0.006	0.004	<0.001	0.001	0.033	0.027	0.013	0.008	0.006	0.004	<0.001	0.001	0.006	0.004	<0.001	0.001
	E2	C47	0.006	0.006	0.004	0.001	0.027	0.025	0.022	0.004	0.008	0.007	0.005	0.001	0.008	0.007	0.005	0.001
	E3	C48	0.008	0.008	0.007	0.001	0.028	0.027	0.026	0.006	0.010	0.010	0.009	0.002	0.010	0.010	0.009	0.002
	WI1	C49	0.007	0.005	<0.001	0.020	0.044	0.036	0.006	0.073	0.007	0.005	0.000	0.020	0.007	0.005	0.000	0.020
	WI2	C50	0.011	0.010	0.006	0.017	0.035	0.032	0.027	0.052	0.012	0.010	0.008	0.018	0.012	0.010	0.008	0.018
	WI3	C51	0.007	0.006	0.006	0.015	0.028	0.027	0.025	0.045	0.010	0.009	0.009	0.017	0.010	0.009	0.009	0.017
	AI1	C52	0.001	0.001	0.000	0.000	0.006	0.005	<0.001	0.002	0.001	0.001	0.000	0.000	0.001	0.001	0.000	0.000
	AI2	C53	0.005	0.003	0.001	<0.001	0.022	0.018	0.008	0.003	0.005	0.004	0.001	0.001	0.005	0.004	0.001	0.001
	AI3	C54	0.007	0.006	0.004	0.001	0.027	0.025	0.022	0.004	0.009	0.008	0.006	0.002	0.009	0.008	0.006	0.002

Open in a new tab

Abbreviation: SSD, sample size design; E, equal sample size design; WI, within-center inequality; AI, among-centers inequality.

Results based on the number of strata: In all of the scenarios, the probability of having at least one type-I error over the scenarios of all of the adjusted BD, Tarone, Woolf, and Peto tests was not notably impacted (around $0.004 \pm 0.016$ ) by the changes in the number strata. While the FWER values of the BD-based and chi-square-based LSD tests only diverged when the sample size was small, the FWER values of the YA test diverged in most of the scenarios.
Results based on the number of different odds ratios: In all of the scenarios, the probability of having at least one type-I error over the scenarios was not notably impacted (around $0.011 \pm 0.025$ ) by the changes in the number of different odds ratios, except for the YA test.
Results based on the sample size design: The change in the sample size design had a small effect (around $0.004 \pm 0.087$ ) on the probability of having at least one type-I error over the scenarios of the Bonferroni, Dunn–Šidák, Holm, Hochberg, and Hommel adjusted BD, Tarone, Woolf, and Peto tests. When the sample size increased, the FWER performance of the YA test decreased.
Results based on the multiple comparison methods: In most of the scenarios, the Benjamini–Hochberg adjusted tests controlled the type-I error rate better than the others. In all of the scenarios, the YA test was the most diverged method from $α = 0.05$ .

Considering that the FWER is the probability of having at least one type-I error over all the comparisons and the PCER is the probability of observing a type-I error in any comparison, PCER and FWER are the probability of a type-I error (V). For each run of simulation, when there was no type-I error, we found V = 0. Because we calculated the mean of these measures over the runs, this caused low values of PCER and FWER. When PCER and FWER were low, this indicated that these tests controlled the type-I error well. Except for BD-based and Chi-squared-based LSD tests and YA tests in some scenarios, we found low PCER and FWER values for the tests. We also found that PCER values less than or equal to FWER, similar to the literature.

4.3.7. FDR

The FDR results are given in Tables 25–28 of Supplemental Material. A low FDR value is expected from a good method. FDR was calculated as $FDR = V / R$ where R was the number of hypotheses that were declared as significant. Because BD-based and chi-square-based LSD tests did not indicate significant differences between any pair of the odds ratios tested, no significant difference was obtained during the simulation runs and the number of hypotheses that were declared as significant was found as 0. Then, the proportion of falsely declared significant hypotheses of the BD-based and chi-square-based LSD tests could not be calculated for the larger sample size tables. Because $FDR = 1 - PPV$ , the results of the FDR were consistent with the PPV results.

5. Discussion

In meta-analysis studies, comparisons of the odds ratios may lead researchers to detect statistical heterogeneity. In some specific studies, instead of pooling different studies, the results are pooled over a third factor (i.e. age, gender, country) and this may cause clinical heterogeneity. Even though Fletcher [14] discussed the fact that judgments about clinical heterogeneity are qualitative and do not involve any calculations, the clinical heterogeneity among the participant characteristics can be detected by comparing the odds ratios. In this study, the focus was placed on meta-analysis studies in which the results are pooled over a third factor and the odds ratios were non-homogeneous. The necessity for a multiple comparison procedure was discussed when the null hypothesis of homogeneity of the odds ratios is rejected.

The results showed that some of these methods controlled the type-I error rates at the desired level, while some of them were more powerful than others. By considering the power and type-I error performance together, some promising tests were identified for the considered scenarios. The recommended tests by considering the main findings of the study are summarized in Table 10.

Table 10.

The recommended tests.

K	TOR	Sample size design	Recommended tests
3	P1	E or WI	Benjamini–Hochberg adjusted tests
		Small AI	Dunn–Sidak adjusted BD test
		Medium or large AI	Benjamini–Hochberg adjusted BD test
	F	Small	Dunn–Sidak adjusted BD test
		Medium or large	YA test
5	P1	Small E or WI	YA test
		Medium or large E or WI	Benjamini–Hochberg adjusted tests
		Small AI	Dunn–Sidak adjusted BD test
		Medium or large AI	Benjamini–Hochberg adjusted BD test
	P2 or F	All	YA test
7	P1	Small E or WI	YA test
		Medium or large E or WI	Benjamini–Hochberg adjusted tests
		Small AI	Dunn–Sidak adjusted BD test
		Medium or large AI	Benjamini–Hochberg adjusted BD test
	P2	Small E	Chi-square-based LSD
		Small WI	Dunn–Sidak adjusted Peto test
		Small AI	Dunn–Sidak adjusted BD or Tarone test
		Medium or large	Benjamini–Hochberg adjusted BD,
			Tarone, or Woolf tests
	F	Small	BD-based LSD test
		Medium or large	YA test

Open in a new tab

Abbreviation: TOR, true odds ratio; P1, only one of the odd ratios is different; P2, more than one of the odds ratios is different, F, all the odd ratios are different; E, equal sample size design, WI, within-center inequality, AI, among-centers inequality.

The Breslow-Day, Tarone, Woolf, and Peto tests were suggested to test the homogeneity of more than three odds ratios. The multiple comparison procedures are applied when the odds ratios are heterogeneous. We discussed the performance of these tests in the multiple comparison procedure. For this purpose, we used the adjustment methods to control the type-I and type-II errors. The BD-based and chi-square-based LSD tests and the YA test were specifically suggested as multiple comparison tests of odds ratios. Thus, the BD-based and chi-square-based LSD tests and the YA test behaved differently from the other tested methods. Unexpectedly, even though BD-based and chi-square-based LSD tests were multiple-comparison tests, their performance was below the others.

The simulation space of this numerical study covered the COVID-19 data given in Section 3. The data had 4 strata and only the odds ratio for China was greater than the others (only one different odds ratio). While the studies from Italy and France had large sample sizes, those of Greece and China were smaller than the others. The simulation study results showed that the tests with Benjamini–Hochberg adjustment were more powerful and controlled the error rates for the tables with 4 strata, only one different odds ratio, and among-centers inequality design. The Benjamini–Hochberg adjusted tests results indicated the difference between the odds ratios for China and the other countries and was consistent with the simulation study results. Only the Woolf test with Benjamini–Hochberg adjustment between the odds ratios for China and France, which were slightly non-significant (p = 0.056). As the simulation results showed that Bonferroni, Holm, Hochberg, and Hommel adjusted BD, Tarone, and Woolf tests, and the BD-based and chi-square-based LSD tests were not suitable for the multiple comparisons of odds ratios for such tables, these tests did not indicate any difference between the odds ratios. The YA test indicated the statistically significant difference between the odds ratios for Greece and France, which were less conservative than the others.

Because it is difficult to design a simulation study by considering the heterogeneity of odds ratios and the difference between the sample sizes in each stratum, this study was limited in that the maximum number of strata was 7. Even though the minimum value was theoretically 2, working with a larger number of studies made the meta-analysis more powerful and reliable.

Supplementary Material

Supplemental Material

Click here for additional data file.^{(211KB, pdf)}

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Agresti A. and Jonathan J., Strategies for comparing treatments on a binary response with multi-centre data, Stat. Med. 19 (2000), pp. 1115–1139. [DOI] [PubMed] [Google Scholar]
2.Armistead T.W., Measures of association, in Wiley StatsRef: Statistics Reference Online, John Wiley & Sons, Ltd., 2014, pp. 1–24.
3.Bagheri Z., Ayatollahi S.M.T, and Jafari P., Comparison of three tests of homogeneity of odds ratios in multicenter trials with unequal sample sizes within and among centers, BMC Med. Res. Methodol. 11 (2011), pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
5.Breslow N.E., Statistics in epidemiology: The case-control study, J. Am. Stat. Assoc. 91 (1996), pp. 14–28. [DOI] [PubMed] [Google Scholar]
6.Breslow N.S. and Day N.E., The Analysis of Case-control Studies. Statistical Methods in Cancer Research, Vol. 1, International Agency for Research on Cancer Scientific Publications, Lyon, France, 1980. [PubMed]
7.Chen S.Y., Feng Z., and Yi X., A general introduction to adjustment for multiple comparisons, J. Thorac. Dis. 9 (2017), pp. 1725–1729. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Colombi D., Bodini F. C., Petrini M., Maffi G., Morelli N., Milanese G., Silva M., Sverzellati N., and Michieletti E., Well-aerated lung on admitting chest CT to predict adverse outcome in COVID-19 pneumonia, Radiology 296 (2020), pp. E86–E96. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.de Almeida-Pititto B., Dualib P.M., Zajdenverg L., Dantas J.R., De Souza F.D., Rodacki M., and Bertoluci M.C., Severity and mortality of COVID 19 in patients with diabetes, hypertension and cardiovascular disease: A meta-analysis, Diabetol. Metab. Syndr. 12 (2020), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Demirhan H., Dolgun N.A., and Demirhan Y.P., Performance of some multiple comparison tests under heteroscedasticity and dependency, J. Stat. Comput. Simul. 80 (2010), pp. 1083–1100. [Google Scholar]
11.Der Simonian R. and Laird N., Meta-analysis in clinical trials, Control Clin. Trials 7 (1986), pp. 177–188. [DOI] [PubMed] [Google Scholar]
12.Dinno A., Nonparametric pairwise multiple comparisons in independent groups using Dunn's test, Stata J. 15 (2015), pp. 292–300. [Google Scholar]
13.Dunn O.J., Multiple comparisons among means, J. Am. Stat. Assoc. 56 (1961), pp. 52–64. [Google Scholar]
14.Fletcher J., What is heterogeneity and is it important?, Br. Med. J. 334 (2007), pp. 94–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gagnier J.J., Moher D., Boon H., Beyene J., and Bombardier C., Investigating clinical heterogeneity in systematic reviews: A methodologic review of guidance in the literature, BMC Med. Res. Methodol. 12 (2012), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Gavaghan D.J., Moore R.A., and McQuay H.J., An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data, Pain 85 (2000), pp. 415–424. [DOI] [PubMed] [Google Scholar]
17.Giamarellos-Bourboulis E. J., Netea M. G., Rovina N., Akinosoglou K., Antoniadou A., Antonakos N., Damoraki G., Gkavogianni T., Adami M.-E., Katsaounou P., Ntaganou M., Kyriakopoulou M., Dimopoulos G., Koutsodimitropoulos I., Velissaris D., Koufargyris P., Karageorgos A., Katrini K., Lekakis V., Lupse M., Kotsaki A., Renieris G., Theodoulou D., Panou V., Koukaki E., Koulouris N., Gogos C., and Koutsoukou A., Complex immune dysregulation in COVID-19 patients with severe respiratory failure, Cell Host Microbe 27 (2020), pp. 992–1000.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Halperin M., Ware J.H., Byar D.P., Mantel N., Brown C.C., Koziol J., Gail M., and Sylvan B.G.R., Testing for interaction in an $i \times j \times k$ contingency table, Biometrika 64 (1977), pp. 271–275. [Google Scholar]
19.Hochberg Y., A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (1988), pp. 800–802. [Google Scholar]
20.Holm S., A simple sequentially rejective multiple test procedure, Scand. J. Stat. 6 (1979), pp. 65–70. [Google Scholar]
21.Hommel G., A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75 (1988), pp. 383–386. [Google Scholar]
22.Hsiung T.H. and Olejnik S., Power of pairwise multiple comparisons in the unequal variance case, Commun Stat - Simul Comput 23 (1994), pp. 691–710. [Google Scholar]
23.Jones M.P., O'Gorman T.W., Lemke J.H., and Woolson R.F., A Monte Carlo investigation of homogeneity tests of the odds ratio under various sample size configurations, Biometrics. 45 (1989), pp. 171–181. [PubMed] [Google Scholar]
24.Kulinskaya E. and Dollinger M.B., An accurate test for homogeneity of odds ratios based on Cochran's q-statistic, BMC Med. Res. Methodol. 15 (2015), pp. 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Mantel N. and Haenszel W., Statistical aspects of the analysis of data from retrospective studies of disease, J. Natl. Cancer Inst. 22 (1959), pp. 719–748. [PubMed] [Google Scholar]
26.Osama A., Testing homogeneity of effect sizes in pooling $2 \times 2$ contingency tables from multiple studies: A comparison of methods, Cogent Math. Stat. 5 (2018), p. 1478698. [Google Scholar]
27.Paul S.R. and Donner A., A comparison of tests of homogeneity of odds ratios in k 2× 2 tables, Stat. Med. 8 (1989), pp. 1455–1468. [DOI] [PubMed] [Google Scholar]
28.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2019. Available at https://www.R-project.org/.
29.Ramsey P.H., Power differences between pairwise multiple comparisons, J. Am. Stat. Assoc. 73 (1989), pp. 479–485. [Google Scholar]
30.Reis I.M., Hirji K.F., and Afifi A.A., Exact and asymptotic tests for homogeneity in several $2 \times 2$ tables, Stat. Med. 18 (1999), pp. 893–906. [DOI] [PubMed] [Google Scholar]
31.Šidák Z., Rectangular confidence regions for the means of multivariate normal distributions, J. Am. Stat. Assoc. 62 (1967), pp. 626–633. [Google Scholar]
32.Signorell A., Aho K., Alfons A., Anderegg N., Aragon T., Arppe A., Baddeley A., Barton K., Bolker B., and Borchers H.W., Desctools: Tools for descriptive statistics, 2020, R package version 0.99.36. Available at https://cran.r-project.org/package=DescTools.
33.Simonnet A., Chetboun M., Poissy J., Raverdy V., Noulette J., Duhamel A., Labreuche J., Mathieu D., Pattou F., and Jourdain M., High prevalence of obesity in severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) requiring invasive mechanical ventilation, Obesity 28 (2020), pp. 1195–1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Tarone R.E., On heterogeneity tests based on efficient scores, Biometrika. 72 (1985), pp. 91–95. [Google Scholar]
35.Van den Ende J., Moreira J., Basinga P., and Bisoffi Z., The trouble with likelihood ratios, Lancet. 366 (2005), p. 548. [DOI] [PubMed] [Google Scholar]
36.Wang L., He W., Yu X., Hu D., Bao M., Liu H., Zhou J., and Jiang H., Coronavirus disease 2019 in elderly patients: Characteristics and prognostic factors based on 4-week follow-up, J. Infect. 80 (2020), pp. 639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Wei Q. and Lai D., Test for homogeneity of odds ratios using U-statistics, Open J. Stat. 9 (2019), pp. 347–360. [Google Scholar]
38.Wong Y.J., Tan M., Zheng Q., Li J.W., Kumar R., Fock K.M., Teo E.K., and Ang T.L., A systematic review and meta-analysis of the COVID-19 associated liver injury, Ann. Hepatol. 19 (2020), pp. 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Woolf B., On estimating the relation between blood group and disease, Ann. Hum. Genet. 19 (1955), pp. 251–253. [DOI] [PubMed] [Google Scholar]
40.Wright S.P., Adjusted p-values for simultaneous inference, Biometrics. 48 (1992), pp. 1005–1023. [Google Scholar]
41.Wu Z. and McGoogan J.M., Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72 314 cases from the Chinese center for disease control and prevention, JAMA 323 (2020), pp. 1239–1242. [DOI] [PubMed] [Google Scholar]
42.Yang J., Zheng Y.A., Gou X., Pu K., Chen Z., Guo Q., Ji R., Wang H., Wang Y., and Zhou Y., Prevalence of comorbidities and its effects in patients infected with SARS-CoV-2: A systematic review and meta-analysis, Int. J. Infect. Dis. 94 (2020), pp. 91–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Yilmaz A.E. and Altunay S.A., Post-hoc comparison tests for odds ratios, Electron. J. Appl. Stat. Anal. 15 (2022), pp. 75–94. [Google Scholar]
44.Yusuf S., Peto R., Lewis J., Collins R., and Sleight P., Beta blockade during and after myocardial infarction: An overview of the randomized trials, Prog. Cardiovasc. Dis. 27 (1985), pp. 335–371. [DOI] [PubMed] [Google Scholar]
45.Zelen M., The analysis of several $2 \times 2$ contingency tables, Biometrika 58 (1971), pp. 129–137. [Google Scholar]
46.Zwinderman A.H. and Bossuyt P.M., We should not pool diagnostic likelihood ratios in systematic reviews, Stat. Med. 27 (2008), pp. 687–697. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Click here for additional data file.^{(211KB, pdf)}

[CIT0001] 1.Agresti A. and Jonathan J., Strategies for comparing treatments on a binary response with multi-centre data, Stat. Med. 19 (2000), pp. 1115–1139. [DOI] [PubMed] [Google Scholar]

[CIT0002] 2.Armistead T.W., Measures of association, in Wiley StatsRef: Statistics Reference Online, John Wiley & Sons, Ltd., 2014, pp. 1–24.

[CIT0003] 3.Bagheri Z., Ayatollahi S.M.T, and Jafari P., Comparison of three tests of homogeneity of odds ratios in multicenter trials with unequal sample sizes within and among centers, BMC Med. Res. Methodol. 11 (2011), pp. 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0004] 4.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]

[CIT0005] 5.Breslow N.E., Statistics in epidemiology: The case-control study, J. Am. Stat. Assoc. 91 (1996), pp. 14–28. [DOI] [PubMed] [Google Scholar]

[CIT0006] 6.Breslow N.S. and Day N.E., The Analysis of Case-control Studies. Statistical Methods in Cancer Research, Vol. 1, International Agency for Research on Cancer Scientific Publications, Lyon, France, 1980. [PubMed]

[CIT0007] 7.Chen S.Y., Feng Z., and Yi X., A general introduction to adjustment for multiple comparisons, J. Thorac. Dis. 9 (2017), pp. 1725–1729. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0008] 8.Colombi D., Bodini F. C., Petrini M., Maffi G., Morelli N., Milanese G., Silva M., Sverzellati N., and Michieletti E., Well-aerated lung on admitting chest CT to predict adverse outcome in COVID-19 pneumonia, Radiology 296 (2020), pp. E86–E96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0009] 9.de Almeida-Pititto B., Dualib P.M., Zajdenverg L., Dantas J.R., De Souza F.D., Rodacki M., and Bertoluci M.C., Severity and mortality of COVID 19 in patients with diabetes, hypertension and cardiovascular disease: A meta-analysis, Diabetol. Metab. Syndr. 12 (2020), pp. 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0010] 10.Demirhan H., Dolgun N.A., and Demirhan Y.P., Performance of some multiple comparison tests under heteroscedasticity and dependency, J. Stat. Comput. Simul. 80 (2010), pp. 1083–1100. [Google Scholar]

[CIT0011] 11.Der Simonian R. and Laird N., Meta-analysis in clinical trials, Control Clin. Trials 7 (1986), pp. 177–188. [DOI] [PubMed] [Google Scholar]

[CIT0012] 12.Dinno A., Nonparametric pairwise multiple comparisons in independent groups using Dunn's test, Stata J. 15 (2015), pp. 292–300. [Google Scholar]

[CIT0013] 13.Dunn O.J., Multiple comparisons among means, J. Am. Stat. Assoc. 56 (1961), pp. 52–64. [Google Scholar]

[CIT0014] 14.Fletcher J., What is heterogeneity and is it important?, Br. Med. J. 334 (2007), pp. 94–96. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0015] 15.Gagnier J.J., Moher D., Boon H., Beyene J., and Bombardier C., Investigating clinical heterogeneity in systematic reviews: A methodologic review of guidance in the literature, BMC Med. Res. Methodol. 12 (2012), pp. 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0016] 16.Gavaghan D.J., Moore R.A., and McQuay H.J., An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data, Pain 85 (2000), pp. 415–424. [DOI] [PubMed] [Google Scholar]

[CIT0017] 17.Giamarellos-Bourboulis E. J., Netea M. G., Rovina N., Akinosoglou K., Antoniadou A., Antonakos N., Damoraki G., Gkavogianni T., Adami M.-E., Katsaounou P., Ntaganou M., Kyriakopoulou M., Dimopoulos G., Koutsodimitropoulos I., Velissaris D., Koufargyris P., Karageorgos A., Katrini K., Lekakis V., Lupse M., Kotsaki A., Renieris G., Theodoulou D., Panou V., Koukaki E., Koulouris N., Gogos C., and Koutsoukou A., Complex immune dysregulation in COVID-19 patients with severe respiratory failure, Cell Host Microbe 27 (2020), pp. 992–1000.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0018] 18.Halperin M., Ware J.H., Byar D.P., Mantel N., Brown C.C., Koziol J., Gail M., and Sylvan B.G.R., Testing for interaction in an $i \times j \times k$ contingency table, Biometrika 64 (1977), pp. 271–275. [Google Scholar]

[CIT0019] 19.Hochberg Y., A sharper Bonferroni procedure for multiple tests of significance, Biometrika 75 (1988), pp. 800–802. [Google Scholar]

[CIT0020] 20.Holm S., A simple sequentially rejective multiple test procedure, Scand. J. Stat. 6 (1979), pp. 65–70. [Google Scholar]

[CIT0021] 21.Hommel G., A stagewise rejective multiple test procedure based on a modified Bonferroni test, Biometrika 75 (1988), pp. 383–386. [Google Scholar]

[CIT0022] 22.Hsiung T.H. and Olejnik S., Power of pairwise multiple comparisons in the unequal variance case, Commun Stat - Simul Comput 23 (1994), pp. 691–710. [Google Scholar]

[CIT0023] 23.Jones M.P., O'Gorman T.W., Lemke J.H., and Woolson R.F., A Monte Carlo investigation of homogeneity tests of the odds ratio under various sample size configurations, Biometrics. 45 (1989), pp. 171–181. [PubMed] [Google Scholar]

[CIT0024] 24.Kulinskaya E. and Dollinger M.B., An accurate test for homogeneity of odds ratios based on Cochran's q-statistic, BMC Med. Res. Methodol. 15 (2015), pp. 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0025] 25.Mantel N. and Haenszel W., Statistical aspects of the analysis of data from retrospective studies of disease, J. Natl. Cancer Inst. 22 (1959), pp. 719–748. [PubMed] [Google Scholar]

[CIT0026] 26.Osama A., Testing homogeneity of effect sizes in pooling $2 \times 2$ contingency tables from multiple studies: A comparison of methods, Cogent Math. Stat. 5 (2018), p. 1478698. [Google Scholar]

[CIT0027] 27.Paul S.R. and Donner A., A comparison of tests of homogeneity of odds ratios in k 2× 2 tables, Stat. Med. 8 (1989), pp. 1455–1468. [DOI] [PubMed] [Google Scholar]

[CIT0028] 28.R Core Team , R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2019. Available at https://www.R-project.org/.

[CIT0029] 29.Ramsey P.H., Power differences between pairwise multiple comparisons, J. Am. Stat. Assoc. 73 (1989), pp. 479–485. [Google Scholar]

[CIT0030] 30.Reis I.M., Hirji K.F., and Afifi A.A., Exact and asymptotic tests for homogeneity in several $2 \times 2$ tables, Stat. Med. 18 (1999), pp. 893–906. [DOI] [PubMed] [Google Scholar]

[CIT0031] 31.Šidák Z., Rectangular confidence regions for the means of multivariate normal distributions, J. Am. Stat. Assoc. 62 (1967), pp. 626–633. [Google Scholar]

[CIT0032] 32.Signorell A., Aho K., Alfons A., Anderegg N., Aragon T., Arppe A., Baddeley A., Barton K., Bolker B., and Borchers H.W., Desctools: Tools for descriptive statistics, 2020, R package version 0.99.36. Available at https://cran.r-project.org/package=DescTools.

[CIT0033] 33.Simonnet A., Chetboun M., Poissy J., Raverdy V., Noulette J., Duhamel A., Labreuche J., Mathieu D., Pattou F., and Jourdain M., High prevalence of obesity in severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) requiring invasive mechanical ventilation, Obesity 28 (2020), pp. 1195–1199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0034] 34.Tarone R.E., On heterogeneity tests based on efficient scores, Biometrika. 72 (1985), pp. 91–95. [Google Scholar]

[CIT0035] 35.Van den Ende J., Moreira J., Basinga P., and Bisoffi Z., The trouble with likelihood ratios, Lancet. 366 (2005), p. 548. [DOI] [PubMed] [Google Scholar]

[CIT0036] 36.Wang L., He W., Yu X., Hu D., Bao M., Liu H., Zhou J., and Jiang H., Coronavirus disease 2019 in elderly patients: Characteristics and prognostic factors based on 4-week follow-up, J. Infect. 80 (2020), pp. 639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0037] 37.Wei Q. and Lai D., Test for homogeneity of odds ratios using U-statistics, Open J. Stat. 9 (2019), pp. 347–360. [Google Scholar]

[CIT0038] 38.Wong Y.J., Tan M., Zheng Q., Li J.W., Kumar R., Fock K.M., Teo E.K., and Ang T.L., A systematic review and meta-analysis of the COVID-19 associated liver injury, Ann. Hepatol. 19 (2020), pp. 627–634. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0039] 39.Woolf B., On estimating the relation between blood group and disease, Ann. Hum. Genet. 19 (1955), pp. 251–253. [DOI] [PubMed] [Google Scholar]

[CIT0040] 40.Wright S.P., Adjusted p-values for simultaneous inference, Biometrics. 48 (1992), pp. 1005–1023. [Google Scholar]

[CIT0041] 41.Wu Z. and McGoogan J.M., Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: Summary of a report of 72 314 cases from the Chinese center for disease control and prevention, JAMA 323 (2020), pp. 1239–1242. [DOI] [PubMed] [Google Scholar]

[CIT0042] 42.Yang J., Zheng Y.A., Gou X., Pu K., Chen Z., Guo Q., Ji R., Wang H., Wang Y., and Zhou Y., Prevalence of comorbidities and its effects in patients infected with SARS-CoV-2: A systematic review and meta-analysis, Int. J. Infect. Dis. 94 (2020), pp. 91–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0043] 43.Yilmaz A.E. and Altunay S.A., Post-hoc comparison tests for odds ratios, Electron. J. Appl. Stat. Anal. 15 (2022), pp. 75–94. [Google Scholar]

[CIT0044] 44.Yusuf S., Peto R., Lewis J., Collins R., and Sleight P., Beta blockade during and after myocardial infarction: An overview of the randomized trials, Prog. Cardiovasc. Dis. 27 (1985), pp. 335–371. [DOI] [PubMed] [Google Scholar]

[CIT0045] 45.Zelen M., The analysis of several $2 \times 2$ contingency tables, Biometrika 58 (1971), pp. 129–137. [Google Scholar]

[CIT0046] 46.Zwinderman A.H. and Bossuyt P.M., We should not pool diagnostic likelihood ratios in systematic reviews, Stat. Med. 27 (2008), pp. 687–697. [DOI] [PubMed] [Google Scholar]

PERMALINK

How reliable are the multiple comparison methods for odds ratio?

Ayfer Ezgi Yilmaz

ABSTRACT

1. Introduction

2. Methods

2.1. Test methods

Table 1.

2.2. The multiple comparison procedures for the odds ratio

3. COVID-19 studies

Table 2.

Table 3.

4. Simulation study

4.1. Simulation design

Table 4.

4.2. Measures to evaluate the tests

Table 5.

4.3. Simulation results

Table 6.

Table 9.

4.3.1. ANPP

Table 7.

4.3.2. APP

4.3.3. PPV

4.3.4. TNR

4.3.5. PCER

4.3.6. FWER

Table 8.

4.3.7. FDR

5. Discussion

Table 10.

Supplementary Material

Disclosure statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases