Skip to main content
BMC Medical Research Methodology logoLink to BMC Medical Research Methodology
. 2025 Nov 20;25:261. doi: 10.1186/s12874-025-02719-7

Alternative tests and measures for between-study inconsistency in meta-analysis

Zhiyuan Yu 1,#, Mengli Xiao 2,#, Xing Xing 3, Lifeng Lin 4,
PMCID: PMC12632047  PMID: 41266976

Abstract

Meta-analysis is a widely used method for synthesizing results from multiple studies across diverse fields. A central challenge in meta-analysis is assessing between-study inconsistency, which can arise from differences in study populations, methodological heterogeneity, or the presence of outliers. Conventional tools such as the Inline graphic and Inline graphic statistics could be limited in power, especially when the number of studies is small or when the between-study distribution deviates from normality. To address these limitations, we propose a family of alternative Inline graphic-like statistics and a hybrid test that adaptively combines their strengths. We also introduce new measures to quantify inconsistency based on these statistics. Simulation studies demonstrate that the hybrid test performs robustly across a wide range of inconsistency patterns, including heavy-tailed, skewed, and contaminated distributions. We further illustrate the practical utility of our methods using three real-world meta-analyses. These approaches offer more flexible and powerful tools for detecting and quantifying inconsistency in meta-analytic practice.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-025-02719-7.

Keywords: Heterogeneity, Hybrid test, Inconsistency, Meta-analysis, Resampling, Statistical power

Background

Meta-analysis is widely used to combine results from multiple studies on the same research topic in a wide range of fields [1, 2]. A critical step for validating a meta-analysis is to assess the differences between studies. Because the multiple studies recruited participants with potentially different characteristics and were conducted by different research teams with different methods, their results are usually expected to have a certain extent of heterogeneity [3]. If the studies are considered mostly homogeneous, such that they share a common true effect size, a common-effect model is employed for the meta-analysis. On the other hand, if the studies are heterogeneous, such that their underlying true effect sizes differ, a random-effects model is typically used to account for the heterogeneity [4, 5]. As such, the assessment of the differences between studies offers valuable information for model selection and appraising the interpretability of meta-analysis conclusions.

The Inline graphic test is traditionally used to measure the between-study heterogeneity in a meta-analysis. Under the null hypothesis of homogeneity, the Inline graphic statistic follows a chi-squared distribution. This statistic takes the sum of squared standardized deviates from individual studies. Conceptually, it shares a similar idea with the classical least-squares regression that minimizes the sum of squared errors. The structure of the sum of squares may be suitable for the conventional assumption made by the random-effects meta-analysis models; that is, the underlying true effect sizes of all individual studies follow a normal distribution with a certain heterogeneity variance. The theoretical properties of the Inline graphic statistic have been mostly investigated under such between-study normality [68]. Nevertheless, this is not always the case of discrepancies between studies’ results, and the normality assumption could be questionable in many applications [912]. For example, a meta-analysis may include a few outlying studies whose results take extreme values compared with the majority of studies [13]. It may also inappropriately pool multiple subgroups in the same analysis; the subgroups could be distinguished based on certain study-level summaries of population characteristics, such as average ages [14]. The studies in different subgroups could have dramatically different effect sizes, while those in the same group may share more homogeneous effect sizes. In such cases, the underlying effect sizes from the whole studies may have a distribution with multiple modes. Moreover, publication bias or small-study effects can distort the observed distribution of studies. For example, if studies reporting effects in one direction (say, positive effects of an intervention) are more likely to be published, the between-study distribution may become skewed, often with a longer tail on the side favored by the bias. It is well documented in the meta-analysis literature that publication bias can interact with heterogeneity assessments, potentially exaggerating between-study variability or even masking true inconsistency [15]. Indeed, ignoring the coexistence of publication bias and heterogeneity may lead to misleading conclusions about both [16, 17]. In summary, the conventional Inline graphic test may not be suitable for the foregoing non-normal cases, as it may suffer from low statistical power, especially when the number of studies in a meta-analysis is relatively small [18, 19].

Considering the diversity of factors that may cause the discrepancies between studies on the same research topic, this article refers to such discrepancies as inconsistency in general instead of heterogeneity in particular. The term “inconsistency” was also used by Higgins et al. [20] when they introduced the famous Inline graphic statistic to the medical community for quantifying the discrepancies between studies. Moreover, this terminology is adopted in the GRADE (Grading of Recommendations Assessment, Development and Evaluation) framework for assessing the certainty of evidence [21]. In the literature of meta-analysis methodology, on the other hand, the term “heterogeneity” seems to be more widely used for the same purpose. It is frequently paired with the random-effects model, which assumes between-study normality; that is, the discrepancies permeate the entire meta-analysis, rather than being limited to certain subgroups or a few outlying studies. This article uses the term “inconsistency” to inclusively cover all types of discrepancies, including subgroup effects and potential outliers. It is important to note that this usage differs from evidence inconsistency in network meta-analysis, which refers to disagreement between direct and indirect comparisons of multiple interventions [22, 23].

Beyond the traditional Inline graphic test, several likelihood-based approaches, including the score test, likelihood-ratio test, and Wald test, have been proposed for assessing between-study heterogeneity [24, 25]. Similar to the Inline graphic test, these methods are generally derived under the assumption of normally distributed random effects. However, they may suffer from low power or lack robustness when the underlying random-effects distribution is skewed, heavy-tailed, or contaminated. In addition, some of these likelihood-based methods are tailored to specific effect measures, such as odds ratios [26], which can limit their general applicability across different types of meta-analyses.

Motivated by these limitations, this article presents alternative test statistics to examine the between-study inconsistency. These statistics are designed based on the sum of absolute values of standardized deviates with different mathematical powers (e.g., square, cubic, maximum, and so on). They attempt to serve as suitable candidates under various scenarios of between-study distributions. For example, consider a meta-analysis with an extremely outlying study; except for the outliers, all remaining studies are mostly homogeneous. If the conventional Inline graphic statistic is used to test for the inconsistency, the sum of squares of standardized deviates would include too much noise, given that most studies are actually homogeneous. Indeed, in such a case, the maximum of the standardized deviates would efficiently capture the inconsistency with the minimal contamination by noises from homogeneous studies, as the outlying study is expected to create the largest deviate.

By their designs, the various alternative tests for inconsistency have different statistical power under different settings that cause discrepancies between individual studies’ results, so there is no universally best test. In practice, it is infeasible to justify an optimal test that fits a specific meta-analysis dataset. As such, using the idea of adaptive testing [27, 28], we derive a hybrid test based on the various tests for inconsistency. The hybrid test statistic takes the minimum P-values from various tests for inconsistency so that it can achieve relatively high power across a wide range of settings. To properly control the type I error rate of the hybrid test, we propose a parametric resampling procedure to derive the null distribution and thus calculate the empirical P-value of the hybrid test.

This article is organized as follows. We start with presenting the setup of the inconsistency problem in a meta-analysis, reviewing the existing popular Inline graphic test for inconsistency, proposing the alternative test statistics and hybrid test, and providing the algorithm of the resampling method for deriving the tests’ P-values. Then, we present simulation studies to compare the statistical power of the various tests and use three case studies to illustrate the real-world performance of the tests. Finally, we conclude this article with discussions about the proposed methods’ limitations and potential future directions.

Methods

The conventional method for testing for inconsistency

Suppose that a meta-analysis includes a total of Inline graphic independent studies. Let Inline graphic be the observed effect size in study Inline graphic (Inline graphic = 1, 2, …, Inline graphic) and Inline graphic be its standard error. Meta-analysis models typically assume that each Inline graphic is approximately normally distributed with mean Inline graphic and variance Inline graphic. Also, the observed standard errors Inline graphic’s are conventionally treated as fixed variables, as if there is no error in their estimation. These assumptions are generally valid if the sample sizes in the studies are sufficiently large (e.g., due to the central limit theorem and law of large numbers), but extra cautions are needed for small sample sizes [10, 29].

This article focuses on testing for the potential inconsistency between the Inline graphic studies, so the distribution of the study-specific underlying true effect sizes Inline graphic plays a critical role. The null hypothesis is that all studies are homogeneous, sharing a common effect size Inline graphic; that is, Inline graphic, and their distribution is a mass at Inline graphic. In such cases, the common-effect model is used. If the homogeneity does not appear to hold, meta-analysts conventionally use the random-effects model to account for the between-study inconsistency. This model typically assumes that Inline graphic’s are random effects, following a normal distribution Inline graphic, where Inline graphic represents the overall mean effect size in the random-effects framework and Inline graphic is the between-study variance [4]. Although alternative distributions are possible for the random effects [30, 31], the between-study normality assumption has dominated the current literature of meta-analyses owing to its simplicity.

The Inline graphic test is the standard approach to examining inconsistency. Its statistic is defined as

graphic file with name d33e489.gif

Where the common-effect estimate is

graphic file with name d33e494.gif 1

The Inline graphic statistic follows a Inline graphic distribution under the null hypothesis of homogeneity.

Alternative test statistics

It is straightforward to rewrite the conventional Inline graphic statistic as Inline graphic, where Inline graphic may be interpreted as study-specific standardized deviates. As illustrated in the introduction section, this sum-of-squares structure is arguably suitable for the between-study normality assumption, but it may not work well for other between-study distributions. Consider the case that the meta-analysis contains an outlying study (denoting its index by Inline graphic) and all remaining studies are homogeneous. We would expect Inline graphic to take an extremely large value in absolute magnitude, while the other studies’ standardized deviates Inline graphic (Inline graphic) are small. The between-study inconsistency can be mostly captured by Inline graphic, and the Inline graphic’s (Inline graphic) may add nuisance information to the sum of squares in Inline graphic. Therefore, it is sensible to test for the inconsistency based solely on Inline graphic, i.e., the maximum value among the absolute values of all standardized deviates Inline graphic’s.

In addition, we may consider taking the sum of the Inline graphic’s with different mathematical powers Inline graphic to reflect different weights contributed by individual studies. For example, when Inline graphic = 1, the sum becomes Inline graphic, which could reduce the impact of potential outliers and make the assessment of inconsistency more robust [9]. When Inline graphic increases, the contributions of larger deviates to the sum become larger, while those of smaller deviates become smaller. If Inline graphic approaches infinity, then only the largest deviate would dominate the sum, so the sum effectively plays a similar role to the maximum of all deviates. Different values of Inline graphic could capture different patterns of the between-study distributions.

Formally, we propose the following alternative statistic for an integer value of the mathematical power Inline graphic:

graphic file with name d33e606.gif 2

If Inline graphic = 2, the Inline graphic becomes the conventional Inline graphic statistic. For other values of Inline graphic, the Inline graphic statistic could capture different patterns of between-study inconsistency and thus be more powerful than Inline graphic. Also, because the largest deviate in absolute magnitude would dominate the sum as Inline graphic approaches infinity, we define the statistic for Inline graphic as

graphic file with name d33e645.gif 3

This article considers the values of 1, 2, …, 8, and Inline graphic for Inline graphic. Based on our empirical experiments, these values are sufficient to capture various patterns of between-study inconsistency. We denote the P-value of the Inline graphic statistic by Inline graphic.

In practice, however, it is infeasible to justify the optimal Inline graphic for a specific meta-analysis because identifying the between-study distribution is challenging. Borrowing the idea of adaptive testing [27, 28, 32], we propose a hybrid test for the between-study inconsistency. Specifically, we first consider a set of candidate tests for inconsistency, say Inline graphic with Inline graphic. Then, the hybrid test statistic is defined as the minimum P-value of all candidate tests; that is,

graphic file with name d33e697.gif 4

As the minimum among the P-values of a pool of tests, the Inline graphic is no longer a valid P-value because it cannot control the type I error rate. Indeed, we treat Inline graphic as a test statistic rather than a P-value. The following subsection proposes a resampling method to derive the null distribution of the hybrid test statistic and thus calculate its P-value Inline graphic. Because it is also difficult to derive the theoretical null distribution of Inline graphic (except for Inline graphic = 2, which leads to the conventional Inline graphic test statistic), the resampling method is used to derive the P-values of Inline graphic’s as well.

Calculation of the P-values of alternative tests

The resampling method for calculating the P-values of the proposed tests is as follows. First, under the null hypothesis of homogeneity, the common effect size Inline graphic is estimated as in Eq. (1), the test statistic Inline graphic for each Inline graphic is obtained using Eqs. (2) or (3), and the hybrid test statistic Inline graphic is obtained using Eq. (4). Second, we generate resampled replicates of the meta-analysis for Inline graphic times (say, Inline graphic = 1,000). Ideally, Inline graphic should be as large as computational resources allow to minimize Monte Carlo error. For each resampling iteration Inline graphic (Inline graphic = 1, 2, …, Inline graphic), we draw study-specific standard errors Inline graphic with replacement from the original standard errors Inline graphic (Inline graphic = 1, 2, …, Inline graphic), and the effect size estimates are obtained as Inline graphic. For the Inline graphic th resampled meta-analysis, we calculate Inline graphic for each Inline graphic using Inline graphic and Inline graphic. These resampled test statistics form null distributions for each Inline graphic; thus, the P-value of Inline graphic can be calculated as

graphic file with name d33e866.gif 5

where Inline graphic is the indicator function. A constant 1 is added to both the denominator and the numerator to avoid the P-value being calculated as 0.

Of note, the P-values of Inline graphic based on Eq. (5) are obtained for the original meta-analyses. To obtain the P-value of the hybrid test, we need to calculate the hybrid test statistic for each resampled meta-analysis, which depends on the P-values of Inline graphic for the resampled meta-analysis. For the Inline graphic th resampled meta-analysis with the test statistic Inline graphic, the statistics Inline graphic in other resampled meta-analyses (Inline graphic = 1, 2, …, Inline graphic but Inline graphic) can serve as an empirical distribution for Inline graphic. Thus, the P-value of Inline graphic can be calculated as

graphic file with name d33e938.gif

With these P-values, for the Inline graphic th resampled meta-analysis, its hybrid test statistic is

graphic file with name d33e950.gif

Finally, based on the hybrid test statistics of the resampled meta-analyses under the null hypothesis, the P-value of Inline graphic is

graphic file with name d33e962.gif

Alternative measures for quantifying inconsistency

In addition to testing for inconsistency, it is also of great interest to quantify it [33]. Like the conventional Inline graphic statistic, the magnitudes of the proposed Inline graphic statistics depend on the number of studies Inline graphic, so Inline graphic cannot be directly used as measures of between-study inconsistency in different meta-analyses. Motivated by the popular Inline graphic statistic [20, 34], we extend the Inline graphic statistics to derive alternative inconsistency measures.

Specifically, the Inline graphic statistic can be calculated as

graphic file with name d33e1009.gif 6

where Inline graphic is the conventional Inline graphic statistic. It is interpreted as the percentage of the total variation in study estimates due to the between-study inconsistency rather than the sampling error. Thus, conceptually, the Inline graphic statistic can be considered as a form of

graphic file with name d33e1027.gif 7

where Inline graphic represents the between-study variance and Inline graphic is a summary of all sample variances from individual studies [34]. The Cochrane Handbook gives a rough guide for interpreting the Inline graphic statistic as unimportant, moderate, or substantial inconsistency [2].

In the framework of this article, however, we do not pursue new measures with a similar interpretation as in Eq. (7) The marginal variance Inline graphic may be intuitive under the between-study normality assumption, but it may not be meaningful in general settings of inconsistency, e.g., the existence of outlying studies. Instead, we are motivated to construct new inconsistency measures by taking another look at the formula of Inline graphic in Eq. (6). Because Inline graphic in the numerator is the expectation of the Inline graphic statistic under the hypothesis, Inline graphic describes the excess of the observed Inline graphic statistic compared with its null expectation, and thus Inline graphic can also be viewed as the percentage of excess inconsistency. This interpretation shares a similar idea with the concept of excess statistical significance used for assessing publication bias [35, 36].

As such, based on each alternative test statistic Inline graphic, we propose to quantify inconsistency by

graphic file with name d33e1100.gif

where Inline graphic is the expectation of Inline graphic under the null hypothesis. The Inline graphic measure is interpreted as the percentage excess inconsistency based on the Inline graphic statistic. When Inline graphic = 2, Inline graphic is identical to the Inline graphic statistic. In addition, like Inline graphic [34], the Inline graphic measure is scale-invariant because Inline graphic is unit-free.

We can derive the theoretical null expectations of Inline graphic for Inline graphic. Recall that Inline graphic, where Inline graphic. Because Inline graphic follows normal distribution, so Inline graphic follows a folded normal distribution with mean 0 if Inline graphic’s are treated as fixed values. By this property, we can derive the following formula of Inline graphic using the Inline graphic th moment of the folded normal distribution:

graphic file with name d33e1188.gif

where Inline graphic denotes the Gamma function and Inline graphic. The detailed proof is given in Additional File 1. Here, we list the equations of Inline graphic with Inline graphic = 1, 2, …, and 8 (which are used in our numerical studies):

  • Inline graphic;

  • Inline graphic;

  • Inline graphic

  • Inline graphic;

  • Inline graphic;

  • Inline graphic;

  • Inline graphic; and

  • Inline graphic.

It is infeasible to obtain the explicit equations of Inline graphic and Inline graphic. Nevertheless, the resampling method introduced in the previous subsection can also be used to calculate the null expectation based on the resampled meta-analyses; that is,

graphic file with name d33e1268.gif

This approximation can be readily used for obtaining Inline graphic with Inline graphic.

For the hybrid test statistic, it is not straightforward to derive Inline graphic in a similar way for two reasons. First, a large value of Inline graphic implies a large inconsistency between studies. On the contrary, a large value of Inline graphic implies a small inconsistency between studies. Second, Inline graphic only ranges from 0 to 1; therefore, it is difficult to compare its magnitudes across meta-analyses when Inline graphic becomes small, say, Inline graphic and Inline graphic. To solve these problems, we apply a log transformation with base 10 to Inline graphic and impose a negative sign; the transformed statistic is Inline graphic. After this transformation, we can use the foregoing resampling method for Inline graphic to approximate the Inline graphic:

graphic file with name d33e1330.gif

Thus, the measure of inconsistency based on the hybrid statistic is calculated as

graphic file with name d33e1335.gif

Similar to the interpretation of Inline graphic, Inline graphic represents the percentage of excess inconsistency, but it is derived from the Inline graphic (or equivalently, Inline graphic) statistic. For example, if Inline graphic = 0.001, corresponding to Inline graphic = 3, and the expected value under the null hypothesis is Inline graphic = 1, then Inline graphic = (3–1)/3 = 66.7%.

Simulation studies

We validated the performance of various tests for between-study inconsistency via simulation studies. We considered the conventional Inline graphic test, the proposed Inline graphic tests with Inline graphic = 1, 2, …, 8, and Inline graphic, as well as the hybrid test. The tests’ performance was assessed in terms of their type I error rates and statistical power.

To derive type I error rates, we generated meta-analyses with observed effect sizes as Inline graphic under the null hypothesis of homogeneity (Case 0). Without loss of generality, we set Inline graphic = 0. In addition, the within-study standard errors Inline graphic were sampled from a uniform distribution, Inline graphic or Inline graphic.

To derive statistical power, we considered various settings of the alternative hypothesis of between-study inconsistency. Specifically, the study-specific effect sizes Inline graphic were generated from Inline graphic, where Inline graphic was the true effect size of study Inline graphic and followed various distributions as follows to reflect different patterns of between-study inconsistency.

  • Case 1: Inline graphic;

  • Case 2: the mixture normal distribution, Inline graphic, consisting of two normal distributions with the same mean 0 and different variances;

  • Case 3: the gamma distribution with shape parameter 0.05 and rate parameter 0.1, Inline graphic;

  • Case 4: the Inline graphic distribution with degrees of freedom 3 and 8, Inline graphic.

  • Case 5: the mixture normal distribution, Inline graphic, consisting of two normal distributions with different means but the same variance;

The normality assumption in Case 1 is conventional for most meta-analysis methods. The distribution in Case 2 had heavier tails than a single normal distribution, which could generate extreme values. The distributions in Cases 3 and 4 were skewed, arguably reflecting the potential influence of publication bias or small-study effects. In Case 5, the distribution conceptually represents scenarios in which a meta-analysis includes two subgroups, each centered at a different overall effect size. The distributions in Cases 3 to 5 have non-zero mean; to make fair comparisons, we centralized these distributions at 0 so that Inline graphic = 0.

We additionally considered two cases of contamination:

  • Case 6: all Inline graphic = Inline graphic 0.2 expect that Inline graphic;

  • Case 7: all Inline graphic = 0 but Inline graphic was artificially added by a discrepancy value of Inline graphic = 3.

The cases occur when most studies in a meta-analysis are homogeneous, but one study based on a dramatically different population is inappropriately included in the meta-analysis.

Each simulated meta-analysis consisted of Inline graphic = 15 or 30 studies. We generated 1,000 replicates for each simulation setting. Because the resampling algorithm is computationally intensive, and considering that it needed to be repeated for 1,000 Monte Carlo replications in the simulations, we used Inline graphic = 500 resampling iterations instead of a much larger number. The significance level for inconsistency was set to 0.10 because inconsistency tests typically have low power in many cases, as indicated by existing simulation studies [37].

Case studies

To illustrate the practical utility of the proposed methods, we applied the Inline graphic tests and the hybrid tests to three real-world meta-analyses. For each meta-analysis, we calculated the P-values of the various tests and the corresponding inconsistency measures. The resampling method was used for calculating the P-values, and the number of resampling iterations was 10,000. Of note, we used a much larger Inline graphic in these case studies than in the simulations because only three meta-analyses were analyzed, making the larger number of resampling iterations computationally feasible and allowing for smaller Monte Carlo resampling error.

Results

Results of simulation studies

Tables 1, 2, 3, and 4 present the simulation results under a range of data-generating settings that varied the distribution of the true effects Inline graphic, the within-study standard errors Inline graphic or Inline graphic, and the number of studies Inline graphic = 15 or 30. Monte Carlo standard errors were mostly around 2%, ensuring reliable comparisons of type I error rates or statistical power across settings. The settings covered symmetric, skewed, heavy-tailed, and contaminated distributions, offering a comprehensive assessment of test performance under diverse conditions.

Table 1.

Type I error rates (Case 0) and statistical power (Cases 1–7), expressed as percentages, for various tests under the setting with within-study standard errors Inline graphic and the number of studies Inline graphic = 15. Monte Carlo standard errors are shown in parentheses. The significance level was set at 10%

Setting Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Case 0 8.6 (0.9) 8.8 (0.9) 9.1 (0.9) 8.9 (0.9) 9.0 (0.9) 9.2 (0.9) 9.3 (0.9) 9.4 (0.9) 9.3 (0.9) 9.1 (0.9)
Case 1 68.2 (1.5) 71.3 (1.4) 69.9 (1.5) 67.0 (1.5) 65.3 (1.5) 64.0 (1.5) 62.9 (1.6) 62.0 (1.6) 56.2 (1.6) 69.0 (1.5)
Case 2 58.7 (1.6) 65.4 (1.5) 66.8 (1.5) 66.6 (1.5) 66.2 (1.5) 65.9 (1.5) 65.5 (1.5) 65.6 (1.5) 65.2 (1.5) 66.4 (1.5)
Case 3 42.2 (1.6) 47.3 (1.6) 49.1 (1.6) 48.6 (1.6) 48.6 (1.6) 48.8 (1.6) 48.5 (1.6) 48.7 (1.6) 48.7 (1.6) 47.9 (1.6)
Case 4 55.5 (1.6) 61.0 (1.6) 60.1 (1.6) 58.7 (1.6) 58.0 (1.6) 57.4 (1.6) 56.7 (1.6) 56.2 (1.6) 53.8 (1.6) 59.9 (1.6)
Case 5 55.9 (1.6) 60.0 (1.5) 58.2 (1.6) 56.5 (1.6) 55.2 (1.6) 53.8 (1.6) 52.4 (1.6) 51.9 (1.6) 47.1 (1.6) 56.6 (1.6)
Case 6 24.9 (1.3) 29.7 (1.4) 33.2 (1.5) 34.2 (1.5) 34.5 (1.5) 34.6 (1.5) 34.7 (1.5) 35.0 (1.5) 34.7 (1.5) 33.0 (1.5)
Case 7 23.0 (1.3) 29.2 (1.4) 32.8 (1.5) 33.1 (1.5) 33.2 (1.5) 33.3 (1.5) 32.7 (1.5) 32.3 (1.5) 32.2 (1.5) 31.4 (1.5)

Table 2.

Type I error rates (Case 0) and statistical power (Cases 1–7), expressed as percentages, for various tests under the setting with within-study standard errors Inline graphic and the number of studies Inline graphic = 15. Monte Carlo standard errors are shown in parentheses. The significance level was set at 10%

Setting  Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Case 0 8.5 (0.9) 8.6 (0.9) 9.1 (0.9) 8.8 (0.9) 9.1 (0.9) 9.1 (0.9) 9.4 (0.9) 9.2 (0.9) 9.4 (0.9) 9.1 (0.9
Case 1 32.8 (1.5) 35.4 (1.5) 34.3 (1.5) 33.1 (1.5) 32.4 (1.5) 31.0 (1.5) 30.7 (1.5) 30.3 (1.5) 28.2 (1.4) 34.8 (1.5)
Case 2 36.2 (1.5) 43.5 (1.6) 43.8 (1.6) 43.3 (1.6) 43.5 (1.6) 43.5 (1.6) 43.4 (1.6) 43.0 (1.6) 41.4 (1.6) 43.2 (1.6)
Case 3 30.7 (1.5) 35.2 (1.5) 36.5 (1.6) 36.0 (1.5) 36.1 (1.5) 35.9 (1.5) 36.0 (1.5) 35.7 (1.5) 35.3 (1.5) 35.6 (1.5)
Case 4 30.0 (1.4) 35.9 (1.5) 35.9 (1.5) 35.8 (1.5) 34.4 (1.5) 34.1 (1.5) 33.7 (1.5) 33.7 (1.5) 31.9 (1.5) 35.9 (1.5)
Case 5 22.7 (1.3) 23.9 (1.3) 24.2 (1.4) 23.3 (1.3) 22.9 (1.3) 22.4 (1.3) 21.7 (1.3) 21.6 (1.3) 20.2 (1.3) 23.6 (1.3)
Case 6 14.4 (1.1) 17.2 (1.2) 17.8 (1.2) 17.8 (1.2) 18.1 (1.2) 17.8 (1.2) 17.4 (1.2) 17.4 (1.2) 17.2 (1.2) 16.6 (1.2)
Case 7 13.7 (1.0) 14.9 (1.1) 15.5 (1.1) 15.0 (1.1) 14.9 (1.1) 15.3 (1.1) 15.3 (1.1) 15.5 (1.1) 15.4 (1.1) 14.7 (1.1)

Table 3.

Type I error rates (Case 0) and statistical power (Cases 1–7), expressed as percentages, for various tests under the setting with within-study standard errors Inline graphic and the number of studies Inline graphic = 30. Monte Carlo standard errors are shown in parentheses. The significance level was set at 10%

Setting  Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Case 0 11.0 (1.0) 10.6 (1.0) 10.5 (1.0) 9.8 (0.9) 9.3 (0.9) 9.5 (0.9) 8.8 (0.9) 9.0 (0.9) 9.3 (0.9) 9.9 (0.9)
Case 1 86.8 (1.1) 89.4 (1.0) 88.4 (1.0) 86.6 (1.1) 84.3 (1.2) 81.4 (1.2) 79.8 (1.3) 77.7 (1.3) 68.8 (1.5) 87.4 (1.0)
Case 2 75.9 (1.3) 82.8 (1.2) 84.0 (1.2) 84.1 (1.2) 84.1 (1.2) 83.9 (1.2) 83.8 (1.2) 83.9 (1.2) 82.7 (1.2) 84.6 (1.2)
Case 3 57.5 (1.6) 65.6 (1.5) 67.6 (1.5) 67.9 (1.5) 67.8 (1.5) 67.8 (1.5) 67.7 (1.5) 67.7 (1.5) 68.0 (1.5) 67.2 (1.5)
Case 4 73.3 (1.4) 78.8 (1.3) 78.3 (1.3) 77.3 (1.3) 75.9 (1.3) 75.0 (1.3) 74.7 (1.3) 73.4 (1.4) 69.5 (1.5) 77.3 (1.3)
Case 5 75.5 (1.4) 79.4 (1.3) 77.8 (1.3) 73.8 (1.4) 70.5 (1.4) 67.3 (1.5) 65.4 (1.5) 63.3 (1.5) 54.6 (1.6) 75.5 (1.4)
Case 6 40.8 (1.6) 68.5 (1.5) 78.7 (1.3) 81.8 (1.2) 83.3 (1.2) 83.7 (1.2) 84.1 (1.1) 84.3 (1.1) 84.6 (1.1) 81.9 (1.2)
Case 7 17.9 (1.2) 22.9 (1.3) 26.6 (1.4) 27.5 (1.4) 29.2 (1.4) 28.8 (1.4) 28.8 (1.4) 28.7 (1.4) 28.4 (1.4) 27.0 (1.4)

Table 4.

Type I error rates (Case 0) and statistical power (Cases 1–7), expressed as percentages, for various tests under the setting with within-study standard errors Inline graphic and the number of studies Inline graphic = 30. Monte Carlo standard errors are shown in parentheses. The significance level was set at 10%

Setting  Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Case 0 11.1 (1.0) 10.5 (1.0) 10.3 (1.0) 9.6 (0.9) 9.0 (0.9) 9.5 (0.9) 8.9 (0.9) 9.0 (0.9) 9.1 (0.9) 9.8 (0.9)
Case 1 40.7 (1.6) 44.0 (1.6) 42.6 (1.6) 40.3 (1.6) 37.8 (1.6) 35.0 (1.5) 33.9 (1.5) 32.6 (1.5) 29.9 (1.4) 40.5 (1.6)
Case 2 44.8 (1.6) 55.6 (1.6) 57.0 (1.6) 57.5 (1.6) 56.9 (1.6) 56.2 (1.6) 55.5 (1.6) 55.2 (1.6) 52.8 (1.6) 57.0 (1.6)
Case 3 37.7 (1.6) 48.4 (1.6) 50.3 (1.6) 50.4 (1.6) 50.5 (1.6) 50.7 (1.6) 50.7 (1.6) 50.6 (1.6) 49.8 (1.6) 50.0 (1.6)
Case 4 39.5 (1.6) 44.9 (1.6) 46.1 (1.6) 45.5 (1.6) 44.9 (1.6) 43.7 (1.6) 43.1 (1.6) 42.6 (1.6) 38.6 (1.6) 44.9 (1.6)
Case 5 33.8 (1.5) 35.5 (1.5) 34.2 (1.5) 33.2 (1.5) 30.3 (1.5) 28.7 (1.4) 27.7 (1.4) 27.2 (1.4) 24.8 (1.4) 34.0 (1.5)
Case 6 22.2 (1.3) 30.4 (1.5) 36.3 (1.6) 38.7 (1.6) 39.7 (1.6) 40.0 (1.6) 39.8 (1.6) 40.2 (1.6) 39.6 (1.6) 38.1 (1.6)
Case 7 11.3 (1.0) 13.1 (1.0) 13.1 (1.0) 12.5 (1.0) 13.0 (1.0) 13.2 (1.0) 13.3 (1.0) 13.2 (1.0) 13.2 (1.0) 12.8 (1.0)

Type I error rate (Case 0)

Under the null hypothesis (Case 0: Inline graphic = 0), all tests controlled the type I error rates well at the nominal 10% level. For example, in Table 1 with Inline graphic and Inline graphic = 15, the type I error rates ranged from 8.6% to 9.4%; in Table 2 with Inline graphic and Inline graphic = 15, the rates had similar ranges (e.g., 8.5% for Inline graphic, 9.4% for Inline graphic). With Inline graphic = 30, the type I error rates remained well-controlled (Tables 3 and 4).

Normal distribution (Case 1)

Under the conventional normality assumption for between-study inconsistency, with Inline graphic, the statistical power of all tests increased substantially with the number of studies. For example, in the setting Inline graphic (Tables 1 and 3), the hybrid test Inline graphic achieved a power of 69.0% when Inline graphic = 15, which increased to 87.4% when Inline graphic = 30. Notably, the Inline graphic test demonstrated the highest power among all Inline graphic tests in this scenario, reaching 71.3% for Inline graphic = 15 and 89.4% for Inline graphic = 30. It also slightly outperformed the hybrid test. When the within-study standard errors followed a wider distribution with Inline graphic in Tables 2 and 4, the power of all tests declined due to increased within-study uncertainties. Nevertheless, Inline graphic remained one of the best-performing methods, with power values of 35.4% (Inline graphic = 15) and 44.0% (Inline graphic = 30), consistently outperforming most other tests.

Heavy-tailed distribution (Case 2)

Across all settings, the hybrid test Inline graphic demonstrated strong robustness and consistently competitive power. For Inline graphic and Inline graphic = 15, the hybrid test achieved a power of 66.4%, which was slightly lower than the best-performing Inline graphic tests, such as 66.8% for Inline graphic, 66.6% for Inline graphic, and 66.2% for Inline graphic. Nevertheless, it remained among the top-performing methods, and its power was notably higher than Inline graphic (with power of 58.7%) and comparable to Inline graphic (65.4%), as shown in Table 1. When within-study standard errors were larger, power declined for all methods. The hybrid test maintained solid performance with power of 43.2%, outperforming Inline graphic (36.2%), and having similar power to Inline graphic (43.5%) and Inline graphic (43.5%), as shown in Table 2. With more studies (Inline graphic = 30) in meta-analyses, the hybrid test showed clearer advantages. In Table 3 with Inline graphic, its power reached 84.6%, outperforming Inline graphic (75.9%) and Inline graphic (82.8%), and closely matching Inline graphic (84.1%). In Table 4 with Inline graphic, it again led with 57.0% power, exceeding the power of Inline graphic (44.8%) and Inline graphic (55.6%).

Skewed distributions (Cases 3 and 4)

For the gamma (Case 3) and Inline graphic-distributed (Case 4) inconsistency, the hybrid test continued to perform well. For instance, in Table 3’s Case 4, the hybrid test’s power was 77.3%, outperforming Inline graphic (73.3%) and Inline graphic (73.4%). In Table 4’s Case 3, the hybrid test reached the power of 50.0%, on par with Inline graphic (50.6%) and superior to Inline graphic (48.4%). These demonstrated the hybrid test’s adaptability to skewed distributions.

Bimodal distribution (Case 5)

With Inline graphic = 15, the hybrid test demonstrated stable and competitive performance. In Table 1, the hybrid test attained a power of 56.6%, closely matching Inline graphic (60.0%) and outperforming both Inline graphic (55.2%) and Inline graphic (47.1%). Under the scenario with larger within-study variances, the hybrid test still achieved 23.6% power, comparable to the best-performing test Inline graphic (24.2%) and exceeding Inline graphic (20.2%) (Table 2). These results suggested that the hybrid test maintained reasonable sensitivity in detecting inconsistency even when the underlying effect sizes may be from different subgroups.

Contamination scenarios (Cases 6 and 7)

In scenarios involving structural or data-driven contamination, test robustness is essential. Recall that Case 6 simulated effect reversal in one study, while Case 7 introduced a single extreme value. In Case 6, with Inline graphic = 15, the hybrid test achieved 33.0% power under the setting of Inline graphic in Table 1, higher than Inline graphic (29.7%) and slightly below Inline graphic (34.7%). Under the setting of Inline graphic in Table 2, all three tests performed similarly (with power in the range of 16.6–17.2%). With Inline graphic = 30, the hybrid test’s advantage became more evident. Under the setting of Inline graphic, its power reached 81.9%, notably higher than Inline graphic (68.5%) and close to Inline graphic (84.3%) and Inline graphic (84.6%), as shown in Table 3. Under the setting of Inline graphic, the hybrid test’s power achieved 38.1%, ranking between Inline graphic (30.4%) and Inline graphic (39.6%), as shown in Table 4. In Case 7, under Inline graphic = 15, the hybrid test had power of 31.4% under the setting of Inline graphic, outperforming Inline graphic (29.2%) and slightly below Inline graphic (32.2%) (Table 1). Under the setting of Inline graphic, its power achieved 14.7%, modestly below Inline graphic (17.2%) (Table 2). With Inline graphic = 30, Table 3 shows the hybrid test with power at 27.0% for Inline graphic, above Inline graphic (22.9%) and slightly below Inline graphic (28.4%). For Inline graphic, the hybrid test’s power was 12.8%, slightly behind Inline graphic (13.1%) and Inline graphic (13.2%) (Table 4).

Overall, Inline graphic exhibited excellent performance under normal inconsistency, making it a strong candidate when the true effect distribution is symmetric and well-behaved. Although the hybrid test had comparable or slightly lower power in this single scenario, it provided broader robustness across other settings involving skewness, heavy tails, or contamination. These findings suggest that the hybrid test offers greater adaptability in complex inconsistency structures.

Results of case studies

Figure 1 displays the forest plots of the three real-world meta-analyses used as our case studies. Table 5 presents their results after applying the various inconsistency tests and measures; it also gives the inconsistency measures Inline graphic. The effect measures of all three meta-analyses were the relative risk.

Fig. 1.

Fig. 1

The forest plots of three real-world meta-analyses

Table 5.

P-values of the various Inline graphic and hybrid tests for between-study inconsistency and their corresponding inconsistency measures Inline graphic for the three real-world meta-analyses

Meta-analysis Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Hughes et al. [38] P-value 0.593 0.221 0.094 0.060 0.047 0.042 0.040 0.038 0.034 0.065
Inline graphic 0 0.237 0.490 0.640 0.725 0.774 0.801 0.813 0.357 0.599
Gafter-Gvili et al. [39] P-value 0.041 0.136 0.243 0.321 0.371 0.407 0.428 0.446 0.556 0.082
Inline graphic 0.279 0.308 0.245 0.104 0 0 0 0 0 0.568
Saha et al. [40] P-value 0.139 0.109 0.144 0.184 0.216 0.240 0.256 0.269 0.293 0.187
Inline graphic 0.196 0.348 0.393 0.367 0.277 0.111 0 0 0.112 0.396

Specifically, the first meta-analysis was from Hughes et al. [38]; it assessed the effectiveness of ovulation suppression agents in the treatment of endometriosis-associated subfertility in improving pregnancy outcomes, including live birth. The number of studies in this meta-analysis was 11. The P-value of the traditional Inline graphic test was 0.221, while the P-value of Inline graphic was lower than 0.1 for Inline graphic 3. The hybrid test yielded a P-value of 0.065, suggesting statistical significance at the 10% level (Fig. 1A and Table 5). This example highlights the hybrid test’s improved sensitivity in detecting inconsistency, making it a valuable tool for enhancing the detection of inconsistency in meta-analyses.

The second meta-analysis was from Gafter-Gvili et al. [39]; it evaluated the effectiveness of antibiotic prophylaxis for preventing bacterial infections in afebrile neutropenic patients following chemotherapy. The number of studies was 14. The P-value of Inline graphic test was relatively small (0.041); the P-value of the hybrid test was also lower than 0.1, indicating statistical significance (Fig. 1B and Table 5). The Inline graphic and Inline graphic with larger Inline graphic values had P-values greater than 0.1, failing to detect significant inconsistency.

The third meta-analysis was from Saha et al. [40], and it investigated the effects of chlorpromazine with atypical or second-generation antipsychotic drugs for the treatment of people with schizophrenia. It consisted of 12 studies. As shown in Table 5, most of the P-values across all tests were above the conventional significance threshold of 0.1. Specifically, the P-values of Inline graphic and Inline graphic were 0.139 and 0.109, respectively. Overall, the results indicate weak evidence of inconsistency across studies in this meta-analysis (Fig. 1C).

As a final remark, the hybrid test evaluates whether any deviation pattern among the family of candidate Inline graphic statistics is convincingly present. Through resampling, it accounts for the correlations among the individual Inline graphic tests and thus provides a conservative overall assessment when no evidence of inconsistency is detected, while being more sensitive when a particular Inline graphic value captures a pattern that others miss. In contrast, the individual Inline graphic tests may not be interpreted collectively as formal hypothesis tests for overall inconsistency, because they do not adjust for multiple testing. A single significant Inline graphic result does not necessarily indicate significant overall inconsistency; rather, these tests are intended to help explore and understand potential patterns of between-study inconsistency.

Discussion

This article introduces several alternative Inline graphic-like test statistics, denoted as Inline graphic, for assessing between-study inconsistency in meta-analyses, along with a hybrid test that combines the strengths of these Inline graphic tests. Our simulation studies and real-world case applications demonstrate that the hybrid test consistently achieves strong performance across a wide range of between-study inconsistency scenarios, particularly when the distribution of true effects deviates from normality.

A major strength of the proposed methods is their ability to address the challenge posed by non-normal between-study distributions. Although the assumption of normality has been a longstanding foundation in meta-analysis methodology and remains widely used in practice, its validity in any given analysis is often uncertain. Even if the assumption is approximately reasonable in many cases, it is difficult to verify whether it holds in a specific meta-analysis or whether its violation materially affects the assessment of inconsistency. This uncertainty largely stems from the fact that testing for between-study normality is inherently limited in power due to the typically small number of studies in meta-analyses.

The hybrid test is particularly well-suited to this dilemma. Without requiring knowledge of the precise form of the between-study distribution, it adaptively integrates information from a family of Inline graphic statistics, each tailored to different patterns of inconsistency. As a result, it maintains relatively high power across diverse conditions, making it a robust tool for evidence synthesis.

Another strength of the hybrid test is its potential adaptability to meta-analyses affected by publication bias or small-study effects. In such cases, the between-study distribution may become skewed due to selective reporting, making the normality assumption unrealistic. Some of our simulation settings included skewed between-study distributions, which mimic scenarios involving publication bias. As discussed in the Introduction, the assessments of publication bias and between-study inconsistency are often interrelated and may influence each other. The proposed approach offers a promising tool to improve inconsistency assessment in the presence of publication bias, although further research is warranted to evaluate its performance under various bias-generating mechanisms. In practice, researchers may first use conventional methods for detecting publication bias, such as funnel plots or Egger’s regression test [41, 42]. If evidence of bias is detected, the traditional Inline graphic test may be suboptimal, and our proposed methods could serve as more robust alternatives.

Despite its strengths, the hybrid test has several limitations. First, like traditional meta-analysis models, it assumes that the within-study variances are known. This assumption may be questionable in certain contexts, such as studies with small sample sizes or rare event probabilities. More precise modeling approaches, such as generalized linear mixed models, can address this limitation [43, 44]. While our proposed test statistics and measures could be extended to these models, doing so would require replacing the sample-based within-study variances with the underlying variance parameters. Achieving this may necessitate additional methodological development, such as incorporating Bayesian hierarchical models [45], which represents a promising avenue for future research.

Second, although the methods are designed to relax the assumption of between-study normality, they still rely on the assumption of within-study normality. This can be justified by asymptotic arguments when studies have large sample sizes, but it may not hold when sample sizes are small.

Third, the choice of Inline graphic values used to construct the Inline graphic tests in our study was limited to integers from 1 to 8 and infinity. In general, Inline graphic = 1 down‑weights outliers and is sensitive to broad, mild dispersion; Inline graphic = 2 corresponds to the conventional Inline graphic statistic, which is most powerful under normality; larger Inline graphic values emphasize tail behavior and isolated anomalies; and Inline graphic = Inline graphic focuses on the maximum standardized deviate. While the set of Inline graphic = 1, 2, …, 8, and Inline graphic performed well in our empirical examples, other Inline graphic values may be more appropriate in certain applications to better capture specific inconsistency patterns. Users are encouraged to examine the results across different Inline graphic values and consider expanding the candidate set if the observed patterns suggest the need for additional sensitivity. Such adjustments may involve a trade-off between potential gains in power and computational burden, as including more Inline graphic values increases the complexity and runtime of the resampling procedure. For example, if the power of tests increases steadily from Inline graphic = 1 to 8, users may explore additional values such as 10, 15, or 20. Conversely, if the power of Inline graphic and Inline graphic is substantially lower than that of tests with smaller Inline graphic values, there may be little benefit in considering larger Inline graphic values. In such cases, users may even consider excluding Inline graphic = 8 and Inline graphic from the candidate set, as incorporating too many Inline graphic values can introduce noise into the hybrid test and potentially reduce its overall power.

In addition, our simulation studies for validating the proposed methods included only meta-analyses with 15 or 30 studies. In practice, many meta-analyses contain fewer studies [46, 47], and the statistical power of all tests, including the proposed ones, would be expected to decrease substantially in such cases. When the number of studies is very small, relying solely on statistical testing to assess heterogeneity may not be valid or reliable because of the limited power [48]. We recommend that researchers also consider the potential sources and implications of heterogeneity from clinical perspectives, such as the presence of effect modifiers or study-level characteristics that may influence effect sizes [49, 50]. Despite our simulations being restricted to 15 and 30 studies, we believe that the results adequately demonstrate how the proposed hybrid approach integrates the strengths of various component tests and achieves competitive performance across diverse scenarios.

Conclusions

We proposed a family of alternative Inline graphic-like tests and a hybrid test to assess between-study inconsistency in meta-analyses, offering flexible and powerful tools beyond the conventional Inline graphic test. Through simulations and real-world applications, we demonstrated that the hybrid test maintains robust performance across a wide range of inconsistency scenarios, particularly when the between-study distribution deviates from normality. These methods provide practical solutions for improving the detection and quantification of inconsistency in meta-analysis, especially when traditional assumptions are questionable. Future work may extend these approaches to accommodate more complex models and further refine the choice of test statistics.

Supplementary Information

12874_2025_2719_MOESM1_ESM.pdf (168.5KB, pdf)

Additional file 1: Derivation of the expectations of Inline graphic statistics under the null hypothesis.

Acknowledgements

ChatGPT 4o was used solely to improve the writing of this manuscript and was not employed for any other purposes in this study.

Authors’ contributions

ZY: methodology, formal analysis, investigation, writing—original draft, visualization; MX: validation, investigation, formal analysis, writing—review & editing, visualization; XX: validation, formal analysis, writing—original draft, visualization; LL: conceptualization, data curation, writing—original draft, writing—review & editing, supervision, funding acquisition.

Funding

This study was supported in part by the US National Institute of Mental Health grant R03 MH128727, the US National Institute on Aging grant R03 AG093555, the US National Library of Medicine grants R21 LM014533 and R01 LM012982, and the Arizona Biomedical Research Centre grant RFGA2023-008–11. The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health and the Arizona Department of Health Services.

Data availability

The R code for the case studies is available at https://osf.io/4t9wj/. The main R function is also included in our R package “altmeta,” which is available on CRAN.

Declarations

Ethics approval and consent to participate

Ethics approval and consent to participate were not required for this study because it used published data in the existing literature.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zhiyuan Yu and Mengli Xiao contributed equally.

References

  • 1.Gurevitch J, Koricheva J, Nakagawa S, Stewart G. Meta-analysis and the science of research synthesis. Nature. 2018;555(7695):175–82. [DOI] [PubMed] [Google Scholar]
  • 2.Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, et al. Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Higgins JPT. Commentary: Heterogeneity in meta-analysis should be expected and appropriately quantified. Int J Epidemiol. 2008;37(5):1158–60. [DOI] [PubMed] [Google Scholar]
  • 4.Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res Synth Methods. 2010;1(2):97–111. [DOI] [PubMed] [Google Scholar]
  • 5.Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549. [DOI] [PubMed] [Google Scholar]
  • 6.Biggerstaff BJ, Tweedie RL. Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Stat Med. 1997;16(7):753–68. [DOI] [PubMed] [Google Scholar]
  • 7.Jackson D. The power of the standard test for the presence of heterogeneity in meta-analysis. Stat Med. 2006;25(15):2688–99. [DOI] [PubMed] [Google Scholar]
  • 8.Biggerstaff BJ, Jackson D. The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Stat Med. 2008;27(29):6093–110. [DOI] [PubMed] [Google Scholar]
  • 9.Lin L, Chu H, Hodges JS. Alternative measures of between-study heterogeneity in meta-analysis: reducing the impact of outlying studies. Biometrics. 2017;73(1):156–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jackson D, White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J. 2018;60(6):1040–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang C-C, Lee W-C. Evaluation of the normality assumption in meta-analyses. Am J Epidemiol. 2020;189(3):235–42. [DOI] [PubMed] [Google Scholar]
  • 12.Al Amer FM, Lin L. Prediction performance of earlier studies for later studies in Cochrane reviews. J Eval Clin Pract. 2025;31(4):e70172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Viechtbauer W, Cheung MWL. Outlier and influence diagnostics for meta-analysis. Res Synth Methods. 2010;1(2):112–25. [DOI] [PubMed] [Google Scholar]
  • 14.Sun X, Briel M, Busse JW, You JJ, Akl EA, Mejza F, et al. The influence of study characteristics on reporting of subgroup analyses in randomised controlled trials: systematic review. BMJ. 2011;342:d1569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L, Moreno SG. Assessing publication bias in meta-analyses in the presence of between-study heterogeneity. J R Stat Soc Ser A Stat Soc. 2010;173(3):575–91. [Google Scholar]
  • 16.Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Stat Med. 2007;26(25):4544–62. [DOI] [PubMed] [Google Scholar]
  • 17.Terrin N, Schmid CH, Lau J, Olkin I. Adjusting for publication bias in the presence of heterogeneity. Stat Med. 2003;22(13):2113–26. [DOI] [PubMed] [Google Scholar]
  • 18.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, Botella J. Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychol Methods. 2006;11(2):193–206. [DOI] [PubMed] [Google Scholar]
  • 19.Jia P, Lin L, Kwong JSW, Xu C. Many meta-analyses of rare events in the Cochrane database of systematic reviews were underpowered. J Clin Epidemiol. 2021;131:113–22. [DOI] [PubMed] [Google Scholar]
  • 20.Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Guyatt GH, Oxman AD, Kunz R, Woodcock J, Brozek J, Helfand M, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol. 2011;64(12):1294–302. [DOI] [PubMed] [Google Scholar]
  • 22.Higgins JPT, Jackson D, Barrett JK, Lu G, Ades AE, White IR. Consistency and inconsistency in network meta-analysis: concepts and models for multi-arm studies. Res Synth Methods. 2012;3(2):98–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.White IR, Barrett JK, Jackson D, Higgins JPT. Consistency and inconsistency in network meta-analysis: model estimation using multivariate meta-regression. Res Synth Methods. 2012;3(2):111–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Viechtbauer W. Hypothesis tests for population heterogeneity in meta-analysis. Br J Math Stat Psychol. 2007;60(1):29–60. [DOI] [PubMed] [Google Scholar]
  • 25.Zhang C, Wang X, Chen M, Wang T. A comparison of hypothesis tests for homogeneity in meta-analysis with focus on rare binary events. Res Synth Methods. 2021;12(4):408–28. [DOI] [PubMed] [Google Scholar]
  • 26.Kulinskaya E, Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran’s Q-statistic. BMC Med Res Methodol. 2015;15(1):49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014;197(4):1081–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Xu G, Lin L, Wei P, Pan W. An adaptive two-sample test for high-dimensional means. Biometrika. 2016;103(3):609–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lin L. Bias caused by sampling error in meta-analysis with small sample sizes. PLoS ONE. 2018;13(9):e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Baker R, Jackson D. A new approach to outliers in meta-analysis. Health Care Manag Sci. 2008;11(2):121–31. [DOI] [PubMed] [Google Scholar]
  • 31.Lee KJ, Thompson SG. Flexible parametric models for random-effects distributions. Stat Med. 2008;27(3):418–34. [DOI] [PubMed] [Google Scholar]
  • 32.Lin L. Hybrid test for publication bias in meta-analysis. Stat Methods Med Res. 2020;29(10):2881–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lin L. Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract. 2020;26(1):376–84. [DOI] [PubMed] [Google Scholar]
  • 34.Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539–58. [DOI] [PubMed] [Google Scholar]
  • 35.Ioannidis JPA, Trikalinos TA. An exploratory test for an excess of significant findings. Clin Trials. 2007;4(3):245–53. [DOI] [PubMed] [Google Scholar]
  • 36.Stanley TD, Doucouliagos H, Ioannidis JPA, Carter EC. Detecting publication selection bias through excess statistical significance. Res Synth Methods. 2021;12(6):776–95. [DOI] [PubMed] [Google Scholar]
  • 37.Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Stat Med. 1998;17(8):841–56. [DOI] [PubMed] [Google Scholar]
  • 38.Hughes E, Brown J, Collins JJ, Farquhar C, Fedorkow DM, Vanderkerchove P. Ovulation suppression for endometriosis for women with subfertility. Cochrane Database Syst Rev. 2007;3:CD000155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gafter-Gvili A, Fraser A, Paul M, Vidal L, Lawrie TA, van de Wetering MD, Kremer LCM, Leibovici L. Antibiotic prophylaxis for bacterial infections in afebrile neutropenic patients following chemotherapy. Cochrane Database Syst Rev. 2012;1:CD004386. [DOI] [PubMed] [Google Scholar]
  • 40.Saha KB, Bo L, Zhao S, Xia J, Sampson S, Zaman RU. Chlorpromazine versus atypical antipsychotic drugs for schizophrenia. Cochrane Database Syst Rev. 2016;4:CD010631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Contour-enhanced meta-analysis funnel plots help distinguish publication bias from other causes of asymmetry. J Clin Epidemiol. 2008;61(10):991–6. [DOI] [PubMed] [Google Scholar]
  • 42.Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ju K, Lin L, Chu H, Cheng L-L, Xu C. Laplace approximation, penalized quasi-likelihood, and adaptive Gauss-Hermite quadrature for generalized linear mixed models: towards meta-analysis of binary outcome with sparse data. BMC Med Res Methodol. 2020;20(1):152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Xu C, Furuya-Kanamori L, Lin L. Synthesis of evidence from zero-events studies: a comparison of one-stage framework methods. Res Synth Methods. 2022;13(2):176–89. [DOI] [PubMed] [Google Scholar]
  • 45.Shi L, Chu H, Lin L. A bayesian approach to assessing small-study effects in meta-analysis of a binary outcome with controlled false positive rate. Res Synth Methods. 2020;11(4):535–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Davey J, Turner RM, Clarke MJ, Higgins JPT. Characteristics of meta-analyses and their component studies in the Cochrane Database of Systematic Reviews: a cross-sectional, descriptive analysis. BMC Med Res Methodol. 2011;11:160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.von Hippel PT. The heterogeneity statistic I2 can be biased in small meta-analyses. BMC Med Res Methodol. 2015;15:35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bender R, Friede T, Koch A, Kuss O, Schlattmann P, Schwarzer G, et al. Methods for evidence synthesis in the case of very few studies. Res Synth Methods. 2018;9(3):382–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Thompson SG. Why sources of heterogeneity in meta-analysis should be investigated. BMJ. 1994;309(6965):1351–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Sedgwick P. Meta-analyses: heterogeneity and subgroup analysis. BMJ. 2013;346:f4040. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12874_2025_2719_MOESM1_ESM.pdf (168.5KB, pdf)

Additional file 1: Derivation of the expectations of Inline graphic statistics under the null hypothesis.

Data Availability Statement

The R code for the case studies is available at https://osf.io/4t9wj/. The main R function is also included in our R package “altmeta,” which is available on CRAN.


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

RESOURCES