Abstract
Rare binary events data arise frequently in medical research. Due to lack of statistical power in individual studies involving such data, meta-analysis has become an increasingly important tool for combining results from multiple independent studies. However, traditional meta-analysis methods often report severely biased estimates in such rare-event settings. Moreover, many rely on models assuming a pre-specified direction for variability between control and treatment groups for mathematical convenience, which may be violated in practice. Based on a flexible random-effects model that removes the assumption about the direction, we propose new Bayesian procedures for estimating and testing the overall treatment effect and inter-study heterogeneity. Our Markov chain Monte Carlo algorithm employs Pólya-Gamma augmentation so that all conditionals are known distributions, greatly facilitating computational efficiency. Our simulation shows that the proposed approach generally reports less biased and more stable estimates compared to existing methods. We further illustrate our approach using two real examples, one using rosiglitazone data from 56 studies and the other using stomach ulcers data from 41 studies.
Keywords: Bayesian hierarchical model, binomial normal, data augmentation, log odds ratio, model selection, Pólya-Gamma
1. INTRODUCTION
Meta-analysis is a systematic and quantitative procedure used to integrate information from a set of individual studies.1 This process is widely used in many fields of research such as medicine, education, psychology, criminology, etc.2 For example, pharmaceutical scientists typically use meta-analysis of (rare) binary events to investigate the efficacy or safety of healthcare interventions, yielding more reliable statistical inference. In this paper, we focus on meta-analysis of rare binary events.
Rare binary outcomes are common in studies involving rare diseases or drug safety. Due to low background incidence rates or small sample sizes, researchers often have difficulty in reporting a reliable and generalizable result from each study alone. Thus, meta-analysis has become a routine for analysis of such data and received a lot of attention. Fixed-effect models (FEMs) were widely used in this field, where an identical treatment effect is assumed across all individual studies. Popular FEM-based approaches include the Mantel-Haenszel method (MH),3 the inverse-variance weighted method,4 the empirical logit method,5 and Peto’s approach.6 One challenge in meta-analysis of rare binary events is dealing with studies that contain zero events in one or both arms. Typical quick fixes include using continuity correction with the MH or inverse-variance weighted methods and excluding zero-event studies as suggested by Peto’s approach, which may lead to substantial bias.7,8,9 Instead of using point estimators, Tian et al.10 derived an interval estimation procedure under the FEM framework without relying on large-sample approximation or continuity correction.
Compared to FEMs, random-effects models (REMs) assume existence of between-study variability. REMs are considered to be more plausible than FEMs since experimental conditions, protocols, and patients’ characteristics vary from study to study, and the goal of meta-analysis is perhaps not only getting the identical point estimate of the treatment effect but also generalizing the result to broader scenarios.2 When outcomes are binary, binomial-normal hierarchical models (BN) are the most popular among REMs.11,12,13,14,15 For instance, Houwelingen et al.11 derived likelihood-based estimators of model parameters by assigning a bivariate normal distribution to the logit-transformed probabilities of two groups. Bhaumik et al.12 considered a BN model and proposed method-of-moments estimators of the treatment effect (i.e., the simple average estimator; SA) and the heterogeneity parameter (i.e., the improved Paule and Mandel estimator; IPM). However, SA tends to overestimate the treatment effect and IPM tends to underestimate the heterogeneity, and their bias increases when the background incidence rate becomes lower. Simmonds and Higgins13 slightly modified the BN model in Bhaumik et al.12 by treating the baseline risk for each trial as a fixed effect. Still, both models make the same assumption of larger variability in the treatment group than in the control group. Li and Wang14 proposed a new flexible BN model without assuming a specific direction between the variances of the two groups. That is, it allows the variability in the treatment group to be smaller, larger or equal to that of the control group.
In addition to the frequentist REM approaches mentioned above, Bayesian approaches have also been used to model meta-analysis of rare binary events. In recent decades, researchers have used the Bayesian framework more frequently for these studies,16,17,9,18,19 due in large part to increased computing power and the development of Markov chain Monte Carlo (MCMC) techniques. Smith et al.16 first proposed a fully Bayesian framework and implemented their process via BUGS software.20 Günhan et al.18 adopted the same model and introduced weakly informative priors to the treatment effects. Bai et al.17 and Ren et al.9 used the model from Bhaumik et al.12 and extended it to the Bayesian paradigm by assigning different priors to hyperparameters. More recently, Hong et al.19 conducted a simulation to compare six Bayesian methods and nine frequentist approaches in estimating the log odds ratio (LOR) for the overall treatment effect in the context of meta-analysis of rare binary events, and recommended two estimators: the Heterogeneous Treatment Effect method based on a Binomial-Beta hierarchical framework (HTE-Beta) and a weighted estimator proposed by Shuster et al.(SGSwgt).21 However, they did not examine the performance in estimating inter-study heterogeneity.
The REMs considered in the literature often assume that the variability in the treatment group is larger than that in the control group12,17,9 or that the variability is equal between two arms.16,18 In practice, a model that allows unequal variability between the two groups and assumes no specific direction is less restrictive and more appropriate. Li and Wang14 were the first to propose such a flexible REM. But the authors mainly focused on theoretical aspects of the SA estimator proposed in Bhaumik et al.12 and performance evaluation of existing estimators of the LOR. In this paper, we implement a Bayesian framework based on the REM in Li and Wang14 and develop new estimators of key model parameters. Our method, called FlexB, allows data to determine the direction of variance comparison (i.e., which group has larger variability) and accounts for this uncertainty in parameter estimation, rather than making an assumption about the direction which may be inaccurate. We further propose a Gibbs sampler, in which we integrate the Pólya-Gamma data-augmentation technique into the hierarchical model structure for efficient and reliable posterior sampling.22 Unlike most previous Bayesian works implemented using Stan, JAGS, or BUGS, we implement our MCMC algorithm using Rcpp, which integrates C++ with R, to achieve much faster computation.23
Another important factor to meta-analysis of rare binary events is hypothesis testing or model selection, which can provide straightforward conclusions to research questions related to efficacy or safety issues. For instance, Bhaumik et al.12 proposed a large-sample test using the SA estimator to test an overall treatment effect. They also derived two tests using Cochran’s Q statistics and parametric bootstrap (PB) techniques for testing between-study heterogeneity of treatment effects. Bai et al.17 employed a Bayesian model-selection approach for simultaneous hypothesis testing rather than testing the overall treatment effect or the inter-study heterogeneity separately. They showed that the deviance information criterion (DIC)24 based on the (marginalized) likelihood function of parameters of main interest is better than several competitors in choosing correct models. Under the flexible REM proposed in Li and Wang,14 we consider a similar DIC approach for simultaneous hypothesis testing of key model parameters. We also adapt Bayesian information criterion (BIC)25 to our meta-analysis framework to select the best model.
The remainder of this paper is organized as follows. In Section 2, we present the proposed Bayesian hierarchical approach including the BN model allowing flexible group variability, prior specification, and posterior computation via a Gibbs sampler based on the Pólya-Gamma data-augmentation technique.22 In Section 3, we consider three model selection criteria for the purpose of hypothesis testing, including BIC and DIC. In Section 4, we present conduct simulation studies to evaluate the performance of the proposed FlexB and compare it with existing methods in estimation and testing of the treatment effect and the inter-study heterogeneity. Section 5 illustrates the proposed method using two data examples. Section 6 ends the paper with conclusions and discussions. Our method is implemented in R package “metaFlexB” and is available at https://github.com/chriszhangm/metaFlexB.
2 |. FLEXB: A BAYESIAN HIERARCHICAL APPROACH
2.1 |. The flexible REM
Let I be the number of independent studies involved in a meta-analysis, xi1 (xi2) be the number of rare events out of ni1 (ni2) cases in the control (treatment) group of the ith study. Each case in the control (treatment) group has probability pi1 (pi2) of having the event of interest. Let ϕi1(ϕi2) denote the logit-transformed pi1 (pi2), namely ; and let θi be the treatment effect for i on the study on the log-odds scale, where . Then ϕi1 (ϕi2) can be expressed as a linear combination of random components θi and μi, each following a Gaussian distribution with all μis and θis assumed to be mutually independent. Li and Wang14 formulated the following flexible binomial-normal hierarchical random-effects model:
| (1) |
where the unknown parameter ω can be a constant in the interval [0, 1], introduced to control the direction of variability in the two groups. That is, for τ2 > 0, if ω ∈ [0, 0.5), then the variance of ϕi2 is greater than that of ϕi1; if ω = 0.5, the variance of ϕi2 is equal to that of ϕi1; and if ω ∈ (0.5, 1], the variance of ϕi2 is smaller than that of ϕi1. More specifically, when ω = 0, (1) would reduce to the model used in Bhaumik et al.12 and Bai et al.17 with
where μi becomes the baseline risk in the control group; when ω = 0.5, (1) yields the model in Smith et al.16 and Günhan et al.18 with
which assumes equal variability in both groups. For τ2 = 0, the variance of ϕi2 is the same as that of ϕi1, which equals σ2, regardless the value of ω.
We adopt a graphical approach to visualize the model (1),26 where each node denotes either observed data or model parameters (Figure 1). Double rectangles denote fixed constants (ni1, ni2) that are determined before each study i; single rectangles represent observed event counts (xi1, xi2); circles denote unknown model parameters at various levels, including the probabilities of the rare event in the control and treatment groups pi1 and pi2, their logits ϕi1 and ϕi2, (μi, θi, ω) used to linearly model the logits, and their distributional parameters (μ0, σ2, θ0, τ2) as well as (λi1, λi2, κi) that are introduced for computational ease; dashed rectangles represent the sampling process of using a data-augmentation technique that will be described in Section 2.3.
FIGURE 1.

Graphical model for the flexible random-effects model; green color indicates those quantities introduced by the Pólya-Gamma data-augmentation technique; pink color indicates key parameters in previous REMs; yellow color indicates the newly added key parameter to reflect the variability direction in treatment and control groups.
For the model depicted in Figure 1, we marginalize over study-specific parameters μis and θis, and consider the set of parameters Θ = {ϕ1, ϕ2, μ0, θ0, σ2, τ2, ω} in our Bayesian analysis, where ϕ1 = {ϕ11, …, ϕI1} and ϕ2 = {ϕ12, …, ϕI2}. Let denote the data we observe. Then the joint distribution of (X, Θ) can be written as
| (2) |
where
and
| (3) |
Here, , , .
2.2. Prior specification
As usual, the hyper-parameters in the top layer of the graphical model in Figure 1 are assumed to be a priori independent; that is, p(μ0, θ0, σ2, τ2, ω) = p(μ0) p(θ0) p(σ2) p(τ2) p(ω). We consider non-informative uniform priors for μ0 and θ0: μ0 ~ U(Lμ, Uμ), θ0 ~ U(Lθ, Uθ), where upper and lower bounds contain virtually all plausible values of μ0 and θ0. To define these ranges, we first use the RE model used in Bhaumik et al.12 to get rough estimates and for all I studies, where , . Next, we define , , , and , where we assign c = 5 as in17 so that the priors for μ0 and θ0 are very conservative. Next, the conditional conjugate prior IG (a, a), an inverse-gamma distribution, is considered for both σ2 and τ2, where we assign a small value such as 0.01 to a, to reflect our lack of information about the variance terms.
As to the variability direction parameter ω, we consider a discrete uniform prior distribution defined on the sample space Ω = {0, 0.5, 1}, i.e. P (ω = d) = 1∕3 where d ∈ Ω. The reason for using this discrete uniform distribution instead of a continuous uniform distribution over the interval [0, 1] is two-fold. Firstly, this discrete prior builds a strong connection to well-established models in the literature, increasing the interpretability of our analysis results. As mentioned in Section 2.1, ω = 0 corresponds to the model considered in Bhaumik et al. and Bai et al.,12,17 while ω = 0.5 corresponds to the model in Smith et al. and Günhan et al..16,18 Using the flexible model in (1), combined with this prior, we can rely on a data-driven approach to determine which (if any) of the two models fit the data well. Secondly, as will be shown in Section S5 in Supplementary Material, using the discrete prior leads to similar results to those from using the continuous prior, even when data are actually generated with ω from U(0, 1).
Our vague and diffuse priors above, combined with the assumed a priori independence, lead to the following joint prior distribution, given by
| (4) |
2.3. Gibbs sampling based on Pólya-Gamma data-augmentation
Our posterior computation employs Gibbs sampling to obtain posterior samples, where, to our knowledge, the Pólya-Gammá method22 is adapted to the context of meta-analysis of (rare) binary events for the first time.
The joint posterior distribution is given by p(Θ|X) ∝ p (X, Θ), where Θ = {ϕ1, ϕ2, σ2, τ2, ω, μ0, θ0}. To simplify the notations, we use Θ∕θ to denote the parameter set that includes all the parameters except for θ. It is easy to verify that the full conditional posterior distributions derived from p(Θ|X) for the global mean and variance terms are all known distributions (i.e., truncated normal or inverse gamma), shown below:
To sample, p(ϕi1, ϕi2|X, Θ∕{ϕi1, ϕi2}) for each study i, Bai et al.17 utilized the rejection sampling algorithm by specifying N(μ0, σ2) and N(μ0, τ2) as the proposal distributions for μi and θi, respectively. However, this process is not sufficient in either computational efficiency or estimation stability, as will be shown later. Instead, we use a data-augmentation strategy based on Pólya-Gamma latent variables, proposed for logistic models by Polson et al..22 This strategy avoids the need for rejection/Metropolis–Hastings sampling, numerical integration, or analytical approximation given one can easily sample from a Pólya-Gamma distribution. Polson et al.22 further developed an efficient sampler for the Pólya-Gamma distribution and showed the effectiveness of the Pólya-Gamma method in various scenarios.
Let λi1, λi2 be the auxiliary parameters with Pólya-Gamma (PG) distributions, namely
and let Then we can derive the following conditional posterior for (ϕi1, ϕi2), given the introduced Λi and other parameters:
where
We note that the Pólya-Gamma mixture framework introduced above requires no tuning and is easy to implement.
Lastly, for every iteration, we update the variance direction parameter ω by calculating the discrete posterior probabilities:
| (5) |
which is proportional to the density function of the bivariate normal distribution in (3). Thus, ω can be sampled based on the normalized updated probabilities in (5).
After a burn-in period to achieve convergence, we use posterior sample means, and , to estimate θ0 and μ0. As in Bai et al.,17 we use posterior medians, and , to estimate τ2 and σ2 because their (marginal) posterior distributions are heavily skewed. Finally, we use the posterior mode to estimate ω due to its discrete posterior.
3. HYPOTHESIS TESTING VIA MODEL SELECTION
We now pivot the discussion to hypothesis testing and outline how this is implemented in our FlexB approach. Under the Bayesian paradigm, hypothesis testing can be viewed as an analogue to model selection. Here, we consider three different approaches to model selection using (a) Akaike information criterion (AIC), (b) Bayesian information criterion (BIC), and (c) deviance information criterion (DIC). Based on the flexible BN model (1), we focus on simultaneous testing of the overall treatment effect θ0, the heterogeneity parameter τ2, and the variance direction parameter ω, which corresponds to selection between the following eight candidate models:
where for ℳ1, ω is no longer needed as θi ≡ 0; for ℳ5 : θ0 ≠ 0, τ2 = 0, ω = 0, 0.5 and 1 lead to three non-identifiable models, so they are indeed one model with different parameterization. This simultaneous testing approach not only allows us to examine the existence of the overall treatment effect and/or between-study heterogeneity, but also indicates whether the popular existing BN models considered in the literature12,16,18 fit the data and if so, which model.
Each of ℳ1− ℳ8 is a special case of (1). In what follows, we detail how each information criterion is computed for each candidate model.
3.1. BIC
Our FlexB model (1) can be written in the form of the following generalized linear mixed model (GLMM):
where j = 1 for the control group and j = 2 for the treatment group. Let and . Then the marginal likelihood ℒ for the above GLMM, as a function of (μ0, θ0, σ2, τ2, ω), is given by
| (6) |
However, there is no closed-form solution to the integration in the likelihood function ℒ, which can only be evaluated numerically. For instance, Laplace approximation is recommended by Wolfinger27 and McCullagh and Nelder28 to solve this issue. Then by maximizing the approximate likelihood, we obtain , and BIC can be defined as , where k is the number of estimated parameters, I is the number of component studies. Clearly, ℒ would be defined differently for candidate models ℳ1 − ℳ8. For example, for ℳ2: θ0 = τ2 ≠ 0, ω = 0, the likelihood in (6) becomes , and will be the (approximate) likelihood evaluated at the maximum likelihood estimates of {μ0, σ2, τ2}. Then, BIC for ℳ2 can be defined as . After getting all BIC values for ℳ1 − ℳ8, the model with the smallest value is deemed to be the best.
3.2. DIC
Spiegelhalter et al.24 proposed DIC to select the best model from the perspective of prediction given the observed data. For model g, the Bayesian deviance is defined as D (β) = −2ln (p(X|β, ℳg)) + 2ln (p (X)), where β is a set of parameters of interest, X is the data, and p (X) is a standardizing term that only relates to the observed data. The posterior mean of the deviance and the effective number of parameters pD are written as and , respectively. Then, , and the model with the smallest DIC will be selected as the best one.
Since we adopt a hierarchical framework, as shown in Figure 1, the likelihood function p(X|β, ℳg) can be defined differently according to the specific definition of β24. For example, Under ℳ2: θ0 = 0, τ2 ≠ 0, ω = 0, the marginal distribution of X, can be written as
or alternatively
where ψ = {μ0, θ0, σ2, τ2, ω}. Therefore, the likelihood p (X|β, ℳg) in DIC can be either p (X|ϕ1, ϕ2, ℳ2) or p (X|ψ, ℳ2), where β can be (ϕ1, ϕ2) or ψ accordingly, and
In this paper, we use p(X|β, ℳ2) to compute DIC for model g since we focus on inference about θ0 and τ2 Our choice is also recommended by Bai et al.,17 who found better performance when using p(X|β, ℳ2) rather than p(X|ϕ1, ϕ2, ℳ2).
4. SIMULATION
We conduct simulation studies to evaluate the performance of the proposed FlexB and compare FlexB with other popular methods for meta-analysis of rare binary events. The methods are evaluated in estimating and testing the overall treatment effect θ0 and heterogeneity parameter τ2, using four metrics: bias, mean squared error (MSE), coverage probability, and interval width. Specifically, for a parameter of interest (say β, β is either θ0 or τ2 here), we define and for each method, where is the corresponding estimate of β at the jth replication among J total replicates; and the coverage probability and interval width are computed using 95% confidence intervals (for frenquentist methods) or equal-tail credible intervals (for Bayesian methods). Throughout our simulation and real data analyses later in Section 5, we apply the default continuity correction factor 0.5 to all frequentist methods unless otherwise specified. By contrast, for Bayesian methods considered, we do not apply any continuity correction or remove studies with zero events as they can handle such studies automatically via incorporation of prior information into data analysis.
4.1. Performance on estimation of θ0
We focus on assessing the estimates of θ0 by varying (a) the value of θ0 and (b) the value of μ0 in the BN model (1) under various simulated scenarios. Let I = 10, 20, 50, 80 represent four different sizes of meta-analysis (we defer a discussion about the choice of I to Section 6); let ω = 0, 0.5, 1 represent the scenario that the variability in the treatment group is larger than, equal to, or smaller than that in the control group, respectively. For (a), we set θ0 ∈ {−1, −0.8, …, 1}, μ0 = −5, σ2 = 0.5, and τ2 = 0.8, which correspond to (very) rare binary events in general. For (b), we let μ0 ∈ {−5, −4.5, …, 0}, θ0 = 0, σ2 = 0.5 and τ2 = 0.8 to evaluate the performance of the estimates for both rare and prevalent binary events. Using each combination of all parameters, we simulate probabilities of the event for both groups, and for study i = 1, 2 …, I. Then, for every study, we randomly draw integers from 50 to 1000 as the number of subjects in the treatment group ni2 and in the control group ni1. Lastly, for each study i, we randomly draw the number of events in the control group xi1, and that in the treatment group xi2 from Bin (ni2, pi2), and Bin (ni1, pi1), respectively. We simulate j = 300 replicates for each combination of the parameters for (a) and (b), and we report results from the proposed FlexB using the four criteria mentioned above.
For point estimation of the overall treatment effect θ0, we employ ten existing methods for the purpose of comparison, where the first eight are frequentist methods including Mantel & Haenszel (MH),3 DerSimonian & Laird (DSL),5 empirical logit (EL),5 GLMM29,30 based on the generalized linear mixed model in Bhaumik et al.,12 median unbiased estimator (MUE),31 simple (unweighted) average (SA) with the default continuity correction factor 0.5,12 Shuster, Guo and Skyler’s weighted estimator (SGSwgt),21 and simple (unweighted) average with continuity correction 0.25 (SA25) due to its good performance under rare settings reported in Li and Wang14, and the last two are Bayesian methods including BAYES17 and HTE-Beta.19 For interval estimation of θ0, we compare the proposed FlexB with four other approaches including Hartung-Knapp/Sidik-Jonkman (HKSJ) method32,33,34 recommended in Weber et al.,35 Wald confidence intervals recommended in Pateras et al.36 using three different heterogeneity estimators: Hartung-Makambi (HM),37 positive Sidik-Jonkman (SJ)34 and improved Paule & Mandel (IPM).12
Figure 2 shows bias results for point estimates of the overall treatment effect θ0 from the eleven methods, as mentioned above, when θ0 varies along the horizontal axis in each subplot, ω varies across rows (the top row of four subplots for ω = 0, middle row for ω = 0.5, and bottom row for ω = 1), and I varies across columns (left column of three subplots for I = 10, and so on). In general, FlexB has the lowest average bias across all scenarios. The top panel (ω = 0) shows that GLMM, BAYES, and FlexB perform well, yielding nearly unbiased estimates. Among the other methods, MH, DSL, EL, SGSwgt and HTE-Beta always overestimate θ0, while SA25, SA and MUE only overestimate θ0 when it is negative. The middle panel (ω = 0.5) reveals that FlexB, MH, SA25, and SGSwgt achieve the best performance, all producing estimates around their true values. Other methods such as MUE, HTE-Beta, EL, SA and DSL overestimate θ0 when θ0 < 0 but underestimate θ0 when θ0 > 0. BAYES and GLMM consistently underestimate θ0, and the fluctuation of the BAYES estimates in the (I = 20, ω = 0.5) plot indicates lack of stability of the algorithm sometimes. In the bottom panel (ω = 1), FlexB often surpasses other procedures, yielding almost unbiased estimates for all θ0 values. SA25, SA and MUE are slightly worse than FlexB as they give biased estimates when θ0 > 0. Among other approaches that underestimate θ0 consistently, DSL has the least bias except for θ0 values close to 1. We also observe that, for all panels, the bias of FlexB estimates tends to decrease as the number of studies I increases.
FIGURE 2.

Bias comparison of point estimates of θ0 by Mantel & Haenszel (MH), DerSimonian & Laird (DSL), empirical logit (EL), generalized linear mixed model estimator (GLMM), median unbiased estimator (MUE), simple (unweighted) average (SA) with the default continuity correction factor 0.5, Shuster, Guo and Skyler’s weighted estimator (SGSwgt), simple (unweighted) average with continuity correction 0.25 (SA25), BAYES, HTE-Beta (HTE), and proposed FlexB for different values of θ0, ω, and I. Settings: ni1,ni2 ~ U (50, 1000), μ0 = −5, θ0 = {−1, −0.9, −0.8, …, 1}, σ2 = 0.5, τ2 = 0.8.
Figure S1 of Supplementary Material shows MSE results for estimating θ0 by the different methods. In general, FlexB works well for I = 50 and 80 and also has reasonable MSE for I = 10 and 20. Among other approaches, the top panel (ω = 0) shows that SA, SA25, MUE, BAYES and GLMM perform well while EL, DSL, MH, SGSwgt and HTE report relatively large MSE. In the bottom panel (ω = 1), SA, SA25 and MUE still perform well, but GLMM and BAYES show large MSE even with large I; as I increases, FlexB reports smaller MSE and its performance arise from the middle to the very top (e.g., for I = 50 and 80, it is the best for almost all θ0 values). All methods behave more similarly in the middle panel (ω = 0.5), where as I gets larger, the difference in their performance becomes smaller.
Figure 3 shows bias results of all methods by varying μ0 while fixing θ0 at zero. We find that four methods, FlexB, SA, SA25 and MUE, form the top performing group for all combinations of (I, ω), producing estimates around the true value of θ0 = 0 consistently. When ω = 0, the top panel shows that BAYES and GLMM perform well besides the best four (yet not as well as the best four) while the remaining approaches including HTE-Beta and SGSwgt significantly overestimate θ0. When ω = 0.5, only BAYES and GLMM underestimate θ0 while the others perform well. When ω = 1, all methods except for the top group underestimate θ0. The bias of the eleven methods across all scenarios roughly follows the order FlexB≈SA25≈SA≈MUE<DSL<BAYES≈GLMM<HTB-Beta≈SGSwgt≈EL≈MH. Figure S2 of Supplementary Material shows the corresponding MSE results for estimating θ0 by varying μ0. Here, FlexB, SA, SA25 and MUE form the top group in terms of MSE; SGSwgt, MH, HTE and EL form the bottom group while the other three stand somewhere in the middle but closer to the top group.
FIGURE 3.

Bias comparison of point estimates of θ0 by Mantel & Haenszel (MH), DerSimonian & Laird (DSL), empirical logit (EL), generalized linear mixed model estimator (GLMM), median unbiased estimator (MUE), simple (unweighted) average (SA) with the default continuity correction factor 0.5, Shuster, Guo and Skyler’s weighted estimator (SGSwgt), simple (unweighted) average with continuity correction 0.25 (SA25), BAYES, HTE-Beta (HTE), and proposed FlexB for different values of μ0, ω, and I. Settings: ni1, ni2 ~ U (50, 1000), μ0 = {−5, −4.5, …, 0}, θ0 = 0, σ2 = 0.5, τ2 = 0.8.
Figure 4 shows coverage results for 95% interval estimates of θ0 from the proposed FlexB and four other methods (i.e., HKSJ, HM, SJ, and IPM) for different θ0, ω and I values. The overall performance of all five methods appears to follow the order FlexB>SJ>HKSJ>IPM>HM. Clearly, FlexB performs best and provides nearly unbaised coverage in all cases considered, anchoring around the nominal level indicated by the red dash line. For the other four, in the top panel (ω = 0), the coverage tends to decrease as θ0 decreases while the trend is opposite in the bottom panel (ω = 1), as they provide lower coverage with larger θ0. When ω = 0.5, the four methods tend to report lower coverage with larger absolute values of θ0, and thus all have a non-monotone pattern. We also observe that, all methods except for FlexB have lower coverage as the number of studies I increases. Figure S3 presents the corresponding width results for the interval estimates of θ0, showing that methods offering better coverage have larger width as well. Thus, FlexB give wider intervals compared to the other four while providing better coverage. Note that narrower intervals are preferred only when they can provide adequate coverage. Figure 4 shows that they fail to do so especially when θ0 moves away from zero.
FIGURE 4.

Coverage comparison of 95% interval estimates of 𝜃0 by Hartung-Knapp/Sidik-Jonkman (HKSJ) method, three Wald confidence intervals using Hartung-Makambi estimator (HM), positive Sidik-Jonkman estimator (SJ), and improved Paule and Mandel estimator (IPM), and proposed FlexB for different values of θ0, ω, and I. Settings: ni1, ni2 ~ U (50, 1000) μ0 = −5, θ0 = {−1, −0.9, −0.8, …, 1} σ2 = 0.5, τ2 = 0.8.
Figure 5 shows coverage results from the five methods by varying μ0 while fixing θ0 at zero. Again, the relative performance of the methods follows the same order as in Figure 4, where FlexB has relatively stable coverage and generally outperforms the other four that tend to have lower coverage as μ0 decreases and I increases for ω ≠ 0.5. Figure S4 shows the corresponding width results. We can see that the width of each method decreases as I or μ0 increases. As in Figure S3, FlexB reports larger width than the other approaches for small I or μ0 but as I and μ0 get large, the difference diminishes.
FIGURE 5.

Coverage comparison of 95% interval estimates of θ0 by Hartung-Knapp/Sidik-Jonkman (HKSJ) method, three Wald confidence intervals using Hartung-Makambi estimator (HM), positive Sidik-Jonkman estimator (SJ), and improved Paule and Mandel estimator (IPM), and proposed FlexB for different values of μ0, ω, and I. Settings: ni1, ni2 ~ U (50, 1000) μ0 = {−5, −4.5, …, 0} θ0= 0, σ2 = 0.5, τ2 = 0.8.
4.2. Performance on estimation of τ2
To evaluate the performance of FlexB in estimating the heterogeneity parameter τ2, we vary (a) the value of τ2 and (b) the value of μ0 in the BN model (1). We choose the same settings for the number of studies I and the variance direction parameter ω as in Section 4.1; for (a), we set τ2 ∈ {0, 0.1, 0.2, …, 1}, θ0 = 0, σ2 = 0.5, μ0 = −5 and for (b) μ0 ∈ {0, −0.5, −1, …, −5}, θ0 = 0, σ2 = 0.5, τ2 = 0.8. We compare point estimates of τ2 from the proposed FlexB and seven other methods in terms of bias and MSE, including Paule & Mandel (PM),38 DerSimonian & Laird (DSL),5 GLMM, SJ, DerSimonian & Kacher (DSK),39 improved Paule & Mandel (IPM),12 and BAYES. As for interval estimation, we consider four competitors recommended by Zhang et al.,40 who conducted comprehensive simulation studies to compare 16 different types of confidence intervals for the heterogeneity parameter. The methods include two profile likelihood confidence intervals based on maximum likelihood estimation (PLML) proposed by Hardy and Thompson41 and restricted maximum likelihood estimation (PLREML) proposed by Viechtbauer,42 Sidik-Jonkman (SJ) method, and approximate Jackson (AJ) method.43
Figure 6 presents bias results for point estimates of the heterogeneity parameter τ2 from the eight methods when τ2 varies. When τ2 > 0 (i.e., between-study heterogeneity exists), the top panel (ω = 0) shows that BAYES, GLMM and FlexB are the top performing group with much less bias than the other five methods; the middle and bottom panels (ω = 0.5, 1) show that FlexB is the winner (except for only a few occasions when τ2 and I are both small), while the bias of BAYES and GLMM grows quickly as τ2 gets large. Among the other five methods, the performance of IPM is better when τ2 ≤ 0.4, while SJ outperforms the other four when τ2 > 0.4. On the other hand, when there is no heterogeneity (τ2 = 0), SJ reports the worst results, and BAYES, GLMM and FlexB overestimate τ2 slightly while the other methods seem to be unbiased in this particular case. Figure S5 in Supplementary Material shows the corresponding MSE results for estimating τ2 using the eight methods. Obviously, there is no method that consistently outperforms the others in all the settings. It seems that for τ2 ≥ 0.4, SJ performs the best, often followed by FlexB; however, when τ2 < 0.4, SJ has the largest MSE while FlexB becomes the best or close to the best, especially when I is not small. Figure 7 shows bias results of the eight methods by varying μ0 while fixing τ2 at 0.8. We observe that FlexB and SJ have much smaller bias in estimating τ2 regardless of μ0, compared to the other methods that usually significantly underestimate τ2. Although GLMM and BAYES perform well in the top row (ω = 0), they tend to report larger bias in the middle and bottom row (ω = 0.5, 1). The other four methods (IPM, PM, DSK, and DSL) perform poorly when μ0 = −5 but the bias slowly decreases as μ0 increases to 0. Figure S6 shows the corresponding MSE results for estimating τ2, where FlexB performs reasonably well in most scenarios, and SJ has quite comparable results. We note that when varying μ0 while fixing τ2 at 0.8, FlexB and SJ are always in the top performing group, regardless of (I, ω) settings, but as shown in Figure 6 (and Figure 8 below), SJ only performs well when τ2 ≥ 0.4 and may achieve the best results around τ2 = 0.8. Thus, SJ may report much less favorable results if we fix τ2 at a value smaller than 0.4.
FIGURE 6.

Bias comparison of estimates of τ2 by Paule & Mandel (PM), DerSimonian & Laird (DSL), generalized linear mixed model (GLMM), Sidik-Jonkman (SJ), DerSimonian & Kacher (DSK), improved Paule & Mandel (IPM), BAYES, and proposed FlexB for different values of τ2, ω, and I. Settings: nic, nit ~ U (50, 1000), μ0 = −5, θ0 = 0, σ2 = 0.5, τ2 = {0, 0.1, 0.2, …, 1}.
FIGURE 7.

Bias comparison of estimates of τ2 by Paule & Mandel (PM), DerSimonian & Laird (DSL), generalized linear mixed model (GLMM), Sidik-Jonkman (SJ), DerSimonian & Kacher (DSK), improved Paule & Mandel (IPM), BAYES, and proposed FlexB for different values of μ0, ω, and I. Settings: nic,nit ~ U (50, 1000),μ0 = {−5, −4.5, …, 0}, θ0 = 0, σ2 = 0.5, τ2 = 0.8.
FIGURE 8.

Coverage comparison of 95% interval estimates of τ2 by proposed FlexB, profile likelihood confidence intervals based on maximum likelihood estimation (PLML), profile likelihood confidence intervals based on restricted maximum likelihood estimation (PLREML), Sidik-Jonkman (SJ) and approximate Jackson (AJ) methods for different values of τ2, ω, and I. Settings: nic,nit ~ U (50, 1000), μ0 = −5, θ0 = 0, σ2 = 0.5, τ2 = {0.1, 0.2, …, 1}.
Figure 8 shows coverage results for 95% interval estimates of τ2 from the proposed FlexB and four other methods (i.e., PLML, PLREML, SJ, and AJ) for different τ2, ω and I values. We find that FlexB is the only method that can provide almost unbiased coverage in all cases while the others show significant undercoverage. SJ performs poorly when τ2 < 0.4 but as τ2 gets larger, it becomes the second best. For the other three methods, the performance appears to follow the order PLREML>PLML>AJ. Figure S7 shows the corresponding width results for interval estimates of τ2, where FlexB reports larger widths than the other approaches, due to its better coverage.
Figures 9 and S8 show coverage and width results, respectively, from the five methods by varying μ0 while fixing τ2 at 0.8. Again, we find that FlexB, often with larger width, provides better coverage than the others in most scenarios.
FIGURE 9.

Coverage comparison of 95% interval estimates of τ2 by proposed FlexB, profile likelihood confidence intervals based on maximum likelihood estimation (PLML), profile likelihood confidence intervals based on restricted maximum likelihood estimation (PLREML), Sidik-Jonkman (SJ) and approximate Jackson (AJ) methods for different values of μ0, ω, and I. Settings: ni1,ni2 ~ U (50, 1000), μ0 = {−5, −4.5, …, 0}, θ0 = 0, σ2 = 0.5, τ2 = 0.8.
4.3. Performance on Bayesian hypothesis testing
We now examine how BIC and DIC, as detailed in Section 3, perform on hypothesis testing (or model selection) in various simulated settings of (θ0, τ2, ω). For the purpose of benchmarking, we consider Akaike information criterion (AIC) since it tends to give better estimates for models of certain characteristics.30 We also include a sequential testing procedure based on tests proposed in Bhaumik et al.12 For testing H0 : θ0 = 0, Bhaumik et al12 recommend a z-test procedure using the statistic T2 based on the SA estimator of θ0. For testing H0 : τ2 = 0, the authors advocate for a parametric bootstrap-based approach based on the test statistic T4, which outperforms an alternative test based on Cochran’s Q statistic. We refer to their approach as Bhaumik’s sequential testing (BST) procedure since two parameters need to be tested separately.
In this simulation study, we vary both θ0 and τ2 in the set {0, 0.3, 0.6, 0.9, 1.2} and ω ∈ {0, 0.5, 1} while fixing μ0 = −5, σ2 = 0.5, I = 50 and nic, nit.~ U (50, 1000). 500 replicates were generated for each setting of (θ0, τ2, ω). We define accuracy of a selection procedure to be the proportion of replications in which the correct (θ0, τ2) setting was selected. This is because BST can only test θ0 and τ2.
Figure 10 shows contour plots of accuracy for the four selection procedures in various simulated settings of (θ0, τ2) with ω = 0.5. DIC and AIC appear to have the best overall performance. However, when there is no inter-study heterogeneity (τ2 = 0), BIC seems to outperform the others since it uses a larger penalty and thus favors simpler models. Similar observations can be made from the results for ω = 0 and 1, which are omitted for brevity.
FIGURE 10.

Contour plots of average accuracies for testing different combinations of the overall treatment effect θ0 and inter-study heterogeneity τ2. Top-left: AIC; Top-right: BIC; Bottom-left: DIC; Bottom-right: Bhaumik’s sequential testing (BST). Settings: nic, nit ~ U (50, 1000), I = 50, μ0 = −5, ω = 0.5, σ2 = 0.5.
4.4. Computational efficiency of FlexB
In Section 2.3, we showed how Pólya-Gamma data-augmentation can be integrated into our flexible Bayesian REM to avoid the need for rejection/Metropolis–Hastings sampling, numerical integration, or analytical approximation. However, the computational efficiency of the algorithm for FlexB has not been investigated, which we cover below. Since the proposed FlexB, BAYES and HTE-Beta are Bayesian approaches, we are interested in comparing their computation time. We set μ0 = −5, θ0 = 1, σ2 = 0.5, τ2 = 0.8, ω = 0, and I ∈ {10, 20, …, 80} and the average computation time is calculated by 100 replicate datasets and 10,000 MCMC iterations per dataset. Table 1 shows that both HTE-Beta and FlexB are much faster than BAYES; and FlexB is the fastest, whose computation time is less than half of HTE-Beta’s time for all different sizes of meta-analysis.
TABLE 1.
Comparison of the mean computation time (in seconds) by BAYES, HTE-Beta and proposed FlexB. Settings: ni1,ni2 ~ U (50, 1000), θ0 = 1, τ2 = 0.8, μ0 = −5, σ2 = 0.5, I ∈ {10, 20, …, 80}.
| I | BAYES | HTE-Beta | FlexB |
|---|---|---|---|
| 10 | 22.38 | 1.22 | 0.48 |
| 20 | 38.47 | 2.32 | 0.85 |
| 30 | 52.96 | 3.53 | 1.23 |
| 40 | 68.27 | 4.49 | 1.55 |
| 50 | 84.28 | 5.71 | 2.12 |
| 60 | 98.42 | 6.64 | 2.54 |
| 70 | 111.06 | 8.03 | 2.96 |
| 80 | 124.85 | 9.03 | 3.07 |
5 |. DATA EXAMPLES
5.1 |. Rosiglitazone meta-analysis (56 studies)
Rosiglitazone (marketed as Avandia), approved by the US Food and Drug Administration (FDA) in 1999, is a widely-used anti-diabetes drug in the US. Shortly after the drug was marketed to patients, there was much debate about potential adverse effects of rosiglitazone on cardiovascular safety.44,45,46 Nissen and Wolski47 conducted a meta-analysis of 56 trials to investigate the adverse effects of rosiglitazone (see Table S1 for detailed data in Supplementary Material). In the 56 independent trials, 19,509 patients were assigned randomly to the treatment group (rosiglitazone), and 16,022 people were assigned to the control group. There were 159 myocardial infarction (MI) and 103 cardiovascular death (CVD) cases in the treatment group and 136 MI and 98 CVD cases in the control group. The overall incidence rate for all 56 trials is rare for both MI (0.83%) and CVD (0.57%).
Figure 11 displays the posterior densities of all global parameters from FlexB and the densities of sample log odds for treatment and control groups from component studies (the bottom right plot). Table 2(a) shows the summary statistics of posterior draws of (θ0, τ2, μ0, σ2) from FlexB, where C.I. represents a credible interval, and Table 2(b) compares our (FlexB) estimates of θ0 and τ2 with those obtained by other approaches. We see that μ0 is very low, which is estimated by −5.761 for MI and −6.633 for CVD, confirming those adverse events are extremely rare. As for estimating θ0 for MI events, our FlexB reports 0.247 but the Bayesian C.I. contains 0, indicating there is no strong evidence to claim the existence of an overall treat effect. We also observe that DSL produces the largest estimate 0.295 followed by MH and SGSwgt; GLMM reports an estimate of 0.226, but with p-value 0.061 which is on the borderline of significance. All other methods report non-significant results on the treatment effect. It is not surprising since in Section 4.1, DSL, MH, and SGSwgt perform poorly and have the largest bias when estimating θ0. For CVD data, SA, EL and MUE report slightly negative estimates of θ0 while other approaches including FlexB produce positive estimates. HTE-Beta reports the largest estimate (0.297), followed by FlexB (0.160) and BAYES (0.137). As mentioned in Section 4.1, HTE-Beta yields the largest bias among the three Bayesian approaches and is worse than some frequentist approaches such as SA and MUE. All methods do not reject the null hypothesis of θ0 = 0. The Bayesian C.I. of our FlexB covers 0, indicating there is no treatment effect for CVD as well.
FIGURE 11.

Rosiglitazone example: posterior densities of all global parameters and densities of sample log odds for treatment and control groups from the 56 component studies for (a) myocardial infrarction (MI), and (b) cardiovascular death (CVD).
TABLE 2.
Rosiglitazone example: (a) a summary of FlexB estimates for myocardial infrarction (MI) and cardiovascular death (CVD); (b) comparison of different methods in estimating θ0 and τ2 for MI and CVD.
| MI | CVD | |||||||
|---|---|---|---|---|---|---|---|---|
| Mean | SE | Median | 95% C.I. | Mean | SE | Median | 95% C.I. | |
| θ 0 | 0.247 | 0.147 | 0.241 | (−0.054,0.579) | 0.160 | 0.232 | 0.144 | (−0.342,0.687) |
| τ 2 | 0.058 | 0.075 | 0.033 | (0.001,0.201) | 0.117 | 0.188 | 0.052 | (0.002,0.454) |
| μ 0 | −5.761 | 0.224 | −5.754 | (−6.225,−5.336) | −6.633 | 0.307 | −6.612 | (−7.280,−6.087) |
| σ 2 | 1.008 | 0.325 | 0.957 | (0.454,1.543) | 1.535 | 0.623 | 1.429 | (0.547,2.860) |
| (a) | ||||||||
| θ 0 | τ 2 | ||||||
|---|---|---|---|---|---|---|---|
| MI | p-value | CVD | p-value | MI | CVD | ||
| SA | 0.030 | 0.446 | −0.063 | 0.395 | DSL | 0 | 0 |
| DSL | 0.295 | 0.006 | 0.118 | 0.184 | DSK | 0 | 0 |
| EL | 0.159 | 0.081 | −0.059 | 0.325 | PM | 0 | 0 |
| MH | 0.250 | 0.015 | 0.026 | 0.422 | IPM | 0 | 0 |
| MUE | 0.071 | 0.268 | −0.032 | 0.402 | GLMM | 0 | 0 |
| GLMM | 0.226 | 0.061 | 0.007 | 0.962 | BAYES | 0.042 | 0.024 |
| SGSwgt | 0.248 | 0.016 | 0.036 | 0.391 | FlexB | 0.033 | 0.052 |
| BAYES | 0.175 | * | 0.137 | * | |||
| HTE-Beta | 0.220 | * | 0.297 | * | |||
| FlexB | 0.247 | * | 0.160 | * | |||
| (b) | |||||||
For estimating the heterogeneity parameter τ2, DSL, DSK, PM, and IPM report a value of 0 for both MI and CVD data while FlexB and BAYES report values a little greater than 0. In Figure 11(a) and (b), for both MI and CVD, we can observe that the variability of sample log odds in the treatment groups is similar to that in the control groups (in theory, the variability in both groups is equal when either ω = 0.5 or τ2 = 0); also, the posterior distributions of ω show that the posterior probabilities at 0, 0.5 and 1 do not differ much. Recall that under ℳ1 and ℳ5 (both assuming τ2 = 0), the posterior distribution of ω is the same as its prior distribution, where 0, 0.5, and 1 occur with equal probability. Moreover, the estimates of τ2 using FlexB are close to 0. Given these results, it is reasonable to conclude that τ2 = 0 for both the MI and CVD data. For model selection, all procedures choose model ℳ1 (θ0 = 0 & τ2 = 0) for the CVD data. For MI data, BIC and BST select ℳ1, while AIC and DIC choose ℳ5 (θ0 ≠ 0 & τ2 = 0). Despite this, we conclude that ℳ1 is the best fit for the MI data since, from Figure 10, BIC performs better than the others when θ0 = 0 and our FlexB C.I. of θ0 covers 0. Therefore, we conclude that there is no significant treatment effect on MI and CVD data, and no heterogeneity in either data as well. Our conclusion comes to an agreement with the literature48,49,15 and the FDA, which has eased the restrictions on rosiglitazone.50
5.2. Stomach ulcers meta-analysis (41 studies)
Efron51 conducted a meta-analysis to compare a new surgical treatment with an older model on stomach ulcers. The study includes 41 independent studies, where the treatment group (new) contains 916 patients, while the control group (old) has 907 patients. There are 170 and 352 total events (recurrent bleeding) in the treatment and control group respectively. The overall incidence rate is 28.6% in this example. For detailed data, see Table S2 in Supplementary Material.
Figure 12 shows the posterior densities of all global parameters from FlexB and the densities of sample log odds for treatment and control groups from component studies. Table 3 displays summary statistics of posterior draws of (θ0, τ2, μ0, σ2) from FlexB and its comparison with other methods in the estimation of θ0 and τ2. The density plots of sample log odds in Figure 12 (the bottom right plot) suggest that the variability in the control groups is larger than that in the treatment groups. This shows a real example where the assumption about the variability direction in those popular existing models such as Smith et al.16 and Bhaumik et al.12 is invalid. Clearly, the posterior distribution of ω from our FlexB model strongly supports the correct choice of ω = 1.
FIGURE 12.

Stomach ulcers example: posterior densities of all global parameters and densities of sample log odds for treatment and control groups from the 41 component studies.
TABLE 3.
Stomach ulcers example: (a) a summary of FlexB estimates; (b) comparison of different methods in estimating θ0 and τ2.
| Stomach Ulcers | ||||
|---|---|---|---|---|
| Mean | SE | Median | 95% C.I. | |
| θ 0 | −1.305 | 0.273 | −1.298 | (−1.828,−0.764) |
| τ 2 | 2.273 | 0.830 | 2.122 | (0.951,3.935) |
| μ 0 | −1.509 | 0.180 | −1.517 | (−1.867,−1.164) |
| σ 2 | 0.524 | 0.242 | 0.485 | (0.125,1.001) |
| (a) | ||||
| θ 0 | p-value | τ 2 | ||
|---|---|---|---|---|
| SA | −1.370 | <0.001 | DSL | 0.778 |
| DSL | −1.026 | <0.001 | DSK | 1.136 |
| EL | −0.855 | <0.001 | PM | 1.275 |
| MH | −1.101 | <0.001 | IPM | 2.368 |
| MUE | −1.447 | <0.001 | GLMM | 1.500 |
| GLMM | −1.386 | <0.001 | BAYES | 1.246 |
| SGSwgt | −1.083 | <0.001 | FlexB | 2.122 |
| BAYES | −1.349 | * | ||
| HTE-Beta | −1.251 | * | ||
| FlexB | −1.305 | * | ||
| (b) | ||||
The results show that FlexB, SA and MUE are top three methods for estimating θ0 with larger μ0 (i.e, not so rare events), as shown in Figure 5. Also, as previously shown in Figure 9, FlexB and IPM report nearly unbiased estimates of τ2 with larger I and μ0, while the others tend to underestimate τ2. In Table 3(b), FlexB reports an estimate of −1.305 when estimating θ0, with SA and MUE providing similar estimates of −1.370 and −1.447. All methods including the proposed FlexB give the same conclusion that an overall treatment effect exists (i.e., θ0 ≠ 0). As for estimating τ2, FlexB and IPM report similar estimates while other methods provide smaller values, which are consistent with what we have observed in our simulation. As to model selection, all approaches choose the model with θ0 ≠ 0 and τ2 ≠ 0. We therefore conclude that the risk of recurrent bleeding is lower for patients undergoing the new treatment compared with the traditional treatment. We also conclude there exists between-study heterogeneity of treatment effects.
6. DISCUSSION
Based on a flexible random-effects model, we develop a novel Bayesian procedure (FlexB) to estimate and test the overall treatment effect θ0 and inter-study heterogeneity τ2 in meta-analysis of rare binary events. FlexB removes the assumption about the direction of variability between control and treatment groups in classical REMs, which may be violated in practical situations (see the second data example in Section 5). It relies on a data-adaptive approach to determine an appropriate direction rather than fixing it beforehand. Our Markov chain Monte Carlo algorithm adapts the Pólya-Gamma data-augmentation technique22 into the proposed Bayesian hierarchical framework for meta-analysis of rare binary events so that the corresponding full conditionals are all known distributions, which, combined with implementation using Rcpp, bring ease of implementation, efficiency in computation and stability for estimation. Our simulation shows that FlexB generally reports less biased results in both point and interval estimation of θ0 and τ2, compared with other frequentist and Bayesian competitors. For simultaneous testing of θ0 and τ2, DIC and AIC are the overall best procedures. However, BIC performs best when there is no inter-study heterogeneity. We further illustrate our estimation and testing approach in rosiglitazone and stomach ulcers meta-analyses, where observations made from real data conform well with those made from simulated data. In rosiglitazone meta-analysis, we conclude no overall treatment effect and no heterogeneity for myocardial infrarction (MI) data and cardiovascular death (CVD) data. In stomach ulcers meta-analysis, we demonstrate that none of the popular REMs in the literature fits in this example and our FlexB leads to the correct choice of the variance direction parameter. We further conclude the risk of having recurrent bleeding would decrease using the new treatment, and there exists inter-study heterogeneity.
As pointed out by one of our reviewers, Rhodes et al.52 reports that 75% of meta-analyses contain five or fewer studies. However, this finding was based on 6,492 continuous-outcome meta-analyses within the Cochrane Database of Systematic Reviews, as clearly stated in their abstract and indicated in their title as well. We would like to point out that this summary may not be representative for the rare binary setting (the focus of our paper). This is mainly because for rare binary data, if the number of studies I is small, meta-analysis would not help much unless researchers have useful prior information and employ a Bayesian approach to leverage such information.53 Although it can confirm that the event probabilities are very small, meta-analysis cannot tell how close to zero they are without prior information, especially in the presence of zero or double zero tables. Thus, people usually apply meta-analysis to rare binary events when I is not small. For example, the most well-known meta-analysis for rare binary data in the literature, is perhaps the one about side effects of Rosiglitazone that we re-analyze in Section 5.1, with I = 48 in Nissen and Wolski’s 2007 study54 and 56 in their 2010 study.47 In other reported meta-analyses of rare binary events,55,56,57,8,58,59 the number of studies I is typically larger than 20. That’s why we simulate data with I = 10, 20, 50, 80, to represent very small, small, medium, and large sizes of meta-analysis. Our simulation results show that the proposed FlexB shows superior performance for I = 50 and 80 and works reasonably well for I = 10 and 20, too. Since FlexB employs virtually non-informative or diffuse priors, we do not recommend the use of FlexB with meta-analysis of only a few studies in which FlexB may yield larger MSE. In such situations, we refer readers to Friede et al.60 and Günhan et al.18 where (weakly) informative priors are suggested. It is worth extending such priors to FlexB and examining its performance for meta-analysis with I ≤ 5.
In this paper, we employ the flexible random-effects model proposed by Li and Wang,14 which assumes that the distribution of treatment effects θj’s are fully heterogeneous across component studies. Recently, Moreno et al.61 considered situations where some θjs are equal, making full heterogeneity no longer valid. They proposed a Bayesian model selection procedure for estimating the true cluster model, and then employed Bayesian model averaging to estimate the so-called meta parameter θ. Their proposed method works with small I (in their data example, I = 6) and they mention that computational difficulties arise when the number of studies I is moderately large. Under the context of rare binary data, where I is typically much larger, it would be interesting to consider a formal binomial-normal model for such situations and develop an efficient algorithm.
Supplementary Material
ACKNOWLEDGMENT
This study was supported by the National Institutes of Health (Grant No.: R15GM131390 to X. Wang).
Footnotes
CONFLICT OF INTEREST
The authors declare no potential conflict of interests.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are included in the Supplementary Material.
REFERENCE
- 1.Glass GV. Primary, Secondary, and Meta-Analysis of Research. Educational Researcher 1976; 5(10): 3–8. [Google Scholar]
- 2.Borenstein M Introduction to meta-analysis. Chichester, U.K: John Wiley & Sons. 2009. [Google Scholar]
- 3.Mantel N, Haenszel W. Statistical Aspects of the Analysis of Data From Retrospective Studies of Disease. JNCI: Journal of the National Cancer Institute 1959; 22(4): 719–748. [PubMed] [Google Scholar]
- 4.Cochran WG. The Combination of Estimates from Different Experiments. Biometrics 1954; 10(1): 101. [Google Scholar]
- 5.DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials 1986; 7(3): 177–188. [DOI] [PubMed] [Google Scholar]
- 6.Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta blockade during and after myocardial infarction: an overview of the randomized trials.. Progress in cardiovascular diseases 1985; 27: 335–371. [DOI] [PubMed] [Google Scholar]
- 7.Bradburn MJ, Deeks JJ, Berlin JA, Localio AR. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Statistics in Medicine 2006; 26(1): 53–77. [DOI] [PubMed] [Google Scholar]
- 8.Kuss O Statistical methods for meta-analyses including information from studies without any events-add nothing to nothing and succeed nevertheless. Statistics in Medicine 2014; 34(7): 1097–1116. [DOI] [PubMed] [Google Scholar]
- 9.Ren Y, Lin L, Lian Q, Zou H, Chu H. Real-world Performance of Meta-analysis Methods for Double-Zero-Event Studies with Dichotomous Outcomes Using the Cochrane Database of Systematic Reviews. Journal of General Internal Medicine 2019; 34(6): 960–968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tian L, Cai T, Pfeffer MA, Piankov N, Cremieux PY, Wei LJ. Exact and efficient inference procedure for meta-analysis and its application to the analysis of independent 2 × 2 tables with all available data but without artificial continuity correction. Biostatistics 2008; 10(2): 275–281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Houwelingen HCV, Zwinderman KH, Stijnen T. A bivariate approach to meta-analysis. Statistics in Medicine 1993; 12(24): 2273–2284. [DOI] [PubMed] [Google Scholar]
- 12.Bhaumik DK, Amatya A, Normand SLT, et al. Meta-Analysis of Rare Binary Adverse Event Data. Journal of the American Statistical Association 2012; 107(498): 555–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Simmonds MC, Higgins JP. A general framework for the use of logistic regression models in meta-analysis. Statistical Methods in Medical Research 2016; 25(6): 2858–2877. [DOI] [PubMed] [Google Scholar]
- 14.Li L, Wang X. Meta-analysis of rare binary events in treatment groups with unequal variability. Statistical Methods in Medical Research 2019; 28(1): 263–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhang C, Wang X, Chen M, Wang T. A comparison of hypothesis tests for homogeneity in meta-analysis with focus on rare binary events. Research Synthesis Methods 2021; 12(4): 408–428. [DOI] [PubMed] [Google Scholar]
- 16.Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to random-effects meta-analysis: A comparative study. Statistics in Medicine 1995; 14(24): 2685–2699. [DOI] [PubMed] [Google Scholar]
- 17.Bai O, Chen M, Wang X. Bayesian Estimation and Testing in Random Effects Meta-Analysis of Rare Binary Adverse Events. Statistics in Biopharmaceutical Research 2016; 8(1): 49–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Günhan BK, Röver C, Friede T. Random-effects meta-analysis of few studies involving rare events. Research Synthesis Methods 2020; 11(1): 74–90. [DOI] [PubMed] [Google Scholar]
- 19.Hong H, Wang C, Rosner GL. Meta-analysis of rare adverse events in randomized clinical trials: Bayesian and frequentist methods. Clinical Trials 2020; 18(1): 3–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gilks WR, Thomas A, Spiegelhalter DJ. A Language and Program for Complex Bayesian Modelling. The Statistician 1994; 43(1): 169. [Google Scholar]
- 21.Shuster JJ, Guo JD, Skyler JS. Meta-analysis of safety for low event-rate binomial trials. Research Synthesis Methods 2012; 3(1): 30–50. doi: 10.1002/jrsm.1039 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Polson NG, Scott JG, Windle J. Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables. Journal of the American Statistical Association 2013; 108(504): 1339–1349. [Google Scholar]
- 23.Eddelbuettel D, François R. Rcpp: SeamlessR and C++ Integration. Journal of Statistical Software 2011; 40(8): 1–18. [Google Scholar]
- 24.Spiegelhalter DJ, Best NG, Carlin BP, Linde v. dA. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002; 64(4): 583–639. [Google Scholar]
- 25.Schwarz G Estimating the Dimension of a Model. The Annals of Statistics 1978; 6(2): 461–464. [Google Scholar]
- 26.Whittaker. Graphical Models in Applied Multi Statis. John Wiley and Sons. 1990. [Google Scholar]
- 27.Wolfinger R Laplace’s approximation for nonlinear mixed models. Biometrika 1993; 80(4): 791–795. [Google Scholar]
- 28.McCullagh P, Nelder J. Generalized Linear Models. Routledge. 2019. [Google Scholar]
- 29.Breslow NE, Clayton DG. Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association 1993; 88(421): 9. [Google Scholar]
- 30.Agresti A Categorical Data Analysis. WILEY. 2012. [Google Scholar]
- 31.Parzen M, Lipsitz S, Ibrahim J, Klar N. An Estimate of the Odds Ratio That Always Exists. Journal of Computational and Graphical Statistics 2002; 11(2): 420–436. [Google Scholar]
- 32.Hartung J, Knapp G. On tests of the overall treatment effect in meta-analysis with normally distributed responses. Statistics in Medicine 2001; 20(12): 1771–1782. doi: 10.1002/sim.791 [DOI] [PubMed] [Google Scholar]
- 33.Hartung J, Knapp G. A refined method for the meta-analysis of controlled clinical trials with binary outcome. Statistics in Medicine 2001; 20(24): 3875–3889. doi: 10.1002/sim.1009 [DOI] [PubMed] [Google Scholar]
- 34.Sidik K, Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2005; 54(2): 367–384. doi: 10.1111/j.1467-9876.2005.00489.x [DOI] [Google Scholar]
- 35.Weber F, Knapp G, Glass Ä, Kundt G, Ickstadt K. Interval estimation of the overall treatment effect in random-effects meta-analyses: Recommendations from a simulation study comparing frequentist, Bayesian, and bootstrap methods. Research Synthesis Methods 2020; 12(3): 291–315. doi: 10.1002/jrsm.1471 [DOI] [PubMed] [Google Scholar]
- 36.Pateras K, Nikolakopoulos S, Mavridis D, Roes KC. Interval estimation of the overall treatment effect in a meta-analysis of a few small studies with zero events. Contemporary Clinical Trials Communications 2018; 9: 98–107. doi: 10.1016/j.conctc.2017.11.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hartung J, Makambi KH. Reducing the Number of Unjustified Significant Results in Meta-analysis. Communications in Statistics - Simulation and Computation 2003; 32(4): 1179–1190. doi: 10.1081/sac-120023884 [DOI] [Google Scholar]
- 38.Paule R, Mandel J. Consensus Values and Weighting Factors. Journal of Research of the National Bureau of Standards 1982; 87(5): 377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.DerSimonian R, Kacker R. Random-effects model for meta-analysis of clinical trials: An update. Contemporary Clinical Trials 2007; 28(2): 105–114. [DOI] [PubMed] [Google Scholar]
- 40.Zhang C, Chen M, Wang X. Statistical Methods for Quantifying Between-study Heterogeneity in Meta-analysis with Focus on Rare Binary Events. Statistics and Its Interface 2020; 13(4): 449–464. doi: 10.4310/sii.2020.v13.n4.a3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Hardy R, Thompson S. A likelihood approach to meta-analysis with random effects. Statistics in Medicine 1996; 15(6): 619–629. doi: [DOI] [PubMed] [Google Scholar]
- 42.Viechtbauer W Confidence intervals for the amount of heterogeneity in meta-analysis. Statistics in medicine 2007; 26: 37–52. doi: 10.1002/sim.2514 [DOI] [PubMed] [Google Scholar]
- 43.Jackson D, Bowden J, Baker R. Approximate confidence intervals for moment-based estimators of the between-study variance in random effects meta-analysis. Research synthesis methods 2015; 6: 372–382. doi: 10.1002/jrsm.1162 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Singh S, Loke YK, Furberg CD. Long-term Risk of Cardiovascular Events With Rosiglitazone. JAMA 2007; 298(10): 1189. [DOI] [PubMed] [Google Scholar]
- 45.Drazen JM, Morrissey S, Curfman GD. Rosiglitazone — Continued Uncertainty about Safety. New England Journal of Medicine 2007; 357(1): 63–64. [DOI] [PubMed] [Google Scholar]
- 46.Dahabreh IJ. Meta-analysis of rare events: an update and sensitivity analysis of cardiovascular events in randomized trials of rosiglitazone. Clinical Trials 2008; 5(2): 116–120. [DOI] [PubMed] [Google Scholar]
- 47.Nissen SE, Wolski K. Rosiglitazone revisited: an updated meta-analysis of risk for myocardial infarction and cardiovascular mortality. Archives of Internal Medicine 2010; 170(14): 1191–1201. [DOI] [PubMed] [Google Scholar]
- 48.Lane PW. Meta-analysis of incidence of rare events. Statistical Methods in Medical Research 2012; 22(2): 117–132. [DOI] [PubMed] [Google Scholar]
- 49.Böhning D, Mylona K, Kimber A. Meta-analysis of clinical trials with rare events. Biometrical Journal 2015; 57(4): 633–648. doi: 10.1002/bimj.201400184 [DOI] [PubMed] [Google Scholar]
- 50.McCarthy M US regulators relax restrictions on rosiglitazone. BMJ 2013; 347(nov28 1): f7144–f7144. [DOI] [PubMed] [Google Scholar]
- 51.Efron B Empirical Bayes Methods for Combining Likelihoods. Journal of the American Statistical Association 1996; 91(434): 538–550. [Google Scholar]
- 52.Rhodes KM, Turner RM, Higgins JP. Predictive distributions were developed for the extent of heterogeneity in meta-analyses of continuous outcome data. Journal of Clinical Epidemiology 2015; 68(1): 52–60. doi: 10.1016/j.jclinepi.2014.08.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Wang G, Cheng Y, Chen M, Wang X. Jackknife empirical likelihood confidence intervals for assessing heterogeneity in meta-analysis of rare binary event data. Contemporary Clinical Trials 2021; 107: 106440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Nissen SE, Wolski K. Effect of Rosiglitazone on the Risk of Myocardial Infarction and Death from Cardiovascular Causes. New England Journal of Medicine 2007; 356(24): 2457–2471. [DOI] [PubMed] [Google Scholar]
- 55.Crowley P Interventions for preventing or improving the outcome of delivery at or beyond term. The Cochrane database of systematic reviews 1997; 2(CD000170). doi: 10.1002/14651858.cd000170 [DOI] [PubMed] [Google Scholar]
- 56.Bellamy L, Casas JP, Hingorani AD, Williams D. Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. The Lancet 2009; 373(9677): 1773–1779. doi: 10.1016/s0140-6736(09)60731-5 [DOI] [PubMed] [Google Scholar]
- 57.Feng X, Zheng BS, Shi JJ, Qian J, He W, Zhou HF. Association of glutathione S-transferase P1 gene polymorphism with the susceptibility of lung cancer. Molecular Biology Reports 2012; 39(12): 10313–10323. [DOI] [PubMed] [Google Scholar]
- 58.Hemkens LG, Ewald H, Gloy VL, et al. Colchicine for prevention of cardiovascular events. CochraneDatabaseofSystematic Reviews 2016. doi: 10.1002/14651858.cd011047.pub2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Sharma T, Guski LS, Freund N, Gøtzsche PC. Suicidality and aggression during antidepressant treatment: systematic review and meta-analyses based on clinical study reports. BMJ 2016: i65. doi: 10.1136/bmj.i65 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Friede T, Röver C, Wandel S, Neuenschwander B. Meta-analysis of few small studies in orphan diseases. ResearchSynthesis Methods 2017; 8(1): 79–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Moreno E, Vázquez-Polo FJ, Negrín MA. Bayesian meta-analysis: The role of the between-sample heterogeneity. Statistical Methods in Medical Research 2017; 27(12): 3643–3657. doi: 10.1177/0962280217709837 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are included in the Supplementary Material.
