Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2023 Jun 28;14(5):671–688. doi: 10.1002/jrsm.1647

Estimation of heterogeneity variance based on a generalized Q statistic in meta‐analysis of log‐odds‐ratio

Elena Kulinskaya 1,, David C Hoaglin 2
PMCID: PMC10946484  PMID: 37381621

Abstract

For estimation of heterogeneity variance τ2 in meta‐analysis of log‐odds‐ratio, we derive new mean‐ and median‐unbiased point estimators and new interval estimators based on a generalized Q statistic, QF, in which the weights depend on only the studies' effective sample sizes. We compare them with familiar estimators based on the inverse‐variance‐weights version of Q, QIV. In an extensive simulation, we studied the bias (including median bias) of the point estimators and the coverage (including left and right coverage error) of the confidence intervals. Most estimators add 0.5 to each cell of the 2×2 table when one cell contains a zero count; we include a version that always adds 0.5. The results show that: two of the new point estimators and two of the familiar point estimators are almost unbiased when the total sample size n250 and the probability in the Control arm (piC) is 0.1, and when n100 and piC is 0.2 or 0.5; for 0.1τ21, all estimators have negative bias for small to medium sample sizes, but for larger sample sizes some of the new median‐unbiased estimators are almost median‐unbiased; choices of interval estimators depend on values of parameters, but one of the new estimators is reasonable when piC=0.1 and another, when piC=0.2 or piC=0.5; and lack of balance between left and right coverage errors for small n and/or piC implies that the available approximations for the distributions of QIV and QF are accurate only for larger sample sizes.

Keywords: effective‐sample‐size weights, heterogeneity, inverse‐variance weights, random effects


Highlights.

What is already known

  • Use of inverse‐variance weights based on estimated variances in the conventional Q statistic makes it very difficult to approximate the null distribution of Q.

  • Related moment‐based estimators of the heterogeneity variance (τ2), such as the DerSimonian‐Laird estimator, have considerable bias.

What is new

  • We study estimation of τ2 based on QF, a generalized Q statistic with fixed effective‐sample‐size weights.

  • For point estimation of τ2 we consider both mean‐unbiased and novel median‐unbiased estimators. The new estimators are based on the Farebrother approximation to the distribution of QF.

  • In an extensive simulation study, we compared four new point estimators of τ2 and two new interval estimators with traditional estimators.

  • We provide practical guidelines for choosing appropriate point and interval estimators for LOR.

  • For median bias, which was not studied previously, an important finding is that the majority of the standard estimators have negative bias for larger sample sizes. This means that their median values of τ^2 are too low. Two new estimators consistently result in almost median‐unbiased estimation for moderate to large n.

Potential impact for RSM readers outside the authors' field

  • We provide practical guidelines for choosing appropriate point and interval estimators of τ2 in meta‐analysis of log‐odds‐ratio.

  • Some popular estimators of heterogeneity variance τ2 such as the DL and REML point estimators and PL intervals have unacceptable bias or coverage. We recommend instead new point and interval estimators of τ2 that use constant effective‐sample‐size weights. Some of these estimators are already implemented in metafor.

1. INTRODUCTION

As the measure of effect, meta‐analyses of binary outcomes from randomized trials most often use the odds ratio (OR), preferably analyzed as log‐odds‐ratio (LOR). The customary random‐effects analyses assess heterogeneity by using Cochran's Q statistic 1 and estimate the between‐study variance, τ2, for use in inverse‐variance weights. The resulting weights, based on estimated variances, underlie various shortcomings of that approach. Thus, for assessing heterogeneity and estimating τ2, we have studied QF, a version of Cochran's Q statistic in which the weights involve only the studies' arm‐level sample sizes. QF belongs to a class of generalized Q statistics, introduced by DerSimonian and Kacker, 2 in which the wi are arbitrary positive constants.

Our initial motivation came from random‐effects meta‐analyses of mean difference (MD) and standardized mean difference (SMD) 3 and, later, LOR, 4 where a weighted mean whose weights involved only those effective sample sizes performed well in estimating the overall effect.

Further developments produced QF and accurate approximations to its distribution function, for testing heterogeneity of MD, 5 SMD, 5 and three binary effect measures (LOR, log‐relative‐risk, and risk difference). 6 Those studies also examined approximations for the customary Q statistic (QIV) and investigated estimates for τ2 in MD and SMD.

In the present paper, we derive from QF new point and interval estimators of τ2 for meta‐analysis of LOR. In particular, we study mean‐ and median‐unbiased estimators. Similar new estimators of τ2 are available in the procedure rma in metafor. 7 However, the performance of these point and interval estimators of τ2 had not been evaluated by simulations. Therefore, we carried out an extensive simulation study of the bias of the point estimators and the coverage of the confidence intervals. For comparison we included familiar point and interval estimators of τ2.

Section 2 briefly reviews study‐level estimation of LOR. Section 3 reviews the generic random‐effects model and describes the Q statistic. Section 4 describes random‐effects models for LOR. Section 5 discusses approximations to the distributions of QF and QIV. Section 6 introduces new point and interval estimators of τ2 for LOR. Section 7 describes the simulation design, and Section 8 summarizes the results. Section 9 examines an example of meta‐analysis using LOR. Section 10 offers a summary and discussion. The Supporting Information provides further details on the interval estimators, the simulation results, and the example.

2. STUDY‐LEVEL ESTIMATION OF LOG‐ODDS‐RATIO

Analyses of log‐odds‐ratio usually adopt binomial distributions to model the numbers of events. In Study i (i=1,,K), XiT and XiC denote the numbers of events in the niT subjects in the Treatment arm and the niC subjects in the Control arm. Thus, we treat XiT and XiC as independent binomial variables:

XiTBinniTpiTandXiCBinniCpiC. (1)

The log‐odds‐ratio for Study i is

θi=logepiT1piCpiC1piTestimatedbyθ^i=logepiT1piCpiC1piT, (2)

where pij is an estimate of pij.

The customary estimators of piT and piC are p˜ij=Xij/nij (maximum likelihood). A reasonable alternative adds 0.5 to Xij and adds 1 to nij: p^ij=Xij+0.5/nij+1 eliminates bias of order 1/nij and provides the least biased estimator of log‐odds. 8

As inputs, a two‐stage meta‐analysis uses estimates of the θi (θ^i) and estimates of their variances (v^i2). The (conditional, given pij and nij) asymptotic variance of θ^i, derived by the delta method, is

vi2=Varθ^i=1niTpiT1piT+1niCpiC1piC, (3)

estimated by substituting p˜ij or p^ij for pij. The estimator of the variance using the p^ij and nij+1 instead of nij in (3) is unbiased in large samples, but Gart et al. 8 note that it overestimates the variance for small sample sizes. As we explain in Section 4, the unconditional variance depends on the mechanism generating the piC and the piT.

3. RANDOM‐EFFECTS MODEL AND THE Q STATISTIC.

In the most widely used method for two‐stage meta‐analysis, 9 Cochran's Q statistic serves as the basis for testing heterogeneity and estimating τ2, for use in the inverse‐variance weights, w^i=1/v^i2+τ^2. Q is a weighted sum of the squared deviations of the estimated effects θ^i from their weighted mean θ¯w=wiθ^i/wi:

Q=wiθ^iθ¯w2. (4)

In calculating Q, an estimate of τ2 is not yet available, so wi is simply 1/v^i2, the reciprocal of the estimated variance of θ^i (as in Cochran 1 ). We denote the result by QIV.

The process underlying the DerSimonian‐Laird 9 estimator of τ2 (τ^DL2) derives the expected value of QIV and rearranges the resulting expression to obtain τ2 in terms of EQIV and the vi2=Varθ^i. Substituting the observed value of QIV for EQIV and the v^i2 for the vi2 then yields τ^DL2. The convenient step of plugging in the v^i2, however, lacks justification; it inderlies the documented shortcomings of the DL method.

As an alternative, we base estimates of τ2 on QF, studied by Kulinskaya et al., 5 in which wi=n˜i=niCniT/ni, the effective sample size in Study i (ni=niC+niT).

In studying properties of estimators under the random‐effects model, we start by taking as given the observed study‐level effects θ^i; that is, we condition on those values. The model includes a distribution for the true θi, conventionally θiNθτ2. Taking expectations with respect to that model yields unconditional estimators, for comparison with the conditional ones.

The random‐effects model assumes that the θ^i are unbiased estimators of the θi and that the vi2=Varθ^iθi are the corresponding variances (i.e., the true conditional variances).

To develop estimators of τ2 based on QF, we define W=wi, qi=wi/W, and Θi=θ^iθ. In this notation, and expanding θ¯w, Equation (4) can be written as

Q=Wqi1qiΘi2i=jqiqjΘiΘj. (5)

We distinguish between the conditional distribution of Q (given the θi) and the unconditional distribution, and the respective moments of Θi. For instance, the conditional second moment of Θi, denoted by M2ic, is vi2; and the unconditional second moment, denoted by M2i, is EΘi2=Varθ^i=Evi2+τ2.

Under the REM, it is straightforward to obtain the first moment of QF as

EQF=Wqi1qiEΘi2=Wqi1qiEvi2+τ2. (6)

This expression is similar to Equation (4) in DerSimonian and Kacker 2 ; they use vi2+τ2 instead of the unconditional variance Evi2+τ2.

Our simulations yield an exact calculation of conditional central moments of LOR, following the implementation of Kulinskaya and Dollinger. 10

4. RANDOM‐EFFECTS META‐ANALYSIS OF LOG‐ODDS‐RATIO

The standard REM for LOR assumes that logitpiT=logitpiC+θi for θiNθτ2. The intercept αi=logitpiC may also be random. Further, piC and piT may be correlated. Equation (3) gives the conditional variance of θ^i. The full (unconditional) variance of θ^i depends on the generation mechanism for the piC and was derived in Kulinskaya et al. 11

Conventionally, the piC are assumed to be fixed. Then

Evi2=1niTpiT1piT+1niCpiC1piC+τ21+12niTpiT1piT12, (7)

where piT=expitαi+θ. For use in QF, this unconditional variance can be estimated by substituting p^iC for piC and p¯iT=expitα^i+θ^ for piT, where α^i=logitp^iC and θ^ is the estimated LOR. We refer to the estimate p¯iT of piT as model‐based. Alternatively, a naïve estimate uses p^iT. Such a naïve estimate has the advantage that it maintains the variance inflation of Evi2 in comparison with vi2 (using vi2 from Equation (3)). Whenever p^ij is used in Equation (7), nij+1 replaces nij.

We use these results in Sections 5 and 6.

5. APPROXIMATIONS TO THE DISTRIBUTIONS OF QF AND QIV

Our new interval estimators of τ2 (Section 6.2) involve the cumulative distribution function of QF. For LOR, QF is a quadratic form in asymptotically normal variables. The Farebrother algorithm, 12 applicable for quadratic forms in normal variables, provides a satisfactory approximation to the cdf for larger sample sizes (n100), though it may not behave well for small n. 6 To apply it, we plug in estimated variances. As in Reference [13], we denote the Farebrother approximation for Q with effective‐sample‐size weights by F SSW. We further distinguish between a “model‐based” version and a “naïve” version of F SSW, according to whether we use p¯iT or p^iT in Equation (7).

The null distribution of QIV is usually approximated by the chi‐square distribution with K1 degrees of freedom. For LOR, as also for both MD and SMD, this approximation is not accurate for small sample sizes. 14 For LOR, Kulinskaya and Dollinger 10 provided an improved approximation to the null distribution of QIV based on fitting two moments of the gamma distribution; we denote this approximation by KD. For comparison, we study point and interval estimators of τ2 based on the chi‐square and KD approximations.

6. POINT AND INTERVAL ESTIMATORS OF τ2 FOR LOR

6.1. Point estimators

The unconditional variance of θ^ in the customary fixed‐intercept model, Equation (7), can be written as a sum of two terms,

M2i=EVarθ^ipijnij+τ21+12niTpiT1piT12, (8)

where EVarθ^ipijnij=niTpiT1piT1+niCpiC1piC1, piC=expitαi, and piT=expitαi+θ. Rearranging the terms in Equation (6) with EΘi2=M2i and replacing piC and piT with p^iC and p^iT (or p¯iT) give the naïve (or model‐based) moment estimator of τ2.

τ^U2=Q/Wqi1qiv^i2qi1qiCi, (9)

where v^i2=niTp^iT1p^iT1+niCp^iC1p^iC1 and Ci=1+12niTp^iT1p^iT12. DerSimonian and Kacker 2 obtain a similar result; they use the conditional estimate, v^i2, instead of the unconditional estimate, Evi2^, obtaining

τ^M2=Q/Wqi1qiv^i2qi1qi. (10)

We study both estimators with effective‐sample‐size weights. With the conditional estimated variances in Equation (10), we denote the estimator by SSC (for “Sample Sizes Conditional”); with the unconditional estimated variances, as in Equation (9), it is SSU (for “Sample Sizes Unconditional”). These estimators differ by a term of order O1/ni and will be very similar for large sample sizes.

The estimators τ^U2 and τ^M2 arose from setting the observed value of Q equal to its expected value and solving for τ2. Instead of the expected value, one could use the median of the distribution of Q given τ2. 15 , 16 , 17 If the true (or approximate) cumulative distribution function is Fτ2, a point estimator of τ2 can be found as

τ^med2=max0τ2:FQτ2=0.5. (11)

In the Farebrother approximation to the distribution of QF (Section 5), one can use either the conditional estimated variances or the unconditional estimated variances. We denote the resulting estimators by SMC and SMU (“Sample sizes Median (Un)Conditional”), respectively.

Choosing between the “model‐based” and the “naïve” estimate of piT in M2i (8) yields “model‐based” and “naïve” versions of SSU and SMU: SSU model or SSU naïve and SMU model or SMU naïve.

The SSC and SMC estimators can be obtained from the procedure rma in metafor 7 by choosing as the method “GENQ” or “GENQM,” respectively, and specifying nbar weights.

For comparison, our simulations (Section 7) include four estimators that use inverse‐variance weights: DerSimonian‐Laird 9 (DL), restricted maximum‐likelihood (REML), Mandel‐Paule 18 (MP), and an estimator (KD) based on the work of Kulinskaya and Dollinger 10 and discussed by Bakbergenuly et al. 4 KD uses an improved non‐null first moment of Q and has better performance than most other estimators of τ2. In their review of methods for estimating the between‐study variance, Veroniki et al. 19 explain that DL is (by default) the most widely used, and they conclude that both REML and MP are better.

A perennial question involves whether analysts should add 1/2 to each of XiT,XiC,niTXiT,niCXiC only when one of them is zero, or in all studies. This is equivalent to using p˜ (whenever possible) or p^ (always), respectively, when estimating θi and vi2. To obtain evidence on this issue, we included the corresponding two versions, “only” and “always,” of DL, REML, MP, SSC, and SMC (we follow the prevalent practice of omitting “double‐zero” studies, in which two of those cell counts are zero).

Table 1 gives the full list of point estimators.

TABLE 1.

Point and interval estimators of τ2 in the simulations.

Estimator Description Add 1/2
“Only” “Always”
Point estimators
DL DerSimonian‐Laird, 9 a moment estimator based on χ2 approximation to distribution of QIV x x
REML Restricted Maximum Likelihood x x
MP Mandel‐Paule, 18 a moment estimator based on χ2 approximation to distribution of QIV x x
KD Kulinskaya‐Dollinger, 10 Bakbergenuly et al. 4 a moment estimator based on improved Gamma approximation to QIV x
SSC Equation (10), effective‐sample‐size weights, conditional variance (3) of θ^ x x
SSU model Equation (9), effective‐sample‐size weights, model‐based estimate of piT in unconditional variance (7) of θ^ x
SSU naïve Equation (9), effective‐sample‐size weights, naïve estimate of piT in unconditional variance (7) of θ^ x
SMC Median‐unbiased, Equation (11), effective‐sample‐size weights, conditional variance (3) of θ^ x x
SMU model Median‐unbiased, Equation (11), effective‐sample‐size weights, model‐based estimate of piT in unconditional variance (7) of θ^ x
SMU naïve Median‐unbiased, Equation (11), effective‐sample‐size weights, naïve estimate of piT in unconditional variance (7) of θ^ x
Confidence intervals
QP Q‐profile, Viechtbauer, 20 Appendix S1.2 x x
PL Profile Likelihood, Hardy & Thompson, 21 Appendix S1.1 x x
KD Kulinskaya–Dollinger, 10 Bakbergenuly et al., 4 Appendix S1.3, profiled improved Gamma approximation to distribution of QIV x
FPC Farebrother Profile, that is, profiled Farebrother approximation to distribution of QF, effective‐sample‐size weights, conditional variance (3) of θ^ x x
FPU model Farebrother Profile, effective‐sample‐size weights, model‐based estimate of piT in unconditional variance (7) of θ^ x
FPU naïve Farebrother Profile, effective‐sample‐size weights, naïve estimate of piT in unconditional variance (7) of θ^ x

6.2. Interval estimators

Straightforward use of the cumulative distribution function Fτ2 also yields a 1001α% confidence interval for τ2:

τ20:FQτ2α/2,1α/2.

We use both the conditional estimated variances and the unconditional estimated variances in the Farebrother approximation to QF (Section 5); we refer to the resulting profile estimators as FPC and FPU (“Farebrother Profile (Un)Conditional”) intervals. Jackson 22 introduced a similar approach using conditional variances. The FPC interval can be obtained from the confint procedure in metafor 7 for “GENQ” or “GENQM” objects that used nbar weights. For the FPU intervals, we further distinguish between a “model‐based” version and a “naïve” version, according to whether we use p¯iT or p^iT in M2i (8). Kulinskaya and Hoaglin 13 give the higher unconditional moments of θ^ required for the FPU intervals.

Our simulations (Section 7) also include the profile‐likelihood interval, 21 the Q‐profile interval, 20 and the KD interval. 10 Table 1 gives the full list, and Section S1 in the Supporting Information gives further details.

7. SIMULATION DESIGN

Our simulation design followed that described in Bakbergenuly et al. 4 Briefly, we varied five parameters: the overall true effect (θ), the between‐studies variance (τ2), the number of studies K, the studies' total sample size (n or n¯, the average sample size), and the probability in the control arm (piC). We kept the proportion of observations in the control arm (f) at 1/2.

The values of θ (0, 0.1, 0.5, 1, 1.5, and 2) aim to represent the range containing most values encountered in practice. LOR is a symmetric effect measure, so the sign of θ is not relevant.

The values of τ2 (0(0.1)1) systematically cover a reasonable range.

The numbers of studies (K=5, 10, and 30) reflect the sizes of many meta‐analyses and have yielded valuable insights in previous work.

In practice, many studies' total sample sizes fall in the ranges covered by our choices (n=20, 40, 100, and 250 when all studies have the same n, and n¯=30, 60, 100, and 160 when sample sizes vary among studies). The choices of sample sizes corresponding to n¯ follow a suggestion of Sánchez‐Meca and Marín‐Martínez, 23 who constructed the studies' sample sizes to have skewness 1.464, which they regarded as typical in behavioral and health sciences. For K=5, Table 2 lists the sets of five sample sizes. The simulations for K=10 and K=30 used each set of unequal sample sizes twice and six times, respectively.

TABLE 2.

Values of parameters in the simulations.

Parameter Equal study sizes Unequal study sizes
K (number of studies) 5, 10, 30

n or n¯ (average (individual) study size—total of the two arms)

For K=10 and K=30, the same set of unequal study sizes is used twice or six times, respectively.

20, 40, 100, 250

30 (12,16,18,20,84),

60 (24,32,36,40,168),

100 (64,72,76,80,208),

160 (124,132,136,140,268)

f (proportion of observations in the control arm) 1/2
piC (probability in the control arm) 0.1, 0.2, 0.5
θ (true value of LOR) 0, 0.1, 0.5, 1, 1.5, 2
τ2 (variance of random effects) 0 (0.1)1

The values of piC, 0.1, 0.2, and 0.5, provide a typical range of small to medium risks.

The values of piC and the true effect θi defined the probabilities piT, and the counts XiC and XiT were generated from the respective binomial distributions. We used a total of 10,000 repetitions for each combination of parameters (which we also call a situation). We discarded “double‐zero” and “double‐n” studies and reduced the observed value of K accordingly. Next, we discarded repetitions with K<3 and used the observed number of repetitions for analysis.

The simulations used R statistical software. 24 We used metafor for all methods of interest that it implemented. Table 1 gives the full list of point and interval estimators of τ2. The user‐friendly R programs implementing our methods are available on OSF. 25

8. SIMULATION RESULTS

Our eprint 26 reports the full simulation results. Here we describe the most important findings. We also refer to Figures S1 through S10 in the Supporting Information.

8.1. Bias in point estimation of τ2 for LOR

The relation of each estimator's bias to τ2 is roughly linear, with variation in intercept and, especially, in slope among the situations in the simulation. The slope varies most with n and K and to a lesser extent with piC, θ, and whether sample sizes are equal or unequal. In one of the more extreme examples, when piC=.1, θ=0, and K=5, the bias of SMC “only” when n=20 is +0.29 at τ2=0 and 0.22 at τ2=1, whereas when n=250, it is +0.09 at τ2=0 and +0.35 at τ2=1. The estimators' traces combine to form various patterns (Figure 1).

FIGURE 1.

FIGURE 1

Bias of estimators of between‐study variance of LOR (the “only” versions of DL, REML, MP, and SMC; SSC “always”; KD; and the model versions of SMU and SSU) versus τ2, for equal sample sizes n=20,40,100, and 250, piC=0.1, θ=0, and f=0.5. [Colour figure can be viewed at wileyonlinelibrary.com]

For small sample sizes, all estimators of τ2 have considerable bias, positive at τ2=0 and negative for larger τ2. The traces are roughly parallel. The pattern becomes tighter as K increases. When piC=0.1, a few estimators have bias that is almost constant in τ2 when n100. When n=250, the traces form a fan‐shaped pattern, in which the bias ranges from 0.02 to 0.07 when τ2=0 and from 0.2 to +0.35 when τ2=1. The flattening and fan‐shaped pattern occur at somewhat smaller n as θ increases. The fan‐shaped pattern appears by n=100 when piC=0.2 and by n=40 when piC=.5. For larger sample sizes, some estimators have negative trends (or no trend), and other have positive trends.

The best estimators, almost unbiased when n250 and piC=0.1, are MP “only,” KD, SSU model, and SSC “always.” The same estimators are recommended for larger values of piC, where they generally become less biased earlier. When piC=0.2 or piC=0.5, various transitions occur at smaller sample sizes (Figures S1 and S2).

For comparison, our simulations included the popular point estimators DL and REML. We focus on their “only” version. The trace for the bias of DL “only” is in the middle of the traces for the collection of estimators, or slightly lower. When n=250 or n¯=160, its negative bias, increasing in size as τ2 increases, stands out (usually alone).

The trace for REML “only” lies very close to that for DL, except that when n=250 (or, less often, n¯=160), it is close to zero instead of trending negative.

To summarize, we recommend MP “only,” KD, SSU model, and SSC “always”; and we emphatically recommend against using DL and REML.

8.2. Median bias of estimators of τ2 for LOR

We define median bias as Pτ^2τ2Pτ^2τ2. For a median, the median bias is zero.

For 0.1τ21, all estimators have negative bias for small to medium sample sizes, n40 (Figure 2). Interestingly, the median bias of most estimators is almost constant across the range of nonzero τ2 values. That level, however, varies with n and K. As n increases, the bias becomes less negative (but not small), but increasing K tends to make it more negative. DL “only,” REML “only,” and MP “only” depart from the almost constant pattern. As K increases, the trace of DL “only” declines steeply (e.g., from 0.20 at τ2=0.1 to 0.68 at τ2=1 when piC=0.2, θ=0, n=100, and K=30, Figure S3), and the trace of REML “only” declines less steeply (and only when piC=0.5, Figure S4). The trace of MP “only” tends to have a positive slope when n=40 and n=100 and θ1 and piC=0.1 and a negative slope when n=20 and n=40 and piC=0.5.

FIGURE 2.

FIGURE 2

Median bias of estimators of between‐study variance of LOR (the “only” versions of DL, REML, MP, and SSC; SMC “always”; KD; and the model versions of SMU and SSU) versus τ2, for equal sample sizes n=20,40,100 and 250, piC=0.1, θ=0 and f=0.5. [Colour figure can be viewed at wileyonlinelibrary.com]

When n=20 or n¯=30 and piC=0.1, none of the estimators are satisfactory. KD comes closest, with median bias around 0.25 at best. For piC=0.2 and piC=0.5, SMC “always” and SMC “only” are better; in a few situations one or both have small positive or negative bias.

The majority of the standard estimators have negative bias for larger values of n. This means that their median values of τ^2 are too low. When n100 for piC=0.1 and n40 for piC0.2, the new median‐unbiased estimators perform well. We recommend SMC “always” and SMU model. Both consistently result in almost median‐unbiased estimation across the range of piC values.

8.3. Coverage of interval estimators of τ2 for LOR

The results of our simulations show a few general patterns, but specific choices of interval estimators depend on values of parameters or combinations of parameters. In particular, we often separate K=30 from K=5 and K=10.

At τ2=0 coverage is always too high (essentially 1.00). For small n, it is roughly flat when τ20.1 and K=5 or K=10.

When piC=0.1 and θ=0, coverage of most estimators remains too high when n100 (Figure 3). KD and FPU model are close to 0.95 when n=100. When K=30 and n=20 or n=40, all estimators except KD break down: most have coverage <0.90 at τ2=0.1, and their coverage declines steeply as τ2 increases. Some form of breakdown persists for all θ. As θ increases, and n=20 or n=40, coverage at 0.1τ21 moves toward 0.95. KD and FPC model (for n40) seem the best choices.

FIGURE 3.

FIGURE 3

Coverage of 95% confidence intervals for between‐study variance of LOR (the “only” versions of PL, QP and FPC; KD; and the model version of FPU) versus τ2, for equal sample sizes n=20,40,100 and 250, piC=0.1, θ=0 , and f=0.5. [Colour figure can be viewed at wileyonlinelibrary.com]

When piC=0.2 or piC=0.5 (Figures S5 and S6), the picture is simpler. Coverage when n=20 or n=40 becomes closer to (or equals) 0.95 as piC increases, except for the “always” versions of PL, QP, and FPC when piC=0.2. For K=30, KD seems the best single choice.

Coverage of PL is usually too high, and in some situations it seems trapped at 1.00.

8.4. Left and right coverage error

It is often informative to approach estimation of coverage by separating its complement, the coverage error or “miscoverage,” into two parts, corresponding to whether the value of the parameter is to the left of the lower confidence limit or to the right of the upper confidence limit. Efron and Tibshirani in Reference [27], section 13.5 denote these by “miss left” and “miss right.” A CI that had a small miss‐left percentage would overcover on the left; if it had a large percentage, it would undercover. The confidence intervals that we studied aim to have miscoverage equal to 2.5% on each side (Section 6.2), but an approximation for the distribution of Q may not provide the desired balance, even if the overall coverage is close to 95%. Thus, our simulations included the miss‐left and miss‐right percentages.

In general, the miss‐left percentages are typically lower than 2.5% for small sample sizes and/or control‐arm probabilities, but they improve for n100 and for larger piC (Figures 4, S7, S8). A low miss‐left percentage means that the confidence interval includes an excess of low τ2 values.

FIGURE 4.

FIGURE 4

Miss‐left probability of PL, QP, KD, FPC, and FPU 95% confidence intervals for between‐study variance of LOR versus τ2, for equal sample sizes n=20,40,100 and 250, piC=0.1, θ=0, and f=0.5. Solid lines: the “only” versions of PL, QP and FPC; KD; and the model version of FPU. Dashed lines: the “always” version of FPC and the naïve version of FPU. [Colour figure can be viewed at wileyonlinelibrary.com]

The miss‐left percentages of KD, FPU naïve, and FPC “always” are closer to 2.5% from n=100, and miss‐left percentages of QP are typically lower than nominal when piC=0.1 but improve for larger piC.

The only exceptions to typically lower than nominal miss‐left percentages are the FPC “only” and FPU model intervals, whose percentages for n40 are often higher than nominal for K=30 and sometimes when K=10. KD occasionally has high percentages for high values of θ when K=30. PL miss‐left percentages are especially low for all sample sizes.

The miss‐right percentages are often higher than nominal, but they improve for larger n and piC (Figures 5, S9, S10). K=30 is especially challenging.

FIGURE 5.

FIGURE 5

Miss‐right probability of PL, QP, KD, FPC, and FPU 95% confidence intervals for between‐study variance of LOR versus τ2, for equal sample sizes n=20,40,100 and 250, piC=0.1, θ=0, and f=0.5. Solid lines: the “only” versions of PL, QP and FPC; KD; and the model version of FPU. Dashed lines: the “always” version of FPC and the naïve version of FPU. [Colour figure can be viewed at wileyonlinelibrary.com]

For piC=0.1, the miss‐right percentages of KD, QP “always,” and FPC “always” are closest to 2.5% when n=20 and K=5, but increase for larger n. When K=5 or 10, the percentages of FPC “only” and QP “only” are reasonably close to 2.5% when n40. The percentages of KD are consistently close to 2.5% for K=10. For K=30, the range of results is larger. The percentages of KD, QP “only,” and FPC “always” are close to 2.5% when n100. The percentages for all other estimators are typically higher, with the exception of FPU model, which has lower than nominal percentages. PL produces erratic percentages, from 0 to above 10%. For larger values of piC, the miss‐right percentages of all estimators, with the exception of PL, improve earlier, so that those of KD, QP “only,” and FPC “always” are all close to 2.5% when n=40 and piC=0.5, even when K=30.

Lack of balance in the miss‐left and miss‐right percentages for small sample sizes and/or control‐arm probabilities agrees with our findings 13 that the chi‐square approximation for QIV and the Farebrother approximation for QF are accurate only for larger sample sizes.

9. EXAMPLE: SMOKING CESSATION

Stead et al. 28 conducted a systematic review of clinical trials on the use of physician advice for smoking cessation. We use the data from the subgroup of interventions in which the treatment involved only one visit (Comparison 3.1.4, p. 54). The first version of the report was published in 2001. In an update, published in 2004, 17 studies included this comparison. The 2013 update included one more study, by Unrod (2007). For each study, Table S1 gives the number of subjects in the treatment and control arms and the number who were nonsmokers at the longest follow‐up time reported (either 6 months or 12 months). The definition of “nonsmoker” varied among the studies. Some studies required sustained abstinence, and others only asked about smoking status at that time. Stead et al. 28 analyzed relative risk. Kulinskaya and Hoaglin 6 showed that both OR and RR are reasonable effect measures for these data. Here we consider estimation of τ2 in two meta‐analyses of LOR.

The studies were mostly balanced, though two studies had substantially more subjects in the treatment arm. Sample sizes varied from 182 to 3128, with an average of 836 patients per study. The mean probabilities of smoking cessation in both arms were rather low, at 0.058 in the treatment arm and 0.043 in the control arm.

The standard IV‐based meta‐analysis of LOR for the original 17 studies gives θ^=0.4774 with standard error 0.1148 and p<0.0001 for the intervention effect (τ^MP2=0.0754). The fixed‐weights estimate of θ is higher, at 0.7127. In testing for heterogeneity, QIV=24.84, and the chi‐square approximation on 16 df provides a p‐value of .079 (I2=38.18%). The p‐values for the KD and F SSW naïve methods (which Kulinskaya and Hoaglin 6 recommend) are very close, at 0.035 and 0.038, respectively.

Table 3 shows striking differences: the SSC and SSU estimates of τ2 are more than twice as large as the standard IV estimates. This agrees with our simulations, which showed large positive biases of SSC and SSU for low values of τ2. All confidence intervals have zero as the lower limit (Table 4). This result contradicts the significant p‐values from the KD and F SSW naïve tests, but it can be explained by the inflated confidence level of all six confidence intervals near zero. Adding 1/2 to the numbers of events somewhat reduces the estimated values and the upper confidence limits. This would somewhat reduce the positive biases and the inflation in coverage. Because of the large sample sizes, there are only minor differences between conditional and unconditional model‐based or naïve estimators of τ2 and confidence intervals.

TABLE 3.

Estimated values of τ2 in the meta‐analysis of LOR for the data of Stead et al. on the use of physician advice for smoking cessation.

Example Add 1/2 DL REML MP KD SSC SSU model SSU naïve SMC SMU model SMU naïve

Stead et al.

(17 studies)

Always 0.0627 0.0696 0.0701 0.0887 0.1715 0.1725 0.1680 0.2014 0.2043 0.1972
Only 0.0675 0.0769 0.0754 0.1776 0.2098

Stead et al.

(18 studies)

Always 0.0514 0.0537 0.0576 0.0740 0.1634 0.1642 0.1601 0.1913 0.1939 0.1873
Only 0.0555 0.0597 0.0623 0.1697 0.1997

TABLE 4.

95% confidence intervals for the between‐study variance in the meta‐analysis of LOR for the data of Stead et al. on the use of physician advice for smoking cessation.

Example Add 1/2 QP PL KD FPC FPU model FPU naïve

Stead et al.

(17 studies)

Always [0, 0.3745] [0, 0.3445] [0, 0.3910] [0, 0.7583] [0, 0.7554] [0, 0.7389]
Only [0, 0.3957] [0, 0.3713] [0, 0.8075]

Stead et al.

(18 studies)

Always [0, 0.3267] [0, 0.2915] [0, 0.3419] [0, 0.6992] [0, 0.6969] [0, 0.6820]
Only [0, 0.3455] [0, 0.3139] [0, 0.7458]

Addition of the 18th study somewhat increased the p‐value for the standard Q test, to 0.107, and the KD p‐value to 0.052, but hardly affected the p‐values of the SSW‐based methods. The recommended F SSW naïve test rejects homogeneity at the 5% significance level, with p=0.033. The estimated τ2 values are somewhat smaller, and all confidence intervals still start at zero.

To better understand the properties of the estimation methods for very small values of piC in this example, we performed additional simulations with 1000 repetitions in the relevant intervals of values, τ20,0.30 and θ0.4,0.8, keeping the sample sizes as in the example and using the probabilities piC=XiC/niC. Figure 6 shows the results. In agreement with our main series of simulations, KD, MP “only”, SSC “always,” and SSU model are the least biased estimators of τ2, whereas SMC “only” is the least median‐biased. Table S2 gives summary statistics for these five estimators when τ2=0.05 and 0.2, and θ=0.5 and 0.7. Additionally, Table S2 provides summary statistics when θ=0.5 and τ2=0.2, but the control arm probabilities are less extreme, at piC=XiC/niC+.1.

FIGURE 6.

FIGURE 6

Bias, median bias, and coverage at 95% nominal confidence level of point and interval estimators of τ2 for the example of meta‐analysis by Stead et al. Solid lines: the “only” versions of DL, REML, MP, SSC, and SMC; KD; the model versions of SSU and SMU point estimators; the “only” versions of PL, QP, and FPC; FPU model, and KD intervals. Dashed lines: the “always” versions of SSC and SMC point estimators. [Colour figure can be viewed at wileyonlinelibrary.com]

Both lower and upper quartiles for the first four estimators of τ2 are comparable, though the upper quartile of SMC “only” is somewhat higher. However, the three effective‐sample‐size estimators have a much longer right tail than KD and MP “only.” This shape explains their much higher mean values. Comparing the medians, SMC “only” is almost median‐unbiased in all five scenarios, whereas SSC “always” and SSU model have the lowest medians, followed by MP “only” and then KD. This pattern is especially noticeable when the control arm probabilities are low. The lower the median, the more often these estimators would underestimate true heterogeneity. The differences between distributions decrease quite considerably for larger control arm probabilities, so the KD and MP “only” densities practically coincide, as do the densities of SSC “always” and SSU model (Figure 7). Overall, KD is the best mean‐unbiased estimator, and SMC “only” is the best median‐unbiased estimator. Which one to prefer depends very much on the context of the research. All confidence intervals provide reasonable coverage, though KD is sometimes too liberal, and PL too conservative (Figure 6).

FIGURE 7.

FIGURE 7

Density plots of the point estimators of τ2 for the example of meta‐analysis by Stead et al.: the “only” versions of MP and SMC; the “always” version of SSC; the model version of SMU; and KD. [Colour figure can be viewed at wileyonlinelibrary.com]

10. DISCUSSION

Estimation of heterogeneity variance τ2 is an important part of any meta‐analysis. Apart from its importance per se, it also affects the value of an estimated pooled effect and its estimated variability. In this paper, we studied estimation of τ2 based on QF, a generalized Q statistic with fixed effective‐sample‐size weights, and compared four proposed new point estimators of τ2 (SSC, SMC, SSU, and SMU) and two new interval estimators (FPC and FPU) with traditional estimators based on the Q statistic with inverse‐variance weights, QIV.

The new point estimators based on the expected value of QF involve estimates (v^i) of the variances of the θ^i. We obtained SSC from the conditional variances and SSU from the unconditional variances of the θ^i. Similarly, our novel point estimators based on the median of the Farebrother approximation to the distribution of QF are SMC and SMU, and our profile interval estimators are FPC and FPU.

For each unconditional estimator, we investigated two approaches to estimation of piT (which is used in the calculation of the second and fourth central moments of θ^i), “naïve” estimation of piT from XiT and niT and “model‐based” estimation using the fixed‐effects meta‐analysis estimate of the overall effect to obtain p¯iT from p^iC. Additionally, we considered two versions of the traditional and the unconditional estimators of τ2: adding 1/2 to the four cell counts always, or only when one of those counts is zero. Thus, our simulation study examined a total of 15 point estimators of τ2 and 9 interval estimators (Table 1).

For mean bias, the best estimators are MP “only,” KD, SSU model, and SSC “always.” The DL and REML estimators have the worst bias. For median bias, which was not studied previously, an important finding is that the majority of the standard estimators have negative bias for larger values of n. This means that their median values of τ^2 are too low. We recommend SMC “always” and SMU model. Both consistently result in almost median‐unbiased estimation across the range of piC values for moderate to large n.

Coverage of the confidence intervals depends on piC, n, and K. There is also some dependence on θ and τ2. However, KD, FPC “always,” and FPU model are the best choices overall. PL is the worst choice. We also considered left and right coverage errors separately. Typically, the miss‐left rates are below, and the miss‐right rates are above, the nominal 2.5% level, especially so for small sample sizes and/or control arm probabilities. Arguably, this is not bad, as it would reduce the width of the confidence intervals. 29 Lack of balance in the miss‐left and miss‐right rates agrees with our findings 13 that the chi‐square approximation for QIV and the Farebrother approximation for QF are accurate only for larger sample sizes.

Alongside the novel two‐stage estimators using sample‐size‐based weights, our simulation study involved a variety of established inverse‐variance‐based two‐stage estimators of heterogeneity. With a very few exceptions, such as MP and, especially, KD, their performance is not impressive. One‐stage meta‐analysis is often suggested as a better choice. However, we previously considered quality of one‐stage estimation in GLMM‐based binomial‐normal models. 11 , 30 and found it lacking. Also, it is useful to remember that GLMMs are still asymptotic methods that use normal likelihood.

Ideally, for meta‐analyses of log‐odds‐ratio, a single point estimator of τ2 would have acceptably low bias (and, perhaps, median bias) for a substantial region of piC, θ, n, K, and τ2. Similarly, a single confidence interval would have close to nominal coverage of τ2. Our results show some progress toward those goals, by demonstrating advantages of new estimators in some situations and, importantly, by demonstrating that some popular estimators have unacceptable bias or coverage. The goals, however, require further research. In the interim, effective education can help users avoid methods that perform poorly.

AUTHOR CONTRIBUTIONS

Elena Kulinskaya: Conceptualization; funding acquisition; methodology; software; visualization; writing – original draft; writing – review and editing. David C. Hoaglin: Conceptualization; investigation; methodology; writing – original draft; writing – review and editing.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Supporting information

Data S1: Supporting Information.

JRSM-14-671-s001.pdf (310.8KB, pdf)

ACKNOWLEDGMENTS

The work by E. Kulinskaya was supported by the Economic and Social Research Council [grant number ES/L011859/1].

Kulinskaya E, Hoaglin DC. Estimation of heterogeneity variance based on a generalized Q statistic in meta‐analysis of log‐odds‐ratio. Res Syn Meth. 2023;14(5):671‐688. doi: 10.1002/jrsm.1647

DATA AVAILABILITY STATEMENT

Our full simulation results are available as an arXiv e‐print (arXiv:2208.00707v1). The user‐friendly R program implementing all studied estimators of heterogeneity variance \$\tau^2\$ in meta‐analysis of log‐odds‐ratio with related confidence intervals is available at https://osf.io/5n3vd

REFERENCES

  • 1. Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10(1):101‐129. [Google Scholar]
  • 2. DerSimonian R, Kacker R. Random‐effects model for meta‐analysis of clinical trials: an update. Contemp Clin Trials. 2007;28(2):105‐114. [DOI] [PubMed] [Google Scholar]
  • 3. Bakbergenuly I, Hoaglin DC, Kulinskaya E. Estimation in meta‐analyses of mean difference and standardized mean difference. Stat Med. 2020;39(2):171‐191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Bakbergenuly I, Hoaglin DC, Kulinskaya E. Methods for estimating between‐study variance and overall effect in meta‐analysis of odds ratios. Res Synth Methods. 2020;11(3):426‐442. doi: 10.1002/jrsm.1404 [DOI] [PubMed] [Google Scholar]
  • 5. Kulinskaya E, Hoaglin DC, Bakbergenuly I, Newman J. A Q statistic with constant weights for assessing heterogeneity in meta‐analysis. Res Synth Methods. 2021;12:711‐730. doi: 10.1002/jrsm.1491 [DOI] [PubMed] [Google Scholar]
  • 6. Kulinskaya E, Hoaglin DC. Simulations for the Q statistic with constant and inverse variance weights for binary effect measures. 2022. arXiv:2206.08907v1 [statME].
  • 7. Viechtbauer W. Conducting meta‐analyses in R with the metafor package. J Stat Softw. 2010;36:3 https://www.metafor-project.org. [Google Scholar]
  • 8. Gart JJ, Pettigrew HM, Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika. 1985;72(1):179‐190. [Google Scholar]
  • 9. DerSimonian R, Laird N. Meta‐analysis in clinical trials. Control Clin Trials. 1986;7(3):177‐188. [DOI] [PubMed] [Google Scholar]
  • 10. Kulinskaya E, Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran's Q‐statistic. BMC Med Res Methodol. 2015;15(1):49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Kulinskaya E, Hoaglin DC, Bakbergenuly I. Exploring consequences of simulation design for apparent performance of methods of meta‐analysis. Stat Methods Med Res. 2021;30(7):1667‐1690. PMID: 34110941. doi: 10.1177/09622802211013065 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Farebrother RW, Algorithm AS 204: the distribution of a positive linear combination of χ 2 random variables. J R Stat Soc Ser C. 1984;33(3):332‐339. [Google Scholar]
  • 13. Kulinskaya E, Hoaglin DC. On the Q statistic with constant weights in meta‐analysis of binary outcomes. 2022. Available at Research Square. doi: 10.21203/rs.3.rs-2121915/v1 [DOI] [PMC free article] [PubMed]
  • 14. Viechtbauer W. Hypothesis tests for population heterogeneity in meta‐analysis. Br J Math Stat Psychol. 2007;60:29‐60. [DOI] [PubMed] [Google Scholar]
  • 15. Bakbergenuly I, Hoaglin DC, Kulinskaya E. On the Q statistic with constant weights for standardized mean difference. Br J Math Stat Psychol. 2021;75:444‐465. doi: 10.1111/bmsp.12263 [DOI] [PubMed] [Google Scholar]
  • 16. Brown GW. On small‐sample estimation. Ann Math Stat. 1947;18:582‐585. [Google Scholar]
  • 17. Viechtbauer W. Median‐unbiased estimators for the amount of heterogeneity in meta‐analysis. Paper presented at: 9th European Congress of Methodology. European Association of Methodology. 2021.
  • 18. Mandel J, Paule RC. Interlaboratory evaluation of a material with unequal numbers of replicates. Anal Chem. 1970;42(11):1194‐1197. [Google Scholar]
  • 19. Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between‐study variance and its uncertainty in meta‐analysis. Res Synth Methods. 2016;7:55‐79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Viechtbauer W. Confidence intervals for the amount of heterogeneity in meta‐analysis. Stat Med. 2007;26(1):37‐52. [DOI] [PubMed] [Google Scholar]
  • 21. Hardy RJ, Thompson SG. A likelihood approach to meta‐analysis with random effects. Stat Med. 1996;15:619‐629. [DOI] [PubMed] [Google Scholar]
  • 22. Jackson D. Confidence intervals for the between‐study variance in random effects meta‐analysis using generalised Cochran heterogeneity statistics. Res Synth Methods. 2013;4(3):220‐229. doi: 10.1002/jrsm.1081 [DOI] [PubMed] [Google Scholar]
  • 23. Sánchez‐Meca J, Marín‐Martínez F. Testing the significance of a common risk difference in meta‐analysis. Comput Stat Data Anal. 2000;33(3):299‐313. [Google Scholar]
  • 24. R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2016. [Google Scholar]
  • 25. Kulinskaya E, Hoaglin DC. R programs for estimation of heterogeneity variance τ2 for log‐odds‐ratio using the generalised Q statistic with constant and inverse variance weights. OSF. 2022. https://osf.io/5n3vd
  • 26. Kulinskaya E, Hoaglin DC. Simulations for estimation of heterogeneity variance τ2 in constant and inverse variance weights meta‐analysis of log‐odds‐ratios. 2022. arXiv:2208.00707v1 [stat.ME]. doi: 10.48550/arxiv.2208.00707 [DOI]
  • 27. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Boca Raton, Florida: Chapman & Hall/CRC. 1993. [Google Scholar]
  • 28. Stead L, Buitrago D, Preciado N, Sanchez G, Hartmann‐Boyce J, Lancaster T. Physician advice for smoking cessation. Cochrane Database Syst Rev. 2013. Art. No. CD000165. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Jackson D, Bowden J. Confidence intervals for the between‐study variance in random‐effects meta‐analysis using generalised heterogeneity statistics: should we use unequal tails? BMC Med Res Methodol. 2016;16:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Bakbergenuly I, Kulinskaya E. Meta‐analysis of binary outcomes via generalized linear mixed models: a simulation study. BMC Med Res Methodol. 2018;18:70. doi: 10.1186/s12874-018-0531-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1: Supporting Information.

JRSM-14-671-s001.pdf (310.8KB, pdf)

Data Availability Statement

Our full simulation results are available as an arXiv e‐print (arXiv:2208.00707v1). The user‐friendly R program implementing all studied estimators of heterogeneity variance \$\tau^2\$ in meta‐analysis of log‐odds‐ratio with related confidence intervals is available at https://osf.io/5n3vd


Articles from Research Synthesis Methods are provided here courtesy of Wiley

RESOURCES