Abstract
When conducting systematic reviews and meta-analyses of continuous outcomes, the mean differences (MDs) and standardized mean differences (SMDs) are 2 commonly used choices for effect measures. The SMDs are motivated by scenarios where studies collected in a systematic review do not report the continuous measures on the same scale. The standardization process transfers the MDs to be unit-free measures that can be synthesized across studies. As such, some evidence synthesis researchers tend to prefer the SMD over the MD. However, other researchers have concerns about the interpretability of the SMD. The standardization process could also yield additional heterogeneity between studies. In this paper, we use simulation studies to illustrate that, in a scenario where the continuous measures are on the same scale, the SMD could have considerably poorer performance compared with the MD in some cases. The simulations compare the MD and SMD in various settings, including cases where the normality assumption of continuous measures does not hold. We conclude that although the SMD remains useful for evidence synthesis of continuous measures on different scales, the SMD could have substantially greater biases, greater mean squared errors, and lower coverage probabilities of CIs than the MD. The MD is generally more robust to the violation of the normality assumption for continuous measures. In scenarios where continuous measures are inherently comparable or can be transformed to a common scale, the MD is the preferred choice for an effect measure.
Keywords: bias, effect measure, mean difference, meta-analysis, standardized mean difference
Introduction
A systematic review is a comprehensive synthesis of research literature conducted according to predefined criteria and following internationally recognized standards or methods. There are different types of systematic reviews: those focusing on qualitative studies, those focusing on quantitative studies, and mixed methods reviews, which incorporate both.
Meta-analysis is commonly used to integrate evidence from multiple studies in a systematic review.1–3 Meta-analysis is specifically used in systematic reviews of quantitative studies and in the quantitative synthesis part of mixed methods reviews. The findings from individual quantitative studies are typically quantified by a specific effect measure, which is a basic element in meta-analysis. It statistically combines the results of multiple studies that are assessing similar outcomes. Various types of effect measures exist for continuous outcomes, which are commonly used in clinical, educational, and psychological studies.
For example, obesity is usually determined by body mass index (BMI), which is a continuous measure based on height and weight. The mean difference (MD) and standardized mean difference (SMD) are popular choices to analyze the difference between the experimental group and the control group. The MD quantifies the difference in means between 2 groups, and it generally has a natural unit corresponding to the continuous measure scale. For instance, the unit of change in BMI is kg/m2. In a meta-analysis, different studies may use different scales to measure the continuous outcomes (eg, some studies on obesity may report changes in BMI in the unit of kg/m2, whereas others may report changes in weight in the unit of kg. The MDs are not directly comparable across studies. To solve this problem, researchers often opt for the SMD as the effect measure.4–6 The SMD is calculated by dividing the MD between the experimental and control groups by a common standard deviation (SD). Consequently, the outcome measures represented by SMDs have uniform units across studies, and they can be synthesized in a meta-analysis.
The evidence-based medicine community has debated over the choice of MD or SMD for decades.7–10 In general, when all studies report on the same scale, both the MD and SMD can be used.11 Some researchers recommend the MD instead of the SMD, as the MD has a meaningful unit,12 such as kg in weight loss or gain. Because the SMD includes the SD in the calculation and is measured in statistical units, it may be difficult to interpret from a clinical perspective.13 However, when the unit is unfamiliar, the SMD is widely used to facilitate comparisons across studies because the SMD is unit-free. The magnitude of the SMD is frequently evaluated based on Cohen’s criterion, where the SMDs of 0.2, 0.5, and 0.8 are used as the cutoff points to classify negligible, small, medium, and large effect sizes.1,6,14
Notably, Cohen’s thresholds are based on conventions that were originally proposed in the context of psychological research. As such, they may not be universally applicable or appropriate across all fields of study and may only be used as rules of thumb.15 Furthermore, it is critical to understand the nuances between different estimators of the SMD.16 Cohen’s is calculated using a pooled SD, while Glass’ uses the control group’s SD, and Hedges’ is an adjusted form of Cohen’s for small sample sizes. Each of these estimators has specific contexts where they are most appropriately applied.6 Readers seeking more in-depth information may refer to the review articles by Andrade,6 Lin and Aloe,16 and Chapter 6.5 of the Cochrane Handbook,11 which provide detailed explorations and examples about various SMD estimators.
Although the calculation and properties of SMD have been extensively studied in the statistical literature, researchers frequently encounter misconceptions and may make mistakes when applying it in practice. Empirical studies have highlighted the prevalence of disagreements and errors in meta-analyses that utilize SMDs, with discrepancies often exceeding the effect size of commonly used treatments.17,18 For instance, one study found that in 37% of meta-analyses that reported a result as an SMD, the result could not be replicated within a margin of 0.1 for at least one of the selected trials.17 This discrepancy often stemmed from errors in data such as patient numbers, means, SDs, and the direction of the effect estimate. Moreover, another study on inter-observer variation in data extraction for meta-analyses using SMDs revealed a 53% agreement at the trial level and only 31% at the meta-analysis level.18 This study highlighted key reasons for disagreement, including differences in the selection of time points, scales, control groups, and calculation types. It also emphasized that observer errors could lead to different conclusions than those published in the original reviews.
Additionally, our recent work has revealed inconsistencies in the current literature, including books and software packages, regarding the methods for handling SMDs, which can potentially hinder research reproducibility.16 It is also important to note that SMDs can be calculated from various types of MDs and SDs, such as baseline measurements, changes during the study period, or end-of-study measurements. However, many reports tend to oversimplify interpretations and provide insufficient details in this regard.19 In addition, the commonly used SMD is criticized because of dividing estimates by sample SDs, which are themselves subject to sampling errors.20 More importantly, the SMD exposes the findings to various types of distortion, as studies can differ significantly in patient population inclusion criteria, leading to concerns about generalizability. For instance, for investigating the effects of antidepressive drugs, studies with narrow inclusion criteria, such as “severe depression,” are expected to yield smaller SDs and consequently larger SMDs compared with studies involving patients with a broader range of depression severity, such as “severe to moderate depression.”
In this article, we focus on comparing the performance of the MD and SMD when the continuous measures are on the same scale. The performance is assessed in various scenarios of the distributional assumptions for the continuous measures. In most applications of evidence syntheses, the continuous measures are often assumed to be normally distributed, but this assumption is rarely assessed.21 For example, due to the existence of potential outliers or extreme values, continuous outcome measures may also follow a heavy-tailed distribution. Some medical research may want to include only patients with severe depression scores, and the values below the cutoff of the depression scale are removed; thus, their distributions might be skewed.22 We design simulation studies to examine how the MD and SMD perform when the continuous outcomes follow the various types of distributions.
Methods
Mean difference and standardized mean difference in a meta-analysis
Suppose a meta-analysis includes studies indexed by . The group index is 0 and 1, representing the control and experimental groups, respectively. Let the sample size be for each study. Suppose is the continuous measure of subject in group of study for . It is generally not reported in published articles unless individual participant data are available. For all studies, we assume to follow a distribution with mean and variance . Here, represents the true mean for group in study is the common variance for both groups.
The true MD in study is defined as , and the true SMD is . Published articles commonly report sample means and sample variances for continuous outcomes, denoted by and , respectively, for group in study . With these summary data, the MD can be estimated as , and its variance is estimated as , where is the pooled sample variance . The SMD can be estimated as ; its variance is approximated as . This estimator is known as Cohen’s .14
Other estimators are available for the SMD,16 such as Hedges’ .23 Although Hedges’ can remove bias caused by small sample sizes for an individual study, this property of bias correction is not warranted for the synthesized SMD estimates due to the intrinsic correlation between the SMD estimates and their standard errors.24 To elaborate, Hedges’ is unbiased within a single study context. A common but overly simplistic argument regarding the unbiasedness of the overall effect estimate in a meta-analysis is that . However, this overlooks the fact that the weights (inverse of variances) are not fixed values but estimates subject to sampling error. The association between Hedges’ and the corresponding variances may be strong, particularly when the sample sizes are small, complicating the calculation of the meta-estimate’s expectation. Simulations have shown that Hedges’ does not necessarily produce less biased meta-estimates compared with Cohen’s .16 Given these considerations, this article opts for Cohen’s as the estimator for the SMD.
Simulation designs
We perform simulation studies to compare the results of MD and SMD when the latent continuous measures are from different distributions. To this end, we need to define an overall MD and SMD, denoted by and , respectively. The random-effects model is often employed to account for between-study heterogeneity, and it assumes that study-specific true effect measures, and , following normal distributions with means and , respectively, across studies. Of note, such distributions differ from the within-study distributions for the individual participant data . To enable direct comparisons between MD and SMD within the same meta-analyses, it is necessary to assume uniform variances, , across studies, which we denote as . Absent this assumption, generating meta-analyses with study-specific true MDs, , distributed as , would result in true SMDs, follow , which are not identically distributed. This violates the assumptions of the random-effects model based on SMDs and could lead to unfair comparisons between MDs and SMDs.
Our simulations are particularly interested in the performance of the MD and SMD under various distributional assumptions for the latent continuous outcome measure . We consider normal distribution, distribution, skewed distribution, and distribution with extreme outliers. The location–scale transformation is used for the continuous outcome measures: . The constant can vary across distributions, and it is used to adjust the distributions’ variances. The continuous random variable is assumed to follow different distributions. To make fair comparisons, we design the various distributions to have the same mean and variance ; that is, and .
The true overall MD was set to 0 and 1. The common SD within studies was set to 1 and 2. Consequently, the true overall standardized MD was . The number of studies in a meta-analysis was set to 5 and 20, representing relatively small and large numbers of studies, respectively. For the total sample sizes within studies, we considered 2 sets of values: (relatively small sample sizes) and (relatively large sample sizes). For , these sets of values were repeated 4 times. The ratio of sample sizes in the experimental and control groups was 1:1, so each group had samples. To control reasonable extents of heterogeneity based on the statistic, the between-study SD for the MDs was set to 0.2 for the set of relatively large sample sizes and 0.4 for the set of relatively small sample sizes. Without loss of generality, the mean of subjects’ latent continuous measures in the control group was set to 0; thus, the mean in the treatment group was .
Recall that we used the location–scale transformation for the continuous outcome measures of individual participants: . Four simulation settings were considered for the distribution of the random variable ; the constant was specific for each setting because , so all 4 distributions share the same mean and the same variance. The 4 distributions are as follows.
Setting 1 (normal distribution): with .
Setting 2 ( distribution): with .
Setting 3 (skewed distribution): with .
Setting 4 (extreme outliers): with .
In Setting 1, the random variable has a standard normal distribution. Setting 2 uses the distribution to present some heavy-tailed errors. In Setting 3, the mixture component ensures that has long right tails. In Setting 4, most of ’s were close to 0 due to the component , but the component ensured that a small number of extreme outliers existed in both directions. Figure 1 visualizes these 4 distributions.
Figure 1:
Probability distribution functions of continuous measures generated from the 4 distributions: Setting 1 (A), Setting 2 (B), Setting 3 (C), and Setting 4 (D).
We used a factorial design, yielding 64 simulation settings in total (2 choices of choices of choices of choices of choices of distributions). For each setting, we generated 1000 replicates of meta-analyses. For studies in each meta-analysis, we generated the study-specific MD from ; thus, the study-specific SMD was . We generated the random variable from Settings 1–4 of distributions, and each individual’s continuous outcome measure was obtained from for group . We calculated the sample means and sample variances . The estimated MD , SMD , and their corresponding sample variances were calculated based on , and .
We used the random-effects model to separately synthesize the observed MDs and observed SMDs from the studies. This method was implemented with the function rma() in the R metaphor package (R Foundation for Statistical Computing, Vienna, Austria) with the restricted maximum likelihood estimation for the heterogeneity variance.25 We used the Hartung-Knapp-Sidik-Jonkman method for deriving 95% CIs.26–28 This method has been shown to be generally more appropriate than the conventional DerSimonian-Laird method.28,29 We obtained the estimated overall MDs, estimated overall SMDs, and their corresponding 95% CIs.
The bias, root mean squared error (RMSE), and CI coverage probability were calculated to compare the performance of the MD and SMD. We obtained their Monte Carlo standard errors to quantify their uncertainties. In each simulation setting, the bias was computed as , where was the number of replicates, was the overall estimate of the MD or SMD in the simulation replicate, and was the true MD or SMD. The RMSE was , and the coverage probability was , where and were the lower and upper bounds of the CI, respectively. The indicator function returned 1 if the CI covered the true MD or SMD and 0 if not. We also converted the bias and RMSE of SMD to the scale of MD by multiplying the common SD . The coverage probability of SMD on the scale of MD was the same as the original coverage probability of SMD.
Results
Across all settings, the Monte Carlo standard errors of biases in MD and SMD were less than 0.014, those of RMSEs were less than 0.012, and those of coverage probabilities were less than 1.5%. The biases, RMSEs, and coverage probabilities of SMDs were converted to the scale of MD for fair comparisons. In the following interpretations, the biases were interpreted in absolute magnitude.
Table 1 summarizes the results when the continuous measures were generated from the normal distribution in Setting 1. Most biases in SMDs were similar to those in MD. Two scenarios yielded SMDs with noticeably greater biases than MDs. When the true overall MD was 1, the number of studies was 20, and the sample sizes were within 10–50, then the bias in MDs was 0.001 and the bias in SMDs was 0.014 for the within-study SD and true overall SMD ; the bias in MDs was 0.003 and the bias in SMDs was 0.013 for the within-study SD and true overall SMD . The RMSEs of SMDs and MDs were ≤ 0.39. In most scenarios, SMDs had smaller RMSEs than MDs for , while MDs produced smaller RMSEs for . Their differences were tiny by up to 0.01. The CIs of MDs and SMDs also had similar coverage probabilities in most scenarios, which were close to 95%.
Table 1:
Bias, root mean squared errors (RMSEs), and coverage probabilities with their Monte Carlo standard errors (in parentheses) of the mean difference (MD) and standardized mean difference (SMD) in the MD scale when the continuous measures are generated from the normal distribution in Setting 1.
Bias |
RMSE |
Coverage probability |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MD | SMD on MD scale | MD | SMD on MD scale | MD | SMD on MD scale | ||||||
0 | 0 | 1 | 5 | 10–50 | 0.4 | 0.011 (0.008) | 0.010 (0.008) | 0.250 (0.005) | 0.249 (0.006) | 94.7% (0.7%) | 94.8% (0.7%) |
100–300 | 0.2 | 0.007 (0.003) | 0.007 (0.003) | 0.110 (0.002) | 0.109 (0.002) | 94.8% (0.7%) | 95.0% (0.7%) | ||||
20 | 10–50 | 0.4 | −0.001 (0.004) | 0.000 (0.004) | 0.129 (0.003) | 0.127 (0.003) | 95.3% (0.7%) | 96.2% (0.6%) | |||
100–300 | 0.2 | −0.003 (0.002) | −0.003 (0.002) | 0.057 (0.001) | 0.057 (0.001) | 95.0% (0.7%) | 95.3% (0.7%) | ||||
| |||||||||||
0 | 2 | 5 | 10–50 | 0.4 | 0.009 (0.012) | 0.008 (0.012) | 0.384 (0.009) | 0.383 (0.008) | 94.6% (0.7%) | 94.7% (0.7%) | |
100–300 | 0.2 | 0.011 (0.005) | 0.011 (0.004) | 0.156 (0.003) | 0.156 (0.004) | 94.1% (0.7%) | 94.8% (0.7%) | ||||
20 | 10–50 | 0.4 | −0.003 (0.006) | −0.002 (0.006) | 0.199 (0.004) | 0.192 (0.004) | 94.3% (0.7%) | 95.6% (0.6%) | |||
100–300 | 0.2 | −0.002 (0.003) | −0.002 (0.002) | 0.081 (0.002) | 0.080 (0.002) | 94.9% (0.7%) | 95.0% (0.7%) | ||||
| |||||||||||
1 | 1 | 1 | 5 | 10–50 | 0.4 | 0.011 (0.008) | 0.009 (0.008) | 0.250 (0.005) | 0.260 (0.006) | 94.7% (0.7%) | 95.0% (0.7%) |
100–300 | 0.2 | 0.007 (0.003) | 0.005 (0.004) | 0.110 (0.002) | 0.113 (0.002) | 94.8% (0.7%) | 93.9% (0.8%) | ||||
20 | 10–50 | 0.4 | −0.001 (0.004) | −0.014 (0.004) | 0.129 (0.003) | 0.132 (0.003) | 95.3% (0.7%) | 94.4% (0.7%) | |||
100–300 | 0.2 | −0.003 (0.002) | −0.004 (0.002) | 0.057 (0.001) | 0.059 (0.001) | 95.0% (0.7%) | 94.7% (0.7%) | ||||
| |||||||||||
0.5 | 2 | 5 | 10–50 | 0.4 | 0.009 (0.012) | 0.010 (0.012) | 0.384 (0.009) | 0.390 (0.010) | 94.6% (0.7%) | 94.9% (0.7%) | |
100–300 | 0.2 | 0.011 (0.005) | 0.010 (0.006) | 0.156 (0.003) | 0.159 (0.004) | 94.1% (0.7%) | 94.0% (0.8%) | ||||
20 | 10–50 | 0.4 | −0.003 (0.006) | −0.013 (0.006) | 0.199 (0.004) | 0.196 (0.004) | 94.3% (0.7%) | 94.8% (0.7%) | |||
100–300 | 0.2 | −0.002 (0.003) | −0.004 (0.002) | 0.081 (0.002) | 0.082 (0.002) | 94.9% (0.7%) | 94.9% (0.7%) |
, true overall mean difference; , true overall standardized mean difference; , within-study standard deviation; , number of studies; , sample size within a study; , between-study standard deviation.
Table 2 shows the results when were generated from the distribution in Setting 2. The SMD and MD had very similar biases when the true overall MD was 0; their differences were up to 0.001. However, when was 1, the biases in SMDs were greater than those in MDs in most scenarios. When , and sample sizes were within 10–50, then the bias in SMDs was 0.021, greater than that in MDs for , and their difference was 0.014 for . Most RMSEs of SMDs were slightly greater than those of MDs by up to 0.019. The RMSEs of SMDs ranged from 0.058 to 0.400, and those of MDs ranged from 0.058 to 0.388. SMDs generally had CIs with slightly higher coverage probabilities by up to 1.5% than MDs.
Table 2:
Bias, root mean squared errors (RMSEs), and coverage probabilities with their Monte Carlo standard errors (in parentheses) of the mean difference (MD) and standardized mean difference (SMD) in the MD scale when the continuous measures are generated from the distribution in Setting 2.
Bias |
RMSE |
Coverage probability |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MD | SMD on MD scale | MD | SMD on MD scale | MD | SMD on MD scale | ||||||
0 | 0 | 1 | 5 | 10–50 | 0.4 | 0.009 (0.008) | 0.009 (0.008) | 0.254 (0.006) | 0.259 (0.006) | 94.8% (0.7%) | 95.4% (0.7%) |
100–300 | 0.2 | −0.005 (0.004) | −0.005 (0.004) | 0.115 (0.002) | 0.115 (0.003) | 93.1% (0.8%) | 93.3% (0.8%) | ||||
20 | 10–50 | 0.4 | 0.008 (0.004) | 0.008 (0.004) | 0.125 (0.003) | 0.126 (0.003) | 93.3% (0.8%) | 94.8% (0.7%) | |||
100–300 | 0.2 | 0.002 (0.002) | 0.002 (0.002) | 0.058 (0.001) | 0.058 (0.001) | 94.1% (0.7%) | 94.1% (0.7%) | ||||
| |||||||||||
0 | 2 | 5 | 10–50 | 0.4 | 0.013 (0.012) | 0.014 (0.012) | 0.388 (0.009) | 0.391 (0.008) | 95.2% (0.7%) | 96.2% (0.6%) | |
100–300 | 0.2 | −0.005 (0.005) | −0.005 (0.006) | 0.161 (0.003) | 0.161 (0.004) | 93.6% (0.8%) | 93.9% (0.8%) | ||||
20 | 10–50 | 0.4 | 0.010 (0.006) | 0.011 (0.006) | 0.192 (0.004) | 0.191 (0.004) | 93.6% (0.8%) | 94.2% (0.7%) | |||
100–300 | 0.2 | 0.002 (0.003) | 0.002 (0.002) | 0.081 (0.002) | 0.081 (0.002) | 94.4% (0.7%) | 94.5% (0.7%) | ||||
| |||||||||||
1 | 1 | 1 | 5 | 10–50 | 0.4 | 0.009 (0.008) | 0.021 (0.009) | 0.254 (0.006) | 0.273 (0.006) | 94.8% (0.7%) | 95.3% (0.7%) |
100–300 | 0.2 | −0.005 (0.004) | −0.006 (0.004) | 0.115 (0.002) | 0.117 (0.002) | 93.1% (0.8%) | 93.5% (0.8%) | ||||
20 | 10–50 | 0.4 | 0.008 (0.004) | 0.010 (0.004) | 0.125 (0.003) | 0.133 (0.003) | 93.3% (0.8%) | 94.7% (0.7%) | |||
100–300 | 0.2 | 0.002 (0.002) | 0.000 (0.002) | 0.058 (0.001) | 0.059 (0.001) | 94.1% (0.7%) | 94.7% (0.7%) | ||||
| |||||||||||
0.5 | 2 | 5 | 10–50 | 0.4 | 0.013 (0.012) | 0.027 (0.012) | 0.388 (0.009) | 0.400 (0.010) | 95.2% (0.7%) | 96.2% (0.6%) | |
100–300 | 0.2 | −0.005 (0.005) | −0.005 (0.006) | 0.161 (0.003) | 0.163 (0.004) | 93.6% (0.8%) | 94.3% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.010 (0.006) | 0.013 (0.006) | 0.192 (0.004) | 0.195 (0.004) | 93.6% (0.8%) | 94.9% (0.7%) | |||
100–300 | 0.2 | 0.002 (0.003) | 0.000 (0.002) | 0.081 (0.002) | 0.082 (0.002) | 94.4% (0.7%) | 94.6% (0.7%) |
, true overall mean difference; , true overall standardized mean difference; , within-study standard deviation; , number of studies; , sample size within a study; , between-study standard deviation.
When were generated from the skewed distribution in Setting 3, the results for SMDs and MDs had large differences, as shown in Table 3. The biases in MDs were generally small, ranging from 0.002 to 0.011. The SMD had small biases up to 0.011 when , while they sharply increased when , ranging from 0.084 to 0.095. The RMSEs of SMDs and MDs were up to 0.425. In most cases, SMDs performed better with smaller RMSEs than MDs. The coverage probabilities of MDs’ CIs ranged from 93.2% to 96.6%. The CIs of SMDs had slightly higher coverage probabilities when the true overall MD was 0, but they had considerably poor performance when was 1. The coverage probabilities of SMDs’ CIs could drop to 65.1% when , and sample sizes were within 100–300.
Table 3:
Bias, root mean squared errors (RMSEs), and coverage probabilities with their Monte Carlo standard errors (in parentheses) of the mean difference (MD) and standardized mean difference (SMD) in the MD scale when the continuous measures are generated from the skewed distribution in Setting 3.
Bias |
RMSE |
Coverage probability |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MD | SMD on MD scale | MD | SMD on MD scale | MD | SMD on MD scale | ||||||
0 | 0 | 1 | 5 | 10–50 | 0.4 | 0.008 (0.008) | 0.008 (0.008) | 0.266 (0.006) | 0.241 (0.006) | 94.7% (0.7%) | 95.8% (0.6%) |
100–300 | 0.2 | −0.006 (0.004) | −0.005 (0.004) | 0.124 (0.003) | 0.113 (0.003) | 93.4% (0.8%) | 93.6% (0.8%) | ||||
20 | 10–50 | 0.4 | 0.007 (0.004) | 0.007 (0.004) | 0.140 (0.003) | 0.126 (0.003) | 93.2% (0.8%) | 93.9% (0.8%) | |||
100–300 | 0.2 | 0.002 (0.002) | 0.002 (0.002) | 0.057 (0.001) | 0.052 (0.001) | 96.6% (0.6%) | 96.6% (0.6%) | ||||
| |||||||||||
0 | 2 | 5 | 10–50 | 0.4 | 0.009 (0.013) | 0.010 (0.012) | 0.425 (0.009) | 0.379 (0.008) | 95.1% (0.7%) | 96.3% (0.6%) | |
100–300 | 0.2 | −0.006 (0.006) | −0.006 (0.006) | 0.182 (0.004) | 0.165 (0.004) | 93.7% (0.8%) | 94.1% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.011 (0.007) | 0.011 (0.006) | 0.220 (0.005) | 0.195 (0.004) | 93.8% (0.8%) | 94.4% (0.7%) | |||
100–300 | 0.2 | 0.004 (0.003) | 0.004 (0.002) | 0.083 (0.002) | 0.076 (0.002) | 96.1% (0.6%) | 96.5% (0.6%) | ||||
| |||||||||||
1 | 1 | 1 | 5 | 10–50 | 0.4 | 0.008 (0.008) | −0.088 (0.008) | 0.266 (0.006) | 0.259 (0.006) | 94.7% (0.7%) | 94.2% (0.7%) |
100–300 | 0.2 | −0.006 (0.004) | −0.095 (0.004) | 0.124 (0.003) | 0.148 (0.003) | 93.4% (0.8%) | 86.4% (1.1%) | ||||
20 | 10–50 | 0.4 | 0.007 (0.004) | −0.094 (0.004) | 0.140 (0.003) | 0.159 (0.003) | 93.2% (0.8%) | 86.0% (1.1%) | |||
100–300 | 0.2 | 0.002 (0.002) | −0.088 (0.002) | 0.057 (0.001) | 0.103 (0.002) | 96.6% (0.6%) | 65.1% (1.5%) | ||||
| |||||||||||
0.5 | 2 | 5 | 10–50 | 0.4 | 0.009 (0.013) | −0.084 (0.012) | 0.425 (0.009) | 0.388 (0.008) | 95.1% (0.7%) | 95.5% (0.7%) | |
100–300 | 0.2 | −0.006 (0.006) | −0.095 (0.006) | 0.182 (0.004) | 0.191 (0.004) | 93.7% (0.8%) | 90.6% (0.9%) | ||||
20 | 10–50 | 0.4 | 0.011 (0.007) | −0.088 (0.006) | 0.220 (0.005) | 0.215 (0.006) | 93.8% (0.8%) | 91.9% (0.9%) | |||
100–300 | 0.2 | 0.004 (0.003) | −0.086 (0.002) | 0.083 (0.002) | 0.115 (0.002) | 96.1% (0.6%) | 81.5% (1.2%) |
, true overall mean difference; , true overall standardized mean difference; , within-study standard deviation; , number of studies; , sample size within a study; , between-study standard deviation.
Table 4 summarizes the results when the generation distribution of had extreme outliers in Setting 4. MDs generally had small biases, ranging from 0.002 to 0.008. When , SMDs had small biases, which were close to the results of MDs. For , the biases of SMDs were greater than those of MDs; their differences were up to 0.025 when sample sizes were within 100–300. SMDs had relatively large biases ranging from 0.177 to 0.205 when sample sizes were within 10–50. In all scenarios, the RMSEs of SMDs and MDs were ≤ 0.51, and SMDs had larger RMSEs than MDs. Their differences were up to 0.022 when sample sizes were within 100–300, while their differences were large by up to 0.181 when sample sizes were small. The coverage probabilities of CIs of MDs ranged from 92.3% to 95.7%. Most coverage probabilities of SMDs’ CIs were slightly higher than those of MDs. For and , SMDs had considerably lower coverage probabilities than MDs, and they could drop to 86.4% when and sample sizes were within 10–50.
Table 4:
Bias, root mean squared errors (RMSEs), and coverage probabilities with their Monte Carlo standard errors (in parentheses) of the mean difference (MD) and standardized mean difference (SMD) in the MD scale when the continuous measures are generated from the distribution with extreme outliers in Setting 4.
Bias |
RMSE |
Coverage probability |
|||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MD | SMD on MD scale | MD | SMD on MD scale | MD | SMD on MD scale | ||||||
0 | 0 | 1 | 5 | 10–50 | 0.4 | 0.006 (0.008) | 0.009 (0.009) | 0.241 (0.006) | 0.288 (0.007) | 93.0% (0.8%) | 95.3% (0.7%) |
100–300 | 0.2 | −0.003 (0.004) | −0.003 (0.004) | 0.115 (0.003) | 0.118 (0.003) | 93.8% (0.8%) | 94.5% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.004 (0.004) | 0.004 (0.005) | 0.127 (0.003) | 0.151 (0.004) | 93.2% (0.8%) | 94.4% (0.7%) | |||
100–300 | 0.2 | 0.003 (0.002) | 0.003 (0.002) | 0.055 (0.001) | 0.057 (0.001) | 95.7% (0.6%) | 95.8% (0.6%) | ||||
| |||||||||||
0 | 2 | 5 | 10–50 | 0.4 | 0.008 (0.011) | 0.008 (0.014) | 0.350 (0.008) | 0.414 (0.010) | 93.2% (0.8%) | 94.5% (0.7%) | |
100–300 | 0.2 | −0.002 (0.005) | −0.002 (0.006) | 0.161 (0.004) | 0.164 (0.004) | 94.2% (0.7%) | 94.7% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.002 (0.006) | 0.004 (0.006) | 0.180 (0.004) | 0.212 (0.004) | 92.3% (0.8%) | 93.9% (0.8%) | |||
100–300 | 0.2 | 0.004 (0.002) | 0.004 (0.002) | 0.077 (0.002) | 0.079 (0.002) | 95.0% (0.7%) | 95.0% (0.7%) | ||||
| |||||||||||
1 | 1 | 1 | 5 | 10–50 | 0.4 | 0.006 (0.008) | 0.205 (0.012) | 0.241 (0.006) | 0.422 (0.010) | 93.0% (0.8%) | 94.0% (0.8%) |
100–300 | 0.2 | −0.003 (0.004) | 0.025 (0.004) | 0.115 (0.003) | 0.137 (0.004) | 93.8% (0.8%) | 94.2% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.004 (0.004) | 0.177 (0.006) | 0.127 (0.003) | 0.257 (0.006) | 93.2% (0.8%) | 86.4% (1.1%) | |||
100–300 | 0.2 | 0.003 (0.002) | 0.028 (0.002) | 0.055 (0.001) | 0.073 (0.002) | 95.7% (0.6%) | 93.8% (0.8%) | ||||
| |||||||||||
0.5 | 2 | 5 | 10–50 | 0.4 | 0.008 (0.011) | 0.203 (0.014) | 0.350 (0.008) | 0.510 (0.012) | 93.2% (0.8%) | 94.5% (0.7%) | |
100–300 | 0.2 | −0.002 (0.005) | 0.026 (0.006) | 0.161 (0.004) | 0.177 (0.004) | 94.2% (0.7%) | 95.1% (0.7%) | ||||
20 | 10–50 | 0.4 | 0.002 (0.006) | 0.177 (0.008) | 0.180 (0.004) | 0.295 (0.006) | 92.3% (0.8%) | 89.5% (1.0%) | |||
100–300 | 0.2 | 0.004 (0.002) | 0.029 (0.002) | 0.077 (0.002) | 0.091 (0.002) | 95.0% (0.7%) | 93.8% (0.8%) |
, true overall mean difference; , true overall standardized mean difference; , within-study standard deviation; , number of studies; , sample size within a study; , between-study standard deviation.
Discussion
This article explores the performance of SMD in various cases of distributions for individual subjects’ continuous outcome measures, including the normal distribution, distribution, skewed distribution, and distribution with extreme outliers. We conducted simulation studies to compare the performance of the MD and SMD. When the true overall effect size was zero, the SMD generally had slightly better or similar performance compared with the MD in terms of bias, RMSE, and CI coverage probability. Most differences were small and likely due to Monte Carlo standard errors. When the overall effect size was large, the SMD preformation was mostly satisfactory in cases of the normal and distribution, whereas the SMD could have considerably greater biases and RMSEs, and its CI could have much lower coverage probabilities than the MD in some cases of the skewed distribution and distribution with extreme outliers.
In real-world scenarios, different studies reporting continuous outcomes often employ various rating instruments or present indicators in different units, complicating the direct comparison of results across studies. In such instances, the SMD serves as a useful tool for harmonizing these differences, enabling the synthesis of results. This is particularly relevant in fields such as psychology or education, where diverse scales and measurements are common. However, the inherent variability and subjectivity of these instruments should be acknowledged, as they can introduce heterogeneity into the meta-analysis and challenge the rationale of synthesizing the SMD. Researchers need to carefully consider the context and nature of the instruments used in the original studies when interpreting the results of meta-analyses that utilize SMD.
Our exploratory work may have some limitations. In the simulation studies, a common SD was considered for all studies in the same meta-analysis. As mentioned in our simulation designs, this assumption is needed for fair comparisons between the MD and SMD. Without this assumption, our simulations based on the MD could not yield a well-defined value of the overall SMD, and thus the bias, RMSE, and CI coverage probabilities could not be calculated for the SMD. In practice, this assumption is certainly unrealistic. However, we believe our simulations were sufficient to illustrate the potentially poor performance of the SMD in some cases. For handling real-world meta-analyses with continuous outcomes, researchers should evaluate whether it is sensible to assume the exchangeability of MDs or SMDs across studies. If the exchangeability assumption holds for one effect measure, then it may not be valid for another effect measure, except in some rare cases (eg, approximately common SDs for all studies). In addition, this article only assesses the impact of within-study distributional assumptions, and it assumes normality between studies. However, the between-study normality cannot be taken for granted,21,30 and must be carefully assessed on a case-by-case basis.
Some extension work based on the continuous effect measures can be considered in future research. This article focuses on only 2 effect measures, the MD and SMD, for continuous outcomes. Other methods can be implemented for evidence syntheses with continuous outcomes, such as the minimally important differences or the ratio of means.4
Conclusions
While the SMD can be beneficial for synthesizing evidence from continuous measures that vary in scale, researchers must be aware of its potential limitations, particularly its inferior performance compared with the MD when dealing with measures on identical scales. The MD demonstrates superior robustness when the normality assumption for continuous measures is challenged. In scenarios where there is inherent comparability among continuous measures or the possibility of transforming them to a common scale, the MD emerges as the more suitable effect measure. Therefore, we recommend that researchers make a contextually informed choice between MD and SMD based on the scale and distribution of outcome measures in their studies. Understanding the limitations and sensitivity to distributional assumptions of each effect measure is key for accurate interpretation. This requires careful evaluation of study designs and outcome measures to determine the most appropriate effect measure.
Funding
This study was supported in part by the National Institute of Mental Health grant R03 MH128727 and the National Library of Medicine grant R01 LM012982 of the US National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the US National Institutes of Health.
Footnotes
The authors declare no conflicts of interest.
References
- 1.Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to Meta-Analysis. John Wiley & Sons; 2009. [Google Scholar]
- 2.Gurevitch J, Koricheva J, Nakagawa S, Stewart G. Meta-analysis and the science of research synthesis. Nature 2018;555(7695):175–82. [DOI] [PubMed] [Google Scholar]
- 3.Murad MH, Montori VM, Ioannidis JPA, Jaeschke R, Devereaux PJ, Prasad K, et al. How to read a systematic review and meta-analysis and apply the results to patient care: users’ guides to the medical literature. JAMA 2014;312(2):171–79. [DOI] [PubMed] [Google Scholar]
- 4.Murad MH, Wang Z, Chu H, Lin L. When continuous outcomes are measured using different scales: guide for meta-analysis and interpretation. BMJ 2019;364:k4817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hedges LV. Estimation of effect size from a series of independent experiments. Psychol Bull 1982;92(2):490–99. [Google Scholar]
- 6.Andrade C. Mean difference, standardized mean difference (SMD), and their use in meta-analysis: as simple as it gets. J Clin Psychiatry 2020;81(5):20f13681. [DOI] [PubMed] [Google Scholar]
- 7.Takeshima N, Sozu T, Tajika A, Ogawa Y, Hayasaka Y, Furukawa TA. Which is more generalizable, powerful and interpretable in meta-analyses, mean difference or standardized mean difference? BMC Med Res Methodol 2014;14(1):30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Cummings P. Arguments for and against standardized mean differences (effect sizes). Arch Pediatr Adolesc Med 2011;165(7):592–96. [DOI] [PubMed] [Google Scholar]
- 9.Greenland S, Schlesselman JJ, Criqui MH. The fallacy of employing standardized regression coefficients and correlations as measures of effect. Am J Epidemiol 1986;123(2):203–08. 9 [DOI] [PubMed] [Google Scholar]
- 10.Greenland S, Maclure M, Schlesselman JJ, Poole C, Morgenstern H. Standardized regression coefficients: a further critique and review of some alternatives. Epidemiology 1991;2(5):387–92. [PubMed] [Google Scholar]
- 11.Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA. Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wilkinson L. Statistical methods in psychology journals: guidelines and explanations. Am Psychol 1999;54(8):594–604. [Google Scholar]
- 13.Egger M, Davey Smith G, Altman DG. Systematic Reviews in Health Care: Meta-Analysis in Context. 2nd ed. BMJ Publishing Group; 2001. [Google Scholar]
- 14.Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Lawrence Erlbaum Associates; 1988. [Google Scholar]
- 15.Ferguson CJ. An effect size primer: a guide for clinicians and researchers. Prof Psychol Res Pract 2009;40(5):532–38. [Google Scholar]
- 16.Lin L, Aloe AM. Evaluation of various estimators for standardized mean difference in meta-analysis. Stat Med 2021;40(2):403–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Gøtzsche PC, Hróbjartsson A, Marić K, Tendal B. Data extraction errors in meta-analyses that use standardized mean differences. JAMA 2007;298(4):430–37. [DOI] [PubMed] [Google Scholar]
- 18.Tendal B, Higgins JPT, Jüni P, Hróbjartsson A, Trelle S, Nüesch E, et al. Disagreements in meta-analyses using outcomes measured on continuous or rating scales: observer agreement study. BMJ 2009;339:b3128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Luo Y, Funada S, Yoshida K, Noma H, Sahker E, Furukawa TA. Large variation existed in standardized mean difference estimates using different calculation methods in clinical trials. J Clin Epidemiol 2022;149:89–97. [DOI] [PubMed] [Google Scholar]
- 20.Dias S, Sutton AJ, Ades AE, Welton NJ. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network meta-analysis of randomized controlled trials. Med Decis Making 2013;33(5):607–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jackson D, White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J 2018;60(6):1040–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bos EH, de Jonge P, Cox RFA. Affective variability in depression: Revisiting the inertia–instability paradox. Br J Psychol 2019;110(4):814–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Hedges LV, Olkin I. Statistical Methods for Meta-Analysis. Academic Press; 1985. [Google Scholar]
- 24.Lin L. Bias caused by sampling error in meta-analysis with small sample sizes. PLoS One 2018;13(9):e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Viechtbauer W. Conducting meta-analyses in R with the metafor package. J Stat Softw 2010;36:3. [Google Scholar]
- 26.Hartung J, Knapp G . A refined method for the meta-analysis of controlled clinical trials with binary outcome. Stat Med 2001;20(24):3875–89. [DOI] [PubMed] [Google Scholar]
- 27.Sidik K, Jonkman JN. A simple confidence interval for meta-analysis. Stat Med 2002;21(21):3153–59. [DOI] [PubMed] [Google Scholar]
- 28.IntHout J, Ioannidis JPA, Borm GF. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Med Res Methodol 2014;14(1):25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;7(3):177–88. [DOI] [PubMed] [Google Scholar]
- 30.Liu Z, Al Amer FM, Xiao M, Xu C, Furuya-Kanamori L, Hong H, et al. The normality assumption on between-study random effects was questionable in a considerable number of Cochrane meta-analyses. BMC Med 2023;21(1):112. [DOI] [PMC free article] [PubMed] [Google Scholar]