Meta-Analysis With Complex Research Designs: Dealing With Dependence From Multiple Measures and Multiple Group Comparisons

Nancy Scammacca; Greg Roberts; Karla K Stuebing

doi:10.3102/0034654313500826

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: Rev Educ Res. 2013 Sep 13;84(3):328–364. doi: 10.3102/0034654313500826

Meta-Analysis With Complex Research Designs: Dealing With Dependence From Multiple Measures and Multiple Group Comparisons

Nancy Scammacca ¹, Greg Roberts ², Karla K Stuebing ³

PMCID: PMC4191743 NIHMSID: NIHMS528552 PMID: 25309002

Abstract

Previous research has shown that treating dependent effect sizes as independent inflates the variance of the mean effect size and introduces bias by giving studies with more effect sizes more weight in the meta-analysis. This article summarizes the different approaches to handling dependence that have been advocated by methodologists, some of which are more feasible to implement with education research studies than others. A case study using effect sizes from a recent meta-analysis of reading interventions is presented to compare the results obtained from different approaches to dealing with dependence. Overall, mean effect sizes and variance estimates were found to be similar, but estimates of indexes of heterogeneity varied. Meta-analysts are advised to explore the effect of the method of handling dependence on the heterogeneity estimates before conducting moderator analyses and to choose the approach to dependence that is best suited to their research question and their data set.

Keywords: meta-analysis, statistical dependence, heterogeneity analysis

The inclusion of statistically dependent effect sizes in a meta-analysis can present a serious threat to the validity of the meta-analytic results. Dependence can arise in a number of ways. One common way that dependence presents itself occurs when a study included in a meta-analysis uses more than one outcome measure, such as a reading intervention study that measures both reading fluency and reading comprehension. The resulting effect sizes are dependent because the same participants were measured more than once. Dependence also commonly occurs when a study's research design includes two treatment groups compared with the same control group. Because the same control group participants are included in each treatment/control comparison, the resulting effect sizes are statistically dependent. Failure to resolve or model dependence results in artificially reduced estimates of variance, which in turn inflates Type I error (Borenstein, Hedges, Higgins, & Rothstein, 2009a). Treating dependent effect sizes as if they were independent also gives more weight in the meta-analysis to studies that have multiple measures or more than two groups. Statistical dependence must be resolved in a way that allows each study to contribute a single independent effect size to the meta-analysis or modeled using methodological techniques designed to handle dependence to avoid these threats to the validity of the meta-analytic results.

Prevalence of the Problem of Dependence in Meta-Analyses in Education Research

Education research studies commonly yield a set of dependent effect sizes. For example, Edmonds et al. (2009) extracted 78 effect sizes from 21 studies of interventions for struggling readers, an average of nearly four per study across multiple measures and multiple dependent comparisons. Tran, Sanchez, Arellano, and Swanson (2011) calculated 107 effect sizes from multiple measures across 13 response-to-instruction studies, meaning that an average of eight outcome measures had been used in these studies. In their meta-analysis on the effectiveness of Reading Recovery, D'Agostino and Murphy (2004) calculated 1,379 effect sizes across the multiple outcomes, group comparisons, and testing occasions in the 36 studies that met their inclusion criteria, for an average of approximately 38 effect sizes per study. In a review of education meta-analyses published since 2000, Ahn, Ames, and Myers (2012) found that 37.5% of the 56 meta-analyses included in their report averaged three or more effect sizes per study. The average number of effect sizes per study across all 56 meta-analyses was 3.71. Just 7 of the 56 meta-analytic reports stated that dependence of effect sizes was not an issue in their data set.

Statistical Methods for Handling Dependence From Multiple Outcomes

Much has been written by prominent researchers about how to resolve dependence of effect sizes in a meta-analysis when faced with multiple outcomes. Some methods are more complex and challenging to implement with education research studies than others. On the less complex end of the spectrum, Card (2012) recommended choosing between two straightforward methods of resolving dependence. The first is to select a single outcome to include based on the focus of the meta-analysis. He cautioned that this approach is appropriate only when the meta-analyst can make a strong case for including one outcome over others. A second option, and one that is frequently implemented in education meta-analyses, is to aggregate all measures by computing an average effect size. Although computing an average effect across measures within a study is easy to do, the result may not be the best measure of the effect of the study. This approach effectively punishes studies for attempting to measure the impact of their treatment across a broad array of measures. For example, researchers testing a reading fluency intervention might be interested in knowing if their intervention has any effect on reading comprehension. Such a study conceivably could result in a large effect of 0.80 on a measure of reading fluency and a small effect of 0.20 on a measure of reading comprehension. If these measures are averaged for inclusion in a meta-analysis that is focused broadly on the effect of reading interventions on reading skills, the resulting effect size of 0.50 would not accurately represent the effectiveness of this study's intervention.

Reflecting on this problem, Marín-Martínez and Sánchez-Meca (1999) cautioned meta-analysts to consider whether or not effect sizes within a study are homogenous before averaging them to resolve dependence. If effects within studies are not homogenous, another approach to resolving dependence should be implemented. Cooper (1998) suggested a variation on simply averaging all outcomes. In his shifting-unit-of-analysis approach, effect sizes within studies are combined based on the variables of interest in the meta-analysis to provide a single estimate of the overall effect to include in the meta-analysis. Cooper stated that this approach minimizes violations of the assumption of independence of the effect sizes while preserving as much of the data as possible. However, using this approach can result in running multiple meta-analyses for each outcome type, with some analyses having a small number of studies and little power as a result.

More complex approaches to dealing with dependence from multiple outcomes involve accounting for the correlation between measures when computing a summary effect size across multiple dependent outcomes. As Borenstein et al. (2009b) pointed out, averaging effect sizes across measures makes an implicit assumption that the correlation between measures is 1.0—meaning that each outcome essentially duplicates the information provided by other outcomes. When meta-analysts ignore dependence and include effect sizes from all measures as if the effects were independent, the assumed correlation between measures is 0—meaning that each outcome contributes information that is unrelated to any other outcome. According to Borenstein et al., when making either of these assumptions about the correlation between measures, the result is an incorrect estimate of the variance of the composite effect size that the study contributes to the meta-analysis. Assuming a correlation of 1.0 results in an overestimate of the variance of the composite effect size because all the information provided by the outcomes is redundant. Assuming a correlation of 0 results in an underestimate of the variance for the composite effect size because each effect size is seen as contributing independent information. A larger estimate of the variance results in a larger confidence interval around the effect size and an increased likelihood of finding that the effect size is not significantly different from zero (a Type II error). The opposite is true when an inaccurately small estimate of the variance is calculated, resulting in an inflation of the Type I error rate.

When the correlation between outcomes is known, the dependence can be accounted for mathematically when computing a mean effect for a study. Rosenthal and Rubin (1986); Raudenbush, Becker, and Kalaian (1988); Gleser and Olkin (1994); and Borenstein et al. (2009b) provided equations for calculating an effect size for a study with multiple outcomes that include the correlations between the outcomes. More complex approaches incorporate the correlation between measures into multivariate models for conducting meta-analysis. Kalaian and Raudenbush (1996) described and illustrated the use of multivariate multilevel modeling to conduct meta-analysis in a way that models dependency in effects within studies. In their example, they meta-analyzed studies of the impact of coaching on performance on the Scholastic Aptitude Test (SAT) math and verbal subtests. Given that the correlation between these subtests has been reported by the developers of the SAT, Kalaian and Raudenbush were able to compute the covari-ance matrix needed for implementing their modeling technique. The structural equation modeling (SEM) approach to meta-analysis proposed by Cheung (2010) also requires that the correlations between multiple measures within a study are known.

In her discussion of multivariate meta-analysis, Becker (2000) acknowledged that in many cases the meta-analyst does not know the correlations between multiple measures used in a particular study. She suggested consulting previous studies or manuals from test publishers to impute a correlation. Theoretically, such an approach makes sense. However, it is often impractical or impossible for a meta-analyst working with education research studies to implement any of these suggestions. Researcher-designed measures are commonly used in education research, and the correlations between such measures are not routinely reported. When a study measures outcomes using standardized tests, the correlations between them might be available from test publishers or in the research literature, but the extent to which these correlations generalize beyond the normative sample to a special population (such as students with learning disabilities) is rarely documented.

When it is not possible to locate the correlation from these sources, Becker (2000) and Borenstein et al. (2009b) suggested conducting sensitivity analyses to determine a possible range of correlations between measures. Conducting sensitivity analyses can be a workable solution when a small number of measures are involved and only a few studies use multiple measures. However, when more than two or three measures are used in multiple studies to be included in the meta-analysis, conducting sensitivity analyses for every pair of outcomes quickly become so laborious and time-consuming that it is not feasible, especially because computer programs to conduct sensitivity analysis are not available. In these instances, averaging outcomes with an assumed correlation of 1.0 and inflating Type II error is considered the more conservative approach.

Statistical Methods for Handling Dependence From Multiple Group Comparisons

Many of the same researchers who have suggested methods for dealing with dependence when including studies with multiple outcomes also have described methods for dealing with dependence from multiple group comparisons within studies. Gleser and Olkin (1994) provided equations for a matrix of effect sizes that come from a set of studies where multiple treatments are compared with a no-treatment control group. They assumed that the corpus of studies that the meta-analyst has gathered includes a common and defined set of treatments (such as several types of diet or exercise routines), with some studies including perhaps two of these treatments compared with a no-treatment control group and others including three or four or more. In this scenario, regression models can be fit that account for the dependence in the group comparisons within studies. This approach works well in fields where treatments are standardized or come from a common set of treatments, such as medicine. Within education research, it is rare that the same treatments are present across studies, making it impossible to construct the type of matrix needed to implement Gleser and Olkin's approach.

Borenstein et al. (2009c) proposed a way of dealing with the dependence inherent in multiple group comparisons that is more easily applied to education research. First, they advised meta-analysts to consider if their interest is in comparing the effects of two specific treatments or in computing a combined overall effect of treatment compared with the control group. If one's interest is in comparing treatments, and two treatment groups are compared with a single control group in a given study, an effect size can be computed from the information provided for the two treatments that indicates the benefit of one treatment over the other. In this case, effect sizes from treatment–control comparisons are not included in the meta-analysis, eliminating the dependence from the shared control group. This approach makes sense only if the two treatments are present in a similar enough form across the corpus of studies to allow for similar contrasts across the meta-analysis.

If one's interest is in the overall effect of different types of treatment compared with a control group, calculating a combined effect size and its variance for studies in which multiple treatments are compared with the same control group is a straightforward process as long as the number of participants in each treatment group and the control group is known. The correlation between the effect size for the first treatment group versus the control group and the effect size for the second treatment group versus the control group can be calculated based on the number of participants in each group. A combined weighted mean effect size can be computed that gives more weight to an effect from a treatment with a larger sample size than to another treatment in the same study with a smaller sample size. The variance of this combined effect can be computed in a manner that takes into account the proportion of all study participants that are shared members of the control group. For example, if 50 participants are in one treatment group, 50 participants are in a second treatment group, and 50 participants are in the control group, the proportion of shared participants in the comparison of the each treatment group with the control group is 0.50 because 50% of the participants in each comparison are the same. More simply, in cases where means, standard deviations, and sample sizes are available for all treatment groups and the control group, the meta-analyst can create a combined mean simply by calculating a weighted mean and standard deviation for a study with all treatment conditions combined and using this mean and standard deviation with the mean and standard deviation of the control group to calculate a standardized mean difference effect size.

Borenstein et al.'s (2009c) approach to computing a combined, weighted mean effect is easier to apply to the types of research methodologies typically found in education research reports than Gleser and Olkin's (1994) approach. It is a sound means of preserving the statistical independence of effect sizes in a meta-analysis. However, independence comes at the cost of losing information about the unique effect of each treatment. Averaging the effects of treatment may not represent the intent of a study's researchers when they designed a multiple treatment versus control study. Additionally, when there are vast differences in the effectiveness of the treatments, this approach handicaps the most effective treatment in a study by averaging it with less effective treatments. When there are many studies with multiple dependent comparisons in a meta-analysis, the overall mean effect will be reduced by the presence of weaker and stronger treatments homogenized into a middling studywise effect size.

New Approaches to Dealing With Dependence From Multiple Outcomes and Comparisons

Robust Variance Estimation

Hedges, Tipton, and Johnson (2010) proposed a new approach to dealing with dependence that can be applied no matter the source or sources of dependence in a data set of effect sizes. Known as robust variance estimation (RVE), it overcomes the need to include the known correlations between measures in order to include all effect sizes from all measures and all group comparisons in the meta-analysis. Instead of modeling dependence as is done in multivariate approaches to meta-analysis that require known correlations, RVE mathematically adjusts the standard errors of the effect sizes to account for the dependence (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). An intraclass correlation (ρ) that represents the within-study correlation between effects must be specified when implementing RVE to estimate the effect size weights, but because RVE is not affected very much by the choice of weights, it does not matter if the correlation is precise (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). Because the same ρ is applied to all dependent effect sizes within each study in the meta-analysis, sensitivity analysis with a range of values for ρ can be conducted quite easily to determine how the correlation that is chosen affects the resulting estimates of the mean effect and its variance. Dependence from multiple sources, including multiple measures and multiple group comparisons, can be accommodated simultaneously (Tanner-Smith & Tipton, 2013). RVE is reasonably easy to implement with syntax for several popular statistical software packages provided by Tanner-Smith and Tipton and available from the Peabody Research Institute (n.d.).

There are some important limitations to consider when implementing RVE. Because the math involved in RVE relies on the central limit theorem, simulation studies have shown that a minimum of 10 independent studies are needed to estimate a reliable main effect and a minimum of 40 independent students are needed to estimate a meta-regression coefficient (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). RVE can be used only in meta-regression. If a meta-analysis involves categorical moderators with more than two levels, the dummy-coding of variables required to analyze all pairwise comparisons can be cumbersome to implement in currently available statistical software. Additionally, because the degrees of freedom used to test the statistical significance of the meta-regression coefficients is equal to the number of independent studies minus the number of parameters estimated, meta-analyses with a small number of studies will be restricted in the number of covariates that can be included (Tanner-Smith & Tipton, 2013). Tanner-Smith and Tipton's simulation studies indicated that a minimum of 40 studies with an average of at least five effect sizes per study are needed to estimate a meta-regression coefficient. When fewer studies are included, they found that the confidence interval for the coefficient tends to be too narrow, meaning that the p value for the estimate will be inaccurate. Nevertheless, RVE is a mathematically sound method for modeling dependence that should be strongly considered by education meta-analysts when their data sets meet its requirements.

Three-Level Meta-Analysis

Konstantopoulos (2011) proposed three-level meta-analysis as an extension of the use of two-level random-effects models in meta-analysis. In two-level models, Level 2 variance represents between-study differences in effect size estimates, with the assumption that all studies are contributing an independent effect size. Three-level meta-analysis allows for clustering of dependent effect sizes within studies at Level 2; between-study effects are then estimated at Level 3. Cheung (2013) described how three-level meta-analysis can be used to pool dependent effect sizes within each study, modeling the within-study dependence at Level 2 and the between-study mean effect size and variance at Level 3. This approach to dependence can be applied when the correlations between the dependent effect sizes are not known, as is usually the case when multiple measures are used in a study. Unlike in RVE, three-level meta-analysis provides estimates of both the Level 2 (within study) and Level 3 (between study) variance so that meta-analysts can determine where the variation in effects is the greatest. Covariates can be included in the three-level model at both Level 2 and Level 3 to attempt to explain the variance present at each level.

Cheung (2013) described how to use SEM to conduct a three-level meta-analysis. Some advantages of the SEM approach include its ability to handle missing data on covariates and to provide a means for empirical comparison of the two-level and three-level models to determine which model best fits the data. Cheung provided syntax and a package for running three-level meta-analysis in R, making it easier for other meta-analysts to implement his approach. Like RVE, three-level meta-analysis is a promising solution to the problem of dependence in meta-analysis. However, as Cheung noted, additional studies are needed to demonstrate the strengths and potential limitations of both approaches to dependence because neither technique has been used widely in published research.

How Education Researchers Handle Dependence in Meta-Analysis

Drawing from the methods described above, education researchers have implemented a variety of means of handling dependence from multiple measures and/or multiple group comparisons when conducting a meta-analysis. In their meta-analysis of the effect of writing instruction on reading, Graham and Hebert (2011) resolved the dependence from multiple measures using Cooper's (1998) shifting-unit-of-analysis approach. They separated measures by construct (e.g., reading comprehension, reading fluency) and meta-analyzed effect sizes for each construct separately. When studies included multiple measures of a single construct, they included the average of the effects in their meta-analysis. Graham and Hebert's approach yielded multiple sets of independent effects that they meta-analyzed separately. This approach also can be implemented when studies provide multiple treatment comparisons by conducting separate meta-analyses for each type of treatment.

The advantage of this approach is that it allows the meta-analyst to retain all of the information from each study while preserving statistical independence. However, to do so the meta-analyst must run multiple analyses and cannot draw conclusions about the overall effect from the corpus of studies. Additionally, dividing the corpus of studies into groups by measure type and/or treatment type can result in a significant reduction in power. Nevertheless, this approach remains popular with meta-analysts and has been implemented in a number of other recent meta-analyses in education (e.g., Flynn, Zheng, & Swanson, 2012; Gersten et al., 2009; Tran et al., 2011). In their review of 56 education meta-analyses, Ahn et al. (2012) found that 26.8% of the meta-analyses in their data set used the shifting-unit-of-analysis approach to resolve dependence.

Another common approach to handling dependence in meta-analysis is to select a single measure and/or group comparison that seems to best represent the study's primary research question. Graham and Hebert (2011) took this approach to resolving the statistical dependence in studies that had multiple group comparisons, and Chambers (2004) used it in a meta-analysis of the effects of computers in classrooms. In their meta-analysis of reading comprehension instruction for students with learning disabilities, Berkeley, Scruggs, and Mastropieri (2010) implemented a hybrid of this approach and the approach described above, selecting a single outcome measure from each study that best represented the research question while conducting separate meta-analyses for different types of measures and for measures of treatment effect, maintenance effect, and generalization effect. This approach was used in 14.3% of the 56 education meta-analyses reviewed by Ahn et al. (2012).

The main advantage of this method of resolving dependence is that it contributes the effect size that conveys the central finding of the study to the meta-analysis. When meta-analysts select a single outcome or group comparison for the meta-analysis, studies that include additional outcomes or comparisons in an attempt to measure the effects of their intervention more broadly or compare it with other types of treatment do not have the effect size of their primary outcome or comparison of interest reduced by averaging it with smaller effects from tertiary outcomes or weaker treatments. However, in large-scale or multicomponent interventions, researchers often expect to see effects of treatment on multiple types of measures or are interested in determining which of several treatments is most effective. In these cases, it can be difficult for the meta-analyst to pick a single measure or group comparison that will best represent the study in the meta-analysis, especially if the study's authors are not clear in describing the outcome or comparison they view as most central to the purpose of their study.

Ahn et al. (2012) documented the use of other approaches to dealing with dependence in the 56 education meta-analyses they reviewed. The approach most commonly used in these meta-analyses was averaging or weighted averaging of the dependent effect sizes within studies. This approach was implemented in 42.9% of the meta-analyses. They also found that a multivariate approach was used in 7.1% of the meta-analyses. A combination of approaches was used in 12.5% of the meta-analyses. In 32.2% of the meta-analyses, researchers either failed to mention whether dependence was an issue in their data set or mentioned it but did not report how they handled it.

Because Hedges et al.'s (2010) RVE approach is a relatively new technique for dealing with dependence, published examples of its use are few in number. Wilson, Tanner-Smith, Lipsey, Steinka-Fry, and Morrison (2011) used RVE to account for dependence in their meta-analysis of high school dropout prevention programs that included 504 effect sizes from 317 independent samples and 152 studies. Uttal et al. (2013) implemented RVE in a meta-analysis that included 1,038 effect sizes from 206 studies that assessed the effect of training programs on spatial skills. Outside of educational research, RVE has been implemented in meta-analyses on the effectiveness of outpatient substance abuse treatment for adolescents (Tanner-Smith, Wilson, & Lipsey, 2013), the relationship between social goals and aggressive behavior in youth (Samson, Ojanen, & Hollo, 2012), and the effect of mindfulness-based stress reduction on physical and mental health in adults (de Vibe, Bjørndal, Tipton, Hammerstrøm, & Kowalski, 2012). No published examples of the use of three-level meta-analysis to handle dependence were found in the educational research literature. Both Konstantopoulus (2011) and Cheung (2013) illustrated the use of three-level meta-analysis with extant data sets. Van den Noortgate, López-López, Marín-Martínez, and Sánchez-Meca (2013) used simulated data sets in their exploration of three-level meta-analysis as a method for handling dependence.

A Case Study in Methods of Dealing With Dependence

To better understand the impact of the choices education meta-analysts face when dealing with multiple measures and multiple group comparisons within studies, different methods of handling dependence were implemented using a set of effect sizes from a meta-analytic study by Scammacca, Roberts, Vaughn, and Stuebing (in press) of reading interventions for struggling readers in Grades 4 to 12. Researchers chose to use an extant set of effect sizes from a recent meta-analysis rather than a simulated data set because we believed that a real-world data set can better emulate the types and nature of dependence that typically exist in studies that education researchers struggle to meta-analyze. In doing so, we acknowledge that simulation studies make an important contribution to the knowledge base and are a necessary next step to the work we present here.

The Scammacca et al. (in press) report involved separate and combined analyses of effect sizes from research published between 1980 and 2004 and between 2005 and 2011. For this report, only effect sizes from the 2005 to 2011 group of 50 studies were used. This more recent group contained many more instances of studies with more than two groups (k = 17) and with multiple measures (k = 43) than the earlier group of studies. The proportion of these more complex research designs within the set of 50 is more representative than the older set of the sets of studies that would cause a meta-analyst to confront the issues addressed here. See the appendix for the effect size data used in this case study.

This case study sought to answer the following research questions:

Research Question 1: How do different approaches to dealing with dependence in data from multiple outcomes within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?
Research Question 2: How do different approaches to dealing with dependence in data from multiple group comparisons within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?

The approaches to handling dependence in this case study include those implemented in other meta-analytic studies that involved education data and others chosen to illustrate alternative means of estimating the overall effect from a study with multiple dependent effects. Additionally, meta-analyses were attempted with all outcomes and all groups as independent for comparison purposes.

Method

Procurement of Corpus of Studies

The studies used in Scammacca et al. (in press) were located through a computer search of ERIC and PsycINFO using descriptors related to reading, learning difficulties/disabilities, and reading intervention; a search of abstracts from other published research syntheses and meta-analyses and reference lists in seminal studies; and a hand search of major journals in which previous intervention studies were published. Studies were included in the meta-analysis if (a) participants were English-speaking struggling readers in Grades 4 to 12 (age 9–21), (b) the study's research design used a multiple-group experimental or quasi-experimental treatment-comparison or multiple-treatment comparison designs, (c) the intervention provided any type of reading instruction, (d) data were reported for at least one dependent measure that assessed one or more reading constructs, and (e) sufficient data for calculating effect sizes and standard errors were provided.

Studies that met criteria were coded using a code sheet that included elements specified in the What Works Clearinghouse Design and Implementation Assessment Device (Institute of Education Sciences, 2008) and used in previous research (Scammacca et al., 2007). Researchers with doctorate degrees and doctoral students with experience coding studies for other meta-analyses and research syntheses completed the code sheets. All coders had completed training on how to complete the code sheet and had reached a high level of reliability with others coding the same article independently. Every study was independently coded by two raters. When discrepancies were found between coders, they reviewed the article together and discussed the coding until consensus was reached.

Effect Size Calculation

Effect sizes were calculated using the Hedges (1981) procedure for unbiased effect sizes for Cohen's d (this statistic is also known as Hedges's g). Hedges's g was calculated using the posttest means and standard deviations for treatment and comparison (or multiple treatment) groups when such data were provided. In some cases, Cohen's d effect sizes were reported and means and standard deviations were not available. For these effects, Cohen's d for posttest mean differences between groups and the treatment and comparison group sample sizes was used to calculate Hedges's g. For each effect, estimates of Hedges's g were weighted by the inverse of the variance to account for variations in precision based on sample size in the studies. All effects were computed using the Comprehensive Meta Analysis (Version 2.2.064) software (Borenstein, Hedges, Higgins, & Rothstein, 2011). Effects were coded for all measures and pairwise group comparisons between treatment and control groups or different treatment groups when no control group was included in the study. The 36 research reports yielded 50 independent studies with a total of 366 effect sizes, an average of about 7 effect sizes per study. At this point, researchers in the original study were faced with the dilemma of how to combine multiple effect sizes from multiple measures and multiple dependent group comparisons to best estimate the mean effect of reading intervention.

Calculating Mean Effects From Studies With Multiple Measures

Nearly all studies provided data on multiple outcome measures. Scammacca et al. (in press) averaged the effect sizes from multiple measures within each pairwise group comparison using the procedure recommended by Card (2012), and included the average effect size and the average of its standard error in the meta-analysis. Five other approaches to computing a mean effect across multiple measures within a single independent group comparison were conducted for the present report:

The measure that yielded the highest effect size was selected for each independent group comparison.
A measure was selected at random using a random number generator for each independent group comparison.
A measure was selected for each independent group comparison that seemed to best represent the primary focus of the study's intervention.
Measures were analyzed separately based on the type of reading skill measured (fluency, vocabulary, spelling, reading comprehension, word, and word fluency) for each independent group comparison and mean effects were calculated for each skill.
All measures were treated as independent estimates of effects for each independent group comparison.

All five approaches were used in meta-analyses to calculate a mean effect and its standard error across all studies for all measures included in the research reports and for norm-referenced, standardized measures only. Because researcher-designed measures tend to have lower reliability than standardized measures, repeating the meta-analyses with only standardized measures allows researchers to investigate the effects of different approaches to dealing with multiple measures while constraining some of the influence of measurement error.

Calculating Mean Effects From Studies With Multiple Dependent Groups

Seventeen of the research reports contained more than one dependent treatment-control or multiple-treatment group comparison. In Scammacca et al. (in press), the procedure recommended by Borenstein et al. (2009c) was implemented for comparisons that involved dependent groups. This procedure involves computing a combined weighted mean effect size and its standard error in a manner that reflects the degree of dependence in the data. Four other approaches to computing a mean effect across multiple dependent group comparisons were completed for this report:

The group comparison that yielded the highest mean effect size across measures included in the study was selected and included in the meta-analysis.
A group comparison was selected at random using a random number generator and its mean effect size across measures was included in the meta-analysis.
A group comparison was selected that seemed to best represent the primary focus of the study's intervention and its mean effect size across measures was included in the meta-analysis.
All group comparisons were treated as independent and each mean effect size across measures was included in the meta-analysis.

Meta-analyses were then conducted on the resulting data using all types of measures and using standardized measures only, for the reason stated above. In each of the different analyses for the multiple dependent group comparisons, the average of the effect sizes for all measures involving the group comparison of interest was used to hold constant the effect of multiple measures while examining different approaches to the problem of multiple dependent group comparisons. In a similar way, the effect of multiple dependent group comparisons was held constant in the analyses involving multiple measures. In these analyses, the Borenstein et al. (2009c) method of combining multiple dependent comparisons was implemented. Results for the RVE approach and three-level meta-analysis are reported separately.

Meta-Analytic Procedures

For all the methods of dealing with multiple measures and multiple dependent group comparisons, a random-effects model was used to analyze effect sizes. This model allows for generalizations to be made beyond the studies included in the analysis to the population of studies from which they come. Mean effect size statistics and their standard errors were computed and heterogeneity of variance was evaluated using the Q statistic, the I² statistic, and the tau-squared statistic. For all but the RVE and three-level meta-analysis approaches, the meta-analyses were conducted in Comprehensive Meta Analysis (Version 2.2.064) software (Borenstein et al., 2011). For the RVE approach, unrestricted, intercept-only meta-regression models were run in SPSS using a macro provided by Tanner-Smith and Tipton (2013) and Peabody Research Institute (n.d.). Sensitivity analysis with a range of values for ρ was conducted to determine the effect of varying intraclass correlations on estimates of the mean effect size, the Q statistic, and the tau-squared statistic. For three-level meta-analysis, Cheung's (2013) R syntax for the metaSEM package he authored was used. Finally, meta-regression was conducted using number of measures and number of groups as a predictor of effect size in a mixed-effects model using unrestricted maximum likelihood estimation.

Results

Approaches to Handling Dependence From Multiple Measures

The meta-analyses that implemented different methods of resolving the dependence resulting from having multiple measures within a study produced some points of similarity and some differences across the methods used when considering all types of measures. See Table 1 for results for all types of measures. The mean effect size and variance when using the mean of measures method was nearly identical to the mean effect size when all measures were treated as independent and when a measure was selected based on the primary research question. Using the highest effect size produced a much larger mean effect, as would be expected, and a slightly larger variance. Random selection of an effect size also produced a somewhat larger estimate of the mean effect and a slightly larger variance. Estimates of heterogeneity varied widely depending on the method used to resolve dependence from multiple measures. Treating all measures as independent, using the highest effect size, and randomly selecting an effect size resulted in the largest values across all three indexes of heterogeneity.