Is Inquiry Science Instruction Effective for English Language Learners? A Meta-Analytic Review

Gabriel Estrella; Jacky Au; Susanne M Jaeggi; Penelope Collins

doi:10.1177/2332858418767402

. Author manuscript; available in PMC: 2019 Jun 19.

Published in final edited form as: AERA Open. 2018 Apr 26;4(2):10.1177/2332858418767402. doi: 10.1177/2332858418767402

Is Inquiry Science Instruction Effective for English Language Learners? A Meta-Analytic Review

Gabriel Estrella ¹, Jacky Au ¹, Susanne M Jaeggi ¹, Penelope Collins ¹

PMCID: PMC6583889 NIHMSID: NIHMS1008630 PMID: 31218240

Abstract

Despite being among the fastest growing segments of the student population, English Language Learners (ELLs) have yet to attain the same academic success as their English-proficient peers, particularly in science. In an effort to support the pedagogical needs of this group, educators have been urged to adopt inquiry approaches to science instruction. Whereas inquiry instruction has been shown to improve science outcomes for non-ELLs, systematic evidence in support of its effectiveness with ELLs has yet to be established. The current meta-analysis summarizes the effect of inquiry instruction on the science achievement of ELLs in elementary school. Although an analysis of 26 articles confirmed that inquiry instruction produced significantly greater impacts on measures of science achievement for ELLs compared to direct instruction, there was still a differential learning effect suggesting greater efficacy for non-ELLs compared to ELLs. Contextual factors that moderate these effects are identified and discussed.

Keywords: English Language Learner, science education, inquiry instruction, achievement gap, quantitative research synthesis

The Next Generation Science Standards (NGSS; Achieve, 2012) reflect the importance of introducing children to scientific and engineering practices early to prepare them for STEM careers. The NGSS framework emphasizes the use of rich content and practices that refine and deepen science inquiry in ways that go beyond the use of hands-on, constructivist approaches to science instruction (Achieve, 2012). However, implementing these standards in elementary school presents unique challenges to educators who must increasingly teach complex concepts and reasoning for English Language Learners (ELLs), or students who have yet to fully develop proficiency in English (Saunders & Marcelletti, 2013). Furthermore, engaging in the science and engineering practices are language intensive for all students and ELL students in particular (Lee, Quinn, & Valdés, 2013). Although the population of ELL students has increased substantially in recent years, their achievement in science has not (Maerten-Rivera, Myers, Lee, & Penfield, 2010; National Center for Education Statistics [NCES], 2014).

To better support the pedagogical needs of this growing population, educators have been encouraged to adopt inquiry-based approaches based on the premise that hands-on instruction makes science learning more engaging, concrete, and meaningful (Janzen, 2008; National Research Council [NRC], 2012; Roseberry & Warren, 2008). Whereas inquiry-based instruction has been shown to improve the science achievement of English-proficient (or non-ELL) students (Furtak, Seidel, Iverson, & Briggs, 2012), consensus regarding both the effectiveness and appropriateness of this approach for ELL students has yet to be established. Substantive differences in the linguistic backgrounds, academic experiences, and pedagogical needs of ELL and non-ELL students have led to disagreement regarding the benefits of inquiry-based approaches for linguistically diverse students. Thus, we seek to conduct a meta-analysis examining the effectiveness of inquiry-based instruction for ELL students. We begin with a brief overview of ELL students’ performance in science, the rationale behind teaching ELL students with inquiry-based instruction, and the promises and challenges associated with its application.

The Need for Effective Science Instruction: Underachievement of ELL Students in Science

ELL students’ educational attainment has received growing attention due to persistently low achievement in general and in STEM in particular (Bravo & Cervetti, 2014; Diamond, Maerten-Rivera, Rohrer, & Lee, 2014; Lara-Alecio et al., 2012). Despite increased resources to enhance STEM education, ELL students have yet to attain the same level of academic success as their English-proficient peers (Lee & Buxton, 2013; Maerten-Rivera et al., 2010; Tong, Irby, Lara-Alecio, & Koch, 2014). For example, ELL students consistently score lower on the science portion of the National Assessment of Educational Progress (NAEP) at all grade levels and are more likely to score below basic (NCES, 2014). These findings indicate that ELL students are in need of greater support in STEM education as compared to their non-ELL peers (Genesee, Lindholm-Leary, Saunders, & Christian, 2005; Goldenberg, 2013).

ELL students’ low achievement in science may be attributed in part to their limited proficiency in English and weak mastery of academic language (Kieffer, Lesaux, Rivera, & Francis, 2009). Scientific texts are linguistically complex, informationally dense, and highly technical (Echevarria, Richards-Tutor, Canges, & Francis, 2011; Fang, 2006). The linguistic complexity of scientific texts can impede meaningful learning for ELL students by interrupting information processing and conceptual understanding (Fang, 2006; Janzen, 2008). Thus, ELL students’ science learning may be constrained by their proficiency in English (Lee, 2005).

Potential Benefits of Learning Science With Inquiry Instruction

One view is that ELL students learn best when instruction is situated within meaningful, interactive activities that leverage the language and cultural backgrounds of students (Bravo & Cervetti, 2014; Echevarria et al., 2011). Inquiry instruction is grounded in the constructivist principle that meaningful learning occurs when students engage in authentic activities that promote active knowledge construction through self-guided exploration (Bruner, 1996; Lee, 2005). Students are encouraged to construct knowledge by posing questions about the natural world, test theories through carefully planned investigations, and draw conclusions based on empirical results (Bruner, 1996). Thus, teachers facilitate meaningful dialogue, experimentation, and engagement (Minner, Levy, & Century, 2010). Inquiry instruction is often contrasted with traditional approaches, such as direct instruction, which aim to build factual knowledge through explicit exposition and highly structured teacher guidance (Kirschner, Sweller, & Clark, 2006). Although direct instruction is commonly used, inquiry learning has been found to improve students’ attitudes toward science (Jiang & McComas, 2015), enhance problem-solving skills (Lazonder & Harmsen, 2016), and increase learning outcomes (Alfieri, Brooks, Aldrich, & Tenenbaum, 2011).

Although most examinations of inquiry instruction to date involve non-ELL students, its benefits are assumed to generalize to ELL students in a number of ways. First, inquiry instruction’s use of engaging, multisensory activities is assumed to increase ELL students’ access to scientific content by reducing the demands of scientific language (Janzen, 2008). Second, its multimodal nature encourages physical and cognitive engagement to support deeper levels of learning (Huerta & Jackson, 2010). Third, inquiry instruction encourages ELL students to communicate their understanding of scientific concepts and procedures, which may promote their oral and written language skills (August, Branum-Martin, Cardenas-Hagan, & Francis, 2009). Finally, the collaborative nature of inquiry instruction is thought to promote rich learning experiences for ELL students that foster both conceptual knowledge and scientific communication (Lee & Buxton, 2013).

Concerns Regarding the Effectiveness of Inquiry Instruction for ELL Students

Despite its potential benefits, there remain concerns and contradictory findings regarding the effectiveness of inquiry instruction with ELL students. First, ELL students may lack sufficient English proficiency to benefit fully from inquiry instruction (August et al., 2009; Bresser & Fargason, 2013; Huerta, Tong, Irby, & Lara-Alecio, 2016). Despite using multimodal approaches to pedagogy, inquiry instruction still has heavy linguistic demands, requiring students to generate predictions, communicate their findings, and engage in meaningful scientific discourse. However, many ELL students are still developing the very language skills critical for active participation and building understanding of the content. Thus, the provision of more hands-on, active learning opportunities may not sufficiently address the linguistic challenges faced by ELL students in the science classroom (August et al., 2009; Bravo & Cervetti, 2014; Lee, Deaktor, Enders, & Lambert, 2008).

Second, the assumption that inquiry instruction is more effective than traditional methods has also been challenged (e.g., Kirschner et al., 2006; Tobias & Duffy, 2009). The hands-on, self-guided exploration characteristic of inquiry may not provide sufficient instructional guidance and structure to facilitate meaningful learning and transfer (Mayer, 2004). Although hands-on instruction may provide students with salient, highly contextualized learning experiences, inquiry instruction may not provide enough of a framework to enable students to represent scientific principles and understanding more abstractly and generalize what they have learned to new contexts.

Finally, the benefits of inquiry instruction may be limited to students who already have sufficient prior knowledge to support exploratory learning (Kirschner et al., 2006; Klahr & Nigam, 2004). Because ELL students’ access to quality instruction is often limited by English-only instruction, tracking into remedial classes, and attending English support services at the exclusion of content-area instruction (Robinson-Cimpian, Thompson, & Umansky, 2016), they may lack the academic preparation to fully benefit from inquiry instruction. Thus, the effects of inquiry instruction for ELL students requires greater examination.

Factors That May Influence the Effectiveness of Inquiry Instruction

From a developmental perspective, there are compelling reasons to expect that the effectiveness of inquiry-based instruction may differ on the basis of student grade level (Meyer, 2000). One factor is that as ELL students progress from first grade and beyond, they build their knowledge base in science, proficiency in English, and metacognitive abilities—all of which contribute to higher learning and achievement. Consequently, inquiry-based instruction may be more advantageous for older ELL students who, compared to their younger counterparts, are more likely to have the requisite skills and knowledge to meet the demands of learning science with inquiry-based instruction. On the other hand, the increasingly rigorous academic and linguistic demands associated with science inquiry in higher grade levels might overburden older ELL students and result in diminished learning (Tolbert Stoddart, Lyon, & Solis, 2014).

Second, the effectiveness of inquiry instruction may be influenced by factors such as teacher preparation and instructional time. Many elementary school teachers report they have been inadequately prepared to teach ELL students science (Cervetti, Kulikowich, & Bravo, 2015; Zwiep & Straits, 2013). However, teachers’ instructional skills and pedagogical knowledge have been shown to have a significant impact on students’ science achievement (Heller, Daehler, Wong, Shinohara, & Miratrix, 2012). Professional development has been found to improve the delivery of inquiry instruction by raising teachers’ pedagogical knowledge and understanding of ELL students’ learning needs (Yoon, Duncan, Lee, Scarloss, & Shapley, 2007).

Third, inquiry instruction requires heavy investments in instructional time. There is considerable variation in the amount of class time devoted to inquiry instruction, which may also influence its effectiveness for ELL students (Baker, Fabrega, Galindo, & Mishook, 2004; Dorph, Shields, Tiffany-Morales, Hartry, & McCaffrey, 2011). Thus, our meta-analysis considers professional development and instructional time in a moderation analysis.

Prior Reviews of Inquiry-Based Instruction for ELL and Non-ELL Students

Several narrative reviews summarizing the prevailing state of knowledge on effective teaching approaches with ELL students provide initial support for the use and effectiveness of inquiry-based instruction with ELL students. Lee (2005) performed a systematic review of research on the science education (K–12) of ELL students and found that hands-on, inquiry-based instruction was generally associated with positive achievement outcomes among all students, including those with lower levels of English proficiency and prior science experience. More recently, Janzen’s (2008) narrative review on content-area instruction in science with ELL students found similar evidence suggesting that inquiry-based instruction led to improvements in both ELL students’ language development and science achievement. Although these reviews offer a useful summary of research on the effectiveness of inquiry-based instruction with ELL students, they use qualitative rather than quantitative methods, do not provide effect size estimates and furthermore, are not the most current anymore.

Three more recent meta-analyses comparing the effectiveness of inquiry-based instruction with direct instruction support the advantage of inquiry-based instruction. First, Alfieri et al.’s (2011) meta-analysis contrasted the effectiveness of direct instruction to both guided and unguided forms of inquiry-based instruction. They found that inquiry-based instruction produced greater achievement outcomes in science than direct instruction (d = .11). Similarly, Furtak et al.’s (2012) meta-analysis found that inquiry-based instruction resulted in significantly greater learning outcomes (d = .50). Finally, Lazonder and Harmsen (2016) showed that guided forms of inquiry instruction produced a positive effect on students’ science content knowledge (d = .50) and ability to perform inquiry (d = .66). Although these meta-analyses provide evidence suggesting that inquiry-based instruction can be an effective method of learning for students as compared with traditional instruction, they were based on studies conducted primarily with mainstream English-proficient students, and thus, their results may not generalize to ELL students.

Present Study

Previous syntheses of research have concluded that inquiry-based instruction is a particularly effective approach for improving the science achievement outcome for students. However, to our knowledge, no study to date has explicitly evaluated changes in ELL students’ science achievement as a result of receiving inquiry instruction in a comprehensive and quantitative synthesis. To this end, we conducted a meta-analysis to determine the extent to which inquiry instruction serves ELL students’ science achievement, addressing the following questions:

Research Question 1: Is inquiry-based science instruction an effective method of teaching for ELL students relative to direct instruction?
Research Question 2: Does inquiry science instruction provide comparable learning benefits to ELL students relative to their English proficient peers?
Research Question 3: What types of factors, if any, moderate the impact of inquiry instruction on science achievement outcomes for ELL students?

Method

Selection of Studies and Data Collection

Inclusion criteria.

We developed selection criteria that would capture empirical studies designed to evaluate the impact of inquiry instruction on science achievement for ELL students. Both published and unpublished studies were eligible to be included as long as they (a) used an experimental or quasi-experimental research design, (b) provided data for ELL students between kindergarten and sixth grade, (c) included a treatment that received inquiry instruction and either a business-as-usual control receiving direct instruction or a non-ELL student comparison group, (d) assessed the effects of inquiry instruction on students’ science learning outcomes and reported these effects quantitatively, (e) provided sufficient data to calculate effect sizes, and (f) were either published or translated in English. To avoid sample bias to the best extent possible, studies that focused exclusively on students who were reclassified as fluent English proficient (i.e., former ELLs) were excluded from this meta-analysis. Furthermore, studies that combined results for ELL and non-ELL students or elementary and non-elementary school students such that effect sizes could not be extracted independently for each subsample were also excluded.

Search procedure.

A comprehensive and systematic search was conducted (between 2000 and 2016) using ERIC, PsycINFO, and Google Scholar, with the search terms science, instruction, education, teaching, K–6, methods, English as a second language, English language learner, limited English proficient, inquiry, discovery, hands-on, and projects strategies. The search was restricted to studies that were published in the years 2000 to 2016. To identify unpublished studies in ERIC and Google Scholar, we modified the search parameters to include dissertations, theses, and conference proceedings. We also submitted our selected articles to both forward and backward searches. Forward searches were carried out by searching for articles that cited other studies that met our search criteria, while backward searches were conducted by manually reviewing the reference sections of each paper for additional studies that matched our search criteria. Studies identified in literature reviews and prior syntheses were also reviewed for inclusion. Finally, we contacted authors of the included studies to solicit other published or unpublished studies that may be relevant to this meta-analysis.

Study selection.

This search procedure returned over 5,000 potentially relevant articles. Using the selection criteria established previously, we examined the title, abstract, and keywords of each article. Studies that met the most fundamental aspects of the selection criteria—that is, whether or not a study investigated the effect of (a) inquiry-based instruction on the (b) science achievement of (c) ELL students—were flagged for potential inclusion and saved for a second review. When abstracts did not provide adequate information for eligibility judgments, the full text of the article was obtained and screened for potential inclusion using the aforementioned search criteria. If multiple reports of the same study were identified (e.g., dissertation/thesis, journal article), they were grouped together and cross-referenced for complete information, and the most comprehensive study was retained. Based on this first round of the literature search, 32 articles were flagged as potentially relevant.

In the second round of reviews, we evaluated each article in greater detail. Six studies were excluded because they lacked an eligible treatment/comparison group or science achievement measure or did not provide sufficient information to calculate effect sizes. Studies with missing effect size information were excluded only if we could not obtain the data to estimate effect sizes after requesting them from the corresponding authors. Disagreements regarding whether to include a study were discussed by the research team until consensus was reached. Overall, this selection procedure yielded a total of 26 studies for inclusion in the meta-analysis. Figure 1 summarizes the study search procedure and selection criteria.

FIGURE 1. — Flow diagram of study selection procedure and selection criteria.

*Note*. ELL = English language learner.

Coding of studies.

First, students were classified based on their English proficiency (Saunders & Marcelletti, 2013). Students who were non-native speakers of English with limited English proficiency were coded as ELL, and native English speakers and language-minority students proficient in English were coded as non-ELL. Instruction involving hands-on, self-guided learning tasks requiring students to construct science knowledge using questions and investigations was coded as inquiry instruction (Bruner, 1996; Furtak et al., 2012). Explicit instruction using highly structured lectures, demonstrations, textbooks, or other teacher-centered methods was coded as direct instruction (Alfieri et al., 2011; Mayer, 2004). Finally, science achievement outcomes were coded if they quantitatively assessed changes in students’ performance on measures of conceptual, factual, or procedural knowledge (Minner et al., 2010).

Implementation variables were coded for moderation analysis. We coded the length of the intervention as the number of months of intervention. Student grade level was coded for K through six. Professional development training was coded as 1 when it was provided and 0 when no training was provided. When professional development was provided, the duration in hours was coded, and the dosage was categorized as small if under 15 hours were provided or large if 15 or more hours were provided. The focus of the training regime was considered ELL focused when training addressed the needs or instruction of ELL students. It was coded as non-ELL focused if training was not specific to the needs of ELL students, such as addressing science pedagogy in general.

The methodological features of each study were coded based on its publication status (published journal article vs. unpublished dissertation/technical report), research design (randomized experiment vs. quasi-experiment), measurement design (pretest and posttest vs. posttest only), assessment format (multiple choice vs. constructed response), and assessment type (researcher-developed test vs. standardized test).

The process of coding was conducted by the first author using a standardized coding protocol developed in advance by the research team (available on request from the authors). However, to ensure reliability and accuracy, all studies were double-coded by the second author. Interrater reliability was established by calculating the percentage of overlap between each coder, which yielded a high percent agreement of 93.6%. Coding discrepancies were discussed as a group until consensus was reached.

Meta-Analytic Procedures

Evaluating the effects of inquiry instruction for ELL students.

We derived three separate meta-analytic effect size (ES) estimates based on standardized mean differences. First, we evaluated the effectiveness of inquiry instruction, or treatment ES, using the standardized mean difference in science achievement outcomes between ELL students who learned with inquiry instruction (treatment condition) and ELL students who learned with traditional instruction (control condition). Positive values for treatment ES indicate that ELL students in the treatment condition outperformed those in the control condition.

Examining the effects of inquiry instruction between ELL and non-ELL students.

The second analysis examined whether inquiry instruction had similar benefits for ELL and non-ELL students. We calculated the inquiry ES using the standardized mean difference in learning outcomes between ELL and non-ELL students who received inquiry instruction within studies that reported data for both groups. Positive values for inquiry ES indicate that ELL students showed greater gains in inquiry instruction than non-ELL students.

To contextualize the inquiry ES findings, we estimated the effect size for traditional science instruction (traditional ES) using the standardized mean difference in learning outcomes between ELL and non-ELL students who received traditional instruction within studies that reported data for both groups. Positive values for traditional ES indicate that ELL students showed greater gains with traditional instruction than non-ELL students. Studies that reported information to estimate one effect size but not another were included in all analyses for which sufficient information was provided.

Computation of effect sizes.

The calculation of the standardized mean difference (Cohen’s d) effect size was estimated depending on the data provided. First, when only posttest data were available, the standardized mean difference was calculated:

d = \frac{{\bar{X}}_{T} - {\bar{X}}_{C}}{S D_{P o o l e d}},

where ${\bar{X}}_{T}$ is the mean posttest score of the treatment condition, ${\bar{X}}_{C}$ is the mean posttest score of the control condition, and SD_pooled is the pooled standard deviation (Lipsey & Wilson, 2001). The pooled standard deviation was calculated as follows:

S D_{P o o l e d} = \sqrt{\frac{(n_{T} - 1) S D^{2}_{T} + (n_{c} - 1) S D^{2}_{C}}{n_{T} + n_{C} - 2}},

where n_T and n_C are the sample sizes associated with the treatment and control group, and SD_T and SD_C reflect their respective standard deviations.

When pretest and posttest data were reported for both groups, pretest-adjusted estimates of the standardized mean difference were calculated as:

d = \frac{({\bar{X}}_{T p o s t} - {\bar{X}}_{T p r e}) - ({\bar{X}}_{C p o s t} - {\bar{X}}_{C p r e})}{S D_{p o o l e d}},

where $\bar{X}$ is the mean test score for the treatment (T) and control condition (C) measured before (pre) and after (post) the learning phase, and SD_pooled is the pooled standard deviation of pretests (Morris, 2008). This method of estimation was preferred as it produces more conservative effect size estimates by adjusting for baseline differences in prior knowledge (Furtak et al., 2012).

When dichotomous data were reported (e.g., proportion of students who attained proficiency on a standardized test), we converted the log odds ratio of successes between groups into the standardized mean difference using the transformation procedure outlined by Borenstein, Hedges, Higgins, and Rothstein (2009).

If means and standard deviations were missing but regression coefficients reported, effects sizes were approximated using the t statistic corresponding to the null hypothesis of independent group differences between the treatment and control condition (Borenstein et al., 2009; Lipsey & Wilson, 2001):

d = t \sqrt{\frac{n_{T} + n_{C}}{n_{T} n_{C}}},

where n_T and n_C are the sample sizes for the treatment and control conditions and t test value for the group comparison. However, because estimates derived from multivariate regression analyses yield partial effect sizes that may not be comparable across studies (Becker & Wu, 2007), sensitivity analyses were performed to examine whether the results where robust to their inclusion.

Finally, to adjust for upward bias in Cohen’s d associated with small samples (n < 20), all effect sizes were transformed into Hedge’s g using the small-sample correction factor proposed by Hedges (1982):

g = [1 - (\frac{3}{4 (n_{T} + n_{C}) - 9})] * d,

where n_T and n_C are the respective sample sizes for the treatment and control conditions, and d is the original standardized mean difference effect size. All effect size computations and subsequent analyses were conducted using the software Comprehensive Meta-Analysis (CMA), version 3 (Borenstein et al., 2014), unless otherwise noted.

Dependent effect sizes.

To resolve statistical dependence among studies reporting multiple outcomes for the same group of students, we report the mean effect size for all outcomes to yield a single effect size per study. Similarly, for longitudinal studies involving the same cohort of students, effect sizes were collapsed together to yield a single average effect size per study. We report the mean effect size for multiple treatment groups when they were compared to a single control group. However, effect sizes generated from two or more different subgroups (i.e., grade levels, cohorts of students, or treatments) within a study such that each subgroup was accompanied with its own distinct comparison group were treated as independent (Borenstein et al., 2009). This made it possible for multiple effect sizes to be extracted from a single study. These procedures ensured that each effect size was estimated based on an independent set of data and that each analysis was conducted with an independent set of effect sizes.

Data synthesis.

To estimate the overall effect size, studies were issued weights based on their level of precision (i.e., standard error). Because the effects of inquiry instruction on science achievement outcomes were assumed to vary among studies as a function of population, intervention, and methodological differences, we used random effects models to calculate the overall weighted mean effect size ( $\bar{g}$ ) as:

\bar{g} = \frac{\sum (w_{i} * g_{i})}{\sum w_{i}},

where g_i is the observed effect size for study_i and w_i is the inverse variance weight assigned to study_i (Borenstein et al., 2009). This approach allows relatively greater weight to be assigned to studies with higher levels of precision.

Heterogeneity of effect sizes.

We used the Q test of heterogeneity to examine the variation in effect size estimates between studies (Lipsey & Wilson, 2001). Moreover, the I² statistic quantifies the percent of variation attributable to true heterogeneity relative to sampling error (Higgins, Thompson, Deeks, & Altman, 2003). Overall, I² values range from 0% to 100%, with increasing values reflecting greater levels of heterogeneity.

Moderation analysis.

When there was significant heterogeneity across studies, we conducted moderation analyses to examine whether variation among effect sizes could be explained by factors that differ between studies. For categorical variables, we performed a Q test of between-group differences (Q_B) using CMA’s one-way ANOVA function. For continuous variables, we tested the relation between a moderator and magnitude of effect size using CMA’s unrestricted maximum likelihood meta-regression function. All moderation analyses were conducted using random effects models weighted by the inverse variance of effect sizes.

Sensitivity Analyses and Robustness Checks

Four sensitivity analyses were used to assess the impact of statistical methods and data inclusion choices on the conclusions of the results and therefore examine the robustness of our findings.

Robust variance estimation.

As noted previously, we resolved statistical dependence among our sample of effect sizes using standard meta-analytic methods, namely, collapsing multiple effect sizes across studies to create a single synthetic effect. To utilize all effect sizes from each study, we reanalyzed the data set of effect sizes using robust variance estimation (RVE; Hedges, Tipton, & Johnson, 2010) with a correction for small sample size bias (Tipton, 2015). This approach permits the synthesis of multiple dependent effect sizes by adjusting the standard errors to account for an assumed correlation (ρ) between effect sizes within studies, thereby minimizing the loss of information that occurs through aggregation. One important limitation to this approach, however, is that a minimum of 40 independent studies with an average of five effect sizes per study are needed to estimate a meta-regression coefficient (Tanner-Smith & Tipton, 2014). This issue is particularly problematic in meta-analyses involving categorical variables with multiple levels. As a result, RVE methods were employed in the synthesis of overall weighted mean effect sizes. All analyses using RVE were conducted in the R statistical environment (version 3.4.2) using the robumeta package (Fisher & Tipton, 2014; Tanner-Smith & Tipton, 2014).

Outliers.

Boxplots were used to identify potential outliers, defined as effect sizes that were 1.5 interquartile ranges above the 75th percentile range or below the 25th percentile range of the distribution. Two effect sizes were identified as outliers in the treatment ES analyses. Because these outliers could not be attributed to methodological or theoretical differences between each study, coupled with the relatively small number of studies in the sample, we elected not to eliminate these estimates. Rather, we adjusted the outliers downward to more conservative values using the 90% Winsorization procedure described by Lipsey and Wilson (2001). All analyses were subsequently carried out using the adjusted data set. Results for the original sample are reported in a sensitivity analysis.

Study quality.

We assessed the methodological quality of included studies using a version of the Quality Assessment Tool for Quantitative Studies (National Collaborating Centre for Methods and Tools, 2008), which was adapted for use with educational research. This quality appraisal tool uses judgments about the extent to which bias may be present in six methodological domains to produce an overall quality rating of weak, moderate, or strong. Although methodologically rigorous studies are more likely to produce valid results (Higgins, Altman, & Sterne, 2017), we decided not to exclude studies on the bases of methodological quality. Including these studies allowed us to maintain our sample of effect sizes and provides a more complete picture of the current research landscape. However, to examine whether our findings were sensitive to differences in study quality, we conducted a sensitivity analysis that excluded studies with an overall quality rating of weak.

Publication bias.

We assessed the potential for publication bias among the sample of effect sizes included in the treatment ES analysis as studies that report null findings or relatively small effects are less likely to be published (Rosenthal, 1979; Song, Hooper, & Loke, 2013). We tried to mitigate publication bias a priori by seeking to include unpublished work (k = 6). The extent and impact of publication bias was assessed graphically using funnel plots and statistically using Egger’s linear regression test (Egger, Smith, Schneider, & Minder, 1997) and a trim and fill analysis of the corresponding funnel plots (Duval & Tweedie, 2000).

Results

Contrasting the Effects of Inquiry and Traditional Instruction for ELL Students

Our first objective was to evaluate whether inquiry-based instruction is more effective than traditional instruction for ELL students. To address this question, we tested if inquiry instruction is more effective than traditional instruction for ELL students by calculating the standardized mean difference (k = 23) in science learning outcomes between ELL students who received inquiry instruction (n = 4,204) and ELL students who received traditional instruction (n = 4,087). Figure 2 shows that overall, ELL students receiving inquiry instruction tended to obtain science scores that were over one-quarter a standard deviation higher than those receiving traditional instruction, Treatment ES = + 0.28 (SE = 0.07, p < .001). The 95% confidence interval ranges from 0.15 to 0.41, suggesting that overall inquiry instruction produces a small positive impact on ELL students’ learning outcomes.

FIGURE 2. — Estimated mean treatment effect size (difference in science achievement between English language learners in treatment and control conditions) for each study with overall mean weighted effect size. Forest plot showing treatment effect sizes with 95% confidence interval and 95% prediction interval. Studies with alphabetic superscripts refer to multiple independent effect sizes generated from the same study.

Contrasting the Effects of Inquiry and Traditional Instruction Between Language Groups

Our second question examines whether inquiry instruction leads to comparable learning benefits to ELL and non-ELL students and are presented in Figure 3. To this end, we estimated the standardized mean difference (k = 30) in science learning outcomes between ELL (n = 5,459) and non-ELL (n = 42,700) students receiving inquiry instruction. The significant inquiry ES of −0.31 (SE = 0.08, p < .001) suggests that non-ELL students obtained science achievement scores that were about one-third a standard deviation higher than those of ELL students.

FIGURE 3. — Estimated mean inquiry effect size (difference in science achievement between English language learners and non-English language learners in treatment condition) for each study with overall weighted effect size. Forest plot showing inquiry effect sizes with 95% confidence interval and 95% prediction interval. Studies with alphabetic superscripts refer to multiple independent effect sizes generated from the same study. The data used to calculate an overall effect size for Lee et al. 2004–2007 is based on information reported in Lee, Maerten-Rivera, Penfield, Leroy, and Secada (2008); Lee, Mahotiere, Salinas, Penfield, and Maerten-Rivera (2009); and Lee, Penfield, and Maerten-Rivera (2009).

Next, we investigated how the achievement gap between ELL and non-ELL students receiving inquiry science instruction compared to relative performance of ELL and non-ELL students receiving traditional science instruction. To do so, we calculated the standardized mean difference (k = 15) in science learning outcomes between ELL (n = 3,085) and non-ELL (n = 9,364) students who received traditional instruction in the control condition (see Figure 4). Overall, non-ELL students obtained science scores that were almost half a standard deviation higher than those of ELL students in traditional classrooms, traditional ES = −0.46 (SE = 0.12, p < .001). The achievement gap between ELL and non-ELL students was greater in science classrooms using traditional instruction ( $\bar{g} = - 0.46$ ) than in those using inquiry instruction ( $\bar{g} = - 0.31$ ). However, a caveat is that the 95% confidence intervals for these effect sizes overlap. Thus, these findings suggest that inquiry instruction may help attenuate the science achievement gap for ELL students.

FIGURE 4. — Estimated mean traditional effect sizes (difference in science achievement between English language learners and non-English language learner students in control condition) for each study with overall weighted effect size. Forest plot showing traditional effect sizes with 95% confidence interval and 95% prediction interval.

Heterogeneity of Effect Sizes

Heterogeneity analyses were conducted to the presence and degree between-study variation using the Q-test and I² statistic. First, we tested the treatment ES, or the degree to which ELL students obtained higher outcomes with inquiry instruction, for heterogeneity. We found a high degree of heterogeneity among the studies, Q = 126.84, df = 22, p < .001, with the I² statistic revealing that 83% of the total observed variance could be attributed to between-study differences rather than within-study sampling error. Next, we examined the heterogeneity of the inquiry ES or the achievement gap between ELL and non-ELL students receiving inquiry instruction. Once again, there was a high degree of heterogeneity among the studies, Q = 377.03, df = 29, p < .001, with the I² statistic indicating that 92% of the variance could be attributed to between-study differences. Finally, the traditional ES, or the achievement gap between ELL and non-ELL students receiving traditional science instruction, also showed a high degree of heterogeneity, Q = 246.77, df = 11, p < .001, with the I² statistic revealing that 92% of the variance is attributable to true heterogeneity. Due to the significant heterogeneity across each sample of effect sizes, we conducted a set of moderator analyses across each sample of effect sizes to identify the sources of between-study variation.

Moderation Analyses

To identify moderating factors that may influence the effect of inquiry instruction on ELL students’ science achievement, we calculated two sets of analyses to examine the potential influence of categorical and continuous moderators for each effect size. Table 1 presents the results for categorical moderators obtained from the subgroup analyses for treatment ES, while the subgroup moderation results corresponding to traditional ES and inquiry ES are displayed in Table 2 and Table 3, respectively. Table 4 presents the results for continuous moderators obtained from the weighted random effects meta-regression analyses. To mitigate against the potential of confounding variable bias in the meta-regression analyses, each predictor is included in the regression analyses as a covariate, along with the following indicators of methodological quality: publication status, research design, and measurement design.

TABLE 1.

Overall Weighted Mean Treatment Effect Size (ES) for Subgroup Analyses of Categorical Moderators

			Treatment ES and 95% CI				Test of Difference
Moderator	n	k	$\bar{g}$	SE	Lower	Upper	Q_B	df
Publication status							8.19^**	1
Published	7,595	17	0 37^***	0.07	0.24	0.51
Unpublished	696	6	−0.04	0.11	−0.26	0.18
Research design
Randomized experiment	6,161	15	0.18^*	0.07	0.04	0.33	5.20^*	1
Quasi-experiment	2,130	8	0.46^***	0.10	0.26	0.65
Measurement design							0.01	1
Pretest and posttest	3,651	13	0.27^*	0.11	0.06	0.48
Posttest only	4,154	10	0.27^*	0.09	0.10	0.45
Assessment format							2.22	2
Multiple choice	6,248	14	0.27^***	0.09	0.10	0.44
Constructed response	460	2	0.58^*	0.23	0.13	1.04
Mixed	1,583	7	0.19	0.12	−0.05	0.43
Assessment type							5.08^*	1
Researcher-developed	2,976	13	0.39^***	0.08	0.23	0.55
Standardized	4,829	10	0.12	0.10	−0.06	0.33
Professional development							4.08	2
Small dose (14 hours)	1,175	5	0.19	0.13	−0.06	0.44
Large dose (15+ hours)	6,822	16	0.27^***	0.07	0.14	0.40
Not reported	294	2	0.66^***	0.19	0.28	1.04
Professional development							8.74^*	2
Focused on English language learners	7,125	15	0.32^***	0.06	0.19	0.44
Not focused on English language learners	872	6	0.06	0.11	−0.16	0.27
Not reported	294	2	0.67^***	0.17	0.30	1.03
Student grade level							6.77	5
First	420	2	0.35^†	0.21	−0.06	0.76
Second	220	1	0.38	0.27	−0.15	0.92
Fourth	601	3	0.63^***	0.16	0.32	0.94
Fifth	5,625	9	0.22^**	0.09	0.05	0.40
Sixth	1,058	5	0.24^†	0.12	−0.01	0.48
Mixed	367	3	0.11	0.17	−0.22	0.44

Open in a new tab

^†

p < .10.

p < .05.

^**

p < .01.

^***

p < .001.

TABLE 2.

Overall Weighted Mean Inquiry Effect Size (ES) for Subgroup Analyses of Categorical Moderators

			Inquiry ES and 95% CI				Test of Difference
Moderator	n	k	$\bar{g}$	SE	Lower	Upper	Q_B	df
Publication status							0.90	1
Published	24,383	21	−0.36^***	0.08	−0.52	−0.20
Unpublished	23,776	9	−0.22^†	0.12	−0.46	0.02
Research design							0.41	1
Randomized experiment	43,408	23	−0.34^***	0.08	−0.49	−0.19
Quasi-experiment	4,751	7	−0.23	0.15	−0.52	0.05
Measurement design							14.52^***	1
Pretest and posttest	31,590	20	−0.17^*	0.08	−0.31	−0.04
Posttest only	16,569	10	−0.66	0.11	−0.87	−0.45
Assessment format							2.29	3
Multiple choice	40,315	18	−0.29^***	0.08	−0.46	−0.13
Constructed response	457	3	−0.67^*	0.29	−1.23	−0.11
Mixed	6,670	8	−0.37^**	0.13	−0.61	−0.12
Other	717	1	−0.05	0.36	−0.76	0.67
Assessment type							5.03^*	1
Researcher-developed	32,923	23	−0.24^***	0.07	−0.38	−0.10
Standardized	15,236	7	−0.56^***	0.12	−0.79	−0.32
Professional development							5.79^†	2
Small dose (14 hours)	22,310	11	−0.12	0.11	−0.34	0.10
Large dose (15+ hours)	24,630	17	−0.46^**	0.09	−0.63	−0.28
Not reported	1,219	2	−0.16	0.26	−0.66	0.34
Professional development							3.83	2
Focused on English language learners	17,842	15	−0.45^***	0.10	−0.64	−0.26
Not focused on English language learners	29,815	14	−0.19^*	0.10	−0.40	0.00
Not reported	502	1	−0.26	0.35	−0.95	0.42
Grade level							6.68	4
Third	6,299	2	−0.26	0.25	−0.75	−0.24
Fourth	11,730	7	−0.20	0.14	−0.48	−0.07
Fifth	21,652	8	−0.54	0.12	−0.78	−0.30
Sixth	7,271	6	−0.38	0.15	−0.67	−0.08
Mixed	1,207	7	−0.08	0.15	−0.37	−0.21

Open in a new tab

^†

p < .10.

p < .05.

^**

p < .01.

^***

p < .001.

TABLE 3.

Overall Weighted Mean Traditional Effect Size (ES) for Subgroup Analyses of Categorical Moderators

			Traditional ES and 95% CI				Test of Difference
Moderator	n	k	$\bar{g}$	SE	Lower	Upper	Q_B	df
Publication status							2.39	1
Published	11,986	10	−0.58^***	0.14	−0.85	−0.31
Unpublished	3,093	5	−0.21	0.19	−0.59	0.17
Research design							1.95	1
Randomized experiment	11,676	8	−.60^***	0.15	−0.76	−0.14
Quasi-experiment	3,403	7	−.29^†	0.17	−0.61	0.04
Measurement design							16.18^***	1
Pretest and posttest	7,606	9	−0.24^*	0.11	−0.44	−0.03
Posttest only	7,473	6	−0.92	0.13	−1.18	−0.66
Assessment format							0.39	2
Multiple choice	11,135	8	−0.55^***	0.15	−0.86	−0.25
Constructed response	310	2	−0.49	0.34	−1.13	0.16
Mixed	3,634	5	−0.40^*	0.32	−0.78	−0.02
Assessment type							2.92^†	1
Researcher-developed	5,320	9	−0.36^**	0.13	−0.61	−0.11
Standardized	9,759	6	−0.69^***	0.15	−0.99	−0.40
Grade level							6.83^†	3
Fourth	1,591	3	−0.42^†	0.25	−0.91	0.08
Fifth	9,761	5	−0.76^***	0.18	−1.12	−0.40
Sixth	3,168	4	−0.45^*	0.21	−0.87	−0.04
Mixed	559	3	0.05	0.25	−0.44	0.54

Open in a new tab

^†

p < .10.

p < .05.

^**

p < .01.

^***

p < .001.

TABLE 4.

Meta-Regression of Continuous Variables on Overall Weighted Mean Effect Sizes (ES)

Moderator	Treatment ES	Inquiry ES	Traditional ES
Constant	0.613^*	−0.582	0.652^†
Methodological controls
Published study	0.433^***	−0.018	−0.216
Randomized experiment	−0.204	−0.111	−0.024
Pretest and posttest design	−0.114	0.516^***	0.429^***
Continuous predictors
Student grade level	−0.049	0.001	−0.197^***
Instruction (weeks)	−0.012^*	0.005	−0.016^***
Professional development (hours)	−0.001	−0.041	—
Number of studies (k)	21	27	15
Between-study variance (τ²)	0.01	0.05	0.01
Heterogeneity (I²), %	57	92	65

Open in a new tab

Note. Random effects models were used in all meta-regression analyses. Random effects variance components were estimated using maximum likelihood. Effect sizes computed as Hedge’s g. Reference group for controls = unpublished study, quasi-experiment, posttest-only design.

^†

p < .10.

p < .05.

^***

p < .001.

Publication status.

The effect of publication status moderated the findings for treatment ES (Q_B = 8.19, df = 1, p = .004). Studies published in peer-reviewed journals had average treatment ESs that were significantly larger than those in nonpublished studies ( $\bar{g} = 0.37$ vs. $\bar{g} = - 0.04$ ). Although a similar pattern of results was observed for both the inquiry ES and treatment ES, such that published studies yielded larger effect sizes than unpublished studies, the between-levels difference was not significant for either of these effect sizes (p > .10). These initial analyses suggest that there may be a publication bias for the treatment ES.

Research design.

We found significant moderation effects based on the research design for the treatment ES (Q_B = 5.20, df = 1, p = .02) but not for the inquiry ES or traditional ES. Studies using quasi-experimental designs showed significantly larger treatment ESs ( $\bar{g} = 0.47$ ) than those using randomized experimental designs ( $\bar{g} = 0.17$ ).

Measurement design.

The type of measurement design moderated findings for both the inquiry ES (Q_B = 14.52, df = 1, p < .001) and traditional ES (Q_B = 16.18, df = 1, p < .001) but not for the treatment ES. Studies that used posttest-only designs revealed science achievement gaps between ELL and non-ELL students that were on average three to four times larger than those using pretest-posttest designs for both the inquiry ES ( $\bar{g} = - 0.17$ vs. $\bar{g} = - 0.66$ ) and traditional ES ( $\bar{g} = - 0.17$ vs. $\bar{g} = - 0.66$ ). These findings suggest that ELL and non-ELL students show varying levels of prior knowledge and subsequent growth in science.

Assessment format and assessment type.

Whereas assessment format did not moderate the findings for any of the three main effect sizes, differences in assessment type was a moderator for the treatment ES (Q_B = 5.08, df = 1, p = .024), inquiry ES (Q_B = 5.03, df = 1, p = .025), and traditional ES (Q_B = 2.92, df = 1, p = .087). For the treatment ES, studies using researcher-developed assessments ( $\bar{g} = 0.39$ ) revealed larger gains in science achievement than those using standardized assessments ( $\bar{g} = 0.12$ ). In contrast, studies using standardized assessments revealed greater science achievement gaps between ELL and non-ELL students than those using more proximal, researcher-developed assessments for both the inquiry ES and traditional ES ( $\bar{g} = - 0.56$ vs. $\bar{g} = - 0.24$ ; $\bar{g} = - 0.69$ vs. $\bar{g} = - 0.36$ , respectively).

Student grade level.

We treated grade level as a continuous variable. When controlling for methodological quality, the meta-regression revealed a significant negative association between average student grade level and magnitude of effect for the traditional ES (b = −0.20, SE = 0.06, p < .001). This effect suggests that the science achievement gap between ELL and non-ELL students fades in traditional instruction across higher grade levels. No other moderation effects involving grade level were significant.

Professional development.

Whereas the dosage of professional development was not a significant moderator for any of the effects of interest, the focus of professional development training moderated the findings for treatment ES (Q_B = 8.74, df = 2, p = .013). Studies in which professional development focused on supporting ELL students yielded larger treatment ESs ( $\bar{g} = 0.32$ ) than those that did not report focusing on ELL students’ academic needs ( $\bar{g} = 0.06$ ). Professional development did not moderate the inquiry ES, and there were too few studies to examine its potential moderating effect on traditional ES.

Length of treatment.

Although the length of treatment moderated the findings for the treatment ES and traditional ES, the moderation effects were very small. Specifically, we found a significant negative association between the length of treatment (in weeks) and magnitude of effect for the treatment ES (b = −0.01, SE = 0.01, p = .03) and traditional ES (b = −0.02, SE = 0.004, p < .001).