Using a Multi-Site RCT to Predict Impacts for a Single Site: Do Better Data and Methods Yield More Accurate Predictions?

Robert B Olsen; Larry L Orr; Stephen H Bell; Elizabeth Petraglia; Elena Badillo-Goicoechea; Atsushi Miyaoka; Elizabeth A Stuart

doi:10.1080/19345747.2023.2180464

. Author manuscript; available in PMC: 2025 Jan 1.

Published in final edited form as: J Res Educ Eff. 2023 Apr 13;17(1):184–210. doi: 10.1080/19345747.2023.2180464

Using a Multi-Site RCT to Predict Impacts for a Single Site: Do Better Data and Methods Yield More Accurate Predictions?

Robert B Olsen ¹, Larry L Orr ², Stephen H Bell ³, Elizabeth Petraglia ⁴, Elena Badillo-Goicoechea ⁵, Atsushi Miyaoka ⁶, Elizabeth A Stuart ⁷

PMCID: PMC10914338 NIHMSID: NIHMS1912038 PMID: 38450254

Abstract

Multi-site randomized controlled trials (RCTs) provide unbiased estimates of the average impact in the study sample. However, their ability to accurately predict the impact for individual sites outside the study sample, to inform local policy decisions, is largely unknown. To extend prior research on this question, we analyzed six multi-site RCTs and tested modern prediction methods—lasso regression and Bayesian Additive Regression Trees (BART)—using a wide range of moderator variables. The main study findings are that: (1) all of the methods yielded accurate impact predictions when the variation in impacts across sites was close to zero (as expected); (2) none of the methods yielded accurate impact predictions when the variation in impacts across sites was substantial; and (3) BART typically produced “less inaccurate” predictions than lasso regression or than the Sample Average Treatment Effect. These results raise concerns that when the impact of an intervention varies considerably across sites, statistical modelling using the data commonly collected by multi-site RCTs will be insufficient to explain the variation in impacts across sites and accurately predict impacts for individual sites.

Keywords: Randomized controlled trials, generalizability, external validity, evidence-based policy, transportability

INTRODUCTION

Motivation

The goal of evidence-based policy is to use credible evidence to inform policy decisions. A key policy decision faced by policymakers in education and other areas is whether to adopt an intervention – e.g., a new curriculum or an after-school program. Policymakers aiming to improve outcomes for a certain group of individuals should consider whether adopting various interventions would improve outcomes in their context—and improve them by enough to warrant the costs. However, the impacts of policy decisions are never known in advance. At best, policymakers have evidence that can be used to predict the impact of their policy decisions. A key question, then, is how accurately policymakers can predict the impact of their decisions – in their local context -- in advance and based on the evidence available.

To produce evidence of program impact, social policy research has often turned to multi-site, randomized controlled trials (RCTs) because they provide internally valid estimates of the effects of interventions. However, while RCTs are designed to produce unbiased estimates of average impact across all participating sites—and whatever population those sites represent—they are not designed to produce unbiased predictions of impact for individual sites. This is an important limitation when policy decisions are made locally. For example, suppose a school principal is trying to decide whether to adopt an educational program. If the impact of the intervention varies across schools, evidence on the average impact in a multi-site trial may not accurately predict the impact of adopting the intervention in the principal’s school.

Predicting the impact of an educational intervention for a single school district or school using evidence from a multi-site RCT is particularly challenging when the impact of the intervention varies across sites in the population. And such variation is not uncommon. Weiss et al. (2016) provides evidence of cross-site impact variation from 16 RCTs of 16 different interventions, finding large variation for some interventions and little or no variation for others. These results suggest that for at least some interventions, variation in impacts could challenge efforts to predict the impacts in individual sites using evidence from multi-site RCTs.

In principle, researchers can address these challenges by modelling the variation in impacts across sites and using those models to predict the impact of the intervention in any site. However, in practice, it is often challenging to build a model that makes accurate predictions. Importantly, the correct model is never known: Researchers are probably unaware of some key variables that moderate the impact of the intervention—or may be aware but the variables not available in the data—and the relationship between these moderators and impacts is never known. Facing imperfect knowledge and substantial data collection costs, researchers conducting RCTs may fail to collect and include some important site-level impact moderators from their models. In addition, building a model that makes accurate out-of-sample predictions is challenging because the relatively small number of sites included in most multi-site RCTs restricts the number of site-level moderator variables that can be included in any one model. Moreover, multi-site RCTs are generally powered to only detect an overall average effect and may have limited ability to estimate moderation effects with precision. Selecting the most important moderators to include in the model may also be challenging without strong prior evidence, which researchers conducting multi-site RCTs often lack. Finally, suppose that all these challenges can be overcome, and researchers are able to build a rich model that produces accurate predictions of impact for individual sites, conditional on site-level moderators collected as part of the study. This model would still be unusable for out-of-sample prediction unless data on the same moderators are available for sites outside of the study sample. In summary, there are serious challenges to building models that are useful for predicting the impact of interventions in individual sites outside the RCT sample—and limited empirical evidence on how well this can be done in practice.

In recent work, Orr et al. (2019) found that the estimated average impact in the sample from an RCT generally produced inaccurate impact predictions for individual sites, and that basic models developed to account for variation in impacts across sites did not improve the accuracy of those predictions. Specifically, they developed and tested two-level hierarchical linear models of students nested within sites, where the impact for an individual site was specified to be a linear function of site-level moderators—in particular, moderators that were typically publicly available to ensure that they would be known to local policymakers and thus usable for predicting local impacts. These models were used to predict site-level impacts in three RCTs of interventions targeted at children and youth: (1) educational technology, (2) charter schools, and (3) Head Start. The paper demonstrated that using the estimated average impact in the sample as the predicted impact for individual sites generated large prediction errors: The root mean squared prediction errors ranged from 0.07 to 0.35 standard deviations (measured in the same effect size units as the study-reported impacts). These differences likely reflect variation in impacts across sites, with large prediction errors when the cross-site impact variance is high and small prediction errors when the cross-site impact variance is low. In addition, the paper found that models with one, two, or five moderators did not yield impact predictions for individual sites that were more accurate than simply using the estimated average impact.

However, a few limitations of Orr et al. (2019) suggest that additional research is warranted. First, the study examined only three RCTs; evidence from additional studies is needed to build a broader evidence base on this topic. Second, the simple linear models that it used did not take advantage of recent advances in predictive models, especially for contexts with limited sample sizes and large numbers of predictors. Third, Orr et al. (2019) focused exclusively on publicly available characteristics of schools when building models, based on the argument that these variables would be known to local policymakers when considering whether to implement the intervention, and excluded other types of moderators that may explain impact variation (Weiss, Bloom, & Brock, 2014). However, in some circumstances, local policymakers may be able to predict these other moderators in advance, such as the characteristics of the students likely to participate in the intervention and features of the counterfactual condition faced by students locally. If so, excluding these variables could have led to overly pessimistic conclusions about the potential for using multi-site RCTs to make accurate out-of-sample impact predictions for individual sites.

Contribution of the Paper

This study extends prior research in three ways. First, it expands the scope and likely generalizability of the evidence base by analyzing a larger number of multi-site RCTs. In addition to reanalyzing data from the evaluation of charter schools examined in Orr et al. (2019) (Gleason et al. 2010), it analyzes data from evaluations of Teach for America (Clark et al., 2013), the Teaching Fellows Program (Clark et al., 2013), the U.S. Department of Education’s Student Mentoring Program (Bernstein et al., 2009), the Gang Resistance Education and Training Program (G.R.E.A.T., Esbenson et al., 2012), and a summer reading program (Wilkins et al., 2012).

Second, the paper improves on the analysis in Orr et al. (2019) by testing two modern statistical methods for making out-of-sample predictions: (1) lasso regression (Tibshirani, 1996) and (2) Bayesian Additive Regression Trees (BART; Chipman et al., 2010). Orr et al. (2019) used standard linear regression models with moderator variables selected based on p values. However, this approach risks producing models that are overfit to the sample data and yield poor out-of-sample predictions. In contrast, lasso and BART are designed for making out-of-sample predictions and thus may be better suited to informing local policy decisions in sites not included in the evaluation. Lasso’s use of cross-validation to select variables and shrink some coefficients toward zero suggests that it may help to avoid overfitting that can lead to inaccurate out-of-sample predictions.¹ In addition, lasso has been shown to work well in relatively small sample sizes (as few as 20), with more possible parameters (p) than observations (n) (e.g., Eberlin et al. 2014, Finch and Finch 2016, Donaho and Stodden 2006). BART’s main strength is its ability to uncover patterns in the data, including nonlinear relationships and high-order interactions, without having to assume a specific functional form for the model or to pre-specify which interactions to include. Kern et al. (2016) found that BART performed better than most other methods in a variety of scenarios when generalizing impact estimates to target populations.

Third, this study uses a more comprehensive range of site-level moderators to predict the impact in individual sites. In particular, it examines the extent to which data from multi-site trials on four different types of effect moderators—individually or together—can improve the ability of statistical models to predict the impact of interventions in individual sites. In addition to publicly available information on schools and districts used by Orr et al. (2019) (e.g., size, student composition, resources, and location), this study uses data collected by the original study teams on three different types of moderators: (1) characteristics of students participating in the study, (2) features of the intervention or how it was implemented, and (3) features of the counterfactual condition faced by individuals assigned to the control group. We assess whether models that include all four types of moderators yield more accurate predictions than models that are restricted to just publicly available information on schools and districts.

Fourth, this paper takes a new approach to estimating the magnitude of the likely prediction errors. This approach, detailed below, uses random effects meta-analysis methods to estimate the (root) mean squared prediction error as a function of the bias, sampling error in the predicted impacts, and cross-site heterogeneity in impacts.

We focus on the magnitude of prediction errors because local policymakers are in a better position to make policy decisions that will improve student outcomes if informed by impact predictions with small errors than by impact predictions with large errors. While Orr et al. (2019) translates prediction errors into estimates of their consequences for decision-making—specifically, on whether policymakers will make the “correct” decision about whether to adopt an intervention based on the evidence—we focus here exclusively on the magnitude of the prediction errors and save the policy consequences of these errors for future research.

THEORY

The goal of this study is to test different methods for predicting the true impact of an intervention in an individual site, and to assess the expected accuracy of those predictions from a randomly selected site from the population. Let $δ_{j}$ be the estimand of interest—the true impact in site $j$ , for $j = 1, \dots, N$ . For a particular prediction method, let $δ_{j}^{p}$ be expected value of the site-specific impact prediction—that is, the parameter to which the predicted impact for site $j$ converges as the number of sites in the multi-site RCT used to estimate the prediction model converges to infinity. Under these assumptions, the bias of the prediction method for site $j$ equals $b_{j} \equiv δ_{j}^{p} - δ_{j}$ . We would in general expect the prediction models to be biased unless the variance of the true impact across sites-- $τ^{2}$ —equal zero or the assumptions of the model explaining variation in impacts strictly hold.

However, the accuracy of the predicted impact for an individual site depends on both the bias and the variance of those prediction. Since the predictions come from a multi-site RCT, the estimated parameters of the prediction model will contain sampling error that depends on the number of sites included in the RCT. Let ${\hat{δ}}_{j}^{p}$ be the predicted impact from a finite sample and $ε_{j}^{p}$ be the random error in the model’s prediction for site $j$ . We model this predicted impact with the equation below:

{\hat{δ}}_{j}^{p} = {δ_{j}^{p} + ε_{j}^{p} = δ}_{j} + b_{j} + ε_{j}^{p},

(1)

where $δ_{j} ~ (δ, τ^{2}), b_{j} ~ (b, σ_{b}^{2})$ , and $ε_{j}^{p} ~ (0, σ_{p}^{2})$ .

These assumptions allow us to define a measure of the accuracy of the prediction method applied to a multi-site RCT with a finite number of sites for a randomly selected site in the population. The root mean squared prediction error ( $R M S P E$ ) can be expressed as follows:

R M S P E = \sqrt{b^{2} + σ_{b}^{2} + σ_{p}^{2}}

(2)

In words, the root mean squared error ( $R M S P E$ ) is the square root of the sum of the squared average bias of the prediction method ( $b^{2}$ ), the variance of the bias across sites ( $σ_{b}^{2}$ ), and the error variance of the prediction error from estimating the prediction model in a finite sample ( $σ_{p}^{2}$ ). This formulation will allow us to estimate the RMSPE, as described later, by separately estimating the average bias in the predictions ( $b$ ) and the sum of the variances of the remaining two components of the prediction error ( $σ_{b}^{2} + σ_{p}^{2}$ ).

DATA

We used data from six large-scale multi-site RCTs of interventions focused on improving youth outcomes (Table 1). These six RCTs can be thought of a convenience sample because they were chosen largely based on data availability. These RCTs fall within a larger set of studies that are highly visible to policymakers through government reports and evidence clearinghouses. Each of these studies offers an opportunity to test how well we can use the RCT data to make accurate impact predictions for individual sites.

Table 1 –

RCTs Included in the Analysis

Intervention	Outcome Domains	Study Design		Data Source
Intervention	Outcome Domains	Unit of Assignment	Number of Sites	Data Source
Gang Resistance Education and Training Program (G.R.E.A.T.)	Behavioral	Classroom (N=195)	31 schools	ICPSR
Teach for America	Achievement	Student (N≈4,570)	45 schools	NCES
The Teaching Fellows Program	Achievement	Student (N≈4,120)	44 schools	NCES
The Student Mentoring Program	Behavioral, achievement, performance in school	Student (N≈1,970)	32 program grantees	NCES
Charter Middle Schools	Achievement	Student (N≈2,150)	36 schools	NCES
Summer Reading	Achievement	Student (N≈1,520)	112 schools	NCES

Open in a new tab

Note: Student sample sizes are rounded to the nearest 10 for compliance with non-disclosure procedures for NCES restricted use data.

Student- and site-level data from these studies were used to build site-level analysis files that contain site-level impact estimates for student outcomes and site-level moderator variables that may predict the size and direction of the site-level impacts. Consolidating the student-level data into site-level datasets simplified the application of lasso regression and BART for making site-specific impact predictions without losing any information helpful in predicting the impact in individual sites.²

Each of the six RCTs included multiple sites in the sample and conducted random assignment within sites, allowing us to produce unbiased impact estimates for each site. From a methodological perspective, we defined a “site” as the randomization block in which students (or in one case, classrooms) were randomly assigned. For all but two of the studies, the site was defined as the public school in which the intervention was implemented. For the evaluation of charter middle schools, we defined the site as the collection of public schools from which students applied to a single charter school--or a pair of nearby charter schools with multiple students applying to both—and were either admitted to a charter school or not admitted based on a random lottery. For the evaluation of the Student Mentoring Program, we defined the site as the grantee organization—often a school or school district—that supported the program locally, and within which students were randomly assigned.³

For each RCT, we used data on one or more student outcomes for which the original study authors estimated and reported the impacts of the intervention. For measures of student achievement based on standardized tests, the student-level outcome variable ( $y_{i j}$ ) was standardized by the original study authors to have a mean of zero and standard deviation of 1 in a defined norming population.⁴ For other continuous or ordinal outcome variables (e.g., the count or number of incidents of a particular type), we created a binary outcome variable equal to 1 or “yes” (e.g., the student had more than one incident) or 0 or “no” (e.g., the student had zero incidents) if the vast majority of students in the sample had no such incidents; otherwise, we standardized the outcome by subtracting off the mean and dividing by the standard deviation for all treatment and control students included in the analysis. Binary outcomes from the original study data (e.g., scored above the state threshold for proficiency in math or reading) were left in their natural scale (0 for no or 1 for yes). See Table 2 for a description of the student outcome variables from the original studies that we used in the analysis along with summary statistics and sample sizes.⁵

Table 2–

Student Outcome Variables Included in the Analysis

Outcome Variable	Description and Scale	Mean	SD	N
Charter Middle Schools
Math achievement, year 1	Standardized score on state test	0.38	1.01	2,130
Math achievement, year 2	Standardized score on state test	0.44	1.14	2,040
Reading achievement, year 1	Standardized score on state test	0.43	0.96	2,150
Reading achievement, year 2	Standardized score on state test	0.39	0.96	2,040
Gang Resistance Education and Training Program (G.R.E.A.T.)
Delinquency (frequency)	Number of incidents of delinquency	0.00	1.00	3,564
Delinquency (variety)	Number of different types of delinquency	0.00	1.00	3,564
Violence (any)	=1 if any violent incidents, =0 otherwise	0.12	0.32	3,564
Student Mentoring Program
Absentee rate	Ratio of absences to days enrolled	0.00	1.00	2,160
Truancy rate	Ratio of unexcused absences to days enrolled	0.00	1.00	1,440
Math grades	=1 (F), 2 (D), 3, (C), 4 (B), or 5 (A)	0.00	1.00	2,000
English grades	=1 (F), 2 (D), 3, (C), 4 (B), or 5 (A)	0.00	1.00	2,010
Science grades	=1 (F), 2 (D), 3, (C), 4 (B), or 5 (A)	0.00	1.00	1,950
Social studies grades	=1 (F), 2 (D), 3, (C), 4 (B), or 5 (A)	0.00	1.00	1,890
Math proficiency	=1 if proficient based on state test, = 0 otherwise	0.46	0.5	1,990
Reading proficiency	=1 if proficient based on state test, = 0 otherwise	0.49	0.5	2,030
Misconduct (any)	=1 if at least one incident of misconduct, =0 otherwise	0.26	0.44	1,730
Misconduct (repeated)	=1 if multiple incidents of misconduct, =0 otherwise	0.16	0.37	1,730
Delinquency (any)	=1 if at least one incident of delinquency, =0 otherwise	0.20	0.40	1,780
Delinquency (repeated)	=1 if multiple incidents of delinquency, =0 otherwise	0.11	0.31	1,570
Summer Reading Program
Reading achievement	Standardized score on study-administered test (Scholastic Reading Inventory)	0.00	1.00	1,570
Teach for America
Math achievement	Standardized score on state test or study-administered test (NWEA)	−0.58	0.93	4,570
Teaching Fellows Program
Math achievement	Standardized score on state test or study-administered test (NWEA)	−0.38	1.07	4,120

Open in a new tab

Notes: Standardized scores on state tests were standardized by the original study authors as z-scores, subtracting off the mean and dividing by the standard deviation for all tested students in the state in the same subject and grade level. Standardized scores on study-administered tests were standardized (1) relative to a national norming population for Teach for America and the Teaching Fellows Program and (2) relative to the pooled sample of treatment and control students for the summer reading program. Student sample sizes for all studies except G.R.E.A.T. are rounded to the nearest 10 for compliance with NCES non-disclosure procedures.

Finally, we used data on four types of potential impact moderators that were available in the restricted use files for the six multi-site RCTs. Some of these variables come from public sources; others were collected by the original study teams:

Publicly available information on the characteristics of each site. The Common Core of Data (CCD) provides publicly available information on all public schools nationwide. For the four studies that assembled CCD data on participating schools, we used data on size, student demographic composition, location, and resources to capture the context in which educational interventions are implemented.
Information collected on the subset of students who participated in the study. Most impact evaluations in education collect data on the characteristics of the students who participated in the study, especially their demographic characteristics, to conduct subgroup analyses. For all six studies, we created site-level averages of the participant characteristics that were collected (e.g., disability status, race/ethnicity, English learner status, and eligibility for free or reduced-price meals) to capture the types of students who would receive the intervention in each site if the intervention were implemented locally.
Features of the intervention or how it was implemented. Many RCTs collect measures related to the intervention itself. For the four studies that collected such measures, we used study-collected measures of intervention features or how the intervention was implemented. These included characteristics of the charter schools attended by students who were admitted by lottery, the characteristics of Teach for America and Teaching Fellows Program teachers for students randomly assigned to their classes, and the characteristics of the local mentoring programs to which students were randomly assigned.
Features of the “counterfactual condition.” Since impacts result from the contrast between the intervention condition and the counterfactual condition, some RCTs also collect site-level measures of the counterfactual condition experienced by students in the control group. We used study-collected measures of the counterfactual from the four studies that constructed such measures. These include measures of academic programming in the regular public schools attended by students who “lost the lottery” to attend a charter school, characteristics of teachers who taught in the same schools and grade levels as teachers from Teach for America or the Teaching Fellows Program, and participation in mentoring programs by students who were assigned the control group in the evaluation of the U.S. Department of Education’s Student Mentoring Program.

The number of moderator variables included in the analysis for each RCT is provided in Table 3. The definition of each of these moderators is provided in the Online Appendix.

Table 3 –

Number of Moderator Variables Included for Each Intervention

Intervention	Site Characteristics	Student Characteristics (Site-Level Averages)	Intervention Features	Counterfactual Features
Charter Middle Schools	21	21	90	90
G.R.E.A.T.	0	67	0	0
Student Mentoring Program	0	261	422	5
Summer Reading Program	0	14	0	0
Teach for America	23	45	202	190
Teaching Fellows Program	23	45	202	190

Open in a new tab

Note: Zero variables were included of a particular type when none were available in the original study data.

METHODS

This study used three types of methods to assess the potential for multi-site RCTs to accurately predict impacts for individual sites outside the study sample: (1) methods for estimating the true impact in a single site—to serve as a benchmark for assessing different prediction methods, (2) methods for predicting the impact in that site using only data from other sites, and (3) methods for assessing the accuracy of those predictions. All three types of methods are described below.

Methods for Estimating the True Impact in a Single Site

For each of the six multi-site RCTs, we took advantage of their experimental design to estimate the following model and produce an unbiased estimate of the impact of the intervention in each site j on student outcome $y_{i j}$ :

y_{i j} = α + β x_{i j} + {δ_{j}^{w} T}_{i j} + e_{i j}, e_{i j} ~ N (0, σ_{j}^{2})

(3)

where $y_{i j}$ is the outcome for student i in site j; $x_{i}$ is a pre-intervention measure of the outcome for student i in site j; $T_{i j}$ is the treatment indicator that equals 1 if student i in site j was assigned to the treatment group and 0 if this student was assigned to the control group; and $e_{i j}$ is a random error term that varies across the students in site j.⁶ Using the full study sample allowed the original study authors to include a large set of covariates in the models. However, we estimated separate impact regression models for each site, based on a small fraction of the total sample, necessitating a more parsimonious model. In estimating impacts separately by site, we included a single covariate—a pre-intervention value of the outcome—given prior evidence that pre-test measures are more predictive of student outcomes than other potential covariates (Bloom, Richburg-Hayes, & Black, 2007).

The parameter $δ_{j}^{w}$ reflects the intervention’s impact in site j, where the “w” superscript denotes that the impact is the impact on students within site j. We estimated this parameter using (1) ordinary least squares if the original study authors did not construct and use weights in the impact analysis or (2) weighted least squares when the study authors did construct and use weights in the impact analysis (typically to address nonresponse bias, variation in the probability of assignment to the treatment, or both).⁷

Methods for Predicting the Impact for a Single Site

In broad terms, we tested three methods for predicting the impact of an educational intervention for a site outside of the study sample but that is similar to those in the study sample: (1) the simple average of the site-level impact estimates for sites in the study sample; (2) lasso regression; and (3) BART. The first approach does not use data on site-level moderators; the second and third approaches use site-level moderator data to produce customized impact predictions for individual sites. With all three methods, we excluded a single site, estimated the prediction models in the remaining J-1 sites, and used the estimated prediction model—with the site moderator values in the excluded site, for lasso and BART—to predict the impact in the excluded site. The analysis plan was initially published on the Open Science Framework before any analysis was conducted (Olsen et al., 2021).⁸

For the simple average approach, we took the unweighted average of the unbiased site-level impact estimates and used this average as the predicted impact in the excluded site (referred to below as the “focal” site).⁹ This approach is motivated by evidence-based policymaking that identifies interventions that “work” and promotes them for local adoption.

We used lasso regression to fit a linear regression model of the intervention’s impact as a function of site-level moderators, and we used the model to predict the impact of the intervention in the excluded site. The lasso was used to penalize the model for large regression coefficients that might improve the model fit in the sample at the expense of lower out-of-sample prediction accuracy. Specifically, we used cross-validation to set the penalty term (also called the “tuning parameter”) to minimize the mean squared prediction error. The resulting linear model was used to predict the impact of the intervention in the excluded site, conditional on the site-level moderators in the model.

Our third prediction approach, BART, fits a series of regression trees of the intervention’s impact as a function of site-level moderators and then uses the set of trees to predict the impact of the intervention in the excluded site. Regression trees sequentially divide the sample into categories, based on binary and continuous variables, to explain variation in the variable being predicted—in this case, the impact of the intervention in individual sites. These trees end in multiple “leaves,” or categories defined by multiple variables (e.g., continuous variable X1 > C and binary variable X2 = 0). In our application, the predicted impact for an individual site was set equal to the average impact estimate among the collection of sites in the same category (i.e., at the same leaf). We chose to test a regression tree-based approach to prediction because, unlike standard linear regression models, they are designed for scenarios where the variables used for classification (here, impact moderators) may have non-linear or interacting relationships with the outcome (here, the estimated impact of an education intervention).

While many approaches to regression trees exist, we used BART because it is well designed to address the overfitting problem of some random forest approaches that can lead to poor predictions. To avoid producing overly complex trees that overfit the sample data, BART constructs a series of small trees and combines them into an “ensemble,” fit using a Bayesian framework with default prior distributions on key parameters (Witten, Hastie, & Tibshirani, 2013). For more details on lasso and BART, and our approach to implementing them, see the Appendix.

For both lasso and BART, we implemented a pre-modeling variable selection step, restricting the models to moderators with a correlation of at least 0.20 with the estimated impact. This step was taken to focus the algorithms on variables likely to moderate the impact of the intervention. Our original plan was to offer lasso and BART the full set of potential moderators, regardless of their correlation with the impact and without any prior judgment by us about their relative importance—leaving it entirely to the lasso and BART algorithms to choose the variables that contribute most to producing accurate impact predictions in individual sites. However, we discovered, in testing our approach to the lasso in charter school data without the 0.20 correlation screen, the lasso did not select any predictor variables for the model.¹⁰ In contrast, setting a minimum correlation of 0.20 led the lasso to select some predictor variables and allowed us to test whether models with predictors selected by the lasso method yielded more accurate predictions than ‘models’ with no predictor variables, as reflected in the simple average of the site-level impact estimates.

Methods for Assessing the Accuracy of the Predicted Impact for a Single Site

To assess the accuracy with which each of the three methods described above predict the impact of each intervention in a single site, we used and adapted the analytic approach developed in Orr et al. (2019):

Step 1: Designate one site from the RCT as the focal (or “excluded”) site. We pretend that this site—site j—did not participate in the evaluation and is the site for which a policy decision is being made, and thus for which an impact prediction is needed.
Step 2: Produce an unbiased estimate of the impact in the focal site. Using only the data from the focal site and methods described earlier, we estimated equation (1). This produced an unbiased estimate of the true impact in site j.
Step 3: Predict the impact in the focal site using data from the other J-1 sites ( ${\hat{δ}}_{j}^{p}$ ). We fit each of the three prediction methods described earlier using the data from the other J-1 sites and then used that model, and moderator values from site j, to predict the impact in site j.
Step 4: Estimate the prediction error. The true prediction error for site j equals ${P E}_{j} = \hat{δ}_{j}^{p} - δ_{j}^{w}$ , where $δ_{j}^{w}$ is the true impact in that site. We estimated the prediction error for site j, ${\hat{P E}}_{j}$ , by taking the difference between the predicted impact, ${\hat{δ}}_{j}^{p},$ and the unbiased estimate of the impact in site j, ${\hat{δ}}_{j}^{w} : {\hat{P E}}_{j} = \hat{δ}_{j}^{p} - {\hat{δ}}_{j}^{w}$ .
Step 5: Repeat steps 1-3 for each of the remaining sites. This treats each site in the sample as the focal site for a local policy decision and estimates the prediction error for that site.
Step 6: Estimate the Root Mean Squared Prediction Error (RMSPE), adjusted for sampling error in the focal site, across all sites in the RCT. We calculated the RMSPE instead of the mean prediction error to ensure that positive errors do not offset negative errors. To calculate the RMSPE, we conducted a random effects meta-analysis. This meta-analysis accounts for sampling error in the unbiased impact estimate for each site.

Figure 1 provides an illustration of the site-level estimates used to estimate the RMSPE for different combinations of study, outcome variable, prediction method, and sets of moderator variables. This figure displays for the sites in the Teach for America study the site-level unbiased impact estimates and the predicted impacts generated using BART with all four types of moderator variables.

The meta-analysis builds on the model developed earlier in the Theory section of the paper. To conduct the meta-analysis, we regressed the estimated prediction error ( ${\hat{P E}}_{j}$ ) on an intercept term. The random effects meta-analysis estimated the cross-site variance of the prediction error, excluding the within-site variance of the unbiased impact estimate ${\hat{δ}}_{j}^{w}$ ).¹¹ To estimate the RMPSE, we took the square root of the sum of two terms: (1) the square of the estimated intercept, which can be interpreted as the average bias in the prediction method, and (2) the estimated cross-site variance from the meta-analysis, which can be interpreted as the sum of the variance of the prediction bias across sites and the variance of the predicted impacts that results from estimating prediction models in finite samples. The estimated RMSPE provides an estimate of the prediction error that can be expected for a randomly selected site from the population represented by sites that participated in the RCT.

To assess the precision of the RMSPE estimates and the differences between RMSPEs from different models, we used the bootstrap method. In particular, 1,000 bootstrap samples of J sites were selected with replacement—where J equals the number of sites included in the RCT— estimated the RMSPE for each bootstrap sample, and we concluded that prediction method A yielded a lower RMSPE than prediction method B if the estimated RMSPE was lower for A than for B in 950 or more of the 1,000 bootstrap samples. Similarly, we concluded that prediction method B yielded a lower RMSPE than prediction method A if the estimated RMSPE was lower for B than for A in 950 or more of the 1,000 bootstrap samples. In effect, we used bootstrapping to test for significant differences at the 90 percent level with a two-tailed test.

Finally, we assessed the magnitude of the RMSPE estimates by comparing them to external estimates of the average impacts of educational interventions. This comparison was made to determine whether the prediction errors were large enough to make an intervention with zero impact appear to have an average impact—one of average magnitude—or an intervention with an average impact appear to have a zero impact. For the four studies that reported impacts on achievement outcomes—the evaluations of charter schools, Teach for America, the Teaching Fellows Program, and the summer reading program—we compared the RMSPE estimates to a threshold value of 0.04 standard deviations, the average impact across 1,352 estimates from randomized trials of the effects of educational interventions on broad tests of student achievement in mathematics or English language arts (Kraft, 2020).¹² For the evaluation of G.R.E.A.T., we compared the RMSPE estimates to a threshold value of 0.09, the average effect size across 68 estimates of the effects of school-based programs for preventing problem behaviors in middle school (Wilson, Gottfredson & Najaka, 2001).¹³ For the evaluation of the student mentoring program, we compared the RMSPE estimates to a threshold value of 0.21, the average effect size across 83 estimates of the effects of mentoring programs for youth from randomized trials and non-randomized comparison group designs (DuBois et al., 2011).¹⁴ Lastly, the evaluations of G.R.E.A.T. and the student mentoring program reported impacts for some binary outcomes. Because the published evidence from meta-analyses generally focused on continuous outcomes, we lack evidence on the average effects of educational interventions on binary outcomes. For this reason, we simply report the RMSPEs on these outcomes and leave it to readers to make their own assessment of whether prediction errors implied by these estimates should be considered large or small.

FINDINGS

To provide context for the main findings, Table 4 provides estimates of the unconditional cross-site standard deviation of the impact for each study and outcome variable. These estimates indicate the challenge that prediction methods face in producing accurate impact estimates. When the unconditional cross-site standard deviation of impact is large, the only way to produce accurate impact predictions for individual sites is through a model that explains a sizable share of the impact variation across sites.

Table 4–

Cross-Site Standard Deviation of Impact

Outcome Variable	Scale at the Student Level	Cross-Site Standard Deviation of Impacts
Charter Middle Schools
Math achievement, year 1	Continuous (standard deviations)	0.161
Math achievement, year 2	Continuous (standard deviations)	0.311
Reading achievement, year 1	Continuous (standard deviations)	0.184
Reading achievement, year 2	Continuous (standard deviations)	0.192
Gang Resistance Education and Training Program (G.R.E.A.T.)
Delinquency (frequency)	Ordinal (standard deviations)	0.000
Delinquency (variety)	Ordinal (standard deviations)	0.000
Violence (any)	Binary (0 or 1)	0.045
Student Mentoring Program
Absentee rate	Continuous (0-1)	0.001
Truancy rate	Continuous (0-1)	0.001
Math grades	Ordinal (1, 2, 3, 4, or 5)	0.000
English grades	Ordinal (1, 2, 3, 4, or 5)	0.000
Science grades	Ordinal (1, 2, 3, 4, or 5)	0.038
Social studies grades	Ordinal (1, 2, 3, 4, or 5)	0.043
Math proficiency	Binary (0 or 1)	0.000
Reading proficiency	Binary (0 or 1)	0.000
Misconduct (any)	Binary (0 or 1)	0.019
Misconduct (repeated)	Binary (0 or 1)	0.008
Delinquency (any)	Binary (0 or 1)	0.000
Delinquency (repeated)	Binary (0 or 1)	0.000
Summer Reading Program
Reading achievement	Continuous (standard deviations)	0.231
Teach for America
Math achievement	Continuous (standard deviations)	0.174
Teaching Fellows Program
Math achievement	Continuous (standard deviations)	0.321

Open in a new tab

Table 4 shows that the six studies cover the range of the cross-site impact variation from prior studies. Weiss et al. (2016) report estimates of cross-site standard deviation of impacts ranging from 0 for an afterschool math program to 0.35 for math achievement in year 2 from the same charter school evaluation analyzed in this study. Similarly, the cross-site standard deviation reported in Table 4 ranged from 0 for several outcomes in the evaluations of G.R.E.A.T. and the student mentoring program to 0.31 for math achievement in year 2 for the evaluation of charter schools.

The first question we addressed is whether a study’s estimate of the average impact produces accurate predictions of impact for individual sites. Table 5 presents the RMSPE estimates for the prediction method that uses the simple average of the impact estimates in the other sites. For the four studies that focused on impacts on student achievement measures—studies of charter middle schools, a summer reading program, Teach for America, and the New Teachers Program—the RMSPE estimates are indicative of large prediction errors, ranging from 0.13 for the impact of Teach for America on math achievement to 0.33 standard deviations for the impact of charter schools on math achievement in year 2. These RMSPE estimates are all greater than 0.04, the average impact of educational interventions cited earlier (Kraft, 2020). This suggest that using the simple average of the impact estimates in other sites yields predictions with errors that are large enough to mislead local policymakers into reaching the wrong conclusion about whether the program would be effective locally.

Table 5–

RMSPE Estimates from the Simple Average of the Impact Estimates in the Other Sites

Outcome Variable	Scale at the Student Level	RMSPE
Charter Middle Schools
Math achievement, year 1	Continuous (standard deviations)	0.17
Math achievement, year 2	Continuous (standard deviations)	0.32
Reading achievement, year 1	Continuous (standard deviations)	0.19
Reading achievement, year 2	Continuous (standard deviations)	0.20
Gang Resistance Education and Training Program (G.R.E.A.T.)
Delinquency (frequency)	Ordinal (standard deviations)	0.02
Delinquency (variety)	Ordinal (standard deviations)	0.02
Violence (any)	Binary (0 or 1)	0.05
Student Mentoring Program
Absentee rate	Continuous (0-1)	0.03
Truancy rate	Continuous (0-1)	0.02
Math grades	Ordinal (1, 2, 3, 4, or 5)	0.04
English grades	Ordinal (1, 2, 3, 4, or 5)	0.02
Science grades	Ordinal (1, 2, 3, 4, or 5)	0.04
Social studies grades	Ordinal (1, 2, 3, 4, or 5)	0.04
Math proficiency	Binary (0 or 1)	0.01
Reading proficiency	Binary (0 or 1)	0.02
Truancy rate	Continuous (0-1)	0.02
Misconduct (any)	Binary (0 or 1)	0.03
Misconduct (repeated)	Binary (0 or 1)	0.01
Delinquency (any)	Binary (0 or 1)	0.02
Delinquency (repeated)	Binary (0 or 1)	0.03
Summer Reading Program
Reading achievement	Continuous (standard deviations)	0.26
Teach for America
Math achievement	Continuous (standard deviations)	0.12
Teaching Fellows Program
Math achievement	Continuous (standard deviations)	0.15

Open in a new tab

In contrast, for the other two studies, the findings suggest that the impact of the intervention in a site can be predicted accurately by taking the simple average of the impact estimates in the other sites. For the evaluation of G.R.E.A.T., the RMSPE estimates in Table 5 for outcomes measured in standard deviation units were close to zero and smaller than the average effect size of 0.09 found for other middle school-based prevention programs (Wilson, Gottfredson & Najaka, 2001). For the evaluation of a federal student mentoring program, the RMSPE estimates for outcomes measured in standard deviation units were generally close to zero and uniformly much smaller than the average effect size of 0.21 found for similar programs (DuBois et al., 2011). Together, the estimates from these two studies reflect prediction errors that are too small to mislead local policymakers into reaching the wrong conclusion about the likely impacts on these outcomes locally.¹⁵

The second question we addressed is whether the advanced statistical methods we tested, lasso and BART, combined with different sets of potential moderators, produced more accurate impact predictions than the simple average of the impact estimates in the other sites, and predictions accurate enough to support local policymaking. To address this question, we analyzed data from the four studies that estimated the impact of the intervention on measures of student achievement, and where the simple average consistently yielded inaccurate impact estimates, as evidenced by RMSPE estimates of greater than the threshold that we established earlier: (1) charter middle schools, (2) Teach for America, (3) the Teaching Fellows Program, and (4) the summer reading program. However, most analyses were restricted to the three studies that collected data on all four types of moderators: (1) charter middle schools, (2) Teach for America, and (3) the Teaching Fellows Program.¹⁶ We tested the performance of lasso and BART relative to the simple average approach and relative to each other, separately for each of the four types of moderator variables described earlier as well as for all four moderator types together.

The results (Figures 2–6) provide little evidence that lasso regression produces more accurate impact predictions for individual sites than the simple average of the impact estimates in the other sites. The method did not produce consistently smaller RMSPE estimates than the simple average of the impact estimates in the other sites for either the four types of moderator variables individually (Figures 2, 3, 4, and 5) or for all four types together (Figure 6). Lasso regression increased the RMSPE for the Teaching Fellows Program for all tested combinations of moderator variables¹⁷; for the other two studies, it increased the RMSPE for some outcomes and types of moderators and reduced it for others.¹⁸

Figure 2. — Note: ‘*’ identifies scenarios in which no moderator variables were offered to lasso or BART because none of them had a correlation of at least 0.20 with the outcome.

Figure 6. — RMSPE for Different Prediction Methods: All Four Types of Moderators

Figure 3. — Note: ‘*’ identifies scenarios in which no moderator variables were offered to lasso or BART because none of them had a correlation of at least 0.20 with the outcome.

Figure 4. — RMSPE for Different Prediction Methods Using Intervention-Related Characteristics as Moderators

Figure 5. — RMSPE for Different Prediction Methods Using Counterfactual-Related Characteristics as Moderators

The performance of BART is more mixed, generally reducing the RMSPEs for the charter middle school study but not for the other three studies. For charter middle schools, BART produced predictions with lower RMSPE than the simple average prediction method for 13 of the 15 predictions across the different outcomes and sets of moderator variables. Of these 13 differences, 12 were statistically significant. In contrast, for the other three studies—Teach for America, the Teaching Fellows Program, and the summer reading program—BART produced lower RMSPE for some predictions and higher RMSPE for others. These differences were statistically significant for only one of the nine BART predictions—roughly as we should expect if the differences were purely due to random chance instead of systematic differences in the performance of different methods.

However, even with BART, the prediction errors were large enough to potentially make effective interventions appear ineffective and vice versa. For the four studies in which the simple average prediction method produced RMSPE estimates of greater than 0.04 standard deviations—the average impact of educational interventions on broad measures of student achievement—BART was never able to reduce the RMSPE below 0.04 standard deviations. The smallest RMSPE generated by BART was 0.09 standard deviations for predictions of the impact of Teach for America on math achievement using all four types of moderators.

Figures 7 and 8 show how much the prediction accuracy of lasso and BART models improve with access to data from all four types of moderators versus data from only a single type of moderator. For lasso, Figure 7 shows that prediction accuracy does not typically improve from having access to all four types of moderators: The RMSPE from the models with all four types (as denoted with the vertical bar) is sometimes larger and sometimes smaller than the RMSPE from models with a single type. In contrast, for BART, Figure 8 shows that prediction accuracy often but not always improves from having access to all four types of moderators. For the charter middle school study, the reductions in RMSPE from BART were greater with all four types of moderators than with just a single type. These differences were statistically significant for 8 of 11 combinations of outcome variable and type of moderator variable. This suggests that for this one study, BART predictions benefit from having access to a wider range of moderator types. This pattern generally held for the study of Teach for America: The estimated RMSPE was lower with all four moderator types than for any single type of moderator, but only one of those four differences were statistically significant. For the study of the Teaching Fellows Program, the results were mixed: The estimated RMSPE was lower with all four moderator types than the RMSPEs with two of the four moderator types but higher than the RMSPEs with the other two types of moderators. Finally, this analysis could not be completed for the summer reading study because the data only contained one type of moderator (characteristics of the students who participated in the study).

Figure 7. — Note: ‘*’ identifies scenarios in which no moderator variables were offered to lasso because none of them had a correlation of at least 0.20 with the outcome.

Figure 8. — Note: ‘*’ identifies scenarios in which no moderator variables were offered to BART because none of them had a correlation of at least 0.20 with the outcome.

DISCUSSION

The findings from this study show that how accurately multi-site RCTs can predict impacts in individual sites depends on the variation in impacts across sites. For the four RCTs that examined the impact of educational interventions on mathematics or reading achievement, the average impact estimate provided an inaccurate prediction of the intervention’s impact for individual sites. In contrast, for the RCTs that focused on behavioral outcomes—the G.R.E.A.T. and student mentoring studies—the average impact estimate did provide fairly accurate impact predictions for individual sites. It appears that the key factor influencing the accuracy of these predictions is the extent to which the impact varies across sites: This variation was substantial for the four studies focused on mathematics and reading achievement but close to zero for the two studies focused on behavioral outcomes. Multi-site RCTs should report evidence on the cross-site impact variation to help readers assess the likelihood that the average impact reported by the RCT provides reasonably accurate impact predictions for individual sites.

When the intervention’s impact varies substantially across sites, additional findings cast doubt on whether accurate site-level impact predictions can be produced through statistical modelling. While one of the two statistical methods tested—BART—was often able to produce more accurate site-level impact predictions than the simple average of the impact estimates in the other sites, the prediction errors were still large enough to make an effective intervention appear to be ineffective or an ineffective intervention appear to be effective. The prediction errors from BART and lasso were large even with data on four types of potential impact moderators. These types of moderators are not even always available in all RCTs. And even when they are, local policymakers may not be able to predict what the values of those moderators will be locally before the intervention is implemented, making models that include them of questionable usefulness to local policymakers trying to decide whether to implement the intervention. The study’s findings suggest that even with moderators that probably could not practically be used to make local impact predictions, predictions from lasso and BART models produced impact predictions that were too inaccurate to inform local policy decisions. In addition, this study examined settings where impacts were being predicted for sites that were randomly – not systematically – different from the sites in the hypothetical studies. The ability to predict site-specific impacts would likely decline even further when considering scenarios where the sites in a study are not representative of the population of interest for predicting site-specific impacts.

However, we believe it is too early in the development of this literature to reach firm, negative conclusions about the ability to use multi-site RCTs to make accurate predictions of impact in individual sites. A more optimistic interpretation of the findings could be justified by the fact that the highest performing approach to modelling impacts—BART with rich data on impact moderators—improved the accuracy of site-level impact predictions for some of the RCTs. Furthermore, while these models are only useful to local policymakers that know or can predict the local values of those moderators in advance, there may be circumstances where local knowledge of key moderators—including the characteristics of the students who would participate and of the counterfactual condition—would allow local policymakers to benefit from these models. Assessing the practical value of these models would require research evidence on what local policymakers know or can predict in advance when deciding whether to adopt an intervention locally.

We acknowledge that the results here may be sensitive to the moderator variables that are available for use. Further theoretical or empirical research on the factors that moderate treatment effects could guide randomized trials to collect data on stronger impact moderators, and ultimately lead to better models of impact moderation that can be used to predict impacts in individual sites. Randomized trials are generally not designed to estimate site-specific impacts, and our exploration of their use for this purpose may be pushing them beyond their original goals. However, it is crucial that randomized trials do inform practice, and methods for designing trials to help facilitate their use for predicting impacts for individual sites should be encouraged.

Finally, estimates of the magnitude of prediction errors for individual sites should not be the last word on the performance of different prediction methods. The RMSPE estimates produced by this study do not account for the effects that those errors may have on policy decisions about whether to adopt interventions, or the welfare consequences of adopting ineffective interventions—or foregoing effective ones—due to errors in predicting their effects. While Orr et al. (2019) takes an initial step in that direction, additional research is needed to identify the conditions under which predictions errors of different magnitudes have serious consequences for decision-making and social welfare.

Supplementary Material

Supp 1

NIHMS1912038-supplement-Supp_1.docx^{(150.9KB, docx)}

Supp 2

NIHMS1912038-supplement-Supp_2.docx^{(23.4KB, docx)}

Supp 3

NIHMS1912038-supplement-Supp_3.docx^{(16.8KB, docx)}

Acknowledgments

The study was supported by a grant from the William T. Grant Foundation and National Institute of Mental Health P50MH115842. The opinions expressed are those of the authors and do not represent views of the funders. The authors would like to acknowledge the excellent programming support provided by Vasiliy Sergueev and William Zhu.

Footnotes

Cross-validation is a technique used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, the dataset is divided into several subsets, typically referred to as “folds.” A given model is fit multiple times, excluding a single fold from each model run, and then using the model to generate predictions for the remaining folds. This process is repeated while varying model hyperparameters, most often hyperparameters that reduce or increase the number of variables included in the model. The “best” model is typically identified as the one with the lowest average out-of-sample prediction error. The use of multiple folds generally decreases the model variance since the model is less likely to be influenced by outliers or unlucky selection of a particular fold to which the model is fit.

For some purposes, aggregating student-level data to the site level would limit the analysis (e.g., remove the ability to test for subgroup differences within site). However, for our purposes, only site-level variation is helpful in predicting site-level impacts—that is, the average impact in a site. Individual-level covariates cannot explain any additional variation across sites beyond what is explained by the site-level aggregates.

Ten of the grantees enrolled students into the study in two successive years. Unlike the original study authors, we treated both cohorts of students for a single grantee as belonging to the same site.

⁴

For state tests, the study authors standardized the scores relative to all test takers in the same state and grade level. For nationally normed tests, the study authors standardized the scores relative to all test takers in the national norming population.

⁵

These summary statistics include those students who contributed to this study. In constructing impact estimates for a particular outcome, we excluded sites with missing values for all students in that site.

⁶

Missing data in the pre-intervention measure of the outcome was addressed using the dummy variable method (e.g., Jones, 1996; Puma et al., 2019). Specifically, the model included an indicator variable that equaled 1 if the pre-intervention measure of the outcome was missing and equaled 0 otherwise. When the pre-intervention measure of the outcome was missing, it was set to 0 for the analysis.

⁷

If our goal were to obtain the most accurate estimate of the impact in each site using data from all sites, we would have used Empirical Bayes methods to estimate site-specific impacts. However, the OLS estimates just provide unbiased impact estimates for the meta-analysis described later in the paper.

⁸

The original analysis plan was posted on June 12, 2020 (see https://osf.io/vbs36). Minor revisions were made to the analysis plan during the conduct of the analysis; a revised analysis plan was posted on November 29, 2021 (see https://osf.io/yqfzt ). Finally, the paper deviated from the analysis plan to address feedback during the peer review process.

⁹

At earlier stages in the analysis, using the approach to estimating the RMSPE from Orr et al. (2019), we also tested precision-weighted averages of the site-specific impact estimates. In that analysis, we found little difference in RMSPE between the simple average of the site-specific impact estimates and the precision-weighted average of those estimates.

¹⁰

This could be due to either the lack of any real signal in the data for predicting site-specific impacts or due to lasso’s inability to distinguish signal from noise when confronting a large number of potential moderators, many of which may be weak moderators at best.

¹¹

The meta-analysis can decompose the error variance into the cross-site variance and the within-site variance because evidence on the within-site variance was available based on the estimated standard errors of the unbiased impact estimates for each site. While the small sample sizes in each site could potentially lead to biased random effects meta-analysis estimates through correlations between the estimated effects and their standard errors, a simulation study on this issue found the bias to be small for difference-in-means and standardized mean difference estimators (Lin, 2018).

¹²

Kraft (2020) reported the unweighted mean, the weighted mean—weighted by the inverse of the variance of the impact estimates—and the median. We used the weighted mean for consistency with the other meta-analyses, which reported only weighted means.

¹³

While the meta-analysis included estimates from studies that the authors acknowledged were based on weak study designs, their analysis suggests that the inclusion of these estimates did not substantially bias their estimates of average effects (Wilson, Gottfredson & Najaka, 2001, pp. 262-263).

¹⁴

While this meta-analysis reported separate estimates by outcome domain—estimates we could have matched to the different outcome domains explored in our study—the meta-analysis found little variation in impacts by outcome domain (DuBois et al., 2011, pp. 67-68).

¹⁵

Ongoing research by the authors of this paper directly estimates the probability that local the site-level predictions from all six studies would lead policymakers to the wrong conclusions about the effectiveness of the interventions.

¹⁶

The study of summer reading could only be included in analyses focused on moderators based on the characteristics of participating students: The other types of moderator variables were not collected by the original study.

¹⁷

These differences were typically statistically significant at the 90% level using two-tailed tests. For example, when lasso regression was applied to moderator variables on intervention features, lasso yielded a larger RMSPE than the simple average of the impacts in the other sites for 994 out of 1,000 bootstrap samples.

¹⁸

While the differences were typically not statistically significant, for reading achievement in year 1 of the charter middle school study, lasso with site characteristics, counterfactual characteristics, and all four types of moderators yielded larger RMSPEs than the simple average of the impacts in the other sites for at least 950 out of 1,000 bootstrap samples.

Contributor Information

Robert B. Olsen, George Washington Institute of Public Policy, The George Washington University, Washington, DC 20052

Larry L. Orr, Department of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, Chevy, Chase, MD 20815

Stephen H. Bell, Bell Eval LLC, Kensington, Maryland, USA

Elizabeth Petraglia, Westat, Rockville, MD 20850.

Elena Badillo-Goicoechea, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205.

Atsushi Miyaoka, Westat, Rockville, MD 20850.

Elizabeth A. Stuart, Departments of Mental Health, Biostatistics, and Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205

References

Bernstein L, Rappaport CD, Olsho L, Hunt D, & Levin M (2009). Impact Evaluation of the US Department of Education’s Student Mentoring Program. Final Report. NCEE 2009-4047. National Center for Education Evaluation and Regional Assistance. [Google Scholar]
Bloom HS, Richburg-Hayes L, & Black AR (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1), 30–59. [Google Scholar]
Chipman HA, George EI, & McCulloch RE (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298. [Google Scholar]
Clark MA, Chiang HS, Silva T, McConnell S, Sonnenfeld K, Erbe A, & Puma M (2013). The Effectiveness of Secondary Math Teachers from Teach For America and the Teaching Fellows Programs. NCEE 2013-4015. National Center for Education Evaluation and Regional Assistance. [Google Scholar]
Donoho D, & Stodden V (2006, July). Breakdown point of model selection when the number of variables exceeds the number of observations. In The 2006 ieee international joint conference on neural network proceedings (pp. 1916–1921). IEEE. [Google Scholar]
Dorie V, Hill J, Shalit U, Scott M, and Cervone D (2019). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68. [Google Scholar]
DuBois DL, Portillo N, Rhodes JE, Silverthorn N, & Valentine JC (2011). How effective are mentoring programs for youth? A systematic assessment of the evidence. Psychological Science in the Public Interest, 12(2), 57–91. [DOI] [PubMed] [Google Scholar]
Eberlin LS, Tibshirani RJ, Zhang J, Longacre TA, Berry GJ, Bingham DB, … & Poultsides GA (2014). Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging. Proceedings of the National Academy of Sciences, 111(7), 2436–2441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Esbensen FA, Peterson D, Taylor TJ, & Osgood DW (2012). Results from a multi-site evaluation of the GREAT program. Justice Quarterly, 29(1), 125–151. [Google Scholar]
Finch WH, & Finch MEH (2016). Regularization methods for fitting linear models with small sample sizes: Fitting the Lasso estimator using R. Practical Assessment, Research, and Evaluation, 21(1), 7. [Google Scholar]
Friedman J, Hastie T, & Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software, 33(1), 1–22. [PMC free article] [PubMed] [Google Scholar]
Hill J (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240. [Google Scholar]
Gleason P, Clark M, Tuttle CC, & Dwoyer E (2010). The Evaluation of Charter School Impacts: Final Report. NCEE 2010–4029. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. [Google Scholar]
Kern HL, Stuart EA, Hill J, and Green DP (2016). Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness, 9(1): 103–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin L (2018). Bias caused by sampling error in meta-analysis with small sample sizes. PloS One, 13(9), e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lipsey MW, Puzio K, Yun C, Hebert MA, Steinka-Fry K, Cole MW, & Busick MD (2012). Translating the Statistical Representation of the Effects of Education Interventions into More Readily Interpretable Forms. National Center for Special Education Research [Google Scholar]
Olsen R, Stuart EA, Orr L & Bell S (2020). How Much Can Evidence from National Studies Improve Local Policy Decisions that Affect Youth? - Analysis Plan. https://osf.io/vbs36/
Orr LL, Olsen RB, Bell SH, Schmid I, Shivji A, & Stuart EA (2019). Using the results from rigorous multisite evaluations to inform local policy decisions. Journal of Policy Analysis and Management, 38(4), 978–1003. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]
Weiss MJ, Bloom HS, & Brock T (2014). A conceptual framework for studying the sources of variation in program effects. Journal of Policy Analysis and Management, 33(3), 778–808. [Google Scholar]
Weiss MJ, Bloom HS, Verbitsky-Savitz N, Gupta H, Vigil AE, & Cullinan DN (2017). How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. Journal of Research on Educational Effectiveness, 10(4), 843–876. [Google Scholar]
Wilkins C, Gersten R, Decker LE, Grunden L, Brasiel S, Brunnert K, & Jayanthi M (2012). Does a Summer Reading Program Based on Lexiles Affect Reading Comprehension? Final Report. NCEE 2012-4006. National Center for Education Evaluation and Regional Assistance. [Google Scholar]
James G, Witten D, Hastie T, & Tibshirani R (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1912038-supplement-Supp_1.docx^{(150.9KB, docx)}

Supp 2

NIHMS1912038-supplement-Supp_2.docx^{(23.4KB, docx)}

Supp 3

NIHMS1912038-supplement-Supp_3.docx^{(16.8KB, docx)}

[R1] Bernstein L, Rappaport CD, Olsho L, Hunt D, & Levin M (2009). Impact Evaluation of the US Department of Education’s Student Mentoring Program. Final Report. NCEE 2009-4047. National Center for Education Evaluation and Regional Assistance. [Google Scholar]

[R2] Bloom HS, Richburg-Hayes L, & Black AR (2007). Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis, 29(1), 30–59. [Google Scholar]

[R3] Chipman HA, George EI, & McCulloch RE (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298. [Google Scholar]

[R4] Clark MA, Chiang HS, Silva T, McConnell S, Sonnenfeld K, Erbe A, & Puma M (2013). The Effectiveness of Secondary Math Teachers from Teach For America and the Teaching Fellows Programs. NCEE 2013-4015. National Center for Education Evaluation and Regional Assistance. [Google Scholar]

[R5] Donoho D, & Stodden V (2006, July). Breakdown point of model selection when the number of variables exceeds the number of observations. In The 2006 ieee international joint conference on neural network proceedings (pp. 1916–1921). IEEE. [Google Scholar]

[R6] Dorie V, Hill J, Shalit U, Scott M, and Cervone D (2019). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68. [Google Scholar]

[R7] DuBois DL, Portillo N, Rhodes JE, Silverthorn N, & Valentine JC (2011). How effective are mentoring programs for youth? A systematic assessment of the evidence. Psychological Science in the Public Interest, 12(2), 57–91. [DOI] [PubMed] [Google Scholar]

[R8] Eberlin LS, Tibshirani RJ, Zhang J, Longacre TA, Berry GJ, Bingham DB, … & Poultsides GA (2014). Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging. Proceedings of the National Academy of Sciences, 111(7), 2436–2441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Esbensen FA, Peterson D, Taylor TJ, & Osgood DW (2012). Results from a multi-site evaluation of the GREAT program. Justice Quarterly, 29(1), 125–151. [Google Scholar]

[R10] Finch WH, & Finch MEH (2016). Regularization methods for fitting linear models with small sample sizes: Fitting the Lasso estimator using R. Practical Assessment, Research, and Evaluation, 21(1), 7. [Google Scholar]

[R11] Friedman J, Hastie T, & Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software, 33(1), 1–22. [PMC free article] [PubMed] [Google Scholar]

[R12] Hill J (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240. [Google Scholar]

[R13] Gleason P, Clark M, Tuttle CC, & Dwoyer E (2010). The Evaluation of Charter School Impacts: Final Report. NCEE 2010–4029. Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. [Google Scholar]

[R14] Kern HL, Stuart EA, Hill J, and Green DP (2016). Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness, 9(1): 103–127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Lin L (2018). Bias caused by sampling error in meta-analysis with small sample sizes. PloS One, 13(9), e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Lipsey MW, Puzio K, Yun C, Hebert MA, Steinka-Fry K, Cole MW, & Busick MD (2012). Translating the Statistical Representation of the Effects of Education Interventions into More Readily Interpretable Forms. National Center for Special Education Research [Google Scholar]

[R17] Olsen R, Stuart EA, Orr L & Bell S (2020). How Much Can Evidence from National Studies Improve Local Policy Decisions that Affect Youth? - Analysis Plan. https://osf.io/vbs36/

[R18] Orr LL, Olsen RB, Bell SH, Schmid I, Shivji A, & Stuart EA (2019). Using the results from rigorous multisite evaluations to inform local policy decisions. Journal of Policy Analysis and Management, 38(4), 978–1003. [Google Scholar]

[R19] Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. [Google Scholar]

[R20] Weiss MJ, Bloom HS, & Brock T (2014). A conceptual framework for studying the sources of variation in program effects. Journal of Policy Analysis and Management, 33(3), 778–808. [Google Scholar]

[R21] Weiss MJ, Bloom HS, Verbitsky-Savitz N, Gupta H, Vigil AE, & Cullinan DN (2017). How much do the effects of education and training programs vary across sites? Evidence from past multisite randomized trials. Journal of Research on Educational Effectiveness, 10(4), 843–876. [Google Scholar]

[R22] Wilkins C, Gersten R, Decker LE, Grunden L, Brasiel S, Brunnert K, & Jayanthi M (2012). Does a Summer Reading Program Based on Lexiles Affect Reading Comprehension? Final Report. NCEE 2012-4006. National Center for Education Evaluation and Regional Assistance. [Google Scholar]

[R23] James G, Witten D, Hastie T, & Tibshirani R (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: Springer. [Google Scholar]

PERMALINK

Using a Multi-Site RCT to Predict Impacts for a Single Site: Do Better Data and Methods Yield More Accurate Predictions?

Robert B Olsen

Larry L Orr

Stephen H Bell

Elizabeth Petraglia

Elena Badillo-Goicoechea

Atsushi Miyaoka

Elizabeth A Stuart

Abstract

INTRODUCTION

Motivation

Contribution of the Paper

THEORY

DATA

Table 1 –

Table 2–

Table 3 –

METHODS

Methods for Estimating the True Impact in a Single Site

Methods for Predicting the Impact for a Single Site

Methods for Assessing the Accuracy of the Predicted Impact for a Single Site

Figure 1.

FINDINGS

Table 4–

Table 5–

Figure 2. RMSPE for Different Prediction Methods Using Site Characteristics as Moderators.

Figure 6.

Figure 3. RMSPE for Different Prediction Methods Using Aggregate Study Participant (Student) Characteristics as Moderators.

Figure 4.

Figure 5.

Figure 7. RMSPE with All Four Types of Moderators: Lasso.

Figure 8. RMSPE with All Four Types of Moderators: BART.

DISCUSSION

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases