Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jul 1.
Published in final edited form as: Res Soc Work Pract. 2017 Jul 27;28(5):532–537. doi: 10.1177/1049731517720730

Generalizability of randomized trial results to target populations: Design and analysis possibilities

Elizabeth A Stuart 1, Benjamin Ackerman 2, Daniel Westreich 3
PMCID: PMC6049838  NIHMSID: NIHMS979470  PMID: 30034203

Abstract

Randomized trials play an important role in estimating the effect of a policy or social work program in a given population. While most trial designs benefit from strong internal validity, they often lack external validity, or generalizability, to the target population of interest. In other words, one can obtain an unbiased estimate of the study sample average treatment effect (SATE) from a randomized trial; however, this estimate may not equal the target population average treatment effect (TATE) if the study sample is not fully representative of the target population. This paper provides an overview of existing strategies to assess and improve upon the generalizability of randomized trials, both through statistical methods and study design, as well as recommendations on how to implement these ideas in social work research.

Keywords: Literature Review, Evidence-based Practice


Many questions of policy or practice interest involve estimates of the effect of some policy or program in a target population of interest. For example, a social work agency may be interested in predicting the average effects if all of their clients receive a new model of program delivery, or a state may be deciding whether to invest in a new training program for social workers across the state. A challenge in estimating these effects, however, is that common existing study designs are often not well targeted for these target population effects. In particular, randomized trials are often conducted in study samples that are explicitly not representative of the target populations in which the policies or programs may eventually be implemented.

Randomized trials have played a critical role in informing evidence-based social work practice, used alongside physicians’ expertise and patients’ preferences to make the best practical decisions on an individual level (Soydan, 2008). Of interest in this paper, however, is not how randomized trials can inform personalized decision-making, but rather how average effects of interventions can impact policy and community-level outcomes in well-defined populations. Consider, for example, a randomized trial in which high school students at a public school are provided with training on how to prevent intimate partner violence, and are then followed for four years. If the trial results suggest that, on average, students who received the training reported less violent behavior with their partners than the control group, then it may be of a state’s interest to implement such programming on a larger scale. The methods discussed here can help a state assess how relevant the findings in the trial are to the state as a whole, and what the average effects might be if the program were implemented statewide.

This paper provides an overview of design and analysis methods for how we can assess and enhance our ability to estimate the effects of interventions in well-defined target populations. Because other work has primarily focused on analysis methods we put somewhat more emphasis on study design options for estimating causal effects in well-defined target populations. Recently researchers have distinguished “generalizability,” which involves generalizing results from a study sample to the population from which that sample was selected (potentially randomly but more commonly non-randomly) (Stuart & Cole, 2010), from “transportability,” which involves estimating effects in a completely external population, or one that the study sample was not drawn from (Hernan & VanderWeele, 2011; Bareinboim & Pearl, 2013). In general, the methods described in this paper will be relevant for both scenarios—in part because it is sometimes difficult to draw a bright line between the two—but distinctions for the two scenarios are described when appropriate.

This paper proceeds as follows. We first present background on the problem, including some notation and a clear description of the goal of analysis and the setting. We then briefly describe analysis strategies for estimating target population treatment effects before turning to study design strategies to enhance the generalizability of trial results to well-defined target populations. We end with a broader discussion, including relevance of the ideas for social work research.

Background on the problem

The first step in examining generalizability or transportability is to identify the target population of interest. Discussing “generalizability” or “transportability” without that is in fact meaningless, and a particular study may be generalizable to one population but not to another (in fact that is essentially always the case). We find that all too often this initial step is not taken, however; researchers jump to discussing “the generalizability” of a study without clarifying to what population one is interested in generalizing. For example, two states (with very different populations) may both be interested in determining whether the Nurse Family Partnership (Olds et al., 1998) might be beneficial for the new parents in their state; the residents of these two states might be two different, but both well-defined, target populations. Throughout the rest of this paper we will assume that “the population” has been well-specified and defined.

Clarification of estimands

We assume that a randomized trial has been conducted in a sample of size n, and there is a well-defined target population of size N to which researchers would like to generalize the results from the randomized trial (e.g., a randomized trial of the Nurse Family Partnership; Olds et al., 1998).

The randomized trial can provide an unbiased effect estimate for the study sample: SATE=1ni=1N(Yi(1)-Yi(0)Si=1), where n denotes the sample size of the trial, Yi(1) denotes the outcome for subject i if they receive treatment, Yi(0) denotes the outcome for subject i if they receive the control condition, and Si = 1 if subject i is in the trial sample, and 0 otherwise. However, ultimate interest is in a target [population] average treatment effect: TATE=1Ni=1N(Yi(1)-Yi(0)). While the effect estimate in the trial is unbiased for the sample in the trial, it is not necessarily unbiased for the TATE.

When will there be sample and target population effects differ?

Intuitively and formally the sample and target effects will differ if there are factors that moderate (modify) treatment effects AND if the distribution of those factors differ between the sample and the target population. For example, an intervention may be more effective among young adults, and different locations may have different age distributions. That combination can lead to bias when trying to generalize the results of a trial from one location to another.

Stuart and Cole (2010) present a formalization of this. Let α denote an estimate of the TATE and β an estimate of the TATE, such that the difference, βα, represents the bias of the SATE as a measure of the TATE. Consider the simple setting where there is only one pretreatment covariate, Z, which is binary. Cole and Stuart (2010) derive the formula for the bias of the SATE as a measure of the TATE:

β-α=bxz×{P(Z=1)P(S=1)×[P(S=1Z=1)-P(S=1)]}.

Here, bxz denotes the coefficient for treatment effect heterogeneity due to Z obtained from the outcome model E(Yi) = b0 + bxXi + bzZi + bxzXiZi, where X is a binary variable indicating treatment. Therefore, the bias depends on the magnitude of treatment effect heterogeneity (bxz), the proportion of the target population sampled for the trial (P(S=1)), the overall prevalence of the pretreatment covariate Z (P(Z=1)), and the difference in the probability of participating in the trial across levels of Z, denoted as (P(S=1|Z=1)-P(S=1)). Note there will be no bias if the probability of being selected for the trial does not depend on Z (P(S=1) = P(S=1|Z=1)), if the sample consists of the entire target population (P(S=1) = 1), or if there is no treatment effect heterogeneity across levels of Z (bxz =0).

The equation above focused on a continuous outcome and an effect estimate parameterized as a difference in outcome means. One key point worth noting is that when the outcome is binary, sample and target effects can be expected to differ on at least one scale (e.g., risk difference or risk ratio) whenever the baseline risks differ between the two populations (a difference in baseline risks is a sufficient condition for moderation of treatment effects on at least one scale). Thus any trial that overenrolls high risk individuals from the target population (as is frequently done to enhance study power) will produce effect estimates that cannot be expected to generalize unconditionally on all scales.

There is growing evidence in practice that randomized trial samples are often not representative of target populations of interest (see, e.g., Rothwell, 2005; Stirman et al., 2005). Braslow et al. (2005) documented that randomized trials of psychiatric treatment often underenrolled minorities (relative to a target population of individuals with psychiatric disorders across the United States). Wisniewski et al. (2009) compared individuals in a large-scale pragmatic effectiveness trial of depression treatment to the subset of patients who would likely have been included in a more typical efficacy trial (with standard inclusion and exclusion criteria) and found large differences in both characteristics and effects. More recent work in studies of drug abuse treatment documented that individuals in randomized trials of those treatments differ substantially from individuals seeking treatment for drug abuse in the US in general, especially in terms of employment status and education levels (Susukida et al., in press).

In education research, Stuart et al. (2017) detailed large differences between the types of school districts that participate in large-scale “national” evaluations of educational interventions and 3 plausible target populations: districts nationwide, disadvantaged districts nationwide, and, for federally-funded programs, the districts nationwide implementing those programs. Stuart et al. (2017) found large differences between the districts participating in evaluations and all of these populations; for example, large low- or mid-performing urban districts represent approximately 48% of the study samples but only 4% of districts nationwide and 7.5% of disadvantaged districts nationwide. Bell et al. (2016) then showed that these differences can result in bias when trying to naively estimate the TATE using data from these trial samples, estimating that the external validity bias due to trial samples not representing the target population is on the order of 0.1 standard deviations.

In social work practice, interventions play a central role in improving conditions for clients, and the optimal method of evaluating the effectiveness of social work interventions is through randomized trials. While there has been limited quantification of the differences between trial samples and target populations in social work research, several studies have discussed the limitation of not having representative samples. Zhai et al. (2010) concluded that in order to better generalize the results from their trial examining dosage effect on school readiness of preschool-aged children, future studies should recruit samples more demographically similar to the national population of interest. In a review of RCTs for parents of children with Autism Spectrum Disorders, Dababnah et al. (2016) observed that across studies, generalizability of trial results was weakened by the lack of racial, ethnic and socioeconomic diversity that existed among the target population of parents of children with ASD. Bronstein et al. (2015) also calls for replication studies in more diverse communities in order to better address the generalizability of their results, indicating the importance of having representative study samples.

Analysis methods for estimating the target average treatment effect

Recent work has developed statistical approaches for estimating the target average treatment effect using data from a randomized trial and covariate information on the target population. Broadly, these primarily involve either 1) weighting methods that weight the study sample to resemble the target population on baseline characteristics that may moderate treatment effects, 2) flexible models of the outcome fit in the study sample and then used to predict impacts in the target population, or 3) both methods combined. Kern et al. (2016) provides an overview of these approaches and simulation studies comparing their performance. Note that all of these approaches assume that there is a set of covariates that are observed consistently across trial sample and target population datasets.

The weighting approach to generalization involves stacking trial and population data on top of each other fitting a model of participation in the trial as a function of observed characteristics; essentially, adjusting for sample and population differences by modeling the probability of participating in the trial. Individuals in the trial are then weighted by one over their probability of participating in the trial (similar to non-response weights in survey samples or propensity score weights in non-experimental studies) in outcome analyses; these weighted outcome models provide an estimate of the TATE, adjusting for the sample and target population differences in observed covariates. Cole and Stuart (2010) present an example of this approach, generalizing the results of a randomized trial of treatment for HIV to the population of individuals newly infected with HIV in the United States in 2006. Similar approaches are described in Hartman et al. (2015), O’Muircheartaigh and Hedges (2014), and Tipton (2013). This approach can be thought of as a smoothed version of post-stratification, whereby effects might be estimated for specific subgroups in the trial (e.g., males and females) and then the subgroup effects weighted using the population distribution of that variable (male/female) to obtain a population effect estimate; the weighting version of this approach allows researchers to adjust for a larger set of factors than would be possible using direct post-stratification (also known as standardization).

A second class of methods instead focus on using data in the trial to model the outcome as a flexible function of treatment status and the covariates (including potential interactions) and then using that model to predict outcomes (and thus effects) in the target population, based on the covariate distribution observed in the population. This approach was examined in Kern et al. (2016), using a specific modeling approach called Bayesian Additive Regression Trees (BART), which fits a very flexible outcome model using a non-parametric approach similar to random forests. Kern et al. (2016) found that this approached worked quite well, even for somewhat complex outcome models.

A third broad class of methods combines these two approaches, similar in spirit to “doubly robust” approaches in non-experimental studies (Kern et al., 2016). In particular, with these methods both selection (trial participation) and outcome models are used, with the outcome models fit using weights generated as in the first approach.

The primary assumption underlying all of these approaches is that of conditionally unconfounded sample selection: that we have observed the factors that moderate treatment effects and differ between sample and population. In other words, we have to be willing to assume that, once we adjust for the set of observed covariates, treatment effects are the same in the trial sample and the population. This assumption, sometimes called “ignorability of sample selection,” is formalized in Hartman et al. (2015) and Kern et al. (2016) (and differs depending on whether outcomes under the control condition are available in the population of interest). Huitfeldt et al. (2017) discusses variations on this assumption and implications for variable selection for modeling or outcome model based approaches.

The assumption of conditionally unconfounded sample selection can be a heroic assumption in practice, especially given sometimes limited data on the population of interest (e.g., see Stuart & Rhodes, in press). So what can we do instead? One key aspect is careful and thoughtful selection of covariates and attention to the comparability of measures across data sources. This selection can be greatly informed by theoretical models of participation in the randomized trials of interest and the interventions themselves, and in particular the factors that may relate to effects and participation. However, in practice we often do not observe all of the factors that we would like to adjust for. For these scenarios sensitivity analyses have been developed to assess how much the TATE estimates would change if there were an unobserved effect moderator (Nguyen et al., 2017). However, another, perhaps better option, is to use smart design choices to make these assumptions less heroic. We turn to these designs now.

Design options for enhancing generalizability to a target population

When the target population of interest is known in advance of a randomized trial being conducted there are a number of design possibilities to better ensure that the results from the trial can be used to estimate effects in that target population. We note that these design options are not sufficient and the analysis strategies introduced above are often needed in addition, given that 1) there may be multiple target populations of interest from a given study (e.g., two US states may both be interested in estimating effects in their own state population), and 2) the target population of interest may change after the trial is conducted, including due to general temporal changes and time trends.

Perhaps the “gold standard” for estimating the TATE are randomized trials conducted in formally representative samples (Imai, King, & Stuart, 2008). We are aware of a handful of studies that randomly sampled sites to participate from a well-defined target population (see Olsen et al., 2012). All evaluations in this category were of U.S. federal government programs, where program implementers (sites) could be mandated to participate in the evaluation: Upward Bound (Seftor et al., 2009), Job Corps (Burghardt et al., 1999; in fact this study included ALL Job Corps sites across the US), and Head Start (Puma et al., 2010). The possibilities for such designs may increase in the future, however, with more and more large-scale population administrative datasets. For example, a health system interested in studying a new warning system for potential drug interactions could be evaluated using a random sample of providers or patients in their population, through an electronic health record system. Olsen and Orr (2016) present some of the considerations when setting up a study that aims for random selection from the target population. When there are concerns that some individuals may not agree to participate in a randomized trial, some studies conduct parallel randomized and non-randomized arms, whereby the individuals who do not consent to randomization are allowed to choose their treatment condition but with their outcomes still tracked over time.

Another design approach that has been proposed does not use random sampling from a population, but rather picks sites systematically in order to cover the target population (Shadish et al., 2002). One particular approach, formalized by Tipton et al. (2014), involves stratifying the population on factors strongly related to outcome. It requires a sample frame of potential study subjects, covariate information on them, and knowledge of the prognostic factors likely related to outcomes. Subjects are then selected for the study based on strata defined by those prognostic factors, with the goal of a final study sample that has representation from all strata. Tipton et al. (2014) illustrate the approach using the design of a scale-up study of mathematics and reading interventions.

There may also be a place for non-experimental studies when primary interest is in a target population effect estimate. As formalized by Imai, King, and Stuart (2008), a well-done non-experimental study in a dataset representing the target population of interest may actually lead to less bias in the TATE than would a small-scale randomized trial in a very non-representative study sample, due to trade-offs between internal and external validity. Thus, a well-done non-experimental study (such as described by Rosenbaum (1999) or Rubin (2001)) that can be conducted in a sample representative of the target population of interest may be wroth considering when interest is in informing decisions in that population.

Some of these design options may seem daunting, and in some contexts it may not be feasible to consider random selection of subjects for a randomized trial. However, even in those cases there are still important design lessons that can be taken from this literature. In particular, all randomized trials should collect data on variables that are likely to moderate effects and may relate to study participation. Studies should also consider their target population, and show a Table 1 documenting the characteristics of study participants and the target population. One prerequisite for doing so will be the collection of variables in a consistent way between trial sample and population datasets; e.g., with trials making an effort to use the measures that are available in common population datasets (e.g., large-scale national surveys). Najafzadeh and Schneeweiss (2017) discuss the importance of measure comparability in the context of medical trials and electronic health records to reflect target populations.

Conclusions and recommendations for future work in Social Work

In summary, no trial is necessarily generalizable, or even generalizable in expectation unless (i) sample == target, or (ii) sample == simple random sample of target. Otherwise the assumption of generalizability is effectively an observational data analysis assumption. Until recently this point has been underappreciated by nearly all fields, but it has important implications for the broader policy and practice relevance of research.

Thus, although generalization of results to target populations is often heroic, there are design and analysis choices to make it more plausible and believable. This includes careful choice of measures and efforts to provide measures comparable across studies. Stuart and Rhodes (in press) found it very difficult to find data on a trial and target population in the field of early childhood education with any comparable measures, and in fact even the best example found had only 7 measures in common between the trial and population. This makes the assumption of unconfounded sample selection particularly problematic and heroic. One way to think about this is that the analysis approaches above, which adjust for observed effect moderators, can help move from an assumption of missing completely at random (MCAR) to an assumption more like missing at random (MAR), but we can never eliminate the possibility of missing not at random (NMAR), just as in non-experimental studies we can not guarantee that there is no unobserved confounding. But careful selection and use of observed covariates can at least move us a step in that direction.

Researchers should also consider whether the design approaches described above are feasible for their work. And as noted above, even when, e.g., random sampling from the target population is not feasible, efforts towards measure comparability with large-scale target population datasets will at least facilitate the use of analysis strategies to assess and enhance generalizability after the fact.

In this paper we have focused on situations with one randomized trial and one well-defined target population. In some contexts there might be multiple trials available (e.g., Petrosino et al., 2013), or a combination of experimental and non-experimental evidence, in which case other approaches may be more appropriate. Possibilities in that case include cross-design synthesis approaches, also known as research synthesis (Pressler & Kaizar, 2013; Prevost et al., 2000). Broadly, this class of methods might model effects as a function of study characteristics and explicitly model the internal and external validity bias, e.g., with prior distributions on the non-identified bias parameters (e.g., Turner et al., 2009).

A number of fields are just beginning to understand the implications of these ideas in their fields, and, for example, how representative (or non-representative) their trials tend to be. Social work should begin to develop such an understanding, through documentation of the characteristics of individuals and sites that participate in rigorous evaluations and how they compare to potential target populations. Data needs are paramount, however, in particular: 1) population data to provide background information on target populations, 2) potentially population data to provide a sampling frame for selection of study subjects, and 3) comparability of measures between those population datasets and randomized trials. The analysis approaches described in this paper can only go so far if the data is not available or appropriate.

In conclusion, this paper has provided a review of methods for enhancing the ability to draw target population inferences from randomized trials, attempting to bridge both internal validity and external validity and ensure that our research studies are as useful as possible for policy and practice.

Acknowledgments

This work was supposed in part by the Institute of Education Sciences, U.S. Department of Education, through grant R305D150003 (PIs: Stuart and Olsen), by a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-1502-27794), and by the National Institutes of Health, through grant DP2-HD-08-4070 (PI: Westreich). The statements in this work are solely the responsibility of the authors and do not necessarily represent the views of the Institute of Education Sciences, the National Institutes of Health, or the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee.

Contributor Information

Elizabeth A. Stuart, Johns Hopkins Bloomberg School of Public Health

Benjamin Ackerman, Johns Hopkins Bloomberg School of Public Health.

Daniel Westreich, University of North Carolina at Chapel Hill.

References

  1. Bareinboim E, Pearl J. A general algorithm for deciding transportability of experimental results. Journal of causal Inference. 2013;1:107–134. [Google Scholar]
  2. Braslow JT, Duan N, Starks SL, Polo A, Bromley E, Wells KB. Generalizability of studies on mental health treatment and outcomes, 1981 to 1996. Psychiatric Services. 2005;56:1261–1268. doi: 10.1176/appi.ps.56.10.1261. [DOI] [PubMed] [Google Scholar]
  3. Bronstein LR, Gould P, Berkowitz SA, James GD, Marks K. Impact of a social work care coordination intervention on hospital readmission: a randomized controlled trial. Social work. 2015;60:248–255. doi: 10.1093/sw/swv016. [DOI] [PubMed] [Google Scholar]
  4. Burghardt J, McConnell S, Meckstroth A, Schochet P, Johnson T, Homrighausen J. National Job Corps Study: Report on Study Implementation 1999 [Google Scholar]
  5. Cole SR, Stuart EA. Generalizing evidence from randomized clinical trials to target populations the actg 320 trial. American journal of epidemiology. 2010;172:107–115. doi: 10.1093/aje/kwq084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dababnah S, Parish SL. A comprehensive literature review of randomized controlled trials for parents of young children with autism spectrum disorder. Journal of evidence-informed social work. 2016;13:277–292. doi: 10.1080/23761407.2015.1052909. [DOI] [PubMed] [Google Scholar]
  7. Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2015;178:757–778. [Google Scholar]
  8. Hernán MA, VanderWeele TJ. Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass) 2011;22:368–377. doi: 10.1097/EDE.0b013e3182109296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Huitfeldt A, Swanson SA, Stensrud MJ, Suzuki E. Effect Heterogeneity and Variable Selection for Standardizing Experimental Findings. 2016 doi: 10.1007/s10654-019-00571-w. arXiv preprint arXiv:1610.00068. [DOI] [PubMed] [Google Scholar]
  10. Imai K, King G, Stuart EA. Misunderstandings between experimentalists and observationalists about causal inference. Journal of the royal statistical society: series A (statistics in society) 2008;171:481–502. [Google Scholar]
  11. Kern HL, Stuart EA, Hill J, Green DP. Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness. 2016;9:103–127. doi: 10.1080/19345747.2015.1060282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Najafzadeh M, Schneeweiss S. From Trial to Target Populations—Calibrating Real-World Data. New England Journal of Medicine. 2017;376:1203–1205. doi: 10.1056/NEJMp1614720. [DOI] [PubMed] [Google Scholar]
  13. Nguyen TQ, Ebnesajjad C, Cole SR, Stuart EA. Sensitivity analysis for an unobserved moderator in RCT-to-target-population generalization of treatment effects. The Annals of Applied Statistics. 2017;11:225–247. [Google Scholar]
  14. O’Muircheartaigh C, Hedges LV. Generalizing from unrepresentative experiments: a stratified propensity score approach. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2014;63:195–210. [Google Scholar]
  15. Olds D, Henderson CR, Jr, Cole R, Eckenrode J, Kitzman H, Luckey D, … Powers J. Long-term effects of nurse home visitation on children’s criminal and antisocial behavior: 15-year follow-up of a randomized controlled trial. Jama. 1998;280:1238–1244. doi: 10.1001/jama.280.14.1238. [DOI] [PubMed] [Google Scholar]
  16. Olsen RB, Orr LL. On the “where” of social experiments: Selecting more representative samples to inform policy. New Directions for Evaluation. 2016;2016(152):61–71. [Google Scholar]
  17. Petrosino A, Turpin-Petrosino C, Hollis−Peel ME, Lavenberg JG. ‘Scared Straight’ and other juvenile awareness programs for preventing juvenile delinquency. The Cochrane Library. 2013 doi: 10.1002/14651858.CD002796.pub2. [DOI] [PubMed] [Google Scholar]
  18. Pressler TR, Kaizar EE. The use of propensity scores and observational data to estimate randomized controlled trial generalizability bias. Statistics in medicine. 2013;32:3552–3568. doi: 10.1002/sim.5802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Prevost TC, Abrams KR, Jones DR. Hierarchical models in generalized synthesis of evidence: an example based on studies of breast cancer screening. Statistics in medicine. 2000;19:3359–3376. doi: 10.1002/1097-0258(20001230)19:24<3359::aid-sim710>3.0.co;2-n. [DOI] [PubMed] [Google Scholar]
  20. Puma M, Bell S, Cook R, Heid C, Shapiro G, Broene P, … Ciarico J. Head Start Impact Study. Final Report. Administration for Children & Families 2010 [Google Scholar]
  21. Rosenbaum PR. Choice as an alternative to control in observational studies. Statistical Science. 1999:259–278. [Google Scholar]
  22. Rothwell PM. External validity of randomised controlled trials: “to whom do the results of this trial apply?”. The Lancet. 2005;365:82–93. doi: 10.1016/S0140-6736(04)17670-8. [DOI] [PubMed] [Google Scholar]
  23. Rubin DB. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology. 2001;2:169–188. [Google Scholar]
  24. Seftor NS, Mamun A, Schirm A. Final Report. US Department of Education; 2009. The Impacts of Regular Upward Bound on Postsecondary Outcomes Seven to Nine Years after Scheduled High School Graduation. [Google Scholar]
  25. Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Wadsworth Cengage learning; 2002. [Google Scholar]
  26. Soydan H. Applying randomized controlled trials and systematic reviews in social work research. Research on Social Work Practice. 2008;18:311–318. [Google Scholar]
  27. Stirman SW, DeRubeis RJ, Crits-Christoph P, Rothman A. Can the randomized controlled trial literature generalize to nonrandomized patients? Journal of Consulting and Clinical Psychology. 2005;73:127. doi: 10.1037/0022-006X.73.1.127. [DOI] [PubMed] [Google Scholar]
  28. Stuart EA, Bell SH, Ebnesajjad C, Olsen RB, Orr LL. Characteristics of school districts that participate in rigorous national educational evaluations. Journal of Research on Educational Effectiveness. 2017;10:168–206. doi: 10.1080/19345747.2016.1205160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stuart EA, Rhodes A. Generalizing treatment effect estimates from sample to population a case study in the difficulties of finding sufficient data. Evaluation Review. 2016 doi: 10.1177/0193841X16660663. 0193841X16660663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Susukida R, Crum RM, Ebnesajjad C, Stuart EA, Mojtabai R. Generalizability of findings from randomized controlled trials: application to the National Institute of Drug Abuse Clinical Trials Network. Addiction. 2017 doi: 10.1111/add.13789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Tipton E. Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics. 2013;38:239–266. [Google Scholar]
  32. Tipton E, Hedges L, Vaden-Kiernan M, Borman G, Sullivan K, Caverly S. Sample selection in randomized experiments: A new method using propensity score stratified sampling. Journal of Research on Educational Effectiveness. 2014;7:114–135. [Google Scholar]
  33. Turner RM, Spiegelhalter DJ, Smith G, Thompson SG. Bias modelling in evidence synthesis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2009;172:21–47. doi: 10.1111/j.1467-985X.2008.00547.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wisniewski SR, Rush AJ, Nierenberg AA, Gaynes BN, Warden D, Luther JF, … Trivedi MH. Can phase III trial results of antidepressant medications be generalized to clinical practice? A STAR* D report. American Journal of Psychiatry. 2009;166:599–607. doi: 10.1176/appi.ajp.2008.08071027. [DOI] [PubMed] [Google Scholar]
  35. Zhai F, Raver CC, Jones SM, Li-Grining CP, Pressler E, Gao Q. Dosage effects on school readiness: Evidence from a randomized classroom-based intervention. Social Service Review. 2010;84:615–655. doi: 10.1086/657988. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES