Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 1.
Published in final edited form as: Ann Intern Med. 2024 Oct 8;177(11):1530–1538. doi: 10.7326/M23-2440

Target trial emulation for evaluating health policy

Nicholas J Seewald 1,2, Emma E McGinty 3, Elizabeth A Stuart 4
PMCID: PMC11817613  NIHMSID: NIHMS2028581  PMID: 39374529

Abstract

Target trial emulation is an approach to designing rigorous non-experimental studies by “emulating” key features of a clinical trial. Most commonly used outside policy contexts, this approach is also valuable for policy evaluation as policies typically are not randomly assigned. In this article, we discuss the application of the target trial emulation framework in a policy evaluation context. The policy trial emulation framework includes 7 components: the units and eligibility criteria, definitions of the exposure and comparison conditions, assignment mechanism, baseline (“time zero”) and follow-up, outcomes, causal estimand, and statistical analysis and assumptions. Policy evaluations that emulate a randomized trial across these dimensions can yield estimates of the causal effects of the policy on outcomes. Using the policy trial emulation framework to conduct and report on research design and methods supports transparent assessment of threats to causal inference in non-experimental studies intended to assess the effect of a health policy on clinical or population health outcomes.

INTRODUCTION

Researchers, policymakers, and stakeholders are interested in evaluating whether policies achieve their intended effects. The gold standard for estimating an intervention’s effects, the randomized controlled trial (RCT), is rarely feasible for this: policy adoption (e.g., by states, cities, organizations) is typically not randomly assigned (1,2). Well-designed non-experimental studies can, under assumptions, yield unbiased estimates of the causal effects of policies on outcomes. “Target trial emulation” is a formal approach to rigorous non-experimental design for causal inference that systematizes communication about key design elements. In this framework, researchers design a hypothetical randomized trial to address their research questions, then use it as a guide in designing a non-experimental study. Target trial emulation can enable stronger study designs outside of an RCT.

Most existing applications of the trial emulation framework have studied causal effects of interventions on individual patients (3,4), like oral antivirals for COVID-19 (5,6) or bariatric surgery to treat obesity (7). Guidance for designing and reporting trial emulations has focused on this intervention-to-individual-patient scenario (810). However, the target trial framework applies broadly and can be used for health policy evaluation (11,12), though limited guidance on designing and reporting in this context exists. This article aims to fill that gap by considering challenges in emulating a target trial in the policy context even if there are not yet consensus solutions. Policy evaluation seeks to understand effects of a policy on populations in settings where treatment assignment happens at the level of the policy-implementing unit. Data may be available only at that level, or at a lower level, e.g., individuals in a state. Therefore, the ways in which a non-experimental policy evaluation might “emulate” an RCT differ from some key components of clinical or epidemiologic trial emulation. Additional challenges include limited sample size, heterogeneity across jurisdictions, and co-occurring or confounding policies. Done well, target trial emulation mitigates these challenges.

COMPONENTS OF A POLICY TRIAL EMULATION

In general, policy evaluations ask scientific questions of the form “What is the effect of a policy on outcomes of interest over a defined period of time, relative to what would have happened in the absence of policy implementation?” Throughout this paper, we discuss approaches to operationalize and address this question with a well-designed non-experimental study.

The policy evaluations we consider are longitudinal studies, which take advantage of repeatedly measuring outcomes before and after policy implementation. This reduces confounding by time, by which one might, e.g., mistake normal temporal variation in the outcome as a policy effect, or vice versa. A policy trial emulation first designs an ideal, hypothetical longitudinal randomized trial to use as a benchmark for designing a non-experimental policy evaluation. Like other frameworks, including the in-development TARGET guidelines for epidemiologic trial emulations (13), we consider 7 aspects of this trial’s design: the units and eligibility criteria, definitions of the exposure and comparison conditions, assignment mechanism, baseline (“time zero”) and follow-up, outcomes, causal estimand, and statistical analysis and assumptions. We recommend researchers using policy trial emulation explicitly describe their target RCT and how it compares to their actual non-experimental study; we provide a template (Table 1) to guide documentation of both study designs. As an example, the supplement (section 3) recreates a similar table in a published policy trial emulation, which we authored, estimating the effects of state medical cannabis laws on treatment for chronic non-cancer pain (12).

Table 1.

Example table comparing a non-experimental policy trial emulation to a hypothetical randomized trial along 7 key dimensions.

Trial Element Description Hypothetical Target Trial Policy Trial Emulation Analogue
General Scientific Question What question motivates the study? Both the target trial and the non-experimental analogue should be designed to answer the same question. In general, “What is the effect of a policy on outcomes of interest over a defined period of time, relative to what would have happened in the absence of the policy?” Operationalize this with clear, explicit definitions of the 7 components below.
1. Units and Eligibility Criteria What are the units that could implement the policy and what factors make them eligible for inclusion in the study? Policy-level units: Randomization-level units (e.g., states, counties, or organizations), that have not previously implemented the policy of interest (see Component 2) and meet other relevant eligibility criteria such as lack of confounding policies at study start.
Impact-level units: Units that the policy of interest is designed to affect and that meet eligibility criteria such as having data systems or reports that facilitate counting outcomes.
Policy-level units: Units that implemented either the policy of interest or the comparison condition (see component 2) at the time of hypothetical study entry (see component 4). defined only using information available at the time of hypothetical study entry.
Impact-level units: Units that the policy of interest is designed to affect and on which outcomes are collected, if impact-level data is available.
2. Definitions of Exposure & Comparison Conditions What is the specific policy (or versions of the policy) under study? What is the comparison condition? Exposure condition: The policy of interest that policy-level units may be assigned to implement. There would be one policy that all units randomized to “receive treatment” would implement, or multiple versions if those variations are of scientific interest.
Comparison condition: What implementation of the exposure condition (i.e., the policy) is compared to. It may be “business as usual”, in which units do not implement the policy of interest at study start and proceed without other limitations (e.g., perhaps implement other policies), a specific comparison policy, or instructions to not implement the policy over the study period.
Exposure condition: implementation of a policy that contains key provisions as identified by policy mapping even if specific implementation details may vary across implementing units. Under high policy heterogeneity, one could consider multiple versions of the exposure and emulate a multi-arm trial.
Comparison condition: Commonly, lack of implementation of the policy of interest or confounding policies at time zero (see component 4).
3. Assignment Mechanism How is it determined whether a policy-level unit implements the policy? Policy-level units would be randomly assigned to implement or not implement the policy affecting the impact-level units (i.e., cluster-level randomization). Randomization will almost certainly be unblinded: units will know whether they are supposed to implement the policy or not. Policy adoption is non-random and potentially influenced by both known and unknown characteristics of the policy-level units.
4. Baseline / Time Zero and Follow-Up When does the post-policy period begin?
Follow-up time should be sufficient to observe scientifically-meaningful effects, such as when implementation takes time to ramp-up and there may not be immediate effects.
Baseline / Time Zero: The time of randomization. All units may be randomized and implement the policy at the same time, or units may have staggered randomization times (and thus implement the policy at different times) as in a stepped wedge or waitlist control trial (49).
Follow-Up: The follow-up period will include sufficient post-policy measurements to capture ramp-up effects, etc.
Baseline / Time Zero: The time at which the policy of interest goes into effect such that it could have an impact on the outcomes. Each exposed policy-level unit (or group of simultaneously-implementing units) may have its own unique time zero (see comment about, e.g., stepped wedge trials at left).
Follow-Up: Commonly, sufficient pre-policy measurements to model or balance pre-policy trends and avoid confounding by time. Sufficient post-policy measurements to capture ramp-up while also avoiding confounding laws.
5. Outcomes What are the outcomes of interest? How and when are they measured? Outcome measures are designed and collected prospectively. Outcome measures are designed and collected retrospectively.
6. Causal Estimand The population-level summary that describes the treatment effect or contrast of interest. Intent-to-treat average treatment effect (ATE), i.e., the average effect of all units being assigned to a policy vs. none implementing the policy at baseline regardless of future policy changes.
Alternatively, per-protocol effect (effect of the policy as indicated in the protocol; for example, keeping the policy in place for at least 2 years).
In addition to (a) ATE, other possible estimands are (b) Average treatment effect among the treated (ATT): “Among units that implemented the policy of interest, what was the effect of the policy on outcomes relative to what would have been observed had those units not implemented the policy?”, or (c) Average treatment effect among the controls (comparators): “Among units that did not implement the policy of interest, what would be the effect of implementing the policy relative to what actually happened under no implementation?”
Investigators should articulate and justify how they handle treatment crossover and how those choices affect the estimand’s interpretation. This is related to the choice of intent-to-treat vs. per-protocol analyses.
7. Statistical Analysis and Assumptions How are the data analyzed to estimate the contrast of interest (estimand)? What (often untestable) assumptions are needed for the estimated effect to have a causal interpretation? Primary analysis is typically an intent-to-treat analysis for (clustered, if the data available is on sub-units within the policy-implementing units) longitudinal data.Per protocol analysis. As such analyses do not use groups defined by randomization, care must be taken to make valid causal inference. See right column, the text, and Supplement section 2 for details.Assumptions:
  •  Stable-unit treatment value assumption (SUTVA): Each policy-level unit’s exposure status does not affect other policy-level units’ outcomes. Untestable, but must be justified by study design, and can be strengthened through design choices, such as avoiding geographic proximity of treatment and comparison units.

  •  Consistency: For each policy-level unit, the observed outcome is equal to that state’s potential outcome under the observed exposure status. Untestable, but commonly made.

  •  Positivity: All policy-level units in the analysis had a positive probability of being exposed or unexposed. Satisfied by design given the randomization.

Common analysis strategies use pre-policy information to model or balance pre-policy trends to reduce confounding by time and create a good proxy for the exposed units’ counterfactual outcomes under the comparison condition. Given nonrandomized assignment, analysis must account for potential confounding factors; for most methods, these are variables that evolve differently over time in the exposed and unexposed groups or that have time-varying effects on the outcome. Additional consideration must be given to potential crossover between conditions. See Supplement section 2 for an annotated bibliography of methods.Assumptions:
  •  All made in left column, plus additional assumptions made by the specific analysis strategy chosen above. Discuss the plausibility of each assumption in the current study. For example, counterfactual parallel trends assumption and absence of factors that evolve differently over time in the exposed and unexposed groups or that have time-varying effects on the outcome.

Note: Additional details appear in the text

Component 1: Units and Eligibility Criteria

Policy evaluations should consider both units that could implement the policy (e.g., states, organizations; “policy-level”), and (sub-)units that would be affected by the policy (e.g., individuals; “impact-level”). Typically, the policy-level units and impact-level sub-units are different (e.g., students in schools), but they may be the same (e.g., in studies investigating state laws’ impact on a state’s public health budget). In this paper, we proceed assuming policy- and impact-level units are different, but the same principles apply otherwise. The policy-level units are of fundamental interest and would be randomized in an RCT (see Component 3), but data are sometimes collected at the (often more granular) impact level.

Investigators must consider eligibility criteria for both types of units in both the target trial and emulation. In the target trial, policy-level units would be eligible for randomization if they had not previously implemented the policy of interest, and possibly have certain other pre-randomization characteristics. In the ideal non-experimental analogue, eligibility criteria are similar: we would include all policy-implementing units and all units that could have implemented the policy when the implementing unit(s) did but did not. The comparison group should ideally consist of the units that have the comparison condition (see Component 2) at the time of an exposed unit’s implementation of the policy (“not-yet-exposed comparators”). Defining exposed and comparison groups using only factors known at baseline avoids (possibly substantial) bias created by conditioning on post-exposure information. However, this means that some of the comparison units may, over time, “cross over” by implementing the policy of interest; see Component 6.

We next consider impact unit-level eligibility criteria. This is the level on which, in a target trial, data would be collected. Impact-level eligibility criteria define the population on which we wish to measure the effects of the policy, often a particular group of individuals, and could be very simple (e.g., “lives in policy-level unit”) or more complex (e.g., “lives in policy-level unit and was diagnosed with X condition in the last 2 years”).

If it is unlikely that policy implementation (or lack thereof) is related to impact-level units’ presence or absence in the data over time, we may restrict the sample to units continuously present in the data over the whole study period to avoid changes in sample composition over time. This choice limits external validity but improves internal validity: policy effect estimates will not be biased by a changing sample. If policy implementation is possibly related to presence or absence in the data, care must be taken to define the sample and outcomes in a way that can account for changes in population (1416).

In a non-experimental setting, impact-level data may or may not be available or may be of varying quality. If data is only available in aggregate (i.e., there is no impact-level data), like state-month counts of opioid prescriptions, it may not be possible to restrict the sample to the target impact-level population that the policy was designed to affect, weakening the trial emulation. For example, if a policy applies only to minors but the only available data cannot be disaggregated by age, this will yield a weaker trial emulation than one that uses data only from individuals under 18. If impact-level data (e.g., patient-level data including diagnoses or prescriptions) is available (either longitudinally or in repeated cross-sections), we can apply impact-level eligibility criteria and examine only the target population, strengthening the emulation.

Component 2: Definitions of Exposure and Comparison Conditions

Policy evaluations usually require natural experiments. Researchers typically do not decide what the policy of interest does or who it affects; therefore, understanding what policies exist and can be studied is a crucial step in asking a precise scientific question. In an RCT, researchers must clearly define what each of the randomized arms receives – the exposure and comparison “conditions” – and deliver treatment consistently to ensure there is one definition of treatment per arm of the trial. In a hypothetical policy RCT, we would implement the same policy in the same way in every unit randomized to treatment; similarly for the comparison arm if the comparison condition is a specific alternative policy. In some contexts, the comparison condition may be “business as usual,” in which case the actual experiences of each comparator may differ.

In a non-experimental context, each exposed (treated) unit that implements the policy of interest may do so idiosyncratically, with specific details varying between policies. For example, most U.S. states now have medical cannabis laws but details (e.g., the set of conditions that qualify for medical cannabis use) vary across states and over time (17). Policy mapping or “legal epidemiology” techniques, which provide a systematic approach to understanding policies’ timing and their granular details, can aid researchers in understanding different versions and core components of the policy of interest, and which of those versions are comparable (18). This can be relevant for defining both exposure and comparison conditions. Under high heterogeneity, one could emulate a multi-arm trial with multiple versions of the exposure.

A carefully defined exposure will generally refer to a group (or a small number of groups) of qualitatively similar policies and determine policy-level units – both exposed and unexposed – included in the study. The exposure could also be a bundle of policies implemented (nearly) simultaneously. In such cases, it is typically impossible to estimate effects for each policy in the bundle, unless each policy affects distinct, unrelated outcomes (19,20).

Policy mapping helps identify confounding policies implemented by either exposed or comparison units that may offer an alternative explanation for changes in outcomes post-implementation. Strong policy trial emulations will precisely define the exposure and comparison conditions to isolate the effect of the policy of interest and disentangle it as much as possible from confounding policies, using sensitivity analyses as appropriate. Researchers should carefully search for contemporaneous policies that could affect the outcomes under study; failure to consider such policies and/or how the policy landscape may change over time threatens effect estimates’ validity (19,20).

Component 3: Assignment Mechanism

In an RCT, assignment of policy-level units to implement or not implement the policy would be done randomly, possibly stratifying on covariates and sometimes blinded (21). Randomization would be at the policy level, clustered by policy-level units. In reality, randomization is generally infeasible, and blinding is infeasible: units must know whether they are implementing the policy (9). When the policy and impact levels differ, we emulate cluster randomization: treatment is at the policy level, but data are (possibly aggregated) from impact-level sub-units.

Treatment assignment in RCTs is unconfounded. In reality, policy adoption is non-random, so there are likely known and unknown characteristics of policy-level units related to both policy implementation and outcomes that could confound effect estimation. This can threaten the validity of an estimated causal effect. Researchers should consider factors that may affect policy adoption and outcomes differently over time. These decisions often involve tradeoffs. For example, selecting near-neighbor comparators, e.g., bordering states, may yield more similar exposed and comparison units. Alternatively, selecting comparators geographically distant from exposed units can alleviate concerns about policy spillover (22). Bias due to confounding can be mitigated by carefully considering each component of a policy trial emulation and using appropriate analytic methods (Component 7).

Component 4: Baseline / Time Zero and Follow-Up

It is important to define a baseline time at which the policy is considered active. In the target trial, pre-randomization recruitment and preparatory phases would allow for the policy to become effective in the units assigned to implement it immediately after randomization: baseline (and thus which measurements are pre- and post-exposure) is defined by the randomization. Identifying baseline is complicated in a non-experimental context in which the policy of interest is often implemented in different policy-level units at different times (i.e., “staggered adoption”). In this setting, there may be several baselines at which the policy was implemented in one or more exposed units.

Defining baseline for policy-implementing units requires understanding when the policy “starts” in those units. Misspecifying baseline might introduce bias due to anticipation effects (if specified too late) or could attenuate the effect estimate (if specified too early). Baseline should refer to when the policy could start impacting outcomes. Defining baseline for the comparison units – when they could have implemented the policy but did not – is challenging. One solution is “stacking”, or serial trial emulation (11,24). For each unique baseline/implementation time among policy-implementing units, define a cohort of exposed and unexposed units eligible for inclusion at that time, and set that implementation date as time zero. Then, align (“stack”) the cohorts centered at these time zeros. This creates multiple trial emulations, one per implementation date (Figure 1, panels 1–3). Stacking has been used in policy evaluation (2527) and trial emulation for individual-level interventions (28,29) (see Supplement Section 2.1).

Figure 1.

Figure 1.

A schematic depiction of “stacking” or serial target trial emulation using synthetic data modeled after McGinty et al. (6,11,12,34). First, each policy-implementing unit’s policy implementation date is identified. Next, study periods for each policy-implementing unit are created (here, 4 years pre-implementation and 3 years post-implementation). Each unit’s implementation date is treated as its baseline or time zero, and units are “aligned” in relative time. Note that if multiple policy-implementing units implement at the same time, those units would be included together as one stacked cohort. Finally, unit-specific treatment effects are estimated (ATTs, usually), and then, if scientifically appropriate, averaged across all policy-implementing units. See Supplement section 2.1 Figure A for more details.

In an RCT, randomization enables causal effect estimation by ensuring that groups are balanced on all pre-intervention measures, however far in the past, by design. Therefore, only one (or few) pre-randomization measurement occasions are common. In a non-experimental context, data from before and after policy implementation is almost always needed to estimate a causal effect well (see Component 7). Strong policy evaluations are generally longitudinal studies: having more pre-policy measurements allows pre-policy trends to be modeled or matched, generally weakening necessary causal assumptions (23). The post-exposure follow-up period should be long enough to capture meaningful effects (and possibly changes in those effects, like post-implementation ramp-up), as in an RCT. The appropriate follow-up duration is highly context specific, depending on how long a policy might take to be fully implemented and to realize its intended effects on outcomes.

Component 5: Outcomes

Both RCTs and non-experimental studies require clearly defined outcomes. RCTs are prospective studies; non-experimental policy evaluations are retrospective by nature. Outcome definitions and levels of measurement in the latter are therefore limited to measures consistently collected for both policy-implementing and comparison units over time. If data are available only at the policy level, outcomes will be proportions, means, etc. for each policy-level unit under study. In other cases, when data are available from sub-units within the policy-level units, outcome measures collected on impact-level units may be aggregated to the policy level for analysis, or a mixed model could be fit with the exposure variable at the policy level.

Component 6: Causal Estimand

Scientific questions can be translated into an estimand, a population-level quantity that statistically describes the treatment effect of interest. Here, the estimand is a causal quantity that describes the average difference between counterfactual outcomes in policy-level units under the exposure and comparison conditions to answer questions about what would have happened under different states of the world (e.g., with and without the policy exposure of interest) (3032). Researchers may be interested in a difference in policy-level average outcomes at a specific time post-exposure or averaged over a particular set of post-exposure time (a difference in “levels”), or in a difference in rates of change over time (“trajectories”).

We discuss 3 categories of estimands. The average treatment effect (ATE) compares the expected counterfactual outcomes under treatment to those under the comparison condition on average over the entire population, addressing a question of the form, “on average, what is effect of all units implementing the policy, compared to none implementing it?” The average treatment effect among the treated (ATT) corresponds to the question, “on average, what was the effect of the policy in the units that implemented it, compared to if they had not implemented the policy when they did?” Finally, the average treatment effect among comparators (ATC) corresponds to “on average, what would have been the effect of the policy in the units that did not implement it, if they instead had implemented it?” These are all equal on average in a randomized trial.

The choice of estimand depends on the scientific question and what investigators believe is feasible in their context. Policy evaluations conventionally target the ATT likely because it asks a focused question that can often be more plausibly estimated. Estimating the ATE or ATC requires imputing what would have happened to comparison units under treatment, which is often a larger conceptual jump.

The estimand’s interpretation partially depends on how investigators consider “crossovers” – comparison units that later implement the policy of interest – as this changes the comparison of interest (similar to intent-to-treat vs. per-protocol approaches (33)). When interest lies in a per-protocol effect, e.g., among units that implement the policy for some specified length of time, methods like censoring and weighting (34) or instrumental variable approaches (35) are common in some scientific areas, including epidemiology (see Supplement Section 2.22.3). In estimating per-protocol effects, these approaches should consider and account for potential relevant time-varying confounders i.e. factors that affect the outcome and are also associated with whether the unit (e.g. state) implements the policy over time. However, censoring weights may be difficult to implement well with only a small number of units (e.g., < 50 states).

Crossover in policy evaluations has historically been dealt with by limiting the comparison group to “never-exposed” units that do not implement the policy during the study period. This removes comparators that cross over to exposure and ensures a consistent comparison group over time. Because this approach does not account for time-varying confounders, it can introduce bias by selecting units based on post-exposure factors when non-adopters of the policy may differ from ever-adopters in unobserved ways at baseline and over time. Studies using never-exposed comparators should acknowledge the possibility of such bias and justify the deviation from target trial emulation principles, for example by articulating why bias is unlikely given the scientific context. Other alternatives are to redesign the study, by, e.g., defining eligibility criteria to exclude likely bad comparators or limiting to a shorter follow-up period with minimal crossover.

Component 7: Statistical Analysis and Assumptions

Methods to estimate an ATT typically use pre-baseline information to create a proxy for the exposed group’s unobserved counterfactual outcomes under the comparison condition, then compare the observed outcomes to that proxy. Because assignment in an RCT is unconfounded the hypothetical target trial can use “standard” methods – typically regression-based – for analysis (21,36); see Supplement section 1. Analysis can proceed at the cluster level or can reflect the multilevel design and account for the clustering in variance estimation (21). In a non-experimental context, analytic techniques can be used with high-quality trial emulation and careful assumptions to mitigate confounding and improve confidence in the causal interpretation of estimated effects. There are many methods used for policy evaluation (see Supplement section 2); here, we focus on difference-in-differences (DiD), one common approach (Supplement section 2.4).

In its simplest form, DiD estimates an ATT as the change in outcome from pre- to post-policy in the exposed group, minus the change in the outcome from pre- to post-policy in the comparators, sometimes using a regression framework (37). The key assumption of most such approaches is “counterfactual parallel trends”, which requires that, in the absence of the policy, the outcome trajectory in policy-implementing units would have looked like the outcome trajectory of the comparators, on average. This assumption is weaker than “ignorability”, which states that, conditional on observed confounders, exposure status is independent of counterfactual outcomes (23). Because of the reliance on longitudinal data and trends, confounders in DiD are only those variables that evolve differently over time in the exposed and unexposed groups or that have time-varying effects on the outcome; therefore, fewer types of variables might confound DiD than approaches assuming ignorability (23,38). For example, a variable that is constant over time is not a “confounder” in DiD, even if its level differs between exposed and unexposed units, unless it has a time-varying relationship with the outcome. The parallel trends assumption is about counterfactuals and is thus untestable and potentially challenging to assess (39). However, carefully-applied matching and weighting approaches can help create similarity in the pre-policy period, making the counterfactual parallel trends assumption more plausible (14,40,41). Under staggered policy adoption, “stacked” DiD uses serial per-implementation-date trial emulations to estimate unit-specific effects, then averages them if scientifically appropriate (see Figure 1, panels 4–5) (11,24). See Supplement section 2.4 for more details.

In general, researchers and readers should consider whether the analytic approach used in a policy trial emulation uses assumptions that are justified by theory and the data (37,4246). Well-reported trial emulations will discuss the plausibility of generally-untestable assumptions (such as counterfactual parallel trends) made by the analytic approach. Transparently reporting these assumptions helps readers calibrate their trust in the results (47).

DISCUSSION

Causal inference for policy evaluation is difficult because policy implementation is complex. Well-designed trial emulation can mitigate threats to causal inference. Head-to-head comparisons of a target trial and non-experimental policy evaluation allow for identification of such threats, including explicit discussion of the causal assumptions involved. The closer a trial emulation aligns with its hypothetical target trial on the 7 dimensions discussed, the greater our ability to make causal inference, so long as the underlying assumptions (Component 7) are valid. Sometimes, the scientific question and available data may be sufficiently misaligned (e.g., too few comparators) that investigators cannot estimate the causal effects of interest without severe bias or untenable assumptions; in such cases, investigators should reconsider the research question, the dataset or the aims of the study to something more achievable.

Discussion of findings from a strong policy trial emulation could be about “estimated effects”, whereas weaker policy trial emulations should be discussed using associational language (47). The term “estimated” acknowledges statistical and causal uncertainty. For each estimate, researchers should carefully convey the numeric value of the estimate, statistical uncertainty around that estimate, and a description of the estimand in scientific context. Emphasizing confidence intervals around each estimate quantifies the range of plausible values of the true effect compatible with the data (48).

Trial emulation can ease assessments of the validity of causal inference in and improve the interpretability of non-experimental policy evaluations. For clinical audiences familiar with RCTs, translating non-experimental studies into the language of a trial improves interpretability of and confidence in results. Providing a head-to-head comparison of a policy evaluation with a target RCT gives readers important context about each component of the trial emulation, helping researchers and readers judge a trial emulation’s quality, identify threats to causal inference, and appropriately calibrate confidence in the results. Future work that develops a consensus framework for transparent reporting guidelines in policy trial emulation, similar to the TARGET guidelines under development for epidemiologic studies, may be useful (13). Additionally, visualization techniques and diagnostics to describe and evaluate target trial emulations for policy evaluation are needed.

CONCLUSION

Health policies help shape clinical and population health outcomes; rigorous evidence on their effectiveness is critical. The trial emulation framework, in which researchers design a hypothetical, ideal randomized policy trial, then mimic it as much as possible with non-experimental data, is a promising approach to strengthening conclusions about policy effects.

Supplementary Material

Supplement

Acknowledgments

This work was funded by NIDA R01DA049789

Footnotes

Disclosures: No disclosures to report.

This is the prepublication, author-produced version of a manuscript accepted for publication in Annals of Internal Medicine. This version does not include postacceptance editing and formatting. The American College of Physicians, the publisher of Annals of Internal Medicine, is not responsible for the content or presentation of the author-produced, accepted version of the manuscript or any version that a third party derives from it. Readers who wish to access the definitive published version of this manuscript and any ancillary material related to this manuscript (e.g., correspondence, corrections, editorials, linked articles) should go to Annals.org) or to the issue in which the article appears. Those who cite this manuscript should cite the published version, as it is the official version of record.

REFERENCES

  • 1.Kingdon JW. Agendas, alternatives, and public policies. 2nd ed. New York: Longman; 2003. 253 p. (Longman classics in political science). [Google Scholar]
  • 2.Rawat P, Morris JC. Kingdon’s “Streams” Model at Thirty: Still Relevant in the 21st Century? Polit Policy. 2016;44(4):608–38. [Google Scholar]
  • 3.Dahabreh IJ, Matthews A, Steingrimsson JA, Scharfstein DO, Stuart EA. Using Trial and Observational Data to Assess Effectiveness: Trial Emulation, Transportability, Benchmarking, and Joint Analysis. Epidemiol Rev. 2023. Feb 8;mxac011. [DOI] [PubMed] [Google Scholar]
  • 4.Hansford HJ, Cashin AG, Jones MD, Swanson SA, Islam N, Douglas SRG, et al. Reporting of Observational Studies Explicitly Aiming to Emulate Randomized Trials: A Systematic Review. JAMA Netw Open. 2023. Sep 27;6(9):e2336023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wan EYF, Yan VKC, Mok AHY, Wang B, Xu W, Cheng FWT, et al. Effectiveness of Molnupiravir and Nirmatrelvir–Ritonavir in Hospitalized Patients With COVID-19. Ann Intern Med. 2023. Apr 18;176(4):505–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hulme WJ, Williamson E, Horne EMF, Green A, McDonald HI, Walker AJ, et al. Challenges in Estimating the Effectiveness of COVID-19 Vaccination Using Observational Data. Ann Intern Med. 2023. May 16;176(5):685–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lazzati A, Epaud S, Ortala M, Katsahian S, Lanoy E. Effect of bariatric surgery on cancer risk: results from an emulated target trial using population-based data. Br J Surg. 2022. May 1;109(5):433–8. [DOI] [PubMed] [Google Scholar]
  • 8.Fu EL. Target Trial Emulation to Improve Causal Inference from Observational Data: What, Why, and How? J Am Soc Nephrol. 2023; 10.1681/ASN.0000000000000152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hernán MA, Wang W, Leaf DE. Target Trial Emulation: A Framework for Causal Inference From Observational Data. JAMA. 2022. Dec 27;328(24):2446–7. [DOI] [PubMed] [Google Scholar]
  • 10.Matthews AA, Danaei G, Islam N, Kurth T. Target trial emulation: applying principles of randomised trials to observational studies. BMJ. 2022. Aug 30;378:e071108. [DOI] [PubMed] [Google Scholar]
  • 11.Ben-Michael E, Feller A, Stuart EA. A Trial Emulation Approach for Policy Evaluations with Group-level Longitudinal Data. Epidemiology. 2021. Jul;32(4):533–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McGinty EE, Tormohlen KN, Seewald NJ, Bicket MC, McCourt AD, Rutkow L, et al. Effects of U.S. State Medical Cannabis Laws on Treatment of Chronic Noncancer Pain. Ann Intern Med. 2023. Jul;176(7):904–12. [DOI] [PubMed] [Google Scholar]
  • 13.Hansford HJ, Cashin AG, Jones MD, Swanson SA, Islam N, Dahabreh IJ, et al. Development of the TrAnsparent ReportinG of observational studies Emulating a Target trial (TARGET) guideline. BMJ Open. 2023. Sep 1;13(9):e074626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Stuart EA, Huskamp HA, Duckworth K, Simmons J, Song Z, Chernew ME, et al. Using propensity scores in difference-in-differences models to estimate the effects of a policy change. Health Serv Outcomes Res Methodol. 2014. Dec 1;14(4):166–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Feller A, Connors MC, Weiland C, Easton JQ, Ehrlich Loewe S, Francis J, et al. Addressing Missing Data Due to COVID-19: Two Early Childhood Case Studies. J Res Educ Eff. Forthcoming; [Google Scholar]
  • 16.Miller D, Spybrook J, Caverly S. Missing Data in Group Design Studies: Revisions in WWC Standards Version 4.0. 2019.
  • 17.Incze MA, Kelley AT, Singer PM. Heterogeneous State Cannabis Policies: Potential Implications for Patients and Health Care Professionals. JAMA. 2021. Dec 21;326(23):2363–4. [DOI] [PubMed] [Google Scholar]
  • 18.Ramanathan T, Hulkower R, Holbrook J, Penn M. Legal Epidemiology: The Science of Law. J Law Med Ethics. 2017. Mar 1;45(1_suppl):69–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Griffin BA, Schuler MS, Pane J, Patrick SW, Smart R, Stein BD, et al. Methodological considerations for estimating policy effects in the context of co-occurring policies. Health Serv Outcomes Res Methodol. 2023. Jun 1;23(2):149–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Matthay EC, Hagan E, Joshi S, Tan ML, Vlahov D, Adler N, et al. The Revolution Will Be Hard to Evaluate: How Co-Occurring Policy Changes Affect Research on the Health Effects of Social Policies. Epidemiol Rev. 2021. Dec 30;43(1):19–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hayes RJ, Moulton LH. Cluster randomised trials. 2nd ed. Boca Raton: CRC press; 2017. (Chapman & Hall - CRC biostatistics series). [Google Scholar]
  • 22.Verbitsky-Savitz N, Raudenbush SW. Causal Inference Under Interference in Spatial Settings: A Case Study Evaluating Community Policing Program in Chicago. Epidemiol Methods. 2012. Aug 29;1(1):107–30. [Google Scholar]
  • 23.Rothbard S, Etheridge JC, Murray EJ. A Tutorial on Applying the Difference-in-Differences Method to Health Data. Curr Epidemiol Rep. 2024. Jun 1;11(2):85–95. [Google Scholar]
  • 24.Baker AC, Larcker DF, Wang CCY. How much should we trust staggered difference-in-differences estimates? J Financ Econ. 2022. May 1;144(2):370–95. [Google Scholar]
  • 25.Cengiz D, Dube A, Lindner A, Zipperer B. The Effect of Minimum Wages on Low-Wage Jobs. Q J Econ. 2019. Aug 1;134(3):1405–54. [Google Scholar]
  • 26.Deshpande M, Li Y. Who Is Screened Out? Application Costs and the Targeting of Disability Programs. Am Econ J Econ Policy. 2019;11(4):213–48. [Google Scholar]
  • 27.Wing C, Yozwiak M, Hollingsworth A, Freedman S, Simon K. Designing Difference-in-Difference Studies with Staggered Treatment Adoption: Key Concepts and Practical Guidelines. Annu Rev Public Health. 2024. May 20;45(Volume 45, 2024):485–505. [DOI] [PubMed] [Google Scholar]
  • 28.Hernán MA, Robins JM, García Rodríguez LA. Discussion on “Statistical Issues Arising in the Women’s Health Initiative.” Biometrics. 2005. Dec 1;61(4):922–30. [DOI] [PubMed] [Google Scholar]
  • 29.Schaubel DE, Wolfe RA, Port FK. A Sequential Stratification Method for Estimating the Effect of a Time-Dependent Experimental Treatment in Observational Studies. Biometrics. 2006. Sep 1;62(3):910–7. [DOI] [PubMed] [Google Scholar]
  • 30.Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol. 1974. Oct;66(5):688–701. [Google Scholar]
  • 31.Musci RJ, Stuart E. Ensuring Causal, Not Casual, Inference. Prev Sci. 2019. Apr 1;20(3):452–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hernán MA. The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data. Am J Public Health. 2018. May 1;108(5):616–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hernán MA, Sauer BC, Hernández-Díaz S, Platt R, Shrier I. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol. 2016. Nov 1;79:70–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Hernán MA, Robins JM. Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available. Am J Epidemiol. 2016. Apr 15;183(8):758–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Rassen JA, Brookhart MA, Glynn RJ, Mittleman MA, Schneeweiss S. Instrumental variables I: instrumental variables exploit natural variation in nonexperimental data to estimate causal relationships. J Clin Epidemiol. 2009. Dec 1;62(12):1226–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Campbell MJ, Walters SJ. How to design, analyse and report cluster randomised trials in medicine and health related research. Chichester, West Sussex: John Wiley & Sons; 2014. (Statistics in practice). [Google Scholar]
  • 37.Roth J, Sant’Anna PHC, Bilinski A, Poe J. What’s trending in difference-in-differences? A synthesis of the recent econometrics literature. J Econom. 2023. Aug 1;235(2):2218–44. [Google Scholar]
  • 38.Zeldow B, Hatfield LA. Confounding and regression adjustment in difference-in-differences studies. Health Serv Res. 2021;56(5):932–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Roth J, Sant’Anna PHC. When Is Parallel Trends Sensitive to Functional Form? Econometrica. 2023;91(2):737–47. [Google Scholar]
  • 40.Daw JR, Hatfield LA. Matching and Regression to the Mean in Difference-in-Differences Analysis. Health Serv Res. 2018. Dec;53(6):4138–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Daw JR, Hatfield LA. Matching in Difference-in-Differences: between a Rock and a Hard Place. Health Serv Res. 2018;53(6):4111–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Abadie A, Diamond A, Hainmueller J. Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. J Am Stat Assoc. 2010. Jun;105(490):493–505. [Google Scholar]
  • 43.Callaway B, Sant’Anna PHC. Difference-in-Differences with multiple time periods. J Econom. 2021;225(2):200–30. [Google Scholar]
  • 44.Ben-Michael E, Feller A, Rothstein J. The Augmented Synthetic Control Method. J Am Stat Assoc. 2021. Oct 2;116(536):1789–803. [Google Scholar]
  • 45.Bandara SN, Kennedy-Hendricks A, Stuart EA, Barry CL, Abrams MT, Daumit GL, et al. The effects of the Maryland Medicaid Health Home Waiver on Emergency Department and inpatient utilization among individuals with serious mental illness. Gen Hosp Psychiatry. 2020. May 1;64:99–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Griffin BA, Schuler MS, Stuart EA, Patrick S, McNeer E, Smart R, et al. Moving beyond the classic difference-in-differences model: a simulation study comparing statistical methods for estimating effectiveness of state-level policies. BMC Med Res Methodol. 2021. Dec;21(1):1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dahabreh IJ, Bibbins-Domingo K. Causal Inference About the Effects of Interventions From Observational Studies in Medical Journals. JAMA. 2024. Jun 4;331(21):1845–53. [DOI] [PubMed] [Google Scholar]
  • 48.Guallar E, Goodman SN, Localio AR, Stephens-Shields AJ, Laine C. Seeing the Positive in Negative Studies. Ann Intern Med. 2023. Apr 18;176(4):561–2. [DOI] [PubMed] [Google Scholar]
  • 49.Hemming K, Taljaard M. Reflection on modern methods: when is a stepped-wedge cluster randomized trial a good study design choice? Int J Epidemiol. 2020. Jun 1;49(3):1043–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES