Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Nov 21.
Published in final edited form as: Ann Appl Stat. 2025 Aug 28;19(3):1826–1846. doi: 10.1214/25-aoas2011

AVERAGED PREDICTION MODELS (APM): IDENTIFYING CAUSAL EFFECTS IN CONTROLLED PRE-POST SETTINGS WITH APPLICATION TO GUN POLICY

Thomas Leavitt 1,a, Laura A Hatfield 2,b
PMCID: PMC12633725  NIHMSID: NIHMS2111843  PMID: 41283000

Abstract

To investigate causal impacts, many researchers use controlled pre-post designs that compare over-time differences between a population exposed to a policy change and an unexposed comparison group. However, researchers using these designs often disagree about the “correct” specification of the causal model, perhaps most notably in analyses to identify the effects of gun policies on crime. To help settle these model specification debates, we propose a general identification framework that unifies a variety of models researchers use in practice. In this framework, which nests “brand name” designs like Difference-in-Differences as special cases, we use models to predict untreated outcomes and then correct the treated group’s predictions using the comparison group’s observed prediction errors. Our point identifying assumption is that treated and comparison groups would have equal prediction errors (in expectation) under no treatment. To choose among candidate models, we propose a data-driven procedure based on models’ robustness to violations of this point identifying assumption. Our selection procedure averages over candidate models, weighting by each model’s posterior probability of being the most robust given its differential average prediction errors in the pre-period. This approach offers a way out of debates over the “correct” model by choosing on robustness instead and has the desirable property of being feasible in the “locked box” of pre-intervention data only. We apply our methodology to the gun policy debate, focusing specifically on Missouri’s 2007 repeal of its permit-to-purchase law, and provide an R package (apm) for implementation.

Keywords: difference-in-differences, predictive models, robustness, model selection, design sensitivity, Bayesian inference

1. Introduction.

The causal effects of public policies are the subject of intense debate. For instance, the question of whether guns make society more or less safe permeates the gun control debate, with opposite sides claiming that policies expanding firearms access either reduce or increase crime. To bring evidence to bear on this debate, we can contrast the change in a relevant outcome (like gun homicides) after a policy intervention to the contemporaneous change in outcomes among an unexposed comparison population. These “controlled pre-post” designs identify policy effects if their assumptions hold. For example, difference-in-differences (DID) depends on parallel trends, sequential DID depends on parallel trends-in-trends and comparative interrupted time series (CITS) depends on assumptions about group differences in slopes and intercepts of a linear model.

To choose among controlled pre-post designs, conventional wisdom holds that we should choose the one that relies on the most plausible assumptions (Roth and Sant’Anna, 2023; Lopez Bernal, Soumerai and Gasparrini, 2018; Ryan, Burgess and Dimick, 2015; Kahn-Lang and Lang, 2020). Because reasonable people may disagree about plausibility, and because it is impossible to prove any causal assumption, researchers tend to use methods that are popular in their disciplines. For instance, CITS is popular in education policy, while DID is popular in health policy (Fry and Hatfield, 2021).

Disagreement over causal assumptions and attendant modeling choices is not merely academic; it can impede progress on the policy front. On the firearm policy question, a 2004 report by the National Research Council concluded that “it is not possible to reach any scientifically supported conclusion because of the sensitivity of the empirical results to seemingly minor changes in model specification” (National Research Council of the National Academies, 2005, p. 151). More recent syntheses have reached similar conclusions (Morral et al., 2018; Smart et al., 2020).

In this paper, we consider the impact of a particular firearm policy change: Missouri’s 2007 repeal of its permit-to-purchase law. See Fig. 1 below.

Fig 1.

Fig 1.

Average gun homicides (rate per 100,000) before and after the 2007 permit-to-purchase repeal in Missouri (treated state) and eight neighboring states without such a change (Arkansas, Illinois, Iowa, Kansas, Kentucky, Nebraska, Oklahoma, and Tennessee)

Previous authors have analyzed this policy change using models with different causal assumptions: Webster, Crifasi and Vernick (2014) fit a Poisson regression model with unit and time fixed effects, while Hasegawa, Webster and Small (2019) used a non-parametric DID estimator. How, in general, can we reconcile competing models, assumptions, and possible conclusions?

We provide a general identification framework that unifies controlled pre-post designs by characterizing them as a combination of a prediction step and a correction step. First, for an arbitrary model in a class of candidate models, we use observations in the pre-intervention period to train a model that predicts untreated outcomes in the post-intervention period. Second, we use the comparison group’s prediction errors in the post-intervention period to correct the treated group’s predictions; this step accounts for time-varying shocks that affect both groups. Our point identifying assumption is that prediction errors would be equal (in expectation) in treated and comparison groups, absent the policy change. We show that we can reproduce many familiar “brand name” designs by careful choice of prediction model. For instance, we get DID when we use the pre-intervention group mean as a (simple) prediction model.

To choose among a set of candidate models — unified under a single identification framework — we move away from the question of model “correctness” and focus on another criterion: robustness. To select the most robust model given differential average prediction errors in the pre-period, we use Bayesian model averaging (BMA) in which we weight each model by its posterior probability of being the most robust, hence the name Averaged Prediction Models (or APM). Using the corrected predictions from our averaged model, we estimate our causal target quantity, the average effect of treatment on the treated (the ATT). Since this model selection procedure based on robustness uses only pre-period observations, it has the benefit of inhibiting “fishing,” whereby researchers select the model that yields the most desirable conclusion about the ATT in the post-period.

Our conception of robustness builds on Manski and Pepper (2018) and Rambachan and Roth (2023) who formalize an idea implicit in pre-period parallel trends tests (Granger, 1969; Angrist and Pischke, 2008; Roth, 2022; Egami and Yamauchi, 2023): Departures from the assumed causal model in the pre-period inform us about violations in the post-period. Since the true relationship between untreated outcomes in the two periods is unknown, these sensitivity analyses take observed departures in the pre-period and assume a relationship to departures in the post-period. Robustness is a (lack of) change in the estimand under departures from the point identifying assumption.

Other authors have also developed methods for causal inference from longitudinal data and applied them to study gun/policing policies and violence/crime outcomes. With a similar focus on prediction models, Antonelli and Beck (2023) use Bayesian spatio-temporal models to produce posterior predictive distributions for unit-specific treatment effects in a staggered adoption setting. Ben-Michael et al. (2023) use multitask Gaussian process models to draw causal inferences from panel data with one treated unit and count outcomes, contributing to the literature on synthetic controls. We build on these existing approaches via a Bayesian model selection procedure that is guided by an anticipated sensitivity analysis. Our methodology, therefore, joins recent research that applies ideas from design sensitivity (Rosenbaum, 2004) (see also Heller, Rosenbaum and Small, 2009; Hsu, Small and Rosenbaum, 2013; Small et al., 2013) to settings other than those of matched observational studies (Huang, Soriano and Pimentel, 2024).

In the rest of this paper, we elaborate on our approach to controlled pre-post designs applied to gun policy evaluation. Sec. 2 details our general identification strategy and establishes that the assumptions of some popular designs can be considered special cases of our framework. In Sec. 3, we introduce a sensitivity analysis framework that motivates our model selection procedure. Sec. 4 describes our proposed estimation and inference procedures. We implement our methods to estimate the effect of Missouri’s permit-to-purchase repeal on homicide in Sec. 5. Finally, we conclude in Sec. 6 and point to open questions for future research.

2. General identification strategy.

Suppose a population-level data generating process with two groups, a treated group (G=1) and comparison group (G=0), as well t=1,,T periods of which T is the only post-treatment period. That is, between periods T-1 and T, the treated group is exposed to treatment and the comparison group is not. Let the treatment indicator in period t be Zt:=G1{t=T}, where 1{} is the indicator function that equals 1 if its argument is true and 0 if not. For the treated group, Zt=0 for all t<T and ZT=1. For the comparison group, Zt=0 for all periods.

We use potential outcomes to define our causal target. Let Yt(0) denote the untreated potential outcome in period t=1,,T and YT(1) denote the treated potential outcome in the post-treatment period, T. Our causal target is the ATT,

ATT:=E𝒫YT(1)G=1-E𝒫YT(0)G=1, (1)

where E𝒫[] denotes expectation with respect to a population-level joint cumulative distribution function.

To express the ATT in terms we can estimate from data, we need assumptions about how potential outcomes relate to observable quantities. The first such assumption is consistency between potential outcomes and the observed outcome, Yt.

Assumption 1 (Consistency).

For t=1,,T,

Yt=ZtYt(1)+1-ZtYt(0). (2)

Assumption 1 ensures that the observed outcome at a given time is the potential outcome corresponding to the treatment condition at that time. This rules out treatment anticipation (i.e., the treated group manifests treated outcomes before treatment begins) and spillovers/interference (i.e., the untreated group manifests treated potential outcomes).

With Assumption 1, we can express the ATT as

ATT=E𝒫YT-YT(0)G=1, (3)

replacing the treated potential outcome with the observed outcome since the treated potential outcome can be observed in the post-period. It remains to replace the (unobservable) untreated potential outcome with an observable quantity.

Suppose we predict the untreated potential outcome in period t, Yt(0), via fXt, where f is a model belonging to class of candidate models and Xt is the collection of predictors for untreated potential outcomes in period t.1 The predictors of untreated potential outcomes in period t are quantities whose values are determined before (or are independent of) possible treatment onset between periods t-1 and t. When we have only one post-period, T, “before T” and “pre-period” are the same. For extensions to multiple post-treatment periods, we also limit a prediction model’s inputs to quantities whose values are determined before the start of (or are independent of) treatment.

If the prediction function were perfect, we could identify the ATT without a comparison group. That is, the ATT could be identified as

ATT=E𝒫YT-fXTG=1.

This identification assumption is the basis for the single interrupted time series (ITS) design (e.g., Wagner et al., 2002; Bloom, 2003; Zhang and Penfold, 2013; McDowall, McCleary and Bartos, 2019; Shadish, Cook and Campbell, 2002).

However, untreated outcomes may be subject to shocks that f cannot predict (Britt, Kleck and Bordua, 1996). Therefore, we rely on an identification assumption that uses the comparison group to inform us about what our prediction model misses. We assume that a model’s prediction errors are equal in the treated and comparison groups (in expectation) or, expressed another way, that unexpected shocks affect both groups’ outcomes equally (in expectation).

Assumption 2 (Equal expected prediction errors).

E𝒫YT(0)-fXTG=1=E𝒫YT(0)-fXTG=0. (4)

The following theorem establishes that, with this additional assumption, we can identify the ATT.

Theorem 1 (Causal identification by equal expected prediction errors).

If Assumptions 1 and 2 hold, then the ATT in Eq. 1 is identified as

E𝒫YT-fXTG=1-E𝒫YT-fXTG=0. (5)

The proof, given in Sec. 1.1 of the online supplement, is straightforward, by linearity of expectation and substitution of observed outcomes using Assumptions 1 and 2.

2.1. Existing designs as special cases.

Under what circumstances would equal expected prediction errors hold? It turns out that several popular non-parametric identification assumptions and structural causal models imply Assumption 2. That is, we show that when these assumptions hold, Assumption 2 will also hold, for particular choices of prediction models. We consider two such situations below and detail several more in the online supplement.

2.1.1. Nonparametric identifying assumptions.

Identification for DID — a popular method for observational causal inference in a range of fields — may be shown using either nonparametric assumptions or structural models. We regard DID as “design-based” (Angrist and Pischke, 2010, p. 14). Hence, we use a non-parametric identification assumption to show that it implies our identification assumption given a careful choice of prediction function.

We consider a simple setting in which there are two groups (treated and comparison) and two periods (pre-period T-1 and post-period T). DID’s crucial counterfactual assumption is that untreated potential outcomes would have evolved in parallel in the two groups:

E𝒫YT(0)G=1-E𝒫YT-1(0)G=1=E𝒫YT(0)G=0-E𝒫YT-1(0)G=0. (6)

In the Appendix (Sec. A.1), we show that if parallel trends holds, Assumption 2 does also, for prediction model fXT=YT-1.

Of course, this is only the simplest example of a DID strategy. We can extend to more complex DID settings with, e.g., conditioning on covariates (in both the assumption of parallel trends and the prediction model). In addition, as shown in the online supplement, we can use alternative assumptions such as those of sequential DID (see Sec. 2.1).

2.1.2. Structural models.

Suppose that we have multiple outcome measurement occasions in the pre- and post-intervention periods and multiple units in the treated and comparison groups. In this case, researchers often fit two-way fixed effects (TWFE) linear regression models, where “two-way” refers to unit and time fixed effects (de Chaisemartin and D’Haultfœuille, 2023). The models contain an interaction between an indicator of the post-intervention period and treated group, the coefficient of which is interpreted as an estimator of the ATT. This approach can be justified by the equivalence of the TWFE estimator and the DID estimator (Angrist and Pischke, 2008; Egami and Yamauchi, 2023; Imai and Kim, 2019; Kropko and Kubinec, 2020; Sobel, 2012; Wooldridge, 2005) in a particular setting, which leads to the popular impression that TWFE model identification is also by a parallel trends assumption. However, the equivalence does not extend to the more general setting. Imai and Kim (2021) show that the TWFE model’s promise of simultaneous adjustment for unobserved unit and time confounders depends crucially on linearity and additivity.

Therefore, we assume that identification of TWFE models is via the following structural model:

Yu,t(0)=αu+γt+ϵu,t (7)

in which E𝒫ϵu,tαu,Gu=0, where u=1,,U. We show in the Appendix (see Sec. A.2) that when this structural model holds, Assumption 2 also holds if the prediction model is fXT=argminαut=1T-1Yu,t-αu2. This prediction model is the population-level ordinary least squares (OLS) solution to the unit fixed effects model’s objective function fit to data before period T, which is equivalent to the mean of a unit’s outcomes in all pre-treatment periods.

Again, this is a simple instance of a TWFE structural model. In the online supplement, we also show that this idea extends to similar structural models that include unit- or group-specific time trends (see Sec. 2.2) or lagged dependent variables (see Sec. 2.3). Many researchers fit more complicated models and obtain estimators from them. We have not proved that these are also special cases of Assumption 2.

2.2. Existing designs that are not special cases.

The designs considered above all use a pre-vs-post contrast (to account for time-invariant group differences) and a treated-vs-comparison contrast (to account for common shocks). Likewise, our proposed framework uses a prediction step (leveraging predictable features of each group’s outcome trajectories) and a correction step (leveraging the comparison group to correct for unexpected shocks). By contrast, some designs lack an analog of either the prediction or correction steps. The ITS design uses only a pre-post contrast; there are no comparison units with which to perform our correction step. Synthetic control uses only a treated-comparison contrast, omitting the pre-post contrast. In the online supplement, we provide more detail on the question of synthetic control (Sec. 2.4), showing that it is not a special case of our framework.

2.3. Staggered adoption.

Thus far, we have assumed that all treated units receive intervention at the same time. We now extend to staggered adoption settings, taking the perspective of Callaway and Sant’Anna (2021). That is, we consider each treatment adoption time as its own simple design, identify the treatment effect in each, and weight these effects together in a sensible way.

Define the multiple treated groups by their time of treatment adoption, g, and for the never-treated group, let g=. Define Yt(0) as the potential outcome at time t under assignment to being never-treated and Yt(g) as the potential outcome at time t under assignment to treatment starting at time g. Then we can re-state consistency (Assumption 1) as

Yt=Yt0+g=2TYtg-Yt0Gg,

where Gg=1{G=g} is an indicator for membership in treatment timing group g. Our target estimand is the average treatment effect on the treated for each treatment time g,

ATT(g):=E𝒫Yg(g)-Yg(0)Gg=1.

That is, the ATT is the difference in potential outcomes under the condition of being treated at time g versus being never-treated for units in treatment timing group g. To identify this, we re-state the equal expected prediction errors assumption as

E𝒫Yt(0)-fXtGg=1=E𝒫Yt(0)-fXtG=1,fort=g.

Then we can use any of the several ideas in Sec. 3 of Callaway and Sant’Anna (2021) to weight the resulting ATTs together.

This simplifies the approach of Callaway and Sant’Anna (2021) in two ways. First, we exclude the possibility of using not-yet-treated units in the comparison group. Second, we assume there is a single post-treatment time t=g at which we identify the ATT for each treatment timing group. Of course, both of these could be relaxed. The point is that identifying each ATT(g) reduces to the simple case of a treated group versus comparison group.

3. Selecting models for robustness.

We have described an assumption that can identify the ATT in controlled pre-post settings and showed that under several familiar nonparametric or structural identifying assumptions, our identifying assumption would also hold. Because our assumption frames the problem in terms of a prediction model, we want a principled basis on which to choose among potential prediction models. Next, we propose to assess models’ robustness (the complement of sensitivity) and discuss the difference between a robust model and a “correct” model.

3.1. Design sensitivity.

The design sensitivity framework, originally developed for matched observational studies (Rosenbaum, 2004), established that violations of key assumptions lead to a range of point estimates that are consistent with sample data. Therefore, an estimator limits not to a point, but rather an interval (Rosenbaum, 2005, 2012). For a given violation, a sensitive design has a wider limiting interval than a more robust one. Conversely, in our framework, a more robust prediction model leads to a narrower limiting interval.

The robustness of a model to violations of an identifying assumption is different from the plausibility that a model is “correct,” i.e., satisfies an identifying assumption. In the name of assessing plausibility that a model is “correct”, researchers often study whether a version of an identification assumption holds in the pre-period. For example, in DID designs, it is common to test for non-parallel trends in the pre-period, which resembles a Granger causality test (Granger, 1969) and other forms of “placebo” tests (see, e.g., Angrist and Pischke, 2008, p. 237). This practice implicitly assumes that patterns observed in the pre-period would have continued into the post-period in the absence of treatment. In other words, as Egami and Yamauchi (2023) explain, this approach replaces one unverifiable assumption about counterfactual outcomes with another. The framework of design sensitivity offers a practical way out of this bind: we study the robustness of our inference to violations of the point identifying assumption, grounded in empirical evidence about the potential magnitude of those violations.

3.2. Robustness criterion.

We build on Rambachan and Roth (2023) who, following Manski and Pepper (2018), set-identify the ATT by bounding the possible violations of parallel trends. Rambachan and Roth (2023) posit that the violation lies in a set defined by the observed pre-period differential trends, yielding sensitivity bounds on the ATT. Similarly, we suppose violations of our identifying assumption lie in a set defined by the pre-period differential prediction errors. Denote the observable population-level differential prediction errors in period t under model specification f by

δf,t:=E𝒫Yt-fXtG=1-E𝒫Yt-fXtG=0. (8)

The point identification of Assumption 2 under model f can now be expressed as δf,T-ATT=0. For set identification, we would instead suppose that δf,T-ATT, i.e., the population-level difference in counterfactual prediction errors, lies in a compact set for some f.

To define a relevant set, we follow Rambachan and Roth (2023) in supposing that the violation of equal expected prediction errors is up to M times the largest absolute differential prediction error in a set of pre-treatment validation periods, 𝒱{2,,T-1}. That is, for any model f, we suppose that the ATT lies in the interval given by

δf,T-Mmaxv𝒱δf,v,δf,T+Mmaxv𝒱δf,v (9)

with M0. This leads to our definition of sensitivity, which is simply the length of the interval in Eq. 9. A lesser length of this interval implies less sensitivity (i.e., greater robustness).

We can imagine alternatives to this set restriction that entail different relationships between pre- and post-periods. For instance, we could create an asymmetric set restriction. Or, if we think more recent validation periods are more informative, we might replace maxv𝒱δf,v in Eq. 9 with δf,V, where V:=max𝒱; that is, we might bound the violation by M times the most recent absolute difference in prediction errors. Alternatively, if we think the average pre-treatment deviation matters, we could use 1/|𝒱|v𝒱δf,v. We proceed with the set restriction in Eq. 9, but these alternatives are straightforward to implement.

The sensitivity parameter M controls how tightly we constrain the identification assumption. Point identification of Assumption 2 holds under M=0 and set identification holds under M>0. Proposition 1 establishes that we can use pre-period data to see which model, f, in a set of candidate models, , is most robust.

Proposition 1.

Let f and f be two prediction model specifications in the set of candidate model specifications, . Under the sensitivity model in Eq. 9, model f is more robust than f if and only if maxv𝒱δf,vmaxv𝒱δf,v.

The proof is in the online supplement (see Sec. 1.2).

Proposition 1 shows that, as long as there is some nonzero pre-treatment difference in prediction errors for all f, the most robust model for any M>0 will be the one with the smallest maximum absolute difference in prediction errors. By deriving robustness in terms of observable pre-period quantities, we can choose among candidate models using the data. If we had a different set restriction (e.g., the most recent or mean across validation periods), the procedure for selecting models on robustness is the same: the one with the narrowest sensitivity bounds.

How is choosing the most robust model different from choosing the “correct” model? Suppose that Assumption 2 holds exactly for one model that nonetheless is less robust (by our criterion) than another candidate model. Proposition 2 quantifies the consequences of this trade-off between “correctness” and robustness.

Proposition 2.

Suppose Assumption 2 holds for f but not f and that f is less robust than f, as defined in Proposition 1. The difference between the ATT of the correct model and the population-level difference in expected prediction errors of the robust model is

E𝒫fXT-fXTG=0-E𝒫fXT-fXTG=1. (10)

The proof is in the online supplement (see Sec. 1.3).

Proposition 2 shows that when a model’s differential prediction errors in the validation periods provide “misleading” information about (unobservable) differential prediction error in the post-period, our conclusions will suffer. This is related to the idea that conclusions are more robust if point estimates are stable across competing models (Brown and Atal, 2019; O’Neill et al., 2016). In our framework, two prediction models that yield identical point estimates for M=0 can have quite different robustness for M>0. However, if two models yield identical point estimates, then Eq. 10 is equal to 0. Therefore, stable point estimates across prediction models do not imply our conclusions are more robust, but they do mitigate a potential trade-off in which choosing the most robust model could come at the expense of choosing the “correct” model.

4. Model selection, estimation, and inference.

Thus far, we have considered population quantities only. To extend our ideas to estimation and inference from finite samples, we cannot simply plug in sample analogs of population quantities. This is because we use the data twice: first to choose a robust prediction model and again to estimate our target parameter. We therefore develop a procedure that accounts for this, illustrating our ideas in an important and accessible class of prediction models: OLS linear regression. This class of models is sufficiently rich to capture a range of models that researchers employ in the gun policy literature. It would be straightforward to show that our conclusions apply to other models, such as logistic, Poisson, transformed-outcome and isotonic regression (see, e.g., Guo and Basse, 2023), but leave this as a topic for future research.

First, we set up the data structure and sampling mechanism. Suppose we have a sample of units indexed by i=1,,n (rather than u, as in the TWFE structural model, to emphasize that we are now referring to a finite sample). Each unit’s observed data up to and including period t are

Di,t:=Yi,t,Yi,<t,Xi,t,Xi,<t,Gi, (11)

where Yi,<t and Xi,<t are the outcomes and predictors from t=1,,t-1 and Yi,t and Xi,t are outcomes and predictors in period t. The respective collections of Yi,<t and Xi,<t over all i=1,,n units are Y<t:=Y1,<t,,Yn,<t and X<t:=X1,<t,,Xn,<t. We can collect the data in Eq. 11 over all i=1,,n units into Dt:=D1,t,,Dn,t, over times into Di:=Di,1,,Di,T and over all units and all times into D=D1,,Dn.

We assume the following condition on how these data are sampled.

Assumption 3.

For all i=1,,n, the sample data, Di, are independent and identically distributed (i.i.d.).

Assumption 3 states that i.i.d. sampling occurs at the cluster level, where the clusters are the individual units indexed by i=1,n.

Next, we set up the prediction models in the OLS framework. Before we proceed, we need an additional assumption that places conditions on the population moments. This assumption applies to each corresponding matrix of predictors for each of the candidate models in .

Assumption 4 (Population moment conditions).

For groups G=0 and G=1, E𝒫YtG=g< and E𝒫[Xt2G=g]< for all t=1,,T, and E𝒫[X<tX<tG=g] is positive definite for all t=2,,T.

The first two conditions are standard, and the third condition implies that we can generate predictions in period t based on the OLS solution to a linear regression model’s objective function in periods before t.

We write the model f for group g in period t as a function of both predictors and parameters, fXi,t;βf,g,t, where βf,g,tRK (in which K is the dimension of Xi,t). Note that βf,g,t is simply a collection of linear projection coefficients for a particular model f in group g based on the population-level OLS solution to the linear regression model’s objective function in periods before t. That is, under Assumption 4,

βf,g,t=E𝒫X<tX<tG=g-1E𝒫X<tY<tG=g. (12)

The collection of estimated coefficients, βˆf,g,t, is the sample analog of Eq. 12. We collect the estimated coefficients over groups into βˆf,t:={βˆf,1,t,βˆf,0,t}. We denote the estimated coefficients collected over times by βˆf and the estimated coefficients collected over models by βˆt. The collection of the estimated coefficients over all models and times is βˆ and the collection over all models and pre-treatment validation times is βˆ𝒱.

With Assumptions 3 and 4 in hand, we write a point estimator of δf,t as

δˆDt,βˆf,t:=1n1i=1n1Gi=1Yi,t-1n1i=1n1Gi=1Xi,tβˆf,1,t-1n0i=1n1Gi=0Yi,t-1n0i=1n1Gi=0Xi,tβˆf,0,t, (13)

where ng:=i=1n1Gi=g. The estimator of lower and upper bounds of the ATT in period T for any M0 and f, is

ΔˆD,βˆf,M:=δˆDT,βˆf,T±Mmaxv𝒱δˆDv,βˆf,v. (14)

When M=0, we simply use δˆDT,βˆf,T.

A simple approach to estimation would be to 1) estimate δˆDv,βˆf,v for each model and validation period, 2) choose the model with the smallest worst-case absolute difference in prediction errors over the validation periods, and 3) use that model to estimate the ATT and its bounds. However, because the chosen model depends on our particular sample, we want to incorporate this uncertainty about the model into our procedure.

The usual approach of splitting data into testing and training subsets is not feasible. We cannot split the data “vertically” (i.e., in time) because our estimators and model selection criterion use the same data by construction: terms in Eq. 14 use data from pre-treatment validation periods 𝒱. Nor can we rely on splitting the data “horizontally”: many applications (including the one we consider here) have only a single or a few treated units, so we cannot afford to split the units.

Therefore, we propose to use a BMA estimator, which averages the estimates across models, weighting each by the model’s posterior probability of being the most robust in the population. We write this estimator as

E^DΔˆD,βˆ,M:=fΔˆD,βˆf,Mpˆf, (15)

where pˆf is the posterior probability that model f is the most robust model, given the sample data. This alternative to the “pick the winner” approach outlined above has statistical advantages (Piironen and Vehtari, 2017; Madigan and Raftery, 1994; Draper, 1995; Moulton, 1991; Raftery, Madigan and Hoeting, 1997).

How do we generate these posterior probabilities? We extend what Gelman and Hill (2006, p. 140) refer to as their “informal Bayesian approach.” This has been employed by many researchers (e.g., King, Tomz and Wittenberg, 2000; Tomz, Wittenberg and King, 2003), including in ITS designs (Miratrix, 2022). The idea is to generate samples from the “informal” posterior of all the coefficients across all prediction models and pre-treatment validation periods. For this distribution, we use a multivariate Normal with mean equal to the estimated coefficients βˆ𝒱 (collected over all validation times and models in ) and their estimated (robust) variance-covariance matrix clustered at the individual level (Liang and Zeger, 1986; Arellano, 1987), Σ^𝒱. This approach is equivalent to the posterior distribution of the models’ parameters if the prior were flat. To estimate the variance-covariance of all the parameters across all the model and time periods simultaneously, we use seemingly unrelated regression tools pioneered by Zellner (1962, 1963) (see also Mize, Doan and Long, 2019), detailed in the Appendix (Sec. B).

To generate the posterior probability that a model is optimal in the population, under each draw from the posterior, we predict outcomes, calculate differential prediction errors over the validation periods and then select the best model. Doing this many times generates a distribution for the best model. That is, the number of times each model is selected by this procedure is proportional to the strength of the evidence that each model is the most robust.

To formally characterize this procedure, let βˆ𝒱(s) for s=1,,S be draws from 𝒩(βˆ𝒱,Σ^𝒱). Then, for all f, write pˆf as

pˆf:=1Ss=1S1f=argminfmaxv𝒱δDv,βˆf,v(s), (16)

which is the proportion of draws in which f is the most robust model. Below we show that, in a sufficiently large sample, this proportion will be close to one with high probability for the truly most robust model.

Lemma 1.

Let f denote the most robust model in the population. Under Assumptions 1, 3 and 4,

pˆfp1.

The proof is given in the online supplement (see Sec. 1.4). As we show next, this lemma implies that our BMA estimator is consistent.

Proposition 3.

Under Assumptions 1, 3 and 4,

E^D[Δˆ(D,βˆ,M)]pδf,T±Mmaxv𝒱δf,v.

The proof is also given in the online supplement (see Sec. 1.5). Proposition 3 shows that the BMA estimator converges in probability to the same limit as that of an estimator in which the optimal model in the population is known before observing data. We provide a conceptual diagram of the overall estimation process in the Appendix (Sec. C).

For inference, we build on the approach from Antonelli, Papadogeorgou and Dominici (2022). These authors establish that we can estimate the uncertainty about both the model and the data in a computationally tractable way by summing two components: variance of the model posterior (holding the sample fixed) and variance of the sample (holding the model posterior fixed). Denote the variance of S draws from the observed posterior, holding the sample fixed, by

Var^DΔˆD,βˆ,M:=fΔˆD,βˆf,M-E^DΔˆD,βˆ,M2pˆf. (17)

Then let r=1,,R index resamples of the data and denote the variance of our estimator over R resamples, holding fixed the observed posterior, as

Var^D(r)E^DΔˆ(D(r),β^(r),M):=1Rr=1RE^DΔˆ(D(r),β^(r),M)-1Rr=1RE^DΔˆ(D(r),β^(r),M)2. (18)

In practice, we draw R resamples of the data via the fractional weighted bootstrap (Xu et al., 2020). The overall variance estimator, accounting for both sampling and model uncertainty, of the BMA estimator in Eq. 15 is the sum of Eqs. 17 and 18. Confidence intervals can then be constructed by drawing on a Normal approximation. We show via simulations in the online supplement (see Sec. 4) that this approach to inference yields 95% confidence intervals with coverage at least as great as nominal rates in moderately large samples. We also observe the conservatism that Antonelli, Papadogeorgou and Dominici (2022, p. 103) note.

5. The effect of gun laws on violent crime.

We now return to our analysis of Missouri’s 2007 repeal of its permit-to-purchase law. The law, in place since 1921, had required people purchasing handguns from private sellers to obtain a license that verified the purchaser had passed a background check. Our data comprise state-year observations of the homicide rate in Missouri and each of its eight neighboring comparison states from 1994 to 2016. For simplicity, we recode the data so that there is a single post-treatment period (denoted by “2008+” in Fig. 1) in which each state’s outcome in 2008 is the average of that state’s outcomes over all post-treatment periods (2008 – 2016). To estimate the repeal’s impact on gun homicides, we form a set of candidate prediction models drawn from the gun policy literature. Researchers agree on a basic model with unit fixed effects (as in Webster, Crifasi and Vernick (2014)), but disagree on other model components. Based on our survey of the literature, we divide the relevant model components into three categories:

  1. Unit-specific time trends. Researchers often include unit-specific time trends, usually linear but sometimes more complicated forms (Black and Nagin, 1998; French and Heagerty, 2008). Others explicitly advocate against their inclusion (Aneja, Donohue III and Zhang, 2014; Wolfers, 2006). We consider models that include unit-specific linear or quadratic trends. (It is straightforward to include higher-order trends, e.g., cubic, quartic, quintic, etc.)

  2. Lagged dependent variables (LDV). Some researchers include lags of the dependent variable (Duwe, Kovandzic and Moody, 2002; Moody et al., 2014), while others advocate against their inclusion because of the possibility of bias in short time series (Nickell, 1981). Following the applied literature, we consider only models that include values of the dependent variable at one time lag; however, multiple time lags are straightforward to incorporate.

  3. Outcome transformations. Linear regression is popular, but can be problematic because many outcomes of interest (including the homicide rate that we consider) are naturally bounded (Moody, 2001; Plassmann and Tideman, 2001). We use only linear models, but do consider transformations of the outcome variable, specifically logs and first differences (Black and Nagin, 1998). However, because we want to compare across models, we back-transform our predictions to the original outcome scale to compute prediction errors.

Obviously, this framework leaves out some modeling variations. For example, some studies in the gun policy literature employ random effects (Crifasi et al., 2018) and two-stage models (Rubin and Dezhbakhsh, 2003). However, given the prominence of these three model components, as well as unit fixed effects and linear models, we believe the resulting set of candidate models is reasonably broad and relevant to the gun policy literature.

From the model components above (summarized in Table 1), we take all possible combinations to derive a set of 18 candidate models. Because of their use in virtually all prediction models we surveyed, we include unit fixed effects for all 18 prediction models.

Table 1.

Model components used to create a set of candidate prediction models.

Time trend Lagged dependent variables Outcome transformations
None None None
t Yt-1 logYt
t2 Yt-Yt-1

To select among the 18 prediction models, we estimate the differences in average prediction errors between treated and comparison groups. For each year prior to the law’s passage in 2007, we train our prediction models on the previous years. For example, in 2006, we train a model on data from 1994 to 2005, predict in 2006, and compute the difference in average prediction errors between treated and comparison groups. To ensure adequate years of training data, we follow Hasegawa, Webster and Small (2019) in beginning the validation period in 1999. Thus, we have 5 or more years of training data, even in the first validation year (1999, for which we train the model on data from 1994 – 1998).

Fig. 2 shows the absolute differential average prediction errors for all 18 models over all validation years, with the maximum for each model highlighted in black. The LDV, i.e., AR(1), model with unit fixed effects fit to the log of the outcome (row 1, column 2) minimizes our sensitivity criterion on the sample data. The baseline mean model with unit fixed effects (row 1, column 4), which is arguably the closest correspondent to the model of choice in Hasegawa, Webster and Small (2019), is the fifth-best model.

Fig 2.

Fig 2.

Absolute difference in average prediction errors for all candidate models. The maximum for each model is highlighted in black. The optimal model is highlighted in gray.

From Fig. 2, we can also see which prediction models would be optimal under different sensitivity criteria. For example, the prediction model with the smallest absolute difference in average prediction errors in the last pre-period (2007) is the linear time trend model with unit fixed effects (row 2, column 4). By contrast, the prediction model with the smallest absolute difference in average prediction errors, averaged over all validation periods, is the baseline mean model with unit fixed effects fit to the log of the outcome (row 1, column 1). These different loss functions for choosing the optimal model can be justified by an appropriate sensitivity analysis model. Given the sensitivity analysis in Eq. 9, which aligns with the sensitivity analysis proposed in recent research (Rambachan and Roth, 2023), the aforementioned LDV model with unit fixed effects fit to the outcome’s log scale is optimal.

Fig. 3 shows the relationship between models’ point estimates and their robustness. As this figure illustrates, the potential trade-offs between models’ “correctness” and robustness are not especially severe. The standard deviation of point estimates across models is 0.38. In addition, the most and least robust models yield relatively similar point estimates of 1.14 and 1.88, respectively. In contrast to other application in the gun policy literature, as described in (National Research Council of the National Academies, 2005; Morral et al., 2018; Smart et al., 2020), this similarity of point estimates across models appears atypical.

Fig 3.

Fig 3.

Estimates under each model (y axis) and corresponding maximum absolute differential prediction errors in the pre-period (x-axis).

Although point estimates may be similar across models, these models can differ in terms of robustness. Nevertheless, much value remains in the similarity of point estimates across models: As Proposition 2 shows, if point identification of M=0 happens to be true under one model that is not the most robust, then the point estimate under the most robust model will not be too misleading insofar as the estimates under both models are similar.

Turning to estimation and inference, this empirical setting requires careful attention to the sources of randomness. In the setting of gun policy research, an influential article by Manski and Pepper (2018) argues that it is often difficult to conceive the units of analysis as randomly sampled from a target population of interest: “Random sampling assumptions, however, are not natural when considering states or countries as units of observations” (Manski and Pepper, 2018, p. 235). Instead, in the setting of most gun policy research, as Manski and Pepper (2018) argue, uncertainty is driven by a fundamental ambiguity over whether counterfactual point identification assumptions hold — i.e., what Rambachan and Roth (2023, p. 2556) call “identification uncertainty.”

In a setting characterized by only identification rather than sampling uncertainty, Rambachan and Roth (2023, p. 2563) argue that a natural starting point for controlled pre-post designs is one of set identification with M=1. In this set identification framework (as opposed to point identification in which M=0), researchers can then gradually increase M in a subsequent sensitivity analysis. A crucial feature of this inferential setting is the absence of uncertainty over which model is truly optimal.

In this setting one could deterministically select the truly optimal model. Then, given the selection of this optimal model, it would be straightforward to calculate bounds on the ATT under M=1 and to assess the sensitivity of these bounds under increasing values of M. Under this approach, the bounds of the ATT (with M=1) under the most robust model is [0.56, 1.72]. The changepoint value of M, i.e., the smallest value of M at which the estimated lower and upper bounds of the ATT bracket 0, is 1.95.

The analysis above supposes the setting that Manski and Pepper (2018) argue is most sensible for our application. However, if we suppose that states are independent and identically distributed draws from a target population, then the estimation and inferential procedure in Sec. 4 is appropriate. The BMA point estimate (under M=0) of 1.14 is nearly identical to the point estimate under the optimal model (also 1.14 subject to rounding error) in the realized sample. The reason for this similarity is because the optimal model in the sample (LDV with unit fixed effects fit to the outcome’s log scale) receives high posterior probability of being the population’s optimal model (approximately 0.97). The model with the second greatest posterior probability of approximately 0.03 is the next best model in the sample data (the LDV model with unit fixed effects fit without the log transformation of the outcome). Fig. 4 below shows the full posterior distribution given the sample data, where the x-axis includes only the models in the support of the observed posterior.

Fig 4.

Fig 4.

Posterior plausibility that each candidate model is optimal

Using our proposed variance estimation procedure, which we would expect not to perform at its best in small samples, yields an estimated standard error (accounting for both model and sampling uncertainty) of 0.12 and corresponding 95% confidence interval of [0.9, 1.38]. That is, we conclude that the repeal of Missouri’s permit-to-purchase law increased the state’s gun homicide rate by somewhere between 0.9 to 1.38 per 100,000 population. The observed homicide rate in 2007 (just before the repeal) in Missouri was 4.5, so the point estimate of roughly 1.14 represents a 25% increase.

For context, Webster, Crifasi and Vernick (2014) state that they estimate an increase of 1.09 per 100,000 (+23%); Hasegawa, Webster and Small (2019) state that they estimate an increase of 1.2 per 100,000 (+24%) using standard DID methods and increases between 0.9 and 1.3 per 100,000 (+17% to +27%) with their bracketing approach. Using synthetic control methods, other authors estimate that Connecticut’s adoption of a permit-to-purchase handgun law decreased firearm homicides by 40% (Rudolph et al., 2015). Thus, our estimate is on the higher end of estimates of the effect for Missouri’s policy change specifically, and there is some evidence that the effects of implementation and repeal of these kinds of laws may be asymmetric.

The changepoint value of M for the BMA estimator is 1.95 (approximately the same as the changepoint value of M under the optimal model). The same changepoint value of M at which the lower bound estimator’s 95% confidence interval no longer excludes 0 is approximately 1.18. The latter changepoint M’s lesser value is to be expected in a study like this with a small sample size.

6. Conclusion and open questions.

In this paper, we introduce a new method for causal inference in controlled pre-post settings, Averaged Prediction Models (or APM). We began by introducing a general identification framework for a broad class of prediction models in which one predicts untreated potential outcomes and corrects these predictions using the observable prediction error in the comparison group. We have shown that several popular designs are special cases of our general identification framework. Then, to choose among the set of candidate prediction models, we propose a BMA procedure based on each model’s robustness given pre-period data.

We applied these ideas to reconcile disparate models and assumptions from gun policy evaluations. Specifically, we studied the repeal of Missouri’s permit-to-purchase law in 2007 using models drawn from the literature. Rather than make claims that any one underlying causal model is “correct”, we selected the optimal model based on robustness. We found that a lagged dependent variable model with unit fixed effects, fit to the outcome’s log scale, minimized our robustness criterion in our sample, making this model the most likely to be the truly optimal model in the population (although other models are plausible as well). Our overall point estimate, averaging over the posterior probability that each model is optimal in the population, was an increase of 1.16 homicides per 100,000 population.

Our sensitivity bounds would include 0 for M1.95. That is, the violation of Assumption 2 would have to be more than 1.95 times greater than a weighted combination of each model’s worst violation in the 9 validation years. By contrast, in the absence of our Bayesian model selection procedure, the value of M that leads the sensitivity bounds to include 0 could be much smaller under any given model, as low as 0.24, with an unweighted average (across all models) of 1.06.

Our approach has several limitations. First, like all causal inference methods, our identifying assumption is untestable because it involves counterfactual quantities. Studying the differential prediction errors of a set of models in the pre-period has similar conceptual problems to testing for differential pre-trends in DID. This is why we use a sensitivity perspective to choose a prediction model based on robustness.

Second, our method is scale-dependent because we measure prediction error as a linear difference on the scale of the outcome variable. This limits our approach. However, we believe this limitation is not specific to our particular framework, as scale dependence is a well-known issue in controlled pre-post designs as a whole.

Third, our prediction models use only variables that are measured prior to (or are independent of) treatment. For some data-generating models, such as interactive fixed effects, the correction step will not de-bias the estimator because the shocks do not affect treated and comparison groups equally. However, as pointed out by a reviewer, an interesting extension of our ideas might separate the comparison units into some for the prediction step and others for the correction step. For instance, the contemporaneous outcomes of some comparison units could be allowed into the prediction function for the treated units’ post-period outcomes, while other comparison units’ post-period outcomes are used to correct for unexpected common shocks.

Fourth, by switching to a robustness criterion for model selection, we induce a possible “correctness” versus robustness trade-off (Proposition 2). Rather than claim that we can choose the “correct” model, we choose a model that maximizes our robustness criterion. A model for which our identifying assumption (Assumption 2) holds exactly need not maximize robustness. However, since there is no data-driven way to choose a model that satisfies a causal identification assumption, we believe choosing based on robustness offers an appealing alternative.

Finally, our inferential procedure, which attempts to appropriately account for uncertainty in both the model and data, may not sufficiently do so in all scenarios. Bootstrap methods perform poorly when there are few clusters, as in our analysis with only one treated unit and eight comparison units (Bertrand, Duflo and Mullainathan, 2004; MacKinnon and Webb, 2020; Conley and Taber, 2011; Rokicki et al., 2018). However, we still believe that our proposal for formally accounting for the model selection procedure is an improvement over the status quo in which model selection is usually hidden from view and outside the bounds of inference entirely. Post-selection inference is an active area of research and, as a recent review article noted, “has a long and rich history, and the literature has grown beyond what can reasonably be synthesized in our review” (Kuchibhotla, Kolassa and Kuffner, 2022, p. 506). Future research should explore the application of these simultaneous inference and conditional selective inference methods to problems like ours in which sample splitting is not feasible.

Our proposal also has several key strengths. First, our conception of robustness allows us to choose a prediction model using pre-treatment observations only. This may discourage “fishing,” i.e., picking a prediction model that yields the most desirable or statistically significant result. Contrast this with selecting a model based on “correctness,” which involves assumptions about unknowable counterfactual outcomes and therefore introduces the temptation to claim that the model with the most favorable results is the “correct” model.

Second, many researchers already interpret robustness in terms of “correctness.” In DID, for instance, researchers interpret parallel trends in the pre-period as evidence for the plausibility of the true identifying assumption of parallel trends from the pre- to post-periods. Yet pre-period parallel trends provide evidence of counterfactual parallel trends only under additional assumptions, and violations of pre-period parallel trends can still be consistent with the identifying assumption (Kahn-Lang and Lang, 2020; Roth and Sant’Anna, 2023). Therefore, our proposal offers a more transparent version of this practice, recasting the evaluation of pre-period violations in terms of a sensitivity analysis rather than as a test of untestable assumptions.

Third, we show that our identification framework unifies a wide variety of prediction models researchers employ in practice. We also show that some familiar designs are special cases of our general identifying assumption for particular choices of prediction models. Thus, to generate the set of candidate prediction models, the existing literature can provide a rich set of models that already have the imprimatur of plausibility.

Fourth, we provide an R package (apm), which implements our estimation and inference procedures. The package includes functions that create a variety of prediction models, fit these models to the pre-intervention data, extract the differential prediction errors, and compute a model-averaged estimate and standard error using this paper’s Bayesian and bootstrap procedures that account for both sampling and model uncertainty.

Fifth, we need not be limited to models already in use. Another potentially significant benefit of our proposed method is its ability to draw upon flexible and modern prediction models, e.g., machine learning methods. Recall that we need not believe the model; in fact, the inner workings of a prediction model can remain a “black box.” As long as the model generates equally good predictions in the treated and control groups, we can identify our target causal estimand. However, we note that our estimation and inferential procedure would need to be substantially updated to accommodate such models, and believe this is a fruitful line of future inquiry.

Supplementary Material

Supplemental Material

Acknowledgments.

The authors would like to thank Noah Griefer for creating the R package, which implements the methods described in this manuscript.

Funding.

This work was supported by the Agency for Healthcare Research and Quality (R01HS028985). Research reported in this publication was also supported by National Institute on Aging of the National Institutes of Health under award number P01AG032952. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the Agency for Healthcare Research and Quality.

APPENDIX A: EXISTING MODELS AS SPECIAL CASES (OR NOT)

Our proofs each follow the steps sketched out below.

  1. Use the design’s identification assumptions to re-express the treated and comparison groups’ untreated potential outcomes (in expectation) in the post-period, E𝒫YT(0)G=1 and E𝒫YT(0)G=0.

  2. Write the prediction errors in treated and comparison groups (in expectation):
    1. First, use Assumption 1 to substitute untreated potential outcomes for any observed outcomes in the argument Xt to the prediction model, f.2
    2. Next, take expectation (with respect to the identification assumptions) of the prediction models in each group, E𝒫fXTG=1 and E𝒫fXTG=0.
    3. Finally, compute the differential prediction error (in expectation),
      E𝒫YT(0)-fXTG=1-E𝒫YT(0)-fXTG=0.
  3. Show that this is equal to 0, thereby implying Assumption 2 and, consequently, the identified estimand in Eq. 5.

A.1. Difference-in-Differences.

If the prediction function is

fXt=Yt-1, (19)

then Assumption 2 will be true whenever parallel trends holds.

First, use parallel trends in Eq. 6 to write the treated and comparison groups untreated potential outcomes (in expectation) in the post-treatment period as

E𝒫YT(0)G=1=E𝒫YT-1(0)G=1+E𝒫YT(0)G=0-E𝒫YT-1(0)G=0
E𝒫YT(0)G=0=E𝒫YT-1(0)G=0+E𝒫YT(0)G=1-E𝒫YT-1(0)G=1.

Next, using Assumption 1, the expectations of the prediction model in Eq. 19 in each group are

E𝒫fXTG=1=E𝒫YT-1(0)G=1
E𝒫fXTG=0=E𝒫YT-1(0)G=0.

Hence, the differential prediction error (in expectation) is

E𝒫YT(0)G=0-E𝒫YT-10G=0-E𝒫YT0G=1-E𝒫YT-10G=1,

which is equal to 0 by parallel trends in Eq. 6. Hence, Assumption 2 also holds.

A.2. Two-way Fixed Effects.

If the prediction function is

fXt=argminαul=1t-1Yu,l-αu2, (20)

then Assumption 2 will be true whenever the TWFE structural model in Eq. 7 holds.

First, the structural model in Eq. 7 yields the following untreated potential outcomes (in expectation) in the post-period:

E𝒫Yu,T(0)Gu=1=E𝒫αuGu=1+γT
E𝒫Yu,T(0)Gu=0=E𝒫αuGu=0+γT.

The prediction model in Eq. 20 is simply each unit’s average outcome in the pre-period,

fXT=argminαut=1T-1Yu,t-αu2=1(T-1)t=1T-1Yu,t, (21)

so substituting Yu,t(0) for the observed outcomes (by Assumption 1) and taking expectations with respect to the structural model in Eq. 7 yields

E𝒫fXTGu=1=E𝒫αuGu=1+1T-1t=1T-1γt
E𝒫fXTGu=0=E𝒫αuGu=0+1T-1t=1T-1γt.

By substitution, we write the differential prediction error (in expectation) in period T as

E𝒫Yu,T(0)-fXTGu=1-E𝒫Yu,T(0)-fXTGu=0=E𝒫αuGu=1+γT-E𝒫αuGu=1-1T-1t=1T-1γt-E𝒫αuGu=0+γT-E𝒫αuGu=0-1T-1t=1T-1γt,

which is equal to 0, thereby implying Assumption 2.

Thus, the popular TWFE structural model implies our identification condition when the prediction function is OLS with unit fixed effects. This result would still hold if one were to fit both unit and time fixed effects, but doing so is unnecessary because the latter are constant across units and, hence, eliminated by the treated-minus-control difference between groups.

On the other hand, other structural models require more careful thought about the appropriate prediction function. For example, with a unit- or group-specific linear time trend model, use of the prediction function in Eq. 20 would not imply equal expected prediction errors, but use of the OLS analog of this model does. Other models, such as that of interactive fixed effects, typically used to justify the synthetic control method (Abadie, Diamond and Hainmueller, 2010), have no clear corresponding prediction function that implies equal expected prediction errors. This should be unsurprising since the synthetic control design, which is based on a treated-versus-control contrast, is outside our scope of controlled pre-post designs.

Embedding potential outcomes in structural models or specific parametric distributions can provide intuition about when equal expected prediction errors holds. However, our identification condition does not require such assumptions. The prediction functions, which may or may not use OLS, should be interpreted as just that — algorithms without the assumptions of corresponding structural models. This approach to prediction models is common in design-based settings wherein randomness stems from either an assignment (Rosenbaum, 2002; Sales, Hansen and Rowan, 2018) or sampling (Huang et al., 2023) mechanism.

APPENDIX B: JOINT VARIANCE-COVARIANCE MATRIX FOR BAYESIAN MODEL SELECTION

The estimated (cluster-robust) variance-covariance matrix for all coefficients across all models and all validation periods is

Σ^𝒱:=Σ^f1,v1,f1,v1Σ^f1,v1,(f||,V)Σ^(f||,V),(f1,v1)Σ^(f||,v|𝒱|),(f||,V), (22)

where Σ^(f,v),f,v is the cluster-robust variance-covariance matrix between any two model-year pairs from ×𝒱 and the elements in the set of candidate models, , are denoted by f1,f2,,f||.

In accordance with the usual sandwich formula, Σ^(f,v),f,v can be decomposed into its “bread” and “meat” components. The “bread” matrix for any (f,v)×𝒱 is

B(f,v):=Xf,<vXf,<v-1 (23)

in which Xf,<v is the n(v-1)×Kf model matrix for f in periods before v, where Kf is model f’s number of coefficients. For the “meat” component, first let ef,i,<v denote the (v-1)×1 vector of unit i’s prediction errors (residuals) under model f for periods before v. Also let Xf,i,<v be the (v-1)×Kf model matrix under model f for unit i in periods before v. Now we can write the “meat” matrix between any two model-year pairs in ×𝒱 (clustered at the unit level) as

M(f,v),f,v:=i=1nXf,i,<vef,i,<vef,i,<vXf,i,<v. (24)

Putting together the “breads” and the “meat” for any two elements from ×𝒱 and then multiplying by the usual small sample adjustment factor (originally derived in Hansen, 2007) results in

Σ^(f,v),f,v:=nn-1B(f,v)M(f,v),f,vBf,v. (25)

This estimated (cluster robust) variance-covariance for any two elements from ×𝒱 in Eq. 25 can be equivalently expressed as

1n-11ni=1nXf,i,<vXf,i,<v-11nM(f,v),f,v1ni=1nXf,i,<vXf,i,<v-1,

from which it is straightforward to see that Eq. 25 converges in probability to 0 as n increases indefinitely, as does the overall (cluster robust) variance-covariance in Eq. 22.

APPENDIX C: CONCEPTUAL DIAGRAM OF ESTIMATION PROCESS

Fig. 5 provides a conceptual diagram of the overall estimation process. All of the mathematical quantities in Fig. 5 are defined in the main text. However, to reiterate, the index s=1,,S runs over the posterior draws from 𝒩βˆ𝒱,Σ^𝒱. In addition, δˆf(s) denotes the largest absolute differential prediction error for model f over all validation periods, 𝒱, where V:=max𝒱, under the sth draw from 𝒩βˆ𝒱,Σ^𝒱. The optimal model under the sth draw is denoted by f(s). The elements in the set of candidate models, , are denoted by f1,f2,,f||. All other quantities — namely, Δˆ(D,βˆ,M),E^D[Δˆ(D,βˆ,M)] and pˆf — are as defined in Eqs. 14, 15 and 16 in the main text.

FIG 5.

FIG 5.

Averaged Prediction Models (APM) estimation process

Footnotes

1

Since the collection of predictors may depend on the model f, we should index the predictors Xt by the corresponding model; however, for the time being, we leave this dependence implicit in our notation since the corresponding model should be clear from context.

2

Since the prediction model can only use pre-treatment outcomes, any outcomes in Xt are untreated potential outcomes.

REFERENCES

  1. Abadie A, Diamond A and Hainmueller J (2010). Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program. Journal of the American Statistical Association 105 493–505. [Google Scholar]
  2. Aneja A, Donohue Iii JJ and Zhang A (2014). The Impact of Right to Carry Laws and the NRC Report: The Latest Lessons for the Empirical Evaluation of Law and Policy Technical Report No. NBER Working Paper No. 18294, https://www.nber.org/papers/w18294, National Bureau of Economic Research, Cambridge, MA. [Google Scholar]
  3. Angrist JD and Pischke J-S (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, Princeton, NJ. [Google Scholar]
  4. Angrist JD and Pischke J-S (2010). The Credibility Revolution in Empirical Economics: How Better Research Design is Taking the Con out of Econometrics. The Journal of Economic Perspectives 24 3–30. [Google Scholar]
  5. Antonelli J and Beck B (2023). Heterogeneous Causal Effects of Neighbourhood Policing in New York City with Staggered Adoption of the Policy. Journal of the Royal Statistical Society Series A: Statistics in Society 186 772–787. 10.1093/jrsssa/qnad058 [DOI] [Google Scholar]
  6. Antonelli J, Papadogeorgou G and Dominici F (2022). Causal inference in high dimensions: A marriage between Bayesian modeling and good frequentist properties. Biometrics 78 100–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Arellano M (1987). Computing Robust Standard Errors for Within-Group Estimators. Oxford Bulletin of Economics and Statistics 49 431–434. [Google Scholar]
  8. Ben-Michael E, Arbour D, Feller A, Franks A and Raphael S (2023). Estimating the Effects of a California Gun Control Program with Multitask Gaussian Processes. The Annals of Applied Statistics 17 985–1016. 10.1214/22-AOAS1654 [DOI] [Google Scholar]
  9. Bertrand M, Duflo E and Mullainathan S (2004). How Much Should We Trust Differences-in-Differences Estimates? The Quarterly Journal of Economics 119 249–275. [Google Scholar]
  10. Black DA and Nagin DS (1998). Do Right-to-Carry Laws Deter Violent Crime? The Journal of Legal Studies 27 209–219. [Google Scholar]
  11. Bloom HS (2003). Using “Short” Interrupted Time-Series Analysis to Measure the Impacts of Whole-School Reforms: With Applications to a Study of Accelerated Schools. Evaluation Review 27 3–49. [DOI] [PubMed] [Google Scholar]
  12. Britt CL, Kleck G and Bordua DJ (1996). A Reassessment of the D.C. Gun Law: Some Cautionary Notes on the Use of Interrupted Time Series Designs for Policy Impact Assessment. Law & Society Review 30 361–380. [Google Scholar]
  13. Brown TT and Atal JP (2019). How Robust are Reference Pricing Studies on Outpatient Medical Procedures? Three Different Preprocessing Techniques Applied to Difference-in Differences. Health Economics 28 280–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Callaway B and Sant’Anna PHC (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225 200–230. [Google Scholar]
  15. Conley TG and Taber CR (2011). Inference with ‘Difference in Differences’ with a Small Number of Policy Changes. The Review of Economics and Statistics 93 113–125. [Google Scholar]
  16. Crifasi CK, Merrill-Francis M, Mccourt AD, Vernick JS, Wintemute GJ and Webster DW (2018). Association between Firearm Laws and Homicide in Urban Counties. Journal of Urban Health 95 383–390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. De Chaisemartin C and D’HaultfŒUille X (2023). Two-Way Fixed Effects and Differences-in-Differences with Heterogeneous Treatment Effects: A Survey. The Econometrics Journal 26 C1–C30. [Google Scholar]
  18. Draper D (1995). Assessment and Propagation of Model Uncertainty. Journal of the Royal Statistical Society: Series B (Methodological) 57 45–70. [Google Scholar]
  19. Duwe G, Kovandzic T and Moody CE (2002). The Impact of Right-to-Carry Concealed Firearm Laws on Mass Public Shootings. Homicide Studies 6 271–296. [Google Scholar]
  20. Egami N and Yamauchi S (2023). Using Multiple Pre-treatment Periods to Improve Difference-in-Differences and Staggered Adoption Designs. Political Analysis 31 195–212. [Google Scholar]
  21. French B and Heagerty PJ (2008). Analysis of Longitudinal Data to Evaluate a Policy Change. Statistics in Medicine 27 5005–5025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Fry CE and Hatfield LA (2021). Birds of a feather flock together: Comparing controlled pre-post designs. Health Services Research 56 942–952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Gelman A and Hill J (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, New York, NY. [Google Scholar]
  24. Granger CWJ (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 37 424–438. [Google Scholar]
  25. Guo K and Basse GW (2023). The Generalized Oaxaca-Blinder Estimator. Journal of the American Statistical Association 118 524–536. [Google Scholar]
  26. Hansen CB (2007). Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data when T Is Large. Journal of Econometrics 141 597–620. [Google Scholar]
  27. Hasegawa RB, Webster DW and Small DS (2019). Evaluating Missouri’s Handgun Purchaser Law: A Bracketing Method for Addressing Concerns About History Interacting with Group. Epidemiology 30 371–379. [DOI] [PubMed] [Google Scholar]
  28. Heller R, Rosenbaum PR and Small DS (2009). Split Samples and Design Sensitivity in Observational Studies. Journal of the American Statistical Association 104 1090–1101. [Google Scholar]
  29. Hsu JY, Small DS and Rosenbaum PR (2013). Effect Modification and Design Sensitivity in Observational Studies. Journal of the American Statistical Association 108 135–148. [Google Scholar]
  30. Huang M, Soriano D and Pimentel SD (2024). Design Sensitivity and Its Implications for Weighted Observational Studies. arXiv Preprint, https://arxiv.org/pdf/2307.00093. [Google Scholar]
  31. Huang M, Egami N, Hartman E and Miratrix L (2023). Leveraging Population Outcomes to Improve the Generalization of Experimental Results: Application to the JTPA Study. Annals of Applied Statistics 17 2139–2164. [Google Scholar]
  32. Imai K and Kim IS (2019). When Should We Use Unit Fixed Effects Regression Models for Causal Inference with Longitudinal Data? American Journal of Political Science 63 467–490. [Google Scholar]
  33. Imai K and Kim IS (2021). On the Use of Two-way Fixed Effects Regression Models for Causal Inference with Panel Data. Political Analysis 29 405–415. [Google Scholar]
  34. Kahn-Lang A and Lang K (2020). The Promise and Pitfalls of Differences-in-Differences: Reflections on 16 and Pregnant and Other Applications. Journal of Business & Economic Statistics 38 613–620. [Google Scholar]
  35. King G, Tomz M and Wittenberg J (2000). Making the Most of Statistical Analyses: Improving Interpretation and Presentation. American Journal of Political Science 44 341–355. [Google Scholar]
  36. Kropko J and Kubinec R (2020). Interpretation and Identification of within-unit and cross-sectional variation in panel data models. PLoS ONE 15 e0231349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kuchibhotla AK, Kolassa JE and Kuffner TA (2022). Post-Selection Inference. Annual Review of Statistics and Its Application 9 505–527. [Google Scholar]
  38. Liang K-Y and Zeger SL (1986). Longitudinal Data Analysis Using Generalized Linear Models. Biometrika 73 13–22. [Google Scholar]
  39. Lopez Bernal J, Soumerai S and Gasparrini A (2018). A Methodological Framework for Model Selection in Interrupted Time Series Studies. Journal of Clinical Epidemiology 103 82–91. [DOI] [PubMed] [Google Scholar]
  40. Mackinnon JG and Webb MD (2020). Randomization Inference for Difference-in-Differences with Few Treated Clusters. Journal of Econometrics 218 435–450. [Google Scholar]
  41. Madigan D and Raftery AE (1994). Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window. Journal of the American Statistical Association 89 1535–1546. [Google Scholar]
  42. Manski CF and Pepper JV (2018). How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions. The Review of Economics and Statistics 100 232–244. [Google Scholar]
  43. Mcdowall D, Mccleary R and Bartos BJ (2019). Interrupted Time Series Analysis. Oxford University Press, New York, NY. [Google Scholar]
  44. Miratrix LW (2022). Using Simulation to Analyze Interrupted Time Series Designs. Evaluation Review 46 750–778. [DOI] [PubMed] [Google Scholar]
  45. Mize TD, Doan L and Long JS (2019). A General Framework for Comparing Predictions and Marginal Effects across Models. Sociological Methodology 49 152–189. [Google Scholar]
  46. Moody CE (2001). Testing for the Effects of Concealed Weapons Laws: Specification Errors and Robustness. The Journal of Law and Economics 44 799–813. [Google Scholar]
  47. Moody CE, Marvell TB, Zimmerman PR and Alemante F (2014). The Impact of Right-to-Carry Laws on Crime: An Exercise in Replication. Review of Economics & Finance 4 33–43. [Google Scholar]
  48. Morral AR, Ramchand R, Smart R, Gresenz CR, Cherney S, Nicosia N, Price CC, Holliday SB, Sayers ELP and Schell EA Terry L (2018). The Science of Gun Policy: A Critical Synthesis of Research Evidence on the Effects of Gun Policies in the United States, 1st ed. RAND Corporation, Santa Monica, CA. [Google Scholar]
  49. Moulton BR (1991). A Bayesian Approach to Regression Selection and Estimation, with Application to a Price Index for Radio Services. Journal of Econometrics 49 169–193. [Google Scholar]
  50. Nickell S (1981). Biases in Dynamic Models with Fixed Effects. Econometrica 49 1417–1426. [Google Scholar]
  51. National Research Council of the National Academies (2005). Firearms and Violence: A Critical Review. The National Academic Press, Washington, D. C. [Google Scholar]
  52. O’Neill S, Kreif N, Grieve R, Sutton M and Sekhon JS (2016). Estimating Causal Effects: Considering Three Alternatives to Difference-in-Differences Estimation. Health Services and Outcomes Research Methodology 16 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Piironen J and Vehtari A (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing 27 711–735. [Google Scholar]
  54. Plassmann F and Tideman TN (2001). Does the Right to Carry Concealed Handguns Deter Countable Crimes? Only a Count Analysis Can Say. The Journal of Law and Economics 44 771–798. [Google Scholar]
  55. Raftery AE, Madigan D and Hoeting JA (1997). Bayesian Model Averaging for Linear Regression Models. Journal of the American Statistical Association 92 179–191. [Google Scholar]
  56. Rambachan A and Roth J (2023). A More Credible Approach to Parallel Trends. Review of Economic Studies 90 2555–2591. [Google Scholar]
  57. Rokicki S, Cohen J, Fink G, Salomon JA and Landrum MB (2018). Inference with Difference-in-Differences with a Small Number of Groups: A Review, Simulation Study and Empirical Application Using SHARE Data. Medical Care 56 97–105. [DOI] [PubMed] [Google Scholar]
  58. Rosenbaum PR (2002). Covariance Adjustment in Randomized Experiments and Observational Studies. Statistical Science 17 286–327. [Google Scholar]
  59. Rosenbaum PR (2004). Design Sensitivity in Observational Studies. Biometrika 91 153–164. [Google Scholar]
  60. Rosenbaum PR (2005). Heterogeneity and Causality: Unit Heterogeneity and Design Sensitivity in Observational Studies. The American Statistician 59 147–152. [Google Scholar]
  61. Rosenbaum PR (2012). An Exact Adaptive Test with Superior Design Sensitivity in an Observational Study of Treatments for Ovarian Cancer. The Annals of Applied Statistics 6 83–105. [Google Scholar]
  62. Roth J (2022). Pretest with Caution: Event-Study Estimates After Testing for Parallel Trends. American Economic Review: Insights 4 305–322. [Google Scholar]
  63. Roth J and Sant’Anna PHC (2023). When Is Parallel Trends Sensitive to Functional Form? Econometrica 91 737–747. [Google Scholar]
  64. Rubin PH and Dezhbakhsh H (2003). The Effect of Concealed Handgun Laws on Crime: Beyond the Dummy Variables. International Review of Law and Economics 23 199–216. [Google Scholar]
  65. Rudolph KE, Stuart EA, Vernick JS and Webster DW (2015). Association Between Connecticut’s Permit-to-Purchase Handgun Law and Homicides. American Journal of Public Health 105 49–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Ryan AM, Burgess JF and Dimick JB (2015). Why We Should Not Be Indifferent to Specification Choices for Difference-in-Differences. Health Services Research 50 1211–1235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Sales AC, Hansen BB and Rowan B (2018). Rebar: Reinforcing a Matching Estimator With Predictions From High-Dimensional Covariates. Journal of Educational and Behavioral Statistics 43 3–31. [Google Scholar]
  68. Shadish WR, Cook TD and Campbell DT (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton Mifflin Company, Boston, MA. [Google Scholar]
  69. Small DS, Cheng J, Halloran ME and Rosenbaum PR (2013). Case Definition and Design Sensitivity. Journal of the American Statistical Association 108 1457–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Smart R, Morral AR, Smucker S, Cherney S, Schell TL, Peterson S, Ahluwalia SC, Cefalu M, Xenakis L, Ramchand R and Gresenz CR (2020). The Science of Gun Policy: A Critical Synthesis of Research Evidence on the Effects of Gun Policies in the United States, 2nd ed. RAND Corporation, Santa Monica, CA. [Google Scholar]
  71. Sobel ME (2012). Does Marriage Boost Men’s Wages?: Identification of Treatment Effects in Fixed Effects Regression Models for Panel Data. Journal of the American Statistical Association 107 521–529. [Google Scholar]
  72. Tomz M, Wittenberg J and King G (2003). Clarify: Software for Interpreting and Presenting Statistical Results. Journal of Statistical Software 8 1–30. [Google Scholar]
  73. Wagner AK, Soumerai SB, Zhang F and Ross-Degnan D (2002). Segmented Regression Analysis of Interrupted Time Series Studies in Medication Use Research. Journal of Clinical Pharmacy and Therapeutics 27 299–309. 10.1046/j.1365-2710.2002.00430.x [DOI] [PubMed] [Google Scholar]
  74. Webster D, Crifasi CK and Vernick JS (2014). Effects of the Repeal of Missouri’s Handgun Purchaser Licensing Law on Homicides. Journal of Urban Health: Bulletin of the New York Academy of Medicine 91 293–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Wolfers J (2006). Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Results. The American Economic Review 96 1802–1820. [Google Scholar]
  76. Wooldridge JM (2005). Fixed-Effects and Related Estimators for Correlated Random-Coefficient and Treatment-Effect Panel Data Models. The Review of Economics and Statistics 87 385–390. [Google Scholar]
  77. Xu L, Gotwalt C, Hong Y, King CB and Meeker WQ (2020). Applications of the Fractional-Random-Weight Bootstrap. The American Statistician 74 345–358. [Google Scholar]
  78. Zellner A (1962). An Efficient Method of Estimating Seemingly Unrelated Regressions and Tests for Aggregation Bias. Journal of the American Statistical Association 57 348–368. [Google Scholar]
  79. Zellner A (1963). Estimators for Seemingly Unrelated Regression Equations: Some Exact Finite Sample Results. Journal of the American Statistical Association 58 977–992. [Google Scholar]
  80. Zhang F and Penfold RB (2013). Use of Interrupted Time Series Analysis in Evaluating Health Care Quality Improvements. Academic Pediatrics 13 S38–S44. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES