Significance
Generalizing scientific findings across diverse populations is fundamental to many scientific fields and policymakers who use such analyses to guide decision-making. Traditional methods often account for the differences in populations by reweighting observed covariates, assuming only observed variables shift across the populations. Analyzing large-scale replication studies in the social sciences, we empirically demonstrate that i) shifts in unobserved variables are common, but ii) such shifts can be predicted from shifts in observed covariates. We propose a statistical theory of distributional shifts to explain this predictive, rather than merely explanatory, role of covariates in effect generalization. Our results serve as the empirical and conceptual basis for developing new statistical methods for generalizability and external validity.
Keywords: generalizability, external validity, distribution shift, replication studies
Abstract
Many existing approaches to generalizing statistical inference amid distribution shift operate under the covariate shift assumption, which posits that the conditional distribution of unobserved variables given observable ones is invariant across populations. However, recent empirical investigations have demonstrated that adjusting for shifts in observed variables (covariate shift) is often insufficient for generalization. In other words, covariate shift does not typically “explain away” the distribution shift between populations. As such, addressing the unknown yet nonnegligible shift in the unobserved variables given observed ones (conditional shift) is crucial for generalizable inference. In this paper, we present empirical evidence from two large-scale multisite replication studies indicating that covariate shift can help predict the strength of unknown conditional shift. Analyzing 680 studies across 65 sites, we find that even though the conditional shift is nonnegligible, its strength can often be bounded by that of the observable covariate shift. This pattern only emerges when the two sources of shifts are quantified by our proposed standardized, pivotal measures. We then interpret this phenomenon by connecting it to similar patterns that can be theoretically derived from a random distribution shift model. Finally, we demonstrate that exploiting the predictive role of covariate shift leads to reliable and efficient uncertainty quantification for target estimates in generalization tasks with partially observed data. Overall, our empirical and theoretical analyses highlight an alternative perspective on the problem of distributional shift, generalizability, and external validity.
Distribution shift is a central issue in generalizing statistical evidence from an observed (source) population to a new, at most partially observed (target) population, with significant implications in many domains. For instance, in the medical and social sciences, researchers/policymakers seek to leverage existing randomized control trials to estimate the treatment effect on a new cohort to guide clinical decisions or policy making (1–7). However, the challenge lies in whether statistical methods can capture the changes between populations to produce credible predictions of target effects.
To address the generalizability question, many statistical methods operate under assumptions positing that observed variables capture all distributional differences between populations. These assumptions can often be described as covariate shift, that is, the distribution of covariates observed in both populations can change, while the conditional distribution of the outcomes (unobserved in the target population) given the observed covariates remains invariant. For example, the distribution of age, gender, and education can differ across populations (e.g., due to convenience sampling), but the conditional treatment effect is the same for individuals with the same covariate profiles. Under this common assumption, adjusting for shift in the observed covariates, either by reweighting based on density ratios or estimating the heterogeneous covariate–outcome relationship (8–12), is sufficient for unbiased estimation of the target parameters. This common approach highlights the role of covariate shift in explaining away the distribution shift.
Given its popularity, a series of recent papers (13–15) have empirically evaluated the performance of generalization estimators based on the covariate shift assumption by comparing them against experimental benchmark estimates. Although each paper focuses on different domains, a common yet somewhat surprising finding is that observed covariate shift often can only explain a small proportion of the distributional shift in real-world applications. This implies two pessimistic messages: 1) adjusting for observed covariate shift may be insufficient for generalization, and 2) the remaining, unobserved conditional shift (i.e., shift in the conditional distribution of the outcomes given the observed covariates) is “larger” than the observed covariate shift. As such, it remains unclear how the conditional shift may be addressed for effect generalization in practice even in well-controlled settings.
1.1. This Work: The Predictive Role of Covariate Shift.
In this paper, we introduce a different role of covariate shift in predicting the unknown shift in the conditional distribution for generalization (Fig. 1). The distribution shift between the source and target populations consists of the observed covariate shift and unobserved conditional shift, the latter being a key challenge in a generalization task. In contrast to existing approaches that either i) assume no conditional shift, or ii) establish worst-case bounds based on adversarial shift in the conditional distribution, we argue that the strength of covariate shift can bound that of the unknown conditional shift. Exploiting this bounding relationship is useful in effect generalization with improved validity and efficiency.
Fig. 1.
Overview of the problem and our approach. Effect generalization from source and target populations needs to address the distribution shift consisting of the observed covariate shift and unobserved conditional shift. We argue a predictive role of covariate shift in bounding the strength of unknown conditional shift, which is supported by our empirical findings and leads to reliable and efficient generalization.
Our proposal is supported by empirical evidence from two well-known, large-scale multisite replication projects—the Pipeline project (16) and the Many Labs 1 project (17)—from the social sciences, analyzing a total of 680 studies across 65 sites examining 25 hypotheses.* To ensure faithful evaluation, since we have no access to the underlying population parameters, we build prediction intervals—based on various distribution shift assumptions—for estimators in target populations (including our proposed ones built upon empirical findings) and use their empirical coverage to examine the plausibility of the assumptions they are based upon. Fig. 2 previews our empirical results.
Fig. 2.
Preview of results. Panel (A): Insufficient explanatory role of covariate shift: Empirical coverage of prediction intervals based on i.i.d. assumption (grey) and covariate shift assumption (green and purple), showing covariate shift cannot explain away distribution shift across sites. Panel (B): Reliable and efficient effect generalization based on the predictive role of covariate shift: Empirical coverage of prediction intervals based on i.i.d. assumption (grey), worst-case bounds (dark blue), and our method with the belief that conditional shift is bounded by covariate shift (red) or with knowledge of their relative strength (yellow).
We begin by examining common approaches that either ignore distribution shift or assume covariate shift (Section 2.3). In the two replication projects, the explanatory role of covariate shift is limited as evident from the low coverage of prediction intervals, complementing existing work that either examine pairs of studies (14) or mean squared errors (15, 18). As shown in Panel (A) of Fig. 2, even for controlled multisite replication studies, distribution shifts across sites are not negligible (methods that assume no distributional shift (IID) do not achieve valid coverage). Furthermore, observed covariate shift cannot explain away the total distributional shift, as methods that only adjust for observed covariate shift (CovShift) do not achieve valid coverage, either.
We then proceed to compare the strengths of the observed covariate shift and the conditional shift (Section 3.2). In stark contrast with the pessimistic conjectures in previous works, we find that conditional shift is often smaller than covariate shift across different applications and comparisons. However, this empirical pattern became clear only after we measured covariate and conditional shifts with proper standardization.
We interpret our empirical findings by connecting them to similar patterns that can be theoretically derived under a recently proposed random distribution shift model (19–21) (Section 3.3). Under this model, one expects to observe smaller conditional shift than covariate shift when the probability space is randomly perturbed in a way that does not favor any direction yet some component of the observed data, which is the treatment assignment here, is kept invariant. This model describes scenarios where the difference between the source and target distributions is not adversarial but is contributed by many small and random factors. Such scenarios are common in collaborative replication studies and potentially other carefully controlled studies where replicators try their best to mimic the original study design and population, but they have to deviate due to logistical and other constraints.
Finally, we demonstrate the effectiveness of exploiting this predictive role in effect generalization, again (for evaluation purposes) by examining the empirical coverage of prediction intervals that aim to address the unknown conditional shift (Section 4). Panel (B) of Fig. 2 previews key takeaway messages. Prediction intervals derived from this framework maintain valid coverage while yielding substantially shorter intervals. This reveals that the predictive role is stable across contexts and permits effective empirical calibration. In contrast, existing methods assuming worst-case conditional shift (WorstCase) achieve valid coverage when the worst-case shift strength is (unrealistically) calibrated by data, but at the expense of too wide intervals.
Overall, our empirical and theoretical analyses suggest a different way to approach the problem of distributional shift, generalizability, and external validity. Most existing methods either i) assume no shift in the unobserved conditional shift or ii) assume shift in the unobserved conditional shift is bounded, and search for the worst-case scenarios that tend to be extremely adversarial. Instead, we offer a data-adaptive middle ground—shift in the unobserved conditional shift is nonnegligible but is predictable from the observed covariate shift. Our results shall serve as the empirical and conceptual basis for developing new methods and models beyond the covariate shift assumption.
1.2. Scope of the Paper.
We emphasize that the main objective of this paper is to offer empirical and theoretical evidence supporting a particular perspective on real-world distributional shifts. The random distribution shift modeling assumption offers a perspective to justify our empirical findings, yet we do not anticipate it to be universally grounded. In particular, we limit the interpretation of our results to contexts similar to multisite replication studies where data are collected in a “natural” manner, with distribution shifts arising from random, unintended factors while the experimenters try to maintain consistency. In other words, the two projects provide a testbed for distribution shifts that emerge due to inevitable deviations despite well-controlled experimental settings (22–24). Counterexamples include studies where researchers purposively change the recruitment criterion across sites, e.g., one site deliberately focuses on university students whereas another site recruits only middle-aged participants, or when researchers purposively change experiment procedures, e.g., the experiment materials in one study are intentionally modified (we discuss robustness analyses in Section 5).
We also note that our evaluation mainly focuses on uncertainty quantification, that is, whether statistical methods can produce reliable prediction intervals for the actual estimates from data in the target population. Focusing on prediction intervals is inevitable since the underlying parameter is not accessible for evaluation purposes. In addition, uncertainty quantification offers a more comprehensive assessment than evaluating the consistency or unbiasedness of point estimates (see Section 1.3 for more discussion).
1.3. Related Work.
1.3.1. Reweighting in causal inference.
Using reweighting to generalize from one population to another population has a long history in causal inference. Early examples include Horvitz-Thompson (25) and Hájek’s estimator. Inverse probability weights are often unstable in practice. This has spurred the development of procedures that use outcome models to reduce variance (26) and balancing weight procedures that penalize the weights (27, 28). Modern reweighting procedures were used to generalize the results of experiments from one site to another (e.g., refs. 4, 5, 8, 11, and 29–32). See refs. 33 and 34 for recent reviews.
1.3.2. Empirical evaluation of generalization.
This work adds to several recent works empirically evaluating generalization procedures that use unit-level data to generalize from one site to another. Ref. 13 diagnose how much of the drop of prediction performance can be attributed to covariate shift vs. concept shift. Refs. 14 and 15 investigate how much of the discrepancy between causal effect estimates in different sites is due to unit-level covariates, among other factors. In welfare-to-work experiments, ref. 15 found that less than 10% of discrepancies between sites is explained by changes in covariate distributions. This work echoes these works on the insufficient explanatory role of covariate shift. An important distinction is that our evaluation leverages the coverage of prediction intervals over many replication studies, which offers more comprehensive and faithful evaluation than methods that evaluate one pair of studies for a hypothesis (13, 14) or examine the mean squared errors (15, 18). For example, while (18) find in another multisite replication dataset that covariate adjustment leads to unbiased estimators (with bias averaged over multiple sites) for target estimates, it may still underestimate the variability if the conditional shift leads to discrepancies that are mean zero when averaged over studies but have nonnegligible magnitude. More importantly, we also investigate a predictive role of covariate shift that can inform reliable generalization in practice.
1.3.3. Heterogeneity and meta-analysis in replicability.
Multisite replication projects have been used to examine the heterogeneity in effect estimates across sites (35–39). A prominent distinction is that these works often measure certain global notions of heterogeneity via meta-analysis (40), while we focus on generalization from one site to another. Methodologically, our generalization methods are applicable when data from only the source and target sites are available, whereas meta-analysis needs data from many sites. In addition, these works provide echoing messages for weak explanatory roles of observed factors (35, 38) or complementary messages for design and estimation uncertainty (39, 41); the latter may be interpreted as “random” shifts if not documented.
1.3.4. Covariate and conditional shift in machine learning.
The term covariate shift was first introduced by ref. 42 and has become one of the standard domain adaptation models; see refs. 43 and 44. Most commonly, covariate shift is addressed via importance weighting with the density ratio, which can be estimated directly, e.g., via a classifier (45). Similarly, density ratio reweighting is a standard approach to addressing covariate shift for statistical estimation and inference. The conditional shift we study is related to the notion of concept drift in machine learning (46, 47). The techniques for addressing these shifts in prediction problems serve distinct goals than our estimation and inference problems.
2. Motivating Applications and Methodological Problem
We introduce our motivating applications and illustrate the core methodological challenges in generalization.
2.1. Motivating Applications: Multisite Replication Projects.
In this paper, we use two large-scale multisite replication projects from the social sciences to empirically investigate the role of covariate shifts in generalization. The Many Labs 1 project (17) evaluates the replicability of 13 classic and contemporary experimental findings in the social sciences, ranging from gain vs. loss framing (48) to sex differences in implicit attitudes toward math (49), across 36 independent data collection sites. Similarly, in the Pipeline project (16), 25 laboratories across the world (contributing populations) independently replicate experiments for 10 scientific hypotheses concerning moral judgment, which is a well-known theory in psychology. Combining the two replication projects, we analyze 680 studies across 65 sites, examining 25 research hypotheses. This scale allows us to assess the proposed role of covariate shifts across diverse empirical settings.
Several features of these multisite replication projects make them suitable for evaluating distribution shifts in generalization. First, we can mimic the real-world generalization task by generalizing an effect estimate from one source site to another target site. Unlike the real generalization task, we have access to the effect estimate from the target site, and therefore, we can empirically evaluate the performance of common generalization estimators based on the covariate shift assumption and our proposed estimator, without simulating data from the artificial data-generating process. Second, in these replication projects, multiple laboratories follow the same experimental process as much as they can, known as direct replications. As a result, the measurement of the outcome variable and treatment variable is consistent across sites, and the interpretation of the covariate shift and the unobserved conditional shift becomes clearer. Finally, the two replication projects differ in how laboratories are recruited. In the Pipeline project (16), laboratories are invited by the project lead because they had “access to a subject population in which the original finding was theoretically expected to replicate using the original materials” (p. 57). Therefore, sites were selected such that distributional shifts between them are expected to be small, or at least without intentional manipulation and adversarial changes. On the other hand, in the Many Labs 1 project (17), laboratories voluntarily participated in the project without specific eligibility criteria related to whether each site was expected to replicate the original finding. Here, sites were selected conveniently but again “naturally” without explicit intention and manipulation. This variation in site selection enables us to empirically evaluate distributional shifts in diverse scenarios.
The datasets are processed based on the raw data and scripts published by the original authors. In both projects, the covariates include demographic variables such as political ideology, gender, age, education, and income. See SI Appendix, section 1 for details about the datasets and preprocessing.
2.2. Notation and Setup.
To formally discuss the generalization problem, we introduce some notation. While we tailor our notation to the two projects above for concrete presentation, the same general framework can be applied to any generalization setting across sites.
We first index the hypotheses by and the sites by . Each hypothesis is tested by a randomized experiment in a subset of sites , following the same experimental protocol. Each site independently collects participants and collects data , where is the covariates, is the binary treatment, and is the outcome(s). Then, within each site, we can define the parameter of interest and its consistent and asymptotically normal estimator , which is a function of . In our applications, most of them consider the average treatment effect (ATE) as and use a test that compares the sample mean of treated and control groups as . Some hypotheses are tested with being the mean of outcomes and being a paired test comparing two outcomes. The specific hypotheses and tests are summarized in SI Appendix, Tables S2 and S4.
We assume are drawn i.i.d. from an underlying (hypothetical) superpopulation , and datasets are independent across sites for each hypothesis . Importantly, the underlying data generating process may vary across sites since there might exist distribution shifts.
We consider the generalization of estimates from site to for all pairs , , in each application. In general, we call the population in site as the source population and the population in site as the target population . As typically the case in practice, for a generalization task, we assume all data from are observed while only covariates are observed from . When we evaluate the performance of various generalization estimators, we will use the full data in the target population to empirically evaluate how well the generalization estimators approximate the benchmark estimates in .
2.3. Challenge: Covariate Shift Cannot Explain Away Distributional Shift.
The vast majority of existing methods for generalization assume that accounting for distributional shifts in observed covariates is sufficient, known as the covariate shift assumption. For example, when researchers want to generalize causal effects in one site to another site in the Pipeline project, they may assume that adjusting for observed characteristics of respondents, such as political ideology, gender, age, and education, is sufficient for generalization (consistent estimation and valid inference for the parameter in the target site).
However, in line with recent empirical evaluations (13–15), we find that this common assumption of covariate shift is often insufficient to explain away distributional shifts in the real-world applications. Fig. 3 examines existing procedures that adjust for shift in observed covariates. We consider generalizing treatment effects from one site to another, using two commonly used estimators—the doubly robust (DR) estimator (11, 26) and the entropy balancing (EB) estimator (28, 50)—to construct point estimates that are consistent for the target parameter under the covariate shift assumption. Then, we follow ref. 51 to construct prediction intervals that would cover the target estimator with probability under covariate shift, and evaluate their empirical coverage.† As a simple baseline, we also compute prediction intervals based on the i.i.d. assumption that assumes no distribution shift between sites. Detailed estimation procedures are deferred to SI Appendix, section 2.B. and we include a brief overview here. When generalizing from site to for hypothesis , the two effect sizes are and , respectively. The prediction interval by the IID method centers around based on the asymptotics , where can be consistently estimated if . The CovShift methods compute an estimator using reweighting, and the prediction interval centers around based on the asymptotic distribution for some constant that can be consistently estimated under the covariate shift assumption. We then evaluate the coverage of these prediction intervals for among all pairs .
Fig. 3.
Insufficient explanatory role of covariate shift. Left: Undercoverage of 95% prediction intervals based on the i.i.d. assumption (gray) and covariate shift assumption adjusted via doubly robust estimator (green) and entropy balancing (purple), averaged over all pairs of sites within each hypothesis for the Pipeline project (P, A) and the ManyLabs 1 data (M, A), respectively. The red dashed line is the nominal level. Right: Estimates based on existing approaches (via doubly robust estimator (green) and entropy balancing (purple)) do not bring the source estimates (gray) closer to the target estimate (red dashed line). As illustrative examples, we show results when generalizing from all other sites to site 5 (raw ID) in hypothesis 5 in the Pipeline data (P, B) and when generalizing from all other sites to site 4 in hypothesis 4 in ManyLabs 1 data (M, B). The segments connect estimates for the same pairs of sites.
Fig. 3 highlights two key findings:
-
(i)
Adjusting for distribution shift is necessary, as prediction intervals based on the assumption of no distribution shift (denoted as IID) do not deliver valid coverage (gray bars).
-
(ii)
The explanatory role of covariate shift is insufficient. This is evident from the undercoverage in panel (A) of both of the two CovShift methods. The coverage is sometimes even lower than IID; this is because the uncertainty that remains after adjustment is underestimated. When comparing the estimates in the source population and generalization estimates in panel (B), we see that adjusting for covariate shift does not necessarily bring the estimators closer to the target estimate.
3. The Predictive Role of Covariate Shift
In this paper, we highlight a role of covariate shifts: Observed covariate shifts can be used to predict unobserved shifts in the conditional distribution of given , even though covariate shifts cannot fully explain the total distributional shift. We first propose standardized measures of distributional shifts, and then provide empirical and theoretical evidence for the predictive role of covariate shift.
3.1. Comparing the Strength of Covariate Shift and Conditional Shift.
We begin by defining our measures of the two sources of distribution shifts: i) the covariate shift in (the part commonly addressed in existing methods) and ii) the conditional shift—the shift in the conditional distribution of given (the part assumed away under the covariate shift assumption). Our approach is based on two simple principles:
Scale invariance. We would like our measures to reflect the strength of perturbations to the probability space, hence they should be invariant to scalings of the variables.
Numerical stability. We would like our measures to be useful in guiding real generalization tasks, hence they should permit stable estimation.
Throughout the paper, we suppose the goal is to understand how causal effects change across sites, and we have two randomized experiments with treatment assignment probability (most studies in our datasets are of this form). We can write the difference in the causal effects as
where and are expectations over the source and target distribution. While we focus our discussion on causal effects in this paper for the sake of clear presentation, our proposed approach is applicable to any parameter of interest by redefining . For example, some studies in the Pipeline project use a one-sample test, in which case the parameter of interest is the mean of the outcome and .
We begin by conceptually decomposing the impact of overall distribution shift on the parameter of interest () to measure the shifts in and given separately:
| [1] |
where (resp. ) is the conditional expectation of in the source (resp. target) distribution. When the parameter of interest is the ATE, we have , the conditional ATE. In ref. 14, the decomposition Eq. 1 is used to diagnose the roles of different distribution shifts on the discrepancy of effect estimates between a pair of studies.
The first “Covariate shift” term in the decomposition Eq. 1 captures the shift in the observed covariates . Intuitively, it measures how much the estimate can be brought closer to the target by adjusting for the shift in . This term becomes larger when the strength of shift between and is larger. Importantly, it also depends on the heterogeneity in , that is, how much the parameter of interest varies with the covariates. Our proposed distribution shift measures will remove the impact of such heterogeneity (sensitivity) on our measure of the strength of distribution shift to ensure interpretability and scale invariance.
The second term in Eq. 1, , captures the shift in the conditional expectation between the source and target distribution. For example, when the parameter of interest is the ATE, this part captures how much the conditional ATE changes between the source and target distribution. Similarly, it not only depends on the strength of conditional shift but also the heterogeneity in ; again, the latter will be removed in our measures.
The common assumption of covariate shift essentially assumes away the second shift in the conditional distribution. We formalize it as follows.
Assumption 1
[Covariate Shift] holds -almost surely.
If , this assumption is the classical covariate shift assumption in machine learning. For experiments, Assumption 1 is satisfied if the treatment probabilities do not change and the conditional distribution of the potential outcomes is invariant, i.e., if .
Assumption 1 implies the second term in Eq. 1 is zero, and thus it suffices to adjust for the shift in observed covariates (the first term). While this is a commonly imposed assumption for the identifiability of target parameters, as discussed in Section 2.3, it is often violated in practice, which implies that the conditional shift (the second term) is often nonzero in real-world applications. Therefore, instead of assuming away the conditional shift, we are to carefully investigate the relationship between the two shifts to offer insights for moving beyond the covariate shift assumption.
We define distribution shift measures by rescaling the two terms in Eq. 1 by their SD for scale invariance:
| [2] |
| [3] |
We will measure the strength of the conditional shift by the “relative conditional shift” Eq. 2. However, an issue with the “relative covariate shift” measure Eq. 3 is numerical instability whenever is close to zero. This might be problematic in social science applications where the explanatory power of covariates can be low. To address this issue, we will use a Mahalanobis-type, “stabilized” measure:
| [4] |
where is the number of covariates. Here, we assume the is whitened (based on ) before computing, so the ’s are uncorrelated. With correlated features, Eq. 4 would have to be replaced by the Mahalanobis distance. We justify this covariate shift measure from a theoretical perspective in Section 3.3. Importantly, this measure is also invariant under the scaling of features.
Remark 1
We note that both i) rescaling by SD for scale invariance and ii) adopting stabilized measure of covariate shift are crucial for interpretable and robust empirical insights. We illustrate the importance of these considerations through an ablation study in SI Appendix, section 4 which explores alternative distribution shift measures without these elements. These alternative distribution shift measures either fail to induce the predictive role or lead to much wider intervals in generalization due to numerical instability.
In our evaluations, the two population measures will be replaced by their estimators (52). The estimation details are deferred to SI Appendix, section 2 with brief summary and specific references in the corresponding parts of the paper.
3.2. Empirical Evidence: Covariate Shift Can Bound Conditional Shift.
Using data from both the Pipeline project and the ManyLabs1 project, we establish empirical evidence that with our distribution shift measures, the covariate shift can bound the conditional shift, even though the strength of both may change across hypotheses and sites. Because the covariate shift is estimable in common generalization tasks, researchers can use this bounding relationship to predict the conditional shift, which is usually unobserved. We provide theoretical justification for the empirical findings in the next subsection.
We estimate the two distribution shift measures for any pair of sites for each hypothesis in the Pipeline project and the ManyLabs1 project. For any given hypothesis , we define following the original analysis (c.f. SI Appendix, Tables S2 and S4 for details), and , for all site pairs and hypothesis index . Then, we compute an estimate for the relative conditional shift (denoted by ), and an estimate for the relative covariate shift (denoted by ), where is estimated by the same point estimate used in Section 2.3, and we develop doubly robust estimators for the denominators and . Other quantities in Eqs. 2 and 3 are estimated by sample average. The estimation details are in SI Appendix, section 2.C.
Fig. 4 compares the conditional shift measure and the covariate shift measure in various contexts. The Left two panels (P, A) and (M, A) show site pairs , where one is in the United States and the other is not in the United States, as well as pairs where both sites are in the United States. The Right two panels (P, B) and (M, B) show site pairs within two hypotheses for each project.
Fig. 4.
Our covariate shift measures bound conditional shift measures in various contexts (pivotality). Left: Conditional and covariate shift measures for site pairs between United States and Europe/Non–United States and site pairs within United States in the Pipeline data (P, A) and the ManyLabs 1 data (M, A). Right: Conditional and covariate shift measures for all site pairs in hypotheses 5 and 6 in Pipeline (P, B), and those in hypotheses 3 and 4 in ManyLabs 1 (M, B). A few (5) largest values are removed for visualization. Panel (C): Empirical quantiles of the ratios between conditional and covariate shift measures within each hypothesis (grey and brown curves). The red curves are multiples of the quantiles of standard Gaussian for reference.
In (P, A) and (M, A), the distribution shift between US–Non-US pairs tends to be larger than within–United States pairs. In (P, B) and (M, B), the magnitude of distribution shifts also vary across hypotheses. Despite the variation across contexts, however, the covariate shift measure upper bounds the conditional shift measure most of the time. In addition, when the conditional shift is larger (which is typically unobservable in a generalization task), the observable covariate shift also tends to be larger, justifying the “predictive role” of the covariate shift for the conditional shift.
Finally, panel (C) of Fig. 4 provides a more quantitative illustration of the predictive role. In the figure, each curve is the or -th empirical quantiles of the ratios for a hypothesis across a series of confidence levels on the -axis. For reference, we compare them with multiples of standard normal distribution quantiles. A few comments are in order:
First, the bounding relationship holds with high probability. Thus, in practice, the belief that is a plausible option to establish a range of the conditional shift strength. We will see reliable effect generalization based on this idea in Section 4.
Second, if one wants to adjust the upper bound of based on a desired confidence level, it is reasonable to use some multiplicative of standard normal, e.g., . Indeed, the empirical quantiles are smooth and similar to normal quantiles in general. This suggests a “smooth” and “random” nature of distribution shift, instead of being adversarial.
3.3. Theoretical Analysis: Random Distribution Shift Model.
We offer a theoretical framework to motivate the predictive role of covariate shift, justifying the empirical evidence in the last section.
We begin by modeling the data collection procedure as a two-stage process. In the first stage, the underlying distribution is randomly perturbed. This perturbation aims to model unintended changes in the population or deviations from the experimental protocols despite efforts to keep them, etc. In the second, data are drawn i.i.d. from the perturbed distributions. This leads to three sources of uncertainty:
Here, “sampling uncertainty” refers to the usual statistical uncertainty arising from randomly drawing observations from an underlying population or , and “random shift” refers to the discrepancy between two underlying distributions and due to natural “perturbations”. In the following, we construct by randomly perturbing . Constructing by randomly perturbing or constructing and by randomly perturbing a third would lead to the same asymptotics.
Our model for distribution shift includes three elements:
We assume that the treatment distribution is invariant, since the treatment probability is fixed and chosen by the scientists for the datasets we consider here.
There is distribution shift in observed covariates , which we model as random. There is shift in some unobserved effect modifiers , modeled as random, too.
The outcome is a function of treatment indicator , covariates , and unobserved modifiers . Thus, the shift in is the driving factor for the conditional shift.
Let , where the treatment is independent of the modifiers under due to randomization. Recall that is observed, while is not.
3.3.1. Random distribution shift.
The key idea of our random distribution shift model is that the original probability measure is randomly brought up and down in small pieces which, put together, lead to CLT-like behavior of the estimates with inflated variance. To be precise, we let events be a disjoint covering of the sample space of . We assume that these “pieces” have the same probability mass, i.e., for . Later, we will take to describe a scenario where many random factors change the probability masses of independently.
Our model describes random perturbations of in these small event pieces. Specifically, we define the randomly reweighted distribution for any event via
| [5] |
where are i.i.d. positive random variables that are bounded away from zero and have finite variance. As written above, the treatment indicator is assumed to be independent of the modifiers under both and , and its distribution is invariant.
Fig. 5 visualizes this idea, where probability masses of small events are independently perturbed by “nature.” Such small, random perturbations are suitable to describe unintended but inevitable distribution shifts in such multisite replication studies, such as unintended changes in the study population or random deviations from the experimental protocols despite efforts to keep them, etc. We make the random distribution shift model explicit as follows.
Fig. 5.
Visualization of the random distribution shift model. The original distribution is randomly perturbed to produce the distribution from which data are i.i.d. drawn. Our model assumes independent perturbation/reweighting of equal-probability small events and takes the number of small events to infinity.
Assumption 2
The outcome is an unknown function of the treatment , observed covariates , and unobserved modifiers .
Assumption 3
Let events be a disjoint covering of the sample space of , and for . We assume that the step functions on these pieces approximate square-integrable functions.‡ The target distribution is obtained by randomly perturbing according to Eq. 5. In addition, the distribution of is independent of under both and , and .
Making the grid fine-grained and taking limits () we obtain a distributional CLT that describes the shift of empirical means under this two-stage sampling procedure. There are various asymptotic regimes that one could consider. Considering the asymptotic regime where means sampling uncertainty and distributional uncertainty are of the same order (19). Taking means distributional uncertainty is of larger order than sampling uncertainty (20). In the following, we focus on scenarios where sampling uncertainty and distributional uncertainty are of the same order as we let .
Assumption 4
The sample sizes obey and for some constants as .
Theorem 1 (Distributional CLT).
Let denote the sample mean of a function over i.i.d. draws from and denote the sample mean of over i.i.d. draws from . Under Assumptions 2, 3, and 4, for any function , we have
where , and measures the strength of perturbation. If is a vector of functions, then and are covariance matrices.
In Theorem 1, the variance term is the usual asymptotic variance one would obtain under the i.i.d. assumption that . In addition, random perturbations to the distributions contributes a factor of , where only the variance of counts because only the distribution of is perturbed, while that of remains invariant.
Theorem 1 describes a marginal perspective that includes two sources of randomness: i) the sampling uncertainty, i.e., the observations being i.i.d. samples from and , respectively, and ii) the distributional uncertainty, where is randomly perturbed from described by the random distribution shift model. Conditional on the randomness of distribution shift, the empirical averages and center around distinct values and , respectively. Similar perspectives can be found in random effect models commonly used in meta-analysis (53, 54) and analysis of heterogeneity in replication studies (40), where each study is assumed to be drawn from a population of studies. Compared with these models, our model is nonparametric, and the symmetric structure allows estimation and generalization with only one dataset (e.g., leveraging the predictive role).
Remark 2
We assume that the treatment distribution is invariant in well-controlled experiments. For conceptual clarity, treatment compliance is implicitly assumed to be perfect. While extensions of the random shift model to instrumental variable settings might be feasible (e.g., the compliance pattern shifts across studies), this requires careful empirical justification and theoretical development.
3.3.3. Why covariate shift often upper bounds conditional shift.
We now further discuss how this distributional CLT implies that covariate shift often upper bounds conditional shift.
For simplicity, we focus on deriving the generalization error for the estimators and . A formal justification of this influence function approximation for general -estimators can be found in ref. 19. The numerator of our relative conditional shift measure Eq. 2 equals the difference-in-means estimator with (ignoring the estimation of for simplicity), where or depending on the hypothesis. With the distributional CLT, the squared relative conditional shift measure obeys
| [6] |
where . Using the distributional CLT for the covariates (taking ), we obtain that standardized squared differences follows a scaled chi-square distribution:
| [7] |
Here, is the sample variance of in . Thus, up to lower-order terms, Eq. 6 is stochastically smaller than Eq. 7 because . In other words, the standardized conditional shift is stochastically smaller than the standardized covariate shift. This is in line with the empirical phenomenon in Fig. 4. It also justifies replacing Eq. 3 by the stabilized Eq. 4, roughly because the perturbations are homogeneous in different directions. If we average over multiple uncorrelated covariates , by the distributional CLT, the squared covariate shift measure obeys
| [8] |
where . As for , Eq. 8 will be close to . In our empirical studies, the covariates exhibit low correlation, hence we directly employ the formula Eq. 4.
We remark that our model is one of the potential models that may explain the empirical finding. In SI Appendix, section 5.A, we further contextualize the insights under our model by connecting it to random-effect-type models commonly used in meta-analysis (53, 54) and heterogeneity analysis (40). There, we show that a similar pattern may arise if random parameters specifying the distributions of observed covariates and hidden effect modifiers are of comparable variances, and discuss the benefits of using the current nonparametric model.
These results motivate using a ratio of the estimated conditional shift and estimated covariate shift as a pivot to create prediction intervals. In the next, we propose such prediction intervals and evaluate the empirical performance.
4. Effect Generalization Exploiting the Predictive Role
In this section, we demonstrate that leveraging the predictive role of covariate shift leads to reliable generalization for target distributions. To this end, we build prediction intervals§ for the target population estimate based on our distribution shift measure and evaluate their empirical coverage.
4.1. Constructing Prediction Intervals.
Before presenting the results, we begin with a high-level overview of our prediction intervals based on the predictive role, while we defer technical details on the estimation procedures to SI Appendix, section 2.D.
We consider generalization tasks where a scientist has access to full observations from the source distribution but only the covariates from the target distribution . To construct our prediction interval for the target estimate , we leverage the ratio between the covariate and conditional shift measures:
where is the estimated conditional shift measure Eq. 2, and is the covariate shift measure Eq. 4. Note that one can estimate but not in a generalization task. Suppose the distribution of can be characterized (e.g., using approaches we discuss below) so that one can find some obeying approximately
By definition, inverting the above event leads to a general form of our prediction interval for :
| [9] |
where is an estimate for in Eq. 2, and is an estimator for in Eq. 2 which adjusts for the covariate shift. Above, all quantities in Eq. 9 except and can be estimated with full observations from the source distribution and the covariate data from the target distribution . In each part of the empirical evaluation in Section 4.2, we will further expand on how is calibrated and the specific choice of sites to be evaluated.
We consider two ways to calibrate under two scenarios of data availability (visualized in Fig. 6):
Fig. 6.
Generalization in two scenarios for the availability of data. Panel (A): Generalization without auxiliary data from source (blue) to target (yellow). Panel (B): Generalization with auxiliary data (green) from the same sites for other hypotheses. The auxiliary data are used to calibrate L and U for a new generalization task.
Constant calibration. We construct prediction intervals assuming the conditional shift measure is bounded by the covariate shift measure [constant bounds ]. This is theoretically justified under the random distribution shift model (Section 3.3). This approach is applicable to a generalization task with no information other than covariate data from the target site.
Data-adaptive calibration. We construct prediction intervals by calibrating the relative strengths of conditional and covariate shift measures using existing data. This is applicable when some relevant auxiliary data are available (but not full observations in the target site) and we believe they inform the (relative) strengths of distribution shifts in the current generalization task.
Of course, the set of available data in the second approach can be flexible; we explore other scenarios in SI Appendix, section 3.A. These prediction intervals are compared with three baselines:
IID. Prediction intervals under the i.i.d. assumption, i.e., , ignoring distribution shift (same as Section 2.3).
WorstCase. Prediction intervals based on worst-case bounds under restrictions on the distributional distance between the target distribution and the reweighted distribution, i.e., , where is calibrated with data (see SI Appendix, section 1.D for details).
Oracle. Prediction intervals calibrated with true knowledge of the relative strength of covariate shift and conditional shift measures. This is the “ideal” but unrealistic version of our method.
We evaluate the generalization performance of different methods by the empirical coverage and average length of prediction intervals across all site pairs for each hypothesis.
4.2. Empirical Evaluation.
4.2.1. Without any auxiliary data.
In the first scenario, the scientists have data from the source distribution, but they do not have any information other than covariates from the target distribution. In this setting, researchers can use our proposed approach with constant calibration.
More specifically, we consider the generalization of site to for all pairs , , for each hypothesis in each application. When we construct prediction intervals, we assume all data from are observed while only covariates are observed from . When we then evaluate the statistical performance of various generalization methods, we use the full data in site to empirically evaluate how well each method covers the benchmark estimate in . The Oracle method takes in Eq. 9 as the empirical -th quantile of for . Ours_Const sets . WorstCase sets as the -th quantile (replacing maximum for stability) of the estimated KL divergence among all pairs , and compute lower/upper bounds for the parameter with distributionally robust optimization (55), i.e., under the constraint .¶
In Fig. 7, we report the empirical coverage and relative lengths of prediction intervals averaged over all pairs within each hypothesis. Across two distinct applications, our procedure (denoted as Ours_Const) achieves valid coverage in most cases (A). WorstCase prediction intervals achieve the target coverage as well but are much wider than the proposed intervals (B). Not surprisingly, intervals based on the i.i.d. assumption exhibit undercoverage.
Fig. 7.
Effect generalization without auxiliary data. Row (A): Empirical coverage of prediction intervals via constant calibration at nominal level and three baseline methods using the Pipeline data (Left) and ManyLabs 1 data (Right). Row (B): Average length of prediction intervals for constant calibrated prediction intervals at nominal level for the four methods, normalized by the largest average length, using the Pipeline (Left) and ManyLabs 1 (Right) data.
4.2.2. With auxiliary data.
Next, we examine how we can improve the performance of our estimator when researchers use some auxiliary data to adaptively calibrate our method. We consider a scenario where data from all sites exist for other hypotheses to build prediction intervals for a new hypothesis. In practice, this arises when there are existing data from the same set of sites on other research questions or hypotheses.
We randomly sample as “existing hypotheses” with full observations, and as “future hypotheses” whose generalization is to be evaluated. For each , Oracle takes the -th empirical quantile of for . Ours sets as the -th empirical quantile of . WorstCase is similar to that in Section 4.2.1, where is the -th empirical quantile of the estimated KL-divergences among all pairs and . We then evaluate the empirical coverage of these prediction intervals for . To ensure stable evaluation, the ordering of the sites is randomly permuted for times. Additional calibration scenarios (generalizing to new sites for existing hypotheses and new sites for new hypotheses) in SI Appendix, section 2.A deliver similar messages.
Fig. 8 reports the coverage and lengths of prediction intervals. For both projects, our procedure achieves coverage close to the nominal level, with prediction intervals that are much smaller than those based on worst-case bounds, and quite close to the oracle method. As before, prediction intervals based on the i.i.d. assumption exhibit undercoverage.
Fig. 8.
Effect generalization with auxiliary data: We generalize to new hypotheses based on distribution shift measures calibrated from the same sites in other hypotheses. Left: Data collection order, where dark color means earlier. Row (A): Empirical coverage of prediction intervals built with four methods over 10 random draws of hypotheses ordering, using the Pipeline (Left) and ManyLabs 1 (Right) data. The red dashed line is the nominal level . Row (B): Average length of prediction intervals over 10 random draws of hypotheses ordering, normalized by the largest average length, using the Pipeline (Left) and ManyLabs 1 (Right) data.
5. Robustness under Purposive Sampling
So far, we have focused on collaborative multisite replication studies to understand realistic distribution shifts in the most idealized setting where researchers make best efforts to be consistent, and use a random shift model to capture all unintended discrepancies that are impossible to fully eliminate (22). However, there are many other cases where intended variations of the study population (41, 56) or experimental design (39) can provide valuable insights into the heterogeneity and robustness of treatment effects.
In this section, we extend our insights to other generalization scenarios. We conduct the same set of empirical analyses to a dataset with deliberately chosen diverse populations and explain the observed phenomenon under a hybrid model.
5.1. Other Generalization Scenarios.
We begin with several alternative scenarios of generalization and their implications to our analysis and theory. These scenarios deviate from—and could help clarify—the scope of our current model.
As a first example, researchers may choose specific sites that are believed to differ from each other and provide diverse evidence (41, 56, 57). Each site then independently recruits participants and collects data using the same procedure. In this case, one may still expect inevitable random changes (22) that can be captured by ideas similar to our model. However, certain “deterministic” shift is present reflecting the researchers’ decisions. Our empirical analyses can still be carried out for such datasets as long as the variables and parameters are defined consistently across studies, yet the theoretical modeling may not be suitable any more.
As a second example, researchers may explicitly control the distribution of certain observed covariates of the recruited participants. In the most extreme case, the distribution of observed covariates is exactly the same across sites, whereas that of unobserved effect modifiers might differ. It is then impossible to gauge the unobserved shifts without additional information. Instead, one may need worst-case analysis assuming certain degree of distribution shift in the outcomes.
As a third example, researchers may deliberately vary the experimental design (22, 39)—such as the experimental materials, procedures, and treatment assignment mechanisms—to investigate the robustness of a scientific hypothesis. This may lead to inconsistent definitions of variables, parameters, and estimates. Thus, the analysis and theory in this paper may not apply, and we believe new empirical and theoretical frameworks are needed for these settings.
In the remaining of this section, we study a dataset close to the first example where the analyses are still applicable.
5.2. Analyzing the KSJ Data.
(41) argues the importance of diverse populations for understanding heterogeneity across studies. To this end, they deliberately chose several online and offline populations that are expected to differ (we take panels from studies 1 to 2, totaling 13 panels for 4 hypotheses; we refer to this as the KSJ data hereafter). While our random shift model also allows for diverse study populations, it cannot address the deterministic process of site selection by the investigators. Nevertheless, as the variables and parameters are consistent across panels, we can still analyze the relationship between covariate and conditional shifts.
Identical to the analyses in Section 2.3, we first study effect generalization methods based on the i.i.d. assumption and the covariate shift assumption. Fig. 9 shows the presence of significant distribution shift (IID), and the insufficient explanatory role of covariate shift (CovShift).
Fig. 9.
Explanatory role of covariate shift in the KSJ dataset; details are as in Fig. 3. Panel (KSJ, A) shows empirical coverage, and Panel (KSJ, B) shows the estimates for different procedures.
We then analyze the covariate and conditional shifts, using the same methods as in Section 3.2. Similar to Fig. 4, we compute for each pair of panels, and compare their strengths across contexts (“academic”/ “commercial” panels detailed in SI Appendix, section 1). In Fig. 10, we observe that the covariate shift measures upper bound the conditional shift measures across various contexts for most site pairs, although the ratio between the conditional and covariate shift measures seems much smaller. When the conditional shifts are larger (comparing across panels), the covariate shifts are also larger.
Fig. 10.
Comparing the strengths of covariate and conditional shifts in various contexts in the KSJ dataset; details are the same as Fig. 4.
Finally, following Section 4, we use the predictive role to calibrate prediction intervals for new generalization tasks. The constant calibration (same as Section 4.2.1) is illustrated in Fig. 11. We observe that assuming a constant bound leads to valid coverage; our method is again more efficient than WorstCase, yet is much more conservative than Oracle, again showing the conservativeness of the predictive role. In addition, due to the limited number of hypotheses, we consider a data-adaptive calibration scenario where full observations are available for a random subset of sites, and the new generalization tasks concern the remaining sites. In Fig. 12, when calibrating the bounds with existing data, our method becomes much more efficient.
Fig. 11.
Prediction intervals via constant calibration for effect generalization in the KSJ dataset; details are the same as Fig. 7. Panel (A) shows coverage, and panel (B) shows the relative lengths of prediction intervals.
Fig. 12.
Data-adaptive calibration for effect generalization, where distribution shifts in a subset of existing sites are used to calibrate the bounds for the other sites Panel (A). Panel (B) reports empirical coverage; Panel (C) reports the relative lengths of the prediction intervals.
To summarize, our analysis of the KSJ data again finds the predictive role of the covariate shift, albeit being conservative. Consequently, constant calibration using the predictive role leads to valid yet conservative uncertainty quantification; with auxiliary data, the data-adaptive calibration approach greatly improves the efficiency while maintaining validity.
5.3. Interpretation with Hybrid Shift.
Finally, we provide a high-level discussion on why the conservative predictive role may hold in this purposive sampling scenario. In the KSJ dataset, the distribution shift induced by the choice of researchers can be thought of as deterministic. In addition, there may be numerous random factors in the data collection process which perturb the “ideal” site distributions selected by the researchers in a way similar to our random shift model. Altogether, it would be reasonable to imagine a mixture of deterministic and random distribution shift. In the deterministic part, one would expect the sites/panels to be more diverse in observed variables—since that is what the investigators diversify—than in unobserved variables. On the other hand, the unintended changes in the random shift would lead to perturbations of similar strength to observed and unobserved variables. Combining the two, the estimated covariate shift would be larger than the conditional shift, leading to a conservative predictive role observed here.
6. Discussion
In this work, we offer insights on distribution shifts when inferring parameter estimates in a new site based on data from one site and covariates from the new one. By empirical benchmarking in large-scale replication projects, we find significant distribution shifts between sites. However, approaches that only account for shifts of observed covariates—thereby relying on the explanatory role of covariate shift—are often insufficient for explaining discrepancies between sites.
Instead of using covariates in an explanatory fashion, we propose to use covariates in a predictive fashion. More precisely, we suggest predicting the strength of the shift of unobserved conditional distribution based on that of observed covariates. We provide empirical evidence based on large-scale replication studies and offer a theoretical justification under a random distribution shift model. In our empirical applications, we show that our proposed prediction intervals maintain the desired coverage even in the presence of unobservable shifts. While these intervals can sometimes be overconservative, they offer a significant improvement over existing approaches. Our method compares favorably to worst-case approaches, which tend to be overly pessimistic.
Our empirical and theoretical findings open up several exciting future avenues. First, real-world scenarios may involve more complex forms of distribution shifts than the one studied in this work (e.g., those discussed in Section 5.1). Empirical understanding and development of estimation procedures for these new models would be a valuable contribution. Second, the nonnegligible conditional shift suggests the importance of collecting data from diverse sources to properly address the “distributional uncertainty”. Toward this goal, our investigation can provide insights for an important methodological challenge: prioritizing data collection. For example, with a partial covariate shift and partial random shift, it may be beneficial to prioritize the collection of covariates most affected by the shift.
Supplementary Material
Appendix 01 (PDF)
Acknowledgments
We appreciate excellent research assistance by Diana Da In Lee. Egami acknowledges financial support from the NSF (SES-2318659). Rothenhäusler acknowledges financial support from the Dieter Schwarz Foundation, the Dudley Chamber fund, and the David Huntington Foundation.
Author contributions
Y.J., N.E., and D.R. designed research; Y.J., N.E., and D.R. performed research; Y.J., N.E., and D.R. contributed new reagents/analytic tools; Y.J. analyzed data; and Y.J., N.E., and D.R. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
This article is a PNAS Direct Submission.
*Note that not all sites examine all hypotheses.
†We use prediction intervals rather than the conventional CIs because we only have access to target population estimates (instead of the underlying parameters) for rigorous evaluation purposes.
‡That is, for any function , it holds that , where . This can be achieved relatively easily, e.g. for a continuous random variable one can set as intervals whose endpoints are the -th and -th quantiles of under .
§We again create prediction intervals for easier evaluation based on target estimates (instead of the underlying parameters).
¶Such computation of is unavailable in a real generalization task, and the lower/upper bounds for the parameter are anticonservative for prediction interval.
Data, Materials, and Software Availability
Previously published data were used for this work (16, 17). All other data are included in the manuscript and/or SI Appendix.
Supporting Information
References
- 1.Shadish W. R., Cook T. D., Campbell D. T., Experimental and Quasi-Experimental Designs for Generalized Causal Inference (Houghton Mifflin, Boston, 2002). [Google Scholar]
- 2.Hotz V. J., Imbens G. W., Mortimer J. H., Predicting the efficacy of future training programs using past experiences at other locations. J. Econ. 125, 241–270 (2005). [Google Scholar]
- 3.Imai K., King G., Stuart E. A., Misunderstandings between experimentalists and observationalists about causal inference. J. R. Stat. Soc. Ser. A Stat. Soc. 171, 481–502 (2008). [Google Scholar]
- 4.Cole S. R., Stuart E. A., Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. Am. J. Epidemiol. 172, 107–115 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Tipton E., Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. J. Educ. Behav. Stat. 38, 239–266 (2013). [Google Scholar]
- 6.Bareinboim E., Pearl J., Causal inference and the data-fusion problem. Proc. Natl. Acad. Sci. U.S.A. 113, 7345–7352 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Deaton A., Cartwright N., Understanding and misunderstanding randomized controlled trials. Soc. Sci. Med. 210, 2–21 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stuart E. A., Cole S. R., Bradshaw C. P., Leaf P. J., The use of propensity scores to assess the generalizability of results from randomized trials. J. R. Stat. Soc. Ser. A Stat. Soc. 174, 369–386 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Tipton E., et al. , Sample selection in randomized experiments: A new method using propensity score stratified sampling. J. Res. Educ. Effect. 7, 114–135 (2014). [Google Scholar]
- 10.Miratrix L. W., Sekhon J. S., Theodoridis A. G., Campos L. F., Worth weighting? How to think about and use weights in survey experiments Polit. Anal. 26, 275–291 (2018). [Google Scholar]
- 11.Dahabreh I. J., Robertson S. E., Tchetgen E. J., Stuart E. A., Hernán M. A., Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics 75, 685–694 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Egami N., Hartman E., Elements of external validity: Framework, design, and analysis. Am. Polit. Sci. Rev. 117, 1070–1088 (2023). [Google Scholar]
- 13.T. T. Cai, H. Namkoong, S. Yadlowsky, Diagnosing model performance under distribution shift. arXiv [Preprint] (2023). http://arxiv.org/abs/2303.02011 (Accessed 17 October 2025).
- 14.Y. Jin, K. Guo, D. Rothenhäusler, Diagnosing the role of observable distribution shift in scientific replications. arXiv [Preprint] (2023). http://arxiv.org/abs/2309.01056 (Accessed 17 October 2025).
- 15.Lu B., Ben-Michael E., Feller A., Miratrix L., Is it who you are or where you are? Accounting for compositional differences in cross-site treatment effect variation J. Educ. Behav. Stat. 48, 420–453 (2023). [Google Scholar]
- 16.Schweinsberg M., et al. , The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. J. Exp. Soc. Psychol. 66, 55–67 (2016). [Google Scholar]
- 17.Klein R. A., et al. , Investigating variation in replicability. Soc. Psychol. 45, 142–152 (2014). [Google Scholar]
- 18.Kern H. L., Stuart E. A., Hill J., Green D. P., Assessing methods for generalizing experimental impact estimates to target populations. J. Res. Educ. Effect. 9, 103–127 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Y. Jeong, D. Rothenhäusler, Calibrated inference: statistical inference that accounts for both sampling uncertainty and distributional uncertainty. arXiv [Preprint] (2022). http://arxiv.org/abs/2202.11886 (Accessed 17 October 2025).
- 20.Y. Jeong, D. Rothenhäusler, Out-of-distribution generalization under random, dense distributional shifts. arXiv [Preprint] (2024). http://arxiv.org/abs/2404.18370 (Accessed 17 October 2025).
- 21.K. C. Bansak, E. Paulson, D. Rothenhäusler, “Learning under random distributional shifts” in International Conference on Artificial Intelligence and Statistics, S. Dasgupta, S. Mandt, Y. Li, Eds. (PMLR, 2024), pp. 3943–3951.
- 22.McShane B. B., Tackett J. L., Böckenholt U., Gelman A., Large-scale replication projects in contemporary psychological research. Am. Stat. 73, 99–105 (2019). [Google Scholar]
- 23.Stroebe W., Strack F., The alleged crisis and the illusion of exact replication. Perspect. Psychol. Sci. 9, 59–71 (2014). [DOI] [PubMed] [Google Scholar]
- 24.Hudson R., Explicating exact versus conceptual replication. Erkenntnis 88, 2493–2514 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Horvitz D. G., Thompson D. J., A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47, 663–685 (1952). [Google Scholar]
- 26.Robins J. M., Rotnitzky A., Zhao L. P., Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 89, 846–866 (1994). [Google Scholar]
- 27.Deville J. C., Särndal C. E., Calibration estimators in survey sampling. J. Am. Stat. Assoc. 87, 376–382 (1992). [Google Scholar]
- 28.Hainmueller J., Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Polit. Anal. 20, 25–46 (2012). [Google Scholar]
- 29.Hartman E., Grieve R., Ramsahai R., Sekhon J. S., From sample average treatment effect to population average treatment effect on the treated. J. R. Stat. Soc. Ser. A Stat. Soc. 178, 757–778 (2015). [Google Scholar]
- 30.Buchanan A. L., et al. , Generalizing evidence from randomized trials using inverse probability of sampling weights. J. R. Stat. Soc. Ser. A Stat. Soc. 181, 1193–1209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Dahabreh I. J., Robertson S. E., Steingrimsson J. A., Stuart E. A., Hernan M. A., Extending inferences from a randomized trial to a new target population. Stat. Med. 39, 1999–2014 (2020). [DOI] [PubMed] [Google Scholar]
- 32.Egami N., Hartman E., Covariate selection for generalizing experimental results: Application to a large-scale development program in Uganda. J. R. Stat. Soc. Ser. A Stat. Soc. 184, 1524–1548 (2021). [Google Scholar]
- 33.Degtiar I., Rose S., A review of generalizability and transportability. Ann. Rev. Stat. Appl. 10, 501–524 (2023). [Google Scholar]
- 34.Colnet B., et al. , Causal inference methods for combining randomized trials and observational studies: A review. Stat. Sci. 39, 165–191 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Klein R. A., et al. , Many labs 2: Investigating variation in replicability across samples and settings. Adv. Methods Pract. Psychol. Sci. 1, 443–490 (2018). [Google Scholar]
- 36.Coppock A., Leeper T. J., Mullinix K. J., Generalizability of heterogeneous treatment effect estimates across samples. Proc. Natl. Acad. Sci. U.S.A. 115, 12441–12446 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.McShane B. B., Böckenholt U., Hansen K. T., Modeling and learning from variation and covariation. J. Am. Stat. Assoc. 117, 1627–1630 (2022). [Google Scholar]
- 38.Delios A., et al. , Examining the generalizability of research findings from archival data. Proc. Natl. Acad. Sci. U.S.A. 119, e2120377119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Holzmeister F., et al. , Heterogeneity in effect size estimates. Proc. Natl. Acad. Sci. U.S.A. 121, e2403490121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.McShane B. B., Böckenholt U., Hansen K. T., Variation and covariation in large-scale replication projects: An evaluation of replicability. J. Am. Stat. Assoc. 117, 1605–1621 (2022). [Google Scholar]
- 41.Krefeld-Schwalb A., Sugerman E. R., Johnson E. J., Exposing omitted moderators: Explaining why effect sizes differ in the social sciences. Proc. Natl. Acad. Sci. U.S.A. 121, e2306281121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Shimodaira H., Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plan. Inference 90, 227–244 (2000). [Google Scholar]
- 43.Quinonero-Candela J., Sugiyama M., Schwaighofer A., Lawrence N. D., Dataset Shift in Machine Learning (MIT Press, 2008). [Google Scholar]
- 44.Pan S. J., Yang Q., A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2009). [Google Scholar]
- 45.S. Bickel, M. Brückner, T. Scheffer, “Discriminative learning for differing training and test distributions” in Proceedings of the 24th International Conference on Machine Learning, Z. Ghahramani, Ed. (Association for Computing Machinery, New York, NY, 2007), pp. 81–88.
- 46.Gama J., Žliobaitė I., Bifet A., Pechenizkiy M., Bouchachia A., A survey on concept drift adaptation. ACM Comput. Surv. 46, 1–37 (2014). [Google Scholar]
- 47.Lu J., et al. , Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 31, 2346–2363 (2018). [Google Scholar]
- 48.Tversky A., Kahneman D., The framing of decisions and the psychology of choice. Science 211, 453–458 (1981). [DOI] [PubMed] [Google Scholar]
- 49.Nosek B. A., Banaji M. R., Greenwald A. G., Math = Male, Me = Female, therefore Math ≠ Me. J. Pers. Soc. Psychol. 83, 44 (2002). [PubMed] [Google Scholar]
- 50.Särndal C. E., Swensson B., Wretman J., Model Assisted Survey Sampling (Springer Science& Business Media, 2003). [Google Scholar]
- 51.Jin Y., Rothenhäusler D., Tailored inference for finite populations: Conditional validity and transfer across distributions. Biometrika 111, 215–233 (2024). [Google Scholar]
- 52.Chernozhukov V., et al. , Double/debiased machine learning for treatment and structural parameters. J. Econom. 21, C1–C68 (2018). [Google Scholar]
- 53.DerSimonian R., Kacker R., Random-effects model for meta-analysis of clinical trials: An update. Contemp. Clin. Trials 28, 105–114 (2007). [DOI] [PubMed] [Google Scholar]
- 54.Borenstein M., Hedges L. V., Higgins J. P., Rothstein H. R., A basic introduction to fixed-effect and random-effects models for meta-analysis. Res. Synth. Methods 1, 97–111 (2010). [DOI] [PubMed] [Google Scholar]
- 55.Hu Z., Hong L. J., Kullback-Leibler divergence constrained distributionally robust optimization. Optim. Online 1, 9 (2013). [Google Scholar]
- 56.Krefeld-Schwalb A., Hua X., Johnson E. J., Measuring population heterogeneity requires heterogeneous populations. Proc. Natl. Acad. Sci. U.S.A. 122, e2425536122 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.N. Egami, D. D. I. Lee, Designing Multi-Site Studies for External Validity: Site Selection via Synthetic Purposive Sampling. https://ssrn.com/abstract=4717330. Accessed 17 October 2025.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Data Availability Statement
Previously published data were used for this work (16, 17). All other data are included in the manuscript and/or SI Appendix.












