Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 9.
Published before final editing as: J Comput Graph Stat. 2025 Nov 25:10.1080/10618600.2025.2592762. doi: 10.1080/10618600.2025.2592762

Fast and robust invariant generalized linear models

Parker Knight 1, Ndey Isatou Jobe 1, Rui Duan 1
PMCID: PMC12685035  NIHMSID: NIHMS2123713  PMID: 41367884

Abstract

Statistical integration of diverse data sources is an essential step in the building of generalizable prediction tools, especially in precision health. The invariant features model is a new paradigm for multi-source data integration which posits that a small number of covariates affect the outcome identically across all possible environments. Existing methods for estimating invariant effects suffer from immense computational costs or only offer good statistical performance under strict assumptions. In this work, we provide a general framework for estimation under the invariant features model that is computationally efficient and statistically flexible. We also provide a robust extension of our proposed method to protect against possibly corrupted or misspecified data sources. We demonstrate the robust properties of our method via simulations, and use it to build a transferable prediction model for end stage renal disease using electronic health records from the All of Us research program.

Keywords: data integration, optimization, electronic health records, risk prediction

1. Introduction

Data integration has become a central challenge of contemporary biomedical research. As research networks spread across institutions and even geographical boundaries, practitioners need to think carefully about how to handle heterogeneity and similarity between diverse data sources. This problem is especially pertinent in precision medicine (Martínez-García & Hernández-Lemus, 2022). Distributional shifts between patient subgroups, manifesting themselves in the training and testing data, may severely impact downstream model performance, which in turn can adversely affect decision making at the patient level (Green et al., 2024). If statistical machine learning systems are to play a role in the clinic, statistical researchers must develop models and methods that enable domain experts to better understand and, in some cases, protect against discrepancies between sources of medical data.

One such model that is growing in popularity is the invariant prediction or features model proposed by Peters et al. (2016). Here, the authors presume that the analyst is given data as outcome-feature pairs from a discrete set of sources, referred to as environments, that may correspond to different sub-populations of observations, experimental conditions, or any other source of heterogeneity. Letting denote this set of environments, the invariant features model assumes that there exists a subset S of features such that for any two environments e, f, we have

E[YeXSe=x]=E[YfXSf=x] (1)

where (Ye,Xe)RRp represents the outcome and feature vector drawn from environment e. The features in S are invariant with respect to the environments , and knowledge of S can be used to build prediction tools that generalize well to previously unseen environments satisfying (1). This model is particularly amenable to large-scale electronic health records data. Due to differences across healthcare systems such as in clinical workflows, patient demographics, as well as EHR systems implementations, many features derived from EHR may admit effects that are idiosyncratic to their source institution, and hence disease prediction models built using these features will be useless if applied to data from a new clinical setting (Sarwar et al., 2022; Sauer et al., 2022; Singh et al., 2022). Knowledge of a set of features satisfying (1) would greatly improve EHR-based phenotyping and disease screening tools across institutions, and lead researchers to a better understanding of disease symptoms.

The interpretability and potential applicability of the invariant features model has inspired a litany of works exploring methods for estimating the conditional mean E[YeXSe] and the set S (Fan et al., 2024; Heinze-Deml et al., 2018; Pfister et al., 2019; Rojas-Carulla et al., 2018; Wang et al., 2024; Yin et al., 2024). While the community has made fundamental strides in building tools for estimation under (1), existing methods are routinely hampered by computational infeasibility. For instance, the original Invariant Causal Prediction algorithm of Peters et al. (2016) requires running a hypothesis test for each S[p], and the Environment Invariant Linear Least Squares estimator of Fan et al. (2024) is defined as the solution to novel subset selection problem which even recent advancements in integer optimization (Bertsimas et al., 2016) cannot solve efficiently. This computational cost totally prohibits these methods from being useful for reasonable data analysis tasks. Recent works such as Yin et al. (2024) and Wang et al. (2024) attempt to design computationally faster methods, but do so by imposing stricter assumptions on the data generating process or by requiring a-priori knowledge of the set S. When these assumptions are not satisfied or such information is unavailable, the analyst has no option but to use an exponentially slow method.

We make two contributions in the present work. First, we provide a computationally efficient procedure for estimating E[YeXSe] that is valid for any generalized linear model satisfying (1). Our approach uses a linear relaxation of the estimator proposed in Fan et al. (2024), and we propose an alternating minimization scheme to compute our estimator quickly and accurately. We then propose a robust version of our method that is able to maintain good statistical performance even when some environments fail to satisfy the condition in (1). Robustness is invaluable in real-world settings, as data sources may suffer from measurement error, corruption, or outright misspecification. We demonstrate through an analysis of synthetic data that our proposed method and its robust extension perform well under a variety of settings. Then, we use a real EHR dataset from the All of Us program (of Us Research Program Investigators, 2019) to build a risk prediction model for end stage renal disease. We show that our proposed methods give the best predictive performance on previously unseen environments even when some of the training data is corrupted.

1.1. Related literature

Our work directly builds on the invariant features model first proposed by the seminal work of Peters et al. (2016). The authors of Peters et al. (2016) propose a multiple testing based procedure for recovering the invariant features S under a linear assumption on (1). In Heinze-Deml et al. (2018) and Pfister et al. (2019), these authors extend this idea to nonlinear and time series models respectively. The first theoretical guarantees on estimating the conditional mean E[YeXSe] under a linear model are given in Fan et al. (2024), and these authors extend their results to nonlinear models in Gu et al. (2024). The work of Yin et al. (2024) also aims to estimate E[YeXSe] under a linear model and does so using a relaxation similar to the one proposed in the present work. They provide theoretical results showing that their CoCo procedure can identify the invariant effects if the analyst has prior knowledge on the support S. The recent work of Wang et al. (2024) uses a non-convex optimization program to estimate E[YeXSe], requiring that distinct environments are generated from additive interventions. Finally, Li and Zhang (2024) studies an invariant features-type model under fairness constraints, and provides algorithms for estimation and feature screening under this model. In the sense that invariant features can be understood as causal features (Peters et al., 2016) for the outcome Ye, the cited works address causal discovery and causal effect estimation under particular parameterizations of the conditional mean E[YeXSe]. While we do not focus on causality in the present work, the reader may keep this interpretation of our proposed algorithm in mind, as it constitutes a fast algorithm for estimating causal effects under the invariant generalized linear model.

This line of work is situated within the broader statistical data integration literature. The invariant features model is closely related to the multi-task learning problem (Zhang & Yang, 2018), as distinct environments can be thought of as different tasks that are modeled simultaneously. The papers Maurer et al. (2013) and Maurer et al. (2016) initiate the contemporary theoretical study of multi-task learning using classical tools from empirical process theory. The more recent work of Tripuraneni et al. (2021) gives sharper results in the multi-task linear model under a low rank feature representation. This model is fully explored in Duchi et al. (2022), Niu et al. (2024), and Tian et al. (2023). In Duan and Wang (2023), the authors provide a general framework for robust multi-task learning when some tasks may be misspecified. Knight and Duan (2024) gives theoretical results for sparse and low rank multi-task learning under data sharing constraints. The invariant features model can also be conceived of as a relaxation of the covariate shift assumption as explored in Liu et al. (2023) and Ma et al. (2023). Here, the end goal is typically to adjust for differences between a source dataset and target data in a transfer learning task. Transfer learning has grown in popularity in its own right, as recent works provide tools for transfer learning under high-dimensional linear models (Li et al., 2022), generalized linear models (Li et al., 2024), as well as in federated settings (Li et al., 2023).

In the machine learning literature, researchers study the data integration problem under the umbrella of domain adaptation (Ben-David et al., 2006). A recent seminal work in this area is the Invariant Risk Minimization framework of Arjovsky et al. (2020), which aims to find a feature representation that mitigates domain-specific spurious correlations. This framework is conceptually similar to the invariant features model, although the community has recently pointed out its shortcomings (Kamath et al., 2021; Rosenfeld et al., 2021). We refer readers to the survey Farahani et al. (2021) for a general overview of developments in this field.

1.2. Structure

The rest of the paper is structured as follows. In Section 2.1 we outline the version of the invariant features model that we consider in detail, as well as describe the primary statistical challenges arising under this model. In Section 2.2 we describe our proposed Fast Invariant Generalized Linear Model (FILM) estimator and provide an alternating optimization procedure to compute it. Next, we describe the Robust-FILM method in Section 2.3. Section 3 gives simulation results demonstrating our method in comparison other methods in the literature, and in Section 4 we use FILM to build a risk prediction model for chronic kidney disease using electronic health records for the All of Us program.

2. Method

2.1. Model and problem setup

We let denote the set of environments, and for each e, we observe iid pairs (Yie,Xie)i=1ne for YieR and XieRp. We will use YeRne to refer to the vector of outcomes and XeRne×p to indicate the matrix of covariates from environment e. Following the invariant features model proposed by Peters et al. (2016), we assume that there exists a set of features S[p] such that the following generalized linear model holds for all e:

g(E[YieXi,Se])=θe+βSTXi,Se (2)

where g is a known canonical link function and θeR denotes an environment specific intercept term. We collect θ=(θ1,,θ)TR. Our aim is to estimate the vector βRp which has support supp(β)=S. We emphasize that β is shared across all environments; for this reason, we refer to β as the vector of invariant effects and the features jS as invariant features.

Let

εie=YieE[YieXi,Se]=Yieg1(θe+βSTXi,Se).

Under the moment condition assumed in Model (2), we have that E[Xi,jeεie]=0 for each jS, but we have no guarantee that E[Xi,jeεie]=0 for any jS. We call any j with E[Xi,jeεie]0 a spurious feature in environment e with spurious correlation E[Xi,jeεie]. Including a spurious feature in a prediction model may enhance accuracy within a specific environment but will likely reduce the model’s transferability to other environments. Importantly, we will not presuppose knowledge of the set S; rather we observe the pairs (Yie,Xie)i=1ne blindly, without knowing which features are invariant and which are spurious. This distinguishes the invariant features model in (2) from, for instance, the instrumental variables model (Angrist et al., 1996), which requires knowledge of which features are uncorrelated with the residual term. This adds further motivation to our aim of estimating β, as a good estimate of these invariant effects may lead us to the features that affect the outcome identically across environments.

The primary difficulty in estimating β under Model (2) arises from the potential presence of spurious features across the environments. Naively using a maximum likelihood or empirical risk minimization estimator in the presence of spurious features will lead to highly biased estimates of the invariant effects. To demonstrate this, suppose that g(x)=x which admits the linear model for Yie in Xi,Se, and suppose that Ye is appropriately centered so that θe=0. If we use the single-environment ordinary least squares estimator for β, an elementary calculation reveals

β^OLS=β+(1nei=1neXie(Xie)T)11ne(Xe)TεePβ+(Σe)1E[Xeεe] (3)

where the limit in probability uses the WLLN and Slutksy’s theorem, and assumes that the covariates have a finite second moment Σe. When E[Xeεe]0 (i.e., if our data includes any spurious features), the OLS estimator is not even consistent. A natural idea to improve this estimator is to run OLS on the pooled data from all the environments , resulting in the estimator

β^Pooled-OLS=argminβ1e1neY2Xeβ22 (4)

However, this approach also incurs bias from spurious features, as Proposition 4.1 in Fan et al. (2024) shows

β^Pooled-OLSβ21eE[Xeεe]2 (5)

Thus, simply pooling data from distinct environments is insufficient to address the bias from spurious features. To overcome this, a strategy commonly adopted in the literature (Arjovsky et al., 2020; Fan et al., 2024; Yin et al., 2024) is to augment a pooled loss function with a gradient-based penalty term which exploits our assumption that the data from each environment satisfies the moment condition in Model (2). This is the path that we will take to derive our estimator of β using the general framework outlined in Fan et al. (2024).

Under Model (2), our data from environment e admit a negative log-likelihood of the form

Le(β,θe)=1nei=1ne{ψ(βTXie+θe)Yie(βTXie+θe)} (6)

where the function ψ is uniquely determined by g via ψ=g1. Note that we consider Le as a function on Rp×R; for j={1,,p}, we use the notation jLe(β,θe)=ddβjL(β,θe). The key observation (Fan et al., 2024; Yin et al., 2024) is that for each jS, we have E[jLe(β,θe)]=0 for all e. This follows directly from (2). Since jLe is an average of iid observations, it follows that jLe(β,θe)0 for all jS, e, indicating that any reasonable estimators of β and θ should satisfy

jLe(β^,θ^e)0 (7)

for all jsupp(β^) and e. This is the intuition behind the Environment Invariant Linear Least Squares (EILLS) method proposed by Fan et al. (2024). Their EILLS estimator is defined as

β^EILLS=argminβ{1eLe(β,0)+λ11ej=1p1(βj0)(jLe(β,0))2+λ2β0} (8)

where they assume that θe=0 for all e. The first penalty term phases out spurious features by enforcing the moment condition (7) across all of the environments and the second penalty removes features that do not admit any spurious correlation but have no effect on the outcome. The pooled loss function ensures that the resulting estimator maintains good predictive performance which excludes trivial solutions to (7), such as the zero vector in Rp. The authors of Fan et al. (2024) provide a complete set of theoretical results for β^EILLS in the linear model case (i.e. when Model (2) is satisfied with g(x)=x). They show that, under mild conditions, the true vector of invariant effects β is identified by the solution to the population-level analog of the EILLS optimization problem in (8), and give statistical results demonstrating that β^EILLS is an efficient estimator of β. We refer the reader to Theorems 4.2, 4.3, and 4.4 in Fan et al., 2024 for details.

Unfortunately, computing β^EILLS is infeasible for even moderate p due to the dependence on the support of β in the invariance-inducing penalty. The authors of Fan et al. (2024) propose a brute force approach, in which the user of the EILLS method iterates through every possible support set S[p] and solves Equation (8) over vectors with support S, then taking the vector that gives the best loss value as the resulting estimator. In other words, computing β^EILLS requires solving a quadratic program for every set S[p], incurring a runtime of at least Ω(2p). This prevents the EILLS estimator from being a practical tool for building transferable models in any realistic data analysis setting, as p may be in the hundreds, thousands, or even millions in contemporary datasets.

2.2. Proposed method

Our first contribution is a new, computationally efficient estimator of β that builds on the EILLS framework. We derive our method via a linear relaxation of the minimization problem in (8) as follows. We parametrize β as an element-wise product β=ab where a[0,1]p is a relaxation of the support of β and bRp contributes the magnitude of each entry. Although the true coefficient vector β does not admit a unique parametrization of this form, relaxing the support indicator to lie in [0, 1] circumvents the combinatorial nature of the EILLS program in Equation (8). Using this re-parametrization, we define a relaxed loss function as

Q(a,b,θ)=1eLe(ab,θ)+λ11ej=1paj2(jLe(ab,θ))2+λ2j=1paj(1aj) (9)

Our Fast Invariant Generalized Linear Model (FILM) estimator is defined as

β^FILM=a^b^;(a^,b^,θ^)=argmina[0,1]p,bRp,θRQ(a,b,θ) (10)

We include the second penalty in Equation (9) to encourage the entries of a to live near either 0 or 1, which improves interpretability of the resulting FILM estimator as well as our method’s agreement with the original EILLS approach. Additionally, we use aj2 in the first penalty instead of aj to further down-weight features that have a non-zero gradient while maintaining a functional form that is easy to minimize.

While our proposed loss function Q(a,b) removes the combinatorial aspect of the optimization problem, it is still non-trivial to compute β^FILM. This is because Q is jointly non-convex in a and b due to the element-wise products in the pooled loss function and the first penalty term. Solving the optimization problem in Equation (10) directly, i.e. via gradient descent, is not guaranteed to be a global minimizer of Q. To overcome this, we propose an iterative alternating minimization scheme that optimizes over a, b, and θ separately, and leverages a quadratic surrogate objective function that improves computational efficiency.

To describe the (k+1)th step of our algorithm, let (a(k), b(k), θ(k)) denote the values of a, b, and θ obtained from the kth iteration. Our first aim is to compute a(k+1). Letting β(k)=a(k)b(k), we introduce the function

Q¯(a;b(k),θ(k),β(k))=1eLe(ab(k),θ(k))+λ1ej=1paj2(jLe(β(k),θ(k))2+λ2j=1paj(1aj) (11)

We then define a(k+1) as

a(k+1)=argmina[0,1]pQ¯(a;b(k),θ(k),β(k)) (12)

Our rationale for optimizing over Q¯ instead of Q with b(k) and θ(k) held constant is purely computational. By fixing β(k) in the gradient term of the invariance-inducing penalty, we simplify both penalties to be quadratic in a, granting us a function that is very efficient to minimize. We also avoid potential spurious minima that may be introduced by optimizing over the products aaj2(jLe(ab(k),θ(k)))2 that arise in the form of Q. To compute b(k+1) and θ(k+1), we optimize Q directly, granting

b(k+1)=argminbRpQ(a(k+1),b,θ(k)) (13)
θ(k+1)=argminθRQ(a(k+1),b(k+1),θ) (14)

We point out that the runtime of each constituent optimization problem is quadratic, meaning that optimizing over b and θ separately incurs a runtime of O(p2+2), where optimizing over them jointly runs in O((p+)2). When p is large, optimizing over b and θ separately may therefore provide a nontrivial improvement in runtime, which motivates us to adopt this approach rather than solving for b and θ jointly. We also allow the user to leverage domain specific knowledge of the data generating mechanism to improve the performance of our algorithm. In many application areas, for instance healthcare or biomedical science, prior studies can provide evidence that particular features indicated by the set J have invariant or causal effects on the outcome of interest. In this case, we want to enforce aj=1 for all jJ in the optimization process. This is straightforward to achieve in our computation of a(k+1) by adding an additional constraint to the optimization problem in Equation (12). We demonstrate via simulation studies in Section (3) that such prior knowledge of the invariant features can grant our algorithm better estimation performance, although it is not strictly necessary. We summarize our iterative procedure in Algorithm (2.1). In the supplement, we provide simulation results demonstrating the advantage of Algorithm (2.1) over direct minimization of the objective function Q(a,b,θ).

Remark 2.1. The form of the relaxation used in the definition of our FILM estimator is similar conceptually to that of the constrained causal optimization (CoCo) method of Yin et al. (2024). Given data from a set of environments , the CoCo algorithm solves

β^CoCo=argminb1eb~Le(b)2forb~J=1,b~Jc=bJc (15)

where J is a set of known exogenous variables. The key difference between our FILM approach and the CoCo method is our decoupling of the support and magnitude of β via the parameters a and b. This severely reduces our dependence on knowledge of a set J of exogenous variables. If the set J is misspecified or missing, the statistical performance of the CoCo algorithm may suffer. Furthermore, we include the pooled loss function by default in Equation (9) to phase out uninformative solutions. While the authors of Yin et al. (2024) provide a risk-regularized variant of CoCo, they do not evaluate this approach for generalized linear models systematically via simulations in their work. Finally, the CoCo method is not robust to misspecified or corrupted environments, unlike our Robust-FILM estimator described in Section 2.3. We validate these points empirically in Section 3.

Algorithm 2.1 Alternating minimization for computing β^FILM
Given data(Yie,Xie)i=1nefrom a set of environmentGiven a set of known exogenous variablesJ[p]Initializea(0)[0,1]p,b0Rp,θ(0)R,andβ(0)=a(0)b(0)fork0tomax.iterdoifJthena(k+1)argmina[0,1]pQ¯(a;b(k),θ(k),β(k))s.t.aj=1jJelsea(k+1)argmina[0,1]pQ¯(a;b(k),θ(k),β(k))endifb(k+1)argminbQ(a(k+1),b,θ(k))θ(k+1)argminθQ(a(k+1),b(k+1),θ)β(k+1)a(k+1)b(k+1)endforreturn(a(max.iter),b(max.iter),θ(max.iter)

We also propose a two-stage cross validation procedure to choose the tuning parameters λ1 and λ2, as well as the initial iterates (a(0), b(0), θ(0)). We evaluate CV error on data from a validation environment that is distinct from environments used in the training process to ensure that the chosen tuning parameters do not over-fit to the training environments, as λ1 needs to be calibrated to balance in-sample predictive performance and invariance. Furthermore, comparing the performance of multiple initializations helps guarantee that our iterative algorithm does not get stuck in a spurious local minima, which would admit a poor estimate β^FILM. This cross validation procedure is detailed in Algorithm (2.2).

Remark 2.2. Our relaxation does not retain the identification and estimation results given in Theorems 4.2 and 4.3 of Fan et al. (2024), because the resulting non-convex objective function Q admits many local solutions that are not necessarily globally optimal. We over-come this challenge by running cross-validation over a grid of initializations as described in Algorithm 2.2. We emphasize that our objective is not to identify and estimate the invariant effects, but rather to build a transferable prediction model across the environments under the invariant generalized linear model. Treating estimation of β as a surrogate for invariant prediction, our proposed FILM estimator achieves excellent performance as demonstrated in simulations and real data analysis in Sections 3 and 4 respectively, indicating our non-standard relaxation is able to recover a transferable prediction rule across a variety of settings.

Algorithm 2.2 Alternating minimization with cross-validation
Given data from a set of training environmenttrand validation environmentvalGiven a set of initial point{(al(0),bl(0),θl(0)):l=1,,L)}Given a set of tuning parameters{(λ1,m,λ2,m):m=1,,M}forl1toLdoform1todoCompute(a^l,m,b^l,m,θ^l,m)by running Algorithm (2.1) over the data fromtrwithinitial points(al(0),bl(0),θl(0))and tuning parameter(λ1,m,λ2,m).endforendforCompute(m,l)=argminm,lQval(a^m,l,b^m,l,θ^m,l)with tuning parameters(λ1,m,λ2,m)return(a^m,l,b^m,l,θ^m,l)

2.3. Robust extension

Our assumption that Model (2) holds for all e may be too strong in practice. In real data settings, some environments may fail to satisfy Model (2) for a litany of reasons, including model shifts, feature misalignment, or data corruption. To protect against possibly misspecified environments, we introduce the robust loss function

QRobust(a,b)=Ψ(Le(ab):e)+λ1Ψ(j=1paj2(jLe(ab))2:e)+λ2j=1paj(1aj) (16)

where Ψ:RR is a robust measure of centrality, such as the median or trimmed mean.

Our Robust-FILM estimator is then defined as

β^Robust-FILM=a^b^;(a^,b^)=argmina[0,1]p,bRpQRobust(a,b) (17)

which we also compute via alternating minimization using a slightly modified version of Algorithm 2.1. We provide details in the appendix. This approach is inspired by the median-of-means literature (Devroye et al., 2016; Lecué & Lerasle, 2020), as we replace a ‘global’ sample average (in this case, over all the training environments) with a robust accumulation of ‘local’ sample averages. In Section 3, we demonstrate empirically that an appropriate choice of Ψ grants excellent statistical performance in the presence of outlier environments with only a minor cost in runtime.

3. Simulations

We investigate the properties of the FILM and Robust-FILM estimators using data simulated from the invariant features model in (2) with the identity and logit link functions, corresponding to the linear and logistic regression models respectively. We again let denote the set of environments and form the partition =IO where I denotes the set of inlier environments and O the outliers. For each eI, we generate ne i.i.d. copies from the following data generating process:

DatageneratingprocessforeIGivenβ,γe,θe,σe:XSeN(0,I)RsFor linear models:εeN(0,(σe)2RYe=θe+βSTXSe+εeRFor logistic models:YeBern(expit(θe+βSTXSe))XSce=γeYe1q+N(0,I)RqZeN(0,I)Rp(s+q)

where β, γe, Σe, θe, and σe are varied in each simulation setting. We collect the covariates into the vector (XSe,XSce,Ze)Rp. XSce denotes the spurious features with corresponding spurious correlation γe. The Ze vector represents null features that are not related to the outcome, and are included to increase the complexity of the problem. In the results that follow, we set s=3 and q=1 with I={1,2,3} and β=(2,3,4,0,0,,0)T. Note that this defines S={1,2,3} as the set of invariant features.

We will evaluate four versions of our FILM method. The first runs Algorithm (2.2) directly with J=, representing a naive application of FILM that incorporates no prior information nor protects against misspecified or corrupted environments. Our second implementation modifies this naive approach by setting J={1} to study the effect of partial oracle knowledge of the support of β on the performance of FILM. Additionally, we fit robust versions of both of these methods by solving Equation (17) with Algorithm (2.2) and Ψ as the median function. In all cases, we use tr={1,2}O and val={3}. We fix λ2=0.1 and choose λ1 with cross validation over a grid of evenly spaced values from 50 to 125. We initialize our method by running cross validation over 30 randomly generated candidate initial points.

To understand the performance of FILM relative to alternative methods, we also fit an oracle generalized linear model using only the invariant features XSe with samples only from the inlier environments I with an environment-level intercept. This model acts as our gold standard, as it uses knowledge of the data-generating process that is typically unavailable to the statistician. We additionally fit the CoCo method by solving Equation (15) with J={1}, and in the linear model setting we fit EILLS with λ1=1 and λ2=0.1. Finally, we fit a basic pooled generalized linear model with all available features and data from all environments.

3.1. Results without outliers

In the first set of results, we set O=, and we investigate the performance of each method discussed above while varying the sample size per environment n, number of features p, number of environments =E, and the minimum strength of spurious correlations within each environment γ. Upon fixing the parameter γ, we choose each γe for e along an evenly spaced grid from γ to γ+4. By default, we set n=500, p=10, E=3, and γ=2. We measure the performance of each method via its estimation error β^β2 and accuracy of variable selection. Here, accuracy is defined as the sum of the number of true positives (β^j0 for jS) and true negatives (β^j=0 for jS), divided by p. The results are given in Figures 1 and 2 respectively.

Figure 1:

Figure 1:

Median estimation error over 50 replications. Results are given for the linear and logistic models along the rows. The columns of the figure delineate the parameter varied along the x-axis (sample size, number of features, number of environments, and γ). The y-axis is on the logarithmic scale.

Figure 2:

Figure 2:

Median accuracy over 50 replications. Results are given for the linear and logistic models along the rows. The columns of the figure delineate the parameter varied along the x-axis (sample size, number of features, number of environments, and γ).

In these figures, we can see that all four versions of FILM are able to mimic the performance of the gold standard oracle estimator and EILLS over a variety of settings. The Robust-FILM estimators perform nearly identically to their non-robust analogs, which indicates that Robust-FILM is safe to use in all cases, even when the analyst does not suspect that any outlying environments are present in their data. This concordance between FILM and Robust-FILM begins to fail as the number of environments increases, but this is reasonable, as minimizing over the median of a set of values becomes more challenging as the number of inputs increases. We also see that including knowledge of an invariant feature by setting J={1} can improve both estimation error and accuracy, but this improvement is not substantial. The CoCo method admits relatively poor performance in these simulations since we only give it knowledge of one entry of the set S. Finally, the naive Pooled GLM is badly biased due to its ignoring of cross-environment heterogeneity and the spurious correlations. This bias becomes especially severe as γ increases.

3.2. Results with outliers

Now we add one environment to O to investigate the effectiveness of Robust-FILM when non-trivial outliers are present in the training data. We draw the outlier environment from each of the following four schemes:

  1. Pure noise: We draw Xeiidt2 and YeiidUnif(5,5) (for linear models) or YeiidBern(0.1) (for logistic models).

  2. Permuted features: Draw (Xe, Ye) as if eI, but return (π(Xe),Ye) for a permutation π.

  3. Rotated features: Generate (Xe, Ye) as if eI, but return (OXe, Ye) for a random rotation matrix O.

  4. Flipped effects: Generate (Xe, Ye) as if eI with causal effects 1β.

We keep n=500, E=3, p=10, and γ=2, and evaluate each method by estimation error. The oracle method is computed using only the data from I so that it acts as our gold standard once more. Results are given in Table 1.

Table 1:

Median estimation error over 50 replications under each of the four outlier schemes. For the sake of space, we abbreviate ‘Robust-FILM’ as ‘R-FILM’.

Model Outlier scheme Oracle GLM Pooled GLM FILM FILM(J={1}) R-FILM R-FILM(J={1}) EILLS CoCo
Linear Pure noise 0.037 5.059 5.385 5.244 0.100 0.082 5.385 5.368
- Permuted 0.040 3.871 5.385 5.049 0.081 0.082 2.413 3.784
- Rotated 0.038 4.840 5.385 5.240 0.088 0.077 5.392 5.157
- Flipped effects 0.035 6.152 5.385 5.183 0.091 0.084 5.388 4.025
Logistic Pure noise 0.277 5.035 5.385 5.294 0.572 0.470 5.385 5.275
- Permuted 0.229 4.289 5.385 5.255 0.550 0.422 5.388 5.235
- Rotated 0.232 4.663 5.385 5.334 0.669 0.506 5.357 5.213
- Flipped effects 0.265 5.396 5.385 5.317 0.501 0.414 5.388 5.291

We can see that under each outlier setting, our Robust-FILM procedure achieves the best performance outside of the oracle estimator. Similar to the outlier-free results, we see that including J={1} is helpful, but Robust-FILM performs very well even without this additional information.

3.3. Numerical runtime

We also run a set of benchmarking simulations to evaluate the runtime of EILLS, FILM, and CoCo over varying values of p. For p{5,10,15,20}, we generate data from 2 environments with n1=n2=100 from Model (2) with g(x)=x. We compute β^EILLS by solving Equation (8) via brute force as suggested by Fan et al., 2024. We compute the FILM estimator via Algorithm (2.1), and we compute the CoCo estimator by solving Equation (15). We also include results for Robust-FILM, which are obtained by solving (17) with Ψ=median(). We fit each method 5 times for each value of p and provide the mean runtime in seconds in Figure 3.

Figure 3:

Figure 3:

Average runtime over 5 replications in seconds by number of features for EILLS, FILM, Robust-FILM, and CoCo. The y-axis is on the logarithmic scale.

We see in Figure 3 that the runtime of computing the EILLS estimator increases exponentially in the dimension p, as expected. We also see that CoCo is consistently the computationally fastest method. This is not surprising, as CoCo only requires solving one optimization problem, where FILM and Robust-FILM involve a nested series of alternating minimization problems. However, FILM and Robust-FILM still run comparably fast to CoCo and offer superior statistical performance as discussed in Section 3. Furthermore, Robust-FILM performs similarly to FILM in terms of runtime, suggesting that the flexibility and robustness granted by the Robust-FILM method can be achieved with very little computational cost.

4. Real data analysis

4.1. Background

We demonstrate the real world effectiveness of our FILM method on kidney disease data from the All of Us research program (Mayo et al., 2023). The All of Us program is a nationwide initiative by the National Institutes of Health that aims to integrate multimodal biomedical data, including genetic profiles, electronic health records (EHR), and wearable device measurements, from diverse patient cohorts at partner institutions across the United States. Such a wealth of information offers great opportunities for researchers to learn from heterogeneous populations, and in particular build disease risk prediction models with transferable performance across patient subgroups.

Chronic kidney disease (CKD) is a critical and rapidly growing global health challenge, currently affecting approximately 10% of the global population—equivalent to over 850 million individuals worldwide (International Society of Nephrology, 2023). CKD accounted for over 1.5 million deaths annually (American Society of Hematology, 2024) and is projected to become the fifth leading cause of death globally by 2040 (Moura et al., 2024). Advanced CKD may lead to end stage renal disease (ESRD), characterized by complete kidney failure, which requires treatment by dialysis or a kidney transplant (Valderrábano et al., 2001). Early screening of ESRD among patients with CKD can greatly improve quality of life and life expectancy (Lin et al., 2018). Estimated glomular filtration rate (eGFR) is the most common measure of kidney function and is often used for screening or diagnosis of kidney disease (Kalantar-Zadeh et al., 2021). However, screening based on eGFR alone may be sub-optimal in practice, as formulas for computing eGFR are not always reliable estimators of the underlying glomular filtration rate (Glassock & Winearls, 2008). Supplementing eGFR-based screening with predictions using EHR derived features offers a promising route to improve renal disease detection (Haris et al., 2024). To this end, we aim to leverage our proposed FILM method to build a transferable prediction model for ESRD among patients with severe eGFR values. Our outcome is defined as the occurrence of an ESRD diagnosis (SNOMED code 46177005). After careful data processing, we are left with 51 demographic and EHR-derived features with which to train our prediction model. We provide complete details on our data processing pipeline in the appendix.

As healthcare system quality and patient demographics vary greatly across geography in the United States (Bethell et al., 2011; Weaver et al., 2021), we use the state of residence for each patient as the environment variable. We restrict our attention to states in the northeast (New York, Pennsylvania, and Massachusetts) and one representative state from the midwest (Illinois) and the south (Texas), each chosen based on population size. We use the northeast states as our training environments, and aim to build a prediction model whose performance transfers well to Illinois and Texas. Table 2 gives the number of cases and controls that were used in our data analysis from each state.

Table 2:

Case and control totals by state.

Role State n cases n controls
Training Pennsylvania 14 266
- Massachusetts 10 207
- New York 19 118
Testing Illinois 41 378
- Texas 16 256

4.2. Analysis and results

We use the data from the northeast states to train four variants of the logistic regression model: a ridge regularized logistic regression using the glmnet package that pools data across all three states, the CoCo method using a logit link function, and FILM and Robust-FILM using the median as our Ψ() function, both with the logit link. Tuning parameters for each method are chosen using cross-validation (for the FILM methods, we use Massachusetts as the validation environment). For each method, we use the estimated effect vector to compute predicted case probabilities on the patients from Illinois and Texas, with which we compute the AUC. These results are given in the first two rows of Table 3.

Table 3:

AUC for each method computed on the Illinois and Texas testing data. The methods are trained on data from Pennsylvania, Massachusetts, and New York. Results are given for the models trained with and without the outlier environment separately.

Outlier? State Pooled Logistic FILM Robust-FILM CoCo
No Illinois 0.779 0.837 0.831 0.661
- Texas 0.691 0.782 0.771 0.479
Yes Illinois 0.603 0.531 0.759 0.340
- Texas 0.513 0.607 0.748 0.465

To demonstrate the effectiveness of our Robust-FILM approach, we use the patient data from California (n=338) to construct an outlier environment. Leaving the covariate data unchanged, we flip the case-control indicator from 0 to 1 or vice-versa for each patient from California with probability 0.5 to mimic data corruption that results from outcome mislabeling. We then include this corrupted environment in the training set and retrain each of the four methods described above. Performance is computed again as AUC on the Illinois and Texas data. These results are given in the second two rows of Table 3.

Our FILM and Robust-FILM methods achieve the best performance in both the outlier-free and outlier-present settings respectively. The CoCo method suffers in this analysis due to the lack of sufficient oracle knowledge of the true invariant features, which prohibits its use in practical data analysis settings, and the Pooled Logistic regression incurs bias from ignoring the cross-environment heterogeneity. The EILLS approach of Fan et al. (2024) is not computationally feasible even in this moderate dimensional (p=51) setting.

In the outlier-free setting, our FILM method selects the following five features as having a nonzero regression coefficient: serum creatinine, acquired hemolytic anemia, bacteremia, renal disorder due to type 1 diabetes mellitus, and systemic sclerosis. In the appendix, we provide a table of each feature’s estimated regression coefficient and corresponding SNOMED code. These features comprise our estimate of the set S of invariant features in Model (2). Each of these features has a clinical relevance to end stage renal disease. Creatinine levels are a well-established measure of kidney function (Fink et al., 1999; Levey et al., 1988) and are used to calculate eGFR (Grams et al., 2023). Hemolytic anemia is known to induce kidney damage by damaging red blood cells, releasing large amounts of haem into circulation which place undue stress on the kidneys (Van Avondt et al., 2019). Recent studies provide evidence that bacteremia is prevalent among kidney transplant recipients, and is associated with subsequent kidney failure and increased rates of mortality (Ito et al., 2016; Jamil et al., 2016; Tsikala-Vafea et al., 2020). Patients with Type 1 diabetes are at high risk for developing end stage renal disease (Skupien et al., 2012; Stadler et al., 2006). Finally, renal disorder is common in cases of systemic sclerosis (Woodworth et al., 2016), and renal failure is a major cause of death among patients affected by systemic sclerosis (Cole et al., 2023). Knowledge of these features as provided by our FILM estimator can allow researchers to build transferable and parsimonious prediction models for ESRD.

5. Discussion

In this work, we present the FILM framework for computationally efficient estimation of generalized linear models from multiple environments under the invariant features model. We demonstrate via simulations that our approach performs as well as state of the art methods while greatly improving runtime. Furthermore, we propose a robust extension of our method called Robust-FILM that protects against environments that may be corrupted or misspecified, and we show that Robust-FILM successfully preserves statistical performances under a variety of corruption settings. We then use our new methods to build a transferable risk prediction model for end stage renal disease using electronic health records from the All of Us database.

The limitations of our FILM method suggest promising directions for future work. First of all, FILM is only designed for generalized linear models with a canonical link function. A natural next step is to extend our approach to a more general class of parametric models and then to nonparametric regression models, perhaps leveraging ideas from Gu et al. (2024). Another possible direction for future work is to extend our FILM approach to handle more general, possibly nonlinear, invariant representations, as is considered in the works (Jiang et al., 2023; Nguyen et al., 2021; Shi et al., 2021; Zhao et al., 2022). This may allow us to integrate data sources in which which the different environments correspond to different modalities of data collection, which is an increasingly common setting in real-world applications. Invariant representation learning also provides an alternate idea for enforcing robustness beyond the median-of-means approach that we take in this work, as data from a potentially outlying environment may satisfy Model (2) only after a suitable transformation (Maurer et al., 2016; Tian et al., 2023). Another potential direction is to extend the current model to incorporate random effects, either to relax the definition of invariant features or to account for correlated or dependent samples by modeling between- and within-environment variability within the FILM framework. We see this as a valuable piece of future work that will bring about its own computational challenges, as is typical when handling generalized linear mixed effects models. Finally, more research is needed to better understand the trade-offs between transferability, robustness, and computational efficiency. The recent work of Gu et al. (2025) explores the fundamental computational difficulty of estimating invariant effects under the linear invariant features model, but further work is needed to characterize the additional cost of robust estimation under this model.

Supplementary Material

Online supplement

R script for FILM method: An R script implementing our FILM and Robust-FILM methods, along with the cross validation procedure (.R file).

R script for data simulation: An R script for simulating data from the invariant features generalized model as used in our simulation studies (.R file).

Additional simulation and details on computation and data analysis: A PDF file describing additional simulation results, the algorithm used to compute the Robust-FILM estimator, and additional information pertaining to the real data analysis including data availability (.pdf file).

Acknowledgments

We gratefully acknowledge the All of Us program participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program for making the data examined in this paper available. We acknowledge support from the National Institutes of Health under Grants R01 GM148494 and R01MH137218. Research reported in this work was partially funded through a Patient-Centered Outcomes Research Institute (PCORI) Award (ME-2024C1-37351). The views and statements in this work are solely the responsibility of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors or Methodology Committee. Parker Knight is supported by an NSF Graduate Research Fellowship. The authors report that there are no competing interests to declare. We thank Tianxi Cai and Junwei Lu for valuable discussions throughout the course of this work.

References

  1. American Society of Hematology. (2024). Global burden and trend of anemia due to chronic kidney disease in g20 countries from 1990-2021: A benchmarking systematic analysis [Accessed: January 22, 2025]. Blood, 144 (Supplement 1), 5248. [Google Scholar]
  2. Angrist JD, Imbens GW, & Rubin DB (1996). Identification of causal effects using instrumental variables. Journal of the American statistical Association, 91 (434), 444–455. [Google Scholar]
  3. Arjovsky M, Bottou L, Gulrajani I, & Lopez-Paz D (2020, March 27). Invariant risk minimization. [Google Scholar]
  4. Ben-David S, Blitzer J, Crammer K, & Pereira F (2006). Analysis of representations for domain adaptation. Advances in neural information processing systems, 19. [Google Scholar]
  5. Bertsimas D, King A, & Mazumder R (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44 (2), 813–852. 10.1214/15-AOS1388 [DOI] [Google Scholar]
  6. Bethell CD, Kogan MD, Strickland BB, Schor EL, Robertson J, & Newacheck PW (2011). A national and state profile of leading health problems and health care quality for us children: Key insurance disparities and across-state variations. Academic pediatrics, 11 (3), S22–S33. [DOI] [PubMed] [Google Scholar]
  7. Cole A, Ong VH, & Denton CP (2023). Renal disease and systemic sclerosis: An update on scleroderma renal crisis. Clinical reviews in allergy & immunology, 64 (3), 378–391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Devroye L, Lerasle M, Lugosi G, & Oliveira RI (2016). Sub-gaussian mean estimators. Annals of Statistics. [Google Scholar]
  9. Duan Y, & Wang K (2023). Adaptive and robust multi-task learning. The Annals of Statistics, 51 (5), 2015–2039. 10.1214/23-AOS2319 [DOI] [Google Scholar]
  10. Duchi JC, Feldman V, Hu L, & Talwar K (2022). Subspace recovery from heterogeneous data with non-isotropic noise. Advances in Neural Information Processing Systems, 35, 5854–5866. [Google Scholar]
  11. Fan J, Fang C, Gu Y, & Zhang T (2024). Environment invariant linear least squares. The Annals of Statistics, 52 (5), 2268–2292. [Google Scholar]
  12. Farahani A, Voghoei S, Rasheed K, & Arabnia HR (2021). A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, 877–894. [Google Scholar]
  13. Fink JC, Burdick RA, Kurth SJ, Blahut SA, Armistead NC, Turner MS, Shickle LM, & Light PD (1999). Significance of serum creatinine values in new end-stage renal disease patients. American journal of kidney diseases, 34 (4), 694–701. [DOI] [PubMed] [Google Scholar]
  14. Glassock RJ, & Winearls C (2008). Screening for ckd with egfr: Doubts and dangers. Clinical Journal of the American Society of Nephrology, 3 (5), 1563–1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Grams ME, Brunskill NJ, Ballew SH, Sang Y, Coresh J, Matsushita K, Surapaneni A, Bell S, Carrero JJ, Chodick G, et al. (2023). The kidney failure risk equation: Evaluation of novel input variables including egfr estimated using the ckd-epi 2021 equation in 59 cohorts. Journal of the American Society of Nephrology, 34 (3), 482–494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Green S, Prainsack B, & Sabatello M (2024). The roots of (in) equity in precision medicine: Gaps in the discourse. Personalized Medicine, 21 (1), 5–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gu Y, Fang C, Bühlmann P, & Fan J (2024). Causality pursuit from heterogeneous environments via neural adversarial invariance learning. arXiv preprint arXiv:2405.04715. [Google Scholar]
  18. Gu Y, Fang C, Xu Y, Guo Z, & Fan J (2025). Fundamental computational limits in pursuing invariant causal prediction and invariance-guided regularization. arXiv preprint arXiv:2501.17354. [Google Scholar]
  19. Haris M, Raveendra K, Travlos CK, Lewington A, Wu J, Shuweidhi F, Nadarajah R, & Gale CP (2024). Prediction of incident chronic kidney disease in community-based electronic health records: A systematic review and meta-analysis. Clinical Kidney Journal, 17 (5), sfae098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Heinze-Deml C, Peters J, & Meinshausen N (2018). Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6 (2), 20170016. [Google Scholar]
  21. International Society of Nephrology. (2023). Global kidney health atlas [Accessed: January 22, 2025].
  22. Ito K, Goto N, Futamura K, Okada M, Yamamoto T, Tsujita M, Hiramitsu T, Narumi S, Tominaga Y, & Watarai Y (2016). Death and kidney allograft dysfunction after bacteremia. Clinical and experimental nephrology, 20, 309–315. [DOI] [PubMed] [Google Scholar]
  23. Jamil B, Bokhari M, Saeed A, Bokhari MZM, Hussain Z, Khalid T, Bukhari H, Imran M, & Abbasi SA (2016). Bacteremia: Prevalence and antimicrobial resistance profiling in chronic kidney diseases and renal transplant patients. J Pak Med Assoc, 66 (6), 705–9. [PubMed] [Google Scholar]
  24. Jiang Q, Chen C, Zhao H, Chen L, Ping Q, Tran SD, Xu Y, Zeng B, & Chilimbi T (2023). Understanding and constructing latent modality structures in multi-modal representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7661–7671. [Google Scholar]
  25. Kalantar-Zadeh K, Jafar TH, Nitsch D, Neuen BL, & Perkovic V (2021). Chronic kidney disease. The lancet, 398 (10302), 786–802. [Google Scholar]
  26. Kamath P, Tangella A, Sutherland D, & Srebro N (2021). Does invariant risk minimization capture invariance? International Conference on Artificial Intelligence and Statistics, 4069–4077. [Google Scholar]
  27. Knight P, & Duan R (2024). Multi-task learning with summary statistics. Advances in neural information processing systems, 36, 54020. [Google Scholar]
  28. Lecué, & Lerasle M (2020). Robust machine learning by median-of-means: Theory and practice. Annals of Statistics. [Google Scholar]
  29. Levey AS, Perrone RD, & Madias NE (1988). Serum creatinine and renal function. Annual review of medicine, 39, 465–490. [Google Scholar]
  30. Li S, Cai TT, & Li H (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84 (1), 149–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Li S, Cai T, & Duan R (2023). Targeting underrepresented populations in precision medicine: A federated transfer learning approach. The Annals of Applied Statistics, 17 (4), 2970–2992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Li S, & Zhang L (2024). Fairm: Learning invariant representations for algorithmic fairness and domain generalization with minimax optimality. arXiv preprint arXiv:2404.01608. [Google Scholar]
  33. Li S, Zhang L, Cai TT, & Li H (2024). Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Journal of the American Statistical Association, 119 (546), 1274–1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lin E, Chertow GM, Yan B, Malcolm E, & Goldhaber-Fiebert JD (2018). Cost-effectiveness of multidisciplinary care in mild to moderate chronic kidney disease in the united states: A modeling study. PLoS medicine, 15 (3), e1002532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Liu M, Zhang Y, Liao KP, & Cai T (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research, 24 (293), 1–50. [Google Scholar]
  36. Ma C, Pathak R, & Wainwright MJ (2023). Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51 (2), 738–761. [Google Scholar]
  37. Martínez-García M, & Hernández-Lemus E(2022). Data integration challenges for machine learning in precision medicine. Frontiers in medicine, 8, 784455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Maurer A, Pontil M, & Romera-Paredes B (2013). Sparse coding for multitask and transfer learning. International conference on machine learning, 343–351. [Google Scholar]
  39. Maurer A, Pontil M, & Romera-Paredes B (2016). The benefit of multitask representation learning. Journal of Machine Learning Research, 17 (81), 1–32. [Google Scholar]
  40. Mayo KR, Basford MA, Carroll RJ, Dillon M, Fullen H, Leung J, Master H, Rura S, Sulieman L, Kennedy N, et al. (2023). The all of us data and research center: Creating a secure, scalable, and sustainable ecosystem for biomedical research. Annual review of biomedical data science, 6 (1), 443–464. [Google Scholar]
  41. Moura AF, et al. (2024). Multidimensional burden of chronic kidney disease in eight countries: Insights from the impact ckd study [Accessed: January 22, 2025]. Kidney International Reports, 9, S263. [Google Scholar]
  42. Nguyen AT, Tran T, Gal Y, & Baydin AG (2021). Domain invariant representation learning with domain density transformations. Advances in Neural Information Processing Systems, 34, 5264–5275. [Google Scholar]
  43. Niu X, Su L, Xu J, & Yang P (2024). Collaborative learning with shared linear representations: Statistical rates and optimal algorithms. arXiv preprint arXiv:2409.04919. [Google Scholar]
  44. of Us Research Program Investigators, A. (2019). The “all of us” research program. New England Journal of Medicine, 381 (7), 668–676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Peters J, Bühlmann P, & Meinshausen N (2016). Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology, 78 (5), 947–1012. [Google Scholar]
  46. Pfister N, Bühlmann P, & Peters J (2019). Invariant causal prediction for sequential data. Journal of the American Statistical Association, 114 (527), 1264–1276. [Google Scholar]
  47. Rojas-Carulla M, Schölkopf B, Turner R, & Peters J (2018, September 24). Invariant models for causal transfer learning. [Google Scholar]
  48. Rosenfeld E, Ravikumar P, & Risteski A (2021, March 27). The risks of invariant risk minimization. [Google Scholar]
  49. Sarwar T, Seifollahi S, Chan J, Zhang X, Aksakalli V, Hudson I, Verspoor K, & Cavedon L (2022). The secondary use of electronic health records for data mining: Data characteristics and challenges. ACM Computing Surveys (CSUR), 55 (2), 1–40. [Google Scholar]
  50. Sauer CM, Chen L-C, Hyland SL, Girbes A, Elbers P, & Celi LA (2022). Leveraging electronic health records for data science: Common pitfalls and how to avoid them. The Lancet Digital Health, 4 (12), e893–e898. [DOI] [PubMed] [Google Scholar]
  51. Shi C, Veitch V, & Blei DM (2021). Invariant representation learning for treatment effect estimation. Uncertainty in artificial intelligence, 1546–1555. [Google Scholar]
  52. Singh H, Mhasawade V, & Chunara R (2022). Generalizability challenges of mortality risk prediction models: A retrospective analysis on a multi-center database. PLOS Digital Health, 1 (4), e0000023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Skupien J, Warram JH, Smiles AM, Niewczas MA, Gohda T, Pezzolesi MG, Cantarovich D, Stanton R, & Krolewski AS (2012). The early decline in renal function in patients with type 1 diabetes and proteinuria predicts the risk of end-stage renal disease. Kidney international, 82 (5), 589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Stadler M, Auinger M, Anderwald C, Kastenbauer T, Kramar R, Feinbock C, Irsigler K, Kronenberg F, & Prager R (2006). Long-term mortality and incidence of renal dialysis and transplantation in type 1 diabetes mellitus. The Journal of Clinical Endocrinology & Metabolism, 91 (10), 3814–3820. [DOI] [PubMed] [Google Scholar]
  55. Tian Y, Gu Y, & Feng Y (2023). Learning from similar linear representations: Adaptivity, minimaxity, and robustness. arXiv preprint arXiv:2303.17765. [Google Scholar]
  56. Tripuraneni N, Jin C, & Jordan M (2021). Provable meta-learning of linear representations. International Conference on Machine Learning, 10434–10443. [Google Scholar]
  57. Tsikala-Vafea M, Basoulis D, Pavlopoulou I, Darema M, Deliolanis J, Daikos GL, Boletis J, & Psichogiou M (2020). Bloodstream infections by gram-negative bacteria in kidney transplant patients: Incidence, risk factors, and outcome. Transplant Infectious Disease, 22 (6), e13442. [DOI] [PubMed] [Google Scholar]
  58. Valderrábano F, Jofre R, & López-Gómez JM (2001). Quality of life in end-stage renal disease patients. American Journal of Kidney Diseases, 38 (3), 443–464. [DOI] [PubMed] [Google Scholar]
  59. Van Avondt K, Nur E, & Zeerleder S (2019). Mechanisms of haemolysis-induced kidney injury. Nature Reviews Nephrology, 15 (11), 671–692. [DOI] [PubMed] [Google Scholar]
  60. Wang Z, Hu Y, Bühlmann P, & Guo Z (2024). Causal invariance learning via efficient optimization of a nonconvex objective. arXiv preprint arXiv:2412.11850. [Google Scholar]
  61. Weaver MR, Nandakumar V, Joffe J, Barber RM, Fullman N, Singh A, Sparks GW, Yearwood J, Lozano R, Murray CJ, et al. (2021). Variation in health care access and quality among us states and high-income countries with universal health insurance coverage. JAMA Network Open, 4 (6), e2114730–e2114730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Woodworth TG, Suliman YA, Li W, Furst DE, & Clements P (2016). Scleroderma renal crisis and renal involvement in systemic sclerosis. Nature Reviews Nephrology, 12 (11), 678–691. [DOI] [PubMed] [Google Scholar]
  63. Yin M, Wang Y, & Blei DM (2024). Optimization-based causal estimation from heterogeneous environments. J. Mach. Learn. Res, 25, 1–44. [PMC free article] [PubMed] [Google Scholar]
  64. Zhang Y, & Yang Q (2018). An overview of multi-task learning. National Science Review, 5 (1), 30–43. [Google Scholar]
  65. Zhao H, Dan C, Aragam B, Jaakkola TS, Gordon GJ, & Ravikumar P (2022). Fundamental limits and tradeoffs in invariant representation learning. Journal of machine learning research, 23 (340), 1–49. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Online supplement

RESOURCES