Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Sep 23.
Published in final edited form as: Ann Appl Stat. 2023 Oct 30;17(4):2970–2992. doi: 10.1214/23-AOAS1747

TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH

By Sai Li 1,a, Tianxi Cai 2,b, Rui Duan 2,c
PMCID: PMC11417462  NIHMSID: NIHMS1945063  PMID: 39314265

Abstract

The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA’s estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.

Keywords: Federated learning, health equity, precision medicine, risk prediction, transfer learning

1. Introduction.

Precision medicine holds promises to improve individual health by integrating a person’s genetics, environment, and lifestyle information to determine the best approach to prevent or treat diseases (Ashley (2016)). Precision medicine research has attracted considerable interest and investment during the past few decades (Collins and Varmus (2015)). With the emergence of electronic health records (EHR) linked with biobank specimens, massive environmental data, and health surveys, we now have increasing opportunities to develop accurate personalized risk prediction models in a cost-effective way (Li et al. (2020)).

Despite the increasing availability of large-scale biomedical data, many demographic sub-populations are observed to be underrepresented in precision medicine research (Kraft et al. (2018), West, Blacksher and Burke (2017)). For example, a disproportionate majority (>75%) of participants in existing genomics studies are of European descent (Martin et al. (2019)). The UK Biobank, one of the largest biobanks, has more than 95% of European-ancestry participants (Sudlow et al. (2015)). Mounting evidence has shown that the transferability and generalizability of models trained in a particular population can be low due to a substantial amount of heterogeneity in underlying distributions of data across populations (Duncan et al. (2019), Kraft et al. (2018), Landry et al. (2018), West, Blacksher and Burke (2017)). For example, the performance of existing genetic risk prediction models in non-European populations has generally been found to be much poorer than in European-ancestry populations, most notably in African-ancestry populations (Duncan et al. (2019)). It remains challenging to optimize prediction model performance for underrepresented populations accounting for both the heterogeneity and the insufficient sample sizes. However, to advance precision medicine it is crucial to improve the performance of statistical and machine learning models in underrepresented populations so as not to exacerbate health disparities.

We propose to address the lack of representation and disparities in model performance through two data integration strategies: (1) leveraging the shared knowledge from diverse populations and (2) integrating larger bodies of data from multiple healthcare institutions. Data across multiple populations may share a certain amount of similarity that can be leveraged to improve the model performance in an underrepresented population (Cai et al. (2021)). However, conventional methods, where all data are combined and used indistinctly in training and testing, cannot tailor the prediction models to work well for a specific population; if using a population-specific training strategy, the sample size from an underrepresented population is usually not sufficient to train a model with reliable performance. To account for such heterogeneity and lack of representation, we propose to use transfer learning to transfer the shared knowledge learned from diverse populations to an underrepresented population so that comparable model performance can be reached with much less data for training (Weiss, Khoshgoftaar and Wang (2016)). In addition, multiinstitutional data integration can improve the sample size of the underrepresented populations and the diversity of data (McCarty et al. (2011)). In recent years, many research consortia and data networks have been built which provide infrastructures for multiinstitutional data integration. For example, the electronic Medical Records and Genomics (eMERGE) Network is a consortium of ten participating sites to investigate the use of EHR systems for genomic research (Gottesman et al. (2013)). In 2019, the Global Biobank Meta Initiative (GBMI) was founded which brings together 19 biobanks to work together to understand the genetic basis of human health and disease (Zhou, analysis Initiative et al. (2021)). We propose to use federated learning to unlock the multiinstitutional health records/biobank data, which overcomes two main barriers to institutional data integration. One is that the individual-level information can be highly sensitive and cannot be shared across institutions (van der Haak et al. (2003)). The other one is that the often-enormous size of the health records/biobanks data makes it infeasible or inefficient to pool all data together due to challenges in data storage, management, and computation (Kushida et al. (2012)). Therefore, as illustrated in Figure 1, we proposed a FEderated Transfer Algorithm (FETA) to incorporate data from diverse populations that are stored at multiple institutions to improve the model performance in a target underrepresented population.

Fig. 1.

Fig. 1.

A schematic illustration of the federated transfer learning framework and the problem setting.

Existing transfer learning methods primarily focus on settings where individual-level data can be shared. For example, Cai and Wei (2021) study the minimax and adaptive methods for nonparametric classification in the transfer learning setting. Bastani (2020) studies estimation and prediction in high-dimensional linear models, and the sample size of the auxiliary study is larger than the number of covariates. Li, Cai and Li (2022) propose a transfer learning algorithm in high-dimensional linear models and studies the adaptation to the unknown similarity level. Li, Cai and Li (2020) study transfer learning in high-dimensional Gaussian graphical models with false discovery rate control. Tian and Feng (2021) study transfer learning under high-dimensional generalized linear models. These individual data-based methods cannot be directly extended to the federated settings due to data sharing constraints and the potential heterogeneity across sites.

Under data sharing constraints, most federated learning methods focus on settings where the true models are the same across studies. For example, many algorithms fit a common model to data from each institution and then aggregate these local estimates through a weighted average, for example, Chen and Xie (2014), Lee et al. (2017), Li, Lin and Li (2013), Wang et al. (2019a). To improve efficiency, surrogate likelihood approaches have been adopted in recently proposed distributed algorithms (Duan et al. (2019), Duan et al. (2020), Jordan, Lee and Yang (2019)) to approximate the global likelihood. These methods cannot be easily extended to the federated transfer learning setting where training a model for a target population is of interest and both data sharing constraints and heterogeneity are present. Recently, Liu et al., 2021 and Cai, Liu and Xia (2022) propose distributed multitask learning approaches, which jointly train site-specific regression parameters based on derived summary data. Different from their work which assumes site-level homogeneous populations, we consider that data in each site are from multiple populations, and the sample sizes can be highly unbalanced. Without making assumptions that model parameters across populations have the same support, our methods are robust to the cases where some populations differ significantly from the target.

To summarize, our work contributes to the current literature from the following perspectives: (1) adopting transfer learning ideas, FETA tackles an important issue in precision medicine where sample sizes from different populations can be highly unbalanced. We provide theoretical analysis, extensive numerical experiments, and real data analysis to demonstrate the improved accuracy and the robustness to the level of heterogeneity; (2) FETA enables federated model training across multiple datasets, which only require a small number of communications across participating sites and can achieve performance comparable to the pooled analysis. Our theoretical analysis and numerical experiments demonstrated how the estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity among populations, and (3) our real data application shows the promise of applying FETA to improve the risk prediction in underrepresented population using data from multicenter large-scale clinical/genomic datasets with communication-efficiency and privacy protection.

The rest of the paper is organized as follows. In Section 2 we introduce FETA with detailed discussions on potential options for initialization, tuning parameter selection, and reducing communication costs. In Sections 3 and 4, we provide theoretical guarantees and simulation studies, respectively, to evaluate the performance of the methods and how it is influenced by the communication constraints and level of heterogeneity across populations. In Section 5 we apply our methods to the construction of prediction models for extreme obesity, which is an emerging public health issue and an important risk factor for many complex diseases. We combine multicenter data from eMERGE Network and target the underrepresented African-ancestry population.

2. Method.

2.1. Problem setup and notation.

We build our federated transfer learning methods based on sparse high-dimensional regression models (Tibshirani (1996), Bickel, Ritov and Tsybakov (2009)). These models have been widely applied to precision medicine research for both association studies and risk prediction models, due to the benefits of simultaneous model estimation and variable selection, and the desirable interpretability (Qian et al. (2020)).

We assume there are N subjects in total from K+1 populations. We treat the underrepresented population of interest as the target population, indexed by k=0, while the other K populations are treated as source populations, indexed by k=1,K. We assume data for the N subjects are stored at M different sites where, due to privacy constraints, no individual-level data are allowed to be shared across sites. We consider the case where K is finite but M is allowed to grow as the total sample size grows to infinity.

Let 𝒩(m,k) be the index sets of the data from the kth population in the mth site and n(m,k)=𝒩(m,k) denote the corresponding sample size, for k=0,,K and m=1,M. We assume the index sets are known and do not overlap with one another, that is, 𝒩(m,k)𝒩m,k'= for any 0k<k'K. In precision medicine research, these index sets may be obtained from indicators of minority and disadvantaged groups, such as race, gender, and socioeconomic status. Denote N(k)=m=1Mn(m,k) and N=k=0KN(k). We are particularly interested in the challenging scenario N(0)N, where the underrepresentation is severe. However, at certain sites the relative sample compositions can be arbitrary. It is possible that some sites may not have data from certain populations, that is, n(m,k)=0 for some but not all m for 1mM. We consider the high-dimensional setting where p can be larger and much larger than N(0) and N.

For the ith subject, we observe an outcome variable yiR and a set of p predictors xiRp, including the intercept term. We assume that the target data on the mth site, xi,yii𝒩(m,0), follow a generalized linear model

gEyixi=xiβ

with a canonical link function g() and a negative log-likelihood function

L(m,0)(β)=i𝒩(m,0)ψxiβ-yixiβ

for some unknown parameter βRp and ψ() uniquely determined by g(). Similarly, the data from the kth source population in the mth site are xi,yii𝒩(m,k), and they follow a generalized linear model

gEyixi=xiw(k)

with negative log-likelihood

L(m,k)w(k)=i𝒩(m,k)ψxiw(k)-yixiw(k)

for some unknown parameter w(k)Rp.

Our goal is to estimate β, using data from K+1 populations and M sites. These data are heterogeneous at two levels: for data from different populations, differences may exist in terms of both the regression coefficients, which characterize the conditional distribution fyixi, as well as the underlying distribution of the covariates, also known as covariate shift in some related work (Guo (2020)). For data from a given population, the distribution of covariates might also be heterogeneous across sites. We assume the regression parameters to be distinct across populations. In addition, we consider the setting where only summary-level data can be shared across sites.

Despite the presence of between-population heterogeneity, it is reasonable to believe that the population-specific models may share some degree of similarity. For example, in genetic risk prediction models the genetic architectures, captured by regression coefficients, of many complex traits and diseases are found to be highly concordant across ancestral groups (Lam et al. (2019)). It is important to characterize and leverage such similarities so that knowledge can be transferred from the source to the target population.

Under our proposed modeling framework, we characterize the similarities between the kth source population and the target based on the difference between their regression parameters, δ(k)=w(k)-β. We consider the following parameter space:

Θ(s,h)=θ=β,δ(1),,δ(K):β0s,max1kKδ(k)0h,

where s and h are the upper bounds for the support size of β and δ(k)k=1K, respectively. Intuitively, a smaller h indicates a higher level of similarity so that the source data can be more helpful for estimating β in the target population. When h is relatively large, incorporating data from source populations may be worse than only using data from the target population to fit the model, also known as negative transfer in the machine learning literature (Weiss, Khoshgoftaar and Wang (2016)). With unknown s and h, in practice, we aim to devise an adaptive estimator to avoid negative transfer under unknown levels of heterogeneity across populations.

Throughout, for real-valued sequences an,bn, we write anbn, if ancbn for some universal constant c(0,), and anbn, if anc'bn for some universal constant c'(0,). We say anbn if anbn and anbn. We let c,C,c0,c1,c2,, denote some universal constants. For a vector vRd and an index set S[d], we use vS to denote the subvector of v corresponding to S. For any vector bRp, let k(b) be formed by setting all but the largest (in magnitude) k elements of b to zero.

2.2. The proposed algorithm.

To motivate our proposed FETA framework, we first consider the ideal case when site-level data can be shared. The transfer learning estimator of β can be obtained via the following three-step procedure:

Step 1: Fit a regression model in each source population. For k{1,,K}, we obtain

wˆ(k)=argminbRp1N(k)m=1ML(m,k)(b)+λ(k)b1. (2.1)

Step 2: Adjust for differences using target data. For k=1,,K, we obtain

δˆ(k)=argminbRp1N(0)m=1ML(m,0)wˆ(k)+b+λδb1. (2.2)

Threshold δˆ(k)viaδˇ(k)=N(0)/logp1/2δˆ(k).

Step 3: Joint estimation using source and target data. We obtain

βˆ=argminbRp1Nm=1ML(m,0)(b)+1Nk=1Km=1ML(m,k)b-δˇ(k)+λβb1, (2.3)

where λ(k)k=1K,λδ, and λβ are tuning parameters. Instead of learning β directly from the target data with a limited sample size, we learn w(k) from the source populations and use them to “jumpstart” the model fitting in the target population. More specifically, we learn the difference δ(k) by offsetting each wˆ(k). In Step 3 we pool all the data to jointly learn β, where the estimated differences δˆ(k) are adjusted for data from the kth source population. In contrast to the existing transfer learning methods based on generalized linear models, the above procedure has benefits in estimation accuracy and flexibility to be implemented in the federated setting. Compared to recent work (Tian and Feng (2021)), the above procedure has a faster convergence rate, which is, in fact, minimax optimal under mild conditions. Moreover, our method learns w(k) independently in Step 1 and Step 2, while in other related methods (Li, Cai and Li (2022), Tian and Feng (2021)), a pooled analysis is conducted with data from multiple populations. In a federated setting, finding a proper initialization for such a pooled estimator is challenging due to various levels of heterogeneity.

To generalize (2.1)–(2.3) to the federated setting, we consider an approximation of L(m,k)(b) by the second-order expansion of L(m,k)(b) at b. That is,

L˜(m,k)(b;b)=L(m,k)(b)+m=1M(b-b)L(m,k)(b)+12m=1M2L(m,k)(b)(b-b)2.

The higher-order terms are omitted, given that the initial value b is sufficiently close to the true parameter. Using these surrogate losses, the sites only need to share three sets of summary statistics, the current estimate b, the score vector L(m,k)(b), and the Hessian matrix 2L(m,k)(b). For k=0,,K, we define

Lkb=m=1MLm,kb,H^kb=1Nkm=1M2Lm,kb,R^kb;b=12(bb)H^kbbb+bb,1NkLkb. (2.4)

The functions R^(k)(b;b) are the combined surrogate log-likelihood functions for the kth population, based on the previous estimate b and corresponding gradients obtained from the M sites. We then follow similar strategies as (2.1)–(2.3) but replace the full likelihood with the surrogate losses to construct a federated transfer learning estimator for β, as detailed in Algorithm 1.

We discuss strategies for the initialization of β and w(k)k=1K in Section 2.3. Algorithm 1 requires T iterations, where within each iteration we collect the first- and second-order derivatives calculated at each site based on the current parameter values. In practice, when iterative communication across sites is not preferred, we can choose T=1. We show in Section 3 that additional iterations can improve the estimation accuracy.

Remark 1. A special case of Algorithm 1 concerns the linear regression models where L(m,k)(b)=i𝒩(m,k)yi-xib2. In such a case, we can directly obtain summary statistics s(m,k)=i𝒩(m,k)xiyi and H(m,k)=i𝒩(m,k)xixi/n(m,k). We can reconstruct L(m,k)(b)/n(m,k), using s(m,k) and H(m,k), as L(m,k)(b)12bH(m,k)b-s(m,k),b. Thus, each site only needs to transfer s(m,k) and H(m,k) to perform Steps 1–3 (in equations (2.1)(2.3)), using summary statistics, and no iterative communication is needed.

For the proper choices of tuning parameters, we provide theoretical investigation in Section 3. In practical implementation, they can be chosen by cross-validation. In a federated setting, minimizing the communication cost of choosing tuning parameters is important, and we provide an efficient procedure in the following remark.

2.

Remark 2. In a federated setting, cross-validation is preferred to be done locally, since otherwise the communication costs would be much higher if different models need to be fitted and validated in a federated way. To reduce communication costs, we suggest the following procedure. We first learn the best tuning parameter for each population using local data. For the kth population, we find Ik=argmax1mMn(m,k), which is the site with the largest sample size from the kth population. Using data at site Ik from the kth population, we fit a local penalized GLM and use cross-validation to choose the best tuning parameter (denoted by λk*). With λk*, based on the theoretical conclusion (van de Geer (2008)), which suggests the best tuning parameter is roughly Cklogp/nIk,k, where Ck is some positive constant not related to p or nIk,k, we can obtain Ck*=λk*/logp/nIk,k. When fitting FETA, we choose tuning parameters by rescaling the local best tuning parameters. More specifically, in equation (2.5) we choose λ(k)=Ck*logp/N(k). In equation (2.6), we choose λδ=C0*logp/N(0). In equation (2.7) we choose λβ=C0*logp/N. This procedure is implemented in both our simulation study and real data application and is demonstrated to be effective. Most importantly, it does not increase the communication cost since tuning parameter selection is performed locally within a single site.

In practice, it is likely that some source models are substantially different from the target model. In such a case, excluding source populations that are highly different from the target may lead to better performance than including all source populations. In an extreme case, if all the source populations are not helpful, it is desired that our algorithm can at least maintain a comparable performance than the estimator obtained using the target data alone. We thus proposed to increase the robustness of the transfer learning by an aggregation step, based on a small target validation dataset in the leading site, which can guarantee that, loosely speaking, the aggregated estimator has prediction performance comparable to the best prediction performance among all the candidate estimators (Rigollet and Tsybakov (2011), Tsybakov (2014), Lecué and Rigollet (2014)). In the leading site (denoted as the m*th site), we denote the validation data to be yi,xii=1n, with sample size n=cnm*,0 for some c(0,1), where nm*,0 is the sample size of the training data in the leading site from the target population. We hope to first obtain a ranking of the K source populations based on their similarity to the target. We can then obtain candidate estimators obtained by only incorporating the top k source populations, for k=1,2,,K. Using the validation data, we select the estimator with the best performance. We summarize the details of the above procedures in Algorithm 2.

In the aggregation step, ηˆt can be viewed as a weight vector for βˆt,0,,βˆt,K. Through optimization (2.9) one can find the optimal combination which maximizes the log-likelihood in some validation samples, and existing work has shown that the performance of the optimal combination βˆt(agg) is no worse than any of the candidate estimators in B^, when the sample size of the validation data is not too small (Rigollet and Tsybakov (2011), Tsybakov (2014), Lecué and Rigollet (2014)). Based on our simulation study and real data example, the size of the validation data can be relatively small, compared to the training data, and cross-fitting may be used to make full use of all the data. In practice, if there is strong prior knowledge indicating that the level of heterogeneity is low across populations, the aggregation step may be skipped.

Remark 3 (Avoid sharing Hessian matrices). Algorithm 1 requires each site to transmit Hessian matrices to the leading site, which may not be a concern when p is relatively small. When p is large, we provide possible options to reduce the communication cost of sharing Hessian matrices: (1) If the distributions of covariate variables x are homogeneous across sites for a certain population, we propose to use Algorithm B.1 in Section B of the Supplementary Material (Li, Cai and Duan (2023)), which only requires the gradient vector from each site. (2) When the distributions of covariate variables are heterogeneous across sites, another possibility is to fit a density ratio model between each dataset and the target data in the leading site. Then we can still use the leading target data to approximate the Hessian matrices of the other datasets, through the density ratio tilting technique proposed in Duan, Ning and Chen, 2022. (3) We can leverage the sparsity structures of the population-level Hessian matrices, denoted by H(m,k), to reduce the communication cost. For example, when constructing polygenic risk prediction, the existing knowledge on the structures of linkage disequilibrium may infer similar block-diagonal structures of the Hessian matrices. In such cases we can apply thresholding to the Hessian matrices and only share the resulting blocks. (4) As demonstrated in our simulation study and real data application, our algorithm with one round of iteration (T=1) already achieves comparable performance as the pooled analysis. Thus, if choosing T=1, each site will only need to share Hessian matrices once. If more iterations are allowed, we propose to only update the first-order gradients in the rest of the T-1 iterations.

2.

2.3. Initialization strategies.

The initialization determines the convergence rate and the number of iterations that Algorithm 1 and Algorithm B.1 need to reach convergence. With data from more than one population, one needs to balance the sample sizes and similarities across populations.

Here we offer two initialization strategies, namely, the single-site initialization and the multisite initialization. The ideal scenario for initialization is that one site has relatively large sample sizes for all K+1 populations. In such a case, we initialize w(k) and β using the single-site initialization. If we cannot find a site with enough data from all the K+1 populations, the multisite initialization can be used:

  • Strategy 1: Single-site initialization. Find m*{1,M}, such that the sample sizes satisfy nm*,kmax1mMn(m,k) for all k{0,K}. In site m*, we initialize w(k) and β by applying the pooled transfer learning approach introduced in equations (2.1)(2.3). For example, the All of Us Precision Medicine Initiative aims to recruit one million Americans, with estimates of early recruitment showing up to 75% of participants are from underrepresented populations (Sankar and Parker (2017)). Such a database can be treated as an initialization site or leading site.

  • Strategy 2: Multisite initialization. We first find Ik=argmax1mMn(m,k), which is the site with the largest sample size from the kth population. In site Ik, if the sample size of the kth population is much smaller than the total sample size, we initialize w(k) by treating the kth population as the target and other populations as the source and apply the transfer learning approach introduced in equations (2.1)(2.3). If the kth population is the dominating population, we can simply initialize w(k) using only its own data. The same procedure applies to the initialization of β. For example, when constructing polygenic prediction models, the UK Biobank has around 500,000 European-ancestry samples but only 3000 African-ancestry samples (Collins (2012)). Thus, if UK Biobank is selected for initialization of the European-ancestry population, it can be done by using only the European-ancestry samples. On the contrary, if the UK Biobank is selected for initialization of the African-ancestry population, a transfer learning approach is needed to improve the accuracy of initialization by incorporating both European-ancestry and African-ancestry samples.

In Section 3 we show that the convergence rates of Algorithm 1 depend on the accuracy of the initial estimators. For the above two strategies, we derive their corresponding convergence rates in the next section. The corresponding theoretical results for Algorithm B.1 can be found in Section D of the Supplementary Material.

3. Theoretical guarantees.

We denote the population Hessian matrices for the kth population at the mth site as H(m,0)=Exi(m,0)xi(m,0)ψ¨xi(m,0)β and H(m,k)=Exi(m,k)xi(m,k)ψ¨xi(m,k)w(k). We denote the population Hessian matrices across all sites to be H(k)=m=1Mn(m,k)H(m,k)/N(k),k=0,,K. We assume the following condition for the theoretical analysis.

Condition 1. The predictors xii𝒩(m,k) are independent and uniformly bounded with mean zero and covariance Σ(m,k) with

max1mM,0kKmaxi𝒩(m,k)xiC<.

The covariance matrices Σ(m,k) and the Hessian matrices H(m,k) are all positive definite for m=1,,M and k=1,,K.

Condition 2 (Lipschitz condition of ψ). The random noises yi-ψ˙xiw(k)i𝒩(k) are independent sub-Gaussian with mean zero for k=0,,K. The second-order derivative ψ¨ is uniformly bounded and |logψ¨(a+b)-logψ¨(a)|C|b| for any a,bR.

Condition 1 assumes uniformly bounded designs with positive definite covariance matrices. The distribution of xi(m,k) can be different for different m and k. This assumption is more realistic in practical settings than the homogeneity assumptions in, say, Jordan, Lee and Yang (2019). In fact, the heterogeneous covariates are allowed because the Hessian matrices from different sites are transmitted in Algorithm 1. In contrast, Algorithm B.1, which only requires transmitting the gradients across sites, would require a stricter version of Condition 1, as stated in Condition B.1 in the Supplementary Material. On the other hand, the positive definiteness assumption is only required for the Hessian matrices involved in the initialization and the pooled Hessian matrix H(k). Moreover, when having unbounded covariates, one may consider relaxing the uniformly bounded designs to sub-Gaussian designs. We comment that our theoretical analysis still carries through with sub-Gaussian designs, but the convergence rate will be inflated with some factors of log p. Condition 2 assumes a Lipschitz condition on the link function which holds for linear, logistic, and multinomial models.

For the tuning parameters, we take

λk=c0logpNk1/2,λδ=c0logpN01/2,λβ=c0logpN1/2+hlogpN0andcn=2c1s

for some constants c0>0 and c11. We set cn at the magnitude of s to simplify the theoretical analysis. Indeed, cn can be chosen as min1kKc0n(init,k)/logp1/2, where n(init,k) is the sample size for the kth population at its initialization site. In practice, a common practice is to select the tuning parameters by cross-validation, as in Remark 2.

We first show the convergence rate of the pooled transfer learning estimator β in the following lemma.

Lemma 1 (Pooled transfer learning estimator). Assume Conditions 1 and 2, hs,N(0)logpmin1kKN(k)2, and max0kKslogp/N(k)+hslogp/N(0)=o(1). Then for βˆ defined in (2.3) and some constant c1>0,

supΘ(s,h)Pβˆ-β22slogpN+hlogpN(0)exp-c1logp.

Lemma 1 demonstrates that the pooled estimator βˆ has optimal rates under mild conditions. Its convergence rate is faster than the target-only minimax rate slogp/N(0) when N(0)N and hs. The sample size condition of Lemma 1 is relatively mild. First, NKN(0) is reasonable, as our target population is underrepresented. The condition s=omin0kKN(k)/logp suggests that it is beneficial to exclude too-small samples as source data. The condition that hs=oN(0)/logp and hs requires that the similarity among different populations is sufficiently high.

Lemma 1 also shows the benefits of our two data integration strategies. When we integrate data across multiple sites, N(k) becomes larger, which relaxes the sparsity conditions and improves the convergence rate. On the other hand, when we incorporate data from diverse populations, the total sample size N is increased, which also improves the convergence rate.

In the federated setting, we first provide in Theorem 3.1 a general conclusion that describes how the convergence rate of Algorithm 1 relies on the initial values. We then provide the convergence rates of Algorithm 1 under initialization Strategies 1 and 2 in Corollaries 1 and 2, respectively.

Theorem 3.1 (Error contraction of Algorithm 1). Assume Conditions 1 and 2 and the true parameters are in the parameter space Θ(s,h). Assume that hscN(0) and max1kKslogp/N(k)+hslogp/N(0)=o(1). If event E0 in (C.16) in the Supplementary Material holds for the initial estimators βˆ0 and wˆ0(k)k=1K, then with probability at least 1-exp-c2logp, for any finite t1,

βˆt-β22slogpN+hlogpN(0)+max1kKwˆ0(k)-w(k)2+βˆ0-β24t. (3.1)

Theorem 3.1 establishes the convergence rate of βˆt under certain conditions on the initializations. As the conditions in event E0 guarantee that wˆ0(k)-w(k)2=o(1) for all 1kK and βˆ0-β2=o(1),wˆt(k) and βˆt converge to w(k) and β in 2-norm, respectively. For large enough t, the convergence rate of βˆt is slogp/N+hlogp/N(0), which is the minimax rate for estimating β in Θ(s,h). Hence, the FETA estimators converge to the global minimax estimators. With proper initialization the number of iterations for convergence can be very small. Detailed analysis based on the initialization strategies, proposed in Section 2.3, is provided in the sequel.

Comparing Theorem 3.1 with Lemma 1, we see some important trade-offs in federated learning. First, the larger estimation error of βˆt with small t in comparison to the pooled version βˆ is a consequence of leveraging summary information rather than the individual data. Second, while the accuracy of βˆt improves as t increases, the communication cost also increases. A balance between communication efficiency and estimation accuracy needs to be determined based on the practical constraints.

To better understand the convergence rate, we investigate the initialization strategies proposed in Section 2.3. Under the single-site strategy, we have the following conclusion.

Corollary 1 (Algorithm 1 with single-site initialization). We compute βˆ0 and wˆ0(k) via (2.1)–(2.3) based on the individual data in site m*. Assume Conditions 1 and 2 and the true parameters are in the parameter space Θ(s,h). Assume that NKN(0),Nm*Knm*,0,hs, and

max1kKs2logp/nm*,k+shlogp/nm*,0=o(1).

Then with probability at least 1-exp-c2logp, for any fixed T1,

βˆT-β22Ex*x*βˆT-β2slogpN+hlogpN(0)+slogpNm*+hlogpnm*,02T,

where x*Rp is an independent test sample with ΛmaxEx*x*C<.

Corollary 1 provides estimation and prediction errors for FETA after T iterations. If the response is binary, then it is of more interest to bound the classification errors, which will be proved in Remark 4. Corollary 1 uses the results that wˆ0(k)-w(k)22=OP((s+h)logp/nm*,k and βˆ0-β22=OPslogp/Nm*+hlogp/nm*,0 under the current conditions. We see that, after OlnN/lnNm*+lnN(0)/lnnm*,0 number of iterations, βˆt has the same convergence rate as the pooled estimator βˆ. If NNm*α and N(0)nm*,0α' for some finite α and α', then only constant number of iterations are needed.

Remark 4 (Classification error under logistic link). Assume the conditions of Corollary 1 and logistic model ψ(μ)=log(1+exp{-μ}). For an independent test sample x*Rp, let yˆ*=1x*βˆT>0. If Px*β2c1slogp/N+c1hlogp/N(0)1-exp-c2logp and TmaxlogN/logNm*,logN(0)/lognm*,0, it holds that

Pyˆ*sgnx*βexp-c3logp

for some positive constants c1,c2, and c3.

Remark 4 states that if the true signal x*β is strong enough with high probability, then, under the conditions of previous theorems, the misclassification rate is o(1).

Next, we study the performance of Algorithm 1 when using multisite initialization strategy. For simplicity, we study the case where nIk,kNIk. That is, the kth population is abundant in site Ik, and hence w(k) can be initialized based on the data from only the kth population in site Ik. Specifically,

βˆ0=argminbp1nI0,0LI0,0b+λβ,0b1,wˆ0k=argminbp1nIk,kLIk,kb+λ0kb1,k=1,,K. (3.2)

Corollary 2 (Algorithm 1 with multisite initialization). We compute βˆ0 and wˆ0(k) via (3.2) based using data in site Ik for k=0,,K. We take λβ,0=c1logp/nI0,01/2 and λ0(k)=c1logp/nIk,01/2 with some large enough constant c1. Assume Conditions 1 and 2 and the true parameters are in the parameter space Θ(s,h). Assume that NKN(0),hs, and max1kKs2logp/nIk,k+shlogp/NI0=o(1). Then with probability at least 1-exp-c2logp, for any fixed T1,

βˆT-β22Ex*x*βˆT-β2slogpN+hlogpN(0)+min0kKslogpnIk,k2T,

where x*Rp is an independent test sample with ΛmaxEx*x*C<.

Corollary 2 uses the results that βˆ0-β22=OPslogp/nI0,0 and wˆ(k)-w(k)22=OPslogp/nIk,k, given the optimization (3.2). In this case, the number of iterations needed for βˆt to reach the same rate as the pooled estimator βˆ is about Omax0kKlnN/lnnIk,k.

4. Simulation studies.

4.1. Predictive performance under different levels of heterogeneity when K=1.

Motivated from our real data application, which is introduced in Section 5, we first consider the setting where K=1. We generate data to mimic genetic risk prediction in a federated network with M=5 sites. In site m{1,M}, we have n(m,0)=400 samples from the target population and n(m,1)=2000 samples from a source population. We set the dimension of xi to be p=2000 and generate xi to mimic the genotype data which take values (0, 1, and 2). The distribution of xi in the target population differs with the source population in terms of LD structures and minor allele frequencies; see Section E of the Supplementary Material for more details. For each subject we generate the binary outcome variable through a logistic regression model

logitEyixi=xibi,

where logit(t)=log{x/(1-x)}. The regression coefficients bi=β is subject i is from the target population, otherwise bi=w(1). The regression-coefficient β has s=100 nonzero entries which are generated from U(-0.4,0.4). The source coefficient w(1) is generated from the following two settings:

(S1) wj(1)=βj+δI(jH), where H is a random subset of [p] with |H|=h and δ is sampled from {-Δ,Δ} with Δ>0. This is corresponding to the setting that a small number of genetic variants have relatively large differences in effect sizes across populations.

(S2) wj(1)=βj+δjI(jH) where H is a random subset with |H|=5h. We generate δj~i.i.d.N(0,Δ/3). This is corresponding to the setting that a large number of genetic variants have relatively small differences in effect sizes across populations.

In both settings the level of heterogeneity increases with h and Δ. We compare a list of methods including: (1) federated learning based on all the target data (target-only), (2) federated learning based on all the source data (source-only), (3) federated learning based on all data combing both the source and target (combined), (4) Algorithm 2 with T=1 (proposed (T=1)), (5) Algorithm 2 with T=3 (proposed (T=3)), and (6) the pooled transfer learning equations (2.1)(2.3) method where data from all sites are pooled together (pooled). The methods are evaluated based on the out-sample area under the receiver operating characteristic curve (AUC), based on a randomly generated testing sample with sample size n=1000.

We present in Figures 2 the AUC over 200 replications under different simulation settings. In setting 1 with the increase of Δ and h, the prediction performance of all methods decreases. This is consistent with our intuition that a source population more different to the target is less helpful. The estimator combining source and target data outperforms the source-only estimator in all scenarios, as it incorporates target samples. Our proposed estimators, even with only one round of communication (T=1), achieve better performance than these benchmarks. With T=3, our methods can further improve the performance, especially when heterogeneity is high. Our estimators have comparable performance to the pooled estimator when Δ or h is small. When the level of heterogeneity is high, we see that the combining source and target data lead to worse performance than the target-only model, while the transfer learning and the proposed methods, due to the aggregation step, can prevent negative transfer from the source population, which is highly different from the target. Results from setting 2 convey similar conclusions. We see that numerically when δ0 is relatively large (even bigger than s), our methods still have improvement, given the magnitude of entries are small. We also evaluated the effect of aggregation by comparing the proposed estimator with βˆ from Algorithm 1 (see Figures E.1 and E.2 in the Supplementary Material). Across all scenarios the proposed estimators with aggregation perform no worse than the estimators without aggregation. The improvement is substantial when the level of heterogeneity is moderate or large.

Fig. 2.

Fig. 2.

Comparison of AUC over 200 replications under simulation settings 1 and 2.

4.2. Robustness under different levels of heterogeneity when K>1.

We then consider the setting with multiple source populations and investigate the robustness of the proposed method when some source populations are more helpful than others. In particular, we choose K=3 and M=5, that is, data from three source populations and one target population are collected from five sites. We set n(m,0)=400,n(m,1)=600,n(m,2)=1000,n(m,3)=2000 for all m{1,2,3,4,5}. We generate β in the same way, as described in Section 4.1. To generate w(k), we choose the same procedure described in setting (S1) of Section 4.1, where for a “good” source population, we choose h=10 and Δ=1 and for a “bad” source, we choose h=100 and Δ=1. We consider the following cases:

  • Case 1: No bad source, that is, δ(k)0=10 for k=1,2,3.

  • Case 2: One bad source, that is, δ(k)0=10 for k=1,2, and δ(3)0=100.

  • Case 3: Two bad sources, that is, δ(1)0=10 and δ(k)0=100 for k=2,3.

  • Case 4: Three bad sources, that is, δ(k)0=100 for k=1,2,3.

From Figure 3, when all the source populations are helpful (Case 1), the proposed method with T=1 and T=3 has nearly the same AUC as the pooled transfer learning method. The combined approach also performs well in Case 1, although it has slightly lower AUC than the proposed method. From Case 1 to Case 4, the number of helpful source populations increases, and the performance of the proposed method also decreases, but it has performance no worse than the target-only estimator across all four cases. In Case 4 all the source data are not helpful, and the combined approach has AUC lower than the target-only approach. The proposed method has nearly the same performance than the target-only model, while the pooled transfer learning method is slightly better than the target-only approach probably due to the loss of federated learning.

Fig. 3.

Fig. 3.

Comparison of AUC over 200 replications under Case 1 to Case 4, where we set K=3 and M=5. The number of helpful source populations decreases from Case 1 to Case 4.

In sum, the simulation studies demonstrate that our proposed methods have improved predictive accuracy compared to the benchmark methods. Compared to the ideal case where data are pooled together, FETA with only one iteration has comparable or slightly worst performance in most of the scenarios. When the level of heterogeneity is high, extra iterations can improve the accuracy and fill the gap between the federated analysis and the pooled analysis. In general, we recommend the aggregation step to ensure the robustness of the results and effectively leverage source data which are more helpful to the target.

5. Risk prediction of extreme obesity using data from the eMERGE network.

Obesity is one of the most significant risk factors for many complex human diseases, including hypertension, coronary heart disease, and type II diabetes (Kaplan (1989)). The prevalence of obesity among adults has more than doubled in the past three decades, and obesity continues to be a public health concern (Pan et al. (2011)). Many studies have suggested that genetics is likely to play a substantial role in the predisposition to obesity and may contribute up to 70% risk for the disease (Golden and Kessler (2020)). Over a hundred genes and gene variants related to excess weight have been discovered (Loos and Yeo (2022)), which has built the foundations for genetic risk prediction models. Similar as we have discussed in Section 1, existing genomics studies of obesity were mostly conducted in European ancestry populations (Zillikens et al. (2017)), raising concerns regarding the generalizability and transferibility of the study results.

5.1. Lack of representation of non-EA populations in eMERGE network.

To study this problem, we applied our proposed method to construct risk predication models using multicenter data from the eMERGE network, where EHR-derived phenotypes for 55,029 subjects from 10 participating sites were linked with DNA samples from biorepositories (Gottesman et al. (2013)). Among the 10 participating sites, three sites are children’s hospitals and are not included in this study. Trained EHR phenotyping algorithms are applied to derive the case-control status of extreme obesity, where participants are classified to either a “case,” a “control,” or “Unknown” (meaning the algorithm is not able to determine the case-control status based on the data available at the EHR system). In our study we exclude participants with unknown case-control status. We consider self-reported race as a population indicator, and Table 1 shows the sample size distribution across race and sites.

Table 1.

Sample sizes by self-reported races and sites.

American Indian or Alaska Native Asian Black or African American Unknown White
Geisinger Health System * * * * 1023
Group Health Cooperative * * * * 252
Marshfield Clinic * * * * 1477
Mayo Clinic * * * 47 1278
Mount Sinai * * 1148 343 185
Northwestern University * * 167 * 1306
Vanderbilt University * * 646 35 1472
*

indicates sample sizes less than 20.

A majority proportion of participants are White (>75%) across all seven sites, and in some sites, such as the Geisinger Health System, Marshfield Clinic, Mayo Clinic, more than 95% of the participant are White. There are around 20% of the participants who are Black or African American (AA), mostly coming from Mount Sinai, Northwestern University and Vanderbilt University. Except for Mount Sinai, White participants are overrepresented in all six sites. There are very few participants who are Asian (<1%), American Indian or Alaska Native (<1%), where all the sites have less than 20 participants in these racial groups.

In this application we treat the Black and AA population as our target population and treat the White population as the source. Due to the extremely small sample sizes from the other racial groups, a population-specific model is hard to be fitted, and we, therefore, incorporate them to the source population. As it is shown in Appendix F of the Supplementary Material, compared to a sensitivity which removes participants whose self-reported race is Asian, American Indian or Alaska Native, or unknown, the study results are consistent with the current findings, probably due to the extremely small sample sizes.

5.2. Data processing and feature selection.

Among the seven sites, Mount Sinai, Northwestern, and Vanderbilt have substantial target samples, while Geisinger, Group Health, Marshfield, and Mayo each has a target sample size less than 14, which may have only limited contributions to the model fitting and can increase the communication cost. Therefore, we exclude the target samples from Geisinger, Group Health, Marshfield, and Mayo.

To test the performance of the proposed method, we select three testing datasets each from Mount Sinai, Northwestern, and Vanderbilt, respectively. Specifically, from Vanderbilt and Northwestern we randomly select two internal testing datasets, each with size 300 target samples. We then choose all the target sample (n = 167) from Northwestern as an external testing dataset, whose data will be held out in the entire model training stage. We expect that this external testing dataset can help to evaluate the portability of the trained model when applied to a new dataset.

Predictors include age (coded as birth decades), gender, top 3 principal components (PCs) calculated from the genotypic data as well as top single nucleotide polymorphisms (SNPs), coded as 0,1, and 2 and selected by their marginal associations with the phenotypes. In particular, we applied regression between each phenotype and SNP controlling for age, gender, and 3 PCs and then select the top 2047 SNPs associated with each phenotype after filtering out SNPs with high LD, missing proportions and low minor allele frequency. The detailed filtering procedure and selected SNPs can be found in Section F of the Supplementary Material.

We applied our proposed method with a logistic regression. Similar as the simulation study, we choose the number of iterations to be T=1 and T=3 and compared them with the methods defined in Section 4. Methods are evaluated on the three testing datasets from Mount Sinai, Vanderbilt, and Northwestern. To account for sampling variability when selecting the testing datasets from Vanderbilt and Mount Sinai, we repeat the evaluation 20 times, and within each replication, prediction performance are evaluated separately on the three testing sets. Mount Sinai was chosen to be the leading site, whose individual-level data was used for the aggregation step.

5.3. Evidence for data heterogeneity between the source and the target populations.

We first investigate the heterogeneity of the distributions of the predictors between the target and the source populations. Using the genotypic data from the target and the source samples, we calculate the minor allele frequencies and pairwise correlations of the SNPs. For more clear presentation, we only plot the SNPs in chromosome 1 in Figure 4, while the comparisons of SNPs in chromosomes 2–22 lead to consistent conclusions.

Fig. 4.

Fig. 4.

Comparisons of correlations and minor allele frequencies (MAFs) calculated from the source and the target populations. Among the 2047 selected SNPs, we show the SNPs on chromosome 1 for illustration purpose. The left panel shows the pairwise correlations between SNPs calculated from the source (the lower triangle) and the target (the upper triangle). The right panel compares the MAFs.

In the left panel of Figure 4, we compare the correlations of SNPs calculated from the target samples (the upper triangle) to the correlations calculated from the source samples (the lower triangle). Overall, the magnitude of correlations are larger and more structured in the target population than the source population. The weaker correlations, observed in the source population, are possibly due to the fact that we use the LD structure obtained from all data to remove SNPs with high correlations, which is closer to the LD structure of the White population due to the dominant sample size. In the right panel of Figure 4, we also observe that the MAFs of SNPs in the target population are highly different from the source population. Figure 4 is consistent with the existing findings, suggesting substantial differences in the LD structures and MAFs across different ancestral and ethnic groups (Martin et al. (2019)).

In addition, our method yields estimated values for the differences of the coefficients between the source and the target models, that is, δˆ, and Table 2 shows the SNPs with top differential effects between the two populations. Among the 25 SNPs, We found that most of them are in introns and intergenic regions. Existing literature found that rs6673201 is associated with interaction between insulin sensitivity index and BMI (Walford et al. (2016)); MEGF6 (multiple epidermal growth factor-like domains protein 6) was reported to contribute to the periotic mesenchyme and its derivatives, skin epidermis, certain cells in brain and ribs (Wang et al. (2019b)); HPSE2 was reported to be related to childhood BMI trajectory and adult lung function (Rathod et al. (2022)); ALOX5AP and TPD52 were reported to be associated with body weight (Kaaman et al. (2006)), and STK39 was shown to be a hypertension susceptibility gene (Wang et al. (2009)). Although there have not been enough multiethnic studies supporting the potentially heterogeneous effects of the above genes and SNPs on obesity across different ancestries, our analysis provides potential evidence for further investigations.

Table 2.

Top 25 SNPs having different effects in the source and the target populations

SNPID Gene name Function
1 rs6673201 C1orf198 3_prime_UTR
2 rs41315290 MEGF6 intron
3 rs546147 HPSE2 intron
4 rs57588150 ALOX5AP intergenic
5 rs13410278 AC016723.4 intergenic
6 rs10497334 STK39 intron
7 rs9975258 AP000233.4 intron
8 rs75401219 intergenic
9 rs845884 RP11–157J24.2 intergenic
10 rs7741633 TPD52L1 intron
11 rs4395726 TCTE3 intron
12 rs7763865 KIAA0319 intron
13 rs2143457 TMEM14C intergenic
14 rs2485835 RP11–374I15.1 intergenic
15 rs11975134 NPY intergenic
16 rs10282500 TBRG4 intergenic
17 rs7836712 ZFPM2 intron
18 rs11782183 PVT1 intron
19 rs55641476 LRRC6 intron
20 rs113235293 KHDRBS3 intron
21 rs672918 GLIS3 intron
22 rs10823830 CDH23 3_prime_UTR
23 rs8008067 RGS6
24 rs6747667 C2orf76 intron
25 rs4848615 AC018866.1 intergenic

5.4. Evaluation of the predictive performance.

Figure 5 presents the prediction performance of the compared methods. Due to the limited sample size, the target-only estimator has the lowest AUC among all the methods, where the average AUC is around 0.64–0.66, when evaluated across the three testing datasets. The variation of the target-only estimator across different random replications, however, is the largest, where the AUC can be as low as 0.51 and up to 0.85. The source-only estimator has better performance compared to the target-only estimator, having an average AUC around 0.71. Combining source and target data leads to better performance than using a population-specific training strategy, having an average AUC of 0.8. The pooled transfer learning estimator has the highest AUC which is around 0.85–0.89. With only one round of communication, FETA can achieve comparable performance to the pooled transfer learning estimator (ratios of AUC range from 88% to 107% with a median at 97%). With two additional rounds of communications, the AUC is nearly the same as the pooled estimator (ratios of AUC range from 92% to 102% with a median at 99%).

Fig. 5.

Fig. 5.

Comparisons of prediction performance across three testing datasets. Each colored bar denoted the average AUC over 20 replications, and the error bar indicates the highest and lowest performance.

Interestingly, the performance of FETA with T=1 and T=3 is similar when applied to the internal testing sets at Vanderbilt and Mount Sinai. When being evaluated at the external testing set from Northwestern University, FETA with T=3 is shown to perform better than T=1, indicating additional iterations might be helpful for the portability of the prediction models. Overall, the results from external evaluation are similar to the internal evaluation, indicating relatively low between-site heterogeneity. Depending on the goal of the prediction model, one can modify our method to treat a specific site as the target population and treat other sites as source data. We also perform sensitivity study in which we exclude participants who have unknown self-reported race or who are nonWhite from the source samples; the prediction performance is nearly the same as the current results. More details can be found in Appendix F of the Supplementary Material.

In sum, this application demonstrates the feasibility and the promise of our methods to be implemented in large heal data networks for the construction of risk prediction models. It may also be applied in other use cases, such as phenotyping or risk profiling, using multi-center data.

6. Discussion.

In this paper we propose federated transfer learning methods to improve the performance of estimation and prediction in underrepresented populations, which enables incorporating data from diverse populations that are stored at different institutions in a communication-efficient and individual-data protected manner. Our theoretical analysis and numerical experiments demonstrate the improved accuracy and robustness of the proposed methods compared with benchmark methods. The application to eMERGE network for constructing prediction models for extreme obesity in the African-ancestry population demonstrates the feasibility of applying our methods to large clinical/genomics consortia for improved risk prediction and stratification among underrepresented groups.

Our proposed FETA framework and the theoretical analyses are illustrated based on high-dimensional GLM with 1-penalty, due to their wide application in biomedical applications (Morin et al. (2021), Wu, Roy and Stewart (2010), Lange et al. (2014)). However, the proposed framework can be extended to machine-learning methods with different penalty terms or model formulations. For example, if the difference of regression parameters between a source and the target population is nonsparse, we can change the 1-penalty to the 2-penalty or elastic net type of penalties. In addition, GLM can be replaced with other methods, such as the support vector machine which has been studied in the distributed setting (Stolpe, Bhaduri and Das (2016)). To account for the heterogeneity across populations, we can also allow a discrepancy term between the source and the target model parameters. However, some of the machine learning models may not be directly extended to the FETA framework. For instance, random forest has not been formally studied in the distributed setting as far as we know.

We account for population-level heterogeneity by allowing both the conditional distribution f(yx) and the marginal distribution f(x) to be different across sites. We use an aggregation method to improve the robustness of our methods to the level of heterogeneity. To account for site-level heterogeneity, our methods allow f(x) to vary across sites, while f(yx) is assumed to be shared across sites, given a specific population. In practice, there might still be site-level heterogeneity that causes differences in f(yx) and robust methods to account for such heterogeneity need to be incorporated, which we will consider in our future work.

Our proposed methods improve the fairness of statistical models by reducing the gap of estimation accuracy across populations due to lack of representation. Our theoretical conclusion provides insights for future data collection, as it reveals how the level of heterogeneity impacts the accuracy gain and the sample size from the target population needed to achieve comparable accuracy as the source populations. This is different from methods that impose constraints on model fitting to ensure that the prediction accuracy of an algorithm has to be at the same level across different populations (Mehrabi et al. (2021)). Since we are focusing on improving the performance of models in an underrepresented population, our work cannot guarantee complete fairness in prediction accuracy across all groups. In the future we can incorporate fairness corrections and constraints into our framework and in the meantime consider a wide variety of models in longitudinal and survival analysis, causal inference, to advance algorithmic fairness in precision medicine.

Funding.

Supplementary Material

supplement

Acknowledgments.

The authors would like to thank the anonymous referees, an Associate Editor, and the Editor for their constructive comments that improved the quality of this paper. The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000888.v1.p1.

Rui Duan was supported by National Institutes of Health (R01 GM148494) and Tianxi Cai was supported by National Institutes of Health (R01 LM013614, R01 HL089778).

Footnotes

SUPPLEMENTARY MATERIAL

Supplementary Material to “Targeting underrepresented populations in precision medicine: A federated transfer learning approach” (DOI: 10.1214/23-AOAS1747SUPP;.pdf). This supplementary file contains alternative algorithms (Sections A and B), conditions, proofs to theorems, corollaries and additional lemmas (Sections C and D), additional simulation results (Section E) and data processing details (Section F).

REFERENCES

  1. Ashley EA (2016). Towards precision medicine. Nat. Rev. Genet 17 507–522. [DOI] [PubMed] [Google Scholar]
  2. Bastani H (2020). Predicting with proxies: Transfer learning in high dimension. Manage. Sci 67 2657–3320. [Google Scholar]
  3. Bickel PJ, Ritov Y and Tsybakov AB (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist 37 1705–1732. MR2533469 10.1214/08-AOS620 [DOI] [Google Scholar]
  4. Cai T, Liu M and Xia Y (2022). Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J. Amer. Statist. Assoc 117 2105–2119. MR4528492 10.1080/01621459.2021.1904958 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cai TT and Wei H (2021). Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. Ann. Statist 49 100–128. MR4206671 10.1214/20-AOS1949 [DOI] [Google Scholar]
  6. Cai M, Xiao J, Zhang S et al. (2021). A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet 108 632–655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen X and Xie M (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statist. Sinica 24 1655–1684. MR3308656 [Google Scholar]
  8. Collins R (2012). What makes uk biobank special? The Lancet (London, England) 379 1173–1174. [DOI] [PubMed] [Google Scholar]
  9. Collins FS and Varmus H (2015). A new initiative on precision medicine. N. Engl. J. Med 372 793–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Duan R, Ning Y and Chen Y (2022). Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 109 67–83. MR4374641 10.1093/biomet/asab007 [DOI] [Google Scholar]
  11. Duan R, Boland MR, Moore JH and Chen Y (2019). ODAL: A one-shot distributed algorithm to perform logistic regressions on electronic health records data from multiple clinical sites. Pacific Symposium on Biocomputing 30–41. [PMC free article] [PubMed] [Google Scholar]
  12. Duan R, Luo C, Schuemie MJ et al. (2020). Learning from local to global: An efficient distributed algorithm for modeling time-to-event data. J. Amer. Med. Inform. Assoc 27 1028–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, Peterson R and Domingue B (2019). Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun 10 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Golden A and Kessler C (2020). Obesity and genetics. J. Amer. Assoc. Nurse Pract 32 493–496. [DOI] [PubMed] [Google Scholar]
  15. Gottesman O, Kuivaniemi H, Tromp G, Faucett WA, Li R, Manolio TA, Sander-son SC, Kannry J, Zinberg R et al. (2013). The electronic medical records and genomics (emerge) network: Past, present, and future. Genet. Med 15 761–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Guo Z (2020). Inference for high-dimensional maximin effects in heterogeneous regression models using a sampling approach. Preprint, arXiv:2011.07568 [Google Scholar]
  17. Jordan MI, Lee JD and Yang Y (2019). Communication-efficient distributed statistical inference. J. Amer. Statist. Assoc 114 668–681. MR3963171 10.1080/01621459.2018.1429274 [DOI] [Google Scholar]
  18. Kaaman M, Rydén M, Axelsson T, Nordström E, Sicard A, Bouloumie A, Langin D, Arner P and Dahlman I (2006). Alox5ap expression, but not gene haplotypes, is associated with obesity and insulin resistance. Int. J. Obes 30 447–452. [DOI] [PubMed] [Google Scholar]
  19. Kaplan NM (1989). The deadly quartet: Upper-body obesity, glucose intolerance, hypertriglyceridemia, and hypertension. Arch. Intern. Med 149 1514–1520. [DOI] [PubMed] [Google Scholar]
  20. Kraft SA, Cho MK, Gillespie K et al. (2018). Beyond consent: Building trusting relationships with diverse populations in precision medicine research. Am. J. Bioethics 18 3–20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kushida CA, Nichols DA, Jadrnicek R et al. (2012). Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50 S82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lam M, Chen C-Y, Li Z et al. (2019). Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet 51 1670–1678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Landry LG, Ali N, Williams DR et al. (2018). Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Aff 37 780–785. [DOI] [PubMed] [Google Scholar]
  24. Lange K, Papp JC, Sinsheimer JS and Sobel EM (2014). Next generation statistical genetics: Modeling, penalization, and optimization in high-dimensional data. Annu. Rev. Stat. Appl 1 279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lecué G and Rigollet P (2014). Optimal learning with Q-aggregation. Ann. Statist 42 211–224. MR3178462 10.1214/13-AOS1190 [DOI] [Google Scholar]
  26. Lee JD, Liu Q, Sun Y and Taylor JE (2017). Communication-efficient sparse regression. J. Mach. Learn. Res. 18 Paper No 5, 30. MR3625709 [Google Scholar]
  27. Li S, Cai T and Duan R (2023). Supplement to “Targeting underrepresented populations in precision medicine: A federated transfer learning approach.” 10.1214/23-AOAS1747SUPP [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li S, Cai TT and Li H (2020). Transfer learning in large-scale gaussian graphical models with false discovery rate control [DOI] [PMC free article] [PubMed]
  29. Li S, Cai TT and Li H (2022). Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Stat. Soc. Ser. B. Stat. Methodol 84 149–173. MR4400393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Li R, Lin DKJ and Li B (2013). Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind 29 399–409. MR3117826 10.1002/asmb.1927 [DOI] [Google Scholar]
  31. Li R, Chen Y, Ritchie MD et al. (2020). Electronic health records and polygenic risk scores for predicting disease risk. Nat. Rev. Genet 21 493–502. [DOI] [PubMed] [Google Scholar]
  32. Liu M, Xia Y, Cho K and Cai T (2021). Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J. Mach. Learn. Res. 22 Paper No. 126, 26. MR4279777 [PMC free article] [PubMed] [Google Scholar]
  33. Loos RJ and Yeo GS (2022). The genetics of obesity: From discovery to biology. Nat. Rev. Genet 23 120–133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Martin AR, Kanai M, Kamatani Y et al. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet 51 584–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Mccarty CA, Chisholm RL, Chute CG et al. (2011). The eMERGE network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med. Genom 41–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Mehrabi N, Morstatter F, Saxena N, Lerman K and Galstyan A (2021). A survey on bias and fairness in machine learning. ACM Comput. Surv 54 1–35. [Google Scholar]
  37. Morin O, Vallières M, Braunstein S, Ginart JB, Upadhaya T, Woodruff HC, Zwanenburg A, Chatterjee A, Villanueva-Meyer JE et al. (2021). An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication. Nat. Cancer 2 709–722. [DOI] [PubMed] [Google Scholar]
  38. Pan L, Freedman DS, Gillespie C, Park S and Sherry B (2011). Incidences of obesity and extreme obesity among us adults: Findings from the 2009 behavioral risk factor surveillance system. Popul. Health Metr 9 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Qian J, Tanigawa Y, Du W et al. (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 16 e1009141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rathod R, Zhang H, Karmaus W, Ewart S, Mzayek F, Arshad SH and Holloway JW (2022). Association of childhood bmi trajectory with post-adolescent and adult lung function is mediated by pre-adolescent dna methylation. Respir. Res 23 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Rigollet P and Tsybakov A (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist 39 731–771. MR2816337 10.1214/10-AOS854 [DOI] [Google Scholar]
  42. Sankar PL and Parker LS (2017). The precision medicine initiative’s all of us research program: An agenda for research on its ethical, legal, and social issues. Genet. Med 19 743–750. [DOI] [PubMed] [Google Scholar]
  43. Stolpe M, Bhaduri K and Das K (2016). Distributed support vector machines: An overview. In Solving Large Scale Learning Tasks. Lecture Notes in Computer Science 9580 109–138. Springer, Cham. MR3537495 10.1007/978-3-319-41706-6_5 [DOI] [Google Scholar]
  44. Sudlow C, Gallacher J, Allen N et al. (2015). Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12 e 1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tian Y and Feng Y (2022). Transfer Learning under High-dimensional Generalized Linear Models. J. Amer. Statist. Assoc 0 1–14. 10.1080/01621459.2022.2071278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. MR1379242 [Google Scholar]
  47. Tsybakov AB (2014). Aggregation and minimax optimality in high-dimensional estimation. In Proceedings of the International Congress of Mathematicians-Seoul 2014 Vol. IV 225–246. Kyung Moon Sa, Seoul. MR3727610 [Google Scholar]
  48. van de Geer SA (2008). High-dimensional generalized linear models and the lasso. Ann. Statist 36 614–645. MR2396809 10.1214/009053607000000929 [DOI] [Google Scholar]
  49. van Der Haak M, Wolff AC, Brandner R et al. (2003). Data security and protection in cross-institutional electronic patient records. Int. J. Med. Inform 70 117–130. [DOI] [PubMed] [Google Scholar]
  50. Walford GA, Gustafsson S, Rybin D, Stančáková A, Chen H, Liu C-T, Hong J, Jensen RA, Rice K et al. (2016). Genome-wide association study of the modified stumvoll insulin sensitivity index identifies bcl2 and fam19a2 as novel insulin sensitivity loci. Diabetes 65 3200–3211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Wang Y, O’Connell JR, Mcardle PF, Wade JB, Dorff SE, Shah SJ, Shi X, Pan L, Rampersaud E et al. (2009). Whole-genome association study identifies stk39 as a hypertension susceptibility gene. Proc. Natl. Acad. Sci. USA 106 226–231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wang X, Yang Z, Chen X and Liu W (2019a). Distributed inference for linear support vector machine. J. Mach. Learn. Res. 20 Paper No 113, 41. MR3990467 [Google Scholar]
  53. Wang Y, Song H, Wang W and Zhang Z (2019b). Generation and characterization of megf6 null and cre knock-in alleles. Genesis 57 e23262. [DOI] [PubMed] [Google Scholar]
  54. Weiss K, Khoshgoftaar TM and Wang D (2016). A survey of transfer learning. J. Big Data 3 1–40. [Google Scholar]
  55. West KM, Blacksher E and Burke W (2017). Genomics, health disparities, and missed opportunities for the nation’s research agenda. JAMA 317 1831–1832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Wu J, Roy J and Stewart WF (2010). Prediction modeling using ehr data: Challenges, strategies, and a comparison of machine learning approaches. Medical care S106–S113. [DOI] [PubMed] [Google Scholar]
  57. Zhou W et al. (2021). Global biobank meta-analysis initiative: Powering genetic discovery across human diseases. MedRxiv [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Zillikens MC, Demissie S, Hsu Y-H, Yerges-Armstrong LM, Chou W-C, Stolk L, Livshits G, Broer L, Johnson T et al. (2017). Large meta-analysis of genome-wide association studies identifies five loci for lean body mass. Nat. Commun 8 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES