Abstract
Polygenic risk scores (PRS) are now showing promising predictive performance on a wide variety of complex traits and diseases, but there exists a substantial performance gap across different populations. We propose ME-Bayes SL, a method for ancestry-specific polygenic prediction that borrows information in the summary statistics from genome-wide association studies (GWAS) across multiple ancestry groups. ME-Bayes SL conducts Bayesian hierarchical modeling under a multivariate spike-and-slab model for effect-size distribution and incorporates an ensemble learning step to combine information across different tuning parameter settings and ancestry groups. In our simulation studies and data analyses of 16 traits across four distinct studies, totaling 5.7 million participants with a substantial ancestral diversity, ME-Bayes SL shows promising performance compared to alternatives. The method, for example, has an average gain in prediction across 11 continuous traits of 40.2% and 49.3% compared to PRS-CSx and CT-SLEB, respectively, in the African Ancestry population. The best-performing method, however, varies by GWAS sample size, target ancestry, underlying trait architecture, and the choice of reference samples for LD estimation, and thus ultimately, a combination of methods may be needed to generate the most robust PRS across diverse populations.
Keywords: Bayesian hierarchical modeling, Effect-size distribution, Ensemble learning, Genome-wide association studies, Multi-ancestry polygenic prediction, Polygenic architecture
Introduction
Polygenic models for predicting complex traits are widely developed utilizing summary-level association statistics from genome-wide association studies (GWAS). While being on course to translate GWAS results into clinical practice, polygenic risk scores (PRS) encounter obstacles due to the poor predictive performance on underrepresented non-European (non-EUR) ancestry populations, especially those with substantial African ancestry1–4. As sample sizes for GWAS in many non-EUR populations remain low for many traits, applications of PRS often rely on EUR-based models, which underperform in other populations due in part to differences in allele frequencies, SNP effect sizes, and linkage disequilibrium (LD)1–3,5,6.
To improve the poor performance of PRS on non-EUR populations, several multi-ancestry methods have recently been developed to combine information from available GWAS summary statistics and LD reference data across multiple ancestry groups. One simple approach is the weighted PRS7, which trains a linear combination of the PRS developed using single-ancestry methods (e.g., LD clumping and P-value thresholding, C+T) applied separately to available GWAS data across different ancestry groups7. More recent methods attempt to borrow information across ancestry at the level of individual SNPs based on Bayesian methods8,9, penalized regressions10,11, or through the extension of C+T12. However, applications show that no single method performs uniformly the best, and their performance depends on many aspects, including the underlying genetic architecture of the trait, the absolute and relative sample sizes across populations, and the algorithm for the estimation of LD based on the underlying reference dataset12.
We propose ME-Bayes SL, a novel method for developing ancestry-specific PRS by jointly modeling ancestry-specific GWAS summary data across diverse ancestries. ME-Bayes SL consists of two steps, a Bayesian modeling step which specifically models the genetic correlation in SNP effect size across ancestries via a multivariate spike and slab prior (“ME-Bayes”), and a super learning (SL) step to seek an “optimal” combination of a series of PRS obtained from ME-Bayes under different tuning parameter settings and across ancestries. We evaluate the proposed method and benchmark it against a variety of alternatives through large-scale simulation studies and analyses of 16 traits from four different studies: (1) the Population Architecture using Genomics and Epidemiology (PAGE) Study supplemented with data from the Biobank Japan (BBJ) and UK Biobank (UKBB), (2) Global Lipids Genetics Consortium (GLGC), (3) All of US (AoU), and (4) 23andMe, Inc. These studies, with training data and additional validation samples from the UKBB study, included a total of 3.4 million European (EUR), 226K Admixed African, African, or African American (AFR), 437K Admixed Americans or Hispanic/Latino (AMR), 389K East Asian (EAS), and 56K South Asian (SAS). Results reveal the promising performance of ME-Bayes SL for developing robust PRS in the multi-ancestry setting and identify a number of practical considerations for implementations that are crucial to the performance of the method.
Results
ME-Bayes SL Overview
Considering that GWAS summary-level association statistics can be shared much more easily among research teams than individual-level genotype and phenotype data, we will focus on PRS methods that can use summary-statistics from the GWAS training samples. The implementation of our proposed method ME-Bayes SL, as well as other multi-ancestry methods which we will compare ME-Bayes SL to, requires three (ancestry-specific) datasets from each training ancestry group: (1) GWAS summary data, (2) LD reference data, and (3) a validation (tuning + testing) dataset with genotype and phenotype data for an adequate number of individuals that are independent of GWAS samples and LD reference samples.
We now introduce ME-Bayes SL, a novel method for enhanced ancestry-specific polygenic risk prediction based on available GWAS summary-level association statistics and LD reference data across multiple ancestry groups. ME-Bayes SL consists of two steps (Figure 1): (1) a Bayesian modeling step (“ME-Bayes”) to model the genetic correlation structure in SNP effect size across ancestry groups while accounting for ancestry-specific LD across SNPs, and (2) a super learning (SL) step to construct an “optimal” linear combination of a series of PRS obtained from ME-Bayes under different tuning parameter settings and across all ancestry groups. Additionally, a step 0 was conducted before step 1 to obtain tuned causal SNP proportion and heritability parameters for each training ancestry group from LDpred2. These parameters will be used to specify the prior causal SNP proportions and heritability parameters in ME-Bayes.
Step 1: ME-Bayes: a Bayesian model for estimating ancestry-specific SNP effect sizes
ME-Bayes tailors effect size estimates for each ancestry group by incorporating data from other ancestry groups via Bayesian hierarchical modeling with a multivariate spike and slab prior on SNP effect sizes across ancestry groups. For population-specific SNPs, i.e., SNPs with minor allele frequency (MAF)>0.01 in only one ancestry group, we assume a spike-and-slab prior as in LDpred2. For SNPs that are polymorphic across multiple populations, the between-SNP correlation is induced in two aspects: (1) we assume a SNP is causal in all those populations or none, and (2) the effect sizes for causal SNPs across populations are correlated (see Online Methods for details). The prior specification is distinct compared to the recent method PRS-CSx8 in two aspects: (1) the use of a multivariate spike-and-slab prior versus a continuous shrinkage prior to perform shrinkage estimation; and (2) flexible specification of genetic correlation structure across ancestry groups in ME-Bayes SL compared to PRS-CSx, which assumes a single hyperparameter is shared across different ancestry groups and thus incorporates as fairly rigid specification of the correlation structure.
We infer posterior estimates of LD-adjusted SNP effect sizes across different ancestries via an efficient Markov chain Monte Carlo (MCMC) algorithm (Online Methods). Multiple PRS will be developed for each ancestry under carefully designed settings of two sets of tuning parameters, (1) the causal SNP proportion in each ancestry group, which will be used to specify the correlated prior causal probabilities across ancestry groups (Online Methods), and (2) the between-ancestry genetic correlation in SNP effect sizes. Ancestry-specific SNP effect sizes are estimated based on MCMC with an approximation strategy previously implemented in the LDpred2 algorithm13, which substantially reduces the number of iterations required to reach convergence with a spike-and-slab type prior on a large number of correlated SNPs. Detailed MCMC algorithm and estimation procedure are described in Online Methods.
Step 2: Super Learning (SL)
Research has shown that combining multiple C+T PRS under different p-value thresholds14 or combining the best ancestry-specific PRS across multiple ancestry groups7,8 can significantly improve predictive performances. Thus, as a second step of ME-Bayes SL, we consider combining PRS obtained from the ME-Bayes step both across different tuning parameter settings and across ancestry groups via an SL model trained on the tuning dataset. SL is an ensemble learning method for seeking an “optimal” linear combination of various base learners for prediction15. In our analyses, we consider three linear base learners, including linear regression, elastic net regression16, and ridge regression17. A similar SL procedure was also implemented recently in another multi-ancestry method CT-SLEB12. In our simulation studies and real data examples, we will show explicitly how much improvement in predictive power can be obtained separately through the Bayesian modeling step and the SL step. Considering that both weighted PRS and PRS-CSx construct a linear combination of the best PRS for each ancestry group, we tried the same approach on our Bayesian model (ME-Bayes) and called this alternative method “weighted ME-Bayes”. We observe on both simulated data and real data that the gain in predictive power by this linear combination strategy is mostly lower than, and sometimes comparable to, the gain by our proposed SL strategy. (Supplementary Figures 1–13, “weighted ME-Bayes” versus “ME-Bayes SL”).
Simulation Settings
We first investigate the performance of ME-Bayes SL and a series of existing methods under various simulated scenarios of the genetic architecture of a continuous trait and absolute and relative GWAS sample sizes across ancestry groups. This large-scale dataset, including simulated genotype and phenotype data for a total of 600,000 individuals across EUR, AFR, AMR, EAS, and SAS, was recently released by our group12. Detailed simulation setup is described in Zhang et al. (2022)12 and briefly summarized in the Supplementary Notes. We apply eight existing approaches for comparison, which include two single-ancestry methods applied to GWAS and LD reference data from the target population: (1) C+T, (2) LDpred2; the same single-ancestry methods applied to GWAS and LD reference data for EUR: (3) EUR C+T, (4) EUR LDpred2; and three existing multi-ancestry methods applied to ancestry-specific GWAS and LD reference data for all ancestry groups: (5) weighted C+T (weighted PRS using C+T as the base method), (6) weighted LDpred2 (weighted PRS using LDpred2 as the base method), (7) PRS-CSx8, and (8) CT-SLEB12. Results from another two recently proposed multi-ancestry methods, PolyPred+18 and XPASS9, on the same simulated dataset are reported in Zhang et al. (2022)12. Taking into account both ancestral diversity and computational efficiency, throughout the text, we restrict all our analyses to the SNPs among approximately 2.0 million SNPs in HapMap 319 plus Multi-Ethnic Genotyping Array (MEGA)20 that are also available in the discovery GWAS, LD reference panel, and validation (tuning + testing) samples. We assess the predictive performance of a PRS by prediction , i.e., the proportion of variance of the trait explained by the PRS. Results of the various methods are compared in five simulation settings: (1) fixed common SNP heritability, strong negative selection, with a genetic correlation set to between any two ancestry groups (Figure 2, Supplementary Figures 1–2), (2) fixed per-SNP heritability, strong negative selection, (Supplementary Figures 3–4), (3) fixed per-SNP heritability, strong negative selection, with a weaker between-ancestry genetic correlation (Supplementary Figures 5–6), (4) fixed common SNP heritability, no negative selection, (Supplementary Figures 7–8), and (5) fixed common SNP heritability, mild negative selection, (Supplementary Figures 9–10).
Simulation results
The multi-ancestry methods tend to outperform the single-ancestry methods, except for weighted C+T, which performs worse than LDpred2 when GWAS sample size of the non-EUR target population becomes adequately large (Figure 2, Supplementary Figures 1–10). When the discovery GWAS sample size of the target non-EUR population is relatively small compared to EUR GWAS , EUR PRS tends to outperform PRS generated based on training data from the target non-EUR population; but as GWAS sample size of the target non-EUR population increases, the prediction of LDpred2 eventually becomes substantially higher than that of EUR C+T and EUR LDpred2. Among the existing multi-ancestry methods, weighted LDpred2, PRS-CSx, and CT-SLEB perform similarly but show advantages over others in different settings: weighted LDpred2 performs well in the scenario of a large causal SNP proportion, CT-SLEB performs similarly as PRS-CSx but shows some advantages when there is a small causal SNP proportion (0.05%) and when GWAS sample size for target non-EUR population is small. Overall, the proposed ME-Bayes SL method outperforms these existing methods in almost all settings. This is expected given that the SNP effect sizes were simulated under a multivariate spike-and-slab distribution as assumed in the ME-Bayes model. The proposed SL step (in ME-Bayes SL) and the alternative linear combination step (in weighted ME-Bayes) only provide minimal improvement in on top of ME-Bayes (Supplementary Figures 1–10). This may be because when the specified distribution of SNP effect sizes approximates the true distribution well, the best PRS trained for each ancestry by ME-Bayes can already provide a high predictive power, and an additional step of combining PRS across tuning parameter settings and ancestry groups is unnecessary.
We also checked computation intensity of ME-Bayes SL in comparison with PRS-CSx. A comparison of computation time between PRS-CSx and CT-SLEB on the same simulation dataset was reported in Zhang et al. (2022)12. With AMD EPYC 7702 64-Core Processors running at 2.0 GHz using a single core, on chromosome 22 and with a total of tuning parameter settings, ME-Bayes SL has an average runtime of approximately 75.9 minutes combining ancestry groups with a total of 17,192 SNPs, 127.2 minutes combining ancestry groups with 17,721 SNPs, and 237.4 minutes across ancestry groups with 17,722 SNPs. Although not as fast as simpler methods such as CT-SLEB and XPASS, ME-Bayes SL is computationally more efficient than PRS-CSx (, , ) and thus is easier to implement than PRS-CSx especially when four or more training populations are available to be combined.
PAGE + UKBB + BBJ data analysis with validation on non-EUR individuals from PAGE
We evaluate the performance of the various methods on predicting the polygenic risk of inverse-rank normal transformed BMI (IRNT BMI), high-density lipoprotein (HDL), and low-density lipoprotein (LDL) separately for AFR, AMR, and EAS. We collected ancestry-specific training GWAS summary data for AFR and AMR from PAGE, GWAS summary data for EAS from BBJ, and EUR GWAS summary data from UKBB. The PRS developed by the various methods are evaluated on validation individuals of AFR, AMR, and EAS populations from PAGE. We use genotype data for 498 EUR, 659 AFR, 347 AMR, 503 EAS, and 487 SAS individuals from the 1000 Genomes project as the LD reference data21.
In this set of analyses, we observe that the multi-ancestry methods tend to outperform single-ancestry methods for EUR, AFR, and AMR (Figure 3, Supplementary Figure 11, Supplementary Table 3). For EAS, LDpred2 can reach an similar to or higher than that of EUR LDpred2 and multi-ancestry methods, which is possibly because the BBJ GWAS sample sizes for EAS are relatively large . For the proposed method ME-Bayes SL, we observe potential improvement in from both the Bayesian modeling step (“ME-Bayes” versus “LDpred2”) and the SL step (“ME-Bayes SL” versus “ME-Bayes”). The linear combination strategy (“weighted ME-Bayes”, Supplementary Figure 11) provides a smaller or similar gain in compared to our SL strategy (“ME-Bayes SL”). The relative performance of the various multi-ancestry methods varies by trait and ancestry, and no method is uniformly better than others. In some settings, ME-Bayes SL PRS gives a lower than the PRS trained by weighted LDpred2 and PRS-CSx in some settings, such as for BMI on AFR and LDL on EAS. But in general, ME-Bayes SL PRS has the best overall performance, with an average increase of 3.6% and 19.6% in compared to PRS-CSx and CT-SLEB, respectively, on non-EUR ancestries.
GLGC data analysis with validation on UKBB individuals
We apply the various methods to develop ancestry-specific PRS for four blood lipid traits, including HDL, LDL, total cholesterol (TC), and log of triglycerides (logTG)22, based on ancestry-specific GWAS summary data for EUR, AFR, AMR, EAS, and SAS, from the Global Lipids Genetics Consortium (GLGC). We validate the performance of the various methods on UKBB individuals of AFR, EAS, and SAS origin separately, where the ancestry information of the UKBB validation individuals was determined based on an ancestry genetic component analysis (Supplementary Notes).
We first use genotype data of the unrelated 1000 Genomes samples as the LD reference data21. We observe that ME-Bayes SL PRS performs the best or similarly to the best PRS (Figure 4(a), Supplementary Figure 12(a), and Supplementary Table 4). We see a notable gain in comparing ME-Bayes SL PRS to weighted LDpred2 PRS (average increase: 50.7%). ME-Bayes SL outperforms CT-SLEB in most cases (average increase in : 27.1%). Although the relative performance between ME-Bayes SL and PRS-CSx varies by ancestry and trait, ME-Bayes SL PRS has a better overall performance, with an average increase of 19.9% in compared to PRS-CSx PRS. Similar to the results from PAGE + UKBB + BBJ analysis, ME-Bayes SL improves on top of LDpred2 by both the Bayesian modeling step (“ME-Bayes” versus “LDpred2”, Supplementary Figure 12(a)) and the SL step (“ME-Bayes SL” versus “ME-Bayes”, Supplementary Figure 12(a)). The PRS generated by the alternative linear combination strategy has a similar or lower than the PRS generated by our proposed SL strategy (“weighted ME-Bayes” versus “ME-Bayes SL”, Supplementary Figure 12(a)).
It has been shown that LDpred2 sometimes has suboptimal performance based on the widely implemented 1000 Genomes LD reference data23,24, which may be due to convergence issue in the presence of inadequate LD reference sample size and/or ancestry mismatch between 1000 Genomes samples and the target population23. Implemented by an MCMC algorithm that utilizes similar computational tricks as LDpred2, ME-Bayes SL may likewise underperform with the 1000 Genomes reference data. We therefore conduct a sensitivity analysis where we estimate LD based on UKBB tuning samples (10,000 EUR, 4,585 AFR, 687 AMR, 1,010 EAS, 5,427 SAS) instead of the 1000 Genomes samples. We observe that the of ME-Bayes SL PRS improves notably compared to using 1000 Genomes LD reference (Figure 4(b), Supplementary Table 4), especially on AFR (average increase: 33.8%). The of PRS-CSx PRS has also increased but not as much as the of ME-Bayes SL PRS. This is particularly noteworthy because PRS-CSx by default uses a much larger number of UKBB LD reference samples (375,120 EUR, 7,507 AFR, 687 AMR, 2,181 EAS, and 8,412 SAS), which also overlap with our UKBB testing samples and thus lead to potentially inflated estimates. The advantage of ME-Bayes SL now becomes more obvious: it outperforms the existing methods in all scenarios except for HDL in EAS, where it performs slightly worse than PRS-CSx PRS. ME-Bayes SL shows the most notable advantage on AFR, for which PRS are typically not powerful and hard to improve (average increase compared to the best existing method: 38.6%). Interestingly, the alternative weighted ME-Bayes approach has a similar or slightly lower than ME-Bayes SL, but it still outperforms PRS-CSx, which utilizes the same linear combination strategy, for almost all traits and ancestry groups (Supplementary Figure 12(b)).
AoU data analysis with validation on UKBB individuals
We also apply the various methods to develop ancestry-specific PRS for height and BMI based on the GWAS summary data we generated from the All of Us Research Program (AoU) for EUR, AFR, and AMR. The performance of the derived PRS is evaluated on UKBB validation samples of AFR ancestry. As in the GLGC data analysis, we first use genotype data of the unrelated 1000 Genomes samples as the LD reference data21 (Figure 5(a), Supplementary Table 5). Although no method is uniformly the best on all traits and ancestry groups, ME-Bayes SL PRS on average has an that is 67.5% higher than that of the PRS-CSx PRS and 53.4% higher than that of the CT-SLEB PRS. ME-Bayes SL PRS improves on top of the single-ancestry method by both the Bayesian modeling step (ME-Bayes versus LDpred2, Figure 5(a)) and the SL step (ME-Bayes SL versus ME-Bayes, Supplementary Figure 13(a)). The weighted ME-Bayes PRS utilizing a linear combination strategy gives a lower than the ME-Bayes SL PRS utilizing the SL strategy (weighted ME-Bayes versus ME-Bayes SL, Supplementary Figure 13(a)).
Similar to the GLGC data analysis, we also conduct a sensitivity analysis where we estimate LD using the UKBB tuning samples (10,000 EUR, 4,585 AFR, 1,010 EAS, 5,427 SAS) instead of the 1000 Genomes data. Different from the results from GLGC data analysis, no PRS has noticeably improved predictive power, even though there is a better ancestry match between the LD reference population and the target population (Figure 5(b), Supplementary Figure 13(b)). Such results from the GLGC data analysis and the AoU data analysis suggest that for ME-Bayes SL, 1000 Genomes LD reference dataset may be adequate for building PRS models with relatively small discovery GWAS, such as the AoU GWAS , but not so with much larger discovery GWAS, such as the GLGC GWAS (N up to 0.89 million). In other words, the ratio of the sample size of the LD reference dataset to the GWAS sample size may matter more than the sample size of the LD reference data itself or the population/ancestry match between datasets.
23andMe data analysis
We have collaborated with 23andMe, Inc. to develop and validate PRS for seven traits for EUR, African American (AFR), Latino (AMR), EAS, and SAS based on a large-scale dataset from 23andMe, Inc. We analyze two continuous traits: (1) heart metabolic disease burden, (2) height, and five binary traits: (3) any cardiovascular disease (any CVD), (4) depression, (5) migraine diagnosis, (6) morning person, and (7) sing back musical note (SBMN). Results are summarized in Figure 6 and Supplementary Table 6. For the two continuous traits, ME-Bayes SL shows a major advantage over the existing methods on AFR and AMR: for example, ME-Bayes SL has a remarkable improvement over two recently proposed advanced methods that perform the best among the existing methods, PRS-CSx (average increase in : 49.8%) and CT-SLEB (average increase in : 47.5%). For EAS and SAS, ME-Bayes SL performs better than all existing methods considered in all scenarios, except for heart metabolic disease burden in SAS, which has the smallest discovery GWAS , where ME-Bayes SL PRS has an slightly lower than that of CT-SLEB PRS but higher than the of all other PRS.
For the five binary traits, we observe a similar pattern as for continuous traits, where ME-Bayes SL generally performs better than or similarly to the best of the existing methods, and it shows the biggest improvement in (AUC – 0.5) over existing methods on AFR (average improvement: 14.4%, Figure 6(b), Supplementary Table 6). Averaged across all five traits and four non-EUR ancestry groups, ME-Bayes SL PRS gives an (AUC – 0.5) that is 13.8% higher than that of the PRS-CSx PRS and 9.0% higher than that of the CT-SLEB PRS.
Discussion
We propose ME-Bayes SL, a powerful method for constructing enhanced ancestry-specific PRS integrating information from GWAS summary statistics and LD reference data across multiple ancestry groups. Built based on an extension of spike-and-slab type prior13, ME-Bayes SL enhances the ancestry-specific polygenic prediction by (1) borrowing information from GWAS of other ancestries via specification of a between-ancestry covariance structure in SNP effect sizes, (2) incorporating heterogeneity in LD and MAF distribution across ancestries, and (3) an SL algorithm combining ancestry-specific PRS developed under various possible genetic architectures of the trait. We benchmark our method against a wide variety of alternatives, including multiple state-of-the-art multi-ancestry methods7,8,12, using extensive simulation studies and data analyses. Results show that while no method is uniformly the best, ME-Bayes SL is generally a robust method that shows close to optimal performance across a wide range of scenarios and have the potential to notably improve PRS performance in the AFR population compared to the alternative methods.
One important observation from the data applications is that the advantage of ME-Bayes SL over existing methods tends to be more notable with larger GWAS accompanied by larger LD reference dataset. In the GLGC data analysis and 23andMe data analysis where the discovery GWAS sample sizes are relatively large, especially for the non-EUR populations, we can clearly observe that ME-Bayes SL performs almost uniformly better than the existing methods. In contrast, in the PAGE + UKBB + BBJ data analysis, where the GWAS sample sizes for AFR and AMR are relatively small, ME-Bayes SL sometimes shows a suboptimal performance. Such trend of having more notable advantages with larger GWAS sample sizes and larger LD reference datasets exists not only when comparing ME-Bayes SL to existing methods, but also when comparing the more advanced methods, such as ME-Bayes SL and PRS-CSx, to simpler alternatives, such as the weighted PRS method.
One key factor in implementing ME-Bayes SL is the LD reference data. The analyses of the GLGC and AoU datasets illustrates that the sample size of the LD reference data should be sufficiently large relative to the discovery GWAS sample size to give ME-Bayes SL an optimal performance (Figure 4, Supplementary Table 4). The performance of ME-Bayes SL depends on estimated causal SNP proportion parameters from single-ancestry LDpred2 analysis. LDpred2 has previously been shown to underperform sometimes when using 1000G LD reference data24 and thus could in turn affect the performance of ME-Bayes SL. Thus, as sample sizes of the training GWAS increase, building a larger LD reference dataset than the widely used 1000 Genomes reference dataset will lead to more optimal performance.
We have compared ME-Bayes SL with a series of recent multi-ancestry methods including PRS-CSx and CT-SLEB, but there are other recently proposed methods that are worth. In fact, we have implemented two other multi-ancestry methods named XPASS and PolyPred+ in our simulation study as well as GLGC, AoU, and 23andMe data analyses, with detailed results reported in Zhang et al. (2022)12. Although computationally super-fast, XPASS, which uses a bivariate normal prior under an infinitesimal model, can only combine up to two ancestry groups, and it is always outperformed by ME-Bayes SL (Supplementary Tables 3–6). This shows the importance of including sparsity components in modeling effect-size distribution for Bayesian polygenic prediction. PolyPred+ implements a linear combination of SBayesR25 trained separately on EUR and the target population and a Polyfun26 PRS on EUR that additionally incorporates information from external functional annotations, and thus it is not directly comparable to the other methods. Even so, it performs worse than ME-Bayes SL most of the time (Supplementary Tables 3–6).
Our study also has several limitations. First, ME-Bayes requires two sets of tuning parameters: causal SNP proportion in each ancestry and between-ancestry correlation in effect sizes, the specification of which is relatively complex compared to other methods such as PRS-CSx. In the default setting of ME-Bayes, the candidate values for genetic correlation between a pair of ancestry groups only lie between 0.7 and 0.95, while for some traits, the estimated correlation can be lower8,22. But given the high computational scalability of ME-Bayes, when the number of ancestry groups is not too large , prior information on genetic correlation can used to specify additional genetic correlation parameter settings to cover a wider range of potential genetic architectures of the trait.
The spike-and-slab type prior in ME-Bayes can be sub-optimal for effect-size distribution of some traits. For example, in GLGC GWAS, we detect several top SNPs with extremely large association coefficients for all four blood lipid traits, each contributing to 0.6% - 3.9% of the estimated total heritability. In this case, ME-Bayes induces the same amount of shrinking on all SNPs, resulting in over-shrinkage on the few large-effect SNPs. We have considered a simple alternative approach to compensate such over-shrinkage27,28, where for each target ancestry group, we first construct a “top-SNP PRS” using GWAS association coefficients of the few top SNPs for the ancestry, then combine it with the ME-Bayes SL PRS constructed based on the rest of the SNPs. This approach, however, does not provide a more powerful PRS. PRS-CSx, which allows a heavy-tail Strawderman-Berger prior, while theoretically expected to be advantageous for handling such large-effect SNPs, does not show much advantage either. In the future, other heavy-tail type priors such as the Bayesian Lasso (i.e., Laplacian)29, Horseshoe30, and Bayesian Bridge31, are worth investigating. Another potential limitation of the method originates in the SL step: when the tuning sample is small (e.g., <1000), the prediction algorithms utilized in SL may be overfit in the presence of a large number of tuning parameters, ultimately leading to low predictive power in an independent sample.
In our data examples, different methods show advantages in different scenarios in terms of GWAS sample size, LD reference data, the type of trait, and target ancestry. It is thus natural to consider extending our SL step from combining a series of PRS trained within a specific type of method, such as ME-Bayes, to those generated across different methods. ME-Bayes SL can also be modified to enhance performance of PRS by borrowing information simultaneously across traits and genetically correlated traits. Two recent studies, both using simple weighting methods, have shown significant potential for cross-trait borrowing to improve PRS performance for individual traits32,33. There is, however, likely to be scope for additional improvement by developing formal Bayesian methods that can utilize flexible models for effect-size distribution simultaneously across ancestries and traits.
In summary, we propose a powerful method for constructing enhanced ancestry-specific PRS combining GWAS summary data and LD reference data across multiple ancestry groups. As sample sizes of the multi-ancestry GWAS and LD reference datasets continue to increase, more advanced methods, such as ME-Bayes SL and PRS-CSx8, are expected to show more and more advantages over simpler alternatives, such as the weighted methods7. Our large-scale simulation study and four unique data examples illustrate the relative performance of a variety of single- and multi-ancestry methods across various settings of ancestry groups, GWAS sample sizes, genetic architecture of the trait, and LD reference panel, which can serve as a guidance for method implementation in future applications.
Online Methods
Details of ME-Bayes SL Step 1: ME-Bayes
ME-Bayes conducts Bayesian modeling to generate ancestry-specific ME-Bayes PRS models through joint modeling of GWAS summary data across all available ancestry groups. This step models the genetic correlation structure in SNP effect size across ancestry groups while accounting for ancestry-specific LD and allele frequency information.
Suppose we are interested in predicting the polygenic risk of some trait based on genotype , for an individual of ancestry , with denoting the number of SNPs with a minor allele frequency in ancestry . For demonstration purposes, we assume the trait is continuous, but the results can be directly applied to GWAS summary-level association statistics for discrete traits in the same manner. We assume all SNPs included are biallelic, i.e., each SNP only has two alleles observed in the population. For each ancestry group , we assume a true additive model for genetic variation, , where denotes the underlying joint effect size of , i.e., effect size after adjusting for the effect of other SNPs, for an individual of ancestry , and denotes a zero-mean random error term that includes effects of risk factors other than SNPs. Suppose we have ancestry-specific GWAS summary data, , specifically, the marginal effect sizes of the SNPs and their corresponding standard errors from one-SNP-at-a-time regressions, , for and . Here, and are the indices of GWAS sample, SNP, and ancestry, respectively, denotes a zero-mean random error term that includes effects of other risk factors and all other SNPs, and are the true marginal SNP effect sizes, total number of SNPs, and GWAS sample size, respectively, for ancestry . Our goal is to obtain an estimate of the joint SNP effect sizes, , to construct polygenic risk model for each ancestry group .
Our analysis is conducted on the standardized scale, where are assumed to be standardized to have a zero mean and unit variance and are assumed to have a unit variance (for continuous traits). This is reflected by rescaling the GWAS summary statistics so that the variance is equal to the inverse of the GWAS sample size. For computational scalability, we divide the whole genome into a series of independent LD blocks1, each containing hundreds of (up to ~2900) SNPs, and only consider the between-SNP correlation within each LD block. Such block structure for LD matrices is considered because it yields similar predictive power as the banded-structure LD matrices accounting for LD within a 3cM genetic distance suggested by LDpred22, but it is computationally more efficient and requires less memory. We estimate LD matrices for SNPs within each LD block using PLINK 2.03 based on LD block segmentation in Berisa and Pickrell (2015)1. LD block information was extracted from the R package “lassosum”4. Note that the LD block information is available for EUR (1747 blocks, median number of SNPs per block: 816), AFR (2626 blocks, median number of SNPs per block: 716), and EAS (1489 blocks, median number of SNPs per block: 815), but not currently available for AMR and SAS, and thus we apply the EUR LD information on AMR and SAS for now.
We denote by and the vector of true joint effect sizes and marginal effect sizes estimated from GWAS, respectively, for SNPS within a specific LD block in ancestry . To conduct analyses on the standardized scale, we first divide each raw effect size estimate by . We can then write down the likelihood of the GWAS summary statistics, , where denotes the LD matrix of the SNPS within the LD block , and is a diagonal matrix with diagonal entries being the corresponding GWAS sample sizes for SNPs within the LD block. For population-specific SNPs, i.e., SNPs with an in only one ancestry , we assume a spike-and-slab prior as in LDpred2, , where denotes the per-SNP heritability, is the indicator of whether SNP is causal in ancestry , i.e., if and 0 otherwise, and is the proportion of causal SNPs in ancestry . For SNPs that have MAF>0.01 in all ancestry groups, we induce a prior correlation structure between and for . The prior distribution of the joint effect size given is then specified as follows,
where , and , with denoting the genetic correlation between ancestry groups and . For SNPs that have an MAF>0.01 in only a subset of ancestries , similar prior distributions can be specified for SNP effect sizes within the set of ancestry groups .
Recall that we introduce variables to denote ancestry-specific causal SNP proportions, and for ancestry-specific SNPs, we assume . Now we generalize this Bernoulli prior to a multinomial prior on for SNPs that exist in a subset of ancestry groups , with probabilities being defined as functions of . We first focus on SNPs that only exist in two ancestry groups : we set , which reflects our assumption that if a SNP is causal in one ancestry group, it is also causal in another. We can then obtain , and . After constructing , we then construct priors for SNPs that exist in three ancestry groups: by specifying , we can obtain the rest of the probabilities Such specifications can be easily extended to apply to SNPs that exist in four ancestry groups, five ancestry groups, etc.
We estimate based on MCMC with an approximation strategy previously implemented in the LDpred2 algorithm², which substantially reduces computation time of the algorithm. There are two sets of tuning parameters which will be estimated by grid search using a tuning dataset independent from the testing samples on which we report : (1) the ancestry-specific causal SNP proportions : we fix to either , the estimated ancestry-specific causal SNP proportions obtained from LDpred2 separately on GWAS summary data of each ancestry, or , i.e., the values of all are set to the LDpred2 estimate of the causal SNP proportion in ancestry (2) the between-ancestry correlation parameters : we consider two settings, i.e., either set to all equal to , or set to 0.75 for any pair of ancestry groups that include AFR and 0.9 otherwise, given that correlation with AFR tends to be weaker than that among other ancestry groups. Prior to the implementation of MCMC, we further estimate the ancestry-specific heritability based on GWAS summary data and LD reference data using LD score regression5.
We now describe the detailed MCMC algorithm and estimation procedure. For SNPs that only exist (MAF>0.01) in one ancestry group, the Gibbs sampler in Vilhjálmsson et al. (2015)6 was implemented. For each SNP that exists in all ancestry groups, we sample and from
where denotes the joint effect sizes for the SNPs within the LD block which SNP is in, .
We first sample from . Here note that obtaining analytically is hard, and thus we approximate it by . For a realization of where , , we first derive
We denote the numerator by , which can be derived as follows:
where
denotes the entry in that corresponds to the correlation between SNPs and if and 0 otherwise, , and . After deriving , we can then sample from .
We obtain the marginal posterior mean of after integrating out :
where
We can easily derive that follows , where
For SNPs that have an MAF>0.01 in a subset of ancestry groups , similar sampling strategy can be conducted but only among ancestry groups . In each MCMC iteration, the prior per-SNP heritability parameter is set to , where denotes the number of causal SNPs estimated from this iteration. The posterior estimate of is obtained by taking the average of obtained from MCMC iterations after a burn-in stage of 100 iterations.
Existing Methods
Single-Ancestry Methods
LD Clumping and Thresholding (C+T)
C+T first constructs a series of PRS by applying an LD clumping step followed by a p-value filtering step with varying p-value cutoffs, then selects the best performing PRS on the tuning dataset. Specifically, an LD clumping step is first conducted to exclude variants that have an absolute pairwise correlation stronger than within a genetic distance (500kb) based on an LD reference dataset. The remaining variants are then filtered by excluding the ones that have a p-value larger than a significance threshold, which, in our analysis, were set to , or 1. These 16 scores were created based on these 16 different significance thresholds by calculating a weighted sum of the number of effect alleles of the selected SNPs, with weights being the effect size estimates from the discovery GWAS. C+T then selects the score with the “optimal” p-value thresholds via parameter tuning with respect to the residual (for continuous traits) or residual AUC (for binary traits) on a tuning dataset that is independent of the training and testing samples. C+T was implemented using PLINK 1.907,8.
LDpred 2
LDpred2 is an LD-based Bayesian modeling approach which leverages information from GWAS summary statistics and explicitly models LD correlation structure with correlation matrices being estimated based on an external reference panel6,9. LDpred2 assumes a spike-and-slab prior on SNP effect sizes, i.e., each SNP has a probability to have a non-zero causal effect , and a probability to have no contribution to the phenotypic variation . Here and are treated as tuning parameters and estimated via grid search on a tuning dataset. We ran LDpred2 on each chromosome and GWAS of each ancestry group separately using R packages “bigstatsr” and “bigsnpr”. Two tuning parameters were considered: (1) causal SNP proportion , with default candidate values , and 1.0; (2) total heritability, which is set to the heritability estimated by LD score regression5 multiplied by 0.7, 1, or 1.4. The “sparse” option was not considered.
Multi-Ancestry Methods
Weighted PRS
A simple multi-ancestry method is weighted PRS, which trains an “optimal” linear combination of the effect size estimates obtained based on training data from each single ancestry. Weighted PRS was first proposed in Marquez-Luna et al. (2017)10 to improve the performance of single ancestry C+T PRS. Suppose we have constructed C+T PRS, , and , separately based on GWAS and LD reference panel of each corresponding ancestry group. The weighted C+T PRS is then constructed as where are obtained by fitting a regression model on the tuning dataset. Here we apply the weighted PRS approach on either C+T (“weighted C+T”) or LDpred2 (“weighted LDpred2”).
PRS-CSx
“PRS-CSx”11 is proposed as the multi-ancestry version of PRS-CS12 which conducts Bayesian modeling followed by an additional step of constructing a linear combination of the best performing PRS trained for each ancestry. PRS-CSx assumes a continuous shrinkage prior named Strawderman-Berger prior on the ancestry-specific effect sizes. For SNPs available in more than one population, this prior induces information sharing across ancestry groups. After the Bayesian modeling step, PRS-CSx further trains a linear combination of the ancestry-specific PRS obtained from the previous step based on the tuning dataset. In all our analyses, we ran PRS-CSx with the default candidate values for the tuning parameter , which is the global shrinkage parameter shared by all SNPs and all ancestries that controls the overall causal SNP proportion. The PRS-CSx software only considers approximately 1.2 million HapMap 3 SNPs and therefore we only report the performance of PRS-CSx PRS based on the HapMap 3 SNPs. We have also tried to apply PRS-CSx to HapMap 3 SNPs plus an additional 0.8 million MEGA SNPs that are also available in the 1000 Genomes reference data. But we found that, on our simulated dataset, the performance of PRS-CSx PRS using the extended HapMap 3 + MEGA SNP set is significantly worse than PRS-CSx using the HapMap 3 SNPs, and in our real data analyses, results from PRS-CSx on the two SNP sets are similar. We therefore stick to the default setting with 1.2 million HapMap 3 SNPs provided by the PRS-CSx software.
CT-SLEB
CT-SLEB is a recently proposed method for multi-ancestry PRS construction13. It first conducts a two-dimensional C+T between EUR GWAS and GWAS of the target population to select SNPs to be included in the target population PRS, then uses an Empirical Bayesian approach to account for genetic correlation across populations, and finally implements an SL algorithm to combine PRS generated under different p-value thresholds in the C+T step. In our analyses, we implemented CT-SLEB with the default setting for p-value threshold, , or 1, and a genetic distance or , where .
Runtimes and memory usage
We compare the computation time and memory usage of ME-Bayes SL and PRS-CSx on chromosome 22 based on the simulated dataset (comparison between PRS-CSx and CT-SLEB on the same dataset has been reported in Zhang et al., 202213). Results from ME-Bayes SL and PRS-CSx combining three ancestry groups (EUR, AFR, and AMR), four ancestry groups (EUR, AFR, AMR, and EAS), and five ancestry groups (EUR, AFR, AMR, EAS, and SAS) are summarized in Supplementary Table 2. The training GWAS sample size is 15,000 for each non-EUR population and 100,000 for EUR population. The tuning and validation dataset each contains 10,000 individuals. All analyses were performed with AMD EPYC 7702 64-Core Processors running at 2.0 GHz. Other than the LDpred2 step which uses parallel computing with 17 cores, all other analyses were conducted using a single core. The reported computation time and memory usage are averaged over 10 replicates.
PAGE + UKBB + BBJ data analysis with validation on non-EUR individuals from PAGE
Three traits, including IRNT BMI, HDL, and LDL, that were available across PAGE, UKBB, and BBJ GWAS for EUR, AFR, AMR (Hispanic), and EAS are analyzed. Ancestry- and trait-specific GWAS sample sizes, validation sample sizes, and number of SNPs analyzed are reported in Supplementary Table 3.1. The training GWAS datasets consist of PAGE, contributing data for AFR and AMR, UKBB, contributing data for EUR, and BBJ, contributing data for EAS. The validation datasets consist of PAGE, contributing data for the three non-EUR ancestry groups, and UKBB, contributing data for EUR. Specifically, we first collect data for a total of 43,769 PAGE individuals of , , or ancestry that have data available for at least one of the three traits. For AFR and AMR that have relatively large sample sizes in PAGE, we randomly divide the samples within each ancestry group into a training dataset (80%) for conducting GWAS, a tuning dataset (10%) for tuning model parameters, and training SL in CT-SLEB and ME-Bayes SL or the linear combination model in weighted PRS and PRS-CSx, and a testing dataset (10%) for evaluating PRS performance. For EAS which has a limited sample size in PAGE, we use all PAGE samples for external validation (tuning + testing) and obtain GWAS summary data from BBJ, which has a much larger sample size. To borrow information from large EUR GWAS, we further collect EUR GWAS summary data from UKBB14 released by the Neale Lab. Finally, to tune the causal SNP proportion for EUR, which is required for specifying the prior causal probabilities for non-EUR ancestry groups, we further randomly select a sample of 20,000 random individuals from UKBB that do not overlap with samples in the EUR UKBB GWAS. Here the ancestry information for individuals from PAGE and UKBB is determined based on self-identified race/ethnicity.
For AFR and AMR, we conduct GWAS on individuals from the PAGE study to obtain the GWAS summary data. Specifically, we first collect a total of 17,127 AFR and 21,995 AMR from PAGE, then randomly divide the samples in each ancestry into a training set (80%) to conduct GWAS and a validation set (20%), of which 10% is used for selecting tuning parameters and training SL (tuning set), and the other 10% is used for reporting PRS performance (testing set). There was no significant difference between training and validation datasets in the distribution of the covariates adjusted for in GWAS. PAGE GWAS: (1) IRNT BMI. For ancestry-specific GWAS analysis on AFR and AMR, measurements of BMI outside of 6 standard deviations from the mean (based on sex and race) were removed. We first created sex-specific residuals for BMI adjusted for age, then inverse normally transformed these residuals. These inverse-normally-transformed residuals were then used in the final analysis where they were further adjusted for self-identified race/ethnicity, study, study center (for MEC and SOL only), and the top 10 genetic principal components (PCs). (2) HDL. For ancestry-specific GWAS analysis on AFR and AMR, untransformed HDL measurements were reported in mg/dL, and were adjusted for each individual’s medication use by adding a constant based on the type of medication used. Details of the adjustment are described in the Supplementary Information in Wojcik et al. (2019)15. Finally, models were adjusted by age at lipid measurement, sex, study, study center (for MEC and SOL only), self-identified race/ethnicity, and top 10 genetic PCs. (3) LDL. For ancestry-specific GWAS analysis on AFR and AMR, untransformed HDL measurements were calculated using the Friedewald Equation16 and reported in mg/dL. The measurements were adjusted for individuals’ medication use by adding a constant based on the type of medication used. Details of the calculation and adjustment are described in the Supplementary Information in Wojcik et al. (2019)15. Participants who were pregnant at blood draw or had fasted less than 8 hours prior to lipid blood draw were excluded. Finally, models were adjusted by age at lipid measurement, sex, study, study center (for MEC and SOL only), self-identified race/ethnicity, and top 10 genetic PCs.
The PAGE individuals included in our analyses are part of the PAGE participant cohort, which were collected from Hispanic Community Health Study/Study of Latinos (HCHS/SOL), Women’s Health Initiative (WHI), Multiethnic Cohort (MEC), and the Icahn School of Medicine at Mount Sinai BioMe biobank in New York City (BioMe)15. Due to the extensive degree of admixture within and between PAGE self-identified racial/ethnic groups, individuals were not reassigned based on their genetic ancestry but remained categorized by their self-identified race/ethnicity. However, we have assigned them to ancestry groupings based on an approximation of mappings to continental-level regions for consistency with other external studies in this manuscript. Written informed consent was obtained for all participants in this study at the relevant recruitment sites. Due to the extensive degree of admixture within and between PAGE self-identified racial/ethnic groups, individuals were not reassigned based on their genetic ancestry but remained categorized by their self-identified race/ethnicity. Detailed information about genotyping, data quality control and imputation, selection of unrelated individuals, genetic principal component analysis, and phenotype harmonization are provided in the Supplementary Information in Wojcik et al. (2019)15. Since PAGE has a limited sample size for EAS (4,647), and thus we further collect publicly available GWAS summary data from BBJ (data availability) and use all PAGE individuals for validation on EAS. For BMI, the GWAS analysis included age, age2, sex, status of a series of diseases, and the top 10 genetic PCs as covariates17. For HDL and LDL, the GWAS analyses included age, sex, status of a series of diseases, and the top 10 genetic PCs as covariates18.
PAGE does not have individuals of EUR ancestry. To borrow information from the much larger EUR GWAS, we further download publicly available EUR GWAS summary data from UKBB (Data and code availability). For all three traits, the UKBB GWAS analyses include age, age2, inferred sex, an interaction term between age and inferred sex, an interaction term between age2 and inferred sex, and the top 20 genetic PCs as covariates. One thing to note is that for HDL and LDL, measurements are untransformed and reported in mmol/L in UKBB, untransformed and reported in mg/dL in PAGE, and reported in mg/dL then standardized to Z-score in BBJ. Although not on the same scale, the correlation in SNP effect size estimates remain the same, allowing the various GWAS summary data to be analyzed jointly. For EUR, we construct a validation dataset of 20,000 independent samples from UKBB that do not overlap with the UKBB GWAS samples. Specifically, we use the genotyping plate and well codes, which are published in the file ukb_sqc_v2.txt by UKBB and are consistent across different project applications, to identify and exclude the individuals included in the UKBB GWAS analysis by Neale Lab, and then randomly select 20,000 independent individuals from the remaining UKBB samples to conduct parameter tuning (10,000) and testing (10,000). For each ancestry group, we use unrelated samples of the same ancestry from 1000 Genomes Project as the LD reference data. For EUR, the reported prediction are adjusted for age, sex, and top 10 genetic PCs. For AFR, AMR and EAS, the for BMI are adjusted for age, sex, top 10 genetic PCs, and whether the individual is from the BioMe Biobank, and for HDL and LDL the are adjusted for age at lipid measurement, sex, top 10 genetic PCs, and whether the individual is from the BioMe Biobank.
We conduct the following quality control steps for the GWAS summary-level association statistics: (1) consistent with the procedure in our simulation study and other data analyses, we restrict our analysis to approximately 1.6 million SNPs in HapMap 3 plus MEGA that are also available in LD reference panel and validation sample; (2) we remove SNPs that have duplicated positions in GWAS or LD reference panel; (3) for EUR, we remove SNPs that have alleles “AT”, “TA”, “CG”, or “GC” to avoid undetectable flipping strands when matching with UKBB validation data; (4) for the implementation of single-ancestry methods, we only keep common SNPs, i.e., SNPs that have ancestry-specific in that ancestry group, and for the implementation of multi-ancestry methods we keep all SNPs that have ancestry-specific in at least one ancestry group. The Manhattan plots and QQ plots for GWAS are reported in Supplementary Figures 14–16. No inflation is observed based on the genomic inflation factor. We estimate heritability of the three traits for EUR using LD score regression5 based on the 1000 Genomes LD reference data for EUR (Supplementary Table 7).
GLGC data analysis with validation on UKBB individuals
We obtain GWAS summary data from the Global Lipids Genetics Consortium (GLGC) for four blood lipid traits including HDL, LDL, TC, and logTG19 on five ancestry groups including EUR , AFR or admixed AFR , Hispanic , EAS , and SAS . Details of the study design, genotyping, quality control and GWAS are previously described19. We validate performance of the various methods on UKBB individuals. Specifically, we select a random set of 20,000 individuals that are of EUR origin and extracted all individuals that are of , , , or Hispanic/Latino origin. The origin of the UKBB individuals were determined by a genetic component analysis (Supplementary Notes). We used 50% of the UKBB samples to tune model parameters and train the SL in CT-SLEB and ME-Bayes SL or the linear combination model in weighted PRS and PRS-CS (tuning set), and the remaining 50% to evaluate PRS performance (testing set). The prediction of the genetic component has a low accuracy for AMR, and given the small number of identified AMR individuals , we do not report prediction on UKBB AMR. We use genotype data of unrelated individuals from 1000 Genomes project or tuning samples from UKBB as the LD reference data20. Ancestry- and trait-specific GWAS sample sizes, validation sample sizes, and number of SNPs analyzed are reported in Supplementary Table 4.1. Based on the genomic inflation factor, no inflation is observed for the various ancestry-specific GWAS. The Manhattan plots and QQ plots are reported in Zhang et al. (2022)13. No inflation is observed given the genomic inflation factor. Heritability of the four traits in EUR is estimated using LD score regression (Supplementary Table 7). All GWAS summary statistics went through the same quality control steps as in PAGE + UKBB + BBJ data analysis as well as one more step, where we further remove SNPs with a GWAS sample size less than 90% of the total GWAS sample size. The GWAS summary data from GLGC does not have information on ancestry-specific MAF, and thus we use the 1000G LD reference genotype data to calculate ancestry-specific MAF for the step where we filter out all SNPs that have in all ancestry groups. The are adjusted for age, sex, and top 10 genetic PCs.
AoU data analysis with validation on UKBB individuals
The individuals included in our analyses are part of the All of Us participant cohort with information collected according to the All of Us Research Program Operational Protocol (https://allofus.nih.gov/sites/default/files/aou_operational_protocol_v1.7_mar_2018.pdf). Detailed information on genotyping, ancestry determination, quality control, removal of related individuals is provided in the All Of Us Research Program Genomic Research Data Quality Report (https://www.researchallofus.org/wp-content/themes/research-hub-wordpress-theme/media/2022/06/All%20Of%20Us%20Q2%202022%20Release%20Genomic%20Quality%20Report.pdf).
On the All of Us platform, we conduct GWAS for BMI and height separately on unrelated individuals of three ancestry groups including EUR , admixed AFR or AFR , and Hispanic/Latino . The GWAS are adjusted for age, sex, and top 16 genetic PCs. There are only about 0.9 million SNPs in HapMap 3 + MEGA that are included in our analyses, which is due to the small number of the overlapping samples across the filtered WGS data, array data, and phenotype data. Similar to the GLGC data analysis, we validate performance of the various methods on UKBB individuals, i.e., 20,000 EUR individuals and individuals of AFR origin that are identified based on a genetic component analysis (Supplementary Notes). Again, the genetic ancestry prediction accuracy for AMR is low, and considering the small number of identified AMR , we do not report validation results on UKBB AMR. We use genotype data of unrelated individuals from 1000 Genomes project or tuning samples from UKBB as the LD reference data. Ancestry- and trait-specific GWAS sample sizes, validation sample sizes, and number of SNPs analyzed are reported in Supplementary Table 5.1. Based on the genomic inflation factor, no inflation is observed other than height for Hispanic/Latino. The Manhattan plots and QQ plots are reported in Zhang et al. (2022)13. Heritability of the two traits in EUR was estimated using LD score regression5 (Supplementary Table 7). All GWAS summary statistics went through the same quality control steps as in the GLGC data analysis. The are adjusted for age, sex, and top 10 genetic PCs.
23andMe Data Analysis
We develop and validate PRS for seven traits, including (1) heart metabolic disease burden, (2) height, (3) any cardiovascular disease (any CVD), (4) depression, (5) migraine diagnosis, (6) morning person, and (7) sing back musical note (SBMN) for EUR, African American (AFR), Latino (AMR), EAS, and SAS based on a large-scale dataset from 23andMe, Inc. We first conduct GWAS separately on the training dataset (70% samples) for each of the five ancestry groups, then apply the various methods to the generated GWAS summary-level association statistics and LD reference data from the 1000 Genomes Project. Within the remaining 30% of the samples, we use 20% to tune model parameters, train the SL in CT-SLEB and ME-Bayes SL, and the linear combination model in weighted PRS and PRS-CSx, then validate the predictive performance of the constructed PRS on the remaining 10% samples. We observe from our analyses on the other three datasets that ME-Bayes SL almost always outperforms the two alternative methods, ME-Bayes and weighted ME-Bayes, and thus for 23andMe data analysis, we only implement ME-Bayes SL but not the two alternative methods.
All GWAS analyses on the training data from 23andMe, Inc. were performed adjusting for age, sex, and the top 5 genetic PCs. Genotype data of unrelated individuals from 1000 Genomes project was used to estimate LD matrices. Detailed information on participant inclusion, genotyping, phenotyping, data imputation and quality control, removing related individuals, ancestry determination, and GWAS analysis is provided in Zhang et al. (2022)13. Ancestry- and trait-specific GWAS sample sizes, validation (tuning + testing) sample sizes, and the number of SNPs analyzed are reported in Supplementary Table 6.1. Based on the genomic inflation factor, no inflation is observed for the various ancestry-specific GWAS. The Manhattan plots and QQ plots are reported in Zhang et al. (2022)13. No inflation is observed given the genomic inflation factor. Heritability of the four traits in EUR is estimated using LD score regression (Zhang et al., 2022)13. All GWAS summary statistics went through the same quality control steps as in PAGE + UKBB + BBJ data analysis as well as one more step where we further remove SNPs with a GWAS sample size less than 90% of the total GWAS sample size. The residual for the two continuous traits were calculated by first regressing each trait on covariates including age, sex, and the top 5 genetic PCs, and then calculating the proportion of variation of the residual explained by the PRS. The residual AUC for the five binary traits were calculated using the “roc.binary” function in the R package RISCA version 1.0171 adjusting for the same set of covariates adjusted for the continuous traits.
Supplementary Material
Acknowledgements
This work was supported by the following NIH grants: K99 HG012223 (J.J.), K99 CA256513 (H.Z.), R01 HG010480 (N.C., J.J., and J.Zhang), U01HG011724 (N.C.), R35 HG011944 (G.L.W.), and U01 HG007419 (G.L.W.). We thank the Neale Lab and BBJ for making the GWAS summary data from UKBB and BBJ publicly available. Individual-level genotype and phenotype data for UKBB validation samples were obtained under application 17731. The PAGE Study is supported by the following NIH grants: U01 HG007419, R01 HG010297, and R01 HL151152. The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the AoU Research Program would not be possible without the partnership of its participants. We would like to thank the research participants and employees of 23andMe, Inc. for making this work possible. We would like to thank Liz Noblin, Melissa J. Francis, and Emily Voeglein for helping with the research collaboration agreement with Harvard T.H. Chan School of Public Health, Johns Hopkins Bloomberg School of Public Health and 23andMe, Inc. We would like to thank the research participants and employees of 23andMe for making this work possible. The analyses in this paper utilized the high-performance computation Biowulf cluster at National Institutes of Health, USA, Faculty of Arts and Sciences Research Computing Cluster at Harvard University, and the Joint High Performance Computing Exchange at Johns Hopkins Bloomberg School of Public Health.
Footnotes
Declaration of interests
J.Zhan, Y.J., J.O., and B.L.K. are employed by and hold stock or stock options in 23andMe, Inc.
Data and code availability
The simulated genotype data for 600K subjects of EUR, AFR, AMR, EAS, or SAS ancestry can be accessed at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/COXHAP. The EUR GWAS summary data for BMI34, HDL35, and LDL35 based on UKBB samples (GWAS round 2) published by the Neale Laboratory can be downloaded at http://www.nealelab.is/uk-biobank. The EAS GWAS summary data from BBJ for BMI36, HDL37, and LDL37 were downloaded from http://jenger.riken.jp/en/result. Split GWAS summary data from PAGE for BMI, HDL, and LDL stratified for AFR and AMR, as used in the training sets in our data analysis, are available upon request (email to Jin.Jin@Pennmedicine.upenn.edu). Stratified GWAS summary data from PAGE for BMI, HDL and LDL for AFR and AMR (not split for training/validation sets) is available on LDHub (https://ldsc.broadinstitute.org). GWAS summary data from GLGC for HDL, LDL, TC, and logTG stratified for EUR, AFR, AMR, EAS, and SAS can be downloaded at http://csg.sph.umich.edu/willer/public/glgc-lipids2021/results/ancestry_specific/. GWAS summary data from AoU for BMI and height stratified for EUR, AFR, and AMR are available upon request (email to Jin.Jin@Pennmedicine.upenn.edu). GWAS summary data from 23andMe Inc. for heart metabolic disease burden, height, any CVD, depression, migraine diagnosis, morning person, and SBMN stratified for EUR, AFR, AMR, EAS, and SAS can be requested through 23andMe, Inc. to qualified researchers under an agreement with 23andMe, Inc. that protects the privacy of the 23andMe participants. Please visit https://research.23andme.com/collaborate/#dataset-access/ to request data access. Participants included in our 23andMe data analysis provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent Review Services. 1000 Genomes Phase 3 reference data can be downloaded from https://mathgen.stats.ox.ac.uk/impute/1000GP_Phase3.html. Our estimated LD block matrices for EUR, AFR, AMR, EAS, and SAS for approximately 2.0 million SNPs in HapMap 3 plus MEGA that are also available in 1000 Genomes Project can be downloaded from https://github.com/Jin93/ME-Bayes-SL. LD block information, including the start and end positions of each block, are extracted from the “lassosum” R package and can be downloaded from https://github.com/tshmak/lassosum.
PLINK 1.9: https://www.cog-genomics.org/plink. PLINK 2.0: https://www.cog-genomics.org/plink/2.0/. LDpred2: https://privefl.github.io/bigsnpr/articles/LDpred2.html. The R package “bigsnpr” used in the LDpred2 pipeline is available for download on Github at https://github.com/privefl/bigsnpr. PRS-CSx: https://github.com/getian107/PRScsx. CT-SLEB: https://github.com/andrewhaoyu/CTSLEB. LD score regression: https://github.com/bulik/ldsc. The ME-Bayes SL pipeline, along with the R code for simulation studies and data analyses in this paper can be accessed at https://github.com/Jin93/ME-Bayes-SL.
References
- 1.Duncan L. et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun 10, 3328 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Liu C. et al. Generalizability of Polygenic Risk Scores for Breast Cancer Among Women With European, African, and Latinx Ancestry. JAMA Network Open 4, e2119084–e2119084 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wojcik G.L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yu Z. et al. Polygenic Risk Scores for Kidney Function and Their Associations with Circulating Proteome, and Incident Kidney Diseases. J Am Soc Nephrol (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Rabinowitz J.A. et al. Genetic propensity for risky behavior and depression and risk of lifetime suicide attempt among urban African Americans in adolescence and young adulthood. Am J Med Genet B Neuropsychiatr Genet 186, 456–468 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Perkins D.O. et al. Polygenic Risk Score Contribution to Psychosis Prediction in a Target Population of Persons at Clinical High Risk. Am J Psychiatry 177, 155–163 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Marquez-Luna C., Loh P.R., South Asian Type 2 Diabetes, C., Consortium, S.T.D. & Price A.L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet Epidemiol 41, 811–823 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ruan Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics 54, 573–580 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cai M. et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am J Hum Genet 108, 632–655 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Tian P. et al. Multiethnic polygenic risk prediction in diverse populations through transfer learning. Front Genet 13, 906965 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sun Q. et al. Improving polygenic risk prediction in admixed populations by explicitly modeling ancestral-specific effects via GAUDI. bioRxiv, 2022.10.06.511219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhang H. et al. Novel Methods for Multi-ancestry Polygenic Prediction and their Evaluations in 3.7 Million Individuals of Diverse Ancestry. bioRxiv, 2022.03.24.485519 (2022). [Google Scholar]
- 13.Privé F., Arbel J. & Vilhjálmsson B.J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Prive F., Vilhjalmsson B.J., Aschard H. & Blum M.G.B. Making the Most of Clumping and Thresholding for Polygenic Scores. Am J Hum Genet 105, 1213–1221 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.van der Laan M.J., Polley E.C. & Hubbard A.E. Super learner. Stat Appl Genet Mol Biol 6, Article25 (2007). [DOI] [PubMed] [Google Scholar]
- 16.Zou H. & Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 67, 301–320 (2005). [Google Scholar]
- 17.Friedman J., Hastie T. & Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33, 1–22 (2010). [PMC free article] [PubMed] [Google Scholar]
- 18.Weissbrod O. et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat Genet 54, 450–458 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.International HapMap C. et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–8 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bien S.A. et al. Strategies for Enriching Variant Coverage in Candidate Disease Loci on a Multiethnic Genotyping Array. PLoS One 11, e0167758 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Siva N. 1000 Genomes project. Nature Biotechnology 26, 256–256 (2008). [DOI] [PubMed] [Google Scholar]
- 22.Graham S.E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Privé F., Arbel J., Aschard H. & Vilhjálmsson B.J. Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances 3, 100136 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ge T., Chen C.Y., Ni Y., Feng Y.A. & Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lloyd-Jones L.R. et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10, 5086 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Weissbrod O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet 52, 1355–1363 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Yang S. & Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am J Hum Genet 106, 679–693 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Khan A. et al. Genome-wide polygenic score to predict chronic kidney disease across ancestries. Nat Med 28, 1412–1420 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Park T. & Casella G. The Bayesian Lasso. Journal of the American Statistical Association 103, 681–686 (2008). [Google Scholar]
- 30.Carvalho C.M., Polson N.G. & Scott J.G. Handling Sparsity via the Horseshoe. in Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics Vol. 5 (eds David van D. & Max W.) 73−−80 (PMLR, Proceedings of Machine Learning Research, 2009). [Google Scholar]
- 31.Polson N.G., Scott J.G. & Windle J. The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 713–733 (2014). [Google Scholar]
- 32.Truong B. et al. Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. medRxiv, 2023.02.21.23286110 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Albiñana C. et al. Multi-PGS enhances polygenic prediction: weighting 937 polygenic scores. medRxiv, 2022.09.14.22279940 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Locke A.E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197–206 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Willer C.J. et al. Discovery and refinement of loci associated with lipid levels. Nat Genet 45, 1274–1283 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Akiyama M. et al. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat Genet 49, 1458–1467 (2017). [DOI] [PubMed] [Google Scholar]
- 37.Kanai M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat Genet 50, 390–400 (2018). [DOI] [PubMed] [Google Scholar]
References
- 1.Berisa T. & Pickrell J.K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283–5 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Privé F., Arbel J. & Vilhjálmsson B.J. LDpred2: better, faster, stronger. Bioinformatics 36, 5424–5431 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Shaun Purcell C.C. PLINK 2.0. URL: www.cog-genomics.org/plink/2.0/.
- 4.Mak T.S.H., Porsch R.M., Choi S.W., Zhou X. & Sham P.C. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol 41, 469–480 (2017). [DOI] [PubMed] [Google Scholar]
- 5.Bulik-Sullivan B.K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47, 291–5 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Vilhjalmsson B.J. et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet 97, 576–92 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chang C.C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Purcell Shaun and Chang Christopher. PLINK 1.90. Vol. 2022. [Google Scholar]
- 9.Prive F., Arbel J. & Vilhjalmsson B.J. LDpred2: better, faster, stronger. Bioinformatics (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Marquez-Luna C., Loh P.R., South Asian Type 2 Diabetes, C., Consortium, S.T.D. & Price A.L. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet Epidemiol 41, 811–823 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ruan Y. et al. Improving polygenic prediction in ancestrally diverse populations. Nature Genetics 54, 573–580 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ge T., Chen C.Y., Ni Y., Feng Y.A. & Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun 10, 1776 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhang H. et al. Novel Methods for Multi-ancestry Polygenic Prediction and their Evaluations in 3.7 Million Individuals of Diverse Ancestry. bioRxiv, 2022.03.24.485519 (2022). [Google Scholar]
- 14.Sudlow C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12, e1001779 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Wojcik G.L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570, 514–518 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Friedewald W.T., Levy R.I. & Fredrickson D.S. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge. Clin Chem 18, 499–502 (1972). [PubMed] [Google Scholar]
- 17.Akiyama M. et al. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat Genet 49, 1458–1467 (2017). [DOI] [PubMed] [Google Scholar]
- 18.Kanai M. et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat Genet 50, 390–400 (2018). [DOI] [PubMed] [Google Scholar]
- 19.Graham S.E. et al. The power of genetic diversity in genome-wide association studies of lipids. Nature 600, 675–679 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Siva N. 1000 Genomes project. 26, 256 (2008). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.