Abstract
Polygenic risk scores (PRS) calculated from genome-wide association studies (GWAS) of Europeans are known to have substantially reduced predictive accuracy in non-European populations, limiting their clinical utility and raising concerns about health disparities across ancestral populations. Here, we introduce a statistical framework named X-Wing to improve predictive performance in ancestrally diverse populations. X-Wing quantifies local genetic correlations for complex traits between populations, employs an annotation-dependent estimation procedure to amplify correlated genetic effects between populations, and combines multiple population-specific PRS into a unified score with GWAS summary statistics alone as input. Through extensive benchmarking, we demonstrate that X-Wing pinpoints portable genetic effects and substantially improves PRS performance in non-European populations, showing 14.1%–119.1% relative gain in predictive R2 compared to state-of-the-art methods based on GWAS summary statistics. Overall, X-Wing addresses critical limitations in existing approaches and may have broad applications in cross-population polygenic risk prediction.
Subject terms: Genetic association study, Genome-wide association studies, Predictive medicine
Polygenic risk scores are used to improve risk prediction for common diseases but typically have reduced accuracy for individuals of non-European ancestry. Here, the authors present an approach that improves polygenic risk score performance in ancestrally diverse populations.
Introduction
Genome-wide association studies (GWAS) have identified tens of thousands of genotype-phenotype associations for human complex traits1,2. Polygenic risk score (PRS) based on GWAS, typically calculated as a weighted sum of trait-associated allele counts across numerous loci in the genome, is an effective tool to quantify the aggregated genetic propensity for a trait or disease3–8. With rapid advances in GWAS sample size and statistical methodology for modeling summary-level data, PRS has shown substantially improved prediction accuracy and great potential in disease risk screening and precision medicine9–11. However, since the vast majority of GWAS participants are of European descent, current PRS models are more effective in Europeans but are known to have substantially reduced accuracy in other populations, which severely limits their clinical utility12–16. There is an urgent need to improve the effectiveness of PRS in diverse human populations and provide equitable access to genomic advances in precision medicine14,17–20.
There have been three types of approaches to improve cross-ancestry genetic prediction in the literature. First, prioritizing causal variants using functional genomic annotations can improve the portability of PRS based on European GWAS21–23. Second, several studies combine multiple PRS trained in various populations using linear regression to optimize the predictive performance in the target (non-European) population16,23,24. The third type of approach parametrizes the degree to which genetic effects are correlated across populations, and integrates GWAS summary statistics from multiple populations in a multivariate model to improve effect size estimation and prediction accuracy in each respective population16,25–27. These models have achieved moderately improved predictive performance compared to conventional single-population approaches, but several critical limitations and challenges remain. First, previous studies used epigenetic regulatory annotations to prioritize variants for PRS21–23. While these annotations improved PRS portability for some traits, they are not designed to quantify the correlated genetic effects between populations28, and there is no guarantee that the same set of annotations will improve PRS performance for all complex traits. Additionally, existing statistical frameworks that leverage functional annotation data to improve PRS29–33 do not apply to multi-ancestry predictive modeling. Finally, in order to combine multiple population-specific PRS, the current practice requires additional data from the target (non-European) population. This includes individual-level genotype and phenotype samples that are independent of the GWAS used to train single-population PRS. In practice, this type of data can be nearly impossible to obtain34. In order to have broad applications, PRS models need to use the increasingly accessible GWAS summary statistics from global populations35–37 as input.
In this work, we introduce a cross-population weighting (X-Wing) framework for genetic prediction. There are three main innovations in our approach. First, we introduce an annotation framework based on cross-population local genetic correlation. This annotation extends our previous work38 to directly quantify correlated (portable) genetic effects between multiple ancestral populations. Second, we introduce a Bayesian method to incorporate functional annotation data into multi-population PRS modeling, where annotation-dependent statistical shrinkage amplifies the effects of annotated variants (i.e., variants with correlated effects between populations). Finally, we resolve a long-standing challenge in the field and introduce a method to combine multiple PRS trained in various populations using GWAS summary data alone as input. We demonstrate the superior performance of X-Wing PRS through extensive benchmarking using numerous GWAS datasets, including UK Biobank (UKB)39, Biobank Japan (BBJ)40, and Population Architecture using Genomics and Epidemiology Consortium (PAGE) study41.
Results
Methods overview
The X-Wing workflow is illustrated in Fig. 1. We have previously developed a scan statistic approach38 for identifying genomic regions with correlated effects on two complex traits. In this paper, we first extend this approach to identify correlated genetic effects on the same trait between two populations. Once identified, these genomic regions explain the shared genetic basis of the phenotype between populations and could be an informative annotation for prioritizing single-nucleotide polymorphisms (SNPs) in PRS models. Next, to quantitatively incorporate this annotation in multi-population PRS modeling, we introduce a Bayesian framework in which annotation-dependent shrinkage parameters allow variable degrees of statistical shrinkage between annotated and non-annotated SNPs. Coupled with other shrinkage parameters that do not depend on functional annotations, this framework amplifies SNP predictors that show correlated effects between populations while ensuring robustness to diverse types of genetic architecture42–45. Although we only explore its performance using the annotation derived from local genetic correlation in this paper, we note that this is a general framework that allows an arbitrary collection of annotation variables as input and also accounts for population-specific linkage disequilibrium (LD) and allele frequencies. Finally, we introduce an innovative strategy to linearly combine multiple PRS trained in different populations using summary association data alone. We employ a summary statistics-based repeated learning approach motivated from our recent work8 and its extension33 to estimate the regression weights for combining multiple PRS. The entire X-Wing procedure only requires GWAS summary data and LD references as input, which is a major advance compared to existing approaches. We present the statistical details and technical discussions in Methods and Supplementary Methods.
X-Wing pinpoints local genetic correlation between ancestral populations
We first carried out simulations to assess the performance of our approach in identifying cross-population local genetic correlations. Using European and East Asian samples in 1000 Genomes Project phase III data46, we simulated chromosome 22 genotypes of 50,000 individuals, and simulated quantitative traits in two populations under an infinitesimal model with varying heritability levels (Methods). When the traits in two populations are independent, X-Wing showed well-controlled type-I error rates (Supplementary Data 1). Since no existing method can estimate local genetic correlation between two distinct ancestral populations, we compared our results with PESCA47, a recently developed approach for estimating the risk SNP proportion shared by two populations, to gain some perspective on the statistical property of our inference results. PESCA also showed well-controlled type-I error across simulation settings, but X-Wing consistently achieved higher statistical power, especially when heritability is large (Fig. 2a).
To assess the robustness of our method to model mis-specification, we considered additional data-generating models in which SNP heritability is enriched in certain genomic regions38 or is dependent on LD and minor allele frequency (MAF)48. We also investigated binary phenotypes using a liability threshold model. We obtained consistent results in these analyses, with our method showing well-controlled type-I error (Supplementary Data 2–4) and superior statistical power (Fig. 2b and Supplementary Fig. 1).
As a robustness check, we also performed simulations based on genome-wide data. X-Wing showed well-calibrated type-I error rates (Supplementary Data 5) and identified more signal regions than PESCA when two populations shared local genetic correlations (Supplementary Fig. 2). Notably, PESCA suffered substantial type-I error inflation when two simulated traits are independent (Supplementary Data 5) and showed high false positive rates when two populations are correlated (Supplementary Data 6).
Local genetic correlation between Europeans and East Asians for 31 traits
We estimated local genetic correlations for 31 complex traits (Supplementary Data 7) between Europeans and East Asians using GWAS summary statistics from UKB (N = 314,921~360,388)39 and BBJ (N = 42,790~159,095)40. In total, we identified 4160 regions with significant cross-population local genetic correlations across 31 traits (FDR < 0.05; Supplementary Data 8). Of these, the vast majority (4,008 regions) showed positive correlations. 958 identified regions have genome-wide significant SNPs in both populations and 2,119 have significant SNPs in only one population (Supplementary Fig. 3). The number of significantly correlated regions identified for each trait pair is proportional to the global genetic correlations estimated from genome-wide data25 (Supplementary Fig. 4; correlation r = 0.49). As a comparison, we also applied PESCA to these data, and identified 1,968 risk regions shared by two populations (Supplementary Data 8). Our approach identified more significant regions in 30 out of 31 traits (Fig. 2c). The regions identified by our approach also explained larger proportions of cumulative genetic covariance in all 31 traits (Fig. 2d). Further, all conclusions remained similar when only HapMap3 SNPs were included in the analysis (Supplementary Fig. 5).
Overall, regions with significant local genetic correlations cover 0.06% (basophil) to 1.73% (height) of the genome, but explain 13.22% (diastolic blood pressure) to 60.17% (mean corpuscular volume) of the total genetic covariance between Europeans and East Asians (Fig. 3a and Supplementary Data 9), showing fold enrichments ranging from 28.09 to 546.83. Cross-population genetic correlations inside X-Wing-identified regions are substantially higher than the genome-wide genetic correlation estimates, while correlations in the remaining genome are consistently lower (Fig. 3b). Notably, among the traits we analyzed, basophil count has the lowest cross-population genetic correlation (rg = 0.23) which is consistent with previous reports49,50. But even for basophil count, we observed a substantial genetic correlation in regions identified by our approach (rg = 0.83). To guard against statistical artifacts, we performed falsification tests by simulating a trait that is uncorrelated between populations (Methods). We did not identify significant global or local correlations for this simulated trait (Fig. 3b).
We also sought to replicate local correlations between Europeans and East Asians for four lipid traits (HDL cholesterol, LDL cholesterol, total cholesterol, and triglycerides) in independent data. We used European GWAS from the Global Lipids Genetics Consortium (GLGC, N = 95,454~100,184)51 and East Asian GWAS from the Asian Genetic Epidemiology Network (AGEN, N = 27,657~34,374)52 as the replication datasets (Supplementary Data 10). In total, we identified 124 significant regions for four lipid traits in the replication analysis. 102 of them overlapped with significant regions identified in the discovery stage (Fig. 3c). Regions identified in the discovery stage showed substantial enrichment for genetic covariance in the replication data (greater than 100-fold for all four traits; Supplementary Data 11). Further, we ranked the regions identified in the discovery stage by their p-values. The cumulative proportion of genetic covariance explained by these regions were nearly identical between discovery and replication analyses (Fig. 3d and Supplementary Fig. 6).
Local genetic correlation annotation improves PRS prediction accuracy across populations
Next, we investigated whether incorporating the annotation based on local genetic correlation can improve the cross-ancestry prediction accuracy of PRS. We used European GWAS from UKB and East Asian GWAS from BBJ to train PRS for 31 complex traits, and evaluated PRS performance using independent East Asian samples in UKB (N = 2683). In this analysis, our approach jointly models GWAS in two populations and outputs separate SNP weights for Europeans and East Asians (Methods). Here, we used annotation-informed PRS based on posterior SNP effects estimated for Europeans, and report its performance in the East Asian target sample (thus, quantifying the portability of European scores in the East Asian population). PRS performance is quantified using partial R2 adjusting for covariates (Methods). Our annotation-informed PRS showed a 4.6% (Pwilcoxon = 7.0e-6) and 35.2% (Pwilcoxon = 1.0e-7) median relative improvement in R2 compared to PRS-CSx14 and XPASS20 (Fig. 4a; Supplementary Fig. 7; Supplementary Data 12), demonstrating the effectiveness of incorporating local genetic correlation annotation. In fact, we found both higher overall R2 and larger increase of R2 in annotated genomic regions (i.e., regions with correlated effects between populations) using our approach. PRS using only SNPs outside annotated regions did not show any improvement (Fig. 4b, c and Supplementary Data 13). We also compared our results with PolyFun-pred18, an approach that uses functional fine-mapping to improve PRS performance. Our PRS showed a substantial 78.1% (Pwilcoxon = 5.8e-4) relative gain in R2, suggesting that fine-mapping in European population alone is a sub-optimal approach compared to multi-population joint modeling (Supplementary Fig. 8 and Supplementary Data 12).
X-Wing combines multiple population-specific PRS using GWAS summary statistics
Next, we investigated the benefit of combining multiple PRS trained for different populations into a single score. We evenly split the East Asian target sample in UKB into a validation set in which we fit a regression model to combine the European and East Asian scores, and a testing set in which we evaluate the performance of combined PRS. We compared the prediction accuracy of X-Wing PRS with PRS-CSx, XPASS, and PolyPred+ using the same regression approach to combine scores. X-Wing showed an median R2 relative increase of 3.9% (Pwilcoxon = 1.0e-6), 46.1% (Pwilcoxon = 1.9e-9), and 24.7% (Pwilcoxon = 0.02) compared to PRS-CSx, XPASS, and PolyPred+ in East Asian target samples, respectively (Fig. 5a, Supplementary Fig. 7, and Supplementary Data 12). We also assessed the combined scores based on UKB, BBJ, and PAGE in admixed Americans and Africans. Our method showed a 3.2% (Pwilcoxon = 0.01) and 1.9% (Pwilcoxon = 0.01) median relative increase in R2 compared to PRS-CSx in admixed Americans and Africans, respectively (Supplementary Figs. 9, 10 and Supplementary Data 14, 15). XPASS was excluded since it cannot take more than two GWAS datasets as input and PolyPred+ was also excluded since it did not release PRS coefficients estimated using PAGE. We also performed sensitivity analyses by varying the size of genetic correlation annotation, upper bound of region size, and merge distance in identifying local genetic correlations. We also examined PRS performance after excluding the MHC region and explored estimating the global shrinkage parameter using a model tuning approach instead of the full Bayesian procedure (Supplementary Methods). We obtained consistent results in these analyses, demonstrating the robustness of X-Wing to these choices (Supplementary Figs. 11–18, Supplementary Data 16–22). We also performed simulations to benchmark the predictive performance of PRS using X-Wing, PRS-CSx and XPASS (Supplementary Methods). X-Wing shows consistent improvement over PRS-CSx and XPASS in the presence of local genetic correlation across two populations (Supplementary Fig. 19).
Finally, we demonstrated that population-specific PRS can be combined using GWAS summary data alone. We used summary-statistics-based repeated learning (Methods), instead of regressions trained on reserved samples, to linearly combine multiple PRS. This analytic strategy showed almost identical results compared to the gold-standard regression approach in both East Asian, admixed American, and African target samples (regression slope = 0.983, 1.007, and 0.971) (Fig. 5b, Supplementary Figs. 10, 20, and Supplementary Data 23). Notably, if no external individual-level data are available for regression model training, the current best PRS approach in practice is to use posterior SNP effects estimated for one population (Methods). Compared to the best-performing population-specific scores, X-Wing PRS can be trained using the same input data but showed a substantial improvement in prediction accuracy, with the median relative increase of R2 ranging from 25.4 to 58.5% (Pwilcoxon = 1.3e-8 to 1.9e-9) in East Asians, 14.1–74.2% (Pwilcoxon = 4.8e-4 to 2.4e-4) in admixed Americans, and 30.2–119.1% (Pwilcoxon = 0.01–2.4e-4) in Africans (Fig. 5c and Supplementary Figs. 10, 20, 21). We further compared X-Wing performance with the “-meta” option in PRS-CSx that requires no additional validation cohort. X-Wing showed a median R2 relative increase of 10.2% (Pwilcoxon = 3.6e-3), 9.6% (Pwilcoxon = 0.02), and 20.2% (Pwilcoxon = 2.4e-4) for traits in East Asians, Africans, and admixed Americans, respectively (Supplementary Fig. 22). We also evaluated X-Wing performance using a binary trait, type-2 diabetes, in East Asians. X-Wing PRS showed both higher liability R2 and AUC over PRS-CSx and XPASS (Supplementary Fig. 23)53,54. Overall, X-Wing PRS shows better predictive performance over alternative methods tested (Supplementary Fig. 24).
Discussion
In this paper, we introduced X-Wing, a sophisticated statistical framework for improving PRS performance in ancestrally diverse populations. X-Wing quantifies cross-population local genetic correlation, and incorporates it as an annotation into a Bayesian framework which amplifies correlated SNP effects between populations through annotation-dependent statistical shrinkage. It also combines multiple population-specific PRS to further improve prediction accuracy while using GWAS summary data alone as input. Applied to numerous GWAS traits, we demonstrated that local genetic correlations help pinpoint portable genetic effects and the annotation-informed PRS shows consistently and substantially improved performance across populations.
Our study presents several methodological innovations that will likely be generalizable and impactful. First, we introduced the concept of cross-population local genetic correlation and developed a scan statistic method to map correlated regions. Complementary to global genetic correlation, local genetic correlation refines the resolution in identifying shared genetic components between populations and provides critical insights into the genetic architecture of complex traits in diverse human populations. Second, we developed a new Bayesian framework that allows the integrative analysis of functional annotation data in multi-population PRS modeling. In this work, we showcased its effectiveness in cross-population risk prediction using an annotation derived from local genetic correlations. But we note that it is a general framework that can incorporate arbitrary sets of annotation data, such as the epigenetic annotations used in the PRS literature, in silico variant annotations based on machine learning exercises, or LD and allele frequencies which have been shown to improve heritability estimation21,23,33,55–57 (Supplementary Methods). It may also be applied to improve PRS portability across other non-ancestry-related demographic groups58. Finally, we introduced a strategy to combine multiple population-specific PRS into one improved score using summary statistics alone. This is innovative since fitting a regression model in an independent sample has long been considered the standard (and only) approach for combining multiple scores. This represents a significant advance in the field since obtaining additional individual-level samples that are independent from input GWAS can be a major challenge in practice. This is also generalizable since the same technique could be used to improve any PRS by creating an “omnibus” score over a number of methods, and the application is not limited to trans-ancestry risk prediction.
In addition to these methodological innovations, our local genetic correlation analysis identified many regions that are of biological interest. We have demonstrated that genomic regions identified by our approach show a substantial effect correlation on basophil count between two populations despite the low genetic correlation estimated from genome-wide data. More specifically, a region spanning 219 KB on chromosome 3 shows correlated effects between Europeans and East Asians for basophil count (Supplementary Fig. 25). Candidate gene GATA2 at this locus encodes a zinc-finger transcription factor which plays an essential role in proliferation, differentiation, and survival of hematopoietic cells59. In particular, expression of GATA2, coupled with CCAAT enhancer-binding protein α (C/EBPα) and transcription factor STAT5, directs the differentiation of granulocyte/monocyte progenitors (GMPs) into basophils60,61. Another correlated region for basophil count is a locus spanning 51 KB on chromosome 3 (Supplementary Fig. 26). Gene IL5RA, which encodes a subunit of a heterodimeric cytokine receptor that specifically binds to interleukin-5 (IL-5), lies 13 KB away from the identified region. Binding of the receptor to its ligand IL-5 is required for the biological activity of IL-5. Notably, IL-5 is a human basophilopoietin that promotes the formation and differentiation of human basophils62,63. Many other traits have interesting findings too. For example, a region spanning 48 KB on chromosome 1 is associated with C-reactive protein in two populations (Supplementary Fig. 27). The locus covers the gene NLRP3, which was identified as a risk gene associated with C-reactive protein levels in an independent GWAS64. NLRP3 encodes a pyrin-like protein that constitutes the NLRP3 inflammasome complex65. It was suggested that the NALP3 inflammasome can activate nuclear factor-κB signaling66 which affects C-reactive protein levels in Hep3B cells64,67. These results provide insights into the shared genetic basis of complex traits across ancestrally diverse populations. The local genetic correlation estimation procedure implemented in X-Wing may have broad applications in future studies that involve joint modeling of multi-population GWAS associations.
Our study also has some limitations. First, although our method does not require any individual-level sample with both genotype and phenotype information, it remains crucial to have LD reference panels that match the input GWAS. We observed an improvement in PRS performance when applying our method to highly diverse samples such as the PAGE study, but it remains unclear how to best select LD references for multi-ancestry GWAS and admixed populations68. Second, we generally believe that statistical methods alone cannot fully solve the challenges in cross-population risk prediction14,17. It is an important future direction to apply state-of-the-art methods to the large and highly diverse GWAS conducted in global biobank cohorts36, and carefully benchmark/combine various annotation data types and PRS training procedures. Third, although we have demonstrated an overall improved prediction accuracy over alternative methods across many traits, the relative improvement in R2 reported for a single trait may be statistically imprecise (Supplementary Data 12) and should be interpreted with caution. Fourth, our simulations were carried out using HapGen2-simulated genotypes, which is known to have smaller fixation index (FST) than expected between two populations. Fifth, only categorical annotations were used for PRS construction in our analysis. It may be of interest to directly estimate local genetic correlation first, and then incorporate the correlation values as a quantitative annotation to improve PRS.
Finally, the overall superior performance of X-Wing can be attributed to the incorporation of cross-population local genetic correlation and summary statistics-based PRS combination. Although we anticipate improved prediction accuracy after incorporating the local genetic correlation annotation, imprecise estimation of local genetic correlation may affect PRS performance when input GWAS have limited sample size. However, the summary statistics-based PRS combination strategy is robust in our analyses. In cases where there are concerns about the quality of local genetic correlation estimation, integrating summary statistics-based PRS combination into existing methods16,23 should still be a strategy for consideration.
Taken together, X-Wing addresses major challenges in existing PRS methods, showcases multiple innovations in trans-ancestry GWAS modeling, and substantially improves the prediction accuracy of PRS in non-European populations. These methodological advances, in conjunction with the ever-growing GWAS sample size especially in non-European populations, give hope to broad and equitable applications of genomic precision medicine around the globe.
Methods
Quantifying local genetic correlations between ancestral populations
We extend the LOGODetect38 framework to detect genomic regions showing local genetic correlations between two ancestral populations. Suppose the association z-scores for two populations are denoted as . Here, Yk is a Nk-dimensional vector of standardized phenotype values with mean 0 and variance 1, and Xk is the standardized genotype matrix of dimension Nk × M where Nk is the GWAS sample size for population k. We define the scan statistic as
1 |
where R is the index set for SNPs in a genomic region, Σk is the variance-covariance matrix of zk and Σk,ii denotes the i-th diagonal element of Σk. We note that the Σk matrix can be estimated using . Here, is the trait heritability which can be estimated using GWAS summary statistics25, Vk is the LD matrix which can be estimated using a reference panel, is an unbiased estimator of the squared LD matrix, and is the sample size of the LD reference panel. The numerator in the scan statistic is the inner product of association z-scores for two populations in a genomic region, which quantifies the correlation of SNP effect sizes. The denominator in the scan statistic adjusts for the effect of LD in two populations, where a tuning parameter θ controls the impact of LD. Technical details of the scan statistic and selection procedure for θ can be found in the Supplementary Methods.
To perform statistical inference, we use the maximal scan statistic over all possible genomic regions as the test statistic:
2 |
where C controls the upper bound of the region size (i.e., number of SNPs) and is pre-specified as 2000 in our analyses. Similar to local genetic correlation analysis in a single population38, we draw 5000 Monte Carlo simulations of z-scores for each population to assess the null distribution of Qmax, and we apply the scanning procedure to identify significant genomic regions showing cross-population local genetic correlations. Significant regions with a distance less than 100KB in-between are merged into a single segment.
An annotation-dependent Bayesian horseshoe regression model for PRS
Next, we describe our Bayesian PRS framework with annotation-dependent statistical shrinkage. Consider an additive genetic model:
3 |
where βk is a M-dimensional vector of SNP effect sizes in population k, ϵk is a vector of error terms with variance , to which we assign a non-informative Jeffreys prior69. MVN denotes multivariate normal distribution, and Ik is an identity matrix.
We introduce an annotation-dependent shrinkage parameter, in addition to the global and local shrinkage parameters used in literature16, to employ variable degrees of statistical shrinkage for SNPs in different annotation categories42,43,45. Here we only consider one annotation for simplicity, but our model allows incorporating multiple annotations (Supplementary Methods). Consider an annotation with A categories, we assign an annotation-dependent horseshoe prior to βjk:
4 |
Here, βjk denotes the effect of SNP j in population k, ϕ is the global shrinkage parameter shared across all M SNPs and K populations, ψj represents the local shrinkage parameter for SNP j, λf(j),k denotes the annotation-dependent shrinkage parameter for SNP j in population k, is a function that maps the j-th SNP to its corresponding category a in the annotation. The annotation-dependent shrinkage parameter is shared across SNPs that are in the same annotation category for a given population, but varies between populations to account for population-specific annotation.
Given this prior and marginal least squares estimates obtained from GWAS summary statistics, posterior mean effects in population k is
5 |
where and Dk is the LD matrix for population k.
To provide an intuition of annotation-dependent statistical shrinkage, suppose all SNP are unlinked (i.e., no LD), then the LD matrix Dk = I and the posterior mean effect for SNP j in population k is
6 |
Since SNPs in an important annotation explain more phenotypic variance (λf(j),k tends to be big), the shrinkage factor will be small if the j-th SNP is in an important annotation. Consequently, there is less statistical shrinkage on SNP effects in genomic regions marked by an important annotation.
To perform the full Bayesian model fitting, we assign half-Cauchy priors to the global, local, and annotation-dependent shrinkage parameters as follows:
7 |
where C+ (1) is the standard Cauchy distribution with the scale parameter equal to 1.
We employ a simple and efficient block Gibbs sampler to fit the PRS model using GWAS summary statistics and LD reference panel (Supplementary Methods)70. Following Ruan et al.16, we recommend using 1000 × K Markov Chain Monte Carlo (MCMC) iterations with the first 500 × K iterations as burn-in. We use the full Bayesian approach as default, which does not require validation data to tune the model. An alternative strategy is to select the optimal global shrinkage parameter ϕ from {10−6, 10−4, 10−2, 1} that maximized the R2 in the validation sample (Supplementary Methods)16. Our method outputs the posterior mean of population-specific SNP effects. PRS for the target cohort is calculated subsequently as the sum of allele counts weighted by posterior effect estimates.
Incorporating local genetic correlation annotation in PRS
Below we explain how to incorporate annotations based on local genetic correlation in our PRS model. Without loss of generality, we assume population 1 is the target population. We break down our algorithm into three steps:
Step1: Obtain annotation information through local genetic correlation analysis
We perform local genetic correlation analysis between population 1 and population k (k = 2, … K) to identify top s regions with positive local genetic correlation. We denote the set of regions as Ωk (e.g., when using UKB, BBJ, and PAGE as training GWAS, we ran local genetic correlation analysis between UKB and PAGE, as well as between BBJ and PAGE). We selected s = 1000 in our primary analysis and demonstrated that PRS performance is robust to the choice of s (Supplementary Figs. 12, 13). We also used regions with both positive and negative local genetic correlation as annotation and demonstrated that the PRS performs better when only positive regions are used (Supplementary Fig. 28).
Step2: Estimate posterior mean effects for all SNPs
Our annotation-dependent shrinkage procedure is designed based on two key intuitions. First, we expect poor PRS portability when using GWAS from various ancestral populations (e.g., European and African) to predict trait values in a different target population (e.g., East Asian), Therefore, we want to amplify SNP effects that are more portable (i.e. correlated) between each non-target population and the target population. Second, we do not expect any portability issue when the GWAS population and the target population are the same (e.g., using an East Asian GWAS to build PRS for East Asian target samples). Thus, we do not employ any annotation-dependent shrinkage when estimating posterior SNP effects for the target population.
Specifically, when estimating posterior SNP effects for the target population, we let λf(j), k)=1 for all j = 1, 2,… M, k = 1, …K. When estimating the posterior SNP effects for the non-target population k (k = 2, … K), we used λf(j),k = λ1,k if SNP j is not annotated by Ωk, λf(j),k = λ2,k if SNP j is annotated by Ωk, and for . We provide an example for the case where K = 3 in the Supplementary Methods.
Step3: Linearly combine multiple population-specific PRS
Based on the posterior mean effects of population k obtained in step2, we can calculate population-specific score PRSk. A common practice to combine these population-specific scores is to fit a regression model using the same phenotype Y(v) and K population-specific PRS in an independent validation dataset from the target population:
8 |
Here, superscript v highlights the fact that phenotypes and PRS in this regression exercise need to be obtained from a validation dataset that is different from any data used for GWAS and PRS modeling training. Instead of fitting a regression in independent samples, we introduce a strategy to obtain the least squares estimates of regression weights (i.e. ) using GWAS summary statistics. We introduce this approach in the next section. The final X-Wing PRS is then calculated as:
9 |
Combining multiple PRS with GWAS summary statistics
First, we briefly illustrate that we do not need any individual-level data from the validation sample, and summary statistics is sufficient for estimating the least squares estimator of PRS combination weights. Then, we provide detailed justifications on how to estimate using only input GWAS data instead of summary statistics from a validation sample. Suppose we have a validation dataset of N(v) individuals, can be estimated as follows:
10 |
Here, Y(v) is the phenotype vector and PRS(v) is the N(v) × K matrix of K population-specific scores in this sample. Further, PRS(v) can be denoted as PRS(v) =X(v) b where X(v) is the Nv × M genotype matrix and b is the M × K matrix for SNP effects. For simplicity, we assume Y(v) is centered, X(v) is standardized, and b quantifies standardized SNP effects. We note that quantifies the covariance of K population-specific PRS which can be approximated by the sample covariance obtained from a reference panel (e.g., LD reference of the target population). Therefore, we have
11 |
where can be obtained from the summary statistics of the validation sample (Supplementary Methods) and b is obtained from the PRS training procedure. N(ref) and PRS(ref) denote the sample size and PRS matrix in the reference panel. Taken together, Eq. (14) shows that LD reference and summary statistics from a validation sample can be used to estimate . However, summary statistics from a validation cohort are still difficult to obtain in practice, and it is tempting to replace it with the input GWAS used for PRS training. But this is not feasible since it is a textbook example of overfitting. This motivates us to use repeated learning (or a similar cross-validation approach; see Supplementary Methods)71,72 to estimate .
Typically, repeated learning (or cross-validation) requires individual-level genotype and phenotype data since it involves sample splitting. Generalizing the technique in our recent work8 and its extension handle the LD33, we introduce a summary statistics-based repeated learning strategy, which mimics the individual-level repeated learning but does not need individual-level GWAS data (Supplementary Methods). This approach has three main steps which we describe below. Since this approach does not involve a separate validation sample, we will perform analysis using input GWAS from the target population (e.g., BBJ GWAS when East Asian is the target population), the sample size of which is typically sufficiently large to ensure the performance of repeated learning. Without loss of generality, we denote k = 1 for this (target) population.
Step1: Subsample GWAS summary statistics from training and validation sets
Suppose we divide the full GWAS sample (X1, Y1) into a training set ( with individuals, and a validation set ( with individuals. Given the association z-scores from GWAS summary statistics and genotype data from the reference panel, association summary statistics based on training and validation sets can be sampled as:
12 |
where is a standardized genotype matrix from the reference panel for the target population, N(ref) is the sample size of the reference panel, g is a N(ref)-dimensional vector with elements drawn from a standard normal distribution (Supplementary Methods).
Step2: PRS model training
We train our PRS model using the training summary statistics subsampled for the target population in step1 and full GWAS summary statistics (without subsampling) for other populations. The output of PRS training is a M × K matrix b with the k-th column showing standardized SNP effects for population k (Supplementary Methods).
Step3: Estimate the linear combination weights
We then estimate PRS weights by
13 |
where denotes the PRS matrix calculated in the reference panel, is the subsampled validation summary statistics. We note that when we calculate using PRS matrix in the reference panel, essentially only LD matrix is used: , where is the LD matrix. We choose to calculate using PRS matrix to reduce computational complexity compared to directly using LD matrix, but one can still estimate using only LD matrix in the reference panel (Supplementary Methods). In practice, we force any negative estimates to be 0 and center PRS in the reference panel. We also normalize PRS weights by .
At last, we perform P-fold repeated learning. The final linear combination weights is the average of the normalized mixing weights across P times:
14 |
where represents the normalized weights in p-th fold. To avoid overfitting, we used distinct reference panels from the target population for GWAS summary statistics subsampling, PRS model training, and estimating weights for PRS combination. We provide the equally divided reference panels from 1000G phase 3 data for Europeans, East Asians, Africans, Central/South Asians, and admixed Americans to the users. We also present the extensions of our approach to handle tuning parameters in PRS model, negative mixing weights from least squares, and multicollinearity between PRS in Supplementary Methods.
Simulations
We used HAPGEN273 to simulate genotypes for 50,000 individuals of European and East Asian ancestry respectively from population-matched 1000 Genomes Project data. We only included SNPs with MAF greater than 5% on chromosome 22. After removing strand-ambiguous variants, 55,000 SNPs remained in the dataset and were used for subsequent analysis.
First, we carried out simulations to assess the type I error rates of two methods (i.e., X-Wing and PESCA). We generated the effect size of each SNP for two populations independently (i.e., under the null) following an infinitesimal model, where the per-SNP heritability was fixed as a constant. Trait heritability for two populations were set to be the same and varied between 0.001 and 0.01. We also compared two methods in three additional model settings: heritability enrichment model, LDAK model48 (SNP heritability is dependent on LD and MAF), and binary trait scenario. In the heritability enrichment model, 30% of heritability was attributed to 1000 randomly selected SNPs and 70% of heritability to the remaining SNPs. LDAK model assumes that the effect size of the j-th SNP follows the normal distribution and the per-SNP heritability is proportional to , where fj is MAF and uj is LDAK weight computed by the LDAK software. In the binary trait scenario, we first simulated the continuous liability following the same infinitesimal model as described above, then assigned the samples with top 50% liability as cases and others as controls. We repeated each simulation setting 100 times. Type I error rate was defined as the proportion of simulation repeats in which correlated regions (for X-Wing) and causal SNPs shared by two populations (for PESCA) were identified.
Next, we compared the statistical power of X-Wing and PESCA under the heritability enrichment model. We randomly selected a genome segment on chromosome 22 spanning 1000 SNPs as the correlated signal region. We attributed 30% trait heritability to the signal region. We jointly simulated SNP effect sizes in the correlated signal region for two populations with a correlation set as 0.9, and then simulated effect sizes of the rest of the genome independently between populations. Trait heritability for two populations were set to be the same and varied between 0.001 and 0.01. We also investigated the LDAK model and the binary trait model. Each simulation setting was repeated 100 times. Statistical power was defined as the proportion of simulation repeats in which at least one identified region (for X-Wing) and one shared causal SNP (for PESCA) overlapped with the true signal region. We also performed simulations across the whole genome. We simulated genotypes for 50,000 individuals and 831,636 HapMap3 SNPs using the HapGen2 software. We simulated two independent traits for two populations under the infinitesimal model and assessed the type-I errors for the two methods. To compared statistical power under the heritability enrichment model, we randomly selected 50 genome segments, each spanning 1000 SNPs as the correlated signal regions. 30% trait heritability was attributed to the signal regions and 70% was attributed to the rest of the genome. Correlation of SNP effect sizes in the correlated signal regions was set as 0.9. We further performed simulation to compare the predictive accuracy (measured by R2) of X-Wing PRS with the existing methods PRS-CSx and XPASS (Supplementary Methods).
Analysis of GWAS data from UKB, BBJ, and PAGE study
We evaluated the prediction accuracy of X-Wing PRS using 31 traits in East Asians and 13 traits in admixed Americans and Africans. European and East Asian GWAS summary statistics were obtained from UKB and BBJ (see Data availability). Trans-ancestry GWAS summary statistics for 13 traits were obtained from the PAGE study74 (Supplementary Data 5). East Asian and admixed American target samples in UKB were identified based on the Pan-UKB population assignment75. We removed samples already included in the UKB European GWAS. We also used KING76 to infer sample relatedness, and only kept individuals without any relatives at the third-degree or higher. We further excluded individuals with conflicting genetically-inferred and self-reported sex. The final East Asian, admixed American, and African target sample consist of 2683, 749, and 6490 individuals, respectively. We calculated PRS for these samples using the imputed genotype data provided by UKB but restricted to the autosomal SNPs with info score > 0.9, MAF > 0.01, missing rate ≤ 0.01, and Hardy Weinberg equilibrium test p-value ≥ 1.0e-6.
We applied X-Wing to obtain the annotations based on pairwise local genetic correlation between European, East Asian, and admixed American population using UKB, BBJ, and PAGE GWAS summary statistics. We annotated SNPs in the top 500, 1000, 1500 correlated regions and excluded regions with negative correlations. We then incorporated the annotation into our PRS model, using 1000 G phase3 data provided in Ruan et al.16 as LD reference panel and independent LD block provided by LDetect77 for block Gibbs sampler. When the target population is East Asian, we used UKB and BBJ GWAS as training data and European and East Asian LD reference panel. For the admixed American and African target population, we used UKB, BBJ, and PAGE GWAS as training data and European, East Asian, and admixed American LD reference panel, since PAGE GWAS consists primarily of Hispanic/Latino16. We randomly and evenly split the target cohort into a validation dataset to linearly combine population-specific PRS and used the remaining samples as the test dataset to evaluate PRS performance. When the PRS model involves model-tuning, the validation dataset is also used to select tuning parameters. We used partial R2 averaged across 100 random splits to benchmark the predictive accuracy of different methods, adjusting for age, sex, age2, age × sex, age2 × sex, and the top 20 genetic principal components. We used the percentage increase in partial R2 for X-Wing over other methods and reported the p-value from two-sided Wilcoxon signed-rank test to compare their performance. X-Wing uses local genetic correlation annotations based on genome-wide imputed SNPs in primary analysis but shows almost identical results using annotations based on HapMap3 SNPs (Supplementary Fig. 29). When the target population is Africans, we further replaced the admixed American LD reference panel with European or Africans LD reference panel and found that using admixed American LD reference yields better predictive performance over alternatives (Supplementary Fig. 30).
We implemented 4-fold repeated learning to estimate the PRS combination weights using GWAS summary statistics and our equally divided 1000G reference panel8,78. In each fold, we first subsampled East Asian (or admixed American) summary statistics for 75% BBJ (or PAGE study) samples as the training and the remaining 25% as the validation set. We applied X-Wing using the UKB and subsampled 75% BBJ training data (or UKB, BBJ, and 75% simulated PAGE summary statistics) to obtain the posterior mean effects for each population. We then used these posterior mean effects to calculate PRS in the 1000G dataset for East Asian (or admixed American) samples and estimated the linear combination weights. We calculated the average weight values over four repeats, used these weights to combine population-specific PRS, and compared its prediction accuracy with the combined PRS based on individual-level data in the same target population. The weights selected from our repeated learning procedure for 29/31 traits in East Asians falls into the 95% confidence interval of the weights estimated in an independent sample (Supplementary Fig. 31). X-Wing uses 4-fold repeated learning in primary analysis but shows almost identical results using 10-fold repeated learning (Supplementary Fig. 32). In our software implementation, we allow the users to specify the number of folds in repeated learning.
Implementation details of XPASS, PESCA, PolyFun-pred, PolyPred+ and PRS-CSx are described in the Supplementary Methods.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
We thank Drs. Lauren Schmitz and Jason Fletcher for helpful discussions. Q.L. and J.M. are supported by the University of Wisconsin-Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation (WARF). L.H. acknowledges research support from the National Natural Science Foundation of China (Grant No. 12071243).
Author contributions
J.M., H.G., L.H., and Q.L. conceived and designed the study. J.M. developed the statistical frameworks for incorporating annotation data into multi-ancestry PRS modeling and combining multiple PRS with GWAS summary data. H.G. developed the method for quantifying the local genetic correlation between distinct populations. J.M. and H.G. performed statistical analyses. G.S. assisted in preparing GWAS summary statistics. Z.Z. assisted in implementing summary statistics-based repeated learning. L.H. and Q.L. advised on statistical and genetic issues. J.M., H.G., L.H., and Q.L. wrote the manuscript. All authors contributed in manuscript editing and approved the manuscript.
Peer review
Peer review information
Nature Communications thanks Shing Wan Choi, Zilin Li, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
This study made use of publicly available datasets. This research has been conducted using the UK Biobank Resource under Application Number 42148. Data from the UK Biobank are available by application to all bona fide researchers in the public interest at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. Phase 3 data of the 1000 Genomes Project are publicly available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/; Pan UK Biobank data are publicly available at: https://pan.ukbb.broadinstitute.org; UKB GWAS summary statistics data are publicly available at: http://www.nealelab.is/uk-biobank; BBJ GWAS summary statistics data are publicly available at: http://jenger.riken.jp/en/result; PAGE study GWAS summary statistics data are publicly available at: https://www.ebi.ac.uk/gwas/publications/31217584; PolyFun-pred PRS coefficients data are publicly available at: http://data.broadinstitute.org/alkesgroup/polypred_results.; All data generated during this study are included in this published article and its supplementary information files. X-Wing posterior SNP effect size estimates in this work are publicly available at https://github.com/qlu-lab/X-Wing.
Code availability
X-Wing software is freely available at https://github.com/qlu-lab/X-Wing;
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Jiacheng Miao, Hanmin Guo.
These authors jointly supervised this work: Lin Hou, Qiongshi Lu.
Contributor Information
Lin Hou, Email: houl@tsinghua.edu.cn.
Qiongshi Lu, Email: qlu@biostat.wisc.edu.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-023-36544-7.
References
- 1.Tam V, et al. Benefits and limitations of genome-wide association studies. Nat. Rev. Genet. 2019;20:467–484. doi: 10.1038/s41576-019-0127-1. [DOI] [PubMed] [Google Scholar]
- 2.Visscher PM, et al. 10 years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Becker J, et al. Resource profile and user guide of the polygenic index repository. Nat. Hum. Behav. 2021;5:1744–1758. doi: 10.1038/s41562-021-01119-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ma Y, Zhou X. Genetic prediction of complex traits with polygenic scores: A statistical review. Trends Genet. 2021;37:995–1011. doi: 10.1016/j.tig.2021.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Miao J, et al. A quantile integral linear model to quantify genetic effects on phenotypic variability. Proc. Natl Acad. Sci. 2022;119:e2212959119. doi: 10.1073/pnas.2212959119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wand H, et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature. 2021;591:211–219. doi: 10.1038/s41586-021-03243-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhao, Z., Fritsche, L.G., Smith, J.A., Mukherjee, B. & Lee, S. The construction of cross-population polygenic risk scores using transfer learning. Am. J. Hum. Genet.109, 1998–2008 (2022). [DOI] [PMC free article] [PubMed]
- 8.Zhao Z, et al. PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics. Genome Biol. 2021;22:1–19. doi: 10.1186/s13059-021-02479-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chatterjee N, Shi J, García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17:392. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Torkamani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 11.Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Martin AR, et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Duncan L, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat. Commun. 2019;10:1–9. doi: 10.1038/s41467-019-11112-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Martin AR, et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Privé F, et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet. 2022;109:12–23. doi: 10.1016/j.ajhg.2021.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ruan Y, et al. Improving polygenic prediction in ancestrally diverse populations. Nat. Genet. 2022;54:573–580. doi: 10.1038/s41588-022-01054-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Gyawali, P.K. et al. Improving genetic risk prediction across diverse population by disentangling ancestry representations. Preprint at arXiv 10.48550/arXiv.2205.04673 (2022). [DOI] [PMC free article] [PubMed]
- 19.Spence, J.P., Sinnott-Armstrong, N., Assimes, T.L. & Pritchard, J.K. A flexible modeling and inference framework for estimating variant effect sizes from GWAS summary statistics. Preprint at bioRxiv 10.1101/2022.04.18.488696 (2022).
- 20.Tian, P. et al. Multiethnic Polygenic Risk Prediction in Diverse Populations through Transfer Learning. Preprint at bioRxiv 10.1101/2022.03.30.486333 (2022). [DOI] [PMC free article] [PubMed]
- 21.Amariuta T, et al. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements. Nat. Genet. 2020;52:1346–1354. doi: 10.1038/s41588-020-00740-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Weissbrod O, et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 2020;52:1355–1363. doi: 10.1038/s41588-020-00735-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Weissbrod O, et al. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 2022;54:450–458. doi: 10.1038/s41588-022-01036-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Márquez-Luna C, Loh PR, Consortium SATD. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet. Epidemiol. 2017;41:811–823. doi: 10.1002/gepi.22083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cai M, et al. A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am. J. Hum. Genet. 2021;108:632–655. doi: 10.1016/j.ajhg.2021.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Xiao, J. et al. XPXP: improving polygenic prediction by cross-population and cross-phenotype analysis. Bioinformatics38, 1947–1955 (2022). [DOI] [PubMed]
- 27.Zhang, H. et al. Novel Methods for Multi-ancestry Polygenic Prediction and their Evaluations in 5.1 Million Individuals of Diverse Ancestry. Preprint at bioRxiv 10.1101/2022.03.24.485519 (2022).
- 28.Finucane HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hu Y, et al. Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet. 2017;13:e1006836. doi: 10.1371/journal.pgen.1006836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Chen T-H, Chatterjee N, Landi MT, Shi J. A penalized regression framework for building polygenic risk models based on summary statistics from genome-wide association studies and incorporating external information. J. Am. Stat. Assoc. 2021;116:133–143. doi: 10.1080/01621459.2020.1764849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hu Y, et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput Biol. 2017;13:e1005589. doi: 10.1371/journal.pcbi.1005589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Márquez-Luna C, et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 2021;12:1–11. doi: 10.1038/s41467-021-25171-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhang Q, Privé F, Vilhjálmsson B, Speed D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun. 2021;12:1–9. doi: 10.1038/s41467-021-24485-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Mills MC, Rahal C. The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet. 2020;52:242–243. doi: 10.1038/s41588-020-0580-y. [DOI] [PubMed] [Google Scholar]
- 35.Wang, Y. et al. Global Biobank analyses provide lessons for developing polygenic risk scores across diverse cohorts. Cell Genomics3, 100241 (2023). [DOI] [PMC free article] [PubMed]
- 36.Zhou, W. et al. Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease. Cell Genomics2, 100192 (2022). [DOI] [PMC free article] [PubMed]
- 37.Conti DV, et al. Trans-ancestry genome-wide association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic risk prediction. Nat. Genet. 2021;53:65–75. doi: 10.1038/s41588-020-00748-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Guo H, Li JJ, Lu Q, Hou L. Detecting local genetic correlations with scan statistics. Nat. Commun. 2021;12:2033. doi: 10.1038/s41467-021-22334-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kanai M, et al. Genetic analysis of quantitative traits in the Japanese population links cell types to complex human diseases. Nat. Genet. 2018;50:390–400. doi: 10.1038/s41588-018-0047-6. [DOI] [PubMed] [Google Scholar]
- 41.Wojcik GL, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Carvalho, C.M., Polson, N.G. & Scott, J.G. Handling sparsity via the horseshoe. in Artificial Intelligence and Statistics 73–80 (PMLR, 2009).
- 43.Xu, Z., Schmidt, D.F., Makalic, E., Qian, G. & Hopper, J.L. Bayesian Grouped Horseshoe Regression with Application to Additive Models. 229–240 (Springer International Publishing, Cham, 2016).
- 44.Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1–10. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bhadra A, Datta J, Polson NG, Willard B. Default Bayesian analysis with global-local shrinkage priors. Biometrika. 2016;103:955–969. doi: 10.1093/biomet/asw041. [DOI] [Google Scholar]
- 46.Consortium GP. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Shi H, et al. Localizing components of shared transethnic genetic architecture of complex traits from GWAS summary data. Am. J. Hum. Genet. 2020;106:805–817. doi: 10.1016/j.ajhg.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 2017;49:986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Chen M-H, et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell. 2020;182:1198–1213. e14. doi: 10.1016/j.cell.2020.06.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Jain D, et al. Genome-wide association of white blood cell counts in Hispanic/Latino Americans: the Hispanic Community Health Study/Study of Latinos. Hum. Mol. Genet. 2017;26:1193–1204. doi: 10.1093/hmg/ddx024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Teslovich TM, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466:707–713. doi: 10.1038/nature09270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Spracklen CN, et al. Association analyses of East Asian individuals and trans-ancestry analyses with European individuals reveal new loci associated with cholesterol and triglyceride levels. Hum. Mol. Genet. 2017;26:1770–1784. doi: 10.1093/hmg/ddx062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Scott RA, et al. An Expanded Genome-Wide Association Study of Type 2. Diabetes Eur. Diabetes. 2017;66:2888–2902. doi: 10.2337/db16-1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Suzuki K, et al. Identification of 28 new susceptibility loci for type 2 diabetes in the Japanese population. Nat. Genet. 2019;51:379–386. doi: 10.1038/s41588-018-0332-4. [DOI] [PubMed] [Google Scholar]
- 55.Wainschtein P, et al. Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data. Nat. Genet. 2022;54:263–273. doi: 10.1038/s41588-021-00997-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Li X, et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 2020;52:969–983. doi: 10.1038/s41588-020-0676-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Zhou, H. et al. FAVOR: functional annotation of variants online resource and annotator for variation across the human genome. Nucleic Acids Research 51, D1300–D1311 (2022). [DOI] [PMC free article] [PubMed]
- 58.Mostafavi H, et al. Variable prediction accuracy of polygenic scores within an ancestry group. Elife. 2020;9:e48376. doi: 10.7554/eLife.48376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Tsai F-Y, Orkin SH. Transcription factor GATA-2 is required for proliferation/survival of early hematopoietic cells and mast cell formation, but not for erythroid and myeloid terminal differentiation. Blood, J. Am. Soc. Hematol. 1997;89:3636–3643. [PubMed] [Google Scholar]
- 60.Iwasaki H, et al. The order of expression of transcription factors directs hierarchical specification of hematopoietic lineages. Genes Dev. 2006;20:3010–3021. doi: 10.1101/gad.1493506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Li Y, Qi X, Liu B, Huang H. The STAT5–GATA2 pathway is critical in basophil and mast cell differentiation and maintenance. J. Immunol. 2015;194:4328–4338. doi: 10.4049/jimmunol.1500018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Denburg JA, Silver JE, Abrams JS. Interleukin-5 is a human basophilopoietin: induction of histamine content and basophilic differentiation of HL-60 cells and of peripheral blood basophil-eosinophil progenitors. Blood. 1991;77:1462–1468. doi: 10.1182/blood.V77.7.1462.1462. [DOI] [PubMed] [Google Scholar]
- 63.Falcone FH, Haas H, Gibbs BF. The human basophil: a new appreciation of its role in immune responses. Blood, J. Am. Soc. Hematol. 2000;96:4028–4038. [PubMed] [Google Scholar]
- 64.Dehghan A, et al. Meta-analysis of genome-wide association studies in> 80 000 subjects identifies multiple loci for C-reactive protein levels. Circulation. 2011;123:731–738. doi: 10.1161/CIRCULATIONAHA.110.948570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Pétrilli V, Dostert C, Muruve DA, Tschopp J. The inflammasome: a danger sensing complex triggering innate immunity. Curr. Opin. Immunol. 2007;19:615–622. doi: 10.1016/j.coi.2007.09.002. [DOI] [PubMed] [Google Scholar]
- 66.Afonina IS, Zhong Z, Karin M, Beyaert R. Limiting inflammation—the negative regulation of NF-κB and the NLRP3 inflammasome. Nat. Immunol. 2017;18:861–869. doi: 10.1038/ni.3772. [DOI] [PubMed] [Google Scholar]
- 67.Voleti B, Agrawal A. Regulation of basal and induced expression of C-reactive protein through an overlapping element for OCT-1 and NF-κB on the proximal promoter. J. Immunol. 2005;175:3386–3390. doi: 10.4049/jimmunol.175.5.3386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Atkinson EG, et al. Tractor uses local ancestry to enable the inclusion of admixed individuals in GWAS and to boost power. Nat. Genet. 2021;53:195–204. doi: 10.1038/s41588-020-00766-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A. Math. Phys. Sci. 1946;186:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
- 70.Makalic E, Schmidt DF. A simple sampler for the horseshoe estimator. IEEE Signal Process. Lett. 2015;23:179–182. doi: 10.1109/LSP.2015.2503725. [DOI] [Google Scholar]
- 71.Allen DM. The relationship between variable selection and data agumentation and a method for prediction. Technometrics. 1974;16:125–127. doi: 10.1080/00401706.1974.10489157. [DOI] [Google Scholar]
- 72.Bates, S., Hastie, T. & Tibshirani, R. Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673 (2021).
- 73.Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27:2304–2305. doi: 10.1093/bioinformatics/btr341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.MacArthur J, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Pan-UKB team. https://pan.ukbb.broadinstitute.org. 2020.
- 76.Manichaikul A, et al. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32:283–285. doi: 10.1093/bioinformatics/btv546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Burman P. A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation and the Repeated Learning-Testing Methods. Biometrika. 1989;76:503–514. doi: 10.1093/biomet/76.3.503. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This study made use of publicly available datasets. This research has been conducted using the UK Biobank Resource under Application Number 42148. Data from the UK Biobank are available by application to all bona fide researchers in the public interest at https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. Phase 3 data of the 1000 Genomes Project are publicly available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/; Pan UK Biobank data are publicly available at: https://pan.ukbb.broadinstitute.org; UKB GWAS summary statistics data are publicly available at: http://www.nealelab.is/uk-biobank; BBJ GWAS summary statistics data are publicly available at: http://jenger.riken.jp/en/result; PAGE study GWAS summary statistics data are publicly available at: https://www.ebi.ac.uk/gwas/publications/31217584; PolyFun-pred PRS coefficients data are publicly available at: http://data.broadinstitute.org/alkesgroup/polypred_results.; All data generated during this study are included in this published article and its supplementary information files. X-Wing posterior SNP effect size estimates in this work are publicly available at https://github.com/qlu-lab/X-Wing.
X-Wing software is freely available at https://github.com/qlu-lab/X-Wing;