Summary
The development of polygenic risk scores (PRSs) has proved useful to stratify the general European population into different risk groups. However, PRSs are less accurate in non-European populations due to genetic differences across different populations. To improve the prediction accuracy in non-European populations, we propose a cross-population analysis framework for PRS construction with both individual-level (XPA) and summary-level (XPASS) GWAS data. By leveraging trans-ancestry genetic correlation, our methods can borrow information from the Biobank-scale European population data to improve risk prediction in the non-European populations. Our framework can also incorporate population-specific effects to further improve construction of PRS. With innovations in data structure and algorithm design, our methods provide a substantial saving in computational time and memory usage. Through comprehensive simulation studies, we show that our framework provides accurate, efficient, and robust PRS construction across a range of genetic architectures. In a Chinese cohort, our methods achieved 7.3%–198.0% accuracy gain for height and 19.5%–313.3% accuracy gain for body mass index (BMI) in terms of predictive R2 compared to existing PRS approaches. We also show that XPA and XPASS can achieve substantial improvement for construction of height PRSs in the African population, suggesting the generality of our framework across global populations.
Keywords: polygenic risk score, cross-population, ancestry, GWAS, UK Biobankcross-population
Introduction
In the past 15 years, genome-wide association studies (GWASs) have been performed on a wide spectrum of complex traits and diseases, providing an unprecedented opportunity to stratify the general population into different risk groups. With the availability of large-scale GWAS data, polygenic risk scores (PRSs) have been constructed to estimate the genetic predisposition of complex phenotypes by collecting contributions of many single-nucleotide polymorphisms (SNPs). The accurate construction of PRSs holds promise in disease risk prediction, personalized healthcare guidance, disease screening, and therapeutic intervention.1 As one example, it was shown that an appropriately constructed PRS can identify 8% of the population with three-fold increased risk of coronary artery disease (CAD), while monogenic mutations with comparable risk can only cover 0.4% of the population.2 In terms of the area under the receiver operator curve (AUC), the PRS’s accuracy in predicting CAD onset can be as high as 0.81 based on 288,978 independent testing participants from the UK Biobank.3 More recently, a significant improvement of AUC (from 0.73 to 0.80) was achieved in glaucoma prediction by incorporating PRSs in the traditional risk prediction model without genetic information.4
Despite the great promise of PRSs, one major limitation for its broader applications is the fact that most GWASs have been performed on samples with European (EUR) ancestry.5, 6, 7, 8, 9, 10, 11 According to the GWAS diversity monitor,8 about 89% of GWAS participants to date are from European ancestry, while less than 8% of the participants are from East Asian (EAS) ancestry and less than 0.45% are from African (AFR) ancestry. Despite the wealth of GWAS findings derived from Europeans, such an unbalanced sample makeup across global populations may exacerbate the disparities in genetic studies of non-Europeans.6 Recent studies have reported that the PRSs derived from European samples are often less accurate when applied to other populations.12,13 It remains unclear how much the genetic discovery from the European population can be transferred to non-European populations.
The challenges of transferable genetic studies arise from three aspects. First, SNPs with biologically important roles in the non-European populations may be neglected in GWASs if they are absent or have very low allele frequencies in Europeans.14,15 Second, the same SNP may have different effect sizes on the same phenotype across different populations,10,14 limiting the extrapolation value of GWAS findings and the GWAS-derived PRS power in non-discovery populations. Third, the linkage disequilibrium (LD) patterns vary across populations,12,16, 17, 18, 19, 20 exacerbating the bias in extrapolating the PRS for risk prediction.
PRSs can be constructed by using either individual-level or summary-level GWAS data. The individual-level methods construct PRSs by directly taking the genotype and phenotype data as their input. Well-known individual-level approaches include best linear unbiased predictor (BLUP),21 LASSO,22 BayesR,23,24 and BayesS.25 These methods have their limitations in construction of PRS in the trans-ancestry setting. First, they are often too time consuming or memory consuming to be applicable for the biobank-scale GWAS data. While the recently proposed snpnet26 method implements LASSO in a memory-efficient manner, it does not provide much gain in computational efficiency. Second, these methods cannot be easily extended to integrate datasets from heterogeneous ethnic backgrounds, since they do not account for different genetic architectures. While the bivariate BLUP (bvBLUP) is useful to reconcile the different genetic effects, the most popular bvBLUP approach provided by the GCTA software (GCTA-bvBLUP)27 is not scalable to the biobank-scale dataset, and thus cannot fully utilize the large-scale datasets to improve prediction performance. Different from individual-level methods, summary-level PRS methods use only summary statistics (e.g., z-scores or marginal effect size estimates) that are widely available for large-scale GWASs to approximate the prediction of individual-level methods. Therefore, they are more flexible, efficient, and easily scalable to well-powered GWAS data. Notable summary-level methods include P+T procedure,28 LDpred,29 and lassosum.30 By taking the advantage of large sample size, existing PRS models have proved informative in predicting genetic risk of European ancestry. However, in view of the imbalanced population composition and the different genetic structures across populations, existing PRS approaches that do not take the heterogeneous genetic architectures into account may not be easily applied in trans-ancestry genetic prediction, leading to sub-optimal prediction accuracy for under-represented populations. The recently proposed MTAG approach31 provides an effective data integration framework to combine GWAS summary data of multiple phenotypes, which has been proved useful to improve power of association mapping or construction of PRSs in the same population. Despite the success of MTAG, its performance in the construction of PRSs across populations remains largely unknown. To generate more accurate PRSs by leveraging trans-ancestry information from large-scale European GWASs, a pioneer approach, XP-BLUP,32 was recently proposed, which generalizes BLUP by introducing an extra variance component to model the effect sizes of significant SNPs derived from well-powered GWASs of the auxiliary population. The underlying assumption is that the significant SNPs from the auxiliary population are more useful for improving PRS accuracy in the under-represented populations. However, as XP-BLUP only incorporates information from significant SNPs, its improvement in prediction accuracy is limited. Therefore, there is a great need for a comprehensive investigation on the transferability of genetic study in the PRS construction among global populations.
With our innovations in data structure and algorithm design, XPA is able to handle bio-bank scale individual-level datasets and achieves the accurate prediction with the computational cost nearly linear to the SNP number and sample size. To demonstrate the effectiveness of our methods, we considered the data collected from about 33,000 Chinese participants through a direct-to-customer platform. For two anthropometric traits, height and body mass index (BMI), we applied XPA to integrate the Chinese training dataset (about 20,000 participants) with the UKBB dataset and evaluated its prediction accuracy on an independent Chinese testing dataset (about 13,000 participants). XPA achieved R2 = 18.92% for height and R2 = 6.06% for BMI. Compared to the runner up, XPA improved the relative prediction accuracy by 7.3% for height and 19.5% for BMI. We also evaluated XPASS and other related methods that only use summary statistics. Given the summary statistics obtained from the same individual-level GWAS data as its input, XPASS is slightly worse than XPA as expected, but it is more broadly applicable when individual-level data are not accessible. Since the ancestry profiles of Japanese and Chinese are close to each other, summary statistics from a larger East Asian GWAS, the BioBank Japan (BBJ) project,33,34 was used as the training data to construct PRSs for testing dataset. By integrating summary data from BBJ with UKBB, XPASS was able to further improve the prediction accuracy for height with R2 = 19.54%, offering 12.7% relative improvement compared to training with Chinese and UKBB summary data. We also demonstrated that our proposed methods was able to improve the construction of PRSs in African population by integrating UKBB.
Material and methods
Method overview: XPA and XPASS
Due to the polygenicity of human complex traits, it is challenging to construct accurate PRSs for an under-studied target population. On one hand, the PRSs constructed using samples from EUR ancestry becomes less accurate when it is applied to non-European samples. On the other hand, the accuracy of PRSs constructed only using samples from the under-represented target population is limited by the sample size. By integrating small-scale or medium-scale data from the target population with existing large-scale data resources from EUR ancestry, our methods can robustly improve the PRS prediction performance. The key idea of our methods is based on the observed substantial genetic correlation that largely remains for the same trait between populations due to the shared genetic basis. Therefore, a large amount of information in the biobank-scale data of EUR ancestry can be utilized for risk prediction in the under-represented ancestry. By taking individual-level GWAS data as input, XPA offers analytic estimate for the SNP effect sizes using shared information across populations. With our innovations in data structure and algorithm design, such as Boolean representation35,36 and stochastic approximation,37 the analytic SNP effect size estimates can be computed efficiently, allowing us to construct PRSs in the target population accurately. Because of the unavailability of individual-level data for many traits, we have extended XPA to XPASS which requires only the GWAS z-scores and SNP correlation matrices from the target and auxiliary populations, where the SNP correlation matrices can be approximated using block-diagonal correlation matrices from a genotype reference panel of the two populations. The computational complexity of XPA and XPASS is approximately linear to the number of SNPs and sample size, making our framework appealing in cross-population risk prediction with biobank-scale data.
XPA statistical framework
Consider a GWAS dataset {G1,Z1,y1} from an under-represented ancestry, where G1 is an n1 × p genotype matrix, Z1 is an n1 × c1 matrix collecting all covariates (e.g., age, sex, and principal components), and y1 is an n1 × 1 phenotype vector. Due to the polygenicity of complex traits, the accuracy of risk prediction is limited by the sample size. Now suppose a biobank-scale dataset {G2,Z2,y2} from European ancestry is also available, where G2 is an n2 × p genotype matrix, Z2 is an n2 × c2 covariate matrix, y2 is an n2 × 1 phenotype vector, and . Since we are mainly interested in improving risk prediction in the under-represented population, we shall regard it as the target population and the biobank-scale data from European ancestry as the auxiliary dataset. To reconcile the difference of allele frequencies in the two populations, XPA assumes that the SNP effect sizes in both the target and auxiliary populations increase as the allele frequencies decrease38 and works with standardized genotype matrices. Specifically, let and denote the j-th column of G1 and G2, respectively. The corresponding column means and standard deviations are given as and , and , respectively. Then we have the corresponding standardized genotype matrices as and , where the j-th column of X1 and X2 is given as and , respectively. In such a way, each column of X1 and X2 has mean 0 and variance 1/p. We relate genotypes and phenotypes using the following linear models:
(Equation 1) |
where and are fixed effects of covariates, and are vectors collecting the SNP effect sizes from the two populations, and and are independent errors. Of note, the two datasets do not share samples because they are from different populations. Therefore, we can assume that the residual vectors and are independent. To model the polygenic effects and their correlation between two populations, we introduce the following probabilistic structures on and :
(Equation 2) |
where and are the variance components characterizing the polygenic effects in the two populations, respectively, ρ is the trans-ancestry genetic correlation of the same trait across population, and is the corresponding covariance. XPA computes the posterior mean of by combining the target dataset with the auxiliary dataset through their genetic correlation, and therefore constructs an improved genetic prediction when the genetic correlation ρ is nonzero. In contrast to XP-BLUP,32 XPA leverages genome-wide information by using all SNPs rather than only the top SNPs from the auxiliary population. While the inclusion of all SNPs from the biobank-scale auxiliary dataset introduces challenges in computation and data storage, we show that they can be properly addressed by the novel data structure and algorithm design in XPA.
XPA+ for capturing population-specific effects
The XPA framework can incorporate population-specific genetic effects to improve prediction performance. To see this, we denote and as matrices collecting all the covariates, and and as the standardized genotype matrices of SNPs with large effects in populations one and two, respectively. By constructing Z1 and Z2 in Equation 1 as and , respectively, we can re-write Equation 1 as:
(Equation 3) |
where and are fixed effects of the pre-selected SNPs, and are fixed effects of covariates, and , . SNPs in and are selected by applying the P+T procedure to the GWAS summary statistics of the target and the auxiliary populations, respectively. Because and in practice, the vectors of fixed effects and can be accurately estimated.
This flexible model structure can accommodate polygenic effects across population and large population-specific effects. On one hand, the probabilistic structures of and capture the polygenic effects that are correlated between populations, allowing the auxiliary samples to be effectively utilized. On the other hand, and capture the large population-specific genetic effects, which further boost the prediction performance. We call this extension of XPA as XPA+. When no SNPs with large population-specific effects are selected, XPA+ is equivalent to XPA. Once the large-effect SNPs have been selected, the parameters can be estimated in the same way as in XPA.
Parameter estimation in XPA
To obtain the posterior mean of , we first need to estimate the unknown parameters . For convenience, we define , , , , and . The marginal distribution of can be obtained by combining Equations 1 and 2 and taking integration over :
(Equation 4) |
where , , , and . To estimate unknown parameters , we get rid of the covariates by multiplying Equation 4 with the projection matrix , which leads to . For convenience, we use to denote MA for any matrix A involved in our notation, hence we have The method of moments (MoM) offers a computationally efficient estimator of by matching the second-order moment based on the criterion of least-squares: . Taking derivatives of the objective function w.r.t and setting them to zero leads to the estimating equations (see Appendix A):
(Equation 5) |
The bottle neck of solving Equation 5 is computing the traces of squared kinship matrices, which has a computational complexity . This computational overhead can become very large when dealing with biobank-scale data. Instead of computing the traces exactly, we apply a stochastic approximation to derive unbiased estimates of the trace terms.37 Given a matrix and a vector of random variables with and , we have the identity . Using this identity, we can construct the following estimates:
(Equation 6) |
where are B random vectors drawn independently from a distribution with zero mean and identity covariance matrix . The stochastic approximation in Equation 6 requires a computational complexity of . In practice, we found that the traces can be effectively approximated with . Replacing , , and by , , and , respectively, we obtained . Then heritablities of the given phenotype in the target and auxiliary populations are estimated as and , respectively. The shared genetic basis between the two populations can be quantified by the estimated trans-ancestry genetic correlation, computed as , which is the key to utilizing information from the biobank-scale EUR data for risk prediction in the target population.
PRS construction in XPA
Given the estimated parameters , the fixed effects in Equation 1 are estimated by:
(Equation 7) |
where .
The posterior mean of can be derived as (see Appendix A):
(Equation 8) |
Obviously, when the genetic correlation (i.e., ), information could be borrowed from the auxiliary population to the target population to improve the posterior mean estimates.
Finally, to obtain the effect size of dosage genotypes , we need to re-scale the posterior mean by , for . When a new observation of genotype is available, the associated PRS can be computed by its inner product with , i.e., . If the covariates of the new observation are also available, we can predict the phenotype at its original scale by .
For XPA +, we note that by constructing and , the effect sizes of the pre-selected SNPs and can be estimated jointly with and in Equation 7, and the posterior mean in the target population can be computed in the same way as in Equation 8. Similarly, the posterior mean should be re-scaled in the same way as in XPA and the estimated effects of large-effect SNPs should be re-scaled by , for . Then, the PRSs associated with the new observation is computed as , where is the sub-vector of corresponding to the SNPs with large effects.
Despite the simple form of Equation 8, it is highly nontrivial to obtain the posterior mean in biobank-scale data due to the challenges in both data storage in computer memory and computation. To boost the computational speed, we solve the large-scale linear system in Equation 8 (i.e., ) using the efficient conjugate gradient (CG) method. Because the classical CG requires storing the whole matrix, which is infeasible for bio-bank scale dataset, we developed a memory-efficient strategy (as described in the following section) for storing standardized genotype matrices, and designed a highly efficient CG algorithm (see Appendix A). Our CG algorithm has the time complexity of , where κ is the condition number of . Because κ is usually small, the CG algorithm offers substantial computational improvement in solving the linear system.
Data storage for large genotype matrices in XPA
In practice, the sample size of the auxiliary dataset can be very large (e.g., for UKBB), producing both computation and data storage problems. To address this difficulty, we apply the memory-efficient Boolean representation proposed in our previous work BOOST35 and the hash table data structure36 to store the standardized genotypes.
Suppose that we are handling a standardized biobank-scale genotype matrix of 400,000 × 3,000,000, the memory usage of storing such a matrix in double-precision floating-point format is 8 bytes bytes Tb. Therefore, directly storing such a large matrix is usually impractical. However, we note that each genotype only takes 4 possible values even after standardization: ,,, and , where is the mean genotype value at SNP j, sj is the standard deviation of the genotype j, and is the value to be filled in for the missing genotypes. For each SNP, we index the standardized genotype values using a Boolean representation. Specifically, we can use 00 to index , 01 to index , 10 to index , and 11 to index . We then save 4 consecutive individual’s genotypes at a time using the 8-digit Boolean representation, which has 28 = 256 combinations. This constructs a hash table of size 256×4, where each cell contains a standardized genotype value in double precision. The total size of this hash table is 256×4×8 bytes = 8,192 bytes. Finally, we need to construct such a hash table for all 3,000,000 SNPs, which amounts to 8,192 bytes ×3,000,000 = 24.576 Gb. In addition to the hash tables, we need to save the Boolean representations for retrieving genotypes, which takes 400,000 × 3,000,000 / 4 = 3 × 1011 bytes Gb. In total, this memory-efficient storing strategy requires only 305 Gb for storing the UKBB genotype matrix.
Constructing PRS using summary statistics by XPASS
We consider the datasets and from the two populations. The vectors and contain the z-scores derived from the two populations, where and , and are the residual variance of regressing on and on , respectively. Following Vilhjálmsson et al.,29 we assume the z-scores are derived from GWAS datasets with phenotype vectors and standardized to have mean of zero and standard deviation of one. The first few PCs of the reference genotypes from the two populations are given in and . Similar to XPA, we first standardize the reference genotype matrices and to obtain the corresponding and that have column means zero and variances 1/p.
In real applications, the individual-level GWAS data may not be easily accessible. To effectively make use of publicly available GWAS summary statistics, we extend the XPA method as XPASS which requires only the z-scores from GWAS results and reference genotypes from the target and auxiliary populations. For XPASS, we consider the vectors and of the z-scores for p SNPs derived from the target and auxiliary populations, respectively, and the m1 × p matrix and m2 × p matrix being the standardized reference genotype matrices from the corresponding populations. With these summary-level data, we show that XPASS approximates the posterior mean of SNP effect sizes in XPA as (see Appendix A):
where denotes the Kronecker product, and are estimates of heritabilities and for the two populations, respectively, is the estimate of co-heritability between two populations, and and are the LD matrices of target and auxiliary populations, respectively. The parameter estimates , , and can be computed using the summary-level datasets (Appendix A).39,40 Again, the information is shared across populations through the genetic correlation. In practice, XPASS takes the heterogeneous LD patterns into account by computing the LD matrices using and from either subsamples of and or external reference genotypes from the two populations. Because the LD between SNPs decreases exponentially with their distance, we approximate the LD matrices using block diagonal matrices by assuming the SNPs between LD blocks are approximately independent.41 The heterogeneous LD patterns result in different LD partitions across populations. Therefore, we build the approximated LD matrices using partitions derived from either the target or the auxiliary population. The application of XPASS to real datasets suggests that our method is insensitive to the partition strategy. To obtain the dosage scale effect size, we compute . When the genotypes of a new sample is available, we can obtain the PRS by .
XPASS+ for capturing large population-specific effects
Similarly to XPA+, we extend XPASS as XPASS+ to incorporate large population-specific effects to improve prediction performance. We denote and as the standardized genotype matrices collecting the columns of and that correspond to the SNPs with large effects, and and as sub-vectors of and corresponding to the SNPs with large effects, respectively. The large genetic effects can be estimated by XPASS+ as (see Appendix A):
(Equation 9) |
where and are the LD matrices of large-effect SNPs and and are the SNP correlation matrices between large-effect SNPs and all SNPs from the two populations, respectively. With the estimated fixed effects given in Equation 9, the posterior means can be computed as:
(Equation 10) |
Finally, to obtain the effect sizes for the dosage genotypes, we re-scale both the posterior mean and the estimated fixed effects by , for and , for , respectively. When a new observation with genotype is available, its PRS can be computed as , where is the vector of dosage genotypes corresponding to the SNPs with large effects.
Simulation design
We conducted a comprehensive simulation study to compare the performance of XPA and XPASS with other PRS approaches for individual-level data and summary data, respectively. For individual-level approaches, we investigated the prediction accuracy of XPA and XPA+ in comparison with three scalable PRS models, including BLUP and bvBLUP implemented in the GCTA,21 LASSO,22 and XP-BLUP.32 The BLUP and bvBLUP were fitted using the GCTA software v.1.93 (GCTA-BLUP and GCTA-bvBLUP). The LASSO was fitted using the R package glmnetPlus. For single-population-base approaches, GCTA-BLUP and LASSO, we trained them with either only the target dataset or only the auxiliary dataset. In addition, we trained GCTA-BLUP with the combined dataset obtained by directly merging the target and the auxiliary data (GCTA-BLUP-combine). For GCTA-bvBLUP, we followed the instruction from the GCTA forum for integrating two independent samples. For XP-BLUP, we considered 6 settings of p value threshold for selecting candidate SNPs from the auxiliary GWAS results: 5 × 10−6, 1 × 10−6, 5 × 10−7, 1 × 10−7, 5 × 10−8, 1 × 10−8. We revised the original XP-BLUP script to allow for the support of multi-threading computation. For XPA+, we selected the large-effect SNPs by applying the P+T procedure to the target dataset with LD threshold r2 = 0.1 and p value threshold 1 × 10−6.
For summary-level approaches, we compared XPASS and XPASS+ with three alternative summary-level PRS models: P+T procedure, LDpred-inf,29 and lassosum.30 As the performance of non-infinitesimal LDpred is often similar to lassosum,30,42 we only considered LDpred-inf, a special case of LDpred with closed form solution. The LDpred-inf was computed using the ldpred software v.1.0.11 with LD radius set at the recommended value 300,000/3,000 = 100. For P+T procedure, we set the region size as 1,000 kb and the LD threshold as 0.1 and considered 10 p value thresholds according to a previous study:43 5 × 10−8, 1 × 10−6, 1 × 10−4, 0.001, 0.01, 0.05, 0.1, 0.2, 0.5, and 1. We used R package ieugwasr to compute effect sizes in P+T procedure. The lassosum was fitted using the R package v.0.4.4 with default tuning parameter settings. Because these summary-level methods cannot handle multi-ancestry datasets, we trained these models with either only the target dataset or only the auxiliary dataset. To assess the prediction utility of MTAG, we applied MTAG to combine the datasets from the target and auxiliary populations, and then applied LDpred-inf to construct PRS (MTAG+LDpred-inf). For XPASS+, we selected the large-effect SNPs by applying the P+T procedure to the target dataset with LD threshold r2 = 0.1 and p value threshold 1 × 10−6.
To mimic realistic LD patterns, we used the genotypes from the Chinese and UKBB samples to generate the target and auxiliary individuals, respectively. For the target population, 5,000 samples were randomly drawn from the Chinese dataset. For the auxiliary population, we explored seven different sample sizes using random samples from the UKBB dataset: 0, 5,000, 10,000, 30,000, 50,000, 70,000, and 90,000. We included 300,000 SNPs in total by selecting the first 30,000 SNPs from each of chromosomes 1 to 10. Given these SNPs, we simulated their effect sizes with heritability and generated the phenotypes in both populations. To investigate a wide spectrum of genetic architectures, we varied the proportion of non-zero genetic effects and the genetic correlation between the two populations. Specifically, we set the proportion of non-zero genetic effects to be 0.9, 0.01, or 0.001, corresponding to the highly polygenic scenario, the moderately sparse scenario, and the sparse scenario, respectively. We considered three settings of the overall genetic correlation ρ: 0, 0.4, and 0.8, corresponding to no, moderate, and high genetic correlation, respectively. As the effect sizes may not be correlated for all SNPs, we generated the nonzero effects by simulating 80% of them from the bi-variate normal distribution and the rest from two independent normal distributions and for populations one and two, respectively, where p is the number of nonzero effects. By the combinatorial configurations of the proportion of non-zero effects and genetic correlation, nine scenarios were considered in our analysis. To evaluate the prediction performance, we sampled 3,000 additional individuals from the Chinese dataset serving as the test set of the target population. For each simulation setting, we computed the averaged prediction R2 from 10 replications.
Among the individual-level methods considered, XPA, GCTA-BLUP, GCTA-bvBLUP, and XP-BLUP support multi-threading computation. To examine the computational and memory efficiency of these methods, we further evaluated their CPU time and memory usage when different numbers of SNPs were included in the model: 100,000, 200,000, and 300,000. The analyses were performed with 16 threads on the platform of Intel Xeon Gold 6152 CPU.
Application of XPA and XPASS to predict height and BMI in the Chinese population
Following the simulation studies, we applied XPA and XPASS to construct PRSs for height and BMI in the Chinese population by integrating European samples from UKBB and Asian samples from Chinese cohort and BBJ. In the individual-level PRS analysis, we split the Chinese data into the training and testing sets and only included the SNPs overlapping with UKBB. For height, we used 21,069 samples for training and held out 11,852 samples for testing. After quality control and overlapping, 3,776,575 SNPs were used to fit the model. For BMI, we used 18,575 samples for training and 10,572 for testing. After quality control and overlapping, 3,777,871 SNPs were used to fit the model. For Chinese population, we included age, sex, and first 10 principal components as covariates. For UKBB, we used the top 20 principal components, age, squared age, sex, genotyping arrays, and sequencing platforms as covariates.
To benchmark the performance of XPA and XPA+ with existing approaches, we compared the predictive R2 of XPA and XPA+ with six other PRS models. Four of the six methods are designed only for single population analysis, including BLUP, which serves as a baseline model; snpnet, a memory-efficient LASSO implementation for large-scale genetic prediction based on R packages glmnetPlus and glmnet;26 BayesR, a hierarchical Bayesian mixture model;23,24 and BayesS, which accounts for the impact of natural selection.25 We trained all these four models on the Chinese dataset, and additionally trained BLUP on the UKBB dataset using our efficient implementation. The fifth method, GCTA-bvBLUP, was trained with all Chinese samples and 150K subsamples randomly drawn from the UKBB dataset because the large memory requirement of GCTA-bvBLUP makes it infeasible to include more UKBB samples (see the results section for details). We also included a recently proposed approach for cross-population prediction,32 XP-BLUP, in our comparison. We implemented the conventional BLUP method in our XPA software, making it more efficient and scalable to biobank-scale data. Since parameter-tuning is required in snpnet, we randomly took one-third of training samples as validation set and fitted the LASSO model on the rest of the training samples. The GCTB v2.0 was used to fit the BayesR and BayesS models with 25,000 MCMC iterations in total and 5,000 burn-in iterations. We set the initial value of heritability at 0.3 for height and 0.15 for BMI. In BayesS, we set the initial value of the proportion of non-zero effects at 0.05 for both height and BMI. The XP-BLUP requires a list of candidate SNPs selected from the UKBB GWAS results by taking a threshold of p values. We considered eight thresholds for selecting SNPs: 5 × 10−6, 1 × 10−6, 5 × 10−7, 1 × 10−7, 5 × 10−8, 1 × 10−8, 5 × 10−9, and 1 × 10−9. After tuning, the optimal thresholds were found at 1 × 10−8 and 5 × 10−7 for height and BMI, respectively. For XPA+, we set the LD threshold at r2 = 0.1 and the p value threshold at 5−8.
In the summary-level PRS analysis, we obtained the summary statistics of the Chinese dataset and UKBB using the BOLT-LMM software36 and additionally included summary statistics from BBJ (∼170,000 Japanese samples)33,34 as an alternative training data from the target population. We took the intersection of SNPs in the Chinese dataset, UKBB and BBJ, leading to 3,621,504 SNPs to be included in height and 3,562,502 SNPs to be included in BMI, respectively. We first randomly sub-sampled 2,000 individuals form both UKBB and the Chinese dataset as the LD reference panels for the two populations.
Here we mainly compared XPASS and XPASS+ with LDpred and P+T because other related methods with different assumptions on effect sizes, such as SBayesR,43 are expected to have very minor improvement, as shown in the aforementioned individual-level analysis. Because LDpred and P+T were developed for single-population analysis, we considered two ways for training the PRS models. The first way was to train the models separately using the Chinese cohort, BBJ, or UKBB. The second way was to first combine the target and auxiliary datasets using MTAG and then train the models with the combined datasets. When MTAG was applied, a covariance matrix of the estimation error should be first constructed and provided as an input.31 Because the two datasets were from different populations, we applied LD score regression to estimate the intercepts of the Chinese cohort and UKBB summary datasets with their corresponding reference genotypes, respectively. After that, the estimated intercepts were used to construct the diagonal elements of the covariance matrix. The off-diagonal elements of the covariance matrix were set to zero because there was no sample overlap between populations. For XPASS and XPASS+, we considered two configurations of the training sets, i.e., Chinese + UKBB and BBJ + UKBB. Assuming only the SNPs within the same LD block are correlated, XPASS approximates the LD matrices by partitioning the genome into nearly independent blocks. Because the LD block partition is not aligned in EAS and EUR,41 we used the LD block partition derived from both EAS and EUR to construct PRS and then evaluated the sensitivity of PRS to the two of LD block partition strategies.
Because both LDpred and P+T have tuning parameters, we considered a number of parameter settings and determine the optimal values by evaluating prediction performance on the test set. For LDpred, we considered nine settings of the proportion of non-zero effects: 1 × 10−4, 5 × 10−4, 1 × 10−3, 5 × 10−3, 1 × 10−2, 5 × 10−2, 1 × 10−1, 5 × 10−1, and 1. For height, the optimal values turned out to be 5 × 10−2 in Chinese, 1 in BBJ, 1 in UKBB, 10−1 in MTAG-Chinese, and 1 in MTAG-UKBB (Figures S25 and S29). For BMI, the optimal values were 1 × 10−3 in Chinese, 5 × 10−1 in BBJ, 1 in UKBB, 5 × 10−1 in MTAG-Chinese, and 1 in MTAG-UKBB (Figures S26 and S30).
For P+T procedure, we set the LD threshold at r2 = 0.1 and considered ten settings of the proportion of p value threshold: 5 × 10−8, 1 × 10−6, 1 × 10−4, 1 × 10−3, 1 × 10−2, 5 × 10−2, 1 × 10−1, 2 × 10−1, 5 × 10−1, and 1. For height, the optimal values turned out to be 1 × 10−4 in Chinese, 1 × 10−3 in BBJ, 1 × 10−4 in UKBB, 5 × 10−2 in MTAG-Chinese, and 1 × 10−2 in MTAG-UKBB (Figures S25 and S29). For BMI, the optimal values were 1 × 10−6 in Chinese, 5 × 10−2 in BBJ, 1 × 10−2 in UKBB, 1 × 10−1 in MTAG-Chinese, and 1 × 10−2 in MTAG-UKBB (Figures S26 and S30). Since the parameter tuning process involved the testing data, the performance of LDpred and P+T reported here could be slightly optimistic.
When XPASS+ was applied, we set the LD threshold r2 = 0.1 and varied the p value threshold at {10−5,10−6,10−7,10−8} to include large population-specific effects. The selected SNPs are treated as covariates only in the target population. As shown in Figure S33, the prediction performance of XPASS+ was insensitive to the p value threshold. Therefore, we reported the results obtained with p value threshold at 10−6 and used 10−6 as the default setting in the XPASS software.
We used the prediction R2 to examine the prediction performance. Given a test set , we used the predictive R2 to evaluate the performance of PRS. The PRS for the testing samples is constructed by , where is the posterior mean of the effect sizes on the dosage genotypes. Then, the predictive R2 of PRS was defined as the squared correlation between PRSnew and the residual obtained by regressing on , denoted as :
When the covariates were taken into account, XPA generated prediction on the original scale of phenotype: . In this case, we evaluated the R2 by computing the squared correlation between and :
Collection, genotyping, and imputation of Chinese sample
To evaluate the prediction performance of our framework, we have collected genotypes of Chinese individuals from the WeGene platform and participants have signed the consent form. The study was reviewed and approved by the Committee on Research Practices of HKUST in strict compliance with regulations regarding ethical considerations and personal data protection. To comply with the regulations of the Human Genetic Resources Administration of China (HGRAC), all Chinese genotypic and phenotypic data were processed and analyzed in a server located in Shenzhen, China. Researchers who request access to the summary statistics from the Chinese samples must get permission from Ministry of Science and Technology of the People’s Republic of China.
DNA extraction and genotyping were performed on saliva samples. A total of 35,908 Chinese participants were genotyped on the Illumina or Affymetrix platforms. Among all participants, 21,830 were genotyped on the Affymetrix WeGene V1 Arrays covering 596,744 SNPs at the WeGene genotyping center, Shenzhen. The WeGene V1 Array optimized the identification of all known paternal and maternal lineages through adding EAS-relevant 18,963 Y chromosome and 4,448 mtDNA phylogenetic SNPs.44 The remaining 14,078 individuals were genotyped on Illumina WeGene V2 Arrays with a total of about 700,000 SNPs at the WeGene genotyping center, Shenzhen. The WeGene V2 array was designed based on the Illumina Infinium Global Screening Array.
Imputation of unobserved genetic variants was performed using the 1000 Genomes Project Phase 3 reference panel.15 All datasets were phased by SHAPEIT245 and imputed by IMPUTE246 using regular steps and parameters. SNP-level (INFO score0.5) and genotype-per-participant-level (genotype probability0.9) filters were used to exclude poorly imputed variants.
Sample description of the Chinese cohort
The 35,908 genotyped Chinese participants in the Chinese cohort cover 33 out of 34 administrative divisions and 43 out of 56 ethnic groups in China, with the majority of samples from the Southeastern area (Figure S2). Among the 28,796 participants with self-reported ethnic information, there are 26,953 (93.6%) Han Chinese, 440 (1.5%) Manchu, 385 (1.3%) Hui, 201 Mongols (0.7%), and 817 (2.8%) from other minority ethnic groups.
To explore the population structure of the Chinese cohort, we first combined genotypes data from the Chinese cohort and the 1000 Genomes Project and performed a principal component analysis (PCA). As shown in Figure S2, the Chinese samples are overlapped with EAS from the 1000 Genomes Project, while its variance is larger because it is comprised of both Han Chinese and multiple minority groups. We then carried out PCA within Chinese to study differentiation across minority ethnic groups in China. Since the Han Chinese dominates the sample makeup, directly applying PCA to all samples fails to capture the variation among minority groups. Therefore, we first obtained the PC loadings from a subset of samples that included 500 randomly selected Han Chinese (roughly matching the number of Manchu people) and then computed the PC scores of all samples using these PC loadings (Figure S2). The first two principal components represent the latitudinal and longitudinal differentiation behind Chinese population structure. The Han Chinese differs substantially along the latitudinal gradient, while less differentiation is found along the longitude direction, which is consistent with previous reports.47, 48, 49 Among the minority ethnic groups, Manchus are genetically closest to the Han Chinese in the northeastern China. In the same area, Koreans in China are more distant to Han Chinese compared to Manchus. The most differentiated groups are Mongols, Hui, and Tibetan from the northwestern area. In the South China, Zhuang people also differ substantially from Han Chinese.
We considered two anthropometric phenotypes in our analysis, height and BMI. After quality control of genotype and phenotype data (see supplementary note), there were 32,921 samples with self-reported height and 29,147 samples with BMI computed from height and weight. For both phenotypes, there are slightly fewer males than females (15,406 compared to 17,515 for height and 13,721 compared to 15,426 for BMI). The overall distributions of height and BMI are summarized in Figure S1. Regarding the geometric distribution, the Northern Chinese are generally higher in both males and females (Figures S2 and S4). A similar latitudinal differentiation is observed in BMI, where the individuals from the north have higher obesity indices than those from the south (Figures S2 and S6). Besides, the older people are generally shorter and tend to have higher BMI (Figures S5 and S7).
Sample description of UKBB data
The UKBB genotype and phenotype data were obtained from the UK Biobank Access Management System (see web resources). We used the measured height phenotype extracted from Data Field 50 and the BMI value constructed from height and weight from Data Field 21001. To restrict the samples within EUR ancestry, we identified the individuals with self-reported ethnic background as “white” in Data Field 21000 and included only these samples for analysis. After phenotype and genotype quality control (see supplemental note), the UKBB data contain genotype information of 3,776,575 SNPs for 429,312 individuals in the analysis of height and 3,777,818 SNPs for 428,864 individuals in the analysis of BMI. We obtained the summary statistics of the UKBB datasets using the BOLT-LMM software36 with age, squared age, sex, and the first 20 genomic PCs as covariates. The same set of covariates was included in the construction of PRS when applying XPA.
Sample description of GWAS data from African population
To demonstrate the generality of our framework to other populations, we applied XPA and XPASS to two GWAS datasets comprising thousands of samples from African ancestry: Institute for Personalized Medicine (IPM) BioMe biobank10 (phs000925.v1.p1) and UKBB. IPM BioMe biobank aimed to study clinical care processes, with 28% samples of which were African Americans. The African participants from the IPM BioMe cohort were genotyped on Illumina Human OmniExpressExome Chip with a total of about 500K SNPs that passed an initial quality control (QC) process. Imputation of unobserved genetic variants was performed using the 1000 Genomes Project Phase 3 reference panel. All datasets were phased by SHAPEIT2 and imputed by Minimac446,50 using regular steps and default parameters.
After removing ancestry and phenotype outliers and samples with ambiguous sex (see supplemental note), 5,491 confirmed African participants from IPM were included in our study. For UKBB, 3,323 participants with self-reported African ancestry were also included, after the same procedure of sample QC, 2,931 samples remained for our analysis. By projecting the genotypes of IPM and UKBB samples to the PC coordinates derived from the 1000 Genomes Project, we found that the Africans from both datasets overlapped with the AFR samples from the 1000 Genomes Project (Figure S3). We observed similar phenotypic distributions for height between IPM and UKBB Africans, before or after we regressed the covariates (e.g., age, squared age, sex, first 20 genomic PCs) out.
To evaluate the performance of PRS approaches, we combined the African samples from IPM and UKBB, leading to a total of 8,422 African samples with 2,690,737 overlapping SNPs. Then, we randomly selected 1K samples as testing data and used the remaining 7.4K samples as training data. We obtained the summary statistics of the training set using the BOLT-LMM software36 with age, squared age, sex, and the first 20 genomic PCs as covariates. The same set of covariates was included in the construction of PRSs when applying XPA.
Results
Simulation study
With data simulated as described above, we investigated the prediction accuracy of XPA in comparison with alternative methods. To gain some intuition, we first considered the single-population-based methods, such as BLUP and LASSO. These predictive models can be trained using samples from either the target population or the auxiliary population. The performance of BLUP and LASSO trained on the target population can serve as the reference results (dashed lines in Figure 1A). When the genetic correlation was zero, the prediction accuracy of BLUP and LASSO trained on the auxiliary dataset could not be improved regardless of the auxiliary sample size. When the genetic correlation became moderate () or strong (), the prediction accuracy of BLUP and LASSO trained on the auxiliary dataset steadily improved as the auxiliary sample size increased. It is worthwhile to note that BLUP and LASSO trained on the auxiliary dataset were more accurate than those trained on the target population when the correlation was strong and the auxiliary sample size was large.
Among the methods that combine both datasets, GCTA-BLUP-combine had the worst performance in most settings. As expected, when there was no genetic correlation, the inclusion of auxiliary dataset led to worse performance than using only the target dataset. When the genetic correlation was nonzero, the predictive R2 first dropped and then increased with increasing auxiliary sample size. When the auxiliary sample size was large enough (e.g., n2 > 30K), the performance of GCTA-BLUP-combine gradually converged to GCTA-BLUP. For GCTA-bvBLUP, it had similar predictive R2 with XPA when the auxiliary sample size was comparable with the target sample size (i.e., ). However, GCTA-bvBLUP does not account for the allele frequency difference between two populations. Therefore, as we can expect, its prediction accuracy became worse than XPA when the auxiliary sample size was large (i.e., n2 > 30K). Between the cross-population methods XPA and XP-BLUP, XPA was clearly the overall winner, as shown in Figure 1A. When the genetic correlation became moderate () or strong (), the prediction accuracy of XPA steadily improved as the sample size of auxiliary population increased, suggesting that XPA was able to leverage the trans-ancestry genetic correlation for constructing PRSs. When either the genetic correlation or the auxiliary sample size approached zero, XPA reduced to BLUP which was trained on the target dataset, as no information could be borrowed from the auxiliary population in these cases. For XP-BLUP, it assumes that the top SNPs from the auxiliary dataset are more likely to have non-zero effects in the target population, but it does not model the correlation of effect sizes between populations. Therefore, it was worse than XPA in the polygenic and moderately sparse settings, but had better performance in the setting of highly sparse effects. However, in the highly sparse setting, XPA+ extension achieved comparable performance with XP-BLUP by incorporating large population-specific effects. We also noted that the causal SNPs between the target and auxiliary populations largely overlapped in our simulation setting, which was preferred by XP-BLUP. We expect the performance of XPA+ will be better than XP-BLUP when the causal SNPs with large effects are not largely overlapping between the two populations.
Among methods with the support of multi-threading computation (Figure 1B), XP-BLUP had the lowest computational cost and memory usage, as its computation was mostly based on the target dataset. While GCTA-BLUP, GCTA-bvBLUP, and XPA all had increased computational cost as the scale of the auxiliary dataset became larger, the CPU time of XPA was nearly linear in the sample size, providing higher efficiency than GCTA-BLUP and GCTA-bvBLUP when the auxiliary sample size was larger than 50,000. In addition, because we have adopted a memory-efficient strategy in storing genotypes, the memory cost of XPA was also linear in the auxiliary sample size and the number of SNPs. However, the memory cost of GCTA-BLUP and GCTA-bvBLUP increased quadratically with the sample size, limiting its usage in the biobank-scale data analysis.
Next, we compared XPA with XPASS which takes the summary-level data as its input. When the auxiliary data are not available, XPA and XPASS reduce to their special cases, BLUP and LDpred-inf, respectively. When the auxiliary sample size increased in our simulation, we measured the relative improvement of individual-level and summary-level approaches as the difference of prediction R2 between XPA and BLUP, and that of XPASS and LDpred-inf, respectively. As shown in Figure 2B, XPA was slightly better than XPASS, suggesting that XPASS can provide comparable prediction improvement using only summary-level datasets. The advantage of XPA became more apparent as the auxiliary sample size increased, suggesting the importance of developing methods to handle biobank-scale individual-level data. These observations highlight the value of XPA and XPASS in different practical scenarios: XPASS provides well-powered PRSs using cross-population information based on summary-level datasets while XPA can achieve higher accuracy with the availability of individual-level datasets.
As a promising approximation to XPA, XPASS also outperformed existing summary-level PRS models. As shown in Figure 2, XPASS had nearly the same performance with LDpred-inf when either the genetic correlation was zero or the auxiliary dataset was unavailable, consistent with previous observations for individual-level methods. With the availability of the auxiliary dataset and non-zero genetic correlation, XPASS achieved the highest prediction R2 among all compared methods in the polygenic and moderately sparse settings. In the highly sparse setting, its extension XPASS+ had the best performance when the genetic correlation was high () and was comparable to alternative approaches with smaller genetic correlation ( and 0.4). The prediction accuracy of both XPASS and XPASS+ increased with larger auxiliary sample size and stronger genetic correlation. Of note, the improvement of XPASS had a very similar pattern in both the sparse scenario and the polygenic scenario, suggesting the robustness of XPASS to different genetic architectures.51 For models trained on the auxiliary dataset, P+T procedure had the lowest prediction accuracy, followed by LDpred-inf. Because lassosum adopts an elastic net model, it had comparable R2 to LDpred-inf in the sparse scenario and outperformed LDpred-inf with larger sample size.
Construction of PRS for the Chinese population by XPA using the individual-level data from the Chinese cohort and UKBB
To study the performance of XPA and XPASS in real applications, we applied our approaches to construct PRSs for height and BMI in the Chinese population by integrating Chinese and UKBB data. We first investigate the performance of XPA using the individual-level data from Chinese and UKBB. To benchmark the performance of XPA with existing approaches, we compared the predictive R2 of XPA with five other PRS models. XPA estimated the genetic correlations between Chinese and UKBB as 0.71 for height ( with SE = 1.8%, with SE = 0.7%) and 0.66 for BMI ( with SE = 1.7%, with SE = 0.1%), suggesting substantial genetic sharing between the two populations. As summarized in Figures 3A and 3D, regarding the overall performance, XPA had the highest accuracy for both height and BMI, with a substantial improvement compared to the baseline BLUP model trained on the Chinese data only. XPA outperformed the runner up (BLUP trained on UKBB) by 7.3% and 19.5% improvements for height and BMI, respectively. In contrast, both XP-BLUP and GCTA-bvBLUP, the two methods that integrated Chinese and UKBB datasets, had lower predictive R2 than XPA. XP-BLUP was only slightly better than the methods trained on the Chinese data, but inferior to both XPA and BLUP trained on the UKBB only. This is because XP-BLUP only includes information from the significant SNPs in UKBB while XPA can borrow information from UKBB across the whole genome. Due to the memory bottleneck, GCTA-bvBLUP failed to include all UKBB samples, and so could not further improve the prediction accuracy (see Figure S23 for details of comparing the memory usage and computational time). GCTA-bvBLUP took 75.8 h and required 1.07 Tb to integrate 150K UKBB samples with all Chinese samples and achieved its best performance for constructing PRSs with predictive R2 = 15.74%. In contrast, XPA used only 54.5 h (including 9 h for loading data, 3 h for estimating variance components, and 42.5 h for computing the posterior means and estimating fixed effects) and 385 Gb to analyze all Chinese and UKBB samples while achieving 20.2% and 46.4% improvement compared to GCTA-bvBLUP for height and BMI, respectively. When the population-specific large-effect SNPs were utilized by XPA+, we found the predictive R2 further increased to 19.72% in height and remained roughly the same with XPA in BMI.
In clinical applications, it is critical to stratify individuals into different genetic risk groups. In our discussion, we use ‘risk’ as a generic term to describe a trait. To measure the ability of risk stratification, we compared the observed phenotypic values of individuals from the top PRS quantiles with those from the reference group (i.e., 40%–50% quantile in our analysis). For both height and BMI, XPA was most effective in screening individuals with high genetic risk (Figures 3B and 3E). Compared to their reference groups, the individuals in the top 1% of PRS were 8.04 cm (SE = 0.54 cm) taller in height and 2.38 kg/m2 larger in BMI. When XPA+ was applied, the stratification ability was further improved, with 8.27 cm increased height and 2.67 kg/m2 larger BMI for the individuals in the top 1% of PRSs. By incorporating covariates, such as age, sex, and first 10 principal components, XPA can construct prediction on the original phenotypic scale. Compared to the risk prediction model using these covariates only, XPA achieved a three-fold improvement of R2 in height and two-fold improvement of R2 in BMI for both males and females. In terms of the square root of mean squared error (), XPA achieved a 10% improvement in height and 4% improvement in BMI for both males and females (Figures 3C and 3F), respectively. We also found XPA can improve the stratification ability of PRSs for different ethnic groups in the Chinese population (see supplemental note and Figure S24). These results suggest that the prediction accuracy in a target population can be greatly improved by XPA which can effectively incorporates trans-ancestry genetic information.
Construction of PRSs for the Chinese population by XPASS using the summary-level data from trans-ancestry groups
When the individual-level GWAS data are not accessible, XPASS can take summary statistics as its input to construct PRSs. As estimated by XPASS, the genetic correlation between Chinese and UKBB was 0.78 for height ( with SE = 2.4%, with SE = 2.3%) and 0.68 for BMI ( with SE = 1.9%, with SE = 0.7%), comparable to those estimated by XPA. For the genetic correlations between BBJ and UKBB, the estimated genetic correlation computed by XPASS was 0.71 for height ( with SE = 1.9%) and 0.68 for BMI ( with SE = 0.5%). Given the substantial genetic correlations, XPASS could effectively leverage the UKBB summary data to improve prediction accuracy in the Chinese population. As summarized in Figures 4A and 4D, the PRSs derived by XPASS largely outperformed LDpred and P+T trained with the training set from a single population. When XPASS was trained on Chinese+UKBB, the predictive R2 for height increased from 9.65% (LDpred trained on the Chinese data only) to 17.6%. By either applying XPASS+ or using BBJ+UKBB as training set, the predictive R2 further increased to 19.5%, achieving improvement compared to the best method using a single population. XPASS also outperformed MTAG with improvement in terms of R2, when the MTAG output was used as training data for LDpred and P+T. A similar trend of improvement was observed for BMI (R2 = 5.9% for XPASS compared to R2 = 2.16% for LDpred trained on Chinese and R2 = 5.15% for LDpred trained on MTAG-UKBB), although the amount of improvement for BMI was less than that of height. This can be attributed to the lower heritability of BMI. We found that the choice of LD block partitions in XPASS had little effect on its performance, suggesting that XPASS is quite robust to the partitioning strategy.
When the PRS was used to stratify individuals, we found XPASS is more effective in identifying the groups with extreme genetic values (Figures 4B and 4E), consistent with the analyses conducted on the individual-level data. By comparing the PRS distributions (Figures 4C and 4F), we observed that PRSs derived by XPASS from the larger training sets often had broader distribution than those derived by LDpred from the smaller training sets. We note that this observation is consistent with the stratification analysis, since the model with higher stratification ability pulls individuals with extreme PRSs farther away from its mean value.
Influence of the auxiliary sample size on prediction performance
As we can observe in the simulation analyses, a large auxiliary dataset is critical for the performance of cross-population prediction. To systematically investigate how the sample size of auxiliary dataset influences the predictive performance of XPA, we randomly subsampled 20,000–300,000 UKBB individuals as an auxiliary dataset to construct a PRS for height in the Chinese population and evaluated the predictive R2. We also trained BLUP using only auxiliary datasets as benchmark. As expected, due to the population difference between EAS and EUR, BLUP trained on 20,000 sample from UKBB was significantly inferior to that trained on 20,000 samples from Chinese (point P1 in Figure 5A). When the UKBB sample size became larger than 50,000 (point P2 in Figure 5A), it achieved better performance than the model trained on 20,000 samples from Chinese. In contrast, XPA provided stable estimate of genetic correlation (Figure 5B) regardless of the auxiliary sample size, so it always outperformed BLUP and effectively improved the prediction accuracy with the inclusion of more UKBB samples. It is also worth noting that XPA used only 20,000 Chinese and 300,000 Europeans to achieve the comparable performance with BLUP that was trained using all 430,000 European samples (point P3 in Figure 5A), highlighting the importance of including samples from the target population in PRS construction. Comparing XPASS with its special case LDpred-inf29 led to similar conclusions (Figures 5C and 5D). When 250,000 samples were included in the auxiliary dataset, XPASS achieved comparable performance with LDpred-inf using all 430,000 UKBB samples (point P7 in Figure 5C). When XPASS+ was applied, it further improved the performance and outperformed LDpred-inf when only 150,000 UKBB samples were included (Figure S34). By contrasting Figures 5A and 5C, we found individual-level approaches were generally better than summary-level approaches, which is consistent with our simulation results.
Construction of PRS for the African population by XPA and XPASS
To demonstrate the generality of our framework to other populations, we compared the prediction performance of XPA and XPASS with existing approaches using GWAS data from the African population. First, we estimated the trans-ancestry genetic correlation among European, African, and East Asian populations for height. Our results suggest that trans-ancestry genetic correlations of East Asian and European populations are often stronger than those of African and European populations (Figure 6A), consistent with a previous report.52 To construct PRSs, we applied XPA and XPASS to integrate 7.4K African training samples with 430K European samples from UKBB. The prediction accuracy was evaluated on the 1K testing African samples. Clearly, both XPA (Figure 6C) and XPASS (Figure 6D) outperformed BLUP and LDpred and effectively improved the prediction accuracy with the inclusion of UKBB samples. The replication of better prediction performance by combining African and well-powered auxiliary European populations reassures us that, despite the more significant genetic distance, our XPA and XPASS framework still achieved state-of-the-art performance when compared to alternative methods.
Discussion
The genetic architectures of various human traits have been mostly studied in samples with European ancestry, while non-European populations are still under-represented. Whether the genetic discoveries derived from Europeans can be transferred to non-Europeans remains unclear. In this study, we have proposed a unified framework for cross-population analysis (XPA and XPASS) to improve genetic prediction of under-represented populations by leveraging their trans-ancestry genetic correlations with a large and well-powered auxiliary GWAS dataset from another population. By combining the individual-level UKBB samples and Chinese samples, we were able to construct improved PRSs for height and BMI in the Chinese population, demonstrating the utility of trans-ancestry genetic prediction. We also showed that XPASS can achieve comparable prediction performance while only requiring summary-level data. When XPASS was trained using the summary-level BBJ and UKBB data, it produced even better prediction performance than XPA trained with the individual-level UKBB and Chinese data. As we do not have access to the individual-level BBJ data, XPASS offers an effective strategy to make use of existing data resources. We also observed improved prediction accuracy of PRSs in the African population when we applied XPA and XPASS to combine AFR and EUR GWAS data, suggesting the generality of our framework across global populations. When XPA+ and XPASS+ were applied, the population-specific effects can also be incorporated, leading to higher prediction accuracy in height. Because of the better performance and robustness to the choice of p value threshold, we recommend XPA+ and XPASS+ in practice.
Existing studies53,54 have established the connection between the prediction accuracy of PRSs and various parameters from the theoretical perspective, including heritability, sample size, genetic correlation, and the effective number of chromosome segments. These theoretical properties have been supported by practical evidence. For example, Truong et al.55 showed that the prediction accuracy of PRSs increased when individuals with higher-level relatedness were included in the analysis, which was due to the decreasing number of effective chromosome segments. Our work mainly focuses on the side of practice, aiming to develop a scalable and accurate method for the construction of PRSs in the cross-ancestry setting. We have observed that higher genetic correlation and inclusion of more target samples in the training set can improve prediction performance, which is consistent with the theory.53
In real applications, by taking the advantage of widespread pleiotropy among phenotypes, many successful multi-trait models56, 57, 58 have been developed to produce powerful PRSs with increased prediction accuracy. While these approaches have been widely used in risk prediction within a single population, they may not be easily applied to integrate datasets from multi-ancestry backgrounds, since they do not take the heterogeneous genetic architectures into account. Our XPA framework, as compared to existing multi-trait models, provides a scalable solution to effectively combine multi-ancestry datasets by leveraging the stable estimate of trans-ancestry genetic correlation while accounting for the heterogeneous LD structure between populations. The success of XPA sheds light on the transferable genetic basis among global populations and demonstrates the benefits of integrating multi-ancestry datasets in genetic prediction.
While we have mainly focused on the anthropometric traits in this study, it is worth noting that XPA and XPASS can be applied to a wide class of phenotypes, such as complex diseases and molecular phenotypes. Due to the binary nature of diseases, their relationships with genotypes are often better captured by the liability threshold model (LTM). When the individual relatedness is low, the uni-variate linear mixed model (LMM) can be viewed as an approximation of the LTM.27 A recent study59 found that the bi-variate LMM can approximate the bi-variate LTM and produce consistent genetic correlation estimate for binary responses, suggesting the potential of applying XPA and XPASS to predict disease risks.
Our XPA framework needs more investigation in the following directions. First, XPA could be improved by allowing more flexible assumptions on SNP effect sizes. XPA assumes that, for a given population, the variance of the effect sizes of standardized genotypes is a constant, implicitly assuming that the SNP effect sizes increase as the allele frequencies decrease at the rate , where f is the allele frequency (AF). This assumption was also adopted in the previous trans-ancestry analysis.60 Some recent studies have suggested that the effect sizes may not keep increasing when the allele frequency is small due to the negative selection.25,61,62 It was shown that the prediction can benefit from introducing a selection parameter in BayesS (Figure 3A).
Second, the trans-ancestry genetic correlation may not be homogeneous across the genome. The differential selection pressure between populations can induce differences in their AF.34,63 By sorting the SNPs according to the normalized AF difference between EAS and EUR, we estimated the genetic correlation of the effect sizes corresponding to the SNPs with the largest AF differences. As shown in the Figure S35, the trans-ancestry genetic correlation decreases as the AF difference increases. To assess the effect of AF difference on the prediction accuracy, we extended the XPASS model to include an additional genetic component that captures the effects of SNPs with large AF differences across populations (see supplemental note). From our real data analysis, we did not observe significant enrichment of heritability among these SNPs. As a result, we did not obtain a better PRS by modeling the effect sizes of these SNPs as an additional variance component in the extended model (see Table S1). Our results suggest that modeling the effect sizes of SNPs with large AF difference may not be the key to improve PRSs.
Third, the integration of multiple GWASs across populations may further improve the prediction accuracy of XPA and XPASS. It has been shown that jointly modeling multiple genetically correlated phenotypes can produce more accurate prediction within EUR population.56, 57, 58 By applying XPASS to estimate genetic correlations for a wide spectrum of phenotypes between EUR and EAS, we found that many genetically correlated traits discovered in EUR studies are replicated between EAS and EUR (supplemental note). Therefore, when a number of correlated phenotypes are simultaneously available in both populations, a more flexible model that jointly considers these phenotypes may further increase the prediction power.
Fourth, functional annotations may also be included to inform the prediction. The SNPs with biological functions, such as gene regulatory elements,64, 65, 66 epigenomic regulations,40,67,68 and tissue-specific functional pathways, are usually enriched for the heritability of complex traits.69, 70, 71, 72, 73 Recent studies suggest that leveraging the functional annotations in prediction models trained on the EUR datasets produced PRSs with higher accuracy in both EUR74,75 and EAS,76 indicating the substantial overlap of functionally important variants across populations. Hence, integrating functional information to prioritize biologically relevant SNPs in complex traits can potentially increase the prediction accuracy of the XPA framework.
Declaration of interests
The authors declare no competing interests.
Acknowledgments
This work is supported in part by National Key R&D Program of China (2020YFA0713900), Hong Kong Research Grant Council (12301417, 16307818, 16301419, 16308120), Hong Kong Innovation and Technology Fund (PRP/029/19FX), Hong Kong University of Science and Technology (startup grant R9405, Z0428 from the Big Data Institute), and the Open Research Fund from Shenzhen Research Institute of Big Data (2019ORF01004). The computational task for this work was partially performed using the X-GPU cluster supported by the RGC Collaborative Research Fund: C6021-19EF.
Published: March 25, 2021
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2021.03.002.
Contributor Information
Gang Chen, Email: chengangcs@gmail.com.
Can Yang, Email: macyang@ust.hk.
Appendix A
Derivation of normalEquation 5
Note has its first-order moment as zero. Thus, the MoM estimator is then obtained by matching the second-order moment based on the criterion of least-squares:
(Equation A1) |
Knowing the fact that , the OLS objective function in Equation A1 can be re-written as
Taking derivatives of the objective function with respect to and setting them to zero leads to estimating Equation 5.
Derivation of posterior mean (Equation 8)
Given the estimates of parameters and fixed effects , the posterior means of are given as:
(Equation A2) |
with
where and are the posterior means of and , respectively. Note that directly working on this form of posterior mean requires solving a linear system, which is intractable since p is in the order of millions. Here, we derive an equivalent form that computes by solving only an linear system:
(Equation A3) |
where and the last equation is granted by the Woodburry matrix identity. XPA computes the posterior mean of the target population by:
Computation of posterior means and fixed effects with the CG algorithm
The fixed effects and posterior means are given as
and
respectively. Both quantities involve solving the linear systems and . Therefore, we can compute the estimates of fixed effects and the posterior means at the same time while only solving the linear systems once. Specifically, we first construct a working matrix and apply the conjugate gradient approach to solve the combined linear systems , as summarized in Algorithm 1.
Algorithm 1. Conjugate Gradient algorithm for solving [v,U]=Ωˆ−1W.
At each iteration of the CG algorithm, we need to compute
where and are column vectors. With entries of and decoded from the Hash table, we first compute and and then multiply and on their left to obtain the corresponding terms. This operation is highly efficient since it only involves matrix-vector multiplication. The final time complexity of the CG algorithm is , where κ is the condition number of . Because κ is usually small, the CG algorithm offers substantial computational improvement in solving the linear system.
Assumptions for XPASS model
We consider the datasets and from the two populations. The vectors and contain the z-scores derived from the two populations, where and and and are the residual variance of regressing on and on , respectively. Following Vilhjálmsson et al.,29 we assume the z-scores are derived from GWAS datasets with phenotype vectors and standardized to have mean of zero and standard deviation of one. The first few PCs of the reference genotypes from the two populations are given in and . Similar to XPA, we first standardize the reference genotype matrices and to obtain the corresponding and that have column means zero and variances .
Parameter estimation in XPASS
To derive the normal equations using the summary-level datasets, we start by eliminating the and in Equation 5, which leads to
By dividing the three equations by , and , the estimating Equation 5 became:
(Equation A4) |
Note that the summary statistic of the j-th SNP , where the first equation is derived from the assumption that has mean 0 and variance , and the approximation is granted by the assumption that each SNP only contributes a small proportion of the total phenotypic variance. Similarly, . Therefore, we can approximate the terms in the right-hand side of Equation A4 by
In the left-hand side of Equation A4, and are not involved, and the terms involving and can be approximated using the reference genotypes and . For example, can be approximated by , where and . Other terms in the left-hand side can be approximated in the same way. Using these approximations, the final normal equation is given as
(Equation A5) |
where , , , and are heritabilities for the two populations, respectively, and is the co-heritability between two populations. Note that all terms in Equation A5 depend only on the z-scores and reference panels. Solving Equation A5 leads to the parameter estimates .
Given the parameter estimates , the genetic correlation can be estimated by . To test for using summary statistics, we estimate the standard error of by applying the jackknife resampling method, with a leaving-one-chromosome-out strategy.
Derivation ofEquations 9and10
With the model in Equation 3, the estimates of fixed effects and the posterior means are given as
By taking the advantage of the Woodbury matrix identity to compute the matrix inversion of , we obtain
(Equation A6) |
where and are the LD matrices of all SNPs, and are the LD matrices of large-effect SNPs, and and are the SNP correlation matrices between large-effect SNPs and all SNPs from the two populations, respectively.
With the reference genotype matrices, we can approximate the LD matrices in Equation A6 with , , , and . In addition, the terms involving can be approximated by , , , and , where and are sub-vectors of and corresponding to the large-effect SNPs. Replacing the relevant terms with corresponding approximations leads to Equations 9 and 10.
Data and code availability
The publicly available GWAS summary statistics for estimating genetic correlations were obtained from the links summarized in Table S2. Access to genome-wide summary statistics from the Chinese data has to be approved by the contact authors (see Table S2). Researchers who wish to gain access to the data are required to submit the data request by sending an email to the contact authors. GWAS summary statistics from other studies can be accessed using the links provided in Table S2. The UKBB data are available through the UK Biobank Access Management System. The IPM BioMe biobank data are accessible on dbGap with accession number phs000925.v1.p1. The C++ software for XPA and the R package for XPASS are available online.
Web resources
C++ software for XPA, https://github.com/YangLabHKUST/XPA
glmnetPlus, https://github.com/junyangq/glmnetPlus
lassosum, https://github.com/tshmak/lassosum
Popcorn, https://github.com/brielin/Popcorn
R package for XPASS, https://github.com/YangLabHKUST/XPASS
Supplemental information
References
- 1.Torkamani A., Wineinger N.E., Topol E.J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 2018;19:581–590. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 2.Abul-Husn N.S., Manickam K., Jones L.K., Wright E.A., Hartzel D.N., Gonzaga-Jauregui C., O’Dushlaine C., Leader J.B., Lester Kirchner H., Lindbuchler D.M. Genetic identification of familial hypercholesterolemia within a single U.S. health care system. Science. 2016;354:aaf7000. doi: 10.1126/science.aaf7000. [DOI] [PubMed] [Google Scholar]
- 3.Khera A.V., Chaffin M., Aragam K.G., Haas M.E., Roselli C., Choi S.H., Natarajan P., Lander E.S., Lubitz S.A., Ellinor P.T., Kathiresan S. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 2018;50:1219–1224. doi: 10.1038/s41588-018-0183-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Craig J.E., Han X., Qassim A., Hassall M., Cooke Bailey J.N., Kinzy T.G., Khawaja A.P., An J., Marshall H., Gharahkhani P., NEIGHBORHOOD consortium. UK Biobank Eye and Vision Consortium Multitrait analysis of glaucoma identifies new risk loci and enables polygenic prediction of disease susceptibility and progression. Nat. Genet. 2020;52:160–166. doi: 10.1038/s41588-019-0556-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bustamante C.D., Burchard E.G., De la Vega F.M. Genomics for the world. Nature. 2011;475:163–165. doi: 10.1038/475163a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Popejoy A.B., Fullerton S.M. Genomics is failing on diversity. Nature. 2016;538:161–164. doi: 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Need A.C., Goldstein D.B. Next generation disparities in human genomics: concerns and remedies. Trends Genet. 2009;25:489–494. doi: 10.1016/j.tig.2009.09.012. [DOI] [PubMed] [Google Scholar]
- 8.Mills M.C., Rahal C. The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet. 2020;52:242–243. doi: 10.1038/s41588-020-0580-y. [DOI] [PubMed] [Google Scholar]
- 9.Lewis C.M., Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wojcik G.L., Graff M., Nishimura K.K., Tao R., Haessler J., Gignoux C.R., Highland H.M., Patel Y.M., Sorokin E.P., Avery C.L. Genetic analyses of diverse populations improves discovery for complex traits. Nature. 2019;570:514–518. doi: 10.1038/s41586-019-1310-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Peterson R.E., Kuchenbaecker K., Walters R.K., Chen C.Y., Popejoy A.B., Periyasamy S., Lam M., Iyegbe C., Strawbridge R.J., Brick L. Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell. 2019;179:589–603. doi: 10.1016/j.cell.2019.08.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lam M., Chen C.Y., Li Z., Martin A.R., Bryois J., Ma X., Gaspar H., Ikeda M., Benyamin B., Brown B.C., Schizophrenia Working Group of the Psychiatric Genomics Consortium. Indonesia Schizophrenia Consortium. Genetic REsearch on schizophreniA neTwork-China and the Netherlands (GREAT-CN) Comparative genetic architectures of schizophrenia in East Asian and European populations. Nat. Genet. 2019;51:1670–1678. doi: 10.1038/s41588-019-0512-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Martin A.R., Kanai M., Kamatani Y., Okada Y., Neale B.M., Daly M.J. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. doi: 10.1038/s41588-019-0379-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Martin A.R., Gignoux C.R., Walters R.K., Wojcik G.L., Neale B.M., Gravel S., Daly M.J., Bustamante C.D., Kenny E.E. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 2017;100:635–649. doi: 10.1016/j.ajhg.2017.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A., Abecasis G.R., 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liu J.Z., van Sommeren S., Huang H., Ng S.C., Alberts R., Takahashi A., Ripke S., Lee J.C., Jostins L., Shah T., International Multiple Sclerosis Genetics Consortium. International IBD Genetics Consortium Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 2015;47:979–986. doi: 10.1038/ng.3359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Easton D.F., Pooley K.A., Dunning A.M., Pharoah P.D., Thompson D., Ballinger D.G., Struewing J.P., Morrison J., Field H., Luben R., SEARCH collaborators. kConFab. AOCS Management Group Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447:1087–1093. doi: 10.1038/nature05887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Mahajan A., Go M.J., Zhang W., Below J.E., Gaulton K.J., Ferreira T., Horikoshi M., Johnson A.D., Ng M.C., Prokopenko I., DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium. Asian Genetic Epidemiology Network Type 2 Diabetes (AGEN-T2D) Consortium. South Asian Type 2 Diabetes (SAT2D) Consortium. Mexican American Type 2 Diabetes (MAT2D) Consortium. Type 2 Diabetes Genetic Exploration by Nex-generation sequencing in muylti-Ethnic Samples (T2D-GENES) Consortium Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat. Genet. 2014;46:234–244. doi: 10.1038/ng.2897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Waters K.M., Stram D.O., Hassanein M.T., Le Marchand L., Wilkens L.R., Maskarinec G., Monroe K.R., Kolonel L.N., Altshuler D., Henderson B.E., Haiman C.A. Consistent association of type 2 diabetes risk variants found in europeans in diverse racial and ethnic groups. PLoS Genet. 2010;6:e1001078. doi: 10.1371/journal.pgen.1001078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.McGuire A.L., Gabriel S., Tishkoff S.A., Wonkam A., Chakravarti A., Furlong E.E.M., Treutlein B., Meissner A., Chang H.Y., López-Bigas N. The road ahead in genetics and genomics. Nat. Rev. Genet. 2020;21:581–596. doi: 10.1038/s41576-020-0272-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yang J., Lee S.H., Goddard M.E., Visscher P.M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 2011;88:76–82. doi: 10.1016/j.ajhg.2010.11.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Friedman J., Hastie T., Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 23.Habier D., Fernando R.L., Kizilkaya K., Garrick D.J. Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics. 2011;12:186. doi: 10.1186/1471-2105-12-186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Erbe M., Hayes B.J., Matukumalli L.K., Goswami S., Bowman P.J., Reich C.M., Mason B.A., Goddard M.E. Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J. Dairy Sci. 2012;95:4114–4129. doi: 10.3168/jds.2011-5019. [DOI] [PubMed] [Google Scholar]
- 25.Zeng J., de Vlaming R., Wu Y., Robinson M.R., Lloyd-Jones L.R., Yengo L., Yap C.X., Xue A., Sidorenko J., McRae A.F. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 2018;50:746–753. doi: 10.1038/s41588-018-0101-4. [DOI] [PubMed] [Google Scholar]
- 26.Qian J., Tanigawa Y., Du W., Aguirre M., Chang C., Tibshirani R., Rivas M.A., Hastie T. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet. 2020;16:e1009141. doi: 10.1371/journal.pgen.1009141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Lee S.H., Wray N.R., Goddard M.E., Visscher P.M. Estimating missing heritability for disease from genome-wide association studies. Am. J. Hum. Genet. 2011;88:294–305. doi: 10.1016/j.ajhg.2011.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Purcell S.M., Wray N.R., Stone J.L., Visscher P.M., O’Donovan M.C., Sullivan P.F., Sklar P., International Schizophrenia Consortium Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Vilhjálmsson B.J., Yang J., Finucane H.K., Gusev A., Lindström S., Ripke S., Genovese G., Loh P.R., Bhatia G., Do R., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 2015;97:576–592. doi: 10.1016/j.ajhg.2015.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Mak T.S.H., Porsch R.M., Choi S.W., Zhou X., Sham P.C. Polygenic scores via penalized regression on summary statistics. Genet. Epidemiol. 2017;41:469–480. doi: 10.1002/gepi.22050. [DOI] [PubMed] [Google Scholar]
- 31.Turley P., Walters R.K., Maghzian O., Okbay A., Lee J.J., Fontana M.A., Nguyen-Viet T.A., Wedow R., Zacher M., Furlotte N.A., 23andMe Research Team. Social Science Genetic Association Consortium Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 2018;50:229–237. doi: 10.1038/s41588-017-0009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Coram M.A., Fang H., Candille S.I., Assimes T.L., Tang H. Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am. J. Hum. Genet. 2017;101:218–226. doi: 10.1016/j.ajhg.2017.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Akiyama M., Okada Y., Kanai M., Takahashi A., Momozawa Y., Ikeda M., Iwata N., Ikegawa S., Hirata M., Matsuda K. Genome-wide association study identifies 112 new loci for body mass index in the Japanese population. Nat. Genet. 2017;49:1458–1467. doi: 10.1038/ng.3951. [DOI] [PubMed] [Google Scholar]
- 34.Akiyama M., Ishigaki K., Sakaue S., Momozawa Y., Horikoshi M., Hirata M., Matsuda K., Ikegawa S., Takahashi A., Kanai M. Characterizing rare and low-frequency height-associated variants in the Japanese population. Nat. Commun. 2019;10:4393. doi: 10.1038/s41467-019-12276-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wan X., Yang C., Yang Q., Xue H., Fan X., Tang N.L., Yu W. BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet. 2010;87:325–340. doi: 10.1016/j.ajhg.2010.07.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Loh P.R., Tucker G., Bulik-Sullivan B.K., Vilhjálmsson B.J., Finucane H.K., Salem R.M., Chasman D.I., Ridker P.M., Neale B.M., Berger B. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 2015;47:284–290. doi: 10.1038/ng.3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wu Y., Sankararaman S. A scalable estimator of SNP heritability for biobank-scale data. Bioinformatics. 2018;34:i187–i194. doi: 10.1093/bioinformatics/bty253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Speed D., Hemani G., Johnson M.R., Balding D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012;91:1011–1021. doi: 10.1016/j.ajhg.2012.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Zhou X. A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann. Appl. Stat. 2017;11:2027–2051. doi: 10.1214/17-AOAS1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lu Q., Li B., Ou D., Erlendsdottir M., Powles R.L., Jiang T., Hu Y., Chang D., Jin C., Dai W. A powerful approach to estimating annotation-stratified genetic covariance via GWAS summary statistics. Am. J. Hum. Genet. 2017;101:939–964. doi: 10.1016/j.ajhg.2017.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Berisa T., Pickrell J.K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32:283–285. doi: 10.1093/bioinformatics/btv546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Yang S., Zhou X. Accurate and scalable construction of polygenic scores in large biobank data sets. Am. J. Hum. Genet. 2020;106:679–693. doi: 10.1016/j.ajhg.2020.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lloyd-Jones L.R., Zeng J., Sidorenko J., Yengo L., Moser G., Kemper K.E., Wang H., Zheng Z., Magi R., Esko T. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 2019;10:5086. doi: 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yao X., Tang S., Bian B., Wu X., Chen G., Wang C.-C. Improved phylogenetic resolution for Y-chromosome Haplogroup O2a1c-002611. Sci. Rep. 2017;7:1146. doi: 10.1038/s41598-017-01340-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Delaneau O., Zagury J.-F., Marchini J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods. 2013;10:5–6. doi: 10.1038/nmeth.2307. [DOI] [PubMed] [Google Scholar]
- 46.Howie B.N., Donnelly P., Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Liu S., Huang S., Chen F., Zhao L., Yuan Y., Francis S.S., Fang L., Li Z., Lin L., Liu R. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell. 2018;175:347–359.e14. doi: 10.1016/j.cell.2018.08.016. [DOI] [PubMed] [Google Scholar]
- 48.Chen J., Zheng H., Bei J.-X., Sun L., Jia W.H., Li T., Zhang F., Seielstad M., Zeng Y.-X., Zhang X., Liu J. Genetic structure of the Han Chinese population revealed by genome-wide SNP variation. Am. J. Hum. Genet. 2009;85:775–785. doi: 10.1016/j.ajhg.2009.10.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Xu S., Yin X., Li S., Jin W., Lou H., Yang L., Gong X., Wang H., Shen Y., Pan X. Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am. J. Hum. Genet. 2009;85:762–774. doi: 10.1016/j.ajhg.2009.10.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Fuchsberger C., Abecasis G.R., Hinds D.A. minimac2: faster genotype imputation. Bioinformatics. 2015;31:782–784. doi: 10.1093/bioinformatics/btu704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Jiang J., Li C., Paul D., Yang C., Zhao H. On high-dimensional misspecified mixed model analysis in genome-wide association study. Ann. Statist. 2016;44:2127–2160. [Google Scholar]
- 52.Nelis M., Esko T., Mägi R., Zimprich F., Zimprich A., Toncheva D., Karachanak S., Piskácková T., Balascák I., Peltonen L. Genetic structure of Europeans: a view from the North-East. PLoS ONE. 2009;4:e5472. doi: 10.1371/journal.pone.0005472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Lee S.H., Clark S., van der Werf J.H.J. Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS ONE. 2017;12:e0189775. doi: 10.1371/journal.pone.0189775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.van Rheenen W., Peyrot W.J., Schork A.J., Lee S.H., Wray N.R. Genetic correlations of polygenic disease traits: from theory to practice. Nat. Rev. Genet. 2019;20:567–581. doi: 10.1038/s41576-019-0137-z. [DOI] [PubMed] [Google Scholar]
- 55.Truong B., Zhou X., Shin J., Li J., van der Werf J.H.J., Le T.D., Lee S.H. Efficient polygenic risk scores for biobank scale data by exploiting phenotypes from inferred relatives. Nat. Commun. 2020;11:3074. doi: 10.1038/s41467-020-16829-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Li C., Yang C., Gelernter J., Zhao H. Improving genetic risk prediction by leveraging pleiotropy. Hum. Genet. 2014;133:639–650. doi: 10.1007/s00439-013-1401-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Maier R., Moser G., Chen G.-B., Ripke S., Coryell W., Potash J.B., Scheftner W.A., Shi J., Weissman M.M., Hultman C.M., Cross-Disorder Working Group of the Psychiatric Genomics Consortium Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 2015;96:283–294. doi: 10.1016/j.ajhg.2014.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Maier R.M., Zhu Z., Lee S.H., Trzaskowski M., Ruderfer D.M., Stahl E.A., Ripke S., Wray N.R., Yang J., Visscher P.M., Robinson M.R. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun. 2018;9:989. doi: 10.1038/s41467-017-02769-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Weissbrod O., Flint J., Rosset S. Estimating SNP-based heritability and genetic correlation in case-control studies directly and with summary statistics. Am. J. Hum. Genet. 2018;103:89–99. doi: 10.1016/j.ajhg.2018.06.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Yang L., Neale B.M., Liu L., Lee S.H., Wray N.R., Ji N., Li H., Qian Q., Wang D., Li J., Psychiatric GWAS Consortium: ADHD Subgroup Polygenic transmission and complex neuro developmental network for attention deficit hyperactivity disorder: genome-wide association study of both common and rare variants. Am. J. Med. Genet. B. Neuropsychiatr. Genet. 2013;162B:419–430. doi: 10.1002/ajmg.b.32169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Speed D., Cai N., Johnson M.R., Nejentsev S., Balding D.J., UCLEB Consortium Reevaluation of SNP heritability in complex human traits. Nat. Genet. 2017;49:986–992. doi: 10.1038/ng.3865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Speed D., Holmes J., Balding D.J. Evaluating and improving heritability models using summary statistics. Nat. Genet. 2020;52:458–462. doi: 10.1038/s41588-020-0600-y. [DOI] [PubMed] [Google Scholar]
- 63.Turchin M.C., Chiang C.W., Palmer C.D., Sankararaman S., Reich D., Hirschhorn J.N., Genetic Investigation of ANthropometric Traits (GIANT) Consortium Evidence of widespread selection on standing variation in Europe at height-associated SNPs. Nat. Genet. 2012;44:1015–1019. doi: 10.1038/ng.2368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Cai M., Chen L.S., Liu J., Yang C. IGREX for quantifying the impact of genetically regulated expression on phenotypes. NAR Genom Bioinform. 2020;2:a010. doi: 10.1093/nargab/lqaa010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Farh K.K., Marson A., Zhu J., Kleinewietfeld M., Housley W.J., Beik S., Shoresh N., Whitton H., Ryan R.J., Shishkin A.A. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2015;518:337–343. doi: 10.1038/nature13835. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Shi X., Chai X., Yang Y., Cheng Q., Jiao Y., Chen H., Huang J., Yang C., Liu J. A tissue-specific collaborative mixed model for jointly analyzing multiple tissues in transcriptome-wide association studies. Nucleic Acids Res. 2020;48:e109. doi: 10.1093/nar/gkaa767. e109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.R., Anttila V., Xu H., Zang C., Farh K., ReproGen Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. RACI Consortium Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Trynka G., Sandor C., Han B., Xu H., Stranger B.E., Liu X.S., Raychaudhuri S. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nat. Genet. 2013;45:124–130. doi: 10.1038/ng.2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Pickrell J.K. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am. J. Hum. Genet. 2014;94:559–573. doi: 10.1016/j.ajhg.2014.03.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ming J., Dai M., Cai M., Wan X., Liu J., Yang C. LSMM: a statistical approach to integrating functional annotations with genome-wide association studies. Bioinformatics. 2018;34:2788–2796. doi: 10.1093/bioinformatics/bty187. [DOI] [PubMed] [Google Scholar]
- 73.Ming J., Wang T., Yang C. LPM: a latent probit model to characterize the relationship among complex traits using summary statistics from multiple GWASs and functional annotations. Bioinformatics. 2020;36:2506–2514. doi: 10.1093/bioinformatics/btz947. [DOI] [PubMed] [Google Scholar]
- 74.Hu Y., Lu Q., Powles R., Yao X., Yang C., Fang F., Xu X., Zhao H. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol. 2017;13:e1005589. doi: 10.1371/journal.pcbi.1005589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Marquez-Luna C., Gazal S., Loh P.-R., Kim S.S., Furlotte N., Auton A., Price A.L., 23andMe Research Team LDpred-funct: incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. bioRxiv. 2020 doi: 10.1101/375337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Amariuta T., Ishigaki K., Sugishita H., Ohta T., Koido M., Dey K.K., Matsuda K., Murakami Y., Price A.L., Kawakami E. Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements. Nat. Genet. 2020;52:1346–1354. doi: 10.1038/s41588-020-00740-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The publicly available GWAS summary statistics for estimating genetic correlations were obtained from the links summarized in Table S2. Access to genome-wide summary statistics from the Chinese data has to be approved by the contact authors (see Table S2). Researchers who wish to gain access to the data are required to submit the data request by sending an email to the contact authors. GWAS summary statistics from other studies can be accessed using the links provided in Table S2. The UKBB data are available through the UK Biobank Access Management System. The IPM BioMe biobank data are accessible on dbGap with accession number phs000925.v1.p1. The C++ software for XPA and the R package for XPASS are available online.