Summary
Accurate polygenic scores (PGSs) facilitate the genetic prediction of complex traits and aid in the development of personalized medicine. Here, we develop a statistical method called multi-trait assisted PGS (mtPGS), which can construct accurate PGSs for a target trait of interest by leveraging multiple traits relevant to the target trait. Specifically, mtPGS borrows SNP effect size similarity information between the target trait and its relevant traits to improve the effect size estimation on the target trait, thus achieving accurate PGSs. In the process, mtPGS flexibly models the shared genetic architecture between the target and the relevant traits to achieve robust performance, while explicitly accounting for the environmental covariance among them to accommodate different study designs with various sample overlap patterns. In addition, mtPGS uses only summary statistics as input and relies on a deterministic algorithm with several algebraic techniques for scalable computation. We evaluate the performance of mtPGS through comprehensive simulations and applications to 25 traits in the UK Biobank, where in the real data mtPGS achieves an average of 0.90%–52.91% accuracy gain compared to the state-of-the-art PGS methods. Overall, mtPGS represents an accurate, fast, and robust solution for PGS construction in biobank-scale datasets.
Keywords: polygenic scores, PGS, polygenic risk scores, PRS, genome-wide association studies, GWAS, genetic prediction, genetic correlation
Xu et al. propose a statistical method called multi-trait assisted polygenic scores (mtPGS) that constructs accurate polygenic scores for a complex trait of interest by leveraging multiple correlated traits. mtPGS demonstrates enhanced predictive performance in simulations and real applications in the UK Biobank dataset.
Introduction
The polygenic score (PGS) of a complex trait captures an individual’s genetic predisposition for the trait and represents one of the earliest, most stable, and accurately measurable factors underlying the trait.1,2 A PGS is often calculated as a weighted summation of genotypes across single-nucleotide polymorphisms (SNPs),3,4,5,6,7 where the SNP effect size estimates serve as the weights. PGSs are commonly referred to as polygenic risk scores (PRSs) when the trait of interest is disease status.8,9,10 With the abundant availability of genotype and phenotype information collected from large-scale genome-wide association studies (GWASs),3,4,5 PGSs are becoming increasingly used for genetic prediction of complex traits/diseases,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25 disease risk stratification,3,10,26 pleiotropic association detection,27,28,29 transcriptome-wide association studies (TWASs),30,31,32 and Mendelian randomization analysis.33,34 Accurate construction of PGSs can facilitate disease prevention, improve early intervention, and support the development of personalized treatment.
Many statistical methods have been developed to construct PGSs.35,36 Different PGS methods often make distinct modeling assumptions on the genetic effect size distribution, use either individual-level GWAS data or summary statistics, and sometimes incorporate external information such as SNP functional annotations.12,19,36 While most existing PGS methods are univariate in nature and model directly the target complex trait of interest, several multivariate PGS methods have been recently developed to leverage traits that are correlated or relevant to the target trait to facilitate its prediction. Leveraging relevant traits through modeling their genetic correlation with the target trait can facilitate SNP effect size estimation for the target trait and thus improve prediction accuracy. Example multivariate methods that use GWAS individual-level data include the bivariate ridge regression (BVR),37 the multi-trait genomic best linear unbiased prediction (MTGBLUP),38 and the cross-trait penalized regression (CTPR).39 Example multivariate methods that use GWAS summary statistics include weighted multi-trait summary statistic best linear unbiased prediction method (wMT-SBLUP),40 multi-trait analysis of GWAS (MTAG),41 and cross-population and cross-phenotype (XPXP).42
Despite the effectiveness of multivariate PGS modeling, the existing multivariate PGS methods have two important limitations. First, almost all multivariate methods make relatively simple modeling assumptions on the shared genetic architecture between the target trait and the relevant traits. In particular, most methods use a bivariate or multivariate linear mixed model (mvLMM),43,44 which assumes that all SNPs have non-zero effects and that their effect sizes on the target and relevant traits follow a bivariate or multivariate normal distribution characterized by genetic correlations. Such modeling assumption is not flexible, as a single multivariate normal distribution often cannot effectively capture the SNP effect size distribution across traits and may lead to over-shrinkage of large-effect SNPs and subsequently impede prediction performance. Indeed, recent studies comparing univariate PGS methods found that flexible modeling of the SNP effect size distribution is key to ensure accurate PGSs across a range of traits with distinct genetic architectures.12,15,19,45,46 Univariate methods that make flexible modeling assumptions on the genetic architectures, such as DPR,12 PRS-CS,15 and DBSLMM,45 all outperform standard polygenic models such as best linear unbiased predictor (BLUP) that assumes normality on the SNP effect size distribution. Therefore, it is important to extend the existing multivariate PGS methods toward flexible modeling of the shared genetic architecture to improve PGS accuracy. Such modeling extension, and its application to large-scale biobank datasets such as UK Biobank (UKB),47 requires the development of new computationally efficient algorithms. Second, most existing multivariate PGS methods have been primarily focused on modeling the genetic correlation among traits to borrow the SNP effect size similarity information from the relevant traits to improve the PGS for the target trait. However, environmental correlation due to sample overlap is another crucial and indispensable component of the phenotypic correlation.44 Many large-scale GWASs, such as UKB, often collect thousands of phenotypes on the same set of individuals.47,48,49,50 These phenotypes may be correlated with each other due to the shared common environmental exposures. Failing to account for such environmental correlation may lead to reduced PGS performance, especially in datasets where a large proportion of samples have multiple phenotypic measurements. Indeed, a recent study42 shows that accounting for the environmental correlation in addition to modeling pleiotropy help improve PGS accuracy.
Here, we present an accurate and scalable multivariate PGS method, which we refer to as the multi-trait assisted PGS (mtPGS), that addresses the above two limitations to effectively leverage multiple correlated traits to predict the target trait of interest. mtPGS makes a flexible modeling assumption on the SNP effect size distribution underneath the shared genetic architecture and explicitly incorporates environmental correlation to aid the prediction of the target trait. mtPGS also takes advantage of a deterministic search algorithm with several algebraic techniques to achieve scalable computation. We illustrate the benefits of mtPGS through comprehensive simulations and applications to 25 traits in the UKB.
Subjects and methods
Method overview
Here, we present an overview of mtPGS, with its technical details provided in the supplemental methods. Our goal is to leverage multiple correlated traits and flexibly model their shared genetic and environmental architecture for accurate genetic prediction on a target trait of interest. To do so, we denote y0 as an n0 vector of the target trait that is measured on n0 samples. We consider M relevant traits and denote ym as an nm vector of the mth trait measured on nm samples (). These relevant traits may be obtained in the same study of the target trait or from a separate study with either overlapped or non-overlapped samples. Detailed selection strategy for the relevant traits is described in a later section. We assume that all individuals have their genotypes measured on a common set of p SNPs. We examine the relevant traits one at a time. For each relevant trait in turn, we leverage its shared genetic and/or environmental architecture with the target trait to construct a PGS for the target trait. Afterward, we combine the different PGSs for the target trait constructed based on the different relevant traits into a single PGS as the final PGS. We describe these steps in detail in the following sections.
Bivariate modeling of the target trait and one relevant trait at a time
First, we consider modeling the target trait and one relevant trait jointly through a bivariate modeling framework. Specifically, for the particular trait pair of y0 and ym, we separate the individuals into three subgroups, including one subgroup of overlapped individuals and two subgroups of non-overlapped individuals. We denote as the number of overlapping individuals with both y0 and ym measured; as the number of non-overlapping individuals who were measured only for the target trait y0; and as the number of non-overlapping individuals who were measured only for the relevant trait ym. For the nms overlapped individuals, we denote , as the phenotype vectors of the target trait and the relevant trait, respectively; as the nms by p genotype matrix; and , as the vectors of residual errors. For the non-overlapping individuals, we denote and as the vectors of phenotypic measurements for the and individuals, respectively; and as the corresponding by p and by p genotype matrices; and and as the corresponding and vectors of residual errors. Within each of the three subgroups of individuals, we center and standardize each phenotype vector and each column of the genotype matrices to have zero mean and unit standard deviation. With these notations, we consider the following three regression models:
| (Equation 1) |
| (Equation 2) |
| (Equation 3) |
where and are p-vectors of genotype effect sizes for the two phenotypes, respectively. In the above model, we accommodate both overlapped and non-overlapped individuals through three separate equations to account for distinct sample overlap patterns. The three equations are interconnected by a common set of genetic effect sizes and , whose similarity characterizes the shared genetic architecture between the traits. To flexibly model the shared genetic architecture between the two traits, we extend the univariate modeling framework of DBSLMM45 toward bivariate modeling and assume that the effect sizes of the jth SNP on the two traits follow a mixture of four bivariate normal distributions,
| (Equation 4) |
where, with probability , the jth SNP has large effects on both traits; with probability and , the jth SNP has large effect on one trait and small effect on the other; and with probability , the jth SNP has small effects on both traits. Above, BN refers to a bivariate normal distribution; and are the variance components of large SNP effect sizes; and are the variance components of small SNP effect sizes; and represents the genetic correlation between the two traits. Our mixture modeling assumption on the effect sizes represents a hybrid between the sparse modeling assumption and polygenic modeling assumption,19,45 facilitating robust and accurate prediction performance across a range of phenotypes with various underlying genetic architectures. In addition, the mixture of four bivariate normal distributions effectively categorizes SNPs into four groups: SNPs with large effects on both traits, SNPs with large effects on one trait or the other, and SNPs with small effects on both traits. By categorizing SNPs into four groups, mtPGS effectively places different shrinkages on the SNP effect sizes in different groups separately, leading to proper shrinkage of the small effects without over-shrinkage of the large effects.
Besides the genetic effects, we also consider the modeling assumptions on the residual errors, which represent the environmental component underlying the two traits. For the and non-overlapped individuals with only one trait measured, we assume that for the ith individual follows a normal distribution while for the jth individual follows another normal distribution . For the nms overlapped individuals with both traits measured, we assume that their environmental effects for the ith individual follow a bivariate normal distribution,
| (Equation 5) |
where represents the environmental correlation between the two traits and , are the variance components of the environmental effects for the two traits.
Finally, we note that in the above model, we introduce the genetic and environmental correlation parameters and to leverage additional information from the relevant trait to improve the prediction accuracy of the target trait. Specifically, the genetic correlation parameter allows us to borrow information of the SNP effect size estimates on the relevant trait to help improve the SNP effect estimation on the target trait, thus improving the PGS on the target trait. The environmental correlation also allows us to effectively control for trait correlation in the overlapped individuals due to the shared environment to facilitate SNP effect size estimation and PGS construction.
Deterministic inference algorithm and use of summary statistics
With the above model, our goal is to obtain the posterior estimates for the effect sizes for the target trait. The effect size estimates would allow us to construct a PGS for the target trait, in the form of
for a new individual with a p-dimensional genotype vector of x, where the subscript m is used to emphasize the fact that the PGS for the target trait is constructed with the help of the mth relevant trait. Estimating , however, is computationally challenging, as the above model requires categorizing each SNP into four distinct categories, which leads to a likelihood with a high dimensional integration that cannot be solved analytically. A standard numerical inference algorithm, such as Markov chain Monte Carlo (MCMC), needs to effectively explore the dimensional space and would inevitably result in heavy computational burden. To ensure scalable computation, we develop an approximate algorithm for inference. We build upon the intuition that we can often obtain a reasonable list of large-effect size SNPs for the target trait through a simple deterministic searching strategy. Specifically, we use the LD clumping and p value thresholding (C+T) procedure implemented in the PLINK software (v.1.90b6.9) to select a set of large-effect SNPs for the target trait. In the C+T procedure, there are three hyper-parameters that need to be specified: LD window size, p value threshold p, and LD threshold r2. In mtPGS, we fix the LD window size to be 1,000 kb. For p and r2, we tune these two hyper-parameters in the validation set based on four different values of p (1e−5, 1e−6, 1e−7, and 1e−8) and three different values of r2 (0.1, 0.2, and 0.25). In the absence of a validation set, we set these hyper-parameters as fixed values (p = 1e−6, r2 = 0.2). Because the number of large-effect SNPs is often small (e.g., ranges from 2 to 2,875 in the real data applications), we can treat their effect sizes as fixed effect and fit the following approximate model for inference:
| (Equation 6) |
| (Equation 7) |
| (Equation 8) |
where , are the p1 vectors of effect sizes of the large-effect SNPs; , are the ps vectors of effect sizes of the small-effect SNPs; , and , are the genotype matrices of the large-effect and small-effect SNPs for the non-overlapping individuals, respectively; and and are the genotype matrices of the large-effect and small-effect SNPs for the overlapping individuals, respectively. We treat and as fixed effects and further assume that and , where MVN denotes the multivariate normal distribution; denotes the Kronecker product; denotes the genetic covariance matrix of the small-effect SNPs; and denotes the environmental covariance matrix for the overlapping individuals. With the above approximate model, we can directly obtain the effect size estimates analytically, in the form of
| (Equation 9) |
| (Equation 10) |
where , , , and , are the SNP heritability estimates for the target trait and the relevant trait, respectively.
The above estimation form allows us to directly extend mtPGS toward using GWAS summary statistics as input. Specifically, the above estimation form in Equations 9 and 10 contains the LD matrices , , and as well as linear products such as and . The linear products can be computed based on GWAS marginal Z scores, which, for the jth SNP and the mth trait, is in the form of , where represents the jth column of the genotype matrix . Therefore, the required GWAS summary statistics include an LD matrix estimated from a reference panel as well as two sets of marginal Z scores from overlapped individuals for the two traits and another two sets of marginal Z scores from non-overlapped individuals for the two traits. In addition, the inference algorithm also works when only two sets of marginal Z scores for the two traits are available as long as the proportion of sample overlap between the two traits is also provided (details in supplemental methods).
Our algorithm effectively extends the scalable algorithm of DBSLMM toward bivariate modeling. In the process, we also adapt four additional procedures to further improve the computational efficiency of the mtPGS algorithm. First, instead of estimating the SNP heritability (, ) for the two traits and the genetic () and environmental () correlations between the two traits in our model, we follow the main idea of DBSLMM45 and obtained the estimates of , , , and externally by using GECKO44 and treated these estimates as fixed before inferring the other parameters in our model. GECKO is a computational method for estimating both genetic and environmental covariances using GWAS summary statistics. GECKO fits the same linear mixed model used by many other approaches51 for genetic covariance estimation and relies on a composite likelihood-based algorithm to improve the estimation accuracy of existing method of moments (MoM)-based estimation algorithms while keeping computation in check. Second, we take advantage of the block LD structure of the human genome and follow Berisa et al.52 to approximate the LD matrix with a block-diagonal matrix. The block-diagonal approximation to the LD matrix, when further paired with eigen-decomposition on the genetic and environmental covariance matrices, allows us to obtain the SNP effect size estimates in Equations 9 and 10 in each LD block separately. Third, we apply the Woodbury matrix identity for SNP effect size estimation in Equations 9 and 10, which effectively transforms large matrix multiplications into small matrix multiplications. Finally, we leverage the preconditioned conjugate gradient algorithm to solve the linear systems in Equations 9 and 10,53 thus bypassing the computationally expensive matrix inversion for scalable computation. These technical details are provided in the supplemental methods. As a result, the computational complexity of our algorithm scales linearly with respect to the number of SNPs when GWAS summary statistics are available as input, and further scales linearly with respect to the number of individuals when summary statistics are absent.
Combining PGSs from multiple relevant traits
The above two sections have described the details for obtaining the PGS for the target trait using one relevant trait. For a new individual, we obtain the SNP effect size estimates from Equations 9 and 10 and use them to construct the PGS for the target trait of a new individual. We denote the resulting PGS as sm, as it uses information from the mth relevant trait. When we have more than one relevant trait, we can combine the PGSs constructed based on all these relevant traits together through a weighted PGS framework. Specifically, we obtain our final PGS for the target trait in the form of
where wm represents the weight for the mth PGS. We infer the weights wm through cross-validation. Specifically, we split the data into three sets: a training set, a validation set, and a test set. We fit our model in the training set, obtain the PGS estimates in the validation set, and further infer the weights in the validation set through a multiple linear regression where the target trait is treated as the outcome and the M PGS values constructed using different relevant traits are treated as the covariates. We finally evaluate the performance of the combined PGSs in the test set. This weighted PGS framework allows us to leverage multiple relevant traits to improve the genetic prediction on the target trait while keeping computation in check.
We refer to our method as the multi-trait assisted PGS (mtPGS). mtPGS is implemented in the mtPGS software, freely available at http://xzlab.org/software.html or https://github.com/xuchang0201/mtPGS.
Compared methods
We compared mtPGS with seven existing summary statistics-based PGS methods: DBSLMM, SBLUP, PRS-CS, MegaPRS, wMT-SBLUP, MTAG, and XPXP. The first four methods are univariate PGS methods that focus only on the target trait, while the last three methods are multivariate PGS methods that can take advantage of multiple relevant traits to construct PGSs for the target trait. All these methods are applied using only GWAS summary statistics and an LD matrix constructed based on a reference panel. We did not include methods such as MTGBLUP and CTPR in comparison as these methods apply only to individual-level data.54 To ensure fair comparison across methods, we used the same reference panel of 500 individuals45,46 randomly sampled from UKB to construct the LD matrix and used this matrix as input for all methods except PRS-CS. For PRS-CS, the software requires using the pre-computed LD matrix available on the software website, which was computed based on European samples in the 1000 Genomes Project phase 3 dataset.
For DBSLMM, we used the deterministic version (v.0.3) for model fitting with default parameter settings. For SBLUP, we used the GCTA software (v.1.93.2beta) to fit the model. In SBLUP, following Maier et al.,40 we set the LD window size to 2,000 kb and calculated the input parameter as , where M is the total number of SNPs used in the analysis and h2 is the SNP heritability. For wMT-SBLUP, we used the SMTpred python software for fitting (see web resources). In particular, for each phenotype of the target trait and relevant traits, we first fitted SBLUP to obtain SNP effect size estimates and then computed individual PGS predictors using the “score” function in the PLINK55 software. Afterward, we constructed wMT-SBLUP predictors for the target trait using the individual SBLUP predictors of the target trait and relevant traits as input. For PRS-CS, we used the PRS-CS software (v.1.0.0) for model fitting. In PRS-CS, following Ge et al.,15 we set the hyper-parameter a to the default value of 1 and the hyper-parameter b to the default value of 0.5. For the global shrinkage parameter φ in PRS-CS, due to memory and computational time constraints, we followed Ge et al.15 and Xiao et al.42 and set it to 10−2 in the simulation studies. In the real data applications, we followed Ge et al.15 and explored four different choices for φ: 10−6, 10−4, 10−2, and 1. We selected the optimal φ by cross-validations in the validation set. For MegaPRS, we used the LDAK software (v.5.2) for model fitting.56 In MegaPRS, following Zabad et al.,57 we estimated the per-predictor heritabilities assuming the GCTA model and constructed the PGS by using the BayesR model. We set the LD window size to be 1,000 kb in both simulations and real data applications as suggested by the authors of the method. For MTAG, following Turley et al.41 and Chung et al.,39 we constructed the PGS by using the multi-trait adjusted summary statistics from MTAG as input for LDpred. Specifically, we first used MTAG to combine summary statistics from GWASs of different traits into multi-trait adjusted summary statistics and then used these summary statistics as input for LDpred to construct a PGS for the target trait. For the LDpred stage of model fitting, we used the LDpred software (v.1.0.11), which requires user specification for two tuning parameters that include the LD radius parameter and the fraction of causal variants parameter . For the LD radius parameter, we set it to be the recommended value (m/3,000) in the simulations, with m being the number of SNPs, and set it to be 200 in the real data applications to ensure scalability following Yang et al.45 For the fraction of causal variants parameter , we used a validation set to tune the parameter in the simulation studies. Specifically, we followed Vilhjálmsson et al.21 and Lloyd-Jones et al.11 and explored nine different choices for : 1, 0.3, 0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, and 0.0001. The value with the highest prediction R2 in the validation set was selected as the optimal fraction parameter. In the real data applications, due to memory and computational time constraints, we set the fraction of causal variants to be 1 following Turley et al.41 For XPXP, we used the XPXP software for model fitting (see web resources). XPXP is primarily designed for cross-population PGS construction and requires two reference panels for the target and auxiliary population, respectively. Therefore, we partitioned the original UKB reference panel into two equal-sized subsets and supplied them as reference panel inputs for XPXP.
Simulations
We performed comprehensive simulation studies to compare the performance of mtPGS with the other PGS methods. An overview of the simulation settings analyzed in this paper is summarized in Table S1. Specifically, we randomly selected 12,000 individuals with European ancestry from the UKB data (detailed description in the next section). For these individuals, we obtained a total of m = 100,000 SNPs by selecting the first 10,000 SNPs from each of the chromosomes 1 to 10. We first consider the main simulation settings with k = 2 traits. We simulated the two traits on the same set of individuals (i.e., complete-overlap study design), with the first trait being the target trait and the second trait being the relevant trait. In the main simulations, we examined three distinct genetic architectures: polygenic (scenario I), sparse (scenario II), and hybrid (scenario III).
Scenario I is a polygenic scenario, where all SNPs are assumed to have non-zero effects on both traits. We simulated the effect sizes of the ith SNP on the two traits based on a bivariate normal distribution, with mean zero, variances , , and correlation . In addition, we simulated the environmental effects for the two traits in the jth individual from another bivariate normal distribution, with mean zero, variances , , and correlation .
Scenario II is a sparse scenario, where we randomly selected 1% of the SNPs to have non-zero effects on at least one trait. We denoted the number of non-zero effect SNPs as p. Among the non-zero effect SNPs, we randomly selected 80% of them to have correlated effects on the two traits, where their effect sizes on the two traits come from a bivariate normal distribution . Here, we set the per-SNP effect size covariance to be since the genetic covariance is contributed by a total of 0.8p SNPs. For the remaining SNPs with uncorrelated effects, we simulated their effect sizes on the two traits from two independent normal distributions, and , respectively.
Scenario III is a hybrid of the polygenic and sparse scenarios. Here, we assumed that all SNPs have non-zero effects, but their effect sizes come from a mixture of two different distributions. Specifically, we randomly selected 1% of SNPs (denoted as pl) to have relatively large effects while assigning the remaining SNPs (denoted as ps) to have small effects. We set the proportion of SNP heritability explained by the large-effect SNPs to be 0.3. In particular, we simulated the effect sizes of each large-effect SNP on the two traits based on a bivariate normal distribution . We simulated the effect sizes of each small-effect SNP on the two traits based on another bivariate normal distribution: .
For each simulation scenario, we set the heritability of the two traits , to be either 0.25 or 0.5, and we varied the genetic correlation to be 0.25, 0.5, or 0.75 and environmental correlation to be 0.1, 0.15, or 0.2, for a total of 36 simulation settings (4 × 3 × 3), with ten simulation replicates per setting. In each simulation replicate, we simulated the SNP effect sizes and environmental effects, and then summed them to obtain the simulated traits. With the simulated data, we divided the samples randomly into a training set of 10,000 individuals, a validation set of 1,000 individuals, and a test set of 1,000 individuals. We fitted the linear regression model implemented in the GEMMA43 software (v.0.98.1) in the training data to compute the marginal Z scores. We also randomly selected 500 individuals from the remaining UKB data and treated them as a reference panel for LD matrix computation. With the marginal Z scores from the training data and the LD matrix from the reference panel, we fitted different PGS methods to obtain SNP effect size estimates. For PGS methods that require parameter tuning (i.e., mtPGS, MegaPRS, and LDpred), we used the validation set to select the optimal tuning parameters. We then evaluated the performance of all methods using R2 in the test set. Due to the small sample size and limited number of SNPs in the simulations, it is challenging to obtain accurate estimates of SNP heritability and genetic/environmental correlations. Therefore, following Lloyd-Jones et al.11 and Yang et al.,45 we supplied true SNP heritability and genetic/environmental correlations for all PGS methods in the simulations. For each simulation setting, we calculated the mean R2 in the test set across the ten replicates as the evaluation metric.
Besides the above simulation settings with completely overlapped samples, we also examined the influence of sample overlap on the predictive performance of different PGS methods. Specifically, we considered two alternative study designs: a non-overlap study design where the two traits are measured on two different sets of individuals and a partial-overlap study design where a fraction of individuals are measured for two traits while the remaining individuals are measured for only one trait. For the non-overlap study design, we randomly selected two sets of n = 10,000 individuals to serve as the training sets for the two traits separately. For the partial-overlap study design, for the training set, we used the same 10,000 individuals in the training set of the complete-overlap study design to simulate the target trait. In addition, we randomly selected 7,000 individuals among them to serve as the common set of individuals with both traits measured. We then randomly selected another 3,000 individuals from UKB and added them to the common set of 7,000 individuals to serve as the training set for the second trait. Therefore, for the training set in the partial-overlap study design, we have a total of 13,000 individuals, with about half of them (7,000) measured for both traits. We used the same validation and test sets for the target trait from the complete-overlap study design for both non-overlap and partial-overlap study designs. For the above two study designs, we set to be 0.5. We set to be 0.1 in the partial-overlap study design. For each of these above study designs, we selected the same 100,000 SNPs as in the complete-overlap study design, varied the heritability of both traits to be either 0.25 or 0.5 (, and simulated the SNP effect sizes under the three genetic architectures described in the scenarios I–III, for a total of 12 (2 × 6) simulation settings. For each setting in the non-overlap study design, we computed two sets of GWAS summary statistics, one for each trait, to serve as the inputs for PGS methods. For each setting in the partial-overlap study design, we performed two different analyses. In the first analysis, we computed two sets of summary statistics, one for each trait, based on the combined overlapped and non-overlapped individuals from each trait. We then supplied the two sets of summary statistics as inputs for mtPGS and the other PGS methods. We used the Equations 18 and 19 in the supplemental methods for approximate model fitting in mtPGS. In the second analysis, we computed four sets of GWAS summary statistics: two from the overlapped individuals for the two traits, and two from non-overlapped individuals for the two traits. We then supplied the four sets of summary statistics as input for mtPGS and used Equations 9 and 10 for model fitting. We compared the predictive performance of mtPGS in the two different analyses to examine the accuracy of the approximation in Equations 18 and 19 in the supplemental methods. For each simulation setting, we performed ten replicates and calculated the mean R2 as the evaluation metric.
Besides the simulation settings with two traits, we also simulated settings with multiple traits to examine whether incorporating multiple relevant traits can further improve the predictive performance of multivariate PGS methods. Specifically, we considered a complete-overlap study design and simulated the number of relevant traits M to be either one, two, or three, with the heritability of all traits set to be 0.5. For settings with more than one relevant trait, we assumed that each relevant trait is genetically correlated with the target trait with , with the genetic correlation between the relevant traits set to be 0. We further assumed that all traits are environmentally correlated with . We simulated the SNP effect sizes and environmental effects under the polygenic architecture described in scenario I to obtain the simulated traits. We then computed GWAS summary statistics using GEMMA. For multivariate PGS methods, we fitted prediction models in the training set, tuned hyper-parameters in the validation set, and obtained individual predictors in the test set. For each specific M, we performed ten simulation replicates and calculated the mean R2 as the evaluation metric. Besides the main simulations, we also performed additional simulations to examine the influence of the number of traits on mtPGS accuracy and how such influence may vary under different genetic correlations between the target trait and the relevant traits. Specifically, we extended the main simulations by simulating up to ten relevant traits, with the genetic correlation between the target trait and each relevant trait set to be either 0.1, 0.2, or 0.5. In these additional simulations, we performed mtPGS analysis and evaluated the accuracy of mtPGS under increasing numbers of relevant traits and varying genetic correlation values.
UKB data
We performed simulations and real data applications based on genotype and phenotype data from the UKB.47 The UKB recruited 502,618 participants aged 40–69 years from across the United Kingdom between 2006 and 2010. Comprehensive phenotypic and health-related information were accessed for enrolled individuals, and 488,377 individuals were genotyped at ∼800,000 genetic markers using the UK BiLEVE and UK Biobank Axiom Arrays from Affymetrix. Genotype data were imputed using the Haplotype Reference Consortium (HRC) and UK10K + 1000 Genomes Project phase 3 data as reference panels, which resulted in ∼93 million SNPs in total. For the sample quality control (QC) process, we followed the procedures described in the online resources of Neale lab (see web resources). Specifically, we retained individuals (1) who are included in the UKB version 3 imputed genotype data; (2) who are included in the genotype principal component (PC) computation; and (3) who are identified as having White British ancestry by self-reporting “white-British,” “Irish,” or “White” and being within 7 standard deviations away from the first six PCs. We further filtered out individuals (1) who are identified as outliers in heterozygosity and missing rates; (2) who have sex chromosome aneuploidy, defined as putatively carrying abnormal sex chromosome configurations; (3) who are excluded from the kinship inference procedure; (4) who have more than ten putative third-degree relatives in the kinship table; or (5) who withdraw their consent for participating the study. For variant QC, we focused on autosome SNPs in the version 3 imputed genotype data and removed SNPs (1) with missing percentage >5%; (2) with a minor allele frequency (MAF) <0.05; (3) with an INFO score <0.8; (4) with a Hardy-Weinberg equilibrium (HWE) test p value < 10−7; or (5) that are a duplicated SNP. A total of 361,112 individuals and 6,171,964 SNPs were retained after the above QC steps. Following Bulik-Sullivan et al.51 and the Neale lab, we further restricted the set of SNPs to those listed in the pre-computed LD scores of European individuals provided by the LDSC website (see web resources), which resulted in 1,119,148 SNPs as the final set for analysis. All QC procedures described above were conducted using the PLINK55 software.
Cross-validation in UKB
We performed cross-validations in UKB to evaluate the predictive performances of different PGS methods. Following Maier et al.,40 Morrison et al.,58 and Yang et al.,45 we selected 15 quantitative traits and 10 binary traits with (1) heritability >0.05 based on the Neale lab’s heritability estimation results (see web resources), (2) total sample size >50,000, and (3) number of affected individuals >1,000 for binary traits. The 15 quantitative traits can be categorized into three groups: a group of physical measurements that include standing height (SH), forced vital capacity (FVC), forced expiratory volume in 1 s (FEV1), basal metabolic rate (BMR), systolic blood pressure (SBP), and body mass index (BMI); a group of blood cell traits that include lymphocyte count (LYMPH), platelet count (PLT), platelet distribution width (PDW), neutrophill count (NEUT), red blood cell (erythrocyte) count (RBC), and hemoglobin concentration (HGB); and a group of other quantitative traits that include fluid intelligence score (GF), age at first live birth (AFLB), and birth weight (BW). The ten binary traits are all complex disease traits and include angina (AG), high cholesterol (HCH), hypertension (HTN), type 2 diabetes (T2D), heart attack/myocardial infarction (MI), ulcerative colitis (UC), asthma (AS), cholelithiasis/gall stones (CL), atrial fibrillation (AF), and gout (GO). An overview of the phenotypes analyzed in this paper is summarized in Table S2 for quantitative traits and Table S3 for binary traits.
In the analysis, we first randomly selected 500 individuals from the UKB data to serve as a reference panel for calculating the SNP LD matrix. Afterward, we randomly selected 80% of the remaining individuals as the training set (n = 288,490), 10% as the validation set (n = 36,061), and 10% as the test set (n = 36,061). For each trait in turn, we extracted the corresponding records from the UKB data and constructed the phenotypes based on numeric measurements of quantitative traits and illness coding of binary traits. For the quantitative traits, we focused on the phenotypic values measured during the initial visit among the three assessment center visits for analysis. Due to missing phenotype measurements, the sample size for the quantitative traits ranges from 93,602 to 287,827 in the training set, ranges from 11,602 to 35,985 in the validation set, and ranges from 11,745 to 35,995 in the test set. For the binary traits, we followed Maier et al.40 to focus on the disease traits defined by self-reported non-cancer illness coding (data field 20002) instead of the ICD10 code diagnoses to maximize the sample size. For both quantitative and binary traits, we fitted a standard linear regression model in the training data to obtain the marginal Z scores using the GEMMA software. Following the Neale lab (see web resources), all quantitative traits were first adjusted for sex, age, age2, genotyping array, and top 20 genotype PCs by regressing the phenotype on these covariates to obtain the phenotype residuals. We then transformed residuals to a standard normal distribution using quantile-quantile normalization and supplied them to GEMMA as phenotypic values. For binary traits, we followed Yang et al.45 and directly fitted standard linear regression model using the original binary outcomes for each SNP-phenotype pair by treating sex, age, age2, genotyping array, and top 20 genotype PCs as covariates in the GEMMA software to obtain the marginal Z scores. Besides the analyses based on summary statistics generated by linear regression, we also performed sensitivity analyses to evaluate the prediction accuracy of mtPGS for binary traits when summary statistics from logistic regression are used as inputs. Specifically, for each of the ten binary traits in turn, we fitted logistic regression for each SNP-phenotype pair using LDAK software (see web resources) while adjusting for the same set of covariates as in our original linear regression analysis. We then supplied the marginal Z scores from logistic regression to mtPGS for PGS construction and compared the prediction accuracy with results based on linear regression.
With the GWAS summary statistics and the reference panel, we applied different PGS methods to the training data and evaluated their performances in the test data. We examined each of the 25 traits one at a time and treated it as the target trait. For univariate PGS methods, we directly fitted prediction models for the target trait to obtain the SNP effect size estimates. For multivariate PGS methods, we first selected relevant traits for the target trait. Specifically, for each target trait in turn, we examined the remaining 24 traits one at a time and selected those that have statistically significant genetic correlation (p value < 0.05/24) with the target trait to serve as the relevant traits. For the target trait that has no significant genetic correlation with any of the 24 traits, we directly selected the trait with the smallest p value of genetic correlation to serve as the relevant trait. We then modeled the target trait with all selected relevant traits jointly. The number of relevant traits ranges from 1 to 17 for each target trait. For the bivariate modeling stage in mtPGS, the SNP effect size estimates were obtained based on only two sets of summary statistics for each pair of the target trait and relevant trait, as the traits were measured in the same set of UKB participants. The genetic and environmental variance component estimates were obtained using GECKO44 with the pre-computed LD scores from the LDSC website. We used the “score” function in PLINK to calculate the PGS in the test data using the estimated SNP effects and evaluated the performance of different PGS methods using R2 for quantitative traits and area under the curve (AUC; calculated by R package pROC v.1.18.0) for binary traits.
Besides the main analyses of 25 traits, we also performed two additional analyses to further examine the performance of mtPGS in the UKB. First, we evaluated the additional accuracy gain brought by the step of combining PGSs from multiple relevant traits over the step of pairwise analysis using only one relevant trait for mtPGS. For the pairwise PGS with one relevant trait, we selected the trait with the smallest p value of genetic correlation with the target trait to serve as the relevant trait. Following the selection procedure of relevant traits described previously, we identified more than one relevant trait for 14 target quantitative traits and 6 target binary traits. Therefore, we performed comparisons on these 20 traits. Second, we combined PGSs from the other PGS methods based on the same set of relevant traits using the same linear aggregation method as we did for mtPGS. Specifically, for the univariate PGS methods, we combined the univariate PGS of the target trait with that of each relevant trait. For the multivariate PGS methods, we combined the PGSs of the target trait from pairwise analysis of the target and each relevant trait. In either case, we obtained the weights from the validation set.
Results
Simulations
A method schematic of mtPGS is shown in Figure 1, with details provided in the subjects and methods. We performed simulation studies to evaluate the predictive performance of mtPGS and compared it with seven other PGS methods. The seven other methods include four univariate PGS methods (DBSLMM, SBLUP, PRS-CS, and MegaPRS) and three multivariate methods (wMT-SBLUP, MTAG, and XPXP). Simulation and comparison details are provided in the subjects and methods. Briefly, we used the genotypes from UKB to simulate two or more traits on a set of n = 12,000 individuals. We divided these individuals into three sets: a training set (n = 10,000), a validation set (n = 1,000), and a test set (n = 1,000). Among the simulated traits, we treated the first trait as the target trait and the remaining traits as the relevant traits. We then predict the target trait either with (for multivariate PGS methods) or without (for univariate PGS methods) the relevant traits. In the main simulations, we primarily focused on settings with two traits, including one target and one relevant trait, and considered three different genetic architectures underlying the two traits: polygenic (scenario I), sparse (scenario II), and hybrid (scenario III). Each genetic architecture differs in terms of the number of causal SNPs and the SNP effect size distribution. In the main simulations, we varied the SNP heritability (, ), genetic correlation (), and environmental correlation () for the two traits under each genetic architecture, resulting in a total of 108 simulation settings across three scenarios, with ten simulation replicates per setting. Besides the main settings with two traits, we also examined additional settings with more than two traits. An overview of the simulation settings analyzed in this paper is summarized in Table S1. In each simulation setting, we fitted different PGS methods on the training set using summary statistics, tuned hyper-parameters in the validation set if necessary, and evaluated their prediction performance in the test set by computing the mean R2 across replicates.
Figure 1.
Schematic of mtPGS
mtPGS performs multi-trait analysis to leverage multiple relevant traits for accurate and scalable PGS construction for a target trait of interest. mtPGS takes input of the GWAS summary statistics for each individual trait (top left) and an LD matrix calculated based on a reference panel (bottom left). mtPGS examines the target trait with the relevant traits one at a time, explicitly models the shared genetic architecture and environmental exposures between the target trait and the relevant trait through a bivariate modeling framework, and constructs a PGS for the target trait by leveraging the relevant trait (top middle). Afterward, mtPGS infers the weights for different PGSs constructed based on different relevant traits in the validation set through a weighted regression model (bottom middle), which allows mtPGS to combine the different PGSs into a single PGS as the final PGS for the target trait of interest (right).
Overall, the simulation results showed that mtPGS achieved improved prediction accuracy across the majority of simulation settings compared to existing PGS methods. Across the 108 main simulation settings, mtPGS was ranked as the most accurate method in 80 settings and was ranked as the second most accurate method in 24 settings (Figures 2 and S1–S17). In the 80 settings where mtPGS was the best, mtPGS achieved an average of 3.65% accuracy gain (median = 3.46%, range = 0.01%–13.53%) over the second-best method. In the 28 settings where mtPGS was not the best, on average, mtPGS was 2.85% less accurate (median = 1.91%, range = 0.14%–10.09%) than the best method. The performance of mtPGS is especially favored in settings where the genetic architecture underlying the traits has a polygenic component (Figure 2B). In particular, mtPGS was ranked as the best method in 33 out of the 36 polygenic settings, 29 (out of 36) hybrid settings, and 18 (out of 36) sparse settings. MegaPRS achieved the second-best prediction performance and was ranked as the best in 21 settings and the second-best in 34 settings. PRS-CS (best in 7 and second in 20) and XPXP (second in 27) also performed well across the simulation settings. DBSLMM was ranked as the second-best in three settings while SBLUP, wMT-SBLUP, and MTAG were neither the best nor the second-best in any of these simulation settings.
Figure 2.
Predictive performance of eight PGS methods in simulations
(A) Jitter plots show the prediction R2 across ten simulation replicates for different PGS methods in each of the six simulation settings. Compared methods include mtPGS (light blue), DBSLMM (purple), SBLUP (dark pink), PRS-CS (green), MegaPRS (light pink), wMT-SBLUP (dark blue), MTAG (red), and XPXP (orange). Results are shown for the three scenarios in the complete-overlap study design with the heritability of the target trait (first row) and 0.5 (second row) under the baseline setting (one relevant trait, heritability of relevant trait , genetic correlation , environmental correlation ). Solid line represents the mean R2 across ten replicates, with the numerical value also displayed above the line. Best ranking PGS method is denoted with double asterisks and second-best PGS method is denoted with single asterisk.
(B) Pie charts show the proportion of times each PGS method is ranked as the best across simulation settings/replicates for each of the three scenarios in the complete-overlap study design. Each scenario consists of 36 simulation settings, each with ten simulation replicates. The value in each slice represents the number of best-ranking settings for the corresponding PGS method.
One unique feature of mtPGS is its explicit modeling of the genetic correlation between the target trait and the relevant trait, which allows mtPGS to take advantage of the shared genetic architecture between the two traits to enhance PGS accuracy for the target trait. Indeed, the accuracy improvement brought by mtPGS over the other methods becomes more apparent with increasing genetic correlation between the two traits. For example, in the hybrid scenario of the main simulation, with and , mtPGS was 2.74%, 2.99%, and 4.31% more accurate than the second-best method when increased from 0.25 to 0.50 and 0.75 (Figures S2, S5, and S8). Across the main simulation settings, the percentage gain brought by mtPGS over the second-best method is also positively correlated with the genetic correlation (Spearman correlation = 0.21; p value = 2.87 × 10−2).
Another unique feature of mtPGS is its explicit modeling of the environmental correlation between the target trait and the relevant trait, which allows mtPGS to take advantage of the shared environmental effects in the presence of sample overlap to enhance PGS accuracy. To illustrate such benefits, besides the main simulations with completely overlapped samples, we also considered two additional study designs: a non-overlap study design where the two traits are measured on two different sets of n = 10,000 individuals and a partial-overlap study design where a fraction of the individuals (n = 7,000) were measured for both traits while the remaining individuals (n = 3,000 for each trait) were measured for only one trait. We found that mtPGS achieved robust predictive performance across settings with different study designs. Across the 12 simulation settings under these two study designs, mtPGS was ranked as the most accurate method in 8 settings (4 in non-overlap and 4 in partial-overlap) and was ranked as the second most accurate method in the remaining 4 settings (Figures 3A, S18, and S19). In the 8 simulation settings where mtPGS was the best, mtPGS outperformed the second-best method by an average of 8.24% (median = 5.69%, range = 1.01%–22.72%) in terms of mean R2 across the simulation replicates. In the 4 settings where mtPGS was the second-best, mtPGS was on average 3.98% (median = 4.08%, range = 1.85%–5.90%) less accurate than the best method. Notably, in the partial-overlap study design, an approximate version of mtPGS that makes use of two sets of GWAS summary statistics also achieved similar predictive performance as compared to the original version of mtPGS that uses four sets of GWAS summary statistics (Figure S20). In particular, using two sets of summary statistics, mtPGS incurred only an average of 2.76% accuracy loss (median = 2.37%, range = 0.15%–7.83%) across six settings of the partial-overlap study design as compared to using four sets of summary statistics. The relatively stable results suggest that the mtPGS modeling framework is reasonably robust with respect to the input data type in the scenarios with partially overlapped samples.
Figure 3.
Comparison of eight PGS methods in simulation settings with different sample overlap designs and with various numbers of relevant traits
Results are shown for simulation settings under the polygenic scenario.
(A) Jitter plots show the prediction R2 across ten replicates for eight PGS methods in non-overlap and partial-overlap study designs with the heritability of the target trait (first row) and 0.5 (second row). Solid line represents the mean R2 across ten replicates, with the numerical value also displayed above the line. Best ranking PGS method is denoted with double asterisks, and second-best PGS method is denoted with single asterisk.
(B) Bar plot shows the mean R2 for four multivariate PGS methods across ten replicates (y axis) for settings with different number of relevant traits (x axis), with the numerical value also displayed on top of the bar.
A closer look at the simulation results provides us with further insights. First, the performance of different PGS methods were generally improved when was increased from 0.25 to 0.5. The increase in performance is typically on the order of approximately 2- to 3-fold improvement for most PGS methods with the only exception of SBLUP. For SBLUP, its performance was much lower than that of the other methods when but became comparable to the other methods when increased to 0.5. For example, in the baseline setting (i.e., , , and ) of the hybrid scenario, the mean R2 for SBLUP across simulation replicates in the complete-overlap study design was only 0.003 when and was increased to 0.168 when was 0.5 (Figure 2A). As a comparison, the mean R2 for mtPGS were 0.062 and 0.178 for settings with and 0.5, respectively. Second, the presence of sample overlap does not influence the performance of univariate PGS methods but decreases the performance of some multivariate PGS methods other than mtPGS. For example, in the baseline setting of the sparse scenario, wMT-SBLUP was 4.02% and 14.73% more accurate than the univariate method SBLUP in the non-overlap and partial-overlap study designs, respectively. However, it was 2.01% less accurate than SBLUP in the complete-overlap study design (Figures 2A and S18). The accuracy loss of wMT-SBLUP in settings with sample overlap may be attributed at least in part to its modeling misspecification of the environmental covariance in these settings.
Finally, mtPGS can incorporate more than one relevant trait to facilitate the construction of PGSs for the target trait. To examine the performance of mtPGS in the presence of multiple relevant traits, we simulated one, two, or three relevant traits. In the simulations, the predictive performance of all multivariate PGS methods improved when the number of relevant traits M increased from one to three, though their relative rank does not change (Figure 3B). The improved predictive performance of multivariate PGS methods with increasing number of relevant traits highlight the benefits of leveraging multiple relevant traits to improve the prediction of the target trait. Besides the main simulations, we also examined the influence of the number of traits on mtPGS accuracy and how such influence may vary under different genetic correlations between the target trait and the relevant traits. The results showed that, when = 0.1 or 0.2, the prediction R2 of mtPGS first increased with increasing number of relevant traits and then decreased when more relevant traits were included (Figure S21). However, when = 0.5, the prediction R2 almost increased monotonically with increasing number of relevant traits, which is expected as the relevant traits in the case of = 0.5 are much more informative than the cases of = 0.1 or 0.2. Consequently, the benefits of including even ten relevant traits still outweighs the potential bias introduced by the additional modeling parameters in the model in the case of = 0.5.
Cross-validation in UKB
We evaluated the performance of mtPGS and the other PGS methods by performing cross-validations for 25 traits, including 15 quantitative traits (details in Table S2) and 10 binary traits (details in Table S3), in the UKB. Analysis details are provided in the subjects and methods. Briefly, we focused on 361,112 European-ancestry individuals and a common set of 1,119,148 SNPs. We partitioned the individuals into a training set of 80% individuals, a validation set of 10% individuals, a test set of 10% individuals, and a reference panel of 500 individuals. For multivariate PGS methods, we treated each trait as the target trait and selected the relevant traits among the remaining 24 traits for analysis. The number of the relevant traits used for each target trait ranges from 1 to 17. For each trait in turn, we fitted PGS methods in the training set, tuned hyper-parameters in the validation set, and evaluated their performance in the test set by computing R2 for quantitative traits and AUC for binary traits.
Overall, mtPGS achieved the best performance among all PGS methods across traits. Specifically, mtPGS was ranked as the most accurate for 13 of the 15 quantitative traits (Figure 4) and for 7 of the 10 binary traits (Figure S22). It was also ranked as the second-best method for the remaining 2 quantitative traits and 3 binary traits. In the 20 traits where mtPGS was the best, compared with the second-best method, mtPGS achieved an average accuracy gain of 2.25% (median = 1.99%, range = 0.06%–5.28%) for quantitative traits and 0.89% (median = 0.67%, range = 0.15%–2.43%) for binary traits. In the 5 traits where mtPGS was the second-best, mtPGS was on average 5.23% (median = 5.23%, range = 2.46%–8.00%) less accurate than the best method for quantitative traits and 0.96% (median = 1.33%, range = 0.05%–1.49%) less accurate for binary traits. The performance of mtPGS was followed by MegaPRS (best for 2 quantitative traits; second-best for 5 quantitative traits and 1 binary trait), XPXP (second-best for 2 quantitative traits and 5 binary traits), DBSLMM (second-best for 5 quantitative traits and 1 binary trait), wMT-SBLUP (best for 2 binary traits; second-best for 1 quantitative trait), and PRS-CS (best for 1 binary trait). Both SBLUP and MTAG were neither the best nor the second-best in any of these traits. For quantitative traits, mtPGS on average outperformed DBSLMM, SBLUP, PRS-CS, MegaPRS, wMT-SBLUP, MTAG, and XPXP by 12.00%, 24.83%, 52.15%, 4.98%, 23.85%, 52.91%, and 16.21%, respectively. For binary traits, mtPGS outperformed these methods by an average of 2.56%, 8.73%, 5.06%, 4.32%, 5.06%, 7.89%, and 0.90%, respectively.
Figure 4.
Comparison of eight PGS methods for predicting 15 quantitative traits in UKB
Compared methods include mtPGS (light blue), DBSLMM (purple), SBLUP (dark pink), PRS-CS (green), MegaPRS (light pink), wMT-SBLUP (dark blue), MTAG (red), and XPXP (orange). Title in each panel shows the abbreviation of 15 quantitative traits: fluid intelligence score (GF), birth weight (BW), body mass index (BMI), basal metabolic rate (BMR), age at first live birth (AFLB), red blood cell (erythrocyte) count (RBC), hemoglobin concentration (HGB), platelet count (PLT), platelet distribution width (PDW), lymphocyte count (LYMPH), neutrophill count (NEUT), forced vital capacity (FVC), forced expiratory volume in 1 s (FEV1), systolic blood pressure (SBP), and standing height (SH). Results are shown for the prediction of 15 quantitative traits in the UKB cross-validation, where univariate PGS methods were applied directly for the target trait and multivariate PGS methods were applied by performing multi-trait analysis with relevant traits. The bar plot in each panel shows the prediction R2 for each method in the test set, with the numerical value also displayed on top of the bar. Best ranking PGS method is denoted with double asterisks, and second-best PGS method is denoted with single asterisk.
A careful examination of the UKB results provides further insights. First, consistent with previous observations,45 the predictive performances of all PGS methods for quantitative traits were positively correlated with their SNP heritability (Pearson correlation ranges from 0.871 to 0.978 for quantitative traits, Figure S23). The correlations of predictive performances and SNP heritability for binary traits are less clear (Pearson correlation ranges from −0.075 to 0.587 for binary traits), likely due to the low disease prevalence in the UKB. Second, the predictive performance of PRS-CS in UKB is much worse than its performance in the simulations (Figures 4 and S22). In particular, PRS-CS became the worst method for 8 out of 15 quantitative traits and for 2 out of 10 binary traits. As a comparison, the performance of mtPGS in UKB is largely consistent with its performance in the simulations. The stable performance of mtPGS in UKB may partially be due to its analytic solutions, which circumvent the convergence issues encountered by MCMC-based PGS methods observed in previous studies.18,46 Finally, the prediction accuracy of mtPGS for ten binary traits were almost identical whether we used summary statistics based on linear regression or logistic regression, with only an average of 0.01% accuracy loss when using summary statistics based on logistic regression compared to linear regression (Figure S24). The nearly identical results suggest that the mtPGS model fitting is reasonably robust with respect to how the summary statistics are generated.
Besides the main analyses of 25 traits, we also performed two additional analyses to examine the performance of mtPGS in the UKB. First, we evaluated the additional accuracy gain brought by the step of combining PGSs from multiple relevant traits for mtPGS. Here, we focused on the 14 target quantitative traits and 6 target binary traits for which we identified more than one relevant trait. The results showed that the step of combining weighted PGSs improved accuracy on top of the pairwise PGS in 13 out of the 14 quantitative traits and in 5 out of the 6 binary traits, with an average accuracy gain of 4.59% (range = −1.35% to +17.74%) across all 14 quantitative traits and 0.90% (range = −0.26% to +2.90%) across all 6 binary traits (Figure S25). Therefore, the step of combining PGSs from multiple relevant traits provided additional improvement on mtPGS accuracy, with the accuracy gain observed in the real data similar to what was observed in the simulations. Second, we combined PGSs from other PGS methods based on the same set of relevant traits using the same linear aggregation method as we did for mtPGS. The results showed that most PGS methods achieved improved predictive performance when different PGSs were combined using the linear aggregation method (Figures S26 and S27). Specifically, compared to the original univariate or multivariate PGS, the weighted PGS of all methods achieved improved accuracy for the 15 quantitative traits, with the average accuracy gain ranging from 1.52% to 9.73%. In addition, the weighted PGS of all methods, with the only exception of wMT-SBLUP, achieved improved accuracy for the 10 binary traits, with the average accuracy gain ranging from −0.09% to 3.66%. These results dovetail the conclusions from recent studies that weighted linear combination of PGSs can further improve predictive performance for both univariate and multivariate methods.59,60 Compared to the weighted PGS approach from the other methods, mtPGS still achieved the best performance among all PGS methods across traits. Specifically, for the 15 quantitative traits, mtPGS on average outperformed the four univariate methods DBSLMM, SBLUP, PRS-CS, and MegaPRS by 8.22%, 19.95%, 42.40%, and 3.40%, respectively, and outperformed the three multivariate methods wMT-SBLUP, MTAG, and XPXP by 19.99%, 42.82%, and 5.84%, respectively. For the 10 binary traits, mtPGS on average outperformed the four univariate methods by 1.02%, 4.95%, 4.22%, and 2.12%, respectively, and outperformed the three multivariate methods by 5.14%, 7.52%, and 0.66%, respectively.
Finally, we examined the computing time and memory cost of different PGS methods. mtPGS compares favorably to the other PGS methods with relatively low computing time and memory cost (Table S4).
Discussion
We have presented mtPGS, a PGS method that leverages multiple relevant traits for accurate and scalable PGS construction for a target trait of interest. mtPGS not only flexibly models the shared genetic architecture between the target trait and the relevant traits but also explicitly models their shared environmental architecture, thus achieving accurate and robust prediction performance across a wide range of shared genetic architectures and study designs with distinct sample overlaps. In addition, mtPGS requires only summary statistics for model fitting and relies on a deterministic inference algorithm to achieve scalable computation. We have illustrated the benefits of mtPGS through simulations and applications to 25 UKB traits.
mtPGS flexibly models the shared genetic architecture between the target trait and the relevant traits by categorizing SNPs into different categories based on their effect sizes. Such SNP categorization allows mtPGS to introduce distinct shrinkage on the estimated effect sizes in different SNP categories, achieving proper shrinkage of small effect estimates while avoiding over-shrinkage of large effect estimates. To achieve SNP categorization, mtPGS incorporates a clumping/thresholding-based searching strategy to select SNPs with potentially large effect sizes. While such searching strategy is effective and efficient, we note that more sophisticated selection approaches may improve the accuracy of mtPGS further. For example, instead of applying the clumping/thresholding strategy using the marginal GWAS summary statistics obtained from univariate regression for the target trait, one may apply such strategy to GWAS summary statistics obtained from multivariate regression to improve the detection of large-effect SNPs.61 In addition, fine-mapping methods such as SuSiE62 and FINEMAP63 may also be used to identify SNPs with potentially large effects. As the analytic solution in mtPGS can be obtained for any searching strategy, pairing mtPGS with more powerful searching strategies in the future may lead to improved PGS accuracy.
We have examined the performance of mtPGS in incorporating multiple relevant traits for predicting the target trait of interest. Because mtPGS uses a pairwise modeling framework to incorporate the relevant traits one at a time, there is in principle no limit to the number of relevant traits that can be jointly analyzed using mtPGS. However, selecting the appropriate relevant traits may be necessary to ensure effective selection of informative relevant traits and guarantee subsequent prediction accuracy.44,51 Selecting appropriate relevant traits can be achieved either based on prior biological knowledge or by performing genetic correlation analysis as done in the present study. With the selected relevant traits, mtPGS relies on the pre-computed genetic and environmental correlation estimates from GECKO, which accounts for the correlated environmental covariance, to model the genetic correlation among traits. Using the pre-computed correlation estimates ensures scalable computation in mtPGS. However, the predictive performance of mtPGS is likely subject to the estimation accuracy of the genetic correlation estimates obtained from GECKO. Consequently, exploring the use of other methods for genetic covariance estimation or exploring the use of local genetic covariance estimation approaches such as SUPERGNOVA64 and LAVA65 may improve the performance of mtPGS further.
While we have primarily focused on unrelated individuals in both simulations and real data applications in the present study, we acknowledge that individual relatedness can occur in some human GWASs and is especially prevalent in animal breeding programs.66 Prediction with PGSs in the presence of related individuals is often carried out based on models that are relevant to or are extensions of linear mixed effect models (LMMs) and best linear unbiased predictors (BLUPs).36 For example, BSLMM is an extension of a LMM and has been widely used for predictions in animal breeding program with heavy individual relatedness as well as related human GWASs.19,66,67,68 As a multivariate extension of BSLMM, mtPGS models small-effect SNPs to capture the shared polygenic components underlying the traits and thus the individual relatedness. Consequently, mtPGS is expected to capture well the individual relatedness that may be encountered in certain GWASs. Certainly, future efforts are needed to validate this expectation in GWASs with related individuals.
The modeling framework of mtPGS is general and can be extended to incorporate external information such as SNP functional annotations to further improve prediction accuracy. Specifically, we can modify the prior assumption on the effect size of each SNP to be dependent on its functional annotations and then rely on the data to inform us of the contribution from each annotation. Incorporating SNP functional annotations into the prior SNP effect size distribution and inferring their contribution is known to improve prediction performance.16,69,70 Unfortunately, incorporating SNP functional annotations increases the mtPGS model complexity, making it challenging to obtain an analytic solution for the SNP effect size estimates and thus likely require future development of alternative fitting algorithms.
Data and code availability
The UK Biobank data are from UK Biobank resource under application number 24460. mtPGS is implemented in the mtPGS software, freely available at https://github.com/xuchang0201/mtPGS.
Acknowledgments
This study was supported by the National Institutes of Health (NIH) grant R01HG009124. This study has been conducted using UK Biobank resources under application number 24460. UK Biobank was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government, and the Northwest Regional Development Agency. It has also had funding from the Welsh Assembly Government, British Heart Foundation, and Diabetes UK. The authors also thank Dr. Kirsten Herold at the UM-SPH writing lab for her helpful editorial suggestions.
Declaration of interests
The authors declare no competing interests. S.K.G. is a non-compensated member of the Medical Advisory Board of the FMD Society of America (FMDSA) and the Scientific Advisory Board of SCAD Alliance. Both organizations are non-profit institutions.
Published: September 15, 2023
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.ajhg.2023.08.016.
Web resources
MegaPRS, https://dougspeed.com/megaprs/
UK Biobank-Neale lab, http://www.nealelab.is/uk-biobank
UK Biobank, https://www.ukbiobank.ac.uk/
wMT-SBLUP, https://github.com/uqrmaie1/smtpred
Supplemental information
References
- 1.de los Campos G., Vazquez A.I., Hsu S., Lello L. Complex-Trait Prediction in the Era of Big Data. Trends Genet. 2018;34:746–754. doi: 10.1016/j.tig.2018.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Khera A.V., Chaffin M., Wade K.H., Zahid S., Brancale J., Xia R., Distefano M., Senol-Cosar O., Haas M.E., Bick A., et al. Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell. 2019;177:587–596.e9. doi: 10.1016/j.cell.2019.03.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Visscher P.M., Wray N.R., Zhang Q., Sklar P., McCarthy M.I., Brown M.A., Yang J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017;101:5–22. doi: 10.1016/j.ajhg.2017.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Visscher P.M., Brown M.A., McCarthy M.I., Yang J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012;90:7–24. doi: 10.1016/j.ajhg.2011.11.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Loos R.J.F. 15 years of genome-wide association studies and no signs of slowing down. Nat. Commun. 2020;11:5900. doi: 10.1038/s41467-020-19653-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mavaddat N., Michailidou K., Dennis J., Lush M., Fachal L., Lee A., Tyrer J.P., Chen T.-H., Wang Q., Bolla M.K., et al. Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. Am. J. Hum. Genet. 2019;104:21–34. doi: 10.1016/j.ajhg.2018.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.So H.C., Kwan J.S.H., Cherny S.S., Sham P.C. Risk prediction of complex diseases from family history and known susceptibility loci, with applications for cancer screening. Am. J. Hum. Genet. 2011;88:548–565. doi: 10.1016/j.ajhg.2011.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lewis C.M., Vassos E. Polygenic risk scores: From research tools to clinical instruments. Genome Med. 2020;12:44. doi: 10.1186/s13073-020-00742-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gibson G. On the utilization of polygenic risk scores for therapeutic targeting. PLoS Genet. 2019;15:e1008060. doi: 10.1371/journal.pgen.1008060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ibanez L., Farias F.H.G., Dube U., Mihindukulasuriya K.A., Harari O. Polygenic Risk Scores in Neurodegenerative Diseases: a Review. Curr. Genet. Med. Rep. 2019;7:22–29. doi: 10.1007/s40142-019-0158-0. [DOI] [Google Scholar]
- 11.Lloyd-Jones L.R., Zeng J., Sidorenko J., Yengo L., Moser G., Kemper K.E., Wang H., Zheng Z., Magi R., Esko T., et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat. Commun. 2019;10:5086. doi: 10.1038/s41467-019-12653-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zeng P., Zhou X. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat. Commun. 2017;8:456. doi: 10.1038/s41467-017-00470-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.VanRaden P.M. Efficient methods to compute genomic predictions. J. Dairy Sci. 2008;91:4414–4423. doi: 10.3168/jds.2007-0980. [DOI] [PubMed] [Google Scholar]
- 14.Robinson M.R., Kleinman A., Graff M., Vinkhuyzen A.A.E., Couper D., Miller M.B., Peyrot W.J., Abdellaoui A., Zietsch B.P., Nolte I.M., et al. Genetic evidence of assortative mating in humans. Nat. Hum. Behav. 2017;1 doi: 10.1038/s41562-016-0016. [DOI] [Google Scholar]
- 15.Ge T., Chen C.Y., Ni Y., Feng Y.C.A., Smoller J.W. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat. Commun. 2019;10:1776. doi: 10.1038/s41467-019-09718-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hu Y., Lu Q., Liu W., Zhang Y., Li M., Zhao H. Joint modeling of genetically correlated diseases and functional annotations increases accuracy of polygenic risk prediction. PLoS Genet. 2017;13:e1006836. doi: 10.1371/journal.pgen.1006836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhao Z., Yi Y., Song J., Wu Y., Zhong X., Lin Y., Hohman T.J., Fletcher J., Lu Q. PUMAS: fine-tuning polygenic risk scores with GWAS summary statistics. Genome Biol. 2021;22:257. doi: 10.1186/s13059-021-02479-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Privé F., Vilhjálmsson B.J., Aschard H., Blum M.G.B. Making the Most of Clumping and Thresholding for Polygenic Scores. Am. J. Hum. Genet. 2019;105:1213–1221. doi: 10.1016/j.ajhg.2019.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhou X., Carbonetto P., Stephens M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLoS Genet. 2013;9:e1003264. doi: 10.1371/journal.pgen.1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Euesden J., Lewis C.M., O’Reilly P.F. PRSice: Polygenic Risk Score software. Bioinformatics. 2015;31:1466–1468. doi: 10.1093/bioinformatics/btu848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Vilhjálmsson B.J., Yang J., Finucane H.K., Gusev A., Lindström S., Ripke S., Genovese G., Loh P.R., Bhatia G., Do R., et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am. J. Hum. Genet. 2015;97:576–592. doi: 10.1016/j.ajhg.2015.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hu Y., Lu Q., Powles R., Yao X., Yang C., Fang F., Xu X., Zhao H. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS Comput. Biol. 2017;13:e1005589. doi: 10.1371/journal.pcbi.1005589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Márquez-Luna C., Gazal S., Loh P.R., Kim S.S., Furlotte N., Auton A., 23andMe Research Team. Price A.L., Bell R.K., Bryc K., et al. Incorporating functional priors improves polygenic prediction accuracy in UK Biobank and 23andMe data sets. Nat. Commun. 2021;12:6052. doi: 10.1038/s41467-021-25171-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Choi S.W., O’Reilly P.F. PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience. 2019;8 doi: 10.1093/gigascience/giz082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.International Schizophrenia Consortium. Purcell S.M., Wray N.R., Stone J.L., O’Donovan M.C., O'Donovan M.C., Sullivan P.F., Sklar P., Morris D.W., Oĝdushlaine C.T., et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ding Y., Hou K., Burch K.S., Lapinska S., Privé F., Vilhjálmsson B., Sankararaman S., Pasaniuc B. Large uncertainty in individual polygenic risk score estimation impacts PRS-based risk stratification. Nat. Genet. 2022;54:30–39. doi: 10.1038/s41588-021-00961-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Katz A.E., Yang M.-L., Levin M.G., Tcheandjieu C., Mathis M., Hunker K., Blackburn S., Eliason J.L., Coleman D.M., Fendrikova-Mahlay N., et al. Fibromuscular Dysplasia and Abdominal Aortic Aneurysms Are Dimorphic Sex-Specific Diseases With Shared Complex Genetic Architecture. Circ. Genom. Precis. Med. 2022;15:e003496. doi: 10.1161/circgen.121.003496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Saw J., Yang M.L., Trinder M., Tcheandjieu C., Xu C., Starovoytov A., Birt I., Mathis M.R., Hunker K.L., Schmidt E.M., et al. Chromosome 1q21.2 and additional loci influence risk of spontaneous coronary artery dissection and myocardial infarction. Nat. Commun. 2020;11:4432. doi: 10.1038/s41467-020-17558-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Fritsche L.G., Gruber S.B., Wu Z., Schmidt E.M., Zawistowski M., Moser S.E., Blanc V.M., Brummett C.M., Kheterpal S., Abecasis G.R., Mukherjee B. Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative. Am. J. Hum. Genet. 2018;102:1048–1061. doi: 10.1016/j.ajhg.2018.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W.J.H., Jansen R., De Geus E.J.C., Boomsma D.I., Wright F.A., et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., GTEx Consortium. Nicolae D.L., et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nagpal S., Meng X., Epstein M.P., Tsoi L.C., Patrick M., Gibson G., De Jager P.L., Bennett D.A., Wingo A.P., Wingo T.S., Yang J. TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. Am. J. Hum. Genet. 2019;105:258–266. doi: 10.1016/j.ajhg.2019.05.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Richardson T.G., Harrison S., Hemani G., Smith G.D. An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife. 2019;8:e43657. doi: 10.7554/eLife.43657.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Shen X., Howard D.M., Adams M.J., Hill W.D., Clarke T.K., Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium. Deary I.J., Whalley H.C., McIntosh A.M., Mattheisen M., et al. A phenome-wide association and Mendelian Randomisation study of polygenic risk for depression in UK Biobank. Nat. Commun. 2020;11:2301. doi: 10.1038/s41467-020-16022-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Choi S.W., Mak T.S.H., O’Reilly P.F. Tutorial: a guide to performing polygenic risk score analyses. Nat. Protoc. 2020;15:2759–2772. doi: 10.1038/s41596-020-0353-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ma Y., Zhou X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends Genet. 2021;37:995–1011. doi: 10.1016/j.tig.2021.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li C., Yang C., Gelernter J., Zhao H. Improving genetic risk prediction by leveraging pleiotropy. Hum. Genet. 2014;133:639–650. doi: 10.1007/s00439-013-1401-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Maier R., Moser G., Chen G.B., Ripke S., Cross-Disorder Working Group of the Psychiatric Genomics Consortium. Coryell W., Potash J.B., Scheftner W.A., Shi J., Weissman M.M., et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 2015;96:283–294. doi: 10.1016/j.ajhg.2014.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Chung W., Chen J., Turman C., Lindstrom S., Zhu Z., Loh P.R., Kraft P., Liang L. Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes. Nat. Commun. 2019;10:569. doi: 10.1038/s41467-019-08535-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Maier R.M., Zhu Z., Lee S.H., Trzaskowski M., Ruderfer D.M., Stahl E.A., Ripke S., Wray N.R., Yang J., Visscher P.M., Robinson M.R. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat. Commun. 2018;9:989. doi: 10.1038/s41467-017-02769-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Turley P., Walters R.K., Maghzian O., Okbay A., Lee J.J., Fontana M.A., Nguyen-Viet T.A., Wedow R., Zacher M., Furlotte N.A., et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 2018;50:229–237. doi: 10.1038/s41588-017-0009-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Xiao, J., Cai, M., Hu, X., Wan, X., Chen, G., and Yang, C. XPXP: Improving polygenic prediction by cross-population and cross-phenotype analysis.Bioinformatics,38(7), pp.1947-1955 10.1093/bioinformatics/btac029/6510931. [DOI] [PubMed]
- 43.Zhou X., Stephens M. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods. 2014;11:407–409. doi: 10.1038/nmeth.2848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Gao B., Yang C., Liu J., Zhou X. Accurate genetic and environmental covariance estimation with composite likelihood in genome-wide association studies. PLoS Genet. 2021;17:e1009293. doi: 10.1371/journal.pgen.1009293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Yang S., Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. Am. J. Hum. Genet. 2020;106:679–693. doi: 10.1016/j.ajhg.2020.03.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Pain O., Glanville K.P., Hagenaars S.P., Selzam S., Fürtjes A.E., Gaspar H.A., Coleman J.R.I., Rimfeld K., Breen G., Plomin R., et al. Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genet. 2021;17:e1009021. doi: 10.1371/journal.pgen.1009021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J., Downey P., Elliott P., Green J., Landray M., et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Chen Z., Chen J., Collins R., Guo Y., Peto R., Wu F., Li L., China Kadoorie Biobank CKB collaborative group. Yang X., Williams A., et al. China Kadoorie Biobank of 0.5 million people: Survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 2011;40:1652–1666. doi: 10.1093/ije/dyr120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Nagai A., Hirata M., Kamatani Y., Muto K., Matsuda K., Kiyohara Y., Ninomiya T., Tamakoshi A., Yamagata Z., Mushiroda T., et al. Overview of the BioBank Japan Project: Study design and profile. J. Epidemiol. 2017;27:S2–S8. doi: 10.1016/j.je.2016.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Locke A.E., Steinberg K.M., Chiang C.W.K., Service S.K., Havulinna A.S., Stell L., Pirinen M., Abel H.J., Chiang C.C., Fulton R.S., et al. Exome sequencing of Finnish isolates enhances rare-variant association power. Nature. 2019;572:323–328. doi: 10.1038/s41586-019-1457-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bulik-Sullivan B., Finucane H.K., Anttila V., Gusev A., Day F.R., Loh P.R., ReproGen Consortium. Psychiatric Genomics Consortium. Genetic Consortium for Anorexia Nervosa of the Wellcome Trust Case Control Consortium 3. Duncan L., et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 2015;47:1236–1241. doi: 10.1038/ng.3406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Berisa T., Pickrell J.K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32:283–285. doi: 10.1093/bioinformatics/btv546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Kaasschieter E.F. Preconditioned conjugate gradients for solving singular systems. J. Comput. Appl. Math. 1988;24:265–275. [Google Scholar]
- 54.Chung W. Statistical models and computational tools for predicting complex traits and diseases. Genomics Inform. 2021;19:e36. doi: 10.5808/gi.21053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Zhang Q., Privé F., Vilhjálmsson B., Speed D. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nat. Commun. 2021;12:4192. doi: 10.1038/s41467-021-24485-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Zabad S., Gravel S., Li Y. Fast and accurate Bayesian polygenic risk modeling with variational inference. Am. J. Hum. Genet. 2023;110:741–761. doi: 10.1016/j.ajhg.2023.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Morrison J., Knoblauch N., Marcus J.H., Stephens M., He X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nat. Genet. 2020;52:740–747. doi: 10.1038/s41588-020-0631-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Yang S., Zhou X. PGS-server: Accuracy, robustness and transferability of polygenic score methods for biobank scale studies. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbac039. [DOI] [PubMed] [Google Scholar]
- 60.Albiñana, C., Zhu, Z., Schork, A.J., Ingason, A., Aschard, H., Brikell, I., Bulik, C.M., Petersen, L. V, Agerbo, E., Grove, J., et al. Multi-PGS enhances polygenic prediction: weighting 937 polygenic scores.Preprint at medRxiv 10.1101/2022.09.14.22279940. [DOI] [PMC free article] [PubMed]
- 61.Stephens M. A Unified Framework for Association Analysis with Multiple Related Phenotypes. PLoS One. 2013;8:e65245. doi: 10.1371/journal.pone.0065245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Weissbrod O., Kanai M., Shi H., Gazal S., Peyrot W.J., Khera A.V., Okada Y., Biobank Japan Project. Martin A.R., Finucane H.K., Price A.L. Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nat. Genet. 2022;54:450–458. doi: 10.1038/s41588-022-01036-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Benner C., Spencer C.C.A., Havulinna A.S., Salomaa V., Ripatti S., Pirinen M. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32:1493–1501. doi: 10.1093/bioinformatics/btw018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhang Y., Lu Q., Ye Y., Huang K., Liu W., Wu Y., Zhong X., Li B., Yu Z., Travers B.G., et al. SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol. 2021;22:262. doi: 10.1186/s13059-021-02478-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Werme J., van der Sluis S., Posthuma D., de Leeuw C.A. An integrated framework for local genetic correlation analysis. Nat. Genet. 2022;54:274–282. doi: 10.1038/s41588-022-01017-y. [DOI] [PubMed] [Google Scholar]
- 66.Wray N.R., Kemper K.E., Hayes B.J., Goddard M.E., Visscher P.M. Complex trait prediction from genome data: Contrasting EBV in livestock to PRS in humans. Genetics. 2019;211:1131–1141. doi: 10.1534/genetics.119.301859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Lloyd-Jones L.R., Robinson M.R., Moser G., Zeng J., Beleza S., Barsh G.S., Tang H., Visscher P.M. Inference on the genetic basis of eye and skin color in an admixed population via bayesian linear mixed models. Genetics. 2017;206:1113–1126. doi: 10.1534/genetics.116.193383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Gualdrón Duarte J.L., Gori A.S., Hubin X., Lourenco D., Charlier C., Misztal I., Druet T. Performances of Adaptive MultiBLUP, Bayesian regressions, and weighted-GBLUP approaches for genomic predictions in Belgian Blue beef cattle. BMC Genom. 2020;21:545. doi: 10.1186/s12864-020-06921-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Chen T.H., Chatterjee N., Landi M.T., Shi J. A Penalized Regression Framework for Building Polygenic Risk Models Based on Summary Statistics From Genome-Wide Association Studies and Incorporating External Information. J. Am. Stat. Assoc. 2021;116:133–143. doi: 10.1080/01621459.2020.1764849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Lu Q., Li B., Ou D., Erlendsdottir M., Powles R.L., Jiang T., Hu Y., Chang D., Jin C., Dai W., et al. A Powerful Approach to Estimating Annotation-Stratified Genetic Covariance via GWAS Summary Statistics. Am. J. Hum. Genet. 2017;101:939–964. doi: 10.1016/j.ajhg.2017.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The UK Biobank data are from UK Biobank resource under application number 24460. mtPGS is implemented in the mtPGS software, freely available at https://github.com/xuchang0201/mtPGS.




