A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

Geyu Zhou; Hongyu Zhao

doi:10.1371/journal.pgen.1009697

. 2021 Jul 26;17(7):e1009697. doi: 10.1371/journal.pgen.1009697

A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

Geyu Zhou ¹, Hongyu Zhao ^1,^2,^*

Editor: Doug Speed³

PMCID: PMC8341714 PMID: 34310601

Abstract

Genetic prediction of complex traits has great promise for disease prevention, monitoring, and treatment. The development of accurate risk prediction models is hindered by the wide diversity of genetic architecture across different traits, limited access to individual level data for training and parameter tuning, and the demand for computational resources. To overcome the limitations of the most existing methods that make explicit assumptions on the underlying genetic architecture and need a separate validation data set for parameter tuning, we develop a summary statistics-based nonparametric method that does not rely on validation datasets to tune parameters. In our implementation, we refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel. We also leverage the block structure of the reference linkage disequilibrium matrix for implementation of a parallel algorithm. Through simulations and applications to twelve traits, we show that our method is adaptive to different genetic architectures, statistically robust, and computationally efficient. Our method is available at https://github.com/eldronzhou/SDPR.

Author summary

Recently there has been much interest in predicting an individual’s phenotype from genetic information, which has great promise for disease prevention, monitoring, and treatment. It has been found that there is great variation in the genetic architecture underlying different complex traits, including the number of genetic variants involved and the distribution of the effect sizes of genetic variants. How to model such genetic contribution is a key aspect for accurate prediction of complex traits. So far, most existing methods make specific assumptions about the shape of the genetic contribution. If these assumptions are not correct, the prediction accuracy might be compromised. Here we propose a method that learns the shape of the genetic contribution without making any explicit assumptions. We found that our method achieved robust performance when compared with other recently developed methods through simulation and real data analysis. Our method is also practically more feasible, since it supports the use of public summary statistics and consumes only small amount of computational resources.

Introduction

Results from large-scale genome-wide association studies (GWAS) offer valuable information to predict personal traits based on genetic markers through polygenic risk scores (PRS) calculated from different methods. For one individual, PRS is typically calculated as the linear sum of the number of the risk alleles weighted by the effect size for each marker, such as single nucleotide polymorphism (SNP) [1]. PRS has gained great interest recently due to its demonstrated ability to identify individuals with higher disease risk for more effective prevention and monitoring [2].

Appropriate construction of PRS requires the development of statistical methods to jointly estimate the effect sizes of all genetic markers in an accurate and efficient way. Statistical challenges associated with the design of PRS methods largely reside in how to account for linkage disequilibrium (LD) among the markers and how to capture the genetic architecture of traits. Meanwhile, practical issues to be addressed include making use of summary statistics as input, as well as reducing the computational burden.

One simple method to compute PRS is to use a subset of SNPs in GWAS summary statistics formed by pruning out SNPs in LD and selecting those below a p value threshold (P+T) [1]. P+T is computationally efficient, though the prediction accuracy can usually be improved by using more sophisticated methods [3]. At present, most of the existing methods that allow the use of summary statistics as input assume a prior distribution on the effect sizes of the SNPs in the genome and fit the model under the Bayesian framework. Methods differ in the choice of the prior distribution. For example, LDpred and LDpred2 assume a point-normal mixture distribution or a single normal distribution [3,4]. SBayesR assumes a mixture of three normal distributions with a point mass at zero [5]. PRS-CS proposes a conceptually different class of continuous shrinkage priors [6]. In reality, there is wide diversity in the distribution of effect sizes for complex traits [7]. Therefore, there may be model specification for choosing a specific parametric prior if the true genetic architecture cannot be captured by the assumed parametric distribution. A natural solution is to consider a generalizable nonparametric prior, such as the Dirichlet process [8]. Dirichlet process regression (DPR) was shown to be adaptive to different parametric assumptions and could achieve robust performance when applied to different traits [9]. However, DPR requires access to individual-level genotype and phenotype data and has expensive computational cost when applied to large-scale GWAS data.

In this work, we derive a summary statistics-based method, called SDPR, which does not rely on specific parametric assumptions on the effect size distribution. SDPR connects the marginal coefficients in summary statistics with true effect sizes through Bayesian multiple Dirichlet process regression. We utilize the concept of approximately independent LD blocks and overparametrization to develop a parallel and fast-mixing Markov Chain Monte Carlo (MCMC) algorithm [10,11]. Through simulations and real data applications, we demonstrate the advantages of our methods in terms of improved computational efficiency and more robust performance in prediction without the need of using a validation dataset to select tuning parameters.

Methods

Overview of SDPR

Suppose GWAS summary statistics are derived based on N individuals and p genetic markers, the phenotypes and genotypes can be related through a multivariate linear model,

y = X β + ϵ

(1)

where y is an N×1 vector of phenotypes, X is an N×p matrix of genotypes, and β is an p×1 vector of effect sizes. We further assume, without loss of generality, that both y and columns of X have been standardized. GWAS summary statistics usually contain the per SNP effect size $\hat{β}$ directly obtained or well approximated through the marginal regression $\hat{β} = \frac{X^{T} y}{N}$ . Under the assumption that individual SNP explains relatively small percentage of phenotypic variance, the residual variance can be set to 1 and the likelihood function can be evaluated from

\hat{β} | β \sim N (R β, \frac{R}{N})

(2)

where R is the reference LD matrix [3,12].

Like many Bayesian methods, we assume that the effect size of the i^th SNP β_i, follows a normal distribution with mean 0 and variance σ². In contrast to methods assuming one particular parametric distribution, we consider placing a Dirichlet process prior on σ², i.e.h a multivariate linear model,

β_{i} \sim N (0, σ^{2}), σ^{2} \sim D P (H, α)

(3)

where H is the base distribution and α is the concentration parameter controlling the shrinkage of the distribution on σ² toward H. To improve the mixing of MCMC and avoid the informativeness issue of inverse gamma distribution, we follow Gelman’s advice to overparametrize the model by writing β_i = ηγ_i and use the square of uniform distribution as the base distribution H [13]. This is explained thoroughly in the section Dirichlet Process Prior of S1 Text.

Dirichlet process has several equivalent probabilistic representations, of which stick-breaking process is commonly used for its convenience of model fitting [14]. The stick-breaking representation views the Dirichlet process as the infinite Gaussian mixture model

β_{i} \sim \sum_{k = 1}^{\infty} p_{k} N (0, σ_{k}^{2}), p_{k} = V_{k} \prod_{m = 1}^{k - 1} (1 - V_{m}), V_{k} \sim B e t a (1, α), σ_{k}^{2} \sim H

(4)

In practice, truncation needs to be applied so that the maximum number of components of the mixture model is finite. We found that setting the maximum components to 1000 was sufficient for our simulation and real data application because the number of non-trivial components, to which SNPs were assigned, was way fewer than 1000. The first component of the mixture model is further fixed to 0 in analogous to Bayesian variable selection. We designed a parallel MCMC algorithm and implemented it in C++. The details of the algorithm can be found in the section MCMC Algorithm of S1 Text.

Robust design of the likelihood function

Unlike individual-level data based methods, summary statistics based methods typically rely on external reference panel to estimate the LD matrix R. Ideally, the same set of individuals in the reference panel should be used to generate the summary statistics. However, due to the limited access to the individual level data of original GWAS studies, an external database with matched ancestry like the 1000 Genomes Project [15] or UK Biobank [16] is usually used instead to compute the reference LD matrix. It is possible that effect sizes of SNPs in summary statistics deviate from what are expected given the likelihood function and reference LD matrix, especially for SNPs in strong LD that are genotyped on different individuals (Table A in S1 Text). This issue was also noted in the section 5.5 of the RSS paper [12]. Failure to account for such discrepancy can cause severe model misspecification problems for SDPR and possibly other methods. One can derive that, if SNPs are genotyped on different individuals, then the likelihood function (2) should be modified as

\hat{β} | β \sim N (R β, R ° H)

(5)

where ° is the Hadamard product, $H_{i i} = \frac{1}{N_{i}}, H_{i j} = \frac{{N_{s}}_{, i j}}{N_{i} N_{j}} (i \neq j), N_{i}$ is the sample size of SNP i, N_j is the sample size of SNP j, and N_s,ij is the number of shared individuals genotyped for SNPs i and j. Evaluation of likelihood function (5) requires the knowledge about the sample size and inclusion of each study for each SNP. For example, SNPs of GWAS summary statistics of lipid traits were genotyped on two arrays in two separate cohorts (GWAS chip: N₁≈95,000; Metabochip: N₂≈94,000) [17]. Based on this information, N_s,ij is set to 0 if SNPs i and j were genotyped on different arrays, N₁ if SNP i was genotyped on GWAS chip and SNP j was genotyped on both arrays, and N₂ if SNP i was genotyped on Metabochip and SNP j was genotyped on both arrays.

In reality, GWAS summary statistics are often obtained through meta-analysis, and information above is generally not available. Besides, double genomic control is applied to many summary statistics, which may lead to deflation of effect sizes [18,19]. Therefore, we consider evaluating the likelihood function from the following distribution.

\hat{\frac{β}{c}} | β \sim N (R β, \frac{R + N a I}{N})

(6)

More specifically, the input is divided by a constant provided by SumHer if application of double genomic control significantly deflates the effect sizes [18]. Compared with Eq (5), the correlation between two SNPs is $\frac{R_{i j}}{1 + N a}$ rather than $\frac{R_{i j} N_{s, i j}}{\sqrt{N_{i} N_{j}}}$ . The connection between Eq (6) and LDSC is discussed in the relevant section of S1 Text. For simulated data, c was set to 1 and α was set to 0 for Scenarios 1A-1C, 4 and 5, since there was no above-mentioned discrepancy in these scenarios. In real data application, Na was set to 0.1 except for lipid traits, and c was set to 1 except for BMI (BMI c = 0.74 given by SumHer).

Construction and partition of reference LD matrix

We use an empirical Bayes shrinkage estimator to construct the LD matrix since the external reference panel like 1000G contains a limited number of individuals [20]. LD matrix can be divided into small “independent” blocks to allow for efficient update of posterior effect sizes using the blocked Gibbs sampler [6]. At present, ldetect is widely used for performing such tasks [10]. However, ldetect sometimes produces false positive partitions that violate the likelihood assumption of Eq (2) (Fig A in S1 Text). To address this issue, we designed a simple and fast algorithm for partitioning independent blocks. The new algorithm ensures that each SNP in one LD block does not have nonignorable correlation (r² > 0.1) with SNPs in other blocks so that the likelihood assumption of Eq (2) is less likely to be violated (Fig A in S1 Text).

Other methods

We compared the performance of SDPR with seven other methods: (1) PRS-CS as implemented in the PRS-CS software; (2) SBayesR as implemented in the GCTB software (version 2.02); (3) LDpred as implemented in the LDpred software (version 1.0.6); (4) P+T as implemented in the PLINK software (version 1.90) [21]; (5) LDpred2 as implemented in the bigsnpr package (version 1.6.1); (6) Lassosum as implemented in the lassosum package (version 0.4.5) [22]; and (7) DBSLMM as implemented in the DBSLMM package (version 0.21) [23]. We used the default parameter setting for all methods. For PRS-CS, the global shrinkage parameter was specified as {1e-6, 1e-4, 1e-2, 1, auto}. For SBayesR, gamma was specified as {0, 0.01, 0.1, 1} and pi was specified as {0.95, 0.02, 0.02, 0.01}. For LDpred, the polygenicity parameter was specified as {1e-5, 3e-5, 1e-4, 3e-4, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, LDpred-Inf}. For P+T, SNPs in GWAS summary statistics were clumped for r² iterated over {0.2, 0.4, 0.6, 0.8}, and for p value threshold iterated over {5e-8, 5e-6, 1e-5, 1e-4, 5e-4, 1e-3, 1e-2, 0.04, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}. For LDpred2, we ran LDpred2-inf, LDpred2-auto and LDpred2-grid, and reported the best performance of three options. The grid of hyperparameters was set as non-sparse, p in a sequence of 21 values from 10⁻⁵ to 1 on a log-scale, and h² within {0.7, 1, 1.4} of h²_LDSC. For lassosum, lambda was set in a sequence of 20 values from 0.001 to 0.1 on a log-scale, and s within {0.2, 0.5, 0.9, 1}. For DBSLMM, p value threshold was iterated within {10⁻⁵, 10⁻⁶, 10⁻⁷, 10⁻⁸}, r2 was iterated within {0.05, 0.1, 0.15, 0.2, 0.25}, and h² was set as h²_LDSC. We tuned the parameters for PRS-CS, LDpred, P+T, LDpred2, lassosum, and DBSLMM using the validation dataset.

Genome-wide simulations

We used genotypes from UK Biobank to perform simulations. UK Biobank’s database contains extensive phenotypic and genotypic data of over 500,000 individuals in the United Kingdom [16]. We selected 276,732 unrelated individuals of European ancestry based on data field 22021 and 22006. A subset of these individuals was randomly selected to form the training, validation and test datasets. Training datasets contained 10,000, 50,000, and 100,000 individuals, while validation and test datasets contained 10,000 individuals. We applied quality control (MAF > 0.05, genotype missing rate < 0.01, INFO > 0.3, pHWE > 1e-5) to select 4,458,556 SNPs from the original ~96 million SNPs. We then intersected these SNPs with 1000G HM3 SNPs (MAF > 0.05) and removed those in the MHC region (Chr6: 28–34 Mb) to form a set of 681,828 SNPs for simulation.

To cover a range of genetic architectures, we simulated effect sizes of SNPs under four scenarios: (1)-(3) $β_{j} \sim π N (0, \frac{h^{2}}{M π}) + (1 - π) δ_{0}$ , where h² = 0.5, M = 681828, π, equaled 10⁻⁴ (scenario 1A), 10⁻³ (scenario 1B) and 10⁻² (scenario 1C); (4) $β_{j} \sim \sum_{i = 1}^{3} π_{i} N (0, c_{i} σ^{2}) + (1 - \sum_{i = 1}^{3} π_{i}) δ_{0}$ where c = (1, 0.1, 0.01), π = (10⁻⁴, 10⁻⁴, 10⁻²) with σ² calculated so that the total heritability equaled 0.5; (5) $β_{j} \sim N (0, \frac{h^{2}}{M})$ . Importantly, scenario 1A-1C satisfied the assumption of LDpred/LDpred2, scenario 5 satisfied the assumption of LDpred-inf/LDpred2-inf, whereas scenario 4 satisfied the assumption of SBayesR. Phenotypes were generated from simulated effect sizes using GCTA-sim, and marginal linear regression was performed on the training data to obtain summary statistics using PLINK2 [24,25]. In each scenario, we performed 10 simulation replicates.

We applied different methods on the training data, and used the 10,000 individuals in the validation dataset to estimate the LD matrix. Parameters for LDpred, P+T, PRS-CS, LDpred2, lassosum, and DBSLMM were also tuned using the validation data. We then evaluated the prediction performance on the test data by computing the square of Pearson correlation of PRS with simulated phenotypes.

Real data application using public summary statistics and UK biobank data

We obtained public GWAS summary statistics for 12 traits and evaluated the prediction performance of each method using the UK Biobank data. Individuals in GWAS do not overlap with individuals in UK Biobank. For this reason, we did not use the latest summary statistics of height and BMI [26]. To standardize the input summary statistics, we generally followed the guideline of LDHub to perform quality control on the GWAS summary statistics [27]. We removed strand ambiguous (A/T and G/C) SNPs, insertions and deletions (INDELs), SNPs with an effective sample size less than 0.67 times the 90^th percentile of sample size. SNPs within the MHC region were removed except for IBD, since MHC region plays an important role in autoimmune diseases. The remaining SNPs were then intersected with 1000G HM3 SNPs provided in the PRS-CS reference panel. Table 1 shows the number of SNPs present in the summary statistics for each trait after performing quality control.

Table 1. Summary information about the sample size and SNPs in GWAS summary statistics and UK Biobank datasets.

For binary traits, effective sample size was used ( $\frac{4 * N_{c a s e} * N_{c o n t r o l}}{N_{c a s e} + N_{c o n t r o l}}$ ) and the validation datasets consisted of equal numbers of cases and controls. If the summary statistics included sample sizes for individual SNPs, the median of all SNPs passing QC was reported. For binary traits, the number of cases and controls were reported in the parenthesis.

Trait	GWAS sample size	GWAS ref	1KG HM3 & UKB & GWAS SNPs	UKB validation Sample size	UKB testing sample size
Height	252,230	[29]	885,791	138,066	138,066
BMI	233,766	[30]	886,654	137,921	137,920
HDL	94,288	[17]	868,645	37,774	37,774
LDL	89,866	[17]	868,179	40,807	40,807
Total Cholesterol	94,571	[17]	868,167	40,898	40,898
Triglycerides	90,989	[17]	86,8243	40,858	40,857
Coronary artery disease	61,294 (22,233/64,762)	[31]	814,337	4475/4475	4475/258,345
Breast Cancer	227,688 (122,977/105,974)	[32]	927,706	4539/4539	4539/133,649
Inflammatory bowel disease	32,372 (12,882/21770)	[33]	918,369	1840/1840	1839/198,815
Type 2 diabetes	156,109 (26,676/132,532)	[34]	974,907	7240/7240	7239/182,292
Bipolar	41,606 (20,129/21,524)	[35]	928,032	832/832	832/176,069
Schizophrenia	65,955 (33,426/32541)	[35]	941,216	223/223	223/203,471

Open in a new tab

For UK Biobank, we first selected unrelated European individuals as we did in simulations. We then applied quality control (MAF > 0.01, genotype missing rate < 0.05, INFO > 0.8, pHWE > 1e-10) to obtain a total of 1,114,176 HM3 SNPs. UK Biobank participants with six quantitative traits-height, body mass index (BMI), high-density lipoproteins (HDL), low-density lipoproteins (LDL), total cholesterol, and triglycerides-were selected based on relevant data fields (Section selection of phenotypes in S1 Text). Selected participants were randomly assigned to form validation and test datasets, each composing half of the individuals. Cases for each of six diseases-coronary artery diseases, breast cancer, inflammatory bowel disease (IBD), type 2 diabetes, bipolar, and schizophrenia-were selected based on self-reported questionnaire and ICD code in the electronic hospital record (EHR). Controls were selected among participants in the EHR based on certain exclusion criteria (Section selection of phenotypes in S1 Text). Validation dataset consisted of an equal number of cases and controls, the rest of which were assigned to the test dataset (Table 1). Random assignments of individuals to validation and test datasets were repeated for 10 times.

For six quantitative traits, we reported the prediction R² of PRS (variance explained by PRS) defined as $R^{2} = 1 - \frac{S S_{1}}{S S_{0}}$ , where SS₀ is the sum of squares of the residuals of the restricted linear regression model with covariates (an intercept, age, sex, top 10 PCs of the genotype data), and SS₁ is the sum of squares of the residuals of the full linear regression model (covariates above and PRS). For six diseases, we reported the AUC of PRS only for better comparison of different methods.

Results

Adaptiveness of Dirichlet process prior

Theoretically, Dirichlet process as an infinite Gaussian mixture model is able to approximate any continuous parametric distribution, thus including other published parametric distributions as special cases [28]. For example, the density of Dirichlet process prior adapts well to the density of normal distribution (LDpred-inf), point normal mixture distribution (LDpred/LDpred2), and three-point normal mixture distribution (SBayesR) (Fig B in S1 Text). Compared with SBayesR, Dirichlet process prior does not constrain the relationship between three non-zero normal variance components. We also explicitly incorporate Bayesian variable selection by setting the first variance component as 0, which is different from PRS-CS. The adaptiveness of Dirichlet process prior potentially makes it more robust to the distribution of effect sizes of real traits.

Simulations

We first compared the performance and computational time of SDPR with DPR in a small-scale simulation setting using 10,000 individuals and 58,432 SNPs on chromosome 1. The effect sizes were generated under the mixture of Dirichlet delta and three normal distributions with total heritability fixed as 0.3. We fitted DPR model with four components and 5000 MCMC iterations, and SDPR model with the input of summary statistics. The average R² of DPR was 0.227, and the average R² of SDPR was 0.204 (Fig C in S1 Text). DPR took about 3.5 hours and consumed 10.4 Gb of memory to finish MCMC, while SDPR took only 10 minutes and used 1.1 Gb of memory. This demonstrated the improved computational efficiency of SDPR over DPR without loss of much prediction accuracy.

We then compared the performance of SDPR with several other summary statistics-based methods via genome-wide simulations across different genetic architectures and training sample sizes. Effect sizes of SNPs were simulated under a point-normal mixture model with increasing number of causal variants, a point-three-normal mixture model satisfying SBayesR’s assumption, and a normal model satisfying LDpred-inf’s assumption (details in Methods). The heritability was fixed as 0.5 and 10 replicates were performed in each simulation setting. Tuning parameters of PRS-CS, LDpred, P+T, LDpred2, lassosum, and DBSLMM were selected using a validation dataset (N = 10,000). 10,000 individuals in the validation dataset were used to construct the LD matrix. We evaluated the prediction performance on the independent test data (N = 10,000) using the squared Pearson correlation coefficient (R²).

The prediction accuracy of all methods generally increased along the sample size of training data (Fig 1 and Tables B-F in S1 Text). Similarly, all methods performed better when the number of causal variants was small. Since the standard error of the regression coefficient estimator in GWAS summary statistics is roughly reciprocal to the square root of the sample size of the training cohort, the dominance of noise over signal poses significant challenges for accurate estimation of effect sizes when the training sample size or per SNP effect size is small.

SDPR, LDpred2, and SBayesR performed better than other methods in the sparse setting (Fig 1 Scenarios 1A-1C and Tables B-E in S1 Text). Consistent with others’ findings, we observed that when the genetic architecture was sparse, the performance of LDpred decreased as the training sample size increased [6]. In contrast, LDpred2 performed significantly better than LDpred. Meanwhile, PRS-CS performed worse when the training sample size was small. In the polygenic setting, SDPR and LDpred-inf/LDpred2-inf performed better than other methods (Fig 1 Scenario 5 and Table F in S1 Text). Overall, SDPR and LDpred2 performed well across a range of simulated sparse and polygenic genetic architectures. LDpred2 is expected to perform well in Scenarios 1A-1C and 5 since it satisfied the assumption of LDpred2/LDpred2-inf. The robust performance of SDPR demonstrates the advantage of using Dirichlet process prior to model the genetic architecture.

It is important to note that while SBayesR and SDPR do not need a validation dataset to tune parameters, they may be more susceptible to heterogeneity and errors in the summary statistics. Therefore, we tested whether our modified likelihood function (6) makes SDPR more robust when dealing with discrepancies between summary statistics and reference panel. We generated summary statistics from 50,000 individuals under the same setting as scenario 1B. For half of the SNPs (340,914), linear regression was performed on 40,000 individuals to obtain the marginal effect sizes. According to Eq (5), the correlation of effect sizes of these SNPs would be 80% of what was expected from the reference panel. Such discrepancy indeed caused the divergence of SBayesR, while SDPR with modified likelihood function (6) converged and performed well (N = 50,000, Na = 0.25, R² = 0.422).

Real data applications

We compared the performance of SDPR with other methods in real datasets to predict six quantitative traits (height, body mass index, high-density lipoproteins, low-density lipoproteins, total cholesterol, and triglycerides) and six diseases (coronary artery diseases, breast cancer, inflammatory bowel disease, type 2 diabetes, bipolar, and schizophrenia) in UK Biobank. We obtained public GWAS summary statistics of these traits and performed quality control to standardize the input (details in Methods; Table 1). A total of 503 1000G EUR individuals were used to construct the reference LD matrix for SDPR, PRS-CS, LDpred, P+T, LDpred2, lassosum, and DBSLMM. For SBayesR, we used 5000 EUR individuals in UK Biobank to create the LD matrix (shrunken and sparse) instead, as it was reported to have suboptimal prediction accuracy when using 1000G samples [5].

For six continuous traits, the prediction performance was measured by variance of phenotype explained by PRS (Fig 2 and Table G in S1 Text). Overall, SDPR, PRS-CS and LDpred2 performed better than other methods, and there was minimal difference of these three methods. In terms of ranking, SDPR and PRS-CS performed best for height. SDPR and LDpred2 performed best for BMI. SDPR performed best for HDL, LDL and total cholesterol, while PRS-CS performed best for triglycerides. We observed convergence issues when running SBayesR on these traits, and followed its manual to filter SNPs based on GWAS P-values and LD R-squared (—p-value 0.4—rsq 0.9). The filtering approach improved the prediction performance of SBayesR, but it still failed to achieve the top tier performance. We suspect that the convergence issue of SBayesR was also caused by the violation of the likelihood assumption, similar to what we observed in the simulation. To address this issue, our approach of modifying the likelihood function might be better than the simple filtering approach used in SBayesR and P+T as it retained all SNPs for prediction.

For six disease traits, the prediction performance was measured by AUC of PRS only (Fig 3 and Table H in S1 Text). Overall, SDPR achieved top tier performance (within 0.003 difference of AUC of the best method) for five out of six diseases. In terms of ranking, LDpred and LDpred2 performed best for coronary artery disease. SDPR and PRS-CS performed best for breast cancer. LDpred2 performed best for IBD. For schizophrenia and type 2 diabetes, SBayesR performed best. LDpred, SDPR, LDpred2 and SBayesR performed best for bipolar.

Consistent with simulations, SBayesR performed similarly to SDPR when there was no convergence issue (IBD, type 2 diabetes, schizophrenia, bipolar vs height, lipid traits). In general, PRS-CS performed better when the training sample size was large (height and breast cancer vs IBD and type 2 diabetes) and LDpred performed better when the training sample size was small (coronary artery disease, IBD vs height, breast cancer). LDpred2 performed significantly better than LDpred, achieving highly competitive performance. SDPR performed best among methods (PRS-CS auto, SBayesR, LDpred2 auto) without the need of parameter tuning (Table I and J in S1 Text). Taken together, our design of the likelihood function and usage of Dirichlet process prior empowers SDPR with generally robust performance across different genetic architectures and training sample sizes.

Computational time

SDPR is implemented in C++ to best utilize the resources of high-performance computing facilities. SDPR optimizes the speed of the computational bottleneck by using SIMD programming, parallelization over independent LD blocks, and high-performance linear algebra library. Besides, SDPR by default runs analysis on each chromosome in parallel because the genetic architecture may be different across chromosomes. We benchmarked the computational time and memory usage of each method on an Intel Xeon Gold 6240 processor (2.60 GHZ). For SDPR and PRS-CS, we paralleled computation over 22 chromosomes and used three threads per chromosome for the linear algebra library (22 ×3 = 66 threads in total). Time and memory usage were reported for the longest chromosome, which was the rate limiting step. For LDpred, SBayesR and P+T, no parallelization was used. LDpred2 was run in the genome-wide mode with 10 threads for parallel computation. DBSLMM and lassosum were run with 3 threads for parallel computation. The evaluation was based on a fixed number of MCMC iterations-1000 for SDPR and PRS-CS (default), 4000 for SBayesR (non-default but achieved generally good performance in simulations and real data application), 100 for LDpred (default), 1000 for LDpred2 (default). One should keep in mind that the number of MCMC iterations and threads for parallel computation affects the computation time significantly, though we did not explore it in this paper since each method also has different convergence and computational properties.

Table 2 shows that SDPR was able to finish the analysis in 15 minutes for most traits and required no more than 3 Gb of memory for each chromosome. SBayesR was also fast but the memory usage was significant for five diseases as no SNPs were removed to improve the convergence. The speed of PRS-CS, LDpred, P+T, LDpred2, lassosum, and DBSLMM was impeded by the need of iterating over tuning parameters. PRS-CS used less memory because the largest size of LD blocks output by ldetect was smaller compared with SDPR.

Table 2. Computational time and memory usage of different methods for 12 traits.

The computational time is in hours. Memory usage of each method, as listed in the parenthesis, is measured in the unit of Gigabytes (Gb). We did not include the time of computing PRS in the validation and test datasets except for P+T, lassosum, LDpred2, and DBSLMM, because such computation was non-trivial for methods with a large grid of tuning parameters.

Trait	SDPR	PRS-CS	SBayesR	LDpred	P+T	LDpred2	Lassosum	DBSLMM
Height	0.20 (2.4)	2.5 (0.7)	0.92 (12.6)	5.0 (15.5)	0.6 (1.1)	5.5 (31.2)	0.50 (2.6)	1.7 (1.1)
BMI	0.18 (2.4)	2.8 (0.7)	0.50 (7.6)	4.9 (15.1)	0.5 (1.1)	5.4 (30.8)	0.45 (2.6)	0.60 (1.1)
HDL	0.20 (2.4)	1.6 (0.7)	0.68 (8.5)	5.1 (15.6)	0.5 (1.1)	3.9 (31.7)	0.41 (2.2)	0.44 (1.1)
LDL	0.22 (2.4)	2.2 (0.7)	0.67 (8.7)	5.1 (15.6)	0.6 (1.1)	5.5 (31.6)	0.42 (2.2)	0.61 (1.1)
Total cholesterol	0.25 (2.4)	2.2 (0.7)	0.48 (8.7)	5.1 (15.4)	0.5 (1.1)	4.1 (31.7)	0.40 (2.6)	0.60 (1.1)
Triglycerides	0.21 (2.4)	2.2 (0.7)	0.50 (8.3)	5.1 (15.5)	0.5 (1.1)	3.5 (31.6)	0.42 (2.6)	0.62 (1.1)
Coronary artery disease	0.23 (2.3)	1.9 (0.7)	0.39 (7.0)	4.7 (14.0)	0.3 (1.1)	3.5 (27.1)	0.33 (2.2)	0.77 (1.1)
Breast cancer	0.20 (2.9)	2.7 (0.7)	0.63 (42.3)	5.5 (16.4)	0.5 (1.1)	4.6 (37.1)	0.42 (2.7)	0.65 (1.1)
IBD	0.28 (2.8)	2.2 (0.7)	0.78 (39.5)	5.1 (16.0)	0.6 (1.1)	3.7 (33.4)	0.45 (2.7)	0.68 (1.1)
Type 2 diabetes	0.31 (2.9)	2.4 (0.7)	0.87 (47.4)	5.5 (17.4)	0.5 (1.1)	4.5 (37.2)	0.51 (2.8)	0.63 (1.2)
Schizophrenia	0.28 (2.7)	2.3 (0.7)	2.6 (42.1)	5.3 (16.4)	0.5 (1.1)	4.4 (36.8)	0.43 (2.3)	0.64 (1.1)
Bipolar	0.28 (2.8)	2.2 (0.7)	1.7 (43.8)	5.3 (16.3)	0.5 (1.1)	4.4 (36.8)	0.45 (2.6)	0.64 (1.1)

Open in a new tab

Discussion

Building on the success of genome wide association studies, polygenic prediction of complex traits has shown great promise with both public health and clinical relevance. Recently, there is growing interest in developing non-parametric or semi-parametric approaches that make minimal assumptions about the distribution of effect sizes to improve genetic risk prediction [9,36,37]. However, these methods either require access to individual-level data (DPR) [9], external training datasets (NPS) [36], or do no account for LD (So’s method) [37]. Other widely used methods usually make specific parametric assumptions, and require external validation or pseudo-validation datasets to optimize the prediction performance [3,6,22]. To address the limitations of the existing methods, we have proposed a non-parametric method SDPR that is adaptive to different genetic architectures, statistically robust, and computationally efficient. Through simulations and real data applications, we have illustrated that SDPR is practically simple, fast yet effective to achieve competitive performance.

One of the biggest challenges of summary statistics based method is how to deal with mismatch between summary statistics and reference panel. Based on our experience, misspecification of correlation of marginal effect sizes for SNPs in high LD can sometimes cause severe convergence issues of MCMC, especially for methods not relying on parameter tuning. Our investigation revealed that even when estimating LD from a perfectly matched reference panel, if SNPs were genotyped on different individuals, the correlation/covariance of marginal effect sizes in the summary statistics can be different from what is expected from the reference panel. We proposed a modified likelihood function to deal with this issue and observed improved convergence of MCMC. Our approach may be applied in a broader setting given that many summary statistics based methods assume $\hat{β} β \sim N (R β, \frac{R}{N})$ or $z | β \sim N (R \sqrt{N} β, R)$ . When the sample size is small, the noise and heterogeneity of GWAS summary statistics poses more challenge for methods trying to learn every parameter from data (PRS-CS auto, LDpred2-auto, SBayesR, and SDPR). Under such circumstances, it is advantageous for methods like LDpred/LDpred2 to use an independent validation dataset to select the optimal parameters.

Although we have focused on the polygenic prediction of SDPR in this paper, it can provide estimation of heritability, genetic architecture, and posterior inclusion probability (PIP) for fine mapping. These issues will be fully explored in our future studies. SDPR can also be extended as a summary statistics-based tool to predict gene expression level for transcriptome wide association studies since a previous study has shown that individual level data based Dirichlet process model improves transcriptomic data imputation [38].

Although our method has robust performance in comparison with other methods, we caution that currently for most traits the prediction accuracy is still limited for direct application in clinical settings. From our perspective, there are three factors that affect the prediction accuracy. First, how much heritability is explained by common SNPs for diseases and complex traits? Second, if diseases or complex traits have relatively moderate heritability, is the GWAS sample size large enough to allow accurate estimation of effect sizes? Third, if the above two conditions are met, is a method able to have good prediction performance? The first two questions have been discussed in the literatures [7,39,40]. As for method development, we have focused on addressing the third question in this paper, and think SDPR represents a solid step in polygenic risk prediction.

Finally, we provide two technical directions for further development of SDPR. First, SDPR may have better performance after incorporating functional annotation as methods utilizing functional annotation generally perform better [41]. Second, studies have shown that PRS developed using EUR GWAS summary statistics does not transfer well to other populations [42,43]. We can further modify the likelihood function to account for different LD patterns across populations to improve the performance of trans-ethnic PRS.

Supporting information

S1 Text. Supplementary note to explain methods in details.

(DOCX)

Click here for additional data file.^{(6.2MB, docx)}

Acknowledgments

We conducted the research using the UK Biobank resource under an approved data request (ref: 29900). We sincerely thank GIANT, GLGC, CARDIoGRAMplusC4D, BCAC, IIBDGC, DIAGRAM, and PGC consortia for making their GWAS summary data publicly accessible.

Data Availability

Genotype and phenotype data are third party data from UK Biobank (www.ukbiobank.ac.uk) and cannot be shared publicly because only approved users can have access. However, if the user has access to UK Biobank, we have provided all scripts to reproduce the results in our manuscript on https://github.com/eldronzhou/SDPR_paper. Access to data was obtained under application number 29900. More specifically, in our analysis, we used version 2 of UK Biobank imputed genotype (Identifier: ukb_imp_chr[1-22]_v2.bgen), phenotype (Identifier: ukb29401.enc_ukb) and in hospital records (Identifier: ukb_hesin_diag10.tsv, ukb_hesin_diag9.tsv, ukb_hesin.tsv). For height and BMI, the GWAS summary statistics were downloaded from https://portals.broadinstitute.org/collaboration/giant/images/0/01/GIANT_HEI GHT_Wood_et_al_2014_publicrelease_HapMapCeuFreq.txt.gz and https://portals.broadinstitute.org/collaboration/giant/images/1/15/SNP_gwas_ mc_merge_nogc.tbl.uniq.gz. For HDL, LDL, TC and TG, summary statistics were downloaded from http://csg.sph.umich.edu/willer/public/lipids2013/. For CAD, summary statistics were downloaded from http://www.cardiogramplusc4d.org/data-downloads/. For breast cancer, summary statistics were downloaded from http://bcac.ccge.medschl.cam.ac.uk/bcacdata/oncoarray/oncoarray-and-combined -summary-result/gwas-summary-results-breast-cancer-risk-2017/. For IBD, summary statistics were downloaded from ftp://ftp.sanger.ac.uk/pub/consortia/ibdgenetics/iibdgc-trans-ancestry-filte red-summary-stats.tgz. For T2D, summary statistics were downloaded from https://www.diagram-consortium.org/downloads.html. For SCZ and BP, summary statistics were downloaded from https://figshare.com/articles/dataset/cdg2018-bip-scz/14672019?file=28169349 and https://figshare.com/articles/dataset/cdg2018-bip-scz/14672019?file=28169361.

Funding Statement

This work was supported in part by NIH grant NIH GM 134005 and NSF grants DMS 1713120 and 1902903 (H.Z.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. Epub 2009/07/03. doi: 10.1038/nature08185 ; PubMed Central PMCID: PMC3912837. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50(9):1219–24. Epub 2018/08/15. doi: 10.1038/s41588-018-0183-z ; PubMed Central PMCID: PMC6128408. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. American journal of human genetics. 2015;97(4):576–92. Epub 2015/10/03. doi: 10.1016/j.ajhg.2015.09.001 ; PubMed Central PMCID: PMC4596916. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2020. Epub 2020/12/17. doi: 10.1093/bioinformatics/btaa1029 . [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature communications. 2019;10(1):5086. doi: 10.1038/s41467-019-12653-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ge T, Chen CY, Ni Y, Feng YA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications. 2019;10(1):1776. Epub 2019/04/18. doi: 10.1038/s41467-019-09718-5 ; PubMed Central PMCID: PMC6467998. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhang Y, Qi G, Park J-H, Chatterjee N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature Genetics. 2018. doi: 10.1038/s41588-018-0193-x [DOI] [PubMed] [Google Scholar]
8.Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. Ann Statist. 1973;1(2):209–30. doi: 10.1214/aos/1176342360 [DOI] [Google Scholar]
9.Zeng P, Zhou X. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature communications. 2017;8(1):456. Epub 2017/09/08. doi: 10.1038/s41467-017-00470-2 ; PubMed Central PMCID: PMC5587666. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32(2):283–5. Epub 2015/09/24. doi: 10.1093/bioinformatics/btv546 ; PubMed Central PMCID: PMC4731402. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Gelman A, van Dyk DA, Huang Z, Boscardin JW. Using Redundant Parameterizations to Fit Hierarchical Models. Journal of Computational and Graphical Statistics. 2008;17(1):95–122. doi: 10.1198/106186008X287337 [DOI] [Google Scholar]
12.Zhu X, Stephens M. BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat. 2017;11(3):1561–92. Epub 2018/02/06. doi: 10.1214/17-aoas1046 ; PubMed Central PMCID: PMC5796536. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Gelman A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal. 2006;1(3):515–34. doi: 10.1214/06-BA117A [DOI] [Google Scholar]
14.Ishwaran H, James LF. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association. 2001;96(453):161–73. doi: 10.1198/016214501750332758 [DOI] [Google Scholar]
15.Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. Epub 2018/10/12. doi: 10.1038/s41586-018-0579-z . [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45(11):1274–83. Epub 2013/10/08. doi: 10.1038/ng.2797 ; PubMed Central PMCID: PMC3838666. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Speed D, Balding DJ. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet. 2019;51(2):277–84. Epub 2018/12/05. doi: 10.1038/s41588-018-0279-5 ; PubMed Central PMCID: PMC6485398. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. Epub 2001/04/21. doi: 10.1111/j.0006-341x.1999.00997.x . [DOI] [PubMed] [Google Scholar]
20.Schafer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology. 2005;4:Article32. Epub 2006/05/02. doi: 10.2202/1544-6115.1175 . [DOI] [PubMed] [Google Scholar]
21.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics. 2007;81(3):559–75. Epub 2007/08/19. doi: 10.1086/519795 ; PubMed Central PMCID: PMC1950838. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41(6):469–80. Epub 2017/05/10. doi: 10.1002/gepi.22050 . [DOI] [PubMed] [Google Scholar]
23.Yang S, Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. American journal of human genetics. 2020;106(5):679–93. Epub 2020/04/25. doi: 10.1016/j.ajhg.2020.03.013 ; PubMed Central PMCID: PMC7212266. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics. 2011;88(1):76–82. Epub 2010/12/21. doi: 10.1016/j.ajhg.2010.11.011 ; PubMed Central PMCID: PMC3014363. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 2015/02/28. doi: 10.1186/s13742-015-0047-8 ; PubMed Central PMCID: PMC4342193. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet. 2018;27(20):3641–9. Epub 2018/08/21. doi: 10.1093/hmg/ddy271 ; PubMed Central PMCID: PMC6488973. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33(2):272–9. Epub 2016/11/03. doi: 10.1093/bioinformatics/btw613 ; PubMed Central PMCID: PMC5542030. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lijoi A, Prünster I, Walker SG. On Consistency of Nonparametric Normal Mixtures for Bayesian Density Estimation. Journal of the American Statistical Association. 2005;100(472):1292–6. doi: 10.1198/016214505000000358 [DOI] [Google Scholar]
29.Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86. Epub 2014/10/06. doi: 10.1038/ng.3097 ; PubMed Central PMCID: PMC4250049. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. Epub 2015/02/13. doi: 10.1038/nature14177 ; PubMed Central PMCID: PMC4382211. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Mehta NN. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Circ Cardiovasc Genet. 2011;4(3):327–9. Epub 2011/06/16. doi: 10.1161/CIRCGENETICS.111.960443 ; PubMed Central PMCID: PMC3125595. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Michailidou K, Lindström S, Dennis J, Beesley J, Hui S, Kar S, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551(7678):92–4. Epub 2017/10/24. doi: 10.1038/nature24284 ; PubMed Central PMCID: PMC5798588. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Liu JZ, van Sommeren S, Huang H, Ng SC, Alberts R, Takahashi A, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 2015;47(9):979–86. Epub 2015/07/21. doi: 10.1038/ng.3359 ; PubMed Central PMCID: PMC4881818. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes. 2017;66(11):2888–902. Epub 2017/06/02. doi: 10.2337/db16-1253 ; PubMed Central PMCID: PMC5652602. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Genomic Dissection of Bipolar Disorder and Schizophrenia, Including 28 Subphenotypes. Cell. 2018;173(7):1705–15.e16. Epub 2018/06/16. doi: 10.1016/j.cell.2018.05.046 ; PubMed Central PMCID: PMC6432650. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Chun S, Imakaev M, Hui D, Patsopoulos NA, Neale BM, Kathiresan S, et al. Non-parametric Polygenic Risk Prediction via Partitioned GWAS Summary Statistics. The American Journal of Human Genetics. 2020;107(1):46–59. doi: 10.1016/j.ajhg.2020.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.So HC, Sham PC. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Scientific reports. 2017;7:41262. Epub 2017/02/02. doi: 10.1038/srep41262 ; PubMed Central PMCID: PMC5286518. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Nagpal S, Meng X, Epstein MP, Tsoi LC, Patrick M, Gibson G, et al. TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. The American Journal of Human Genetics. 2019;105(2):258–66. doi: 10.1016/j.ajhg.2019.05.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. American journal of human genetics. 2011;88(3):294–305. Epub 2011/03/08. doi: 10.1016/j.ajhg.2011.02.002 ; PubMed Central PMCID: PMC3059431. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 2013;45(4):400–5, 5e1-3. Epub 2013/03/05. doi: 10.1038/ng.2579 ; PubMed Central PMCID: PMC3729116. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Hu Y, Lu Q, Powles R, Yao X, Yang C, Fang F, et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS computational biology. 2017;13(6):e1005589. doi: 10.1371/journal.pcbi.1005589 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature communications. 2019;10(1):3328. doi: 10.1038/s41467-019-11112-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584–91. Epub 2019/03/31. doi: 10.1038/s41588-019-0379-x ; PubMed Central PMCID: PMC6563838. [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Genet. doi: 10.1371/journal.pgen.1009697.r001

Decision Letter 0

David Balding, Doug Speed

14 Mar 2021

Dear Dr Zhao,

Thank you very much for submitting your Methods entitled 'A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics' to PLOS Genetics. We apologise for the slow turnaround, which was largely caused by a difficulty finding suitable reviewers.

Your paper has now been reviewed by three expert reviewers, who provide comments below. While they had positive comments, they also raised important concerns, which mean that the paper can not be accepted in its current state. However, we are willing to consider a revised version, that addresses the comments of the reviewers. In particular, you will note that they have requested additional methods are included in the comparisons, including DPR, LassoSum and LDPred2. Further, they have questions about the robustness of your method, including whether it is necessary to first exclude large effects.

Should you decide to revise the manuscript for further consideration here, your revisions should address all the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Doug Speed

Guest Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: Zhou and Zhao present SDPR, a fast and robust Bayesian nonparametric method for building polygenic scores based on summary statistics. Lately, many Bayesian methods have been developed to derive polygenic scores based on summary statistics as well. They all are very similar in the algorithm and implementation they use, with the exception that they assume different prior distributions. At first sight, SDPR does not seem very different.

Therefore there are a few comments/questions I would like the authors to address so that I can better understand what SDPR adds to the current literature. Then and only then could I recommend SDPR to be published in PLOS Genetics.

Major comments (in no particular order):

- I first want to congratulate the authors for sharing their analysis code on GitHub.

- The authors claim that their method is more robust to different architectures because they use Dirichlet processes. Before PRS-CS claimed the same thing by introducing continuous shrinkage priors and SBayesR by using a point-mass at 0 and a mixture of 3 Normal distributions (instead of using only one in e.g. LDpred). The authors present a hierarchical Bayesian framework similar (for non-Bayesian experts) to the one used in PRS-CS and present the Dirichlet process as an infinite Gaussian mixture model. Therefore I wonder how different SDPR is from PRS-CS and SBayesR. I would urge the authors to explain these differences in more detail and how they think this could make SDPR more robust to different architectures than other published methods.

- Sometimes I wonder if allowing too much flexibility in the model is really a good thing? I guess more flexibility can hinder convergence or even make fitting diverge as it is sometimes the case in SBayesR. Is this why LDpred works better when summary statistics come from a GWAS with a small sample size? I guess that SDPR would work only with GWAS summary statistics for large sample size then? The authors could comment on this to expand the Discussion.

- Again, maybe a naive question, but the authors present their method as nonparametric, but I see that many parametric distributions are assumed as priors for everything. What does “nonparametric” mean here? How is SDPR less parametric than e.g. LDpred2-auto which just assumes that beta ~ N(0, h2/(Mp)), p~U(0,1) and estimate h2=beta^T R beta?

- The authors say they “refine the commonly used likelihood assumption to deal with the discrepancy between summary statistics and external reference panel”. In my opinion, it is a very interesting point of this paper. However, the authors use simulation scenarios that do not allow to show this at all. Clear comparisons that show how this change makes SDPR more robust to the discrepancy between summary statistics and external reference panels would be well received. An ultimate test (which might be too difficult) would be to use a reference panel from individuals of e.g. South Asian ancestry, or at least a mixture of European and South Asian ancestry.

- On the same point, is quality control not enough to control the discrepancy between summary statistics and external reference panel? E.g. the ones proposed in LDpred2 (https://doi.org/10.1093/bioinformatics/btaa1029) or in DENTIST (https://doi.org/10.1101/2020.07.09.196535), or maybe even better, using both.

- Again talking about robustness of methods, the authors remove the MHC region, which is usually performed for methods that are not robust to long-range LD regions. Is it the case for SDPR? How much signal is lost when removing this region? SNPs with extremely large effects are also removed, which should be the ones that are most useful for prediction. I wonder then if SDPR is really a robust method. By the way, these two removals of SNPs should never be called “quality control”.

- A simulation scenario with p=0.1 (and even p=1) should be added since many traits are thought to be very polygenic.

- Two methods that have been shown to perform very well in the literature, namely lassosum and LDpred2, are missing from the comparisons.

Minor comments:

- Figures could be made nicer and more readable by using ‘+ theme_bw(16)’ (or even 18 if necessary).

- Maybe report the AUC of the PGS only, not of the full model, for better comparison of methods, and starting the y-axis at 0.5.

- In the introduction, what does “reparameterization” refer to here?

- Is correction for genomic control really necessary if using Z-Scores instead of p-values?

Reviewer #2: The manuscript describes a summary statistics based non-parametric method, SDPR, for genetic prediction of complex traits. The authors applied SDPR through four simulation settings and applications to twelve traits in the UK Biobank. In the simulations, SDPR works quite similarly as SBayesR but outperforms the other methods. In the UK Biobank applications, SDPR outperforms the other methods in more than half of the examined traits. Overall, I think this is a very nice method that adapts DPR to summary statistics and large-scale data applications. It is a timely contribution to the PRS field. The developed method has the potential to be widely used. I only have a few main comments:

1. In our experience, the four compared methods (PRS-CS, SBayesR, LDpred, P+T) are not among the most accurate PRS methods and are usually quite easy to outperform. It would be useful to add comparisons with lassosum and DBSLMM, both of which work quite well across a range of settings.

2. As a related note, I am a bit surprised to see that SDPR performs similarly as SBayesR in all simulation settings. It would be helpful to identify some simulation settings where SDPR can clearly outperform SBayesR. SDPR is a polygenic model that assumes all SNPs to have non-zero effects, while SBayesR is a sparse model that sets a proportion of SNPs to have zero effects. Therefore, it might be helpful explore a few polygenic settings where all or a large fraction of SNPs have non-zero effects. This way, the new simulation results will become well aligned with the current real data results.

3. Given that SDPR is a summary statistics version of DPR, it would be beneficial to compare SDPR with DPR in some small-scale simulations. These simulations can help benchmark the computational gains brought by SDPR over DPR and evaluate the potential accuracy loss in SDPR as compared to DPR.

Minor comment:

1. On line 56-357 on page 20, the authors mentioned that "However, these methods either require external training datasets or do no account for LD". This statement on the ref 9 (DPR) does not appear to be accurate. As far as I am aware of, DRP does not require an external training dataset and does account for LD; it just does not model summary statistics as SDPR does.

Reviewer #3: The authors present a non-parametric PRS based on a new idea for data-driven adaptive modeling of the underlying effect size distribution. Extending on the previous work on Dirichlet process regression, which required individual-level genotype data, they propose a new MCMC algorithm that allows for the training of a PRS with summary-level GWAS data and reference LD panel. The performance of the new approach is benchmarked in simulated and real data. The work is solid and will be of great interest to PLoS Genetics readership if the following concerns are addressed.

Simulation scenarios: I am somewhat disappointed that the new method does not outperform SBayesR in simulation. Does SDPR outperform SBayesR when the effect sizes are sampled from a distribution that is very different from the BayesR model? SDPR performs very robustly in real data; however, it's difficult to know whether this robustness comes from the flexible non-parametric prior or some other features of the algorithm.

PRS-CS: I think PRS-CS authors share codes to prepare custom LD matrices upon request. Comparisons to PRS-CS are particularly important here because it is more similar to SDPR than SBayesR is. I agree that the Dirichlet process would be more adaptive and flexible than PRS-CS's Strawderman-Berger prior on \\sigma^2, but it needs to be shown clearly in which conditions this leads to higher accuracy.

Extremely large effects: the authors "excluded extremely large effects (z2 > 80) to improve the convergence of MCMC" (p. 10) along with MHC. I think this decision is understandable and practical but wonder if SDPR is less accurate in handling large effect tails compared to other methods.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

PLoS Genet. 2021 Jul 26;17(7):e1009697. doi: 10.1371/journal.pgen.1009697.r002

Author response to Decision Letter 0

25 May 2021

Attachment

Submitted filename: Reviewer.pdf

Click here for additional data file.^{(109.9KB, pdf)}

PLoS Genet. doi: 10.1371/journal.pgen.1009697.r003

Decision Letter 1

David Balding, Doug Speed

24 Jun 2021

Dear Dr Zhao,

Thank you very much for submitting a revised version of your Methods entitled 'A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics' to PLOS Genetics.

Your article has now been reviewed by the original three reviewers. You will see from their comments that they are all satisfied that you addressed their original comments. However Reviewer 3 has a question about your Github page and version consistency. Therefore, please can you address this new comment of Reviewer 3 (as well as correct the typos noted by Reviewer 1).

Specifically, we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Doug Speed

Guest Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Reviewer #1: I thank the authors for their revised work and do not have any further comment.

Just two typos:

- Hammond -> Hadamard

- ldtect -> ldetect

Reviewer #2: All my comments are well addressed and the paper is ready to publish. I just want to bring up one minor technical inconsistency between the github code and the main text. On line 177, the authors listed a set of hyper-parameter choices that were used for fitting DBSLMM, which implies the tuning version of DBSLMM was used. However, based on the github code, it seems that the automatic version of DBSLMM was used which estimates all hyper-parameters based on the training data automatically (since no validation data was used in the fitted code there). The automatic version has slightly worse performance than the tuning version, but comparing to either version is completely fine and would demonstrate the benefits of SDPR. I don't know if I went to the wrong github site; if not, it would be important to keep the main text consistent with the github code.

Reviewer #3: The authors addressed all concerns.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

PLoS Genet. 2021 Jul 26;17(7):e1009697. doi: 10.1371/journal.pgen.1009697.r004

Author response to Decision Letter 1

1 Jul 2021

Attachment

Submitted filename: response.letter.docx

Click here for additional data file.^{(37.7KB, docx)}

PLoS Genet. doi: 10.1371/journal.pgen.1009697.r005

Decision Letter 2

David Balding, Doug Speed

5 Jul 2021

Dear Dr Zhao,

Thank you for making the requested changes to your article "A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics", and for explaining the GitHub concerns. I am happy to say your article has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Doug Speed

Guest Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01887R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

PLoS Genet. doi: 10.1371/journal.pgen.1009697.r006

Acceptance letter

David Balding, Doug Speed

20 Jul 2021

PGENETICS-D-20-01887R2

A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

Dear Dr Zhao,

We are pleased to inform you that your manuscript entitled "A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Agota Szep

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supplementary note to explain methods in details.

(DOCX)

Click here for additional data file.^{(6.2MB, docx)}

Attachment

Submitted filename: Reviewer.pdf

Click here for additional data file.^{(109.9KB, pdf)}

Attachment

Submitted filename: response.letter.docx

Click here for additional data file.^{(37.7KB, docx)}

Data Availability Statement

[pgen.1009697.ref001] 1.Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. Epub 2009/07/03. doi: 10.1038/nature08185 ; PubMed Central PMCID: PMC3912837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref002] 2.Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50(9):1219–24. Epub 2018/08/15. doi: 10.1038/s41588-018-0183-z ; PubMed Central PMCID: PMC6128408. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref003] 3.Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. American journal of human genetics. 2015;97(4):576–92. Epub 2015/10/03. doi: 10.1016/j.ajhg.2015.09.001 ; PubMed Central PMCID: PMC4596916. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref004] 4.Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2020. Epub 2020/12/17. doi: 10.1093/bioinformatics/btaa1029 . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref005] 5.Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature communications. 2019;10(1):5086. doi: 10.1038/s41467-019-12653-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref006] 6.Ge T, Chen CY, Ni Y, Feng YA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature communications. 2019;10(1):1776. Epub 2019/04/18. doi: 10.1038/s41467-019-09718-5 ; PubMed Central PMCID: PMC6467998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref007] 7.Zhang Y, Qi G, Park J-H, Chatterjee N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nature Genetics. 2018. doi: 10.1038/s41588-018-0193-x [DOI] [PubMed] [Google Scholar]

[pgen.1009697.ref008] 8.Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. Ann Statist. 1973;1(2):209–30. doi: 10.1214/aos/1176342360 [DOI] [Google Scholar]

[pgen.1009697.ref009] 9.Zeng P, Zhou X. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature communications. 2017;8(1):456. Epub 2017/09/08. doi: 10.1038/s41467-017-00470-2 ; PubMed Central PMCID: PMC5587666. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref010] 10.Berisa T, Pickrell JK. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics. 2016;32(2):283–5. Epub 2015/09/24. doi: 10.1093/bioinformatics/btv546 ; PubMed Central PMCID: PMC4731402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref011] 11.Gelman A, van Dyk DA, Huang Z, Boscardin JW. Using Redundant Parameterizations to Fit Hierarchical Models. Journal of Computational and Graphical Statistics. 2008;17(1):95–122. doi: 10.1198/106186008X287337 [DOI] [Google Scholar]

[pgen.1009697.ref012] 12.Zhu X, Stephens M. BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat. 2017;11(3):1561–92. Epub 2018/02/06. doi: 10.1214/17-aoas1046 ; PubMed Central PMCID: PMC5796536. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref013] 13.Gelman A. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal. 2006;1(3):515–34. doi: 10.1214/06-BA117A [DOI] [Google Scholar]

[pgen.1009697.ref014] 14.Ishwaran H, James LF. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association. 2001;96(453):161–73. doi: 10.1198/016214501750332758 [DOI] [Google Scholar]

[pgen.1009697.ref015] 15.Auton A, Abecasis GR, Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi: 10.1038/nature15393 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref016] 16.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–9. Epub 2018/10/12. doi: 10.1038/s41586-018-0579-z . [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref017] 17.Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45(11):1274–83. Epub 2013/10/08. doi: 10.1038/ng.2797 ; PubMed Central PMCID: PMC3838666. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref018] 18.Speed D, Balding DJ. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet. 2019;51(2):277–84. Epub 2018/12/05. doi: 10.1038/s41588-018-0279-5 ; PubMed Central PMCID: PMC6485398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref019] 19.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. Epub 2001/04/21. doi: 10.1111/j.0006-341x.1999.00997.x . [DOI] [PubMed] [Google Scholar]

[pgen.1009697.ref020] 20.Schafer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in genetics and molecular biology. 2005;4:Article32. Epub 2006/05/02. doi: 10.2202/1544-6115.1175 . [DOI] [PubMed] [Google Scholar]

[pgen.1009697.ref021] 21.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics. 2007;81(3):559–75. Epub 2007/08/19. doi: 10.1086/519795 ; PubMed Central PMCID: PMC1950838. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref022] 22.Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41(6):469–80. Epub 2017/05/10. doi: 10.1002/gepi.22050 . [DOI] [PubMed] [Google Scholar]

[pgen.1009697.ref023] 23.Yang S, Zhou X. Accurate and Scalable Construction of Polygenic Scores in Large Biobank Data Sets. American journal of human genetics. 2020;106(5):679–93. Epub 2020/04/25. doi: 10.1016/j.ajhg.2020.03.013 ; PubMed Central PMCID: PMC7212266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref024] 24.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. American journal of human genetics. 2011;88(1):76–82. Epub 2010/12/21. doi: 10.1016/j.ajhg.2010.11.011 ; PubMed Central PMCID: PMC3014363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref025] 25.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. Epub 2015/02/28. doi: 10.1186/s13742-015-0047-8 ; PubMed Central PMCID: PMC4342193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref026] 26.Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet. 2018;27(20):3641–9. Epub 2018/08/21. doi: 10.1093/hmg/ddy271 ; PubMed Central PMCID: PMC6488973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref027] 27.Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33(2):272–9. Epub 2016/11/03. doi: 10.1093/bioinformatics/btw613 ; PubMed Central PMCID: PMC5542030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref028] 28.Lijoi A, Prünster I, Walker SG. On Consistency of Nonparametric Normal Mixtures for Bayesian Density Estimation. Journal of the American Statistical Association. 2005;100(472):1292–6. doi: 10.1198/016214505000000358 [DOI] [Google Scholar]

[pgen.1009697.ref029] 29.Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86. Epub 2014/10/06. doi: 10.1038/ng.3097 ; PubMed Central PMCID: PMC4250049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref030] 30.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. Epub 2015/02/13. doi: 10.1038/nature14177 ; PubMed Central PMCID: PMC4382211. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref031] 31.Mehta NN. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Circ Cardiovasc Genet. 2011;4(3):327–9. Epub 2011/06/16. doi: 10.1161/CIRCGENETICS.111.960443 ; PubMed Central PMCID: PMC3125595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref032] 32.Michailidou K, Lindström S, Dennis J, Beesley J, Hui S, Kar S, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551(7678):92–4. Epub 2017/10/24. doi: 10.1038/nature24284 ; PubMed Central PMCID: PMC5798588. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref033] 33.Liu JZ, van Sommeren S, Huang H, Ng SC, Alberts R, Takahashi A, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 2015;47(9):979–86. Epub 2015/07/21. doi: 10.1038/ng.3359 ; PubMed Central PMCID: PMC4881818. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref034] 34.Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes. 2017;66(11):2888–902. Epub 2017/06/02. doi: 10.2337/db16-1253 ; PubMed Central PMCID: PMC5652602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref035] 35.Genomic Dissection of Bipolar Disorder and Schizophrenia, Including 28 Subphenotypes. Cell. 2018;173(7):1705–15.e16. Epub 2018/06/16. doi: 10.1016/j.cell.2018.05.046 ; PubMed Central PMCID: PMC6432650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref036] 36.Chun S, Imakaev M, Hui D, Patsopoulos NA, Neale BM, Kathiresan S, et al. Non-parametric Polygenic Risk Prediction via Partitioned GWAS Summary Statistics. The American Journal of Human Genetics. 2020;107(1):46–59. doi: 10.1016/j.ajhg.2020.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref037] 37.So HC, Sham PC. Improving polygenic risk prediction from summary statistics by an empirical Bayes approach. Scientific reports. 2017;7:41262. Epub 2017/02/02. doi: 10.1038/srep41262 ; PubMed Central PMCID: PMC5286518. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref038] 38.Nagpal S, Meng X, Epstein MP, Tsoi LC, Patrick M, Gibson G, et al. TIGAR: An Improved Bayesian Tool for Transcriptomic Data Imputation Enhances Gene Mapping of Complex Traits. The American Journal of Human Genetics. 2019;105(2):258–66. doi: 10.1016/j.ajhg.2019.05.018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref039] 39.Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. American journal of human genetics. 2011;88(3):294–305. Epub 2011/03/08. doi: 10.1016/j.ajhg.2011.02.002 ; PubMed Central PMCID: PMC3059431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref040] 40.Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 2013;45(4):400–5, 5e1-3. Epub 2013/03/05. doi: 10.1038/ng.2579 ; PubMed Central PMCID: PMC3729116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref041] 41.Hu Y, Lu Q, Powles R, Yao X, Yang C, Fang F, et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS computational biology. 2017;13(6):e1005589. doi: 10.1371/journal.pcbi.1005589 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref042] 42.Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nature communications. 2019;10(1):3328. doi: 10.1038/s41467-019-11112-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1009697.ref043] 43.Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51(4):584–91. Epub 2019/03/31. doi: 10.1038/s41588-019-0379-x ; PubMed Central PMCID: PMC6563838. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A fast and robust Bayesian nonparametric method for prediction of complex traits using summary statistics

Geyu Zhou

Hongyu Zhao

Roles

Abstract

Author summary

Introduction

Methods

Overview of SDPR

Robust design of the likelihood function

Construction and partition of reference LD matrix

Other methods

Genome-wide simulations

Real data application using public summary statistics and UK biobank data

Table 1. Summary information about the sample size and SNPs in GWAS summary statistics and UK Biobank datasets.

Results

Adaptiveness of Dirichlet process prior

Simulations

Fig 1. Prediction performance of different methods on simulated data with varying samples sizes of the training cohort.

Real data applications

Fig 2. Prediction performance of different methods for six quantitative traits in the UK Biobank.

Fig 3. Prediction performance of different methods for 6 diseases in the UK biobank.

Computational time

Table 2. Computational time and memory usage of different methods for 12 traits.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

David Balding

Doug Speed

Roles

Author response to Decision Letter 0

Decision Letter 1

David Balding

Doug Speed

Roles

Author response to Decision Letter 1

Decision Letter 2

David Balding

Doug Speed

Roles

Acceptance letter

David Balding

Doug Speed

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases