Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 8.
Published in final edited form as: Genet Epidemiol. 2020 Jul 21;44(7):676–686. doi: 10.1002/gepi.22339

A principal component approach to improve association testing with polygenic risk scores

Brandon J Coombes 1,*, Alexander Ploner 2, Sarah E Bergen 2, Joanna M Biernacka 1,3
PMCID: PMC7722089  NIHMSID: NIHMS1637380  PMID: 32691445

Abstract

Polygenic risk scores (PRSs) have become an increasingly popular approach for demonstrating polygenic influences on complex traits and for establishing common polygenic signals between different traits. PRSs are typically constructed using pruning and thresholding (P+T), but the best choice of parameters is uncertain; thus multiple settings are used and the best is chosen. Optimization can lead to inflated type I error. Permutation procedures can correct this, but they can be computationally intensive. Alternatively, a single parameter setting can be chosen a priori for the PRS, but choosing suboptimal settings results in loss of power. We propose computing PRSs under a range of parameter settings, performing principal component analysis (PCA) on the resulting set of PRSs, and using the first PRS-PC in association tests. The first PC reweights the variants included in the PRS to achieve maximum variation over all PRS settings used. Using simulations and a real data application to study PRS association with bipolar disorder and psychosis in bipolar disorder, we compare the performance of the proposed PRS-PCA approach with a permutation test and an a priori selected p-value threshold. The PRS-PCA approach is simple to implement, outperforms the other strategies in most scenarios, and provides an unbiased estimate of prediction performance.

Keywords: polygenic risk scores, principal component analysis, weighting, permutation

Introduction

Polygenic risk scores (PRSs) have become an increasingly popular tool in genetics research. PRSs leverage summary statistics from previous genome-wide association studies (GWASs) to predict risk for individuals in a new population. If the individuals’ predicted risk is associated with their phenotype, this approach provides evidence of polygenic effects even when no genome-wide significant variants exist. When a PRS for one trait is associated with another trait, this approach can be used to establish common polygenic signals between two different traits (Torkamani et al., 2018).

A PRS is a weighted sum of an individual’s alleles, where allele weights are estimated based on their effects in a GWAS in a different sample (International Schizophrenia Consortium et al., 2009). A simple summation across single nucleotide polymorphisms (SNPs) while ignoring the linkage disequilibrium (LD) among them would not be appropriate because trait-associated regions with high LD would be over-weighted. There are several approaches that account for LD in PRS construction. The most common approach, the so-called “pruning-and-thresholding” (P+T) method, constructs the PRS by first removing SNPs in high LD to obtain a set of roughly independent SNPs (pruning) and then including only SNPs that have a p-value below a certain value (thresholding) (Choi & O’Reilly, 2019; International Schizophrenia Consortium et al., 2009). Other methods use penalization to shrink most of the SNP effects to zero (Mak et al., 2017) or use a Bayesian prior that incorporates the LD structure to place downward bias on all of the SNP effects (Ge et al., 2019; Vilhjálmsson et al., 2015).

Regardless of the method, construction of a PRS requires specification of tuning parameters, such as the pruning and thresholding parameters in the P+T method. Typically, PRS analysis involves constructing multiple PRSs across a range of the tuning parameters, followed by selection of the optimal PRS for prediction (i.e. the one that gives the strongest evidence for association). For example in P+T, researchers construct the PRS using around ten different thresholds, and then use all generated PRSs in the analysis. This optimization can inflate the probability of a type I error, if the multiple testing inherent in choosing the best PRS is not accounted for. Inflated type 1 error can be guarded against by using permutations to evaluate significance of the selected PRS; we refer to this approach as Opt-Perm. Here, while the p-value would be corrected for multiple testing, the optimized PRS may still be over-fit and thus the corresponding R2 value, which measures the proportion of variation in the trait explained by the PRS, would be inflated. Moreover, it should be noted that permutation procedures can become quite computationally intensive. It has also been proposed to use external or internal validation to choose tuning parameters and avoid permutations. However, external validation datasets are often not available, especially for rarely-studied phenotypes (Mak et al., 2018), and in smaller samples, splitting the data into training and validation sets can decrease power. As an alternative to the optimization approach, one could a priori choose a single tuning parameter setting (e.g. fixing the p-value threshold and LD pruning level) to construct a single PRS. This approach has been used recently in investigations to test for associations of one PRS with many phenotypes (Richardson et al., 2019; Zheutlin et al., 2019). By not optimizing over a set of tuning parameters for each test of association, this strategy avoids further increasing the multiple testing and computation time. However, a sub-optimal PRS may be selected, leading to poor prediction and power to test for association of the PRS with the trait.

Here, we instead compute PRSs over a range of tuning parameter settings, perform principal component analysis (PCA) on the set of PRSs, and use only the first PRS-PC for association testing. The first PC captures the largest amount of variation in the computed PRSs and thus could have better discrimination of the phenotype we are testing. This strategy was recently implemented in studies of polygenic effects on schizophrenia as well as brain imaging (Alnæs et al., 2019; Bergen et al., 2019; De Lange et al., 2019; Maglanoc et al., 2019). This unsupervised approach incorporates all computed scores across a range of tuning parameters and, importantly, is agnostic regarding the outcome of interest and thereby maintains correct type I error. Additionally, the PRS-PCA approach produces a score that is not overfit, which can be used to assess predictive performance of the PRS using measures such as R2 or area under the receiver operating characteristic curve (AUC).

Here, we assess the statistical properties of the proposed method in the context of P+T PRS analysis. We begin by constructing PRSs using the P+T approach across a range of p-value thresholds. We then compare the performance of the PRS-PCA approach with the Opt-Perm approach and a priori selection of the p-value threshold tuning parameter. Using simulations and analysis of the Mayo Clinic Bipolar Disorder (BD) Biobank data, we show that the PRS-PCA approach maintains correct type I error and outperforms the other PRS strategies in most scenarios. With the increasing availability of GWAS summary statistics, application of PRS strategies to test for association between many different PRSs and multiple phenotypes is becoming common (B. Coombes et al., 2020; Grigoroiu-Serbanescu et al., 2020; Zheutlin et al., 2019). The PRS-PCA approach can substantially reduce multiple testing without suffering loss of power and is easy to implement, making it especially beneficial in this context.

Methods

Polygenic risk scores

Let Gij denote the number of copies of the reference allele for the jth SNP for the ith individual, possibly estimated via imputation. Let β^j and pj be the estimated effect and p-value, respectively, for the jth SNP. The PRS for the ith individual is then j=1JGijβ^j over a set of J markers. Using GWAS summary statistics from a prior analysis of a trait of interest and LD structure estimated either from a reference panel or the target data, the set of J markers to include in the sum and their estimated effects are usually chosen using a P+T strategy (Choi & O’Reilly, 2019). Briefly, P+T “prunes” markers throughout the genome to obtain approximately independent SNPs and only uses SNPs with p-values below a certain threshold to estimate the PRS. This approach can be optimized over different p-value thresholds, clump sizes, and LD measures to determine approximate independence. Typically, researchers a priori select the level of pruning and optimize the p-value threshold over a set of K constructed PRSs, PRS1,,PRSK based on thresholds: t1<t2<<tK. Each PRS can be written as

PRSk=j=1JGijβj(k)

where βj(k)=β^JI[0,tk](pj) is a hard-threshold version of the effect size β^J where I[0,tk](pj) is the indicator of whether pj<tk. However, by searching for the best tuning parameter setting, the PRS can be overfit to the target data and the test of association of the PRS with a trait can have inflated type-I error. This can be corrected by using permutations to generate empirical p-values for association with the optimized PRS.

PRS-PCA approach

Instead of using the target data to choose the best PRS, we propose using an unsupervised approach to construct a single PRS from the set of PRSs (PRS1,,PRSK) computed over a range of tuning parameter values (e.g. p-value thresholds). Because we are not interested in the actual scale of the PRS, we perform a PCA on the correlation matrix rather than the covariance matrix. This is equivalent to performing PCA on the matrix of standardized PRSs

PRSk¯=PRSkmksk

Where mk and sk are the mean and standard deviation of the kth PRS. The PCs are simply weighted sums of the underlying variables PRS1¯,,PRSK¯. Specifically, we can write the first PC as PC1=l1PRS1¯++lkPRSK¯ where lk are the loadings for the first PC extracted by the PCA.

To demonstrate how PRS-PCA reweights the original effect sizes βj, we first can re-write this as

PC1=k=1KlkPRSk¯=k=1KlkPRSkmksk=k=1KlkskPRSkk=1Klkmksk=k=1Klk˜PRSkC

Where lk˜ are the standardized loadings lksk and C refers to a constant term depending only on terms that do not vary between subjects. Without loss of generality, we drop the constant and define the PRS-PCA as

PRSPCA=PC1+C=k=1Klk˜PRSk=k=1Klk˜j=1JGijβj(k)=j=1JGijk=1Klk˜βj(k)=j=1JGijβ^jw(pj)

where w(pj)=k=1Klk˜I[0,tk](pj) is the PRS-PCA weight on β^j; which depends on how many thresholds the p-value is below and the loadings of the PRSs corresponding to those p-value thresholds. Assuming lk˜>0 for all k, we divide w(pj) by k=1Klk˜ to ensure the weights are between zero and one. Thus, we can see that the weight function corresponds to progressive shrinkage of the original effect sizes β^j. The most significant SNPs retain their original effect sizes while SNPs with larger p-values are multiplied by weights that are smaller than one and decrease as a function of the p-value. SNPs with pj>tK have their effect size shrunk to zero. Because the loadings used in the weights are from PCA, PRS-PCA captures the greatest variation of the PRSs computed under the different thresholds used. We use this reweighted PRS (i.e. PRS-PCA) to test for association with the phenotype rather than performing K different association tests with PRSs at different thresholds and then choosing the best one.

Simulations

To estimate empirical type I error and power of the previous and newly proposed methods under different scenarios, we simulated data with and without genotype-phenotype associations. We generated genotypes by sampling without replacement from the Mayo Clinic Bipolar Disorder Biobank sample, followed by generating phenotypes conditional on (or independent of) the genotypes. The Mayo Clinic Bipolar Disorder Biobank collection, genotyping, and genetic data quality control has been described in previous publications (Frye et al., 2015; Markota et al., 2018). Briefly, the Illumina HumanOmniExpress platform was used to genotype cases and controls. For quality control purposes, we excluded subjects with <98% call rate and related subjects. SNPs with call rate <98%, MAF < 0.01, and those not in Hardy-Weinberg Equilibrium (HWE; P<1e-06) were removed. After these steps 643 011 SNPs, 968 cases, and 777 controls remained.

We explored the performance of the methods using samples sizes of N = 500 and 1500 with a balanced case-control design. To avoid assigning “causal” effects to SNPs in LD, we first clumped the summary statistics using PLINKv1.90 (--clump-kb 250 --clump-p=1 --clump-r2=0.1) to obtain 93 802 approximately independent SNPs. To choose realistic effect sizes for SNPs across the genome, we randomly chose the effect size of each independent SNP from a normal distribution with mean equal to log(OR^) and standard deviation SE^ of the corresponding SNP in the summary statistics from the Psychiatric Genomics Consortium (PGC) Schizophrenia (SCZ) GWAS (Consortium et al., 2014). We previously showed this PRS was associated with psychosis during mania in bipolar disorder (Markota et al., 2018). Finally, we varied the level of polygenicity of the trait by selecting the ‘causal’ SNPs from the pruned set with absolute value of the log(OR^) greater than 0.01 (high; 71694 SNPs), 0.07 (medium; 1493 SNPs), and 0.15 (low; 31 SNPs), respectively, in the PGC-SZ GWAS. Using GCTA (Yang et al., 2011), we simulated the liability of a trait and varied the heritability of the liability to be 0, 0.2, 0.4, 0.6, or 0.8. The final simulated liability was then dichotomized at the median to create a balance of cases and controls. For each scenario, data was simulated 3000 and 1000 times to estimate empirical type I error (heritability = 0) and power (heritability > 0), respectively.

We used PRSice2 (Choi & O’Reilly, 2019) to compute PRSs under different p-value thresholds in the simulated datasets using the PGC-SCZ summary statistics. The simulated datasets were then analyzed using PRS-PCA as well as Opt-Perm, and a priori selection of a p-value threshold (pT=5x108,0.01, or 1). To explore the effect of the set of p-value thresholds searched on the performance of PRS-PCA and Opt-Perm, we computed PRSs at either K = 5 ((pT=5x108,106,104,0.01,1), 11 (pT=5x108,107,106,105,104,0.001,0.01,0.005,0.1,0.5. 1), or 100 (thresholdschosenusinganevenlyspacedgridonthenegativelog10scalefrom 0(pT=1)to7.3(pT=5x108) p-value thresholds. It should be noted that the default search implemented in PRSice2 searches multiple hundreds of p-value thresholds over a grid from 5x108 to 1. We chose a smaller grid to reduce computational expense in our simulations. Logistic regression models were fit to estimate the p-value of association (H0: PRS not associated with the trait) of the PRS approaches with the simulated trait using R (R Core Team, 2018) version 3.5.2. We also estimated the percent of variation of the binary phenotype explained by each PRS using Nagelkerke’s R2 for each PRS approach using the R package rsq (Zhang, 2018). For each scenario in the simulations, power and type I error were calculated using an α = 0.05 significance level.

Application to Mayo Clinic Bipolar Biobank Data

To compare performance across PRS approaches, we used publicly available GWAS summary statistics to calculate PRSs for a variety of traits for subjects in the Mayo Bipolar Biobank dataset, including: SCZ (Consortium et al., 2014), BD (Stahl et al., 2019), major depressive disorder (MDD) (Wray et al., 2018), attention deficit and hyperactivity disorder (ADHD) (Demontis et al., 2019), anxiety disorders (Otowa et al., 2016), post-traumatic stress disorder (PTSD) (Duncan et al., 2018), obsessive compulsive disorder (OCD)(International Obsessive Compulsive Disorder Foundation Genetics Collaborative (IOCDF-GC) and OCD Collaborative Genetics Association Studies (OCGAS), 2018), anorexia nervosa (AN) (Bulik et al., 2017), insomnia (Lane et al., 2019), and educational attainment (EA) (J. J. Lee et al., 2018). We used PRSice2 (Choi & O’Reilly, 2019) to compute the PRSs using the same settings described for the simulations. Some smaller p-value thresholds were not applicable for GWAS without genome-wide significant variants. We used the various PRS approaches to test for association of each PRS with BD case-control status (N cases = 968; N controls = 777) to compare the performances of the methods in a well-studied phenotype. We additionally repeated these analyses using the history of psychosis during mania in BD cases (N with manic psychosis = 336; N without psychosis = 309) as the phenotype. We recently demonstrated that psychosis during mania is associated with polygenic risk of schizophrenia (Markota et al., 2018). No large GWAS exists for this phenotype, thus, PRS approaches can be quite useful here to elucidate potential differences in genetic background between bipolar cases with and without psychosis, and the genetic overlap of this phenotype with other psychiatric traits in addition to SCZ. We used logistic regression to test for association of each PRS with BD or psychosis status after controlling for the first four principal components of the genotype data to adjust for population stratification. P-values for the Opt-Perm method were calculated using up to 100,000 permutations. We estimated the percent of variation of the binary phenotypes explained by each PRS using Nagelkerke’s R2. For the Opt-Perm approach, we followed the standard approach of reporting the Nagelkerke’s R2 estimate for the best p-value threshold, which is a biased overestimate of the true R2.

Results

Type I error

Table 1 shows the empirical type I error and median R2 for each method with varying set of p-value thresholds used, corresponding to setting the heritability of the liability equal to zero. A total of 3000 simulations were performed for each scenario (row) in Table 1. The PRS-PCA approach maintains correct type I error in all scenarios. As expected, optimization of p-value thresholds without correction for multiple testing results in inflated type I error, which worsens as the number of thresholds searched increases. Permutations correct the type I error. However, it should be noted that while both PRS-PCA and Opt-Perm maintain correct type I error, the median R2 for Opt-Perm is greater than zero, demonstrating that the median estimate of variance explained by the PRS is biased upward for Opt-Perm. Fig 1 shows that this bias can become quite large under the null hypothesis and is always larger than the bias of the PRS-PCA approach. This was especially true for the smaller sample size (N = 500).

Table 1.

Empirical Type I error and median R2 in each simulation setting for each method with sample size N and number of parameters searched K. PCA = first PC of search, Opt = Select best parameter, Opt-Perm = Permutation of Opt

N K PCA
α (median R2)
Opt
α (median R2)
Opt-Perm
α (median R2)
500 5 0.054 (0) 0.151 (0.2%) 0.051 (0.2%)
500 11 0.052 (0) 0.194 (0.3%) 0.053 (0.3%)
500 100 0.049 (0) 0.255 (0.5%) 0.050 (0.5%)

1500 5 0.047 (0) 0.145 (0.1%) 0.051 (0.1%)
1500 11 0.046 (0) 0.188 (0.1%) 0.047 (0.1%)
1500 100 0.049 (0) 0.250 (0.1%) 0.044 (0.1%)

Fig 1.

Fig 1.

Comparison of Nagelkerke’s R2 for PRS-PCA and Opt-Perm under the null hypothesis of no polygenic effect with varying the set of p-value thresholds used (K) and sample size (N). R2 asymptotically goes to zero under the null.

Investigation of SNP weights in the PRS-PCA approach

For the set of 93 802 independent SNPs obtained after pruning, we performed PCA on the K PRSs (K = 5, 11, or 100) and plotted the stepwise weight curves for w(pj) described in the Methods (Fig 2). These curves show that as p-value increases, the SNP effect is down-weighted more depending on the choice of K thresholds. For example, with our set of K = 5 thresholds, variants with p-value in [1, 0.01), [0.01, 0.0001), [0.0001, 1e-6), [1e-6, 5e-8), [5e-8, 0] were down-weighted by 22%, 6%, 2%, 0.7%, and 0%, respectively. With our K = 11 thresholds, SNPs with p-value in [1, 0.1) and [0.1, 0.05) were severely down-weighted by 61% and 41%, respectively. With K = 100, a smoother form of the down-weighting of SNPs based on their p-values occurs.

Fig 2.

Fig 2.

Scaled weights on effect sizes of SNPs (betas) for SCZ-PRS assigned by PRS-PCA in the Mayo Clinic sample. Weights depend on the p-value thresholds used in the analysis (K = 5, 11, or 106), PCA loadings, and the p-value of the SNP from the training sample.

Power

We assessed the power to detect association using the various PRS approaches for a trait with high, medium, or low polygenicity using 1000 simulations for a given sample size and heritability. Empirical power for these methods is shown in Fig 3. All methods had more power using a sample size of 1500, but the relative performance of the methods remained unchanged between sample sizes. The PRS-PCA approach, regardless of the set of thresholds used (K = 5, 11, or 100), had greater power than Opt-Perm in almost all scenarios and achieved nearly the same power as the PRS constructed with fixed threshold matching the simulation setting (pT=5e-8 and 1 for low and high polygenicity, respectively). The PRS-PCA approach performed best with K = 11 when the trait was highly polygenic. PRS-PCA with the larger set of thresholds (K = 100) did not perform as well in this scenario, but otherwise had generally similar performance as PRS-PCA with K = 11. When the trait had few causal variants, PRS-PCA with the set of K=5 thresholds performed marginally better. While a similar pattern of performance was seen for Opt-Perm, the Opt-Perm performance varied less than PRS-PCA with respect to the choice of thresholds used in the optimization.

Fig 3.

Fig 3.

Empirical power of each method given a trait with high (left; |log(OR)| > 0.01), medium (center; |log(OR)| > 0.07), or low (right; |log(OR)| > 0.15) polygenicity with sample size of N = 500 (top) or 1500 (bottom). GWS = genome-wide significant p-value threshold (5e-8). Line thickness corresponds to PRS-PCA and Opt-Perm with varying K (thicker = more thresholds used).

Illustration of Approach: Application to Mayo Clinic Bipolar Biobank Data

For the psychiatric traits considered in the real data analysis, the PRS-PCAs were weakly correlated with each other, with most correlations between 0 and 0.15 and the highest correlation (r = 0.43) between PTSD and AN (Supplementary Figure 1). We also estimated the variation in the PRSs explained by the first PC as well as the PCA loadings for the PRSs of each trait from Table 2 (Supplementary Figure 2). For PRSs with the same set of p-value thresholds, the pattern of PC-loadings was very similar. The first PRS-PC explained between 35% and 75% of the variation in PRSs computed at different p-value thresholds regardless of K and the amount of variation explained by the first PRS-PC was higher for traits where the PRSs calculated under different p-value thresholds were more correlated (EA and SCZ in Supplementary Figure 3).

Table 2.

Comparison of PRS approaches testing for association of each PRS with BD case-control status. The traits are sorted by the PCA approach (K = 11) p-value. Prediction performance was measured by Nagelkerke’s R2.

PRS PCA
(K = 5)
p-value (R2)
PCA
(K = 11)
p-value (R2)
PCA
(K = 100)
p-value (R2)
Opt
p-value (R2)*
Best Threshold Opt-Perm
p-value
BD 8e-14 (4.4%) 6e-17 (5.6%) 3e-14 (4.6%) 4e-14 (4.6%) 1 < 1e-05
SZ 5e-06 (1.6%) 5e-08 (2.3%) 5e-07 (1.9%) 2e-11 (3.5%) 0.05 < 1e-05
MDD 0.008 (0.5%) 7e-05 (1.2%) 0.002 (0.7%) 5e-06 (1.6%) 0.05 4e-05
ADHD 0.021 (0.4%) 0.003 (0.6%) 0.023 (0.3%) 3e-04 (1.0%) 1 0.003
Insomnia 0.015 (0.4%) 0.004 (0.6%) 0.006 (0.5%) 5e-04 (0.9%) 0.01 0.005
Anxiety 0.129 (0.1%) 0.023 (0.3%) 0.037 (0.3%) 0.013 (0.4%) 0.1 0.095
PTSD 0.075 (0.2%) 0.043 (0.3%) 0.034 (0.3%) 0.021 (0.4%) 1 0.131
OCD 0.641 (0.0%) 0.072 (0.2%) 0.325 (0.0%) 0.022 (0.4%) 0.2 0.137
AN 0.119 (0.1%) 0.102 (0.1%) 0.065 (0.2%) 0.184 (0.1%) 1 0.656
EA 0.306 (0.0%) 0.278 (0.0%) 0.379 (0.0%) 0.153 (0.1%) 1 0.527
*

Nagelkerke’s R2 for the Opt approach is estimated from the best performing p-value threshold searched (K = 11). The best p-value threshold from the K = 11 thresholds searched is reported.

We first considered PRS analyses of BD case-control phenotype, which has been extensively studied (Ruderfer et al., 2018; Stahl et al., 2019). Overall, the PRS-PCA method performed best with our proposed set of K = 11 thresholds. The PRS-PCA method with K = 11 and K = 100 resulted in p-values < 0.05 for 7 of the 10 traits considered. Meanwhile, only 5 PRSs were significantly associated with BD based on Opt-Perm with K = 11. Additionally, a precise Opt-Perm p-value could not be estimated for the strongest two PRS associations (BD and SCZ), due to the computational burden of running more than 100,000 permutations. While the PRS-PCA method took an average of 1.9 seconds to run all three settings of K (5, 11, and 100), the Opt-Perm approach took about 0.09 seconds per permutation (Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz), which means that 100,000 permutations took around 2.5 hours without parallelization.

We also analyzed a case-only phenotype, history of manic psychosis (Table 3), for which no large samples exist. No method performed uniformly best. The PRS-PCA method with K = 11 showed that the PRSs for EA, BD, and SCZ were higher in cases with psychosis than those without. With K = 5, the PRS-PCA analysis failed to detect association with SCZ PRS. The Opt-Perm method (with K = 11) provided weaker evidence of association with the BD-PRS, but stronger evidence of a PTSD PRS association than the PRS-PCA method.

Table 3.

Comparison of PRS approaches testing for association of each PRS with presence of psychosis among cases of bipolar disorder (N = 645). The traits are sorted by the PCA approach (K = 11) p-value. Prediction performance was measured by Nagelkerke’s R2.

PRS PCA
(K = 5)
p-value (R2)
PCA
(K = 11)
p-value (R2)
PCA
(K = 100)
p-value (R2)
Opt
p-value (R2)*
Best Threshold Opt-Perm
p-value
EA 6e-04 (2.0%) 9e-04 (2.0%) 0.001 (2.0%) 9e-04 (2.0%) 0.0001 0.006
BD 0.014 (1.1%) 0.020 (1.0%) 0.013 (1.1%) 0.019 (1.0%) 0.0001 0.132
SCZ 0.141 (0.3%) 0.045 (0.7%) 0.071 (0.5%) 0.002 (1.8%) 0.05 0.013
PTSD 0.629 (0.0%) 0.115 (0.4%) 0.319 (0.1%) 0.006 (1.4%) 0.05 0.041
Anxiety 0.459 (0.0%) 0.270 (0.1%) 0.339 (0.0%) 0.219 (0.2%) 0.2 0.829
AN 0.732 (0.0%) 0.551 (0.0%) 0.972 (0.0%) 0.013 (1.1%) 1 0.075
OCD 0.674 (0.0%) 0.584 (0.0%) 0.863 (0.0%) 0.294 (0.1%) 1e-5 0.888
ADHD 0.806 (0.0%) 0.671 (0.0%) 0.865 (0.0%) 0.153 (0.3%) 1 0.664
MDD 0.890 (0.0%) 0.718 (0.0%) 0.955 (0.0%) 0.087 (0.5%) 1 0.438
Insomnia 0.618 (0.0%) 0.884 (0.0%) 0.432 (0.0%) 0.059 (0.6%) 0.0001 0.350
*

Nagelkerke’s R2 for the Opt approach is estimated from the best performing p-value threshold searched (K = 11). The best p-value threshold from the K = 11 thresholds searched is reported.

Discussion

In this paper we proposed a method of PRS analysis that uses PCA to concentrate the maximum variation in a set of PRSs in a single PC, and then tests for association of the phenotype with only the first PRS-PC. This method avoids optimizing the parameters to construct the PRS, which inflates the probability of a type I error if unaccounted for, and is computationally faster than using permutations to correct for the inflation (Supplementary Table 1). Through simulations, we showed that the PRS-PCA approach can be as or more powerful than the Opt-Perm approach that relies on permutations to compute the p-value. We showed how the PRS-PCA uses PC loadings to reweight the original SNP effects such that SNPs with larger p-values are down-weighted more.

The choice of thresholds explored performed similarly in most scenarios for both PRS-PCA and even more so for Opt-Perm. Our set of K = 11 thresholds is a very typical choice of analysts, and similar searches have been used previously (Bergen et al., 2019). This setting for PRS-PCA performed best in most of the simulated scenarios (Figure 3), as well as in the real data application (Table 2). Also, in addition to being computationally faster than the Opt-Perm approach, because PRS-PCA tests a single PRS rather than selecting the most predictive PRS in a particular dataset, the PRS-PCA approach does not overestimate PRS performance (e.g. area under the curve or proportion of variation explained) – and in fact, it is expected to underestimate it if only the first PC is used for prediction. On the other hand, the upward bias in prediction performance can be substantial for Opt-Perm when applied to small sample sizes (Table 1).

In our real data application, all of the methods identified five PRSs – BD, SCZ, MDD, ADHD, and insomnia - significantly higher in cases with BD than controls (Table 2). However, only PRS-PCA with K = 11 and K = 100 identified an additional two traits – PTSD and anxiety – with higher genetic load in cases. All of these traits have been shown to have significant genetic correlation or PRS association with BD, in larger samples (Di Florio et al., 2020; Grigoroiu-Serbanescu et al., 2020; P. H. Lee et al., 2019; Nievergelt et al., 2019). In our analysis of the case-only phenotype of manic psychosis, both the PRS-PCA (K = 11) and the Opt-Perm approaches reproduced our previous finding that the PRS for SCZ is higher in cases with a history of manic psychosis (N = 336) than those without a history of psychosis (N = 309) (Markota et al., 2018). Both methods also showed cases with manic psychosis had higher genetic load for EA. While psychosis in the context of BD has been less studied, prior studies have shown small positive genetic correlation between EA and SCZ, and a PRS for EA has been found to be higher in people with SCZ (Power et al., 2015). Finally, only the PRS-PCA approach found evidence that the PRS for BD was higher in cases with manic psychosis. This may suggest that a higher genetic load for BD may cause more severe symptoms of BD.

The PRS-PCA approach controls type I error while maintaining good power. This approach is well-suited to hypothesis testing with many PRSs, because it prevents overfitting each PRS to the outcome and does not require choosing one p-value threshold for all PRSs (Mullins et al., 2019; Richardson et al., 2019; Zheutlin et al., 2019), which can reduce power as seen in our simulations. In this study, we explored how the PRS-PCA approach can improve PRS analyses that implement P+T, by using the first PC PRSs across different thresholds. While more than one PC could be used for association testing in this approach, others have found that subsequent PCs explained relatively little of the variation in the phenotype (Bergen et al., 2019). A similar observation was made in the context of PCA-based SNP-set tests (Ballard et al., 2010). Furthermore, representing polygenic risk with one score allows for ease of interpretation. Here, we have focused on the most popular implementation of PRS: P+T. Future investigation is needed to test if the same PCA approach can be used to avoid optimizing PRSs over other tuning parameters with non-P+T PRS approaches, such as lassosum (Mak et al., 2017), LDpred (Vilhjálmsson et al., 2015), or PRS-CS (Ge et al., 2019). Furthermore, PRSs constructed with different methods could easily be combined using the PCA approach. Here, the PRSs are summed using the loadings for PC1 as weights. However, other weights might be more optimal in certain situations. The PRS-PCA approach could be easily incorporated into the adaptive Sum of Powered Score (aSPU) method, which performs well in many different contexts (B. J. Coombes et al., 2018; Pan, Chen, et al., 2015; Pan, Kwak, et al., 2015). This topic warrants future investigation.

In this paper, we advocate for use of a powerful method of efficiently consolidating information harnessed by PRSs, which avoids the multiple testing inherent in the popular optimization approach. In studies that aim to test for association of PRSs with more than one phenotype such as a PRS PheWAS (Zheutlin et al., 2019) or more than one PRS (B. Coombes et al., 2020), the PRS-PCA approach would substantially reduce the multiple testing that would occur with the optimization approach. With the growing use of PRSs, the PRS-PCA approach gives researchers an unbiased and powerful approach to index polygenic risk.

Supplementary Material

Supplementary Material

Acknowledgments

This work was supported by the Marriott Foundation, the Mayo Clinic Center for Individualized Medicine, and the National Institute of Mental Health (R01 MH121924).

Grant number: R01 MH121924

Footnotes

Conflicts of Interest

The authors have no conflicts of interest to report.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request. R Code to implement PRS-PCA has been made available in the Supplementary Material.

References

  1. Alnæs D, Kaufmann T, Van Der Meer D, Córdova-Palomera A, Rokicki J, Moberget T, Bettella F, Agartz I, Barch DM, Bertolino A, Brandt CL, Cervenka S, Djurovic S, Doan NT, Eisenacher S, Fatouros-Bergman H, Flyckt L, Di Giorgio A, Haatveit B, … Westlye LT (2019). Brain Heterogeneity in Schizophrenia and Its Association with Polygenic Risk. JAMA Psychiatry, 76(7), 739–748. 10.1001/jamapsychiatry.2019.0257 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ballard DH, Cho J, & Zhao H (2010). Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genetic Epidemiology, 34(3), 201–212. 10.1002/gepi.20448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bergen SE, Ploner A, Howrigan D, O’Donovan MC, Smoller JW, Sullivan PF, Sebat J, Neale B, Kendler KS, & Kendler KS (2019). Joint Contributions of Rare Copy Number Variants and Common SNPs to Risk for Schizophrenia. American Journal of Psychiatry, 176(1), 29–35. 10.1176/appi.ajp.2018.17040467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bulik C, Duncan L, Breen G, & PGC_AN Working Group. (2017). The PGC Gwas Meta-Analysis of Anorexia Nervosa: SNP Heritability, Genetic Correlations, And Snp Results. European Neuropsychopharmacology, 27, S360–S361. 10.1016/j.euroneuro.2016.09.381 [DOI] [Google Scholar]
  5. Choi SW, & O’Reilly PF (2019). PRSice-2: Polygenic Risk Score software for biobank-scale data. GigaScience, 8(7). 10.1093/gigascience/giz082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Consortium, S. W. G. of the P. G., Ripke S, Neale BM, Corvin A, Walters JTR, Farh K-H, Holmans PA, Lee P, Bulik-Sullivan B, Collier DA, Huang H, Pers TH, Agartz I, Agerbo E, Albus M, Alexander M, Amin F, Bacanu SA, Begemann M, … O’Donovan MC (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510), 421–427. 10.1038/nature13595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Coombes BJ, Basu S, & McGue M (2018). A linear mixed model framework for gene-based gene–environment interaction tests in twin studies. Genetic Epidemiology, 42(7). 10.1002/gepi.22150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Coombes B, Markota M, Mann J, Colby C, Stahl E, Talati A, Pathak J, Weissman M, McElroy S, Frye M, & Biernacka JM (2020). Dissecting clinical heterogeneity of bipolar disorder using multiple polygenic risk scores. MedRxiv [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. De Lange AMG, Kaufmann T, Van Der Meer D, Maglanoc LA, Alnæs D, Moberget T, Douaud G, Andreassen OA, & Westlye LT (2019). Population-based neuroimaging reveals traces of childbirth in the maternal brain. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22341–22346. 10.1073/pnas.1910666116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Demontis D, Walters RK, Martin J, Mattheisen M, Als TD, Agerbo E, Baldursson G, Belliveau R, Bybjerg-Grauholm J, Bækvad-Hansen M, Cerrato F, Chambert K, Churchhouse C, Dumont A, Eriksson N, Gandal M, Goldstein JI, Grasby KL, Grove J, … Neale BM (2019). Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder. Nature Genetics, 51(1), 63–75. 10.1038/s41588-018-0269-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Di Florio A, Lewis KJS, Richards A, Karlsson R, Leonenko G, Jones SE, Jones HJ, Gordon-Smith K, Forty L, Escott-Price V, Owen MJ, Weedon MN, Jones L, Craddock N, Jones I, Landén M, & O’Donovan MC (2020). Comparison of Genetic Liability for Sleep Traits among Individuals with Bipolar Disorder i or II and Control Participants. JAMA Psychiatry, 77(3), 303–310. 10.1001/jamapsychiatry.2019.4079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Duncan LE, Ratanatharathorn A, Aiello AE, Almli LM, Amstadter AB, Ashley-Koch AE, Baker DG, Beckham JC, Bierut LJ, Bisson J, Bradley B, Chen CY, Dalvie S, Farrer LA, Galea S, Garrett ME, Gelernter JE, Guffanti G, Hauser MA, … Koenen KC (2018). Largest GWAS of PTSD (N=20 070) yields genetic overlap with schizophrenia and sex differences in heritability. Molecular Psychiatry, 23(3), 666–673. 10.1038/mp.2017.77 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Frye MA, McElroy SL, Fuentes M, Sutor B, Schak KM, Galardy CW, Palmer BA, Prieto ML, Kung S, Sola CL, & others. (2015). Development of a bipolar disorder biobank: differential phenotyping for subsequent biomarker analyses. International Journal of Bipolar Disorders, 3(1), 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Ge T, Chen CY, Ni Y, Feng YCA, & Smoller JW (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1). 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Grigoroiu-Serbanescu M, Giaroli G, Thygesen JH, Shenyan O, Bigdeli TB, Bass NJ, Diaconu CC, Neagu AI, Forstner AJ, Degenhardt F, Herms S, Nöthen MM, & McQuillin A (2020). Predictive power of the ADHD GWAS 2019 polygenic risk scores in independent samples of bipolar patients with childhood ADHD. Journal of Affective Disorders, 265, 651–659. 10.1016/j.jad.2019.11.109 [DOI] [PubMed] [Google Scholar]
  16. International Obsessive Compulsive Disorder Foundation Genetics Collaborative (IOCDF-GC) and OCD Collaborative Genetics Association Studies (OCGAS). (2018). Revealing the complex genetic architecture of obsessive-compulsive disorder using meta-analysis. Molecular Psychiatry, 23(5), 1181–1188. 10.1038/mp.2017.154 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. International Schizophrenia Consortium, Purcell, S. M., Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, & Sklar P (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460(7256), 748–752. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lane JM, Jones SE, Dashti HS, Wood AR, Aragam KG, van Hees VT, Strand LB, Winsvold BS, Wang H, Bowden J, Song Y, Patel K, Anderson SG, Beaumont RN, Bechtold DA, Cade BE, Haas M, Kathiresan S, Little MA, … Saxena R (2019). Biological and clinical insights from genetics of insomnia symptoms. In Nature Genetics Nature Publishing Group; 10.1038/s41588-019-0361-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, Nguyen-Viet TA, Bowers P, Sidorenko J, Karlsson Linnér R, Fontana MA, Kundu T, Lee C, Li H, Li R, Royer R, Timshel PN, Walters RK, Willoughby EA, … Cesarini D (2018). Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics, 50(8), 1112–1121. 10.1038/s41588-018-0147-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lee PH, Anttila V, Won H, Feng YCA, Rosenthal J, Zhu Z, Tucker-Drob EM, Nivard MG, Grotzinger AD, Posthuma D, Wang MMJ, Yu D, Stahl EA, Walters RK, Anney RJL, Duncan LE, Ge T, Adolfsson R, Banaschewski T, … Smoller JW (2019). Genomic Relationships, Novel Loci, and Pleiotropic Mechanisms across Eight Psychiatric Disorders. Cell, 179(7), 1469–1482.e11. 10.1016/j.cell.2019.11.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Maglanoc LA, Kaufmann T, van der Meer D, Marquand AF, Wolfers T, Jonassen R, Hilland E, Andreassen OA, Landrø NI, & Westlye LT (2019). Brain connectome mapping of complex human traits and their polygenic architecture using machine learning. Biological Psychiatry 10.1016/j.biopsych.2019.10.011 [DOI] [PubMed] [Google Scholar]
  22. Mak TSH, Porsch RM, Choi SW, & Sham PC (2018). Polygenic scores for UK Biobank scale data. BioRxiv, 252270 10.1101/252270 [DOI] [Google Scholar]
  23. Mak TSH, Porsch RM, Choi SW, Zhou X, & Sham PC (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41(6), 469–480. 10.1002/gepi.22050 [DOI] [PubMed] [Google Scholar]
  24. Markota M, Coombes BJ, Larrabee BR, McElroy SL, Bond DJ, Veldic M, Colby CL, Chauhan M, Cuellar-Barboza AB, Fuentes M, & others. (2018). Association of schizophrenia polygenic risk score with manic and depressive psychosis in bipolar disorder. Translational Psychiatry, 8(1), 188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Mullins N, Bigdeli TB, Børglum AD, Coleman JRI, Demontis D, Mehta D, Power RA, Ripke S, Stahl EA, Starnawska A, Anjorin A, Corvin A, Sanders AR, Forstner AJ, Reif A, Koller AC, Świątkowska B, Baune BT, Müller-Myhsok B, … Lewis CM (2019). GWAS of Suicide Attempt in Psychiatric Disorders and Association With Major Depression Polygenic Risk Scores. American Journal of Psychiatry, 176(8), 651–660. 10.1176/appi.ajp.2019.18080957 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Nievergelt CM, Maihofer AX, Klengel T, Atkinson EG, Chen CY, Choi KW, Coleman JRI, Dalvie S, Duncan LE, Gelernter J, Levey DF, Logue MW, Polimanti R, Provost AC, Ratanatharathorn A, Stein MB, Torres K, Aiello AE, Almli LM, … Koenen KC (2019). International meta-analysis of PTSD genome-wide association studies identifies sex- and ancestry-specific genetic risk loci. Nature Communications, 10(1), 1–16. 10.1038/s41467-019-12576-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Otowa T, Hek K, Lee M, Byrne EM, Mirza SS, Nivard MG, Bigdeli T, Aggen SH, Adkins D, Wolen A, Fanous A, Keller MC, Castelao E, Kutalik Z, Der Auwera SV, Homuth G, Nauck M, Teumer A, Milaneschi Y, … Hettema JM (2016). Meta-analysis of genome-wide association studies of anxiety disorders. Molecular Psychiatry, 21(10), 1391–1399. 10.1038/mp.2015.197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Pan W, Chen YM, & Wei P (2015). Testing for polygenic effects in genome-wide association studies. Genetic Epidemiology, 39(4), 306–316. 10.1002/gepi.21899 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pan W, Kwak I-Y, & Wei P (2015). A powerful pathway-based adaptive test for genetic association with common or rare variants. The American Journal of Human Genetics, 97(1), 86–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Power RA, Steinberg S, Bjornsdottir G, Rietveld CA, Abdellaoui A, Nivard MM, Johannesson M, Galesloot TE, Hottenga JJ, Willemsen G, Cesarini D, Benjamin DJ, Magnusson PKE, Ullén F, Tiemeier H, Hofman A, van Rooij FJA, Walters GB, Sigurdsson E, … Stefansson K (2015). Polygenic risk scores for schizophrenia and bipolar disorder predict creativity. Nature Neuroscience, 18(7), 953–955. 10.1038/nn.4040 [DOI] [PubMed] [Google Scholar]
  31. R Core Team. (2018). R: A Language and Environment for Statistical Computing https://www.r-project.org/
  32. Richardson TG, Harrison S, Hemani G, & Smith GD (2019). An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. ELife, 8 10.7554/eLife.43657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Ruderfer DM, Ripke S, McQuillin A, Boocock J, Stahl EA, Pavlides JMW, Mullins N, Charney AW, Ori APS, Loohuis LMO, Domenici E, Di Florio A, Papiol S, Kalman JL, Trubetskoy V, Adolfsson R, Agartz I, Agerbo E, Akil H, … Kendler KS (2018). Genomic Dissection of Bipolar Disorder and Schizophrenia, Including 28 Subphenotypes. Cell, 173(7), 1705–1715.e16. 10.1016/j.cell.2018.05.046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Stahl EA, Breen G, Forstner AJ, McQuillin A, Ripke S, Trubetskoy V, Mattheisen M, Wang Y, Coleman JRI, Gaspar HA, de Leeuw CA, Steinberg S, Pavlides JMW, Trzaskowski M, Byrne EM, Pers TH, Holmans PA, Richards AL, Abbott L, … Sklar P (2019). Genome-wide association study identifies 30 loci associated with bipolar disorder. Nature Genetics, 51(5), 793–803. 10.1038/s41588-019-0397-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Torkamani A, Wineinger NE, & Topol EJ (2018). The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics, 19(9), 581–590. 10.1038/s41576-018-0018-x [DOI] [PubMed] [Google Scholar]
  36. Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, Genovese G, Loh P-R, Bhatia G, Do R, Hayeck T, Won H-H, Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study, S., Kathiresan S, Pato M, Pato C, Tamimi R, Stahl E, Zaitlen N, … Price AL (2015). Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. American Journal of Human Genetics, 97(4), 576–592. 10.1016/j.ajhg.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wray NR, Ripke S, Mattheisen M, Trzaskowski M, Byrne EM, Abdellaoui A, Adams MJ, Agerbo E, Air TM, Andlauer TMF, Bacanu SA, Bækvad-Hansen M, Beekman AFT, Bigdeli TB, Binder EB, Blackwood DRH, Bryois J, Buttenschøn HN, Bybjerg-Grauholm J, … Sullivan PF (2018). Genome-wide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nature Genetics, 50(5), 668–681. 10.1038/s41588-018-0090-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Yang J, Lee SH, Goddard ME, & Visscher PM (2011). GCTA: A Tool for Genome-wide Complex Trait Analysis. The American Journal of Human Genetics, 88(1), 76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Zhang D (2018). rsq: R-Squared and Related Measures https://cran.r-project.org/package=rsq
  40. Zheutlin AB, Dennis J, Karlsson Linnér R, Moscati A, Restrepo N, Straub P, Ruderfer D, Castro VM, Chen C-Y, Ge T, Huckins LM, Charney A, Kirchner HL, Stahl EA, Chabris CF, Davis LK, & Smoller JW (2019). Penetrance and Pleiotropy of Polygenic Risk Scores for Schizophrenia in 106,160 Patients Across Four Health Care Systems. American Journal of Psychiatry, 176(10), 846–855. 10.1176/appi.ajp.2019.18091085 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request. R Code to implement PRS-PCA has been made available in the Supplementary Material.

RESOURCES