Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2022 Mar 24;18(3):e1010105. doi: 10.1371/journal.pgen.1010105

Significant sparse polygenic risk scores across 813 traits in UK Biobank

Yosuke Tanigawa 1,2,*, Junyang Qian 3, Guhan Venkataraman 1, Johanne Marie Justesen 1, Ruilin Li 4, Robert Tibshirani 1,3, Trevor Hastie 1,3, Manuel A Rivas 1,*
Editor: Samuli Ripatti5
PMCID: PMC8946745  PMID: 35324888

Abstract

We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10−5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman’s ⍴ = 0.61, p = 2.2 x 10−59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10−4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

Author summary

Polygenic risk score (PRS), an approach to estimate genetic predisposition on disease liability by aggregating the effects across multiple genetic variants, has attracted increasing research interest. While there have been improvements in the predictive performance of PRS for some traits, the applicability of PRS models across a wide range of human traits has not been clear. Here, applying penalized regression using Batch Screening Iterative Lasso (BASIL) algorithm to more than 269,000 individuals of white British ancestry in UK Biobank, we systematically characterize PRS models across more than 1,500 traits. We report 813 traits with PRS models of statistically significant predictive performance. While the statistical significance does not necessarily directly translate into clinical relevance, we investigate the properties of the 813 significant PRS models and report a significant correlation between predictive performance and estimated SNP-based heritability. We find that the number of genetic variants selected in our sparse PRS model is significantly correlated with the incremental predictive performance in both quantitative and binary traits. Our transferability assessment of PRS models in UK Biobank revealed that the sparse PRS models trained on individuals of European ancestry had a lower predictive performance for individuals of African and Asian ancestry groups.

Introduction

Polygenic risk score (PRS), an estimate of an individual’s genetic liability to a trait or disease, has been proposed for disease risk prediction with potential clinical relevance for some traits [1,2]. Due to training data sample size increase and methods development advances for variable selection and effect size estimation, PRS predictive performance has improved [317]. However, it has not been clear what would be the predictive performance of PRS models when it is applied to a wide range of traits and their transferability across ancestry groups. Rich phenotypic information in large-scale genotyped cohorts provides an opportunity to address this question.

Here, we present significant sparse PRSs across 813 traits in the UK Biobank [18,19]. We applied the recently developed batch screening iterative lasso (BASIL) algorithm implemented in the R snpnet package [10] across more than 1,500 traits consisting of binary outcomes and quantitative traits, including disease outcomes and biomarkers, respectively (Fig 1, S1 Table). As opposed to most of the recently developed PRS methods that take genome-wide association study (GWAS) summary statistics as input, BASIL/snpnet is capable of performing variable selection and effect size estimation simultaneously from individual-level genotype and phenotype data. BASIL/snpnet results in sparse PRS models, meaning that most genetic variants in the input dataset have zero coefficient. For example, the snpnet PRS for standing height, a classic example of polygenic traits, includes 51,209 variants, which has non-zero coefficients for 4.7% of 1,080,968 genetic variants and allelotypes present in the input genetic data. Moreover, this approach does not require the explicit specification of the underlying genetic architecture of traits, suitable for a phenome-wide application of PRS modeling. Using individuals in a hold-out test set, we evaluated their predictive performance and their statistical significance, resulting in 813 significant (p < 2.5 x 10−5) PRS models. We find a significant correlation between the number of the genetic variants selected in the model and the incremental predictive performance compared to the covariate-only models across quantitative traits and binary traits. We assess the transferability of the PRS models across ancestry groups using individuals from non-British white, African, South Asian, and East Asian ancestry in the UK Biobank. We make the coefficients of the PRS models publicly available via the PRS map web application on the Global Biobank Engine [20] (https://biobankengine.stanford.edu/prs).

Fig 1. Significant sparse polygenic risk scores (PRSs) across 813 traits in the UK Biobank.

Fig 1

(A) We analyzed a total of more than 378,000 unrelated individuals and 1,565 traits in UK Biobank. We used 80% of individuals of white British ancestry for score development. For evaluation, we used the remaining 20% of individuals and additional individuals in other ancestry groups. (B) The full list of 1,565 traits with predictive performance is shown as a sortable table at Global Biobank Engine (https://biobankengine.stanford.edu/prs). (C) The predictive performance of PRS models for quantitative traits is summarized as a heatmap comparing the predicted risk score (Z-score) and observed trait value (left) and mean and standard error of trait values stratified by percentile bin (right). (D) The predictive performance of PRS models for binary traits is summarized as PRS score distribution stratified by case/control status (left) and odds ratio stratified by percentile bin (right). (E) The non-zero coefficients of the sparse PRS model are shown. (F) The predictive performance evaluation in training and test sets consist of individuals of white British ancestry, as well as additional sets consisting of individuals from non-British white, African, South Asian, and East Asian ancestry groups in the UK Biobank.

Results

Characterizing sparse PRS models with BASIL algorithm

To build sparse PRSs across a wide range of phenotypes, we compiled a total of 1,565 traits in the UK Biobank. We grouped them into trait categories, such as disease outcomes, anthropometry measures, and cancer phenotypes (S1 Table, Methods). We analyzed a total of 1,080,968 genetic variants and allelotypes from the directly-genotyped variants [19], imputed HLA allelotypes [21], and copy number variants [22]. Using 80% (n = 269,704) of unrelated individuals of white British ancestry, we applied batch screening iterative lasso (BASIL) implemented in the R snpnet package [10]. This recently developed method characterizes PRS models by simultaneously performing variable selection and effect size estimation. Applying different levels of penalization in the Lasso regression with penalty factors, we prioritized the medically relevant alleles in the PRS model. Specifically, we used the predicted consequence of the genotyped variants and the pathogenicity information in the ClinVar database. We prioritized protein-truncating variants, protein-altering variants, imputed HLA allelotype, and known pathogenic and likely-pathogenic variants by assigning lower penalty factors (Methods). As unpenalized covariates, we included age, sex, and the loadings of the top ten principal components (PCs) of genotypes. For 35 blood and urine biomarker traits, we took the snpnet PRS models from a recently published study [23], where the PRS models were characterized with the same methods on the same set of individuals following the adjustment for an extensive list of technical covariates, including fasting time and dilution factors, as well as for age, sex, and genotype PCs.

To evaluate the predictive performance (R2 for quantitative traits and observed scale Nagelkerke’s pseudo-R2 [also known as Cragg and Uhler’s pseudo-R2] [24,25] for binary traits) and its statistical significance, we focused on the remaining 20% of unrelated individuals in the hold-out test set (n = 67,425) as well as additional sets of unrelated individuals in the following ancestry groups in UK Biobank: non-British European (non-British white, n = 24,905), African (n = 6,497), South Asian (n = 7,831), and East Asian (n = 1,704) (S2 Table, Methods). We found 813 PRS models with significant (p < 2.5 x 10−5 = 0.05/2,000, adjusted for multiple hypothesis testing with Bonferroni method) predictive performance in the hold-out test set of white British individuals (Methods). For the binary traits, we also evaluated the receiver operating characteristic area under the curve [ROC-AUC] and Tjur’s Coefficient of Discrimination (Tjur’s pseudo-R2) [26].

The participants of the UK Biobank were genotyped on two different arrays: about 10% of participants were genotyped on the UK BiLEVE Axiom array, whereas the rest were genotyped on the UK Biobank Axiom array [19]. To account for the potential biases correlated with the types of arrays, we evaluated the predictive performance of the PRS by accounting for the types of the arrays in addition to the age, sex, and the top ten genotype PCs. We found the identity of the UK Biobank assessment centers mostly has a non-significant impact on the predictive performance (S1 Fig, Methods).

To assess the degree of prioritization of the medically relevant alleles, we selected standing height, body mass index (BMI), high cholesterol, and asthma. We compared the predictive performance and the number of genetic variants for each functional category. For the four selected traits, we found a little difference in the predictive performance (R2 = 0.177 vs. 0.176 for the PRS model with penalty factor and without penalty factor, respectively, for standing height, R2 = 0.111 vs. 0.111 for BMI, AUC = 0.620 vs. 0.619 for high cholesterol, and AUC = 0.617 vs. 0.617 for asthma) (S2 Fig) while we saw an enrichment of the number of the medically relevant alleles with non-zero coefficients in the PRS model with prioritization (2.14 fold enrichment standing height, 2.75 fold for BMI, 4.14 fold for high cholesterol, and 4.33 fold asthma) (Table 1 and S3 Table), highlighting the flexibility of the BASIL/snpnet in assigning different levels of penalization based on the variant-level information.

Table 1. The prioritization of the medically relevant alleles with penalty factors.

The numbers of the genetic variants or allelotypes with non-zero coefficient values are shown for the selected four traits. The denominator represents the total number of variables included in the model. The numerator represents the number of the medically relevant alleles, which are one of the following: protein-truncating variants, protein-altering variants, imputed HLA allelotypes, the pathogenic or likely-pathogenic variants in the ClinVar database. The enrichment of the medically relevant variants is also shown.

Trait Number of selected genetic variants or allelotypes
with penalty factor without penalty factor enrichment
Standing height 4187 / 51209 2129 / 55937 2.15
Body mass index 2543 / 27126 977 / 28667 2.75
High cholesterol 969 / 5987 215 / 5506 4.14
Asthma 1022 / 6430 250 / 6819 4.34

With the same set of four traits, we asked whether including the imputed genetic variants could improve the predictive performance. We saw some gain in the predictive performance in three traits but not for standing height (S3 Fig). Based on those results, we decided to move on to the phenome-wide application of the BASIL algorithm implemented in the R snpnet packages on the directly genotyped variants, imputed allelotypes, and copy number variants while prioritizing the medically relevant alleles with penalty factors.

Significance and estimated effect size of sparse PRS models

We estimated the SNP-based heritability by applying linkage disequilibrium (LD) score regression (LDSC) [27] on genome-wide association study (GWAS) summary statistics. We compared it against the predictive performance (R2 for quantitative traits and Nagelkerke’s pseudo-R2 for binary traits) of the significant PRS models (Fig 2). Across 244 binary traits and 569 quantitative traits with significant PRS models, we found higher estimated observed scale heritability for quantitative traits. Overall, we found a significant correlation between the estimated SNP-based observed scale heritability and predictive performance (Spearman’s rank correlation coefficient ⍴ = 0.44, p-value = 3.5 x 10–13 for binary traits, ⍴ = 0.46, p-value = 1.4 x 10–31 for quantitative traits).

Fig 2. Comparison of the estimated SNP-based heritability and predictive performance across the 813 traits with significant PRSs.

Fig 2

The predictive performance (Nagelkerke’s pseudo-R2 for 244 binary traits [left] and R2 for 569 quantitative traits [right]) of the PRS models that only consider genetic variants are compared against the estimated SNP-based heritability. Both metrics are shown in observed scale and depend on the proportion of cases in the target and discovery cohorts. The solid gray lines represent y = x. We show the points on the bottom left corners in the inset plots. The error bars represent standard error. BMD: Bone mineral density.

The basic covariates alone are already informative for phenotype prediction. To assess the incremental utility of PRSs, we quantified the incremental predictive performance by comparing the predictive performance of the full model that considers both genotypes and covariates and that of the covariate-only model across the 813 traits with significant sparse PRS. We found most traits have a modest increase in the effect sizes of the prediction with a few notable exceptions, such as celiac disease (Nagelkerke’s pseudo-R2 = 0.149 in the full model vs 0.006 in the covariate-only model, p = 3.8 x 10−162), hair color (red) (Nagelkerke’s pseudo-R2 = 0.603 vs. 0.008, p < 1 x 10−300), mean platelet volume (R2 = 0.36 vs. 0.001, p < 1 x 10−300), heel bone mineral density (R2 = 0.20 vs. 0.06, p < 1 x 10−300), and blood and urine biomarker traits [23] (Figs 3 and 4).

Fig 3. Incremental predictive performance of PRS models across the 813 traits with significant predictive performance in the hold-out test set individuals of white British ancestry.

Fig 3

The predictive performance (Nagelkerke’s pseudo-R2 for 244 binary traits [left] and R2 for 569 quantitative traits [right]) of the full models that consider both the genotype and covariates are compared against that of the covariate-only models, and their difference (the incremental predictive performance) are shown as a histogram.

Fig 4. The sparse PRS model and their predictive performance for celiac disease.

Fig 4

(A, B) the predictive performance of celiac disease PRS. (A) the celiac disease PRS distribution (y-axis) in a hold-out test set stratified by the disease case status (x-axis). The dashed lines represent the mean and the quantiles are shown as box plots. (B) The disease prevalence odds ratio compared to the individuals with middle (40–60 th percentile) PRS score stratified by PRS percentile bins. The error bars represent standard error (SE). (C) the coefficients of the celiac disease PRS model. The estimated effect size (y-axis) for each genetic variant (x-axis) is shown. The gene symbols are annotated in the plot for coding variants and HLA allelotypes with large effect size estimates.

Sparse PRS models offer an interpretation of genomic loci underlying the polygenic risk

Celiac disease is an autoimmune disorder that affects the small intestine from gluten consumption. The sparse PRS model for this trait, for example, consists of 428 variants that contain the imputed HLA allelotypes and variants near the MHC region in chromosome 6 [19,21]. The PRS model also contains genetic variants in all other autosomes, including a previously implicated missense variant in chromosome 12 (rs3184504, log(OR) = 0.15 in multivariate PRS model) in SH2B3. This gene encodes SH2B adaptor protein 3, which is involved in cellular signaling, hematopoiesis, and cytokine receptors [28] (Fig 4).

The size of the PRS model is correlated with the incremental predictive performance

The significant PRS models have a wide range of the number of variables selected in the model, ranging from only one variable for iritis PRS (HLA allelotype, HLA-B*27:05, at the well-established HLA-B*27 locus [29,30]) to 51,209 variants selected for standing height PRS (Fig 5). We examined whether there is a relationship between the number of active variables in the significant PRS model and the incremental predictive performance. The significant correlation between the two quantities is stronger in quantitative (Spearman’s rank correlation coefficient ⍴ = 0.61, p = 2.2 x 10−59) traits than in binary (⍴ = 0.21, p = 9.6 x 10−4), reflecting the difference in power between binary and quantitative traits [31].

Fig 5. Comparison of the effect size and the model size of sparse PRS.

Fig 5

The number of the genetic variants included in the model (size of the model, x-axis) and the incremental predictive performance (effect size of the model, y-axis) are shown for 244 binary traits (left) and 569 quantitative traits (right). TTE: time-to-event phenotype.

Sparse PRS models exhibit limited transferability across ancestry groups

While the majority of the participants in the UK Biobank are of European ancestry, the inclusion of individuals from African and Asian ancestry enables an assessment of the transferability of the PRS models across ancestry groups in UK Biobank. In addition to the hold-out test set that we derived from the white British population, we focused on additional sets of individuals from non-British European (non-British white), African, South Asian, and East Asian ancestry groups and compared the incremental predictive performance with that in white British hold-out test set (Fig 6). For quantitative traits, the models predicted well for non-British white (linear regression fit of the incremental predictive performance: y = 0.91x), but they suffer limited transferability for the non-European ancestry groups (y = 0.56x, y = 0.47x, and y = 0.13x for South Asian, East Asian, and African, respectively). Similarly, in binary traits, the non-British white showed higher transferability (y = 0.80 x) than the non-European ancestry groups (y = 0.027x, y = 0.059x, and y = -0.145x for South Asian, East Asian, and African, respectively).

Fig 6. Transferability assessment of PRS models across ancestry groups in the UK Biobank.

Fig 6

The incremental predictive performance (Nagelkerke’s pseudo-R2 for 244 binary traits (A, B) and incremental R2 for 569 quantitative traits (C, D)) was quantified in individuals in different ancestry groups in the UK Biobank and was compared against that in the hold-out test set constructed from the individuals in white British ancestry group. (A, C) the difference in the incremental predictive performance between the target group (x-axis, double-coded with color) and the source white British cohort. The median values are shown as black horizontal bars and numbers. (B, D) comparison of the incremental predictive performance in the target group (color) and the test set. A linear regression fit was shown for each ancestry group with the dashed lines. The slopes of the regression lines were also shown.

Discussion

In this study, we performed a systematic scan of polygenic prediction across more than 1,500 traits and reported 813 significant sparse PRS models. We found a correlation between the predictive performance of the significant PRS models and SNP-based heritability estimates. We assessed the effect size of the PRS model by quantifying the incremental predictive performance, which we define as the difference in the predictive performance between the covariate-only model and the full model consisting of both covariates and genetics. In both quantitative and binary traits, we find a significant correlation between the number of independent loci included in the model and their incremental predictive performance.

Our study is complementary to many other studies that focus on fewer traits to construct PRS models from GWAS meta-analysis and mixed models. While the sample size in our study is sufficiently large to observe statistical significance in predictive performance across hundreds of traits, it does not necessarily mean the clinical relevance of the PRS models. Moreover, population-based recruitment in UK Biobank may not be the best strategy to achieve the highest predictive performance for some traits. A disease-focused study [6,3234] would be an attractive alternative strategy, especially when multiple genotyped cohorts recruited for the same disease are available or the disease of interest has a low population prevalence. Our study, instead, focused on the phenome-wide application of PRS across hundreds of traits in a single cohort by applying BASIL algorithm with readily available implementation in R snpnet package [10], which does not require explicit modeling of underlying genetic architecture across a wide variety of traits.

For binary traits, we used observed scale pseudo-R2 and observed scale SNP-based heritability estimates, given that population prevalence is available for only a subset of binary traits considered in the present study. Conversion to liability scale estimates will further enhance the validity of the comparison [35] and is of interest for future investigation.

Like other PRS approaches that consider datasets from one source population in the PRS training, our sparse model trained on the individual-level data of white British showed limited transferability across diverse ancestry groups [3638]. The sample sizes of non-European ancestry groups in UK Biobank are smaller than that of European ancestry groups. In general, that will result in larger uncertainties in predictive performance assessment. Nonetheless, when we assess the incremental predictive performance across ancestry groups by comparing the full model consisting of the genetic data and basic covariates and the covariate-only model, we found the binary traits, including disease outcomes, have lower transferability compared to quantitative traits, including biomarkers, blood measurements, and anthropometric traits. The power difference between binary and quantitative traits [31], limitation in power for some traits, especially for the binary traits with limited case counts, and differences in heritability may be the contributing factors of the observed difference. Improvements of PRS models with high transferability across ancestry groups and the admixed individuals are of interest for future research.

Given the medical relevance [3951], we prioritized pathogenic and likely-pathogenic variants reported in ClinVar [52] as well as predicted protein-truncating and protein-altering variants (Methods). Our analysis focusing on four traits suggests that prioritizing the medically relevant alleles does not necessarily improve the predictive performance. While our sparse PRS models show enrichment in the number of selected medically relevant alleles, there is no guarantee that the genetic variants included in the sparse PRS models were causal. It warrants further follow-up analysis with statistical fine-mapping and detailed functional characterization at each locus.

The increased availability of PRS models across multiple traits [17] exhibits a wide range of applications, including the improved genetic risk prediction of disease [23,53] and the identification of causal relationships across complex traits [54]. We provide the results on the Global Biobank Engine (https://biobankengine.stanford.edu/prs) as well as on the PGS catalog [17] and envision the resource will serve as an important basis to understand the polygenic basis of complex traits.

Methods

Ethics statement

This research has been conducted using the UK Biobank Resource under Application Number 24983, “Generating effective therapeutic hypotheses from genomic and hospital linkage data” (http://www.ukbiobank.ac.uk/wp-content/uploads/2017/06/24983-Dr-Manuel-Rivas.pdf). Based on the information provided in Protocol 44532, the Stanford IRB has determined that the research does not involve human subjects as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g). All participants of the UK Biobank provided written informed consent (more information is available at https://www.ukbiobank.ac.uk/2018/02/gdpr/).

Study population and genetic data

UK Biobank is a population-based cohort study collected from multiple sites across the United Kingdom [18]. To minimize the variabilities due to population structure in our dataset, we restricted our analyses to unrelated individuals based on the following four criteria [46,55] reported by the UK Biobank in sample QC file, “ukb_sqc_v2.txt”: (1) used to compute principal components (“used_in_pca_calculation” column); (2) not marked as outliers for heterozygosity and missing rates (“het_missing_outliers” column); (3) do not show putative sex chromosome aneuploidy (“putative_sex_chromo- some_aneuploidy” column); and (4) have at most ten putative third-degree relatives (“excess_relatives” column).

Using a combination of genotype principal components (PCs), the self-reported ancestry (UK Biobank Field ID 21000, https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=21000), and “in_white_British_ancestry_subset” column in the sample QC file from UK Biobank, we subsequently focused on people of self-identified white British (n = 337,129), self-identified non-British white (n = 24,905), African (n = 6,497), South Asian (n = 7,831), and East Asian (n = 1,704) ancestry as described elsewhere [23]. Briefly, we used a two-step procedure to define the five groups. We first used the genotype principal component loadings of the individuals and set thresholds on component 1 and component 2 as follows: (1) self-identified White British: -20 ≤ PC1 ≤ 40 and -25 ≤ PC2 ≤ 10 and in_white_British_ancestry_subset = = 1; (2) self-identified non-British White: -20 ≤ PC1 ≤ 40, -25 ≤ PC2 ≤ 10, has a self-reported ancestry of White, and does not identify themselves as White British; (3) African: 260 ≤ PC1, 50 ≤ PC2, and does not identify themselves as any of the following: Asian, White, Mixed, or Other population groups; (4) South Asian: 40 ≤ PC1 ≤ 120, -170 ≤ PC2 ≤ -80, and does not identify themselves as any of the following: Black, White, Mixed, or Other population groups; and (5) East Asian: 130 ≤ PC1 ≤ 170, PC2 ≤ -230, and does not identify themselves as any of the following: Black, White, Mixed, or Other population groups. To refine the population definition by removing the outliers, we computed population-specific genotype PCs using approximately LD independent (R2 < 0.5) common (population-specific minor allele frequency > 5%) biallelic variants outside of the major histocompatibility complex region [23]. We applied following thresholds [23]: (1) South Asian: -0.02 ≤ population-specific PC1 ≤ 0.03, -0.05 ≤ population-specific PC2 ≤ 0.02; and (2) East Asian: -0.01 ≤ population-specific PC1 ≤ 0.02, -0.02 ≤ population-specific PC2 ≤ 0.

We randomly split the white British cohort into 70% training (n = 235,991), 10% validation (to select the optimal sparsity level) (n = 33,713), and 20% test (n = 67,425) sets [23,56]. We used the same split of training, validation, and test set for all tested traits. The non-British white, African, South Asian, and East Asian samples were only used as test sets.

Variant quality control and variant annotation

We used genotype datasets (release version 2 for the directly genotyped variants and the imputed HLA allelotype datasets) [19], the CNV dataset [22], and the hg19 human genome reference for the main PRS analyses in the study. Additionally, we considered imputed variants (release version 3) to investigate whether the imputed variants would improve the predictive performance. We annotated the directly-genotyped variants using Ensembl’s Variant Effect Predictor (VEP) (version 101) [57,58] with the LOFTEE plugin (https://github.com/konradjk/loftee) [49], for which we created a Docker container image (https://github.com/yk-tanigawa/docker-ensembl-vep-loftee). Using ClinVar (version 20200914) [28], we annotated “pathogenic” and “likely pathogenic” variants.

We performed variant quality control as described elsewhere [23,46,55]. Briefly, we focused on the variants passing the following criteria: (1) outside of the major histocompatibility complex (MHC) region (hg19 chr6:25477797–36448354); (2) the missingness of the variant is less than 1%, considering that the two genotyping arrays (the UK BiLEVE Axiom array and the UK Biobank Axiom array) cover a slightly different set of variants [19]; (3) the minor-allele frequency is greater than 0.01%; (4) Hardy-Weinberg disequilibrium test p-value is less than 1.0x10-7; (5) Passed the comparison of minor allele frequency with the gnomAD dataset (version 2.0.1) as described before [46,49]; (6) We manually investigated the cluster plots for a subset of variants and removed 11 variants that have unreliable genotype calls [46].

We grouped the VEP-predicted consequence of the variants into six groups: protein-truncating variants (PTVs), protein-altering variants (PAVs), proximal coding variants (PCVs), Intronic variants (Intronic), variants in the untranslated region (Intronic), and other variants (Others). Our grouping rule of the VEP-predicted consequence is summarized in (S4 Table).

We included the imputed copy number variants (CNVs) [22] and imputed HLA allelotypes [21]. The CNVs were called using PennCNV (v.1.0.4) [59] on raw signal intensity data from each genotyping array as described elsewhere [22]. Because the precise location of the CNVs is not identified, we did not infer the functional consequences of CNVs with variant annotation. The HLA allelotypes at HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, -DRB1, -DRB3, -DRB4, and -DRB5 loci were imputed using the HLA*IMP:02 and imputed dosage file is provided by the UK Biobank. We included 156 alleles across all 11 loci that had a frequency of 0.1% or greater in the white British. We rounded allele dosage when they were within plus or minus 0.1 of 0, 1, or 2. We excluded the remaining nonzero entries. We also excluded erroneous total allele counts post-rounding [21].

When evaluating whether the inclusion of the imputed variants would improve the predictive performance of the PRS models, we focused on the 5,931,362 imputed variants [19] based on the following criteria: (1) imputation INFO score is greater than 0.7, (2) minor allele frequency computed across the entire ∼500k genotyped samples (UK Biobank Resource 1967, https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=1967) is greater than 0.01, (3) biallelic variants, (4) the variant is not present in the directly genotyped variants, and (5) missingness is less than 1%. We subsequently combined the imputed variant dataset with the directly genotyped variants, imputed HLA allotypes, and copy number variants.

Phenotype definitions in the UK Biobank

We analyzed a wide variety of traits in the UK Biobank, including disease outcome [46,60], family history [46,60], cancer registry data [46], blood and urine biomarkers [23], hematological measurements, and other binary and quantitative phenotypes [55,56]. Some phenotype information collected at UK Biobank’s assessment center contains up to four instances, each of which corresponds to (1) the initial assessment visit (2006–2010), (2) first repeat assessment visit (2012–2013), and (3) imaging visit (2014-), and (4) first repeat imaging visit (2019-). Briefly, for binary traits, we performed manual curation of phenotypic definitions and assigned “case” status if the participants are classified as the case in at least one of their visits and “control” otherwise. For quantitative traits, we took the median of non-NA values, as described elsewhere [55].

Previously, we analyzed blood and urine biomarker traits, investigating the effects of covariates on the biomarker levels and derived covariate-adjusted biomarker values [23]. Briefly, we used a linear regression model to account for the covariate effects on the log-transformed measurement values from UK Biobank and adjusted for principal component loadings of genotype, age, sex, age by sex interactions, self-identified ancestry group, self-identified ancestry group by sex interactions, fasting time, estimated sample dilution factor, assessment center indicators, genotyping batch indicators, time of sampling during the day, the month of assessment, and day of the assay. We used the PRS models trained for the covariate-adjusted traits [23]. To quantify the incremental predictive performance against the covariate-only models, we quantified predictive performance against the original measurement values, except eGFR, AST/ALT ratio, and non-albumin protein, where we used the covariate-adjusted trait values. Those three traits are derived from covariate-adjusted biomarkers [23] and do not have raw measurement values.

The list of 1,565 traits with at least 100 cases (for binary traits) or non-NA measurements (for quantitative traits) analyzed in this study is listed in (S1 Table).

Construction of sparse PRS models

Using the batch screening iterative Lasso (BASIL) algorithm implemented in the R snpnet package [10], we constructed the sparse PRS models for the 1,565 traits. We used the Gaussian family and the R2 metric for quantitative traits, whereas we used the binomial family and the AUC-ROC metric for the binary traits [10]. For each trait, we fit a series of regression models with a varying degree of sparsity on the training set, consisting of 70% (n = 235,991) of unrelated individuals of white British ancestry. The predictive performance of each of the models is evaluated on the validation set, which consists of 10% (n = 33,713) of unrelated individuals of white British ancestry to guide the selection of the optional level of sparsity. We selected the sparsity that maximizes the predictive performance in the validation set. We subsequently refit the penalized regression model using the individuals in the combined training and validation set individuals (n = 269,704), which we denote as score development set, to maximize the power in the regression model [10]. We used the same training, validation, and test set split across all the PRS models analyzed in this study.

As opposed to many PRS methods that operate on the GWAS summary statistics [39,1315], our method takes individual-level genotype and phenotype data. Using L1 penalized regression (also known as Lasso), BASIL simultaneously performs variable selection and effect size estimation of the selected variants. We included age, sex, and top ten population-specific genotype PC loadings computed for the white British individuals [23] as unpenalized covariates. Thanks to the L1 penalty term in the objective function that penalizes the number of features of non-zero regression coefficients, the resulting models will be sparse, meaning that they will have fewer genetic variants than unpenalized models [10].

To prioritize coding variants over non-coding variants in linkage, we assigned three levels of penalty factors (also known as penalty scaling parameter) [61]: 0.5 for pathogenic variants in ClinVar [52] or protein-truncating variants according to VEP-based variant annotation [58]; 0.75 for likely pathogenic variants in ClinVar, VEP-predicted protein-altering variants, or imputed allelotypes; and 1.0 for all other variants. The assignment rules of the penalty factors are summarized in (S5 Table). The variants with lower values of penalty factors are prioritized in the L1 penalized regression. To assess the degree of prioritization of the medically relevant alleles and their impacts on the predictive performance, we focused on four traits (standing height, BMI, high cholesterol, and asthma) and fit a separate model without penalty factors. We compared the number of selected variants and the predictive performance.

Predictive performance and transferability of PRS models

We evaluated the predictive performance (R2 for quantitative traits and Nagelkerke’s pseudo-R2 [also known as Cragg and Uhler’s pseudo-R2] [24,25] for binary traits) of PRS models (S6 Table). For the binary traits, we also evaluated the receiver operating characteristic area under the curve [ROC-AUC] and Tjur’s Coefficient of Discrimination (Tjur’s pseudo-R2) [26]. For R2 and ROC-AUC, we evaluated the 95% confidence interval of predictive performance using approximate standard error of R2 [62,63] and DeLong’s method [64], respectively. We used the individuals in the hold-out test set (n = 67,425) of white British ancestry as well as additional sets of individuals in non-British white (n = 24,905), African (n = 6,497), South Asian (n = 7,831), and East Asian (n = 1,704) ancestry groups. We evaluated the predictive performance of (1) the genotype-only model, (2) the covariate-only model, and (3) the full model that considers both covariates and genotypes. We computed the difference between the full model and the covariate-only model to derive the incremental predictive performance.

To evaluate the predictive performance of the covariate-only model in the hold-out test set of white British ancestry, we fit a generalized regression model, trait ∼ age + sex + array + Genotype PCs, using the individuals in the score development set. We subsequently computed the risk scores based on the covariate terms for the individuals in the hold-out test set. The array is an indicator variable denoting the types of the genotyping array (either the UK BiLEVE Axiom array or the UK Biobank Axiom array). For the individuals in non-British white, African, South Asian, and East Asian ancestry groups, we took the ancestry group-specific PCs computed for each set [23] and fit the same regression model for each group. We did not use the array indicator variable for African, South Asian, and East Asian because all individuals in those ancestry groups were genotyped on the UK Biobank Axiom Array (S2 Table).

To evaluate the predictive performance of the genotype-only model, we computed the polygenic risk score for the sets of individuals for evaluation using the--score command implemented in plink2 [65]. We quantified evaluation metrics (R2, Nagelkerke’s pseudo-R2, ROC-AUC, and Tjur’s pseudo-R2).

To evaluate the predictive performance of the full model, we fit a model, trait ∼ 1 + covariate-only score + PRS, using the covariate-only score and PRS described above. The constant term accounts for the potential differences in the trait mean (for quantitative traits) or case prevalence (for binary traits) between the score development population and the target population. We looked at the p-value reported for the PRS term for the statistical significance of the PRS model. We used p < 2.5 x 10−5 (= 0.05/2000, adjusted for multiple hypothesis testing using the Bonferroni method for the number of traits analyzed in the study) as the significance threshold.

We also computed the difference in R2 or Nagelkerke’s pseudo-R2 between the full and covariate-only models to derive the incremental predictive performance.

SNP-based Heritability estimation

To compare the incremental predictive performance of the PRS models with SNP-based heritability, we applied genome-wide association analysis with PLINK. Specifically, we applied--glm command in PLINK [65] v2.00-alpha with age, sex, array, the number of CNVs, the length of CNVs, and the top ten genotype PC loadings as covariates. The array is an indicator variable denoting whether the UK Biobank Axiom array or UK BiLEVE Axiom array was used in the genotyping. We included this term if the variants were directly measured on both arrays. The number and the length of the CNVs are described elsewhere [22]. The genotype PCs are the principal component (PC) loadings of individuals. We computed the population-specific PCs using the unrelated individuals in white British and used the first 10 PCs [23]. In the regression analysis, we standardized the variance of the covariates (--covar-variance-standardize option) and applied quantile normalization for the quantitative phenotype (--pheno-quantile-normalize option). Note, we did not perform quantile normalization in the PRS analysis. We used “cc-residualize” and “firth-residualize” options that implement the approximation [66] for efficient computation of GWAS p-values. We subsequently applied linkage disequilibrium (LD) score regression (LDSC) [27] and characterized the SNP-based heritability (S7 Table). We compared the predictive performance of the PRS models and the LDSC-based heritability estimates.

Correlation analysis of the number of genetic variants and predictive performance of PRS models

We applied Spearman’s correlation test implemented in R to assess the rank correlation between the size (the number of genetic variants included in the model) and the effect size (the incremental predictive performance) of the PRS model.

Statistics

For computational and statistical analysis, we used Jupyter Notebook [67], R [68], R tidyverse package [69], and GNU parallel [70]. The p-values were computed from two-sided tests unless otherwise specified.

Supporting information

S1 Fig. Statistical significance of the assessment center terms in phenotype prediction.

We fit a regression model on age, sex, the types of genotyping arrays, polygenic risk score, and assessment centers for each of the 1,565 traits analyzed in the study. The frequency of the statistical significance (-log10(P)) of assessment center variables was shown. The cumulative frequency was shown on the secondary axis on the right. The statistical significance after the Bonferroni correction was shown as a red vertical line.

(TIF)

S2 Fig. The impact of prioritizing the medically relevant alleles with penalty factors on the predictive performance of snpnet PRS models.

The predictive performance (AUC for binary traits and R2 for quantitative traits) evaluated across hold-out test set individuals of different ancestry groups in UK Biobank are shown for four traits. The error bars represent the 95% confidence interval.

(TIF)

S3 Fig. The impact of the imputed genetic variants on the predictive performance of snpnet PRS models.

The predictive performance (AUC for binary traits and R2 for quantitative traits) evaluated across hold-out test set individuals of different ancestry groups in UK Biobank are shown for four traits. The error bars represent the 95% confidence interval.

(TIF)

S1 Table. List of traits analyzed in the study and the predictive performance of the corresponding PRS models.

For the 1,565 traits analyzed in the study, the following information is shown: trait category, the phenotype ID in Global Biobank Engine (GBE ID), trait name, the types of link functions in a generalized linear model (Gaussian for quantitative traits and Binomial for binary traits), the predictive performance of the genotype-only model, covariate-only model, the full model that considers both genotype and covariates, as well as the incremental predictive performance (Delta[Full, covariates-only]), the number of genetic variants included in the PRS model, the statistical significance of the incremental predictive performance in a hold-out test set consists of a subset of white British individuals in the UK Biobank, whether the p-value is significant after multiple-hypothesis correction (p < 2.5 x 10−5), the score ID in polygenic score (PGS) catalog, the experimental factor ontology term ID of the mapped traits in PGS catalog, and the label of the mapped traits in PGS catalog.

(XLSX)

S2 Table. The cohort characteristics.

For each ancestry group in UK Biobank, the number of individuals (n), age (mean and standard deviation [sd]), sex (percentage of individuals in male), the fraction of individuals genotyped on the UK Biobank Axiom Array. The statistics for the white British ancestry group were shown for the 70% training set, 10% validation set, and 20% test set.

(XLSX)

S3 Table. The number of variants with non-zero BETAs is shown across four traits.

For each trait, we compared two models: without and with penalty factors to prioritize the medically relevant alleles.

(XLSX)

S4 Table. The variant consequence grouping.

We grouped the Ensembl’s variant effect predictor (VEP)-predicted consequence of the genetic variants into six groups (Consequence group): protein-truncating variants (PTVs), protein-altering variants (PAVs), protein-coding variants (PCVs), intronic variants (Intronic), variants in untranslated region (UTR), and other non-coding variants (Others). The links to the sequence ontology (SO) term detailing the definition of each of the predicted consequences are shown.

(XLSX)

S5 Table. The penalty factor assignment rule.

We used the VEP-predicted consequence and ClinVar annotation to prioritize protein-truncating, protein-altering, and (likely) pathogenic variants by assigning lower penalty factor values. The penalty factor and the number of variants stratified by genetic variants (genotype or allelotype), predicted consequence, and ClinVar annotation is shown.

(XLSX)

S6 Table. The predictive performance of PRS models.

For each trait (Trait category, GBE_ID, and Trait Name), we show the types of link functions in a generalized linear model (GLM family column, Gaussian for quantitative traits and binomial for binary traits), the population split (population), the types of the predictive model (model column), the types of evaluation metric (R2 [R2], Nagelkerke’s pseudo-R2 [NagelkerkeR2], AUROC [AUC], or Tjur’s Coefficient of Discrimination [TjurR2]), the value of the specified metric and its lower and upper bound of 95% confidence interval, and the statistical significance (p-value).

(XLSX)

S7 Table. Estimated SNP-based heritability.

For each trait with a significant PRS model (trait, trait_name, and trait_category), we show the types of link functions in a generalized linear model (family column, Gaussian for quantitative traits and binomial for binary traits), estimated SNP-based observed scale heritability with standard error (h2_obs and h2_obs_se), lambda GC (lambda_GC), mean chi-square statistic (mean_chi2), LD score regression intercept and its standard error (intercept and intercept_se), and the proportion of the inflation attributed to the LD score regression intercept, defined by (intercept -1)/(mean(chi-square)-1), and its standard error (ratio and ratio_se).

(XLSX)

Acknowledgments

Some of the computing for this project was performed on the Sherlock cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to these research results. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies; funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

The sparse PRS model weights generated from this study are available on the Global Biobank Engine (https://biobankengine.stanford.edu/prs). The significant PRS models are also available at the PGS catalog (https://www.pgscatalog.org/publication/PGP000244/ and https://www.pgscatalog.org/publication/PGP000128/, score IDs are listed in S1 Table). The BASIL algorithm implemented in the R snpnet package was used in the PRS analysis, which is available at https://github.com/rivas-lab/snpnet. The analyses presented in this study were based on the individual-level data accessed through the UK Biobank: https://www.ukbiobank.ac.uk.

Funding Statement

This work has been supported by The National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) [R01HG010140 to M.A.R.]; NIH [5U01 HG009080 to M.A.R., 5R01 EB 001988-21 to T.H., and 5R01 EB001988-16 to R.T]; National Science Foundation [DMS-1407548 to T.H., 19 DMS1208164 to R.T.]; Stanford University School of Medicine [to Y.T., R.L., and M.A.R.]; and the Funai Foundation for Information Technology [to Y.T.]. The authors of this manuscript have received the following salary support: NHGRI of NIH [R01HG010140 to Y.T. and M.A.R., R01HG008155 to Y.T.]; NIH [5U01 HG009080 to M.A.R.]; and the National Institute on Aging of NIH [R01AG067151 to Y.T.]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies; funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Lewis CM, Vassos E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 2020;12: 44. doi: 10.1186/s13073-020-00742-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wray NR, Lin T, Austin J, McGrath JJ, Hickie IB, Murray GK, et al. From Basic Science to Clinical Application of Polygenic Risk Scores: A Primer. JAMA Psychiatry. 2021;78: 101–109. doi: 10.1001/jamapsychiatry.2020.3049 [DOI] [PubMed] [Google Scholar]
  • 3.Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97: 576–592. doi: 10.1016/j.ajhg.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores via penalized regression on summary statistics. Genet Epidemiol. 2017;41: 469–480. doi: 10.1002/gepi.22050 [DOI] [PubMed] [Google Scholar]
  • 5.Zhu X, Stephens M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann Appl Stat. 2017;11: 1561–1592. doi: 10.1214/17-aoas1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50: 1219–1224. doi: 10.1038/s41588-018-0183-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10: 1776. doi: 10.1038/s41467-019-09718-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019;8: giz082. doi: 10.1093/gigascience/giz082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun. 2019;10: 5086. doi: 10.1038/s41467-019-12653-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Qian J, Tanigawa Y, Du W, Aguirre M, Chang C, Tibshirani R, et al. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet. 2020;16: e1009141. doi: 10.1371/journal.pgen.1009141 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li R, Chang C, Justesen JM, Tanigawa Y, Qiang J, Hastie T, et al. Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. Biostatistics. 2020; kxaa038. doi: 10.1093/biostatistics/kxaa038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Li R, Chang C, Tanigawa Y, Narasimhan B, Hastie T, Tibshirani R, et al. Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks. Bioinformatics. 2021; btab452. doi: 10.1093/bioinformatics/btab452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2020;36: 5424–5431. doi: 10.1093/bioinformatics/btaa1029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15: 2759–2772. doi: 10.1038/s41596-020-0353-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ojavee SE, Kousathanas A, Trejo Banos D, Orliac EJ, Patxot M, Läll K, et al. Genomic architecture and prediction of censored time-to-event phenotypes with a Bayesian genome-wide analysis. Nat Commun. 2021;12: 2337. doi: 10.1038/s41467-021-22538-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Wand H, Lambert SA, Tamburro C, Iacocca MA, O’Sullivan JW, Sillari C, et al. Improving reporting standards for polygenic scores in risk prediction studies. Nature. 2021;591: 211–219. doi: 10.1038/s41586-021-03243-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lambert SA, Gil L, Jupp S, Ritchie SC, Xu Y, Buniello A, et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet. 2021;53: 420–425. doi: 10.1038/s41588-021-00783-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12: e1001779. doi: 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562: 203–209. doi: 10.1038/s41586-018-0579-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.McInnes G, Tanigawa Y, DeBoever C, Lavertu A, Olivieri JE, Aguirre M, et al. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics. 2018;35: 2495–2497. doi: 10.1093/bioinformatics/bty999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Venkataraman GR, Olivieri JE, DeBoever C, Tanigawa Y, Justesen JM, Dilthey A, et al. Pervasive additive and non-additive effects within the HLA region contribute to disease risk in the UK Biobank. bioRxiv. 2020. p. 2020.05.28.119669. doi: 10.1101/2020.05.28.119669 [DOI] [Google Scholar]
  • 22.Aguirre M, Rivas MA, Priest J. Phenome-wide Burden of Copy-Number Variation in the UK Biobank. Am J Hum Genet. 2019;105: 373–383. doi: 10.1016/j.ajhg.2019.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sinnott-Armstrong N, Tanigawa Y, Amar D, Mars N, Benner C, Aguirre M, et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat Genet. 2021;53: 185–194. doi: 10.1038/s41588-020-00757-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cragg JG, Uhler RS. The Demand for Automobiles. Can J Econ. 1970;3: 386–406. doi: 10.2307/133656 [DOI] [Google Scholar]
  • 25.Nagelkerke NJD. A note on a general definition of the coefficient of determination. Biometrika. 1991;78: 691–692. doi: 10.1093/biomet/78.3.691 [DOI] [Google Scholar]
  • 26.Tjur T. Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination. Am Stat. 2009;63: 366–372. doi: 10.1198/tast.2009.08210 [DOI] [Google Scholar]
  • 27.Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015;47: 1228–1235. doi: 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Trynka G, Hunt KA, Bockett NA, Romanos J, Mistry V, Szperl A, et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat Genet. 2011;43: 1193–1201. doi: 10.1038/ng.998 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chang JH, McCluskey PJ, Wakefield D. Acute anterior uveitis and HLA-B27. Surv Ophthalmol. 2005;50: 364–388. doi: 10.1016/j.survophthal.2005.04.003 [DOI] [PubMed] [Google Scholar]
  • 30.Qi J, Li Q, Lin Z, Liao Z, Wei Q, Cao S, et al. Higher risk of uveitis and dactylitis and older age of onset among ankylosing spondylitis patients with HLA-B*2705 than patients with HLA-B*2704 in the Chinese population. Tissue Antigens. 2013;82: 380–386. doi: 10.1111/tan.12254 [DOI] [PubMed] [Google Scholar]
  • 31.Yang J, Wray NR, Visscher PM. Comparing apples and oranges: equating the power of case-control and quantitative trait association studies. Genet Epidemiol. 2010;34: 254–257. doi: 10.1002/gepi.20456 [DOI] [PubMed] [Google Scholar]
  • 32.Nikpay M, Goel A, Won H-H, Hall LM, Willenborg C, Kanoni S, et al. A comprehensive 1,000 Genomes-based genome-wide association meta-analysis of coronary artery disease. Nat Genet. 2015;47: 1121–1130. doi: 10.1038/ng.3396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Inouye M, Abraham G, Nelson CP, Wood AM, Sweeting MJ, Dudbridge F, et al. Genomic risk prediction of coronary artery disease in 480,000 adults: Implications for primary prevention. J Am Coll Cardiol. 2018;72: 1883–1893. doi: 10.1016/j.jacc.2018.07.079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Mars N, Koskela JT, Ripatti P, Kiiskinen TTJ, Havulinna AS, Lindbohm JV, et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med. 2020;26: 549–557. doi: 10.1038/s41591-020-0800-0 [DOI] [PubMed] [Google Scholar]
  • 35.Lee SH, Goddard ME, Wray NR, Visscher PM. A better coefficient of determination for genetic profile analysis. Genet Epidemiol. 2012;36: 214–224. doi: 10.1002/gepi.21614 [DOI] [PubMed] [Google Scholar]
  • 36.Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. Am J Hum Genet. 2017;100: 635–649. doi: 10.1016/j.ajhg.2017.03.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kim MS, Patel KP, Teng AK, Berens AJ, Lachance J. Genetic disease risks can be misestimated across global populations. Genome Biol. 2018;19: 179. doi: 10.1186/s13059-018-1561-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019;51: 584–591. doi: 10.1038/s41588-019-0379-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Cohen J, Pertsemlidis A, Kotowski IK, Graham R, Garcia CK, Hobbs HH. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat Genet. 2005;37: 161–165. doi: 10.1038/ng1509 [DOI] [PubMed] [Google Scholar]
  • 40.Cohen JC, Boerwinkle E, Mosley TH Jr, Hobbs HH. Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N Engl J Med. 2006;354: 1264–1272. doi: 10.1056/NEJMoa054013 [DOI] [PubMed] [Google Scholar]
  • 41.Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat Genet. 2011;43: 1066–1073. doi: 10.1038/ng.952 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Rivas MA, Pirinen M, Conrad DF, Lek M, Tsang EK, Karczewski KJ, et al. Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science. 2015;348: 666–669. doi: 10.1126/science.1261877 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Rivas MA, Graham D, Sulem P, Stevens C, Desch AN, Goyette P, et al. A protein-truncating R179X variant in RNF186 confers protection against ulcerative colitis. Nat Commun. 2016;7: 12342. doi: 10.1038/ncomms12342 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Narasimhan VM, Hunt KA, Mason D, Baker CL, Karczewski KJ, Barnes MR, et al. Health and population effects of rare gene knockouts in adult humans with related parents. Science. 2016;352: 474–477. doi: 10.1126/science.aac8624 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Saleheen D, Natarajan P, Armean IM, Zhao W, Rasheed A, Khetarpal SA, et al. Human knockouts and phenotypic analysis in a cohort with a high rate of consanguinity. Nature. 2017;544: 235–239. doi: 10.1038/nature22034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.DeBoever C, Tanigawa Y, Lindholm ME, McInnes G, Lavertu A, Ingelsson E, et al. Medical relevance of protein-truncating variants across 337,205 individuals in the UK Biobank study. Nat Commun. 2018;9: 1612. doi: 10.1038/s41467-018-03910-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Emdin CA, Khera AV, Chaffin M, Klarin D, Natarajan P, Aragam K, et al. Analysis of predicted loss-of-function variants in UK Biobank identifies variants protective for disease. Nature Communications. 2018. doi: 10.1038/s41467-018-03911-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Tanigawa Y, Wainberg M, Karjalainen J, Kiiskinen T, Venkataraman G, Lemmelä S, et al. Rare protein-altering variants in ANGPTL7 lower intraocular pressure and protect against glaucoma. PLoS Genet. 2020;16: e1008682. doi: 10.1371/journal.pgen.1008682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581: 434–443. doi: 10.1038/s41586-020-2308-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Lam BYH, Williamson A, Finer S, Day FR, Tadross JA, Gonçalves Soares A, et al. MC3R links nutritional state to childhood growth and the timing of puberty. Nature. 2021. doi: 10.1038/s41586-021-04088-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Backman JD, Li AH, Marcketta A, Sun D, Mbatchou J, Kessler MD, et al. Exome sequencing and analysis of 454,787 UK Biobank participants. Nature. 2021. doi: 10.1038/s41586-021-04103-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46: D1062–D1067. doi: 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Chung W, Chen J, Turman C, Lindstrom S, Zhu Z, Loh P-R, et al. Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes. Nat Commun. 2019;10: 569. doi: 10.1038/s41467-019-08535-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Richardson TG, Harrison S, Hemani G, Davey Smith G. An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife. 2019;8: e43657. doi: 10.7554/eLife.43657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Tanigawa Y, Li J, Justesen JM, Horn H, Aguirre M, DeBoever C, et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nat Commun. 2019;10: 4064. doi: 10.1038/s41467-019-11953-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Aguirre M, Tanigawa Y, Venkataraman GR, Tibshirani R, Hastie T, Rivas MA. Polygenic risk modeling with latent trait-related genetic components. Eur J Hum Genet. 2021;29: 1071–1081. doi: 10.1038/s41431-021-00813-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Yates AD, Achuthan P, Akanni W, Allen J, Allen J, Alvarez-Jarreta J, et al. Ensembl 2020. Nucleic Acids Res. 2020;48: D682–D688. doi: 10.1093/nar/gkz966 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17: 122. doi: 10.1186/s13059-016-0974-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SFA, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007;17: 1665–1674. doi: 10.1101/gr.6861907 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.DeBoever C, Tanigawa Y, Aguirre M, McInnes G, Lavertu A, Rivas MA. Assessing Digital Phenotyping to Enhance Genetic Studies of Human Diseases. Am J Hum Genet. 2020;106: 611–622. doi: 10.1016/j.ajhg.2020.03.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33: 1–22. doi: 10.18637/jss.v033.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Olkin I, Finn JD. Correlations redux. Psychol Bull. 1995;118: 155–164. doi: 10.1037/0033-2909.118.1.155 [DOI] [Google Scholar]
  • 63.Cohen J, Cohen P, West SG, Aiken LS. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Routledge; 2013. https://play.google.com/store/books/details?id=fAnSOgbdFXIC [Google Scholar]
  • 64.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44: 837–845. doi: 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
  • 65.Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4: 7. doi: 10.1186/s13742-015-0047-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Mbatchou J, Barnard L, Backman J, Marcketta A, Kosmicki JA, Ziyatdinov A, et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat Genet. 2021;53: 1097–1103. doi: 10.1038/s41588-021-00870-7 [DOI] [PubMed] [Google Scholar]
  • 67.Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press; 2016. pp. 87–90. [Google Scholar]
  • 68.R Core Team. R: A language and environment for statistical computing. 2019. https://www.R-project.org/ [Google Scholar]
  • 69.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the tidyverse. J Open Source Softw. 2019;4: 1686. doi: 10.21105/joss.01686 [DOI] [Google Scholar]
  • 70.Tange O. GNU Parallel 2018. 2018. doi: 10.5281/zenodo.1146014 [DOI] [Google Scholar]

Decision Letter 0

Scott M Williams, Samuli Ripatti

12 Oct 2021

Dear Dr Tanigawa,

Thank you very much for submitting your Research Article entitled 'Significant Sparse Polygenic Risk Scores across 428 traits in UK Biobank' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Samuli Ripatti

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this paper, the authors have applied BASIL, a method previously developed by the authors, to 1600 traits in the UK Biobank. They have also provided a Global Biobank Engine which shows the predictive power of their PRS. However, given the current state of this paper, I cannot recommend this for publishing, and here are my reasons:

1. Most of the methods were “described elsewhere”, reading the current paper, it is not possible for readers to know how exactly the PRS were calculated. It is also unclear how the authors incorporate the HLA allelotype, CNV data and how the penalty factors were applied to the BASIL model.

2. In addition, the authors did not provide any details of how they obtained the GWAS summary statistics required for PRS calculation (presumably using the 70%UK biobank data using PLINK?)

3. Usually, we would include the UK Biobank assessment centre as a covariate to UK Biobank related analysis to avoid systematic collection error. In addition, for blood biomarkers, we usually want to include Fasting time, dilution factor and statin use, as those usually have significant impact to the model fit.

4. Was the metric reported based on the test sets?

5. How did the authors use the PCS and self-reported ancestry to identify the sample population? K mean clustering on PC1 and PC2? Or did they performed calculate the Euclidian distance between each sample and the PC centroid of each self-reported cluster?

6. Given the small sample size of non-European samples, does the author also split them into validation and test sets, or were the training all done on the validation sets? Based on page 10 line 230-242, I am guessing that the GWAS is performed on the British white, and parameter optimization / variable selections were done on the British white and then the predictive performance is performed on each of the populations? Or were the parameter optimization / variable selections also done in each of the populations?

7. Looking at the results shown in the Global Biobank Engine, there are many traits where the covariate has a predictive performance of 0 (assuming this is measured in R2). Specifically, for Lipoprotein A, its PRS performance is as high as 0.57 but the covariate has performance of 0, which is hard to believe.

8. The correlation of number of selected variable and the predictive performance sounds like an issue with power. This is similar to the self-contained test-statistics in gene set studies where including more information has a higher chance of having a high predictive performance. The lack of correlation in binary traits might be due to rare variants that have large effect or ascertainment of case control. For example, only one variants were selected for Iritis. Also, if we look at the Global Biobank Engine, we can see a lot of duplicated traits that were assigned to different categories. For example, Lipoprotein A is both a biomarker and blood assays. Were this duplicated removed from the correlation analyses? If not, then duplicated traits or highly correlated traits (e.g. Hand grip strength left and right) might have inflated the correlation. Considering the trait definition, it is also much easier to have duplication and correlation between quantitative traits than the binary triats.

Reviewer #2: Here the authors investigate properties of PRS across 428 traits in the UK. The work is nicely conducted and explained.

The introduction and Discussion need to better sign post what this study IS and what it is NOT about. For example, it is NOT promoting BASIL at the best PRS method, but rather it can be considered as a method that can be easily applied across many traits and could be useful across a range of genetic architectures without explicitly modelling genetic architectures. Also it is NOT proposing the best predictor for a trait because a) it only uses UKB data and not other GWAS data available for some traits b) relatives are excluded which (although independence of discovery and test sample are important) the GWAS discovery could be more powerful by including relatives (I am not saying you need to include the relatives for the purpose of this paper but rather more clearly define its boundaries). The purpose of this study is more about considering the properties of PRS of many traits from the same data set and examining trends across the traits.

1. Line 40 Define “sparse PRS”, this may be unclear to some readers.

2. Lines 55-61 do not define how you made discovery, tuning and testing samples, although the info is in the methods a brief summary is needed to interpret results presented

3. Figure 1 Axis labels too small – especially part D- lake plot – not an informative title; The entries of column 1 are not self-evident in terms of discovery/target, from lines 139-145 I see that other ancestries were not included in discovery sample, but not obvious from Fig 1 legend. Line 206, add to avoid ambiguity “The non-British white, African, Sout Asian and East Asian samples were only used as test sets”

4. I think “ethnic” is now regarded as cultural, and the preferred term in this context is “ancestry”

5. Figure 5 would benefit from “quotable” mean number stats for each ancestries

6. It would be of interest to have a plot of x-axis SNP-based heritability, y-axis increase in r2/AUC.

7. Given the differing sizes of test sets across ancestries it would be good to remind readers how this does/does not impact on interpretation of cross-ancestry comparisons.

Reviewer #3: In this paper Tanigawa et al. describe the systematic creation of polygenic risk scores (PRS) for > 1,600 traits using data from the UK Biobank (UKB). The construction and evaluation of the PRS is well-described, and the main result of the manuscript is a large resource of PRS built using a single method and a comprehensive web portal describing the results that will be useful for others looking to better understand the performance of each score and apply them to other cohorts. I do not have any major concerns about the manuscript; however, I think some of the unique features of the analysis should be better described and contextualised:

• The choice of variants (directly genotyped, imputed HLA, and CNVs) is quite different from classical PRS analyses that usually employs the full-set of imputed variants with MAF/INFO filtering. Does the performance improve if these imputed variants are included in the dataset? It is probably relevant to list the genotyping arrays employed, and adjust for the different arrays used in the performance evaluation.

• The prioritization of medically-relevant (ClinVar pathogenic/likely-pathogenic, VEP predicted protein-truncating/altering variants) for non-zero effect weights in the PRS is also a quite interesting addition; however, I was surprised to see no quantitative analysis of its impact on PRS performance. I would also hypothesize that the weighting would also impact the number of variants selected in the model (Figure 4)? Some comparison of the PRS performance and transferability with/without the variant prioritisation is necessary.

• Are there any obvious reasons that the correlation of predictiveness and number of variants changes? Is it dictated by differences in effect-size distributions or the MAF of selected variants?

Minor comments:

• A table with the age/sex/follow-up time/ancestry breakdown of the different training and test sets should be included. Were individuals included in the 70% training set consistent across all PRS being built?

• Description of how the p-value threshold for incremental predictiveness was selected should be provided.

• A major advantage of the BASIL/snpnet application in comparison to other PRS-derivation methods seems to be that it does not rely on LD reference panels which often limit the PRS derivation set to being a single-ancestry group. Given that the manuscript is somewhat focused on transferability of sparse PRS: would it be possible to derive new PRS using a random sample of the entire cohort (all ancestries) and evaluate how the multi-ancestry PRS compare to European-PRS at the whole population and single-ancestry level? [I realize this is beyond the scope of the current analysis but would be informative and may greatly improve the impact]

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Shing Wan Choi

Reviewer #2: No

Reviewer #3: No

Decision Letter 1

Scott M Williams, Samuli Ripatti

30 Nov 2021

Dear Dr Tanigawa,

Thank you very much for submitting your Research Article entitled 'Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Samuli Ripatti

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This manuscript has certainly been improved with the addition of more detail descriptions of the method and procedure involved. Thank you to the authors for making all these efforts.

Overall, I am still slightly confused as to what are the main messages of the current paper. I am also slightly concern about some interpretation of the results.

1. While the authors have now included much of the needed details regarding the procedure and methods performed, there are still some critical information that are missing. For example, quantitative traits were calculated as the “median of non-NA values, as describe elsewhere”, does that mean that the authors took the measurement across multiple assessment timepoint and take the median of that? Did the author perform any quality controls on the phenotype to remove outliers?

2. In a similar vein, for the blood and urine biomarkers, the covariate adjusted phenotype were calculated using the log transformed phenotypic value and the incremental predictive performance were calculated against the predictive value based on the original measurement. Were the original measurements also log transformed? Or was the untransformed value being used? If it is the latter, wouldn’t that introduce some bias? In addition, it is not uncommon to have blood or urine biomarker measurement of 0. In those scenarios, log transformation will lead to undefined value. How was that accounted for?

3. For the SNP-heritability estimates, the authors perform GWAS on the quantile normalized phenotype. Were the phenotypes also log transformed? It is difficult to assess the relationship between the PRS performance and SNP-heritability if they were performed on phenotypes undergone different transformation. Also, was the quantile normalization done on both quantitative traits and binary traits?

4. It is odd to have PRS that report a higher predictive performance than SNP-heritability, as the SNP-heritability are the theoretical upper bound of the PRS. It will be helpful if the authors can provide an explanation as to why the PRS performance is higher than the SNP-heritability (possibly due to different phenotypic transformation, or that the PRS include information that were excluded from the SNP-heritability estimate?). Standard error of the predictions should ideally be also reported to provide a better understanding of the power.

5. Based on how this paper is structured, it seems like the main message is that there is a significant positive correlation between the number of active variables in the PRS model and the incremental predictive performance in quantitative traits but not in binary traits, and this “highlighting the presence of diverse genetic architecture across disease outcomes.”. However, because the population prevalence of the binary traits is usually not known, and that the UK Biobank is a prospective cohort where the case numbers might not reflect the true population prevalence, the prediction performance of the binary traits, and their SNP-heritability estimations will likely be biased by ascertainment. In addition, in the main analysis, the authors “used the same split of training, validation and test set for all tested traits.”, which means that the case control ratio for the binary traits are likely different between the different set of samples, leading to a greater disparity of performance. Considering the lower heritability of binary traits (mean = 0.04 for binary trait, mean = 0.23 for quantitative traits, based on provided supplementary), reporting on observed instead of liability scale, and the different level of ascertainment bias, it is not surprising that the correlation between the number of active variables in the PRS model and the incremental predictive performance in binary traits are not significant. And it might be slightly misleading to conclude that the lack of correlation in binary traits, but in quantitative traits is a result of “the presence of diverse genetic architecture”.

6. Similar to the above comment, the case control ratio in different population might also differ, which was not accounted for here.

Other minor comments:

1. On line 197, line 249 and line 532, a different style of citation seems to be used? (ref:[#] , instead of [#])

2. For figure 4 top right, are the range inclusive or exclusive? E.g. for sample at 10 percentile, will they be grouped in [0-10%] or [10-20%]? Also, for multipaned plots, might be easier if the individual sub-plots are also labeled (e.g. 4a, 4b, 4c)

Reviewer #2: The authors have addressed my comments, but the revision has introduced some strong statements in the discussion which I believe are scale and power dependent. Therefore, I have additional comments.

New Figure 2A. For binary traits estimates of SNP-based heritability depends on proportion of GWAS discovery sample are cases, and Pseudo-R2 depend on the proportion of the target sample are cases. Although requiring a user-specified lifetime risk it would make more sense for these axes to be on the liability scale (even if lifetime risk used is the proportion of cases in the sample since all traits are in UKB) since then both axes are on the same scale and comparisons across traits are more valid.

Figure 5A and Figure 6 LHS use “incremental AUC”. AUC has the nice property that it doesn’t depend on the proportion of cases in the sample, both other than that it has very non-linear properties with respect to quantitative genetic metrics of polygenic traits such as heritability. For example, while a linear relationship might be expected in incremental R2 for quantitative traits (Figure 6 bottom left quadrant) I wouldn’t expect a linear relationship in incremental AUC. This may impact the conclusion line 331 “we found a significant correlation across quantitative traits but not within binary traits” Suggest of these analyses R2 liability is used.

The point being made here “While the underlying genetic architecture of binary traits may span the gamut of a wide variety of polygenicity, that of highly heritable quantitative traits may not be compatible with monogenic inheritance as illustrated in the wide adoption of Fisher’s infinitesimal model”. That is a very broad statement not really relevant to the study, suggest delete. Moreover, expressions of genes are quantitative traits that likely span the gamut of genetic architectures.

I am concerned about the new conclusions that contrast binary traits with quantitative traits with only a nod to differences in power. It is intuitive that for the same N (ie UKB sample size) as the proportion of cases tends to zero the power of the sample for detection of association is reduced. I think Yang et al (2009) equation 3 could help quantify expectations doi:10.1002/gepi.20456

Supp Table 6 seems to have a column missing -across the labels in column A-D there are 3 sets of results. Model column? I have never seen TjurR2 presented before in this context. It is presented together with NagelkerkeR2. There is not justification as to why TjurR2 should be presented. Both I believe are dependent on the proportion of cases in the sample . Some of the AUC values seem implausibly high given the R2? Check?

Reviewer #3: The additional analyses and explanations in this revision result in a much improved manuscript describing the phenome-wide application of BASIL to derive PGS in UKB. The authors have addressed all my concerns (especially with respect to the description of variant-penalties), the analyses are technically sound and well described.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Shing Wan Choi

Reviewer #2: No

Reviewer #3: No

Decision Letter 2

Scott M Williams, Samuli Ripatti

15 Feb 2022

Dear Dr Tanigawa,

We are pleased to inform you that your manuscript entitled "Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Samuli Ripatti

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Human Variation

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Please address the one remaining request from the reviewer.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: With the latest update, the authors have address most of my concerns. Thank you for the hard works.

Reviewer #2: Thank you for addressing the comments.

I understand your choices to report SNP-based heritability on the observed scale and Nagelkerke's R2. Please add a sentence to remind readers that both these metrics depend on the proportion of cases in the samples (discovery and target respectively) including in Figure legends.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Shing Wan Choi

Reviewer #2: No

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-21-01210R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

Scott M Williams, Samuli Ripatti

28 Feb 2022

PGENETICS-D-21-01210R2

Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank

Dear Dr Tanigawa,

We are pleased to inform you that your manuscript entitled "Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Statistical significance of the assessment center terms in phenotype prediction.

    We fit a regression model on age, sex, the types of genotyping arrays, polygenic risk score, and assessment centers for each of the 1,565 traits analyzed in the study. The frequency of the statistical significance (-log10(P)) of assessment center variables was shown. The cumulative frequency was shown on the secondary axis on the right. The statistical significance after the Bonferroni correction was shown as a red vertical line.

    (TIF)

    S2 Fig. The impact of prioritizing the medically relevant alleles with penalty factors on the predictive performance of snpnet PRS models.

    The predictive performance (AUC for binary traits and R2 for quantitative traits) evaluated across hold-out test set individuals of different ancestry groups in UK Biobank are shown for four traits. The error bars represent the 95% confidence interval.

    (TIF)

    S3 Fig. The impact of the imputed genetic variants on the predictive performance of snpnet PRS models.

    The predictive performance (AUC for binary traits and R2 for quantitative traits) evaluated across hold-out test set individuals of different ancestry groups in UK Biobank are shown for four traits. The error bars represent the 95% confidence interval.

    (TIF)

    S1 Table. List of traits analyzed in the study and the predictive performance of the corresponding PRS models.

    For the 1,565 traits analyzed in the study, the following information is shown: trait category, the phenotype ID in Global Biobank Engine (GBE ID), trait name, the types of link functions in a generalized linear model (Gaussian for quantitative traits and Binomial for binary traits), the predictive performance of the genotype-only model, covariate-only model, the full model that considers both genotype and covariates, as well as the incremental predictive performance (Delta[Full, covariates-only]), the number of genetic variants included in the PRS model, the statistical significance of the incremental predictive performance in a hold-out test set consists of a subset of white British individuals in the UK Biobank, whether the p-value is significant after multiple-hypothesis correction (p < 2.5 x 10−5), the score ID in polygenic score (PGS) catalog, the experimental factor ontology term ID of the mapped traits in PGS catalog, and the label of the mapped traits in PGS catalog.

    (XLSX)

    S2 Table. The cohort characteristics.

    For each ancestry group in UK Biobank, the number of individuals (n), age (mean and standard deviation [sd]), sex (percentage of individuals in male), the fraction of individuals genotyped on the UK Biobank Axiom Array. The statistics for the white British ancestry group were shown for the 70% training set, 10% validation set, and 20% test set.

    (XLSX)

    S3 Table. The number of variants with non-zero BETAs is shown across four traits.

    For each trait, we compared two models: without and with penalty factors to prioritize the medically relevant alleles.

    (XLSX)

    S4 Table. The variant consequence grouping.

    We grouped the Ensembl’s variant effect predictor (VEP)-predicted consequence of the genetic variants into six groups (Consequence group): protein-truncating variants (PTVs), protein-altering variants (PAVs), protein-coding variants (PCVs), intronic variants (Intronic), variants in untranslated region (UTR), and other non-coding variants (Others). The links to the sequence ontology (SO) term detailing the definition of each of the predicted consequences are shown.

    (XLSX)

    S5 Table. The penalty factor assignment rule.

    We used the VEP-predicted consequence and ClinVar annotation to prioritize protein-truncating, protein-altering, and (likely) pathogenic variants by assigning lower penalty factor values. The penalty factor and the number of variants stratified by genetic variants (genotype or allelotype), predicted consequence, and ClinVar annotation is shown.

    (XLSX)

    S6 Table. The predictive performance of PRS models.

    For each trait (Trait category, GBE_ID, and Trait Name), we show the types of link functions in a generalized linear model (GLM family column, Gaussian for quantitative traits and binomial for binary traits), the population split (population), the types of the predictive model (model column), the types of evaluation metric (R2 [R2], Nagelkerke’s pseudo-R2 [NagelkerkeR2], AUROC [AUC], or Tjur’s Coefficient of Discrimination [TjurR2]), the value of the specified metric and its lower and upper bound of 95% confidence interval, and the statistical significance (p-value).

    (XLSX)

    S7 Table. Estimated SNP-based heritability.

    For each trait with a significant PRS model (trait, trait_name, and trait_category), we show the types of link functions in a generalized linear model (family column, Gaussian for quantitative traits and binomial for binary traits), estimated SNP-based observed scale heritability with standard error (h2_obs and h2_obs_se), lambda GC (lambda_GC), mean chi-square statistic (mean_chi2), LD score regression intercept and its standard error (intercept and intercept_se), and the proportion of the inflation attributed to the LD score regression intercept, defined by (intercept -1)/(mean(chi-square)-1), and its standard error (ratio and ratio_se).

    (XLSX)

    Attachment

    Submitted filename: PRSmap 20211109 1140 response to reviewers.pdf

    Attachment

    Submitted filename: 2022.01.27 10.34 PRSmap response to reviewers (2022 Jan).pdf

    Data Availability Statement

    The sparse PRS model weights generated from this study are available on the Global Biobank Engine (https://biobankengine.stanford.edu/prs). The significant PRS models are also available at the PGS catalog (https://www.pgscatalog.org/publication/PGP000244/ and https://www.pgscatalog.org/publication/PGP000128/, score IDs are listed in S1 Table). The BASIL algorithm implemented in the R snpnet package was used in the PRS analysis, which is available at https://github.com/rivas-lab/snpnet. The analyses presented in this study were based on the individual-level data accessed through the UK Biobank: https://www.ukbiobank.ac.uk.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES