Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data

Jianxin Shi; Ju-Hyun Park; Jubao Duan; Sonja T Berndt; Winton Moy; Kai Yu; Lei Song; William Wheeler; Xing Hua; Debra Silverman; Montserrat Garcia-Closas; Chao Agnes Hsiung; Jonine D Figueroa; Victoria K Cortessis; Núria Malats; Margaret R Karagas; Paolo Vineis; I-Shou Chang; Dongxin Lin; Baosen Zhou; Adeline Seow; Keitaro Matsuo; Yun-Chul Hong; Neil E Caporaso; Brian Wolpin; Eric Jacobs; Gloria M Petersen; Alison P Klein; Donghui Li; Harvey Risch; Alan R Sanders; Li Hsu; Robert E Schoen; Hermann Brenner; MGS (Molecular Genetics of Schizophrenia) GWAS Consortium; GECCO (The Genetics and Epidemiology of Colorectal Cancer Consortium); The GAME-ON/TRICL (Transdisciplinary Research in Cancer of the Lung) GWAS Consortium; PRACTICAL (PRostate cancer AssoCiation group To Investigate Cancer Associated aLterations) Consortium; PanScan Consortium; The GAME-ON/ELLIPSE Consortium; Rachael Stolzenberg-Solomon; Pablo Gejman; Qing Lan; Nathaniel Rothman; Laufey T Amundadottir; Maria Teresa Landi; Douglas F Levinson; Stephen J Chanock; Nilanjan Chatterjee

doi:10.1371/journal.pgen.1006493

. 2016 Dec 30;12(12):e1006493. doi: 10.1371/journal.pgen.1006493

Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data

Jianxin Shi ^1,^*, Ju-Hyun Park ², Jubao Duan ³, Sonja T Berndt ¹, Winton Moy ⁴, Kai Yu ¹, Lei Song ¹, William Wheeler ⁵, Xing Hua ¹, Debra Silverman ¹, Montserrat Garcia-Closas ¹, Chao Agnes Hsiung ⁶, Jonine D Figueroa ^1,⁷, Victoria K Cortessis ^8,⁹, Núria Malats ¹⁰, Margaret R Karagas ¹¹, Paolo Vineis ^12,¹³, I-Shou Chang ¹⁴, Dongxin Lin ^15,¹⁶, Baosen Zhou ¹⁷, Adeline Seow ¹⁸, Keitaro Matsuo ¹⁹, Yun-Chul Hong ²⁰, Neil E Caporaso ¹, Brian Wolpin ^21,²², Eric Jacobs ²³, Gloria M Petersen ²⁴, Alison P Klein ^25,²⁶, Donghui Li ²⁷, Harvey Risch ²⁸, Alan R Sanders ³, Li Hsu ²⁹, Robert E Schoen ³⁰, Hermann Brenner ^31,^32,³³; MGS (Molecular Genetics of Schizophrenia) GWAS Consortium^¶; GECCO (The Genetics and Epidemiology of Colorectal Cancer Consortium)^¶; The GAME-ON/TRICL (Transdisciplinary Research in Cancer of the Lung) GWAS Consortium^¶; PRACTICAL (PRostate cancer AssoCiation group To Investigate Cancer Associated aLterations) Consortium^¶; PanScan Consortium^¶; The GAME-ON/ELLIPSE Consortium^¶, Rachael Stolzenberg-Solomon ¹, Pablo Gejman ³, Qing Lan ¹, Nathaniel Rothman ¹, Laufey T Amundadottir ¹, Maria Teresa Landi ¹, Douglas F Levinson ³⁴, Stephen J Chanock ¹, Nilanjan Chatterjee ^1,^35,^36,^*

Editor: Samuli Ripatti³⁷

¹Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States of America

²Department of Statistics, Dongguk University, Seoul, Korea

³Center for Psychiatric Genetics, Department of Psychiatry and Behavioral Sciences, North Shore University Health System Research Institute, University of Chicago Pritzker School of Medicine, Evanston, Illinois, United States of America

⁴Dept. of Statistics, Northern Illinois University, DeKalb, Illinois, United States of America

⁵Information Management Services, Inc., Rockville, Maryland, United States of America

⁶Institute of Population Health Sciences, National Health Research Institutes, Miaoli, Taiwan

⁷Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Medical School, Edinburgh, United Kingdom

⁸Department of Preventive Medicine and Department of Obstetrics and Gynecology, USC Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America

⁹Norris Comprehensive Cancer Center, USC Keck School of Medicine, University of Southern California, Los Angeles, California, United States of America

¹⁰Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain

¹¹Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, United States of America

¹²Human Genetics Foundation, Turin, Italy

¹³MRC-PHE Centre for Environment and Health, School of Public Health, Imperial College London, London, United Kingdom

¹⁴National Institute of Cancer Research, National Health Research Institutes, Zhunan, Taiwan

¹⁵Department of Etiology & Carcinogenesis, Cancer Institute and Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

¹⁶State Key Laboratory of Molecular Oncology, Cancer Institute and Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China

¹⁷Department of Epidemiology, School of Public Health, China Medical University, Shenyang, China

¹⁸Saw Swee Hock School of Public Health, National University of Singapore, Singapore

¹⁹Division of Molecular Medicine, Aichi Cancer Center Research Institute, Chikusa-ku, Nagoya, Japan

²⁰Department of Preventive Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea

²¹Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America

²²Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, United States of America

²³Epidemiology Research Program, American Cancer Society, Atlanta, Georgia, United States of America

²⁴Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, United States of America

²⁵Department of Oncology, the Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America

²⁶Department of Epidemiology, the Bloomberg School of Public Health, Baltimore, Maryland, United States of America

²⁷Department of Gastrointestinal Medical Oncology, University of Texas M.D. Anderson Cancer Center, Houston, Texas, United States of America

²⁸Department of Chronic Disease Epidemiology, Yale School of Public Health, New Haven, Connecticut, United States of America

²⁹Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America

³⁰Department of Medicine and Epidemiology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania, United States of America

³¹Division of Clinical Epidemiology and Aging Research, German Cancer Research Center (DKFZ), Heidelberg, Germany

³²Division of Preventive Oncology, German Cancer Research Center (DKFZ) and National Center for Tumor Diseases (NCT), Heidelberg, Germany

³³German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany

³⁴Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, California, United States of America

³⁵Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America

³⁶Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America

³⁷Institute for Molecular Medicine Finland (FIMM), FINLAND

The authors have declared that no competing interests exist.

Conceptualization: JS NC.
Formal analysis: JS JHP NC.
Methodology: JS JHP NC.
Resources: JD STB DS MGC CAH JDF VKC NM MRK PV ISC DLin BZ AS KM YCH NEC BW EJ GMP APK DLi HR ARS LH RES HB RSS PG QL NR LTA MTL DFL SJC.
Writing – original draft: JS NC.
Writing – review & editing: JS JHP JD SB WM KY LS WW XH DS MGC CAH JDF VKC NM MRK PV ISC DLin BZ AS KM YCH NEC BW EJ GMP APK DLi HR ARS LH RES HB RSS PG QL NR LTA MTL DFL SJC NC.

¶ Members of MGS Consortium, GECCO Consortium, GAME-ON/TRICL Consortium, PRACTICAL Consortium, PanScan Consortium and GAME-ON/ ELLIPSE are provided in S1 Text.

^✉

* E-mail: Jianxin.Shi@nih.gov (JS); nilanjan@jhu.edu (NC)

Roles

Samuli Ripatti: Editor

PMCID: PMC5201242 PMID: 28036406

Abstract

Recent heritability analyses have indicated that genome-wide association studies (GWAS) have the potential to improve genetic risk prediction for complex diseases based on polygenic risk score (PRS), a simple modelling technique that can be implemented using summary-level data from the discovery samples. We herein propose modifications to improve the performance of PRS. We introduce threshold-dependent winner’s-curse adjustments for marginal association coefficients that are used to weight the single-nucleotide polymorphisms (SNPs) in PRS. Further, as a way to incorporate external functional/annotation knowledge that could identify subsets of SNPs highly enriched for associations, we propose variable thresholds for SNPs selection. We applied our methods to GWAS summary-level data of 14 complex diseases. Across all diseases, a simple winner’s curse correction uniformly led to enhancement of performance of the models, whereas incorporation of functional SNPs was beneficial only for selected diseases. Compared to the standard PRS algorithm, the proposed methods in combination led to notable gain in efficiency (25–50% increase in the prediction R²) for 5 of 14 diseases. As an example, for GWAS of type 2 diabetes, winner’s curse correction improved prediction R² from 2.29% based on the standard PRS to 3.10% (P = 0.0017) and incorporating functional annotation data further improved R² to 3.53% (P = 2×10⁻⁵). Our simulation studies illustrate why differential treatment of certain categories of functional SNPs, even when shown to be highly enriched for GWAS-heritability, does not lead to proportionate improvement in genetic risk-prediction because of non-uniform linkage disequilibrium structure.

Author Summary

Large GWAS have identified tens or even hundreds of common SNPs significantly associated with individual complex diseases; however, these SNPs typically explain a small proportion of phenotypic variance. Recently, heritability analyses based on GWAS data suggest that common SNPs have the potential to explain substantially larger fraction of phenotypic variance and to improve the genetic risk prediction. Because of the polygenic nature, improving genetic risk prediction for complex diseases typically requires substantially increasing the sample size in the discovery set. Thus, it is crucial to develop more efficient algorithms using existing GWAS summary data. In this article, we extend the polygenic risk score (PRS) method by adjusting the marginal effect size of SNPs for winner’s curse and by incorporating external functional annotation data. Theoretical analysis and simulation studies show that the performance improvement depends on the genetic architecture of the trait, sample size of the discovery sample set and the degree of enrichment of association for SNPs annotated as “high-prior” and the linkage disequilibrium patterns of these SNPs. We applied our method to the summary data of 14 GWAS. Our method achieved 25–50% gain in efficiency (measured in the prediction R²) for 5 of 14 diseases compared to the standard PRS.

Introduction

Large genome-wide association studies (GWAS) have accelerated the discovery of dozens or even hundreds of common single nucleotide polymorphisms (SNPs) associated with individual complex traits and diseases, such as height [1, 2], body mass index [3] and common cancers (e.g., breast [4] and prostate [5] cancers). Although individual SNPs typically have small effects, cumulative results have provided insight about underlying biologic pathways and for some common diseases like breast cancer have yielded levels of risk-stratification that could be useful as part of prevention efforts [6]. Analyses of GWAS heritability using algorithms such as GCTA [7, 8] have shown that common SNPs have the potential to explain substantially larger fraction of the variation of many traits.

The future yield of GWAS studies, for both discovery and prediction, depends heavily on the underlying effect-size distribution (ESD) of susceptibility SNPs [9, 10]. A number of alternative types of analyses of ESD now point towards a polygenic architecture for most complex traits, in which thousands or even tens of thousands of common SNPs, each with small estimated effect sizes together can explain a substantial fraction of heritability [11, 12]. Mathematical analyses of power indicates that because of the polygenic nature of complex traits, future studies will need large sample sizes, often by an order of magnitude higher than even some of the largest studies to date, for improving accuracy of genetic risk-prediction [10, 11]. Nevertheless, for current datasets, there remains an opportunity to develop more efficient algorithms for improving the models [13].

Available algorithms for polygenic risk score (PRS) prediction models have varying degrees of complexity. The simplest of these methods, widely implemented in large GWAS, selects SNPs based on a threshold for the significance of the marginal association test-statistics and then the cumulative weighting of these SNPs by their estimated marginal strength of association is applied [14]. The threshold for SNP selection can be optimized to improve the predictive performance in an independent validation dataset. For a number of traits with large GWAS sample sizes, it has been shown that an optimally selected threshold can improve risk prediction compared to that based on the genome-wide significance threshold used for discovery [15]. A number of newer methods involving the joint analysis of all SNPs using sophisticated mixed-effect modeling techniques have recently been developed and may lead to further increases in model performance [16–18].

In this report, we propose simple modifications to the widely used PRS modeling techniques using only GWAS summary-level data. Drawing from the lasso [19] algorithm, we propose a simple threshold dependent winner’s curse adjustment for marginal association coefficients that can be used to weight the SNPs in PRS. Second, to exploit external functional knowledge that might identify subsets of SNPs highly enriched for association signals, we consider using multiple thresholds for SNPs selection based on group membership and identify an optimal set of thresholds through an independent validation dataset. We demonstrated the value of our new method using summary-level results from large GWAS across a spectrum of traits, some with available independent validation datasets to assess the performance of these methods. Available resources, such as annotation databases, expression and methylation quantitative trait locus (QTL) analyses were employed to identify groups of SNPs that are likely to be enriched with the trait of interest. We evaluated the utility of this information for risk-prediction for respective outcomes. We also report on the performance of new algorithm using simulation studies that incorporate realistic genetic architecture, linkage disequilibrium pattern and enrichment factor for underlying functional SNPs.

Results

Overview of statistical approach

Let Z_m, P_m, ${\hat{β}}_{m}$ , and ${\hat{σ}}_{m}$ (m = 1, …, T) denote the Z-statistics, the two-sided P-values, the estimated association coefficients and their standard deviations available as part of summary-level results for T SNPs from a GWAS. We assume that each genotypic value is normalized to have mean zero and unit variance and that ${\hat{β}}_{m}$ is rescaled to correspond to the normalized genotypic values. We assume that M SNPs are selected after LD-clumping, a SNP pruning procedure guided by the association P-values [20]. Let g_im be the genotype of SNP m for subject i. The simplest and most popular form of the PRS has the form

P R S_{i} (α) = \sum_{m = 1}^{M} {\hat{β}}_{m} I (P_{m} < α) g_{i m},

(1)

where the threshold α for the P-values can be chosen to optimize the predictive performance of PRS in an independent validation dataset. Here, I (⋅) is an indicator function. Because PRS_i(α) uses a single threshold to select SNPs, we refer this as one-dimensional PRS or 1D PRS. In what follows, we extend PRS_i(α) by incorporating annotation data and correcting for the upward bias in ${\hat{β}}_{m}$ caused by winner’s curse.

2D PRS

Information from various functional studies, annotation databases and GWAS from various traits is increasingly available to allow identification of subset of SNPs that can be considered to have potential high-prior probability for association with a given trait. Various types of enrichment analyses, whether based on distribution of summary-level statistics [21] or on more advanced heritability-partitioning analyses [22, 23], have shown empirical evidence of strong enrichment of GWAS association signals for different categories of SNPs which represent only a relatively small fractions of all GWAS SNPs. However, very few systematic studies have examined whether and how such enrichment information can be utilized to improve models for genetic risk prediction. We consider a simple modification to PRS to explore this issue. We assume that the set of M SNPs can be partitioned into two mutually exclusive groups, S₁ and S₂, where S₁ represents a relatively small subset representing “high-prior” SNPs (referred to as HP) and the second group S₂ represents the remainder of the GWAS SNPs (referred to as “low-prior” SNPs or LP) that can be considered part of an “agnostic” search. We allow differential treatment of the SNPs in the PRS:

P R S_{i} (α_{1}, α_{2}) = \sum_{m \in S_{1}}^{} {\hat{β}}_{m} I (P_{m} < α_{1}) g_{i m} + \sum_{m \in S_{2}}^{} {\hat{β}}_{m} I (P_{m} < α_{2}) g_{i m}

(2)

and select the optimal (α₁, α₂) based on independent validation dataset(s). Intuitively, SNPs in the HP group are included at a less rigorous threshold than SNPs in the LP group to optimize the performance. We refer to the PRS in Eq (2) as two-dimensional PRS or 2D PRS.

When the genetic architecture parameters are known and SNPs are independent, we derived the theoretical predictive performance of 2D PRS and the corresponding optimal (α₁, α₂) following analytic techniques similar to those derived for 1D PRS [11] (Materials and Methods). Fig 1A shows the theoretically-derived area under the curve (AUC) for a binary trait based on 1D PRS and 2D PRS. For both PRS models, the AUC increases with the sample size of the discovery dataset. The 2D PRS can improve the 1D PRS in which the magnitude depends on the sample size in the discovery sample and also the enrichment fold change Δ of the HP SNPs. Here, Δ is defined as the ratio of the proportion of causal SNPs in HP to the overall proportion of causal SNPs. A larger value of Δ indicates a greater enrichment of causal SNPs in HP. Fig 1B shows the optimal P-value thresholds (α₁, α₂) for including SNPs that maximize the prediction of 2D PRS for a given sample size in the discovery sample. The optimal P-value threshold for including HP SNPs is more liberal than that for LP SNPs and the difference diminishes as the training sample size becomes very large.

Fig 1 — The theoretic calculation assumes M = 53,163 independent SNP, of which 5,000 are causal for a binary trait, similar to simulation studies. The high-prior (HP) SNP set has 5,000 SNPs and the low-prior (LP) SNP set has 48,163 SNPs. Δ is the enrichment fold of HP SNPs in the causal SNP set. (A) The prediction AUC for 1D PRS and 2D PRS. (B) The optimal P-value thresholds for including HP and LP SNPs in 2D PRS. For both plots, x-coordinate is the discovery sample size, assuming equal number of cases and controls.

PRS with winner’s curse correction

In PRS, only SNPs with P-values less than a specific threshold are included. This selection affects the probability density of ${\hat{β}}_{m}$ for selected SNPs and may cause upward bias in the estimate, an effect called winner’s curse. Methods have been proposed to reduce the selection bias in GWAS [24–26]; however, it is not clear whether winner’s curse corrections improve the performance of PRS. Let β_m denote the true effect size and assume that ${\hat{β}}_{m} ~ N (β_{m}, {\hat{σ}}_{m}^{2})$ . Following Zhong and Prentice [26], we consider a shrinkage estimator ${\hat{β}}_{m}^{m l e} (α)$ that maximizes a conditional likelihood

P ({\hat{β}}_{m} | P_{m} < α) = \frac{ϕ (({\hat{β}}_{m} - β_{m}) / {\hat{σ}}_{m}) / {\hat{σ}}_{m}}{Φ (β_{m} / {\hat{σ}}_{m} - λ / {\hat{σ}}_{m}) + Φ (- β_{m} / {\hat{σ}}_{m} - λ / {\hat{σ}}_{m})} I (| β_{m} | \geq λ (α)),

where ϕ() is the density function of N(0,1), Φ() is the cumulative distribution function of N(0,1) and $λ (α) = Φ^{- 1} (1 - α / 2) {\hat{σ}}_{m}$ . The 1D PRS and 2D PRS after winner’s curse correction are calculated as

P R S_{i}^{m l e} (α) = \sum_{m = 1}^{M} {\hat{β}}_{m}^{m l e} (α) I (P_{m} < α) g_{i m}

(3)

and

P R S_{i}^{m l e} (α_{1}, α_{2}) = \sum_{m \in S_{1}}^{} {\hat{β}}_{m}^{m l e} (α_{1}) I (P_{m} < α_{1}) g_{i m} + \sum_{m \in S_{2}}^{} {\hat{β}}_{m}^{m l e} (α_{2}) I (P_{m} < α_{2}) g_{i m},

(4)

respectively. Because ${\hat{β}}_{m}^{m l e} (α)$ is a maximum likelihood estimator, we denote it as MLE winner’s curse correction. It is critical that for selection of the optimal threshold parameter(s), bias correction is performed simultaneously with SNP selection for different values of the threshold parameters. This approach, although conceptually straightforward, is computationally extensive for analyzing a large number of SNPs and a grid of P-value thresholds.

A computationally more attractive approach is to build a PRS using lasso [19] based on summary level data from a GWAS. Suppose that we have M independent SNPs and N training samples with phenotype y_j. We assume that genotypic values g_jm are standardized to have mean zero and unit variance. We estimate parameters (β₀, β₁, …, β_M) by minimizing a penalized loss function:

\frac{1}{2} {\sum_{j = 1}^{N} (y_{j} - β_{0} - \sum_{m = 1}^{M} β_{m} g_{j m})}^{2} + λ \sum_{m = 1}^{M} | β_{m} |,

(5)

where λ controls the sparseness of the prediction model. Let ${\hat{β}}_{m} = \sum_{j = 1}^{N} (y_{j} - \bar{y}) g_{i m}$ be the marginal estimate of β_m. When SNPs are independent, the solution to Eq (5) was derived as [19]

{\hat{β}}_{m}^{l a s s o} (λ) = s i g n ({\hat{β}}_{m}) | | {\hat{β}}_{m} | - λ | I (| {\hat{β}}_{m} | > λ) .

(6)

The resulting linear prediction model, or equivalently the PRS, is given as

P R S_{i}^{l a s s o} (λ) = \sum_{m = 1}^{M} {\hat{β}}_{m}^{l a s s o} (λ) g_{i m} = \sum_{m = 1}^{M} s i g n ({\hat{β}}_{m}) | | {\hat{β}}_{m} | - λ | I (| {\hat{β}}_{m} | > λ) g_{i m} .

Because event {P_m < α} is equivalent to event ${| {\hat{β}}_{m} | > λ (α)}$ with $λ (α) = Φ^{- 1} (1 - α / 2) s d ({\hat{β}}_{m})$ , we can rewrite $P R S_{i}^{l a s s o} (λ)$ as

P R S_{i}^{l a s s o} (α) = \sum_{m = 1}^{M} s i g n ({\hat{β}}_{m}) | | {\hat{β}}_{m} | - λ (α) | I (P_{m} < α) g_{i m} .

(7)

Similarly, considering the lasso problem with two penalty terms by minimizing

\frac{1}{2} {\sum_{i = 1}^{N} (y_{i} - β_{0} - \sum_{m = 1}^{M} β_{m} g_{i m})}^{2} + λ_{1} \sum_{m \in S_{1}} | β_{m} | + λ_{2} \sum_{m \in S_{2}} | β_{m} |

leads to a 2D PRS

P R S_{i}^{l a s s o} (α_{1}, α_{2}) = \sum_{m \in S_{1}}^{} {\hat{β}}_{m}^{l a s s o} (λ (α_{1})) I (P_{m} < α_{1}) g_{i m} + \sum_{m \in S_{2}}^{} {\hat{β}}_{m}^{l a s s o} (λ (α_{2})) I (P_{m} < α_{2}) g_{i m} .

(8)

Note that the above derivation assumes independence between SNPs. In reality, nearby SNPs may still be in weak LD even after aggressive LD-clumping using r² < 0.1. Thus, Eq (6) approximates the exact lasso solution that formally adjusts for correlation. The similarity between $P R S_{i}^{m l e} (α)$ in Eq (3) and $P R S_{i}^{l a s s o} (α)$ in Eq (7) suggests that the lasso shrinkage estimator Eq (6) provides an alternative approach for reducing the bias caused by winner’s curse. This observation motivated us to use the shrinkage estimator in Eq (6) to build PRS for a binary trait, where ${\hat{β}}_{m}$ is marginally estimated. Because the models in Eqs (7) and (8) are approximations to the true lasso prediction model in presence of weak LD between SNPs, we refer to them as PRS with lasso-type winner’s curse correction.

Simulation results

We performed simulations to evaluate the performance of six PRS prediction methods: 1D and 2D PRS without and with winner’s curse correction (MLE and lasso-type correction). To make simulations realistic in terms of the distribution of minor allele frequencies (MAF) and LD, we simulated quantitative traits with specific genetic architecture by conditioning on the genotypes of a lung cancer GWAS [27], which had 11,924 samples of European ancestry and 485,315 autosomal SNPs after quality control. We randomly selected 10,000 samples as a discovery set and 1,924 as a validation set. The causal SNP set consisted of 5,000 SNPs in linkage equilibrium. In the first set of simulations, the HP SNPs were randomly selected from LD-pruned SNPs across the genome. In the second set of simulations, we simulated HP SNPs located in conserved regions (CR) [28], which were recently reported to be highly enriched for association signal of 17 complex traits based on a heritability partitioning analysis [23].

The simulation results are summarized in Fig 2. First, winner’s curse corrections slightly improved prediction in most if not all simulations and in particular improved more for the 1D PRS than the 2D PRS. We also observed that the two winner’s curse correction methods performed similarly. Second, if HP SNPs were chosen randomly in the LD-pruned SNP set and were strongly enriched for causal SNPs, 2D PRS substantially improved the prediction over 1D PRS. As expected, the improvement increased quickly with the enrichment fold change Δ. Consistent with theoretical analysis assuming independent SNPs (Fig 1B), the optimal P-value threshold for HP SNPs was more liberal than that for LP SNPs (S1 Table).

However, when we used CR-SNPs as the HP SNPs, the improvement of 2D PRS was less compared to the simulations with randomly selected HP SNPs, even with the same enrichment fold change. To investigate whether the difference was caused by different local LD structure, for each SNP, we counted the number of SNPs located less than 1Mb from the given SNP and had r² ≥ 0.8 with the SNP in The 1000 Genomes Project [29]. For 9,940 CR-SNPs used for our simulations, the average number of LD SNPs is 22.4 (median = 12) while the average number is 6.4 (median = 2) for non-CR SNPs. See also the histograms in S1 Fig. Thus, CR-SNPs are enriched in regions with strong LD and may suggest a possible explanation why CR-SNPs (and other functional categories with similar LD structure) may not lead to improvement in risk prediction as much as would be expected based on enriched heritability.

Results of analyzing real GWAS data sets

We applied the six PRS methods to 14 traits with either individual level GWAS data or summary level data (Tables 1 and 2). We defined the HP SNP set S₁ using expression QTL SNPs (eSNPs) in blood, tissue specific eSNPs and methylation QTL SNPs (meSNPs), SNPs related with cis-regulatory elements (referred to as CRE-SNPs), SNPs related with genomic regions conserved across mammals (referred to as CR-SNPs) and SNPs identified by pleiotropic analyses (referred to as PT-SNPs). Details about annotation data are provided in Materials and Methods. The annotation data used for each trait is summarized in S2 Table. For those with individual level data but without independent validation samples, we used cross-validation to estimate performance.

Table 1. GWAS data sets with individual level data.

Data source	Ancestry	Diseases	(Cases, controls)	Cross-validation
WTCCC	European	Bipolar disorder	(1817, 2928)	5-fold
	European	Coronary artery disease	(1878, 2928)	5-fold
	European	Crohn’s disease	(1729, 2928)	5-fold
	European	Hypertension	(1934, 2928)	5-fold
	European	Rheumatoid	(1894, 2928)	5-fold
	European	Type 1 diabetes	(1939, 2928)	5-fold
NCI GWAS	European	Bladder cancer	(5937, 10862)	10-fold
	Asian	Lung cancer in non-smoking females	(5510, 4544)	10-fold
	European	Pancreatic cancer	(5066, 8807)	10-fold

Open in a new tab

Table 2. GWAS data with summary level data.

	Discovery sample			Validation sample
	Ancestry	Data	(Cases, controls)	Ancestry	Data	(Cases, controls)
Type 2 diabetes	European	DIAGRAM GERA	(17802, 105109)	Europe	GERA	(1500,1500)
Lung cancer	European	TRICL	(11300, 15952)	Europe	PLCO	(1237,1330)
Schizophrenia	European	PGC2	(31560,42951)	Europe	MGS	(2681,2653)
Colorectal cancer	European	GECCO	(9719, 10937)	Europe	PLCO	(1000,2302)
Prostate cancer	Europe African Japanese Latino	PRACTICAL ELLIPSE	(38703, 40796)	Europe	Pegsus	(4600,2941)

Open in a new tab

Polygenic risk prediction of type 2 diabetes

We first use type-2 diabetes (T2D) as an example to illustrate our methods. Fig 3A presents the 1D PRS results for T2D. The standard 1D PRS without winner’s curse correction had a prediction R² = 2.29% by including SNPs with P≤2×10⁻³. The winner’s curse correction improved R² to 3.10% using the lasso-type correction and 2.67% using the MLE correction.

Next, we investigated whether functional annotation could further improve risk prediction. We considered CR-SNPs, eSNPs and meSNPs in adipose tissue, and SNPs related with different histone marks and their combinations as HP SNP sets. These SNPs were enriched in T2D GWAS, exemplified by the QQ plot in Fig 3B for a HP SNP set comprising of eSNPs/meSNPs in adipose tissue and SNPs related with H3K4me3 in the pancreatic islet cell line. Note that the SNPs have been pruned to have pairwise r² ≤ 0.1, so the observed enrichment was unlikely due to an artifact related to extensive LD. Fig 3C illustrates how the prediction R² of a 2D PRS depends on the P-value thresholds for the HP and LP SNPs. The prediction R² was maximized using a more liberal P-value threshold 0.03 for HP SNPs and a more rigorous threshold 0.005 for LP SNPs. This optimal 3D PRS had 8,018 HP SNPs and 2,033 LP SNPs.

Fig 3D reports the prediction R², AUC and the significance for testing whether an alternative PRS method could improve the standard 1D PRS. The best predictions were achieved by the 2D PRS with lasso-type correction: R² = 3.48% using eSNPs/meSNPs and CR-SNPs and R² = 3.53% using eSNPs/meSNPs and H3K4me3 SNPs in pancreatic islet cell line (52.0% and 54.1% efficiency gain compared to 2.29% using standard 1D PRS, respectively). These improvements were statistically significant compared to the 1D standard PRS (P = 0.00002 and 0.00004, respectively). Of note, the recently developed method LD-pred [31] that models the LD information only slightly improved prediction R² from 2.47% to 2.73% (10.5% efficiency gain) using DIAGRAM summary statistics as discovery. Results are summarized in S3 Table (prediction R², AUC and Nagelkerke R²), S4 Table (P-value for testing significance of improvement) and S5 Table (optimal thresholds for SNP selection).

Results for WTCCC data

The prediction R² values for six diseases in WTCCC data are reported in Fig 4A. The AUCs and Nagelkerke R² are summarized in S6 Table. Optimal thresholds for SNP selection are in S7 Table. The lasso-type winner’s curse correction improved the 1D PRS predictions for CD, RA and T1D. The 2D PRS improved the prediction for CD (6.65% to 7.71% using blood eSNPs). Combining functional data and lasso-type correction gave a prediction R² = 8.75% for CD (31.6% efficiency gain over the standard 1D PRS). However, because of the small sample size in the validation sample, the improvements were not statistically significant.

Results for three cancer GWAS with individual genotype data

Results are summarized in Fig 4B (prediction R²), S8 Table (AUC and Nagelkerke R²), S9 Table (P-value for testing significance of improvement) and S10 Table (optimal thresholds for SNP selection). The standard 1D PRS achieved an R² = 1.12% for bladder cancer, 2.35% for Asian nonsmoking female lung cancer and 2.2% for pancreatic cancer, indicating the difficulty of genetic risk prediction for these cancers. The 2D PRS with lasso-type correction improved the prediction although the various annotation datasets gave different improvement. For bladder cancer, the greatest efficiency gain (R² = 1.64%, 46.4% efficiency gain over the standard 1D PRS and 27.1% efficiency gain over the 1D PRS with lasso-type correction) was achieved with the SNPs related to the lung tissue/cell line expression data (eSNPs, meSNPs, H3K4me3 SNPs in SAEC), which performed slightly better than the SNPs related with histone marks in bladder cell line (R² = 1.46%). For non-smoking female Asian lung cancer, the 2D PRS incorporated with PT-0.001 SNPs or H3K4me3 SNPs in HAEC improved R² to 2.84%. For pancreatic cancer, the 2D PRS incorporated with CR-SNPs, SNPs related with histone marks of pancreatic islet and adipose eSNPs/meSNPs improved prediction R² by approximately ~30% compared with the standard 1D PRS. Many of the improvements over the standard 1D PRS were statistically significant (S9 Table), e.g., P = 0.025 for 2D PRS with H3K4me3 SNPs in HAEC for bladder cancer, P = 0.025 for 2D PRS with PT-0.001 SNPs for Asian lung cancer and P = 0.047 (0.023, 0.023) for 2D PRS with CR-SNPs (PT-0.001, PT-0.01 SNPs) for pancreatic cancer.

Results for four large-scale summary-statistics datasets

Prediction results for lung cancer, schizophrenia, prostate cancer and colorectal cancer are reported in Fig 4C (prediction R²), S3 Table (AUC and Nagelkerke R²), S4 Table (P-values for testing whether improvements were significant), S5 Table (optimal p-value thresholds for SNP selection in 2D PRS) and S2 Fig. For lung cancer, the standard 1D PRS had an R² = 1.13%. The best prediction R² = 1.65% was achieved by lasso-corrected 2D PRS with eSNPs/meSNPs in lung tissues, blood eSNPs and SNPs related with H3K4me3 in SAEC. To achieve this prediction accuracy, the optimal P-value threshold for the 2D PRS should be 0.008 for HP SNPs and 5 × 10⁻⁶ for LP SNPs. However, the improvement was not statistically significant. For schizophrenia, the lasso-type correction improved 1D PRS R² from 14.01% to 14.94%; the 2D PRS with CR-SNPs further improved the R² to 15.37% slightly and the improvement was highly statistically significant (P = 3.2 × 10⁻¹⁰). For CRC and prostate cancer, neither winner’s curse correction nor 2D PRS improved prediction.

Discussion

Our study demonstrates that the predictive performance of GWAS PRS models can be improved based on a combination of a simple adjustment to the threshold levels of SNP selection and weights of selected SNPs. The degree of gain, however, is not uniform and depends on multiple factors, including the genetic architecture of the trait, sample size of the discovery sample set, degree of enrichment of association in selected set of “high-prior” SNPs and the linkage disequilibrium patterns of these SNPs with the rest of the genome.

The simple winner’s curse correction of SNP weights using the lasso-type method leads to an improvement in performance of PRS uniformly across all studied diseases. For some diseases, such as type-2 diabetes (Fig 3 and S3 Table) or Crohn’s disease (Fig 4 and S6 Table), this correction alone led to notable improvement in the performance of PRS. The optimal weighting of SNPs would depend on the true effect size distribution of the underlying susceptibility SNPs. Lasso-type weights can be expected to be optimal under a double exponential distribution [19, 32], and it is possible that the weighting could be improved further under alternative models of effect-size distribution. It is, however, encouraging that irrespective of what might be the true effect-size distribution, which is likely to vary across the diseases of study; our simple lasso-type correction improves over the standard PRS without adding any additional computational complexity.

The effect of using various threshold levels for different functional categories of SNPs on the performance of the model varied by disease as well as the functional annotation of external data sets employed in our analytical approach. After adjustment with lasso-type weights, the use of two-dimensional threshold based on prioritized SNPs led to notably higher values of R² for lung cancer in Caucasians, bladder cancer, type-2 diabetes and pancreatic cancer. Consistent with theoretical expectations, for each of the traits, the optimal thresholds selected were more liberal for the associated category of high-prior SNPs than those for complementary set.

Our simulation study illustrated how the improvement in performance of the PRS model due to differential treatment of certain categories of SNPs is modest even when these SNPs have been categorized to be highly enriched for heritability [22]. For example, recent heritability partitioning analysis has identified SNPs in conserved DNA regions, representing 2.6% of the genome, to be highly enriched for GWAS heritability for many diseases (explaining 35% heritability on average). Our theoretical calculations suggest that if only independent SNPs are analyzed, use of a subset of SNPs similarly enriched for heritability is expected to yield much higher improvement in the performance of the model (Fig 1). Our simulation studies showed that a similarly large gain is expected even in the presence of naturally occurring LD pattern if these SNPs are selected randomly from the genome. However, when we simulated high-prior SNPs based on the exact location of conserved regions, the improvement was modest, within the range of observed data. The CR-SNPs represent a highly unusual linkage disequilibrium pattern in that they are in high degree of LD with an unusually large number of neighboring SNPs (S1 Fig).

In the future, more detailed and accurate assessment of the functional annotation of SNPs should improve performance of PRS models. Our method requires only simple modifications to the standard PRS algorithm and can thereby be used to rapidly evaluate the effectiveness of many alternative strategies. In the current study, we used physical location information pertaining to histone marks to define high-priority SNP. However, a SNP located in histone marks does not necessarily cause the variation in histone binding. Thus, a more reasonable approach is to identify genetic variants associated with histone variation across subjects in order to define high-priority SNP sets. These types of histone QTLs have recently been reported in small-scale studies based on HapMap samples [33, 34]. We expect that histone QTL SNPs identified in future large-scale tissue specific studies might be more informative for risk prediction.

We have investigated the performance of the various algorithms using criteria that reflect how much of the variability of the observed outcomes can be explained by the PRS in the validation dataset. For clinical applications of risk-models, however, it is important to evaluate whether models are well calibrated that is to what extent they can produce unbiased estimates of risk for individuals with different SNP profiles. Earlier studies have noted that the standard PRS can be mis-calibrated and additional calibration steps may be needed when applying PRS in a clinical setting. In this regard, we find that a winner’s curse correction can alleviate calibration bias of the standard PRS, but substantial residual bias remains in some situations (S11 Table). The regression relationship between overall PRS and disease status can be estimated based on a relatively small validation sample and can also be used to re-scale PRS for producing calibrated risk estimates.

We used several different metrics for evaluating the potential impact of an improved PRS for risk-stratification. The percentage gain in prediction R² due to improved PRS is substantial for several diseases. For these diseases, the impact of an improved PRS on overall discriminatory performance of the models is noticeable but small (increase in AUC value between 1–2%). However, even a modest increment in AUC value can lead to identification of substantially higher fraction of individuals who are at the tails of risk distribution and hence likely to consider clinical decisions (S12 Table).

A limitation of our method is that we use stringent LD-pruning for creating sets of independent SNPs. However, this may result in loss of predictive power of models as SNPs in moderate or low LD may still harbor independent association signals. The LD-pred [31] method has been proposed to better account for correlated SNPs in building PRS using GWAS summary-level data and has been shown to lead to improved performance over standard PRS for some diseases such as schizophrenia. The LD-pred method also uses a specific form of prior distribution for obtaining “shrunken” estimates of the regression coefficients for the SNPs in the model. Although we did not make direct comparisons, it appears that the LD-pred method gains over standard PRS by improving the accounting for correlation between risk SNPs. In contrast, in our algorithm, which used stringent LD pruning, the gain in performance over the standard PRS mainly came from the lasso-type winner’s curse correction and the use of variable thresholds to account for HP and LP SNPs. Thus it is possible that in the future the complementary strengths of the algorithms can be combined to develop more powerful PRS.

In conclusion, we have proposed a set of simple methods for constructing PRS for genetic risk prediction using GWAS summary-level data. The proposed methods are computationally not onerous and yet show a noteworthy gain in performance. A major strength of our study is that we evaluated the proposed methods across a large number of scenarios reflecting a spectrum of underlying genetic architectures for different complex diseases, sample size of the study and available functional annotation. These studies and additional simulations provide comprehensive insights to promises and limitations of genetic risk prediction models in the near future.

Materials and Methods

LD-pruning and LD-clumping

The performance of PRS is typically improved if genetic markers are pruned for LD. LD-pruning procedures that ignore GWAS P-values frequently prune out the most significant SNPs and may reduce performance. Instead, we use the LD-clumping procedure implemented in PLINK [20] that chooses the most significant SNP from a set of SNPs in LD guided by GWAS P-values. After LD-clumping, no SNPs with physical distance less than 500kb have LD r² ≥ 0.1.

Expanding HP SNP set through LD

Suppose S₁ is a given HP set defined based on external annotation data (see section Annotation datasets). Any SNP in high LD with a SNP in S₁ is also considered to be an HP SNP. Thus, we expanded S₁ by including all SNPs that were in high LD (r² ≥ 0.8) with any SNP in the original S₁.

Simulations

We simulated quantitative traits with specific genetic architecture by conditioning on the genotypes of a lung cancer GWAS [27], including 11,924 samples of European ancestry and 485,315 autosomal SNPs after quality control. The simulation scheme is summarized in the following steps:

We performed LD-pruning implemented in PLINK so that no SNPs within 500kb were in LD at threshold r² = 0.1. After LD-pruning, M = 53,163 autosomal SNPs (denoted as S) were left.
Denote S₁ as the putative HP SNP set and S₂ = S \ S₁ as the LP SNP set. We selected a set of 5000 “causal” SNPs (denoted as C) from the pruned SNP set S. If C is randomly selected, i.e., S₁ is not enriched with causal SNPs, we expect | S₁ ∩ C | = | C || S₁| / M SNPs overlapping between S₁ and C. Thus, we defined the enrichment fold change for S₁ as
$Δ = \frac{| S_{1} \cap C |}{| C | | S_{1} | / M} .$

The enrichment fold change Δ ranged from 2 to 4 in simulations.
We simulated quantitative traits according to y_i = Σ_t∈C β_tg_it + ε_i, where β_ts were simulated independently from a Gaussian mixture distribution $β_{t} ~ π N (0, σ_{1}^{2}) + (1 - π) N (0, σ_{2}^{2})$ with π = 0.1 Here, $σ_{1}^{2}$ , $σ_{2}^{2}$ and Var(ε_i) were scaled so that Var(y_i) = 1. The phenotypic variances explained by the two components were $h_{1}^{2} = | C | π σ_{1}^{2} = 0.1$ and $h_{2}^{2} = | C | (1 - π) σ_{2}^{2} = 0.4$ . We assume the same effect-size distribution for both HP and LP causal SNPs, but the proportions of causal SNPs are higher in the former than the later group. Under this assumption, Δ also reflects the ratio of heritability explained at a per SNP basis in the HP set compared to LP set.
We randomly selected 10,000 samples as a discovery set and 1,924 as a validation set. We performed GWAS association analysis for all 485,315 autosomal SNPs in the discovery sample. The summary statistics were used to calculate PRS for each sample in the validation sample. The prediction R² was calculated as max_λ cor²(PRS_i(λ), y_i) for 1D PRS methods and max_λ₁,_λ₂ cor²(PRS_i(λ₁, λ₂), y_i) for 2D PRS methods. We repeated the simulation 50 times for each set of parameters and report the average prediction R².

Recently, Finucane et al. [23] reported the heritability explained by common SNPs in multiple functional categories for 17 traits. Interestingly, they found that common SNPs located in regions that are conserved in mammals [28] accounted for about 2.6% of total common SNPs but explained approximately 35% of total heritability in average across these traits, suggesting a 13.5-fold enrichment. Thus, we were motivated to investigate whether SNPs related with the conserved regions (CR) may be useful for 2D PRS methods. We downloaded the CR annotations (http://compbio.mit.edu/human-constraint/data/gff/), identified common SNPs located in any CR and also identified their LD SNPs with r² ≥ 0.8. These SNPs are referred to as CR-SNPs, which were used as HP S₁ in simulations. We found 9,940 CR-SNPs overlapping with the 53,163 LD-pruned SNPs. To investigate whether specific genomic locations of CR-SNPs influence the performance of 2D-PRS, we also performed simulations using a set S₁ of random SNPs that has the same size and associated heritability as the CR-SNPs.

WTCCC GWAS data

The Wellcome Trust Case Control Consortium [30] (WTCCC) data consisted of two control data sets (1958 Cohort samples and NBS control samples) and seven diseases: bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), Type 1 diabetes (T1D) and Type 2 diabetes (T2D). Since we analyzed T2D using a much larger discovery sample, we did not analyze the T2D data in WTCCC. Because cases and controls were genotyped in different batches, differential errors between cases and controls might cause a serious overestimate of the risk prediction. Thus, we performed very rigorous quality control (QC) by removing duplicate samples, first or second degree relatives, samples with missing rate greater than 5% and non-European samples identified from EigenStrat [35] analysis. For each disease, we excluded SNPs with MAF<5%, missing rate >2%, missing rate difference >1% between cases and controls or P_HWE<10⁻⁴ in the control samples. For each PRS method and each disease, we estimated the prediction R² by five-fold cross-validation.

Three cancer GWAS with individual genotype data

We analyzed three cancer GWAS with individual level genotype data: the bladder cancer [36, 37] GWAS of European ancestry including 5,937 cases and 10,862 controls, the pancreatic cancer GWAS [38] of European ancestry (after excluding samples with Asian or African ancestry) including 5,066 cases and 8,807 controls, and the Asian non-smoking female lung cancer GWAS [39] with 5,510 cases and 4,544 controls. After QC, the bladder cancer GWAS had 463,559 autosomal SNPs and the Asian lung cancer GWAS had 329,703 autosomal SNPs. The pancreatic cancer GWAS included samples from three studies that used different genotyping platforms. For convenience, we analyzed 267,935 autosomal SNPs that overlapped in all three platforms. The prediction performance was evaluated using ten-fold cross-validation.

Five large GWAS with summary statistics and independent validation samples

For T2D, we downloaded the summary statistics of the DIAGRAM (DIAbetes Genetics Replication And Meta-analysis) consortium [40] with 12,171 cases and 56,862 controls for 2.5 million SNPs imputed to the Hapmap2 reference panel. We also downloaded the GERA (Genetic Epidemiology Research on Adult Health and Aging) GWAS data of European ancestry with 7,131 T2D patients and 49,747 samples without T2D (but may have other medical conditions, e.g., 27.4% with cancers, 25.4% with asthma, 25.4% with allergic rhinitis‎ and 12.4% with depression). We randomly selected 5,631 T2D patients and 48,247 non-T2D subjects from GERA as discovery set, performed association analysis adjusting for top 10 PCA scores and meta-analyzed with the summary statistics from DIAGRAM for 353,196 autosomal SNPs overlapping between the two studies. The resulting summary statistics were used to build PRS risk models, which were validated in the remaining 1500 T2D patients and 1500 non-T2D subjects in GERA.

The PGC2 (Psychiatric Genetics Consortium) schizophrenia GWAS meta-analysis consisted of 34,241 cases and 45,604 controls [41] (http://www.med.unc.edu/pgc/downloads). Summary statistics were obtained by meta-analyzing all PGC2 schizophrenia GWAS except the MGS [42] (Molecular Genetics of Schizophrenia) subjects of European ancestry. The summary statistics were used to build PRS models, which were validated in MGS samples with 2,681 cases and 2653 controls.

The TRICL (Transdisciplinary Research in Cancer of the Lung) GWAS consortium consisted of 12,537 lung cancer cases and 17,285 controls [43, 44]. We performed meta-analysis using TRICL samples excluding the samples from the PLCO [27] (Prostate, Lung, Colon, and Ovary Cohort Study) study. The summary statistics based on 11,300 cases and 15,952 controls were used to build risk models, which were validated in the PLCO lung GWAS samples with 1,237 cases and 1,333 controls.

For colorectal cancer, we performed meta-analysis for the GECCO (Genetics and Epidemiology of Colorectal Cancer Consortium) [45] GWAS data after excluding the PLCO GWAS data. The PLCO samples were genotyped using two different genotyping platforms with different marker densities: one had approximately 500K SNPs and the other had only 250K SNPs. Thus, we first imputed the genotypes to the Hapmap2 reference panel using IMPUTE2 [46] and selected SNPs with imputation r² ≥ 0.9 for risk prediction. The discovery sample consisted of 9,719 cases and 10,937 controls from 19 studies. The PLCO validation sample had 1,000 cases and 2,302 controls.

The summary statistics for prostate cancer were obtained from the PRACTICAL (PRostate cancer AssoCiation group To Investigate Cancer Associated aLterations) consortium and The GAME-ON/ELLIPSE (Elucidating Loci Involved in Prostate Cancer Susceptibility) Consortium with samples from populations of European, African, Japanese and Latino ancestry [5]. The discovery samples consisted of 38,703 cases and 40,796 controls after excluding the NCI Pegsus GWAS samples with 4,600 cases and 2,941 controls, which were used for validation. We analyzed 536,057 autosomal SNPs after QC that overlapped between the validation and the discovery sample summary statistics.

Annotation data sets

For many traits, GWAS risk SNPs have been reported to show enrichment for eQTLs, methylation QTLs (meQTLs) and cis-regulatory elements (CREs). In addition, recent studies have reported extensive genetic pleiotropy across diseases and traits, e.g. psychiatric diseases [47, 48], schizophrenia and cardiovascular-disease risk factors, including blood pressure, triglycerides, low- and high-density lipoprotein, body mass index (BMI) and waist-to-hip ratio (WHR) [49]. This information may potentially improve risk prediction if the SNPs identified from the secondary trait are highly enriched in the GWAS of the primary trait. Thus, we defined the HP SNP set S₁ using eQTL SNPs (referred to as eSNPs) in blood, tissue specific eSNPs and meQTL SNPs (referred to as meSNPs), SNPs related with CREs (referred to as CRE-SNPs), SNPs related with genomic regions conserved across mammals (referred to as CR-SNPs) and SNPs identified by pleiotropic analyses (referred to as PT-SNPs). Here, LD was calculated based on the genotype data of relevant ancestry in The 1000 Genomes Project [29]. Note that the availability of functional annotation data depends on tissue types. However, for all diseases studied in the paper, we have used blood eSNPs and CR-SNPs because blood eSNPs are enriched for GWAS of all these traits and CR-SNPs were highly enriched in many traits by a heritability partitioning analysis [23].

eSNPs and meSNPs

Blood cis-eSNPs were identified from two large-scale eQTL studies in European populations. One study involved a transcriptome sequencing project of 922 subjects [50] and the other involved a microarray study of 5,311 subjects [51] (http://genenetwork.nl/bloodeqtlbrowser/). Because of its very large sample size, the second study had the power to detect eSNPs with even tiny effect sizes which may not have meaningful functional importance. Thus, we included eSNPs with association P-value <10⁻⁶ with any gene in the cis region in the second study. For both Asian and European lung cancer GWAS data, we used eSNPs [52] and meSNPs [53] based on lung tissues. For T2D, we used eSNPs [54] and meSNPs [55] based on adipose tissues (http://www.muther.ac.uk/Data.html). Furthermore, detected trans-SNPs are much fewer than cis-SNPs and the replication rate of trans-eSNPs was much lower than cis-SNPs [54], suggesting that including trans-SNPs would be unlikely to improve risk prediction. Thus, we did not include trans-SNPs.

CRE-SNPs

CREs are regions of noncoding DNA regulating the transcription of nearby genes. SNPs located in CREs may change the binding of specific transcription factors and thus the expression of the target genes. Typically, CREs are identified through ChIP-Seq experiments of histone modifications. We downloaded “peak” data (each peak represents one CRE) of specific sets of histone methylation markings, acetylation markings and DNase I hypersensitive sites (DHSs) from the ROADMAP project website for relevant cell lines. For each identified CRE (‘peak’), we identified common SNPs with MAF>1%. For prostate cancer, we used the ChIP-Seq data for H3K27Ac and the transcription factor TCF7L2 [56] to define HP SNP sets.

PT-SNPs

The summary statistics for height [1, 2], BMI and obesity [3, 57], WHR [58], waist circumference (WC) [58], hip circumference (HIP) [58] were downloaded from the GIANT consortium website. The summary statistics for GWAS meta-analysis of cardiovascular-disease risk factors [59], including triglycerides (TG), low-density lipoprotein (LDL) and high-density lipoprotein (HDL), were also used for 2D PRS.

We investigated whether or not each tentative HP SNP set was enriched for GWAS associations by examining the quantile-quantile (QQ) plot, which was made for HP SNPs vs. LP SNPs after LD-clumping. The SNP sets not enriched for GWAS associations were not expected to improve risk prediction in 2D PRS. Thus, for each disease, we only included HP SNP sets for 2D PRS when they showed strong enrichment in QQ plots. Interestingly, blood eSNPs were enriched for almost all diseases. CR-SNPs showed modest enrichment for majority of the diseases. Thus, blood eSNPs and CR-SNPs were used for 2D PRS for all diseases. In addition, eSNPs and meSNPs derived in lung tissues were enriched in lung cancer GWAS of both European and Asian ancestry. The SNPs related in enhancer and active promoter regions (characterized by H3K4me3, H3K9-14Ac, H3K36me3, H3K4me1, H3K9ac and H3K9me3) were enriched for GWAS associations but SNPs related with the repressive regions (characterized by H3K27me3) were not. Thus, we included SNPs related with these enhancer and active promoter regions for 2D PRS. DHS SNPs were not strongly enriched and thus were excluded. Recently, we have shown significantly shared genetic component between lung cancer and bladder cancer risk [60]. Thus, we also used HP SNPs derived based on lung tissues or cell lines for predicting bladder cancer risk. Furthermore, we found that SNPs identified through pleiotropic analysis were enriched in multiple diseases. For example, SNPs with P-value <0.001 in GWAS of height, HDL, LDL, TC, TG, WC, obesity, HIP and T2D were enriched in lung cancer GWAS. Because our 2D PRS methods required a relatively large number of HP SNPs to achieve improvement, we combined the SNPs with P-value <10⁻³ (or 10⁻²) in at least one trait into a HP SNP set referred as PT-0.001 (or PT-0.01).

Testing the statistical significance of improvement for risk prediction

For WTCCC and three cancer GWAS data sets with individual genotype data, we used K-fold cross-validation to estimate prediction R². Here, K = 5 for WTCCC data and K = 10 for cancer GWAS data. We were interested in testing whether the prediction of a new PRS method was significantly better than that of the standard 1D PRS defined in Eq (1). For the i^th cross-validation, we denote $R_{i, 0}^{2}$ as the maximum prediction for the standard 1D PRS optimized across P-value thresholds, $R_{i, 1}^{2}$ as the maximum prediction for a new PRS method optimized across all P-value thresholds for 1D PRS and all pairs of P-value thresholds for 2D PRS. We defined $δ_{i} = R_{i, 1}^{2} - R_{i, 0}^{2}$ and estimated its variance as ${\hat{σ}}^{2} = Σ_{i = 1}^{K} {(δ_{i} - \bar{δ})}^{2} / (K - 1)$ with $\bar{δ} = Σ_{i = 1}^{K} δ_{i} / K$ . We calculated the statistic $T = \bar{δ} / \sqrt{{\hat{σ}}^{2} / K}$ and evaluated its significance using the t-distribution. For the five diseases with independent validation samples, we used bootstrap to estimate the variance of the R² estimates to test significance [29].

Theoretical prediction performance assuming independent SNPs

Suppose that for a given trait of interest Y, there are two predefined SNP sets: the high priority (HP) SNP set S₁ and the low priority (LP) SNP set S₂. SNPs have been pruned and are in linkage equilibrium. We assume that S₁ has M₁ independent susceptibility SNPs and M₃ null SNPs while S₂ has M₂ susceptibility SNPs and M₄ independent null SNPs. Following Chatterjee et al. [11], we assume that the true relationship between outcome Y and independent susceptibility SNPs is modeled as follows:

Y = \sum_{i = 1}^{M_{1}} β_{1 i} g_{1 i} + \sum_{j = 1}^{M_{2}} β_{2 j} g_{2 j} + \sum_{k = 1}^{M_{3}} 0 \cdot g_{3 k} + \sum_{l = 1}^{M_{4}} 0 \cdot g_{4 l} + ϵ,

where all Y and the genotypic values g’s are standardized so that E(Y) = 0, Var(Y) = 1, E(g) = 0 and Var(g) = 1, and the error term ϵ ~ N(0, σ²) and is independent of the genotypic values.

From a discovery GWAS data set of size N, we have regression coefficient ${\hat{β}}_{i}$ - and two-sided p-value P_i for each SNP. We build an additive prediction model by including SNPs in S₁ with P-value ≤ α₁ and SNPs in S₂ with P-value ≤ α₂:

\hat{Y} (α_{1}, α_{2}) = \sum_{i = 1}^{M_{1}} {\hat{β}}_{1 i} γ_{1 i} (α_{1}) g_{1 i} + \sum_{j = 1}^{M_{2}} {\hat{β}}_{2 j} γ_{2 j} (α_{2}) g_{2 j} + \sum_{k = 1}^{M_{3}} {\hat{β}}_{3 k} γ_{3 k} (α_{1}) g_{3 k} + \sum_{l = 1}^{M_{4}} {\hat{β}}_{4 l} γ_{4 l} (α_{2}) g_{4 l},

where γ (α) = I (P ≤ α) with I (⋅) being an indicator function.

The predictive correlation coefficient (PCC) for the predictive model can be expressed as

\begin{array}{l} P C C (α_{1}, α_{2}) = c o r (Y, \hat{Y} (α_{1}, α_{2})) \\ = \frac{\sum_{i = 1}^{M_{1}} β_{1 i} {\hat{β}}_{1 i} γ_{1 i} (α_{1}) + \sum_{j = 1}^{M_{2}} β_{2 j} {\hat{β}}_{2 j} γ_{2 j} (α_{2})}{\sqrt{\sum_{i = 1}^{M_{1}} {\hat{β}}_{1 i}^{2} γ_{1 i} (α_{1}) + \sum_{j = 1}^{M_{2}} {\hat{β}}_{2 j}^{2} γ_{2 j} (α_{2}) + \sum_{k = 1}^{M_{3}} {\hat{β}}_{3 k}^{2} γ_{3 k} (α_{1}) + \sum_{l = 1}^{M_{4}} {\hat{β}}_{4 l}^{2} γ_{4 l} (α_{2})}} . \end{array}

Following Chatterjee et al. (2014), one can verify that PCC follows a normal distribution by the central limit theorem and the strong law of large numbers. Therefore, the expected value of PCC can be approximated as

\begin{array}{l} E (P C C (α_{1}, α_{2})) = \frac{\sum_{i = 1}^{M_{1}} β_{1 i} e_{N, α_{1}} (β_{1 i}) pow (N, β_{1 i}, α_{1}) + \sum_{j = 1}^{M_{2}} β_{2 j} e_{N, α_{2}} (β_{2 j}) pow (N, β_{2 j}, α_{2})}{\sqrt{\sum_{i = 1}^{M_{1}} ν_{N, α_{1}} (β_{1 i}) pow (N, β_{1 i}, α_{1}) + \sum_{j = 1}^{M_{2}} ν_{N, α_{2}} (β_{2 j}) pow (N, β_{2 j}, α_{2}) + M_{3} α_{1} ν_{N, α_{1}} (0) + M_{4} α_{2} ν_{N, α_{2}} (0)}}, \\ \approx \frac{M_{1} \int β e_{N, α_{1}} (β) pow (N, β, α_{1}) f_{1} (β) d β + M_{2} \int β e_{N, α_{2}} (β) pow (N, β, α_{2}) f_{2} (β) d β}{\sqrt{M_{1} \int β ν_{N, α_{1}} (β) pow (N, β, α_{1}) f_{1} (β) d β + M_{2} \int β ν_{N, α_{2}} (β) pow (N, β, α_{2}) f_{2} (β) d β + M_{3} α_{1} ν_{N, α_{1}} (0) + M_{4} α_{2} ν_{N, α_{2}} (0)}} \end{array}

where $e_{N, α} (β) = E (\hat{β} | γ (α) = 1)$ , $ν_{N, α} (β) = E ({\hat{β}}^{2} | γ (α) = 1)$ , pow (N, β, α) is power to detect a SNP with effect size β at a significance level α in a GWAS with size N, and f₁(⋅) and f₂(⋅) are effect-size distributions for HP and LP susceptibility SNPs, respectively.

In our numerical calculations, we assumed that the effect sizes of the susceptibility SNPs in the HP and LP sets followed the same distribution $β ~ π N (0, σ_{1}^{2}) + (1 - π) N (0, σ_{2}^{2})$ , consistent with simulations. We performed grid search to identify the p-value thresholds (α₁, α₂) that maximizes E(PCC(α₁, α₂)). For binary disease outcomes, AUC can be expressed as a function of PCC [11].

Supporting Information

S1 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS in simulation studies.

(DOC)

Click here for additional data file.^{(252KB, doc)}

S2 Table. GWAS and functional annotation data for developing genetic risk prediction models.

(DOC)

Click here for additional data file.^{(361KB, doc)}

S3 Table. Prediction R², Nagelkerke R² and AUC for five large scale GWAS summary statistics with independent validation data.

(DOC)

Click here for additional data file.^{(92.5KB, doc)}

S4 Table. P-values for testing whether a PRS statistically significantly improved the risk prediction for five large-scale GWAS summary statistics based on bootstrap.

(DOC)

Click here for additional data file.^{(76KB, doc)}

S5 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for five diseases with large-scale discovery data and independent validation samples.

(DOC)

Click here for additional data file.^{(72.5KB, doc)}

S6 Table. Prediction R², Nagelkerke R² and AUC in WTCCC, based on five-fold cross-validation.

(DOC)

Click here for additional data file.^{(59KB, doc)}

S7 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for WTCCC data.

(DOC)

Click here for additional data file.^{(40KB, doc)}

S8 Table. Prediction R², Nagelkerke R² and AUC in the three cancer GWAS data sets, based on 10-fold cross-validation.

(DOC)

Click here for additional data file.^{(63KB, doc)}

S9 Table. P-values for testing whether a PRS significantly improved the risk prediction for three cancer GWAS.

(DOC)

Click here for additional data file.^{(51.5KB, doc)}

S10 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for three cancer GWAS.

(DOC)

Click here for additional data file.^{(53KB, doc)}

S11 Table. Calibration comparison for 1D PRS modeling with or without winner’s curse correction.

(DOC)

Click here for additional data file.^{(31.5KB, doc)}

S12 Table. Implication of identifying high-risk subjects based on PRS.

(DOCX)

Click here for additional data file.^{(14.6KB, docx)}

S1 Text. Additional acknowledegements.

(DOC)

Click here for additional data file.^{(47KB, doc)}

S1 Fig. Randomly selected SNPs and SNPs related with conserved genomic regions (CR-SNPs) have different local linkage disequilibrium (LD) pattern.

(TIF)

Click here for additional data file.^{(50.2KB, tif)}

S2 Fig. The prediction R2 for four diseases with large-scale discovery samples.

(TIF)

Click here for additional data file.^{(246.7KB, tif)}

Acknowledgments

This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, MD. (http://biowulf.nih.gov). This study made use of data generated by the Wellcome Trust Case Control Consortium (WTCCC). A full list of the investigators who contributed to the generation of the data is available at www.wtccc.org.uk. We thank Hilary Kiyo Finucane and Alkes Price for providing the annotation data for conserved DNA regions. We would like to acknowledge all the investigators, their support staff, and their funding support who contributed to GWAS of lung cancer among non-smoking females in Asia, as part of the Female Lung Cancer Consortium in Asia (FLCCA), described in reference 39. We would like to acknowledge all the investigators, their support staff, and their funding support who contributed to GWAS of bladder cancer, described in reference 36 and in reference 37. More information for the MGS consortium, the GECOO consortium, the GAME-ON/TRICL consortium, the GAME-ON/PRACTICAL consortium, the GEME-ON/ELLIPSE consortium and the PanScan consortium can be found in S1 Text.

Data Availability

The GWAS genotype data are not publicly available for the purpose of protecting patient privacy. Summary-level data or genotype data can be applied for from DbGaP or specific GWAS consortium. Access to WTCCC data is available by application to the Wellcome Trust Case Control Consortium Data Access Committee following the link https://www.sanger.ac.uk/legal/DAA/MasterController. Access to the GWAS of pancreatic cancer can be applied for through the PanC4 consortium (Email: eduell@iconcologia.net; Website: www.panc4.org). Access to the colorectal cancer GWAS data can be applied for through GECCO Consortium (Genetics and Epidemiology of Colorectal Cancer Consortium) (Dr. Ulrike Peters, Member Fred Hutchinson Cancer Research Center. Email: upeters@fhcrc.org). Summary level data for European lung cancer can be applied for from the TRICL consortium (Transdisciplinary Research in Cancer of the Lung) (Dr. Christopher I Amos, Norris Cotton Cancer Center, Dartmouth College. Email: Christopher.I.Amos@dartmouth.edu). Summary level data for prostate cancer GWAS can be applied for from the PRACTICAL consortium (Prostate Cancer Association Group to Investigate Cancer Associated Alterations in the Genome. Website: http://practical.ccge.medschl.cam.ac.uk/) and the GAME-ON/ELLIPSE consortium (Elucidating Loci Involved in Prostate Cancer Susceptibility. Website: http://epi.grants.cancer.gov/gameon/index.html). Access to the following GWAS individual-level data can be applied for through the dbGaP website (https://www.ncbi.nlm.nih.gov/gap): Female Lung Cancer Consortium in Asia (FLCCA), phs000716.v1.p1; bladder cancer, phs000346.v1.p1; Molecular Genetics of Schizophrenia, phs000167.v1.p1; Genetic Epidemiology Research on Adult Health and Aging (GERA), phs000674.v1.p1; Lung cancer GWAS in EAGLE (Environment and Genetics in Lung Cancer Etiology Study), phs000093.v2.p2.

Funding Statement

The research was supported by the NIH Intramural Research program. The TRICL Consortium was supported by NIH grant U19 CA148127. Funding for GECCO consortium: National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services (U01 CA137088; R01 CA059045). ASTERISK: a Hospital Clinical Research Program (PHRC) and supported by the Regional Council of Pays de la Loire, the Groupement des Entreprises Françaises dans la Lutte contre le Cancer (GEFLUC), the Association Anne de Bretagne Génétique and the Ligue Régionale Contre le Cancer (LRCC). COLO2&3: National Institutes of Health (R01 CA60987). DACHS: German Research Council (Deutsche Forschungsgemeinschaft, BR 1704/6-1, BR 1704/6-3, BR 1704/6-4 and CH 117/1-1), and the German Federal Ministry of Education and Research (01KH0404 and 01ER0814). DALS: National Institutes of Health (R01 CA48998 to M. L. Slattery). HPFS is supported by the National Institutes of Health (P01 CA 055075, UM1 CA167552, R01 137178, R01 CA151993 and P50 CA127003), NHS by the National Institutes of Health (UM1 CA186107, R01 CA137178, P01 CA87969, R01 CA151993 and P50 CA127003) and PHS by the National Institutes of Health (R01 CA042182). MEC: National Institutes of Health (R37 CA54281, P01 CA033619, and R01 CA63464). OFCCR: National Institutes of Health, through funding allocated to the Ontario Registry for Studies of Familial Colorectal Cancer (U01 CA074783); see CCFR section above. Additional funding toward genetic analyses of OFCCR includes the Ontario Research Fund, the Canadian Institutes of Health Research, and the Ontario Institute for Cancer Research, through generous support from the Ontario Ministry of Research and Innovation. PMH: National Institutes of Health (R01 CA076366 to P.A. Newcomb). VITAL: National Institutes of Health (K05 CA154337). WHI: The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C, and HHSN271201100004C. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467(7317):832–8. 10.1038/nature09410 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86. 10.1038/ng.3097 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Felix R, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;47(4):373–380. 10.1038/ng.3242 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Al Olama AA, Kote-Jarai Z, Berndt SI, Conti DV, Schumacher F, Han Y, et al. A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet. 2014;46(10):1103–9. 10.1038/ng.3094 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Mavaddat N, Pharoah PD, Michailidou K, Tyrer J, Brook MN, Bolla MK, et al. Prediction of breast cancer risk based on profiling with common genetic variants. J Natl Cancer Inst. 2015;107(5). 10.1093/jnci/djv036 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010;42(7):570–575. 10.1038/ng.610 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9(3):e1003348 10.1371/journal.pgen.1003348 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 2013;45(4):400–5,. 10.1038/ng.2579 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet. 2012;44(5):483–9. 10.1038/ng.2232 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 2016;17(7):392–406. 10.1038/nrg.2016.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.International Schizophrenia C, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Golan D, Rosset S. Effective Genetic-Risk Prediction Using Mixed Models. Am J Hum Genet. 2014;95(4):383–93. 10.1016/j.ajhg.2014.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Speed D, Balding DJ. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research. 2014;24(9):1550–7. 10.1101/gr.169375.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Maier R, Moser G, Chen GB, Ripke S, Coryell W, Potash JB, et al. Joint Analysis of Psychiatric Disorders Increases Accuracy of Risk Prediction for Schizophrenia, Bipolar Disorder, and Major Depressive Disorder. A J Hum Genet. 2015;96(2):283–94. 10.1016/j.ajhg.2014.12.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996;58(1):267–88. [Google Scholar]
20.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Schork AJ, Thompson WK, Pham P, Torkamani A, Roddey JC, Sullivan PF, et al. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs. PLoS Genet. 2013;9(4): e1003449 10.1371/journal.pgen.1003449 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Gusev A, Lee SH, Trynka G, Finucane H, Vilhjalmsson BJ, Xu H, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet. 2014;95(5):535–52. 10.1016/j.ajhg.2014.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015; 47(11):1228–35. 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Garner C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol. 2007;31(4):288–95. 10.1002/gepi.20209 [DOI] [PubMed] [Google Scholar]
25.Sun L, Bull SB. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005;28(4):352–67. 10.1002/gepi.20068 [DOI] [PubMed] [Google Scholar]
26.Zhong H, Prentice RL. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics. 2008;9(4):621–34. 10.1093/biostatistics/kxn001 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, Rotunno M, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet. 2009;85(5):679–91. 10.1016/j.ajhg.2009.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478(7370):476–82. 10.1038/nature10530 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97(4):576–92. 10.1016/j.ajhg.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical Association. 2008;103(482):681–6. [Google Scholar]
33.Kilpinen H, Waszak SM, Gschwind AR, Raghav SK, Witwicki RM, Orioli A, et al. Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription. Science. 2013;342(6159):744–7. 10.1126/science.1242463 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.McVicker G, van de Geijn B, Degner JF, Cain CE, Banovich NE, Raj A, et al. Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science. 2013;342(6159):747–9. 10.1126/science.1242429 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9. 10.1038/ng1847 [DOI] [PubMed] [Google Scholar]
36.Rothman N, Garcia-Closas M, Chatterjee N, Malats N, Wu X, Figueroa JD, et al. A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet. 2010;42(11):978–84. 10.1038/ng.687 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Figueroa JD, Ye Y, Siddiq A, Garcia-Closas M, Chatterjee N, Prokunina-Olsson L, et al. Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum Mol Genet. 2014;23(5):1387–98. 10.1093/hmg/ddt519 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Wolpin BM, Rizzato C, Kraft P, Kooperberg C, Petersen GM, Wang ZM, et al. Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat Genet. 2014;46(9):994–1000. 10.1038/ng.3052 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lan Q, Hsiung CA, Matsuo K, Hong YC, Seow A, Wang ZM, et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nat Genet. 2012;44(12):1330–5. 10.1038/ng.2456 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet. 2010;42(7):579–89. 10.1038/ng.609 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Ripke S, Neale BM, Corvin A, Walters JTR, Farh KH, Holmans PA, et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–7. 10.1038/nature13595 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Shi JX, Levinson DF, Duan JB, Sanders AR, Zheng YL, Pe'er I, et al. Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature. 2009;460(7256):753–7. 10.1038/nature08192 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Timofeeva MN, Hung RJ, Rafnar T, Christiani DC, Field JK, Bickeboller H, et al. Influence of common genetic variation on lung cancer risk: meta-analysis of 14 900 cases and 29 485 controls. Hum Mol Genet. 2012;21(22):4980–95. 10.1093/hmg/dds334 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Wang YF, Mckay JD, Rafnar T, Wang ZM, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet. 2014;46(7):736–41. 10.1038/ng.3002 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Peters U, Jiao S, Schumacher FR, Hutter CM, Aragaki AK, Baron JA, et al. Identification of Genetic Susceptibility Loci for Colorectal Tumors in a Genome-Wide Meta-analysis. Gastroenterology. 2013;144(4):799–807. 10.1053/j.gastro.2012.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5(6):e1000529 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Smoller JW, Craddock N, Kendler K, Lee PH, Neale BM, Nurnberger JI, et al. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381(9875):1371–9. 10.1016/S0140-6736(12)62129-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, Perlis RH, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat Genet. 2013;45(9):984–94. 10.1038/ng.2711 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Andreassen OA, Djurovic S, Thompson WK, Schork AJ, Kendler KS, O'Donovan MC, et al. Improved Detection of Common Variants Associated with Schizophrenia by Leveraging Pleiotropy with Cardiovascular-Disease Risk Factors. A J Hum Genet. 2013;92(2):197–209. 10.1016/j.ajhg.2013.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Battle A, Mostafavi S, Zhu XW, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Research. 2014;24(1):14–24. 10.1101/gr.155192.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Westra HJ, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet. 2013;45(10):1238–U195. 10.1038/ng.2756 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Hao K, Bosse Y, Nickle DC, Pare PD, Postma DS, Laviolette M, et al. Lung eQTLs to Help Reveal the Molecular Underpinnings of Asthma. PLoS Genet. 2012;8(11):e1003029 10.1371/journal.pgen.1003029 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Shi J, Marconett CN, Duan J, Hyland PL, Li P, Wang Z, et al. Characterizing the genetic basis of methylome diversity in histologically normal human lung tissue. Nat Commun. 2014;5:3365 10.1038/ncomms4365 [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Grundberg E, Small KS, Hedman AK, Nica AC, Buil A, Keildson S, et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet. 2012;44(10):1084–9. 10.1038/ng.2394 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Grundberg E, Meduri E, Sandling JK, Hedman AK, Keildson S, Buil A, et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements. Am J Hum Genet. 2013;93(5):876–90. 10.1016/j.ajhg.2013.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Hazelett DJ, Rhie SK, Gaddis M, Yan CL, Lakeland DL, Coetzee SG, et al. Comprehensive Functional Annotation of 77 Prostate Cancer Risk Loci. PLoS Genet. 2014;10(1):e1004102 10.1371/journal.pgen.1004102 [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010;42(11):937–U53. 10.1038/ng.686 [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Berndt SI, Gustafsson S, Magi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet. 2013;45(5):501–U69. 10.1038/ng.2606 [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466(7307):707–13. 10.1038/nature09270 [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Sampson JN, Wheeler WA, Yeager M, Panagiotou O, Wang Z, Berndt SI, et al. Analysis of Heritability and Shared Heritability Based on Genome-Wide Association Studies for Thirteen Cancer Types. J Natl Cancer Inst. 2015;107(12):djv279 10.1093/jnci/djv279 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS in simulation studies.

(DOC)

Click here for additional data file.^{(252KB, doc)}

S2 Table. GWAS and functional annotation data for developing genetic risk prediction models.

(DOC)

Click here for additional data file.^{(361KB, doc)}

S3 Table. Prediction R², Nagelkerke R² and AUC for five large scale GWAS summary statistics with independent validation data.

(DOC)

Click here for additional data file.^{(92.5KB, doc)}

S4 Table. P-values for testing whether a PRS statistically significantly improved the risk prediction for five large-scale GWAS summary statistics based on bootstrap.

(DOC)

Click here for additional data file.^{(76KB, doc)}

S5 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for five diseases with large-scale discovery data and independent validation samples.

(DOC)

Click here for additional data file.^{(72.5KB, doc)}

S6 Table. Prediction R², Nagelkerke R² and AUC in WTCCC, based on five-fold cross-validation.

(DOC)

Click here for additional data file.^{(59KB, doc)}

S7 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for WTCCC data.

(DOC)

Click here for additional data file.^{(40KB, doc)}

S8 Table. Prediction R², Nagelkerke R² and AUC in the three cancer GWAS data sets, based on 10-fold cross-validation.

(DOC)

Click here for additional data file.^{(63KB, doc)}

S9 Table. P-values for testing whether a PRS significantly improved the risk prediction for three cancer GWAS.

(DOC)

Click here for additional data file.^{(51.5KB, doc)}

S10 Table. Optimal P-value thresholds for including SNPs for 1D and 2D PRS for three cancer GWAS.

(DOC)

Click here for additional data file.^{(53KB, doc)}

S11 Table. Calibration comparison for 1D PRS modeling with or without winner’s curse correction.

(DOC)

Click here for additional data file.^{(31.5KB, doc)}

S12 Table. Implication of identifying high-risk subjects based on PRS.

(DOCX)

Click here for additional data file.^{(14.6KB, docx)}

S1 Text. Additional acknowledegements.

(DOC)

Click here for additional data file.^{(47KB, doc)}

S1 Fig. Randomly selected SNPs and SNPs related with conserved genomic regions (CR-SNPs) have different local linkage disequilibrium (LD) pattern.

(TIF)

Click here for additional data file.^{(50.2KB, tif)}

S2 Fig. The prediction R2 for four diseases with large-scale discovery samples.

(TIF)

Click here for additional data file.^{(246.7KB, tif)}

Data Availability Statement

[pgen.1006493.ref001] 1.Allen HL, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467(7317):832–8. 10.1038/nature09410 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref002] 2.Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014;46(11):1173–86. 10.1038/ng.3097 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref003] 3.Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Felix R, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015;518(7538):197–206. 10.1038/nature14177 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref004] 4.Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;47(4):373–380. 10.1038/ng.3242 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref005] 5.Al Olama AA, Kote-Jarai Z, Berndt SI, Conti DV, Schumacher F, Han Y, et al. A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer. Nat Genet. 2014;46(10):1103–9. 10.1038/ng.3094 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref006] 6.Mavaddat N, Pharoah PD, Michailidou K, Tyrer J, Brook MN, Bolla MK, et al. Prediction of breast cancer risk based on profiling with common genetic variants. J Natl Cancer Inst. 2015;107(5). 10.1093/jnci/djv036 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref007] 7.Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565–9. 10.1038/ng.608 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref008] 8.Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 2011;88(1):76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref009] 9.Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, et al. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet. 2010;42(7):570–575. 10.1038/ng.610 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref010] 10.Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9(3):e1003348 10.1371/journal.pgen.1003348 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref011] 11.Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projecting the performance of risk prediction based on polygenic analyses of genome-wide association studies. Nat Genet. 2013;45(4):400–5,. 10.1038/ng.2579 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref012] 12.Stahl EA, Wegmann D, Trynka G, Gutierrez-Achury J, Do R, Voight BF, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet. 2012;44(5):483–9. 10.1038/ng.2232 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref013] 13.Chatterjee N, Shi J, Garcia-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 2016;17(7):392–406. 10.1038/nrg.2016.27 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref014] 14.Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, Sullivan PF, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref015] 15.International Schizophrenia C, Purcell SM, Wray NR, Stone JL, Visscher PM, O'Donovan MC, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–52. 10.1038/nature08185 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref016] 16.Golan D, Rosset S. Effective Genetic-Risk Prediction Using Mixed Models. Am J Hum Genet. 2014;95(4):383–93. 10.1016/j.ajhg.2014.09.007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref017] 17.Speed D, Balding DJ. MultiBLUP: improved SNP-based prediction for complex traits. Genome Research. 2014;24(9):1550–7. 10.1101/gr.169375.113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref018] 18.Maier R, Moser G, Chen GB, Ripke S, Coryell W, Potash JB, et al. Joint Analysis of Psychiatric Disorders Increases Accuracy of Risk Prediction for Schizophrenia, Bipolar Disorder, and Major Depressive Disorder. A J Hum Genet. 2015;96(2):283–94. 10.1016/j.ajhg.2014.12.006 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref019] 19.Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996;58(1):267–88. [Google Scholar]

[pgen.1006493.ref020] 20.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref021] 21.Schork AJ, Thompson WK, Pham P, Torkamani A, Roddey JC, Sullivan PF, et al. All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal a Consistent Pattern of Enrichment among Functionally Annotated SNPs. PLoS Genet. 2013;9(4): e1003449 10.1371/journal.pgen.1003449 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref022] 22.Gusev A, Lee SH, Trynka G, Finucane H, Vilhjalmsson BJ, Xu H, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet. 2014;95(5):535–52. 10.1016/j.ajhg.2014.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref023] 23.Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015; 47(11):1228–35. 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref024] 24.Garner C. Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol. 2007;31(4):288–95. 10.1002/gepi.20209 [DOI] [PubMed] [Google Scholar]

[pgen.1006493.ref025] 25.Sun L, Bull SB. Reduction of selection bias in genomewide studies by resampling. Genet Epidemiol. 2005;28(4):352–67. 10.1002/gepi.20068 [DOI] [PubMed] [Google Scholar]

[pgen.1006493.ref026] 26.Zhong H, Prentice RL. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics. 2008;9(4):621–34. 10.1093/biostatistics/kxn001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref027] 27.Landi MT, Chatterjee N, Yu K, Goldin LR, Goldstein AM, Rotunno M, et al. A genome-wide association study of lung cancer identifies a region of chromosome 5p15 associated with risk for adenocarcinoma. Am J Hum Genet. 2009;85(5):679–91. 10.1016/j.ajhg.2009.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref028] 28.Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478(7370):476–82. 10.1038/nature10530 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref029] 29.Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref030] 30.Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78. 10.1038/nature05911 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref031] 31.Vilhjalmsson BJ, Yang J, Finucane HK, Gusev A, Lindstrom S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97(4):576–92. 10.1016/j.ajhg.2015.09.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref032] 32.Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical Association. 2008;103(482):681–6. [Google Scholar]

[pgen.1006493.ref033] 33.Kilpinen H, Waszak SM, Gschwind AR, Raghav SK, Witwicki RM, Orioli A, et al. Coordinated Effects of Sequence Variation on DNA Binding, Chromatin Structure, and Transcription. Science. 2013;342(6159):744–7. 10.1126/science.1242463 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref034] 34.McVicker G, van de Geijn B, Degner JF, Cain CE, Banovich NE, Raj A, et al. Identification of Genetic Variants That Affect Histone Modifications in Human Cells. Science. 2013;342(6159):747–9. 10.1126/science.1242429 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref035] 35.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9. 10.1038/ng1847 [DOI] [PubMed] [Google Scholar]

[pgen.1006493.ref036] 36.Rothman N, Garcia-Closas M, Chatterjee N, Malats N, Wu X, Figueroa JD, et al. A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet. 2010;42(11):978–84. 10.1038/ng.687 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref037] 37.Figueroa JD, Ye Y, Siddiq A, Garcia-Closas M, Chatterjee N, Prokunina-Olsson L, et al. Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum Mol Genet. 2014;23(5):1387–98. 10.1093/hmg/ddt519 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref038] 38.Wolpin BM, Rizzato C, Kraft P, Kooperberg C, Petersen GM, Wang ZM, et al. Genome-wide association study identifies multiple susceptibility loci for pancreatic cancer. Nat Genet. 2014;46(9):994–1000. 10.1038/ng.3052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref039] 39.Lan Q, Hsiung CA, Matsuo K, Hong YC, Seow A, Wang ZM, et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nat Genet. 2012;44(12):1330–5. 10.1038/ng.2456 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref040] 40.Voight BF, Scott LJ, Steinthorsdottir V, Morris AP, Dina C, Welch RP, et al. Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet. 2010;42(7):579–89. 10.1038/ng.609 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref041] 41.Ripke S, Neale BM, Corvin A, Walters JTR, Farh KH, Holmans PA, et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–7. 10.1038/nature13595 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref042] 42.Shi JX, Levinson DF, Duan JB, Sanders AR, Zheng YL, Pe'er I, et al. Common variants on chromosome 6p22.1 are associated with schizophrenia. Nature. 2009;460(7256):753–7. 10.1038/nature08192 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref043] 43.Timofeeva MN, Hung RJ, Rafnar T, Christiani DC, Field JK, Bickeboller H, et al. Influence of common genetic variation on lung cancer risk: meta-analysis of 14 900 cases and 29 485 controls. Hum Mol Genet. 2012;21(22):4980–95. 10.1093/hmg/dds334 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref044] 44.Wang YF, Mckay JD, Rafnar T, Wang ZM, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet. 2014;46(7):736–41. 10.1038/ng.3002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref045] 45.Peters U, Jiao S, Schumacher FR, Hutter CM, Aragaki AK, Baron JA, et al. Identification of Genetic Susceptibility Loci for Colorectal Tumors in a Genome-Wide Meta-analysis. Gastroenterology. 2013;144(4):799–807. 10.1053/j.gastro.2012.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref046] 46.Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5(6):e1000529 10.1371/journal.pgen.1000529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref047] 47.Smoller JW, Craddock N, Kendler K, Lee PH, Neale BM, Nurnberger JI, et al. Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis. Lancet. 2013;381(9875):1371–9. 10.1016/S0140-6736(12)62129-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref048] 48.Lee SH, Ripke S, Neale BM, Faraone SV, Purcell SM, Perlis RH, et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat Genet. 2013;45(9):984–94. 10.1038/ng.2711 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref049] 49.Andreassen OA, Djurovic S, Thompson WK, Schork AJ, Kendler KS, O'Donovan MC, et al. Improved Detection of Common Variants Associated with Schizophrenia by Leveraging Pleiotropy with Cardiovascular-Disease Risk Factors. A J Hum Genet. 2013;92(2):197–209. 10.1016/j.ajhg.2013.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref050] 50.Battle A, Mostafavi S, Zhu XW, Potash JB, Weissman MM, McCormick C, et al. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Research. 2014;24(1):14–24. 10.1101/gr.155192.113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref051] 51.Westra HJ, Peters MJ, Esko T, Yaghootkar H, Schurmann C, Kettunen J, et al. Systematic identification of trans eQTLs as putative drivers of known disease associations. Nat Genet. 2013;45(10):1238–U195. 10.1038/ng.2756 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref052] 52.Hao K, Bosse Y, Nickle DC, Pare PD, Postma DS, Laviolette M, et al. Lung eQTLs to Help Reveal the Molecular Underpinnings of Asthma. PLoS Genet. 2012;8(11):e1003029 10.1371/journal.pgen.1003029 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref053] 53.Shi J, Marconett CN, Duan J, Hyland PL, Li P, Wang Z, et al. Characterizing the genetic basis of methylome diversity in histologically normal human lung tissue. Nat Commun. 2014;5:3365 10.1038/ncomms4365 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref054] 54.Grundberg E, Small KS, Hedman AK, Nica AC, Buil A, Keildson S, et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet. 2012;44(10):1084–9. 10.1038/ng.2394 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref055] 55.Grundberg E, Meduri E, Sandling JK, Hedman AK, Keildson S, Buil A, et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements. Am J Hum Genet. 2013;93(5):876–90. 10.1016/j.ajhg.2013.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref056] 56.Hazelett DJ, Rhie SK, Gaddis M, Yan CL, Lakeland DL, Coetzee SG, et al. Comprehensive Functional Annotation of 77 Prostate Cancer Risk Loci. PLoS Genet. 2014;10(1):e1004102 10.1371/journal.pgen.1004102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref057] 57.Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet. 2010;42(11):937–U53. 10.1038/ng.686 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref058] 58.Berndt SI, Gustafsson S, Magi R, Ganna A, Wheeler E, Feitosa MF, et al. Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture. Nat Genet. 2013;45(5):501–U69. 10.1038/ng.2606 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref059] 59.Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature. 2010;466(7307):707–13. 10.1038/nature09270 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1006493.ref060] 60.Sampson JN, Wheeler WA, Yeager M, Panagiotou O, Wang Z, Berndt SI, et al. Analysis of Heritability and Shared Heritability Based on Genome-Wide Association Studies for Thirteen Cancer Types. J Natl Cancer Inst. 2015;107(12):djv279 10.1093/jnci/djv279 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data

Jianxin Shi

Ju-Hyun Park

Jubao Duan

Sonja T Berndt

Winton Moy

Kai Yu

Lei Song

William Wheeler

Xing Hua

Debra Silverman

Montserrat Garcia-Closas

Chao Agnes Hsiung

Jonine D Figueroa

Victoria K Cortessis

Núria Malats

Margaret R Karagas

Paolo Vineis

I-Shou Chang

Dongxin Lin

Baosen Zhou

Adeline Seow

Keitaro Matsuo

Yun-Chul Hong

Neil E Caporaso

Brian Wolpin

Eric Jacobs

Gloria M Petersen

Alison P Klein

Donghui Li

Harvey Risch

Alan R Sanders

Li Hsu

Robert E Schoen

Hermann Brenner

Rachael Stolzenberg-Solomon

Pablo Gejman

Qing Lan

Nathaniel Rothman

Laufey T Amundadottir

Maria Teresa Landi

Douglas F Levinson

Stephen J Chanock

Nilanjan Chatterjee

Roles

Abstract

Author Summary

Introduction

Results

Overview of statistical approach

2D PRS

Fig 1. Theoretic investigation of prediction performance and optimal thresholds for SNP selection in 2D PRS.

PRS with winner’s curse correction

Simulation results

Fig 2. Simulation results for comparing polygenic risk prediction methods and different high priority SNP sets.

Results of analyzing real GWAS data sets

Table 1. GWAS data sets with individual level data.

Table 2. GWAS data with summary level data.

Polygenic risk prediction of type 2 diabetes

Fig 3. Genetic risk prediction for type-2 diabetes.

Results for WTCCC data

Fig 4. Comparison of polygenic risk prediction methods for 13 complex diseases.

Results for three cancer GWAS with individual genotype data

Results for four large-scale summary-statistics datasets

Discussion

Materials and Methods

LD-pruning and LD-clumping

Expanding HP SNP set through LD

Simulations

WTCCC GWAS data

Three cancer GWAS with individual genotype data

Five large GWAS with summary statistics and independent validation samples

Annotation data sets

eSNPs and meSNPs

CRE-SNPs

PT-SNPs

Testing the statistical significance of improvement for risk prediction

Theoretical prediction performance assuming independent SNPs

Supporting Information