Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 27.
Published in final edited form as: Genet Epidemiol. 2012 May 29;36(5):517–524. doi: 10.1002/gepi.21644

Unidentified Genetic Variants Influence Pancreatic Cancer Risk: An Analysis of Polygenic Susceptibility in the PanScan Study

Brandon L Pierce 1,2,*, Lin Tong 1, Peter Kraft 3, Habibul Ahsan 1,2,4
PMCID: PMC10967700  NIHMSID: NIHMS1565352  PMID: 22644738

Abstract

Genome-wide association (GWA) studies have identified several pancreatic cancer (PanCa) susceptibility loci. Methods for assessment of polygenic susceptibility can be employed to detect the collective effect of additional association signals for PanCa. Using data on 492,651 autosomal single nucleotide polymorphisms (SNPs) from the PanScan GWA study (2,857 cases, 2,967 controls), we employed polygenic risk score (PRS) cross-validation (CV) methods to (a) confirm the existence of unidentified association signals, (b) assess the predictive value of PRSs, and (c) assess evidence for polygenic effects in specific genomic locations (genic vs. intergenic). After excluding SNPs in known PanCa susceptibility regions, we constructed PRS models using a training GWA dataset and then tested the model in an independent testing dataset using fourfold CV. We also employed a “power-replication” approach, where power to detect SNP associations was calculated using a training dataset, and power was tested for association with “replication status” in a testing dataset. PRS scores constructed using ≥10% of genome-wide SNPs showed significant association with PanCa (P < 0.05) across the majority of CV analyses. Associations were stronger for PRSs restricted to genic SNPs compared to intergenic PRSs. The power-replications approach produced weaker associations that were not significant when restricting to SNPs with low pairwise linkage disequilibrium, whereas PRS results were robust to such restrictions. Although the PRS approach will not dramatically improve PanCa prediction, it provides strong evidence for unidentified association signals for PanCa. Our results suggest that focusing association studies on genic regions and conducting larger GWA studies can reveal additional PanCa susceptibility loci.

Keywords: polygenic, pancreatic cancer, risk score, cross-validation, genome-wide association study (GWAS), single nucleotide polymorphism (SNP)

INTRODUCTION

Individuals with a family history of pancreatic cancer (PanCa) are at increased risk of being diagnosed with the disease [Schenk et al., 2001; Tersmette et al., 2001] and genetic factors have been estimated to account for approximately 36% of the variance in PanCa incidence [Lichtenstein et al., 2000]. To date, four pancreatic susceptibility loci have been identified in genome-wide association (GWA) studies, with all four being identified by the Cancer Genetic Markers of Susceptibility (CGEMS) PanScan multistage study [Amundadottir et al., 2009; Petersen et al., 2010]. Recently, a fifth locus, in the HNF1A region, has been detected in an association study of PanScan data focused on variants with known effects on human biology [Pierce and Ahsan, 2011]. Because a small proportion of familial aggregation can be explained by known genetic factors (both common and rare [Shi et al., 2009]), it is likely that additional susceptibility variants exist. However, it is unknown to what extent this unexplained heritability is due to common variants with associations that are too weak to detect in the existing PanCa GWA data. Thus, it is unclear how successful larger, more powerful GWA studies will be at identifying additional susceptibility loci.

Recently, polygenic risk score (PRS) methods have been employed to assess the evidence for unidentified susceptibility loci for several diseases [Bush et al., 2010; Machiela et al., 2011; Purcell et al., 2009; Witte and Hoffmann, 2011]. This method involves constructing a multi-SNP model for disease risk based on many weakly associated single nucleotide polymorphisms (SNPs) that do not reach genome-wide significance levels. Then, using an independent set of study participants, the predictive ability of the model can be assessed, to determine if the model contains SNPs representing true susceptibility loci. Purcell et al. [2009] originally employed this approach to confirm the existence of SNPs showing weak associations with risk for schizophrenia and bipolar disorder. More recently, two groups have employed this approach to explore the degree to which unidentified SNPs associate with breast and prostate cancer [Machiela et al., 2011; Witte and Hoffmann, 2011].

Evidence of unidentified susceptibility variants for bipolar disorder has also been found using an alternative strategy that we refer to as the power-replication approach [Smith et al., 2011]. For this approach, power to detect association is calculated based on association estimates from a GWA dataset; these power estimates are then tested for association with replication status (P < 0.05 and consistent direction of association) in an independent GWA dataset. Observing such an association implies that SNPs with high power estimates are enriched for true association signals.

In this study, we use existing data from a large GWA study of PanCa (the PanScan study) to confirm existence of additional pancreatic susceptibility variants using both approaches described above: the PRS analysis and the power-replication approach. We also assess the predictive ability of the PRS and conduct all analyses stratified by genic and intergenic SNPs to determine where unidentified pancreatic susceptibility variants are likely to reside.

METHODS

The CGEMS PanScan-I and PanScan-II GWA studies have been previously described [Amundadottir et al., 2009; Petersen et al., 2010]. Briefly, cases and controls were drawn from 12 cohort studies and eight case-control studies. All cases were diagnosed with primary adenocarcinoma of the exocrine pancreas. Controls were matched to cases based on birth year, sex, and race/ethnicity and were free of PanCa at the time of diagnosis of the matched case. Sample quality control and genotyping was conducted at the National Cancer Institute’s Core Genotyping Facility using Illumina HumanHap550 and HumanHap550-Duo SNP arrays (PanScan-I) and Illumina Human 610-Quad arrays (PanScan-II) [Amundadottir et al., 2009; Petersen et al., 2010]. In total, CGEMS provided high-quality genotype data for 1,895 cases and 1,937 controls from PanScan-I and for 1,478 cases and 1,534 controls from PanScan-II (after excluding duplicate samples). All data were downloaded from the database of Genotypes and Phenotypes [Mailman et al., 2007] (Accession number: phs000206v3p2).

To create a single large dataset that could be randomly subdivided for cross-validation (CV) purposes, we combined the PanScan I and II data into one large GWA dataset and restricted to 493,619 autosomal SNPs that were present on all platforms and had call rates >0.95, minor allele frequencies (MAF) >0.04, and Hardy-Weinberg P-values >0.00001. These thresholds were chosen to be slightly more stringent than the original PanScan publications [Amundadottir et al., 2009; Petersen et al., 2010], due to the fact that we combined data across different genotyping platforms. We assessed population structure using ~12,000 SNPs with low pair-wise linkage disequilibrium (LD) [Yu et al., 2008] and high call rates (>99%) in PanScan and HapMap3 samples (from CEU, YRI, and CHB + JPT datasets). The EIGENSTRAT principal components analysis (PCA) program [Price et al., 2006] was used to identify and exclude participants who did not cluster tightly with the CEU HapMap samples (253 in PanScan-I; 753 in PanScan-II). Based on identity-by-descent estimates, one individual from each suspected first- or second-degree relative pair was removed (14 in PanScan-I; one in PanScan-II), resulting in a total sample size of 2,857 cases and 2,967 controls.

We then excluded the index SNPs for the five established PanCa susceptibility loci (rs9543325, rs3790844, rs401681, rs505922, and rs7310409) and 963 SNPs residing in the 1 Mb regions surrounding each of these index SNPs (500 kb on each side), resulting in 492,651 autosomal SNPs for analysis purposes. PCA was used to generate principal components of European ancestry [Price et al., 2006]. To evaluate the effects of removing SNPs to eliminate high pair-wise LD values (LD, as measured by r2), as done in prior PRS analyses of GWA data [Bush et al., 2010; Machiela et al., 2011; Purcell et al., 2009; Witte and Hoffmann, 2011], we pruned SNPs from our dataset based on LD using the PLINK indep-pairwise command. Using this command, we derived three SNP datasets for analysis: a set of SNPs with no pairwise r2 > 0.2 (n = 94,488), (as in Bush et al. [2010], Machiela et al. [2011]), a set of SNPs with no pairwise r2 > 0.5 (n = 225,022; as in Witte and Hoffmann [2011]), and the full unpruned data set (n = 492,651; as in Evans et al. [2009]).

To create datasets for fourfold CV analysis of the multi-SNP PRS models, we randomly assigned each PanScan participant to one of four groups of approximately equal size and case-control ratio (714 cases, 742 controls; Figure 1). Three of the four the datasets were combined as a “training set” and the fourth dataset was used as the “testing set.” This process was repeated for all possible combinations of the four randomly generated groups, producing four training sets and four corresponding testing sets. The training sets were used to derive association estimates for all SNPs in the dataset, using unconditional logistic regression models adjusted for age, sex, and five principal components. We used a log-additive genetic model with the major allele as the reference category (i.e., 0, 1, or 2 minor alleles).

Fig. 1.

Fig. 1.

Overview of the polygenic risk score CV analysis workflow.

For each individual in the testing set, the PRS was calculated as follows: using the results from the analysis of the training set, we first set a P-value threshold (0.0001, 0.001, 0.01, 0.1, 0.25, or 0.5) to select SNPs for inclusion in the PRS model. For each SNP that passed this P-value threshold, the number of minor alleles carried by each individual in the testing set (0, 1, or 2) was multiplied by the SNP’s log(OR) derived from the training set. For each individual, these weighted allele counts were then summed over all SNPs passing the threshold and divided by the total number of SNPs to produce the PRS (as implemented in the PLINK [Purcell et al., 2007]—score command). These scores were then tested for association with case-control status in the testing dataset, again using logistic regression. Four CV analyses were performed for each P-value threshold, producing four estimates of association for each threshold. The predictive ability of the PRS was assessed using the logistic regression coefficient of determination (R2) and the area under the receiver operating characteristic curve (AUC). A summary of the design for PRS analysis is shown in Figure 1. We chose fourfold CV because it uses the majority of the data for training (75%), but maintains a relatively large sample size for testing and prediction (25%) and provides four independent assessments of the PRS model for evaluation of consistency across testing sets.

For the power-replication analysis, we used methods closely related to those described by Smith et al. [2011]. First, we randomly divided the PanScan participants into one training dataset and one testing dataset of nearly identical size and case-control ratio (1,429 cases, 1,484 controls) for twofold CV. Our rationale for using a 50:50 split and two-fold CV is given at the end of this paragraph. For the training set, we obtained the odds ratio (OR) for each SNP using logistic regression. Based on this OR and the observed MAF for each SNP, we calculated the power to detect an association of the same magnitude in the testing dataset (714 cases, 742 controls) using the POWER procedure in SAS, version 9.2 (SAS Institute Inc., Cary, NC) [Shieh, 2000]. The POWER procedure calculated power based on the likelihood ratio chi-square test of a single predictor in binary logistic regression assuming a two-sided test and a significance threshold of 0.05. We then tested each SNP for association with PanCa in the testing dataset and recorded the P-value and OR. In the final step, we tested the association between power (based on ORs and MAFs in the training set) and the replication status in the testing set (P < 0.05 in the testing set with same direction of association as the training set) using logistic regression. To clarify, this regression was run on a dataset with SNPs as observations, not individuals. We used twofold CV (i.e., a 50:50 split of the data) for the power-replication analyses because preliminary analyses suggested that a testing set containing only 25% of the data did not produce consistent association estimates across the four CV sets. This is likely due to the fact that detecting polygenic effects using the power-replication method is dependent on detecting nominally significant (P < 0.05) associations for SNPs in the testing set, while the PRS method is not. Detecting a substantial number of significantly associated SNPs is less likely in smaller testing sets (e.g., fourfold CV).

To compare the relative contributions of SNPs near gene regions vs. intergenic SNPs to PanCa susceptibility, we repeated the PRS and power-replication analyses described above, but restricted to subsets of SNPs defined by their proximity to exons and transcribed regions. We obtained NCBI36/hg18 coordinates for the exons and transcribed regions for all refSeq genes from the UCSC genome browser (http://genome.ucsc.edu/cgi-bin/hgTables) in order to assign SNPs to categories (i.e., exon, transcribed region, intergenic) based on these coordinates. SNPs were assigned to these categories after expanding the size of the exon regions (by 5 kb on each side) and the transcribed regions (15 kb on each side) to allow for the inclusion of SNPs that are in LD with untyped (or pruned-out) SNPs that reside in exons and transcribed regions [Wang et al., 2010]. In other words, because LD often spans across exons and introns and into intergenic regions, it is difficult to definitely categorize the location of any association signal near these boundaries; thus we extended the boundaries to attempt to capture SNPs representing variation in exons and transcripts. In addition, expanding the size of the transcribed regions allows for inclusion of proximal regulatory variants. The boundaries of the exon regions were expanded by only 5 kb because using a larger extension (i.e., 15 kb) resulted in a set of SNP that was very similar to the set of “transcript SNPs”; thus we used a 5 kb extension to create a set of SNPs that was distinct from the “transcript SNPs,” with increased enrichment for exonic variants. Of the 492,651 SNPs in the analysis dataset, approximately 3% resided in exons and 46% resided in transcribed regions. After expanding exon and transcript window, these percentages were 28% and 56%, respectively.

RESULTS

The results for the PRS analysis using the SNP set with pairwise r2 < 0.2 are shown in Table I. The PRS showed consistent evidence of positive association with pancreatic cancer risk (P < 0.05 and OR >1) across three of the four independent testing sets when P-value thresholds of 0.1, 0.25, or 0.5 were used to select SNPs from the training set. The other testing set (CV set 1) generated P-values that were borderline significant at these thresholds. These results provide robust evidence of undetected association signals in the PanScan dataset. Associations for the PRS were much less pronounced when training set P-value thresholds of 0.0001 and 0.001 were used to select SNPs for the PRS; the score based on a threshold of 0.01 showed moderate evidence of association. The predictive value of the PRS, as measured by R2 and the AUC, showed very mild increases as the P-value threshold for the PRS was relaxed. The choice of r2 value for LD-pruning did not have a substantial impact on our results. For example, when analyses were conducting using all 492,651 SNPs or a SNP set with pairwise r2 < 0.5, our results were similar (Table II).

TABLE I.

Association and prediction for polygenic risk scores in relation to PanCa risk using fourfold CV

P threshold for PRS CV set Number of SNPs in PRSa ORb 95% CI P R 2 c AUC
0.0001 1 12 0.90 0.81–1.00 0.05 0.012 0.557
2 8 0.91 0.82–1.01 0.08 0.015 0.570
3 6 0.94 0.85–1.05 0.27 0.005 0.539
4 7 1.05 0.94–1.16 0.41 0.009 0.554
0.001 1 92 0.98 0.88–1.08 0.67 0.010 0.544
2 105 0.97 0.87–1.07 0.51 0.013 0.561
3 85 1.08 0.97–1.19 0.17 0.006 0.536
4 93 1.07 0.96–1.19 0.21 0.010 0.558
0.01 1 1,027 1.07 0.96–1.18 0.23 0.011 0.558
2 961 1.11 1.00–1.23 0.05 0.015 0.564
3 981 1.04 0.94–1.15 0.49 0.005 0.538
4 987 1.13 1.01–1.25 0.03 0.012 0.564
0.1 1 9,645 1.07 0.96–1.19 0.20 0.011 0.557
2 9,507 1.11 1.00–1.23 0.05 0.015 0.566
3 9,531 1.12 1.01–1.25 0.03 0.008 0.547
4 9,626 1.12 1.01–1.25 0.03 0.012 0.565
0.25 1 23,808 1.11 1.00–1.23 0.06 0.012 0.564
2 23,703 1.15 1.04–1.28 0.01 0.017 0.572
3 23,728 1.16 1.05–1.29 0.01 0.010 0.555
4 23,915 1.14 1.02–1.26 0.02 0.013 0.569
0.5 1 47,215 1.10 0.99–1.22 0.08 0.012 0.561
2 47,433 1.14 1.03–1.27 0.01 0.017 0.572
3 47,383 1.13 1.02–1.26 0.02 0.008 0.549
4 47,413 1.13 1.02–1.26 0.02 0.013 0.567
a

SNPs in training set limited to those with low pairwise LD: r2 < 0.2 (n = 94,259).

b

ORs based on PRS that has been divided by its standard deviation to generate ORs that correspond to a one standard deviation change in the PRS.

c

Generalized R2 measure for the fitted logistic model.

PRS, polygenic risk score; CV, cross-validation; OR, odds ratio; CI, confidence interval; AUC, area under the curve.

TABLE II.

Association and prediction for polygenic risk scores in relation to PanCa using fourfold CV and varying the r2 threshold for LD pruning

SNP set CV set Number of SNPs in PRSa ORb 95% CI P R 2 c AUC
All SNPs 1 124,754 1.06 0.96–1.18 0.25 0.011 0.559
2 123,874 1.16 1.05–1.29 0.01 0.018 0.574
3 124,038 1.15 1.03–1.27 0.01 0.009 0.555
4 125,106 1.09 0.98–1.21 0.10 0.011 0.562
Pairwise r2 < 0.5 1 56,999 1.10 0.99–1.23 0.07 0.012 0.566
2 56,817 1.14 1.03–1.26 0.02 0.017 0.572
3 56,410 1.18 1.06–1.32 0.00 0.011 0.559
4 57,281 1.09 0.99–1.21 0.10 0.011 0.560
Pairwise r2 < 0.2 1 23,808 1.11 1.00–1.23 0.06 0.012 0.564
2 23,703 1.15 1.04–1.28 0.01 0.017 0.572
3 23,728 1.16 1.05–1.29 0.01 0.010 0.555
4 23,915 1.14 1.02–1.26 0.02 0.013 0.569
a

The P-value threshold for selecting SNPs from the training set was 0.25.

b

ORs based on PRS that has been divided by its standard deviation to generate ORs that correspond to a one standard deviation change in the PRS.

c

Generalized R2 measure for the fitted logistic model.

PRS, polygenic risk score; CV, cross-validation; OR, odds ratio; CI, confidence interval; AUC, area under the curve.

The PRS analysis was then restricted to SNPs only in exonic regions (±5 kb) and transcribed regions (±15 kb) using a P-value threshold of 0.25 (selected based on the association results in Table I). In this analysis, associations for the PRSs based on coding region SNPs were stronger that those observed when analyses were restricted to nontranscript SNPs (Table III). This finding was generally consistent across all four CV sets.

TABLE III.

Association and prediction for polygenic risk scores based on exon, transcript, and nontranscript SNPs in relation to PanCa using fourfold CV

SNPs in PRS CV set Number of SNPs in PRSa ORb 95% CI P R 2c AUC
Exon SNPsd 1 6,684 1.06 0.96–1.18 0.25 0.011 0.556
2 6,563 1.11 1.00–1.23 0.05 0.015 0.568
3 6,626 1.14 1.03–1.27 0.01 0.009 0.551
4 6,629 1.12 1.01–1.25 0.03 0.012 0.565
Transcript SNPsd 1 13,379 1.10 0.99–1.22 0.09 0.012 0.565
2 13,183 1.16 1.05–1.29 0.01 0.018 0.576
3 13,212 1.20 1.08–1.33 0.0008 0.012 0.563
13,355 1.10 0.99–1.22 0.08 0.011 0.566
Intergenic SNPs 1 10,429 1.05 0.95–1.17 0.33 0.010 0.553
2 10,520 1.05 0.94–1.16 0.4 0.013 0.562
3 10,516 1.03 0.93–1.15 0.56 0.005 0.535
4 10,560 1.10 0.99–1.22 0.08 0.011 0.560
a

SNPs in training set limited to those with low pairwise LD: r2 < 0.2 (n = 94,259). The P-value threshold for selecting SNPs from the training set was 0.25.

b

ORs based on PRS that has been divided by its standard deviation to generate ORs that correspond to a one standard deviation change in the PRS.

c

Generalized R2 measure for the fitted logistic model.

d

Exon and transcript regions were extended on each side by 5 kb and 15 kb, respectively.

PRS, polygenic risk score; CV, cross-validation; OR, odds ratio; CI, confidence interval; AUC, area under the curve.

Associations between “power to detect association” (based on ORs and MAFs from a training set) and replication (P < 0.05 with same direction of association in a testing set) are presented in Table IV. In the absence of LD pruning, power showed a clear association with replication status in analyses of “all SNPs” and when restricting to transcribed SNPs (P ≤ 0.01 for each CV set). Analyses restricted to exonic and intergenic SNPs did not show significant associations. However, the unpruned SNP dataset contained many correlated observations, violating the independent observation assumption for logistic regression. When the SNPs were pruned based on pairwise LD (i.e., restricting to SNPs with pairwise r2 <0.2), the power-replication associations were no longer statistically significant. However, the observed ORs for the pruned SNP dataset were greater than one for all CV sets, except for analyses restricted to intergenic SNPs. The magnitude of the association between power and replication was stronger in analyses restricted to exonic and transcribed SNPs than in analyses based on intergenic SNPs only, consistent with our observations from the PRS analysis. This was especially true for analyses of the pruned SNP dataset (Table IV). The exact proportions of SNPs replicating in the testing set are shown in Table V, stratified by power estimates derived from the training data. In general, the proportion of replicated SNPs shows slight increases across quintiles of power, with the most prominent increases between the two highest quartiles. Associations for SNPs in the top decile of power were not substantially different from those in the top quintile.

TABLE IV.

Associations between power to detect association (based on training data) and replication status (P < 0.05 and association in the same direction as training data) in the testing dataset using twofold cross-validation

SNP set
LD features Location CV set Number of SNPs analyzed OR 95% CI P
All SNPs All 1 492,651 1.18 1.05–1.32 0.004
2 492,651 1.19 1.07–1.33 0.002
Exonica 1 136,424 1.15 0.93–1.41 0.20
2 136,424 1.11 0.90–1.37 0.34
Transcripta 1 274,680 1.20 1.04–1.40 0.01
2 274,680 1.26 1.09–1.46 0.002
Intergenic 1 217,971 1.15 0.97–1.36 0.12
2 217,971 1.11 0.94–1.32 0.22
Pairwise r2 < 0.2 All 1 94,259 1.15 0.89–1.48 0.30
2 94,259 1.08 0.84–1.40 0.54
Exonica 1 26,295 1.43 0.91–2.26 0.13
2 26,295 1.31 0.81–2.11 0.27
Transcripta 1 52,719 1.32 0.94–1.84 0.11
2 52,719 1.33 0.95–1.86 0.09
Intergenic 1 41,540 0.95 0.64–1.42 0.81
2 41,540 0.82 0.55–1.22 0.33
a

Exon and transcript regions were extended on each side by 5 kb and 15 kb, respectively.

LD, linkage disequilibrium; CV, cross-validation; OR, odds ratio; CI, confidence interval.

TABLE V.

Proportion of SNPs replicating (P < 0.05 and with the same direction) in the testing dataset according to power estimated from the training dataset using twofold CV

SNP set CVset Number of SNPs analyzed Quintiles of power Top decile of power (>0.38)
LD features Location Q1 (≥0.05) Q2 (0.06–0.08) Q3 (0.09–0.13) Q4 (0.14–0.25) Q5 (>0.25)
All SNPs All 1 492,651 2.51 2.65 2.55 2.63 2.71 2.80
2 492,651 2.43 2.62 2.51 2.62 2.68 2.72
Exonica 1 136,424 2.69 2.70 2.51 2.95 2.80 2.85
2 136,424 2.56 2.53 2.55 2.51 2.70 2.70
Transcripta 1 274,680 2.67 2.69s 2.56 2.67 2.83 2.90
2 274,680 2.46 2.56 2.47 2.56 2.72 2.78
Intergenic 1 217,971 2.31 2.61 2.53 2.59 2.55 2.67
2 217,971 2.40 2.69 2.57 2.67 2.66 2.66
Pairwise r2 <0.2 All 1 94,259 2.58 2.58 2.42 2.59 2.67 2.75
2 94,259 2.44 2.78 2.48 2.56 2.66 2.60
Exonica 1 26,295 2.62 2.85 2.32 3.00 2.97 3.23
2 26,295 2.40 2.68 2.45 2.36 2.91 2.78
Transcripta 1 52,719 2.57 2.77 2.32 2.59 2.88 2.86
2 52,719 2.33 2.81 2.42 2.42 2.87 2.79
Intergenic 1 41,540 2.55 2.44 2.55 2.59 2.36 2.55
2 41,540 2.59 2.76 2.55 2.74 2.40 2.36
a

Exon and transcript regions were extended on each side by 5 kb and 15 kb, respectively.

LD, linkage disequilibrium; CV, cross-validation.

Due to concerns that SNPs with low MAF may be more likely to replicate that high MAF SNPs (under the null hypothesis) due to failure of asymptotics, we repeated all power-replication analyses after excluding all SNPs with MAF < 0.10. The results were very similar to the results obtained from the full SNP dataset, but with somewhat stronger positive associations observed across all SNP categories (exon, transcript, intergenic, etc). This implies that our estimates of association were not inflated by biases arising from testing low MAF SNPs. To further ensure the validity of our results, we permuted the case-control phenotypes and repeated our PRS analyses. As expected, we confirmed that there is no evidence of association between the PRS and PanCa risk in any CV set with permuted phenotypes (not presented).

DISCUSSION

In this paper, we assess, for the first time, empirical evidence for unidentified common markers of PanCa risk, using two different CV techniques: a PRS approach and power-replication approach. The PRS approach shows that most susceptibility SNPs do not reside in the extreme end of the P-value distribution in this GWA dataset, but may have P-values > 0.01. The associations observed for the PRS scores were very consistent across all four training sets, which are independent hypothesis tests, providing very strong evidence that the associations we observe are not false positives.

For our PRS models (with a P-value threshold ≥0.1), the R2 ranged from 0.008 to 0.017 and the AUC from 0.547 to 0.572. Thus, the predictive ability of the PRS is very similar to, or perhaps slightly greater than, the predictive ability of any one of the five known PanCa susceptibility variants, which have R2 values that range from 0.008 to 0.010 and AUC values of 0.550 to 0.557. The R2 and AUC for all five established SNPs (in the same model) were 0.026 and 0.592, respectively, and these increased to 0.029 and 0.600 after including the PRS in the model. Assuming a heritability of 36% [Lichtenstein et al., 2000] and a lifetime risk of 1.5% [Howlader et al., 2011] for PanCa, population genetics theory [Wray et al., 2010] and formulae provided by Wray and colleagues (http://gump.qimr.edu.au/gen roc/genroc_calc_nrw.cgi) suggest that the five established susceptibility variants explain ~4% of the heritability for PanCa. Adding the PRS to this model increased the heritability explained to ~5%.

The power-replication approach suggests that SNPs in the right tail of the power distribution (i.e., estimated power >25%) are slightly more likely to replicate than SNPs with lower power, although these results were not statistically significant. Results from both the PRS and the power-replication method suggest that susceptibility loci are more likely to reside within or in close proximity to transcribed regions rather than in intergenic regions. This is consistent with the observation that four of the five PanCa susceptibility loci identified are located in or very close to coding regions.

While both the PRS and the power-replication approach have the ability to detect the presence of many SNPs of weak effect, the methods have key differences. The former assesses the ability of a risk score to predict case/control status, while the latter is a comparison of SNP characteristics across datasets. Thus, the power-replication approach does not provide any information on the predictive value of the unidentified variants. Based on our results, the PRS approach appears to be more robust to LD-pruning decisions, as it produces associations of similar magnitude and significance for both LD-pruned and unpruned datasets. In contrast, the power-replication method produces more significant associations for unpruned SNP data as compared to data that has been pruned for LD. This finding of decreased significance for LD-pruned SNP data is consistent with the work of Smith et al. [2011]. This phenomenon is likely due to the fact that the confidence intervals are wider for estimates derived from pruned SNP data. More specifically, the full SNP data contains more information on SNP-disease associations than does the pruned data. However, a substantial amount of this additional information is actually redundant due to high LD between neighboring SNPs, whereas the pieces of information in the pruned SNP data are largely independent. This additional, redundant information in the unpruned data is not likely to substantially alter the signal-to-noise ratio in the data, but will increase the sample size (i.e., number of SNPs), thereby narrowing the confidence intervals for the association between power and replication status. The relative contribution of redundant vs. nonredundant association signals to the discrepancy between the significance of the association estimates derived from the pruned and unpruned results is unclear.

Given the existence of additional associated variants in this dataset, our results suggest that, in theory, larger GWA studies of PanCa will detect new susceptibility loci. However, the underlying effect sizes and samples sizes needed to detect them need to be explored in future studies. Extremely large studies of PanCa are inherently difficult to carry out, due to the relatively low incidence rate for PanCa in the United States [Jemal et al., 2010] and elsewhere [Jemal et al., 2011] and the typically short amount of time between diagnosis and death [Fesinmeyer et al., 2005]. Such studies will require substantial financial resources and effort; however, as the cost of genotyping decreases, international collaboration increases, and data on additional cases are accumulated, very large GWA studies of PanCa may become feasible.

There are several other recently developed methods that use GWA data and results to examine the extent to which unidentified susceptibility variants contribute to disease risk. For example, methods for obtaining heritability estimates from mixed models that utilize kinship matrices derived from dense genome-wide SNP data have been developed [Lee et al., 2011; Yang et al., 2010]. The methods used in this work are quite different from these GWA-based heritability methods, as we are detecting collective associations that arise due to many weak SNP-disease associations, rather than inferring heritability based on genome-wide identity-by-descent information. Thus, our results represent currently undetected variants that, in theory, can be detected using GWA methods that examine primarily common SNPs.

Park and colleagues have developed methods for estimating the distribution of susceptibility variant effect sizes, calculation of power to detect such variants, and heritability explained by all variants [Park et al., 2010]. This method uses information on known susceptibility variants to estimate a distribution of effect sizes for the trait under study. Thus, this method is not based on detecting weak associations in GWA data. This method has worked well for traits with many known genetic determinants (e.g., Crohn’s disease, height, common cancers). However, because few GWA studies have been conducted for PanCa, it is unclear how well these methods would work with so little data from which to extrapolate an effect size distribution. In a similar fashion, So and colleagues [So et al., 2010] developed a method to estimate the total number of susceptibility variants underlying complex diseases. The estimation is conducted under the assumption that the distribution of variance explained for all susceptibility variants is exponential, using a liability threshold model for binary traits. So et al. did not test for the existence of their hypothesized unobserved susceptibility variants using GWA data.

There are several unique aspects of our work compared to prior studies using similar methodology [Bush et al., 2010; Machiela et al., 2011; Purcell et al., 2009; Smith et al., 2011; Witte and Hoffmann, 2011]. For example, this is the first study that has use a PRS approach focused solely on genic or intergenic SNPs, and the first to compare such results. In addition, this is the first study to compare the performance of the PRS and the power-replication approach, providing evidence that the PRS approach may be more powerful for detecting polygenic effects. This work is the first to explore the effects of LD pruning decisions on results from PRS analysis. Our study used a 75:25 split of the GWA data for CV for the PRS method, as opposed to prior studies using 50:50 splits of the data without CV [Witte and Hoffmann, 2011] and 90:10 splits with CV [Machiela et al., 2011]. We feel our method balances the need to generate accurate association estimates in the training set and the ability to detect significant associations in the testing set. For the power-replication approach, we used only two CV sets, because in preliminary analyses, a 75:25 split of the data produced somewhat inconsistent association estimate across the four CV sets. Thus, we used a 50:50 split and achieved better consistency.

A key limitation of the methods used in this work (and the kinship-based heritability method) is that they do not provide insight into which specific SNPs or pathways are driving the observed polygenic effects. Additional research is needed to identify the causal variants and/or pathways. However, we have demonstrated that PRS analysis, when restricted to subsets of SNPs (i.e., genic or intergenic SNPs), can be used to address specific hypotheses. Future studies could use on information on pathways or genome structure to groups SNPs in ways that allows for assessment of polygenic effects within these classes of SNPs. Such method could shed additional light on the location and functional characteristics of causal SNPs. While neither of the methods used in this work required any assumptions regarding the distribution of the effect sizes or the total number of unidentified susceptibility variants, our methods do not provide any information on these unknown parameters.

In summary, we have used a PRS approach and a power-replication approach to confirm the existence of additional susceptibility loci for PanCa. Both of these methods suggest that unidentified susceptibility variants are more likely to reside within genes than within intergenic regions. Larger GWA studies and/or innovative secondary analyses of existing PanCa GWA data may be promising methods for identifying these additional susceptibility loci.

ACKNOWLEDGMENTS

The authors would like to thank the Database of Genotypes and Phenotypes (dbGaP), all Investigators who contributed the phenotype data and DNA samples to the PanScan project, and the National Cancer Institute, the primary funder of the PanScan genome-wide association study. This work was supported by the Department of Defense [W81XWH-10-1-0499 to B.P.] and the National Institutes of Health [CA122171 and CA102484 to H.A.].

REFERENCES

  1. Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, Bueno-de-Mesquita HB, Gross M, Helzlsouer K, Jacobs EJ, LaCroix A, Zheng W, Albanes D, Bamlet W, Berg CD, Berrino F, Bingham S, Buring JE, Bracci PM, Canzian F, Clavel-Chapelon F, Clipp S, Cotterchio M, de Andrade M, Duell EJ, Fox JW Jr., Gallinger S, Gaziano JM, Giovannucci EL, Goggins M, Gonzalez CA, Hallmans G, Hankinson SE, Hassan M, Holly EA, Hunter DJ, Hutchinson A, Jackson R, Jacobs KB, Jenab M, Kaaks R, Klein AP, Kooperberg C, Kurtz RC, Li D, Lynch SM, Mandelson M, McWilliams RR, Mendelsohn JB, Michaud DS, Olson SH, Overvad K, Patel AV, Peeters PH, Rajkovic A, Riboli E, Risch HA, Shu XO, Thomas G, Tobias GS, Trichopoulos D, Van Den Eeden SK, Virtamo J, Wactawski-Wende J, Wolpin BM, Yu H, Yu K, Zeleniuch-Jacquotte A, Chanock SJ, Hartge P, Hoover RN. 2009. Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat Genet 41(9):986–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bush WS, Sawcer SJ, de Jager PL, Oksenberg JR, McCauley JL, Pericak-Vance MA, Haines JL. 2010. Evidence for polygenic susceptibility to multiple sclerosis–the shape of things to come. Am J Hum Genet 86(4):621–625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Evans DM, Visscher PM, Wray NR. 2009. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Genet 18(18):3525–3531. [DOI] [PubMed] [Google Scholar]
  4. Fesinmeyer MD, Austin MA, Li CI, De Roos AJ, Bowen DJ. 2005. Differences in survival by histologic type of pancreatic cancer. Cancer Epidemiol Biomarkers Prev 14(7):1766–1773. [DOI] [PubMed] [Google Scholar]
  5. Howlader N, Noone AM, Krapcho M, Neyman N, Aminou R, Waldron W, Altekruse SF, Kosary CL, Ruhl J, Tatalovich Z, Cho H, Mariotto A, Eisner MP, Lewis DR, Chen HS, Feuer EJ, Cronin KA, Edwards BK, editors. 2011. SEER Cancer Statistics Review, 1975–2008. Bethesda, MD: National Cancer Institute. [Google Scholar]
  6. Jemal A, Siegel R, Xu J, Ward E. 2010. Cancer statistics, 2010. CA Cancer J Clin 60(5):277–300. [DOI] [PubMed] [Google Scholar]
  7. Jemal A, Bray F, Center MM, Ferlay J, Ward E, Forman D. 2011. Global cancer statistics. CA Cancer J Clin 61(2):69–90. [DOI] [PubMed] [Google Scholar]
  8. Lee SH, Wray NR, Goddard ME, Visscher PM. 2011. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88(3):294–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A, Hemminki K. 2000. Environmental and heritable factors in the causation of cancer–analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med 343(2):78–85. [DOI] [PubMed] [Google Scholar]
  10. Machiela MJ, Chen CY, Chen C, Chanock SJ, Hunter DJ, Kraft P. 2011. Evaluation of polygenic risk scores for predicting breast and prostate cancer risk. Genet Epidemiol 35(6):506–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang ZY, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. 2007. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 39(10):1181–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Park JH, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N. 2010. Estimation of effect size distribution fromgenome-wide association studies and implications for future discoveries. Nat Genet 42(7):570–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Petersen GM, Amundadottir L, Fuchs CS, Kraft P, Stolzenberg-Solomon RZ, Jacobs KB, Arslan AA, Bueno-de-Mesquita HB, Gallinger S, Gross M, Helzlsouer K, Holly EA, Jacobs EJ, Klein AP, LaCroix A, Li D, Mandelson MT, Olson SH, Risch HA, Zheng W, Albanes D, Bamlet WR, Berg CD, Boutron-Ruault MC, Buring JE, Bracci PM, Canzian F, Clipp S, Cotterchio M, de Andrade M, Duell EJ, Gaziano JM, Giovannucci EL, Goggins M, Hallmans G, Hankinson SE, Hassan M, Howard B, Hunter DJ, Hutchinson A, Jenab M, Kaaks R, Kooperberg C, Krogh V, Kurtz RC, Lynch SM, McWilliams RR, Mendelsohn JB, Michaud DS, Parikh H, Patel AV, Peeters PH, Rajkovic A, Riboli E, Rodriguez L, Seminara D, Shu XO, Thomas G, Tjonneland A, Tobias GS, Trichopoulos D, Van Den Eeden SK, Virtamo J, Wactawski-Wende J, Wang Z, Wolpin BM, Yu H, Yu K, Zeleniuch-Jacquotte A, Fraumeni JF Jr., Hoover RN, Hartge P, Chanock SJ. 2010. A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat Genet 42(3):224–228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Pierce BL, Ahsan H. 2011. Genome-wide “pleiotropy scan” identifies HNF1A region as a novel pancreatic cancer susceptibility locus. Cancer Res 71(13):4352–4358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. 2006. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909. [DOI] [PubMed] [Google Scholar]
  16. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P. 2009. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460(7256):748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Schenk M, Schwartz AG, O’Neal E, Kinnard M, Greenson JK, Fryzek JP, Ying GS, Garabrant DH. 2001. Familial risk of pancreatic cancer. J Natl Cancer Inst 93(8):640–644. [DOI] [PubMed] [Google Scholar]
  19. Shi C, Hruban RH, Klein AP. 2009. Familial pancreatic cancer. Arch Pathol Lab Med 133(3):365–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Shieh G 2000. A comparison of two approaches for power and sample size calculations in logistic regression models. Communications in Statistics-Simulation 29:763–791. [Google Scholar]
  21. Smith EN, Koller DL, Panganiban C, Szelinger S, Zhang P, Badner JA, Barrett TB, Berrettini WH, Bloss CS, Byerley W, Coryell W, Edenberg HJ, Foroud T, Gershon ES, Greenwood TA, Guo Y, Hipolito M, Keating BJ, Lawson WB, Liu C, Mahon PB, McInnis MG, McMahon FJ, McKinney R, Murray SS, Nievergelt CM, Nurnberger JI Jr., Nwulia EA, Potash JB, Rice J, Schulze TG, Scheftner WA, Shilling PD, Zandi PP, Zollner S, Craig DW, Schork NJ, Kelsoe JR. 2011. Genome-wide association of bipolar disorder suggests an enrichment of replicable associations in regions near genes. PLoS Genet 7(6):e1002134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. So HC, Yip BH, Sham PC. 2010. Estimating the total number of susceptibility variants underlying complex diseases from genome-wide association studies. PLoS One 5(11):e13898. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Tersmette AC, Petersen GM, Offerhaus GJ, Falatko FC, Brune KA, Goggins M, Rozenblum E, Wilentz RE, Yeo CJ, Cameron JL, Kern SE, Hruban RH. 2001. Increased risk of incident pancreatic cancer among first-degree relatives of patients with familial pancreatic cancer. Clin Cancer Res 7(3):738–744. [PubMed] [Google Scholar]
  24. Wang K, Li M, Hakonarson H. 2010. Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11(12):843–854. [DOI] [PubMed] [Google Scholar]
  25. Witte JS, Hoffmann TJ. 2011. Polygenic modeling of genome-wide association studies: an application to prostate and breast cancer. OMICS 15(6):393–398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Wray NR, Yang J, Goddard ME, Visscher PM. 2010. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet 6(2):e1000864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. 2010. Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Yu K, Wang Z, Li Q, Wacholder S, Hunter DJ, Hoover RN, Chanock S, Thomas G. 2008. Population substructure and control selection in genome-wide association studies. PLoS One 3(7):e2551. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES