Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2018 May 17;102(6):1048–1061. doi: 10.1016/j.ajhg.2018.04.001

Association of Polygenic Risk Scores for Multiple Cancers in a Phenome-wide Study: Results from The Michigan Genomics Initiative

Lars G Fritsche 1,2,3, Stephen B Gruber 4, Zhenke Wu 1,5, Ellen M Schmidt 1, Matthew Zawistowski 1,2, Stephanie E Moser 6, Victoria M Blanc 7, Chad M Brummett 6,8, Sachin Kheterpal 6,8, Gonçalo R Abecasis 1,2, Bhramar Mukherjee 1,2,5,9,10,11,
PMCID: PMC5992124  PMID: 29779563

Abstract

Health systems are stewards of patient electronic health record (EHR) data with extraordinarily rich depth and breadth, reflecting thousands of diagnoses and exposures. Measures of genomic variation integrated with EHRs offer a potential strategy to accurately stratify patients for risk profiling and discover new relationships between diagnoses and genomes. The objective of this study was to evaluate whether polygenic risk scores (PRS) for common cancers are associated with multiple phenotypes in a phenome-wide association study (PheWAS) conducted in 28,260 unrelated, genotyped patients of recent European ancestry who consented to participate in the Michigan Genomics Initiative, a longitudinal biorepository effort within Michigan Medicine. PRS for 12 cancer traits were calculated using summary statistics from the NHGRI-EBI catalog. A total of 1,711 synthetic case-control studies was used for PheWAS analyses. There were 13,490 (47.7%) patients with at least one cancer diagnosis in this study sample. PRS exhibited strong association for several cancer traits they were designed for, including female breast cancer, prostate cancer, melanoma, basal cell carcinoma, squamous cell carcinoma, and thyroid cancer. Phenome-wide significant associations were observed between PRS and many non-cancer diagnoses. To differentiate PRS associations driven by the primary trait from associations arising through shared genetic risk profiles, the idea of “exclusion PRS PheWAS” was introduced. Further analysis of temporal order of the diagnoses improved our understanding of these secondary associations. This comprehensive PheWAS used PRS instead of a single variant.

Keywords: humans, multifactorial inheritance, neoplasms, electronic health records, hospitals, phenotype, genetic variation, genome-wide association study, risk, phenome-wide association study

Introduction

In the past decade, genome-wide association studies (GWASs) using single-nucleotide polymorphisms (SNPs) led to discovery of many common disease susceptibility loci.1, 2, 3 An alternative agnostic way of exploring gene-disease association is through phenome-wide association studies (PheWASs).4, 5, 6 PheWASs enable simultaneous exploration of the association between genetic variants and a broad spectrum of physiological/clinical phenotypes. To explore the joint genome × phenome landscape, one needs access to both electronic health records (EHRs) and GWAS data. The promise and potential of these studies have recently been illustrated by the electronic Medical Records and Genomics (eMERGE) network.7, 8 Beyond genetic associations, EHR has enabled discovery of new associations between disease and secondary effects of drugs or blood biomarker levels.9, 10, 11

PheWASs have been used both to replicate known genetic-phenotypic associations and to discover new consequences for disease-associated variants. PheWASs use computable phenotypes derived from EHR databases. Traditional PheWASs have used International Classification of Disease (ICD) codes to define a set of computable phenotypes or “PheWAS codes” defined and validated by experts using a combination of ICD codes.12 Standard PheWASs have primarily focused on correlating genetic variants, one at a time, to a spectrum of phenotypes. When each variant is associated with a small effect size, these studies can provide only limited insight. For this reason, many areas of genetics now use ensembles of variants that cumulatively explain substantial variation in disease risk.13, 14, 15 For example, PRS constructed from multiple GWASs identified loci that have been proposed for cancer screening, risk prediction, and risk stratification.16, 17, 18, 19

In this paper, we present the exploration of PRS in a PheWAS setting instead of a traditional PheWAS that considers single variants, one at a time. We focus on cancer traits while constructing the PRS. We construct PRS for multiple cancers including some of the most common groupings of cancers in the United States—prostate cancer (PCa [MIM: 176807]), breast cancer (MIM: 114480), colorectal cancer (MIM: 114500), lung cancer (MIM: 211980), melanoma of skin (MIM: 155601), and basal cell carcinoma (MIM: 614740)—and correlate them with PheWAS codes.

Our study is based on the Michigan Genomics Initiative (MGI) launched in 2012, a biorepository effort to create a longitudinal cohort of participants in Michigan Medicine.

MGI enrolled participants undergoing anesthesia prior to a surgery or diagnostic procedure, creating a patient community with genome-wide data, electronic health information, and permission for follow-up and re-contact in future studies. Our current analysis of 28,260 patients in MGI indicates that 47.7% of these patients have at least one current or previous neoplasm diagnosis (excluding benign neoplasms). This presents a unique opportunity to study multiple cancer outcomes leveraging both EHR and genomic data in MGI.

At the same time, this enrichment of cancer patients in MGI highlights some of the special features of the sampling frame for the study and source population. Because of the self-selective, consent-based nature of MGI patient enrollment, the sample selection mechanism is non-probabilistic, that is, the probability of a sampling unit being included in the study is not pre-determined. The MGI sample is enriched for neoplasm diagnoses, which could be related to the fact that many surgeries are diagnostic procedures that are specifically related to cancer treatment and screening (e.g., colonoscopy, skin biopsies). Cancer patients undergo surgery more frequently than the general population and frequently choose an academic medical center for diagnostic and/or interventional procedures. The analytic framework presented in this paper conducts careful sensitivity analysis for protecting our inference against such selection biases, unbalanced case-control ratios, and phenotypic enrichment.

There are several distinct aspects to our study. Our study represents a comprehensive PheWAS focused on using PRS in a cancer-enriched cohort accrued in an academic health center. Our study is also a PheWAS focused on cancer. Our results demonstrate that PRS, a summary score constructed based on results of large population-based GWASs, can be potentially useful for cancer risk stratification among patients in an academic medical center. We also note that when a PRS-based PheWAS leads to the association of a cancer-specific PRS (e.g., prostate cancer PRS) with other secondary related phenotypes (e.g., erectile dysfunction or urinary incontinence), these findings may require careful consideration. We observe that many of these secondary associations are often driven by the primary cancer diagnosis. We introduce the notion of “exclusion PRS PheWAS” to detect independent secondary associations that have shared genetic etiology. We extract the temporal order of diagnoses from the EHR to shed further insight into these secondary associations.

Subjects and Methods

MGI Cohort

Participants were recruited through Michigan Medicine health system while awaiting diagnostic or interventional procedures either during a preoperative visit prior to the procedure or on the day of procedure that required anesthesia. Opt-in written informed consent is obtained. In addition to coded biosamples and protected secure health information, participants understand that all EHR, claims, and national data sources linkable to the participant may be incorporated into the MGI databank. Each participant donates a blood sample for genetic analysis, undergoes baseline vital signs and a comprehensive history and physical, and completes validated self-report measures of pain, mood, and function, including NIH Patient Reported Outcomes Measurement Information System (PROMIS) measures. Data were collected according to Declaration of Helsinki principles. Study participants provided written informed consent, and protocols were reviewed and approved by local ethics committees (IRB ID HUM00099605). In the current study, we report results obtained from 28,260 genotyped samples of European ancestry with available integrated EHR data (see summary characteristics of the cohort in Table 1).

Table 1.

Demographics and Clinical Characteristics of the Final Analytic Dataset

Characteristic Analytic Data Set
n 28,260
Females, n (%) 15,113 (53.5%)
Mean age, years (SD) 54.1 (15.9)
Total number of ICD9 code days 3.5 million
Number of unique ICD9 codes 10,322
Median number of visits per participant 23
Median days between first and last visit 1,265
Median ICD9 code days per participant 28

Genotyping and Sample Quality Control (QC)

DNA from 37,412 blood samples was genotyped on two batches of customized Illumina Infinium CoreExome-24 bead arrays (“UM_HUNT_Biobank_11788091_A1” [n = 21,207] and “UM_HUNT_Biobank_v1-1_20006200_A” [n = 16,205]) that in addition to standard genome-wide tagging SNPs (n = ∼240,000) and exomic variants (n = ∼280,000) contained about 70,000 additional custom content variants, e.g., candidate variants from GWASs, nonsense and missense variants from sequencing studies, ancestry informative markers, and Neanderthal variants. Genotype analysis was performed with Illumina GenomeStudio (module 1.9.4, algorithm GenTrain 2.0). After initial clustering, variant cluster boundaries were re-defined in a second run using only individuals with call rate of at least 99% while the remaining samples were genotyped afterwards.

We excluded samples with (1) call rate < 99%, (2) estimated contamination > 2.5% (BAF Regress),20 (3) large chromosomal copy number variants (single chromosome with missingness ≥ five times larger than other chromosomes), (4) lower call rate than its technical duplicate or twin, (5) gonosomal constellations other than XX and XY, or (6) whose inferred sex did not match the reported gender. We excluded variants if (1) their probes could not be perfectly mapped or mapped perfectly to multiple position in the human genome assembly (Genome Reference Consortium Human genome build 37 and revised Cambridge Reference Sequence of the human mitochondrial DNA; BLAT),21 (2) they showed deviations from Hardy Weinberg equilibrium in European ancestry samples (p < 0.0001), (3) had a call rate < 99%, (4) another variant with higher call rate assayed the same variant, or (5) the allele frequency differences between the two array versions within unrelated, European ancestry samples had a p value < 0.001 (PLINK v.1.90).22 After quality control, 392,323 polymorphic variants remained.

Before preparing the final analytical dataset, we reduced the data to 33,028 samples for which we had complete age and ICD9 data available. Next, we estimated pairwise kinship with the software KING23 and limited further analysis to a subset that contained no pairs of individuals with a first- or second-degree relationship. We inferred recent ancestry by projecting all genotyped samples into the space of the principal components of the Human Genome Diversity Project reference panel using PLINK (938 unrelated individuals).24, 25 We limited the principal component analysis to variants that were shared between the HGDP reference and the MGI data, had a minor allele frequency > 1%, and remained after LD pruning (r2 < 0.5; PLINK). Samples of recent European ancestry (∼90% of participants) were defined as samples that fell into a circle around the center of the European HGDP populations in the PC1 versus PC2 space, whereas the circle’s radius was set to 1/8 of the distance between the center of the European HGDP populations and the centroid of the centers of the European, East Asian, and Sub-Saharan populations (Figure S1). Principal components were stored and used for further association tests. After quality control, 28,260 unrelated, genotyped individuals of recent European ancestry with age and ICD9 data remained for further analysis.

Phasing and Genotype Imputation

We imputed genotypes of the Haplotype Reference Consortium using the Michigan Imputation Server26 and filtered poorly imputed variants with R2 < 0.3 and/or minor allele frequency (MAF) < 0.1% resulting in more than 17 million imputed variants available after quality control and filtering. The obtained accuracy for imputed variants, i.e., the average empirical R2 values for different MAF frequency bins, was 0.89 (0.1% ≤ MAF ≤ 0.5%), 0.94 (0.5% < MAF ≤ 5%), and 0.96 (MAF > 5%).

Phenome Generation

We extracted the ICD9 data for 28,260 unrelated, genotyped individuals of recent European ancestry and mapped a total of 3.5 million ICD9 codes to PheWAS codes (PheWAS translation table version 1.2).12 The ICD9 codes (10,322 unique ICD9 codes) were aggregated to PheWAS traits using the PheWAS R package.12 Cases for a given PheWAS code were defined if an individual had at least one assignment of that PheWAS code in their record. The remaining individuals that did not have overlapping PheWAS codes that are a part of the exclusion criteria were considered as control subjects. A total of 1,857 case-control studies were generated of which 1,711 with ≥20 cases were used for further analyses (see Table S1; phenotypes with <50 cases were coded as “<50”).

GWAS Catalog SNP Extraction and Construction of PRS

We downloaded previously reported GWAS variants from the NHGRI-EBI Catalog (file date: June 31, 2017).27, 28 None of the discovery studies included in the catalog used any subset of the MGI cohort. This is primarily because MGI started recruiting in 2012 and the genotype data became available only recently. Variant positions were converted to GRCh37 using variant IDs from dbSNP build 144 (UCSC Genome Browser) after updating outdated dbSNP IDs to their merged dbSNP IDs. Entries with missing risk alleles, risk allele frequencies, or odds ratios were excluded. If a reported risk allele did not match any of the reported forward strand alleles of a non-ambiguous SNP (not A/T or C/G) in the 1000 Genomes Project genotype data, we assumed minus strand designation and corrected the effect allele to its complementary base of the forward strand. Entries with a reported risk allele that did not match any of the alleles of an ambiguous SNP (A/T and C/G) in the 1000 Genomes Project data would have been excluded at this step. We included only entries with broad European ancestry (as reported by the NHGRI-EBI Catalog). To allow an additional quality control check, we compared the reported risk allele frequencies (RAF) in controls with the frequencies of the 503 European samples of the 1000 Genomes Project reference data (Phase 3, release 5).29 We then excluded entries whose RAF deviated more than 15% from the reference. This chosen threshold is subjective and was based on clear differentiation between correct and likely flipped alleles on the two diagonals (see Figure S2) as noted frequently in GWAS meta-analyses quality control procedures.30 For each analyzed cancer type, we extracted overlapping GWAS hits in our genotype data and estimated pairwise LD (r2) using the available allele dosages of the corresponding controls. For pairwise correlated SNPs (r2 > 0.5) or SNPs with multiple entries, we kept the SNP with the younger publication date (and smaller p value, if necessary) and excluded the other (Figure S2 and Table S2). Finally, we weighted the allele dosages of risk SNPs of the risk increasing alleles with their reported log odds ratios and calculated PRS as their sum. Namely, for subject j in MGI, the PRS was of the form PRSj = iβiGij where the sum extends over all included loci, βi are the log odds ratios retrieved from the GWAS catalog for locus i, and Gij was the measured dosage data for the risk allele on locus i in subject j. This variable was created for each MGI participant and for each cancer separately.

Statistical Analysis

For the current study, we initially explored 30 cancer traits that had matching entries in the GWAS catalog (Table S3) and restricted our analysis to 12 cancer traits with at least 5 risk SNPs detected in the GWAS catalog after filtering that had relatively larger samples sizes in MGI (namely n ≥ 250 case subjects) (Table 2). Logistic regression was used for all genetic association analysis. Firth’s bias reduction method was applied to all single SNP and PRS models to resolve the problem of separation in logistic regression (Logistf in R package “EHR;” see Web Resources),31 a common problem for binary or categorical outcome models when for a certain part of the covariate space there is only one observed value of the outcome which often leads to very large parameter estimates and standard errors. Firth’s bias-reduction32 is a penalized likelihood method that reduces the bias in such situations by adding a penalty term to the likelihood.

Table 2.

Association Analysis of Cancer Traits with at Least Five NHGRI EBI GWAS Catalog Risk SNPs and More than 250 Cases in MGI

Cancer Traita No. Case Subjects No. Control Subjects No. Risk SNPs used for PRSb Effect Size Correlation[ρˆ]between GWAS Catalog and MGI [95% CI] Effect Size Correspondence (Lin’s CCC) MGI versus GWAS Catalog [95% CI] Estimated PRS-Cancer Association
PRS Odds Ratio [95% CI] pc Continuous PRS (Transformed to Standard Normal) Point Estimate [95% CI] pd
Breast cancer (female)e 1,827 11,073 78 0.67 [0.53,0.78] 0.64 [0.51, 0.75] 2.3 [2.0,2.7]; 2.5 × 10−29 1.4 [1.3;1.5]; 3.6 × 10−37
Cancer of prostatee 1,425 9,793 93 0.81 [0.73,0.87] 0.74 [0.66, 0.81] 3.3 [2.7,3.9]; 3.7 × 10−43 1.7 [1.6;1.8]; 3.8 × 10−69
Melanomas of skine 1,404 23,798 16 0.92 [0.77,0.97] 0.91 [0.77, 0.97] 2.4 [2.0,2.8]; 2.6 × 10−31 1.4 [1.3;1.5]; 6.7 × 10−36
Basal cell carcinomae 1,124 23,798 19 0.88 [0.71,0.95] 0.85 [0.68, 0.94] 2.7 [2.2,3.2]; 1.1 × 10−27 1.5 [1.4;1.6]; 3.3 × 10−44
Cancer of bladder 978 26,748 16 0.65 [0.22,0.86] 0.57 [0.22, 0.79] 1.4 [1.2,1.7]; 0.00018 1.2 [1.1;1.2]; 4.9 × 10−6
Non-Hodgkins lymphoma 878 26,794 18 0.51 [0.05, 0.79] 0.24 [0.028,0.43] 1.3 [1.1,1.6]; 0.0063 1.1 [1.0;1.2]; 0.0029
Colorectal cancer 718 22,183 42 0.48 [0.21,0.68] 0.39 [0.17,0.58] 1.3 [1.1,1.6]; 0.011 1.1 [1.1;1.2]; 0.00078
Squamous cell carcinomae 703 23,798 5 0.95 [0.39,1.00] 0.92 [0.57, 0.99] 2.0 [1.6,2.5]; 2.0 × 10−10 1.4 [1.3;1.5]; 1.8 × 10−18
Malignant neoplasm of kidney, except pelvis 613 26,748 7 0.33 [−0.57,87] 0.053 [−0.094, 0.20] 0.98 [0.77,1.3]; 0.89 0.99 [0.91;1.1]; 0.86
Cancer of bronchus, lung 570 27,596 9 0.90 [0.60,0.98] 0.82 [0.53, 0.93] 1.2 [0.91,1.6]; 0.13 1.1 [0.99;1.2]; 0.091
Thyroid cancere 472 26,692 8 0.97 [0.82,0.99] 0.94 [0.82, 0.98] 3.2 [2.5,4.5]; 1.8 × 10−18 1.4 [1.3;1.6]; 4.8 × 10−19
Cancer of brain and nervous system 321 27,069 9 0.79 [0.26,0.95] 0.66 [0.25,0.87] 1.3 [0.92,1.7]; 0.13 1.1 [1.0;1.2]; 0.042

Abbreviations: PRS, polygenic risk score; CI, confidence interval.

a

Underlying ICD9 codes are listed in Table S4.

b

GWAS Catalog SNPs after quality control; corresponding summary statistics are listed in Tables S2 and S5.

c

Odds ratio for each cancer with top PRS quartile compared to bottom PRS quartile. Point estimates, confidence intervals, and p values are obtained by fitting Firth’s Bias-Corrected Logistic Regression.

d

Association of each cancer with continuous PRS that were transformed to standard normal distribution. Point estimates, confidence intervals, and p values are obtained by fitting Firth’s Bias-Corrected Logistic Regression.

e

PRS OR > 1.5, pcontinous PRS < 2.9 × 10−5, selected for PRS PheWAS analysis

To estimate the association of PRS with the primary cancer phenotype, we first determined the PRS quartiles using all control samples, categorized all samples according to these PRS quartiles, and fitted Firth bias-corrected logistic regression adjusting for age, sex, genotyping array, and the first four principal components. We report odds ratios corresponding to the top versus the bottom quartile PRS (reference), referred to as PRS OR. We also used continuous PRS instead of the categorized version as the covariate for enhanced power. For comparability of effect sizes corresponding to the continuous PRS across cancer traits, we transformed the PRS to the standard normal distribution using “ztransform” of the R package “GenABEL” (see Web Resources).

To compare reported associations of individual GWAS catalog SNPs with association observed in the MGI dataset, we tested the association between reported GWAS hits and its corresponding trait using Firth bias-corrected logistic regression implemented in EPACTS (v.3.3, see Web Resources). Age, sex, genotyping array, and principal components 1–4 were included as covariates (see Kinship and Ancestry Inference).

To determine the agreement of estimated effect sizes [estimated log(odds ratios)] between the MGI case-control studies and the published GWAS catalog hits, we estimated Pearson’s correlation coefficient [ρˆ] and Lin’s concordance measure between the two sets of coefficients.33, 34 Toward more standard discovery type genome-wide association analysis with MGI data, we performed GWAS for the nine cancer traits where the correspondence between the effect sizes were relatively strong [ρˆ 0.6 ] (Table 2). For computationally efficient GWA analysis, we used the score test-based saddle point approximation (SPA)35 method adjusting for age, sex, genotyping array, and the first four principal components. SPA was reported to provide accurate test statistics even for extremely unbalanced case-control ratios similar to Firth bias corrected logistic regression (see below) but was estimated to be 100 times faster than the latter.35

For our primary PRS-PheWAS, for the six PRS in Table 2 that showed strong and significant association, we conducted Firth bias-corrected logistic regression by fitting a model of the following form and repeated them for each of the 1,711 phenotypes.

logit (P(Disease=1|PRS, Age, Sex, Array, PC))

=β0+βPRSPRS+βAgeAge+ βSexSex+βArrayArray+βPC

where the PCs were the first four principal components obtained from the principal component analysis of the genotyped GWAS markers and where “Array” represents the two genotyping array versions used in MGI accounting for potential batch effects. To adjust for multiple testing, we applied the conservative phenome-wide Bonferroni correction according to the 1,711 analyzed PheWAS codes (Table S1). Through a PheWAS plot, we present –log10 (p values) corresponding to each of the 1,711 association tests for H0:βPRS=0. Directional arrows on the PheWAS plot indicate whether a phenome-wide significant trait was positively or negatively associated with the PRS.

Furthermore, our extensive sensitivity analyses included (1) similar models adjusting for 20 PCs, (2) matching case to control subjects and conducting conditional logistic regression analysis, and (3) using the unweighted risk allele counts as predictor. The reason for these three sensitivity analyses was to check (1) whether the first four PCs were sufficient to control for population stratification, (2) whether differences in age and sex distributions or extreme case-control ratios influenced the main analysis, and (3) whether ignoring effect sizes and using total risk allele count produced similar results. For (2), we matched case and control subjects using the R package “MatchIt” and applied nearest neighbor matching for age, PC1–4 (using Mahalanobis-metric matching; matching window caliper/width of 0.25 standard deviations), and exact matching for sex. We considered a varying set of case:control matching ratios from 1:1 to 1:10. We observed gain in precision with increasing the number of matched control subjects per case subject, but the gain in precision became negligible after 1:10 matching ratio (Table S6). Moreover, for some cancers with large number of case subjects, we could not attain 1:10 matching ratio for all case subjects and ended up with varying number of control subjects per case subject. For example, for prostate cancer the average number of matched control subjects per case subject was around 5.36

To investigate the possibility of the secondary trait associations with PRS being completely driven by the primary trait association, we performed a second set of PheWAS after excluding individuals affected with the primary cancer trait for which the PRS was constructed, referred to as “exclusion PRS PheWAS.” We applied exclusion PRS PheWAS instead of a PRS PheWAS that uses the primary cancer trait as covariate because the control exclusion criteria implemented in the PheWAS phenotype construction pipeline will often eliminate these primary cancer case subjects from being eligible control subjects for some selected secondary related phenotypes and thus a logistic regression analysis will lead to complete separation.12 We also stratified the MGI dataset (or the corresponding gender subset depending on cancer type) into ten groups of equal size by PRS deciles and determined the percentage of observed case subjects for secondary traits in each risk decile and conducted a test of significance in difference in proportions across the deciles before and after removing individuals affected with cancer traits related to the primary cancer trait. As a follow-up tool to understand the secondary associations, we created a plot to display the temporal ordering of diseases plotted against time of diagnoses. If not stated otherwise, analyses were performed using R 3.4.1 (see Web Resources).

Results

In the current study, we report results obtained from 28,260 genotyped and unrelated samples of inferred European ancestry with available integrated ICD9-based EHR data. The study sample contains 53.5% females and the mean age is 54 years (see Table 1 for summary). We conducted our initial analysis on 12 cancer traits that after quality control had at least five independent risk variants in the NHGRI EBI GWAS Catalog and more than 250 case subjects in our cohort (Table 2, Table S2). Table 2 summarizes data on 8,423 distinct individuals that were affected by at least 1 of the 12 cancers. Of these patients, 6,398 had one cancer, 1,574 had two cancers, and 451 had more than two cancer sites involved.

Correspondence of MGI Effect Estimates with Those Reported in GWAS

To assess the calibration properties of the 12 ICD9-based cancer case-control studies, we first compared the concordance of observed effect estimates (log odds ratios) from MGI with published effect estimates reported in the NHGRI EBI GWAS Catalog.

We found strong positive correlation (estimated Pearson’s correlation coefficient [ρˆ] > 0.6) between the MGI and GWAS reported estimates for 9 of the 12 cancers: female breast cancer (78 SNPs; [ρˆ] = 0.67 [95% CI: 0.53,0.78]), prostate cancer (PCa; 93 SNPs; [ρˆ] = 0.81 [0.73,0.87]), melanoma (16 SNPs; [ρˆ] = 0.92 [0.77,0.97]), basal cell carcinoma (19 SNPs; [ρˆ] = 0.88 [0.71,0.95]), bladder cancer ([MIM: 109800]; 16 SNPs; [ρˆ] = 0.65 [0.22,0.86]), squamous cell carcinoma (5 SNPs; [ρˆ] = 0.95 [0.39,1]), lung cancer ([MIM: 211980]; 9 SNPs; [ρˆ] = 0.90 [0.6,0.98]), thyroid cancer (9 SNPs; [ρˆ] = 0.79 [0.26,0.95]), and cancer of brain and nervous system (9 SNPs; [ρˆ] = 0.79 [0.26,0.95]) (Tables 2 and S5; Figures 1 and S3).

Figure 1.

Figure 1

Calibration of Association Parameters

Calibration of association parameters between the MGI-GWAS and NHGRI-EBI GWAS Catalog derived effect estimates [log(OR)] for (A) breast cancer (females only), (B) cancer of prostate, (C) melanoma, (D) basal cell carcinoma, (E) squamous cell carcinoma, and (F) thyroid cancer. The agreement of two sets of SNP-specific beta coefficients (non-reference allele is the effect allele), their Pearson correlation (coefficient ρˆ, incl. 95% confidence interval and p) and Lin’s correspondence correlation (coefficient CCC; incl. 95% confidence interval) are shown; dashed line indicates perfect concordance; solid line indicates fitted line.

Cancer GWASs in MGI

After having established strong positive correlation for 9 of the 12 cancer traits and thus a phenotype quality that appears to be in line with their corresponding published GWASs, we performed for each of these 9 cancers a GWAS to explore our ability to replicate and/or uncover cancer risk variants in a genome-wide setting. For the 9 cancers, we could replicate a total 55 of the 253 included risk SNPs with consistent effect orientation with p < 0.05 after correcting for the number of SNPs per phenotype (Table S7). We found genome-wide significant signals (p < 5 × 10−8) for female breast cancer, melanoma of skin, basal cell carcinoma, squamous cell carcinoma, and thyroid cancer. All but one of the genome-wide signals were found in loci already reported in the GWAS catalog for the corresponding cancer trait or related phenotypes. For instance, the four melanoma of skin loci with risk variants near SLC45A2 (MIM: 606202), IRF4 (MIM: 601900), MC1R (MIM: 155555), and ASIP/RALY (MIM: 600201) were previously reported to be associated with melanoma, non-melanoma skin cancer, squamous cell carcinoma, or basal cell carcinoma.37, 38, 39, 40, 41 Also, the two breast cancer risk loci near FGFR2 (MIM: 176943) and FGF3/FGF4 (MIM: 164950/164980) as well as the thyroid cancer risk loci near NRG1 (MIM: 142445) and FOXE1 (MIM: 602617) were previously described.42, 43, 44, 45 The only potentially novel finding was the SNP rs77909434 on chromosome 13 showing borderline genome-wide association with melanoma (MAF in case subjects = 5.3%; MAF in control subjects = 3.4%; p = 1.5 × 10−8) located 53 kb downstream of the Fibroblast Growth Factor 9 gene (FGF9 [MIM: 600921]) on chromosome 13. Since multiple phenotypes were involved in the genome-wide explorations, this SNP would not have passed the Bonferroni multiple testing correction for multiple GWASs. Further exploration in larger studies are warranted to substantiate this suggestive finding. We present GWAS Manhattan and QQ plots for all nine cancer traits in Figure S4.

Owing to the smaller sample sizes compared to the studies included in the NHGRI-EBI GWAS Catalog, only 8 of 253 catalog SNPs exceeded the genome-wide significance (Table S5). However, we found cataloged risk SNPs in Table 2 were markedly enriched in the top 1% of GWAS associations, especially for the larger case-control studies. For example, 27 out of the 93 GWAS Catalog PCa risk SNPs fall in the top 1% of associated SNPs in the MGI GWAS (with p < 0.0083) (Table S8).

Replicability of PRS Primary Cancer Association

PRS integrates multiple SNPs, weighted by prior effect estimates and is expected to substantially improve the power to detect an association compared to an analysis with individual variants. To evaluate the association of PRS with the primary cancer trait, we estimated the OR for patients in the top risk quartile compared to the bottom quartile (PRS OR). Six of the 12 cancer PRS revealed an at least 2-fold enrichment of case subjects with pQ1vsQ4 < 2.0 × 10−10, all of which also showed strong positive correlation between the MGI and GWAS reported estimates (see above): female breast cancer (PRS OR = 2.3 [95% CI: 2.0;2.7], pQ1vsQ4 = 2.5 × 10−29), prostate cancer (PRS OR = 3.3 [95% CI: 2.7;3.9], pQ1vsQ4 = 3.7 × 10−43), melanoma (PRS OR = 2.4 [95% CI: 2.0;2.8], pQ1vsQ4 = 2.6 × 10−31), basal cell carcinoma (PRS OR = 2.7 [95% CI: 2.2;3.2], pQ1vsQ4 = 1.1 × 10−27), squamous cell carcinoma (PRS OR = 2.0 [95% CI: 1.6;2.5], pQ1vsQ4 = 2.0 × 10−10), and thyroid cancer (PRS OR = 3.2 [95% CI: 2.5;4.5], pQ1vsQ4 = 1.8 × 10−18) (Figures 1A–1F, Table 2).

The corresponding p values obtained from Firth’s bias-reduced logistic regression using continuous PRS were even stronger as expected and indicated that these six cancer traits would withstand a Bonferroni multiple testing correction in a phenome setting (1,711 traits; pPRS < 2.9 × 10−5): female breast cancer (pPRS = 3.6 × 10−37), prostate cancer (pPRS = 3.8 × 10−69), melanoma (pPRS = 6.7 × 10−36), basal cell carcinoma (pPRS = 3.3 × 10−44), cutaneous squamous cell carcinoma (SCC, pPRS = 1.8 × 10−18), and thyroid cancer ([MIM: 188550]; pPRS = 4.8 × 10−19) (Tables 2 and S3). We excluded the remaining six cancer traits from further investigation, because in this initial analysis they showed only little or moderate association (PRS OR < 1.5) and consequently only modest power for subsequent exploration of phenome-wide associations (Table 2).

PRS PheWAS

Next, we evaluated each of the six remaining PRS that were strongly associated with the primary cancer trait across a collection of 1,711 EHR-derived phenotypes (not limited to cancer traits) with at least 20 case subjects each (Table S1). For each of the six cancer PRS, we found strongest associations with their primary traits, except for squamous cell carcinoma PRS which revealed its strongest association with the more general skin cancer trait definition (pPRS = 7.2 × 10−61) (Figure 2). Overall, we found no or little sign for inflation in our PheWAS results (median chi-square based Lambda ≤ 1.16). Notably, we observed deflation for some PRS PheWASs that might be caused by lack of power especially for the phenotypes with small number of case subjects (Figure S5). We displayed the results from the three types of sensitivity PheWAS analyses next to the original results: conditional logistic regression results from 1:10 matched (for age, sex, and first four PCs) case-control studies; adjusting for 20 principal components; and using unweighted sum of risk allele counts instead of weighted PRS (Figures S6–S11). The results remained robust with respect to these design and analytic choices.

Figure 2.

Figure 2

PRS PheWAS Plots

PRS PheWAS plots for (A) breast cancer (females only), (B) cancer of prostate, (C) melanoma, (D) basal cell carcinoma, (E) squamous cell carcinoma, and (F) thyroid cancer. 1,711 traits are grouped into 16 color-coded categories as shown on the horizontal axis; the p values for testing the associations of PRS with the traits are minus log-base-10-transformed and shown on the vertical axis. Triangles indicate phenome-wide significant associations with their effect orientation (up, risk increasing; down, risk decreasing). PRS upon multiplicity adjustment (see Subjects and Methods). The solid horizontal line for p = 2.9 × 10−5 cut-off.

Secondary Associations

In addition, we identified for each PRS associations with secondary traits besides their primary traits (Figures 2A–2F, Table S9). For example, we observed associations of the three skin cancer PRSs (PRS for melanoma, basal cell carcinoma, and squamous cell carcinoma) with overall skin cancer and other skin cancer subcategories—expected due to their overlapping SNP sets—but also significantly associated with multiple dermatologic phenotypes, e.g., actinic keratosis (pPRS < 1.2 × 10−10) and other degenerative skin conditions or disorders, all potential pre-cancer stages (Figures 2C–2E).

Similarly, the female breast cancer PRS was associated not only with breast cancer (pPRS = 3.6 × 10−37) but also with acquired absence of breast (pPRS = 2.4 × 10−14), abnormal mammogram (pPRS = 1.3 × 10−8), benign neoplasms of the breast (pPRS = 1.8 × 10−7), and benign mammary dysplasias (pPRS = 1.2 × 10−5) (Figure 2A). The PRS originally constructed for prostate cancer was associated with prostate cancer (pPRS = 3.8 × 10−69), as expected, but also with four additional traits: elevated prostate specific antigen (pPRS = 9.3 × 10−27), erectile dysfunction (pPRS = 6.3 × 10−15), urinary incontinence (pPRS = 6.6 × 10−11), frequency of urination and polyuria (pPRS = 2.9 × 10−6), and hyperplasia of prostate (pPRS = 3.6 × 10−6) (Figure 2B, Table S9).

While all of the above mentioned secondary trait associations were in the same effect orientation as their primary traits, i.e., increasing PRSs were associated with increased risk for the secondary trait, we observed an association of increasing thyroid cancer PRS with decreased risk for hypothyroidism (pPRS = 7.0 × 10−10) (Figure 2F).

Exploring Secondary PRS PheWAS Associations via Exclusion PRS PheWAS

Since we already applied exclusion criteria to the control subjects during our phenome generation, e.g., individuals with elevated prostate-specific antigen levels were excluded from being control subjects for prostate cancer and vice versa, we could not adjust for the primary cancer traits as a predictor in logistic regression models to identify independent secondary PRS PheWAS associations due to the issue of complete separation. To alternatively explore the secondary associations in PRS PheWAS (Figure 2), we proposed and performed exclusion PRS PheWAS by removing subjects affected with the cancer or related cancer traits for which the PRS was constructed. After removing all breast cancer case subjects (n = 1,894), no association with breast cancer PRS remained significant, i.e., acquired absence of breast (pPRS = 0.52), abnormal mammogram (pPRS = 0.76), or benign neoplasms of the breast (pPRS = 0.49), indicating that the secondary trait associations were driven by the primary trait (Figure S12A). However, we noted that the majority of case subjects of the non-neoplasm phenotype “acquired absence of breast” (>94.4%; 624 of 661) were removed in this step as they are highly correlated with breast cancer. We made similar observations for prostate cancer PRS where none of the previously detected secondary trait associations remained phenome-wide significant after removing all 1,425 prostate cancer case subjects (Figure S12B).

In contrast, we found a markedly stronger association between hypothyroidism and thyroid cancer PRS after removing 472 thyroid cancer case subjects (pPRS = 4.7 × 10−19) compared to the full analysis (pPRS = 7.0 × 10−10) which is consistent with the effect orientations between thyroid cancer PRS and hypothyroidism (Figure 2F).

To account for the substantial overlap between skin cancer subtypes, e.g., 253 of the 1,404 individuals affected with melanoma are also affected by basal and/or squamous cell carcinoma (Figure S13) and to account for the likely intensified skin cancer screening of individuals that were diagnosed with skin cancer once in their life time, we excluded any type of skin cancer (n = 3,910) and repeated the PheWAS for melanoma, basal cell carcinoma, and squamous cell carcinoma PRS. After doing so, only actinic keratosis remained statistically associated with squamous cell carcinoma PRS while all of the previously observed associations mainly driven by skin cancer diagnoses disappeared (Figures 2C–2E and S12C–S12E). The association between squamous cell carcinoma PRS and actinic keratosis was less pronounced after excluding skin cancer cases but still remained phenome-wide significant (pPRS = 2.3 × 10−36 versus pPRS = 1.1 × 10−12).

To further understand the discovered secondary associations in the PRS PheWAS analyses (Figures 2 and S12), we conducted a simple follow-up analysis by stratifying the data into PRS deciles. We discuss selected secondary trait associations only for the prostate cancer (PCa), squamous cell carcinoma (SCC), and thyroid examples in the main text and relegate their comprehensive analysis and a similar analysis of breast cancer, melanoma, and basal cell carcinoma PRS to the supplemental material (Table S10). For prostate cancer, we stratified a total of 12,026 male individuals in MGI with age ≥ 30 years into deciles of PCa PRS. The observed PCa PRS associations in the PheWAS analysis are further supported by their respective increasing trait prevalences that are observed across 10 PCa PRS decile-stratified strata (Figure 3A; Table S10). These strata were not adjusted for confounders, but it is less likely that the PRS is strongly associated with other covariates. A striking observation is that the proportion of PCa cases in lowest versus highest decile of PCa PRS is 5.4% versus 23.4% (Δ = 18.0% [95% CI, 15.2 to 20.7%]; p = 2.4 × 10−35), emphasizing that the PRS can distinguish well between high- and low-risk individuals in a realistic academic medical center population.

Figure 3.

Figure 3

Proportion of Primary and Secondary Traits Stratified by PRS Deciles

Percentage of primary and selected secondary traits in each cancer PRS category for (A) prostate cancer, (B) squamous cell carcinomas, and (C) thyroid cancer. Observed percentages in the MGI database as represented by the height of bars for each of ten increasing decile-stratified PRS strata from left to right. The PRS’s underlying trait is shown on top and secondary traits below with (blue) and without (green) overlapping relevant cancer cases. Only individuals with age ≥ 30 years were included in each analysis, and the prostate cancer PRS example includes only male individuals (see Table S10 for detailed sample sizes and percentages).

Focusing on the secondary traits that reached phenome-wide significance with the PCa PRS, all these traits are known to be associated with PCa: erectile dysfunction (ED), urinary incontinence (UI) (which commonly follows invasive surgical removal of the prostate), and elevated prostate-specific antigen levels (ePSA) (which is a known biomarker for an increased PCa risk, being closely monitored after prostatectomy). For example, when comparing the lowest versus the highest PRS risk decile, we found significant differences for ePSA (3.9% versus 11.2%; Δ = 7.2% [95% CI, 5.1 to 9.4%]; p = 3.0 × 10−11), ED (9.7% versus 17.1%; Δ = 7.4% [95% CI, 4.6 to 10.2%]; p = 1.5 × 10−7), and UI (4.7% versus 11.9%; Δ = 7.1% [95% CI, 4.8 to 9.3%]; p = 4.2 × 10−10). To test whether these associations are early indicators for PCa or whether they are driven by the fact that subjects affected by these secondary traits are also PCa-affected case subjects (perhaps as a side effect of PCa treatment), we removed PCa-affected case subjects and evaluated secondary disease prevalence across PCa PRS deciles. By doing so, prevalence of all secondary traits became roughly constant across PRS strata (Figure 3A; Table S10) and can be illustrated by the comparison of the proportions of the lowest versus the highest PRS risk decile: ePSA (2.8% versus 2.2%, Δ = −0.6% [95% CI, −1.9 to 0.8%]; p = 0.44), ED (7.6% versus 6.7%, Δ = −1.1% [95% CI, −3.1 to 1.1%]; p = 0.38), and UI (3.0% versus 2.8%, Δ = 0.2% [95% CI, −1.6 to 1.3%]; p = 0.90). Based on these observations, we hypothesize that the association of PCa PRS on the secondary traits ePSA, ED, and UI were driven by the PCa diagnosis, through either prior symptoms of PCa or prescribed medication, chemotherapy, or surgical procedures for prostate removal (Table S10).

For SCC PRS stratification, there was a gradual increase of individuals affected with SCC with increasing PRS risk deciles, a trend that was also noted for actinic keratosis, dermatitis due to solar radiation, and seborrheic keratosis (Figure 3B). However, when excluding case subjects with skin cancer, the upward trend for the latter two phenotypes disappeared. The previously observed difference between the top and bottom PRS risk decile of individuals affected with actinic keratosis (5.5% versus 13.2%, Δ = 7.7% [95% CI, 6.1 to 9.3%]; p = 4.0 × 10−21) was markedly reduced after excluding skin cancer case subjects but still remained significant (2.8% versus 5.1%, Δ = 2.4% [95% CI, 1.3 to 3.5%]; p = 1.6 × 10−5) (Table S10), suggesting the potential for common genetic risk profiles between SCC and actinic keratosis. Since actinic keratosis is a known precursor for squamous cell carcinoma,46 our approach indicated that it is possible to identify phenotypic risk factors through phenome-wide association scans and careful follow-up investigation of primary and secondary diagnoses.

Finally, we found an attenuated association between increasing thyroid cancer PRS and reduced risk for hypothyroidism: within all 25,681 samples ≥ 30 years of age, the difference between bottom and top decile was Δ = −3.5% ([95% CI, −5.4 to −1.6%]; 15.1% versus 11.5%; p = 2.5 × 10−4) and after excluding 452 thyroid cancer case subjects, it increased to Δ = −5.3% ([95% CI, −7.1 to −3.5%]; 14.4% versus 9.1%; p = 4.5 × 10−9) (Table S10). Several studies previously reported genetic overlap of a subset of thyroid cancer risk variants and variants associated with serum levels of thyroid stimulating hormone (TSH), which matches the current observed association between thyroid cancer risk and risk for hypothyroidism.44, 45

To further our understanding of the observed secondary associations, we take advantage of the temporally resolved electronic health records data and explore the temporal order in which the diagnoses appear. Figure 4 shows that actinic keratosis diagnosis mostly precedes the diagnosis of squamous cell carcinoma, sometimes by even 10 years. Erectile dysfunction or hypothyroidism, known side effects of treatment of prostate and thyroid cancer (respectively), are mostly identified within a short time frame of primary cancer diagnosis. In contrast, elevated PSA, used as a screening tool for prostate cancer with known shared genetic correlation, is observed mostly prior to a prostate cancer diagnosis and also after treatment as a prognostic marker. Having access to the electronic health records enables us to explore these temporally ordered data patterns and understand the explanation behind these secondary associations.

Figure 4.

Figure 4

Temporal Order of Diagnoses

Order was as follows: (A) elevated PSA levels (ePSA) and PCa in 452 individuals with PCa and ePSA, (B) erectile dysfunction (ED) and prostate cancer (PCa) in 575 individuals with ED and PCa, (C) actinic keratosis (AK) and squamous cell carcinoma (SCC) in 286 individuals with AK and SCC, and (D) hypothyrodism (HT) and thyroid cancer (TCa) in 298 individuals with HT and TC. The time of the first non-cancer diagnosis relative to the cancer diagnosis is shown in weeks, before (blue) and after (red) the cancer diagnosis.

Discussion

Integration of large-scale biorepositories such as genetic data with EHRs are becoming increasingly common and indispensable for next-generation etiology studies. In this paper, we proposed, demonstrated, and tested trait-specific PRS that summarize the results of large population-based GWASs toward cancer risk prediction in an actual academic medical center population managed by Michigan Medicine. Data repositories like MGI allow us to explore many traits simultaneously whereas population-based case-control studies focus on one specific trait. It is indeed encouraging that the results of population-based studies corroborate with the phenotypes computed from EHR data. We found improved trait prediction power of the composite PRS compared to single-SNP analyses. We also replicated cataloged associations of SNPs for some cancer traits, observed excellent correspondence of effect estimates, and discovered secondary trait associations with cancer PRS that were not driven by the primary cancer diagnoses.

In this comprehensive PRS PheWAS that focused on cancer, we introduced a sequence of analytic strategies. We presented a principled framework and quality-control pipeline to create a PRS from a large curated, public database and to perform PRS PheWASs in a potentially biased sample. We introduced a primary PheWAS using Firth’s bias-reduced logistic regression, which has the advantage of resolving the problem of separation in logistic regression and providing well-controlled type I error rates for unbalanced case-control studies with relatively small sample counts (see Logistf in Web Resources).31, 47 These issues are often present in large EHR-based phenomes where control subjects are frequently hundredfold more abundant than case subjects. In addition, we conducted thorough sensitivity analyses to check the robustness of our findings by using PheWASs with unweighted risk allele counts, adjusting for 20 PCs and PRS PheWASs based on matched control subjects. All our reported results remained robust under these sensitivity analyses.

To distinguish PRS-trait associations that truly derive from a shared genetic risk profile from secondary associations that are potentially driven by the primary trait (for example urinary incontinence or erectile dysfunction following prostate cancer treatment), we further introduced a modified PRS PheWAS approach that excludes the PRS’s underlying cancer traits. While reducing overall sample size, this “exclusion PRS PheWAS” approach is statistically preferable in contrast to a PRS PheWAS that conditions on the primary cancer trait. A conditional PheWAS approach is often affected by unilaterally applied exclusion criteria of control subjects that occur during the phenome construction, e.g., PCa case subjects were excluded from being eligible control subjects for elevated PSA levels and vice versa. Our approach could directly discard trait associations driven by the primary cancer diagnosis and has the potential to identify clinically useful diagnostic traits among many that are conveniently measured in panel tests of biomarkers. When an association with a secondary trait disappears by removing the primary cancer cases in an exclusion PheWAS, there can be several alternative explanations: truly shared genetic correlation, intensified screening/examination due to detection of an initial cancer, a screening biomarker/pre-cancer phenotype, or simply post treatment effects. We used the temporal ordering of the diagnoses to understand which of the above explanations appear plausible for a given scenario. Further exploration of our findings in larger biobank studies, like the UK Biobank study, is warranted and will empower a deeper understanding of relevant pre-cancer traits.48

There are several limitations to the current study. We decided to rely on the associations reported in the NHGRI-EBI GWAS Catalog instead of focusing on the latest and largest GWAS specific for each cancer trait. Our rationale for choosing the NHGRI-EBI GWAS Catalog as our source for extracting summary statistics were primarily three-fold. (1) Data quality: Summary statistics in the GWAS Catalog underwent a detailed expert curation and harmonization28, 49 that avoids redundancy and allows reliable SNP position extraction and most importantly ancestry matching. We wanted to use a database that is publicly accessible and applies the same set of criteria to update reported results across a wide variety of phenotypes. (2) Reproducibility: We provided detailed instructions on how to extract and filter GWAS Catalog summary statistics to construct PRS. This will allow interested readers to easily apply our approach to the regularly updated GWAS catalog versions or to a different ancestry group and/or broad set of disease categories without requiring detailed and deep literature searches that could be somewhat subjective. (3) Scalability to multiple phenotypes: One can construct PRS for specific cancers of primary interest from the latest GWAS meta-analyses following the same prescriptions we provided. Using the latest GWAS result is likely to enhance power of a PRS PheWAS. Similarly, using a PRS that is based on a truly polygenic model with many more variant (or the entire genome) instead of considering the GWAS hits may reveal new associations.

We restricted our analysis to GWAS results from studies of broad European ancestry to match them to our cohort of predominantly European ancestry and to allow an extra filtering of potential swaps in directionality of risk allele in published GWASs that otherwise could have negatively affected the correlative properties of our constructed PRS. One could modify or extend construction of PRS based on global ancestry, functionality of the variant, and use of other weighting schemes. Stratifying the present analysis by young-onset cancers, metastatic/aggressive cancers, or tumor subtype will shed further insight into cancer biology, cancer genetics, and specificity of the PRS-cancer association. We have mostly ignored the temporal ordering in the diagnoses codes by defining dichotomous phenotypes of interest. Exploring the time-stamped data in greater detail may be instrumental in understanding the secondary associations like the negative association between hypothyroidism and thyroid cancer PRS.

Though we note some very encouraging and promising results for the cancer traits with modest number of case and control subjects and with a larger number of variants reported in the NHGRI-EBI GWAS catalog, we also note that the correlation of effect estimates or the PRS-cancer association was not very strong for some cancers (Table 2). This could be due to limited sample size/power, heterogeneity in the definition of the cancer phenotype, incomplete specification of PRS, differences in allele frequencies in the MGI population, or misclassification of ICD9 codes. To address concerns with misclassification, we conducted detailed chart review of 50 randomly sampled case subjects with at least one cancer PheWAS code and verified their primary and secondary cancer diagnosis. We could verify 149 of the 151 diagnoses and found 49/50 patients to have accurate record of their cancer diagnosis. Based on this we conclude that the rate of misclassification will likely be low for ICD9 codes associated with cancer.

In this paper, we have focused on cancer traits. The low misclassification rate of cancer traits, typical within academic health and cancer centers, along with effective sensitivity analyses partly protect the results against imprecise case definitions and confounding. For non-cancer disease traits, more stringent ICD9-defined cases, e.g., by repeated ICD9 diagnoses, of adequate sample sizes might alleviate the biases from case misclassification. Future analysis will need to control for potentially different levels of misclassification error across phenotypes.

Our phenome comprised a total of 1,711 ICD9-based phenotypes and by its implemented design of hierarchical phenotypes with different levels of specificity induce a certain degree of redundancy. While we applied the multiple testing correction for 1,711 performed tests, we acknowledge that this threshold might be too conservative. For example, we estimated a maximal set of 1,452 phenotypes with all pairwise correlations r2 < 0.5 before applying any exclusion criteria to the control subjects. In addition, the PheWAS approach often applies similar exclusion criteria to related phenotypes and thereby further reduces the observable independence of case-control studies. Future studies are needed to determine the effective number of independent tests in such a phenome-wide analysis.

Besides the ICD9 codes used for case and control definitions, EHR databases generally contain vast amount of additional patient information including ICD10 codes, temporal laboratory tests, drug prescriptions, inpatient and outpatient records, etc. Future analyses that leverage these heterogeneous data sources that might be predictive of disease outcomes could further improve disease risk predictions. It will be interesting to study whether PRS for cancer risk behaviors like smoking, alcohol, and obesity predict cancer phenotypes. Tailored and validated models capable of integrating multiple sources of molecular and environmental data for predicting risks of disease will be crucial.

Acknowledgments

The authors acknowledge the Michigan Genomics Initiative, University of Michigan Precision Health, and the University of Michigan Medical School Central Biorepository for providing the collection, biospecimen storage, management, and distribution services in support of the research reported in this publication. This material is based in part upon work supported by the National Institutes of Health/NIH (NCI P30CA046592) and by the National Science Foundation under grant number DMS-1712933. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Published: May 17, 2018

Footnotes

Supplemental Data include 13 figures and 10 tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.04.001.

Web Resources

Supplemental Data

Document S1. Figures S1–S13 and Tables S6–S8
mmc1.pdf (3.6MB, pdf)
Table S1. Overview of MGI Phenome (1,711 Case Control Studies)
mmc2.xlsx (100KB, xlsx)
Table S2. Extracted GWAS Catalog Entries and Filtering
mmc3.xlsx (268.4KB, xlsx)
Table S3. Overview of 30 Cancer Traits in MGI that Had Matching NHGRI EBI GWAS Catalog Entries
mmc4.xlsx (14.5KB, xlsx)
Table S4. Underlying ICD9 Codes of the Analyzed Cancer PheWAS Codes
mmc5.xlsx (35.7KB, xlsx)
Table S5. Effect Size Comparison between NHGRI EBI GWAS Catalog Entries and Summary Statistics of the Current Study
mmc6.xlsx (43.6KB, xlsx)
Table S9. PheWAS Results of Analyzed Cancer Traits
mmc7.xlsx (219.5KB, xlsx)
Table S10. Trait Prevalences of Primary and Secondary Phenotypes that Showed Phenome-wide Significant Association with Female Breast Cancer PRS (A), Prostate Cancer PRS (B), Melanoma PRS (C), Basal Cell Carcinoma PRS (D), Squamous Cell Carcinoma PRS (E), and Thyroid Cancer PRS (F) Stratified by 10 Increasing Decile-Stratified PRS Strata
mmc8.xlsx (29.8KB, xlsx)
Document S2. Article plus Supplemental Data
mmc9.pdf (6.1MB, pdf)

References

  • 1.Witte J.S. Genome-wide association studies and beyond. Annu. Rev. Public Health. 2010;31:9–20. doi: 10.1146/annurev.publhealth.012809.103723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Manolio T.A. Genomewide association studies and assessment of the risk of disease. N. Engl. J. Med. 2010;363:166–176. doi: 10.1056/NEJMra0905980. [DOI] [PubMed] [Google Scholar]
  • 3.Raychaudhuri S. Mapping rare and common causal alleles for complex human diseases. Cell. 2011;147:57–69. doi: 10.1016/j.cell.2011.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Denny J.C., Bastarache L., Ritchie M.D., Carroll R.J., Zink R., Mosley J.D., Field J.R., Pulley J.M., Ramirez A.H., Bowton E. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 2013;31:1102–1110. doi: 10.1038/nbt.2749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bush W.S., Oetjens M.T., Crawford D.C. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 2016;17:129–145. doi: 10.1038/nrg.2015.36. [DOI] [PubMed] [Google Scholar]
  • 7.Verma A., Verma S.S., Pendergrass S.A., Crawford D.C., Crosslin D.R., Kuivaniemi H., Bush W.S., Bradford Y., Kullo I., Bielinski S.J. eMERGE Phenome-Wide Association Study (PheWAS) identifies clinical associations and pleiotropy for stop-gain variants. BMC Med. Genomics. 2016;9(Suppl 1):32. doi: 10.1186/s12920-016-0191-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rasmussen L.V., Overby C.L., Connolly J., Chute C.G., Denny J.C., Freimuth R., Hartzler A.L., Holm I.A., Manzi S., Pathak J. Practical considerations for implementing genomic information resources. Experiences from eMERGE and CSER. Appl. Clin. Inform. 2016;7:870–882. doi: 10.4338/ACI-2016-04-RA-0060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ananthakrishnan A.N., Cagan A., Cai T., Gainer V.S., Shaw S.Y., Churchill S., Karlson E.W., Murphy S.N., Liao K.P., Kohane I. Statin use is associated with reduced risk of colorectal cancer in patients with inflammatory bowel diseases. Clin. Gastroenterol. Hepatol. 2016;14:973–979. doi: 10.1016/j.cgh.2016.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ananthakrishnan A.N., Cagan A., Cai T., Gainer V.S., Shaw S.Y., Churchill S., Karlson E.W., Murphy S.N., Kohane I., Liao K.P., Xavier R.J. Common genetic variants influence circulating vitamin D levels in inflammatory bowel diseases. Inflamm. Bowel Dis. 2015;21:2507–2514. doi: 10.1097/MIB.0000000000000524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ananthakrishnan A.N., Cagan A., Cai T., Gainer V.S., Shaw S.Y., Savova G., Churchill S., Karlson E.W., Murphy S.N., Liao K.P., Kohane I. Identification of nonresponse to treatment using narrative data in an electronic health record inflammatory bowel disease cohort. Inflamm. Bowel Dis. 2016;22:151–158. doi: 10.1097/MIB.0000000000000580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Carroll R.J., Bastarache L., Denny J.C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics. 2014;30:2375–2376. doi: 10.1093/bioinformatics/btu197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mavaddat N., Pharoah P.D., Michailidou K., Tyrer J., Brook M.N., Bolla M.K., Wang Q., Dennis J., Dunning A.M., Shah M. Prediction of breast cancer risk based on profiling with common genetic variants. J. Natl. Cancer Inst. 2015;107 doi: 10.1093/jnci/djv036. djv036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Frampton M., Houlston R.S. Modeling the prevention of colorectal cancer from the combined impact of host and behavioral risk factors. Genet. Med. 2017;19:314–321. doi: 10.1038/gim.2016.101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Szulkin R., Whitington T., Eklund M., Aly M., Eeles R.A., Easton D., Kote-Jarai Z.S., Amin Al Olama A., Benlloch S., Muir K., Australian Prostate Cancer BioResource. Practical Consortium Prediction of individual genetic risk to prostate cancer using a polygenic score. Prostate. 2015;75:1467–1474. doi: 10.1002/pros.23037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Maas P., Barrdahl M., Joshi A.D., Auer P.L., Gaudet M.M., Milne R.L., Schumacher F.R., Anderson W.F., Check D., Chattopadhyay S. Breast cancer risk from modifiable and nonmodifiable risk factors among white women in the United States. JAMA Oncol. 2016;2:1295–1302. doi: 10.1001/jamaoncol.2016.1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chatterjee N., Shi J., García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat. Rev. Genet. 2016;17:392–406. doi: 10.1038/nrg.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Garcia-Closas M., Rothman N., Figueroa J.D., Prokunina-Olsson L., Han S.S., Baris D., Jacobs E.J., Malats N., De Vivo I., Albanes D. Common genetic polymorphisms modify the effect of smoking on absolute risk of bladder cancer. Cancer Res. 2013;73:2211–2220. doi: 10.1158/0008-5472.CAN-12-2388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hsu L., Jeon J., Brenner H., Gruber S.B., Schoen R.E., Berndt S.I., Chan A.T., Chang-Claude J., Du M., Gong J. A model to determine colorectal cancer risk using common genetic susceptibility loci. Gastroenterology. 2015;148:1330–1339. doi: 10.1053/j.gastro.2015.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Jun G., Flickinger M., Hetrick K.N., Romm J.M., Doheny K.F., Abecasis G.R., Boehnke M., Kang H.M. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 2012;91:839–848. doi: 10.1016/j.ajhg.2012.09.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kent W.J. BLAT--the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chang C.C., Chow C.C., Tellier L.C., Vattikuti S., Purcell S.M., Lee J.J. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Manichaikul A., Mychaleckyj J.C., Rich S.S., Daly K., Sale M., Chen W.M. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010;26:2867–2873. doi: 10.1093/bioinformatics/btq559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wang C., Zhan X., Bragg-Gresham J., Kang H.M., Stambolian D., Chew E.Y., Branham K.E., Heckenlively J., Fulton R., Wilson R.K., FUSION Study Ancestry estimation and control of population stratification for sequence-based association studies. Nat. Genet. 2014;46:409–415. doi: 10.1038/ng.2924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Li J.Z., Absher D.M., Tang H., Southwick A.M., Casto A.M., Ramachandran S., Cann H.M., Barsh G.S., Feldman M., Cavalli-Sforza L.L., Myers R.M. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–1104. doi: 10.1126/science.1153717. [DOI] [PubMed] [Google Scholar]
  • 26.McCarthy S., Das S., Kretzschmar W., Delaneau O., Wood A.R., Teumer A., Kang H.M., Fuchsberger C., Danecek P., Sharp K., Haplotype Reference Consortium A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 2016;48:1279–1283. doi: 10.1038/ng.3643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E., Junkins H., McMahon A., Milano A., Morales J. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45(D1):D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Auton A., Brooks L.D., Durbin R.M., Garrison E.P., Kang H.M., Korbel J.O., Marchini J.L., McCarthy S., McVean G.A. A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Winkler T.W., Day F.R., Croteau-Chonka D.C., Wood A.R., Locke A.E., Mägi R., Ferreira T., Fall T., Graff M., Justice A.E., Genetic Investigation of Anthropometric Traits (GIANT) Consortium Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 2014;9:1192–1212. doi: 10.1038/nprot.2014.071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med. 2006;25:4216–4226. doi: 10.1002/sim.2687. [DOI] [PubMed] [Google Scholar]
  • 32.Firth D. Bias reduction of maximum likelihood estimates. Biometrika. 1993;80:27–38. [Google Scholar]
  • 33.Lin L.I. A note on the concordance correlation coefficient. Biometrics. 2000;56:324–325. [Google Scholar]
  • 34.Lin L.I. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45:255–268. [PubMed] [Google Scholar]
  • 35.Dey R., Schmidt E.M., Abecasis G.R., Lee S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 2017;101:37–49. doi: 10.1016/j.ajhg.2017.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ho D.E., Imai K., King G., Stuart E.A. MatchIt: nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 2011;42:1–28. [Google Scholar]
  • 37.Nan H., Xu M., Kraft P., Qureshi A.A., Chen C., Guo Q., Hu F.B., Curhan G., Amos C.I., Wang L.E. Genome-wide association study identifies novel alleles associated with risk of cutaneous basal cell carcinoma and squamous cell carcinoma. Hum. Mol. Genet. 2011;20:3718–3724. doi: 10.1093/hmg/ddr287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Barrett J.H., Iles M.M., Harland M., Taylor J.C., Aitken J.F., Andresen P.A., Akslen L.A., Armstrong B.K., Avril M.F., Azizi E., GenoMEL Consortium Genome-wide association study identifies three new melanoma susceptibility loci. Nat. Genet. 2011;43:1108–1113. doi: 10.1038/ng.959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Bishop D.T., Demenais F., Iles M.M., Harland M., Taylor J.C., Corda E., Randerson-Moor J., Aitken J.F., Avril M.F., Azizi E. Genome-wide association study identifies three loci associated with melanoma risk. Nat. Genet. 2009;41:920–925. doi: 10.1038/ng.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Brown K.M., Macgregor S., Montgomery G.W., Craig D.W., Zhao Z.Z., Iyadurai K., Henders A.K., Homer N., Campbell M.J., Stark M. Common sequence variants on 20q11.22 confer melanoma susceptibility. Nat. Genet. 2008;40:838–840. doi: 10.1038/ng.163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Asgari M.M., Wang W., Ioannidis N.M., Itnyre J., Hoffmann T., Jorgenson E., Whittemore A.S. Identification of susceptibility loci for cutaneous squamous cell carcinoma. J. Invest. Dermatol. 2016;136:930–937. doi: 10.1016/j.jid.2016.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hunter D.J., Kraft P., Jacobs K.B., Cox D.G., Yeager M., Hankinson S.E., Wacholder S., Wang Z., Welch R., Hutchinson A. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 2007;39:870–874. doi: 10.1038/ng2075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Turnbull C., Ahmed S., Morrison J., Pernet D., Renwick A., Maranian M., Seal S., Ghoussaini M., Hines S., Healey C.S., Breast Cancer Susceptibility Collaboration (UK) Genome-wide association study identifies five new breast cancer susceptibility loci. Nat. Genet. 2010;42:504–507. doi: 10.1038/ng.586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Gudmundsson J., Sulem P., Gudbjartsson D.F., Jonasson J.G., Masson G., He H., Jonasdottir A., Sigurdsson A., Stacey S.N., Johannsdottir H. Discovery of common variants associated with low TSH levels and thyroid cancer risk. Nat. Genet. 2012;44:319–322. doi: 10.1038/ng.1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gudmundsson J., Sulem P., Gudbjartsson D.F., Jonasson J.G., Sigurdsson A., Bergthorsson J.T., He H., Blondal T., Geller F., Jakobsdottir M. Common variants on 9q22.33 and 14q13.3 predispose to thyroid cancer in European populations. Nat. Genet. 2009;41:460–464. doi: 10.1038/ng.339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Werner R.N., Sammain A., Erdmann R., Hartmann V., Stockfleth E., Nast A. The natural history of actinic keratosis: a systematic review. Br. J. Dermatol. 2013;169:502–518. doi: 10.1111/bjd.12420. [DOI] [PubMed] [Google Scholar]
  • 47.Ma C., Blackwell T., Boehnke M., Scott L.J., GoT2D investigators Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet. Epidemiol. 2013;37:539–550. doi: 10.1002/gepi.21742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Sudlow C., Gallacher J., Allen N., Beral V., Burton P., Danesh J., Downey P., Elliott P., Green J., Landray M. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Morales J., Bowler E.H., Buniello A., Cerezo M., Hall P., Harris L.W., Hastings E., Junkins H.A., Malangone C., McMahon A.C. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. bioRxiv. 2017 doi: 10.1186/s13059-018-1396-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S13 and Tables S6–S8
mmc1.pdf (3.6MB, pdf)
Table S1. Overview of MGI Phenome (1,711 Case Control Studies)
mmc2.xlsx (100KB, xlsx)
Table S2. Extracted GWAS Catalog Entries and Filtering
mmc3.xlsx (268.4KB, xlsx)
Table S3. Overview of 30 Cancer Traits in MGI that Had Matching NHGRI EBI GWAS Catalog Entries
mmc4.xlsx (14.5KB, xlsx)
Table S4. Underlying ICD9 Codes of the Analyzed Cancer PheWAS Codes
mmc5.xlsx (35.7KB, xlsx)
Table S5. Effect Size Comparison between NHGRI EBI GWAS Catalog Entries and Summary Statistics of the Current Study
mmc6.xlsx (43.6KB, xlsx)
Table S9. PheWAS Results of Analyzed Cancer Traits
mmc7.xlsx (219.5KB, xlsx)
Table S10. Trait Prevalences of Primary and Secondary Phenotypes that Showed Phenome-wide Significant Association with Female Breast Cancer PRS (A), Prostate Cancer PRS (B), Melanoma PRS (C), Basal Cell Carcinoma PRS (D), Squamous Cell Carcinoma PRS (E), and Thyroid Cancer PRS (F) Stratified by 10 Increasing Decile-Stratified PRS Strata
mmc8.xlsx (29.8KB, xlsx)
Document S2. Article plus Supplemental Data
mmc9.pdf (6.1MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES