Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Aug 24.
Published in final edited form as: Genet Med. 2014 Oct 23;17(7):536–544. doi: 10.1038/gim.2014.143

Summarizing polygenic risks for complex diseases in a clinical whole genome report

Sek Won Kong 1,8,*, In-Hee Lee 1,8, Ignaty Leschiner 2,8, Joel Krier 2,8, Peter Kraft 3, Heidi L Rehm 4,5,8, Robert C Green 6,8, Isaac S Kohane 1,8, Calum A MacRae 6,8; the MedSeq Project7
PMCID: PMC4547452  NIHMSID: NIHMS713002  PMID: 25341114

Abstract

Purpose

Disease-causing mutations and pharmacogenomic variants are of primary interest for clinical whole-genome sequencing. However, estimating genetic liability for common complex diseases using established risk alleles might one day prove clinically useful.

Methods

We compared polygenic scoring methods using a case-control data set with independently discovered risk alleles in the MedSeq Project. For eight traits of clinical relevance in both the primary-care and cardiomyopathy study cohorts, we estimated multiplicative polygenic risk scores using 161 published risk alleles and then normalized using the population median estimated from the 1000 Genomes Project.

Results

Our polygenic score approach identified the overrepresentation of independently discovered risk alleles in cases as compared with controls using a large-scale genome-wide association study data set. In addition to normalized multiplicative polygenic risk scores and rank in a population, the disease prevalence and proportion of heritability explained by known common risk variants provide important context in the interpretation of modern multilocus disease risk models.

Conclusion

Our approach in the MedSeq Project demonstrates how complex trait risk variants from an individual genome can be summarized and reported for the general clinician and also highlights the need for definitive clinical studies to obtain reference data for such estimates and to establish clinical utility.

Keywords: clinical whole-genome sequencing, common complex disorders, polygenic score, risk alleles

INTRODUCTION

As the cost of sequencing decreases, the clinical utility of whole genome sequence (WGS) is currently undergoing intensive investigation as a tool for precise diagnosis, risk prediction, and therapeutic guidance1; WGS is also undergoing evaluation from ethical and legal perspectives.2,3 The MedSeq project is a randomized clinical trial studying the integration of WGS into clinical care in two specific contexts4: patients from a specialty clinic with a focus on Mendelian forms of inherited cardiomyopathy and patients from a primary-care practice. In each of these clinical settings, pathogenic variants in known Mendelian disease genes, loss-of-function variants in disease-associated genes across the genome, and other actionable variants, including alleles of pharmacogenetic importance, are the major focus of the whole genome report. However, one of the advantages of WGS over whole-exome sequencing (WES) is that the former provides genomic variants in intronic and other non-coding regions, where the majority of common alleles associated with disease traits reside.5,6 As a result, WGS also has the potential, when interpreted in the context of rigorous population data, to enable the efficient estimation of genetic liability for common complex diseases as well as the discovery of possible modifier effects on rare alleles of larger effect size.

One of the most interesting and relevant questions for WGS reporting is in regard to how to define and present data on common alleles associated with increased or decreased risk for certain diseases,7 particularly those with potential therapeutic implications.8 Several approaches might be used to estimate the composite risk for a given trait and to allow its communication to general clinicians. Risk alleles are typically discovered using a case-control design in which the frequency of each allele in cases is compared to that in controls. An allele observed at a higher frequency in cases is considered to be a risk allele and represents a marker for all adjacent variants in linkage disequilibrium.9 Conversely, an allele with a lower frequency in cases is sometimes reported as a “protective” allele. However, the very same allele in different populations may represent distinct haplotypes, whereas case and control definitions are also necessarily imperfect. Thus, without longitudinal cohort studies, it may be difficult to establish the clinical validity of common alleles. In this context, in the current study, we focused on the single-nucleotide polymorphisms (SNPs) more frequently found in cases from genome-wide association studies (GWASs) as listed in the National Human Genome Research Institute GWAS catalogue in the current study.10

An intuitive approach to combine information from several genetic tests is to multiply likelihood ratios with pretest odds of population-specific lifetime disease risk estimates.11,12 However, for the majority of risk alleles, objective likelihood ratios are not available. Polygenic risk scores (PRSs) have been proposed by several investigators7,1317 to combine multiple risk alleles, including those that fail to attain genome-wide significance in association studies, on the basis that there may be genetic epistasis, interaction with environmental factors, or aggregate effects that can be captured.18 To this end, a multiplicative model including seven risk alleles for breast cancer for risk stratification.17 Aggregating the information from a larger number of sub-threshold risk alleles has also been used, testing the classic model of polygenic inheritance.13,16 These studies highlighted the possibility of using polygenic scores in the context of conditioning nongenetic clinical information, although the performances of such PRSs were inconsistent across different diseases.19,20

Although the prediction of disease risk based solely on genotype is not currently standard of care in medical practice, it may soon be useful for patients and clinicians to know whether a patient presents a high-risk genomic profile for a specific trait or disease as compared with the population norm.21,22 This may be the case even when there are no robust independent data regarding the clinical utility of genetic predictors, given the known role of multiple subjective variables in situations of clinical equipoise. Here we summarize multiple risk alleles by calculating a normalized PRS using a population-scale WGS data set from the 1000 Genomes Project (1KGP).23 Our approach demonstrates how complex trait risk variants from individual genomes can be efficiently summarized and reported in a clinical context, highlighting the clinical uncertainties of interpretation while facilitating the use of the available information in clinical decision making.

MATERIALS AND METHODS

Risk alleles

The National Human Genome Research Institute’s GWAS catalog (http://www.genome.gov/admin/gwascatalog.txt) was downloaded on 03/12/2013.10 The catalog contained a total of 9,785 records corresponding to 8,384 risk alleles. We used a series of filtering steps to retain only informative SNPs for the PRS estimates as detailed in Supplementary Figure S1 online. The excluded SNPs with each filtering step can be found at the second to the rightmost column–“Filtering Status”–of Supplementary Table S1. For the risk alleles with odds ratios (ORs) < 1, we followed the GWAS catalog’s inversion of ORs using the alternative alleles as risk alleles. A total of 1,565 risk alleles for 182 traits met our filtering criteria (Supplementary Table S1 online).

To test our approach to the reporting of common allele variations in the MedSeq project, we selected 8 binary phenotypes–abdominal aortic aneurysm, atrial fibrillation, coronary heart disease (CHD), type 3 diabetes (T2D), hypertension, obesity/metabolic syndrome, platelet aggregation, and QT prolongation–that are factors frequently weighed in decision making in both primary care and cardiology subspecialty settings. Quantitative phenotypes were not included due to the inconsistency in phenotype measures and descriptions between studies. A total of 161 risk alleles were then incorporated into PRS estimates for the eight selected phenotypes.

Calculating polygenic risk scores

Several approaches to polygenic risk scoring exist, the majority summing up all risk alleles present in an individual genome and assigning allele-specific weighting. The simplest method is to treat all risk alleles equally, that is, an allele counting method where the weight equals to 1.20 Alternatively, observed effect sizes can be used to weight each risk allele differently.13,16 We calculated a multiplicative PRS (MPRS) as detailed in the Supplementary Materials and Methods online. Briefly, the MPRS for each phenotype was calculated as the product of ORs. Thus, the log (MPRS) is equivalent to the OR-weighted sum of risk allele counts.20 The population attribution risk (PAR) method integrates population allele frequency (AF) and OR.15 A single SNP PAR was established as AFi (ORi – 1)/(AFi × (ORi – 1) + 1), in which AFi is the prevalence of the risk allele at the ith locus in the control population, and ORi is the OR of the risk allele at the ith locus. The multi-SNP PAR was calculated on the basis of the single SNP PAR for each associated SNP: 1 – Π(1 – PARi), in which PARi is the single SNP PAR for the ith locus. The raw scores from counting, and the MPRS and PAR methods were normalized using the median score of the European (EUR) genotypes (N = 392) in the 1KGP, and the ranks of the individual’s score are reported as deciles.

Testing the performance of the MPRS with a GWAS data set

To compare the distribution of polygenic scores between cases and controls, we used the Wellcome Trust Case Control Consortium (WTCCC) phase I data set, which genotyped 16,179 individuals with the Affymetrix GeneChip Human Mapping 500K arrays.24 The details of the WTCCC data set are described in the Supplementary Materials and Methods online. We selected the subset of risk alleles represented on the Affymetrix 500K arrays to calculate the MPRS and performed the analysis after excluding those risk alleles that were originally reported with the WTCCC data set.24 Genotype imputation was not performed since the estimated 5–6% imputation error rate25 might result in significant changes in MPRS decile (see Results). The MPRS percentile for each individual was calculated for each trait against 2,938 controls. As noted, for SNPs in linkage disequilibrium (r2 > 0.5), we chose the risk allele with the largest effect size. The SNPs in the major histocompatibility complex region of chromosome 6–rs6458307, rs9469220, rs615672, rs6457617, rs9272346, and rs9465871–were excluded when calculating MPRS for Crohn disease, type 1 diabetes, and rheumatoid arthritis.

RESULTS

Correlation between different polygenic scoring methods

The numbers of reported risk alleles per trait skewed to the right because a small number of traits were associated with a majority of risk alleles. Risk alleles for multiple sclerosis (n = 105), CD (n = 95), T2D (n = 77), ulcerative colitis (n = 64), and CHD (n = 62) constituted 25.7% of 1,565 alleles. Forty-three traits were associated with a single reported risk allele. The median OR was 1.25 (interquartile range (IQR) 1.15 – 1.45), and 461 risk alleles exhibited ORs of more than 1.45. The majority of risk alleles were found in non-protein coding regions (91.0% of 1,565): 55.7% (872/1565) lie within intergenic regions while 553 (35.3%) were intronic. A total of 103 (6.6%) risk alleles were found in coding regions, and 14 and 23 were mapped to 5’-UTR and 3’-UTR, respectively. The AFs ranged from 0.011 to 0.983 with an average of 0.422. Risk AFs were not listed for 265 loci in the original discovery studies.

We compared the three methods for combining risk alleles: counting, MPRS, and the multi-SNP PAR outlined in the Methods section. For each individual in the 1KGP EUR population (N=379), we calculated polygenic scores for eight cardiac phenotypes: abdominal aortic aneurysm, atrial fibrillation, CHD, T2D, hypertension, obesity/metabolic syndrome, platelet aggregation, and QT prolongation. The scores from three methods showed significant positive correlations for all 8 traits (Kendall’s tau, P < 2.2×10−16; Supplementary Table S2 online); however, the counting method when used with small numbers of risk alleles yielded nonunique scores in 379 EUR individuals (Supplementary Figure S2 online).

To check whether the subgroups of highest genetic risk–i.e., those within the 10th decile–could be consistently defined by different summary methods, we selected two common complex traits–CHD and T2D, which had 62 and 77 risk alleles, respectively, that met our filtering criteria. The percentile rank of each individual was calculated using all three methods, and decile ranks were compared between polygenic scoring approaches. The three methods showed significant positive correlations overall (Figure 1 and Supplementary Table S2 online), with the correlation between MPRS and PAR being the highest (Kendall’s tau = 0.7229 (Figure 1c) and 0.6928 (Figure 1g) for CHD and T2D, respectively). However, identifying subgroups within the 10th decile varied significantly by the summary method used. The concordance rate for 10th decile in CHD PRS was 49% between MPRS and PAR methods (Figure 1d). Among 38 individuals in the 10th decile as ascertained by counting CHD risk alleles, 23 and 16 were in the 10th decile as ascertained by MPRS and PAR methods, respectively (Figure 1d). Similarly, 25 individuals were in the 10th decile as ascertained by counting and PAR for T2D, and 22 were in the 10th decile as ascertained by MPRS and PAR (Figure 1h).

Figure 1. Comparison of polygenic score calculation methods.

Figure 1

Using the risk alleles and allele frequencies reported in the GWAS catalog, we calculated polygenic scores for 379 individuals of the 1000 Genomes Project European cohort. We counted the number of risk alleles in an individual—counting method—and compared with the multiplicative polygenic risk score (MPRS) and multiple single-nucleotide polymorphism (SNP) population attribution risk (PAR) using odd ratios (ORs) and ORs with risk allele frequency, respectively. Red circles represent the individuals in the same decile according to MPRS and PAR. The resulting decile of the counting method was different from those from MPRS and PAR, although they were significantly correlated (c and g). The results for coronary heart disease (60 risk alleles, a–c) and type 2 diabetes (70 risk alleles, e–g) showed the same trend. Venn diagrams show the agreement between polygenic scoring methods for the individuals in the lOth deciles by three methods (d and h). GWAS, genome-wide association study.

PAR provides more intuitive interpretation of genetic risk by combining AF and effect size. However, the prevalence of some risk alleles varies widely across ethnic groups, as indeed may the risk associated with individual alleles. If the AF in the discovery population deviates from the population mean or if the data are from individuals of different ethnic background than the original study, then there may be large effects on the estimated PAR. Thus, at present the validity of PAR is limited for many traits. The validity of counting method is also limited due to nonunique scores for the traits with fewer risk alleles (Supplementary Figure S2a,c,e,f online). Therefore, we chose the normalized MPRS for further evaluation.

There were also significant differences in MPRS distributions among the four ethnic groups. We compared the distribution of the MPRS for each phenotype between ethnic groups using one-way analysis of variance followed by post hoc tests. With the reported risk alleles, 168 of 182 traits analyzed showed significant differences between ethnic groups (Bonferroni corrected analysis of variance P < 0.01, Supplementary Table S3 online), reinforcing the widely held notion that an individual’s polygenic scores can be rigorously interpreted only in the context of the matched ethnic background.

Performance of polygenic scores with a case-control data set

To check the distribution of MPRS in cases as compared with that of controls, we used the WTCCC phase I data set.24 We calculated an MPRS for each individual for seven diseases and two control groups, excluding the risk alleles originally reported for the WTCCC data set (Table 1). The five hypertension risk alleles in the GWAS catalog were not sufficient to rank all cases and controls because of tied scores; otherwise, the distributions of MPRS for six diseases showed significant differences between cases and controls (Tukey’s honestly significant difference (HSD), all P < 0.001 for cases as compared with control groups). For all phenotypes, there was no significant difference of MPRS distributions between 1958 British Birth Cohort and the UK Blood Services cohort (Figure 2). Validating a single risk allele with an independently collected data set often produces inconsistent results26; however, our polygenic score approach successfully identified the overrepresentation of independently discovered risk alleles in cases.

Table 1.

Predictive value of high-risk group defined by the 10th decile of the polygenic score

Disease Number of
risk alleles in
GWAS catalog
Number of risk alleles after
excluding the ones discovered
in the original study
Lifetime
prevalence
in general
population
Relative risk for 10th
decile polygenic score
in WTCCC data set
Positive predictive
value for 10th decile
polygenic score
Risk to
first-
degree
relatives
Bipolar disorder 66 60 2.10% 1.88 5.58% 7–10
Coronary heart disease 71 68 6% 1.43 10.15% 1.60–1.62
Crohn disease 116 107 0.1–16/100,000 1.91 0.04% 30
Hypertension 5 3 28.6% (adults 20 and over) NA NA 2.5–3.5
Rheumatoid arthritis 59 56 0.5–1.0% 1.43 1.74% 2–5
Type 1 diabetes 43 38 0.42% 2.22 1.51% 15
Type 2 diabetes 116 111 7.90% 1.38 12.40% 2–4

Using the Wellcome Trust Case Control Consortium (WTCCC) case-control data set, we compared the distribution of multiplicative polygenic risk scores (MPRSs). A small proportion of risk alleles that were originally reported with the WTCCC data were excluded. Lifetime prevalence and sibling relative risk were retrieved from literature, and positive predictive value (PPV) was calculated for the individuals in the 10th decile for each disease. As compared with the sibling relative risks, PPV was small for the high-risk group according to MRPS, suggesting limited clinical validity.

GWAS, genome-wide association study; NA, not available.

Figure 2. Distribution of polygenic scores in a case-control data set.

Figure 2

The Wellcome Trust Case Control Consortium (WTCCC) phase I data set (N = 16,179 individuals) consisted of two control groups—the 1958 British Birth Cohort (58BC) and common controls recruited from the UK Blood Services (NBS)—and six disease groups; Crohn disease (CD), bipolar disorder (BD), coronary heart disease (CHD), type 1 diabetes (T1D), type 2 diabetes (T2D), and rheumatoid arthritis (RA). We compared the multiplicative polygenic risk score (MPRS) distributions between cases and controls, except for the hypertension group because of the small number of risk alleles (see Table 1). For all phenotypes, no significant difference was found between 58BC and NBS, and the mean MPRS of case groups was significantly higher as compared with the two control groups (Tukey's honestly significant difference P values < 0.001 for all case versus control groups).

Polygenic scores for each phenotype were sorted into 10 bins in the control group, and the score decile of each case was then determined using the score range of 1st to 10th decile in controls. Each bin had ~294 control individuals and different numbers of cases according to the MPRS. As expected, we observed a significant overrepresentation of cases as compared with controls in upper deciles (Supplementary Figure S3 online). For the patients with CD, 27.3% were in 10th decile compared to 2.35% in 1st decile, which resulted in the relative risk of 1.91 in this data set. However, the positive predictive value for those individuals in 10th decile was 0.044% using the upper-bound CD prevalence of 16/100,000.27 Positive predictive value increased with the prevalence of the trait, as summarized in Table 1, and was as high as 12.4% for T2D. Given the relatively low narrow-sense heritability of 0.05–0.10 for T2D,28 the clinical validity of analyzing common risk alleles for unsegmented common diseases is likely to be limited.29 We also measured the performance of polygenic scores using the area under the receiver operating characteristics curve (AUC). Except for CD (AUC 0.704), overall performances of polygenic scores for diseases were poor (AUCs 0.592 (bipolar disorder), 0.622 (coronary artery disease), 0.595 (T2D), 0.604 (T1D), and 0.614 (rheumatoid arthritis), Supplementary Figure S4 online).

Stability of summary method with fewer risk alleles

In light of potential inaccuracy in genotyping, we checked the stability of the MPRS rank of an individual in a population by comparing the original decile using all reported risk alleles with the deciles recalculated using smaller numbers of randomly selected risk alleles. A total of 111 risk alleles were reported for T2D (Table 1), and we randomly selected n risk alleles to recalculate the MPRS and the relevant decile. For the individuals in 10th decile with all 111 alleles, we traced the change of decile ranks with random exclusion of n risk alleles from 1 to 56 (Supplementary Figure S5 online). This procedure was repeated 100 times for each n, and the mean decile was plotted. Excluding 20% of risk alleles (blue dotted line in Supplementary Figure S5 online) did not result in a change of classification by more than two deciles on average; however, 25% of instances were equal to or less than the 9th decile (Table 2). With 50% of risk alleles, only 56.8% were in 10th decile. For other phenotypes with small numbers of risk alleles, excluding a single risk allele could change scores from the highest decile to lower deciles or vice versa.

Table 2.

Stability of polygenic risk scores with fewer risk alleles

−1 −10% −20% −30% −40% −50%
10th decile 98.9 84.6 75.0 71.9 64.1 56.8
9th decile 20 13.2 18.3 17.2 18.5 19.9
8th decile 2.3 4.8 6.5 8.7 10.5
7th decile 0.5 1.8 2.9 4.5 6.1
6th decile 0.2 0.6 1.3 2.5 3.6
5th decile 0.3 0.7 1.3 2.1
4th decile 0.1 0.2 0.7 1.1
3rd decile 0 1 0.1 0.3 0.6
2nd decile 0.2 0.2
1st decile 0.1 0.1

We tested whether the subgroup in the 10th decile remained as the highest risk group with random exclusion of n risk alleles from 1 to 50% because a few risk alleles may have genotyping errors or a proportion of risk alleles can be associated with an increased risk in a specific ethnic group. For each n, we sampled 100 times, and average change of decile was listed. The greener shades represent higher concordance with the original decile using all type 2 diabetes risk alleles (n = 111) as compared to yellow shades. Blank cells represent no observation, and numeric values are mean percentage concordant with the original ranks, i.e., 10th decile.

Summarizing cardiac risk alleles in the clinical context

To summarize polygenic relative risks from known risk alleles for general clinicians and patients, we prepared a report on cardiovascular disease risk from common genetic variation as a part of a Cardiac Supplement to our Genome Report in the MedSeq project.4 The reports include the disease prevalence and narrow-sense heritability in conjunction with an estimated MPRS for a limited number of common cardiac traits of relevance for decision support in both primary prevention and in specialist care of inherited heart disease. For eight traits (abdominal aortic aneurysm, atrial fibrillation, CHD, T2D, hypertension, obesity/metabolic syndrome, platelet aggregation, and QT prolongation) implicated in cardiac diseases with qualitative outcome measures, the effect sizes of risk alleles selected for these cardiac phenotypes were small to moderate (average OR 1.23, range 1.06 – 3.57) (Table 3). We normalized MPRS to the 1KGP data set, including four ethnic groups, to calculate relative risks compared to estimated population norms. Across the four ethnic groups, the number of risk alleles per individual was significantly different (one-way analysis of variance P < 0.0001). The East Asian individuals had more risk alleles (mean ± SD 105.5 ± 4.82) compared to the other ethnic groups (Tukey’s HSD P < 0.0001 for all 3 comparisons). The average number of risk alleles in Admixed American individuals (102.5 ± 5.05) was not significantly different from those of EUR (102.0 ± 4.79) and African (103.6 ± 4.37) (Tukey’s HSD P = 0.0853 and 0.745 respectively) individuals, but the difference between African and EUR individuals was significant (Tukey’s HSD P = 0.0005). The differences were partly attributable to biases in discovery cohorts (Supplementary Table S4 online). More than two-thirds of risk alleles (70.8%) were reported from studies with EUR populations. East Asian (20.5%) and African (6.8%) populations were underrepresented in previous studies. For instance, seven risk alleles associated with obesity were discovered from two independent studies of EUR populations. Of these, five risk alleles–rs10508503, rs2116830, rs988712, rs1805081, and rs1421085–are rare (AF ≤ 0.05) in the African group, and two risk alleles– rs10508503 and rs2116830 –are not present in any East Asian individuals in the 1KGP. The average MPRS in EUR was higher as compared with those of the other ethnic groups (one-way analysis of variance with Dunnett’s post hoc tests with EUR as control, P < 0.001). Thus, an individual in the interquartile range of MPRS in the EUR population might be placed in 9th and 10th decile in the other ethnic groups.

Table 3.

A summary of risk alleles for the cardiac supplement in the MedSeq Project

Phenotype Contextual data Patient results
Population
prevalence of
phenotype for age 56
Proportion of variation in
phenotype liability explained
by common genetic variants
Number
of risk loci
evaluated
Number of
total risk alleles
identified
Polygenic
relative risk
Percentile
rank of
relative risk
Abdominal aortic aneurysm 6% Unknown 3 3/6 1.1 60–70th
Atrial fibrillation 2% 0.1 11 7/22 1.2 60–70th
Coronary heart disease 6% <10% 60 55/120 1.2 50–60th
Type 2 diabetes 13% 5–10% 70 71/140 ≥3.0 90–100th
Hypertension 52% <10% 3 3/6 0.9 30–40th
Obesity 37% 1–2% 7 8/14 1.6 80–90th
Venous thromboembolism Unknown 5–10% 4 4/8 ≥3.0 90–100th
QT prolongation unknown 0.07 3 4/6 ≤0.8 0–10th

The table summarizes the risk alleles conferring small to moderate risk modification for eight cardiac phenotypes. As the data utilized in the analysis were derived from nonlongitudinal association studies, “Relative Risk from Common Genetic Variation” and “Percentile Rank of Relative Risk from Common Genetic Variation” values have been estimated using the 1000 Genomes Project European cohort. The contextual data provide the relative contribution of risk alleles to phenotype. Because the “Proportion of Variation in phenotype Liability Explained by Common Genetic Variants” is less than 10% of total genetic liability, the clinical validity of “Percentile Rank of Relative Risk” is limited and should be interpreted with detailed family and medical history, and lab test results.

Table 3 demonstrates our current format for reporting of the MPRS and the other contextual information outlined above. Age-specific prevalence is also reported, with the proportion of variation in phenotype liability explained by common genetic variants based on the extant literature. The number of risk loci and total risk alleles identified, normalized MPRS truncated at 10 and 90 percentiles for the outlier values, and percentile rank are reported. The clinical application of this result summary (albeit in the absence of objective clinical utility) will be investigated in the MedSeq Project and other longitudinal studies. As such, it will be important also to emphasize the changing context and evolving limitations of genetic risk assessment attributable to common variants. For instance, the estimated heritability of T2D from family studies ranges from 0.3 to 0.6 as compared with the more modest proportion of variation in phenotype liability explained by common genetic variants (0.05–0.1). Although much more rigorous data will be required for the demonstration of formal clinical utility, the combination of a detailed family history, with even current risk predictions for common diseases attributable to common genetic variants, may be informative for clinicians and patients to promote specific health behaviors.

DISCUSSION

Predicting the genetic liability for a particular disease based on the reported risk alleles is currently not useful in medicine practice. Indeed, even alleles with large effect sizes are of little utility for predicting clinically meaningful outcomes. In most common disorders, the contribution of acquired or environmental risk factors is considered to be of much greater importance than the inherited contribution. These limitations of genetic prediction are also a function of the context in which the extant genetic data have been collected; for common phenotypes, the context is usually case-control studies that are not designed or powered to derive the trait’s genetic architecture. For most diseases, rigorous heritability estimates are scant, genetic studies have used low-resolution phenotypes, and outcomes data are incomplete. For all but a few genotypes there are no robust data for clinical utility. If genome sequencing and common genetic variation are to play a substantial role in precision medicine (it is expected that they will), then there will have to be considerable investment in rigorous large-scale studies in clinical cohorts for which validity, clinical utility, and cost-effectiveness can be demonstrated.12,30,31

One of the prerequisites for studies that will be necessary to establish the role of WGS in the clinic is standardized reporting strategies for genome-scale data. These will be required not only to communicate the primary results but also to inform the clinician of additional nongenomic data and to supply the nuanced context necessary for secondary interpretation. In the current study, we have proposed summarizing polygenic risks using the ranks in a population instead of providing absolute disease risk estimates attributable to known risk alleles.17 Clinicians and patients can review the genetic information in the context of the medical and family histories, lifestyle, and laboratory test results. These are all important elements that can condition interpretation of any genotype and frame the doctor-patient relationship for a range of health-promoting behaviors. Thus, an individual with the highest polygenic disease risk may have a modest overall risk once nongenetic factors are considered. Importantly, the reproducibility and stability of risk prediction in such a complex context are likely to limit the clinical validity of genetics.32 Kalf and colleagues compared the three polygenic relative risk prediction methods of current direct-to-consumer genotyping companies33 and found significant discordance. For six multifactorial diseases, the personal genome tests marketed by the three companies had limited predictive ability (atrial fibrillation, T2D, and prostate cancer), a considerable probability (20–27%) of predicting effects in the “opposite” direction (age-related macular degeneration and CD), or substantial differences in absolute risks at the individual level (celiac disease).

There are also some significant limitations to our approach. First, we restricted our model to narrow-sense heritability, aggregating the additive contributions of each risk allele to the phenotype and ignoring potential dependencies between the risk alleles for the same phenotype. As a consequence, estimating genetic risks from multiple risk alleles may overestimate the total heritability or genetic risk. Second, we chose the 1KGP cohort to calculate the background distribution of MPRS, but this cohort contains only a few hundred individuals of each major ethnic group, so the samples were not large enough to accurately match genetic background or to estimate population norms. Third, the original discovery and replication cohorts undoubtedly have biases in population structure and cryptic relatedness34 because the observed levels of MPRS in the 1KGP population were considerably smaller than the expected 3N levels with n risk alleles in our analysis. Indeed, even small numbers of genotyping errors result in significant changes in polygenic risk, as shown in our simulation analysis. Fourth, we also found significant errors throughout the current GWAS catalog. For instance, in some cases risk AF was replaced by the OR, or minor alleles were reported as major, with downstream errors in direction and magnitude of effect. Much more stringent data sets will be necessary for clinical interpretation and decision support. Finally, we did not undertake analysis for detection of copy number or other structural variations in the current study, given the limits of current analytic tools, and the phenotypic associations of such variants are not well established, except for specific oncogenic driver mutations.35 As analytic techniques improve and associations are defined, WGS data sets can be reanalyzed for such structural variants.

Family history remains the most commonly used genetic information in clinical practice. Because collecting family history is an important part of standard medical assessment and can contribute independent genetic information beyond any measured risk alleles, future prospective studies should seek to combine family history and allelic risk predictions. Some such population-scale data sets have accumulated in direct-to-consumer companies over several years and would provide an invaluable resource to the biomedical research community if shared with appropriate privacy protection. The successful implementation of genomic medicine will require the systematic collection of phenotypic data and environmental risk factors, drug responses, and quantitative outcomes. The deconvolution even of the limited genotypic data interpretable at present will require vast data sets that can only be mustered by collaborative projects on a global scale. The unstated inference is that for genomic medicine to be rigorously evaluated, it must first be incorporated into general clinical practice, overturning the “evidence-first” strategy of modern medicine.

Supplementary Material

Supplemental Figure 3
Supplementary Materials and Methods
Supplemental Figure 4
Supplemental Figure 5
Supplemental Table 1
Supplemental Table 2
Supplemental Table 3
Supplemental Table 4
Supplementary Figure 1
Supplementary Figure 2

ACKNOWLEDGEMENTS

The MedSeq Project is supported by the National Institutes of Health (NIH) National Human Genome Research Institute (U01-HG006500). Members of the MedSeq Project are as follows: David W. Bates, MD, Alexis D. Carere, MA, MS, Allison Cirino, MS, Lauren Connor, Kurt D. Christensen, MPH, PhD, Jake Duggan, Robert C. Green, MD, MPH, Carolyn Y. Ho, MD, Joel B. Krier, MD, William J. Lane, MD, PhD, Denise M. Lautenbach, MS, Lisa Lehmann, MD, PhD, MSc, Christina Liu, Calum A. MacRae, MD, PhD, Rachel Miller, MA, Cynthia C. Morton, PhD, Christine E. Seidman, MD, Shamil Sunyaev, PhD, Jason L. Vassy, MD, MPH, SM, Brigham and Women’s Hospital and Harvard Medical School; Sandy Aronson, ALM, MA, Ozge Ceyhan-Birsoy, PhD, Siva Gowrisankar, Ph.D., Matthew S. Lebo, PhD, Ignat Leschiner, PhD, Kalotina Machini, PhD, MS, Heather M. McLaughlin, PhD, Danielle R. Metterville, MS, Heidi L. Rehm, PhD, Partners Personalized Medicine; Jennifer Blumenthal-Barby, PhD, Lindsay Zausmer Feuerman, MPH, Amy L. McGuire, JD, PhD, Sarita Panchang, Jill Oliver Robinson, MA, Melody J. Slashinski, MPH, PhD, Baylor College of Medicine, Center for Medical Ethics and Health Policy; Stewart C. Alexander, PhD, Kelly Davis, Peter A. Ubel, MD, Duke University; Peter Kraft, PhD, Harvard School of Public Health; J. Scott Roberts, PhD, University of Michigan; Judy E. Garber, MD, MPH, Dana-Farber Cancer Institute; Tina Hambuch, PhD, Illumina, Inc.; Michael F. Murray, MD, Geisinger Health System; Isaac S. Kohane, MD, PhD, Sek Won Kong, MD, In-Hee Lee, PhD, Boston Children’s Hospital.

Footnotes

DISCLOSURE

The authors declare no conflict of interest.

REFERENCES

  • 1.Biesecker LG, Burke W, Kohane I, Plon SE, Zimmern R. Next-generation sequencing in the clinic: are we ready? Nature reviews. Genetics. 2012 Nov;13(11):818–824. doi: 10.1038/nrg3357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.McGuire AL, McCullough LB, Evans JP. The indispensable role of professional judgment in genomic medicine. JAMA : the journal of the American Medical Association. 2013 Apr 10;309(14):1465–1466. doi: 10.1001/jama.2013.1438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Pyeritz RE. The coming explosion in genetic testing--is there a duty to recontact? The New England journal of medicine. 2011 Oct 13;365(15):1367–1369. doi: 10.1056/NEJMp1107564. [DOI] [PubMed] [Google Scholar]
  • 4.Vassy JL, Lautenbach DM, McLaughlin HM, et al. The MedSeq Project: a randomized trial of integrating whole genome sequencing into clinical medicine. Trials. 2014 Mar 20;15(1):85. doi: 10.1186/1745-6215-15-85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA : the journal of the American Medical Association. 2014 Mar 12;311(10):1035–1045. doi: 10.1001/jama.2014.1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America. 2009 Jun 9;106(23):9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lyssenko V, Jonsson A, Almgren P, et al. Clinical risk factors, DNA variants, and the development of type 2 diabetes. The New England journal of medicine. 2008 Nov 20;359(21):2220–2232. doi: 10.1056/NEJMoa0801869. [DOI] [PubMed] [Google Scholar]
  • 8.Manolio TA. Bringing genome-wide association findings into clinical use. Nature reviews. Genetics. 2013 Aug;14(8):549–558. doi: 10.1038/nrg3523. [DOI] [PubMed] [Google Scholar]
  • 9.Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genetics in medicine : official journal of the American College of Medical Genetics. 2002 Mar-Apr;4(2):45–61. doi: 10.1097/00125817-200203000-00002. [DOI] [PubMed] [Google Scholar]
  • 10.Welter D, MacArthur J, Morales J, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic acids research. 2014 Jan;42(Database issue):D1001–D1006. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yang Q, Khoury MJ, Botto L, Friedman JM, Flanders WD. Improving the prediction of complex diseases by testing for multiple disease-susceptibility genes. American journal of human genetics. 2003 Mar;72(3):636–649. doi: 10.1086/367923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ashley EA, Butte AJ, Wheeler MT, et al. Clinical assessment incorporating a personal genome. Lancet. 2010 May 1;375(9725):1525–1535. doi: 10.1016/S0140-6736(10)60452-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Stahl EA, Wegmann D, Trynka G, et al. Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nature genetics. 2012 May;44(5):483–489. doi: 10.1038/ng.2232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nature reviews. Genetics. 2013 Jul;14(7):507–515. doi: 10.1038/nrg3457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kraft P, Wacholder S, Cornelis MC, et al. Beyond odds ratios--communicating disease risk based on genetic profiles. Nature reviews. Genetics. 2009 Apr;10(4):264–269. doi: 10.1038/nrg2516. [DOI] [PubMed] [Google Scholar]
  • 16.Purcell SM, Wray NR, Stone JL, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009 Aug 6;460(7256):748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pharoah PD, Antoniou AC, Easton DF, Ponder BA. Polygenes, risk prediction, and targeted prevention of breast cancer. The New England journal of medicine. 2008 Jun 26;358(26):2796–2803. doi: 10.1056/NEJMsa0708739. [DOI] [PubMed] [Google Scholar]
  • 18.Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009 Oct 8;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Machiela MJ, Chen CY, Chen C, Chanock SJ, Hunter DJ, Kraft P. Evaluation of polygenic risk scores for predicting breast and prostate cancer risk. Genetic epidemiology. 2011 Sep;35(6):506–514. doi: 10.1002/gepi.20600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Evans DM, Visscher PM, Wray NR. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Human molecular genetics. 2009 Sep 15;18(18):3525–3531. doi: 10.1093/hmg/ddp295. [DOI] [PubMed] [Google Scholar]
  • 21.Pashayan N, Pharoah P. Translating genomics into improved population screening: hype or hope? Human genetics. 2011 Jul;130(1):19–21. doi: 10.1007/s00439-011-0985-x. [DOI] [PubMed] [Google Scholar]
  • 22.Kathiresan S, Melander O, Anevski D, et al. Polymorphisms associated with cholesterol and risk of cardiovascular events. The New England journal of medicine. 2008 Mar 20;358(12):1240–1249. doi: 10.1056/NEJMoa0706728. [DOI] [PubMed] [Google Scholar]
  • 23.The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012 Nov 1;491(7422):56–65. doi: 10.1038/nature11632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wellcome Trust Case Control C. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007 Jun 7;447(7145):661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nature reviews. Genetics. 2010 Jul;11(7):499–511. doi: 10.1038/nrg2796. [DOI] [PubMed] [Google Scholar]
  • 26.Ioannidis JP, Thomas G, Daly MJ. Validating, augmenting and refining genome-wide association signals. Nature reviews. Genetics. 2009 May;10(5):318–329. doi: 10.1038/nrg2544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lakatos PL. Recent trends in the epidemiology of inflammatory bowel diseases: up or down? World journal of gastroenterology : WJG. 2006 Oct 14;12(38):6102–6108. doi: 10.3748/wjg.v12.i38.6102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Weedon MN, Clark VJ, Qian Y, et al. A common haplotype of the glucokinase gene alters fasting glucose and birth weight: association in six studies and population-genetics analyses. American journal of human genetics. 2006 Dec;79(6):991–1001. doi: 10.1086/509517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Hunter DJ, Khoury MJ, Drazen JM. Letting the genome out of the bottle--will we get our wish? The New England journal of medicine. 2008 Jan 10;358(2):105–107. doi: 10.1056/NEJMp0708162. [DOI] [PubMed] [Google Scholar]
  • 30.Patel CJ, Sivadas A, Tabassum R, et al. Whole Genome Sequencing in support of Wellness and Health Maintenance. Genome medicine. 2013 Jun 27;5(6):58. doi: 10.1186/gm462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Chen R, Mias GI, Li-Pook-Than J, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell. 2012 Mar 16;148(6):1293–1307. doi: 10.1016/j.cell.2012.02.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ng PC, Murray SS, Levy S, Venter JC. An agenda for personalized medicine. Nature. 2009 Oct 8;461(7265):724–726. doi: 10.1038/461724a. [DOI] [PubMed] [Google Scholar]
  • 33.Kalf RR, Mihaescu R, Kundu S, de Knijff P, Green RC, Janssens AC. Variations in predicted risks in personal genome testing for common complex diseases. Genetics in medicine : official journal of the American College of Medical Genetics. 2013 Jun 27; doi: 10.1038/gim.2013.80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.McCarthy MI, Abecasis GR, Cardon LR, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews. Genetics. 2008 May;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
  • 35.Forbes SA, Bindal N, Bamford S, et al. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic acids research. 2011 Jan;39(Database issue):D945–D950. doi: 10.1093/nar/gkq929. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Figure 3
Supplementary Materials and Methods
Supplemental Figure 4
Supplemental Figure 5
Supplemental Table 1
Supplemental Table 2
Supplemental Table 3
Supplemental Table 4
Supplementary Figure 1
Supplementary Figure 2

RESOURCES