Genome-wide association (GWA) studies involve genotyping hundreds of thousands to millions of markers, usually single nucleotide polymorphisms (SNPs), across the human genome to identify the role of common genetic variation (often defined as >5% prevalence of the minor allele) in the manifestation of a variety of traits (phenotypes), ranging from a predisposition to common diseases (eg, cancer, heart disease), to drug response (eg, efficacy, toxicity), to quantitative characteristics (eg, height, blood pressure). These studies have been successful in identifying genetic associations, with more than 1600 published GWA studies on SNPs at a genome-wide significance level (P<5 × 10–8) for more than 280 traits.1 However, these studies are costly. Data from thousands of people are required to achieve adequate statistical power for the identification of new genetic variants. Large sample sizes are necessary in part because of the sheer number of statistical tests performed (resulting in numerous false-positives) and because of the small effect of each genetic variant (relative risks or odds ratios on the order of 1.1-1.5) commonly observed. Thus, creative ways to maximize reuse of GWA study data to interrogate other traits and outcomes in a cost-effective manner are a high priority.
In this issue of Mayo Clinic Proceedings, Bielinski et al2 report on a promising approach for reusing GWA study data. Using large-scale genotyping data from participants in existing studies, they performed a GWA study of a quantitative phenotype (bilirubin level) that was completely independent from the phenotypes proposed in the original studies. Because their phenotype was obtained from the electronic medical record (EMR) and no additional genotyping was needed, this study was performed quickly and inexpensively compared with the standard approach used for GWA studies. Using GWA studies and available measures of bilirubin level from the EMR for more than 4000 participants, Bielinski et al identified genome-wide significant associations for serum bilirubin level on chromosome 2 corresponding to the UGTA1 cluster and chromosome 12 in the SLCO1B1 gene. Importantly, this study confirmed results from a previous primary GWA study and linkage studies, highlighting the validity of this approach.
The Mayo Genome Consortia (MayoGC) described by Bielinski et al is a collection of data from GWA studies conducted at Mayo Clinic. Phase 1 of the MayoGC is completed and consists of 6307 participants from 3 GWA studies. Phase 2 will add 11 studies, for available GWA study data on more than 11,500 participants from a single institution (Mayo Clinic).2 The MayoGC is a unique resource, with a common EMR enabling the systematic definition of phenotypes, including longitudinal changes in disease status and quantitative measurements. For example, serial measurements of bilirubin level were available for 58% of the patients in the study by Bielinski et al. For phase 1 of the MayoGC, the median medical record length (ie, the number of years a person has been seen at Mayo Clinic) ranged from 6.3 years in one GWA study group to 29.8 years in another, with 65% of patients having a medical record length greater than 10 years. As demonstrated by this longevity of the medical record, most persons with data in the MayoGC will have had routine blood tests, medical examinations, and screening studies at the recommended ages. The MayoGC provides opportunities to study any available laboratory-based measures using a GWA study. Importantly, in contrast to other GWA studies, those using MayoGC data could analyze genetic variants for a broad range of diseases or traits, allowing insight into pathologic mechanisms across a range of disease types, something not possible in single phenotype studies.3 The Mayo EMR also captures inpatient and outpatient prescription drug information, allowing for pharmacogenetic studies of single and/or multiple drugs. Because one of the criteria for inclusion in MayoGC studies was a written consent for use of genetic data in other research studies, no additional patient contact is required to accomplish these analyses.
The MayoGC was motivated by the Electronic Medical Records and Genomics (eMERGE) Network. The eMERGE Network is an initiative of the National Human Genome Research Institute, which funded collaborations across multiple institutions to conduct GWA studies on phenotypes obtained from EMR data.4 This network seeks to define EMR-derived phenotypes across multiple sites with different EMR systems and populations.5,6 Several similar national efforts are under way, including BioVU (Vanderbilt University), i2b2 (Partners Healthcare System and Harvard Medical School), and the Kaiser Permanente Research Program on Genes, Environment, and Health (RPGEH).3
The most important aspect of the MayoGC and other EMR-based consortia is their cost-effectiveness. Studies can be conducted at one or a small number of institutions with a common EMR, an existing infrastructure for coding disorders and laboratory measures, and experience with natural language processing.7 Although validation of new phenotypes extracted from the narrative text of the EMR by means of natural language processing carries a cost, it is marginal compared with the costs of genotyping and phenotyping for clinical research (eg, sampling an entire cohort[s] for a new outcome).3 Also, many phenotypes have already been defined using the EMR in the course of individual Mayo GWA studies, increasing the efficiencies of the MayoGC collaboration. Finally, as highlighted by Bielinski et al, at least 6% of MayoGC participants are also members of the Mayo Clinic Biobank, another source of both exposure and biospecimen data.2 Blood samples have been collected for research purposes and could be used for exploration of loci for entirely new blood markers.
Initiatives such as the MayoGC and the eMERGE Network allow for the examination of millions of genetic variants with multiple phenotypes, resulting in continued discoveries of genetic loci. These new discoveries show promise for translation to the clinical arena, although the delay in moving these results from laboratory bench to bedside has resulted in some skepticism to date. Such skepticism is understandable. The newly identified SNPs were thought to hold the answer to personalized risk prediction, but, as we currently appreciate, the situation is not that simple. For most diseases or therapies, few genetic loci (10-30 SNPs for most phenotypes) have been identified. Risk models incorporating these genetic loci have shown little to modest improvement in individual risk prediction8,9 because of the small effect of each genetic variant and the limited variability in phenotype explained by these variants. For example, inclusion of 10 SNPs to the Gail model (minus diagnosis of atypical hyperplasia), a model widely used in breast cancer risk prediction, modestly improved performance, with an area under the curve (AUC) of 58.0% for the Gail model factors alone and 61.8% for Gail model plus the 10 SNPs (an AUC of 50% denotes random classification, whereas 100% is perfect classification). However, the ability of the SNP-only (genetic) model to predict risk was similar to that of the Gail model, which considers clinical risk factors only (AUC = 59.7% vs AUC = 58.0%, respectively). With the identification of more realistic numbers of loci (eg, 50, 100, 200, or even thousands), models incorporating genetics may yet provide improvement for personalized risk assessment.
The potential for translation of GWA study results to the clinical arena has been realized in the area of pharmacogenetics. A recent review by Wang et al10 highlights several examples of how GWA study results hold promise for physicians in optimizing drug selection, dose, and treatment duration as well as in preventing adverse drug reactions. They present an example of a GWA study of warfarin dosing that resulted in confirmation of previously identified genetic associations (CYP2C9 and VKORC1) and identification of a novel locus at CYP4F211; these 3 genes were also implicated in a GWA study of maintenance dosage of a related anticoagulant, acenocoumarol.10,12 When genotype information on CYP2C9 and VKORC1 was included in clinical studies of warfarin dosing, outcomes improved13 and hospitalizations for hemorrhage were reduced by 28%,14 leading to revisions to the warfarin label by the US Food and Drug Administration.10 Importantly, pharmacogenetic GWA study results present unique challenges for replication because it is often difficult to identify appropriate populations, limiting or delaying the incorporation of genetic results into clinical care.10 Consortia such as the MayoGC, which allow for the mining of drug information and toxicities captured in the EMR, could provide replication of variants found associated with efficacy or toxicity of treatments for more common diseases, such as diabetes and heart disease.
Genome-wide association studies remain a promising avenue for uncovering the genetic basis of health and disease. The continued identification of genetic variants and the characterization of their function could inform future prevention and treatment strategies and, possibly, allow for the targeting of high-risk individuals. Creative use of existing resources such as the MayoGC will enable new discoveries and clinical translation, with cost and time efficiencies. Similar efforts are under way at many medical institutions,3 maximizing the value of these GWA study data and reducing costs. By reducing the amount of necessary genotyping through the reuse and recycling of existing genetic data, GWA studies can go “green.”
Acknowledgments
We would like to thank James R. Cerhan, MD, PhD, and Susan Slager, PhD, for their helpful review of the submitted editorial.
Footnotes
See also page 606
References
- 1. Hindorff LA, Junkins HA, Hall PN, Mehta JP, Manolio TA. A Catalog of Published Genome-Wide Association Studies. http://www.genome.gov/gwastudies/ Accessed May 27, 2011.
- 2. Bielinski SJ, Chai HS, Pathak J, et al. Mayo Genome Consortia: a genotype-phenotype resource for genome-wide association studies with an application to the analysis of circulating bilirubin levels. Mayo Clin Proc. 2011;86(7):606-614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Kohane IS. Using electronic health records to drive discovery in disease genomics. Nat Rev Genet. 2011;12:417-428 [DOI] [PubMed] [Google Scholar]
- 4. McCarty CA, Chisholm RL, Chute CG, et al. ; eMERGE Team The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4:13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Kho AN, Pacheco JA, Peissig PL, et al. Electronic medical records for genetic research: results of the eMERGE Consortium. Sci Transl Med. 2011;79(3):79re1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, Chute CG. Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J Am Med Inform Assoc. 2010;17:568-574 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507-513 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Ripatti S, Tikkanen E, Orho-Melander M, et al. A multilocus genetic risk score for coronary heart disease: case-control and prospective cohort analyses. Lancet. 2010;376:1393-1400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Wacholder S, Hartge P, Prentice R, et al. Performance of common genetic variants in breast-cancer risk models. N Engl J Med. 2010;362:986-993 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang L, McLeod HL, Weinshilboum RM. Genomics and drug response. N Engl J Med. 2011;364:1144-1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Takeuchi F, McGinnis R, Bourgeois S, et al. A genome-wide association study confirms VKORC1, CYP2C9, and CYP4F2 as principal genetic determinants of warfarin dose. PLoS Genet. 2009;5:e1000433 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Teichert M, Eijgelsheim M, Rivadeneira F, et al. A genome-wide association study of acenocoumarol maintenance dosage. Hum Mol Genet. 2009;18:3758-3768 [DOI] [PubMed] [Google Scholar]
- 13. Klein TE, Altman RB, Eriksson N, et al. Estimation of the warfarin dose with clinical and pharmacogenetic data. N Engl J Med. 2009;360:753-764 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Epstein RS, Moyer TP, Aubert RE, et al. Warfarin genotyping reduces hospitalization rates results from the MM-WES (Medco-Mayo Warfarin Effectiveness study). J Am Coll Cardiol. 2010;55:2804-2812 [DOI] [PubMed] [Google Scholar]
