Abstract
The International HapMap Project produced a genome-wide database of human genetic variation for use in genetic association studies of common diseases. The initial output of these studies has been overwhelming, with over 150 risk loci identified in studies of more than 60 common diseases and traits. These associations have suggested previously unsuspected etiologic pathways for common diseases that will be of use in identifying new therapeutic targets and developing targeted interventions based on genetically defined risk. Here we examine the development and application of the HapMap to genome-wide association (GWA) studies; present and future technologies for GWA research; current major efforts in GWA studies; successes and limitations of the GWA approach in identifying polymorphisms related to complex diseases; data release and privacy polices; use of these findings by clinicians, the public, and academic physicians; and sources of ongoing authoritative information on this rapidly evolving field.
Keywords: complex diseases, genetic association, genomic variation
THE HAPMAP: BUILDING THE FOUNDATION FOR GENOME-WIDE ASSOCIATION STUDIES
The International HapMap Project was de-signedtocreateapublic, genome-wide database of patterns of common human sequence variation to guide genetic studies of human health and disease, including genome-wide association (GWA) studies (1, 2). Identifying genetic influences on complex diseases would be quite difficult if the risk-associated allelic variants at a particular disease-causing locus were very rare, so that for a disease to be common there would be many different causative alleles. The HapMap was instead designed to facilitate identification of commonly occurring disease-causing variants based upon the “common disease, common variant” hypothesis (3). This hypothesis suggests that at least some of the genetic influences on many common diseases are attributable to a limited number of common allelic variants that are present in more than 5% of the population.
GWA studies attempt to identify these common disease-causing variants by using high-throughput genotyping technologies to assay hundreds of thousands of common single nu-cleotide polymorphisms (SNPs) and relate them to clinical conditions and measurable traits. Because of the strong associations among SNPs in most chromosomal regions, only a few carefully chosen SNPs need to be typed in each region to predict the likely alleles at the rest of the SNPs in that region. Selecting the best tag SNPs requires precise mapping of the patterns of linkage disequilibrium (LD) among SNPs, which differ somewhat across ancestral groups. The need for precise LD maps to facilitate genetic association studies was the stimulus for developing the human haplotype map (2, 4).
The International HapMap Project was a consortium of researchers in Canada, China, Japan, Nigeria, the United Kingdom, and the United States, organized to produce a human haplotype map by genotyping 270 samples from four populations with geographically diverse ancestry (1, 2). These samples in-cluded 30 mother–father–adult child trios from the Yoruba in Ibadan, Nigeria; 30 trios from the CEPH (Centre d’Etude du Polymorphisme Humain) collection of Utah residents of Northern and Western European ancestry; 45 unrelated Han Chinese individuals in Beijing, China; and 45 unrelated Japanese individuals in Tokyo, Japan. Approximately 1 million SNPs were genotyped and their LD patterns characterized in Phase I of the project. A description was published in 2005 (1), but the data were available long before this and were central to several early genomic discoveries in complex diseases (5, 6). The Phase II HapMap of more than 3 million SNPs was published in 2007 (7).
Subsequent research has shown that tag SNPs chosen using the HapMap are generally applicable across other populations, but there are some limitations, particularly for rarer SNPs and for populations with substantial proportions of recent African ancestry (8). To allow better choice of tag SNPs and more detailed analyses for diverse populations, additional samples were collected from the same four initial HapMap populations and from seven additional populations: Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Tuscans in Italy; Gujarati Indians in Houston, Texas; Chinese in metropolitan Denver, Colorado; persons of Mexican ancestry in Los Angeles, California; and persons of African ancestry in the Southwestern United States (9). These 1301 extended HapMap samples are now available from the Coriell Institute and have been genotyped on the Affymetrix 6.0 platform and the Illumina 1 million SNP chip. Genome-wide sequencing of these samples to develop a comprehensive catalog of rarer variants will begin soon as part of the international 1000 Genomes Project (http://www.1000genomes.org).
GENOME-WIDE ASSOCIATION TECHNOLOGIES, PRESENT AND FUTURE
GWA studies have been defined by the National Institutes of Health (NIH) as any studies of common genetic variation across the entire human genome designed to identify genetic associations with observable traits (10). Implicit in this definition is that sufficient numbers of SNPs are typed to capture the vast majority of common variations (as noted above, these are alleles with a frequency of at least 5% in a population) throughout the entire genome. Such studies typically involve hundreds of thousands of SNPs and are not limited to known genes or regulatory regions. Instead, they assess genetic variation genome-wideinanalmost “agnostic” fashion, unconstrained by current imperfect understanding of genome structure and function (11).
Technologies for high-throughput assays of thousands, and then tens and hundreds of thousands, of SNPs developed in parallel with the progress of the HapMap, as it became clear that denser maps could effectively capture the majority of human genetic variation (12, 13). These advances have made possible the dense genotyping needed to characterize the SNP variation within an individual, at a sufficiently low cost to allow the large sample sizes needed for comparisons of persons with and without disease. As genotyping platforms expand to include ever more tag SNPs, they capture increasingly larger proportions of the variation in any population, so that even samples of recent African ancestry, characterized by greater variation and shorter stretches of LD (14), have most of the genome covered at high r2 (7).
Current-generation high-throughput geno-typing platforms are extraordinarily efficient at genotyping SNPs but are less effective at geno-typing structural variants, such as insertions, deletions, inversions, and copy number variants (CNVs). These variants are common in the human genome, though not as common as SNPs (15). The HapMap was not designed to capture these variants, although it can be used indirectly to do so, particularly for deletions that are in strong LD with SNPs (16). CNVs, in which stretches of genomic sequence roughly 1 kb to 3 Mb in size are deleted or dupli-cated in varying numbers, have gained increasing attention because of their apparent ubiquity and potential dosage effect on gene expression (17).
A critical question related to methods for typing CNVs is whether they usually arise from a single originating event and then are propagated with diminishing degrees of LD on a single haplotype background, or instead are frequently regenerated on varying haplotype backgrounds (18). The former situation would be conducive to tagging and indirect interrogation with HapMap-based genotyping platforms, while the latter would likely require direct interrogation or genomic sequencing for reliable association studies. Expansions and refinements of current genotyping platforms are increasingly focused on capturing CNVs adequately, and some success has already been achieved (19). Array and sequencing methods are also being used to type structural variants, using the HapMap samples for development and cross-validation of the methods (20, 21).
The identification of rare, potentially causal variants that are poorly tagged by existing geno-typing platforms will require sequencing DNA from large numbers of people for the genomic regions showing strong associations with complex traits (22). The 1000 Genomes Project plans to produce modest sequence coverage (an average of four sequencing reads at any place in the genome) of ~1500 individuals that will extend the catalog of human genetic variation to variants present in 1%–5% of the population (10). It will thus limit the follow-up sequencing needed for investigating specific association findings to the search for very rare variants. Fine-mapping of candidate regions with common and rare SNPs optimally chosen, based on HapMap data, to maximize the regional genomic variation captured while minimizing costs, will refine association signals and narrow the list of possible functional variants.
CURRENT MAJOR EFFORTS IN GENOME-WIDE ASSOCIATION STUDIES
The first association study generally considered to be truly genome-wide was published in March 2005 (23), and by August 2008 > 170 such publications had identified > 150 genetic loci associated with > 60 complex diseases and traits (9, 24). Although many such efforts have been and will continue to be undertaken individually, or as part of single large-scale studies such as the deCODE database (25) or the National Heart, Lung, and Blood Institute’s Framingham Study (26), the value of collaborative efforts across studies and even across diseases is increasingly being recognized.
A series of coordinated GWA publications in early 2007 in prostate cancer, breast cancer, and myocardial infarction demonstrated the value of assessing association reports in multiple studies simultaneously (9). The joint publication of individual and combined associations with type 2 diabetes in three collaborating studies definitively showed the importance of combining individual-level genotype and phe-notype data in > 30,000 subjects to identify associations across several studies that no single study could reliably identify on its own (27–29). This approach was subsequently expanded by the addition of seven more diabetes studies in the Diabetes Genetics Replication and Meta-analysis (DIAGRAM) Consortium, with an effective sample size of > 50,000 (30). Similarly large efforts focused on a single phenotype or closely related phenotypes in tens of thousands of subjects have yielded variants related to obesity (31), lipids (32), height (33), and other traits.
A more challenging and somewhat more controversial approach has been to combine individual-level data from cases with several related or even unrelated conditions and compare them to a common control group,in an effort to expand sample size and increase study power for each condition. The success of this method was demonstrated by the landmark Wellcome Trust Case Control Consortium (WTCCC) study of 2000 cases of each of seven common diseases and 3000 shared controls (34). This study provided many fundamental methodologic advances, including demonstration of the robustness of a single control group, the value of using cases of some diseases as controls for others, the greater power provided by increased sample size (numbers of subjects) rather than increased genomic coverage (numbers of SNPs), the critical need for manual review of automated genotyping calls, and the reliability of imputed genotypes for SNPs that were not actually typed by the genotyping platform. This approach of common controls and combined case groups used as controls was also employed by the WTCCC in its smaller study of 14,500 nonsynonymous SNPs in four autoimmune diseases, and in dense genome-wide genotyping of African cases of tuberculosis and malaria (35, 36). The Wellcome Trust recently announced plans to conduct genome-wide genotyping in 120,000 additional people to identify variants related to 25 diseases and traits using the same approach (37).
Other collaborative studies of multiple diseases have focused less on combining genotype-phenotype associations than on sharing methods for genotyping quality control, data analysis, imputation, and data distribution. Experience gained from early quality-control efforts in programs such as the Genetic Association Information Network (GAIN) of six complex diseases has been of great value in speeding the completion and analysis of geno-typing in later studies (19). Several other collaborative programs are currently in the pipeline (Table 1).
Table 1.
Collaborative genome-wide association studies (adapted from Reference 9)
| Study name | Genetic Association Information Networ (GAIN) | Genes, Environment,and Health Initiative (GEI) | SNP Typing for Association with multiple Phenotypes from Existing Epidemiologic Data (STAMPEED) | Cancer Genetic Markers of Susceptibility (CGEMS) | Psychiatric Genomewide Association Study Consortium (PGC) |
|---|---|---|---|---|---|
| URL | http://www.fnih.org/GAIN2/home new.shtml | http://www.gei.nih.gov/ | http://public.nhlbi.nih.gov/GeneticsGenomics/home/stampeed.aspx | http://cgems. cancer.gov/ | http://sullivanlab.unc. edu/pgc/index.html |
| Traits or diseases studied | attention deficit/hyperactivity disorder | type 2 diabetes | early-onset myocardial infarction | prostate cancer | autism |
| major depressive disorder | maternal metabolism and birth weight | asthma | breast cancer | attention deficit/hyperactivity disorder | |
| bipolar I disorder | preterm birth | platelet phenotypes | pancreatic cancer | bipolar disorder | |
| schizophrenia | oral clefts | coronary heart disease and other heart, lung, and blood disorders | lung cancer | major depressive disorder | |
| type 1 diabetic nephropathy | dental caries | childhood respiratory outcomes | bladder cancer | schizophrenia | |
| psoriasis | coronary disease | hematopoietic cell transplant outcome | renal cancer | ||
| lung cancer | arteriosclerosis in hypertensives | ||||
| addiction | asthma and lung function | ||||
| cardiovascular risk factors | |||||
| atherosclerosis pathway genes | |||||
| cardiovascular events | |||||
| early coronary artery disease | |||||
| phenotypic variability in sickle-cell anemia | |||||
| longevity to age 100 |
SUCCESSES IN IDENTIFYING VARIANTS RELATED TO COMPLEX DISEASES
The first notable success of the GWA method came in March 2005, with the identification of a variant in the gene for complement factor H (CFH) associated with age-related macular degeneration (23). Two additional GWA studies were published within that year, of Parkinson’s disease and obesity (38, 39), but efforts at replicating these findings have produced inconsistent results (40, 41). In 2006, strong, robust associations with electrocardiographic QT interval prolongation (42), neovascular macu-lar degeneration (43), and inflammatory bowel disease (44) were identified and have since been the subjects of a substantial body of follow-up research to determine gene function and population impact.
The pace of genomic discovery increased dramatically in 2007, following the increased availability of high-density genotyping platforms and experience in interpreting the results. Simultaneous publication of coordinated efforts in multiple diseases, and of the WTCCC study, have been described above. Rapid progress has continued into 2008 with identification of > 150 loci for > 60 common diseases and traits (Figure 1). Indeed, as Hunter & Kraft have noted, “There have been few, if any, similar bursts of discovery in the history of medical research” (45).
Figure 1.
SNP-trait associations detected in GWA studies. Associations significant at p > 9.9 × 10−7 and reported through June 2008 are shown according to chromosomal location and involved or nearby gene, if any. Colored boxes indicate similar diseases or traits. Adapted from Reference 9 with permission.
Unique aspects of the GWA method have made these discoveries possible. For example, GWA studies allow the investigator to narrow an association region to a 10–100 kilo-base length of DNA, in contrast to the 5–10 megabases usually detected in familial linkage studies. Because GWA regions typically contain only a few genes, rather than the dozens or hundreds implicated in linkage regions, potentially causative variants can be examined much more rapidly and in greater depth. As noted above, systematic interrogation of the entire genome frees the investigator from reliance on inaccurate prior hypotheses based on incomplete understanding of disease pathogen-esis and genome structure and function. The critical importance of this is illustrated by the fact that many of the associations identified to date, such as CFH in macular degeneration (23) and TCF7L2 in type 2 diabetes (6, 46), have been surprising—the genes were not previously suspected of being related to the disease. Some, such as the strong associations of prostate cancer with SNPs in the 8q24 region (47) and Crohn’s disease with the 5p13 region (34), have been in genomic regions containing no known genes at all. And because current genotyping assays capture the vast majority of human variation genome-wide, rather than being focused on particular regions or pathways, once a GWA scan is completed it can be applied to any condition or trait measured in that same individual and consistent with his or her informed consent.
Several of these discoveries have suggested etiologic pathways not previously implicated in these diseases, such as the autophagy pathway in inflammatory bowel disease (48), the complement pathway in macular degeneration (23), and the HLA-C locus in control of viral load in HIV infection (49). Of considerable interest in determining pathophysiology have been variants or regions implicated in multiple diseases, such as the 8q24 region in prostate, breast, and colorectal cancers and the PTPN2 gene in type 1 diabetes and Crohn’s disease (9).
LIMITATIONS OF THE GENOME-WIDE ASSOCIATION METHOD
Important limitations of GWA studies should also be kept in mind. One is their enormous potential for generating false-positive or spurious associations. Because they test hundreds of thousands of statistical hypotheses—one for each allele or genotype assessed—GWA studies have enormous potential for generating false-positive results due to chance alone. At the usual p < 0.05 level of significance, an association study of one million SNPs will show 50,000 SNPs to be “associated” with disease, almost all spuriously. One response to this problem is to reduce the false-positive rate by applying the Bonferroni correction, in which the conventional p-value is divided by the number of tests performed (45). A one-million-SNP survey would thus use a threshold of p < 0.05/106, or 5 × 10−8, to identify associations unlikely to have occurred by chance. This correction has been criticized as overly conservative, but it remains the most commonly used approach to date (50). Another cause of the false-positive associations to which GWA studies are prone is population stratification. Allele frequencies vary between population subgroups, such as those defined by ethnicity or geographic origin, and these subgroups in turn differ in their risk for disease. GWA studies may then falsely identify the subgroup-associated genes as related to disease (50). Genotyping error is another important cause of spurious associations that must be carefully sought and corrected.
These problems have also plagued candidate-gene association studies, where systematic reviews showed that the vast majority of initial associations could not be replicated (51). This experience has led to calls for all genetic association reports to include documented replication of findings as a prerequisite for publication. Consensus guidelines for replication in any GWA study, and, crucially, for complete description of the initial study so that replication is possible, have been developed (52).
Another limitation of GWA studies is their lack of power for identifying associations with rare sequence variants, since these are poorly represented on current genotyping platforms, as are structural variants. The often limited information available on environmental exposures and other nongenetic risk factors in GWA studies will make it difficult to identify gene-environment interactions, or modification of gene-disease associations in the presence of environmental factors. Most variants identified to date are of relatively modest effect size, conferring less than a twofold increase in disease risk and necessitating large sample sizes to detect their effect (9, 34). Although the importance of risk factors with small effect is debated, the potential for purifying selection to eliminate risk variants of large effect is unique to genetic studies and will tend to keep effect sizes small for common variants (53). Modest associations can point the way to important therapeutic avenues, and, when considered in combination, may identify persons at substantially increased risk (28). Such information can be particularly important, even in the absence of specific pharmaceutical agents targeted to such individuals, for more aggressive effortstoreduce known risk factors that are modifiable, such as obesity in prediabetes and smoking in age-related macu-lar degeneration (9).
DATA RELEASE AND PARTICIPANT PRIVACY
GWA studies produce massive data sets, often representing substantial investments of public funds and providing unparalleled opportunities for research into complex diseases. Recognizing the research potential of these data sets, and following an extended period of public comment, NIH released its Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-Wide Association Studies, recommending widespread and responsible release of GWA data to the scientific community through the Database of Genotype and Phenotype (dbGaP) of the National Center for Biotechnology Information (10, 54). Study descriptions and protocols are available in the open-access portion of the db-GaP website; individual-level data are provided through a controlled-access process consistent with participants’ informed consent. This commitment to rapid data release builds on the now well-established ethic in genomic community research projects of maximizing data access. Other GWA data access sites include the Cancer Genetic Markers of Susceptibility (CGEMS) data portal (55) and the European Genotype Archive (EGA) (56). Policies for data release have been developed collabora-tively among these projects and are quite similar. Published GWA studies and major findings are also catalogued by the National Human Genome Research Institute (NHGRI) (24), and GWA literature citations are available through the Centers for Disease Control and Prevention (57).
The extensive genotype and phenotype information deposited in dbGaP raises important questions about possible risks to confidentiality of individual participants in broad data-sharing models. NIH policies were thus developed with deliberate attention to participant protections, both in the process of data submission from the original studies and in the processes of data access and use by outside investigators. A key aspect of the protections provided in dbGaP is removal of potentially identifying information prior to data submission, using criteria very similar to those described within the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule (10).
Substantial participant protections are also applied at the dbGaP data-user level through a process managed by a Data Access Committee (DAC) composed of senior NIH staff. Investigators interested in obtaining controlled-access dbGaP data submit a Data Access Request, cosigned by their institution, constituting their agreement to abide by the principles and practices detailed in the NIH GWA study policy. These include keeping the data secure; using them only for the approved research purposes; acknowledging NIH policies on publications and intellectual property (IP); and submitting periodic reports on data use. Data users also agree not to distribute individual-level data in any form to any third parties (other than their own research staff who have agreed to the terms of access), nor to attempt to identify individual study participants.
Recognizing the unprecedented pace of scientific progress in this field, NIH has designed its policies on data sharing in GWA studies to adapt to rapid technical advances. For example, data summaries and other group-level data such as allele frequencies and association statistics were initially provided in the open-access portion of dbGaP, in the belief that grouped data carried no threats to individual privacy. An innovative analysis for resolving the presence of an individual’s DNA in a mix of DNA (as from a mass disaster) subsequently showed that an individual could be determined to have contributed to grouped allele frequency data with high reliability if one had data on that individual’s genotypes at hundreds of thousands of SNPs (58). NIH responded swiftly to remove these data sets from open access and place them behind the controlled-access process, and to notify other major data providers as well, who took similar actions (59). Data access policies will continue to evolve with ongoing scientific advances in the field, to ensure that state-of-the-art data can be distributed and used in the most responsible and productive manner.
Ensuring confidentiality and privacy is vital for databases containing individual-level genotype or phenotype information. Important concerns about the potential for persons carrying risk-associated variants (i.e., essentially everyone) becoming the object of discrimination by employers or insurers must be addressed. Application of GWA findings and other genomic research will be greatly facilitated by the formal legal protection against discrimination based on genetic information provided by the Genetic Information Nondiscrimination Act (GINA) that was signed into law in May 2008. When it goes into effect in 2009, GINA will protect against discrimination by health insurers and employers on the basis of genetic information. It is particularly important for GWA studies because of the breadth of information obtained in such studies; almost certainly every individual will carry at least one risk allele for at least one common disease. Protections under GINA will not only shield study subjects from the risk of genetic discrimination due to their participation in research, but more importantly will ensure that clinicians can order genetic tests identified from GWA studies to make more effective treatment decisions and can place this information in patients’ records without risk to patients or their families.
USE OF FINDINGS BY PHYSICIANS AND THE PUBLIC
Although GWA discovery studies provide valuable clues to genomic function and pathophys-iologic mechanisms, they are only a first step in identification of disease genes and are many steps removed from actual clinical application. Nonetheless, they tend to receive considerable media attention and have the potential for generating queries from patients about whether to get tested for the “new gene for Disease X” based on the latest report (50). As noted above, many SNPs identified from such studies, as well as the genes or regions containing them, are currently of unknown function.Inad-dition, SNPs from GWA studies in complex diseases (unlike many Mendelian disorders) do not predict unequivocally who will develop disease and who will remain free of it. Instead, individuals carrying a particular risk genotype implicated in a GWA study have a greater risk (and sometimes only a modestly greater risk) of developing a complex disease than those who do not.
The distinction between disease prediction and disease susceptibility is important because for many common variants, a substantial number of persons who do not carry the at-risk genotype may develop disease anyway owing to environmental or other factors. Indeed, for common diseases such as hypertension or diabetes, environmental or lifestyle factors may play such a strong role relative to genetics that many individuals with the at-risk genotype will develop disease for reasons that are probably unrelated to genotype, and others with the at-risk genotype may remain healthy in the absence of other important environmental exposures (60). Identifying subgroups of individuals in whom SNP-outcome associations differ according to the presence or absence of other SNPs or environmental factors might eventually be of considerable clinical use, particularly for environmental factors that can be modified.
The consensus at present is that GWA findings provide important clues to disease etiology and pathways to treatment, but current information is far too preliminary to recommend their use inprevention or treatment recommendations. Use of GWA findings in screening for disease risk, though beginning to be marketed commercially, is problematic. Although getting the latest “gene test” may be alluring, evidence is needed that such screening adds information to known risk factors (such as age, smoking, obesity, and family history), that effective interventions are available, that improved outcomes justify the associated costs, and that obtaining this information does not have serious adverse consequences for patients and their families.
Given the availability of many genetic tests to anyone willing to pay for them, however, clinicians are soon likely to face anxious patients equipped with genotype information showing them to be at risk for multiple diseases. This may provide a “teachable” moment for encour-aging patients to apply known preventive strategies against the conditions for which they are at increased risk. Such encounters also provide critical opportunities to discourage complacency in preventive strategies for which geno-typing information suggests a patient is not at increased risk. This is because so little is known about genetic influences on complex diseases and because variants identified to date typically explain so small a proportion of population risk. It may be useful to point out to patients considering purchasing these tests that obtaining a family history is often simpler and almost always cheaper. A positive family history typically confers a three- to fourfold increase in the risk of many diseases and is extremely useful in identifying persons to target for more intensive screening (61).
UTILITY OF FINDINGS FOR RESEARCH
Perhaps the greatest initial utility of GWA findings is in the clues they provide for disease etiology, therapeutic targets, and gene function. As noted above, several of these discoveries have suggested etiologic pathways and therapeutic opportunities not previously implicated in the complex diseases with which they are associated, such as the autophagy pathway in inflammatory bowel disease, the complement pathway in macular degeneration, and the HLA-C locus in control of viral load in HIV infection (9). Intriguing potential genetic connections between diseases previously believed to be unrelated—such as the finding that risks of type 2 diabetes, coronary disease, and familial melanoma are all associated with variants near CDKNA2A/B, or that risks of Crohn’s disease and type 1 diabetes are related to variants near PTPN2—suggest new avenues of research in identifying other similarities in etiology, progression, or treatment of these conditions. The not infrequent occurrence of associations in “gene deserts” far from any known genes invites the question of whether studies of disease pathogenesis have been too focused on coding regions of the genome and have missed other important structural and functional clues to ge-nomic regulation.
Research to pursue initial GWA discoveries will include replication studies in the same phenotypes and populations, to ensure the robustness of the findings, and in similar but not identical phenotypes and populations, to extend the findings and increase understanding of their mechanisms and importance (52). Investigation of disease subtypes, such as estrogen receptor–positive versus -negative breast cancer, or young-onset or severely progressive forms of prostate cancer or diabetes, may be of great value in identifying which subgroups of alleles confer the highest risk and which subgroups of persons carry those alleles. Functional studies of highly replicated variants, in experimental models such as knockdown and overexpression studies (9) and in relationship to gene expression, as recently demonstrated for asthma-associated variants in ORMDL3 (62), will help to determine the mechanisms of gene function and how they are perturbed in disease, providing insights into possible preventive or therapeutic strategies.
AUTHORITATIVE SOURCES OF INFORMATION
The HapMap continues to evolve, with new SNPs being identified and LD patterns defined in both the original and newer HapMap populations. The primary portal to HapMap genotype data, as well as publications, tutorials, and other relevant resources, is the International HapMap Project website at http://www.hapmap.org. An up-to-date catalog of GWA studies is provided by the NHGRI’s Office of Population Genomics at http://www.genome.gov/GWAstudies. This site lists all published studies attempting to assay 100,000 SNPs or more, noting the trait under investigation, the top new associations identified, their genomic region and nearby genes, p-values, odds ratios, and links to the PubMed citations. Study descriptions, protocols, and associa-tion findings are available for many NIH-supported GWA studies in dbGaP at http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap, and individual-level data may be requested for download through the controlled-access portion of that site. Those seeking additional information on specific genes related to complex diseases should consult Online Mendelian Inheritance in Man (OMIM)at http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM, the definitive catalog of human genes and genetic disorders. Findings from GWA studies are added to genes described in OMIM on a regular basis. More relevant to clinicians and patients may be the website and materials produced by the National Coalition for Health Professional Education in Genetics, a coalition of health professional organizations whose purpose is to promote health professional education and access to information about advances in human genetics, at http://www.nchpeg.org.
SUMMARY
The genome-wide database of human genetic variation produced by the International HapMap Project has provided a radically new approach for searching for genetic variants associated with complex diseases. The overwhelming success of these studies has led to surprising new insights into disease pathophysiol-ogy and therapeutic approaches, as well as new questions about genomic structure and function (and its interaction with genomic variation and environmental factors) in disease causation.
GWA studies represent a powerful new tool for identifying genetic variants related to complex diseases, but they also have important limitations, including their potential for false-positive results and lack of sensitivity to detect rare variants. Their primary uses for the foreseeable future are likely to be in the investigation of biologic pathways of disease causation and normal health and development. Clinical application of these findings will require firm evidence that testing for them adds information to known risk factors, that effective interventions are available, that improved outcomes justify the associated costs, and that obtaining this information does not have serious adverse consequences for patients and their families. Although most GWA findings are clearly several steps removed from main- stream clinical use at present, functional investigation and experimental application of these findings are expected to produce new advances in the prevention and treatment of common diseases.
FUTURE ISSUES
Defining the functional properties of genomic variants identified through GWA studies, including effects on gene expression, protein structure, and protein function.
Identifying copy number variants that may be contributing significantly to common disease risk but are scored inconsistently by current technologies.
Identifying rarer sequence variants that may be causative or additive in the disease associations identified in GWA studies.
Determining the population prevalence and risk associated with putative causal variants in unbiased and diverse population samples.
Estimating the increment in risk over established risk factors provided by GWA-defined variants.
Using information from GWA studies to identify new targets for therapeutic intervention.
Glossary
- GWA
genome-wide association
- Single nucleotide polymorphism (SNP)
site within the genome that differs by a single nucleotide base across different individuals
- Polymorphism
a form of genetic variation in which each allele occurs in at least 1% of the population
- Tag SNP
representative SNP in a region of the genome with high linkage disequilibrium to other variants
- Linkage disequilibrium (LD)
association of alleles at two or more sites on the same chromosome that are inherited together more often than expected by chance
- Haplotype:
a combination of alleles at multiple linked sites on a single chromosome that are transmitted together
- r2
Linkage disequilibrium coefficient representing the proportion of observations in which two specific pairs of alleles occur together
- Copy number variant (CNV)
a DNA sequence of hundreds to thousands of base pairs that occurs a variable number of times across individuals
- Nonsynonymous SNP
a SNP for which each allele encodes a different amino acid in the protein sequence
- dbGaP:
Database of Genotype and Phenotype
Footnotes
The U.S. Government has the right to retain a nonexclusive, royalty-free license in and to any copyright covering this paper.
DISCLOSURE STATEMENT
The authors are not aware of any factors that might be perceived as affecting the objectivity of this review.
LITERATURE CITED
- 1.International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–794. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- 3.Collins FS, Guyer MS, Chakravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278:1580–1581. doi: 10.1126/science.278.5343.1580. [DOI] [PubMed] [Google Scholar]
- 4.Eberle MA, Ng PC, Kuhn K, et al. Power to detect risk alleles using genome-wide tag SNP panels. PLoS Genet. 2007;3:1827–1837. doi: 10.1371/journal.pgen.0030170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Benusiglio PR, Lesueur F, Luccarini C, et al. Common variation in EMSY and risk of breast and ovarian cancer: a case-control study using HapMap tagging SNPs. BMC Cancer. 2005;5:81. doi: 10.1186/1471-2407-5-81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Grant SF, Thorleifsson G, Reynisdottir I, et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 2006;38:320–323. doi: 10.1038/ng1732. [DOI] [PubMed] [Google Scholar]
- 7.International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.deBakker PI, Burtt NP, Graham RR, et al. Transferability of tag SNPs in genetic association studies in multiple populations. Nat. Genet. 2006;38:1298–1303. doi: 10.1038/ng1899. [DOI] [PubMed] [Google Scholar]
- 9.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 2008;118:1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. [accessed 9/12/2008];Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies (GWAS). Federal Register 8/30/07. 2007 http://grants.nih.gov/grants/guide/notice-files/NOT-OD-07-088.html.
- 11.Carlson CS. Agnosticism and equity in genome-wide association studies. Nat. Genet. 2006;38:605–606. doi: 10.1038/ng0606-605. [DOI] [PubMed] [Google Scholar]
- 12.Wang DG, Fan JB, Siao CJ, et al. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998;280:1077–1082. doi: 10.1126/science.280.5366.1077. [DOI] [PubMed] [Google Scholar]
- 13.Matsuzaki H, Dong S, Loi H, et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods. 2004;1:109–111. doi: 10.1038/nmeth718. [DOI] [PubMed] [Google Scholar]
- 14.Gabriel SB, Schaffner SF, Nguyen H, et al. The structure of haplotype blocks in the human genome. Science. 2002;296:2225–2229. doi: 10.1126/science.1069424. [DOI] [PubMed] [Google Scholar]
- 15.Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat. Genet. 2005;37:727–732. doi: 10.1038/ng1562. [DOI] [PubMed] [Google Scholar]
- 16.Komura D, Shen F, Ishikawa S, et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006;16:1575–1584. doi: 10.1101/gr.5629106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Stranger BE, Forrest MS, Dunning M, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315:848–853. doi: 10.1126/science.1136678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Estivill X, Cox NJ, Chanock SJ, et al. SNPs meet CNVs in genome-wide association studies: HGV2007 meeting report. PLoS Genet. 2008;4:e1000068. doi: 10.1371/journal.pgen.1000068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Manolio TA, Rodriguez LL, Brooks L et al. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nat. Genet. 2007;39:1045–1051. doi: 10.1038/ng2127. [DOI] [PubMed] [Google Scholar]
- 20.Estivill X, Armengol L. Copy number variants and common disorders: filling the gaps and exploring complexity in genome-wide association studies. PLoS Genet. 2007;3:1787–1799. doi: 10.1371/journal.pgen.0030190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kidd JM, Cooper GM, Donahue WF et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. doi: 10.1038/nature06862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Frayling TM, McCarthy MI. Genetic studies of diabetes following the advent of the genome-wide association study: Where do we go from here? Diabetologia. 2007;50:2229–2233. doi: 10.1007/s00125-007-0825-7. [DOI] [PubMed] [Google Scholar]
- 23.Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.National Human Genome Research Institute. [accessed 4/28/08];A catalog of genome-wide association studies. http://www.genome.gov/GWAstudies/
- 25.Gulcher J, Kong A, Stefansson K. The genealogic approach to human genetics of disease. Cancer J. 2001;7:61–68. [PubMed] [Google Scholar]
- 26.Cupples LA, Arruda HT, Benjamin EJ, et al. The Framingham Heart Study 100K SNP genome-wide association study resources: overview of 17 phenotype working group reports. BMC Med. Genet. 2007;8 Suppl. 1:S1. doi: 10.1186/1471-2350-8-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Saxena R, Voight BF, Lyssenko V, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. doi: 10.1126/science.1142358. [DOI] [PubMed] [Google Scholar]
- 28.Scott LJ, Mohlke KL, Bonnycastle LL, et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science. 2007;316:1341–1345. doi: 10.1126/science.1142382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zeggini E, Weedon MN, Lindgren CM, et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science. 2007;316:1336–1341. doi: 10.1126/science.1142364. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Zeggini E, Scott LJ, Saxena R, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 2008;40:638–645. doi: 10.1038/ng.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Frayling TM, Timpson NJ, Weedon MN, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894. doi: 10.1126/science.1141634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kathiresan S, Melander O, Guiducci C, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 2008;40:189–197. doi: 10.1038/ng.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Weedon MN, Lango H, Lindgren CM, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 2008;40:575–583. doi: 10.1038/ng.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Burton PR, Clayton DG, Cardon LR, et al. Wellcome Trust Case Control Consortium; Australo-Anglo-American Spondylitis Consortium (TASC) Association scan of 14,500 nonsynonymous SNPs in four diseases identifies autoimmunity variants. Nat. Genet. 2007;39:1329–1337. doi: 10.1038/ng.2007.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wellcome Trust Case Control Consortium. Overview. http://www.wtccc.org.uk/info/overview.shtml.
- 37.Wellcome Trust. [accessed 4/29/08];Largest ever study of genetics of common diseases just got bigger. 2008 News release http://www.wellcome.ac.uk/News/Media-office/Press-releases/2008/WTD039438.htm.
- 38.Maraganore DM, de Andrade M, Lesnick TG, et al. High-resolution whole-genome association study of Parkinson disease. Am. J. Hum. Genet. 2005;77:685–693. doi: 10.1086/496902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Herbert A, Gerry NP, McQueen MB, et al. A common genetic variant is associated with adult and childhood obesity. Science. 2006;312:279–283. doi: 10.1126/science.1124779. [DOI] [PubMed] [Google Scholar]
- 40.Myers RH. Considerations for genomewide association studies in Parkinson disease. Am. J. Hum. Genet. 2006;78:1081–1082. doi: 10.1086/504730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lyon HN, Emilsson V, Hinney A, et al. The association of a SNP upstream of INSIG2 with body mass index is reproduced in several but not all cohorts. PLoS Genet. 2007;3:e61. doi: 10.1371/journal.pgen.0030061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Arking DE, Pfeufer A, Post W, et al. A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization. Nat. Genet. 2006;38:644–651. doi: 10.1038/ng1790. [DOI] [PubMed] [Google Scholar]
- 43.Dewan A, Liu M, Hartman S, et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science. 2006;314:989–992. doi: 10.1126/science.1133807. [DOI] [PubMed] [Google Scholar]
- 44.Duerr RH, Taylor KD, Brant SR, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006;314:1461–1463. doi: 10.1126/science.1135245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Hunter DJ, Kraft P. Drinking from the fire hose—statistical issues in genomewide association studies. N. Engl. J. Med. 2007;357:436–439. doi: 10.1056/NEJMp078120. [DOI] [PubMed] [Google Scholar]
- 46.Sladek R, Rocheleau G, Rung J, et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature. 2007;445:881–885. doi: 10.1038/nature05616. [DOI] [PubMed] [Google Scholar]
- 47.Yeager M, Orr N, Hayes RB, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet. 2007;39:645–649. doi: 10.1038/ng2022. [DOI] [PubMed] [Google Scholar]
- 48.Rioux JD, Xavier RJ, Taylor KD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. 2007;39:596–604. doi: 10.1038/ng2032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Fellay J, Shianna KV, Ge D, et al. A whole-genome association study of major determinants for host control of HIV-1. Science. 2007;317:944–947. doi: 10.1126/science.1143767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Pearson TA, Manolio TA. How to interpret a genome-wide association study. JAMA. 2008;299:1335–1344. doi: 10.1001/jama.299.11.1335. [DOI] [PubMed] [Google Scholar]
- 51.Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet. Med. 2002;4:45–61. doi: 10.1097/00125817-200203000-00002. [DOI] [PubMed] [Google Scholar]
- 52.Chanock SJ, Manolio T, Boehnke M, et al. Replicating genotype-phenotype associations. Nature. 2007;447:655–660. doi: 10.1038/447655a. [DOI] [PubMed] [Google Scholar]
- 53.Gorlov IP, Gorlova OY, Sunyaev SR, et al. Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 2008;82:100–112. doi: 10.1016/j.ajhg.2007.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Mailman MD, Feolo M, Jin Y, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat. Genet. 2007;39:1181–1186. doi: 10.1038/ng1007-1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.National Cancer Institute. [accessed 4/29/08];Cancer Genetic Markers of Susceptibility (CGEMS) data portal. https://caintegrator.nci.nih.gov/cgems/
- 56.European Genotype Archive. [accessed 4/29/08]; http://www.ebi.ac.uk/ega/page.php?page=home.
- 57.Centers for Disease Control and Prevention. [accessed 9/12/2008];HuGE Navigator. http://www.hugenavigator.net/
- 58.Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4(8):e1000167. doi: 10.1371/journal.pgen.1000167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Couzin J. Genetic privacy. Whole-genome data not anonymous, challenging assumptions. Science. 2008;321(5894):1278. doi: 10.1126/science.321.5894.1278. [DOI] [PubMed] [Google Scholar]
- 60.Cooper RS. Gene-environment interactions and the etiology of common complex disease. Ann. Intern. Med. 2003;139:437–440. doi: 10.7326/0003-4819-139-5_part_2-200309021-00011. [DOI] [PubMed] [Google Scholar]
- 61.Guttmacher AE, Collins FS, Carmona RH. The family history—more important than ever. N. Engl. J. Med. 2004;351:2333–2336. doi: 10.1056/NEJMsb042979. [DOI] [PubMed] [Google Scholar]
- 62.Moffatt MF, Kabesch M, Liang L, et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature. 2007;448:470–473. doi: 10.1038/nature06014. [DOI] [PubMed] [Google Scholar]

