Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 11.
Published in final edited form as: Nat Rev Genet. 2009 May;10(5):318–329. doi: 10.1038/nrg2544

Validating, augmenting and refining genome-wide association signals

John P A Ioannidis ‡,*, Gilles Thomas §,, Mark J Daly ¶,#
PMCID: PMC7877552  NIHMSID: NIHMS1668198  PMID: 19373277

Abstract

Studies using genome-wide platforms have yielded an unprecedented number of promising signals of association between genomic variants and human traits. This Review addresses the steps required to validate, augment and refine such signals to identify underlying causal variants for well-defined phenotypes. These steps include: large-scale exact replication across both similar and diverse populations; fine mapping and resequencing; determination of the most informative markers and multiple independent informative loci; incorporation of functional information; and improved phenotype mapping of the implicated genetic effects. Even in cases for which replication proves that an effect exists, confident localization of the causal variant often remains elusive.


Studies using genome-wide platforms and increased sample sizes have recently yielded an unprecedented number of promising signals of association between human traits and genetic variants1. The list of association signals is growing weekly2. However, these signals are only markers of putative risk and are not necessarily the culprits — the functional genetic variants themselves. Finding these variants is essential for understanding the biological processes underlying disease pathogenesis, offering hints for developing new treatments for common diseases and identifying modifiable non-genetic exposures in these biological pathways. The identification of these variants might also improve predictive models of disease risk3.

Identifying causal variants is a difficult task and it cannot be guaranteed in advance that extensive efforts will improve biological knowledge, therapeutic potential, knowledge of modifiable exposures and predictive models. However, rational considerations can be made about which methods might be more fruitful in specific circumstances and how research steps might be prioritized. These steps include: the confirmation of association signals through replication; testing whether the signals can be generalized across different populations; fine mapping and resequencing; determining the most informative markers and finding multiple independent markers under each signal; documenting the biological function; appreciating the exact phenotypes involved in the association; and appreciating potential pleiotropy.

We briefly describe the available methods, their advantages and disadvantages, and how they might be prioritized. Even when these methods are properly combined, one should be cautious before claiming that the causative variants have been reliably identified.

Large-scale exact replication

Genome-wide significance.

Associations that have been identified from a single genome-wide association (GWA) data set rarely have definitive statistical support. p values of ≤10−7 are required for genome-wide significance. A p value of approximately 10−7 in the GWA setting corresponds to a p value of approximately 0.05 for a traditional, classical epidemiological study in which only one hypothesis is being tested. As the number of analyses performed has increased, this threshold has been shown to be insufficient, even in traditional epidemiological studies. Similarly, as hundreds or even thousands of different GWA studies are performed and each study could incorporate multiple exploratory analyses in the pursuit of associations for diverse phenotypes, the currently adopted GWA significance thresholds could be very lenient. Given the many analyses and outcomes that are assessed in a GWA study, even a p value of 10−10 or less might be necessary to safely confirm an association4,5,.

Nevertheless, most of the proposed associations from the GWA literature have more modest support. For example, data from the National Human Genome Research Institute NHGRI GWA studies catalogue2,7,8 (as of January 31, 2009), show that there are 1,321 entries of discovered associations with a p value of <10−5, but only 550 of these entries have a p value of <5 × 10−8. Therefore, many of the loci reported in the literature might need replication in additional, independent data sets before they can be considered to be sufficiently reliable markers — fortunately, such independent efforts tend to rapidly follow initial (stage 1) reports. The recommended approach (known as exact replication) is to examine the same variant of interest for association in diverse data sets using the same analysis model6.

Sources of data.

Sources of data include the original GWA data sets and additional replication studies that focus on specific variants of interest that have promising stage 1 results. In several fields, many teams could be conducting GWA studies on the same trait or disease and can therefore provide material for replication.

Consortia of investigators are also becoming increasingly popular and can build sufficient sample sizes for gene discovery. Consortia can perform meta-analyses of data from multiple GWA studies and additional replication data sets9,10. For example, in the DIAGRAM (Diabetes Genetics Replication and Meta-analysis) consortium for type 2 diabetes, data have been combined from 3 GWA and 16 replication studies11. Similarly, a Crohn’s disease meta-analysis combined data from three published GWA studies with similarly sized replication studies12. A large consortium (GIANT; Genomewide Investigation of Anthropometric Measures) compiles information on anthropometric traits, such as body mass index or height, which is collected in association studies for many different diseases13. Given the large number of teams and investigators involved, it is important that the participating teams reach an explicit consensus for data-sharing and data-analysis plans14, and that data integration is anticipated as early as possible during individual studies.

Confirmation of markers: multistage design versus joint analysis.

Stage 1 data are rarely conclusive of association, and the replication of putative markers is always needed. To achieve p values of much less than 10−7 for realistic odds ratios (that is, <1.3), large sample sizes are needed. One method to reduce genotyping costs is to perform multistage procedures15,16. Typically, a comprehensive coverage of the genome is carried out in stage 1, and the one or more subsequent stages focus on the promising regions that were defined by the previous stage15. These follow-up stages typically involve a small percentage of the initial SNP set. The total genotyping cost is decreased with a minimal reduction in power when the stages have been carefully designed17.

There are nonetheless several drawbacks to multistage designs. First, the decrease in cost is mitigated by the high cost per genotype that is incurred by the custom manufacturing of a chip that explores selected SNPs. The progressive decrease in the cost of high-throughput platforms further decreases the cost benefit. Second, the selection of the SNPs that are carried to subsequent stages depends on the association test statistic. Many GWA studies use variants of the Cochran–Armitage test, which favours the identification of SNPs with multiplicative (log-additive, per allele) effects, but is less efficient at detecting risk alleles that are fully recessive or dominant. An alternative would be to use a model-free genotype test, but this is slightly less efficient at identifying multiplicative SNPs and also requires an additional degree of freedom. Third, SNPs with low p values that are in strong linkage disequilibrium (LD) (typically r2>0.8) with a SNP that has a lower p value are often removed from the follow-up set, although the SNP with the lowest p value in this family of proxies is not necessarily more informative or closer to the causal variant. Therefore, many regions that had initially been detected that contained multiple promising SNPs are subsequently explored using a single SNP, which leads to less robust information for those regions. Fourth, single GWA studies have limited power to detect associated SNPs with small odds ratios and are notoriously underpowered to detect gene–gene or gene–environment interactions. Fifth, simulations show that the detection probability is prohibitively low when only <25% of individuals in the total sample are genotyped in stage 1 (assuming a total available sample of 8,000 cases and 8,000 controls and odds ratios of 1.1–1.3)16.

The deficiencies of a multistage procedure can be addressed by jointly analysing multiple independently performed stage 1 GWA studies. The joint analysis of the replication data sets from different GWA studies is of little interest, as there is typically minimal overlap between the SNPs that are investigated by follow-up studies. Therefore, the most efficient approach is to have a common plan among different teams for the discovery and replication of associations.

Combining data sets.

Although joint analyses are preferable, merging data sets is not always straightforward. Current GWA platforms do not capture the whole range of common variants, and the variants that are included on different platforms have limited overlap. Several imputation methods are available1822 (BOX 1) to allow data sets from different platforms to be combined compared with a common reference standard, for example, HapMap. When combining GWA data sets, several considerations apply before any quantitative synthesis can be made9, including: scrutiny of the epidemiological design (whether there are peculiarities and potential sources of bias in each study that contributes data); rigorous quality checks (including the evaluation of the Hardy–Weinberg equilibrium, the rate of missing genotypes and the imputation accuracy scores); comparison and standardization (or at least sufficient harmonization) of the analytical methods, the definitions of phenotypes or traits and the adjustments for covariates (for example, age or gender) used in each data set; appropriate adjustments for any overlap between the contributed data sets (for example, if two data sets use the same set of controls); adjustments for relatedness and population stratification; and other data consistency issues, such as consideration of the similarity of the reference HapMap build, imputation methods and adjustment for imputation accuracy across data sets.

Box 1 |. Imputation methods for genome-wide association (GWA) data.

Imputation methods try to best guess the missing genotypes using observed genotypes. Using imputation methods in the GWA setting, one can cut down the direct genotyping cost and can also combine data across GWA studies that have used different platforms with limited overlap in the SNPs that were genotyped using each platform. The computation time increases linearly with the number of markers and approximately quadratically with the number of states at each marker.

Common approaches

A popular approach is to use a variant of the expectation maximization algorithm, an iterative frequentist approach that aims to maximize likelihood. Some values are selected at random for the missing genotypes, and the likelihood of seeing the observed genotypes, if these hypothetical values of the missing genotypes were true, is examined. The imputed values are sequentially corrected over 10–100 iterations until the estimated error is minimized. An expectation maximization algorithm can get caught in a local maximum (providing a false impression of the best solution), so it should be run many times with different initial states to provide certainty.

Alternatively, Bayesian models can be fitted using Markov chain Monte Carlo (MCMC) algorithms. MCMC algorithms are also iterative and explore the entire model space, not just the maximum (the best solution). This is advantageous for small sets of markers but these methods (for example, as in PHASE 2.1) require tens of thousands of iterations, and require a long computational time.

Dealing with error

There are two types of error in imputation: the switch error rate (the proportion of successive pairs of heterozygote markers in an individual with incorrect phasing with respect to each other) and the imputation error rate (the proportion of missing data genotypes that are incorrectly imputed). To assess how well the imputation performs, one can label a small proportion of known genotypes as though they are missing. Before this assessment, a decision should be made on what an acceptable threshold of error rates is.

Accuracy or consistency for a particular imputed genotype is typically inferred by examining the consistency of the imputation during multiple steps of the same expectation maximization iteration or at the end of multiple different expectation maximization iterations. One should decide what level of accuracy will be tolerated and whether imputed genotypes with low accuracy (for example, <80% or <90%) should be excluded from further consideration and association analyses. Alternatively, some association analyses might explicitly weight all the possible values of an imputed marker.

Markers are difficult to impute with any accuracy when they are not in at least modest linkage disequilibrium with other directly genotyped markers. Direct genotyping of imputed markers that seem to represent interesting association signals is recommended for verification of the results.

There are many available programmes, including BEAGLE, IMPUTE, MACH, PLINK and BIMBAM that have been compared in a number of studies with simulation or real data1822. These programmes tend to give similar results but they use different heuristic methods. They can differ in computational time and their accuracy can also vary depending on the type of data and sample size.

Data from multiple data sets can be synthesized using meta-analysis methods. These methods involve either the combination of p values10,23 or effect sizes2428 (BOX 2). Cumulative meta-analyses can update the calculations once new data are available.

Box 2 |. Meta-analysis methods for genome-wide association data.

Combining p values

Methods that combine p values test the null hypothesis of no association in any of the combined data sets. The alternative hypothesis is that there is association in at least one data set. These methods are easy to compute and have adequate power. However, they have been abandoned in most non-genetic fields because they have major drawbacks. These include: the inability to calculate summary effect sizes; the inability to measure and factor potential heterogeneity across data sets into the calculations; uncertainty about how to weight data sets (and whether to weight them at all); dependence on normality assumptions for generating p values in single data sets and combining them; uncertainty whether other distributions should be used instead; and susceptibility to bias, if any one data set is biased (the method tests whether at least one data set shows an effect). Conclusions can be misleading when many analyses are performed (for example, with different phenotype definitions or adjustments) on each data set and only the best emerging p value from each data set is retained.

Combining effect sizes

Effect size meta-analysis methods use information on the effect sizes of the variant and calculate summary effect sizes that can be meaningfully translated; for example, a 1.25 increase in the odds of schizophrenia can tell us how big the magnitude of association is, but a p value of 10−10 alone does not convey this information. These methods also allow the extent of between-study heterogeneity to be estimated and tested. Statistical tests for heterogeneity, such as the Q test, have low power when few studies (< 10) are available. Similarly, with few data sets, the 95% confidence intervals of the metrics of the amount of heterogeneity that is beyond chance, such as I2, are very large. Therefore, one should be cautious about concluding whether heterogeneity is present.

Fixed effects models assume that the true underlying genetic effect in all data sets is the same and that the observed differences are due to chance alone. This assumption is violated when heterogeneity is shown to exist. Even when heterogeneity is not shown, the assumption is tenuous, owing to potential differences in phenotype definitions, linkage disequilibrium structure and many other known or unknown sources of variation across data sets.

Random effects models assume that each data set has its own true underlying effect within a population of true underlying effects; they estimate the effect in the average population and the uncertainty in this estimation. Random effects are more realistic because they explicitly allow for the possibility that genetic effects differ across studies. The most popular method for estimating the between-study variance is the DerSimonian and Laird method, but more sophisticated methods also exist. Estimation is difficult with very small samples and uncommon variants.

In the absence of between-study heterogeneity, the results derived from fixed and random effects models coincide. In the random effects framework, one can also estimate the uncertainty that exists not only for the average effect, but for any population effect and the predicted interval. For example, the 95% predicted interval is expected to encompass the true genetic effects of 95% of the diverse populations from which the observed populations have been sampled. The predicted interval is typically wider, especially when the estimated between-study variance is large.

Typically, no penalty is applied for the sequential testing of the null hypothesis at multiple time points, but one might argue that a correction for such multiple data analyses is justified29. If so, associations discovered after repeated analyses of sequentially accumulating data sets might need more stringent p values to be as credible as associations that arise when only one analysis is performed, including all data. On the basis of anecdotal experience, many teams have probably not published their GWA studies because the addition of replication data does not reach a convincing level of significance, and they then continue to accrue more replication data until a suitable threshold is reached.

Sequential discovery of additional variants.

Combination of data from several GWA studies not only improves the strength of the evidence for associations that have already been proposed, but also allows the identification of new, previously unsuspected associated loci. These variants can then be subjected to further replication efforts. For example, a meta-analysis combined data from 3 GWA studies on Crohn’s disease that had 3,230 cases and 4,829 controls, and found 526 variants in 74 loci that had p values of <5 × 10−5 (REF. 12). Eleven loci were previously proposed to be associated with Crohn’s disease, but further replication of the other 63 loci confirmed a further 21 loci for which combined data gave a p value of <5 × 10−8.

Even using large-scale synthesis of data from several GWA studies, the further addition of more GWA data will probably continue to be useful. Most genetic effects that are pursued for common variants seem to be very small, which could reflect odds ratios below 1.3, and possibly even less than 1.1 (REFS 1,12). The combined data in the Crohn’s disease meta-analysis would have adequate power to detect single loci that explain over 0.2% of the risk variance. The power for more subtle associations decreases steeply. Over 100 of these subtle associations might exist and their discovery will therefore require extremely large GWA data sets.

Winner’s curse.

Associations that pass desired thresholds of statistical significance tend to have inflated effect sizes (the so-called winner’s curse)30,31. This phenomenon should be accounted for when estimates are made about the required sample size for subsequent replication studies. The predictive ability of the discovered associations and the estimate of the risk variance explained by the associations are also inflated. The magnitude of the winner’s curse is inversely related to the power of the study. In typical circumstances, for 10% power, the inflation of an additive effect could be approximately 60%; however, for 60% power, the inflation would be only 10%. For small effects, even large meta-analyses could be grossly underpowered and emerging associations could be considerably inflated. For rare variants, the power can be <1%, and therefore associations that are discovered for rare variants will have extremely inflated effects and the true effect size should await further replication. Although analytical methods exist for estimating the amount of inflation, and such estimates can be considered in power and sample size analyses for future studies, a more reliable estimation of the effect size would be typically obtained from evaluating the association in additional data sets.

Heterogeneity.

Genetic effects can vary across data sets (heterogeneity). Statistical heterogeneity can affect the ability to detect associations32 and, when detected, it poses questions about the reasons for its existence. However, one should acknowledge that unless many data sets are combined, there will be a large degree of uncertainty about the amount of estimated between-study heterogeneity33. It is practically impossible to prove the absence of any between-study heterogeneity. When multiple large data sets do not show significantly different effects at a confirmed locus, it can be concluded that even if heterogeneity is present, it is probably small. Conversely, if large heterogeneity is readily documented, it should be taken seriously, despite uncertainty about its exact magnitude.

Heterogeneity can reflect errors and biases or genuine diversity34. Errors or biases include, but are not limited to, genotyping errors, phenotype misclassification, unaccounted population stratification or selective reporting of results. Heterogeneity can also be due to genuine differences in the pertinent LD blocks across populations or differences in phenotype ascertainment across populations. Finally, diversity in genetic effects across studies could reflect differences in environmental exposures across populations or differences in allele frequency at an independent interacting locus. Knowledge of these parameters can help to fine tune the genetic, phenotypic and environmental modulation of the proposed association. Unfortunately, several of these parameters are usually either unknown or vary concurrently in the same study, and it is difficult to pinpoint which is the most influential parameter in shaping the observed heterogeneity.

Ideally, case and control samples should be drawn randomly from the same population to minimize the chances of population stratification. To limit selection bias, data sets can use nested case–control samples from well-defined prospective cohorts. However, the number of cases available from prospective cohorts is usually low and investigators might use convenience samples, thus increasing the risk of population stratification. Also, owing to the present cost of genome-wide genotyping or, for some studies, owing to the lack of proper controls, it can be advantageous to combine genotypes from several control groups that are obtained independently from the cases. The gain in power that is due to an increased number of controls is expected to overcome the loss of power that is due to potential stratification. In one seminal experiment, a common set of controls collected through two different procedures was successfully used against several case groups, each of which was assembled on the basis of a different phenotype and using different designs35. Methods such as principal components analysis can probe population stratification and correct for it at the cost of small decrease in power36.

Historically, the heterogeneity of published association results has been a sign of false positive associations37 that arise from optimistic statistical interpretation and contributions from errors. However, with rigorous quality control and statistical methods, in the future we should be able to discover true biological heterogeneity, if it exists.

Handling differential linkage

Linkage in the close neighbourhood of a marker.

Given the extent of LD in the human genome38, any particular variant is likely to have many close proxies. An evaluation of 5 Mb of the human genome that was sequenced and genotyped at a high resolution (the ENCODE regions) in the HapMap documented that more than one-half of all common SNPs had more than 10 neighbouring SNPs with r2 > 0.8 (REF. 39). Furthermore, small insertions and deletions and larger copy number variants (CNVs) are also in LD with SNPs, adding to the diversity of variants that are implicated by any GWA study40. Given such a correlation structure, it is difficult to identify the causal variant. Many variants with less than perfect correlation to the initially identified variant should be considered. This strategy substantially increases the number of variants to pursue.

Exploiting heterogeneity.

One can test emerging associations in populations of different ancestry and populations in which the LD blocks in the area that surrounds the associated variant are known to be different (they have different recombination hot spots and different allele frequencies) from the LD blocks in the population in which the marker was originally identified. This strategy should probably be applied to markers that have already achieved robust statistical support in replication studies across populations of similar LD. When there are several markers at the same locus, each with strong association, replication in a population with a different LD pattern can help identify which markers have the strongest and most consistent LD to the causative variant.

The approach of exploiting genetic heterogeneity has some limitations. As discussed above, the causes of heterogeneity have substantial statistical uncertainty. Modest LD differences will not necessarily result in different genetic effects. Heterogeneity in the genetic effects might be due to various other causes, beyond differences in LD with the causative variant. Populations of different ancestry might have different lifestyles or other environmental exposures that confound interpretation. Genetic interactions with unlinked alleles that differ in frequency between populations could also manifest as effect heterogeneity. Nevertheless, for replicated candidate gene associations, heterogeneity between racial descents is rare41. Most GWA studies have used populations of Caucasian ancestry, but examples accrue from markers that have been investigated in populations of diverse ancestry4248 (BOX 3).

Box 3 |. Genetic effects of genome-wide association-derived markers in populations of different ancestry.

Box 3 |

The results of three studies on genome-wide association (GWA)-derived markers for breast cancer, type 2 diabetes (T2D) and atrial fibrillation are shown in the figure, left panel, for which evaluation has been performed in both Caucasian and Asian ancestry populations. The per-allele odds ratios are compared in the two populations.

Five low-penetrance risk factors for breast cancer had similar odds ratios in East Asian and Caucasian populations42, and most loci for T2D discovered in Caucasian GWA studies had similar odds ratios in Asian populations, even though the allele frequencies were often different43. Some of the variants (for example, cyclin-dependent kinase inhibitor 2A and 2B variants for T2D, and 8q and lymphocyte-specific 1 locus markers for breast cancer) were not nominally statistically significantly replicated in the Asian ancestry samples, but the effects in Caucasian populations were so small (odds ratios of 1.08–1.12) that it is difficult to tell whether there is no effect in Asians (estimated odds ratios of 1.03–1.06) and whether the differences are beyond chance.

In another example, of the two variants that were associated with the risk of atrial fibrillation in Caucasian populations, only one had an effect (albeit smaller in magnitude) in Asian ancestry populations (not shown, as it is an outlier in the plot with an odds ratio of 1.72 in Caucasian and 1.42 in Asian populations), and the other variant had no detectable association in populations of Asian ancestry, despite a clear effect in Caucasian populations (odds ratio of 1.39)44.

Linkage disequilibrium (LD) blocks tend to be substantially different in populations of African ancestry compared with samples of Caucasian or Asian descent. Therefore, refinement and selection of the best markers might be most efficient in African populations. Two examples, from studies of variation in the fat mass and obesity-associated gene (FTO) and obesity and variation in the TCF7L2 gene and T2D are shown in the right panel.

For the FTO locus in the original discovery45, rs9939609 had the strongest obesity association signal in Caucasian populations, but there were at least two other SNPs (rs3751812 and rs8050136) with an r2 value of 1 in Caucasians and many other neighbouring SNPs also had significant association signals. Conversely, when the two perfect proxies and another 11 SNPs from the same region were tested in a population of African ancestry, only rs3751812 showed a nominally significant signal (odds ratio of 1.31, p value of 0.017; however these values are unadjusted for multiple comparisons). Interestingly, rs3751812 and the originally proposed rs9939609 SNP have an r2 value of 1 in Caucasians but only 0.058 in populations of African ancestry. Of note, the originally proposed rs9939609 SNP and the rs8050136 SNP do not seem to have clear association signals in Chinese populations either46.

Another example in which replication with intended heterogeneity resulted in refining the association was the evaluation of TCF7L2 variation and T2D. Although the original discovery in populations of Caucasian ancestry showed that one microsatellite and two SNPs in TCF7L2 were strongly associated with T2D47, only one of these markers (rs7903146) showed a consistent association in West Africans (estimated odds ratio of 1.45 in African and 1.48 in Caucasian populations)48.

In search of the affected genes: local, proximal and distant effects.

Statistical genetics alone might not permit the isolation of one causal variant, and finding the affected genes is even more convoluted. The location of the causal variants can be linked to the affected genes in different ways4951 (BOX 4).

Box 4 |. Where is the culprit? Functional variants and affected genes.

Whether a genome-wide association (GWA) study and subsequent efforts have found the causal functional DNA variants and whether they have found the genes that these variants belong to or regulate are two separate issues. The top markers hit by the GWA study could be the causal variants or just one or more linked (correlated) markers. Given that only a small fraction of the over 10 million common genome variants are directly genotyped by current GWA platforms, it is far more likely that the top-scoring markers are simply correlated markers. If a linked marker has been hit, the causal variant could be another common variant, a rare variant or a structural variant (see the figure, part a). Moreover, the causal variant could represent either coding or regulatory variation. The causative gene could be directly hit or it could be a gene in the immediate vicinity (the hit marker is in the same gene or its immediately neighbouring regulatory areas), a gene in the wider vicinity (in the same LD block as the hit marker but not the gene closest to it) or a remote gene affected by some regulatory circuit (see the figure, part b). The causative genes could also be many genes that are affected by the same regulatory circuit.

The causal variant should be in the vicinity of the best associated SNP, that is, its distance (in centiMorgans) to the best associated SNP should be small. Hot spots of recombination in the vicinity of the best associated SNP provide some logical boundaries. Typically, the bounded areas are 5–100 kb in European populations and tend to be smaller in African populations39. However, the causative genes do not need to be bound by the same recombination hot spots.

The power of a SNP marker to show association is related to its correlation coefficient with the causal variant. The sample size required to detect an association by typing a marker that has a correlation coefficient r2 with the causal variant is inflated by 1/r2 compared with the sample size required for direct typing of the causal variant49,50. Moreover, the correlation rg-ph between a causal variant (g) and the phenotype of interest (ph) is decreased by the square root of the product of the reliability with which the causal variant is measured (rel(g)) and the reliability with which the phenotype is measured (rel(ph)) (REF. 51). The lower the linkage r2 between the causal variant and its marker, the less the reliability of rel(g) and the greater the decrease in the association. Thus, poorly linked markers are less likely to be found in the top ranks of association than strongly linked markers. However, there is no assurance that the correlation between the best hit and the causal variant is perfect. Moreover, the observed order of the p value or other ranking for the associated variants is subject to random noise, even in the absence of any bias. On average, one expects to get a stronger association signal for the causal variant than for variants that are less than perfectly linked to the causal variant. However, owing to chance error, a linked marker can produce a stronger association signal than the causal variant. Given that there are usually many highly linked variants, the observed top-ranking variant is usually not the causal variant, even if the causal variant has been directly genotyped. Thus a wide net should be spread to evaluate a range of linked variants in the region of interest.

Box 4 |

Sometimes, variants can easily be found that cause non-synonymous changes in the coding region of a gene. Examples in this category include the non-synonymous functional variant in integrin-αM (encoded by ITGAM) that has been associated with systemic lupus erythematosus (SLE)52, the non-synonymous SNP in protein tyrosine phosphatase non-receptor 22 (PTPN22) that protects from Crohn’s disease, but increases the risk of rheumatoid arthritis and type 1 diabetes12, or the non-synonymous SNP in thyroid adenoma-associated (THADA) that is associated with type 2 diabetes11. However, far fewer than one-half of the reported GWA results seem to point to such obvious variants.

The discovered markers can be in non-coding regions or the relevant haplotype block can extend across several genes. Sometimes the identified markers lie in gene deserts; for example, the 8q24 gene desert42,5362 (BOX 5). The affected genes can be near to or in the wider vicinity of the marker, or even on other chromosomes.

Box 5 |. Gene deserts: the 8q24 region and susceptibility to various cancers.

Box 5 |

Many studies have found signals of association for diverse cancers in a long stretch of 8q24 that contains no characterized genes (a gene desert). Two association studies searching for prostate cancer susceptibility exploited either a linkage study performed on a large Icelandic population53 or a genome-wide admixture study on a group of African Americans54 and identified the same associated marker in prostate region 1 (rs1447295). Subsequently, independent GWA studies performed on various populations of European origin identified four additional susceptibility loci in the same chromosomal sub-band. One of these loci is clearly involved in multiple cancer types (prostate55,56, colon57 and ovary58), and the other three have currently been unambiguously associated with only one cancer type (prostate region 3 (REFS 56,59), breast43 and bladder60). These different regions might be associated with an even larger range of cancer types61,62, but replication of the proposed associations is warranted for them to be widely accepted. For each of the five associated regions, the marker with the highest significance for cancer associations (as of late 2008) is indicated (see the figure). The blocks of linkage disequilibrium in which these markers are located are devoid of characterized genes. They are located 30 kb to 500 kb upstream of the nearest characterized gene (the MYC oncogene). A parsimonious hypothesis is that the various cancer predispositions observed in the 8q24 region are related to dysregulated MYC expression, but the variants are far from MYC and this hypothesis has not been supported by clear-cut experimental data, and thus the interpretation of the mystery of this gene desert remains open.

Several instructive points can be derived from the 30 risk loci that have been identified for Crohn’s disease. Approximately one-half of the associated loci contain more than one gene in strong LD with the associated variant. However, although the true causal variant should be in strong LD with the detected SNP, there is no reason to presume that the gene affected by the causal variant need be in LD or even close to the causal variant. Although most of the regulatory variation that influences a gene is probably close to the gene itself, as suggested by analyses of conserved transcription factor binding sites63 and cis-acting expression associations (expression quantitative trait loci; eQTLs)64, long-range trans-acting regulatory variation has been documented65. This type of variation must be considered a viable possibility as, of the 30 Crohn’s disease associations, two-thirds do not seem to map to amino acid-coding variations and six map to regions without any known protein-coding genes. For example, variation in a 5p13 gene desert correlates with the expression of prostaglandin E2 receptor EP4 subtype (PTGER4), which is 270 kb away66.

Fine mapping and resequencing

Fine mapping an established association.

The aim of fine mapping is to select a non-redundant set of either perfectly correlated or highly correlated polymorphisms in a region, which is typically defined by a haplotype block. Although this approach seems straightforward, it poses several technical challenges. First, the complete common SNP set is not known. The second version of the HapMap67 is currently the most detailed resource for these analyses, but it contains only ~30% of the common SNPs that are present in the genome. To acquire the entire list of correlated variants, the regions of association, which can sometimes be long, must be sequenced in a meaningful number of reference samples. However, the 1000 Genomes Project will soon complete the genome-wide resequencing of more than 1,000 individuals and thus dramatically improve the coverage of common DNA variation. Improvements in next-generation sequencing make it increasingly possible to sequence the whole region of interest for large population samples68,69. Meanwhile, the effectiveness of imputation methods for filling in missing genotype information will allow the 1000 Genomes Project resource to be integrated in large case–control studies.

Even if the catalogue of correlated variants is essentially complete, the emerging picture could still remain unclear. The aim of fine mapping is to evaluate the statistical evidence at neighbouring correlated variants; for example, it is difficult to distinguish the true causal variant that has an odds ratio of 1.2 from partially correlated neighbouring SNPs that have odds ratios (that arise from LD) of 1.1–1.15, although it is not necessary to correct for a genome-wide testing burden to test this focused hypothesis. Methods for correction of multiple correlated tests are beyond the scope of this Review, but REFS 7072 describe applicable procedures.

As the 1000 Genomes Project is also compiling a complete picture of CNVs and small insertion–deletion (indel) polymorphisms, it will provide an important connection between these fine mapping activities and the identification of other DNA variants, which can then be analysed for functional consequences. For example, in Crohn’s disease, associated non-coding variation identified near the immunity-related GTPase family M gene (IRGM)73 was subsequently shown to be a proxy for a 20 kb deletion upstream of the gene that significantly influences gene expression.

Deeper sequencing to seek other associated variants.

Although fine mapping of individual common SNPs might be well supported by resources of genetic variation, many important rarer variants will remain beyond the reach of indirect LD-based approaches because they are either too rare in the general population74,75 but have higher penetrance in disease, or are mutations that have arisen recently. The effectiveness of deep candidate gene sequencing of the tails of population phenotype distributions has been established as a complementary strategy76,77; however, these studies have only recently become efficient and affordable to perform on a larger scale, owing to the advent of next-generation sequencing technologies. Although the original signals detected in GWA studies can often be common variants, the discovery of highly penetrant rare variants in the same region might be crucial.

Most informative and independent markers

Selecting the best correlated variant for a single locus.

Most variants that map to the same genetic locus do not contribute independent information. If the different variants are in perfect LD, any of the variants can be chosen for further evaluations and replications. However, if the data on which perfect LD is based are limited, the LD may be less than perfect if more samples could be genotyped. If there is perfect correlation for k genotypes between two variants, the 95% confidence interval for the proportion of discordant genotypes is approximately (0–300/k)%. Thus, if 100 subjects have been genotyped and two variants are perfect proxies, 3% discordance in genotypes is still plausible. The LD could also be substantially different in populations of different ancestry.

If the observed LD between two variants is not perfect, but is high (r2 > 0.90), then one variant could be selected on the basis of diverse considerations. These include selecting the variant with the best level of statistical significance (the best p value), the one that has the highest likelihood ratio or the best Bayes factor under specific prior assumptions. These rules often lead to the selection of the same variants, but sometimes they can lead to discrepancies. The effect size, the proportion of variance explained and the lowest between-data set heterogeneity can also be useful considerations, especially if the marker is considered for predictive purposes. However, usually the differences in these parameters across the compared variants are subtle. If not, one should test whether these variants contribute independent information.

In several examples, the variant with the strongest support in the original discovery phase was replaced by another variant in high LD that had a much stronger effect in further replication studies73,78. If the number of variants in high LD that have promising signals is small and there are no financial constraints, it can be useful to continue the replication process by testing several of the variants in subsequent samples. Testing several variants in high LD rather than just one might also be sensible when the LD block contains more than one gene — in which case, it might be worth considering markers from each gene.

Deciding on the contribution of two or more independent loci.

Markers with promising association signals that are not in LD are expected to contribute independent effects to the phenotype of interest. Therefore, all independent signals should be carried forth in further research efforts.

For markers that have modest LD, the situation is more complicated. One can use different methods to probe the contribution of each marker. Multivariate (or conditional) analyses can be used, in which each SNP is considered as a separate variable. When multiple independent loci are implicated, their discovery is subject to the, often underappreciated, limitations of regression models. These limitations include: overfitting, inflated effects for selected markers, dependence of the resulting model on the model-building approach and criteria for keeping or deleting markers, potential colinearity in the presence of correlated markers, and dependence on influential outliers and erroneous data. Advanced analysis and modelling options can bypass or ameliorate some of these problems (for example, see REF. 79), but caution is warranted, particularly when a study is underpowered and there are large correlations between markers. Alternative methods, such as classification trees or artificial intelligence (neural network classifiers), do not necessarily improve the performance of the models and also require an explicit description of how they have been applied to be reproducible.

Once a set of markers has been selected, one can construct haplotypes and also evaluate haplotype effects. Several methods are available to infer haplotypes (see BOX 1 on imputation methods, for which many principles are similar, and a review on Bayesian methods80). Ideally, the resulting multimarker models or risk haplotypes should be replicated again in additional independent populations to assess their consistency. Results should be interpreted cautiously and in the context of other data, including potential functional information. It should not be assumed that the results can be extrapolated in other populations. In summary, this is not an easy task.

Case studies.

As an example, several studies have tried to dissect the contribution of multiple independent loci in interferon regulatory factor 5 (IRF5) to SLE8184. Initially, an exon 1B splice site marker (rs2004640) had the strongest association (the lowest p value). After resequencing, the association signal was strongest for three highly correlated SNPs (group 1) that did not include the previously studied rs2004640 (REF. 81). Logistic regression conditioning on the group 1 rs2070197 variant showed that a second set of correlated SNPs (group 2; including the previously described rs2004640) were independently associated with SLE risk. After including both a group 1 and a group 2 variant in the model, a third set of six highly correlated SNPs (group 3) were independently associated with risk of SLE. On the basis of functional considerations, plausible biological mechanisms were suggested for the group 2 (altered splicing) and 3 (alteration of 3′ UTR length and mRNA stability) variants, but not for the group 1 variants. A new logistic regression conditioned a group 2 variant and a group 3 variant — that is, the selection of conditioning was not based on statistical significance but on additional functional considerations. This led to the discovery of an additional independent group and the tagging of an indel in exon 6 that results in protein isoforms with differential ability to initiate transcription of IRF5 target genes. Risk haplotypes were thus constructed on the basis of three variants.

However, other studies reached different conclusions. A Swedish study identified 16 SNPs and 2 length polymorphisms in IRF5 that associate with SLE82. This study, which used a Bayesian approach, selected a model with only two markers: rs10488631 (a marker from group 1 of the previous study) and a novel CGGGG indel that is 64 bp upstream of exon 1A and was thought to explain the described associations of the group 2 and group 3 markers in the previous study. To further complicate the interpretation of these studies, other studies found that many risk markers in European populations were not polymorphic in Asian populations and identified different risk variants and haplotypes83,84.

Another example that highlights the complexity of multiple independent markers is provided by complement factor H (CFH) and age-related macular degeneration. The original discovery was a common polymorphism, Y402H, which was strongly associated with disease susceptibility; however, subsequent studies have shown the importance of multiple variants beyond Y402H in CFH and its vicinity8587.

Deciding on the optimal model of genetic inheritance.

Typically, it is not strongly believed that the associations to be discovered should follow a particular model of inheritance (for example, dominant, recessive or codominant). One can perform analyses with different models and select the best-fit model or use model-free approaches. In studies of typical sample sizes, it is practically impossible to confidently assess which model fits the data best — specific patterns might reflect noise rather than a specific fit to a model. The situation might improve when data are collected across many data sets and appropriate Bayesian meta-analysis methods are available to identify the best-fit models88. Such meta-analyses can also accommodate heterogeneity in the genetic model across data sets. Heterogeneity might be more prominent when the markers are more remote proxies of the causal variant and when LD differs across data sets.

Function

There are two types of functional data: those that help understand the molecular change introduced by the identified DNA variant and those that show how that molecular change alters a biological process that is relevant to disease.

On the basis of observations from Mendelian disorders (in which gene inactivation and protein-coding polymorphisms dominate89) and incontrovertible evidence that selection acts most strongly against protein-coding polymorphisms, a truncating (stop codons and frame shifts) or non-synonymous coding change is considered to be a leading candidate for the causative variant. Large-scale gene expression data sets — for example, eQTLs — can also help to distinguish between SNPs. However, when the effect size is small, neither eQTLs nor coding SNPs can confirm the causal SNP. There are noteworthy examples in which association to a coding SNP was later shown to be a result of LD with a stronger causal non-coding variant78.

Several in silico analyses might help to functionally characterize associated variants. Polymorphisms in transcription factor binding sites might be functional and the conservation of potential motifs across species68 and of clusters of motifs upstream of many genes90 can identify true transcription factor binding sites. Similarly, computational methods can be used to identify non-coding RNAs, microRNA binding sites and polymorphisms that create or abrogate poly(A) tracts. However, computational algorithms have limitations; for example, an empirical evaluation has shown that commonly used algorithms make >50% errors in predicting transcription factor binding sites91.

In vitro experiments can also help to evaluate the expression of genes in cells of different genetic backgrounds, or identify the functional changes that arise when genes are knocked down by RNAi or chemically enhanced or suppressed. In cases in which a protein has been clearly implicated in a biological process, in vivo measurements of human samples might also be feasible. Introduction of a specific DNA variant into a model system or perturbation of a gene in vivo might provide further insights into the influence of the variant on biological processes or functions, and confirm the effect of the variant on the phenotype of a complete organism.

The balance between these approaches can vary. Immunological defects can be effectively studied through in vitro experimentation using immune cells removed from the circulation of patients or controls92. For bacterial autophagy experiments in Crohn’s disease78,93, it also seems straightforward to extrapolate the results from intracellular responses to bacteria in vitro. By contrast, brain-specific effects will be far more difficult to study in vitro as sampling brain tissue from patients or maintaining neuronal cell lines is more difficult logistically. Therefore, murine, zebrafish and other model organism studies can offer opportunities that are not available using in vivo or in vitro studies in humans94,95.

Overall, functional data are helpful, but are not a substitute for robust statistical replication of proposed or refined associations. There is some empirical evidence that there is limited concordance between functional and epidemiological data96. Given the multiplicity of functional analyses that can be performed, selective interpretation of the functional results is a potential source of bias.

Phenome mapping

Phenotype definitions can vary across data sets. Furthermore, different phenotypes often constitute a dense correlation network. When a robust association signal is documented, it can pertain to the tested phenotype, to one or more other correlated phenotypes or to both the tested and other phenotypes97111 (BOX 6). Pleiotropy can also exist. Investigators have proposed ‘diseasome’ diagrams in which different disease phenotypes are connected with links that are proportional to the number of genes that they have in common of those that seem to regulate their risk112114.

Box 6 |. Correlation between phenotypes.

Different phenotypes can be substantially correlated. The figure shows observed Pearson correlation coefficients between metabolic syndrome-related phenotypes97 (top) and between various candidate schizophrenia-related endophenotypes (bottom)98. Correlations that exceed absolute values of 0.20 are in red and correlations with absolute values of 0.10–0.19 are in green. Some correlation might be due to shared genetic covariance. A detected genetic association for one phenotype might reflect associations for other correlated phenotypes.

Some genetic effects are partly or totally explained through an association with another phenotype. For example, fat mass and obesity-associated (FTO) is robustly associated with obesity, but its original discovery was as a marker for type 2 diabetes (T2D)99. Given that obesity increases T2D risk, a gene variant that increases obesity risk could also increase T2D risk solely through effects on weight. Indeed, in a study that matched diabetic cases and controls for body mass index, the FTO variant was no longer significantly associated with T2D100,101. In some fields, such as for psychological or mental traits and diseases, phenotype associations can be strong and difficult to disentangle. For example, the correlation coefficient between intelligence quotient (IQ) and schizophrenia is −0.61, and shared genetic variance accounts for 92% of the covariance between these phenotypes102. A genome-wide association (GWA) study for schizophrenia might find hits that are associated primarily with IQ, and adjustment, matching or stratification for IQ would be needed to identify variants that are associated with schizophrenia but not IQ.

A gene variant might be truly associated with two or more different correlated phenotypes. For example, several gene variants have been identified that seem to confer susceptibility to multiple autoimmune syndromes and diseases that share some partial clinical overlap, as is the case for protein tyrosine phosphatase non-receptor 22 (PTPN22) and cytotoxic T lymphocyte 4 (CTLA4) variants103105. These observations could indicate some common pathogenetic considerations. Genes could also have clear pleiotropic effects on phenotypes that are apparently clinically uncorrelated. Opposite effects can exist; for example, a variant in the hepatocyte nuclear factor 1β (HFN1B) gene seems to increase the risk of prostate cancer but protect from diabetes106, and a missense variant in the glucokinase regulator gene (GCKR) is apparently associated with increased concentrations of plasma triglyceride and C-reactive protein, but lower fasting glucose concentrations107.

Among many highly correlated phenotypes, the phenotype that is measured with the least error (the highest reliability) should be optimal to use for detecting and refining associations108. Genetic effects can be extensively blunted even by modest phenotypic error; for example, for personality scales, reliabilities range between 0.7 and 0.85 (REF. 109), and this could almost annihilate the power to detect subtle associations of variants with such personality traits. Another hint is that for phenotypes that vary a lot over time, using repeated measurements might enhance power, especially if genetic effects are not limited to a specific age110. Finally, some fields evaluate a large number of correlated phenotypes; for example, 21 studies on asthma have used 485 different phenotypes or analyses for the response to treatment in asthma111. This creates challenges in standardizing and juxtaposing information across data sets.

Box 6 |

Different research teams often collect data on different phenotypes and complete data from all teams might be available for only a few phenotypes111. The same challenges apply to the evaluation of environmental exposures. Although it is interesting to assess gene variants in populations with different environmental exposures (these exposures are cumulatively called the exposome115), data are often either lacking or subject to high levels of error in measurements.

The discovery of associations should use carefully defined phenotypes that are as similar as possible across combined data sets. Once a signal is documented, exploring associations with other diverse and correlated phenotypes can be pursued. The same strategy applies to the examination of clinically interesting subgroups and subsets116. The dangers of unaccounted multiplicity of comparisons should not be understated.

Conclusions and perspectives

We have discussed a number of steps that can be followed in trying to validate, augment and refine detected genome-wide associations. These steps have a rational sequence (FIG. 1), but this can also be modulated by the availability of resources and data, and the specific challenges that arise in each field.

Figure 1 |. Putting it in order.

Figure 1 |

The arrows show the flow of research effort and thicker arrows show the more common sequence for the use of these methods. Typically, robust statistical documentation of an association takes precedence over refinement efforts. Otherwise, one runs the risk of spending funds and efforts on associations that are false positives with low yield. After robust exact replication, one can proceed directly to functional evidence if there is an immediately obvious observation (for example, a striking coding change), or wait to perform functional studies only after additional genetic and epidemiological refinement. Genetic and epidemiological refinements typically proceed in parallel. In this process, if definitive genetic or epidemiological evidence arises, one can again proceed with functional assays. Finally, if suitable data are available, phenome mapping can be exploited at different stages: at early stages for maximizing power, and detecting and replicating the strongest associations; or at late stages for assessing the full phenotypic range.

Eventually, if one starts from one or more variants and reaches completely different DNA variants through a meandering path that involves a combination of these techniques, it might be advisable to re-evaluate their exact replication117 in large-scale population samples, acknowledging the dangers of hidden multiplicity of analyses. Public availability of data and analyses might improve the credibility of the discovery and refinement processes118120.

The process described above will often fail to identify the causal variants with certainty. Both the genetic and phenotypic architecture could have dense correlation patterns that are difficult to decipher. Functional insights can also often be tenuous. Although the discovery of GWA signals is exciting, the amount of work required to achieve and confirm causal variants should not be underestimated.

Pleiotropy.

The effect of a gene on more than one phenotype or disease.

Meta-analysis.

An analysis that combines the evidence from multiple data sets.

Odds ratio.

A measurement of association that is commonly used in case–control studies. It is defined as the odds of exposure to the susceptible genetic variant in cases compared with the odds of exposure in controls. If the odds ratio is significantly greater than one, then the genetic variant is associated with the disease.

Cochran–Armitage test.

A genotype-based contingency table test for association that is well suited to the detection of trends across ordinal categories (in this case, genotypes).

r2.

(Correlation coefficient). For linkage disequilibrium, it provides a measure of the strength and direction of a linear relationship between the genotypes of two variants expressed as a number of minor alleles.

Proxy.

A highly correlated DNA variant that is an adequate substitute in an association study.

Detection probability.

For a two-stage design, this is the probability that a disease-associated SNP will have a p value among the lowest ranks of p values at stage 1 and, among those SNPs selected at stage 1, that a disease-associated SNP will also have a p value among the lowest ranks of p values at stage 2.

Hardy–Weinberg equilibrium.

A theoretical description of the relationship between genotype and allele frequencies that is based on an expectation in a stable population undergoing random mating in the absence of selection, new mutations and gene flow. Under these conditions, and in the absence of linkage disequilibrium, the genotype frequencies are equal to the product of the allele frequencies.

Imputation accuracy.

This describes the different ways to treat missing genotypes in a data set. Imputed genotypes with less than a pre-specified accuracy can be considered missing or genotypes can be weighted in the calculations on the basis of the estimated imputation accuracy.

Population stratification.

The situation that arises when a population contains several subpopulations that differ in their genetic characteristics.

Frequentist.

A statistical approach for assessing whether a hypothesis is correct or an alternative should be adopted.

Markov chain Monte Carlo.

An iterative computational approach for identifying the most likely model among many possible models.

Phasing.

The determination of the haplotype phase (the arrangement of alleles at two loci on homologous chromosomes) from genotype data using statistical methods.

Winner’s curse.

The inflation of effect sizes compared with the true effect size for associations that are discovered on the basis of passing specific statistical significance or other selection thresholds.

I2.

A metric of between-study heterogeneity taking values between 0 and 100%, which describes how much of the between-study heterogeneity is beyond chance.

Fixed effects model.

A set of methods for combining data that assumes there is a common effect in all data sets and that observed effects only differ by chance.

Random effects model.

A set of methods for combining data that assumes that genetic effects are different across different populations.

Phenotype misclassification.

This describes the situation in which cases are classified as controls or controls are classified as cases for binary outcomes. The equivalent problem for continuous traits is measurement error.

Nested case–control.

A design in which cases and controls are sampled from a pre-existing larger cohort.

Convenience sample.

A sample of controls or of cases with a trait of interest that is available for another purpose and has not been collected for the purpose of the specific research project or with an explicit sampling scheme.

Principal components analysis.

A statistical method used to simplify data sets by transforming a series of correlated variables into a smaller number of uncorrelated factors.

Copy number variant.

A class of DNA sequence variants (including deletions and duplications) that lead to a departure from the expected diploid representation of DNA sequence.

Recombination hot spot.

A small (usually one to a few kilobases) chromosomal region in which the frequency of meiotic recombination is much higher than average. Hot spots of recombination can be recognized by observing that all pairs of SNPs that encompass the region have a low D′ value.

Gene desert.

A stretch of the genome that contains no known protein-coding gene.

Expression quantitative trait locus.

A locus at which genetic allelic variation is associated with variation in gene expression.

Bayes factor.

The ratio of the prior probabilities of the null hypothesis compared with the alternative hypotheses over the ratio of the posterior probabilities. This can be interpreted as the relative odds that the hypothesis is true before and after examining the data.

Regression model.

A model that evaluates the association between one or multiple variables with an outcome of interest.

Overfitting.

In a regression model, the tendency to obtain better fit to the available data than to other independent data.

Bayesian method.

Any approach that uses a combination of prior beliefs and observed data to generate posterior beliefs.

Endophenotype.

A physiological or other trait that is related to a disease trait and is measured independently of the disease.

Acknowledgements

Scientific support for this project was provided through the Tufts Clinical and Translational Science Institute (Tufts CTSI) under funding from the National Institute of Health/National Center for Research Resources (UL1 RR025752). Points of view or opinions in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts CTSI.

References

  • 1.McCarthy MI et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Rev. Genet 9, 356–369 (2008). [DOI] [PubMed] [Google Scholar]; A comprehensive review of challenges in the discovery of associations using GWA studies.
  • 2.Manolio TA, Brooks LD & Collins FS A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest 118, 1590–1605 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Janssens AC & van Duijn CM Genome-based prediction of common diseases: advances and prospects. Hum. Mol. Genet 17, R166–R173 (2008). [DOI] [PubMed] [Google Scholar]
  • 4.Hoggart CJ, Clark TG, De Iorio M, Whittaker JC & Balding DJ Genome-wide significance for dense SNP and resequencing data. Genet. Epidemiol 32, 179–185 (2008). [DOI] [PubMed] [Google Scholar]
  • 5.Pe’er I, Yelensky R, Altshuler D & Daly MJ Estimation of the multiple testing burden for genomewide association studies of nearly all common variants. Genet. Epidemiol 32, 381–385 (2008). [DOI] [PubMed] [Google Scholar]
  • 6.Clarke GM, Carter KW, Palmer LJ, Morris AP & Cardon LR Fine mapping versus replication in whole-genome association studies. Am. J. Hum. Genet 81, 995–1005 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hindorff LA, Junkins HA, Mehta JP & Manolio TA A Catalog of Published Genome-Wide Association Studies. National Human Genome Research Institute [online], http://www.genome.gov/26525384 (2009). [Google Scholar]; A continuously updated online list of GWA studies and their main results.
  • 8.Altshuler D, Daly MJ & Lander ES Genetic mapping in human disease. Science 322, 881–888 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zeggini E & Ioannidis JPA Meta-analysis of genome-wide association studies. Pharmacogenomics 10, 191–201 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.de Bakker PI et al. Practical aspects of imputation-driven meta-analysis of genome-wide association studies. Hum. Mol. Genet 17, R122–R128 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zeggini E et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genet 40, 638–645 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]; An early paradigm of the application of meta-analysis in combining several GWA data sets and subsequent replication studies.
  • 12.Barrett JC et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nature Genet 40, 955–962 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.The GIANT consortium. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nature Genet 41, 25–34 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Seminara D et al. The emergence of networks in human genome epidemiology: challenges and opportunities. Epidemiology 18, 1–8 (2007). [DOI] [PubMed] [Google Scholar]
  • 15.Pahl R, Schäfer H & Müller HH Optimal multistage designs—a general framework for efficient genome-wide association studies. Biostatistics 10, 297–309 (2009). [DOI] [PubMed] [Google Scholar]
  • 16.Gail MH, Pfeiffer RM, Wheeler W & Pee D Probability that a two-stage genome-wide association study will detect a disease-associated SNP and implications for multistage designs. Ann. Hum. Genet 72, 812–820 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Skol AD, Scott LJ, Abecasis GR & Boehnke M Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet 38, 209–213 (2006). [DOI] [PubMed] [Google Scholar]
  • 18.Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M & Franke A A comprehensive evaluation of SNP genotype imputation. Hum. Genet 125, 163–171 (2009). [DOI] [PubMed] [Google Scholar]
  • 19.Guan Y & Stephens M Practical issues in imputation-based association mapping. PLoS Genet 4, e1000279 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Marchini J et al. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genet 39, 906–913 (2007). [DOI] [PubMed] [Google Scholar]
  • 21.Browning SR Missing data imputation and haplotype phase inference for genome-wide association studies. Hum. Genet 124, 439–450 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Browning SR & Browning BL Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet 81, 1084–1097 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Trikalinos TA, Salanti G, Zintzaras E & Ioannidis JP Meta-analysis methods. Adv. Genet 60, 311–334 (2008). [DOI] [PubMed] [Google Scholar]
  • 24.Kavvoura FK & Ioannidis JP Methods for meta-analysis in genetic association studies: a review of their potential and pitfalls. Hum. Genet 123, 1–14 (2008). [DOI] [PubMed] [Google Scholar]
  • 25.Sutton AJ, Abrams KR, Jones DR, Sheldon TA & Song F Methods for Meta-Analysis in Medical Research (Wiley, Chichester, 2000). [Google Scholar]
  • 26.Sutton AJ & Higgins JP Recent developments in meta-analysis. Stat. Med 27, 625–650 (2008). [DOI] [PubMed] [Google Scholar]
  • 27.Spiegelhalter DJ, Abrams KR & Myles PJ Bayesian Approaches to Clinical Trials and Health-Care Evaluation Ch. 8, 267–305 (Wiley, Chichester, 2004). [Google Scholar]
  • 28.Salanti G, Higgins JP, Trikalinos TA & Ioannidis JP Bayesian meta-analysis and meta-regression for gene–disease associations and deviations from Hardy–Weinberg equilibrium. Stat. Med 26, 553–567 (2007). [DOI] [PubMed] [Google Scholar]
  • 29.Thorlund K, et al. Can trial sequential monitoring boundaries reduce spurious inferences from meta-analyses? Int. J. Epidemiol 38, 276–286 (2009). [DOI] [PubMed] [Google Scholar]
  • 30.Zollner S & Pritchard JK Overcoming the winner’s curse: estimating penetrance parameters from case–control data. Am. J. Hum. Genet 80, 605–615 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]; A thorough presentation of the winner’s curse and of the proposed approach for correcting for it.
  • 31.Ioannidis JP Why most discovered true associations are inflated. Epidemiology 19, 640–648 (2008). [DOI] [PubMed] [Google Scholar]
  • 32.Moonesinghe R, Khoury MJ, Liu T & Ioannidis JP Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc. Natl Acad. Sci. USA 105, 617–622 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Ioannidis JP, Patsopoulos NA & Evangelou E Uncertainty in heterogeneity estimates in meta-analyses. BMJ 335, 914–916 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Ioannidis JP Non-replication and inconsistency in the genome-wide association setting. Hum. Hered 64, 203–213 (2007). [DOI] [PubMed] [Google Scholar]
  • 35.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Price AL et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet 38, 904–909 (2006). [DOI] [PubMed] [Google Scholar]
  • 37.Kavvoura FK et al. Evaluation of the potential excess of statistically significant findings in published genetic association studies: application to Alzheimer’s disease. Am. J. Epidemiol 168, 855–865 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Slatkin M Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nature Rev. Genet 9, 477–485 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.McCarroll SA et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nature Genet 40, 1166–1174 (2008). [DOI] [PubMed] [Google Scholar]
  • 41.Ioannidis JP, Ntzani EE & Trikalinos TA ‘Racial’ differences in genetic effects for complex diseases. Nature Genet 36, 1312–1318 (2004). [DOI] [PubMed] [Google Scholar]
  • 42.Easton DF et al. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature 447, 1087–1093 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ng MC et al. Implication of genetic variants near TCF7L2, SLC30A8, HHEX, CDKAL1, CDKN2A/B, IGF2BP2, and FTO in type 2 diabetes and obesity in 6,719 Asians. Diabetes 57, 2226–2233 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Gudbjartsson DF et al. Variants conferring risk of atrial fibrillation on chromosome 4q25. Nature 448, 353–357 (2007). [DOI] [PubMed] [Google Scholar]
  • 45.Grant SF et al. Association analysis of the FTO gene with obesity in children of Caucasian and African ancestry reveals a common tagging SNP. PLoS ONE 3, e1746 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Li H et al. Variants in the fat mass- and obesity-associated (FTO) gene are not associated with obesity in a Chinese Han population. Diabetes 57, 264–268 (2008). [DOI] [PubMed] [Google Scholar]
  • 47.Grant SF et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genet 38, 320–323 (2006). [DOI] [PubMed] [Google Scholar]
  • 48.Helgason A et al. Refining the impact of TCF7L2 gene variants on type 2 diabetes and adaptive evolution. Nature Genet 39, 218–225 (2007). [DOI] [PubMed] [Google Scholar]
  • 49.Terwilliger JD & Hiekkalina T An utter refutation of the ‘Fundamental Theorem of the HapMap’. Eur. J. Hum. Genet 14, 426–437 (2006). [DOI] [PubMed] [Google Scholar]
  • 50.Thomas D & Stram D An utter refutation of the ‘Fundamental Theorem of the HapMap’ by Terwilliger and Hiekkalina. Eur. J. Hum. Genet 14, 1238–1239 (2006). [DOI] [PubMed] [Google Scholar]
  • 51.Nunnally JC Introduction to Psychological Measurement (McGraw–Hill, New York, 1970). [Google Scholar]
  • 52.Nath SK et al. A nonsynonymous functional variant in integrin-αM (encoded by ITGAM) is associated with systemic lupus erythematosus. Nature Genet 40, 152–154 (2008). [DOI] [PubMed] [Google Scholar]
  • 53.Amundadottir LT et al. A common variant associated with prostate cancer in European and African populations. Nature Genet 38, 652–658 (2006). [DOI] [PubMed] [Google Scholar]
  • 54.Freedman ML et al. Admixture mapping identifies 8q24 as a prostate cancer risk locus in African–American men. Proc. Natl Acad. Sci. USA 103, 14068–14073 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Yeager M et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genet 39, 645–649 (2007). [DOI] [PubMed] [Google Scholar]
  • 56.Haiman CA et al. Multiple regions within 8q24 independently affect risk for prostate cancer. Nature Genet 39, 638–644 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zanke BW et al. Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nature Genet 39, 989–994 (2007). [DOI] [PubMed] [Google Scholar]
  • 58.Ghoussaini M et al. Multiple loci with different cancer specificities within the 8q24 gene desert. J. Natl. Cancer Inst 100, 962–966 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Gudmundsson J et al. Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q24. Nature Genet 39, 631–637 (2007). [DOI] [PubMed] [Google Scholar]
  • 60.Kiemeney LA et al. Sequence variant on 8q24 confers susceptibility to urinary bladder cancer. Nature Genet 40, 1307–1312 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wokolorczyk D et al. A range of cancers is associated with the rs6983267 marker on chromosome 8. Cancer Res 68, 9982–9986 (2008). [DOI] [PubMed] [Google Scholar]
  • 62.Park SL et al. Associations between variants of the 8q24 chromosome and nine smoking-related cancer sites. Cancer Epidemiol. Biomarkers Prev 17, 3193–3202 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Xie X et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Veyrieras JB et al. High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet 4, e1000214 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Petretto E et al. Heritability and tissue specificity of expression quantitative trait loci. PLoS Genet 2, e172 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Libouille C et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet 3, e58 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]; A description of the second generation of the HapMap.
  • 68.Voelkerding KV, Dames SA & Durtschi JD Next-generation sequencing: from basic research to diagnostics. Clin. Chem 26 February 2009. (doi: 10.1373/clinchem.2008.112789). [DOI] [PubMed] [Google Scholar]
  • 69.Wang J et al. The diploid genome sequence of an Asian individual. Nature 456, 60–65 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Nyholt DR A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet 74, 765–769 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Li J & Ji L Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95, 221–227 (2005). [DOI] [PubMed] [Google Scholar]
  • 72.Lin DY An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21, 781–787 (2005). [DOI] [PubMed] [Google Scholar]
  • 73.McCarroll SA et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn’s disease. Nature Genet 40, 1107–1120 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR & Amos CI Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet 82, 100–112 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Kryukov GV, Pennacchio LA & Sunyaev SR Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet 80, 727–739 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Yeo GS et al. Mutations in the human melanocortin-4 receptor gene associated with severe familial obesity disrupts receptor function through multiple molecular mechanisms. Hum. Mol. Genet 12, 561–574 (2003). [DOI] [PubMed] [Google Scholar]
  • 77.Cohen JC et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869–872 (2004). [DOI] [PubMed] [Google Scholar]
  • 78.Ueda H et al. Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature 423, 506–511 (2003). [DOI] [PubMed] [Google Scholar]
  • 79.Harrell FE Jr, Lee KL & Mark DB Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med 15, 361–387 (1996). [DOI] [PubMed] [Google Scholar]
  • 80.Stephens M & Donnelly P A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet 73, 1162–1169 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Graham RR et al. Three functional variants of IFN regulatory factor 5 (IRF5) define risk and protective haplotypes for human lupus. Proc. Natl Acad. Sci. USA 104, 6758–6763 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Sigurdsson S et al. Comprehensive evaluation of the genetic variants of interferon regulatory factor 5 (IRF5) reveals a novel 5 bp length polymorphism as strong risk factor for systemic lupus erythematosus. Hum. Mol. Genet 17, 872–881 (2008). [DOI] [PubMed] [Google Scholar]
  • 83.Shin HD et al. Different genetic effects of interferon regulatory factor 5 (IRF5) polymorphisms on systemic lupus erythematosus in a Korean population. J. Rheumatol 35, 2148–2151 (2008). [DOI] [PubMed] [Google Scholar]
  • 84.Kawasaki A et al. Association of IRF5 polymorphisms with systemic lupus erythematosus in a Japanese population: support for a crucial role of intron 1 polymorphisms. Arthritis Rheum 58, 826–834 (2008). [DOI] [PubMed] [Google Scholar]
  • 85.Li M et al. CFH haplotypes without the Y402H coding variant show strong association with susceptibility to age-related macular degeneration. Nature Genet 38, 1049–1054 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Maller J et al. Common variation in three genes, including a noncoding variant in CFH, strongly influences risk of age-related macular degeneration. Nature Genet 38, 1055–1059 (2006). [DOI] [PubMed] [Google Scholar]
  • 87.Mori K et al. Coding and noncoding variants in the CFH gene and cigarette smoking influence the risk of age-related macular degeneration in a Japanese population. Invest. Ophthalmol. Vis. Sci 48, 5315–5319 (2007). [DOI] [PubMed] [Google Scholar]
  • 88.Minelli C, Thompson JR, Abrams KR & Lambert PC Bayesian implementation of a genetic model-free approach to the meta-analysis of genetic association studies. Stat. Med 24, 3845–3861 (2005). [DOI] [PubMed] [Google Scholar]
  • 89.Risch N & Botstein D Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nature Genet 33 (Suppl.), 228–237 (2003). [DOI] [PubMed] [Google Scholar]
  • 90.Warner JB et al. Systematic identification of mammalian regulatory motifs’ target genes and function. Nature Methods 5, 347–353 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Tompa M et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnol 23, 137–144 (2005). [DOI] [PubMed] [Google Scholar]
  • 92.Kariuki SN et al. Autoimmune disease risk variant of STAT4 confers increased sensitivity to IFN-α in lupus patients in vivo. J. Immunol 182, 34–38 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Kuballa P, Huett A, Rioux JD, Daly MJ & Xavier R Impaired autophagy of an intracellular pathogen induced by a Crohn’s disease associated ATG16L1 variant. PLoS ONE 3, e3391 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Ogura Y et al. Genetic variation and activity of mouse Nod2, a susceptibility gene for Crohn’s disease. Genomics 81, 369–377 (2003). [DOI] [PubMed] [Google Scholar]
  • 95.Shen S et al. Schizophrenia-related neural and behavioural phenotypes in transgenic mice expressing truncated Disc1. J. Neurosci 28, 10893–10904 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Ioannidis JP & Kavvoura FK Concordance of functional in vitro data and epidemiological associations in complex disease genetics. Genet. Med 8, 583–593 (2006). [DOI] [PubMed] [Google Scholar]
  • 97.Martin LJ et al. Phenotypic, genetic, and genome-wide structure in the metabolic syndrome. BMC Genet 4 (Suppl. 1), S95 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Aukes MF et al. Genetic overlap among intelligence and other candidate endophenotypes for schizophrenia. Biol. Psychiatry 65, 527–534 (2009). [DOI] [PubMed] [Google Scholar]
  • 99.Zeggini E et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336–1341 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Frayling TM et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889–894 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Ioannidis JP, Patsopoulos NA & Evangelou E Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE 2, e841 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Toulopoulou T et al. Substantial genetic overlap between neurocognition and schizophrenia: genetic modeling in twin samples. Arch. Gen. Psychiatry 64, 1348–1355 (2007). [DOI] [PubMed] [Google Scholar]
  • 103.Bottini N, Vang T, Cucca F & Mustelin T Role of PTPN22 in type 1 diabetes and other autoimmune diseases. Semin. Immunol 18, 207–213 (2006). [DOI] [PubMed] [Google Scholar]
  • 104.Kavvoura FK et al. Cytotoxic T-lymphocyte associated antigen 4 gene polymorphisms and autoimmune thyroid disease: a meta-analysis. J. Clin. Endocrinol. Metab 92, 3162–3170 (2007). [DOI] [PubMed] [Google Scholar]
  • 105.Kavvoura FK & Ioannidis JP CTLA-4 gene polymorphisms and susceptibility to type 1 diabetes mellitus: a HuGE Review and meta-analysis. Am. J. Epidemiol 162, 3–16 (2005). [DOI] [PubMed] [Google Scholar]
  • 106.Gudmundsson J et al. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nature Genet 39, 977–983 (2007). [DOI] [PubMed] [Google Scholar]
  • 107.Orho-Melander M et al. A common missense variant in the glucokinase regulatory protein gene (GCKR) is associated with increased plasma triglyceride and C-reactive protein but lower fasting glucose concentrations. Diabetes 57, 3112–3121 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Wojczynski MK & Tiwari HK Definition of phenotype. Adv. Genet 60, 75–105 (2008). [DOI] [PubMed] [Google Scholar]
  • 109.Viswesvaran C & Ones DS Measurement error in “Big Five Factors” personality assessment: reliability generalization across studies and measures. Educ. Psychol. Meas 60, 224–235 (2000). [Google Scholar]
  • 110.Dina C et al. Variation in FTO contributes to childhood obesity and severe adult obesity. Nature Genet 39, 724–726 (2007). [DOI] [PubMed] [Google Scholar]
  • 111.Contopoulos-Ioannidis DG, Alexiou GA, Gouvias TC & Ioannidis JP An empirical evaluation of multifarious outcomes in pharmacogenetics: β2 adrenoceptor gene polymorphisms in asthma treatment. Pharmacogenet. Genomics 16, 705–711 (2006). [DOI] [PubMed] [Google Scholar]
  • 112.Goh KI et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Lage K et al. A human phenome–interactome network of protein complexes implicated in genetic disorders. Nature Biotechnol 25, 309–316 (2007). [DOI] [PubMed] [Google Scholar]
  • 114.van Driel MA, Bruggeman J, Vriend G, Brunner HG & Leunissen JA A text-mining analysis of the human phenome. Eur. J. Hum. Genet 14, 535–542 (2006). [DOI] [PubMed] [Google Scholar]
  • 115.Wild CP Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 14, 1847–1850 (2005). [DOI] [PubMed] [Google Scholar]
  • 116.Garcia-Closas M et al. Heterogeneity of breast cancer associations with five susceptibility loci by clinical and pathological characteristics. PLoS Genet 4, e1000054 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.NCI–NHGRI Working Group on Replication in Association Studies. Replicating genotype–phenotype associations. Nature 447, 655–660 (2007). [DOI] [PubMed] [Google Scholar]
  • 118.Ioannidis JP Molecular evidence-based medicine: evolution and integration of information in the genomic era. Eur. J. Clin. Invest 37, 340–349 (2007). [DOI] [PubMed] [Google Scholar]
  • 119.Mailman MD et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet 39, 1181–1186 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.GAIN Collaborative Research Group. New models of collaboration in genome-wide association studies: the Genetic Association Information Network. Nature Genet 39, 1045–1051 (2007). [DOI] [PubMed] [Google Scholar]

RESOURCES