Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2009 Feb 6;106(10):3871–3876. doi: 10.1073/pnas.0812824106

Power of deep, all-exon resequencing for discovery of human trait genes

Gregory V Kryukov a, Alexander Shpunt a,b, John A Stamatoyannopoulos c, Shamil R Sunyaev a,1
PMCID: PMC2656172  PMID: 19202052

Abstract

The ability to sequence cost-effectively all of the coding regions of a given individual genome is rapidly approaching, with the potential for whole-genome resequencing not far behind. Initiatives are currently underway to phenotype hundreds of thousands of individuals for major human traits. Here, we determine the power for de novo discovery of genes related to human traits by resequencing all human exons in a clinical population. We analyze the potential of the gene discovery strategy that combines multiple rare variants from the same gene and treats genes, rather than individual alleles, as the units for the association test. By using computer simulations based on deep resequencing data for the European population, we show that genes meaningfully affecting a human trait can be identified in an unbiased fashion, although large sample sizes would be required to achieve substantial power.

Keywords: association studies, polymorphism, rare variants, sequencing


Whole-genome association studies based on genotyping have recently demonstrated potential for identifying SNPs and haplotypes associated with a range of common clinical phenotypes (13). However, only a small fraction of observed phenotypic variation is currently attributable to identified allelic variants. Association studies are fundamentally limited by previously known genetic variation, featuring predominantly high-frequency SNPs. By contrast, deep resequencing has the potential to reveal a vast trove of low-frequency alleles. Low-cost–high-throughput sequencing technologies hold the potential to propel discovery of gene–phenotype associations incorporating low-frequency allelic variation on a large scale.

Although knowledge of all variants segregating in the population would seem to increase the power of genetic analysis, this prospect faces daunting statistical challenges, because an expanding pool of variants requires more stringent multiple testing correction, whereas the power to detect association with low-frequency variants is reduced. This problem may be surmounted by pooling allelic variants in a single candidate gene (46) or pathway (79). However, if most variation in a gene or pathway is neutral, this pooling strategy will not provide a sufficient signal-to-noise ratio (10). To enrich variation in functionally significant alleles, the analysis should be limited to nonsynonymous coding variation as one obvious functional class. Site-directed mutagenesis and comparative genomics have shown that the large fraction of de novo missense mutations are of functional significance (1114). Consequently, many mildly deleterious coding variants are expected to be segregating in the human population at low allele frequencies, as was originally proposed by Tomoko Ohta in the “nearly neutral theory” of molecular evolution (15). Indeed, the statistically significant excess of combined rare missense variation in individuals at phenotypic extremes was detected in candidate gene studies for several phenotypes (46).

Results

Simulation of Resequencing Studies.

Although success with highly targeted candidate gene resequencing has been reported (46), the potential for unbiased discovery of new gene–phenotype associations by resequencing large numbers of genes in large numbers of individuals has not been considered. It is therefore unclear whether clinical populations of realistic size would provide sufficient statistical power for new gene discovery by using this approach. We address these questions through simulation of resequencing studies (Fig. 1A). We focus on quantitative rather than qualitative phenotypes (Fig. 1B). However, our strategy involves comparison of two groups of individuals at phenotypic extremes and, therefore, can be extended to dichotomous traits with specified penetrance. Quantitative traits allow for additional flexibility in selecting most informative individuals. Our simulations make no assumptions about the existence of specific causal variants with specified allele frequencies. Instead, they rely on the influx of new mutations and resulting collective effect of low-frequency hypomorphic alleles.

Fig. 1.

Fig. 1.

Simulated resequencing study. (A) Design of simulated resequencing study. Color marks represent various new mutations discovered in sequenced individuals. (B) Modeling the effect of mutations effect on phenotype. Distribution of quantitative trait for noncarriers is shown in gray. Distribution of the same quantitative trait for individuals that carry at least one moderately deleterious mutation is shown in red. Observed distribution of QT in the whole population is the sum of these distributions. Individuals from the phenotypic extremes marked by blue are subjects for resequencing.

Assessing the feasibility of identifying human trait genes from “all-exon” sequence data is tantamount to determining the potential for detecting the impact of cumulative variation within an individual gene on a phenotype at a high level of statistical significance (2.5 · 10−6, equivalent to a level of P < 0.05 following Bonferroni correction for 20,000 genes). We combine multiple rare variants from the same gene and treat genes rather than individual alleles as the units for the association test.

We based our computational model of coding human genetic variation on existing population sequencing data and extrapolated this model to even larger population samples. We analyzed the deepest systematic resequencing dataset currently available, comprising 58 genes (exons plus flanking intronic and intergenic regions) that were resequenced in 757 individuals of European ancestry (8). The present analysis is focused exclusively on individuals of European ancestry because of the lack of deep resequencing data from other populations.

The feasibility of gene mapping by resequencing coding regions depends on mutation rate, population demographic history, selection coefficients, and phenotypic effects associated with new missense mutations. We assumed a mutation rate of 1.8 · 10−8 per generation (16), and used the conventional four-parametric model of the history of the European population with long-term constant size followed by a bottleneck and then by an exponential expansion (17).

Demographic Model.

To estimate parameters of the demographic model, we computed a likelihood function for the observed site-frequency spectrum of synonymous and noncoding SNPs by using diffusion approximation of the Wright–Fisher model. We used the infinite-number-of-sites model and obtained a time-dependent solution for the population with the variable effective size (18, 19) (see Materials and Methods). We verified this analytical approach using forward simulations [supporting information (SI) Appendix, Fig. S1].

The likelihood has a single well-defined maximum in the space of the four parameters of the model (Fig. 2, SI Appendix, Fig. S3). The largest uncertainty was observed in the bottleneck population size due to sparse data on high-frequency SNPs in the resequencing dataset that we used.

Fig. 2.

Fig. 2.

Two-dimensional sections of the likelihood surface for the demographic model that was fitted to the systematic resequencing data. Population history model, long-term constant population size is followed by a bottleneck and subsequent exponential population growth. The model has four parameters and limited to the European population: N1, ancestral population size; Nb, bottleneck population size; Nf, final population size; π, time of the population expansion since the bottleneck.

Our demographic model reproduces observed site-frequency spectrum well (Fig. 3A). Under this model, the spectrum of neutral genetic variation in European populations is explained best by population growth that started approximately 7,500–9,000 years ago (average, 20–25 years per generation), coinciding with the spread of agriculture in Europe (20). As expected, our analysis of deep resequencing data resulted in a larger estimate of current effective population size (900,000) compared with demographic models constructed by using smaller population samples (17, 2125). Explicit modeling of exponential population growth enabled us to extrapolate more accurately and thus simulate outcomes of resequencing projects involving significantly larger numbers of individuals.

Fig. 3.

Fig. 3.

Agreement of experimental allele frequency spectra with the modeled spectra (A) neutral SNPs and (B) missense SNPs.

The deep resequencing dataset used in our study is not informative about the ancient population bottleneck, a well-known feature of the demographic history of Europeans. This is not surprising because almost all variants in this dataset have low population frequency. According to the demographic model, essentially all rare variants originated by mutations in the recent population growth phase, whereas most frequent SNPs originated by mutations prior to the growth phase (SI Appendix, Fig. S4A).

To ensure that this is not an artifact of our method, we applied it to SeattleSNPs sequencing dataset, which covers as many as 320 genes in as few as 23 individuals of European origin. The inferred population history is in general agreement with previous coalescence-based analyses and reproduces a deep ancient bottleneck (2125). However, SeattleSNPs dataset is not informative about the recent population growth, i.e., different growth rates have comparable likelihoods. Combined analysis of the two datasets (maximizing the product of the two likelihoods) results in a demographic history, which includes both deep ancient bottleneck and recent growth. This demographic history model generates numbers of rare SNPs highly similar to the model based solely on the deep sequencing dataset (SI Appendix, Fig. S4B). This model also predicts that the overwhelming majority of rare variants originated in the recent population growth phase.

It is likely that modeling the demographic history of Europeans by using datasets rich in both rare and frequent variants would require more complex demographic models. However, because almost all rare variants originated in the population growth phase, the recent history mostly determines results of our study. The parameters of our model corresponding to the recent history were estimated with a high degree of confidence by using the deep resequencing dataset. Therefore, we present here results obtained by using this demographic model.

Distribution of Selection Coefficients.

As a next step we added a distribution of selection coefficients for new missense mutations to the model and estimated this distribution from the site-frequency spectrum of nonsynonymous SNPs by using forward simulations. We modeled a distribution of selection coefficients by a gamma distribution with parameters estimated by maximum likelihood. Simulated site-frequency spectrum agrees well with the observed spectrum for nonsynonymous SNPs (Fig. 3B). All suboptimal distributions consistent with the data have the maximum probability mass in the range of selection coefficients between 0.0006 and 0.004 (Fig. 4), which is in good agreement with multiple recent studies (1114). There is a consensus that the majority of new missense mutations are moderately deleterious both in humans (1114) and flies (26).

Fig. 4.

Fig. 4.

Distribution of selection coefficients for de novo missense mutations. Distribution was modeled by gamma distribution and fitted to deep resequencing data by the maximum likelihood method. Mutations with selection coefficient < 10−4 were assumed to have no effect on quantitative phenotype in our model and are shown in green. Mutations assumed to be functional are shown in orange.

We note that this analysis assumed complete neutrality of synonymous sites. Incorporating weak negative selection acting on synonymous sites would result in slightly higher estimates of selection coefficients for nonsynonymous mutations and in lower estimates of the recent population growth rate.

Simulation of Phenotypic Variation.

To simulate population phenotypic variation, we first simulated nonsynonymous genetic variation in the average human protein coding gene (∼500 aa) by using the estimated demographic history and distribution of selection coefficients. To simulate corresponding quantitative trait (QT) variation, we used a simple model of genotype–phenotype relationship. We assumed that QT is distributed normally and individuals carrying at least one functional nonsynonymous allele have QT values distributed with a shifted mean, but with the same variance (Fig. 1B). In our simulation we subdivided new nonsynonymous mutations into functional (i.e., affecting QT) and nonfunctional classes by using two different approaches. First, we assumed all missense mutations with selection coefficient >10−4 to be functional. This assumption implies neither that a given mutation is under selection due to its effect on QT of interest, nor that QT of interest is under selection at all. The simple reasoning behind this assumption is that if amino acid change is visible to purifying selection through its effect on at least some traits, it affects protein molecular function.

In addition, to analyze a possibility that the effect of mutations on QT of interest is independent of the integrated strength of natural selection (either direct or pleiotropic), we randomly assigned mutations to functional and nonfunctional classes irrespective of the selection coefficients associated with them.

We further assumed that all functional missense mutations in a given gene bias QT in one direction. This assumption is based on the fact that the vast majority of de novo amino acid mutations with a measurable effect reduce protein activity. Gain-of-function mutations are much less frequent and are restricted to specific residues or domains. Although very important in a number of rare, Mendelian syndromes, gain-of-function mutations do not represent a sizable fraction of human deleterious variation, and thus can be disregarded in our analysis. Loss-of-function mutations in different genes can bias QT in different directions (depending on the molecular function and position in biological networks), but loss-of-function mutations in the same gene should bias QT in the same direction.

The exact distribution of effects of new missense mutations on QTs is unknown, but on average, de novo missense mutation in a gene highly relevant to a given phenotype should have phenotypic effect smaller than new complete loss-of-function mutations but larger than segregating nonsynonymous SNPs.

De novo mutations associated with dominant syndromes caused by haploinsufficiency are known to cause shifts in QT means exceeding one standard deviation (σ). For example, loss-of-function mutations in NSD1 causing Sotos syndrome increase mean occipitofrontal head circumference by >2σ (27); mutations in JAG1 causing Alagille syndrome decrease mean height by 2σ (28); dominant-negative mutations in FBN1 producing Marfan syndrome reduce mean bone mineral density by 1.5σ (29); and LDL receptor mutations causing hypercholesterolemia increase mean LDL-bound cholesterol level by >4σ (30). However, a segregating nonsynonymous SNP A390P in CETP is associated with 0.4σ decrease in plasma HDL-C levels (31). We therefore concentrated on values of 0.25σ and 0.5σ for new missense mutations but also considered a wide range of QT means shifts (SI Appendix, Fig. S5A). In the context of a trait such as human height, 0.25σ would correspond to a change in mean height of 0.5 inch.

Simulation of Resequencing of Individuals from Phenotypic Extremes.

Next, we simulated results of resequencing of individuals with extreme QT values. Simulated individuals were sorted according to their QT values and gene sequence information was extracted for individuals from lower and upper percentiles. Although it is easy to incorporate sequencing errors in the model, in this work we assumed ideal quality of sequence data.

Simulation of Detection of QT Affecting Genes.

Finally, we tested whether lower and upper tails of the trait distribution were significantly (at the genome-wide level) different in the total amount of rare nonsynonymous variants. Current candidate gene-based studies focus on variants observed exclusively in one of the tails. With increasing sample size, this approach will have decreased power because at would ignore variants observed predominantly, albeit not exclusively in one of the extremes (Fig. 5). We used two alternative approaches: one in which only rare variants were analyzed and the other in which information on common variants was also included to further improve power (32).

Fig. 5.

Fig. 5.

Excess of missense variants at one of the phenotypic extremes. Data are shown for simulation of 5,000 individuals sequenced at each of two 5% phenotypic extremes. Sums of the alleles (cumulative frequencies of pooled variants) were averaged over 10,000 simulations. Shift of quantitative trait median in mutation carriers is assumed to be equal to 0.5σ.

In the “rare variants only” approach SNPs with the observed frequency above >5% in either of the two tails were considered to be common and were excluded from further analysis. We then calculated the cumulative frequency of rare variants separately at each phenotypic extreme, by summing counts of rare SNPs. We calculated statistical significance of departure of rare variants distribution between two phenotypic tails from 50%/50% ratio using the χ2 test. If the obtained P -value was < 0.05 after Bonferroni correction to 20,000 genes, the association between the gene and the phenotype was counted as identified.

To incorporate information about common alleles we compared contingency table of allele counts between two phenotypic extremes considering all SNPs with minimal observed frequency above >5% independently and the remaining rare SNPs grouped into one category with all allele counts summed. Dissimilarity between distributions of allele counts in the two phenotypic extremes was detected by using χ2 test.

Estimation of Number of Required Individuals and Power of Resequencing Studies.

For each set of parameters the simulation described above was run 10,000 times. Power of the test was calculated as the fraction of simulations in which enrichment of rare variants in the proper phenotypic tail was detected. As shown in Table 1, achieving substantial power for individual gene discovery by using this study design would require large population samples. Sequencing of 10,000 individuals (with 5,000 from the upper and 5,000 from the lower fifth percentiles) would require 100,000 phenotyped individuals and would provide 77% and 40% power for genes with the effect of new missense mutations of 0.5σ and 0.25σ, respectively. The power estimates were 78% and 49% for the above parameters, under the model that assumed disassociation between a mutation's effect on phenotype and the strength of purifying selection acting on the mutation.

Table 1.

Estimated power of gene mapping by complete resequencing

Effect of functional mutations (in fractions of standard deviation) No. of sequenced individuals No. of phenotyped individuals
12,500 25,000 50,000 100,000 200,000
0.25σ 5,000 0.11 0.18 0.24
10,000 0.24 0.31 0.40
20,000 0.38 0.51 0.59
0.5σ 5,000 0.36 0.47 0.57
10,000 0.56 0.69 0.77
20,000 0.76 0.84 0.88

Incorporating frequent variants in addition to rare variants increases the power by 3%. It should be noted, however, that our analysis was focused on the utility of rare, deleterious SNPs in identification of genes affecting human traits. Our simulations do not incorporate balancing selection, selection reversal, antagonistic pleiotropy, or any other evolutionary forces that might lead to high-frequency functional variants. Correspondingly, the real gain of power from analysis of common variants in addition to pooled rare variants might be significantly higher that these estimates.

Sampling of suboptimal demographic histories and distributions of selection coefficients (SI Appendix, Fig. S6 and Table S1) suggests that our power estimates are within the range of 20% to 70%. Dependencies of the power estimates on other model parameters are shown in SI Appendix, Fig. S5.

A much smaller sample size of 1,000 individuals would have >75% power to detect effect sizes of 2σ. If pathways rather than individual genes would be considered as units of the association test, a sample of 1,000 would be sufficient to achieve over >60% power (assuming 20 genes per pathway and effect size of 0.5σ).

Although 100,000 individuals is substantially larger than any currently existing well-phenotyped clinical population, efforts to aggregate populations on this scale are underway and will be greatly facilitated by electronic medical records. Notably, the total number of unique individuals in published systematic retrospective clinical studies currently available for genetic analysis substantially exceeds 100,000. Genome-wide association studies have already been conducted on >15,000 individuals (2). Also, numerous trait genes may in fact be uncovered with far shallower levels of resequencing.

Meta-Analysis of MC4R Gene Resequencing Studies.

To illustrate the feasibility of this approach we performed a meta-analysis of four resequencing datasets of the MC4R gene, mutations that are known to be strongly overrepresented among severely obese individuals (8, 3335) (SI Appendix, Table S2). Forty-three individuals with rare missense variants were found among 2,940 obese individuals (99th BMI percentile), while only 3 individuals with rare missense variants were detected in 1,925 lean individuals (<15th BMI percentile). With the described approach, MC4R would be readily detected with P -value < 5 · 10−7, achieving genome-wide statistical significance. As such, this gene could have been discovered by resequencing a sample of < 5,000 individuals.

Discussion

Importantly, our simulations do not assume that protein-coding variation is solely responsible for phenotypic variation. In contrast to previous studies (3638) the goal of these simulations was not to predict the allelic spectrum of human disease, but rather to probe the utility of rare missense SNPs in discovery of genes influencing human traits. The considered study design is similar in concept to a genetic screen utilizing naturally occurring mutations. Even if a large fraction of trait variance were explained by noncoding cis-regulatory variation (e.g., causing changes in expression levels of genes associated with phenotypes), hypomorphic coding mutations arising in the same genes would lead to some level of functional polymorphism detectable by resequencing of these genes in a sufficiently large population. Accordingly, resequencing of only coding regions should be sufficient to identify genes affecting traits of interest, but to explain population variation in phenotype, resequencing of noncoding region might be necessary. We note, however, that extending the very same type of analysis to highly conserved non-coding regions is unlikely to be successful because the effect of mutations in these regions is probably much weaker than the effect of coding mutations. Deletions of ultraconserved regions were not shown to lead to any detectable phenotypes (39) and population genetics studies suggest that conservation of many of the conserved regions is maintained by weak selective pressure (4042).

We conclude that a simple study design combining the pending availability of both low-cost–high-volume sequencing technology and large well-phenotyped population samples has the potential to facilitate discovery of genes relevant to human phenotypes.

Materials and Methods

Demographic Model.

Overall, the demographic model included four parameters: (i) long-term ancestral effective population size; (ii) bottleneck population size; (iii) duration of exponential growth in generations; and (iv) recent effective population size. In the analysis, experimental site frequency spectrum was represented by synonymous and noncoding variants combined. Allelic spectra of these categories of SNPs were similar, and pooling them together resulted in an increase of the amount of data. Although we used data on 757 individuals sequenced (1,514 chromosomes), due to failed base calls in some cases, we approximated number of sequenced chromosomes as 1,400. The combined number of sequenced neutral nucleotide sites was approximated as 63,000 (see SI Appendix, Methods).

Neutral Wright–Fisher Model for Variable Population Size.

Let ϕ(q,t|p) be the probability density that the allele frequency in the t th generation is between x and x + dx, (0 < x < 1) given its starting frequency p. It has been shown (4347) that ϕ(q,t|p) satisfies the forward Kolmogorov equation:

graphic file with name zpq99909-6661-m01.jpg

For a constant population size, Nt = N0, the transient solution:

graphic file with name zpq99909-6661-m02.jpg

where Ci3/2 is the Gegenbauer polynomial with λ = 3/2. For general time-dependent population size Nt, we can obtain ϕ(q,t|p) by defining effective time t′, such that dt′ = (N0/Nt)dt. Written in terms of the effective time t′, forward Kolmogorov equation becomes

graphic file with name zpq99909-6661-m03.jpg

Consequently, the solution at time π is

graphic file with name zpq99909-6661-m04.jpg

For Nt = N0 · eγt,

graphic file with name zpq99909-6661-m05.jpg

Site-Frequency Spectrum at Present.

The expected site frequency spectrum f(q), produced by the above demographic model in today's population, is given by a sum of the contribution of ancestral mutations originated at the stationary phase prior to the bottleneck and the contribution of recent mutations occurring in the population with the cumulative rate 2Ntμ per generation t.

graphic file with name zpq99909-6661-m06.jpg

Likelihood.

In a sample of Ns chromosomes the number of segregating sites observed with polymorphic multiplicity i is given by

graphic file with name zpq99909-6661-m07.jpg

In absence of additional information about which allele is the ancestor, what is actually recorded is the minority multiplicity, i.e., Fj(b) = 63,000(Fj + FNsj), for j = 1,…,Ns/2 − 1. The number of monomorphic sites is given by F0 = 63,000 − ∑j = 1N/2Fj(b). The probability that SNP will be found i times of Ns is then Pj = Fj(b)/63,000.

Assuming independence between SNPs, the likelihood is given by

graphic file with name zpq99909-6661-m08.jpg

Note that because the likelihood includes monomorphic sites, maximizing it maximizes match in the observed units and not merely the distribution form. The 4D likelihood surface was studied extensively and shown to possess a single maximum. Its 2D projections are shown in Fig. 2.

Estimation of Selection Coefficient Distributions.

Distribution of selection coefficients was modeled as a gamma distribution with addition of completely neutral sites (selection coefficient, s = 0) and sites under “lethally” strong selection (s = 1). Intermediate categories of sites with selection coefficients in the range of 10−5 − 10−1 were modeled by a gamma distribution:

graphic file with name zpq99909-6661-m09.jpg

where Γ(k) is a gamma function. Continuous gamma distribution was transformed to a 20-bin histogram by numerical integration within corresponding bin boundaries uniformly spaced in the logarithmic scale between 10−5 and 10−1. We determined parameters of the distribution by using an exhaustive grid search. Fraction of neutral and invariable sites were allocated in 5% minimal “blocks,” while remaining 20 bins described by gamma distribution were normalized so the sum of all bins was equal 1. All possible values from 0% to 100% were tested for values of extreme bins. Values of θ and k parameters were taken from the logarithmic grid: θ from 10−5 to 10−1 with 10 step; k from 10−2 to 10 with 10 step.

Allele frequency spectra were computed by using forward simulations with a predefined demographic history and a chosen distribution of selection coefficients. Log-likelihood LL, for each model was computed as:

graphic file with name zpq99909-6661-m10.jpg

where niexp is a number of nonsynonymous SNPs in which minor allele was observed i times in the experimental resequencing dataset, nisimul is a number of SNPs with simulated minor allele count equal to i, ntotal is the total number of simulated “missense” sites (20,000), and M is the number of sequenced chromosomes.

Supplementary Material

Supporting Information

Acknowledgments.

We thank N. Ahituv and L. Pennacchio for sharing resequencing data, and A. Kondrashov, I. Kohane, and G. Church for helpful discussions. This work was supported by National Institutes of Health Grants GM078598 and MH084676 (to S.R.S.), R01GM071852 (to J.A.S.) and National Institutes of Health roadmap initiative grant U54LM008748.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0812824106/DCSupplemental.

References

  • 1.Couzin J, Kaiser J. Genome-wide association. Closing the net on common disease genes. Science. 2007;316:820–822. doi: 10.1126/science.316.5826.820. [DOI] [PubMed] [Google Scholar]
  • 2.Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Frayling TM. Genome-wide association studies provide new insights into type 2 diabetes aetiology. Nat Rev Genet. 2007;8:657–662. doi: 10.1038/nrg2178. [DOI] [PubMed] [Google Scholar]
  • 4.Cohen JC, et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science. 2004;305:869–872. doi: 10.1126/science.1099870. [DOI] [PubMed] [Google Scholar]
  • 5.Kotowski IK, et al. A spectrum of PCSK9 alleles contributes to plasma levels of low-density lipoprotein cholesterol. Am J Hum Genet. 2006;78:410–422. doi: 10.1086/500615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Romeo S, et al. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat Genet. 2007;39:513–516. doi: 10.1038/ng1984. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Riley BM, et al. Impaired FGF signaling contributes to cleft lip and palate. Proc Natl Acad Sci USA. 2007;104:4512–4517. doi: 10.1073/pnas.0607956104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ahituv N, et al. Medical sequencing at the extremes of human body mass. Am J Hum Genet. 2007;80:779–791. doi: 10.1086/513471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST) Mutat Res. 2007;615:28–56. doi: 10.1016/j.mrfmmm.2006.09.003. [DOI] [PubMed] [Google Scholar]
  • 10.Hirschhorn JN, Altshuler D. Once and again-issues surrounding replication in genetic association studies. J Clin Endocrinol Metab. 2002;87:4438–4441. doi: 10.1210/jc.2002-021329. [DOI] [PubMed] [Google Scholar]
  • 11.Eyre-Walker A, Keightley PD. The distribution of fitness effects of new mutations. Nat Rev Genet. 2007;8:610–618. doi: 10.1038/nrg2146. [DOI] [PubMed] [Google Scholar]
  • 12.Kryukov GV, Pennacchio LA, Sunyaev SR. Most rare missense alleles are deleterious in humans: Implications for complex disease and association studies. Am J Hum Genet. 2007;80:727–739. doi: 10.1086/513473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Yampolsky LY, Kondrashov FA, Kondrashov AS. Distribution of the strength of selection against amino acid replacements in human proteins. Hum Mol Genet. 2005;14:3191–3201. doi: 10.1093/hmg/ddi350. [DOI] [PubMed] [Google Scholar]
  • 14.Eyre-Walker A, Woolfit M, Phelps T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics. 2006;173:891–900. doi: 10.1534/genetics.106.057570. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ohta T. Slightly deleterious mutant substitutions in evolution. Nature. 1973;246:96–98. doi: 10.1038/246096a0. [DOI] [PubMed] [Google Scholar]
  • 16.Kondrashov AS. Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases. Hum Mutat. 2003;21:12–27. doi: 10.1002/humu.10147. [DOI] [PubMed] [Google Scholar]
  • 17.Adams AM, Hudson RR. Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms. Genetics. 2004;168:1699–1712. doi: 10.1534/genetics.104.030171. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Evans SN, Shvets Y, Slatkin M. Non-equilibrium theory of the allele frequency spectrum. Theor Popul Biol. 2007;71:109–119. doi: 10.1016/j.tpb.2006.06.005. [DOI] [PubMed] [Google Scholar]
  • 19.Griffiths RC. The frequency spectrum of a mutation, and its age, in a general diffusion model. Theor Popul Biol. 2003;64:241–251. doi: 10.1016/s0040-5809(03)00075-3. [DOI] [PubMed] [Google Scholar]
  • 20.Ammerman AJ, Cavalli-Sforza LL. The Neolithic Transition and the Genetics of Populations in Europe. Princeton, NJ: Princeton Univ Press; 1984. [Google Scholar]
  • 21.Marth GT, Czabarka E, Murvai J, Sherry ST. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics. 2004;166:351–372. doi: 10.1534/genetics.166.1.351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Williamson SH, et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA. 2005;102:7882–7887. doi: 10.1073/pnas.0502300102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Schaffner SF, et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 2005;15:1576–1583. doi: 10.1101/gr.3709305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Voight BF, et al. Interrogating multiple aspects of variation in a full resequencing data set to infer human population size changes. Proc Natl Acad Sci USA. 2005;102:18508–18513. doi: 10.1073/pnas.0507325102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Keinan A, Mullikin JC, Patterson N, Reich D. Measurement of the human allele frequency spectrum demonstrates greater genetic drift in East Asians than in Europeans. Nat Genet. 2007;39:1251–1255. doi: 10.1038/ng2116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Loewe L, Charlesworth B. Inferring the distribution of mutational effects on fitness in Drosophila. Biol Lett. 2006;2:426–430. doi: 10.1098/rsbl.2006.0481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Cole TR, Hughes HE. Sotos syndrome: A study of the diagnostic criteria and natural history. J Med Genet. 1994;31:20–32. doi: 10.1136/jmg.31.1.20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Rovner AJ, et al. Rethinking growth failure in Alagille syndrome: The role of dietary intake and steatorrhea. J Pediatr Gastroenterol Nutr. 2002;35:495–502. doi: 10.1097/00005176-200210000-00007. [DOI] [PubMed] [Google Scholar]
  • 29.Moura B, et al. Bone mineral density in Marfan syndrome. A large case-control study. Joint Bone Spine. 2006;73:733–735. doi: 10.1016/j.jbspin.2006.01.026. [DOI] [PubMed] [Google Scholar]
  • 30.Brown MS, Goldstein JL. A receptor-mediated pathway for cholesterol homeostasis. Science. 1986;232:34–47. doi: 10.1126/science.3513311. [DOI] [PubMed] [Google Scholar]
  • 31.Spirin V, et al. Common single-nucleotide polymorphisms act in concert to affect plasma levels of high-density lipoprotein cholesterol. Am J Hum Genet. 2007;81:1298–1303. doi: 10.1086/522497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am J Hum Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hinney A, et al. Prevalence, spectrum, and functional characterization of melanocortin-4 receptor gene mutations in a representative population-based sample and obese adults from Germany. J Clin Endocrinol Metab. 2006;91:1761–1769. doi: 10.1210/jc.2005-2056. [DOI] [PubMed] [Google Scholar]
  • 34.Larsen LH, et al. Prevalence of mutations and functional analyses of melanocortin 4 receptor variants identified among 750 men with juvenile-onset obesity. J Clin Endocrinol Metab. 2005;90:219–224. doi: 10.1210/jc.2004-0497. [DOI] [PubMed] [Google Scholar]
  • 35.Hinney A, et al. Melanocortin-4 receptor gene: Case-control study and transmission disequilibrium test confirm that functionally relevant mutations are compatible with a major gene effect for extreme obesity. J Clin Endocrinol Metab. 2003;88:4258–4267. doi: 10.1210/jc.2003-030233. [DOI] [PubMed] [Google Scholar]
  • 36.Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet. 2001;69:124–137. doi: 10.1086/321272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet. 2001;17:502–510. doi: 10.1016/s0168-9525(01)02410-6. [DOI] [PubMed] [Google Scholar]
  • 38.Pritchard JK, Cox NJ. The allelic architecture of human disease genes: Common disease-common variant.or not? Hum Mol Genet. 2002;11:2417–2423. doi: 10.1093/hmg/11.20.2417. [DOI] [PubMed] [Google Scholar]
  • 39.Ahituv N, et al. Deletion of ultraconserved elements yields viable mice. PLoS Biol. 2007;5:e234. doi: 10.1371/journal.pbio.0050234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kryukov GV, Schmidt S, Sunyaev S. Small fitness effect of mutations in highly conserved non-coding regions. Hum Mol Genet. 2005;14:2221–2229. doi: 10.1093/hmg/ddi226. [DOI] [PubMed] [Google Scholar]
  • 41.Keightley PD, Kryukov GV, Sunyaev S, Halligan DL, Gaffney DJ. Evolutionary constraints in conserved nongenic sequences of mammals. Genome Res. 2005;15:1373–1378. doi: 10.1101/gr.3942005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Chen CT, Wang JC, Cohen BA. The strength of selection on ultraconserved elements in the human genome. Am J Hum Genet. 2007;80:692–704. doi: 10.1086/513149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ewens W. Mathematical Population Genetics I. Theoretical Introduction. Springer: Heidelberg; 2004. [Google Scholar]
  • 44.Kimura M. Solution of a process of random genetic drift with a continuous model. Proc Natl Acad Sci USA. 1955;41:144–150. doi: 10.1073/pnas.41.3.144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kimura M. Diffusion models in population genetics. J Appl Prob. 1964;1:177–232. [Google Scholar]
  • 46.Joyce P, Tavaré S. The distribution of rare alleles. J Math Biol. 1995;33:602–618. doi: 10.1007/BF00298645. [DOI] [PubMed] [Google Scholar]
  • 47.Myers S, Fefferman C, Patterson N. Can one learn history from the allelic spectrum? Theor Popul Biol. 2008;73:342–348. doi: 10.1016/j.tpb.2008.01.001. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES