Abstract
Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics. However, selection of candidate regions for further study on the basis of elevated or depressed divergence between species leads to a divergence-based ascertainment bias in the site frequency spectrum within selected candidate loci. Here, a method to correct this problem is developed that obtains maximum-likelihood estimates of the unascertained allele frequency distribution using numerical optimization. I show how divergence-based ascertainment may mimic the effects of natural selection and offer correction formulae for performing proper estimation into the strength of selection in candidate regions in a maximum-likelihood setting.
Introduction
The recent explosion in genome sequencing has proven to be an invaluable resource to evolutionary genetics. For the first time investigators are able to explore hypotheses about evolutionary processes that shape DNA at the level of whole genomes. In particular those genomic regions that have evolved either very slowly (e.g. [1]) or very quickly (e.g. [2]) among species are of considerable interest, as the evolution of these regions should be dominated by the deterministic forces of natural selection [3].
The population genetics approach of comparing patterns of genetic divergence among species to patterns of polymorphism within species provides a powerful tool towards unraveling the myriad of forces at work in genetic evolution. A typical approach is as follows: 1) pick candidate regions for study based on levels of divergence between reference genome sequences, and then 2) attempt to infer causes of evolution based on polymorphism data from those candidate regions. An example of the approach is the work of Drake et al. [4], wherein the authors attempted to understand whether conserved non-coding sequences (CNCs) in the human genome were selectively constrained or mutational cold-spots. As the allele frequency distribution from CNCs was shifted towards rare alleles relative to non-conserved sequences, the authors concluded that selection against deleterious mutations was occurring in CNCs, and thus such sequences were likely to be functional and not in fact mutationally cold.
The problem with this approach is that selection of candidate regions based on patterns of divergence between species creates an ascertainment bias with rather large effects on the expected site frequency spectrum (Fig. 1). Intuitively, this can be understood by considering that polymorphisms can often be confused with fixed differences between species when single, or few, sequence comparisons are used to measure divergence. In the case of genomic regions selected for conservation, divergence-based ascertainment biases against intermediate or high frequency derived alleles, as such alleles would likely show up in between species sequence comparisons as fixed differences, causing candidate regions to be rejected. Conversely, genomic regions identified as rapidly evolving, are expected to be enriched for high and intermediate frequency polymorphisms, as these are more likely to contribute to the signal of divergence (see [5]). It is also worth noting that this skew in the frequency spectrum will lead to an increase or decrease in the expected number of segregating sites recovered from divergence ascertained loci.
Analysis
As the divergence time between species increases, the expected ratio of fixed differences to segregating sites in a sample also increases. One should then expect the strength of a divergence-based ascertainment bias to decrease with increasing genetic distance between species used in the ascertainment phase. This decay in divergence-based ascertainment bias is shown in Figure 2, where coalescent simulations are performed as in Figure 1 but species divergence times vary. Tajima's D [6] is used here as a one dimensional proxy for the site frequency spectrum. Even at species divergence times as great as 100 (in units of 4Ne generations) a substantial divergence-based ascertainment bias is apparent.
To correct for this divergence-based ascertainment, we wish to obtain an estimate of the unascertained site frequency spectrum. Following the general framework of Nielsen et al. [7], let pj be the frequency of segregating sites with derived allele frequency j in a sample of n chromosomes that has undergone no ascertainment bias (1<j≤n−1). With a set of observed counts of allele frequencies in hand we aim to find the maximum-likelihood estimate of the site frequency spectrum, . We assume that the presence or absence of the derived version of each segregating site in the original divergence comparison is known. The likelihood function for the site frequency spectrum, is then
(1) |
where S is the number of segregating sites in the sample, Xi is the derived allele frequency at the ith site including those in the reference genome used to ascertain the site, and Asci is the notation for the ascertainment of a segregating site in the original divergence comparison. It is clear that since ascertainment depends only on how many ancestral and derived actually occur at site i (Nielsen et al. [7]). Thus the crucial quantity for correcting divergence-based ascertainment is the probability of our ascertainment condition, which can be incorporated at a site by site basis to correct for arbitrary levels of divergence (see Figure 3). I assume that divergence ascertainment was performed in a subsample, size d, of our final sample, size n, thus the probability of ascertainment is
In most cases, for example that of Drake et al. [4], d = 1, corresponding to ascertaining divergence from a single genomic reference sequence. If d>1, divergence is assumed to be measured by the number of fixed differences (see [8] for a cogent treatment of this statistic).
With this probability in hand, one can then estimate via numerical optimization of the likelihood function (Equation 1). In the present case optimization was performed by using a version of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [9] however any standard optimization scheme will work. Figure 3 shows the results of maximum-likelihood estimation of the site frequency spectrum from the simulated data presented in Figure 1 (source code to perform this correction is available upon request). As can be noted visually, very accurate estimates of the unascertained allele frequency distribution are recovered from the divergence ascertained data. It is important to note that unlike in Nielsen et al. [7] where the authors aim to find the “true” site frequency spectrum, we already know the site frequency spectrum of our region(s) and instead we aim to correct for poor ascertainment in which polymorphisms were mistaken for fixed differences in deciding which regions of the genome to study.
Often estimates of evolutionary parameters are of central interest to the investigator. In particular the strength of selection acting on a region may be an important quantity to examine (e.g. [10]). The framework presented here is easily incorporated into such estimation. One can describe the probability of a derived allele segregating in a sample at frequency Xi as a function of its selection coefficient , and thus to estimate the strength of selection acting on a locus with observed site frequency spectrum via maximum-likelihood (see [11]–[13]). Generically, the likelihood function takes the form
(2) |
(see [12], [13] for ). Using Bayes rule, and noting that the probability of divergence ascertainment is conditionally independent of α given Xi = xi, i.e. , we can then write down the divergence ascertainment corrected version of Equation 1:
where the denominator is simply the normalization factor
The divergence ascertainment corrected likelihood function can then be written as
(3) |
Figure 4 shows the effect of divergence-based ascertainment on ML estimates from simulated loci selected for depressed levels of divergence. The mean from uncorrected estimates (i.e. Equation 2) is α̂ = −1.34, indicating evidence of weak negative selection, even though these data have been generated under a standard neutral model. MLEs of α using the divergence-based ascertainment corrected likelihood function (Equation 3), restores the expected value to approximately zero (mean α̂ = 0.001). This ascertainment corrected version of the likelihood function is also useful in a Bayesian setting, for an example see Katzman et al. [14], where it was used in a Bayesian Hierarchical model for estimating distributions of selection coefficients from divergence ascertained data.
One concern with our ascertainment correction is that we might mask the “true” signature of selection acting at a locus. Figure 5 shows the effect of our correction on ML estimates of selection coefficients from loci simulated under a deleterious alleles model (i.e. negative selection; see figure caption for details) that again have been selected on the basis of depressed divergence. Of particular concern is the case of weak negative selection, thus simulations were performed where new alleles were assigned a negative selection coefficient with the strength of selection α = −5.0. Under this model, both the uncorrected and the corrected estimates were relatively close to the true selection coefficient (uncorrected mean α̂ = −5.36, corrected mean α̂ = −4.91), however the ascertainment correction does seem to improve estimates. Evidence of selection when estimates have been corrected for divergence ascertainment can thus be considered conservative.
Conclusions
Reference genome sequences are an invaluable resource, however their utility in identifying candidate regions for further study presents the pitfall of divergence-based ascertainment biases in population genetics investigation. Divergence-based ascertainment bias calls into question the validity of earlier studies of the site frequency spectrum from regions which have been selected based on their evolutionary rate between species but have ignored this source of error. In this study I offer simple corrections for those biases which may be used to estimate the unascertained site frequency spectrum of divergence ascertained data as well as for estimation of evolutionary parameters when the sequence used for ascertainment is included in the population sample. In particular, accounting for ascertainment when estimating selection coefficients is imperative when loci have been selected on the basis of divergence between species.
Acknowledgments
I thank D. Haussler, S. Katzman, C. Jones, M. McPeek, and M. Hahn for their discussion and comments on this manuscript.
Footnotes
Competing Interests: The author has declared that no competing interests exist.
Funding: This study was funded by an NIH/NHGRI fellowship and Dartmouth College. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, et al. Ultraconserved elements in the human genome. Science. 2004;304:1321–1325. doi: 10.1126/science.1098119. [DOI] [PubMed] [Google Scholar]
- 2.Pollard KS, Salama SR, King B, Kern AD, Dreszer T, et al. Forces shaping the fastest evolving regions in the human genome. PLoS Genet. 2006;2:1553–7404. doi: 10.1371/journal.pgen.0020168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gillespie JH. The causes of molecular evolution. New York: Oxford University Press; 1991. [Google Scholar]
- 4.Drake JA, Bird C, Nemesh J, Thomas DJ, Newton-Cheh C, et al. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat Genet. 2005;38:223–227. doi: 10.1038/ng1710. [DOI] [PubMed] [Google Scholar]
- 5.Nielsen R, Bustamante CD, Clark AG, Glanowski S, Sackton TB, et al. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 2005;3:1545–7885. doi: 10.1371/journal.pbio.0030170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123:585–595. doi: 10.1093/genetics/123.3.585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nielsen R, Hubisz MJ, Clark AG. Reconstituting the frequency spectrum of ascertained single-nucleotide polymorphism data. Genetics. 2004;168:2373–82. doi: 10.1534/genetics.104.031039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hey J. The structure of genealogies and the distribution of fixed differences between DNA sequence samples from natural populations. Genetics. 1991;128:831–40. doi: 10.1093/genetics/128.4.831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes in C. Cambridge University Press; 1992. [Google Scholar]
- 10.Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, et al. Natural selection on protein-coding genes in the human genome. Nature. 2005;437:1153–1157. doi: 10.1038/nature04240. [DOI] [PubMed] [Google Scholar]
- 11.Sawyer SA, Hartl DL. Population genetics of polymorphism and divergence. Genetics. 1992;132:1161–1176. doi: 10.1093/genetics/132.4.1161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bustamante CD, Wakeley J, Sawyer SA, Hartl DL. Directional selection and the site-frequency spectrum. Genetics. 2001;159:1779–1788. doi: 10.1093/genetics/159.4.1779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, et al. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci U S A. 2005;102:7882–7887. doi: 10.1073/pnas.0502300102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Katzman S, Kern AD, Bejerano G, Fewell G, Fulton L, et al. Human ultraconserved elements are ultraselected. Science. 2007;317:915. doi: 10.1126/science.1142430. [DOI] [PubMed] [Google Scholar]
- 15.Gillespie JH. Junk ain't what junk does: neutral alleles in a selected context. Gene. 1997;205:291–299. doi: 10.1016/s0378-1119(97)00470-8. [DOI] [PubMed] [Google Scholar]
- 16.Kern AD, Jones CD, Begun DJ. Molecular population genetics of male accessory gland proteins in the Drosophila simulans complex. Genetics. 2004;167:725–735. doi: 10.1534/genetics.103.020883. [DOI] [PMC free article] [PubMed] [Google Scholar]