Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2009 Oct 2;5(10):e1000674. doi: 10.1371/journal.pgen.1000674

Inference of Microbial Recombination Rates from Metagenomic Data

Philip L F Johnson 1,*, Montgomery Slatkin 2
Editor: Bruce Walsh3
PMCID: PMC2745702  PMID: 19798447

Abstract

Metagenomic sequencing projects from environments dominated by a small number of species produce genome-wide population samples. We present a two-site composite likelihood estimator of the scaled recombination rate, ρ = 2Nec, that operates on metagenomic assemblies in which each sequenced fragment derives from a different individual. This new estimator properly accounts for sequencing error, as quantified by per-base quality scores, and missing data, as inferred from the placement of reads in a metagenomic assembly. We apply our estimator to data from a sludge metagenome project to demonstrate how this method will elucidate the rates of exchange of genetic material in natural microbial populations. Surprisingly, for a fixed amount of sequencing, this estimator has lower variance than similar methods that operate on more traditional population genetic samples of comparable size. In addition, we can infer variation in recombination rate across the genome because metagenomic projects sample genetic diversity genome-wide, not just at particular loci. The method itself makes no assumption specific to microbial populations, opening the door for application to any mixed population sample where the number of individuals sampled is much greater than the number of fragments sequenced.

Author Summary

At a broad scale, the exchange of genetic material through homologous recombination (i.e. what happens in animals during sex) increases the potential rate of adaptation. Bacteria often reproduce clonally, without recombination, by making exact copies of their genomes, but they also have mechanisms analogous to sex that allow them to recombine sporadically. Despite microbes' critical role at the base of our world's ecosystem, microbiologists know surprisingly little about how microbes grow and evolve outside the laboratory. Metagenomic sequencing projects provide a means to sample the genetic diversity of natural microbial populations and have the potential to reveal much about the ecology and evolution of these populations. Here we present a novel method to estimate the recombination rate from metagenomic data, while explicitly allowing for imperfections such as sequencing error and missing data.

Introduction

Microbial populations exchange homologous genetic material at different rates, dramatically affecting the evolutionary potential of the population. While basal mutation rates can be estimated via long-term within-laboratory evolution experiments [1], recombination rates are more difficult to infer because they require identification of multiple alleles at multiple loci in multiple individuals. Further, biogeographic barriers and interspecies interactions may lead to qualitatively different effects than growth in axenic laboratory culture, making determination of recombination rates in an organism's natural environment critical to accurate interpretation [2]. For the purpose of this study, we ignore the mechanism behind homologous recombination (i.e. transformation, transduction, or conjugation) and focus on its effect on genetic diversity.

Much research has investigated human recombination hotspots [3], yet almost nothing is known about variation in microbial recombination rates within a genome. In specific instances, however, studies have experimentally identified sequence motifs associated with recombination hotspots in some species of bacteria and yeast [4]. Mounting evidence suggests that regions known as CRISPR (Clusters of Regularly Interspaced Short Palindromic Repeats) form the basis of a bacterial immune system against phage in which chunks of the phage genome are inserted into the CRISPR region [5]. Thus a reasonable hypothesis would be that these regions or other regions with similar effect might recombine with greater frequency than the rest of the genome.

Inference of a genome-wide, fine-scale recombination map requires both extensive genome-wide sampling of the genetic diversity within the population of interest as well as an appropriate population genetic model, neither of which has been previously available for microbial populations. Microbial population surveys have primarily sequenced a small number of loci (“multi-locus sequence typing”) [6], which yield no information about variation in local recombination rate. Current methods tailored to microbial populations rely on low-power summary statistics [7],[8], heuristics instead of explicitly modeling the source of the recombining fragments [9], or parsimony based on manual inspection [10]. A few studies (e.g. [2],[11]) applied a more rigorous likelihood-based approach using a population genetic model ([12]; discussed more below), but these were still able to estimate only a genome-wide average rate of recombination.

Recently, large-scale metagenomic sequencing projects have begun to generate genome-wide population samples by sequencing random reads from a pool of DNA extracted from all microorganisms in a given environment. Projects that sample environments dominated by only a few microbial “species” are able to assemble near-complete genomes [13],[14], in which the constituent reads contain information about the genetic diversity in the population. Considering the large number of individuals in the sampled community relative to the number of reads sequenced, each read derives almost certainly from a different individual microorganism. With average read depths as high as ten [14], the resulting data hold rich potential for population genetic analysis [15],[16].

Given these data, we can make inferences about parameters such as mutation rate and recombination rate. In population genetic theory, the per-generation mutation rate, Inline graphic, and per-generation recombination rate, Inline graphic, almost always appear in conjunction with the effective population size, Inline graphic, as the parameters Inline graphic and Inline graphic. In our microbial context, we assume a single recombination event leads to the replacement of a short tract of sequence, creating two recombination breakpoints. A full likelihood method would yield maximal power by calculating the probability of observing the entire pattern of polymorphism across all samples, given the parameters Inline graphic and Inline graphic. In practice, however, this approach is extremely computationally intensive [17], and even a recent breakthrough using a Markov chain Monte Carlo technique only extends full-likelihood to input data containing fewer than Inline graphic SNPs [18]. Instead, we follow the lead of previous researchers who sacrificed power for greater practicality by using a composite likelihood method [12],[19],[20] that approximates the true likelihood, as detailed in the Methods section.

However, metagenomic population samples differ from traditional population samples and, as a result, provide new challenges to estimating recombination. First, the sample size varies according to the read depth at a given location instead of being fixed across all loci. Second, the quality of each base call varies along each read, and the random nature of the metagenomic method prevents independent replication of the sampling and sequencing steps to confirm observed polymorphisms. Finally, linkage information is greatly reduced in that instead of the traditional approach of sampling the same individual at all loci, each fragment of DNA derives from a different individual. Depending on the sequencing technology and whether reads were sequenced in pairs, these data will reveal, at most, linkage within two reads of Inline graphic nucleotides that are separated by a distance generally less than 40 kilobases.

As high-throughput sequencing becomes ever cheaper, the number of projects producing this sort of data will only increase. The Human Microbiome Project (http://www.hmpdacc.org/) plans to perform metagenomic sequencing of microbes found at five sites around the body. A particularly intriguing future application will be to sequence mixtures of pathogens sampled from within a single infected human. These data, combined with the methods presented here, will allow inferences about the interplay between the immune response and recombination within pathogens.

Methods

We start by deriving our two-locus composite likelihood estimator based on the idea of Hudson [19] and the estimator of McVean et al. [12] but now allowing for realistic amounts of missing data and sequencing error. Sequencing error probabilities are taken as given in the form of per-base quality scores. The resulting likelihood calculation becomes computationally infeasible on metagenomic-scale data, so we further describe several numerical approximations that make our implementation tractable. Finally we define a statistic to quantify the amount of missing data. This statistic will aid analysis and discussion of our estimator of Inline graphic.

Composite-likelihood estimator

Our input data consist of a metagenomic assembly (i.e. alignment of reads to a scaffold), untrimmed FASTA sequences for the reads, quality scores for each base in each read and, if applicable, information about read pairs. We explicitly do not consider any uncertainty in either the assembly or in the quality scores for the practical reason that current assembly algorithms and base callers do not generate this information; however, in principle, our method could be extended to incorporate these sources of uncertainty. Given these data, we wish to estimate two population genetic parameters: Inline graphic and Inline graphic.

Following [12], we assume that each site in the assembly has at most two different nucleotides and arbitrary label these as zero and one. In the rare event that more than two distinct nucleotides are observed, then we again arbitrarily label them zero and one after first grouping the nucleotides into two categories: the most common nucleotide and everything else. In the case of a tie for the most common nucleotide, we pick one at random. Given this labeling, we can represent the state of a read at a given position by 0, 1 or ?, where the question mark represents missing data. Analogously, we represent the state of a single chromosome at two positions simultaneously: 00, 01, 10, 11, 0?, 1?, ?0, ?1 (ignoring ??, since this conveys no information). An example is given in Figure 1 and described below. Note that, in a metagenomic context, “a single chromosome” means that both nucleotides are either on the same read or on two paired reads. We assume that the total number of sequenced reads is much less than the total number of cells in the sampled environment such that the probability of two independent (unpaired) reads deriving from the same original cell/chromosome is essentially zero.

Figure 1. Cartoon metagenomic assembly.

Figure 1

Three chromosomes, each with paired-end reads (bold horizontal lines) separated by a gap (dashed line). Assembly contains three polymorphic sites (vertical bars), which create three pairs of polymorphic sites (arrows). Note that our method actually uses all pairs of sites, not just polymorphic ones.

First we outline our notation more formally. The assembly, Inline graphic, extends from position 1 to position Inline graphic and contains information about both the content of the reads and their position. The set of quality scores, Inline graphic, contains one quality score for each base in each read in the assembly. We assume Phred-calibrated quality scores [21], so any particular quality score, Inline graphic, can be converted into an error probability, Inline graphic, by means of the formula Inline graphic. The configuration for a pair of sites, Inline graphic (Inline graphic), is a vector of eight numbers corresponding to the number of chromosomes observed in each of the eight states (00, 01, etc.). For example, in Figure 1 the configuration of the leftmost pair of polymorphic sites is {Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic}. In addition to the configuration at pair Inline graphic, we also have the set of quality scores, Inline graphic (Inline graphic).

We wish to calculate the likelihood of the observed data, Inline graphic, given the quality scores, Inline graphic, and the population genetic parameters of interest, Inline graphic and Inline graphic. We approximate the true likelihood with the composite likelihood:

graphic file with name pgen.1000674.e036.jpg (1)

in which the two-locus configurations are treated as though they were independent among pairs of sites. We take the mutation rate (and thereby Inline graphic) to be constant and independent across all sites in the assembly, conditional on the genealogy. However, the recombination rate between two sites Inline graphic and Inline graphic depends on their distance apart, Inline graphic, as measured by the number of nucleotides separating them. We model recombination in microbial populations as occurring via gene conversion with recombination tract lengths drawn from an exponential distribution [12],[22],[23]:

graphic file with name pgen.1000674.e041.jpg (2)

where Inline graphic is the average length of the recombination tract. Theoretically Inline graphic and Inline graphic might be identifiable, but in practice our data are insufficient to separate them. Instead we fix Inline graphic and estimate Inline graphic, similar to the approach taken by McVean et al. [12]. Minor misspecification of Inline graphic will simply rescale Inline graphic, although major misspecification of Inline graphic will also change the right-hand side of (2).

Now we turn to the likelihood of a single two-locus configuration. We first account for sequencing error by summing over all possibilities for the truth, Inline graphic:

graphic file with name pgen.1000674.e051.jpg (3)

where the sum iterates over all Inline graphic elements of the set of possible two-locus configurations, Inline graphic, and Inline graphic is the average number of reads at each site. The first term inside the sum is the error probability, while the second term is the two-locus likelihood without any error. We assume that sequencing errors cause a switch from 0 to 1 and vice versa:

graphic file with name pgen.1000674.e055.jpg (4)

where Inline graphic and the subscript Inline graphic indexes the same position in the same read in the quality scores, the observation, and the truth. In other words, all mismatches between the truth and observed must be the result of an error, while all matches between the truth and the observed cannot have been caused by an error.

Next we account for missing data by summing over all possibilities for the unknown nucleotides in the complete configuration, Inline graphic:

graphic file with name pgen.1000674.e059.jpg (5)

where the sum iterates over all elements of the set of configurations compatible with the observed data, Inline graphic (i.e. those that satisfy the constraints Inline graphic, etc.). The first term inside the sum accounts for missing data, while the second term is the pure two-locus likelihood. If we treat the configurations Inline graphic and Inline graphic as a specific ordering of chromosomes, then this first term has a binary value of 1 for all configurations Inline graphic that match Inline graphic at non-missing positions and 0 otherwise. As a result of our definition for the set Inline graphic, all configurations Inline graphic will match Inline graphic at non-missing positions, so the first term is always 1. We describe calculation of the second term in the next section below.

We arrive at the final composite likelihood equation by taking (1) and substituting in (3), (4) and (5), which leaves us with four nested products and sums of significant size as discussed below.

Now we wish to find maximum likelihood estimates to our parameters. Joint maximization of Inline graphic and Inline graphic is computationally impractical. Instead, we perform a two-step estimation procedure in which we first estimate Inline graphic from single sites using a previously-developed method that correctly handles sequencing error [15] and then estimate Inline graphic from pairs of sites by numerically maximizing (1) while holding Inline graphic.

Two-locus complete likelihoods without error

We pre-calculate and store the two-locus likelihoods for all possible complete two-locus configurations without error (i.e. the second term in (5)) for a single sample size, Inline graphic, across a range of Inline graphic values and a single fixed Inline graphic value. We generate this table of likelihoods by running a slightly modified version of the complete program from the LDhat package [12], which assumes a finite sites Jukes-Cantor style biallelic mutation model and uses the neutral coalescent-with-recombination importance sampling method of Fearnhead and Donnelly [24]. The original complete program computed likelihoods only for configurations in which both sites were observed to be polymorphic; our modification enables the calculation of likelihoods for configurations with one polymorphic site and one fixed site. We deduce the final probability of both sites being fixed by subtracting all other probabilities from 1.

Given this table for a fixed sample size Inline graphic and fixed Inline graphic, we can exactly infer an analogous table for smaller sample sizes and approximately infer a table for different values of Inline graphic.

A smaller sample size table can be directly generated for an arbitrary new sample size, Inline graphic; however, in the interests of clarity, we will describe how to generate a table when Inline graphic, which can be iterated. Let the vector Inline graphic denote a configuration of sample size Inline graphic. Assuming probabilities for ordered configurations (as generated by complete by default), the probability of this new configuration is the sum of the probabilities of Inline graphic, Inline graphic, Inline graphic and Inline graphic.

Adjusting the table for a different Inline graphic poses a greater challenge. One option would be to run complete many times to generate tables for different values of Inline graphic, but this would be extremely time-consuming. Our alternative solution takes advantage of the fact that, while Inline graphic strongly affects the relative probabilities among the three broad categories of (both-sites-polymorphic, one-site-fixed, both-sites-fixed), Inline graphic only mildly affects the relative probabilities of different configurations within these categories. The approximate probability of a site being polymorphic under the finite sites mutation model in a sample of size Inline graphic is Inline graphic (approximate in the sense that this ignores the slight possibility of a site being polymorphic but having back mutations erase all traces of that polymorphism). If two sites are independent Inline graphic, then the probabilities corresponding to these three categories of pairs are Inline graphic, Inline graphic, Inline graphic. Now we assume that the ratio between the probabilities of these categories is independent of Inline graphic and approximate the probabilities of configurations under some new Inline graphic by multiplying by Inline graphic (if both sites are polymorphic) or Inline graphic (if one site is fixed). If both sites are fixed, then we again deduce the probability by subtracting all other probabilities from 1.

Given these tabulated (or calculated) values, we use linear interpolation to arrive at the final probability for a given Inline graphic. Linear interpolation as well as our numerical maximization algorithm require that the likelihood surface be reasonably smooth. The importance sampling algorithm leaves a small amount of error in its estimate of the likelihood, which can lead to small wiggles in the likelihood surface. We solve this problem by smoothing the tabulated values where necessary via cubic splines. Also, for configurations with a single fixed site, the importance sampling algorithm did not reduce the variance in the likelihood below the very low level of the slope across Inline graphic, leading to numerical difficulties performing maximization on a non-smooth likelihood surface. We avoid this problem by making the likelihoods for these configurations constant across Inline graphic by setting them equal to their average value.

Complexity and approximations

As alluded to earlier, a brute force implementation of the four nested loops in the composite likelihood function would take Inline graphic time where Inline graphic is the length of the assembly (or region of interest), Inline graphic is the read depth and Inline graphic is the average number of missing nucleotides at each site, assuming a constant read depth. Real metagenomic data have variable read depth, which makes the situation even worse with the sequencing error component (Inline graphic) dominating the complexity at high-depth sites (i.e. where Inline graphic). Instead we make several approximations:

  1. Reduce amount of low quality data. We allow no more than five bases with quality below Inline graphic (1 in 100 chance of error) in any pair of sites. For an average read depth of Inline graphic and a quality distribution from Sanger sequencing, this cutoff eliminates ∼3% of our lowest-information-content data for a significant speed increase.

  2. Skip nearby pairs of sites. We consider only those pairs separated by at least 10 bases (in (1), change the second product to start at Inline graphic) and we only make pairs for every 5th site (in (1), change the first product to take values Inline graphic). Any given pair of adjacent sites is highly unlikely to have had a recombination breakpoint between them. If the sites are separated by a greater distance, the chance of a recombination breakpoint between them increases. Thus this approximation sacrifices a small amount of information to reduce the overall number of pairs of sites. Empirically, simulations suggest this approximation does not greatly increase the variance of Inline graphic.

  3. Only use pairs of sites spanned by at least one chromosome (i.e. using the statistic defined in the next section, Inline graphic). Pairs of sites not meeting this criteria tend to be far apart and contain relatively little information.

  4. When accounting for error, only consider plausible true configurations, instead of all possible configurations. For a given pair of sites, we first sort the quality scores in ascending order (Inline graphic). Then we iterate over truths in decreasing order of probability (for one error: Inline graphic, then Inline graphic, etc.; for two errors: Inline graphic, then Inline graphic, etc.) until the probability is less than Inline graphic times as likely as the most probable configuration.

Given these approximations, a standard desktop computer can perform this estimation for 10 kb of sequence, average depth of 10 and a realistic error distribution in less than one hour.

Ρs statistic

Before we discuss our results, we need to quantify the amount of missing data between a given pair of sites. Define Inline graphic to be the proportion of chromosomes that span a particular pair of sites: Inline graphic, where Inline graphic is the number of chromosomes spanning both sites (i.e. both sites are covered either by the same read or by paired reads) and Inline graphic is the average number of chromosomes covering each site separately (Inline graphic and Inline graphic, respectively).

The average value of this statistic together with the average sample size provide an indirect measure for the amount of information about recombination captured by pairs of sites within a given dataset.

Sludge data

We applied our technique to the first 500 kb of the assembly of Candidatus Accumulibacter phosphatis from a recent metagenomic sequencing project of activated sludge from a wastewater treatment plant [13]. The sludge we analyzed came from a laboratory bioreactor in Madison, Wisconsin that had been seeded from a local wastewater treatment plant. We received the data (P. Hugenholtz, personal communication) in the form of a finished assembly consisting of ACE and PhD files covering a Inline graphic megabase scaffold of average depth Inline graphic. Equivalent data in a different form are also available directly from the Joint Genome Institute via the IMG/M system [25] and the NCBI Trace Archive (genome project id 17657).

Results

We first investigate the information content of a single pair of sites as a function of the amount of missing data. This information sets an upper bound on our estimator's performance since we use the composite likelihood instead of the true likelihood. In particular, the Fisher information, Inline graphic, for a single pair of sites with depth Inline graphic decreases with Inline graphic, although the information only falls off dramatically for Inline graphic (Figure 2). We find these results encouraging since the average Inline graphic of pairs in the actual sludge metagenome falls just above this threshold at 0.21. Note that the Fisher information holds little meaning on an absolute scale since we calculate the information for a single pair of sites rather than for our actual data with many dependent pairs. Instead, the values in Figure 2 should be interpreted on a relative scale. For instance, for Inline graphic, approximately ten independent pairs with Inline graphic would contain the same information about Inline graphic as a single pair with Inline graphic.

Figure 2. Information about Inline graphic as a function of missing data.

Figure 2

Fisher information for a single pair of sites of depth Inline graphic, with varying amount of missing data as quantified by Inline graphic (0 = no chromosomes span both sites; 1 = all chromosomes span both sites) for different values of Inline graphic.

The bulk of our analyses rely on simulated data where we know the truth and can evaluate the performance of our estimator. We use the program ms [26] in combination with seq-gen [27] to generate sequences across a 10 kb region under a finite-sites model of mutation (Inline graphic unless specified otherwise) and the coalescent with recombination. We simulate recombination as gene conversion with mean tract length fixed at Inline graphic (see equation 2). The sample size (i.e. number of simulated chromosomes) is Inline graphic where Inline graphic is the average read depth and Inline graphic is the length of each read in a read-pair. We transform these sequences into metagenomic-style data by randomly distributing read starts uniformly across the simulated region and trimming each simulated sequence to only be present for the length of three segments: one read, the gap between read pairs, and one read. Our simulation assumes no variation in read length or distance between read pairs. Note that a gap of zero produces the same effect as unpaired reads with double the read length. For results with sequencing error, we assign quality scores from the true Sanger sequencing quality score distribution as determined from the sludge data. A “sequencing error” causes a switch from the true nucleotide to each of the other three with probability 1/3. Given that we are simulating relatively small datasets with low information content, we occasionally generate an assembly with a maximum likelihood at Inline graphic. We exclude these values from all further analyses, but, for each parameter set, we report the proportion of replicates that yielded infinite parameter estimates either in Table 1 or in the text below.

Table 1. Proportion of simulation replicates with Inline graphic for each parameter set.

Parameters Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
0.01,75,0 0.020 0.0040 0.0040 0.0040 0.0060 0.0020 0.004
0.01,500,0 0.026 0.0160 0.0080 0.0140 0.0020 0.0060 0.004
0.01,500,100 0.030 0.0140 0.0080 0.0140 0.0020 0.0120 0.012
0.01,500,500 0.026 0.0180 0.0060 0.0160 0.0080 0.0020 0.012
0.01,1000,0 0.022 0.0065 0.0045 0.0022 0.0046 0.0067 0.011
0.01,500,0* 0.016 0.0040 0.0000 0.0000 0.0000 0.0000 0.000
0.01,5000,0* 0.000 0.0000 0.0000 0.0000 0.0000 0.0000 0.004

First column lists parameters: Inline graphic, read length, gap between paired end reads. Asterisk (*) signifies simulations without sequencing error.

We analyzed the performance of our estimator in the presence of sequencing error across a range of plausible values of Inline graphic (0.002 to 0.04), read lengths roughly corresponding to current Illumina, 454 and Sanger sequencing technologies (75, 500, 1000) and gaps between read-pairs (0, 100, 500) by calculating the root mean squared error (RMSE) relative to the true value (Inline graphic; Figure 3). Note that while RMSE conveniently summarizes our estimator's sampling distribution, it obscures the inherent asymmetry of the distribution caused by the constraint Inline graphic. A clear trend emerges with lower relative RMSE accompanying increased recombination. The estimator has little bias (results not shown) and, for Inline graphic, we are able to reliably estimate within a factor of Inline graphic of the true value. For most parameters, increasing the read length reduces the variance by virtue of increasing Inline graphic, but for larger Inline graphic the results for 1 kb reads appear slightly worse than for 0.5 kb reads. Increasing the gap between the paired-end reads increases the variance for all except the very smallest Inline graphic. Intuitively, this makes sense: if all pairs of sites are very close together with low Inline graphic then a recombination event will only rarely occur between them; however, if all pairs are far apart with high Inline graphic then recombination events will saturate between the pairs of sites.

Figure 3. Performance of Inline graphic.

Figure 3

Root mean squared error (RMSE) of Inline graphic relative to the true Inline graphic for paired-end reads with (A) different read lengths with gap = 0 separating the pairs and (B) different gap lengths with read length = 500. RMSE calculated from 500 replicate simulations of assembly size 10 kb, Inline graphic, Inline graphic, Sanger-distributed sequencing error.

With the above results suggesting that longer read lengths do not always yield a better estimate, we decided to directly compare a metagenomic-style sample to a “standard” population genetic sample in which the same individuals are sequenced at all loci. The fair comparison keeps the total number of sequenced bases constant, so we simulate a 10 kb region with either 100 reads of 1 kb each or 10 reads of 10 kb each (Figure 4). For simplicity, we do not simulate sequencing error. As analyzed in the Discussion, despite the average depth being identical between the two sets of simulations, the metagenomic sample (on the left) exhibits less bias and much lower variance than the standard sample (on the right).

Figure 4. Metagenomic versus standard population sampling.

Figure 4

Metagenomic data on left has 100 reads of 1 kb each; standard data on right has 10 reads of 10 kb each. Circles correspond to mean; whiskers show 2.5% and 97.5% percentiles for 250 replicate simulations of assembly size 10 kb, Inline graphic, Inline graphic, no read pairs, no sequencing error.

Next we tested our approximation that adjusts the two-site likelihoods for different values of Inline graphic (see Methods subsection “Two-locus complete likelihoods without error”) by fixing Inline graphic and simulating across Inline graphic ranging from 0.002 to 0.025 while estimating Inline graphic using a two-site likelihood table generated for Inline graphic (Figure 5). Again we do not simulate sequencing error to focus on the effects of Inline graphic. Here we see that the correction (on the right in Figure 5) works quite well for Inline graphic above the likelihood table's driving value (i.e. Inline graphic) and somewhat less well for lower Inline graphic, with 3% of the simulations for Inline graphic giving infinite estimates. However, the uncorrected estimator (on the left) is strongly biased, with 98% of simulations for Inline graphic resulting in infinite (unplotted) estimates and 26% of those for Inline graphic. No other parameter values yielded any infinite estimates. The low Inline graphic results are exacerbated by the correlation of Inline graphic with the number of polymorphic sites. Lower Inline graphic means fewer polymorphic sites; since the majority of information about recombination rate comes from polymorphic sites, we see a larger variance in our estimate of Inline graphic for low Inline graphic.

Figure 5. Likelihood correction for different θs.

Figure 5

All simulations use Inline graphic (dashed horizontal line) and estimates use likelihood table created for Inline graphic. Left panel makes no adjustment for mismatch between the table Inline graphic and the true Inline graphic. The odd variance for Inline graphic stems from 98% of simulations yielding infinite estimates and the remaining estimates being highly biased. Right panel uses the correction described in the Methods section. Circles correspond to mean; whiskers show 2.5% and 97.5% percentiles for 250 replicate simulations of assembly size 10 kb, Inline graphic, read length 1000, no read pairs, no sequencing error.

Finally we apply our estimator to the sludge metagenomics project by sliding a 50 kb window in 25 kb steps across the first 500 kb of the assembly and independently estimating the recombination rate within each window (Figure 6). All windows produced finite estimates with Inline graphic.

Figure 6. Parameter estimates from sludge data.

Figure 6

Estimates generated by sliding 50 kb window in steps of 25 kb across first 500 kb of the sludge assembly.

Discussion

The two-site composite likelihood estimator appears to be better suited for metagenomic samples (i.e. the purpose of this paper) than for standard population genetic samples (i.e. the purposes of [12],[19]) as seen from Figure 4. We believe this results from the balance of two opposing factors: greater linkage (less missing data) pushes the advantage toward the standard sample, while a larger genealogy with more independence pushes the advantage back toward the metagenomic sample. For the parameter ranges investigated here, the latter force wins and we see that the estimates for metagenomic samples have both less bias and lower variance for a fixed amount of sequencing. This result makes sense given the nature of the composite likelihood technique in which we treat each pair of sites as though it were independent of every other pair. The more chromosomes that are sampled, the more closely this independence assumption matches reality. An intriguing open question is how the composite likelihood estimator on metagenomic data compares to a full likelihood estimator on standard data, but we do not pursue this topic here.

The bias in the standard sample estimates (Figure 4) surprised us given theoretical results that assert consistency for the composite likelihood estimator [28]. However, consistency is an asymptotic feature and does not necessarily hold for finite samples. Indeed, further simulations of standard samples with greater sample depth reduced the bias to essentially zero with depth Inline graphic (results not shown). Given that metagenomic samples appear nearly unbiased with depth Inline graphic, the added independence of the metagenomic sample must allow the estimator to converge faster toward the asymptotic results.

Further, in contrast to Hudson's and McVean's programs (maxhap and LDhat, respectively), our method makes use of all pairs of sites, including sites observed to be fixed. We include these sites primarily as a byproduct of properly accounting for sequencing error, but these additional data also help reduce our variance. As a bonus, using all sites automatically makes our pairwise likelihoods true likelihoods, thus fulfilling one of the requirements for Fearnhead's [28] results proving the consistency of the composite likelihood estimator. If fixed sites were not included, then the pairwise likelihoods would need to be made conditional on only using pairs of segregating sites, which becomes computationally challenging when dealing with missing data. In fact, while maxhap and LDhat allow missing entries in their input data, this feature is not described in the accompanying papers [12],[19], and these implementations do not properly condition their likelihoods to account for the fact that they only use segregating sites. The only disadvantage of using all pairs of sites is that the likelihood calculation scales linearly with the number of pairs and thus using all pairs takes longer; however, our implementation still runs in a reasonable amount of time on realistic amounts of data (see “Complexity and approximations” subsection in Methods).

Real data include sequencing errors, which have the potential to bias population genetic inference and increase the variance of estimators [29]. Trimming the data based on quality scores will help reduce these problems, but the remaining error must still be taken into account. We do not have analytic theory quantifying the amount of bias introduced by sequencing error, but simulations show that unaccounted-for errors produce estimates biased toward a specific finite value of Inline graphic that depends on the read length and gap size (results not shown). Intuitively, sequencing error primarily produces singletons, which yield different configurations depending on the distance separating the two sites with errors. If the two sites are close together, then errors will tend to generate 01 and 10 states. If the two sites are far apart, then errors will tend to generate 1? and ?1 states. The first group of states (01, 10) provides evidence for higher recombination since, if both mutations originally fell on the same chromosome (state 11), then recombination would have been necessary to break them up to be (01, 10). The second group of states (1?, ?1) provides evidence for lower recombination since this pattern of missing data is more likely to have arisen from (11, 11) states, which is suggestive of no recombination, then (01, 10) states. Thus sequencing error introduces a highly artificial pattern of configurations, with a combination of evidence for high recombination between close pairs of sites and low recombination between distant pairs of sites leading to a maximum likelihood at an intermediate value. For paired-end reads of 500 bases separated by a gap of 0, errors drive toward Inline graphic.

The striking inverse correlation between the estimates of Inline graphic and Inline graphic from the sludge data (Figure 6) could either be the result of an unknown artifact or a biological reality stemming from a dependence between recombination efficiency and sequence divergence. One possibility for an artifact would be sequencing error not accounted-for in the quality scores (e.g. a PCR error before sequencing). Such errors would certainly lead to increased estimates of Inline graphic, but, on the basis of our simulations, seem unlikely to drive Inline graphic down to 0. Also, such errors would have to occur non-uniformly across the genome at a granularity of 50 kb, which seems implausible. Another potential source for an artifact is our two-step estimation procedure in which we first estimate Inline graphic without regard to recombination and then estimate Inline graphic conditional on Inline graphic. Again, however, simulations reveal that, while Inline graphic affects the variance of Inline graphic, the estimator is unbiased across all tested Inline graphic and shows no correlation between Inline graphic and Inline graphic (results not shown). Without a clear artefactual explanation, we turn toward biology. Laboratory experiments have shown a negative log-linear dependence between sequence divergence and transformation efficiency [30], and an analysis of a different metagenomic dataset found a similar dependence between divergence and parsimoniously-inferred recombination events [10]. Our data suggest that this pattern holds at a finer resolution with subtle increases in diversity, as quantified by Inline graphic, leading to lower rates of recombination in a log-linear manner, with the exception of regions in which recombination appears nonexistent (Figure 7).

Figure 7. Log-linear relationship between Inline graphic and Inline graphic.

Figure 7

Circles are data from Figure 6, dashed line is log-linear regression using non-zero values of Inline graphic.

On an absolute scale, these estimates from the sludge data fall into a plausible range for bacterial populations. For instance, in Campylobacter jejuni Inline graphic [31] and in Neisseria meningitidis Inline graphic ranges from 0.00270 to 0.034 [11]. However, previous estimates of microbial recombination rates have been based on much smaller amounts of data (in these examples, Inline graphic bases) relative to the sludge windows of 50 kilobases. In addition, C. jejuni and N. meningitidis are both pathogens, which makes for a quite different ecological and evolutionary environment than that of the nonpathogenic sludge bacterium A. phosphatis. When the sludge estimates of mutation and recombination are viewed relative to each other, we see that mutation events generally occur more frequently than recombination events (Inline graphic), which places A. phosphatis more toward the clonal end of the bacterial spectrum [32].

Overall, our new estimator produces surprisingly accurate estimates of recombination rate, particularly considering the amount of missing data. The real power of the estimator derives from the greater independence of the genealogies underlying the sample; sequencing error and missing data present hurdles to accessing this information but our estimator has surmounted them. Despite our motivation from microbial populations, our method itself makes no assumptions inherent to microbial populations. For our purpose, a “metagenomic” sample simply means sampling a mixture of a large number of individuals from a single species, in which each read (or pair of reads) can be safely assumed to have originated from a different individual. Given the results from the comparison to a standard sample, the metagenomic approach should always be followed to obtain maximal information about recombination for a fixed amount of sequencing.

An implementation of our Population genetic Inference In Metagenomics (PIIM) method is freely available for download from http://ib.berkeley.edu/labs/slatkin/software.html.

Acknowledgments

We thank Phil Hugenholtz at JGI for sharing the assembly of the sludge data.

Footnotes

The authors have declared that no competing interests exist.

This research was supported by National Institutes of Health grant R01-GM40282 to MS and by a Chang-Lin Tien Fellowship to PLFJ. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Lenski RE, Winkworth CL, Riley MA. Rates of DNA sequence evolution in experimental populations of Escherichia coli during 20,000 generations. J Mol Evol. 2003;56:498–508. doi: 10.1007/s00239-002-2423-0. [DOI] [PubMed] [Google Scholar]
  • 2.Whitaker RJ, Grogan DW, Taylor JW. Recombination shapes the natural population structure of the hyperthermophilic archaeon Sulfolobus islandicus. Mol Biol Evol. 2005;22:2354–61. doi: 10.1093/molbev/msi233. [DOI] [PubMed] [Google Scholar]
  • 3.Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science. 2005;310:321–4. doi: 10.1126/science.1117196. [DOI] [PubMed] [Google Scholar]
  • 4.Smith GR. Hotspots of homologous recombination. Experientia. 1994;50:234–41. doi: 10.1007/BF01924006. [DOI] [PubMed] [Google Scholar]
  • 5.Brouns SJ, Jore MM, Lundgren M, Westra ER, Slijkhuis RJ, et al. Small CRISPR RNAs guide antiviral defense in prokaryotes. Science. 2008;321:960–4. doi: 10.1126/science.1159689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, et al. Opinion: Re-evaluating prokaryotic species. Nat Rev Microbiol. 2005;3:733–9. doi: 10.1038/nrmicro1236. [DOI] [PubMed] [Google Scholar]
  • 7.Fraser C, Hanage WP, Spratt BG. Neutral microepidemic evolution of bacterial pathogens. Proc Natl Acad Sci U S A. 2005;102:1968–73. doi: 10.1073/pnas.0406993102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Smith JM, Smith NH, O'Rourke M, Spratt BG. How clonal are bacteria? Proc Natl Acad Sci U S A. 1993;90:4384–8. doi: 10.1073/pnas.90.10.4384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Didelot X, Falush D. Inference of bacterial microevolution using multilocus sequence data. Genetics. 2007;175:1251–66. doi: 10.1534/genetics.106.063305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Eppley JM, Tyson GW, Getz WM, Banfield JF. Genetic exchange across a species boundary in the archaeal genus ferroplasma. Genetics. 2007;177:407–16. doi: 10.1534/genetics.107.072892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jolley KA, Wilson DJ, Kriz P, McVean G, Maiden MC. The influence of mutation, recombination, population history, and selection on patterns of genetic diversity in Neisseria meningitidis. Mol Biol Evol. 2005;22:562–9. doi: 10.1093/molbev/msi041. [DOI] [PubMed] [Google Scholar]
  • 12.McVean G, Awadalla P, Fearnhead P. A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics. 2002;160:1231–41. doi: 10.1093/genetics/160.3.1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Martn HG, Ivanova N, Kunin V, Warnecke F, Barry KW, et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol. 2006;24:1263–9. doi: 10.1038/nbt1247. [DOI] [PubMed] [Google Scholar]
  • 14.Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. [DOI] [PubMed] [Google Scholar]
  • 15.Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res. 2006;16:1320–7. doi: 10.1101/gr.5431206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Simmons SL, Dibartolo G, Denef VJ, Goltsman DS, Thelen MP, et al. Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. PLoS Biol. 2008;6:e177. doi: 10.1371/journal.pbio.0060177. doi: 10.1371/journal.pbio.0060177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Stumpf MP, McVean GA. Estimating recombination rates from population-genetic data. Nat Rev Genet. 2003;4:959–68. doi: 10.1038/nrg1227. [DOI] [PubMed] [Google Scholar]
  • 18.Wang Y, Rannala B. Population genomic inference of recombination rates and hotspots. Proc Natl Acad Sci U S A. 2009;106:6215–9. doi: 10.1073/pnas.0900418106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hudson RR. Two-locus sampling distributions and their application. Genetics. 2001;159:1805–17. doi: 10.1093/genetics/159.4.1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wall JD. Estimating recombination rates using three-site likelihoods. Genetics. 2004;167:1461–73. doi: 10.1534/genetics.103.025742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998;8:186–94. [PubMed] [Google Scholar]
  • 22.Frisse L, Hudson RR, Bartoszewicz A, Wall JD, Donfack J, et al. Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am J Hum Genet. 2001;69:831–43. doi: 10.1086/323612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM. Linkage disequilibria and the site frequency spectra in the su(s) and su(w(a)) regions of the Drosophila melanogaster X chromosome. Genetics. 2000;156:1837–52. doi: 10.1093/genetics/156.4.1837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fearnhead P, Donnelly P. Estimating recombination rates from population genetic data. Genetics. 2001;159:1299–318. doi: 10.1093/genetics/159.3.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res. 2008;36:D534–8. doi: 10.1093/nar/gkm869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–8. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
  • 27.Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997;13:235–8. doi: 10.1093/bioinformatics/13.3.235. [DOI] [PubMed] [Google Scholar]
  • 28.Fearnhead P. Consistency of estimators of the population-scaled recombination rate. Theor Popul Biol. 2003;64:67–79. doi: 10.1016/s0040-5809(03)00041-8. [DOI] [PubMed] [Google Scholar]
  • 29.Johnson PLF, Slatkin M. Accounting for bias from sequencing error in population genetic estimates. Mol Biol Evol. 2008;25:199–206. doi: 10.1093/molbev/msm239. [DOI] [PubMed] [Google Scholar]
  • 30.Roberts MS, Cohan FM. The effect of DNA-sequence divergence on sexual isolation in Bacillus. Genetics. 1993;134:402–8. doi: 10.1093/genetics/134.2.401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, et al. Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni. Mol Biol Evol. 2009;26:385–97. doi: 10.1093/molbev/msn264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Hanage WP, Fraser C, Spratt BG. The impact of homologous recombination on the generation of diversity in bacteria. J Theor Biol. 2006;239:210–9. doi: 10.1016/j.jtbi.2005.08.035. [DOI] [PubMed] [Google Scholar]

Articles from PLoS Genetics are provided here courtesy of PLOS

RESOURCES