An alignment-free test for recombination

Bernhard Haubold; Linda Krause; Thomas Horn; Peter Pfaffelhuber

doi:10.1093/bioinformatics/btt550

. 2013 Sep 23;29(24):3121–3127. doi: 10.1093/bioinformatics/btt550

An alignment-free test for recombination

Bernhard Haubold ^1,^*, Linda Krause ^1,1, Thomas Horn ¹, Peter Pfaffelhuber ¹

PMCID: PMC5994939 PMID: 24064419

Abstract

Motivation: Why recombination? is one of the central questions in biology. This has led to a host of methods for quantifying recombination from sequence data. These methods are usually based on aligned DNA sequences. Here, we propose an efficient alignment-free alternative.

Results: Our method is based on the distribution of match lengths, which we look up using enhanced suffix arrays. By eliminating the alignment step, the test becomes fast enough for application to whole bacterial genomes. Using simulations we show that our test has similar power as established tests when applied to long pairs of sequences. When applied to 58 genomes of Escherichia coli, we pick up the strongest recombination signal from a 125 kb horizontal gene transfer engineered 20 years ago.

Availability and implementation: We have implemented our method in the command-line program rush. Its C sources and documentation are available under the GNU General Public License from http://guanine.evolbio.mpg.de/rush/.

Contact: haubold@evolbio.mpg.de

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

It is surprisingly difficult to account for the prevalence of sex and recombination in nature (Otto and Lenormand, 2002). Classical explanations for the evolution of sex are based on the realization that it can speedup adaptation (Fisher, 1930/1999, ch. 6). In addition, it removes deleterious mutations from the population (Muller, 1932, 1964). However, recombination also leads to the breakup of coadapted genes, which begs the question as to whether recombination has an evolutionary cause, or is a mere consequence of molecular mechanisms such as DNA repair (Felsenstein, 1974). Perhaps both views are correct, as the contemporary consensus is that sex and recombination would only be maintained in multicellular organisms with high mutation rates and largely negative fitness interactions between genes (Otto, 2007). Irrespective of this uncertainty about the function of recombination in unicellular organisms, horizontal gene transfer in bacteria has attracted particular attention, as it often underlies the emergence of virulent pathogens (Baquero, 2004).

The enigma of recombination, combined with its prevalence and clinical importance, has inspired the development of numerous methods for assessing genetic exchange (Posada, 2002). These fall into two broad categories: methods for detecting the presence of recombination, and methods for estimating its rate. We concentrate here on the simpler detection problem. In the past, this has been solved in two ways: by looking for clustering of mutations or by identifying multiple mutations to the same nucleotide in distinct lineages. Such recurrent mutations are called homoplasies. Methods based on the detection of clustered polymorphisms include the widely used Max- Inline graphic method (Maynard Smith, 1992) and the runs method implemented in the popular GENECONV program (Sawyer, 1989). Both are based on the realization that the time to the most recent common ancestor varies along a recombining sequence (Fig. 1). Because the number of mutations that affect a genomic region is proportional to the time to the most recent common ancestor of the sampled sequences, the number of mutations varies along the sequence; as a result, the mutations are clustered.

Fig. 1. — Simulated time to the most recent common ancestor (TMRCA) and mutations (vertical lines) along a recombining stretch of DNA sequence. TMRCA is proportional to the population size

Classical homoplasy-based methods rest on the assumption that any nucleotide mutates at most once, thus generating three haplotypes between a pair of polymorphisms, say 10, 01 and 00. The missing fourth possible haplotype, 11 in our example, can only be generated through recombination. The detection of such haplotype quartets is used to determine the minimum number of recombination events in an aligned sample of homologous sequences (Hudson, 1985). More recently, this diagnosis of either presence or absence of a recombination event between a pair of polymorphisms was generalized to inferring possibly more than one recombination in samples of more than four sequences. This has been formalized as the Inline graphic statistic and implemented in Phi, a fast tool for detecting recombination (Bruen et al., 2006). Like all homoplasy-based tests, tends to have greater power than its rivals based on polymorphism clustering (Wiuf et al., 2001).

The detection of recombination among a sample of homologous sequences is thus a well-understood problem, as long as an alignment of the sequences is available. However, aligning genomes remains challenging despite great advances in this field (Bray and Pachter, 2004; Darling et al., 2004). At the same time, there is interest in analyzing recombination at the scale of bacterial genomes, if not greater (Didelot et al., 2010). Fortunately, the nucleotide-wise assignment of homology that defines alignments is not necessary to test for recombination. We show this by developing a fast alignment-free test for recombination. Like Max- Inline graphic (Maynard Smith, 1992) and GENECONV (Sawyer, 1989), our test is based on the detection of polymorphism clustering. But instead of scoring single nucleotide polymorphisms (SNPs), we record the lengths of exact matches between pairs of sequences. To a first approximation this corresponds to the distances between SNPs. Recombination leads to an increase in the fluctuation of match lengths when compared with no recombination. Because match lengths can be looked up efficiently using modern string algorithms (Puglisi et al., 2007), our test scans a pair of Escherichia coli genomes totaling 10 Mb in only 8 s on a contemporary laptop.

In the following, we derive our test and demonstrate its sensitivity and specificity through simulation. We also use simulations to compare it with two published tests, Max- Inline graphic and . Finally, we apply our test to 58 E.coli genomes to search for horizontal gene transfer.

2 METHODS

2.1 Derivation

Consider the query Inline graphic and the subject . At every position i in q we look for the shortest substring that is absent from s. We call this SHortest Unique subSTRING ‘shustring’ (Haubold et al., 2005) at position i and denote its length by X_i. For example, is the shustring starting at the first position in q, hence Inline graphic . Our approach is built on the assumption that in pairs of homologous DNA sequences the shustrings correspond to homologous matches and hence their lengths represent distances to the next SNP. This assumption is reasonable for closely related sequences such as those sampled from populations or recently diverged species.

Conceptually, we identify shustrings using a suffix tree of q and s (Gusfield, 1997). We say ‘conceptually’ because our actual implementation is based on an abstract version of a suffix tree called ‘enhanced suffix array’ (Abouelhoda et al., 2002). Figure 2 shows the suffix tree of q and s. Each suffix is represented by a path from the root to a leaf. For example, Inline graphic is the leaf that corresponds to the suffix starting at position 1 in q, . Moreover, repeated prefixes are collapsed into single paths. For example, the suffixes and share the prefix , which appears only once in the tree. The sentinel character at the end of q and s differs from every character in q and s, even from itself. Its addition ensures that a suffix such as Inline graphic , which is a prefix of the suffix , is also represented by a leaf in the tree.

Fig. 2. — Suffix tree of our example query and subject sequences s and q

To identify the shustring starting at, say, Inline graphic , we visit leaf , and climb toward the root until we find a node, n, with a subject leaf in the subtree rooted on n. In our example, we carry out two climbing steps. Then we extend the path label from the root to the node we have reached by one character toward the starting leaf, to find our first shustring, Inline graphic . In this way, we calculate the shustrings at every position in q.

Our test statistic is based on the lengths of shustrings, X_i. Without recombination, and if sequences differ at a fraction π of their sites, every site i has probability π of differing between q and s, independently of other sites. Using this model based on the infinite sites model from population genetics, we obtain for the average shustring length

Hence, π can be estimated from unaligned sequences as the inverse of the average shustring length (Haubold et al., 2011). Recall that recombination leads to fluctuations in the coalescence times along a sequence and therefore to clustering of polymorphisms (Fig. 1). As a result, the mean shustring length increases, which might be used as an indicator of recombination (Haubold and Pfaffelhuber, 2012). Unfortunately, we found that it was impossible to infer the expected average shustring length without estimating π to a precision attainable only with alignments. However, the empirical variance of shustring lengths is a more promising statistic. Its expectation without recombination is

where we can estimate π from the average shustring length (Haubold and Pfaffelhuber, 2012). Inline graphic is then compared with the observed variance, s², to test the null hypothesis . For this, we derived

(1)

and assumed that s² is normally distributed with expectation Inline graphic (Supplementary Material). This means that can we use a one-sided test for the normalized difference between s² and

(2)

which is approximately normally distributed with mean 0 and standard deviation 1. We also define the ratio

as a rough measure of recombination.

2.2 Implementation

We implemented our test in the program rush, which stands for ‘Recombination detection Using SHustrings’. The program rush takes as input a query and a subject DNA sequence in FASTA format and computes Q, Inline graphic , and the corresponding P-value. Its suffix tree construction is based on the deep-shallow algorithm and implementation by Manzini and Ferragina (2002), which is one of the most efficient string indexing methods available (Puglisi et al., 2007).

2.3 Data

We downloaded complete genome sequences from GenBank for the 58 E.coli strains listed in Supplementary Table S1. Because rush is restricted to the analysis of sequences consisting solely of the nucleotide designations Inline graphic , all other characters were removed prior to the analysis.

2.4 Simulations and computer programs

We generated samples of homologous DNA sequences using the coalescent simulation program ms (Hudson, 2002) in conjunction with ms2dna available from http://guanine.evolbio.mpg.de/bioBox/

These simulations were conditioned on the population mutation rate, θ, and the population recombination rate, ρ. Inline graphic , where is the effective population size and μ is the mutation probability per generation. Under the infinite sites model, θ is equal to the expected number of pairwise mismatches, π. Similarly, , where c is the probability of recombination per generation. For more background on coalescent theory see the excellent introduction by Wakeley (2009).

Max- Inline graphic and -values were computed using the program Phi (Bruen et al., 2006). The distances between the E.coli genomes were computed using kr (Domazet-Lošo and Haubold, 2009) and the tree based on these distances was computed and drawn with Phylip (Felsenstein, 2005) using neighbor joining and mindpoint rooting.

3 RESULTS

3.1 Simulations

We explored the accuracy of the newly derived Equation (1) through simulations. In Figure 3A, we varied sequence length. The simulated values of Inline graphic were always smaller than Equation (1), but the two curves are close. Similarly, when varying π in Figure 3B, we found that Equation (1) is greater than the simulated value, though not much. Notice also that as expected from Equation (1), a 10-fold change in sequence length corresponds to a 10-fold change in Inline graphic (Fig. 3A), whereas a 10-fold change in π corresponds to an enormous 10⁵-fold change in (Fig. 3B).

Fig. 3. — Comparing the observed variance of the variance in shustring length, , simulated without recombination to calculated according to Equation (1) along the two dimensions of , sequence length (A) and genetic diversity, π (B). Number of replicates: 10⁴; in A ; in B sequence length 10⁵

This comparison between simulated and computed Inline graphic values suggests that our test is conservative. Accordingly, we found that for the low rate of recombination of the null hypothesis of no recombination is rejected at with a frequency of only 0.032 rather than the expected 0.05 (Fig. 4). However, with increasing recombination the rejection rate grows to >0.99 until Inline graphic . When ρ exceeds π, the rejection rate declines again, as our test statistic is maximal if . We also compared the rejection rate with to that with the homoplasy-based test statistic , which is computed from four or more aligned sequences (Bruen et al., 2006). Figure 4 shows that Inline graphic has a similar rejection curve as , with slightly less sensitivity at the simulated parameter combination.

Fig. 4. — has similar sensitivity as the alignment-based (Bruen et al., 2006). The graph shows the fraction of hypothesis tests rejected with significance as a function of the rate of recombination, ρ. Horizontal line: 0.05; , , ; sample size for : 4

However, the sensitivity of Inline graphic and depends on the number of mismatches sampled. Figure 5 shows the rejection rate as a function of sequence length at and a length-invariant expected number of recombination events of 100. When sequence length is 10⁵, and have little power, whereas 10 times longer sequences yield rejection in Inline graphic of cases, with marginally outperforming .

Fig. 5. — Rejection rate as a function of sequence length for compared with (Bruen et al., 2006); , for entire region,

It is reassuring that Inline graphic is highly sensitive in the limit of long sequences. This is equivalent to saying that is sensitive when applied to polymorphic sequences as long as π is small. This proviso follows from the assumption that shustrings are homologous. The greater π, the more shustrings occur that are generated by random non-homologous matches between query and subject.

Traditional methods for detecting recombination such as Max- Inline graphic were designed for sequences no longer than a few kilobase. To compare with Max-, we used the simulation scheme by Bruen et al. (2006): for and Max-, we simulated samples of 10 sequences with and . For 1 kb sequences, was more sensitive than Max-, in agreement with Bruen et al. (2006) (Fig. 6). For Inline graphic , we simulated pairs of sequences with the same θ, and a ρ value that resulted in the same expected number of recombination events as in the sample of 10; that is, we used (Hudson and Kaplan, 1985). For kilobase length sequences, has little power (Fig. 6). However, the power of all methods to detect recombination increases with sequence length (Wiuf et al., 2001). For sequences of 100 kb or more Inline graphic is sensitive. Incidentally, this is the sequence length where we stopped the Max- simulations as they became too slow.

Fig. 6. — Comparing rejection frequencies between , Max- and as a function of the length of the input sequences. The simulation parameters are described in the text

The Inline graphic statistic was developed as a fast alternative to methods such as Max-. We, therefore, simulated sequence quartets of lengths 10⁴–10⁷bp and timed Phigiven an alignment. The slope of the resulting run time curve was 2.2, that is, doubling the sequence length increases the run time Inline graphic -fold. We compared this with rush when applied to sequence pairs of the same lengths. The slope of its run time curve is 1.2. This is still a bit steeper than the optimal slope of 1, but means that with double the input data rush takes 2.3 times longer. As a consequence, Phi is eventually outperformed by rush for sequences longer than 100 kb (Fig. 7).

Fig. 7. — Comparing the run times of `Phi` and `rush` as a function of sequence length. ,

Because Phi requires aligned sequences as input, we also timed aligning sequence quartets with Mavid (Bray and Pachter, 2004). Mavid is fast and the slope of its run time curve is only 1.1, making it virtually optimal. Still, rush is roughly 100 times faster than alignment-based Phi.

In our final set of simulations, we investigated the potentially confounding effects of codon data and changes in GC content, and found that the rejection rate was unaffected by either (Supplementary Figs. S1 and Supplementary Data).

3.2 Horizontal gene transfer in E.coli

Having established the accuracy and robustness of rush through simulations, we applied it to the 58 fully sequenced E.coli genomes available from GenBank at the time of writing (Supplementary Table S1). Figure 8 shows a cluster diagram of the strains computed from their complete genomes.

Fig. 8. — Midpoint-rooted neighbor joining tree of 58 *E.coli* strains. The two genomes with the highest recombination measure Q, K12_MG1655 and KO11, are indicated by arrows

The Inline graphic pairwise recombination tests between the 58 E.coli strains took rush 15 h, 9 min and 15 s on an Intel Xeon 2.40 GHz CPU, an average of 16.5 s per test. In all, 97% of these tests were rejected with significance , indicating that the model assumption of uniformly distributed mutations is usually violated across whole bacterial genomes. Figure 9 shows the distribution of the 3306 values of our recombination measure Q; its median is 2.951 with a huge range of 0.587–40.939. We focused on the two strains with the largest Q, KO11 versus K12_MG1655, which are marked by arrows in the cluster diagram (Fig. 8).

Fig. 9. — Histogram of all 3306 measures of recombination, Q, computed in pairwise comparisons between the 58 *E.coli* strains shown in Figure 8

To search for evidence of horizontal gene transfer, we looked for regions in the KO11 genome that were more closely related to K12_MG1655 than to its closest relative. The cluster diagram (Fig. 8) indicates that the closest relative of KO11 is KO11FL. However, this is an artifact of the clutering algorithm, which guarantees finding the correct tree only for distances that actually fit a tree. Empirical distances are often not strictly tree-like, especially in the presence of recombination. When we look up the raw pairwise distances, we find that the distance between KO11 and KO11FL is Inline graphic , whereas that between KO11 and W1 or W2 is below the sensitivity of kr, the program we used for estimating the substitution rates between genomes (Domazet-Lošo and Haubold, 2009). Hence, we took W1 as the closest relative of KO11. Figure 10 shows that at position 4.4 Mb, KO11 contains large fragments more similar to K12_MG1655 than to W1. The longest of these spans 102 kb and is located at 4 385 866–4 487 685. When blasting this region against W1, the best hit is 46.9 kb long and contains 367 mismatches and 13 gaps. In contrast, the best hit in K12_MG1655 is 78.6 kb long with just 8 mismatches and 1 gap. The next best hit in K12_MG1655 is 23.5 kb long with 4 mismatches and 0 gaps. This is strong evidence for horizontal gene transfer. It looks as if it affected a common ancestor of KO11 and KO11FL because in KO11FL the 102 kb query generates two top hits, one 68.2 kb long, the other 33.5 kb with just 2 mismatches and 0 gaps.

Fig. 10. — Comparison between *E. coli* strain KO11 as query and the two strains K-12_MG165 and W1 as subject using the program `alfy` (Domazet-Lošo and Haubold, 2011). Regions where KO11 is most closely related to W1 are shown in light gray, regions closer to K-12_MG165 in black; query regions with no close homolog in the subject sequences are shown in white. The x-axis gives positions in megabase

4 DISCUSSION

To make the most of genomic sequences, we would ideally use software that allows us to query them interactively. Our fast alignment-free test of recombination is intended as a step toward this goal. The program rush achieves its speed by using similar string-indexing techniques as applied in the genome-aligner MUMmer (Kurtz et al., 2004). The central feature of this approach to sequence analysis is that it is ‘optimal’ in the sense that in theory it runs in time linear in the length of the sequences analyzed. In practice, programs based on modern string-indexing techniques are fast and memory efficient (Puglisi et al., 2007). We have been interested for some time now in bringing the power of these algorithms to biology by combining them with modeling the distribution of shortest unique substrings (shustrings), which we take as proxy for the distribution of the distances to the next polymorphism (Domazet-Lošo and Haubold, 2009, 2011; Haubold et al., 2005, 2009;). In particular, in our previous derivation of an alignment-free estimator of genetic diversity we noted that the mean shustring length is sensitive to recombination (Haubold and Pfaffelhuber, 2012).

Here, we have used the test statistic Inline graphic defined in Equation (2), which is based on the variance of the shustring length. Specifically, we compared its observed value, , with its expectation, . We tested the null hypothesis that by deriving and constructing a parametric hypothesis test. This contrasts with the approach taken by the authors of Phi, our main point of reference: They test the null hypothesis of no recombination using a permutation test (Bruen et al., 2006). This was not an option for us because generating the null distribution of shustring lengths by shuffling polymorphisms requires an alignment. Moreover, a parametric approach is faster, albeit sometimes less accurate, than its Monte Carlo equivalent. In our case, Inline graphic is always slightly larger than its simulated counterpart (Fig. 3). Our test is thus conservative when applied to simulated data. This is illustrated by Figure 4, where no recombination leads to 3% rejections with %. However, a 3% rejection rate is reasonably close to the expected 5%, giving us additional confidence that the test statistic Inline graphic behaves as desired.

We compared Max- Inline graphic with using one of the parameter combinations of Bruen et al. (2006). Our 1 kb results in Figure 6 were compatible with theirs; our results also document that is not a replacement of established methods, but rather complements them for megabase-length sequences. Application to sequences of this length is possible because rush runs ∼100 times faster than Phi plus alignment (Fig. 7).

Given that our test is conservative (Fig. 4), it might come as a surprise that 97% of the pairs of E. coli genomes tested had a significant Inline graphic (Fig. 9). This points to a weakness our method shares with all methods for detecting recombination that are based on the identification of clustered polymorphisms. These methods are sensitive to variations in the rate of mutation. This is the reason for the superiority of homoplasy-based methods including Inline graphic over cluster-based methods (Bruen et al., 2006).

In the present study, we compared in particular the alignment-based test statistic Inline graphic with our alignment-free test statistic . An important difference between them is that is applied to at least four aligned sequences, whereas compares unaligned pairs of query and subject sequences. In our example application the subject was always a single genome, but it could also consist of several concatenated sequences.

Statistical tests exist independently of their implementations. However, to analyze simulated and experimental data, we rely on the tests’ implementations, which may or may not be the best possible. With this proviso in mind, we computed Inline graphic using the published program Phi (Bruen et al., 2006), and using our new program rush. The program rush is always faster than the alignment step necessary for applying Phi. However, for sequences longer than 10⁵rush is even faster than Phigiven an alignment (Fig. 7). Its ability to work without alignment makes rush not only fast, it also facilitates its application to genomes that are available only as sets of contigs. This is useful because genomes are increasingly published in this form rather than as fully assembled chromosomes. In our study, not aligning enabled us to efficiently search for the genome pair with the largest value of Inline graphic , which was KO11 as query and K12_MG1655 as subject. The closest relative of KO11 is strain W. Together with strains K12, B and C this belongs to the select group of four E.coli strains classified as Risk Group 1 organisms; these are safe to use in the laboratory. It turns out that KO11 was engineered in 1991 from W to produce ethanol (Ohta et al., 1991; Turner et al., 2012). In the process, two chunks of DNA were transferred into what became KO11: three genes for ethanol production from Zymonas mobilis and 125 kb of the laboratory work horse E.coli K12_MG1655. These 125 kb comprise the uvrA–mutL region of the K12_MG1655 chromosome. The 102 kb fragment we investigated starts inside the uvrA locus. In other words, we have picked up a region that was acquired from K12_MG1655 20 years ago. During these 20 years KO11 was serially transferred and evolved into KO11FL. We found that KO11FL had diverged more from its ancestor KO11 than from their common ancestor W. This was also reported by the team that sequenced KO11FL (Turner et al., 2012).

5 CONCLUSION

Our fast method for detecting recombination from unaligned genomes is accurate when applied to simulated data. Applied to bacterial genomes, it diagnoses recombination too frequently, due to variation in mutation rate across long sequences. This is a weakness Inline graphic shares with other methods to detect recombination based on identifying polymorphism clustering (Bruen et al., 2006). Nevertheless, using the recombination measure Q, we discovered the strongest signal for the pair of E.coli genomes that had undergone an engineered 125 kb horizontal gene transfer 20 years ago.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(44.2KB, zip)}

ACKNOWLEDGEMENT

The authors are grateful to Paul Rainey for helpful comments.

Funding: Deutsche Forschungsgemeinschaft through grant (Pf672/3-1) (to P.P).

Conflict of Interest: none declared.

REFERENCES

Abouelhoda M, et al. Proceedings of the Second Workshop on Algorithms in Bioinformatics. 2002. The enhanced suffix array and its applications to genome analysis. Lecture Notes in Computer Science 2452, Springer-Verlag, pp. 449–463. [Google Scholar]
Baquero F. From pieces to patterns: evolutionary engineering in bacterial pathogens. Nat. Revi. Microbiol. 2004;2:510–518. doi: 10.1038/nrmicro909. [DOI] [PubMed] [Google Scholar]
Bray N, Pachter L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. doi: 10.1101/gr.1960404. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bruen TC, et al. A simple and robust statistical test for detecting the presence of recombination. Genetics. 2006;172:2665–2681. doi: 10.1534/genetics.105.048975. [DOI] [PMC free article] [PubMed] [Google Scholar]
Darling ACE, et al. Mauve: multiple alignment of conserved genomic sequence with rearrangement. Genome Res. 2004;14:1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]
Didelot X, et al. Inference of homologous recombination in bacteria using whole genome sequences. Genetics. 2010;186:1435–1449. doi: 10.1534/genetics.110.120121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Domazet-Lošo M, Haubold B. Efficient estimation of pairwise distances between genomes. Bioinformatics. 2009;25:3221–3227. doi: 10.1093/bioinformatics/btp590. [DOI] [PubMed] [Google Scholar]
Domazet-Lošo M, Haubold B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011;27:1466–1472. doi: 10.1093/bioinformatics/btr176. [DOI] [PubMed] [Google Scholar]
Felsenstein J. The evolutionary advantage of recombination. Genetics. 1974;78:737–756. doi: 10.1093/genetics/78.2.737. [DOI] [PMC free article] [PubMed] [Google Scholar]
Felsenstein J. PHYLIP (phylogeny interference package) version 3.6. 2005 [Google Scholar]
Fisher RA. The Genetical Theory of Natural Selection. Variorum edn. Oxford: Oxford University Press; 1930/1999. [Google Scholar]
Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997. [Google Scholar]
Haubold B, Pfaffelhuber P. Alignment-free population genomics: an efficient estimator of sequence diversity. Genes Genomes Genet. 2012;2:883–889. doi: 10.1534/g3.112.002527. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haubold B, et al. Estimating mutation distances from unaligned genomes. J. Comput. Biol. 2009;16:1487–1500. doi: 10.1089/cmb.2009.0106. [DOI] [PubMed] [Google Scholar]
Haubold B, et al. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics. 2005;6:123. doi: 10.1186/1471-2105-6-123. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haubold B, et al. Alignment-free estimation of nucleotide diversity. Bioinformatics. 2011;27:449–455. doi: 10.1093/bioinformatics/btq689. [DOI] [PubMed] [Google Scholar]
Hudson RR. The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics. 1985;109:611–631. doi: 10.1093/genetics/109.3.611. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]
Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics. 1985;111:147–164. doi: 10.1093/genetics/111.1.147. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Manzini G, Ferragina P. ESA’02: Proceedings of the 10th Annual European Symposium on Algorithms. London: Springer-Verlag; 2002. Engineering a lightweight suffix array construction algorithm; pp. 698–710. [Google Scholar]
Maynard Smith J. Analysing the mosaic structure of genes. J. Mol. Evol. 1992;34:126–129. doi: 10.1007/BF00182389. [DOI] [PubMed] [Google Scholar]
Muller HJ. Some genetic aspects of sex. Am. Nat. 1932;66:118–138. [Google Scholar]
Muller HJ. The relation of recombination to mutational advance. Mutat. Res. 1964;1:2–9. doi: 10.1016/0027-5107(64)90047-8. [DOI] [PubMed] [Google Scholar]
Ohta K, et al. Genetic improvement of Escherichia coli for ethanol production: chromosomal integration of zymomonas mobilis genes encoding pyruvate decarboxylase and alcohol dehydrogenase II. Appl. Environ. Microbiol. 1991;57:893–900. doi: 10.1128/aem.57.4.893-900.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
Otto SP. Unravelling the evolutionary advantage of sex: a commentary on ‘Mutation-selection balance and the evolutionary advantage of sex and recombination’ by Brian Charlesworth. Genet. Res. Camb. 2007;89:447–449. doi: 10.1017/S001667230800966X. [DOI] [PubMed] [Google Scholar]
Otto SP, Lenormand T. Resolving the paradox of sex and recombination. Nat. Rev. Genet. 2002;3:252–261. doi: 10.1038/nrg761. [DOI] [PubMed] [Google Scholar]
Posada D. Evaluation of methods for detecting recombination from DNA sequences: empirical data. Mol. Biol. Evol. 2002;19:708–717. doi: 10.1093/oxfordjournals.molbev.a004129. [DOI] [PubMed] [Google Scholar]
Puglisi SJ, et al. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 2007;39:4. [Google Scholar]
Sawyer SA. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 1989;6:526–538. doi: 10.1093/oxfordjournals.molbev.a040567. [DOI] [PubMed] [Google Scholar]
Turner PC, et al. Optical mapping and sequencing of the Escherichia coli KO11 genome reveal extensive chromosomal rearrangements, and multiple tandem copies of the Zymomonas mobilis pdc and adhB genes. J. Ind. Microbiol. Biotechnol. 2012;39:629–639. doi: 10.1007/s10295-011-1052-2. [DOI] [PubMed] [Google Scholar]
Wakeley J. Coalescent Theory: An Introduction. Colorado: Roberts & Company; 2009. [Google Scholar]
Wiuf C, et al. A simulation study of the reliability of recombinaiton detection methods. Mol. Biol. Evol. 2001;18:1929–1939. doi: 10.1093/oxfordjournals.molbev.a003733. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(44.2KB, zip)}

[btt550-B1] Abouelhoda M, et al. Proceedings of the Second Workshop on Algorithms in Bioinformatics. 2002. The enhanced suffix array and its applications to genome analysis. Lecture Notes in Computer Science 2452, Springer-Verlag, pp. 449–463. [Google Scholar]

[btt550-B2] Baquero F. From pieces to patterns: evolutionary engineering in bacterial pathogens. Nat. Revi. Microbiol. 2004;2:510–518. doi: 10.1038/nrmicro909. [DOI] [PubMed] [Google Scholar]

[btt550-B3] Bray N, Pachter L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. doi: 10.1101/gr.1960404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B4] Bruen TC, et al. A simple and robust statistical test for detecting the presence of recombination. Genetics. 2006;172:2665–2681. doi: 10.1534/genetics.105.048975. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B5] Darling ACE, et al. Mauve: multiple alignment of conserved genomic sequence with rearrangement. Genome Res. 2004;14:1394–1403. doi: 10.1101/gr.2289704. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B6] Didelot X, et al. Inference of homologous recombination in bacteria using whole genome sequences. Genetics. 2010;186:1435–1449. doi: 10.1534/genetics.110.120121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B7] Domazet-Lošo M, Haubold B. Efficient estimation of pairwise distances between genomes. Bioinformatics. 2009;25:3221–3227. doi: 10.1093/bioinformatics/btp590. [DOI] [PubMed] [Google Scholar]

[btt550-B8] Domazet-Lošo M, Haubold B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011;27:1466–1472. doi: 10.1093/bioinformatics/btr176. [DOI] [PubMed] [Google Scholar]

[btt550-B9] Felsenstein J. The evolutionary advantage of recombination. Genetics. 1974;78:737–756. doi: 10.1093/genetics/78.2.737. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B10] Felsenstein J. PHYLIP (phylogeny interference package) version 3.6. 2005 [Google Scholar]

[btt550-B11] Fisher RA. The Genetical Theory of Natural Selection. Variorum edn. Oxford: Oxford University Press; 1930/1999. [Google Scholar]

[btt550-B12] Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge: Cambridge University Press; 1997. [Google Scholar]

[btt550-B13] Haubold B, Pfaffelhuber P. Alignment-free population genomics: an efficient estimator of sequence diversity. Genes Genomes Genet. 2012;2:883–889. doi: 10.1534/g3.112.002527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B14] Haubold B, et al. Estimating mutation distances from unaligned genomes. J. Comput. Biol. 2009;16:1487–1500. doi: 10.1089/cmb.2009.0106. [DOI] [PubMed] [Google Scholar]

[btt550-B15] Haubold B, et al. Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics. 2005;6:123. doi: 10.1186/1471-2105-6-123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B16] Haubold B, et al. Alignment-free estimation of nucleotide diversity. Bioinformatics. 2011;27:449–455. doi: 10.1093/bioinformatics/btq689. [DOI] [PubMed] [Google Scholar]

[btt550-B17] Hudson RR. The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics. 1985;109:611–631. doi: 10.1093/genetics/109.3.611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B18] Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. doi: 10.1093/bioinformatics/18.2.337. [DOI] [PubMed] [Google Scholar]

[btt550-B19] Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics. 1985;111:147–164. doi: 10.1093/genetics/111.1.147. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B20] Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B21] Manzini G, Ferragina P. ESA’02: Proceedings of the 10th Annual European Symposium on Algorithms. London: Springer-Verlag; 2002. Engineering a lightweight suffix array construction algorithm; pp. 698–710. [Google Scholar]

[btt550-B22] Maynard Smith J. Analysing the mosaic structure of genes. J. Mol. Evol. 1992;34:126–129. doi: 10.1007/BF00182389. [DOI] [PubMed] [Google Scholar]

[btt550-B23] Muller HJ. Some genetic aspects of sex. Am. Nat. 1932;66:118–138. [Google Scholar]

[btt550-B24] Muller HJ. The relation of recombination to mutational advance. Mutat. Res. 1964;1:2–9. doi: 10.1016/0027-5107(64)90047-8. [DOI] [PubMed] [Google Scholar]

[btt550-B25] Ohta K, et al. Genetic improvement of Escherichia coli for ethanol production: chromosomal integration of zymomonas mobilis genes encoding pyruvate decarboxylase and alcohol dehydrogenase II. Appl. Environ. Microbiol. 1991;57:893–900. doi: 10.1128/aem.57.4.893-900.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btt550-B26] Otto SP. Unravelling the evolutionary advantage of sex: a commentary on ‘Mutation-selection balance and the evolutionary advantage of sex and recombination’ by Brian Charlesworth. Genet. Res. Camb. 2007;89:447–449. doi: 10.1017/S001667230800966X. [DOI] [PubMed] [Google Scholar]

[btt550-B27] Otto SP, Lenormand T. Resolving the paradox of sex and recombination. Nat. Rev. Genet. 2002;3:252–261. doi: 10.1038/nrg761. [DOI] [PubMed] [Google Scholar]

[btt550-B28] Posada D. Evaluation of methods for detecting recombination from DNA sequences: empirical data. Mol. Biol. Evol. 2002;19:708–717. doi: 10.1093/oxfordjournals.molbev.a004129. [DOI] [PubMed] [Google Scholar]

[btt550-B29] Puglisi SJ, et al. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 2007;39:4. [Google Scholar]

[btt550-B30] Sawyer SA. Statistical tests for detecting gene conversion. Mol. Biol. Evol. 1989;6:526–538. doi: 10.1093/oxfordjournals.molbev.a040567. [DOI] [PubMed] [Google Scholar]

[btt550-B31] Turner PC, et al. Optical mapping and sequencing of the Escherichia coli KO11 genome reveal extensive chromosomal rearrangements, and multiple tandem copies of the Zymomonas mobilis pdc and adhB genes. J. Ind. Microbiol. Biotechnol. 2012;39:629–639. doi: 10.1007/s10295-011-1052-2. [DOI] [PubMed] [Google Scholar]

[btt550-B32] Wakeley J. Coalescent Theory: An Introduction. Colorado: Roberts & Company; 2009. [Google Scholar]

[btt550-B33] Wiuf C, et al. A simulation study of the reliability of recombinaiton detection methods. Mol. Biol. Evol. 2001;18:1929–1939. doi: 10.1093/oxfordjournals.molbev.a003733. [DOI] [PubMed] [Google Scholar]

PERMALINK

An alignment-free test for recombination

Bernhard Haubold

Linda Krause

Thomas Horn

Peter Pfaffelhuber

Abstract

1 INTRODUCTION

Fig. 1.

2 METHODS

2.1 Derivation

Fig. 2.

2.2 Implementation

2.3 Data

2.4 Simulations and computer programs

3 RESULTS

3.1 Simulations

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

3.2 Horizontal gene transfer in E.coli

Fig. 8.

Fig. 9.

Fig. 10.

4 DISCUSSION

5 CONCLUSION

Supplementary Material

ACKNOWLEDGEMENT

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An alignment-free test for recombination

Bernhard Haubold

Linda Krause

Thomas Horn

Peter Pfaffelhuber

Abstract

1 INTRODUCTION

Fig. 1.

2 METHODS

2.1 Derivation

Fig. 2.

2.2 Implementation

2.3 Data

2.4 Simulations and computer programs

3 RESULTS

3.1 Simulations

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

3.2 Horizontal gene transfer in E.coli

Fig. 8.

Fig. 9.

Fig. 10.

4 DISCUSSION

5 CONCLUSION

Supplementary Material

ACKNOWLEDGEMENT

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases