Skip to main content
Evolutionary Bioinformatics Online logoLink to Evolutionary Bioinformatics Online
. 2007 Feb 21;2:295–302.

Genome Halving with an Outgroup

Chunfang Zheng 1, Qian Zhu 2, David Sankoff 3,
PMCID: PMC2674651  PMID: 19455223

Abstract

Some genomes are known to have incurred a genome doubling (tetraploidization) event in their evolutionary history, and this is reflected today in patterns of duplicated segments scattered throughout their chromosomes. These duplications may be used as data to “halve” the genome, i.e. to reconstruct the an cestral genome at the moment of tetraploidization, but the solution is often highly non-unique. To resolve this problem, we adapt the genome halving algorithm of El-Mabrouk and Sankoff to take account of an external reference genome. We apply this to reconstruct the tetraploid ancestor of maize, using either rice or sorghum as the reference.

Keywords: tetraploid, genome doubling, cereals, genome rearrangement, synteny, algorithms

Introduction

Many genomes have been shown to result from an ancestral doubling, or tetraploidization, event, followed by a period of diploidization, i.e. the loss of compartmentalization between the two original copies of the genome, as well as genome rearrangement through intra-and interchromosomal movement of genetic material. The genome halving problem is to reconstruct the ancestral genome on the basis of a decomposition of the present-day genome into a set of apparently duplicated blocks of genes or DNA sequence dispersed among the chromosomes. A quantitative approach to this problem was first discussed by Seoighe and Wolfe (1998) in the context of the genome doubling of the ancestor of the yeast Saccharomyces cerevisiae. At the same time, motivated by studies of genome duplication in early vertebrates (Nadeau and Sankoff, 1997), El-Mabrouk and colleagues (1998, El-Mabrouk and colleagues 1999a, El-Mabrouk and colleagues 1999b) published a series of papers on the combinatorial optimization approach to the problem, culminating in a general solution (El-Mabrouk and Sankoff, 2003). Further refinements have been published by Alekseyev and Pevzner (2004).

Seoighe and Wolfe (1998) noted the extreme non-uniqueness associated with the solution to the genome halving problem and suggested that this difficulty could be attenuated through the use of a reference genome, or outgroup. The suggestion to use a reference genome was taken up to study the post-tetraploidization evolution of S. cerevisiae, both in reference to the genome of Ashbya gossypii (Dietrich et al. 2004) and to that of Kluyveromyces waltii (Kellis et al. 2004), though without recourse to genome rearrangement or genome halving algorithms. Similar research compared mammalian genomes with the tetraploid ancestor of the pufferfish Tetraodon nigroviridis (Jaillon et al. 2004). In the present paper, we formalize this strategy by developing a general algorithm to reconstruct an ancestral tetraploid genome with reference to an outgroup genome. We apply it to infer the ancestor of the maize (Zea mays) genome, with the rice (Oryza sativa) and sorghum (Sorghum bicolor) genomes as outgroups. For this purpose, we are concerned only with duplicated blocks in maize, and their single-copy counterparts in rice and sorghum, as extracted from the Gramene database (Jaiswal et al. 2006), and not the rest of the genomes.

Our strategy is to generate all the solutions to the genome halving problem for the maize genome, and to focus on the subset of these that have a minimum rearrangement distance with the rice (or sorghum) genome. We formulate a search heuristic to transcend the set of optimal halving solutions to find the most realistic ancestral genome that minimizes the sum of the distance between the ancestral tetraploid and present-day maize and the distance between rice (or sorghum) and the diploid form of the ancestor.

The Data

It is generally agreed that the maize genome underwent a genome doubling event some 11–16 million years ago (Gaut and Doebley, 1997). While some duplicated regions clearly attest to this event, there is no consensus on the exact inventory of such regions. Moore et al. (1995) and Wilson et al. (1999) presented two largely consistent views of syntenic blocks across the cereals based on the mapping evidence at the time. These included 14 and 19 duplicated blocks in the maize genome. Gaut (2001) gave a more comprehensive account of the pattern of 23 duplicated regions, based on maize genomic sequence data in 2001. He did not completely establish the relative position of all the syntenies on the chromosomes in this work.

Even now that the rice genome has been sequenced, and the maize genome project is well-advanced, it is no trivial matter to identify the duplicate blocks resulting from the tetraploidization event. The maize genome has many other duplicated segments dating from periods both after and before the tetraploidization and even before the divergence from the other cereals. This is complicated by post-tetraploidization genome rearrangement events, deletions and insertions of genetic material, transpositions of genes or larger segments from one site on the genome to another, and loss of homology between the parts of the duplicated regions.

The databank which has the most information on the syntenies among the cereal genomes is Gramene (Jaiswal et al. 2006). The current version at time of writing is release 21. From this we can obtain a conservative (i.e. confined to high homology regions only) estimate of duplicate blocks by comparison with the rice genome. For example, in Figure 1, we can visually identify large duplicated regions in chromosomes 1 and 9, chromosomes 1 and 5, and possibly a number of smaller ones, all by virtue of their common homology with regions of rice chromosome 3.

Figure 1.

Figure 1

Syntenies between rice chromosome 1 and maize chromosomes, as produced by Gramene.

Unfortunately, there is as yet no comparison of syntenic blocks between sorghum and the other genomes on Gramene. However, there are extensive mapping data of various kinds of markers. We bolstered our preceding data collection by searching sets of duplicate markers in maize that had single copies in sorghum and rice, comparing mainly the Patterson, 2003, genetic map for sorghum, the IBM2 Neighbours, 2004 and Cornell Wilson, 1999, genetic maps for maize and the Annotated Nipponbare Sequence, 2006, sequence map for rice. All the markers satisfying these criteria fell into the rice-maize syntenies established previously. Based on these criteria, i.e. markers identified as homologous in Gramene, with a single copy in each of rice and sorghum and two copies in maize, plus the requirement that the maize and rice copies fall into the appropriate, previous identified, rice-maize syntenic blocks, we could now identify 34 syntenic blocks as basic data for our reconstruction. These data are depicted in Figure 2, but should be considered to constitute a working hypothesis; definitive data must await the finishing of the maize genome, the sequencing of the sorghum genome, and the further application of global alignment and synteny block construction methods.

Figure 2.

Figure 2

Figure 2

Figure 2

Order of syntenic blocks in rice, sorghum and, in two copies each, maize.

The Genome Halving Algorithm

Distance based on genomic structure d(X,Y) is calculated by rapid, albeit complicated, rearrangement algorithms for finding the minimum number of operations necessary to convert one genome X into another Y. The genomes are represented by signed permutations on 1,···, n and the biologically-motivated operations generally include inversions (implying as well change of sign, i.e. change of strand) of chromosomal segments, reciprocal translocations (of telomere-containing segments of two chromosomes) and chromosome fission or fusion. They may also include transpositions (including “jumping genes”) of segments from one site to another on a chromosome or interchanges of segments on a chromosome, both of which count as two steps compared to one for the previously mentioned operations.

Rearrangement algorithms (e.g. Tesler, 2002) make use of the bi-coloured “breakpoint graph” or similar structure, where each end of an oriented syntenic block, gene or marker on genome X is joined by a red edge to the adjoining end of the adjacent syntenic block, gene or marker, and these same ends, represented by the 2n vertices in the graph are joined by black edges determined by the adjacencies in genome Y. The breakpoint graphs necessarily consist of disjoint alternating cycles and/or paths, and it can be shown that d (X,Y) = nc, where c is the number of cycles (in the case X and Y consist of single circular chromosomes), or d (X,Y) = n + χ − c −Π, where χ is the maximum number of linear chromosomes in X and Y, Π and counts the number of certain kinds of paths in the graph. The actual operations, d (X,Y) in number, may be reconstructed by splitting large cycles in the breakpoint graph into two cycles each, until there are d (X,Y) cycles each made up of two vertices, one red edge and one black edge. Every time a cycle is split, this corresponds to one rearrangement operation.

In the rearrangement algorithms, construction of the breakpoint graph is an easy preliminary step. The genome halving algorithms (El-Mabrouk and Sankoff, 2003; Alekseyev and Pevzner, 2004) also make use of the breakpoint graph, but the problem here is building the breakpoint graph where one of the genomes (the tetraploid) is unknown. This is done by segregating the vertices of the graph in a natural way into subsets, such that the vertices of all cycles must fall within a single subset, and then constructing these cycles in an optimal way within each subset so that the red edges correspond to the structure of the known genome and the black edges define the adjacencies of a tetraploid.

A Heuristic for Minimizing d(U, A) + d(A ⊕ A,T)

Let T be a genome consisting of χ chromosomes and 2n genes, syntenic blocks or other markers, g1,1,···, g1,n; g2,1,···, g2,n, dispersed in any order on the chromosomes. For each i, we call g1,i and g2,i “duplicates,” but there is no particular property distinguishing elements of the set of g1,i from the set of g2,i. A potential “ancestral tetraploid” of T is written AA, and consists of 2Ψ chromosomes, where some half (Ψ) of the chromosomes contains exactly one of each of g1,i or g2,i for each i =1,···, n. The remaining Ψ chromosomes are each identical to one in the first half, in that where g1,i appears on a chromosome in the first half, g2,i appears on the corresponding chromosome in the second half, and where g2,i appears in the first half, g1,i appears in the second. We define A to be either of the two halves of AA, where the subscript 1 or 2 is suppressed from each g1,i or g2,i. These Ψ chromosomes, and the n genes, syntenic blocks or markers they contain, g1, ···, gn constitute a potential “ancestral diploid” of T.

A solution of the genome halving problem for T is any A such that d (AA, T) is minimal.

Any genome U is a reference genome for T if it contains the n genes, syntenic blocks or markers g1,···, gn.

Let U be a reference genome for T. The central problem in this paper is to find a potential ancestral diploid genome A such that d (U, A) + d (AA,T) is minimized.

Let S be the set of solutions of the genome halving algorithm for T. As an initial step to our heuristic, schematized in Figure 3, we confine our search to S.

Figure 3.

Figure 3

Procedure for finding ancestral tetraploid. T = genome made up of duplicated markers, U = reference genome. S = set of solutions to the genome halving problem. S′= subset closest to U, A(i) = genome on trajectory from AS′ to U.

For each solution AS, we calculate the rearrangement distance d (U, A) between the reference genome U and A. This is feasible even for large S because of the rapidity of the rearrangement calculation. We then define

S={AS|d(U,A)=minXSd(U,X)}. (1)

By definition, there is no minimizing genome in S\S′.

To look for a better genome outside of S, for each AS′, we assume that any such genome will be found on a path between some element of S′ and U. We calculate the d (U, A) genomes, other than A, on a parsimonious trajectory A, A(1), A(2),···, U from A to U. Note that d (U, A(i)) = d (U, A) − i. Then we search for an A(i) such that

d(U,A(i))+d(A(i)A(i),T)<d(U,A)+d(AA,T). (2)

(Note that it is not necessary to try A(1) though it is closer by one step to U than A is, because A(1)A(1) is also farther from T by at least one step, since it is not in S.) Our final solution set S′′ is the set of A(i), over all genomes AS′, and all trajectories between A and U, that satisfy inequality (2) and that minimize the left hand side of (2).

If S″ is empty, then S′ is the final set of minimizing genomes.

Complexity

Since both genome halving and genome rearrangement are essentially linear in n, the execution time of our search is O (n|S| + φn2| S′|), the second term measuring the number of steps between genomes in S′ and U and the time to calculate the distance to U at each step, and the number φ of different paths sampled per element in S′. In our example, biological reality motivates constraining the search so that all chromosome fissions are carried out first, as far as compatible with the optimality of the path. This is because the loss of chromosomes is likely to occur around the time of diploidization, so the path back from A towards the ancestor should attempt to restore the number of chromosomes to what it is in sorghum or rice as soon as possible, i.e. for some A(i), where i is as small as possible.

Results

The genome halving algorithm usually involves some arbitrary choices in constructing the optimal ancestral tetraploid. In the case of the maize genome, this leads to more than 5,000,000 different execution paths for the algorithm. Not all of these lead to the different results, but distinct solutions in S surely number in the hundreds of thousands, if not millions; a sample of 15,000 paths resulted in over 13,000 different solutions. The original data set not being very large (34 blocks in two genomes, 68 in maize), this exemplifies the extreme lack of uniqueness in the results of genome halving.

When we bring the reference genomes to bear, we first note that over all XS, the distance d (X, So) ranges from 19 (for the solutions in SSO) to 28, while d (X, R) ranges from 19 (for the solutions in SR) to 27. The sets SSO and SR, however, contain only 8 and 24 solutions, respectively. Thus there is a massive reduction of non-uniqueness induced by appealing to a reference. Then, in venturing outside of S on the paths from pre-tetraploid versions of elements of S′ towards the reference, either rice or sorghum, we find even fewer genomes X with a minimum sum of distance to the reference (X as a diploid) plus distance to maize (XX as a tetraploid). For example, the genome ASO(3) in Figure 4 and depicted in Figure 5 satisfies

Figure 4.

Figure 4

Figure 4

Results of search for ancestral tetraploid. M, R, SO = maize, rice, sorghum genomes. S = set of solutions to the genome halving problem. SR, SSO = subsets closest to R, SO, AR(i)= genome on trajectory from ARS′ to R. ASO(i) = genome on trajectory from ASO ∈ S′SO to SO.

Figure 5.

Figure 5

Figure 5

Order of syntenic blocks in the reconstructed diploid maize ancestor, compared to sorghum, with the same rice chromosomal colour coding as in Figure 2.

d(So,ASO(3))+d(ASO(3)ASO(3),M)=16+29<d(So,A)+d(AA,M)=19+27, (3)

inequality (3) for all AS′. There are only two other solutions with value 45 for the objective function, one a step closer (an A(4)) and one a step further (an A(2)), from the sorghum genome. In the case of a rice reference, there is actually a unique solution, with d(R, X) + d (XX, M) = 44.

Thus we have almost completely eliminated the non-uniqueness of the solutions to the genome-halving problem, though of course the number of solutions found will still depend on the data set. It is also possible that a better solution is to be found off the paths we have explored, although this is unlikely for the relatively small example represented by these cereal genomes.

Conclusions

We have been working with a small data set, and the differences between the optimal solution and suboptimal solutions are small, as in inequality (3). As more data become available on maize and especially sorghum, our reconstructions should be better and the role of the reference genome in zeroing in on a unique solution for genome halving will be clarified. This should also allow for statistical validation.

Our analysis used sorghum and rice as reference genomes in two separate analyses. And it is gratifying that using sorghum alone as reference produced an ancestral maize genome closer, not only to sorghum, but also to rice, than any candidate ancestor based on genome halving with no reference. Nevertheless, it would be interesting to formally combine gene order information from both rice and sorghum simultaneously in reconstructing the maize ancestor. Along the lines of our current analysis, first finding S, then S′, and finally an optimal A(i), we could define S′ as the subset of S whose elements A each induce a minimal solution of the median problem (Sankoff and Blanchette, 1997; Siepel, 2001), i.e. for which there is a genome X, such that d (A, X) + d (U1, X) + d (U2, X) is minimal compared to all AS. Then the search for an optimal A(i) could proceed on the paths from all AS′ to X.

A more difficult theoretical problem would be to replace our sequential procedure by a single algorithm searching for the A which minimizes d (U, A) + d (AA,T). It is not clear whether this is a hard problem, given that genome halving and genome rearrangement are both solvable in close to linear time. But there is no obvious way of modifying the halving algorithm so that it could take account of a reference genome while retaining optimality. Some of the searches we have performed here might be incorporated directly into the halving algorithm to transform it into a heuristic method, and this might work even for the direct minimization of d (U1, A) + d (U2, A) + d (AA,T).

References

  1. Alekseyev MA, Pevzner PA. Genome halving problem revisited. In: Lodaya K, Mahajan M, editors. Proceedings of FSTTCS 2004: Foundations of Software Technology and Theoretical Computer Science Lecture Notes in Computer Science, 3328; Heidelberg: Springer; 2004. pp. 1–15. [Google Scholar]
  2. Dietrich FS, Voegeli S, Brachat S, et al. The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science. 2004;304:304–7. doi: 10.1126/science.1095781. [DOI] [PubMed] [Google Scholar]
  3. El-Mabrouk N, Bryant D, Sankoff D. Reconstructing the predoubling genome. In: Istrail S, Pevzner P, Waterman M, editors. Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB 99) ; New York: ACM Press; 1999a. pp. 154–63. [Google Scholar]
  4. El-Mabrouk N, Nadeau JH, Sankoff D. Genome halving. In: Farach-Colton M, editor. Combinatorial Pattern Matching, Ninth Annual Symposium Lecture Notes in Computer Science, 1448 . Heidelberg: Springer; 1998. pp. 235–50. [Google Scholar]
  5. El-Mabrouk N, Sankoff D. On the reconstruction of ancient doubled circular genomes using minimum reversals. In: Asai K, Miyano S, Takagi T, editors. Genome Informatics. Tokyo: Universal Academy Press; 1999b. 1999. pp. 83–93. [PubMed] [Google Scholar]
  6. El-Mabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM Journal on Computing. 2003;32:754–92. [Google Scholar]
  7. Gaut BS. Patterns of chromosomal duplication in maize and their implications for comparative maps of the grasses. . Genome Res. 2001;11:55–66. doi: 10.1101/gr.160601. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Gaut BS, Doebley JF. DNA sequence evidence for the segmental allotetraploid origin of maize. . Proc. Natl. Acad. Sci. 1997;94:6809–14. doi: 10.1073/pnas.94.13.6809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jaillon O, Aury JM, Brunet F, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004;431:946–57. doi: 10.1038/nature03025. [DOI] [PubMed] [Google Scholar]
  10. Jaiswal P, Ni J, Yap I, et al. Gramene: a bird’s eye view of cereal genomes. Nucleic Acids Research. 2006;34:D717–23. doi: 10.1093/nar/gkj154. URL: http://www.gramene.org. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kellis M, Birren B, Lander E. Proof and evolutionary analysis of ancient genome duplication in the yeast. Saccharomyces cerevisiae. Nature. 2004;428:617–24. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
  12. Moore G, Devos KM, Wang Z, Gale MD. Cereal genome evolution—Grasses, line up and form a circle. . Curr. Biol. 1995;5:737–9. doi: 10.1016/s0960-9822(95)00148-5. [DOI] [PubMed] [Google Scholar]
  13. Nadeau JH, Sankoff D. Comparable rates of gene loss and functional divergence after genome duplications early in vertebrate evolution. Genetics. 1997;147:1259–66. doi: 10.1093/genetics/147.3.1259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Sankoff D, Blanchette M. The median problem for breakpoints in comparative genomics. In: Jiang T, Lee DT, editors. Proceedings of the Third International Computing and Combinatorics Conference (COCOON 1997) Lecture Notes in Computer Science, 1276 ; Heidelberg: Springer; 1997. pp. 251–63. [Google Scholar]
  15. Seoighe C, Wolfe KH. Extent of genomic rearrangement after genome duplication in yeast. . Proc. Natl. Acad. Sci. U.S.A. 1998;95:4447–52. doi: 10.1073/pnas.95.8.4447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Siepel A. Exact algorithms for the reversal median problem. Master’s thesis; University of New Mexico: 2001. [Google Scholar]
  17. Tesler G. Efficient algorithms for multichromosomal genome rearrangements. Journal of Computer and System Sciences. 2002;65:587–609. [Google Scholar]
  18. Wilson WA, Harrington SE, Woodman WL, et al. Inferences on the genome structure of progenitor maize through comparative analysis of rice, maize and the domesticated panicoids. Genetics. 1999;153:453–73. doi: 10.1093/genetics/153.1.453. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Evolutionary Bioinformatics Online are provided here courtesy of SAGE Publications

RESOURCES