Abstract
Motivations
Gene trees inferred solely from multiple alignments of homologous sequences often contain weakly supported and uncertain branches. Information for their full resolution may lie in the dependency between gene families and their genomic context. Integrative methods, using species tree information in addition to sequence information, often rely on a computationally intensive tree space search which forecloses an application to large genomic databases.
Results
We propose a new method, called ProfileNJ, that takes a gene tree with statistical supports on its branches, and corrects its weakly supported parts by using a combination of information from a species tree and a distance matrix. Its low running time enabled us to use it on the whole Ensembl Compara database, for which we propose an alternative, arguably more plausible set of gene trees. This allowed us to perform a genome-wide analysis of duplication and loss patterns on the history of 63 eukaryote species, and predict ancestral gene content and order for all ancestors along the phylogeny.
Availability
A web interface called RefineTree, including ProfileNJ as well as a other gene tree correction methods, which we also test on the Ensembl gene families, is available at: http://www-ens.iro.umontreal.ca/~adbit/polytomysolver.html. The code of ProfileNJ as well as the set of gene trees corrected by ProfileNJ from Ensembl Compara version 73 families are also made available.
Introduction
Several gene tree databases from whole genomes are available, including Ensembl Compara [1], Hogenom [2], Phog [3], MetaPHOrs [4], PhylomeDB [5], Panther [6]. However they are known to contain many errors and uncertainties, in particular for unstable families [7]. Their use for accurate ancestral genome inference, orthology detection, or the study of genome dynamics could lead to erroneous results. For example Ensembl Compara trees, when reconciled with a species tree to annotate gene duplications and losses, systematically and unrealistically overestimate the number of genes in ancestral genomes, and lead to erroneous predictions of ancestral chromosome structures [8]. It is a known artifact, and a substantial number of nodes in the Ensembl gene trees are labeled as “dubious” [9].
Reasons for errors in gene trees are numerous. When constructed from multiple sequence alignments of homologous genes, they are dependent on gene annotations, gene family clustering or alignment quality, as well as on the accuracy of the models and algorithms used. But above all, gene sequences often do not contain enough substitutions to resolve all the branches of a phylogeny, or alternatively, too many such that the substitution history is saturated. Therefore sequence based methods, computing gene trees from sequence information (e.g. PhyML [10], RAxML [11], MrBayes [12], PhyloBayes [13]), are usually accompanied with measures of statistical support on their branches or a posteriori distributions of likely trees.
Another category of methods, designated here as integrative methods, use a species tree, in addition to a multiple sequence alignment, to model gene gains and losses inferred from the reconciliation between gene and species trees (e.g. TreeBeST [14], TreeFix [15], NOTUNG [16], PhylDog [8], ALE [17], GSR [18, 19], SPIMAP [20], Giga [21], MowgliNNI [22]). They all report gene trees with better accuracy compared with sequence based methods. But they leave a large space for improvement, both on tree quality and on computing time. In terms of models, they often assume unrealistic loss/retention ratios [23]. In terms of computation strategy, most of them use tree space exploration based on small modifications or local moves on branches (typically NNI, SPR, TBR), usually proposed at random. Moves are accepted or rejected according to hill-climbing, Metropolis-like criteria, or other statistical or empirical arguments. Such exploration methods are computationally intensive and do not scale well as databases grow in size. Consequently, database construction pipelines such as TreeBeST (constructing the Ensembl Compara gene trees [1]) have to adopt compromises, exploring a limited tree space. Improving local exploration can be done by using some correction techniques, most of them based on the idea of selecting local moves reducing the cost of reconciliation with a species tree (e.g. [16, 22, 24–33]). But even with such improvements, it remains that most local search strategies have no guarantee, neither on running time, nor on the quality of the solution.
In this paper, we propose a new gene tree correction method, called ProfileNJ, which can be directly used as a fast integrative method, without local search. It is a deterministic approach with a guaranteed time complexity. ProfileNJ takes as input a starting tree with supports on its branches, and outputs a set of rooted binary trees containing all well-supported branches of the starting tree, and minimizing the number of duplications and losses when reconciled with a given species tree. It is based on PolytomySolver, a previous algorithm developed by our group [31] for resolving a non-binary tree in a way minimizing a reconciliation cost. Its theoretical complexity has been shown to outperform previous algorithms for the same problem, namely NOTUNG [16] and Zheng and Zhang’s algorithm [34]. ProfileNJ extends PolytomySolver by integrating Neighbor-Joining (NJ) principles to choose among the numerous optimal solutions. Among all trees with equal reconciliation cost exhibiting the same gene count on branches, a choice is made with NJ principles, based on a distance matrix computed from gene sequences. Other extensions are considered such as allowing different costs for duplications and losses, as well as non rooted trees as input. These extensions turn an algorithmic principle into a workable method suited for constructing trees from biological data.
We compare ProfileNJ with TreeFix [15]. Indeed, among all correction methods, TreeFix adopts the most similar evaluation strategy, by exploring neighboring trees which are statistically equivalent according to the sequences. Moreover, TreeFix is among the best available integrative methods, according to the quality of the output trees and running time. On simulations, both algorithms achieve results of comparable quality, but ProfileNJ is several times faster. We also ran ProfileNJ on the whole set of gene families from the Ensembl database, which is out of reach for competing methods with comparable quality. The trees for the whole database were processed by ProfileNJ in less than 48h sequential time. In the same amount of time, 0.5% of the database was corrected by TreeFix. The trees we propose compare very favorably with the trees stored in Ensembl. We then used the reconstructed trees to study genome evolution across the 63 eukaryotic species from the Ensembl database (release 73). A whole genome analysis of duplication patterns is provided, pointing at certain branches which seem to show acceleration of duplication or loss processes.
ProfileNJ is integrated into a modular interface called RefineTree that contains two other gene tree correction tools using information on extant and ancestral synteny [32, 33]. We could thus evaluate the results of a pipeline taking into account gene sequence, gene content and chromosome structure evolution on the Ensembl database according to several criteria: (1) likelihood ratio based on the Ensembl alignments; (2) ancestral genome sizes based on reconciliation with the species tree; (3) linearity of ancestral chromosomal segments computed with DeCo [35]. We discuss the improvements brought by each type of method and the distance of their output to “true” gene trees, in the light of incomplete lineage sorting and gene conversion.
Description of ProfileNJ
The basic vocabulary of phylogenetic trees is taken from [36], and the reconciliation method between a rooted binary gene tree G and a rooted binary species tree S is recalled in the Method section. Just note that in a reconciled gene tree G, each node (representing an extant gene if at a leaf or an ancestral gene if at an internal node) is mapped to the node of S corresponding to the genome the gene belongs to. Edges of G are subdivided, adding extra vertices and pending edges so that the extremities of an edge map either to the same node, or to two extremities of an edge of S. An internal node of G is a duplication if it maps to the same node of S as one of its child. The number of genes in a species s ∈ V(S) induced by a reconciled gene tree G is defined as the number of nodes x ∈ V(G) mapped to s, such that the parent of x does not map to s. The reconciliation cost is either the number of duplications and losses, or a linear combination of the two if different weights are given to the two kinds of events. Also note that when two trees G1 and G2 have the same genes at their leaves, we can say that a branch of G1 is present in G2 if the bipartition of the leaves induced by this branch in G1 is also induced by a branch of G2.
ProfileNJ is a gene tree correction algorithm that takes as input a gene tree (rooted or non rooted) G for a given gene family with supports on its branches, and improves it according to the available information, taken from a species tree, a distance matrix and a threshold number for statistical support. It can be viewed as a generalization of three different standard algorithms designed for evolutionary studies:
The Wagner parsimony method applied to the inference of ancestral gene contents from the extant gene contents by minimizing a duplication and loss cost [37];
The Neighbor-Joining [38] (NJ) method which constructs a tree from a distance matrix D between taxa;
The reconciliation of a gene tree G with a species tree S [39].
Whereas these three methods do not have much in common a priori, they are all bricks of our solution and each of them reduces to some particular case of our problem. ProfileNJ outputs a rooted binary gene tree Gc on the same gene family as the input gene tree G, where all branches of G with a support above the threshold are present in Gc. Among all such trees, ProfileNJ outputs those minimizing a duplication and loss cost when reconciled with the species tree with respect to the NJ criterion.
ProfileNJ is an extension of PolytomySolver, a previous algorithm developed by our group [31]. We first describe the principle of the latter and then describe the additions.
PolytomySolver: It takes as input a multifurcated rooted gene tree G (with non-binary nodes) and a binary rooted species tree S. It outputs a binary rooted gene tree containing all branches of G, that minimizes the number of duplications and losses when reconciled with S. It has been shown by [31] that each polytomy (multifurcated node) of G can be considered independently. Therefore, in the following, we restrict the presentation to a single polytomy P (s.f. polytomy P in Fig 1).
The algorithm, based on dynamic programming, computes a table M where, for each node (including leaves) s of S and each integer k (limits on k are discussed in [31]), M(s, k) is the reconciliation cost of a gene tree with k genes in species s before any duplication in s. For example, in Fig 1, P has three genes belonging to genome b, and thus M(b, 1) = 2 as any solution having one gene in b before any duplication in b means that two duplications must have occurred in b, while M(b, 4) = 1 as having four genes induces one gene loss on b.
The final cost of a minimum refinement of the polytomy is given by M(r, 1), where r is the root of S. Using a backtracking approach, PolytomySolver then outputs a count vector V containing the number of genes per node of S. Notice that, by construction, two brother nodes of S (nodes with the same parent) have the same count. Then a gene tree G of minimum cost M(r, 1) is found, such that in the reconciliation of G with S there are exactly V[s] genes in each s ∈ V(S). For example, the final binary tree in Fig 1 has two maximal trees rooted at b, as required by the count vector V.
If the reconciliation cost is the number of duplications and losses (i.e. the same unit cost is attributed to each duplication or loss), as it was initially published [31], the table M can be constructed in time linear in the size of S [31], leading to a linear-time algorithm for finding one optimal refinement of the polytomy. Moreover, we showed recently [40] that linearity can be extended to a whole gene tree involving multiple polytomies. For weighted operations, i.e. different costs for duplications and losses, the algorithm runs in quadratic time [40].
Extensions: ProfileNJ consists in contracting branches with low support in G, which leads to polytomies, and applying Polytomysolver. If G is not rooted, one node is chosen as the root. All nodes can be tried and one minimizing the cost can be chosen. The first phases of Polytomysolver are applied, until the construction of a count vector V.
Our main extension concerns a treatment of the multiplicity of solutions, as it can be exponential. Indeed, the backtracking procedure mentioned above may lead to many optimal count vectors, and for each count vector, there are possibly several gene trees in agreement with it. Therefore, exploring the set of optimal trees requires exploring the set of all gene trees in agreement with each count vector. For example, the count vector of Fig 1 induces two duplications in b. However this vector involves no information on which of the three genes b1, b2, b3 should be joined first. Such information can be deduced from the pairwise alignment distance between gene sequences.
Suppose that a pairwise distance matrix D is available for the gene family. Then the problem can be seen as selecting, among all optimal solutions possibly output by PolytomySolver given a vector V, the one best reflecting the distance matrix D. The problem of constructing a solution such that its induced distance is close to D according to a standard measure of metric spaces comparison is NP-complete. But it is also known to be empirically and, to a certain extent, theoretically, well approximated by Neighbor-Joining (NJ) [38, 41]. In ProfileNJ, we use such an NJ approach for choosing neighboring genes.
As in the NJ algorithm, a metric space E induced by D on the leaves of P, is progressively augmented with newly created genes. The algorithm proceeds by successively joining pairs of nodes (points of E), eventually leading to a full binary tree. For example, in Fig 1, the initial metric space E contains the nodes {a1, a2, b1, b2, b3, c}. Joining the nodes b1 and b2 leads to the new set of nodes {a1, a2, b3, b4, c}. Nodes to be joined are selected according to the NJ criterion, namely we select from a node set of size n, the couple of genes x and y minimizing:
(1) |
The metric space E is updated after each join r = (x, y) by removing x and y, adding a new node r, and computing the distance between the newly created node r with each element t of E. When x and y are not created artificially (i.e. they are not loss nodes created with the last instruction of Algorithm 1), this is done using the NJ formula:
(2) |
Otherwise, if x is a loss, we set D(r, t) = D(y, t) and if y is a loss, D(r, t) = D(x, t).
A full pseudo-code of the extension part of ProfileNJ is written as Algorithm 1. It works on one polytomy P, assuming that all polytomies below have been resolved. It takes as input a count vector V, the species tree S and a distance matrix D defining the metric space E. It outputs a refinement of P in agreement with V, resulting from the performed joins on the nodes of E. Given a node s of S, denote by E(s) the subset of E restricted to the genes belonging to s, and by m(s) = |E(s)| the multiplicity of s in E. The tree S is processed bottom-up. For each internal node s, speciations are considered first by clustering, using the NJ criterion, the genes from E(sl) with the genes from E(sr), where sl and sr are the two children of s. If the obtained multiplicity m(s) of s is greater than the desired count V[s], then duplications are performed, again using the NJ criterion for choosing the gene pairs in E(s) to be joined. Otherwise, if m(s) is lower than the desired count V[s] of gene copies, then losses are predicted.
Algorithm 1 ProfileNJ (S,P,V,D)
Let E be the metric space with nodes corresponding to the leaves of P;
For each node s of S in a bottom-up traversal of S Do
If s is an internal node of S with chidren sl, sr Do
{By construction, V[sl] = V[sr] = n}
For i = 1 to n Do
Choose in E(sl) × E(sr) the gene pair (gl, gr) minimizing Eq (1) and create
the node g = (gl, gr);
Remove gl and gr from E and add g;
Compute D(g, g′) for all g′ ∈ E using Eq (2);
End For
End If
If m(s)>V[s] Do
For i = 1 to m(s) − V[s] Do
{Perform m(s) − V[s] duplications}
Choose in E(s) × E(s) the gene pair (g1, g2) minimizing Eq (1), and create
the node g = (g1, g2);
Remove g1 and g2 from E and add g;
Compute D(g, g′) for all g′ ∈ E using Eq (2)
End For
End If
Else If m(s)<V[s] Do
{Perform V[s] − m(s) losses}
Add V[s] − m(s) artificial genes to E(s), each with infinite distance to all elements of E;
End If
End For
Complexity: Let G be the solution output by the algorithm, and suppose that G has n leaves after the inclusion of lost genes. Then exactly n − 1 NJ operations have been performed. Each join calculation is restricted to a subset of the genes, and so the time required to perform these joins is bounded by the time required to run the classical NJ algorithm on the n leaves of G, which is O(n3). Note that n can be as large as |V(P)||V(S)|, making the worst case running time O(|V(P)|3|V(S)|3). However this worst case only occurs when O(|V(S)|) losses are inserted on each branch of the solution. In practice n is in O(|V(P)|).
A Multi-functional algorithm: ProfileNJ is a phylogenetic tool that generalizes several usually unrelated standard methods. Indeed, if G is a binary rooted tree, then ProfileNJ can be seen as a reconciliation tool. If G is unrooted, then ProfileNJ can be used to choose an appropriate root according to the induced reconciliation cost. On the other hand, various ways of contracting branches can be considered. For example an exploration scheme contracting the branches one by one and applying ProfileNJ can be considered, which would be equivalent to local modifications [27]. A more radical modification would be to contract all branches, leading to a star tree. In this case, ProfileNJ can be seen as a tool for computing ancestral gene content with Wagner parsimony, minimizing the cost of duplications and losses. If the star tree has all its genes belonging to a single species, ProfileNJ returns an NJ tree. Other kinds of contraction schemes can be imagined, as contracting branches around “Non Apparent Duplications” [42], or “Dubious duplications” stored in the Ensembl trees.
Notice that the pseudo-code for ProfileNJ has been given for a single count vector. For a full exploration of the tree space, all count vectors should be considered. This is what we do in the simulation section.
Efficiency of ProfileNJ
Efficiency of the NJ criterion
We ran ProfileNJ twice on the same data set of 20519 trees (the Ensembl Compara gene families), except that once the distance matrix was computed using the Ensembl nucleotide alignments with FastDist from the FastPhylo package [43], and once the distance matrix was random. The starting tree was computed for every family using PhyML on the nucleic alignments, and all branches with aLRT support <0.95 were contracted. In average 55% of the branches were contracted. A histogram of the full distribution is shown in Fig A in S1 File.
Then we computed the likelihood of both trees for every family with PhyML. Among the trees for which the likelihood was different (55% of all tested trees), 76% were in favor of the trees built with the FastDist distance matrix. These 76% of trees account for 95% of the total sum of likelihood differences on all trees.
The comparisons are clearly in favor of the NJ criterion over no criterion at all, while quantitatively there remains a small but non negligible part of the trees for which no criterion (the random distance matrix) gives an unexplained slightly, but significantly, better likelihood.
Efficiency of the tree space exploration strategy on simulated gene trees
We compared ProfileNJ with TreeFix, the most closely related tool, on simulated data. The principle of TreeFix is to randomly explore, by local moves, the space of trees that are statistically equivalent to the input tree, and report the one with the best reconciliation cost. Instead, we take a deterministic and more targeted approach by focusing on weakly supported branches of the tree, with a possibly deep modification of the tree. The comparison with TreeFix is therefore intended to compare these two space exploration strategies.
In [15], TreeFix has been compared with NOTUNG [16] and SPIMAP [20], showing a better accuracy than NOTUNG, and a higher speed than SPIMAP. We perform a similar comparison on the same simulated dataset of 16 fungi. This dataset consists of simulated gene families generated under the SPIMAP model and their corresponding nucleotide alignments, for four different rates of duplication and loss (DL) events: (1 × rD, 1 × rL), (2 × rD, 2 × rL), (4 × rD, 4 × rL) and (4 × rD, 1 × rL), where rD and rL are respectively the estimated duplication and loss rates for fungi. For instance, a (2 × rD, 2 × rL)-simulated gene family is expected to have, on average, two times more duplications and losses than a real gene family in fungi. Comparisons reported in this section are performed on 2575 simulated gene families randomly chosen from the four fungi datasets with different DL rates.
An initial maximum likelihood (ML) tree is constructed for each simulated gene family with RAxML v-8.1.2 [11], with the rapid bootstrap algorithm, under the GTR-Γ model and the majority rule consensus tree as bootstopping criterion. A randomly rooted tree is then provided as input to TreeFix (as TreeFix requires the input tree to be rooted), while a multifurcated non rooted tree obtained by contracting the branches with support lower than 95% is provided as input to ProfileNJ. We used default parameters for both programs. Among the set of all binary trees output by ProfileNJ (one for each count vector), the best statistically supported tree was selected using RAxML under the GTR-Γ model of nucleotide substitution.
For RAxML, TreeFix and ProfileNJ trees, we measured the Robinson-Foulds (RF) distance to true trees, compared the reconstructed tree with the true tree using site-wise likelihoods (see Fig G in S1 File), measured the accuracy of the duplication and loss scenarios (Fig E in S1 File), the sensitivity of the accuracy to gene family size (Fig F in S1 File), the sensitivity to species tree errors (Fig H in S1 File), and the running time.
Fig 2 illustrates the results for the RF distance. It shows that sequence-only does not contain enough signal to lead to the true tree for our simulated dataset, and integrating additional information from the species tree actually improves reconstruction. Indeed, TreeFix and ProfileNJ reconstruct around 75% of true trees, compared with only 10% for RAxML. We investigated some cases where erroneous gene trees were inferred, and found that often, the true scenario was not parsimonious in terms of duplications and losses, while TreeFix and ProfileNJ chose duplications that are too recent in order to avoid losses. An example is given in Fig D in S1 File.
The performances of TreeFix and ProfileNJ are similar in terms of distance to the true tree. As for RAxML, it gives the best likelihood, which is not surprising as it is specifically designed for that. The returned likelihood is even usually higher than the likelihood of the true tree, but not significantly according to an AU test (Approximately unbiased [44], see Fig G in S1 File). TreeFix is designed to produce trees which are not significantly different than the ML tree, which we could check: 1.36% of the trees fail the AU test against the ML tree at α = 0.05, while the proportion jumps to 9.17% for ProfileNJ. It is noticeable that this has no visible consequence on the distance to the true tree (Fig C in S1 File).
Fig 3 shows that ProfileNJ outperforms TreeFix in running-time, the gap between the two algorithms increasing with tree size. This figure also shows that the most time-consuming step in ProfileNJ is tree selection. For a tree of size 30, ProfileNJ is about four to seven times faster than TreeFix, and about 15 times faster if we discard statistical support evaluation and tree selection step with RAxML. This includes the construction of the distance matrix, but not the construction of the initial RAxML, as it is common to both methods.
Other analyses, including the sensitivity to gene family size and the number of duplications and losses, are reported in S1 File. They lead to the same conclusions: TreeFix and ProfileNJ have similar performance on all measures except running time for which ProfileNJ is significantly better.
Application on biological data
Software: RefineTree
ProfileNJ is integrated in a new modular online software, called RefineTree, designed to combine a number of correction techniques, with an easy-to-use interface (see Fig 4). The present version includes, in addition to ProfileNJ, a tool called ParalogyCorrector [32] for correcting orthology relations. ParalogyCorrector takes as input a gene tree and a set of known pairwise orthology relations between genes, which would typically be derived from synteny comparisons, and constructs the tree which is the closest to the input tree according to the RF distance, with the constraint that couples of putative orthologs must be orthologs in the reconciliation (see Method section).
RefineTree can be used in a modular way, according to the user’s specifications. It has been designed to be easily extensible to other tools. For example instead of asking the user to input his own orthology relations, tools for inferring putative orthologs can be included.
Protocol
The low running time of ProfileNJ makes it possible to run it on large databases. Here, we exhaustively use ProfileNJ to correct all trees of the Ensembl Compara database, containing 63 eukaryotic whole genomes. Gene trees are constructed for 20519 families. In order to quantify the contribution of ProfileNJ and the contribution of methods using other kinds of information as synteny, we compared three sets of trees on the whole database.
Ensembl trees: Trees stored in the Ensembl gene family database (see Method section);
ProfileNJ trees: Trees output by ProfileNJ with unrooted PhyML trees as input (where branches with aLRT support <0.95 are contracted) and FastDist distance matrices. A single solution is retained for the rooting leading to a minimum weighted reconciliation cost (see Method section);
Synteny trees: Trees output by either ParalogyCorrector or Unduplicator [32] (the two are computed and the most likely according to the sequence is chosen) with ProfileNJ trees as input, using PhylDiag [45] and DeCo [35] to infer synteny constraints (see Method section and Fig 5)
Results on Ensembl gene trees
The PhyML trees for the 20519 gene families were processed by ProfileNJ in less than 48h sequential time. In the same amount of time, 0.5% of the database was corrected by TreeFix.
We evaluated the resulting trees (all made available at the web page indicated in the abstract) according to sequence likelihood, ancestral genome content and ancestral chromosome linearity. The ancestral genome content metric is based on the assumption that the distribution of ancestral gene content sizes should be close to that of extant genomes. Incorrect trees are known to require additional duplications to be reconciled with the species tree, which tends to increase the number of genes in ancestral genomes. The ancestral chromosome linearity metric is based on the assumption that the linearity of ancestral genomes is expected to be as close as possible to that of the extant genomes, with each gene having zero, one or two neighbors, with a peak at two (having genes with zero or one neighbor is usually due to partially assembled genomes).
Results are given in Fig 6. ProfileNJ trees show a better behavior than Ensembl trees according to the three measures: more than 2/3 of the trees have a better likelihood than Ensembl trees, ancestral genome content distribution is much closer to the extant one, and linearity of chromosomes is higher. Therefore this set of trees, achieving better performance according to sequence evolution, gene content evolution and chromosome evolution, is arguably a better dataset than the one stored in the Ensembl database.
However, results obtained when we include synteny information are less clear. Indeed, quality of synteny trees drops in terms of likelihood (Fig 6(A)), but jumps in terms of the stability of gene content and the linearity of ancestral chromosomes (Fig 6(B) and 6(C)).
Results on duplication and loss history in Eukaryotes
Partial patterns of duplications and losses in eukaryotes have been considered in previous studies, as for example by [8] in mammals with a subset of gene families, or by [46] in vertebrates with a subset of species. The ability of ProfileNJ to handle the whole Ensembl database allowed us to perform a more exhaustive study. In addition to gene trees, we reconstructed all ancestral gene contents and organizations. Gene content is computed according to reconciliation (see Methods), and genome organization, which consists in sets of links between consecutive genes, is inferred with DeCo. Genes are not always clustered into full linear genomes. Such non-linearity has diverse causes that we do not wish to mask with an ad-hoc linearization method. An interesting property of DeCo is to highlight genes or groups of genes evolving together in parts of the tree. For example 8488 blocks of co-duplicated genes are inferred by DeCo on the considered eukaryote dataset. Most of them contain only a few number of genes (83% contain 2 genes). The largest blocks are found in the terminal branches leading to Danio rerio and Caenorhabditis elegans.
Fig 7 shows the result for the full genomes of the full phylogeny of the 63 Ensembl species. As seen in Fig 7, duplication rates are highly variable across branches of the phylogeny. Branches with a large number of duplications (hot branches) are those leading to vertebrates, which is in agreement with the two rounds of whole genome duplication hypothesis. Interestingly, the speciation event leading to Petromyzon marinus, which is usually thought to have diverged after these events [47], precedes the hot branches. This may be in agreement with recent results based on the analysis of Hox clusters in the Japanese lamprey [48]. Another hot branch leads to eutherian mammals, which was also found by two other studies [8, 46] with partial data. These two hottest internal branches are exactly the ones found by Mahmudi et al [46] using a probabilistic technique, but using only 9 species due to computational cost. Other hot branches are terminal, the hottest being those leading to Caenorhabditis elegans and Danio rerio. This is possibly due to ongoing dynamics of polymorphic copy number variations. The same tree showing the number of losses is provided in Fig I in S1 File.
Discussion
ProfileNJ is a new gene tree correction method based on exploring a restricted tree space and choosing the most likely tree according to a species tree and a distance matrix on gene sequences. It is shown to be accurate and it outperforms in running time the most comparable existing correction methods. Efficiency in running time allowed us to apply ProfileNJ to the entire Ensembl database.
Trees obtained by correcting PhyML trees with ProfileNJ are arguably better than gene trees stored in Ensembl, according to sequence likelihood, ancestral genome content and ancestral chromosome linearity. We also corrected directly the Ensembl trees and the results (not shown) were similar, ProfileNJ giving better ancestral genomes and more likely trees than the starting trees. Based on such accurate trees, we have been able to perform an exhaustive study of the patterns of duplication and loss on the phylogeny of the 63 Eukaryote species included in Ensembl.
Accounting for synteny
In addition to sequence and phylogeny, we also used synteny information and orthology relations as correction criteria. As gene trees contain the most complete information about a gene family history, detecting orthologs or studying gene repertoire evolution should be achieved by interpreting trees. But due to the rate of errors in the current trees stored in databases, orthology is often assessed with a series of techniques including synteny [50] and Reciprocal Best Hits, while the evolution of gene repertoires is often studied with phyletic profile techniques [51]. What we presented here is a way of integrating those diverse techniques into a phylogenetic framework.
Unfortunately, integrating synteny information results in a drop of gene tree quality in terms of likelihood. One interpretation is that achieving orthology constraints may require breaking well supported branches. Part of the likelihood drop could also be interpreted as an inadequacy of the sequence evolution models to appropriately account for gene families with a high rate of duplications. Further, as observed in our simulations, the true tree is not necessarily the ML tree. Finally, the likelihood is computed with an alignment that usually results from a guiding tree, estimated using fast but crude approaches, and often different from the tested tree. Some synteny trees might therefore be better trees even in cases where sequence likelihood disagrees, because sequence likelihood can be incorrect.
However there is a third interpretation. Synteny information describes the history of loci [52], while phylogenetic models describe the evolution of sequences. Loci and sequences often have the same history, but they may differ following gene conversion or incomplete lineage sorting (ILS).
In case of ILS or gene conversion, two different true versions of the gene history are concurrent. In Fig 8 the gene as a locus has a history depicted by the right tree, while the gene as a sequence has a history depicted by the left tree. None of the two are wrong, but they are significantly different. They highlight the ambiguity of the definition of a gene, which yields an ambiguity in its history. Sequence trees will have a high likelihood and mediocre results for gene contents and synteny when constructed from duplication and loss scenarios, while it is the opposite for locus trees. A probabilistic model that incorporates ILS in sequence and duplications and losses in loci has been proposed [52]. However, no model is currently able to handle conversion.
Not only the gene trees
Using genome evolution in the construction of the gene trees, we get ancestral genomes as a byproduct. They are made of genes and sets of gene adjacencies. They are still too big (in terms of gene number) and too non linear to be fully trusted. This is partly due to incorrect gene trees in our output, or incorrect inferences from DeCo, but also to problems in sequencing, assembling, annotating genomes, clustering families or inferring the species tree. Good methods for finding linear structures from a set of adjacencies exist [53]. Here we rather used non-linearity as a testimony of the flaws of the data and methods used to reconstruct genome evolution.
Although gene trees are “better” with our correction, they still could be improved. The likelihood drop for synteny correction is indeed surprising, as these corrections lead to ancestral genomes that are closer to gene content and gene neighborhoods of extant genomes. We would need better exploration schemes with integrated models to really trust gene trees on a whole genome database within a deep phylogeny.
Methods
Families, alignments and trees were taken from Ensembl Compara release 73, sept 2013. They were computed with a pipeline called TreeBest, but we simply call them the “Ensembl trees”. There are 20529 rooted trees, each corresponding to a gene family, for a total of 1091891 genes taken from 63 species. 13128 of these trees have 3 leaves or more, and are thus treated by our phylogenetic study. Information on gene position on chromosomes, scaffolds or contigs is available at ftp://ftp.ensembl.org/pub/release-73/emf/ensembl-compara/homologies/.
Use of ProfileNJ on Ensembl
PhyML was used with default parameters to compute maximum likelihood trees from the protein multiple alignments taken from Ensembl, for families with 4 sequences or more. An aLRT support was computed, and only branches with support ≥0.95 were preserved by ProfileNJ. FastDist was run on DNA alignments to provide a distance matrix. Then ProfileNJ was run with the command (an example is given for the first family).
ProfileNJ -s Compara.73.species_tree \\
-g data/famille_1.start_tree \\
-d data/famille_1.dist \\
-o data/famille_1.tree \\
-n -r best -c nj --slimit 1 \\
--plimit 1 --firstbest --cost 1 0.99999
We tested the sensitivity of the method to the choice of the threshold parameter for contracting unsupported branches. The threshold is a trade-off between the amount of change in a tree and the probability that the resulting tree is rejected. Values that are too high would avoid exploring a large space around the starting tree while small values would lead to low likelihood trees. It has to be settled empirically. For example.80 bootstrap threshold was considered an acceptable threshold in some genomic studies [54].
Ancestral genes and genomes with DeCo
DeCo [35] computes ancestral genes and gene neighborhoods from gene trees and extant gene neighborhoods. It was used to compare ancestral gene contents and orders with different sets of gene trees for the same set of gene families to generate Fig 6. It takes as input rooted gene trees and adjacencies, i.e. couples of extant genes that are consecutive on a chromosomes. Ancestral adjacencies are constructed according to a parsimony principle minimizing the number of gains and losses of adjacencies.
Adjacencies at different loci on chromosomes are supposed to evolve independently one from the other. As a consequence, although an extant gene should belong to at most two adjacencies (zero or one is frequent because of low assembly quality of some genomes), it is not necessarily the case for ancestral genes. As a consequence ancestral genomes are not necessarily linear arrangements of genes. It might be seen as a weakness of this ancestral genome reconstruction method, but for us it is a criterion to evaluate the quality of gene trees. Indeed, a high part of the non linearity of ancestral genomes is not due to the inadequacy of the software itself, but to the quality of the input data. It has been remarked that a significant improvement in the linearity of ancestral genomes was obtained by constructing gene trees according to more integrative models [8, 55].
Information from extant synteny
Orthology constraints for synteny trees are inferred as follows. If several genes are found consecutive in one genome, and their homologs are also found consecutive in the other genome, the common linear arrangement was in the ancestor and the homologous genes are probably orthologous. This hypothesis is incorrect in at least three cases: (1) if the whole block of genes was duplicated, (2) if there is a tandem duplication of a gene followed by a differential loss in the two species, or (3) if a gene is converted by a paralog. To handle these cases, we require that (1) the majority of the homologous genes are indeed predicted as orthologs by phylogeny, (2) the common ancestor of two homologous genes does not lead to two paralogous descendants placed in tandem in one species. In case (3), we are in a situation where the loci are orthologous but not the sequences. In that case we construct the “locus tree” [52] and trust syntenic information over gene sequence information.
First we ran PhylDiag as follows, for each pair of genomes. Files genome_1, genome_2 and ancestral_genes respectively contain the ordered list of genes from each genome, and the list of families clustering the genes as in the Ensembl database.
phylDiag.py genome_1 genome_2 ancestral_genes \\
-gapMax = 2 -pThreshold = 0.00000005 \\
-filterType = InBothSpecies -multiprocess \\
-minChromLength = 2 >syntenyblocks_1_2
The statistical threshold is calculated in order to minimize the number of false positives, taking into account the number (2211) of comparisons between pairs of species and an expected number (500, as an approximate average of the number of found synteny blocks) of synteny blocks for each comparison (0.05/(2211*500) ≈ 5e − 8).
For each synteny block found by PhylDiag, we kept only the genes that had one single exemplar in the two blocks from both species. We counted the number of such pairs of genes, and referred to an LCA reconciliation of the output trees of ProfileNJ to check that most pairs are orthologs (their common ancestor is labeled by a speciation). We discarded the blocks that did not fit this condition. This discards possible block duplications.
For the remaining blocks, and for each couple of uniquely represented genes a and b, we required that the LCA node X of a and b in the reconciled ProfileNJ tree is not a supported duplication: let X1 and X2 be the two children of the node X labeled as a duplication (so X1 and X2 are in the same species as X), the genes a and b are not kept as putative orthologs if one of the branches XX1 and XX2 has a high support (>0.95), and there are two genes, x1 and x2, which respectively descend from X1 and X2, which are located on the same genome. This discards possible tandem duplications in the block, followed by differential losses of copies.
The output trees from ProfileNJ as well as the filtered pairs of putative orthologs were given as input to ParalogyCorrector, which finds the tree that is as close as possible to the input tree in terms of RF distance, such that in an LCA reconciliation, all pairs of putative orthologs have an LCA node annotated as a speciation.
Information from ancestral synteny
From the results of DeCo on the output gene trees produced by ProfileNJ, we used an “unduplication” principle as in [33] every time we found that an ancestral gene x had three neighbors a, b, c, two of them (say a, b) arising from a duplication node d in a single gene tree. In that case, we rearranged the four grand children of d so that the clade under d has an LCA which is annotated as a speciation in the LCA reconciliation. See an insight into its functioning in Fig 9.
Likelihood ratio tests
We computed the likelihood of all trees according to the HKY85 model with PhyML on nucleotide alignments. To test the significance of a likelihood difference, we computed the AU tests (Approximately unbiased, [44] with RAxML and Consel [56]. Unless otherwise stated, we use “significantly” better for a tree with a AU value >0.95.
Supporting Information
Data Availability
The code of ProfileNJ is available at: https://github.com/UdeM-LBIT/profileNJ and a web interface called RefineTree, including ProfileNJ as well as several published gene tree correction methods based on synteny, which we also test on the Ensembl gene families, is available at: http://www-ens.iro.umontreal.ca/~adbit/polytomysolver.html. The set of gene trees corrected by ProfileNJ from Ensembl Compara version 73 families is available at: ftp://pbil.univ-lyon1.fr/pub/datasets/PlosOne2016/ and is also provided as a Supporting Information file.
Funding Statement
MS, LG, BB and ET were supported by the French Agence Nationale de la Recherche (ANR) Grant ANR-10-BINF-01-01 “Ancestrome”. EN, ML, JS and NEM were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and the “Fonds de recherche du Québec Nature et technologies” (FRQNT) of Quebec. Computations were made on the supercomputer “Briarée” from Université de Montréal, managed by Calcul Québec and Compute Canada.
References
- 1. Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara gene trees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Research. 2009;19:327–335. 10.1101/gr.073585.107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, et al. Databases of homologous gene families for comparative genomics. BMC Bioinformatics. 2009;10 Suppl 6:S3 10.1186/1471-2105-10-S6-S3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Datta RS, Meacham C, Samad B, Neyer C, Sjölander K. Berkeley PHOG: PhyloFacts orthology group prediction web server. Nucleic Acids Research. 2009;37:W84–W89. 10.1093/nar/gkp373 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Pryszcz LP, Huerta-Cepas J, Gabaldón T. MetaPhOrs: orthology nd paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Research. 2011;39:e32 10.1093/nar/gkq953 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Huerta-Cepas J, Capella-Gutierrez S, Pryszcz LP, Denisov I, Kormes D, Marcet-Houben M, et al. PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions. Nucleic Acids Research. 2011;39:D556–D560. 10.1093/nar/gkq1109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Research. 2012;41:D377–D386. 10.1093/nar/gks1118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Boeckmann B, Robinson-Rechavi M, Xenarios I, Dessimoz C. Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees. Brief Bioinform. 2011;12(5):423–435. 10.1093/bib/bbr034 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Research. 2013;23:323–330. 10.1101/gr.141978.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Research. 2014;42(Database issue):D749–D755. 10.1093/nar/gkt1196 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Guindon S, Gascuel O. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology. 2003;52:696–704. 10.1080/10635150390235520 [DOI] [PubMed] [Google Scholar]
- 11. Stamatakis A. RAxML-VI-HPC: Maximum likelihood-based phylogenetic analysis with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–2690. 10.1093/bioinformatics/btl446 [DOI] [PubMed] [Google Scholar]
- 12. Ronquist F, Huelsenbeck JP. MrBayes3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19:1572–1574. 10.1093/bioinformatics/btg180 [DOI] [PubMed] [Google Scholar]
- 13. Lartillot N, Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution. 2004;21(6):1095–1109. 10.1093/molbev/msh112 [DOI] [PubMed] [Google Scholar]
- 14.Schreiber F, Patricio M, Muffato M, Pignatelli M, Bateman A. TreeFam v9: a new website, more species and orthology-on-the-fly. Nucleic Acids Research. 2013;. [DOI] [PMC free article] [PubMed]
- 15. Wu YC, Rasmussen MD, Bansal MS, Kellis M. TreeFix: Statistically informed gene tree error correction using species trees. Systematic Biology. 2013;62(1):110–120. 10.1093/sysbio/sys076 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Chen K, Durand D, Farach-Colton M. Notung: Dating Gene Duplications using Gene Family Trees. Journal of Computational Biology. 2000;7:429–447. 10.1089/106652700750050871 [DOI] [PubMed] [Google Scholar]
- 17. Szöllősi GJ, Rosikiewicz W, Boussau B, Tannier E, Daubin V. Efficient exploration of the space of reconciled gene trees. Systematic Biology. 2013;62(6):901–912. 10.1093/sysbio/syt054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Akerborg O, Sennblad B, Arvestad L, Lagergren J. Simultaneous Bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences USA. 2009;106(14):5714–5719. 10.1073/pnas.0806251106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Arvestad L, Berglund AC, Lagergren J, Sennblad B. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution In: RECOMB; 2004. p. 326–335. [Google Scholar]
- 20. Rasmussen MD, Kellis M. A bayesian approach for fast and accurate gene tree reconstruction. Molecular Biology and Evolution. 2011;28(1):273–290. 10.1093/molbev/msq189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Thomas PD. GIGA: a simple, efficient algorithm for gene tree inference in the genomic age. BMC Bioinformatics. 2010;11:312 10.1186/1471-2105-11-312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Nguyen TH, Ranwez V, Pointet S, Chifolleau AMA, Doyon JP, Berry V. Reconciliation and local gene tree rearrangement can be of mutual profit. Algorithms for Molecular Biology. 2013;8(1):12 10.1186/1748-7188-8-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Konrad A, Teufel AI, Grahnen JA, Liberles DA. Toward a general model for the evolutionary dynamics of gene duplicates. Genome Biology and Evolution. 2011;3:1197–1209. 10.1093/gbe/evr093 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Durand D, Haldórsson BV, Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. Journal of Computational Biology. 2006;13:320–335. 10.1089/cmb.2006.13.320 [DOI] [PubMed] [Google Scholar]
- 25. Gorecki P, Eulenstein O. A linear-time algorithm for error-corrected reconciliation of unrooted gene trees In: ISBRA. vol. 6674 of LNBI. Springer-Verlag; 2011. p. 148–159. [Google Scholar]
- 26. Gorecki P, Eulenstein O. Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem. BMC Bioinformatics. 2011;13(Supp 10):S14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Chaudhary R, Burleigh JG, Eulenstein O. Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. BMC Bioinformatics. 2011;13(Supp.10):S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Berglund-Sonnhammer AC, Steffansson P, Betts MJ, Liberles DA. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. Journal of Molecular Evolution. 2006;63:240–250. 10.1007/s00239-005-0096-1 [DOI] [PubMed] [Google Scholar]
- 29. Doroftei A, El-Mabrouk N. Removing Noise from Gene Trees In: WABI. vol. 6833 of LNBI/LNBI; 2011. p. 76–91. [Google Scholar]
- 30. Swenson KM, Doroftei A, El-Mabrouk N. Gene Tree Correction for Reconciliation and Species Tree Inference. Algorithms for Molecular Biology. 2012;7(1):31 10.1186/1748-7188-7-31 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Lafond M, Swenson KM, El-Mabrouk N. An Optimal Reconciliation Algorithm for Gene Trees with Polytomies In: LNCS. vol. 7534 of WABI; 2012. p. 106–122. [Google Scholar]
- 32. Lafond M, Semeria M, Swenson KM, Tannier E, El-Mabrouk N. Gene tree correction guided by orthology. BMC Bioinformatics. 2013;14 (supp 15)(S5). 10.1186/1471-2105-14-S15-S5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Chauve C, El-Mabrouk N, Guéguen L, Semeria M, Tannier E. Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later In: Chauve C, El-Mabrouk N, Tannier E, editors. Models and Algorithms for Genome Evolution. London: Springer; 2013. p. 47–62. [Google Scholar]
- 34. Zheng Y, Zhang L. Reconciliation with Non-binary Gene Trees Revisited In: Lecture Notes in Computer Science. vol. 8394; 2014. p. 418–432. [Google Scholar]
- 35. Bérard S, Gallien C, Boussau B, Szöllősi GJ, Daubin V, Tannier E. Evolution of gene neighborhoods within reconciled phylogenies. Bioinformatics. 2012;28(18):i382–i388. 10.1093/bioinformatics/bts374 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Semple C, Steel M. Phylogenetics. Oxford Univ Press; 2003. [Google Scholar]
- 37. Kluge AG, Farris JS. Quantitative phyletics and the evolution of anurans. Syst Zool. 1969;18:1–32. 10.2307/2412407 [DOI] [Google Scholar]
- 38. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987;4:406–425. [DOI] [PubMed] [Google Scholar]
- 39. Maddison WP. Gene trees in species trees. Syst Biol. 1997;46:523–536. 10.1093/sysbio/46.3.523 [DOI] [Google Scholar]
- 40.Lafond M, Noutahi E, El-Mabrouk N. Efficient Non-binary Gene Tree resolution with Weighted Reconciliation Cost; 2016.
- 41. Gascuel O, Steel M. Neighbor-joining revealed. Mol Biol Evol. 2006;23(11):1997–2000. 10.1093/molbev/msl072 [DOI] [PubMed] [Google Scholar]
- 42. Lafond M, Chauve C, Dondi R, El-Mabrouk N. Polytomy refinement for the correction of dubious duplications in gene trees. Bioinformatics. 2014;30(17):i519–i526. 10.1093/bioinformatics/btu463 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Khan MA, Elias I, Sjölund E, Nylander K, Guimera RV, Schobesberger R, et al. Fastphylo: fast tools for phylogenetics. BMC Bioinformatics. 2013;14:334 10.1186/1471-2105-14-334 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Shimodaira H. An approximately unbiased test of phylogenetic tree selection. Syst Biol. 2002;51(3):492–508. 10.1080/10635150290069913 [DOI] [PubMed] [Google Scholar]
- 45. Lucas JM, Muffato M, Roest Crollius H. PhylDiag: identifying complex synteny blocks that include tandem duplications using phylogenetic gene trees. BMC Bioinformatics. 2014;15(1):268 10.1186/1471-2105-15-268 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Mahmudi O, Sjöstrand J, Sennblad B, Lagergren J. Genome-wide probabilistic reconciliation analysis across vertebrates. BMC Bioinformatics. 2013;14 Suppl 15:S10 10.1186/1471-2105-14-S15-S10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Smith JJ, Kuraku S, Holt C, Sauka-Spengler T, Jiang N, Campbell MS, et al. Sequencing of the sea lamprey (Petromyzon marinus) genome provides insights into vertebrate evolution. Nature genetics. 2013;45(4):415–421. 10.1038/ng.2568 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Mehta TK, Ravi V, Yamasaki S, Lee AP, Lian MM, Tay BH, et al. Evidence for at least six Hox clusters in the Japanese lamprey (Lethenteron japonicum). Proceedings of the National Academy of Sciences. 2013;110(40):16044–16049. 10.1073/pnas.1315760110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Rambaut A. Figtree; 2006. http://tree.bio.ed.ac.uk/software/figtree/
- 50. Sonnhammer ELL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, et al. Big data and other challenges in the quest for orthologs. Bioinformatics. 2014; p. btu492. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Cohen O, Ashkenazy H, Burstein D, Pupko T. Uncovering the co-evolutionary network among prokaryotic genes. Bioinformatics. 2012;28(18):i389–i394. 10.1093/bioinformatics/bts396 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Rasmussen MD, Kellis M. Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Research. 2012;22(4):755–765. 10.1101/gr.123901.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Maňuch J, Patterson M, Wittler R, Chauve C, Tannier E. Linearization of ancestral multichromosomal genomes. BMC Bioinformatics. 2012;13 Suppl 19:S11 10.1186/1471-2105-13-S19-S11 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Abby SS, Tannier E, Gouy M, Daubin V. Lateral gene transfer as a support for the tree of life. Proceedings of the National Academy of Sciences USA. 2012;109(13):4962–4967. 10.1073/pnas.1116871109 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Patterson M, Szöllősi G, Daubin V, Tannier E. Lateral gene transfer, rearrangement, reconciliation. BMC Bioinformatics. 2013;14 Suppl 15:S4 10.1186/1471-2105-14-S15-S4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Shimodaira H, Hasegawa M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics. 2001;17(12):1246–1247. 10.1093/bioinformatics/17.12.1246 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The code of ProfileNJ is available at: https://github.com/UdeM-LBIT/profileNJ and a web interface called RefineTree, including ProfileNJ as well as several published gene tree correction methods based on synteny, which we also test on the Ensembl gene families, is available at: http://www-ens.iro.umontreal.ca/~adbit/polytomysolver.html. The set of gene trees corrected by ProfileNJ from Ensembl Compara version 73 families is available at: ftp://pbil.univ-lyon1.fr/pub/datasets/PlosOne2016/ and is also provided as a Supporting Information file.