Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Xiaofan Zhou; Xing-Xing Shen; Chris Todd Hittinger; Antonis Rokas

doi:10.1093/molbev/msx302

. 2017 Nov 21;35(2):486–503. doi: 10.1093/molbev/msx302

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Xiaofan Zhou ^1,², Xing-Xing Shen ³, Chris Todd Hittinger ⁴, Antonis Rokas ^3,^✉

PMCID: PMC5850867 PMID: 29177474

Abstract

The sizes of the data matrices assembled to resolve branches of the tree of life have increased dramatically, motivating the development of programs for fast, yet accurate, inference. For example, several different fast programs have been developed in the very popular maximum likelihood framework, including RAxML/ExaML, PhyML, IQ-TREE, and FastTree. Although these programs are widely used, a systematic evaluation and comparison of their performance using empirical genome-scale data matrices has so far been lacking. To address this question, we evaluated these four programs on 19 empirical phylogenomic data sets with hundreds to thousands of genes and up to 200 taxa with respect to likelihood maximization, tree topology, and computational speed. For single-gene tree inference, we found that the more exhaustive and slower strategies (ten searches per alignment) outperformed faster strategies (one tree search per alignment) using RAxML, PhyML, or IQ-TREE. Interestingly, single-gene trees inferred by the three programs yielded comparable coalescent-based species tree estimations. For concatenation-based species tree inference, IQ-TREE consistently achieved the best-observed likelihoods for all data sets, and RAxML/ExaML was a close second. In contrast, PhyML often failed to complete concatenation-based analyses, whereas FastTree was the fastest but generated lower likelihood values and more dissimilar tree topologies in both types of analyses. Finally, data matrix properties, such as the number of taxa and the strength of phylogenetic signal, sometimes substantially influenced the programs’ relative performance. Our results provide real-world gene and species tree phylogenetic inference benchmarks to inform the design and execution of large-scale phylogenomic data analyses.

Keywords: molecular evolution, tree space, topology, heuristic search

Introduction

Phylogenetic analysis—that is, the identification of the tree best representing the evolutionary history of the underlying data—is of fundamental importance to many biological disciplines, including but not limited to systematics, molecular evolution, and comparative genomics (Felsenstein 2003; Xia 2013; Hamilton 2014; Yang 2014). However, finding the best tree is an exceptionally difficult task because evaluation of each tree requires a considerable amount of calculations (Bryant et al. 2005) as well as because the number of candidate strictly bifurcating trees grows very rapidly with the number of sequences (Felsenstein 1978)—for example, there are ∼8 × 10²¹ possible rooted topologies for a set of 20 taxa. Therefore, fast programs that employ heuristic algorithms that can efficiently infer the best tree (or nearly as good alternatives) are of pivotal importance to phylogenetic analysis. This is evident by the success of the Neighbor-Joining (NJ) method, a distance-based clustering (instead of tree searching) algorithm (Saitou and Nei 1987) that is the most highly cited phylogenetic method (Van Noorden et al. 2014). NJ and its variants (e.g., BIONJ that takes the variance of distance estimation into consideration) (Gascuel 1997; Bruno et al. 2000) were among the few available options for analyzing large data sets until the 2000s, and are still widely used today to quickly produce good starting points for more sophisticated methods (e.g., Guindon et al. 2010; Nguyen et al. 2015).

It is now generally accepted that statistical methods, such as maximum likelihood (ML) (Felsenstein 1981), produce more reliable results than distance and parsimony methods (Yang and Rannala 2012; Whelan and Morrison 2017). However, ML-based methods are also computationally more expensive, necessitating the use of heuristic search algorithms for searching the enormity of tree space (Chor and Tuller 2005). Heuristic search algorithms typically adopt iterative, “hill-climbing” optimization techniques that involve three steps: 1) generate a quick starting tree (e.g., BIONJ tree, stepwise-addition parsimony tree, etc.); 2) modify the tree using certain topological rearrangement rules and evaluate the resultant trees under the ML criterion; and 3) replace the starting tree and repeat step 2 if the rearrangements identify a better tree, or otherwise terminate the search. The most common rearrangement algorithms for step 2 are Nearest-Neighbor-Interchange (NNI), where the four subtrees connected by a given internal branch are re-arranged to form two new, alternative topologies (Robinson 1971), and Subtree-Pruning-and-Regrafting (SPR), in which a given subtree is detached from the full tree and re-inserted onto each of the remaining branches (Swofford et al. 1996). SPR is more expansive in searching tree space than NNI since it can evaluate many more trees from one initial topology, but it is also much slower because of the extra tree evaluations.

Four of the most popular fast ML-based phylogenetic programs that differ in their choices or implementations of rearrangement algorithms are PhyML (Guindon et al. 2003 , 2010), RAxML/ExaML (Stamatakis 2014; Kozlov et al. 2015), FastTree (Price et al. 2010), and IQ-TREE (Nguyen et al. 2015). First introduced in the early 2000s, PhyML has been one of the most widely used programs for ML-based phylogenetic inference (Guindon et al. 2003). The original algorithm was based solely on NNI and achieved comparable performance as other contemporary ML methods but with much lower computational costs. The latest version of PhyML (version 20160530) performs hill-climbing tree searches using SPR rearrangements in early stages and NNI rearrangements in later stages of the tree search (Guindon et al. 2010). Specifically, during the SPR-based search, candidate re-grafting positions are first filtered based on parsimony scores; the most parsimonious ones are then subject to approximate ML evaluation where branch-lengths are only re-optimized at the branches adjacent to the pruning and re-grafting positions. To accelerate the tree search, the best “up-hill” SPR move for each subtree is accepted immediately, potentially leading to the simultaneous application of multiple SPRs in one round. Once the search has converged to a single topology, the resultant tree is further optimized by NNI-based hill-climbing. Similar to the SPR stage, PhyML evaluates candidate NNIs only approximately by re-optimizing the five relevant branches, and may apply multiple NNI moves simultaneously at each round. The addition of the SPR algorithm in PhyML has significantly improved its accuracy, although at the cost of longer runtimes (Guindon et al. 2010).

RAxML is another widely used program for fast estimation of ML trees (Stamatakis 2006, 2014). The latest version (8.2.11) implements the standard SPR-based hill-climbing algorithm and employs important heuristics to reduce the amount of unpromising SPR candidates, including: 1) candidate re-grafting positions are limited to only those within a certain distance from the pruning position (known as the “lazy subtree rearrangement”) (Stamatakis et al. 2005); and 2) if the re-grafting to a candidate position results in a substantially worse likelihood value, all branches further away from that point will be ignored (Stamatakis et al. 2007). As in PhyML, the approaches of approximate prescoring of SPR candidates and simultaneous SPRs are also used by RAxML to speed up the analysis (Stamatakis et al. 2005). In addition to RAxML, its sister program ExaML is specifically engineered for large concatenated data sets (Kozlov et al. 2015); it achieves greatly enhanced parallel efficiency through a novel balance load algorithm and parallel I/O optimization. As RAxML has exhibited excellent performance in both accuracy and speed (Stamatakis 2006), it is considered by many to be the state-of-the-art ML fast phylogenetic program.

Although both PhyML and RAxML represent great advances in developing fast and accurate phylogenetic programs, efforts aimed at improving the speed of ML tree estimation continue. For example, the recently developed FastTree program can be orders of magnitude faster than either PhyML or RAxML/ExaML (Price et al. 2010). FastTree (latest version 2.1.10) first constructs an approximate NJ starting tree which is then improved under the minimum evolution criterion using both NNI and SPR rearrangements, followed by ML-based NNI rearrangements to search for the final tree. With computational efficiency at the very heart of its design, FastTree makes heavy use of heuristics at all stages to limit the numbers of tree searches and likelihood optimizations. As a tradeoff, FastTree generates less accurate tree estimates than SPR-based ML methods (Price et al. 2010). The substantial edge of the FastTree program in speed has made it very popular, particularly in analyses of very large phylogenomic data matrices.

An important weakness of pure hill-climbing methods is that they can be easily trapped in local optima. The IQ-TREE program, the most recent of the four fast ML-based phylogenetic programs, was developed aiming to overcome this local optimum problem through the use of stochastic techniques (Nguyen et al. 2015). Specifically, IQ-TREE (latest version 1.5.5) generates multiple starting trees instead of one and subsequently maintains a pool of candidate trees during the entire analysis. The tree inference proceeds in an iterative manner; at every iteration, IQ-TREE selects a candidate tree randomly from the pool, applies stochastic perturbations (e.g., random NNI moves) onto the tree, and then uses the modified tree to initiate an NNI-based hill-climbing tree search. If a better tree is found, the worst tree in the current pool is replaced and the analysis continues; otherwise, the iteration is considered unsuccessful and the analysis terminates after a certain number of unsuccessful iterations. IQ-TREE takes advantage of successful preexisting heuristics (e.g., simultaneous NNIs [Guindon et al. 2003]) and a highly optimized implementation of likelihood functions (Flouri et al. 2015) for better computational efficiency.

These four programs offer different tradeoffs between the extent of tree space searched and speed in fast phylogenetic inference, and they may exhibit different behaviors toward diverse phylogenomic data sets whose properties (e.g., taxon number and gene number) and evolutionary characteristics (e.g., age of lineage, taxonomic range, and evolutionary rate) vary. Therefore, a good understanding of their relative performance across diverse empirical phylogenomic data matrices is critical to the success of phylogenetic inference when computational resources are limited. This is particularly relevant for large-scale studies using data matrices of ever-increasing data volumes and complexities. So far, these four programs have only been evaluated using simulated data (Guindon et al. 2010; Price et al. 2010; Liu et al. 2011), and empirical data sets with wide ranges of taxon numbers (e.g., up to 237,882 taxa in Price et al. [2010]) but relatively small numbers of genes (from ∼10 [Price et al. 2010; Liu et al. 2011; Chernomor et al. 2016] to ∼200 [Guindon et al. 2010; Money and Whelan 2012; Nguyen et al. 2015]), which might not well approximate today’s state-of-the art phylogenomic data matrices. In these studies, RAxML and PhyML showed largely similar performance in identifying trees of higher likelihood scores (Guindon et al. 2010; Money and Whelan 2012), whereas IQ-TREE exhibited improved efficiency compared with both RAxML and PhyML (Nguyen et al. 2015; Chernomor et al. 2016). On the other hand, FastTree was found to be much faster than RAxML and PhyML but reported lower likelihood scores for data sets with both small and large numbers of sequences (Guindon et al. 2010; Price et al. 2010; Liu et al. 2011). However, it remains unclear if these patterns would hold for empirical data sets with large numbers of loci and for species tree estimation based on genome-scale data.

To comprehensively evaluate the four fast ML-based phylogenetic programs (table 1), we used a large collection of 19 empirical phylogenomic data sets representing a wide range of properties, including data types (both DNA and protein data), numbers of taxa (up to 200) and genes (up to 14,446), and taxonomic range for diverse animal, plant, and fungal lineages (table 2; for details on the source of each data set, see supplementary table S1, Supplementary Material online). For each of these data sets, we compared the performance of all programs for single-gene tree inference and, for coalescent-based and concatenation-based species tree inference, the two major current approaches to inferring species phylogenies from phylogenomic data (Liu et al. 2015). In the coalescent-based approach, the species tree is estimated by considering all individually inferred single-gene trees using coalescent methods that take into account that the histories of genes may differ from those of species due to incomplete lineage sorting (fig. 1A), whereas in the concatenation-based approach, the species tree is estimated from the supermatrix derived by concatenating all single-gene alignments (fig. 1B).

Table 1.

Overview of the Four Fast ML-Based Phylogenetic Programs Evaluated in This Study.

Programs	Optimality Criterion	Starting Tree	Topological Moves	Supported Models		Partitioned Analysis
Programs	Optimality Criterion	Starting Tree	Topological Moves	AA	DNA	Partitioned Analysis
RAxML v8.2.0 (ExaML v3.0.17)	ML	Parsimony/random/custom	SPR	Common and custom models	JC69, K80, HKY85, GTR	Y
PhyML v20160530	ML	Parsimony/random/custom	Interleaved NNI and SPR	Common and custom models	Common and custom models	Y
IQ-TREE v1.4.2	ML	BIONJ and multiple parsimony/random/custom	NNI and stochastic perturbation	Common and custom models	Common and custom models	Y
FastTree v2.1.9	ML	Heuristic NJ	NNI and SPR (ME) followed by NNI (ML)	JTT, WAG, LG	JC69, GTR	N

Open in a new tab

Note.—ML, maximum likelihood; ME, minimum evolution; NJ, neighbor joining; NNI, nearest neighbor interchange; SPR, subtree pruning and re-grafting.

Table 2.

Overview of the 19 Phylogenomic Data Sets Included in This Study.

Study	Data Set^a		Genes	Taxa	Taxonomic Group	Data Type
Study	AA	DNA	Genes	Taxa	Taxonomic Group	Data Type
Nagy et al. (2014)	NagyA1		594	60	Fungi	Genome
Misof et al. (2014)	MisoA2		1,478	144	Insects	Transcriptome
		MisoD2a^b^,c
		MisoD2b^c
Wickett et al. (2014)	WickA3		844	103	Land plants	Transcriptome
		WickD3a^c
		WickD3b^d
Chen et al. (2015)	ChenA4		4,682	58	Vertebrates	Transcriptome
Struck et al. (2015)	StruA5		679	100	Worms	Transcriptome
Borowiec et al. (2015)	BoroA6		1,080	36	Metazoans	Genome
Whelan et al. (2015)	WhelA7		210	70	Metazoans	Transcriptome
Yang et al. (2015)	YangA8		1,122	95	Caryophyllales	Transcriptome
Shen et al. (2016b)	ShenA9		1,233	96	Yeasts	Genome
Song et al. (2012)		SongD1	424	37	Mammals	Genome
Xi et al. (2014)		XiD4	310	46	Flowering plants	Transcriptome
Jarvis et al. (2014)		JarvD5a^e	14,446	48	Birds	Genome
Jarvis et al. (2014)		JarvD5b^e	2,022	48	Birds	Genome
Prum et al. (2015)		PrumD6	259	200	Birds	Target enrichment
Tarver et al. (2016)		TarvD7	11,178	36	Mammals	Genome

Open in a new tab

Data sets are named using the first four letters of the first author’s last name from the study the data set was generated, followed by the letter A (for amino acid) or D (for DNA), followed by a unique numeric or alphanumeric identifier.

Data set MisoD2a does not have a corresponding supermatrix from the original study.

DNA data sets MisoD2a and WickD3a include the codon-based alignments corresponding to the amino acid alignments in data sets MisoA2 and WickA3, respectively.

DNA data sets MisoD2b and WickD3b include the full codon-based alignments corresponding to the amino acid alignments in data sets Miso2 and Wick3, respectively, with the third codon positions removed.

Data set JarvD5b were derived from data set JarvD5a through statistical binning (Mirarab et al. 2014), and the two data sets correspond to the same supermatrix.

Fig. 1. — Schematics of the (A) single-gene tree inference test as well as the coalescent-based and (B) concatenation-based species tree inference tests used to evaluate the performance of fast phylogenetic programs in phylogenomic analysis.

In single-gene tree estimation, we found that, although the more comprehensive analysis strategy (ten searches per alignment using RAxML, PhyML, or IQ-TREE) performed considerably better than fast strategies (one tree search per alignment using the same programs), all produced results of comparable quality when the inferred gene trees were used for coalescent-based species tree inference. The impact of tree search numbers and starting tree types on the efficiency of single-gene alignment analysis was also investigated. For the concatenation-based species tree inference, we found that, in some cases, IQ-TREE recovered trees with higher likelihood scores than RAxML/ExaML, although both showed the best performance for most data sets. Importantly, IQ-TREE exhibited comparable or better speed in both coalescent-based and concatenation-based species tree inference compared with RAxML/ExaML. In contrast, FastTree produced significantly worse single-gene and species trees than the other three programs even when allowed to run multiple times, whereas PhyML did not scale well to supermatrices because the concatenation-based species tree inferences failed to complete for multiple data sets. Overall, our benchmarking of the 4 fast ML-based phylogenetic programs against 19 state-of-the-art data matrices is highly informative for the design of efficient data analysis strategies in phylogenomic studies with 10s to 200 taxa.

Results and Discussion

A Comprehensive Collection of Empirical Data

For a comprehensive evaluation of the four fast ML-based phylogenetic programs, we retrieved 19 data sets from 14 recently published phylogenomic studies (table 2; see supplementary table S1, Supplementary Material online for detailed sources of each data set), representing a wide range of characteristics: 1) they include both amino acid and nucleotide data sets (nine and ten, respectively); 2) they contain either moderate numbers of taxa (e.g., PrumD6, 200 taxa, and 259 genes [Prum et al. 2015]), large numbers of genes (e.g., JarvD5a, 48 taxa, and 14,448 genes [Jarvis et al. 2014]), or both (e.g., MisoA2, 144 taxa, and 1,478 genes [Misof et al. 2014]); 3) they cover 3 major taxonomic groups (i.e., animals, plants, and fungi) and various depths within each group (e.g., data sets SongD1 [Song et al. 2012], ChenA4 [Chen et al. 2015], and WhelA6 [Whelan et al. 2015] cover mammals, vertebrates, and metazoans, respectively); and 4) they consist of sequence data derived from different technologies (e.g., some data sets were built entirely on whole genome sequences [Song et al. 2012; Jarvis et al. 2014; Shen et al. 2016b; Tarver et al. 2016], whereas some others contained mostly transcriptome sequencing data [Misof et al. 2014; Wickett et al. 2014; Yang et al. 2015]). In addition, these data sets were assembled and curated in state-of-the-art phylogenomic studies and thus are of high quality. Therefore, these data sets are well suited for benchmarking the performance of fast phylogenetic programs in the context of phylogenomics. At the same time, since here we only examined data sets with up to 200 taxa, the patterns revealed in our study might not necessarily hold for larger data matrices with thousands or more taxa.

Performance Test I: Single-Gene Tree Inference

In the first test, we examined the performance of four fast ML-based phylogenetic programs (i.e., RAxML, PhyML, IQ-TREE, and FastTree) in inferring single-gene trees (fig. 1A). We designed seven strategies, including four basic strategies in which each program was used to infer each gene tree from a single starting tree (these were named RAxML, PhyML, IQ-TREE, and FastTree), as well as three more comprehensive strategies in which each of RAxML, PhyML, and IQ-TREE was used to infer each gene tree from ten replicates (these were named RAxML-10, PhyML-10, and IQ-TREE-10). In both RAxML-10 and PhyML-10, five of the starting trees were obtained via parsimony (including the ones used in the RAxML and PhyML strategies, respectively) and the other five were random starting trees. On the other hand, IQ-TREE-10 consists of ten independent IQ-TREE searches, including the one performed in IQ-TREE.

The seven strategies were compared for the likelihood scores and topologies of their single-gene tree inferences, as well as for their computational speeds. Since the true evolutionary histories are unknown for the empirical data used here, we identified the tree with the highest likelihood score for each alignment (hereafter referred to as the “best-observed” tree) among trees inferred by the seven strategies and the trees reported in previous studies, if available. These “best-observed” trees were used as the reference in the comparisons of likelihood score and topology.

Likelihood Score Maximization

We first examined the performance of the seven strategies in likelihood score maximization on single-gene alignments (supplementary table S2, Supplementary Material online) by calculating the frequencies with which each of the seven strategies had the highest score (fig. 2). Overall, IQ-TREE-10 and RAxML-10 had the highest frequencies of finding the highest likelihood scores (80.17% and 75.99%, respectively) and reported the highest likelihood scores more frequently than the other strategies in all data sets except for JarvD5b, for which IQ-TREE-10 performed the best but IQ-TREE slightly outperformed RAxML-10, highlighting the benefit of using multiple starting trees. Importantly, the performances of IQ-TREE-10 and RAxML-10 varied substantially among data sets; whereas the two strategies performed very similarly on several data sets (e.g., NagyA1 and SongD1), in others RAxML-10 outperformed IQ-TREE-10 by large margins (e.g., MisoA2, MisoD2a, and MisoD2b), or vice versa (e.g., JarvD5b).

Fig. 2. — Performance of fast phylogenetic programs in the inference of single-gene trees. The bar-plots show the frequencies with which each of the seven analysis strategies produced the best likelihoods for single-gene alignments in each of the (A) protein and (B) DNA data sets. Note that the best likelihood score for a given single-gene alignment can be found by more than one strategies; therefore the sum of frequencies for a data set may be greater than one.

Notably, the basic strategy IQ-TREE was the third best strategy with an overall frequency of 54.03%, slightly higher than that of the more comprehensive strategy PhyML-10 (52.35%). In fact, IQ-TREE not only outperformed PhyML-10 in 11/19 data sets, but also showed higher frequency than RAxML-10 in the data set JarvD5b, as noted earlier. On the other hand, PhyML-10 performed consistently better than RAxML and PhyML, two basic strategies whose overall frequencies were fifth and sixth, respectively, and considerably lower (35.98% and 24.17%) than the first to the fourth best (IQTREE-10, RAxML-10, IQTREE, and PhyML-10). Among basic strategies, RAxML performed better than IQ-TREE on only four (MisoA2, StruA5, MisoD2a, and MisoD2b) data sets, yet neither of them performed well on these data sets. Both IQ-TREE and RAxML found higher likelihood scores more often than PhyML in all data sets except for JarvD5b in which RAxML had slightly lower frequency.

In comparison, the likelihood scores obtained by FastTree were much lower than those of the other six strategies; the program produced the highest likelihood scores in only 1.67% of all alignments. However, FastTree also had substantial advantages in computational speed compared with the others (see below). Since FastTree can initiate tree searches using distinct starting trees, we performed additional FastTree analyses for selected data sets, consisting of 100 tree searches for each alignment starting from 50 parsimony trees and 50 random trees. The results show that in the vast majority of cases FastTree still generated worse likelihood scores than the other strategies even after compensating for the differences in runtime by repeating the search 100 times (supplementary table S3, Supplementary Material online).

To further investigate the relative performance of the strategies using RAxML, PhyML, and IQ-TREE, we carried out pairwise comparisons between the three comprehensive strategies (i.e., RAxML-10, PhyML-10, and IQ-TREE-10) and also between their corresponding basic strategies (i.e., RAxML, PhyML, and IQ-TREE) (supplementary fig. S1, Supplementary Material online). The overall trend is the same as that observed in figure 2; on most data sets, IQ-TREE-10 found better likelihood scores more frequently than RAxML-10 which, in turn, outperformed PhyML-10; the same is true for the basic strategies. Interestingly, the three programs showed much closer performance when multiple trees searches were conducted. For instance, compared with RAxML, IQ-TREE found trees with equally good likelihood scores on 32.67% of all alignments and better scores on 43.96% of all alignments; the frequencies changed to 60.44% and 21.38%, respectively, in the comparison between IQ-TREE-10 and RAxML-10. Nonetheless, IQ-TREE-10 and RAxML-10 still showed considerable advantages over PhyML-10; cumulatively, they found higher likelihood scores on 40.27% and 41.77%, respectively, of all alignments than PhyML-10, whereas PhyML-10 found better scores on only ∼12% of all alignment in both comparisons.

Tree Topology

Trees with similar likelihood scores may differ substantially in their topologies, or vice versa. Hence, it is important to also examine the topological similarities between trees inferred by different methods in addition to their likelihood scores. Our evaluation is based on empirical data sets for which the true evolutionary histories are unknown, thus preventing a direct measurement of topological accuracy. Instead, we compared the trees inferred by various methods against the best-observed tree (i.e., the tree with the highest likelihood score) for each alignment. The rationale for using the best-observed ML trees as the references in our comparison is that, under the ML optimality criterion (which underlies all the methods examined here), the topologies of the trees with the highest likelihood scores are considered the best (currently known) answer.

We measured the normalized Robinson–Foulds, or nRF, distances (Robinson and Foulds 1981) between trees inferred by the seven strategies on each alignment against the corresponding best-observed tree. Overall, there was a strong positive correlation between the differences in likelihood scores and the topological distances when comparing inferred trees to the best-observed trees (Spearman’s correlations of 0.87 for all alignments and above 0.90 for most data sets, P-values <2.2 × 10⁻¹⁶ in all cases). In other words, strategies that yielded likelihood scores closest or equal to the best-observed likelihood scores tended to be those whose topologies were also closest or identical to the best-observed topologies (supplementary table S4, Supplementary Material online; see fig. 3 for data set YangA8 as an example).

Fig. 3. — The performances of fast phylogenetic programs with respect to likelihood maximization and tree topology are positively correlated. Dots in the scatter plot correspond to trees inferred by various analysis strategies from single-gene alignments in data set YangA8. Log-likelihood score differences between inferred trees and the “best-observed” trees are plotted against the corresponding topological distances. The log-likelihood score differences are shown in logarithmic scale (with the addition of a small value of 0.01). The violin plots on the top and right show the distributions of log-likelihood differences (top) and topological distances (right), respectively, for trees inferred by each strategy.

Among the seven strategies, IQ-TREE-10, RAxML-10, and IQ-TREE showed the best performance in tree topology with median nRF distances of 0 for more than half of the data sets (supplemental table S5, Supplementary Material online); this was unsurprising since these strategies contributed most of the best-observed trees. PhyML-10, RAxML, and PhyML also performed relatively well, with median nRF distances less than 0.03, 0.06, and 0.13, respectively, for ten or more data sets. Here again, FastTree was behind the other strategies as it led to median nRF distances greater than 0.33 for most data sets.

Computational Speed

To compare the computational speed of the seven strategies, we first measured the runtimes of RAxML (using a parsimony starting tree), PhyML (using a parsimony starting tree), IQ-TREE, and FastTree, as well as of RAxML and PhyML analyses using one random starting tree (referred to as RAxML(RT) and PhyML(RT), respectively). We then plotted the runtimes of all these strategies against that of RAxML (fig. 4; supplementary table S6, Supplementary Material online), and found strong positive correlations between the speeds of strategies over a wide range of runtimes (Spearman’s correlation ≥0.91 for all combinations of data types and strategies, P-values <2.2 × 10⁻¹⁶ in all cases). The runtimes of RAxML(RT) and PhyML(RT) were highly similar to those of RAxML and PhyML, suggesting that RAxML-10 and PhyML-10 would take about ten times longer than RAxML and PhyML, respectively (supplementary table S7, Supplementary Material online). Interestingly, PhyML was ∼1.5 times faster than RAxML on protein alignments, but ∼3.1 times slower on DNA alignments. On the contrary, IQ-TREE was faster than RAxML for both protein and DNA data (∼1.6 and ∼1.1 times faster, respectively), and the runtime of IQ-TREE-10 would simply be ten times longer since it consists of ten independent IQ-TREE analyses. Lastly, FastTree was substantially more time-efficient than RAxML on both DNA alignments (∼47.9 times faster) and protein alignments (∼95.4 times faster). In addition, the time advantage of FastTree was greater for alignments requiring longer runtimes; for instance, our linear regression analysis suggests that FastTree might run ∼162.0 times faster than RAxML on the largest single protein alignments but only ∼9.6 times faster on the smallest ones.

Fig. 4. — Runtime comparisons of fast phylogenetic programs in single-gene tree inferences. The runtimes required by each strategy to analyze a randomly selected subset of all protein (top row) and DNA (bottom row) alignments are plotted against the corresponding runtimes of *RAxML*. All runtimes (in seconds) are shown in logarithmic scale.

Overall, our results at the level of single-gene tree inference are consistent with previous, smaller-scale studies on the better efficiency of IQ-TREE relative to RAxML and PhyML (all using one search per alignment) (Nguyen et al. 2015), and the inferior performance of FastTree in likelihood score maximization when compared with other programs (Guindon et al. 2010; Liu et al. 2011). However, in contrast to previous observations (Guindon et al. 2010), we found that RAxML consistently outperformed PhyML in all data sets. This difference might be due to the small number of alignments examined in the previous study (Guindon et al. 2010) and the numerous updates of both programs since then. Another study (Liu et al. 2011) compared the performance of RAxML and FastTree on ten ribosomal RNA data sets and found that FastTree can sometimes generate more accurate trees than RAxML, typically on alignments with lower quality and fewer sequences. Importantly, Liu et al. (2011) examined data sets with highly reliable curated phylogenies as references, which are not available in most empirical studies, and also much greater numbers of taxa (between 263 and 27,643 in most cases) than the ones examined in our study (up to 200).

Implications for Efficient Tree Search on Single-Gene Alignments

The inclusion of RAxML-10, PhyML-10, and IQ-TREE-10 in our evaluation provided an opportunity to examine the effect of running multiple independent tree searches. For each of the three strategies, we first determined the highest likelihood score for each alignment, and then calculated the percentages of alignments for which the highest scores were found by given numbers of tree searches (supplementary fig. S2, Supplementary Material online). In IQ-TREE-10, the highest likelihood scores were found in the first tree search for more than 70% of the alignments in 11/19 data sets (which explains the excellent performance of IQ-TREE in fig. 2), and the frequencies quickly approached 100% with additional tree searches. In contrast, the first tree search in PhyML-10 found the highest likelihood scores for much fewer alignments (less than 30% in 10/19 data sets), and the frequencies increased more evenly with increasing numbers of tree searches. The plots of RAxML-10 lie in between those of IQ-TREE-10 and PhyML-10 in most data sets. Interestingly however, in some data sets (e.g., MisoA2, StruA5, MisoD2a, MisoD2b), all three strategies showed almost the same linear increases in their frequencies of finding the highest scores with the number of tree searches (about 10% of the highest likelihood scores were found in each tree search). These results suggest that efficient tree search strategies are likely to vary between data sets and fast phylogenetic programs. To avoid unnecessary (or insufficient) tree search efforts, it is important to monitor the likelihood improvements over rounds of independent searches.

Additionally, the use of both parsimony and random starting trees in RAxML-10 and PhyML-10 allowed us to investigate the relative performance of the two types of starting tree. In our comparisons, parsimony and random starting trees showed comparable overall performance (supplementary fig. S3, Supplementary Material online). For RAxML (supplementary fig. S3A, Supplementary Material online), five (or one) searches per alignment using random starting trees found better likelihood scores than using parsimony starting trees for only additional 3.47% (or 1.86%) of all alignments. In addition, equally good likelihood scores were obtained using both types of starting trees on 50.12% (or 31.73%) of all alignments when five (or one) RAxML searches were conducted. However, at the level of individual data sets, random starting trees outperformed parsimony starting trees on 16 data sets regardless of the number of tree searches. A similar pattern was also observed for PhyML (supplementary fig. S3B, Supplementary Material online). Together with their similar run-time performances (fig. 4), these results suggest that the two types of starting trees are similarly efficient in the analysis of single-gene alignments with moderate sequence numbers, although random starting trees might be slightly more advantageous.

Performance Test II: Coalescent-Based Species Tree Inference

In the second test, we assessed the fast ML-based phylogenetic programs in the context of the “two-step” coalescent-based species tree inference, in which single-gene trees were first estimated from individual alignments by each examined strategy and then used collectively to infer the species tree by the coalescent-based method (fig. 1A) (Liu et al. 2015). Here, we used the single-gene trees produced in the Performance Test I as input for the ASTRAL program (Mirarab and Warnow 2015), which was used to infer coalescent-based species trees. The species tree inferences by the seven strategies were then compared with the species tree estimated from the best-observed gene trees (referred to as best-observed species trees hereafter) to measure the topological distances (i.e., nRF distances).

We first determined for each data set the topological distances between the species tree inferred from the best-observed single-gene trees and those inferred from the gene trees inferred by each of the seven strategies. In that regard, the species tree estimations of all six strategies using RAxML, PhyML, or IQ-TREE displayed comparably small topological distances to the best-observed species trees (median nRF distances ranged between 0 and 0.03 across data sets), whereas the species trees inferred by FastTree were considerably more dissimilar (median nRF distances of 0.121) (table 3). When we only considered the bipartitions or splits that were strongly supported (i.e., had quartet-based posterior probability, or PP, support greater or equal to 0.9 [Sayyari and Mirarab 2016]), the species tree inferred by these strategies became even more similar to the best-observed species trees, although FastTree-generated species trees still showed the greatest topological distances (supplementary table S8, Supplementary Material online). Nonetheless, for most strategies and data sets, the species tree estimates were much more similar to the best-observed trees than the corresponding single-gene tree inferences (table 3; supplementary tables S5 and S8, Supplementary Material online).

Table 3.

Normalized Robinson-Foulds Distances between the Coalescent-Based Species Trees Estimated from Gene Trees Inferred by Various Strategies and the “Best-Observed” Gene Trees.

Data Set		Analysis Strategies
Data Set		RAxML_10	PhyML_10	IQ-TREE_10	RAxML	PhyML	IQ-TREE	FastTree
Amino acid	NagyA1	0.035	0.035	0.018	0.07	0.035	0.035	0.123
	MisoA2	0.007	0.014	0.028	0.028	0.021	0.035	0.099
	WickA3	0.01	0.01	0	0.01	0.03	0.01	0.09
	ChenA4	0	0	0	0	0	0	0
	StruA5	0.103	0.124	0.155	0.124	0.186	0.124	0.289
	BoroA6	0	0.03	0	0	0.03	0	0.121
	WhelA7	0.03	0	0	0.06	0.015	0.015	0.06
	YangA8	0.022	0	0	0.011	0.011	0	0.054
	ShenA9	0.011	0.022	0	0.032	0.022	0.032	0.054
Nucleotide	SongD1	0	0	0	0	0	0	0
	MisoD2a	0.007	0.05	0.043	0.043	0.071	0.05	0.206
	MisoD2b	0.007	0.035	0.035	0.05	0.043	0.064	0.156
	WickD3a	0.03	0.01	0.02	0.03	0.02	0.04	0.15
	WickD3b	0.01	0.01	0	0.02	0.03	0.01	0.09
	XiD4	0	0.023	0.023	0.023	0.023	0.023	0.186
	JarvD5a	0.022	0.022	0	0	0	0	0.4
	JarvD5b	0	0.022	0	0.067	0.044	0.022	0.289
	PrumD6	0.03	0.041	0.025	0.051	0.091	0.066	0.137
	TarvD7	0	0	0	0	0	0	0

Open in a new tab

We further assessed the confidence levels (i.e., PP supports) of the incongruent bipartitions or splits identified in the abovementioned species tree comparison. Worryingly, the incongruent splits between the species tree inferred using FastTree-generated gene trees as input and the best-observed species tree received significantly higher PP supports (fig. 5; see supplementary table S9, Supplementary Material online, for the results of Wilcoxon rank-sum tests); the median PP values of which were 0.81 for protein data sets and close to 1 for DNA data sets. Both of these values were much higher than those of the other six strategies, which were all below 0.60 and 0.71 for protein and DNA data sets, respectively.

Fig. 5. — Incongruent splits in coalescent-based species trees estimated by the strategies using RAxML, PhyML, and IQ-TREE are weakly supported. The violin plots show the distribution of local posterior probabilities for incongruent splits in coalescent-based species trees estimated by various analysis strategies. Here, incongruent splits are defined as the splits that are not present in species trees estimated from best-observed single-gene trees. The areas of violin plots are proportional to the total numbers of incongruent splits. The gray dots and bars in each violin plot indicate the median and the first/third quartiles of the local posterior probabilities, respectively.

Performance Test III: Concatenation-Based Species Tree Inference

In the third test, we examined the relative performance of the four programs in concatenation analysis of 17 taxon- and gene-rich supermatrices (we conducted concatenation analyses on 17, rather than 19, data matrices because: 1) JarvD5a and JarvD5b correspond to different partitioning strategies from the same supermatrix [Jarvis et al. 2014], and 2) MisoD2a does not have a corresponding supermatrix available from the original study [Misof et al. 2014]) (fig. 1B;table 2). Here, we again focused on the programs’ performance on likelihood score maximization, tree topology, and computational speed. However, as PhyML required exceedingly high runtime, memory, or crashed on multiple data sets, its results are not included in the evaluation. In addition to our analyses, all the supermatrices have also been previously extensively analyzed using either RAxML or ExaML (e.g., Jarvis et al. 2014; Misof et al. 2014; Wickett et al. 2014). Therefore, we included the reported likelihood scores and topologies—we refer to them as “RAxML/ExaML-published” trees—in our examination of relative performance.