Skip to main content
Bioinformatics Advances logoLink to Bioinformatics Advances
. 2023 Jan 30;3(1):vbad008. doi: 10.1093/bioadv/vbad008

SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic placement

Gillian Chu 1,, Tandy Warnow 2,
Editor: Thomas Lengauer
PMCID: PMC9933845  PMID: 36818728

Abstract

Summary

Phylogenetic placement is the problem of placing ‘query’ sequences into an existing tree (called a ‘backbone tree’). One of the most accurate phylogenetic placement methods to date is the maximum likelihood-based method pplacer, using RAxML to estimate numeric parameters on the backbone tree and then adding the given query sequence to the edge that maximizes the probability that the resulting tree generates the query sequence. Unfortunately, this way of running pplacer fails to return valid outputs on many moderately large backbone trees and so is limited to backbone trees with at most ∼10 000 leaves. SCAMPP is a technique to enable pplacer to run on larger backbone trees, which operates by finding a small ‘placement subtree’ specific to each query sequence, within which the query sequence are placed using pplacer. That approach matched the scalability and accuracy of APPLES-2, the previous most scalable method. Here, we explore a different aspect of pplacer’s strategy: the technique used to estimate numeric parameters on the backbone tree. We confirm anecdotal evidence that using FastTree instead of RAxML to estimate numeric parameters on the backbone tree enables pplacer to scale to much larger backbone trees, almost (but not quite) matching the scalability of APPLES-2 and pplacer-SCAMPP. We then evaluate the combination of these two techniques—SCAMPP and the use of FastTree. We show that this combined approach, pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer-FastTree and achieves better accuracy than the comparably scalable methods.

Availability and implementation

https://github.com/gillichu/PLUSplacer-taxtastic.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

1 Introduction

Phylogenetic placement is the problem of placing sequences (called ‘queries’) into an existing phylogeny (called a ‘backbone tree’). Phylogenetic placement methods can be used to add new sequences into an existing phylogenetic tree, to taxonomically characterize new sequences and to perform abundance profiling for microbiome samples (Czech et al., 2022).

In many applications, large phylogenies are preferred over smaller phylogenies, because increasing ‘taxon sampling’ can improve biological conclusions about evolutionary events and processes (Nabhan and Sarkar, 2012; Zwickl and Hillis, 2002), and updating large phylogenies to include newly generated sequences is a standard task in modern biology. An obvious current example of a growing phylogeny is that of SARS-COV-2. Updating a large phylogeny through the addition of new sequences into an already computed maximum likelihood tree or maximum parsimony tree is computationally preferable over recalculating the tree from scratch, as both problems are NP-hard (Foulds and Graham, 1982; Roch, 2006) and the most accurate heuristics are computationally intensive or even infeasible on large trees (Park et al., 2021). Microbiome analysis is another application where large backbone trees can be beneficial. For example, TIPP (Nguyen et al., 2014) is a method that uses phylogenetic placement to taxonomically identify reads from microbiome samples and then combines the taxonomic identification information to estimate abundance profiles. As shown in Shah et al. (2021), the accuracy of these abundance profiles improves substantially when the backbone tree increases in size.

However, only a few phylogenetic placement methods can run on large backbone trees. In particular, the most accurate placement methods, which are based on maximum likelihood [e.g. pplacer (Matsen et al., 2010) and EPA-ng (Barbera et al., 2019)], are generally limited to small backbone trees (at most a few thousand leaves) due to computational reasons. While EPA-ng is generally faster and more scalable than pplacer, and Linard et al. (2019) did not find any noteworthy differences in accuracy between the two methods, other studies (Balaban et al., 2020, 2022; Wedell et al., 2022) have found pplacer to be more accurate than EPA-ng on those datasets that are small enough for pplacer to run well. Alternative approaches for phylogenetic placement include distance-based techniques, such as APPLES-2 (Balaban et al., 2022), that can be very fast and scalable; however, so far these approaches are not as reliably accurate as pplacer.

In this study, we focus on improving the scalability of the maximum likelihood placement method pplacer. The input to pplacer is a backbone tree with a multiple sequence alignment that includes the sequences at the leaves as well as the query sequences. The first step in pplacer is to estimate the numerical model parameters on the backbone tree, such as branch lengths defining expected numbers of substitutions and the substitution rate matrix for the Generalized Time Reversible model (Tavaré, 1986); this is often performed by RAxML (Stamatakis, 2006), given its superiority for numeric parameter estimation compared to, for example, FastTree (Price et al., 2010). Then, for each query sequence q, pplacer computes the probability that q is placed on each edge; as a result, these values can be used to identify the maximum likelihood edge for final placement. However, pplacer used with RAxML (referred to as pplacer-RAxML) fails to run on large backbone trees with more than a few thousand sequences, either due to a memory-caused segmentation fault or by producing illegal outputs in the form of negative infinity log-likelihood scores. These problems may be numerical, as Wedell et al. (2022) showed that once the backbone tree is very large, the log-likelihood scores may become too large in magnitude for pplacer. While EPA-ng does not seem to have the same numerical issues as pplacer, it also has computational costs that make it unable to run on datasets with more than about 30 000 sequences (Wedell et al., 2022). Thus, both of the leading maximum likelihood methods have limitations to somewhat small backbone trees.

Two approaches have been used to address the limited scalability of placement methods, such as pplacer-RAxML. One approach for improving placement scalability uses divide-and-conquer; for example, SCAMPP (Wedell et al., 2022), adds each query sequence into the backbone tree by first letting the query sequence select a small placement subtree within the backbone tree. Then, it runs the phylogenetic placement method of choice (e.g. pplacer or EPA-ng) to place the query sequence into that placement subtree. By using branch length information, SCAMPP then identifies the correct edge in the backbone tree for the query sequence. Using SCAMPP with pplacer-RAxML (i.e. pplacer-SCAMPP-RAxML) and limiting the placement subtree to at most 2000 leaves allows pplacer-RAxML to run on backbone trees with up to 200 000 leaves, achieving high accuracy (Wedell et al., 2022).

The second approach is anecdotal: two studies [Balaban et al. (2022) and Wedell et al. (2022)] used FastTree-2 (Price et al., 2010) instead of RAxML to estimate numeric parameters on large backbone trees, followed by the use of the software package taxtastic (Fred Hutchinson Cancer Research Center, 2022) to reformat the numeric parameters suitably for pplacer. In these two studies, this substitution allows pplacer to run on larger datasets without producing negative infinity log likelihood values.

Yet, no prior study has compared pplacer-RAxML to pplacer with FastTree for numeric parameter estimation (i.e. pplacer-FastTree). In addition, no study has combined these techniques to see how the two methods operate together; that is, follow the SCAMPP divide-and-conquer framework, but instead of using pplacer-RAxML to place the query sequence into the placement subtree, use pplacer-FastTree (i.e. using pplacer within the divide-and-conquer strategy of SCAMPP and FastTree for numeric parameter estimation).

In this study, we explore pplacer-FastTree, pplacer-RAxML, pplacer-SCAMPP-RAxML and pplacer-SCAMPP-FastTree on a collection of simulated and biological datasets, in order to determine which methods provide the best accuracy and scalability. We confirm that the replacement of RAxML by FastTree for numeric parameter estimation consistently enables pplacer to scale to larger backbone trees (though not quite matching the scalability of APPLES-2 or pplacer-SCAMPP-RAxML), and that pplacer-FastTree is similar in accuracy to pplacer-RAxML. We also find that pplacer-SCAMPP-FastTree has low-computational requirements and matches the scalability of APPLES-2 and pplacer-SCAMPP, the previous two most scalable methods (which can scale to backbone trees with 200 000 leaves). Moreover, pplacer-SCAMPP-FastTree improves on the accuracy of both APPLES-2 and pplacer-SCAMPP-RAxML.

2 Study design

2.1 Overview

We explored six phylogenetic placement pipelines, which depend on both the method used to estimate numeric parameters as well as the phylogenetic placement method. The phylogenetic placement methods we studied are pplacer-RAxML (also referred to simply as pplacer), pplacer-FastTree, pplacer-SCAMPP-RAxML and pplacer-SCAMPP-FastTree. By design, SCAMPP allows a user-input parameter B for the placement size; the fact that pplacer-FastTree can run on larger backbone trees means we can use larger settings for B (which determines the size of the placement subtree) when using pplacer-FastTree for the placement step. We additionally compare these methods to EPA-ng and APPLES-2. We used two collections of simulated datasets and three biological datasets to evaluate accuracy.

2.2 Placement pipelines studied

The input to a phylogenetic placement pipeline is a backbone tree and backbone alignment for the leaf sequences and a single query sequence, which is also aligned to the backbone sequences (since the query sequences are inserted independently, the extension to multiple query sequences is trivial). The pipelines we explored first estimate branch lengths and other numeric parameters for each backbone tree then use the specified phylogenetic placement method to insert the query sequence into the tree.

2.2.1 The pplacer-RAxML pipeline

Given a backbone alignment and tree, we use RAxML v. 7.2.6 to estimate numeric parameters on the topology. We then place the query sequence(s) into the tree using pplacer.

2.2.2 The pplacer-FastTree pipeline

Given a backbone alignment and tree, we use FastTree-2 to estimate the numeric parameters on the topology. Then, we use the taxtastic package to reformat the tree topology, numeric parameters and alignment into a reference package for use with pplacer. Then, pplacer uses this reference package to place queries into large datasets.

2.2.3 The pplacer-SCAMPP-RAxML pipeline

Given a backbone alignment and tree, we use RAxML 7.2.6 to estimate the numeric parameters on the topology. Then, given a query sequence, pplacer-SCAMPP-RAxML uses the SCAMPP framework to find the nearest leaf (using Hamming distance) to the given query sequence, and uses this to extract a placement subtree of size B [with B = 2000 by default, as recommended in Wedell et al. (2022)]. To place the query sequence into the subtree, pplacer-SCAMPP-RAxML runs pplacer for query placement. The placement in the subtree provides the location as well as how the selected branch subdivides its branch length, which is then used to place the query sequence.

2.2.4 The pplacer-SCAMPP-FastTree pipeline

Given a backbone alignment and tree, we use FastTree 2 to estimate the numeric parameters on the topology. Then, given a query sequence, pplacer-SCAMPP-FastTree uses the SCAMPP framework to find the nearest leaf (using Hamming distance) to the given query sequence, and uses this to extract a placement subtree of size B. Since pplacer-FastTree can potentially run on larger trees than pplacer, the optimal value for B may be larger than its default setting in pplacer-SCAMPP-RAxML. To place the query sequence into the subtree, pplacer-SCAMPP-FastTree builds a taxtastic reference package on the subtree using the backbone sequences and numeric parameters, and provides pplacer with this taxtastic reference package for query placement.

2.2.5 The EPA-ng pipeline

The EPA-ng pipeline is also a maximum likelihood approach to phylogenetic placement, and the guidelines recommend the use of RAxML-ng to estimate branch lengths and other numeric model parameters.

2.2.6 The APPLES-2 pipeline

As a distance-based method, APPLES-2 differs from all other methods presented so far, which are maximum likelihood-based methods. The developers of APPLES-2 recommend the use of branch lengths estimated under minimum evolution. Given the topology of the backbone tree, we estimate the branch lengths under minimum evolution using FastTree-2 to estimate branch lengths on the tree with the ‘no ML’ option; this produces minimum evolution branch lengths, abbreviated here by ‘ME’.

2.3 Datasets

We explore the phylogenetic placement methods on previously published nucleotide datasets, both biological and simulated. For the simulated datasets we include ROSE 1000M1, ROSE 1000M5, RNASim and nt78. For the biological datasets, we include CRW 16S.B.ALL, PEWO green85 and PEWO LTP_s128_SSU. The empirical properties of each dataset are provided in Supplementary Materials.

2.3.1 ROSE 1000-sequence datasets

The ROSE datasets (1000M1 and 1000M5) have 1000 sequences each and were simulated with substitutions [under the GTR+GAMMA model (Tavaré, 1986)] with medium length indels using ROSE (Stoye et al., 1998); these were originally simulated for the SATé study (Liu et al., 2009) but have also since been used in studies of alignment accuracy (Park et al., 2023; Smirnov and Warnow, 2021a, b) and placement accuracy (Mirarab et al., 2012). We used the provided binary model trees and the true alignment. Each model condition dataset contains 20 replicates. In this study, we arbitrarily chose the first five replicates, where each replicate contains 1000 taxa. We explore our phylogenetic placement methods on the 1000M1 and 1000M5 datasets, where the M indicates a ‘medium’ gap length, and 1 indicates the highest rate of evolution, whereas 5 indicates the lowest rate of evolution (Liu et al., 2009). The rates of evolution are also reflected in the average p-distance (i.e. normalized Hamming distance) for each of these datasets.

2.3.2 RNASim-VS

We use the RNASim-VS datasets, which are subsets of the million-sequence RNASim simulated dataset (Mirarab et al., 2015). The RNASim simulation evolves RNA sequences under a non-homogeneous biophysical fitness model which reflects selective pressures to maintain the RNA secondary structure (as opposed to models such as GTR where there is no selection operating on the sequences) (Mirarab et al., 2015). This RNASim dataset has been widely used in studies evaluating alignment accuracy (Mirarab et al., 2015; Nguyen et al., 2015; Smirnov and Warnow, 2021a, b). Part of the RNASim dataset (i.e. RNASim-VS) has been used to study phylogenetic placement accuracy in prior studies (Balaban et al., 2022; Wedell et al., 2022), and was specifically used to study APPLES (Balaban et al., 2020), APPLES-2 (Balaban et al., 2022), pplacerDC (Koning et al., 2021) and pplacer-SCAMPP-RAxML. The RNASim Variable-Size (RNASim-VS) datasets provide true phylogenetic trees, true multiple sequence alignments, and estimated maximum likelihood trees computed using FastTree (Price et al., 2010). We use three subsets from the full RNASim-VS dataset, containing 50 000 sequences, 100 000 and 200 000 sequences. We refer to each subset as RNASim-50k, RNASim-100k and RNASim-200k. For RNASim-50k, we use the first five replicates. For the two larger datasets, we run on one replicate (arbitrarily picking the first replicate).

2.3.3 nt78

This simulated nucleotide dataset contains 78 132 sequences and 20 replicates and was created to study FastTree-2 (Price et al., 2010). We chose the first replicate to use as it has been used in other studies as well (Wedell et al., 2022). The nt78 dataset was simulated using ROSE (Stoye et al., 1998) under the HKY (Hasegawa et al., 1985) model to resemble 16S sequences, and with different evolutionary rates for each site selected from 16 different rate categories. See the Supplementary Materials for additional information.

We used three biological datasets (Supplementary Table S3), ranging in size from 5088 to 27 643 sequences; each comes with a multiple sequence alignment and a tree that has been computed on the multiple sequence alignment. Two of these (green85 and the LTP_s128_SSU) are from the PEWO collection (Linard et al., 2021): the green85 biological dataset is originally from the Greengenes database (DeSantis et al., 2006) and contains 5 088 sequences, while the LTP_s128_SSU dataset contains 12 953 aligned sequences. The final biological dataset is the 16S.B.ALL dataset from the Comparative Ribosomal Website (Cannone et al., 2002), with 27 643 sequences, and whose reference tree is a maximum likelihood tree computed using RAxML on the reference alignment.

2.3.4 Fragmentary protocol

We evaluated the placement methods when given fragmentary sequences on the RNASim-VS and nt78 datasets. Given a query sequence with original length lo, a random starting position was selected by sampling from a normal distribution. Two sequences of different length were generated for each query sequence: in the low fragmentary case, sequence length was sampled using a normal distribution with mean as 25% of lo and standard deviation as 60 basepairs (N(μ=0.25lo,σ=60)). In the high fragmentary case, sequence length was sampled using N(μ=0.10lo,σ=10). Note that these distributions were chosen because 154 basepairs is close to the Illumina read sequence length, making a study of placing read-length query sequences more relevant to downstream applications of phylogenetic placement. The fragment lengths for RNASim-VS and nt78 are reported in the context of those experiments.

2.3.5 Leave-one-out experiments

All experiments in this study were run as leave-one-out experiments, with results reported on a per-query basis. For all datasets, we randomly choose 200 sequences to be used as the query sequences. Thus, for a given dataset with n sequences, with one selected as a query sequence, we use each phylogenetic placement method to place the selected query sequence into the backbone tree containing the other n1 sequences.

2.4 Criteria

We evaluated phylogenetic placement error using delta error, a criterion that has been used before (Balaban et al., 2020; Mirarab et al., 2012; Wedell et al., 2022). The formal definition is provided in the Supplementary Materials, and a high-level description is given here. Given a query sequence q, let T denotes the reference tree (i.e. true tree for simulated data or estimated tree for biological data) that contains all the sequences in the backbone tree and also the given query sequence q; we do not require that the backbone tree be identical to T restricted to its leafset, so that the backbone tree may have error. To measure error in the backbone tree with respect to the reference tree (minus the query sequence), we report the number of missing branches; we similarly measure error in the tree resulting from adding the query sequence to the backbone tree by comparing it to the reference tree with the query sequence. The number of missing branches may stay the same or increase but cannot go down after adding the query sequence. The delta error is the change in the number of missing branches (also known as false negative, or FN) incurred by placing the query sequence into the backbone tree; hence, the delta error is always non-negative.

2.5 Experiments

  • Experiment 1: We compared the behavior of pplacer-RAxML and pplacer-FastTree when placing full-length query sequences using ROSE-1000M1, ROSE-1000M5 and nt78 datasets.

  • Experiment 2: We explored pplacer-FastTree and pplacer-SCAMPP-RAxML on nt78 and RNASim-50k, using both full and fragmentary query sequences.

  • Experiment 3: We evaluate the impact of changing the placement tree size in pplacer-SCAMPP-FastTree on all datasets and compare to APPLES-2 and EPA-ng.

We present delta error (with standard error), runtime and memory consumption (with standard deviation) in the subsequent figures; distributions of delta error are shown in the Supplementary Figures S2–S7. Where no placement tree size was specified, the entire dataset (besides the query) was used. All experiments were performed on the UIUC Campus Cluster, using the secondary queue with a time-limit of 4 wall-time hours per query and 64 GB of available memory.

3 Results and discussion

3.1 Experiment 1

This experiment explores the behavior of pplacer-RAxML and pplacer-FastTree for placing full-length sequences into backbone trees for the simulated ROSE and nt78 datasets. On the ROSE 1000M1 and 1000M5 datasets (Supplementary Fig. S1b and c, respectively), pplacer-RAxML and pplacer-FastTree produce delta error results within 0.01 of each other. However, pplacer-RAxML fails to return valid results on the nt78 dataset, which contains 78 132 sequences (Supplementary Fig. S1a), while pplacer-FastTree has no trouble on this dataset. We also see that pplacer-FastTree is faster than pplacer-RAxML but both have low memory usage when they run. Finally, pplacer-FastTree has low delta error in every model condition (much less than 0.2 of an edge), and incurs lower delta error on the lower rate of evolution (1000M5) than on the higher rate of evolution (1000M1).

Since the only difference between pplacer-RAxML and pplacer-FastTree is the software used to estimate the numeric parameters (i.e. RAxML or FastTree), this experiment shows that there is very little difference in accuracy between the two phylogenetic placement methods for those datasets on which both methods can complete. This is perhaps surprising, as previous comparisons of RAxML and FastTree established that RAxML produced better maximum likelihood scores than FastTree, suggesting that RAxML produces more accurate numeric parameter estimates on a given tree than FastTree (Liu et al., 2011). Assuming this trend to hold on these datasets as well, this study shows that any improvement in numeric parameter estimation provided by RAxML may not be important for improving phylogenetic placement accuracy. Therefore, while pplacer-RAxML may fail to produce valid results for placing into very large backbone trees, pplacer-FastTree has greater scalability and provides a close level of accuracy, making it an acceptable substitution to pplacer-RAxML.

3.2 Experiment 2

In this experiment, we explore the behavior of pplacer-FastTree and pplacer-SCAMPP-RAxML(2k) (where ‘2k’ indicates that the placement tree size is 2000) on the simulated nt78 and RNASim-50k datasets, using both full and fragmentary query sequences. The results show different trends on the full-length and fragmentary sequences, and so we discuss these separately.

For the full-length sequences, there is a small advantage to pplacer-FastTree for placement accuracy but a huge advantage for running time and memory usage for pplacer-SCAMPP-RAxML(2k) (Fig. 1). The small advantage in placement accuracy is interesting, especially since Experiment 1 indicated that the choice of RAxML and FastTree for numeric parameter estimation had little or no impact on accuracy. Hence, this suggests that the improvement in accuracy for pplacer-FastTree over pplacer-SCAMPP-RAxML(2k) is the result of being able to place query sequences into the entire backbone tree instead of into a placement subtree with only 2k sequences.

Fig. 1.

Fig. 1.

(Experiment 2) Per-query comparison between pplacer-SCAMPP-RAxML and pplacer-FastTree. (a, b) Results on RNASim-50k (50 000 sequences) and (c, d) results on nt78 (78 132 sequences). The top row shows average delta error over 200 queries, middle row shows runtime, and bottom shows peak memory usage (each per query). The error bars show standard error in the case of delta error, and standard deviation in the case of runtime and memory. The results shown here placing fragmentary queries in the nt78 dataset have not been explored in other papers before

The differences between methods are larger when placing fragmentary sequences (here, 10% in length). In this case, pplacer-SCAMPP-RAxML(2k) delta error was larger than for pplacer-FastTree, with a very large increase on the nt78 dataset where pplacer-FastTree had delta error of 1.54 and pplacer-SCAMPP-RaxML(2k) had delta error above 20. It seems very unlikely that using FastTree instead of RAxML would improve accuracy to this extent and instead suggests therefore that it is the restriction to a small subtree of only 2000 leaves that is problematic for pplacer-SCAMPP-RAxML(2k).

The results from Experiment 2 motivate Experiment 3, where we study the effect of varying the size of the placement subtree on accuracy and runtime.

3.3 Experiment 3

Experiment 3 evaluates the impact of changing the placement tree size on pplacer-SCAMPP-FastTree on both simulated (Supplementary Table S1) and biological (Supplementary Table S2) datasets, for all studied methods, including EPA-ng, pplacer-RAxML and pplacer-SCAMPP-RAxML(2k). These tables show that EPA-ng and pplacer-RAxML are not able to run on the larger backbone trees and that pplacer-SCAMPP-RAxML(2k) is not as accurate as pplacer-SCAMPP-FastTree. Hence, here we examine only the remaining methods: pplacer-SCAMPP-FastTree, pplacer-FastTree and APPLES-2. We show these results in three figures, first exploring performance on the nt78 dataset, then on RNASim datasets, and finally on biological datasets.

3.3.1 Results on the nt78 dataset

In Figure 2, we see substantial differences in accuracy between methods that depend on the query sequence length. Hence, we begin with trends for full-length query sequences and then address how these change as the query sequences become shorter.

Fig. 2.

Fig. 2.

(Experiment 3) Per-query results on the nt78 dataset (78 132 sequences) on full and fragmentary query sequences, comparing the pplacer variants to APPLES-2. (a) Results on the full query sequence length. (b) Results on the 0.25 fragment length (322 basepairs). (c) Results on the 0.10 fragment length (127 basepairs). The top row shows average delta error over 200 queries, middle row shows runtime, and bottom shows peak memory usage (each per query)

On the full-length sequences, all methods have very low delta error (0.14–0.22), with APPLES-2 slightly higher delta error than the other methods. Delta error increases for all methods as query sequence lengths decrease, with very clear distinctions between methods observed especially on the shortest query sequences. Thus, on the queries that are 10% of the full length, the lowest delta error is achieved by pplacer-FastTree (at 1.54) and the second lowest (2.84) is achieved by APPLES-2. In contrast, the delta error for pplacer-SCAMPP-FastTree(B) depends very strongly on B (the placement tree size), and is only acceptably low when B=40K, where it achieves delta error of 3.74.

Thus, this experiment shows the importance of a large placement subtree size B for ensuring good accuracy for pplacer-SCAMPP-FastTree(B) on the nt78 datasets when placing short query sequences. It also shows the resilience of pplacer-FastTree, which maintains its top accuracy across all three query length settings.

A comparison of methods with respect to their computational effort is also worthwhile. The runtimes are highest on the full-length sequences and then decrease with the query sequence length. APPLES-2 is consistently the fastest and pplacer-FastTree and pplacer-SCAMPP-FastTree(40K) are the two slowest. Memory usage is highest for pplacer-FastTree, followed by pplacer-SCAMPP-FastTree(40K).

3.3.2 Results on the RNASim datasets

Somewhat different trends are revealed when examining placement into the RNASim datasets (Fig. 3), with two query sequence lengths for RNASim-50K and only full-length queries for RNASim-200K. APPLES-2 has higher delta error than the pplacer variants on all datasets, with much higher error on the short query sequences for RNASim-50K. We also see that delta error increases for all methods as the query sequence length decreases, but in this case they remain low (below 0.2) for the pplacer variants, even for the 0.25-length query sequences.

Fig. 3.

Fig. 3.

(Experiment 3) Per-query comparison between pplacer-FastTree, pplacer-SCAMPP-FastTree and APPLES-2 on the RNASim datasets. We place full-length and 0.25-fragmentary query sequences into the RNASim-50k dataset and full-length query sequences into the RNASim-200k datasets. The top row shows average delta error over 200 queries, middle row shows runtime, and bottom shows peak memory usage (each per query). We note that pplacer-FastTree failed to complete on RNASim-200k, given 64 GB of memory to run

On the largest backbone tree (i.e. the RNASim-200K model condition), pplacer-FastTree fails to complete due to an out of memory error, but the other methods succeed. Moreover, pplacer-SCAMPP-FastTree(B) completes for all settings of B, as does APPLES-2, and all achieve very low delta error (lower than on the RNASim-50K model condition). Effectively there is no need for B to be large on these model conditions, as all settings for B produce the same low error for pplacer-SCAMPP-FastTree(B).

The comparison between methods in terms of computational effort shows, therefore, that pplacer-FastTree has a computational limit so that it is unable to complete on the RNASim-200K dataset. Comparisons between the other methods are as expected, with running time increasing for pplacer-SCAMPP-FastTree(B) as B increases, and APPLES-2 remaining the fastest method.

3.3.3 Results on the biological datasets

In Figure 4, we explore results for phylogenetic placement of full-length query sequences on biological datasets: 16S.B.ALL with 27 643 sequences, green85 with 5088 sequences and LTP_s128_SSU with 12 953 sequences. As these datasets are all smaller than the study simulated datasets, we explored pplacer-SCAMPP-FastTree(B) with two values for B: one using 25% of the number of leaves in the tree and the other using 50% of the number of leaves in the tree. We also examine pplacer-FastTree and APPLES-2.

Fig. 4.

Fig. 4.

(Experiment 3) Per-query results placing full-length query sequences into three biological datasets. We show average delta error, runtime and memory on (a) 16S.B.ALL (27 643 sequences), (b) green85 (5088 sequences) and (c) LTP_s128_SSU (12 953 sequences). The number of sequences used as the placement subtree size for pplacer-SCAMPP-FastTree is shown in parentheses as a percentage of the total number of sequences. The top row shows average delta error over 200 queries, middle shows runtime, and bottom shows peak memory usage (each per query). Memory usage for these methods is extremely low—well below 1 GB—see Supplementary Table S3 for values

On all three datasets, APPLES-2 consistently produced the highest delta error—in some cases much higher—while there were no noteworthy differences in accuracy between the remaining methods. As with other experiments, APPLES-2 was the fastest method, and increasing B for methods including SCAMPP increased the runtime. Memory usage was negligible for all methods.

In interpreting these trends, it is helpful to realize that the reference tree is itself an estimated tree, typically based on a maximum likelihood analysis of an estimated alignment. Thus, the superior accuracy of likelihood-based placement methods compared to APPLES-2 may be due to use of likelihood in placing the query sequences, compared to the use of distances in APPLES-2.

3.3.4 Summary of trends

The experiments performed here, as well as in other studies (e.g. Wedell et al., 2022), provide insights into when each phylogenetic placement method is likely to provide good accuracy, as well as their computational limitations. Here, we summarize these trends (see also Supplementary Tables S1 and S2).

One trend consistently seen is that placement error is higher for fragmentary sequences compared to full-length sequences; this is expected, since shorter sequences have less information available to determine placement (Wedell et al., 2022). Another trend consistently seen is that APPLES-2 often provides good, although not the best accuracy. Results reported here generally suggest that the maximum likelihood methods provide superior accuracy. Hence, in general, if runtime is the most important criterion, then APPLES-2 is the winner, but using APPLES-2 can reduce accuracy, making the choice a runtime/accuracy trade off. The remaining discussion focuses on accuracy, as a function of the backbone tree size and the query sequence length, and makes comparisons between the tested maximum likelihood placement methods.

On the simulated datasets we examined, pplacer-FastTree either was uniquely the most accurate or tied for the most accurate of all tested methods. Furthermore, although pplacer-FastTree was not the most accurate on one biological dataset (following EPA-ng by a small amount), it was again best or tied for best on the other two biological datasets. Given that we find very small accuracy differences between pplacer-RAxML and pplacer-FastTree when placing into relatively small trees and that previous studies (e.g. Balaban et al., 2020) have found pplacer-RAxML to provide superior accuracy to other methods, our finding that pplacer-FastTree produces high accuracy throughout our experiments is consistent with prior studies. Another trend observed in this experiment and elsewhere (Wedell et al., 2022) is that pplacer-RAxML has reduced accuracy when placing into some even moderately large trees [and Wedell et al. (2022) observed that pplacer-RAxML can return placements with negative log infinity values for large trees]. These trends suggest that pplacer-RAxML has numeric issues on large trees. In contrast, pplacer-FastTree is much more scalable: it succeeded in placing into the RNASim-100K tree (with 100 000 leaves) and only failed on the RNASim-200K tree due to only having 64 Gb of available memory (Supplementary Table S1). Another trend observed is that pplacer-RAxML and EPA-ng both have much higher computational requirements than pplacer-FastTree and become infeasible at much smaller backbone trees (Supplementary Table S1). Thus, when accuracy is the most important consideration and the backbone tree is not too large, then we would recommend the use of pplacer-FastTree. For those backbone trees where pplacer-FastTree cannot be run (or is considered too slow to use), then the most reliable method is pplacer-SCAMPP-FastTree(B), and the question becomes how to set B, the size of the placement tree.

Our study showed there are model conditions where increasing B helps accuracy, and generally, it does not hurt accuracy to do so. The conditions where increasing B is likely to be needed are for placing short sequences, and the implication is that short sequences make it more difficult for the SCAMPP technique to reliably find a good placement tree when B is small. However, we also observed that increasing B improved accuracy on the biological datasets. Thus, in general, increasing B is desirable for accuracy, but will increase the runtime, making this a runtime/accuracy tradeoff.

The observation that large values for B within pplacer-SCAMPP improves accuracy, sometimes dramatically, explains why pplacer-SCAMPP-FastTree can achieve higher accuracy than pplacer-SCAMPP-RAxML. Specifically, each of these methods is limited by the placement tree size B that it can use, with the values for B in pplacer-SCAMPP-FastTree(B) limited to whatever tree sizes pplacer-FastTree can handle (which can be large, as we saw pplacer-FastTree could place into backbone trees with 100 000 leaves in Supplementary Table S1). Similarly the values for B in pplacer-SCAMPP-RAxML(B) are limited to the tree sizes possible for pplacer-RAxML, but with the realization that setting B large within pplacer-SCAMPP-RAxML(B) can reduce accuracy [as observed here and in Wedell et al. (2022)] or may lead to failures. Thus, the best settings for B are completely different for these two pipelines. Finally, our study also shows that for some datasets (e.g. the biological datasets and when placing short sequences) that large values for B provide the best accuracy; hence, restricting B to a small value so that pplacer-RAxML can run means that pplacer-SCAMPP-RAxML is unable to reliably achieve the same level of accuracy as pplacer-SCAMPP-pplacer.

4 Conclusion

Phylogenetic placement methods are used in several downstream bioinformatics pipelines, such as taxonomic identification of microbiome samples and updating large phylogenetic trees, where the ability to place sequences into large trees can improve placement accuracy. Several studies have shown that pplacer—which uses maximum likelihood—is the most accurate phylogenetic placement method. However, when used in its default mode, pplacer has been shown to be limited to placing into small phylogenetic ‘backbone’ trees.

This study confirms that using FastTree for numeric parameter estimation improves pplacer’s scalability without compromising its accuracy. However, we find that pplacer-FastTree failed on the RNASim-200K dataset due to an out of memory error, while pplacer-SCAMPP-RAxML(2k) can complete on that dataset (and by design is not limited to backbone trees of that size). Hence, the simple replacement of RAxML by FastTree improves scalability only to a limited extent. However, combining the divide-and-conquer strategy of SCAMPP and replacing RAxML with FastTree for numeric parameter estimation provides scalability to the RNASim-200K dataset—the largest backbone tree that any of these methods have been shown to place into. Moreover, by selecting the placement subtree size (B) appropriately, the user can explore the runtime/accuracy trade off, with the expectation that larger values of B will be desirable when placing very short sequences into backbone trees.

This study presented two ways for improving the scalability of pplacer, a highly accurate maximum likelihood phylogenetic placement method, to run on large backbone trees. The first, pplacer-FastTree, provides excellent accuracy, often the best of the competing methods, across all conditions on which it could run, but is more computationally intensive and can fail to complete on the largest backbone trees we examined (RNASim-200k, with 200 000 leaves), due to memory requirements. The second, pplacer-SCAMPP-FastTree, achieves the same scalability of the prior most scalable and accurate methods, pplacer-SCAMPP-RAxML and APPLES-2, but is more accurate than both.

This study suggests several directions for future work. One question that remains unanswered is why there are numeric problems in placing into large placement trees with pplacer using RAxML numeric parameters but not with FastTree numeric parameters. We also note that as the output file to pplacer-FastTree and pplacer-SCAMPP-FastTree contains the most likely edge and its corresponding probability; additional work could leverage the uncertainty of query placements (i.e. where the most likely edge has very low probability) (Czech et al., 2022).

Supplementary Material

vbad008_Supplementary_Data

Acknowledgements

The authors thank Eleanor Wedell, Morgan Price and Erick Matsen for helpful discussion, and the anonymous reviewers for their suggestions, which improved the article.

Contributor Information

Gillian Chu, Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

Tandy Warnow, Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

Author contributions

Gillian Chu (Investigation [equal], Methodology [equal], Software [lead], Writing—review & editing [supporting]) and Tandy Warnow (Conceptualization [lead], Investigation [equal], Methodology [equal], Project administration [lead], Resources [lead], Supervision [lead], Writing—original draft [lead], Writing—review & editing [lead])

Funding

This work was supported by funds from the National Science Foundation [# 2006069 to T.W.]; and fellowship support from the National Science Foundation Graduate Research Fellowship Program to G.C.

Conflict of Interest: none declared.

References

  1. Balaban M. et al. (2020) APPLES: scalable distance-based phylogenetic placement with or without alignments. Syst. Biol., 69, 566–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Balaban M. et al. (2022) Fast and accurate distance-based phylogenetic placement using divide and conquer. Mol. Ecol. Resour., 22, 1213–1227. [DOI] [PubMed] [Google Scholar]
  3. Barbera P. et al. (2019) EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol., 68, 365–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cannone J.J. et al. (2002) The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics, 3, 2–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Czech L. et al. (2022) Metagenomic Analysis Using Phylogenetic Placement—A Review of the First Decade. In: Setubal, J.C., Kyrpides, N. (eds) Computational Methods for Microbiome Analysis. Vol. 2, Frontiers in Bioinformatics, Lausanne, Switzerland. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. DeSantis T.Z. et al. (2006) Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol., 72, 5069–5072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Foulds L.R., Graham R.L. (1982) The Steiner problem in phylogeny is NP-complete. Adv. Appl. Math., 3, 43–49. [Google Scholar]
  8. Fred Hutchinson Cancer Research Center (2022) taxtastic. http://fhcrc.github.io/taxtastic/ (10 February 2023, date last accessed).
  9. Hasegawa M. et al. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol., 22, 160–174. [DOI] [PubMed] [Google Scholar]
  10. Koning E. et al. (2021) pplacerDC: a new scalable phylogenetic placement method. In: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, Gainesville, FL, pp. 1–9.
  11. Linard B. et al. (2019) Rapid alignment-free phylogenetic identification of metagenomic sequences. Bioinformatics, 35, 3303–3312. [DOI] [PubMed] [Google Scholar]
  12. Linard B. et al. (2021) PEWO: a collection of workflows to benchmark phylogenetic placement. Bioinformatics, 36, 5264–5266. [DOI] [PubMed] [Google Scholar]
  13. Liu K. et al. (2009) Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science, 324, 1561–1564. [DOI] [PubMed] [Google Scholar]
  14. Liu K. et al. (2011) RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One, 6, e27731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Matsen F.A. et al. (2010) pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Mirarab S. et al. (2012) SEPP: SATé-enabled phylogenetic placement. In: Altman, R., Dunker, A.K., Hunter, L., Murray, T.A., Klein, T.E. (eds) Biocomputing 2012. World Scientific, Singapore, pp. 247–258. [DOI] [PubMed] [Google Scholar]
  17. Mirarab S. et al. (2015) PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol., 22, 377–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Nabhan A.R., Sarkar IN. (2012) The impact of taxon sampling on phylogenetic inference: a review of two decades of controversy. Brief. Bioinform., 13, 122–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Nguyen N.-P. et al. (2014) TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics, 30, 3548–3555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Nguyen N.-P. et al. (2015) Ultra-large alignments using phylogeny-aware profiles. Genome Biol., 16, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Park M. et al. (2021) Disjoint tree mergers for large-scale maximum likelihood tree estimation. Algorithms, 14, 148. [Google Scholar]
  22. Park M. et al. (2023) UPP2: fast and accurate alignment estimation of datasets with fragmentary sequences. Bioinformatics, 39, btad007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Price M.N. et al. (2010) Fasttree 2–approximately maximum-likelihood trees for large alignments. PLoS One, 5, e9490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Roch S. (2006) A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinform., 3, 92–94. [DOI] [PubMed] [Google Scholar]
  25. Shah N. et al. (2021) TIPP2: metagenomic taxonomic profiling using phylogenetic markers. Bioinformatics, 37, 1839–1845. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Smirnov V., Warnow T. (2021a) MAGUS: multiple sequence alignment using graph clustering. Bioinformatics, 37, 1666–1672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Smirnov V., Warnow T. (2021b) Phylogeny estimation given sequence length heterogeneity. Syst. Biol., 70, 268–282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Stamatakis A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, 22, 2688–2690. [DOI] [PubMed] [Google Scholar]
  29. Stoye J. et al. (1998) Rose: generating sequence families. Bioinformatics, 14, 157–163. [DOI] [PubMed] [Google Scholar]
  30. Tavaré S. (1986) Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences. Vol. 17, American Mathematical Society, Providence, RI, pp. 57–86. [Google Scholar]
  31. Wedell E. et al. (2022) SCAMPP: scaling alignment-based phylogenetic placement to large trees. IEEE/ACM Trans. Comput. Biol. Bioinform., doi: 10.1109/TCBB.2022.3170386. [DOI] [PubMed] [Google Scholar]
  32. Zwickl D.J., Hillis D.M. (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol., 51, 588–598. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbad008_Supplementary_Data

Articles from Bioinformatics Advances are provided here courtesy of Oxford University Press

RESOURCES