Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2019 Sep 23;69(3):566–578. doi: 10.1093/sysbio/syz063

APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments

Metin Balaban 1, Shahab Sarmashghi 2, Siavash Mirarab 2,
Editor: David Posada
PMCID: PMC7164367  PMID: 31545363

Abstract

Placing a new species on an existing phylogeny has increasing relevance to several applications. Placement can be used to update phylogenies in a scalable fashion and can help identify unknown query samples using (meta-)barcoding, skimming, or metagenomic data. Maximum likelihood (ML) methods of phylogenetic placement exist, but these methods are not scalable to reference trees with many thousands of leaves, limiting their ability to enjoy benefits of dense taxon sampling in modern reference libraries. They also rely on assembled sequences for the reference set and aligned sequences for the query. Thus, ML methods cannot analyze data sets where the reference consists of unassembled reads, a scenario relevant to emerging applications of genome skimming for sample identification. We introduce APPLES, a distance-based method for phylogenetic placement. Compared to ML, APPLES is an order of magnitude faster and more memory efficient, and unlike ML, it is able to place on large backbone trees (tested for up to 200,000 leaves). We show that using dense references improves accuracy substantially so that APPLES on dense trees is more accurate than ML on sparser trees, where it can run. Finally, APPLES can accurately identify samples without assembled reference or aligned queries using kmer-based distances, a scenario that ML cannot handle. APPLES is available publically at github.com/balabanmetin/apples.

Keywords: Distance-based methods, genome skimming, phylogenetic placement


Phylogenetic placement is the problem of finding the optimal position for a new query species on an existing backbone (or, reference) tree. Placement, as opposed to a de novo reconstruction of the full phylogeny, has two advantages. In some applications (discussed below), placement is all that is needed, and in terms of accuracy, it is as good as, and even better than (Janssen et al. 2018), de novo reconstruction. Moreover, placement can be more scalable than de novo reconstruction when dealing with very large trees.

Earlier research on placement was motivated by scalability. For example, placement is used in greedy algorithms that start with an empty tree and add sequences sequentially (e.g., Felsenstein 1981; Desper and Gascuel 2002). Each placement requires polynomial (often linear) time with respect to the size of the backbone, and thus, these greedy algorithms are scalable (often requiring quadratic time). Despite computational challenges (Warnow 2017), there has been much progress in the de novo reconstruction of ultralarge trees (e.g., thousands to millions of sequences) using both maximum likelihood (ML) (e.g., Price et al. 2010; Nguyen et al. 2015) and the distance-based (e.g., Lefort et al. 2015) approaches. However, these large-scale reconstructions require significant resources. As new sequences continually become available, placement can be used to update existing trees without repeating previous computations on full data set.

More recently, placement has found a new application in sample identification: given one or more query sequences of unknown origins, detect the identity of the (set of) organism(s) that could have generated that sequence. These identifications can be made easily using sequence matching tools such as BLAST (Altschul et al. 1990) when the query either exactly matches or is very close to a sequence in the reference library. However, when the sequence is novel (i.e., has lowered similarity to known sequences in the reference), this closest match approach is not sufficiently accurate (Koski and Golding 2001), leading some researchers to adopt a phylogenetic approach (Sunagawa et al. 2013; Nguyen et al. 2014). Sample identification is essential to the study of mixed environmental samples, especially of the microbiome, both using 16S profiling (e.g., Gill et al. 2006; Krause et al. 2008) and metagenomics (e.g., von Mering et al. 2007). It is also relevant to barcoding (Hebert et al. 2003) and metabarcoding (Clarke et al. 2014; Bush et al. 2017) and quantification of biodiversity (e.g., Findley et al. 2013). Driven by applications to microbiome profiling, placement tools like pplacer (Matsen et al. 2010) and evolutionary placement algorithm (EPA(-ng)) (Berger et al. 2011; Barbera et al. 2019) have been developed. Researchers have also developed methods for aligning query sequence (e.g., Berger and Stamatakis 2011; Mirarab et al. 2012) and for downstream steps (e.g., Stark et al. 2010; Matsen and Evans 2013). These publications have made a strong case that for sample identification, placement is sufficient (i.e., de novo is not needed). Moreover, some studies (e.g., Janssen et al. 2018) have shown that when dealing with fragmentary reads typically found in microbiome samples, placement can be more accurate than de novo construction and can lead to improved associations of microbiome with clinical information.

Existing phylogenetic placement methods have focused on the ML inference of the best placement—a successful approach, which nevertheless, suffers from two shortcomings. On the one hand, ML can only be applied when the reference species are assembled into full-length sequences (e.g., an entire gene) and are aligned; however, in new applications that we will describe, assembling (and hence aligning) the reference set is not possible. On the other hand, ML, while somewhat scalable, is still computationally demanding, especially in memory usage, and cannot place on backbone trees with many thousands of leaves. As the density of reference substantially impacts the accuracy and resolution of placement, this inability to use ultralarge trees as backbone also limits accuracy. This limitation has motivated alternative methods using local sensitive hashing (Brown and Truszkowski 2013) and divide-and-conquer (Mirarab et al. 2012).

Assembly-free and alignment-free sample identification using genome skimming (Dodsworth 2015) can also benefit from phylogenetic placement. A genome skim is a shotgun sample of the genome sequenced at low coverage (e.g., 1X)—so low that assembling the nuclear genome is not possible (though, mitochondrial or plastid genomes can often be assembled). Genome skimming promises to replace traditional marker-based barcoding of biological samples (Coissac et al. 2016) but limiting analyses to organelle genome can limit resolution. Moreover, mapping reads to reference genomes is also possible only for species that have been assembled, which is a small fraction of the biodiversity on Earth. Sarmashghi et al. (2019) have recently shown that using shared Inline graphic-mers, the distance between two unassembled genome skims with low coverage can be accurately estimated. This approach, unlike assembling organelle genomes, uses data from the entire nuclear genome and hence promises to provide a higher resolution (e.g., at species or subspecies levels) while keeping the low sequencing cost. However, ML and other methods that require assembled sequences cannot analyze genome skims, where both the reference and the query species are unassembled genome-wide bags of reads.

Distance-based approaches to phylogenetics are well-studied, but no existing tool can perform distance-based placement of a query sequence on a given backbone. The distance-based approach promises to solve both shortcomings of ML methods. Distance-based methods are computationally efficient and do not require assemblies. They only need distances (however computed). Thus, they can take as input assembly-free estimates of genomic distance estimated from low coverage genome skims using Skmer (Sarmashghi et al. 2019) or other alternatives (Yi and Jin 2013; Haubold 2014; Fan et al. 2015; Benoit et al. 2016; Ondov et al. 2016; Leimeister et al. 2017; Jain et al. 2018). While alignment-based phylogenetics has been traditionally more accurate than alignment-free methods when both methods are possible, in these new scenarios, only alignment-free methods are applicable.

Here, we introduce a new method for distance-based phylogenetic placement called APPLES (Accurate Phylogenetic Placement using LEast Squares). APPLES uses dynamic programming to find the optimal distance-based placement of a sequence with running time and memory usage that scale linearly with the size of the backbone tree. We test APPLES in simulations and on real data, both for alignment-free and aligned scenarios.

Description

Background and Notations

Let an unrooted tree Inline graphic be a weighted connected acyclic undirected graph with leaves denoted by Inline graphic. We let Inline graphic be the rooting of Inline graphic on a leaf Inline graphic obtained by directing all edges away from Inline graphic. For node Inline graphic, let Inline graphic denote its parent, Inline graphic denote its set of children, Inline graphic denote its siblings, and Inline graphic denote the set of leaves at or below Inline graphic (i.e., those that have Inline graphic on their path to the root), all with respect to Inline graphic. Also let Inline graphic denote the length of the edge Inline graphic.

The tree Inline graphic defines an Inline graphic matrix where each entry Inline graphic corresponds to the path length between leaves Inline graphic and Inline graphic. We further generalize this definition so that Inline graphic indicates the length of the undirected path between any two nodes of Inline graphic (when clear, we simply write Inline graphic). Given some input data, we can compute a matrix of all pairwise sequence distances Inline graphic, where the entry Inline graphic indicates the dissimilarity between species Inline graphic and Inline graphic. When the sequence distance Inline graphic is computed using (the correct) phylogenetic model, it will be a noisy but statistically consistent estimate of the tree distance Inline graphic (Felsenstein 2003). Given these “phylogenetically corrected” distances (e.g. Inline graphic is the corrected hamming distance Inline graphic under the Jukes and Cantor (1969) model), we can define optimization problems to recover the tree that best fits the distances. A natural choice is minimizing the (weighted) least square difference between tree and sequence distances:

graphic file with name M34.gif (1)

Here, weights (e.g., Inline graphic) are used to reduce the impact of large distances (expected to have high variance). A general weighting schema can be defined as Inline graphic for a constant value Inline graphic. Standard choices of Inline graphic include Inline graphic for the ordinary least squares (OLS) method of Cavalli-Sforza and Edwards 1967, Inline graphic due to Beyer et al. 1974 (BE), and Inline graphic due to Fitch and Margoliash 1967 (FM).

Finding Inline graphic is NP-Complete (Day and Sankoff 1987). However, decades of research has produced heuristics like neighbor-joining (Saitou and Nei 1987), alternative formulations like (balanced) minimum evolution (ME) (Cavalli-Sforza and Edwards 1967; Desper and Gascuel 2002), and several effective tools for solving the problem heuristically (e.g., FastME by Lefort et al. 2015, DAMBE by Xia 2018, and Ninja by Wheeler 2009).

Problem Statement

We let Inline graphic be the tree obtained by adding a query taxon Inline graphic on an edge Inline graphic, creating three edges Inline graphic, Inline graphic, and Inline graphic, with lengths Inline graphic, Inline graphic, and Inline graphic, respectively (Fig. 1). When clear, we simply write Inline graphic and note that Inline graphicinducesInline graphic both in topology and branch length. We now define the problem.

Figure 1.

Figure 1.

Any placement of Inline graphic can be characterized as a tree Inline graphic, shown here. The backbone tree Inline graphic is an arborescence on leaves Inline graphic, rooted at leaf Inline graphic. Query taxon Inline graphic is added on the edge between Inline graphic and Inline graphic, creating a node Inline graphic. All placements on this edge are characterized by Inline graphic, the length of the pendant branch, and Inline graphic, the distance between Inline graphic and Inline graphic.

  • Least squares phylogenetic placement (LSPP):

    • [Input:] A backbone tree Inline graphic on Inline graphic, a query species Inline graphic, and a vector Inline graphic with elements Inline graphic giving sequence distances between Inline graphic and every species Inline graphic;

    • [Output:] The placement tree Inline graphic that adds Inline graphic on Inline graphic and minimizes
      graphic file with name M78.gif (2)

Linear Time Algorithm

The number of possible placements of Inline graphic is Inline graphic. Therefore, LSPP can be solved by simply iterating over all the topologies, optimizing the score for that branch, and returning the placement with the minimum least square error. A naive algorithm can accomplish this in Inline graphic running time by optimizing Eq. 2 for each of the Inline graphic branches. However, using dynamic programming, the optimal solution can be found in linear time.

Theorem 1.

The LSPP problem can be solved with Inline graphic running time and memory.

The proof (given in Supplementary Appendix 1 available on Dryad at https://doi.org/10.5061/dryad.78nf7dq) follows easily from three lemmas that we next state. The algorithm starts with precomputing a fixed-size set of values for each nodes. For any node Inline graphic and exponents Inline graphic and Inline graphic, let Inline graphic and for Inline graphic, let Inline graphic. Note that Inline graphic. Similarly, for Inline graphic, let Inline graphic for Inline graphic and let Inline graphic.

Lemma 2.

The set of all Inline graphic and Inline graphic values can be precomputed in Inline graphic time with two tree traversals using the dynamic programming given by:

Lemma 2. (3)
Lemma 2. (4)

Lemma 3.

Equation 2 can be rearranged (see Supplementary Appendix 1, Eq. S2 available on Dryad) such that computing Inline graphic for a given Inline graphic requires a constant time computation using Inline graphic and Inline graphic values for Inline graphic and Inline graphic.

Thus, after a linear time precomputation, we can compute the error for any given placement in constant time. It remains to show that for each node, the optimal placement on the branch above it (e.g., Inline graphic and Inline graphic) can be computed in constant time.

Lemma 4.

For a fixed node Inline graphic, if Inline graphic, then

Lemma 4. (5)

and hence Inline graphic can be computed in constant time.

Non-negative branch lengths

The solution to Equation 5 does not necessarily conform to constraints Inline graphic and Inline graphic. However, the following lemma (proof in Supplementary Appendix 1 available on Dryad) allows us to easily impose the constraints by choosing optimal boundary points when unrestricted solutions fall outside boundaries.

Lemma 5.

With respect to variables Inline graphic and Inline graphic, Inline graphic is a convex function.

Minimum evolution

An alternative to directly using minimum least square error (MLSE) (Eq. 1) is the ME principle (Cavalli-Sforza and Edwards 1967; Rzhetsky and Nei 1992). Our algorithm can also optimize the ME criterion: after computing Inline graphic and Inline graphic by optimizing MLSE for each node Inline graphic, we choose the placement with the minimum total branch length. This is equivalent to using Inline graphic, since the value of Inline graphic does not contribute to total branch length. Other solutions for ME placement exist (Desper and Gascuel 2002), a topic we return to in the Discussion section.

Hybrid

We have observed cases where ME is correct more often than MLSE, but when it is wrong, unlike MLSE, it has a relatively high error. This observation led us to design a hybrid approach. After computing Inline graphic and Inline graphic for all branches, we first select the top Inline graphic edges with minimum Inline graphic values (this requires Inline graphic time). Among this set of edges, we place the query on the edge satisfying the ME criteria.

APPLES Software

We implemented the algorithm described above in a software called APPLES. APPLES uses Treeswift (Moshiri 2018) for phylogenetic operations, and it generates the output in the jplace format (Matsen et al. 2012). APPLES can compute distances using vectorized numpy (Oliphant 2006) operations but can also use input distance matrices (e.g., generated using FastME or Skmer). When computing distances internally, APPLES ignores positions that have a gap in at least one of the two sequences. By default, APPLES uses the JC69 model to compute phylogenetic distances (Jukes and Cantor 1969) without Inline graphic model of rate variation. It computes distances independently for all pairs, and not simultaneously as suggested by Tamura et al. (2004).

By default, APPLES uses FM weighting, the MLSE selection criterion, enforcement of non-negative branch lengths, and JC69 distances. When not specified otherwise, these default parameters are used (the default setting is referred to as APPLES*).

Benchmark

Data Sets

We benchmark accuracy and scalability of APPLES in two settings: sample identification using assembly-free genome skims on real biological data and placement using aligned sequences on simulated data.

Real genome skim data sets

Columbicola genome skims

We use a set of 61 genome skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Since this data set is not assembled, the coverage of the genome skims is unknown; Skmer estimates the coverage to be between 0.2Inline graphic and 1Inline graphic for 0.1Gb samples (and 5 times that coverage with 0.5Gb).

Anopheles and Drosophila data sets

We also use two insect data sets used by Sarmashghi et al. (2019): a data set of 22 Anopheles and a data set of 21 Drosophila genomes (Supplementary Appendix 3 available on Dryad), both obtained from InsectBase (Yin et al. 2016). For both data sets, genome skims with 0.1Gb and 0.5Gb sequence were generated from the assemblies using the short-read simulator tool ART, with the read length Inline graphic and default error profile. Since species have different genome sizes, with 0.1Gb data, our subsampled genome skims range in coverage from 0.35Inline graphic to 1Inline graphic for Anopheles and from 0.4Inline graphic to 0.8Inline graphic for Drosophila.

More recently, Miller et al. (2018) sequenced several Drosophila genomes, including 12 species shared with the InsectBase data set. Sarmashghi et al. (2019) subsampled the SRAs from this second project to 0.1Gb or 0.5Gb and, after filtering contaminants, obtained artificial genome skims. We can use these genome skims as query and the genome skims from the InsectBase data set as the backbone. Since the reference and query come from two projects, the query genome skim can have a nonzero distance to the same species in the reference set, providing a realistic test of sample identification applications.

Backbone trees

For all genome skimming data sets, we inferred the backbone tree using FastME from the JC69 distance matrix computed from genome skims using Skmer.

Simulated alignment-based data sets

GTR

We use a 101-taxon data set available from Mirarab and Warnow 2015. Sequences were simulated under the general time reversible (GTR) plus the Inline graphic model of site rate heterogeneity using INDELible (Fletcher and Yang 2009) on gene trees that were simulated using SimPhy (Mallo et al. 2016) under the coalescent model evolving on species trees generated under the Yule model. Note that the same model is used for inference under ML placement methods (i.e., no model misspecification). We took all 20 replicates of this data set with mutation rates between Inline graphic and Inline graphic, and for each replicate, we selected five genes at random among many candidates that satisfy the condition that RF (Robinson and Foulds 1981) distance between the true tree and the tree inferred from the sequence is at most 20%. Thus, we have a total of 100 backbone trees. This data set is the simplest test case where model violation or mis-alignment is not a concern.

RNASim

Guo et al. (2009) designed a complex model of RNA evolution that does not make usual i.i.d. assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation–selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. The model complexity of RNASim allows us to test both ML and APPLES under a substantially misspecified model. An RNASim data set of one million sequences (with E. coli SSU rRNA used as the root), which consists of a multiple sequence alignment and true phylogeny, is available from Mirarab et al. (2015). We created several subsets of the full RNASim data set.

Inline graphic Heterogeneous: We first randomly subsampled the full data set to create 10 data sets of size 10,000. Then, we chose the largest clade of size at most 250 from each replicate; this gives us 10 backbone trees of mean size Inline graphic.

Inline graphic Varied diameter: To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created data sets with low, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous data set. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in Inline graphic (mean: 0.36), Inline graphic (mean: 0.51), and Inline graphic (mean: 0.82), respectively.

Inline graphic Varied size (RNASim-VS): We randomly subsampled the full data set to create five replicates of data sets of size (Inline graphic): Inline graphic, Inline graphic, Inline graphic, 10,000, 50,000, and 100,000, and 1 replicate (due to size) of size 200,000. For replicates that contain at least Inline graphic species, we removed sites that contain gaps in 95% or more of the sequences in the alignment.

Inline graphic Query scalability (RNASim-QS): We first randomly subsampled the full data set to create a data set of size Inline graphic. Then for Inline graphic1 to 49,152 queries (choosing all Inline graphic) we created five replicates of Inline graphic query sequences, again randomly subsampling from the full alignment with one million sequences.

Inline graphic Alignment Error (RNASim-AE): Mirarab et al. (2015) used PASTA to estimate alignments on subsets of the RNASim data set with up to 200,000 sequences. We use their reported alignment with 200,000 or 10,000 sequences (taking only replicate 1 in this case).

Backbone alignment and tree

Backbone alignments

We present results both based on true backbone alignments (for all data sets) and PASTA-estimated alignments (for large data sets). The true alignments are known from the simulations. To test the accuracy in the presence of alignment error, we use the available PASTA backbone alignment for RNASim-AE data set. The alignments have considerable error as measured by FastSP (Mirarab and Warnow 2011): 11.5% and 12.7% SPFN (sum-of-pairs false negative), 10.9% and 11.7% SPFP (sum-of-pairs false positive), and only 165 and 848 fully correctly aligned sites (2.2% and 6.4%), respectively for 10,000 and 200,000 sequences. As before, here, we remove sites with more than 95% gaps in the estimated alignments.

Backbone trees

For true alignments, we ran RAxML (Stamatakis 2014) using GTRGAMMA model for all data sets to estimate the topology of the backbone tree except for RNASim-AE and RNASim-VS data set, where due to the size, we used FastTree-2 (Price et al. 2010). When estimated alignment is used (RNASim-AE data set), we used the coestimated tree output by PASTA, which is itself computed using FastTree-2. We always re-estimated branch lengths and model parameters on that fixed topology using RAxML (switching to GTRCAT for Inline graphic) before running ML methods. For APPLES, we re-estimated branch lengths using FastTree-2 under the JC69 model to match the model used for estimating distances.

Alignment of queries

For analyses with true backbone alignment, we use the true alignment of the queries to the backbone. In analyses with estimated backbone alignments, we align the query sequences to the estimated backbone alignment using SEPP (Mirarab et al. 2012), which is a divide-and-conquer method that internally uses HMMER (Eddy 1998; Eddy 2009), with alignment subset size set to 10% of the full set (default setting). We use the resulting extended alignment, after masking unaligned sites, to run both APPLES and EPA-ng, in both cases, placing on the full backbone tree. We also report results using default SEPP (which runs pplacer internally); however, here, due to limitations of pplacer, we use the default setting of SEPP, which set the placement subset size to 10% of the full set.

Methods Compared

For aligned data, we compare APPLES to two ML methods: pplacer (Matsen et al. 2010) and EPA-ng (Barbera et al. 2019). Matsen et al. (2010) found pplacer to be substantially faster than EPA (Berger and Stamatakis 2011) while their accuracy was similar. EPA-ng improves the scalability of EPA; thus, we compare to EPA-ng in analyses that concerned scalability (e.g., RNASim-VS). We run pplacer and EPA-ng in their default mode using GTR+Inline graphic model and use their best hit (ML placement). We also compare with a simple method referred to as CLOSEST that places the query as the sister to the species with the minimum distance to it. CLOSEST is meant to emulate the use of BLAST (if it could be used). For the assembly-free setting, existing phylogenetic placement methods cannot be used, and we compared only against CLOSEST.

To run APPLES on assembly-free data sets, we first compute genomic distances using Skmer (Sarmashghi et al. 2019). We then correct these distances using the JC69 model, without Inline graphic model of rate variation. For APPLES on alignment-based analyses, we let APPLES compute distances for JC69. We also use FastME (see Supplementary Appendix 4 available on Dryad) to compute distances according to four more models: JC69+Inline graphic (Jin and Nei 1990), the six-parameter Tamura and Nei 1993 (TN93) model, TN93+Inline graphic (Waddell and Steel 1997), and the 12-parameter general Markov model (Lockhart et al. 1994). Pairing Gamma with GTR is theoretically possible in the absence of noise; however, the method can run into problems on real data (Waddell and Steel 1997). Thus, we do not include a GTR model directly. Instead, we use the log-det approach that can handle the most general (12-parameter) Markov model (Lockhart et al. 1994); however, log-det cannot account for rate across sites heterogeneity (Waddell and Steel 1997). In JC69+Inline graphic and TN93+Inline graphic models, we used the Inline graphic parameter computed by RAxML (Stamatakis 2014) run on the backbone alignment and given the backbone tree.

Evaluation Procedure

To evaluate the accuracy, we use a leave-one-out strategy. We remove each leaf Inline graphic from the backbone tree Inline graphic and place it back on this Inline graphic tree to obtain the placement tree Inline graphic. On the RNAsim-VS data set, due to its large size, we only removed and added back 200 randomly chosen leaves per replicate. On the RNAsim-AE data set, we remove 200 queries from the backbone at the same time (leave-many-out). Finally, for RNAsim-QS, we place Inline graphic queries in one run, allowing the methods to benefit from optimizations designed for multiple queries, but note that queries are not selected from the backbone tree but are instead selected from the full data set. In all cases, placement of queries is with respect to the backbone and not other queries.

Delta error

We measure the accuracy of a placement tree Inline graphic of a single query Inline graphic on a backbone tree Inline graphic on leafset Inline graphic with respect to the true tree Inline graphic on Inline graphic using delta error:

graphic file with name M175.gif (6)

where Inline graphic is the set of bipartitions of a tree and Inline graphic is the true tree restricted to Inline graphic. Note that Inline graphic because adding Inline graphic cannot decrease the number of missing branches in Inline graphic. We report delta error averaged over all queries (denoted as Inline graphic). Backbone tree is estimated from the same data used in distance calculation, whereas true tree is either the ground truth or the gold standard that approximates the most to the truth. In leave-one-out experiments, placing Inline graphic to the same location as the backbone before leaving it out can still have a nonzero delta error because the backbone tree is not the true tree. We refer to the placement of a leaf into its position in the backbone tree as the de novo placement. In leave-many-out experiments, we measure delta error of each query separately (not the delta error of the combination of all queries). On biological data, where the true tree is unknown, we instead use a published phylogenetic tree as the gold standard (Supplementary Appendix 2, Fig. S1 available on Dryad). For Drosophila and Anopheles, we use the tree available from the Open Tree Of Life (Hinchliff et al. 2015) and for Columbicola, we use the ML concatenation tree published by Boyd et al. (2017).

Benchmark Results

Assembly-Free Placement of Genome Skims

On our three biological genome skim data sets, APPLES* successfully places the queries on the optimal position in most cases (97%, 95%, and 71% for Columbicola, Anopheles, and Drosophila, respectively) and is never off from the optimal position by more than one branch. Other versions of APPLES are less accurate than APPLES*; for example, APPLES with ME can have up to five wrong branches (Table 1). On genome skims, where assembly and alignment are not possible, existing placement tools cannot be used, and the only alternative is the CLOSEST method (emulating BLAST if assembly was possible). CLOSEST finds the optimal placement only in 54% and 57% of times for Columbicola and Drosophila; moreover, it can be off from the best placement by up to seven branches for the Columbicola data set. On the Anopheles data set, where the gold standard tree is unresolved (Supplementary Appendix 2, Fig. S1 available on Dryad), all methods perform similarly.

Table 1.

Assembly-free placement of genome skims. We show the percentage of placements into optimal position (those that do not increase Inline graphic), average delta error (Inline graphic), and maximum delta error Inline graphic for APPLES, assignment to the CLOSEST species, and the placement to the position in the backbone (DE-NOVO) over the 61 (a), 22 (b), and 21 (c) placements. Results are shown for genome skims with 0.1 Gbp of reads. Delta error is the increase in the missing branches between the true tree (or the gold standard for biological data) and the backbone tree after placing each query.

  (a) Columbicola (b) Anopheles (c) Drosophila
% Inline graphic Inline graphic % Inline graphic Inline graphic % Inline graphic Inline graphic
APPLES* 97 0.03 1 95 0.05 1 71 0.29 1
APPLES-ME 84 0.28 5 95 0.05 1 67 0.42 2
APPLES-HYBRID 87 0.16 2 95 0.05 1 67 0.33 1
CLOSEST 54 1.15 7 91 0.09 1 57 0.62 3
DE-NOVO 98 0.02 1 95 0.05 1 71 0.29 1

APPLES* is less accurate on the Drosophila data set than other data sets. However, here, simply placing each query on its position in the backbone tree would lead to identical results (Table 1). Thus, placements by APPLES* are as good as the de novo construction, meaning that errors of APPLES* are entirely due to the differences between our backbone tree and the gold standard tree. Moreover, these errors are not due to low coverage; increasing the genome skim size 5Inline graphic (to 0.5Gb) does not decrease error (Supplementary Appendix 3, Table S4 available on Dryad).

On Drosophila data set, we next tested a more realistic sample identification scenario using the 12 genome skims from the separate study (and thus, nonzero distance to the corresponding species in the backbone tree). As desired, APPLES* places all of 12 queries from the second study as sister to the corresponding species in the reference data set.

Alignment-Based Placement

We first compare the accuracy and scalability of APPLES* to ML methods and then compare various settings of APPLES. For ML, we use pplacer (shown everywhere) and EPA-ng (shown only when we study scalability and work on large backbones).

Comparison to ML without alignment error

GTR data set

On this data set, where it faces no model misspecification, pplacer has high accuracy. It finds the best placement in 84% of cases and is off by one edge in 15% (Fig. 2a); its mean delta error (Inline graphic) is only Inline graphic edges. APPLES* is also accurate, finding the best placement in 78% of cases and resulting in the mean Inline graphic0.28 edges. Thus, even though pplacer uses ML and faces no model misspecification and APPLES* uses distances based on a simpler model, the accuracy of the two methods is within Inline graphic edges on average. In contrast, CLOSEST has poor accuracy and is correct only 50% of the times, with the mean Inline graphic of 1.0 edge.

Figure 2.

Figure 2.

Accuracy on simulated data. We show empirical cumulative distribution of the delta error, defined as the increase in the number of missing branches in the tree compared to the true tree after placement. We compare pplacer (dotted), CLOSEST match (dashed), and APPLES with FM weighting and JC69 distances and MLSE (APPLES*), ME, or hybrid optimization. a) GTR data set. b) RNASim-Heterogeneous. c) RNASim-varied diameter, shown in boxes: low, medium (mid), or high. Distributions are over 10,000 (a), 2450 (b), and 3675 (c) points.

Model misspecification

On the small RNASim data with subsampled clades of Inline graphic250 species), both APPLES* and pplacer face model misspecification. Here, the accuracy of APPLES* is very close to ML using pplacer. On the heterogeneous subset (Fig. 2b and Table 2), pplacer and APPLES* find the best placement in 88% and 85% of cases and have a mean delta error of 0.13 and 0.17 edges, respectively. Both methods are much more accurate than CLOSEST, which has a delta error of 0.87 edges on average.

Table 2.

The delta error for APPLES*, CLOSEST match, and pplacer on the RNASim-varied diameter data set (low, medium, or high) and the RNA-heterogeneous data set. Measurements are shown over 1250 placements for each diameter size category, corresponding to 5 backbone trees and 250 placements per replicate.

  Low Medium High Heterogeneous
% Inline graphic Inline graphic % Inline graphic Inline graphic % Inline graphic Inline graphic % Inline graphic Inline graphic
APPLES* 86 0.15 2 85 0.18 5 84 0.18 3 85 0.17 5
CLOSEST 59 0.88 13 60 0.88 13 60 0.85 14 60 0.87 14
pplacer 88 0.13 2 89 0.11 3 87 0.13 3 88 0.13 3
Impact of diameter

When we control the tree diameter, APPLES* and pplacer remain very close in accuracy (Fig. 2c). The changes in error are small and not monotonic as the diameters change (Table 2). The accuracies of the two methods at low and high diameters are similar. The two methods are most divergent in the medium diameter case, where pplacer has its lowest error (Inline graphic 0.11) and APPLES* has its highest error (Inline graphic 0.18).

To summarize results on small RNASim data set with model misspecification, although APPLES* uses a parameter-free model, its accuracy is extremely close to ML using pplacer with the GTR+Inline graphic model.

Impact of taxon sampling

The real advantage of APPLES* over pplacer becomes clear for placing on larger backbone trees (Fig. 3 and Table 3). For backbone sizes of 500 and 1000, pplacer continues to be slightly more accurate than APPLES* (mean Inline graphic of pplacer is better than APPLES* by 0.09 and 0.23 edges, respectively). However, with backbones of 5000 leaves, pplacer fails to run on 449/1000 cases, producing infinity likelihood (perhaps due to numerical issues) and has Inline graphic times higher error than APPLES* on the rest (Supplementary Appendix 2, Fig. S2 available on Dryad). Since pplacer could not scale to 5000 leaves, we also test the recent method, EPA-ng (Barbera et al. 2019). Given the 64GB of memory available on our machine, EPA-ng is able to run on data sets with up to 10,000 leaves. EPA-ng finds the correct placement less often than pplacer but is close to APPLES* (Fig. 3a). However, when it has error, it tends to have a somewhat lower distance to the correct placement, making its mean error slightly better than APPLES* (Fig. 3b and Table 3).

Figure 3.

Figure 3.

Results on RNASim-VS. a) Placement accuracy with taxon sampling ranging from 500 to 200,000. b) The empirical cumulative distribution of the delta error, shown for 500 Inline graphic 10,000 where EPA-ng can run. c,d) Running time and peak memory usage of placement methods for a single placement. Lines are fitted in the log–log scale and their slope (indicated on the figure) empirically estimates the polynomial degree of the asymptotic growth (Inline graphic). APPLES lines are fitted to Inline graphic5000 points because the first two values are small and irrelevant to asymptotic behavior. All calculations are on 8-core, 2.6GHz Intel Xeon CPUs (Sandy Bridge) with 64GB of memory, with each query placed independently and given 1 CPU core and the entire memory.

Table 3.

Percentage of correct placements (shown as %) and the delta error (Inline graphic) on the RNASim data sets with various backbone size (Inline graphic). % and Inline graphic is over 1000 placements (except n = 200,000, which is over 200 placements). Running pplacer and EPA-ng was not possible (n.p) for trees with at least 10,000 leaves and failed in some cases (number of fails shown) for 5000 leaves.

  n = Inline graphic n = Inline graphic n = Inline graphic n = 10,000 n = 100,000 n = 200,000
  % Inline graphic % Inline graphic % Inline graphic % Inline graphic % Inline graphic % Inline graphic
APPLES* 75 0.32 71 0.43 77 0.37 79 0.33 84 0.25 87 0.25
CLOSEST 52 1.16 53 1.18 54 1.15 59 0.90 61 0.69 63 0.70
EPA-ng 73 0.33 73 0.31 78 0.24 79 0.22 n.p n.p n.p n.p
pplacer 80 0.23 81 0.20 Fail (800) n.p n.p n.p n.p n.p n.p n.p

For backbone trees with more than 10,000 leaves, pplacer and EPA-ng are not able to run given computational resources at hand, and CLOSEST is not very accurate (finding the best placement in only 59% of cases). However, APPLES* continues to be accurate for all backbone sizes. As the backbone size increases, the taxon sampling of the tree is improving (recall that these trees are all random subsets of the same tree). With denser backbone trees, APPLES* has increased accuracy despite placing on larger trees (Fig. 3a, Table 3). For example, using a backbone tree of 200,000 leaves, APPLES* is able to find the best placement of query sequences in 87% of cases, which is better than the accuracy of either APPLES* or ML tools on any backbone size. Thus, an increased taxon sampling helps accuracy, but ML tools are limited in the size of the tree they can handle given relatively powerful machines (e.g., 64GB of memory).

Running time and memory

As the backbone size increases, the running times of pplacer and APPLES grow close to linearly with the size of the backbone tree, Inline graphic, whereas running time of EPA-ng seems to grow with Inline graphic (Fig. 3c). APPLES is on average 13 times faster than pplacer and 7.5 times faster than EPA-ng on backbone trees with 1000 leaves, and is 41 times faster than EPA-ng with 10,000-taxon backbones.

The memory consumption of all methods increases close to linearly as Inline graphic increases, but APPLES requires dramatically less memory (Fig. 3d). For example, for placing on a backbone with 10,000 leaves, EPA-ng requires 51GB of memory, whereas APPLES requires only 0.4GB. APPLES easily scales to a backbone of 200,000 sequences, running in only 1 min and using 6GB of memory per query (including all precomputations in the dynamic programming). These numbers also include the time and memory needed to compute the distance between the query sequence and all the backbone sequences.

We next test the scalability of APPLES* and EPA-ng with respect to the number of queries, Inline graphic, in one run given 28 CPU cores and 28 threads on RNASim-QS data set. Both methods spend time on preprocessing steps that will be amortized over a large number of queries. The running time of APPLES*, as expected, increases linearly with Inline graphic and grow more slowly for Inline graphic (due to the preprocessing) (Fig. 4). The patterns of running time of EPA-ng are surprising. Increasing Inline graphic from 1 to 768 decreases the running time instead of increasing it. While the exact reasons for the reductions are not clear to us, we note that EPA-ng developers have taken great care to implement various optimizations, especially for utilizing multiple cores. Only with Inline graphic we start to see increasing running times for EPA-ng. With Inline graphic, EPA-ng seems to grow in running with Inline graphic; however, we suspect further increasing Inline graphic would increase the exponent (asymptotic running time cannot theoretically be less than Inline graphic as all queries need to be read). Aside from the fluctuation due to small sample size for small Inline graphic values, the number of queries do not seem to affect the accuracy for either method as expected since both methods treat queries independently (Supplementary Appendix 2, Fig. S3 available on Dryad).

Figure 4.

Figure 4.

Scalability with respect to the number of queries. We show wall-clock running time with respect to increased numbers of queries (Inline graphic) in one execution given 28 CPU cores and 28 threads on a Intel Xeon E5 CPU with 64 GB of memory. We fit a line to running times in log-log scale only for Inline graphic because otherwise, the preprocessing time would distort estimates (note: Inline graphic except in approximation if Inline graphic).

Comparing parameters of APPLES

Comparing five models of sequence evolution that can be used with APPLES, we see similar patterns of accuracy across all models despite their varying complexity, ranging from 0 to 12 parameters (Supplementary Appendix 2, Fig. S4 available on Dryad). Since the JC69 model is parameter-free and results in similar accuracy to others, we opted to use it as the default. Next, we ask whether imposing the constraint to disallow negative branch lengths improves the accuracy. The answer depends on the optimization strategy. Forcing non-negative lengths marginally increases the accuracy for MLSE but dramatically reduces the accuracy for ME (Supplementary Appendix 2, Fig. S5a,b available on Dryad). Thus, we always impose non-negative constraints on MLSE but never for ME. Likewise, our Hybrid method includes the constraint for the first MLSE step but not for the following ME step (Supplementary Appendix 2, Fig. S5c available on Dryad).

The next parameter to choose is the weighting scheme. Among the three methods available in APPLES, the best accuracy belongs to the FM scheme closely followed by the BE (Supplementary Appendix 2, Fig. S6 available on Dryad). The OLS scheme, which does not penalize long distances, performs substantially worse than FM and BE. Thus, the most aggressive form of weighting (FM) results in the best accuracy. Fixing the weighting scheme to FM and comparing the three optimization strategies (MLSE, ME, and Hybrid), the MLSE approach has the best accuracy (Fig. 2), finding the correct placement 84% of the time (mean error: 0.18), and ME has the lowest accuracy, finding the best placement in only 67% of cases (mean error: 0.70). The Hybrid approach is between the two (mean error: 0.34) and fails to outperform MLSE on this data set. However, when we restrict the RNASim backbone trees to only 20 leaves, we observe that Hybrid can have the best accuracy (Supplementary Appendix 2, Fig. S7 available on Dryad).

Comparison to ML with alignment error

We next test the impact of alignment errors. On RNASim-AE data set with Inline graphic = 200,000 sequences, we observe Inline graphic placement accuracy using SEPP+APPLES* (Fig. 5), which is a statistically significant (Inline graphic according to McNemar’s test) but relatively modest Inline graphic reduction compared to placing using the true alignments. On the Inline graphic = 10,000 backbone, placement accuracy drops from 83% on the true alignment to Inline graphic using SEPP+APPLES*, and from 81% to Inline graphic using SEPP+EPA-ng; neither change is statistically significant (Inline graphic and Inline graphic, respectively). Despite the relatively small drops on placement accuracy, the impact of alignment error on delta error can be more pronounced (Fig. 5). The mean delta error goes up from Inline graphic to Inline graphic for Inline graphic 10,000, which is statistically significant (Inline graphic according to a one-sided paired t-test). For Inline graphic, the error also increases significantly (Inline graphic) from Inline graphic to Inline graphic edges.

Figure 5.

Figure 5.

Impact of alignment error on placement accuracy. Left and right panels show placement accuracy and mean delta error of leave-many-out experiments for backbone size of 10,000 and 200,000 (200 queries each).

Recall that SEPP+APPLES* eliminates the need for decomposing the backbone tree into smaller placement subtrees, as default SEPP must do to deal with memory requirements of pplacer (which it internally uses). Comparison of default SEPP and SEPP+APPLES* shows that incorporating APPLES* inside SEPP reduces the mean delta error from Inline graphic to Inline graphic while it hurts placement accuracy from 81% to 80%; however, neither change is statistically significant (P = 0.15, one-sided paired t-test, and P = 0.66, McNemar’s test, respectively). To summarize, highly accurate placement is possible even with estimated alignments with backbones of size up to 200,000 using the divide-and-conquer methods PASTA and SEPP for estimating alignments of the backbone and query, respectively.

Discussion

We introduced APPLES: a new method for adding query species onto large backbone trees using both unassembled genome skims and aligned data. We now provide further observations on our results and on distance-based placement.

Further Observations on the Results

The accuracy of APPLES was very close to ML in most settings where we could run ML; the accuracy advantages of ML were particularly small for the RNASim data set where both methods face model misspecification. As expected by the substantial evidence from the literature (Zwickl and Hillis 2002; Hillis et al. 2003), improved taxon sampling increased the accuracy of placement. Thus, overall, the best accuracy on RNASim data set was obtained by APPLES* run on the full reference data set, further motivating its use when large backbones are available. Despite many strides made in terms of scalability by the new method EPA-ng, ML methods still have to restrict their backbone to at most several thousand species given reasonable amounts of memory (up to 64GB in our case). We also note that it is possible to follow the APPLES* placement with a round of ML placement on smaller trees, but the small differences in accuracy of ML and APPLES* on smaller trees did not give us compelling reasons to try such hybrid approaches.

APPLES was an order of magnitude or more faster and less memory-hungry than ML tools (pplacer and EPA-ng) for single query runs. However, for placing large numbers of queries (e.g., as found in metagenomic data sets) on a relatively small sized backbone (Inline graphic), EPA-ng had an advantage since it is specifically designed to tackle scalability of multiple queries.

Advantages in memory consumption and scalability to large backbone trees remain for APPLES* regardless of the number of queries. The python APPLES code is not optimized nearly as much as EPA-ng and can also benefit from some of the heuristic techniques used by EPA-ng. We plan for the future versions of the code to focus on improved scalability as the number of queries increases.

By incorporating APPLES inside SEPP, we were able to create a method that can do both alignment and placement on very large backbones with reasonable computational requirements and high accuracy. We observed relatively small reductions in accuracy as a result of alignment error, a pattern that we find remarkable given the size of the tree and the amount of error in the estimated alignment. The default SEPP method deals with large backbone trees by dividing the backbone tree into “placement” subtrees and choosing which subtree to place on using bit-scores produced by HMMs (hidden Markov model) trained on subsets of sequences. The original paper had shown that the best accuracy is obtained with the largest possible backbone subtrees given computational limitations. APPLES now enables us to eliminate the need for decomposition into placement subsets, and in doing so, reduces placement error.

In our analyses, we observed no advantage in using models more complex than JC69+Inline graphic for distance calculation inside APPLES. However, these results may be due to our pairwise estimation of model parameters (e.g., base compositions). More complex models may perform better if we instead estimate model parameters on the backbone alignment/tree and reuse the parameters for queries (or simultaneously among all queries and the reference sequences). Simultaneous estimation of distances has many advantages over using independent distances for the de novo case (Tamura et al. 2004; Xia 2009); these results give us hope that using simultaneous distances inside APPLES can further improve its accuracy.

Branch lengths of our backbone trees were computed using the same distance model as the one used for computing the distance of the query to backbone species. Using consistent models for the query and for the backbone branch lengths is essential for obtaining good accuracy (see Supplementary Appendix 2, Fig. S8 available on Dryad for evidence). Thus, in addition to having a large backbone tree at hand, we need to ensure that branch lengths are computed using the right model. Fortunately, FastTree-2 can compute both topologies and branch lengths on large trees in a scalable fashion, without a need for quadratic time/memory computation of distance matrices (Price et al. 2010).

Observations on Distance-Based Placement

Phylogenetic insertion using the ME criterion has been previously studied for the purpose of creating an algorithm for greedy minimum evolution. Desper and Gascuel (2002) have designed a method that given the tree Inline graphic can update it to get a tree with Inline graphic leaves in Inline graphic after precomputation of a data structure that gives the average sequence distances between all adjacent clusters in Inline graphic. The formulation by Desper and Gascuel (2002) has a subtle but consequential difference from our ME placement. Their algorithm does not compute branch lengths for inserted sequence (e.g., Inline graphic and Inline graphic). It is able to compute the optimal placement topology without knowing branch lengths of the backbone tree. Instead, it relies on pairwise distances among backbone sequences (Inline graphic), which are precomputed and saved in the data structure mentioned before. In the context of the greedy algorithm for tree inference, in each iteration, the data structure can be updated in Inline graphic, which does not impact the overall running time of the algorithm. However, if we were to start with a tree of Inline graphic leaves, computing this structure from scratch would still require Inline graphic. Thus, computing the placement for a new query would need quadratic time, unless if the Inline graphic precomputation is allowed to be amortized over Inline graphic queries. Our formulation, in contrast, uses branch lengths of the backbone tree (which is assumed fixed) and thus never uses pairwise distances among the backbone sequences. Thus, using tree distances is what allows us to develop a linear time algorithm. Finally, we note that in our experimental analyses, we were not able to test the distance-based algorithm of Desper and Gascuel (2002) because it is available only as part of a the greedy algorithm inside FastME but is not available as a stand-alone feature to place on a given tree.

We emphasize that our results in assembly-free tests do not advocate for the use of assembly-free methods when assemblies are available. Moreover, we have no evidence that assembly-free methods are effective in inferring deep branches of a phylogeny. Instead, our results show that assembly-free phylogenetic placement is effective in sample identification where assembly is not possible due to low coverage. In assembly-free analyses, we used Skmer to get distances because alternative alignment-free methods of estimating distance generally either require assemblies (e.g., Haubold 2014; Leimeister and Morgenstern 2014; Leimeister et al. 2017) or higher coverage than Skmer (e.g., Yi and Jin 2013; Benoit et al. 2016; Ondov et al. 2016); however, combining APPLES with other alignment-free methods can be attempted in future (finding the best way of computing distances without assemblies was not our focus). Moreover, the Skmer paper has described a trick that can be used to compute log-det distances from genome skims. Future studies should test whether using that trick and using GTR instead of JC69 improves accuracy.

Availability

The APPLES software is publically available in open-source from https://github.com/balabanmetin/apples.

Funding

This work was supported by the National Science Foundation (NSF) [IIS-1565862]; National Institutes of Health (NIH) [subaward 5P30AI027767-28 to M.B. and S.M.]; and NSF [NSF-1815485 to M.B., S.S., and S.M.]; Computations were performed on the San Diego Supercomputer Center (SDSC) through XSEDE allocations, which is supported by the NSF [ACI-1053575].

Supplementary material

Data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.78nf7dq.

References

  1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman, D.J.. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–410. [DOI] [PubMed] [Google Scholar]
  2. Barbera P., Kozlov A.M., Czech L., Morel B., Darriba D., Flouri T., Stamatakis, A.. 2019. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68:365–369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Benoit G., Peterlongo P., Mariadassou M., Drezen E., Schbath S., Lavenier D., Lemaitre C.. 2016. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2:e94. [Google Scholar]
  4. Berger S.A., Stamatakis A.. 2011. Aligning short reads to reference alignments and trees. Bioinformatics 27:2068–2075. [DOI] [PubMed] [Google Scholar]
  5. Berger S.A., Krompass D., Stamatakis A.. 2011. Performance, accuracy, and Web server for evolutionary placement of short sequence reads under maximum likelihood. Syst. Biol. 60:291–302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Beyer W.A., Stein M.L., Smith T.F., Ulam S.M.. 1974. A molecular sequence metric and evolutionary trees. Math. Biosci. 19:9–25. [Google Scholar]
  7. Boyd B.M., Allen J.M., Nguyen N., Sweet A.D., Warnow T., Shapiro M.D., Villa S.M., Bush S.E., Clayton D.H., Johnson K.P.. 2017. Phylogenomics using target-restricted assembly resolves intra-generic relationships of parasitic lice (Phthiraptera: Columbicola). Syst. Biol. 66:896–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Brown D., Truszkowski J.. 2013. LSHPlace: fast phylogenetic placement using locality-sensitive hashing. Pac. Symp. Biocomput. 310–319. [PubMed] [Google Scholar]
  9. Bush A., Sollmann R., Wilting A., Bohmann K., Cole B., Balzter H., Martius C., Zlinszky, A., Calvignac-Spencer S., Cobbold C.A., Dawson T.P., Emerson B.C., Ferrier S., Gilbert M.T.P., Herold M., Jones L., Leendertz F.H., Matthews L., Millington J.D.A., Olson J.R., Ovaskainen O., Raffaelli D., Reeve R., Rödel M.-O., Rodgers T.W., Snape S., Visseren-Hamakers I., Vogler A.P., White P.C.L., Wooster M.J., Yu, D.W.. 2017. Connecting Earth observation to high-throughput biodiversity data. Nat. Ecol. Evol. 1:0176. [DOI] [PubMed] [Google Scholar]
  10. Bushnell B. 2014. Bbtools Software Package. Available from: URL http://sourceforge.net/projects/bbmap.
  11. Cavalli-Sforza L.L., Edwards A.W.. 1967. Phylogenetic analysis. Models and estimation procedures. Am. J. Hum. Genet. 19(3 Pt 1):233–257. [PMC free article] [PubMed] [Google Scholar]
  12. Clarke L.J., Soubrier J., Weyrich L.S., Cooper A.. 2014. Environmental metabarcodes for insects: in silico PCR reveals potential for taxonomic bias. Mol. Ecol. Resour. 14:1160–1170. [DOI] [PubMed] [Google Scholar]
  13. Coissac E., Hollingsworth P.M., Lavergne S., Taberlet P.. 2016. From barcodes to genomes: extending the concept of DNA barcoding. Mol. Ecol. 25:1423–1428. [DOI] [PubMed] [Google Scholar]
  14. Day W.H.E., Sankof D.. 1987. Computational complexity of inferring phylogenies from chromosome inversion data. J. Theoret. Biol. 124(2):213–218. [DOI] [PubMed] [Google Scholar]
  15. Desper R., Gascuel O.. 2002. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol. 9:687–705. [DOI] [PubMed] [Google Scholar]
  16. Dodsworth S. 2015. Genome skimming for next-generation biodiversity analysis. Trends Plant Sci. 20:525–527. [DOI] [PubMed] [Google Scholar]
  17. Eddy S.R. 1998. Profile hidden Markov models. Bioinformatics 14:755–763. [DOI] [PubMed] [Google Scholar]
  18. Eddy S.R. 2009. A new generation of homology search tools based on probabilistic inference. Genome Inform., 23:205–211. [PubMed] [Google Scholar]
  19. Fan H., Ives A.R., Surget-Groba Y., Cannon C.H.. 2015. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 16:522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376. [DOI] [PubMed] [Google Scholar]
  21. Felsenstein J. 2003. Inferring phylogenies. Massachusetts: Sinauer Associates. [Google Scholar]
  22. Findley K., Oh J., Yang J., Conlan S., Deming C., Meyer J.A., Schoenfeld D., Nomicos E., Park M., Kong H.H., Segre J.A.. 2013. Topographic diversity of fungal and bacterial communities in human skin. Nature 498:367–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Fitch W.M., Margoliash E.. 1967. Construction of phylogenetic trees. Science 155:279–284. [DOI] [PubMed] [Google Scholar]
  24. Fletcher W., Yang Z.. 2009. INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 26:1879–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Gill S.R., Pop M., Deboy R.T., Eckburg P.B., Turnbaugh P.J., Samuel B.S., Gordon J.I., Relman D.A., Fraser-Liggett C.M., Nelson K.E.. 2006. Metagenomic analysis of the human distal gut microbiome. Science (New York, N.Y.) 312:1355–1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Guo S., Wang L.-S., Kim J.. 2009. Large-scale simulation of RNA macroevolution by an energy-dependent fitness model. arXiv 0912.2326. [Google Scholar]
  27. Haubold B. 2014. Alignment-free phylogenetics and population genetics. Brief. Bioinform. 15:407–418. [DOI] [PubMed] [Google Scholar]
  28. Hebert P.D.N., Cywinska A., Ball S.L., deWaard J.R.. 2003. Biological identifications through DNA barcodes. Proc. R. Soc. B 270:313–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hillis D.M., Pollock D.D., McGuire J.A., Zwickl D.J.. 2003. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. 52:124–126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hinchliff C.E., Smith S.A., Allman J.F., Burleigh J.G., Chaudhary R., Coghill L.M., Crandall K.A., Deng J., Drew B.T., Gazis R., Gude K., Hibbett D.S., Katz L.A., Laughinghouse H.D., McTavish E.J., Midford P.E., Owen C.L., Ree R.H., Rees J.A., Soltis D.E., Williams T.L., Cranston K.A.. 2015. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc. Natl. Acad. Sci. USA, 112:12764–12769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Jain C., Rodriguez-R L.M., Phillippy A.M., Konstantinidis K.T., Aluru S.. 2018. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun., 9:5114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Janssen S., McDonald D., Gonzalez A., Navas-Molina J.A., Jiang L., Xu Z.Z., Winker K., Kado D.M., Orwoll E., Manary M., Mirarab S., Knight R.. 2018. Phylogenetic placement of exact amplicon sequences improves associations with clinical information. mSystems, 3:e00021-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Jin L., Nei M.. 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol., 7:82–102. [DOI] [PubMed] [Google Scholar]
  34. Jukes T.H., Cantor C.R.. 1969. Evolution of protein molecules In: Munro H.N., editor. Mammalian protein metabolism, Vol. III. New York: Academic Press; p. 21–132. [Google Scholar]
  35. Koski L.B., Golding G.B.. 2001. The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol., 52:540–542. [DOI] [PubMed] [Google Scholar]
  36. Krause L., Diaz N.N., Goesmann A., Kelley S., Nattkemper T.W., Rohwer F., Edwards R.A., Stoye J.. 2008. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res., 36:2230–2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lefort V., Desper R., Gascuel O.. 2015. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol., 32:2798–2800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Leimeister C.-A., Morgenstern B.. 2014. kmacs: the k -mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30:2000–2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Leimeister C.-A., Sohrabi-Jahromi S., Morgenstern B.. 2017. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics, 33:971–979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Lockhart P.J., Steel M.A., Hendy M.D., Penny D.. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol., 11:605–612. [DOI] [PubMed] [Google Scholar]
  41. Mallo D., De Oliveira Martins L., Posada D.. 2016. SimPhy: Phylogenomic simulation of gene, locus, and species trees. Syst. Biol., 65:334–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Matsen IV F.A., Evans S.N.. 2013. Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison. PLoS One, 8(3):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Matsen F.A., Kodner R.B., Armbrust E.V.. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11:538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Matsen F.A., Hoffman N.G., Gallagher A., Stamatakis A.. 2012. A format for phylogenetic placements. PLoS One, 7:e31009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Miller D.E., Staber C., Zeitlinger J., Hawley R.S.. 2018. Highly contiguous genome assemblies of 15 Drosophila species generated using nanopore sequencing. G3: Genes, Genomes, Genet., 8:3131–3141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Mirarab S., Warnow T.. 2011. FastSP: linear time calculation of alignment accuracy. Bioinformatics, 27:3250–3258. [DOI] [PubMed] [Google Scholar]
  47. Mirarab S., Warnow T.. 2015. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31:i44–i52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Mirarab S., Nguyen N., Warnow T.. SEPP: SATé-enabled phylogenetic placement In: Altman R.B., Dunker A.K., Hunter L., Murray T., Klein T.E., editors. Pacific Symposium on Biocomputing. Algorithms in Bioinformatics. Hackensack: World Scientific Publishing Co; p. 247–258. [Google Scholar]
  49. Mirarab S., Nguyen N., Guo S., Wang L.-S., Kim J., Warnow T.. 2015. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol., 22:377–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Moshiri N. 2018. TreeSwift: a massively scalable Python tree package. New York: Cold Spring Harbor Laboratory; bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Nguyen L.T., Schmidt H.A., Von Haeseler A., Minh B.Q.. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol., 32(1):268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Nguyen N.-p., Mirarab S., Liu B., Pop M., Warnow T.. 2014. TIPP: taxonomic identification and phylogenetic profiling. Bioinformatics, 30:3548–3555. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Oliphant T.E. 2006. A guide to NumPy, Vol. 1 USA: Trelgol Publishing. [Google Scholar]
  54. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., Phillippy A.M.. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol., 17:132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Price M.N., Dehal P.S., Arkin A.P.. 2010. FastTree-2 approximately maximum-likelihood trees for large alignments. PLoS One, 5:e9490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Robinson D.F., Foulds L.R.. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131–147. [Google Scholar]
  57. Rzhetsky A., Nei M.. 1992. A simple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol., 9:945. [Google Scholar]
  58. Saitou N., Nei M.. 1987. The neighbour-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406–425. [DOI] [PubMed] [Google Scholar]
  59. Sarmashghi S., Bohmann K., P. Gilbert M. T., Bafna V., Mirarab S.. 2019. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol. 20:34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30:1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Stark M., Berger S.A., Stamatakis A., von Mering C.. 2010. MLTreeMap—accurate maximum likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics, 11:461. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Steel M. 2009. A basic limitation on inferring phylogenies by pairwise sequence comparisons. J. Theoret. Biol. 256:467–472. [DOI] [PubMed] [Google Scholar]
  63. Sunagawa S., Mende D. R., Zeller G., Izquierdo-Carrasco F., Berger S.a., Kultima J.R., Coelho L.P., Arumugam M., Tap J., Nielsen H.B., Rasmussen S., Brunak S., Pedersen O., Guarner F., de Vos W.M., Wang J., Li J., Dore J., Ehrlich S.D., Stamatakis a., Bork P.. 2013. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods, 10:1196–1199. [DOI] [PubMed] [Google Scholar]
  64. Tamura K., Nei M.. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial-DNA in humans and chimpanzees. Mol. Biol. Evol., 10:512–526. [DOI] [PubMed] [Google Scholar]
  65. Tamura K., Nei M., Kumar S.. 2004. Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc. Natl. Acad. Sci. USA, 101:11030–11035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. von Mering C., Hugenholtz P., Raes J., Tringe S.G., Doerks T., Jensen L.J., Ward N., Bork P.. 2007. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science 315:1126–1130. [DOI] [PubMed] [Google Scholar]
  67. Waddell P.J., Steel M.A.. 1997. General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol. 8:398–414. [DOI] [PubMed] [Google Scholar]
  68. Warnow T. 2017. Computational phylogenetics: an introduction to designing methods for phylogeny estimation. Cambridge: Cambridge University Press. [Google Scholar]
  69. Wheeler T.J. 2009. Large-scale neighbor-joining with NINJA In: Salzberg S.L., Warnow, T., editors. Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; p. 375–389. [Google Scholar]
  70. Xia X. 2009. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Mol. Phylogenet. Evol. 52:665–676. [DOI] [PubMed] [Google Scholar]
  71. Xia X. 2018. DAMBE7: new and improved tools for data analysis in molecular biology and evolution. Mol. Biol. Evol. 35:1550–1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Yi H., Jin L.. 2013. Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic Acids Res. 41:e75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Yin C., Shen G., Guo D., Wang S., Ma X., Xiao H., Liu J., Zhang Z., Liu Y., Zhang Y., Yu K., Huang S., Lif F.. 2016. InsectBase: a resource for insect genomes and transcriptomes. Nucleic Acids Res. 44:D801–D807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Zwickl D.J., Hillis D.M.. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51:588–598. [DOI] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES