Abstract
Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including sampled ancestors in which we sequence a genotype along with its direct descendants, and polytomies in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this paper, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators for the branch lengths of phylogenetic trees, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics.
Keywords: phylogenetics, ℓ1 regularization, adaptive LASSO, sparsity, model selection, consistency, FISTA
1. Introduction
Phylogenetic methods, originally developed to infer evolutionary relationships among species separated by millions of years, are now widely used in biomedicine to investigate very short-time-scale evolutionary history. For example, mutations in viral genomes can inform us about patterns of infection and evolutionary dynamics as they evolve in their hosts on a time-scale of years (Grenfell et al., 2004). Antibody-making B cells diversify in just a few weeks, with a mutation rate around a million times higher than the typical mutation rate for cell division (Kleinstein et al., 2003). Although general-purpose phylogenetic methods have proven useful in these biomedical settings, the basic assumption that evolutionary trees follow a bifurcating pattern need not hold. Our goal is to develop a penalized maximum-likelihood approach to infer non-bifurcating trees (Figure 1).
Figure 1.

(a) A cartoon evolutionary scenario, with sampled ancestors (gray dots) and a multifurcation (dashed box). (b) A corresponding standard maximum likelihood phylogenetic inference, without regularization or thresholding.
Although our practical interests concern inference for finite-length sequence data, some situations in biology will lead to non-bifurcating phylogenetic trees, even in the theoretical limit of infinite sequence information. For example, a retrovirus such as HIV incorporates a copy of its genetic material into the host cell upon infection. This genetic material is then used for many copies of the virus, and when more than two descendants from this infected cell are then sampled for sequencing, the correct phylogenetic tree forms a multifurcation from these multiple descendants (a.k.a. a polytomy). In other situations we may sample an ancestor along with a descendant cell, which will appear as a node with a single descendant edge (Figure 1). For example, antibody-making B cells evolve within host in dense accretions of cells called germinal centers in order to better bind foreign molecules (Victora and Nussenzweig, 2012). In such settings it is possible to sample a cell along with its direct descendant. Indeed, upon DNA replication in cell division, one cell inherits the original DNA of the coding strand, while the other inherits a copy which may contain a mutation from the original. If we sequence both of these cells, the first cell is the genetic ancestor of the second cell for this coding region. In this case the correct configuration of the two genotypes is that the first cell is a sampled ancestor of the second cell.
However DNA sequences are finite and often rather short, limiting the amount of information available with which to infer phylogenetic trees. Even though entire genomes are large, the segment of interest for a phylogenetic analysis is frequently small. For example, B cells evolve rapidly only in the hundreds of DNA sites used to encode antibodies, and thus sequencing is typically applied only to this region (Georgiou et al., 2014). Similarly, modern applications of pathogen outbreak analysis using sequencing (Gardy et al., 2015) frequently observe the same sequence, indicating that sampling is dense relative to mutation rates. Because genetic recombination and processes such as viral reassortment (Chen and Holmes, 2008) break the assumption that genetic data has evolved according to a single tree, practitioners often restrict analysis to an even shorter region that they believe has evolved according to a single process.
Inference on these shorter sequences further motivates correct inferences for non-bifurcating tree inference. Indeed, even if a collection of sequences in fact did diverge in a bifurcating fashion, if no mutations happened in the sequenced region during this diversification (i.e. a zero-length branch) then a non-bifurcating representation is appropriate. We thus expect multifurcations and sampled ancestors whenever the interval between the bifurcations is short compared to the total mutation rate in the sequenced region.
Non-bifurcating tree inference has thus far been via Bayesian phylogenetics, with the two deviations from bifurcation in two separate lines of work. For multifurcations, Lewis et al. (2005, 2015) develop a prior on phylogenetic trees with positive mass on multifurcating trees, and then perform tree estimation using reversible jump MCMC (rjMCMC) moves between trees. For sampled ancestors, Gavryushkina et al. (2014, 2016) introduce a prior on trees with sampled ancestors and then also use rjMCMC for inference. To our knowledge no priors have been defined that place mass on trees with multifurcations and/or sampled ancestors.
Current biomedical applications require a more computationally efficient alternative than these Bayesian techniques. Indeed, current methods for real-time phylogenetics in the course of a viral outbreak use maximum likelihood (Neher and Bedford, 2015; Libin et al., 2017), which is orders of magnitude faster than Bayesian analyses. This is essential because the time between new sequences being added to the database can be shorter than the required execution time for a Bayesian analysis. However, to our knowledge an appropriate maximum-likelihood alternative to such rjMCMC phylogenetic inference for multifurcating trees does not yet exist.
Elsewhere in statistics, researchers find the set of parameters with zero values via penalized maximum likelihood inference, commonly maximizing the sum of a penalty term and a log likelihood function. When the penalty term has a nonzero slope as each variable approaches zero, the penalty will have the effect of “shrinking” that variable to zero when there is not substantial evidence from the likelihood function that it should be nonzero. There is now a substantial literature on such estimators, of which L1 penalized estimators such as LASSO (Tibshirani, 1996) are the most popular.
In this paper, we introduce such regularization estimators into phylogenetics, derive their properties, and show this regularization to be a practically-useful approach for phylogenetics via new algorithms and experiments. Specifically, we first show consistency: that the LASSO and its adaptive variants find all zero-length branches in the limit of long sequences with an appropriate penalty weight. We also derive new algorithms for phylogenetic LASSO and show them to be effective via simulation experiments and application to a Dengue virus data set. Throughout this paper, we assume that the topology of the tree is known, and that the branch lengths of the trees are bounded above by some constant.
Phylogenetic LASSO is challenging and requires additional new techniques above those for classical LASSO. First, the phylogenetic log-likelihood function is non-linear and non-convex. More importantly, unlike the standard settings for model selection where the variables can receive both positive and negative values, the branch lengths of a tree are non-negative. Thus, the objective function of phylogenetic LASSO can only be defined on a constrained compact space, for which the “true parameter” lies on the boundary of the domain. Furthermore the behavior of the phylogenetic log-likelihood on this boundary is untamed: when multiple branch lengths of a tree approach zero at the same time, the log-likelihood function may diverge to infinity, even if it is analytic in the inside of the domain of definition. The geometry of the subset of the boundary where these singularities happen is non-trivial, especially in the presence of randomness in data. All of these issues combine to make theoretical analyses and practical implementation of these estimators an interesting challenge.
2. Mathematical framework
2.1. Phylogenetic tree.
A phylogenetic tree is a tree graph τ such that each leaf has a unique name, and such that each edge e of the tree is associated with a non-negative number qe. We will denote by E and V the set of edges and vertices of the tree, respectively. We will refer to τ and (qe)e∈E as the tree topology and the vector of branch lengths, respectively. Any edge adjacent to a leaf is called a pendant edge, and any other edge is called an internal edge. A pendant edge with zero branch length leads to a sampled ancestor while an internal edge with zero branch length is part of a polytomy.
As mentioned above, we assume that the topology τ of the tree is known and we are interested in reconstructing the vector of branch lengths. Since the tree topology is fixed, the tree is completely represented by the vector of branch lengths q. We will consider the set of all phylogenetic trees with topology τ and branch lengths bounded from above by some g0 > 0. This arbitrary upper bound on branch lengths is for mathematical convenience and does not represent a real constraint for the short-term evolutionary setting of interest here.
2.2. Phylogenetic likelihood.
We now summarize the likelihood-based formulation of phylogenetics. The input data for this formulation is a collection of molecular sequences (such as DNA sequences) that have been aligned into a collection of sites. We assume that the differences in sequences between sites is due to a point mutation process that is modeled with a continuous-time Markov chain. One can use a dynamic program to calculate a likelihood (detailed below), allowing one to select the maximum-likelihood phylogenetic tree and model parameters.
We will follow the most common setting for likelihood-based phylogenetics: a reversible continuous-time Markov chain model of substitution which is IID across sites. Briefly, let Ω denote the set of states and let r = |Ω|; for convenience, we assume the states have indices 1 to r. We assume that mutation events occur according to a continuous-time Markov chain on states Ω. Specifically, the probability of ending in state y after “evolutionary time” t given that the site started in state x is given by the xy-th entry of P(t), where P(t) is the matrix valued function P = eQt, and the matrix Q is the instantaneous rate matrix of the evolutionary model. Here Q is normalized to have mean rate 1, so “evolutionary time” t is measured in terms of the expected number of substitutions per site. We assume that the rate matrix Q is reversible with respect to a stationary distribution π on the set of states Ω.
We will use the term state assignment to refer to a single-site labelling of the leaf of tree by characters in Ω. For a fixed vector of branch lengths q, the phylogenetic likelihood is defined as follows and will be denoted by L(q). Let Yk = (Y(1), Y(2),…, Y(k)) ∈ ΩN × k be the observed sequences (with characters in Ω) of length k over N leaves (i.e., each of the Y(i)’s is a state assignment). We will say that a function f extends a function g if f has a larger domain than g but agrees with g on its domain. The likelihood of observing Y given the tree has the form
where ρ is any internal node of the tree, ai ranges over all extensions of Y(i) to the internal nodes of the tree, denotes the assigned state of node u by ai, Pxy(t) denotes the transition probability from character x to character y across an edge of length t defined by a given evolutionary model and η is the stationary distribution of this evolutionary model. The value of the likelihood does not depend on choice of ρ due to the reversibility assumption.
We will also denote and refer to it as the log-likelihood function given the observed sequence data. We allow the likelihood of a tree given data to be zero, and thus ℓk is defined on with values in the extended real line [−∞, 0]. We note that ℓk is continuous, that is, for any vector of branch lengths , we have
even if .
Each vector of branch lengths q generates a distribution on the state assignment of the leaves, hereafter denoted by Pq. We will make the following assumptions:
Assumption 2.1 (Model identifiability). .
Assumption 2.2. The data Yk are generated on a tree topology τ with vector of branch lengths according to the above Markov process, where some components of q* might be zero. We assume further that the tree distance (the sum of branch lengths) between any pair of leaves of the true tree is strictly positive.
We note that model identifiability (Assumption 2.1) is a standard assumption and is essential for inferring evolutionary histories from data in the likelihood-based framework. This condition holds for a wide range of evolutionary models that are used in phylogenetic inference, including the Jukes-Cantor, Kimura, and other time-reversible models (Chang, 1996; Allman et al., 2008; Allman and Rhodes, 2008). The second criterion of Assumption 2.2 ensures that no two leaves will be labeled with identical sequences as sequence length k becomes long.
2.3. Regularized estimators for phylogenetic inference.
Throughout the paper, we consider regularization-type estimators, which are defined as the minimizer of the phylogenetic likelihood function penalized with various Rk:
| (2.1) |
Here Rk denotes the penalty function and λk is the regularization parameter that controls how the penalty function impacts the estimates. Different forms of the penalty function will lead to different statistical estimators of the generating tree.
The existence of a minimizer as in (2.1) is guaranteed by the following Lemma (proof in the Appendix):
Lemma 2.3. If the penalty Rk is continuous on , then for λ > 0 and observed sequences Yk, there exists a minimizing
We are especially interested in the ability of the estimators to detect polytomies and sampled ancestors. This leads us the following definition of topological consistency, which in the usual variable selection setting is sometimes called sparsistency.
Definition 2.4. For any vector of branch lengths q, we denote the index set of zero entries with
We say a regularized estimator with penalty function Rk is topologically consistent if for all data-generating branch lengths q*, we have
Definition 2.5 (Phylogenetic LASSO). The phylogenetic LASSO estimator is (2.1) with the standard LASSO penalty , which in our setting of non-negative qi is
We will use to denote the phylogenetic LASSO estimate, namely
In general, the classical LASSO may not be topologically consistent, since the ℓ1 penalty forces the coefficients (in our case, the branch lengths) to be equally penalized (Zou, 2006). To address this issue, one may assign different weights to different branch lengths by first constructing a naive estimate of the branch lengths, then using this initial estimate to design the weights of the penalties. Such an “adaptive” procedure is exactly the idea of adaptive LASSO, as follows.
Definition 2.6 (Adaptive LASSO (Zou, 2006)). The phylogenetic adaptive LASSO estimator is (2.1) with penalty function
for some γ > 0 and is the phylogenetic LASSO estimate. The regularizing parameter of adaptive LASSO estimator will be denoted by . Here, we use the convention that , which means that zero branch lengths contribute nothing to the penalty.
A reviewer has pointed out a nice connection between the Adaptive LASSO objective and that of weighted least squares phylogenetics. In weighted least squares phylogenetics, one finds branch lengths and tree topology that minimize the sum of weighted squared differences between a given set of molecular sequence distances Di,j and inferred distances di,j on a phylogenetic tree. Fitch and Margoliash (1967) propose that the these squared distances should be weighted by , while Beyer et al. (1974) propose weighting with . These are structurally similar to our definition of phylogenetic adaptive LASSO for γ = 2 and γ = 1, respectively, although in our hands these terms are penalties rather than the primary objective function.
Definition 2.7 (Multiple-step adaptive LASSO (Bühlmann and Meier, 2008)). The phylogenetic multiple-step LASSO is defined recursively with the phylogenetic LASSO estimator as the base case (m = 1), and the penalty function in (2.1) at step m being
where γ > 0 and is the (m − 1)-step regularized estimator with penalty function . The regularizing parameter of the m-step adaptive LASSO estimator will be denoted by . Again, we use the convention that .
In this paper, we aim to prove that if the weights are data-dependent and cleverly chosen, then the estimators are topologically consistent. Our proof design relies on the parameter γ, which dictates how strongly we penalize small edges in the estimation. In Section 3, we show that if γ is sufficiently large (γ > β − 1, where β is a constant that depends on the structure of the problem), the corresponding LASSO procedures are topologically consistent.
2.4. Related work.
There is a large literature on penalized M-estimators with possibly non-convex loss or penalty functions from both theoretical and numerical perspectives. The optimization of such estimators has its own rich literature:
In the context of least squares and convex regression with non-convex penalties, several numerical procedures have been proposed, including local quadratic approximation (LQA) (Fan and Li, 2001), the minorize-maximize (MM) algorithm (Hunter and Li, 2005), local linear approximation (LLA) (Zou and Li, 2008), the concave-convex procedure (CCP) (Kim et al., 2008) and coordinate descent (Kim et al., 2008; Mazumder et al., 2011). Zhang and Zhang (2012) provided statistical guarantees for global optima of least-squares linear regression with non-convex penalties and showed that gradient descent starting from a LASSO solution would terminate in specific local minima. Fan et al. (2014) proved for convex losses that the LLA algorithm initialized with a LASSO solution attains a local solution with oracle statistical properties. Wang et al. (2013) proposed a calibrated concave-convex procedure that can achieve the oracle estimator.
To enable these analyses, various sufficient conditions for the success of ℓ1-relaxations have been proposed, including restricted eigenvalue conditions (Bickel et al., 2009; Meinshausen and Yu, 2009) and the restricted Riesz property (Zhang and Huang, 2008). Pan and Zhang (2015) also provide results showing that under restricted eigenvalue assumptions, a certain class of non-convex penalties yield estimates that are consistent in ℓ2-norm.
For studies of regularized estimators with non-convex losses, one prominent approach is to impose a weaker condition known as restricted strong convexity on the empirical loss function, which involves a lower bound on the remainder in the first-order Taylor expansion of the loss function (Negahban et al., 2012; Agarwal et al., 2010; Loh and Wainwright, 2011, 2013, 2017).
Outside of the optimization framework, previous work has developed regularized procedures aiming at support recovery and model selection. The goal of this research is to identify the support (the non-zero components) of the data-generating vector of parameters. Meinshausen and Bühlmann (2006) and Zhao and Yu (2006) prove that the Irrepresentable Conditions are almost necessary and sufficient for LASSO to select the true model, which provides a foundation for applications of LASSO for feature selection and sparse representation. Under a sparse Riesz condition on the correlation of design variables, Zhang and Huang (2008) prove that the LASSO selects a model of the correct order of dimensionality and selects all coefficients of greater order than the bias of the selected model. Zou (2006) introduces the adaptive LASSO algorithm, which produces a topologically consistent estimate of the support even in cases when LASSO may not be consistent. Loh and Wainwright (2017) also show that for certain non-convex optimization problems, under the restricted strong convexity and a beta-min condition (which provides a lower bound on the minimum signal strength), support recovery consistency may be guaranteed.
Our approach for phylogenetic LASSO is inspired by previous work of Zou (2006) and Bühlmann and Meier (2008) who carefully choose the weights of the penalty function. To enable the theoretical analyses of the constructed estimators, we derive a new condition that is similar to the Restricted Strong Convexity condition (Loh and Wainwright, 2013; Loh, 2017). However, instead of imposing regularity conditions directly on the empirical log-likelihood function, we use concentration arguments to analyze the empirical log-likelihood function through its expectation.
We note that penalized likelihood has appeared before in phylogenetics (Kim and Sanderson, 2008; Dinh et al., 2018), although we believe ours to be the first application of a LASSO-type penalty on phylogenetic branch lengths.
3. Theoretical properties of LASSO-type regularized estimators for phylogenetic inference
We next show convergence and topological consistency of the LASSO-type phylogenetic estimates introduced in the previous section. As described in the introduction, phylogenetic LASSO is a non-convex regularization problem for which the true estimates lie on the boundary of a space on which the likelihood function is untamed. To circumvent those problems, we take a minor departure from the standard approach for analysis of non-convex regularization: instead of imposing regularity conditions directly on the empirical log-likelihood function, we investigate the expected per-site log likelihood and investigate its regularity. This function enables us to isolate the singular points and derive a local regularity condition that is similar to the Restricted Strong Convexity condition (Loh and Wainwright, 2013; Loh, 2017). This leads us to study the fast-rate generalization of the empirical log-likelihood in a PAC learning framework (Van Erven et al., 2015; Dinh et al., 2016).
3.1. Definitions and lemmas.
We begin by setting the stage with needed definitions and lemmas. All proofs have been deferred to the Appendix.
Definition 3.1. We define the expected per-site log-likelihood
for any vector of branch lengths q.
Definition 3.2. For any μ > 0, we denote by the set of all branch length vectors such that for all state assignments ψ to the leaves.
We have the following result, where is the ℓ2-norm in .
Lemma 3.3 (Limit likelihood). The vector q* is the unique maximizer of ϕ, and
| (3.1) |
Moreover, there exist β ≥ 2 and c1 > 0 depending on N, Q, η, g0, μ such that
| (3.2) |
Proof. The first statement follows from the identifiability assumption, and (3.1) is a direct consequence of the Law of Large Numbers. Equation 3.2 follows from the Lojasiewicz inequality (Ji et al., 1992) for ϕ on , which applies because ϕ is an analytic function defined on the compact set with q* as its only maximizer in . □
Group-based DNA sequence evolution models are a class of relatively simple models that have transition matrix structure compatible with an algebraic group (Evans and Speed, 1993). From Lemma 6.1 of Dinh et al. (2018), we have
Remark 3.4. For group-based models, we can take β = 2.
For any μ > 0, we also have the following estimates showing local Lipschitzness of the log-likelihood functions, recalling that k is the number of sites.
Lemma 3.5. For any μ > 0, there exists a constant c2(N, Q, η, g0, μ) > 0 such that
| (3.3) |
and
| (3.4) |
for all .
Fix an arbitrary μ > 0. For any we consider the excess loss
and derive a PAC lower bound on the deviation of the excess loss from its expected value on . First note that since the sites Yk are independent and identically distributed, we have
Moreover, from Lemma 3.5, we have . This implies by (3.2) that
| (3.5) |
for all .
Lemma 3.6. Let Gk be the set of all branch length vectors such that . Let β ≥ 2 be the constant in Lemma 3.3. For any δ > 0 and previously specified variables there exists C(δ, N, Q, η, g0, μ, β) ≥ 1 (independent of k) such that for any k ≥ 3, we have:
with probability greater than 1 − δ.
We also need the following preliminary lemma from (Dinh et al., 2016).
Lemma 3.7. Given 0 < ν < 1, there exist constants C1, C2 > 0 depending only on ν such that for all x > 0, if x ≤ axν + b then x ≤ C1a1/(1 − ν) + C2b.
3.2. Convergence and topological consistency of regularized phylogenetics.
We now show convergence and topological consistency of qk,Rk, the regularized estimator (2.1), for various choices of penalty Rk as the sequence length k increases. For convenience, we will assume throughout this section that the parameters N, Q, η, g0, μ and β (defined in the previous section) are fixed.
We first have the following two lemmas guaranteeing that if μ is carefully chosen, a neighborhood V of q* and the regularized estimator qk,Rk lie inside with high probability.
Lemma 3.8. There exist μ* > 0 and an open neighborhood V of q* in such that .
Lemma 3.9. If the sequence {λkRk(q*)} is bounded, then for any δ > 0, there exist μ(δ) > 0 and K(δ) > 0 such that for all with probability at least 1 − 2δ.
These results enable us to prove a series of theorems establishing consistency and topological consistency of phylogenetic adaptive and multi-step adaptive LASSO. As part of this development we will first use as a hypothesis and then establish the technical condition that there exists a C3 > 0 independent of k such that
| (3.6) |
This will form an essential part of our recursive proof. As the first step in this project, choosing μ to satisfy these lemmas, we can use the deviation bound of Lemma 3.6 to prove
Theorem 3.10. If then converges to q* almost surely.
Moreover, letting β ≥ 2 be the constant in Lemma 3.3, for any δ > 0 there exist C(δ) > 0 and K(δ) > 0 such that for all k ≥ K, with probability at least 1 − δ we have
| (3.7) |
If we assume further that there exists a C3 > 0 ndependent of k satisfying (3.6) then there exists C′(δ) > 0 such that for all k ≥ K,
with probability at least 1 − δ.
Another goal of this section is to prove that the phylogenetic LASSO is able to detect zero edges, which then give polytomies and sampled ancestors. Since the estimators are defined recursively, we will establish these properties of adaptive and multi-step phylogenetic LASSO through an inductive argument. Throughout this section, we will continue to use qk,Rk to denote the regularized estimator (2.1). We will use qk,Sk to denote the corresponding adaptive estimator where and for some γ > 0. We will use αk to be the regularizing parameter for the second step (regularizing with Sk) and keep λk as the parameter for the first step. These two need not be equal.
For positive sequences fk, gk, we will use the notation fk ≻ gk to mean that . We have the following result showing consistency of adaptive LASSO, and setting the stage to show topological consistency of adaptive LASSO.
Theorem 3.11. Assume that and that
We have
and the estimator is consistent.
If there exists C3 independent of k satisfying (3.6) then the estimator is topologically consistent.
We also obtain the following Lemma, which proves the regularity of the multiple-step adaptive LASSO, as describe by Equation (3.6):
Lemma 3.12. If qk,Sk is topologically consistent and qk,Rk is consistent, then there exists a C3 independent of k such that
This recursive regularity condition helps establish the main result:
Theorem 3.13. If
and
| (3.8) |
then
The adaptive LASSO and the m-step LASSO are topologically consistent for all 1 ≤ m ≤ M.
- For all 0 ≤ m ≤ M, the m-step LASSO (including the phylogenetic LASSO and adaptive LASSO) are consistent. Moreover, for all δ > 0 and 0 ≤ m ≤ M, there exists C[m] (δ) > 0 such that for all k ≥ K,
with probability at least 1 − δ. In other words, the convergence of m-step LASSO is of order
where denotes big-O-in-probability.
Remark 3.14. If we further assume that γ > β − 1, then the results of Theorem 3.13 are valid if is independent of m. This enables us to keep the regularizing parameters λk unchanged through successive applications of the multi-step estimator.
Similarly, the Theorem applies if γ > β − 1 and
for all m = 1, …, M.
Remark 3.15. Consider the case β = 2 (for example, for group-based models), ϵ > 0 and γ > 1. If we choose (independent of m) such that
then the convergence of m-step LASSO is of order
We further note that for group-based models, we can take β = 2, and that the theoretical results derived in this section apply for γ > 1. The limit case when γ = 1 is interesting, for which we believe that the results still hold. However, the techniques we employ in our framework, including the recursive arguments in Theorem 3.13 and the concentration argument (Lemma 3.6), cannot be adapted to resolve the case. This issue arises from the fact that less is known about the empirical phylogenetic likelihoods than about their counterparts in classical statistical analyses, which forces us to investigate them indirectly through a concentration argument.
4. Algorithms
In this section, we aim to design a robust solver for the phylogenetic LASSO problem. Many efficient algorithms have been proposed for the LASSO minimization problem
| (4.1) |
for a variety of objective functions g. Note that we now drop the subscript k denoting sequence length from λ, as we now consider a fixed data set in contrast to the previous theoretical analysis. When , Efron et al. (2004) introduced least angle regression (LARS) that computes not only the estimates but also the solution path efficiently. In more general settings, iterative shrinkage-thresholding algorithm (ISTA) is a typical proximal gradient method that utilizes an efficient and sparsity-promoting proximal mapping operator (also known as soft-thresholding operator) in each iteration. Adopting Nesterov’s acceleration technique, Beck and Teboulle (2009) proposed a fast ISTA (FISTA) that has been proved to significantly improve the convergence rate.
These previous algorithms do not directly apply to phylogenetic LASSO. LARS is mainly designed for regression and does not apply here. Classical proximal gradient methods are not directly applicable for the phylogenetic LASSO for the following reasons: (i) Nonconvexity. The negative log phylogenetic likelihood is usually non-convex. Therefore, the convergence analysis (which is described briefly in the following section 4.1) may not hold. Moreover, nonconvexity also makes it much harder to adapt to local smoothness which could lead to slow convergence. (ii) Bounded domain. ISTA and FISTA also assume there are no constraints while in phylogenetic inference we need the branches to be nonnegative: q ≥ 0. (iii) Regions of infinite cost. Unlike normal cost functions, the negative phylogenetic log-likelihood can be infinite especially when q is sparse as shown in the following proposition.
Proposition 4.1. Let Y = (y1, y1, …, yN) ∈ ΩN be an observed character vector-on one site. If yi ≠ yj and there is a path (u0, u1), (u1, u2),…, (us, us+1), u0 = i, us+1 = j on the topology τ such that qukuk+1 = 0, k = 0,…, s, then
Proof. Let a be any extension of Y to the internal nodes. Since , there must be some such that . Therefore,
□
In what follows, we briefly review the proximal gradient methods (ISTA) and their accelerations (FISTA), and provide an extension of FISTA to accommodate the above issues.
4.1. Proximal Gradient Methods.
Consider the nonsmooth ℓ1 regularized problem (4.1). Gradient descent generally does not work due to non-differentiability of the ℓ1 norm. The key insight of the proximal gradient method is to view the gradient descent update as a minimization of a local linear approximation to g plus a quadratic term. This suggests the update strategy
| (4.2) |
where tn is the step size. Note that (4.2) corresponds to the proximal map of , which is defined as follows
| (4.3) |
If the regularization function h is simple, (4.3) is usually easy to solve. For example, in case of , it can be solved by the soft thresholding operator
where . Applying this operator to (4.2), we get the ISTA update formula
| (4.4) |
Let . Assume g is convex and is Lipschitz continuous with Lipschitz constant ; if a constant step size is used and , then ISTA converges at rate
| (4.5) |
where q* is the optimal solution. This means ISTA has sublinear convergence whenever the stepsize is in the interval . Note that ISTA could have linear convergence if g is strongly convex.
The convergence rate in (4.5) can be significantly improved using Nesterov’s acceleration technique. The acceleration comes from a weighted combination of the current and previous gradient directions, which is similar to gradient descent with momentum. This leads to Algorithm 1 which is essentially equivalent to the fast iterative shrinkage-thresholding algorithm (FISTA) introduced by Beck and Teboulle (2009). Under the same condition, FISTA enjoys a significantly faster convergence rate
| (4.6) |
Notice that the above convergence rates both require the stepsize . In practice, however, the Lipschitz coefficient is usually unavailable and backtracking line search is commonly used.
Algorithm 1.
Fast Iterative Shrinkage-Thresholding Algorithm (FISTA)
| Input: initial value , step size t, regularization coefficient λ | ||
| 1: | Set | |
| 2: | while not converged do | |
| 3: | ⊳ Nesterov’s Acceleration | |
| 4: | ⊳ Soft-Thresholding Operator | |
| 5: | ||
| 6: | end while | |
| Output: | ||
4.2. Projected FISTA.
FISTA usually assumes no constraints for the parameters. However, in the phylogenetic case branch lengths are must be non-negative (q ≥ 0). To address this issue, we combine the projected gradient method (which can be viewed as proximal gradient as well) with FISTA to assure non-negative updates. We refer to this hybrid as projected FISTA (pFISTA). Note that a similar strategy has been adopted by Liu et al. (2016) in tight frames based magnetic resonance image reconstruction. Let C be a convex feasible set, define the indicator function Ic of the set C:
With the constraint q ∈ C, we consider the following projected proximal gradient update
| (4.7) |
where . Using forward-backward splitting (see Combettes and Wajs, 2006), (4.7) can be approximated as
| (4.8) |
where ΠC is the Euclidean projection on to C. When , , we have the following pFISTA update formula
Note that in this case, (4.8) is actually exact. Similarly, we can easily derive the projected ISTA (pISTA) update formula and we omit it here.
4.3. Restarting.
To accommodate non-convexity and possible infinities of the phylogenetic cost function, we adopt the restarting technique introduced by O’Donoghue and Candes (2013) where they used it as a heuristic means of improving the convergence rate of accelerated gradient schemes. In the phylogenetic case, due to the non-convexity of negative phylogenetic log-likelihood, backtracking line search would fail to adapt to local smoothness which could lead to inefficient small step size. Moreover, the LASSO penalty will frequently push us into the “forbidden” zone , especially when there are a lot of short branches. We therefore adjust the restarting criteria as follows:
Small stepsize: restart whenever tn is less than a restart threshold ϵ.
Infinite cost: restart whenever .
Equipping FISTA with projection and adaptive restarting, we obtain an efficient phylogenetic LASSO solver that we summarize in Algorithm 2.
Algorithm 2.
Projected FISTA with Restarting
| Input: initial , default step size t, regularization coefficient λ, restart threshold ϵ, backtracking line search parameter . | ||
| 1: | while not converged do | |
| 2: | Set | |
| 3: | while not converged do | |
| 4: | ⊳ Nesterov’s Acceleration | |
| 5: | if then | ⊳ Restarting |
| 6: | break the inner loop | |
| 7: | end if | |
| 8: | ||
| 9: | Adapt tn through backtracking line search with ω | |
| 10: | if then | ⊳ Restarting |
| 11: | break the inner loop | |
| 12: | end if | |
| 13: | ⊳ Projected Soft-Thresholding | |
| Operator | ||
| 14: | ||
| 15: | end while | |
| 16: | Set | |
| 17: | end while | |
| Output: | ||
Remark 4.2. Note that the adaptive phylogenetic LASSO
| (4.9) |
is equivalent to (using / to denote componentwise division)
Therefore, Algorithm 2 can also be used to solve the (multi-step) adaptive phylogenetic LASSO.
Table 2.
Comparison on the support values obtained from different methods on DENV4 Brazil clade data set. All analysis used the Jukes-Cantor model.
| Edge | ML bootstrap | MCMC | rjMCMC |
ML+thresholding+bootstrap |
ML+adaLASSO+bootstrap |
||||
|---|---|---|---|---|---|---|---|---|---|
| C = 1 | κ = 1e-06 | κ = 5e-05 | κ = 1e-04 | λ = 150 | λ = 300 | λ = 450 | |||
| 1 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 2 | 7 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 93 | 54 | 1 | 82 | 65 | 58 | 66 | 66 | 66 |
| 4 | 95 | 100 | 100 | 95 | 95 | 95 | 95 | 95 | 95 |
| 5 | 38 | 34 | 1 | 0 | 0 | 0 | 4 | 4 | 4 |
| 6 | 63 | 100 | 100 | 61 | 61 | 61 | 63 | 63 | 63 |
| 7 | 10 | 17 | 20 | 8 | 8 | 7 | 6 | 6 | 6 |
| 8 | 52 | 65 | 31 | 45 | 44 | 44 | 44 | 44 | 44 |
| 9 | 11 | 34 | 1 | 0 | 0 | 0 | 2 | 2 | 2 |
| 10 | 10 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 61 | 100 | 66 | 57 | 56 | 22 | 61 | 61 | 61 |
| 12 | 13 | 17 | 3 | 8 | 2 | 0 | 3 | 3 | 3 |
| 13 | 9 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 23 | 19 | 3 | 9 | 2 | 0 | 4 | 3 | 4 |
| 15 | 13 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 16 | 99 | 100 | 100 | 99 | 99 | 96 | 99 | 99 | 99 |
| 17 | 94 | 100 | 100 | 94 | 94 | 94 | 94 | 94 | 94 |
| 18 | 88 | 100 | 100 | 88 | 88 | 88 | 88 | 88 | 88 |
| 19 | 6 | 24 | 24 | 4 | 4 | 3 | 3 | 3 | 3 |
5. Experiments
In this section, we first demonstrate the efficiency of the proposed algorithm for solving the phylogenetic LASSO problem when combined with maximum-likelihood phylogenetic inference. We then show (non-adaptive) phylogenetic LASSO does not appear to be strong enough to find zero edges on simulated data; adaptive phylogenetic LASSO performs much better. We, therefore, compare our adaptive phylogenetic LASSO with simple thresholding and rjMCMC on simulated data and then apply it to some real data sets. For all simulation and inference, we use the simplest Jukes and Cantor (1969) model of DNA substitution, in which all substitutions have equal rates. The choice of regularization λ is to a certain extent data dependent. For the simulated data, we choose a range of λs to demonstrate the balance between miss rate and false alarm rate (Figure 3). For the Dengue virus data, we find that the performance is fairly insensitive to the regularization coefficient once the regularization coefficient is reasonably large (Figure 5).
Figure 3.

Topological consistency comparison of different phylogenetic LASSO procedures on simulation 2. Upper Left panel: number of identified zero branches after various numbers of multistep adaptive LASSO cycles. Upper Right panel: the number of misidentified zero branches (false alarm) and the number of unidentified zero branches (miss detection) for simple thresholding and multistep adaptive phylogenetic LASSO at different cycles. Bottom panel: miss rate and false alarm rate as a function of the regularization coefficient with 4 cycles.
Figure 5.

The distributions (across bootstrap replicates) of numbers of zero branches detected by adaptive phylogenetic LASSO for various regularization coefficients.
We use PhyloInfer to compute the phylogenetic likelihood via the pruning algorithm Felsenstein (1981), which can be found at https://github.com/zcrabbit/PhyloInfer. PhyloInfer is a Python package originally developed for extending Hamiltonian Monte Carlo to Bayesian phylogenetic inference (Dinh et al., 2017). The code for adaptive phylogenetic LASSO is made available at https://github.com/matsengrp/adaLASSO-phylo.
5.1. Efficiency of pFISTA for solving the phylogenetic LASSO.
The fast convergence rate of FISTA (or pFISTA) need not hold when the cost function g is nonconvex. However, we can expect that g is well approximated by a quadratic function near the optimal (or some local mode) q* (O’Donoghue and Candes, 2013). That is, there exists a neighborhood of q* inside of which
When we are eventually inside this domain, we will observe behavior consistent with the convergence analysis in Section 4.1.
To test the efficiency of pFISTA in different scenarios, we consider various simulated data sets generated from “sparse” unrooted trees with 100 tips and 50 randomly chosen zero branches as follows. All simulated data sets contain 1000 independent observations on the leaf nodes. We set the minimum step size ϵ = 5e-08 for restarting.
We use the following simulation setups, in which branch lengths are expressed in the traditional units of expected number of substitutions per site.
Simulation 1. (No short branches). All nonzero branches have length 0.05. Because there are no short nonzero branches, branches that are originally nonzero are less likely to be collapsed to zero and we expect no restarting is needed.
Simulation 2. (A few short branches). For all the nonzero branches, we randomly choose 15 of them and set their lengths to 0.002. All the other branches have length 0.05. In this setting, there are a few short branches that are likely to be shrunken to zero. As a result, several restarts may be needed before convergence.
We see that when the model does not have very short non-zero branches and the phylogenetic cost is more regular, pFISTA finds the quadratic domain quickly and performs consistently with the corresponding convergence rate in equation (4.6), even without restarting (Figure 2, left). When the model does have many very short branches and the negative phylogenetic log-likelihood is highly nonconvex, pFISTA with restart still manages to arrive at the quadratic domain quickly and exhibits fast convergence thereafter. Furthermore, we find the small stepsize restarting criterion is useful to adapt to changing local smoothness and facilitate mode exploration. In both situations, pFISTA performs consistently better than pISTA. As a matter of fact, pISTA is monotonic so is more likely to get stuck in local minima, and hence may not be suitable for nonconvex optimization. We, therefore, use pFISTA with restart as our default algorithm in all the following experiments.
Figure 2.

pISTA vs pFISTA on simulated data sets in terms of the relative error , where and . The optimal solution is obtained from a long run of pFISTA. We used penalty coefficient for each run. Left panel: simulation 1; Right panel: simulation 2. In simulation 2, we tried two restarting strategies: restart whenever (partial) and restart whenever or (full).
Remark 5.1. Like other non-convex optimization algorithms, pFISTA with restart may be sensitive to the starting position of the parameters. However, due to the momentum introduced in Nesterov’s acceleration (which causes the ripples in Figure 2) and adaptive restarting, pFISTA with restart is more likely to escape local minima and potentially arrive at the global minimum.
5.2. Performance of phylogenetic LASSO.
Through simulation we also find that in practice the (non-adaptive) phylogenetic LASSO penalty is not strong enough to find all zero branches. Indeed, we find that phylogenetic LASSO only recovers around 60% of the sparsity found in the true models and larger penalty does not necessarily give more sparsity (Table 1). This suggests we use the multistep adaptive phylogenetic LASSO that has been proven to be topologically consistent under mild conditions (Theorem 3.13).
Table 1.
Number of correct zero length branches found by (non-adaptive) phylogenetic LASSO using various penalty coefficients in both simulation models, each of which have 50 zero length branches.
| λ | 1 | 5 | 10 | 20 | 40 | 80 | 160 |
|---|---|---|---|---|---|---|---|
| Simulation 1 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
| Simulation 2 | 32 | 32 | 32 | 32 | 32 | 32 | 31 |
5.3. Performance of adaptive phylogenetic LASSO.
Next, we demonstrate that the topologically consistent (multistep) adaptive phylogenetic LASSO significantly enhances sparsity on simulated data compared to phylogenetic LASSO. We will use the more difficult simulation 2 that have a combination of zero and very short branches. In what follows (and for the rest of this section), we compute adaptive and multistep adaptive phylogenetic LASSO as described in Section 2.3. Note that m = 1 (first cycle) is the phylogenetic LASSO and m = 2 (second cycle) corresponds to the adaptive phylogenetic LASSO. Therefore, we can compare all phylogenetic LASSO estimators by simply running the multistep adaptive phylogenetic LASSO with the maximum cycle number M ≥ 2. Our theoretical results are for γ > 1, however we have found that in practice large γ often leads to severe adaptive weights and hence numerical instability. Thus we use γ = 1 in the following experiments and put some results for γ > 1 (with guaranteed topological consistency) in the Appendix.
We run the multistep phylogenetic LASSO with M = 4 cycles. To test the topological consistency of the estimators, we use different initial regularization coefficients λ[0] = 10, 20, 30, 40, 50 and update the regularization coefficients according to
which maintains a relatively stable regularization among the adaptive LASSO steps because the varying part of the regularization is roughly for the mth cycle. This formula provides reasonably good balance between sparsity (identified zero branches) and numerical stability in our experiments.
We find that multistep adaptive phylogenetic LASSO does improve sparsity identification while maintaining a relatively low misidentification rate. Indeed, as the cycle number increases, the estimator now is able to identify more zero branches (Figure 3, upper left panel). Moreover, unlike the phylogenetic LASSO (m = 1), we do observe more sparsity when the regularization coefficient increases at cycles m > 1. As more cycles are run and larger penalty coefficients are used, we see that multistep adaptive LASSO manages to reduce miss detection (i.e. unidentified zero branches) without introducing many extra false alarms (misidentified zero branches; Figure 3, upper right panel). In contrast, simple thresholding is more likely to misidentify zero branches when larger thresholds are used to bring down miss detection. The choice of the regularization coefficient λ is also important. While small λ is not enough for detecting most zero branches, these simulations show that a too-large λ is likely to increase the number of false detections (Figure 3, bottom panel). On the other hand, as described below, for real data we find less dependence on the exact value of λ.
5.4. Short Edge Detection.
Previous work has proposed Bayesian approaches to infer non-bifurcating tree topologies by assigning priors that cover all of the tree space, including less-resolved tree topologies (Lewis et al., 2005, 2015). Since the numbers of branches (parameters) are different among those tree topologies, reversible-jump MCMC (rjMCMC) is used for posterior inference. Both sparsity-promoting priors and adaptive LASSO are means of sparsity encouragement that allow us to discover non-bifurcating tree topologies, which as described in the Introduction make different evolutionary statements than their resolved counterparts. However, those sparsity encouraging procedures also make it much more difficult to detect relatively short edges. Thus, we would like to understand the performance of methods in terms of detection probability: the probability of inferring a branch to be of non-zero length.
To investigate how short an edge can be and still be detected by both methods, we follow Lewis et al. (2005) and simulate a series of data sets using the same tree as in Simulation 2. All branch lengths are the same as in that simulation except those for the 15 randomly chosen short branches, each of which we take to be 0.0, 0.002, 0.004, 0.006, 0.008, 0.010 for the various trials; the nonzero short branches are meant to be particularly challenging to distinguish from the actual zero branches. For each of these six lengths, we simulate 100 data sets of the same size (1000 sites). These values for short branches are multiples of 1/1000, which provides, on average, one mutation per data set along the branch of interest. Note that branch lengths represent the expected number of mutations per site, so for example a branch length of 0.001 does not guarantee that a mutation will occur on the branch of interest in every simulated data set. We run multistep adaptive phylogenetic LASSO with M = 4 cycles and initial regularization coefficient λ[0] = 50, and rjMCMC with the polytomy prior C = 1 (C is the ratio of prior mass between trees with successive numbers of internal nodes as defined in Lewis et al. (2005)) for analysis. The detection probabilities of rjMCMC are the averaged split posterior probabilities of the corresponding branches over the 100 independent data sets.
We find that multistep adaptive phylogenetic LASSO indeed strikes a better balance between identifying zero branches and detecting short branches than rjMCMC in this simulation study (Figure 4). In addition to being slightly better at identifying zero branches than rjMCMC (partly due to a weak polytomy prior C = 1), multistep adaptive phylogenetic LASSO has a substantially improved detection probability for short branches (Figure 4). Also note that sufficiently long branch lengths (about 10 expected substitution per data set) are needed for an edge to be reliably detected using either method.
Figure 4.

Performance of multistep (4 cycle) adaptive phylogenetic LASSO and rjMCMC at detecting short branches. Detection probability is the probability of inferring a branch to be of non-zero length. Therefore, the ideal detection probability is 1 for non-zero length branches (all except for the first value on the x-axis) and 0 for zero length branches.
5.5. Dengue Virus Data.
We now compare our adaptive phylogenetic LASSO methods to others on a real data set. So far, we have tested the performance of multistep adaptive phylogenetic LASSO on a fixed topology. For real data sets, the underlying phylogenies are unknown and hence have to be inferred from the data. We therefore propose to use multistep adaptive phylogenetic LASSO as a sparsity-enforcing procedure after traditional maximum likelihood based inferences. In what follows, we use this combined procedure together with bootstrapping to measure edge support on a real data set of the Dengue genome sequences. In our experiment, we consider one typical subset of the 4th Dengue serotype (“DENV4”) consisting of 22 whole-genome sequences from Brazil curated by the nextstrain project (Hadfield et al., 2018) and originally sourced from the LANL hemorrhagic fever virus database (Kuiken et al., 2012). The sequence alignment of these sequences comprises 10756 nucleotide sites.
Following Lewis et al. (2005), we conduct our analysis using the following methods: (1) maximum likelihood bootstrapping columns of a sequence alignment; (2) a conventional MCMC Bayesian inference restricted to fully resolved tree topologies; (3) a reversible-jump MCMC method moving among fully resolved as well as polytomous tree topologies; (4) two combined procedures, maximum likelihood bootstrapping plus multistep adaptive phylogenetic LASSO and maximum likelihood bootstrapping plus thresholding, both allow fully bifurcating and non-bifurcating tree topologies. Maximum likelihood bootstrap analysis is performed using RAxML (Stamatakis, 2014) with 1000 replicates. The conventional MCMC Bayesian analysis is done in MrBayes (Ronquist et al., 2012) where we place a uniform prior on the fully resolved topology and Exponential (rate 10) prior on the branch lengths. The rjMCMC analysis is run in p4 (Foster, 2004), using a flat polytomy prior with C = 1. Their code can be found at https://github.com/Anaphory/p4-phylogeny. For each Bayesian approach, a single Markov chain was run 8e+06 generations after a 2e+06 generation burn-in period. Trees and branch lengths are sampled every 1000 generations, yielding 8000 samples. Both combined procedures are implemented based on the bootstrapped ML trees obtained in (1). For multistep adaptive phylogenetic LASSO, we use M = 4 cycles and test different initial regularization coefficients λ[0] = 150, 300, 450. We set the thresholds κ = 1e-06, 5e-05, 1e-04 for the simple thresholding method.
Figure 5 shows the distributions (across bootstrap replicates) of the numbers of zero branches detected by adaptive phylogenetic LASSO as a function of regularization coefficient λ. We see lots of detected zero branches, indicating the existence of non-bifurcating topologies for this data set. Furthermore, we find that even unpenalized (λ = 0) optimization recovers quite a few zero branches with our more thorough optimization allowing branch lengths to go all the way to zero.
Figure 6 shows the consensus tree obtained from the conventional MCMC samples. Each interior edge has its index number i and its support value (expressed as percentage) si right above it: . Following Lewis et al. (2005), we re-estimate the support values for all interior edges (splits) on this MCMC consensus tree using the aforementioned methods and summarize the results in Table 2. As one of the state-of-the-art approaches for identifying non-bifurcating topologies, rjMCMC is able to detect edges with exactly zero support (edges 2, 10, 13, 15). Due to their sparsity-encouraging nature, ML+thresholding+bootstrap and ML+adaLASSO+bootstrap can identify these zero edges (compared to standard MCMC and ML+bootstrap) and the detected zero edges are largely consistent with rjMCMC. Moreover, we also examine the support estimates of all splits observed in the standard MCMC samples, and find that adaptive phylogenetic LASSO tends to provide the closest estimates to rjMCMC (for zero-edge detection) among all the alternatives (Figure 7). Overall, we see that (multistep) adaptive phylogenetic LASSO is able to reveal non-bifurcating structures comparable to rjMCMC Bayesian approach when applied to maximum likelihood tree topologies, and is less likely to misidentify weakly supported edges in contrast to simple thresholding.
Figure 6.

Consensus tree resulting from the conventional MCMC Bayesian inference on the Brazil clade from DENV4.
Figure 7.

A comparison of different methods on the support estimates of all splits observed in the MCMC samples. Splits are sorted by their support under rjMCMC.
6. Conclusion
We study ℓ1-penalized maximum likelihood approaches for phylogenetic inference, with the goal of recovering non-bifurcating tree topologies. We prove that these regularized maximum likelihood estimators are asymptotically consistent under mild conditions. Furthermore, we show that the (multistep) adaptive phylogenetic LASSO is topologically consistent and therefore is able to detect non-bifurcating tree topologies that may contain polytomies and sampled ancestors. We present an efficient algorithm for solving the corresponding optimization problem, which is inherently more difficult than standard ℓ1-penalized problems with regular cost functions. The algorithm is based on recent developments on proximal gradient descent methods and their various acceleration techniques (Beck and Teboulle, 2009; O’Donoghue and Candes, 2013).
Our method is closest in spirit to rjMCMC, which is a rigorous means of inferring the posterior distribution of potentially multifurcating topologies, and thus we have limited our performance comparisons to this method. However, there is precedent for using a hypothesis-testing framework to test if parts of the tree should be multifurcating. Jackman et al. (1999) use a parametric bootstrap test and a randomization test built on maximum parsimony to evaluate the strength of support for a rapid-evolution scenario. Walsh et al. (1999) determine the number of base pairs required to resolve a given phylogenetic divergence with a prior hypothesis about the amount of time during which this divergence could have occurred. Also, one might wish to use the SOWH test (Swofford et al., 1996; Goldman et al., 2000; Susko, 2014) to find an appropriate threshold by increasing the threshold progressively until the SOWH test detects a significant difference between the ML tree and the thresholded tree. However, we have shown that thresholding is less effective than our method across a range of threshold values (Figure 3). Furthermore, existing software implementations of SOWH cannot test multifurcating topologies because inference under a multifurcation constraint is not supported in current tree inference packages.
We have done a wide range of experiments to demonstrate the efficiency and effectiveness of our method. We show in a synthetic study that although the (non-adaptive) phylogenetic LASSO has difficulty finding zero-length branches, the adaptive phylogenetic LASSO provides significant improvement on sparsity recovery which validates its theoretical properties. Although we assume a fixed tree topology for deriving statistical consistency, our method can be used to discover non-bifurcating tree topologies in real data problems when combined with traditional maximum likelihood phylogenetic inference methods. Our experiments have shown that the adaptive phylogenetic LASSO performs comparably with other MCMC based sparsity encouraging procedures (rjMCMC) in terms of sparsity recovery while being computationally more efficient as an optimization approach. We also compare our method to a heuristic simple thresholding approach and find that regularization permits more consistent performance. Finally, we show that compared to rjMCMC, the adaptive phylogenetic LASSO is more likely to detect short branches while identifying zero branches with high accuracy. It is worth mentioning that while lots of sparsity can be detected by maximizing the likelihood with nonnegative constraints, the adaptive phylogenetic LASSO can be advantageous when there exist challenging zero-branches in the tree topologies with high likelihoods. Our results offer new insights into non-bifurcating phylogenetic inference methods and support the use of ℓ1 penalty in statistical modeling with more general settings.
We leave some questions to future work. For the theory, the rate with which we can allow the number of leaves to go to infinity in terms of the sequence length is not yet known. We have also not explored the extent to which the optimal penalized tree is a contraction of the ML unpenalized tree. Also, although we have laid the algorithmic foundation for efficient penalized inference, there is further work to be done to make a streamlined implementation that is integrated with existing phylogenetic inference packages.
Supplementary Material
7. Acknowledgements
The authors would like to thank Vladimir Minin and Noah Simon for helpful discussions, and Sidney Bell for helping with the Dengue sequence data. This work supported by National Institutes of Health grants R01-GM113246, R01-AI120961, U19-AI117891, and U54-GM111274 as well as National Science Foundation grant CISE-1564137. The research of Frederick Matsen was supported in part by a Faculty Scholar grant from the Howard Hughes Medical Institute and the Simons Foundation.
References
- Agarwal A, Negahban S, and Wainwright MJ (2010). Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pp. 37–45. [Google Scholar]
- Allman ES, Ané C, and Rhodes JA (2008). Identifiability of a markovian model of molecular evolution with Gamma-Distributed rates. Adv. Appl. Probab 40(1), 229–249. [Google Scholar]
- Allman ES and Rhodes JA (2008, January). Identifying evolutionary trees and substitution parameters for the general markov model with invariable sites. Math. Biosci 211 (1), 18–33. [DOI] [PubMed] [Google Scholar]
- Beck A and Teboulle M (2009, January). A fast iterative Shrinkage-Thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci 2(1), 183–202. [Google Scholar]
- Beyer WA, Stein ML, Smith TF, and Ulam SM (1974, February). A molecular sequence metric and evolutionary trees. Math. Biosci 19(1), 9–25. [Google Scholar]
- Bickel PJ, Ritov Y, and Tsybakov AB (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37(4), 1705–1732. [Google Scholar]
- Bühlmann P and Meier L (2008, August). Discussion: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat 36(4), 1534–1541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang JT (1996). Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Mathematical biosciences 137(1), 51–73. [DOI] [PubMed] [Google Scholar]
- Chen R and Holmes EC (2008, June). The evolutionary dynamics of human influenza B virus. J. Mol. Evol 66 (6), 655–663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Combettes PL and Wajs VR (2006). Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation 4 (4), 1168–1200. [Google Scholar]
- Dinh V, Bilge A, Zhang C, and Matsen FA IV (2017, July). Probabilistic Path Hamiltonian Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, pp. 1009–1018. [Google Scholar]
- Dinh V, Ho LST, Suchard MA, and Matsen FA IV (2018). Consistency and convergence rate of phylogenetic inference via regularization. Annals of statistics 46(4), 1481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dinh VC, Ho LST, Nguyen B, and Nguyen D (2016). Fast learning rates with heavy-tailed losses. In Advances in Neural Information Processing Systems, pp. 505–513. [Google Scholar]
- Efron B, Hastie T, Johnstone I, and Tibshirani R (2004). Least angle regression. Annals of Statistics 32(2), 407–499. [Google Scholar]
- Evans SN and Speed TP (1993, March). Invariants of some probability models used in phylogenetic inference. Ann. Stat 21 (1), 355–377. [Google Scholar]
- Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American, statistical Association 96 (456), 1348–1360. [Google Scholar]
- Fan J, Xue L, and Zou H (2014). Strong oracle optimality of folded concave penalized estimation. Annals of statistics 42(3), 819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17(6), 368–376. [DOI] [PubMed] [Google Scholar]
- Fitch WM and Margoliash E (1967, January). Construction of phylogenetic trees. Science 155(3760), 279–284. [DOI] [PubMed] [Google Scholar]
- Foster PG (2004). Modeling compositional heterogeneity. Syst. Biol 53, 485–495. [DOI] [PubMed] [Google Scholar]
- Gardy J, Loman NJ, and Rambaut A (2015, July). Real-time digital pathogen surveillance — the time is now. Genome Biol. 16(1), 155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gavryushkina A, Heath TA, Ksepka DT, Stadler T, Welch D, and Drummond AJ (2016, August). Bayesian Total-Evidence dating reveals the recent crown radiation of penguins. Syst. Biol. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gavryushkina A, Welch D, Stadler T, and Drummond AJ (2014, December). Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration. PLoS Comput. Biol 10(12), e1003919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, and Quake SR (2014, January). The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat. Biotechnol. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldman N, Anderson JP, and Rodrigo AG (2000, December). Likelihood-Based tests of topologies in phylogenetics. Syst. Biol 49(4), 652–670. [DOI] [PubMed] [Google Scholar]
- Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, Mumford JA, and Holmes EC (2004, January). Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303(5656), 327–332. [DOI] [PubMed] [Google Scholar]
- Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, and Neher RA (2018, May). Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hunter DR and Li R (2005). Variable selection using MM algorithms. Annals of statistics 33(4), 1617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jackman TR, Larson A, de Queiroz K, and Losos JB (1999, June). Phylogenetic relationships and tempo of early diversification in anolis lizards. Syst. Biol 48 (2), 254–285. [Google Scholar]
- Ji S, Kollár J, and Shiffman B (1992). A global Łojasiewicz inequality for algebraic varieties. Transactions of the American Mathematical Society 329(2), 813–818. [Google Scholar]
- Jukes TH and Cantor CR (1969). Evolution of protein molecules. In Munro HN (Ed.), Mammalian protein metabolism, Volume 3, pp. 21–132. New York: Academic Press. [Google Scholar]
- Kim J and Sanderson MJ (2008, October). Penalized likelihood phylogenetic inference: bridging the parsimony-likelihood gap. Syst. Biol 57(5), 665–674. [DOI] [PubMed] [Google Scholar]
- Kim Y, Choi H, and Oh H-S (2008). Smoothly clipped absolute deviation on high dimensions. Journal of the American Statistical Association 103(484), 1665–1673. [Google Scholar]
- Kleinstein SH, Louzoun Y, and Shlomchik MJ (2003, November). Estimating hypermutation rates from clonal tree data. J. Immunol 171 (9), 4639–4649. [DOI] [PubMed] [Google Scholar]
- Kuiken C, Thurmond J, Dimitrijevic M, and Yoon H (2012, January). The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses. Nucleic Acids Res. 40 (Database issue), D587–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis PO, Holder MT, and Holsinger KE (2005, April). Polytomies and Bayesian phylogenetic inference. Syst. Biol 54 (2), 241–253. [DOI] [PubMed] [Google Scholar]
- Lewis PO, Holder MT, and Swofford DL (2015, January). Phycas: Software for Bayesian phylogenetic analysis. Syst. Biol. [DOI] [PubMed] [Google Scholar]
- Libin P, Vanden Eynden E, Incardona F, Nowé A, Bezenchek A, EucoHIV study group, Sönnerborg A, Vandamme A-M, Theys K, and Baele G (2017, August). PhyloGeoTool: interactively exploring large phylogenies in an epidemiological context. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Y, Zhan Z, Cai JF, Guo D, Chen Z, and Qu X (2016). Projected iterative soft-thresholding algorithm for tight frames in compressed sensing magnetic resonance imaging. IEEE Trans. Med. Imag 35(9), 2130–2140. [DOI] [PubMed] [Google Scholar]
- Loh P-L (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics 45(2), 866–896. [Google Scholar]
- Loh P-L and Wainwright MJ (2011). High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In Advances in Neural Information Processing Systems, pp. 2726–2734. [Google Scholar]
- Loh P-L and Wainwright MJ (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pp. 476–484. [Google Scholar]
- Loh P-L and Wainwright MJ (2017). Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics 45(6), 2455–2482. [Google Scholar]
- Mazumder R, Friedman JH, and Hastie T (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association 106 (495), 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N and Bühlmann P (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics 34 (3), 1436–1462. [Google Scholar]
- Meinshausen N and Yu B (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics 37(1), 246–270. [Google Scholar]
- Negahban SN, Ravikumar P, Wainwright MJ, and Yu B (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science 27(4), 538–557. [Google Scholar]
- Neher RA and Bedford T (2015, June). nextflu: Real-time tracking of seasonal influenza virus evolution in humans. Bioinformatics. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Donoghue B and Candes E (2013). Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics 15, 515–732. [Google Scholar]
- Pan Z and Zhang C (2015). Relaxed sparse eigenvalue conditions for sparse estimation via non-convex regularized regression. Pattern Recognition 48(1), 231–243. [Google Scholar]
- Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, and Huelsenbeck JP (2012, 22 February). MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst. Biol 61 (3), 539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A (2014). Raxml version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinformatics 30(9), 1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Susko E (2014, April). Tests for two trees using likelihood methods. Mol. Biol. Evol 31 (4), 1029–1039. [DOI] [PubMed] [Google Scholar]
- Swofford DL, Olsen GJ, Waddell PJ, and Hillis DM (1996). Phylogenetic inference. In Hillis DM, Moritz C, Mable BK, and Olmstead RG (Eds.), Molecular systematics, pp. 407–514. Sunderland, MA: Sinauer Associates. [Google Scholar]
- Tibshirani R (1996, January). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol 58(1), 267–288. [Google Scholar]
- Van Erven T, Grünwald PD, Mehta NA, Reid MD, and Williamson RC (2015). Fast rates in statistical and online learning. Journal of Machine Learning Research 16, 1793–1861. [Google Scholar]
- Victora GD and Nussenzweig MC (2012, January). Germinal centers. Annu. Rev. Immunol 30, 429–457. [DOI] [PubMed] [Google Scholar]
- Walsh HE, Kidd MG, Moum T, and Friesen VL (1999). Polytomies and the power of phylogenetic inference. Evolution 53(3), 932–937. [DOI] [PubMed] [Google Scholar]
- Wang L, Kim Y, and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. Annals of Statistics 41 (5), 2505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C-H and Huang J (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics 36(4), 1567–1594. [Google Scholar]
- Zhang C-H and Zhang T (2012). A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science 27(4), 576–593. [Google Scholar]
- Zhao P and Yu B (2006). On model selection consistency of Lasso. Journal of Machine learning research 7(Nov), 2541–2563. [Google Scholar]
- Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101 (476), 1418–1429. [Google Scholar]
- Zou H and Li R (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of statistics 36(4), 1509. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
