Abstract
Among the methods currently available for inferring species trees from gene trees, the GLASS method of Mossel and Roch (2010), the Shallowest Divergence (SD) method of Maddison and Knowles (2006), the STEAC method of Liu et al. (2009), and a related method that we call Minimum Average Coalescence (MAC) are computationally efficient and provide branch length estimates. Further, GLASS and STEAC have been shown to be consistent estimators of tree topology under a multispecies coalescent model. However, divergence time estimates obtained with these methods are all systematically biased under the model because the pairwise interspecific gene divergence times on which they rely must be more ancient than the species divergence time. Jewett and Rosenberg (2012) derived an expression for the bias of GLASS and used it to propose an improved method that they termed iGLASS. Here, we derive the biases of SD, STEAC, and MAC, and we propose improved analogues of these methods that we call iSD, iSTEAC, and iMAC. We conduct simulations to compare the performance of these methods with their original counterparts and with GLASS and iGLASS, finding that each of them decreases the bias and mean squared error of pairwise divergence time estimates. The new methods can therefore contribute to improvements in the estimation of species trees from information on gene trees.
Key words: algorithms, coalescence, phylogenetic trees
1. Introduction
Many methods exist for inferring species trees from gene trees (Maddison, 1997; Ewing et al., 2008; Degnan and Rosenberg, 2009; Kubatko et al., 2009; Liu et al., 2009; Than and Nakhleh, 2009). Among these methods, several are now available that are computationally efficient and capable of estimating branch lengths. This collection of methods includes the GLASS method of Mossel and Roch (2010), which was developed independently by Liu et al. (2010) under the name Maximum Tree, the Shallowest Divergence (SD) method of Maddison and Knowles (2006), and the Species Tree Estimation using Average Coalescence times (STEAC) method of Liu et al. (2009).
GLASS, SD, and STEAC are all distance-matrix methods. Such methods first estimate an evolutionary distance between each pair of taxa and then construct a species tree that provides an approximate representation of the pairwise distances. In GLASS, SD, and STEAC, divergence times are estimated for each pair of species, and a hierarchical clustering method is then applied to the pairwise distances to construct an estimate of the species tree. Thus, each distance-matrix method can be viewed as an estimator of pairwise species divergence times combined with a hierarchical clustering procedure.
To understand the general approach by which GLASS, SD, and STEAC estimate the divergence time between a pair of taxa A and B, consider a set of L loci indexed by ℓ, and denote by and the numbers of lineages sampled at locus ℓ in taxa A and B, respectively. Each method takes as input a set of estimated coalescence times for all pairs of taxa. Denote the sets of lineages sampled from taxa A and B at locus ℓ by and , and denote the true coalescence time between lineages and by . We assume that an estimate of has been obtained by some procedure that is left unspecified; the focus of GLASS, SD, and STEAC is not on the estimation of the themselves, but rather, on the way in which a collection of values of is used to estimate species divergence times. Throughout this paper, we use the term pairwise interspecific coalescence to refer to the ; that is, the coalescences between one of the lineages from taxon A and one of the lineages from taxon B at locus ℓ. We differentiate this concept from that of an interspecific coalescent event, at which multiple pairwise interspecific coalescences, as defined here, can simultaneously occur. For example, in the species tree depicted in each box of Figure 1, two interspecific coalescent events occur for the locus shown in blue. At each of these events, two pairwise interspecific coalescences occur. The gene tree at the other locus, in orange, contains three interspecific coalescent events. One pairwise interspecific coalescence occurs at each of the first two interspecific coalescent events, and two pairwise interspecific coalescences, between the rightmost lineage from the right taxon and each lineage from the left taxon, occur at the third interspecific coalescent event. From these pairwise interspecific coalescences, we denote the estimated minimum pairwise interspecific coalescence time between the taxa at locus ℓ by , and the estimated mean pairwise interspecific coalescence time by .
The GLASS, SD, and STEAC estimation methods each consider a minimum or mean of either the or the over all loci (Fig. 1). For example, the GLASS estimator of the divergence time TAB, denoted , is obtained by taking the minimum over loci of the locus-specific minimum pairwise interspecific coalescence times; that is, . The SD estimator of TAB is obtained by taking the mean over loci of the locus-specific minimum pairwise interspecifc coalescence times, or . The STEAC estimator of TAB is the mean over loci of the locus-specific mean pairwise interspecific coalescence times, or . We also introduce a fourth divergence time estimator equal to the minimum over loci of the locus-specific mean pairwise interspecific coalescence times, or . We will call this method Minimum Average Coalescence (MAC).
The species divergence time estimates given by GLASS, SD, STEAC, and MAC are all systematically biased under the multispecies coalescent, a basic evolutionary model for describing the evolution of gene trees conditional on species trees (Degnan and Rosenberg, 2009). This bias arises because, under the model, true pairwise interspecific gene coalescence times always exceed the species divergence time. If estimates perfectly, then because the estimates from GLASS, SD, STEAC, and MAC (, and , respectively) all consist of means and minima of the , under the model, they too must exceed the true species divergence time.
For the GLASS method, Jewett and Rosenberg (2012) addressed the issue of bias by proposing an improved method for estimating pairwise divergence times. This method, which they called iGLASS, adjusts the GLASS estimate downward by an amount related to the bias of GLASS. Jewett and Rosenberg showed that iGLASS is consistent for estimating pairwise divergence times and that it can be combined with a suitable clustering algorithm to produce a consistent estimator of the species tree topology. Through simulations, they found that iGLASS greatly reduces the bias and mean squared error of pairwise divergence time estimates produced by GLASS.
Here, we propose improved analogues for SD, STEAC, and MAC, which we call iSD, iSTEAC, and iMAC, respectively, under which bias in the estimation of species divergence times is substantially reduced or eliminated. In Section 2, we derive the improved methods for estimating pairwise divergence times. Our derivations first quantify the bias in pairwise estimates of species divergence time for each of the original three methods as a function of the true divergence time, the number of lineages sampled from each taxon, and the number of sampled loci. Given a divergence time estimate obtained using one of the original methods, an improved estimate is produced by subtracting from the original estimate its bias; more precisely, because the bias for a given scenario depends on the true divergence time, which we are attempting to estimate, we calculate improved estimates by solving an equation for the true divergence time that will be detailed in Section 2. In Section 3, we evaluate the methods for estimating pairwise divergence times, and we compare the performance of the improved methods with each other and with their original counterparts. We also expand the analysis to trees with more than two taxa; in these cases, the improved version of a given method (GLASS, SD, STEAC, or MAC) is obtained by calculating the improved pairwise divergence time estimate for each pair of taxa and then applying a clustering method to the matrix of improved estimates. We compare the original and improved methods for inferring these larger trees with respect to both the bias and mean squared error in pairwise divergence time estimates and the proportion of sampled trees for which the correct topology is inferred.
2. Derivation Of Improved Methods
We follow the assumptions of a simple multispecies coalescent model (Degnan and Rosenberg, 2009). Each branch i of the species tree is assumed to have a constant effective population size Ni. Looking backward in time from the present at a species tree branch with n sampled lineages, we assume that all pairs of lineages are equally likely to coalesce and that the waiting time until coalescence occurs is exponentially distributed with mean coalescent time units (where one unit is defined as 2Ni generations). The species tree and gene trees are assumed to be binary. Furthermore, any two species or genetic lineages are assumed to be equidistant from their common ancestor in time units of years, so that species trees and gene trees are ultrametric. We also assume that Ni = N for all i, and that generation times and mutation rates are also equivalent across species tree branches.
Bias reduction is carried out in a similar way for each of the four estimators that we consider. As we have discussed, Jewett and Rosenberg (2012) examined the bias reduction in the case of GLASS; here, we proceed more generally. Consider a specific estimator tAB chosen from among the four in our study; that is, tAB represents either , or . To obtain an improved version of tAB, let denote an observation of the estimator tAB. We then obtain the improved estimate by finding the divergence time at which the expected value of the estimator tAB under the model is equal to the observed estimate. That is, we solve
(1) |
for τAB.
In the case in which the estimate is smaller than the smallest possible value for , or , it is not biologically meaningful to solve Equation ( 1 ), since the equation produces negative divergence time estimates. To circumvent this problem, define a function g as
(2) |
The improved estimate for the pairwise species divergence time, , is now defined piecewise:
(3) |
As we will see, except in the case of STEAC, this procedure gives an estimate of τAB that is biased when more than one lineage is sampled in each taxon. However, the bias of each improved estimator is greatly reduced compared to that of the corresponding original estimator.
To simplify our derivations, we define a random variable VAB as the difference between the estimator tAB and the true speciation time τAB; that is, VAB = tAB − τAB. VAB measures the extent to which a random estimate exceeds the true speciation time. We can now express from Equation (2) as . Because the random variable VAB varies among the four methods, must be derived separately for each of the four original estimators (GLASS, SD, STEAC, and MAC) in order to define our four improved estimators (iGLASS, iSD, iSTEAC, and iMAC). Following the notation established above for the divergence times, denote by , and the values of VAB for the GLASS, SD, STEAC, and MAC methods, respectively.
2.1. Glass
As derived by Jewett and Rosenberg (2012), the iGLASS estimator can be obtained by first deriving in the case of a single locus ℓ, or , with and lineages sampled from species A and B, respectively. We briefly review this derivation here, as components of it are required for obtaining the other improved estimators.
2.1.1. Derivation of for a single locus
Define and to be the random numbers of lineages remaining from taxa A and B at locus ℓ at the divergence time. Conditioning on and , the expectation of is given by Jewett and Rosenberg as
(4) |
where is the probability that n initial lineages coalesce to k lineages in time . This probability has been calculated by Tavaré (1984), and it is
(5) |
with and ; the time τ is in units of 2N generations. The conditional expectation of , can be calculated by integrating the probability density function of conditional on and :
(6) |
The density is obtained by conditioning on M, the number of coalescent events that occur in the ancestral population up to and including the first interspecific coalescence; using a derivation from Takahata (1989), Jewett and Rosenberg obtained the expectation of as a sum over all possible values of and ,
(7) |
where and , and where
Here, and
2.1.2. Derivation of for multiple loci
To derive the iGLASS correction for multiple loci, Jewett and Rosenberg (2012) obtained the conditional expectation,
(8) |
The unconditional expectation is then found using a generalization of Equation (4) that can be plugged into Equation (2) to obtain the iGLASS correction by Equation (3).
2.2. SD
2.2.1. Derivation of for a single locus
The SD estimator, , is given by the mean over all loci of the minimum interspecific coalescence times. For a single locus, is simply equivalent to , the expectation of the minimum interspecific coalescence time at that locus, as derived by Jewett and Rosenberg (2012) and given in Equation (7).
2.2.2. Derivation of for multiple loci
For a set of L loci, we take the mean of over the L loci to obtain the SD estimator. That is, , by the linearity of the expectation operator. Using the formula for in Equation (7), we obtain
(9) |
Similarly to the procedure for GLASS, the expectation from (9) is then inserted into Equation (2), which is then solved for to obtain the iSD estimate according to Equation (3).
2.3. MAC
2.3.1. Derivation of for a single locus
To obtain the iMAC estimator, we begin by deriving the distribution of at a single locus, or . We derive the distribution of by conditioning on and , the numbers of lineages surviving until the divergence time. The density function of , however, cannot be written as simply as the analogous expression for in the GLASS derivation. The computation of the mean pairwise interspecific coalescence time requires knowledge of the number of pairwise interspecific coalescences at each coalescent event past divergence, which in turn requires information about the topology of the gene tree under consideration.
Because many gene tree topologies are possible when more than one lineage is sampled from each taxon, we further condition on a certain set of unlabeled tree topologies, , where, as described below, is the number of unlabeled tree topologies with sampled lineages and lineages remaining at the divergence time. To define this set of topologies, note that given , the mean pairwise interspecific coalescence times—and thus, —depend on the tree topologies both in the daughter branches of the species tree and in the ancestral branch. For example, in Figure 2, the trees in parts A and B have identical topologies past the divergence time, but the mean pairwise interspecific coalescence times differ due to different coalescence patterns in the daughter branches.
Given and , we find the desired set of tree topologies by declaring that lineages from the same taxon are equivalent and that lineages from different taxa are distinct. Further, in the case where , two topologies that differ only by a transposition of all of the species labels of the gene lineages are defined to be equivalent. For example, consider the case of . Then or , and or . In Table 1A, , and only one topology is possible; because , the two taxa are interchangeable, and it does not matter to the computation of coalescence time distributions whether the most recent coalescence occurs in the left or the right taxon. Thus, . Tables 1B and 1C consider ; again, because , it does not matter whether it is the left or the right taxon that experiences a coalescence before the divergence time, and this case is equivalent to . After the divergence time, the three remaining lineages form pairs that are equally likely to coalesce. Because the two lineages from the right taxon are equivalent, only two distinct possibilities exist: first, as seen in Table 1B, the two lineages from the right taxon can coalesce. Alternatively, as seen in Table 1C, the lineage from the left taxon can coalesce with either of the two equivalent lineages from the right taxon. Thus, . The four tree topologies possible when appear in Tables 1D–G.
Table 1.
The set of distinct topologies, pictured in the third row, is formed by considering the lineages from species A to be interchangeable, but distinct from the interchangeable lineages from species B. Here, since , the two taxa are also considered interchangeable. Also shown are the probabilities of each topology given and and the expected mean interspecific coalescence times conditional on each topology, denoted , which are used to derive . The probabilities, , that and lineages remain at the divergence time are also given. These probabilities are used with to calculate the unconditional density and the survival function , given in the last row.
Conditional on tree topology, we calculate the density of by recalling from Section 2 that when n lineages are available to coalesce, the waiting time until a coalescent event is assumed under the model to be exponentially distributed with rate coalescent time units. The mean pairwise interspecific coalescence time is a weighted sum of these random variables. Consider for example the tree in Table 1A, in which all four pairwise interspecific coalescences occur at the lone interspecific coalescent event; because exactly two lineages are available to form this coalescence, we denote the waiting time until it occurs by t2. Thus, the mean pairwise interspecific coalescence time given the topology in Table 1A is . Because the mean interspecific coalescence time has the same distribution as the single interspecific coalescence time in the case that only two lineages are available to coalesce, in this case,
For a more complex example, consider Table 1F. Here, the first of the four pairwise interspecific coalescences occurs at the first coalescent event past divergence, when four lineages remain; we denote the waiting time until this coalescence by t4. The second pairwise interspecific coalescence occurs at the next coalescent event, so that the waiting time past the divergence for this coalescent event is t4 + t3. Finally, the last two pairwise interspecific coalescences, between the rightmost lineage and each of the two lineages of the left taxon, occur at the last coalescent event, which occurs at time t4 + t3 + t2 past the divergence time. The mean, as given in the figure, is therefore . The distribution can now be calculated as the sum of three exponential random variables. The first of these variables has parameter . For the second and third random variables, and , we must consider the coefficients as well; because an exponential random variable multiplied by a constant yields a new exponential random variable, with mean given by the product of the constant and the mean of the first variable, we multiply the parameter λ by the reciprocal of the coefficient to obtain the parameter of the new exponential random variable produced by scaling. Thus, the two remaining random variables have parameters and . Because the three random variables are all exponentially distributed with distinct rates, the density of their sum can be calculated as follows:
(10) |
In this equation (Ross, 2007), independent exponential random variables are denoted , with parameters . Note, however, that for some topologies in cases with higher and , the rate parameters will not be pairwise distinct, and a convolution integral must be used. Here, in the case of the topology in Table 1F, either method gives .
Once has been calculated for all tree topologies, we weight the density functions by conditional probabilities of the relevant topologies given . These probabilities, denoted , are possible to compute based on the assumption of the coalescent model that each pair of lineages is equally likely to coalesce. Examining Table 1A as a simplest case, we see that if and , only one topology is possible, and . In Table 1F, however, all four lineages survive until the divergence time, so choices of two lineages are possible for the first coalescent event. Four of these choices involve one lineage from each species, and therefore the probability is that the first coalescent event is compatible with the topology in Table 1F. Assuming that one of these four equivalent coalescent events occurs, possible pairs of lineages are available for the next coalescent event. Because and the same topology will be obtained whether the lineage that coalesces with the interspecific pair is from the left or the right taxon, the probability that the next coalescence is compatible with the tree in Table 1F is also . Given the first two coalescent events, the third coalescent event and the topology are determined, and so . Probabilities for the other topologies are computed analogously (Table 1).
With the conditional densities of given a topology and the topology probabilities known, we calculate the density function for unconditional on tree topology,
(11) |
We then find the distribution of unconditional on by summing over possible values of . If , we sum over all possible values of and , as in Equation (4). When , however, we avoid overcounting equivalent topologies by requiring . This strategy results in a sum of (n2 + n)/2 terms. Then
(12) |
where we define the starting point of the summation for as
2.3.2. Derivation of for multiple loci
For one locus, the probability that exceeds a value v is given by the survival function of , obtained by integrating the density of in Equation (12) over the range [v, ∞):
(13) |
For many loci, the survival function of is given by the probability that exceeds v for all loci:
(14) |
In the case in which the same numbers of lineages are sampled at each locus, the survival functions for different loci are identical, and Equation (14) further simplifies to
(15) |
The expectation required in computing the iMAC estimator can then be found by integration:
(16) |
Equation (16) gives the expectation that is inserted into Equation (2) to obtain the iMAC estimate according to Equation (3).
2.4. STEAC
The STEAC estimator of Liu et al. (2009) is found by computing the mean pairwise interspecific coalescence time for each locus, and then taking the mean of these times over all loci. Because the expected time until two lineages from different taxa coalesce is one coalescent unit past the divergence time, or , the expected amount by which the STEAC estimate exceeds the divergence time is . The expected mean pairwise interspecific coalescence time over all pairs of lineages is therefore . Averaging across loci, . Substituting into Equations (2) and (3), we see that the iSTEAC estimate is simply one less than the STEAC estimate when the STEAC estimate is greater than one, or zero when the STEAC estimate is less than one.
3. Comparison Of Methods
3.1. Bias and mean squared error in pairwise divergence time estimates
Each of the four improved methods has the potential to substantially reduce bias in estimates of species divergence times. Some bias remains in the improved estimates, however, and we use simulations to examine bias and mean squared error (MSE) for all of the methods.
3.1.1. Simulation procedure
We performed simulations using the multispecies coalescent model, following a procedure similar to that of Jewett and Rosenberg (2012). Briefly, in each daughter branch with n sampled lineages, two lineages were chosen randomly to coalesce, with coalescence time following an exponential distribution with mean ; this procedure was repeated until only one lineage remained or until time τ, when the kA remaining lineages from taxon A joined the kB lineages from taxon B, and the procedure was repeated in the ancestral branch until all lineages coalesced. After obtaining a gene tree for each of a series of loci, the iGLASS, iSD, iMAC and iSTEAC estimates were calculated as in Equation (3) and Sections 2.1–2.4 from the true coalescence times taken from the set of simulated gene trees at the collection of loci. For iGLASS, the expected values from Equation (8) sometimes required excessive computation due to large numbers of nested sums. Consequently, for the improvements to GLASS, the approximate iGLASS method was used instead (Equations (20) and (21) of Jewett and Rosenberg, 2012). For simplicity, this approximation was also used for iSD.
We considered several simulated scenarios, with true divergence times set to 0, 0.1, 0.3, 0.6, 1.0, 1.5, 2.1, 2.8, and 3.6 coalescent units and with the number of loci varied from 1 to 10 in unit increments. These divergence times and numbers of loci were chosen so that trends in each parameter, as well as differences among the methods over a biologically realistic range, could potentially be detected. Simulations for each of the eight methods (GLASS, iGLASS, SD, iSD, MAC, iMAC, STEAC, and iSTEAC) were performed separately. For each method, we simulated a set of 50,000 replicate experiments for each of the 90 combinations of divergence time and number of loci, with two lineages sampled from each taxon. Two lineages were sampled for the small-sample-size simulations because with a single sampled lineage, the mean pairwise interspecific coalescence and the minimum pairwise interspecific coalescence are equivalent; differences between GLASS and MAC and between SD and STEAC would then be undetectable. To investigate sample-size effects, simulations were repeated with four lineages sampled per taxon. For these larger-sample-size simulations, the number of sampled lineages was limited to four because it is desirable to compare all methods at the same sample size, and computing the distributions of the mean pairwise interspecific coalescence times necessary for evaluation of iMAC was prohibitive with larger sample sizes.
Figure 3 shows the resulting bias, variance, and MSE in the 50,000 simulations for each of the four pairs of original and improved methods, for each of the 90 combinations of divergence time and number of loci, with two lineages sampled from each taxon. Figure 4 shows corresponding results for the case in which four lineages were sampled from each taxon.
3.1.2. Equivalences among special cases
In order to better analyze trends in the bias, variance, and MSE among the methods, we first highlight two special cases that produce equivalencies among some of our methods from shared features in the ways that the methods are constructed. First, suppose that only a single locus is sampled, as in the first rows of each heat map panel of Figure 3. Comparing the first rows of Figure 3A,C, we see that if a single locus is sampled, then GLASS and SD produce identical results. GLASS considers the minimum over loci of the locus-specific minimum coalescence times, while SD considers the mean over loci of the minimum coalescence times; for a single locus, the methods are equivalent. By the same logic, MAC, which considers the minimum over loci of locus-specific mean coalescence times, and STEAC, which considers the mean over loci of the locus-specific mean coalescence times, are also equivalent for a single locus; this result can be seen by comparing the first rows of the heat maps in Figure 3E,G. In the single-locus case, GLASS, SD, MAC, and STEAC are all identical when one lineage is sampled per taxon.
A second relationship can be seen by considering large divergence times, as shown in Figure 3 by the rightmost columns of the heat maps. As the divergence time becomes large, at any locus, the probability is high that only a single lineage will remain in each branch when the divergence time is reached. In this case, the mean and minimum interspecific coalescence at a given locus will be the same. Thus, GLASS and MAC are asymptotically equivalent at large divergence times, as are SD and STEAC. This result can be verified by comparing the rightmost columns of the heat maps in Figure 3A with those in Figure 3E, and those in Figure 3C to those in Figure 3G. At infinite divergence times in the case of a single locus, all four methods would be equivalent.
3.1.3. Bias
Considering the simulation results for the various methods individually, in Figure 3A we see that the original GLASS method has a relatively low bias compared with the other original methods (Fig. 3C,E,G). A notable improvement is still made by the approximate iGLASS method of Jewett and Rosenberg (2012), in Figure 3B, particularly when few loci are sampled and at high divergence times.
In Figure 3C, a striped pattern in the heat map for the SD method confirms that with divergence time held constant, is constant in the number of loci. This result is a consequence of the fact that SD considers the mean over loci of the locus-specific minimum interspecific coalescence times, and when each locus has the same sample size, this mean has the same expectation irrespective of the number of loci. While improvements in bias are seen when comparing iSD, in Figure 3D, with SD, iSD retains more bias than does iGLASS.
The bias of MAC is shown in Figure 3E. As mentioned in Section 2.4, we see that for a single locus, bias is constant at a value of one coalescent unit. For more than one locus, unlike for GLASS and SD, the bias of MAC is most pronounced for short divergence times. At short divergence times, the numbers of lineages remaining at the divergence time will be larger than when the divergence time is large, and the mean pairwise interspecific coalescence time will have a smaller variance around its expected value of one coalescent unit. With few loci, the minimum of these times across loci is still likely to be quite high. As we will see in the next section, larger divergence times or more sampled loci increase the probability that one of the mean pairwise interspecific coalescence times will be small; in such cases, the MAC estimator will be less biased. Even so, for the bias in the iMAC estimator, a considerable improvement is seen over MAC (Fig. 3F).
Figure 3G shows that the bias of STEAC is constant at one coalescent unit regardless of the divergence time or number of loci sampled, as shown in Section 2.4. A marked improvement in bias for iSTEAC compared with STEAC is shown in Figure 3H, where the bias is reduced to a maximum of about 0.26 coalescent units in the case of only one locus sampled and a divergence time of zero. Note that the bias of iSTEAC is not zero because the iSTEAC estimate is set to zero rather than a negative value whenever the unimproved STEAC estimate is smaller than its smallest expected value, leading to upwardly biased estimates.
Comparing the four improved methods in Figure 3B,D,F,H, we see that all four improved methods have bias of similar magnitude. Furthermore, for each improved method, bias decreases with increasing divergence time or number of loci, with slight exceptions in the iGLASS and iSD cases resulting from the use of the approximate correction. Some differences between methods are seen for shorter divergence times or fewer loci. For small divergence times, iGLASS has the lowest bias for all of the numbers of loci that we have considered. For intermediate and long divergence times and when only a single locus is considered, iMAC and iSTEAC provide the least-biased estimates. At intermediate and long divergence times and when more than one locus is considered, iGLASS, iMAC, and iSTEAC yield the lowest biases.
3.1.4. Variance
The variances in pairwise species divergence time estimates generated by the four original and four improved methods appear in Figures 3I–P. Each unimproved method shows a pattern of variance in which values are higher at large divergence times and small numbers of loci. For larger divergence times, it is likely that only one lineage will remain in each taxon at the divergence time, and the resulting coalescence time between these two lineages will have a higher variance than will the minimum or mean coalescence time between multiple lineages. The greater variance when fewer loci are sampled is a standard effect of sampling more data: estimates based on only a few locus-specific values will have higher variance than those based on more loci.
Little change in variance is seen between any original method and its improved analogue. This result can be explained by viewing our bias reduction process as being approximately equivalent to subtracting a constant from the original estimates, an adjustment that would leave the variance unchanged. The small changes in variance that do occur arise from two deviations between the actual bias reduction method and this simplified scenario. First, in solving Equation (3), except in the case of STEAC, the amount subtracted from an estimate to construct the improved estimate is not actually a constant, but rather, a function of itself (Figure 5). Thus, for GLASS, SD, and MAC, the variance can change slightly between the original and improved estimates, particularly for small divergence times where the greatest deviation from the idealized scenario is seen. Second, the truncation of the improved estimates in cases in which observations lie below the expectation for zero divergence time will also decrease the variance somewhat artificially, particularly when short divergence times are considered and many improved estimates are set to zero. Because the first effect is not relevant to STEAC (Fig. 5), all of the differences between Figure 3O and Figure 3P can be attributed to the truncation effect.
The four original methods have similar variances (Fig. 3I,K,M,O), and because variance changes little between the original and improved estimates, the improved methods also have similar variances (Fig. 3J,L,N,P). As in the case of the original methods, variances for the improved methods increase with increasing divergence times and decrease with increasing numbers of loci. Variance is lowest for iGLASS, followed by iMAC, then by iSD, and iSTEAC.
3.1.5. Mean squared error
Considered together, the bias and variance results explain the patterns seen in the mean squared error (MSE) of the divergence time estimates in Figure 3Q–X. The reduction in MSE from the original to the improved methods is mostly due to the decrease in bias. Comparing the four improved methods, in Figure 3R,T,V,X, we see that the MSE in the improved methods largely shows the same pattern as the variance, increasing with greater divergence times and decreasing with greater numbers of loci. For intermediate and large divergence times, iGLASS has the lowest MSE, followed by iMAC, iSD, and iSTEAC.
3.2. Sample size
To investigate the effect of sample size on the various estimation methods, we can compare Figure 4, which considers samples of size four from each species, with Figure 3, which considers samples of size two. The comparison shows that patterns in bias, MSE, and variance are similar between the two scenarios. For large divergence times, we expect the values to be identical because at large divergence times, the probability is high that only one lineage from each branch will survive until the divergence, regardless of the initial number of lineages sampled.
At small divergence times, a comparison of Figure 4A,C with Figure 3A,C shows that the original GLASS and SD methods have less bias as the number of sampled lineages increases. Because the minimum interspecific coalescent event is considered at each locus for these methods, and because we expect the first interspecific coalescent event to occur sooner after divergence when more lineages are available at the divergence time, this result is not surprising. For the MAC method in Figures 3E and 4E, particularly at small divergence times and when more than one locus is considered, we see an increase in the bias of the original method with increasing sample size. This increase occurs because the variance of the mean pairwise interspecific coalescence time at each locus decreases when more lineages are available at the divergence time to coalesce, reducing the probability that the mean pairwise interspecific coalescence time at a given locus will be near the divergence time. Figure 4G confirms that the bias in STEAC is constant with respect to sample size.
For the improved methods, slight decreases in bias are seen with increasing sample size. In Figure 3B,D,F,H, as noted above, the bias is greatest when divergence times are low. This result is due to the fact that in these cases, many estimates that lie below their expected value, and that would produce improved estimates that would underestimate the true divergence time, are set to zero by the procedure in Equation (3). The original estimates that exceed their expectations, however, still overestimate the true divergence time, and a positive bias is introduced. When sample size is increased, the variance of the original estimates decreases, as can be seen by comparing Figure 4I,K,M,O with Figure 3I,K,M,O; as a result of this decrease in variance, fewer estimates are set to zero at small divergence times, and the positive bias in the estimates is reduced. Thus, because SD experiences a relatively large reduction in variance with the increased sample size (comparing Fig. 4K with Fig. 3K), iSD shows a considerable decrease in bias with increased sample size at low divergence times (comparing Fig. 4D with Fig. 3D). MAC and STEAC, which have only slight reductions in variance with the increase in sample size, show correspondingly small reductions in the bias of their improved methods.
3.3. Trees with more than two taxa
Distance-matrix methods use pairwise distance estimates to construct a species tree through a hierarchical clustering procedure such as single-linkage clustering (Gordon, 1996) or UPGMA (Unweighted Pair Group Method with Arithmetic Mean) (Sokal and Michener, 1958). In cases with more than two species, we evaluate the accuracy in the original and improved methods by comparing divergence times of taxon pairs in the inferred tree with the corresponding divergence times in the true tree, and by comparing the topology of the inferred tree to the topology of the true tree. Note that Liu et al. (2009) performed a comparison of the unimproved GLASS, STEAC, and SD methods with respect to the inferred tree topology under a variety of scenarios; our interest here is in comparing the improved to the unimproved methods.
To evaluate the estimates of pairwise divergence times, we choose a five-taxon pectinate species tree with divergence times of 0.025, 0.02625, 0.0275, and 0.5275 coalescent units (Fig. 6); this tree, which lies in the anomaly zone (Degnan and Rosenberg 2006), is equivalent to a species tree used by Liu et. al (2010). Two lineages were sampled from each taxon. We first evaluate the estimation methods by disregarding the topology of the simulated tree and simply comparing divergence times of taxon pairs in the inferred tree with corresponding divergence times in the true tree. This approach allows us to obtain the bias and MSE in divergence time estimates for each pair of taxa. We performed the simulations with the coalescent simulator ms (Hudson, 2002), with 50,000 replicate sets of five gene trees representing five sampled loci. The trees generated in these 50,000 sets were used to evaluate all eight estimation methods.
To evaluate the accuracy with which each method reconstructs the topology, we consider the proportion of sampled sets of gene trees that accurately predict the species tree topology. For this analysis, a set of random five-taxon species trees was generated under the Yule model (Kulkarni, 2010). Briefly, for each tree, a split into two branches is taken at the root. Subsequent bifurcations have an equal probability of occurring on each branch, and the waiting time from the (i − 1)st bifurcation to the (i)th bifurcation is exponentially distributed with mean 1/(iλ For our simulations, we take λ = 1. Bifurcations continue until the external branches of a five-taxon tree have evolved, that is, until the time of the fifth bifurcation (the moment when the sixth external branch is formed). Because sampling is not likely to occur exactly at the time of the fifth bifurcation, we truncate the last branches of the species tree, with the time of truncation chosen uniformly between the times of the fourth and fifth bifurcations.
A set of 50,000 true species trees was generated following this procedure. Using the coalescent simulator ms (Hudson, 2002), one set of five gene trees, representing five loci, was then simulated for each of the true species trees, and divergence times for each pair of taxa were estimated with each of the eight methods. The same set of true species trees, and the same set of simulated gene trees for each species tree, were used to evaluate each method. Species tree estimates were constructed using a hierarchical clustering procedure. Single-linkage was defined to be the clustering method of GLASS by Mossel and Roch (2010). In single-linkage clustering, the two taxa with the shortest distance are grouped; distances between clusters are then recalculated, where the distance between two clusters A and B is defined as the shortest distance between one taxon from cluster A and one from cluster B, considering all such possible pairs of taxa. The two clusters with the shortest distance are grouped and the process is repeated until a single cluster remains. Because MAC, like GLASS, evaluates a minimum over loci, single-linkage, with its minimization procedure for computing distances between clusters, was taken to be the appropriate choice for MAC as well. To make results comparable between improved and original methods, single-linkage clustering was also used for iGLASS and iMAC. For SD, iSD, STEAC, and iSTEAC, where the mean is taken over loci, it is natural to use a clustering method that infers each node height by taking the mean over pairs of taxa that find their common ancestor at the node; because UPGMA defines the distance between cluster A and cluster B as the mean node height of all pairs of taxa, one from cluster A and the other from cluster B, UPGMA was used for these methods.
Figure 7 shows the bias and MSE observed for estimates of pairwise divergence times on the simulated data. Both bias and MSE are reduced considerably by the four improved methods, compared with the original analogues that do not incorporate a partial bias correction. The bias and MSE values are also quite similar to those in corresponding cases from Figure 3; that is, the bias and MSE in pairwise divergence time estimates in the larger tree are similar to those that would be predicted if each pair of taxa constituted a separate two-taxon tree.
Values for our second measure of accuracy, the proportion of species trees reconstructed correctly, appear in Figure 8. Each method is seen to have an increased accuracy as more loci are sampled. The differences in accuracy among the four original methods are substantial. GLASS performs best, with >90% accuracy when ten loci are considered. MAC has >80% percent accuracy at ten loci, compared with ∼70% for SD and STEAC. Each improved method infers the correct tree topology about as often as its corresponding original method, with a slight but noticeable decrease in accuracy for iGLASS compared with GLASS and iMAC compared with MAC, and almost no difference between iSTEAC and STEAC or between iSD and SD.
Why do the original and improved methods produce similar accuracy in estimating the tree topology? For each of the methods, g(τAB) is monotonic in τAB (Fig. 5B). For example, if the divergence times between taxa X and Y and between taxa Z and W were estimated to be and , respectively, with , then the improved estimates would also satisfy . Thus, none of the bias reduction procedures will change the ordering of divergence time estimates, and we would expect that the improved methods would generally construct the same species tree topology as the analogous original methods. For short divergence times, however, the truncation of estimates at zero can cause difficulties for the improved estimates. To illustrate this problem, consider a gene tree sampled from the species tree in Figure 6, with divergence time estimates obtained from any of the four original methods, such that . Under the unimproved method, the relationships among A, B, and C would be reconstructed correctly; that is, taxa A and B would be grouped first before either was grouped with taxon C. Reducing the bias of the estimates by Equation (3), however, gives , and the estimated tree topology becomes ambiguous under the improved method. For the simulations in Figure 8, we dealt with this ambiguity by randomly ordering the tied estimates. For this reason, the improved methods show a slight decrease in accuracy compared with their analogous original methods. Because the original estimates must be obtained prior to the bias reduction process, a simple solution to the problem of ties would be to maintain the ordering from the original estimates in cases in which multiple estimates of zero are obtained by the improved estimator.
4. Discussion
We have shown through simulations that the iGLASS, iSD, iMAC, and iSTEAC estimators of species divergence times have lower biases and mean squared errors than do the original GLASS, SD, MAC, and STEAC estimators, and that these improvements are seen, to varying degrees, under a range of divergence times, numbers of loci, and numbers of lineages sampled. In this section, we place these results in context with respect to the literature, describe limitations of the work, and suggest some problems for future research.
As genetic studies acquire increasingly large amounts of data, computationally fast methods that reconstruct species tree topologies from gene trees while providing accurate branch length estimates are increasingly important. GLASS, SD, MAC, and STEAC provide efficient alternatives to maximum likelihood and Bayesian methods, which can be prohibitively slow on large data sets. Each of these four methods has its own distinct features. GLASS has been shown to be a consistent estimator of tree topology under the coalescent model as the number of loci increases (Mossel and Roch, 2010). SD uses the fact that minimum coalescence times between two species are consistent estimates of species divergence times as the number of sampled lineages increases (Takahata, 1989), and it is less susceptible than GLASS to erroneous inferences resulting from incorrectly inferred gene trees (Liu et al., 2009). In using means at individual loci rather than minima, MAC and STEAC are less susceptible than GLASS and SD to the possibility of divergence time estimates of zero in cases with little genetic variation; STEAC has also been shown to be consistent for the estimation of species tree topologies as the number of loci increases (Liu et al., 2009), and it is easy to compute. Our improved iSD, iMAC, and iSTEAC methods, together with the iGLASS method of Jewett and Rosenberg (2012), provide a class of estimators that can obtain more accurate branch length estimates while preserving properties that make GLASS, SD, MAC, and STEAC appealing.
Asymptotic running times for all of the original and improved methods except iMAC can be obtained by comparison to GLASS and approximate iGLASS, the complexities of which were derived by Mossel and Roch (2010) and by Jewett and Rosenberg (2012), respectively. Because taking a mean at either step in a method (either among pairwise times within a locus, or over loci, or both) instead of a minimum requires the same amount of time, GLASS, SD, MAC, and STEAC all have running time O(n2LS2 + S3) (Mossel and Roch, 2010), where n is the maximum number of lineages sampled from any taxon, L is the number of loci, and S is the number of taxa. iSTEAC also shares this running time, since the method entails calculating the STEAC estimate and subtracting one from each estimated divergence time. Approximate iGLASS has complexity O(n2LQS + LQ3S2), where Q is a tuning parameter affecting the accuracy of the computations (Jewett and Rosenberg, 2012); when the same number of lineages is sampled at each locus from each taxon, it can be shown that approximate iSD has this same precision as well. When loci differ in sample sizes, by noting that Equation C.1 of Jewett and Rosenberg (2012) must be computed L different times, it can be observed that the running time of approximate iSD becomes O(n2LQS + L2Q3S2).
The computation of the iMAC estimator requires sums over all elements of a certain set of unlabeled tree topologies and over all possible values of the numbers of lineages remaining from each taxon at divergence. As the number of lineages increases, iMAC is factorial in nA and factorial in nB; the complexity of the calculations increases so quickly due primarily to the number of tree topologies that must be considered, but also to the increasing complexity of computing the density of the mean pairwise interspecific coalescence time conditional on each topology. We have obtained exact expressions for iMAC for cases in which up to four lineages are sampled per taxon; for large sample sizes, exact computation of the iMAC estimate is not practical, and it will be desirable to develop an approximate iMAC method similar to approximate iGLASS.
Some limitations to the applicability of our results arise from the assumptions we have made. We considered the multispecies coalescent model, in which species divergences each occur at a single point in time, with no subsequent migration or horizontal gene transfer. Further, we have assumed that the effective sizes of all populations in our model are equal, and we have not considered the effects of mutation. Perhaps most importantly, we have assumed that gene trees are known with certainty; when gene trees are inferred, errors in the inference could substantially affect the methods. A particular problem that would be encountered in real data is inferred node heights that are considerably lower than their true values. For instance, if any estimated interspecific coalescence time in the data is zero, GLASS, which takes the minimum over loci of the minimum of pairwise coalescence times, will necessarily give an estimate of zero, as will iGLASS; simulations have suggested that GLASS can have poor performance in practice (Yang and Warnow, 2011). SD, which takes the mean of minimum coalescence times, and MAC, which takes the minimum of mean coalescence times, may also be susceptible to extreme values; underestimates will then be passed on to iSD and iMAC. STEAC, on the other hand, does not employ minimization, taking a mean of mean coalescence times, and it is less likely to be affected by coalescence times of zero. STEAC has performed well for estimating species tree topologies from erroneous gene trees (Liu et al. 2009), and given a gene tree estimator that produces unbiased estimates of coalescence times, STEAC produces an unbiased mean across loci of mean coalescence times, even if gene trees are not inferred correctly. Thus, iSTEAC, which improves upon STEAC, is also expected to perform reasonably well on estimated gene trees.
We note that all of the improved methods—iGLASS, iSD, iMAC, and iSTEAC—share a common drawback. To avoid negative estimates of speciation times, in all of the methods, we set the divergence time estimate to zero when the corresponding estimate from the original method is smaller than its smallest possible expected value, E0[VAB]. While this choice allows us to reduce bias without introducing unreasonable negative estimates, it can be problematic when an estimated divergence time is less than E0[VAB]. For example, because E0[VAB] for iSTEAC is one coalescent unit, regardless of the number of loci or the numbers of lineages sampled, for any observed estimate that is below one, iSTEAC will provide an estimate of zero. This property of the method limits its utility for small divergence times. For iGLASS, iSD, and iMAC, however, this drawback is less problematic, as divergence time estimates of zero are less likely for these methods. Furthermore, for iGLASS, iSD, and iMAC, the probability of obtaining zero estimates decreases as the amount of data increases. For iGLASS and iMAC, in which a minimum is taken over loci, E0[VAB] approaches zero as the number of loci increases, and for iGLASS and iSD, E0[VAB] approaches zero as the number of lineages sampled at each locus in each population increases. To avoid the problem of producing estimates of zero, for all four methods, one solution is to obtain the improved estimate allowing for negative estimates, and then add the absolute value of the most negative estimate to each of the improved estimates. All estimates will then be nonnegative, and fewer estimates of zero will be produced. We expect that this approach will augment bias by an amount comparable to the bias in the most negative estimate, a value that is likely to be small.
Though our simulations have compared these four methods for a variety of values of the model parameters, a comparison of the results obtained with each of the four methods on real data or simulations with mutation would also be informative. Future work that mathematically considers the effects of mutation on these methods could improve their utility for a wider range of practical studies. Finally, an accurate approximation of iMAC would enable use of this method when large amounts of data are available.
Acknowledgments
Support for this work was provided by the NSF (grants DEB-0716904 and DBI-1146722) and by the Burroughs Wellcome Fund.
Disclosure Statement
No competing financial interests exist.
References
- Degnan J.H. Rosenberg N.A. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 2009;24:332–340. doi: 10.1016/j.tree.2009.01.009. [DOI] [PubMed] [Google Scholar]
- Degnan J.H. Rosenberg N.A. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2:762–768. doi: 10.1371/journal.pgen.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ewing G.B. Ebersberger I. von Haeseler A., et al. Rooted triple consensus and anomalous gene trees. BMC Evol. Biol. 2008;8:118. doi: 10.1186/1471-2148-8-118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Felsenstein J. Inferring Phylogenies. Sinauer Associates; Sunderland, MA: 2004. [Google Scholar]
- Gordon A.D. Hierarchical clustering, 65–121. In: Arabie P., editor; Hubert L.J., editor; De Soete G., editor. Clustering and Classification. World Scientific; River Edge, NJ: 1996. [Google Scholar]
- Jewett E.M. Rosenberg N.A. iGLASS: An improvement to the GLASS method for estimating species trees from gene trees. J. Comput. Biol. 2012;19:293–315. doi: 10.1089/cmb.2011.0231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kubatko L.S. Carstens B.C. Knowles L.L. STEM: species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009;25:971–973. doi: 10.1093/bioinformatics/btp079. [DOI] [PubMed] [Google Scholar]
- Kulkarni V.G. Modeling and Analysis of Stochastic Systems. 2nd. Chapman & Hall/CRC Press; Boca Raton, FL: 2010. [Google Scholar]
- Liu L. Yu L. Pearl D.K. Maximum tree: a consistent estimator of the species tree. J. Math. Biol. 2010;60:95–106. doi: 10.1007/s00285-009-0260-0. [DOI] [PubMed] [Google Scholar]
- Liu L. Yu L. Pearl D.K., et al. Estimating species phylogenies using coalescence times among sequences. Syst. Biol. 2009;58:468–477. doi: 10.1093/sysbio/syp031. [DOI] [PubMed] [Google Scholar]
- Maddison W.P. Gene trees in species trees. Syst. Biol. 1997;46:523–536. [Google Scholar]
- Maddison W.P. Knowles L.L. Inferring phylogeny despite incomplete lineage sorting. Syst. Biol. 2006;55:21–30. doi: 10.1080/10635150500354928. [DOI] [PubMed] [Google Scholar]
- Meng C. Kubatko L.S. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor. Popul. Biol. 2009;75:35–45. doi: 10.1016/j.tpb.2008.10.004. [DOI] [PubMed] [Google Scholar]
- Mossel E. Roch S. Incomplete lineage sorting: consistent phylogeny estimation from multiple loci. IEEE/ACM Trans. Comput. Biol. Bioinform. 2010;7:166–171. doi: 10.1109/TCBB.2008.66. [DOI] [PubMed] [Google Scholar]
- Ross S.M. Introduction to Probability Models. 9th. Academic Press; New York: 2007. [Google Scholar]
- Sokal R.R. Michener C.D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958;38:1409–1438. [Google Scholar]
- Takahata N. Gene genealogy in three related populations: consistent probability between gene and population trees. Genetics. 1989;122:957–966. doi: 10.1093/genetics/122.4.957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Than C. Nakhleh L. Species tree inference by minimizing deep coalescences. PLoS Comput. Biol. 2009;5:e1000501. doi: 10.1371/journal.pcbi.1000501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang J. Warnow T. Fast and accurate methods for phylogenomic analyses. BMC Bioinform. 2011;12:S4. doi: 10.1186/1471-2105-12-S9-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]