Abstract
The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.
Keywords: concatenation, efficiency, molecular clock, MSC, multispecies coalescent, species tree
Introduction
The multispecies coalescent (MSC) model (Rannala and Yang 2003) combines the phylogenetic process of species divergences with the population genetic process of coalescent and naturally accommodates “delayed coalescence” (also known as “incomplete lineage sorting,” Maddison 1997), the phenomenon in which gene sequences fail to coalesce in their most recent common ancestor but do so only in more ancient ancestors. Delayed coalescence causes the gene tree for a gene or genomic region to differ from the species tree and is the most important factor for gene-tree–species-tree discordance (Maddison 1997; Nichols 2001; Szöllősi et al. 2015). The MSC provides a natural framework for estimating species trees accounting for genealogical heterogeneity among genes or across the genome (Edwards 2009; Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).
Two lines of research into the MSC have provided the foundation for species tree methods. The first concerns the probabilities of different gene tree topologies (Hudson 1983; Pamilo and Nei 1988) and algorithms for their efficient calculation given the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). The gene tree distribution can be used in the two-step method of species tree estimation, by inferring gene trees for the individual loci and then applying maximum likelihood (ML) to counts of gene tree topologies (as in stells,Wu 2012). Nevertheless, widely used two-step methods, including astral (Mirarab et al. 2014) and mp-est (Liu et al. 2010), are simpler, and estimate species trees for species triplets (assuming the molecular clock) or quartets (without the clock) and then assemble the subtrees to produce a species-tree estimate for all species. Studies of gene-tree probabilities led to the discovery of the “anomaly zone,” the region of the parameter space in which the most probable gene tree has a different topology from the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). In the anomaly zone, the two-step method, which uses the most common gene tree as the species tree estimate, will be inconsistent.
The second line of research into MSC is the development of the joint probability distribution of the gene tree and coalescent times (Rannala and Yang 2003). This forms the basis for exact methods of inference, including ML (Yang 2002; Dalquen et al. 2017) and Bayesian methods (Liu and Pearl 2007; Heled and Drummond 2010; Yang and Rannala 2014; Ogilvie et al. 2017; Rannala and Yang 2017). Although heuristic methods use summaries of the data, exact methods use the multilocus sequence alignments directly and naturally accommodate phylogenetic reconstruction errors and uncertainties (Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).
Simulation has been used to examine the performance of different species-tree methods (e.g., Leaché and Rannala 2011; Mirarab et al. 2014; Chou et al. 2015; Xu and Yang 2016). A limitation of simulation is that it can examine only a small portion of the parameter space and the results often have limited applicability. Analytical results on the efficiency of different methods have been lacking. Here, we analyze species tree estimation under the MSC in the case of three species, with one sequence from each species per locus. We focus on closely related species and assume the JC mutation model (Jukes and Cantor 1969) and the molecular clock. We are in particular interested in the efficiency of the various methods, measured by the probability of recovering the correct species tree.
We consider four inference methods: 1) ML (a full likelihood method under the MSC applied to the multilocus sequence alignments), 2) 2-step (or majority-vote), 3) concatenation (concat), and 4) independent-sites ML (isml, also known as coalescent-aware concatenation or concat) (Xu and Yang 2016). ML is the full-likelihood method and calculates the likelihood function using the multilocus sequence alignments or a sufficient summary. The 2-step method estimates the gene tree at each locus and then uses the most common gene tree as the species tree estimate. It does not account for the uncertainties in the estimated gene trees. For the case of three species considered here, 2-step is equivalent to the maximum pseudolikelihood method (mp-est) (Liu et al. 2010). Concatenation applies ML to the concatenated sequences, assuming that the same tree underlies all sites in the super alignment. In the case considered here, concatenation is equivalent to steac (Liu et al. 2009), which uses average coalescent times over loci as data to infer a gene tree, which is the species tree estimate. Isml (or concat) estimates the species tree by ML under the assumption that all sites, both from the same locus and from different loci, have independent gene trees (Xu and Yang 2016). This was suggested as an improvement to SVDQuartets of Chifman and Kubatko (2014). All four methods considered here use ML, but the likelihood function is applied to different summaries of the same data. Here, we refer to the full-likelihood or full-data method as the ML method, whereas all other methods (2-step, concatenation, and isml) are considered heuristic summary methods: 2-step uses the (estimated) gene tree topologies, whereas concatenation and isml use the site-pattern counts pooled across loci. We derive approximations to the error rate of species tree estimation by the different methods and assess their accuracy. We use the theory to characterize the differences in the use of information in the data by different methods.
Results
Multispecies Coalescent in the Case of Three Species
For three species A, B, and C, there are three possible species trees: , , and , each with two divergence times (τ0 and τ1) and two population sizes (θ0 and θ1) (fig. 1a). Both τs and θs are measured by the expected number of mutations per site. For each species, the population size parameter is , where N is the (effective) population size and μ is the mutation rate per site per generation. We consider only one sequence from each species, so that θs for the modern species are not considered. The parameters have different interpretations in different species trees: in S1, the two ancestral species are AB and ABC so the parameters are = = .
Fig. 1.
(a) The three species trees () for three species () and the parameters in each MSC model. (b) The possible gene trees with coalescent times (t0, t1) for a locus with three sequences (a, b, c) given the species tree S1. The probabilities for the gene trees are shown above them, where = is the probability that a andb do not coalesce in population AB or over the time interval (τ1, τ0). Note that if the species tree is S2 (or S3), it will be possible for sequences b and c (or c and a) to coalesce in the time interval (τ1, τ0).
At each locus, three sequences (a, b, and c) are sampled, one from each species. They are related through a gene tree. The three possible gene trees are , and , with probabilities:
(1) |
where = is the probability that sequences a and b do not coalesce in population AB so that all three sequences enter the ancestor ABC and the three gene trees occur with equal probability (fig. 1b) (Hudson 1983). Here, is known as the internal branch length in coalescent units, as the average coalescent time in population AB is generations or mutations per site.
For locus i, let be the coalescent times (node ages) on the gene tree (fig. 1b). The joint MSC density for the gene tree and coalescent times given species tree S1 and parameters is then:
(2) |
for (Takahata et al. 1995; Yang 2002). The probability densities for S2 and S3 are given similarly.
The data consist of sequence alignments at m loci. Under the JC mutation model, the data at locus i can be summarized as counts of five site patterns: xxx, xxy, yxx, xyx, and xyz, where x, y, z are any three distinct nucleotides. Let those counts be , with to be the number of sites (sequence length) at each locus. Let be the frequencies. Let data at all m loci be .
Given the gene tree and coalescent times at locus i, the probability of the sequence data, , is given by the multinomial distribution for the five site patterns. For example, given gene tree G1 with node ages and (fig. 1b), the site-pattern probabilities, , are as follows:
(3) |
where and (Yang 1994b). Note that as . The probabilities for gene trees G2 or G3 are given by symmetry. Then the sequence data or the five site-pattern counts at the locus have the multinomial probabilities:
(4) |
The ML Method of Species Tree Estimation
The log-likelihood function for species tree S1 with parameters is given by summing over the gene trees and integrating over the coalescent times.
(5) |
where is the MSC density for the gene tree and coalescent times at locus i (eq. 2), and is the probability of the sequence data at locus i given the gene tree (eq. 4). The log likelihood functions, and , for S2 (with parameters ) and S3 (with ) are defined similarly.
Maximizing the log-likelihood function (eq. 5) with respect to the parameters will lead to a log-likelihood value for the given species tree, and the species tree that achieves the highest is the ML species tree. This is not analytically tractable. The program 3 s implements the method by explicitly summing over the gene trees (Gi) and by using Gaussian quadrature to calculate the 2D integrals over (eq. 5) (Yang 2002; Zhu and Yang 2012; Dalquen et al. 2017). This is used in simulations.
We present two theorems for approximating the error in species tree estimation.
Theorem 1
.(a) Suppose , are an independent and identically distributed (i.i.d.) sample of size m from a distribution with means , with , and variances , where and . Let be the sample means, with 1, 2, 3. For large m, . Let . Then
(6) where is the cumulative distribution function (CDF) for the normal distribution . We also write ζN as .
(b) Let and , with and . Then ζ is bounded by:
(7) where . The equality for the lower bound holds when . We write those bounds as , so that .
Proof.
A proof is given in Appendix A, in which, we discuss alternative approximations and also give a tighter pair of bounds in equation (A27), with . □
In this paper, ζ represents the error probability of species tree estimation. Thus, the bounds suggest that when , the probit transform of the species-tree error probability, , where is the inverse CDF of , decreases linearly with . For practical calculations for finite m in this paper, equation (6) is more accurate (see Appendix A) and will be used later.
Corollary 2.Let ( ) be random variables from the multinomial distribution MN(m, q0, q1, q2, q3), with = q3, and . Then can be approximated by:
(8) |
(9) |
Proof.
Let be the observed frequencies. We have and for . Then equation (8) follows from equation (6) in Theorem 1. The form , an alternative to equation (8), is from Yang (1996, eq. 3), based on Zharkikh and Li(1992, eq. 20). This applies the term to correct for discontinuity (Fleiss et al. 2003) and ignores correlations between y1, y2, and y3 as well as some terms of small probabilities. The discontinuity correction does not appear to be useful. If , both forms, with and without the discontinuity correction, are very close. □
The error rate for the ML method (eq. 5) is analyzed in Appendix B. When the number of loci , the MLE in species tree Sj, j = 1, 2, 3. Note that S1 represents the true model and are the true parameter values, while S2 and S3 are misspecified models and and are the “best-fitting or pseudotrue parameter values.” The Kullback–Leibler distance D12 from S2 to S1 is:
(10) where , with x to be one data point (or site pattern counts at one locus), and where the integral means summation over all possible data outcomes at a locus. We use the per-locus log-likelihood values to compare the three species trees: , j == 1, 2, 3. When m is large, these have the means , with , and the variance matrix , where and . The error of the ML method, , is then given by Theorem 1 as:
(11) Equation (11) cannot be used to calculate the error rate for ML as D12 and σjk are not easily computable. It predicts a linear relationship between and . This is confirmed by simulation (fig. 2a′–c′).
Precise results may be obtained in special cases. In the case of one locus (m = 1), the ML gene tree is the ML species tree except for rare data sets: the true species tree S1 is recovered if . In rare data sets of extreme divergence, even if , ties for gene trees are possible, with the star tree being as good as the binary trees (Yang 2000), whereas ML under MSC favors S1. One such data set is , in which case the three gene trees as well as the star tree achieve the same likelihood, whereas ML under MSC favors S1. However, such data sets involve sequences more divergent than random sequences have vanishingly small probability when n is large. Thus, we ignore them and consider all methods to be equivalent when m = 1. With one locus, it is impossible to identify all parameters in the MSC model: there are four parameters and only three independent site-pattern frequencies ( for S1, for example).
The case of one site per locus (n = 1) is analyzed later in the section on isml. Numerical calculations on a model species tree are presented in table 1. They will be discussed later in comparison with other methods.
In the case of , the gene tree (including the coalescent times) at each locus is given without errors. The likelihood is then the product of MSC densities of gene trees across the loci (eq. 2). This likelihood has singularities, with one or more species trees achieving infinite likelihood (Liu et al. 2010; Yang 2014). In the case of three species considered here, only one species tree (given by the smallest coalescent time) achieves infinite likelihood and will be the unique species-tree estimate, so that the estimation can proceed despite the singularity (Yang 2014, p. 360, Problem 9.4). Let the smallest coalescent/divergence time between species across all loci be tab, tbc, and tca. If tab is the smallest among the three, then species tree S1 achieves infinite likelihood, by collapsing on the coalescent time tab; that is, as and (see eq. 2) (Yang 2014, p.338–339), whereas the other two species trees have finite likelihood.
Given S1 as the true species tree, both tbc and tca are (fig. 1b). If sequences a and b coalesce in population AB at any of the m loci, tab will be smaller than both tbc and tca, and S1 will be the ML species tree. Thus, an incorrect species tree is inferred only if a and b do not coalesce in AB at any of the m loci and are not the first to coalesce in the root population ABC. Thus,
(12) where is the probability that a and b do not coalesce in population AB. This equation is exact and applies to both small and large m (fig. 3b).
Fig. 2.
(a–c) Species-tree estimation error (e) at three sequence lengths (n = 1, 2, 1,000) plotted against the number of loci (m) for different methods. (a–c) The probit transform of the species-tree error, , plotted against . The parameters used in the simulation are , and . When n = 1, all four methods (ML, 2-step, concatenation, and isml) give the same species tree estimate, while concatenation and isml are equivalent in all cases considered in this paper. The number of replicates is for ML and for the other methods.
Table 1.
Probabilities () of Estimated Gene Trees at Different Sequence Lengths (n) and the Error Rates for the Summary Methods 2-step and isml with m = 1,000 Loci, Each with n Sites.
n | 1 | 2 | 10 | 100 | 1,000 | |
---|---|---|---|---|---|---|
2-step (mp-est) | ||||||
(tie) | 0.92948 | 0.8673 | 0.57015 | 0.22159 | 0.05105 | 0 |
0.02378 | 0.04474 | 0.14515 | 0.26646 | 0.33273 | 0.35947 | |
0.02337 | 0.04398 | 0.14235 | 0.25598 | 0.30811 | 0.32026 | |
0.642 | 0.633 | 0.597 | 0.470 | 0.260 | 0.114 | |
0.644 | 0.635 | 0.600 | 0.472 | 0.264 | 0.113 | |
NA | NA | 0.613 | 0.482 | 0.271 | 0.117 | |
(0.635, 0.953) | (0.623, 0.935) | (0.578, 0.869) | (0.430, 0.647) | (0.219, 0.331) | (0.087, 0.132) | |
(0.637, 0.729) | (0.626, 0.714) | (0.585, 0.668) | (0.446, 0.561) | (0.242, 0.328) | (0.103, 0.132) | |
ζ (mean2) | 0.683 | 0.670 | 0.627 | 0.504 | 0.285 | 0.118 |
a | 0.574051 | 0.574056 | 0.573612 | 0.569708 | 0.562911 | 0.555962 |
b | 0.0678913 | 0.0930368 | 0.190376 | 0.527747 | 1.11658 | 1.72268 |
isml (concat) | ||||||
0.642 | 0.632 | 0.590 | 0.438 | 0.246 | 0.196 | |
ζN | 0.644 | 0.634 | 0.592 | 0.443 | 0.254 | 0.194 |
0.643 | 0.633 | 0.591 | 0.437 | 0.234 | 0.166 | |
(0.635, 0.953) | (0.622, 0.934) | (0.568, 0.854) | (0.397, 0.598) | (0.211, 0.318) | (0.157, 0.237) | |
(0.637, 0.728) | (0.625, 0.713) | (0.576, 0.659) | (0.416, 0.536) | (0.233, 0.316) | (0.177, 0.237) | |
ζ (mean2) | 0.683 | 0.669 | 0.618 | 0.476 | 0.275 | 0.207 |
a | 0.574029 | 0.573971 | 0.57356 | 0.569747 | 0.558232 | 0.553151 |
b | 0.067892 | 0.0958963 | 0.21228 | 0.607057 | 1.14253 | 1.35035 |
Note.— is the probability for ties in gene trees, with 1. The probabilities of estimated gene trees () as well as the error rates ( and ) are estimated by simulation using a C program, with replicates. Ties are broken evenly in the error calculation. The parameter values used are =(0.02, 0.019, 0.01, 0.05). The marginal (pooled) site pattern probabilities are () = (0.92831926, 0.023777106, 0.023372801, 0.023372801, 0.001158033), given by equation (13). For 2-step, at n = 1, the estimated gene tree is determined by the single site so that and , whereas at , the estimated gene tree is the true gene tree, so that and (eq. 1). For 2-step, (eq. 9) is inapplicable at n = 1 or 2 as m = 1000 is too small. For isml, ignores the correlation (eq. 6), while ζN accounts for the correlation. The bounds and are calculated using equations (7) and (A27), with k = 2 used in . “mean2” is the average of the tight bounds: .
Fig. 3.
Error rates in species-tree estimation by ML, 2-step, and isml (=concatenation). (a) Error plotted against sequence length n when the number of loci m is fixed at 100 or 1,000, generated by simulation. (b) Error plotted against m when . Error for ML is given by equation (12), whereas those for isml and 2-step are generated by simulation. (c) Error plotted against n when is fixed, generated by simulation. Note that all four methods are equivalent when n = 1 or , while concatenation and isml are equivalent in all cases. Parameters used in the simulation are , and . The number of replicates is .
Concatenation
Sequence alignments at the m loci are merged into a super-alignment of length nm, and the data are the site-pattern counts pooled across loci: , with . The likelihood function is given by the multinomial probability of equation (4) except that is used instead of xij. The ML tree is G1 if (Yang 1994b, 2000). We discuss the error rate of concatenation below in the section on the isml method.
We also examine biases in parameter estimation using concatenation. We use species tree S1 with τABC = 0.02, τAB = 0.01, θABC = 0.02, and θAB = 0.01 to simulate loci each with n = 250 sites. We obtain MLEs and on gene tree G1 from the concatenated data for comparison with the MLEs and on species tree S1 in the MSC model (eq. 5). With so much data, both concatenation and ML recover the true tree with near certainty. The MLEs under the MSC (obtained using the 3 sprogram) are very close to the true values, whereas concatenation (baseml in paml, Yang 2007) produced seriously biased estimates (table 2). Even the relative age, = 1.92, differs from , which means that molecular clock dating analysis using concatenated data will produce biased time estimates (Angelis and dos Reis 2015; Ogilvie et al. 2017; Tiley et al. 2020).
Table 2.
Estimates of Divergence Times (true values in parentheses) by ML under the MSC (3 s) and by Concatenation (baseml) in Two Simulated Data Sets, Each of Loci and n = 250 Sites.
τABC | τAB | θABC | θAB | |
---|---|---|---|---|
Data/method | (0.02) | (0.01) | (0.02) | (0.01) |
Data set 1, 3s | 0.0201 | 0.0096 | 0.0199 | 0.0101 |
Data set 2, 3s | 0.0196 | 0.0100 | 0.0201 | 0.0100 |
Data set 1, baseml | 0.0298 | 0.0155 | ||
Data set 2, baseml | 0.0298 | 0.0156 |
ISML
The isml method assumes that all sites in the super-alignment are i.i.d. Like concatenation, the data are summarized as pooled site-pattern counts, . However, isml is coalescent-aware and uses the MSC model to calculate the probabilities for the site patterns. By averaging the conditional site-pattern probabilities of equation (3) over the MSC density of gene trees and coalescent times of equation (2), we derive the marginal site-pattern probabilities, , as:
(13) |
where , , , , and , with . Note that are functions of a0, and , although these do not appear to permit simple biological interpretations. The cases for S2 and S3 are given by symmetry.
The likelihood function (or the probability for the pooled site-pattern counts) for each species tree is:
(14) |
Theorem 3
.(a) If the true species tree is S1 with parameters , then . (b) Isml infers the species tree S1 if .
Proof.
(a) Each of the marginal site pattern probabilities , is a sum over the four gene trees of figure 1b: and G3. The three gene trees , and G3 have the same densities (eq. 2). Together their contribution to the site pattern xxy is the same as that to the pattern yxx or pattern xyx. If the gene tree is (with any coalescent times t0 > t1), site pattern xxy will have a higher probability than yxx or xyx, with . Averaging over all the four gene trees, we have .
(b) We show that if , then , where and are the MLEs under each species tree. First note that if and , then . Let and , and we have . In other words, even if we use (the MLE for S2) to calculate the likelihood for species tree S1, tree S1 will have a higher likelihood than S2. Since may not be optimal for S1, it follows that . □
Theorem 3 means that isml infers species tree Sj if is the greatest among , and , just like concatenation.
To study the error rate for isml (or concat), let pij, be the site-pattern probabilities at any locus i. Data at each locus are represented by the site-pattern frequencies . Let be the data at locus i. The fi are i.i.d. among loci from a common distribution with mean and variance/covariance and . Let be the means over loci. Here, constitute the full data, whereas are summaries used by isml: the species tree estimate is Sj if is the largest among . Thus, , where . Below we derive the variances.
At n = 1, they are given by the multinomial distribution as:
(15) At , we have fij = pij, given by equation (3). The variances, denoted , can be generated by simulating gene trees with coalescent times and calculating the site-pattern probabilities (eq. 3) (supplementary table S1, Supplementary Material online). This distribution is 3D (for , , and under S1), indexed by four parameters ( in S1), and is a mixture distribution with 4 components corresponding to the four gene trees of figure 1b. It reflects the coalescent fluctuation in gene genealogies.
For any finite , the variances are given by:
(16) where (eq. 13), whereas and are the variances/covariances over the coalescent process. These are calculated for a set of parameter values in supplementary table S1, Supplementary Material online. The variances of fij are thus weighted averages of variances at and .
The approximation is very accurate, with errors <0.002 in the simulation of table 1. At large n, accommodating correlation is useful as which ignores correlation is less accurate (see fig. 4 for the case of ). For example, the correlation is , and −0.181 at n = 1, 1,000, and , respectively (supplementary table S1, Supplementary Material online).
We now consider parameter estimation by isml. Theorem 3 allows species tree estimation by isml without knowledge of the MLE of the parameters. With data of , there are only three observations (three free proportions , and in the case of S1). As there are four parameters in the MSC model, it is impossible to identify all of them.
If we assume (as in Tian and Kubatko 2016), all three parameters () will be identifiable. As , equation (13) simplifies to:
(17) where a0, a1, and b are defined in equation (13) with . By equating the observed site-pattern frequencies to their expected probabilities (eq. 17), we have
(18) Thus, we have a quadratic equation in :
(19) This always has a unique positive root. Given , the estimates and are given by equation (18), which are guaranteed to be positive.
Fig. 4.
Species tree error for isml at generated by simulation (108 replicates) and by approximation based on ζN either with or without accounting for correlations. The error goes from 0.64 (at m = 1) to 0.19 (at m = 1,000). Results for other methods for the same parameter settings are in figure 3b.
Thus, under the assumption , the isml method provides estimates of the three parameters in the model: θ, τ0, and τ1. As there is a one-to-one correspondence between the parameters and the multinomial proportions, the estimates are consistent and approach the true values when for any if the assumption of is correct (table 3, cases c and d). However, the pooled site-pattern counts or average site-pattern frequencies are summaries of the original data and are not sufficient statistics. It then follows that the isml estimates will be less efficient and have larger asymptotic variances than the MLEs obtained from the full data under the same model assumption of (table 3, case c). Furthermore, if , assuming will lead to biased and inconsistent parameter estimates even if the same species tree estimate is produced. In other words if , the isml method assuming will produce a consistent estimate of the species tree and inconsistent estimates of the model parameters (table 3, cases e and f).
Table 3.
Characterization of the isml Method.
True Model | Assumption | Data Size | Parameters | isml vs. ml | |
---|---|---|---|---|---|
(a) | n > 1 | 3 out of 4 identifiable | isml ml | ||
(b) | n = 1 | 3 out of 4 identifiable | isml ml | ||
(c) | n > 1 | all 3 identifiable | isml ml | ||
(d) | n = 1 | all 3 identifiable | isml ml | ||
(e) | n > 1 | 3 out of 4 identifiable, inconsistent | isml ml | ||
(f) | n = 1 | 3 out of 4 identifiable, inconsistent | isml ml |
Note.—In all cases, the species tree topology is identifiable and consistently estimated by isml when the number of loci . If the parameters are identifiable, their estimates will be consistent. When isml differs from ML and the assumed model is correct, isml is less efficient than ML for parameter estimation (case c).
Two-Step Method (Majority Vote)
In the 2-step method, we estimate gene trees at individual loci and then use the most common gene tree topology as the species tree estimate. Under JC, the ML gene tree for locus i (which is also the upgma tree) is tree Gj if xij is the largest among , and (Yang 1994b, 2000); site patterns xxy, yxx, and xyx “support” gene trees G1, G2, and G3, respectively. There is no need for numerical optimization to obtain the ML tree at each locus.
Let g1, g2, and g3 be the probabilities that the estimated gene tree is G1, G2, and G3, respectively; that is, , and so on. These are functions of all four parameters in the MSC model () as well as the sequence length n, and can be computed numerically (Yang 2002, eq. 12) or by simulation. Under JC and the clock, (Yang 2002). This result has several implications. First, means that phylogenetic errors inflate gene-tree–species-tree discordance and lead to underestimation of the internal branch length in the species tree (Yang 2002). Second also means that use of estimated (rather than true) gene trees leads to reduced probability for recovering the correct species tree. Third, means that the 2-step estimate of the species tree is consistent even if estimated gene trees are used.
Let the number of loci at which G1 is the ML tree be , where the indicator function if statement a is true and 0 otherwise. Similarly define m2 and m3 to be the counts for the two mismatching gene trees. The correct species tree is inferred if and only if . Thus, the error rate can be approximated by (eq. 8).
The accuracy of this approximation is assessed in table 1 at different values of n with m = 1,000 and with parameter values , and . Consider first the case of n = 1. The gene tree is resolved if the single site at the locus has site patterns 1, 2, or 3, but is unresolved if the site has patterns 0 or 4. Whether we ignore loci with ties (with site patterns 0 or 4) or break ties evenly (assigning to each gene tree) does not affect the species tree estimate. Thus, and (eq.13) and the error is . This is equivalent to for isml, consistent with the fact that at n = 1 all methods considered here are equivalent.
If , the estimated gene trees will be the true gene trees so that and . The error rate is then ζ(1,000, 0.3594737, 0.3202631) = 0.1132, close to 0.114 from simulation. At , the proportions of estimated gene trees are g1 = 0.33273 and g2 = 0.30811, so that 0.264, close to 0.260 by simulation (table 1). These are much larger than 0.114 at , suggesting that with n = 1,000 sites in the sequence, the estimated gene trees have substantial errors and uncertainties.
The approximations (eq. 9) and ζ (eq. 8) give nearly identical results. The error rate is found to be very sensitive to the precise values of g1 and g2. Overall, the approximation is good, with errors within or close to 1%.
Numerical Comparison of Different Methods
We use simulation to compare the different species-tree estimation methods and to assess the reliability of our approximations. We use a challenging species tree with parameters , and . The error is plotted against the number of loci (m) when the number of sites per locus is fixed at 1, 2, or 1,000 (fig. 2).
In the case of one site per sequence (n = 1), all four methods considered in this study are equivalent, with the species tree given by the most frequent pooled site pattern (i.e., the greatest of , and ). With one site, the independent-sites assumption is correct, and ml and isml are exactly the same. As discussed earlier, concatenation and 2-step also select the species tree according to the pooled site patterns. Treatment of ties among has very minor effects on the error rate. For n = 1 and , simulation gave the error estimate 0.642 if ties are broken evenly (table 1) or 0.641 if data sets with ties are ignored. As predicted by our theory, the probit transform of the error, , shows a linear relationship with (fig. 2a′, ).
In the case of n = 2 sites per locus, isml (=concatenation), 2-step, and ML are all distinct. To see that concatenation and 2-step may produce different species trees, consider the case of m = 3 loci and n = 2 sites. If the data set at the three loci are 11, 02, and 00, where 0–4 represent the five site patterns, concatenation will infer the correct species tree S1 (as ), whereas 2-step will have a tie between S1 and S2 (as ). If the data set at the three loci are 33, 01, and 14, concatenation will have a tie between S1 and S3 (as ), whereas 2-step will infer the correct species tree (as ). We also confirm that at n = 2 ML differs from all three summary methods and can identify and consistently estimate all four parameters in the MSC model. Indeed ML is far more efficient for species tree estimation than the summary methods when n = 2 (fig. 2b and b′). Although the summary methods improve only slightly when n changes from 1 to 2, there is a major performance boost for ML (fig. 3a). This may be due to the fact that the model is fully identifiable with n = 2 but not when n = 1. The predicted linear relationship between and holds well for the three summary methods (fig. 2b′). For ML, if we remove the first two points (for m = 10 and 20), the relationship is nearly linear, with , with .
The most interesting case is with , since in real data sets n may be in the range 50–5,000, say. We used n = 1,000 in figure 2c and c′. As in the case of n = 2, there is a large performance divide between ML and the three summary methods (Isml = concat and 2-step), whereas the summary methods have similar performance. The approximate linear relationship between and holds well for all methods.
The superior performance of ML persists in the limit of (fig. 3b). For example, 0.45 and 0.01 for ML at m = 10 and 100, respectively, compared with 0.60 and 0.46 for 2-step or 0.62 and 0.51 for isml. The differences between ML and 2-step reflect the information in the coalescent times or gene-tree branch lengths. The differences between ML and isml reflect the information in the variation of site-pattern frequencies among loci, as isml uses only the averages across loci.
Figure 3c examines the error rates of different methods, while is fixed. At the two ends (n = 1 or m = 1), all four methods are equivalent, with 0.587 at n = 1 and , and 0.646 at m = 1 and . Note that when and , the error , while if m = 1 and , the error 0.6405. The high error at m = 1 even when is because a single gene tree (with coalescent times), even if known with certainty, does not contain much information about the MSC process. Away from the two ends (n > 1 or m > 1), ML is considerably more efficient than the summary methods (fig. 3c). The case of (n = 1), at which 0.587, and the case of m = 2 (n = 5,000), at which = 0.487, make an interesting contrast. In the first case all sites are i.i.d., while in the second, there are only two independent genes, each of 5,000 sites in complete linkage. One might expect data of independent sites to be more informative than two loci with correlated sites at the same locus (e.g., Long and Kubatko 2018), but the opposite is true. With n = 1, not all model parameters are identifiable, and this nonidentifiability issue appears to impact species tree estimation as well (Shi and Yang 2018, p. 172). With nm fixed, the smallest error occurs at intermediate values of n and m, around , although performance is similar over a large range of n (fig. 3c).
In table 1, we calculated the species-tree error probability using equations (6) and (8), as well as two pairs of bounds () and () (Theorem 1, Appendix A), for comparison with the simulation results. The asymptotic results are expected to apply when the sequence length n is fixed, whereas the number of loci . Here, m is fixed at 1,000, so that b < 2 for all cases (table 1), and is too small for the asymptotic approximations to be reliable. As a result, equations (6) and (8) are more accurate.
Discussion
Errors of Species Tree Estimation by Different Methods
Under the MSC model, data at different loci are i.i.d., so that the number of loci (m) constitutes the sample size in the statistical model. Thus, we have derived approximations to the error rate for different methods when m increases, with the sequence length n fixed. For large m, the error can be approximated by , where c is a constant. This is seen to apply to all four methods considered in this study (ML, isml = concatenation, and 2-step) (see table 4 for a summary).
Table 4.
Summary of Analytical Approximations to Species-Tree Estimation Error by Different Methods.
Method | n = 1 | ||
---|---|---|---|
ml | eq. 11 | eq. 12 | |
2-step | |||
isml/concatenation |
Note.—For isml/concatenation, , and the variance–covariance matrix at n is (eq. 16). In the case of n = 1, , and 2-step, isml, concatenation, and ml are all equivalent.
The theory for ML in Appendix B applies generally to ML selection of nonnested models, whether one model (which may and may not be the true model) fits the data better than the others, judged by the K–L divergence to the true data-generating model. In particular, the theory applies to conventional phylogenetic reconstruction without the MSC model. For example, figure 5 applies the same prediction to simulation results on four-taxa trees from Yang (1997). Previously, Susko (2011) developed a large-sample approximation to the log-likelihood difference between two trees and to the probability that each tree will be the ML tree in the case of four-species without the molecular clock. It was assumed that the internal branch length in the tree is small and approaches 0 at the rate of or faster when the number of sites n increases. In our analysis, we take the conventional approach of fixing the parameters when the data size increases.
Fig. 5.
The probit transform of the phylogenetic reconstruction error, , is a linear function of the square root of the number of sites in the alignment (). Simulation results from Yang (1997, fig. 1A and B) are used in the plot. The trees used in the simulation have four taxa, with branch lengths ((0.5, 0.5):0.1, 0.5, 0.5) for tree A and ((0.5, 0.5):0.1, 0.6, 1.4) for tree B. Data are simulated under the JC+G model (Yang 1994a) and analyzed under both JC and JC+G (Jukes and Cantor 1969; Yang 1994a). Note that in (B), ML under the incorrect model (JC) is more efficient than ML under the correct model (JC+G).
We note that in problems of parameter estimation, the standard error for the parameter estimate or the width of the confidence interval typically decreases at the rate of , so that quadrupling the data size halves the interval. In contrast, the probability of recovering the best-fitting model approaches 1 much faster. As the probit transform of the error decreases linearly with , it will soon reach a point beyond which the precise error probability is of no practical significance: for example, means e = 0.0013, while means . The different dynamics between model selection and parameter estimation when the data size grows is consistent with the fact that we tend to obtain extreme support for phylogenies inferred in large data sets (Yang and Zhu 2018).
Implications of Our Study to Species Tree Methods
Although the species tree problem studied here is the simplest, it has the complexities of the general problem. Furthermore, we have represented all major species tree methods in our analysis. We expect ML to be asymptotically similar to Bayesian inference as both are full-data methods.
We have assumed the JC mutation model and the molecular clock. Our results are thus applicable to shallow species phylogenies and may not apply to distantly related species for which the JC model may be inadequate for multiple-hit correction and the molecular clock may be seriously violated. In the case of three species examined in this paper, concatenation and isml always produce the same species tree estimate. However, in more general settings with four or more species and when the clock is violated and unrooted trees are used, concatenation and isml are known to be different. In particular, concatenation (as well as 2-step) can be inconsistent (Roch and Steel 2015), while isml is a coalescent-aware method and is always consistent.
The isml method considered here is similar to SVDQuartets (Chifman and Kubatko 2014). Both are summary methods based on pooled site-pattern counts. SVDQuartets is sometimes described as a site pattern-based method (e.g., Kubatko 2019). This is not a helpful description. Site-pattern counts for different loci are sufficient statistics under the model and carry the same amount of information as the sequence alignments at the same loci so that it makes no difference whether site patterns or sequences are used. Indeed virtually all methods involving likelihood calculation on sequences operate on site patterns instead of sites. Instead what matters is whether site patterns are pooled across loci. In the original data, the sites of the same locus share the same gene tree and the variation among loci provides information about parameters of the coalescent process such as the ancestral population sizes. Pooling sites across loci means that such information is lost (Shi and Yang 2018). As a result, the pooled site-pattern counts are unable to identify all parameters of the MSC model even if they can identify the species tree topology. Previously, Long and Kubatko (2018) found in simulations that SVDQuartets performed better in data sets of 600 coalescent-independent sites ( in the notation of this paper) than in data of two genes each of 300 bp (), and suggested that this is because “[t]he 600 sites observed from 600 distinct gene trees give independent genealogical information about the species tree, though indirectly, whereas the 300 sites for each of the two genes can give a reasonable indication of the individual gene trees, but still provide only two observed gene genealogies.” Our analysis suggests that this is not a correct interpretation. When the information in the data is used properly (as in the ML method), there is in fact more information in two genes each of 300 bp than in 600 independent sites (fig. 3c).
To understand the issue of parameter unidentifiability and the potential information loss for species tree estimation due to the pooling of sites across loci in SVDQuartets, consider the simple random-effects model:
(20) |
where the treatment effect and the error . Parameters in the model include the grand mean μ and the variance components and . It is obvious that if there are no replications within treatment (n = 1) or if the observations (yij) are pooled across treatments, the between-treatment variation and within-treatment errors will be confounded so that and will not be identifiable even though μ still is. In species tree estimation, pooling site patterns across loci (as in isml and SVDQuartets) causes some parameters of the MSC model to become unidentifiable even though the species tree still is. This issue of information loss due to averaging over the whole genome may be even more serious for methods designed for data of single nucleotide polymorphisms (SNPs) (Leaché and Oaks 2017), such as snapp (Bryant et al. 2012), because the removal of constant sites in the SNP data causes further loss of information (even if the ascertainment bias is accounted for in the method).
An important difference between isml and SVDQuartets is that isml applies ML to the pooled site-pattern counts, whereas SVDQuartets uses a criterion based on linear invariants to avoid the ML optimization (Xu and Yang 2016). Use of a non-ML criterion is expected to lead to further reduction in efficiency, in addition to information loss due to the pooling of sites across loci (Chou et al. 2015; Xu and Yang 2016; Shi and Yang 2018).
The MSC model analyzed in this paper assumes free recombination among loci and no recombination between sites of the same locus. Data for such analysis are typically loosely linked short genomic segments that are far apart from each other so that recombination within a locus is rare, whereas different loci are nearly independent (e.g.,Takahata et al. 1995; Burgess and Yang 2008; Lohse et al. 2016). Both assumptions of free recombination among loci and no recombination within locus are expected to be violated in real data analysis, and the impact of within-locus recombination is of particular concern. The ML method considered in this paper assumes no recombination (with r = 0), whereas isml (and SVDQuartets) assumes free recombination (). The relative performance of the methods will depend on the true recombination rate: ML may be expected to perform better than isml if r is close to 0, while isml may be superior if r is large. At very high recombination rates, it may even be possible for ML (assuming r = 0) to be inconsistent since the method is similar to concatenation and merges sites of the same locus with different histories into one sequence. In contrast, isml is consistent for all values of r. Previously, Lanier and Knowles (2012) found in a computer simulation that species-tree estimation was robust to moderate levels of within-locus recombination (see also discussions in Edwards et al. [2016];Xu and Yang [2016]). It will be interesting to evaluate the relative performance of modern species-tree estimation methods (including isml and SVDQuartets) under realistic recombination rates.
Materials and Methods
Simulation
We use a challenging species tree with parameters , , and (fig. 1a). A C program is written to simulate gene trees and sequence alignments for the case of three species/sequences, under the JC model (Jukes and Cantor 1969) with the clock. To simulate the gene tree and the sequence alignment for each locus, we generate an exponential coalescent waiting time (s1) with mean . If , the gene tree is , and another exponential waiting time s0 is generated with mean to get and t1 = s1. If , the gene tree is one of , chosen at random, and two coalescent waiting times (s1 and s0) are generated with means and , respectively, so that and (fig. 1b). The gene tree and node ages are then used to calculate the site-pattern probabilities for the locus (eq. 3), and the site-pattern counts are generated from multinomial sampling (eq. 4). Each data set consists of m loci with the sequence length of n sites. We use a large number of replicates (typically or 108) so that sampling errors due to a limited number of replicates is not a concern. Species tree estimation by concatenation (=isml) and 2-step is done by counting site patterns.
For the ML method (eq. 5), we used the simulation program MCcoal, which is part of the bpp program (Yang 2015), to simulate the gene trees and sequence alignments. The data are then analyzed using the ML program 3s (Yang 2002; Dalquen et al. 2017). The JC model is used to simulate and analyze data.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Supplementary Material
Acknowledgments
We thank Bin Wang for discussions and two anonymous reviewers for many insightful comments. This study has been supported by Biotechnology and Biological Sciences Research Council grant (BB/P006493/1) to Z.Y. and a BBSRC equipment grant (BB/R01356X/1). T.Z. is supported by a Natural Science Foundation grant (32070685 and 31671370) and a grant from the Youth Innovation Promotion Association of Chinese Academy of Sciences (201901).
Data Availability
The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.
Appendix A. Proof of Theorem 1
(a) Define the random variable:
(A1) |
where and , with
(A2) |
Here, we treat and as normal variables, according to the central limit theorem as . As and both y1 are y2 are normal variables, they are independent. Then,
(A3) |
where , and is the probability density function (PDF) for , whereas is the PDF for . The last integral has been studied by Yang and Rodríguez (2013, SI) in a different context and can be written as:
(A4) |
or, by letting , with dθ = −1/[(t−a)2+1] dt, as:
(A5) |
Equations (A4) and (A5) can be calculated using Gaussian quadrature and match direct calculations using the CDF for the bivariate normal distribution for . When , we have b = 0 and:
(A6) |
In the symmetrical case of , and (with , b = 0), this gives , as expected. In this case the three variables and have the same probability of being the greatest so that the error is .
To avoid numerical integration, we note that , and is a folded normal variable with mean and variance:
(A7) |
Thus,
(A8) |
We have,
(A9) |
Collecting all terms in equation (A8), we get
(A10) |
If we assume that y is approximately normally distributed, as in Zharkikh and Li (1992) and Yang (1996), then equation (6) follows. Note that equation (6) can also be written as . Because has a folded normal distribution and is not a normal variable, the error of approximation of equation (6) does not approach zero when . For instance, in the symmetrical case (, and ), equation (6) gives , not . This level of accuracy is acceptable for our calculations for finite m in this paper, as the precise value of the error is unimportant if the error is nearly zero, but equation (6) may not give the correct asymptotic error rate when (fig. 6, a = 10).
Fig. 6.
Probit of error, , plotted against b for different values of a. Six methods for calculating ζ are shown. The first five are, from top to bottom, (brown dashed line), (orange dotted, with k = 2 in eq. A27), Exact (black solid line), (blue dotted), and (purple dashed). Equation (6) (black dotted) is included as well.
(b) To study the asymptotic behavior of the error probability ζ when , we derive bounds on ζ. From equation (A3),
(A11) |
where the first integral is , with to be the distance from the origin (0, 0) to the line (fig. 7), and the second integral is:
(A12) |
Fig. 7.
The areas of integration for integrals in Equations (A11) and (A12). The two angles are and , k > 1, with . The integral over the half-plane is , whereas the integral over the half-plane is . The integral over the sector (the shaded area) is . This is smaller than and greater than the integral over the area shaded with the brick pattern: these give the bounds () in Appendix A. The purple dashed lines are and . They cross the blue lines at A and , with the length of the line segment OA to be . Note that the integral over the circle is .
By considering the area of integration (fig. 7), it is obvious that:
(A13) |
where the equality holds when b = 0. Let,
As , we have:
(A14) |
or
(A15) |
as in equation (7). The equality in the lower bounds is achieved at . Note that the bounds apply to all a > 0 and b > 0. We use the bounds () in Theorem 1 and in the calculation of table 1. The width of the interval is , so that using any value inside the interval as the estimate will give an error of approximation that is smaller than the error probability ζ.
Note that the bounds are also given by the definition , since
(A16) |
with .
Next we consider the upper bound in equation (A15) when h or b is large. Note that:
(A17) |
where 1. For large h,
(A18) |
Thus, for large h, is bounded by:
(A19) |
or
(A20) |
Let such that for ; in other words, ɛ is the offset at the probit level to reduce the probability by a fraction. From equation (A20),
(A21) |
Thus,
(A22) |
which gives or
(A23) |
In particular, for , we have:
(A24) |
Thus, for large h, we have:
(A25) |
as in equation (7). It may be noteworthy that for large h, a very small change at the probit level, of about , changes the probability by a factor of 2.
A tighter lower bound for 2 A than zero of equation (A13) is:
(A26) |
where with k > 1 (fig. 7). Thus, we have a tighter pair of bounds on ζ,
(A27) |
where k > 1. We write this pair of bounds as . We have . These bounds, as well as the exact value and equation (6), are plotted against b in figure 6 for and 10.
Appendix B. The asymptotics of ML species tree estimation
The proof below borrows heavily from White (1982), Dawid (2011), and Yang and Zhu (2018). Let be the three species trees with parameters . Note that S1 is the true model, while S2 and S3 are mis-specified models. Let the data at m loci be . The log-likelihood function is . We also define for one data point (that is, site-pattern counts at any single locus), . When the number of loci , the MLE . We assume that both and are inner points in the parameter space. Whether is inside the parameter space or at its boundary should not affect the asymptotic rate of convergence. Here, for the true species tree S1 is the true parameter value, whereas for S2 (as well as for S3) is the pseudotrue parameter value, which minimizes the Kullback–Leibler distance from the misspecified model S2 to the true model S1.
(A28) |
where the expectation is over the true distribution . D13 is defined similarly, with .
We consider the log-likelihood ratio, , given the data (x) for any of the species tree j. We drop the subscript j for clarity. As in White (1982) and Dawid (2011), we define two matrices:
(A29) |
(A29)
where the superscript T stands for transpose and where the expectation is over the true distribution, and and are the first and second derivatives with respect to .
Apply Taylor expansion to the log likelihood around the MLE :
(A30) |
where both the gradient and the Hessian are evaluated at the MLE ( ), with . Setting , we have:
(A31) |
Apply Taylor expansion to the derivative around the MLE and let , and we have:
(A32) |
and
(A33) |
Each of and is a sum of m i.i.d. elements. When , with (eq. A29). Furthermore,
(A34) |
where (eq. A29). Thus,
(A35) |
Thus, . Equation (A31) becomes:
(A36) |
Equations (A29–A36) apply to all three species trees. In the case of S1 (the true model), , the Fisher information matrix, and . For S2 or S3, is a quadratic form of normal variates and is a mixture of noncentral variables with mean and variance , both of O(1).
Now consider using , j = 1, 2, 3, to compare species trees S1, S2, and S3. We have:
(A37) |
Thus, when the number of loci have means and variance/covariance matrix , where is O(1) and independent of m. The error of the ML method, , is then given by Theorem 1 as equation (11).
References
- Angelis K, dos Reis M.. 2015. The impact of ancestral population size and incomplete lineage sorting on Bayesian estimation of species divergence times. Curr Zool. 61(5):874–885. [Google Scholar]
- Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A.. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 29(8):1917–1932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burgess R, Yang Z.. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 25(9):1979–1994. [DOI] [PubMed] [Google Scholar]
- Chifman J, Kubatko L.. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T.. 2015. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16(Suppl 10):S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dalquen D, Zhu T, Yang Z.. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst Biol. 66(3):379–398. [DOI] [PubMed] [Google Scholar]
- Dawid A.2011. Posterior model probabilities. In: Bandyopadhyay PS, Forster M, editors. Philosophy of statistics.New York: Elsevier. p. 607–630. [Google Scholar]
- Degnan JH, Rosenberg NA.. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2(5):e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Degnan JH, Salter LA.. 2005. Gene tree distributions under the coalescent process. Evolution 59(1):24–37. [PubMed] [Google Scholar]
- Edwards SV.2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19. [DOI] [PubMed] [Google Scholar]
- Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, et al. 2016. Implementing and testing the multispecies coalescent model a valuable paradigm for phylogenomics. Mol Phylogenet Evol. 94(Pt A):447–462. [DOI] [PubMed] [Google Scholar]
- Fleiss JL, Levin B, Palk MC.. 2003. Statistical methods for rates and proportions.New York: John Wiley and Sons.3rd ed. [Google Scholar]
- Heled J, Drummond AJ.. 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 27(3):570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hudson R.1983. Testing the constant-rate neutral alele model with protein sequence data. Evolution 37(1):203–217. [DOI] [PubMed] [Google Scholar]
- Jukes T, Cantor C.. 1969. Evolution of protein molecules.In: Munro H, editor. Mammalian protein metabolism.New York: Academic Press. p. 21–123. [Google Scholar]
- Kubatko L.2019. The multispecies coalescent.In: Balding D, Moltke I, Marioni J, editors. Handbook of statistical genomics.4th ed.New York: Wiley. p. 219–245. [Google Scholar]
- Lanier HC, Knowles LL.. 2012. Is recombination a problem for species-tree analyses? Syst Biol. 61(4):691–701. [DOI] [PubMed] [Google Scholar]
- Leaché AD, Oaks J.. 2017. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Annu Rev Ecol Evol Syst. 48(1):69–84. [Google Scholar]
- Leaché AD, Rannala B.. 2011. The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol. 60(2):126–137. [DOI] [PubMed] [Google Scholar]
- Liu L, Pearl DK.. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 56(3):504–514. [DOI] [PubMed] [Google Scholar]
- Liu L, Yu L, Edwards SV.. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 10(1):302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu L, Yu L, Pearl DK, Edwards SV.. 2009. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 58(5):468–477. [DOI] [PubMed] [Google Scholar]
- Lohse K, Chmelik M, Martin SH, Barton NH.. 2016. Efficient strategies for calculating blockwise likelihoods under the coalescent. Genetics 202(2):775–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long C, Kubatko L.. 2018. The effect of gene flow on coalescent-based species-tree inference. Syst Biol. 67(5):770–785. [DOI] [PubMed] [Google Scholar]
- Maddison W.1997. Gene trees in species trees. Syst Biol. 46(3):523–536. [Google Scholar]
- Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T.. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17):i541–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nichols R.2001. Gene trees and species trees are not the same. Trends Ecol Evol. 16(7):358–364. [DOI] [PubMed] [Google Scholar]
- Ogilvie HA, Bouckaert RR, Drummond AJ.. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol. 34(8):2101–2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pamilo P, Nei M.. 1988. Relationships between gene trees and species trees. Mol Biol Evol. 5(5):568–583. [DOI] [PubMed] [Google Scholar]
- Rannala B, Edwards S, Leaché AD, Yang Z.. 2020. The multispecies coalescent model and species tree inference. In: Scornavacca C, Delsuc F, Galtier N, editors. Phylogenetics in the genomic era. Book Section 3.3.No Commercial Publisher. p. 1–20. [Google Scholar]
- Rannala B, Yang Z.. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rannala B, Yang Z.. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst Biol. 66(5):823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roch S, Steel M.. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 100:56–62. [DOI] [PubMed] [Google Scholar]
- Shi C, Yang Z.. 2018. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of Gibbons. Mol Biol Evol. 35(1):159–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Susko E.2011. Large sample approximations of probabilities of correct evolutionary tree estimation and biases of maximum likelihood estimation. Stat Appl Genet Mol Biol. 10(1):10. [DOI] [PubMed] [Google Scholar]
- Szöllősi GJ, Tannier E, Daubin V, Boussau B.. 2015. The inference of gene trees with species trees. Syst Biol. 64(1):e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Takahata N, Satta Y, Klein J.. 1995. Divergence time and population size in the lineage leading to modern humans. Theor Popul Biol. 48(2):198–221. [DOI] [PubMed] [Google Scholar]
- Tian Y, Kubatko LS.. 2016. Distribution of coalescent histories under the coalescent model with gene flow. Mol Phylogenet Evol. 105:177–192. [DOI] [PubMed] [Google Scholar]
- Tiley GP, Poelstra JP, dos Reis M, Yang Z, Yoder AD.. 2020. Molecular clocks without rocks: new solutions for old problems. Trends Genet. 36(11):845–856. [DOI] [PubMed] [Google Scholar]
- White H.1982. Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25. [Google Scholar]
- Wu Y.2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775. [DOI] [PubMed] [Google Scholar]
- Xu B, Yang Z.. 2016. Challenges in species tree estimation under the multispecies coalescent model. Genetics 204(4):1353–1368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z.1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]
- Yang Z.1994b. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 43(3):329–342. [Google Scholar]
- Yang Z.1996. Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol. 42(2):294–307. [DOI] [PubMed] [Google Scholar]
- Yang Z.1997. How often do wrong models produce better phylogenies? Mol Biol Evol. 14(1):105–108. [DOI] [PubMed] [Google Scholar]
- Yang Z.2000. Complexity of the simplest phylogenetic estimation problem. Proc R Soc Lond B. 267(1439):109–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z.2002. Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162(4):1811–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z.2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24(8):1586–1591. [DOI] [PubMed] [Google Scholar]
- Yang Z.2014. Molecular evolution: a statistical approach. Oxford (England: ): Oxford University Press. [Google Scholar]
- Yang Z.2015. The BPP program for species tree estimation and species delimitation. Curr Zool. 61(5):854–865. [Google Scholar]
- Yang Z, Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol Biol Evol. 31(12):3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Rodríguez CE.. 2013. Searching for efficient markov chain Monte Carlo proposal kernels. Proc Natl Acad Sci USA. 110(48):19307–19312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Zhu T.. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci USA. 115(8):1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zharkikh A, Li W-H.. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. i. Four taxa with a molecular clock. Mol Biol Evol. 9:1119–1147. [DOI] [PubMed] [Google Scholar]
- Zhu T, Yang Z.. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 29(10):3131–3142. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.