Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2021 Jan 25;38(9):3993–4009. doi: 10.1093/molbev/msab009

Complexity of the simplest species tree problem

Tianqi Zhu 1,2, Ziheng Yang 1,3,
Editor: Bing Su
PMCID: PMC8382899  PMID: 33492385

Abstract

The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Keywords: concatenation, efficiency, molecular clock, MSC, multispecies coalescent, species tree

Introduction

The multispecies coalescent (MSC) model (Rannala and Yang 2003) combines the phylogenetic process of species divergences with the population genetic process of coalescent and naturally accommodates “delayed coalescence” (also known as “incomplete lineage sorting,” Maddison 1997), the phenomenon in which gene sequences fail to coalesce in their most recent common ancestor but do so only in more ancient ancestors. Delayed coalescence causes the gene tree for a gene or genomic region to differ from the species tree and is the most important factor for gene-tree–species-tree discordance (Maddison 1997; Nichols 2001; Szöllősi et al. 2015). The MSC provides a natural framework for estimating species trees accounting for genealogical heterogeneity among genes or across the genome (Edwards 2009; Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).

Two lines of research into the MSC have provided the foundation for species tree methods. The first concerns the probabilities of different gene tree topologies (Hudson 1983; Pamilo and Nei 1988) and algorithms for their efficient calculation given the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). The gene tree distribution can be used in the two-step method of species tree estimation, by inferring gene trees for the individual loci and then applying maximum likelihood (ML) to counts of gene tree topologies (as in stells,Wu 2012). Nevertheless, widely used two-step methods, including astral (Mirarab et al. 2014) and mp-est (Liu et al. 2010), are simpler, and estimate species trees for species triplets (assuming the molecular clock) or quartets (without the clock) and then assemble the subtrees to produce a species-tree estimate for all species. Studies of gene-tree probabilities led to the discovery of the “anomaly zone,” the region of the parameter space in which the most probable gene tree has a different topology from the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). In the anomaly zone, the two-step method, which uses the most common gene tree as the species tree estimate, will be inconsistent.

The second line of research into MSC is the development of the joint probability distribution of the gene tree and coalescent times (Rannala and Yang 2003). This forms the basis for exact methods of inference, including ML (Yang 2002; Dalquen et al. 2017) and Bayesian methods (Liu and Pearl 2007; Heled and Drummond 2010; Yang and Rannala 2014; Ogilvie et al. 2017; Rannala and Yang 2017). Although heuristic methods use summaries of the data, exact methods use the multilocus sequence alignments directly and naturally accommodate phylogenetic reconstruction errors and uncertainties (Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).

Simulation has been used to examine the performance of different species-tree methods (e.g., Leaché and Rannala 2011; Mirarab et al. 2014; Chou et al. 2015; Xu and Yang 2016). A limitation of simulation is that it can examine only a small portion of the parameter space and the results often have limited applicability. Analytical results on the efficiency of different methods have been lacking. Here, we analyze species tree estimation under the MSC in the case of three species, with one sequence from each species per locus. We focus on closely related species and assume the JC mutation model (Jukes and Cantor 1969) and the molecular clock. We are in particular interested in the efficiency of the various methods, measured by the probability of recovering the correct species tree.

We consider four inference methods: 1) ML (a full likelihood method under the MSC applied to the multilocus sequence alignments), 2) 2-step (or majority-vote), 3) concatenation (concat), and 4) independent-sites ML (isml, also known as coalescent-aware concatenation or concat) (Xu and Yang 2016). ML is the full-likelihood method and calculates the likelihood function using the multilocus sequence alignments or a sufficient summary. The 2-step method estimates the gene tree at each locus and then uses the most common gene tree as the species tree estimate. It does not account for the uncertainties in the estimated gene trees. For the case of three species considered here, 2-step is equivalent to the maximum pseudolikelihood method (mp-est) (Liu et al. 2010). Concatenation applies ML to the concatenated sequences, assuming that the same tree underlies all sites in the super alignment. In the case considered here, concatenation is equivalent to steac (Liu et al. 2009), which uses average coalescent times over loci as data to infer a gene tree, which is the species tree estimate. Isml (or concat) estimates the species tree by ML under the assumption that all sites, both from the same locus and from different loci, have independent gene trees (Xu and Yang 2016). This was suggested as an improvement to SVDQuartets of Chifman and Kubatko (2014). All four methods considered here use ML, but the likelihood function is applied to different summaries of the same data. Here, we refer to the full-likelihood or full-data method as the ML method, whereas all other methods (2-step, concatenation, and isml) are considered heuristic summary methods: 2-step uses the (estimated) gene tree topologies, whereas concatenation and isml use the site-pattern counts pooled across loci. We derive approximations to the error rate of species tree estimation by the different methods and assess their accuracy. We use the theory to characterize the differences in the use of information in the data by different methods.

Results

Multispecies Coalescent in the Case of Three Species

For three species A, B, and C, there are three possible species trees: S1=((AB)C), S2=((BC)A), and S3=((CA)B), each with two divergence times (τ0 and τ1) and two population sizes (θ0 and θ1) (fig. 1a). Both τs and θs are measured by the expected number of mutations per site. For each species, the population size parameter is θ=4Nμ, where N is the (effective) population size and μ is the mutation rate per site per generation. We consider only one sequence from each species, so that θs for the modern species are not considered. The parameters have different interpretations in different species trees: in S1, the two ancestral species are AB and ABC so the parameters are θ1 =  {τ0,τ1,θ0,θ1}  =  {τABC,τAB,θABC,θAB}.

Fig. 1.

Fig. 1.

(a) The three species trees (S1,S2,S3) for three species (A,B,C) and the parameters in each MSC model. (b) The possible gene trees with coalescent times (t0, t1) for a locus with three sequences (a, b, c) given the species tree S1. The probabilities for the gene trees are shown above them, where ϕ = e2θ1(τ0τ1) is the probability that a andb do not coalesce in population AB or over the time interval (τ1, τ0). Note that if the species tree is S2 (or S3), it will be possible for sequences b and c (or c and a) to coalesce in the time interval (τ1, τ0).

At each locus, three sequences (a, b, and c) are sampled, one from each species. They are related through a gene tree. The three possible gene trees are G1=((ab)c),G2=((bc)a), and G3=((ca)b), with probabilities:

P(G1|S1,θ1)=123ϕ,P(G2|S1,θ1)=P(G3|S1,θ1)=13ϕ, (1)

where ϕ =  e2(τABCτAB)/θAB is the probability that sequences a and b do not coalesce in population AB so that all three sequences enter the ancestor ABC and the three gene trees occur with equal probability (fig. 1b) (Hudson 1983). Here, 2(τABCτAB)/θAB is known as the internal branch length in coalescent units, as the average coalescent time in population AB is 2NAB generations or θAB/2 mutations per site.

For locus i, let ti={ti0,ti1} be the coalescent times (node ages) on the gene tree (fig. 1b). The joint MSC density for the gene tree and coalescent times given species tree S1 and parameters θ1 is then:

f(G1a,ti|S1,θ1)=2θ1e2θ1(ti1τ1)·2θ0e2θ0(ti0τ0),τ1<ti1<τ0,ti0>τ0,f(Gk,ti|S1,θ1)=e2θ1(τ0τ1)×2θ02θ0e6θ0(ti1τ0)2θ0(ti0ti1),ti1>τ0,ti0>ti1, (2)

for k=1b,2,3 (Takahata et al. 1995; Yang 2002). The probability densities for S2 and S3 are given similarly.

The data consist of sequence alignments at m loci. Under the JC mutation model, the data at locus i can be summarized as counts of five site patterns: xxx, xxy, yxx, xyx, and xyz, where x, y, z are any three distinct nucleotides. Let those counts be xi={xi0,xi1,xi2,xi3,xi4}, with j=04xij=n to be the number of sites (sequence length) at each locus. Let fij=xij/n be the frequencies. Let data at all m loci be x={xi}.

Given the gene tree and coalescent times at locus i, the probability of the sequence data, f(xi|Gi,ti), is given by the multinomial distribution for the five site patterns. For example, given gene tree G1 with node ages ti0 and ti1 (fig. 1b), the site-pattern probabilities, pi={pi0,pi1,pi2,pi3,pi4}, are as follows:

pi0=P(xxx|G1,ti)=116(1+3v2+6u+6uv),pi1=P(xxy|G1,ti)=116(3+9v26u6uv),pi2=P(yxx|G1,ti)=116(33v2+6u6uv),pi3=P(xyx|G1,ti)=p2,pi4=P(xyz|G1,ti)=116(66v212u+12uv), (3)

where u=e8ti0/3 and v=e4ti1/3 (Yang 1994b). Note that pi1>pi2=pi3 as ti0>ti1. The probabilities for gene trees G2 or G3 are given by symmetry. Then the sequence data or the five site-pattern counts at the locus have the multinomial probabilities:

f(xi|G1,ti)=pi0xi0pi1xi1pi2xi2+xi3pi4xi4,f(xi|G2,ti)=pi0xi0pi1xi2pi2xi3+xi1pi4xi4,f(xi|G3,ti)=pi0xi0pi1xi3pi2xi1+xi2pi4xi4. (4)

The ML Method of Species Tree Estimation

The log-likelihood function for species tree S1 with parameters θ1 is given by summing over the gene trees and integrating over the coalescent times.

1(θ1)=i=1mlogf(xi|S1,θ1)=i=1mlog{Gif(Gi,ti|S1,θ1)f(xi|Gi,ti)dti}, (5)

where f(Gi,ti|S1,θ1) is the MSC density for the gene tree and coalescent times at locus i (eq. 2), and f(xi|Gi,ti) is the probability of the sequence data at locus i given the gene tree (eq. 4). The log likelihood functions, 2(θ2)and 3(θ3), for S2 (with parameters θ2) and S3 (with θ3) are defined similarly.

Maximizing the log-likelihood function (eq. 5) with respect to the parameters will lead to a log-likelihood value for the given species tree, and the species tree that achieves the highest is the ML species tree. This is not analytically tractable. The program 3 s implements the method by explicitly summing over the gene trees (Gi) and by using Gaussian quadrature to calculate the 2D integrals over ti (eq. 5) (Yang 2002; Zhu and Yang 2012; Dalquen et al. 2017). This is used in simulations.

We present two theorems for approximating the error in species tree estimation.

Theorem 1

.(a) Supposezi=(zi1,zi2,zi3)T,i=1,,m, are an independent and identically distributed (i.i.d.) sample of size m from a distribution with meansμ=(μ1,μ2,μ2)T, withΔμ=μ1μ2>0, and variancesΣ={σjk}, whereσ11=σ12,σ12=σ13=ρ12σ1σ2,σ22=σ33=σ22andσ23=ρ23σ22. Letz¯={z¯1,z¯2,z¯3}Tbe the sample means, withz¯j=1mi=1mzij,j=1, 2, 3. For large m, z¯N3(μ,1mΣ). Letζ=P{z¯1<max(z¯2,z¯3)}. Then

ζΦ(Δμm+1π(σ22σ23)σ122σ12+σ221π(σ22σ23))ζN(m,Δμ,σ12,σ22,σ12,σ23), (6)

whereΦis the cumulative distribution function (CDF) for the normal distributionN(0,1). We also write ζN asζN(m,Δμ,Σ).

(b) Leta=s2/s1andb=Δμm/s1, withs12=σ122σ12+σ2212(σ22σ23)ands22=12(σ22σ23). Then ζ is bounded by:

Φ(h)(1+2πtan1a)ζ<2Φ(h)=Φ(h+1hlog2+o(1h)), (7)

whereh=b1+a2. The equality for the lower bound holds whenh=b=0. We write those bounds asζL1ζζU1, so thatΦ(h)ζL1ζ<2Φ(h).

Proof.

A proof is given in Appendix A, in which, we discuss alternative approximations and also give a tighter pair of bounds (ζL2,ζU2) in equation (A27), with ζL1<ζL2<ζ<ζU2<ζU1. □

In this paper, ζ represents the error probability of species tree estimation. Thus, the bounds Φ(h)ζ<2Φ(h) suggest that when m, the probit transform of the species-tree error probability, Φ1(ζ), where Φ1 is the inverse CDF of N(0,1), decreases linearly with m. For practical calculations for finite m in this paper, equation (6) is more accurate (see Appendix A) and will be used later.

Corollary 2.Let (y0,y1,y2,y3) be random variables from the multinomial distribution MN(m, q0, q1, q2, q3), withq0=1q1q2q3,q1>q2= q3, andΔq=q1q2>0. ThenP{y1<max(y2,y3)}can be approximated by:

ζ(m,q1,q2)=Φ(Δqm+q2πq1+q2(Δq)2q2π), (8)
ζZLY(m,q1,q2)=Φ(Δqm1Δq+q2πq1+q2q2π). (9)

Proof.

Let z¯j=yj/m,j=1,2,3 be the observed frequencies. We have σjj=qj(1qj) and σjk=qjqk for jk. Then equation (8) follows from equation (6) in Theorem 1. The form ζZLY, an alternative to equation (8), is from Yang (1996, eq. 3), based on Zharkikh and Li(1992, eq. 20). This applies the term 1/Δq to correct for discontinuity (Fleiss et al. 2003) and ignores correlations between y1, y2, and y3 as well as some terms of small probabilities. The discontinuity correction does not appear to be useful. If m1/Δq, both forms, with and without the discontinuity correction, are very close. □

The error rate for the ML method (eq. 5) is analyzed in Appendix B. When the number of loci m, the MLE θ^jθj* in species tree Sj, j =1, 2, 3. Note that S1 represents the true model and θ1* are the true parameter values, while S2 and S3 are misspecified models and θ2* and θ3* are the “best-fitting or pseudotrue parameter values.” The Kullback–Leibler distance D12 from S2 to S1 is:

D12=f(x|S1,θ1*)logf(x|S1,θ1*)f(x|S2,θ2*)dx =E(l1(θ1*))E(l2(θ2*)), (10)

where lj(θj*)logf(x|Sj,θj*), with x to be one data point (or site pattern counts at one locus), and where the integral means summation over all possible data outcomes at a locus. We use the per-locus log-likelihood values to compare the three species trees: z¯j1mj(θ^j), j  == 1, 2, 3. When m is large, these have the means E(z¯j)E(lj(θj*))μj, with μ1μ2=D12, and the variance matrix 1mΣ, where Σ={σjk} and σjkCov(lj(θj*),lk(θk*)). The error of the ML method, eML=P{1(θ^1)<max(2(θ^2),3(θ^3))}, is then given by Theorem 1 as:

eML=P{z¯1<max(z¯2,z¯3)}ζN(m,D12,Σ). (11)

Equation (11) cannot be used to calculate the error rate for ML as D12 and σjk are not easily computable. It predicts a linear relationship between Φ1(eML) and m. This is confirmed by simulation (fig. 2a′–c).

Precise results may be obtained in special cases. In the case of one locus (m =1), the ML gene tree is the ML species tree except for rare data sets: the true species tree S1 is recovered if xi1>max(xi2,xi3). In rare data sets of extreme divergence, even if xi1>max(xi2,xi3), ties for gene trees are possible, with the star tree being as good as the binary trees (Yang 2000), whereas ML under MSC favors S1. One such data set is xi=(4,13,12,11,50), in which case the three gene trees as well as the star tree achieve the same likelihood, whereas ML under MSC favors S1. However, such data sets involve sequences more divergent than random sequences have vanishingly small probability when n is large. Thus, we ignore them and consider all methods to be equivalent when m =1. With one locus, it is impossible to identify all parameters in the MSC model: there are four parameters and only three independent site-pattern frequencies ( fi0,fi1,fi2+fi3 for S1, for example).

The case of one site per locus (n =1) is analyzed later in the section on isml. Numerical calculations on a model species tree are presented in table 1. They will be discussed later in comparison with other methods.

In the case of n, the gene tree (including the coalescent times) at each locus is given without errors. The likelihood is then the product of MSC densities of gene trees across the loci (eq. 2). This likelihood has singularities, with one or more species trees achieving infinite likelihood (Liu et al. 2010; Yang 2014). In the case of three species considered here, only one species tree (given by the smallest coalescent time) achieves infinite likelihood and will be the unique species-tree estimate, so that the estimation can proceed despite the singularity (Yang 2014, p. 360, Problem 9.4). Let the smallest coalescent/divergence time between species across all loci be tab, tbc, and tca. If tab is the smallest among the three, then species tree S1 achieves infinite likelihood, by collapsing on the coalescent time tab; that is, 1(θ^1) as τ^0=τ^1=tab and θ^10 (see eq. 2) (Yang 2014, p.338–339), whereas the other two species trees have finite likelihood.

Given S1 as the true species tree, both tbc and tca are >τABC (fig. 1b). If sequences a and b coalesce in population AB at any of the m loci, tab will be smaller than both tbc and tca, and S1 will be the ML species tree. Thus, an incorrect species tree is inferred only if a and b do not coalesce in AB at any of the m loci and are not the first to coalesce in the root population ABC. Thus,

eML,=ϕm×23, (12)

where ϕ=e2θAB(τABCτAB) is the probability that a and b do not coalesce in population AB. This equation is exact and applies to both small and large m (fig. 3b).

Fig. 2.

Fig. 2.

(a–c) Species-tree estimation error (e) at three sequence lengths (n =1, 2, 1,000) plotted against the number of loci (m) for different methods. (ac) The probit transform of the species-tree error, Φ1(e), plotted against m. The parameters used in the simulation are τ0=0.02,τ1=0.019,θ0=0.01, and θ1=0.05. When n =1, all four methods (ML, 2-step, concatenation, and isml) give the same species tree estimate, while concatenation and isml are equivalent in all cases considered in this paper. The number of replicates is R104 for ML and 106 for the other methods.

Table 1.

Probabilities (g1,g2,g3) of Estimated Gene Trees at Different Sequence Lengths (n) and the Error Rates for the Summary Methods 2-step and isml with m =1,000 Loci, Each with n Sites.

n 1 2 10 100 1,000
2-step (mp-est)
P(tie) 0.92948 0.8673 0.57015 0.22159 0.05105 0
g1(n) 0.02378 0.04474 0.14515 0.26646 0.33273 0.35947
g2(n)=g3(n) 0.02337 0.04398 0.14235 0.25598 0.30811 0.32026
e2STEP 0.642 0.633 0.597 0.470 0.260 0.114
ζ(m,g1,g2) 0.644 0.635 0.600 0.472 0.264 0.113
ζZLY(m,g1,g2) NA NA 0.613 0.482 0.271 0.117
(ζL1,ζU1) (0.635, 0.953) (0.623, 0.935) (0.578, 0.869) (0.430, 0.647) (0.219, 0.331) (0.087, 0.132)
(ζL2,ζU2) (0.637, 0.729) (0.626, 0.714) (0.585, 0.668) (0.446, 0.561) (0.242, 0.328) (0.103, 0.132)
ζ (mean2) 0.683 0.670 0.627 0.504 0.285 0.118
a 0.574051 0.574056 0.573612 0.569708 0.562911 0.555962
b 0.0678913 0.0930368 0.190376 0.527747 1.11658 1.72268
isml (concat)
eISML 0.642 0.632 0.590 0.438 0.246 0.196
ζN 0.644 0.634 0.592 0.443 0.254 0.194
ζN0 0.643 0.633 0.591 0.437 0.234 0.166
(ζL1,ζU1) (0.635, 0.953) (0.622, 0.934) (0.568, 0.854) (0.397, 0.598) (0.211, 0.318) (0.157, 0.237)
(ζL2,ζU2) (0.637, 0.728) (0.625, 0.713) (0.576, 0.659) (0.416, 0.536) (0.233, 0.316) (0.177, 0.237)
ζ (mean2) 0.683 0.669 0.618 0.476 0.275 0.207
a 0.574029 0.573971 0.57356 0.569747 0.558232 0.553151
b 0.067892 0.0958963 0.21228 0.607057 1.14253 1.35035

Note.—P(tie) is the probability for ties in gene trees, with P(tie)+g1+2g2= 1. The probabilities of estimated gene trees (g1,g2,g3) as well as the error rates (e2STEP and eISML) are estimated by simulation using a C program, with 106 replicates. Ties are broken evenly in the error calculation. The parameter values used are (τ0,τ1,θ0,θ1)=(0.02, 0.019, 0.01, 0.05). The marginal (pooled) site pattern probabilities are p¯= (p¯0,p¯1,p¯2,p¯3,p¯4) = (0.92831926, 0.023777106, 0.023372801, 0.023372801, 0.001158033), given by equation (13). For 2-step, at n =1, the estimated gene tree is determined by the single site so that g1(1)=p¯1and g2(1)=p¯2, whereas at n=, the estimated gene tree is the true gene tree, so that g1()=P(G1) and g2()=P(G2) (eq. 1). For 2-step, ζZLY (eq. 9) is inapplicable at n =1 or 2 as m =1000 is too small. For isml, ζN0=ζN(m,Δμ,σ12,σ22,0,0) ignores the correlation (eq. 6), while ζN accounts for the correlation. The bounds (ζL1,ζU1) and (ζL2,ζU2) are calculated using equations (7) and (A27), with k =2 used in ζU2. “mean2” is the average of the tight bounds: (ζL2+ζU2)/2.

Fig. 3.

Fig. 3.

Error rates in species-tree estimation by ML, 2-step, and isml (=concatenation). (a) Error plotted against sequence length n when the number of loci m is fixed at 100 or 1,000, generated by simulation. (b) Error plotted against m when n=. Error for ML is given by equation (12), whereas those for isml and 2-step are generated by simulation. (c) Error plotted against n when nm=104 is fixed, generated by simulation. Note that all four methods are equivalent when n =1 or m=1, while concatenation and isml are equivalent in all cases. Parameters used in the simulation are τ0=0.02,τ1=0.019,θ0=0.01, and θ1=0.05. The number of replicates is R104.

Concatenation

Sequence alignments at the m loci are merged into a super-alignment of length nm, and the data are the site-pattern counts pooled across loci: x.={x·j}, with x·j=ixij,j=0,1,,4. The likelihood function is given by the multinomial probability of equation (4) except that x·j is used instead of xij. The ML tree is G1 if x·1>max(x·2,x·3) (Yang 1994b, 2000). We discuss the error rate of concatenation below in the section on the isml method.

We also examine biases in parameter estimation using concatenation. We use species tree S1 with τABC = 0.02, τAB = 0.01, θABC = 0.02, and θAB = 0.01 to simulate m=104 loci each with n =250 sites. We obtain MLEs t^0 and t^1 on gene tree G1 from the concatenated data for comparison with the MLEs τ^0 and τ^1 on species tree S1 in the MSC model (eq. 5). With so much data, both concatenation and ML recover the true tree with near certainty. The MLEs under the MSC (obtained using the 3 sprogram) are very close to the true values, whereas concatenation (baseml in paml, Yang 2007) produced seriously biased estimates (table 2). Even the relative age, t^0/t^1 = 1.92, differs from τABC/τAB=2, which means that molecular clock dating analysis using concatenated data will produce biased time estimates (Angelis and dos Reis 2015; Ogilvie et al. 2017; Tiley et al. 2020).

Table 2.

Estimates of Divergence Times (true values in parentheses) by ML under the MSC (3 s) and by Concatenation (baseml) in Two Simulated Data Sets, Each of m=104 Loci and n =250 Sites.

τABC τAB θABC θAB
Data/method (0.02) (0.01) (0.02) (0.01)
Data set 1, 3s 0.0201 0.0096 0.0199 0.0101
Data set 2, 3s 0.0196 0.0100 0.0201 0.0100
Data set 1, baseml 0.0298 0.0155
Data set 2, baseml 0.0298 0.0156

ISML

The isml method assumes that all sites in the super-alignment are i.i.d. Like concatenation, the data are summarized as pooled site-pattern counts, x.={x·0,x·1,x·2,x·3,x·4}. However, isml is coalescent-aware and uses the MSC model to calculate the probabilities for the site patterns. By averaging the conditional site-pattern probabilities of equation (3) over the MSC density of gene trees and coalescent times of equation (2), we derive the marginal site-pattern probabilities, p¯=(p¯0,,p¯4), as:

p¯0=116(1+18a0+54a0b+54a0c0+9c1+9a1),p¯1=316(16a018a0b18a0c0+9c1+9a1),p¯2=316(1+6a018a0b18a0c03c13a1),p¯3=p¯2,p¯4=616(16a0+18a0b+18a0c03c13a1), (13)

where a0=e8τ0/33+4θ0, a1=e8τ1/33+4θ1, b=e4τ1/33+2θ1, c0=2ϕ·(θ1θ0)·e4τ0/3(3+2θ0)(3+2θ1), and c1=4ϕ·(θ1θ0)·a03+4θ1, with ϕ=e2(τ0τ1)/θ1. Note that {p¯j} are functions of a0, b+c0 and a1+c1, although these do not appear to permit simple biological interpretations. The cases for S2 and S3 are given by symmetry.

The likelihood function (or the probability for the pooled site-pattern counts) for each species tree is:

f(x.|S1,θ1)=p¯0x·0p¯1x·1p¯2x·2+x·3p¯4x·4,f(x.|S2,θ2)=p¯0x·0p¯1x·2p¯2x·3+x·1p¯4x·4,f(x.|S3,θ3)=p¯0x·0p¯1x·3p¯2x·1+x·2p¯4x·4. (14)

Theorem 3

.(a) If the true species tree is S1 with parametersθ1, thenp¯1>p¯2=p¯3. (b) Isml infers the species tree S1 ifx·1>max{x·2,x·3}.

Proof.

(a) Each of the marginal site pattern probabilities p¯j,j=0,,4, is a sum over the four gene trees of figure 1b:G1a,G1b,G2 and G3. The three gene trees G1b,G2, and G3 have the same densities (eq. 2). Together their contribution to the site pattern xxy is the same as that to the pattern yxx or pattern xyx. If the gene tree is G1a (with any coalescent times t0 > t1), site pattern xxy will have a higher probability than yxx or xyx, with p1>p2=p3. Averaging over all the four gene trees, we have p¯1>p¯2=p¯3.

(b) We show that if x·1>x·2, then (S1,θ^1)>(S2,θ^2), where θ^1 and θ^2 are the MLEs under each species tree. First note that if x·1>x·2 and q1>q2>0, then q1x·1q2x·2>q1x·2q2x·1. Let q1=p¯1(S1,θ^2)and q2=p¯2(S1,θ^2), and we have (S1,θ^2)>(S2,θ^2). In other words, even if we use θ^2 (the MLE for S2) to calculate the likelihood for species tree S1, tree S1 will have a higher likelihood than S2. Since θ^2 may not be optimal for S1, it follows that (S1,θ^1)(S1,θ^2)>(S2,θ^2). □

Theorem 3 means that isml infers species tree Sj if x·j is the greatest among x·1,x·2, and x·3, just like concatenation.

To study the error rate for isml (or concat), let pij, j=0,,4 be the site-pattern probabilities at any locus i. Data at each locus are represented by the site-pattern frequencies fij=xij/n. Let fi={fij} be the data at locus i. The fi are i.i.d. among loci from a common distribution with mean E(fij)=p¯j and variance/covariance σjjV(fij) and σjkCov(fij,fik). Let f¯j=1mi=1mfij=x·j/m be the means over loci. Here, {fij} constitute the full data, whereas {f¯j} are summaries used by isml: the species tree estimate is Sj if f¯j is the largest among (f¯1,f¯2,f¯3). Thus, eISML=P(f¯1<max{f¯2,f¯3})ζN(m,p¯1p¯2,Σ), where Σ={σjk}. Below we derive the variances.

At n =1, they are given by the multinomial distribution as:

σjj(1)=p¯j(1p¯j),σjk(1)=p¯jp¯k,1j,k3. (15)

At n=, we have fij = pij, given by equation (3). The variances, denoted σjk(), can be generated by simulating gene trees with coalescent times and calculating the site-pattern probabilities (eq. 3) (supplementary table S1, Supplementary Material online). This distribution is 3D (for fi0, fi1, and fi2=fi3 under S1), indexed by four parameters (θ1 in S1), and is a mixture distribution with 4 components corresponding to the four gene trees of figure 1b. It reflects the coalescent fluctuation in gene genealogies.

For any finite 1n<, the variances are given by:

σjj=V(E(fij|pij))+E(V(fij|pij))=V(pij)+E(pij(1pij)/n)=V(pij)+1n[E(pij)(1E(pij))V(pij)]=1nσjj(1)+n1nσjj(),σjk=Cov(pij,pik)+E(Cov(fij,fik|pij,pik))=Cov(pij,pik)+1n[E(pij)E(pik)Cov(pij,pik)]=1nσjk(1)+n1nσjk(), (16)

where E(pij)p¯j (eq. 13), whereas V(pij)=σjj and Cov(pij,pik)=σjk are the variances/covariances over the coalescent process. These are calculated for a set of parameter values in supplementary table S1, Supplementary Material online. The variances of fij are thus weighted averages of variances at n=1 and .

The approximation eISMLζN(m,p¯1p¯2,Σ) is very accurate, with errors <0.002 in the simulation of table 1. At large n, accommodating correlation is useful as ζN0 which ignores correlation is less accurate (see fig. 4 for the case of n=). For example, the correlation ρ(fi1,fi2)=σ12σ11σ22is 0.124,0.153, and −0.181 at n =1, 1,000, and , respectively (supplementary table S1, Supplementary Material online).

We now consider parameter estimation by isml. Theorem 3 allows species tree estimation by isml without knowledge of the MLE of the parameters. With data of x.j,j=0,,4, there are only three observations (three free proportions f¯0,f¯1, and f¯2+f¯3 in the case of S1). As there are four parameters in the MSC model, it is impossible to identify all of them.

If we assume θ0=θ1=θ (as in Tian and Kubatko 2016), all three parameters (τ0,τ1,θ) will be identifiable. As c0=c1=0, equation (13) simplifies to:

p¯0=116(1+18a0+54a0b+9a1),p¯1=316(16a018a0b+9a1),p¯2=316(1+6a018a0b3a1)=p¯3,p¯4=616(16a0+18a0b3a1), (17)

where a0, a1, and b are defined in equation (13) with θ0=θ1=θ. By equating the observed site-pattern frequencies to their expected probabilities (eq. 17), we have

14(9a1+1)=f¯0+f¯1h1,14(9a0+1)=f¯0+12(f¯2+f¯3)h2,38(18a0b+3a1+1)=f¯1+12(f¯2+f¯3)h3. (18)

Thus, we have a quadratic equation in θ^:

4(4h32h11)2θ^2+[3(4h32h11)2(4h11)(4h21)2](4θ^+3)=0. (19)

This always has a unique positive root. Given θ^, the estimates τ^0 and τ^1 are given by equation (18), which are guaranteed to be positive.

Fig. 4.

Fig. 4.

Species tree error for isml at n= generated by simulation (108 replicates) and by approximation based on ζN either with or without accounting for correlations. The error goes from 0.64 (at m =1) to 0.19 (at m =1,000). Results for other methods for the same parameter settings are in figure 3b.

Thus, under the assumption θ0=θ1, the isml method provides estimates of the three parameters in the model: θ, τ0, and τ1. As there is a one-to-one correspondence between the parameters and the multinomial proportions, the estimates are consistent and approach the true values when m for any n1 if the assumption of θ0=θ1 is correct (table 3, cases c and d). However, the pooled site-pattern counts or average site-pattern frequencies are summaries of the original data and are not sufficient statistics. It then follows that the isml estimates will be less efficient and have larger asymptotic variances than the MLEs obtained from the full data under the same model assumption of θ0=θ1 (table 3, case c). Furthermore, if θ0θ1, assuming θ0=θ1 will lead to biased and inconsistent parameter estimates even if the same species tree estimate is produced. In other words if θ0θ1, the isml method assuming θ0=θ1 will produce a consistent estimate of the species tree and inconsistent estimates of the model parameters (table 3, cases e and f).

Table 3.

Characterization of the isml Method.

True Model Assumption Data Size Parameters isml vs. ml
(a) θ0θ1 θ0θ1 n >1 3 out of 4 identifiable isml ml
(b) θ0θ1 θ0θ1 n =1 3 out of 4 identifiable isml = ml
(c) θ0=θ1 θ0=θ1 n >1 all 3 identifiable isml ml
(d) θ0=θ1 θ0=θ1 n =1 all 3 identifiable isml = ml
(e) θ0θ1 θ0=θ1 n >1 3 out of 4 identifiable, inconsistent isml ml
(f) θ0θ1 θ0=θ1 n =1 3 out of 4 identifiable, inconsistent isml = ml

Note.—In all cases, the species tree topology is identifiable and consistently estimated by isml when the number of loci m. If the parameters are identifiable, their estimates will be consistent. When isml differs from ML and the assumed model is correct, isml is less efficient than ML for parameter estimation (case c).

Two-Step Method (Majority Vote)

In the 2-step method, we estimate gene trees at individual loci and then use the most common gene tree topology as the species tree estimate. Under JC, the ML gene tree for locus i (which is also the upgma tree) is tree Gj if xij is the largest among xi1,xi2, and xi3 (Yang 1994b, 2000); site patterns xxy, yxx, and xyx “support” gene trees G1, G2, and G3, respectively. There is no need for numerical optimization to obtain the ML tree at each locus.

Let g1, g2, and g3 be the probabilities that the estimated gene tree is G1, G2, and G3, respectively; that is, g1=P{xi1>max(xi2,xi3)}, and so on. These are functions of all four parameters in the MSC model (θ1) as well as the sequence length n, and can be computed numerically (Yang 2002, eq. 12) or by simulation. Under JC and the clock, g2=g3<g1<P(G1|S1,θ1) (Yang 2002). This result has several implications. First, g1<P(G1) means that phylogenetic errors inflate gene-tree–species-tree discordance and lead to underestimation of the internal branch length in the species tree (Yang 2002). Second g1<P(G1) also means that use of estimated (rather than true) gene trees leads to reduced probability for recovering the correct species tree. Third, g1>g2=g3 means that the 2-step estimate of the species tree is consistent even if estimated gene trees are used.

Let the number of loci at which G1 is the ML tree be m1=i=1mIxi1>max(xi2,xi3), where the indicator function Ia=1 if statement a is true and 0 otherwise. Similarly define m2 and m3 to be the counts for the two mismatching gene trees. The correct species tree is inferred if and only if m1>max(m2,m3). Thus, the error rate can be approximated by e2STEPζ(m,g1,g2) (eq. 8).

The accuracy of this approximation is assessed in table 1 at different values of n with m =1,000 and with parameter values τ0=0.02,τ1=0.019,θ0=0.01, and θ1=0.05. Consider first the case of n =1. The gene tree is resolved if the single site at the locus has site patterns 1, 2, or 3, but is unresolved if the site has patterns 0 or 4. Whether we ignore loci with ties (with site patterns 0 or 4) or break ties evenly (assigning 13 to each gene tree) does not affect the species tree estimate. Thus, g1(1)=p¯1 and g2(1)=p¯2 (eq.13) and the error is e2STEPζ(m,p¯1,p¯2). This is equivalent to eISMLζN(m,p¯1p¯2,Σ) for isml, consistent with the fact that at n =1 all methods considered here are equivalent.

If n=, the estimated gene trees will be the true gene trees so that g1=P(G1) and g2=P(G2). The error rate is then ζ(m,P(G1),P(G2))=ζ(1,000, 0.3594737, 0.3202631) = 0.1132, close to 0.114 from simulation. At n=1000, the proportions of estimated gene trees are g1 = 0.33273 and g2 = 0.30811, so that ζ(m,g1,g2)= 0.264, close to 0.260 by simulation (table 1). These are much larger than 0.114 at n=, suggesting that with n =1,000 sites in the sequence, the estimated gene trees have substantial errors and uncertainties.

The approximations ζZLY (eq. 9) and ζ (eq. 8) give nearly identical results. The error rate is found to be very sensitive to the precise values of g1 and g2. Overall, the approximation is good, with errors within or close to 1%.

Numerical Comparison of Different Methods

We use simulation to compare the different species-tree estimation methods and to assess the reliability of our approximations. We use a challenging species tree with parameters τ0=0.02,τ1=0.019,θ0=0.01, and θ1=0.05. The error is plotted against the number of loci (m) when the number of sites per locus is fixed at n= 1, 2, or 1,000 (fig. 2).

In the case of one site per sequence (n =1), all four methods considered in this study are equivalent, with the species tree given by the most frequent pooled site pattern (i.e., the greatest of x·1,x·2, and x·3). With one site, the independent-sites assumption is correct, and ml and isml are exactly the same. As discussed earlier, concatenation and 2-step also select the species tree according to the pooled site patterns. Treatment of ties among x·1,x·2,x·3 has very minor effects on the error rate. For n =1 and m=1000, simulation gave the error estimate e= 0.642 if ties are broken evenly (table 1) or 0.641 if data sets with ties are ignored. As predicted by our theory, the probit transform of the error, Φ1(e), shows a linear relationship with m (fig. 2a′, R2=0.9994).

In the case of n =2 sites per locus, isml (=concatenation), 2-step, and ML are all distinct. To see that concatenation and 2-step may produce different species trees, consider the case of m =3 loci and n =2 sites. If the data set at the three loci are 11, 02, and 00, where 0–4 represent the five site patterns, concatenation will infer the correct species tree S1 (as x·1=2,x·2=1,x·3=0), whereas 2-step will have a tie between S1 and S2 (as m1=1,m2=1,m3=0). If the data set at the three loci are 33, 01, and 14, concatenation will have a tie between S1 and S3 (as x·1=2,x·2=0,x·3=2), whereas 2-step will infer the correct species tree (as m1=2,m2=0,m3=1). We also confirm that at n =2 ML differs from all three summary methods and can identify and consistently estimate all four parameters in the MSC model. Indeed ML is far more efficient for species tree estimation than the summary methods when n =2 (fig. 2b and b′). Although the summary methods improve only slightly when n changes from 1 to 2, there is a major performance boost for ML (fig. 3a). This may be due to the fact that the model is fully identifiable with n =2 but not when n =1. The predicted linear relationship between Φ1(e) and m holds well for the three summary methods (fig. 2b′). For ML, if we remove the first two points (for m =10 and 20), the relationship is nearly linear, with y=0.0022x+0.0391, with R2=0.97.

The most interesting case is with n1, since in real data sets n may be in the range 50–5,000, say. We used n =1,000 in figure 2c and c′. As in the case of n =2, there is a large performance divide between ML and the three summary methods (Isml = concat and 2-step), whereas the summary methods have similar performance. The approximate linear relationship between Φ1(e) and m holds well for all methods.

The superior performance of ML persists in the limit of n= (fig. 3b). For example, eML,= 0.45 and 0.01 for ML at m =10 and 100, respectively, compared with e2STEP,= 0.60 and 0.46 for 2-step or eISML,= 0.62 and 0.51 for isml. The differences between ML and 2-step reflect the information in the coalescent times or gene-tree branch lengths. The differences between ML and isml reflect the information in the variation of site-pattern frequencies among loci, as isml uses only the averages across loci.

Figure 3c examines the error rates of different methods, while nm=104 is fixed. At the two ends (n =1 or m =1), all four methods are equivalent, with e= 0.587 at n =1 and m=104, and e= 0.646 at m =1 and n=104. Note that when n=1 and m, the error e0, while if m =1 and n, the error e=1g1(n)1P(G1)= 0.6405. The high error at m =1 even when n= is because a single gene tree (with coalescent times), even if known with certainty, does not contain much information about the MSC process. Away from the two ends (n >1 or m >1), ML is considerably more efficient than the summary methods (fig. 3c). The case of m=104 (n =1), at which eML= 0.587, and the case of m =2 (n =5,000), at which eML = 0.487, make an interesting contrast. In the first case all sites are i.i.d., while in the second, there are only two independent genes, each of 5,000 sites in complete linkage. One might expect data of independent sites to be more informative than two loci with correlated sites at the same locus (e.g., Long and Kubatko 2018), but the opposite is true. With n =1, not all model parameters are identifiable, and this nonidentifiability issue appears to impact species tree estimation as well (Shi and Yang 2018, p. 172). With nm fixed, the smallest error eML occurs at intermediate values of n and m, around n=m=100, although performance is similar over a large range of n (fig. 3c).

In table 1, we calculated the species-tree error probability using equations (6) and (8), as well as two pairs of bounds (ζL1,ζU1) and (ζL2,ζU2) (Theorem 1, Appendix A), for comparison with the simulation results. The asymptotic results are expected to apply when the sequence length n is fixed, whereas the number of loci m. Here, m is fixed at 1,000, so that b <2 for all cases (table 1), and is too small for the asymptotic approximations to be reliable. As a result, equations (6) and (8) are more accurate.

Discussion

Errors of Species Tree Estimation by Different Methods

Under the MSC model, data at different loci are i.i.d., so that the number of loci (m) constitutes the sample size in the statistical model. Thus, we have derived approximations to the error rate for different methods when m increases, with the sequence length n fixed. For large m, the error can be approximated by Φ(cm), where c is a constant. This is seen to apply to all four methods considered in this study (ML, isml = concatenation, and 2-step) (see table 4 for a summary).

Table 4.

Summary of Analytical Approximations to Species-Tree Estimation Error by Different Methods.

Method n =1 n2 n=
ml eq. 11 eq. 12
2-step ζ(m,p¯1,p¯2) ζ(m,g1,g2) ζ(m,P(G1),P(G2))
isml/concatenation ζN(m,Δp,Σ(1)) ζN(m,Δp,Σ(n)) ζN(m,Δp,Σ())

Note.—For isml/concatenation, Δp=p¯1p¯2, and the variance–covariance matrix at n is Σ(n)=1nΣ(1)+n1nΣ() (eq. 16). In the case of n =1, ζ(m,p¯1,p¯2)=ζN(m,Δp,Σ(1)), and 2-step, isml, concatenation, and ml are all equivalent.

The theory for ML in Appendix B applies generally to ML selection of nonnested models, whether one model (which may and may not be the true model) fits the data better than the others, judged by the K–L divergence to the true data-generating model. In particular, the theory applies to conventional phylogenetic reconstruction without the MSC model. For example, figure 5 applies the same prediction to simulation results on four-taxa trees from Yang (1997). Previously, Susko (2011) developed a large-sample approximation to the log-likelihood difference between two trees and to the probability that each tree will be the ML tree in the case of four-species without the molecular clock. It was assumed that the internal branch length in the tree is small and approaches 0 at the rate of n12 or faster when the number of sites n increases. In our analysis, we take the conventional approach of fixing the parameters when the data size increases.

Fig. 5.

Fig. 5.

The probit transform of the phylogenetic reconstruction error, Φ1(e), is a linear function of the square root of the number of sites in the alignment (n). Simulation results from Yang (1997, fig. 1A and B) are used in the plot. The trees used in the simulation have four taxa, with branch lengths ((0.5, 0.5):0.1, 0.5, 0.5) for tree A and ((0.5, 0.5):0.1, 0.6, 1.4) for tree B. Data are simulated under the JC+G model (Yang 1994a) and analyzed under both JC and JC+G (Jukes and Cantor 1969; Yang 1994a). Note that in (B), ML under the incorrect model (JC) is more efficient than ML under the correct model (JC+G).

We note that in problems of parameter estimation, the standard error for the parameter estimate or the width of the confidence interval typically decreases at the rate of n12, so that quadrupling the data size halves the interval. In contrast, the probability of recovering the best-fitting model approaches 1 much faster. As the probit transform of the error decreases linearly with n, it will soon reach a point beyond which the precise error probability is of no practical significance: for example, Φ1(e)=3 means e =0.0013, while Φ1(e)=5 means e=2.9×107. The different dynamics between model selection and parameter estimation when the data size grows is consistent with the fact that we tend to obtain extreme support for phylogenies inferred in large data sets (Yang and Zhu 2018).

Implications of Our Study to Species Tree Methods

Although the species tree problem studied here is the simplest, it has the complexities of the general problem. Furthermore, we have represented all major species tree methods in our analysis. We expect ML to be asymptotically similar to Bayesian inference as both are full-data methods.

We have assumed the JC mutation model and the molecular clock. Our results are thus applicable to shallow species phylogenies and may not apply to distantly related species for which the JC model may be inadequate for multiple-hit correction and the molecular clock may be seriously violated. In the case of three species examined in this paper, concatenation and isml always produce the same species tree estimate. However, in more general settings with four or more species and when the clock is violated and unrooted trees are used, concatenation and isml are known to be different. In particular, concatenation (as well as 2-step) can be inconsistent (Roch and Steel 2015), while isml is a coalescent-aware method and is always consistent.

The isml method considered here is similar to SVDQuartets (Chifman and Kubatko 2014). Both are summary methods based on pooled site-pattern counts. SVDQuartets is sometimes described as a site pattern-based method (e.g., Kubatko 2019). This is not a helpful description. Site-pattern counts for different loci ({fij}) are sufficient statistics under the model and carry the same amount of information as the sequence alignments at the same loci so that it makes no difference whether site patterns or sequences are used. Indeed virtually all methods involving likelihood calculation on sequences operate on site patterns instead of sites. Instead what matters is whether site patterns are pooled across loci. In the original data, the sites of the same locus share the same gene tree and the variation among loci provides information about parameters of the coalescent process such as the ancestral population sizes. Pooling sites across loci means that such information is lost (Shi and Yang 2018). As a result, the pooled site-pattern counts are unable to identify all parameters of the MSC model even if they can identify the species tree topology. Previously, Long and Kubatko (2018) found in simulations that SVDQuartets performed better in data sets of 600 coalescent-independent sites (m=600,n=1 in the notation of this paper) than in data of two genes each of 300 bp (m=2,n=300), and suggested that this is because “[t]he 600 sites observed from 600 distinct gene trees give independent genealogical information about the species tree, though indirectly, whereas the 300 sites for each of the two genes can give a reasonable indication of the individual gene trees, but still provide only two observed gene genealogies.” Our analysis suggests that this is not a correct interpretation. When the information in the data is used properly (as in the ML method), there is in fact more information in two genes each of 300 bp than in 600 independent sites (fig. 3c).

To understand the issue of parameter unidentifiability and the potential information loss for species tree estimation due to the pooling of sites across loci in SVDQuartets, consider the simple random-effects model:

yij=μ+αi+eij,i=1,,m;j=1,,n, (20)

where the treatment effect αiN(0,σa2) and the error eijN(0,σe2). Parameters in the model include the grand mean μ and the variance components σa2 and σe2. It is obvious that if there are no replications within treatment (n =1) or if the observations (yij) are pooled across treatments, the between-treatment variation and within-treatment errors will be confounded so that σa2 and σe2 will not be identifiable even though μ still is. In species tree estimation, pooling site patterns across loci (as in isml and SVDQuartets) causes some parameters of the MSC model to become unidentifiable even though the species tree still is. This issue of information loss due to averaging over the whole genome may be even more serious for methods designed for data of single nucleotide polymorphisms (SNPs) (Leaché and Oaks 2017), such as snapp (Bryant et al. 2012), because the removal of constant sites in the SNP data causes further loss of information (even if the ascertainment bias is accounted for in the method).

An important difference between isml and SVDQuartets is that isml applies ML to the pooled site-pattern counts, whereas SVDQuartets uses a criterion based on linear invariants to avoid the ML optimization (Xu and Yang 2016). Use of a non-ML criterion is expected to lead to further reduction in efficiency, in addition to information loss due to the pooling of sites across loci (Chou et al. 2015; Xu and Yang 2016; Shi and Yang 2018).

The MSC model analyzed in this paper assumes free recombination among loci and no recombination between sites of the same locus. Data for such analysis are typically loosely linked short genomic segments that are far apart from each other so that recombination within a locus is rare, whereas different loci are nearly independent (e.g.,Takahata et al. 1995; Burgess and Yang 2008; Lohse et al. 2016). Both assumptions of free recombination among loci and no recombination within locus are expected to be violated in real data analysis, and the impact of within-locus recombination is of particular concern. The ML method considered in this paper assumes no recombination (with r =0), whereas isml (and SVDQuartets) assumes free recombination (r=). The relative performance of the methods will depend on the true recombination rate: ML may be expected to perform better than isml if r is close to 0, while isml may be superior if r is large. At very high recombination rates, it may even be possible for ML (assuming r =0) to be inconsistent since the method is similar to concatenation and merges sites of the same locus with different histories into one sequence. In contrast, isml is consistent for all values of r. Previously, Lanier and Knowles (2012) found in a computer simulation that species-tree estimation was robust to moderate levels of within-locus recombination (see also discussions in Edwards et al. [2016];Xu and Yang [2016]). It will be interesting to evaluate the relative performance of modern species-tree estimation methods (including isml and SVDQuartets) under realistic recombination rates.

Materials and Methods

Simulation

We use a challenging species tree with parameters τ0=0.02,τ1=0.019, θ0=0.01, and θ1=0.05 (fig. 1a). A C program is written to simulate gene trees and sequence alignments for the case of three species/sequences, under the JC model (Jukes and Cantor 1969) with the clock. To simulate the gene tree and the sequence alignment for each locus, we generate an exponential coalescent waiting time (s1) with mean θ1/2. If s1<τ1, the gene tree is G1a, and another exponential waiting time s0 is generated with mean θ0/2 to get t0=τ0+s0 and t1 = s1. If s1>τ1, the gene tree is one of G1b,G2,G3, chosen at random, and two coalescent waiting times (s1 and s0) are generated with means θ0/6 and θ0/2, respectively, so that t1=τ0+s1 and t0=t1+s0 (fig. 1b). The gene tree and node ages (t0,t1) are then used to calculate the site-pattern probabilities for the locus (eq. 3), and the site-pattern counts are generated from multinomial sampling (eq. 4). Each data set consists of m loci with the sequence length of n sites. We use a large number of replicates (typically R=106 or 108) so that sampling errors due to a limited number of replicates is not a concern. Species tree estimation by concatenation (=isml) and 2-step is done by counting site patterns.

For the ML method (eq. 5), we used the simulation program MCcoal, which is part of the bpp program (Yang 2015), to simulate the gene trees and sequence alignments. The data are then analyzed using the ML program 3s (Yang 2002; Dalquen et al. 2017). The JC model is used to simulate and analyze data.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Supplementary Material

msab009_Supplementary_Data

Acknowledgments

We thank Bin Wang for discussions and two anonymous reviewers for many insightful comments. This study has been supported by Biotechnology and Biological Sciences Research Council grant (BB/P006493/1) to Z.Y. and a BBSRC equipment grant (BB/R01356X/1). T.Z. is supported by a Natural Science Foundation grant (32070685 and 31671370) and a grant from the Youth Innovation Promotion Association of Chinese Academy of Sciences (201901).

Data Availability

The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.

Appendix A. Proof of Theorem 1

(a) Define the random variable:

y=z¯1max(z¯2,z¯3)=z¯112(z¯2+z¯3)12|z¯2z¯3|=y1|y2|, (A1)

where y1=z¯112(z¯2+z¯3)N(Δμ,s12/m) and y2=12(z¯2z¯3)N(0,s22/m), with

1ms12=V(z¯1)+14V(z¯2+z¯3)2Cov(z¯1,z¯2)=1m[σ12+12(σ22+σ23)2σ12],1ms22=14V(z¯2z¯3)2=12m(σ22σ23). (A2)

Here, we treat z¯1,z¯2 and z¯3 as normal variables, according to the central limit theorem as m. As Cov(y1,y2)=0 and both y1 are y2 are normal variables, they are independent. Then,

ζ=P{y1<|y2|}=P{y2<0,y1<y2}+P{y2>0,y1<y2}=2P{y2>0,y1<y2}=20ϕ(y2;0,1ms22)Φ(y2Δμs1/m)dy2=20ϕ(t)Φ(atb)dt, (A3)

where a=s2/s1,b=Δμm/s1, and ϕ(x;μ,σ2) is the probability density function (PDF) for N(μ,σ2), whereas ϕ(x) is the PDF for N(0,1). The last integral has been studied by Yang and Rodríguez (2013, SI) in a different context and can be written as:

ζ=1ππ2tan1aexp{b22(sinθacosθ)2}dθ, (A4)

or, by letting t=atanθ, with dθ = −1/[(ta)2+1] dt, as:

ζ=1π01(ta)2+1eb2[(ta)2+1]2t2dt. (A5)

Equations (A4) and (A5) can be calculated using Gaussian quadrature and match direct calculations using the CDF for the bivariate normal distribution for (z¯1z¯2,z¯1z¯3). When Δμ=0, we have b =0 and:

ζ=20ϕ(t)Φ(at)dt=12+1πtan1a. (A6)

In the symmetrical case of Δμ=0,σ12=σ22, and σ12=σ23 (with a=13, b =0), this gives 12+1πtan1(13)=23, as expected. In this case the three variables z¯1,z¯2 and z¯3 have the same probability of being the greatest so that the error is 23.

To avoid numerical integration, we note that y2N(0,12m(σ22σ23)), and |y2| is a folded normal variable with mean and variance:

E(|y2|)=1mπ(σ22σ23),V(|y2|)=(12m1mπ)(σ22σ23). (A7)

Thus,

E(y)=Δμ1mπ(σ22σ23).V(y)=V(z¯1)+14V(z¯2+z¯3)+14V(|z¯2z¯3|)Cov(z¯1,z¯2+z¯3)Cov(z¯1,|z¯2z¯3|)+12Cov(z¯2+z¯3,|z¯2z¯3|)=V(z¯1)+14V(z¯2+z¯3)+14V(|z¯2z¯3|)2Cov(z¯1,z¯2)Cov(z¯1,|z¯2z¯3|)+Cov(z¯2,|z¯2z¯3|). (A8)

We have,

V(z¯2+z¯3)+V(|z¯2z¯3|)=E(z¯2+z¯3)2+E(z¯2z¯3)2E2(z¯2+z¯3)E2(|z¯2z¯3|)=4E(z¯22)4μ224mπ(σ22σ23)=4mσ224mπ(σ22σ23),Cov(z¯1,|z¯2z¯3|)=0,Cov(z¯2,|z¯2z¯3|)=0. (A9)

Collecting all terms in equation (A8), we get

V(y)=1m[σ122σ12+σ221π(σ22σ23)] (A10)

If we assume that y is approximately normally distributed, as in Zharkikh and Li (1992) and Yang (1996), then equation (6) follows. Note that equation (6) can also be written as ζN=Φ(b+a2/π1+a2(12π)). Because |y2| has a folded normal distribution and is not a normal variable, the error of approximation of equation (6) does not approach zero when m. For instance, in the symmetrical case (Δμ=0,σ12=σ22, and σ12=σ23), equation (6) gives Φ(12π1)=0.66824, not 23. This level of accuracy is acceptable for our calculations for finite m in this paper, as the precise value of the error is unimportant if the error is nearly zero, but equation (6) may not give the correct asymptotic error rate when m (fig. 6, a =10).

Fig. 6.

Fig. 6.

Probit of error, Φ1(ζ), plotted against b for different values of a. Six methods for calculating ζ are shown. The first five are, from top to bottom, ζU1 (brown dashed line), ζU2 (orange dotted, with k = 2 in eq. A27), Exact (black solid line), ζL2 (blue dotted), and ζL1 (purple dashed). Equation (6) (black dotted) is included as well.

(b) To study the asymptotic behavior of the error probability ζ when m, we derive bounds on ζ. From equation (A3),

ζ=20ϕ(t)atbϕ(x)dxdt=2atbϕ(t)ϕ(x)dxdt20atbϕ(t)ϕ(x)dxdt=2S2A, (A11)

where the first integral is S=Φ(h), with h=b1+a2=Δμmσ122σ12+σ22 to be the distance from the origin (0, 0) to the line x=atb (fig. 7), and the second integral is:

2A=20atbϕ(t)ϕ(x)dxdt=bϕ(x)(x+b)/a(x+b)/aϕ(t)dtdx. (A12)

Fig. 7.

Fig. 7.

The areas of integration for integrals in Equations (A11) and (A12). The two angles are ψ=tan11a and φ=tan11ka, k >1, with φ<ψ<π2. The integral over the half-plane x<b is Φ(b), whereas the integral over the half-plane t>(x+b)/a is S=Φ(h)=P{z¯1<z¯2}. The integral over the sector ABA (the shaded area) is 2A=P{z¯1<z¯2,z¯1<z¯3}. This is smaller than Φ(b)·ψ/π2 and greater than the integral over the area shaded with the brick pattern: these give the bounds (ζL2,ζU2) in Appendix A. The purple dashed lines are t=x/(ka) and t=x/(ka). They cross the blue lines at A and A, with the length of the line segment OA to be r=ba2k2+1a(k1). Note that the integral over the circle x2+t2<r2 is 1e12r2.

By considering the area of integration (fig. 7), it is obvious that:

0<2AΦ(b)[12πtan1a], (A13)

where the equality holds when b =0. Let,

ζL2=2Φ(h)Φ(b)[12πtan1a].

As Φ(b)<Φ(h), we have:

ζL2>Φ(h)[1+2πtan1a]ζL1, (A14)

or

Φ(h)ζL1ζL2ζ<2Φ(h)ζU1, (A15)

as in equation (7). The equality in the lower bounds is achieved at b=0. Note that the bounds apply to all a >0 and b >0. We use the bounds (ζL1,ζU1) in Theorem 1 and in the calculation of table 1. The width of the interval is Φ(h)[12πtan1a]Φ(h)ζ, so that using any value inside the interval as the estimate will give an error of approximation that is smaller than the error probability ζ.

Note that the bounds Φ(h)<ζ<2Φ(h) are also given by the definition ζ=P{z¯1<z¯2z¯1<z¯3}, since

P(z¯1<z¯2)<ζ<P(z¯1<z¯2)+P(z¯1<z¯3)=2P(z¯1<z¯2), (A16)

with P(z¯1<z¯2)=Φ(h).

Next we consider the upper bound in equation (A15) when h or b is large. Note that:

Φ(h)=h12πey2/2dy=012πe12(x+h)2dx=12πeh2/20e(hx+12x2)dx=1h2πeh2/20ete12h2t2dt=1h2πeh2/2B, (A17)

where B=0ete12h2t2dt<0etdt= 1. For large h,

B>0hete12h2t2dt>e12h0hetdt=e12h(1eh)=112h+o(1h). (A18)

Thus, for large h, Φ(h) is bounded by:

(112h+o(1h))1h2πeh2/2<Φ(h)<1h2πeh2/2, (A19)

or

Φ(h)=1h2πeh2/2+O(1h2eh2/2). (A20)

Let ɛ>0 such that Φ((h+ɛ))=αΦ(h) for 0<α<1; in other words, ɛ is the offset at the probit level to reduce the probability by a fraction. From equation (A20),

1(h+ɛ)2πe12(h2+2ɛh+ɛ2)+O(1(h+ɛ)2e12(h+ɛ)2)=αh2πe12h2+O(1h2e12h2). (A21)

Thus,

1h+ɛe12(h2+2ɛh+ɛ2)=αhe12h2+O(1h2e12h2), (A22)

which gives ɛ=1hlogα+o(1h) or

Φ(h+1hlogα+o(1h))=αΦ(h). (A23)

In particular, for α=12, we have:

Φ((h+1hlog2+o(1h)))=12Φ(h). (A24)

Thus, for large h, we have:

2Φ(h)=Φ(h+1hlog2+o(1h)), (A25)

as in equation (7). It may be noteworthy that for large h, a very small change at the probit level, of about 1hlog2, changes the probability by a factor of 2.

A tighter lower bound for 2 A than zero of equation (A13) is:

2A>φπexp{b2(a2k2+1)2a2(k1)2}, (A26)

where φ=tan11ka with k >1 (fig. 7). Thus, we have a tighter pair of bounds on ζ,

2Φ(h)Φ(b)[12πtan1a]ζ<2Φ(h)1πtan11kaexp{b2(a2k2+1)2a2(k1)2}, (A27)

where k >1. We write this pair of bounds as ζL2<ζ<ζU2. We have Φ(b)Φ(h)ζL1ζL2ζ<ζU2<ζU1=2Φ(h). These bounds, as well as the exact value and equation (6), are plotted against b in figure 6 for a=0.01,0.1,1 and 10.

Appendix B. The asymptotics of ML species tree estimation

The proof below borrows heavily from White (1982), Dawid (2011), and Yang and Zhu (2018). Let Sj,j=1,2,3 be the three species trees with parameters θj. Note that S1 is the true model, while S2 and S3 are mis-specified models. Let the data at m loci be x={xi},i=1,,m. The log-likelihood function is j(θj)=logf(x|Sj,θj). We also define lj(θj)=logf(x|Sj,θj) for one data point (that is, site-pattern counts at any single locus), x(xi0,xi1,xi2,xi3,xi4). When the number of loci m, the MLE θ^jθj*. We assume that both θ^j and θj* are inner points in the parameter space. Whether θ^j is inside the parameter space or at its boundary should not affect the asymptotic rate of convergence. Here, θ1* for the true species tree S1 is the true parameter value, whereas θ2* for S2 (as well as θ3* for S3) is the pseudotrue parameter value, which minimizes the Kullback–Leibler distance from the misspecified model S2 to the true model S1.

D12=f(x|S1,θ1*)logf(x|S1,θ1*)f(x|S2,θ2*)dx =E{l1(θ1*)(l2(θ2*)}, (A28)

where the expectation is over the true distribution f(x|S1,θ1*). D13 is defined similarly, with D13=D12.

We consider the log-likelihood ratio, j(θ^)j(θ*), given the data (x) for any of the species tree j. We drop the subscript j for clarity. As in White (1982) and Dawid (2011), we define two matrices:

I(θ)=E{logf(x|θ)·logf(x|θ)T}=E{l(θ)(l(θ))T},J(θ)=E{2logf(x|θ)}=E{l(θ)}, (A29)

(A29)

where the superscript T stands for transpose and where the expectation is over the true distribution, and and 2 are the first and second derivatives with respect to θ.

Apply Taylor expansion to the log likelihood around the MLE θ^:

(θ)(θ^)+(θ^)(θθ^)+12(θθ^)T(θ^)(θθ^), (A30)

where both the gradient and the Hessian are evaluated at the MLE ( θ^), with (θ^)=0. Setting θ=θ*, we have:

(θ^)(θ*)+12(θ^θ*)T((θ^))(θ^θ*). (A31)

Apply Taylor expansion to the derivative (θ) around the MLE θ^ and let θ=θ*, and we have:

(θ)(θ^)(θθ^), (A32)

and

θ^θ*(θ^)1(θ*). (A33)

Each of (θ^) and (θ^) is a sum of m i.i.d. elements. When m,(θ^)mE{l(θ*)}=mJ*, with J*=J(θ*) (eq. A29). Furthermore,

E{(θ*)}=0,V{(θ*)}=mV{l(θ*)}=mI*, (A34)

where I*=I(θ*) (eq. A29). Thus,

m(θ^θ*)PN(0,(J*1)TI*(J*1)). (A35)

Thus, θ^=θ*+Op(m1/2). Equation (A31) becomes:

(θ^)(θ*)+12{m(θ^θ*)}TJ*{m(θ^θ*)} (A36)
=(θ*)+Op(1).

Equations (A29–A36) apply to all three species trees. In the case of S1 (the true model), J*=I*, the Fisher information matrix, and (θ^)(θ*)12χd2. For S2 or S3, (θ^)(θ*) is a quadratic form of normal variates and is a mixture of noncentral χ2 variables with mean 12tr(I*J*1) and variance 12tr((I*J*1)2), both of O(1).

Now consider using z¯j1mj(θ^j), j =1, 2, 3, to compare species trees S1, S2, and S3. We have:

E(z¯j)E(lj(θj*))μj,V(z¯j)1mV(lj(θj*))1mσjj,Cov(z¯j,z¯k)1mCov(lj(θj*),lk(θk*))1mσjk. (A37)

Thus, when the number of loci m,{z¯j}={1mj(θ^j)} have means (μ1,μ2,μ2) and variance/covariance matrix 1mΣ, where Σ={σjk} is O(1) and independent of m. The error of the ML method, P{1(θ^1)>max(2(θ^2),3(θ^3))}=P{z¯1>max(z¯2,z¯3)}, is then given by Theorem 1 as equation (11).

References

  1. Angelis K, dos Reis M.. 2015. The impact of ancestral population size and incomplete lineage sorting on Bayesian estimation of species divergence times. Curr Zool. 61(5):874–885. [Google Scholar]
  2. Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A.. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 29(8):1917–1932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Burgess R, Yang Z.. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 25(9):1979–1994. [DOI] [PubMed] [Google Scholar]
  4. Chifman J, Kubatko L.. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T.. 2015. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16(Suppl 10):S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dalquen D, Zhu T, Yang Z.. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst Biol. 66(3):379–398. [DOI] [PubMed] [Google Scholar]
  7. Dawid A.2011. Posterior model probabilities. In: Bandyopadhyay PS, Forster M, editors. Philosophy of statistics.New York: Elsevier. p. 607–630. [Google Scholar]
  8. Degnan JH, Rosenberg NA.. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2(5):e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Degnan JH, Salter LA.. 2005. Gene tree distributions under the coalescent process. Evolution 59(1):24–37. [PubMed] [Google Scholar]
  10. Edwards SV.2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19. [DOI] [PubMed] [Google Scholar]
  11. Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, et al. 2016. Implementing and testing the multispecies coalescent model a valuable paradigm for phylogenomics. Mol Phylogenet Evol. 94(Pt A):447–462. [DOI] [PubMed] [Google Scholar]
  12. Fleiss JL, Levin B, Palk MC.. 2003. Statistical methods for rates and proportions.New York: John Wiley and Sons.3rd ed. [Google Scholar]
  13. Heled J, Drummond AJ.. 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 27(3):570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hudson R.1983. Testing the constant-rate neutral alele model with protein sequence data. Evolution 37(1):203–217. [DOI] [PubMed] [Google Scholar]
  15. Jukes T, Cantor C.. 1969. Evolution of protein molecules.In: Munro H, editor. Mammalian protein metabolism.New York: Academic Press. p. 21–123. [Google Scholar]
  16. Kubatko L.2019. The multispecies coalescent.In: Balding D, Moltke I, Marioni J, editors. Handbook of statistical genomics.4th ed.New York: Wiley. p. 219–245. [Google Scholar]
  17. Lanier HC, Knowles LL.. 2012. Is recombination a problem for species-tree analyses? Syst Biol. 61(4):691–701. [DOI] [PubMed] [Google Scholar]
  18. Leaché AD, Oaks J.. 2017. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Annu Rev Ecol Evol Syst. 48(1):69–84. [Google Scholar]
  19. Leaché AD, Rannala B.. 2011. The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol. 60(2):126–137. [DOI] [PubMed] [Google Scholar]
  20. Liu L, Pearl DK.. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 56(3):504–514. [DOI] [PubMed] [Google Scholar]
  21. Liu L, Yu L, Edwards SV.. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 10(1):302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Liu L, Yu L, Pearl DK, Edwards SV.. 2009. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 58(5):468–477. [DOI] [PubMed] [Google Scholar]
  23. Lohse K, Chmelik M, Martin SH, Barton NH.. 2016. Efficient strategies for calculating blockwise likelihoods under the coalescent. Genetics 202(2):775–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Long C, Kubatko L.. 2018. The effect of gene flow on coalescent-based species-tree inference. Syst Biol. 67(5):770–785. [DOI] [PubMed] [Google Scholar]
  25. Maddison W.1997. Gene trees in species trees. Syst Biol. 46(3):523–536. [Google Scholar]
  26. Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T.. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17):i541–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Nichols R.2001. Gene trees and species trees are not the same. Trends Ecol Evol. 16(7):358–364. [DOI] [PubMed] [Google Scholar]
  28. Ogilvie HA, Bouckaert RR, Drummond AJ.. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol. 34(8):2101–2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Pamilo P, Nei M.. 1988. Relationships between gene trees and species trees. Mol Biol Evol. 5(5):568–583. [DOI] [PubMed] [Google Scholar]
  30. Rannala B, Edwards S, Leaché AD, Yang Z.. 2020. The multispecies coalescent model and species tree inference. In: Scornavacca C, Delsuc F, Galtier N, editors. Phylogenetics in the genomic era. Book Section 3.3.No Commercial Publisher. p. 1–20. [Google Scholar]
  31. Rannala B, Yang Z.. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Rannala B, Yang Z.. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst Biol. 66(5):823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Roch S, Steel M.. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 100:56–62. [DOI] [PubMed] [Google Scholar]
  34. Shi C, Yang Z.. 2018. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of Gibbons. Mol Biol Evol. 35(1):159–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Susko E.2011. Large sample approximations of probabilities of correct evolutionary tree estimation and biases of maximum likelihood estimation. Stat Appl Genet Mol Biol. 10(1):10. [DOI] [PubMed] [Google Scholar]
  36. Szöllősi GJ, Tannier E, Daubin V, Boussau B.. 2015. The inference of gene trees with species trees. Syst Biol. 64(1):e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Takahata N, Satta Y, Klein J.. 1995. Divergence time and population size in the lineage leading to modern humans. Theor Popul Biol. 48(2):198–221. [DOI] [PubMed] [Google Scholar]
  38. Tian Y, Kubatko LS.. 2016. Distribution of coalescent histories under the coalescent model with gene flow. Mol Phylogenet Evol. 105:177–192. [DOI] [PubMed] [Google Scholar]
  39. Tiley GP, Poelstra JP, dos Reis M, Yang Z, Yoder AD.. 2020. Molecular clocks without rocks: new solutions for old problems. Trends Genet. 36(11):845–856. [DOI] [PubMed] [Google Scholar]
  40. White H.1982. Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25. [Google Scholar]
  41. Wu Y.2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775. [DOI] [PubMed] [Google Scholar]
  42. Xu B, Yang Z.. 2016. Challenges in species tree estimation under the multispecies coalescent model. Genetics 204(4):1353–1368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Yang Z.1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]
  44. Yang Z.1994b. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 43(3):329–342. [Google Scholar]
  45. Yang Z.1996. Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol. 42(2):294–307. [DOI] [PubMed] [Google Scholar]
  46. Yang Z.1997. How often do wrong models produce better phylogenies? Mol Biol Evol. 14(1):105–108. [DOI] [PubMed] [Google Scholar]
  47. Yang Z.2000. Complexity of the simplest phylogenetic estimation problem. Proc R Soc Lond B. 267(1439):109–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Yang Z.2002. Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162(4):1811–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Yang Z.2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24(8):1586–1591. [DOI] [PubMed] [Google Scholar]
  50. Yang Z.2014. Molecular evolution: a statistical approach. Oxford (England: ): Oxford University Press. [Google Scholar]
  51. Yang Z.2015. The BPP program for species tree estimation and species delimitation. Curr Zool. 61(5):854–865. [Google Scholar]
  52. Yang Z, Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol Biol Evol. 31(12):3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yang Z, Rodríguez CE.. 2013. Searching for efficient markov chain Monte Carlo proposal kernels. Proc Natl Acad Sci USA. 110(48):19307–19312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Yang Z, Zhu T.. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci USA. 115(8):1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Zharkikh A, Li W-H.. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. i. Four taxa with a molecular clock. Mol Biol Evol. 9:1119–1147. [DOI] [PubMed] [Google Scholar]
  56. Zhu T, Yang Z.. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 29(10):3131–3142. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msab009_Supplementary_Data

Data Availability Statement

The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.


Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES