Complexity of the simplest species tree problem

Tianqi Zhu; Ziheng Yang

doi:10.1093/molbev/msab009

. 2021 Jan 25;38(9):3993–4009. doi: 10.1093/molbev/msab009

Complexity of the simplest species tree problem

Tianqi Zhu ^1,², Ziheng Yang ^1,^3,^✉

Editor: Bing Su

PMCID: PMC8382899 PMID: 33492385

Abstract

The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Keywords: concatenation, efficiency, molecular clock, MSC, multispecies coalescent, species tree

Introduction

The multispecies coalescent (MSC) model (Rannala and Yang 2003) combines the phylogenetic process of species divergences with the population genetic process of coalescent and naturally accommodates “delayed coalescence” (also known as “incomplete lineage sorting,” Maddison 1997), the phenomenon in which gene sequences fail to coalesce in their most recent common ancestor but do so only in more ancient ancestors. Delayed coalescence causes the gene tree for a gene or genomic region to differ from the species tree and is the most important factor for gene-tree–species-tree discordance (Maddison 1997; Nichols 2001; Szöllősi et al. 2015). The MSC provides a natural framework for estimating species trees accounting for genealogical heterogeneity among genes or across the genome (Edwards 2009; Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).

Two lines of research into the MSC have provided the foundation for species tree methods. The first concerns the probabilities of different gene tree topologies (Hudson 1983; Pamilo and Nei 1988) and algorithms for their efficient calculation given the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). The gene tree distribution can be used in the two-step method of species tree estimation, by inferring gene trees for the individual loci and then applying maximum likelihood (ML) to counts of gene tree topologies (as in stells,Wu 2012). Nevertheless, widely used two-step methods, including astral (Mirarab et al. 2014) and mp-est (Liu et al. 2010), are simpler, and estimate species trees for species triplets (assuming the molecular clock) or quartets (without the clock) and then assemble the subtrees to produce a species-tree estimate for all species. Studies of gene-tree probabilities led to the discovery of the “anomaly zone,” the region of the parameter space in which the most probable gene tree has a different topology from the species tree (Degnan and Salter 2005; Degnan and Rosenberg 2006). In the anomaly zone, the two-step method, which uses the most common gene tree as the species tree estimate, will be inconsistent.

The second line of research into MSC is the development of the joint probability distribution of the gene tree and coalescent times (Rannala and Yang 2003). This forms the basis for exact methods of inference, including ML (Yang 2002; Dalquen et al. 2017) and Bayesian methods (Liu and Pearl 2007; Heled and Drummond 2010; Yang and Rannala 2014; Ogilvie et al. 2017; Rannala and Yang 2017). Although heuristic methods use summaries of the data, exact methods use the multilocus sequence alignments directly and naturally accommodate phylogenetic reconstruction errors and uncertainties (Xu and Yang 2016; Kubatko 2019; Rannala et al. 2020).

Simulation has been used to examine the performance of different species-tree methods (e.g., Leaché and Rannala 2011; Mirarab et al. 2014; Chou et al. 2015; Xu and Yang 2016). A limitation of simulation is that it can examine only a small portion of the parameter space and the results often have limited applicability. Analytical results on the efficiency of different methods have been lacking. Here, we analyze species tree estimation under the MSC in the case of three species, with one sequence from each species per locus. We focus on closely related species and assume the JC mutation model (Jukes and Cantor 1969) and the molecular clock. We are in particular interested in the efficiency of the various methods, measured by the probability of recovering the correct species tree.

We consider four inference methods: 1) ML (a full likelihood method under the MSC applied to the multilocus sequence alignments), 2) 2-step (or majority-vote), 3) concatenation (concat), and 4) independent-sites ML (isml, also known as coalescent-aware concatenation or concat) (Xu and Yang 2016). ML is the full-likelihood method and calculates the likelihood function using the multilocus sequence alignments or a sufficient summary. The 2-step method estimates the gene tree at each locus and then uses the most common gene tree as the species tree estimate. It does not account for the uncertainties in the estimated gene trees. For the case of three species considered here, 2-step is equivalent to the maximum pseudolikelihood method (mp-est) (Liu et al. 2010). Concatenation applies ML to the concatenated sequences, assuming that the same tree underlies all sites in the super alignment. In the case considered here, concatenation is equivalent to steac (Liu et al. 2009), which uses average coalescent times over loci as data to infer a gene tree, which is the species tree estimate. Isml (or concat) estimates the species tree by ML under the assumption that all sites, both from the same locus and from different loci, have independent gene trees (Xu and Yang 2016). This was suggested as an improvement to SVDQuartets of Chifman and Kubatko (2014). All four methods considered here use ML, but the likelihood function is applied to different summaries of the same data. Here, we refer to the full-likelihood or full-data method as the ML method, whereas all other methods (2-step, concatenation, and isml) are considered heuristic summary methods: 2-step uses the (estimated) gene tree topologies, whereas concatenation and isml use the site-pattern counts pooled across loci. We derive approximations to the error rate of species tree estimation by the different methods and assess their accuracy. We use the theory to characterize the differences in the use of information in the data by different methods.

Results

Multispecies Coalescent in the Case of Three Species

For three species A, B, and C, there are three possible species trees: $S_{1} = ((A B) C)$ , $S_{2} = ((B C) A)$ , and $S_{3} = ((C A) B)$ , each with two divergence times (τ₀ and τ₁) and two population sizes (θ₀ and θ₁) (fig. 1a). Both τs and θs are measured by the expected number of mutations per site. For each species, the population size parameter is $θ = 4 N μ$ , where N is the (effective) population size and μ is the mutation rate per site per generation. We consider only one sequence from each species, so that θs for the modern species are not considered. The parameters have different interpretations in different species trees: in S₁, the two ancestral species are AB and ABC so the parameters are $θ_{1}$ = ${τ_{0}, τ_{1}, θ_{0}, θ_{1}}$ = ${τ_{ABC}, τ_{A B}, θ_{ABC}, θ_{A B}}$ .

Fig. 1. — (a) The three species trees ( $S_{1}, S_{2}, S_{3}$ ) for three species ( $A, B, C$ ) and the parameters in each MSC model. (b) The possible gene trees with coalescent times (t₀, t₁) for a locus with three sequences (a, b, c) given the species tree S₁. The probabilities for the gene trees are shown above them, where $ϕ$ = $e^{- \frac{2}{θ_{1}} (τ_{0} - τ_{1})}$ is the probability that a andb do not coalesce in population AB or over the time interval (τ₁, τ₀). Note that if the species tree is S₂ (or S₃), it will be possible for sequences b and c (or c and a) to coalesce in the time interval (τ₁, τ₀).

At each locus, three sequences (a, b, and c) are sampled, one from each species. They are related through a gene tree. The three possible gene trees are $G_{1} = ((a b) c), G_{2} = ((b c) a)$ , and $G_{3} = ((c a) b)$ , with probabilities:

\begin{matrix} P (G_{1} | S_{1}, θ_{1}) & = 1 - \frac{2}{3} ϕ, \\ P (G_{2} | S_{1}, θ_{1}) & = P (G_{3} | S_{1}, θ_{1}) = \frac{1}{3} ϕ, \end{matrix}

(1)

where $ϕ$ = $e^{- 2 (τ_{ABC} - τ_{A B}) / θ_{A B}}$ is the probability that sequences a and b do not coalesce in population AB so that all three sequences enter the ancestor ABC and the three gene trees occur with equal probability (fig. 1b) (Hudson 1983). Here, $2 (τ_{ABC} - τ_{A B}) / θ_{A B}$ is known as the internal branch length in coalescent units, as the average coalescent time in population AB is $2 N_{A B}$ generations or $θ_{A B} / 2$ mutations per site.

For locus i, let $t_{i} = {t_{i 0}, t_{i 1}}$ be the coalescent times (node ages) on the gene tree (fig. 1b). The joint MSC density for the gene tree and coalescent times given species tree S₁ and parameters $θ_{1}$ is then:

\begin{matrix} f (G_{1 a}, t_{i} | S_{1}, θ_{1}) & = \frac{2}{θ_{1}} e^{- \frac{2}{θ_{1}} (t_{i 1} - τ_{1})} \cdot \frac{2}{θ_{0}} e^{- \frac{2}{θ_{0}} (t_{i 0} - τ_{0})}, \\ τ_{1} < t_{i 1} < τ_{0}, t_{i 0} > τ_{0}, \\ f (G_{k}, t_{i} | S_{1}, θ_{1}) & = e^{- \frac{2}{θ_{1}} (τ_{0} - τ_{1})} \\ \times \frac{2}{θ_{0}} \frac{2}{θ_{0}} e^{- \frac{6}{θ_{0}} (t_{i 1} - τ_{0}) - \frac{2}{θ_{0}} (t_{i 0} - t_{i 1})}, \\ t_{i 1} > τ_{0}, t_{i 0} > t_{i 1}, \end{matrix}

(2)

for $k = 1 b, 2, 3$ (Takahata et al. 1995; Yang 2002). The probability densities for S₂ and S₃ are given similarly.

The data consist of sequence alignments at m loci. Under the JC mutation model, the data at locus i can be summarized as counts of five site patterns: xxx, xxy, yxx, xyx, and xyz, where x, y, z are any three distinct nucleotides. Let those counts be $x_{i} = {x_{i 0}, x_{i 1}, x_{i 2}, x_{i 3}, x_{i 4}}$ , with $\sum_{j = 0}^{4} x_{i j} = n$ to be the number of sites (sequence length) at each locus. Let $f_{i j} = x_{i j} / n$ be the frequencies. Let data at all m loci be $x = {x_{i}}$ .

Given the gene tree and coalescent times at locus i, the probability of the sequence data, $f (x_{i} | G_{i}, t_{i})$ , is given by the multinomial distribution for the five site patterns. For example, given gene tree G₁ with node ages $t_{i 0}$ and $t_{i 1}$ (fig. 1b), the site-pattern probabilities, $p_{i} = {p_{i 0}, p_{i 1}, p_{i 2}, p_{i 3}, p_{i 4}}$ , are as follows:

\begin{matrix} p_{i 0} = P (xxx | G_{1}, t_{i}) = \frac{1}{16} (1 + 3 v^{2} + 6 u + 6 u v), \\ p_{i 1} = P (xxy | G_{1}, t_{i}) = \frac{1}{16} (3 + 9 v^{2} - 6 u - 6 u v), \\ p_{i 2} = P (yxx | G_{1}, t_{i}) = \frac{1}{16} (3 - 3 v^{2} + 6 u - 6 u v), \\ p_{i 3} = P (xyx | G_{1}, t_{i}) = p_{2}, \\ p_{i 4} = P (xyz | G_{1}, t_{i}) = \frac{1}{16} (6 - 6 v^{2} - 12 u + 12 u v), \end{matrix}

(3)

where $u = e^{- 8 t_{i 0} / 3}$ and $v = e^{- 4 t_{i 1} / 3}$ (Yang 1994b). Note that $p_{i 1} > p_{i 2} = p_{i 3}$ as $t_{i 0} > t_{i 1}$ . The probabilities for gene trees G₂ or G₃ are given by symmetry. Then the sequence data or the five site-pattern counts at the locus have the multinomial probabilities:

\begin{matrix} f (x_{i} | G_{1}, t_{i}) = p_{i 0}^{x_{i 0}} p_{i 1}^{x_{i 1}} p_{i 2}^{x_{i 2} + x_{i 3}} p_{i 4}^{x_{i 4}}, \\ f (x_{i} | G_{2}, t_{i}) = p_{i 0}^{x_{i 0}} p_{i 1}^{x_{i 2}} p_{i 2}^{x_{i 3} + x_{i 1}} p_{i 4}^{x_{i 4}}, \\ f (x_{i} | G_{3}, t_{i}) = p_{i 0}^{x_{i 0}} p_{i 1}^{x_{i 3}} p_{i 2}^{x_{i 1} + x_{i 2}} p_{i 4}^{x_{i 4}} . \end{matrix}

(4)

The ML Method of Species Tree Estimation

The log-likelihood function for species tree S₁ with parameters $θ_{1}$ is given by summing over the gene trees and integrating over the coalescent times.

ℓ_{1} (θ_{1}) = \sum_{i = 1}^{m} log f (x_{i} | S_{1}, θ_{1}) = \sum_{i = 1}^{m} log {\sum_{G_{i}} \int f (G_{i}, t_{i} | S_{1}, θ_{1}) f (x_{i} | G_{i}, t_{i}) d t_{i}},

(5)

where $f (G_{i}, t_{i} | S_{1}, θ_{1})$ is the MSC density for the gene tree and coalescent times at locus i (eq. 2), and $f (x_{i} | G_{i}, t_{i})$ is the probability of the sequence data at locus i given the gene tree (eq. 4). The log likelihood functions, $ℓ_{2} (θ_{2})$ and $ℓ_{3} (θ_{3})$ , for S₂ (with parameters $θ_{2}$ ) and S₃ (with $θ_{3}$ ) are defined similarly.

Maximizing the log-likelihood function (eq. 5) with respect to the parameters will lead to a log-likelihood value for the given species tree, and the species tree that achieves the highest $ℓ$ is the ML species tree. This is not analytically tractable. The program 3 s implements the method by explicitly summing over the gene trees (G_i) and by using Gaussian quadrature to calculate the 2D integrals over $t_{i}$ (eq. 5) (Yang 2002; Zhu and Yang 2012; Dalquen et al. 2017). This is used in simulations.

We present two theorems for approximating the error in species tree estimation.

Theorem 1

.(a) Suppose $z_{i} = {(z_{i 1}, z_{i 2}, z_{i 3})}^{T}, i = 1, \dots, m$ , are an independent and identically distributed (i.i.d.) sample of size m from a distribution with means $μ = {(μ_{1}, μ_{2}, μ_{2})}^{T}$ , with $Δ μ = μ_{1} - μ_{2} > 0$ , and variances $Σ = {σ_{j k}}$ , where $σ_{11} = σ_{1}^{2}, σ_{12} = σ_{13} = ρ_{12} σ_{1} σ_{2}, σ_{22} = σ_{33} = σ_{2}^{2}$ and $σ_{23} = ρ_{23} σ_{2}^{2}$ . Let $\bar{z} = {{\bar{z}}_{1}, {\bar{z}}_{2}, {\bar{z}}_{3}}^{T}$ be the sample means, with ${\bar{z}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} z_{i j}, j =$ 1, 2, 3. For large m, $\bar{z} \sim N_{3} (μ, \frac{1}{m} Σ)$ . Let $ζ = P {{\bar{z}}_{1} < \max ({\bar{z}}_{2}, {\bar{z}}_{3})}$ . Then

$\begin{matrix} ζ & \approx Φ (\frac{- Δ μ \sqrt{m} + \sqrt{\frac{1}{π} (σ_{2}^{2} - σ_{23})}}{\sqrt{σ_{1}^{2} - 2 σ_{12} + σ_{2}^{2} - \frac{1}{π} (σ_{2}^{2} - σ_{23})}}) \\ \equiv ζ_{N} (m, Δ μ, σ_{1}^{2}, σ_{2}^{2}, σ_{12}, σ_{23}), \end{matrix}$ (6)

where $Φ$ is the cumulative distribution function (CDF) for the normal distribution $N (0, 1)$ . We also write ζ_N as $ζ_{N} (m, Δ μ, Σ)$ .

(b) Let $a = s_{2} / s_{1}$ and $b = Δ μ \sqrt{m} / s_{1}$ , with $s_{1}^{2} = σ_{1}^{2} - 2 σ_{12} + σ_{2}^{2} - \frac{1}{2} (σ_{2}^{2} - σ_{23})$ and $s_{2}^{2} = \frac{1}{2} (σ_{2}^{2} - σ_{23})$ . Then ζ is bounded by:

$\begin{matrix} Φ (- h) (1 + \frac{2}{π} {tan}^{- 1} a) & \leq ζ < 2 Φ (- h) \\ = Φ (- h + \frac{1}{h} log 2 + o (\frac{1}{h})), \end{matrix}$ (7)

where $h = \frac{b}{\sqrt{1 + a^{2}}}$ . The equality for the lower bound holds when $h = b = 0$ . We write those bounds as $ζ_{L 1} ζ ζ_{U 1}$ , so that $Φ (- h) \leq ζ_{L 1} \leq ζ < 2 Φ (- h)$ .

Proof.

A proof is given in Appendix A, in which, we discuss alternative approximations and also give a tighter pair of bounds $(ζ_{L 2}, ζ_{U 2})$ in equation (A27), with $ζ_{L 1} < ζ_{L 2} < ζ < ζ_{U 2} < ζ_{U 1}$ . □

In this paper, ζ represents the error probability of species tree estimation. Thus, the bounds $Φ (- h) \leq ζ < 2 Φ (- h)$ suggest that when $m \to \infty$ , the probit transform of the species-tree error probability, $Φ^{- 1} (ζ)$ , where $Φ^{- 1}$ is the inverse CDF of $N (0, 1)$ , decreases linearly with $\sqrt{m}$ . For practical calculations for finite m in this paper, equation (6) is more accurate (see Appendix A) and will be used later.

Corollary 2.Let ( $y_{0}, y_{1}, y_{2}, y_{3}$ ) be random variables from the multinomial distribution MN(m, q₀, q₁, q₂, q₃), with $q_{0} = 1 - q_{1} - q_{2} - q_{3}, q_{1} > q_{2}$ = q₃, and $Δ q = q_{1} - q_{2} > 0$ . Then $P {y_{1} < \max (y_{2}, y_{3})}$ can be approximated by:

ζ (m, q_{1}, q_{2}) = Φ (\frac{- Δ q \sqrt{m} + \sqrt{\frac{q_{2}}{π}}}{\sqrt{q_{1} + q_{2} - {(Δ q)}^{2} - \frac{q_{2}}{π}}}),

(8)

ζ_{ZLY} (m, q_{1}, q_{2}) = Φ (\frac{- Δ q \sqrt{m - \frac{1}{Δ q}} + \sqrt{\frac{q_{2}}{π}}}{\sqrt{q_{1} + q_{2} - \frac{q_{2}}{π}}}) .

(9)

Proof.

Let ${\bar{z}}_{j} = y_{j} / m, j = 1, 2, 3$ be the observed frequencies. We have $σ_{j j} = q_{j} (1 - q_{j})$ and $σ_{j k} = - q_{j} q_{k}$ for $j \neq k$ . Then equation (8) follows from equation (6) in Theorem 1. The form $ζ_{ZLY}$ , an alternative to equation (8), is from Yang (1996, eq. 3), based on Zharkikh and Li(1992, eq. 20). This applies the term $1 / Δ q$ to correct for discontinuity (Fleiss et al. 2003) and ignores correlations between y₁, y₂, and y₃ as well as some terms of small probabilities. The discontinuity correction does not appear to be useful. If $m ≫ 1 / Δ q$ , both forms, with and without the discontinuity correction, are very close. □

The error rate for the ML method (eq. 5) is analyzed in Appendix B. When the number of loci $m \to \infty$ , the MLE ${\hat{θ}}_{j} \to θ_{j}^{*}$ in species tree S_j, j = 1, 2, 3. Note that S₁ represents the true model and $θ_{1}^{*}$ are the true parameter values, while S₂ and S₃ are misspecified models and $θ_{2}^{*}$ and $θ_{3}^{*}$ are the “best-fitting or pseudotrue parameter values.” The Kullback–Leibler distance D₁₂ from S₂ to S₁ is:

$\begin{matrix} D_{12} = \int f (x | S_{1}, θ_{1}^{*}) log \frac{f (x | S_{1}, θ_{1}^{*})}{f (x | S_{2}, θ_{2}^{*})} dx \\ = E (l_{1} (θ_{1}^{*})) - E (l_{2} (θ_{2}^{*})), \end{matrix}$ (10)

where $l_{j} (θ_{j}^{*}) \equiv log f (x | S_{j}, θ_{j}^{*})$ , with x to be one data point (or site pattern counts at one locus), and where the integral means summation over all possible data outcomes at a locus. We use the per-locus log-likelihood values to compare the three species trees: ${\bar{z}}_{j} \equiv \frac{1}{m} ℓ_{j} ({\hat{θ}}_{j})$ , j == 1, 2, 3. When m is large, these have the means $E ({\bar{z}}_{j}) \approx E (l_{j} (θ_{j}^{*})) \equiv μ_{j}$ , with $μ_{1} - μ_{2} = D_{12}$ , and the variance matrix $\frac{1}{m} Σ$ , where $Σ = {σ_{j k}}$ and $σ_{j k} \equiv Cov (l_{j} (θ_{j}^{*}), l_{k} (θ_{k}^{*}))$ . The error of the ML method, $e_{M L} = P {ℓ_{1} ({\hat{θ}}_{1}) < \max (ℓ_{2} ({\hat{θ}}_{2}), ℓ_{3} ({\hat{θ}}_{3}))}$ , is then given by Theorem 1 as:

$e_{M L} = P {{\bar{z}}_{1} < \max ({\bar{z}}_{2}, {\bar{z}}_{3})} \approx ζ_{N} (m, D_{12}, Σ) .$ (11)

Equation (11) cannot be used to calculate the error rate for ML as D₁₂ and σ_jk are not easily computable. It predicts a linear relationship between $Φ^{- 1} (e_{ML})$ and $\sqrt{m}$ . This is confirmed by simulation (fig. 2a′–c′).

Precise results may be obtained in special cases. In the case of one locus (m = 1), the ML gene tree is the ML species tree except for rare data sets: the true species tree S₁ is recovered if $x_{i 1} > \max (x_{i 2}, x_{i 3})$ . In rare data sets of extreme divergence, even if $x_{i 1} > \max (x_{i 2}, x_{i 3})$ , ties for gene trees are possible, with the star tree being as good as the binary trees (Yang 2000), whereas ML under MSC favors S₁. One such data set is $x_{i} = (4, 13, 12, 11, 50)$ , in which case the three gene trees as well as the star tree achieve the same likelihood, whereas ML under MSC favors S₁. However, such data sets involve sequences more divergent than random sequences have vanishingly small probability when n is large. Thus, we ignore them and consider all methods to be equivalent when m = 1. With one locus, it is impossible to identify all parameters in the MSC model: there are four parameters and only three independent site-pattern frequencies ( $f_{i 0}, f_{i 1}, f_{i 2} + f_{i 3}$ for S₁, for example).

The case of one site per locus (n = 1) is analyzed later in the section on isml. Numerical calculations on a model species tree are presented in table 1. They will be discussed later in comparison with other methods.

In the case of $n \to \infty$ , the gene tree (including the coalescent times) at each locus is given without errors. The likelihood is then the product of MSC densities of gene trees across the loci (eq. 2). This likelihood has singularities, with one or more species trees achieving infinite likelihood (Liu et al. 2010; Yang 2014). In the case of three species considered here, only one species tree (given by the smallest coalescent time) achieves infinite likelihood and will be the unique species-tree estimate, so that the estimation can proceed despite the singularity (Yang 2014, p. 360, Problem 9.4). Let the smallest coalescent/divergence time between species across all loci be t_ab, t_bc, and t_ca. If t_ab is the smallest among the three, then species tree S₁ achieves infinite likelihood, by collapsing on the coalescent time t_ab; that is, $ℓ_{1} ({\hat{θ}}_{1}) \to \infty$ as ${\hat{τ}}_{0} = {\hat{τ}}_{1} = t_{a b}$ and ${\hat{θ}}_{1} \to 0$ (see eq. 2) (Yang 2014, p.338–339), whereas the other two species trees have finite likelihood.

Given S₁ as the true species tree, both t_bc and t_ca are $> τ_{ABC}$ (fig. 1b). If sequences a and b coalesce in population AB at any of the m loci, t_ab will be smaller than both t_bc and t_ca, and S₁ will be the ML species tree. Thus, an incorrect species tree is inferred only if a and b do not coalesce in AB at any of the m loci and are not the first to coalesce in the root population ABC. Thus,

$e_{M L, \infty} = ϕ^{m} \times \frac{2}{3},$ (12)

where $ϕ = e^{- \frac{2}{θ_{A B}} (τ_{A B C} - τ_{A B})}$ is the probability that a and b do not coalesce in population AB. This equation is exact and applies to both small and large m (fig. 3b).

Fig. 2. — (*a–c*) Species-tree estimation error (e) at three sequence lengths (n = 1, 2, 1,000) plotted against the number of loci (m) for different methods. (a $'$ –c $'$ ) The probit transform of the species-tree error, $Φ^{- 1} (e)$ , plotted against $\sqrt{m}$ . The parameters used in the simulation are $τ_{0} = 0.02, τ_{1} = 0.019, θ_{0} = 0.01$ , and $θ_{1} = 0.05$ . When n = 1, all four methods (ML, 2-step, concatenation, and isml) give the same species tree estimate, while concatenation and isml are equivalent in all cases considered in this paper. The number of replicates is $R \geq 10^{4}$ for ML and $\geq 10^{6}$ for the other methods.

Table 1.

Probabilities ( $g_{1}, g_{2}, g_{3}$ ) of Estimated Gene Trees at Different Sequence Lengths (n) and the Error Rates for the Summary Methods 2-step and isml with m = 1,000 Loci, Each with n Sites.

n	1	2	10	100	1,000	$\infty$
2-step (mp-est)
$P$ (tie)	0.92948	0.8673	0.57015	0.22159	0.05105	0
$g_{1} (n)$	0.02378	0.04474	0.14515	0.26646	0.33273	0.35947
$g_{2} (n) = g_{3} (n)$	0.02337	0.04398	0.14235	0.25598	0.30811	0.32026
$e_{2 - STEP}$	0.642	0.633	0.597	0.470	0.260	0.114
$ζ (m, g_{1}, g_{2})$	0.644	0.635	0.600	0.472	0.264	0.113
$ζ_{ZLY} (m, g_{1}, g_{2})$	NA	NA	0.613	0.482	0.271	0.117
$(ζ_{L 1}, ζ_{U 1})$	(0.635, 0.953)	(0.623, 0.935)	(0.578, 0.869)	(0.430, 0.647)	(0.219, 0.331)	(0.087, 0.132)
$(ζ_{L 2}, ζ_{U 2})$	(0.637, 0.729)	(0.626, 0.714)	(0.585, 0.668)	(0.446, 0.561)	(0.242, 0.328)	(0.103, 0.132)
ζ (mean2)	0.683	0.670	0.627	0.504	0.285	0.118
a	0.574051	0.574056	0.573612	0.569708	0.562911	0.555962
b	0.0678913	0.0930368	0.190376	0.527747	1.11658	1.72268
isml (concat)
$e_{ISML}$	0.642	0.632	0.590	0.438	0.246	0.196
ζ_N	0.644	0.634	0.592	0.443	0.254	0.194
$ζ_{N 0}$	0.643	0.633	0.591	0.437	0.234	0.166
$(ζ_{L 1}, ζ_{U 1})$	(0.635, 0.953)	(0.622, 0.934)	(0.568, 0.854)	(0.397, 0.598)	(0.211, 0.318)	(0.157, 0.237)
$(ζ_{L 2}, ζ_{U 2})$	(0.637, 0.728)	(0.625, 0.713)	(0.576, 0.659)	(0.416, 0.536)	(0.233, 0.316)	(0.177, 0.237)
ζ (mean2)	0.683	0.669	0.618	0.476	0.275	0.207
a	0.574029	0.573971	0.57356	0.569747	0.558232	0.553151
b	0.067892	0.0958963	0.21228	0.607057	1.14253	1.35035

Open in a new tab

Note.— $P (tie)$ is the probability for ties in gene trees, with $P (tie) + g_{1} + 2 g_{2} =$ 1. The probabilities of estimated gene trees ( $g_{1}, g_{2}, g_{3}$ ) as well as the error rates ( $e_{2 - STEP}$ and $e_{ISML}$ ) are estimated by simulation using a C program, with $\geq 10^{6}$ replicates. Ties are broken evenly in the error calculation. The parameter values used are $(τ_{0}, τ_{1}, θ_{0}, θ_{1})$ =(0.02, 0.019, 0.01, 0.05). The marginal (pooled) site pattern probabilities are $\bar{p} =$ ( ${\bar{p}}_{0}, {\bar{p}}_{1}, {\bar{p}}_{2}, {\bar{p}}_{3}, {\bar{p}}_{4}$ ) = (0.92831926, 0.023777106, 0.023372801, 0.023372801, 0.001158033), given by equation (13). For 2-step, at n = 1, the estimated gene tree is determined by the single site so that $g_{1} (1) = {\bar{p}}_{1}$ and $g_{2} (1) = {\bar{p}}_{2}$ , whereas at $n = \infty$ , the estimated gene tree is the true gene tree, so that $g_{1} (\infty) = P (G_{1})$ and $g_{2} (\infty) = P (G_{2})$ (eq. 1). For 2-step, $ζ_{ZLY}$ (eq. 9) is inapplicable at n = 1 or 2 as m = 1000 is too small. For isml, $ζ_{N 0} = ζ_{N} (m, Δ μ, σ_{1}^{2}, σ_{2}^{2}, 0, 0)$ ignores the correlation (eq. 6), while ζ_N accounts for the correlation. The bounds $(ζ_{L 1}, ζ_{U 1})$ and $(ζ_{L 2}, ζ_{U 2})$ are calculated using equations (7) and (A27), with k = 2 used in $ζ_{U 2}$ . “mean2” is the average of the tight bounds: $(ζ_{L 2} + ζ_{U 2}) / 2$ .

Fig. 3. — Error rates in species-tree estimation by ML, 2-step, and isml (=concatenation). (a) Error plotted against sequence length n when the number of loci m is fixed at 100 or 1,000, generated by simulation. (b) Error plotted against m when $n = \infty$ . Error for ML is given by equation (12), whereas those for isml and 2-step are generated by simulation. (c) Error plotted against n when $n m = 10^{4}$ is fixed, generated by simulation. Note that all four methods are equivalent when n = 1 or $m = 1$ , while concatenation and isml are equivalent in all cases. Parameters used in the simulation are $τ_{0} = 0.02, τ_{1} = 0.019, θ_{0} = 0.01$ , and $θ_{1} = 0.05$ . The number of replicates is $R \geq 10^{4}$ .

Concatenation

Sequence alignments at the m loci are merged into a super-alignment of length nm, and the data are the site-pattern counts pooled across loci: $x . = {x_{\cdot j}}$ , with $x_{\cdot j} = \sum_{i} x_{i j}, j = 0, 1, \dots, 4$ . The likelihood function is given by the multinomial probability of equation (4) except that $x_{\cdot j}$ is used instead of x_ij. The ML tree is G₁ if $x_{\cdot 1} > \max (x_{\cdot 2}, x_{\cdot 3})$ (Yang 1994b, 2000). We discuss the error rate of concatenation below in the section on the isml method.

We also examine biases in parameter estimation using concatenation. We use species tree S₁ with τ_ABC = 0.02, τ_AB = 0.01, θ_ABC = 0.02, and θ_AB = 0.01 to simulate $m = 10^{4}$ loci each with n = 250 sites. We obtain MLEs ${\hat{t}}_{0}$ and ${\hat{t}}_{1}$ on gene tree G₁ from the concatenated data for comparison with the MLEs ${\hat{τ}}_{0}$ and ${\hat{τ}}_{1}$ on species tree S₁ in the MSC model (eq. 5). With so much data, both concatenation and ML recover the true tree with near certainty. The MLEs under the MSC (obtained using the 3 sprogram) are very close to the true values, whereas concatenation (baseml in paml, Yang 2007) produced seriously biased estimates (table 2). Even the relative age, ${\hat{t}}_{0} / {\hat{t}}_{1}$ = 1.92, differs from $τ_{ABC} / τ_{A B} = 2$ , which means that molecular clock dating analysis using concatenated data will produce biased time estimates (Angelis and dos Reis 2015; Ogilvie et al. 2017; Tiley et al. 2020).

Table 2.

Estimates of Divergence Times (true values in parentheses) by ML under the MSC (3 s) and by Concatenation (baseml) in Two Simulated Data Sets, Each of $m = 10^{4}$ Loci and n = 250 Sites.

	τ_ABC	τ_AB	θ_ABC	θ_AB
Data/method	(0.02)	(0.01)	(0.02)	(0.01)
Data set 1, 3s	0.0201	0.0096	0.0199	0.0101
Data set 2, 3s	0.0196	0.0100	0.0201	0.0100
Data set 1, baseml	0.0298	0.0155
Data set 2, baseml	0.0298	0.0156

Open in a new tab

ISML

The isml method assumes that all sites in the super-alignment are i.i.d. Like concatenation, the data are summarized as pooled site-pattern counts, $x . = {x_{\cdot 0}, x_{\cdot 1}, x_{\cdot 2}, x_{\cdot 3}, x_{\cdot 4}}$ . However, isml is coalescent-aware and uses the MSC model to calculate the probabilities for the site patterns. By averaging the conditional site-pattern probabilities of equation (3) over the MSC density of gene trees and coalescent times of equation (2), we derive the marginal site-pattern probabilities, $\bar{p} = ({\bar{p}}_{0}, \dots, {\bar{p}}_{4})$ , as:

\begin{matrix} {\bar{p}}_{0} = \frac{1}{16} (1 + 18 a_{0} + 54 a_{0} b + 54 a_{0} c_{0} + 9 c_{1} + 9 a_{1}), \\ {\bar{p}}_{1} = \frac{3}{16} (1 - 6 a_{0} - 18 a_{0} b - 18 a_{0} c_{0} + 9 c_{1} + 9 a_{1}), \\ {\bar{p}}_{2} = \frac{3}{16} (1 + 6 a_{0} - 18 a_{0} b - 18 a_{0} c_{0} - 3 c_{1} - 3 a_{1}), \\ {\bar{p}}_{3} = {\bar{p}}_{2}, \\ {\bar{p}}_{4} = \frac{6}{16} (1 - 6 a_{0} + 18 a_{0} b + 18 a_{0} c_{0} - 3 c_{1} - 3 a_{1}), \end{matrix}

(13)

where $a_{0} = \frac{e^{- 8 τ_{0} / 3}}{3 + 4 θ_{0}}$ , $a_{1} = \frac{e^{- 8 τ_{1} / 3}}{3 + 4 θ_{1}}$ , $b = \frac{e^{- 4 τ_{1} / 3}}{3 + 2 θ_{1}}$ , $c_{0} = 2 ϕ \cdot (θ_{1} - θ_{0}) \cdot \frac{e^{- 4 τ_{0} / 3}}{(3 + 2 θ_{0}) (3 + 2 θ_{1})}$ , and $c_{1} = 4 ϕ \cdot (θ_{1} - θ_{0}) \cdot \frac{a_{0}}{3 + 4 θ_{1}}$ , with $ϕ = e^{- 2 (τ_{0} - τ_{1}) / θ_{1}}$ . Note that ${{\bar{p}}_{j}}$ are functions of a₀, $b + c_{0}$ and $a_{1} + c_{1}$ , although these do not appear to permit simple biological interpretations. The cases for S₂ and S₃ are given by symmetry.

The likelihood function (or the probability for the pooled site-pattern counts) for each species tree is:

\begin{matrix} f (x . | S_{1}, θ_{1}) = {\bar{p}}_{0}^{x_{\cdot 0}} {\bar{p}}_{1}^{x_{\cdot 1}} {\bar{p}}_{2}^{x_{\cdot 2} + x_{\cdot 3}} {\bar{p}}_{4}^{x_{\cdot 4}}, \\ f (x . | S_{2}, θ_{2}) = {\bar{p}}_{0}^{x_{\cdot 0}} {\bar{p}}_{1}^{x_{\cdot 2}} {\bar{p}}_{2}^{x_{\cdot 3} + x_{\cdot 1}} {\bar{p}}_{4}^{x_{\cdot 4}}, \\ f (x . | S_{3}, θ_{3}) = {\bar{p}}_{0}^{x_{\cdot 0}} {\bar{p}}_{1}^{x_{\cdot 3}} {\bar{p}}_{2}^{x_{\cdot 1} + x_{\cdot 2}} {\bar{p}}_{4}^{x_{\cdot 4}} . \end{matrix}

(14)

Theorem 3

.(a) If the true species tree is S₁ with parameters $θ_{1}$ , then ${\bar{p}}_{1} > {\bar{p}}_{2} = {\bar{p}}_{3}$ . (b) Isml infers the species tree S₁ if $x_{\cdot 1} > \max {x_{\cdot 2}, x_{\cdot 3}}$ .

Proof.

(a) Each of the marginal site pattern probabilities ${\bar{p}}_{j}, j = 0, \dots, 4$ , is a sum over the four gene trees of figure 1b: $G_{1 a}, G_{1 b}, G_{2}$ and G₃. The three gene trees $G_{1 b}, G_{2}$ , and G₃ have the same densities (eq. 2). Together their contribution to the site pattern xxy is the same as that to the pattern yxx or pattern xyx. If the gene tree is $G_{1 a}$ (with any coalescent times t₀ > t₁), site pattern xxy will have a higher probability than yxx or xyx, with $p_{1} > p_{2} = p_{3}$ . Averaging over all the four gene trees, we have ${\bar{p}}_{1} > {\bar{p}}_{2} = {\bar{p}}_{3}$ .

(b) We show that if $x_{\cdot 1} > x_{\cdot 2}$ , then $ℓ (S_{1}, {\hat{θ}}_{1}) > ℓ (S_{2}, {\hat{θ}}_{2})$ , where ${\hat{θ}}_{1}$ and ${\hat{θ}}_{2}$ are the MLEs under each species tree. First note that if $x_{\cdot 1} > x_{\cdot 2}$ and $q_{1} > q_{2} > 0$ , then $q_{1}^{x_{\cdot 1}} q_{2}^{x_{\cdot 2}} > q_{1}^{x_{\cdot 2}} q_{2}^{x_{\cdot 1}}$ . Let $q_{1} = {\bar{p}}_{1} (S_{1}, {\hat{θ}}_{2})$ and $q_{2} = {\bar{p}}_{2} (S_{1}, {\hat{θ}}_{2})$ , and we have $ℓ (S_{1}, {\hat{θ}}_{2}) > ℓ (S_{2}, {\hat{θ}}_{2})$ . In other words, even if we use ${\hat{θ}}_{2}$ (the MLE for S₂) to calculate the likelihood for species tree S₁, tree S₁ will have a higher likelihood than S₂. Since ${\hat{θ}}_{2}$ may not be optimal for S₁, it follows that $ℓ (S_{1}, {\hat{θ}}_{1}) \geq ℓ (S_{1}, {\hat{θ}}_{2}) > ℓ (S_{2}, {\hat{θ}}_{2})$ . □

Theorem 3 means that isml infers species tree S_j if $x_{\cdot j}$ is the greatest among $x_{\cdot 1}, x_{\cdot 2}$ , and $x_{\cdot 3}$ , just like concatenation.

To study the error rate for isml (or concat), let p_ij, $j = 0, \dots, 4$ be the site-pattern probabilities at any locus i. Data at each locus are represented by the site-pattern frequencies $f_{i j} = x_{i j} / n$ . Let $f_{i} = {f_{i j}}$ be the data at locus i. The f_i are i.i.d. among loci from a common distribution with mean $E (f_{i j}) = {\bar{p}}_{j}$ and variance/covariance $σ_{j j} \equiv V (f_{i j})$ and $σ_{j k} \equiv Cov (f_{i j}, f_{i k})$ . Let ${\bar{f}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} f_{i j} = x_{\cdot j} / m$ be the means over loci. Here, ${f_{i j}}$ constitute the full data, whereas ${{\bar{f}}_{j}}$ are summaries used by isml: the species tree estimate is S_j if ${\bar{f}}_{j}$ is the largest among $({\bar{f}}_{1}, {\bar{f}}_{2}, {\bar{f}}_{3})$ . Thus, $e_{ISML} = P ({\bar{f}}_{1} < \max {{\bar{f}}_{2}, {\bar{f}}_{3}}) \approx ζ_{N} (m, {\bar{p}}_{1} - {\bar{p}}_{2}, Σ)$ , where $Σ = {σ_{j k}}$ . Below we derive the variances.

At n = 1, they are given by the multinomial distribution as:

$σ_{j j}^{(1)} = {\bar{p}}_{j} (1 - {\bar{p}}_{j}), σ_{j k}^{(1)} = - {\bar{p}}_{j} {\bar{p}}_{k}, 1 \leq j, k \leq 3.$ (15)

At $n = \infty$ , we have f_ij = p_ij, given by equation (3). The variances, denoted $σ_{j k}^{(\infty)}$ , can be generated by simulating gene trees with coalescent times and calculating the site-pattern probabilities (eq. 3) (supplementary table S1, Supplementary Material online). This distribution is 3D (for $f_{i 0}$ , $f_{i 1}$ , and $f_{i 2} = f_{i 3}$ under S₁), indexed by four parameters ( $θ_{1}$ in S₁), and is a mixture distribution with 4 components corresponding to the four gene trees of figure 1b. It reflects the coalescent fluctuation in gene genealogies.

For any finite $1 \leq n < \infty$ , the variances are given by:

$\begin{matrix} σ_{j j} = V (E (f_{i j} | p_{i j})) + E (V (f_{i j} | p_{i j})) \\ = V (p_{i j}) + E (p_{i j} (1 - p_{i j}) / n) \\ = V (p_{i j}) + \frac{1}{n} [E (p_{i j}) (1 - E (p_{i j})) - V (p_{i j})] \\ = \frac{1}{n} σ_{j j}^{(1)} + \frac{n - 1}{n} σ_{j j}^{(\infty)}, \\ σ_{j k} = Cov (p_{i j}, p_{i k}) + E (Cov (f_{i j}, f_{i k} | p_{i j}, p_{i k})) \\ = Cov (p_{i j}, p_{i k}) + \frac{1}{n} [- E (p_{i j}) E (p_{i k}) - Cov (p_{i j}, p_{i k})] \\ = \frac{1}{n} σ_{j k}^{(1)} + \frac{n - 1}{n} σ_{j k}^{(\infty)}, \end{matrix}$ (16)

where $E (p_{i j}) \equiv {\bar{p}}_{j}$ (eq. 13), whereas $V (p_{i j}) = σ_{j j}^{\infty}$ and $Cov (p_{i j}, p_{i k}) = σ_{j k}^{\infty}$ are the variances/covariances over the coalescent process. These are calculated for a set of parameter values in supplementary table S1, Supplementary Material online. The variances of f_ij are thus weighted averages of variances at $n = 1$ and $\infty$ .

The approximation $e_{ISML} \approx ζ_{N} (m, {\bar{p}}_{1} - {\bar{p}}_{2}, Σ)$ is very accurate, with errors <0.002 in the simulation of table 1. At large n, accommodating correlation is useful as $ζ_{N 0}$ which ignores correlation is less accurate (see fig. 4 for the case of $n = \infty$ ). For example, the correlation $ρ (f_{i 1}, f_{i 2}) = \frac{σ_{12}}{\sqrt{σ_{11} σ_{22}}}$ is $- 0.124, - 0.153$ , and −0.181 at n = 1, 1,000, and $\infty$ , respectively (supplementary table S1, Supplementary Material online).

We now consider parameter estimation by isml. Theorem 3 allows species tree estimation by isml without knowledge of the MLE of the parameters. With data of $x ._{j}, j = 0, \dots, 4$ , there are only three observations (three free proportions ${\bar{f}}_{0}, {\bar{f}}_{1}$ , and ${\bar{f}}_{2} + {\bar{f}}_{3}$ in the case of S₁). As there are four parameters in the MSC model, it is impossible to identify all of them.

If we assume $θ_{0} = θ_{1} = θ$ (as in Tian and Kubatko 2016), all three parameters ( $τ_{0}, τ_{1}, θ$ ) will be identifiable. As $c_{0} = c_{1} = 0$ , equation (13) simplifies to:

$\begin{matrix} {\bar{p}}_{0} & = \frac{1}{16} (1 + 18 a_{0} + 54 a_{0} b + 9 a_{1}), \\ {\bar{p}}_{1} & = \frac{3}{16} (1 - 6 a_{0} - 18 a_{0} b + 9 a_{1}), \\ {\bar{p}}_{2} & = \frac{3}{16} (1 + 6 a_{0} - 18 a_{0} b - 3 a_{1}) = {\bar{p}}_{3}, \\ {\bar{p}}_{4} & = \frac{6}{16} (1 - 6 a_{0} + 18 a_{0} b - 3 a_{1}), \end{matrix}$ (17)

where a₀, a₁, and b are defined in equation (13) with $θ_{0} = θ_{1} = θ$ . By equating the observed site-pattern frequencies to their expected probabilities (eq. 17), we have

$\begin{matrix} \frac{1}{4} (9 a_{1} + 1) = {\bar{f}}_{0} + {\bar{f}}_{1} \equiv h_{1}, \\ \frac{1}{4} (9 a_{0} + 1) = {\bar{f}}_{0} + \frac{1}{2} ({\bar{f}}_{2} + {\bar{f}}_{3}) \equiv h_{2}, \\ \frac{3}{8} (- 18 a_{0} b + 3 a_{1} + 1) = {\bar{f}}_{1} + \frac{1}{2} ({\bar{f}}_{2} + {\bar{f}}_{3}) \equiv h_{3} . \end{matrix}$ (18)

Thus, we have a quadratic equation in $\hat{θ}$ :

$\begin{matrix} 4 {(4 h_{3} - 2 h_{1} - 1)}^{2} {\hat{θ}}^{2} + [3 {(4 h_{3} - 2 h_{1} - 1)}^{2} \\ - (4 h_{1} - 1) {(4 h_{2} - 1)}^{2}] (4 \hat{θ} + 3) = 0. \end{matrix}$ (19)

This always has a unique positive root. Given $\hat{θ}$ , the estimates ${\hat{τ}}_{0}$ and ${\hat{τ}}_{1}$ are given by equation (18), which are guaranteed to be positive.

Fig. 4. — Species tree error for isml at $n = \infty$ generated by simulation (10⁸ replicates) and by approximation based on *ζ_N*either with or without accounting for correlations. The error goes from 0.64 (at m = 1) to 0.19 (at m = 1,000). Results for other methods for the same parameter settings are in figure 3b.

Thus, under the assumption $θ_{0} = θ_{1}$ , the isml method provides estimates of the three parameters in the model: θ, τ₀, and τ₁. As there is a one-to-one correspondence between the parameters and the multinomial proportions, the estimates are consistent and approach the true values when $m \to \infty$ for any $n \geq 1$ if the assumption of $θ_{0} = θ_{1}$ is correct (table 3, cases c and d). However, the pooled site-pattern counts or average site-pattern frequencies are summaries of the original data and are not sufficient statistics. It then follows that the isml estimates will be less efficient and have larger asymptotic variances than the MLEs obtained from the full data under the same model assumption of $θ_{0} = θ_{1}$ (table 3, case c). Furthermore, if $θ_{0} \neq θ_{1}$ , assuming $θ_{0} = θ_{1}$ will lead to biased and inconsistent parameter estimates even if the same species tree estimate is produced. In other words if $θ_{0} \neq θ_{1}$ , the isml method assuming $θ_{0} = θ_{1}$ will produce a consistent estimate of the species tree and inconsistent estimates of the model parameters (table 3, cases e and f).

Table 3.

Characterization of the isml Method.

	True Model	Assumption	Data Size	Parameters	isml vs. ml
(a)	$θ_{0} \neq θ_{1}$	$θ_{0} \neq θ_{1}$	n > 1	3 out of 4 identifiable	isml $\neq$ ml
(b)	$θ_{0} \neq θ_{1}$	$θ_{0} \neq θ_{1}$	n = 1	3 out of 4 identifiable	isml $=$ ml
(c)	$θ_{0} = θ_{1}$	$θ_{0} = θ_{1}$	n > 1	all 3 identifiable	isml $\neq$ ml
(d)	$θ_{0} = θ_{1}$	$θ_{0} = θ_{1}$	n = 1	all 3 identifiable	isml $=$ ml
(e)	$θ_{0} \neq θ_{1}$	$θ_{0} = θ_{1}$	n > 1	3 out of 4 identifiable, inconsistent	isml $\neq$ ml
(f)	$θ_{0} \neq θ_{1}$	$θ_{0} = θ_{1}$	n = 1	3 out of 4 identifiable, inconsistent	isml $=$ ml

Open in a new tab

Note.—In all cases, the species tree topology is identifiable and consistently estimated by isml when the number of loci $m \to \infty$ . If the parameters are identifiable, their estimates will be consistent. When isml differs from ML and the assumed model is correct, isml is less efficient than ML for parameter estimation (case c).

Two-Step Method (Majority Vote)

In the 2-step method, we estimate gene trees at individual loci and then use the most common gene tree topology as the species tree estimate. Under JC, the ML gene tree for locus i (which is also the upgma tree) is tree G_j if x_ij is the largest among $x_{i 1}, x_{i 2}$ , and $x_{i 3}$ (Yang 1994b, 2000); site patterns xxy, yxx, and xyx “support” gene trees G₁, G₂, and G₃, respectively. There is no need for numerical optimization to obtain the ML tree at each locus.

Let g₁, g₂, and g₃ be the probabilities that the estimated gene tree is G₁, G₂, and G₃, respectively; that is, $g_{1} = P {x_{i 1} > \max (x_{i 2}, x_{i 3})}$ , and so on. These are functions of all four parameters in the MSC model ( $θ_{1}$ ) as well as the sequence length n, and can be computed numerically (Yang 2002, eq. 12) or by simulation. Under JC and the clock, $g_{2} = g_{3} < g_{1} < P (G_{1} | S_{1}, θ_{1})$ (Yang 2002). This result has several implications. First, $g_{1} < P (G_{1})$ means that phylogenetic errors inflate gene-tree–species-tree discordance and lead to underestimation of the internal branch length in the species tree (Yang 2002). Second $g_{1} < P (G_{1})$ also means that use of estimated (rather than true) gene trees leads to reduced probability for recovering the correct species tree. Third, $g_{1} > g_{2} = g_{3}$ means that the 2-step estimate of the species tree is consistent even if estimated gene trees are used.

Let the number of loci at which G₁ is the ML tree be $m_{1} = \sum_{i = 1}^{m} I_{x_{i 1} > \max (x_{i 2}, x_{i 3})}$ , where the indicator function $I_{a} = 1$ if statement a is true and 0 otherwise. Similarly define m₂ and m₃ to be the counts for the two mismatching gene trees. The correct species tree is inferred if and only if $m_{1} > \max (m_{2}, m_{3})$ . Thus, the error rate can be approximated by $e_{2 - STEP} \approx ζ (m, g_{1}, g_{2})$ (eq. 8).

The accuracy of this approximation is assessed in table 1 at different values of n with m = 1,000 and with parameter values $τ_{0} = 0.02, τ_{1} = 0.019, θ_{0} = 0.01$ , and $θ_{1} = 0.05$ . Consider first the case of n = 1. The gene tree is resolved if the single site at the locus has site patterns 1, 2, or 3, but is unresolved if the site has patterns 0 or 4. Whether we ignore loci with ties (with site patterns 0 or 4) or break ties evenly (assigning $\frac{1}{3}$ to each gene tree) does not affect the species tree estimate. Thus, $g_{1} (1) = {\bar{p}}_{1}$ and $g_{2} (1) = {\bar{p}}_{2}$ (eq.13) and the error is $e_{2 - STEP} \approx ζ (m, {\bar{p}}_{1}, {\bar{p}}_{2})$ . This is equivalent to $e_{ISML} \approx ζ_{N} (m, {\bar{p}}_{1} - {\bar{p}}_{2}, Σ)$ for isml, consistent with the fact that at n = 1 all methods considered here are equivalent.

If $n = \infty$ , the estimated gene trees will be the true gene trees so that $g_{1} = P (G_{1})$ and $g_{2} = P (G_{2})$ . The error rate is then $ζ (m, P (G_{1}), P (G_{2})) =$ ζ(1,000, 0.3594737, 0.3202631) = 0.1132, close to 0.114 from simulation. At $n = 1000$ , the proportions of estimated gene trees are g₁ = 0.33273 and g₂ = 0.30811, so that $ζ (m, g_{1}, g_{2}) =$ 0.264, close to 0.260 by simulation (table 1). These are much larger than 0.114 at $n = \infty$ , suggesting that with n = 1,000 sites in the sequence, the estimated gene trees have substantial errors and uncertainties.

The approximations $ζ_{ZLY}$ (eq. 9) and ζ (eq. 8) give nearly identical results. The error rate is found to be very sensitive to the precise values of g₁ and g₂. Overall, the approximation is good, with errors within or close to 1%.

Numerical Comparison of Different Methods

We use simulation to compare the different species-tree estimation methods and to assess the reliability of our approximations. We use a challenging species tree with parameters $τ_{0} = 0.02, τ_{1} = 0.019, θ_{0} = 0.01$ , and $θ_{1} = 0.05$ . The error is plotted against the number of loci (m) when the number of sites per locus is fixed at $n =$ 1, 2, or 1,000 (fig. 2).

In the case of one site per sequence (n = 1), all four methods considered in this study are equivalent, with the species tree given by the most frequent pooled site pattern (i.e., the greatest of $x_{\cdot 1}, x_{\cdot 2}$ , and $x_{\cdot 3}$ ). With one site, the independent-sites assumption is correct, and ml and isml are exactly the same. As discussed earlier, concatenation and 2-step also select the species tree according to the pooled site patterns. Treatment of ties among $x_{\cdot 1}, x_{\cdot 2}, x_{\cdot 3}$ has very minor effects on the error rate. For n = 1 and $m = 1000$ , simulation gave the error estimate $e =$ 0.642 if ties are broken evenly (table 1) or 0.641 if data sets with ties are ignored. As predicted by our theory, the probit transform of the error, $Φ^{- 1} (e)$ , shows a linear relationship with $\sqrt{m}$ (fig. 2a′, $R^{2} = 0.9994$ ).

In the case of n = 2 sites per locus, isml (=concatenation), 2-step, and ML are all distinct. To see that concatenation and 2-step may produce different species trees, consider the case of m = 3 loci and n = 2 sites. If the data set at the three loci are 11, 02, and 00, where 0–4 represent the five site patterns, concatenation will infer the correct species tree S₁ (as $x_{\cdot 1} = 2, x_{\cdot 2} = 1, x_{\cdot 3} = 0$ ), whereas 2-step will have a tie between S₁ and S₂ (as $m_{1} = 1, m_{2} = 1, m_{3} = 0$ ). If the data set at the three loci are 33, 01, and 14, concatenation will have a tie between S₁ and S₃ (as $x_{\cdot 1} = 2, x_{\cdot 2} = 0, x_{\cdot 3} = 2$ ), whereas 2-step will infer the correct species tree (as $m_{1} = 2, m_{2} = 0, m_{3} = 1$ ). We also confirm that at n = 2 ML differs from all three summary methods and can identify and consistently estimate all four parameters in the MSC model. Indeed ML is far more efficient for species tree estimation than the summary methods when n = 2 (fig. 2b and b′). Although the summary methods improve only slightly when n changes from 1 to 2, there is a major performance boost for ML (fig. 3a). This may be due to the fact that the model is fully identifiable with n = 2 but not when n = 1. The predicted linear relationship between $Φ^{- 1} (e)$ and $\sqrt{m}$ holds well for the three summary methods (fig. 2b′). For ML, if we remove the first two points (for m = 10 and 20), the relationship is nearly linear, with $y = - 0.0022 x + 0.0391$ , with $R^{2} = 0.97$ .

The most interesting case is with $n ≫ 1$ , since in real data sets n may be in the range 50–5,000, say. We used n = 1,000 in figure 2c and c′. As in the case of n = 2, there is a large performance divide between ML and the three summary methods (Isml = concat and 2-step), whereas the summary methods have similar performance. The approximate linear relationship between $Φ^{- 1} (e)$ and $\sqrt{m}$ holds well for all methods.

The superior performance of ML persists in the limit of $n = \infty$ (fig. 3b). For example, $e_{ML, \infty} =$ 0.45 and 0.01 for ML at m = 10 and 100, respectively, compared with $e_{2 - STEP, \infty} =$ 0.60 and 0.46 for 2-step or $e_{ISML, \infty} =$ 0.62 and 0.51 for isml. The differences between ML and 2-step reflect the information in the coalescent times or gene-tree branch lengths. The differences between ML and isml reflect the information in the variation of site-pattern frequencies among loci, as isml uses only the averages across loci.

Figure 3c examines the error rates of different methods, while $n m = 10^{4}$ is fixed. At the two ends (n = 1 or m = 1), all four methods are equivalent, with $e =$ 0.587 at n = 1 and $m = 10^{4}$ , and $e =$ 0.646 at m = 1 and $n = 10^{4}$ . Note that when $n = 1$ and $m \to \infty$ , the error $e \to 0$ , while if m = 1 and $n \to \infty$ , the error $e = 1 - g_{1} (n) \to 1 - P (G_{1}) =$ 0.6405. The high error at m = 1 even when $n = \infty$ is because a single gene tree (with coalescent times), even if known with certainty, does not contain much information about the MSC process. Away from the two ends (n > 1 or m > 1), ML is considerably more efficient than the summary methods (fig. 3c). The case of $m = 10^{4}$ (n = 1), at which $e_{ML} =$ 0.587, and the case of m = 2 (n = 5,000), at which $e_{ML}$ = 0.487, make an interesting contrast. In the first case all sites are i.i.d., while in the second, there are only two independent genes, each of 5,000 sites in complete linkage. One might expect data of independent sites to be more informative than two loci with correlated sites at the same locus (e.g., Long and Kubatko 2018), but the opposite is true. With n = 1, not all model parameters are identifiable, and this nonidentifiability issue appears to impact species tree estimation as well (Shi and Yang 2018, p. 172). With nm fixed, the smallest error $e_{ML}$ occurs at intermediate values of n and m, around $n = m = 100$ , although performance is similar over a large range of n (fig. 3c).

In table 1, we calculated the species-tree error probability using equations (6) and (8), as well as two pairs of bounds ( $ζ_{L 1}, ζ_{U 1}$ ) and ( $ζ_{L 2}, ζ_{U 2}$ ) (Theorem 1, Appendix A), for comparison with the simulation results. The asymptotic results are expected to apply when the sequence length n is fixed, whereas the number of loci $m \to \infty$ . Here, m is fixed at 1,000, so that b < 2 for all cases (table 1), and is too small for the asymptotic approximations to be reliable. As a result, equations (6) and (8) are more accurate.

Discussion

Errors of Species Tree Estimation by Different Methods

Under the MSC model, data at different loci are i.i.d., so that the number of loci (m) constitutes the sample size in the statistical model. Thus, we have derived approximations to the error rate for different methods when m increases, with the sequence length n fixed. For large m, the error can be approximated by $Φ (- c \sqrt{m})$ , where c is a constant. This is seen to apply to all four methods considered in this study (ML, isml = concatenation, and 2-step) (see table 4 for a summary).

Table 4.

Summary of Analytical Approximations to Species-Tree Estimation Error by Different Methods.

Method	n = 1	$n \geq 2$	$n = \infty$
ml		eq. 11	eq. 12
2-step	$ζ (m, {\bar{p}}_{1}, {\bar{p}}_{2})$	$ζ (m, g_{1}, g_{2})$	$ζ (m, P (G_{1}), P (G_{2}))$
isml/concatenation	$ζ_{N} (m, Δ p, Σ^{(1)})$	$ζ_{N} (m, Δ p, Σ^{(n)})$	$ζ_{N} (m, Δ p, Σ^{(\infty)})$

Open in a new tab

Note.—For isml/concatenation, $Δ p = {\bar{p}}_{1} - {\bar{p}}_{2}$ , and the variance–covariance matrix at n is $Σ^{(n)} = \frac{1}{n} Σ^{(1)} + \frac{n - 1}{n} Σ^{(\infty)}$ (eq. 16). In the case of n = 1, $ζ (m, {\bar{p}}_{1}, {\bar{p}}_{2}) = ζ_{N} (m, Δ p, Σ^{(1)})$ , and 2-step, isml, concatenation, and ml are all equivalent.

The theory for ML in Appendix B applies generally to ML selection of nonnested models, whether one model (which may and may not be the true model) fits the data better than the others, judged by the K–L divergence to the true data-generating model. In particular, the theory applies to conventional phylogenetic reconstruction without the MSC model. For example, figure 5 applies the same prediction to simulation results on four-taxa trees from Yang (1997). Previously, Susko (2011) developed a large-sample approximation to the log-likelihood difference between two trees and to the probability that each tree will be the ML tree in the case of four-species without the molecular clock. It was assumed that the internal branch length in the tree is small and approaches 0 at the rate of $n^{- \frac{1}{2}}$ or faster when the number of sites n increases. In our analysis, we take the conventional approach of fixing the parameters when the data size increases.

Fig. 5. — The probit transform of the phylogenetic reconstruction error, $Φ^{- 1} (e)$ , is a linear function of the square root of the number of sites in the alignment ( $\sqrt{n}$ ). Simulation results from Yang (1997, fig. 1A and B) are used in the plot. The trees used in the simulation have four taxa, with branch lengths ((0.5, 0.5):0.1, 0.5, 0.5) for tree A and ((0.5, 0.5):0.1, 0.6, 1.4) for tree B. Data are simulated under the JC+G model (Yang 1994a) and analyzed under both JC and JC+G (Jukes and Cantor 1969; Yang 1994a). Note that in (B), ML under the incorrect model (JC) is more efficient than ML under the correct model (JC+G).

We note that in problems of parameter estimation, the standard error for the parameter estimate or the width of the confidence interval typically decreases at the rate of $n^{- \frac{1}{2}}$ , so that quadrupling the data size halves the interval. In contrast, the probability of recovering the best-fitting model approaches 1 much faster. As the probit transform of the error decreases linearly with $\sqrt{n}$ , it will soon reach a point beyond which the precise error probability is of no practical significance: for example, $Φ^{- 1} (e) = - 3$ means e = 0.0013, while $Φ^{- 1} (e) = - 5$ means $e = 2.9 \times 10^{- 7}$ . The different dynamics between model selection and parameter estimation when the data size grows is consistent with the fact that we tend to obtain extreme support for phylogenies inferred in large data sets (Yang and Zhu 2018).

Implications of Our Study to Species Tree Methods

Although the species tree problem studied here is the simplest, it has the complexities of the general problem. Furthermore, we have represented all major species tree methods in our analysis. We expect ML to be asymptotically similar to Bayesian inference as both are full-data methods.

We have assumed the JC mutation model and the molecular clock. Our results are thus applicable to shallow species phylogenies and may not apply to distantly related species for which the JC model may be inadequate for multiple-hit correction and the molecular clock may be seriously violated. In the case of three species examined in this paper, concatenation and isml always produce the same species tree estimate. However, in more general settings with four or more species and when the clock is violated and unrooted trees are used, concatenation and isml are known to be different. In particular, concatenation (as well as 2-step) can be inconsistent (Roch and Steel 2015), while isml is a coalescent-aware method and is always consistent.

The isml method considered here is similar to SVDQuartets (Chifman and Kubatko 2014). Both are summary methods based on pooled site-pattern counts. SVDQuartets is sometimes described as a site pattern-based method (e.g., Kubatko 2019). This is not a helpful description. Site-pattern counts for different loci $({f_{i j}})$ are sufficient statistics under the model and carry the same amount of information as the sequence alignments at the same loci so that it makes no difference whether site patterns or sequences are used. Indeed virtually all methods involving likelihood calculation on sequences operate on site patterns instead of sites. Instead what matters is whether site patterns are pooled across loci. In the original data, the sites of the same locus share the same gene tree and the variation among loci provides information about parameters of the coalescent process such as the ancestral population sizes. Pooling sites across loci means that such information is lost (Shi and Yang 2018). As a result, the pooled site-pattern counts are unable to identify all parameters of the MSC model even if they can identify the species tree topology. Previously, Long and Kubatko (2018) found in simulations that SVDQuartets performed better in data sets of 600 coalescent-independent sites ( $m = 600, n = 1$ in the notation of this paper) than in data of two genes each of 300 bp ( $m = 2, n = 300$ ), and suggested that this is because “[t]he 600 sites observed from 600 distinct gene trees give independent genealogical information about the species tree, though indirectly, whereas the 300 sites for each of the two genes can give a reasonable indication of the individual gene trees, but still provide only two observed gene genealogies.” Our analysis suggests that this is not a correct interpretation. When the information in the data is used properly (as in the ML method), there is in fact more information in two genes each of 300 bp than in 600 independent sites (fig. 3c).

To understand the issue of parameter unidentifiability and the potential information loss for species tree estimation due to the pooling of sites across loci in SVDQuartets, consider the simple random-effects model:

y_{i j} = μ + α_{i} + e_{i j}, i = 1, \dots, m; j = 1, \dots, n,

(20)

where the treatment effect $α_{i} \sim N (0, σ_{a}^{2})$ and the error $e_{i j} \sim N (0, σ_{e}^{2})$ . Parameters in the model include the grand mean μ and the variance components $σ_{a}^{2}$ and $σ_{e}^{2}$ . It is obvious that if there are no replications within treatment (n = 1) or if the observations (y_ij) are pooled across treatments, the between-treatment variation and within-treatment errors will be confounded so that $σ_{a}^{2}$ and $σ_{e}^{2}$ will not be identifiable even though μ still is. In species tree estimation, pooling site patterns across loci (as in isml and SVDQuartets) causes some parameters of the MSC model to become unidentifiable even though the species tree still is. This issue of information loss due to averaging over the whole genome may be even more serious for methods designed for data of single nucleotide polymorphisms (SNPs) (Leaché and Oaks 2017), such as snapp (Bryant et al. 2012), because the removal of constant sites in the SNP data causes further loss of information (even if the ascertainment bias is accounted for in the method).

An important difference between isml and SVDQuartets is that isml applies ML to the pooled site-pattern counts, whereas SVDQuartets uses a criterion based on linear invariants to avoid the ML optimization (Xu and Yang 2016). Use of a non-ML criterion is expected to lead to further reduction in efficiency, in addition to information loss due to the pooling of sites across loci (Chou et al. 2015; Xu and Yang 2016; Shi and Yang 2018).

The MSC model analyzed in this paper assumes free recombination among loci and no recombination between sites of the same locus. Data for such analysis are typically loosely linked short genomic segments that are far apart from each other so that recombination within a locus is rare, whereas different loci are nearly independent (e.g.,Takahata et al. 1995; Burgess and Yang 2008; Lohse et al. 2016). Both assumptions of free recombination among loci and no recombination within locus are expected to be violated in real data analysis, and the impact of within-locus recombination is of particular concern. The ML method considered in this paper assumes no recombination (with r = 0), whereas isml (and SVDQuartets) assumes free recombination ( $r = \infty$ ). The relative performance of the methods will depend on the true recombination rate: ML may be expected to perform better than isml if r is close to 0, while isml may be superior if r is large. At very high recombination rates, it may even be possible for ML (assuming r = 0) to be inconsistent since the method is similar to concatenation and merges sites of the same locus with different histories into one sequence. In contrast, isml is consistent for all values of r. Previously, Lanier and Knowles (2012) found in a computer simulation that species-tree estimation was robust to moderate levels of within-locus recombination (see also discussions in Edwards et al. [2016];Xu and Yang [2016]). It will be interesting to evaluate the relative performance of modern species-tree estimation methods (including isml and SVDQuartets) under realistic recombination rates.

Materials and Methods

Simulation

We use a challenging species tree with parameters $τ_{0} = 0.02, τ_{1} = 0.019$ , $θ_{0} = 0.01$ , and $θ_{1} = 0.05$ (fig. 1a). A C program is written to simulate gene trees and sequence alignments for the case of three species/sequences, under the JC model (Jukes and Cantor 1969) with the clock. To simulate the gene tree and the sequence alignment for each locus, we generate an exponential coalescent waiting time (s₁) with mean $θ_{1} / 2$ . If $s_{1} < τ_{1}$ , the gene tree is $G_{1 a}$ , and another exponential waiting time s₀ is generated with mean $θ_{0} / 2$ to get $t_{0} = τ_{0} + s_{0}$ and t₁ = s₁. If $s_{1} > τ_{1}$ , the gene tree is one of $G_{1 b}, G_{2}, G_{3}$ , chosen at random, and two coalescent waiting times (s₁ and s₀) are generated with means $θ_{0} / 6$ and $θ_{0} / 2$ , respectively, so that $t_{1} = τ_{0} + s_{1}$ and $t_{0} = t_{1} + s_{0}$ (fig. 1b). The gene tree and node ages $(t_{0}, t_{1})$ are then used to calculate the site-pattern probabilities for the locus (eq. 3), and the site-pattern counts are generated from multinomial sampling (eq. 4). Each data set consists of m loci with the sequence length of n sites. We use a large number of replicates (typically $R = 10^{6}$ or 10⁸) so that sampling errors due to a limited number of replicates is not a concern. Species tree estimation by concatenation (=isml) and 2-step is done by counting site patterns.

For the ML method (eq. 5), we used the simulation program MCcoal, which is part of the bpp program (Yang 2015), to simulate the gene trees and sequence alignments. The data are then analyzed using the ML program 3s (Yang 2002; Dalquen et al. 2017). The JC model is used to simulate and analyze data.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online.

Supplementary Material

msab009_Supplementary_Data

Click here for additional data file.^{(43.1KB, pdf)}

Acknowledgments

We thank Bin Wang for discussions and two anonymous reviewers for many insightful comments. This study has been supported by Biotechnology and Biological Sciences Research Council grant (BB/P006493/1) to Z.Y. and a BBSRC equipment grant (BB/R01356X/1). T.Z. is supported by a Natural Science Foundation grant (32070685 and 31671370) and a grant from the Youth Innovation Promotion Association of Chinese Academy of Sciences (201901).

Data Availability

The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.

Appendix A. Proof of Theorem 1

(a) Define the random variable:

\begin{matrix} y = {\bar{z}}_{1} - \max ({\bar{z}}_{2}, {\bar{z}}_{3}) = {\bar{z}}_{1} - \frac{1}{2} ({\bar{z}}_{2} + {\bar{z}}_{3}) - \frac{1}{2} | {\bar{z}}_{2} - {\bar{z}}_{3} | \\ = y_{1} - | y_{2} |, \end{matrix}

(A1)

where $y_{1} = {\bar{z}}_{1} - \frac{1}{2} ({\bar{z}}_{2} + {\bar{z}}_{3}) \sim N (Δ μ, s_{1}^{2} / m)$ and $y_{2} = \frac{1}{2} ({\bar{z}}_{2} - {\bar{z}}_{3}) \sim N (0, s_{2}^{2} / m)$ , with

\begin{matrix} \frac{1}{m} s_{1}^{2} & = V ({\bar{z}}_{1}) + \frac{1}{4} V ({\bar{z}}_{2} + {\bar{z}}_{3}) - 2 Cov ({\bar{z}}_{1}, {\bar{z}}_{2}) \\ = \frac{1}{m} [σ_{1}^{2} + \frac{1}{2} (σ_{2}^{2} + σ_{23}) - 2 σ_{12}], \\ \frac{1}{m} s_{2}^{2} & = \frac{1}{4} V {({\bar{z}}_{2} - {\bar{z}}_{3})}^{2} = \frac{1}{2 m} (σ_{2}^{2} - σ_{23}) . \end{matrix}

(A2)

Here, we treat ${\bar{z}}_{1}, {\bar{z}}_{2}$ and ${\bar{z}}_{3}$ as normal variables, according to the central limit theorem as $m \to \infty$ . As $Cov (y_{1}, y_{2}) = 0$ and both y₁ are y₂ are normal variables, they are independent. Then,

\begin{matrix} ζ & = P {y_{1} < | y_{2} |} \\ = P {y_{2} < 0, y_{1} < - y_{2}} + P {y_{2} > 0, y_{1} < y_{2}} \\ = 2 P {y_{2} > 0, y_{1} < y_{2}} \\ = 2 \int_{0}^{\infty} ϕ (y_{2}; 0, \frac{1}{m} s_{2}^{2}) Φ (\frac{y_{2} - Δ μ}{s_{1} / \sqrt{m}}) d y_{2} \\ = 2 \int_{0}^{\infty} ϕ (t) Φ (a t - b) d t, \end{matrix}

(A3)

where $a = s_{2} / s_{1}, b = Δ μ \sqrt{m} / s_{1}$ , and $ϕ (x; μ, σ^{2})$ is the probability density function (PDF) for $N (μ, σ^{2})$ , whereas $ϕ (x)$ is the PDF for $N (0, 1)$ . The last integral has been studied by Yang and Rodríguez (2013, SI) in a different context and can be written as:

ζ = \frac{1}{π} \int_{- \frac{π}{2}}^{{tan}^{- 1} a} exp {- \frac{b^{2}}{2 {(sin θ - a cos θ)}^{2}}} d θ,

(A4)

or, by letting $t = a - tan θ$ , with dθ = −1/[(t−a)²⁺¹] dt, as:

ζ = \frac{1}{π} \int_{0}^{\infty} \frac{1}{{(t - a)}^{2} + 1} e^{- \frac{b^{2} [{(t - a)}^{2} + 1]}{2 t^{2}}} d t .

(A5)

Equations (A4) and (A5) can be calculated using Gaussian quadrature and match direct calculations using the CDF for the bivariate normal distribution for $({\bar{z}}_{1} - {\bar{z}}_{2}, {\bar{z}}_{1} - {\bar{z}}_{3})$ . When $Δ μ = 0$ , we have b = 0 and:

ζ = 2 \int_{0}^{\infty} ϕ (t) Φ (a t) d t = \frac{1}{2} + \frac{1}{π} {tan}^{- 1} a .

(A6)

In the symmetrical case of $Δ μ = 0, σ_{1}^{2} = σ_{2}^{2}$ , and $σ_{12} = σ_{23}$ (with $a = \frac{1}{\sqrt{3}}$ , b = 0), this gives $\frac{1}{2} + \frac{1}{π} {tan}^{- 1} (\frac{1}{\sqrt{3}}) = \frac{2}{3}$ , as expected. In this case the three variables ${\bar{z}}_{1}, {\bar{z}}_{2}$ and ${\bar{z}}_{3}$ have the same probability of being the greatest so that the error is $\frac{2}{3}$ .

To avoid numerical integration, we note that $y_{2} \sim N (0, \frac{1}{2 m} (σ_{2}^{2} - σ_{23}))$ , and $| y_{2} |$ is a folded normal variable with mean and variance:

\begin{matrix} E (| y_{2} |) & = \sqrt{\frac{1}{m π} (σ_{2}^{2} - σ_{23})}, \\ V (| y_{2} |) & = (\frac{1}{2 m} - \frac{1}{m π}) (σ_{2}^{2} - σ_{23}) . \end{matrix}

(A7)

Thus,

\begin{matrix} E (y) & = Δ μ - \sqrt{\frac{1}{m π} (σ_{2}^{2} - σ_{23})} . \\ V (y) & = V ({\bar{z}}_{1}) + \frac{1}{4} V ({\bar{z}}_{2} + {\bar{z}}_{3}) + \frac{1}{4} V (| {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ - Cov ({\bar{z}}_{1}, {\bar{z}}_{2} + {\bar{z}}_{3}) - Cov ({\bar{z}}_{1}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ + \frac{1}{2} Cov ({\bar{z}}_{2} + {\bar{z}}_{3}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ = V ({\bar{z}}_{1}) + \frac{1}{4} V ({\bar{z}}_{2} + {\bar{z}}_{3}) + \frac{1}{4} V (| {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ - 2 Cov ({\bar{z}}_{1}, {\bar{z}}_{2}) - Cov ({\bar{z}}_{1}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ + Cov ({\bar{z}}_{2}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) . \end{matrix}

(A8)

We have,

\begin{matrix} V ({\bar{z}}_{2} + {\bar{z}}_{3}) + V (| {\bar{z}}_{2} - {\bar{z}}_{3} |) = E {({\bar{z}}_{2} + {\bar{z}}_{3})}^{2} + E {({\bar{z}}_{2} - {\bar{z}}_{3})}^{2} \\ - E^{2} ({\bar{z}}_{2} + {\bar{z}}_{3}) - E^{2} (| {\bar{z}}_{2} - {\bar{z}}_{3} |) \\ = 4 E ({\bar{z}}_{2}^{2}) - 4 μ_{2}^{2} - \frac{4}{m π} (σ_{2}^{2} - σ_{23}) \\ = \frac{4}{m} σ_{2}^{2} - \frac{4}{m π} (σ_{2}^{2} - σ_{23}), \\ Cov ({\bar{z}}_{1}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) = 0, \\ Cov ({\bar{z}}_{2}, | {\bar{z}}_{2} - {\bar{z}}_{3} |) = 0. \end{matrix}

(A9)

Collecting all terms in equation (A8), we get

V (y) = \frac{1}{m} [σ_{1}^{2} - 2 σ_{12} + σ_{2}^{2} - \frac{1}{π} (σ_{2}^{2} - σ_{23})]

(A10)

If we assume that y is approximately normally distributed, as in Zharkikh and Li (1992) and Yang (1996), then equation (6) follows. Note that equation (6) can also be written as $ζ_{N} = Φ (\frac{- b + a \sqrt{2 / π}}{\sqrt{1 + a^{2} (1 - \frac{2}{π})}})$ . Because $| y_{2} |$ has a folded normal distribution and is not a normal variable, the error of approximation of equation (6) does not approach zero when $m \to \infty$ . For instance, in the symmetrical case ( $Δ μ = 0, σ_{1}^{2} = σ_{2}^{2}$ , and $σ_{12} = σ_{23}$ ), equation (6) gives $Φ (\frac{1}{\sqrt{2 π - 1}}) = 0.66824$ , not $\frac{2}{3}$ . This level of accuracy is acceptable for our calculations for finite m in this paper, as the precise value of the error is unimportant if the error is nearly zero, but equation (6) may not give the correct asymptotic error rate when $m \to \infty$ (fig. 6, a = 10).

Fig. 6. — Probit of error, $Φ^{- 1} (ζ)$ , plotted against b for different values of a. Six methods for calculating ζ are shown. The first five are, from top to bottom, $ζ_{U 1}$ (brown dashed line), $ζ_{U 2}$ (orange dotted, with k = 2 in eq. A27), Exact (black solid line), $ζ_{L 2}$ (blue dotted), and $ζ_{L 1}$ (purple dashed). Equation (6) (black dotted) is included as well.

(b) To study the asymptotic behavior of the error probability ζ when $m \to \infty$ , we derive bounds on ζ. From equation (A3),

\begin{matrix} ζ & = 2 \int_{0}^{\infty} ϕ (t) \int_{- \infty}^{a t - b} ϕ (x) d x d t \\ = 2 \int_{- \infty}^{\infty} \int_{- \infty}^{a t - b} ϕ (t) ϕ (x) d x d t - 2 \int_{- \infty}^{0} \int_{- \infty}^{a t - b} ϕ (t) ϕ (x) d x d t \\ = 2 S - 2 A, \end{matrix}

(A11)

where the first integral is $S = Φ (- h)$ , with $h = \frac{b}{\sqrt{1 + a^{2}}} = \frac{Δ μ \sqrt{m}}{\sqrt{σ_{1}^{2} - 2 σ_{12} + σ_{2}^{2}}}$ to be the distance from the origin (0, 0) to the line $x = a t - b$ (fig. 7), and the second integral is:

\begin{matrix} 2 A & = 2 \int_{- \infty}^{0} \int_{- \infty}^{a t - b} ϕ (t) ϕ (x) d x d t \\ = \int_{- \infty}^{- b} ϕ (x) \int_{(x + b) / a}^{- (x + b) / a} ϕ (t) d t d x . \end{matrix}

(A12)

By considering the area of integration (fig. 7), it is obvious that:

0 < 2 A \leq Φ (- b) [1 - \frac{2}{π} {tan}^{- 1} a],

(A13)

where the equality holds when b = 0. Let,

ζ_{L 2} = 2 Φ (- h) - Φ (- b) [1 - \frac{2}{π} {tan}^{- 1} a] .

As $Φ (- b) < Φ (- h)$ , we have:

ζ_{L 2} > Φ (- h) [1 + \frac{2}{π} {tan}^{- 1} a] \equiv ζ_{L 1},

(A14)

Φ (- h) \leq ζ_{L 1} \leq ζ_{L 2} \leq ζ < 2 Φ (- h) \equiv ζ_{U 1},

(A15)

as in equation (7). The equality in the lower bounds is achieved at $b = 0$ . Note that the bounds apply to all a > 0 and b > 0. We use the bounds ( $ζ_{L 1}, ζ_{U 1}$ ) in Theorem 1 and in the calculation of table 1. The width of the interval is $Φ (- h) [1 - \frac{2}{π} {tan}^{- 1} a] \leq Φ (- h) \leq ζ$ , so that using any value inside the interval as the estimate will give an error of approximation that is smaller than the error probability ζ.

Note that the bounds $Φ (- h) < ζ < 2 Φ (- h)$ are also given by the definition $ζ = P {{\bar{z}}_{1} < {\bar{z}}_{2} \cup {\bar{z}}_{1} < {\bar{z}}_{3}}$ , since

P ({\bar{z}}_{1} < {\bar{z}}_{2}) < ζ < P ({\bar{z}}_{1} < {\bar{z}}_{2}) + P ({\bar{z}}_{1} < {\bar{z}}_{3}) = 2 P ({\bar{z}}_{1} < {\bar{z}}_{2}),

(A16)

with $P ({\bar{z}}_{1} < {\bar{z}}_{2}) = Φ (- h)$ .

Next we consider the upper bound in equation (A15) when h or b is large. Note that:

\begin{matrix} Φ (- h) & = \int_{h}^{\infty} \frac{1}{\sqrt{2 π}} e^{- y^{2} / 2} d y \\ = \int_{0}^{\infty} \frac{1}{\sqrt{2 π}} e^{- \frac{1}{2} {(x + h)}^{2}} d x \\ = \frac{1}{\sqrt{2 π}} e^{- h^{2} / 2} \int_{0}^{\infty} e^{- (h x + \frac{1}{2} x^{2})} d x \\ = \frac{1}{h \sqrt{2 π}} e^{- h^{2} / 2} \int_{0}^{\infty} e^{- t} e^{- \frac{1}{2 h^{2}} t^{2}} d t \\ = \frac{1}{h \sqrt{2 π}} e^{- h^{2} / 2} B, \end{matrix}

(A17)

where $B = \int_{0}^{\infty} e^{- t} e^{- \frac{1}{2 h^{2}} t^{2}} d t < \int_{0}^{\infty} e^{- t} d t =$ 1. For large h,

\begin{matrix} B & > \int_{0}^{\sqrt{h}} e^{- t} e^{- \frac{1}{2 h^{2}} t^{2}} d t > e^{- \frac{1}{2 h}} \int_{0}^{\sqrt{h}} e^{- t} d t \\ = e^{- \frac{1}{2 h}} (1 - e^{- \sqrt{h}}) = 1 - \frac{1}{2 h} + o (\frac{1}{h}) . \end{matrix}

(A18)

Thus, for large h, $Φ (- h)$ is bounded by:

(1 - \frac{1}{2 h} + o (\frac{1}{h})) \frac{1}{h \sqrt{2 π}} e^{- h^{2} / 2} < Φ (- h) < \frac{1}{h \sqrt{2 π}} e^{- h^{2} / 2},

(A19)

Φ (- h) = \frac{1}{h \sqrt{2 π}} e^{- h^{2} / 2} + O (\frac{1}{h^{2}} e^{- h^{2} / 2}) .

(A20)

Let $ɛ > 0$ such that $Φ (- (h + ɛ)) = α Φ (- h)$ for $0 < α < 1$ ; in other words, ɛ is the offset at the probit level to reduce the probability by a fraction. From equation (A20),

\begin{matrix} \frac{1}{(h + ɛ) \sqrt{2 π}} e^{- \frac{1}{2} (h^{2} + 2 ɛ h + ɛ^{2})} + O (\frac{1}{{(h + ɛ)}^{2}} e^{- \frac{1}{2} {(h + ɛ)}^{2}}) \\ = \frac{α}{h \sqrt{2 π}} e^{- \frac{1}{2} h^{2}} + O (\frac{1}{h^{2}} e^{- \frac{1}{2} h^{2}}) . \end{matrix}

(A21)

Thus,

\frac{1}{h + ɛ} e^{- \frac{1}{2} (h^{2} + 2 ɛ h + ɛ^{2})} = \frac{α}{h} e^{- \frac{1}{2} h^{2}} + O (\frac{1}{h^{2}} e^{- \frac{1}{2} h^{2}}),

(A22)

which gives $ɛ = - \frac{1}{h} log α + o (\frac{1}{h})$ or

Φ (- h + \frac{1}{h} log α + o (\frac{1}{h})) = α Φ (- h) .

(A23)

In particular, for $α = \frac{1}{2}$ , we have:

Φ (- (h + \frac{1}{h} log 2 + o (\frac{1}{h}))) = \frac{1}{2} Φ (- h) .

(A24)

Thus, for large h, we have:

2 Φ (- h) = Φ (- h + \frac{1}{h} log 2 + o (\frac{1}{h})),

(A25)

as in equation (7). It may be noteworthy that for large h, a very small change at the probit level, of about $\frac{1}{h} log 2$ , changes the probability by a factor of 2.

A tighter lower bound for 2 A than zero of equation (A13) is:

2 A > \frac{φ}{π} exp {- \frac{b^{2} (a^{2} k^{2} + 1)}{2 a^{2} {(k - 1)}^{2}}},

(A26)

where $φ = {tan}^{- 1} \frac{1}{k a}$ with k > 1 (fig. 7). Thus, we have a tighter pair of bounds on ζ,

\begin{matrix} 2 Φ (- h) - Φ (- b) [1 - \frac{2}{π} {tan}^{- 1} a] \leq ζ < \\ 2 Φ (- h) - \frac{1}{π} {tan}^{- 1} \frac{1}{k a} exp {- \frac{b^{2} (a^{2} k^{2} + 1)}{2 a^{2} {(k - 1)}^{2}}}, \end{matrix}

(A27)

where k > 1. We write this pair of bounds as $ζ_{L 2} < ζ < ζ_{U 2}$ . We have $Φ (- b) \leq Φ (- h) \leq ζ_{L 1} \leq ζ_{L 2} \leq ζ < ζ_{U 2} < ζ_{U 1} = 2 Φ (- h)$ . These bounds, as well as the exact value and equation (6), are plotted against b in figure 6 for $a = 0.01, 0.1, 1$ and 10.

Appendix B. The asymptotics of ML species tree estimation

The proof below borrows heavily from White (1982), Dawid (2011), and Yang and Zhu (2018). Let $S_{j}, j = 1, 2, 3$ be the three species trees with parameters $θ_{j}$ . Note that S₁ is the true model, while S₂ and S₃ are mis-specified models. Let the data at m loci be $x = {x_{i}}, i = 1, \dots, m$ . The log-likelihood function is $ℓ_{j} (θ_{j}) = log f (x | S_{j}, θ_{j})$ . We also define $l_{j} (θ_{j}) = log f (x | S_{j}, θ_{j})$ for one data point (that is, site-pattern counts at any single locus), $x \equiv (x_{i 0}, x_{i 1}, x_{i 2}, x_{i 3}, x_{i 4})$ . When the number of loci $m \to \infty$ , the MLE ${\hat{θ}}_{j} \to θ_{j}^{*}$ . We assume that both ${\hat{θ}}_{j}$ and $θ_{j}^{*}$ are inner points in the parameter space. Whether ${\hat{θ}}_{j}$ is inside the parameter space or at its boundary should not affect the asymptotic rate of convergence. Here, $θ_{1}^{*}$ for the true species tree S₁ is the true parameter value, whereas $θ_{2}^{*}$ for S₂ (as well as $θ_{3}^{*}$ for S₃) is the pseudotrue parameter value, which minimizes the Kullback–Leibler distance from the misspecified model S₂ to the true model S₁.

\begin{matrix} D_{12} = \int f (x | S_{1}, θ_{1}^{*}) log \frac{f (x | S_{1}, θ_{1}^{*})}{f (x | S_{2}, θ_{2}^{*})} dx \\ = E {l_{1} (θ_{1}^{*}) - (l_{2} (θ_{2}^{*})}, \end{matrix}

(A28)

where the expectation is over the true distribution $f (x | S_{1}, θ_{1}^{*})$ . D₁₃ is defined similarly, with $D_{13} = D_{12}$ .

We consider the log-likelihood ratio, $ℓ_{j} (\hat{θ}) - ℓ_{j} (θ^{*})$ , given the data (x) for any of the species tree j. We drop the subscript j for clarity. As in White (1982) and Dawid (2011), we define two matrices:

\begin{matrix} I (θ) & = E {\nabla log f (x | θ) \cdot \nabla log f {(x | θ)}^{T}} \\ = E {l' (θ) {(l^{'} (θ))}^{T}}, \\ J (θ) & = E {- \nabla^{2} log f (x | θ)} = E {- l ″ (θ)}, \end{matrix}

(A29)

where the superscript T stands for transpose and where the expectation is over the true distribution, and $\nabla$ and $\nabla^{2}$ are the first and second derivatives with respect to $θ$ .

Apply Taylor expansion to the log likelihood around the MLE $\hat{θ}$ :

ℓ (θ) \approx ℓ (\hat{θ}) + ℓ' (\hat{θ}) (θ - \hat{θ}) + \frac{1}{2} {(θ - \hat{θ})}^{T} ℓ ″ (\hat{θ}) (θ - \hat{θ}),

(A30)

where both the gradient and the Hessian are evaluated at the MLE ( $\hat{θ}$ ), with $ℓ' (\hat{θ}) = 0$ . Setting $θ = θ^{*}$ , we have:

ℓ (\hat{θ}) \approx ℓ (θ^{*}) + \frac{1}{2} {(\hat{θ} - θ^{*})}^{T} (- ℓ ″ (\hat{θ})) (\hat{θ} - θ^{*}) .

(A31)

Apply Taylor expansion to the derivative $ℓ' (θ)$ around the MLE $\hat{θ}$ and let $θ = θ^{*}$ , and we have:

ℓ' (θ) \approx ℓ ″ (\hat{θ}) (θ - \hat{θ}),

(A32)

and

\hat{θ} - θ^{*} \approx - ℓ ″ {(\hat{θ})}^{- 1} ℓ' (θ^{*}) .

(A33)

Each of $ℓ' (\hat{θ})$ and $ℓ ″ (\hat{θ})$ is a sum of m i.i.d. elements. When $m \to \infty, - ℓ ″ (\hat{θ}) \approx m E {- l ″ (θ^{*})} = m J^{*}$ , with $J^{*} = J (θ^{*})$ (eq. A29). Furthermore,

\begin{matrix} E {ℓ' (θ^{*})} & = 0, \\ V {ℓ' (θ^{*})} & = m V {l' (θ^{*})} = m I^{*}, \end{matrix}

(A34)

where $I^{*} = I (θ^{*})$ (eq. A29). Thus,

\sqrt{m} (\hat{θ} - θ^{*}) \to_{}^{P} N (0, {(J^{* - 1})}^{T} I^{*} (J^{* - 1})) .

(A35)

Thus, $\hat{θ} = θ^{*} + O_{p} (m^{- 1 / 2})$ . Equation (A31) becomes:

\begin{matrix} ℓ (\hat{θ}) \approx ℓ (θ^{*}) + \frac{1}{2} {\sqrt{m} (\hat{θ} - θ^{*})}^{T} J^{*} {\sqrt{m} (\hat{θ} - θ^{*})} \end{matrix}

(A36)

= ℓ (θ^{*}) + O_{p} (1) .

Equations (A29–A36) apply to all three species trees. In the case of S₁ (the true model), $J^{*} = I^{*}$ , the Fisher information matrix, and $ℓ (\hat{θ}) - ℓ (θ^{*}) \sim \frac{1}{2} χ_{d}^{2}$ . For S₂ or S₃, $ℓ (\hat{θ}) - ℓ (θ^{*})$ is a quadratic form of normal variates and is a mixture of noncentral $χ^{2}$ variables with mean $\frac{1}{2} tr (I^{*} J^{* - 1})$ and variance $\frac{1}{2} tr ({(I^{*} J^{* - 1})}^{2})$ , both of O(1).

Now consider using ${\bar{z}}_{j} \equiv \frac{1}{m} ℓ_{j} ({\hat{θ}}_{j})$ , j = 1, 2, 3, to compare species trees S₁, S₂, and S₃. We have:

\begin{matrix} E ({\bar{z}}_{j}) \approx E (l_{j} (θ_{j}^{*})) \equiv μ_{j}, \\ V ({\bar{z}}_{j}) \approx \frac{1}{m} V (l_{j} (θ_{j}^{*})) \equiv \frac{1}{m} σ_{j j}, \\ Cov ({\bar{z}}_{j}, {\bar{z}}_{k}) \approx \frac{1}{m} Cov (l_{j} (θ_{j}^{*}), l_{k} (θ_{k}^{*})) \equiv \frac{1}{m} σ_{j k} . \end{matrix}

(A37)

Thus, when the number of loci $m \to \infty, {{\bar{z}}_{j}} = {\frac{1}{m} ℓ_{j} ({\hat{θ}}_{j})}$ have means $(μ_{1}, μ_{2}, μ_{2})$ and variance/covariance matrix $\frac{1}{m} Σ$ , where $Σ = {σ_{j k}}$ is O(1) and independent of m. The error of the ML method, $P {ℓ_{1} ({\hat{θ}}_{1}) > \max (ℓ_{2} ({\hat{θ}}_{2}), ℓ_{3} ({\hat{θ}}_{3}))} = P {{\bar{z}}_{1} > \max ({\bar{z}}_{2}, {\bar{z}}_{3})}$ , is then given by Theorem 1 as equation (11).

References

Angelis K, dos Reis M.. 2015. The impact of ancestral population size and incomplete lineage sorting on Bayesian estimation of species divergence times. Curr Zool. 61(5):874–885. [Google Scholar]
Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A.. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 29(8):1917–1932. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burgess R, Yang Z.. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 25(9):1979–1994. [DOI] [PubMed] [Google Scholar]
Chifman J, Kubatko L.. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T.. 2015. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16(Suppl 10):S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dalquen D, Zhu T, Yang Z.. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst Biol. 66(3):379–398. [DOI] [PubMed] [Google Scholar]
Dawid A.2011. Posterior model probabilities. In: Bandyopadhyay PS, Forster M, editors. Philosophy of statistics.New York: Elsevier. p. 607–630. [Google Scholar]
Degnan JH, Rosenberg NA.. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2(5):e68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Degnan JH, Salter LA.. 2005. Gene tree distributions under the coalescent process. Evolution 59(1):24–37. [PubMed] [Google Scholar]
Edwards SV.2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19. [DOI] [PubMed] [Google Scholar]
Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, et al. 2016. Implementing and testing the multispecies coalescent model a valuable paradigm for phylogenomics. Mol Phylogenet Evol. 94(Pt A):447–462. [DOI] [PubMed] [Google Scholar]
Fleiss JL, Levin B, Palk MC.. 2003. Statistical methods for rates and proportions.New York: John Wiley and Sons.3rd ed. [Google Scholar]
Heled J, Drummond AJ.. 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 27(3):570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudson R.1983. Testing the constant-rate neutral alele model with protein sequence data. Evolution 37(1):203–217. [DOI] [PubMed] [Google Scholar]
Jukes T, Cantor C.. 1969. Evolution of protein molecules.In: Munro H, editor. Mammalian protein metabolism.New York: Academic Press. p. 21–123. [Google Scholar]
Kubatko L.2019. The multispecies coalescent.In: Balding D, Moltke I, Marioni J, editors. Handbook of statistical genomics.4th ed.New York: Wiley. p. 219–245. [Google Scholar]
Lanier HC, Knowles LL.. 2012. Is recombination a problem for species-tree analyses? Syst Biol. 61(4):691–701. [DOI] [PubMed] [Google Scholar]
Leaché AD, Oaks J.. 2017. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Annu Rev Ecol Evol Syst. 48(1):69–84. [Google Scholar]
Leaché AD, Rannala B.. 2011. The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol. 60(2):126–137. [DOI] [PubMed] [Google Scholar]
Liu L, Pearl DK.. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 56(3):504–514. [DOI] [PubMed] [Google Scholar]
Liu L, Yu L, Edwards SV.. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 10(1):302. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu L, Yu L, Pearl DK, Edwards SV.. 2009. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 58(5):468–477. [DOI] [PubMed] [Google Scholar]
Lohse K, Chmelik M, Martin SH, Barton NH.. 2016. Efficient strategies for calculating blockwise likelihoods under the coalescent. Genetics 202(2):775–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
Long C, Kubatko L.. 2018. The effect of gene flow on coalescent-based species-tree inference. Syst Biol. 67(5):770–785. [DOI] [PubMed] [Google Scholar]
Maddison W.1997. Gene trees in species trees. Syst Biol. 46(3):523–536. [Google Scholar]
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T.. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17):i541–548. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nichols R.2001. Gene trees and species trees are not the same. Trends Ecol Evol. 16(7):358–364. [DOI] [PubMed] [Google Scholar]
Ogilvie HA, Bouckaert RR, Drummond AJ.. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol. 34(8):2101–2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pamilo P, Nei M.. 1988. Relationships between gene trees and species trees. Mol Biol Evol. 5(5):568–583. [DOI] [PubMed] [Google Scholar]
Rannala B, Edwards S, Leaché AD, Yang Z.. 2020. The multispecies coalescent model and species tree inference. In: Scornavacca C, Delsuc F, Galtier N, editors. Phylogenetics in the genomic era. Book Section 3.3.No Commercial Publisher. p. 1–20. [Google Scholar]
Rannala B, Yang Z.. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rannala B, Yang Z.. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst Biol. 66(5):823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roch S, Steel M.. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 100:56–62. [DOI] [PubMed] [Google Scholar]
Shi C, Yang Z.. 2018. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of Gibbons. Mol Biol Evol. 35(1):159–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Susko E.2011. Large sample approximations of probabilities of correct evolutionary tree estimation and biases of maximum likelihood estimation. Stat Appl Genet Mol Biol. 10(1):10. [DOI] [PubMed] [Google Scholar]
Szöllősi GJ, Tannier E, Daubin V, Boussau B.. 2015. The inference of gene trees with species trees. Syst Biol. 64(1):e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]
Takahata N, Satta Y, Klein J.. 1995. Divergence time and population size in the lineage leading to modern humans. Theor Popul Biol. 48(2):198–221. [DOI] [PubMed] [Google Scholar]
Tian Y, Kubatko LS.. 2016. Distribution of coalescent histories under the coalescent model with gene flow. Mol Phylogenet Evol. 105:177–192. [DOI] [PubMed] [Google Scholar]
Tiley GP, Poelstra JP, dos Reis M, Yang Z, Yoder AD.. 2020. Molecular clocks without rocks: new solutions for old problems. Trends Genet. 36(11):845–856. [DOI] [PubMed] [Google Scholar]
White H.1982. Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25. [Google Scholar]
Wu Y.2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775. [DOI] [PubMed] [Google Scholar]
Xu B, Yang Z.. 2016. Challenges in species tree estimation under the multispecies coalescent model. Genetics 204(4):1353–1368. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z.1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]
Yang Z.1994b. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 43(3):329–342. [Google Scholar]
Yang Z.1996. Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol. 42(2):294–307. [DOI] [PubMed] [Google Scholar]
Yang Z.1997. How often do wrong models produce better phylogenies? Mol Biol Evol. 14(1):105–108. [DOI] [PubMed] [Google Scholar]
Yang Z.2000. Complexity of the simplest phylogenetic estimation problem. Proc R Soc Lond B. 267(1439):109–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z.2002. Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162(4):1811–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z.2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24(8):1586–1591. [DOI] [PubMed] [Google Scholar]
Yang Z.2014. Molecular evolution: a statistical approach. Oxford (England: ): Oxford University Press. [Google Scholar]
Yang Z.2015. The BPP program for species tree estimation and species delimitation. Curr Zool. 61(5):854–865. [Google Scholar]
Yang Z, Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol Biol Evol. 31(12):3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z, Rodríguez CE.. 2013. Searching for efficient markov chain Monte Carlo proposal kernels. Proc Natl Acad Sci USA. 110(48):19307–19312. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang Z, Zhu T.. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci USA. 115(8):1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zharkikh A, Li W-H.. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. i. Four taxa with a molecular clock. Mol Biol Evol. 9:1119–1147. [DOI] [PubMed] [Google Scholar]
Zhu T, Yang Z.. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 29(10):3131–3142. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msab009_Supplementary_Data

Click here for additional data file.^{(43.1KB, pdf)}

Data Availability Statement

The C program for simulating under the MSC model with 3 species and 3 sequences is available from the authors upon request.

[msab009-B1] Angelis K, dos Reis M.. 2015. The impact of ancestral population size and incomplete lineage sorting on Bayesian estimation of species divergence times. Curr Zool. 61(5):874–885. [Google Scholar]

[msab009-B2] Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A.. 2012. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 29(8):1917–1932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B3] Burgess R, Yang Z.. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 25(9):1979–1994. [DOI] [PubMed] [Google Scholar]

[msab009-B4] Chifman J, Kubatko L.. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317–3324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B5] Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T.. 2015. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16(Suppl 10):S2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B6] Dalquen D, Zhu T, Yang Z.. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst Biol. 66(3):379–398. [DOI] [PubMed] [Google Scholar]

[msab009-B7] Dawid A.2011. Posterior model probabilities. In: Bandyopadhyay PS, Forster M, editors. Philosophy of statistics.New York: Elsevier. p. 607–630. [Google Scholar]

[msab009-B8] Degnan JH, Rosenberg NA.. 2006. Discordance of species trees with their most likely gene trees. PLoS Genet. 2(5):e68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B9] Degnan JH, Salter LA.. 2005. Gene tree distributions under the coalescent process. Evolution 59(1):24–37. [PubMed] [Google Scholar]

[msab009-B10] Edwards SV.2009. Is a new and general theory of molecular systematics emerging? Evolution 63(1):1–19. [DOI] [PubMed] [Google Scholar]

[msab009-B11] Edwards SV, Xi Z, Janke A, Faircloth BC, McCormack JE, Glenn TC, Zhong B, Wu S, Lemmon EM, Lemmon AR, et al. 2016. Implementing and testing the multispecies coalescent model a valuable paradigm for phylogenomics. Mol Phylogenet Evol. 94(Pt A):447–462. [DOI] [PubMed] [Google Scholar]

[msab009-B12] Fleiss JL, Levin B, Palk MC.. 2003. Statistical methods for rates and proportions.New York: John Wiley and Sons.3rd ed. [Google Scholar]

[msab009-B13] Heled J, Drummond AJ.. 2010. Bayesian inference of species trees from multilocus data. Mol Biol Evol. 27(3):570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B14] Hudson R.1983. Testing the constant-rate neutral alele model with protein sequence data. Evolution 37(1):203–217. [DOI] [PubMed] [Google Scholar]

[msab009-B15] Jukes T, Cantor C.. 1969. Evolution of protein molecules.In: Munro H, editor. Mammalian protein metabolism.New York: Academic Press. p. 21–123. [Google Scholar]

[msab009-B16] Kubatko L.2019. The multispecies coalescent.In: Balding D, Moltke I, Marioni J, editors. Handbook of statistical genomics.4th ed.New York: Wiley. p. 219–245. [Google Scholar]

[msab009-B17] Lanier HC, Knowles LL.. 2012. Is recombination a problem for species-tree analyses? Syst Biol. 61(4):691–701. [DOI] [PubMed] [Google Scholar]

[msab009-B18] Leaché AD, Oaks J.. 2017. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Annu Rev Ecol Evol Syst. 48(1):69–84. [Google Scholar]

[msab009-B19] Leaché AD, Rannala B.. 2011. The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol. 60(2):126–137. [DOI] [PubMed] [Google Scholar]

[msab009-B20] Liu L, Pearl DK.. 2007. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 56(3):504–514. [DOI] [PubMed] [Google Scholar]

[msab009-B21] Liu L, Yu L, Edwards SV.. 2010. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 10(1):302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B22] Liu L, Yu L, Pearl DK, Edwards SV.. 2009. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 58(5):468–477. [DOI] [PubMed] [Google Scholar]

[msab009-B23] Lohse K, Chmelik M, Martin SH, Barton NH.. 2016. Efficient strategies for calculating blockwise likelihoods under the coalescent. Genetics 202(2):775–786. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B24] Long C, Kubatko L.. 2018. The effect of gene flow on coalescent-based species-tree inference. Syst Biol. 67(5):770–785. [DOI] [PubMed] [Google Scholar]

[msab009-B25] Maddison W.1997. Gene trees in species trees. Syst Biol. 46(3):523–536. [Google Scholar]

[msab009-B26] Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T.. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17):i541–548. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B27] Nichols R.2001. Gene trees and species trees are not the same. Trends Ecol Evol. 16(7):358–364. [DOI] [PubMed] [Google Scholar]

[msab009-B28] Ogilvie HA, Bouckaert RR, Drummond AJ.. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Mol Biol Evol. 34(8):2101–2114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B29] Pamilo P, Nei M.. 1988. Relationships between gene trees and species trees. Mol Biol Evol. 5(5):568–583. [DOI] [PubMed] [Google Scholar]

[msab009-B30] Rannala B, Edwards S, Leaché AD, Yang Z.. 2020. The multispecies coalescent model and species tree inference. In: Scornavacca C, Delsuc F, Galtier N, editors. Phylogenetics in the genomic era. Book Section 3.3.No Commercial Publisher. p. 1–20. [Google Scholar]

[msab009-B31] Rannala B, Yang Z.. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B32] Rannala B, Yang Z.. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst Biol. 66(5):823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B33] Roch S, Steel M.. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor Popul Biol. 100:56–62. [DOI] [PubMed] [Google Scholar]

[msab009-B34] Shi C, Yang Z.. 2018. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of Gibbons. Mol Biol Evol. 35(1):159–179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B35] Susko E.2011. Large sample approximations of probabilities of correct evolutionary tree estimation and biases of maximum likelihood estimation. Stat Appl Genet Mol Biol. 10(1):10. [DOI] [PubMed] [Google Scholar]

[msab009-B36] Szöllősi GJ, Tannier E, Daubin V, Boussau B.. 2015. The inference of gene trees with species trees. Syst Biol. 64(1):e42–e62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B37] Takahata N, Satta Y, Klein J.. 1995. Divergence time and population size in the lineage leading to modern humans. Theor Popul Biol. 48(2):198–221. [DOI] [PubMed] [Google Scholar]

[msab009-B38] Tian Y, Kubatko LS.. 2016. Distribution of coalescent histories under the coalescent model with gene flow. Mol Phylogenet Evol. 105:177–192. [DOI] [PubMed] [Google Scholar]

[msab009-B39] Tiley GP, Poelstra JP, dos Reis M, Yang Z, Yoder AD.. 2020. Molecular clocks without rocks: new solutions for old problems. Trends Genet. 36(11):845–856. [DOI] [PubMed] [Google Scholar]

[msab009-B40] White H.1982. Maximum likelihood estimation of misspecified models. Econometrica 50(1):1–25. [Google Scholar]

[msab009-B41] Wu Y.2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66(3):763–775. [DOI] [PubMed] [Google Scholar]

[msab009-B42] Xu B, Yang Z.. 2016. Challenges in species tree estimation under the multispecies coalescent model. Genetics 204(4):1353–1368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B43] Yang Z.1994a. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39(3):306–314. [DOI] [PubMed] [Google Scholar]

[msab009-B44] Yang Z.1994b. Statistical properties of the maximum likelihood method of phylogenetic estimation and comparison with distance matrix methods. Syst Biol. 43(3):329–342. [Google Scholar]

[msab009-B45] Yang Z.1996. Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol. 42(2):294–307. [DOI] [PubMed] [Google Scholar]

[msab009-B46] Yang Z.1997. How often do wrong models produce better phylogenies? Mol Biol Evol. 14(1):105–108. [DOI] [PubMed] [Google Scholar]

[msab009-B47] Yang Z.2000. Complexity of the simplest phylogenetic estimation problem. Proc R Soc Lond B. 267(1439):109–116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B48] Yang Z.2002. Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162(4):1811–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B49] Yang Z.2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24(8):1586–1591. [DOI] [PubMed] [Google Scholar]

[msab009-B50] Yang Z.2014. Molecular evolution: a statistical approach. Oxford (England: ): Oxford University Press. [Google Scholar]

[msab009-B51] Yang Z.2015. The BPP program for species tree estimation and species delimitation. Curr Zool. 61(5):854–865. [Google Scholar]

[msab009-B52] Yang Z, Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol Biol Evol. 31(12):3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B53] Yang Z, Rodríguez CE.. 2013. Searching for efficient markov chain Monte Carlo proposal kernels. Proc Natl Acad Sci USA. 110(48):19307–19312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B54] Yang Z, Zhu T.. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci USA. 115(8):1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]

[msab009-B55] Zharkikh A, Li W-H.. 1992. Statistical properties of bootstrap estimation of phylogenetic variability from nucleotide sequences. i. Four taxa with a molecular clock. Mol Biol Evol. 9:1119–1147. [DOI] [PubMed] [Google Scholar]

[msab009-B56] Zhu T, Yang Z.. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 29(10):3131–3142. [DOI] [PubMed] [Google Scholar]

PERMALINK

Complexity of the simplest species tree problem

Tianqi Zhu

Ziheng Yang

Roles

Abstract

Introduction

Results

Multispecies Coalescent in the Case of Three Species

Fig. 1.

The ML Method of Species Tree Estimation

Theorem 1

Proof.

Proof.

Fig. 2.

Table 1.

Fig. 3.

Concatenation

Table 2.

ISML

Theorem 3

Proof.

Fig. 4.

Table 3.

Two-Step Method (Majority Vote)

Numerical Comparison of Different Methods

Discussion

Errors of Species Tree Estimation by Different Methods

Table 4.

Fig. 5.

Implications of Our Study to Species Tree Methods

Materials and Methods

Simulation

Supplementary Material

Supplementary Material

Acknowledgments

Data Availability

Appendix A. Proof of Theorem 1

Fig. 6.

Fig. 7.

Appendix B. The asymptotics of ML species tree estimation

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases