Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2016 Dec 24;66(2):283–298. doi: 10.1093/sysbio/syw097

Displayed Trees Do Not Determine Distinguishability Under the Network Multispecies Coalescent

Sha Zhu 1, James H Degnan 2,*
PMCID: PMC5837799  PMID: 27780899

Abstract

Recent work in estimating species relationships from gene trees has included inferring networks assuming that past hybridization has occurred between species. Probabilistic models using the multispecies coalescent can be used in this framework for likelihood-based inference of both network topologies and parameters, including branch lengths and hybridization parameters. A difficulty for such methods is that it is not always clear whether, or to what extent, networks are identifiable—that is whether there could be two distinct networks that lead to the same distribution of gene trees. For cases in which incomplete lineage sorting occurs in addition to hybridization, we demonstrate a new representation of the species network likelihood that expresses the probability distribution of the gene tree topologies as a linear combination of gene tree distributions given a set of species trees. This representation makes it clear that in some cases in which two distinct networks give the same distribution of gene trees when sampling one allele per species, the two networks can be distinguished theoretically when multiple individuals are sampled per species. This result means that network identifiability is not only a function of the trees displayed by the networks but also depends on allele sampling within species. We additionally give an example in which two networks that display exactly the same trees can be distinguished from their gene trees even when there is only one lineage sampled per species.

Keywords: gene tree, hybridization, identifiability, maximum likelihood, species tree, phylogeny


Hybridization between distinct species or populations is often represented using a rooted phylogenetic network rather than a tree (Huson et al. 2010; Bapteste et al. 2013; Nakhleh 2013). In much of the literature on networks representing hybridization, there has been interest in which trees are displayed by a network, where a network displays a particular tree if removing some subset of hybridization edges results in the given tree (Huson and Scornavacca 2011; Morrison 2011). For example, several papers investigate finding a network with the minimum number of hybridization events that displays two conflicting input trees (Albrecht et al. 2012; Baroni et al. 2006; Bordewich and Semple 2007; Chen and Wang 2010; van Iersel et al. 2014). These input trees are often described as gene trees, and could arise, for example, from estimating trees from sequences from two different loci (e.g., one mitochondrial and one nuclear gene). However, it is not always clear in the literature if a displayed tree in a network refers to a gene tree or a species tree (representing species history rather than ancestry for a specific locus).

A number of methods have recently been developed to infer species networks that explicitly represent species relationships using a network while relationships at the gene level are modeled as gene trees within the network (Jones et al. 2013; Kubatko 2009; Meng and Kubatko 2009; Yu et al. 2011, 2012, 2014; Yu and Nakhleh 2015; Solís-Lemus and Ané 2016). These models are motivated by cases in which hybridization and incomplete lineage sorting are likely to occur simultaneously. In probabilistic versions of these models, gene trees are assumed to be strictly tree-like, and although they are embedded within the network, they do not have to be displayed by the network. In particular, by modeling species networks under the multispecies coalescent, all gene trees have positive probability whether or not they are displayed by the network. We refer to the multispecies coalescent model applied to networks as the Network Multispecies Coalescent (NMSC), and this model is the focus of this article.

The NMSC is intended to represent the case of two populations merging so that the hybrid population is expected to have many individuals with ancestry from both parental populations. Hybrid speciation due to changes in ploidy can result in all descendants of the hybrid having one ancestor, in which case incomplete lineage sorting would not occur. Our model is therefore restricted to homoploid hybridizations; see (Jones et al. 2013) for models applicable to polyploid hybridization. The NMSC is also not intended to model horizontal gene transfer, which causes much of the reticulation in bacterial networks, or recombination, two other processes that motivate network representations, including in likelihood frameworks (Strimmer and Moulton 2000; Jin et al. 2006; Abbott et al. 2010; Nguyen and Roos 2015).

An early study concerning the multispecies coalescent approach to networks assumed that a hybrid species occurred sometime in the past, and that a single allele is sampled from a population descended from this hybrid species (Meng and Kubatko 2009). Under this assumption, an allele from the hybrid species could have descended from one of two possible ancestral populations. This results in the probability of a gene tree being a linear combination of the gene tree probabilities from two parent species trees, where the parent species trees are obtained by removing one of the two hybridization edges. This reduces the network into a set of two species trees, and takes advantage of the fact that probabilities of gene trees given species trees under the multispecies coalescent can be computed. This approach is useful in cases where only one allele is sampled per locus from any species that is the result of hybridization. However, this approach does not generalize easily to cases where an ancient hybrid subsequently speciates or in which more than one allele is sampled from a hybrid species.

To compute more general likelihoods than the approach of Meng and Kubatko (2009), Yu et al. (2012) developed an algorithm that represented a species network as a multilabeled tree (MUL-tree) where species descended from hybrids are represented more than once in the tips of the MUL-tree. The likelihood is computed by summing over possible assignments of alleles to these nonuniquely labeled tips. This approach allows multiple alleles to be sampled within populations as well as hybrids to occur anciently in the network so that populations descended from hybrids can subsequently speciate.

A problem for inferring phylogenetic networks, however, is that they are not always identifiable. That is, examples can be found where two networks that correspond to distinct biological hypotheses about speciation and hybridization events can give rise to the same distribution of gene trees. Two networks that give the same probabilities on all gene tree topologies can be said to be mathematically indistinguishable. We might also differentiate between mathematical distinguishability, by which we mean that two models lead to distinct probability distributions, and practical distinguishability, which would mean that one can perform reasonably accurate model selection from finite data. In this article, we are primarily interested in mathematical distinguishability; however, we also do simulations to address the more practical sense of distinguishability.

Mathematical indistinguishability means that there are some sets of networks for which no amount of data could determine which of the networks gave rise to the data. Although several positive results have been found for identifying species trees from gene trees and sequences evolving on gene trees (DeGiorgio and Degnan 2010; Allman et al. 2011a,b; Chifman and Kubatko 2015), the identifiability of networks is a more challenging problem theoretically. One reason for this is that the space of phylogenetic networks is much larger than that of phylogenetic trees, and is infinite if the number of hybridization events is not bounded. Networks can also have “ghost” lineages (lineages that once existed but that went extinct) that can also make identifiability more difficult for networks than for trees (Marcussen et al. 2015).

One factor that affects network identifiability is whether or not gene trees have branch lengths (Pardi and Scornavacca 2015). If only gene tree topologies are used, then many distinct networks will give equivalent gene tree topology probabilities if speciation times and hybridization parameters are allowed to vary. In some cases, the distribution of the coalescence times in the gene trees (which are functions of the gene tree branch lengths) will depend on hybridization events, thus allowing it to be possible to distinguish two networks that could not be distinguished using only topologies.

If only gene tree topologies are used, then the number of hybridization events that it is possible to infer may also be limited. An example is given in Yu et al. (2012) in which a network has three species and two hybridization events (Fig. 1). In that example, there are three times corresponding to either speciation or hybridization events, and there are two hybridization parameters. With only three observed gene tree topologies and five parameters, even if the network topology is known, this results in a system of three (estimated) equations and five unknowns (one equation for each gene tree topology). It is therefore not surprising that it is not possible to determine the five parameters using the gene tree topologies alone.

Figure 1.

Figure 1.

Example of three-taxon network in which parameters are not identifiable from gene tree topologies. The example is taken from Yu et al. (2012), Figure 4, doi:10.1371/journal.pgen.1002660.g004.

Yu et al. (2012) show that for the three-taxon example, identifiability is improved by allele sampling. If two alleles are sampled from species B, then there are 15 possible gene tree topologies (since we now have gene trees with four leaves). The 15 gene tree probabilities can then be used to estimate the five parameters.

A more difficult case of identifiability might appear to be that given by Pardi and Scornavacca (2015) (Fig. 2). In this example, when there is one allele sampled per species, the distribution of the gene trees, including their branch lengths, is identical for two different networks. The authors point out that there are three species trees displayed by the networks and that the three species trees can have identical branch lengths given certain choices of parameters in the two networks. The authors claim that “no method based on this definition of likelihood will be able to discriminate between” the two networks.

Figure 2.

Figure 2.

Networks Inline graphic and Inline graphic from Pardi and Scornavacca (2015), doi:10.1371/journal.pcbi.1004135.g003. The two networks both display exactly the same three trees, Inline graphic, Inline graphic, and Inline graphic.

While we agree that the likelihood used in Yu et al. (2012) cannot distinguish the two networks if one allele per species is used (whether or not branch lengths are used for this case), we disagree if the data can have multiple alleles per species and if incomplete lineage sorting is possible. Some likelihood approaches assume that sequence alignments evolve on gene trees displayed by the network (Jin et al. 2006; Park and Nakhleh 2012; Pardi and Scornavacca 2015), leading to the likelihood:

i=1mTT(Nk)P(Ai|T)P(T|Nk) (1)

where Inline graphic is the set of trees displayed by network Inline graphic and Inline graphic is the sequence alignment for the Inline graphicth locus. This likelihood sums over the trees displayed by the network, and is motivated by cases such as horizontal gene transfer in bacteria and hybrid speciation, in which gene trees are expected to be trees displayed by the network.

The likelihood used in Meng and Kubatko (2009) and Yu et al. (2012) treats gene trees as data, and can be written instead as

i=1mWjωjP(gi|Wj) (2)

where Inline graphic are species trees called parental trees in Meng and Kubatko (2009), and the Inline graphic are weights based on the probability that lineages take certain paths through the network. In Meng and Kubatko (2009), in which there is only one descendant from any hybrid node, the trees Inline graphic are indeed displayed by the network, whereas in Yu et al. (2012), the trees Inline graphic are generally MUL-trees which are not always displayed the network. The approach in this article can also be written using equation (2), where the Inline graphic terms are uniquely labeled trees (not MUL-trees), and can be interpreted as parental trees, similarly to Meng and Kubatko (2009), but are not necessarily displayed by the network.

The description of the likelihood in Pardi and Scornavacca applies to cases where gene trees are considered known and can only arise as displayed trees within the network. This assumption might be reasonable for a number of biological processes such as hybrid speciation, in which an individual hybrid can be ancestral to a new species (Abbott et al. 2010), recombination among viruses, and horizontal gene transfer. Under the NMSC, gene trees are not necessarily displayed by the network. In the parental species tree approach of Meng and Kubatko (2009), parental species trees are displayed by the network if there is only one individual descended from each hybrid, or if all lineages are constrained to coalesce more recently than a hybridization event (such as for hybrid speciation). However, for cases where there are several lineages in a hybrid population, such as is allowed in Yu et al. (2012), parental species trees under the NMSC are also not necessarily displayed by the network.

The likelihood used in Yu et al. (2012) is calculated over a sum of probabilities based on MUL trees, with the summation being over allele assignments. Some allele assignments will correspond to displayed trees, but some may not, particularly when two lineages (whether or not they are from the same species) follow different paths up the network at a hybridization node. In these cases, the probability cannot be written in terms of a displayed tree obtained by dropping one of the hybridization edges. Consequently, equation (1) is not in general an accurate representation of the likelihood used in Yu et al. (2012).

In the next section, we describe an alternative method for representing gene tree probabilities that does not use MUL trees, and describes the probabilities of gene trees as linear combinations as in Equation (1), except that the sum is not necessarily over trees displayed by the network. This helps to explain why equivalence of displayed trees is not sufficient for determining that two networks are indistinguishable.

Gene Tree Probabilities as Linear Combinations Under Different Species Trees

An alternative way of deriving the likelihood of the network given the gene trees can be obtained by conditioning on events at hybridization nodes and branches descended from them and using recursion. This results in an alternative algorithm to that of Yu et al. (2012) for computing the likelihood of a gene tree and results in an expression more similar to the strategy of Meng and Kubatko (2009) of reducing the probability given a network to a linear combination of probabilities given species trees. Following Meng and Kubatko, we refer to these species trees as parental trees or parental species trees. For networks with more than one lineage descended from a hybridization node, the recursion results in a linear combination including some species trees that are not displayed by the network. An example is shown in Figure 3, which gives an intuitive picture of the procedure. The example generalizes the three-taxon example in Figure 1 by splitting taxon Inline graphic into two species and making hybridization edges to not be horizontal.

Figure 3.

Figure 3.

Decomposition of network into parental species trees. Wiggly arrows indicate conditioning on a coalescence event. Solid arrows indicate conditioning on paths taken at a hybridization node. When a taxon is labeled Inline graphic, this can be interpreted as a leaf where Inline graphic and Inline graphic have been merged, or as a two-taxon tree where there is an infinite branch for the ancestor of Inline graphic and Inline graphic, guaranteeing that lineages sampled from Inline graphic and Inline graphic coalesce with each other more recently than with any other taxa.

At each step in the recursive approach, we condition on whether lineages either coalesce or do not coalesce, or we condition on whether lineages go left versus right at a hybridization node. Each step reduces the network into a larger number of smaller networks until the process ends with a collection of species trees. Details of the algorithm are given in the Appendix.

Distinguishability of Networks with the Same Displayed Trees

Decomposition of Networks Inline graphic and Inline graphic

We use the networks Inline graphic and Inline graphic described as indistinguishable (Pardi and Scornavacca 2015) (Fig. 2).

These networks are slightly modified from Pardi and Scornavacca (2015) with species written with capital letters. We then consider a modified version in which the population descended from both hybrid nodes undergoes speciation, resulting in species Inline graphic and Inline graphic (networks Inline graphic and Inline graphic in Fig. 4). The networks Inline graphic and Inline graphic are similar to Inline graphic and Inline graphic, respectively, when there are two lineages sampled from Inline graphic (Fig. 2). The number of lineages sampled per species affects the decomposition of the networks into parental species trees.

Figure 4.

Figure 4.

Extension of networks Inline graphic and Inline graphic to allow two species descended from the most recent hybridization node. The figure is modified from doi:10.1371/journal.pcbi.1004135.g003.

When there are two lineages sampled from species Inline graphic in Inline graphic, we denote the lineages by Inline graphic and Inline graphic. They fail to coalesce in this branch with probability Inline graphic. Assuming no coalescence, the lineages from species Inline graphic either both go to the left, one goes leftward and one rightward, or both go to the right at the lower hybridization node. The cases are listed in Supplementary Table 1 (available on Dryad). An example corresponding to case Inline graphic in Supplementary Table 1 (available on Dryad) is shown in Figure 5. We use Inline graphic and Inline graphic for the probability that a lineage goes left at the more recent and less recent hybridization nodes, respectively, in network Inline graphic. Similarly, Inline graphic and Inline graphic are the probabilities that a lineage goes left at the more recent and less recent hybridization nodes, respectively, in Inline graphic. These hybridization parameters are also called inheritance probabilities (Pardi and Scornavacca 2015). The parental species trees Inline graphicInline graphic referred to in Supplementary Table 1 (available on Dryad) are given in newick format in Supplementary Tables 2 and 3 (available on Dryad).

Figure 5.

Figure 5.

Gene trees in networks that display the same trees. Here Inline graphic and Inline graphic display the same trees with the same branch lengths given suitable choices of parameters. Similarly, Inline graphic and Inline graphic display the same trees. Two gene trees are shown with coalescence times that are compatible with both Inline graphic and Inline graphic, and another two gene trees are shown with coalescence times compatible with both Inline graphic and Inline graphic so that knowing the coalescence times in the gene tree does not determine which network it evolved in. The gene trees in this figure cannot be represented as having evolved in a species tree displayed by the networks. The gene tree in Inline graphic corresponds to case Inline graphic where Inline graphic goes left, Inline graphic goes right, then left. The gene tree in Inline graphic corresponds to case Inline graphic, where both go left, then Inline graphic goes left, Inline graphic goes right.

We wish to show that there is at least one gene tree topology with different probabilities under the two networks. The calculations are simplest for a gene tree that is very unlikely, in which case calculations can be done “by hand.” For example, consider the gene tree Inline graphic. The probability of this gene tree topology conditional on the above parental species trees is given in Supplementary Table 1 (available on Dryad).

The probabilty of the gene tree Inline graphic under the two networks can be written as

PN1(g)=i=114ωiP(g|Wi),PN2(g)=i=1528ωiP(g|Wi)

where Inline graphic. To illustrate using Supplementary Table 1 (available on Dryad), the probability of Inline graphic under Inline graphic is

PN1(g)=g22(λ7)γ12g22(λ6)g33(λ1)g22(λ2)/180=+g22(λ7)γ1(1γ1)γ2g22(λ1)g22(λ3)g33(λ2)/180=++g21(λ7)(1γ1)(1γ2)0

Here Inline graphic from equation (A.2).

The probabilities do not depend on Inline graphic, Inline graphic, or Inline graphic due to there only being one lineage on each of these pendant edges and therefore no probability of coalescence on these edges. The terms Inline graphic are equal to 0 for five choices of Inline graphic under both networks. These cases correspond to parental species trees that have conditioned on the event that lineages Inline graphic and Inline graphic have coalesced more recently than one of the hybridization nodes, which is impossible for gene tree Inline graphic. This reduces the number of parental species trees needed in the sums from 14 to 9.

In addition, the gene tree forces all coalescences to occur more anciently than the root of the network for both networks. This means that only one coalescent history needs to be computed (instead of enumerating over several coalescent histories for each parental species tree). Because of the asymmetry in the gene tree, only one sequence of coalescences, out of Inline graphic, produces the gene tree, which leads to the denominators in the probabilities.

When there is only one lineage sampled per species, Inline graphic and Inline graphic are indistinguishable under the following conditions (Pardi and Scornavacca 2015):

  • 1.

    Inline graphic

  • 2.

    Inline graphic

  • 3.

    Inline graphic

  • 4.

    Inline graphic

We pick a particular set of parameters to show that the networks are distinguishable when two lineages are sampled in species Inline graphic. For the choice of parameters

γ1=1/3,γ2=2/3,γ3=7/9,γ4=3/7,
x=y=1/2,λi=1,i{1,,12},

the conditions for indistinguishability specified by Pardi and Scornavacca (2015) are met, and the probability of gene tree topology Inline graphic under the two networks is

PN1(g)7.7×106,PN2(g)7.6×106,

Thus, the gene tree probability is approximately 1.4% higher under Inline graphic than Inline graphic. Both probabilities are small because this gene tree is quite unlikely for both networks, requiring no coalescences to occur except more anciently than the root. Nevertheless, it shows that the two networks have different gene tree distributions.

Rather than using gene tree probabilities, clade probabilities could also be used to distinguish the two networks for many parameter values. Let Inline graphic and Inline graphic both be very large and let Inline graphic be very small, so that any two lineages on branches with these lengths will almost certainly coalesce. Similarly, let Inline graphic be very small, so that Inline graphic and Inline graphic are very unlikely to coalesce. For these parameters, with high probability, Inline graphic is a clade on the gene tree when, and only when, both lineages both go to the left or both go to the right at the more recent hybridization node. Then using the above values of Inline graphic and Inline graphic, the probability that a gene tree has clade Inline graphic is approximately Inline graphic under Inline graphic and is approximately Inline graphic under Inline graphic. The clade probability will therefore distinguish the two networks.

We emphasize that if there is only one lineage sampled per species, then there is at most one lineage present at each hybrid node for Inline graphic and Inline graphic. In this case, the methods of Meng and Kubatko (2009) can be applied to calculate gene tree probabilities, but we agree with Pardi and Scornavacca (2015) that Inline graphic and Inline graphic are indistinguishable in this situation.

Rooted triples and quartets have also been used to reconstruct or infer networks under the NMSC (Yu and Nakhleh 2015) and Solís-Lemus and Ané (2016). Yu and Nakhleh (2015) give an example of networks that are not distinguishable using probabilities of triples in the gene trees that have evolved in the network (Fig. 6). They give an explanation that the networks display the same sets of triples. We agree that for this particular example, triples cannot be used to distinguish their networks Inline graphic and Inline graphic,. However, for the case of Inline graphic and Inline graphic, triples can be used to distinguish the networks when there are two lineages sampled from Inline graphic, even though Inline graphic and Inline graphic display the same set of rooted triples. For example, using the previous parameters, the probability of triple Inline graphic is approximately 0.081 under Inline graphic and approximately 0.089 under Inline graphic.

Figure 6.

Figure 6.

Two networks taken from Figure 2 of Yu and Nakhleh (2015) that display the same trees and triplets. The networks are distinguishable using gene tree probabilities but not using rooted triple probabilities.

As an alternative explanation for why triples cannot be used to distinguish Inline graphic and Inline graphic, it is noticed that equating triple probabilities for the two networks, such as Inline graphic results in a system of 12 equations. Removing linear dependencies, such as Inline graphic and that for any three taxa, the sum of the three rooted triple probabilities sums to 1. Removing such linearly dependencies from the system results in five linearly independent equations for a system with nine parameters, making the system underdetermined. This makes it possible to find parameters for Inline graphic that will make the rooted triple probabilities match those for Inline graphic.

Probabilities of unrooted quartets can also be calculated, and again, these distinguish the networks Inline graphic and Inline graphic when there are two lineages sampled from species Inline graphic. For the same parameters as above, with Inline graphic, Inline graphic, Inline graphic, and Inline graphic, the probability that a rooted gene tree displays the quartet Inline graphic is approximately 0.10 and 0.14 for networks Inline graphic and Inline graphic, respectively. We note this example in particular because the recently introduced method for inferring networks from quartets (Solís-Lemus and Ané 2016) cited the results in Pardi and Scornavacca (2015) as a reason to not apply the method to level-Inline graphic networks for Inline graphic (networks in which an edge can appear in more than one cycle of the graph, such as Inline graphic and Inline graphic).

The example from Yu and Nakhleh (2015) suggests that caution is indeed needed, since there are cases where trees but not summary statistics such as rooted triples can distinguish the networks. The example of Inline graphic and Inline graphic with multiple lineages per species, however, suggests that even more complicated networks are potentially distinguishable from rooted triples or quartets. In the Yu and Nakhleh (2015) example, we suspect that distinguishability would be achieved by sampling additional lineages. Their example is somewhat different from that of Inline graphic and Inline graphic in that the different networks do not have the same species descended from the hybrid, and the number of descendants of the hybrid is not the same for the two networks. The networks are also level 1, with only one hybridization event, and distinguishability is a problem not because of the complexity of the networks but rather because of the small number of taxa and resulting small set of linearly independent rooted triple probabilities for the number of parameters.

It is also possible to have distinguishability between two networks that each display the same set of trees when there is only one lineage sampled per species. In particular, the networks Inline graphic and Inline graphic are essentially identical to Inline graphic and Inline graphic, respectively, when the pendant branches leading to species Inline graphic and Inline graphic have length 0. In this case, the most recent ancestral population to Inline graphic and Inline graphic is a single population, and the lineages Inline graphic and Inline graphic are two lineages from the same population. As a result, the distributions of gene tree topologies under Inline graphic and Inline graphic are identical with those of Inline graphic and Inline graphic when lineage Inline graphic is replaced by Inline graphic and lineage Inline graphic is replaced by Inline graphic. Similarly, the networks Inline graphic and Inline graphic both display the trees Inline graphic, and Inline graphic (Fig. 4), which are equivalent to Inline graphic, and Inline graphic (Fig. 2), respectively, when Inline graphic is replaced by Inline graphic. The length of the pendant edges does not affect gene tree probabilities when one lineage is sampled per species. Consequently, gene tree Inline graphic has the same probabilities under Inline graphic and Inline graphicInline graphic has under Inline graphic and Inline graphic, respectively: Inline graphic for Inline graphic.

The important point is that there exist pairs of networks that display the same trees but can be distinguished under the NMSC model, even when there is only one lineage sampled per species. This example demonstrates that showing that two networks display the same trees (including branch lengths and inheritance probabilities) is not sufficient for showing that the networks are indistinguishable. In this particular case, the networks Inline graphic and Inline graphic are also distinguishable using rooted triplets, quartets, or clades, in spite of the two networks displaying exactly the same rooted triplets, quartets, and clades. A crucial reason for the ability to distinguish networks is the following: if there is more than one lineage descended from a hybrid node (either due to the hybrid population speciating or due to sampling more than one lineage from a population descended from a hybrid), there can exist gene trees that are not embedded in a tree displayed by the network.

Simulation

Distinguishability of Inline graphic and Inline graphic Using Model Selection

To illustrate the ability of network methods to distinguish two networks that display the same trees, we simulated gene trees from Inline graphic with two and three alleles sampled from species Inline graphic and one allele sampled from each of the other species. We used phylonet (Than et al. 2008) to compute likelihood scores by optimizing branch lengths and hybridization parameters assuming the fixed network topologies Inline graphic and Inline graphic. The network branch lengths were based on using the network Inline graphic (Fig. 5) with the height of the network being 10 coalescent units. The networks used for simulation in hybrid-Lambda (Zhu et al. 2015) are, in coalescent units:

(((A:5,(B:3)h1#.5:2)s2:5,((D:5.6,(h1#.5:1.3)h2#.6:1.3)s3:2.3,(h2#.6:.1,C:4.4)s4:3.5)s5:2.1)s6:10,O:20)r;(((A:5,(B:1)h1#.5:4)s2:5,((D:5.6,(h1#.5:3.3)h2#.6:1.3)s3:2.3,(h2#.6:.1,C:4.4)s4:3.5)s5:2.1)s6:10,O:20)r;

where species Inline graphic is an out-group. In this notation, all internal nodes (both hybridization and speciation nodes) are labeled. After the hybridization nodes, Inline graphic, the first number represents the probability of going “left,” and the second number represents the branch length from the hybridization node to the next node (either left or right). Thus, for both networks used in the simulation, the probability of going left for Inline graphic is Inline graphic. In the extended newick string, the branch length after the first (second) instance (reading from the left) of Inline graphic is the length of the branch leading from Inline graphic to the left (right) parent of Inline graphic.

The networks have identical topologies, inheritance probabilities, and branch lengths, except that Inline graphic for the first network and Inline graphic for the second network. The second network has a higher probability that the lineages sampled from Inline graphic will fail to coalesce more recently than the most recent hybridization node, and therefore has a higher level of incomplete lineage sorting. We therefore refer to this as the “high ILS” network. Gene trees on this network are much less likely to have monophyly of lineages sampled from Inline graphic. The other network is referred to as the “low ILS” network.

For each set of gene trees, the likelihood under the estimated parameters was compared with the two networks and the proportion of times that Inline graphic had a higher likelihood than Inline graphic was reported. The simulation was performed with Inline graphic and 400 independent loci. For each gene tree, alignments with 500 sites were simulated using seq-gen (Rambaut and Grassly 1997) under the Inline graphic model with base frequencies of Inline graphic, and Inline graphic for Inline graphic, and Inline graphic, respectively, with four rate categories and 10% invariable sites. As is typical with multilocus simulations, gene trees were independent with no recombination within loci. An out-group was added to the network with the MRCA of the out-group and in-group taxa being 10 coalescent units deeper than the root of the in-group taxa. This ensures extremely high probabilities that the in-group taxa are monophyletic in the gene trees. Gene trees were estimated using phyml (Guindon et al. 2010) under the Inline graphic model assuming four rate categories and estimating all other parameters, and using default tree searches. Unrooted gene trees estimated by phyml were rooted using the out-group, and the out-group was then removed before inputting the estimated rooted gene trees into phyml.

Not surprisingly, increasing the number of loci increased the ability to distinguish the two networks (Fig. 7). Increasing the number of alleles (from 2 to 3) increased the ability of maximum likelihood to distinguish the networks. An intuitive explanation is that with more alleles, it is more likely that lineages cannot be embedded in a tree displayed by the network. Having the higher level of ILS lineages (obtained by having a smaller value for Inline graphic) greatly increases the ability of phylonet to distinguish the two networks, and this also has the explanation that since Inline graphic lineages are less likely to have coalesced more recently than the first hybridization node, gene trees in the high ILS case are less likely to be embedded in a tree displayed by the network than gene trees in the low ILS case. The fact that increasing allele sampling can improve inference of species relationships has been emphasized in the species tree literature as well (Maddison and Knowles 2006; DeGiorgio and Degnan 2014; Heled and Drummond 2010; Huang et al. 2010). In this case, however, sampling multiple alleles not only improves inference, but is also crucial for being able to distinguish the networks at all.

Figure 7.

Figure 7.

Performance of phylonet for distinguishing networks Inline graphic and Inline graphic when data was simulated from Inline graphic, with the indicated number of alleles sampled from species Inline graphic and all other species having one allele sampled. The fraction of times out of 300 iterations that the Inline graphic had a higher likelihood than Inline graphic is reported when both networks were fixed and phylonet optimized branch lengths and inheritance probabilities. “High ILS” refers to Inline graphic, and “Low ILS” refers to Inline graphic, with all other parameters kept the same.

The “high ILS” case, with a branch length of 1.0 coalescent units, is typical for cases known to have significant amounts of ILS For example, the level of gene tree incongruence, for which approximately 60–80% of trees have humans and chimpanzees being the most closely related among humans, chimpanzees, and gorillas (Ebersberger et al. 2007) suggests an internal branch length of between 0.5 and 1.2 coalescent units (Ané 2010; Degnan 2010). The probability that two lineages coalesce within 1.0 coalescent units is Inline graphic, whereas the probability that two lineages coalesce within 3.0 units (the low ILS case in our simulations) is 0.95. Thus, for the low ILS case with two alleles from Inline graphic, more than 19 in 20 gene trees evolve on a tree displayed by the network, making it difficult to distinguish the two networks; for the high ILS case, using Inline graphic, the proportion of gene trees evolving on a tree not displayed by the network is close to Inline graphic.

Comparison of Phylonet and Hybrid-Coal

Both phylonet and hybrid-coal compute probabilities of gene tree topologies given species networks, but the two programs use different algorithms. The program hybrid-coal uses a recursion that allows representing the gene tree topology probability as a linear combination of probabilities given species trees. In contrast, phylonet initially represented probabilities as a sum over probabilities of coalescent histories given MUL-tree representations of networks, and more recently also implemented the ancestral configuration approach (Wu 2012), which tends to run more quickly than the coalescent history approach for larger trees (roughly more than 10 taxa, depending on tree shape). There are also many features in phylonet not implemented in hybrid-coal, such as algorithms to infer the network from a set of gene trees, searching over network topologies, branch lengths, and inheritance probabilities, and using branch lengths in the gene trees.

The main idea of hybrid-coal is compatible with both the coalescent history and ancestral configuration approaches, since once the parental species trees have been enumerated, either coalescent histories or ancestral configurations could be used to compute the probability of the gene tree given the parental species tree. Currently, only the coalescent history approach is implemented in hybrid-coal, but the ancestral configuration method could be added in the future. In comparison with phylonet, hybrid-coal breaks down the network into a larger number of smaller problems, with the parental species trees tending to be smaller trees than the MUL-tree representation of the network, which can have more leaves than there are taxa. This could potentially be an advantage in future parallel programming implementations of the algorithm.

The main advantage for having the new algorithm in hybrid-coal is perhaps the theoretical insight it gives in terms of representing gene tree probabilities in terms of parental species trees. This appears to be a fairly intuitive way to think about the relationship between gene trees and species networks (Holland et al. 2008; Meng and Kubatko 2009), although we have shown that perhaps counterintuitively, the parental species trees are often not displayed by the network. We hope that future theoretical work will make use of the representation of gene tree probabilities as mixtures arising from different species trees. In other contexts, mixtures of trees have proved identifiable (Allman et al. 2012; Rhodes and Sullivant 2012), and this might be a useful approach for thinking about identifiability of networks.

Discussion

The Effect of Branch Lengths

Consideration of branch lengths in the gene trees can lead to the ability to distinguish networks which could not be distinguished using only topologies. For example, if a species history has multi-edges—cases where two nodes are directly connected by two distinct edges—it can still be possible to estimate parameters of this model and to distinguish it from a model in which multi-edges are collapsed. An example of networks with multi-edges is shown in Figure 8. Biologically, a multi-edge could depict a population temporarily splitting into two populations with no gene flow followed by the populations merging at a later time before either population splits. This type of history would be desirable to be able to estimate since it could occur, for example, due to glaciation or other episodic events that temporarily divide populations (Comes and Kadereit 1998; Marshall et al. 2009).

Figure 8.

Figure 8.

Coalescence times for networks with multi-edges. a) The leftmost species history is a two-taxon species tree. The middle and right species histories have one or two sets of multi-edges, reflecting species diverging and subsequently hybridizing without any other speciation. Histograms of 100,000 coalescence times for single genes sampled from species Inline graphic and Inline graphic are depicted in (b)–(d). (c) and (d) correspond to the middle and rightmost species histories, respectively. Simulations are based on Inline graphic coalescent units and were done in the program ms (Hudson 2002).

Coalescence times are potentially useful in these cases because multi-edges affect the distribution of coalescence times. Theoretically, a multi-edge will result in a bimodal distribution of coalescence times. In practice, estimated coalescence times are highly variable and subject to estimation error, so that a bimodal signature might be difficult to detect. However, this does not affect the point that multi-edges are potentially inferable given ideal data.

The example from Figure 8 essentially arises from the three-taxon network in Figure 1 when taxon Inline graphic is dropped and Inline graphic. The example illustrates that this network has potentially identifiable parameters when using branch lengths in the gene trees even when topologies cannot identify the parameters in the network.

A deeper difficulty with identifiability is that it is not clear that hybridization can be distinguished from other population genetic processes that can result in gene tree incongruence and complicated distributions of coalescence times. For example, alternating bottlenecks and population expansions can result in similar multimodal distributions of coalescence times as that found in Fig. 8 (DeGiorgio et al. 2011). Bottlenecks can also be a problem in practice for distinguishing two networks since a smaller population size makes incomplete lineage sorting less likely, thereby making gene trees more likely to be displayed by the network. The extreme case is a bottleneck of size one, which can occur in hybrid speciation, and guarantees that all gene trees are displayed by the network.

As another example, it is well known that the multispecies coalescent on a three-taxon tree with no ancestral population structure predicts that one triplet is most frequent while the other two triplets are tied in probability, and that these tied probabilities are less frequent (Nei 1987). Consequently, a test of equality of proportions is sometimes used for the less frequent triplets as a goodness of fit test for the multispecies coalescent (Degnan and Rosenberg 2009; White et al. 2009; Ané 2010; Cranston 2010; Song et al. 2012). Asymmetry in the less frequent triplet can be explained by hybridization, but could also be explained by ancient population structure (Slatkin and Pollack 2008). Distinguishing hybridization from processes such as ancient population structure and changing population sizes might be at least as challenging as distinguishing one hybridization network from another assuming that population structure and population sizes that do not fluctuate.

Summary

To summarize our results, we find that

  • Two networks that display the same trees, including branch lengths and inheritance probabilities, might or might not be distinguishable under the NMSC in the sense of leading to the same probability distribution of gene tree topologies. In particular, there are examples where two networks display exactly the same trees, clades, triples, and quartets, yet are distinguishable from the probabilities of trees, clades, triples, and quartets.

  • Network distinguishability can be improved in some cases by using branch length information and/or by sampling more than one individual per species descended from a hybrid population.

  • Higher levels of incomplete lineage sorting can make inference of hybridization events easier in some cases.

  • A desirable property of a network inference method is to be able to distinguish networks that are in fact distinguishable, even when they display the same trees. We have shown that maximum likelihood can do this in at least some cases.

We agree with Pardi and Scornavacca (2015) that identifiability is an important topic when trying to infer networks. Much of the effort in the literature on hybridization networks has focused on constructing networks that display a set of input trees, which are treated as data (Bordewich and Semple 2007; Holland et al. 2008; van Iersel and Linz 2013). From this point of view, it is crucial to understand when two networks might display the same set of trees.

Much less work has been done on what we are calling the NMSC, which has only recently become an active area of research. We have shown that identifiability results from the combinatorial point of view do not necessarily immediately transfer to the NMSC framework, and that many cases thought to be indistinguishable turn out to be distinguishable using this probabilistic modeling approach. An analogy is that in the case of trees (rather than networks), unrooted trees might not be expected to have any information about the root of the trees from which they came. However, under a probabilistic model, unrooted trees can have information about the root (Steel 2012), and in particular, under the multispecies coalescent, the distribution of unrooted trees determines the rooted species tree when there are five or more taxa (but not for four taxa) (Allman et al. 2011b).

We hope that there will be more of an intersection in future phylogenetic network research between combinatorial approaches and the NMSC framework. A particular problem in need of more theoretical work is that of distances between networks. In particular, standard definitions of distance between networks, such as cluster-based definitions which extend Robinson–Foulds distances (Robinson and Foulds 1981) to networks (Cardona et al. 2009), return a distance of 0 between Inline graphic and Inline graphic and between Inline graphic and Inline graphic. This makes it difficult to determine whether an inferred network is closer to Inline graphic versus Inline graphic (or Inline graphic vs. Inline graphic), even for methods capable of distinguishing these networks.

The increased ability to distinguish networks using probabilistic models is good news for biologists interested in being able to infer biologically meaningful networks. However, much is still not understood about the space of networks in which we are interested in making inferences, and more theory is needed to determine what is and what is not distinguishable or identifiable under the NMSC. We showed that the particular example given by Pardi and Scornavacca (2015) turned out to be distinguishable if there is more than one lineage sampled from species Inline graphic, and that generally there are cases of two networks that display exactly the same species trees (including branch lengths) that are nevertheless distinguishable under the NMSC, even with one lineage sampled per species.

However, we did not establish that Inline graphic and Inline graphic are distinguishable from all networks on four taxa, even if the number of hybridization nodes is capped. Nor did we establish that if, say, the topology of Inline graphic is known, then the parameters of the network would be identifiable. Here lack of identifiability would mean that two distinct sets of branch lengths and/or hybridization parameters (Inline graphic), lead to the same distribution of gene trees. Since networks of any complexity can be conceived, we can construct networks on Inline graphic taxa with more parameters than there are gene tree topologies (assuming a fixed number of alleles sampled per species), and this will certainly result in lack of identifiability of the parameters from gene tree topologies even if the network topology is known. The challenge remains to determine what is and what is not identifiable for networks under the NMSC.

Supplemental Material

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.t2d38.

Software Availability

The software hybrid-coal, which computes gene tree probabilities recursively, and hybrid-Lambda, which simulates gene trees in phylogenetic networks, are both available at Github at https://github.com/hybridLambda/ and can be freely used under GNU GPL Version 3 or later. Both have been tested in Mac OS X 10.9.5 and Ubuntu.

Funding

Much of this work was completed while SZ was a PhD student at University of Canterbury, supervised by JD and Mike Steel, [SZ and JD were funded by the New Zealand Marsden fund during 2010–2013]. JD was additionally supported by National Institutes of Health [grant R01 GM117590].

Acknowledgments

We are grateful to David Bryant and Mike Steel for comments on the description of the algorithm. We thank Bengt Oxelman and two anonymous reviewers for additional helpful comments.

Appendix

Recursive Method to Compute Gene Tree Probabilities Given Species Networks

In this section, we introduce a novel method to compute gene tree probabilities of a given species network. Nodes of the network are visited in a modified post-order traversal so that the algorithm works on the deepest nodes descended from hybrid nodes first and works from this node toward the root until all hybrid nodes are eliminated. We introduce two operations to decompose a network into reduced networks that have a smaller number of edges or nodes. The post-order traversal ensures that we perform the simplification operations in a correct order—a node is never operated on before removing its descendant internal nodes.

Decomposition Operations

In this section, we propose two operations to simplify a complex phylogeny structure into simpler structures with fewer hybridization events. To demonstrate this procedure, we first consider simple cases where one individual is sampled from each population at the present. We first make several restrictions and assumptions for the gene tree Inline graphic and the network Inline graphic in this section:

  • The gene tree Inline graphic and the network Inline graphic are rooted.

  • Gene tree Inline graphic and network Inline graphic have the same number of external edges.

  • Gene tree Inline graphic is binary.

  • An interior node of Inline graphic can only have at most two parent nodes; a hybrid node refers to an internal node of Inline graphic which has two parent nodes.

  • We do not consider the case that a hybrid node is also a leaf node.

The network Inline graphic is initially reduced to a set of simpler networks (Inline graphic) in a single step in the reduction process. Let Inline graphic be the probability of gene tree Inline graphic given a species network Inline graphic, by the law of total probability, we have the following:

P(T|W)=wSG(W)P(T|W=w,W)P(W=w|W)=wSG(W)P(T|W=w)P(W=w|W), (A.1)

where Inline graphic is a random variable that depends on Inline graphic.

For any Inline graphic, Inline graphic* implies either a particular parental branch that some lineages have followed at a hybrid node or some specific coalescences that have occurred beneath a hybridization node.

Prior to decomposing a network, nodes are ranked from the leaves of the network to the top: tip nodes have rank one; an interior node’s rank is one plus the highest rank of its child nodes.

The key to simplifying a network is to remove the interior nodes of the network in a specific order, along with the branches that are connected to the node. Here we define several functions to assist us identifying which nodes should be removed first. Let Inline graphic be the set of nodes in the network; for Inline graphic, let Inline graphic be the rank of Inline graphic (the number of edges from Inline graphic to the root), and Inline graphic be the number parent node of Inline graphic. We use indicator function Inline graphic to identify if a node Inline graphic is a hybrid node:

h(v)={1,if p(v)=2;0,otherwise.

Let Inline graphic and Inline graphic be the indicator functions that take values

hd(v)={1,if v is a descendant node of a hybrid node;0,otherwise;

and

t(v)={1,if v is a leaf node;0,otherwise

respectively.

Thus, we can apply Algorithm 1 to find which node should be removed from the network: If the algorithm returns value Inline graphic, it means that Inline graphic is already tree-like, and does not need to be simplified; otherwise, it returns the index of the node that we need to perform the following operations.

graphic file with name syw097a1.jpg

Decomposition operation 1.—

If the chosen node is an interior descendant node Inline graphic of a hybrid node, then this implies that Inline graphic has a single parent node (otherwise Inline graphic is a hybrid node), and child nodes of Inline graphic are the leaf nodes of Inline graphic (since Inline graphic has the lowest rank beside the tips). The first step of operation 1 is to remove Inline graphic from Inline graphic, along with all of the edges that are connected to Inline graphic.

Let Inline graphic denote all of the leaf nodes descended from Inline graphic. We now enumerate all possible ways to partition Inline graphic. For example, if Inline graphic, let Inline graphic be one of the possible partitions of Inline graphic. Inline graphic could be Inline graphic, Inline graphic, Inline graphic, Inline graphic or Inline graphic. We treat every element of any Inline graphic as a new leaf node. In the second part of operation 1, we create a new graph Inline graphic, by connecting the elements of Inline graphic to the parent node of Inline graphic. Notice, if the element of Inline graphic contains more than one leaf node, this implies that by changing from graph Inline graphic to Inline graphic, we need to coalesce these leaves on the branch that connects Inline graphic and its parent node.

To calculate the probability of these events, we let Inline graphic, and Inline graphic and Inline graphic be the branch length from Inline graphic to its parent node. Then the probability of Inline graphic lineages coalesce into Inline graphic lineages within time Inline graphic is (Tajima 1983; Saunders et al. 1984; Takahata and Nei 1985; Rosenberg 2002; Degnan and Salter 2005):

gij(t)=k=jie(k2)t(2k1)(1)kjj!(kj)!(j+k1)×y=0k1(j+y)(iy)(i+y). (A.2)

Therefore, we have:

P(W=w|W)=wcgij(t)Iw(T), for wSG(W), (A.3)

where Inline graphic is the number of ways for Inline graphic lineages to coalesce into Inline graphic lineages, which is equal to Inline graphic, and Inline graphic is the number of sequences of coalescences resulting in the same topology with Inline graphic lineages coalescing into Inline graphic lineages. This is equal to Inline graphic, where Inline graphic is the number of interior nodes that are descended from the coalesced nodes (Degnan and Salter 2005), and Inline graphic is the number of ways for Inline graphic lineages to coalesce into Inline graphic lineages, which is equal to Inline graphic. The indicator function is defined as

Iw(T)={1,if the lineages in w can lead to topology T;0,otherwise.

For instance, if the gene tree is Inline graphic and Inline graphic in Fig. 3, then Inline graphic.

Operation 1 removes an internal node of network Inline graphic. Therefore, any reduced network Inline graphic, Inline graphic, has fewer interior nodes than network Inline graphic.

Decomposition operation 2.—

Before applying operation 2 on a hybrid node Inline graphic of Inline graphic, we need to make sure that operation 1 has been applied to all of the interior nodes descended from Inline graphic. This implies that all of the child nodes of Inline graphic are the leaf nodes of Inline graphic. Let Inline graphic and Inline graphic be the two parent nodes of Inline graphic. We use Inline graphic to denote the set of child nodes of Inline graphic and Inline graphic to denote the collection of all of the subsets of Inline graphic. The first step of operation 2, is to remove Inline graphic from Inline graphic, and all of the edges connected to Inline graphic.

We then introduce two new nodes, Inline graphic and Inline graphic. For any Inline graphic, we have a new graph Inline graphic, connect Inline graphic to Inline graphic, then connect Inline graphic to Inline graphic, and connect Inline graphic to Inline graphic, then connect Inline graphic to Inline graphic. Let Inline graphic, Inline graphic, and Inline graphic. The parameter Inline graphic is the probability that one lineage is attached to Inline graphic. Thus, we obtain the set of simpler networks Inline graphic and the probabilities Inline graphic for any Inline graphic:

P(W=w|W)=γmL(1γ)mR,  where mL+mR=m. (A.4)

Operation 2 removes an internal node of network Inline graphic. The newly added nodes Inline graphic and Inline graphic are effectively external nodes: as all of the nodes descended from Inline graphic and Inline graphic are leaf nodes, we can treat Inline graphic and Inline graphic as leaf nodes, but sampling multiple lineages from each of them. Therefore, any reduced network Inline graphic, Inline graphic, has fewer interior nodes than network Inline graphic.

Simplifying a Network Recursively

Operations 1 and 2 are applied recursively on any networks in Inline graphic until all of the simplified network structures are tree-like. Gene tree probabilities can be computed using either coalescent histories (Degnan and Salter 2005) or ancestral configurations (Wu 2012). The approach outlined in this article will therefore reduce the probability of a gene tree, given a species network, to a linear combination of gene tree probabilities of given species trees.

Let Inline graphic be an ordered list of directed graphs (trees or networks). Then Inline graphic is the number of elements in the list. Here we borrow the concepts of set operations “Inline graphic” and “Inline graphic” for our use. Let Inline graphic denote gradually appending the elements of Inline graphic to the end of the list Inline graphic, then indexing the new elements of Inline graphic from Inline graphic to Inline graphic. For an element Inline graphic of Inline graphic, we define operation Inline graphic, as removing the element Inline graphic from Inline graphic, the index of any element behind Inline graphic is now one less.

Then we apply Algorithm 2 to simplify a network Inline graphic, and then compute the probability for gene tree Inline graphic.

graphic file with name syw097a2.jpg

During the decomposition process, different sequences of removing the hybrid nodes may lead to the same subspecies trees Inline graphic. For Inline graphic, we use Inline graphic to denote the collection of ways to decompose Inline graphic into Inline graphic. Each sequence of decomposition corresponds to a unique weight Inline graphic. Thus by simplifying Equation (A.1), we have:

P(T|W)=WAGT(W)P(T|W=W,W)cC(W,W)ωc. (A.5)

Figure 3 illustrates the decomposition of a species network on four taxa with two hybridization nodes.

Notice that even though Inline graphic and Inline graphic have the same topology, the branch lengths of these two trees differ. We consider them to be different species trees. For different gene trees, according to coalescent events, Inline graphic may differ. For example, if the gene tree is Inline graphic, Inline graphic, but when the gene tree is Inline graphic, Inline graphic.

References

  1. Abbott R.J., Hegarty M.J., Hiscock S.J., Brennan A.C. 2010. Homoploid hybrid speciation in action. Taxon 59:1375–1386. [Google Scholar]
  2. Albrecht B., Scornavacca C., Cenci A., Huson D.H. 2012. Fast computation of minimum hybridization networks. Bioinformatics 28:191–197. [DOI] [PubMed] [Google Scholar]
  3. Allman E.S., Degnan J.H., Rhodes J.A. 2011a. Determining species tree topologies from clade probabilities under the coalescent. J. Theor. Biol. 289:96–106. [DOI] [PubMed] [Google Scholar]
  4. Allman E.S., Degnan J.H., Rhodes J.A. 2011b. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 62:833–862. [DOI] [PubMed] [Google Scholar]
  5. Allman E.S., Rhodes J.A., Sullivant S. 2012. When do phylogenetic mixture models mimic other phylogenetic models? Syst. Biol. 61:1049–1059. [DOI] [PubMed] [Google Scholar]
  6. Ané C. 2010. Reconstructing concordance trees and testing the coalescent model from genome-wide data sets. In: Knowles L. L., Kubatko L. S. editors. Estimating species trees: theoretical and practical aspects.Hoboken, (NJ): Wiley-Blackwell; p. 35–52. [Google Scholar]
  7. Bapteste E., van Iersel L., Janke A., Kelchner S., Kelk S., McInerney J.O., Morrison D.A., Nakhleh L., Steel M., Stougie L., Whitfield J. 2013. Networks: expanding evolutionary thinking. Trends Genet. 29:439–441. [DOI] [PubMed] [Google Scholar]
  8. Baroni M., Semple C., Steel M. 2006. Hybrids in real time. Syst. Biol. 55:46–56. [DOI] [PubMed] [Google Scholar]
  9. Bordewich M., Semple C. 2007. Computing the hybridization number of two phylogenetic trees is fixed-parameter tractable. IEEE/ACM Trans. Comp. Biol. Bioinform. 4:458–466. [DOI] [PubMed] [Google Scholar]
  10. Cardona G., Llabrés M., Rosselló F., Valiente G. 2009. Metrics for phylogenetic networks I: Generalizations of the Robinson-Foulds metric. IEEE/ACM Trans. Comp. Biol. Bioinform. 6:46–61. [DOI] [PubMed] [Google Scholar]
  11. Chen Z.-Z., Wang L. 2010. Hybridnet: a tool for constructing hybridization networks. Bioinformatics 26:2912–2913. [DOI] [PubMed] [Google Scholar]
  12. Chifman J., Kubatko L. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374:35–47. [DOI] [PubMed] [Google Scholar]
  13. Comes H.P., Kadereit J.W. 1998. The effect of quaternary climatic changes on plant distribution and evolution. Trends Plant Sci. 3:432–438. [Google Scholar]
  14. Cranston K.A. 2010. Summarizing gene tree incongruence at multiple phylogenetic depths. In: Knowles L.L., Kubatko L.S. editors. Estimating species trees: practical and theoretical aspectsHoboken (NJ): Wiley-Blackwell; p. 129–143. [Google Scholar]
  15. DeGiorgio M., Degnan J.H., Rosenberg N.A. 2011. Coalescence-time distributions in a serial founder model of human evolutionary history. Genetics 189:579–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. DeGiorgio M., Degnan J.H. 2010. Fast and consistent estimation of species trees using supermatrix rooted triples. Mol. Biol. Evol. 27:552–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. DeGiorgio M., Degnan J.H. 2014. Robustness to divergence time underestimation when inferring species trees from estimated gene trees. Syst. Biol. 63:66–82. [DOI] [PubMed] [Google Scholar]
  18. Degnan J.H., Rosenberg N.A. 2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 24:332–340. [DOI] [PubMed] [Google Scholar]
  19. Degnan J.H., Salter L.A. 2005. Gene tree distributions under the coalescent process. Evolution 59:24–37. [PubMed] [Google Scholar]
  20. Degnan J.H. 2010. Probabilities of gene trees with intraspecific sampling given a species tree. In: Knowles L.L., Kubatko L.S. editors. Estimating Species Trees: Practical and Theoretical AspectsWiley-Blackwell; p. 53–78. [Google Scholar]
  21. Ebersberger I., Galgoczy P., Taudien S., Taenzer S., Platzer M., von Haeseler A. 2007. Mapping human genetic ancestry. Mol. Biol. Evol. 24:2266–2277. [DOI] [PubMed] [Google Scholar]
  22. Guindon S., Dufayard J.-F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0. Syst. Biol. 59:307–321. [DOI] [PubMed] [Google Scholar]
  23. Heled J., Drummond A.J. 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27:570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Holland B.R., Benthin S., Lockhart P.J., Moulton V., Huber K.T. 2008. Using supernetworks to distinguish hybridization from lineage-sorting. BMC Evol. Biol. 8:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Huang H., He Q., Kubatko L.S., Knowles L.L. 2010. Sources of error inherent in species-tree estimation: impact of mutational and coalescent effects on accuracy and implications for choosing among different methods. Syst. Biol. 59:573–583. [DOI] [PubMed] [Google Scholar]
  26. Hudson R. 2002. Generating samples under a wright-fisher neutral model of genetic variation. Bioinformatics 18:337–338. [DOI] [PubMed] [Google Scholar]
  27. Huson D.H., Scornavacca C. 2011. A survey of combinatorial methods for phylogenetic networks. Genome Biol. Evol. 3:23–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Huson D., Rupp R., Scornavacca C. 2010. Phylogenetic networks: concepts, algorithms and applications. New York: Cambridge University Press. [Google Scholar]
  29. Jin G., Nakhleh L., Snir S., Tuller T. 2006. Maximum likelihood of phylogenetic networks. Bioinformatics 22:2604–2611. [DOI] [PubMed] [Google Scholar]
  30. Jones G., Sagitov S., Oxelman B. 2013. Statistical inference of allopolyploid species networks in the presence of incomplete lineage sorting. Syst. Biol. 62:467–478. [DOI] [PubMed] [Google Scholar]
  31. Kubatko L.S. 2009. Identifying hybridization events in the presence of coalescence via model selection. Syst. Biol. 58:478–488. [DOI] [PubMed] [Google Scholar]
  32. Maddison W.P., Knowles L.L. 2006. Inferring phylogeny despite incomplete lineage sorting. Syst. Biol. 55:21–30. [DOI] [PubMed] [Google Scholar]
  33. Marcussen T., Heier L., Brysting A.K., Oxelman B., Jakobsen K.S. 2015. From gene trees to a dated allopolyploid network: Insights from the angiosperm genus Viola (Violaceae). Syst. Biol. 64:84–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Marshall D.C., Hill K.B., Fontaine K.M., Buckley T.R., Simon C. 2009. Glacial refugia in a maritime temperate climate: Cicada (kikihia subalpina) MTDNA phylogeography in New Zealand. Mol. Ecol. 18:1995–2009. [DOI] [PubMed] [Google Scholar]
  35. Meng C., Kubatko L.S. 2009. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor. Popul. Biol. 75:35–45. [DOI] [PubMed] [Google Scholar]
  36. Morrison D.A. 2011. Introduction to phylogenetic networks. Uppsala: RJR Productions. [Google Scholar]
  37. Nakhleh L. 2013. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol. Evol. 28:719–728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Nei M. 1987. Molecular evolutionary genetics. New York: Columbia University Press. [Google Scholar]
  39. Nguyen Q., Roos T. 2015. Likelihood-based inference of phylogenetic networks from sequence data by phylodag. In: Algorithms for computational biology. Springer; p. 126–140. [Google Scholar]
  40. Pardi F., Scornavacca C. 2015. Reconstructible phylogenetic networks: do not distinguish the indistinguishable. PLoS Comput. Biol. e1004135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Park H.J., Nakhleh L. 2012. Inference of reticulate evolutionary histories by maximum likelihood: the performance of information criteria. BMC Bioinform. 13:S12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Rambaut A., Grassly N.C. 1997. Seq-gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosci. 13:235–238. [DOI] [PubMed] [Google Scholar]
  43. Rhodes J.A., Sullivant S. 2012. Identifiability of large phylogenetic mixture models. Bull. Math. Biol. 74:212–231. [DOI] [PubMed] [Google Scholar]
  44. Robinson D., Foulds L. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131–147. [Google Scholar]
  45. Rosenberg N.A. 2002. The probability of topological concordance of gene trees and species trees. Theor. Pop. Biol. 61:225–247. [DOI] [PubMed] [Google Scholar]
  46. Saunders I.W., Tavaré S., Watterson G.A. 1984. On the genealogy of nested subsamples from a haploid population. Adv. Appl. Prob. 16:471–491. [Google Scholar]
  47. Slatkin M., Pollack J.L. 2008. Subdivision in an ancestral species creates asymmetry in gene trees. Mol. Biol. Evol. 25:2241–2246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Solís-Lemus C., Ané C. 2016. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 12:e1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Song S., Liu L., Edwards S.V., Wu S. 2012. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. USA: 109:14942–14947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Steel M. 2012. Root location in random trees: a polarity property of all sampling consistent phylogenetic models except one. Mol. Phylogenet. Evol. 65:345–348. [DOI] [PubMed] [Google Scholar]
  51. Strimmer K., Moulton V. 2000. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol. 17:875–881. [DOI] [PubMed] [Google Scholar]
  52. Tajima F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Takahata N., Nei M. 1985. Gene genealogy and variance of interpopulational nucleotide differences. Genetics 110:325–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Than C., Ruths D., Nakhleh L. 2008. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics 9:322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. van Iersel L., Kelk S., Lekić N., Scornavacca C. 2014. A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees. BMC Bioinformat. 15:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. van Iersel L., Linz S. 2013. A quadratic kernel for computing the hybridization number of multiple trees. Inform. Process. Lett. 113:318–323. [Google Scholar]
  57. White M.A., Ané C., Dewey C.N., Larget B.R., Payseur B.A. 2009. Fine-scale phylogenetic discordance across the house mouse genome. PLoS Genet. 5:e1000729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Wu Y. 2012. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 66:763–775. [DOI] [PubMed] [Google Scholar]
  59. Yu Y., Degnan J.H., Nakhleh L. 2012. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS Genet. 8:e1002660–e1002660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Yu Y., Dong J., Liu K.J., Nakhleh L. 2014. Maximum likelihood inference of reticulate evolutionary histories. Proc. Natl. Acad. Sci. USA: 111:16448–16453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Yu Y., Nakhleh L. 2015. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genom. 16:S10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Yu Y., Than C., Degnan J.H., Nakhleh L. 2011. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst. Biol. 60:138–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zhu S., Degnan J.H., Goldstien S.J., Eldon B. 2015. Hybrid-lambda: simulation of multiple merger and kingman gene genealogies in species networks and species trees. BMC Bioinform. 16:292. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES