Skip to main content
Systematic Biology logoLink to Systematic Biology
. 2018 Jul 5;68(1):168–181. doi: 10.1093/sysbio/syy051

The Spectre of Too Many Species

Adam D Leaché 1,#, Tianqi Zhu 2,3,#, Bruce Rannala 4, Ziheng Yang 2,5,6,
Editor: Matthew Hahn
PMCID: PMC6292489  PMID: 29982825

Abstract

Recent simulation studies examining the performance of Bayesian species delimitation as implemented in the bpp program have suggested that bpp may detect population splits but not species divergences and that it tends to over-split when data of many loci are analyzed. Here, we confirm these results and provide the mathematical justifications. We point out that the distinction between population and species splits made in the protracted speciation model (PSM) has no influence on the generation of gene trees and sequence data, which explains why no method can use such data to distinguish between population splits and speciation. We suggest that the PSM is unrealistic as its mechanism for assigning species status assumes instantaneous speciation, contradicting prevailing taxonomic practice. We confirm the suggestion, based on simulation, that in the case of speciation with gene flow, Bayesian model selection as implemented in bpp tends to detect population splits when the amount of data (the number of loci) increases. We discuss the use of a recently proposed empirical genealogical divergence index (gdi) for species delimitation and illustrate that parameter estimates produced by a full likelihood analysis as implemented in bpp provide much more reliable inference under the gdi than the approximate method phrapl. We distinguish between Bayesian model selection and parameter estimation and suggest that the model selection approach is useful for identifying sympatric cryptic species, while the parameter estimation approach may be used to implement empirical criteria for determining species status among allopatric populations.

Keywords: bpp, multispecies coalescent, Species delimitation, taxonomy


In the past decade, the multispecies coalescent (MSC) model (Rannala and Yang 2003) has emerged as an important framework for statistical analysis of genomic sequence data from closely related species. Under the model, different genomic regions (called loci) may have different genealogical histories due to coalescent processes occurring in the extinct ancestral species. The MSC thus naturally accommodates gene tree heterogeneity across the genome. Likelihood-based inference under the MSC averages over the gene trees for multiple loci, achieved through either numerical integration (Yang 2002; Zhu and Yang 2012) or Bayesian Markov chain Monte Carlo (MCMC) (Edwards 2009; Heled and Drummond 2010; Yang and Rannala 2010, 2014). Averaging over gene trees incurs a heavy computational burden but has the benefit of accommodating phylogenetic uncertainty at individual loci, which is important when the species are closely related and the sequence alignment at each locus has low phylogenetic information content (Xu and Yang 2016). Given the species phylogeny, the MSC can be used to estimate important parameters concerning species divergences such as the population sizes of modern and ancestral species, species divergence times, and past migration patterns and rates (Takahata et al. 1995; Burgess and Yang 2008; Hey 2010; Mailund et al. 2012). The MSC also provides the appropriate inference framework for estimating species phylogenies, while accommodating gene tree heterogeneity caused by deep coalescence and incomplete lineage sorting (Maddison 1997; Nichols 2001; Edwards 2009; Heled and Drummond 2010; Yang and Rannala 2014). It has been applied to species identification (assignment) and found to achieve better statistical performance than DNA bar-coding based on a simple distance threshold (Yang and Rannala 2017). The MSC has also been used to address the problem of species discovery (or delimitation) (Yang and Rannala 2010, 2014). Different species delimitation models are formulated as competing statistical models and inferred from the genetic data through Bayesian model selection (i.e., through calculation of posterior model probabilities). Species delimitation is a complex issue, however, partly because there is no universally accepted definition of species (Mallet 2013).

Two recent studies (Jackson et al. 2017; Sukumaran and Knowles 2017) used computer simulation to evaluate the performance of Bayesian species delimitation as implemented in the software package bpp (Bayesian Phylogenetics and Phylogeography) (Yang and Rannala 2010; Rannala and Yang 2013). Both studies concluded that bpp may over-split, capturing population splits rather than species divergences. Sukumaran and Knowles (2017) simulated species phylogenies, gene trees, and sequence data under the protracted speciation model (PSM) (Rosenblum et al. 2012; Etienne et al. 2014), which distinguishes between populations (incipient species) and species. They concluded that in some cases bpp delimited population structure rather than species. Jackson et al. (2017) simulated sequence data under the MSC model on a given species tree, and then used a heuristic genealogical divergence index (gdi) to define species status. They found that their simulation-based heuristic method phrapl was more successful in inferring species status than bpp, which tended to split subdivided populations into species even in the face of high gene flow.

Here, we examine the conditions of the simulations of Sukumaran and Knowles (2017) and Jackson et al. (2017) to evaluate the performance of bpp. Two features of the simulation of Sukumaran and Knowles (2017) are noteworthy. First, the species conversion process is superimposed on the population branching process and is Markovian (memoryless) so the rate of species conversion (from incipient species to species) is fixed and independent of the duration of genetic isolation between incipient species. Moreover, the PSM distinguishes between populations and species but the species status of lineages is ignored when the gene trees and sequence data are generated under the MSC model for subsequent analysis using bpp. Second, the assignment of species status in the PSM does not appear to be consistent with current taxonomic practices or with most models of speciation.

In Jackson et al. (2017), a heuristic criterion was used to define species and that definition was used in phrapl but not in bpp when both programs were used to infer species status. We perform a fair comparison in which the same heuristic species definition is used in both phrapl and bpp analyses. We demonstrate that even though bpp ignores gene flow and is based on the simplistic JC mutation model (Jukes and Cantor 1969), it provides more accurate parameter estimates and inference of species status than phrapl when both programs use the same heuristic definition of species. We discuss the asymptotic behavior of Bayesian species delimitation through model selection as the number of loci increases.

Protracted Speciation?

A defining feature of the simulation by Sukumaran and Knowles (2017) under the PSM is that the conversion event that transforms a population (an incipient species) into a species (a true species) is independent of the process of genetic divergence among populations and of the generation of gene trees and sequence data. The PSM distinguishes between populations and species but when the population tree is used to simulate gene trees and sequences no such distinction is made. The simulation may be considered an attempt to mimic the use of the neutral genome or non-coding DNA to delineate species boundaries, but the procedure makes it clear that the simulated sequence data do not contain information concerning species status. This is a consequence of the likelihood principle in statistics, which states that all information about the competing models and model parameters is contained in the likelihood function, the probability of the data given the model and parameters (O’Hagan and Forster, 2004, pp. 61–64). If two models make the same probabilistic predictions about the observable data and thus have identical likelihoods for all possible data outcomes, the models are not identifiable and the data cannot be used to distinguish them.

The PSM used in the simulation of Sukumaran and Knowles (2017) is a simplified version in which the “species initiation rate” (rate at which incipient species arise) is identical for incipient and true species, as is the species extinction rate. Thus the model has three parameters, species initiation rate Inline graphic, species extinction rate Inline graphic, and species conversion rate Inline graphic. Because the rates Inline graphic and Inline graphic do not depend on the status of the population (incipient species or true species) it is straightforward to study the statistical properties of this model by first determining the probability density of the population tree under a conventional birth–death process and then superimposing the process of species conversion on the population tree.

Let Inline graphic be the population tree topology and Inline graphic be the set of divergence times (Fig. 1). Let Inline graphic be the set of population size parameters, with Inline graphic where Inline graphic is the effective population size and Inline graphic is the mutation rate per generation per site. Both parameters Inline graphic and Inline graphic are measured in the expected number of mutations per site. Let Inline graphic be the species delimitation (or a representation of coloring scheme in Fig. 1). Let the sequence data at the Inline graphic loci be Inline graphic, Inline graphic, and let the gene trees be Inline graphic. Bayesian species delimitation under the PSM should then involve a slight change to the formulation of Yang and Rannala (2010). The posterior probability distribution of species delimitation, species/population tree as well as the parameters in the MSC on the population tree (Inline graphics and Inline graphics) is then

Figure 1.

Figure 1.

Figure 1 of Sukumaran and Knowles (2017) redrawn to illustrate the simulation of species (indicated by tip labels) under the protracted speciation model. The species tree is shown with one embedded gene tree (purple); a species conversion event happens when a branch on the species tree changes color.

graphic file with name M21.gif (1)

Here, the MSC density for the gene tree topology and coalescent times, Inline graphic, is given by Rannala and Yang (2003), the probability of the sequence alignment at locus Inline graphic (known as the phylogenetic likelihood), Inline graphic, is given by Felsenstein (1981).

The joint prior for the population tree and species delimitation factors into two terms:

graphic file with name M25.gif (2)

where Inline graphic is the density for the population tree and divergence times given by the birth–death process (e.g., Rannala and Yang, 1996), while Inline graphic is the probability of the species delimitation (species conversions) given the population tree, which is specified by the Poisson process, with constant rate Inline graphic, of species conversions running along the branches of the population tree.

For example, given the population tree for eight populations of Figure 1, the probability of the delimitation in Figure 1 (i.e., the probability of the coloring scheme with the two conversion events and with the three true species) is

graphic file with name M29.gif (3)

where Inline graphic is the age of the Inline graphic ancestor, and so on. Note that the term in the square bracket is the total time length of the population tree.

From this formulation, it is clear that the impact of the PSM is to change the prior, since the sequence likelihood and the MSC density for the gene trees are unchanged. The fact that the species conversion process is conditionally independent of the population tree means that the genetic data do not allow species to be delimited without assuming the rate Inline graphic. It then follows that the posterior probabilities for the species delimitation models will be extremely sensitive to the conversion rate Inline graphic or its prior.

The Protracted Speciation Model Assumes Instantaneous Speciation

The PSM assumed in the simulation of Sukumaran and Knowles (2017) has several extreme properties, making it an unrealistic model for most speciation processes in nature. The model posits an exaggerated form of punctuated equilibrium—exponentially distributed periods of stasis followed by an instantaneous conversion to a new species. At the conversion event, the new population and the parental population (which are only one generation apart) are deemed distinct species. Even though Sukumaran and Knowles (2017) used the PSM to simulate speciation as an extended process rather than an event, PSM assumes instantaneous speciation or conversion of an incipient species into true species in one generation. Few species appear to have originated in this way. An alternative “gradualist” model would treat morphological characters involved in species classification as quantitative traits that evolve according to a diffusion model determined by the effects of underlying mutational changes and genetic drift of allele frequencies. Two populations are recognized as different species if the difference in mean trait values exceeds some threshold, which reflects the biologist’s perception of what species are and how morphologically different distinct species should be. Under such a model there will be a strong covariance between genetic isolation, population divergence time, and species status. This gradualist model is another extreme and a more realistic model may include a mixture of morphological “jumps” as well as “diffusions” (see Discussion section).

The way in which the PSM assigns species status is also problematic, contradicting prevailing taxonomic practices. In Figure 1 of Sukumaran and Knowles (2017), the different colors on branches signify distinct species produced by conversion events under the PSM (Fig. 1). It is possible for the model to generate species near the tips of the species tree, say, Inline graphic generations ago. However, taxonomists would not recognize recent divergences of only a few generations as valid speciation events. Instead, speciation is a consequence of an extended process of genetic isolation, and species status is assigned retrospectively based on empirical measures of morphological and/or genetic divergence. It may not be possible to simulate species forward in time because the criterion of the systematist depends on the level of divergence between populations and this is only known after the simulation of population splits is completed.

Asymptotic Behavior of Bayesian Comparison of Species Delimitation Models

Jackson et al. (2017) simulated data under the MSC model with migration (Hey, 2010, the so-called isolation-with-migration or IM model) for two species/populations and analyzed them using bpp to calculate the posterior probabilities for the one-species and two-species models. They observed that the posterior probability for the two-species model increases when the number of loci increases. Here, we investigate the asymptotic behavior of Bayesian posterior model probabilities and confirm that this is the expected behavior of Bayesian model selection and of the program.

Choosing Among Wrong Models

The asymptotic dynamics of Bayesian model selection depends on how wrong the two competing models are relative to the true data-generating model (Yang and Zhu 2018). Here, we consider independent and identically distributed (i.i.d.) models only, under which the data points Inline graphic are i.i.d., with Inline graphic. Let Inline graphic. The distance from any model Inline graphic with parameters Inline graphic to the true model Inline graphic is measured by the Kullback–Leibler (KL) divergence

graphic file with name M41.gif (4)

where Inline graphic is the limiting maximum likelihood estimate (MLE) of Inline graphic under the model when the data size Inline graphic, and is known as the best-fitting parameter value under the model (White 1982). The KL divergence Inline graphic if the model encompasses the true model (or, in other words, is true), and Inline graphic if the model is wrong.

Here the true model Inline graphic is the MSC model with migration (the IM model). Under the model, the gene trees and sequence alignments are i.i.d. among loci, so that the datasize is the number of loci (Inline graphic). Currently, bpp does not accommodate migration or introgression and implements the complete isolation model only. The two models under comparison are then the one-species model Inline graphic with a single population-size parameter Inline graphic and the two-species model Inline graphic with parameters Inline graphic, where Inline graphic (for Inline graphic) is the divergence time between the two species, and the Inline graphics are the population size parameters for the two modern species Inline graphic and Inline graphic and for the ancestral species Inline graphic, with Inline graphic (Fig. 2a). Both Inline graphic and Inline graphic are measured in the expected number of mutations per site. As the true model involves migration, both models Inline graphic and Inline graphic are wrong, with Inline graphic. Note that Inline graphic is a special case of Inline graphic since the two models are equivalent when Inline graphic in Inline graphic, in which case parameters Inline graphic and Inline graphic in Inline graphic are unidentifiable. The dynamics of the posterior probabilities for Inline graphic and Inline graphic depends on whether Inline graphic and Inline graphic are equally wrong (in which case Inline graphic) or Inline graphic is less wrong than Inline graphic (with Inline graphic), or equivalently on whether the best fitting parameter value for Inline graphic in Inline graphic is Inline graphic or Inline graphic. If Inline graphic, the two models will be equally wrong, and they are also unidentifiable in the limit of infinite data. Then Inline graphic, with fewer parameters, dominates, with its posterior probability approaching 100% when the number of loci Inline graphic increases. In contrast, if Inline graphic, Inline graphic is less wrong than Inline graphic, and Inline graphic will dominate. While an analytical proof is not available, we analyze increasingly larger data sets to examine the asymptotic behavior of the MLEs numerically. Our calculations suggest that the second case applies: when the true model is the MSC model for two populations with migration, the two-species isolation model Inline graphic is less wrong than the one-species model Inline graphic and dominates in the posterior when the number of loci increases.

Figure 2.

Figure 2.

a) A species tree for two species (Inline graphic and Inline graphic) and three gene trees for two sequences (Inline graphic and Inline graphic), used to illustrate the asymptotics of Bayesian model selection. The coalescence between the two sequences occurs before species divergence in the brown and purple gene trees (with Inline graphic) and after in the green gene tree (with Inline graphic). b) A species tree for two species (Inline graphic and Inline graphic) and two gene trees for three sequences (Inline graphic and Inline graphic from species Inline graphic and Inline graphic from species Inline graphic), used to illustrate the computation of the gdi. Both gene trees have the same topology Inline graphic, but the coalescence between Inline graphic and Inline graphic occurs before species divergence (in species Inline graphic) in the green tree (with Inline graphic) and after in the brown tree (with Inline graphic).

As an example, we simulate large data sets with many loci, each of 500 sites, under the symmetrical IM model for two species with Inline graphic for the species divergence and Inline graphic for all populations, and with migration rates between the two populations to be Inline graphic immigrants per generation (Fig. 2a). In this article, the (scaled) migration rate is defined as Inline graphic, the expected number of immigrants in population Inline graphic from population Inline graphic per generation, with Inline graphic to be the proportion of immigrants in population Inline graphic. The MCcoal program, in the bpp package, was used to generate gene trees and sequence alignments under the JC model (Jukes and Cantor 1969). Each locus has two sequences, Inline graphic and Inline graphic, from species Inline graphic and Inline graphic, respectively. At those parameter values, sequences Inline graphic and Inline graphic coalesce before species divergence (with Inline graphic, as in the brown and purple gene trees of Fig. 2a) at 62.75% of loci, which is very similar to the probability for Inline graphic (63.21%) if the two sequences are from the same population.

The data are then analyzed using the 3S program to obtain the MLEs for the two parameters (Inline graphic and Inline graphic) under the two-species MSC model with no migration (Inline graphic) (Yang 2002; Dalquen et al. 2017). The estimate of Inline graphic is 0.0158. The MLE Inline graphic ranged from 0.00033Inline graphic0.00036 over ten replicates for Inline graphic to over 0.000329Inline graphic0.000348 for Inline graphic. Based on the stability of the estimates among the replicate data sets and between the large values of Inline graphic, we suggest that at the limit of infinitely many loci, the best-fitting parameter value is Inline graphic. We note that the best-fitting parameter value depends on the configuration of the data such as the number of sequences per locus and the number of sites, as well as the parameters of the MSC model with migration (Inline graphics, Inline graphics, and Inline graphic’s). If the sequence length is 250 sites instead of 500, we obtain Inline graphic instead of 0.00034. Those results provide numerical evidence that at the limit of infinite data, Inline graphic, so that the two-species model will dominate the posterior, even though the migration rates are so high between the two populations that they should be considered one species by any species definition.

The Impact of Migration or Gene Flow

Note that if Bayesian model selection is conducted under the IM model, incorporating migration, the two-species model with migration will be correct (with Inline graphic), while the one-species model will be wrong (with Inline graphic). Then the two-species model will dominate with the posterior probability approaching 100% as the number of loci increases. This is the case even if the migration rate Inline graphic is very large (but finite). Thus if we use Bayesian model selection to infer species status (treating a population split as a speciation event) then incorporating migration into the MSC model will not correct the problem of over-splitting.

In conclusion, the concern that Bayesian model selection as implemented in bpp may over-split and recognize too many species in subdivided populations with ongoing gene flow is legitimate. Over-splitting may be of particular concern when hundreds or thousands of loci are analyzed. If two populations are truly panmictic, the model with fewer parameters will be favored, and the populations will be correctly lumped into one species. However, if there is partial subdivision (even with relatively high levels of gene flow) the method will prefer the two-species model asymptotically as the number of loci increases. One possible solution is to include a model with gene flow and use model selection to choose among 3 models: (i) a single population; (ii) two completely isolated populations; and (iii) two populations with gene flow. A choice of model 1 strongly suggests a single species; a choice of model 2 suggests two species but a final decision should be based on a consideration of the population divergence time and other relevant information (morphology, etc); a choice of model 3 allows either one species or two, depending on considerations such as the degree of gene flow, distinctness of morphology, and so on.

Heuristic Species Delimitation

Jackson et al. (2017) suggested a heuristic criterion for species delimitation based on a genealogical divergence index (gdi) between populations that can be calculated using estimates of parameters under the MSC model with migration (Inline graphic, Inline graphic, and Inline graphic). Suppose one samples two sequences (Inline graphic and Inline graphic) from population Inline graphic and one sequence (Inline graphic) from population Inline graphic (see Fig. 2b). Let the probability that the two sequences from population Inline graphic coalesce first, so that the gene tree is Inline graphic, be

graphic file with name M157.gif (5)

Obviously Inline graphic ranges from Inline graphic (when the three sequences are interchangeable, as in the case of Inline graphic) to 1. Jackson et al. (2017) rescaled Inline graphic so that the genealogical divergence index,

graphic file with name M162.gif (6)

ranges from 0 to 1 when Inline graphic goes from Inline graphic to 1. In the special case of no migration (with Inline graphic), we have Inline graphic and

graphic file with name M167.gif (7)

where Inline graphic is the population divergence time in coalescent units (with one coalescent time unit to be Inline graphic generations) and Inline graphic is the probability that the two sequences from population Inline graphic (Inline graphic and Inline graphic) do not coalesce before reaching species divergence (Inline graphic) when we trace the genealogy backwards in time.

The gdi Heuristic for Species Identification

Jackson et al. (2017) calculated the gdi as defined in equations 5 and 6 by simulating 10,000 gene trees under the MSC model with migration. Here, we provide its analytical computation, using the Markov chain characterization of the coalescent process with migration (Hobolth et al. 2011; Zhu and Yang 2012; Dalquen et al. 2017). For two populations (Inline graphic and Inline graphic) with gene flow and three sequences (Inline graphic, Inline graphic, and Inline graphic), the genealogical process of coalescent and migration when one traces the history of the sample backwards in time can be described by a Markov chain with 21 states. The state of the chain is specified by the number of sequences remaining in the sample and the populations in which they reside, or by the population IDs (Inline graphic and Inline graphic) and the sequence IDs (Inline graphic, etc.). For example, the state Inline graphic means that the three sequences Inline graphic, and Inline graphic are in populations Inline graphic, Inline graphic, and Inline graphic, respectively. We also write this as “Inline graphic”. This is the initial state. State Inline graphic, abbreviated “Inline graphic”, means that two sequences remain in the sample, with the ancestor of sequences Inline graphic and Inline graphic in population Inline graphic and sequence Inline graphic in population Inline graphic.

The transition rate matrix of the Markov chain Inline graphic is given in Table 1. The transition probability matrix over time Inline graphic is then Inline graphic, where Inline graphic is the probability that the Markov chain is in state Inline graphic at time Inline graphic in the past given that it is in state Inline graphic at time 0 (the present time). Suppose Inline graphic has the spectral decomposition

Table 1.

Rate matrix for Markov chain describing transitions between states in multispecies coalescent with migration model with two populations (Inline graphic and Inline graphic) and three sequences (Inline graphic, Inline graphic, and Inline graphic)

  Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic   Inline graphic       Inline graphic Inline graphic Inline graphic                    
Inline graphic Inline graphic Inline graphic   Inline graphic   Inline graphic                           Inline graphic  
Inline graphic Inline graphic   Inline graphic Inline graphic     Inline graphic                       Inline graphic    
Inline graphic   Inline graphic Inline graphic Inline graphic       Inline graphic             Inline graphic            
Inline graphic Inline graphic       Inline graphic Inline graphic Inline graphic                     Inline graphic      
Inline graphic   Inline graphic     Inline graphic Inline graphic   Inline graphic               Inline graphic          
Inline graphic     Inline graphic   Inline graphic   Inline graphic Inline graphic                 Inline graphic        
Inline graphic       Inline graphic   Inline graphic Inline graphic Inline graphic       Inline graphic Inline graphic Inline graphic              
Inline graphic                 Inline graphic           Inline graphic     Inline graphic     Inline graphic
Inline graphic                   Inline graphic           Inline graphic     Inline graphic   Inline graphic
Inline graphic                     Inline graphic           Inline graphic     Inline graphic Inline graphic
Inline graphic                       Inline graphic     Inline graphic     Inline graphic     Inline graphic
Inline graphic                         Inline graphic     Inline graphic     Inline graphic   Inline graphic
Inline graphic                           Inline graphic     Inline graphic     Inline graphic Inline graphic
Inline graphic                 Inline graphic     Inline graphic     Inline graphic            
Inline graphic                   Inline graphic     Inline graphic     Inline graphic          
Inline graphic                     Inline graphic     Inline graphic     Inline graphic        
Inline graphic                 Inline graphic     Inline graphic           Inline graphic      
Inline graphic                   Inline graphic     Inline graphic           Inline graphic    
Inline graphic                     Inline graphic     Inline graphic           Inline graphic  
Inline graphic                                         Inline graphic

Note: Inline graphic and Inline graphic are mutation-scaled migration rates, and Inline graphic and Inline graphic are the coalescent rates. The state of the chain is given by the population IDs (Inline graphic or Inline graphic) and sequence IDs (such as Inline graphic, Inline graphic, Inline graphic). For example the initial state Inline graphic means that the three sequences Inline graphic, and Inline graphic are from populations Inline graphic, Inline graphic, and Inline graphic, respectively. States with three sequences are abbreviated, with the three sequences assumed to be in the order Inline graphic so that the sequence IDs are suppressed. Thus Inline graphic is ‘Inline graphic’. State Inline graphic means that two sequences remain in the sample, with the ancestor of sequences Inline graphic and Inline graphic is in population Inline graphic while sequence Inline graphic is in population Inline graphic. This is abbreviated ‘Inline graphic’, with the sequence ID ‘Inline graphic’ suppressed. ‘Inline graphic’ is an absorbing state in which only one sequence remains in the sample, in either Inline graphic or Inline graphic, after two coalescent events have occurred.

graphic file with name M205.gif (8)

where Inline graphic are the eigenvalues of Inline graphic, columns in Inline graphic are the corresponding right eigenvectors, and rows in Inline graphic are the left eigenvectors. Then

graphic file with name M373.gif (9)

Gene tree Inline graphic can be generated in two ways. The first is for sequences Inline graphic and Inline graphic to coalesce before reaching the ancestral population, with Inline graphic (as in the green gene tree of Fig. 2b). Sequence Inline graphic then joins the ancestor of sequences Inline graphic and Inline graphic either before species divergence at Inline graphic, in which case the root of the gene tree is younger than species divergence, or after, in which case the root of the gene tree is older than Inline graphic (the latter case is illustrated in the green gene tree of Fig. 2b).

The probability density that sequences Inline graphic and Inline graphic coalesce at time Inline graphic is given by

graphic file with name M386.gif (10)

This is a sum of two terms, corresponding to the first coalescent (between sequences Inline graphic and Inline graphic) occurring in populations Inline graphic and Inline graphic, respectively. The first term is the probability, Inline graphic, that sequences Inline graphic and Inline graphic are in population Inline graphic right before time Inline graphic, times the rate for them to coalesce Inline graphic. Similarly, the second term is the probability density that sequences Inline graphic and Inline graphic coalesce at time Inline graphic in population Inline graphic (Fig. 2b, green gene tree).

The second way of generating gene tree Inline graphic is for sequences Inline graphic and Inline graphic to coalesce after population divergence, with Inline graphic (as in the brown gene tree of Fig. 2b). This occurs with probability Inline graphic, where Inline graphic is the set of states with three sequences, and Inline graphic is the probability that no coalescent event occurs during the time interval Inline graphic. In this scenario, the gene tree root must be older than Inline graphic.

Thus combining the two possibilities for generating gene tree Inline graphic, we have

graphic file with name M411.gif (11)

where Inline graphic is given in equation 10. To calculate the integral in equation 11, note that from equation 9,

graphic file with name M413.gif (12)

We calculated Inline graphic under the symmetrical migration model with Inline graphic and Inline graphic. Figure 3b shows Inline graphic plotted against Inline graphic (population divergence time in coalescent units) and Inline graphic under the symmetrical migration model. This is a more accurate calculation than Figure 6 of Jackson et al. (2017), which was based on simulating gene trees, even though the two approaches are equivalent if a huge number of replicates is used in the simulation.

Figure 3.

Figure 3.

Probability Inline graphic of gene tree Inline graphic, plotted (a) as a function of population divergence in coalescent units (Inline graphic) in a pure isolation model for two populations without gene flow and (b) as a function of population divergence in coalescent units (Inline graphic) and scaled migration rate Inline graphic. According to Jackson et al. (2017), the lower and upper limits of Inline graphic for species delimitation are 0.47 and 0.8.

Based on the meta-analysis of Pinho and Hey (2010), Jackson et al. (2017) suggested the rule of thumb that gdi values Inline graphic suggest a single species and gdi values Inline graphic suggest distinct species, while gdi values within the range indicate ambiguous delimitation. The limits of 0.2 and 0.7 for gdi correspond to 0.47 and 0.8 for Inline graphic, and in the case of no migration, to 0.22 and 1.20 for the population divergence in coalescent units (Inline graphic) (Fig. 3a).

Subjectively Defined Species

Jackson et al. (2017) simulated data under the MSC model with migration for two populations and analyzed the data using phrapl and bpp. While the true model used in the simulation always had two populations, the gdi was used to define species status. This criterion was used in the phrapl analysis of the simulated data to infer species status, but not in bpp. It was then found that phrapl out-performed bpp (Jackson et al. 2017, Fig. 4), and that bpp tended to over-split, identifying too many species.

Comparison Between phrapl and bpp

Both bpp and phrapl can estimate the parameters of the MSC model, although phrapl accommodates gene flow, while bpp in its current implementation assumes no gene flow. Here, we apply the gdi definition of species status in bpp, so that the same criterion is used by bpp and phrapl. A simple approach is to use the posterior means of the parameters under the MSC generated by bpp to calculate the gdi (equation 7). We use this method here. A more sophisticated approach, which we use later in the analysis of the empirical data sets, is to generate a posterior distribution of gdi using the sample of parameters taken during the MCMC.

We thus repeated the simulation of Jackson et al. (2017, Fig. 4), applying gdi to bpp parameter estimates. The true species tree is Inline graphic, with six sets of species divergence time parameters, with Inline graphic, Inline graphic, and Inline graphic, with Inline graphic. Note that Inline graphic is much larger than Inline graphic, so that species Inline graphic is a distant outgroup, and the focus is on whether populations Inline graphic and Inline graphic are one or two species. Migration is assumed to occur between Inline graphic and Inline graphic, with Inline graphic, and Inline graphic, where Inline graphic is the number of immigrants per generation. The sequence data were simulated under the HKY model (Hasegawa et al. 1985), with base frequencies 0.3, 0.2, 0.3, and 0.2 (for T, C, A, and G) and transition/transversion rate ratio Inline graphic. For each of the Inline graphic parameter combinations for Inline graphic and Inline graphic, 50 replicate data sets were simulated. There are 50 loci in each data set, with 20 sequences from each of the three species, and 500 sites in the sequence. The data were simulated using the mccoal program, part of the bpp release, as detailed in Zhang et al. (2011). We used bpp version 4.0 to estimate the parameters in the MSC model on the fixed species tree Inline graphic (this is the A00 analysis of Yang, 2015). Version 4.0 of the program assigns inverse-gamma priors on parameters. We used the shape parameter 3 in the inverse-gamma priors, while the prior means are set to match the true values: Inline graphic IG(3, 0.01) with mean Inline graphic, and Inline graphic IG(3, 0.025), IG(3, 0.05), and IG(3, 0.1), for the three true Inline graphic values. Note that the value 3 for the shape parameter means that the inverse-gamma priors are diffuse, with the coefficient of variation to be Inline graphic. Estimation of parameters under the MSC is known to be fairly robust to the priors, for example, to a one order-of-magnitude change to the prior means (Burgess and Yang 2008). After bpp generated the posterior distribution of the parameters, we used the posterior means to calculate gdi using equation 7, with Inline graphic and Inline graphic.

The results are shown in Figure 4. Even though it ignores migration and uses an overly simplistic JC mutation model (while the true model is HKY), bpp performed better than phrapl in delimiting species status defined by the gdi, especially at high migration rates (with Inline graphic = 2 or 5). This result may seem counterintuitive, since the data were simulated with migration and phrapl allows for migration so that there is no model violation, while bpp ignores migration so that its model is violated.

Figure 4.

Figure 4.

Accuracy of species delimitation using the gdi with parameters estimated from data of 50 loci using (a) phrapl and (b) bpp. Species status is defined using the gdi at different cutoffs (Inline graphic and Inline graphic). This is calculated by simulating 10,000 gene trees under the MSC model with migration for phrapl, and analytically for bpp. Along the Inline graphic-axis, each group of bars gives results for different gdi cut-offs. Below the lower bound (Inline graphic), populations Inline graphic and Inline graphic are defined as a single species; above the upper bound (Inline graphic), Inline graphic and Inline graphic are defined as separate species, while between the bounds, the species status is ambiguous. The six bars within each group represent the six sets of species divergence times (Inline graphics). The bar shadings are white = the inferred delimitation outcome matched the true outcome; light green/gray = ambiguity was inferred when the true delimitation is known (insufficient power); dark green/gray = delimitation was inferred (whether one or two species) when the truth was ambiguous (excessive confidence); and black = one species was inferred when there were two, or vice versa. The results for phrapl are recreated using the R code from Jackson et al. (2017, Fig. 4).

Shortcomings of Approximate Methods

We suggest that two factors may account for the poorer performance of phrapl in this simulation. First, phrapl is a summary method for estimating parameters, and it relies on gene tree topologies and ignores branch lengths. As a result, parameter estimates may be biased or even inconsistent due to phylogenetic errors of gene tree reconstruction (Yang 2002). Second, use of the gene tree topologies, while ignoring the branch lengths leads to information loss and may even cause identifiability problems. In the simple case of three species and three sequences, with one sequence from each species, there is only one degree of freedom in the data of gene tree topologies, which is the proportion of the most common gene tree topology. In this case the complete-isolation model (with Inline graphic) involves four parameters (two Inline graphics and two Inline graphics for the two ancestral species), but use of the gene tree topologies alone allows the estimation of only the internal branch length on the species tree in coalescent units, Inline graphic, while other parameters are unidentifiable (Xu and Yang 2016). Even the internal branch length is estimated inconsistently because phylogenetic reconstruction errors tend to inflate gene tree-species tree mismatches (Yang 2002).

The cases with more than three sequences per locus and with migration may be more complex, but it should not be surprising that approximate methods that rely on summary statistics such as gene tree topologies will suffer from an information loss. In contrast, bpp is a full-likelihood method and makes use of information in the gene tree branch lengths (coalescent times) as well as topologies, while accommodating phylogenetic uncertainties due to the limited number of informative sites at each locus (Yang 2014; Xu and Yang 2016). Even though bpp operates under a wrong model that ignores migration, the sequence data at multiple loci may be informative about the expected gene tree configurations. Nevertheless, extension of bpp to allow for gene flow will provide more accurate estimation of parameters in the MSC model, which should lead to more accurate species delimitation using heuristic criteria such as gdi.

Heuristic Species Delimitation Using bpp

Here, we describe how Bayesian parameter estimation under the MSC model can be combined with gdi to delimit species using a hierarchical procedure based on a species/population tree. This is similar to the use of a “guide tree” for species delimitation by Yang and Rannala (2010), in that an ancestral node on the guide tree is merged into one species only if its descendant nodes are merged. However, here, we rely on Bayesian parameter estimation on a fixed species/population tree while Yang and Rannala (2010) used reversible-jump algorithms to calculate posterior probabilities for different species delimitation models (represented by merging nodes on the guide tree). We first demonstrate the procedure using a simulated data set and then apply it to the analysis of three empirical data sets analyzed previously by Jackson et al. (2017). The gdi is only one of many possible heuristics with rough correspondences to different species definitions.

We use a species/population tree for five populations, Inline graphic, to simulate data (Fig. 5a). Inline graphic represents a large paraphyletic species with a broad geographic distribution arranged in a stepping-stone design, with migration between any two adjacent populations including the ancestors (e.g., between Inline graphic and the ancestral population Inline graphic after the first population split, and then between Inline graphic and Inline graphic and between Inline graphic and Inline graphic after the second split, etc.). The scaled migration rate is Inline graphic for any pair of adjacent populations. Inline graphic is a new species, having separated from population Inline graphic (with Inline graphic), and there is no gene flow involving Inline graphic. The divergence times (Inline graphics) are at 0.04, 0.03, 0.02, and 0.01. The population size parameter is Inline graphic for all populations. We simulated 100 loci, each of 500 sites, for four samples per species (20 sequences per locus).

Figure 5.

Figure 5.

Species delimitation applying heuristic index gdi to parameter estimates from bpp. a) Species tree used for simulation allows migration between populations Inline graphic, and Inline graphic and their ancestors (indicated by arrows), but no gene flow involving species Inline graphic. b) Species (guide) tree inferred from A11 analysis of bpp. In (b–g), gdi is used to collapse populations on guide tree into same species in a hierarchical procedure, with bpp used to estimate MSC parameters (Inline graphic and Inline graphic) and generate posterior distribution of gdi. For example, gdi calculated using population Inline graphic of panel b, based on Inline graphic (equation 7), is shown in panel c (labeled ‘sp. Inline graphic’). Sister populations inferred to belong to same species by gdi are collapsed, and resulting species tree is used to conduct a new bpp analysis. Procedure is repeated until distinct species are inferred or until root of tree is reached. According to Jackson et al. (2017), gdiInline graphic indicates a single species, gdiInline graphic indicates distinct species, and gdi values between 0.2 and 0.7 represent ambiguous species status.

To generate a working species/population tree (the guide tree), we run a joint analysis of species delimitation and species tree estimation (the A11 analysis in bpp, Yang, 2015). The parameters in the MSC model are assigned diffuse inverse-gamma priors Inline graphic and Inline graphic, with shape parameter 3 and with the prior means matching the true values. We used a burnin of 40,000, sample frequency of 10, and collected 50,000 samples. We conducted four separate runs for each analysis, with convergence ensured mainly by checking consistency between runs. The posterior probabilities for the species delimitation models calculated in the A11 analysis provided strong support for five species, and the inferred species tree incorrectly placed species Inline graphic sister to Inline graphic (Fig. 5b). This incorrect topology may be expected, as populations exchanging genes tend to form clades in species tree analyses that ignore migration (Leaché et al., 2013). Next, we run an A00 analysis, estimating parameters on the inferred guide tree (Fig. 5b) to generate the posterior distribution for the gdi for the most recent species divergences, between Inline graphic and Inline graphic and between Inline graphic and Inline graphic (Fig. 5c). Note that Inline graphic is used to decide whether population Inline graphic is a species distinct from Inline graphic, while Inline graphic is used to decide whether population Inline graphic is a species distinct from Inline graphic. Low gdi values of Inline graphic indicate that Inline graphic and Inline graphic are one species, as are Inline graphic and Inline graphic. Next, we collapse Inline graphic and Inline graphic, and Inline graphic and Inline graphic, and conduct another A00 analysis to estimate Inline graphic and Inline graphic for putative species Inline graphic and Inline graphic (Fig. 5d). The posterior distribution of gdi obtained suggest that Inline graphic and Inline graphic belong to the same species (Fig. 5e). The final iteration fits a two-species model containing species Inline graphic and species Inline graphic (Fig. 5f). The gdi value for species Inline graphic is ambiguous (with Inline graphicgdiInline graphic), while the evidence for species Inline graphic is strong (gdiInline graphic, Fig. 5g). Here, the gdi shows an ambiguity of the species status of Inline graphic and Inline graphic, depending on which population size (Inline graphic or Inline graphic) is used to calculate the index.

Next, we re-analyzed the three empirical data sets of Jackson et al. (2017) using the hierarchical procedure described above. The three empirical data sets include eight nuclear loci from three populations of North American ground skinks (Scincella lateralis), 20 loci from three populations of southeastern United States pitcher plants (Sarracenia alata), and 50 loci from four population of Homo sapiens. In the analysis of Jackson et al. (2017), phrapl supported a single species of Scincella lateralis and two species of Sarracenia alata, and grouped the human populations into one species, while Bayesian model selection by bpp inferred the maximum number of species in each data set.

Here, we used the MCMC samples generated in the bpp analysis (Yang, 2015, analysis A00) to estimate the posterior distribution of the gdi. We used inverse-gamma priors on parameters (Inline graphics and Inline graphics), with the shape parameter 3 and with the same prior means as used by Jackson et al. (2017). For each data set, we conducted four separate runs with a burnin of 10,000, sample frequency of 5, and collected 100,000 samples. The guide species trees are fixed at the previously published topologies from Jackson et al. (2017) (Fig. 6). We applied the hierarchical procedure to calculate gdi for population pairs by collapsing populations into a single species and conducting new MCMC analyses. Using bpp to calculate posterior distributions for gdi, we find no support for multiple species (gdiInline graphic) in any of the empirical data sets (Fig. 6).

Figure 6.

Figure 6.

Posterior distribution of genealogical divergence index (gdi), generated in bpp analysis of three real data sets of Jackson et al. (2017). Silhouettes of species are from phylopic.org http://phylopic.org. Colored ancestral branches were analyzed by collapsing descendent species and conducting new MCMC analyses.

Discussion

Simulation of Species Divergences

The PSM specifies a process of population splits (incipient species formation) as well as conversions of incipient species (populations) into true species. However, with time running forward, simulation under the PSM produces a new species (a conversion event) instantaneously. At a conversion event, the new true species and its parental incipient species (population) are deemed distinct species. As stated above, this process does not realistically model the biological process of speciation, nor does it mimic the way taxonomists identify new species. We consider two alternative approaches for simulating the process of population splits and species assignments, and discuss their implications for the development of methods for species delimitation using genomic sequence data. A clear specification of the simulation procedure implies a probabilistic model of data generation and statistical inference methodology, because given the model, full-likelihood methods (maximum likelihood and Bayesian inference) are known to have certain desirable statistical properties (Rannala 2015).

In the first approach, one can simulate population splits under a branching model, such as the birth–death process. The random birth and death events specify a probabilistic distribution of the population tree topology and divergence times (Inline graphics), and a certain model may be used to sample the population sizes (Inline graphics) and migration rates (Inline graphics). Gene trees (topologies and coalescent times) can be generated using the population tree with parameters (Inline graphics, Inline graphics, Inline graphics), and then used to simulate sequence alignments. At the end of this simulation, the populations at the tips of the population phylogeny are assigned species status using heuristic criteria of divergence times and migration rates. This is very similar to the simulation approach of Jackson et al. (2017).

In the second approach, one may simulate population splits as in the first approach, but in addition simulate the evolution of a continuous character along the branches of the generated population phylogeny. The difference in the continuous character between two populations is a measure of genetic incompatibility and a threshold can be used to identify species status: if the continuous character has measurements Inline graphic and Inline graphic in two populations, they are considered distinct species if and only if Inline graphic. Evolution of the continuous character may be simulated based under a model for the accumulation of genetic incompatibilities (such as the Dobzhansky–Muller incompatibilities, Orr and Turelli, 2001), for example, with a small probability for “catastrophes” (mimicking large events that may establish reproductive isolation at an instance, such as chromosomal rearrangements or polyploidizations) and a large probability for Brownian motion-like drift over time (mimicking the accumulation of genetic incompatibilities over time). At the end of the simulation, species status is assigned for populations at the tips of the tree based on the differences in the continuous character.

In both approaches, we assume that the process of sequence evolution is independent of population split events, and of the evolution of the continuous character, as expected if the neutral genome is used for species delimitation. Both scenarios seem to suggest that the only inference possible using the neutral genome is the population history and the population divergence parameters (Inline graphics, Inline graphics, and Inline graphics). Assignment of species status will then depend on our empirical knowledge about the level of genetic divergence between good species, or the expected amount of genetic incompatibility that may be accumulated over a given time period. Both approaches of simulation posit a protracted process of speciation (to allow accumulation of genetic incompatibilities or of differences in the continuous character), in contrast to the PSM, which assumes instantaneous speciation completed over one generation.

Hypothesis Testing Versus Parameter Estimation and the Functionalities of bpp

The MSC model was developed for comparative analysis of the ‘neutral’ genome to estimate parameters that characterize the history of population divergences, under the assumption that natural selection has not significantly altered the genealogical histories of genomic regions (gene tree topologies and coalescent times). The MSC model does not aim to identify speciation genes or genes responsible for establishing reproductive barriers (which may be under species-specific directional selection), even though identifying such genes, however, rare they are, may greatly enrich our understanding of the origin and maintenance of species. For example, proteins involved in female and male reproduction are well-known to evolve at accelerated rates, apparently driven by natural selection due to ecological adaptations and sexual selection maintaining species boundaries (Swanson and Vacquier 2002). In a few cases where the MSC model was applied to exons or the coding genome, it was noted to produce results highly consistent with the non-coding regions of the genome (Ebersberger et al. 2007; Dalquen et al. 2017; Shi and Yang 2018). This is apparently due to the fact that most protein-coding genes are performing the same conserved functions in closely related species so that the effect of purifying selection removing nonsynonymous mutations is predominantly a reduction of the neutral mutation rate. At any rate, the MSC model treats genomic regions as neutral markers to extract information concerning genealogical histories of the populations, reflected in population divergence parameters, such as population sizes, divergence times, and migration rates.

We take it for granted that the neutral genome contains useful information about the population divergence history and about species status. In clear-cut cases, population divergence parameters should be sufficient to determine species status. For example, distantly related species can be reliably identified using a simple genetic distance threshold as in DNA-barcoding analysis (Hebert et al. 2004). The difficulty is in identifying the species boundary (the so-called boundary conditions, Moritz and Cicero, 2004) for allopatric populations with low levels of genetic divergence and possibly frequent gene flow. The definitions of races, subspecies and species are often subjective, and the neutral genome may not provide unambiguous resolution of species status (Rannala, 2015). If species divergence is due to very few genes (in the so-called speciation islands), while the rest of the genome is homogenized due to widespread interbreeding, the divergence between species may be similar to the polymorphism within species (Nadeau et al., 2012). In such cases the neutral genome may not be highly informative about the species status and use of other kinds of data, such as evidence of reproductive isolation and ecological adaptation or identification of speciation genes, may be necessary to determine species status.

The inherent subjectivity of allopatric species delimitation is clearly illustrated by the distinction between statistical significance and biological significance made by Jackson et al. (2017). Consider by analogy a coin-tossing experiment to determine whether a coin is biased. One can use a significance test to test the null hypothesis of a fair coin (with the probability of heads Inline graphic) against the alternative hypothesis of a biased coin (with Inline graphic) or calculate the posterior probabilities for the two models. With a large number of coin tosses, this approach of model selection may have the power to detect a very small bias, with Inline graphic, say. However, the bias of 0.01 is said to be statistically significant but not biologically significant, and it is considered incorrect to suggest that the coin with Inline graphic is biased. An alternative approach is to estimate the probability parameter Inline graphic using the counts of heads and tails, and then apply whatever definition of bias one assumes heuristically. Given the arbitrariness in the definition of a biased coin, this approach may be the only one feasible.

Similarly, we have in this article made a distinction between two kinds of analysis under the MSC model implemented in bpp: (i) Bayesian model selection to calculate posterior probabilities for different species delimitation models (the A10 and A11 analyses in Yang, 2015) and (ii) Bayesian parameter estimation when species/population assignment and phylogeny are fixed (the A00 analysis in Yang, 2015). In theory, selection of species delimitation models can also be conducted in a Frequentist framework using a likelihood ratio test, for example, with the one-species model formulated as the null hypothesis (with Inline graphic) and the two-species model the alternative (with Inline graphic). With genomic data, model selection in both the Frequentist and Bayesian frameworks may be very powerful in identifying population splits even if the age of the divergence event (Inline graphic) is very young.

We suggest that Bayesian model selection is appropriate for identifying morphologically cryptic species. Even if the genomic data or the bpp program cannot distinguish populations and species, the genetic distinctness of the populations signifies the presence of reproductive barriers or isolation mechanisms. There seems to be no controversy in assigning species status to populations that exist in sympatry and are genetically distinct.

For heuristic delimitation of allopatric species, we suggest the use of Bayesian parameter estimation. The genomic data allows reliable estimation of population-divergence parameters (Inline graphics, Inline graphics, and Inline graphics), which can then be used to apply a heuristic definition of species status.

Heuristic Criteria for Species Status

The gdi attempts to use the overall genetic divergence between two populations affected by the combined effects of genetic isolation and gene flow. The index appears to have weaknesses. First, the criterion depends on the population divergence time relative to the population size (Inline graphic in the case of no gene flow). If the population is established by a few founder individuals, Inline graphic and Inline graphic may be very small, and the use of gdi may lead to claims of species status even if the populations diverged very recently. It may be necessary to consider the (absolute) population divergence (Inline graphic) (Yang and Rannala, 2010) as well as the divergence relative to the population size. Second, there may be ambiguity when the two populations concerned have very different sizes. If Inline graphic, the use of gdi may lead to the awkward solution that Inline graphic is a distinct species from Inline graphic (if one uses sequences Inline graphic and Inline graphic to calculate the index) but Inline graphic is not a distinct species from Inline graphic (if one uses sequences Inline graphic, Inline graphic). This is the case in the analysis of the simulated data in Fig. 5g. Third, gdi has a large range of indecision (0.2–0.7), although this may reflect the arbitrary nature of species definition rather than a weakness of the index itself.

There is clearly a need to refine criteria for heuristic species delimitation using genomic sequence data. It may be necessary to incorporate multiple criteria. For example, we may require a minimum species divergence time relative to the population size (Inline graphic), a minimum absolute divergence time (Inline graphic generations, say, as indicated by Inline graphic), and a maximum migration rate between species (Inline graphic). If a contact zone exists for the two populations, important indicators of pre- and post-mating reproductive isolation may be obtainable. For example, we may require the frequency of FInline graphic hybrids (Inline graphic) to be Inline graphic of the frequency expected from population abundance, and we may further require the long-term migration rate to be much lower than the hybrid frequency (with Inline graphic, say), indicating selective rejection of introgressed alleles after hybridization.

Concluding Remarks

The MSC model and its implementation in bpp provides a powerful framework for inferring population divergence histories and estimating evolutionary parameters using the fast-accumulating genomic sequence data. There appears to be no controversy regarding the use of Bayesian model selection under MSC or bpp to identify morphologically cryptic species. For allopatric populations or species, the accurate estimation of important population parameters should allow one to apply any empirical criterion for defining species that the evolutionary biologist entertains. For these reasons, the MSC model and bpp will continue to be useful tools in the analysis of genomic data to better understand biodiversity despite the fact that the interpretation of these results in assessing species status may be debated.

Acknowledgments

We thank N. Jackson for providing the R code for generating the phrapl results of Figure 4, and for clarifying the simulation design of Jackson et al. (2017). We thank F. Burbrink, B. Carstens, K. de Queiroz, N. Jackson, P. Kornilios, J. McGuire, and J. Mallet for their helpful comments. We are grateful to three anonymous reviewers for their constructive criticisms.

Supplementary Material

Data available from the Dryad Data Repository: http://dx.doi.org/10.5061/dryad.t66gq81.

Funding

This work was supported by National Science Foundation [1456098 to A.D.L.]; Natural Science Foundation of China [31671370, 31301093, 11201224, and 11301294 to T.Z.]; the Youth Innovation Promotion Association of Chinese Academy of Sciences [2015080]; a Biotechnological and Biological Sciences Research Council [BB/P006493/1 to Z.Y.], and in part by the Radcliffe Institute for Advanced Study at Harvard University.

References

  1. Burgess R., Yang Z.. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol. Biol. Evol. 25:1979–1994. [DOI] [PubMed] [Google Scholar]
  2. Dalquen D., Zhu T., Yang Z.. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst. Biol. 66:379–398. [DOI] [PubMed] [Google Scholar]
  3. Ebersberger I., Galgoczy P., Taudien S., Taenzer S., Platzer M., von Haeseler A.. 2007. Mapping human genetic ancestry. Mol. Biol. Evol. 24:2266–2276. [DOI] [PubMed] [Google Scholar]
  4. Edwards S.V. 2009. Is a new and general theory of molecular systematics emerging? Evolution. 63:1–19. [DOI] [PubMed] [Google Scholar]
  5. Etienne R.S., Morlon H., Lambert A.. 2014. Estimating the duration of speciation from phylogenies. Evolution. 68:2430–2440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Felsenstein J. 1981. Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376. [DOI] [PubMed] [Google Scholar]
  7. Hasegawa M., Kishino H., Yano T.. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174. [DOI] [PubMed] [Google Scholar]
  8. Hebert P.D., Stoeckle, M.Y., Zemlak, T S., Francis, C.M.. 2004. Identification of birds through DNA barcodes. PLoS Biol. 2:1657–1663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Heled J., Drummond A.J.. 2010. Bayesian inference of species trees from multilocus data. Mol. Biol. Evol. 27:570–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hey J. 2010. Isolation with migration models for more than two populations. Mol. Biol. Evol. 27:905–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hobolth A., Andersen L., Mailund T.. 2011. On computing the coalescence time density in an isolation-with-migration model with few samples. Genetics, 187:1241–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Jackson N., Carstens B., Morales A., B.C., O. 2017. Species delimitation with gene flow. Syst. Biol. 66:799–812. [DOI] [PubMed] [Google Scholar]
  13. Jukes T., Cantor C.. 1969. Evolution of protein molecules. In: Munro H., editor. Mammalian protein metabolism. New York: Academic Press, p. 21–123. [Google Scholar]
  14. Leaché A.D., Harris R.B., Rannala B., Yang Z.. 2013. The influence of gene flow on species tree estimation: a simulation study. Syst. Biol. 63:17–30. [DOI] [PubMed] [Google Scholar]
  15. Maddison W. 1997. Gene trees in species trees. Syst. Biol. 46:523–536. [Google Scholar]
  16. Mailund T., Dutheil J.Y., Hobolth A., Lunter G., Schierup M.H.. 2012. Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model. PLoS Genet. 7:e1001319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Mallet J. 2013. Concepts of species. In: Levin S., editor. Encyclopedia of biodiversity, Vol. 6 MA, USA: Academic Press; p. 679–691. [Google Scholar]
  18. Moritz C., Cicero C.. 2004. DNA barcoding: promise and pitfalls. PLoS Biol. 2:e354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Nadeau N.J., Whibley A., Jones R.T., Davey J.W., Dasmahapatra K.K., Baxter S.W., Quail M.A., Joron M., Ffrench-Constant R.H., Blaxter M.L., Mallet J., Jiggins C.D.. 2012. Genomic islands of divergence in hybridizing heliconius butterflies identified by large-scale targeted sequencing. Philos Trans. R. Soc. Lond. B. Biol. Sci. 367:343–353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Nichols R. 2001. Gene trees and species trees are not the same. Trends Ecol. Evol. 16:358–364. [DOI] [PubMed] [Google Scholar]
  21. O’Hagan A., Forster J.. 2004. Kendall’s advanced theory of statistics: Bayesian inference. London, England: Arnold. [Google Scholar]
  22. Orr H.A., Turelli M.. 2001. The evolution of postzygotic isolation: accumulating dobzhansky-muller incompatibilities. Evolution, 55:1085–1094. [DOI] [PubMed] [Google Scholar]
  23. Pinho C., Hey J.. 2010. Divergence with gene flow: models and data. Ann. Rev. Ecol. Evol. Syst. 41:215–230. [Google Scholar]
  24. Rannala B., Yang Z.. 1996. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J. Mol. Evol. 43:304–311. [DOI] [PubMed] [Google Scholar]
  25. Rannala B., Yang Z.. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics, 164:1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Rannala B., Yang Z.. 2013. Improved reversible jump algorithms for Bayesian species delimitation. Genetics 194:245–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Rannala B.H. 2015. The art and science of species delimitation. Curr. Zool. 61:846–853. [Google Scholar]
  28. Rosenblum E., Sarver B., Brown J., Des Roches S., Hardwick K., Hether T., Eastman J., Pennell M., Harmon L.. 2012. Goldilocks meets Santa Rosalia: an ephemeral speciation model explains patterns of diversification across time scales. Evolut. Biol. 39:255–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shi C., Yang Z.. 2018. Coalescent-based analyses of genomic sequence data provide a robust resolution of phylogenetic relationships among major groups of gibbons. Mol. Biol. Evol. 35:159–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sukumaran J., Knowles L.. 2017. Multispecies coalescent delimits structure, not species. Proc. Natl. Acad. Sci. USA. 114:1607–1612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Swanson W., Vacquier V.. 2002. The rapid evolution of reproductive proteins. Nature Rev. Genet. 3:137–144. [DOI] [PubMed] [Google Scholar]
  32. Takahata N., Satta Y., Klein J.. 1995. Divergence time and population size in the lineage leading to modern humans. Theor. Popul. Biol., 48:198–221. [DOI] [PubMed] [Google Scholar]
  33. White H. 1982. Maximum likelihood estimation of misspecified models. Econometrica, 50:1–25. [Google Scholar]
  34. Xu B., Yang Z.. 2016. Challenges in species tree estimation under the multispecies coalescent model. Genetics, 204:1353–1368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Yang Z. 2002. Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics, 162:1811–1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Yang Z. 2014. Molecular evolution: a statistical approach. Oxford, England: Oxford University Press. [Google Scholar]
  37. Yang Z. 2015. The BPP program for species tree estimation and species delimitation. Curr. Zool., 61:854–865. [Google Scholar]
  38. Yang Z., Rannala B.. 2010. Bayesian species delimitation using multilocus sequence data. Proc. Natl. Acad. Sci. USA, 107: 9264–9269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Yang Z., Rannala B.. 2014. Unguided species delimitation using DNA sequence data from multiple loci. Mol. Biol. Evol., 31: 3125–3135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Yang Z., Rannala B.. 2017. Bayesian species identification under the multispecies coalescent provides significant improvements to DNA barcoding analyses. Mol. Ecol., 26:3028–3036. [DOI] [PubMed] [Google Scholar]
  41. Yang Z., Zhu T.. 2018. Bayesian selection of misspecified models is overconfident and causes spurious posterior probabilities for phylogenetic trees. Proc. Nat. Acad. Sci. USA. 115:1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhang C., Zhang D.-X., Zhu T., Yang Z.. 2011. Evaluation of a Bayesian coalescent method of species delimitation. Syst. Biol., 60: 747–761. [DOI] [PubMed] [Google Scholar]
  43. Zhu T., Yang Z.. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol. Biol. Evol., 29:3131–3142. [DOI] [PubMed] [Google Scholar]

Articles from Systematic Biology are provided here courtesy of Oxford University Press

RESOURCES