Abstract
Genomic sequence data provide a rich source of information about the history of species divergence and interspecific hybridization or introgression. Despite recent advances in genomics and statistical methods, it remains challenging to infer gene flow, and as a result, one may have to estimate introgression rates and times under misspecified models. Here we use mathematical analysis and computer simulation to examine estimation bias and issues of interpretation when the model of gene flow is misspecified in analysis of genomic datasets, for example, if introgression is assigned to the wrong lineages. In the case of two species, we establish a correspondence between the migration rate in the continuous migration model and the introgression probability in the introgression model. When gene flow occurs continuously through time but in the analysis is assumed to occur at a fixed time point, common evolutionary parameters such as species divergence times are surprisingly well estimated. However, the time of introgression tends to be estimated towards the recent end of the period of continuous gene flow. When introgression events are assigned incorrectly to the parental or daughter lineages, introgression times tend to collapse onto species divergence times, with introgression probabilities underestimated. Overall, our analyses suggest that the simple introgression model is useful for extracting information concerning between-specific gene flow and divergence even when the model may be misspecified. However, for reliable inference of gene flow it is important to include multiple samples per species, in particular, from hybridizing species.
Keywords: gene flow, model misspecification, multispecies coalescent, introgression, Bayesian phylogenetics and phylogeography (BPP), species tree
Introduction
Hybridization can enhance variation in recipient species, and has long been recognized as an important process in plants that can stimulate the origin of new species (e.g., Anderson 1949; Mallet 2007). Analyses of genomic data in the past decade have highlighted the prevalence of hybridization or introgression in animals as well, including bears (Liu et al. 2014; Kumar et al. 2017), birds (Ellegren et al. 2012), and butterflies (Martin et al. 2013). Between-species gene flow may involve either sister or non-sister species and may play an important role in ecological adaptation (Mallet et al. 2016; Martin and Jiggins 2017). Gene flow can be a major contributor of genealogical variation across the genome and gene tree-species tree discordance, in addition to ancestral polymorphism or delayed coalescence (Maddison 1997; Nichols 2001).
There is a long history of studies in population genetics of models of population subdivision and migration (Wright 1943; Malecot 1948; Slatkin 1987), and a number of methods have been developed to estimate the migration rate between populations (Beerli and Felsenstein 1999, 2001; Bahlo and Griffiths 2000). An important limitation of models of population subdivision, when applied to data from different species or subspecies, is that they do not account for the divergence history of the populations or species. Introducing a population/species phylogeny into models of population subdivision not only improves the realism of the model but also opens up opportunities for addressing a number of interesting questions in evolutionary biology, such as estimating species divergence times and ancestral population sizes, delineating species boundaries, and inferring the direction, rate, and timing of gene flow (Jiao et al. 2021).
Two classes of models of gene flow have been developed that accommodate the phylogeny of the species, both of which are extensions of the multispecies coalescent (MSC) model (Rannala and Yang 2003). The first is the MSC-with-migration model (MSC-M, or isolation-with-migration or IM model, Hey and Nielsen 2004; Hey 2010; Zhu and Yang 2012; Dalquen et al. 2017; Hey et al. 2018), which assumes that two species exchange migrants at a certain rate over an extended time period. The rate of gene flow from species A to B is measured by the proportion of migrants () in the receiving population B or by the population migration rate, , the expected number of immigrants from A to B per generation, where is the (effective) population size of species B. We note that the isolation-with-initial-migration (IIM) model of Costa and Wilkinson-Herbots (2017), which assumes that gene flow occurs initially after species divergence but stops after a period of time when reproductive isolation has been fully established, is an instance of the MSC-M or IM model (see below). The second class of models of gene flow is the MSC-with-introgression (MSci) model (Flouri et al. 2020), also known as multispecies network coalescent model (MSNC; Wen and Nakhleh 2018; Zhang et al. 2018), which assumes that gene flow occurs at fixed time points in the past. The rate of gene flow is measured by the introgression probability ( or ), which is the proportion of successful immigrants in the population at the time of introgression.
In the real world, introgressed alleles may be removed by natural selection because they are involved in hybrid incompatibility and are deleterious in the genetic background of the recipient population (Dobzhansky 1937; Muller 1942) or because they are linked to such loci (Petry 1983; Barton and Bengtsson 1986; Uecker et al. 2015). Thus the rate of gene flow (M in MSC-M or in MSci), when those models are used to analyze genomic sequence data, reflect the long-term effects of selection and drift as well as hybridization or introgression (Martin and Jiggins 2017). Such an effective rate of gene flow may be expected to vary across the genome, influenced by the presence of loci in the genomic region important in ecological adaptation and by the local recombination rate (Bürger and Akerman 2011; Aeschbacher and Bürger 2014; Akerman and Bürger 2014; Schumer et al. 2018; Edelman et al. 2019; Martin et al. 2019). The rate may also vary over time, depending on geological or ecological events that cause changes in the ecology and distribution of the species and in the chance for two species to exchange genes. One can envisage models of gene flow in which the rate varies over time and across genomic regions. For the present, such extended models are not yet implemented in the MSC framework, and the feasibility of fitting such parameter-rich models to genomic datasets is unexplored. MSC-M and MSci models implemented to date (Dalquen et al. 2017; Hey et al. 2018; Wen and Nakhleh 2018; Zhang et al. 2018; Flouri et al. 2020) assume constant rates, and should be considered first approximations when applied to genomic sequence data.
In this paper, we use mathematical analysis and computer simulation to examine the impact of model misspecification on estimation of parameters under the MSci model, such as species divergence and introgression times, population sizes, and introgression probabilities. We use the Bayesian program Bayesian phylogenetics and phylogeography (BPP) (Flouri et al. 2018, 2020) to analyze multilocus sequence data simulated under various MSci and MSC-M models. Although Bpp is our own implementation of the MSci model, our results should apply to similar exact or likelihood methods (Wen and Nakhleh 2018; Zhang et al. 2018). Our results may also apply to approximate methods, which use summaries of the data such as the genome-wide site-pattern counts (as in the D-statistic, Green et al. 2010 and Hyde, Meng and Kubatko 2009; Blischak et al. 2018), reconstructed gene trees (as in Snaq, Solis-Lemus and Ane 2016), or other summary statistics used in Approximate Bayesian Computation (Dittberner et al. 2022). However, approximate methods do not make a full use of information in the data and may not identify all parameters in the model. For example, the D-statistic is agnostic of the mode of gene flow (migration versus introgression) and cannot be applied to data sampled from only two species or populations. The computational strengths and statistical weaknesses of approximate methods have been discussed by a number of authors (Degnan 2018; Elworth et al. 2019; Jiao et al. 2021; Zhu and Yang 2021; Hibbins and Hahn 2022; Ji et al. 2022). In contrast, likelihood methods integrate over all possible gene trees underlying the sequence alignments, making use of all information about the model and parameters in the sequence data. They typically involve a heavy computational load. However, recent algorithmic improvements have made it possible to apply the MSci model to genome-scale datasets with more than 10,000 loci (Flouri et al. 2020). Inferring introgression events or constructing an introgression model using genomic sequence data, however, remains a challenging task, even when a binary species tree is specified, onto which introgression events can be added (Ji et al. 2022; Thawornwattana et al. 2022); see Discussion for an overview of currently available methods for inferring gene flow on a species phylogeny. For these and many other reasons, the model of gene flow assumed in our data analysis may often be incorrect. An important question is to what extent inference of gene flow and estimation of the timing and rate of gene flow can still be achieved when the model of gene flow is misspecified. The impact of model misspecification on estimation of other evolutionary parameters such as species divergence times is also of major concern.
Although there are many ways in which the assumed model is wrong, we are particularly interested in a few types that are likely in real data analyses (Finger et al. 2022; Thawornwattana et al. 2022). First, gene flow may be occurring continuously during a time period but an MSci model is fitted to the genomic data, which assumes that gene flow occurred at a particular time point (e.g., Wen and Nakhleh 2018; Jiao et al. 2020). We are here interested in whether species divergence times and ancestral population sizes are affected by the misspecification, and how the migration rate in the migration model (M) corresponds to the introgression probability in the MSci model (). The case of two species is analytically tractable. We study the limit of the maximum-likelihood estimates (MLEs) of introgression probability and introgression time when the data size (the number of loci) approaches infinity when the data are generated under the MSC-M model. We use computer simulation to verify and extend the analytical calculation.
Second, the introgression event may be assigned to a wrong branch on the species tree, for example, to a parental or daughter branch of the genuine introgression lineage. Alternatively, introgression may involve species that have since gone extinct or are not included in the data sample. The presence of such ghost species is known to mislead inference of the history of gene flow for the sampled species (Beerli 2004; Tricou et al. 2022). Thus we conducted simulation to examine the impact of unsampled species on the inference of gene flow. In general, our results demonstrate the usefulness of the simple introgression model in inferring gene flow using genomic sequence data.
Results
Correspondence between the MSC-M and MSci Models in the Case of two Species
Notation and Definition of Parameters
Following Jiao et al. (2020), we study the asymptotic behavior of Bayesian parameter estimation under the introgression (MSci) model when the data are generated under the migration (IM) model in the case of two species, with one sequence per species per locus (fig. 1). Here we focus on this simple case because it is analytically tractable. Note that our Bayesian implementation in Bpp (Flouri et al. 2020) accommodates an arbitrary number of species and an arbitrary number of sequences per species per locus, and the likelihood calculation averages over the gene genealogy for the sequences at each locus. We assume an infinite number of loci, and the data at each locus consist of a pair of sequences () from the two species, with x differences at n sites. The coalescent time t for the locus is unknown and underlies the observed difference. Jiao et al. (2020) analyzed the MSC-M model (fig. 1a) assuming infinite sequence length () so that the true coalescent time between the two sequences (t) is known. Here we accommodate random fluctuations in the number of mutations due to finite sequence length and consider three variants of the migration model.
In the basic IM model (fig. 1a), species A and B diverged at time and there has since been gene flow from A to B at the rate of migrants per generation. The IIM model (fig. 1b) assumes that migration occurred initially after species divergence but stopped at time (Costa and Wilkinson-Herbots 2017), and is represented by an MSC-M model for three species including a ghost species. Here the ghost does not necessarily represent a real species but is a mathematical device for specifying the IIM model. The IIM model becomes identical to the IM model when . We also consider a secondary contact (SC) model (fig. 1c), in which two species initially had complete isolation but came into contact at a certain time point () with ongoing gene flow at the rate of ever since (Costa and Wilkinson-Herbots 2021). This is similarly specified using a ghost species at time point (fig. 1c). The migration model involves three types of parameters: species divergence times (), population sizes for extant, and extinct species (), and the (population) migration rate . The population size parameter for any species with (effective) population size N is defined as , where is the mutation rate per site per generation. We refer to a branch on the species tree by its daughter node so that branch RA is also branch A, with population size parameter . Both divergence times () and population sizes () are measured by the expected number of mutations per site.
Asymptotic Theory
We first consider the IIM model of figure 1b, of which the IM model of figure 1a is a special case with . The backwards-in-time process of coalescent and migration in time interval is described by a Markov chain with three states: AB, AA, and A (Notohara 1990). Here AB is the initial state, with two sequences in the sample, one in A and another in B; AA means both sequences are in A (in other words, sequence b is traced back into A); and A means one sequence in A (in other words, sequence b is traced back into A and has coalesced with sequence a). Note that in the Markov chain, time runs backwards, so the transition from AB to AA means migration of a sequence from A to B in the real world. The generator matrix for the Markov chain is (see, e.g., Notohara 1990; Jiao et al. 2020)
(1) |
where is the mutation-scaled migration rate, and is the coalescent rate in population A, with one time unit being the expected time taken to accumulate one mutation per site. Q has eigenvalues , , and .
Let the transition probability matrix over time t be , where is the probability that the Markov chain will be in state j time t later given that it is in state i at time 0. This is
(2) |
The probability density of coalescent time t is thus
(3) |
This is a function of but not of and individually. The parameters specifying the density are thus . Note that the density under the IM model is given by with (fig. 1bandc).
Similarly under the secondary-contact (SC) model (fig. 1c), the coalescent-with-migration process over the time interval (0, ) is described by the Markov chain of equation (1). Given the parameters , the probability density of coalescent time t is
(4) |
Under the MSci model (fig. 1d), we have (e.g., Jiao et al. 2020)
(5) |
This is a function of parameters . Given the coalescent time t for a locus, the probability of observing x differences at n sites under the JC mutation model (Jukes and Cantor 1969) is given by the binomial probability
(6) |
The marginal probability of observing x differences at n sites, under both the migration (IM, IIM, SC) and introgression (MSci) models, is
(7) |
where is given by equations (3), (4), or (5).
For analytical tractability of the likelihood (eq. 7), we assume the infinite-sites mutation model instead of JC, and replace the binomial likelihood by a Poisson approximation
(8) |
Equation (7) is derived in SI text, as supplementary equation (S6), Supplementary material online for the IM (with ) and IIM (with ) models, supplementary equation (S7), Supplementary material online for the SC model, and supplementary equation (S9), Supplementary material online for the MSci model.
Suppose the data are generated under the migration model (IM, IIM, or SC) and analyzed under the MSci model. When the number of loci , the MLE under MSci will converge to , which minimizes the Kullback–Leibler (KL) divergence
(9) |
where the subscript “m” stands for any of the three MSC-M models (“im” for IM, “iim” for IIM, or “sc” for SC, fig. 1a–c). The KL divergence is a measure of distance from the fitting introgression model to the true migration model: here are fixed, whereas are being estimated. The limiting values as are also known as the pseudo-true parameter values for the misspecified MSci model. The BFGS optimization routine in Paml (Yang 2007) is used to minimize equation (9) to obtain the MLEs.
We are in particular interested in the introgression probability and the introgression time . Note that under the migration model, the probability that any lineage from species B traces back to A is
(10) |
where is the time period of gene flow (fig. 1a–c). Equation (10) gives the expected proportion of migrants under the true migration model. When is small, , which is also given by equating the expected total number of migrants under the two models: . Note that is the expected number of migrants per generation and is the number of generations with gene flow.
It may be noted that the theory of equation (9) can be used to study the limiting parameter estimates (when ) in the migration model when the true model is the introgression model. One has only to flip the roles of and in equation (9). This is not pursued in this paper.
Asymptotic Results under the IM Model
We used the asymptotic theory (eq. 9) to obtain the MLEs () under the MSci model (fig. 1d) when the data consist of an infinite number of loci, with one sequence of length n per species per locus, generated under the IM, IIM, or SC models (fig. 1a–c). The true parameter values used () are shown in figure 1. The MLEs are shown in figure 2 and the true and best-fitting distributions of the coalescent time t are shown in supplementary figure S1, Supplementary Material online for the IM model. The corresponding results for the IIM and SC models are in supplementary figures S2–S5, Supplementary Material online, to be discussed in the next sections.
We use five methods (a–e) to fit the MSci model, with method d estimating all five parameters, whereas the others have some parameters fixed (fig. 2). We examined the effects of the sequence length (n) and the migration rate (). Note that five parameters are identifiable under the MSci model: (fig. 1d), and are unidentifiable as no coalescent events can occur in those populations given one sequence per species per locus. Population size is identifiable as it is possible for both sequences a and b to be traced back to population S. Nevertheless, one expects the information concerning to be weak in datasets of two sequences per locus. In methods c and d, and are estimated as free parameters. Application of the misspecified MSci model (to data generated under the IM model) led to unreasonably large estimates of (as large as 0.5 mutations per site), and the poor estimates of caused to be poorly estimated as well. This is due partly to our use of one sequence per species per locus and partly to the confounding effects between and . We discuss both effects later when we describe the simulation results. Here we focus on methods a, b, and e, in which is fixed at the true value (in methods a and b) or constrained to be equal to (in method e).
In the IM model, migration occurs throughout the time interval , at the rate of migrants per generation (fig. 1a). When such data are analyzed under the introgression model, a simple expectation might be that the introgression time should be the average , whereas the introgression probability might be given by the expected proportion of equation (10). However, as we show below, this expectation is too simplistic.
First, we discuss the introgression time , assuming the true coalescent time (or ). Given the data-generating IM model, there is a strictly positive probability for for any small constant (supplementary fig. S1, Supplementary Material online). In other words, there must exist loci at which t is arbitrarily close to 0. In the MSci model, sequences a and b cannot coalesce until they are in the same population S, so that . When the MSci model is fitted to data generated under the IM model, is dominated by the minimum rather than the average coalescent time, and when the number of loci (and when the true coalescent time t is known). Even though migration occurs throughout the time interval , the MSci model has to lump all migration events to one time point, (fig. 2bande).
With finite sequences (), t is not observed and is reflected in the number of mutations (x). Whatever the true t, there is a positive probability of observing no mutations between the two sequences, so that an absence of mutations () is not strong evidence for . The MLE reflects not only the minimum coalescent time, but also the whole distribution (supplementary fig. S1, Supplementary Material online). Thus , different from the case where the coalescent time is known without error (). Nevertheless, one expects to be closer to 0 than to , especially if the number of sites is large. Indeed in our calculations, (fig. 2bande).
Next we consider the introgression probability and again focus on methods a, b, and e (fig. 2). The estimate increases nearly linearly when is small (, say) but tails off at large . All estimates are smaller than of equation (10) but they are close at low rates (with and , say) (fig. 2a, bande). We defer to a later section a detailed discussion of the estimation of , contrasting the IM, IIM, and SC models.
Finally the estimated divergence time between the two species () matched the true values at low migration rates but was underestimated at high migration rates, with the ancestral population size overestimated (fig. 2). It may be tempting to interpret the underestimation of (and overestimation of ) by the MSci model as being due to the difficulty of distinguishing complete isolation with recent species divergence from introgression or of distinguishing migration and coalescent events close to species divergence from ancestral polymorphism. However, this does not appear to be a correct interpretation.
We examined the true and fitted distributions of the coalescent time (supplementary fig. S1, Supplementary Material online). If there is no migration (), the MSci model (with ) will be correct, and the parameter estimates will converge to the true values, with a perfect fit to the density . At low migration rates (, say), the MSci model fits the density very well, with the discontinuity point in the true and fitting distributions coinciding: . At the intermediate rate of , the species divergence time is still correctly estimated even though the fit to the density is poor (supplementary fig. S1, Supplementary Material online). At high rates (with , say), the true density has a mode in the interval , dropping off at . The best fitting density starts from 0, with an exponential decay, and has a discontinuity point at with again an exponential decay. This best-fitting density is a poor fit, and the discontinuity point is moved to smaller values as an attempt to accommodate the migration and coalescent events in the middle of the interval () to improve the fit (judged by the KL divergence). Thus is underestimated (). As a result, the population size parameter is overestimated, as those two parameters tend to be strongly negatively correlated (e.g., Burgess and Yang 2008). In other words, the intermediate coalescent times in the interval (0, ), which occur at a large proportion of loci, are accommodated or misinterpreted by the MSci model using a smaller and larger . Coalescent times in the range , which represent true migration events, are misinterpreted as coalescent events in species R, and is much less than (eq. 10).
Asymptotic Results under the IIM Model
When data are generated under the IIM model (fig. 1b) and analyzed under the MSci model (fig. 1d), the results (supplementary figs. S2 and S3, Supplementary Material online) show similar patterns to those under the IM model discussed above. Similarly, is difficult to estimate using two sequences per locus in methods c and d, and the poor estimates of affects the estimation of . Thus we focus on methods a, b, and e, in which is fixed or constrained, and on the introgression time and introgression probability.
In the IIM model, migration events occur throughout the time interval (fig. 1b), but the estimate of the introgression time is dominated or influenced by the minimum coalescent time, so that when , and when n is finite. In the latter case, is much closer to than to (supplementary fig. S2, Supplementary Material online).
The introgression probability grew almost linearly with when was small (with , say), and this estimate was close to the expectation of equation (10) (supplementary fig. S2a,bande, Supplementary Material online). At high migration rates, equation (10) gave a serious overestimate. This “bias” in at high migration rates was accompanied by a reduction in and overestimation of . This can similarly be explained by the attempt of the MSci model to accommodate the coalescent times in the middle of the time interval () (supplementary fig. S3, Supplementary Material online).
Asymptotic Results under the SC Model
Under the SC model, there is initially complete isolation after species divergence but the two species come into contact at time , with ongoing gene flow ever since (fig. 1c). The best-fitting parameter values under the MSci model (), for data of two sequences per locus, are shown in supplementary figure S4, Supplementary Material online, with fitted densities of coalescent time t shown in supplementary figure S5, Supplementary Material online.
The results show patterns similar to those under the IM and IIM models discussed above. The species divergence time under the MSci model when the migration rate is small but drops at very high rates (with ). The introgression time is dominated by the minimum coalescent time, so that when , and is much closer to 0 than to when n is finite (supplementary fig. S4, Supplementary Material online). Note that in the true model migration occurs throughout the time interval .
The introgression probability grew almost linearly with the migration rate when was small (with , say), and was close to the expectation (eq. 10) when (supplementary fig. S4a,bande, Supplementary Material online). At very high rates (), was much smaller than , and this ‘bias’ was accompanied by an underestimation of and overestimation of . Similarly to the IM and IIM models discussed above, this is due to the attempt of the MSci model to accommodate the coalescent times in the middle of the interval () (supplementary fig. S5, Supplementary Material online).
The Amount of Gene Flow under the IM, IIM, and SC Models
Although the expected total amount of gene flow measured by (eq. 10) is the same under the IM, IIM, and SC models of figure 1a–c, the estimates under the MSci model differ, as summarized in figure 1e.
At low migration rates, , and in the MSci model are nearly accurately estimated to match those in the true model (figs. 2, supplementary figures S2 and S4, Supplementary Material online). Consider the case of infinitely long sequences with known coalescent time. Let , , and let the introgression time be for the IM and SC model, and for the IIM model. We also match the probability density of coalescent time t, with , for . With those simplifying assumptions, that minimizes the KL divergence (eq. 9) can be derived as
(11) |
At low migration rates, equation (11) provides accurate numerical results (methods a, b, e in figs. 2, supplementary figures S2 and S4, Supplementary Material online). From equation (11), we have
(12) |
In other words, recent gene flow (as in SC) is easier to recover by the MSci model than ancient gene flow (as in IM or IIM). Note that holds only when one sequence is sampled per species; as there is no coalescent over , IIM is essentially the same model as IM with a time shift (fig. 1). This will not be the case when multiple sequences per species are sampled or when the sequence length is finite.
Simulation Results
As our asymptotic theory was limited to a single sequence per species per locus, we used simulation to verify and augment our analytical calculations above. We simulated data under the IM, IIM, or SC models of figure 1a–c, using the same parameter values as above, and analyzed them using Bpp under the MSci model (fig. 1d). The JC mutation model (Jukes and Cantor 1969) was assumed. In the basic setting, we used sequences per species per locus, sites per sequence, and loci in each dataset, with the migration rate . We varied to examine their effects. With multiple sequences per species (), all eight parameters of the MSci model, (fig. 1d) are identifiable (Yang and Flouri 2022). The results are summarized in figure 3. We note a few common features first. In nearly all cases, population sizes for extant species () were very well estimated, with posterior means close to the true values and with very narrow highest-probability-density (HPD) credibility intervals (CIs). The exception was parameter under the IM model (note that B is the species receiving immigrants), which was less well estimated when the dataset was small and had either short sequences () or few loci (), or when the migration rate was very high. The poorer estimation of appeared to be related to the underestimation of and ; see below. The population size for the common ancestor was mostly well estimated, although overestimated at very high migration rates. Population sizes for the ancestral species () are harder to estimate; indeed they had larger CIs and were influenced by misspecification of the model of gene flow. As expected from the asymptotic results, was very well estimated, except at very high migration rates, in which case was underestimated (and overestimated).
Next we examine the effects of in turn. First, the number of sites (n) had a relatively small impact on MSci parameters, when other factors were fixed (at the basic setting of , and ). When , CIs for parameters such as the introgression time and probability ( and ) were wide. When , the CIs were much smaller for all parameters. The introgression time decreased slightly as the sequence became longer. This is consistent with the asymptotic analysis, which suggests that is dominated by the smallest coalescent time or sequence divergence and should be 0 for the IM and SC models or for the IIM model (when ) (figs. 2, supplementary figures S2 and S4, Supplementary Material online). Similarly, for the IM and SC models, increased with the increase of n when was low, as observed in the asymptotic analysis. Under the IIM model, small datasets with short sequences () produced very uncertain estimates of and (and, to a lesser extent, and ). The two parameters are nearly confounded; this is discussed below when we examine the impact of the migration rate ().
Second, we varied the number of sequences per species (S). When one sequence per species is in the data (), only five parameters in the MSci model (fig. 1d) are identifiable: ). When multiple sequences were sampled per species, all eight parameters are identifiable. They were well estimated when the dataset was large (say, with for IM and SC or for IIM). Even with sequences per species, estimates of from data generated under the IIM model involved wide CIs, with being close to , and and being very imprecise as well (fig. 3). This is due to the semi-unidentifiability or the confounding effects of the parameters, and will be discussed below. Here we note that the problem disappeared and all parameter estimates were well-behaved in large datasets when many sequences were sampled ( for IM and SC or for IIM; fig. 3).
Third, we examined the impact of the number of loci (L). The IIM model was hard to fit in small datasets with a small number of loci (), generating large CIs for parameters and . This is the same pattern as in the case of short loci () or few sequences (), discussed above. In large datasets, the parameters were well estimated. Note that the number of loci L is the sample size in the statistical model as data at different loci are independently and identically distributed. Theory predicts that in large datasets the variance should be proportional to (see O’Hagan and Forster 2004 for the case of correctly specified models and Yang and Zhu 2018 for the case of misspecified models), and thus the CI should decrease at the rate of . This prediction held for parameters that were well estimated (fig. 3). As discussed earlier, the introgression time is dominated by the smallest coalescent time or smallest sequence divergence. Thus increasing the number of loci led to a decrease in the estimated introgression time, and the trend was in particular apparent for the IIM model (under which when if ). In all cases, the estimated introgression time () was closer to the more recent end of the time interval for gene flow than to the midpoint, that is, for IM, for IIM, and for SC (see fig. 1a–c).
Finally, we evaluated the impact of the migration rate () (fig. 3). Under the IM model, there is a near linear relationship between the introgression probability and at low rates. The amount of gene flow estimated under the MSci model is less than the true amount expected under the IM model ( of eq. 10) but the two were close at low rates (with , say). At very high rates (with , say), divergence time was increasingly underestimated and the population size was overestimated. These patterns are the same as observed in the asymptotic analysis of infinite data (), and are due to the attempt of the MSci model to accommodate intermediate coalescent times in the data, as discussed earlier (see figs. 2, supplementary figures S2 and S4, Supplementary Material online).
Under the IIM model, involved very large uncertainties at low rates (, say), with , and affected as well. Given the small , why did not converge to with narrow CIs? Note that if in the IIM model, the MSC model with no gene flow or MSci with will be the correct model. Similarly in figure 3 where was fixed, wide CIs for those parameters were observed in small datasets with short loci (), few sequences (), or few loci (), as noted above. Also in the asymptotic analysis (with ), we noted that and were grossly wrong but had no sampling errors because the data size was (fig. 2; supplementary figures S2 and S4, Supplementary Material online, methods c, d). We suggest that all those results are due to the near unidentifiability of parameters in the MSci model (in particular, and ); in other words, the parameters are confounded.
If in the MSC-M model, MSci with will be the correct model, but with a large appropriately adjusted may provide a very good fit to the data (of two sequences per locus). When is small but nonzero, the MSci model will never achieve a perfect fit, and a large with appropriately adjusted may provide a better fit than a small . Thus in infinite data (), we may get grossly wrong estimates with no uncertainty (supplementary figure S2, Supplementary Material online, methods c, d). In finite datasets (), there will be a ridge in the posterior surface involving and , leading to wide CIs for those parameters, influenced by both model misspecification and the prior (fig. 3). Including multiple samples from the same species () is useful for improving the information content in the data, but strong correlation between and may be expected nevertheless. In this regard, the large uncertainties in posterior estimates of parameters may be useful as they help the investigator avoid incorrect inferences of a large when gene flow is minimal.
Introgression Events Assigned to Wrong Branches
We conducted simulations to examine the bias in parameter estimates when the introgression event is assigned on either the parental or daughter branch of the lineage genuinely involved in introgression. The data were simulated under model trees A or B and analyzed under models A or B of figure 4aandb.
In the A-A and B-B settings (fig. 4e), the correct MSci model was assumed, and the performance of the method serves as a reference for comparison. Most parameters, including the species divergence times (, and ) and population sizes for extant species (), were well estimated. For well-estimated parameters, the CI width reduced by a half as the number of loci (L) quadrupled, as predicted by theory. Population sizes for ancestral species (, , , , and ) were less well estimated, although performance improved with sample size: with loci, these parameters were well estimated. Introgression probability () was well estimated, but thousands of loci were necessary to obtain precise estimates with narrow CIs under the standard settings used here (four sequences per species per locus and 500 sites per sequence).
In the other settings (fig. 4e), there was mismatch between the models used to simulate and to analyze data. We note that population sizes for extant species () were well estimated, as was the age of the root (). Performance for estimation of those parameters was very similar whether or not there was model misspecification (e.g., the A-B setting versus the B-B setting and C-A versus A-A). Below we focus on estimation of the other parameters.
In the A-B setting (fig. 4e), data were simulated under model A with introgression (fig. 4a) but analyzed under model B with introgression incorrectly assigned to the parental branch ST. Ancestral population sizes and were well estimated, similar to the B-B setting. Divergence times and were well estimated, but and were stuck together. We expect to be mostly determined by the smallest sequence divergence () between B and C, which should be close to . Here, we use the superscript to indicate the model in which the parameter is defined. In the fitting model B, the introgression time (which is ) should reflect the smallest sequence divergence , whereas in the true model A, is mostly determined by (which is ). Thus misidentification of the introgression lineage caused to be stuck at (fig. 5a). There was virtually no information for as the population was estimated to have near-zero time duration with no chance for coalescence. The introgression probability was seriously underestimated, converging to when the number of loci L increases (table 1), whereas the true value was 0.2. This smaller estimate of introgression probability is explained by the distribution of coalescent times between species in the true and fitting models (supplementary fig. S6, Supplementary Material online, true model A). Under the true model A, sequences from A and B are more similar than those between A and C due to the introgression, with an excess of small coalescence time . Under the analysis model B, and have the same distribution. Thus the true model predicts an excess of small , whereas the fitting model predicts an excess of small , and having a smaller in the fitting model helps to reduce the discrepancy.
Table 1.
Analysis | ||||||
---|---|---|---|---|---|---|
Figure 4 A-A | 3.06 (2.63, 3.49) | 3.02 (2.80, 3.24) | 3.00 (2.89, 3.11) | 0.23 (0.16, 0.32) | 0.21 (0.17, 0.24) | 0.20 (0.19, 0.22) |
Figure 4 B-A | 1.62 (0.95, 2.05) | 1.77 (1.54, 1.96) | 1.82 (1.72, 1.91) | 0.02 (0.00, 0.04) | 0.02 (0.01, 0.03) | 0.02 (0.02, 0.03) |
Figure 4 C-A | 1.12 (0.83, 1.40) | 1.11 (0.97, 1.25) | 1.11 (1.04, 1.18) | 0.12 (0.09, 0.15) | 0.12 (0.10, 0.13) | 0.12 (0.11, 0.12) |
Figure 4 D-A | 1.69 (1.18, 2.07) | 1.80 (1.58, 1.97) | 1.86 (1.76, 1.94) | 0.02 (0.01, 0.04) | 0.02 (0.01, 0.03) | 0.02 (0.02, 0.03) |
Figure 4 A-B | 3.82 (3.53, 4.11) | 3.75 (3.61, 3.90) | 3.73 (3.66, 3.80) | 0.18 (0.09, 0.28) | 0.13 (0.11, 0.16) | 0.12 (0.11, 0.14) |
Figure 4 B-B | 2.98 (2.61, 3.35) | 2.99 (2.80, 3.18) | 3.00 (2.91, 3.10) | 0.23 (0.14, 0.34) | 0.20 (0.17, 0.24) | 0.20 (0.18, 0.22) |
Figure 4 C-B | 2.98 (2.72, 3.24) | 2.93 (2.80, 3.06) | 2.91 (2.85, 2.98) | 0.11 (0.08, 0.14) | 0.10 (0.09, 0.12) | 0.10 (0.10, 0.11) |
Figure 4 D-B | 2.83 (2.28, 3.38) | 2.71 (2.42, 3.00) | 2.73 (2.59, 2.87) | 0.11 (0.04, 0.20) | 0.08 (0.05, 0.10) | 0.08 (0.07, 0.09) |
Figure 6 IIM | 3.40 (2.38, 4.36) | 2.93 (2.42, 3.43) | 2.83 (2.58, 3.08) | 0.24 (0.04, 0.53) | 0.10 (0.05, 0.16) | 0.08 (0.06, 0.10) |
Figure 7 | 2.81 (2.41, 3.22) | 2.80 (2.60, 3.01) | 2.79 (2.68, 2.89) | 0.23 (0.16, 0.31) | 0.21 (0.18, 0.25) | 0.21 (0.19, 0.22) |
Figure 8 | 3.12 (1.93, 4.07) | 3.05 (2.42, 3.68) | 2.98 (2.73, 3.23) | 0.03 (0.01, 0.06) | 0.03 (0.01, 0.04) | 0.02 (0.02, 0.03) |
In the B-A setting (fig. 4e), the simulation model (MSci model B of fig. 4b) assumes introgression involving the ancestral branch ST but the analysis model (model A) assigned introgression to the daughter branch TB. Posterior means and CIs for divergence times and were similar to those in the A-A setting. Note that should be mostly determined by the smallest sequence divergence () between B and C, and given that this is , was well estimated, unaffected by mis-assigned introgression event. Although the true introgression time was 0.003, it was forced to be less than by the analysis model A. As the number of loci increases, became stuck at (fig. 5b). However, was seriously underestimated. This may be explained as follows. In the analysis model A, was mostly determined by the shortest sequence distance between A and C. In the true model B, this should be close to , due to introgression. With mutational fluctuations in the sequences, one can expect to lie between , but closer to in large datasets with many sites and/or many loci. Population sizes and were affected by the mis-assigned introgression events as well, as those populations are close to the introgression branches. In particular, was very imprecise as branch YT was very short, and was overestimated because was seriously underestimated (as those two parameters are negatively correlated). Finally, the introgression probability () was underestimated, apparently converging to when the number of loci L increased (table 1), whereas the true value was 0.2. This greatly reduced introgression probability appeared to reflect the very poor fit of the misspecified model A to data generated under model B (see the large differences between the true and fitting distributions of coalescent times in supplementary fig. S6, Supplementary Material online, second row). As and are seriously underestimated, an excess of small coalescent times () is expected in the fitting model A but does not appear in the data, so that having a smaller improves the fit.
In summary, assigning introgression events to a wrong parental or daughter branch led to biased estimates of introgression times (causing the introgression events to collapse onto speciation events) and to seriously underestimated introgression probabilities.
Continuous Migration versus Episodic Introgression
In this set of simulations, we generated data under the IM models C and D of figure 4candd and analyzed them under the MSci models A and B, with the mode of gene flow misspecified and with gene flow assigned to either the correct branch or a wrong branch on the species tree.
In the C-A and D-B settings (fig. 4e), gene flow occurred continuously but the data were analyzed under the MSci model assuming introgression at a time point. The mode of gene flow was misspecified, but the lineages involved were correctly identified. In the C-A setting, gene flow was between non-sister species, whereas in the D-B setting it was between sister species. Speciation times (, , ) and population sizes () were well estimated, similar to the A-A setting. Surprisingly ancestral population sizes appeared to be even better estimated, with narrower CIs, in the C-A setting than in A-A. Speciation times and population sizes were extremely similar between settings D-B and B-B. Those results were consistent with the results for the case of two species (fig. 3), which showed that at low migration rates, species divergence times and population sizes were well estimated under the MSci model when the data were generated under the IM model.
In the C-A setting, the estimated introgression time appeared to converge (when L increased) to 0.0011, much more recent than the average time of gene flow (), and the introgression probability appeared to converge to (table 1), smaller than the expected proportion of total migrants: . As discussed earlier for the case of two species, the limiting value for was nonzero, as the sequence length is finite, and the MLE slightly underestimated the true amount of gene flow. In the D-B setting, the introgression time appeared to converge to , larger than but much smaller than the average time of gene flow, , and the introgression probability appeared to converge to (table 1), much smaller than from equation (10). In both the C-A and D-B settings, the estimated introgression time was within the time interval of gene flow, but closer to the time when gene flow stopped, whereas the amount of gene flow was underestimated (, ). Moreover, we have . These patterns are consistent with our analysis of the two-species case at low migration rates (eq. 12, fig. 3), which suggested that gene flow after a period of isolation (the SC model) is easier to recover than gene flow that starts at speciation but stops some time afterwards (the IIM model).
In the C-B and D-A settings (fig. 4e), the mode of gene flow was misspecified and furthermore gene flow was assigned onto the wrong branch of the species tree. In the C-B setting, divergence time was underestimated slightly, due to gene flow assigned to the wrong branch, as observed in the A-B setting. Ancestral population sizes and were affected by gene flow, similar to the A-B setting. Model B forces . Thus we expect and to get stuck together, with both being smaller than ; as the number of loci L increased, appeared to converge to 0.0029, and to (table 1).
In the D-A setting, the divergence time was underestimated, due to gene flow assigned to the wrong branch, similarly to the B-A setting. The ancestral population sizes and were well estimated as in the A-A setting, but had a slight positive bias. The ancestral population sizes and were affected by the gene flow, similar to the B-A setting. The introgression time and probability ( and ) do not exist in the simulation model D. Model A forces , so we expect to be close to ; when the number of loci L increased, appeared to converge to 0.00186, and to (table 1). Note that with and . Those results are consistent with our early results for fitting the MSci model to data generated under the migration model in the two-species case (eq. 12, fig. 3), and with the results for the A-B and B-A settings that assignment of gene flow to a wrong branch reduces the estimate of .
In summary, the estimated introgression probabilities, at 0.12, 0.08, 0.10, and 0.02 for the C-A, D-B, C-B, and D-A settings, respectively, even though the total amount of gene flow was the same in models C and D (table 1), suggest the following general patterns. First, the MSci model underestimates the total amount of gene flow if gene flow occurs continuously in every generation (i.e., ), as discussed in our analysis of the two-species case. Second, assigning gene-flow events to wrong lineages led to serious underestimation of the amount of gene flow (i.e., ). Third, recent gene flow in the data is more easily recovered (i.e., ).
Isolation with Initial Migration (IIM) Model
Next, we assessed the effects of taxon sampling when the mode of gene flow is misspecified. We used the IIM model for three species of figure 6a to simulate data and analyzed them under the MSci model of figure 6b. Species divergence times () and population sizes (, and even and ) were well estimated. We expect the estimated introgression time to converge to if the sequence length is infinite and to a higher limit for finite sequence length. In our simulation at (table 1). The estimated introgression probability () converged to a nonzero limit, (table 1), compared with by equation (10).
The IIM model of figure 6a is very similar to the two-species model of figure 1b except that here the tree is larger with more species, and serves to highlight the fact that the impact of the misspecification of the model of gene flow is local. The case is also similar to the D-B setting of figure 4, with the only difference that here the hybridizing species T had only one descendent species sampled in the data, whereas in figure 4 (D-B) it had two descendent species sampled. Thus estimates of parameters such as the introgression probability and introgression time were similar to those in the D-B setting of figure 4 but with wider CIs (table 1). Unlike approximate methods designed to work with species triplets or quartets only, the Bayesian approach accommodates an arbitrary number of species in the data (with arbitrary data configurations at each locus), so that the difference in taxon sampling has only the effect of affecting the information content in the data.
Ghost Species
We considered two scenarios in which a species that contributed migrants to extant species has gone extinct or is otherwise unsampled in the data. Note that existence of extinct or unsampled species that received genetic materials from ancestors of extant species in the sample is not relevant to the analysis of the sampled data and does not constitute a model misspecification. In the first scenario, model A of figure 7a is used to simulate data, which assumes that species XUV contributed migrants to species B but is not included in the sample. Note that this model is equivalent to model A of figure 7a. When we fit model B (fig. 7b), the only incorrect assumption is the constraint that . This is a minor misspecification. Indeed all parameters shared between the simulation model and the analysis model were well estimated (fig. 7c). The estimates of introgression time, (table 1), were close to the average of the two parameters in the true model (0.0025). Introgression probability (table 1) was also close to the true value (0.2). The existence of the ghost species (XUV) had very little effect on the inference.
In the second scenario (fig. 8a), the true model assumes continuous migration involving intermediate ancestral species that have gone extinct, and the MSci model (fig. 8b) was fitted to data sampled from extant species. Divergence times and were very well estimated, as were the population sizes shared between the simulation and analysis models (, , , ). We expect in model B to be dominated by the minimum coalescent time between sequences from A and B, and this is given by . Gene flow from branches RC to SU over the time interval () and then from SU to TB during was interpreted as introgression in the MSci model. The effective rate for this migration may be close to , giving . The estimate was (fig. 8c, table 1). The introgression time should be between and and the estimate was (fig. 8c, table 1). Note that both and were overestimated (fig. 8c). Branch T of figure 8b corresponds to branches RS and ST of figure 8a, with population size . Branch Y corresponds to a segment of branch TB over the time interval , with . Overestimation of (and ) may be because there is a deficit of over the interval due to gene flow, and the fitting MSci model, with the amount of gene flow underestimated (), used large and to compensate.
Discussion
The Mode of Gene Flow and the Utility of Misspecified Introgression Models
The asymptotic theory, even though based on only two species with one sequence sampled per species per locus, has been very useful. It generated a number of insights that were confirmed and extended in our simulation. Together the theory and simulation suggest the following correspondence between the MSC-M and MSci models. When gene flow occurs continuously over an extended time period after divergence of two species and we fit the introgression model, the estimated introgression time tends to be closer to the more recent end of the time period of gene flow, because the introgression time is dominated by the most recent coalescent time or the minimum sequence divergence between species. If the true coalescent time is known and used as data, the introgression time will converge to the time when gene flow stopped. At low migration rates (, say), the species divergence time is correctly estimated by the MSci model, and the introgression probability is lower than but close to the expected proportion of migrants (). The estimate is particularly close under the secondary-contact model (supplementary fig. S4, Supplementary Material online). At very high migration rates, the estimated introgression probability may be much less than , and furthermore the species divergence time is underestimated to account for intermediate coalescent times generated under the MSC-M model. Recent gene flow (as in the SC model) is easier to recover (with closer to ) than ancient gene flow (as in the IIM model).
The accurate estimation of species divergence times under the MSci model despite the misspecification, at least at lower migration rates (e.g., for in fig. 3), may be worth emphasizing. It is well known that ignoring gene flow between two species may lead to serious underestimation of the species divergence time. Here our results suggest that if gene flow is continuous, the MSci model assuming introgression at a fixed time point still gives reliable estimates of the species divergence time. The estimated introgression probability () may also serve as a useful guide even though it reflects both the migration rate per generation (m or M) and the time duration of the period of gene flow (eq. 10). Even if gene flow occurs continuously over time (so that the migration model is a more realistic model), the MSci model is effective in extracting historical information about species divergence times and population sizes. Note that on the evolutionary time scale, a few hundred or thousand generations may count as a fixed time point, in which case the MSci model may provide an adequate approximation.
Both the asymptotic theory and simulation have highlighted the semi-unidentifiability or confounding effects between the introgression probability () and the population size of the donor species ( in fig. 1d) (e.g., fig. 2, methods c and d). The problem is particularly acute under the IIM model applied to small datasets (with short loci, few sequences per species, or few loci), where high estimates of with wide CIs are produced even though migration occurs at very low rates (fig. 3). One such case has been observed in a recent analysis of genomic data from the erato group of Heliconius butterflies (Thawornwattana et al. 2022). The estimated H. saraH. demeter introgression probability was high with wide CIs for some chromosomal regions with a small number of loci (e.g., chromosome 21 with 4350 noncoding and 3628 coding loci, and an inversion on chromosome 15 with 149 noncoding and 167 coding loci), with the introgression time close to the species divergence time, whereas for the other large chromosomes, the estimates were nearly zero (). The true rate in this case appeared to be , but the limited data from small chromosomal segments led to poorly supported large introgression rates, as in our simulations (fig. 3).
We demonstrated that including multiple samples from the same species (in particular, from recipient species) is important to resolving unidentifiability issues or confounding effects, as well as boosting up the information content concerning the rate of gene flow in the data. In this regard, it may be noted that many approximate methods are designed to use only one sample per species, and it has been claimed that “adding more samples provides little new information with respect to introgression” (Hibbins and Hahn 2022). We suggest that this may not be a generally correct statement.
Overall, our simulations using larger species trees with more than two species suggest that misspecification of the mode of gene flow (continuous migration versus episodic hybridization/introgression) has relatively small and localized effects, restricted to divergence times and population sizes around the lineages involved in gene flow, while species divergence times, population sizes for extant species and for ancestral species not involved in gene flow are largely unaffected. If gene flow occurs between species A and B but more distantly related species are included in the data sample, parameters outside the AB clade are largely unaffected (e.g., compare results for the IIM model for two species of fig. 3 with those for three species of fig. 6). Similarly, if A represents a clade rather than one species, divergence times and population sizes inside the A clade are not affected by gene flow involving the branch ancestral to the A clade (e.g., compare the D-B setting of fig. 4 with the IIM model of fig. 3).
Assigning gene flow to parental or daughter branches causes the introgression probability to be underestimated, and the introgression time to collapse onto the species divergence time. This result may be used to diagnose the mis-assignment of introgression lineages in real data analysis (Ji et al. 2022). A number of authors have discussed the impact of ghost species on detection of between-species gene flow (Beerli 2004; Ottenburghs 2020). Tricou et al. (2022) used simulations to demonstrate that D-statistics can be misled to detect false signals of introgression when the model involved an unsampled (ghost) species. In our simulations, the impact of ghost species on Bayesian estimation of introgression rate and time was minor provided we considered the rate of gene flow in the migration and introgression models to reflect both indirect gene flow via intermediate species and direct gene flow.
Testing Models of Gene Flow
In this study, we fixed the model of introgression in our analyses, with all introgression events pre-identified, to examine the effects of model misspecification. One may ask what happens if different introgression models (which for example assign introgression events onto different branches of the species tree) are compared using genomic data. Currently, both *beast and Phylonet have implemented cross-model MCMC algorithms under the MSci model, which insert and delete introgression events on the species tree, allowing the Markov chain to move between models. Those algorithms are computationally expensive and currently the two programs can handle only very small datasets (with <100 loci, say). In the Bpp program, one may use the Bayes factor to compare two MSci models, using thermodynamic integration (Gelman and Meng 1998; Lartillot and Philippe 2006) combined with Gaussian quadrature to calculate the marginal likelihood values (Rannala and Yang 2017). In the case where the compared models are nested (e.g., one with introgression and another without), the Bayes factor may also be calculated through the Savage–Dickey density ratio (Dickey 1971), which uses only a within-model MCMC run under the more general model (Ji et al. 2022). This has a computational advantage over reversible jump MCMC (Green 1995), and has recently been applied to formulate and compare introgression models in an analysis of genomic data from the Tamias quadrivittatus group of North American chipmunks (Ji et al. 2022). Calculation of marginal likelihood values or Bayes factors may be feasible if we have only a small number of well-specified models but may not be feasible for searching in the space of MSci models for a given set of species.
Approximate methods have also been developed to infer introgression events or the so-called phylogenetic networks using summaries of the multilocus sequence data. For example, estimated gene tree topologies may be treated as data, as in Phylonet/gt (Wen et al. 2016). Some methods are designed to detect gene flow in a small tree with three or four species, including summary methods based on genome-wide site-pattern counts (such as D and Hyde discussed earlier) or on estimated gene trees (e.g., Snaq) and maximum likelihood applied to multilocus sequence alignments (e.g., 3s, Zhu and Yang 2012; Dalquen et al. 2017). Results for species subsets may then be combined to formulate an introgression model on the large tree for all species, which is a challenging task (Edelman et al. 2019; Thawornwattana et al. 2022). In summary, there is currently an acute need for improving the computational efficiency of Bayesian MCMC algorithms for inference under the MSC model with gene flow and the statistical efficiency of approximate methods.
It will also be interesting to use the same genomic data to compare the MSC-M and MSci models. The two classes of models often predict very different distributions of gene trees and coalescent times (e.g., supplementary figs. S1, S3, S5, Supplementary Material online; see also Jiao and Yang 2021). Thus, genomic data may be informative to distinguish them. A stochastic search in the combined space of MSC-M and MSci models may be infeasible, as the two types of models are very different. However, they can be compared using Bayes factors.
Materials and Methods
Simulation to Establish a Correspondence between the Migration and Introgression Models in the Case of Two Species
We analyzed the relationships between parameters when data are generated under the continuous migration model (IM, IIM, and SC; fig. 1a–c) and analyzed under the episodic introgression (MSci) model (fig. 1d). Our theory assumed an infinite number of loci (), a finite number of sites per sequence (n), with only one sequence per species per locus. We conducted computer simulations to augment the theoretical analysis. Data of multilocus sequence alignments were simulated under the IM, IIM, and SC models of figure 1a–c, and analyzed under the MSci model (fig. 1d). Population sizes on the species tree (fig. 1) were for the thin branches and for the thick branches. Migration occurred from species A to B after their divergence at in the IM model, between and in the IIM model, and between and the present time in the SC model. In the standard model, the migration rate was individuals per generation. Each dataset consisted of loci, with sequences per species, and sites per sequence. We conducted four sets of simulation to examine the impact of the number of sites per sequence (n), the number of sequences per species (S), the number of loci (L), and the migration rate (). The values used were , 1,000, 4,000, 16,000, 64,000; , 2, 4, 8, 16; , 500, 1,000, 2,000, 4,000, 8,000; and , 0.02, 0.03, 0.04, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1.0, 1.5, 2.0. With three models (IM, IIM, and SC), four factors (), and 30 replicates, a total of datasets were simulated. Data were simulated using Bpp 4.4.1 (Flouri et al. 2018, 2020), by generating the gene tree with coalescent times for each locus and then “evolving” sequences along branches of the gene tree under the JC mutation model (Jukes and Cantor 1969). Sequences at the tips of the gene tree constituted the data at the locus.
Each dataset was analyzed using Bpp under the MSci model (fig. 1d) to estimate the parameters. This is the so-called A00 analysis, with the model fixed (Yang 2015). The Bayesian implementation of the MSci model in Bpp accommodates gene-tree reconstruction uncertainties while making use of information in both gene tree topologies and branch lengths, and allows the estimation of the direction, timing, and strength of introgression (Jiao et al. 2021). The JC mutation model was assumed in the analysis. Gamma priors were assigned to population size parameters () and to the age of the root on the species tree; and . Note that the gamma distribution G has mean and variance , so that the shape parameter means diffuse priors. Introgression probability was assigned the beta prior beta, which is .
We used 32,000 MCMC iterations as burnin, and took samples, sampling every five iterations.
Introgression Events Assigned to Wrong Branches
Data were simulated under models A and B of figure 4 and analyzed under models A and B, possibly with the introgression event assigned incorrectly onto either the parental or a daughter branch of the branch truly involved in introgression. The species divergence times () are shown in the trees (fig. 4). We used sequences per species per locus, with sites in the sequence. The number of loci was , 1,000, and 4,000. We used two population sizes, with for the thin branches and for the thick branches. The number of replicates was 100.
Each dataset was analyzed using Bpp under both models A and B (fig. 4aandb). Gamma priors were assigned to parameters, with mean 0.005 and with mean 0.01. With two trees/models, three numbers of loci, datasets were simulated, each analyzed under models A and B. We used 32,000 MCMC iterations as burnin, and took samples, sampling every five iterations.
Continuous Migration versus Episodic Introgression
Data were simulated under the MSC-M models C and D of figure 4candd, with continuous migration at the rate migrants per generation, and analyzed under MSci models A and B (fig. 4aandb), resulting in four settings: C-A (simulation model C and analysis model A), C-B, D-A, and D-B. In setting C-A and D-B, gene flow was continuous in the true model but the MSci model assumes episodic introgression at a particular time point, so that the mode of gene flow is misspecified. In settings C-B and D-A, the mode of gene flow was similarly misspecified but we had in addition mis-assignment of gene flow to wrong branches on the species tree. Other parameter settings were the same as above. With two trees, three numbers of loci (L), a total of 600 datasets were generated, each analyzed twice (under models A and B).
Isolation with Initial Migration (IIM) Model
Data were simulated under the IIM model A of figure 6a, with migration over the time period , and analyzed under the MSci model of figure 6b, assuming introgression at time . The IIM model was specified using a ghost species (U) from which no sequences were available. We generated 100 replicate datasets, each of , 1,000, or 4,000 loci, with a total of 300 datasets simulated. MCMC settings were the same as above.
Ghost Species
To assess the effects of unsampled ghost population, we simulated data under MSci model A (see fig. 1A in Flouri et al. 2020) of figure 7a and analyzed them under the MSci model B of figure 7b, with incorrectly assumed. Here introgression involved a ghost species XUV which went extinct or was otherwise unsampled in the data. This scenario is equivalent to model A of figure 7a. With the three values for L (250, 1,000, 4,000), 300 datasets were generated, all analyzed under the MSci model (fig. 7b).
We also used the IIM model of figure 8a to generate data, with migration from species RC to SU and from SU to TB, and with V and W to be unsampled ghost species. Data (i.e., sequences from and C) were analyzed under the MSci model of figure 8b. We used three values for L (250, 1,000, 4,000) and 100 replicates, with 300 datasets simulated in total. Other settings were the same as above.
Supplementary Material
Acknowledgments
This study has been supported by Biotechnology and Biological Sciences Research Council grants (BB/T003502/1, BB/R01356X/1), as well as by Harvard University.
Contributor Information
Jun Huang, School of Biomedical Engineering, Capital Medical University, Beijing 100069, P.R. China.
Yuttapong Thawornwattana, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138.
Tomáš Flouri, Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, United Kingdom.
James Mallet, Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA 02138.
Ziheng Yang, Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, United Kingdom.
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
References
- Aeschbacher S, Bürger R. 2014. The effect of linkage on establishment and survival of locally beneficial mutations. Genetics 197(1):317–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akerman A, Bürger R. 2014. The consequences of gene flow for local adaptation and differentiation: a two-locus two-deme model. J Math Biol. 68(5):1135–1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anderson E. 1949. Introgressive hydridization. New York: John Wiley. [Google Scholar]
- Bahlo M, Griffiths RC. 2000. Inference from gene trees in a subdivided population. Theor Popul Biol. 57:79–95. [DOI] [PubMed] [Google Scholar]
- Barton N, Bengtsson BO. 1986. The barrier to genetic exchange between hybridising populations. Heredity. 57(3):357–376. [DOI] [PubMed] [Google Scholar]
- Beerli P. 2004. Effect of unsampled populations on the estimation of population sizes and migration rates between sampled populations. Mol Ecol. 13:827–836. [DOI] [PubMed] [Google Scholar]
- Beerli P, Felsenstein J. 1999. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beerli P, Felsenstein J. 2001. Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc Natl Acad Sci U S A. 98:4563–4568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blischak PD, Chifman J, Wolfe AD, Kubatko LS. 2018. HyDe: a Python package for genome-scale hybridization detection. Syst Biol. 67(5):821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bürger R, Akerman A. 2011. The effects of linkage and gene flow on local adaptation: a two-locus continent-island model. Theor Popul Biol. 80(4):272–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burgess R, Yang Z. 2008. Estimation of hominoid ancestral population sizes under Bayesian coalescent models incorporating mutation rate variation and sequencing errors. Mol Biol Evol. 25(9):1979–1994. [DOI] [PubMed] [Google Scholar]
- Costa RJ, Wilkinson-Herbots H. 2017. Inference of gene flow in the process of speciation: an efficient maximum-likelihood method for the isolation-with-initial-migration model. Genetics 205(4):1597–1618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Costa RJ, Wilkinson-Herbots HM. 2021. Inference of gene flow in the process of speciation: efficient maximum-likelihood implementation of a generalised isolation-with-migration model. Theor Popul Biol. 140(1–15):1–15. [DOI] [PubMed] [Google Scholar]
- Dalquen D, Zhu T, Yang Z. 2017. Maximum likelihood implementation of an isolation-with-migration model for three species. Syst Biol. 66:379–398. [DOI] [PubMed] [Google Scholar]
- Degnan JH. 2018. Modeling hybridization under the network multispecies coalescent. Syst Biol. 67(5):786–799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickey JM. 1971. The weighted likelihood ratio, linear hypotheses on normal location parameters. Ann Math Stat. 42(1):204–223. [Google Scholar]
- Dittberner H, Tellier A, de Meaux J. 2022. Approximate Bayesian computation untangles signatures of contemporary and historical hybridization between two endangered species. Mol Biol Evol. 39(2):msac015. doi: 10.1093/molbev/msac015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobzhansky T. 1937. Genetics and the origin of species. New York: Columbia University. [Google Scholar]
- Edelman NB, Frandsen PB, Miyagi M, Clavijo B, Davey J, Dikow RB, García-Accinelli G, Van Belleghem SM, Patterson N, Neafsey DE, et al. 2019. Genomic architecture and introgression shape a butterfly radiation. Science 366(6465):594–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ellegren H, Smeds L, Burri R, Olason PI, Backstrom N, Kawakami T, Kunstner A, Makinen H, Nadachowska-Brzyska K, Qvarnstrom A, et al. 2012. The genomic landscape of species divergence in Ficedula flycatchers. Nature 491:756–760. [DOI] [PubMed] [Google Scholar]
- Elworth RAL, Ogilvie HA, Zhu J, Nakhleh L. 2019. Advances in computational methods for phylogenetic networks in the presence of hybridization. Bioinform Phylogenet. 29:317–360. [Google Scholar]
- Finger N, Farleigh K, Bracken J, Leache A, Francois O, Yang Z, Flouri T, Charran T, Jezkova T, Williams D, et al. 2022. Genome-scale data reveal deep lineage divergence and a complex demographic history in the Texas horned lizard (Phrynosoma cornutum) throughout the southwestern and central USA. Genome Biol Evol. 14(1):evab260. doi: 10.1093/gbe/evab260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flouri T, Jiao X, Rannala B, Yang Z. 2018. Species tree inference with BPP using genomic sequences and the multispecies coalescent. Mol Biol Evol. 35(10):2585–2593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flouri T, Jiao X, Rannala B, Yang Z. 2020. A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis. Mol Biol Evol. 37(4):1211–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Meng X. 1998. Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Stat Sci. 13:163–185. [Google Scholar]
- Green PJ. 1995. Reversible jump Markov chain monte carlo computation and Bayesian model determination. Biometrika 82:711–732. [Google Scholar]
- Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Hsi-Yang Met al. 2010. A draft sequence of the Neandertal genome. Science 328:710–722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hey J. 2010. Isolation with migration models for more than two populations. Mol Biol Evol. 27:905–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hey J, Chung Y, Sethuraman A, Lachance J, Tishkoff S, Sousa VC, Wang Y. 2018. Phylogeny estimation by integration over isolation with migration models. Mol Biol Evol. 35(11):2805–2818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hey J, Nielsen R. 2004. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167:747–760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hibbins MS, Hahn MW. 2022. Phylogenomic approaches to detecting and characterizing introgression. Genetics 220(2):iyab173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji J, Jackson DJ, Leache AD, Yang Z. 2022. Significant cross-species gene flow detected in the Tamias quadrivittatus group of North American chipmunks. BioRxiv. doi:10.1101/2021.12.07.471567
- Jiao X, Flouri T, Rannala B, Yang Z. 2020. The impact of cross-species gene flow on species tree estimation. Syst Biol. 69(5):830–847. [DOI] [PubMed] [Google Scholar]
- Jiao X, Flouri T, Yang Z. 2021. Multispecies coalescent and its applications to infer species phylogenies and cross-species gene flow. Natl Sci Rev. 8:nwab127. doi: 10.1093/nsr/nwab127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiao X, Yang Z. 2021. Defining species when there is gene flow. Syst Biol. 70(1):108–119. [DOI] [PubMed] [Google Scholar]
- Jukes T, Cantor C. 1969. Evolution of protein molecules. In: Munro H, editor. Mammalian protein metabolism. New York: Academic Press. p. 21–123.
- Kumar V, Lammers F, Bidon T, Pfenninger M, Kolter L, Nilsson MA, Janke A. 2017. The evolutionary history of bears is characterized by gene flow across species. Sci Rep. 7:46487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lartillot N, Philippe H. 2006. Computing bayes factors using thermodynamic integration. Syst Biol. 55:195–207. [DOI] [PubMed] [Google Scholar]
- Liu S, Lorenzen ED, Fumagalli M, Li B, Harris K, Xiong Z, Zhou L, Korneliussen TS, Somel M, Babbitt Cet al. 2014. Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell 157:785–794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maddison W. 1997. Gene trees in species trees. Syst Biol. 46:523–536. [Google Scholar]
- Malecot G. 1948. Les mathematiques de I’heredite. Paris: Masson. [Google Scholar]
- Mallet J. 2007. Hybrid speciation. Nature 446:279–283. [DOI] [PubMed] [Google Scholar]
- Mallet J, Besansky N, Hahn MW. 2016. How reticulated are species? BioEssays 38(2):140–149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin SH, Dasmahapatra KK, Nadeau NJ, Salazar C, Walters JR, Simpson F, Blaxter M, Manica A, Mallet J, Jiggins CD. 2013. Genome-wide evidence for speciation with gene flow in Heliconius butterflies. Genome Res. 23(11):1817–1828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin SH, Davey JW, Salazar C, Jiggins CD. 2019. Recombination rate variation shapes barriers to introgression across butterfly genomes. PLoS Biol. 17(2):e2006288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin SH, Jiggins CD. 2017. Interpreting the genomic landscape of introgression. Curr Opin Genet Dev. 47:69–74. [DOI] [PubMed] [Google Scholar]
- Meng C, Kubatko LS. 2009. Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theor Popul Biol. 75(1):35–45. [DOI] [PubMed] [Google Scholar]
- Muller HJ. 1942. Isolating mechanisms, evolution, and temperature. Biol Symp. 6:71–125. [Google Scholar]
- Nichols R. 2001. Gene trees and species trees are not the same. Trends Ecol Evol. 16:358–364. [DOI] [PubMed] [Google Scholar]
- Notohara M. 1990. The coalescent and the genealogical process in geographically structured population. J Math Biol. 29:59–75. [DOI] [PubMed] [Google Scholar]
- O’Hagan A, Forster J. 2004. Kendall’s advanced theory of statistics: Bayesian inference. London: Arnold. [Google Scholar]
- Ottenburghs J. 2020. Ghost introgression: spooky gene flow in the distant past. Bioessays 42(6):e2000012. [DOI] [PubMed] [Google Scholar]
- Petry D. 1983. The effect on neutral gene flow of selection at a linked locus. Theor Popul Biol. 23:300–313. [DOI] [PubMed] [Google Scholar]
- Rannala B, Yang Z. 2003. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 164(4):1645–1656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rannala B, Yang Z. 2017. Efficient Bayesian species tree inference under the multispecies coalescent. Syst Biol. 66:823–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schumer M, Xu C, Powell DL, Durvasula A, Skov L, Holland C, Blazier JC, Sankararaman S, Andolfatto P, Rosenthal GG. 2018. Natural selection interacts with recombination to shape the evolution of hybrid genomes. Science 360(6389):656–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slatkin M. 1987. Gene flow and the geographic structure of natural populations. Science 236(4803):787–792. [DOI] [PubMed] [Google Scholar]
- Solis-Lemus C, Ane C. 2016. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genet. 12(3):e1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thawornwattana Y, Seixas FA, Mallet J, Yang Z. 2022. Full-likelihood genomic analysis clarifies a complex history of species divergence and introgression: the example of the erato-sara group of Heliconius butterflies. Syst Biol. 71:1159–1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tricou T, Tannier E, de Vienne DM. 2022. Ghost lineages highly influence the interpretation of introgression tests. Syst Biol. doi: 10.1093/sysbio/syac011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uecker H, Setter D, Hermisson J. 2015. Adaptive gene introgression after secondary contact. J Math Biol. 70(7):1523–1580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wen D, Nakhleh L. 2018. Coestimating reticulate phylogenies and gene trees from multilocus sequence data. Syst Biol. 67(3):439–457. [DOI] [PubMed] [Google Scholar]
- Wen D, Yu Y, Hahn MW, Nakhleh L. 2016. Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Mol Ecol. 25:2361–2372. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. 1943. Isolation by distance. Genetics 28:114–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 24:1586–1591. [DOI] [PubMed] [Google Scholar]
- Yang Z. 2015. The BPP program for species tree estimation and species delimitation. Curr Zool. 61:854–865. [Google Scholar]
- Yang Z, Flouri T. 2022. Estimation of cross-species introgression rates using genomic data despite model unidentifiability. Mol Biol Evol. 39:msac083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Z, Zhu T. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc Natl Acad Sci U S A. 115(8):1854–1859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C, Ogilvie HA, Drummond AJ, Stadler T. 2018. Bayesian inference of species networks from multilocus sequence data. Mol Biol Evol. 35:504–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu T, Yang Z. 2012. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 29:3131–3142. [DOI] [PubMed] [Google Scholar]
- Zhu T, Yang Z. 2021. Complexity of the simplest species tree problem. Mol Biol Evol. 39:3993–4009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.