Skip to main content
Molecular Biology and Evolution logoLink to Molecular Biology and Evolution
. 2024 Jul 3;41(7):msae137. doi: 10.1093/molbev/msae137

Inference of Locus-Specific Population Mixtures from Linked Genome-Wide Allele Frequencies

Carlos S Reyna-Blanco 1,2,#, Madleina Caduff 3,4,#, Marco Galimberti 5,6,7,8, Christoph Leuenberger 9, Daniel Wegmann 10,11,
Editor: Aurelien Tellier
PMCID: PMC11255385  PMID: 38958167

Abstract

Admixture between populations and species is common in nature. Since the influx of new genetic material might be either facilitated or hindered by selection, variation in mixture proportions along the genome is expected in organisms undergoing recombination. Various graph-based models have been developed to better understand these evolutionary dynamics of population splits and mixtures. However, current models assume a single mixture rate for the entire genome and do not explicitly account for linkage. Here, we introduce TreeSwirl, a novel method for inferring branch lengths and locus-specific mixture proportions by using genome-wide allele frequency data, assuming that the admixture graph is known or has been inferred. TreeSwirl builds upon TreeMix that uses Gaussian processes to estimate the presence of gene flow between diverged populations. However, in contrast to TreeMix, our model infers locus-specific mixture proportions employing a hidden Markov model that accounts for linkage. Through simulated data, we demonstrate that TreeSwirl can accurately estimate locus-specific mixture proportions and handle complex demographic scenarios. It also outperforms related D- and f-statistics in terms of accuracy and sensitivity to detect introgressed loci.

Keywords: gene flow, admixture, introgression rate, Gaussian process, linkage, hidden Markov model

Introduction

Gene flow, the exchange of genetic material between populations or different species (Slatkin 1985a), can occur through various mechanisms, such as migration, admixture, hybridization, cross-fertilization, or even by the dispersal of diaspores and pollinators (Barton and Hewitt 1985; Ellstrand et al. 2003; Tung and Barreiro 2017; Burgarella et al. 2019). This exchange may play a significant role in the maintenance of genetic variation, but also in the adaptation to multiple ecological niches (Anderson 1949; Slatkin 1985b, 1987; Rieseberg and Wendel 1993; Barton 2001). At sufficient levels, gene flow can lead to homogenization of populations, particularly in the face of opposing genetic drift (Ellstrand 2014). Gene flow might also increase genetic variation at a much higher rate than mutation (Grant and Grant 1994) and impact the process of speciation by becoming a primary source of genetic diversity and adaptive novelty for a population (Ellstrand et al. 2003; Abbott et al. 2013). Several genetic analyses have shown that gene flow, both ancient and present, is a common phenomenon in nature (Grant and Grant 1992; Mallet 2005; Patterson et al. 2006; Wegmann and Excoffier 2010; Tung and Barreiro 2017; Marchi et al. 2022), and a bifurcating tree, representing population or species historical relationships, fails to account for it (Kulathinal et al. 2009; Sousa et al. 2009; Green et al. 2010; Durand et al. 2011). This led to the development of methods that use allele frequency data and graph-based models to infer population splits and test for the presence of gene flow between divergent populations or species (Patterson et al. 2012; Pickrell and Pritchard 2012; Yang et al. 2012; Eaton and Ree 2013; Lipson et al. 2013; Martin et al. 2013; Lipson et al. 2014; Kozak et al. 2021), which, for instance, confidently settled the long-standing question whether gene flow occurred between modern humans and archaic hominins. However, these methods assume a genome-wide gene flow rate per migration edge, which is unrealistic in the presence of selection. In theory, the effective gene flow may vary significantly along the genome because of selection and genetic drift (Yamamichi and Innan 2012), making it essential to quantify these variations to better understand the dynamics that lead to introgression (Racimo et al. 2015, 2017; Suarez-Gonzalez et al. 2018; Sankararaman 2020).

Introgression is a lasting consequence of gene flow that leads to the assimilation of variants into the local gene pool through repeated back-crossing, resulting in their permanent inclusion (Anderson and Hubricht 1938). When introgressed loci increase the fitness of the recipient population, this is known as “adaptive introgression”. Unlike neutral introgression, which can be lost over time due to drift, adaptive introgression is sustained by selection and can eventually lead to fixation (Zhang et al. 2021). But selection may also prevent introgression if it reduces fitness (Christe et al. 2016). While introgressed loci may be identified through explicit demographic modeling (Luqman et al. 2021), the classic way is via population genetic summary statistics. Patterson’s D, for example, has been estimated in sliding windows along the genome to identify introgressed loci (Dasmahapatra et al. 2012; Kronforst et al. 2013; Smith and Kronforst 2013; Rheindt et al. 2014; Fontaine et al. 2015). Since it was originally intended for genome-wide analysis (Martin et al. 2015), more suitable related statistics have been used for analyzing specific short genomic regions, such as fd , fdM, and df (Malinsky et al. 2015; Martin et al. 2015; Pfeifer and Kapan 2019; Malinsky et al. 2021). There are other statistics, for instance, S* and its variants that use linkage disequilibrium information to detect long introgressed haplotypes (Plagnol and Wall 2006; Wall et al. 2009; Vernot and Akey 2014; Vernot et al. 2016; Browning et al. 2018) or ArchIE that combines diverse summary statistics to detect introgressed haplotypes without a reference (Durvasula and Sankararaman 2019, 2020). However, outlier scans based on such statistics are likely to ignore valuable information present in the full data, do not model linkage explicitly or require an arbitrary choice of large window size and outliers identification. To overcome these constraints, probabilistic frameworks such as hidden Markov models (HMMs) (Rabiner and Juang 1986; Prüfer et al. 2014; Seguin-Orlando et al. 2014; Skov et al. 2018; Steinrücken et al. 2018), and conditional random fields (CRFs) (Sankararaman et al. 2014) have been applied to infer the ancestry state of each site. These methods are extensions of models that infer local ancestry from genotyping data (Tang et al. 2006; Price et al. 2009; Wegmann et al. 2011; Lawson et al. 2012; Maples et al. 2013) and while explicitly accounting for demographic history and linkage, they rely on phased and training sequence data, unadmixed or archaic reference, and detailed demographic models. As a consequence, such approaches are not easily applicable to nonmodel species for which only limited data and knowledge is available.

To complement these methods, we here propose a model that makes use of Gaussian processes to infer locus-specific mixture proportions. Gaussian processes have a rather long history to model allele frequency differences between populations (Cavalli-Sforza and Edwards 1967; Felsenstein 1981), but have recently seen a surge in applications due to the development of the popular tool TreeMix (Pickrell and Pritchard 2012). Our method, TreeSwirl, explicitly takes an admixture graph (e.g. inferred by TreeMix) and genome-wide allele frequencies to infer locus-specific mixture proportions. To account for linkage, we make use of a HMM, wherein the hidden states are represented by the proportion of the mixture at a particular site and the observed data are represented by the sampled allele frequencies. To evaluate the performance of our method against other tools, we simulated data using various demographic models. We estimated the mixture proportions with TreeSwirl and computed related D- and f-statistics using D-suite Dinvestigate (Malinsky et al. 2021). Our findings revealed that TreeSwirl surpasses the summary statistics estimates in detecting the simulated signal of introgression under different scenarios, although at an additional computational cost. Furthermore, by applying TreeSwirl to real data cases, we successfully identified candidate genomic regions where migration rates fluctuate and may be subject to selection.

Materials and Methods

The Model

Consider a set of populations m=1,2,,M that are linked by a graph G which represents their population history in terms of population splits and migration events. Consider as well a series of diploid, biallelic loci l=1,2,,L, where the total number of loci L might constitute, for instance, consecutive SNPs (single nucleotide polymorphisms) along the genome. At each locus l, a total number of Nl=(Nl1,,NlM) alleles have been observed across the M populations, of which nl=(nl1,,nlM) were derived and the remaining ancestral (or otherwise polarized). To model sampled allele counts nl|Nl we distinguish two processes: the first models the distribution of the vector of the actual but unknown population frequencies yl=(yl1,,ylM) given the graph G, and the second the distribution of the sampled allele counts nl|Nl given yl (Fig. 1a).

Fig. 1.

Fig. 1.

Inference example. a) Admixture graph with two migration edges marked in different colors. Parameters of interest are shown on the graph (root prior and branch lengths) as well as the untransformed and transformed ancient, sampling and population allele frequency variables. b) All possible configurations of b for two migration events when they are open or closed. c) Example of inference under our TreeSwirl model for each migration event. The top panel shows the posterior mean mixture proportions w^l compared to simulated estimates and the bottom panel shows the identified candidate regions under possible selection, where the false discovery rates for excess (qe) and dearth (qd) introgression was determined for each locus as explained in the “Inference” section.

Evolution along the Graph G

We assume, as in Pickrell and Pritchard (2012), that the change in allele frequencies from the root to the tips of G is modeled as a Brownian motion (BM) process. For each locus l, the BM process starts at the root of G at a value of allele frequency which we denote by νl. It proceeds along the branches of G and finally gives rise to the above-mentioned random vector yl at the leaves of G. The probability of yl is given by the multivariate normal density

π(ylνl,G)=N(νl,V(νl)),

where νl=(νl,,νl) is the mean vector and V(νl) is the variance–covariance matrix corresponding to the BM on G. For the construction of V(νl), which depends on the topology of G, the lengths of the branches in G and the migration rates, we follow Pickrell and Pritchard (2012). We set

V(νl)=νl(1νl)Wl, (1)

where Wl only depends on the graph topology, the branch lengths and the migration rates.

However, it was long recognized that BM with constant variance is not adequately describing allele frequency changes, especially close to boundaries and various transformations to alleviate the problem have been proposed (Felsenstein 1981). Here, we will consider the transformation

μl=arcsin(2νl1) (2)

from the interval [0,1] onto [π/2,π/2]. This has the advantage that all factors of νl(1νl) in front of the variance matrices will be canceled. We thus replace (equation (1)) by

Wl=(dμldνl)2V(νl). (3)

Let xl=(xl1,,xlM),xlm=arcsin(2ylm1) denote the transformed population allele frequencies. The distribution of xl thus follows the multivariate normal density

π(xlμl,G)=N(μl,Wl) (4)

with μl=(μl,,μl)=μl1.

The matrix Wl is constructed as follows. Let G be a rooted population graph with K oriented branches k=1,,K of length ck, c=(c1,,cK); the orientation of the branches points in direction of the leaves. We assume that the graph also contains I oriented migration edges τi, i=1,,I, to which we assign no branch length. The migration edges should be placed such that there are no cycles in the graph. However, we allow for bidirectional migration edges, which, to avoid cycles, we model effectively as two migration edges, of which the starting point of one precedes the end point of the other by an infinitesimally small branch.

We now consider paths leading from the root of the graph to a leaf taking some of the migration edges (open edges) and leaving others out (closed edges). More precisely, let

b=(b1,,bI)

be a binary vector indicating a certain configuration of open and closed migration edges: a bit bi=1 indicates that the migration edge τi is open and bi=0 that the migration edge τi is closed (Fig. 1b). We denote by wli the migration rate, i.e. the probability of edge τi to be open, and thus we assign to the configuration b the probability

wl(b)=i=1Iwlibi(1wli)1bi. (5)

Now, for a given configuration b, pick a population (leaf) m and a branch k. There is at most one path leading from the root to the population m and taking exactly the open migration edges according to b. If, moreover, this path contains the branch k, we set the indicator function Imk(b) equal to 1. Otherwise we set Imk(b)=0.

Using this notation, we can now define the M×M-matrices Jlk for each branch k element-wise by

[Jlk]mn=bwl(b)Imk(b)bwl(b)Ink(b), (6)

where each sum runs over all the 2I possible configurations of b and b, respectively. Each matrix Jlk thus reflects the probabilities that branch k was common for any pair of leaves.

The matrix Wl, after all, is given by

Wl(w)=k=1KckJlk. (7)

This construction of the variance matrix Wl(wl) is a generalized reformulation of an argument given in Pickrell and Pritchard (2012).

To unclutter the notation, we will use Wl=Wl(wl) in the rest of this article and thus not indicate its dependence on the migration rates wl=(wl1,,wlI).

Sampling

We assume that the observed allele counts nlm at locus l and population m follow a binomial distribution with parameters Nlm and ylm, where ylm is the true allele frequency in population m. By independence of the samples, we have

π(nlyl)=m=1MBin(nlmNlm,ylm). (8)

If the sample sizes are sufficiently large, we can approximate this distribution by a multivariate normal density. Let fl=(fl1,,flM) with flm=nlm/Nlm denote the observed allele frequencies at locus l, which are approximately normally distributed with mean yl and a diagonal variance–covariance matrix:

diag[yl1(1yl1)Nl1,,ylM(1ylM)NlM]. (9)

The transformed observed allele frequencies dl=(dl1,,dlM) with dlm=arcsin(2flm1), are then approximated by the multivariate normal density

π(dlxl)N(xl,Σl) (10)

with

Σl=diag[1Nl1,,1NlM]

because the factors yl1(1yl1) are transformed away from the variance–covariance matrix (equation (9)) similar to equation (3).

Full Likelihood for One Locus

Given the ancestral frequency μl, we obtain the likelihood by combining equation (4) and equation (10) and integrating out:

π(dlμl,G)=π(dlxl)π(xlμl,G)dxl. (11)

Using well-known formulae for linear systems (see Theorem 4.4.1 in Murphy 2012), we obtain for the likelihood (equation (11)) the following approximation:

π(dlμl,G)N(μl,Σl+Wl). (12)

We now set a normal prior on μl, namely we assume that

π(μl)=N(μ,σ2).

Again from Theorem 4.4.1 in Murphy (2012) we conclude that

π(dlμl,σ2,G)=N(μl,Sl) (13)

with

μl=μl1,Sl=Σl+Wl+σ211. (14)

Explicitly

π(dlμl,σ2,G)1(2π)M|Sl|exp[(dlμl1)Sl1(dlμl1)2]. (15)

We note that the above model accounts for variation in sample size among loci, but assumes that at least one sample was observed per locus and population.

Hidden Markov Model

We develop a HMM for multiple loci l=1,,L with varying migration rates for each of the I migration edges of graph G. We assume that the locus and specific migration rates wli take values out of a small set of discrete numbers between 0 and 1:

wli{wi1,wi2,,wiJi}.

We thus have J1J2JI possible combinations and these combinations will constitute the hidden states of our Markov model. We denote the hidden state at locus l by zl. Each state zl corresponds to a multiindex

j=(j1,j2,,jI)

that defines the migration values (w1j1,,wIjI) of the migration edges. Thus, knowing the state zl is tantamount to knowing the combination of migration rates at the given site which in turn determines the matrix W in equation (7) via equation (5) and equation (6).

To account for linkage between loci, we assume that the locus-specific transition matrix P(zl=jzl1=j) is based on physical or genetic distances δl between loci. We assume independence of the transition probabilities of the different migration edges:

P(zl=jzl1=j)=Pl(j,j)=i=1IPli(ji,ji).

Each one of the factors in this product is an element of a ladder-type Markov matrix Pli which is defined via a transition rate matrix κiΛi:

Pli=eδlκiΛi. (16)

Here, κi is a positive scaling parameter pertaining to migration edge i, the distances δl are known constants corresponding to the linkage distances. Further, the Ji×Ji-matrices Λi reflect a transition model for infinitesimal steps.

We consider two transition models: The first is a standard ladder-type model in which transitions are only allowed to neighboring states and at equal rates:

Λi=(1100000121000001210000000011). (17)

Under this transition matrix, only κi is inferred, for which we maximize the Q-function using a linear search.

The second is also a ladder-type model, but which includes an attractor state ai{wi1,,wiJi} reflecting the background migration rate. Similar to Galimberti et al. (2020), we use two parameters to describe the fraction of loci deviating from the attractor state and the degree of that deviation, but choose a slightly different parametrization. Specifically, and given the two parameters ϕi and ζi, we have

Λi=(110000001ζi21+ζi0000001ζi21+ζi000000001+ζ121ζi0000001+ζi21ζi00000011) (18)

with the attractor row given by

(00ϕi2ϕiϕi00).

Note that the κi, ϕi, and ζi all must be strictly positive. However, we limit ϕi and ζi to the range (0,1] to ensure that the stationary probability of the attractor state ai is higher than for any other state.

Finally, the emission probabilities are generated via the marginal likelihood (equation (15)):

P(dlzl=j)=π(dlμl,σ2,Gj), (19)

where Gj denotes the population graph with migration rates according to the state zl=j and μl is the root state at site l.

Inference

We developed an empirical Bayes inference scheme for the hidden states under the assumption that the topology of the admixture graph is either known or was previously obtained. Specifically, we first infer both the emission and transition probabilities using the Baum–Welch algorithm (Baum et al. 1970) and then posterior state probabilities under the inferred parameters. We implemented the SQUAREM inference scheme (Varadhan and Roland 2008) for accelerated convergence of the Baum–Welch algorithm. As detailed in the Supplementary Information (see section “Inference”), the Baum–Welch algorithm requires numerical optimization in each iteration. While the parameter of the root prior μ can be optimized analytically, we resort to Newton–Raphson optimization (Nocedal and Wright 2006; Lange 2010) for the root prior σ2 and for parameters of the population graph (i.e. the branch lengths c) and to Nelder–Mead optimization (Nelder and Mead 1965) for the parameters regarding the transition matrices with attractors (i.e. the κi, ϕi, ζi) or a linear search for transition matrices with no attractors (i.e. the κi).

The Baum–Welch algorithm may be sensitive to initial conditions. We obtain initial estimates of all parameter values as follows (see Supplementary Information for more details):

  1. We use the observed variance–covariance matrix of the transformed observed frequencies as an initial guess of the variance–covariance matrix W.

  2. To account for variation in W among loci, we refine this initial estimate using a Gaussian mixture model (GMM) under which the transformed observed frequencies are modeled by one of r=1,,R multivariate Gaussian distributions with variance–covariance matrices Wr but shared root priors μ and σ2. This model assumes no constraints regarding the structure of the Wr and can be optimized with an expectation–maximization (EM) algorithm with analytic updates. The hidden states s=(s1,,sL) with sl{1,,R} are assumed to follow a Categorical distribution sCat(π) with π=(π1,,πR).

  3. We next use a Nelder–Mead algorithm to coerce the inferred variance–covariance matrices W1,,WR onto the population graph. Specifically, we seek to find the set of branch lengths c and partition-specific migration rate wr that best explain the previously learned variance–covariance matrices using the weighted residuals sum of squares (RSS). To mitigate convergence problems, we repeat the Nelder–Mead C times and use different initial values each time. We set C=1,000 by default.

  4. To determine the optimal number of mixture components R, we repeat steps 2 and 3 with an increasing number of mixture components R=1,2,3,4, as long as (i) each πr is larger than some threshold, which we set to 0.2 by default, and (ii) the weighted RSS is lower than the weighted RSS of R1.

  5. We then run a Baum–Welch algorithm to infer the branch lengths c, the root prior μ and σ2 and the transition parameters κi for the attractor-free, ladder-type transition matrix (equation (17)). Upon convergence, we identify the top T states with the highest average posterior probabilities across all loci, and consider each of these states as a potential attractor. We set T=5 by default.

  6. To identify the best attractor, we then run individual Baum–Welch optimization for each potential attractor, thereby identifying the best values for the branch lengths c, the root prior μ and σ2 and the transition parameters κi, ϕi, and ζi for the full transition matrix (equation (18)). The final result, as relevant to the user, are the results with the transition matrix that resulted in the highest likelihood, which may either be the ladder-type matrix or one with a specific attractor state.

Once maximum likelihood estimates for the branch lengths c, the transition parameters κi, ϕi, ζi, and ai as well as the root prior μ and σ2 are obtained, we infer state posterior probabilities P(zld,θ) given the full data d and the learned parameters collectively denoted by θ, see Fig. 1c. We further determined the posterior mean migration rates as

w¯il=jwijiP(zl=jd,θ). (20)

To identify regions exhibiting either excess or dearth introgression compared to the genome-wide average, and are hence candidate regions to have experienced selection, we summarized these posterior probabilities as

P(zl>aid,θ)=jI(ji>ai)P(zld,θ),P(zl<aid,θ)=jI(ji<ai)P(zld,θ),

where I() denotes the indicator function. We then determined for each locus l the false discovery rates (FDRs) for excess [qe(l)] and dearth [qd(l)] introgression as

qe(l)=1P(zl>aid,θ),qd(l)=1P(zl<aid,θ).

Implementation

We implemented the proposed inference scheme in C++ as a user-friendly command-line tool TreeSwirl. Our tool is open-source and available through a git-repository at bitbucket.org/wegmannlab/treeswirl, along with a documentation and a custom R package to visualize the results. Our implementation makes heavy use of the HMM framework of the statistical library stattools available at bitbucket.org/wegmannlab/stattools.

To streamline computations, we employ a straightforward clustering method to reduce the number of sampling size variance matrices Σl that need to be considered:

  1. We sort the vector of sample sizes according to the frequency of each occurrence.

  2. To cluster, we identify the pair of vectors with the least occurrences and compute their weighted average.

  3. We retain the weighted vector of sample sizes, remove the pair, and update the occurrence count as the sum of the deleted pair counts.

  4. We repeat steps 1 to 3 until the desired number of Σl is obtained.

Given a limited number u of such matrices and given that we use a finite number of discrete migration rates, there exist also an only finite number of matrices Sl that can be precomputed in each Baum–Welch iteration to speed up the forward–backward pass through the HMM.

Simulations

Simulations under Model Assumptions

To demonstrate the power and limitations of our methods with respect to different demographic scenarios, we simulated data under the TreeSwirl model for population graphs of four populations under four different topologies, visualized in Fig. 2. In the first two cases, we designed a tree with a single migration event starting ancestral to Population 2 and directed to either Population 3 (Fig. 2a) or ancestral to Population 3 (Fig. 2b). The remaining two cases had two migration events, either bidirectional migration ancestral to the Populations 2 and 3 (Fig. 2c), or two unrelated events from ancestral of Population 2 and 4 to ancestral of Populations 3 and 1, respectively (Fig. 2d). For each tree, we simulated a single chromosome and considered both a scenario with a constant, genome-wide migration rate, as well as a scenario where the migration rate varied as shown in Fig. 2. For all cases, we simulated allele counts under the TreeSwirl model with N=100, μ=0.5, σ2=0.3, and J=21. We ran 20 replicates each.

Fig. 2.

Fig. 2.

Performance of TreeSwirl as assessed by simulations under the TreeSwirl model. Shown are, for each of four cases of migration scenarios, the population graph (first column), estimates of locus-specific migration rates (posterior means, second and third column) and estimates of selected branch lengths (fourth and fifth column) for a case with variable (third and fifth column) and for a case without (second and fourth column) variation in migration rates along the simulated chromosome. Colors correspond to the migration edges and branch lengths of the same color in the population graph. Simulated migration rates are shown as a thick line and the estimates as a thin line per replicate. For branch lengths the simulated values are shown as a cross and estimates as open circles.

fastsimcoal2

To compare TreeSwirl to competing methods, we used fastsimcoal2 (Excoffier et al. 2021) to simulate genomic data under seven different demographic scenarios only consisting of population splits and admixture pulses (but no population growth or continuous migration, Fig. 3). We maintained a constant effective population size of Ne=10,000 and used a sample size of N=100 for each population in all cases.

Fig. 3.

Fig. 3.

Performance of TreeSwirl and f4-stats methods to measure the amount of introgression under different demographic histories with an background migration rate αb=0.05. First column: simulated demographic histories. Second column: schematic of the topology models. Third column: AUC measures for TreeSwirl to identify outlier loci (elevated migration compared to background rate) as a function of the peak width (symbols, in number of blocks) and peak migration (x-axis). Fourth column: AUC measure for the combination of summary statistic (Fst, D- or f-related), window size and population subset (in case E only) that led to the highest AUC.

To simulate variation in admixture pulses along chromosomes, we simulated chromosomes of 10 Mb in length, each composed of independent DNA segments of length 1,000 bp with a per-locus recombination and mutation rate of 108, and a transition rate of 0.33. For each chromosome, we simulated 7,000 blocks with a background migration rate of 0.1, 1,000 blocks with the peak migration rate (either 0.2, 0.3, 0.4, 0.5, or 0.6) and 2,000 blocks with intermediate migration rate. The blocks with elevated migration rates were grouped into peaks, each consisting of m blocks at intermediate migration, m blocks at peak migration, and again m blocks at intermediate migration. We distributed the blocks evenly along the chromosome and consider peak width of m=50,100,200, and 500 blocks. In case we simulated two migration events, the second migration rates were constructed analogously but with all migration rates increased by 0.1 and the peaks arranged such that some peaks of the two migration events overlap fully, some partially and some not at all.

We generated ten replicates for each parameter combination and used a custom script to transform the generated output files into standard VCF (variant calling format) files and concatenating the 10,000 blocks corresponding to a single chromosome. Unless specified otherwise, we applied a minimum allele frequency filter of maf=0.05 with VCFtools (Danecek et al. 2011), resulting in a variable but rather low (0 to 3) number of polymorphic loci per block and thus resulted in data similar to what would be obtained after LD-pruning.

The filtered VCFs served as input for estimating sliding window Fst for simulated data only consisting of two or three populations as well as for running D-suite Dinvestigate (Malinsky et al. 2021) with varying window sizes s=(10,50,100,150,200,250,300,350,400,450,500), a sliding locus of 1, and the true trio and corresponding outgroup for demographic scenarios with more than three populations. Concurrently, we executed TreeSwirl using the same filtered data and the expected tree topology.

We employed a receiver operating characteristic (ROC) curve analysis to assess the area under the curve (AUC), which summarizes the performance of the method in identifying loci with elevated migration rates. For the ROC analysis, we used the estimated mean posteriors obtained from TreeSwirl, along with the computed values of Fst, Patterson’s D, fd, fdM, and df, and compared them to the true migration rates. For each comparison, we used the statistics and window size that resulted in the best AUC.

Real Data Application

We reanalyzed whole-genome sequencing from a recent study that identified putatively adaptive introgression from domestic goat into Alpine ibex (Münger et al. 2024). We restricted our analysis to species with at least four individuals: modern Alpine ibex (29 individuals), the Iberian ibex (four individuals), Bezoar ibex (six individuals), and domestic goats (16 individuals). We further focused on Chromosomes 23 on which significant introgression was previously reported for gene regions with immune-related genes such as MHC (Grossen et al. 2014; Münger et al. 2024), and included the flanking Chromosomes 21, 22, 24, and 25 for comparison. We used a VCF file provided by the authors. This file was previously filtered to a minimum minor allele frequency (MAF) of 0.05 and a maximal missingness of 0.9 across all samples studied in Münger et al. (2024), and was further thinned to have a minimal distance of 100 bp between consecutive SNPs (see Münger et al. 2024, for details). The admixture graph used was derived from the demographic model inferred by Münger et al. (2024), but we modeled two migration events starting ancestral to domestic goat and directed to Alpine ibex and one to Iberian ibex, respectively, since hybridization between domestic goats and Iberian ibex has also been reported (Cardoso et al. 2021). We used the physical positions of markers provided in the VCF file and set J=21.

Results

Simulations under the TreeSwirl Model

We conducted simulations under the above model to investigate the power of TreeSwirl to infer branch lengths and locus-specific migration rates. We focused on a tree of four populations and considered several cases of migration events of increasing complexity (Fig. 2) and simulated a case with and without variation in migration rates along the simulated chromosome.

The first case contains a single migration event from a population ancestral to Population 2 to Population 3 (Fig. 2a). In this case, the most challenging branch lengths to estimate are those leading to and from the source of the migration edge (branches 2a and 2b in Fig. 2a) since only their sum is relevant for the variance across loci in Population 2. However, additional information about their lengths comes from the covariance between Populations 2 and 3, and as a result their lengths as well as locus-specific migration rates are well estimated both when using a constant or a variable migration rate (Fig. 2a).

The second case is similar to the first, except that the migration edge now ends ancestral to Population 3 (Fig. 2b), leading to two pairs of branches that are challenging to infer: branches 2a and 2b and branches 3a and 3b (marked in blue and pink in Fig. 2b, respectively). As in the first case, information about the lengths of branches 2a and 2b stems from the variance of Population 2 and the covariance between Populations 2 and 3. In contrast, information about the lengths of branches 3a and 3b only stems from the variance of Population 3, and they are thus nonidentifiable in the case of constant migration with some branch lengths often inferred to be negative (Fig. 2b). This nonidentifiability issue has been previously observed by Pickrell and Pritchard (2012), and TreeMix consequently only allows for migration edges that end at tips or the end of branches. As shown in Fig. 2b, however, these branch lengths do become identifiable if loci vary in their migration rates and if that variation is explicitly modeled.

And as shown in Fig. 2c, variation in migration rates renders all branch lengths identifiably also in the case of bidirectional migration, which led to an accurate inference of locus-specific migration rates in all simulated replicates. While variation in migration rates renders all branch lengths identifiable also in the case of two migration events involving all four populations, our inference framework fails to identify the correct solutions in about two-thirds of all replicates (Fig. 2d). Closer inspection reveals that this mostly affects the branch lengths and migration rates related to the migration edge that is more ancient and has both lower average and less variability in migration rates. We note that these cases often result in negative branch lengths, which we found to be a good indicator that the population graph used has too many degrees of freedom for the data analyzed.

Comparison to Related D- and F-Statistic Methods

We generated coalescent simulations under seven demographic histories of population splits and mixtures outlined in Figs. 3 and 4. For each model, we used an effective population size of Ne=10,000, a sample size of N=100 and a shared common ancestor for all populations dating back 2,000 generations. We simulated a single chromosome composed of genomic blocks of 1,000 bp with different migration rates: 7,000 blocks were at a background migration rate, 1,000 blocks were at peak migration rate, and 2,000 blocks had an intermediate migration rate. We arranged these blocks to obtain a range of peak sizes and peak migration rates and for each case evaluated the power of TreeSwirl to identify regions with elevated migration rates (candidates for adaptive introgression, Figs. 3 and 4) and to infer the patterns of migration rates along the simulated genomes (Fig. 5). For the models involving a single migration rate, we compared the power of TreeSwirl to identify regions with elevated migration rates to the power of commonly used summary statistics, namely Fst and f-statistics. We calculated these statistics using D-suite Dinvestigate (Malinsky et al. 2021) for various window sizes and, in case of more than four populations, for all population combinations possible. To render our comparison conservative, we then always kept the summary statistics, window size, and population combination resulting in highest power.

Fig. 4.

Fig. 4.

Performance of TreeSwirl to measure the amount of introgression under different demographic histories with two migration events. First column: simulated demographic histories. Second column: schematic of the topology models. Third and fourth columns: AUC measures to identify outlier loci (elevated migration compared to background rate) for migration edges α1 and α2, respectively, as a function of the peak width (symbols, in number of blocks) and peak migration (x-axis).

Fig. 5.

Fig. 5.

Accuracy of migration rate inference by TreeSwirl. Shown are the per-locus posterior mean migration rate estimates (thin lines) for the first 6 Mb (the first half of all simulated loci) obtained among ten replicates under four models (columns): Models A, B, and D from Fig. 3 and Model A from Fig. 4 (orange corresponds to the edge from Pop3 into Pop2, blue to the edge from Pop2 into Pop3). For each model, results are shown for three combinations of peak width and peak migration rate: 50 blocks and migration 0.3, 500 blocks and migration 0.3, and 50 blocks and migration 0.6. For Model 4A, the corresponding rate for the second migration event was 0.45, 0.45, and 0.75. In all cases, the simulated migration rate is shown as a thick line.

As shown in the third and fourth column of Fig. 3, TreeSwirl has overall higher power to identify regions with elevated migration rates than competing summary statistics, with a much lower false-positive rate, particularly if the background and elevated migration rates were rather similar (i.e. low peak migration rate). Importantly, TreeSwirl also had considerable power in identifying loci with elevated migration rates for models with two- and three-taxon topologies. This presents a significant advantage over f4-stat methods, which are constrained to four-taxon configurations and require a predefined outgroup.

The most difficult scenario was the case of only two populations, in which TreeSwirl needs to estimate migration rates along four branch lengths. In the case of no variation in migration rates, the information in the data corresponds to only two variances and one covariance term, thus considerably less than the degrees of freedom in the model. Here, we simulated genomes with three different migration rates, but still no accurate inference of migration rates was possible in case of a peak width of 50 blocks and a peak migration of 0.3 with estimates of individual replicates varying across the entire range of possible values (Fig. 5, Model 2A). When using a larger peak width of 500 blocks (but still a peak migration of 0.3), TreeSwirl generally inferred the proper pattern in migration rate variation, but overestimated the peak migration considerably in several replicates. When using a higher peak migration rate of 0.6 (but still a peak width of 50 blocks), estimates were much more accurate, even if the peak migration rate were slightly underestimated. Consequently, the power to identify regions with elevated migration rates was higher than for Fst in these cases (Fig. 3a).

As more populations are included, the number of variances and covariances increases faster than the number of branch lengths (M(M+1)/2 vs. 2M1 for a strictly bifurcating graph without migration), rendering model parameters easier to infer (but see Fig. 2 for how migration edges complicate things). Consequently, TreeSwirl also inferred migration rates much more accurately for simulations conducted with more populations (Fig. 5, Models 3B and 3D), although it remained more difficult in the case of very short peak width and low peak migration rates. Nonetheless, in these cases TreeSwirl showed higher power to identify regions with elevated migration rates than even the best of all summary statistics tested (Fig. 3b to d).

Noteworthy is the model illustrated in Fig. 3e, which included many populations but turned out to be challenging for inference. In that scenario, the migration event occurred a relatively long time after the involved population split, and none of the other populations sheds additional information on the allele frequencies at the time of migration, rendering it difficult to distinguish migration from pure drift. Since the populations Pop1 and Pop2 branch of after the migration event only, the scenario is actually very similar to that explored in Fig. 2b, which we found only to be identifiable with sufficient variation in migration rates. Interestingly, and despite its rather low power, TreeSwirl still outperformed all summary statistics tested in identifying loci with elevated migration rates also under this model.

We next investigated the power of TreeSwirl to identify loci with elevated migration rates under two demographic models involving two migration edges. The first model, shown in Fig. 4a, extends the single migration, four-population model shown in Fig. 3c with bidirectional gene flow. For the migration edge from Population 3 into Population 2 (migration 1 in Fig. 4a), migration rates were rather accurately inferred (Fig. 5, Model 4A in blue) and consequently the power to identify loci with elevated migration rates was only slightly reduced compared to the single migration case (left panel in Fig. 3c). In contrast, inferring migration rates was more challenging for the migration edge from Population 2 into Population 3 due to the rather deep split between Populations 3 and 4 (Fig. 5 Model 4A in orange). Consequently, the power to identifying such loci was considerably lower for that migration edge (migration 2 in Fig. 4a). The second model, shown in Fig. 4b, is similar to the single migration model shown in Fig. 3e, but with two recent edges involving different branches, for both of which the power to identify loci with elevated migration rates was comparable to that for single population models. Notably, TreeSwirl had higher power to detect loci with elevated migration rates than any of the summary statistics tested in all cases, likely because it could make use of all population samples jointly while the summary statistics were limited to a subset of comparisons only. In the first model, for instance, only Fst between population pairs could be used to distinguish between the two migration edges. Similarly, in the second model, data from Pop3 could not be used to form f4-statistics to identify loci with elevated loci for migration edge 1 from Pop2 into Pop4, although that population would be most informative in the absence of migration edge 2.

MAF Filtering and Runtime Considerations

Across all simulations we noted that migration rates tended to rather be under- than overestimated (Fig. 5), which we found to be a direct consequence of filtering for minimum MAF. We illustrate this in Fig. 6 for the model from Fig. 3d and the easiest case (peak width of 500 blocks, peak migration of 0.6): with filtering out low frequency variants more strictly, per-locus migration estimates tend to get lower. The reason for this is likely that low frequency variants, which are often population-specific, are particularly informative about migration, while high frequency variants tend to be shared among populations regardless of migration. Although this may suggest that no MAF filter should be used, we note that low frequency variants also lead to more noisy estimates of per-locus migration rates (Fig. 6a) and considerably longer run times, which scales linearly with the number of loci.

Fig. 6.

Fig. 6.

Effect of filtering on MAF. Shown are the per-locus posterior mean migration rate estimates (thin lines) for the simulations of Model D from Fig. 3 with a peak width of 500 blocks and peak migration of 0.6 (thick lines), for no MAF filter and for only keeping loci with MAF 0.01, MAF 0.05, and MAF 0.1.

Additional factors affecting the computational costs of running TreeSwirl include: (i) the number of discrete states J, with which runtime scales quadratically, (ii) the number of migration edges, with which runtime scales exponentially, and (iii) the number of matrices Σl, with which runtime scales linearly. Computations may be efficiently distributed across multiple computer nodes by dividing the genome into independent segments, such as individual chromosomes or chromosome arms. This approach is valid because linkage does not persist across chromosome boundaries and is typically weak across the centromere.

Introgression from Domestic Goats into Alpine Ibex

Alpine ibex are a wild goat species native to the European alps. It suffered from a near-extinction two centuries ago, but has since recovered thanks to repopulation efforts. It is known that Alpine ibex occasionally hybridize with domestic goats and adaptive introgression has been reported for several genomic regions, particularly for a region on Chromosome 23 that contains immune-related genes such as MHC (Grossen et al. 2014; Münger et al. 2024). Here, we used TreeSwirl to reanalyze whole-genome sequencing data of Alpine ibex, Iberian ibex, Bezoar ibex, and domestic goat to quantify locus-specific rates of introgression from domestic goat into Alpine and Iberian ibex along Chromosomes 21 to 25. As shown in Fig. 7, we generally inferred very low migration rates from domestic goat into Alpine ibex with an inferred attractor state at zero. Nonetheless, we identified a clear signal of introgression on Chromosomes 23 between 20 and 24 Mb, corresponding well to the region previously reported (Münger et al. 2024) and containing MHC. We also identified an additional, smaller region of introgression on Chromosome 21.

Fig. 7.

Fig. 7.

Inference of introgressed loci in Alpine and Iberian ibex. Shown is the inferred population graph along with inferred migration rates (posterior means) along Chromosomes 21 to 25 for Alpine ibex (top, edge from Goat to Alpine) and Iberian ibex (bottom, edge from Goat to Iberian). The region reported by Münger et al. (2024) on Chromosome 23 is marked with a gray background.

We also inferred rather low overall migration rates from domestic goat into Iberian ibex, again with the attractor state estimated at zero. Interestingly, the region on Chromosome 23 inferred as introgressed in Alpine ibex was also inferred as strongly introgressed in Iberian ibex, this time along an additional, smaller region on Chromosome 25. Since the two ibex species likely diverged prior to the domestication of goats, the signal on Chromosome 23 is thus indicative of independent adaptive introgression in both ibex species. But we note that given the frequency dependent selection likely acting on the MHC region, the signal may also be indicative of shared ancient polymorphisms among all goat species, although the high migration rates we inferred (close to 1.0) suggest this scenario to be less likely.

Discussion

One approach to infer historical relationships among populations is to model allele frequency changes along a phylogenetic tree as a Gaussian process (Cavalli-Sforza and Edwards 1967; Felsenstein 1981). This rather old concept was recently revived by extending the model to a graph with migration edges and by providing a user-friendly tool TreeMix to infer parameters under such a graph (Pickrell and Pritchard 2012). However, the TreeMix model assumes migration rates to be constant along the genome, an assumption that may not hold in the face of selection or strong genetic drift. Indeed, theory predicts variation in the rate of effective gene flow along the genome (Harrison 1993), in which local barriers to gene flow are anticipated to emerge from the random accumulation of Dobzhansky–Muller incompatibilities, both under models of secondary contact after isolation (Barton and Gale 1993) as well as under models of continuous gene flow during speciation (Wu 2001). In the case of gene flow between highly divergent gene pools, selection is likely to act as the primary driving force for variation in effective gene flow along the genome, with rates of introgression being particularly low in genomic regions involved in adaptation, so called islands of speciation, but potentially much higher in regions free from the selection pressure (Dasmahapatra et al. 2012).

In light of these considerations, we here present TreeSwirl, an extension of the model described in Pickrell and Pritchard (2012) that allows for mixture proportions to vary along the genome in an auto-correlated way that reflects the effect of linkage. Using simulations, we found the power of TreeSwirl to detect loci with elevated gene flow to be a function of several parameters: First, the power generally increases with the difference between the peak and background migration rate as well as with the size of the region affected by elevated gene flow (and hence the local recombination structure). Second, the power generally increased if data from additional populations was availably, ideally populations that she a lot of information about the allele frequencies at the time of the migration events such as population that split of right before a population received migrants. Third, sufficient variation in migration rates along the genome is required to accurately infer some branch lengths, which in turn increases power to infer outlier loci. Finally, and although untested here, we expect sample size to affect power as well as low sample sizes lead to a lot of uncertainty regarding the current allele frequencies. As our simulations indicated, TreeSwirl had high power to infer even relatively short regions of 50 kb in many scenarios, and may identify even much shorter regions under a favorable population graph and provided that peak (or trough) migration rates was substantially different from the background rate.

We compared the power of TreeSwirl to existing methods related to D- and f-statistics, such as Fst, Patterson’s D (Patterson et al. 2012), fd (Martin et al. 2015), fdM (Malinsky et al. 2015), and df (Pfeifer and Kapan 2019), which have been frequently applied to identify signatures of introgression using arbitrary genomic window sizes. As our simulations showed, TreeSwirl had superior accuracy and sensitivity in detecting retrogressed loci under all demographic histories investigated. The approach presented here also addresses numerous constraints inherent to the use of related D- and f-statistics. First, these summary statistics are limited to bifurcating four-population topologies. In cases involving graphs of five or more populations, the simplest option is to subsample a section of the graph in the appropriate configuration, as done in D-suite (Malinsky et al. 2021) used here. In cases involving two- or three-population topologies, one would need to resort to Fst-based metrics. In contrast, the method presented here is not constraint by topology, working well with any number of populations and also under topologies that include polytomies.

Second, our HMM-based approach to model linkage eliminates the need to specify window sizes. Instead, the parameters governing auto-correlation are directly inferred from the data along with introgression rates. In our simulations, the choice of window sizes, as well as the choice of the specific statistics to use, had a big impact on power. To ensure a fair comparison between methods, we thus tested all available summary statistics for a wide range of window sizes and only report the results of the combination of summary statistics and window size that was optimal for each individual case. In applications to real data, however, such explorations are not possible, likely leading to an even larger difference in power between TreeSwirl and these summary statistics.

Third, TreeSwirl supports graphs with multiple migration edges for which introgression rates are learned simultaneously. However, it is important to note that the performance of TreeSwirl is likely dependent on the quality of the tree topology used as input and may not perform well if the tree topology is poorly resolved or incorrect. Similarly, we caution that TreeSwirl just as TreeMix assumes that population allele frequencies can be explained through a population graph with splits, drift along branches and potential mixtures. While this approach has been proved useful in many cases, it is easy to think of scenarios that violate these assumptions. One such scenario consists of populations that have diverged long enough to hardly share any polymorphisms. This leads to negative covariance between populations (the presence of a mutation in one population implies its absence in the others), which cannot be captured by a population graph. While experimenting with such cases showed TreeSwirl to be remarkably robust, it should not be confused with phylogenetic approaches.

Other scenarios that violate the underlying model assumptions are cases of extensive continuous gene flow as well as cases of population structure not well captured with a graph. Such scenarios were shown to lead to spurious inference of introgression using other methods (e.g. Eriksson and Manica 2014; Lawson et al. 2018; Tournebize and Chikhi 2023), and while we have not explored these scenarios in detail here, we are convinced they can lead to spurious inference by both TreeSwirl and TreeMix and appeal to users to carefully assess model assumptions of these any other tools they use. Nonetheless, and as we showed through both simulations and a data application, the model introduced here is powerful to identify introgressed loci under many scenarios and will thereby contribute to a better understanding of the role of introgression in evolution, which is expected to act as a major driver in adaptation to ongoing global changes (Suarez-Gonzalez et al. 2018).

Supplementary Material

msae137_Supplementary_Data

Acknowledgments

We thank Christine Grossen for sharing the ibex data and two anonymous reviewer for their helpful comments on an earlier version of this manuscript.

Contributor Information

Carlos S Reyna-Blanco, Department of Biology, University of Fribourg, Fribourg 1700, Switzerland; Swiss Institute of Bioinformatics, Fribourg 1700, Switzerland.

Madleina Caduff, Department of Biology, University of Fribourg, Fribourg 1700, Switzerland; Swiss Institute of Bioinformatics, Fribourg 1700, Switzerland.

Marco Galimberti, Department of Biology, University of Fribourg, Fribourg 1700, Switzerland; Swiss Institute of Bioinformatics, Fribourg 1700, Switzerland; Department of Psychiatry, Yale University School of Medicine, New Haven, CT, USA; Veterans Affairs Connecticut Healthcare System, West Haven, CT, USA.

Christoph Leuenberger, Department of Mathematics, University of Fribourg, Fribourg 1700, Switzerland.

Daniel Wegmann, Department of Biology, University of Fribourg, Fribourg 1700, Switzerland; Swiss Institute of Bioinformatics, Fribourg 1700, Switzerland.

Supplementary Material

Supplementary material is available at Molecular Biology and Evolution online.

Author Contributions

D.W. conceived the idea; D.W., C.L., and C.S.R.-B. developed the model; C.S.R.-B. and M.C. implemented the method in collaboration with M.G.; C.S.R.-B. and M.C. conducted all simulations and data analyses; C.S.R.-B., M.C., and D.W. led the writing of the manuscript. All authors contributed critically to the draft and gave final approval for publication.

Funding

This work was supported by Swiss National Science Foundation grants 31003A_173062 and 310030_200420 to D.W.

Conflict of interest

None declared.

Data Availability

The authors affirm that all data required to validate the conclusions of this article are either included within the article itself or accessible through the indicated repositories. The source code for TreeSwirl can be found in the following Git repository: bitbucket.org/wegmannlab/treeswirl2, which also contains a user manual. This study did not generate any new data.

References

  1. Abbott  R, Albach  D, Ansell  S, Arntzen  JW, Baird  SJ, Bierne  N, Boughman  J, Brelsford  A, Buerkle  CA, Buggs  R, et al. Hybridization and speciation. J Evol Biol. 2013:26(2):229–246. 10.1111/jeb.2013.26.issue-2. [DOI] [PubMed] [Google Scholar]
  2. Anderson  E. Introgressive hybridization. New York (NY): J. Wiley; 1949. [Google Scholar]
  3. Anderson  E, Hubricht  L. Hybridization in tradescantia. III. the evidence for introgressive hybridization. Am J Bot. 1938:25(6):396–402. 10.1002/ajb2.1938.25.issue-6. [DOI] [Google Scholar]
  4. Barton  NH. The role of hybridization in evolution. Mol Ecol. 2001:10(3):551–568. 10.1046/j.1365-294x.2001.01216.x. [DOI] [PubMed] [Google Scholar]
  5. Barton  NH, Gale  KS. Genetic analysis of hybrid zones. In: Harrison RG, editor. Hybrid zones and the evolutionary proces. Ch 2. New York (NY): Oxford University Press; 1993. p. 13–45.
  6. Barton  NH, Hewitt  GM. Analysis of hybrid zones. Annu Rev Ecol Evol Syst. 1985:16(1):113–148. 10.1146/ecolsys.1985.16.issue-1. [DOI] [Google Scholar]
  7. Baum  LE, Petrie  T, Soules  G, Weiss  N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat. 1970:41(1):164–171. 10.1214/aoms/1177697196. [DOI] [Google Scholar]
  8. Browning  SR, Browning  BL, Zhou  Y, Tucci  S, Akey  JM. Analysis of human sequence data reveals two pulses of archaic Denisovan admixture. Cell. 2018:173(1):53–61.e9. 10.1016/j.cell.2018.02.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Burgarella  C, Barnaud  A, Kane  NA, Jankowski  F, Scarcelli  N, Billot  C, Vigouroux  Y, Berthouly-Salazar  C. Adaptive introgression: an untapped evolutionary mechanism for crop adaptation. Front Plant Sci. 2019:10:4. 10.3389/fpls.2019.00004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cardoso  TF, Luigi-Sierra  MG, Castelló  A, Cabrera  B, Noce  A, Mármol-Sánchez  E, García-González  R, Fernández-Arias  A, Alabart  JL, López-Olvera  JR, et al. Assessing the levels of intraspecific admixture and interspecific hybridization in Iberian wild goats (Capra pyrenaica). Evol Appl. 2021:14(11):2618–2634. 10.1111/eva.v14.11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cavalli-Sforza  LL, Edwards  AW. Phylogenetic analysis. models and estimation procedures. Am J Hum Genet. 1967:19(3 Pt 1):233. 10.2307/2406616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Christe  C, Stölting  KN, Bresadola  L, Fussi  B, Heinze  B, Wegmann  D, Lexer  C. Selection against recombinant hybrids maintains reproductive isolation in hybridizing Populus species despite F1 fertility and recurrent gene flow. Mol Ecol. 2016:25(11):2482–2498. 10.1111/mec.2016.25.issue-11. [DOI] [PubMed] [Google Scholar]
  13. Danecek  P, Auton  A, Abecasis  G, Albers  CA, Banks  E, DePristo  MA, Handsaker  RE, Lunter  G, Marth  GT, Sherry  ST, et al. The variant call format and VCFtools. Bioinformatics. 2011:27(15):2156–2158. 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dasmahapatra  KK, Walters  JR, Briscoe  AD, Davey  JW, Whibley  A, Nadeau  NJ, Zimin  AV, Salazar  C, Ferguson  LC, Martin  SH, et al. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature. 2012:487(7405):94–98. 10.1038/nature11041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Durand  EY, Patterson  N, Reich  D, Slatkin  M. Testing for ancient admixture between closely related populations. Mol Biol Evol. 2011:28(8):2239–2252. 10.1093/molbev/msr048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Durvasula  A, Sankararaman  S. A statistical model for reference-free inference of archaic local ancestry. PLoS Genet. 2019:15(5):e1008175. 10.1371/journal.pgen.1008175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Durvasula  A, Sankararaman  S. Recovering signals of ghost archaic introgression in African populations. Sci Adv. 2020:6(7):1–9. 10.1126/sciadv.aax5097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Eaton  DA, Ree  RH. Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae). Syst Biol. 2013:62(5):689–706. 10.1093/sysbio/syt032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ellstrand  NC. Is gene flow the most important evolutionary force in plants?  Am J Bot. 2014:101(5):737–753. 10.3732/ajb.1400024. [DOI] [PubMed] [Google Scholar]
  20. Ellstrand  NC, Barrett  SC, Linington  S, Stephenson  AG, Comai  L. Current knowledge of gene flow in plants: implications for transgene flow. Philos Trans R Soc Lond B Biol Sci. 2003:358(1434):1163–1170. 10.1098/rstb.2003.1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Eriksson  A, Manica  A. The doubly conditioned frequency spectrum does not distinguish between ancient population structure and hybridization. Mol Biol Evol. 2014:31(6):1618–1621. 10.1093/molbev/msu103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Excoffier  L, Marchi  N, Marques  DA, Matthey-Doret  R, Gouy  A, Sousa  VC. fastsimcoal2: demographic inference under complex evolutionary scenarios. Bioinformatics. 2021:37(24):4882–4885. 10.1093/bioinformatics/btab468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Felsenstein  J. Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution. 1981:35(6):1229–1242. 10.2307/2408134. [DOI] [PubMed] [Google Scholar]
  24. Fontaine  MC, Pease  JB, Steele  A, Waterhouse  RM, Neafsey  DE, Sharakhov  IV, Jiang  X, Hall  AB, Catteruccia  F, Kakani  E, et al. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science. 2015:347(6217):1258524. 10.1126/science.1258524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Galimberti  M, Leuenberger  C, Wolf  B, Szilágyi  SM, Foll  M, Wegmann  D. Detecting selection from linked sites using an F-model. Genetics. 2020:216(4):1205–1215. 10.1534/genetics.120.303780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Grant  PR, Grant  BR. Hybridization of bird species. Science. 1992:256(5054):193–197. 10.1126/science.256.5054.193. [DOI] [PubMed] [Google Scholar]
  27. Grant  PR, Grant  BR. Phenotypic and genetic effects of hybridization in Darwin’s finches. Evolution. 1994:48(2):297–316. 10.2307/2410094. [DOI] [PubMed] [Google Scholar]
  28. Green  RE, Krause  J, Briggs  AW, Maricic  T, Stenzel  U, Kircher  M, Patterson  N, Li  H, Zhai  W, Fritz  MHY, et al. A draft sequence of the Neandertal genome. Science. 2010:328(5979):710–722. 10.1126/science.1188021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Grossen  C, Keller  L, Biebach  I, Consortium  TIGG, Croll  D. Introgression from domestic goat generated variation at the major histocompatibility complex of alpine ibex. PLoS Genet. 2014:10(6):1–16. 10.1371/journal.pgen.1004438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Harrison  RG. Hybrid zones and the evolutionary process. New York (NY): Oxford University Press; 1993. [Google Scholar]
  31. Kozak  KM, Joron  M, McMillan  WO, Jiggins  CD. Rampant genome-wide admixture across the Heliconius radiation. Genome Biol Evol. 2021:13(7):1–17. 10.1093/gbe/evab099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kronforst  MR, Hansen  ME, Crawford  NG, Gallant  JR, Zhang  W, Kulathinal  RJ, Kapan  DD, Mullen  SP. Hybridization reveals the evolving genomic architecture of speciation. Cell Rep. 2013:5(3):666–677. 10.1016/j.celrep.2013.09.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Kulathinal  RJ, Stevison  LS, Noor  MA. The genomics of speciation in Drosophila: diversity, divergence, and introgression estimated using low-coverage genome sequencing. PLoS Genet. 2009:5(7):e1000550. 10.1371/journal.pgen.1000550. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lange  K. Numerical analysis for statisticians. Statistics and Computing. New York (NY): Springer; 2010. [Google Scholar]
  35. Lawson  DJ, Hellenthal  G, Myers  S, Falush  D. Inference of population structure using dense haplotype data. PLoS Genet. 2012:8(1):e1002453. 10.1371/journal.pgen.1002453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lawson  DJ, Van Dorp  L, Falush  D. A tutorial on how not to over-interpret structure and admixture bar plots. Nat Commun. 2018:9(1):3258. 10.1038/s41467-018-05257-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lipson  M, Loh  PR, Levin  A, Reich  D, Patterson  N, Berger  B. Efficient moment-based inference of admixture parameters and sources of gene flow. Mol Biol Evol. 2013:30(8):1788–1802. 10.1093/molbev/mst099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Lipson  M, Loh  PR, Patterson  N, Moorjani  P, Ko  YC, Stoneking  M, Berger  B, Reich  D. Reconstructing Austronesian population history in Island Southeast Asia. Nat Commun. 2014:5(1):1–7. 10.1038/ncomms5689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Luqman  H, Widmer  A, Fior  S, Wegmann  D. Identifying loci under selection via explicit demographic models. Mol Ecol Resour. 2021:21(8):2719–2737. 10.1111/men.v21.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Malinsky  M, Challis  RJ, Tyers  AM, Schiffels  S, Terai  Y, Ngatunga  BP, Miska  EA, Durbin  R, Genner  MJ, Turner  GF. Genomic Islands of speciation separate cichlid ecomorphs in an East African crater lake. Science. 2015:350(6267):1493–1498. 10.1126/science.aac9927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Malinsky  M, Matschiner  M, Svardal  H. Dsuite - fast D-statistics and related admixture evidence from VCF files. Mol Ecol Resour. 2021:21(2):584–595. 10.1111/men.v21.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Mallet  J. Hybridization as an invasion of the genome. Trends Ecol Evol. 2005:20(5):229–237. 10.1016/j.tree.2005.02.010. [DOI] [PubMed] [Google Scholar]
  43. Maples  BK, Gravel  S, Kenny  EE, Bustamante  CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013:93(2):278–288. 10.1016/j.ajhg.2013.06.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Marchi  N, Winkelbach  L, Schulz  I, Brami  M, Hofmanová  Z, Blöcher  J, Reyna-Blanco  CS, Diekmann  Y, Thiéry  A, Kapopoulou  A, et al. The genomic origins of the world’s first farmers. Cell. 2022:185(11):1842–1859.e18. 10.1016/j.cell.2022.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Martin  SH, Dasmahapatra  KK, Nadeau  NJ, Salazar  C, Walters  JR, Simpson  F, Blaxter  M, Manica  A, Mallet  J, Jiggins  CD. Genome-wide evidence for speciation with gene flow in Heliconius butterflies. Genome Res. 2013:23(11):1817–1828. 10.1101/gr.159426.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Martin  SH, Davey  JW, Jiggins  CD. Evaluating the use of ABBA–BABA statistics to locate introgressed loci. Mol Biol Evol. 2015:32(1):244–257. 10.1093/molbev/msu269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Münger  X, Robin  M, Dalén  L, Grossen  C. Facilitated introgression from domestic goat into Alpine ibex at immune loci. Mol Ecol. 2024:33(14):e17429. 10.1111/mec.17429. [DOI] [PubMed] [Google Scholar]
  48. Murphy  KP. Machine learning: a probabilistic perspective. Cambridge (MA): MIT Press; 2012. [Google Scholar]
  49. Nelder  JA, Mead  R. A simplex method for function minimization. Comput J. 1965:7(4):308–313. 10.1093/comjnl/7.4.308. [DOI] [Google Scholar]
  50. Nocedal  J, Wright  S. Numerical optimization. New York (NY): Springer Science & Business Media; 2006. [Google Scholar]
  51. Patterson  N, Moorjani  P, Luo  Y, Mallick  S, Rohland  N, Zhan  Y, Genschoreck  T, Webster  T, Reich  D. Ancient admixture in human history. Genetics. 2012:192(3):1065–1093. 10.1534/genetics.112.145037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Patterson  N, Richter  DJ, Gnerre  S, Lander  ES, Reich  D. Genetic evidence for complex speciation of humans and chimpanzees. Nature. 2006:441(7097):1103–1108. 10.1038/nature04789. [DOI] [PubMed] [Google Scholar]
  53. Pfeifer  B, Kapan  DD. Estimates of introgression as a function of pairwise distances. BMC Bioinformatics. 2019:20(1):1–11. 10.1186/s12859-019-2747-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Pickrell  JK, Pritchard  JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012:8(11):e1002967. 10.1371/journal.pgen.1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Plagnol  V, Wall  JD. Possible ancestral structure in human populations. PLoS Genet. 2006:2(7):e105–e105. 10.1371/journal.pgen.0020105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Price  AL, Helgason  A, Palsson  S, Stefansson  H, Andreassen  OA, Reich  D, Kong  A, Stefansson  K. The impact of divergence time on the nature of population structure: an example from Iceland. PLoS Genet. 2009:5(6):e1000505. 10.1371/journal.pgen.1000505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Prüfer  K, Racimo  F, Patterson  N, Jay  F, Sankararaman  S, Sawyer  S, Heinze  A, Renaud  G, Sudmant  PH, De Filippo  C, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014:505(7481):43–49. 10.1038/nature12886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Rabiner  LR, Juang  BH. An introduction to hidden Markov models. IEEE ASSP Magaz. 1986:3(1):4–16. 10.1109/MASSP.1986.1165342. [DOI] [Google Scholar]
  59. Racimo  F, Marnetto  D, Huerta-Sánchez  E. Signatures of archaic adaptive introgression in present-day human populations. Mol Biol Evol. 2017:34(2):296–317. 10.1093/molbev/msw216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Racimo  F, Sankararaman  S, Nielsen  R, Huerta-Sánchez  E. Evidence for archaic adaptive introgression in humans. Nat Rev Genet. 2015:16(6):359–371. 10.1038/nrg3936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Rheindt  FE, Fujita  MK, Wilton  PR, Edwards  SV. Introgression and phenotypic assimilation in Zimmerius flycatchers (Tyrannidae): population genetic and phylogenetic inferences from genome-wide SNPs. Syst Biol. 2014:63(2):134–152. 10.1093/sysbio/syt070. [DOI] [PubMed] [Google Scholar]
  62. Rieseberg  LH, Wendel  JF. lntrogression and its consequences in plants. In: Harrison RG, editor. Hybrid zones and the evolutionary process. Ch. 4. New York (NY): Oxford University Press; 1993. p. 70–109.
  63. Sankararaman  S. Methods for detecting introgressed archaic sequences. Curr Opin Genet Dev. 2020:62:85–90. 10.1016/j.gde.2020.05.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Sankararaman  S, Mallick  S, Dannemann  M, Prüfer  K, Kelso  J, Pääbo  S, Patterson  N, Reich  D. The genomic landscape of Neanderthal ancestry in present-day humans. Nature. 2014:507(7492):354–357. 10.1038/nature12961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Seguin-Orlando  A, Korneliussen  TS, Sikora  M, Malaspinas  AS, Manica  A, Moltke  I, Albrechtsen  A, Ko  A, Margaryan  A, Moiseyev  V, et al. Genomic structure in Europeans dating back at least 36,200 years. Science. 2014:346(6213):1113–1118. 10.1126/science.aaa0114. [DOI] [PubMed] [Google Scholar]
  66. Skov  L, Hui  R, Shchur  V, Hobolth  A, Scally  A, Schierup  MH, Durbin  R. Detecting archaic introgression using an unadmixed outgroup. PLoS Genet. 2018:14(9):e1007641. 10.1371/journal.pgen.1007641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Slatkin  M. Gene flow in natural populations. Source Ann Rev Ecol Syst. 1985a:16(1):393–430. 10.1146/ecolsys.1985.16.issue-1. [DOI] [Google Scholar]
  68. Slatkin  M. Rare alleles indicators of gene flow. Evolution. 1985b:39(1):53–65. 10.2307/2408516. [DOI] [PubMed] [Google Scholar]
  69. Slatkin  M. Gene flow and the geographic structure of natural populations. Science. 1987:236(4803):787–792. 10.1126/science.3576198. [DOI] [PubMed] [Google Scholar]
  70. Smith  J, Kronforst  MR. Do Heliconius butterfly species exchange mimicry alleles?  Biol Lett. 2013:9(4):20130503. 10.1098/rsbl.2013.0503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Sousa  VC, Fritz  M, Beaumont  MA, Chikhi  L. Approximate Bayesian computation without summary statistics: the case of admixture. Genetics. 2009:181(4):1507–1519. 10.1534/genetics.108.098129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Steinrücken  M, Spence  JP, Kamm  JA, Wieczorek  E, Song  YS. Model-based detection and analysis of introgressed Neanderthal ancestry in modern humans. Mol Ecol. 2018:27(19):3873–3888. 10.1111/mec.14565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Suarez-Gonzalez  A, Lexer  C, Cronk  QC. Adaptive introgression: a plant perspective. Biol Lett. 2018:14(3):20170688. 10.1098/rsbl.2017.0688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Tang  H, Coram  M, Wang  P, Zhu  X, Risch  N. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet. 2006:79(1):1–12. 10.1086/504302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Tournebize  R, Chikhi  L. Questioning Neanderthal admixture: on models, robustness and consensus in human evolution. bioRxiv. 2023. 10.1101/2023.04.05.535686. preprint: not peer reviewed. [DOI] [Google Scholar]
  76. Tung  J, Barreiro  LB. The contribution of admixture to primate evolution. Curr Opin Genet Dev. 2017:47:61–68. 10.1016/j.gde.2017.08.010. [DOI] [PubMed] [Google Scholar]
  77. Varadhan  R, Roland  C. Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand J Stat. 2008:35(2):335–353. 10.1111/sjos.2008.35.issue-2. [DOI] [Google Scholar]
  78. Vernot  B, Akey  JM. Resurrecting surviving Neandertal lineages from modern human genomes. Science. 2014:343(6174):1017–1021. 10.1126/science.1245938. [DOI] [PubMed] [Google Scholar]
  79. Vernot  B, Tucci  S, Kelso  J, Schraiber  JG, Wolf  AB, Gittelman  RM, Dannemann  M, Grote  S, McCoy  RC, Norton  H, et al. Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science. 2016:352(6282):235–239. 10.1126/science.aad9416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wall  JD, Lohmueller  KE, Plagnol  V. Detecting ancient admixture and estimating demographic parameters in multiple human populations. Mol Biol Evol. 2009:26(8):1823–1827. 10.1093/molbev/msp096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Wegmann  D, Excoffier  L. Bayesian inference of the demographic history of chimpanzees. Mol Biol Evol. 2010:27(6):1425–1435. 10.1093/molbev/msq028. [DOI] [PubMed] [Google Scholar]
  82. Wegmann  D, Kessner  DE, Veeramah  KR, Mathias  RA, Nicolae  DL, Yanek  LR, Sun  YV, Torgerson  DG, Rafaels  N, Mosley  T, et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat Genet. 2011:43(9):847–853. 10.1038/ng.894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Wu  CI. The genic view of the process of speciation. J Evol Biol. 2001:14(6):851–865. 10.1046/j.1420-9101.2001.00335.x. [DOI] [Google Scholar]
  84. Yamamichi  M, Innan  H. Estimating the migration rate from genetic variation data. Heredity. 2012:108(4):362–363. 10.1038/hdy.2011.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Yang  MA, Malaspinas  AS, Durand  EY, Slatkin  M. Ancient structure in Africa unlikely to explain Neanderthal and non-African genetic similarity. Mol Biol Evol. 2012:29(10):2987–2995. 10.1093/molbev/mss117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Zhang  X, Witt  KE, Bañuelos  MM, Ko  A, Yuan  K, Xu  S, Nielsen  R, Huerta-Sanchez  E. The history and evolution of the Denisovan-EPAS1 haplotype in Tibetans. Proc Natl Acad Sci USA. 2021:118(22):e2020803118. 10.1073/pnas.2020803118. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

msae137_Supplementary_Data

Data Availability Statement

The authors affirm that all data required to validate the conclusions of this article are either included within the article itself or accessible through the indicated repositories. The source code for TreeSwirl can be found in the following Git repository: bitbucket.org/wegmannlab/treeswirl2, which also contains a user manual. This study did not generate any new data.


Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press

RESOURCES