Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 10.
Published in final edited form as: Curr Biol. 2021 Nov 16;32(1):111–123.e5. doi: 10.1016/j.cub.2021.10.052

Widespread introgression across a phylogeny of 155 Drosophila genomes

Anton Suvorov 1,*,$, Bernard Y Kim 2, Jeremy Wang 1, Ellie E Armstrong 2, David Peede 3, Emmanuel R R D’Agostino 3, Donald K Price 4, Peter Waddell 5, Michael Lang 6, Virginie Courtier-Orgogozo 6, Jean R David 7,8, Dmitri Petrov 2, Daniel R Matute 3,†,&, Daniel R Schrider 1,†,&, Aaron A Comeault 9,†,&,*
PMCID: PMC8752469  NIHMSID: NIHMS1753727  PMID: 34788634

SUMMARY

Genome-scale sequence data have invigorated the study of hybridization and introgression, particularly in animals. However, outside of a few notable cases, we lack systematic tests for introgression at a larger phylogenetic scale across entire clades. Here we leverage 155 genome assemblies, from 149 species, to generate a fossil-calibrated phylogeny and conduct multilocus tests for introgression across nine monophyletic radiations within the genus Drosophila. Using complementary phylogenomic approaches, we identify widespread introgression across the evolutionary history of Drosophila. Mapping gene-tree discordance onto the phylogeny revealed that both ancient and recent introgression has occurred across most of the nine clades that we examined. Our results provide the first evidence of introgression occurring across the evolutionary history of Drosophila and highlight the need to continue to study the evolutionary consequences of hybridization and introgression in this genus and across the Tree of Life.

In Brief

Suvorov et al. use multi-locus data from 155 Drosophila genome assemblies to reconstruct the phylogeny of the genus and estimate divergence times. They used multiple phylogenomic methods to shed light on patterns of gene flow within the genus. Their results suggest multiple instances of introgression across the Drosophila Tree of Life.

INTRODUCTION

The extent of gene exchange in nature has remained one of the most hotly debated questions in speciation genetics. Genomic data have revealed that introgression is common across taxa, having been identified in major groups such as fungi13, vertebrates47, insects810, and angiosperms11,12. The evolutionary effects of introgression are diverse, and are determined by multiple ecological and genomic factors13,14. Once thought to be strictly deleterious, it has become increasingly clear that introgression can serve as a source of genetic variation used during local adaptation15,16 and adaptive radiation17,18. While our understanding of introgression as a widespread phenomenon has clearly improved, it remains unclear how often it occurs across taxa. Ideally, determining the frequency of introgression across the Tree of Life would leverage the signal from systematic analyses of clade-level genomic data without an a priori selection of taxa known to hybridize in nature.

At the phylogenetic scale, hybridization has typically been explored at relatively recent timescales. For example, studies of hybridization between cats (Felidae; 10–12 My; ~40 species19), butterflies (Heliconius; 10–15 My; 15 species8), cichlid fishes from the African rift lakes (0.5–10 My; ~27 species18,20,21), and wild tomatoes (Solanum; ~4 My; ~20 species12) all rejected a purely bifurcating phylogenetic history. In each of these systems introgression has occurred relatively recently, as the common ancestor for each species group occurred no more than 15 million years ago. However, there are also notable exceptions, and evidence for introgression has been found across much deeper phylogenetic timescales within vascular plants11 and primates7. In some species, there is also evidence that introgression has been a source of adaptive genetic variation that has helped drive adaptation (e.g. refs. 2,2225). These results show how introgression has both (1) occurred in disparate taxonomic groups and (2) promoted adaptation and diversification in some. Notwithstanding key examples47,11,12, we still require systematic tests of introgression that use clade-level genomic data that spans both deep and shallow phylogenetic time to better understand introgression’s generality throughout evolution.

Species from the genus Drosophila remain one of the most powerful genetic systems to study animal evolution. Comparative analyses suggest that introgression might be common during speciation in the genus26. Genome scans of closely related drosophilid species have provided evidence of gene flow and introgression9,10,2732. There is also evidence of contemporary hybridization3335 and stable hybrid zones between a handful of species3638. These examples of hybridization and introgression show that species boundaries can be porous but cannot be taken as prima facie evidence of the commonality of introgression. We still lack a systematic understanding of the relative frequency of hybridization and subsequent introgression across Drosophila. Here we analyze patterns of introgression across a phylogeny generated using 155 whole genomes derived from 149 species of Drosophila, and the genomes of four outgroup species. These Drosophila species span over 50 million years of evolution and include multiple samples from nine major radiations within the genus Drosophila. We used two different phylogenetic approaches to test whether introgression has occurred in each of these nine radiations. We found numerous instances of introgression across the evolutionary history of drosophilid flies, some mapping to early divergences within clades up to 20–25 Mya. Our results provide a taxonomically unbiased estimate of the prevalence of introgression at a macroevolutionary scale. Despite few known observations of current hybridization in nature, introgression appears to be a widespread phenomenon across the phylogeny of Drosophila.

RESULTS

A high-confidence phylogeny of 155 Drosophila genomes

We first used genome-scale sequence data to infer phylogenetic relationships among species in our data set. To achieve this, we annotated and generated multiple sequence alignments for 2,791 Benchmarking Universal Single-Copy Orthologs (BUSCOs; v339,40) across 155 independently assembled Drosophila genomes together with four outgroups (3 additional species from Drosophilidae and Anopheles gambiae). We used these alignments, totalling 8,187,056 nucleotide positions, and fossil calibrations to reconstruct a fossil-calibrated tree of Drosophila evolutionary history. Note that the inclusion of Anopheles as an outgroup allowed us to include a fossil of Grauvogelia, the oldest known dipteran, in our fossil calibration analysis, along with several Drosophilidae fossils and/or geological information (i.e., formation of the Hawaiian Islands; Data S1).

Our phylogenetic analyses (see STAR Methods for details) using both maximum-likelihood (ML using the IQ-TREE package) and gene tree coalescent-based (ASTRAL) approaches with DNA data revealed well-supported relationships among nearly all species within our dataset. Phylogenies inferred using these two approaches only differed in three relationships (Figure S1): (i) D. villosipedis was either recovered as sister species to D. limitata + D. ochracea (ML topology) or as a sister to D. limitata + D. ochracea + D. murphyi + D. sproati (ASTRAL topology); (ii) D. vulcana and D. seguyi form monophyletic lineage sister to the D. nikananu + D. spp. aff. chauvacae + D. burlai + D. bocqueti + D. bakoue clade (ML topology) or have paraphyletic relationships where D. vulcana is sister to the D. nikananu + D. spp. aff. chauvacae + D. burlai + D. bocqueti + D. bakoue clade (ASTRAL topology); (iii) D. simulans was recovered as sister either to D. mauritiana (ML topology) or D. sechellia (ASTRAL topology, the latter of which is perhaps more likely to be the true species tree according to an analysis examining low-recombining regions, which are less prone to ILS41. The nodal supports were consistently high across both ML (Ultrafast bootstrap (UFBoot) = 100, an approximate likelihood ratio test with the nonparametric Shimodaira–Hasegawa correction (SH-aLRT) = 100, a Bayesian-like transformation of aLRT (aBayes) = 1) and ASTRAL (Local posterior probability (LPP) = 1) topologies with the exception of D. limitata + D. ochracea + D. villosipedis (UFBoot = 9, SH-aLRT = 81, aBayes = 1) and D. carrolli + D. rhopaloa + D. kurseongensis (UFBoot = 81.2, SH-aLRT = 81, aBayes = 1) on the ML tree, and D. limitata + D. ochracea + D. murphyi + D. sproati (LPP = 0.97) and D. sulfugaster bilimbata + D. sulfugaster sulfurigaster (LPP = 0.69) on the ASTRAL tree. Thus, the phylogeny we report here is the first of the genus Drosophila with almost all nodes resolved with high confidence—recent estimates of the Drosophila phylogeny lacked strong support throughout all tree depth levels4244.

Erroneous orthology inference as well as misalignment can impede accurate phylogenetic inference and create artificially long branches45. Repeating our ASTRAL analysis after removing outlier long branches via TreeShrink45 resulted in an identical tree topology with the aforementioned ASTRAL tree (Figure S1). Furthermore, an ML topology estimated from the dataset with more closely related outgroup species (see STAR Methods) results in an identical topology with the aforementioned ML tree (Figure S1). The inferred phylogeny from the protein supermatrix showed only four incongruencies with the phylogeny that was inferred from DNA data (Figure S1): (i) D. villosipedis was recovered as a sister species to D. limitata + D. ochracea + D. murphyi + D. sproati; (ii) D. watanabei + D. punjabiensis is sister to the clade containing D. bakoue and D. jambulina; (iii) D. vulcana and D. seguyi show paraphyletic relationships; (iv) Z. vittiger and Z. lachaisei show sister species relationships. We performed further assessment of nodal support with Quartet Sampling11, using the Quartet Concordance (QC) and Quartet Differential (QD) scores to identify quartet-tree species-tree discordance (STAR Methods). At some nodes, an appreciable fraction of quartets disagreed with our inferred species tree topology (QC < 1), and in most of these cases this discordance was skewed toward one of the two possible alternative topologies (i.e. QD < 1 but > 0) as is consistent with introgression. We formally explore this pattern below.

In order to estimate divergence times across the Drosophila phylogeny, we developed five calibration schemes (A, B, C, D and “Russo”; Data S1) used in MCMCtree46 and one scheme based on the Fossilized Birth-Death (FBD) process47 used in BEAST248 (BEAST2 FBD; Data S1). Overall, four of the five MCMCtree schemes yielded nearly identical age estimates with narrow 95 % credible intervals (CI), whereas scheme “Russo” (a fossil calibration strategy closely matching that from43) showed slightly older estimates (Figure S2) with notably wider 95% CIs. Throughout this manuscript we use the time estimates obtained with scheme A. This calibration analysis estimated that extant members of the genus Drosophila branched off from the other Drosophilidae (Leucophenga, Scaptodrosophila and Chymomyza) ~53 Mya (95% CI: 50 – 56.6 Mya) during the Eocene Epoch of the Paleogene Period (Figure 1). The same analysis inferred that the split between the two major lineages within Drosophila—the subgenera of Sophophora and Drosophila—occurred ~47 Mya (95% CI: 43.9– 49.9 Mya; Figure 1; “A” and “B” clades, respectively); previously published estimates of this time include ~32 Mya (95% CI: 25–40 Mya49), ~63 Mya (95% CI: 39–87 Mya50), and ~56 Mya (95% CI not available43). We also note that our divergence time estimates of the Drosophila subgenus (~34 Mya, 95% CI: 31.6 – 36.8 Mya; clades 6 through 9) are somewhat younger than ~40 Mya, a previous estimate reported in51, although the latter had fairly wide confidence intervals (95% CI: 33.4 – 47.6 Mya). On the other hand, divergence time estimates produced by the FBD scheme in BEAST2 tend to be older especially for deeper nodes (Figure S2). Also, CIs estimated by BEAST2 were wider than those from MCMCTREE. This can be explained by the fewer assumptions about fossil calibration placement and age prior specification for methods that rely on the FBD process. Additionally, we note that not all parameters of the BEAST2 FBD calibration scheme converged (i.e., effective sample size < 100) even after 6 × 108 MCMC generations. Thus, the lack of a thorough fossil record within Drosophila makes it difficult to accurately and precisely estimate divergence times, and point estimates of divergence times should be interpreted with caution.

Figure 1. Fossil calibrated maximum likelihood phylogenetic tree of the genus Drosophila inferred from a supermatrix of 2,791 BUSCO loci (total of 8,187,056 sites).

Figure 1.

The blue distributions at each divergence point on the tree represent nodal age posterior probabilities from MCMCTree. Grauvogelia and Oligophryne fossils were used to set priors on the age of the root of the tree, Phytomyzites and Electrophortica succini were used for priors for the root of the Drosophilidae family, and Electrophortica succini and Scaptomyza dominicana were used to set priors for the crown group “Scaptomyza”, i.e. Most Recent Common Ancestor (MRCA) node of the Scaptomyza species (scheme A; Data S1). The numbered red circles denote clades for which analyses of introgression were performed. Inset: the phylogenetic and temporal relationships between our distant outgroup Anopheles gambiae, more closely related outgroup species of Drosophilidae (Leucophenga varia, Scaptodrosophila lebanonensis and Chymomyza costata), and the Drosophila genus. A and B denote the two inferred major groups within Drosophila. See also Figures S2, S5 and Data S3.

Widespread signatures of introgression across the Drosophila phylogeny

To assess the prevalence of introgression across the Drosophila tree, we subdivided species into nine monophyletic lineages (herein referred to as clades 1 through 9; Figure 1) and tested for introgression within each clade. These clades correspond to the deepest divergences within the genus, with most having an MRCA during the Paleogene. Clades 4 and 5 are the two exceptions, splitting from an MRCA later in the Neogene. Within each of the nine clades, the MRCA of all sampled genomes ranged from ~10 Mya (Figure 1; clade 2) to ~32 Mya (Figure 1; clade 1). We note that Hirtodrosophila duncani, Drosophila busckii and Drosophila repletoides were not included in these clade assignments as each of these species was the only sampled descendent of a deep lineage; additional taxon sampling is required to assign them to specific monophyletic species groups that could be tested for introgression.

We tested for introgression within each of these nine clades using two complementary phylogenomic methods that rely on the counts of gene trees inferred from the BUSCO loci that are discordant with the inferred trees (hereafter referred to as the discordant-count test or DCT) and the distribution of branch lengths for discordant gene trees (hereafter termed the branch-length test or BLT), respectively, among rooted triplets of taxa (Figure 2). These methods leverage information contained across a set of gene trees to differentiate patterns of discordance that are consistent with introgression from those that can be explained by incomplete lineage sorting alone (see STAR Methods). We found at least one pair of species with evidence of introgression in 7 of the 9 clades according to both DCT and BLT (i.e. the same pair of species showed evidence for introgression that was significant for both tests in the same triplet at an FDR-corrected P-value threshold of 0.05). In clades 1 and 3 there were no species pairs for which the DCT and BLT were significant in the same triplet and both suggest the same introgressing species pair (Data S2). However, both clades had several pairs that were significant according to one test or the other (Data S2). We found even stronger support for introgression using two existing software methods: QuIBL (Data S2), which examines the branch-length distributions of all three gene tree topologies for a triplet8, and HyDe (Data S2), which tests for introgression by counting quartet site patterns52. Specifically, QuIBL detected introgression in 120 of 152 (78.9%) of species pairs detected by both DCT and BLT, as well as 894 additional species pairs not detected by DCT-BLT; we note that BLT and QuIBL approaches are not fully independent, since they both utilize branch-length information. Similarly, HyDe detected introgression in 142 of 152 (93.4%) of species pairs detected by both DCT and BLT, and 898 additional species pairs (the results of HyDe were not qualitatively affected if a more distantly related outgroup, i.e. Anopheles gambiae, was selected, see Data S2). However, we focus here on the intersection between DCT and BLT methods (after correcting each for multiple testing), as this provides a more conservative estimate of the extent of introgression. Supporting this claim, we applied these tests to a gene tree dataset simulated under high levels of ILS53 and observed low false positive rates: 0.054 for DCT, 0.089 for BLT, and 0.009 for their intersection.

Figure 2. Overview of the phylogenomic approaches used to detect introgression.

Figure 2.

(A) The Discordant Count Test (DCT) identifies introgression where a given triplet within the tree shows an excess of gene trees that support one of the two possible divergent topologies. Note that concordant gene trees and corresponding probabilities are included for completeness, although these are not used by our test. (B) The Branch Length Test (BLT) identifies introgression where branch lengths of gene trees that support introgression are shorter than branch lengths of those that support the species tree and the less frequent divergent topology (i.e., the discordant topology putatively due to ILS).

We carried out several analyses to assess the robustness of our results to data quality and evolutionary rate. Specifically, we tested whether aliment length and quality, karyotype evolution 54, and positive selection may potentially affect introgression inference by the DCT and BLT approaches by repeating our analysis after filtering our data to account for each of these potential confounders (STAR Methods). Although these filtering schemes overall reduce the number of introgression events, our major conclusions about introgression within Drosophila still hold (Table S1). Indeed, most of the reduction in signal that we observe appears to be driven by the loss of power due to the reduced size of the filtered dataset (see effect of filtering random sets of gene trees; Table S1), rather than data artifacts and/or biological confounding factors which appear to have only a modest impact.

The number of species pairs that show evidence of introgression in our initial DCT-BLT analysis is not equivalent to the number of independent introgression events among Drosophila species. This is because gene flow in the distant past can leave evidence of introgression in multiple contemporary species pairs. For example, we found evidence for introgression between D. robusta and all five species within the D. americana-D. montana group (see clade 7 in Figure 3). Rather than five independent instances of introgression between species, this pattern could reflect introgression between ancestral taxa that subsequently diverged into the contemporary species. More generally, cases where multiple introgressing species pairs each shared the same MRCA may be more parsimoniously explained by a single ancestral introgression event between the branches that coalesce at this node, while those involving only a single species pair may have resulted from introgression between the extant species pair (Data S2). Another example of the former can be seen in clade 6 where the evidence suggests introgression occurred between the Hawaiian Scaptomyza and Drosophila (Figure S3) that are estimated to have diverged from each other more than 20 Mya. This ancient introgression may have occurred prior to the formation of Kauai island ~5 Mya which is now the oldest high island with extant species in these two groups55,56.

Figure 3. Patterns of introgression inferred for the monophyletic clades 2, 7 and 9.

Figure 3.

The matrix shows inferred introgression proportions as estimated from gene tree counts for the introgressed species pairs (STAR Methods), and then mapped to internal branches using the f-branch method20. The expanded tree at the top of each matrix shows both the terminal as well as ancestral branches. The tree on the left side of each matrix represents species relationships with mapped introgression events (red arrows) derived from the corresponding f-branch matrix (STAR Methods). The fractions next to each arrow represent the number of triplets that support a specific introgression event by both DCT and BLT divided by the total number of triplets that could have detected the introgression event. Dashed arrows represent introgression events with low support (triplet support ratio < 10%). See also Figures S3S4, Tables S1S2 and Data S2, S4S5.

To summarize our DCT-BLT results and estimate both the number of introgression events and the proportion of the genome that introgressed during those events (ϒ) we adapted the f-branch heuristic20 (implemented in Dsuite57; STAR Methods). Summed across all clades, our f-branch results suggest that at least 30 introgression events are required to explain our DCT-BLT results (Figure 3 and Figure S2). Clades 2, 4, 6, 7 and 9 showed the strongest evidence of introgression, in terms of both the total number of DCT-BLT significant triplets and estimates from Dsuite that support those events (Table 1). For example, in clade 2 Dsuite suggests an ancestral introgression event between the branch leading to D. obscura and D. bifasciata and the branch that leads to the clade containing D. pseudoobscura and D. affinis. Furthermore, this particular signal is characterized by a large fraction of introgressed genetic material ( = 0.237, Table 1) and by the large number of triplets that are significant according to both DCT and BLT (26 out of 28 total triplets that could detect this event are significant according to both tests). We stress that both our methods used to detect introgression (DCT and BLT) and our approaches for counting introgression events (f-branch) are conservative, and thus the true number of events could be substantially greater, as suggested by our analyses using QuIBL and HyDe. Regardless of the method used, careful examination of results in Data S2, Figure 3 and Figure S2 reveals that deep introgression events are clearly the best explanation for some of our patterns (e.g. the case from clade 7 involving D. robusta described above), although more recent events may have occurred as well (e.g. between D. pachea and D. acanthoptera; Data S2, clade 7).

Table 1. Placements, support and timing of introgression events across the Drosophila phylogeny.

Putative introgression events (⟷) are specified between different clades indicated by the pair of species in that clade with the oldest MRCA. The triplet ratio shows the number of significant and non-significant triplets according to DCT-BLT. The average γ was obtained from the f-branch results. The durations of the two introgressing lineages are represented by predicted lower and upper boundaries of credible intervals (95% CIs) estimated by MCMCTree using calibration scheme A.

Clade Introgression Event Triplet ratio (significant/total) Average γ CI lower and upper bounds of lineage duration (Mya)
2 D. obscura…D. bifasciataD. pseudoobscura…D. affinis 26/28 0.237 9.84–5.9, 11.53–6.18
D. subobscura…D. guancheD. pseudoobscura…D. lowei 2/40 0.016 9.84–1.93, 8.39–3.94
D. loweiD. azteca…D. affinis 3/9 0.019 5.77–0, 8.39–2.88
4 D. ficusphilaD. carrolli…D. elegans 5/81 0.049 18.46–0, 16.72–8.55
D. ficusphilaD. erecta…D. eugracilis 19/65 0.035 18.46–0, 14.77–10.3
D. erecta...D. orenaD. mauritiana...D. melanogaster 4/16 0.044 6.38–2.45, 7.71–2.92
5 D. leontiaD. birchii...D. serrata 1/55 0.076 3.07–0, 8.84–6.4*
6 S. pallidaD. cyrtoloma...D. primaeva 13/42 0.03 4.9–0, 24.22–4.52
S. flava...S. montanaD. cyrtoloma...D. prolacticillia 25/56 0.032 16.14–1.87/6.56–3.05
D. primaevaD. cyrtoloma...D. silvestris 1/40 0.02 6.56–0/3.96–2
D. heteroneuraD. grimshawi...D. sproati 1/36 0.021 1.1–0/3.22–2.17*
D. primaevaD. prolacticillia 1/12 0.012 6.56–0/2.16–0
7 D. robustaD. americana...D. montana 49/60 0.113 29.36–0/26.88–4.28
D. pseudotalamancanaD. americana...D. montana 52/55 0.103 23.56–0/26.88–4.28
D. novamexicanaD. arizonae...D. hydei 7/60 0.019 2.99–0/23.56–10.24
D. americana...D. novamexicanaD. arizonae...D. seriema 5/88 0.031 2.99–1.17/12.23–6.71*
D. hydeiD. americana...D. novamexicana 2/28 0.034 13.82–0/2.99–1.17
D. hydeiD. arizonae...D. seriema 3/4 0.234 13.82–0/12.23–6.71
D. robustaD. pseudotalamancana 6/16 0.076 29.36–0/23.56–0
D. pacheaD. acanthoptera 1/1 0.05 5.36–0/4.95–0
8 Z. camerounensisZ. lachaisei 1/1 0.051 2.13–0/2.76–0
Z. camerounensisZ. vittiger 1/2 0.06 2.13–0/3–0
9 D. pruinosaD. arawakana...D. mush sãotomé 47/110 0.138 22.41–0/27.26–18.21
D. funebris...D. mush sãotoméD. albomicans…D. Pruinose 206/288 0.031 22.7–14.74/27.26–16.86
D. subbadiaD. guttifera…D. mush sãotomé 1/18 0.106 3.19–0/19.51–11.12*
D. innubila…D. mush sãotoméD. funebris 1/2 0.135 15.94–7.48/19.51–0
D. immigransD. neonasuta...D. sulfurigaster sulfurigaster 1/24 0.01 1.44–0/3.5–1.7*
D. immigrans (kari17)D. nasuta 1/9 0.013 1.44–0/2.38–0
D. pallidipennisD. pulaua 1/20 0.045 18.35–0/1.86–0
D. quadrilineataD. nasuta 1/21 0.01 30.52–0/2.38–0
*

indicates introgressing lineages with no time overlap (according to 95% CIs).

We note that some scenarios of ancestral population structure could potentially result in differences in the number and branch lengths of gene trees with either discordant topology (discussed in STAR Methods). We therefore applied a more stringent version of the DCT-BLT that compares the branch lengths of the discordant topology with those of the concordant topology; this test will not be sensitive to ancestral population structure but could potentially produce many false negatives (STAR Methods). The results of this test (Table S2) show that the majority (~2/3) of our strongly supported putative introgression events are inconsistent with the phenomenon of ancestral population structure-produced false positives. Given that this test is highly conservative, we interpret this result as evidence that the vast majority of our detected introgression events are true positives rather than artifacts of population structure.

To complement our f-branch analysis, we also used PhyloNet58,59 to identify the branches with the strongest signature of introgression in each of the nine monophyletic clades in our tree. Overall, PhyloNet’s results are largely consistent with our DCT-BCT analysis and therefore support our finding of multiple introgression events across the Drosophila phylogeny (see details in Figure S4).

Finally, we asked whether the proportion of the genome that introgressed between putatively introgressing taxa (ϒ) varied with the timing of introgression events (Figure 4). Rather than timing introgression relative to when two hybridizing taxa shared a most recent common ancestor (which would require additional data, such as haplotype lengths of introgressed regions), we leveraged divergence time estimates across the drosophila phylogeny (Figure 1) and estimated when introgression events could have occured in time relative to the present (i.e., Mya). For this analysis, we focused on the 17 “best-supported” introgression events based on the criteria that more than 10% of the total triplets that could detect introgression between a given pair of taxa were significant according to both DCT and BLT (see solid red arrows in Figs. 3 and S3; Table 1). We estimated when these events occurred by taking the maximum, minimum, and midpoint times when the two branches that experienced introgression both coexisted in our dated phylogeny. We note that this approach results in imprecise time estimates, particularly for long branches in the phylogeny; however, it allowed us to test whether there was any obvious relationship between the proportion of the genome that introgressed () and when those introgression events took place in the past. In one instance, the two branches that putatively experienced introgression did not overlap in time in our phylogeny. This situation could be explained by “ghost” introgression with unsampled or extinct lineages. For the 17 remaining introgression events, there was not a significant relationship between the midpoint estimate of timing of introgression (Mya) and (Spearman’s rank correlation: 0.43; P = 0.085; Figure 4). Our analyses therefore support introgression across the evolutionary history of Drosophila, with introgressing species pairs exchanging a similar fraction of the genome (range of average estimates = 0.013 – 0.237) regardless of whether those events were ancient or more recent.

Figure 4. Time and fraction of the genome introgressing for the 17 best-supported introgression events across the Drosophila phylogeny.

Figure 4.

Each horizontal segment summarises one of the 17 introgression events highlighted in Figure 3 and is colored by clade. Segments span the times when the two putatively introgressing taxa both existed and are based on times inferred from the dating analysis summarised in Figure 1. Fraction of the genome that introgressed was estimated as the average f-branch statistic across all triplet comparisons that supported a given introgression event. Mya = million years ago.

Discussion

A time-calibrated tree of drosophilid evolution

Drosophila, as a genus, remains a premier model in genetics, ecology, and evolutionary biology. With over 1,600 species42, the genus has the potential to reveal why some groups are more speciose than others. Yet the phylogenetic relationships among the main groups in the genus have remained largely unresolved (reviewed in 42). Here we estimated a robust time-calibrated phylogeny for the whole genus using multilocus genomic data and calibrated it using a fossil record.

Our results confirm that the genus Drosophila is paraphyletic, with the genera Zaprionus, Scaptomyza, Leucophenga, and Hirtodrosophila each nested within the larger genus Drosophila. Consistent with the subdivisions previously proposed by refs. 60 and 44, clades 1–5 of our phylogeny contain species belonging to the subgenus Sophophora, and include species from the genus Lordiphosa (group A in Figure 1). Clades 6–9 of our phylogeny contain species belonging to the subgenus Drosophila (group B in Figure 1) and include species from the Hawaiian Drosophila and the subgenera Siphlodora, Phloridosa (synonymized with the subgenus Drosophila44, and genus Zaprionus. For more recent radiations within Drosophila, the topology we present is largely congruent with previous studies42,51 but two general observations are notable. First, our results confirm that Lordiphosa is closely related to the saltans and willistoni groups (clade 1) and part of the Sophophora subgenus (consistent with ref. 61). Second, we confirm that Zaprionus is related to the cardini/qunaria/immigrans group (consistent with refs. 42 and 60, but discordant with 43). Despite our well resolved phylogeny, comparisons with other studies emphasize the need to expand species sampling, especially given the potential to generate highly contiguous genomes at relatively low cost62.

Our results from divergence time analysis via MCMCTree suggest that the origin of Drosophila (including the subgenera Sophophora (group A) and Drosophila (group B)) occurred during the Eocene Epoch of the Paleogene, which is younger than estimates by 60 and 43, but older than estimates by 49. Different estimates of divergence times may be the result of different calibration information used, such as mutation rates, the time of formation of the Hawaiian Islands, and the fossil record. However, our comparison of various calibration schemes suggests that the choice of calibration information has a minor effect on MCMCTree’s age estimation (Figure S2). Additionally, credible intervals around our estimates tend to be notably narrower than in all of the aforementioned studies. In contrast to the previous studies, we used genome-scale multilocus data which would be expected to improve both the accuracy and precision of age estimates63,64.

On the other hand, we note that our analyses in BEAST2 using the FBD model yielded significantly older ages (Figure S2) especially for deeper nodes and with markedly wider credible intervals suggesting origination of Drosophila lineage in the Late Cretaceous. These calibration inconsistencies may arise as a result of the poor fossil record within Drosophila (only Scaptomyza dominicana from Dominican amber) and selection of the oldest fossils for deeper radiations, which together can lead to overestimation of nodal ages under the FBD model65. Moreover, the poor convergence behavior we observed would also be expected to produce larger credible intervals.

The extent of introgression in Drosophila

Access to genome-scale data has reinvigorated the study of hybridization and introgression14. We used genome-scale sequence data to provide the first systematic survey of introgression across the phylogeny of drosophilid flies. Our complementary—and conservative—approaches identified overlapping evidence for introgression within seven of the nine clades we analyzed (Figs. 3 and S3, Data S2). We conclude that at least 30 pairs of lineages have experienced introgression across Drosophila’s history (Table 1), though we note that other methods recover more introgression events (Data S2) and thus we cannot rule out the possibility that the true number is substantially higher. Moreover, we find that in many cases a substantial fraction of the genome is introgressed: our estimates indicate that numerous introgression events have altered gene tree topologies for >10% of the genome (Figs. 3 and S3, Table 1). Studies in contemporary Drosophila species suggest that selection may constrain the evolution of mixed ancestry, at least in naturally occurring9,36,66 and experimental admixed populations67,68. The results we have presented here used phylogenetic signals to show that introgression has nonetheless occurred and left a detectable signal within the genomes of many extant Drosophila.

In addition to providing an estimate of the extent of introgression, our results are informative about the timing of introgression among Drosophila lineages: the approaches we used to estimate the number of introgression events, and map them onto the phylogeny could potentially overestimate the timing of introgression if multiple independent more recent events are mistaken for one ancestral event. However, as described in the Results, both our PhyloNet analyses and a careful examination of our DCT-BLT results are most consistent with ancient introgression events in many cases. We also find evidence for very recent events, and although our analyses did not search for gene flow between sister taxa, previous studies of closely related species in Drosophila have revealed evidence of introgression9,10,29,31,32. Studies that have taken phylogenomic approaches to detect introgression in other taxa have also reported evidence for introgression between both “ancient” lineages (i.e., those that predate speciation events generating extant species) and extant species8,12,18,19,21. We conclude that introgression between Drosophila flies has similarly occurred throughout their evolutionary history.

Although the signal of introgression across our phylogeny provides evidence for widespread introgression in Drosophila, the evolutionary role of introgressed alleles remains to be tested. For example, the impact of hybridization and introgression on evolution can be diverse, from redistributing adaptive genetic variation23,69,70 to generating negative epistasis between alleles that have evolved in different genomic backgrounds (refs. 7173; reviewed in refs. 15,16,74,75). The number of introgressed alleles that remain in a hybrid lineage depends on their selection coefficients7678, their location in the genome (i.e., sex chromosomes vs. autosomes7981), levels of divergence between the hybridizing species9,82,83, and recombination rates among loci6,84. Previous studies have, for example, shown that Drosophila hybrids often show maladaptive phenotypes36,8589. Similarly, experimental hybrid swarms generated from two independent species pairs of Drosophila have shown that these populations can evolve to represent only one of their two parental species within as few as 10 generations, with the genome of one of their two parental species being rapidly purged from the populations67. These results show how hybrid Drosophila can be less fit than their parents, and further work is needed to determine the evolutionary effects, and the ecological context, of the introgression that we report here. However, our results suggest that not all introgressed material is deleterious in Drosophila, as we find that for some lineages a large fraction of the genome is introgressed (i.e. our ϒ estimates shown in Figs. 3 and S3 and Table 1). These results add to the growing body of literature that document a detectable phylogenetic signal of introgession left within the genomes of a wide range of species radiations that include Drosophila, other dipterans90, lepidopterans8,84,91, humans5,92,93, fungi1,2, and angiosperm plants11,12.

Caveats and future directions

We estimated the number of events required to explain the introgression patterns across the tree and in some cases those events were recovered as relatively ancient. However, our approaches for mapping gene flow events onto the phylogeny was somewhat parsimonious in that it favors older events over repeated and recent introgressions (see STAR Methods), and thus may bias the age of introgression towards ancient events and underestimate the true number of pairs of lineages that have exchanged genetic material. For example, introgression events we inferred at deeper nodes in our phylogeny are often supported by only a subset of comparisons between species pairs that spanned those nodes (e.g. see “ancient” introgression events in clades 2, 7 and 9; Figure 3). It is also possible that some patterns we observe reflect scenarios where introgressed segments have persisted along some lineages but been purged along others. This phenomenon could also cause older gene flow between sister lineages, which should generally be undetectable according to the BLT and DCT methods, to instead appear as introgression between non-sister lineages that our methods can detect. Future work could seek to more precisely reveal the number and timing of gene flow events across this phylogeny, including more recent introgression events and gene flow between extant and extinct/unsampled lineages, a pattern referred to as “ghost” introgression94,95.

Our analyses also do not identify the precise alleles that have crossed species boundaries or reveal the manner in which these alleles may have affected fitness in the recipient population74,75. Genome alignments, complete annotations, and/or population level sampling across the genus are required to determine whether certain genes or functional categories of genes are more likely to cross species boundaries than others. More complete taxonomic sampling, combined with methodological advances for inferring the number and timing of introgression events in large phylogenies, will increase our ability to identify the specific timing and consequences of introgression across Drosophila.

Conclusions

Speciation research has moved away from the debate of whether speciation can occur with gene flow to more quantitative tests of how much introgression occurs in nature, and how this introgression affects the fitness of individuals in the recipient population. Our well-resolved phylogeny and survey of introgression revealed that gene flow has been a relatively common feature across the evolutionary history of Drosophila. Yet, identifying the specific consequences of introgression on fitness and the evolution of species and entire radiations within Drosophila and other systems remains a major challenge. Future research could combine the power of phylogenomic inference with population-level sampling to detect segregating introgression between sister species to further our understanding of the amount, timing, and fitness consequences of admixture for diversification.

STAR METHODS

RESOURCE AVAILABILITY

Lead contact

More information regarding the resources used in this study should be directed to the lead contact Anton Suvorov (antony.suvorov@med.unc.edu)

Materials availability

This study did not generate any new unique reagents.

Data and code availability

The data and code produced during this study are publically available on GitHub (https://github.com/SchriderLab/drosophila_phylogeny) and FigShare (dx.doi.org/10.6084/m9.figshare.13264697). Whole genome sequencing data generated for this study are available on NCBI (BioProject PRJNA675888, BioProject PRJNA593822, and BioProject PRJNA611543).

METHOD DETAILS

Genome assemblies and public data

Genome sequences used by this work were obtained from concurrent projects and public databases. Genome sequencing and assembly for 84 genomes is described in62. These data are available for download at NCBI BioProject PRJNA675888. For the remaining genomes: sequencing and assembly of 8 Hawaiian Drosophila were provided by E. Armstrong and D. Price, described in Armstrong et al. (in prep) and available at NCBI BioProject PRJNA593822; sequences and/or assemblies of five nannoptera group species were provided by M. Lang and V. Courtier-Orgogozo and are available at NCBI BioProject PRJNA611543; 44 were downloaded as assembled sequences from NCBI GenBank; Z. sepsoides and D. neohypocausta were sequenced as paired-end 150bp reads on Illumina HiSeq 4000 at UNC and assembled using SPAdes v3.11.1 with default parameters96; and 15 were generated by assembling short read sequences downloaded from NCBI SRA. For sets of unassembled short reads, we used ABySS v2.2.397 with parameters “k=64” with paired-end reads (typically 100–150bp) to assemble the reads. Finally, outgroup genome sequences (A. gambiae, M. domestica, L. trifolii, C. hians, and E. gracilis) were obtained from NCBI GenBank. See Data S3 for a full list of samples, strain information, accessions, and associated publications.

Orthology Inference

We identified single-copy orthologous genes in each genome using BUSCO (Benchmarking Universal Single-Copy Orthologs; v3.1.098). BUSCO was run with orthologs from the Diptera set in OrthoDB v.9 (odb9) using default parameters. For each species, all BUSCOs found in a single copy were used for phylogenetic analysis.

Assignment of BUSCO genes to Muller elements for obscura group species

Each of the BUSCO genes identified as single-copy in each of the group 12 (obscura group: D. affinis, D. athabasca, D. azteca, D. bifasciata, D. guanche, D. lowei, D. miranda, D. obscura, D. persimilis, D. pseudoobscura, D. subobscura) genome assemblies was assigned to one of the six Muller elements (elements A-F). For D. athabasca, D. bifasciata, D. lowei, D. miranda, D. pseudoobscura, and D. subobscura, contig/scaffold associations with chromosomes and/or Muller elements were simply obtained from NCBI GenBank assembly report tables. For the remaining genomes (D. affinis, D. azteca, D. guanche, D. obscura, D. persimilis), we used whole-genome alignments to infer the Muller element associated with each contig or scaffold. Using the Progressive Cactus99 software, each remaining genome was aligned to a closely related reference genome (D. affinis - D. athabasca; D. azteca - D. athabasca; D. guanche - D. subobscura; D. obscura - D. bifasciata; D. persimilis - D. miranda) with a similar karyotype54,100. Using the reference genomes as backbones, each remaining genome was then scaffolded, with Ragout101. The scaffolds allowed us to annotate each contig in the remaining genomes with Muller element information from the reference genomes (see Data S4). BUSCO genes on unplaced contigs were ignored.

Phylogenetic reconstruction

Every DNA BUSCO locus was aligned with MAFFT v7.427102 using the L-INS-i method. We removed sites that had fewer than three non-gap characters from the resulting multiple sequence alignments (MSAs). These trimmed MSAs were concatenated to form a supermatrix. To assess the quality of the assembled supermatrices we computed pairwise completeness scores in AliStat103 (Figure S5). We inferred a maximum likelihood (ML) phylogenetic tree from the supermatrix (a.k.a. concatenated alignment) using IQ-TREE v1.6.5104, and treated the supermatrix as a single partition. IQ-TREE was run under GTR+I+G substitution model, as inference under any other substitution model will not necessarily lead to better accuracy of tree topology estimation105. To estimate the support for each node in this tree, we used three different reliability measures. We did 1,000 ultrafast bootstrap (UFBoot) replicates106 and additionally performed an approximate likelihood ratio test with the nonparametric Shimodaira–Hasegawa correction (SH-aLRT) and a Bayesian-like transformation of aLRT107. We used the ML gene trees obtained by IQ-TREE with a GTR+I+G substitution model for tree inference in ASTRAL108. For the estimated ASTRAL tree we calculated the support of each node using local posterior probabilities (LPP)108. Also, we created a gene tree set by removing taxa with outlier branch lengths that were potentially produced by misaligned regions and/or incorrect orthology inference in TreeShrink45 under default parameters. This analysis resulted in a small fraction of branches removed from our gene tree set (<5.5%)

We did two additional analyses to verify the robustness of our topology inference. First, we inferred an ML tree using WAG+I+G substitution model from the protein supermatrix obtained from concatenation of protein BUSCO MSAs. MSAs based on amino acid sequences have been shown to have superior accuracy to DNA MSAs for distantly related species109. Second, to verify that long branch attraction did not distort our tree topology, we inferred an ML tree under a GTR+I+G substitution model using a different set of outgroup species from the DNA supermatrix. Specifically, instead of distantly related Anopheles gambiae, we used Musca domestica, Liriomyza trifolii, Curricula hians and Ephydra gracilis together as our outgroup species.

Phylogenetic Support Analysis via Quartet Sampling

We used quartet sampling (QS) as an additional approach to estimate phylogenetic support11. Briefly, QS provides three scores for internal nodes: (i) quartet concordance (QC), which gives an estimate of how sampled quartet topologies agree with the putative species tree; (ii) quartet differential (QD) which estimates frequency skewness of the discordant quartet topologies, and can be indicative of introgression if a skewed frequency observed, and (iii) quartet informativeness (QI) which quantifies how informative sampled quartets are by comparing likelihood scores of alternative quartet topologies. Finally, QS provides a score for terminal nodes, quartet fidelity (QF), which measures a taxon “rogueness”. We did QS analysis using the DNA BUSCO supermatrix described above, specifying an IQ-TREE engine for quartet likelihood calculations with 100 replicates (i.e. number of quartet draws per focal branch).

Fossil Dating

MCMCTREE:

We implemented the Bayesian algorithm of MCMCTree v4.9h46 with approximate likelihood computation to estimate divergence times within Drosophila using several calibration schemes (Data S1). First, we estimated branch lengths by ML and then the gradient and Hessian matrices around these ML estimates in MCMCTree using the DNA supermatrix and species tree topology estimated by IQ-TREE. Because large amounts of sequence data are not essential for accurate fossil calibration110, we performed dating analysis using a random sample of 1,000 MSA loci (out of 2,791) for the sake of computational efficiency. Thus, for this analysis the supermatrix was generated by concatenating 1,000 randomly selected gene-specific MSAs. Using fewer loci (10 and 100) for fossil calibration did not drastically affect nodal age estimation (Figure S1). We removed sites that had less than 80 non-gap characters from all these supermatrices. Second, we used the gradient and Hessian matrix, which constructs an approximate likelihood function by Taylor expansion111, to perform fossil calibration in MCMC framework. For this step we specified a GTR+G substitution model with four gamma categories; birth, death and sampling parameters of 1, 1 and 0.1, respectively. To model rate variation we used an uncorrelated relaxed clock. To ensure convergence, the analysis was run ten times independently for 8 × 106 generations (first 106 generations were discarded as burn-in), logging every 1,000 generations. We used the R package MCMCtreeR112 to visualize the calibrated tree.

BEAST 2:

Additionally we performed fossil calibration using the Fossilized Birth-Death (FBD) process47 as implemented in the Bayesian framework of BEAST 2.6.348. For scalability purposes, we randomly selected 1,000 loci and then partitioned them into 10 supermatrices each consistent of 100 different MSAs. Each of these 10 datasets was treated as a single partition in the downstream analyses. Additionally, we removed sites that had less than 128 non-gap characters from all these supermatrices. To perform fossil calibration, we used a GTR+G model with four gmamma categories, and an optimized relaxed clock113 was used to model rate variation. For the FBD prior we specified an initial origin value of 230 Mya (which corresponds to the age of oldest known dipteran fossil Grauvogelia), and the tree likelihood was conditioned on the proportion of species sampled at present (p= 0.1). The remaining priors were set to their defaults. In order to directly compare divergence time estimation between BEAST 2 and MCMCTree, we used the same fixed IQ-TREE species tree topology with several exceptions. First, we did not fix the phylogenetic positions of contemporary Scaptomyza species and fossil taxon Scaptomyza dominicana within its monophyletic group. Second, we did not constrain relationships of outgroup species L. varia, C. costata, S. lebanonensis including fossil taxon Electrophortica succini. Two additional fossils, Oligophryne and Phytomyzites, were specified for Drosophilidae stem. Furthermore, to accomodate uncertainty of fossil dates we incorporated age ranges for several fossils (Data S1). For each of the 10 datasets we ran 2 independent MCMC chains for 6 × 108 generations with sampling frequency of 10,000 for each model parameter. Additionally, we performed sampling from the prior distribution only. Convergence was assessed using ESS in Tracer114. Divergence times were generated by taking means of posterior nodal ages discarding 25% of the sampled trees as burn-in in TreeAnnotator for each dataset. To drastically improve computational efficiency of likelihood calculations in all BEAST 2 analyses we used the program in conjunction with BEAGLE library115 that enables GPU utilization.

Inferring Introgression Across the Tree

Gene tree-based methods:

In order to detect patterns of introgression we used three different methods that rely on the topologies of gene trees, and the distributions of their corresponding branch lengths, for triplets of species. If the true species tree is ((A, B), C), these tests are able to detect cases of introgression between A and C, or between B and C. These include two of the methods that we devised for this study, and which use complementary pieces of information—the counts of loci supporting either discordant topology, and the branch-length distributions of gene trees supporting these topologies, respectively—to test an introgression-free null model.

The first method we developed was the discordant-count test (DCT), which compares the number of genes supporting each of the two possible discordant gene trees: ((A, C), B) or (A, (B, C)), similar in principle to the delta statistic from116. Genes may support the two discordant topologies (denoted T1 and T2) in the presence of ILS and/or in the presence of introgression. In the absence of ancestral population structure, gene genealogies from loci experiencing ILS will show either topology with equal probability; ILS alone is not expected to bias the count towards one of the topologies. In the presence of introgression, one of the two topologies will be more frequent than the other because the pair of species experiencing gene flow will be sister lineages at all introgressed loci (illustrated in Figure 2). For example, if there is introgression between A and C, there will be an excess of gene trees with the ((A, C), B) topology. The DCT identifies pairs of species that may have experienced introgression by performing a χ2 goodness-of-fit test on the gene tree count values for a species triplet to determine whether their proportions significantly deviate from 0.5, the expected proportion for each gene genealogy under ILS. We used this test on all triplets extracted from BUSCO gene trees within each clade, and the resulting P-values were then corrected for multiple testing using the Benjamini-Hochberg procedure with a false discovery rate (FDR) cutoff of 0.05. We note that these tests are not independent since different triplets may contain overlapping taxa. Thus, while our correction results in more conservative tests57, the inferred FDRs may be somewhat inaccurate.

Second, we devised a branch-length test (BLT) to identify cases of introgression (illustrated in Figure 2). This test examines branch lengths to estimate the age of the most recent coalescence event (measured in substitutions per site). Introgression should result in more recent coalescences than expected under the concordant topology with complete lineage sorting, while ILS shows older coalescence events90. Importantly, ILS alone is not expected to result in different coalescence times between the two discordant topologies, and this forms the null hypothesis for the BLT. For a given triplet, for each gene tree we calculated the distance d (a proxy for the divergence time between sister taxa) by averaging the external branch lengths leading to the two sister taxa under that gene tree topology. We calculated d for each gene tree and denote values of d from the first discordant topology dT1 and those from the second discordant topology dT2. We then compared the distributions of dT1 and dT2 using a Mann-Whitney U test. Under ILS alone the expectation is that dT1 = dT2, while in the presence of introgression dT1 < dT2 (suggesting introgression consistent with discordant topology T1) or dT1 > dT2 (suggesting introgression with consistent with topology discordant T2). The BLT is conceptually similar to the D3 test117, which transforms the values of dT1 and dT2 in a manner similar to the D statistic for detecting introgression92. As with the DCT, we performed the BLT on all triplets within a clade and used a Benjamini-Hochberg correction with a false discovery rate cutoff (FDR) of 0.05. We note that both the DCT and BLT will be conservative in cases where, for a triplet ((A,B), C), there is introgression between A and C as well as B and C, with the extreme case of equal rates of introgression for both species pairs resulting in a complete loss of power.

Finally, we used QuIBL8, an analysis of branch-length distribution across gene trees to infer putative introgression patterns. Briefly, under coalescent theory internal branches of rooted gene trees for a set of 3 taxa (triplet) can be viewed as a mixture of two distributions: one that generates branch lengths under ILS, and the other under introgression/speciation. Thus, the estimated mixing proportions (π1 for ILS and π2 for introgression/speciation; π1 + π2 = 1) of those distribution components show which fraction of the gene trees were generated through ILS or non-ILS processes. For a given triplet, QuIBL computes the proportion of gene trees that support the three alternative topologies. Then for every alternative topology QuIBL estimates mixing proportions along with other relevant parameters via Expectation-Maximization and computes Bayesian Information Criterion (BIC) scores for ILS-only and introgression models. For concordant topologies elevated values of π2 are expected whereas for discordant ones π2 values significantly greater than zero are indicative of introgression. To identify significant cases of introgression here we used a cutoff of ΔBIC < −30 as in8. We ran QuIBL on every triplet individually under default parameters with the number of steps (the numsteps parameter) set to 50 and using Anopheles gambiae for triplet rooting; the branch length between A. gambiae and the triplet is not used for any of QuIBL’s calculations.

We note that the DCT and BLT methods are potentially impacted by ancestral population structure: if the lineages leading to B and C were in subpopulations that were more likely to interbreed in the ancestral population, then the ((B, C), A) topology might be expected to be more prevalent than ((A, C), B), along with a shorter time back to the first coalescence. However, it is unclear how much of a concern ancestral population structure should be for this analysis, as it seems less likely that it would be a pair of lineages that diverged first (i.e. A and C or B and C) that interbred more frequently in the ancestral population instead of the two lineages that went on to be sister taxa (i.e. A and B). Nonetheless, plausible scenarios of ancestral structure supporting one discordant topology over the other can be devised (e.g. ref 118). We therefore conducted a more stringent version of our DCT-BLT combined test that requires the average distance between the two introgressing taxa (when examining gene trees with the discordant topology consistent with introgression) to be less than that between the two sister species (when examining gene trees with the concordant topology). Such a pattern is consistent with introgression between non-sister species, which must occur more recently than the species split and therefore causing more recent coalescence events, but not with ancestral structure which will still result in older coalescence times for discordant trees than the concordant trees (because structure in the ancestral population is only a factor in the case of ILS). Note that this test is expected to be especially conservative because ILS, which for many triplets accounts for a sizable fraction of our discordant gene trees, will push the coalescent times for all discordant topologies back further in time.

We also examined the effect of evolutionary rate heterogeneity measured in branch-specific dN/dS values on introgression detection. To that end, we generated codon alignments for each BUSCO locus using TranslatorX119 and then calculated dN/dS ratios for each gene tree in PAML46 within each clade using a free-ratios branch model that assumes independent dN/dS for each gene tree branch. Then, we evaluated the distribution of dN/dS ratios across all gene trees to determine the 95th percentile value of dN/dS. Thus, we repeated our DCT/BLT analyses for each triplet after excluding every gene tree that had at least one branch with dN/dS > 0.53. Note, branches with dN/dS values where dS< 0.001 or >5 were deemed unreliable and thus were excluded from calculation of a critical value or from downstream filtering. Additionally, we performed random filtering of gene trees to see if this procedure would have a similar impact on downstream introgression-detection as did our dN/dS filter. First, we estimated the distribution of proportions of gene trees retained for each triplet after applying the dN/dS filter. Then, for a given triplet, we randomly drew a number of genes to remove from the aforementioned distribution, and then applied our DCT-BLT method to this triplet after removing the selected number of genes. This process was repeated for each triplet tested in our main analysis to generate a randomly filtered set of DCT-BLT results for each of our 9 clades. We then repeated this entire process 1000 times and noted the average fraction of DCT-BLT results remaining significant after randomly filtering genes.

Our DCT-BLT test assumes that there is no recombination within loci and complete inter-locus independence—these assumptions are commonly made by introgression inference methods10,120,121. We note that intra-locus recombination may interfere with the signatures of introgression by reducing discordant topology counts (because even loci experiencing introgression will have non-introgressed segments), and similarly diluting branch-length signatures of introgression, thereby reducing the sensitivity of our DCT-BLT approach. Nevertheless, site-pattern-based approaches (e.g. HyDe, see below) are not affected by intra-locus recombination as they evaluate each site in an MSA independently.

Site-pattern -based detection of introgression:

Signatures of introgression can be identified by investigating fractions of certain site patterns within MSAs of species quartets. One of the most widely used methods is based on the counts of ABBA-BABA site patterns (aka., Patterson’s D statistic122). Here we used the hybridization model implemented in HyDe52 that implements an alternative invariant-based statistic to test introgression and estimate the fraction of the introgressed genome (ϒ). We ran HyDe analysis on each of the 9 clades using the entire supermatrix and in each case selected the quartet’s outgroup from a sister clade. Additionally, to examine effects of outgroup choice, we ran HyDe analyses with a more distantly related outgroup, Anopheles gambiae for all clades. The resulting P-values for each quartet were corrected for multiple testing using the Bonferroni method. To investigate an individual contribution of each BUSCO locus to introgression, we additionally ran HyDe using BUSCO MSAs with Anopheles gambiae outgroup. We note, however, in this case HyDe’s power to detect introgression will be reduced, especially for short MSAs with <10,000 sites52. A complete summary for each BUSCO locus including introgression results from locus-specific HyDe and BLT/DCT analyses is included in Data S5.

Placing introgression events on the phylogeny:

All the aforementioned methods can infer multiple correlated signatures of introgression especially when triplets/quartets share the same taxa. Thus it can be difficult to interpret these interdependent results. To alleviate this problem,20 devised a simple heuristic metric called f-branch to disentangle and map introgression events detected in multiple correlated species pairs onto the tree. In the original formulation, f-branch examines multiple f4 statistics measured for each species pair and that quantify, the proportion of introgressed material for that pair. However, the calculation of the f4 statistic requires allele frequency measures within each sampled species. Thus, to calculate f-branch statistic, instead of f4 we used the introgression proportion derived from DCT/BLT as follows: ϒ=dis2dis1con+dis1+dis2, where con, dis1 and dis2 represent concordant and discordant counts of gene trees and dis1 < dis2. To compute f-branch statistic from DCT/BLT’s ϒ estimates and to visualize the results within each clade we used the Dsuite python package57.

Dsuite outputs a matrix of ϒ estimates that have been partially collapsed: on one axis of this matrix signals of introgression can appear on ancestral branches, but on the other axis only extant branches are shown. Thus, we manually further collapsed these signatures by parsimoniously assuming that if some lineage A showed evidence of introgression with multiple descendants of some other lineage B that is not ancestral to A, then we considered this to be caused by a single introgression event between A and B. Note that we did not require all descendants of lineage B to share this signature of introgression, and thus this approach could potentially undercount the number of introgression events and overestimate their ages.

Phylogenetic networks:

Introgression generates instances of reticulate evolution such that purely bifurcating trees cannot adequately represent evolutionary history; phylogenetic networks have been shown to provide a better fit to describe these patterns123,124. We used PhyloNet58,59 to calculate likelihood scores for networks generated by placing a single reticulation event (node) in an exhaustive manner, i.e. connecting all possible branch pairs within a clade and determining which of the resulting phylogenetic networks produced the best likelihood score. We note that networks with more reticulation events would most likely exhibit a better fit to observed patterns of introgression but the biological interpretation of complex networks with multiple reticulations is more challenging; thus, we limited the analysis to a single reticulation event even though this will produce false negatives in clades with multiple gene flow events. Because full likelihood calculations with PhyloNet can be prohibitively slow for large networks, for each of clades 1 through 9 we selected a subsample of 10 species in a manner that preserves the overall species tree topology. No subsampling was performed for clade 3 which has fewer than 10 species. Using these subsampled clade topologies, we formed all possible network topologies having a single reticulation node (with the exception of networks having reticulation nodes connecting sister taxa). Because PhyloNet takes gene trees as input, for each clade we subsampled each gene tree to include only the subset of 10 species selected for the PhyloNet analysis (or all species in the case of clade 3); any gene trees missing at least one of these species were omitted from the analysis. Finally, we used the GalGTProb program125 of the PhyloNet suite to obtain a likelihood score for each network topology for each clade. We report networks with the highest likelihood scores.

QUANTIFICATION AND STATISTICAL ANALYSIS

Nodal reliability measures including ultrafast bootstrap (UFBoot), an approximate likelihood ratio test with the nonparametric Shimodaira–Hasegawa correction (SH-aLRT) and a Bayesian-like transformation of aLRT were used to assess phylogenetic nodal support. The DCT approach conducts a χ2 goodness-of-fit test for an imbalance in the number of discordant gene trees, and the BLT approach uses the Mann-Whitney U test to assess the significance of differences in coalescent times between the discordant gene trees (see “STAR Methods” for a detailed description of these tests). Multiple testing correction techniques of a Benjamini-Hochberg with a false discovery rate (FDR) and Bonferroni were used to adjust P-values derived from DCT/BLT and HyDe, respectively, and a significance threshold of 0.05 was applied to these adjusted P-values. The difference of Bayesian Information Criterion scores (ΔBIC) was used to assess significance of QuIBL results at a cutoff of −30. The correlation between timing of introgression events and ϒ was tested using Spearman’s rank correlation.

Supplementary Material

1

Data S1. Fossil calibration schemes, related to Figure 1.

2. Data S2. Results of introgression analyses (DCT, BLT, DCT-BLT, HyDe and QuIBL) for tested species pairs, related to Figure 3.

Columns show tested species pairs, number of triplets that exhibit significant introgression signal for species pair, total number of tested triplets and fraction of significant vs. total tested triplets.

3

Data S3. Taxon sampling and genome assembly information, related to Figure 1.

4

Data S5. Correspondence between BUSCO loci and Muller elements, related to Figure 3.

5

Data S4. BUSCO summary table, related to Figure 3.

7

KEY RESOURCES TABLE

Highlights.

Resolved fossil-calibrated evolutionary history of 155 Drosophila genomes

Phylogenomic evidence of widespread introgression in Drosophila

Evidence of both phylogenetically deep and recent gene flow events in multiple clades

Conservative detection of gene flow via discordant gene tree counts and branch lengths

Acknowledgements

We thank M. Hahn, M. Turelli, A. Yassin, D. Obbard and M. Matschiner for helpful feedback on a previous draft, and M. Hibbins for sharing simulated gene trees. AS and DRS were supported by the NIH under award nos. R00HG008696 and R35GM138286. BYK was supported by the NIH under award no. F32GM135998. JW was supported by the NIH under award no. K01DK119582. DP, AAC were supported by NSF Dimensions of Biodiversity award 1737752. DRM and ERRD were supported by NIH award R01GM121750. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of interests

The authors declare that there is no conflict of interest

REFERENCES

  • 1.Tusso S, Nieuwenhuis BPS, Sedlazeck FJ, Davey JW, Jeffares DC, and Wolf JBW (2019). Ancestral Admixture Is the Main Determinant of Global Biodiversity in Fission Yeast. Mol. Biol. Evol 36, 1975–1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Eberlein C, Hénault M, Fijarczyk A, Charron G, Bouvier M, Kohn LM, Anderson JB, and Landry CR (2019). Hybridization is a recurrent evolutionary stimulus in wild yeast speciation. Nat. Commun 10, 923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Leducq J-B, Nielly-Thibault L, Charron G, Eberlein C, Verta J-P, Samani P, Sylvester K, Hittinger CT, Bell G, and Landry CR (2016). Speciation driven by hybridization and chromosomal plasticity in a wild yeast. Nat. Microbiol 1, 15003. [DOI] [PubMed] [Google Scholar]
  • 4.Lamichhaney S, Berglund J, Almén MS, Maqbool K, Grabherr M, Martinez-Barrio A, Promerová M, Rubin C-J, Wang C, Zamani N, et al. (2015). Evolution of Darwin’s finches and their beaks revealed by genome sequencing. Nature 518, 371. [DOI] [PubMed] [Google Scholar]
  • 5.Racimo F, Sankararaman S, Nielsen R, and Huerta-Sánchez E (2015). Evidence for archaic adaptive introgression in humans. Nat. Rev. Genet 16, 359–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Schumer M, Xu C, Powell DL, Durvasula A, Skov L, Holland C, Blazier JC, Sankararaman S, Andolfatto P, Rosenthal GG, et al. (2018). Natural selection interacts with recombination to shape the evolution of hybrid genomes. Science 360, 656–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Vanderpool D, Minh BQ, Lanfear R, Hughes D, Murali S, Harris RA, Raveendran M, Muzny DM, Hibbins MS, Williamson RJ, et al. (2020). Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression. PLOS Biol. 18, e3000954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Edelman NB, Frandsen PB, Miyagi M, Clavijo B, Davey J, Dikow RB, García-Accinelli G, Belleghem SMV, Patterson N, Neafsey DE, et al. (2019). Genomic architecture and introgression shape a butterfly radiation. Science 366, 594–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Turissini DA, and Matute DR (2017). Fine scale mapping of genomic introgressions within the Drosophila yakuba clade. PLOS Genet. 13, e1006971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lohse K, Clarke M, Ritchie MG, and Etges WJ (2015). Genome-wide tests for introgression between cactophilic \textlessi\textgreaterDrosophila\textless/i\textgreater implicate a role of inversions during speciation. Evolution 69, 1178–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pease JB, Brown JW, Walker JF, Hinchliff CE, and Smith SA (2018). Quartet Sampling distinguishes lack of support from conflicting support in the green plant tree of life. Am. J. Bot 105, 385–403. [DOI] [PubMed] [Google Scholar]
  • 12.Pease JB, Haak DC, Hahn MW, and Moyle LC (2016). Phylogenomics reveals three sources of adaptive variation during a rapid radiation. PLOS Biol 14, e1002379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rhymer JM, and Simberloff D (1996). Extinction by Hybridization and Introgression. Annu. Rev. Ecol. Syst 27, 83–109. [Google Scholar]
  • 14.Taylor SA, and Larson EL (2019). Insights from genomes into the evolutionary importance and prevalence of hybridization in nature. Nat. Ecol. Evol 3, 170–177. [DOI] [PubMed] [Google Scholar]
  • 15.Hedrick PW (2013). Adaptive introgression in animals: examples and comparison to new mutation and standing variation as sources of adaptive variation. Mol. Ecol 22, 4606–4618. [DOI] [PubMed] [Google Scholar]
  • 16.Suarez-Gonzalez A, Lexer C, and Cronk QCB (2018). Adaptive introgression: a plant perspective. Biol. Lett 14, 20170688. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Marques DA, Meier JI, and Seehausen O (2019). A Combinatorial View on Speciation and Adaptive Radiation. Trends Ecol. Evol 34, 531–544. [DOI] [PubMed] [Google Scholar]
  • 18.Meier JI, Marques DA, Mwaiko S, Wagner CE, Excoffier L, and Seehausen O (2017). Ancient hybridization fuels rapid cichlid fish adaptive radiations. Nat. Commun 8, 14363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li G, Davis BW, Eizirik E, and Murphy WJ (2016). Phylogenomic evidence for ancient hybridization in the genomes of living cats (Felidae). Genome Res 26, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Malinsky M, Svardal H, Tyers AM, Miska EA, Genner MJ, Turner GF, and Durbin R (2018). Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. Nat. Ecol. Evol 2, 1940–1955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Svardal H, Quah FX, Malinsky M, Ngatunga BP, Miska EA, Salzburger W, Genner MJ, Turner GF, and Durbin R (2020). Ancestral Hybridization Facilitated Species Diversification in the Lake Malawi Cichlid Fish Adaptive Radiation. Mol. Biol. Evol 37, 1100–1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chen N, Cai Y, Chen Q, Li R, Wang K, Huang Y, Hu S, Huang S, Zhang H, Zheng Z, et al. (2018). Whole-genome resequencing reveals world-wide ancestry and adaptive introgression events of domesticated cattle in East Asia. Nat. Commun 9, 2337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Jones MR, Mills LS, Alves PC, Callahan CM, Alves JM, Lafferty DJR, Jiggins FM, Jensen JD, Melo-Ferreira J, and Good JM (2018). Adaptive introgression underlies polymorphic seasonal camouflage in snowshoe hares. Science 360, 1355–1358. [DOI] [PubMed] [Google Scholar]
  • 24.Platt RN, McDew-White M, Le Clec’h W, Chevalier FD, Allan F, Emery AM, Garba A, Hamidou AA, Ame SM, Webster JP, et al. (2019). Ancient Hybridization and Adaptive Introgression of an Invadolysin Gene in Schistosome Parasites. Mol. Biol. Evol 36, 2127–2142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Richards EJ, and Martin CH (2017). Adaptive introgression from distant Caribbean islands contributed to the diversification of a microendemic adaptive radiation of trophic specialist pupfishes. PLOS Genet 13, e1006919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Turelli M, Lipkowitz JR, and Brandvain Y (2014). On the Coyne and Orr-Igin of Species: Effects of Intrinsic Postzygotic Isolation, Ecological Differentiation, X Chromosome Size, and Sympatry on Drosophila Speciation. Evolution 68, 1176–1187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Brand CL, Kingan SB, Wu L, and Garrigan D (2013). A Selective Sweep across Species Boundaries in Drosophila. Mol. Biol. Evol 30, 2177–2186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dyer KA, Bewick ER, White BE, Bray MJ, and Humphreys DP (2018). Fine-scale geographic patterns of gene flow and reproductive character displacement in Drosophila subquinaria and Drosophila recens. Mol. Ecol 27, 3655–3670. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Garrigan D, Kingan SB, Geneva AJ, Andolfatto P, Clark AG, Thornton KR, and Presgraves DC (2012). Genome sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res 22, 1499–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Kang L, Garner HR, Price DK, and Michalak P (2017). A Test for Gene Flow among Sympatric and Allopatric Hawaiian Picture-Winged Drosophila. J. Mol. Evol 84, 259–266. [DOI] [PubMed] [Google Scholar]
  • 31.Mai D, Nalley MJ, and Bachtrog D (2020). Patterns of Genomic Differentiation in the Drosophila nasuta Species Complex. Mol. Biol. Evol 37, 208–220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Schrider DR, Ayroles J, Matute DR, and Kern AD (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLOS Genet 14, e1007341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kao JY, Lymer S, Hwang SH, Sung A, and Nuzhdin SV (2015). Postmating reproductive barriers contribute to the incipient sexual isolation of the United States and Caribbean Drosophila melanogaster. Ecol. Evol 5, 3171–3182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Matute DR, and Ayroles JF (2014). Hybridization occurs between Drosophila simulans and D. sechellia in the Seychelles archipelago. J. Evol. Biol 27, 1057–68. [DOI] [PubMed] [Google Scholar]
  • 35.Sawamura K, Sato H, Lee C-Y, Kamimura Y, and Matsuda M (2016). A Natural Population Derived from Species Hybridization in the Drosophila ananassae Species Complex on Penang Island, Malaysia. Zoolog. Sci 33, 467–475. [DOI] [PubMed] [Google Scholar]
  • 36.Cooper BS, Sedghifar A, Nash WT, Comeault AA, and Matute DR (2018). A Maladaptive Combination of Traits Contributes to the Maintenance of a Drosophila Hybrid Zone. Curr. Biol 28, 2940–2947.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lachaise D, Harry M, Solignac M, Lemeunier F, Bénassi V, and Cariou ML (2000). Evolutionary novelties in islands: Drosophila santomea, a new melanogaster sister species from São Tomé. Proc. R. Soc. Lond. B 267, 1487–1495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Matute DR (2010). Reinforcement of gametic isolation in Drosophila. PLoS Biol 8, e1000341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Seppey M, Manni M, and Zdobnov EM (2019). BUSCO: Assessing Genome Assembly and Annotation Completeness. In Gene Prediction: Methods and Protocols Methods in Molecular Biology, Kollmar M, ed. (Springer; ), pp. 227–245. [DOI] [PubMed] [Google Scholar]
  • 40.Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, and Zdobnov EM (2017). BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pease JB, and Hahn MW (2013). More Accurate Phylogenies Inferred from LowRecombination Regions in the Presence of Incomplete Lineage Sorting. Evolution 67, 2376–2384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.O’Grady PM, and DeSalle R (2018). Phylogeny of the Genus Drosophila. Genetics 209, 1–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Russo CAM, Mello B, Frazão A, and Voloch CM (2013). Phylogenetic analysis and a time tree for a large drosophilid data set (Diptera: Drosophilidae). Zool. J. Linn. Soc 169, 765–775. [Google Scholar]
  • 44.Yassin A (2013). Phylogenetic classification of the Drosophilidae Rondani (Diptera): the role of morphology in the postgenomic era. Syst. Entomol 38, 349–364. [Google Scholar]
  • 45.Mai U, and Mirarab S (2018). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19, 272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Yang Z (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol 24, 1586–1591. [DOI] [PubMed] [Google Scholar]
  • 47.Heath TA, Huelsenbeck JP, and Stadler T (2014). The fossilized birth–death process for coherent calibration of divergence-time estimates. Proc. Natl. Acad. Sci 111, E2957–E2966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bouckaert R, Vaughan TG, Barido-Sottani J, Duchêne S, Fourment M, Gavryushkina A, Heled J, Jones G, Kühnert D, Maio ND, et al. (2019). BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLOS Comput. Biol 15, e1006650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Obbard DJ, Maclennan J, Kim K-W, Rambaut A, O’Grady PM, and Jiggins FM (2012). Estimating Divergence Dates and Substitution Rates in the Drosophila Phylogeny. Mol. Biol. Evol 29, 3459–3473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Tamura K, Subramanian S, and Kumar S (2004). Temporal Patterns of Fruit Fly (Drosophila) Evolution Revealed by Mutation Clocks. Mol. Biol. Evol 21, 36–44. [DOI] [PubMed] [Google Scholar]
  • 51.Izumitani HF, Kusaka Y, Koshikawa S, Toda MJ, and Katoh T (2016). Phylogeography of the Subgenus Drosophila (Diptera: Drosophilidae): Evolutionary History of Faunal Divergence between the Old and the New Worlds. PLOS ONE 11, e0160051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Blischak PD, Chifman J, Wolfe AD, and Kubatko LS (2018). HyDe: A Python Package for Genome-Scale Hybridization Detection. Syst. Biol 67, 821–829. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Hibbins M, and Hahn M (2021). Phylogenomic approaches to detecting and characterizing introgression [DOI] [PMC free article] [PubMed]
  • 54.Bracewell R, Chatla K, Nalley MJ, and Bachtrog D (2019). Dynamic turnover of centromeres drives karyotype evolution in Drosophila. eLife 8, e49002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Magnacca KN, and Price DK (2015). Rapid adaptive radiation and host plant conservation in the Hawaiian picture wing Drosophila (Diptera: Drosophilidae). Mol. Phylogenet. Evol 92, 226–242. [DOI] [PubMed] [Google Scholar]
  • 56.Price JP, and Clague DA (2002). How old is the Hawaiian biota? Geology and phylogeny suggest recent divergence. Proc. R. Soc. B Biol. Sci 269, 2429–2435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Malinsky M, Matschiner M, and Svardal H (2021). Dsuite - Fast D-statistics and related admixture evidence from VCF files. Mol. Ecol. Resour 21, 584–595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Than C, Ruths D, and Nakhleh L (2008). PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics 9, 322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wen D, Yu Y, Zhu J, and Nakhleh L (2018). Inferring Phylogenetic Networks Using PhyloNet. Syst. Biol 67, 735–740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Throckmorton LH (1975). The phylogeny, ecolopy, and geography of Drosophila. In Handbook of Genetics, Vol 3., King RC, ed. (Plenum Publishing Corp.), pp. 421–469. [Google Scholar]
  • 61.Katoh T, Tamura K, and Aotsuka T (2000). Phylogenetic Position of the Subgenus Lordiphosa of the Genus Drosophila (Diptera: Drosophilidae) Inferred from Alcohol Dehydrogenase (Adh) Gene Sequences. J. Mol. Evol 51, 122–130. [DOI] [PubMed] [Google Scholar]
  • 62.Kim BY, Wang JR, Miller DE, Barmina O, Delaney E, Thompson A, Comeault AA, Peede D, D’Agostino ERR, Pelaez J, et al. (2020). Highly contiguous assemblies of 101 drosophilid genomes. bioRxiv, 2020.12.14.422775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Reis MD, and Yang Z (2013). The unbearable uncertainty of Bayesian divergence time estimation. J. Syst. Evol 51, 30–43. [Google Scholar]
  • 64.Yang Z, and Rannala B (2006). Bayesian Estimation of Species Divergence Times Under a Molecular Clock Using Multiple Fossil Calibrations with Soft Bounds. Mol. Biol. Evol 23, 212–226. [DOI] [PubMed] [Google Scholar]
  • 65.Matschiner M (2019). Selective Sampling of Species and Fossils Influences Age Estimates Under the Fossilized Birth–Death Model. Front. Genet 0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Meiklejohn CD, Landeen EL, Gordon KE, Rzatkiewicz T, Kingan SB, Geneva AJ, Vedanayagam JP, Muirhead CA, Garrigan D, Stern DL, et al. (2018). Gene flow mediates the role of sex chromosome meiotic drive during complex speciation. eLife 7, e35468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Matute DR, Comeault AA, Earley E, Serrato-Capuchina A, Peede D, MonroyEklund A, Huang W, Jones CD, Mackay TFC, and Coyne JA (2020). Rapid and Predictable Evolution of Admixed Populations Between Two Drosophila Species Pairs. Genetics 214, 211–230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Wang S, Nalley MJ, Chatla K, Aldaimalani R, MacPherson A, Wei K, Corbett R, Mai D, and Bachtrog D (2021). Neo-sex chromosome evolution shapes sex-dependent asymmetrical introgression barrier (Evolutionary Biology) [DOI] [PMC free article] [PubMed]
  • 69.Anderson TM, vonHoldt BM, Candille SI, Musiani M, Greco C, Stahler DR, Smith DW, Padhukasahasram B, Randi E, Leonard JA, et al. (2009). Molecular and evolutionary history of melanism in North American gray wolves. Science 323, 1339–1343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Dasmahapatra KK (2012). Heliconius genome supplementary information. Nature [Google Scholar]
  • 71.Fishman L, and Sweigart AL (2018). When Two Rights Make a Wrong: The Evolutionary Genetics of Plant Hybrid Incompatibilities. Annu. Rev. Plant Biol 69, 707–731. [DOI] [PubMed] [Google Scholar]
  • 72.Maheshwari S, and Barbash DA (2011). The Genetics of Hybrid Incompatibilities. Annu. Rev. Genet 45, 331–355. [DOI] [PubMed] [Google Scholar]
  • 73.Nosil P, and Schluter D (2011). The genes underlying the process of speciation. Trends Ecol. Evol 26, 160–167. [DOI] [PubMed] [Google Scholar]
  • 74.Baack EJ, and Rieseberg LH (2007). A genomic view of introgression and hybrid speciation. Curr. Opin. Genet. Dev 17, 513–518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Moran BM, Payne C, Langdon Q, Powell DL, Brandvain Y, and Schumer M (2020). The genetic consequences of hybridization. ArXiv201204077 Q-Bio [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Harris K, and Nielsen R (2016). The Genetic Cost of Neanderthal Introgression. Genetics 203, 881–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Kim BY, Huber CD, and Lohmueller KE (2018). Deleterious variation shapes the genomic landscape of introgression. PLOS Genet 14, e1007741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Sachdeva H, and Barton NH (2018). Introgression of a Block of Genome Under Infinitesimal Selection. Genetics 209, 1279–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Geraldes A, Ferrand N, and Nachman MW (2006). Contrasting Patterns of Introgression at X-Linked Loci Across the Hybrid Zone Between Subspecies of the European Rabbit (Oryctolagus cuniculus). Genetics 173, 919–933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Payseur BA, Krenz JG, and Nachman MW (2004). Differential Patterns of Introgression Across the X Chromosome in a Hybrid Zone Between Two Species of House Mice. Evolution 58, 2064–2078. [DOI] [PubMed] [Google Scholar]
  • 81.Storchová R, Reif J, and Nachman MW (2010). Female Heterogamety and Speciation: Reduced Introgression of the Z Chromosome Between Two Species of Nightingales. Evolution 64, 456–471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Hamlin JAP, Hibbins MS, and Moyle LC (2020). Assessing biological factors affecting postspeciation introgression. Evol. Lett 4, 137–154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Kronforst MR, Hansen MEB, Crawford NG, Gallant JR, Zhang W, Kulathinal RJ, Kapan DD, and Mullen SP (2013). Hybridization Reveals the Evolving Genomic Architecture of Speciation. Cell Rep, 666–677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Martin SH, Davey JW, Salazar C, and Jiggins CD (2019). Recombination rate variation shapes barriers to introgression across butterfly genomes. PLOS Biol 17, e2006288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Coyne JA, and Orr HA (1997). “Patterns of speciation in Drosophila” revisited. Evolution 51, 295–303. [DOI] [PubMed] [Google Scholar]
  • 86.Coyne JA, and Orr HA (1989). Patterns of speciation in Drosophila. Evolution 43, 362–381. [DOI] [PubMed] [Google Scholar]
  • 87.Serrato-Capuchina A, Schwochert TD, Zhang S, Roy B, Peede D, Koppelman C, and Matute DR (2020). Pure species discriminate against hybrids in the Drosophila melanogaster species subgroup. bioRxiv, 2020.07.22.214924. [DOI] [PubMed] [Google Scholar]
  • 88.Turissini DA, McGirr JA, Patel SS, David JR, and Matute DR (2018). The Rate of Evolution of Postmating-Prezygotic Reproductive Isolation in Drosophila. Mol. Biol. Evol 35, 312–334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Turissini DA, Comeault AA, Liu G, Lee YCG, and Matute DR (2017). The ability of Drosophila hybrids to locate food declines with parental divergence. Evolution 71, 960–973. [DOI] [PubMed] [Google Scholar]
  • 90.Fontaine MC, Pease JB, Steele A, Waterhouse RM, Neafsey DE, Sharakhov IV, Jiang X, Hall AB, Catteruccia F, Kakani E, et al. (2015). Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Dasmahapatra KK, Walters JR, Briscoe AD, Davey JW, Whibley A, Nadeau NJ, Zimin AV, Hughes DST, Ferguson LC, Martin SH, et al. (2012). Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature 487, 94–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Green RE, Krause J, Briggs AW, Maricic T, Stenzel U, Kircher M, Patterson N, Li H, Zhai W, Fritz MH-Y, et al. (2010). A Draft Sequence of the Neandertal Genome. Science 328, 710–722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Juric I, Aeschbacher S, and Coop G (2015). The Strength of Selection Against Neanderthal Introgression. PLoS Genet 12, e1006340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Durvasula A, and Sankararaman S (2020). Recovering signals of ghost archaic introgression in African populations. Sci. Adv 6, eaax5097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Ottenburghs J (2020). Ghost Introgression: Spooky Gene Flow in the Distant Past. BioEssays 42, 2000012. [DOI] [PubMed] [Google Scholar]
  • 96.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. (2012). SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol 19, 455–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, et al. (2017). ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res 27, 768–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, and Zdobnov EM (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212. [DOI] [PubMed] [Google Scholar]
  • 99.Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q, Xie D, Feng S, Stiller J, et al. (2020). Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Bracewell R, and Bachtrog D (2020). Complex Evolutionary History of the Y Chromosome in Flies of the Drosophila obscura Species Group. Genome Biol. Evol 12, 494–505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Kolmogorov M, Armstrong J, Raney BJ, Streeter I, Dunn M, Yang F, Odom D, Flicek P, Keane TM, Thybert D, et al. (2018). Chromosome assembly of large and complex genomes using multiple references. Genome Res 28, 1720–1732. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Katoh K, Misawa K, Kuma K, and Miyata T (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30, 3059–3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Wong TKF, Kalyaanamoorthy S, Meusemann K, Yeates DK, Misof B, and Jermiin LS (2020). A minimum reporting standard for multiple sequence alignments. NAR Genomics Bioinforma 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Nguyen L-T, Schmidt HA, von Haeseler A, and Minh BQ (2015). IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol. Biol. Evol 32, 268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Abadi S, Azouri D, Pupko T, and Mayrose I (2019). Model selection may not be a mandatory step for phylogeny reconstruction. Nat. Commun 10, 934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Minh BQ, Nguyen MAT, and von Haeseler A (2013). Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol 30, 1188–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Anisimova M, Gil M, Dufayard J-F, Dessimoz C, and Gascuel O (2011). Survey of Branch Support Methods Demonstrates Accuracy, Power, and Robustness of Fast Likelihood-based Approximation Schemes. Syst. Biol 60, 685–699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Sayyari E, and Mirarab S (2016). Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies. Mol. Biol. Evol 33, 1654–1668. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Bininda-Emonds OR (2005). transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences. BMC Bioinformatics 6, 156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Anisimova M ed. (2012). Evolutionary Genomics: Statistical and Computational Methods, Volume 2 (Springer Science+Business Media, LLC; ). [Google Scholar]
  • 111.dos Reis M, and Yang Z (2011). Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Mol. Biol. Evol 28, 2161–2172. [DOI] [PubMed] [Google Scholar]
  • 112.Puttick MN (2019). MCMCtreeR: functions to prepare MCMCtree analyses and visualize posterior ages on trees. Bioinformatics 35, 5321–5322. [DOI] [PubMed] [Google Scholar]
  • 113.Douglas J, Zhang R, and Bouckaert R (2021). Adaptive dating and fast proposals: Revisiting the phylogenetic relaxed clock model. PLOS Comput. Biol 17, e1008322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Rambaut A, Drummond AJ, Xie D, Baele G, and Suchard MA (2018). Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7. Syst. Biol 67, 901–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Ayres DL, Darling A, Zwickl DJ, Beerli P, Holder MT, Lewis PO, Huelsenbeck JP, Ronquist F, Swofford DL, Cummings MP, et al. (2012). BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics. Syst. Biol 61, 170–173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Huson DH, Klöpper T, Lockhart PJ, and Steel MA (2005). Reconstruction of Reticulate Networks from Gene Trees. In Research in Computational Molecular Biology Lecture Notes in Computer Science, Miyano S, Mesirov J, Kasif S, Istrail S, Pevzner PA, and Waterman M, eds. (Springer; ), pp. 233–249. [Google Scholar]
  • 117.Hahn MW, and Hibbins MS (2019). A Three-Sample Test for Introgression. Mol. Biol. Evol 36, 2878–2882. [DOI] [PubMed] [Google Scholar]
  • 118.Eriksson A, and Manica A (2012). Effect of ancient population structure on the degree of polymorphism shared between modern human populations and ancient hominins. Proc. Natl. Acad. Sci 109, 13956–13960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Abascal F, Zardoya R, and Telford MJ (2010). TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res 38, W7–W13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Flouri T, Jiao X, Rannala B, and Yang Z (2020). A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis. Mol. Biol. Evol 37, 1211–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Hey J, Chung Y, Sethuraman A, Lachance J, Tishkoff S, Sousa VC, and Wang Y (2018). Phylogeny Estimation by Integration over Isolation with Migration Models. Mol. Biol. Evol 35, 2805–2818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Patterson N, Moorjani P, Luo Y, Mallick S, Rohland N, Zhan Y, Genschoreck T, Webster T, and Reich D (2012). Ancient Admixture in Human History. Genetics 192, 1065–1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.Huson DH, and Bryant D (2006). Application of Phylogenetic Networks in Evolutionary Studies. Mol. Biol. Evol 23, 254–267. [DOI] [PubMed] [Google Scholar]
  • 124.Solís-Lemus C, Yang M, and Ané C (2016). Inconsistency of Species Tree Methods under Gene Flow. Syst. Biol 65, 843–851. [DOI] [PubMed] [Google Scholar]
  • 125.Yu Y, Degnan JH, and Nakhleh L (2012). The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection. PLOS Genet 8, e1002660. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

Data S1. Fossil calibration schemes, related to Figure 1.

2. Data S2. Results of introgression analyses (DCT, BLT, DCT-BLT, HyDe and QuIBL) for tested species pairs, related to Figure 3.

Columns show tested species pairs, number of triplets that exhibit significant introgression signal for species pair, total number of tested triplets and fraction of significant vs. total tested triplets.

3

Data S3. Taxon sampling and genome assembly information, related to Figure 1.

4

Data S5. Correspondence between BUSCO loci and Muller elements, related to Figure 3.

5

Data S4. BUSCO summary table, related to Figure 3.

7

Data Availability Statement

The data and code produced during this study are publically available on GitHub (https://github.com/SchriderLab/drosophila_phylogeny) and FigShare (dx.doi.org/10.6084/m9.figshare.13264697). Whole genome sequencing data generated for this study are available on NCBI (BioProject PRJNA675888, BioProject PRJNA593822, and BioProject PRJNA611543).

RESOURCES