Skip to main content
Plant Communications logoLink to Plant Communications
. 2023 Oct 5;5(2):100728. doi: 10.1016/j.xplc.2023.100728

Widespread incomplete lineage sorting and introgression shaped adaptive radiation in the Gossypium genus

Yanchao Xu 1,2,4, Yangyang Wei 3, Zhongli Zhou 1, Xiaoyan Cai 1,4, Scott A Boden 5, Muhammad Jawad Umer 1, Luqman B Safdar 5, Yuling Liu 3, Dingsha Jin 6, Yuqing Hou 1, Yuhong Wang 1, Sarah Brooke Wall 7, Kunbo Wang 1, Shuxun Yu 1, Baohong Zhang 7,∗∗, Renhai Peng 3,∗∗∗, Fang Liu 1,8,
PMCID: PMC10873890  PMID: 37803827

Abstract

Cotton (Gossypium) stands as a crucial economic crop, serving as the primary source of natural fiber for the textile sector. However, the evolutionary mechanisms driving speciation within the Gossypium genus remain unresolved. In this investigation, we leveraged 25 Gossypium genomes and introduced four novel assemblies—G. harknessii, G. gossypioides, G. trilobum, and G. klotzschianum (Gklo)—to delve into the speciation history of this genus. Notably, we encountered intricate phylogenies potentially stemming from introgression. These complexities are further compounded by incomplete lineage sorting (ILS), a factor likely to have been instrumental in shaping the swift diversification of cotton. Our focus subsequently shifted to the rapid radiation episode during a concise period in Gossypium evolution. For a recently diverged lineage comprising G. davidsonii, Gklo, and G. raimondii, we constructed a finely detailed ILS map. Intriguingly, this analysis revealed the non-random distribution of ILS regions across the reference Gklo genome. Moreover, we identified signs of robust natural selection influencing specific ILS regions. Noteworthy variations pertaining to speciation emerged between the closely related sister species Gklo and G. davidsonii. Approximately 15.74% of speciation structural variation genes and 12.04% of speciation-associated genes were estimated to intersect with ILS signatures. These findings enrich our understanding of the role of ILS in adaptive radiation, shedding fresh light on the intricate speciation history of the Gossypium genus.

Key words: cotton speciation, Gossypium genus, incomplete lineage sorting, ILS, phylogenetic analysis, gene tree resolution


This study reports the investigation of adaptive radiation of the Gossypium genus using four new genome assemblies and 25 publicly available Gossypium genomes. The results reveal intricate phylogenies resulting from introgression and incomplete lineage sorting, enriching our understanding of the complex speciation history of Gossypium.

Introduction

Cotton is a vital crop, providing textile fiber from four domesticated species (Grover et al., 2021). The Gossypium genus is widespread and geographically diverse, containing over 50 species, including eight diploid genome groups (A–G and K genomes) and a more recently evolved allopolyploid group (AD genome) (Wang et al., 2018). The rapid adaptive diversification of flowering plants is often attributed to hybridization, polyploidization, and introgression (Mcgrath and Lynch, 2012; Soltis and Soltis, 2016). In the Gossypium genus, species in different genome clades have a wide distribution and are separated geographically, resulting in barriers to gene flow. However, hybridization between A-genome and D-genome ancestors and subsequent polyploidization has resulted in allotetraploid cotton (Paterson et al., 2012; Wendel et al., 2012; Chen et al., 2020).

Recent studies have reported deep phylogenetic incongruence between nuclear and cytoplasmic genomes in the Gossypium genus (Chen et al., 2017; Wu et al., 2018), which could be due to hybridization or incomplete lineage sorting (ILS). However, the genomic landscape of introgression has remained elusive because it is indistinguishable from ILS. Understanding the genetic patterns underlying the rapid differentiation of flowering plants remains a challenge. Despite numerous sequenced genomes in the Gossypium genus, the genetic basis of recent rapid-diverging phylogenetic lineages remains poorly sampled. For example, the high-quality genome of an important cotton lineage, subsection Integrifolia, remains unknown (Grover et al., 2019).

In this study, we produced a high-quality, chromosome-scale Gossypium klotzschianum (Gklo) genome using a multi-platform approach to investigate the history of speciation in the Gossypium genus. Our results suggest that ancestral polymorphisms in some genetic loci allowed for survival after geographic speciation, providing new insights into the rapid radiation of the Gossypium genus. This study provides new insights into the population genetics and evolutionary history of cotton and deepens our understanding of the rapid diversification of flowering plants.

Results

Phylogenomic analyses of the Gossypium genus

To gain a comprehensive understanding of the evolutionary history of cotton, we assembled the genomes of four diploid cotton species, G. harknessii (Ghar; D2-2), G. gossypioides (Ggos; D6), G. trilobum (Gtri; D8), and Gklo (supplemental information section 1). In addition, we included 27 previously reported high-quality genomes from Theobroma cacao (Tc), G. kirkii (Gkir), and 25 other Gossypium species in our analysis. Using a set of 1165 conserved orthologous single-copy genes (CSGs), we constructed a Gossypium phylogenetic tree, employing both coalescent-based and concatenation-based approaches (Figure 1A; supplemental information section 2). We also incorporated data from population genomics involving 106 individuals, encompassing diploid cotton, the D subgenome of tetraploid cotton, and an outgroup, Tc (Figure 1B; supplemental information section 3). These methodologies yielded a consistent tree topology, delineating the diploid genome into two prominent clusters characterized by distinct geographic isolation, except for tetraploid cotton. These divisions include the Africa/Asia/Australia group (A clade), comprising the A, B, E–G, and K genomes and the A subgenomes of tetraploid cotton, and the America group (D clade), which encompasses the D genome and D subgenomes of tetraploid cotton. Principal-component analysis (PCA) of the polymorphism data further supported this classification (Figure 1C). Notably, within this phylogenetic tree, E1 (G. stocksii) and D6 (Ggos) emerged as sister lineages to the remaining species in the A and D clades, respectively. However, the normalized quartet score for this tree was approximately 0.818 (Figure 1A). Upon closer examination of the CSG-based trees, it became evident that the conflicting placement of G. stocksii (E1) as a sibling to species in the D clade or the occurrence of admixture within the D clade contributed to the relatively modest support scores (Figure 1D). In addition, our analysis revealed that diploid species within the D clade exhibited genetic admixture with B-/E-/C-/G-/K-genome species that had overcome the oceanic barrier (Figure 1B). Drawing on previous findings (Feng et al., 2022), we posited that ancient introgression and ILS potentially underlie the observed contradictory tree topology.

Figure 1.

Figure 1

Phylogenetic analysis of the Gossypium genus.

(A) Phylogenetic relationships between 27 diploid genomes: 25 species in the Gossypium genus and two closely related species. The coalescent-based species tree was constructed in ASTRAL using 1157 conserved genes. The branch length values indicate substitutions between each species and the nearest ancestor.

(B) Ancestry plots at K = 2–6 for 106 Gossypium samples.

(C) PCA plot for 106 samples of the Gossypium genus.

(D) Proportions of gene trees with differing topologies and summary of QuIBL results.

(E) Schematic summary of DFOIL results for five-taxon phylogenies. The numbers correspond to the proportion of introgressed genes for the corresponding type of introgression.

(F) Genome-wide estimates of Patterson’s D, with Theobroma cacao as the outgroup in each test.

To address this, we used the tree-based method DFOIL (Pease and Hahn, 2015) to detect introgression in five-taxon symmetric phylogenies. The results showed significant evidence of introgression between the ancestors of the D and E genomes (Figure 1E). Likewise, a high introgression rate between the ancestor of the D genome and other species in the A clade (0.03208 ± 0.00524 for E and D; 0.01757 ± 0.01066 for E and A) suggested potential admixture of the A and D clades (Figure 1E).

We then applied another tree-based method, quantifying introgression via branch lengths (QuIBL) (Edelman et al., 2019), to evaluate potential reasons for conflicts between gene trees across the Gossypium genus. Based on the Bayesian information criterion (BIC) test, our results indicated that ILS and introgression led to gene tree conflicts in the internal branch of triplet trees (supplemental information section 4). This conclusion was consistent with the results of DFOIL, but we observed a lower introgression rate (0.0063 ± 0.0021 for E and D; 0.0068 ± 0.0035 for E and A) using QuIBL than using DFOIL (Figure 1E).

To further verify the presumed admixture, we conducted ABBA-BABA D-statistic tests involving different trios based on two topologies: ((E1, A clade), D) or ((D6, D), E1). With the topology ((D6, D), E1), we found two different patterns: introgression between D_G2 and E1, where D = 0.3253 ± 0.3815 for ((D6, D_G2), E1), and introgression between D6 and E1, where D = −0.3808 ± 0.1530 for ((D6, D_G1), E1). Considering the geographic barriers that likely led to reproductive isolation, we hypothesize that fortuitous introgression events between the D and E1 genomes occurred before the divergence of the D clade (Figure 1F). Overall, the results of D statistics on the topologies ((E1, A clade), D) and ((D6, D), E1) showed significant evidence of admixture between the E1 and D genomes (Figure 1F). On the basis of the resequencing dataset, we therefore posit that phylogenetic discordance was due to introgression in the rapidly differentiated Gossypium genus; however, we could not rule out the potential influence of ILS.

Rapid radiation of D-genome cotton

Recent studies reported that ∼13 extant species of D-genome cotton (Figures 2A and 2B) were naturally distributed in tropical and subtropical regions of America (Figure 2C). In addition to the conflicts observed in the CSG-based species trees, we also found a conflicted tree topology in the D clade when chloroplast data were used to infer phylogenies (Supplemental Figure 1). In both the species tree and the chloroplast tree, four major clades maintained the closest internal relationships: clade Inter, containing Gklo and Gdav; clade Houz, containing Gtri and Gthu; clade Cadu, containing Ghar, Gtur, and the D2-1 species G. armourianum; and clade Erio, containing the D4 species G. aridum (Gari), the D9 species G. laxum, the D7 species G. lobatum, and the D11 species G. schwendimanii (Gsch). However, the tree based on chloroplast data (chloroplast tree) combined D5 and D6 as sister species and mixed with the Erioxylum subsection; by contrast, our tree showed D5, the Integrifolia subsection, and the Houzingenia subsection as the closest relatives, whereas D6 was the sister clade to the common ancestor of other D-genome species (Supplemental Figure 1).

Figure 2.

Figure 2

Introgression and incomplete lineage sorting (ILS) test for the Gossypium D clade.

(A) Diversity in flower morphology of Gossypium D-clade species.

(B) Phylogenetic relationship of D-clade species based on 3527 orthologous CSGs.

(C) Schematic diagram showing the geographic origins of D-clade species.

(D) Phylogenetic network constructed with PhyloNetworks. The cladogram includes blue lines, which represent inter-specific introgression events.

(E) ML tree inferred using TreeMix with seven migration events allowed. G. somalense was assigned as the outgroup. Migration edges are depicted as arrows with color indicating the migration weight.

(F) Five-taxon phylogenies with the proportion of introgressed genes inferred via DFOIL analysis. Purple and blue numbers represent the corresponding taxa and show ancient introgression with Erio.

This indicates that there were at least three hybridization events in the D genome that would have resulted in the observed topology of the chloroplast tree: H1, hybridization between the ancestor of the Houzingenia subsection and the ancestor of Erioxylum; H2, hybridization between the ancestor of Caducibracteolata and the ancestor of Erioxylum; H3, hybridization between D5 and D6; or H4, hybridization between D5 or D6 and the ancestor of Erioxylum.

We inferred putative hybridization events using two different methods: the SNP-based TreeMix (Pickrell and Pritchard, 2012) and the tree-based PhyloNetworks (Solis-Lemus et al., 2017). PhyloNetworks infers a phylogenetic network based on maximum pseudolikelihood estimation of species networks by applying quartets under ILS. TreeMix infers population mixture from genome-wide allele frequency data. To reduce the computational time and resources required, we selected eight species that represent the major lineages from all 19 species: Gklo (clade Inter), Gthu (clade Hou), Ghar (clade Cadu), Gsch (clade Erio1), Gari (clade Erio2), Grai (clade Aust), GdarD (clade tetraploid cotton, D subgenome of G. darwinii), and Ggos (clade Sele). In PhyloNetworks, we performed 10 network searches with pre-set 1–10 putative reticulation events. The results indicated three hybridization events among the major clades of D-genome cotton (−log(pseudolikelihood) = 11.5709) (Figure 2D and Supplemental Figure 2A). This finding supported ancient hybridization from Erio2 to the ancestors of Hou, Cadu, Aust, and Inter and between clades Houz and Cadu. We also conducted gene flow inference in TreeMix using pre-set 0–10 putative hybridization events (Figure 2E). The results supported four gene flow events (the optimal number of migration edges on population trees = 7, with three events between the E and D genomes) in the D group (Supplemental Figures 2B and 2C). Gene flow was highest from Erio to Aust (H4; compared with that from Houz to Sele, from Inter to Erio, and from GdarD to Houz). Although the identified hybridization events differed between PhyloNetworks and TreeMix, events between Erio and other clades were detected using both methods. In addition, several hybridization events, including H1, H2, and H4, were detected by DFOIL (Figure 2F). Scenarios H1, H2, and H4 (as described earlier) could plausibly account for the discordance observed between the chloroplast and species trees, aligning with the findings obtained from the ABBA-BABA D-statistic analysis (Supplemental Figure 3A).

The lower hybridization rate (rather than the rate of gene tree discordance with species trees in several major triplets) suggests that hybridization alone may not sufficiently explain the gene tree conflicts. Therefore, we also used QuIBL to further assess putative hybridization and ILS rates. Based on the BIC test in the QuIBL analysis, both ILS and gene flow resulted in reticulate evolution in the D clade (Supplemental Table 31). The introgression rate supported the hybridization events mentioned above (Supplemental Figure 3B). Moreover, strong ILS signals (>0.1) were detected among members of the D group (Supplemental Figure 3C). However, ILS rates were greater than hybridization rates in the D group. Regardless of the method (QuIBL, PhyloNetworks, ABBA-BABA test, or DFOIL) or dataset (CSG or population resequencing data) used, our results suggested that ancient gene flow and ILS were widespread in the D clade. Here, we outline a detailed consensus scenario showing gene flow and ILS in the D clade based on the whole-genome comparative analysis (Figures 2D–2F).

ILS contributed to the rapid divergence of Gklo and Gdav

The D-clade species tree demonstrates that several recently diverged species, even without reproductive isolation, rapidly speciated because of geographic distribution or phenotypic differences (Fryxell, 1979; Wendel and Grover, 2015). The above results suggest a complex reticulate evolution system in the D group of the genus Gossypium, which may have been caused by hybridization and ILS. Speciation of Gklo and Gdav provides a system for understanding the rapid radiation of Gossypium and potentially other flowering plants (supplemental information section 5).

Gklo and Gdav were estimated to have diverged ∼390 thousand years ago (kya) in the Middle Pleistocene Epoch. Inference of recent demographic history indicates that Gklo experienced a rapid (and still ongoing) population decline from 200 kya, at the beginning of the Last Interglacial (which occurred ∼120–200 kya), due to a drop in global temperature during this period. The Gdav population experienced a drastic fluctuation from 40 to 20 kya (supplemental information section 5).

To further understand the divergence of Gdav and Gklo, we investigated the genetic and epigenetic landscapes of the two species by studying genomic features, genome-wide profiles of three-dimensional genome structure, histone H3 lysine 4 trimethylation (H3K4me3), and H3K27me3. We found that the genomic and epigenomic landscapes of Gdav and Gklo were generally conserved, and fewer differentially expressed genes, DNVs, and speciation structural variations (SSVs) may underlie the divergence of Gdav and Gklo (supplemental information section 5).

The higher quality and contiguous nature of the new Gklo assembly and other recently published cotton genomes (Gthu, Grai, GdarD, and Gdav) provided an opportunity to generate a higher-resolution ILS map. A total of 2.47% of the gene trees constructed using protein-coding genes exhibited inconsistent resolution (Figure 3A). These conflicts may have been the result of ILS or hybridization. Lower introgression rates (0.0198–0.0436) were detected among Grai, GdarD, Gklo, and Gdav using DFOIL and QuIBL (Figures 3B and 3C); there was limited evidence of gene flow among these four species (Figure 2). ILS was therefore the most likely explanation for conflicting gene trees. This was consistent with a recent, relatively short period of divergence for Gdav and Gklo and the geographic separation of Gklo, Gdav, and Grai. For simplicity, we considered all inconsistent tree topologies to have resulted from ILS in subsequent analyses.

Figure 3.

Figure 3

ILS analysis in Grai, Gklo, and Gdav.

(A) Phylogenetic relationships between Grai, Gklo, and Gdav. Whole-genome ILS cladogram for Grai/Gdav (red) and Grai/Gklo (blue).

(B) Schematic summary of DFOIL results for five-taxon phylogenies with four in-group taxa (Gklo, Gdav, Grai, and GdarD) and an outgroup (Gthu).

(C) Mean total proportion of introgression/ILS rates as inferred via QuIBL analysis.

(D) Schematic map of clustered ILS segments (500-bp resolution). The lighter density plot represents clustered ILS events mapped to intragenic regions. Vertical lines indicate the subset that overlaps with exons.

(E) Distribution of distances between ILS segments (inter-ILS) (500-bp resolution) compared with a simulated null expectation (based on 100 simulations) revealed a bimodal pattern with a subset that was clustered and significantly non-randomly distributed.

(F) Distribution of nucleotide diversity (π) estimated over 500-bp non-overlapping windows. ILS exons show significantly lower nucleotide diversity for ILS, whether or not they are located in the genic region.

We extracted ∼806.5 Mb of orthologous regions from whole-genome alignments (WGAs) of Grai, Gklo, Gdav, and Gthu. These regions were used together with six datasets (comprising genomes fragmented into non-overlapping windows of 20, 10, 5, 2, 1, and 0.5 kb) to reconstruct the Gthu, Gdav, Gklo, and Grai phylogenies. Conflicting window-based trees were classified as resulting from Grai/Gklo and Grai/Gdav ILS (Figure 3A). Although higher ILS rates were observed in the 500-bp window dataset, there was no significant variation in ILS rates across the six datasets (supplemental information section 6). In the 500-bp window dataset, ∼1.46% of the Grai genome was shown to be most closely related to Gdav or Gklo; 0.47% of the fragments were closer to the Gklo genome than the Gdav genome (i.e., Grai/Gklo ILS), whereas 0.99% of the Grai genome fragments were closer to the Gdav genome than the Gklo genome (Grai/Gdav ILS). The majority of the ILS segments were biased toward repeat regions and were significantly enriched in the Gypsy element (Supplemental Table 32). Remarkably, gene content was higher in the 500-bp window dataset than in the other datasets (Supplemental Table 33). Because the short segment length was helpful in accurately identifying ILS (Mao et al., 2021; Feng et al., 2022), we hypothesized that some ILS events led to new biological functions. By measuring the distance between ILS segments, we also found a subset of sites that were significantly more closely clustered than would be expected based on random chance (Figures 3D and 3E).

On the basis of the results described above, we next focused on protein-coding genes (as defined in the Gklo genome annotation). We found a total of 4361 exons that overlapped with ILS regions, specifically 456 exons in Grai/Gklo ILS regions and 3905 exons in Grai/Gdav ILS regions (Supplemental Table 34). These ILS exons were associated with 3106 genes. A Gene Ontology enrichment analysis of the genes in Grai/Gklo and Grai/Gdav ILS regions confirmed enrichment of genes related to ion transport, such as sodium transport (39 genes), potassium ion transmembrane transport (22 genes), and calcium ion transmembrane transport (25 genes); it also confirmed enrichment of genes related to macromolecule transport, e.g., intracellular protein transport (52 genes) (Supplemental Table 35). We observed that both genic and non-genic ILS exons showed a significant excess of nucleotide diversity (π), consistent with relaxed or positive selection (Figure 3F). We therefore extended the ILS analysis to the recent polyploidization of tetraploids by including genome data from GdarD. As expected, ILS estimates (based on orthologous protein-coding genes, OPGs) involving the D subgenome of tetraploid cotton (Dt) genome were increased to 15.57% (Supplemental Table 36). We found that 1.04% of genes of Grai or GdarD were inherited from Gklo. Of the conflicting gene tree relationships between Gdav and Gklo (clade Inter), 56.73% were mapped to Inter/Gdar ILS regions, and 43.27% were mapped to Inter/Grai ILS regions.

We next combined genomic divergence data with a whole-genome ILS map. Of the 739 speciation-associated genes (SAGs), we observed that 286 of the corresponding OPG gene trees were significantly supported (approximately unbiased [AU] > 0.95). The topologies of 85.31% (244) of the gene trees were consistent with the Gossypium species tree (Supplemental Table 37). In addition, Grai or GdarD showed a close relationship with clade Inter in 28 trees. These gene trees supported monophyletic Gdav and Gklo clades. Gklo clustered with Grai or Gdar in four gene trees, inconsistent with the Gossypium species trees; this occurred in a tree based on an important flowering time regulation gene, KLM.chr12G160800, which encodes a CONSTANS-like protein in photoperiod-sensitive plants. Based on a high-resolution ILS map (500-bp segments), some SAGs (89 genes) were identified as overlapping with the ILS signature. Moreover, 15.74% of SSV genes were also identified as ILS genes (Figure 4A). For example, the ILS gene KLM.chr12G018200, which encodes a peroxidase (GkPrx66, homolog of AtPrx66), had a 4875-bp non-coding insertion in the intron (Figures 4B and 4C). AtPrx66 is not only involved in responses to environmental stress but also related to biosynthesis and degradation of lignin (Tokunaga et al., 2009). Interestingly, GkGlyt3 (KLM.chr6G114400), a homolog of AtGlyt3, may be involved in cell wall biosynthesis; it was also identified as an ILS gene and had a 4851-bp deletion (Figures 4B and 4C). Notably, this deletion in GkGlyt3 had high homology with the Gdav segments downstream of GdGlyt3 (EVM0008047, an ortholog of GkGlyt3) (Figure 4B). Although the coding sequence topology of GkGlyt3 was consistent with that of the Gossypium species tree, we observed that 500-bp ILS segments overlapped with the 5′ untranslated region of GkGlyt3. These two genes not only were increased in length but also had decreased expression levels (Figure 4D).

Figure 4.

Figure 4

ILS regions in Grai, Gklo, and Gdav.

(A) Venn diagrams showing the intersection of ILS genes with SSV genes and SAGs.

(B) Schematic maps showing insertions in two SSV genes in Grai, Gdav, and Gklo.

(C) Phylogenetic trees built from the coding sequences of Prx66 and Glyt3 paralogs using ML estimation. Numbers at nodes are bootstrap values.

(D) Distribution of gene expression levels (in fragments per kilobase of transcript per million mapped reads) of two SSV genes.

Discussion

The release of high-quality genome assemblies for members of the Gossypium genus provides critical resources for understanding genetic differences in cotton fiber development and the genomic basis of Gossypium speciation and polyploidization. By combining PacBio long reads, Illumina short reads, and Hi-C sequencing data, we constructed a high-quality Gklo genome with a contig N50 length of 17.24 Mb, which was ∼2.8- and ∼9.0-fold improved compared with recently published assemblies of Grai (Udall et al., 2019) and Gher (Huang et al., 2020). Compared with the previously published Gklo genome, our updated assembly contains fewer gaps, more complete repeat regions, and more identified gene models. Gklo is the most closely related species to Grai (or the D subgenome of tetraploid cotton) that has been assembled.

The phylogenetic relationships between three important Gossypium lineages with significant differences in geographic distribution, namely the Australian lineage (C, G, and K genome cotton), the African/Asian lineage (A, B, E, and F genome cotton), and the American lineage (D and AD genome cotton), have been determined using multiple types of molecular data (Wendel and Albert, 1992; Wendel et al., 2012; Wendel and Grover, 2015). On the basis of whole-genome assemblies from 25 Gossypium species, we performed phylogenetic inference using several approaches. All analyses yielded identical phylogenetic trees, which clearly distinguished the A clade (Australian and African/Asian lineages) and the D clade (American lineage) of diploid cotton. Topologies of all phylogenetic trees highly support an identical sister-group relationship between the Australian and African/Asian lineages. However, the hypothesized relationship between the Australian lineage and the E and B genomes differed between CSG trees, particularly with respect to the E genome. The conflicting phylogenetic placement may have resulted from admixture of genetic components between those species and the American lineage. Current results suggest complex, reticulate evolution of the Gossypium genus, which may have been caused by ILS or ancient hybridization events. Previous analyses that examined the phylogenetic topology of the Gossypium genus did not observe such conflicts between trees, but they were based on fewer types of molecular evidence (Wendel and Grover, 2015). We therefore sought to identify potential causes (e.g., ILS or hybridization) that may have been responsible for these topological conflicts. This was accomplished using CSG trees and whole-genome SNPs with three different methods (QuIBL, DFOIL, and ABBA-BABA D statistics). Although ancient gene flow events between the E genome and the American lineage were detected using all three methods, they were not sufficient to explain all of the observed conflicts. Our analysis suggests that ILS was the primary reason for incongruent phylogenetic topology between the Gossypium species tree and some gene trees. Given huge variation in local environments, climatic differences, and the relatively short speciation history (with divergence beginning <10 Mya), a high level of adaptability in the ancestor cotton population could explain the rapid diversification. In animals, adaptive radiation fueled by niche competition accelerates rapid species diversification (Edelman et al., 2019; Ronco et al., 2021). This may also be the case for flowering plants such as woody bamboos (Guo et al., 2019) and switchgrass (Lovell et al., 2021). ILS is an important factor in rapid adaptive radiation, as reported in hominids (Mao et al., 2021), birds (Jarvis et al., 2014), primates (Mailund et al., 2014; Feng et al., 2022), and even red algae (Lee et al., 2018) and Arundinarieae (Zhang et al., 2012). ILS has been widespread but understudied and may have contributed to rapid adaptive speciation of members of the Gossypium genus.

Cotton D genomes were distributed from ∼34.26° north to ∼0.5° south in America. Consistent with a recent report (Wu et al., 2018), we also observed deep phylogenetic incongruence between nuclear and chloroplast genomes in D-genome cotton species. Based on analyses with multiple methods (QuIBL, DFOIL, ABBA-BABA D statistics, PhyloNetworks, and TreeMix), ILS and several ancient hybridization events (from clade Hou to Erio [H3], from clade Cadu to Erio [H2], and from Aust to Erio) have been proposed as possible causes for the observed incongruence. ILS occurs at much higher rates than hybridization. Therefore, our study suggests that ILS, rather than ancient gene flow, is the more likely cause of incongruent phylogenetic topology between the Gossypium species tree and the chloroplast tree in the D-genome group. However, ILS and hybridization in the genome produce similar results in phylogenetic analysis; it is difficult to distinguish between the two hypotheses. Analyses based on multiple datasets and methods will aid in distinguishing between the two mechanisms.

Although the natural geographic distributions of Grai, Gklo, and Gdav are very large, these three species had the closest relationships among the Gossypium species and had a short divergence time. Systematic genetic comparisons based on high-quality Grai, Gdav, and Gklo genome assemblies yielded new results that promote understanding of speciation and adaptive radiation associated with this genus. The high-quality assemblies also allowed us to reconsider ILS at a high resolution because 99% of Gossypium genus genomes could be systematically examined via phylogenetic comparisons. Consistent with a previous analysis (Mao et al., 2021), a non-random distribution was observed in the high-resolution ILS map of the Grai, Gdav, and Gklo genomes. ILS regions were depleted in genes and enriched in intergenic and repetitive regions of the genome. Nonetheless, 28.77% of ILS regions overlapped with coding sequences, and 4% of ILS regions intersected with H3K4me3-marked regions.

In general, ancestral lineages with a large effective population size (Ne) can cause ILS during subsequent speciation events (Feng et al., 2022). In our research, large fluctuations in Ne were observed after Gdav/Gklo speciation. Decreasing Ne and strong directional selection due to demographic history and huge environmental differences could account for some low levels of ILS among Grai, Gdav, and Gklo. Natural selection can fix some ILS regions, although it is not necessary for maintenance of ILS regions during the early period of speciation in descendant lineages (Feng et al., 2022). Lower nucleotide diversity in the ILS region (in Grai, Gklo, and Gdav) also suggests that parts of these regions were targets of strong natural selection. Therefore, the consistent decline in Ne and natural selection would both result in some ILS signals after the recent speciation event leading to Grai, Gklo, and Gdav. We estimated that >18% of OPGs would show signs of ILS if we considered a deeper phylogeny (including tetraploid cotton), in part because of the large Ne of the common ancestor of the clade containing GdarD, Grai, Gklo, and Gdav.

Gklo and Gdav are thought to have diverged over 390 kya, and they show nearly identical genomic and epigenomic landscapes. Therefore, less Gklo/Gdav divergence was found, but we did identify 3906 SSVs and 739 SAGs that were potentially involved in speciation events leading to Gdav and Gklo. Although the functions of SAGs and genes containing SSVs remain to be investigated in Gdav or Gklo, they may have resulted in fundamental genetic differences between the two species. For example, we identified a 4875-bp insertion in Gdav that resulted in formation of a 5524-bp intron within a gene that was 6853-bp long; the ortholog, GkPrx66, was only 2655-bp long. Prx66 has been associated with abiotic and biotic stress regulation and with biosynthesis and degradation of lignin (Tokunaga et al., 2009). Another difference between Gdav and Gklo was a large deletion (compared with the ortholog in Gdav) of a 4851-bp segment downstream of GkGlyt3, which resulted in addition of a distant exon to the genic region of GkGlyt3. Remarkably, Glyt3 is also associated with cell wall biosynthesis. Both GkPrx66 and GkGlyt3 intersected with the ILS signature, and these two genes were markedly downregulated in expression compared with their orthologs in Gdav. Another ILS signature was located in the 3′ untranslated region of GkGlyt3 but did not affect the topology of the GkGlyt3 gene tree. The effects of this deletion on speciation and rapid adaptive radiation require further study.

In summary, high-quality assemblies have greatly improved our understanding of Gossypium evolution. Here, comparative analysis of Gklo genomes and 24 other genomes of the Gossypium genus provide new insights into the rapid radiation of cotton and are beginning to illuminate the dawn of flowering plants in adaptive diversification.

Methods

Sample collection, library construction, and sequencing

All samples for DNA sequencing (DNA-seq) and RNA-seq were collected from adult plants in the National Wild Cotton Nursery at the Institute of Cotton Research, Chinese Academy of Agricultural Sciences, in Sanya, China. In brief, leaves, flowers, stems, apices, bolls, and flower buds were collected, immediately frozen in liquid nitrogen, and stored at −80°C. The leaves of 15-day-old seedlings were also collected for a high-throughput chromosome conformation capture experiment.

High-molecular-weight genomic DNA (gDNA) was extracted with a standard CTAB protocol from the leaves of four cotton species: Ghar (D2-2), Ggos (D6), Gtri (D8), and Gklo. PacBio SMRTbell long-read sequencing libraries were prepared from Gklo DNA by fragmenting extracted DNA with a Covaris g-TUBE Shearing Device. Illumina (San Diego, CA, USA) paired-end sequencing libraries (with an insert size of 350 bp) were generated from the same extracted Gklo gDNA following the manufacturer’s protocol. Short-insert (220- and 500-bp) paired-end and large-insert (3-, 4-, and 5-kb) mate-pair libraries were prepared from Ggos, Ghar, and Gtri gDNA. Illumina Hi-C was performed following a previously published protocol (Berkum et al., 2010). In brief, fresh leaves from Gklo seedlings were fixed in a 1% formaldehyde solution. The nuclei and chromatin were extracted, then digested with DpnII. The overhangs resulting from DpnII digestion were filled in with biotin-14-dCTP (Invitrogen) and Klenow (New England Biolabs). After chromatin dilution and religation, gDNA was extracted and purified; purified DNA was sheared to 300–500 bp with a Bioruptor (Diagenode). Finally, the Hi-C library was prepared as previously described (Servant et al., 2015). Total RNA was extracted from six different tissues using a TRIzol Reagent RNA isolation kit (Invitrogen). RNA-seq libraries were prepared using the standard Illumina mRNA-seq library preparation kit.

The PacBio SMRTbell long-read sequencing library was sequenced on the PacBio Sequel I platform. The Hi-C library was sequenced on the Illumina HiSeq X Ten platform. PE150 sequencing was performed on the Illumina HiSeq X Ten platform for the Gklo library. PE125 sequencing was performed on the Illumina HiSeq X Ten platform for the prepared Ghar, Ggos, and Gtri libraries. Paired-end 150-bp reads were generated on the Illumina NovaSeq platform for the RNA libraries.

Phylogenetic analysis

Phylogenetic trees were constructed for the Gossypium genus using three different datasets: 1165 CSGs in all 27 Gossypium species, with one gene for each species; 4DTVs (fourfold degenerate site) based on the gene models of the Gklo genome; and SNPs from 106 samples belonging to 26 Gossypium species (Supplemental Table 17).

Protein sequences of annotated genes from 27 genomes, including 25 Gossypium species and two closely related species, were used to identify CSGs with OrthoFinder (v.2.2.7) (Emms and Kelly, 2015). Multiple amino-acid sequence alignments of the CSGs were then constructed using MAFFT (v.7.505) (Katoh and Standley, 2013). The resulting alignments were converted into corresponding codon alignments with PAL2NAL (Suyama et al., 2006), and poorly aligned positions and divergent regions of alignments were eliminated using Gblocks (v.0.91b) (Talavera and Castresana, 2007) with the following parameters: “-t = c -b5 = h”. For the coalescent-based tree, the phylogenetic tree for each CSG was first reconstructed using the maximum-likelihood (ML) program IQ-TREE (v.1.6.12) (Nguyen et al., 2015) with the parameters “-m MFP–bb 1000”. The gene trees were then analyzed using ASTRAL (v.5.9.1) (Mirarab et al., 2014) to infer the species tree using quartet scores and posterior probabilities. For the concatenation-based tree, alignments of all CSGs were concatenated using a custom Python script and passed to IQ-TREE to construct the species tree using the parameters noted above. Using the gene models of the Gklo genome, 4DTVs were extracted using the msa_view program in PHAST (v.1.4) (Hubisz et al., 2011). The 4DTVs were also used to build an ML-based tree in IQ-TREE using the parameters described above.

We downloaded high-depth WGS short reads for diploid and tetraploid cotton from previously published studies and mapped them to the Gklo genome with BWA, then retained only the uniquely mapped reads. We used SAMtools (v.1.15.1) (Danecek et al., 2021) to sort and remove duplicate reads. SNP calling was performed with GATK (v.4.2.5.0) (McKenna et al., 2010). The resulting SNPs were then filtered with VCFtools (v.0.1.16) (Danecek et al., 2011) using the following parameters: “--minDP 4 --max-missing 0.2 --mac 3 --minQ 30--maf 0.05”. The final SNPs were used in PCA and analysis of population structure, population genetics, and ABBA-BABA D statistics. SnpEff (v.4.3t) (Cingolani et al., 2012) was used to extract 4DTVs identified from the SNP data; these were then used for phylogenetic tree construction in RAxML (v.8.2.12) (Stamatakis, 2014) with the following parameters: “-# 1000 -m GTRCAT”. Values of π and FST (Fixation Index) were calculated using VCFtools. PCA was performed using Plink (v.1.90b6.25) (Purcell et al., 2007). The genetic population structure was examined using Admixture (v.1.3.0) (Alexander et al., 2009) with K set to two through six, with 10 000 iterations for each run.

Hybridization inference and ILS simulation

ABBA-BABA D-statistic tests were used to detect admixture among species in the Gossypium genus as described previously (Choi et al., 2020). In brief, given a rooted topology (((P1, P2), P3), O) where O is the outgroup (P1, P2, and P3: three different species), admixture can be inferred between P1 and P3 or P2 and P3 by calculating the ancestral (“A”) and derived (“B”) allelic state of each individual. We computed all possible D statistics using Dsuite (v.0.5) (Malinsky et al., 2021) with default parameters. We also considered all triplets of the 26 species, using Tc as the outgroup for the whole-genus Gossypium analysis and using G. somalense as the outgroup when analyzing all triplets of the 18 D-genome cotton species (including the D subgenome of tetraploid cotton). Finally, we calculated the average D statistic for each pair from all results of triplet combination analysis, for example, the average D statistic between A and C consistent with the topology ((others, A), C) and considering the remaining species as others.

Ancient introgression was tested using DFOIL (Pease and Hahn, 2015) in Gossypium species as described previously (Meleshko et al., 2021). The DFOIL statistic is an extended version of the D statistic; it is used to directly estimate gene flow and can also be used to infer gene flow between the ancestors of sister species and extant species (Vianna et al., 2020; Meleshko et al., 2021). CSG alignments (3527) were used to perform the DFOIL test for all symmetrical five-taxon combinations in the Gossypium genus, consistent with our Gossypium species trees using one ingroup clade older than another (Meleshko et al., 2021) and Tc as the outgroup. Finally, we calculated the average DFOIL statistic between the E and D genomes by calculating the DFOIL statistic between one ingroup clade (where P1 and P2 belong to the D group) and the E genome with the topology ((P1, P2), (P3, E), Tc) and considering the remaining species in the A-genome group as P3. DFOIL calculations were performed for the D-genome group as described above but using 3527 CSG alignments; Ggos was used as the outgroup.

Another tree-based method of assessing branch length distribution across gene trees was used to quantify putative hybridization (Edelman et al., 2019). Similar to the DFOIL analysis, we used CSG trees as the candidate input set for QuIBL analysis of the Gossypium genus. First, 500 trees were randomly selected from the 3527 CSG trees as input for one QuIBL estimation; random selection was repeated 100 times to generate 100 QuIBL outputs as reported previously (Feng et al., 2022). We generated QuIBL results from all triplet combinations of the 26 species, retaining Tc as the outgroup. The difference in BIC values (ΔBIC) was calculated and used to distinguish between an ILS-only model and a hybridization model (Guo et al., 2021). In brief, a model with ΔBIC >10 was considered ILS only, whereas a model with ΔBIC < −10 was considered an ILS and introgression model. Finally, we calculated the average ILS or introgression rate for each pair from triplet analysis results; for example, we calculated the ILS/introgression rate between the E and D genomes with the topology ((others, E), D), considering the remaining species in the A-genome group as others. Discordant triplets (others, (E, D)) were then used to calculate the average ILS/introgression rate between the E and D genomes as described previously (Guo et al., 2021). For the D group, we performed QuIBL analysis using 3527 CSG trees. The average ILS/introgression rate was calculated as described above with Ggos as the outgroup.

For the D-genome group, we detected hybridization using PhyloNetworks (Solis-Lemus et al., 2017). PhyloNetworks is widely used to estimate phylogenetic networks based on the maximum pseudolikelihood. To reduce the computational time and resources required, we selected eight species that represent the major lineages from all 19 species: Gklo (clade Inter), Gthu (clade Hou), Ghar (clade Cadu), Gsch (clade Erio1), Gari (clade Erio2), Grai (clade Aust), GdarD (clade tetraploid cotton), and Ggos (clade Sele). We performed 10 network searches in PhyloNetworks using pre-set putative 1–10 reticulation events, with 100 runs each to ensure accuracy (Ma et al., 2021). We also estimated phylogenetic networks using TreeMix (Pickrell and Pritchard, 2012) based on SNPs in the D-genome group with G. somalense as the outgroup. Gene flow inference was performed using pre-set 0 to 10 putative hybridization events in TreeMix.

Genome variation between Gdav and Gklo

Genome-wide syntenic blocks were identified within and between Grai, GdarD, Gthu, Gdav, and Gklo based on the results of all-vs.-all BLAST (E value ≤ 1e−10). MCScanX v.0.8 was used to identify conserved collinear blocks (classified as >5 homologous gene pairs/block) (Wang et al., 2012). Genome-wide collinear blocks (GCBs) that could be aligned to the Gklo genome were identified using the NUCmer suite in the MUMmer4 package (Marcais et al., 2018). Alignment delta files were filtered using the delta-filter suite from the MUMmer4 package with the following parameters: “-i 89 -L 1000 -1”. GCBs were obtained by converting the filtered delta files to tables. The GC content and length of each GCB were calculated using SeqKit (v.0.15.0) (Shen et al., 2016). Large-scale inversions were identified by combining genome-wide syntenic blocks and GCBs.

We selected two resequencing samples, one each from Gdav and Gklo, for SNP and SV identification. Resequencing data were first aligned to the reference Gklo genome and the Gdav genome as appropriate. We then identified SVs using BreakDancer (v.1.4.5) (Chen et al., 2009) and the reference Gklo genome. Consistent with a previously published protocol (Wang et al., 2020), we also identified SVs using SV-calling pipelines (https://github.com/GaoLei-bio/SV). In addition, we used Assemblytics (Nattestad and Schatz, 2016) to identify SVs based on the delta file output by MUMmer4. The analyses described above yielded three SV sets. For insertions and deletions, we obtained the final SSV set by merging the three raw datasets using Jasmine v.1.0.11 (https://github.com/mkirsche/Jasmine) (Alonge et al., 2020) with the following parameters: “min_support = 1 max_dist = 100 k_jaccard = 9 min_seq_id = 0.2 spec_len = 30”. Genomic SSVs were annotated using the ANNOVAR package (v.2019Oct24) (Wang et al., 2010).

Weighted gene co-expression network analysis

Gene expression levels were calculated in fragments per kilobase of transcript per million mapped reads using HTSeq (v.0.6.1) (Anders et al., 2015); RNA-seq data were mapped to the corresponding reference genome using TopHat2 (v.2.0.8) (Kim et al., 2013). Differentially expressed genes were detected using the DESeq2 package (Love et al., 2014) in R. We then constructed gene co-expression networks using all identified genes and weighted gene co-expression network analysis (v.1.47) (Langfelder and Horvath, 2008). Prior to analysis, genes without detectable expression were removed. Soft thresholds (power = 12) were set based on the scale-free topology criterion employed. The dynamic tree-cutting algorithm and associations of modules with traits were evaluated as described previously (Xu et al., 2020). For speciation traits, Gdav was set to one, and Gklo was set to zero.

Demographic analysis

We inferred historical dynamics of effective population size and the divergence time of Gklo and Gdav using the pairwise sequentially Markovian coalescent (PSMC) method (Li and Durbin, 2011). Four samples with high sequencing depth (∼40-fold) were used for PSMC. The resequencing data were mapped to the Gklo genome using BWA. The whole-genome diploid consensus sequences for each sample were generated using SAMtools and BCFtools with the parameter C50; regions with sequencing depth <10 or >100 sites were removed to reduce the probability of false positives as described previously (Zhao et al., 2019). PSMC was used with default parameters (-N25 -t15 -r5 -p “4 + 25∗2+4+6”) to infer the historical effective population size. The estimated generation time and mutation rate were set to 2e−9 and 3.61e−9, respectively.

ILS map construction for Grai/Gklo and Gdav/Grai

Using a previously published protocol (Feng et al., 2022), we generated WGAs with the LASTZ + MULTIZ pipeline. We first carried out pairwise WGAs for Grai, Gklo, Gdav, and Gthu using LASTZ v.1.04.03 (Harris, 2007) with the following parameters: “O = 400 E = 30 K = 3000 L = 3000 H = 2200 --format = axt --ambiguous = n --ambiguous = iupac”. We also used the Chain/net package with the following parameters for the axtChain program: “-minScore = 5000 -linearGap = medium”; default parameters were used for other programs. A merged WGA was obtained using MULTIZ (v.11.2) (Blanchette et al., 2004) to merge all pairwise WGAs. WGAs were filtered using mafFilter with the following parameters: “-overlap -minScore = 5000 -minRow = 4”. This yielded an 806.5-Mb WGA using Gklo as the reference genome.

As described above, there was some evidence for limited gene flow between Grai, Gklo, and Gdav. Because these three species were largely geographically isolated, they provide a unique framework in which to understand the rapidity of genetic changes that resulted from ILS. We searched for evidence of ILS in Gdav, Gklo, and Grai at different levels of resolution by segmenting the WGA. This was accomplished using the reference Gklo genome to generate datasets with different window lengths (20, 10, 5, 2, 1, and 0.5 kbp) using the msa_split program. For each segmented dataset, we obtained multiple sequence alignments for all segments. To investigate the genome-wide landscape of introgression and ILS, we examined the phylogenetic topologies generated from each gene. In brief, for each segment, the AU test was used to determine the topology that significantly fit. RAxML was used to estimate site-likelihood values, and the AU test in CONSEL v.1.20 (Shimodaira and Hasegawa, 2001) was then used to detect significant topology. A topology with an AU test support >0.95 was selected. A similar strategy of phylogenetic hypothesis testing was used for CSGs and OPGs. Next, we regarded conflicting gene trees (i.e., those that differed from the species tree) as being due to ILS and used the “ete3” module in python3 to count the number of ILS segments.

ILS inter-distance simulation

As in a previous report (Mao et al., 2021), an ILS inter-distance null distribution was defined by permuting the coordinates of ILS regions across the genome using no ILS discovery regions. A total of 100 permutations of the ILS coordinates were then compared to the null inter-distance to separate clustered from non-clustered ILS events.

Statistical analyses

Statistically significant differences were called at p = 0.05 or p = 0.01 for statistical analyses. Welch’s t test, the Wilcoxon rank-sum test, and false discovery rate calculation were performed using the R functions t.test, wilcox.test, and p.adjust, respectively.

Data and code availability

All raw sequencing data and genome assemblies can be accessed from the China National Genomics Data Center at https://ngdc.cncb.ac.cn under accession number PRJCA012048. These data are also available at the NCBI BioProject database with accession number PRJNA858595. In addition, genome assemblies and annotation files can be found on the CottonGen website at https://www.cottongen.org/.

Funding

We thank the National Natural Science Foundation of China (32272090, 32171994, and 32072023), the Central Plains Science and Technology Innovation Leader Project (214200510029 and 2022C01NY001), the Project of Sanya Yazhou Bay Science and Technology City (SCKY-JYRC-2022-88), and the National Key R&D Program of China (2021YFE0101200) for financial support.

Author contributions

Y.X., K.W., B.Z., R.P., and F.L. designed and supervised the study. Z.Z., Y.H., Y. Wang, and X.C. managed the field work and prepared the samples. Y. Wei and Y.L. provided the chromatin immunoprecipitation-seq data. Y.X. performed the data analysis and prepared the manuscript. F.L., B.Z., K.W., S.B.W., D.J., M.J.U., S.Y., and R.P. revised the manuscript. All authors have read and approved the manuscript.

Acknowledgments

No conflict of interest is declared.

Published: October 5, 2023

Footnotes

Published by the Plant Communications Shanghai Editorial Office in association with Cell Press, an imprint of Elsevier Inc., on behalf of CSPB and CEMPS, CAS.

Supplemental information is available at Plant Communications Online.

Contributor Information

Baohong Zhang, Email: zhangb@ecu.edu.

Renhai Peng, Email: aydxprh@163.com.

Fang Liu, Email: liufcri@163.com.

Supplemental information

Document S1. Supplemental Figures 1–15
mmc1.pdf (4.1MB, pdf)
Data S1. Supplemental Tables 1–30
mmc2.xlsx (1.3MB, xlsx)
Data S2. Supplemental Tables 31–37
mmc3.xlsx (712.8KB, xlsx)
Document S2. Article plus supplemental information
mmc4.pdf (7MB, pdf)

References

  1. Alexander D.H., Novembre J., Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Alonge M., Wang X., Benoit M., Soyk S., Pereira L., Zhang L., Suresh H., Ramakrishnan S., Maumus F., Ciren D., et al. Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato. Cell. 2020;182:145–161.e23. doi: 10.1016/j.cell.2020.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Anders S., Pyl P.T., Huber W. HTSeq-a Python framework to work with high-throughput sequencing data. Bioinformatics. 2015;31:166–169. doi: 10.1093/bioinformatics/btu638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Berkum N.L.v., Lieberman-Aiden E., Williams L., Imakaev M., Gnirke A., Mirny L.A., Dekker J., Eric, Lander S. Hi-C: a method to study the three-dimensional architecture of genomes. JoVE. 2010:1869. doi: 10.3791/1869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. doi: 10.1101/gr.1933104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen K., Wallis J.W., McLellan M.D., Larson D.E., Kalicki J.M., Pohl C.S., McGrath S.D., Wendl M.C., Zhang Q., Locke D.P., et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods. 2009;6:677–681. doi: 10.1038/nmeth.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen Z.J., Sreedasyam A., Ando A., Song Q., De Santiago L.M., Hulse-Kemp A.M., Ding M., Ye W., Kirkbride R.C., Jenkins J., et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 2020;52:525–533. doi: 10.1038/s41588-020-0614-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen Z., Nie H., Wang Y., Pei H., Li S., Zhang L., Hua J. Rapid evolutionary divergence of diploid and allotetraploid Gossypium mitochondrial genomes. BMC Genom. 2017;18:876. doi: 10.1186/s12864-017-4282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Choi J.Y., Lye Z.N., Groen S.C., Dai X., Rughani P., Zaaijer S., Harrington E.D., Juul S., Purugganan M.D. Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol. 2020;21:21. doi: 10.1186/s13059-020-1938-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cingolani P., Platts A., Wang L.L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 2012;6:80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M., et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008. doi: 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Edelman N.B., Frandsen P.B., Miyagi M., Clavijo B., Davey J., Dikow R.B., García-Accinelli G., Van Belleghem S.M., Patterson N., Neafsey D.E., et al. Genomic architecture and introgression shape a butterfly radiation. Science. 2019;366:594–599. doi: 10.1126/science.aaw2090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Emms D.M., Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Feng S., Bai M., Rivas-González I., Li C., Liu S., Tong Y., Yang H., Chen G., Xie D., Sears K.E., et al. Incomplete lineage sorting and phenotypic evolution in marsupials. Cell. 2022;185:1646–1660.e18. doi: 10.1016/j.cell.2022.03.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fryxell P.A. Texas A & M University Press; 1979. The Natural History of the Cotton Tribe (Malvaceae, Tribe Gossypieae)https://www.cabdirect.org/cabdirect/abstract/19812607534 [Google Scholar]
  17. Grover C.E., Yuan D., Arick M.A., Miller E.R., Hu G., Peterson D.G., Wendel J.F., Udall J.A. The Gossypium stocksii genome as a novel resource for cotton improvement. G3 (Bethesda) 2021;11 doi: 10.1093/g3journal/jkab125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grover C.E., Arick M.A., 2nd, Thrash A., Conover J.L., Sanders W.S., Peterson D.G., Frelichowski J.E., Scheffler J.A., Scheffler B.E., Wendel J.F. Insights into the Evolution of the New World Diploid Cottons (Gossypium, Subgenus Houzingenia) Based on Genome Sequencing. Genome Biol. Evol. 2019;11:53–71. doi: 10.1093/gbe/evy256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Guo X., Fang D., Sahu S.K., Yang S., Guang X., Folk R., Smith S.A., Chanderbali A.S., Chen S., Liu M., et al. Chloranthus genome provides insights into the early diversification of angiosperms. Nat. Commun. 2021;12:6930. doi: 10.1038/s41467-021-26922-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Guo Z.H., Ma P.F., Yang G.Q., Hu J.Y., Liu Y.L., Xia E.H., Zhong M.C., Zhao L., Sun G.L., Xu Y.X., et al. Genome Sequences Provide Insights into the Reticulate Origin and Unique Traits of Woody Bamboos. Mol. Plant. 2019;12:1353–1365. doi: 10.1016/j.molp.2019.05.009. [DOI] [PubMed] [Google Scholar]
  21. Harris R.S. The Pennsylvania State University; 2007. Improved Pairwise Alignment of Genomic DNA. [Google Scholar]
  22. Huang G., Wu Z., Percy R.G., Bai M., Li Y., Frelichowski J.E., Hu J., Wang K., Yu J.Z., Zhu Y. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat. Genet. 2020;52:516–524. doi: 10.1038/s41588-020-0607-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hubisz M.J., Pollard K.S., Siepel A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief. Bioinform. 2011;12:41–51. doi: 10.1093/bib/bbq072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jarvis E.D., Mirarab S., Aberer A.J., Li B., Houde P., Li C., Ho S.Y.W., Faircloth B.C., Nabholz B., Howard J.T., et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–1331. doi: 10.1126/science.1253451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Katoh K., Standley D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Langfelder P., Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinf. 2008;9:559. doi: 10.1186/1471-2105-9-559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lee J.M., Song H.J., Park S.I., Lee Y.M., Jeong S.Y., Cho T.O., Kim J.H., Choi H.-G., Choi C.G., Nelson W.A., et al. Mitochondrial and Plastid Genomes from Coralline Red Algae Provide Insights into the Incongruent Evolutionary Histories of Organelles. Genome Biol. Evol. 2018;10:2961–2972. doi: 10.1093/gbe/evy222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Li H., Durbin R. Inference of human population history from individual whole-genome sequences. Nature. 2011;475:493–496. doi: 10.1038/nature10231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Lovell J.T., MacQueen A.H., Mamidi S., Bonnette J., Jenkins J., Napier J.D., Sreedasyam A., Healey A., Session A., Shu S., et al. Genomic mechanisms of climate adaptation in polyploid bioenergy switchgrass. Nature. 2021;590:438–444. doi: 10.1038/s41586-020-03127-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ma J., Sun P., Wang D., Wang Z., Yang J., Li Y., Mu W., Xu R., Wu Y., Dong C., et al. The Chloranthus sessilifolius genome provides insight into early diversification of angiosperms. Nat. Commun. 2021;12:6929. doi: 10.1038/s41467-021-26931-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Mailund T., Munch K., Schierup M.H. Lineage Sorting in Apes. Annu. Rev. Genet. 2014;48:519–535. doi: 10.1146/annurev-genet-120213-092532. [DOI] [PubMed] [Google Scholar]
  34. Malinsky M., Matschiner M., Svardal H. Dsuite - Fast D-statistics and related admixture evidence from VCF files. Mol. Ecol. Resour. 2021;21:584–595. doi: 10.1111/1755-0998.13265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Mao Y., Catacchio C.R., Hillier L.W., Porubsky D., Li R., Sulovari A., Fernandes J.D., Montinaro F., Gordon D.S., Storer J.M., et al. A high-quality bonobo genome refines the analysis of hominid evolution. Nature. 2021;594:77–81. doi: 10.1038/s41586-021-03519-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Marçais G., Delcher A.L., Phillippy A.M., Coston R., Salzberg S.L., Zimin A. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 2018;14 doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Mcgrath C.L., Lynch M. 2012. Evolutionary Significance of Whole-Genome Duplication. [Google Scholar]
  38. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–1303. doi: 10.1101/gr.107524.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Meleshko O., Martin M.D., Korneliussen T.S., Schröck C., Lamkowski P., Schmutz J., Healey A., Piatkowski B.T., Shaw A.J., Weston D.J., et al. Extensive Genome-Wide Phylogenetic Discordance Is Due to Incomplete Lineage Sorting and Not Ongoing Introgression in a Rapidly Radiated Bryophyte Genus. Mol. Biol. Evol. 2021;38:2750–2766. doi: 10.1093/molbev/msab063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Mirarab S., Reaz R., Bayzid M.S., Zimmermann T., Swenson M.S., Warnow T. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. 2014;30:i541–i548. doi: 10.1093/bioinformatics/btu462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Nattestad M., Schatz M.C. Assemblytics: a web analytics tool for the detection of variants from an assembly. Bioinformatics. 2016;32:3021–3023. doi: 10.1093/bioinformatics/btw369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Paterson A.H., Wendel J.F., Gundlach H., Guo H., Jenkins J., Jin D., Llewellyn D., Showmaker K.C., Shu S., Udall J., et al. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature. 2012;492:423–427. doi: 10.1038/nature11798. [DOI] [PubMed] [Google Scholar]
  44. Pease J.B., Hahn M.W. Detection and Polarization of Introgression in a Five-Taxon Phylogeny. Syst. Biol. 2015;64:651–662. doi: 10.1093/sysbio/syv023. [DOI] [PubMed] [Google Scholar]
  45. Pickrell J.K., Pritchard J.K. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012;8 doi: 10.1371/journal.pgen.1002967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Ronco F., Matschiner M., Böhne A., Boila A., Büscher H.H., El Taher A., Indermaur A., Malinsky M., Ricci V., Kahmen A., et al. Drivers and dynamics of a massive adaptive radiation in cichlid fishes. Nature. 2021;589:76–81. doi: 10.1038/s41586-020-2930-4. [DOI] [PubMed] [Google Scholar]
  48. Servant N., Varoquaux N., Lajoie B.R., Viara E., Chen C.J., Vert J.P., Heard E., Dekker J., Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015;16:259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Shen W., Le S., Li Y., Hu F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS One. 2016;11 doi: 10.1371/journal.pone.0163962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Shimodaira H., Hasegawa M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics. 2001;17:1246–1247. doi: 10.1093/bioinformatics/17.12.1246. [DOI] [PubMed] [Google Scholar]
  51. Solís-Lemus C., Bastide P., Ané C. PhyloNetworks: A Package for Phylogenetic Networks. Mol. Biol. Evol. 2017;34:3292–3298. doi: 10.1093/molbev/msx235. [DOI] [PubMed] [Google Scholar]
  52. Soltis P.S., Soltis D.E. Ancient WGD events as drivers of key innovations in angiosperms. Curr. Opin. Plant Biol. 2016;30:159–165. doi: 10.1016/j.pbi.2016.03.015. [DOI] [PubMed] [Google Scholar]
  53. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Suyama M., Torrents D., Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006;34:W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Talavera G., Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 2007;56:564–577. doi: 10.1080/10635150701472164. [DOI] [PubMed] [Google Scholar]
  56. Tokunaga N., Kaneta T., Sato S., Sato Y. Analysis of expression profiles of three peroxidase genes associated with lignification in Arabidopsis thaliana. Physiol. Plantarum. 2009;136:237–249. doi: 10.1111/j.1399-3054.2009.01233.x. [DOI] [PubMed] [Google Scholar]
  57. Udall J.A., Long E., Hanson C., Yuan D., Ramaraj T., Conover J.L., Gong L., Arick M.A., Grover C.E., Peterson D.G., et al. De Novo Genome Sequence Assemblies of Gossypium raimondii and Gossypium turneri. G3 (Bethesda) 2019;9:3079–3085. doi: 10.1534/g3.119.400392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Vianna J.A., Fernandes F.A.N., Frugone M.J., Figueiró H.V., Pertierra L.R., Noll D., Bi K., Wang-Claypool C.Y., Lowther A., Parker P., et al. Genome-wide analyses reveal drivers of penguin diversification. Proc. Natl. Acad. Sci. USA. 2020;117:22303–22310. doi: 10.1073/pnas.2006659117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Wang K., Wendel J.F., Hua J. Designations for individual genomes and chromosomes in Gossypium. J. Cotton Res. 2018;1 [Google Scholar]
  61. Wang X., Gao L., Jiao C., Stravoravdis S., Hosmani P.S., Saha S., Zhang J., Mainiero S., Strickler S.R., Catala C., et al. Genome of Solanum pimpinellifolium provides insights into structural variants during tomato breeding. Nat. Commun. 2020;11:5817. doi: 10.1038/s41467-020-19682-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Wang Y., Tang H., Debarry J.D., Tan X., Li J., Wang X., Lee T.H., Jin H., Marler B., Guo H., et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40:e49. doi: 10.1093/nar/gkr1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Wendel J., Grover C. In: Taxonomy and Evolution of the Cotton Genus, Gossypium. Fang D.D., Percy R.G., editors. 2015. [DOI] [Google Scholar]
  64. Wendel J.F., Albert V.A. Phylogenetics of the Cotton Genus (Gossypium): Character-State Weighted Parsimony Analysis of Chloroplast-DNA Restriction Site Data and Its Systematic and Biogeographic Implications. Syst. Bot. 1992;17:115–143. [Google Scholar]
  65. Wendel J.F., Flagel L.E., Adams K.L. In: Polyploidy and Genome Evolution. Soltis P.S., Soltis D.E., editors. Springer Berlin Heidelberg; 2012. Jeans, Genes, and Genomes: cotton as a model for studying polyploidy; pp. 181–207. [DOI] [Google Scholar]
  66. Wu Y., Liu F., Yang D.G., Li W., Zhou X.J., Pei X.Y., Liu Y.G., He K.L., Zhang W.S., Ren Z.Y., et al. Comparative Chloroplast Genomics of Gossypium Species: Insights Into Repeat Sequence Variations and Phylogeny. Front. Plant Sci. 2018;9:376. doi: 10.3389/fpls.2018.00376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Xu Y., Magwanga R.O., Yang X., Jin D., Cai X., Hou Y., Wei Y., Zhou Z., Wang K., Liu F. Genetic regulatory networks for salt-alkali stress in Gossypium hirsutum with differing morphological characteristics. BMC Genom. 2020;21:15. doi: 10.1186/s12864-019-6375-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Zhang Y.X., Zeng C.X., Li D.Z. Complex evolution in Arundinarieae (Poaceae: Bambusoideae): incongruence between plastid and nuclear GBSSI gene phylogenies. Mol. Phylogenet. Evol. 2012;63:777–797. doi: 10.1016/j.ympev.2012.02.023. [DOI] [PubMed] [Google Scholar]
  69. Zhao Y.P., Fan G., Yin P.P., Sun S., Li N., Hong X., Hu G., Zhang H., Zhang F.M., Han J.D., et al. Resequencing 545 ginkgo genomes across the world reveals the evolutionary history of the living fossil. Nat. Commun. 2019;10:4201. doi: 10.1038/s41467-019-12133-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental Figures 1–15
mmc1.pdf (4.1MB, pdf)
Data S1. Supplemental Tables 1–30
mmc2.xlsx (1.3MB, xlsx)
Data S2. Supplemental Tables 31–37
mmc3.xlsx (712.8KB, xlsx)
Document S2. Article plus supplemental information
mmc4.pdf (7MB, pdf)

Data Availability Statement

All raw sequencing data and genome assemblies can be accessed from the China National Genomics Data Center at https://ngdc.cncb.ac.cn under accession number PRJCA012048. These data are also available at the NCBI BioProject database with accession number PRJNA858595. In addition, genome assemblies and annotation files can be found on the CottonGen website at https://www.cottongen.org/.


Articles from Plant Communications are provided here courtesy of Elsevier

RESOURCES