Skip to main content
Journal of Advanced Research logoLink to Journal of Advanced Research
. 2023 Feb 10;54:15–27. doi: 10.1016/j.jare.2023.02.002

Genome and haplotype provide insights into the population differentiation and breeding improvement of Gossypium barbadense

Nian Wang a, Yuanxue Li a, Qingying Meng a, Meilin Chen a, Mi Wu a, Ruiting Zhang a, Zhiyong Xu a, Jie Sun c, Xianlong Zhang a, Xinhui Nie c,, Daojun Yuan a,, Zhongxu Lin a,b,c,
PMCID: PMC10703724  PMID: 36775017

Graphical abstract

graphic file with name ga1.jpg

Keywords: Gossypium barbadense, Perennial accession, Structural variation, Haplotype, Population structure, Selection and domestication

Highlights

  • In this study, the genome of one perennial Gb cotton accession was assembled leading to identify intraspecific and interspecific SVs associated with the agronomic traits.

  • Haploblocks in sea-island cotton were associated with agronomic traits improvement and drove the differentiation and adaptation of sea-island cotton.

  • Our study builds a foundation for the breeding and improvement of sea-island cotton.

Abstract

Introduction

Sea-island cotton (Gossypium barbadense, Gb) is one of the major sources of high-grade natural fiber. Besides the common annual Gb cotton, perennial Gb cotton is also cultivated, but studies on perennial Gb cotton are rare.

Objectives

We aimed to make a systematic analysis of perennial sea-island cotton and lay a foundation for its utilization in breeding, and try to identify the representative structural variations (SVs) in sea-island cotton, and to reveal the population differentiation and adaptive improvement of sea-island cotton.

Methods

Through genome assembly of one perennial Gb cotton accession (named Gb_M210936) and comparative genome analysis, variations during Gb cotton domestication were identified by comparing Gb_M210936 with annual Gb accession 3–79 and with wild allotetraploid cotton G. darwinii. Six perennial Gb accessions combining with the resequenced 1,129 cotton accessions were used to conduct population and genetic analysis. Large haplotype blocks (haploblocks), generated from interspecific introgressions and intraspecific inversions, were identified and were used to analyze their effects on population differentiation and agronomic traits of sea-island cotton.

Results

One reference genome of perennial sea-island cotton was assembled. Representative SVs in sea-island cotton were identified, and 31 SVs were found to be associated with agronomic traits. Perennial Gb cotton had a closer kinship with the wild-to-landrace continuum Gb cotton from south America where Gb cotton is originally domesticated. Haploblocks were associated with agronomic traits improvement of sea-island cotton, promoted sea-island cotton differentiation into three subgroups, were suffered from breeding selection, and may drive Gb cotton to be adapted to central Asian.

Conclusion

Our study made up the lack of perennial Gb cotton genome, and clarified that exotic introgressions improved the traits of sea-island cotton, promoted the population differentiation, and drove sea-island cotton adaptive to central Asia, which will provide new insights for the genetic breeding improvement of sea-island cottons.

Introduction

Gossypium barbadense (Gb) and G. hirsutum (Gh, upland cotton) are the two cultivated allotetraploid cotton species. Although having the same origin [1], [2], [3], the two species underwent relatively independent processes of domestication [4], [5]. Upland cotton has high yield, wide planting area, strong adaptability, and more abundant germplasm resources, which accounts for more than 90 % of the total annual world cotton output. Sea-island cotton, cultivar of Gb, has good fiber quality, however, its low yield and narrow adaptability limit its popularization and cultivation.

When sea-island cotton was first introduced into China is hardly retraced [6]. At the beginning of the 20th century, a small amount of planted sea-island cottons was found in Kaiyuan county, Yunnan province, China, where the subtropical climate is suitable for cotton growth all year. The perennial sea-island cottons are divided into two types according to whether the boll chambers are connected (Lianhemumian, Inline graphic) or not (Lihemumian, Inline graphic) [6]. Annual sea-island cotton was mainly introduced from the United States (USA) and Egypt in the 1930 s. In the 1950 s, annual sea-island cotton was re-introduced from the Soviet Union, Egypt, Sudan, Peru and USA [6]. Subsequently, Chinese breeders have developed a series of elite cultivars from these germplasms, and Xinjiang gradually becomes the main production area of sea-island cotton in China because of its unique climate.

There have been several studies revealing the genetic improvement and population differentiation of upland cotton [4], [7], [8], [9], [10], but fewer studies were about sea-island cotton. Sea-island cottons were previously categorized into Pima type, Egypt type, and central Asia type according to their original regions of cultivation [11]. Yuan et al. completed the resequencing of 81 domesticated and 101 accessions spanning the wild-to-landrace continuum Gb cottons, which greatly enriched genomic data from the origin area of Gb cottons [5]. Zhao et al. divided sea-island cottons into two major gene pools and a third admixed subgroup based on 336 sea-island cotton cultivars [12]. In another study, sea-island cottons were divided into 4 groups, representing Gb landraces, obsoleting Gb cultivars, modern Gb cultivars, and Xinjiang Gb cultivars, respectively [13].

Introgression is common in plants which shaped the genomic diversity of species [14], improved crop traits [15], and affected plant adaptation [16]. A functional HaFT1 gene was contained in an introgression from Helianthus annuus, which caused early flowering in coastal Helianthus argophyllus [17]. Interspecific hybridization was used to improve sea-island cotton especially using upland cotton as a donor [18], [19], [20]. Nie et al. identified 17 interspecific introgression events in sea-island cottons from upland cottons which were beneficial for the improvement of fiber quality and yield traits [21]. Fang et al. identified 6 interspecific introgressions from upland cotton to sea-island cotton that were significantly associated with the phenotypic performance of sea-island cotton, which were under further selection and stabilized during improvement [22]. Yuan et al. (2021) identified reciprocal introgressions between G. hirsutum and G. barbadense, and found introgressions in G. barbadense D subgenome (Dt, with lower-case ‘t’ denoting tetraploid) were more abundant than in A subgenome (At). Wang et al. revealed that introgressions from upland cotton drove population differentiation and genetic diversity of sea-island cotton [13].

Resequencing combined with genome-wide association studies (GWAS) has identified many loci associated with agronomic traits in sea-island cotton [12], [22], [23], but the role of large structural variations (SVs) and large haplotype blocks in the sea-island cottons have not been characterized. SVs often affect important agronomic traits [24]. A haplotype is the special combinations of alleles observed on a chromosome[25], which could be inherited as a whole. In upland cotton, inversions on chromosomes A06 and A08 were revealed to be related to group differentiation and geographic differentiation [9], [26]. In sunflower, non-recombining haplotype blocks (haploblocks) were found to be associated with numerous ecologically relevant traits [17].

The reference genomes of three wild allotetraploid cottons [27], and the cultivated sea-island cotton, including 3–79 [28], Pima 90 [10], and Hai7124 [29] have been published. Perennial sea-island cotton retains some characteristics of wild cotton, and is separated from annual sea-island cotton. The special status of perennial sea-island cotton in the evolution of cotton species is unknown and worth being explored. To reveal the genomic variations between perennial and annual sea-island cottons, one genome of Lianhemumian named Gb_M210936 was assembled and six perennial sea-island cottons were also resequenced. Variations between Gb_M210936 and cultivated sea-island cotton 3–79, and between Gb_M210936 and G. darwinii were identified through genome comparison. G. darwinii is considered because it is a wild allotetraploid cotton which has a closer relationship with Gb and is also considered as the wild species of Gb. Combining the resequencing data of the Gb cotton population, we try to reveal the role of perennial sea-island cotton in cotton genus evolution and the driving force of population differentiation of sea-island cotton.

Materials and methods

Plant materials

Six perennial sea-island cottons including four from Peru and two from Yunnan, China, were introduced from Institute of Cotton Research, Chinese Academy of Agricultural Sciences, Anyang, Henan 455000, China (Table S1). The six perennial sea-island cottons were planted in Wuhan, and young leaves were taken for genome sequencing. Resequencing data of cotton accessions, including Gb cotton, upland cotton, and other allotetraploid cotton species, were downloaded from NCBI (https://www.ncbi.nlm.nih.gov), which was also detailed descripted in section data availability. Five sea-island cotton cultivars (XH33, XH37, XH58, Ashi, and LuoSaiNa) with extreme difference of fiber quality traits were used for transcriptome analysis [12].

Genome assembly and quality assessment

The genomic DNA of cotton young leaves was extracted using the CTAB method [30]. The Gb_M210936 genome was sequenced through single-molecule real-time HiFi on the PacBio platform, and was assembled using HiFiasm [31]. Construction of high-throughput chromosome conformation capture (Hi-C) library of Gb_M210936 was referred to Wang et al. [32]. Contigs were ordered to chromosome by Hi-C using Juicer (v1.5) [33] and 3D-DNA (v180922) [34].

Illumina reads and HiFi reads were aligned to the assembled genome by BWA (v0.7.17) [35] and minimap2 (v2.23) [36], respectively, to evaluate alignment and coverage. BUSCO (v3.1.0) (Benchmarking universal single-copy orthologs) was used to assess the integrity of the genome assembly with eukaryota_odb10 [37]. LTR_retriever (v2.8) [38] and LTR-FINDER (v1.07) [39] were used to assess genome integrity using long terminal repeats, expressed by the LTR Assembly Index (LAI) index [40].

Repeats and Non-coding RNA annotation

Homology prediction of repeat sequences was based on RepBase library (https://www.girinst.org/repbase) using RepeatMasker (v4.1) [41]. RepeatModeler (v1.0.11) [42], Piler [43], and RepeatScount [44] were used to model and annotate the repetitive sequence family of the genome from scratch. TRF (v4.0.9) [45] and LTR-FINDER (v1.07) [39] were used to identify TRF and LTR by sequence characterization. Homology predictions and de novo predicted repeats were integrated as the final repeat annotation results. According to the structural characteristics of tRNA, tRNAscan-SE (v2.0) [46] was used to identify tRNA sequences in the genome. Basing the Rfam [47] database, the INFERNAL (v1.1.3) [48] software was used to predict the miRNA and snRNA sequence in the genome.

Annotation of protein-coding genes

De novo annotation, transcriptome-based annotation and homology annotation were used to annotate protein coding genes. AUGUSTUS (v3.3.1) [49], [50], Genscan (v1.0) [51] and GlimmerHMM (v3.0.2) [52] were used to ab initio predict gene models with default parameters. RNA-seq data for different tissues of 3–79 were downloaded from the NCBI database, which were aligned to the assembled genome by TopHat (v2.1.1) [53] and transcripts were assembled by Cufflinks (v2.2.1) [54]. All cotton protein sequences were downloaded from Cottongen (https://www.cottongen.org) [55] and merged with Theobroma cacao, Oryza, and Arabidopsis thaliana protein sequences downloaded from Phytozome (https://phytozome-next.jgi.doe.gov/) for homologous annotations using MAKER (v3.01) [56]. The results of above predictions were integrated by MAKER (v3.01) to obtain the final annotation file.

To conduct functional annotation of protein-coding genes, the annotated genes were blasted against the SwissPort and TrEMBL [57] non-redundant protein databases using Blast+ (v2.9) [58]. A Gene Ontology (GO) term for each gene was obtained from the corresponding InterPro [59] descriptions. The gene set was mapped to a Kyoto Encyclopedia of Genes and Genomes (KEGG) [60] pathway to identify the best match classifcation for each gene.

Gene family cluster

The amino acid sequences of G. barbadense 3–79 and G. darwinii (AD5) were downloaded from CottonGen (https://www.cottongen.org) [55]. Family genes of Gb_M210936, 3–79 and AD5 were clustered using OrthoMCL [61]. Genes with alternative splicing sites were filtered, and only the longest amino acid sequence was kept. All-versus-all blastp (v2.9.0) (e-value < 10−5) [58] comparison of all protein sequences for the three accessions was conducted, and orthologous genes were clustered by OrthoMCL [61] using default parameters. Finally, 36,588 gene clusters were retained.

Variant calling and population genetics analysis

The Illumina paired-end reads were filtered with Fastp (v.0.2) [62] with default parameters to obtain clean reads. Sentieon (v201808.07) was used for single nucleotide polymorphisms (SNPs) calling [63]. High-quality SNPs with a sequencing depth greater than 5, minor allele frequency (MAF) ≥ 0.05 and missing rate ≤ 0.1 were retained for further analysis. The NJ-tree was constructed with Phylip (v.3.697) [64] with 1000 bootstrap replicates, and visualized with the online tool iTOL (https://itol.embl.de/) [65]. Population structure was analyzed using Admixture (v1.3.0) [66], and cross validation (CV) error was used to evaluate the number of groups. PCA was carried out using Plink (v1.9) [67]. For haplotype analysis, all cultivars were first classified according to the clustering results, and then, heatmaps were drawn based on the genotype files with R package Pheatmap (v1.0.12) (https://cran.r-project.org/web/packages/pheatmap/index.html). The population fixation statistics (Fst) and the nucleotide diversity (π) of each subgroup were calculated using VCFtools (v0.1.16) [68] with a 500 kb sliding windows and a step size of 50 kb, and candidate domestication-sweep windows were identified as the top 5 % genomic regions exhibiting the greatest reduction in diversity (πwild/πcultivar) values and the top 5 % of regions with the greatest Fst between groups [5].

Structural variation analysis

The Gb_M210936 genome was aligned to the reference genome 3–79 [28] using Mummer (v4.0) with default parameter settings [69]. Program Delta-filter of Mummer (v4.0) was used to determine the one-to-one alignment blocks with the parameters ‘-r -q’. To identify SVs (≥30 bp), long reads were mapped to 3–79 with NGMLR (v0.2.7) [70], and SVs were called by Sniffles (v1.0.11) [70] with parameter ‘-s 3, -l 30′. To ensure the accuracy of the data, SVs greater than 100 kb were removed with “0/0” genotype [71]. Similar SVs identified in Gb_M210936, Hai7124, Pima90, AD5 and TM-1 were merged by svimmer (https://github.com/DecodeGenetics/svimmer), and were used to build a graph by vg (v.1.33) (https://github.com/vgteam/vg). Then the presence of SVs in the graph was determined by resequencing data using Vg pipeline in 336 sea-island cotton cultivars. The genotype files for SVs of each cultivar were merged, and the SV set was filtered with a MAF ≥ 0.05. Phenotypic data of 14 traits of 336 sea-island cottons were obtained from Zhao et al. (2022). The 336 sea-island cottons were phenotyped in Yacheng in Hainan Province in 2018; Korla in Xinjiang in 2013–2016, 2018 and 2019; Awat in Xinjiang in 2018 and 2019. Growth period (GP), first node of fruit branch (FNFB), plant height (PH), lint percentage (LP), single boll weight (SBW), boll number (BN), fruit branch number (FBN), fiber strength (FS), fiber length (FL), fiber elongation (FE), fiber uniformity (FU), and fiber micronaire (MV) were recorded in nine environments (locations × years), while seed index (SI) and disease percentage (DP) were assessed in six and four environments respectively. Finally, the average value of each trait was used in this study. The first three components of PCA were used to represent population structure, and Tassel (v5.0) with a mixed linear model (MLM) was used for GWAS [72].

Population genomic detection of haploblocks

To identify haploblocks, we referred to the same methods as we used in upland cotton [26]. The abnormal population structure in genomic region in sea-island cotton was detected by local PCA/population structure (lostruct) [17], [73]. Multidimensional scaling (MDS) plots with windows of 100 SNPs were used to define potential haploblocks, and two coordinates of MDS (MDS1 and MDS2) were used to show the haploblocks along the genome, and the haploblock boundaries were manually confirmed. The k-means clustering algorithm in R [75] was used to define three clusters from PC1, that was calculated with SNPRelate [74] using all SNPs within haploblocks [17]. The linkage disequilibrium (LD) of the whole chromosome was calculated and the heatmap was drawn using LDBlockShow (https://github.com/BGI-shenzhen/LDBlockShow). The correlations between haploblocks and between different groups of GbC were calculated using cor function in R (v4.1.0). The correlation between haploblocks and traits was conducted using the general linear model (GLM) of Tassel (v5.0) [72] with no population structure, which was further verified by the significant differences of traits corresponding to different haplotypes of each haploblocks using a Student’s t test. Student’s t test was performed using SPSS 17.0 (SPSS, Chicago, USA).

RNA-seq and data analysis

The 0, 5, 10, 15, 20, 25 days post-anthesis (DPA) fiber transcriptome data of 5 cultivars (XH33, XH37, XH58, Ashi, and LuoSaiNa) with extreme difference of fiber quality traits are available [12]. The clean reads were mapped to the 3–79 genome with Hisat2 (v2.2.0) [76], and the expression level of each gene was determined with StringTie (v2.1.4) [77]. Transcripts per million (TPM) [78] value was used to measure the relative level of expression, and a heatmap of expression was drawn using Pheatmap package (v1.0.12) in R (v4.1.0).

Results

Genome assembly of one perennial sea-island cotton accession

To fill the vacancy of genome resource between wild cotton and annual sea-island cotton, a perennial sea-island cotton accession (Table S1, Fig. S1), Lianhemumian, from Denggao village, Lujiang town, Longyang District, Baoshan City, Yunnan, China (the germplasm bank number is M210936 in Institute of Cotton Research, Chinese Academy of Agricultural Sciences, Anyang, Henan 455000, China, and it is named as Gb_M210936) was sequenced, and its genome was assembled. Through HiFi sequencing, 47.89 Gigabyte of HiFi data were obtained, with an average sequencing depth of approximately 21.3×; the total number of reads was 3,106,696, and the reads N50 reached 15,125 bp (Fig. S2a, S2b). The assembled genome size was approximately 2.24 Gigabyte with 776 contigs, and contig N50 reached 68.8 Mb (Table 1). The HiFi reads were compared to the assembled genome, the alignment rate was 99.9 %, and the coverage was 99.7 % (Table S2). The Illumina reads with a sequencing depth of approximately 60× (Fig. S2c) were also aligned to the assembled genome, the alignment rate was 99.8 %, and the coverage was 99.4 % (Table S2).

Table 1.

Summary of genome assembly and annotation for sea-island cottons.

Genomic feature Gb_M210936 379#
Total length of contigs (Gb) 2.243 2.223
Number of contigs 776 4,930
Total length of assemblies (Gb) 2.24 2.267
Number of scaffolds 26 + 656 26 + 3006
Annotated genes 75,609 71,297
#

Wang et al., 2019.

Hi-C was used to aid the assembly of contigs to chromosomes, and the original 776 contigs were interrupted according to the Hi-C interaction map, and the contigs were rearranged according to the principle that the closer the distance was, the stronger the Hi-C interaction signal was (Fig. 1a). Finally, 26 chromosomes and 656 scaffolds were constructed, with a total length of 2.24 Gigabyte (Table 1). Scaffold N50 reached 95.8 Mb, and the chromosome anchoring rate was 98.3 %. GC content of the whole genome was 34.4 %, and the ratio of N was close to 0 (Fig. S3a). The genes with complete alignment of benchmarking universal single-copy orthologs (BUSCO) accounted for 99.3 % of the whole genome (Fig. S3b), and the separated assessment results of the At and Dt subgenomes were also both greater than 98.0 % (Fig. S3b). The LAI score was 13.1, indicating that the Gb_M210936 genome can be used as a reference genome [40]. Through genome annotation, 75,609 gene structures were predicted (Fig. 1b, Table S3), and 96.5 % genes were annotated to at least one database (Table S4). The number of annotated genes was more than that in 3–79 (71,297) [28], less than that in Pima90 (79,613) [10], and similar to that in Hai7124 (75,071) [29].

Fig. 1.

Fig. 1

Assembly of one perennial sea-island cotton genome. a. Hi-C map of Gb_M210936. b. Genomic landscape of Gb_M210936 genome. ① Gene density, ② GC content, ③ LTR retrotransposon density, ④ DNA transposons density, ⑤ LINE retrotransposon density, ⑥ SINE retrotransposon density. Density was calculated using a sliding window of 0.5 Mb.

Phylogenetic relationships of perennial sea-island cottons

Besides the accession for genome assembly, additional 5 perennial sea-island cottons were resequenced (Fig. S1, Table S1), and were used to explore the role of perennial sea-island cotton in cotton domestication by combining the published cotton resequencing data including 555 Gb cultivars (GbC), 98 wild-to-landrace continuum Gb (GbW), 60 Gb landraces (GbL), 394 upland cotton cultivars (Gh), and 22 other allotetraploid cotton species (Table S5). The resequencing data of a total of 1,135 cotton accessions were mapped to the reference genome of G. barbadense acc. 3–79 [28]. Total of 14,305,902 SNPs were identified, SNP density was 6.88/kb in At and 6.41/Kb in Dt; total of 5,370,660 SNPs were obtained when only considering sea-island cottons with SNP density of 2.69/kb in At and 2.22/kb in Dt.

The phylogenetic tree showed that sea-island cottons and upland cottons were divided into two groups, and accessions in each group were clustered together separately (Fig. S4). G. darwinii (AD5) showed a closer relationship with sea-island cotton than other allotetraploid cottons, and GbW was evolutionarily intermediate between AD5 and GbC (Fig. S4). GbL was discretely distributed within GbW and GbC (Fig. S4). The six perennial sea-island cottons did not cluster together to form an individual branch, but were distributed in GbW(Fig. S4).

Population structure of sea-island cottons

Structure of sea-island cottons was assessed, while GbL accessions showed a discretely distribution between GbW and GbC, and were not used to assess population structure (Fig. S4). When K (the number of populations modeled) was set to 2, upland cotton was separated from sea-island cotton (Fig. 2a); when K was set to 3, GbW was separated from GbC; when K was set to 4, GbC was divided into two groups; when K was set to 5, GbC was further divided into three groups (represented by GbC1, GbC2 and GbC3) (Fig. 2a). GbC3 exhibited a heterozygous genomic component of GbC1 and GbC2 (Fig. 2a). PCA result was consisted with the population structure analyzed by STRUCTURE (Fig. 2b). Gh had lower population fixation statistics (Fst) values with GbC (0.728) than with GbW (0.732) (Fig. 2c). GbC1 (0.096) inherited more genetic basis from GbW as it had lower Fst with GbW than with GbC2 (0.102) and GbC3 (0.102) (Fig. 2c). GbC2 was suffered most domestication selection which was shown by the lowest SNP diversity value (π) and highest LD decay in four sea-island cotton groups (Fig. 2c and 2d). Counting the geographic distribution in different groups, American Pima cotton, Egyptian cotton and cultivars from China inland were clustered to either GbC1 or GbC3 (Fig. 2e), while GbC2 almost only contained cultivars from central Asia with cultivars from Xinjiang accounting for the majority.

Fig. 2.

Fig. 2

Population structure of sea-island cottons. a. Population structure based on different numbers of clusters (K = 2 to 5). The horizontal axis represents individual plant, and different colors represent different genetic components. Gh, upland cotton; GbC, cultivated sea-island cotton; GbW, wild-to-landrace continuum sea-island cotton. b. PCA of cotton population when the population was divided into 5 (K = 5). Different colors represent different groups as shown in the legend. c. Genetic diversity and population differentiation across five groups. Values inside circles represent SNP diversity (π). The values on the connecting line between groups indicate population fixation statistics (Fst). d. Genome-wide averages of linkage disequilibrium (LD) decay in four groups of sea-island cotton. e. Types of cultivars in the three groups of GbC. Only GbC with a clear geographic origin were used for statistics, and the numbers in the bar graph represent the number of GbC. Cultivars of GbC2 was basically from central Asia.

Global comparisons among Gb_M210936, 3–79 and AD5

Sea-island cotton has a closer genetic relationship with G. darwinii (AD5) than other allotetraploid cottons [5], [27] which was also confirmed in the evolutionary tree (Fig. S4), and genome sequence of Gb_M210936 was compared to 3–79 and AD5. Total of 3.91 × 106 SNPs, and 0.68 × 106 insertions and deletions (Indels) were identified between Gb_M210936 and 3–79 (Fig. 3a). The variant (SNPs and Indels) frequency in the At (2.12 /kb) was higher than that in the Dt (1.80/kb); 67.4 % variants located in intergenic regions and 4.10 % in introns; 1.08 % were identified as missense variants. Total of 7.10 × 106 SNPs and 0.99 × 106 Indels were identified between Gb_M210936 and AD5 (Fig. 3a); 65.4 % variants located in intergenic regions and 3.49 % in introns; 1.44 % were identified as missense variants.

Fig. 3.

Fig. 3

Distribution and statistics of SNPs, Indels, SVs, and rearrangements. a. Genome-wide distribution of SNPs, Indels, SVs, and rearrangements when Gb_M210936 genomes was compared to the genomes of 3–79 and AD5. The outermost circle represents 26 chromosomes, and the second layer represents the gene distribution density. The inner circles represent the SNPs, Indels, SVs and rearrangements identified between Gb_M210936 and 3–79, Gb_M210936 and AD5, as shown in the figure. b. Statistics for the number and type of SVs. DEL, deletion; INS, insertion; DUP, duplication; INV, inversion. c. SV length statistics. Red color represents the statistical results of SV between Gb_M210936 and 3–79. Blue color represents the statistical results of SV between Gb_M210936 and AD5. The bar chart shows the number of SVs of different lengths, and the pie chart shows the proportion of SVs of different lengths; most of the SVs were less than 1 kb in size. d. Number of SVs on each chromosome. Most SVs were identified on A01. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Total of 18,736 SVs (≥30 bp) were identified between Gb_M210936 and 3–79 (Fig. 3b), in which 1,434 INSs/DELs (insertions and deletions ≥ 30 bp, different from Indels) overlapped with 1,301 gene bodies; 1,182 INSs/DELs overlapped with 2,000 bp upstream of 1,194 genes; 1,072 INSs/DELs located in 2,000 bp downstream of 1,145 genes (Table S6). And 35,402 SVs were identified between Gb_M210936 and AD5 (Fig. 3b), in which 2,434 INSs/DELs overlapped with 2,304 gene bodies; 2,804 INSs/DELs located 2,000 bp upstream of 2,859 genes; and 2,488 SVs located 2,000 bp downstream of 2,654 genes (Table S6). Most SVs were INSs and DELs with length of less than 1 kb (Fig. 3b and 3c), and were identified on A01 between Gb_M210936 and 3–79 and between Gb_M210936 and AD5 (Fig. 3d).

Genomes of Pima90 and Hai7124 were also used to identify Mb-level inversions, and Hai7124 had more inversions on 14 chromosomes than other 3 accessions when Hai7124 was aligned to 3–79 (Fig. S5). Three inversions located on A03, A09 and A13 were identified between Gb_M210936 and 3–79, which were also identified when comparing with other G. barbadense accessions, indicating that these inversions are common in sea-island cotton and not specific to Gb_M210936. Total of 1,112 segments in Gb_M210936 with a total length of 44 Mb were absent in 3–79, and 2,724 segments in 3–79 with a total length of 34 Mb were absent in Gb_M210936. The largest Gb_M210936-specific segment was a 2.23 Mb segment on A09, and the largest 3–79-specific segment was a 1.37 Mb segment on A01. We found that 497 genes in Gb_M210936 and 244 genes in 3–79 were in these presence/absence variation (PAV) regions (Table S7). When comparing Gb_M210936 with AD5, total of 2,240 segments in Gb_M210936 with a total length of 76 Mb were absent in AD5, and 2,251 segments in AD5 with a total length of 47 Mb were absent in Gb_M210936. And 1,193 genes in Gb_M210936 and 897 genes in AD5 were in these PAV regions (Table S7).

Total of 23,591 gene clusters were shared by Gb_M210936, 3–79 and AD5 simultaneously, and 1,553 clusters were shared by both Gb_M210936 and 3–79 but not by AD5 (Fig. S6a). And 817 clusters were only identified in Gb_M210936, KEGG analysis showed that genes in these clusters mainly involved in photosynthesis pathway and fatty acid biosynthesis pathway (Fig. S6b).

SVs in sea-island cottons

Total of 5,756 SVs were identified in Gb_M210936 and AD5 simultaneously, and were not identified in TM-1 when they were compared with 3–79 genome, implying that these SVs were introduced from upland cotton in the breeding process of annual sea-island cotton cultivars. In which, 4,876 INSs/DELs overlapped with 349 gene bodies, and were adjacent to 736 genes (Table S8). To more comprehensively identify SVs in sea-island cottons based on existing data, SVs identified in Gb_M210936, Hai7124, Pima90, AD5 and TM-1 by comparing with 3–79 were merged including those SVs introduced from upland cotton, and the presence of these SVs in each Gb cultivar was identified using resequencing data. Total of 2,357 SVs were only identified in AD5 and were not identified in the six perennial sea-island cottons or GbC. These G. darwinii-specific SVs may play an important role in the differentiation of Gb cotton from G. darwinii, which overlapped with 170 genes and were adjacent to 442 genes (Table S9). Total of 8,194 SVs were identified in the six perennial sea-island cottons simultaneously, of which only 110 SVs had a low frequency in GbC (≤5%), indicating that perennial sea-island cotton specific SVs were few. These 110 SVs overlapped with 6 genes body, in which Gbar_D01G023100 (annotated as Expansin-A4, EXPA4) may involve in plant-type cell wall loosening [79].

In GbC, 23,195 SVs with MAF ≥ 0.05 were kept (Fig. 4a), and 71.8 % of them had a frequency of greater than 10 % and less than 90 %. Using SV-based GWAS, 31 SVs were identified to be associated with 11 traits (Fig. 4b, S7, Table S10). One 52 bp DEL on chromosome A06 was associated with GP, which located in gene Gbar_A06G007270 that is annotated as IAA-amino acid hydrolase ILR1-like 4 (ILL4); cultivars containing this SV (average GP = 126.1 days) had a 7 days shorter GP than cultivars not containing this SV (average GP = 133.1 days) (Fig. S8a). One 39 bp INS on chromosome D11 that was identified in TM-1 and was not identified in AD5 or Gb_M210936 (Table S10) was associated with BN, which overlapped with gene Gbar_D11G032720 and was only identified in 18 cultivars (Fig. S8b). One 35 bp DEL associated with LP was identified on chromosome A04 in only 20 cultivars, which had not been identified by SNP-GWAS [12]; cultivars containing this SV (average LP = 37.4 %) had significant higher LP than cultivars not containing this SV (average LP = 32.5 %) (Fig. S8c); this DEL was identified in Pima90, and was not identified in AD5, Gb_M210936 or TM-1, which may represent a rare variation within cultivated annual sea-island cotton. On chromosome D10, three SVs were associated with FM, which had been identified to be associated with FM by SNP-GWAS [12], [13]; among them, a 74 bp DEL (SV D10_15354941) identified both in AD5 and Gb_M210936 (Table S10) showed the most significance with FM, and the FM of the sea-island cottons containing this SV was significantly higher than that of the sea-island cottons not containing the SV (Fig. 4c); this SV exists in Ashi and LuoSaiNa with higher FM (FM ≥ 4.9) both, and not exists in XH58, XH37 and XH33 with lower FM (FM ≤ 4.5). There were 31 genes within 600 kb upstream and downstream of the FM associated SV, and two genes showed expression difference between the two genotypes (Fig. S9). Gbar_D10G011020, annotated as UDP-glucuronate: xylan alpha-glucuronosyltransferase 1 (GUX1), was involved in the metabolism of xylan [80]. Gbar_D10G011030, annotated as cell wall/vacuolar inhibitor of fructosidase 2 (C/VIF2), inhibited fructosidases from cell wall [81]. These two genes may be the key genes responsible for the FM difference between the two genotypes.

Fig. 4.

Fig. 4

SVs in sea-island cottons. a. An SV set of sea-island cottons. Different colors correspond to different genotypes, as shown in the legend. Each point on the horizontal axis represents an SV locus, and each row on the vertical axis represents a sea-island cotton cultivar. b. SV-based GWAS of fiber micronaire (FM). c. Phenotypic statistics of SV associated with FM. The lowercase n represents the number of accessions. The significance difference was calculated using a t test.

Haploblocks are associated with the population differentiation and adaptation of sea-island cotton

To elucidate the driving forces of population differentiation and the effect of conserved genomic regions of sea-island cotton, 22 haploblocks with a total length of 187.59 Mb were identified (Fig. S10-S13, Table S11), covering 2,418 genes in the genome. Accessions in each haploblock can be divided into three haplotypes, represented by g0, g1 and g2 (Fig. S10e, Table S12), and g1 is a heterozygous haplotype of g0 and g2 (Fig. S10f). Given that the SNPs and Indels within each haploblock are tightly linked and roughly divided into two haplotypes, SNPs and Indels producing two haplotypes in each haploblock were annotated. And 656 genes in haploblocks had drastic sequence changes (Table S13).

These Mb-level haploblocks were sufficient to affect the population structure of sea-island cottons, and 9 haploblocks showed a correlation with the grouping of GbC1, GbC2 and GbC3 (r greater than 0.3) (Fig. S14), in which D04_HB1 showed the highest correlation (r = 0.54). The g2 haplotype of these GbC-group-related haploblocks had higher proportion in GbC1, while the g0 haplotype had higher proportion in GbC2 (Fig. S15). For example, 78.0 % cultivars of GbC2 in D04_HB1 were grouped into the g0 haplotype, and 75.8 % cultivars of GbC1 in D04_HB1 were grouped into the g2 haplotype (Fig. S15). The early maturity traits of 9 group-related haploblocks were counted, cultivars of the g0 haplotype had a shorter GP and lower FNFB than those of the g2 haplotype (Fig. S16). These GbC-group-related haploblocks showed a certain higher correlation with each other than with the other haploblocks (Fig. S14). Among the 9 haploblocks, 7 haploblocks were located on five chromosomes of the Dt (Fig. S14), indicating that Dt played a more major role in the population differentiation of sea-island cotton than At.

Haploblocks are suffered from selection during sea-island cotton domestication

Seven haploblocks overlapped with the putative centromeric regions (Fig. S17a). Putative introgressions from upland cotton into sea-island cotton were identified (Fig. S17a), and 10 haploblocks overlapped with these introgression regions. In addition, A09_HB1/HB2, A11_HB1, D03_HB1, D04_HB3, D05_HB1, D05_HB2 and D12_HB1 overlapped with inversions between sea-island cotton (Fig. S12). These inversions were identified in Hai7124, and some inversions may be introgressed from AD5 because inversions on A11, D05 and D12 also existed in AD5 (Fig. S5). In addition, inversions on A11, D05 and D12 also overlapped with the putative centromeric region. The presence of centromeres, intraspecific inversions and interspecific introgressions were the main reason for the formation of haploblocks.

Total of 41 genomic regions suffered from selective scanning with a total length of 46.70 Mb genomic region (Fig. S17a, Table S14) were identified by Fst and π. Proportions of the g0, g1 and g2 haplotype in each haploblock varied widely both in GbW and GbC (Fig. S17b, c), and the proportion of the three haplotypes in some haploblocks were quite different between GbW and GbC (Table S12, S15, Fig. S17d). For example, the proportion of the g2 haplotype in A01_HB1 increased from 4.76 % in GbW to 87.34 % in GbC; the proportion of the g2 haplotype in D11_HB1 decreased from 87.62 % in GbW to 4.69 % in GbC (Fig. S17b, c, d). In total, eight haploblocks had a proportion difference greater than 50 % (Fig. S17d). Haploblocks, in which the proportion of the three haplotypes had quite difference between GbW and GbC, as a whole may be suffered from breeding selection during the domestication from GbW to GbC. We also found that more individuals in GbW than in GbC were divided into the intermediately heterozygous g1 haplotype (Fig. S17b, c), corresponding to a higher genetic diversity in GbW.

Haploblocks are associated with agronomic traits

Treating haploblocks as individual loci, correlation analysis of haploblocks with agronomic traits showed that 21 haploblocks were associated with at least one growth trait (-log10(P) greater than 2) (Fig. 5a); in which A01_HB3, A03_HB2, A08_HB1 and D07_HB1 were associated with three growth traits simultaneously, and 12 haploblocks were associated with 2 growth traits simultaneously. Four haploblocks were associated with DP (Fig. 5a). Total of 21 haploblocks were associated with at least one yield traits (Fig. 5b), in which A01_HB4, D04_HB2, D04_HB3, D05_HB2, D07_HB1 and D09_HB1 were associated with at least 3 growth traits simultaneously. And 17 haploblocks were associated with at least one fiber quality traits (Fig. 5c), in which A01_HB3, A01_HB4, D04_HB2, D04_HB3 D09_HB1, and D12_HB1 were associated with at least 4 fiber quality traits simultaneously. Total of 15 haploblocks were associated with growth, yield and fiber quality traits simultaneously (Fig. 5a-c).

Fig. 5.

Fig. 5

Association analysis of haploblocks and phenotypic traits. a. Association analysis of haploblocks with sea-island cotton growth traits and disease resistance trait. GP, growth period; FNFB, first node of fruit branch; PH, plant height; DP, disease percentage. b. Association analysis of haploblocks and yield traits. BN, boll number; FBN, fruit branch number; SBW, single boll weight; SI, seed index; LP, lint percentage. c. Association analysis of haploblocks and fiber quality traits. FS, fiber strength; FU, fiber uniformity; FL, fiber length; FE, fiber elongation; FM, Fiber micronaire. The threshold line in the figure represents -log10(P value) = 2. d. The GP of sea-island cotton gradually decreased with the accumulation of shorter GP haplotype of haploblocks, and the LP of upland cotton gradually increased with the accumulation of higher LP haplotype of haploblocks.

With the accumulation of favorable haploblocks (shorter GP; higher LP), GP gradually decreased and LP gradually increased (Fig. 5d), indicating that these haploblocks had cumulative effects similar to favorable alleles. In addition, the 9 GbC-group-related haploblocks were identified associated with GP and FNFB (Fig. 5a), simultaneously, which was consistent with the trait difference between the three haplotypes of these haploblocks (Fig. S16). Haploblocks suffered from selection also showed correlation with multiple traits (Fig. 5a).

Differentially expressed genes in haploblocks

D04_HB3, A11_HB1, D03_HB1, D05_HB1, D05_HB2 and D12_HB1 were covered by inversions (Fig. S12). According to the grouping of haploblocks (Table S12), the existence of inversions in each cultivar was accurately determined. For A11_HB1, it was associated with FS (Fig. 5c), and was covered by an inversion on chromosome A11 having a frequency of 3.55 % in sea-island cotton cultivars (Table S12). Cultivars of the g2 haplotype in A11_HB1 had lower FS than cultivars of the g0 haplotype (Fig. S18a), and the g0 haplotype was majorly accounted in sea-island cotton cultivars. Only LuoSaiNa (FS = 27.5 cN/tex) belonged to the g2 haplotype in A11_HB1, while the other four cultivars (FS ≥ 32 cN/tex) belonged to the g0 haplotype. Gbar_A11G023380 had a lower expression and Gbar_A11G024150 had a higher expression in LuoSaiNa than in cultivars of the g0 haplotype in A11_HB1 (Fig. S18b). D05_HB1 was mainly associated with GP and FNFB (Fig. 5a), and cultivars of the g2 haplotype accounted for 29.9 % in GbC. D05_HB2 was mainly associated with FL and FBN (Fig. 5b, c, Fig. S19a), and cultivars of the g2 haplotype accounted for 30.47 % in GbC. Ashi (FL = 25 mm) and LuoSaiNa (FL = 28.3 mm) belonged to the g2 haplotype in D05_HB2, the other three cultivars belonged to the g0 haplotype (FL ≥ 34.8 mm), and two genes Gbar_D05G034100 and Gbar_D05G034350 showed lower expression in the cultivars of the g2 haplotype than in the cultivars of g0 haplotype in D05_HB2 (Fig. S19b). Inversions on D03 and D12 only accounted for 7.2 % and 4.25 % in sea-island cottons, respectively (Table S12).

A01_HB1-HB5, A08_HB1, A10_HB1, D04_HB1, D09_HB1 and D11_HB1 were introgression-generated haploblocks (Fig. S13), which showed correlations with multiple traits (Fig. 5a-c). Five haploblocks in sea-island cottons (Fig. S20a) were identified, in which A01_HB3 and A01_HB4 were highly linked and both showed correlations with GP, FNFB, FBN, BN, FS, FE and FL (Fig. 5a). Taking A01_HB4 as the representative for detailed analysis, cultivars of the g0 haplotype in A01_HB4 had shorter GP, lower FNFB, BN, FBN, FS and PH, but higher FE than cultivars of the g2 haplotype (Fig. S20b). Most GbW accessions belonged to the g0 haplotype, but the g2 haplotype generated from an introgression took a higher proportion in GbC. Only LuoSaiNa belonged to the g2 haplotype in A01_HB3 and A01_HB4, and six genes, Gbar_A01G013620, Gbar_A01G013660, Gbar_A01G013800, Gbar_A01G014000, Gbar_A01G014270, and Gbar_A01G013590, exhibited obvious expression differences between the two haplotypes (Fig. S20c). Variations generated from introgression that may affect gene expression were identified in this study (Table S8, Table S13). For example, a 11.9 kb DEL in the g2 haplotype resulted in the loss of Gbar_A01G014000 (Table S8).

For D11_HB1 (Fig. S21a), the cultivars of the newly generated g0 haplotype had higher yield and better fiber quality than those of the g2 haplotype (Fig. S21b), and the g0 haplotype accounted for the majority of sea-island cotton cultivars. At least six genes showed obvious expression difference in fibers of cultivars between the g0 and g2 haplotypes (Fig. S21c). And 25 gene bodies overlapped with SVs introduced from Gh (Table S8). In which, AGP14 (Arabinogalactan peptide 14, Gbar_D11G025860) was rarely expressed in fibers of cultivars of g2 haplotype, but highly expressed in fibers at all stages of cultivars of g0 haplotype (Fig. S21c). A 31,029 bp DEL resulted in the loss of AGP14 in the g2 haplotype (Table S8). Two CSLE6 (Cellulose synthase-like protein E6, Gbar_D11G025800 and Gbar_D11G025810) genes had a low expression in 5 DPA fibers of cultivars of the g0 haplotype. A 168 bp insertion was identified at 101 bp upstream of Gbar_D11G025810 start codon when comparing Gb_M210936 with 3–79 (Fig. S22), which may be the main reason for gene expression difference between the two haplotypes. This insertion was also identified in AD5 and was not identified in TM-1 (Fig. S22). Overall, introgression-generated haploblocks shortened the GP of sea-island cotton, partially increased the yield, and improved the FS and FL of sea-island cotton.

Haploblocks on chromosome D04 are associated with sea-island cotton adaptation

When comparing Hai7124 genome with 3–79, a large deletion and an inversion on chromosome D04 covering D04_HB2 and D04_HB3 were found, and the deletion was also identified when comparing Gb_M210936 with 3–79 (Fig. 6a). D04_HB1, D04_HB2 and D04_HB3 showed high linkage with each other (Fig. 6a), and all belonged to GbC-group-related haploblocks which showed the highest correlation with early maturity traits (Fig. 5a). Different haplotypes of D04_HB1, D04_HB2, and D04_HB3 showed obvious differences in geographical distribution, and cultivars of the g0 haplotype had a higher proportion in high latitude regions, including Xinjiang, China and some central Asia countries such as Tajikistan, Turkmenistan and Uzbekistan; while cultivars of the g2 haplotype had a higher proportion in low latitude regions, like the Yangtze River region (YtRR) and southern China region (SCR), Egypt, and southern of the USA (Fig. 6b, c, d). The difference was that the g0 haplotype occupied the majority in GbW in D04_HB1 (96.1 %), while the g2 haplotype had a higher proportion in GbW in D04_HB2 (75 %) and D04_HB3 (59.8 %).

Fig. 6.

Fig. 6

Haploblocks on chromosome D04 associated with adaptation of sea-island cotton. a. The three haploblocks identified on chromosome D04. From top to bottom: three haploblocks on chromosome D04 identified by local-PCA method. MDS1 and MDS2 are two coordinates of multidimensional scaling (MDS) and were used to show the haploblocks along the genome; the linkage disequilibrium (LD) map of SNPs on the entire chromosome; comparison of chromosome D04 between Gb_M210936, 3–79 and Hai7124; the haplotype map based on genotype file. b. Geographical distribution of different haplotypes of D04_HB1. c. Geographical distribution of different haplotypes of D04_HB2. d. Geographical distribution of different haplotypes of D04_HB3. For a-d, the g0 haplotype had relatively high proportion in Xinjiang, China and some central Asia countries. And the g2 haplotype had a higher proportion in low latitude regions, like the Yangtze River region (YtRR) and southern China region (SCR), Egypt, and southern of the USA.

Discussion

Sea-island cotton includes annual and perennial types, however the perennial one is ignored because it is not applied in production. In order to evaluate and explore the genetic resource of perennial sea-island cotton, genome of one perennial sea-island cotton accession from China was assembled to make up for the lack of perennial sea-island cotton genome (Fig. 1, Table 1). Variants identified between perennial Gb_M210936 and annual cultivar 3–79 might uncover the differentiation and the transition from perennial to annual sea-island cotton (Fig. 3). Based on resequencing data, the six perennial sea-island cottons were distributed in GbW and did not cluster together to form an individual branch (Fig. S4), indicating that there is high genetic diversity among them compared to modern cultivars, which is a valuable resource for Gb genetics and breeding, They may be directly introduced from wild or landrace Gb cotton, and the warm climate in southern China, like Yunnan, is suitable for the perennial growth of sea-island cotton. We took the advantage of existing genomic data to identify SVs in sea-island cotton populations, and SVs in this study were representative because 71.8 % SVs had a relative high frequency. Some SVs associated with agronomic traits were identified using SV-based GWAS (SV-GWAS) (Table S10). Although the number of SVs was less than SNPs and the distribution of SVs on chromosomes was not uniform, some loci identified using SV-GWAS overlapped with loci identified based on SNPs-GWAS [12], [13], demonstrating mutual complementation of these two methods. Using PAV-GWAS, Song et al. directly identified SVs affecting yield and flowering period in Brassica napus [82]. Although finding candidate genes within association loci is hard, designing molecular markers for breeding selection from SVs associated with agronomic traits is easier because their sequence differences are bigger than SNPs/Indels.

The population structure of Gb cotton may be relatively complex. When K ≥ 5, the CV error did not change much, and the most appropriate number of groups was obtained when K was set to 9 (Fig. S23). We focused on sea-island cotton group from central Asia, especially in Xinjiang in this study, and K = 5 was chosen (Fig. 2a). And annual sea-island cottons were roughly divided into three groups when K = 5, which was consistent with the results of Zhao et al. [12], but was not consistent with Percy et al., where sea-island cottons were divided into Pima type, Egypt type, and central Asia type [11]. Almost all cultivars in GbC2 came from central Asia indicating that central Asia have gradually developed indigenous cotton populations, but both Pima type and Egypt type cultivars were divided into GbC1 or GbC2. Interspecific introgressions, large inversions and genomic regions near the centromere maintain a high LD in the local region of sea-island cotton genome. These regions were seldom functionally identified by methods such as SNP-based GWAS because of the Mb-level linkage interval [17]. However, these regions often have aggregated mutations which cover many genes. Haploblocks produced by interspecific introgressions seemed to have more obvious effects on traits and have a greater impact on the group differentiation than haploblocks produced by intraspecific inversions in sea-island cotton, because the frequency of the inversions identified in this study was low. The nine GbC-group-related haploblocks might be the genetic basis of sea-island cotton population differentiation (Fig. S14), and haploblocks on D04 were probably the most prominent ones that made the cultivars gradually adapt to central Asia (Fig. 6). Interspecific introgressions promoted the population differentiation and adaptive improvement both in upland cotton and sea-island cotton, the difference is that introgression-generated inversions on chromosome A06 and A08 drove population and adaptation differentiation of upland cotton [9], [26].

Interspecific introgression plays an important role in the genetic improvement of sea-island cotton [5], [13], [21], [22], and some introgressions from upland cotton to sea-island cotton identified in this study have been reported before [13], [22]. However, we used a larger sea-island cotton population and our analysis was more specific and comprehensive. We typed the sea-island cotton cultivars within each introgression-generated haploblock, and quantitatively analyzed the number of cultivars in each haplotype. Some differentially expressed genes within the haploblocks associated with fiber quality were identified. The effects of haploblocks on agronomic traits were revealed (Fig. 5a, 5b, 5c), which will be helpful in breeding selection. In the breeding process, we hope to obtain key major genes for rapid and effective improvement for crops, but these conserved regions cannot be ignored. Accumulation of favorite haploblocks may be another strategy for efficient breeding (Fig. 5d). Based on the effects of these haploblocks on specific traits, selective combination of haploblocks may be faster than aggregating superior alleles. These haploblocks might be a major reason of linkage drag because one haploblock tends to be associated with multiple traits (Fig. 5a, 5b, 5c).

It is a common method to identify candidate domestication-sweep windows by identifying the overlapping regions of the top 5 % of Fst and the top 5 % of πwild/πcultivar in the genome [5], [83]. The proportion difference of g0 and g2 haplotypes between GbW and GbC did reflect that g0 or g2 haplotype was preferentially selected in GbC. But we found that haploblocks rarely overlapped with sweep regions identified by Fst and π (Fig. S17), which is mainly because that g0 or g2 had been preferentially selected in GbC, while the other haplotype inherited from the GbW still kept a certain proportion, which narrowed the difference between GbW and GbC and leaded to a reduced Fst. Haploblocks as a whole subjected to breeding selection have been ignored in the previous research.

The boundaries of haploblocks were not absolutely conserved, which might obscure the identification of haploblock boundaries, although the impact of several hundreds of SNPs was limited for Mb-scale genomic regions. As a macroscopic analysis method, how to identify the key genes that makes a decisive impact in the haploblock needs further analysis.

Conclusion

In this study, the genome of one perennial Gb cotton accession was assembled leading to identify intraspecific and interspecific SVs associated with the agronomic traits. Haploblocks were associated with agronomic traits improvement, and drove the differentiation and adaptation of sea-island cotton. Our study lays a foundation for the breeding and improvement of sea-island cotton.

Compliance with ethics requirements

This article does not contain any studies with human or animal subjects.

CRediT authorship contribution statement

Nian Wang: Methodology, Investigation, Writing – original draft. Yuanxue Li: Resources, Data curation. Qingying Meng: Resources, Data curation. Meiling Chen: Resources, Data curation. Mi Wu: Resources, Data curation. Ruiting Zhang: Resources, Data curation. Zhiyong Xu: Resources, Data curation. Jie Sun: Writing – review & editing. Xianlong Zhang: Writing – review & editing. Xinhui Nie: Writing – review & editing. Daojun Yuan: Writing – review & editing. Zhongxu Lin: Conceptualization, Supervision, Project administration, Funding acquisition, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank the National Medium-term Gene Bank of Cotton in China and National cotton germplasm resources platform for providing the 6 perennial sea-island cottons. We thank the high-performance computing center at the National Key Laboratory of Crop Genetic Improvement at Huazhong Agricultural University. This research was supported by project of Hubei Hongshan Laboratory (2021hszd006).

Data availability

The sequencing data of perennial sea-island cottons were uploaded to the NCBI website (https://www.ncbi.nlm.nih.gov/) under study number PRJNA861722. The genome file of Gb_M210936 can be found at https://figshare.com/s/f4a3e8d9927817442dde. The other resequencing data used in this study can be found at NCBI as study PRJNA720818, Fig. S14 Fig. S17 Q5 PRJNA613140 and PRJNA414461. The sequences of the reference genomes 3-79, Hai7124, Pima90 and AD5 were downloaded from Cottongen (https://www.cottongen.org/) [55].

Footnotes

Peer review under responsibility of Cairo University.

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.jare.2023.02.002.

Contributor Information

Nian Wang, Email: wangnian@webmail.hzau.edu.cn.

Yuanxue Li, Email: liyx1124@163.com.

Qingying Meng, Email: qymeng1996@foxmail.com.

Meilin Chen, Email: 1204514079@qq.com.

Mi Wu, Email: wumi@webmail.hzau.edu.cn.

Ruiting Zhang, Email: 934520453@qq.com.

Zhiyong Xu, Email: 17097462507@163.com.

Jie Sun, Email: sunjie@shzu.edu.cn.

Xianlong Zhang, Email: xlzhang@mail.hzau.edu.cn.

Xinhui Nie, Email: xjnxh2004130@126.com.

Daojun Yuan, Email: robert@mail.hzau.edu.cn.

Zhongxu Lin, Email: linzhongxu@mail.hzau.edu.cn.

Appendix A. Supplementary material

The following are the Supplementary data to this article:

Supplementary data 1
mmc1.pdf (6.2MB, pdf)

References

  • 1.Wendel J.F. New World tetraploid cottons contain Old World cytoplasm. Proc Natl Acad Sci USA. 1989;86:4132–4136. doi: 10.2307/33638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Senchina D.S. Rate variation among nuclear genes and the age of polyploidy in Gossypium. Mol Biol and Evol. 2003;20:633–643. doi: 10.1093/molbev/msg065. [DOI] [PubMed] [Google Scholar]
  • 3.Huang G., Wu Z., Percy R.G., Bai M., Li Y., Frelichowski J.E., et al. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat Genet. 2020;52:516–524. doi: 10.1038/s41588-020-0607-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Fang L., Gong H., Hu Y., Liu C., Zhou B., Huang T., et al. Genomic insights into divergence and dual domestication of cultivated allotetraploid cottons. Genome Biol. 2017;18:33–46. doi: 10.1186/s13059-017-1167-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yuan D., Grover C.E., Hu G., Pan M., Miller E.R., Conover J.L., et al. Parallel and intertwining threads of domestication in allopolyploid cotton. Advanced Sci. 2021;8 doi: 10.1002/advs.202003634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Z.K. Huang. Cotton varieties and their genealogy in China, Chinese Agricultural Press. (2007).
  • 7.Wang M., Tu L., Lin M., Lin Z., Wang P., Yang Q., et al. Asymmetric subgenome selection and cis-regulatory divergence during cotton domestication. Nat Genet. 2017;49:579–587. doi: 10.1038/ng.3807. [DOI] [PubMed] [Google Scholar]
  • 8.Ma Z., He S., Wang X., Sun J., Zhang Y., Zhang G., et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nat Genet. 2018;50:803–813. doi: 10.1038/s41588-018-0119-7. [DOI] [PubMed] [Google Scholar]
  • 9.He S., Sun G., Geng X., Gong W., Dai P., Jia Y., et al. The genomic basis of geographic differentiation and fiber improvement in cultivated cotton. Nat Genet. 2021;53:916–924. doi: 10.1038/s41588-021-00844-9. [DOI] [PubMed] [Google Scholar]
  • 10.Ma Z., Zhang Y., Jiang Y., Wang N., Wang G., Li X., et al. High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement. Nat Genet. 2021;53:1385–1391. doi: 10.1038/s41588-021-00910-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.R. Percy. The worldwide gene pool of Gossypium barbadense L. and its improvement, In Genetics and genomics of cotton US: Springer. 3 (2009) 53-68. 10.1007/978-0-387-70810-2_3.
  • 12.Zhao N., Wang W., Grover C.E., Jiang K., Pan Z., Guo B., et al. Genomic and GWAS analyses demonstrate phylogenomic relationships of Gossypium barbadense in China and selection for fibre length, lint percentage and Fusarium wilt resistance. Plant Biotechnol J. 2021;20:691–710. doi: 10.1111/pbi.13747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang P., Dong N., Wang M., Sun G., Jia Y., Geng X., et al. Introgression from Gossypium hirsutum is a driver for population divergence and genetic diversity in Gossypium barbadense. Plant J. 2022;110:764–780. doi: 10.1111/tpj.15702. [DOI] [PubMed] [Google Scholar]
  • 14.Zhang X., Liu T., Wang J., Wang P., Qiu Y., Zhao W., et al. Pan-genome of Raphanus highlights genetic variation and introgression among domesticated, wild, and weedy radishes. Mol Plant. 2021;14:2032–2055. doi: 10.1016/j.molp.2021.08.005. [DOI] [PubMed] [Google Scholar]
  • 15.Akpertey A., Singh R.J., Diers B.W., Graef G.L., Mian M.A.R., Shannon J.G., et al. Genetic introgression from glycine tomentella to soybean to increase seed yield. Crop Sci. 2017;28:89–111. doi: 10.2135/cropsci2017.07.0445. [DOI] [Google Scholar]
  • 16.Taylor S.A., Larson E.L., Harrison R.G. Hybrid zones: windows on climate change. Trends Ecol Evol. 2015;30:398–406. doi: 10.1016/j.tree.2015.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Todesco M., Owens G.L., Bercovich N., Légaré J.-S., Soudi S., Burge D.O., et al. Massive haplotypes underlie ecotypic differentiation in sunflowers. Nature. 2020;584:602–607. doi: 10.1038/s41586-020-2467-6. [DOI] [PubMed] [Google Scholar]
  • 18.Nie X., Tu J., Wang B., Zhou X., Lin Z. A BIL population derived from G. hirsutum and G. barbadense provides a resource for cotton genetics and breeding. PLoS One. 2015;10:e0141064. doi: 10.1371/journal.pone.0141064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Shi Y., Li W., Li A., Ge R., Zhang B., Li J., et al. Constructing a high-density linkage map for Gossypium hirsutum X Gossypium barbadense and identifying. J Integr Agr. 2014;5:18. doi: 10.1111/jipb.12288. [DOI] [PubMed] [Google Scholar]
  • 20.Si Z., Chen H., Zhu X., Cao Z., Zhang T. Genetic dissection of lint yield and fiber quality traits of G. hirsutum in G. barbadense background. Mol Breeding. 2017;37:9. doi: 10.1007/s11032-016-0607-3. [DOI] [Google Scholar]
  • 21.Nie X., Wen T., Shao P., Tang B., Nuriman-Guli A., Yu Y., et al. High-density genetic variation maps reveal the correlation between asymmetric interspecific introgressions and improvement of agronomic traits in Upland and Pima cotton varieties developed in Xinjiang, China. Plant J. 2020;103:677–689. doi: 10.1111/tpj.14760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Fang L., Zhao T., Hu Y., Si Z., Zhu X., Han Z., et al. Divergent improvement of two cultivated allotetraploid cotton species. Plant Biotechnol J. 2021;19:1325–1336. doi: 10.1111/pbi.13547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Yu J., Hui Y., Chen J., Yu H., Gao X., Zhang Z., et al. Whole genome resequencing of 240 Gossypium barbadense accessions reveals genetic variation and genes associated with fiber strength and lint percentage. Theor Appl Genet. 2022;134:3249–3261. doi: 10.1007/s00122-021-03889-w. [DOI] [PubMed] [Google Scholar]
  • 24.Tao Y., Zhao X., Mace E., Henry R., Jordan D. Exploring and exploiting pan-genomics for crop improvement. Mol Plant. 2019;12:156–169. doi: 10.1016/j.molp.2018.12.016. PubMed PMID: 30594655. [DOI] [PubMed] [Google Scholar]
  • 25.Gibbs R.A., Belmont J.W., Hardenbol P., Willis T.D., Yu F., Yang H., et al. The International HapMap Project. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  • 26.Wang N., Li Y., Shen C., Yang Y., Wang H., Yao T., et al. High-resolution sequencing of nine elite upland cotton cultivars uncovers genic variations and breeding improvement targets. Plant J. 2023;113:145–159. doi: 10.1111/tpj.16041. [DOI] [PubMed] [Google Scholar]
  • 27.Chen Z.J., Sreedasyam A., Ando A., Song Q., De Santiago L.M., Hulse-Kemp A.M., et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat Genet. 2020;52:525–533. doi: 10.1038/s41588-020-0614-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wang M., Tu L., Yuan D., Zhu C., Shen J.L., et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat Genet. 2019;51:224–229. doi: 10.1038/s41588-018-0282-x. [DOI] [PubMed] [Google Scholar]
  • 29.Hu Y., Chen J., Fang L., Zhang Z., Ma W., Niu Y., et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton. Nat Genet. 2019;51:739–748. doi: 10.1038/s41588-019-0371-5. [DOI] [PubMed] [Google Scholar]
  • 30.Paterson A.H., Brubaker C.L., Wendel J.F. A rapid method for extraction of cotton (Gossypium spp.) genomic DNA suitable for RFLP or PCR analysis. Plant Mol Biol Rep. 1993;11:122–127. doi: 10.1007/BF02670470. [DOI] [Google Scholar]
  • 31.Cheng H., Concepcion G.T., Feng X., Zhang H., Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18:170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wang M., Wang P., Lin M., Ye Z., Li G., Tu L., et al. Evolutionary dynamics of 3D genome architecture following polyploidization in cotton. Nat Plants. 2018;4:90–97. doi: 10.1038/s41477-017-0096-3. [DOI] [PubMed] [Google Scholar]
  • 33.Durand N.C., Shamim M.S., Machol I., Rao S.S.P., Huntley M.H., Lander E.S., et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C Experiments. Cell Syst. 2016;3:95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Dudchenko O., Batra S.S., Omer A.D., Nyquist S.K., Hoeger M., Durand N.C., et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356:92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Li H., Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–3100. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Simao F.A., Waterhouse R.M., Ioannidis P., Kriventseva E.V., Zdobnov E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
  • 38.Ou S., Jiang N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 2018;176:1410–1422. doi: 10.1104/pp.17.01310. PubMed PMID: 29233850. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Xu Z., Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:265–268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ou S., Chen J., Jiang N. Assessing genome assembly quality using the LTR Assembly Index (LAI) Nucleic Acids Res. 2018;46:e126. doi: 10.1093/nar/gky730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Tarailo-Graovac M., Chen N. Using repeatmasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;25 doi: 10.1002/0471250953.bi0410s25. [DOI] [PubMed] [Google Scholar]
  • 42.Flynn J.M., Hubley R., Goubert C., Rosen J., Clark A.G., Feschotte C., et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA. 2020;117:9451–9457. doi: 10.1073/pnas.1921046117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Edgar R.C., Myers E.W. PILER: identification and classification of genomic repeats. Bioinformatics. 2005;21:152–158. doi: 10.1093/bioinformatics/bti1003. [DOI] [PubMed] [Google Scholar]
  • 44.Price A.L., Jones N.C., Pevzner P.A. De novo identification ofrepeat families in large genomes. Bioinformatics. 2005;21:351–358. doi: 10.1093/bioinformatics/bti1018. [DOI] [PubMed] [Google Scholar]
  • 45.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lowe T.M., Eddy S.R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Griffiths-Jones S., Moxon S., Marshall M., Khanna A., Eddy S.R., Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005;33:121–124. doi: 10.1093/nar/gki081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Nawrocki E., Kolbe D., Eddy S. Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009;25:1335–1337. doi: 10.1093/bioinformatics/btp157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Stanke M., Schöffmann O., Morgenstern B., Waack S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinf. 2006;7:62–73. doi: 10.1186/1471-2105-7-62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Stanke M., Waack S. Gene prediction with a hidden Markov model and anew intron submodel. Bioinformatics. 2003;19:215–225. doi: 10.1093/bioinformatics/btg1080. [DOI] [PubMed] [Google Scholar]
  • 51.Salamov A.A., Solovyev V.V. Ab initio gene finding in drosophila genomic DNA. Genome Res. 2015;10:516–522. doi: 10.1101/gr.10.4.516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Majoros W.H., Pertea M., Salzberg S.L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–2879. doi: 10.1093/bioinformatics/bth315. [DOI] [PubMed] [Google Scholar]
  • 53.Trapnell C., Pachter L., Salzberg S.L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J.v. Baren, et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation, Nat Biotechnol. 28 (2010) 511-517. 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed]
  • 55.Yu J., Jung S., Humann J., Cheng C.-H., Lee T., Zheng P., et al. CottonGen: The community database for cotton genomics, genetics, and breeding research. Plants. 2021;10 doi: 10.3390/plants10122805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Holt C., Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf. 2011;12:491–505. doi: 10.1186/1471-2105-12-491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Bairoch A., Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28:45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J. BLAST plus: architecture and applications. BMC Bioinf. 2009;10:1–9. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Mulder N., Apweiler R. InterPro and InterProScan: tools for protein sequence classifcation and comparison. Methods in Mol Biol. 2007;396:59–70. doi: 10.1007/978-1-59745-515-2_5. [DOI] [PubMed] [Google Scholar]
  • 60.Kanehisa M., Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;27:29–34. doi: 10.1093/nar/27.1.29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Li L., Stoeckert C.J., Roos D.S. OrthoMCL: identifcation of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Chen S., Zhou Y., Chen Y., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:884–890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.D. Freed, R. Aldana, J.A. Weber, J.S. Edwards. The Sentieon Genomics Tools - A fast and accurate solution to variant calling from next-generation sequence data, bioRxiv. (2017). 10.1101/115717.
  • 64.Retief J. Phylogenetic analysis using PHYLIP, Methods In. Mol Biol. 2000;132:243–258. doi: 10.1385/1-59259-192-2:243. [DOI] [PubMed] [Google Scholar]
  • 65.Letunic I., Bork P. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:293–296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Alexander D.H., Novembre J., Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655–1664. doi: 10.1101/gr.094052.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Kurtz S., Phillippy A., Delcher A.L., Smoot M., Shumway M., Antonescu C., et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5 doi: 10.1186/gb-2004-5-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Sedlazeck F.J., Rescheneder P., Smolka M., Fang H., Nattestad M., von Haeseler A., et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15:461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Alonge M., Wang X., Benoit M., Soyk S., Pereira L., Zhang L., et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell. 2020;182:1–17. doi: 10.1016/j.cell.2020.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Bradbury P.J., Zhang Z., Kroon D.E., Casstevens T.M., Ramdoss Y., Buckler E.S. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007;23:2633–2635. doi: 10.1093/bioinformatics/btm308. [DOI] [PubMed] [Google Scholar]
  • 73.Li H., Ralph P. Local PCA shows how the effect of population structure differs along the genome. Genetics. 2019;211:289–304. doi: 10.1534/genetics.118.301747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Zheng X., Levine D., Shen J., Gogarten S.M., Laurie C., Weir B.S. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012;28:3326–3328. doi: 10.1093/bioinformatics/bts606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Hartigan J.A., Wong M.A. Algorithm AS 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 1979;28:100–108. doi: 10.2307/2346830. [DOI] [Google Scholar]
  • 76.Kim D., Joseph M.P., Chanhee P., Christopher B., Salzberg S.L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37:907–915. doi: 10.1038/s41587-019-0201-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Pertea M., Kim D., Pertea G.M., Leek J.T., Salzberg S.L. Transcript-level expression analysis of RNA-seq experiments with HISAT. StringTie and Ballgown, Nat Protocols. 2016;11:1650–1667. doi: 10.1038/nprot.2016-095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Li B., Ruotti V., Stewart R.M., Thomson J.A., Dewey C.N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26:493–500. doi: 10.1093/bioinformatics/btp692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.W. Liu, L. Xu, H. Lin, J. Cao. Two expansin genes, AtEXPA4 and AtEXPB5, are redundantly required for pollen tube growth and atEXPA4 is involved in primary root elongation in Arabidopsis thaliana, Genes. 12 (2021) 249-155. 10.3390/genes12020249. [DOI] [PMC free article] [PubMed]
  • 80.Mortimer J.C., Miles G.P., Brown D.M., Zhang Z., Segura M.P., Weimar T., et al. Absence of branches from xylan in Arabidopsis gux mutants reveals potential for simplification of lignocellulosic biomass. Proc Natl Acad Sci USA. 2010;107:17409–17414. doi: 10.1073/pnas.1005456107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Link M., Rausch T., Greiner S. In Arabidopsis thaliana, the invertase inhibitors AtC/VIF1 and 2 exhibit distinct target enzyme specificities and expression profiles. Febs Lett. 2004;573:105–109. doi: 10.1016/j.febslet.2004.07.062. [DOI] [PubMed] [Google Scholar]
  • 82.Song J., Guan Z., Hu J., Guo C., Yang Z., Wang S., et al. Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nat Plants. 2020;6:34–45. doi: 10.1038/s41477-019-0577-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Li M., Tian S., Jin L., Zhou G., Li Y., Zhang Y., et al. Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars. Nat Genet. 2013;12:1431–1438. doi: 10.1038/ng.2811. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary data 1
mmc1.pdf (6.2MB, pdf)

Data Availability Statement

The sequencing data of perennial sea-island cottons were uploaded to the NCBI website (https://www.ncbi.nlm.nih.gov/) under study number PRJNA861722. The genome file of Gb_M210936 can be found at https://figshare.com/s/f4a3e8d9927817442dde. The other resequencing data used in this study can be found at NCBI as study PRJNA720818, Fig. S14 Fig. S17 Q5 PRJNA613140 and PRJNA414461. The sequences of the reference genomes 3-79, Hai7124, Pima90 and AD5 were downloaded from Cottongen (https://www.cottongen.org/) [55].


Articles from Journal of Advanced Research are provided here courtesy of Elsevier

RESOURCES