Skip to main content
G3: Genes | Genomes | Genetics logoLink to G3: Genes | Genomes | Genetics
. 2022 Dec 1;13(2):jkac308. doi: 10.1093/g3journal/jkac308

The Gossypium herbaceum L. Wagad genome as a resource for understanding cotton domestication

Thiruvarangan Ramaraj 1, Corrinne E Grover 2, Azalea C Mendoza 3, Mark A Arick II 4, Josef J Jareczek 5, Alexis G Leach 6, Daniel G Peterson 7, Jonathan F Wendel 8, Joshua A Udall 9,*,2
Editor: P Morrell
PMCID: PMC9911056  PMID: 36454094

Abstract

Gossypium herbaceum is a species of cotton native to Africa and Asia that is one of the 2 domesticated diploids. Together with its sister-species G. arboreum, these A-genome taxa represent models of the extinct A-genome donor of modern polyploid cotton, which provide about 95% of cotton grown worldwide. As part of a larger effort to characterize variation and improve resources among diverse diploid and polyploid cotton genomes, we sequenced and assembled the genome of G. herbaceum cultivar (cv.) Wagad, representing the first domesticated accession for this species. This chromosome-level genome was generated using a combination of PacBio long-read technology, HiC, and Bionano optical mapping and compared to existing genome sequences in cotton. We compare the genome of this cultivar to the existing genome of wild G. herbaceum subspecies africanum to elucidate changes in the G. herbaceum genome concomitant with domestication and extend these analyses to gene expression using available RNA-seq. Our results demonstrate the utility of the G. herbaceum cv. Wagad genome in understanding domestication in the diploid species, which could inform modern breeding programs.

Keywords: Gossypium herbaceum, Wagad, genome sequence, cotton

Introduction

The cotton genus (Gossypium) comprises the primary source of natural fiber worldwide. While the genus itself is composed of over 50 known species (Wendel and Grover 2015; Wang et al. 2018; Hu et al. 2021), only the fiber from 4 species is suitable for textile production. Remarkably, these 4 species were each independently domesticated, providing a naturally replicated experiment with which we can understand the underlying genetic changes that lead to phenotypic convergence. While much research has been centered around understanding domestication in the 2 dominant cultivated species, i.e. G. hirsutum and G. barbadense, considerably less is known about the domestication in the “short staple cottons” (Khadi et al. 2010), G. herbaceum and G. arboreum.

G. herbaceum and G. arboreum are diploid sister species collectively known as the “A-genome” cottons, which are native to Africa and the Arabian Peninsula. Cultivated on a relatively small scale where disease and/or stress prohibit the other domesticates (Kranthi 2018), these cottons are notable both for their independent domestication (Renny-Byfield et al. 2016), as well as their phylogenetic relationship to the ancestral maternal parent of domesticated polyploid cottons (Wendel and Grover 2015; Huang et al. 2020). Accordingly, recent efforts have been made to improve resources for the A-genome cottons (Du et al. 2018; Huang et al. 2020; Grover, Arick, et al. 2021) in an attempt to better understand their domestication and underlying genetics. While resources are comparatively abundant for the sister species, G. arboreum, no wild forms have been found for that species (Vollesen 1987; Wendel et al. 1989; Renny-Byfield et al. 2016). Native to the savannahs of Southern Africa (Vollesen 1987; Wendel et al. 1989; Khadi et al. 2010), several accessions of wild G. herbaceum (subspecies africanum) are available from germplasm resources, as are additional accessions reflecting various levels of domestication and/or improvement.

Recently, a high-quality genome of G. herbaceum subspecies africanum has become available (Huang et al. 2020), providing the opportunity to evaluate the consequences of domestication in diploid cotton on a genome-wide scale. Resequencing efforts have characterized divergence of G. herbaceum from its sister species (G. arboreum) as recent [<1 million years; (Huang et al. 2020; Grover, Arick, et al. 2021)]. In addition, diversity among G. herbaceum accessions are low (Wendel et al. 1989; Jena et al. 2011; Grover, Arick, et al. 2021) even though segregating polymorphisms [characterized by heterozygous, derived single nucleotide polymorphisms (SNPs)] number in the millions within species [1.7–4.1 M in each of the 21 accessions of G. herbaceum surveyed (Grover, Arick, et al. 2021)]. Here, we report a high-quality genome of an elite domesticated line, G. herbaceum cultivar (cv.) Wagad, and compare our results with the genome of wild G. herbaceum (subsp. africanum) and to the outgroup G. longicalyx. We present the G. herbaceum cv. Wagad (hereafter, A1-Wagad) genome in detail, and compare this assembly to the previously released G. herbaceum subsp. africanum (hereafter, A1-africanum) assembly to provide insight into the changes accompanying domestication.

Methods and materials

Plant material and sequencing methods

Pacbio sequencing

Approximately 5 g of G. herbaceum cv. Wagad (A1-Wagad) tissue was collected from a USDA, College Station, TX, greenhouse and shipped to the DNA Sequencing Center at Brigham Young University (BYU). DNA was extracted from the plant material using the Qiagen Genomic Tip kit (Qiagen, Hilden, Germany), and sequencing libraries were subsequently constructed and sequenced at the BYU DNA Sequencing Center (DNASC). DNA shearing of both libraries was done on a Megaruptor2 (∼20 kb; Diagenonde Inc., Denville, NJ, USA). High molecular weight DNA was partitioned into 13 bins using the Sage Elf (Sage Science, Beverly, MA, USA), and the top 5 bins were run on a Fragment Analyzer (Agilent Technologies, Santa Clara, CA, USA) to select the appropriate bin size range (15–18 kb). Libraries were made using the SMRTbell Express Template Prep kit as recommended by Pacific Biosciences (PacBio; Menlo Park, CA, USA). Three PacBio cells were sequenced from 1 circular consensus sequence (CCS) library on the Pacific Biosciences Sequel system.

Illumina sequencing

Plant tissue of Wagad was grown at Brigham Young University Greenhouse. Young tissues were collected, and DNA was extracted through the CTAB method (Kidwell and Osborn 1992). In our use of this method, tissue was lyophilized, ground in liquid nitrogen to disrupt membranes, resuspended in buffer and incubated with a lysis solution at 60°C for 30 minutes, treated with RNase, treated with chloroform to separate DNA from insoluble particles, precipitated for removal of salts, and rehydrated in TE buffer. DNA was then shipped to the National Center for Genome Resources (NCGR, Santa Fe, NM, USA) for library preparation and sequencing on the Illumina HiSeq system.

Genome assembly

PacBio CCS reads were assembled using Hifiasm, a fast haplotype-resolved de novo genome assembler for PacBio HiFi sequence reads (Cheng et al. 2021). The contigs were aligned to previously assembled genomes of G. herbaceum and G. arboreum (Huang et al. 2020) using minimap2 (Li 2021) and visualized by dotPlotly (https://github.com/tpoorten/dotPlotly). Multiple contigs aligning to the same contig were manually scaffolded (or concatenated) to create the final chromosomes. The final chromosomes were labeled and again aligned to the previously published A-genome using minimap2 and visualized by dotPlotly.

Hi–C libraries were constructed by Phase Genomics (Seattle, WA, USA) from seedling tissue grown at Brigham Young University Greenhouses (different tissue source than sequenced by PacBio above). Short-read sequencing (Illumina, San Diego, CA, USA; 150PE) of the libraries was performed by Phase Genomics. The Hi–C data were mapped to the assembled genome sequence using bwa mem (with the −5SP option for HiC data, Li and Durbin 2009). The Hi–C interactions were used as evidence for contig proximity and in scaffolding contig sequences. Matlock (https://github.com/phasegenomics/matlock) was used in conjunction with the mapped reads to identify linkages between different genomic regions in the bam file. Juicebox (Robinson et al. 2018) was used to visualize the linkages along the pseudomolecules. All inversions were identified in alignments created by minimap2 (Li 2021).

Repeat and gene annotation

A RepeatModeler (v2.0.1) (Flynn et al. 2020) generated, Gossypium-specific repeat database was used in conjunction with RepeatMasker (v4.1.1) (Smit et al. 2015a) to annotate and soft mask the repeats in the G. herbaceum genome. Existing RNA-seq datasets (Supplementary Table 1) were mapped to the genome using hisat2 (v2.1.0) (Kim et al. 2015). BRAKER2 (v2.1.5) (Hoff et al. 2019), using a combination of GeneMark-ES (v4.61) (Borodovsky and Lomsadze 2011), the soft masked genome, and the RNA-seq alignments, was used to train Augustus (v3.3.3) (Stanke et al. 2006) and SNAP (v2013-02-16) (Korf 2004). The Mikado (v2.0) (Venturini et al. 2018) pipeline was used to predict transcripts based on the RNA-seq alignments. Briefly, the transcripts, which were predicted by Trinity (v2.11.0) (Grabherr et al. 2011) and mapped with GMAP (v2020-10-14) (Wu and Watanabe 2005), Cufflinks (v2.2.1) (Ghosh and Chan 2016), and Stringtie (v2.1.2) (Pertea et al. 2015), were combined into a nonredundant set and mapped to the Uniprot-Swissport database (v2020-05] using BLAST + (v2.9.0) (Altschul et al. 1990). Additionally, Prodigal (v2.6.3) (Hyatt et al. 2010) was used to find ORFs in the transcripts, and Portcullis (v1.2.2) (Mapleson et al. 2018) was used to predict splice sites from the read alignments. Finally, Mikado used the splice site predictions, the predicted transcripts, and the protein alignments to identify gene loci and the representative transcript. Maker2 (v2.31.10) (Holt and Yandell 2011; Campbell et al. 2014) with (1) the RepeatModeler/RepeatMasker annotations; (2) the SNAP, Augustus, GeneMark, and Mikado predictions; (3) the protein evidence from previously annotated G. hirsutum and G. raimondii genomes (Paterson et al. 2012; Chen 2020) along with the Uniprot Swissprot database annotations [v2020-05]; and (4) all available ESTs from NCBI for Gossypium [downloaded 2020-09-03, search filter: txid3633(Organism:exp) AND is_est(filter)] was used as alternative EST evidence to predict structural annotations. Benchmarking Universal Single-Copy Orthologs (BUSCO v4.1.2) (Waterhouse et al. 2017) was used with the eudicot (odb10) database against the predicted transcripts to find a satisfactory annotation edit distance (AED) filter (Holt and Yandell 2011). An AED filter of 0.37 was used to determine the final number of genes. The filtered annotations were functionally annotated using InterProScan (v5.47-82.0) (Jones et al. 2014) and BLAST + with the Uniprot Swissprot database.

Comparison between G. herbaceum cv. Wagad and G. herbaceum subsp. africanum

To observe syntenic conservation, whole-genome alignments were generated between Wagad and africanum using Mummer (Marçais et al. 2018). Alignments were visualized using dotPlotly (https://github.com/tpoorten/dotPlotly) in R (version 3.6.0; R Development Core Team et al. 2011) to identify regions with major structural variations.

Illumina DNA reads were generated and/or downloaded for A1-Wagad (PRJNA421172) and several accessions of G. herbaceum subspecies africanum, including accession “Mutema” (Huang et al. 2020), A1-073, (Page et al. 2013), and A1-155 (Page et al. 2013). These were aligned to the outgroup reference genome G. longicalyx (Grover et al. 2020) using BWA (Li and Durbin 2009) with default arguments. SNPs and small insertion–deletion polymorphisms were identified using the Sention (Kendig et al. 2019) DNAseq workflow and filtered to remove sites with >90% missing data or more than 2 alleles.

The resulting SNPs among A1-africanum, A1-073, A1-155, and A1-Wagad were annotated with SnpEff (Cingolani, Platts, et al. 2012). Since we are interested in finding SNPs that were derived under domestication (i.e. unique to A1-Wagad relative both to wild G. herbaceum and the outgroup reference, G. longicalyx) and likely visible during domestication, we filtered these SNPs using SnpSift (Cingolani, Patel, et al. 2012) to include only those that are homozygous variant (1/1) for A1-Wagad, homozygous reference for the 3 wild accessions (0/0), and predicted to be nonsynonymous (i.e. “ANN[*].EFFECT” = “missense_variant”, “exon_loss_variant”, or “frameshift_variant”, etc.). That is, SnpSift was used to restrict the genotype for each wild accession to homozygous ancestral [e.g. GEN(africanum).GT=“0/0”) and the A1-Wagad genotype to homozygous derived [i.e. GEN(Wagad).GT=“1/1”). Confidence in SNPs was increased via genotype level filters; i.e. genotype quality (GQ) <20 (Adelson et al. 2019) and read depth (DP) values <10 were removed [GEN(*).GQ>20 & GEN(*).DP≥10].

Because fiber is one of the primary agronomic traits of interest, we screened the remaining Wagad-unique nonsynonymous genes by their associated GO, Interpro, and Panther functional annotation (Grover et al. 2020) for those that have known functions in fiber. Any genes having homology to existing transposable elements were filtered out. To further identify and characterize changes under domestication, we used existing RNA-seq data from A1-Wagad and several G. herbaceum subsp. africanum accessions (SRA pending) to assess differential gene expression at 10- and 20-Days Post Anthesis (DPA). RNA-seq reads for each DPA/accession were processed via Kallisto v0.46.1 (Bray et al. 2016; i.e. kallisto quant) to identify and quantify transcripts using the A1-Wagad annotations generated here. Differential gene expression analyses were conducted in R/4.0.2 (R Core Team 2020) using DESeq2 (Love et al. 2014) for each DPA separately. For wild G. herbaceum, we used expression from multiple G. herbaceum subsp. africanum accessions as pseudoreplicates to represent expression diversity; for A1-Wagad, 3 replicates were used for each of the 2 timepoints. Genes with a Benjamini–Hochberg (Benjamini and Hochberg 1995) adjusted P-value <0.05 (as implemented in DESeq2) were considered differentially expressed.

Results and discussion

Genome assembly and annotation

Pacific Biosciences Sequel sequencing yielded ∼4.4 million (M) CCS reads with mean read length of ∼15 kbps, generating over 68.4 billion (B) bases or approx. 40x- coverage of the 1.7 gigabase (Gb) genome (Hendrix and Stewart 2005). The final assembly consisted of 13 sequences, representing the 13 chromosomes and totaling ≈1.6 Gbps (Table 1). This assembled genome size represents >94% of the estimated genome size, similar to other recent reports from G. herbaceum (Huang et al. 2020). The genome assembly was validated by Hi–C and alignment to other recent genome assemblies. The Hi–C results showed discrete colocalization along the scaffolds for cross-linked pairs of Illumina reads, with no evidence of translocations (Fig. 1).

Table 1.

The assembled genomes of A1-Wagad, A1-africanum, and G. longicalyx.

G. herbaceum cv. Wagad G. herbaceum subsp. africanum G. longicalyx (outgroup)
Contigsa 21 1,782 97
Max contig (MB) 141,411,576 9,975,897 49,804,510
Mean contig (MB) 75,610,818 873,166 12,266,987
Contig N50 (MB) 122,262,091 1,914,893 29,420,374
Total contig length 1,587,827,183 1,555,981,907 1,189,897,770
Assembly GC % 35.1 35.0 34.2
Scaffoldsb 13 732 13
Max scaffold (MB) 142,105,803 132,679,160 110,231,091
Mean scaffold (MB) 122,140,578 2,125,802 91,531,230
Scaffold N50 131,646,256 117,878,823 95,878,146
Total scaffold length (MB) 1,587,827,514 1,556,086,907 1,189,905,995
Number of genes 39,100 43,952 38,378

Contig metrics reports are from the raw output file of hifiasm.

Scaffold metrics reported are after manual scaffolding.

Fig. 1.

Fig. 1.

Plot of Hi–C interaction plots for the A1-Wagad genome. The x- and y-axis both depict a linear representation of the genome from top left (chromosome 1) to bottom right (chromosome 13). The blue boxes on the diagonal represent the chromosomes. The color intensity represents the number of HiC interactions along the chromosome and illustrates that most chromatin interactions are with other nucleotides from the same chromosome.

We used BUSCO analysis of the genome (Waterhouse et al. 2017) to assess the completeness of the assembly, which recovered 96.50% complete BUSCOs (from the total of 2,121 BUSCO groups searched; Table 2). Most BUSCOs (88%) were both complete and single copy, with only 8.5% BUSCOs complete and duplicated. Around 3.5% of BUSCOs were either fragmented (1.20%) or missing (2.30%), indicating a general completeness of the genome. This BUSCO recovery is like recent publications of other diploid cotton genomes (Huang et al. 2020, Grover et al. 2020, 2021a).

Table 2.

BUSCO metrics in the assembled genomes of A1-Wagad, A1-africanum, and G. longicalyx.

BUSCO metric G. herbaceum cv. Wagad, % G. herbaceum subsp. Africanum, % G. longicalyx, %
Complete BUSCOs 96.5 99.0 97.9
Complete and single-copy BUSCOs (S) 88.0 94.1 94.6
Complete and duplicated BUSCOs (D) 8.5 4.9 3.3
Fragmented BUSCOs (F) 1.2 0.2 0.7
Missing BUSCOs (M) 2.3 0.8 1.4
Total BUSCO groups searched 2,121 2,121 2,121

Annotation of the genome produced 39,100 unique genes, similar to the previously published cotton diploids (Paterson et al. 2012; Li et al. 2014; Du et al. 2018; Udall et al. 2019; Grover et al. 2020, 2021a, 2021b; Huang et al. 2020). The annotation of A1-Wagad contained ∼5,000 fewer unique genes than the published annotation for A1-africanum, possibly due to annotation methods and/or filtering. While also fewer than previously reported for the sister taxon G. arboreum (40,134–43,278; Li et al. 2014; Du et al. 2018; Huang et al. 2020), this number of annotated genes is similar to the outgroup species, G. longicalyx (38,378; Grover et al. 2020). These discrepancies among annotations suggest there may be a future need for uniformity among annotation methods to further facilitate comparison among genomes.

Repeat content was predicted using a combination of RepeatMasker (Smit et al. 2015b), “One code to find them all” (Bailly-Bechet et al. 2014), and RepeatExplorer (Novák et al. 2013, 2020). Approximately 59.2% of the A1-Wagad genome was estimated to be repetitive, comprising 58% retroelements and 1.2% DNA transposons. As expected, more than half of the retroelements (57.3%) were annotated as LTR elements, of which 39.9% were “Gypsy/DIRS1” and 10.2% were “Ty1/Copia”; 23.3% were unclassified by the analysis (Table 3). The repeat analysis results are very comparable to other closely related genomes, including A1-africanum, G. arboreum, G. longicalyx, and the A-genome of the polyploid (Table 4).

Table 3.

Repetitive element annotations of G. herbaceum cv. Wagad.

Family Number of elements Length (Mb) Percentage of sequence (%)
DNA (ALL) 21,870 19 1.22
DNA/Harbinger 4,086 2.5 0.16
DNA/hAT 1,829 1.3 0.08
LTR (ALL) 1,011,698 921 58.01
LTR 1,001,971 909 57.31
LTR/Copia 281,785 161 10.18
LTR/Gypsy 519,278 633 39.87
EVERYTHING_TE 1,310 82.51

Table 4.

Repeat element content in A1-Wagad vs A1-africanum and G. longicalyx.

G. longicalyx G. herbaceum cv. Wagad G. herbaceum subsp. africanuma
Genome size 1,311 1,587 1,667
LTR/Gypsy (Ty3) 513 633 684
LTR/Copia (Ty1) 47 161 46
DNA (all element types) 14 19 14
Total repetitive 575 618 745
% genome is repeat 44% 58% 45%
% genome is gypsy 39% 40% Not listed
% repeat is gypsy 89% 83% Not listed

TE annotations are from Huang et al. (2020).

Comparison between cultivated A1-Wagad and wild A1-africanum

Domestication is an important evolutionary process whereby human-mediated phenotypic changes result in myriad (often anonymous) corresponding molecular changes. Understanding the molecular basis for improved phenotypes is a fundamental goal of many modern breeding programs, which may use a combination of traditional and cutting-edge breeding techniques to accelerate crop improvement. Comparison between wild and domesticated germplasm can improve our understanding of the molecular basis underlying the phenotypic superiority of modern crops, as well as facilitate our understanding of how wild germplasm can be used to improve modern cultivars (e.g. by conferring disease resistance).

Whole genome alignment of the A1-Wagad sequence to the recent assembly of A1-africanum revealed high co-linearity with a few exceptions, including major inversions (≈6 M) on A1-01, A1-02, and A1-06 and 2 inversions on A1-13 (≈43 M; Fig. 2 and Supplementary Figs. 1–6). When these 2 genomes were compared to the genome of their sister species G. arboreum (Huang et al. 2020; Wang et al. 2021), we found many shared inversions between G. herbaceum genome sequences (A1-africanum and A1-Wagad) and the G. arboreum genome sequence. The shared inversions and high degree of co-linearity suggested that the independent G. herbaceum genome sequences were correctly assembled. In 1 case, however, we found a unique inversion in the A1-Wagad sequence on A1-06 (102,159,887–110,192,639), possibly indicating a structural change acquired during or after domestication (Supplementary Figs. 1–6).

Fig. 2.

Fig. 2.

a) Genome alignment of the 13 chromosomes in the G. herbaceum var. africanum (y-axis) and the G. herbaceum cv. Wagad (x-axis) genomes. The shade of blue indicates the degree of sequence match. The orange circles indicate major and minor inversions between the 2 genome assemblies on chromosomes A01, A02, A06, and A13. b) Alignment of the chromosomes containing inversions (A01, A02, A06, and A13) between the 2 A1 genomes. A13 had 3 inversions. The major inversion on A13 was so large that a portion was colored as a translocation, instead of an inversion by plotsr (Goel et al. 2022).

At the nucleotide level, we identified 220,191 fixed SNPs between A1-Wagad and the 3 A1-africanum accessions. As expected, 84% of the SNPs were intergenic, and only 17% were within gene regions (including UTR). Because amino acid changes may be visible to selection (i.e. domestication), we focused on derived nonsynonymous mutations in A1-Wagad; that is, A1-Wagad was homozygous for a derived SNP where all 3 A1-africanum accessions shared the ancestral state with G. longicalyx. This produced a set of 4,719 nonsynonymous SNPs that occurred in 2,544 genes. Notably, 26 of these had >10 missense mutations (accounting for 433 SNPs), likely indicating a major disruption to the gene, such as a frameshift mutation and/or pseudogenization. Many of these genes also had other variant annotations (e.g. UTR, start/stop codon gain/loss, splice site modifications), while an additional 779 genes had at least one of these other potentially functionally relevant annotations without also having a missense mutation. Overall, the number of genes with only single, missense mutation was 524 or ∼25% of the total, indicating putative differences between wild and domesticated G. herbaceum, while affecting <10% of the genes, tend to produce multiple modifications. A full list of genes, their inferred variant effects, their G. herbaceum homologs, and their functional annotations are listed in Supplementary Table 3.

We also leveraged existing RNA-seq data to infer transcriptomic changes between wild and domesticated G. herbaceum fiber at 2 key timepoints, i.e. 10 and 20 DPA, that are representative of primary and secondary wall synthesis, respectively. Here, RNA-seq was mapped against the gene annotations of the newly developed A1-Wagad sequence to infer differential expression (Table DE; Supplementary Table 1). For 10 DPA, 6,034 genes were differentially expressed (3,175 down, 2,859 up), including 990 genes that were also on the list of genes with annotated SNP variants (Supplementary Table 2). Functional annotation of these genes include genes with possible relevance to fiber, such as 3 upregulated tubulin-related proteins (A1_06G097900, A1_05G253000, A1_10G076200) and several upregulated kinesin-like proteins (Supplementary Table 2). These 2 examples are important for fiber shape, as tubulins comprise the shape conferring microtubules, while kinesins move materials along them (Graham and Haigler 2021; Preuss et al. 2004). Likewise, at 20 DPA, 6,400 genes exhibited differential expression (3,078 down, 3,322 up), including 1,040 genes exhibiting potentially consequential SNP variants. Among these are several downregulated NAC domain-containing proteins [(NAM (no apical meristem, Petunia), ATAF1–2 (Arabidopsis thaliana activating factor), and CUC2 (cup-shaped cotyledon, Arabidopsis)], which regulate secondary cell wall deposition (Zhong et al. 2006; Zhang et al. 2018), and 4 probable WRKY transcription factors [named after the almost invariant WRKY amino acid sequence at the N-terminus and is about 60 residues in length], most of which are also down-regulated, with the exception of A1_09G185500, which is upregulated 2.7-fold in A1-Wagad. There were 1,979 genes differentially expressed at both timepoints, most of which exhibited either up-regulation (736, or 37%) or down-regulation (647, 33%) at both timepoints. For those genes that exhibited differences in relative expression level across timepoints, 445 were downregulated at 10 DPA but upregulated at 20 DPA, and only 151 genes exhibited the inverse (i.e. up-regulated in A1-Wagad at 10 DPA, but down-regulated at 20 DPA). A similar pattern was observed for the 355 genes that also contained an annotated variant (Supplementary Table 2). These results provide an overview of differential expression in wild vs domesticated diploid cotton fiber using the newly generated A1-Wagad genome.

Conclusion

Cotton is an important fiber crop that has been independently domesticated multiple times. While the cultivated tetraploid species have extensive genome resources, cultivated diploid cotton species have received far less attention. Here, we report a genome sequence for a cultivated form of Gossypium herbaceum (cv. Wagad) that complements the existing genome assemblies and diversity studies (Huang et al. 2020). Our data provide a foundation for understanding the genome structure and genetic diversity within G. herbaceum thereby allowing additional perspectives on diploid cotton species.

Supplementary Material

jkac308_Supplementary_Data

Acknowledgments

This work was made possible by the USDA Supercomputer Atlas funded through the SCINet initiative. We thank the Iowa State University Research IT unit for computational resources and support.

Contributor Information

Thiruvarangan Ramaraj, School of Computing, Jarvis College of Computing and Digital Media, DePaul University, Chicago, IL 60605, USA.

Corrinne E Grover, Ecology, Evolution, and Organismal Biology Department, Iowa State University, Ames, IA 50011, USA.

Azalea C Mendoza, School of Computing, Jarvis College of Computing and Digital Media, DePaul University, Chicago, IL 60605, USA.

Mark A Arick, II, Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA.

Josef J Jareczek, Ecology, Evolution, and Organismal Biology Department, Iowa State University, Ames, IA 50011, USA.

Alexis G Leach, Ecology, Evolution, and Organismal Biology Department, Iowa State University, Ames, IA 50011, USA.

Daniel G Peterson, Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University, Mississippi State, MS 39762, USA.

Jonathan F Wendel, Ecology, Evolution, and Organismal Biology Department, Iowa State University, Ames, IA 50011, USA.

Joshua A Udall, Crop Germplasm Research Unit, USDA/Agricultural Research Service, College Station, TX 77845, USA.

Data availability

The assembled genome sequence of G. herbaceum cv. Wagad is available at NCBI under BioProject: PRJNA614591 (Accession number, JAOTOV000000000) and CottonGen (https://www.cottongen.org/). PacBio CCS sequence reads for G. herbaceum cv. Wagad are also available at NCBI, BioProject: PRJNA421172 (SRA: SRP251159) and PRJNA595350 (SRA: SRP237305) for RNA-Seq.

Supplemental material available at G3 online.

Funding

We thank the National Science Foundation (NSF #1339412) and USDA-ARS (58-6066-0-066, Genomics of Malvaceae) for their financial support.

 

Communicating editor: P. Morrell

Literature cited

  1. R Development Core Team, et al. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing;2011. [Google Scholar]
  2. R Core Team . R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing;2020. [Google Scholar]
  3. Adelson RP, Renton AE, Li W, Barzilai N, Atzmon G, Goate AM, Davies P, Freudenberg-Hua Y. Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance. Sci Rep. 2019;9(1):16156. doi: 10.1038/s41598-019-52614-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  5. Bailly-Bechet M, Haudry A, Lerat E. “One code to find them all”: a perl tool to conveniently parse RepeatMasker output files. Mob DNA. 2014;5(1):13. doi: 10.1186/1759-8753-5-13. [DOI] [Google Scholar]
  6. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Stat Methodol. 1995;57(1):289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  7. Borodovsky M, Lomsadze A. Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr Protoc Bioinformatics. 2011;Chapter 4:Unit 4.6.1–Unit 4.6.10. doi: 10.1002/0471250953.bi0406s35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bray NL, Pimentel H, Melsted P, Pachter L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(8):888. doi: 10.1038/nbt0816-888d. [DOI] [PubMed] [Google Scholar]
  9. Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinformatics. 2014;48(1):4.11.1–4.11.39. doi: 10.1002/0471250953.bi0411s48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen ZJ, Sreedasyam A, Ando A, Song Q, De Santiago L, Hulse-Kemp A, Ding M. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat Genet. 2020;52(5):525–533. doi: 10.1038/s41588-020-0614-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–175. doi: 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Front Genet. 2012;3:35. doi: 10.3389/fgene.2012.00035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92. doi: 10.4161/fly.19695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Du X, Huang G, He S, Yang Z, Sun G, Ma X, Li N, Zhang X, Sun J, Liu M, et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nat Genet. 2018;50(6):796–802. doi: 10.1038/s41588-018-0116-x. [DOI] [PubMed] [Google Scholar]
  15. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. Repeatmodeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A. 2020;117(17):9451–9457. doi: 10.1073/pnas.1921046117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ghosh S, Chan C-KK. Analysis of RNA-seq data using TopHat and cufflinks. Methods Mol Biol. 2016;1374:339–361. doi: 10.1007/978-1-4939-3167-5_18. [DOI] [PubMed] [Google Scholar]
  17. Goel M, Schneeberger K, Robinson P. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics. 2022;38(10):2922–2926. doi: 10.1093/bioinformatics/btac196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Graham BP, Haigler CH.. Microtubules exert early, partial, and variable control of cotton fiber diameter. Planta. 2021;253(2):e56315. doi: 10.1007/s00425-020-03557-1 [DOI] [PubMed] [Google Scholar]
  20. Grover CE, Arick MA, Thrash A, Sharbrough J, Hu G, et al. Dual domestication, diversity, and differential introgression in old world cotton diploids. bioRxiv. doi: 10.1101/2021.10.20.465142. 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Grover CE, Pan M, Yuan D, Arick MA, Hu G, Brase L, Stelly DM, Lu Z, Schmitz RJ, Peterson DG, et al. The Gossypium longicalyx genome as a resource for cotton breeding and evolution. G3 (Bethesda). 2020;10(5):1457–1467. doi: 10.1534/g3.120.401050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Grover CE, Yuan D, Arick MA, Miller ER, Hu G, Peterson DG, Wendel JF, Udall JA. The Gossypium anomalum genome as a resource for cotton improvement and evolutionary analysis of hybrid incompatibility. G3 (Bethesda). 2021a;11(11):jkab319. doi: 10.1093/g3journal/jkab319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Grover CE, Yuan D, Arick MA, Miller ER, Hu G, Peterson DG, Wendel JF, Udall JA. The Gossypium stocksii genome as a novel resource for cotton improvement. G3 (Bethesda). 2021b;11(7):jkab125. doi: 10.1093/g3journal/jkab125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Hendrix B, Stewart JM. Estimation of the nuclear DNA content of gossypium species. Ann Bot. 2005;95(5):789–797. doi: 10.1093/aob/mci078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. Whole-genome annotation with BRAKER. Methods Mol Biol. 2019;1962:65–95. doi: 10.1007/978-1-4939-9173-0_5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Holt C, Yandell M. Maker2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011;12(1):491. doi: 10.1186/1471-2105-12-491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Hu G, Grover CE, Yuan D, Dong Y, Miller E, Conover JL, Wendel JF. Evolution and diversity of the cotton genome. In: Rahman M-U, Zafar Y, Zhang T, editors. Cotton Precision Breeding. New York (NY): Springer; 2021. p. 25–78. [Google Scholar]
  28. Huang G, Wu Z, Percy RG, Bai M, Li Y, Frelichowski JE, Hu J, Wang K, Yu JZ, Zhu Y. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat Genet. 2020;52(5):516–524. doi: 10.1038/s41588-020-0607-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ, Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11(1):119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jena SN, Srivastava A, Singh UM, Roy S, Banerjee N, Mohan Rai K, Singh SK, Kumar V, Chaudhary LB, Roy JK, et al. Analysis of genetic diversity, population structure and linkage disequilibrium in elite cotton (Gossypium L.) germplasm in India. Crop Pasture Sci. 2011;62(10):859–875. doi: 10.1071/CP11161. [DOI] [Google Scholar]
  31. Jones P, Binns D, Chang H-Y, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. Interproscan 5: genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kendig KI, Baheti S, Bockol MA, Drucker TM, Hart SN, Heldenbrand JR, Hernaez M, Hudson ME, Kalmbach MT, Klee EW, et al. Sentieon DNASeq variant calling workflow demonstrates strong computational performance and accuracy. Front Genet. 2019;10:736. doi: 10.3389/fgene.2019.00736. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Khadi BM, Santhy V, Yadav MS. Cotton: An Introduction. In: Cotton: Biotechnological Advances. Berlin (Germany): Springer; 2010. p. 1–14. [Google Scholar]
  34. Kidwell KK, Osborn TC. Simple plant DNA isolation procedures. In: Beckmann JS, Osborn TC, editors. Plant Genomes: Methods for Genetic and Physical Mapping. Heidelberg (Netherlands): Springer; 1992. p. 1–13. [Google Scholar]
  35. Kim D, Langmead B, Salzberg SL. Hisat: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5(1):59. doi: 10.1186/1471-2105-5-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Kranthi KR. Cotton production practices: snippets from global data 2017. ICAC Record. 2018;XXXVI:4–14. [Google Scholar]
  38. Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021;37(23):4572–4574. 10.1093/bioinformatics/btab705 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Li F, Fan G, Wang K, Sun F, Yuan Y, Song G, Li Q, Ma Z, Lu C, Zou C, et al. Genome sequence of the cultivated cotton Gossypium arboreum. Nat Genet. 2014;46(6):567–572. doi: 10.1038/ng.2987. [DOI] [PubMed] [Google Scholar]
  41. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Mapleson D, Venturini L, Kaithakottil G, Swarbreck D. Efficient and accurate detection of splice junctions from RNA-seq with Portcullis. Gigascience. 2018;7(12):giy131. doi: 10.1093/gigascience/giy131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. Mummer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Novák P, Neumann P, Macas J. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nat Protoc. 2020;15(11):3745–3776. doi: 10.1038/s41596-020-0400-y. [DOI] [PubMed] [Google Scholar]
  45. Novák P, Neumann P, Pech J, Steinhaisl J, Macas J. Repeatexplorer: a galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads. Bioinformatics. 2013;29(6):792–793. doi: 10.1093/bioinformatics/btt054. [DOI] [PubMed] [Google Scholar]
  46. Page JT, Gingle AR, Udall JA. PolyCat: a resource for genome categorization of sequencing reads from allopolyploid organisms. G3 Genes|Genomes|Genetics. 2013;3(3):517–525. doi: 10.1534/g3.112.005298 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin D, Llewellyn D, Showmaker KC, Shu S, Udall J, et al. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature. 2012;492(7429):423–427. doi: 10.1038/nature11798. [DOI] [PubMed] [Google Scholar]
  48. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. Stringtie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Preuss ML, Kovar DR, Lee Y-RJ, Staiger CJ, Delmer DP, Liu B. A Plant-Specific Kinesin Binds to Actin Microfilaments and Interacts with Cortical Microtubules in Cotton Fibers. Plant Physiology. 2004;136(4):3945–3955. doi: 10.1104/pp.104.052340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Renny-Byfield S, Page JT, Udall JA, Sanders WS, Peterson DG, Arick MA, Grover CE, Wendel JF. Independent domestication of two old world cotton Species. Genome Biol Evol. 2016;8(6):1940–1947. doi: 10.1093/gbe/evw129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Robinson J, Turner D, Durand N, Thorvaldsdottr J, Mesirov J, Aiden E. 2018. Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. Cell Systems 6:2. doi: 10.1016/j.cels.2018.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Smit AFA, Hubley R, Green P. RepeatMasker Open-4.0. 2013–2015. 2015a.
  53. Smit AFA, Hubley R, Green P. RepeatModeler Open-1.0. 2008–2015. 2015b.
  54. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. Us: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34(Web Server):W435–W439. doi: 10.1093/nar/gkl200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Udall JA, Long E, Hanson C, Yuan D, Ramaraj T, Conover JL, Gong L, Arick MA, Grover CE, Peterson DG, et al. De novo genome sequence assemblies of Gossypium raimondii and Gossypium turneri. G3 (Bethesda). 2019;9(10):3079–3085. doi: 10.1534/g3.119.400392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. Gigascience. 2018;7(8):giy093. doi: 10.1093/gigascience/giy093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Vollesen K. The native species of Gossypium (Malvaceae) in Africa, Arabia and Pakistan. Kew Bull. 1987;42(2):337–349. doi: 10.2307/4109688. [DOI] [Google Scholar]
  58. Wang M, Li J, Wang P, Liu F, Liu Z, Zhao G, Xu Z, Pei L, Grover CE, Wendel JF, et al. Comparative genome analyses highlight transposon-mediated genome expansion and the evolutionary architecture of 3D genomic folding in cotton. Mol Biol Evol. 2021;38(9):3621–3636. doi: 10.1093/molbev/msab128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wang K, Wendel JF, Hua J. Designations for individual genomes and chromosomes in Gossypium. J Cotton Res. 2018;1(1):3. doi: 10.1186/s42397-018-0002-1. [DOI] [Google Scholar]
  60. Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM. Busco applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol. 2017;35(3):543–548. doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Wendel JF, Grover CE. Taxonomy and evolution of the cotton genus, Gossypium. In: Fang DD, Percy RG, editors. Cotton: Agronomy Monographs. Madison (WI): American Society of Agronomy, Inc., Crop Science Society of America, Inc., and Soil Science Society of America, Inc; 2015. p. 25–44. [Google Scholar]
  62. Wendel JF, Olson PD, Stewart JM. Genetic diversity, introgression, and independent domestication of old world cultivated cottons. Am J Bot. 1989;76(12):1795–1806. doi: 10.1002/j.1537-2197.1989.tb15169.x. [DOI] [Google Scholar]
  63. Wu TD, Watanabe CK. Gmap: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–1875. doi: 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
  64. Zhang Y, et al. Genome-wide identification and comprehensive analysis of the NAC transcription factor family in Sesamum indicum. PLOS ONE. 2018;13(6):e0199262. doi: 10.1371/journal.pone.0199262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Zhong R, Demura T, Ye Z-H. SND1, a NAC Domain Transcription Factor, Is a Key Regulator of Secondary Wall Synthesis in Fibers of Arabidopsis. Plant Cell. 2006;18(11):3158–3170. doi: 10.1105/tpc.106.047399 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

jkac308_Supplementary_Data

Data Availability Statement

The assembled genome sequence of G. herbaceum cv. Wagad is available at NCBI under BioProject: PRJNA614591 (Accession number, JAOTOV000000000) and CottonGen (https://www.cottongen.org/). PacBio CCS sequence reads for G. herbaceum cv. Wagad are also available at NCBI, BioProject: PRJNA421172 (SRA: SRP251159) and PRJNA595350 (SRA: SRP237305) for RNA-Seq.

Supplemental material available at G3 online.


Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press

RESOURCES