Skip to main content
Scientific Data logoLink to Scientific Data
. 2026 Jan 10;13:231. doi: 10.1038/s41597-026-06547-2

Chromosome-level genome assembly of the dwarf cattail Typha minima

Junshuai Du 1, Lei Huang 1, Xinwei Xu 1,
PMCID: PMC12901240  PMID: 41519866

Abstract

We report the first chromosome-level genome assembly of the critically endangered dwarf cattail, Typha minima, a wetland species of ecological and medicinal importance. Utilizing PacBio HiFi long-read sequencing and Hi-C scaffolding technologies, we generated a high-quality 324.66 Mb genome, anchored onto 30 pseudochromosomes. The assembly demonstrates exceptional continuity, with contig and scaffold N50 values of 10.84 Mb and 10.90 Mb, respectively, and a near-complete chromosomal anchoring rate of 99.65%. It exhibits outstanding completeness, as reflected by a BUSCO score of 99.2%, and contains 33.20% repetitive sequences. We annotated 34,541 protein-coding genes, with 96.42% receiving functional assignments. The assembly also includes annotations for non-coding RNAs, comprising 1,261 rRNAs, 230 miRNAs, and 467 tRNAs. Integrated orthology analysis identified 10,055 consensus orthologs across five functional databases. This high-quality genomic resource provides a foundation for advancing studies in evolutionary adaptation and conservation genomics of this endangered wetland plant.

Subject terms: Molecular ecology, Genome

Background & Summary

The genus Typha (cattails) comprises large emergent aquatic macrophytes that constitute a vital component of wetland ecosystems1. These plants play a crucial role in maintaining ecosystem functions by enhancing structural complexity, regulating biogeochemical cycles, and contributing to overall productivity24. Notably, Typha species effectively remove pathogenic microorganisms from water through root adsorption and the secretion of antimicrobial compounds5, highlighting their ecological importance in maintaining wetland health. Beyond their ecological contributions, Typha species are valued for their medicinal properties. In traditional medicine, their dried pollen, known as Puhuang, is recognized for its hemostatic, stasis-resolving, and diuretic effects6. Given these multifaceted benefits, Typha species possess considerable applied potential and merit further scientific investigation.

Typha minima, characterized by a delicate growth habit, is native to temperate Eurasia7. It is currently listed as endangered in Switzerland and persists only in small, isolated populations in several other European countries8. Phylogenetic analyses have shown that T. minima and T. elephantina form a monophyletic clade, which is sister to the clade containing all other Typha species9. To date, genomes have been published for three Typha species: T. latifolia, T. angustifolia and T. domingensis1012. Therefore, obtaining a high-quality genome of T. minima is crucial for advancing Typha phylogenomic, elucidating the genetic mechanisms underlying its endangerment, and developing effective conservation strategies.

In this study, we constructed a high-quality chromosome-level genome assembly for T. minima by integrating PacBio HiFi long-read sequencing and Hi-C chromatin interaction data. The final assembly spans 324.66 Mb with a scaffold N50 of 10.90 Mb, and 99.65% of the assembled sequences were successfully anchored onto 30 pseudochromosomes (Table 1, Supplementary Table S1). The genome exhibits high completeness, supported by a BUSCO completeness score of 99.20% (1,601 of 1,614 conserved genes) and a sequencing reads mapping rate of 98.19% (Table 1). Repetitive elements accounted for 33.20% (107.80 Mb) of the genome (Table 2). Among these, long terminal repeat (LTR) retrotransposons were the most abundant class (12.48%), predominantly represented by Gypsy (5.13%) and Copia (1.15%) families. Non-LTR retrotransposons and DNA transposons comprised 0.75% and 0.61%, respectively. Additionally, 11.94% of repeats remained unclassified and may include novel lineage-specific elements. A total of 34,541 protein-coding genes were annotated, of which 33,304 (96.42%) were assigned functional descriptions (Table 3). This chromosome-level genome of T. minima serves as a valuable genomic resource for investigating the genetic basis of its endangered status and will support further studies on the evolution and phylogenetic relationships within the genus Typha.

Table 1.

Genome assembly summary of Typha minima.

Statistical feature Corresponding value
Assembled genome size (bp) 324,661,732
Number of contigs 33
Number of scaffolds 36
Contig N50 (bp) 10,842,424
Scaffold N50 (bp) 10,900,986
Number of chromosomes 30
Genome sequences anchored to chromosomes (bp) 323,512,042
Anchoring rate 99.65%
GC content 37.91%
BUSCO complete genes(C) 1,601 (99.2%)
BUSCO single copy genes(S) 971 (60.2%)
BUSCO duplicated genes(D) 629 (39%)
BUSCO fragmented genes(F) 6 (0.4%)
BUSCO missing genes (M) 6 (0.4%)

Table 2.

Summary of repetitive sequences in the Typha minima genome.

Type Length (bp) % in genome
DNA transposon 5,008,465 1.54%
LINE 2,436,784 0.75%
SINE 266 0.00%
LTR 20,630,798 18.83%
Satellite 188,970 0.06%
Other 255,404 0.08%
Unknown 38,776,748 11.94%
Total 67,297,435 33.20%

Table 3.

Summary of the protein-coding gene annotation for the Typha minima genome.

Statistical feature Corresponding value
Total gene number 34,541
Total transcript number 49,360
Mean gene length (bp) 4,949
Functionally annotated gene number 33,304

Methods

Genome sequencing and assembly

Fresh leaves of Typha minima were collected from Kashgar, Xinjiang, China (39°14'15.2“N, 76°09'41.4“E; elevation 1,228 m). The samples were immediately frozen in liquid nitrogen and stored at −80 °C until DNA and RNA extraction. Genomic DNA was extracted using a modified CTAB protocol13. DNA purity was assessed with NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) to determine A260/A280 and A260/A230 ratios Potential degradation and RNA contamination were evaluated using pulsed-field gel electrophoresis (PFGE) and Qubit 3.0 fluorometry (Thermo Fisher Scientific, USA). For initial genome survey, we generated 65 Gb (~196 × coverage) of 150-bp paired-end Illumina NovaSeq. 6000 reads (Supplementary Table S1). Raw reads were quality-filtered by removing reads containing >3 N-bases, adapter contamination, or ≥ 20% of bases with Phred Q < 20. Contamination screening was performed by aligning 10,000 randomly selected reads against the NT database14 using BLASTN v2.12.015. K-mer analysis (k = 23) conducted with Jellyfish v1.1.1116 and GenomeScope v1.017 estimated a genome size of 331.55 Mb (Fig. 1). The GenomeScope profile exhibited three peaks with an approximate ratio of 1:2:4 (Fig. 1), suggesting that T. minima is likely a tetraploid. The inference was further confirmed by Smudgeplot18 analysis, which supported an allotetraploid genome structure (Fig. 2). For high-quality genome assembly, we prepared PacBio HiFi libraries using the SMRTbell Express 2.0 Kit, with library quality assessed by capillary electrophoresis (Fragment Analyzer 5400, Agilent). High-fidelity reads were processed via the CCS module in SMRT Link v11.0, yielding 48.66 Gb of HiFi data (~147 × coverage; Supplementary Table S1). De novo assembly was performed using Hifiasm v0.19.519, followed by haplotype purging with purge_dups v1.2.520 to remove heterozygous redundancies, resulting in a final genome assembly of 324.66 Mb.

Fig. 1.

Fig. 1

Genome survey of Typha minima using k-mer distribution analysis (k = 23).

Fig. 2.

Fig. 2

The result of smudgeplot analysis for Typha minima.

Following assembly, we performed comprehensive quality assessment through the analysis of 10-kb non-overlapping genomic windows for GC content and mean coverage depth. This dual-parameter analysis facilitated both base composition profiling and the detection of potential exogenous contamination. The absence of discrete clusters in bivariate GC-depth distributions (Fig. 3) confirmed a contamination-free assembly. For genome annotation, we implemented an integrated pipeline based on the embryophyta_odb10 single-copy ortholog set21, consisting of three main steps: initial screening using TBLASTN22, gene prediction with AUGUSTUS v3.523, and domain validation using HMMER v3.3.224. The quality of the genome assembly was evaluated from two aspects: gene completeness and sequence contiguity. First, gene completeness was assessed using BUSCO v5.4.321 against the viridiplantae_odb10 database, yielding a completeness score of 99.20%. Subsequently, raw second-generation sequencing data were mapped back to the assembled genome using BWA v0.7.1725 with default parameters, yielding a read mapping rate of 98.19% (Table 1). These results collectively demonstrate that the assembled genome possesses high completeness and accuracy.

Fig. 3.

Fig. 3

GC content and depth distribution.

Hi-C-Assisted genome assembly

Fresh leaf tissue was fixed with 1% formaldehyde for 15 min to preserve chromatin architecture, after which the crosslinking reaction was quenched using 0.125 M glycine. Hi-C libraries were constructed according to a standard protocol26 and sequenced as 150-bp paired-end reads on an Illumina NovaSeq. 6000 platform, yielding 52.24 Gb of Hi-C data (~161 × coverage, Supplementary Table S1). Raw reads were quality-filtered, and 10,000 randomly selected clean reads were screened for contamination by aligning against the NT14 database using BLASTN v2.12.015. No exogenous contamination was detected (Supplementary Table S2). Valid chromatin interactions were aligned to the draft assembly using HiC-Pro v2.11.427. Chromosome-level scaffolding was achieved through initial scaffolding with 3D-DNA v18092228 and subsequently manual curation in Juicebox v1.11.0829. Genome-wide contact maps were generated using HiCExplorer v3.7.230 to visualize interaction intensities (Fig. 4). The final genome architecture was illustrated using Circos v0.69.931, highlighting key genomic features (Fig. 5).

Fig. 4.

Fig. 4

Hi-C heatmap for the genome assembly of Typha minima.

Fig. 5.

Fig. 5

Genomic features of Typha minima. The features are arranged in the order of chromosomes, gene density, repeat density and GC content from outside to inside across the 30 pseudochromosomes.

Genome annotation

To optimize genome annotation, we performed transcriptome sequencing on root, stem, leaf, and fruit tissues of T. minima collected from Kashgar, Xinjiang. Total RNA was extracted from each tissue using the RNAprep Pure Plant Kit (Cat. No. DP441, Tiangen Biotech). RNA integrity, purity, and concentration were assessed using a NanoDrop spectrophotometer and an Agilent 2100 Bioanalyzer. Strand-specific libraries were constructed with the TruSeq mRNA-seq Kit and sequenced on the Illumina NovaSeq. 6000 platform (PE150). The four libraries generated 6.82 Gb, 7.32 Gb, 5.94 Gb, and 6.83 Gb of raw data, respectively, with Q30 base percentages exceeding 91% in all samples. To complement the short-read data, we performed PacBio full-length transcriptome sequencing. A SMRTbell library was prepared from leaf RNA using the Iso-Seq express 2.0 Kit and Kinnex Full-length RNA Kit, and sequenced on the PacBio Sequel II platform, generating 4.6 Gb of HiFi data. The resulting reads had an average length of 1,622 bp, a median length of 1,593 bp, an N50 of 1,776 bp, and a median base quality of Q35.

Repetitive elements were annotated through a multi-step pipeline. Initially, Miniature Inverted-repeat Transposable Elements (MITEs) were identified using MITE-Hunter v1.032 with default parameters. This was followed by LTR retrotransposon prediction via LTRharvest33 and LTR_Finder v1.0.734, with results integrated using LTR_retriever v2.935. An initial screening against the RepBase v20170127 database36 was conducted with RepeatMasker v4.1.137, complemented by de novo prediction on masked sequences using RepeatModeler v2.038. This comprehensive analysis identified 67.3 Mb repetitive sequences, accounting for 33.2% of the genome. LTR retrotransposons dominated the repeat landscape (12.48%), with Gypsy (5.13%) and Copia (1.15%) as the predominant subtypes (Table 2). Non-coding RNAs were annotated using tRNAscan-SE v2.039 for tRNAs and INFERNAL v1.1.240 against the Rfam v14.1 database41 for other classes, yielding 1,261 rRNAs, 230 miRNAs, 65 snRNAs, and 467 tRNAs (Supplementary Table S4). Protein-coding genes prediction integrated multiple approaches: homology-based prediction with GeMoMa v1.942 using related proteomes and transcript evidence (HISAT243 alignments assembled by Cufflinks), and ab initio prediction using AUGUSTUS v3.523, SNAP44, GlimmerHMM v3.0.445, and GeneMark-ET46. Results were integrated using EvidenceModeler, and UTR annotation and alternative splicing analysis were performed using PASA v2.5.247. This process identified 34,541 protein-coding genes with a mean length of 4,949 bp (Table 3). Functional annotation was performed by conducting BLASTP15 searches against the NR v202108 database48, Swiss-Prot release 2021_0849, eggNOG v5.050, Gene Ontology (GO)51, and KEGG PATHWAY52 databases. A total of 33,304 genes (96.42%) were functionally annotated (Table 3). Subsequent Venn analysis of annotations across these databases identified 10,055 genes with consensus support (Fig. 6).

Fig. 6.

Fig. 6

Venn diagram of gene functional annotations across multiple databases.

Data Records

The raw sequencing data generated in this study, including genome survey short reads, PacBio HiFi reads, Hi-C reads, RNA-seq short reads, and Iso-Seq reads, have been deposited in the Genome Sequence Archive (GSA) at the National Genomics Data Center (NGDC), China National Center for Bioinformation (CNCB), under BioProject accession number PRJCA04264653. The corresponding dataset accessions are CRA02755354 (genome survey), CRA02759555 (PacBio HiFi), CRA02757456 (Hi-C), CRA02759757 (RNA-seq), and CRA02759858 (Iso-Seq). The assembled whole-genome sequence has been deposited in the Genome Warehouse (GWH) at NGDC under accession number GWHGEOG00000000.159 and is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The raw sequencing data and genome assembly have also been deposited in the European Nucleotide Archive (ENA) under project accession number PRJEB10298060, with the genome assembly available under accession number GCA_97706353561. The respective ENA accessions for the raw data are: ERR1586010262 (survey reads); ERR1586283663 (Iso-Seq reads); ERR1587448364 (HiFi reads); ERR1587363965 (Hi-C reads); and ERR1587362066, ERR1587362167, ERR1587362268 and ERR1587362369 (RNA-seq reads).

Technical Validation

The chromosome-level genome assembly of Typha minima (324.66 Mb) was comprehensively validated using orthogonal methods, confirming its high contiguity (Table 1). Hi-C scaffolding anchored 99.65% of the assembly into 30 chromosomes, with chromatin interaction heatmaps demonstrating clear chromosomal compartmentalization (Fig. 4). Assembly completeness was supported by a BUSCO score of 99.2% and K-mer analysis (k = 23), which estimated a genome size of 331.55 Mb (Fig. 1)—a deviation of only 2.1% from the final assembly size. High data fidelity was further demonstrated by mapping rates exceeding 98% for both short reads and PacBio HiFi reads (aligned using BWA v0.7.17), as well as an RNA-seq alignment rate of 98.09%. Rigorous contamination screening—including pre-assembly BLASTN alignment against the NT database (Supplementary Table S3) and post-assembly GC-depth distribution analysis, which showed no aberrant clusters (Fig. 3) —confirmed the absence of detectable foreign sequences. Gene annotation yielded 34,542 protein-coding genes, 96.42% of which were functionally annotated. Among these, 10,055 genes were consistently supported across all databases (Table 3).

Supplementary information

Supplementary Tables (14KB, xlsx)

Acknowledgements

This study was supported by the National Water Pollution Control and Treatment Science and Technology Major Project, China (No. 2015ZX07503005). The calculations in this paper were performed using the supercomputing system at the Supercomputing Center of Wuhan University.

Author contributions

X.X. designed the research; L.H. carried out the field collections; J.D. carried out the experiments and performed the data analysis; J.D., L.H. and X.X. wrote and revised the manuscript. All authors read and approved the manuscript.

Data availability

Raw sequence data have been deposited in the Genome Sequence Archive (GSA) at the National Genomics Data Center (NGDC) under BioProject accession number PRJCA04264653. The specific accessions for the genome survey, PacBio HiFi, Hi-C, RNA-seq, and Iso-Seq reads are CRA02755354, CRA02759555, CRA02757456, CRA02759757, and CRA02759858, respectively. The genome assembly has been deposited in the Genome Warehouse (GWH) under accession number GWHGEOG00000000.159. All raw data and the assembly are also available in the European Nucleotide Archive (ENA) under project PRJEB10298060, with the following accessions: ERR1586010262 (survey); ERR1586283663 (Iso-Seq); ERR1587448364 (HiFi); ERR1587363965 (Hi-C); ERR1587362066, ERR1587362167, ERR1587362268 and ERR1587362369 (RNA-seq reads).; and GCA_97706353561 (genome assembly).

Code availability

All bioinformatics tools and software used for genome assembly, annotation, and data analysis in this study were operated strictly according to their official user manuals, with no custom code employed. Software versions and parameters are comprehensively documented in the Methods section.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41597-026-06547-2.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. European Nucleotide Archivehttps://identifiers.org/ena.embl:PRJEB102980 (2025).
  2. European Nucleotide Archivehttps://identifiers.org/insdc.gca:GCA_977063535 (2025).
  3. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15860102 (2025).
  4. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15862836 (2025).
  5. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15874483 (2025).
  6. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15873639 (2025).
  7. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15873620 (2025).
  8. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15873622 (2025).
  9. European Nucleotide Archivehttps://identifiers.org/insdc.sra:ERR15873623 (2025).

Supplementary Materials

Supplementary Tables (14KB, xlsx)

Data Availability Statement

Raw sequence data have been deposited in the Genome Sequence Archive (GSA) at the National Genomics Data Center (NGDC) under BioProject accession number PRJCA04264653. The specific accessions for the genome survey, PacBio HiFi, Hi-C, RNA-seq, and Iso-Seq reads are CRA02755354, CRA02759555, CRA02757456, CRA02759757, and CRA02759858, respectively. The genome assembly has been deposited in the Genome Warehouse (GWH) under accession number GWHGEOG00000000.159. All raw data and the assembly are also available in the European Nucleotide Archive (ENA) under project PRJEB10298060, with the following accessions: ERR1586010262 (survey); ERR1586283663 (Iso-Seq); ERR1587448364 (HiFi); ERR1587363965 (Hi-C); ERR1587362066, ERR1587362167, ERR1587362268 and ERR1587362369 (RNA-seq reads).; and GCA_97706353561 (genome assembly).

All bioinformatics tools and software used for genome assembly, annotation, and data analysis in this study were operated strictly according to their official user manuals, with no custom code employed. Software versions and parameters are comprehensively documented in the Methods section.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES