Skip to main content
Scientific Data logoLink to Scientific Data
. 2024 Aug 2;11:835. doi: 10.1038/s41597-024-03664-8

Telomere-to-telomere genome assembly of sorghum

Meng Li 1,✉,#, Chunhai Chen 2,#, Haigang Wang 1,#, Huibin Qin 1,#, Sen Hou 1,#, Xukui Yang 2,#, Jianbo Jian 2, Peng Gao 3,, Minxuan Liu 4,, Zhixin Mu 1,
PMCID: PMC11297213  PMID: 39095379

Abstract

“Cuohu Bazi” (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.

Subject terms: Plant molecular biology, Agricultural genetics

Background & Summary

Sorghum is a widely cultivated cereal crop, particularly in Africa, where it ranks 5th in global cereal production1. It exhibits remarkable adaptability and possesses strong stress resistance characteristics, including drought tolerance, waterlogging tolerance, salt-alkali tolerance, barrenness tolerance, and high temperature tolerance. It serves as a staple food for approximately 500 million people in Africa and Asia, and provides a source of energy, forage, and industrial raw material for the brewing industry. The genome of sorghum has been extensively studied, with the first reference genome of sorghum (BTx623) was published in 20092, followed by the decoding of sorghum inbred Tx430 transformation line genome in 20183 and the release of the sweet sorghum genome in 20194. In 2021, the first pan-genome of sorghum was completed5. Furthermore, in 2023, seven high-quality sorghum organelle genomes were published6. These milestones mark significant advancements in molecular research on sorghum. However, the current widely used sorghum reference genome (BTx623, RefSeq assembly accession: GCF_000003195.3)2 still has relatively low continuity and quality, with a contig N50 of 1.3 Mb and 3,913 total gaps. There has long been a need for high-quality sorghum genomes.

Recent advances in genome sequencing and assembly methodology have made telomere-to-telomere (T2T) gap-free assembly of chromosome sequences possible. The assembly of T2T genome enables the exploration of unknown fields such as telomeres and centromeres, which also provides a more in-depth research direction for animal and plant research79. T2T genome assemblies have been reported in several important crops, including banana10, barley11, rice12, and maize13. “Cuohu Bazi” (CHBZ) is an ancient local sorghum landrace collected in China between 1982 and 1986 (Fig. 1a). Through field identification, it has been discovered that CHBZ possesses excellent agronomic traits, such as dwarf stature and early maturation. The plant height is approximately 1.30 meters, and the growth period is 100 days. The T2T genome assembly of CHBZ will undoubtedly provide valuable guidance for sorghum breeding efforts.

Fig. 1.

Fig. 1

Overview of the CZBZ and genome. (a) Photographs of CHBZ. (b) Circos plot illustrating the genome of CHBZ. The plot includes the following components, arranged from inside to outside: (I) Collinear regions within the CHBZ assembly; (II) Gene density in 1-Mb sliding windows; (III) GC content in non-overlapping 1 Mb windows; (IV) Percentage of interspersed repeats in 1-Mb sliding windows; (V) Percentage of tandem repeats in 1-Mb sliding windows; (VI) Length of pseudo-chromosome in megabases (Mb).

In this study, we aim to generate the first T2T gap-free genome for CHBZ using a combination of the latest sequencing technologies, including PacBio high-fidelity (HiFi) sequencing, ultra-long Oxford Nanopore Technology (ONT), and High-through chromosome conformation capture (Hi-C) sequencing. The genomic resources and gene structures produced by this study will lay the groundwork for future research endeavors in CHBZ genetic breeding.

Methods

Sample collection and sequencing

CHBZ, which was grown in the germplasm resource nursery at the Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Taiyuan, 030031, China, was chosen for DNA and RNA sequencing. Fresh, healthy, young seedling were harvested and immediately placed in a freezing chamber with liquid nitrogen, then stored at −80 °C. Genomics DNA was extracted with the cetyltrimethylammonium bromide method, and evaluated using NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The PacBio HiFi libraries, the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol from PacBio was followed, with an insert size of approximately 20 kb (Pacific Biosciences, USA). Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms in circular consensus sequencing (CCS) mode. For the ONT ultra-long sequencing, the library was prepared using the Oxford Nanopore SQK LSK109 kit, and then sequenced on a PromethION flow cell (Oxford Nanopore Technologies Oxford, UK). For Hi-C sequencing, Hi-C libraries based on DpnII restriction enzymes were generated as previously described14, and sequenced on the MGISEQ-2000 platform. Total RNA from roots, stems, leaves and spikes, was isolated using the NEB Next Poly (A) mRNA Magnetic Isolation Module. DNase I (Thermo Fisher Scientific, Wilmington, DE, USA) was used to remove genomic DNA. The RNA integrity was checked using a BioAnalyzer 2100 (Agilent Technologies, Santa Clara, USA). RNA libraries were prepared using the NEB Next Ultra RNA Library Prep Kit for Illumina with an insert size of 300 bp. The RNA libraries were sequenced on a MGISEQ-2000 instrument and generated 150 bp paired-end reads.

In total, we generated 304.06 Gb (~419X coverage) ONT reads with a N50 of 52.44 kb, 28.65 Gb PacBio HiFi CCS reads with a N50 of 16.64 kb (~40X coverage), and 304.93 Gb Hi-C data (Illumina paired-end reads, ~421X coverage), 123.30 Gb RNA data (Tables 1 and 2).

Table 1.

Summary of DNA sequencing data of the CHBZ genome.

Total length (GB) Genome depth* Min length (bp) Max length (bp) N50 length of reads (bp)
Raw ultra-long ONT data 304.06 419 5,000 825,384 52,442
Error corrected ONT data 36.03 50 1,025 278,940 80,402
PacBio subreads data 477.89 659 100 529,636 16,559
PacBio CCS data 28.65 40 517 50,265 16,638
Raw Hi-C data 304.93 421 150 150 150
Clean Hi-C data 303.56 419 150 150 150

*Estimated based on the size of 724.95 Mb.

Table 2.

Summary of RNAseq sequencing data of the CHBZ genome.

Sample Organization_Stage Total raw reads (Mb) Total clean reads (Mb) Total clean bases (Gb) Clean reads Q20% Clean reads GC%
WHPLAolfhRAAARAA-9313 Root_seedling 62.03 59.67 8.95 98.41 53.46
WHPLAolfhRAABRAA-9314 Stem_seedling 59.49 57.41 8.61 98.33 55.11
WHPLAolfhRAACRAA-9315 Leaf_seedling 51.11 49.75 7.46 98.35 55.25
WHPLAolfhRAADRAA-9316 Root_jointing 53.21 52.21 7.83 98.33 53.2
WHPLAolfhRAAERAA-9317 Stem_jointing 62.67 60.96 9.14 98.43 52.11
WHPLAolfhRAAFRAA-9318 Leaf_jointing 59.22 56.9 8.53 98.41 53.71
WHPLAolfhRAAGRAA-9319 Ear_jointing 54.05 52.49 7.87 98.31 52.54
WHPLAolfhRAAHRAA-9320 Root_flowering 46.69 45.54 6.83 98.32 52.57
WHPLAolfhRAAIRAA-9321 Stem_ flowering 62.12 60.21 9.03 98.47 53.83
WHPLAolfhRAAJRAA-9322 Leaf_flowering 57.42 56.31 8.45 98.23 53.45
WHPLAolfhRAAKRAA-9323 Ear_flowering 56.75 55.33 8.3 98.31 54.28
WHPLAolfhRAALRAA-9324 Root_filling 57.48 56.29 8.44 98.37 52.44
WHPLAolfhRAAMRAA-9325 Stem_filling 53.06 52.06 7.81 98.35 51.51
WHPLAolfhRAANRAA-9326 Leaf_filling 56.98 55.71 8.36 98.36 53.47
WHPLAolfhRAAORAA-9327 Ear_filling 52.32 51.25 7.69 98.4 52.32
Total NA 844.60 822.09 123.30 98.36 53.28

Genome assembly

The subreads generated by PacBio Sequel II platforms were processed using the CCS algorithm of SMRTLink (v11.1.0)15 with the following parameters: “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”. Subsequently, we performed a genome survey utilizing GCE (Genomic Charactor Estimator) (v1.0.2)16 with the parameter ‘-k 17’ using HiFi reads to assess the genome size and heterozygosity of CHBZ genome. The estimated size of the CHBZ genome was approximately 724.95 Mb, with a heterozygosity rate of 0.10% and a repeat content of 67.74% (Table 3). After quality control using software SOAPnuke (v2.0)17 with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”, a total of 303.56 Gb Hi-C clean data was obtained. Using PacBio HiFi reads, ONT reads, and Hi-C clean data, the primary contigs were generated by Hifiasm (v 0.19.5)18 with default parameters. To anchor contigs onto chromosomes, we employed BWA (v 0.7.12)19 to align the Hi-C clean data to the assembled contigs, and then we filtered low quality reads using a HiC-Pro pipeline20 with the default parameters. The valid reads were used to anchor chromosomes with Juicer21 and 3d-dna pipeline22. LR_Gapcloser23 was utilized to close gaps in the assembled genome, using error-corrected ONT long reads generated by NECAT24. To further refinement of the genome, the T2T assembly was polished using the similar method described in Mc Cartney, Shafin et al.25. Briefly, the HiFi reads were aligned to the T2T assembly using Winnowmap2 (v 2.03)26. The output alignments were filtered to remove all secondary alignments and alignments with excessive clipping by using ‘falconc bam-filter-clipped’ tool. Finally, racon (v 1.5.0)27 was conducted with the filtered alignments.

Table 3.

Summary statistics of the CHBZ genome assembly.

Genomics feature Value
Estimated genome size (Mb) 724.95
Estimated heterozygosity rate (%) 0.10
Estimated repeat content (%) 67.74
Assembled genome size (Mb) 724.85
N50 value (Mb) 71.06
Number of base chromosomes 10
Number of gap-free chromosomes 10
Number of candidate telomeres 20
Number of candidate centromeres 10
GC content (%) 43.90
Quality value (QV) 61.60
LTR assembly index (LAI) 23.63
Repetitive sequences (Mb) 510.36 (70.41%)
Genome BUSCOs (%) 99.01
Number of genes 32,855
Gene BUSCOs (%) 99.38

Overall, the final genome assembly of CHBZ is about 724.85 Mb with a N50 length of 71.06 Mb (Table 3). Genome sequences were clustered and oriented as 10 pseudochromosomes (Fig. 1b, Table 4).

Table 4.

The quality of each chromosome in the CHBZ genome.

ID Length (bp) N% GC% Contig number Gene number
Chr01 86,020,532 0 44.23 1 5,427
Chr02 78,662,643 0 43.76 1 4,296
Chr03 81,090,025 0 43.90 1 4,447
Chr04 71,062,133 0 43.88 1 3,631
Chr05 78,004,805 0 43.80 1 2,418
Chr06 66,673,040 0 43.77 1 2,847
Chr07 69,900,438 0 43.83 1 2,333
Chr08 66,635,546 0 43.75 1 2,029
Chr09 63,867,428 0 44.02 1 2,564
Chr10 62,932,570 0 43.70 1 2,863

Repeat annotation

Two strategies including de novo and homolog methods were used to predict repeat elements. De novo repeats were identified by RepeatModeler (v1.0.4)28 and long terminal repeats were annotated by LTR-FINDER (v1.0.7)29. DNA and protein transposable elements were detected by RepeatMasker (v4.0.7)30 and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database31. At last, tandem repeats were performed by Tandem Repeat Finder (v4.10.0)32.

In the CHBZ genome, repeat sequences accounted for 510.36 Mb, representing 70.41% of the assembly (Table 3). Long terminal repeat (LTR) retrotransposons (55.75%) were the most abundant component among repetitive elements, which was consistent with the previous study1 (Table 5).

Table 5.

Transposable element (TE, interspersed repeat) contents in the CHBZ assembly.

Type Repbase TEs TE protiens De novo *Combined TEs
Length (bp) % of genome Length (bp) % of genome Length (bp) % of genome Length (bp) % of genome
Class I 371,524,099 51.26 115,028,697 15.87 364,008,956 50.22 417,863,762 57.65
LTR 361,035,622 49.81 108,970,406 15.03 359,774,055 49.63 404,105,204 55.75
LINE 10,408,720 1.44 6,058,291 0.84 4,234,901 0.58 13,678,801 1.89
SINE 79,757 0.01 0 0.00 0 0.00 79,757 0.01
Class II: DNA 71,208,394 9.82 10,201,586 1.41 28,984,885 4.00 79,678,544 10.99
Unclassified 15,457 0.00 0 0.00 425,680 0.06 441,137 0.06
Total 442,066,952 60.99 125,226,177 17.28 385,130,064 53.13 475,407,461 65.59

Note: This statistical table does not contain Tandem Repeats, some elements may partly include another element domain.

*Combined: the non-redundant consensus of all repeat prediction/classification methods employed.

Unclassified: the predicted repeats that cannot be classified by RepeatMasker;

LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Protein-coding genes prediction and functional annotation

Gene prediction was conducted through a combination of transcriptome-based prediction, homology-based prediction, and ab initio prediction methods. For transcriptome-based prediction, 123.3 Gb clean reads sequenced by DNBSEQ-2000 from root, stem, leaf and ear tissues from four stages (seedling, jointing, heading and flowering, filling and maturity stage) were assembled by Trinity (v 2.15.1)33 with parameters of ‘–max_memory 200 G–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr = 0.1–stderr’–genome_guided_max_intron 10000’, which generated 130,301 transcripts with a N50 of 2,702 (Table 6). These assembled transcripts were aligned against the T2T assembly by Program to Assemble Spliced Alignment (PASA) (v 2.4.1)34. Valid transcript alignments were clustered based on genome mapping location and assembled into gene structures. The coding regions were obtained by employing TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) (PASA-set). In addition, the RNA-seq clean reads were also mapped to the T2T assembly using Hisat2 (v 2.0.1)35. Stringtie (v 1.2.2)36 and TransDecoder (v 5.7.1) were employed to assemble the transcripts and identify candidate coding regions into gene models (Stringtie-set). Homologous genomes from five plants, including rice (T2T-NIP)12, foxtail millet (RefSeq assembly accession: GCF_000263155.2), maize (T2T Mo17)13, A. thaliana (Col-PEK)37, and BTx623 sorghum (RefSeq assembly accession: GCF_000003195.3) were downloaded. Subsequently, these sequences were used as queries to search against the T2T assembly using GeMoMa (v 1.9)38 with bam files from RNA-seq data. Homology predictions were denoted as “Homology-set”. For ab initio prediction methods, AUGUSTUS (v 3.2.3)39 was used to predict coding regions in the repeat-masked genome. All gene models predicted were combined by EvidenceModeler (v 2.1.0)40 into a non-redundant set of gene structures. Evidence from different sources was assigned to different weight parameters: 10 for PASA-set, 5 for Stringtie-set, 5 for Homology-set, and 1 for AUGUSTUS gene prediction. Finally, the produced gene models were further refined with the PASA (v 2.4.1)34 to generate untranslated regions and alternative splicing variation information. The integrated gene set was translated into amino-acid sequences. By using Diamond program (v 0.9.30.131)41 with an E-value cutoff of 1E-05, the amino-acid sequences were aligned to five public protein databases, KOG, SwissProt42, Kyoto Encyclopedia of Genes and Genomes (KEGG)43, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (TrEMBL) databases. At last, we search protein domains through InterProScan (v 5.30)44 program. The Gene Ontology (GO) terms for each gene were extracted with InterProScan. Gene annotation identified 32,855 protein-coding genes. Of which, 32,746 genes (99.67%) were annotated at least one functional database (Table 7).

Table 6.

Statistics of RNAseq de novo assembly of CHBZ.

Item length(bp) number
N10 7,004 2,101
N20 5,073 5,421
N30 4,018 9,760
N40 3,301 15,110
N50 2,702 21,631
N60 2,195 29,626
N70 1,716 39,627
N80 1,187 53,115
N90 626 75,492
Total length 194,566,978
number >=100 bp 130,301
number >=2000bp 33,368
GC rate 47.10%

Table 7.

Number of functional annotations for predicted genes in the CHBZ assembly.

Type Gene number Percentage
Total 32,855 100.00%
Nr 31,243 95.09%
Swissprot 23,054 70.17%
KEGG 22,578 68.72%
KOG 22,125 67.34%
Trembl 31,025 94.43%
Interpro All 32,123 97.77%
GO 18,871 57.44%
Annotated 32,746 99.67%
Unannotated 109 0.33%

Gene expression analysis

The raw RNA-seq reads were quality controlled by fastp (0.19.5)45. Subsequently, the clean reads were aligned to the CHBZ genome using Hisat2 (v2.1.0)46, with the following parameters: ‘–phred33 -p 5–sensitive–no-discordant–no-mixed -I 1 -X 1000’. An estimated mapped read count matrix was generated using htseq-count (v0.12.4)47. The gene expression level was calculated using the fragments per kilobase of exon per million mapped reads (FPKM) method48.

Comparative genomic analysis

To identify syntenic relationships between the CHBZ and BTx623 (RefSeq assembly accession: GCF_000003195.3, publicly released by the Sorghum Consortium in 2017) genomes, we extracted the longest coding sequences (CDSs) of genes. Subsequently, these CDSs were input into JCVI (v1.1.18)49 with a minimum requirement of 30 genes per block and the parameters “–cscore = 0.99” was set. The analysis yielded 24,685 orthologous pairs: 24,639 (74.6%) in CHBZ and 24,637 (72.2%) in BTx623 (Fig. 2a).

Fig. 2.

Fig. 2

The syntenic relationships and PAVs between the CHBZ and BTx623. (a) JCVI was used to detect syntenic blocks between the CHBZ and BTx623 gene pairs. The x-axis is the CHBZ genome, and the y-axis is the BTx623 genome. (b) The heatmap of 129 CHBZ-specific PAV genes. Rows represent 129 CHBZ-specific PAV genes, and columns represent 15 RNAseq samples. Blue and red boxes represent genes showing lower and higher expression levels, respectively.

The presence/absence variants (PAVs) between CHBZ and BTx623 genome were performed using the similar method described in Li, Xu et al.50. At first, a sliding window (window size, 500 bp; step, 100 bp) was used to divide the CHBZ genome. All divided sequences were then aligned against the BTx623 genome using BWA (v 0.7.17-r1188)51 with MEM algorithm (-w 500 -M -t 16). If a sequence failed to be aligned to the BTx623 genome, or coverage with <25%, it was defined as a CHBZ-specific sequence. For the identification of CHBZ PAV genes, the longest CDS per gene was extracted, and genes with >75% of their CDS covered by specific sequences were defined as putative PAV genes. To exclude potential false positives, the CHBZ longest CDS sequences were mapped to the BTx623 genome using minimap2 (v2.24-r1122)52 with parameters of ‘-x splice -t 10 -k 12 -a -p 0.4 -N 20’. If a gene with mapping quality >10, and coverage >25%, or identity >90%, it was defined as false-positive PAV gene. In summary, our analysis yielded 161 CHBZ-specific PAV genes and 178 BTx623-specific PAV genes. Notably, 129 of the 161 CHBZ-specific PAV genes were found to be expressed in at least one RNAseq sample (Fig. 2b). Among the 178 BTx623-specific PAV genes, 163 genes were found to overlap with those identified in the pan-genomic study5.

Identification of centromeres and telomeres

Using a method similar to that described in the wild blueberry T2T assembly53, the centromeres and telomeres were identified using QuarTeT (v 1.1.1)54 method with the “-c plant” option. QuarTeT provides a comprehensive suite of tools for automating T2T genome assembly and analysis, including the TeloExplorer module for telomere identification and CentroMiner for predicting centromere candidates. Briefly, CentroMiner identifies tandem repeat monomers, selects potential centromeric repeats based on period and copy number, clusters them to minimize redundancy, and aligns representative monomers to corresponding chromosomes. Our findings revealed that the length of centromere region range from 4.31 Mb to 13.00 Mb. Notably, 98.58% of the centromere regions consist of repetitive sequences, with LTR-Gypsy and tandem repeat sequences being the predominant categories (Table 8). We observed large blank regions in the Hi-C interaction heatmap of the centromere region, a phenomenon also identified in the Scutellaria baicalensis gap-free genome55 (Fig. 3).

Table 8.

The characteristic of predicted centromeres.

Chr Start position / bp End position / bp Length Repetitive sequence percentage
Tandem repeat Interspersed repeat
Total LTR/Gypsy LTR/Copia LINE SINE DNA Unclassified
Chr01 34,928,863 40,321,337 5,392,475 50.46 98.35 62.88 8.90 0.06 0.00 1.13 35.85
Chr02 25,957,327 38,958,868 13,001,542 19.52 95.07 71.69 12.30 0.57 0.00 3.82 11.93
Chr03 33,253,963 43,005,316 9,751,354 52.58 98.54 54.85 11.27 0.09 0.00 4.09 36.86
Chr04 28,116,367 35,694,187 7,577,821 51.22 97.75 58.00 7.64 0.44 0.00 1.42 38.89
Chr05 31,489,790 40,675,439 9,185,650 45.85 98.08 64.60 8.48 0.07 0.08 1.13 30.64
Chr06 22,061,995 28,582,296 6,520,302 67.15 99.13 44.03 10.54 0.01 0.00 0.12 56.41
Chr07 33,906,883 38,220,429 4,313,547 68.07 97.94 43.62 5.32 0.01 0.00 0.25 53.55
Chr08 25,494,256 32,886,453 7,392,198 57.78 98.97 55.28 5.96 0.00 0.00 0.44 46.12
Chr09 28,090,934 35,539,312 7,448,379 67.32 98.36 46.95 6.39 0.03 0.00 0.50 51.23
Chr10 25,626,568 34,864,952 9,238,385 42.31 97.82 62.67 6.51 0.17 0.00 2.18 31.44

Note: Some repetitive elements may partly include another element domain.

Unclassified: the predicted repeats that cannot be classified by RepeatMasker;

LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Fig. 3.

Fig. 3

Chromatin interactions at 100 kb resolution reveals the characteristics of the centromere region in the CHBZ genome.

Data Records

The sequencing data has been deposited at the Sequence Read Archive with the accession number SRP47291256, under the project identifier PRJNA1037263. Additionally, the assembled genome sequence has been made available in GenBank with the Whole Genome Shotgun project accession GCA_040267525.157. Furthermore, files pertaining to the genome assembly, gene structure annotation, and repeat predictions annotation have been archived in the Figshare repository58.

Technical Validation

Evaluation of the genome assembly

Multiple approaches were employed to validate the accuracy and completeness of the CHBZ genome assembly. We predicted centromeric sequences from the CHBZ genome assembly and observed all 10 centromeres were captured (Fig. 4a). Additionally, we identified all 20 telomeres by using the seven-base telomere repeat sequence (‘AAACCCT’) as a query (Fig. 4a). The number of telomere repeats in the CHBZ genome assembly was significantly higher than that of the BTx623 genome (Fig. 4b). The Hi-C heatmap displayed a high level of consistency across all chromosomes, providing evidence for the accurate sequencing, ordering, and orientation of contigs in the CHBZ genome assembly (Fig. 4c). The completeness test of LTR showed that the LTR assembly index (LAI) value of the assembly was 23.63 (Table 3), which was higher than the LAI values of BTx623 (LAI: 19.52) and even the latest T2T assembly of wild blueberry (LAI: 20.22)53. The completeness was also supported by the high mapping rates of two type sequences on the CHBZ genome assembly, with 100% of ONT reads and 99.9% of HiFi reads aligning to the CHBZ assembly. In addition, the assembly achieved a quality value score of 61.60 (Table 3). Finally, the Benchmarking Universal Single-Copy Orthologs (BUSCO)59 test revealed that the CHBZ assembly successfully identified 99.01% of 1,614 embryophyta gene set (Fig. 4d). Overall, these results presented the high quality and reliability of the CHBZ genome assembly.

Fig. 4.

Fig. 4

The high quality of the CHBZ genome. (a) Telomere detection map. Triangles and circles represent telomeres and centromere within the CHBZ assembled chromosomes; red color indicates high gene density; blue color indicates low gene density. (b) The statistic of telomere repeat number in CHBZ and BTx623 genome. (c)Heatmap displaying Hi-C interactions of CHBZ pseudomolecules. (d) BUSCO assessments in the CHBZ genome.

Evaluation of the gene annotation

Firstly, the congruity of exon length and intron length frequency with those of three related species substantiates the reliability of our annotation findings (Fig. 5a,b). Secondly, a total of 32,284 (98.26%) genes received evidence-based support (Table 9). Furthermore, 32,746 (99.67%) protein-coding genes were successfully annotated in various databases and 19,252 (58.60%) were supported by all six databases (Table 7, Fig. 5c). Notably, 25,873 (78.75%) exhibited detectable transcriptional activity (FPKM ≥ 1) across the 15 RNA-seq datasets (Fig. 5d). Moreover, the predicted proteins achieved a complete BUSCO score of approximately 99.38%, indicating high-quality gene annotation (Table 3). To summarize, the gene annotation exhibits a remarkable degree of accuracy and integrity.

Fig. 5.

Fig. 5

Quality assessment of the protein-coding genes in the CHBZ assembly. (a) Comparison of exon length among four related plant gene sets. Window refers to the length of every point. (b) Comparison of intron length among four related plant gene sets. No obvious unexpected differences exist among these three organisms, indicating the high quality of gene structure annotation. (c) Petal diagram using six public databases. (d) Proportions of genes that could be transcriptionally detected in CHBZ.

Table 9.

Summary of evidence for the CHBZ gene models.

>=30% overlap >=50% overlap >=80% overlap
Number Rate (%) Number Rate (%) Number Rate (%)
C 1,104 3.36 1,152 3.51 1,453 4.42
H (single) 573 1.74 656 2.00 917 2.79
H (more) 1,073 3.27 1,241 3.78 2,011 6.12
P 2,492 7.58 2,639 8.03 3,013 9.17
HC 2,078 6.32 2,435 7.41 4,715 14.35
PC 614 1.87 567 1.73 452 1.38
PH 5,413 16.48 5,507 16.76 4,914 14.96
PHC 18,937 57.64 18,034 54.89 14,559 44.31
Total 32,284 98.26 32,231 98.10 32,034 97.50

P: ab initio prediction; H: homology-based prediction; C: cDNA or transcriptome-based prediction; single: with one gene source; more: with two or more gene sources.

Acknowledgements

This work was funded by Project of Conservation and Utilization of Agricultural Germplasm Resources in Shanxi Province (sxzyk202201) and Basic Research Program of Shanxi Province (20210302124238).

Author contributions

P. G., M.X.L. and Z.X.M. conceived the study. M.L., S.H., H.G.W. and H.B.Q. collected and prepared the samples. C.H.C., X.K.Y. and J.B.J. performed bioinformatics analysis. All authors read and approved the final manuscript.

Code availability

No specific code was developed for this study. The data analyses were conducted following the manuals and protocols provided by the developers of the relevant bioinformatics tools, which are described in the Methods section along with the versions used.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Meng Li, Chunhai Chen, Haigang Wang, Huibin Qin, Sen Hou, Xukui Yang.

Contributor Information

Meng Li, Email: nkypzslm@163.com.

Peng Gao, Email: gaopeng@genomics.cn.

Minxuan Liu, Email: liuminxuan@caas.cn.

Zhixin Mu, Email: muzx2008@sina.com.

References

  • 1.Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017). [DOI] [PubMed]
  • 2.Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature457, 551–556 (2009). 10.1038/nature07723 [DOI] [PubMed] [Google Scholar]
  • 3.Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications9 (2018). [DOI] [PMC free article] [PubMed]
  • 4.Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics20 (2019). [DOI] [PMC free article] [PubMed]
  • 5.Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants7, 766–773 (2021). 10.1038/s41477-021-00925-x [DOI] [PubMed] [Google Scholar]
  • 6.Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology23 (2023). [DOI] [PMC free article] [PubMed]
  • 7.Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2021). 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022). [DOI] [PMC free article] [PubMed]
  • 9.Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
  • 10.Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
  • 11.Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal20, 1373–1386 (2021). 10.1111/pbi.13816 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant16, 1232–1236 (2023). 10.1016/j.molp.2023.08.003 [DOI] [PubMed] [Google Scholar]
  • 13.Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics55, 1221–1231 (2023). 10.1038/s41588-023-01419-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods58, 268–276 (2012). 10.1016/j.ymeth.2012.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods10, 563–569 (2013). 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
  • 16.Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).
  • 17.Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience1 (2018). [DOI] [PMC free article] [PubMed]
  • 18.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods18, 1–6 (2021). 10.1038/s41592-020-01056-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25, 1754–1760 (2009). 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology16 (2015). [DOI] [PMC free article] [PubMed]
  • 21.Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems3, 95–98 (2016). 10.1016/j.cels.2016.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, 6333 (2017). 10.1126/science.aal3327 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience1 (2019). [DOI] [PMC free article] [PubMed]
  • 24.Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications12 (2021). [DOI] [PMC free article] [PubMed]
  • 25.Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods19, 687–695 (2022). 10.1038/s41592-022-01440-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods19, 705–710 (2022). 10.1038/s41592-022-01457-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research27, 737–746 (2017). 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics4 (2004). [DOI] [PubMed]
  • 29.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research35, W265–W268 (2007). 10.1093/nar/gkm286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics21(Suppl 1), i351–i358 (2005). 10.1093/bioinformatics/bti1018 [DOI] [PubMed] [Google Scholar]
  • 31.Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA6 (2015). [DOI] [PMC free article] [PubMed]
  • 32.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research27, 573–580 (1999). 10.1093/nar/27.2.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology29, 644–652 (2011). 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31, 5654–5666 (2003). 10.1093/nar/gkg770 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods12, 357–360 (2015). 10.1038/nmeth.3317 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology20, 278 (2019). 10.1186/s13059-019-1910-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant15, 1247–1250 (2022). 10.1016/j.molp.2022.05.014 [DOI] [PubMed] [Google Scholar]
  • 38.Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019). [DOI] [PubMed]
  • 39.Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research33, W465–W457 (2005). 10.1093/nar/gki458 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology9, R7 (2008). 10.1186/gb-2008-9-1-r7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods12, 59–60 (2015). 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]
  • 42.Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research27, 49–54 (1999). 10.1093/nar/27.1.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research28, 27–30 (2000). 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics30, 1236–1240 (2014). 10.1093/bioinformatics/btu031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890 (2018). 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology37, 907–915 (2019). 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics31, 166–169 (2015). 10.1093/bioinformatics/btu638 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology12, R22 (2011). 10.1186/gb-2011-12-3-r22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science320, 486–488 (2008). 10.1126/science.1153917 [DOI] [PubMed] [Google Scholar]
  • 50.Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).
  • 51.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).
  • 52.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2017). 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]
  • 54.Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]
  • 55.Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
  • 56.NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
  • 57.NCBI GenBank.https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
  • 58.Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare.10.6084/m9.figshare.24532924.v1 (2024). 10.6084/m9.figshare.24532924.v1 [DOI]
  • 59.Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution35, 543–548 (2018). 10.1093/molbev/msx319 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
  2. NCBI GenBank.https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
  3. Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare.10.6084/m9.figshare.24532924.v1 (2024). 10.6084/m9.figshare.24532924.v1 [DOI]

Data Availability Statement

No specific code was developed for this study. The data analyses were conducted following the manuals and protocols provided by the developers of the relevant bioinformatics tools, which are described in the Methods section along with the versions used.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES