Telomere-to-telomere genome assembly of sorghum

Meng Li; Chunhai Chen; Haigang Wang; Huibin Qin; Sen Hou; Xukui Yang; Jianbo Jian; Peng Gao; Minxuan Liu; Zhixin Mu

doi:10.1038/s41597-024-03664-8

. 2024 Aug 2;11:835. doi: 10.1038/s41597-024-03664-8

Telomere-to-telomere genome assembly of sorghum

Meng Li ^1,^✉,^#, Chunhai Chen ^2,^#, Haigang Wang ^1,^#, Huibin Qin ^1,^#, Sen Hou ^1,^#, Xukui Yang ^2,^#, Jianbo Jian ², Peng Gao ^3,^✉, Minxuan Liu ^4,^✉, Zhixin Mu ^1,^✉

PMCID: PMC11297213 PMID: 39095379

Abstract

“Cuohu Bazi” (CHBZ) is an ancient sorghum variety collected from the fields of China, known for its agronomic traits like dwarf stature, early maturation. In this study, we present the first telomere-to-telomere (T2T) and gap-free genome assembly of CHBZ using PacBio HiFi reads, Oxford Nanopore Technologies, and Hi-C data. The assembled genome comprises 724.85 Mb, effectively resolving all 3,913 gaps that were present in the previous sorghum BTx623 reference genome. Notably, the T2T assembly captures 10 centromeres and all 20 telomeres, providing strong support for their integrity. This assembly is of high quality in terms of contiguity (contig N50: 71.1 Mb), completeness (BUSCO score: 99.01%, k-mer completeness: 98.88%), and correctness (QV: 61.60). Repetitive sequences accounted for 70.41% of the genome and a total of 32,855 protein-coding genes have been annotated. Furthermore, 161 CHBZ-specific presence/absence variants genes have been identified when comparing to BTx623 genome. This study provides valuable insights for future research on sorghum genetics, genomics, and evolutionary history.

Subject terms: Plant molecular biology, Agricultural genetics

Background & Summary

Sorghum is a widely cultivated cereal crop, particularly in Africa, where it ranks 5th in global cereal production¹. It exhibits remarkable adaptability and possesses strong stress resistance characteristics, including drought tolerance, waterlogging tolerance, salt-alkali tolerance, barrenness tolerance, and high temperature tolerance. It serves as a staple food for approximately 500 million people in Africa and Asia, and provides a source of energy, forage, and industrial raw material for the brewing industry. The genome of sorghum has been extensively studied, with the first reference genome of sorghum (BTx623) was published in 2009², followed by the decoding of sorghum inbred Tx430 transformation line genome in 2018³ and the release of the sweet sorghum genome in 2019⁴. In 2021, the first pan-genome of sorghum was completed⁵. Furthermore, in 2023, seven high-quality sorghum organelle genomes were published⁶. These milestones mark significant advancements in molecular research on sorghum. However, the current widely used sorghum reference genome (BTx623, RefSeq assembly accession: GCF_000003195.3)² still has relatively low continuity and quality, with a contig N50 of 1.3 Mb and 3,913 total gaps. There has long been a need for high-quality sorghum genomes.

Recent advances in genome sequencing and assembly methodology have made telomere-to-telomere (T2T) gap-free assembly of chromosome sequences possible. The assembly of T2T genome enables the exploration of unknown fields such as telomeres and centromeres, which also provides a more in-depth research direction for animal and plant research^7–9. T2T genome assemblies have been reported in several important crops, including banana¹⁰, barley¹¹, rice¹², and maize¹³. “Cuohu Bazi” (CHBZ) is an ancient local sorghum landrace collected in China between 1982 and 1986 (Fig. 1a). Through field identification, it has been discovered that CHBZ possesses excellent agronomic traits, such as dwarf stature and early maturation. The plant height is approximately 1.30 meters, and the growth period is 100 days. The T2T genome assembly of CHBZ will undoubtedly provide valuable guidance for sorghum breeding efforts.

Fig. 1 — Overview of the CZBZ and genome. **(a)** Photographs of CHBZ. **(b)** Circos plot illustrating the genome of CHBZ. The plot includes the following components, arranged from inside to outside: (I) Collinear regions within the CHBZ assembly; (II) Gene density in 1-Mb sliding windows; (III) GC content in non-overlapping 1 Mb windows; (IV) Percentage of interspersed repeats in 1-Mb sliding windows; (V) Percentage of tandem repeats in 1-Mb sliding windows; (VI) Length of pseudo-chromosome in megabases (Mb).

In this study, we aim to generate the first T2T gap-free genome for CHBZ using a combination of the latest sequencing technologies, including PacBio high-fidelity (HiFi) sequencing, ultra-long Oxford Nanopore Technology (ONT), and High-through chromosome conformation capture (Hi-C) sequencing. The genomic resources and gene structures produced by this study will lay the groundwork for future research endeavors in CHBZ genetic breeding.

Methods

Sample collection and sequencing

CHBZ, which was grown in the germplasm resource nursery at the Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Taiyuan, 030031, China, was chosen for DNA and RNA sequencing. Fresh, healthy, young seedling were harvested and immediately placed in a freezing chamber with liquid nitrogen, then stored at −80 °C. Genomics DNA was extracted with the cetyltrimethylammonium bromide method, and evaluated using NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). The PacBio HiFi libraries, the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol from PacBio was followed, with an insert size of approximately 20 kb (Pacific Biosciences, USA). Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms in circular consensus sequencing (CCS) mode. For the ONT ultra-long sequencing, the library was prepared using the Oxford Nanopore SQK LSK109 kit, and then sequenced on a PromethION flow cell (Oxford Nanopore Technologies Oxford, UK). For Hi-C sequencing, Hi-C libraries based on DpnII restriction enzymes were generated as previously described¹⁴, and sequenced on the MGISEQ-2000 platform. Total RNA from roots, stems, leaves and spikes, was isolated using the NEB Next Poly (A) mRNA Magnetic Isolation Module. DNase I (Thermo Fisher Scientific, Wilmington, DE, USA) was used to remove genomic DNA. The RNA integrity was checked using a BioAnalyzer 2100 (Agilent Technologies, Santa Clara, USA). RNA libraries were prepared using the NEB Next Ultra RNA Library Prep Kit for Illumina with an insert size of 300 bp. The RNA libraries were sequenced on a MGISEQ-2000 instrument and generated 150 bp paired-end reads.

In total, we generated 304.06 Gb (~419X coverage) ONT reads with a N50 of 52.44 kb, 28.65 Gb PacBio HiFi CCS reads with a N50 of 16.64 kb (~40X coverage), and 304.93 Gb Hi-C data (Illumina paired-end reads, ~421X coverage), 123.30 Gb RNA data (Tables 1 and 2).

Table 1.

Summary of DNA sequencing data of the CHBZ genome.

	Total length (GB)	Genome depth*	Min length (bp)	Max length (bp)	N50 length of reads (bp)
Raw ultra-long ONT data	304.06	419	5,000	825,384	52,442
Error corrected ONT data	36.03	50	1,025	278,940	80,402
PacBio subreads data	477.89	659	100	529,636	16,559
PacBio CCS data	28.65	40	517	50,265	16,638
Raw Hi-C data	304.93	421	150	150	150
Clean Hi-C data	303.56	419	150	150	150

Open in a new tab

*Estimated based on the size of 724.95 Mb.

Table 2.

Summary of RNAseq sequencing data of the CHBZ genome.

Sample	Organization_Stage	Total raw reads (Mb)	Total clean reads (Mb)	Total clean bases (Gb)	Clean reads Q20%	Clean reads GC%
WHPLAolfhRAAARAA-9313	Root_seedling	62.03	59.67	8.95	98.41	53.46
WHPLAolfhRAABRAA-9314	Stem_seedling	59.49	57.41	8.61	98.33	55.11
WHPLAolfhRAACRAA-9315	Leaf_seedling	51.11	49.75	7.46	98.35	55.25
WHPLAolfhRAADRAA-9316	Root_jointing	53.21	52.21	7.83	98.33	53.2
WHPLAolfhRAAERAA-9317	Stem_jointing	62.67	60.96	9.14	98.43	52.11
WHPLAolfhRAAFRAA-9318	Leaf_jointing	59.22	56.9	8.53	98.41	53.71
WHPLAolfhRAAGRAA-9319	Ear_jointing	54.05	52.49	7.87	98.31	52.54
WHPLAolfhRAAHRAA-9320	Root_flowering	46.69	45.54	6.83	98.32	52.57
WHPLAolfhRAAIRAA-9321	Stem_ flowering	62.12	60.21	9.03	98.47	53.83
WHPLAolfhRAAJRAA-9322	Leaf_flowering	57.42	56.31	8.45	98.23	53.45
WHPLAolfhRAAKRAA-9323	Ear_flowering	56.75	55.33	8.3	98.31	54.28
WHPLAolfhRAALRAA-9324	Root_filling	57.48	56.29	8.44	98.37	52.44
WHPLAolfhRAAMRAA-9325	Stem_filling	53.06	52.06	7.81	98.35	51.51
WHPLAolfhRAANRAA-9326	Leaf_filling	56.98	55.71	8.36	98.36	53.47
WHPLAolfhRAAORAA-9327	Ear_filling	52.32	51.25	7.69	98.4	52.32
Total	NA	844.60	822.09	123.30	98.36	53.28

Open in a new tab

Genome assembly

The subreads generated by PacBio Sequel II platforms were processed using the CCS algorithm of SMRTLink (v11.1.0)¹⁵ with the following parameters: “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”. Subsequently, we performed a genome survey utilizing GCE (Genomic Charactor Estimator) (v1.0.2)¹⁶ with the parameter ‘-k 17’ using HiFi reads to assess the genome size and heterozygosity of CHBZ genome. The estimated size of the CHBZ genome was approximately 724.95 Mb, with a heterozygosity rate of 0.10% and a repeat content of 67.74% (Table 3). After quality control using software SOAPnuke (v2.0)¹⁷ with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”, a total of 303.56 Gb Hi-C clean data was obtained. Using PacBio HiFi reads, ONT reads, and Hi-C clean data, the primary contigs were generated by Hifiasm (v 0.19.5)¹⁸ with default parameters. To anchor contigs onto chromosomes, we employed BWA (v 0.7.12)¹⁹ to align the Hi-C clean data to the assembled contigs, and then we filtered low quality reads using a HiC-Pro pipeline²⁰ with the default parameters. The valid reads were used to anchor chromosomes with Juicer²¹ and 3d-dna pipeline²². LR_Gapcloser²³ was utilized to close gaps in the assembled genome, using error-corrected ONT long reads generated by NECAT²⁴. To further refinement of the genome, the T2T assembly was polished using the similar method described in Mc Cartney, Shafin et al.²⁵. Briefly, the HiFi reads were aligned to the T2T assembly using Winnowmap2 (v 2.03)²⁶. The output alignments were filtered to remove all secondary alignments and alignments with excessive clipping by using ‘falconc bam-filter-clipped’ tool. Finally, racon (v 1.5.0)²⁷ was conducted with the filtered alignments.

Table 3.

Summary statistics of the CHBZ genome assembly.

Genomics feature	Value
Estimated genome size (Mb)	724.95
Estimated heterozygosity rate (%)	0.10
Estimated repeat content (%)	67.74
Assembled genome size (Mb)	724.85
N50 value (Mb)	71.06
Number of base chromosomes	10
Number of gap-free chromosomes	10
Number of candidate telomeres	20
Number of candidate centromeres	10
GC content (%)	43.90
Quality value (QV)	61.60
LTR assembly index (LAI)	23.63
Repetitive sequences (Mb)	510.36 (70.41%)
Genome BUSCOs (%)	99.01
Number of genes	32,855
Gene BUSCOs (%)	99.38

Open in a new tab

Overall, the final genome assembly of CHBZ is about 724.85 Mb with a N50 length of 71.06 Mb (Table 3). Genome sequences were clustered and oriented as 10 pseudochromosomes (Fig. 1b, Table 4).

Table 4.

The quality of each chromosome in the CHBZ genome.

ID	Length (bp)	GC%	Contig number	Gene number
Chr01	86,020,532	44.23	1	5,427
Chr02	78,662,643	43.76	1	4,296
Chr03	81,090,025	43.90	1	4,447
Chr04	71,062,133	43.88	1	3,631
Chr05	78,004,805	43.80	1	2,418
Chr06	66,673,040	43.77	1	2,847
Chr07	69,900,438	43.83	1	2,333
Chr08	66,635,546	43.75	1	2,029
Chr09	63,867,428	44.02	1	2,564
Chr10	62,932,570	43.70	1	2,863

Open in a new tab

Repeat annotation

Two strategies including de novo and homolog methods were used to predict repeat elements. De novo repeats were identified by RepeatModeler (v1.0.4)²⁸ and long terminal repeats were annotated by LTR-FINDER (v1.0.7)²⁹. DNA and protein transposable elements were detected by RepeatMasker (v4.0.7)³⁰ and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database³¹. At last, tandem repeats were performed by Tandem Repeat Finder (v4.10.0)³².

In the CHBZ genome, repeat sequences accounted for 510.36 Mb, representing 70.41% of the assembly (Table 3). Long terminal repeat (LTR) retrotransposons (55.75%) were the most abundant component among repetitive elements, which was consistent with the previous study¹ (Table 5).

Table 5.

Transposable element (TE, interspersed repeat) contents in the CHBZ assembly.

Type	Repbase TEs		TE protiens		De novo		*Combined TEs
Type	Length (bp)	% of genome	Length (bp)	% of genome	Length (bp)	% of genome	Length (bp)	% of genome
Class I	371,524,099	51.26	115,028,697	15.87	364,008,956	50.22	417,863,762	57.65
LTR	361,035,622	49.81	108,970,406	15.03	359,774,055	49.63	404,105,204	55.75
LINE	10,408,720	1.44	6,058,291	0.84	4,234,901	0.58	13,678,801	1.89
SINE	79,757	0.01	0	0.00	0	0.00	79,757	0.01
Class II: DNA	71,208,394	9.82	10,201,586	1.41	28,984,885	4.00	79,678,544	10.99
^†Unclassified	15,457	0.00	0	0.00	425,680	0.06	441,137	0.06
Total	442,066,952	60.99	125,226,177	17.28	385,130,064	53.13	475,407,461	65.59

Open in a new tab

Note: This statistical table does not contain Tandem Repeats, some elements may partly include another element domain.

*Combined: the non-redundant consensus of all repeat prediction/classification methods employed.

^†Unclassified: the predicted repeats that cannot be classified by RepeatMasker;

LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Protein-coding genes prediction and functional annotation

Gene prediction was conducted through a combination of transcriptome-based prediction, homology-based prediction, and ab initio prediction methods. For transcriptome-based prediction, 123.3 Gb clean reads sequenced by DNBSEQ-2000 from root, stem, leaf and ear tissues from four stages (seedling, jointing, heading and flowering, filling and maturity stage) were assembled by Trinity (v 2.15.1)³³ with parameters of ‘–max_memory 200 G–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 3–min_glue 3–bfly_opts ‘-V 5–edge-thr = 0.1–stderr’–genome_guided_max_intron 10000’, which generated 130,301 transcripts with a N50 of 2,702 (Table 6). These assembled transcripts were aligned against the T2T assembly by Program to Assemble Spliced Alignment (PASA) (v 2.4.1)³⁴. Valid transcript alignments were clustered based on genome mapping location and assembled into gene structures. The coding regions were obtained by employing TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) (PASA-set). In addition, the RNA-seq clean reads were also mapped to the T2T assembly using Hisat2 (v 2.0.1)³⁵. Stringtie (v 1.2.2)³⁶ and TransDecoder (v 5.7.1) were employed to assemble the transcripts and identify candidate coding regions into gene models (Stringtie-set). Homologous genomes from five plants, including rice (T2T-NIP)¹², foxtail millet (RefSeq assembly accession: GCF_000263155.2), maize (T2T Mo17)¹³, A. thaliana (Col-PEK)³⁷, and BTx623 sorghum (RefSeq assembly accession: GCF_000003195.3) were downloaded. Subsequently, these sequences were used as queries to search against the T2T assembly using GeMoMa (v 1.9)³⁸ with bam files from RNA-seq data. Homology predictions were denoted as “Homology-set”. For ab initio prediction methods, AUGUSTUS (v 3.2.3)³⁹ was used to predict coding regions in the repeat-masked genome. All gene models predicted were combined by EvidenceModeler (v 2.1.0)⁴⁰ into a non-redundant set of gene structures. Evidence from different sources was assigned to different weight parameters: 10 for PASA-set, 5 for Stringtie-set, 5 for Homology-set, and 1 for AUGUSTUS gene prediction. Finally, the produced gene models were further refined with the PASA (v 2.4.1)³⁴ to generate untranslated regions and alternative splicing variation information. The integrated gene set was translated into amino-acid sequences. By using Diamond program (v 0.9.30.131)⁴¹ with an E-value cutoff of 1E-05, the amino-acid sequences were aligned to five public protein databases, KOG, SwissProt⁴², Kyoto Encyclopedia of Genes and Genomes (KEGG)⁴³, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (TrEMBL) databases. At last, we search protein domains through InterProScan (v 5.30)⁴⁴ program. The Gene Ontology (GO) terms for each gene were extracted with InterProScan. Gene annotation identified 32,855 protein-coding genes. Of which, 32,746 genes (99.67%) were annotated at least one functional database (Table 7).

Table 6.

Statistics of RNAseq de novo assembly of CHBZ.

Item	length(bp)	number
N10	7,004	2,101
N20	5,073	5,421
N30	4,018	9,760
N40	3,301	15,110
N50	2,702	21,631
N60	2,195	29,626
N70	1,716	39,627
N80	1,187	53,115
N90	626	75,492
Total length	194,566,978
number >=100 bp	130,301
number >=2000bp	33,368
GC rate	47.10%

Open in a new tab

Table 7.

Number of functional annotations for predicted genes in the CHBZ assembly.

Type		Gene number	Percentage
Total		32,855	100.00%
Nr		31,243	95.09%
Swissprot		23,054	70.17%
KEGG		22,578	68.72%
KOG		22,125	67.34%
Trembl		31,025	94.43%
Interpro	All	32,123	97.77%
Interpro	GO	18,871	57.44%
Annotated		32,746	99.67%
Unannotated		109	0.33%

Open in a new tab

Gene expression analysis

The raw RNA-seq reads were quality controlled by fastp (0.19.5)⁴⁵. Subsequently, the clean reads were aligned to the CHBZ genome using Hisat2 (v2.1.0)⁴⁶, with the following parameters: ‘–phred33 -p 5–sensitive–no-discordant–no-mixed -I 1 -X 1000’. An estimated mapped read count matrix was generated using htseq-count (v0.12.4)⁴⁷. The gene expression level was calculated using the fragments per kilobase of exon per million mapped reads (FPKM) method⁴⁸.

Comparative genomic analysis

To identify syntenic relationships between the CHBZ and BTx623 (RefSeq assembly accession: GCF_000003195.3, publicly released by the Sorghum Consortium in 2017) genomes, we extracted the longest coding sequences (CDSs) of genes. Subsequently, these CDSs were input into JCVI (v1.1.18)⁴⁹ with a minimum requirement of 30 genes per block and the parameters “–cscore = 0.99” was set. The analysis yielded 24,685 orthologous pairs: 24,639 (74.6%) in CHBZ and 24,637 (72.2%) in BTx623 (Fig. 2a).

Fig. 2 — The syntenic relationships and PAVs between the CHBZ and BTx623. (a) JCVI was used to detect syntenic blocks between the CHBZ and BTx623 gene pairs. The x-axis is the CHBZ genome, and the y-axis is the BTx623 genome. (b) The heatmap of 129 CHBZ-specific PAV genes. Rows represent 129 CHBZ-specific PAV genes, and columns represent 15 RNAseq samples. Blue and red boxes represent genes showing lower and higher expression levels, respectively.

The presence/absence variants (PAVs) between CHBZ and BTx623 genome were performed using the similar method described in Li, Xu et al.⁵⁰. At first, a sliding window (window size, 500 bp; step, 100 bp) was used to divide the CHBZ genome. All divided sequences were then aligned against the BTx623 genome using BWA (v 0.7.17-r1188)⁵¹ with MEM algorithm (-w 500 -M -t 16). If a sequence failed to be aligned to the BTx623 genome, or coverage with <25%, it was defined as a CHBZ-specific sequence. For the identification of CHBZ PAV genes, the longest CDS per gene was extracted, and genes with >75% of their CDS covered by specific sequences were defined as putative PAV genes. To exclude potential false positives, the CHBZ longest CDS sequences were mapped to the BTx623 genome using minimap2 (v2.24-r1122)⁵² with parameters of ‘-x splice -t 10 -k 12 -a -p 0.4 -N 20’. If a gene with mapping quality >10, and coverage >25%, or identity >90%, it was defined as false-positive PAV gene. In summary, our analysis yielded 161 CHBZ-specific PAV genes and 178 BTx623-specific PAV genes. Notably, 129 of the 161 CHBZ-specific PAV genes were found to be expressed in at least one RNAseq sample (Fig. 2b). Among the 178 BTx623-specific PAV genes, 163 genes were found to overlap with those identified in the pan-genomic study⁵.

Identification of centromeres and telomeres

Using a method similar to that described in the wild blueberry T2T assembly⁵³, the centromeres and telomeres were identified using QuarTeT (v 1.1.1)⁵⁴ method with the “-c plant” option. QuarTeT provides a comprehensive suite of tools for automating T2T genome assembly and analysis, including the TeloExplorer module for telomere identification and CentroMiner for predicting centromere candidates. Briefly, CentroMiner identifies tandem repeat monomers, selects potential centromeric repeats based on period and copy number, clusters them to minimize redundancy, and aligns representative monomers to corresponding chromosomes. Our findings revealed that the length of centromere region range from 4.31 Mb to 13.00 Mb. Notably, 98.58% of the centromere regions consist of repetitive sequences, with LTR-Gypsy and tandem repeat sequences being the predominant categories (Table 8). We observed large blank regions in the Hi-C interaction heatmap of the centromere region, a phenomenon also identified in the Scutellaria baicalensis gap-free genome⁵⁵ (Fig. 3).

Table 8.

The characteristic of predicted centromeres.

Chr	Start position / bp	End position / bp	Length	Repetitive sequence percentage
				Tandem repeat	Interspersed repeat
				Tandem repeat	Total	LTR/Gypsy	LTR/Copia	LINE	SINE	DNA	^†Unclassified
Chr01	34,928,863	40,321,337	5,392,475	50.46	98.35	62.88	8.90	0.06	0.00	1.13	35.85
Chr02	25,957,327	38,958,868	13,001,542	19.52	95.07	71.69	12.30	0.57	0.00	3.82	11.93
Chr03	33,253,963	43,005,316	9,751,354	52.58	98.54	54.85	11.27	0.09	0.00	4.09	36.86
Chr04	28,116,367	35,694,187	7,577,821	51.22	97.75	58.00	7.64	0.44	0.00	1.42	38.89
Chr05	31,489,790	40,675,439	9,185,650	45.85	98.08	64.60	8.48	0.07	0.08	1.13	30.64
Chr06	22,061,995	28,582,296	6,520,302	67.15	99.13	44.03	10.54	0.01	0.00	0.12	56.41
Chr07	33,906,883	38,220,429	4,313,547	68.07	97.94	43.62	5.32	0.01	0.00	0.25	53.55
Chr08	25,494,256	32,886,453	7,392,198	57.78	98.97	55.28	5.96	0.00	0.00	0.44	46.12
Chr09	28,090,934	35,539,312	7,448,379	67.32	98.36	46.95	6.39	0.03	0.00	0.50	51.23
Chr10	25,626,568	34,864,952	9,238,385	42.31	97.82	62.67	6.51	0.17	0.00	2.18	31.44

Open in a new tab

Note: Some repetitive elements may partly include another element domain.

^†Unclassified: the predicted repeats that cannot be classified by RepeatMasker;

LINE, long interspersed nuclear elements; SINE, short interspersed nuclear elements; LTR, long terminal repeat.

Fig. 3 — Chromatin interactions at 100 kb resolution reveals the characteristics of the centromere region in the CHBZ genome.

Data Records

The sequencing data has been deposited at the Sequence Read Archive with the accession number SRP472912⁵⁶, under the project identifier PRJNA1037263. Additionally, the assembled genome sequence has been made available in GenBank with the Whole Genome Shotgun project accession GCA_040267525.1⁵⁷. Furthermore, files pertaining to the genome assembly, gene structure annotation, and repeat predictions annotation have been archived in the Figshare repository⁵⁸.

Technical Validation

Evaluation of the genome assembly

Multiple approaches were employed to validate the accuracy and completeness of the CHBZ genome assembly. We predicted centromeric sequences from the CHBZ genome assembly and observed all 10 centromeres were captured (Fig. 4a). Additionally, we identified all 20 telomeres by using the seven-base telomere repeat sequence (‘AAACCCT’) as a query (Fig. 4a). The number of telomere repeats in the CHBZ genome assembly was significantly higher than that of the BTx623 genome (Fig. 4b). The Hi-C heatmap displayed a high level of consistency across all chromosomes, providing evidence for the accurate sequencing, ordering, and orientation of contigs in the CHBZ genome assembly (Fig. 4c). The completeness test of LTR showed that the LTR assembly index (LAI) value of the assembly was 23.63 (Table 3), which was higher than the LAI values of BTx623 (LAI: 19.52) and even the latest T2T assembly of wild blueberry (LAI: 20.22)⁵³. The completeness was also supported by the high mapping rates of two type sequences on the CHBZ genome assembly, with 100% of ONT reads and 99.9% of HiFi reads aligning to the CHBZ assembly. In addition, the assembly achieved a quality value score of 61.60 (Table 3). Finally, the Benchmarking Universal Single-Copy Orthologs (BUSCO)⁵⁹ test revealed that the CHBZ assembly successfully identified 99.01% of 1,614 embryophyta gene set (Fig. 4d). Overall, these results presented the high quality and reliability of the CHBZ genome assembly.

Fig. 4 — The high quality of the CHBZ genome. (a) Telomere detection map. Triangles and circles represent telomeres and centromere within the CHBZ assembled chromosomes; red color indicates high gene density; blue color indicates low gene density. (b) The statistic of telomere repeat number in CHBZ and BTx623 genome. (c)Heatmap displaying Hi-C interactions of CHBZ pseudomolecules. (d) BUSCO assessments in the CHBZ genome.

Evaluation of the gene annotation

Firstly, the congruity of exon length and intron length frequency with those of three related species substantiates the reliability of our annotation findings (Fig. 5a,b). Secondly, a total of 32,284 (98.26%) genes received evidence-based support (Table 9). Furthermore, 32,746 (99.67%) protein-coding genes were successfully annotated in various databases and 19,252 (58.60%) were supported by all six databases (Table 7, Fig. 5c). Notably, 25,873 (78.75%) exhibited detectable transcriptional activity (FPKM ≥ 1) across the 15 RNA-seq datasets (Fig. 5d). Moreover, the predicted proteins achieved a complete BUSCO score of approximately 99.38%, indicating high-quality gene annotation (Table 3). To summarize, the gene annotation exhibits a remarkable degree of accuracy and integrity.

Fig. 5 — Quality assessment of the protein-coding genes in the CHBZ assembly. (a) Comparison of exon length among four related plant gene sets. Window refers to the length of every point. (b) Comparison of intron length among four related plant gene sets. No obvious unexpected differences exist among these three organisms, indicating the high quality of gene structure annotation. (c) Petal diagram using six public databases. (d) Proportions of genes that could be transcriptionally detected in CHBZ.

Table 9.

Summary of evidence for the CHBZ gene models.

	>=30% overlap		>=50% overlap		>=80% overlap
	Number	Rate (%)	Number	Rate (%)	Number	Rate (%)
C	1,104	3.36	1,152	3.51	1,453	4.42
H (single)	573	1.74	656	2.00	917	2.79
H (more)	1,073	3.27	1,241	3.78	2,011	6.12
P	2,492	7.58	2,639	8.03	3,013	9.17
HC	2,078	6.32	2,435	7.41	4,715	14.35
PC	614	1.87	567	1.73	452	1.38
PH	5,413	16.48	5,507	16.76	4,914	14.96
PHC	18,937	57.64	18,034	54.89	14,559	44.31
Total	32,284	98.26	32,231	98.10	32,034	97.50

Open in a new tab

P: ab initio prediction; H: homology-based prediction; C: cDNA or transcriptome-based prediction; single: with one gene source; more: with two or more gene sources.

Acknowledgements

This work was funded by Project of Conservation and Utilization of Agricultural Germplasm Resources in Shanxi Province (sxzyk202201) and Basic Research Program of Shanxi Province (20210302124238).

Author contributions

P. G., M.X.L. and Z.X.M. conceived the study. M.L., S.H., H.G.W. and H.B.Q. collected and prepared the samples. C.H.C., X.K.Y. and J.B.J. performed bioinformatics analysis. All authors read and approved the final manuscript.

Code availability

No specific code was developed for this study. The data analyses were conducted following the manuals and protocols provided by the developers of the relevant bioinformatics tools, which are described in the Methods section along with the versions used.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Meng Li, Chunhai Chen, Haigang Wang, Huibin Qin, Sen Hou, Xukui Yang.

Contributor Information

Meng Li, Email: nkypzslm@163.com.

Peng Gao, Email: gaopeng@genomics.cn.

Minxuan Liu, Email: liuminxuan@caas.cn.

Zhixin Mu, Email: muzx2008@sina.com.

References

1.Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017). [DOI] [PubMed]
2.Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature457, 551–556 (2009). 10.1038/nature07723 [DOI] [PubMed] [Google Scholar]
3.Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications9 (2018). [DOI] [PMC free article] [PubMed]
4.Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics20 (2019). [DOI] [PMC free article] [PubMed]
5.Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants7, 766–773 (2021). 10.1038/s41477-021-00925-x [DOI] [PubMed] [Google Scholar]
6.Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology23 (2023). [DOI] [PMC free article] [PubMed]
7.Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2021). 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022). [DOI] [PMC free article] [PubMed]
9.Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
10.Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
11.Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal20, 1373–1386 (2021). 10.1111/pbi.13816 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant16, 1232–1236 (2023). 10.1016/j.molp.2023.08.003 [DOI] [PubMed] [Google Scholar]
13.Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics55, 1221–1231 (2023). 10.1038/s41588-023-01419-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods58, 268–276 (2012). 10.1016/j.ymeth.2012.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods10, 563–569 (2013). 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]
16.Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).
17.Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience1 (2018). [DOI] [PMC free article] [PubMed]
18.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods18, 1–6 (2021). 10.1038/s41592-020-01056-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25, 1754–1760 (2009). 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology16 (2015). [DOI] [PMC free article] [PubMed]
21.Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems3, 95–98 (2016). 10.1016/j.cels.2016.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, 6333 (2017). 10.1126/science.aal3327 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience1 (2019). [DOI] [PMC free article] [PubMed]
24.Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications12 (2021). [DOI] [PMC free article] [PubMed]
25.Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods19, 687–695 (2022). 10.1038/s41592-022-01440-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods19, 705–710 (2022). 10.1038/s41592-022-01457-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research27, 737–746 (2017). 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics4 (2004). [DOI] [PubMed]
29.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research35, W265–W268 (2007). 10.1093/nar/gkm286 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics21(Suppl 1), i351–i358 (2005). 10.1093/bioinformatics/bti1018 [DOI] [PubMed] [Google Scholar]
31.Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA6 (2015). [DOI] [PMC free article] [PubMed]
32.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research27, 573–580 (1999). 10.1093/nar/27.2.573 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology29, 644–652 (2011). 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31, 5654–5666 (2003). 10.1093/nar/gkg770 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods12, 357–360 (2015). 10.1038/nmeth.3317 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology20, 278 (2019). 10.1186/s13059-019-1910-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant15, 1247–1250 (2022). 10.1016/j.molp.2022.05.014 [DOI] [PubMed] [Google Scholar]
38.Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019). [DOI] [PubMed]
39.Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research33, W465–W457 (2005). 10.1093/nar/gki458 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology9, R7 (2008). 10.1186/gb-2008-9-1-r7 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods12, 59–60 (2015). 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]
42.Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research27, 49–54 (1999). 10.1093/nar/27.1.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research28, 27–30 (2000). 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics30, 1236–1240 (2014). 10.1093/bioinformatics/btu031 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890 (2018). 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT₂ and HISAT-genotype. Nature Biotechnology37, 907–915 (2019). 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics31, 166–169 (2015). 10.1093/bioinformatics/btu638 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology12, R22 (2011). 10.1186/gb-2011-12-3-r22 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science320, 486–488 (2008). 10.1126/science.1153917 [DOI] [PubMed] [Google Scholar]
50.Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).
51.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).
52.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2017). 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]
54.Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]
55.Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]
56.NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
57.NCBI GenBank.https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
58.Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare.10.6084/m9.figshare.24532924.v1 (2024). 10.6084/m9.figshare.24532924.v1 [DOI]
59.Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution35, 543–548 (2018). 10.1093/molbev/msx319 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).
NCBI GenBank.https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).
Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare.10.6084/m9.figshare.24532924.v1 (2024). 10.6084/m9.figshare.24532924.v1 [DOI]

Data Availability Statement

[CR1] 1.Mccormick, R. F. et al. The Sorghum bicolor reference genome: improved assembly, gene annotations, a transcriptome atlas, and signatures of genome organization. The Plant Journal (2017). [DOI] [PubMed]

[CR2] 2.Paterson, A. H. et al. The Sorghum bicolor genome and the diversification of grasses. Nature457, 551–556 (2009). 10.1038/nature07723 [DOI] [PubMed] [Google Scholar]

[CR3] 3.Deschamps, S. et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nature Communications9 (2018). [DOI] [PMC free article] [PubMed]

[CR4] 4.Cooper, E. A. et al. A new reference genome for Sorghum bicolor reveals high levels of sequence similarity between sweet and grain genotypes: implications for the genetics of sugar metabolism. BMC Genomics20 (2019). [DOI] [PMC free article] [PubMed]

[CR5] 5.Tao, Y. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nature Plants7, 766–773 (2021). 10.1038/s41477-021-00925-x [DOI] [PubMed] [Google Scholar]

[CR6] 6.Zhang, S. et al. Variation in mitogenome structural conformation in wild and cultivated lineages of sorghum corresponds with domestication history and plastome evolution. BMC Plant Biology23 (2023). [DOI] [PMC free article] [PubMed]

[CR7] 7.Nurk, S. et al. The complete sequence of a human genome. Science376, 44–53 (2021). 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science (2022). [DOI] [PMC free article] [PubMed]

[CR9] 9.Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]

[CR10] 10.Huang, H. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]

[CR11] 11.Navrátilová, P. et al. Prospects of telomere-to-telomere assembly in barley: Analysis of sequence gaps in the MorexV3 reference genome. Plant Biotechnology Journal20, 1373–1386 (2021). 10.1111/pbi.13816 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Molecular plant16, 1232–1236 (2023). 10.1016/j.molp.2023.08.003 [DOI] [PubMed] [Google Scholar]

[CR13] 13.Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics55, 1221–1231 (2023). 10.1038/s41588-023-01419-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Belton, J. M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods58, 268–276 (2012). 10.1016/j.ymeth.2012.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods10, 563–569 (2013). 10.1038/nmeth.2474 [DOI] [PubMed] [Google Scholar]

[CR16] 16.Wang, H. et al. Estimation of genome size using k-mer frequencies from corrected long reads, arXiv. 2003. 11817 (2020).

[CR17] 17.Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience1 (2018). [DOI] [PMC free article] [PubMed]

[CR18] 18.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods18, 1–6 (2021). 10.1038/s41592-020-01056-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25, 1754–1760 (2009). 10.1093/bioinformatics/btp324 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology16 (2015). [DOI] [PMC free article] [PubMed]

[CR21] 21.Durand, N. et al. Juicer provides a one-Click system for analyzing loop-resolution Hi-C experiments. Cell Systems3, 95–98 (2016). 10.1016/j.cels.2016.07.002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, 6333 (2017). 10.1126/science.aal3327 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Xu, G. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience1 (2019). [DOI] [PMC free article] [PubMed]

[CR24] 24.Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications12 (2021). [DOI] [PMC free article] [PubMed]

[CR25] 25.Rhie, A. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nature methods19, 687–695 (2022). 10.1038/s41592-022-01440-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature methods19, 705–710 (2022). 10.1038/s41592-022-01457-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Vaser, R., Sovic, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome research27, 737–746 (2017). 10.1101/gr.214270.116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics4 (2004). [DOI] [PubMed]

[CR29] 29.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research35, W265–W268 (2007). 10.1093/nar/gkm286 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics21(Suppl 1), i351–i358 (2005). 10.1093/bioinformatics/bti1018 [DOI] [PubMed] [Google Scholar]

[CR31] 31.Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA6 (2015). [DOI] [PMC free article] [PubMed]

[CR32] 32.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research27, 573–580 (1999). 10.1093/nar/27.2.573 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology29, 644–652 (2011). 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31, 5654–5666 (2003). 10.1093/nar/gkg770 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature Methods12, 357–360 (2015). 10.1038/nmeth.3317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology20, 278 (2019). 10.1186/s13059-019-1910-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Hou, X., Wang, D., Cheng, Z., Wang, Y. & Jiao, Y. A near-complete assembly of an Arabidopsis thaliana genome. Molecular plant15, 1247–1250 (2022). 10.1016/j.molp.2022.05.014 [DOI] [PubMed] [Google Scholar]

[CR38] 38.Keilwagen, J. et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology (2019). [DOI] [PubMed]

[CR39] 39.Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic acids research33, W465–W457 (2005). 10.1093/nar/gki458 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology9, R7 (2008). 10.1186/gb-2008-9-1-r7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature Methods12, 59–60 (2015). 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]

[CR42] 42.Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research27, 49–54 (1999). 10.1093/nar/27.1.49 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research28, 27–30 (2000). 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics30, 1236–1240 (2014). 10.1093/bioinformatics/btu031 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34, i884–i890 (2018). 10.1093/bioinformatics/bty560 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT₂ and HISAT-genotype. Nature Biotechnology37, 907–915 (2019). 10.1038/s41587-019-0201-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Anders, S., Pyl, P. T. & Huber, W. HTSeq–a Python framework to work with high-throughput sequencing data. Bioinformatics31, 166–169 (2015). 10.1093/bioinformatics/btu638 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. & Pachter, L. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology12, R22 (2011). 10.1186/gb-2011-12-3-r22 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science320, 486–488 (2008). 10.1126/science.1153917 [DOI] [PubMed] [Google Scholar]

[CR50] 50.Li, T. et al. Genome assembly of KA105, a new resource for maize molecular breeding and genomic research. The Crop Journal (2023).

[CR51] 51.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Genomics (2013).

[CR52] 52.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2017). 10.1093/bioinformatics/bty191 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Zeng, T. et al. The Telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]

[CR54] 54.Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research (2023). [DOI] [PMC free article] [PubMed]

[CR55] 55.Pei, T. et al. Gap-free genome assembly and CYP450 gene family analysis reveal the biosynthesis of anthocyanins in Scutellaria baicalensis. Horticulture Research10 (2023). [DOI] [PMC free article] [PubMed]

[CR56] 56.NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP472912 (2024).

[CR57] 57.NCBI GenBank.https://identifiers.org/ncbi/insdc.gca:GCA_040267525.1 (2024).

[CR58] 58.Wang, H. Genome assembly and annotation of Sorghum bicolor CHBZ. figshare.10.6084/m9.figshare.24532924.v1 (2024). 10.6084/m9.figshare.24532924.v1 [DOI]

[CR59] 59.Waterhouse, R. M. et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution35, 543–548 (2018). 10.1093/molbev/msx319 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Telomere-to-telomere genome assembly of sorghum

Meng Li

Chunhai Chen

Haigang Wang

Huibin Qin

Sen Hou

Xukui Yang

Jianbo Jian

Peng Gao

Minxuan Liu

Zhixin Mu

Abstract

Background & Summary

Fig. 1.

Methods

Sample collection and sequencing

Table 1.

Table 2.

Genome assembly

Table 3.

Table 4.

Repeat annotation

Table 5.

Protein-coding genes prediction and functional annotation

Table 6.

Table 7.

Gene expression analysis

Comparative genomic analysis

Fig. 2.

Identification of centromeres and telomeres

Table 8.

Fig. 3.

Data Records

Technical Validation

Evaluation of the genome assembly

Fig. 4.

Evaluation of the gene annotation

Fig. 5.

Table 9.

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases