Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Mar 18;12:452. doi: 10.1038/s41597-025-04800-8

Chromosome-scale assembly of the Xenocypris davidi using PacBio HiFi reads and Hi-C technologies

Tiezhu Yang 1,2,#, Liangjie Zhao 1,2,#, Chaoqun Su 1,2, Xusheng Guo 1,2,3,, Xinliang Peng 1,2,, Shijie Yang 4, Gaoyou Yao 1,2
PMCID: PMC11920407  PMID: 40102422

Abstract

Xenocypris davidi is a benthic fish species widely distributed in the water systems south of the Yellow River in China, playing a significant role in aquatic ecosystems. Despite its ecological and economic importance, genomic resources for X. davidi are limited, hindering a comprehensive understanding of its evolutionary adaptations and genetic improvements. This study presents the first chromosome-level genome assembly of X. davidi, utilizing PacBio long-reads, Illumina short reads, and Hi-C sequencing data. The genome assembly spans 1.05 Gb with a scaffold N50 length of 33.99 Mb, and 95.12% of the genome sequence was successfully anchored onto 24 pseudochromosomes. We identified 27,360 protein-coding genes, of which 26,672 were functionally annotated. This genome sequence provides a valuable resource for exploring the molecular basis of agronomic traits in X. davidi and will facilitate its genetic enhancement.

Subject terms: Genetic databases, DNA sequencing

Background & Summary

Xenocypris davidi Bleeker, 1871, a species of fish extensively distributed throughout the water systems south of the Yellow River in China, is commonly found in rivers, lakes, and reservoirs, and is classified as a benthic species. It primarily feeds on bottom-dwelling algae and detritus. These fish typically inhabit the mid-to-upper reaches of rivers and migrate upstream to shallow rapids for spawning during the breeding season1. Owing to its identity as a benthic scraper, it actively contributes to the enhancement of aquatic ecological environments and the amelioration of water quality, rendering it an ideal candidate for artificial propagation and release, as well as for pond polyculture practices. Research on the dietary habits of X. davidi in a certain reservoir has revealed that adult fish primarily consume a substantial amount of detritus, supplemented by a small number of sessile diatoms and oscillatorian cyanobacteria. This diet does not compete with that of other economically important fish species such as silver carp and bighead carp in the reservoir, indicating that X. davidi is a preferred species for integrated aquaculture2. Studies by Tang et al. have demonstrated that X. davidi has significant potential in managing the overgrowth of filamentous algae3. Appreciated for its delicate texture and savory taste, X. davidi has become one of the specialty fish species favored by consumers. Recent research has also explored the dietary nutritional requirements of X. davidi broodstock and the impact of adding bee pollen and glutamine to their feed on the growth of this species46. In the field of toxicology, it has been discovered that X. davidi exhibits a low tolerance to heavy metal copper ions. A hepatic transcriptome database for X. davidi has been constructed to analyze gene expression profiles, providing molecular insights into the species’ response to environmental pollutants7.

X. davidi, classified within the order Cypriniformes, family Cyprinidae, subfamily Xenocyprininae, and genus Xenocypris, has been the subject of evolutionary studies primarily based on the analysis of the complete mitochondrial genome8,9. Assessments in the field of population genetics, utilizing microsatellite markers, have evaluated the impact of stocking activities on wild populations of X. davidi in the Qiantang River, with results indicating that these activities have not posed genetic risks to the wild populations10. Additionally, research has identified genetic diversity variations among different aquaculture populations of X. davidi11. When analyzing wild populations from various water systems, significant genetic divergence was observed between populations in Qiandao Lake and the Yangtze River12. Nonetheless, the scant genomic data resources limit the understanding of X. davidi’s evolutionary adaptation and molecular mechanisms of its traits, restricting our full appreciation and exploitation.

In this research, we have achieved the construction of a chromosome-level genome assembly for X. davidi, representing the initial case of jointly employing PacBio long-reads, Illumina short reads, and Hi-C sequencing data. The obtained genome assembly has a total length of 1.05 Gb. The scaffold N50 length reaches 33.99 Mb, and a remarkable 95.12% of the genome sequence is effectively anchored to 24 pseudochromosomes. By using a combined method including de novo gene predictions, RNA-seq data, and homologous protein evidence, a sum of 27,360 protein-coding genes have been identified, among which 26,672 have been functionally annotated. The genome sequence is a valuable asset for understanding X. davidi’s agronomic trait molecular basis and facilitating its genetic improvement.

Methods

Ethics statement

The Experimental Animal Care and Ethics Committee of Xinyang Agriculture and Forestry University approved all fish sampling experiment procedures.

Samling and genome survey

A healthy female specimen of X. davidi, weighing 659.36 grams, was collected from the Pohe Reservoir in Guangshan County, Xinyang City, Henan Province, China. Following euthanasia with eugenol at an anesthetic concentration of 16 mg L^-313, the fish was rapidly rinsed three times with sterile physiological saline and dabbed dry with sterile cotton. Tissues including those from the muscle, heart, spleen, liver, and kidney were promptly dissected and promptly immersed in liquid nitrogen for conservation, and then stored at -80 °C until the DNA extraction process. High-quality genomic DNA from the muscle was extracted with the PureLink™ Genomic DNA Mini Kit (K182001, Thermo Fisher Scientific, USA). Meanwhile, RNA from diverse tissues was isolated by TRIzol™ Reagent (15596026CN, Thermo Fisher Scientific, USA). The quality and concentration of DNA were evaluated via 1% agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Invitrogen, Thermo Fisher Scientific, USA). In contrast, for RNA, its purity and integrity were further appraised using a NanoDropTM One spectrophotometer (Thermo Fisher Scientific, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Inc., USA).

In the context of the genomic survey, a 10 μg DNA sample was utilized to construct a 350 bp library, employing a paired-end 150 bp (PE150) sequencing strategy on the Illumina Novaseq 6000 platform. A total of 50.51 Gb of raw data and 168,353,928 read pairs were generated. The Jellyfish14 (version 2.2.7) was used to construct the 17-mer frequency depth distribution and estimate the genome size. Genome assembly was performed using SOAPdenovo215. Ultimately, survey analysis conducted with a Kmer size of 17 estimated the genome size to be 1,020.95 Mbp, which was refined to 1,004.32 Mbp. The heterozygosity rate was determined to be 0.52%, with the proportion of repetitive sequences constituting 53.91% (Table 1). Assembly with a Kmer size of 41 yielded a contig N50 of 2,166 bp, with a total length of 876,892,788 bp, and a scaffold N50 of 3,222 bp, with a total length of 899,312,128 bp. (Table 2).

Table 1.

Kmer = 17 Analysis of genomic characteristic statistics.

Kmer Depth N Kmer Genome size (M) Revised genome size (M) Heterozygous rate (%) Repeat_rate (%)
17 37 37,775,268,357 1,020.95 1,004.32 0.52 53.91

Table 2.

Statistical summary of the genome survey assembly results for X. davidi.

Title Total_length Total_number Max_length N50_length N90_length
Contig 876,892,788 1,361,752 79,659 2,166 183
Scaffold 899,312,128 1,111,026 104,922 3,222 258

PacBio and Hi-C based whole-genome sequencing

Regarding PacBio sequencing, high-quality DNA samples (with the main band being greater than 30 kb) were fragmented into segments ranging from 15 to 18 kb by utilizing a Covaris ultrasonic disruptor. Afterwards, the large DNA fragments were enriched and purified through magnetic beads. Subsequently, the fragmented DNA underwent damage repair as well as end repair. Hairpin sequencing adapters were then ligated to both ends of the DNA fragments, and exonucleases were used to eliminate any fragments that had unsuccessful ligation. The properly prepared library was then sequenced on the PacBio Sequel IIe platform in the CCS mode. After filtering polymerase reads, raw subreads were acquired, and these were further processed using SMARTLink (version 11.0, parameters: filter_min_qv = 20) to generate HiFi reads. Ultimately, a total of 27.86 Gb HiFi reads were obtained, along with a read number of 1,908,298 and a mean read length of 14,601 bp (as shown in Table 3 and Table 4).

Table 3.

Statics of different types of sequencing reads.

Library types Sample Platform Bases (Gb) Reads Count Mean Length (bp) N50 (bp)
SMRT Bell muscle PacBio Sequel IIe (HiFi) 27.86 1,908,298 14,601 15,148
Hi-C muscle Illumina Novaseq 6000 49.22 344,604,444 150 150
Short-read muscle Illumina Novaseq 6000 50.51 168,353,928 150 150
RNA-seq muscle Illumina Novaseq 6000 7.01 23,352,108 150 150
RNA-seq heart Illumina Novaseq 6000 6.33 21,090,726 150 150
RNA-seq spleen Illumina Novaseq 6000 6.48 21,598,661 150 150
RNA-seq liver Illumina Novaseq 6000 6.99 23,304,023 150 150
RNA-seq kidney Illumina Novaseq 6000 7.22 24,060,364 150 150

Table 4.

A statistical analysis of the sequencing data obtained from Hi-Fi.

Type HiFi reads
Read_base (bp) 27,863,295,623
Read_Number 1,908,298
Read_length(max) (bp) 56,349
Read_length(mean) (bp) 14,601
Read_length(N50) (bp) 15,148

In the case of Hi-C sequencing, muscle tissue was processed with paraformaldehyde to stabilize the intracellular DNA conformation. After cell lysis, the crosslinked DNA was digested by the restriction enzyme MboI to create sticky ends. Subsequently, the DNA termini were biotinylated, and DNA ligase was used to connect the DNA fragments. Thereafter, proteases were applied to reverse the crosslinking of DNA. The purified DNA was then randomly fragmented into pieces ranging from 300 to 500 bp and utilized to construct a Hi-C library16,17. Hi-C sequencing was carried out on the Illumina Novaseq 6000 platform following a paired-end 150 bp (PE-150) sequencing strategy. The raw sequence data obtained were processed through the HiCUP18 (v 0.8.3), which includes hicup_truncater for identifying and mapping chimeric sequences, hicup_filter for filtering mapped reads, and hicup_deduplicator for removing duplicate contacts. After conducting the aforementioned quality assessments on the Hi-C data, a total of 5,564,221 Total Reads Pairs were obtained, with 2,754,693 Total Paired (mapped) reads (accounting for 49.51%), 2,334,774 Valid Pairs, and 2,237,596 Unique di-Tags (Table 5).

Table 5.

Statistical analysis of sequencing data from Hi-C.

Type Counts
Total Reads Pairs 5,564,221
Total Paired (mapped) 2,754,693
Total Paired Ratio (%) 49.51
Valid Pairs 2,334,774
Unique di-Tags 2,237,596
Effect Rate (%) 40.21

Genome assembly

The genomic assembly of X. davidi was facilitated by employing the default settings within the Hifiasm software19 (v 0.16.1-r375). This assembly approach commences from the uncollapsed genomic data, thereby maximizing the retention of haplotype information. Hifiasm was provisioned with HiFi long reads to produce a monoploid assembly and a pair of contig graphs that resolve haplotypes. Consequently, the assembly process resulted in the construction of 134 contigs with a combined length of 1.05 Gb. This genomic assembly size is marginally larger than what was initially anticipated based on the survey results. The average length, maximum contig size, and N50 were 7.8, 57.52, and 33.99 Mb (Table 6), respectively.

Table 6.

Statistics for assembly at the contig level.

Type Contig
Total length (bp) 1,048,230,699
Total number 134
Average length (bp) 7,822,617
Max length (bp) 57,517,301
Min length (bp) 17,878
N50 length (bp) 33,985,674
N50 number 12
N90 length (bp) 12,284,600
N90 number 32

Hi-C assembly and Chromosome anchoring

The Hi-C technology was employed to differentiate contigs or scaffolds into distinct chromosomes based on the higher probability of intra-chromosomal interactions compared to inter-chromosomal interactions. Additionally, it facilitated the ordering and orientation of contigs or scaffolds on the same chromosome, as the interaction probability decreases with increasing interaction distance along the chromosome. After the Hi-C corrected contigs were integrated into the ALLHiC pipeline20 for pruning, partition, rescue, optimization, and building, a substantial 95.12% of the assembled sequences were anchored to 24 pseudochromosomes21 (Fig. 1), with chromosome lengths varying from 31.29 Mb to 60.28 Mb (Table 7). This outcome aligns with the karyotype results derived from cytological observations22, which are consistent with the chromosome numbers of 2n = 48 observed in several Xenocyprididae fish species, such as Plagiognathops microlepis23, Chanodichthys erythropterus24, M. amblycephala25, and Ctenopharyngodon idella26. Moreover, we manually refined the Hi-C scaffolding based on the chromatin contact matrix within Juicebox27 (Fig. 2). The 24 pseudochromosomes can be clearly identified on the heatmap, and there is a strong signal intensity around the diagonal, which suggests the high quality of the genome assembly. After Hi-C correction, the final assembled genome had a total span of 1.05 Gb, along with a scaffold N50 of 40.13 Mb (Table 8).

Fig. 1.

Fig. 1

A circus plot of 24 chromosomes of X. davidi. The tracks from outside to inside are: 24 chromosomes, the distributions of transposable element and bar plot for gene density profile.

Table 7.

Statistics of the 24 anchored chromosomes of X. davidi genome.

Chromosome Cluster Number Sequeues Length (bp)
Chr1 2 60,275,181
Chr2 2 57,625,182
Chr3 1 55,586,653
Chr4 1 52,022,816
Chr5 6 51,677,919
Chr6 4 44,187,633
Chr7 4 43,435,225
Chr8 2 43,302,729
Chr9 2 42,370,478
Chr10 3 40,516,972
Chr11 2 40,131,624
Chr12 4 38,782,997
Chr13 2 38,256,806
Chr14 2 38,096,289
Chr15 2 38,035,212
Chr16 3 37,432,084
Chr17 2 37,282,341
Chr18 3 35,654,014
Chr19 2 35,628,516
Chr20 3 35,330,893
Chr21 3 34,045,325
Chr22 1 33,397,843
Chr23 3 32,692,433
Chr24 2 31,287,097

Fig. 2.

Fig. 2

Hi-C chromatin interaction heatmap of the X. davidi assembly.

Table 8.

Assembly statistics for Hi-C.

Sample ID Contig length Scaffold length Contig number Scaffold number
Total 1,048,230,699 1,048,234,399 142 105
Max 57,517,301 60,275,181
Number > = 2000 142 105
N50 31,018,029 40,131,624 13 11
N60 27,366,945 38,096,289 17 14
N70 24,045,337 37,282,341 21 17
N80 13,754,553 35,330,893 26 20
N90 10,369,524 32,692,433 34 23

Genome annotation

In the annotation of repetitive sequences within the genome of X. davidi, this study employed two complementary approaches: homology-based alignment28 and de novo prediction29. The homology-based alignment method was predicated on the RepBase database30 and leveraged two software tools, RepeatMasker31 (version 4.1.0, parameters: -a -nolow -no_is -norna -parallel 4) and RepeatProteinMask (version 4.1.0, parameters: -noLowSimple -pvalue 0.0001 -engine ncbi), to identify sequences with similarity to known repetitive elements. Conversely, the de novo prediction approach commenced with the establishment of a de novo repetitive sequence library utilizing LTR_FINDER32 (version 1.06, parameters: -C -w 2), RepeatScout (version 1.0.5), and RepeatModeler33 (version 2.0.1, parameters: -engine ncbi -pa 15), followed by the application of RepeatMasker for prediction. Additionally, within the de novo prediction methodology, the software TRF34 (version 4.09, parameters: 2 7 7 80 10 50 2000 -d -h -ngs) was employed to detect tandem repeats within the X. davidi genome. Ultimately, all predicted results were consolidated and duplicates were eliminated, yielding the identification of 533.34 Mb of repetitive sequences, constituting 50.88% of the assembled genome. The predominant element among these repetitive sequences was long terminal repeats (LTR), which accounted for 45.59% (477.87 Mb) of the assembled genome (Table 9), a significant departure from the genome of the fine-scaled gudgeon, where DNA transposons were the most abundant, comprising 31.55%23. Long interspersed nuclear elements (LINE) constituted 1.81% of the genome, short interspersed nuclear elements (SINE) constituted 0.02% of the genome, and DNA elements constituted 4.56% of the genome, respectively (Table 9).

Table 9.

Summary of the transposable elements in X. davidi genome.

Type Denovo + Repbase TE Proteins Combined TEs
Length(bp) % in Genome Length(bp) % in Genome Length(bp) % in Genome
DNA 39,993,470 3.82 15,440,641 1.47 47,841,916 4.56
LINE 7,002,432 0.67 14,991,786 1.43 19,012,975 1.81
SINE 172,417 0.02 0 0 172,417 0.02
LTR 475,525,141 45.36 23,810,950 2.27 477,886,955 45.59
Unknown 13,215,147 1.26 0 0 13,215,147 1.26
Total 527,762,304 50.35 54,235,314 5.17 533,336,825 50.88

For the prediction of gene structures within the genome of X. davidi, this manuscript employed a triad of methodologies: de novo prediction, homology-based prediction, and annotation using transcriptome data35. The de novo prediction outcomes were derived from the utilization of Augustus36 (version 3.2.3, parameters: --species = pasa1--uniqueGeneId = TRUE --noInFrameStop = TRUE --GFF3 = on --genemodel = complete --strand = both), GlimmerHMM37 (version 3.0.4, parameters: -d pasa1 -f -g), SNAP (version 2013.11.29, parameters: -gff pasa1.hmm), Geneid (version 1.4, parameters: -P homo_sapiens.param -v -G -p geneid), and Genscan (version 1.0, parameters: HumanIso.smat) software. In the homology-based prediction approach, protein sequences from Carassius auratus38 (GenBank: GCA_003368295.1), Cyprinus carpio39 (GenBank: GCA_000951615.2), Ctenopharyngodon idellus26 (GenBank: GCA_019924925.1), Danio rerio40 (GenBank: GCA_000002035.4), Onychostoma macrolepis41 (GenBank: GCA_012432095.1), M. amblycephala25 (GenBank: GCA_018812025.1), and Opsariichthys bidens42 (GenBank: GCA_037194245.1) were downloaded from the NCBI database and used to predict gene structures within the genome through alignment software such as Blastall43 (version 2.2.26, parameters: -e 1e-05 -F T -m 8), Solar (version 0.9.6, parameters: -a prot2genome2 -z -f m8), and Genewise44 (version 2.4.1, parameters: -tfor -genesf -gff -sum) (Fig. 3). For the transcriptome data annotation method, high-quality RNA from muscle, heart, spleen, liver, and kidney were used to construct RNAseq libraries. Subsequently, these libraries were sequenced on the Illumina Novaseq 6000 platform, and 150 bp paired-end reads were obtained as a result. Post-sequencing, 37.73 Gb of raw data was generated, which was filtered to yield 34.03 Gb of clean data (Table 3). Subsequent de novo assembly was performed using Trinity (version 2.1.1, parameters:–normalize_reads–full_cleanup–min_glue 2–min_kmer_cov 2–KMER_SIZE 25), alignment analysis with Hisat2 (version 2.0.4), and assembly annotation with Stringtie (version 1.3.3). The gene sets predicted by the above-mentioned three methods were combined into a non-redundant gene set with the help of EvidenceModeler (EVW)45 (version 1.1.1, parameters:–segmentSize 200000 –overlapSize 20000–min_intron_length 20). Finally, PASA (http://pasa.sourceforge.net/) was utilized, in conjunction with transcriptome assembly results, to refine the EVW annotation, incorporating UTR and alternative splicing information, to arrive at the final gene set. The genomic prediction for X. davidi resulted in 27,360 genes, with an average transcript length of 10,053.32 bp, an average coding region length of 1,121.54 bp, an average of 6.28 introns per gene, an average intron length of 178.57 bp, and an average exon length of 1,691.46 bp (Table 10, Fig. 4A).

Fig. 3.

Fig. 3

Comparisons of the genomic elements of closely related species.

Table 10.

Statistical analyses of the gene structure annotation of the X. davidi.

Gene set Number Average transcript length(bp) Average CDS length(bp) Average exons per gene Average exon length(bp) Average intron length(bp)
De novo Augustus 41,765 10,053.32 1,121.54 6.28 178.57 1,691.46
GlimmerHMM 106,335 8,754.46 573.66 3.89 147.46 2,830.35
SNAP 59,634 12,134.09 673.63 4.87 138.4 2,963.41
Geneid 32,058 19,612.36 1,344.29 6.3 213.33 3,445.94
Genscan 31,763 22,650.39 1,559.14 8.21 189.9 2,925.14
Homolog Ctenopharyngodon idellus 29,826 9,959.04 1,251.03 6.87 182.16 1,484.06
Opsariichthys bidens 26,232 13,465.66 1,449.43 7.96 182.16 1,727.23
Onychostoma macrolepis 24,228 14,824.04 1,520.60 8.58 177.18 1,754.54
Megalobrama amblycephala 24,907 14,497.76 1,576.73 8.68 181.66 1,682.52
Carassius auratus 29,056 10,456.97 1,301.02 7.02 185.33 1,520.97
Cyprinus carpio 26,439 10,919.42 1,317.09 7.12 184.91 1,568.31
Danio rerio 23,752 14,024.94 1,547.84 8.39 184.57 1,689.21
RNAseq PASA 26,100 15,024.86 1,453.96 8.96 162.26 1,704.73
Transcripts 33,355 26,488.96 2,770.83 9.87 280.77 2,674.37
EVM 38,936 12,542.04 1,232.01 6.93 177.83 1,907.85
Pasa-update* 38,720 12,882.32 1,250.40 7.03 177.78 1,927.93
Final set* 27,360 16,345.93 1,566.90 8.87 176.69 1,878.35

Fig. 4.

Fig. 4

Gene prediction and functional annotation of the X. davidi genome. (A) Venn diagram of the gene set prediction. (B) Venn diagram of functional annotation based on different databases.

To conduct the functional annotation of protein-coding genes, this study employed a dual strategy utilizing Blastp46 (version 2.2.26, parameters: -max_target_seqs. 1 -evalue 1e-4) and Diamond47 (version 0.8.22, parameters:–more-sensitive -k 10 -e 1e-5 -f 6 qseqid qlen qstart qend sseqid slen sstart send pident ppos qcovhsp bitscore evalue–salltitles–threads 10) for aligning the protein-coding genes against the SwissProt48, NCBI Non-redundant protein (NR) (https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/), KEGG49, InterPro50, Gene Ontology (GO)51, and Pfam52 protein databases. The identification of protein domains and motifs was facilitated through the application of InterProScan53 (version 5.35–74.0, parameters: -cpu 20 -format tsv -appl ProDom, SMART, ProSiteProfiles, PRINTS, Pfam, Panther -iprlookup -dp -goterms). Ultimately, 26,672 (97.50%) of the 27,360 predicted genes received annotations from at least one of the databases (Table 11). Among the functionally annotated proteins, 20,599 (75.29%) were corroborated by annotations from all four databases (Fig. 4B).

Table 11.

Non-coding RNA statistical results of X. davidi.

Type Copy number Average length(bp) Total length(bp) % of genome
miRNA 2,009 105.11 211,175 0.020146
tRNA 11,325 76.03 861,093 0.082147
rRNA rRNA 13,213 147.93 1,954,630 0.19
18S 322 732.25 235,783 0.022493
28S 790 387.85 306,403 0.02923
5.8S 116 156.56 18,161 0.001733
5S 11,985 116.34 1,394,283 0.13
snRNA snRNA 2,412 160.5 387,133 0.036932
CD-box 249 152.23 37,906 0.003616
HACA-box 92 148.27 13,641 0.001301
splicing 2,031 162.25 329,527 0.031436
scaRNA 26 194.35 5,053 0.000482
Unknown 14 71.86 1,006 0.000096

The annotation of non-coding RNAs encompasses tRNA, rRNA, miRNA, and snRNA. To identify tRNA sequences within the genome, the structural characteristics of tRNA were leveraged using the tRNAscan-SE software54 (version 1.4). Given the high conservation of rRNA, the rRNA sequences of closely related species were chosen as reference sequences. Moreover, the identification of rRNA within the genome was made easier through BLAST alignment (v 2.2.26, parameters: -e 1e-10 -v 10000 -b 10000). Furthermore, employing the covariance models from the Rfam family, the INFERNAL software, which is integrated within the Rfam55 (v 14.1) suite, can be utilized to predict microRNA (miRNA) and small nuclear RNA (snRNA) sequence information on the X. davidi genome. Ultimately, four types of non-coding RNAs were identified from the X. davidi genome, including 2,009 miRNAs, 11,325 tRNAs, 13,213 rRNAs, and 2,412 snRNAs (Table 12).

Table 12.

Statistical analysis of the functional gene annotations of the X. davidi genome.

Number Percent (%)
Total 27,360
Swissprot 22,152 81
Nr 26,253 96
KEGG 22,742 83.1
InterPro 25,497 93.2
GO 16,878 61.7
Pfam 20,899 76.4
Annotated 26,672 97.5
Unannotated 688 2.5

Data Records

The raw sequence data of RNAseq data, HiC data, PacBio data and Illumina short reads data reported in this paper have been deposited in the Genome Sequence Archive in National Genomics Data Center, China National Center for Bioinformation / Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA02081456, CRA02081757, CRA02081858, CRA02081959, CRA02082060, CRA02082161, CRA02082262, CRA02082363). The whole genome sequence data reported in this paper have been deposited in the GenBank (JBLRZY000000000)64 and figshare database65. The genome annotation files will already be uploaded and shared publicly in the figshare database66.

Technical Validation

Evaluation of the genome assembly and annotation

The quality assessment of the genome assembly and annotation was conducted with meticulous attention to detail. The completeness of the assembled genome was assessed by utilizing BUSCO67 (version 5.4.3) with the actinopterygii_odb10 database, yielding a 98.3% complete BUSCO score within the assembled genomes (Table 11), which is a testament to the high degree of completeness of our genomic assemblies. The genomic consistency was further appraised by aligning Illumina short-reads to the assembled genomes with BWA68 (version 0.7.17), resulting in exceptionally high mapping rates (99.36%) and coverage (99.96%) against the assembled genomes (Table 13). Employing Merqury69 (version 1.4.1), the consensus quality value (QV), indicative of per-base consensus accuracy, was calculated to be 50.458 for the assembled genomes. Additionally, a comparative analysis of the length distributions of genes, coding sequences (CDSs), introns, and exons across the genomes of C. idellus, O. bidens, O. macrolepis, M. amblycephala, C. auratus, C. carpio, and D. rerio was performed, revealing similarities (Fig. 3), which substantiates the reliability of our genome annotation. Collectively, the outcomes from these four methodologies demonstrate the high accuracy and completeness of the final genome assembly.

Table 13.

Completeness and accuracy evaluation of the genome.

Type Number Percentage (%)
Complete BUSCOs (C) 3,578 98.3
Single-copy BUSCOs (S) 3,502 96.2
Duplicated BUSCOs (D) 76 2.1
Fragmented BUSCOs (F) 25 0.7
Missing BUSCOs (M) 36 1.0
Total BUSCOs 3,640
Short-reads mapping rate 99.36
Genome covered by reads 99.96
Quality value (QV) 50.458

Acknowledgements

The study was supported by the Youth Fund Project of Xinyang Agriculture and Forestry University (No. QN2021020), the Natural Science Foundation of Henan (No. 232300421273, No. 242300420175), the Key Scientific Research Projects of Colleges and Universities in Henan Province (No. 23B240003, No. 24B240001), and the Henan Province Science and Technology Research Project (No. 252102110075, No. 202102110263, No. 162102110053).

Author contributions

X. G. and X. P. conceived the research project. C. S. and S. Y. collected the samples. L. Z. designed the experiment. T. Y., G. Y. and L. Z. performed data analysis. T. Y and C. S. drafted the manuscript. L. Z. and G. Y. revised this manuscript. All authors have read and approved the manuscript.

Code availability

In this study, no custom-written codes were used. All data processing operations were conducted in accordance with the manuals and protocols of the relevant software. The specific parameters for different software and tools were described in the Methods section. For cases where detailed parameters were not specified, default parameters were adopted.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Tiezhu Yang, Liangjie Zhao.

Contributor Information

Xusheng Guo, Email: gxs1968@xyafu.edu.cn.

Xinliang Peng, Email: 2008210024@xyafu.edu.cn.

References

  • 1.Ding, D. Introduction to Four Species of Culter Fish. Hunan Agricuture2, 30+24 (2013). [Google Scholar]
  • 2.Xu, D. A Preliminary Analysis on the Food Habie of Xenocypris davidi Bleeker in Reservoir Guangting. Acta Hydrobiologica Sinica01, 43–53 (1988). [Google Scholar]
  • 3.Tang, Y. et al. Grazing Effects of Xenocypris davidi Bleeker (Cyprinidae, Cypriniformes) on Filamentous Algae and the Consequent Effects on Intestinal Microbiota. Aquaculture Research2023, 1–14 (2023). [Google Scholar]
  • 4.Li, C. et al. Research on the Dietary Protein and Fat Requirements of Parental Xenocypris davidi. Jiangsu Agricultural Sciences47, 220–223 (2019). [Google Scholar]
  • 5.Li, C. et al. Effects of Different Levels of Bee Pollen in Feed on Reproductive Performance of Xenocypris davidi Bleeker. Agricultural Science and Technology20, 48–52 (2019). [Google Scholar]
  • 6.Wang, Y. et al. Effects of dietary glutamine supplementation on growth performance, intestinal digestive ability, antioxidant status and hepatic lipid accumulation in Xenocypris davidi (Bleeker,1871). Aquacult Int32, 725–743 (2024). [Google Scholar]
  • 7.Peng, X., Zhao, L., Liu, J., Guo, X. & Ding, Y. Comparative transcriptome analyses of the liver between Xenocypris microlepis and Xenocypris davidi under low copper exposure. Aquatic Toxicology236, 105850 (2021). [DOI] [PubMed] [Google Scholar]
  • 8.Xu, H., Zhu, Y., Zheng, D. & Yang, S. Molecular identification and phylogenetic analysis of mitogenome of the Xenocypris davidi from Cao’e. Mitochondrial DNA Part B Resources4, 3998–3999 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu, Y. The complete mitochondrial genome sequence of Xenocypris davidi (Bleeker). Mitochondrial DNA25, 374–376 (2014). [DOI] [PubMed] [Google Scholar]
  • 10.Guo, A. et al. Stock enhancement effect and potential genetic risks of Xenocypris davidi by molecular markers in the upper reaches of Qiantang River, China. Journal of Fisheries of China46, 2349–2356 (2022). [Google Scholar]
  • 11.Liu, S. et al. Genetic Diversity Analysis of Four Cultured Xenocypris davidi Populations Based on Mitochondrial D-loop Sequences. Guangdong Agricultural Sciences50, 139–145 (2023). [Google Scholar]
  • 12.Zhang, H., Zhao, L., Hu, Z. & Liu, Q. Genetic variation analysis of Xenocypris davidi populations from Qiandao Lake and Yangtze River. Journal of Shanghai Ocean University24, 12–19.
  • 13.Wang, W. Study on the mechanism and protection of anaesthesia injury in Lateolabrax maculatus. (Shanghai Ocean University, Shanghai, 2020).
  • 14.Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics27, 764–770 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience1, 2047-217X-1–18 (2012). [DOI] [PMC free article] [PubMed]
  • 16.van Berkum, N. L. et al. Hi-C: A Method to Study the Three-dimensional Architecture of Genomes. JoVE (Journal of Visualized Experiments) e1869 (2010). [DOI] [PMC free article] [PubMed]
  • 17.Rao, S. S. P. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell159, 1665–1680 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wingett, S. W. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000research4, 1310 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants5, 833–845 (2019). [DOI] [PubMed] [Google Scholar]
  • 21.Hu, J. et al. Characteristics of diploid and triploid hybrids derived from female Megalobrama amblycephala Yih × male Xenocypris davidi Bleeker. Aquaculture364–365, 157–164 (2012). [Google Scholar]
  • 22.Zhang, H., Xu, X., Zhang, Y. & Wang, S. Chromosomal Karyotype Analysis of Xenocypris davidi. Jiangxi Fishery Science and Technology20, 22 (2018). [Google Scholar]
  • 23.Wu, Y., Sha, H., Luo, X., Zou, G. & Liang, H. Chromosome-level genome assembly of Plagiognathops microlepis based on PacBio HiFi and Hi-C sequencing. Sci Data11, 802 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhao, S. et al. A chromosome-level genome assembly of the redfin culter (Chanodichthys erythropterus). Sci Data9, 535 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liu, H. et al. A Chromosome-Level Assembly of Blunt Snout Bream (Megalobrama amblycephala) Genome Reveals an Expansion of Olfactory Receptor Genes in Freshwater Fish. Mol Biol Evol38, 4238–4251 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wu, C.-S. et al. Chromosome-level genome assembly of grass carp (Ctenopharyngodon idella) provides insights into its genome evolution. BMC Genomics23, 271 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Robinson, J. T. et al. Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. cels6, 256–258.e1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Edgar, R. C. & Myers, E. W. PILER: identification and classification of genomic repeats. Bioinformatics21, i152–i158 (2005). [DOI] [PubMed] [Google Scholar]
  • 29.Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics21, i351–i358 (2005). [DOI] [PubMed] [Google Scholar]
  • 30.Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research110, 462–467 (2005). [DOI] [PubMed] [Google Scholar]
  • 31.Nishimura, D. RepeatMasker. Biotech Software & Internet Report1, 36–39 (2000). [Google Scholar]
  • 32.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research35, W265–W268 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences117, 9451–9457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research27, 573–580 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31, 5654–5666 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research34, W435–W439 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics20, 2878–2879 (2004). [DOI] [PubMed] [Google Scholar]
  • 38.Chen, Z. et al. De novo assembly of the goldfish (Carassius auratus) genome and the evolution of genes after whole-genome duplication. Science Advances5, eaav0547 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Xu, P. et al. Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nat Genet46, 1212–1219 (2014). [DOI] [PubMed] [Google Scholar]
  • 40.Howe, K. et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature496, 498–503 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Sun, L. et al. Chromosome-level genome assembly of a cyprinid fish Onychostoma macrolepis by integration of nanopore sequencing, Bionano and Hi-C technology. Molecular Ecology Resources20, 1361–1371 (2020). [DOI] [PubMed] [Google Scholar]
  • 42.Xu, X. et al. Chromosome-Level Assembly of the Chinese Hooksnout Carp (Opsariichthys bidens) Genome Using PacBio Sequencing and Hi-C Technology. Front. Genet. 12 (2022). [DOI] [PMC free article] [PubMed]
  • 43.Kent, W. J. BLAT—The BLAST-Like Alignment Tool. Genome Res.12, 656–664 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res.14, 988–995 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol9, R7 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res25, 3389–3402 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods18, 366–368 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research28, 45–48 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research28, 27–30 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Research37, D211–D215 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet25, 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research49, D412–D419 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Zdobnov, E. M. & Apweiler, R. InterProScan – an integration platform for the signature-recognition methods in InterPro. Bioinformatics17, 847–848 (2001). [DOI] [PubMed] [Google Scholar]
  • 54.Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence. Nucleic Acids Research25, 955–964 (1997). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Research33, D121–D124 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020814 (2024).
  • 57.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020817 (2024).
  • 58.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020818 (2024).
  • 59.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020819 (2024).
  • 60.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020820 (2024).
  • 61.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020821 (2024).
  • 62.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020822 (2024).
  • 63.CNCB Genome Sequence Archivehttps://ngdc.cncb.ac.cn/gsa/browse/CRA020823 (2024).
  • 64.NCBI GenBankhttps://identifiers.org/ncbi/insdc:JBLRZY000000000 (2025).
  • 65.Yang, T. Genome annotation files of Xenocypris davidi. figshare10.6084/m9.figshare.28287308.v1 (2025).
  • 66.Yang, T. Genome annotation files of Xenocypris davidi. figshare10.6084/m9.figshare.27932985.v1 (2024).
  • 67.Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution38, 4647–4654 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv1303, 3997 (2013). [Google Scholar]
  • 69.Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology21, 245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Yang, T. Genome annotation files of Xenocypris davidi. figshare10.6084/m9.figshare.28287308.v1 (2025).
  2. Yang, T. Genome annotation files of Xenocypris davidi. figshare10.6084/m9.figshare.27932985.v1 (2024).

Data Availability Statement

In this study, no custom-written codes were used. All data processing operations were conducted in accordance with the manuals and protocols of the relevant software. The specific parameters for different software and tools were described in the Methods section. For cases where detailed parameters were not specified, default parameters were adopted.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES