Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Aug 21;12:1457. doi: 10.1038/s41597-025-05735-w

A telomere-to-telomere gap-free genome assembly of the protandrous hermaphrodite Asian seabass (Lates calcarifer)

Xinhui Zhang 1,2, Jieming Chen 2, Wenchuan Zhou 3, Jiufu Wen 4,5,, Qiong Shi 1,2,
PMCID: PMC12371012  PMID: 40841804

Abstract

As a protandrous hermaphroditic fish species with natural sex change from male to female, Asian seabass (Lates calcarifer) represents an attractive model for studying sequential hermaphroditism. In this study, we constructed the first telomere-to-telomere (T2T) gap-free genome assembly of Asian seabass, by integration of MGI short-read, PacBio HiFi long-read, ONT ultra-long and Hi-C sequencing technologies. The haplotypic 614.19 Mb genome sequences were successfully anchored onto 24 chromosomes, demonstrating exceptional contiguity with a contig N50 of 26.57 Mb. Comprehensive annotation revealed precise localization of telomeric repeats and centromeric regions across various chromosomes. Good results from Merqury (QV: 57.8), CRAQ (99.45%) and BUSCO (100%) indicate a high level of accuracy for the assembled genome. ONT ultra-long and PacBio HiFi sequencing data were aligned with the assembly using minimap2, resulting in a mapping rate over 98%. Repetitive elements accounted for 18.18% (111.64 Mb) of the entire genome, and a total of 25,093 protein-coding genes were annotated. This high-quality T2T genome assembly provides a valuable genetic resource for in-depth comparative genomics, population genetics, molecular breeding, and functional studies of this economically important marine species. This reference assembly also facilitates investigations into the detailed molecular mechanisms underlying its unique reproductive strategy of the protandrous hermaphrodite Asian seabass.

Subject terms: Genome, Evolutionary genetics

Background & Summary

Sex determination is a genetic or epigenetic process that initiates and regulates the developmental trajectory of sexual differentiation, whereas sex differentiation encompasses the cascade of morphological and physiological events through which a bi-potential gonad progressively develops into either a testis or an ovary, culminating in the establishment of species-specific secondary sexual characteristics1. Compared with those highly conserved sex determination systems in various mammals and birds, fishes exhibit remarkable diversity in sex determination patterns. They present more diversified sex determination modes than higher vertebrates, such as genetic sex determination (GSD), environmental sex determination (ESD), and the coexistence of both2,3. Notably, among diverse environmental cues, temperature emerges as the most influential exogenous factor to modulate sexual development in fishes. Numerous species across different taxa have been documented to own thermally sensitive sex determination, where incubation temperature during critical developmental windows can override genotypic sex determinants. Good examples include European seabass (Dicentrarchus labrax)4, tilapia (Nile tilapia and Oreochromis niloticus)5, and Atlantic halibut (Hippoglossus hippoglossus)6,7. These fishes exhibit interesting characteristics of temperature-dependent sex determination, and their sex ratios can change significantly with variations in environmental temperature during their hatching period.

In addition to gonochorism (separate sexes), fishes also exhibit hermaphroditism as an important reproductive strategy. Approximately 2% of teleost fishes are hermaphroditic, distributed across 27 families within 7 orders8. Sex change is a biological process in which an organism transitions from its original sex to another through specific physiological mechanisms. Organisms capable of naturally undergoing sex change are referred to hermaphrodites, which are typically categorized into protandrous (male-to-female) and protogynous (female-to-male)9. Common examples in these fishes include groupers, black seabream, clownfish, and ricefield eel1013.

Asian seabass holds substantial cultural and economic values throughout the tropical Indo-West Pacific region, serving as both a key fishery resource and a commercially important aquaculture species14. As a protandrous hermaphroditic fish15, it usually first develops into a male at 3–4 years of age, and then approximately 90% of individuals undergoes natural sex change to female by age 616. Despite its remarkable reproductive strategy, the genetic mechanisms underlying sex change in Asian seabass remain poorly understood, as is the case for most hermaphroditic species. Genomic resources, including DNA markers, high-resolution linkage maps, transcriptomes, reference genome sequences along with their comprehensive annotations, play a pivotal role in supporting aquaculture. These valuable genetic resources provide a solid foundation for diverse applications, enabling comprehensive genetic investigations to support development of sophisticated artificial breeding strategies. Ultimately, they contribute to the sustainable expansion and increased productivity of international aquaculture industry14. Given the economic value of Asian seabass and its remarkable natural sex change, construction of its high-quality genome assembly is absolutely essential.

In this study, we combined MGI short-read, PacBio HiFi long-read, ONT (Oxford Nanopore Technologies) ultra-long, and Hi-C sequencing data to generate a high-fidelity T2T genome assembly of Asian seabass. This assembly was rigorously assessed for quality, and its key genomic features were systematically characterized. In fact, this gap-free and complete reference assembly represents a substantial improvement over any previous assembly of this species17. It will not only facilitate population genetic research and evolutionary study, but also provide an important genetic resource for molecular breeding and investigating molecular mechanisms of sex change in this economically important fish.

Methods

Sample collection

A male Asian seabass (Fig. 1A) was collected from a local aquaculture facility of the South China Sea Fisheries Research Institute under Chinese Academy of Fishery Sciences, which is located in Guangzhou City, Guangdong Province, China. Muscle tissue was sampled for whole-genome sequencing, including MGI short read, PacBio HiFi long read, ONT (Oxford Nanopore Technologies) Ultra-long and Hi-C sequencing technologies. Additionally, seven distinct tissues (such as gill, brain, liver, muscle, eye, testis, and skin) were collected for transcriptome sequencing (Table 1). Upon dissection into small fragments, the tissue samples were washed with ice-cold PBS (pH 7.4) to eliminate blood residues and contaminants. After removing outside liquid by blotting, these samples were rapidly frozen in liquid nitrogen and subsequently maintained at −80 °C before use. For transcriptome sequencing, frozen specimens were shipped in dry ice containers to the sequencing company (BGI, Shenzhen, Guangdong, China).

Fig. 1.

Fig. 1

Asian seabass and its whole-genome sequence distribution. (A) A morphological image of the sequenced Asian seabass. (B) A k-mer (21-mer) distribution curve for estimation of the genome size.

Table 1.

Sequencing data of the Asian seabass genome and transcriptomes.

Type Library type Raw data (Gb) Clean data (Gb) Read N50/ length (bp) Coverage of the genome (×)
DNA MGI 37.41 33.69 150 54
PacBio HiFi / 90.47 18,366* 141
ONT Ultra-long / 61.34 71,701* 95
Hi-C 102.32 93.8 150 133.33
RNA Eye 6.104 5.236 150 /
Muscle 6.207 5.625 150 /
Skin 6.332 5.769 150 /
Liver 7.007 6.377 150 /
Gill 6.868 6.263 150 /
Brain 6.335 5.783 150 /
Testis 9.195 8.395 150 /

*For the PacBio HiFi and ONT Utra-long sequencing, this number is N50 of reads; for others, it denotes read length.

DNA extraction and genome sequencing

Genomic DNA (gDNA) was extracted from muscle tissue using a QIAamp DNA Mini Kit (Qiagen, Valencia, CA, USA) following the manufacturer’s protocols18. Fragment size, purity, and quantification of the extracted gDNA were assessed via 0.75% agarose gel electrophoresis, an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA) and a Qubit Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), respectively.

For the MGI short-read sequencing, gDNA was randomly fragmented using a MGIEasy Universal DNA Library Preparation Kit (MGI, Shenzhen, China) to construct a library with an insert-size of 350 bp. Sequencing was performed on a DNBSEQ-T7 platform (MGI), generating 37.4 Gb of raw 150-bp paired-end reads, and then filtered by fastp v0.12.619 (parameter: -n 0 -f 5 -F 5 -t 5 -T 5) to remove adaptor sequences and low-quality reads. Finally, a total of 33.69 Gb of clean reads (Table 1) were obtained for further data error correction and genome-size estimation.

For the PacBio HiFi sequencing, approximately 10 μg of high-quality gDNA was applied to construct a SMRTbell library following the manufacturer’s standard protocol (SMRTbell Express Template Prep Kit 2.0; Pacific Biosciences, Menlo Park, CA, USA), which was then sequenced on a PacBio Sequel II System using the circular consensus sequencing (CCS) technology. A total of 90.47 Gb of HiFi reads with a N50 of 18,366 bp were obtained (Table 1) using the CCS v6.0.020 (Circular Consensus Sequencing) software with the optimized parameter “-min-passes 3”.

Two ultra-long read libraries were constructed using Oxford Nanopore Technologies (ONT) protocols, which were sequenced on a PromethION platform (Oxford Nanopore Technologies Co., Littlemore, Oxford, UK). Raw reads were initially processed to eliminate those with a quality value (QV) lower than 7 using the NanoFilt v2.8.021 software. Finally, a total of 1.54 million clean reads were retained, accumulating a substantial base count of 61.32 Gb. The average read length was 39.69 kb, with an N50 length of 71.17 kb (Table 1).

For the high-throughput chromosome conformation capture (Hi-C) sequencing, one Hi-C library was generated using a GrandOmics Hi-C kit (GrandOmics, Wuhan, Hubei, China) following the manufacturer’s protocol. In brief, gDNA was first cross-linked using a 4% formaldehyde solution to stabilize chromatin structures. Subsequently, the DNA was digested with the restriction enzyme MboI to introduce specific cleavage sites. Those resulting DNA fragments were then labeled with biotin-14-dCTP, allowing for incorporation of a detectable marker. The labeled DNA fragments were ligated using T4 DNA ligase to facilitate subsequent enrichment steps. Following ligation, the DNA was further digested to yield fragments in the size range of 200 to 600 bp. The library was sequenced on a DNBSEQ-T7 platform (MGI, Shenzhen, China) using a 150-bp paired-end model. The Hi-C sequencing technology generated 102.32 Gb of raw data. Subsequently, fastp v0.12.619 was applied to filter adaptor sequences and low-quality reads. Finally, 93.8 Gb of Hi-C clean data were retained (Table 1) for chromosome assembly.

RNA extraction and transcriptome sequencing (RNA-seq)

Total RNA was extracted from seven tissues separately according to a standard Trizol protocol (Invitrogen, Frederick, MD, USA), followed by purification with a Qiagen RNeasy Mini Kit (Qiagen, Germantown, MD, USA). RNA concentration and integrity were measured using a NanoDrop 8000 Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively. Only those RNA samples with OD260/280 ≥ 1.8 and RNA integrity ≥ 7.0 were selected for transcriptome sequencing. RNA was used for construction of a cDNA library followed the manufacture’s guideline, which was then sequenced on a HiSeq X Ten platform (Illumina, San Diego, CA, USA). A total of 48.07 Gb of transcriptome raw data were generated (Table 1), which aided in annotation of protein-coding genes and prediction of gene structures.

Genome-size estimation and construction of a T2T genome assembly

To estimate the genome size of Asian seabass, we employed jellyfish (v2.2.10)22 to perform k-mer counting with k = 21, and the parameters were set as ‘-m 21 -s 10 G -C’. Subsequently, a generated histogram was utilized as an input file for GenomeScope v2.023 to estimate genetic characteristics. This approach provided a sequence-derived estimate of the Asian seabass genome characteristics prior to assembly. Our analysis results show that the genome size of Asian seabass is approximately 576.74 Mb, with an estimated heterozygosity of about 0.46% (Fig. 1B) and repetitive sequences accounting for 32.79 Mb (5.69%).

Primary contigs were initially generated by assembling PacBio HiFi and ONT data using Hifiasm v0.19.824 with default parameters. Then, purge_dups v1.2.525 was employed to remove haplotypic and heterozygous duplications from the de novo assembly, yielding a final assembly with a total length of 614.08 Mb.

Using the preliminary assembly as the reference, Hi-C clean reads were utilized to construct chromosomes for Asian seabass. First, the Hi-C reads were mapped to the assembled contigs using bowtie2 v2.2.5 (–very-sensitive -L 20–score-min L, −0.6, −0.2–end-to-end)26. Subsequently, the HiC-Pro v2.8.127 pipeline was applied to detect ligation products, retaining only valid paired reads for downstream analysis. Based on these valid reads, the primary assembly was clustered, ordered, and oriented into chromosomes using the Juicer v1.528 and 3D-DNA v3.029 software with parameters -m haploid -r 2 -c 24. Juicebox v1.11.0830 was employed to visualize before manually adjusting the candidate assemblies.

To fill the remaining gaps, those corrected ultra-long ONT reads were applied to generate a gap-free genome assembly using TGS-GapCloser v1.2.131 with optimized parameter “–min_match 1000–min_nread 3” and LR_Gapcloser v1.032 with the parameter “-t 35 -m 1000000 -v 500”. The final genome assembly spans 614.19 Mb, and it is anchored onto 24 chromosomes (Fig. 2), among them the longest and the shortest are 31.85 Mb and 14.85 Mb, respectively (Table 2).

Fig. 2.

Fig. 2

The first T2T genome assembly of Asian seabass. (A) Genome-wide chromatin interactions at a 500-kb resolution. Color blocks represent corresponding interactions, with various strengths from yellow (low) to red (high). (B) A Circos plot of the main genome features. From outside to inside include the 24 chromosomes, gene density, GC content, repetitive sequences density, and a colinear relationship among chromosomes of the Asian seabass genome assembly. Note that the density calculation window is set as 100 kb.

Table 2.

Comparison of the available genome assemblies for Asian seabass.

Category This study L. calcarifer (ASB-BC8)17
Genome survey (Mb) 576.74 593–648
Genome length (bp) 614,195,649 668,464,831
Longest scaffold (bp) 31,852,513 30,776,907
Number of scaffolds 24 3,807
Contig N50 (bp) 26,575,253 1,066,117
Scaffold N50 (bp) 26,575,253 25,848,596
GC content 40.7% 40.8%
BUSCO 100% (S:99.94%; D:0.16%) 99.7% (S:96%; D:3.7%)
Number of chromosomes 24 24
Chromosome length (bp) 614,195,649 586,924,032
Repetitive sequence 18.18% /

Abbreviations: S, single copy complete genes; D, duplicated complete genes.

Identification of the centromere and telomere sequences

Telomeres were identified by searching for the target sequence (CCCTAA/TTAGGG) at both ends of each chromosome using Telomere-to-Telomere Toolkit quarTeT v1.1.133. Centromeres, as specialized DNA sequences connecting sister chromatids, exhibit complex structures in most animals and plants with highly repetitive satellite DNA and scattered retrotransposon sequences. In this study, after identifying repeat sequences according to TRF v4.0.434 and RepeatMasker v4.0.635 and obtaining a TE annotation file, quarTeT v1.1.133 was applied to identify centromeres, and the candidate interval range of every centromere was predicted. Ultimately, we determined that the Asian seabass genome contains a complete set of 24 centromeres and 48 telomeres (Table 3; Fig. 3).

Table 3.

Telomere and centromere positions in the assembled genome.

Chr Contig Length (bp) Gap Telomere (Te) Centromere (Ce)
Upstream Start Upstream End Downstream Start Downstream End Start End
Chr01 1 31,852,513 0 251 3,343 31,848,788 31,852,328 1,505,150 1,784,594
Chr02 1 31,638,724 0 29 1,,056 31,638,090 31,638,401 4,132,572 4,235,565
Chr03 1 29,918,513 0 64 6134 29,917,458 29,918,485 27,923,600 28,634,341
Chr04 1 29,558,833 0 493 3,991 29,553,555 29,558,758 16,763,413 16,977,890
Chr05 1 29,570,087 0 38 7,538 29,569,431 29,570,087 28,554,835 29,493,014
Chr06 1 29,199,514 0 63 5,031 29,198,324 29,199,475 9,609 779,212
Chr07 1 29,179,243 0 5 3,891 29,179,155 29,179,208 28,437,506 28,968,124
Chr08 1 27,751,246 0 7 6,788 27,747,489 27,751,129 1,198,505 1,281,772
Chr09 1 27,635,561 0 3 5,138 27,632,115 27,635,202 24,779,176 24,919,240
Chr10 1 26,717,877 0 44 3,433 26,713,583 26,717,829 11,069,449 11,101,393
Chr11 1 26,575,253 0 102 4,914 26,568,413 26,575,161 23,814,253 24,022,695
Chr12 1 26,190,281 0 3 4,913 26,185,410 26,190,261 1,972,765 2,010,536
Chr13 1 25,913,521 0 2 2,370 25,908,749 25,913,302 23,680,651 23,817,342
Chr14 1 25,614,823 0 6 4,500 25,609,695 25,614,798 130,448 540,036
Chr15 1 25,420,547 0 26 3,571 25,420,129 25,420,291 80,051 483,348
Chr16 1 25,111,693 0 3 4,038 25,065,727 25,111,549 1,479,578 1,691,299
Chr17 1 23,846,329 0 304 4,318 23,841,449 23,846,329 20,554,022 20,584,804
Chr18 1 23,429,025 0 380 4,462 23,428,592 23,428,950 118,946 609,424
Chr19 1 22,557,243 0 558 3,151 22,550,858 22,557,068 1,711,822 1,903,333
Chr20 1 21,388,025 0 110 5,748 21,330,373 21,388,021 20,427,089 21,220,047
Chr21 1 21,383,011 0 29 5,745 21,378,941 21,382,968 20,010,214 20,126,177
Chr22 1 19,598,755 0 4 3,098 19,593,794 19,598,751 194,577 544,417
Chr23 1 19,288,435 0 148 3,373 19,285,242 19,288,431 17,836,318 17,972,547
Chr24 1 14,856,597 0 456 28,012 14,852,448 14,856,551 405,579 524,259

Fig. 3.

Fig. 3

Genome-wide localization of repetitive elements (REs), telomeres and centromeres. The triangles at both ends of each chromosome represent the telomere regions, and the gully area within each chromosome stands for the centromere region.

Annotation of repeat elements

For prediction of repetitive elements (REs), tandem repeats were first annotated using TRF v4.0.434 and GMATA v2.236. TRF was employed to identify simple sequence repeats (SSRs), whereas GMATA was used to recognize all tandem REs across the entire genome.

Transposable elements (TEs) in the assembled genome were predicted using a combination of homology-based and de novo methods. For the homology approach, TEs were identified using RepeatMasker v4.0.6 and RepeatProteinMask v4.0.635. For the de novo approach, RepeatModeler v1.0.837 and LTR_FINDER v1.0.638 were employed to generate a de novo repeat library, and RepeatMasker was applied to annotate REs against this repeat library. The annotation results of all repetitive sequences were merged into a comprehensive dataset. This comprehensive annotation revealed 111.64 Mb of repetitive sequences, which account for 18.18% of the assembled Asian seabass genome (Fig. 3). The most abundant repetitive element was DNA transposons at 9.00% (55.26 Mb), followed by long interspersed nuclear elements (LINEs) at 2.89% (17.76 Mb) and long terminal repeats (LTRs) at 2.46% (15.07 Mb) (see Table 4).

Table 4.

Classification of repetitive sequences in Asian seabass genome.

Type Length (bp) Count % of Genome
Dispersed repeats DNA transposons 55,263,108 477,149 9.00
Retroelements LINE 17,763,761 124,264 2.89
LTR 15,078,735 141,007 2.46
SINE 2,414,481 20,229 0.39
Unclassified 3,985,113 26,905 0.65
Tandem Repeats Simple repeats 1,766,456 149,486 0.29
Satellites 3,174,653 50,007 0.52
Unknown 12,194,948 95,844 1.98
Total 111,641,255 10,848,891 18.18

Prediction and functional annotation of protein-coding genes

Repetitive regions of the assembled genome were masked prior to prediction of genes and their structures. Protein-coding genes was annotated by combination of three methods, including de novo, homology and RNA-seq-based annotations. First, AUGUSTUS v3.2.139 and GlimmerHMM v3.0.440 were employed to perform the ab inito gene structure prediction. Second, GeMoMa v1.6.441 was applied for the homology-based prediction. We aligned homology proteins from five representative fish species, including Epinephelus fuscoguttatus (brown-marbled grouper, GCA_011397635.1), Epinephelus moara (kelp grouper, GCA_006386435.1), Lates japonicus (Japanese lates, GCA_033238685.1), Perca flavescens (yellow Perch, GCA_004354835.1) and Sebastes umbrosus (Honeycomb rockfish, GCA_015220745.1) downloaded from the NCBI. Third, the RNA-seq data from seven tissues were assembled into contigs using Trinity v2.5.142, and then gene structures were identified using PASA v2.3.343. Finally, gene sets were integrated by the Evidence Modeler (EVM) pipeline v1.044.

A total of 25,093 protein-coding genes were annotated, with an average gene length of 13.81 kb and an average coding sequence (CDS) length of 1,721.49 bp (Table 5). Protein-coding genes were evaluated using BUSCO with the actinopterygii_odb10 database as the reference. More than 98.8% of complete BUSCOs were identified within the predicted protein-coding genes.

Table 5.

Summary of the predicted gene structures using three methods.

Method Software/Species Number Average length (bp) Average exon per gene
gene CDS exon intron
De novo Augustus 24,459 14,516.87 1,741.71 159.82 1,290.65 10.9
Glimmer 39,727 14,042.91 1,020.96 161.29 2,443.21 6.33
Homolog E. moara 50,987 23,273.19 1,658.37 180.14 2,634.1 9.21
L. japonicus 51,010 19,380.37 1,656.83 179.59 2,154.73 9.23
E. fuscoguttatus 52,748 23,530.9 1,656.82 180.53 2,674.84 9.18
P. flavescens 53,847 24,587.81 1,638.19 181.46 2,858.73 9.03
S. umbrosus 53,366 24,022.61 1,673.48 185.4 2,784.43 9.03
RNA-seq PASA 21,217 16,814.96 3,699.28 302.27 1,167.04 12.24
Integrated EVM 25,093 13,819.54 1,721.49 168.93 1,316.4 10.19

Functional annotation of the protein-coding genes was performed using Blastp v2.2.2645, which aligned deduced protein sequences against five public databases including NCBI Non-Redundant Protein Sequence (NR), SwissProt46, Gene Ontology (GO)47, Kyoto Encyclopedia of Genes and Genomes (KEGG)48 and EuKaryotic Orthologous Groups (KOG)49, with an E-value cutoff of <1e−5. Ultimately, 23,711 protein-coding genes (94.49% of the total predicted genes) were functionally annotated, with at least one hit for each gene in the searched databases (Table 6).

Table 6.

Functional annotation of predicted protein-coding genes.

Database Number Percentage (%)
Total 25,093 100
NR 23,699 94.44
Swissprot 21,269 84.76
KEGG 16,838 67.10
GO 16,260 64.80
KOG 15,510 61.81
Overall 23,711 94.49

Overall represents the total number of annotated genes with at least one hit from the five searched databases.

Data Records

Files of the MGI, PacBio, ONT, Hi-C and transcriptome sequencing, and the assembled genome for Asian seabass were deposited at NCBI under the accession number PRJNA1245135. Raw reads are available in the Sequence Reads Archive (SRA) with the accession numbers SRR32997291 to SRR3299730550. The genome assembly, predicted coding sequences and function annotation files of Asian seabass were stored in Figshare (No: m9.figshare.28735226)51. The genome assembly has also been deposited at the NCBl/GenBank under the accession number of GCA_051027255.152.

Technical Validation

To evaluate the quality of our genome assembly, we employed four approaches. First, BUSCO v5.2.253 was employed to examine completeness. A total of 100% (single copy complete genes (S): 99.84%, duplicated complete genes (D): 0.16%) of complete BUSCOs in the actinopterygii_odb10 database were identified. Second, Merqury v1.32854 was applied to estimate the base-level accuracy and completeness on the basis of k-mer counts (generated from Illumina and PacBio HiFi reads), resulting in a QV of 40.59 and 57.80 respectively. Third, Clipping information for Revealing Assembly Quality (CRAQ, v1.09)55 was used to assess the accuracy of our genome assembly based on PacBio HiFi and Illumina reads, resulting in a R-AQI (assembly quality indicator) of 98.42 and a S-AQI of 99.45. Fourth, we mapped the sequencing data to the assembled genome using bwa v0.7.1756 and minimap2 v2.2657, which showed mapping rates of 99.46% for the MGI data, 99.99% for the PacBio data, and 98.43% for the ONT data. These results collectively support high quality of the Asian seabass genome assembly. The BUSCO completeness value was calculated to be 98.8% for the predicted protein-coding genes of Asian seabass (Table 7). To further evaluate the quality of these predicted protein-coding genes, we aligned the transcriptome data to the assembled genome using STAR v 2.7.11b58, and then calculated the exonic coverage rate with bedtools v2.29.259. We observed that 94.71% of the exonic regions had been covered with sequencing reads, indicating high annotation accuracy (see Table 7).

Table 7.

Assessment metrics of the genome assembly and annotation.

Type Evaluation Methods Results
Genome accuracy and completeness Mapping short reads rate 99.46%
Mapping HiFi reads rate 99.99%
Mapping ONT reads rate 98.43%
QV Short reads 40.59
HiFi reads 57.80
CRAQ R-AQI 98.42%
S-AQI 99.45%
BUSCO 100%
Annotation quality Complete BUSCOs 98.8% (3,599)
Complete and single-copy BUSCOs (S) 98.2% (3,576)
Complete and duplicated BUSCOs (D) 0.6% (23)
Fragmented BUSCOs (F) 0 (0)
Missing BUSCOs (M) 1.2% (41)
RNA-seq coverage ratio of the exonic regions 94.71%

Acknowledgements

This work was supported by Shenzhen Natural Science Foundation (no. JCYJ20241202124511016) and National Key Research and Development Program of China (no. 2022YFE0139700).

Author contributions

Q.S. conceived and designed the study. X.Z., J.W. and J.C. collected the samples. X.Z., J.C. and J.W. performed data analysis. J.W. and W.Z. conducted experiments for species identification. X.Z. and J.W. wrote the manuscript. Q.S. revised the manuscript. All authors read and approved the final manuscript for publication.

Code availability

The versions and parameters of bioinformatics tools applied in this study have been described in the Method section. If no parameter is provided, the default is set. No custom code was used.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Jiufu Wen, Email: nhswjf@163.com.

Qiong Shi, Email: shiqiong@szu.edu.cn, Email: shiqiong@genomics.cn.

References

  • 1.Gamble, T. et al. Sex determination. Current Biology22(8), 257–262 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Penman, D. J. et al. Fish gonadogenesis. Part I: genetic and environmental mechanisms of sex determination. Reviews in Fisheries Science16(sup1), 16–34 (2008). [Google Scholar]
  • 3.Devlin, R. H. et al. Sex determination and sex differentiation in fish: an overview of genetic, physiological, and environmental influences. Aquaculture208(3-4), 191–364 (2002). [Google Scholar]
  • 4.Piferrer, F. et al. Genetic, endocrine, and environmental components of sex determination and differentiation in the European sea bass (Dicentrarchus labrax L.). General and comparative endocrinology142(1-2), 102–110 (2005). [DOI] [PubMed] [Google Scholar]
  • 5.Baroiller, J. F. et al. Tilapia sex determination: where temperature and genetics meet. Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology153(1), 30–38 (2009). [DOI] [PubMed] [Google Scholar]
  • 6.Palaiokostas, C. et al. Mapping the sex determination locus in the Atlantic halibut (Hippoglossus hippoglossus) using RAD sequencing. BMC genomics14, 1–12 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hughes, V. et al. Effect of rearing temperature on sex ratio in juvenile Atlantic halibut, Hippoglossus hippoglossus. Environmental biology of fishes81, 415–419 (2008). [Google Scholar]
  • 8.Avise, J. C. et al. Evolutionary perspectives on hermaphroditism in fishes. Sexual Development3(2-3), 152–163 (2009). [DOI] [PubMed] [Google Scholar]
  • 9.Kuwamura, T. et al. Sex change of primary males in a diandric labrid Halichoeres trimaculatus: coexistence of protandry and protogyny within a species. Journal of Fish Biology70(6), 1898–1906 (2007). [Google Scholar]
  • 10.Li, S. et al. Mechanisms of sex differentiation and sex reversal in hermaphrodite fish as revealed by the Epinephelus coioides genome. Molecular Ecology Resources23(4), 920–932 (2023). [DOI] [PubMed] [Google Scholar]
  • 11.Zhang, K. et al. A telomere-to-telomere genome assembly of the protandrous hermaphrodite blackhead seabream, Acanthopagrus schlegelii. Scientific Data12(1), 350 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Casas, L. et al. Sex change in clownfish: molecular insights from transcriptome analysis. Scientific Reports6(1), 35461 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cheng, H. et al. The rice field eel as a model system for vertebrate sexual development. Cytogenetic and Genome Research101(3-4), 274–277 (2003). [DOI] [PubMed] [Google Scholar]
  • 14.Yue, G. H. et al. Genomic resources and their applications in aquaculture of Asian seabass (Lates calcarifer). Reviews in Aquaculture15(2), 853–871 (2023). [Google Scholar]
  • 15.Athauda, S. et al. Effect of rearing water temperature on protandrous sex inversion in cultured Asian Seabass (Lates calcarifer). General and Comparative Endocrinology175(3), 416–423 (2012). [DOI] [PubMed] [Google Scholar]
  • 16.Jerry, D. R. Biology and culture of Asian seabass Lates calcarifer. CRC Press (2013).
  • 17.Vij, S. et al. Chromosomal-level assembly of the Asian seabass genome using long sequence reads and multi-layered scaffolding. PLoS Genetics12(4), e1005954 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Mei, L. et al. Evaluation of QIAamp® DNA Stool Mini Kit for ecological studies of gut microbiota. Journal of Microbiological Methods54(1), 13–20 (2003). [DOI] [PubMed] [Google Scholar]
  • 19.Chen S. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics34(17), i884–i890. [DOI] [PMC free article] [PubMed]
  • 20.Rhoads, A. et al. PacBio Sequencing and Its Applications. Genomics Proteomics & Bioinformatics13, 278–289 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.De Coster, W. et al. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics34(15), 2666–2669 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Marçais, G. et al. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics27(6), 764–770 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics33(14), 2202–2204 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Roach, M. J. et al. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics19, 1–10 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Langmead, B. et al. Fast gapped-read alignment with Bowtie 2. Nature Methods9(4), 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Dekker, J. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology16, 259 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Systems3(1), 95–98 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science356, 92–95 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Systems3, 99–101 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Xu, M. et al. TGS-GapCloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience9(9), giaa094 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Xu, G. C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience8(1), giy157 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research10(8), uhad127 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research27(2), 573–580 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Tarailo-Graovac, M. et al. Using RepeatMasker to identify repetitive elements in genomic sequences. Current Protocols in Bioinformatics Chapter 4, 4–10 (2009). [DOI] [PubMed] [Google Scholar]
  • 36.Wang, X. & Wang, L. GMATA: an integrated software package for genome-scale SSR mining, marker development and viewing. Frontiers in Plant Science7, 1350 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Science of the United States of America117, 9451–9457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Xu, Z. et al. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research35, W265–268 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Research34, W435–439 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Majoros, W. H. et al. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics20(16), 2878–2879 (2004). [DOI] [PubMed] [Google Scholar]
  • 41.Keilwagen, J. et al. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods in Molecular Biology1962, 161–177 (2019). [DOI] [PubMed] [Google Scholar]
  • 42.Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc.8, 1494–1512 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research31(19), 5654–5666 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology9, R7 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Altschul, S. F. et al. Basic local alignment search tool. Journal of Molecular Biology215(3), 403–410 (1990). [DOI] [PubMed] [Google Scholar]
  • 46.Bairoch, A. et al. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research28(1), 45–48 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature genetics25(1), 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kanehisa, M. et al. KEGG as a reference resource for gene and protein annotation. Nucleic acids research44(D1), D457–D462 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Korf, I. Gene finding in novel genomes. BMC Bioinformatics5, 59 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP576768 (2025).
  • 51.Zhang, X. Genome assembly, predicted coding sequences and functional annotation files of L. calcarifer. Figshare.10.6084/m9.figshare.28735226 (2025).
  • 52.NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_051027255.1 (2025).
  • 53.Simao, F. A. et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics31, 3210–3212 (2015). [DOI] [PubMed] [Google Scholar]
  • 54.Rhie, A. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology21(1), 245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Li, K. et al. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nature Communications14(1), 6556 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Li, H. et al. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics25(14), 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34(18), 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics29(1), 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Quinlan, A. R. et al. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics26(6), 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. NCBI Sequence Read Archive.https://identifiers.org/ncbi/insdc.sra:SRP576768 (2025).
  2. Zhang, X. Genome assembly, predicted coding sequences and functional annotation files of L. calcarifer. Figshare.10.6084/m9.figshare.28735226 (2025).
  3. NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_051027255.1 (2025).

Data Availability Statement

The versions and parameters of bioinformatics tools applied in this study have been described in the Method section. If no parameter is provided, the default is set. No custom code was used.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES