Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Jan 15;12:74. doi: 10.1038/s41597-025-04418-w

Near complete genome assembly of Yadong trout (Salmo trutta)

Chen Li 1,2,3,#, Shenglei Han 1,3,#, Shuo Li 1,3, Kaiqiang Liu 1,3,4, Yuyan Liu 1,3,4, Hong-yan Wang 1,3,4, Qian Wang 1,3,4, Changlin Liu 1,3, Changwei Shao 1,3,4,
PMCID: PMC11735641  PMID: 39814780

Abstract

The Yadong trout (Salmo trutta), a species endemic to the Yatung River in Tibet, China, was classified as a second-class protected species in the 20th century. Now, it is considered one of the most important fishery resources in China. In this study, we assembled a near-complete genome of the S. trutta, integrating PacBio HiFi, Hi-C, and ONT sequencing technologies. The genome assembly spans 2.49 Gb, with 96.87% of the sequence anchored onto 40 chromosomes. In this assembly, a total of 12 chromosomes were assembled to a gap-free level, with 8 of them reaching the telomere-to-telomere level. The completeness of this assembly was assessed at 99.50% by BUSCO, containing approximately 63.24% repetitive sequences, and predicted to encode 41,782 protein-coding genes. This is the first near-complete genome assembly of the S. trutta, providing an essential resource for molecular breeding and germplasm conservation of this important species.

Subject terms: Eukaryote, Open reading frames

Background & Summary

Salmo trutta, a member of the Salmonidae family, is characterized by a grey upper body with spots distributed above and below the lateral line1. It is native to Europe, Western Asia, and North Africa2, and can be broadly categorized into anadromous and lacustrine populations3. Since the mid-19th century, S. trutta has been introduced to 24 countries, including Russia, the United States, the United Kingdom, Japan, and countries in South America. Its strong migratory abilities and adaptability to diverse environments have enabled it to rapidly establish itself as a global species3,4.

In 1866, S. trutta was introduced to the Yadong River in Tibet, China, where it has since become a localized population, colloquially known as Yadong trout5. The specific environmental conditions of the Yadong River result in a slow reproductive and growth cycle for the Yadong trout population. This makes the species highly vulnerable to overfishing6,7. This has led to a steady decline in population numbers, prompting its designation as a second-class protected aquatic species in the Tibet Autonomous Region in 19928. In recent years, efforts by the Yellow Sea Fisheries Research Institute of the Chinese Academy of Fishery Sciences have led to the implementation of large-scale aquaculture programs to support the conservation and commercial farming of Yadong trout.

Advancements in sequencing technologies have made telomere-to-telomere (T2T) level genome assemblies feasible. As a result, several fish species, including Neosalanx taihuensis9, Lateolabrax maculatus10, and Clarias gariepinus11, have had their genomes published at this level of detail. However, the only available genome (GCA_901001165.1) of S. trutta remains the chromosomal-level assembly published by the Wellcome Sanger Institute in 201912, and the number of published chromosomal-level genomes is still extremely limited compared to the global distribution of different populations of S. trutta.

In this study, we integrated PacBio high-fidelity (HiFi), high-throughput chromatin conformation capture (Hi-C) and Oxford Nanopore Technologies (ONT) reads to assemble the S. trutta genome, achieving near complete genome sequence level (Fig. 1a,b). Compared to the published S. trutta genome12, our assembly shows significant improvements in continuity and completeness. This high-quality genome will provide valuable resources for the molecular breeding of S. trutta and facilitate comparative genomic analyses among S. trutta populations from different regions.

Fig. 1.

Fig. 1

S. trutta genome snail plot and circos plot. (a) The snail plot presents the basic metrics of the genome assembly. (b) The circos plot represents the following metrics from outer to inner layers: (a) chromosomes, (b) gene density, (c) CG content, (d) DNA transposons, (e) LTRs, (f) LINEs, and (g) SINEs. The points on the chromosome backbone indicate detected telomeres, with red points representing chromosomes that are telomere-to-telomere with no gaps.).

Methods

Sample collection and sequencing

An adult female S. trutta was sourced from the Yadong County Industrial Park, Shang Yadong Township, Shigatse City, Tibet, China. We employed a rigorously annotated SDS method to obtain sufficient quality and quantity of genomic DNA (gDNA). The sheared gDNA was purified using AMPure PB beads, followed by end repair, adapter ligation, and further purification to construct SMRTbell templates. These templates were then loaded into SMRT cells and sequenced on the PacBio Sequel II platform, yielding 94.96 Gb (~38×) of HiFi data (Table 1).

Table 1.

Statistics of the sequencing data.

Library type Platform Tissue Data size (Gb) Average depth (×) Average Length (bp)
ONT ultra-long PromethION Fin Ray 72.72 29 84,676
PacBio SMRT PacBio Sequel II Muscle 94.96 38 12,037
Hi-C Illumina Novaseq6000 Muscle 255.69 103 150
WGS Illumina Novaseq6000 Muscle 99.63 40 150
RNA-Seq Illumina Novaseq6000 Liver 7.38 150
RNA-Seq Illumina Novaseq6000 Heart 6.42 150
RNA-Seq Illumina Novaseq6000 Intestine 6.62 150
RNA-Seq Illumina Novaseq6000 Gill 7.44 150
RNA-Seq Illumina Novaseq6000 Kidney 6.72 150
RNA-Seq Illumina Novaseq6000 Brain 6.61 150
RNA-Seq Illumina Novaseq6000 Muscle 7.37 150
RNA-Seq Illumina Novaseq6000 Ovary 6.99 150
RNA-Seq Illumina Novaseq6000 Spleen 6.08 150
Iso-Seq PacBio Sequel IIe Tissue mixture 10.11 2,730

For ONT data, DNA was extracted from fin clip tissue using the NEB Monarch® HMW DNA Extraction Kit for Tissue. Libraries were prepared and sequenced on the PromethION platform, resulting in 72.72 Gb (~29×) of ONT data (Table 1).

For Hi-C data, muscle tissue was processed through formaldehyde crosslinking, followed by washing, lysis, enzymatic digestion, DNA end modification, fragment ligation, purification, DNA end repair, biotin labeling, and PCR amplification. Illumina PE150 sequencing was performed, yielding 255.69 Gb (~103×) of Hi-C data (Table 1).

For WGS sequencing, DNA extracted from muscle tissues was transformed into Illumina library formats using the the NEBNext® Ultra™ DNA Library Prep Kit. Cluster formation for these libraries was carried out on cBot Cluster Generation System with the Illumina Paired-End Cluster Kit, adhering to the guidelines provided by the producer. We ultimately obtained 99.63 Gb (~40×) of short-fragment data (Table 1).

For RNA-seq data, we extracted RNA from nine different tissues, including muscle, liver, intestine, ovary, brain, spleen, gill, kidney, and heart. RNA-Seq libraries were assembled following the protocol of the library preparation kit. Illumina sequencing yielded an average of 6.83 Gb of data per tissue sample (Table 1). Additionally, we extracted RNA from a mixed tissue sample, and qualified RNA samples underwent reverse transcription, end repair, DNA fragmentation, adapter ligation, and amplification to construct the library. Sequencing was then carried out on the PacBio Sequel IIe platform, resulting in 10.11 Gb of Isoform Sequencing data. (Table 1).

Genome assembly and telomeres identification

To obtain a high-quality genome for S. trutta, we integrated ONT data with HiFi data and utilized Hifiasm13 (v0.19.9) to assemble a draft genome. Subsequently, we employed CRAQ14 (v1.0.9) to identify chimeric fragments and generated CRAQ-corrected genome. We then used kmerDedup15 for assembly redundancy removal and HapHic16 (v1.0.6) together with Hi-C data, to anchor the draft genome to 40 chromosomes, consistent with the chromosome number of the previously published S. trutta genome12 (Fig. 2a). We utilized Juicer-box17 (v1.91) for minor manual refinements of the genome, subsequently employing TGS-GapCloser18 (v1.2.1) in conjunction with ONT data to fill gaps within the genome. Thereafter, we leveraged NextPolish219 (v0.2.1), in tandem with HiFi and WGS data, to rectify the genome, culminating in the acquisition of a 2.49 Gb genome with a contig N50 of 47.99 Mb (Fig. 1a). We utilized the TeloExplorer parameter in quarTeT20 (v1.2.1) to identify the telomeres (TTAGGG) at both ends of each chromosome in the S. trutta genome, revealing that 18 chromosomes had double-end telomeres detected, 17 chromosomes had single-end telomeres detected, and 8 chromosomes achieved gap-free telomere-to-telomere status (Fig. 1b and Table 2).

Fig. 2.

Fig. 2

Hi-C heatmap and collinearity dot plot of the genome. (a) The Hi-C heatmap illustrates the chromosome interaction frequencies of the S. trutta genome, with each blue contour representing a chromosome. (b) The dot plot displays the collinearity relationship with previously published genome assembly of S. trutta.

Table 2.

Assembly statistics of chromosomes.

Chromosomes Length (bp) Telomere Numbers
Chr 1 90,493,090 1
Chr 2 82,810,130 1
Chr 3 84,936,639 2
Chr 4 85,513,606 1
Chr 5 72,538,494 1
Chr 6 72,262,847 1
Chr 7 84,217,063 1
Chr 8 58,820,683 1
Chr 9 50,963,598 2
Chr 10 48,104,659 2
Chr 11 25,970,806 1
Chr 12 107,239,732 1
Chr 13 101,836,447 1
Chr 14 99,869,877 0
Chr 15 100,024,870 1
Chr 16 62,148,298 1
Chr 17 64,962,358 1
Chr 18 61,029,533 2
Chr 19 58,673,832 0
Chr 20 58,621,627 2
Chr 21 53,977,117 2
Chr 22 54,916,544 0
Chr 23 53,462,562 2
Chr 24 53,486,768 0
Chr 25 50,849,851 1
Chr 26 51,063,025 2
Chr 27 47,770,962 1
Chr 28 51,549,634 2
Chr 29 54,445,351 2
Chr 30 47,725,804 2
Chr 31 47,930,784 2
Chr 32 49,330,108 2
Chr 33 47,087,300 2
Chr 34 47,087,710 2
Chr 35 43,920,889 2
Chr 36 43,421,662 2
Chr 37 48,078,177 2
Chr 38 34,900,096 1
Chr 39 28,802,253 1
Chr 40 29,316,477 0
Unplaced 77,658,758

Additionally, we conducted a collinearity analysis using Minimap221 (v2.28) with the published S. trutta genome and visualized the results using pafCoordsDotPlotly.R (https://github.com/tpoorten/dotPlotly). The results showed that most chromosomes exhibited good collinearity, although some chromosomes displayed different arrangements (Fig. 2b). We also compared various assembly metrics between the two genomes, and found that our assembled genome demonstrated superior performance, with a total genome length of 2.49 Gb, an anchoring rate of 96.87%, a contig N50 of 47.99 Mb, a BUSCO completion rate of 99.50%, fewer genome gaps, 12 chromosomes being gap-free, and the detection of 53 telomeres (Table 3).

Table 3.

Assembly statistics of S. trutta.

Assembly fSalTru1.1 This study
Total length 2.37 Gb 2.49 Gb
Anchoring rate 91.45% 96.87%
Contig N50 1.69 Mb 47.99 Mb
BUSCO completion rate 98.93% 99.50%
Number of gaps 2,954 97
Number of gap-free chromosomes 0 12
Number of telomeres 0 53

Repetitive sequence annotation

For the identification and characterization of genomic repetitive elements, we employed a combination of de novo prediction and homology-based annotation. We employed LTR-Finder22 (v1.07) to predict LTR retrotransposons in the S. trutta genome. Simultaneously, we utilized RepeatModeler23 (v2.0.5) to perform de novo predictions of repetitive elements, and based on the results from both methods, we constructed a repeat element database specific to the S. trutta genome with the Repbase database24 (v202101). Homology-based annotation was mainly conducted using the RepeatProteinMask and Repbase modules of RepeatMasker25 (v.4.1.0), also with default parameters for prediction. The predictions from both strategies were then integrated and filtered. The annotation results indicated that repetitive sequences account for approximately 63.24% of the genome, with the largest portion being occupied by various interspersed elements: long interspersed nuclear elements (LINEs) made up 32.34%, short interspersed nuclear elements (SINEs) constituted 0.55%, and long terminal repeats (LTRs) accounted for 10.80% (Fig. 1b and Table 4).

Table 4.

Statistics of repeat content.

Type Length (bp) Percent of genome (%)
DNA 618,594,869 24.87
LINE 804,466,278 32.34
SINE 13,567,810 0.55
LTR 268,565,114 10.80
Other 3,941,006 0.16
Unknown 21,935,652 0.88
Total (exclude nested/overlapping TEs) 1,573,346,823 63.24

Genomic structure and functional annotation

Building upon the repetitive sequence-masked S. trutta genome, we employed Augustus26 (v3.5.0) for gene de novo prediction using default parameters. Additionally, we utilized Miniprot27 (v0.13) to perform homology annotation with protein sequences from four species, including Salmo salar, S. trutta, Oncorhynchus mykiss, and Oncorhynchus tshawytscha. Concurrently, we aligned RNA-seq data to the genome using HISAT228 (v2.2.1), followed by transcriptome assembly with StringTie29 (v2.2.3). Open reading frame (ORF) within the assembled transcripts were further identified using TransDecoder (https://github.com/TransDecoder/TransDecoder), and potential coding regions were further predicted to construct evidence for annotation. Additionally, we processed full-length transcriptome data using the IsoSeq330 (v4.0.0) pipeline and aligned them to the genome with GMAP (https://github.com/juliangehring/GMAP-GSNAP) to generate annotation evidence. Finally, we integrated four annotation strategies—de novo, homology, transcript-based, and full-length transcriptome—using EvidenceModeler31 (v1.1.1). The final gene set was further refined by aligning it to the S. trutta genome annotation using Liftoff32 (v1.6.3) as a reference, resulting in the identification of 41,782 genes. To assess the accuracy of gene annotation, we compared the distribution of mRNA lengths and exon counts per mRNA in the S. trutta genome with gene data from S. salar and O. mykiss. The results demonstrated highly similar genomic component distribution characteristics among the three species (Fig. 3a).

Fig. 3.

Fig. 3

Comparative genome plot of closely related species and gene functional annotation upset plot. (a) Comparison of mRNA length distribution and exon count per mRNA among the three closely related species. (b) Upset plot of gene functional annotation using data from EggNOG, Pfam, KEGG, NR, Kofam, and SwissProt.

The predicted gene protein sequences were aligned against various functional databases using Diamond33 (v2.1.6) with an E-value threshold set at 1e-5. The databases encompassed Kyoto Encyclopedia of Genes and Genomes34 (KEGG), Swiss Institute of Bioinformatics Protein Database35 (Swiss-Prot), Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups36 (EggNOG), Protein Families Database37 (Pfam), KOfam Database38 (Kofam), and the Non-Redundant database39 (NR) to extract potential gene functional information, which was subsequently utilized for statistical analysis. A total of 41,629 genes, which account for 99.63% of the total estimated protein-coding genes, have been effectively annotated by a minimum of one of these databases (Fig. 3b and Table 5).

Table 5.

Statistics of gene annotation.

Type Number Percentage (%)
NR 41,619 99.61
EggNOG 40,884 97.85
SwissProt 38,283 91.63
KEGG 33,564 80.33
Pfam 37,890 90.68
Kofam 31,161 74.58
Total 41,629 99.63

Data Records

The genome assembly data have been uploaded to GenBank under accession number JBICOJ000000000 (PRJNA1170013)40 and Figshare41. The genome annotation file has also been uploaded to Figshare41. The sequencing raw data have been uploaded to NCBI with the project number PRJNA1171601 (SRP540059)42.

Technical Validation

We employed BUSCO43 (v5.3) with the Actinopterygii database (actinopterygii_odb10) to evaluate the genome assembly’s completeness. The BUSCO analysis of the genome revealed an overall completeness of 99.50%, with 56.48% being single-copy and 43.02% being duplicated, leaving only 0.28% as fragmented and 0.22% missing (Fig. 1a and Table 3). We utilized Inspector44 (v1.0.1) to map the PacBio HiFi reads against the genome, achieving a quality score, as indicated by the Quality Value (QV), of 44.91 and a read-to-contig alignment rate of 99.95%. Additionally, we applied CRAQ14 for genome assessment, which determined an Assembly Quality Index (AQI) of 96 (>90), indicating that our assembled genome is highly complete, of good quality, and has achieved a reference-quality standard.

Acknowledgements

This research was funded by the National Key R&D Program of China (2024YFD2400901 and 2022YFD2400100), the Key Research and Development Project of Shandong Province (2024LZGC005),the AoShan Talents Cultivation Program Supported by Qingdao National Laboratory for Marine Science and Technology (grant number 2017ASTCP-ES06), the Taishan Scholars Program (NO. tstp20221149) to C.S, the National Ten-Thousands Talents Special Support Program to C.S, the Key R&D Program of Hebei Province, China (21326307D) and the Central Public-interest Scientific Institution Basal Research Fund, CAFS (grant number 2023TD19).

Author contributions

C.S. conceived of the project. C.L. and S.H. analyzed the data. C.L. drafted the manuscript. S.L. K.L., H.W. and Q.W. revised the manuscript. Y.L. and C.L. prepared the sample materials and extracted the DNA. All authors read and approved the final version of the manuscript.

Code availability

In this study, we did not employ any customized scripts or software for personalized analysis. The analytical tools and parameters used are described in the methods section. For software without specific parameter descriptions, the default parameters were selected.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Chen Li, Shenglei Han.

References

  • 1.Moyle, P. B. Inland Fishes of California. (University of California Press, 1976).
  • 2.Lobry, J., Mourand, L., Rochard, E. & Elie, P. Structure of the Gironde estuarine fish assemblages: a comparison of European estuaries perspective. Aquat. Living Resour.16, 47–58 (2003). [Google Scholar]
  • 3.Klemetsen, A. et al. Atlantic salmon Salmo salar L., brown trout Salmo trutta L. and Arctic charr Salvelinus alpinus (L.): a review of aspects of their life histories. Ecol Freshwa Fish12, 1–59 (2003). [Google Scholar]
  • 4.Elliott, J. M. Quantitative ecology and the brown trout Ch. 9 (Oxford Univ. Press, 1994).
  • 5.Liu, J. et al. Comparative Analysis of Nutrient Composition of Different-Colored Yadong Trout Eggs. Prog Fish Sci44, 133–141 (2023). [Google Scholar]
  • 6.Kang, B. et al. Introduction of non-native fish for aquaculture in China: A systematic review. RevAquac15, 676–703 (2023). [Google Scholar]
  • 7.Zhou, J., Min, Z., Zhang, C., Li, B. & Wang, W. Research progress of fishery resources in Tibet. Anim. Husb. Feed Sci.8, 246–250 (2016). [Google Scholar]
  • 8.Tian, H.-F., Hu, Q.-M. & Li, Z. Genome-wide identification of simple sequence repeats and development of polymorphic SSR markers in swamp eel (Monopterus albus). Sci Prog104, 368504211035597 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhou, Y. et al. Gap-free genome assembly of Salangid icefish Neosalanx taihuensis. Sci Data10, 768 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sun, Z. et al. Telomere-to-telomere gapless genome assembly of the Chinese sea bass (Lateolabrax maculatus). Sci Data11, 175 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Nguinkal, J. A. et al. Haplotype-resolved and near-T2T genome assembly of the African catfish (Clarias gariepinus). Sci. Data11, 1095 (2024). [DOI] [PMC free article] [PubMed]
  • 12.Hansen, T. et al. The genome sequence of the brown trout, Salmo trutta Linnaeus 1758. Wellcome Open Res6, 108 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Li, K., Xu, P., Wang, J., Yi, X. & Jiao, Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat. Commun.14, 6556 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Abalde, S., Tellgren-Roth, C., Heintz, J., Vinnere Pettersson, O. & Jondelius, U. The draft genome of the microscopic Nemertoderma westbladi sheds light on the evolution of Acoelomorpha genomes. Front. Genet. 14, (2023). [DOI] [PMC free article] [PubMed]
  • 16.Zeng, X. et al. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nat. Plants10, 1184–1200 (2024). [DOI] [PubMed] [Google Scholar]
  • 17.Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst3, 95–98 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Xu, M. et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience9, giaa094 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hu, J. et al. NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads. Genomics Proteomics Bioinf22, qzad009 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic. Res.10, uhad127 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res.35, W265–W268 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci USA117, 9451–9457 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA6, 11 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4, 4.10.1–4.10.14 (2009). [DOI] [PubMed] [Google Scholar]
  • 26.Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res27, 573–580 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li, H. Protein-to-genome alignment with miniprot. Bioinformatics39, btad014 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat Methods12, 357–360 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol33, 290–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Guizard, S. et al. nf-core/isoseq: simple gene and isoform annotation with PacBio Iso-Seq long-read sequencing. Bioinformatics39, btad150 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol9, R7 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.A, S. & Sl, S. Liftoff: accurate mapping of gene annotations. Bioinformatics37, (2021). [DOI] [PMC free article] [PubMed]
  • 33.Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods18, 366–368 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res28, 27–30 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res28, 45–48 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res47, D309–D314 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res49, D412–D419 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics36, 2251–2252 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res33, D501–504 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_043791845.1 (2024).
  • 41.Li, C. Salmo trutta genome. Figshare.10.6084/m9.figshare.27282591.v3 (2024).
  • 42.NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRP540059 (2024).
  • 43.Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol38, 4647–4654 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen, Y., Zhang, Y., Wang, A. Y., Gao, M. & Chong, Z. Accurate long-read de novo assembly evaluation with Inspector. Genome Biology22, 312 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. NCBI GenBankhttps://identifiers.org/ncbi/insdc.gca:GCA_043791845.1 (2024).
  2. Li, C. Salmo trutta genome. Figshare.10.6084/m9.figshare.27282591.v3 (2024).
  3. NCBI Sequence Read Archivehttps://identifiers.org/ncbi/insdc.sra:SRP540059 (2024).

Data Availability Statement

In this study, we did not employ any customized scripts or software for personalized analysis. The analytical tools and parameters used are described in the methods section. For software without specific parameter descriptions, the default parameters were selected.


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES