A high-quality chromosome-level genome assembly of the Chinese medaka Oryzias sinensis

Zhongdian Dong; Jiangman Wang; Guozhu Chen; Yusong Guo; Na Zhao; Zhongduo Wang; Bo Zhang

doi:10.1038/s41597-024-03173-8

. 2024 Mar 28;11:322. doi: 10.1038/s41597-024-03173-8

A high-quality chromosome-level genome assembly of the Chinese medaka Oryzias sinensis

Zhongdian Dong ^1,^2,^#, Jiangman Wang ^3,^#, Guozhu Chen ⁴, Yusong Guo ¹, Na Zhao ^1,⁵, Zhongduo Wang ^1,^2,^✉, Bo Zhang ^1,^5,^✉

PMCID: PMC10978949 PMID: 38548787

Abstract

Oryzias sinensis, also known as Chinese medaka or Chinese ricefish, is a commonly used animal model for aquatic environmental assessment in the wild as well as gene function validation or toxicology research in the lab. Here, a high-quality chromosome-level genome assembly of O. sinensis was generated using single-tube long fragment read (stLFR) reads, Nanopore long-reads, and Hi-C sequencing data. The genome is 796.58 Mb, and a total of 712.17 Mb of the assembled sequences were anchored to 23 pseudo-chromosomes. A final set of 22,461 genes were annotated, with 98.67% being functionally annotated. The Benchmarking Universal Single-Copy Orthologs (BUSCO) benchmark of genome assembly and gene annotation reached 95.1% (93.3% single-copy) and 94.6% (91.7% single-copy), respectively. Furthermore, we also use ATAC-seq to uncover chromosome transposase-accessibility as well as related genome area function enrichment for Oryzias sinensis. This study offers a new improved foundation for future genomics research in Chinese medaka.

Subject terms: Ichthyology, Animal breeding, Genome

Background & Summary

Chinese medaka, or Oryzias sinensis, is a teleost fish closely related to Japanese medaka (Oryzias latipes), which has been used as a model organism in many genomic studies as well as in aquatic toxicology research¹. Both fish belong to the family Adrianichthyidae, commonly referred to as the Medaka family. Similar to its relative, Chinese medaka is also attracting the attention of scientists due to its small size and short generation interval². There are several differences between these two fishes, including their vertebrae and pectoral fin strip; most importantly, Chinese medaka is mainly distributed in freshwater while Japanese medaka can adapt to a certain level of salinity. Because of this, Chinese medaka could be more suitable for freshwater quality evaluation.

Although the Chinese medaka genome has been released³, the completeness and genome annotations still need to be further improved. The reported genome was only released at the scaffold level with alignments to the chromosomes of Japanese medaka. Several phylogenetic analyses and taxonomic revisions for medaka have already found that the Chinese medaka is different from its Japanese relative in chromosome constitution; the diploid chromosome number is 46 in O. sinensis, and 48 in O. latipes^4,5. Therefore, a high-quality reference genome for O. sinensis is increasingly important to support future research.

In the present study, an improved high-quality chromosome-level genome assembly of O. sinensis was generated using single-tube long fragment read (stLFR) reads, Nanopore long-reads, and the Hi-C sequencing data. The genome size was 796.58 Mb with a scaffold N50 length of 30.38 Mb (Table 1). A total of 712.17 Mb (89.34%) of assembled sequences were anchored to 23 pseudo-chromosomes, with lengths ranging from 58.48 Mb to 18.87 Mb (Table 2). Based on this improved genome assembly, repeat elements and gene structure annotations were conducted combining de novo prediction, homolog-based alignment, and transcriptome-assisted methods. Benchmarking Universal Single-Copy Orthologs (BUSCO) evaluation result showed that the final assembly was benchmarked at 95.1% and the annotation reached 94.6% (Table 3).

Table 1.

Genome assembly statistics of Chinese medaka Oryzias sinensis.

Statistical level:	New chromosome-level genome			Published genome (GCA_008586565.1)
Statistical level:	chromosome	scaffold	contig	scaffold	contig
Total number (>):	23	17,780	24,324	68,189	98,653
Total length of (bp):	712,173,063	797,156,861	765,499,752	813,986,518	734,065,487
N50 Length (bp):	31,744,625	30,386,322	373,585	991,358	28,180
N90 Length (bp):	24,137,528	36,705	13,816	2,913	2,405
Maximum length (bp):	58,480,650	58,480,650	2,853,130	7,585,817	287,113
Minimum length (bp):	18,868,712	714	48	747	1
GC content is (%):	40.69			40.45

Open in a new tab

Table 2.

Chromosome length distribution of assembled genome.

Chromosome ID	Length (bp)	Percentage (%)
1	58,480,650	7.34
2	39,505,166	4.96
3	39,054,318	4.9
4	33,217,277	4.17
5	32,964,224	4.14
6	32,856,377	4.12
7	32,552,571	4.08
8	32,162,499	4.03
9	31,899,507	4
10	31,744,625	3.98
11	31,538,233	3.96
12	30,386,322	3.81
13	29,554,195	3.71
14	29,246,493	3.67
15	28,794,247	3.61
16	28,705,415	3.6
17	27,734,809	3.48
18	27,467,262	3.45
19	24,846,853	3.12
20	24,137,528	3.03
21	23,984,531	3.01
22	22,471,249	2.82
23	18,868,712	2.37
Unplaced*	84,983,798	10.66

Open in a new tab

*Unplaced referring the sequences which did not mount to known chromosome.

Table 3.

Genome and Annotation BUSCO evaluation result.

	Genome		Annotation
	Gene numbers	Percentage	Gene numbers	Percentage
Complete BUSCOs	4,356	95.1	4,335	94.6
Complete Single-Copy BUSCOs	4,275	93.3	4,202	91.7
Complete Duplicated BUSCOs	81	1.8	133	2.9
Fragmented BUSCOs	123	2.7	153	3.3
Missing BUSCOs	105	2.2	96	2.1
Total BUSCO groups searched	4,584	100	4,584	100

Open in a new tab

In addition, we conducted ATAC-seq try to find out the chromosome accessibility of O. sinensis. After we got the peak results we also performed function enrichment of related genome areas and obtain some clue of transcriptional activity. Taken together, this study could provide a new reliable foundation for research on the Chinese medaka, as well as for its use as a model organism.

Methods

De novo genome assembly

The samples of O. sinensis were obtained from the National Plateau Wetland Research Center, Southwest Forestry University; muscle tissue was used for nucleic acid extraction. High-quality purified RNA was used to construct a transcript sequencing library and DNA was used to construct stLFR, Nanopore long-reads, Hi-C sequencing, and ATAC-seq libraries. stLFR technology added the same barcode sequence to subfragments of the original long DNA molecule, allowing these co-barcoded sub-fragments to be subsequently sequenced using a second-generation platform⁶. This method generates long reads with high accuracy, enabling high-quality assembly and subsequent analysis. Most importantly, stLFR was cost-effective and has been widely used in aquatic genome projects^7,8.

The genome size and heterozygosity of O. sinensis were estimated using Jellyfish v2.2.6⁹ and GenomeScope v1.0.0¹⁰ by k-mer analysis with clean stLFR data. The clean data was then used in de novo genome assembly with the stlfr2supernova pipeline (https://github.com/BGI-Qingdao/stlfr2supernova_pipeline). This pipeline conducts de novo assembly using stLFR data with Supernova Assembler, which refers to the de novo software from 10X Genomics¹¹. Nanopore long-reads were employed to carry out further scaffolding, gap-closing, and polishing at the same time using TGS-GapCloser¹². The size of the assembly after these steps was bigger than the k-mer result (approximately 906 Mb) so purge_haplotigs¹³ was applied to remove redundancy. Finally, a draft assembly that covered approximately 796 Mb of the genome with a contig N50 length of 373.71 Kb was obtained.

Hi-C analysis and chromosome assembly

The Hi-C reads were aligned to the draft assembly using HiCPro¹⁴ to find valid read pairs. Juicer¹⁵ and 3D-DNA¹⁶ were then used to finish the construction of chromosomes, with manual correction of misjoins, wrong order, and opposite orientation using Juicebox¹⁵. The Hi-C scaffolding resulted in 23 pseudo-chromosomes with a total length of 712.17 Mb. We offer a chromosome circus map using TBtools¹⁷ and heatmap using Juicebox¹⁸ (Fig. 1A,B). Genome assembly statistics are shown in Table 1 and Table 2. BUSCO (Benchmarking Universal Single-Copy Orthologs)¹⁹ evaluation of the assembly reached 95.1% using actinopterygii_odb9 database, indicating a well-covered fish genome assembly (Table 3).

Fig. 1 — (A) Global genome landscape of Chinese medaka. From inner to outer circles: Density of genes with 500 kbp windows, ranging from 0 to 50; Density of TE with 500 kbp windows, ranging from 0 to 600; Density of TRF with 500 kbp windows, ranging from 0 to 450; Distributions of GC content with 500 kbp windows. (B) Hi-C interaction heat map and scatter plot of genome assembly GC content and sequencing depth.

Repeat annotation

Tandem repeats and interspersed repeats identification was essential before protein-coding gene and function annotations. RepeatMasker²⁰ and RepeatProteinMask were applied to find repeat elements as homology predictions based on RepBase²¹. For the de-novo method, RepeatModeler²² was used to predict repeat elements; LTR_FINDER²³, and TRF tool²⁴ were also used to predict repeat elements based on sequences features. Overall, 311.32 Mb of the O. sinensis genome assembly were identified as repetitive elements, accounting for 39.05% of the whole genome (Table 4).

Table 4.

Prediction of repeat elements.

Type	Repeat Size(bp)	% of genome
TRF	12,455,455	1.56
RepeatMasker	104,455,645	13.10
RepeatProteinMask	31,952,649	4.01
*De novo*	293,230,747	36.78
Total	311,329,140	39.05

Open in a new tab

Gene prediction and annotation

Protein coding genes were predicted with multiple sources of evidence including homology-based alignments, de novo prediction, and transcriptome-assisted methods. In homology alignments, the protein sequences of Danio rerio (GRCz11), Ictalurus punctatus (IpCoco_1.2), Oryzias javanicus (OJAV_1.1), Oryzias latipes (ASM223467v1), Oryzias melastigma (Om_v0.7), Poecilia formosa (PoeFor_5.1.2), and Takifugu rubripes (FUGU5) were mapped to a repeat soft masked draft genome using blat²⁵, and Genewise was applied to define gene models²⁶. In de-novo prediction, Augustus²⁷ was used to predict the coding regions of genes. In transcriptome-assisted methods, two different methods were used to obtain predicted gene sets. First, RNA reads were mapped to the genome assembly using Hi-SAT2²⁸; the result was used to build a transcript gene model using Stringtie²⁹ and TransDecoder. Second, a transcriptome was first assembled with RNA reads using Trinity, before PASA was used to get a gene model. Finally, all evidence was merged to form a consensus gene structure annotation result using GLEAN³⁰. A total of 22,461 protein-coding genes were successfully predicted. BUSCO¹⁹ assessments reached 94.6% using the actinopterygii_odb9 database, indicating relatively complete gene annotation coverage (Table 3).

All predicted genes were compared with public biological function databases including KEGG³¹, Swissprot³², and TrEMBL³³ (https://www.uniprot.org/statistics/TrEMBL) using BLASTp³⁴ with an E-value cutoff of 1e-5 for functional annotation. Interpro was applied to provide functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites, followed by Gene Ontology annotation^35–37. Overall, 22,162 protein-coding genes (98.67%) were successfully functionally annotated (Table 5).

Table 5.

Genome function annotation result.

	Number	Percentage (%)
Total	22,461	100
Swissprot-Annotated	20,906	93.08
KEGG-Annotated	19,352	86.16
TrEMBL-Annotated	22,135	98.55
Interpro-Annotated	20,291	90.34
Overall	22,162	98.67

Open in a new tab

Non-coding RNAs (ncRNAs) are an important part of genome annotation as they can be active in transcriptional and translational regulation of gene expression as well as in the modulation of protein function³⁸. Since ribosomal RNA (rRNA) was highly conservative, vertebrate rRNA data was used as a reference to map to the draft genome using BLASTn with an E-value of 1e-5. For the discovery of transfer RNA (tRNA), tRNAscan-SE v1.3.1 was applied with eukaryotic parameters according to the characteristics of tRNA³⁹. Rfam database was applied to find microRNAs (miRNA) and small nuclear RNA (snRNA)^40,41. In total, 676 rRNA, 740 tRNA, 231 miRNA, and 1,168 snRNA genes were identified from the O. sinensis genome.

Chromosome accessibility analysis using ATAC-seq

ATAC-seq is a widely used method used to identify regions of open chromatin in genomes. SOAPnuke (v1.5.2)⁴² was used to filter out reads with low quality original data or high content of unknown bases. Bowtie2 (v2.2.5)⁴³ was used to align reads to the O. sinensis genome assembly and peak calling was performed with MACS2 software (v2.1.0)⁴⁴. Peaks related to genes were used to carry out GO and KEGG enrichment analysis (Fig. 2).

Fig. 2 — ATAC-seq peaks result from related genes’ functional enrichment. (A) GO enrichment; (B) KEGG enrichment.

Data Records

The Nanopore long reads, stLFR genomic sequencing data, Hi-C data as well as the assembled genome have been deposited at China National GeneBank DataBase (CNGBdb) under the accession CNP0003475⁴⁵^.The whole sequencing dataset of O. sinensis was deposited in the NCBI Sequence Read Archive database (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA895195) under project identification number PRJNA895195. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JAUDJI000000000. The version described in this paper is version JAUDJI010000000⁴⁶. Sequence Read Archive (SRA) project number was SRP410304⁴⁷. DNA sequencing data from the WGS library were deposited in the SRA at SRR22435960⁴⁸. DNA sequencing data from the Hi-C library were deposited in the SRA at SRR22435961⁴⁹. The ATAC-sequencing data were also deposited at Figshare⁵⁰^.

Technical Validation

To evaluate the quality of the genome assembly, stLFR reads were mapped to the final reference genome assembly using BWA (v0.7.12). A total of 98.27% of reads were mapped, covering 98.29% of the genome sequence. Genome average sequencing depth reached 172x, and 97.34% of the genome had a sequencing depth of over 20x. The scatter plot between assembly sequencing depth and GC content found no abnormal GC content, recognized as no exogenous pollution. The completeness of the genome assembly and annotation was assessed using BUSCO (v3.0) with the actinopterygii_odb9 database. The BUSCO benchmark of genome assembly and gene annotation reached 95.1% and 94.6%, respectively.

Acknowledgements

This research was financed by the National Natural Science Foundation of China (41806195, 31972794), Nanhai Scholar Project of GDOU (QNXZ201807, 201903), and State Laboratory of Developmental Biology of Freshwater Fish (2020KF004).

Author contributions

B.Z. and Z.W. designed the study. J.W., G.C., Y.G. and N.Z. performed the experiments and analyzed the data. Z.D. and B.Z. wrote the paper. Z.W. and N.Z. revised the manuscript.

Code availability

All software used in this study are in the public domain, with parameters described in Methods and this section. If no detailed parameters were mentioned for the software, default parameters were used according to the software introduction.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Zhongdian Dong, Jiangman Wang.

Contributor Information

Zhongduo Wang, Email: aduofa@hotmail.com.

Bo Zhang, Email: zb611273@163.com.

References

1.Wittbrodt J, Shima A, Schartl M. Medaka–a model organism from the far East. Nat Rev Genet. 2002;3:53–64. doi: 10.1038/nrg704. [DOI] [PubMed] [Google Scholar]
2.Cui L, et al. Oryzias sinensis, a new model organism in the application of eco-toxicity and water quality criteria (WQC) Chemosphere. 2020;261:127813. doi: 10.1016/j.chemosphere.2020.127813. [DOI] [PubMed] [Google Scholar]
3.Wang Y, et al. Genome and transcriptome of Chinese medaka (Oryzias sinensis) and its uses as a model fish for evaluating estrogenicity of surface water. Environ Pollut. 2023;317:120724. doi: 10.1016/j.envpol.2022.120724. [DOI] [PubMed] [Google Scholar]
4.Kasahara M, et al. The medaka draft genome and insights into vertebrate genome evolution. Nature. 2007;447:714–719. doi: 10.1038/nature05846. [DOI] [PubMed] [Google Scholar]
5.Parenti LR. A phylogenetic analysis and taxonomic revision of ricefishes, Oryzias and relatives (Beloniformes, Adrianichthyidae) Zool J Linn. 2008;154:494–610. doi: 10.1111/j.1096-3642.2008.00417.x. [DOI] [Google Scholar]
6.Wang O, et al. Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res. 2019;29:798–808. doi: 10.1101/gr.245126.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Zhao N, et al. Genome assembly and annotation at the chromosomal level of first Pleuronectidae: Verasper variegatus provides a basis for phylogenetic study of Pleuronectiformes. Genomics. 2021;113:717–726. doi: 10.1016/j.ygeno.2021.01.024. [DOI] [PubMed] [Google Scholar]
8.Zhao N, et al. High-quality chromosome-level genome assembly of redlip mullet (Planiliza haematocheila) Zool Res. 2021;42:796–799. doi: 10.24272/j.issn.2095-8137.2021.255. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Vurture GW, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27:757–767. doi: 10.1101/gr.214874.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Xu MY, et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience. 2020;9:giaa094. doi: 10.1093/gigascience/giaa094. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Roach MJ, Schmidt SA, Borneman AR. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics. 2018;19:460. doi: 10.1186/s12859-018-2485-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Servant N, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015;16:259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356:92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chen CC, et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol Plant. 2020;13:1194–1202. doi: 10.1016/j.molp.2020.06.009. [DOI] [PubMed] [Google Scholar]
18.Robinson JT, et al. Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. Cell Syst. 2018;6:256–258. doi: 10.1016/j.cels.2018.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Waterhouse RM, et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018;35:543–548. doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;Chapter 4:4.10.1–4.10.14. doi: 10.1002/0471250953.bi0410s25. [DOI] [PubMed] [Google Scholar]
21.Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tempel S. Using and understanding RepeatMasker. Methods Mol Biol. 2012;859:29–51. doi: 10.1007/978-1-61779-603-6_2. [DOI] [PubMed] [Google Scholar]
23.Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–8. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Kent WJBLAT–the. BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Birney E, Clamp M, Durbin R. GeneWise and genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005;33:W465–7. doi: 10.1093/nar/gki458. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–60. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Elsik CG, et al. Creating a honey bee consensus gene set. Genome Biol. 2007;8:R13. doi: 10.1186/gb-2007-8-1-r13. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.UniProt Consortium T. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/nar/gky092. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Consortium U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
35.The Gene Ontology Consortium The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Mitchell AL, et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47:D351–D360. doi: 10.1093/nar/gky1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Szymanski M, Erdmann VA, Barciszewski J. Noncoding RNAs database (ncRNAdb) Nucleic Acids Res. 2007;35:D162–D1644. doi: 10.1093/nar/gkl994. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Kalvari I, et al. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46:D335–D342. doi: 10.1093/nar/gkx1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Kalvari I, et al. Non-Coding RNA Analysis Using the Rfam Database. Curr Protoc. Bioinformatics. 2018;62:e51. doi: 10.1002/cpbi.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Chen Y, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018;7:1–6. doi: 10.1093/gigascience/gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zhang Y, et al. Model-based analysis of ChIP-Seq (MACS) Genome. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.China national gene bankhttps://db.cngb.org/search/project/CNP0003475/
46.Dong ZD, 2024. Genbank. GCA_037389245.1
47.2023. NCBI Sequence Read Archive. SRP410304
48.2023. NCBI Sequence Read Archive. SRR22435960
49.2023. NCBI Sequence Read Archive. SRR22435961
50.Dong ZD, 2023. A high-quality chromosome-level genome assembly of the Chinese rice fish Oryzias sinensis. figshare. [DOI] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Dong ZD, 2024. Genbank. GCA_037389245.1
2023. NCBI Sequence Read Archive. SRP410304
2023. NCBI Sequence Read Archive. SRR22435960
2023. NCBI Sequence Read Archive. SRR22435961
Dong ZD, 2023. A high-quality chromosome-level genome assembly of the Chinese rice fish Oryzias sinensis. figshare. [DOI] [PubMed]

Data Availability Statement

[CR1] 1.Wittbrodt J, Shima A, Schartl M. Medaka–a model organism from the far East. Nat Rev Genet. 2002;3:53–64. doi: 10.1038/nrg704. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Cui L, et al. Oryzias sinensis, a new model organism in the application of eco-toxicity and water quality criteria (WQC) Chemosphere. 2020;261:127813. doi: 10.1016/j.chemosphere.2020.127813. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Wang Y, et al. Genome and transcriptome of Chinese medaka (Oryzias sinensis) and its uses as a model fish for evaluating estrogenicity of surface water. Environ Pollut. 2023;317:120724. doi: 10.1016/j.envpol.2022.120724. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Kasahara M, et al. The medaka draft genome and insights into vertebrate genome evolution. Nature. 2007;447:714–719. doi: 10.1038/nature05846. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Parenti LR. A phylogenetic analysis and taxonomic revision of ricefishes, Oryzias and relatives (Beloniformes, Adrianichthyidae) Zool J Linn. 2008;154:494–610. doi: 10.1111/j.1096-3642.2008.00417.x. [DOI] [Google Scholar]

[CR6] 6.Wang O, et al. Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res. 2019;29:798–808. doi: 10.1101/gr.245126.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Zhao N, et al. Genome assembly and annotation at the chromosomal level of first Pleuronectidae: Verasper variegatus provides a basis for phylogenetic study of Pleuronectiformes. Genomics. 2021;113:717–726. doi: 10.1016/j.ygeno.2021.01.024. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Zhao N, et al. High-quality chromosome-level genome assembly of redlip mullet (Planiliza haematocheila) Zool Res. 2021;42:796–799. doi: 10.24272/j.issn.2095-8137.2021.255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–770. doi: 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Vurture GW, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27:757–767. doi: 10.1101/gr.214874.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Xu MY, et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience. 2020;9:giaa094. doi: 10.1093/gigascience/giaa094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Roach MJ, Schmidt SA, Borneman AR. Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinformatics. 2018;19:460. doi: 10.1186/s12859-018-2485-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Servant N, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015;16:259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 2016;3:95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356:92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Chen CC, et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol Plant. 2020;13:1194–1202. doi: 10.1016/j.molp.2020.06.009. [DOI] [PubMed] [Google Scholar]

[CR18] 18.Robinson JT, et al. Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. Cell Syst. 2018;6:256–258. doi: 10.1016/j.cels.2018.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Waterhouse RM, et al. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol Biol Evol. 2018;35:543–548. doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;Chapter 4:4.10.1–4.10.14. doi: 10.1002/0471250953.bi0410s25. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Tempel S. Using and understanding RepeatMasker. Methods Mol Biol. 2012;859:29–51. doi: 10.1007/978-1-61779-603-6_2. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Xu Z, Wang H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 2007;35:W265–8. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. doi: 10.1093/nar/27.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Kent WJBLAT–the. BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Birney E, Clamp M, Durbin R. GeneWise and genomewise. Genome Res. 2004;14:988–995. doi: 10.1101/gr.1865504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Stanke M, Morgenstern B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 2005;33:W465–7. doi: 10.1093/nar/gki458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12:357–60. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Pertea M, et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–295. doi: 10.1038/nbt.3122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Elsik CG, et al. Creating a honey bee consensus gene set. Genome Biol. 2007;8:R13. doi: 10.1186/gb-2007-8-1-r13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.UniProt Consortium T. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699. doi: 10.1093/nar/gky092. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Consortium U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[CR35] 35.The Gene Ontology Consortium The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;47:D330–D338. doi: 10.1093/nar/gky1055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Mitchell AL, et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47:D351–D360. doi: 10.1093/nar/gky1100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Szymanski M, Erdmann VA, Barciszewski J. Noncoding RNAs database (ncRNAdb) Nucleic Acids Res. 2007;35:D162–D1644. doi: 10.1093/nar/gkl994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Kalvari I, et al. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018;46:D335–D342. doi: 10.1093/nar/gkx1038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Kalvari I, et al. Non-Coding RNA Analysis Using the Rfam Database. Curr Protoc. Bioinformatics. 2018;62:e51. doi: 10.1002/cpbi.51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Chen Y, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018;7:1–6. doi: 10.1093/gigascience/gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Zhang Y, et al. Model-based analysis of ChIP-Seq (MACS) Genome. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.China national gene bankhttps://db.cngb.org/search/project/CNP0003475/

[CR46] 46.Dong ZD, 2024. Genbank. GCA_037389245.1

[CR47] 47.2023. NCBI Sequence Read Archive. SRP410304

[CR48] 48.2023. NCBI Sequence Read Archive. SRR22435960

[CR49] 49.2023. NCBI Sequence Read Archive. SRR22435961

[CR50] 50.Dong ZD, 2023. A high-quality chromosome-level genome assembly of the Chinese rice fish Oryzias sinensis. figshare. [DOI] [PubMed]

PERMALINK

A high-quality chromosome-level genome assembly of the Chinese medaka Oryzias sinensis

Zhongdian Dong

Jiangman Wang

Guozhu Chen

Yusong Guo

Na Zhao

Zhongduo Wang

Bo Zhang

Abstract

Background & Summary

Table 1.

Table 2.

Table 3.

Methods

De novo genome assembly

Hi-C analysis and chromosome assembly

Fig. 1.

Repeat annotation

Table 4.

Gene prediction and annotation

Table 5.

Chromosome accessibility analysis using ATAC-seq

Fig. 2.

Data Records

Technical Validation

Acknowledgements

Author contributions

Code availability

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases