Bicolor angelfish (Centropyge bicolor) provides the first chromosome-level genome of the Pomacanthidae family

Chunhua Li; Xianwei Yang; Libin Shao; Rui Zhang; Qun Liu; Mengqi Zhang; Shanshan Liu; Shanshan Pan; Weizhen Xue; Congyan Wang; Chunyan Mao; He Zhang; Guangyi Fan

doi:10.46471/gigabyte.32

. 2021 Nov 9;2021:gigabyte32. doi: 10.46471/gigabyte.32

Bicolor angelfish (Centropyge bicolor) provides the first chromosome-level genome of the Pomacanthidae family

Chunhua Li ^1,^†, Xianwei Yang ^1,^3,^†, Libin Shao ^1,^†, Rui Zhang ¹, Qun Liu ¹, Mengqi Zhang ¹, Shanshan Liu ¹, Shanshan Pan ¹, Weizhen Xue ¹, Congyan Wang ¹, Chunyan Mao ¹, He Zhang ^1,^2,^*, Guangyi Fan ^1,^*

PMCID: PMC9650296 PMID: 36824335

Abstract

The Bicolor Angelfish, Centropyge bicolor, is a tropical coral reef fish. It is named for its striking two-color body. However, a lack of high-quality genomic data means little is known about the genome of this species. Here, we present a chromosome-level C. bicolor genome constructed using Hi-C data. The assembled genome is 650 Mbp in size, with a scaffold N50 value of 4.4 Mbp, and a contig N50 value of 114 Kbp. Protein-coding genes numbering 21,774 were annotated. Our analysis will help others to choose the most appropriate de novo genome sequencing strategy based on resources and target applications. To the best of our knowledge, this is the first chromosome-level genome for the Pomacanthidae family, which might contribute to further studies exploring coral reef fish evolution, diversity and conservation.

Data description

Background

Centropyge bicolor (NCBI:txid109723; FishbaseID: 5454; urn:lsid:marinespecies.org:taxname:211780) (Figure 1), also known as the Bicolor, Two-Colored, or Pacific Rock Beauty Angelfish, is a showy coral reef fish commonly distributed in the Indo–Pacific ocean (from East Africa to the Samoan and Phoenix Islands, north to southern Japan, south to New Caledonia; throughout Micronesia). As a member of the Pomacanthidae family, it is similar to those of the Chaetodontidae (Butterflyfishes) but is distinguished by the presence of strong preopercle spines. C. bicolor has clear boundaries between its body colors, so might be a good model in which to study body color development in coral fish [1].

Context

Although the availability of genetic, and especially genomic resources, remains limited for the Pomacanthidae family, we assembled the first C. bicolor reference genome. This will provide valuable information for genetic studies of this coral reef fish, and will contribute to studies in body color diversity. With the whole genome sequence of C. bicolor, it might be possible to explore the genetic mechanisms of body color development in coral reef fish by comparative genomic methods.

Methods and results

A protocols collection for BGISEQ-500, stLFR and Hi-C library construction is available in protocols.io (Figure 2) [2].

Sample collection and genome sequencing

A C. bicolor individual was collected from the market in Xiamen, Fujian Province, China. DNA was extracted from fresh muscle tissue according to a standard protocol. Single-tube long fragment read (stLFR) [2] and Hi-C libraries were constructed following the manufacturers’ instructions [2, 3] to sequence and assemble the genome. We obtained 130.47 Gbp (gigabase pairs; ∼197×) raw stLFR data and 134.57 Gbp (∼203.20×) raw Hi-C data (Table 1) using the BGISEQ-500 platform in 100-bp (basepair) paired-end mode.

Table 1.

Statistics of DNA sequencing data.

		Raw data		Valid data
Libraries	Read length	Total bases (Gbp)	Sequencing depth (×)	Total bases (Gbp)	Sequencing depth (×)
stLFR	100:100	130.47	197.00	60.71	91.67
Hi-C	100:100	134.57	203.20	42.51	64.19

Open in a new tab

Sequencing depth = Total bases / Genome size, where the genome size is the result of k-mer estimation, as shown in Table 2.

Low-quality reads (sequences with more than 40% of bases with a quality score lower than 8), polymerase chain reaction (PCR) duplications, adaptor sequences and reads with a high (greater than 10%) proportion of ambiguous bases (Ns) occurring in stLFR data were filtered using SOAPnuke (v1.6.5; RRID:SCR_015025) [4]. We obtained 62.6 Gbp (∼91.67×) clean data (Table 1) to assemble the draft genome. Meanwhile, HiC-Pro (v. 2.8.0) [5] was used for the quality control of raw Hi-C data, and 42.51 Gbp (∼64.19×) valid data were used to assemble the genome to the chromosome-level (Table 1).

Genome assembly

Using GenomeScope software (RRID:SCR_017014) with stLFR clean data, k-mer distribution was used to understand the genome complexity before genome assembly [6]. The genome size of C. bicolor was estimated as 662.27 Mbp (megabase pairs), with 37.6% repeat sequences and 1.16% heterozygous sites (Table 2, Figure 3).

Table 2.

Statistical information of 17-mer analysis.

k-mer	k-mer number	k-mer Depth	Heterozygosity (%)	Genome size (Mbp)
17	50,994,645,240	77	1.16	662.27

Open in a new tab

The genome size, G, was defined as G = K_num/K_depth, where K_num is the total number of k-mers, and K_depth is the most frequently occurring k-mer.

Figure 3. — The 17-mer depth distribution of *Centropyge bicolor*. The estimated genome size is 662.27 Mbp and the heterozygosity is 1.16%.

We reformatted the clean stLFR data into 10× Genomics format using an in-house script [7] and assembled the draft genome using Supernova (v.2.0.1, RRID:SCR_016756) [8] with default parameters. The draft genome was 681 Mbp, with a contig N50 of 115.5 Kbp (kilobase pairs) and scaffold N50 of 4.4 Mbp (Table 3), which is similar to the estimated genome size.

Table 3.

Statistics of the draft assembly with stLFR data.

Statistics	Contig	Scaffold
Total number (#)	40,442	29,065
Total length (bp)	655,705,062	681,285,455
Gap (N) (bp)	0	25,580,393
Average length (bp)	16,213.47	23,440.06
N50 length (bp)	115,524	4,424,004
N90 length (bp)	6,029	7,618
Maximum length (bp)	1,148,507	21,943,074
Minimum length (bp)	48	940
GC content (%)	41.74	41.74

Open in a new tab

To obtain the chromosome-level genome, we used Juicer (v3, RRID:SCR_017226) [9] to build a contact matrix and 3dDNA (v.170123) [10] to sort and anchor scaffolds with the parameters: “–m haploid –s 4 –c 24”. There are 24 distinct contact blocks, which correspond to 24 chromosomes, representing 96% of the whole genome (Figures 4A, 5, Table 4). On evaluating the completeness of the genome and gene set using Benchmarking Universal Single-Copy Orthologs (BUSCO, v.3.0.2, RRID:SCR_015008) [11] and a vertebrata database, our assembly maintained a score of 96.2% (Table 5). We also identified putative homologous chromosomal regions between C. bicolor and Oryzias latipes by MCscanx [12] (Figure 6).

Figure 4. — Annotation of the *Centropyge bicolor* genome. (A) Basic genomic elements of the *Centropyge bicolor* genome. LTR, long terminal repeat; LINE, long interspersed nuclear elements; SINE, short interspersed elements. (B) Physical map of mitochondrial assembly.

Figure 5. — Heat map of interactive intensity between chromosome sequences.

Table 4.

Statistics of the chromosome-level genome.

Statistics	Contig	Scaffold
Total number (#)	40,778	28,555
Total length (bp)	655,705,062	680,873,932
Gap (N) (bp)	0	25,168,870
Average length (bp)	16,079.87	23,844.30
N50 length (bp)	113,563	21,943,074
N90 length (bp)	5,988	7,542
Maximum length (bp)	1,148,507	28,105,280
Minimum length (bp)	43	43
GC content (%)	41.74	41.74

Open in a new tab

Table 5.

Statistics of the BUSCO assessment.

Types of BUSCOs	Gene set		Assembly
	Number	Percentage (%)	Number	Percentage (%)
Complete BUSCOs	2,408	93.1	2,486	96.2
Complete single-copy BUSCOs	2,348	90.8	2,438	94.3
Fragmented BUSCOs	81	3.1	64	2.5
Missing BUSCOs	97	3.8	36	1.3
Total BUSCO groups searched	2,586	100	2,586	100

Open in a new tab

Figure 6. — Homologous chromosomal regions between *Centropyge bicolor* and *Oryzias latipes*.

In addition, we cut off partial stLFR reads (25 M) for assembly by MitoZ with default parameters [13] and obtained a 16,961-bp circular mitochondrial genome of C. bicolor. Thirteen protein-coding genes, 24 tRNA genes and three rRNA genes were annotated by GeSeq (RRID:SCR_017336) [14] (Figure 4B).

Genomic annotation

For the annotation of repeats, we carried out homolog annotation and ab initio prediction independently. RepeatMasker (v.4.0.6, RRID:SCR_012954) [15], RepeatProteinMask (a module from RepeatMasker) and trf (Tandem Repeats Finder, v.4.07b) [16] were used to identify known repetitive sequences by comparing the whole genome with RepBase [17]. LTR_FINDER (v.1.06, RRID:SCR_015247) [16, 18] and RepeatModeler (v.1.0.8, RRID:SCR_015027) [19] were used in de novo prediction. We also classified transposable elements (TEs) from the integration of all repeats. In total, we identified 124 Mbp (18.32% of the entire genome) of repetitive sequences (Figure 4A, Table 6), including 110 Mbp of TEs (Figure 4A, Table 7).

Table 6.

Statistics of repetitive sequences.

Type	Repeat size (bp)	Percentage of genome (%)
TRF	14,165,095	2.08
RepeatMasker	43,423,877	6.38
RepeatProteinMask	12,503,750	1.84
De novo	110,871,693	16.28
Total	124,708,977	18.32

Open in a new tab

Table 7.

Statistics of transposable elements.

	Repbase TEs, n (%)	Protein TEs, n (%)	De novo TEs, n (%)	Combined TEs, n (%)
DNA	27,163,851 (3.990)	1,068,990 (0.157)	61,731,447 (9.067)	70,925,963 (10.417)
LINE	10,228,332 (1.502)	6,956,340 (1.022)	20,006,579 (2.938)	26,714,285 (3.924)
SINE	856,125 (0.126)	0 (0.000)	497,024 (0.073)	1,187,676 (0.174)
LTR	10,971,817 (1.611)	4,485,808 (0.659)	16,270,071 (2.390)	23,101,529 (3.393)
Other	10,041 (0.001)	0	0	10,041 (0.001)
Unknown	0	0	14,054,230 (2.064)	14,054,230 (2.064)
Total	43,423,877 (6.378)	12,503,750 (1.836)	99,265,690 (14.579)	109,868,166 (16.136)

Open in a new tab

Homolog-based and ab initio prediction were used to identify the protein-coding genes. Augustus (v.3.3, RRID:SCR_008417) [20] was used in ab initio prediction basing on a repeat-masked genome [21]. Protein sequences of Astatotilapia calliptera, Danio rerio, Larimichthys crocea, and Oreochromis niloticus were downloaded from the National Center for Biotechnology Information (NCBI) GenBank database and aligned to the C. bicolor genome for homolog gene annotation with Genewise (v2.4.1, RRID:SCR_015054) [22]. Finally, we used GLEAN [23] to integrate all the above evidence and obtained a total of 21,774 genes, which contained 11 exons on average and had an average coding sequence (CDS) length of 1,575 bp (Table 8).

Table 8.

Statistics of the predicted genes in the bicolor angelfish genome.

	Gene set	Gene number	Average transcript length (bp)	Average CDS length (bp)	Average intron length (bp)	Average exon length (bp)	Average exons per gene
Homolog	Astatotilapia calliptera	51,174	21,762.29	2,259.23	1,691.33	180.29	12.53
	Danio rerio	22,005	27,982.75	1,570.36	3,438.82	180.90	8.68
	Larimichthys crocea	47,419	19,884.78	2,139.39	1,575.94	174.50	12.26
	Oreochromis niloticus	47,067	17,771.04	1,906.97	1,608.29	175.53	10.86
De novo	Augustus	34,470	9,675.42	1,335.20	1,344.81	185.40	7.20
GLEAN		21,774	14,024.40	1,906.28	1,206.07	172.55	11.05

Open in a new tab

The GLEAN gene set is the integrated result of de novo gene predictions and homolog gene predictions.

To predict gene functions, 21,774 genes were aligned against several public databases, including TrEMBL [24], SwissProt [24], KEGGViewer [25] and InterProScan [26]. As a result, 99.67% of all genes were predicted functionally (Table 9, Figure 7).

Table 9.

Statistics of the functional annotation.

Database	Number	Percentage (%)
Total	21,774	100.00
SwissProt	20,784	95.45
KEGG	19,168	88.03
TrEMBL	21,688	99.61
Interpro	20,153	92.56
Overall	21,702	99.67

Open in a new tab

Figure 7. — Venn diagram of orthologous gene families. Four teleost species (*Centropyge. bicolor, Larimichthys crocea, Oreochromis niloticus, and Danio rerio*) were used to generate the Venn diagram based on gene family cluster analysis.

Phylogenetic analysis

We downloaded the gene data of seven representative teleost fishes from NCBI to study the phylogenetic relationships between C. bicolor. These seven fishes were: Danio rerio, Gasterosteus aculeatus, Gadus morhua, Larimichthys crocea, Oryzias latipes, Oreochromis niloticus and Tetraodon nigroviridis. For each dataset, the longest transcripts were selected and aligned to each other by BLASTP (v2.9.0, RRID:SCR_001010) [27] (E-value ≤ 1e-5). TreeFam (v.2.0.9, RRID:SCR _013401) [28] was used to cluster gene families, with default parameters. Among all 20,706 clustered gene families, there were 4,450 common single-copy families and 57 families specific to C. bicolor (Table 10). With single-copy sequences, we used PhyML (v.3.3, RRID:SCR_014629) [29] to construct the phylogenetic tree of C. bicolor and the seven other fishes mentioned above, setting D. rerio as an outgroup.

Table 10.

Statistics of gene family clustering.

Species	Total genes	Unclustered genes	Families	Unique families	Average number of genes per family
Centropyge bicolor	21,774	694	16,219	57	1.3
Danio rerio	30,067	2,188	18,575	726	1.5
Gasterosteus aculeatus	20,756	784	15,921	16	1.25
Gadus morhua	19,987	535	15,630	9	1.24
Larimichthys crocea	24,403	610	17,273	55	1.38
Oryzias latipes	19,535	1,048	14,805	87	1.25
Oreochromis niloticus	21,431	180	15,780	14	1.35
Tetraodon nigroviridis	19,544	901	14,803	57	1.26

Open in a new tab

Based on the phylogenetic tree and single-copy sequences, the divergence time between different species was estimated by MCMCTREE with parameters of “–model 0 –rootage 500 -clock 3”. The results showed that C. bicolor was formed ∼34.95 million years ago, when differentiated from the common ancestor with L. crocea (Figure 8).

Figure 8. — Comparative analysis of the *Centropyge bicolor* genome. (A) The protein-coding genes of the eight species were clustered into 17,849 gene families. Among these gene families, 4,450 were single-copy gene families. (B) Phylogenetic analysis of *Centropyge bicolor* (Cbi.), *Danio rerio* (Dre.), *Gasterosteus aculeatus* (Gac.), *Gadus morhua* (Gmo.), *Larimichthys crocea* (Lcr.), *Oryzias latipes* (Ola.), *Oreochromis niloticus* (Oni.), and *Tetraodon nigroviridis* (Tni.) using single-copy gene families. The species differentiation time between *Centropyge bicolor* and *Larimichthys crocea* was ∼34.95 million years.

Analysis of bicolor formation in teleosts

Current studies suggest that different pigment cells produce different pigments. Some types of pigment cells already have been identified in teleost [30]. C. bicolor has an attractive body color with clear color boundaries, but the molecular mechanism underlying this remains unknown. Compared with other teleost, there are 1,081 expanded gene families and 57 specific gene families in C. bicolor (Figure 9). Functional enrichment analysis showed that notable expansion occurred in those gene families related to visual development and enzyme metabolism (Figure 9).

Figure 9. — Statistics of gene function enrichment (Gene Ontology) for expanded genes of *Centropyge bicolor*. Nodes are colored by q-value (adjusted p-value). Node size is shown according to its enriched gene number.

Re-use potential

Coral reef fishes, with distinctive color patterns and color morphs, are important for understanding the adaptive evolution of fishes. In this study, we firstly assembled a high-quality, chromosome-level genome of C. bicolor, with a length of 681 Mbp, and annotated 21,774 genes. This is the first genome of a fish from the Pomacanthidae family. These genomic data will be useful for genome-scale comparisons and further studies on the mechanisms underlying colorful body development and adaptation.

Acknowledgements

We thank the China National Genebank for technical support in constructing and sequencing the stLFR library.

Funding Statement

This work was supported by funding from the “Blue Granary” project for scientific and technological innovation of China (2018YFD0900301-05).

Data availability

The data sets supporting the results of this article are available in the GigaScience Database [31]. Raw reads from genome sequencing and assembly are deposited at the China National Gene Bank under reference number CNP0001160, which contains sample information (CNS0315939), Hi-C raw data (CNX0286336) and stLFR raw data (CNX0286337). The project also has been deposited at NCBI under accession ID PRJNA702283.

Declarations

List of abbreviations

bp: base pair; BUSCO: Benchmarking Universal Single-Copy Orthologs; Gbp: gigabase pair; Kbp: kilobase pair; KEGG: Kyoto Enyclopedia of Genes and Genomes; Mbp: megabase pair; NCBI: National Center for Biotechnology Information; stLFR: single-tube long fragment reads; TE: transposable element.

Ethical approval

All resources used in this study were approved by the Institutional Review Board of BGI (IRB approval No. FT17007). This experiment has passed the ethics audit of the Beijing Genomics Institute (BGI) Gene Bioethics and Biosecurity Review Committee.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by funding from the “Blue Granary” project for scientific and technological innovation of China (2018YFD0900301-05).

Authors’ contributions

H.Z. and G.F. designed this project. M.Z. prepared the samples. S.L., S.P., W.X., C.W. and C.M. conducted the experiments. C.L., X.Y., L.S., R.Z. and Q.L. did the analyses. C.L., X.Y., L.S., R.Z. wrote and revised the manuscript. All authors read and approved the final version of the manuscript.

References

1.Mendoncą RC, Chen JY, Zeng C, Tsuzuki MY, . Embryonic and early larval development of two marine angelfish, Centropyge bicolor and Centropyge bispinosa. Zygote, 2020; 28(3): 196–202. doi: 10.1017/S0967199419000789. [DOI] [PubMed] [Google Scholar]
2.Li C, et al. Protocols for “Bicolor Angelfish (Centropyge bicolor) genome provided first chromosome-level reference of Pomacanthidae family and clues for bi-color body formation”. protocols.io. 2020; 10.17504/protocols.io.bpxhmpj6. [DOI]
3.Wang O, et al. Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res., 2019; 29(5): 798–808. doi: 10.1101/gr.245126.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chen Y, et al. SOAPnuke: A MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience, 2018; 7(1): gix120. doi: 10.1093/gigascience/gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chen C-J, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol., 2015; 16: 259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Vurture GW, et al. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics, 2017; 33(14): 2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.BGI-QingDao . stlfr2supernova_pipeline. 2021; https://github.com/BGI-Qingdao/stlfr2supernova_pipeline.
8.Wong KHY, Levy-Sakin M, Kwok PY, . De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations. Nat. Commun., 2018; 9: 3040. doi: 10.1038/s41467-018-05513-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst., 2016; 3(1): 95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 2017; 356(6333): 92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Waterhouse RM, Seppey M, Sim FA, Ioannidis P, . BUSCO applications from quality assessments to gene prediction and phylogenomics. Letter Fast Track, 2017; doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Wang Y, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res., 2012; 40(7): e49. doi: 10.1093/nar/gkr1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Meng G, Li Y, Yang C, Liu S, . MitoZ: A toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic Acids Res., 2019; 47(11): e63. doi: 10.1093/nar/gkz173. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Tillich M, et al. GeSeq – versatile and accurate annotation of organelle genomes. Nucleic Acids Res., 2017; 45(W1): W6–W11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Tarailo-Graovac M, Chen N, . Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics, 2009; doi: 10.1002/0471250953.bi0410s25. [DOI] [PubMed] [Google Scholar]
16.Carrillo-Avila M, Resende EK, Marques DKS, Galetti PM, . Tandem repeats finder: a program to analyze DNA sequences. Conserv. Genet., 2009; 25: 4.10.1–4.10.14. doi: 10.1590/S1679-62252007000200018. [DOI] [Google Scholar]
17.Bao W, Kojima KK, Kohany O, . Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Xu Z, Wang H, . LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res., 2007; 35(2): W265–W268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF, . The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA, 2021; 12: 2. doi: 10.1186/s13100-020-00230-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Stanke M, Schöffmann O, Morgenstern B, Waack S, . Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform., 2006; 7: 62. doi: 10.1186/1471-2105-7-62. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B, . AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res., 2006; 34(2): W435–W439. doi: 10.1093/nar/gkl200. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Doerks T, Copley RR, Schultz J, Ponting CP, Bork P, . Systematic identification of novel protein domain families associated with nuclear functions. Genome Res., 2002; 12(1): 47–56. doi: 10.1101/gr.203201. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lewis S, et al. Creating a honey bee consensus gene set. Genome Biol., 2002; 3: research0082.1. doi: 10.1186/gb-2002-3-12-research0082. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Bairoch A, . The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 2000; 28(1): 45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Habermann BH, Villaveces JM, Jimenez RC, . KEGGViewer, a BioJS component to visualize KEGG pathways. F1000Research, 2014; 3: 43, doi: 10.12688/f1000research.3-43.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jones P, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics, 2014; 30(9): 1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, . Basic local alignment search tool. J. Mol. Biol., 1990; 215(3): 403–410. doi: 10.1016/S0022-2836(0580360-2. [DOI] [PubMed] [Google Scholar]
28.Ruan J, et al. TreeFam: 2008 update. Nucleic Acids Res., 2007; 36(suppl 1): D735–D740. doi: 10.1093/nar/gkm1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O, . New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst. Biol., 2010; 59(3): 307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
30.Kimura T, et al. Leucophores are similar to xanthophores in their specification and differentiation processes in medaka. Proc. Natl Acad. Sci. USA, 2014; 111(20): 7343–7348. doi: 10.1073/pnas.1311254111. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Li C, et al. Genome data of the bicolor angelfish (Centropyge bicolor). GigaScience Database. 2020; 10.5524/100802. [DOI] [Google Scholar]

GigaByte. 2021 Nov 9;2021:gigabyte32.

Article Submission

Chunhua Li

GigaByte.

Assign Handling Editor

Editor: Scott Edmunds

GigaByte.

Editor Assess MS

Editor: Hongfang Zhang

Open in a new tab

GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[ref1] 1.Mendoncą RC, Chen JY, Zeng C, Tsuzuki MY, . Embryonic and early larval development of two marine angelfish, Centropyge bicolor and Centropyge bispinosa. Zygote, 2020; 28(3): 196–202. doi: 10.1017/S0967199419000789. [DOI] [PubMed] [Google Scholar]

[ref2] 2.Li C, et al. Protocols for “Bicolor Angelfish (Centropyge bicolor) genome provided first chromosome-level reference of Pomacanthidae family and clues for bi-color body formation”. protocols.io. 2020; 10.17504/protocols.io.bpxhmpj6. [DOI]

[ref3] 3.Wang O, et al. Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly. Genome Res., 2019; 29(5): 798–808. doi: 10.1101/gr.245126.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Chen Y, et al. SOAPnuke: A MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience, 2018; 7(1): gix120. doi: 10.1093/gigascience/gix120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5.Chen C-J, et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol., 2015; 16: 259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6.Vurture GW, et al. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics, 2017; 33(14): 2202–2204. doi: 10.1093/bioinformatics/btx153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7.BGI-QingDao . stlfr2supernova_pipeline. 2021; https://github.com/BGI-Qingdao/stlfr2supernova_pipeline.

[ref8] 8.Wong KHY, Levy-Sakin M, Kwok PY, . De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations. Nat. Commun., 2018; 9: 3040. doi: 10.1038/s41467-018-05513-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9.Durand NC, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst., 2016; 3(1): 95–98. doi: 10.1016/j.cels.2016.07.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10.Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science, 2017; 356(6333): 92–95. doi: 10.1126/science.aal3327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11.Waterhouse RM, Seppey M, Sim FA, Ioannidis P, . BUSCO applications from quality assessments to gene prediction and phylogenomics. Letter Fast Track, 2017; doi: 10.1093/molbev/msx319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12.Wang Y, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res., 2012; 40(7): e49. doi: 10.1093/nar/gkr1293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Meng G, Li Y, Yang C, Liu S, . MitoZ: A toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic Acids Res., 2019; 47(11): e63. doi: 10.1093/nar/gkz173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14.Tillich M, et al. GeSeq – versatile and accurate annotation of organelle genomes. Nucleic Acids Res., 2017; 45(W1): W6–W11. doi: 10.1093/nar/gkx391. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15.Tarailo-Graovac M, Chen N, . Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics, 2009; doi: 10.1002/0471250953.bi0410s25. [DOI] [PubMed] [Google Scholar]

[ref16] 16.Carrillo-Avila M, Resende EK, Marques DKS, Galetti PM, . Tandem repeats finder: a program to analyze DNA sequences. Conserv. Genet., 2009; 25: 4.10.1–4.10.14. doi: 10.1590/S1679-62252007000200018. [DOI] [Google Scholar]

[ref17] 17.Bao W, Kojima KK, Kohany O, . Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18.Xu Z, Wang H, . LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res., 2007; 35(2): W265–W268. doi: 10.1093/nar/gkm286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19.Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF, . The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob. DNA, 2021; 12: 2. doi: 10.1186/s13100-020-00230-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20.Stanke M, Schöffmann O, Morgenstern B, Waack S, . Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform., 2006; 7: 62. doi: 10.1186/1471-2105-7-62. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21.Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B, . AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res., 2006; 34(2): W435–W439. doi: 10.1093/nar/gkl200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22.Doerks T, Copley RR, Schultz J, Ponting CP, Bork P, . Systematic identification of novel protein domain families associated with nuclear functions. Genome Res., 2002; 12(1): 47–56. doi: 10.1101/gr.203201. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23.Lewis S, et al. Creating a honey bee consensus gene set. Genome Biol., 2002; 3: research0082.1. doi: 10.1186/gb-2002-3-12-research0082. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24.Bairoch A, . The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 2000; 28(1): 45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25.Habermann BH, Villaveces JM, Jimenez RC, . KEGGViewer, a BioJS component to visualize KEGG pathways. F1000Research, 2014; 3: 43, doi: 10.12688/f1000research.3-43.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26.Jones P, et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics, 2014; 30(9): 1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, . Basic local alignment search tool. J. Mol. Biol., 1990; 215(3): 403–410. doi: 10.1016/S0022-2836(0580360-2. [DOI] [PubMed] [Google Scholar]

[ref28] 28.Ruan J, et al. TreeFam: 2008 update. Nucleic Acids Res., 2007; 36(suppl 1): D735–D740. doi: 10.1093/nar/gkm1005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29.Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O, . New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst. Biol., 2010; 59(3): 307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]

[ref30] 30.Kimura T, et al. Leucophores are similar to xanthophores in their specification and differentiation processes in medaka. Proc. Natl Acad. Sci. USA, 2014; 111(20): 7343–7348. doi: 10.1073/pnas.1311254111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31.Li C, et al. Genome data of the bicolor angelfish (Centropyge bicolor). GigaScience Database. 2020; 10.5524/100802. [DOI] [Google Scholar]

PERMALINK

Bicolor angelfish (Centropyge bicolor) provides the first chromosome-level genome of the Pomacanthidae family

Chunhua Li

Xianwei Yang

Libin Shao

Rui Zhang

Qun Liu

Mengqi Zhang

Shanshan Liu

Shanshan Pan

Weizhen Xue

Congyan Wang

Chunyan Mao

He Zhang

Guangyi Fan

Roles

Abstract

Data description

Background

Figure 1.

Context

Methods and results

Figure 2.

Sample collection and genome sequencing

Table 1.

Genome assembly

Table 2.

Figure 3.

Table 3.

Figure 4.

Figure 5.

Table 4.

Table 5.

Figure 6.

Genomic annotation

Table 6.

Table 7.

Table 8.

Table 9.

Figure 7.

Phylogenetic analysis

Table 10.

Figure 8.

Analysis of bicolor formation in teleosts

Figure 9.

Re-use potential

Acknowledgements

Funding Statement

Data availability

Declarations

List of abbreviations

Ethical approval

Consent for publication

Competing interests

Funding

Authors’ contributions

References

Article Submission

Ms Chunhua Li

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Major Revision

Ms Chunhua Li

Roles

Assess Revision

Roles

Re-Review MS

Roles

Editor Decision