Abstract
In China, 65 types of venomous snakes exist, with the Chinese Cobra Naja atra being prominent and a major cause of snakebites in humans. Furthermore, N. atra is a protected animal in some areas, as it has been listed as vulnerable by the International Union for Conservation of Nature. Recently, due to the medical value of snake venoms, venomics has experienced growing research interest. In particular, genomic resources are crucial for understanding the molecular mechanisms of venom production. Here, we report a highly continuous genome assembly of N. atra, based on a snake sample from Huangshan, Anhui, China. The size of this genome is 1.67 Gb, while its repeat content constitutes 37.8% of the genome. A total of 26,432 functional genes were annotated. This data provides an essential resource for studying venom production in N. atra. It may also provide guidance for the protection of this species.
Introduction
Elapidae is a family of snakes divided into three subfamilies (Bungarinae, Elapinae and Notechinae), with 44 genera and around 186 described species distributed widely [1]. The front of the mouth of an elapid has permanently erect tusks, which are his distinguishing features. Elapids include terrestrial and sea snakes. Terrestrial elapids, a family of venomous snakes, are distributed across the globe in tropical and subtropical regions, with most species inhabiting the Southern Hemisphere. Elapid sea snakes are mainly distributed in the Indian Ocean and the Southwest Pacific Ocean [2].
The Chinese cobra, or Naja atra (NCBI: txid8656) (Figure 1), is a species of cobra from the family Elapidae. Chinese cobras are usually between 1.2 and 1.5 m long [3], and they are among the most prevalent cobra species in China. The Chinese cobra likes to inhibit plains, hills and low mountains [4]. Humans often encounter Chinese cobras, although these snakes usually escape to avoid confrontation with humans. Chinese cobras can be observed hunting during daylight hours from March to October and up to 2–3 hours after sunset at temperatures of 20–32 °C [5]. They have a widely varied diet and prey on rodents, frogs, toads and other snakes.
The Chinese cobra is highly poisonous, its venom consisting mainly of postsynaptic neurotoxins and cardiotoxins [6]. Their venom offers them protection from predation to a certain extent; however, populations of Chinese cobra have declined by 30% to 50% due to habitat loss and hunting by humans. The venom of Chinese cobras can be used to extract anti-cobra snake venom, which is used to treat cobra snake bites. Although the Chinese cobra is currently listed as a Vulnerable species on the International Union for Conservation of Nature Red List [7], its numbers in the wild have declined from Vulnerable to Endangered due to continued hunting.
Main Content
Context
Snakebite is a serious threat to human life as it kills around 100,000 people annually. Genome-enabled research of toxin genes may facilitate the development of effective antivenoms. Here, we present a highly continuous reference genome assembly of N. atra. While there is a reference genome for the Indian cobra (Naja naja) [8], this is the first for the Chinese cobra. This resource may also provide valuable information for the conservation of this vulnerable species, which can be used for targeted protection and breeding.
Methods
The detailed methods used in this study are available via a protocol collection hosted in protocols.io [9] (Figure 2).
Sample collection and sequencing
The N. atra sample used in this study was captured in Huangshan, Anhui, China, in 2021. After collection, the specimen was quickly frozen to −80 °C using drikold dry ice for storage and transport. Methods for DNA extraction, library construction and sequencing were identical those used by Liu et al. in a previous study [10].
Sample collection, experiments and research design were all authorized by the Institutional Review Board of BGI (BGI-IRB E22001). In this research, all the procedures have been operated abiding to the guidelines from BGI-IRB strictly.
Genome survey, assembly, annotation and assessment
The single-tube long fragment read sequencing data were assembled using Supernova (v2.1.1, RRID:SCR_016756) [11]. NextPolish (v1.0.5) [12] was then used to perform a second round of correction and a third round of polishing of this assembly using the Whole Genome Sequencing data. To get a haploid representation of the genome, duplicates were purged from the genome using the purge_dups pipeline (RRID:SCR_021173) [13]. The completeness of the genome was evaluated using sets of BUSCO (v5.2.2, RRID:SCR_015008) [14] with genome mode and lineage data from vertebrata_odb10 [15].
In order to detect the presence of known repeat elements in the genome of the many-banded P. mucosa, the following approach was employed. To identify the known repetitive elements in the genome of the many-banded krait, we used Tandem repeats Finder [16], LTR_Finder (RRID:SCR_015247) [17] and RepeatModeler (v2.0.1, RRID:SCR_015027) [18]. RepeatMasker (v3.3.0, RRID:SCR_012954) [19] and RepeatProteinMask v3.3.0 [20] were used to search the genome sequences for known repeat elements. The BRAKER2 pipeline (RRID:SCR_018964) [21] was used for gene prediction. Then, the gene sets were aligned against several known databases, including SwissProt, TrEMBL [22], Kyoto encyclopedia of genes and genomes (KEGG) [23], gene ontology (GO), and the Non-Redundant Protein Sequence Database [24] database.
Results
We present a draft genome sequence of N. atra. The size of this genome is 1.67 Gb (Table 1), similar to the previously published 1.79 Gb genome of N. naja [8]. The scaffold N50 length is 234.17 Kb, and the CG content reached 37.8%. The maximal scaffold length is 2,929,773 bp, demonstrating that the reference is highly continuous according to the characteristics of the genome sequence. In addition, the integrity of the genome was assessed at 84.1% using BUSCO (Figure 3).
Table 1.
Contig | Scaffold | |
---|---|---|
Maximal length (bp) | 271,789 | 2,929,773 |
N90 (bp) | 4,371 | 7,368 |
N50 (bp) | 33,081 | 234,173 |
Number ≥ 100 bp | 194,909 | 106,418 |
Number ≥ 2 kb | 113,570 | 54,157 |
GC content (%) | 40.3 | 37.8 |
Genome size (bp) | 1,671,178,062 |
In our N. atra genome, the content of repetitive elements is up to 40.26%, and the total length is 672 Mb (Tables 2, 3). After we counted all repeat elements, we found that long interspersed nuclear elements (LINEs) accounted for 30.63%, long terminal repeats (LTRs) accounted for 14.03% and DNA accounted for 4.27% (Figure 4).
Table 2.
Type | Length (bp) | % in genome |
---|---|---|
DNA | 37,917,702 | 2.269170 |
LINE | 449,338,074 | 26.890460 |
SINE | 2,779,035 | 0.166310 |
LTR | 224,765,038 | 13.450975 |
Other | 0 | 0 |
Satellite | 632,498 | 0.037852 |
Simple_repeat | 5,080,994 | 0.304070 |
Unknown | 7,924,824 | 0.474258 |
Total | 672,795,525 | 40.263183 |
Table 3.
Repbase TEs | TE proteins | De novo | Combined TEs | |||||
---|---|---|---|---|---|---|---|---|
Type | Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome |
DNA | 44,907,141 | 2.57 | 3,638,477 | 0.20 | 41,761,899 | 2.39 | 81,259,555 | 4.66 |
LINE | 170,663,721 | 9.79 | 140,023,530 | 8.03 | 581,624,764 | 33.36 | 619,156,475 | 35.51 |
SINE | 25,759,131 | 1.47 | 0 | 0 | 8,061,060 | 0.46 | 32,226,226 | 1.84 |
LTR | 22,468,876 | 1.28 | 30,088,483 | 1.72 | 149,994,747 | 8.60 | 159,624,403 | 9.15 |
Other | 23,680 | 0.001 | 0 | 0 | 0 | 0 | 23,680 | 0.001 |
Unknown | 0 | 0 | 0 | 0 | 5,653,213 | 0.32 | 5,653,213 | 0.32 |
Total | 251,569,212 | 14.43 | 173,669,200 | 9.96 | 722,435,038 | 41.44 | 752,340,302 | 43.15 |
Finally, 29,063 functional genes were annotated. Through KEGG annotation, we found that the genes related to signal transduction are essential in N. atra (Figure 5). Furthermore, through a pathway enrichment analysis, we found that the number of Human Diseases pathways is the highest. Environmental Information Processing and Organismal systems also account for a relatively large proportion. According to the annotation and enrichment in the GO database, 6,292 genes are enriched in cellular process and 6,734 in binding.
Acknowledgements
Yunnan Agricultural University collected the samples.
Funding Statement
Our project was financially supported by the Guangdong Provincial Key Laboratory of Genome Read and Write (grant no. 2017B030301011). This work was also supported by China National GeneBank (CNGB).
Data availability
The data supporting the findings of this study have been deposited into the CNGB Sequence Archive (or CNSA) of China National GeneBank DataBase (or CNGBdb) with the accession number CNP0004141. Raw reads are available in the SRA via bioproject PRJNA955401. Additional data is in the GigaDB repository [25].
Editor’s Note
This paper is part of a series of Data Release papers presenting the genomes of different snake species [26].
Abbreviations
GO, gene ontology; KEGG, Kyoto encyclopedia of genes and genomes; LINE, long interspersed nuclear element; LTR, long terminal repeat; SINE, short interspersed nuclear element; TE, transposable element.
Declarations
Ethics approval and consent to participate
The authors declare that ethical approval was not required for this type of research.
Competing interests
The authors declare no conflict of financial interests.
Authors’ contributions
GH and FH designed and initiated the project. YW performed DNA extraction, library construction and data analysis. JW wrote the manuscript. All authors read and approved the final manuscript.
Funding
Our project was financially supported by the Guangdong Provincial Key Laboratory of Genome Read and Write (grant no. 2017B030301011). This work was also supported by China National GeneBank (CNGB).
References
- 1.Slowinski JB, Keogh JS. . Phylogenetic relationships of elapid snakes based on cytochrome b mtDNA sequences. Mol. Phylogenet. Evol., 2000; 15(1): 157–164. doi: 10.1006/mpev.1999.0725.i. [DOI] [PubMed] [Google Scholar]
- 2.Chan S. . A Field Guide to the Venomous Land Snakes of Hong Kong. Hong Kong: Cosmos Books Ltd., 2006; ISBN 988-211-326-5. [Google Scholar]
- 3.Gopalkrishnakone P, Chou LM. . Snakes of Medical Importance. Singapore: Venom and Toxins Research Group, 1990; ISBN 9971-62-217-3. [Google Scholar]
- 4. Naja atra – General details, taxonomy and biology, venom, clinical effects, treatment, first aid, antivenoms. WCH Clinical Toxinology Resource. University of Adelaide. http://www.toxinology.com/fusebox.cfm?fuseaction=main.snakes.display&id=SN0039. Accessed 18 May 2023.
- 5.Zhao EM, Adler K. . Herpetology of China. United States: Society for the Study of Amphibians and Reptiles, 1993; ISBN 0-916984-28-1. [Google Scholar]
- 6.Wang AH, Yang CC. . Crystallographic studies of snake venom proteins from Taiwan cobra (Naja nana atra). Cardiotoxin-analogue III and phospholipase A2. J. Biol. Chem., 1981; 256(17): 9279–9282. doi: 10.1016/S0021-258(19)52542-X. [DOI] [PubMed] [Google Scholar]
- 7.“Naja atra”. IUCN Red List of Threatened Species. 2014; e.T192109A2040894. 10.2305/IUCN.UK.2014-3.RLTS.T192109A2040894.en.
- 8.Suryamohan K, Krishnankutty SP, Guillory J et al. . The Indian cobra reference genome and transcriptome enables comprehensive identification of venom toxins. Nat. Genet., 2020; 52(1): 106–117. doi: 10.1038/s41588-019-0559-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liu B, Cui L, Deng Z et al. . Protocols for the assembly and annotation of snake genomes. Protocols.io. 2023; 10.17504/protocols.io.5jyl8j6e9g2w/v2. [DOI]
- 10.Liu B, Cui L, Deng Z et al. . The genome assembly and annotation of the many-banded krait, Bungarus multicinctus . GigaByte, 2023; 2023: gigabyte82. doi: 10.46471/gigabyte.82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Weisenfeld NI, Kumar V, Shah P et al. . Direct determination of diploid genome sequences. Genome Res., 2017; 27(5): 757–767. doi: 10.1101/gr.214874.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hu J, Fan J, Sun Z et al. . NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics, 2020; 36(7): 2253–2255. doi: 10.1093/bioinformatics/btz891. [DOI] [PubMed] [Google Scholar]
- 13.Guan D, McCarthy SA, Wood J et al. . Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics, 2020; 36(9): 2896–2898. doi: 10.1093/bioinformatics/btaa025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Simão FA, Waterhouse RM, Ioannidis P. . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31(19): 3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
- 15.Wick RR, Holt KE. . Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research, 2019; 8: 2138. doi: 10.12688/f1000research.21782.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Benson G. . Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 1999; 27(2): 573–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhao X, Hao W. . LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res., 2007; 35(suppl_2): W265–W268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Smit A, Hubley R, Green P. . RepeatModeler Open-1.0. 2008–2015. Seattle, USA: Institute for Systems Biology, 2018. Available from: https://www.repeatmasker.org, Last Accessed May 2015; 1. [Google Scholar]
- 19.Tarailo-Graovac M, Chen N. . Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics, 2009; 25: 4.10.1–4.10.14. [DOI] [PubMed] [Google Scholar]
- 20.Tempel S. . Using and understanding RepeatMasker. In: Mobile Genetic Elements. Springer, 2012; pp. 29–51, doi: 10.1007/978-1-61779-603-6_2. [DOI] [PubMed] [Google Scholar]
- 21.Bruna T, Hoff KJ, Lomsadze A et al. . BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom. Bioinform., 2021; 3(1): lqaa108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bairoch A, Apweiler R. . The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 2000; 28(1): 45–48. doi: 10.1093/nar/28.1.45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kanehisa M. . KEGG database. Novartis Found. Symp., 2006; 247: 91–101. discussion 101–3, 119–28, 244–52. doi: 10.1002/0470857897.ch8. [DOI] [PubMed] [Google Scholar]
- 24.Pruitt KD, Tatusova T, Maglott DR. . NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 2007; 35(Database issue): D61–D65. doi: 10.1093/nar/gkl842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wang J, Wu Y, Wang S et al. . Supporting data for “The genome assembly and annotation of the Chinese cobra, Naja atra”. GigaScience Database, 2023; 10.5524/102476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Snake Genomes. GigaByte. 2023; 10.46471/GIGABYTE_SERIES_0004. [DOI]