Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Apr 25;30:105618. doi: 10.1016/j.dib.2020.105618

Chaos game representation dataset of SARS-CoV-2 genome

Raquel de M Barbosa a, Marcelo AC Fernandes b,c,⁎,1
PMCID: PMC7182522  PMID: 32341946

Abstract

As of April 16, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 142,000 deaths and more than 2,000,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream, digital signal processing, and machine learning techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical values representation. Thus, the dataset provides a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the CGR of 100 instances of SARS-CoV-2 virus, 11540 instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).

Keywords: SARS-CoV-2, CGR, COVID-19


Specification Table
Subject Biochemistry, Genetics and Molecular Biology (General)
Specific subject area Bioinformatics
Type of data Table
Number
How data were acquired NCBI - Genbank - SARS-CoV2 https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/
Virus-Host-DB https://www.genome.jp/virushostdb/
Matlab Software
Excel Software
Data format Raw and analyzed data are in Matlab file (.mat), Microsoft Excel file (.xlsx), and text file (.txt).
Parameters for data collection The entire dataset was generated using MATLAB 2019b on Windows operating system with Intel Core - i5 6500T 2.5 GHz quad-core processor with 16GB of RAM.
Description of data collection The raw data were downloaded from NCBI - Genbank, and Virus-Host-DB. The CGR values were generated using Matlab.
Data source location Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte.
Data accessibility https://data.mendeley.com/datasets/nvk5bf3m2f/1

Value of the data

  • These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2). With this form of the data, it is possible to use data stream, digital signal processing, and machine learning algorithms.

  • All researchers in bioinformatics, computing science, and computing engineering field can benefit from these data because by using this numeric representation they can apply several techniques such as machine learning and digital signal processing in genomic information.

  • Data experiments that use clustering and classification techniques in SARS-CoV-2 virus genomic information can be used with this dataset.

  • These data represent an easy way to evaluate the SARS-CoV-2 virus genome.

1. Data Description

This work presents a new dataset of a chaos game representation (CGR) of SARS-CoV-2 virus nucleotide sequences. The dataset contains two kinds of data, the raw data, and the processing data. The raw data is composed of the 100 instances of the SARS-CoV-2 virus genome collected from the National Center for Biotechnology Information (NCBI) [1], 11540 instances of other viruses from the Virus-Host DB [2], [3], and three other instances of Riboviria also collected from the NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21). Which have high similarity with SARS-CoV-2 [4], [5].

The dataset provides two groups of formats files for all data. In the first group, all data are stored in Matlab file format (.mat), and in the second group, part of the data is stored in Microsoft Excel (.xlsx) and another part in the text file (.txt). The two groups have the same information. The data is organized into three main directories: “SARS-CoV-2 data”, “Virus-Host DB data” and “Other viruses data.” Each main directory is formed by two sub-directories: “Matlab” and “Excel and txt.”

Each sub-directory “Matlab” contains three files called “RawDataTable.mat”, “RawData.mat” and “CGRData.mat”. “RawDataTable.mat” and “RawData.mat” files store the raw data information from the viruses database; they have the same information, however in the “RawDataTable.mat” the attributes are stored in Matlab table format (after 2013b version) and in “RawData.mat” the attributes are stored in Matlab cell arrays format. Each “CGRData.mat” file stores the CGR values of all viruses presented in each “RawDataTable.mat” and “RawData.mat” file. For the main directory “Virus-Host DB data”, the CGR values are stored in 10 files where each k-th file is called “RawData_k.mat.”

Each sub-directory “Excel and txt” is composed of a file and another sub-directory called “RawData.xlsx” and “CGRData”, respectively. Each “RawData.xlsx” file has the raw data information from the viruses database, and each “CGRData” has the CGR of viruses presented in each “RawData.xlsx” file. The points of the CGR associated with each virus are stored in a text file called “LocusName_COD.txt” where COD is the code (locus name) associated with the virus in Genbank [6].

2. Experimental Design, Materials, and Methods

The Chaos Game Representation (CGR), proposed by H. Joel Jeffrey in [7], transforms the nucleotide sequence (DNA or RNA) to bi-dimensional real values. The CGR maintains the statistical properties of the nucleotide sequence, and it allows an investigation of the local and global patterns in sequences [8], [9].

The CGR has with input the nucleotide sequence, s, expressed as

s=[s1,,sn,,sN] (1)

where N is the length of sequence and sn is the n-th nucleotide of the sequence. Each n-th nucleotide, sn, is mapped to bi-dimensional symbol (sx(n), sy(n)) and it can be expressed as

sx(n)={1forsn=A1forsn=TorU1forsn=C1forsn=G (2)

and

sy(n)={1forsn=A1forsn=TorU1forsn=C1forsn=G. (3)

After the mapping, each n-th symbol (sx(n), sy(n)) is transformed in CGR values by equations expressed as

px(n)=12sx(n)+12px(n1),forn=1,,N (4)

and

py(n)=12sy(n)+12py(n1),forn=1,,N (5)

where for the initial condition, n=0, px(0)=αx and py(0)=αy [7], [8]. The dataset was generated with αx=0 and αy=0. Figures 1(a), 1(b), 1(c) and 1(d) show a example of CGR points (px(n), py(n)) from dataset presented in this work.

Fig. 1.

Fig. 1

Example of the CGR values for the SARS-CoV-2 virus stored in this dataset.

Acknowledgments

The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support.

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Contributor Information

Raquel de M. Barbosa, Email: raquelmb@mit.edu.

Marcelo A.C. Fernandes, Email: mfernandes@dca.ufrn.br.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://data.mendeley.com/datasets/nvk5bf3m2f/1

Supplementary material

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.dib.2020.105618

Appendix B. Supplementary materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.xml (1.1KB, xml)

References

  • 1.NCBI, SARS-CoV-2 (Severe acute respiratory syndrome coronavirus 2) Sequences, 2020, (https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/).
  • 2.Mihara T., Nishimura Y., Shimizu Y., Nishiyama H., Yoshikawa G., Uehara H., Hingamp P., Goto S., Ogata H. Linking virus genomes with host taxonomy. Viruses. 2016;8(3) doi: 10.3390/v8030066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Virus-Host DB, Virus-Host DB - Website, 2020, (https://www.genome.jp/virushostdb).
  • 4.Randhawa G.S., Soltysiak M.P., Roz H.E., de Souza C.P., Hill K.A., Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study. bioRxiv. 2020 doi: 10.1101/2020.02.03.932350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Randhawa G.S., Soltysiak M.P., Roz H.E., de Souza C.P., Hill K.A., Kari L. Machine learning-based analysis of genomes suggests associations between wuhan 2019-ncov and bat betacoronaviruses. bioRxiv. 2020 doi: 10.1101/2020.02.03.932350. [DOI] [Google Scholar]
  • 6.NCBI, Genbank, 2020, (https://www.ncbi.nlm.nih.gov/genbank/).
  • 7.Jeffrey H. Chaos game representation of gene structure. Nucleic Acids Research. 1990;18(8):2163–2170. doi: 10.1093/nar/18.8.2163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.C. Yin, Encoding DNA sequences by integer chaos game representation. arXiv preprint arXiv:1712.04546, 2017. [DOI] [PubMed]
  • 9.Hoang T., Yin C., Yau S.S.-T. Numerical encoding of dna sequences by chaos game representation with application in similarity comparison. Genomics. 2016;108(3):134–142. doi: 10.1016/j.ygeno.2016.08.002. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.xml (1.1KB, xml)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES