Abstract
As of May 25, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 348,000 deaths and more than 5,550,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical stream representation. Thus, the dataset provides four kinds of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the DSR of 1557 instances of SARS-CoV-2 virus, 11540 other instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).
Keywords: SARS-CoV-2, Data stream, COVID-19
Specifications Table | |
---|---|
Subject | Biochemistry, Genetics and Molecular Biology (General) |
Specific subject area | Bioinformatics |
Type of data | Table |
Number | |
How data were acquired | NCBI - Genbank - SARS-CoV2 https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs |
Virus-Host-DB https://www.genome.jp/virushostdb | |
Matlab Software | |
Excel Software | |
Data format | Raw and analyzed data are in Matlab file (.mat) and Microsoft Excel file (.xlsx). |
Parameters for data collection | The entire dataset was generated using MATLAB 2019b on Windows operating system with Intel Core - i5 6500T 2.5 GHz quad-core processor with 16GB of RAM. |
Description of data collection | The raw data were downloaded from NCBI - Genbank, and Virus-Host-DB. The data stream values were generated using Matlab. |
Data source location | Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte. |
Data accessibility | https://data.mendeley.com/datasets/g5ktw4y4pz/2 |
Value of the Data
-
•
These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2). With this, it is possible to use data stream algorithms.
-
•
All researchers in bioinformatics, computing science, and computing engineering disciplines can benefit from these data because by using this numeric representation, they can apply several stream algorithms and techniques such as TEDA (Typicality and Eccentricity Data Analytic), TEDA-Cloud, TEDA-Cluster and Teda-Class in genomic information.
-
•
Data experiments that use analytic stream techniques in SARS-CoV-2 virus genomic information can be used with this dataset.
-
•
These data represent an simple way to evaluate the SARS-CoV-2 virus genome with stream algorithms.
-
•
Differently of the conventional bioinformatics techniques in which are based on dynamic programming (such as BLAST and other), this approach allows the utilization of different techniques (techniques commons in other areas) to find similarities between genome sequences.
1. Data Description
This work presents a dataset of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences. The dataset contains two kinds of data, the raw data, and the processing data. The raw data is composed of the 1557 instances of the SARS-CoV-2 virus genome collected from the National Center for Biotechnology Information (NCBI) [1], 11540 instances of other viruses from the Virus-Host DB [2], [3], and the other three specific viruses also collected from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21). The last specific three viruses have high similarity with SARS-CoV-2 [4], [5]. The processing data is composed of four kinds of DSR called Direct Mapping (DM), DM with Chaos Game Representation (DM-CGR), k-mers mapping (kMersM) and k-mers mapping with CGR (kMersM-CGR). k-mers is a frequency count metric used in Bioinformatics. Other k-mers datasets are presented in [6], [7], [8].
In the Chaos Game Representation (CGR) [8], the genome sequence is transformed in a bi-dimensional signal (1D vector), and after that, this signal passes to infinite impulse response (IIR) filter [9]. The result of CGR is a signal that expressed the density of the bases and, at the same time, the transition between bases because the IIR is a memory system. CGR can be used with the signature of the genome sequence. With k-mers representation [10], the genome can be transformed into a 1D or 2D vector that represents the occurrence number of each base (frequency of the bases). k-mers also can be used with a signature of the genome sequence. However, in this manuscript, the genome sequence is transformed into a linear stream data, and this type of transformation can be used with stream algorithms. Another important aspect of this dataset is associated with applied CGR not in all sequences but just in each k bases (with mers or not). This strategy maintains the statistical characteristics and reduces the size of the stream.
The data is organized into three main directories: “SARS-CoV-2 data”, “Virus-Host DB data” and “Other viruses data”. Each main directory contains three files called “RawDataTable.mat”, “RawData.mat” and “RawData.xlsx”, and four sub-directories named “DirectMapping”, “DirectMappingCGR”, “kmersMapping” and “kmersMappingCGR”. “RawDataTable.mat”, “RawData.mat” and “RawData.xlsx” files store the raw data information from viruses database; they have the same information, however in the “RawDataTable.mat” the attributes are stored in Matlab table format (after 2013b version), in the “RawData.mat” the attributes are stored in Matlab cell arrays format, and in the “RawData.xlsx” the attributes are stored in a Microsoft Excel file. In the sub-directories “DirectMapping”, “DirectMappingCGR”, “kmersMapping” and “kmersMappingCGR” are stored the DM, DM-CGR, kMersM and kMersM-CGR data stream representation, respectively. Inside each sub-directory the files are called:
-
•
For DM, the DSR was generated for and the files are called “PointsData_1_k=k.mat”;
-
•
For DM-CGR, the DSR was generated for and the files are called “PointsDataCGR_1_k=k.mat”;
-
•
For kMersM, the DSR was generated for and the files are called “PointsDatakmers_1_k=k.mat”;
-
•For kMersM-CGR:
-
•In the directories “Other viruses data” and “SARS-CoV-2 data”, the DSR was generated for and the files are called “PointsDatakmersCGR_1_k=k.mat”;
-
•In the “Virus-Host DB data”, the DSR was generated for and the files are called “PointsDatakmersCGR_1_k=k.mat”;
-
•
For the main directory “Virus-Host DB data”, the values are stored in 10 files where each i-th file is called “PointsData_k_k=k.mat” for sub-directory “DirectMapping”, “PointsDataCGR_i_k=k.mat” for DM-CGR, “PointsDatakmers_i_k=k.mat” for kMersM and “PointsDatakmersCGR_i_k=k.mat” for kMersM-CGR.
2. Experimental design, materials, and methods
The streams were based in nucleotide sequence, s, expressed as
(1) |
where N is the length of sequence and sn is the nth nucleotide of the sequence.
For DM and DM-CGR, the nucleotide sequence, s, are grouped in sub-sequences of the k bases. The group of sub-sequences can be expressed as
(2) |
where
(3) |
and the i-th vector bi is a i-th group of the k nucleotides, that is
(4) |
For DM, the group of sup-sequences, stored in matrix B, are transformed in a sequence of the integer values expressed as
(5) |
where c is the DM stream stored in dataset. The DM stream, c, calculus can be expressed as
(6) |
where fmap( · ) is the mapping function expressed by
(7) |
and
(8) |
For DM-CGR, the stream is characterized by vector a expressed as
(9) |
where the ai is the i-th value of CGR. In CGR (see [11], [12]) each element ai is a bi-dimensional value expressed as
(10) |
where and are the x-axes and y-axes in bi-dimensional space, receptively. The values of the CGR are calculate using the functions and in Matrix B, that is
(11) |
The function calculates the x-axes value of the CGR and it can be expressed as
(12) |
where
(13) |
and
(14) |
For y-axes, the function, can be expressed as
(15) |
where
(16) |
and
(17) |
For the initial condition, and [11], [12]. The dataset was generated with and .
For kMersM and kMersM-CGR, the nucleotide sequence, s, are grouped in k-mers sub-sequences [13], [14] in the matrix H that can expressed as
(18) |
The kMersM, stream is characterized as a sequence of the integer values expressed as
(19) |
where
(20) |
The function fmap( · ) is the mapping processing characterized by Eqs. (7) and (8). The kMersM-CGR is stored in the vector z expressed as
(21) |
where the zi is the i-th value of CGR. Each ith element zi is a bi-dimensional value expressed as
(22) |
where and are the x-axes and y-axes in bi-dimensional space, receptively. The values of the CGR are calculate using the functions (see Eqs. (12)–(14)) and (see Equation see Eqs. (15)–(17)) in Matrix H, that is
(23) |
Fig. 1, Fig. 2, Fig. 3, Fig. 4 show the DSR examples for SARS-CoV-2 from Brazil, respectively.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support.
Footnotes
Supplementary material associated with this article can be found, in the online version, at 10.1016/j.dib.2020.105829
Contributor Information
Raquel de M. Barbosa, Email: raquelmb@mit.edu.
Marcelo A.C. Fernandes, Email: mfernandes@dca.ufrn.br.
Appendix A. Supplementary materials
References
- 1.NCBI, SARS-CoV-2 (Severe acute respiratory syndrome coronavirus 2) Sequences, 2020, (https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/).
- 2.Mihara T., Nishimura Y., Shimizu Y., Nishiyama H., Yoshikawa G., Uehara H., Hingamp P., Goto S., Ogata H. Linking virus genomes with host taxonomy. Viruses. 2016;8(3) doi: 10.3390/v8030066. [DOI] [PMC free article] [PubMed] [Google Scholar]; URL https://www.mdpi.com/1999-4915/8/3/66
- 3.Virus-Host DB, Virus-Host DB - Website, 2020, https://www.genome.jp/virushostdb.
- 4.Randhawa G.S., Soltysiak M.P., Roz H.E., de Souza C.P., Hill K.A., Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study. bioRxiv. 2020 doi: 10.1101/2020.02.03.932350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Randhawa G.S., Soltysiak M.P., Roz H.E., de Souza C.P., Hill K.A., Kari L. Machine learning-based analysis of genomes suggests associations between wuhan 2019-ncov and bat betacoronaviruses. bioRxiv. 2020 doi: 10.1101/2020.02.03.932350. [DOI] [Google Scholar]
- 6.de M. Barbosa R., Fernandes M.A. Chaos game representation dataset of sars-cov-2 genome. Mendeley Data. 2020;v2 doi: 10.17632/nvk5bf3m2f.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.de M. Barbosa R., Fernandes M.A. k-mers 1d and 2d representation dataset of sars-cov-2 nucleotide sequences. Mendeley Data. 2020;v2 doi: 10.17632/f5y9cggnxy.2. [DOI] [Google Scholar]
- 8.de M. Barbosa R., Fernandes M.A. Chaos game representation dataset of sars-cov-2 genome. Data Brief. 2020;30:105618. doi: 10.1016/j.dib.2020.105618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Proakis J.G., Manolakis D.K. (4th Ed.) Prentice-Hall, Inc.; USA: 2006. Digital Signal Processing. [Google Scholar]
- 10.Pinello L., Lo Bosco G., Yuan G.-C. Applications of alignment-free methods in epigenomics. Briefings in Bioinf. 2014;15(3):419–430. doi: 10.1093/bib/bbt078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Jeffrey H. Chaos game representation of gene structure. Nucleic Acids Research. 1990;18(8):2163–2170. doi: 10.1093/nar/18.8.2163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.C. Yin, Encoding dna sequences by integer chaos game representation, 2017, arXiv: 1712.04546 [DOI] [PubMed]
- 13.Mapleson D., Garcia Accinelli G., Kettleborough G., Wright J., Clavijo B.J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2016;33(4):574–576. doi: 10.1093/bioinformatics/btw663. [DOI] [PMC free article] [PubMed] [Google Scholar]; URL https://academic.oup.com/bioinformatics/article-pdf/33/4/574/25146635/btw663.pdf
- 14.Chor B., Horn D., Goldman N., Levy Y., Massingham T. Genomic dna k-mer spectra: models and modalities. Genome Biol. 2009;10(10):R108. doi: 10.1186/gb-2009-10-10-r108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.