Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Jun 10;31:105829. doi: 10.1016/j.dib.2020.105829

Data stream dataset of SARS-CoV-2 genome

Raquel de M Barbosa a,b,d,⁎⁎, Marcelo AC Fernandes b,c,e,⁎,⁎⁎⁎
PMCID: PMC7306612  PMID: 32596428

Abstract

As of May 25, 2020, the novel coronavirus disease (called COVID-19) spread to more than 185 countries/regions with more than 348,000 deaths and more than 5,550,000 confirmed cases. In the bioinformatics area, one of the crucial points is the analysis of the virus nucleotide sequences using approaches such as data stream techniques and algorithms. However, to make feasible this approach, it is necessary to transform the nucleotide sequences string to numerical stream representation. Thus, the dataset provides four kinds of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences. The dataset provides the DSR of 1557 instances of SARS-CoV-2 virus, 11540 other instances of other viruses from the Virus-Host DB dataset, and three instances of Riboviria viruses from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21).

Keywords: SARS-CoV-2, Data stream, COVID-19


Specifications Table
Subject Biochemistry, Genetics and Molecular Biology (General)
Specific subject area Bioinformatics
Type of data Table
Number
How data were acquired NCBI - Genbank - SARS-CoV2 https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs
Virus-Host-DB https://www.genome.jp/virushostdb
Matlab Software
Excel Software
Data format Raw and analyzed data are in Matlab file (.mat) and Microsoft Excel file (.xlsx).
Parameters for data collection The entire dataset was generated using MATLAB 2019b on Windows operating system with Intel Core - i5 6500T 2.5 GHz quad-core processor with 16GB of RAM.
Description of data collection The raw data were downloaded from NCBI - Genbank, and Virus-Host-DB. The data stream values were generated using Matlab.
Data source location Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte.
Data accessibility https://data.mendeley.com/datasets/g5ktw4y4pz/2

Value of the Data

  • These data are useful because they provide numeric representation of the COVID-2019 epidemic virus (SARS-CoV-2). With this, it is possible to use data stream algorithms.

  • All researchers in bioinformatics, computing science, and computing engineering disciplines can benefit from these data because by using this numeric representation, they can apply several stream algorithms and techniques such as TEDA (Typicality and Eccentricity Data Analytic), TEDA-Cloud, TEDA-Cluster and Teda-Class in genomic information.

  • Data experiments that use analytic stream techniques in SARS-CoV-2 virus genomic information can be used with this dataset.

  • These data represent an simple way to evaluate the SARS-CoV-2 virus genome with stream algorithms.

  • Differently of the conventional bioinformatics techniques in which are based on dynamic programming (such as BLAST and other), this approach allows the utilization of different techniques (techniques commons in other areas) to find similarities between genome sequences.

1. Data Description

This work presents a dataset of data stream representation (DSR) of SARS-CoV-2 virus nucleotide sequences. The dataset contains two kinds of data, the raw data, and the processing data. The raw data is composed of the 1557 instances of the SARS-CoV-2 virus genome collected from the National Center for Biotechnology Information (NCBI) [1], 11540 instances of other viruses from the Virus-Host DB [2], [3], and the other three specific viruses also collected from NCBI (Betacoronavirus RaTG13, bat-SL-CoVZC45, and bat-SL-CoVZXC21). The last specific three viruses have high similarity with SARS-CoV-2 [4], [5]. The processing data is composed of four kinds of DSR called Direct Mapping (DM), DM with Chaos Game Representation (DM-CGR), k-mers mapping (kMersM) and k-mers mapping with CGR (kMersM-CGR). k-mers is a frequency count metric used in Bioinformatics. Other k-mers datasets are presented in [6], [7], [8].

In the Chaos Game Representation (CGR) [8], the genome sequence is transformed in a bi-dimensional signal (1D vector), and after that, this signal passes to infinite impulse response (IIR) filter [9]. The result of CGR is a signal that expressed the density of the bases and, at the same time, the transition between bases because the IIR is a memory system. CGR can be used with the signature of the genome sequence. With k-mers representation [10], the genome can be transformed into a 1D or 2D vector that represents the occurrence number of each base (frequency of the bases). k-mers also can be used with a signature of the genome sequence. However, in this manuscript, the genome sequence is transformed into a linear stream data, and this type of transformation can be used with stream algorithms. Another important aspect of this dataset is associated with applied CGR not in all sequences but just in each k bases (with mers or not). This strategy maintains the statistical characteristics and reduces the size of the stream.

The data is organized into three main directories: “SARS-CoV-2 data”, “Virus-Host DB data” and “Other viruses data”. Each main directory contains three files called “RawDataTable.mat”, “RawData.mat” and “RawData.xlsx”, and four sub-directories named “DirectMapping”, “DirectMappingCGR”, “kmersMapping” and “kmersMappingCGR”. “RawDataTable.mat”, “RawData.mat” and “RawData.xlsx” files store the raw data information from viruses database; they have the same information, however in the “RawDataTable.mat” the attributes are stored in Matlab table format (after 2013b version), in the “RawData.mat” the attributes are stored in Matlab cell arrays format, and in the “RawData.xlsx” the attributes are stored in a Microsoft Excel file. In the sub-directories “DirectMapping”, “DirectMappingCGR”, “kmersMapping” and “kmersMappingCGR” are stored the DM, DM-CGR, kMersM and kMersM-CGR data stream representation, respectively. Inside each sub-directory the files are called:

  • For DM, the DSR was generated for k=15 and the files are called “PointsData_1_k=k.mat”;

  • For DM-CGR, the DSR was generated for k=17 and the files are called “PointsDataCGR_1_k=k.mat”;

  • For kMersM, the DSR was generated for k=25 and the files are called “PointsDatakmers_1_k=k.mat”;

  • For kMersM-CGR:
    • In the directories “Other viruses data” and “SARS-CoV-2 data”, the DSR was generated for k=27 and the files are called “PointsDatakmersCGR_1_k=k.mat”;
    • In the “Virus-Host DB data”, the DSR was generated for k=2,3,5,and7 and the files are called “PointsDatakmersCGR_1_k=k.mat”;

For the main directory “Virus-Host DB data”, the values are stored in 10 files where each i-th file is called “PointsData_k_k=k.mat” for sub-directory “DirectMapping”, “PointsDataCGR_i_k=k.mat” for DM-CGR, “PointsDatakmers_i_k=k.mat” for kMersM and “PointsDatakmersCGR_i_k=k.mat” for kMersM-CGR.

2. Experimental design, materials, and methods

The streams were based in nucleotide sequence, s, expressed as

s=[s1,,sn,,sN] (1)

where N is the length of sequence and sn is the nth nucleotide of the sequence.

For DM and DM-CGR, the nucleotide sequence, s, are grouped in sub-sequences of the k bases. The group of sub-sequences can be expressed as

B=[b1bibK]=[s1sksk(i1)+1sk(i1)+ksKk+1sK] (2)

where

K=k×Nk (3)

and the i-th vector bi is a i-th group of the k nucleotides, that is

bi=[bi,1,,bi,j,,bi,k]=[sk(i1)+1,,sk(i1)+j,,sk(i1)+k]. (4)

For DM, the group of sup-sequences, stored in matrix B, are transformed in a sequence of the integer values expressed as

c=[c1,,ci,,cK] (5)

where c is the DM stream stored in dataset. The DM stream, c, calculus can be expressed as

[c1cicK]T=fmap(B)=[fmap(b1)fmap(bi)fmap(bK)] (6)

where fmap( · ) is the mapping function expressed by

ci=fmap(bi)=(j=0k14j×(ui,j1))+1 (7)

and

ui,j={1forbi,j+1=TorU2forbi,j+1=C3forbi,j+1=A4forbi,j+1=G. (8)

For DM-CGR, the stream is characterized by vector a expressed as

a=[a1,,ai,,aK] (9)

where the ai is the i-th value of CGR. In CGR (see [11], [12]) each element ai is a bi-dimensional value expressed as

ai=(aix,aiy) (10)

where aix and aiy are the x-axes and y-axes in bi-dimensional space, receptively. The values of the CGR are calculate using the functions fCGRx(·) and fCGRy(·) in Matrix B, that is

[(a1x,a1y)(aix,aiy)(aKx,aKy)]T=(fCGRx(B),fCGRy(B))=[(fCGRx(b1),fCGRy(b1))(fCGRx(bi),fCGRy(bi))(fCGRx(bK),fCGRy(bK))]. (11)

The function fCGRx(·) calculates the x-axes value of the CGR and it can be expressed as

aix=fCGRx(bi)=pi,kx (12)

where

pi,jx=12ui,jx+12pi,j1x,forj=1,,k (13)

and

ui,jx={1forbi,j=A1forbi,j=TorU1forbi,j=C1forbi,j=G. (14)

For y-axes, the function, fCGRy(·), can be expressed as

aiy=fCGRy(bi)=pi,ky (15)

where

pi,jy=12ui,jy+12pi,j1y,forj=1,,k (16)

and

ui,jy={1forbi,j=A1forbi,j=TorU1forbi,j=C1forbi,j=G. (17)

For the initial condition, j=0, pi,0x=αx and pi,0y=αy [11], [12]. The dataset was generated with αx=0 and αy=0.

For kMersM and kMersM-CGR, the nucleotide sequence, s, are grouped in k-mers sub-sequences [13], [14] in the matrix H that can expressed as

H=[h1h2hihNkhNk+1]=[s1sks2sk+1sisi+ksNksN1sNk+1sN]. (18)

The kMersM, stream is characterized as a sequence of the integer values expressed as

r=[r1,,ri,,rNk+1] (19)

where

[r1rihNk+1]T=fmap(H)=[fmap(h1)fmap(hi)fmap(hNk+1)]. (20)

The function fmap( · ) is the mapping processing characterized by Eqs. (7) and (8). The kMersM-CGR is stored in the vector z expressed as

z=[z1,,zi,,zNk+1] (21)

where the zi is the i-th value of CGR. Each ith element zi is a bi-dimensional value expressed as

zi=(zix,ziy) (22)

where zix and ziy are the x-axes and y-axes in bi-dimensional space, receptively. The values of the CGR are calculate using the functions fCGRx(·) (see Eqs. (12)–(14)) and fCGRy(·) (see Equation see Eqs. (15)(17)) in Matrix H, that is

[(z1x,z1y)(zix,ziy)(zNk+1x,zNk+1y)]T=(fCGRx(H),fCGRy(H))=[(fCGRx(h1),fCGRy(h1))(fCGRx(hi),fCGRy(hi))(fCGRx(hNk+1),fCGRy(hNk+1))]. (23)

Fig. 1, Fig. 2, Fig. 3, Fig. 4 show the DSR examples for SARS-CoV-2 from Brazil, respectively.

Fig. 1.

Fig. 1

Example of the DM-DSR values for the SARS-CoV-2 sequence (i=500600) stored in dataset (MT126808 - Brazil).

Fig. 2.

Fig. 2

Example of the DM-CGR-DSR values for the SARS-CoV-2 sequence (i=500600) stored in dataset (MT126808 - Brazil).

Fig. 3.

Fig. 3

Example of the kMersM-DSR values for the SARS-CoV-2 sequence (i=500600) stored in dataset (MT126808 - Brazil).

Fig. 4.

Fig. 4

Example of the kMersM-CGR-DSR values for the SARS-CoV-2 sequence (i=500600) stored in dataset (MT126808 - Brazil).

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support.

Footnotes

Supplementary material associated with this article can be found, in the online version, at 10.1016/j.dib.2020.105829

Contributor Information

Raquel de M. Barbosa, Email: raquelmb@mit.edu.

Marcelo A.C. Fernandes, Email: mfernandes@dca.ufrn.br.

Appendix A. Supplementary materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.xml (1.1KB, xml)

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data S1

Supplementary Raw Research Data. This is open data under the CC BY license http://creativecommons.org/licenses/by/4.0/

mmc1.xml (1.1KB, xml)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES