Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2020 Oct 16;113(1):778–784. doi: 10.1016/j.ygeno.2020.10.009

Clustering and classification of virus sequence through music communication protocol and wavelet transform

Tirthankar Paul a,, Seppo Vainio b, Juha Roning a
PMCID: PMC7561519  PMID: 33069829

Abstract

The coronavirus pandemic became a major risk in global public health. The outbreak is caused by SARS-CoV-2, a member of the coronavirus family. Though the images of the virus are familiar to us, in the present study, an attempt is made to hear the coronavirus by translating its protein spike into audio sequences. The musical features such as pitch, timbre, volume and duration are mapped based on the coronavirus protein sequence. Three different viruses Influenza, Ebola and Coronavirus were studied and compared through their auditory virus sequences by implementing Haar wavelet transform. The sonification of the coronavirus benefits in understanding the protein structures by enhancing the hidden features. Further, it makes a clear difference in the representation of coronavirus compared with other viruses, which will help in various research works related to virus sequence. This evolves as a simplified and novel way of representing the conventional computational methods.

Keywords: Coronavirus, Haar wavelet, SVM, Protein music, MIDI

1. Introduction

The year 2020 begins with a threat of coronavirus originated in China, and later spread to the rest of the world. The novel coronavirus has reached almost every country in the world. Multiple numbers of pneumonia cases were noticed in Wuhan city, Hubei Province, China, in December 2019. Later, the disease was recognised as a novel coronavirus. Severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) is the seventh family member of the coronavirus family [1,2]. Initially, the cases were at first reported from China, later in Japan, South Korea, Singapore Thailand, mostly Asian countries. Gradually, the cases were found in Europe and America [3]. The virus is affecting people around the globe, and almost all countries are affected by the virus. More than seven million people have been affected, and around four hundred fifty thousand people died due to this virus by 13th June 2020. In this paper, the coronavirus protein sequence was translated into music through MIDI protocol and compared with intra-family and inter-family members. Representing genome data in a non-conventional way has always been appreciated by the researches; for example, the genome sequence was portrayed as an image, based on chaos game representation to analysis different biological features in the recent publications [[4], [5], [6]]. Here, A set of 56 virus protein sequences among Coronavirus, Influenza and Ebola were studied and classified through their auditory pattern.

The SARS-CoV-2, which causes a severe respiratory syndrome, is a positive RNA strand [7]. World Health Organization (WHO) enlisted unknown caused an epidemic and pandemic potential disease such as the Middle East respiratory syndrome (MERS), Disease X and severe acute respiratory syndrome (SARS) in their priority list of pathogens in April 2018 [8]. Initially, the disease was suspected as a Disease X aetiology by WHO [9]. But soon after it was denoted a novel coronavirus (2019-nCoV) caused disease COVID-19 by WHO [10,11]. In the study, it was found that the 2019-nCov has 79.5% protein sequence similarity with SARS-CoV and 96% similarity with SL-Cov-RaTG13, known as bat coronavirus [12]. The Chinese group of virologists renamed the virus HCoV-19 [13], while the internationally, Coronavirus Study Group (CSG) renamed it SARS-CoV-2 [14].

It is not a new trend to represent a species, DNA, RNA or genome into musical form. Japanese scientist Susumo Ohno showed a connection between reoccurrence of repeat unit and musical repetitions [15]. Initially, the work Biomusic was translated into musical notes from four DNA bases which gave an auditory pattern of DNA sequence but lacked a rhythmic and musical point of view [16]. Later, many approaches were attempted like codon reading and physical properties of DNA bases, to translate DNA nucleotides into music [17,18]. There are few studies based on protein music where the protein was translated into music. The study showed that the secondary structure of proteins could be converted into a musical sequence [19]. DNA music was also formed by mapping amino acid into different pitches. Therefore, 20 amino acid are assigned to 20 different notes and the protein music can be created [20]. This kind of musical transformation would identify the difference between protein folding and amino acids in terms of understanding the facts to regulate the cell process [21]. DNA music can also be created based on the presence of short tandem repeats (STR) in a CODIS region, making it very unique for the individuals [22]. The STR sequence and the STR frequency data were converted as a musical element and performed in musical instrument and digital interference (MIDI) format to make the music melodious. The MIDI is a commonly used protocol to make communication between musical device and computer, mostly in the music industries. Initially, it was developed to create polyphonic music [23]. A musical translation from protein fold can also be a useful method for comparative study of different genome sequences [24]. Musical mapping is not only limited to biological data, but mobile key-stroke data could also be mapped into music to secure the credentials [25]. Recently, a musical work from COVID 19 was created by translating protein into music in Massachusetts Institute of Technology, USA [26]. Musical notes can be displayed in several way. Most common way to present the music in a musical scale is with octave scale, diatonic scale, Tone scale and chromatic scale [15,20,21,27,28]. Our study shows auditory representation of virus. Sometime, the audio sequence sounds like music, but no musical scale was followed for the sonification. Therefore, the audio sequence of the virus was presented in form of piano sound [22,29].

Biological studies demand high-performance data analysis. Manual data analysis of genome or long nucleotide sequences may not be efficient for classification or clustering, so machine algorithms are always preferred in this case. Clustering is one method among the others in the process, which plays a very important role in bioinformatics studies. A Most common way to show the clustering among the genomic sequence is phylogenetic analysis. An algorithm based on occurrence and position of k-tuples of DNA sequences was introduced for phylogenetic clustering [30]. A similar type method (Accumulated Natural Vector (ANV)) transforms the DNA sequence into eighteen data points, including nucleotides covariance and distribution [31]. This ANV is the advanced version of the Natural Vector (NV) method which translates the DNA into twelve points and claims to be the most accurate method of clustering [31,32]. Genomic datasets are large scale structure of proteins/nucleotides. The complexity increases when a large set of data ‘N' items are to be clustered to a large number ‘K' cluster. ‘Linclust’ algorithms reduce the time complexity for the large dataset clustering [33]. Another algorithm ‘MeShClust’ was introduced for the classification of DNA sequence-based on mean shift algorithm of image processing study [34]. Sequence comparing the used discrete wavelet transform (DWT) was performed by extracting the k-mers from the genome sequences. The k-mers were mapped and transformed into discrete wavelet to get a numeric featured vector for the clustering [35]. A Haar wavelet filtering method was used to decompose the sequences for detecting cancerous genome by Liu et al. [36]. The author extracted statistical data of cancerous and non-cancerous genome and classified via machine learning [36]. Also, Haar wavelet is capable of identifying the short tandem repeats (STR) in a DNA sequence [37]. There is also a standard useful tool, k-means algorithm which is an easy to apply clustering algorithm for genome sequences [38,39]. The above clustering algorithms and their significances are summarized in Table 1 . Also, the data can be classified using the support vector machine (SVM). Nucleotide sequences of different species were classified with the SVM model and reconstructed partitioning in Euclidean hyperplanes [40]. An SVM model was applied in a virus genome sequence and achieved a low mean error rate to classify the sequences [41]. Numerical representation of DNA-binding protein sequences was applied for the predicting and classifying the sequences in SVM classifier based on protein properties and features transformation methods [42].

Table 1.

Genome sequence clustering algorithms.

Reference Algorithm Significance Year
[32] Natural Vector (NV) DNA sequences into twelve statistical points vectors. 2011
[30] mBKM with DMk The occurrence and position of k-tuples of DNA sequences 2012
[33] Linclust Reduce the time complexity for the large dataset 2018
[34] MeShClust Clustering method based on shift algorithm of image processing. 2018
[35] Haar wavelet filtering Detecting cancerous and non-cancerous genome. 2018
[31] Accumulated Natural Vector (ANV) DNA sequences into eighteen statistical points vectors. 2019

In this study, we created auditory representation of coronaviruses, Influenza and Ebola virus. Our assumption is that the sound pattern could help the researchers to find the virus protein by searching an acoustic sequence. It is natural for a human being to be attracted towards music. In our society, many people have a good knowledge about music and the human brain has an excellent analysing power of sound. Our minds can identify the sound features such as pitch, timbre, volume, rhythm and melody. Translating genome sequence into audio sequence could opens a new door of hidden features of genome data in the form of sound sequence.

2. Methodology

The outlines of the work towards music is shown in Fig. 1 . The RNA sequences of SARS-CoV-2 are found in the NCBI database [43]. The genome sequence of the virus has been assigned in GenBank with an accession number MN908947 [10]. The other family members of the virus (SARS-CoV) sequence are assigned in GenBank with an accession number AY278741, having 29,727 nucleotides base [44]. Similarly, the Middle East respiratory syndrome (MERS) coronavirus can also be found in GenBank, and its accession number is KT006149 [45]. In this paper, the musical conversion was performed in the Matlab environment. The programming script was written in Matlab to map the RNA sequence into music. A MIDI toolbox was installed in the Matlab platform to translate nucleotide data to musical elements [29]. The Algorithm 1 is designed to a) download RNA sequence, b) count the protein present in the sequence and c) MIDI musical mapping. The methodology is adopted from the author previous publication [22].

Unlabelled Image

Fig. 1.

Fig. 1

Steps of the musical conversion.

First, the nucleotide sequence was downloaded from the open-source database NCBI. Then the sequence was converted into a numerical and protein sequences for further process in Matlab. The K-mer analysis was examined and the amino acids were mapped into music. Therefore, the size of the k-mer would be three base codons. So, the codon or amino acid may appear one or many times in the sequence. Each virus sequence has a distinct set of a protein. The number of proteins in the sequence differs with the type of virus. The protein was coded according to the number of occurrences in the sequence. A protein is replaced by a number which indicates the total number of the particular protein present in the sequence. The sequence length and each protein presence take a vital role in numerical mapping of the protein sequence. Later, the musical transformation will make a noticeable difference in the magnitude of the sequence. For example, MADADDAAA is a protein sequence. Total number of ‘M' = 1, total number of ‘A' = 5, total number of ‘D' = 3. So, the coded sequence will be 153533555.

Generally, music has seven different elements, i.e. pitch, volume, timbre, duration, form and texture. Here, pitch and duration were coded based on the physical appearance of amino acids. Also, volume, form and timbre were modulated according to the sequence. The communication message will be assigned a data format/ MIDI matrix to musical devices from the computer [23]. This matrix is the size of N*6 elements, where ‘N' is the number of notes. In the MIDI file (M ps), the ‘N' row represents a note event, and the 6 columns define different features such as track number, MIDI channel, note value or MIDI pitch, volume, note starting time and note ending time of the MIDI events. Here, track (tn ps) and channel number (cn ps) for piano were set to 1. A constant volume (v ps) for all the note events (N) was fixed to 75. The third column is for MIDI pitch (‘mp ps’). The MIDI pitch is to define from the coded frequency (‘f’) which was mapped from the nucleotide sequence. A small difference in frequency (‘f’) will make a prominent difference in MIDI pitch (mp ps), for the scaling factor log 2. The MIDI pitch conversion is shown in Eq. (1).

mpps=69+12log2f440 (1)

The fifth column is ‘start time’ (st ps), and the sixth one is the ‘end time’ (et ps) of the note and both columns combinedly represent the duration (D ps) of the note. Music can be created by using the set of data as a form of note matrix in the Matlab platform with Ken Schutte MIDI toolbox [29]. The MIDI matrix can be defined as Eq. (2).

Mps=tnpscnpsmppsvpsstpsetps (2)

As described earlier tnps=cnps=1, volume (vps) =75, mpps and D ps are mapped based on the protein sequences. Where Dps=etpsstps. The M ps was processed through matrix2midi module to generate the audio file of the protein sequences [29].

The pitch values from the MIDI file were taken for finding the similarities among the virus sequences. In this purpose, Euclidian distances were measured to compare the audio signals. The lengths of the signals should be the same to find the distance. However, it is not common to fetch same-length sequence of different species for the study. Here, in this study, three groups of virus, Influenza, Ebola and Coronavirus were examined. There are total of 56 viruses (Influenza, Ebola and Coronavirus) genome sequences were taken for the purpose, and they were downloaded from NCBI database. These three groups of virus have different lengths and sequences, and their auditory translations were filtered through discrete wavelet transform (DWT). A DWT or more precisely Haar wavelet, is fast in computing with reversible lossless transform, and most importantly memory efficient to compare and detect the genome sequence, as shown in the previous studies [36,37]. Coefficients related with Haar wavelet provide the low and high frequency as well as location information in a form of approximate coefficients and detail coefficients of the signal/sequences, respectively [37]. The approximate coefficients and detail coefficients can be explained by the Eq. (3), (4). The mean value and standard deviation (SD) were obtained from the wavelet transformation to classify the viruses into different clusters. These statistical data with the accession number of the 56 viruses are given in the dataset section.

filternlow=k=sk.g2.nk (3)
filternhigh=k=sk.h2.nk (4)

where, filter(n)low and filter(n)high are the output of low-pass and high-pass filter respectively. s[k] is the protein sequence in numerical form. The low-pass filter coefficient is g(n) and h(n) is the high-pass filter coefficient.

Bayesian optimization model was used to get a good low loss in cross-validation for coronavirus and non-coronavirus data in Fig. 2 [46].

Fig. 2.

Fig. 2

Cross-validation optimization fit for coronavirus and non-coronavirus data.

3. Dataset

The GenBank with an accession number of the genomes, which were not mentioned on the previous paragraph but studied in this paper, are given in Table 2 .

Table 2.

Accession numbe of whole genome virus sequences.

4. Result

The acoustic output of a genome sequence may reveal and open many new hidden features which cannot be seen in detail in microscopic images. The musical elements of coronavirus protein music may create a potential impact on our mind. Here, the music conversions were played with a piano instrumental sound. The piano sound of the RNA sequence of SARS-CoV-2 was plotted and shown in Fig. 3a. Similarly, Fig. 3b and Fig. 3c represents the music for SARS-CoV and MERS-CoV, respectively.

Fig. 3.

Fig. 3

Piano roll plot of virus protein sequence.

The Euclidian distances were measured to show the similarities and dissimilarities among three sequences (SARS-CoV, SARS-CoV-2 and MERS-CoV). The protein sequence and music sequence distance are shown in the upper triangular matrix (marked in yellow) and lower triangular matrix (marked in green) respectively in Table 3 . The distance matrix shows distance of 224.0603 for the nucleotide sequence. For more than twenty-nine thousand bases, the Euclidian distance between the two virus sequences is negligible. On the other hand, the Euclidian distance increased to 1534 for the sequence when it was transformed into music. The same trend in Euclidian distance was found for MERS-CoV with SARS-CoV and SARS-CoV-2. The distance is much higher in the music sequence of coronavirus rather than the protein sequence. As a result, the converted music sequence can show a noticeable difference from two very similar nucleotide sequences. This distance can be measured for intra-family members, or those sequence lengths are almost the same. The two sequences need to be the same length to measure Euclidian distance. Influenza, Ebola and Coronavirus are come from different families and also have a vast range of sequence lengths. A Haar wavelet transformation was applied to obtain the statistical values sequences. The viruses were clustered based on the statistical values through K-means clustering algorithm. Fig. 4 and Fig. 5 are the clustered output of virus genome sequences and virus audio sequences. In Fig. 4, most of the virus sequences are placed close to each other, and the clustering algorithm was not significant to show the variation among different group of viruses. On the other hand, the K-means algorithm was applied into the virus pitch value sequences that show three distinct clusters of Influenza, Ebola and Coronavirus in Fig. 5.

Table 3.

Euclidian distance of coronavirus before and after translating into music.

4.

Fig. 4.

Fig. 4

Clustering of virus sequences.

Fig. 5.

Fig. 5

Clustering of auditory sequence of the viruses.

The average detail coefficient from Haar wavelet, of the viruses, are plotted in the boxplot in Fig. 6 (a and b). The boxes lay almost on the same height for genome sequence in Fig. 6a. Differently, the difference of box heights can be shown in Fig. 6b, for audio sequences of the viruses. The audio sequence was created based on the physical features of the protein sequence. Therefore, a small difference in protein sequences creates a considerable change in the statistical values of music sequences. Also, the difference can be visualized in the classifier in Fig. 7 , where coronavirus and non-coronavirus audio data were classified with zero loss. On the other hand, the loss of optimized genome sequences data was recorded 0.0377 for the viruses. The classification result improves from genome (Fig. 7a) to audio sequences (Fig. 7b) of coronavirus and non-coronavirus. Therefore, the audio translation of the virus protein sequences enhances the hidden features, which can be identified in the form of a sound signal.

Fig. 6.

Fig. 6

Boxplot demonstrating Average detail coefficient distribution on different viruses.

Fig. 7.

Fig. 7

Classification based on Haar wavelet coefficients.

5. Conclusions

This work suggests a way where the Influenza, Ebola and Coronavirus protein sequences can be a sound sequence instead of visual data. The auditory representation of the coronaviruses can help researchers to understand the protein structures in a different way. Sometimes, the primary protein structures are too tiny to watch, but it can be effectively heard in the music form. The virus music representation algorithm can be a beneficial tool to help in portraying the small mutation within the family (coronavirus family) in the form of music. All three scenario (among SARS-CoV, SARS-CoV-2 and MERS-CoV) show that the Euclidian distance of musical data is much higher than the protein sequence data for intra-family members. The pathogenic effect in coronavirus may enhance or limit with a small mutation which can be identified in the audio sequences of the virus.

Moreover, in the inter-family scenario, the three different types of virus (Influenza, Ebola and Coronavirus) were classified through their translated audio sequence. Therefore, the comparison shows that the more promising difference is captured in the auditory representation of the protein spikes. The proposed algorithm is computationally efficient with time complexity O(n ∗ k ∗ t), for the ‘n' length sequences, ‘k' is the cluster of k-means algorithm and ‘t’ is the number of iterations. The numerical mapping based on the physical presence of each protein and the length of the virus sequence played a dominating role towards audio translation. And the scaling factor ‘log 2’ made a noticeable difference in the magnitude of the audio sequence in MIDI conversion. This algorithm will be a helpful tool to find and classify virus sequences into virus family and species, and also make a difference from the other members of the same family without studying in a laboratory condition.

Declaration of Competing Interest

None.

Acknowledgment

This research work was supported by Infotech Oulu Doctoral Program.

Footnotes

Appendix A

Supplementary data to this article can be found online at https://doi.org/10.1016/j.ygeno.2020.10.009.

Appendix A. Supplementary data

Genome accession number and Haar wavelet coefficients.

mmc1.csv (7.9KB, csv)

Music file of MERS, SARS-CoV and SARS-CoV-2.

mmc2.zip (175.9KB, zip)

References

  • 1.Yang P., Wang X. COVID-19: a new challenge for human beings. Cell. Mol. Immunol. 2020 doi: 10.1038/s41423-020-0407-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Xu B., Gutierrez B., Mekaru S., Sewalk K., Goodwin L., Loskill A., Cohn E.L., Hswen Y., Hill S.C., Cobo M.M., Zarebski A.E., Li S., Wu C.H., Hulland E., Morgan J.D., Wang L., O’Brien K., Scarpino S.V., Brownstein J.S., Pybus O.G., Pigott D.M., Kraemer M.U.G. Epidemiological data from the COVID-19 outbreak, real-time case information. Sci. Data. 2020;7:106. doi: 10.1038/s41597-020-0448-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Rothan H.A., Byrareddy S.N. The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak. J. Autoimmun. 2020 doi: 10.1016/j.jaut.2020.102433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sun Z., Pei S., He R.L., Yau S.S.T. A novel numerical representation for proteins: three-dimensional chaos game representation and its extended natural vector. Comput. Struct. Biotechnol. J. 2020;18:1904–1913. doi: 10.1016/j.csbj.2020.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hoang T., Yin C., Yau S.S.T. Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison. Genomics. 2016;108:134–142. doi: 10.1016/j.ygeno.2016.08.002. [DOI] [PubMed] [Google Scholar]
  • 6.Hoang T., Yin C., Yau S.S.T. Splice sites detection using chaos game representation and neural network. Genomics. 2020;112:1847–1852. doi: 10.1016/j.ygeno.2019.10.018. [DOI] [PubMed] [Google Scholar]
  • 7.Yan R., Zhang Y., Li Y., Xia L., Guo Y., Zhou Q. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science. 2020;80(367):1444–1448. doi: 10.1126/science.abb2762. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.World Health Organization visited on (Visited on 13th June, 2020) 2020. https://www.who.int/activities/prioritizing-diseases-for-research-and-development-in-emergency-contexts
  • 9.Xia S., Liu M., Wang C., Xu W., Lan Q., Feng S., Qi F., Bao L., Du L., Liu S., Qin C., Sun F., Shi Z., Zhu Y., Jiang S., Lu L. Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion. Cell Res. 2020;2 doi: 10.1038/s41422-020-0305-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wu F., Zhao S., Yu B., Chen Y.M., Wang W., Song Z.G., Hu Y., Tao Z.W., Tian J.H., Pei Y.Y., Yuan M.L., Zhang Y.L., Dai F.H., Liu Y., Wang Q.M., Zheng J.J., Xu L., Holmes E.C., Zhang Y.Z. A new coronavirus associated with human respiratory disease in China. Nature. 2020;579:265–269. doi: 10.1038/s41586-020-2008-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhu N., Zhang D., Wang W., Li X., Yang B., Song J., Zhao X., Huang B., Shi W., Lu R., Niu P., Zhan F., Ma X., Wang D., Xu W., Wu G., Gao G.F., Tan W. A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. 2020;382:727–733. doi: 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhou P., Lou Yang X., Wang X.G., Hu B., Zhang L., Zhang W., Si H.R., Zhu Y., Li B., Huang C.L., Chen H.D., Chen J., Luo Y., Guo H., Di Jiang R., Liu M.Q., Chen Y., Shen X.R., Wang X., Zheng X.S., Zhao K., Chen Q.J., Deng F., Liu L.L., Yan B., Zhan F.X., Wang Y.Y., Xiao G.F., Shi Z.L. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jiang S., Du L., Shi Z. An emerging coronavirus causing pneumonia outbreak in Wuhan, China: calling for developing therapeutic and prophylactic strategies. Emerg. Microbes Infect. 2020;9:275–277. doi: 10.1080/22221751.2020.1723441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gorbalenya A.E., Baker S.C., Baric R.S., de Groot R.J., Drosten C., Gulyaeva A.A., Haagmans B.L., Lauber C., Leontovich A.M., Neuman B.W., Penzar D., Perlman S., Poon L.L.M., Samborskiy D.V., Sidorov I.A., Sola I., Ziebuhr J. The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol. 2020;5:536–544. doi: 10.1038/s41564-020-0695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Susumu O., Midori O. The all pervasive principle of repetitious recurrence governs not only coding sequence construction but also human endeavor in musical composition. Immunogenetics. 1986;24:71–78. doi: 10.1007/BF00373112. [DOI] [PubMed] [Google Scholar]
  • 16.Hayashi K., Munakata N. Basically musical. Nature. 1984;310:96. doi: 10.1038/310096a0. [DOI] [PubMed] [Google Scholar]
  • 17.Gena P., Ph D., Strom C. 1995. A Physiological Approach to DNA Music, Sixth Int. Symp. Electron. Art; pp. 83–85. [Google Scholar]
  • 18.Gena P., Ph D., Strom C. 1995. Musical Synthesis of DNA Sequences, XI Colloq. Di Inform. Music. [Google Scholar]
  • 19.Dunn J., Clark M.A. Life music: the Sonification of proteins. Leonardo. 1999;32:25–32. doi: 10.1162/002409499552966. [DOI] [Google Scholar]
  • 20.Takahashi R., Miller J.H. Conversion of amino-acid sequence in proteins to classical music: search for auditory patterns. Genome Biol. 2007;8 doi: 10.1186/gb-2007-8-5-405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Castagna R., Chiolerio A., Margaria V. Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 2011. Music translation of tertiary protein structure: Auditory patterns of the protein folding; pp. 214–222. 6625 LNCS. [DOI] [Google Scholar]
  • 22.Paul T., Vainio S., Roning J. Adv. Intell. Syst. Comput. 2019. Towards personalised, DNA signature derived music via the short tandem repeats (STR) pp. 951–964. [DOI] [Google Scholar]
  • 23.Florea B.C. Proc. 2014 6th Int. Conf. Electron. Comput. Artif. Intell. ECAI 2014. 2015. MIDI-based controller of electrical drives; pp. 27–30. [DOI] [Google Scholar]
  • 24.Bywater R.P., Middleton J.N. Melody discrimination and protein fold classification. Heliyon. 2016;2 doi: 10.1016/j.heliyon.2016.e00175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Belman A.K., Paul T., Wang L., Iyengar S.S., Śniatała P., Jin Z., Phoha V.V., Vainio S., Roning J. Authentication by mapping keystrokes to music: the melody of typing. Int. Conf. Artif. Intell. Signal Process. AISP. 2020;2020 doi: 10.1109/AISP48273.2020.9073125. [DOI] [Google Scholar]
  • 26.Massachusetts Institute of Technology 2020. http://news.mit.edu/2020/qa-markus-buehler-setting-coronavirus-and-ai-inspired-proteins-to-music-0402 Visited on 13th June.
  • 27.Marques M., Oliveira V., Vieira S., Rosa A.C. Proc. 2000 Congr. Evol. Comput. CEC 2000. 1. 2000. Music composition using genetic evolutionary algorithms; pp. 714–719. [DOI] [Google Scholar]
  • 28.Bertino F., Chuan C., Peroune J. 2020. The Musical Gene: Generating Harmonic Patterns from Sequenced DNA of E. coli Bacteria to Compose Music, Work. Visited on 13th June. [Google Scholar]
  • 29.Ken Schutte M.I.D.I. Matlab Toolbox. 2020. https://github.com/kts/matlab-midi Visited on 13th June.
  • 30.Wei D., Jiang Q., Wei Y., Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinforma. 2012;13(1) doi: 10.1186/1471-2105-13-174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Dong R., He L., He R.L., Yau S.S.T. A novel approach to clustering genome sequences using inter-nucleotide covariance. Front. Pharmacol. 2019;10 doi: 10.3389/fgene.2019.00234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Deng M., Yu C., Liang Q., He R.L., Yau S.S.T. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS One. 2011;6 doi: 10.1371/journal.pone.0017293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Steinegger M., Söding J. Clustering huge protein sequence sets in linear time. Nat. Commun. 2018;9 doi: 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.James B.T., Luczak B.B., Girgis H.Z. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 2018;46:e83. doi: 10.1093/nar/gky315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lin J., Wei J., Adjeroh D., Jiang B.H., Jiang Y. SSAW: a new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinforma. 2018;19:1–11. doi: 10.1186/s12859-018-2155-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Liu D.W., Jia R.P., Wang C.F., Arunkumar N., Narasimhan K., Udayakumar M., Elamaran V. Automated detection of cancerous genomic sequences using genomic signal processing and machine learning. Futur. Gener. Comput. Syst. 2019;98:233–237. doi: 10.1016/j.future.2018.12.041. [DOI] [Google Scholar]
  • 37.Paul T., Vainio S., Roning J. 2019 IEEE 19th Int. Symp. Signal Process. Inf. Technol. ISSPIT 2019. 2019. Haar wavelet based approach for Short Tandem Repeats(STR) Detection; pp. 1–6. [DOI] [Google Scholar]
  • 38.Bakar R.B.A., Watada J., Pedrycz W. DNA approach to solve clustering problem based on a mutual order. BioSystems. 2008;91:1–12. doi: 10.1016/j.biosystems.2007.06.002. [DOI] [PubMed] [Google Scholar]
  • 39.Kenidra B., Benmohammed M., Beghriche A., Benmounah Z. Proc. - 19th IEEE Int. Conf. Comput. Sci. Eng. 14th IEEE Int. Conf. Embed. Ubiquitous Comput. 15th Int. Symp. Distrib. Comput. Appl. to Business, Engi. 2017. A partitional approach for genomic-data clustering combined with K-Means algorithm; pp. 114–121. [DOI] [Google Scholar]
  • 40.Seo T.K. Classification of nucleotide sequences using support vector machines. J. Mol. Evol. 2010;71:250–267. doi: 10.1007/s00239-010-9380-9. [DOI] [PubMed] [Google Scholar]
  • 41.Wang T., Herbster M., Mian I.S. 2018. Virus Genome Sequence Classification using Features based on Nucleotides, Words and Compression; pp. 1–36.http://arxiv.org/abs/1809.03950 [Google Scholar]
  • 42.Zou C., Gong J., Li H. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinforma. 2013;14 doi: 10.1186/1471-2105-14-90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.NCBI Database 2020. https://www.ncbi.nlm.nih.gov/ Visited on 13th June.
  • 44.Drosten C., Günther S., Preiser W., Van der Werf S., Brodt H.R., Becker S., Rabenau H., Panning M., Kolesnikova L., Fouchier R.A.M., Berger A., Burguière A.M., Cinatl J., Eickmann M., Escriou N., Grywna K., Kramme S., Manuguerra J.C., Müller S., Rickerts V., Stürmer M., Vieth S., Klenk H.D., Osterhaus A.D.M.E., Schmitz H., Doerr H.W. Identification of a novel coronavirus in patients with severe acute respiratory syndrome. N. Engl. J. Med. 2003;348:1967–1976. doi: 10.1056/NEJMoa030747. [DOI] [PubMed] [Google Scholar]
  • 45.Lu R., Wang Y., Wang W., Nie K., Zhao Y., Su J., Deng Y., Zhou W., Li Y., Wang H., Wang W., Ke C., Ma X., Wu G., Tan W. Complete genome sequence of Middle East respiratory syndrome coronavirus (MERS-CoV) from the first imported MERS-CoV case in China. Genome Announc. 2015;3:2014–2015. doi: 10.1128/genomeA.00818-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kouziokas G.N. SVM kernel based on particle swarm optimized vector and Bayesian optimized SVM in atmospheric particulate matter forecasting. Appl. Soft Comput. J. 2020;93 doi: 10.1016/j.asoc.2020.106410. [DOI] [Google Scholar]
  • 47.de Groot R., Baker S., Baric R., Enjuanes L., Gorbalenya A., Holmes K., Perlman S., Poon L., Rottier P., Talbot P., Woo P., Ziebuhr J. 2012. Part II – The Positive Sense Single Stranded RNA Viruses Family Coronaviridae, Virus Taxon. Ninth Rep. Int. Comm. Taxon. Viruses; pp. 806–828. [DOI] [Google Scholar]
  • 48.Jones S., Prasad R., Nair A.S., Dharmaseelan S., Usha R., Nair R.R., Pillai R.M. Vol. 5. 2015. Whole-Genome Sequences of Influenza A(H1N1)pdm09 Virus Isolates from Kerala, India; pp. 9–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.NCBI Database Ebolavirus. 2020. https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=186536 Visited on 13th June.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Genome accession number and Haar wavelet coefficients.

mmc1.csv (7.9KB, csv)

Music file of MERS, SARS-CoV and SARS-CoV-2.

mmc2.zip (175.9KB, zip)

Articles from Genomics are provided here courtesy of Elsevier

RESOURCES