Abstract
Experimental approaches are currently used to determine viral-host interactions, but these approaches are both time-consuming and costly. For these reasons, computational-based approaches are recommended. In this study, using computational-based approaches, viral-host interactions of SARS-CoV-2 virus and human proteins were predicted. The study consists of four different stages; in the first stage viral and host protein sequences were obtained. In the second stage, protein sequences were converted into numerical expressions by various protein mapping methods. These methods are entropy-based, AVL-tree, FIBHASH, binary encoding, CPNR, PAM250, BLOSUM62, Atchley factors, Meiler parameters, EIIP, AESNN1, Miyazawa energies, Micheletti potentials, Z-scale, and hydrophobicity. In the third stage, a deep learning model was designed and BiLSTM was used for this. In the last stage, the protein sequences were classified, and the viral-host interactions were predicted. The performances of protein mapping methods were determined by accuracy, F1-score, specificity, sensitivity, and AUC scores. According to the classification results, the best classification process was obtained by the entropy-based method. With this method, 94.74% accuracy, and 0.95 AUC score were calculated. Then, the most successful classification process was performed with the Z-scale and 91.23% accuracy, and 0.96 AUC score were obtained. Although other protein mapping methods are not as efficient as Z-scale and entropy-based methods, they have achieved successful classification. AVL-tree, FIBHASH, binary encoding, CPNR, PAM250, BLOSUM62, Atchley factors, Meiler parameters and AESNN1 methods showed over 80% accuracy, F1-score, and AUC score. Accuracy scores of EIIP, Miyazawa energies, Micheletti potentials and hydrophobicity methods remained below 80%. When the results were examined in general, it was observed that the computational approaches were successful in predicting viral-host interactions between SARS-CoV-2 virus and human proteins.
Keywords: Protein mapping, Deep learning, Covid-19, SARS-CoV-2 virus
1. Introduction
The SARS-CoV-2 virus, which emerged in 2019 in Wuhan, China, spread all over the world in a short time and became a pandemic by causing COVID-19 disease. This disease, which affects the immune system, causes fever and cough in people, and in some serious cases, it causes pneumonia [1]. Although it is a serious disease, approximately 80% of people have mild to moderate disease and do not need any treatment [2,3]. However, people whose immune system is not strong enough and who have chronic disorders have this disease severely and in very serious cases, they are hospitalized in intensive care units [4]. COVID-19 disease is considered a serious illness because it is highly permeable and contagious. In addition, the COVID-19 disease is seriously affecting health systems. Furthermore, the scarcity of intensive care units, long treatment periods and insufficient hospital resources clearly demonstrate the impact of this epidemic on the health system [5]. Today, vaccines have been developed and started to be used to combat COVID-19. However, although vaccines are developed, the virus rebuilds itself and constantly reveals various variants. Some studies have shown that the efficacy of vaccines and antibodies against new variants is reduced [6,7]. It has also been observed that the new variants are more contagious [8]. For these reasons, it is important to diagnose the disease at an early stage to prevent such problems. To fight COVID-19 disease, the genetic characteristics of SARS-CoV-2 disease should be well known. Thanks to the genetic characteristics of the virus, various drug and vaccine treatments related to COVID-19 disease can be revealed [9,10]. SARS-CoV-2 virus consists of four main structural proteins; spike glycoprotein (S), membrane glycoprotein (M), small envelope glycoprotein (E), and nucleocapsid protein (N) [11]. In Fig. 1 , the genetic structure of the SARS-CoV-2 virus is given.
Fig. 1.
Genetic structure of the SARS-CoV-2 virus [12].
Understanding how these virus proteins interact with host cells for survival and reproduction is of great importance for the development and use of drugs [13]. Viruses communicate with host cells through protein-protein interactions. Protein interactions between viral and host are responsible for the entire life cycle of the virus [14]. This cycle includes infection of the host cell and replication of the viral genome. Therefore, the identification of protein interactions between virus and host proteins provides information about how virus proteins work, how they replicate, and how they cause disease. Analysis of interactions between proteins elucidates processes such as immunity, cellular signaling, replication, and cellular division. In addition, viral-host interactions constitute a potential target for antiviral therapies [15]. Experimental approaches are frequently used to determine interactions between proteins and between virus and host. These approaches generate a large amount of data and are costly in terms of both time and laboratory equipment [16]. The most used experimental methods are YTH (Yeast Two Hybrid), GST Pull-Down, NAPPA (Nucleic Acid Programmable Protein Array) and APMS (Affinity Purification Mass Spectrometry) [14]. Yet, protein interactions vary according to the experimental methods. Because experimental methods are based on both environmental and operational processes, false positive and false negative results may occur [17]. Because of these problems and disadvantages, computational-based approaches are more preferred, and their popularity is constantly increasing [18,19].
Computational approaches were also used in this study and viral-host interactions between SARS-CoV-2 and human proteins were predicted. The study consists of four stages. In the first stage, viral-host interaction data between SARS-CoV-2 and human were obtained. In the second stage, the protein sequences were mapped by numerical mapping methods. For this purpose, fifteen protein mapping methods were used. These methods are AESNN1, Atchley factors, AVL-tree (Adelson, Velsky and Landis), binary encoding, BLOSUM62 (Blocks Substitution Matrix), CPNR (Complex Prime Number Representation), EIIP (Electron-Ion Interaction Potential), FIBHASH (FibonacciHash), hydrophobicity, entropy-based, Meiler parameters, Micheletti potentials, Miyazawa energies, PAM250 (Point Accepted Mutation), and Z-scale. In the third stage, a deep learning model was designed and viral-host interactions between SARS-CoV-2 virus and human proteins were classified. In the last stage, the performances of the numerical mapping methods were determined by the evaluation criteria of accuracy, sensitivity, specificity, AUC score and F1-score. The highlights of the study can be summarized as follows:
-
•
In this study, viral-host interactions between SARS-CoV-2 and human proteins were predicted by computational approaches.
-
•
To the best of our knowledge, proteins were analyzed for the first time in this study using AVL-tree, entropy-based, and FIBHASH methods.
-
•
In this study, viral-host interactions between SARS-CoV-2 and human proteins were effectively classified by deep learning.
The remainder of the study is organized as follows. In the second section, the studies carried out in this field in the literature are examined and information about the methods and classifiers is given. In the third section, information about the protein mapping methods is given. In this section, the data set is also emphasized, and necessary explanations are made about the training and test data. In this section, the designed deep learning model is also examined, and its parameters are explained. In the fourth section, the classification process was carried out and the performances of the protein mapping methods were compared by giving the classification results. In the last section information is given about the contribution of the study to the literature and the fields of study that can be used in the future.
2. Related works
In this section, studies predicting viral-host interactions between SARS-CoV-2 and human proteins are reviewed. The data, methods and classification results used in these studies were analyzed. In study [13], researchers predicted viral-host interactions between SARS-CoV-2 and human proteins using machine learning techniques. In the study, SVM (Support Vector Machine), KNN (K Nearest Neighbor), NB (Naive Bayes), and RF (Random Forest) algorithms were used for classification. Protein sequences were mapped by AAC (Amino Acid Composition), CT (Conjoint Triad) and PseAAC (Pseudo Amino Acid Composition) methods and features were obtained. Then, feature selection was performed with the LVQ (Learning Vector Quantization) method. The performances of the classification algorithms were measured with accuracy, recall, specificity, precision, and F1-score. The best results were obtained with the ensemble model which consists of SVM and RF algorithms. With the ensemble model, an accuracy of 72.33% was obtained. Researchers in study [20] analyzed the host-cell interactome of SARS-CoV-2 infection. In the study, a network-based approach was used for the analysis process. Data were obtained from the Reactome and KEGG datasets. The pathogen of SARS-CoV-2 virus was predicted using three human coronavirus molecules (SARS-CoV, MERS-COV and HCoV-229E) and protein-protein interactions. At the end of the study, it was stated that the developed method was successful and could be used to identify potential new biological targets. In another study, researchers aligned viral-host protein-protein interactions of SARS-CoV-1, SARS-CoV-2 viruses and human proteins using integer linear programming [21]. In the study, gene ontologies were used and protein interactions were examined in terms of MF (Molecular Function), CC (Cellular Component) and BP (Biological Process). In the first stage of the study, the proteins of SARS-CoV-1 and SARS-CoV-2 viruses were aligned. Alignment was performed separately for each gene ontology. After the alignment process, human proteins were used and the proteins with which these viruses interacted were determined. The best BP and CC results for spike protein (S) strains were obtained with Q7Z5G4 protein. For MF, it was obtained from Q7Z5G4 and Q9C0B5 proteins. For the Envelope protein (E) strains, the best alignment result was obtained with the O00203 protein in the BB category. While the Q8IWA5 protein gave the best alignment result in the CC category, the best alignment interaction was calculated for the O00203 protein in the MF category. In the membrane protein (M), the best alignment score was obtained from the Q96D53 protein in the CC category. In the BP category, the best score was obtained from the P27105 protein, while in the MF category it was again obtained from the Q96D53 protein. In the nucleocapsid protein (N) type, the best alignment score was obtained from the P67870 protein for BP. The most efficient alignment process in the CC gene ontology type was calculated with the P11940 protein. It was observed that multiple proteins from the MF category were effective in the alignment process. At the end of the study, it was stated that the alignment process was effective in determining virus-host interactions of SARS-CoV-2 and human proteins. In study [22], researchers developed a novel deep learning model and predicted virus-host interactions from protein sequences and infectious disease phenotypes. In the study, human proteins and virus taxonomies were mapped by one-hot encoding method. After the mapping process, a CNN (Convolutional Neural Networks) based deep learning model was developed and the data were classified. The classification process was carried out in three different categories as family-based, taxon-based, and protein-based, and the performance of the proposed deep learning model was calculated with the ROC (Receiver Characteristic Curve). While 81.30% success was achieved in family-based classification, this rate was calculated as 82.90% in taxon-based classification. In protein-based classification, this rate was calculated as 80%. At the end of the study, it was observed that the proposed deep learning model effectively determined the viral-host interactions between SARS-CoV-2 and human proteins. In another study, researchers predicted viral-host interactions between SARS-CoV-2 virus and human proteins using a network-based method [23]. Structural and content information of virus and human proteins were used, and this information was combined. Then, classification was performed with MLP (Multi-Layer Perceptron) and the model was compared with various models in the literature. The performance of the classification algorithm was calculated with accuracy, precision, recall, F1-score, and AUC values. With the proposed deep learning model, an accuracy score of 97.1% was obtained, while an AUC score of 0.997 was calculated.
3. Materials and methods
3.1. SARS-CoV-2 and human protein-protein interaction data
In this study, the data set given in the study [24] was used. In the study, the researchers shared a data set containing protein-protein interactions between the SARS-CoV-2 virus and human proteins. Affinity purification mass spectrometry was used to create this data set. In the data set, there are 332 interactions between 332 human proteins and 4 structural and 20 coronavirus proteins. This data set was used to construct positive training and test data sets in the study. There is no specific gold standard for constructing the negative data set in protein-protein interactions. Generally, two methods, random matching, and sub-cellular localization, are used when creating a negative data set [13]. In most studies, non-interacting proteins are randomly generated, and protein pairs used in the positive dataset are eliminated [25,26]. On the other hand, in some studies, negative data set is created by sub-cellular localization [27]. However, negative data sets generated by these two methods are not very reliable. In the random matching process, positive interactions can be mistakenly treated as negative interactions, and different accuracy scores can be calculated for different random matching processes. In addition, biased accuracy scores can be obtained with sub-cellular localization [26]. For such reasons, these two methods were not used while creating the negative data set in this study. The negative data set was obtained from study [13]. In the study [13], the researchers used a degree-based approach when constructing the negative data set. It has been determined that the negative data set created with this approach is more effective both in the training phase and in the testing phase compared to both random matching and sub-cellular localization methods. While a total of 1,327 protein sequences were used in the positive data set, 4,554 protein sequences were used for the negative data set.
3.2. Protein mapping methods
In this study, protein sequences were mapped various protein mapping methods. These methods are AESNN1, Atchley factors, hydrophobicity, Meiler parameters, EIIP, CPNR, binary-encoding, FIBHASH, PAM250, BLOSUM62, Micheletti potentials, Miyazawa energies, entropy-based, AVL-tree, and Z-scale. In the following sections, information about these methods is given.
3.2.1. AESNN1 protein mapping method
In this protein mapping method, protein structure information is trained with a machine learning algorithm to create the numerical representations for protein sequences [28]. The numerical values of this method are given in Table 1 .
Table 1.
AESNN1 values of amino acid codes.
| Amino Acid Code | AESNN1 Value | Amino Acid Code | AESNN1 Value |
|---|---|---|---|
| A | −0.99 | L | −0.92 |
| R | 0.28 | K | −0.63 |
| N | 0.77 | M | −0.80 |
| D | 0.74 | F | 0.87 |
| C | 0.34 | P | −0.99 |
| Q | 0.12 | S | 0.99 |
| E | 0.59 | T | 0.42 |
| G | −0.79 | W | −0.13 |
| H | 0.08 | Y | 0.59 |
| I | −0.77 | V | −0.99 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [-0.99 0.28 0.77 0.74 0.34 0.12 …] by the AESNN1 protein mapping method.
3.2.2. Atchley factors protein mapping method
Atchley factors were proposed to eliminate metric problems encountered in protein sequences [29]. There are five factors in total in this method: bipolar (bp), secondary structure (ss), molecular volume (mv), relative amino acid composition (raac) and electrostatic charge (ec). Since there are 5 factors, it is a 5-dimensional method. Protein sequences are mapped using these five factors. Table 2 provides Atchley factors of amino acids.
Table 2.
Atchley factors of amino acid codes.
| Amino Acid Code | bp | ss | mv | raac | ec |
|---|---|---|---|---|---|
| A | −0.591 | −1.302 | −0.733 | 1.570 | −0.146 |
| R | 1.538 | −0.055 | 1.502 | 0.440 | 2.897 |
| N | 0.945 | 0.828 | 1.299 | −0.169 | 0.933 |
| D | 1.050 | 0.302 | −3.656 | −0.259 | −3.242 |
| C | −1.343 | 0.465 | −0.862 | −1.020 | 0.255 |
| E | 1.357 | −1.453 | 1.477 | 0.113 | −0.837 |
| Q | 0.931 | −0.179 | −3.005 | −0.503 | −1.853 |
| G | −0.384 | 1.652 | 1.330 | 1.045 | 2.064 |
| H | 0.336 | −0.417 | −1.673 | −1.474 | −0.078 |
| I | −1.239 | −0.547 | 2.131 | 0.393 | 0.816 |
| L | −1.019 | −0.987 | −1.505 | 1.266 | −0.912 |
| K | 1.831 | −0.561 | 0.533 | −0.277 | 1.648 |
| M | −0.663 | −1.524 | 2.219 | −1.005 | 1.212 |
| F | −1.006 | −0.590 | 1.891 | −0.397 | 0.412 |
| P | 0.189 | 2.081 | −1.628 | 0.421 | −1.392 |
| S | −0.228 | 1.399 | −4.760 | 0.670 | −2.647 |
| T | −0.032 | 0.326 | 2.213 | 0.908 | 1.313 |
| W | −0.595 | 0.009 | 0.672 | −2.128 | −0.184 |
| Y | 0.260 | 0.830 | 3.097 | −0.838 | 1.512 |
| V | −1.337 | −0.279 | −0.544 | 1.242 | −1.262 |
A protein sequence P (S) = [A R …] is mapped as C (S) = [[-0.591 -1.302 -0.733 1.570 -0.146] [1.538 -0.055 1.502 0.440 2.897] …] by the Atchley factors protein mapping method.
3.2.3. Hydrophobicity protein mapping method
In the hydrophobicity method, the mapping processes of proteins are carried out according to the hydrophilic and hydrophobic information of the protein sequences in the polypeptide chains [30]. The hydrophobicity properties of amino acids are frequently used as an efficient way to compare and analyze amino acid sequences. Hydrophobicity protein mapping method consists of 1-dimensional numerical data. Hydrophobicity values of amino acids are given in Table 3 .
Table 3.
Hydrophobicity values of amino acid codes.
| Amino Acid Code | Hydrophobicity Value | Amino Acid Code | Hydrophobicity Value |
|---|---|---|---|
| A | 1.8 | M | 1.9 |
| C | 2.5 | N | −3.5 |
| D | −3.5 | P | −1.6 |
| E | −3.5 | Q | −3.5 |
| F | 2.8 | R | −4.5 |
| G | −0.4 | S | −0.8 |
| H | −3.2 | T | −0.7 |
| I | 4.5 | V | 4.2 |
| K | −3.9 | W | −0.9 |
| L | 3.8 | Y | −1.3 |
A protein sequence P (S) = [A C D E F G …] is mapped as C (S) = [1.8 2.5–3.5 -3.5 2.8 -0.4 …] by the hydrophobicity protein mapping method.
3.2.4. Meiler parameters protein mapping method
Meiler parameters have been proposed to express reduced representations of amino acids [31]. Just like the Atchley factors, this method has many parameters. These parameters are steric (s), polarizability (p), volume (v), hydrophobicity (h), isoelectric point (ip), helix probability (hp), and sheet probability (sp). With this method, a 7-dimensional data is obtained when the protein sequences are mapped. Meiler parameters of amino acid codes are given in Table 4 .
Table 4.
Meiler parameters of amino acid codes.
| Amino Acid Code | s | p | v | h | ip | hp | sp |
|---|---|---|---|---|---|---|---|
| A | 1.28 | 0.05 | 1.00 | 0.31 | 6.11 | 0.42 | 0.23 |
| R | 2.34 | 0.29 | 6.13 | −1.01 | 10.74 | 0.36 | 0.25 |
| N | 1.60 | 0.13 | 2.95 | −0.60 | 6.52 | 0.21 | 0.22 |
| D | 1.60 | 0.11 | 2.78 | −0.77 | 2.95 | 0.25 | 0.20 |
| C | 1.77 | 0.13 | 2.43 | 1.54 | 6.35 | 0.17 | 0.41 |
| E | 1.56 | 0.15 | 3.78 | −0.64 | 3.09 | 0.42 | 0.21 |
| Q | 1.56 | 0.18 | 3.95 | −0.22 | 5.65 | 0.36 | 0.25 |
| G | 0.00 | 0.00 | 0.00 | 0.00 | 6.07 | 0.13 | 0.15 |
| H | 2.99 | 0.23 | 4.66 | 0.13 | 7.69 | 0.27 | 0.30 |
| I | 4.19 | 0.19 | 4.00 | 1.80 | 6.04 | 0.30 | 0.45 |
| L | 2.59 | 0.19 | 4.00 | 1.70 | 6.04 | 0.39 | 0.31 |
| K | 1.89 | 0.22 | 4.77 | −0.99 | 9.99 | 0.32 | 0.27 |
| M | 2.35 | 0.22 | 4.43 | 1.23 | 5.71 | 0.38 | 0.32 |
| F | 2.94 | 0.29 | 5.89 | 1.79 | 5.67 | 0.30 | 0.38 |
| P | 2.67 | 0.00 | 2.72 | 0.72 | 6.80 | 0.13 | 0.34 |
| S | 1.31 | 0.06 | 1.60 | −0.04 | −5.70 | 0.20 | 0.28 |
| T | 3.03 | 0.11 | 2.60 | 0.26 | 5.60 | 0.21 | 0.36 |
| W | 3.21 | 0.41 | 8.08 | 2.25 | 5.94 | 0.32 | 0.42 |
| Y | 2.94 | 0.30 | 6.47 | 0.96 | 5.66 | 0.25 | 0.41 |
| V | 3.67 | 0.14 | 3.00 | 1.22 | 6.02 | 0.27 | 0.49 |
A protein sequence P (S) = [A R …] is mapped as C (S) = [[1.28 0.05 1.00 0.31 6.11 0.42 0.23] [2.34 0.29 6.13 -1.01 10.74 0.36 0.25] …] by the Meiler parameters protein mapping method.
3.2.5. EIIP protein mapping method
The EIIP method has been proposed to predict interactions between proteins and between protein and DNA [32]. With this method, firstly, sequences are converted into genomic signals by Fourier transform. The power spectrum values are then calculated, and the amino acids are mapped. EIIP values of amino acids are given in Table 5 .
Table 5.
EIIP values of amino acid codes.
| Amino Acid Code | EIIP Value | Amino Acid Code | EIIP Value |
|---|---|---|---|
| M | 0.0823 | Q | 0.0761 |
| W | 0.0548 | S | 0.0829 |
| F | 0.0946 | A | 0.0373 |
| Y | 0.0516 | N | 0.0036 |
| P | 0.0198 | G | 0.0050 |
| C | 0.0829 | R | 0.0959 |
| T | 0.0941 | I | 0 |
| H | 0.0242 | D | 0.1263 |
| V | 0.0057 | E | 0.0058 |
| L | 0 | K | 0.0371 |
A protein sequence P (S) = [M W F Y P C …] is mapped as C (S) = [0.0823 0.0548 0.0946 0.0516 0.0198 0.0829 …] by the EIIP protein mapping method.
3.2.6. CPNR protein mapping method
The CPNR protein mapping method was proposed to compare protein functions [33]. In this method, protein sequences were mapped according to codons. In the first step of the mapping process, amino acids were separated into codon numbers and prime numbers were assigned to these codons. CPNR values of amino acids are given in Table 6 .
Table 6.
CPNR values of amino acid codes.
| Amino Acid Code | CPNR Value | Amino Acid Code | CPNR Value |
|---|---|---|---|
| M | 1 | Q | 29 |
| W | 2 | S | 31 |
| F | 3 | A | 37 |
| Y | 5 | N | 41 |
| P | 7 | G | 43 |
| C | 11 | R | 47 |
| T | 13 | I | 53 |
| H | 17 | D | 59 |
| V | 19 | E | 61 |
| L | 23 | K | 67 |
A protein sequence P (S) = [M W F Y P C …] is mapped as C (S) = [1 2 3 5 7 11 …] by the CPNR protein mapping method.
3.2.7. Binary encoding protein mapping method
In the binary encoding mapping method, amino acids in protein sequences are represented by the values "0″ and "1". The most widely used binary coding method is the one-hot coding method. In this method, each of the twenty standard amino acids is mapped with twenty-dimensional binary vectors. In the mapping process, amino acids are listed alphabetically, and numerical values are assigned to these amino acid codes. Binary encoding values of amino acid codes are given in Table 7 .
Table 7.
Binary encoding values of amino acid codes.
| Amino Acid Code | Binary Encoding Value | Amino Acid Code | Binary Encoding Value |
|---|---|---|---|
| A | 10000000000000000000 | M | 00000000001000000000 |
| C | 01000000000000000000 | N | 00000000000100000000 |
| D | 00100000000000000000 | P | 00000000000010000000 |
| E | 00010000000000000000 | Q | 00000000000001000000 |
| F | 00001000000000000000 | R | 00000000000000100000 |
| G | 00000100000000000000 | S | 00000000000000010000 |
| H | 00000010000000000000 | T | 00000000000000001000 |
| I | 00000001000000000000 | V | 00000000000000000100 |
| K | 00000000100000000000 | W | 00000000000000000010 |
| L | 00000000010000000000 | Y | 00000000000000000001 |
A protein sequence P (S) = [A C D …] is mapped as C (S) = [10000000000000000000 010000000000000000 00100000000000000000 …] by the binary encoding protein mapping method.
3.2.8. PAM250 protein mapping method
The PAM250 protein mapping method is used to identify similarities in protein sequences by scoring aligned peptide sequences [34]. Values in the PAM250 matrix were obtained from comparison of aligned protein sequences with known homology and identification of observed point mutations. PAM250 values of amino acids are given in Table 8 .
Table 8.
PAM250 values of amino acid codes.
| Amino Acid Code | PAM250 Value | Amino Acid Code | PAM250 Value |
|---|---|---|---|
| A | 2 | L | 6 |
| R | 6 | K | 5 |
| N | 2 | M | 6 |
| D | 4 | F | 9 |
| C | 4 | P | 6 |
| Q | 4 | S | 3 |
| E | 4 | T | 3 |
| G | 5 | W | 17 |
| H | 6 | Y | 10 |
| I | 5 | V | 4 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [2 6 2 4 4 4 …] by the PAM250 protein mapping method.
3.2.9. BLOSUM62 protein mapping method
BLOSUM matrices are used to score alignments between evolutionarily dissimilar protein sequences. It is a location-independent method and was first proposed in 1992 [35]. There are different types of matrices such as BLOSUM80, BLOSUM45. The numbers next to the matrices represent the similarity value. The similarities between the two organisms closest to each other in terms of evolution receive high score values. Therefore, the higher the score in BLOSUM matrices, the greater the similarity. For example, the BLOSUM80 matrix is used when aligning the protein sequences of two species closely related organisms. Conversely, the proteins of two organisms that are far apart in species can be aligned with the BLOSUM45 matrix. However, when the distance or proximity of organisms is not known exactly, the BLOSUM62 matrix is used. BLOSUM62 values of amino acids are given in Table 9 .
Table 9.
BLOSUM62 values of amino acid codes.
| Amino Acid Code | BLOSUM62 Value | Amino Acid Code | BLOSUM62 Value |
|---|---|---|---|
| A | 4 | L | 4 |
| R | 5 | K | 5 |
| N | 6 | M | 5 |
| D | 6 | F | 6 |
| C | 9 | P | 7 |
| Q | 5 | S | 4 |
| E | 5 | T | 5 |
| G | 6 | W | 11 |
| H | 8 | Y | 7 |
| I | 4 | V | 4 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [4 5 6 6 9 5 …] by the BLOSUM62 protein mapping method.
3.2.10. Miyazawa energies protein mapping method
Miyazawa energies were proposed to determine the energies between residues in protein sequences [36]. In this method, the contact energies of protein sequences are obtained by regression coefficients. In Miyazawa energies, energy values are divided into two terms as secondary structure energies and tertiary structure energies. Tertiary structure energies are obtained by summing the residue-residual contact energies of proteins. The secondary structure energies are calculated according to the interactions between the atomic chains and the interactions between the atomic chain and side chain. Miyazawa energy values of proteins are given in Table 10 .
Table 10.
Miyazawa energy values of amino acid codes.
| Amino Acid Code | Miyazawa Energy Value | Amino Acid Code | Miyazawa Energy Value |
|---|---|---|---|
| A | −0.02 | L | −0.32 |
| R | 0.08 | K | 0.30 |
| N | 0.10 | M | −0.25 |
| D | 0.19 | F | −0.33 |
| C | −0.32 | P | 0.11 |
| Q | 0.21 | S | 0.11 |
| E | 0.15 | T | 0.05 |
| G | −0.02 | W | −0.27 |
| H | −0.02 | Y | −0.23 |
| I | −0.28 | V | −0.23 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [-0.02 0.08 0.10 0.19–0.32 0.21 …] by the Miyazawa energies protein mapping method.
3.2.11. Micheletti potentials protein mapping method
Micheletti potentials are based on the potential energy in interactions between proteins [37]. The main idea of Micheletti potentials is to identify optimal protein interactions. The Micheletti potential values of the proteins are given in Table 11 .
Table 11.
Micheletti potential values of amino acid codes.
| Amino Acid Code | Micheletti Potential Value | Amino Acid Code | Micheletti Potential Value |
|---|---|---|---|
| A | −0.001461 | L | −0.000782 |
| R | 0.009875 | K | 0.005109 |
| N | −0.001962 | M | 0.031655 |
| D | −0.000531 | F | −0.013128 |
| C | −0.002544 | P | −0.003621 |
| Q | 0.006456 | S | −0.000802 |
| E | 0.008438 | T | 0.003269 |
| G | 0.000990 | W | 0.131813 |
| H | 0.001314 | Y | −0.007699 |
| I | 0.006801 | V | 0.001445 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [-0.001461 0.009875 -0.001962 -0.000531 -0.002544 0.006456 …] by the Micheletti potentials protein mapping method.
3.2.12. FIBHASH protein mapping method
In the FIBHASH method, the researchers proposed a hybrid model using Fibonacci numbers and hash tables and mapped the protein sequences [38]. In the proposed method, first the protein sequences are sorted alphabetically, and Fibonacci numbers are added to each amino acid in order. Then the values are added to the hash table and the final values of the proteins are obtained. The FIBHASH values of the proteins are given in Table 12 .
Table 12.
FIBHASH values of amino acid codes.
| Amino Acid Code | FIBHASH Value | Amino Acid Code | FIBHASH Value |
|---|---|---|---|
| A | 1 | M | 9 |
| C | 2 | N | 7 |
| D | 3 | P | 16 |
| E | 4 | Q | 17 |
| F | 5 | R | 10 |
| G | 8 | S | 11 |
| H | 13 | T | 18 |
| I | 6 | V | 12 |
| K | 14 | W | 19 |
| L | 15 | Y | 20 |
A protein sequence P (S) = [A C D E F G …] is mapped as C (S) = [1 2 3 4 5 8 …] by the FIBHASH protein mapping method.
3.2.13. Entropy-based protein mapping method
In the entropy-based protein mapping method, the proteins are mapped based on the Shannon entropy calculation [39]. It does not have a static structure like other methods. In this method, the numerical expressions of the proteins vary for each sequence. The main reason for this is that the mapping process depends on the length of the sequence and the number of repeating codons. A protein sequence P (S) = [M A K A T G R N N L V S P K K …] is mapped as C (S) = [0.07 0.09 0.12 0.09 0.07 0.07 0.07 0.09 0.09 0.07 0.07 0.07 0.07 0.12 0.12 …] by the entropy-based protein mapping method.
3.2.14. AVL-tree protein mapping method
In the AVL-tree method, the mapping of proteins is performed based on the AVL tree [40]. In this method, amino acid codes are first sorted alphabetically and placed in the AVL tree, respectively. After the necessary addition and balancing processes are completed, the depth values of the amino acid codes in the AVL tree are obtained and mapping is performed. Amino acid values of the protein mapping method developed with AVL-tree are given in Table 13 .
Table 13.
AVL-tree values of amino acid codes.
| Amino Acid Code | AVL-tree Value | Amino Acid Code | AVL-tree Value |
|---|---|---|---|
| A | 4 | L | 3 |
| R | 3 | K | 2 |
| N | 0 | M | 4 |
| D | 4 | F | 4 |
| C | 3 | P | 3 |
| Q | 2 | S | 1 |
| E | 2 | T | 3 |
| G | 3 | W | 2 |
| H | 1 | Y | 3 |
| I | 3 | V | 4 |
A protein sequence P (S) = [A R N D C Q …] is mapped as C (S) = [4 3 0 4 3 2 …] by the AVL-tree protein mapping method.
3.2.15. Z-scale protein mapping method
The Z-scale protein mapping methods includes the physicochemical properties of amino acids. It consists of NMR (Nuclear Magneti Resonance) data and TLC (Thin Layer Chromatography) data. There are 5 parameters in total in this method. These are lipophilicity, polarizability, polarity, electronegativity and electrophilicity. In this method, the numerical expressions of the protein sequences vary. A protein sequence P (S) = [M A K A T G R N N L V S P K K …] is mapped as C (S) = [0.72 -0.50 -0.53 0.09 0.55] by the Z-scale protein mapping method.
3.3. Designing a deep learning model
Deep learning is a type of machine learning and is used effectively today. One of the biggest reasons for its popularity is that data can be obtained more quickly and easily and the hardware needs that can analyze this data can be met [41]. Furthermore, being successful in complex and large data sets is also a factor in this sense. The biggest advantage of deep learning over machine learning is the process of feature extraction. While key features are manually extracted in machine learning-based approaches, there is an adaptive approach in deep learning [42]. The large number of data or the complexity of the data set make it difficult and even time consuming to extract features manually [16]. Such advantages have enabled deep learning to be used in almost every field. With deep learning, there are studies in the fields of biomedicine [43,44], bioinformatics [45], object recognition [46], robotics [47], energy [48]. In this study, recurrent neural networks, one of the deep learning models, were used and BiLSTM (Bidirectional Long/Short Term Memory) was designed. LSTM (Long/Short Term Memory) is more effective in sequential structures such as time and text series [49]. In the structure of LSTM, information from the past is used. BiLSTM, on the other hand, is based on the logic of training two LSTMs at the same time. Thanks to this structure, the deep learning network not only retains information from the past but also information about the future. In the BiLSTM architecture, forward and backward calculations are run simultaneously, and the output is obtained by combining the information obtained as a result of the calculations made in both directions. For this reason, the use of information in two directions provides an advantage in processing sequential data and time series. It consists of two LSTM units. One of them takes the input and processes it forward, while the other processes the information backwards. The parameters of the BiLSTM network designed in this study can be summarized as follows:
-
•
Protein sequences mapped by each protein mapping technique were used in the input layer.
-
•
BiLSTM was used in the second layer and 64 units were evaluated for this purpose. ReLU (Rectified Linear Unit) was used as the activation function.
-
•
In the third layer, BiLSTM was used again, and 32 units were evaluated in this layer. ReLU is used as activation function again.
-
•
Then, the outputs were flattened, and the data was made 1-dimensional.
-
•
Then, to ensure that all data are in the same range, normalization was performed, and batch normalization was used.
-
•
Finally, three fully connected layers are used and 240, 120 and 60 neurons are considered, respectively.
-
•
In the last layer, classification was made, and 2 neurons were used in this layer to determine whether there was an interaction. Softmax calculation was used as the activation function.
-
•
Categorical cross-entropy was used to determine the loss of the model.
-
•
SGD (Stochastic Gradient Descent) was used for the optimization process.,
-
•
The training was carried out with 500 epochs.
-
•
The performances of the protein mapping methods were determined by 10 fold cross-validation.
-
•
All parameters used in the study were determined by trial-and-error approach and the parameters that gave the best results were evaluated.
The flow chart of the study is given in Fig. 2 .
Fig. 2.
Flow chart of the study.
4. Application results
In the study, the performances of protein mapping methods were evaluated and compared after the classification process. Accuracy, F1-score, specificity, sensitivity, and AUC values were calculated to compare the performances of protein mapping methods. Average classification results are given in Table 14 .
Table 14.
Performances of protein mapping methods (Results are average values of 10 fold cross-validation).
| Protein Mapping Methods | Accuracy | Specificity | Sensitivity | F1-score | AUC |
|---|---|---|---|---|---|
| AESNN1 | 88.94% | 88.41% | 87.62% | 88.01% | 0.81 |
| Atchley factors | 82.58% | 76.77% | 78.79% | 77.76% | 0.83 |
| AVL-tree | 90.44% | 86.85% | 89.86% | 88.33% | 0.91 |
| Binary encoding | 86.20% | 83.84% | 84.86% | 84.34% | 0.88 |
| BLOSUM62 | 85.54% | 85.83% | 83.80% | 84.80% | 0.83 |
| CPNR | 88.52% | 84.84% | 86.84% | 85.82% | 0.87 |
| EIIP | 78.89% | 79.75% | 79.74% | 79.74% | 0.82 |
| Entropy-based | 94.74% | 88.91% | 90.90% | 89.89% | 0.95 |
| FIBHASH | 89.33% | 86.88% | 88.86% | 87.85% | 0.88 |
| Hydrophobicity | 79.65% | 76.77% | 73.74% | 75.22% | 0.77 |
| Meiler parameters | 81.10% | 82.78% | 83.83% | 83.30% | 0.82 |
| Micheletti potentials | 75.39% | 72.72% | 74.75% | 73.72% | 0.79 |
| Miyazawa energies | 77.31% | 73.70% | 75.70% | 74.68% | 0.81 |
| PAM250 | 87.63% | 85.83% | 85.84% | 85.83% | 0.87 |
| Z-scale | 91.23% | 88.90% | 88.88% | 88.85% | 0.96 |
When the results given in Table 2 were examined, it was observed that the most effective result was obtained with the entropy-based method. With this method, an accuracy of 88.94% was obtained. Then the most effective method was the Z-scale method. The accuracy score of this method was calculated as 91.23%. Finally, the AVL-tree method showed a success rate of over 90% and the accuracy score was 90.44%. Other than these methods, no protein mapping method has shown a success rate of over 90%. While 89.33% accuracy was obtained with the FIBHASH protein mapping method, 88.94% accuracy was calculated with the AESNN1 method. The CPNR protein mapping method showed the closest performance to these methods, and an average of 88.52% accuracy was obtained with this method. In addition to these methods, accuracy values of 87.63%, 86.20%, 85.54%, 82.58% and 81.10%, were obtained with the methods of PAM250, binary encoding, BLOSUM62, Atchley factors and Meiler parameters, respectively. The remaining protein mapping methods showed accuracy below 80%. EIIP and hydrophobicity methods produced close scores, and accuracy scores of 78.89% and 79.65% were observed, respectively. While an accuracy score of 75.39% was obtained with Micheletti potentials, this rate increased in Miyazawa energies and rose to 77.31%. AUC score is a very important evaluation criterion in bioinformatics and medical studies [[50], [51], [52]]. A classifier is considered a reasonable classifier if its AUC score is greater than 0.7 [53]. In addition, values between 0.8 and 0.9 are expressed as a great classifier, while a value greater than 0.9 indicates that the classifier has excellent performance [54]. When a comparison is made according to the AUC scores, only the AVL-tree, entropy-based and Z-scale methods showed a value of over 90%. Therefore, these methods have performed an excellent classification. Apart from these methods, almost every method has performed a great classification process. However, since the AUC scores of hydrophobicity and Micheletti potentials were below 0.80, it was acceptable. Fig. 3 shows ROC graphs of protein quantification methods with AUC scores above 0.9.
Fig. 3.
ROC plots of protein mapping methods.
5. Discussion
In this study, viral-host interactions between SARS-CoV-2 virus and human proteins were analyzed by various protein mapping methods. Therefore, the performances of these methods were compared, and it was explained why the results differed according to the protein mapping methods. Micheletti potentials and Miyazawa energies are structure-based methods. The reason why these two methods are the most ineffective may be that the structure information of protein sequences is not needed in the study. For this reason, key feature may have been lost. Obtaining structure-based features will increase the performance of these methods. Looking at the results in Table 2, it is seen that physicochemical-based methods are also generally ineffective. Although Atchley factors, hydrophobicity, Meiler parameters and EIIP in this category performed better than structure-based ones, they were generally ineffective. The main reason for this situation is that the mapping process requires a certain information, just like structure-based methods. In physicochemical-based methods, chemical information of proteins is needed. Since no chemical information was analyzed in this study, information loss may occur. When the performances of evolution-based methods (PAM250 and BLOSUM62) are examined, it is seen that protein mapping methods in this category are more successful than structure-based and physicochemical-based methods. Evolution-based methods are often used in alignment studies [55,56]. Performing the classification process after the alignment process in the study can significantly improve the performance of these methods. Character-based methods seem to be more successful. The CPNR, binary encoding, and FIBHASH methods all achieved an accuracy score of over 85%. The reason behind this is that there is no need for specific information while mapping with these methods. Therefore, there may be no loss of information. The second most effective method was the Z-scale method. Obtaining SAR (Structure Activity Relationship) features with this method has been instrumental in the success of the method. The Z-scale method is also a physicochemical-based method [57]. However, it was more effective than other physicochemical-based methods (Atchley factors, hydrophobicity, Meiler parameters and EIIP). In addition to these, it is seen that the most effective method is the entropy-based method. The reason why these two methods are successful is that they have a dynamic structure. In all other methods, protein sequences have a static structure. In a static structure, amino acids in protein sequences always take the same numerical values. However, since there is an adaptive approach in the entropy-based and Z-scale method, protein sequences can take different values. In the entropy-based method, the numerical values of the amino acid codes vary according to the length of the protein sequence and the repetition frequency of the amino acids in the sequence. There are 5 parameters in the Z-scale method, and this calculation process varies according to the protein sequence and the order of the amino acid codes in the sequence. For these reasons, these two methods are adaptive methods. Additionally, in some protein mapping methods, different amino acids take on the same value. For example, in the AESNN1 protein mapping method, A and P amino acids take the same values. Similarly, amino acids L and I take the same values in the EIIP protein mapping method. In this and similar methods, degeneration is observed when different amino acids take the same numerical expressions [58]. This causes the methods to be ineffective.
6. Conclusion
In this study, viral-host interactions between SARS-CoV-2 and human proteins were predicted by computational-based approaches. The study consisted of four different stages. In the first stage, viral-host interactions between SARS-CoV-2 virus and human proteins were obtained and a positive data set was created. In addition, non-interacting protein sequences were obtained, and a negative data set was created. In the second stage, the protein sequences in both negative and positive data sets were mapped with various protein mapping methods and made ready for classification by deep learning algorithm. In the third stage, a deep learning model was designed and the BiLSTM algorithm was used. At the last stage, the performances of protein mapping methods were determined by accuracy, F1-score and AUC scores and their performances were compared. The most ineffective protein mapping method was Micheletti potentials, with an accuracy of 75.39%. The most effective classification process was performed with the entropy-based protein mapping method, and an accuracy of 94.74% was obtained with this method. When the results were examined in general, it was determined that the most ineffective methods were structure-based methods. Furthermore, it has been observed that the most effective method for physicochemical-based methods is Atchley factors. With this method, an accuracy of 82.58% was obtained, while an AUC score of 0.83 was calculated. Subsequently, the most effective methods have been evolution-based methods. Among these methods, the PAM250 matrix achieved an accuracy score of 87.63% by performing a successful classification process, while it achieved an AUC score of 0.87. The fourth most effective method was obtained by AESNN1 protein mapping method, which is one of the machine learning-based methods. With this method, an accuracy score of 88.94% was calculated, while an AUC score of 0.81 was achieved. The third most effective mapping method was obtained with FIBHASH, one of the character-based protein mapping methods. With this method, an accuracy score of 89.33% was obtained, while an AUC of 0.88 was calculated. On the other hand, AVL, which is an algorithm-based method, performed the second most efficient classification process. The accuracy score of this method was calculated as 90.44%. The AUC score was determined as 0.91. However, there are some limitations of the developed study. According to the application results, it has been determined that computational-based approaches can be used effectively in this field. There are not many studies in this topic in the literature. Therefore, the study needs to be supported by various other approaches (other mapping methods, alignment algorithms, etc.). Furthermore, finding new viral-host interactions between SARS-CoV-2 and human proteins will increase the number of data. Therefore, the results obtained in this study can be further improved. The large number of data will enable the deep learning model to learn better and the results will be more reliable. In addition to these, with different deep learning and machine learning algorithms, the results can be evaluated in a different way. To preserve the diversity and performance of the study, it should also be evaluated with other deep learning algorithms. Apart from the protein mapping methods used in this study, the use of different protein mapping approaches is important for the accuracy and reliability of the study. It is important to evaluate the study with other protein mapping methods in this regard. When the results were examined in general, it was observed that all protein mapping methods performed a successful classification process with the designed deep learning model. In this way, the use of computational-based approaches in this field will help both to better understand the structure of the virus and to determine the result of its interactions with human proteins more quickly.
Authorship contributions
Conception and Design of the Study: T. B. Alakus.
Acquisition of Data: T. B. Alakus.
Analysis and/or Interpretation of Data: T.B. Alakus, I. Turkoglu.
Drafting the Manuscript: T.B. Alakus.
Revising the Manuscript: T.B. Alakus, I. Turkoglu.
Approval of the Version of the Manuscript to be Published: T.B. Alakus, I. Turkoglu.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
I want to thank my wife who has been with me in every moment of my life and I want to make her a proposal. Dilan will I marry you?
References
- 1.Toraman S., Alakus T.B., Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos, Solit. Fractals. 2020;140 doi: 10.1016/j.chaos.2020.110122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gandhi R.T., Lynch J.B., del Rio C. Mild or moderate covid-19. N. Engl. J. Med. 2020;383:1757–1766. doi: 10.1056/NEJMcp2009249. [DOI] [PubMed] [Google Scholar]
- 3.Ege I. The impact of coronavirus disaese (COVID-19) pandemic on cruise industry: case of diamond princess cruise ship. Mersin Univ. J. Marit. Fac. 2020;2(1):32–37. [Google Scholar]
- 4.Alakus T.B., Turkoglu I. Comparison of deep learning approaches to predict COVID-19 infection. Chaos, Solit. Fractals. 2020;140 doi: 10.1016/j.chaos.2020.110120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jiang X., Coffee M., Bari A., Wang J., Jiang X., et al. Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity. Comput. Mater. Continua (CMC) 2020;63(1):537–551. doi: 10.32604/cmc.2020.010691. [DOI] [Google Scholar]
- 6.Vasireddy D., Vanaparthy R., Mohan G., Malayala S.V., Atlurie P. Review of COVID-19 variants and COVID-19 vaccine efficacy: what the clinician should know? J. Clin. Med. Res. 2021;13(6):317–325. doi: 10.14740/jocmr4518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Khan A., Waris H., Rafique M., Suleman M., Mohammed A., et al. The Omicron (B.1.1.529) variant of SARS-CoV-2 binds to the hACE2 receptor more strongly and escapes the antibody response: insights from structural and simulation data. Int. J. Biol. Macromol. 2022;200:438–448. doi: 10.1016/j.ijbiomac.2022.01.059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Khan A., Mohammad A., Haq I., Nasar M., Ahmad W., et al. Structural-dynamics and binding analysis of RBD from SARS-CoV-2 variants of concern (VOCs) and GRP78 receptor revealed basis for higher infectivity. Microorganisms. 2021;9(11) doi: 10.3390/microorganisms9112331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sanders J.M., Monogue M.L., Jodlowski T.Z., Cutrell J.B. Pharmacologic treatments for coronavirus disease 2019 (COVID-19) J. Am. Med. Assoc. 2020;323(18):1824–1836. doi: 10.1001/jama.2020.6019. [DOI] [PubMed] [Google Scholar]
- 10.Kaplan R.M., Milstein A. Influence of a COVID-19 vaccine's effectiveness and safety profile on vaccination acceptance. Proc. Natl. Acad. Sci. U.S.A. 2021;118(10) doi: 10.1073/pnas.2021726118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gordon D.E., Jang G.M., Bouhaddou M., Xu J., Obernier K., et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583:459–468. doi: 10.1038/s41586-020-2286-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Florindo H.F., Kleiner R., Vaskovich-Koubi D., Acurcio R.C., Carreira B., et al. Immune-mediated approaches against COVID-19. Nat. Nanotechnol. 2020;15:630–645. doi: 10.1038/s41565-020-0732-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Dey L., Chakraborty S., Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral–host interactions between SARS-CoV-2 and human proteins. Biomed. J. 2020;43(5):438–450. doi: 10.1016/j.bj.2020.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gillen J., Nita-Lazar A. Experimental analysis of viral–host interactions. Front. Physiol. 2019;10 doi: 10.3389/fphys.2019.00425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fätkenheuer G., Pozniak A.L., Johnson M.A., Plettenberg A., Staszewski S., et al. Efficacy of short-term monotherapy with maraviroc, a new CCR5 antagonist, in patients infected with HIV-1. Nat. Med. 2005;11(11):1170–1172. doi: 10.1038/nm1319. [DOI] [PubMed] [Google Scholar]
- 16.Alakus T.B., Turkoglu I. International Symposium on Multidisciplinary Studies and Innovative Technologies. 2019. Prediction of protein-protein interactions with LSTM deep learning model. 11-13 October, Ankara, Turkey. [DOI] [Google Scholar]
- 17.Lei W., Wang H.F., Liu S.R., Yan X., Song K.J. Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective to rotation forest. Sci. Rep. 2019;9 doi: 10.1038/s41598-019-46369-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chen K.H., Wang T.F., Hu Y.J. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme. BMC Bioinf. 2019;20 doi: 10.1186/s12859-019-2907-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sarkar D., Saha S. Machine-learning techniques for the prediction of protein–protein interactions. J. Biosci. 2019;44 doi: 10.1007/s12038-019-9909-z. [DOI] [PubMed] [Google Scholar]
- 20.Messina F., Giombini E., Agrati C., Vairo F., Bartoli T.A., et al. COVID-19: viral–host interactome analyzed by network based-approach model to study pathogenesis of SARS-CoV-2 infection. J. Transl. Med. 2020;18 doi: 10.1186/s12967-020-02405-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Llabres M., Valiente G. Alignment of virus-host protein-protein interaction networks by integer linear programming: SARS-CoV-2. PLoS One. 2020 doi: 10.1371/journal.pone.0236304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu-Wei W., Kafkas Ş., Chen J., Dimonaco N.J., Tegner J., et al. DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes. Bioinformatics. 2021;37(17):2722–2729. doi: 10.1093/bioinformatics/btab147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Du H., Chen F., Liu H., Hong P. Network-based virus-host interaction prediction with application to SARS-CoV-2. Patterns. 2021;2(5) doi: 10.1016/j.patter.2021.100242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gordon D.E., Jang G.M., Bouhaddou M., Xu J., Obernier K., et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583:459–468. doi: 10.1038/s41586-020-2286-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Barman R.K., Saha S., Das S. Prediction of interactions between viral and host proteins using supervised machine learning methods. PLoS One. 2014 doi: 10.1371/journal.pone.0112034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ben-Hur A., Noble W.S. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinf. 2006;7 doi: 10.1186/1471-2105-7-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sun T., Zhou B., Lai L., Pei J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinf. 2017;18 doi: 10.1186/s12859-017-1700-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lin K., May A.C.W., Taylor W.R. Amino acid encoding schemes from protein structure alignments: multi-dimensional vectors to describe residue types. J. Theor. Biol. 2002;216(3):361–365. doi: 10.1006/jtbi.2001.2512. [DOI] [PubMed] [Google Scholar]
- 29.Atchley W.R., Zhao J., Fernandes A.D., Drüke T. Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U.S.A. 2005;102(18):6395–6400. doi: 10.1073/pnas.0408677102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kyte J., Doolittle R.F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 1982;157(1):105–132. doi: 10.1016/0022-2836(82)90515-0. [DOI] [PubMed] [Google Scholar]
- 31.Meiler J., Müller M., Zeidler A., Schmaschke F. Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Mol. Model. Ann. 2001;7:360–369. doi: 10.1007/s008940100038. [DOI] [Google Scholar]
- 32.Veljkovic N., Glisic S., Prljic J., Perovic V., Botta M., Veljkovic V. Discovery of new therapeutic targets by the informational spectrum method. Curr. Protein Pept. Sci. 2008;9(5):493–506. doi: 10.2174/138920308785915245. [DOI] [PubMed] [Google Scholar]
- 33.Chen D., Wang J., Yan M.N., Bao F.S. A complex prime numerical representation of amino acids for protein function comparison. J. Comput. Biol. : J. Comput. Mol. Cell. Biol.”. 2016;23(8):669–677. doi: 10.1089/cmb.2015.0178. [DOI] [PubMed] [Google Scholar]
- 34.Dayhoff M.O., Schwartz R.M. Chapter 22: a model of evolutionary change in proteins. Atlas Protein Seq. Struct. 1978;5:345–352. [Google Scholar]
- 35.Henikoff S., Henikoff J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 1992;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Miyazawa S., Jemigan L. Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules. 1985;18(3):534–552. doi: 10.1021/ma00145a039. [DOI] [Google Scholar]
- 37.Micheletti C., Seno F., Banavar J.R., Maritan A. Learning effective amino acid interactions through iterative stochastic techniques. Protein Struct. Funct. Genet. 2001;42:422–431. doi: 10.1002/1097-0134(20010215)42:3<422::aid-prot120>3.0.co;2-2. [DOI] [PubMed] [Google Scholar]
- 38.Alakus T.B., Turkoglu I. A novel Fibonacci hash method for protein family identification by using recurrent neural networks. Turk. J. Electr. Eng. Comput. Sci. 2021;29:370–386. doi: 10.3906/elk-2003-116. [DOI] [Google Scholar]
- 39.Alakus T.B., Turkoglu I. A novel entropy-based mapping method for determining the protein-protein interactions in viral genomes by using coevolution analysis. Biomed. Signal Process Control. 2021;65 doi: 10.1016/j.bspc.2020.102359. [DOI] [Google Scholar]
- 40.Alakus T.B., Turkoglu I. A novel protein mapping method for predicting the protein interactions in COVID-19 disease by deep learning. Interdiscipl. Sci. Comput. Life Sci. 2021;13:44–60. doi: 10.1007/s12539-020-00405-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Goodfellow I., Bengio Y., Courville A. MIT Press; 2016. Deep Learning. [Google Scholar]
- 42.Santur Y. Improving sentiment classification performance using deep learning and undersampling approaches. Firat Univ. J. Eng. Sci. 2020;32(2):561–570. doi: 10.35234/fumbd.759131. [DOI] [Google Scholar]
- 43.Baldi P. Deep learning in biomedical data science. Ann. Rev. Biomed. Data Sci. 2018;1:181–205. 0.1146/annurev-biodatasci-080917-013343. [Google Scholar]
- 44.Zemouri R., Zerhouni N., Racoceanu D. Deep learning in the biomedical applications: recent and future status. Appl. Sci. 2019;9(8) doi: 10.3390/app9081526. [DOI] [Google Scholar]
- 45.Min S., Lee B., Yoon S. Deep learning in bioinformatics. Briefings Bioinf. 2017;18(5):851–869. doi: 10.1093/bib/bbw068. [DOI] [PubMed] [Google Scholar]
- 46.Zhao Z.Q., Zheng P., Xu S.T., Wu X. Object detection with deep learning: a review. IEEE Transact. Neural Networks Learn. Syst. 2019;30(11):3212–3232. doi: 10.1109/TNNLS.2018.2876865. [DOI] [PubMed] [Google Scholar]
- 47.Chen A.I., Balter M.L., Maguire T.J., Yarmush M.L. Deep learning robotic guidance for autonomous vascular access. Nat. Mach. Intell. 2020;2:104–115. doi: 10.1038/s42256-020-0148-7. [DOI] [Google Scholar]
- 48.Baldi P., Sadowski P., Whiteson D. Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 2014;5 doi: 10.1038/ncomms5308. [DOI] [PubMed] [Google Scholar]
- 49.Ozlem A. USD/TRY price prediction using LSTM architecture. Eur. J. Sci. Technol. 2020:452–456. doi: 10.31590/ejosat.araconf59. [DOI] [Google Scholar]
- 50.Kamarudin A.N., Cox T., Kolamunnage-Dona R. Time-dependent ROC curve analysis in medical research: current methods and applications. BMC Med. Res. Methodol. 2017;17(1) doi: 10.1186/s12874-017-0332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Safari S., Baratloo A., Elfil M., Negida A. Evidence based emergency medicine; Part 5 receiver operating curve and area under the curve. Emergency. 2016;4(2):111–113. [PMC free article] [PubMed] [Google Scholar]
- 52.Zhao X.G., Dai W., Tian L. AUC-based biomarker ensemble with an application on gene scores predicting low bone mineral density. Bioinformatics. 2011;27(21):3050–3055. doi: 10.1093/bioinformatics/btr516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Wigton R.S., Connor J.L., Centor R.M. Transportability of a decision rule for the diagnosis of streptococcal pharyngitis. Arch. Intern. Med. 1986;146(1):81–83. [PubMed] [Google Scholar]
- 54.Mandrekar J.N. Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol. 2010;5(9):1315–1316. doi: 10.1097/JTO.0b013e3181ec173d. [DOI] [PubMed] [Google Scholar]
- 55.Altschul S.F. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 1991;219(3) doi: 10.1016/0022-2836(91)90193-A. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Streiemer G.M., Akoglu A. Proceedings in International Symposium on Parallel and Distributed Processing. 2009. Sequence alignment with GPU: performance and design challenges. 23-29 May, Italy, Rome. [DOI] [Google Scholar]
- 57.Sandberg M., Eriksson L., Jonsson J., Sjöström M., Wold S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 1998;41(14):2481–2491. doi: 10.1021/jm9700575. [DOI] [PubMed] [Google Scholar]
- 58.Yau S.S.T., Wang J., Niknejad A., Lu C., Jin N., Ho Y.K. DNA sequence representation without degeneracy. Nucleic Acids Res. 2003;31(12):3078–3080. doi: 10.1093/nar/gkg432. [DOI] [PMC free article] [PubMed] [Google Scholar]



