Classifying COVID-19 based on amino acids encoding with machine learning algorithms

Walaa Alkady; Khaled ElBahnasy; Víctor Leiva; Walaa Gad

doi:10.1016/j.chemolab.2022.104535

. 2022 Mar 15;224:104535. doi: 10.1016/j.chemolab.2022.104535

Classifying COVID-19 based on amino acids encoding with machine learning algorithms

Walaa Alkady ^a,^∗∗, Khaled ElBahnasy ^b, Víctor Leiva ^c,^∗, Walaa Gad ^b

PMCID: PMC8923015 PMID: 35308181

Abstract

COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named “amino acid encoding based prediction” (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning shows similar results with our method.

Keywords: Amino acid composition, ANOVA, Artificial intelligence, Bagging ensemble and gradient boosting, Chi-square test, Deep learning, Feature extraction and selection, Information gain, LASSO, Molecular modeling, Protein sequence, SARS-CoV-2

1. Introduction, bibliographical review, notations, and objectives

1.1. Introduction and bibliographical review

The severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) infects humans with potentially serious pathogens and was discovered in 2019 [1]. This virus causes a pulmonary pathology called corona-virus disease 2019 (COVID-19). The SARS-CoV-2 or COVID-29 has infected an enormous number of people provoking thousands of deaths all around the world [[2], [3], [4]]. In March 2020, the World Health Organization (WHO) considered COVID-19 as a global pandemic [5].

The SARS-CoV-2 is classified within the Coronavirinae family, which belongs to the Nidovirales order. This family [6,7] has four genera: alpha coronavirus, beta coronavirus, gamma coronavirus, and delta coronavirus. Human beings are infected by alpha and beta genera. Alpha has two human coronaviruses: HCoV-229E and HCoV-NL63. Beta has five human coronaviruses: HCoV-HKU1, HCoV-OC43, middle east respiratory syndrome coronavirus (MERS-CoV), SARS-CoV, and SARS-CoV-2.

The coronaviruses are positive single-stranded ribonucleic acid (RNA) viruses. They have uncommonly long and complex genomes in comparison to other RNA viruses. Two overlapped open reading frames (ORF1a and ORF1b) are occupying 66% of the corona-virus genome. These reading frames are translated to sixteen non-structural proteins (NSPs) encoded from NSP1 to NSP16. The remaining 34% of the genome is occupied by the coding regions of structural proteins. Coronavirus has four primary basic structural proteins [8]: the spike surface glycoprotein, envelope protein, membrane protein, and nucleocapsid protein. The spike proteins are responsible for interacting with the host receptors. In addition, the viral genome contains several auxiliary proteins.

According to phylogenetic studies, SARS-CoV, SARS-CoV-2, and other viruses in the beta coronavirus genus may be grouped into one cluster. Most human coronaviruses, such as SARS-CoV and MERS-CoV, are produced in bats and are transmitted to humans via intermediate hosts [9]. The similarity between other coronaviruses and SARS-CoV-2 is recognized in the receptor-binding domain (RBD) of the spike protein, which plays a significant role in the drug design process for this virus and determines the potential hosts of the infection [10,11].

Coronavirus crosses the species barrier and infects humans with very dangerous diseases [12,13]. Public health is challenged due to the novelty of the antigen for the human host. Due to the danger of COVID-19, many methods have been introduced to develop models as well as to diagnose and differentiate it from other viruses.

In [14], eighteen radiological semantic features and seventeen clinical features were selected from 67 features (41 radiological and 26 clinical features). Also, univariate analysis methods, as the least absolute shrinkage and selection operator (LASSO), have been used for feature selection and its classification performance has been evaluated based on accuracy, sensitivity, and specificity measures. Using clinical features, accuracy, sensitivity, and specificity were reached to 88.8%, 89.2%, and 88.4%, respectively. Employing radiological features, accuracy was 92.9%, sensitivity was 99%, and specificity was 85.1%. Utilizing both types of features, accuracy, sensitivity, and specificity were of 95.9%, 96.1%, and 95.7%, respectively.

In [15], the infection risk of non-human-origin coronavirus is classified considering the spike protein sequence of coronaviruses. A feature extraction method named amino acid composition (AAC) was used to reduce the 41 features of the protein sequence. The AAC counts the frequency of twenty amino acids in the sequence and then the features were reduced to 20. Protein sequences of 2666 coronaviruses were collected from the National Genomics Data Center (NGDC) database (ngdc.cncb.ac.cn, accessed on October 24, 2021) [16]. The predictive model achieved an accuracy between 96.15% and 98.18% utilizing the random forest (RF) classifier and 10-cross validation. The results showed that SARS-CoV-2 is like SARS-CoV. It means both viruses have the same human receptor. Until now, the origin of SARS-CoV-2 is unknown and needs further analysis and study.

In [17,18], a diagnostic model based on X-ray images was proposed. In this model, related features to image texture and shapes were extracted employing the Zernike-Haralick method. The Kyoto Encyclopedia of Genes and Genomes (KEGG) dataset (www.genome.jp/kegg, accessed on October 24, 2021) was considered to test this model. It contains 464 COVID-19 images, 1490 viral images, 2783 bacterial images, and 1583 healthy images. The proposed model employed the support vector machine (SVM) classifier and got an accuracy of 89.78%, precision of 89.85%, sensitivity of 89.79%, and specificity of 99.63%.

Note that the selection of human monoclonal antibodies may identify immunodominant antigenic sites associated with neutralization and provide reagents for stabilizing and solving the structure of viral surface proteins [19]. Understanding the structural basis of the SARS-CoV-2 can guide the selection of vaccine targets. Studying the similarity between coronaviruses and the sequences analysis of these viruses results in knowing whether the other viruses’ vaccines may affect SARS-CoV-2 or not. COVID-19 has an infection cycle and a similar sequence in some regions of the proteins that indicate the efficacy of vaccines against SARS-CoV-2.

Machine learning methods and artificial intelligence [20,21,45] can be used to analyze, predict, and diagnose infection rates. In addition, machine learning helps in the drug discovery process for a vaccine. To the best of our knowledge, there are no studies that accurately differentiate between SARS-CoV-2 and other coronavirus types with a small number of features. Therefore, machine learning models are needed to make this differentiation.

As mentioned in [44], differentiating between coronavirus types is helpful for designing an effective vaccine. Moreover, the mutation pattern of the COVID-19 can be partially predicted after applying more analysis on the difference of the protein sequences of different strains.

1.2. Abbreviations, acronyms, notations, and symbols

In this section, all abbreviations and symbols employed in this paper are defined in Table 1 in alphabetical order. One of the used abbreviations is the standard one-letter abbreviation of the twenty amino acids found in [22].

Table 1.

Acronyms and definition of symbols.

Acronym/Symbol	Meaning	Acronym/Symbol	Meaning
A, G, M, …	The 20 amino acids 1-letter abbreviations [19]	MI	Mutual information
AAC	Amino acid composition	MMS	Mean of sum of squares in the between groups
AAPred	Amino acid encoding based prediction	MMS within	Mean of the sum of squares in the same group
Acc	Accuracy	NCBI	National Center for Biotechnology Information
ANOVA	Analysis of variance	NGDC	National Genomics Data Center
BE	Bagging ensemble	NSPs	Non-structural proteins
c, d, f	Number of classes, dataset, feature	Num	Number of occurrences of class
COVID-19	Coronavirus disease 2019	ORF	Open reading frame
CSV	Comma-separated values	Pr	Probability of selecting a sample of a class
DNA	Deoxyribonucleic acid	PK	Polynomial kernel
DT	Decision tree	Prec	Precision
Ent	Entropy	RBD	Receptor-binding domain
FASTA	Format of text for nucleotide or amino acids	RBF	Radial basis function
FN	False-negative	RF	Random forest
FP	False-positive	RNA	Ribonucleic acid
Freq	Frequency of each amino acid class	SARS-CoV	Severe acute respiratory syndrome coronavirus
GB	Gradient boosting	Sens	Sensitivity
IG	Information gain	SK	Sigmoid kernel
KEGG	Kyoto Encyclopedia of Genes and Genomes	Spec	Specificity
KNN	k-nearest neighbor	Spyder	Scientific python development environment
LASSO	Least absolute shrinkage and selection operator	SVM	Support vector machine
Len	Length of the protein sequence	TN	True-negative
LK	Linear kernel	TP	True-positive
MD	Manhattan distance	WHO	World Health Organization
MERS-CoV	Middle East respiratory syndrome coronavirus	χ²	Chi-square

Open in a new tab

1.3. Objectives and outline

The objectives of this paper are as follows:

•
To formulate a novel accurate model for COVID-19 that classifies the various coronavirus types and differentiates SARS-CoV-2 from other coronaviruses.
•
To use machine learning techniques for evaluating the model in terms of accuracy, precision, sensitivity, and specificity.
•
To reduce the number of features of the novel model to enhance its performance.
•
To analyze the protein sequence of SARS-CoV-2 for understanding the viral infection cycle.
•
To implement the obtained results computationally.
•
To apply the results to real-world data, in our case using the NGDC database.

Specifically, we propose a novel model named amino acid encoding-based prediction (AAPred) to carry out our investigation. The AAPred model classifies and distinguishes between COVID-19 and other coronaviruses adopting the amino acid encoding method [23] for features extraction and to improve the model performance. The amino acid encoding method is modified by adding a new amino acid class called “zero”, which contains unknown amino acids. Also, ambiguous amino acids are added to amino acids classes. In addition, the feature selection phase is applied to remove irrelevant features. The proposed AAPred model employs three different methods [24] for feature selection which are detailed in Subsection 2.1. This model optimizes the performance when predicting the virus type and utilizes mathematical and statistical methods to reduce the number of features.

The paper is organized as follows. Section 2 introduces the methodology and the AAPred model. Section 3 presents the experimental results employing the NGDC database. Finally, Section 4 provides a discussion as well as the conclusions, limitations, and ideas for future research.

2. Methodology

2.1. The proposed AAPred model

The AAPred model uses selection algorithms to extract features and then reduces the number of them eliminating those that are irrelevant. Three different methods of feature selection are utilized:

•
Information gain (IG) [25],
•
Analysis of variance (ANOVA) test [26], and
•
Chi-square (χ²) statistic test [27].

The number of extracted features is reduced by removing irrelevant features. Specifically, protein sequence analysis requires feature extraction to convert sequence characters to a numerical form employing amino acid encoding. Lastly, the classification is performed predicting the coronavirus type using six different machine learning algorithms:

•
Bagging ensemble (BE),
•
Decision trees (DT),
•
Gradient boosting (GB),
•
k-nearest neighbors (KNN),
•
RF, and
•
SVM.

In summary, as shown in Fig. 1 , the AAPred model consists of three phases:

•
Feature extraction,
•
Feature reduction, and
•
Classification.

2.2. Feature extraction phase

In the feature extraction phase, the amino acid encoding method is used to obtain features from the viral protein sequences. The amino acid encoding method employs two physicochemical properties of the amino acids: volume and dipole [30]. The volume and dipole values are calculated utilizing molecular modeling and density-functional methods [31]. The calculated volumes and dipole values of amino acids divide the twenty amino acids into seven classes as shown in Table 2 .

Table 2.

Amino acids classification based on their side chain volumes and dipole values. Source: Taken from Ref. [32] which is licensed under a CC BY 4.0 License (creativecommons.org, accessed on October 24, 2021).

Class number	Dipole scale	Volume scale	Amino acids
1	–	–	A, G, V
2	–	+	I, L, F, P
3	+	+	Y, M, T, S
4	++	+	H, N, Q, W
5	+++	+	R, K
6	+'+'+'	+	D, E
7	+	+	C

Open in a new tab

According to Table 2, the dipole scale varies between 1.0 and 3.0 as follows:

•
“-”: dipole value is less than 1.0,
•
“+”: dipole value is between 1.0 and less than 2.0,
•
“++”: dipole value is between 2.0 and less than 3.0,
•
“+++” dipole value is greater or equal than 3.0, and
•
“+'+'+'” dipole value is greater than 3.0 with opposite orientation.

Note that the volume scale “-” means that the volume is less than 50; the “+” means that volume is greater than 50; and cysteine (C) amino acid is moved from Class 3 to Class 7 because it can form disulfide bonds.

Due to the ambiguity of protein sequencing, there are four ambiguous amino acid characters added to the twenty amino acids. An eighth class labeled with zero is added to the seven classes of amino acid encoding. The newly added class contains three ambiguous amino acids (X, Z, and B), where X is an unknown amino acid, Z is a glutamic acid (E) or glutamine (Q), and B is an aspartic acid (D) or asparagine (N). The fourth ambiguous amino acid (J) is leucine (L) or isoleucine (I), which is added to Class 2 because I and L amino acids belong to this class. The ambiguity of amino acid codes comes from the error percentage of the sequencer of amino acids. During the sequencing process, the machine can be unable to detect the coming amino acid. For example, is it (L) or (I)? Thus, the sequencer puts (J), which means this amino acid may be (I) or (L).

Table 3 shows the amino acid classification. The amino acid encoding method replaces each amino acid character with its corresponding class number. This method converts the amino acids to numbers according to the eight classes. Then, the AAC method is applied utilizing the encoded amino acids. This method calculates the frequency of each class using the expression given by

Equation 1.

(1)

where ${Freq}_{i}$ defined in (1) is the frequency of Class i, ${Num}_{i}$ is the number of occurrences of Class i in the protein sequence, and Len is the length of the protein sequence. Each class frequency is considered as a feature of the protein sequence. Therefore, eight features are extracted based on physicochemical properties.

Table 3.

Eight classes of amino acids. Source: Taken from Ref. [33] which is licensed under a CC BY 4.0 License (creativecommons.org, accessed on October 24, 2021).

Class	Amino acids
0	X (unknown), B (D or N), Z (E or Q)
1	A, G, V
2	I, L, F, P, J (I or L)
3	Y, M, T, S
4	H, N, Q, W
5	R, K
6	D, E
7	C

Open in a new tab

2.3. Feature reduction phase

In this phase, optimal features are selected employing the mentioned feature selection techniques: information gain (recall it is denoted as IG in short), ANOVA, and χ².

The IG is a filter-based method that calculates the information amount for each feature item related to the target class. It measures features' importance for the classification phase. The IG of a dataset (d) concerning one feature is calculated as

Equation 2.

(2)

where Ent(d) stated in (2) is the entropy of the dataset d and Ent(d|f) is the conditional entropy of d given the feature f. The entropy of d with the classes is defined as

Equation 3.

(3)

where $p_{i}$ given in (3) is the probability of selecting a sample of Class i, and c is the number of classes. The conditional entropy is obtained by dividing the dataset into several groups, with each group consisting of two features. Then, we compute the quantitative relation of samples in each group out of the complete dataset multiplied by the entropy of each group based on the expression stated as

Equation 4.

(4)

where $p (f)$ given in (4) is the probability of selecting a feature and D, F are the universe of datasets and features, respectively. Mutual information (MI) is calculated between two features with MI measuring the IG for one feature given a known value of the other feature. Thus, the MI between two features f ₁ and f ₂ is formulated as

Equation 5.

(5)

where MI(f ₁ , f ₂) stated in (5) is the mutual information for f ₁ and f ₂, IG(d, f ₁) is the information gain for f ₁, and IG(f ₁ |f ₂) is the conditional information gain for f ₁ given f ₂. Therefore, we have that the expression formulated as

Equation 6.

(6)

which defines the conditional information gain for f ₁ given f ₂, where IG(f) stated in (6) is the IG of a feature f with respect to the dataset d and $p (f_{1}, f_{2})$ is the probability of selecting the two features. In the proposed model, IG is calculated according to the information of each amino acid class.

The ANOVA is a statistical test that studies the variance of mean sum of squares in different groups. The ANOVA is constructed from the expressions stated as

Equation 7.

(7)

where MMS _within is the mean sum of squares in the same group, f is the feature in this group, $\bar{F_{i}}$ is the mean of the features in the group, G is the number of groups, (G - 1) are the degrees of freedom in the group, and

Equation 8.

(8)

where MMS _between defined in (8) is the mean sum of squares between groups, with $\bar{F_{i}}$ and G being defined as in (5), $\bar{F}$ is the mean of all features, S is the total number of samples, and (S – G) corresponds to the degrees of freedom between groups. Therefore, the statistic used to compare groups is given by

ANOVA = MMS _between / MMS _within.

(9)

Note that the ANOVA establishes the variation between the frequency of each amino acid class with respect to the variation of all virus sequences.

The χ² test is applied to the groups of features to evaluate the association between them using the corresponding statistic defined as

Equation 10.

(10)

where $O_{i}$ and $E_{i}$ are observed and predicted or expected frequencies in Class i, respectively, with G, as mentioned in (7), being the number of groups.

All these feature selection methods, that is, IG, ANOVA, χ² test, are efficient in selecting features on protein sequence datasets because of the relationship between the amino acid classes. The proposed AAPred model uses the IG established in (2), ANOVA defined in (9), and χ² statistic stated in (10) to assign scores to the extracted features. Then, the features are sorted, and the highest scores are considered in the classification phase. This step is very important to select the most significant features and remove unwanted features.

2.4. Classification phase

In the classification phase, the significant features are used to classify the coronaviruses to COVID-19 and non-COVID-19 employing the BE, DT, GB, KNN, RF, and SVM algorithms. Prediction of coronavirus types is considered a binary classification problem.

The BE algorithm [42] utilizes a basis classifier for a given number of iterations. The BE classifier applies weight to the training set to help the next iteration of the classifier to achieve better performance.

The DT classifier splits the dataset based on the entropy by means of the formula given by

Equation 11.

(11)

where Ent(d) given in (11) is the entropy of the dataset d, x is the set of classes (COVID-19 class or non_COVID-19 class), and Pr(x) is the proportion (probability) between samples in Class x and the total number of samples in d. After calculating the entropy values for each sample, the DT method splits the dataset using the IG defined as

Equation 12.

(12)

with Ent(d) being defined in (12), B is the universe of subsets of the samples that are divided using the random variable a, Pr(b) is the proportion between samples in subset $b \in B$ and the total number of the dataset, and Ent(b) is the entropy of subset b.

The GB algorithm [43] aims to maximize the results of the classifier correlated with the negative gradient of the loss function.

The KNN classifier is applied using the Manhattan distance (MD), which is calculated as the sum of the absolute values of the differences of the samples and then defined as

Equation 13.

(13)

where k stated in (13) is the number of features (dimensions), and x, y are two samples (vectors).

The RF algorithm is an ensemble classifier that combines several classifiers. The RF method consists of multiple DTs, each of which works as a classifier to predict the class label and then the majority voting of these trees’ outputs is utilized to predict the class label.

The SVM algorithm is applied using different kernel functions: Linear, hyperbolic tangent or sigmoid, polynomial, and radial basis function (RBF) [34]. The simplest kernel is the linear function, which is computed by the inner product of the feature vector and the class label vector, being it expressed as

Equation 14.

(14)

where LK(f, y) is the linear kernel, f is the feature vector, y is the class label vector, “<f , y>” is the corresponding inner product, and g is a free general parameter (constant). The polynomial kernel function is calculated as

Equation 15.

(15)

where PK(f, y) given in (15) is the polynomial kernel, f, y, g are defined as in (14), and q is the polynomial degree. The sigmoid kernel utilizes the bipolar sigmoid function formulated as

Equation 16.

(16)

where SK(f, y) stated in (16) is the sigmoid kernel, f, y, g are defined as in (14), (15), and α is a value of the slope in the linear function. The RBF is an exponential function established as

Equation 17.

(17)

where RK(f, y) given in (17) is the radial basis kernel, f , y , g are defined as in (14), (15), (16), and γ is an adjustable parameter.

3. Application

3.1. Data description

As mentioned, the proposed model is evaluated using the NGDC dataset that is described by two types of files: comma-separated values (CSV) and FASTA, a format that is based on the text for representing either nucleotide or amino acid sequences, in which nucleotides or amino acids are denoted by single-letter codes. This dataset was accessed on July 2020 and has 113,927 samples of protein sequences for COVID-19 and other coronavirus types. There are different types of coronaviruses in this dataset such as alpha coronaviruses, bat coronaviruses, MERS-CoV, SARS-CoV, and SARS-CoV2. The NGDC dataset contains 60,539 protein sequences of COVID-19 and the remaining 53,388 sequences are non-COVID-19 viruses. Therefore, the dataset is balanced between the two coronaviruses types the COVID-19 and non-COVID-19 by selecting only 53,388 from the COVID-19 protein sequences randomly. The remaining COVID-19 sequences were considered in the model training. Also, the NGDS dataset was accessed on November 2020 and the proposed model is evaluated with the new uploaded sequences totalizing 520,789 protein sequences. Moreover, the AAPred model is evaluated employing another dataset that contains only the spike protein of coronaviruses from the National Center for Biotechnology Information (NCBI) coronavirus dataset [35].

For COVID-19, the minimum protein sequence length is 21 amino acids, and its maximum value is 7097 amino acids. For non-COVID-19, the minimum protein sequence length is 26 amino acids, and its maximum value is 7247 amino acids. The CSV file has data about viruses’ protein sequences, such as accession numbers, collection date of protein sequences, species, genus, family, protein sequence length, isolation source, host, and geographical location as shown in Table 4 .

Table 4.

Data presented in the CSV file format.

Accession	Species	Length	Host
AVP78037	SARS-Cov-2	121	Homo Sapiens
AVP78039	SARS-Cov-2	97	Homo Sapiens
AVP78040	SARS-Cov	70	Homo Sapiens
BBE15202	Alpha	237	Felis Catus
QBI71705	Avian	125	Gallus Gallus
AXM42849	Porcine	161	Sus Scrofa
ATG84898	MERS	4391	Homo Sapiens

Open in a new tab

The FASTA file contains the protein sequences with the accessions number and virus type as a header for each protein sequence as shown in Fig. 2 .

3.2. Model evaluation criteria

As mentioned, the AAPred model is evaluated using four performance measures: accuracy, precision, sensitivity, and specificity defined next. The accuracy is stated as

Acc = ((TP + TN)/S) × 100,

(18)

where TP given in (18) is the total number of true positives, TN is the total number of true-negatives, and S is the total number of samples in the dataset.

The precision criterion calculates the positive prediction or the correctness of the model when the classifier predicts the protein sequence as COVID-19. The precision formula is defined by

Prec = (TP/(TP + FP)) × 100,

(19)

where TP is given in (18) and FP in (19) is the total number of false-positives. The summation of TP and FP represents the total number of protein sequences predicted as COVID-19 (predicted as presence of the virus).

The sensitivity states the true-positive rate of the proposed model when the classifier predicts the protein sequence as COVID-19 and it is really a COVID-19. The sensitivity is also called recall and formulated as

Sens = (TP/(TP + FN)) × 100,

(20)

where FN stated in (20) is the total number of false negatives. The summation of TP and FN represents the number of COVID-19 protein sequences (when the presence of the virus is true).

The specificity determines the true-negative rate of the model when the classifier predicts the protein sequence as non-COVID-19 and it is really a non-COVID-19. The specificity is defined as

Spec = (TN/(TN + FP)) × 100,

(21)

where TN is given in (18) and FP stated in (21) is the total number of false-positives. The summation of TN and FP represents the number of non-COVID-19 protein sequences (when the absence of the virus is true).

3.3. Results

In the dataset, protein sequences are categorized into two classes: COVID-19 and non-COVID-19. The 106,776 protein sequences are divided into 80% for training and 20% for testing. Moreover, all protein sequences were evaluated using 10-fold cross-validation. Fig. 3 shows the frequencies of the eight amino acid classes for the samples in the dataset. In Fig. 3, the x-axis represents the eight classes of amino acids, and the y-axis is the frequency of each amino acid class in each sample.

As shown in Fig. 3, the COVID-19 samples contain high amounts of Class 2 amino acids. This class dipole value is less than 1.0 and includes the amino acids: I, L, Phenylalanine (F), and Proline (P). Concerning non-COVID-19 samples, it contains high amounts of amino acids of Classes 3 and 6. These classes include the amino acids: D, E, Methionine (M), Serine (S), Threonine (T), and Tyrosine (Y). Moreover, the third and sixth classes of amino acids dipole values range between 1.0 and 3.0.

Therefore, SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. In addition, the polarizabilities of most amino acids found in non-COVID-19 protein sequences are higher than the polarity of amino acids found in COVID-19 protein sequences. Therefore, COVID-19 is potentially nonpolar and tends to form chemical bonds with very small dipole values.

The proposed model was fitted on the scientific python development environment (Spyder) using the Scikit-learn software package for python programming language. The model was applied on a computer with Intel(R) Core(TM) i7-9750H CPU and 16 GB RAM.

In the AAPred model, amino acid encoding extracts the features and, as mentioned, three different selection methods are utilized to reduce the number of features and optimize the performance. Then, the performance measures are employed for evaluation, that is, accuracy, precision, sensitivity, and specificity.

Table 5 reports the performance of the six different classifiers based on the NGDC dataset using 80% training and 20% testing. As reported in this table, the highest performance is reached using the χ² method for feature selection in the KNN classifier. The χ² method reduces the number of features to seven of them. Accuracy, precision, sensitivity, and specificity are reached 99.69%, 99.72%, 99.65%, and 99.72%, respectively.

Table 5.

AAPred model performance for the indicated method and classifier with the NGDC dataset.

Classifier	IG				ANOVA				χ²
Classifier	Acc	Sens	Spec	Prec	Acc	Sens	Spec	Prec	Acc	Sens	Spec	Prec
BE	98.53	98.21	98.43	98.35	96.91	96.32	96.44	96.83	98.89	98.87	98.33	98.33
DT	99.23	99.19	99.37	99.37	99.39	99.31	99.48	99.48	99.28	99.17	99.40	99.40
GB	97.61	96.80	97.85	97.51	95.62	95.34	95.64	95.64	97.81	97.56	97.74	97.73
KNN	99.66	99.65	99.67	99.67	99.63	99.55	99.71	99.71	99.69	99.65	99.72	99.72
RF	99.69	99.80	99.58	99.58	99.69	99.81	99.56	99.56	99.68	99.78	99.57	99.57
SVM	95.13	95.40	94.86	94.89	95.14	95.39	94.89	94.91	95.15	95.40	94.90	94.92

Open in a new tab

Table 6 reports the average performance using a 10-fold cross-validation. The average performance reaches, employing the χ² method for feature selection with the KNN classifier, an accuracy, precision, sensitivity, and specificity of 90.69%, 91.72%, 89.65%, and 88.72%, respectively. The highest performance employing the 10-cross-validation is reached with the RF classifier and the IG method as shown in Fig. 4, Fig. 5, Fig. 6 .

Table 6.

AAPred model performance for the indicated method and classifier with the NGDC dataset using 10-fold cross-validation.

Classifier	IG				ANOVA				χ²
Classifier	Acc	Sens	Spec	Prec	Acc	Sens	Spec	Prec	Acc	Sens	Spec	Prec
BE	98.53	96.21	97.43	98.35	96.51	96.32	94.44	90.83	93.91	90.32	90.44	90.83
DT	89.23	89.19	89.37	89.37	94.39	96.31	90.48	90.48	89.28	89.17	89.4	89.4
GB	97.61	96.8	97.85	97.51	95.62	95.34	95.64	90.64	92.62	89.34	88.64	88.64
KNN	89.66	89.65	89.67	89.67	89.63	89.55	89.71	89.71	90.69	89.65	88.72	91.72
RF	98.69	96.81	97.72	98.72	96.69	95.81	95.56	91.56	94.68	90.78	91.57	89.57
SVM	85.13	85.4	84.86	84.89	85.14	85.39	84.89	84.91	94.15	85.4	84.9	84.92

Open in a new tab

Fig. 4 — Proposed model performance based on IG for the NGDC dataset using a 10-fold cross-validation.

Fig. 5 — Proposed model performance based on an ANOVA for the NGDC dataset using a 10-fold cross-validation.

Fig. 6 — Proposed model performance based on the χ² method for the NGDC dataset using a 10-fold cross-validation.

The accuracy of the SVM classifier employing the χ² method is 94.15%. In addition, the DT classifier reaches an accuracy of 94.39%, a precision of 90.48%, a sensitivity of 96.31%, and a specificity of 90.48% using an ANOVA. The accuracy of the RF classifier with IG is 98.69% and this is the maximum accuracy reached.

Note that the three feature selection methods reduce the number of features to seven. The selected seven features from each method are different, but there are two common features among the three feature selection methods: the frequency of Classes 2 and 6.

The proposed model performance using 10-cross-validation based on IG, ANOVA, and χ² test is shown in Fig. 4, Fig. 5, Fig. 6. These figures reveal that the RF classifier outperforms the other classifiers, and the IG method is the best one concerning accuracy.

The proposed model was applied to the spike protein dataset to predict the infection risk. The results reached 99.01%, 96.41%, 98.56%, and 97.02% for accuracy, precision, sensitivity, and specificity, respectively. Fig. 7 shows the performance of the RF classifier based on the spike protein dataset using the three selection methods: IG, ANOVA, and χ² test.

Fig. 7 — Proposed model performance using the spike protein dataset.

The proposed model is evaluated using six different classifiers of machine learning algorithms [20,28,29]: BE, DT, GB, KNN, RF, and SVM. The proposed model reached an accuracy of 99.69%, a precision of 99.58%, a sensitivity of 99.80%, and a specificity of 99.58% based on 80% of the data for training and 20% for testing. The average performance measures using 10-fold cross-validation are 98.69%, 98.72%, 96.81%, and 97.72% for accuracy, sensitivity, specificity, and precision, respectively. The AAPred model outperforms the existing models in terms of accuracy and computation time. Moreover, it reduces the number of features to only seven of them. The performance of the AAPred model changed when applying different training and testing methods. As reported in Table 5, the KNN method outperforms the other classifiers using 80% of the data for training and 20% for testing. On the contrary, when using a 10-fold cross-validation, the RF method outperforms the other classifiers. This indicates that splitting the dataset into training and testing affects the prediction of the viruses, especially to the KNN classifier that depends on the neighbors. Thus, applying the RF classifier using a 10-fold cross-validation makes the prediction results more reliable.

4. Discussion and conclusions, limitations, and future research

4.1. Discussion

To verify the performance of the proposed AAPred model, we compare this model with the performance of two different models: LASSO [14] and AAC [15] for feature selection. With LASSO, the selected number of features is 35 (eighteen radiological semantic features and seventeen clinical features). Using both types of features, the performance measures are 95.9%, 96.1%, and 95.7%, in terms of accuracy, sensitivity, and specificity, respectively. When the AAC is utilized, the 41 features of the protein sequence are reduced to twenty important features. We employ the RF classifier such as in [15] to evaluate the models. The achieved accuracy is 98.18% using 10-cross validation.

Table 7 summarizes the performance of the AApred model compared to the method presented in [14]. The proposed AApred model outperforms Chen et al. [14]'s method. It is evaluated employing six classifiers and four performance measures, reaching a higher accuracy of 98.69% in 6.43 seconds compared to 95.9% with Chen et al. [14]'s method in 5.62 seconds. In addition, our model reduces the number of features to seven of them compared to the 35 features presented in [14]. The second comparison is conducted with the spike protein dataset. The AAPred model was compared with Qiang et al. [15]'s model that has an accuracy of 98%. Our AAPred model outperforms Qiang et al. [15]'s model, with an accuracy of 99.01% and seven features in 3.58 seconds compared to Qiang et al. [15] that has twenty features as shown in Table 8 . The results report that SARS-CoV-2 has similar infection cycles and sequences in some regions of the SARS CoV proteins that indicate the affectedness of vaccines on SARS-CoV-2.

Table 7.

AAPred model performance compared to the method proposed in Ref. [14]. Source: The authors.

Method	Number of features	Acc	Sens	Spec	Prec	Computing time (in seconds)
IG	7	98.69	96.81	97.72	98.72	6.43
ANOVA	7	96.69	95.81	95.56	91.56	1.17
χ²	7	94.68	90.78	91.57	89.57	1.12
Chen et al. [14]	35	95.90	96.10	95.70	98.60	5.62

Open in a new tab

Table 8.

AAPred performance compared to the method in Ref. [15]. Source: The authors.

Method	Number of features	Acc	Sens	Spec	Prec	Computing time (in seconds)
IG	7	99.01	98.56	97.02	96.41	3.58
Qiang el al [15].	20	98.18	99.16	97.26	96.38	4.21

Open in a new tab

4.2. Conclusions, limitations, and future research

•
In this paper, a novel model, named AAPred in short, was proposed to predict coronavirus types. The AAPred model extracts features from the protein sequences by replacing the amino acid characters in each sequence with the normalized frequencies of the eight amino acid classes. We used the information gain, analysis of variance, and chi-square methods to reduce the number of features and select the most significant ones. The model reduces the features to only seven of them. The AAPred model was evaluated employing six classifiers: Bagging ensemble, decision trees, gradient boosting, k-nearest neighbors, random forest, and support vector machine. Experimental results showed that the AAPred model can differentiate between COVID-19 and non-COVID-19 viruses' sequences. This differentiation is supported by the average accuracy of the model of 98.69%, a precision of 98.72%, a sensitivity of 96.81%, and a specificity of 97.72%. The maximum accuracy was reached utilizing the random forest classifier and the information gain method.

In addition, the 10-Fold cross-validation method was used. Moreover, after studying the polarity and dipole values of the protein sequence of the coronavirus, we conclude that the COVID-19 protein sequence has a high level of amino acids of the second class that contains nonpolar amino acids, compared to the non-COVID-19 protein sequence.

We choose machine learning tools rather than deep learning methods because machine learning techniques can reach excellent results in our dataset with less computational burden and they are simpler to use. Indeed, it has only seven features and is not complicated as an image classification, for example. In any case, we applied a generative adversarial network and resulted in an accuracy of around 98% as the random forest classifier.

A limitation of the proposed model is that it relies only on protein sequence analysis and does not consider other aspects such as the protein structure of the virus or its DNA sequence. As further research directions, we aim to apply the same model using different feature extraction methods according to the sequence and the structure of the proteins to obtain more detailed biological information about the virus behavior and its infection cycle. Other classification methods will also be explored in future studies such as principal components analysis and its new derivations, including supervised and unsupervised approaches, as well as functional data analysis, partial least squares structures, and other recent methodologies [[36], [37], [38], [39], [40], [41],[46], [47], [48], [49]].

Data availability statement

Publicly available datasets were analyzed in this study. These data can be found at https://ngdc.cncb.ac.cn/gwh/browse/virus/coronaviridae.

Author statement

All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the Editors and Reviewers for their constructive comments on an earlier version of this manuscript which resulted in this improved version. The research of V. Leiva was partially funded by FONDECYT, project grant number 1200525 from the National Agency for Research and Development (ANID) of the Chilean government under the Ministry of Science and Technology, Knowledge, and Innovation.

References

1.Coronaviridae Study Group of the International Committee on Taxonomy of Viruses The species severe acute respiratory syndrome-related coronavirus classifying 2019-CoV and naming it SARS-CoV-2. Nat. Microbiol. 2020;5:536–544. doi: 10.1038/s41564-020-0695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhou P., Yang X., Wang X., Hu B., Zhang L., Zhang W., et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Jerez-Lillo N., Lagos Alvarez B., Muñoz Gutierrez J., Figueroa-Zúñiga J.I., Leiva V. A statistical analysis for the epidemiological surveillance of COVID-19 in Chile. Signa Vitae. 2022;18:19–30. doi: 10.22514/sv.2021.130. in press. [DOI] [Google Scholar]
4.Martin-Barreiro C., Ramirez-Figueroa J.A., Cabezas X., Leiva V., Galindo-Villardón M.P. Disjoint and functional principal component analysis for infected cases and deaths due to COVID-19 in South American countries with sensor-related data. Sensors. 2021;21:4094. doi: 10.3390/s21124094. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.World Health Organization . 09 March 2020. WHO Announces COVID-19 Outbreak a Pandemic.http://www.euro.who.int/en/health-topics/health-emergencies/coronavirus-covid19/news/news/2020/3/who-announces-covid-19-outbreak-a-pandemic Available from: [Google Scholar]
6.Agranovsky A.A. Structure and expression of large (+)RNA genomes of viruses of higher eukaryotes. Biochemistry. 2021;86:248–261. doi: 10.1134/S0006297921030020. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.International Committee on Taxonomy of Viruses. Available from: http://ictvonline.org/virusTaxonomy.asp (accessed on 24 October 2021).
8.Li F. Structure, function, and evolution of coronavirus spike proteins. Annu. Rev. Virol. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Jf L.S.K., Kk T., Vc C., Pc W., Ky Y. Middle East respiratory syndrome coronavirus: another zoonotic betacoronavirus causing SARS-like disease. Clin. Microbiol. Rev. 2015;28:465–522. doi: 10.1128/CMR.00102-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wu Y., Peng B., Huang X., Ding X., Wang P. Niu. Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China. Cell Host Microbe. 2020;27:325–328. doi: 10.1016/j.chom.2020.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li F. Structure, function, and evolution of coronavirus spike proteins. Ann. Rev. Virol. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Adams M.J., Lefkowitz E.J., King A.M., Harrach B., Harrison R.L., Knowles N.J., et al. Ratification vote on taxonomic proposals to the international committee on taxonomy of viruses (2016) Arch. Virol. 2016;161:2921–2949. doi: 10.1007/s00705-016-2977-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Menachery V., Yount B., Debbink K., Agnihothram S., Gralinski L., Plante J. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Nat. Med. 2015;21:1508–1513. doi: 10.1038/nm.3985. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen X., Tang Y., Mo Y., Li S., Lin D., Yang Z., et al. A diagnostic model for coronavirus disease 2019 (COVID-19) based on radiological semantic and clinical features: a multi-center study. Eur. Radiol. 2020;30:4893–4902. doi: 10.1007/s00330-020-06829-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Qiang X., Xu P., Fang G., Liu W., Kou Z. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect. Dis. Poverty. 2020;9:33. doi: 10.1186/s40249-020-00649-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhao W.M., Song S.H., Chen M.L., et al. The 2019 novel coronavirus resource. Yi Chuan. 2020;42:212–221. doi: 10.16288/j.yczz.20-030. [DOI] [PubMed] [Google Scholar]
17.Gomes J., Barbosa V., Santana M., Bandeira J. IKONOS: an intelligent tool to support diagnosis of COVID-19 by texture analysis of x-ray images. medRxiv. 09 May 2020 [Google Scholar]
18.Bustos N., Tello M., Droppelmann G., Garcia N., Feijoo F., Leiva V. Machine learning techniques as an efficient alternative diagnostic tool for COVID-19 cases. Signa Vitae. 2022;18:23–33. [Google Scholar]
19.V’kovski P., Kratzel A., Steiner S., Stalder H., Thiel V. Coronavirus biology and replication: implications for SARS-CoV-2. Nat. Rev. Microbiol. 2020;19:155–170. doi: 10.1038/s41579-020-00468-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Braga-Neto . Springer; New York: 2020. Fundamentals of Pattern Recognition and Machine Learning. [Google Scholar]
21.Palacios C.A., Reyes-Suarez J.A., Bearzotti L.A., Leiva V., Marchant C. Knowledge discovery for higher education student retention based on data mining: machine learning algorithms and case study in Chile. Entropy. 2021;23:485. doi: 10.3390/e23040485. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.The Ddbj/ENA/GenBank Feature Table Definition. International Nucleotide Sequence Database Collaboration. Available from: https://www.insdc.org/documents/feature-table (accessed on 24 October 2021).
23.Zhang M., Su Q., Lu Y., Zhao M., Niu B. Application of machine learning approaches for protein-protein interactions prediction. Med. Chem. 2017;13:506–514. doi: 10.2174/1573406413666170522150940. [DOI] [PubMed] [Google Scholar]
24.Asim S., Shah A., Shabbir H., Rehman S., Waqas M. A comparative study of feature selection approaches: 2016-2020. Int. J. Sci. Eng. Res. 2020;11:469. [Google Scholar]
25.Lefkovits S., Lefkovits L. Gabor feature selection based on information gain. Process Eng. 2017;181:892–898. [Google Scholar]
26.Ardelean F. Case study using analysis of variance to determine groups' variations. MATEC Web Conferen. 2017;126 [Google Scholar]
27.Benhamou E., Melot V. Seven proofs of the Pearson chi-squared independence test and its graphical interpretation. SSRN. 2010 doi: 10.2139/ssrn.3239829. [DOI] [Google Scholar]
28.Torsello A., Rossi L., Pelillo M., Biggio B., Robles-Kelly A. Springer; New York, USA: 2021. Structural, Syntactic, and Statistical Pattern Recognition. [Google Scholar]
29.Alkady W., Gad W., Bahnasy K. 2019 14th International Conference on Computer Engineering and Systems (ICCES) 2019. Swarm intelligence optimization for feature selection of biomolecules; pp. 380–385. [DOI] [Google Scholar]
30.Xiuquan D., Xinrui L., Zhang H., Zhang Y. Prediction of protein-protein interaction by metasample-based sparse representation. Math. Probl Eng. 2015:858256. [Google Scholar]
31.Philip J., Keith R., Probert I.J., Jonathan R., Stewart J., Chris J. Density functional theory in the solid-state. Phil. Trans. R. Soc. 2014;372:20130270. doi: 10.1098/rsta.2013.0270. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Xiao N., Cao D.S., Zhu M.F., Xu Q.S. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. 2015;31:1857–1859. doi: 10.1093/bioinformatics/btv042. [DOI] [PubMed] [Google Scholar]
33.Wang X., Wu Y., Wang R., Wei Y., Gui Y. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences. PLoS ONE. 2019;14:e0217312. doi: 10.1371/journal.pone.0217312. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Cano Lengua M.A., Papa Quiroz E.A. A systematic literature review on support vector machines applied to Classification. IEEE Eng. Int. Res. Conferen. (EIRCON) 2020:1–4. doi: 10.1109/EIRCON51178.2020.9254028. [DOI] [Google Scholar]
35.NCBI coronavirus datasets. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Protein (accessed on 24 October 2021).
36.Ramirez-Figueroa J.A., Martin-Barreiro C., Nieto A.B., Leiva V., Galindo-Villardón M.P. A new principal component analysis by particle swarm optimization with an environmental application for data science. Stoch. Environ. Res. Risk Assess. 2021;35:1969–1984. [Google Scholar]
37.Melendez R., Giraldo R., Sign V. Leiva. Wilcoxon and Mann-Whitney tests for functional data: an approach based on random projections. Mathematics. 2021;9:44. [Google Scholar]
38.Martinez J.L., Leiva V., Saulo H., Liu S. Estimating the covariance matrix of the coefficient estimator in multivariate partial least squares regression with chemical applications. Chemometr. Intell. Lab. Syst. 2021;214:104328. [Google Scholar]
39.Campos T.L., Korhonen P.K., Young N.D. Cross-predicting essential genes between two model eukaryotic species using machine learning. Int. J. Mol. Sci. 2021;22:5056. doi: 10.3390/ijms22105056. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Naumov V., Putin E., Pushkov S., Kozlova E., Romantsov K., et al. COVIDomic: a multi-modal cloud-based platform for identification of risk factors associated with COVID-19 severity. PLoS Comput. Biol. 2021;17 doi: 10.1371/journal.pcbi.1009183. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Alzahrani A.Y., Shaaban M.M., Elwakil B.H., Hamed M.T., Rezki N., Aouad M.R., Zakaria M.M.A., Hagar M. Anti-COVID-19 activity of some benzofused 1, 2, 3-triazolesulfonamide hybrids using in silico and in vitro analyses. Chemometr. Intell. Lab. Syst. 2021;217:104421. doi: 10.1016/j.chemolab.2021.104421. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Jafarzadeh H., Mahdianpari M., Gill E., Mohammadimanesh F., Homayouni S. Bagging and boosting ensemble classifiers for classification of multispectral, hyperspectral and PolSAR data: a comparative evaluation. Rem. Sens. 2021;13:4405. [Google Scholar]
43.Natekin A., Knoll A. Gradient boosting machines: a tutorial. Front. Neurorob. 2013;7:21. doi: 10.3389/fnbot.2013.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.M. Cascella, M. Rajnik, A. Aleem, et al. Features, evaluation, and treatment of Coronavirus (COVID-19) [Updated 2021 Sep 2]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2021 January. Available from: https://www.ncbi.nlm.nih.gov/books/NBK554776/. [PubMed]
45.A.K.M. Nor, S.R. Pedapati, M. Muhammad, V. Leiva, Overview of explainable artificial intelligence for prognostic and health management of industrial assets based on preferred reporting items for systematic reviews and meta-analyses. Sensors 21, 8020, 10.3390/s21238020. [DOI] [PMC free article] [PubMed]
46.Nor A.K.M., Pedapati S.R., Muhammad M., Leiva V. Abnormality detection and failure prediction using explainable bayesian deep learning: methodology and case study with industrial data. Mathematics. 2022;10:554. [Google Scholar]
47.Huerta M., Leiva V., Liu S., Rodriguez M., Villegas D. On a partial least squares regression model for asymmetric data with a chemical application in mining. Chemometr. Intell. Lab. Syst. 2019;190:55–68. [Google Scholar]
48.Ma L., Zhang Y., Leiva V., Liu S., Ma T. A new clustering algorithm based on a radar scanning strategy with applications to machine learning data. Expert Syst. Appl. 2022;191:116143. [Google Scholar]
49.Mahdi E., Leiva V., Mara'Beh S., Martin-Barreiro C. A new approach to predicting cryptocurrency returns based on the gold prices with support vector machines during the COVID-19 pandemic using sensor-related data. Sensors. 2021;21:6319. doi: 10.3390/s21186319. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found at https://ngdc.cncb.ac.cn/gwh/browse/virus/coronaviridae.

[bib1] 1.Coronaviridae Study Group of the International Committee on Taxonomy of Viruses The species severe acute respiratory syndrome-related coronavirus classifying 2019-CoV and naming it SARS-CoV-2. Nat. Microbiol. 2020;5:536–544. doi: 10.1038/s41564-020-0695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Zhou P., Yang X., Wang X., Hu B., Zhang L., Zhang W., et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Jerez-Lillo N., Lagos Alvarez B., Muñoz Gutierrez J., Figueroa-Zúñiga J.I., Leiva V. A statistical analysis for the epidemiological surveillance of COVID-19 in Chile. Signa Vitae. 2022;18:19–30. doi: 10.22514/sv.2021.130. in press. [DOI] [Google Scholar]

[bib4] 4.Martin-Barreiro C., Ramirez-Figueroa J.A., Cabezas X., Leiva V., Galindo-Villardón M.P. Disjoint and functional principal component analysis for infected cases and deaths due to COVID-19 in South American countries with sensor-related data. Sensors. 2021;21:4094. doi: 10.3390/s21124094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.World Health Organization . 09 March 2020. WHO Announces COVID-19 Outbreak a Pandemic.http://www.euro.who.int/en/health-topics/health-emergencies/coronavirus-covid19/news/news/2020/3/who-announces-covid-19-outbreak-a-pandemic Available from: [Google Scholar]

[bib6] 6.Agranovsky A.A. Structure and expression of large (+)RNA genomes of viruses of higher eukaryotes. Biochemistry. 2021;86:248–261. doi: 10.1134/S0006297921030020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.International Committee on Taxonomy of Viruses. Available from: http://ictvonline.org/virusTaxonomy.asp (accessed on 24 October 2021).

[bib8] 8.Li F. Structure, function, and evolution of coronavirus spike proteins. Annu. Rev. Virol. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Jf L.S.K., Kk T., Vc C., Pc W., Ky Y. Middle East respiratory syndrome coronavirus: another zoonotic betacoronavirus causing SARS-like disease. Clin. Microbiol. Rev. 2015;28:465–522. doi: 10.1128/CMR.00102-14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Wu Y., Peng B., Huang X., Ding X., Wang P. Niu. Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China. Cell Host Microbe. 2020;27:325–328. doi: 10.1016/j.chom.2020.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Li F. Structure, function, and evolution of coronavirus spike proteins. Ann. Rev. Virol. 2016;3:237–261. doi: 10.1146/annurev-virology-110615-042301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Adams M.J., Lefkowitz E.J., King A.M., Harrach B., Harrison R.L., Knowles N.J., et al. Ratification vote on taxonomic proposals to the international committee on taxonomy of viruses (2016) Arch. Virol. 2016;161:2921–2949. doi: 10.1007/s00705-016-2977-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Menachery V., Yount B., Debbink K., Agnihothram S., Gralinski L., Plante J. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Nat. Med. 2015;21:1508–1513. doi: 10.1038/nm.3985. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Chen X., Tang Y., Mo Y., Li S., Lin D., Yang Z., et al. A diagnostic model for coronavirus disease 2019 (COVID-19) based on radiological semantic and clinical features: a multi-center study. Eur. Radiol. 2020;30:4893–4902. doi: 10.1007/s00330-020-06829-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Qiang X., Xu P., Fang G., Liu W., Kou Z. Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus. Infect. Dis. Poverty. 2020;9:33. doi: 10.1186/s40249-020-00649-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Zhao W.M., Song S.H., Chen M.L., et al. The 2019 novel coronavirus resource. Yi Chuan. 2020;42:212–221. doi: 10.16288/j.yczz.20-030. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Gomes J., Barbosa V., Santana M., Bandeira J. IKONOS: an intelligent tool to support diagnosis of COVID-19 by texture analysis of x-ray images. medRxiv. 09 May 2020 [Google Scholar]

[bib18] 18.Bustos N., Tello M., Droppelmann G., Garcia N., Feijoo F., Leiva V. Machine learning techniques as an efficient alternative diagnostic tool for COVID-19 cases. Signa Vitae. 2022;18:23–33. [Google Scholar]

[bib19] 19.V’kovski P., Kratzel A., Steiner S., Stalder H., Thiel V. Coronavirus biology and replication: implications for SARS-CoV-2. Nat. Rev. Microbiol. 2020;19:155–170. doi: 10.1038/s41579-020-00468-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Braga-Neto . Springer; New York: 2020. Fundamentals of Pattern Recognition and Machine Learning. [Google Scholar]

[bib21] 21.Palacios C.A., Reyes-Suarez J.A., Bearzotti L.A., Leiva V., Marchant C. Knowledge discovery for higher education student retention based on data mining: machine learning algorithms and case study in Chile. Entropy. 2021;23:485. doi: 10.3390/e23040485. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.The Ddbj/ENA/GenBank Feature Table Definition. International Nucleotide Sequence Database Collaboration. Available from: https://www.insdc.org/documents/feature-table (accessed on 24 October 2021).

[bib23] 23.Zhang M., Su Q., Lu Y., Zhao M., Niu B. Application of machine learning approaches for protein-protein interactions prediction. Med. Chem. 2017;13:506–514. doi: 10.2174/1573406413666170522150940. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Asim S., Shah A., Shabbir H., Rehman S., Waqas M. A comparative study of feature selection approaches: 2016-2020. Int. J. Sci. Eng. Res. 2020;11:469. [Google Scholar]

[bib25] 25.Lefkovits S., Lefkovits L. Gabor feature selection based on information gain. Process Eng. 2017;181:892–898. [Google Scholar]

[bib26] 26.Ardelean F. Case study using analysis of variance to determine groups' variations. MATEC Web Conferen. 2017;126 [Google Scholar]

[bib27] 27.Benhamou E., Melot V. Seven proofs of the Pearson chi-squared independence test and its graphical interpretation. SSRN. 2010 doi: 10.2139/ssrn.3239829. [DOI] [Google Scholar]

[bib28] 28.Torsello A., Rossi L., Pelillo M., Biggio B., Robles-Kelly A. Springer; New York, USA: 2021. Structural, Syntactic, and Statistical Pattern Recognition. [Google Scholar]

[bib29] 29.Alkady W., Gad W., Bahnasy K. 2019 14th International Conference on Computer Engineering and Systems (ICCES) 2019. Swarm intelligence optimization for feature selection of biomolecules; pp. 380–385. [DOI] [Google Scholar]

[bib30] 30.Xiuquan D., Xinrui L., Zhang H., Zhang Y. Prediction of protein-protein interaction by metasample-based sparse representation. Math. Probl Eng. 2015:858256. [Google Scholar]

[bib31] 31.Philip J., Keith R., Probert I.J., Jonathan R., Stewart J., Chris J. Density functional theory in the solid-state. Phil. Trans. R. Soc. 2014;372:20130270. doi: 10.1098/rsta.2013.0270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Xiao N., Cao D.S., Zhu M.F., Xu Q.S. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics. 2015;31:1857–1859. doi: 10.1093/bioinformatics/btv042. [DOI] [PubMed] [Google Scholar]

[bib33] 33.Wang X., Wu Y., Wang R., Wei Y., Gui Y. A novel matrix of sequence descriptors for predicting protein-protein interactions from amino acid sequences. PLoS ONE. 2019;14:e0217312. doi: 10.1371/journal.pone.0217312. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Cano Lengua M.A., Papa Quiroz E.A. A systematic literature review on support vector machines applied to Classification. IEEE Eng. Int. Res. Conferen. (EIRCON) 2020:1–4. doi: 10.1109/EIRCON51178.2020.9254028. [DOI] [Google Scholar]

[bib35] 35.NCBI coronavirus datasets. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Protein (accessed on 24 October 2021).

[bib36] 36.Ramirez-Figueroa J.A., Martin-Barreiro C., Nieto A.B., Leiva V., Galindo-Villardón M.P. A new principal component analysis by particle swarm optimization with an environmental application for data science. Stoch. Environ. Res. Risk Assess. 2021;35:1969–1984. [Google Scholar]

[bib37] 37.Melendez R., Giraldo R., Sign V. Leiva. Wilcoxon and Mann-Whitney tests for functional data: an approach based on random projections. Mathematics. 2021;9:44. [Google Scholar]

[bib38] 38.Martinez J.L., Leiva V., Saulo H., Liu S. Estimating the covariance matrix of the coefficient estimator in multivariate partial least squares regression with chemical applications. Chemometr. Intell. Lab. Syst. 2021;214:104328. [Google Scholar]

[bib39] 39.Campos T.L., Korhonen P.K., Young N.D. Cross-predicting essential genes between two model eukaryotic species using machine learning. Int. J. Mol. Sci. 2021;22:5056. doi: 10.3390/ijms22105056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Naumov V., Putin E., Pushkov S., Kozlova E., Romantsov K., et al. COVIDomic: a multi-modal cloud-based platform for identification of risk factors associated with COVID-19 severity. PLoS Comput. Biol. 2021;17 doi: 10.1371/journal.pcbi.1009183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.Alzahrani A.Y., Shaaban M.M., Elwakil B.H., Hamed M.T., Rezki N., Aouad M.R., Zakaria M.M.A., Hagar M. Anti-COVID-19 activity of some benzofused 1, 2, 3-triazolesulfonamide hybrids using in silico and in vitro analyses. Chemometr. Intell. Lab. Syst. 2021;217:104421. doi: 10.1016/j.chemolab.2021.104421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42.Jafarzadeh H., Mahdianpari M., Gill E., Mohammadimanesh F., Homayouni S. Bagging and boosting ensemble classifiers for classification of multispectral, hyperspectral and PolSAR data: a comparative evaluation. Rem. Sens. 2021;13:4405. [Google Scholar]

[bib43] 43.Natekin A., Knoll A. Gradient boosting machines: a tutorial. Front. Neurorob. 2013;7:21. doi: 10.3389/fnbot.2013.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.M. Cascella, M. Rajnik, A. Aleem, et al. Features, evaluation, and treatment of Coronavirus (COVID-19) [Updated 2021 Sep 2]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2021 January. Available from: https://www.ncbi.nlm.nih.gov/books/NBK554776/. [PubMed]

[bib45] 45.A.K.M. Nor, S.R. Pedapati, M. Muhammad, V. Leiva, Overview of explainable artificial intelligence for prognostic and health management of industrial assets based on preferred reporting items for systematic reviews and meta-analyses. Sensors 21, 8020, 10.3390/s21238020. [DOI] [PMC free article] [PubMed]

[bib46] 46.Nor A.K.M., Pedapati S.R., Muhammad M., Leiva V. Abnormality detection and failure prediction using explainable bayesian deep learning: methodology and case study with industrial data. Mathematics. 2022;10:554. [Google Scholar]

[bib47] 47.Huerta M., Leiva V., Liu S., Rodriguez M., Villegas D. On a partial least squares regression model for asymmetric data with a chemical application in mining. Chemometr. Intell. Lab. Syst. 2019;190:55–68. [Google Scholar]

[bib48] 48.Ma L., Zhang Y., Leiva V., Liu S., Ma T. A new clustering algorithm based on a radar scanning strategy with applications to machine learning data. Expert Syst. Appl. 2022;191:116143. [Google Scholar]

[bib49] 49.Mahdi E., Leiva V., Mara'Beh S., Martin-Barreiro C. A new approach to predicting cryptocurrency returns based on the gold prices with support vector machines during the COVID-19 pandemic using sensor-related data. Sensors. 2021;21:6319. doi: 10.3390/s21186319. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Classifying COVID-19 based on amino acids encoding with machine learning algorithms

Walaa Alkady

Khaled ElBahnasy

Víctor Leiva

Walaa Gad

Abstract

1. Introduction, bibliographical review, notations, and objectives

1.1. Introduction and bibliographical review

1.2. Abbreviations, acronyms, notations, and symbols

Table 1.

1.3. Objectives and outline

2. Methodology

2.1. The proposed AAPred model

Fig. 1.

2.2. Feature extraction phase

Table 2.

Table 3.

2.3. Feature reduction phase

2.4. Classification phase

3. Application

3.1. Data description

Table 4.

Fig. 2.

3.2. Model evaluation criteria

3.3. Results

Fig. 3.

Table 5.

Table 6.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

4. Discussion and conclusions, limitations, and future research

4.1. Discussion

Table 7.

Table 8.

4.2. Conclusions, limitations, and future research

Data availability statement

Author statement

Declaration of competing interest

Acknowledgments

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases