Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2024 Mar 19;33(4):e4928. doi: 10.1002/pro.4928

Examining evolutionary scale modeling‐derived different‐dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow

Karla L Martínez‐Mauricio 1, César R García‐Jacas 2,, Greneter Cordoves‐Delgado 1
PMCID: PMC10949403  PMID: 38501511

Abstract

Molecular features play an important role in different bio‐chem‐informatics tasks, such as the Quantitative Structure–Activity Relationships (QSAR) modeling. Several pre‐trained models have been recently created to be used in downstream tasks, either by fine‐tuning a specific model or by extracting features to feed traditional classifiers. In this regard, a new family of Evolutionary Scale Modeling models (termed as ESM‐2 models) was recently introduced, demonstrating outstanding results in protein structure prediction benchmarks. Herein, we studied the usefulness of the different‐dimensional embeddings derived from the ESM‐2 models to classify antimicrobial peptides (AMPs). To this end, we built a KNIME workflow to use the same modeling methodology across experiments in order to guarantee fair analyses. As a result, the 640‐ and 1280‐dimensional embeddings derived from the 30‐ and 33‐layer ESM‐2 models, respectively, are the most valuable  since statistically better performances were achieved by the QSAR models built from them. We also fused features of the different ESM‐2 models, and it was concluded that the fusion contributes to getting better QSAR models than using features of a single ESM‐2 model. Frequency studies revealed that only a portion of the ESM‐2 embeddings is valuable for modeling tasks since between 43% and 66% of the features were never used. Comparisons regarding state‐of‐the‐art deep learning (DL) models confirm that when performing methodologically principled studies in the prediction of AMPs, non‐DL based QSAR models yield comparable‐to‐superior performances to DL‐based QSAR models. The developed KNIME workflow is available‐freely at https://github.com/cicese-biocom/classification-QSAR-bioKom. This workflow can be valuable to avoid unfair comparisons regarding new computational methods, as well as to propose new non‐DL based QSAR models.

Keywords: antimicrobial peptides, deep learning, ensemble classifiers, ESM‐2, evolutionary scale modeling, KNIME, QSAR, shallow classifiers

1. INTRODUCTION

Molecular features play an important role in different bio‐chem‐informatics tasks (Todeschini & Consonni, 2009), such as the Quantitative Structure–Activity Relationships (QSAR) modeling (Muratov et al., 2020). Over the years, the calculation of molecular features has been the main strategy to extract useful features to build non‐deep learning based QSAR models. To date, several algorithms to calculate those features have been built according to different theories (García‐Jacas et al., 2020; Romero‐Molina et al., 2019; Todeschini & Consonni, 2009; Valdés‐Martiní et al., 2017). But because there is not a single algorithm able to extract all chemical information for different datasets, it is necessary to use several theoretically different algorithms to get an initial pool of molecular features. This initial pool is often a high‐dimensionality set from which relevant features should be selected (Bolón‐Canedo & Alonso‐Betanzos, 2019; Cerruela García et al., 2019; Pes, 2020). However, the feature selection is a NP‐hard problem, where the powerset 2n of the total of features n is the space of possible subsets to analyze. Therefore, achieving a lower‐dimensional pool of relevant features (often using ranking selectors) from which getting good subsets (often using wrapper or embedded selectors) to build robust QSAR models is an effort‐demanding and time‐consuming task (Bolón‐Canedo & Alonso‐Betanzos, 2019; Cerruela García et al., 2019; Pes, 2020).

Before this scenario, the construction of QSAR models based on deep learning (DL) architectures (e.g., convolutional neural network [CNN], recurrent neural network [RNN] based on long short‐term memory [LSTM] units, and gated recurrent units [GRU], among others) have gained particular relevance (Fu et al., 2020; Ma et al., 2022; Sharma et al., 2021a; Sharma et al., 2021b; Sharma et al., 2021c; Sharma et al., 2022; Singh et al., 2021; Su et al., 2019; Veltri et al., 2018; Xu et al., 2015; Xu et al., 2017; Yan et al., 2020). This is mainly due to the abilities of DL to automatically learn discriminative features from unabstracted molecular representations (raw data) such as simplified molecular line entry system (SMILES) for small‐ and medium‐sized molecules (Xu et al., 2015; Xu et al., 2017), and amino acid sequences for proteins (Fu et al., 2020; Ma et al., 2022; Sharma et al., 2021a; Sharma et al., 2021b; Sharma et al., 2021c; Sharma et al., 2022; Singh et al., 2021; Su et al., 2019; Veltri et al., 2018; Yan et al., 2020). In this way, the burden of a‐priori feature engineering process is removed. But, as it has been demonstrated elsewhere (Garcia‐Jacas et al., 2022; García‐Jacas et al., 2022; Muratov et al., 2020), DL architectures lead to build QSAR models with comparable‐to‐inferior performances regarding non‐DL based QSAR models when using small‐ and medium‐sized sets. This is because such datasets are not large enough (Manibardo et al., 2021; Oyedare & Park, 2019) to automatically get learned (non‐handcrafted) features with better modeling abilities than calculated (handcrafted) features that can be selected from them (Garcia‐Jacas et al., 2022; García‐Jacas et al., 2022).

However, because of the great success of pre‐trained language models (i.e., trained on large unlabeled datasets) in natural language processing tasks (Acheampong et al., 2021; Floridi & Chiriatti, 2020; Research, 2023), a rising interest in building such deep models for bio‐chem‐informatics tasks has emerged (BenevolentAI, 2020; Fabian et al., 2020; Irwin et al., 2022; Lin et al., 2023; NVIDIA, 2022; Rives et al., 2021) in order to get useful features from large, publicly available databases (Consortium, T.U, 2018; Kim, 2019; Kim et al., 2018; Mistry et al., 2020). In this sense, models based on Transformers (Devlin et al., 2018; Lewis et al., 2018; Radford et al., 2018) and coupled with SMILES strings such as MolBERT (BenevolentAI, 2020; Fabian et al., 2020) and MegaMolBART (Irwin et al., 2022; NVIDIA, 2022), or coupled with protein sequences such as Evolutionary Scale Modeling (ESM) (Lin et al., 2023; Rives et al., 2021), have been successfully built. These unsupervised models yield lower‐dimensional representations known as embeddings, which have been shown to improve results in many downstream tasks (Rives et al., 2021; Yang et al., 2018). Thus, embeddings are an initial, lower‐dimensional pool of relevant features obtained without a feature engineering process, just as it is needed when using handcrafted features.

ESM models are one of the most successfully used in several applications (Lin et al., 2023; Rives et al., 2021), including the classification of antimicrobial peptides (AMPs) (García‐Jacas et al., 2022). A new family of ESM models (termed ESM‐2) was recently pre‐trained (Lin et al., 2023) at scales from 8 million parameters up to 15 billion parameters. Thus, different‐dimensional embeddings can be derived from them. According to the results shown in Table S1 in (Lin et al., 2023), the 8‐million‐parameter model yielded the worst result in prediction of protein structures, while the 15‐billion‐parameter model was the best. These are expected results because it is well‐known that increasing the model capacity leads to learning better features. But this does not imply that the higher the model capacity, the better the results can always be obtained. In this sense, it can be observed in Table S1 in (Lin et al., 2023) how the 3‐billion‐parameter model and the 15‐billion‐parameter model performed pretty similar. This thus suggests studying the impact of the different ESM‐2 embeddings in the construction of non‐DL based QSAR models, since problems related to the curse of dimensionality can be avoided if good models are built from both the smallest embeddings and largest embeddings. This also leads to studying if the chemical information codified in the smallest embeddings is codified in the largest ones or, on the contrary, different ESM‐2 embeddings can be combined.

Therefore, this manuscript has two main objectives. First, studying if significantly better non‐DL based models are built as higher capacity ESM‐2 models are used. And second, studying if better non‐DL based models are built when fusing different ESM‐2 models. To this end, we used four state‐of‐the‐art sets recently used to build DL‐based models to classify general‐AMPs (i.e., not related to a specific antimicrobial activity), antibacterial peptides (ABP), antifungal peptides (AFP), and antiviral peptides (AVP). Building robust QSAR models to identify likely AMPs is one of the tasks carried out in the discovery of new drugs to address the antimicrobial resistance (Zhang et al., 2023). To ensure fair comparisons, we built a KNIME workflow by combining different feature selectors and applying different shallow and ensemble classifiers. We performed statistical tests to figure out what ESM‐2 models are the most suitable, as well as to figure out if the fusion of ESM‐2 models is a prominent way to build better QSAR models than when using a single ESM‐2 model. Finally, we compared our best models regarding DL‐based models originally built on the datasets accounted for.

2. MATERIALS AND METHODS

2.1. Antimicrobial peptide datasets

We used four state‐of‐the‐art datasets created by Sharma et al. to build DL‐based models to predict general‐AMPs (model AniAMPpred) (Sharma et al., 2021b), ABPs (model Deep‐ABPpred) (Sharma et al., 2021a), AFPs (model Deep‐AFPpred) (Sharma et al., 2021c) and AVPs (model Deep‐AVPpred) (Sharma et al., 2022). These authors collected the general‐AMP sequences from the Protein (Schoch et al., 2020) and StarPep (Aguilera‐Mendoza et al., 2019) databases. The ABP sequences were got from the Antimicrobial Peptide Database (APD) (Wang et al., 2015), Data Repository of Antimicrobial Peptides (DRAMP) (Kang et al., 2019) and Milk Antimicrobial Peptides Database (MilkAMP) (Théolier et al., 2014). The AFP sequences were acquired from the CAMP (Waghu et al., 2015), DRAMP (Kang et al., 2019) and StarPep (Aguilera‐Mendoza et al., 2019) databases. Finally, the AVP sequences were got from the AVPpred (Thakur et al., 2012), DBAASP (Pirtskhalava et al., 2015), DRAMP (Kang et al., 2019), SATPDB (Singh et al., 2015) and StarPep (Aguilera‐Mendoza et al., 2019) databases. As there is no repository containing peptide sequences evidencing lack of antimicrobial activity, Sharma et al. built the negative sequences from reviewed, manually annotated protein sequences available at the Universal Protein Resource (UniProt) database (Consortium, T.U, 2018). The queries to recover these sequences did not contain keywords such as antimicrobial, antifungal, antibacterial, antiviral, antibiotic, anti‐toxin, antitumor, defensin, among others (see (Sharma et al., 2021a; Sharma et al., 2021b; Sharma et al., 2021c; Sharma et al., 2022) for more details). The four considered datasets were divided into training and test sets (see Table 1) by their authors. These same partitions were used in this work to ensure comparability of results. Sharma et al. created a validation set for the hyperparameter tuning of the Deep‐AVPpred model (Sharma et al., 2022). This validation set was not used in this work. Data Supplementary Information S1 contains the FASTA files of the four aforementioned sets.

TABLE 1.

Training, validation, and test sets of the antimicrobial peptide datasets used in this work.

Dataset Training dataset Validation (tuning) dataset Test dataset
Total Positive Negative Total Positive Negative Total Positive Negative
SharmaAMP (from Sharma et al. (Sharma et al., 2021b)) 13,430 6657 6773 7179 3530 3649
SharmaABP (from Sharma et al. (Sharma et al., 2021a)) 3120 1635 1485 9816 4017 5799
SharmaAFP (from Sharma et al. (Sharma et al., 2021c)) 4124 2062 2062 2758 1379 1379
SharmaAVP (from Sharma et al. (Sharma et al., 2022)) 4908 2454 2454 1636 818 818 1636 818 818

Abbreviations: ABP, antibacterial peptides; AFP, antifungal peptides; AMP, general antimicrobial peptides; AVP, antiviral peptides.

2.2. Evolutionary scale modeling (ESM)‐2 embeddings

The new family of ESM‐2 models is comprised of six pretrained 6‐, 12‐, 30‐, 33‐, 36‐, and 48‐layer models that scale up to 8 million, 35 million, 150 million, 650 million, 3 billion, and 15 billion parameters, respectively. These ESM‐2 models were approximately trained on 65 million unique sequences (Lin et al., 2023), more than twice the number of unique sequences used to train the previous ESM‐1b model (27.1 million representative sequences) (Rives et al., 2021). We extracted 320‐, 480‐, 640‐, 1280‐, and 2560‐dimensional embeddings on each dataset described in Table 1 with the 6‐, 12‐, 30‐, 33‐, and 36‐layer models, respectively. These embeddings are calculated by averaging the embedding per amino acid of the final layer obtained for each peptide sequence. These embeddings are the lower‐dimensional feature spaces to be used as initial feature pool in the modeling workflow explained below. Because of hardware limitations (NVIDIA RTX A5500 24 GB), we were unable to use the 48‐layer model to extract 5120‐dimensional embeddings since it needs a dedicated GPU memory greater than the one specified above.

2.3. KNIME workflow to build non‐deep learning‐based models by fusing feature selectors and individual classifiers

The workflow was implemented with the open‐source software Konstanz Information Miner (KNIME) v4.7.2 (available at https://www.knime.com/), exploiting nodes both from the KNIME Analytics Platform and from the KNIME Extensions and Integrations (e.g., Python scripts and classifiers available in the Weka framework v7.0). Hence, additionally to the KNIME installation guidelines, it is needed to install a Python environment as described in (Martínez‐Mauricio & García‐Jacas, 2023). KNIME Python scripts were used to implement the relevance and redundancy filters based on Shannon Entropy (Godden et al., 2000) and Spearman (or Pearson) correlation (Myers & Sirois, 2006), respectively. To this end, the SciPy library v1.10.1 (Virtanen et al., 2020) was used. KNIME Python scripts were also used to apply six ranking‐type selectors available in the scikit‐learn (Pedregosa et al., 2011) and scikit‐feature (Li et al., 2017) tools, as well as to apply the Correlation‐based Feature Subset Selection (CFS) method available in the python‐weka‐wrapper3 library v0.2.12 (Reutemann, 2022).

The built workflow receives several input arguments. First, the paths of the comma separated values (CSV) files containing the ESM‐2 embeddings calculated on the training and test datasets. Second, the Matthew correlation coefficient (MCC) to be used to retain the best models. Third, the name of the target variable, which should be the same one in both the training CSV file and all the test CSV files. Fourth, the Shannon Entropy (SE) threshold to be used to eliminate irrelevant features. And finally, the Spearman and Pearson correlation‐based thresholds to remove redundant features and models, respectively. In reference (Martínez‐Mauricio & García‐Jacas, 2023) are described these input arguments as well as the input arguments required by KNIME to be run from command line. It is important to highlight that this workflow was implemented in a generic way to automatically build binary classification models from any tabular dataset, and not only to build models to classify AMPs and their functional types from the ESM‐2 embeddings. The workflow works as explained below.

On the training CSV file, an imputation step is performed substituting the missing values by zero. After that, a Kurtosis statistic‐based filter is applied to remove constant or near‐to‐constant features. The higher the kurtosis value, the heavier the distribution tail is. Kurtosis values greater than 3 represent leptokurtic distributions, whereas very high values represent peak distributions. We used a threshold equal to 11 to filter out features with high kurtosis values. Afterwards, an unsupervised relevance filter based on SE is applied to retain those features with SE values greater than 25% of the maximum SE that a feature can achieve. The higher the SE values, the better ability the features have to discriminate between different instances (Godden et al., 2000). The discretization scheme used is equal to the number of instances in each training dataset. Thus, the maximum SE of each feature in the general‐AMP, ABP, AFP, and AVP datasets is equal to 13.71, 11.61, 12.01, and 12.26 bits, respectively. After applying this SE‐based relevance filter, correlated features are removed using a Spearman correlation‐based redundancy filter (Myers & Sirois, 2006), considering a threshold equal to 0.95.

Once the prior pre‐processing phase finishes, a supervised feature selection is carried out by applying the CFS selector (Hall, 2000) and the Relief‐F (Robnik‐Šikonja & Kononenko, 2003; Urbanowicz et al., 2018), ANOVA F‐value (Pedregosa et al., 2011), Chi‐square (Pedregosa et al., 2011), Gini index (Li et al., 2017), t‐score (Li et al., 2017), and mutual information (Pedregosa et al., 2011) ranking‐type selectors. The size of the subset obtained with the CFS selector is the number of features to be selected in the ranking‐type selectors. Then, a total of 23 feature subsets are additionally created as follow. First, joining the output of all the individual selectors. Second, performing a 2‐combination Cn,r=C7,2=21 of the feature subsets obtained with the seven previous selectors. And third, selecting the best features according to the average ranking, which is calculated by averaging the individual ranking achieved by each feature in the ranking selectors initially applied. The lower the average ranking, the more important the feature is. The importance of fusing the output of several feature selectors has been widely explained elsewhere (Bolón‐Canedo & Alonso‐Betanzos, 2019; Pes, 2020).

Afterwards, a wrapper‐based selection (Kohavi & John, 1997) is performed on each of the 30 subsets created. These wrappers use the genetic algorithm (GA) metaheuristic as search strategy, the accuracy as fitness measure, and the random forest (RF) (Breiman, 2001), J48 (Sahu & Mehtre, 2015), reduce error pruning tree (REPTree) (Shahdad & Saber, 2022), k‐nearest neighbors (k‐NN) (Maleki et al., 2021), support vector machine (SVM) (Chauhan et al., 2019), random tree (Geurts et al., 2006), and Bayes Nets (Ben‐Gal, 2007) learning methods. The KNIME Weka nodes were used to implement these wrappers. The GA‐based search strategy and the RF, J48, REPTree, Random Tree, and Bayes Nets classifiers were applied using their default settings. The k‐NN classifier is applied using the three distance weighting schemas available in the KNIME node, whereas the best k‐value is determined through hold‐one‐out cross‐validation. Moreover, the SVM classifier is applied using the Pearson VII function‐based universal (Puk) kernel (Üstün et al., 2006) and the polynomial kernel (PolyKernel) (Pande et al., 2023). After the wrapper‐type selection, several models are created applying all the classifiers on all the feature subsets. For the wrapper‐derived subsets, only the classifier used to obtain them is the one used to build the corresponding model. None hyperparameter tuning is carried out.

The sensitivity (SN), specificity (SP), accuracy (ACC), and MCC metrics are calculated after applying a 10‐fold cross‐validation (10cv), and they are also calculated to assess the generalization ability on the test CSV file(s). The models with a MCC10cv value greater than the MCC threshold specified as input argument are selected to be assessed on the test file(s), whereas the classifiers and feature subsets used to build those models are selected to build models based on the Bagging (Breiman, 1996), AdaBoost (Ding et al., 2022), Random Committee (Niranjan et al., 2019), and LogitBoost (Zhang & Fang, 2007) ensemble learners. Table S1 shows what ensemble algorithm is used per base classifier. All the ensemble‐based models are also evaluated on the test file(s). Finally, redundant models are removed using a Pearson correlation‐based filter when their predictions are correlated above the threshold given as input parameter. The workflow output is a directory containing all the feature subsets and models built, as well as several graphics comparing the best, non‐redundant models. Files summarizing the training and test results are also created, as well as text files summarizing the execution time. All the input CSV files are normalized using a Min‐Max strategy before performing all the steps implemented in the proposed workflow. Figure 1 shows the implementation of the methodology explained above.

FIGURE 1.

FIGURE 1

Workflow representing the modeling methodology implemented in the workflow.

3. RESULTS AND DISCUSSION

We applied the built KNIME workflow on the 320‐, 480‐, 640‐, 1280‐, and 2560‐dimensional ESM‐2 embeddings that were extracted on each training set. Only the models with MCC10cv values greater than (or equal to) 0.9 to classify general‐AMPs and ABPs, as well as with MCC10cv values greater than (or equal to) 0.8 and 0.75 to predict AFPs and AVPs, respectively, were retained for further analyses. These MCC10cv thresholds were established to only study the models with comparable‐to‐superior performance metrics to the ones obtained by the original models (AniAMPpred [Sharma et al., 2021b], Deep‐ABPpred [Sharma et al., 2021a], Deep‐AFPpred [Sharma et al., 2021c], and Deep‐AVPpred [Sharma et al., 2022]), that is, those created where the datasets used in this work were proposed. The MCC10cv, ACC10cv, SN10cv, and SP10cv values of all the models are shown in Data S2.

3.1. Contribution of each of the ESM‐2 embeddings to the development of models

In this section, we assessed the contribution of each of the ESM‐2 embeddings regarding the number of built models and number of features used to create them. As a result, Figure 2a (and Data S2) show per endpoint the total number of models created from the ESM‐2 embeddings. It can be observed that the number of models with MCC10cv values greater than the used thresholds range between 400 and 1150 models, except for the ABP endpoint, where 170 and 197 were the two‐smallest amounts of created models. This suggests that the implemented KNIME workflow is able to automatically create several models with good goodness‐of‐fit (variance). Additionally, it can be noted from Figure 2a that the largest number of models was mostly built from the 640‐, 1280‐, and 2560‐dimensional embeddings, which are calculated with the highest capacity ESM‐2_t30 (30‐layers), ESM‐2_t33 (33‐layers), and ESM‐2_t36 (36‐layers) models, respectively. In the AVP endpoint, the fewest number of models was built from the 2560‐dimensional embeddings, whereas for the other three endpoints, the fewest number of models was derived from the 320‐ and 480‐dimensional embeddings, which are calculated with the lowest capacity ESM‐2_t6 (6‐layers) and ESM‐2_t12 (12‐layers) models, respectively. Notice that, independently of the endpoint, the 1280‐dimensional embeddings had the best behavior regarding the number of built models.

FIGURE 2.

FIGURE 2

Bar and boxplot graphics related to the (a) number of models built and (b) total of features used by them to predict general‐AMP, ABP, AFP, and AVP sequences, respectively.

Moreover, Figure 2b shows per endpoint and ESM‐2 embeddings boxplot graphics related to the number of features used in the built models (see Table S2 for more details). In this case, it can be observed that the greater the size of the embeddings used as the initial pool of features, the greater the number of features included in the models. This implies that as more and better (see Section 3.2) models can be built from the greatest embeddings, these models are more complex than the ones built from the smallest embeddings regarding the number of features. The models built from the 2560‐dimensional embeddings are the only ones that use more than 200 features. In this regard, it can also be analyzed that 90.4% of the total of models used less than 150 features, whereas 5.68% of the total of models used between 150 and 200 features. This indicates that only a part of the embeddings was valuable for modeling the endpoints in this work.

In that sense, on the one hand, Figure 3 shows how frequently the features derived from the ESM‐2 models were used in the built models. These frequencies were calculated from the feature subsets built to train the classifiers. A subset used several times to train different classifiers was considered once. The frequencies shown in Figure 3 were calculated by summing the frequency of use of each feature across all endpoints. Table S3 shows the frequency values for each endpoint. As a result, it can be seen in Figure 3 that there are several gaps, demonstrating that several features in the ESM‐2 embeddings were never selected to build a model, either because they did not have relevant chemical information, or their information was redundant regarding the other features. Indeed, when analyzing all the ESM‐2 embeddings together, it can be analyzed that 3156 (59.8%) out of 5280 total features were never used, 1014 (19.2%) features were used between 1 and 100 times, 596 (11.3%) features were used between 100 and 200 times, and 514 (9.73%) features were used more than (or equal to) 200 times.

FIGURE 3.

FIGURE 3

Frequency of use in the built models of the features belonging to each of the ESM‐2 embeddings.

On the other hand, Table 2 shows per ESM‐2 embeddings the total number of features used at least once for modeling each endpoint. As can be seen, the total number of used features ranges between 59 and 428 features, representing between 10.47% and 34.38% regarding the dimension of the corresponding embedding. It can also be seen that a similar number of features was always used from the 1280‐, and 2560‐dimensional embeddings for modeling the four endpoints, whereas there were bigger differences in the number of features used from the other ESM‐2 embeddings. Finally, when analyzing the total number of used features considering all the endpoints, it can be noted that more than 49% of the features belonging to the three smallest embeddings were required for modeling, whereas less than 40% of the features belonging to the two largest embeddings were used. That is, 43.44%, 47.5%, 50.47%, 60.94%, and 65.86% of the chemical information codified in the 320‐, 480‐, 640‐, 1280‐, and 2560‐dimensional ESM‐2 embeddings, respectively, were useless (or redundant) for the modeling tasks carried out.

TABLE 2.

Number of features belonging to the ESM‐2 embeddings that were used at least once to build the models to predict general‐AMP, ABP, AFP, and AVP sequences.

Embeddings (dimension) Endpoints Total a
AMP a ABP a AFP a AVP a
ESM‐2_t6 (320) 59 (18.43%) 76 (23.75%) 69 (21.56%) 110 (34.38%) 181 (56.56%)
ESM‐2_t12 (480) 91 (18.96%) 89 (18.54%) 113 (23.54%) 154 (32.83%) 252 (52.50%)
ESM‐2_t30 (640) 101 (15.78%) 99 (15.47%) 146 (22.81%) 188 (29.38%) 317 (49.53%)
ESM‐2_t33 (1280) 201 (15.70%) 198 (15.47%) 200 (15.63%) 193 (15.08%) 500 (39.06%)
ESM‐2_t36 (2560) 428 (16.72%) 314 (12.27%) 268 (10.47%) 292 (11.41%) 874 (34.14%)

Abbreviations: ABP, antibacterial peptides; AFP, antifungal peptides; AMP, general antimicrobial peptides; AVP, antiviral peptides.

a

The percentage regarding the dimension of the embeddings are shown between parentheses.

Overall, it can be drawn that the 1280‐dimensional embeddings calculated with the model ESM‐2_t33 are those from which a greater number of models are produced using a similar number of features for all the endpoints, followed by the 2560‐dimensional embeddings calculated with the model ESM‐2_t36. Moreover, the findings described above could also suggest that only a part of the embeddings would be useful for downstream tasks, avoiding the curse of dimensionality when using small chemical datasets. But this hypothesis needs deeper studies, which are out of the scope of this manuscript.

3.2. Analysis of each of the ESM‐2 embeddings regarding the generalization ability of the built models

In this section, we discussed the generalization ability in the test sets (see Table 1) achieved by the best models automatically built with the implemented KNIME workflow. The best models were those that obtained MCC10cv values greater than the input MCC threshold. This study was performed to analyze from which ESM‐2 embedding were built the models with the best generalization abilities. The MCCtest, ACCtest, SNtest, and SPtest values of the best models are shown in Data S2. As a result (see Table S4), almost all the models achieved MCCtest values greater than 0.8 in the prediction of general‐AMPs, being the models derived from the three largest embeddings those that generally obtained MCCtest values greater than 0.9. In the prediction of ABPs and AFPs, the models mainly yielded MCCtest values between 0.7 and 0.9. For these two endpoints, the largest number of models with MCCtest values greater than 0.8 were built from the 2560‐dimensional embeddings. Finally, it can be noted in Table S4 that almost all the models achieved MCCtest values between 0.7 and 0.8 to predict AVPs. These results demonstrate that the KNIME workflow is able to automatically produce models both with remarkable goodness‐of‐fit (variance) and generalization ability (bias), considering that all the models achieved MCC10cv values greater than 0.9 to predict general‐AMPs and ABPs, and MCC10cv values greater than 0.8 and 0.75 to predict AFPs and AVPs, respectively. Thus, it can be drawn that the models are not overfitting nor underfitting.

Additionally, Figure 4 shows per endpoint and ESM‐2 embeddings boxplot graphics related to the MCCtest values achieved by the best created models. It can be seen that the highest MCCtest values were yielded by the models derived from the 2560‐dimensional (ESM‐2_t36) embeddings, followed by the models derived from the 1280‐dimensional (ESM‐2_t33) embeddings to predict general‐AMPs. But the models generated from the 2560‐dimensional embeddings used a much greater number of features to predict the endpoint mentioned above as shown in Figure 2b. In the prediction of ABPs, the highest MCCtest values were obtained by the models built from the 1280‐dimensional embeddings. Indeed, 50% out of the MCCtest values achieved by those models were better than approximately 75% out of the MCCtest values achieved by the models with the second‐best performance, which were built from the 640‐dimensional (ESM‐2_t30) embeddings. As for the prediction of AFPs and AVPs, it can be observed that models with MCCtest values as higher as the ones obtained by the models derived from the three largest embeddings were also created from the 320‐ (ESM‐2_t6) and 480‐dimensional (ESM‐2_t12) embeddings. Also notice that to predict AFPs, the models generated from the 1280‐dimensional embeddings generally presented a lower performance regarding the models developed from the 640‐ and 2560‐dimensional embeddings, although in the prediction of AVPs, the former were generally better than the latters.

FIGURE 4.

FIGURE 4

Boxplots related to the MCCtest values obtained in the test sets by the models built to predict general‐AMP, ABP, AFP, and AVP sequences.

A statistical analysis based on Bayesian estimation was additionally performed to determine the probability of building better models from a embedding when comparing regarding other one (Benavoli et al., 2017). Figure 5 shows the barycentric coordinates for the probability distributions obtained through Monte Carlo sampling. Three areas are distinguished in each coordinate. First, a top region (denoted as rope) representing practical equivalence or not difference. And second, two inferior regions representing the models built from two different embeddings. If the probability distribution tends to one of the two inferior regions, then the models that it represents are more probable to be better than the models represented by the other inferior region. Only the results between the three largest embeddings are depicted in Figure 5, whereas the results for the other embedding‐pairs are shown in Data S3. A rope equal to 0.01 was used.

FIGURE 5.

FIGURE 5

Barycentric coordinates corresponding to the statistical analysis based on Bayesian estimation to study from which ESM‐2 embeddings there is greater probability to yield the best models to predict general‐AMPs (first row), ABPs (second row), AFPs (third row), and AVPs (fourth row).

As can be observed in Figure 5, there is no difference when generating models from the three largest embeddings to predict general‐AMPs (first row), indicating that any of those embeddings is suitable for modeling that endpoint. For the prediction of ABPs (second row) and AVPs (fourth row), it can be seen that the highest probabilities correspond to the models built from the 1280‐dimensional (ESM‐2_t33) embeddings. In the prediction of ABPs, also notice that the models built from the 640‐dimensional (ESM‐2_t30) embeddings have greater probability of being better than the models built from the 2560‐dimensional (ESM‐2_t36) embeddings, whereas in the prediction of AVPs, there is no difference between them. Finally, it can be observed that in the prediction of AFPs (third row), models built from the 640‐ and 2560‐dimensional embeddings present greater probability to achieve better MCCtest values than the models derived from the 1280‐dimensional embeddings, whereas there is no difference between the models created from the two former.

Overall, it can be concluded that the 640‐ (ESM‐2_t30) and 1280‐dimensional (ESM‐2_t33) embeddings seem to be the most suitable for modeling. However, it is important to highlight that from the other embeddings were built models as good as the ones created from the aforementioned embeddings. This thus suggests that combining different embeddings (feature ensemble) could be a prominent way to build better models, which is studied below.

3.3. Fusing ESM‐2 embeddings to improve the generalization ability

Herein, we assessed if the fusion of features belonging to different ESM‐2 embeddings is a suitable strategy to build better QSAR models. To this end, we selected per endpoint the best 50 models considering all the models together, that is, without matter from which embedding were built. From each of these pools (one per endpoint) of top‐50 non‐fusion based models, we selected the best model per embedding to combine the features included in them. In this way, we combined the features included in the best model created from each of the 640‐ and 1280‐dimensional embeddings to predict general‐AMPs and ABPs, respectively. We also fused the features used in the best model built from each of the 320‐, 480‐, 640‐, and 1280‐dimensional embeddings to predict AFPs. Lastly, we fused the features used in the best model built from each of the five embeddings to predict AVPs. Data S2 contains the fused training and test sets. The KNIME workflow was applied on the fused sets, and the best 50 models per endpoint (see Table S6) were selected regarding their MCCtest values just as selected the best non‐fusion based models.

Table 3 shows the MCC10cv, SNtest, SPtest, ACCtest, MCCtest average values corresponding to the best 50 models built when fusing and not fusing embeddings. These metrics are also shown for the best model. The number of features used by the best model as well as the average number of features used by the best 50 models are specified as well. On the one hand, regarding the number of features, it can be first observed that simpler models to predict general‐AMPs and ABPs were built from the fused feature sets. In this sense, notice that the best fused feature‐based models to predict general‐AMPs and ABPs used 82 and 42 features, respectively, representing 44.22% and 41.67% fewer features compared to the number of features used by the best non‐fusion based models. Similar conclusions can be drawn regarding the average number of features used to predict these endpoints mentioned above. Moreover, it can also be noted that more complex models were mostly created from the fused feature sets to predict AFPs, using 10% more features on average. In the prediction of AVPs, although the best fused feature‐based model used 43 (52.44%) features more than the best non‐fusion based model, it can be noted that the average number of features used by the fused feature‐based models was slightly less than the average number of features used by the non‐fusion based models.

TABLE 3.

Performances of the top‐1 (and top‐50) non‐fusion based models and the top‐1 (and top‐50) fused feature‐based models. The standard deviation values are shown between parentheses.

Endpoint Model No. features MCC10cv SNtest SPtest ACCtest MCCtest
General‐AMPs Best non‐fused‐feature model 147 0.9510 0.9669 0.9811 0.9741 0.9482
Top‐50 non‐fused‐feature models 109 (31) 0.9465 (0.003) 0.9614 (0.006) 0.9826 (0.006) 0.9722 (0.001) 0.9447 (0.001)
Best fused‐feature model 82 0.9418 0.9646 0.9819 0.9734 0.9469
Top‐50 fused‐feature models 77 (25) 0.9434 (0.003) 0.9629 (0.005) 0.9790 (0.005) 0.9711 (0.001) 0.9423 (0.002)
ABPs Best non‐fused‐feature model 72 0.9242 0.9457 0.9514 0.9491 0.8950
Top‐50 non‐fused‐feature models 80 (31) 0.9200 (0.007) 0.9417 (0.015) 0.9421 (0.010) 0.9419 (0.003) 0.8808 (0.005)
Best fused‐feature model 42 0.9190 0.9435 0.9602 0.9533 0.9035
Top‐50 fused‐feature models 59 (20) 0.9219 (0.008) 0.9459 (0.015) 0.9435 (0.014) 0.9445 (0.004) 0.8863 (0.007)
AFPs Best non‐fused‐feature model 64 0.8444 0.8970 0.9550 0.9260 0.8535
Top‐50 non‐fused‐feature models 70 (26) 0.8319 (0.009) 0.8963 (0.011) 0.9445 (0.010) 0.9204 (0.002) 0.8420 (0.003)
Best fused‐feature model 93 0.8517 0.9181 0.9449 0.9315 0.8633
Top‐50 fused‐feature models 77 (19) 0.8382 (0.008) 0.9039 (0.009) 0.9453 (0.008) 0.9246 (0.002) 0.8500 (0.004)
AVPs Best non‐fused‐feature model 82 0.7911 0.8594 0.9352 0.8973 0.7969
Top‐50 non‐fused‐feature models 104 (41) 0.7838 (0.012) 0.8910 (0.018) 0.8957 (0.019) 0.8933 (0.002) 0.7872 (0.004)
Best fused‐feature model 125 0.7830 0.8900 0.9156 0.9028 0.8059
Top‐50 fused‐feature models 102 (24) 0.7777 (0.008) 0.8927 (0.008) 0.9077 (0.009) 0.9002 (0.002) 0.8006 (0.003)

Abbreviations: ABP, antibacterial peptides; AFP, antifungal peptides; AMP, general antimicrobial peptides; AVP, antiviral peptides.

On the other hand, regarding the generalization abilities (MCCtest metric), it can be noted that the performance of the fused feature‐based models is slightly less than the performance of the non‐fusion based models to predict general‐AMPs. Indeed, in absolute terms, the difference between the highest and average performances of those models is equal to 0.0013 and 0.0024, respectively. However, for prospective studies, the most suitable models to predict general‐AMPs are the ones based on the fusion of embeddings because they use fewer features than and perform pretty similar predictions to the non‐fusion based models. This is in correspondence with the Occam principle (Lazar, 2010), since between QSAR models with similar performances, the simplest one should be the one used. Similar conclusions can also be drawn for the fused feature‐based models to predict ABPs. In this case, the models achieved better MCCtest values than the non‐fusion based models and were always simpler regarding the number of features. Lastly, notice that the best fused feature‐based model is the only one that achieved a MCCtest value greater than 0.9 (also see Table S6).

As for the prediction of AFPs and AVPs, it can be observed that the better MCCtest values were yielded by the fused feature‐based models but using a greater number of features as discussed before. The performance of the best fused feature‐based model was better than the performance of the best non‐fusion based model by 1.15% (0.0098 in absolute terms) and 1.13% (0.009) in the prediction of AFPs and AVPs, respectively. Also, the top‐50 fused feature‐based models were on average better than the top‐50 non‐fusion based models by 0.95% (0.008) and 1.7% (0.0134) in the prediction of AFPs and AVPs, respectively. However, it is important to remark that simpler, better models can also be built from the fused feature sets to predict AFPs and AVPs. In this sense, it can be noted in Table S6 that the sixth and seventh best fused feature‐based model to predict AFPs used 41 and 60 features and achieved a MCCtest value equal to 0.8541 and 0.8537, respectively. In addition, the twelfth, eighteenth and thirty‐second best fused feature‐based model to predict AVPs used 83, 63 and 56 features and achieved a MCCtest value equal to 0.8037, 0.8012, and 0.7984, respectively. These indicators are similar or better than the ones obtained by the best non‐fusion based model to predict AFPs (64 features, MCCtest=0.8535) and AVPs (82 features, MCCtest=0.7969), respectively. Overall, it can be drawn that the fusion of different embeddings contributes to getting better QSAR models than when using a single ESM‐2 embedding.

3.4. Comparative analysis regarding state‐of‐the‐art deep learning architectures

In this section, we compare the highest generalization ability achieved by the built models regarding the generalization ability achieved by the DL‐based models that were built where the datasets used in this work were introduced. To this end, we considered the best model created per endpoint both from the individual embeddings and from the fused feature sets (see Table 3). The DL‐based models are AniAMPpred (Sharma et al., 2021b), Deep‐ABPpred (Sharma et al., 2021a), Deep‐AFPpred (Sharma et al., 2021c), and Deep‐AVPpred (Sharma et al., 2022), and as their names indicate, they were built to predict general‐AMPs, ABPs, AFP, and AVPs, respectively. The AniAMPpred (Sharma et al., 2021b) model uses a one‐dimensional convolutional neural network (1D‐CNN). The Deep‐ABPpred model (Sharma et al., 2021a) is based on a bi‐directional long short term memory (Bi‐LSTM)‐type recurrent neural network, whereas Deep‐AFPpred (Sharma et al., 2021c) uses a 1D‐CNN followed by a Bi‐LSTM for the classification. Finally, the Deep‐AVPpred (Sharma et al., 2022) model is based on a convolutional neural network (CNN).

Figure 6 shows a comparison regarding the SNtest, SPtest, ACCtest, and MCCtest metrics. On the one hand, it can be seen that the best non‐fusion based model and the best fused feature‐based model achieved better performances than AniAMPpred regarding the SNtest, ACCtest, and MCCtest metrics in the prediction of general‐AMPs. AniAMPpred was only marginally better than the other two analyzed models regarding the SPtest metric. AniAMPpred used 200 features learned through a 1D‐CNN architecture to train five individual learners and make the decisions using a majority voting approach, whereas 82 and 147 ESM‐2 features were used to create the best fused feature‐based model and the best non‐fusion based model, respectively. Thus, these ESM‐2 feature‐based models used 59% and 26.5% fewer features to perform better predictions than AniAMPpred. On the other hand, regarding the predictions of ABPs, Deep‐ABPpred was better than the best fused feature‐based model (and the best non‐fusion based model) by 1.96% (1.72%), 0.49% (0.94%), and 1.16% (2.12%) regarding the SNtest, ACCtest, and MCCtest metrics, respectively, but it was slightly less than the best fused feature‐based model by 0.54% regarding the SPtest metric. The best fused feature‐based model used 42 features, whereas the best non‐fusion based model used 72 features, and both were trained with the SVM classifier that is simpler than the Bi‐LSTM architecture used by Deep‐ABPpred. So, according to the Ockham razor (Lazar, 2010), the fused feature‐based model would be deemed as the most suitable for further studies.

FIGURE 6.

FIGURE 6

Comparisons between state‐of‐the‐art deep learning architectures and the best models built in this work to predict general‐AMPs, ABPs, AFPs, and AVPs.

As for the predictions of AFPs, it can be observed that the best fused feature‐based model performed better than Deep‐AFPpred (and the best non‐fusion based model) by 0.23% (2.35%), 0.51% (0.59%), and 1.09% (1.15%) regarding the SNtest, ACCtest, and MCCtest metrics, respectively. However, the best non‐fusion based model achieved the highest SPtest value, outperforming the best fused feature‐based model and Deep‐AFPpred by 1.07% and 1.81%, respectively. Regarding the AVP predictions, it can be observed that Deep‐AVPpred obtained better performance than the other two models under study, outperforming the best fused feature‐based model and the best non‐fusion model by 4.85% and 6.04% according to the MCCtest metric, respectively. This result was also obtained in benchmarking studies performed between non‐DL based models and DL based models in (Garcia‐Jacas et al., 2022). In that study, Deep‐AVPpred was always better than all the non‐DL based models created with features calculated with the ProtDCal and iLearn tools. But it is important to mention that a validation set was used for hyperparameter tuning to build Deep‐AVPpred, but no hyperparameter tuning was performed to build the models in the KNIME workflow. This could be the cause why the built models to predict AVPs did not perform better than Deep‐AVPpred.

These results confirm the results reported in (Garcia‐Jacas et al., 2022; García‐Jacas et al., 2022), where was demonstrated that non‐DL based models achieve comparable‐to‐superior performances to DL‐based models in the prediction of AMPs when methodologically principled studies are performed. Several works justifying that DL‐based models improve the AMP classification, often by narrow performance margins, present modeling biases related to the quality and diversity of the sequence‐based protein features, lack of feature selection approaches, and poor use of ensemble classifiers. To avoid some of these biases as well as to propose new non‐DL based models, we built the KNIME workflow by encompassing several of the methodological aspects to be considered when building non‐DL based models from tabular data, allowing getting in a reasonable time hundreds or thousands of non‐DL based models. For instance, the required time to build the best 900, 150, 500, and 400 models to predict general‐AMPs, ABPs, AFPs, and AVPs from the 2560‐dimensional embeddings was approximately equal to 137, 6, 13, and 17 hours, respectively. This required less effort and computational resources that when building DL‐based models. The results achieved with the modeling strategies implemented in the KNIME workflow are presented below.

3.5. Contribution of the feature selection strategies and classifiers applied

In this section, we discussed the contribution of each of the feature selection strategies and learning algorithms regarding the number of built models and their generalization abilities. In this study, we did not include the results when fusing the embeddings, but only the ones obtained from each of them separately. This analysis was performed considering all the endpoints together. As a result, Figure 7 depicts the number of models created from each feature selection strategy. As can be observed, all the feature selectors contributed to generate a lot of number of models, where the only strategy below 250 models is the one based on Gini‐index (Li et al., 2017). The strategy that most contributed was the join of the seven individual selectors, followed by the fusion of the subsets obtained with the CFS method (Hall, 2000) with the subsets obtained with the MI (Pedregosa et al., 2011), Relief‐F (Robnik‐Šikonja & Kononenko, 2003; Urbanowicz et al., 2018) and Gini‐index (Li et al., 2017) ranking‐type selectors, in that order. Notice that the selection strategy based on the average ranking fusion had a comparable‐to‐inferior performance regarding the other ones, suggesting that the average is not an effective way to combine different individual rankings.

FIGURE 7.

FIGURE 7

Number of models created from each feature selection strategy implemented in the KNIME workflow.

Moreover, Figure 8 shows per individual and ensemble classifier the total number of built models, as well as how many of them achieved a performance in a specific MCCtest value range. On the one hand, it can be first noted in Figure 8a that only 2 out of 6 classification algorithms (RF is considered as ensemble classifier) contributed to developing good models. Specifically, the k‐NN and SVM algorithms with their three weighting schemas and two kernels, respectively, were the most useful; being the classifier SVM with Puk kernel the one that most contributed with more than 1000 models. It can also be observed that more than 1500 models were developed when analyzing the total number of models created with the three k‐NN classifiers (one per weighting schema), which is an amount greater than the number of models built with the two SVM‐based classifiers together. Nonetheless, Figure 8b reveals that the models based on the classifier SVM with Puk kernel achieved the highest generalization abilities. Indeed, more than 600 models with MCCtest values between 0.8 and 1 were yielded from that classifier, whereas between 300 and 400 models in the that range of MCCtest values were obtained from each of the three k‐NN classifiers.

FIGURE 8.

FIGURE 8

Bar graphs representing the number of models yielded from the best individual (a, b) and ensemble (c, d) classifiers both globally and per MCCtest value range.

Additionally, Figure 8c shows the number of models built from each ensemble classifier. As can be noted, the greatest number of models was derived from the AdaBoost and Bagging (without including RF) ensemble classifiers with 2810 and 2552 models each, respectively. But that result is because AdaBoost was applied on all the weak classifiers, whereas Bagging was applied on 8 out of 9 weak classifiers (see Table S1 for specific details). LogitBoost was only applied on 4 out of 8 weak classifiers accounted for. Notice that the RF classifier was the worst of all regarding the number of built models, even worse than 4 out of 5 individual classifiers represented in Figure 8a. Regarding the generalization abilities, it can be noted in Figure 8d that the AdaBoost and Bagging ensemble classifiers perform similarly, both having 1691 and 1634 models with MCCtest values between 0.8 and 1, respectively. The LogitBoost and RF ensemble classifiers yielded 766 and 267 models in the range mentioned above. Finally, notice in Figure S1 that a similar number of models with MCCtest values between 0.8 and 1 were built both using and not using wrapper‐type selectors.

4. CONCLUSIONS

This work focused on assessing the usefulness of Evolutionary Scale Modeling (ESM‐2)‐derived different‐dimensional embeddings. To this end, we implemented a KNIME workflow to guarantee using the same modeling methodology and, in this way, carrying out fair comparisons. We performed an analysis of the contribution of each of the ESM‐2 embeddings regarding the number of models created from each of them, the number of features used into the models, and the performance of the models built. As a result, 1280‐dimensional embeddings calculated with the model ESM‐2_t33 are those from which a greater number of models can be built using a similar number of features for all the endpoints. Additionally, it can be concluded that the 640‐ and 1280‐dimensional embeddings are the most appropriate to be used in modeling since statistically better results were achieved from them. Nonetheless, these results do not suggest that from the other ESM‐2 embeddings cannot be obtained good models. For that reason, we combined features from different embeddings, and as a result, we can draw that fusing features of different embeddings contributes to getting better models than only using a specific model ESM‐2.

Moreover, we carried out comparisons regarding state‐of‐the‐art deep learning models. In this regard, we can conclude that when performing methodologically principled studies in the prediction of AMPs, non‐DL based models yield comparable‐to‐superior performances to DL‐based models, which is in correspondence with results reported elsewhere (Garcia‐Jacas et al., 2022; García‐Jacas et al., 2022). The KNIME workflow is available‐freely to be used at https://github.com/cicese-biocom/classification-QSAR-bioKom, and we consider that this can be a suitable tool to prevent unfair comparisons regarding new computational methods, as well as to propose new non‐DL based models. This workflow has implemented several of the methodological aspects to be considered when building non‐DL based models from tabular data, and it allows to get in a reasonable time hundreds (or thousands) of non‐DL based models.

AUTHOR CONTRIBUTIONS

César R. García‐Jacas: Conceptualization; methodology; supervision; validation; investigation; funding acquisition; writing – original draft; writing – review and editing; formal analysis; visualization. Karla L. Martínez‐Mauricio: Methodology; software; validation; visualization. Greneter Cordoves‐Delgado: Conceptualization; validation; resources.

FUNDING INFORMATION

Consejo Nacional de Ciencia y Tecnología (CONACYT)‐funded project 320658 under the program “Ciencia Básica y/o Ciencia de Frontera, Modalidad: Paradigmas y Controversias de la Ciencia 2022”

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

Supporting information

Data S1. Supporting Information.

PRO-33-e4928-s001.docx (47.7KB, docx)

Figure S1. Bar graphs representing the number of models yielded when applying wrapper‐type feature selection approaches.

PRO-33-e4928-s002.pdf (4.5KB, pdf)

Data S2. Supporting Information.

PRO-33-e4928-s003.zip (1.3GB, zip)

ACKNOWLEDGMENTS

CRGJ acknowledges to the program “Cátedras CONACYT” from “Consejo Nacional de Ciencia y Tecnología (CONACYT), México” by the support to the endowed chair 501/2018 at “Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE)”. CRGJ also acknowledges the support of CONACYT under grant 320658.

Biographies

Karla L. Martínez‐Mauricio, MSc in Computer Science, Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico; BSc in Applied Mathematics (2019), Universidad Autónoma de Aguascalientes, Mexico.

César R. García‐Jacas, Conacyt Researcher at the Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico. He received his PhD in Technical Science (2015) and M.Sc. in Computer Science (2013) from the Universidad Central “Marta Abreu” de las Villas, Cuba; as well as his B.Eng. in Informatic Science (2009) from the Universidad de las Ciencias Informáticas, La Habana, Cuba.

Greneter Cordoves‐Delgado, MSc student in Computer Science, Department of Computer Science, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Baja California, Mexico; B.E. in Informatic (2011), Universidad de Cienfuegos, Cuba.

Martínez‐Mauricio KL, García‐Jacas CR, Cordoves‐Delgado G. Examining evolutionary scale modeling‐derived different‐dimensional embeddings in the antimicrobial peptide classification through a KNIME workflow. Protein Science. 2024;33(4):e4928. 10.1002/pro.4928

Review Editor: Nir Ben‐Tal

DATA AVAILABILITY STATEMENT

All supplementary data and materials are available‐freely at https://drive.google.com/drive/folders/1WZ24Y4klj5xnrWjM6IjnkiVu9MBjODLU?usp=sharing. KNIME workflow available at: https://github.com/cicese‐biocom/classification‐QSAR‐bioKom.

REFERENCES

  1. Acheampong FA, Nunoo‐Mensah H, Chen W. Transformer models for text‐based emotion detection: a review of BERT‐based approaches. Artif Intell Rev. 2021;54(8):5789–5829. [Google Scholar]
  2. Aguilera‐Mendoza L, Marrero‐Ponce Y, Beltran JA, Tellez Ibarra R, Guillen‐Ramirez HA, Brizuela CA. Graph‐based data integration from bioactive peptide databases of pharmaceutical interest: toward an organized collection enabling visual network analysis. Bioinformatics. 2019;35(22):4739–4747. [DOI] [PubMed] [Google Scholar]
  3. Benavoli A, Corani G, Demšar J, Zaffalon M. Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. J Mach Learn Res. 2017;18(77):1–36. [Google Scholar]
  4. BenevolentAI . MolBERT. 2020. https://github.com/BenevolentAI/MolBERT.
  5. Ben‐Gal, I. , Bayesian networks, in encyclopedia of statistics in quality and reliability. John Wiley & Sons, Ltd; 2007. [Google Scholar]
  6. Bolón‐Canedo V, Alonso‐Betanzos A. Ensembles for feature selection: a review and future trends. Inf Fusion. 2019;52:1–12. [Google Scholar]
  7. Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–140. [Google Scholar]
  8. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. [Google Scholar]
  9. Cerruela García G, Pérez‐Parras Toledano J, De Haro García A, García‐Pedrajas N. Filter feature selectors in the development of binary QSAR models. SAR QSAR Environ Res. 2019;30(5):313–345. [DOI] [PubMed] [Google Scholar]
  10. Chauhan VK, Dahiya K, Sharma A. Problem formulations and solvers in linear SVM: a review. Artif Intell Rev. 2019;52(2):803–855. [Google Scholar]
  11. Consortium, T.U . UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018;47(D1):D506–D515. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Devlin J, Chang M‐W, Lee K, Toutanova K. BERT: pre‐training of deep bidirectional transformers for language understanding. arXiv. 2018;12:1–13. [Google Scholar]
  13. Ding Y, Zhu H, Chen R, Li R. An efficient AdaBoost algorithm with the multiple thresholds classification. Appl Sci. 2022;12(12):5872. [Google Scholar]
  14. Fabian B, Edlich T, Gaspar H, Segler M, Meyers J, Fiscato M, Ahmed M. Molecular representation learning with language models and domain‐relevant auxiliary tasks. arXiv. 2020:1–12. https://ml4molecules.github.io/papers2020/ML4Molecules_2020_paper_74.pdf [Google Scholar]
  15. Floridi L, Chiriatti M. GPT‐3: its nature, scope, limits, and consequences. Minds Mach. 2020;30(4):681–694. [Google Scholar]
  16. Fu H, Cao Z, Li M, Wang S. ACEP: improving antimicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC Genom. 2020;21(1):597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. García‐Jacas CR, García‐González LA, Martinez‐Rios F, Tapia‐Contreras IP, Brizuela CA. Handcrafted versus non‐handcrafted (self‐supervised) features for the classification of antimicrobial peptides: complementary or redundant? Brief Bioinform. 2022;23(6):1–16. [DOI] [PubMed] [Google Scholar]
  18. García‐Jacas CR, Marrero‐Ponce Y, Vivas‐Reyes R, Suárez‐Lezcano J, Martinez‐Rios F, Terán JE, et al. Distributed and multicore QuBiLS‐MIDAS software v2.0: computing chiral, fuzzy, weighted and truncated geometrical molecular descriptors based on tensor algebra. J Comput Chem. 2020;41(12):1209–1227. [DOI] [PubMed] [Google Scholar]
  19. Garcia‐Jacas CR, Pinacho‐Castellanos SA, García‐González LA, Brizuela CA. Do deep learning models make a difference in the identification of antimicrobial peptides? Brief Bioinform. 2022;23(3):1–16. [DOI] [PubMed] [Google Scholar]
  20. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42. [Google Scholar]
  21. Godden JW, Stahura FL, Bajorath J. Variability of molecular descriptors in compound databases revealed by Shannon entropy calculations. J Chem Inf Comput Sci. 2000;40(3):796–800. [DOI] [PubMed] [Google Scholar]
  22. Hall MA. Correlation‐based feature selection for discrete and numeric class machine learning. In: Langley P, editor. Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers Inc; 2000. p. 359–366. [Google Scholar]
  23. Irwin R, Dimitriadis S, He J, Bjerrum EJ. Chemformer: a pre‐trained transformer for computational chemistry. Mach Learn‐Sci Technol. 2022;3(1):015022. [Google Scholar]
  24. Kang X, Dong F, Shi C, Liu S, Sun J, Chen J, et al. DRAMP 2.0, an updated data repository of antimicrobial peptides. Sci Data. 2019;6(1):148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kim S. Public chemical databases. In: Ranganathan S et al., editors. Encyclopedia of bioinformatics and computational biology. Oxford: Academic Press; 2019. p. 628–639. [Google Scholar]
  26. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2018;47(D1):D1102–D1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997;97(1):273–324. [Google Scholar]
  28. Lazar N. Ockham's razor. Wiley Interdiscip Rev Comput Stat. 2010;2(2):243–246. [Google Scholar]
  29. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L. BART: Denoising sequence‐to‐sequence pre‐training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. Association for Computational Linguistics. 2018.
  30. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: a data perspective. ACM Comput Surv. 2017;50(6):1–45. [Google Scholar]
  31. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science. 2023;379(6637):1123–1130. [DOI] [PubMed] [Google Scholar]
  32. Ma Y, Guo Z, Xia B, Zhang Y, Liu X, Yu Y, et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nat Biotechnol. 2022;40(6):921–931. [DOI] [PubMed] [Google Scholar]
  33. Maleki N, Zeinali Y, Niaki STA. A k‐NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Syst Appl. 2021;164:113981. [Google Scholar]
  34. Manibardo EL, Laña I, Ser JD. Deep learning for road traffic forecasting: does it make a difference? IEEE Trans Intell Transp Syst. 2021;23:6164–6188. 10.1109/TITS.2021.3083957 [DOI] [Google Scholar]
  35. Martínez‐Mauricio KL, Cordoves‐Delgado G, García‐Jacas CR. KNIME workflow to build classification models for QSAR studies. 2023. https://github.com/cicese-biocom/classification-QSAR-bioKom. [DOI] [PMC free article] [PubMed]
  36. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2020;49(D1):D412–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, et al. QSAR without borders. Chem Soc Rev. 2020;49(11):3525–3564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Myers, L. and Sirois M.J., Spearman correlation coefficients, differences between, in encyclopedia of statistical sciences, Kotz S., et al., eds., Wiley Online Library. 2006. [Google Scholar]
  39. Niranjan A, Haripriya DK, Pooja R, Sarah S, Deepa Shenoy P, Venugopal KR. EKRV: ensemble of kNN and random committee using voting for efficient classification of phishing. Springer Singapore: Singapore; 2019. [Google Scholar]
  40. NVIDIA . MegaMolBART. 2022. https://github.com/NVIDIA/MegaMolBART.
  41. Oyedare T, Park JJ. Estimating the required training dataset size for transmitter classification using deep learning. 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN). 2019.
  42. Pande CB, Kushwaha NL, Orimoloye IR, Kumar R, Abdo HG, Tolche AD, et al. Comparative assessment of improved SVM method under different kernel functions for predicting multi‐scale drought index. Water Resour Manag. 2023;37(3):1367–1399. [Google Scholar]
  43. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit‐learn: machine learning in python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
  44. Pes B. Ensemble feature selection for high‐dimensional data: a stability analysis across multiple domains. Neural Comput Applic. 2020;32(10):5951–5973. [Google Scholar]
  45. Pirtskhalava M, Gabrielian A, Cruz P, Griggs HL, Squires RB, Hurt DE, et al. DBAASP v.2: an enhanced database of structure and antimicrobial/cytotoxic activity of natural and synthetic peptides. Nucleic Acids Res. 2015;44(D1):D1104–D1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre‐training. 2018. https://s3‐us‐west‐2.amazonaws.com/openai‐assets/research‐covers/language‐unsupervised/language_understanding_paper.pdf.
  47. Research M. LLaMA: open and efficient foundation language models. 2023. https://research.facebook.com/publications/llama‐open‐and‐efficient‐foundation‐language‐models/.
  48. Reutemann P. Python wrapper for the Java machine learning workbench Weka using the python‐javabridge library. 2022. https://pypi.org/project/python-weka-wrapper3/.
  49. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Robnik‐Šikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn. 2003;53(1):23–69. [Google Scholar]
  51. Romero‐Molina S, Ruiz‐Blanco YB, Green JR, Sanchez‐Garcia E. ProtDCal‐suite: a web server for the numerical codification and functional analysis of proteins. Protein Sci. 2019;28(9):1734–1743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Sahu S, Mehtre BM. Network intrusion detection system using J48 Decision Tree. in 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI). 2015.
  53. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Shahdad M, Saber B. Drought forecasting using new advanced ensemble‐based models of reduced error pruning tree. Acta Geophys. 2022;70(2):697–712. [Google Scholar]
  55. Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. Deep‐ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec. Brief Bioinform. 2021a;22(5):1–19. [DOI] [PubMed] [Google Scholar]
  56. Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. AniAMPpred: artificial intelligence guided discovery of novel antimicrobial peptides in animal kingdom. Brief Bioinform. 2021b;22(6):1–23. [DOI] [PubMed] [Google Scholar]
  57. Sharma R, Shrivastava S, Kumar Singh S, Kumar A, Saxena S, Kumar Singh R. Deep‐AFPpred: identifying novel antifungal peptides using pretrained embeddings from seq2vec with 1DCNN‐BiLSTM. Brief Bioinform. 2021c;23(1):1–16. [DOI] [PubMed] [Google Scholar]
  58. Sharma R, Shrivastava S, Singh SK, Kumar A, Singh AK, Saxena S. Deep‐AVPpred: artificial intelligence driven discovery of peptide drugs for viral infections. IEEE J Biomed Health Inform. 2022;26(10):5067–5074. [DOI] [PubMed] [Google Scholar]
  59. Singh S, Chaudhary K, Dhanda SK, Bhalla S, Usmani SS, Gautam A, et al. SATPdb: a database of structurally annotated therapeutic peptides. Nucleic Acids Res. 2015;44(D1):D1119–D1126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Singh V, Shrivastava S, Kumar Singh S, Kumar A, Saxena S. StaBle‐ABPpred: a stacked ensemble predictor based on biLSTM and attention mechanism for accelerated discovery of antibacterial peptides. Brief Bioinform. 2021;23(1):1–17. [DOI] [PubMed] [Google Scholar]
  61. Su X, Xu J, Yin Y, Quan X, Zhang H. Antimicrobial peptide identification using multi‐scale convolutional network. BMC Bioinf. 2019;20(1):730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Thakur N, Qureshi A, Kumar M. AVPpred: collection and prediction of highly effective antiviral peptides. Nucleic Acids Res. 2012;40(W1):W199–W204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Théolier J, Fliss I, Jean J, Hammami R. MilkAMP: a comprehensive database of antimicrobial peptides of dairy origin. Dairy Sci Technol. 2014;94(2):181–193. [Google Scholar]
  64. Todeschini R, Consonni V. Methods and principles in medicinal chemistry. In: Mannhold R, Kubinyi H, Folkers G, editors. Handbook of molecular descriptors. Volume 11. 1st ed. Weinheim, Germany: WILEY‐VCH Verlag GmbH; 2009. p. 667. [Google Scholar]
  65. Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. Relief‐based feature selection: introduction and review. J Biomed Inform. 2018;85:189–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Üstün B, Melssen WJ, Buydens LMC. Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Chemom Intel Lab Syst. 2006;81(1):29–40. [Google Scholar]
  67. Valdés‐Martiní JR, Marrero‐Ponce Y, García‐Jacas CR, Martinez‐Mayorga K, Barigye SJ, Vaz d'Almeida YS, et al. QuBiLS‐MAS, open source multi‐platform software for atom‐ and bond‐based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J Chem. 2017;9(35):1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018;34(16):2740–2747. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat Methods. 2020;17(3):261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Waghu FH, Barai RS, Gurung P, Idicula‐Thomas S. CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides. Nucleic Acids Res. 2015;44(D1):D1094–D1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Wang G, Li X, Wang Z. APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 2015;44(D1):D1087–D1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Xu Y, Dai Z, Chen F, Gao S, Pei J, Lai L. Deep learning for drug‐induced liver injury. J Chem Inf Model. 2015;55(10):2085–2093. [DOI] [PubMed] [Google Scholar]
  73. Xu Y, Pei J, Lai L. Deep learning based regression and multiclass models for acute Oral toxicity prediction with automatic chemical feature extraction. J Chem Inf Model. 2017;57(11):2672–2685. [DOI] [PubMed] [Google Scholar]
  74. Yan J, Bhadra P, Li A, Sethiya P, Qin L, Tai HK, Wong KH, Siu SWI. Deep‐AmPEP30: improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucleic Acids. 2020;20:882–894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings for machine learning. Bioinformatics. 2018;34(15):2642–2648. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Zhang G, Fang B. LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol. 2007;127(3):417–424. [DOI] [PubMed] [Google Scholar]
  77. Zhang Y, Su JQ, Liao H, Breed MF, Yao H, Shangguan H, et al. Increasing antimicrobial resistance and potential human bacterial pathogens in an invasive land snail driven by urbanization. Environ Sci Technol. 2023;57(18):7273–7284. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting Information.

PRO-33-e4928-s001.docx (47.7KB, docx)

Figure S1. Bar graphs representing the number of models yielded when applying wrapper‐type feature selection approaches.

PRO-33-e4928-s002.pdf (4.5KB, pdf)

Data S2. Supporting Information.

PRO-33-e4928-s003.zip (1.3GB, zip)

Data Availability Statement

All supplementary data and materials are available‐freely at https://drive.google.com/drive/folders/1WZ24Y4klj5xnrWjM6IjnkiVu9MBjODLU?usp=sharing. KNIME workflow available at: https://github.com/cicese‐biocom/classification‐QSAR‐bioKom.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES