Abstract
Background & objectives:
Discovery of new antibiotics is the need of the hour to treat infectious diseases. An ever-increasing repertoire of multidrug-resistant pathogens poses an imminent threat to human lives across the globe. However, the low success rate of the existing approaches and technologies for antibiotic discovery remains a major bottleneck. In silico methods like machine learning (ML) deem more promising to meet the above challenges compared with the conventional experimental approaches. The goal of this study was to create ML models that may be used to successfully predict new antimicrobial compounds.
Methods:
In this article, we employed eight different ML algorithms namely, extreme gradient boosting, random forest, gradient boosting classifier, deep neural network, support vector machine, multilayer perceptron, decision tree, and logistic regression. These models were trained using a dataset comprising 312 antibiotic drugs and a negative set of 936 non-antibiotic drugs in a five-fold cross validation approach.
Results:
The top four ML classifiers (extreme gradient boosting, random forest, gradient boosting classifier and deep neural network) were able to achieve an accuracy of 80 per cent and above during the evaluation of testing and blind datasets.
Interpretation & conclusions:
We aggregated the top performing four models through a soft-voting technique to develop an ensemble-based ML method and incorporated it into a freely accessible online prediction server named ABDpred (http://clinicalmedicinessd.com.in/abdpred/).
Keywords: Extreme gradient boosting, five-fold cross-validation, machine learning, multidrug resistant, neural network, random forest
Nearly a century after the discovery of penicillin, the human race is facing one of the greatest threats to its existence from pathogenic microorganisms that developed resistance against currently available antibiotics and are rapidly spreading worldwide1. As the widely acclaimed ‘golden era’ of antibiotic discovery came to an end, rapid emergence and spread of multidrug-resistant (MDR) and extensively drug-resistant bacterial pathogens, the so-called ‘superbugs’ outpaced new antibiotic discovery, magnifying the threat from difficult-to-treat and potentially untreatable infections by many folds2. The most common bacterial genus and species causing such infections include Mycobacterium tuberculosis, methicillin-resistant Staphylococcus aureus and vancomycin-resistant enterococci2. However, during the past three decades only two new classes of antibiotics (oxazolidinone and cyclic lipopeptide) have been added to our armamentarium, both of which are active exclusively against Gram-positive bacteria2.
Bacterial resistance to antibiotics may be intrinsic or acquired, and multiple mechanisms could underlie its development. Intrinsic antibiotic resistance is generally chromosome encoded and may include efflux pumps (often non-specific), antibiotic-inactivating enzymes or increased permeability barriers. The acquired resistance, on the other hand, is mediated by horizontal gene transfer and includes plasmid-encoded efflux pumps (specific) and the antibiotic- or the target-modifying enzymes3. The latter mechanisms are of major concern to human health because of the risk of enhanced expression and dissemination of the resistance determinant3. The timeline of the appearance of antibiotic resistance suggests that it develops naturally, but misuse of antibiotics accelerates the process. Thus, in many cases, resistance was reported at the same time or even before the use of the first dose of an antibiotic4. However, it started to spread rapidly several years after the antibiotic was approved for clinical application.
Several new approaches have been taken towards antibiotic discovery by invoking high throughput experimental techniques and high-tech computational platforms and prediction tools5. A high throughput, target-based screening protocol has been used extensively to identify pathogen-specific genes and validate them as drug targets. As this technique failed to discover new antibiotics, ‘reverse genomics’ was introduced where several compounds were first screened for antibacterial activity6. The identification of targets in bacteria and understanding their working mechanism using chemical genetics, fluorescence-based high-content screening, and ultraviolet resonance-based methods provide an opportunity for improving leads and their commercial success. The strategies sparked a strong interest in phenotypic assays, which may provide hope for discovering newer antibiotics7. For example, teixobactin was recently identified as a novel antibiotic from soil bacteria, using iChip-based technology5. Quantitative structure-activity relationship models used statistical methods and successfully predicted antibacterial functions of 1,2,4-oxadiazole and 5-oxo and 5-thio derivatives of 1,4-disubstituted tetrazole class of compounds; however, they are unable to screen large and diverse chemical libraries8.
Over the last two decades ML techniques were used to discover lead candidates. These techniques showed a promising way to curtail both time and cost for drug development of new classes of antibiotics using experimental approaches. García-Domenech’s9 group developed linear discriminant analysis and artificial neural network models with prediction success of over 90 per cent. Jaén-Oltra’s10 group made a neural network model to predict the antibacterial activity of fluorquinolone. Yang’s11 group showed that the support vector classification model was able to predict antibacterial compounds with an accuracy of over 96 per cent. Ivanenkov’s12 group used top performing three ML models including a neural network-based self-organizing map, feedforward neural network, and support vector machine to predict the antibacterial activity of compounds. Thirteen antibacterial compounds were successfully predicted by this approach, which were validated subsequently by high throughput screening and other chemical assays. In 2020, Stokes’s13 group made a deep learning model that used atom and bond features of 2335 compounds from ZinC15 database predicted halicin (Su 3327), a known antidiabetic drug to possess potent antibacterial activity. This was confirmed by in vitro and in vivo assays against Mycobacterium tuberculosis, carbapenem-resistant Enterobacteriaceae, Clostridioides difficile and pan-resistant Acinetobacter baumannii13.
Despite the current need for finding out new antibacterial drugs, no ML models other than the Chemprop server are publicly available for further use to accelerate antibiotic discovery13. However, this webserver only provides inhibitory concentrations of the input chemicals against E. coli. We report here the development of ML models, using antibiotic drugs from DrugBank as the positive set and Food and Drug Administration (FDA)-approved drugs not having antibiotic activity as a negative set (Fig. 1). Five-fold cross-validation, learning parameter tuning and threshold tuning of different ML classifiers were applied to optimize the performance of the models. Blind dataset from the DrugBank and an independent set of chemicals from the National Cancer Institute (NCI) were subsequently screened using the best-performing models. The consensus prediction results and scores were computed using a soft-voting technique. Finally, we compared our ML models with the existing models and embedded the models in the ABDpred server (available at http://clinicalmedicinessd.com.in/abdpred/) to provide a simple and reliable tool for prediction of antibiotics active against a wide range of Gram-positive and Gram-negative bacteria from large chemical databases.
Fig. 1.

ABDpred workflow (http://clinicalmedicinessd.com.in/abdpred/).
Material & Methods
This study was undertaken at the division of Clinical Medicine, ICMR-National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India between June 2021 to August 2022. It employed a prediction modelling approach by using data archived on different databases. In this study, ML classifiers such as gradient boosting, random forest, deep learning, and support vector machine were employed in different research groups10,11,12. The learning parameters tuning, cross-validation approach, regularization, and different positive and negative ratio mixed training were done here exclusively. Furthermore, the ABDpred prediction server was used to provide the output of the used machine learning models in an HTML framework.
Dataset: For the training, testing and blind datasets, a total of 390 antibiotic drugs were curated from the DrugBank database (https://go.drugbank.com/) and 1170 FDA-approved drugs were randomly chosen from the FDA drug database (https://www.fda.gov/drugs, accessed on June 23, 2021). The positive dataset (class 1) comprised of antibiotic drugs, while the negative dataset (class 0) was composed of randomly collected, FDA-approved drugs that had no antibiotic, antifungal or antibacterial activity reported in the ChEMBL database14. No two drugs were exactly the same in the training and testing datasets. Multidimensional scaling (MDS) in two-dimensional (2D) space was done separately for the positive and negative classes with the chosen Tanimoto similarity cut-off value of 0.4 using ChemMine tool15. All datasets (positive and negative) were divided into 80 per cent for training and 20 per cent for blind validation. The models were developed using a positive set of 312 antibiotic drugs and a negative set of 936 non-antibiotic drugs. We randomly selected data from the total dataset by varying the ratio of positives to negatives to 1:1, 1:2 and 1:3 to avoid bias towards positive or negative. In the same manner, the blind set was created with the same three ratios of positive to negative sets (78 positive drugs and 234 negative drugs). The blind set of drugs was never used in training or testing. The blind dataset played a crucial role in identifying the best mathematical model among all the developed models. An independent dataset consisting of 173,714 small chemicals from the NCI was curated for drug screening purposes and to evaluate the performance of the best models16.
Molecular descriptors as features: Molecular descriptors, including physicochemical and structural properties, have been widely used in the drug discovery research to characterize drugs and small chemicals17. The features of antibiotic and non-antibiotic drugs were computed using python modules, such as PaDELPy and PubChemPy18,19. The two modules returned 1440 and 17 features, respectively, as the vector length. The 1440 descriptors available in the PaDELPy module included acidic group count, ALogP, APol, aromatic atom count, aromatic bond count, atom count, autocorrelation, Barysz matrix and basic group count. The PubChemPy provided 17 important descriptors to help characterize molecules, including molecular weight, xlogp3-aa, hydrogen bond donor count, hydrogen bond acceptor count, rotatable bond count, exact mass, monoisotopic mass, topological polar surface area, heavy atom count, formal charge, complexity, isotope atom count, defined atom stereocenter count, undefined atom stereocenter count, defined bond stereocenter count, undefined bond stereocenter count and covalently bonded unit count. By employing these descriptors, one can achieve a thorough understanding of the molecular structure and characteristics of a particular compound. Three well-known feature selection methods, namely univariate, L1 based and tree based, were used to train and test the models feature wise (17 and 1440) as well as at different ratios (1:1, 1:2 and 1:3) of the training and testing sets. The models were also tested on the blind dataset simultaneously to get the best-optimized models with selective features. For the above two vector lengths (17 and 1440), the method was continued without feature selection being done separately (all features). Besides, the importance of features study was also performed based on the ranks, using a supervised ML technique called ExtraTreesClassifier20.
Five-fold cross-validation (5-fold CV): In five iterations, 80 per cent of the samples were used for training and the remaining 20 per cent were used for testing, making five-fold cross-validation (5-fold CV) a popular technique for selecting the best-trained models among all models. The average score was considered the final performance score of the trained model. The actual purpose of five-fold cross-validation is to identify the best learning parameters of all ML algorithms, which is helpful to select the best model. The blind dataset was also screened using the best-performing models based on sensitivity, specificity, accuracy, precision, F1 score, Matthews correlation coefficient (MCC) and area under the ROC curve (AUC) value. These matrices are computed by 2x2 confusion matrix, that contains true positive (TP), true negative (TN), false negative (FN), and false positive (FP).
Sensitivity = TP/(TP+FN);
Specificity= TN/(TN+FP);
Precision = TP/(TP+FP);
Accuracy = (TP+TN)/(TP+TN+FP+FN);
F1 score = 2TP/(2TP+FP+FN);
MCC = (TP*TN-FP*FN)/ √(TP+FP) (TP+FN) (TN+FP) (TN+FN)
Machine learning classifiers: The identification of antibiotic drugs can be viewed as a binary classification system, with class 1 representing drugs with antibiotic activity and class 0 representing drugs with non-antibiotic activity. Different mathematical models were developed in 5-fold CV using the supervised classifiers, such as extreme gradient boosting, random forest (RF), gradient boosting classifier (GBC), deep neural network, support vector machine (SVM), multilayer perceptron, decision tree (DT) and logistic regression (LR), to screen antibiotics from the non-antibiotic drugs available in the public domain (ChEMBL)14. Each of the classifiers tuned its specific parameters iteratively to get the optimized scores. The chances of over-fitting were also minimized using 5-fold CV and regularization parameters (L1 and L2 regularization).
Extreme gradient boosting (XGBoost): In this study, supervised ensemble XGBoost ML was used as a classifier model. It is a well-performing ML that invokes a multi-layer DT-based system, which is potentially easily interpretable. During 5-fold CV, XGBoost was trained with the following parameter settings: maximum depth equal to 5 (max_depth), learning rate equal to 0.1 (learning_rate), number of tree estimators set to 1000 (n_estimators), the value of L2 and L1 regularization parameter lambda set to 1 (reg_lambda) and 0 (reg_alpha) with a subsample was set to 0.8 to prevent overfitting. However, these parameters and threshold values were tuned with grid search to get optimized predicted model performance.
Random forest (RF): RF is another supervised ensemble ML classifier that classifies by constructing multiple DT. During training, several uncorrelated trees create a committee-like environment that has a higher ability for prediction than individual trees. In five-fold cross-validation, different parameters of RF were set, such as estimator (n_estimators) was set to 200, random state (random_state) was set to 92, bootstrap was enabled and max_features was set to sqrt to get better performance of the model. Different threshold values were tuned simultaneously. These parameters of RF were also computed in the grid searching technique to get optimum parameter values with model performance.
Gradient boosting classifier (GBC): GBC is an ensemble ML that can also classify by building many DT of fixed size, where the boosting has the ability to provide a high quality of learning of the DTs. The parameters set during 5-fold CV of GBC included random state (random_state) set at 42, loss of function set to deviance, learning rate equal to 0.1 and n_estimator set to 100, which were computed to get better performance of the model.
Deep neural network (DNN): DNN classifier is a subgroup of ANN and artificial intelligence. DNN functions by creating several layers of neural networks, including input, output and hidden layers. It is a robust and powerful tool, nowadays applied extensively to solve complex classification problems. The ‘TensorFlow (version: 2.4.0)’ and ‘Keras’ deep-learning python packages were used to predict antibiotic and non-antibiotic drugs. The DNN was built in a sequential network model with dense layers. In this model, several parameters such as kernel initializer (set to uniform), activation functions (relu and sigmoid), optimizer (adam), loss of function (binary_crossentropy) and dropout (0.45) were used to train and test with five-fold cross-validation.
Support vector machine (SVM): SVM is a supervised binary ML classifier capable to handle non-linear and complex data using kernel functions (RBF, sigmoid and polynomial). Different parameters, such as cost (C) equal to 350, gamma = 0.00001 and kernel set to radial basis function (RBF), were set to get well-performing model.
Multi-layer perceptron (MLP): MLP is a three-layered feedforward traditional neural network with input, hidden and output layers. In the input layer of the nodes, there exists an activation function. MLP performs using backpropagation in training. The parameters such as hidden layer sizes and maximum iteration set to 100, random state set at 62 and batch size equal to 320 were used to develop the ML model.
Decision tree (DT): DT is another popular supervised ML classifier that can perform classification using the tree model of decision-making, computed based on cost, outcome and utility. DT consists mainly of three nodes such as decision, chance and end nodes.
Logistic regression (LR): LR is a probability-based ML classifier that performs binary classification and is assigned a probability score between 0 and 1.
Python packages: The features of positive antibiotic drugs and negative drugs were curated using ‘PubChemPy’ and ‘PaDELPy’. The feature selection, model development, model selection, computing model performance, best model selection and best model evaluation were done using the ‘scikit-learn’, ‘tensorflow’, ‘keras’ and ‘XGBoost’ packages in Python 3.8.6.
Webserver: ABDpred web server was developed to predict antibiotic drugs using four best-performing ML classifiers, XGBoost RF, GBC and DNN. The web pages of the server were designed using PHP (7.3.25), HTML and JavaScript. In the back end of the server, Python 3.8.6 was embedded inside CentOS Linux 7 (core). The aggregation of four ML classifiers was done using a soft-voting technique for better prediction results. The consensus prediction scores (x’ and y’) were computed by taking the mean prediction scores of the four ML classifiers in terms of positiveness and negativeness. Source codes of ABDpred are available at http://clinicalmedicinessd.com.in/abdpred/download.php
Where, n is equal to 4 (top 4 performing ML classifiers) and Z equals to input vectors (features).
Results
Feature importance and molecule descriptor selection: To identify the important features for developing the ML classifier models, different combinations of features (PubChemPy and PaDELPy) and different ratios of positive and negative data (1:1, 1:2 and 1:3) were applied to the training and blind datasets to generate a high value of area under the curve (AUC). Three feature selection algorithms of the python ‘scikit-learn’ package, namely the univariate, L1-based and tree-based algorithms, were used for this purpose. With Chi-square-based univariate feature selection (percentile) and PubChemPy python module, the XGBoost achieved an AUC value of 81.52 per cent on the training set and 45.62 per cent on the blind dataset with positive:negative (P:N) ratio of 1:1, and 5 features were identified as vector length (Table I). In contrast, when the univariate feature selection (percentile) method was applied to the python module PaDELPy, the XGBoost achieved an AUC value of 86.63 and 61.97 per cent on the training and blind dataset, respectively. We identified 351 features as vector length when positive and negative datasets were at an equal ratio (1:1). When all features were considered (no feature selection was done) in the PubChemPy module, XGBoost achieved an AUC value of 88.39 per cent on the training set and 80 per cent on the blind set with P:N ratio of 1:1, and 17 features were identified. A similar trend was observed with all features in the PaDELPy module. It was observed that the AUC scores decreased when the ratio of the positives and negatives was increased. Interestingly, AUC scores of the XGBoost model did not increase further, if more than 17 features were considered. Similar observations were made with RF and GBC, as shown in Supplementary Tables SI and SII. By using Extra Tree Classifier, a powerful ensemble-based ML method, we increased our assurance in the 17 significant features identified, based on their rankings, as shown in Figure 2. The 17 features of the python module PubChemPy that achieved a more balanced AUC value on the training, as well as the blind dataset, were chosen for model development and analysis.
Table I.
Feature-wise performance of the ensemble tree-based XGBoost classification was evaluated during both training/testing and blind set evaluation through 5-fold cross-validation with different feature selection techniques
| Source of features | Features selection | P:N | Vector length | AUC for training set per cent | AUC for blind set per cent |
|---|---|---|---|---|---|
| PubChemPy | All features | 1:1 | 17 | 88.39 | 80 |
| Univariate | 1:1 | 5 | 81.52 | 45.62 | |
| L1 based | 1:1 | 14 | 87.37 | 78.75 | |
| Tree based | 1:1 | 6 | 87.89 | 75.62 | |
| All features | 1:2 | 17 | 88.27 | 76.25 | |
| Univariate | 1:2 | 5 | 78.07 | 48.75 | |
| L1 based | 1:2 | 16 | 77.08 | 70 | |
| Tree based | 1:2 | 6 | 85.8 | 68.44 | |
| All features | 1:3 | 17 | 85.32 | 73.96 | |
| Univariate | 1:3 | 5 | 79.43 | 58.33 | |
| L1 based | 1:3 | 15 | 83.93 | 70.83 | |
| Tree based | 1:3 | 6 | 83.23 | 70.2 | |
| PaDELPy | All features | 1:1 | 1440 | 87.82 | 78.17 |
| Univariate | 1:1 | 351 | 86.63 | 61.97 | |
| L1 based | 1:1 | 194 | 87.4 | 79.58 | |
| Tree based | 1:1 | 454 | 87.63 | 76.76 | |
| All features | 1:2 | 1440 | 86.48 | 77.4 | |
| Univariate | 1:2 | 347 | 80.37 | 69.25 | |
| L1 based | 1:2 | 242 | 87.21 | 78.47 | |
| Tree based | 1:2 | 490 | 80.27 | 78.84 | |
| All features | 1:3 | 1440 | 87.84 | 76.52 | |
| Univariate | 1:3 | 347 | 87.25 | 47.7 | |
| L1 based | 1:3 | 247 | 87.44 | 79.14 | |
| Tree based | 1:3 | 462 | 86.68 | 77.02 |
The best performances were highlighted in bold text. AUC, area under the curve
Table SI.
The random forest classifier-based assessment of various features during training/testing and blind set evaluations
| Source of features | Features selection | P:N | Vector length | AUC score per cent training set | AUC score % blind set |
|---|---|---|---|---|---|
| PubChemPy | All features | 1:1 | 17 | 88.55 | 80 |
| Univariate | 1:1 | 5 | 84.15 | 48.12 | |
| L1 based | 1:1 | 14 | 88.5 | 78.75 | |
| Tree based | 1:1 | 6 | 89.51 | 78.75 | |
| All features | 1:2 | 17 | 87.5 | 75.93 | |
| Univariate | 1:2 | 5 | 78.37 | 47.18 | |
| L1 based | 1:2 | 16 | 83.74 | 70 | |
| Tree based | 1:2 | 7 | 84.37 | 72.189 | |
| All features | 1:3 | 17 | 85.54 | 71.2 | |
| Univariate | 1:3 | 5 | 73.94 | 52.5 | |
| L1 based | 1:3 | 15 | 83.37 | 69.79 | |
| Tree based | 1:3 | 7 | 83.23 | 71.67 | |
| PaDELPy | All features | 1:1 | 1440 | 88.94 | 82.39 |
| Univariate | 1:1 | 351 | 86.62 | 50.7 | |
| L1 based | 1:1 | 193 | 86.06 | 79.57 | |
| Tree based | 1:1 | 472 | 88.18 | 79.58 | |
| All features | 1:2 | 1440 | 86.14 | 78.13 | |
| Univariate | 1:2 | 347 | 85.32 | 51.74 | |
| L1 based | 1:2 | 243 | 85.02 | 78.47 | |
| Tree based | 1:2 | 512 | 88.6 | 79 | |
| All features | 1:3 | 1440 | 86 | 77.79 | |
| Univariate | 1:3 | 347 | 83.11 | 50 | |
| L1 based | 1:3 | 244 | 84.17 | 75.87 | |
| Tree based | 1:3 | 477 | 84.53 | 76.38 |
The best performances were highlighted in bold text. Multiple feature selection techniques were utilized, along with 5-fold cross-validation. AUC, area under the curve
Table SII.
The ensemble gradient boosting classifier-based assessment of various features during training/testing and blind set evaluations
| Source of features | Features selection | P:N | Vector length | AUC score % training set | AUC score % blind set |
|---|---|---|---|---|---|
| PubChemPy | All features | 1:1 | 17 | 87.1 | 78.13 |
| Univariate | 1:1 | 5 | 80.38 | 45.62 | |
| L1 based | 1:1 | 14 | 90.39 | 78.1 | |
| Tree based | 1:1 | 7 | 88.77 | 76.87 | |
| All features | 1:2 | 17 | 87.5 | 73 | |
| Univariate | 1:2 | 5 | 81.53 | 50.31 | |
| L1 based | 1:2 | 16 | 86.27 | 69.06 | |
| Tree based | 1:2 | 6 | 85.01 | 71.56 | |
| All features | 1:3 | 17 | 84.73 | 71.88 | |
| Univariate | 1:3 | 5 | 75.34 | 56.04 | |
| L1 based | 1:3 | 15 | 83.51 | 69.79 | |
| Tree based | 1:3 | 6 | 82.8 | 74.37 | |
| PaDELPy | All features | 1:1 | 1440 | 88.77 | 78.16 |
| Univariate | 1:1 | 351 | 87.84 | 54.93 | |
| L1 based | 1:1 | 142 | 87.4 | 78.87 | |
| Tree based | 1:1 | 443 | 86 | 79 | |
| All features | 1:2 | 1440 | 85.31 | 75.56 | |
| Univariate | 1:2 | 347 | 88 | 70.96 | |
| L1 based | 1:2 | 207 | 88.59 | 75.96 | |
| Tree based | 1:2 | 524 | 89.58 | 78.1 | |
| All features | 1:3 | 1440 | 87.64 | 73.83 | |
| Univariate | 1:3 | 347 | 82.79 | 47.27 | |
| L1 based | 1:3 | 239 | 85.73 | 72.74 | |
| Tree based | 1:3 | 459 | 85.33 | 75.87 |
Multiple feature selection techniques were applied, along with 5-fold cross-validation. The highest-performing features are indicated in bold text.
Fig. 2.

Identification of important features was done using extra tree classifier.
Significant differences between the antibiotic and non-antibiotic drugs for the above features were observed, as shown in Table II. The median and mean value differences underscored that normal data distribution was not possible. This observation was important since it was extensively reported in the recent past21. On a MDS at 2D plane, using ChemMine tool with Tanimoto similarity cut-off of 0.4 for the positive as well as the negative set, we found that most of the positive drugs were clustered below the threshold value of 0.3 for the similarity score, while the negative drugs were largely clustered below the cut-off value of 0.33 (Supplementary Fig. S1 (103KB, tif) ). The distribution plots of all the 17 features for both the positive and negative sets were done (Supplementary Fig. S2 (340.4KB, tif) ).
Table II.
The physicochemical properties of the antibiotic and non-antibiotic drugs were obtained using the PubChemPy module
| Physicochemical properties | Antibiotic drug (n=390) | Non-antibiotic drug (n=1170) | t-test of antibiotic drug (n=390) and non-antibiotic drug (n=390) | ||
|---|---|---|---|---|---|
|
|
|
||||
| Mean | SD | Mean | SD | ||
| Molecular weight (g/mol) | 561.72 | 345.15 | 549.57 | 696.25 | 0.09 |
| XLogP3-AA | 0.64 | 3.18 | 1.74 | 4.37 | 0* |
| Hydrogen bond donor count | 4.58 | 4.87 | 4.25 | 9.66 | 0.11 |
| Hydrogen bond acceptor count | 10.73 | 6.11 | 8.44 | 11.85 | 0* |
| Rotatable bond count | 7.54 | 7.18 | 9.55 | 21.15 | 0.12 |
| Exact mass (g/mol) | 561.24 | 344.91 | 549.04 | 696.02 | 0.09 |
| Monoisotopic mass (g/mol) | 561.23 | 344.88 | 548.98 | 695.8 | 0.09 |
| Topological polar surface area (Ų) | 193.3 | 131.78 | 158.84 | 298.27 | 0* |
| Heavy atom count | 38.81 | 24.44 | 36.94 | 48.26 | 0.04* |
| Formal charge | 0.01 | 0.12 | 0.01 | 0.22 | 0.37 |
| Complexity | 987.52 | 744.96 | 843.69 | 1702.96 | 0* |
| Isotope atom count | 0 | 0 | 0.03 | 0.29 | 0* |
| Defined atom stereocenter count | 5.72 | 6.46 | 3.04 | 7.16 | 0* |
| Undefined atom stereocenter count | 0.53 | 2.29 | 0.6 | 2.03 | 0.13 |
| Defined bond stereocenter count | 0.6 | 1.26 | 0.18 | 0.75 | 0* |
| Undefined bond stereocenter count | 0 | 0.05 | 0.01 | 0.21 | 0.5 |
| Covalently-bonded unit count | 1.07 | 0.33 | 1.85 | 1.67 | 0* |
P* <0.05 was considered significant. SD, standard deviation
Performance comparison of different classifiers: We used different classifiers, such as XGBoost, RF, GBC, DNN, SVM, MLP, DT and LR, to develop models and computed their performance measures based on diverse parameters (parameters tuning). The best results of all ML classifiers are reported here. Different threshold values of these ML classifiers were also calculated to achieve the most promising and optimized results. All the ML classifiers performed best at the threshold value of 0.5 (Supplementary Table SIII). The performance measures were also compared to each of the other ML classifiers. To evaluate model performances, stratified k-fold computations (k=3, 5 and 10) were conducted simultaneously (Supplementary Table SIV). The XGBoost and RF performed better than GBC, DNN, SVM, MLP, DT and LR (Table III). The XGBoost achieved the sensitivity and specificity values of 87.42 and 89.35 per cent, respectively, with an accuracy value of 88.39 per cent, MCC value of 76.51 per cent, precision value of 89.57 per cent, F1 score of 88.16 per cent and AUC value of 88.38 per cent at the threshold value of 0.5. The RF achieved the sensitivity, specificity, accuracy, MCC and precision values of 87.74, 89.35, 88.55, 77.48 and 89.53 per cent, respectively, with an F1 score of 88.41 per cent and AUC value of 88.55 per cent at the threshold 0.5 compared to other ML classifiers, the performance of XGBoost and RF was more balanced (Fig. 3). The MCC values showed that the RF model outperformed XGBoost when performances of the two models were compared. DTs of XGBoost and RF were generated to interpret the decision rules, using the selected 17 features on 312 antibiotic (positive) and 312 non-antibiotic (negative) drugs (Supplementary Fig. S3 (71.6KB, tif) and S4 (88KB, tif) ).
Table SIII.
The performance of 17 features was measured on the training set using different machine learning classifiers and 5-fold cross-validation while maintaining P:N ratios (1:1) at different thresholds
| Methods / ML classifiers | Threshold value | Vector length | P:N | Sensitivity % | Specificity % | Accuracy % | Precision % | F1 % |
|---|---|---|---|---|---|---|---|---|
| XGBoost | 0.3 | 17 | 1:1 | 90.97 | 85.16 | 88.06 | 86.39 | 88.4 |
| 0.4 | 17 | 1:1 | 88.42 | 87.09 | 87.9 | 87.71 | 87.96 | |
| 0.5 | 17 | 1:1 | 87.42 | 89.35 | 88.39 | 89.57 | 88.16 | |
| 0.6 | 17 | 1:1 | 86.12 | 91.62 | 88.07 | 91.39 | 88.38 | |
| 0.7 | 17 | 1:1 | 83.87 | 93.54 | 88.7 | 92.92 | 87.92 | |
| 0.8 | 17 | 1:1 | 81.29 | 95.16 | 88.12 | 94.4 | 87.06 | |
| RF | 0.3 | 17 | 1:1 | 94.84 | 76.45 | 85.64 | 80.49 | 86.92 |
| 0.4 | 17 | 1:1 | 91.29 | 84.52 | 87.9 | 85.97 | 88.35 | |
| 0.5 | 17 | 1:1 | 87.74 | 89.35 | 88.55 | 89.53 | 88.41 | |
| 0.6 | 17 | 1:1 | 83.87 | 93.87 | 88.87 | 93.28 | 88.14 | |
| 0.7 | 17 | 1:1 | 76.77 | 96.45 | 86.61 | 95.57 | 84.95 | |
| 0.8 | 17 | 1:1 | 68.71 | 97.74 | 83.32 | 96.84 | 80.07 | |
| GBC | 0.3 | 17 | 1:1 | 91.61 | 80.97 | 86.29 | 83.26 | 86.96 |
| 0.4 | 17 | 1:1 | 89.67 | 83.55 | 86.61 | 85 | 86.94 | |
| 0.5 | 17 | 1:1 | 88.06 | 86.13 | 87.09 | 86.38 | 87.18 | |
| 0.6 | 17 | 1:1 | 86.13 | 89.67 | 87.9 | 89.62 | 87.63 | |
| 0.7 | 17 | 1:1 | 80.64 | 92.58 | 86.61 | 91.91 | 85.58 | |
| 0.8 | 17 | 1:1 | 75.81 | 94.52 | 85.16 | 93.64 | 83.37 | |
| DNN | 0.3 | 17 | 1:1 | 93.87 | 48.71 | 71.29 | 65.12 | 76.69 |
| 0.4 | 17 | 1:1 | 87.41 | 58.06 | 72.74 | 68.98 | 76.4 | |
| 0.5 | 17 | 1:1 | 80.97 | 70.64 | 75.81 | 73.99 | 76.94 | |
| 0.52 | 17 | 1:1 | 83.87 | 65.16 | 74.52 | 70.79 | 76.56 | |
| 0.54 | 17 | 1:1 | 76.45 | 74.84 | 75.65 | 75.81 | 75.52 | |
| 0.55 | 17 | 1:1 | 84.19 | 73.55 | 78.87 | 76.88 | 79.52 | |
| 0.6 | 17 | 1:1 | 52.58 | 79.68 | 66.12 | 56.88 | 52.54 | |
| 0.7 | 17 | 1:1 | 01.29 | 99.35 | 50.32 | 0 | 02.42 | |
| 0.8 | 17 | 1:1 | 0 | 99.38 | 49.68 | 0 | 0 | |
| SVM | 0.3 | 17 | 1:1 | 93.55 | 56.45 | 75 | 68.27 | 78.9 |
| 0.4 | 17 | 1:1 | 88.38 | 69.68 | 79.03 | 74.44 | 80.79 | |
| 0.5 | 17 | 1:1 | 80.32 | 76.45 | 78.39 | 77.15 | 78.58 | |
| 0.6 | 17 | 1:1 | 74.52 | 83.23 | 78.87 | 81.59 | 77.53 | |
| 0.7 | 17 | 1:1 | 57.74 | 90.32 | 74.03 | 84.87 | 67.91 | |
| 0.8 | 17 | 1:1 | 40.32 | 95.81 | 68.06 | 90.25 | 55.23 | |
| MLP | 0.3 | 17 | 1:1 | 95.81 | 55.81 | 75.81 | 68.44 | 79.84 |
| 0.4 | 17 | 1:1 | 90.65 | 65.16 | 77.9 | 72.21 | 80.32 | |
| 0.5 | 17 | 1:1 | 87.74 | 71.29 | 79.51 | 77.15 | 78.58 | |
| 0.6 | 17 | 1:1 | 77.1 | 79.03 | 78.06 | 78.7 | 77.8 | |
| 0.7 | 17 | 1:1 | 53.87 | 88.39 | 71.29 | 83.03 | 64.9 | |
| 0.8 | 17 | 1:1 | 29.68 | 95.81 | 62.74 | 86.79 | 42.5 | |
| DT | 0.3 | 17 | 1:1 | 83.78 | 84.51 | 84.19 | 84.38 | 84 |
| 0.4 | 17 | 1:1 | 85.08 | 84.38 | 85.32 | 85.06 | 85.26 | |
| 0.5 | 17 | 1:1 | 85.8 | 84.19 | 85 | 84.57 | 85 | |
| 0.6 | 17 | 1:1 | 84.52 | 82.58 | 83.55 | 82.97 | 83.55 | |
| 0.7 | 17 | 1:1 | 84.38 | 83.22 | 84.03 | 83.57 | 84.01 | |
| 0.8 | 17 | 1:1 | 83.54 | 85.48 | 84.51 | 85.23 | 84.24 | |
| LR | 0.3 | 17 | 1:1 | 99.68 | 08.39 | 53.65 | 52.12 | 68.45 |
| 0.4 | 17 | 1:1 | 99.03 | 22.58 | 59.94 | 56.27 | 71.71 | |
| 0.5 | 17 | 1:1 | 43.22 | 83.87 | 63.54 | 73.36 | 54.13 | |
| 0.6 | 17 | 1:1 | 04.19 | 96.77 | 50.48 | 41.5 | 07.57 | |
| 0.7 | 17 | 1:1 | 01.29 | 97.42 | 49.84 | 33.67 | 02.43 | |
| 0.8 | 17 | 1:1 | 0 | 97.42 | 48.71 | 0 | 0 |
The best performances with their corresponding thresholds are indicated in bold text. RF, random forest; GBC, gradient boosting classifier; DNN, deep neural network; SVM, support vector machine; MLP, multilayer perceptron; DT, decision tree; LR, logistic regression
Table SIV.
K-fold model comparison
| Methods | K-fold | Sensitivity % | Specificity % | Accuracy % | Precision % | F1-score % | MCC % | AUC % |
|---|---|---|---|---|---|---|---|---|
| XGBoost | K=3 | 87.08 | 88.07 | 87.58 | 88.19 | 87.52 | 75.37 | 87.57 |
| K=5 | 87.42 | 89.35 | 88.39 | 89.57 | 88.16 | 76.51 | 88.38 | |
| K=10 | 87.38 | 88.71 | 88.54 | 88.96 | 88.59 | 76.24 | 88.55 | |
| RF | K=3 | 86.13 | 88.71 | 87.42 | 88.49 | 87.27 | 74.9 | 87.42 |
| K=5 | 87.74 | 89.35 | 88.55 | 89.53 | 88.41 | 77.48 | 88.55 | |
| K=10 | 86 | 89.03 | 87.9 | 89.08 | 87.76 | 76.08 | 87.9 | |
| GBC | K=3 | 87.41 | 87.09 | 87.26 | 87.14 | 87.27 | 74.52 | 87.26 |
| K=5 | 88.06 | 86.13 | 87.09 | 86.38 | 87.18 | 74.66 | 87.09 | |
| K=10 | 88.03 | 85.03 | 88.03 | 89.32 | 88.98 | 74.38 | 88.03 | |
| DNN | K=3 | 83.55 | 77.42 | 80.49 | 78.7 | 80.98 | 61.26 | 80.49 |
| K=5 | 84.19 | 73.55 | 78.87 | 76.88 | 79.52 | 62.57 | 78.87 | |
| K=10 | 86.77 | 72.9 | 79.83 | 76.84 | 81.16 | 60.9 | 79.84 | |
| SVM | K=3 | 79.02 | 78.38 | 78.71 | 78.5 | 78.74 | 57.44 | 78.7 |
| K=5 | 80.32 | 76.45 | 78.39 | 77.15 | 78.58 | 57.02 | 78.39 | |
| K=10 | 81.93 | 77.41 | 79.67 | 78.35 | 79.97 | 59.63 | 79.67 | |
| MLP | K=3 | 84.2 | 73.88 | 79.04 | 76.34 | 80.05 | 58.46 | 79.04 |
| K=5 | 87.74 | 71.29 | 79.51 | 75.28 | 81.02 | 59.91 | 79.51 | |
| K=10 | 89.03 | 69.35 | 79.19 | 74.75 | 81.06 | 60.01 | 79.19 | |
| DT | K=3 | 87.41 | 83.54 | 85.48 | 84.25 | 85.73 | 71.14 | 85.47 |
| K=5 | 85.8 | 84.19 | 85 | 84.57 | 85 | 69.54 | 84.35 | |
| K=10 | 85.8 | 85.48 | 85.64 | 85.77 | 85.54 | 71.68 | 85.64 | |
| LR | K=3 | 42.92 | 84.49 | 63.71 | 74.01 | 53.98 | 30.39 | 63.71 |
| K=5 | 43.25 | 83.87 | 63.54 | 73.36 | 54.13 | 29.89 | 63.54 | |
| K=10 | 42.9 | 82.58 | 62.74 | 71.42 | 53.23 | 27.91 | 62.74 |
Table III.
Performance measures of eight different ML classifiers at 5-fold cross-validation using the 17 selected features and P:N ratio of 1:1
| Methods / ML classifiers | Vector length | P:N | Sensitivity % | Specificity % | Accuracy % | Precision % | F1-score % | MCC % | AUC % |
|---|---|---|---|---|---|---|---|---|---|
| XGBoost / ML classifiers | 17 | 1:1 | 87.42 | 89.35 | 88.39 | 89.57 | 88.16 | 76.51 | 88.38 |
| RF | 17 | 1:1 | 87.74 | 89.35 | 88.55 | 89.53 | 88.41 | 77.48 | 88.55 |
| GBC | 17 | 1:1 | 88.06 | 86.13 | 87.09 | 86.38 | 87.18 | 74.66 | 87.09 |
| DNN | 17 | 1:1 | 84.19 | 73.55 | 78.87 | 76.88 | 79.52 | 62.57 | 78.87 |
| SVM | 17 | 1:1 | 80.32 | 76.45 | 78.39 | 77.15 | 78.58 | 57.02 | 78.39 |
| MLP | 17 | 1:1 | 87.74 | 71.29 | 79.51 | 75.28 | 81.02 | 59.91 | 79.51 |
| DT | 17 | 1:1 | 85.8 | 84.19 | 85 | 84.57 | 85 | 69.54 | 84.35 |
| LR | 17 | 1:1 | 43.25 | 83.87 | 63.54 | 73.36 | 54.13 | 29.89 | 63.54 |
The best performances are highlighted in bold text at the threshold value of 0.5. ML, machine learning; RF, random forest; GBC, gradient boosting classifier; DNN, deep neural network; SVM, support vector machine; MLP, multilayer perceptron; DT, decision tree; LR, logistic regression; MCC, Matthews correlation coefficient
Fig. 3.

Receiver operating characteristic plot performances of different machine learning classifiers on the training dataset with P: N=1:1 based on 17 selected features. ROC, receiver operating characteristic, XGBoost, extreme gradient boosting; RF, random forest; GBC, gradient boosting classifier; DNN, deep neural network; SVM, support vector machine; MLPC, multi-layer perceptron; DT, decision tree; LR, logistic regression; AUC, area under the curve.
Model performance using the blind set: The blind set was tested using the best models of all ML classifiers used in this study. As shown in Table IV, the RF model had a sensitivity of 68.75 per cent, specificity of 92.5 per cent, accuracy of 80.62 per cent, MCC of 60.12 per cent, precision of 90.16 per cent and AUC of 80 per cent, outperforming all other ML classifiers. The XGBoost model achieved the sensitivity (70%), specificity (90%), accuracy (80%), MCC (59.39%), precision (87.5%) and AUC (80%) values, which were quite similar to RF. These two models performed better than GBC, DNN, SVM, MLP, DT and LR classifiers. However, DNN and GBC also did reasonably well with sensitivity over 68 per cent, accuracy above 78 per cent and AUC value higher than 78 per cent. Interestingly, we found that DNN performed better when the number of features increased. Finally, considering the results of both the training set and the blind set, XGBoost, RF, GBC and DNN were chosen for further analysis and webserver development.
Table IV.
Performance measures with selected features (all: 17) at P:N=1:1 ratio on blind set using best models of different machine learning (ML) classifiers
| Methods / ML classifiers | Vector length | P:N | Sensitivity % | Specificity % | Accuracy % | Precision % | F1 % | MCC % | AUC % |
|---|---|---|---|---|---|---|---|---|---|
| XGBoost | 17 | 1:1 | 70 | 90 | 80 | 87.5 | 77.78 | 59.39 | 80 |
| RF | 17 | 1:1 | 68.75 | 92.5 | 80.62 | 90.16 | 78.04 | 60.12 | 80 |
| GBC | 17 | 1:1 | 67.5 | 88.75 | 78.12 | 85.71 | 75.52 | 52.92 | 78.12 |
| DNN | 17 | 1:1 | 78.75 | 77.5 | 78.13 | 77.78 | 78.26 | 44.54 | 78.12 |
| SVM | 17 | 1:1 | 67.5 | 82.5 | 73.75 | 78.79 | 71.23 | 48.24 | 73.75 |
| MLP | 17 | 1:1 | 75 | 80 | 77.5 | 78.94 | 76.92 | 55.02 | 77.5 |
| DT | 17 | 1:1 | 58.75 | 71.25 | 65 | 67 | 62.67 | 35.72 | 65 |
| LR | 17 | 1:1 | 35 | 92.5 | 63.75 | 82.35 | 49.12 | 33.2 | 63.75 |
The best performances were highlighted in bold text
Model performance using the training set as negative/unknown: XGBoost, RF, GBC and DNN models were also used to compute the performance of the training set (312 antibiotic drugs). The reason for doing this analysis was that the models were evaluated again on a negative or unknown level set to ensure that these could learn the underlying patterns in the data and make accurate predictions on unseen data. This helps to prevent overfitting, which can lead to poor performance on unseen data. The top four models identified positive drugs as positive with higher accuracy than the other models. The models performed well with a threshold (cut-off) probability set at 0.56 for the training set taken as a negative dataset (Supplementary Fig. S5 (134.7KB, tif) ).
Model performance on 14 widely used antibiotic drugs: The 14 well known antibiotic drugs, such as penicillin G, ampicillin, cloxacillin, vancomycin, ticarcillin, meropenem, azithromycin, ceftazidime, daptomycin, rifampicin, chloramphenicol, ciprofloxacin, tetracycline and gentamicin c1a, were evaluated using the top four ML classifiers: XGBoost, RF, GBC and DNN. As shown in Figure 4, the average cut-off probability was 0.56, but XGBoost predicted these antibiotic drugs with a probability score above 0.9.
Fig. 4.

Comparison of the probability scores of the top 14 antibiotic drugs using the best-performing classifier models (XGBoost, random forest, gradient boosting classifier and deep neural network).
Model performance on independent dataset: XGBoost, RF, GBC and DNN models were tested for their performance on the NCI dataset of small chemicals (n=173714), taken as an independent set. Both XGBoost and RF performed better than GBC and DNN, with an accuracy of above 80 per cent, while XGBoost was even better as compared to RF. The best XGBoost model predicted a total of 13,552 small chemicals as antibiotic drugs with a probability score better than 0.9, out of which 596 chemicals reached the same score for both RF and XGBoost (Supplementary Fig. S6 (66.7KB, tif) ). To validate the prediction result, 596 NCI small chemicals were studied individually in literature. Interestingly, it was found that the majority of the above 596 NCI chemicals have antibacterial activity in literature (Supplementary Table SIV).
Webserver prediction: The ABDpred webserver is capable of computing structural similarities of the input chemicals to 390 known antibiotic drugs based on the Tanimoto Coefficient score. At the home page of the webserver, users can provide the PubChem CID of choice in the text box (Fig. 5A). Once the prediction button is pressed, the webserver would process the inputs and generate a result page in return, named as ‘prediction_result_updated.php’. The result page will contain the output arranged into four sections. In the first section, called ‘Input compound brief details’, the user will get input compound CID, structure and synonymous name, as shown in Figure 5B (for example, gramicidin D, CID: 42567103). The next section illustrates ‘drug-like properties and bioavailability profile of the input compound’ using Lipinski’s rule, Pfizer’s 3/75 rule, Veber’s rule and Egan’s rule22,23. Another section describes ‘prediction results given by different ML classifiers (such as XGBoost, RF, GBC and DNN)’ (Fig. 5C). A fourth section mentions ‘Aggregation of prediction results’ (Fig. 5D). Finally, a separate section demonstrates the search results of similarity of the input compound to known antibiotic drugs used in this study (Fig. 5E). ABDpred is freely accessible at http://clinicalmedicinessd.com.in/abdpred/.
Fig. 5.

(A-E) ABDpred webserver’s input page and result page, showing different output sections using the example query as input.
Comparison of ABDpred server with the existing machine learning (ML) methods: Ivanenkov’s12 group developed ML models to predict antibacterial chemicals. They also validated some of the chemicals having the highest predicted antibacterial potency by in vitro and in vivo experiments12. Stokes’s13 group developed Chemprop webserver that was used to discover halicin as an antimicrobial compound. The Chemprop webserver only achieved 51.5 per cent of accuracy and 51.5 per cent of sensitivity, while using the ABDpred server, we screened these chemicals with the top performing models and found the best model (RF) to have 67 per cent of accuracy and 67 per cent of sensitivity in identifying compounds with antimicrobial activity (Supplementary Table SVI). In addition, most of the compounds were identified as likely antibiotic drugs, and their drug likeliness was determined based on 17 selected features.
Table SVI.
ABDpred server comparison with existing server: Chemprop
| Proposed server | Methods | Sensitivity % | Accuracy % | AUC % | Existing server | Method | Sensitivity % | Accuracy % |
|---|---|---|---|---|---|---|---|---|
| ABDpred | XGB | 62.5 | 62.5 | 62.5 | Chemprop | DNN | 51.5 | 51.5 |
| RF | 67 | 67 | 67.5 | |||||
| GBC | 62.5 | 62.5 | 62.5 | |||||
| DNN | 50 | 50 | 50 |
The best performances were highlighted in bold text.
Table SVII.
PaDELPy descriptors list
| Descriptor type | n | Descriptor | Class |
|---|---|---|---|
| Acidic group count | 1 | nAcid | 2D |
| ALOGP | 3 | ALogP, ALogp2, AMR | 2D |
| APol | 1 | apol | 2D |
| Aromatic atom count | 1 | naAromAtom | 2D |
| Aromatic bond count | 1 | nAromBond | 2D |
| Atom count | 14 | nAtom, nHeavyAtom, nH, nB, nC, nN, nO, nS, nP, nF, nCl, nBr, nI, nX | 2D |
| Autocorrelation | 346 | ATS0m, ATS1m, ATS2m, ATS3m, ATS4m, ATS5m, ATS6m, ATS7m, ATS8m, ATS0v, ATS1v, ATS2v, ATS3v, ATS4v, ATS5v, ATS6v, ATS7v, ATS8v, ATS0e, ATS1e, ATS2e, ATS3e, ATS4e, ATS5e, ATS6e, ATS7e, ATS8e, ATS0p, ATS1p, ATS2p, ATS3p, ATS4p, ATS5p, ATS6p, ATS7p, ATS8p, ATS0i, ATS1i, ATS2i, ATS3i, ATS4i, ATS5i, ATS6i, ATS7i, ATS8i, ATS0s, ATS1s, ATS2s, ATS3s, ATS4s, ATS5s, ATS6s, ATS7s, ATS8s, AATS0m, AATS1m, AATS2m, AATS3m, AATS4m, AATS5m, AATS6m, AATS7m, AATS8m, AATS0v, AATS1v, AATS2v, AATS3v, AATS4v, AATS5v, AATS6v, AATS7v, AATS8v, AATS0e, AATS1e, AATS2e, AATS3e, AATS4e, AATS5e, AATS6e, AATS7e, AATS8e, AATS0p, AATS1p, AATS2p, AATS3p, AATS4p, AATS5p, AATS6p, AATS7p, AATS8p, AATS0i, AATS1i, AATS2i, AATS3i, AATS4i, AATS5i, AATS6i, AATS7i, AATS8i, AATS0s, AATS1s, AATS2s, AATS3s, AATS4s, AATS5s, AATS6s, AATS7s, AATS8s, ATSC0c, ATSC1c, ATSC2c, ATSC3c, ATSC4c, ATSC5c, ATSC6c, ATSC7c, ATSC8c, ATSC0m, ATSC1m, ATSC2m, ATSC3m, ATSC4m, ATSC5m, ATSC6m, ATSC7m, ATSC8m, ATSC0v, ATSC1v, ATSC2v, ATSC3v, ATSC4v, ATSC5v, ATSC6v, ATSC7v, ATSC8v, ATSC0e, ATSC1e, ATSC2e, ATSC3e, ATSC4e, ATSC5e, ATSC6e, ATSC7e, ATSC8e, ATSC0p, ATSC1p, ATSC2p, ATSC3p, ATSC4p, ATSC5p, ATSC6p, ATSC7p, ATSC8p, ATSC0i, ATSC1i, ATSC2i, ATSC3i, ATSC4i, ATSC5i, ATSC6i, ATSC7i, ATSC8i, ATSC0s, ATSC1s, ATSC2s, ATSC3s, ATSC4s, ATSC5s, ATSC6s, ATSC7s, ATSC8s, AATSC0c, AATSC1c, AATSC2c, AATSC3c, AATSC4c, AATSC5c, AATSC6c, AATSC7c, AATSC8c, AATSC0m, AATSC1m, AATSC2m, AATSC3m, AATSC4m, AATSC5m, AATSC6m, AATSC7m, AATSC8m, AATSC0v, AATSC1v, AATSC2v, AATSC3v, AATSC4v, AATSC5v, AATSC6v, AATSC7v, AATSC8v, AATSC0e, AATSC1e, AATSC2e, AATSC3e, AATSC4e, AATSC5e, AATSC6e, AATSC7e, AATSC8e, AATSC0p, AATSC1p, AATSC2p, AATSC3p, AATSC4p, AATSC5p, AATSC6p, AATSC7p, AATSC8p, AATSC0i, AATSC1i, AATSC2i, AATSC3i, AATSC4i, AATSC5i, AATSC6i, AATSC7i, AATSC8i, AATSC0s, AATSC1s, AATSC2s, AATSC3s, AATSC4s, AATSC5s, AATSC6s, AATSC7s, AATSC8s, MATS1c, MATS2c, MATS3c, MATS4c, MATS5c, MATS6c, MATS7c, MATS8c, MATS1m, MATS2m, MATS3m, MATS4m, MATS5m, MATS6m, MATS7m, MATS8m, MATS1v, MATS2v, MATS3v, MATS4v, MATS5v, MATS6v, MATS7v, MATS8v, MATS1e, MATS2e, MATS3e, MATS4e, MATS5e, MATS6e, MATS7e, MATS8e, MATS1p, MATS2p, MATS3p, MATS4p, MATS5p, MATS6p, MATS7p, MATS8p, MATS1i, MATS2i, MATS3i, MATS4i, MATS5i, MATS6i, MATS7i, MATS8i, MATS1s, MATS2s, MATS3s, MATS4s, MATS5s, MATS6s, MATS7s, MATS8s, GATS1c, GATS2c, GATS3c, GATS4c, GATS5c, GATS6c, GATS7c, GATS8c, GATS1m, GATS2m, GATS3m, GATS4m, GATS5m, GATS6m, GATS7m, GATS8m, GATS1v, GATS2v, GATS3v, GATS4v, GATS5v, GATS6v, GATS7v, GATS8v, GATS1e, GATS2e, GATS3e, GATS4e, GATS5e, GATS6e, GATS7e, GATS8e, GATS1p, GATS2p, GATS3p, GATS4p, GATS5p, GATS6p, GATS7p, GATS8p, GATS1i, GATS2i, GATS3i, GATS4i, GATS5i, GATS6i, GATS7i, GATS8i, GATS1s, GATS2s, GATS3s, GATS4s, GATS5s, GATS6s, GATS7s, GATS8s | 2D |
| Barysz matrix | 91 | SpAbs_DzZ, SpMax_DzZ, SpDiam_DzZ, SpAD_DzZ, SpMAD_DzZ, EE_DzZ, SM1_DzZ, VE1_DzZ, VE2_DzZ, VE3_DzZ, VR1_DzZ, VR2_DzZ, VR3_DzZ, SpAbs_Dzm, SpMax_Dzm, SpDiam_Dzm, SpAD_Dzm, SpMAD_Dzm, EE_Dzm, SM1_Dzm, VE1_Dzm, VE2_Dzm, VE3_Dzm, VR1_Dzm, VR2_Dzm, VR3_Dzm, SpAbs_Dzv, SpMax_Dzv, SpDiam_Dzv, SpAD_Dzv, SpMAD_Dzv, EE_Dzv, SM1_Dzv, VE1_Dzv, VE2_Dzv, VE3_Dzv, VR1_Dzv, VR2_Dzv, VR3_Dzv, SpAbs_Dze, SpMax_Dze, SpDiam_Dze, SpAD_Dze, SpMAD_Dze, EE_Dze, SM1_Dze, VE1_Dze, VE2_Dze, VE3_Dze, VR1_Dze, VR2_Dze, VR3_Dze, SpAbs_Dzp, SpMax_Dzp, SpDiam_Dzp, SpAD_Dzp, SpMAD_Dzp, EE_Dzp, SM1_Dzp, VE1_Dzp, VE2_Dzp, VE3_Dzp, VR1_Dzp, VR2_Dzp, VR3_Dzp, SpAbs_Dzi, SpMax_Dzi, SpDiam_Dzi, SpAD_Dzi, SpMAD_Dzi, EE_Dzi, SM1_Dzi, VE1_Dzi, VE2_Dzi, VE3_Dzi, VR1_Dzi, VR2_Dzi, VR3_Dzi, SpAbs_Dzs, SpMax_Dzs, SpDiam_Dzs, SpAD_Dzs, SpMAD_Dzs, EE_Dzs, SM1_Dzs, VE1_Dzs, VE2_Dzs, VE3_Dzs, VR1_Dzs, VR2_Dzs, VR3_Dzs | 2D |
| Basic group count | 1 | nBase | 2D |
| BCUT | 6 | BCUTw-1l, BCUTw-1h, BCUTc-1l, BCUTc-1h, BCUTp-1l, BCUTp-1h | 2D |
| Bond count | 10 | nBonds, nBonds2, nBondsS, nBondsS2, nBondsS3, nBondsD, nBondsD2, nBondsT, nBondsQ, nBondsM | 2D |
| BPol | 1 | bpol | 2D |
| Burden modified eigenvalues | 96 | SpMax1_Bhm, SpMax2_Bhm, SpMax3_Bhm, SpMax4_Bhm, SpMax5_Bhm, SpMax6_Bhm, SpMax7_Bhm, SpMax8_Bhm, SpMin1_Bhm, SpMin2_Bhm, SpMin3_Bhm, SpMin4_Bhm, SpMin5_Bhm, SpMin6_Bhm, SpMin7_Bhm, SpMin8_Bhm, SpMax1_Bhv, SpMax2_Bhv, SpMax3_Bhv, SpMax4_Bhv, SpMax5_Bhv, SpMax6_Bhv, SpMax7_Bhv, SpMax8_Bhv, SpMin1_Bhv, SpMin2_Bhv, SpMin3_Bhv, SpMin4_Bhv, SpMin5_Bhv, SpMin6_Bhv, SpMin7_Bhv, SpMin8_Bhv, SpMax1_Bhe, SpMax2_Bhe, SpMax3_Bhe, SpMax4_Bhe, SpMax5_Bhe, SpMax6_Bhe, SpMax7_Bhe, SpMax8_Bhe, SpMin1_Bhe, SpMin2_Bhe, SpMin3_Bhe, SpMin4_Bhe, SpMin5_Bhe, SpMin6_Bhe, SpMin7_Bhe, SpMin8_Bhe, SpMax1_Bhp, SpMax2_Bhp, SpMax3_Bhp, SpMax4_Bhp, SpMax5_Bhp, SpMax6_Bhp, SpMax7_Bhp, SpMax8_Bhp, SpMin1_Bhp, SpMin2_Bhp, SpMin3_Bhp, SpMin4_Bhp, SpMin5_Bhp, SpMin6_Bhp, SpMin7_Bhp, SpMin8_Bhp, SpMax1_Bhi, SpMax2_Bhi, SpMax3_Bhi, SpMax4_Bhi, SpMax5_Bhi, SpMax6_Bhi, SpMax7_Bhi, SpMax8_Bhi, SpMin1_Bhi, SpMin2_Bhi, SpMin3_Bhi, SpMin4_Bhi, SpMin5_Bhi, SpMin6_Bhi, SpMin7_Bhi, SpMin8_Bhi, SpMax1_Bhs, SpMax2_Bhs, SpMax3_Bhs, SpMax4_Bhs, SpMax5_Bhs, SpMax6_Bhs, SpMax7_Bhs, SpMax8_Bhs, SpMin1_Bhs, SpMin2_Bhs, SpMin3_Bhs, SpMin4_Bhs, SpMin5_Bhs, SpMin6_Bhs, SpMin7_Bhs, SpMin8_Bhs | 2D |
| Carbon types | 9 | C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2SP3, C3SP3, C4SP3 | 2D |
| Chi chain | 10 | SCH-3, SCH-4, SCH-5, SCH-6, SCH-7, VCH-3, VCH-4, VCH-5, VCH-6, VCH-7 | 2D |
| Chi cluster | 8 | SC-3, SC-4, SC-5, SC-6, VC-3, VC-4, VC-5, VC-6 | 2D |
| Chi path cluster | 6 | SPC-4, SPC-5, SPC-6, VPC-4, VPC-5, VPC-6 | 2D |
| Chi path | 32 | SP-0, SP-1, SP-2, SP-3, SP-4, SP-5, SP-6, SP-7, ASP-0, ASP-1, ASP-2, ASP-3, ASP-4, ASP-5, ASP-6, ASP-7, VP-0, VP-1, VP-2, VP-3, VP-4, VP-5, VP-6, VP-7, AVP-0, AVP-1, AVP-2, AVP-3, AVP-4, AVP-5, AVP-6, AVP-7 | 2D |
| Constitutional | 12 | Sv, Sse, Spe, Sare, Sp, Si, Mv, Mse, Mpe, Mare, Mp, Mi | 2D |
| Crippen logP and MR | 2 | CrippenLogP, CrippenMR | 2D |
| Detour matrix | 11 | SpMax_Dt, SpDiam_Dt, SpAD_Dt, SpMAD_Dt, EE_Dt, VE1_Dt, VE2_Dt, VE3_Dt, VR1_Dt, VR2_Dt, VR3_Dt | 2D |
| Eccentric connectivity index | 1 | ECCEN | 2D |
| Atom type electrotopological state | 489 | nHBd, nwHBd, nHBa, nwHBa, nHBint2, nHBint3, nHBint4, nHBint5, nHBint6, nHBint7, nHBint8, nHBint9, nHBint10, nHsOH, nHdNH, nHsSH, nHsNH2, nHssNH, nHaaNH, nHsNH3p, nHssNH2p, nHsssNHp, nHtCH, nHdCH2, nHdsCH, nHaaCH, nHCHnX, nHCsats, nHCsatu, nHAvin, nHother, nHmisc, nsLi, nssBe, nssssBem, nsBH2, nssBH, nsssB, nssssBm, nsCH3, ndCH2, nssCH2, ntCH, ndsCH, naaCH, nsssCH, nddC, ntsC, ndssC, naasC, naaaC, nssssC, nsNH3p, nsNH2, nssNH2p, ndNH, nssNH, naaNH, ntN, nsssNHp, ndsN, naaN, nsssN, nddsN, naasN, nssssNp, nsOH, ndO, nssO, naaO, naOm, nsOm, nsF, nsSiH3, nssSiH2, nsssSiH, nssssSi, nsPH2, nssPH, nsssP, ndsssP, nddsP, nsssssP, nsSH, ndS, nssS, naaS, ndssS, nddssS, nssssssS, nSm, nsCl, nsGeH3, nssGeH2, nsssGeH, nssssGe, nsAsH2, nssAsH, nsssAs, ndsssAs, nddsAs, nsssssAs, nsSeH, ndSe, nssSe, naaSe, ndssSe, nssssssSe, nddssSe, nsBr, nsSnH3, nssSnH2, nsssSnH, nssssSn, nsI, nsPbH3, nssPbH2, nsssPbH, nssssPb, SHBd, SwHBd, SHBa, SwHBa, SHBint2, SHBint3, SHBint4, SHBint5, SHBint6, SHBint7, SHBint8, SHBint9, SHBint10, SHsOH, SHdNH, SHsSH, SHsNH2, SHssNH, SHaaNH, SHsNH3p, SHssNH2p, SHsssNHp, SHtCH, SHdCH2, SHdsCH, SHaaCH, SHCHnX, SHCsats, SHCsatu, SHAvin, SHother, SHmisc, SsLi, SssBe, SssssBem, SsBH2, SssBH, SsssB, SssssBm, SsCH3, SdCH2, SssCH2, StCH, SdsCH, SaaCH, SsssCH, SddC, StsC, SdssC, SaasC, SaaaC, SssssC, SsNH3p, SsNH2, SssNH2p, SdNH, SssNH, SaaNH, StN, SsssNHp, SdsN, SaaN, SsssN, SddsN, SaasN, SssssNp, SsOH, SdO, SssO, SaaO, SaOm, SsOm, SsF, SsSiH3, SssSiH2, SsssSiH, SssssSi, SsPH2, SssPH, SsssP, SdsssP, SddsP, SsssssP, SsSH, SdS, SssS, SaaS, SdssS, SddssS, SssssssS, SSm, SsCl, SsGeH3, SssGeH2, SsssGeH, SssssGe, SsAsH2, SssAsH, SsssAs, SdsssAs, SddsAs, SsssssAs, SsSeH, SdSe, SssSe, SaaSe, SdssSe, SssssssSe, SddssSe, SsBr, SsSnH3, SssSnH2, SsssSnH, SssssSn, SsI, SsPbH3, SssPbH2, SsssPbH, SssssPb, minHBd, minwHBd, minHBa, minwHBa, minHBint2, minHBint3, minHBint4, minHBint5, minHBint6, minHBint7, minHBint8, minHBint9, minHBint10, minHsOH, minHdNH, minHsSH, minHsNH2, minHssNH, minHaaNH, minHsNH3p, minHssNH2p, minHsssNHp, minHtCH, minHdCH2, minHdsCH, minHaaCH, minHCHnX, minHCsats, minHCsatu, minHAvin, minHother, minHmisc, minsLi, minssBe, minssssBem, minsBH2, minssBH, minsssB, minssssBm, minsCH3, mindCH2, minssCH2, mintCH, mindsCH, minaaCH, minsssCH, minddC, mintsC, mindssC, minaasC, minaaaC, minssssC, minsNH3p, minsNH2, minssNH2p, mindNH, minssNH, minaaNH, mintN, minsssNHp, mindsN, minaaN, minsssN, minddsN, minaasN, minssssNp, minsOH, mindO, minssO, minaaO, minaOm, minsOm, minsF, minsSiH3, minssSiH2, minsssSiH, minssssSi, minsPH 2, minssPH, minsssP, mindsssP, minddsP, minsssssP, minsSH, mindS, minssS, minaaS, mindssS, minddssS, minssssssS, minSm, minsCl, minsGeH3, minssGeH2, minsssGeH, minssssGe, minsAsH2, minssAsH, minsssAs, mindsssAs, minddsAs, minsssssAs, minsSeH, mindSe, minssSe, minaaSe, mindssSe, minssssssSe, minddssSe, minsBr, minsSnH3, minssSnH2, minsssSnH, minssssSn, minsI, minsPbH3, minssPbH2, minsssPbH, minssssPb, maxHBd, maxwHBd, maxHBa, maxwHBa, maxHBint2, maxHBint3, maxHBint4, maxHBint5, maxHBint6, maxHBint7, maxHBint8, maxHBint9, maxHBint10, maxHsOH, maxHdNH, maxHsSH, maxHsNH2, maxHssNH, maxHaaNH, maxHsNH3p, maxHssNH2p, maxHsssNHp, maxHtCH, maxHdCH2, maxHdsCH, maxHaaCH, maxHCHnX, maxHCsats, maxHCsatu, maxHAvin, maxHother, maxHmisc, maxsLi, maxssBe, maxssssBem, maxsBH2, maxssBH, maxsssB, maxssssBm, maxsCH3, maxdCH2, maxssCH2, maxtCH, maxdsCH, maxaaCH, maxsssCH, maxddC, maxtsC, maxdssC, maxaasC, maxaaaC, maxssssC, maxsNH3p, maxsNH2, maxssNH2p, maxdNH, maxssNH, maxaaNH, maxtN, maxsssNHp, maxdsN, maxaaN, maxsssN, maxddsN, maxaasN, maxssssNp, maxsOH, maxdO, maxssO, maxaaO, maxaOm, maxsOm, maxsF, maxsSiH3, maxssSiH2, maxsssSiH, maxssssSi, maxsPH2, maxssPH, maxsssP, maxdsssP, maxddsP, maxsssssP, maxsSH, maxdS, maxssS, maxaaS, maxdssS, maxddssS, maxssssssS, maxSm, maxsCl, maxsGeH3, maxssGeH2, maxsssGeH, maxssssGe, maxsAsH2, maxssAsH, maxsssAs, maxdsssAs, maxddsAs, maxsssssAs, maxsSeH, maxdSe, maxssSe, maxaaSe, maxdssSe, maxssssssSe, maxddssSe, maxsBr, maxsSnH3, maxssSnH2, maxsssSnH, maxssssSn, maxsI, maxsPbH3, maxssPbH2, maxsssPbH, maxssssPb, sumI, meanI, hmax, gmax, hmin, gmin, LipoaffinityIndex, MAXDN, MAXDP, DELS, MAXDN2, MAXDP2, DELS2 | 2D |
| Extended topochemical atom | 43 | ETA_Alpha, ETA_AlphaP, ETA_dAlpha_A, ETA_dAlpha_B, ETA_Epsilon_1, ETA_Epsilon_2, ETA_Epsilon_3, ETA_Epsilon_4, ETA_Epsilon_5, ETA_dEpsilon_A, ETA_dEpsilon_B, ETA_dEpsilon_C, ETA_dEpsilon_D, ETA_Psi_1, ETA_dPsi_A, ETA_dPsi_B, ETA_Shape_P, ETA_Shape_Y, ETA_Shape_X, ETA_Beta, ETA_BetaP, ETA_Beta_s, ETA_BetaP_s, ETA_Beta_ns, ETA_BetaP_ns, ETA_dBeta, ETA_dBetaP, ETA_Beta_ns_d, ETA_BetaP_ns_d, ETA_Eta, ETA_EtaP, ETA_Eta_R, ETA_Eta_F, ETA_EtaP_F, ETA_Eta_L, ETA_EtaP_L, ETA_Eta_R_L, ETA_Eta_F_L, ETA_EtaP_F_L, ETA_Eta_B, ETA_EtaP_B, ETA_Eta_B_RC, ETA_EtaP_B_RC | 2D |
| FMFDescriptor | 1 | FMF | 2D |
| Fragment complexity | 1 | fragC | 2D |
| Hbond acceptor count | 4 | nHBAcc, nHBAcc2, nHBAcc3, nHBAcc_Lipinski | 2D |
| Hbond donor count | 2 | nHBDon, nHBDon_Lipinski | 2D |
| Hybridization ratio | 1 | HybRatio | 2D |
| Information content | 42 | IC0, IC1, IC2, IC3, IC4, IC5, TIC0, TIC1, TIC2, TIC3, TIC4, TIC5, SIC0, SIC1, SIC2, SIC3, SIC4, SIC5, CIC0, CIC1, CIC2, CIC3, CIC4, CIC5, BIC0, BIC1, BIC2, BIC3, BIC4, BIC5, MIC0, MIC1, MIC2, MIC3, MIC4, MIC5, ZMIC0, ZMIC1, ZMIC2, ZMIC3, ZMIC4, ZMIC5 | 2D |
| Kappa shape indices | 3 | Kier1, Kier2, Kier3 | 2D |
| Largest chain | 1 | nAtomLC | 2D |
| Largest Pi system | 1 | nAtomP | 2D |
| Longest aliphatic chain | 1 | nAtomLAC | 2D |
| Mannhold LogP | 1 | MLogP | 2D |
| McGowan volume | 1 | McGowan_Volume | 2D |
| Molecular distance edge | 19 | MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-22, MDEC-23, MDEC-24, MDEC-33, MDEC-34, MDEC-44, MDEO-11, MDEO-12, MDEO-22, MDEN-11, MDEN-12, MDEN-13, MDEN-22, MDEN-23, MDEN-33 | 2D |
| Molecular linear free energy relation | 6 | MLFER_A, MLFER_BH, MLFER_BO, MLFER_S, MLFER_E, MLFER_L | 2D |
| Path counts | 22 | MPC2, MPC3, MPC4, MPC5, MPC6, MPC7, MPC8, MPC9, MPC10, TPC, piPC1, piPC2, piPC3, piPC4, piPC5, piPC6, piPC7, piPC8, piPC9, piPC10, TpiPC, R_TpiPCTPC | 2D |
| Petitjean number | 1 | PetitjeanNumber | 2D |
| Ring count | 68 | nRing, n3Ring, n4Ring, n5Ring, n6Ring, n7Ring, n8Ring, n9Ring, n10Ring, n11Ring, n12Ring, nG12Ring, nFRing, nF4Ring, nF5Ring, nF6Ring, nF7Ring, nF8Ring, nF9Ring, nF10Ring, nF11Ring, nF12Ring, nFG12Ring, nHeteroRing, n3HeteroRing, n4HeteroRing, n5HeteroRing, n6HeteroRing, n7HeteroRing, n8HeteroRing, n9HeteroRing, n10HeteroRing, n11HeteroRing, n12HeteroRing, nG12HeteroRing, nFHeteroRing, nF4HeteroRing, nF5HeteroRing, nF6HeteroRing, nF7HeteroRing, nF8HeteroRing, nF9HeteroRing, nF10HeteroRing, nF11HeteroRing, nF12HeteroRing, nFG12HeteroRing, nTHeteroRing, nT4HeteroRing, nT5HeteroRing, nT6HeteroRing, nT7HeteroRing, nT8HeteroRing, nT9HeteroRing, nT10HeteroRing, nT11HeteroRing, nT12HeteroRing, nTG12HeteroRing | 2D |
| Rotatable bond count | 4 | nRotB, RotBFrac, nRotBt, RotBtFrac | 2D |
| Rule of five | 1 | LipinskiFailures | 2D |
| Topological | 3 | topoRadius, topoDiameter, topoShape | 2D |
| Topological charge | 21 | GGI1, GGI2, GGI3, GGI4, GGI5, GGI6, GGI7, GGI8, GGI9, GGI10, JGI1, JGI2, JGI3, JGI4, JGI5, JGI6, JGI7, JGI8, JGI9, JGI10, JGT | 2D |
| Topological distance matrix | 11 | SpMax_D, SpDiam_D, SpAD_D, SpMAD_D, EE_D, VE1_D, VE2_D, VE3_D, VR1_D, VR2_D, VR3_D | 2D |
| Topological polar surface area | 1 | TopoPSA | 2D |
| Van der Waals volume | 1 | VABC | 2D |
| Vertex adjacency information (magnitude) | 1 | vAdjMat | 2D |
| Walk counts | 20 | MWC2, MWC3, MWC4, MWC5, MWC6, MWC7, MWC8, MWC9, MWC10, TWC, SRW2, SRW3, SRW4, SRW5, SRW6, SRW7, SRW8, SRW9, SRW10, TSRW | 2D |
| Weight | 2 | MW, AMW | 2D |
| Weighted path | 5 | WTPT-1, WTPT-2, WTPT-3, WTPT-4, WTPT-5 | 2D |
Discussion
MDR and extensively drug-resistant bacterial pathogens have emerged as a major threat to public health1. There is a growing global demand for newer antimicrobial drug discovery since the projected cost of antimicrobial resistance in terms of loss of human lives, economic impact and burden on the health system is enormous. However, the traditional drug discovery pipelines are rather expensive and slow, normally costing billions of dollars, taking years to enter the clinical trial phase and are unlikely to catch up with the fast-evolving pathogens, acquiring resistance through multiple mechanisms. Under such circumstances, computational approaches, especially ML techniques, can yield double benefits of speeding up the drug discovery process considerably while lowering the overall cost involved in compound screening and bioactivity prediction12,24.
Several models have been developed to predict the antibacterial activities of small-molecule compounds. However, some models are focused on individual classes of compounds8,25, and hence, are not applicable to diverse compound libraries. Others, being restricted to individual organisms13,26, are not helpful for the discovery of broad-spectrum antimicrobials. Although models for the prediction of antibacterial drug from heterogeneous molecules exist6,11, significant scope remains for their improvement through the use of advanced ML algorithms such as XGBoost, RF and DNN.
For every ML algorithm, it is critical to prudently choose positive and negative datasets. It is difficult to build a reliable model without considerable differences between the features of the positive and the negative data. We achieved this by curating 390 antibiotic drugs from the DrugBank database as the positive set, while 1170 randomly chosen FDA-approved drugs that lack any kind of antimicrobial functions (antiviral, antifungal, antiprotozoal and antiparasitic) confirmed by ChEMBL database14 were considered the negative set. More than 80 per cent of drugs from the positive dataset were clustered at the Tanimoto coefficient value <0.3 in the MDS using ChemMine, indicating that these had diverse structures. In contrast, the negative dataset had a Tanimoto coefficient value below 0.33 in MDS for 80 per cent of the compounds. Previous studies by Yang’s11 group used 230 antibacterial and 381 non-antibacterial compounds. Masalha’s6 group used a set of 628 anti-bacterial drugs from the Comprehensive Medicinal Chemistry database as positives and another 2,892 phytochemicals from Analyticon Discovery GmbH as the negative dataset. However, none of the above studies excluded antimicrobial compounds from the negative dataset. On the other hand, Stokes’s13 group screened 2,335 unique compounds, including 1,760 molecules from the FDA-approved drug library and 800 natural products isolated from plant, animal and microbial sources for growth inhibition against E. coli BW25113 and used the positive targets as the training dataset.
In Stokes’s13 group study only known drug molecules for the construction of both positive and negative datasets because, first, their drug likeliness and bioavailability are already known, second, many of these have passed clinical trials and, third, majority of them are currently available in the market. Our approach might prove useful in light of considerable interest in repurposing already-marketed drugs as antibiotics, which would cut down cost, time and effort for clinical trials.
In this study, 80 per cent of the positive and negative datasets were used for training and testing by stratified 5-fold cross-validation, and two other types of k-fold cross-validation (k=3 and, k=5), while 20 per cent of the data were kept aside to be used as a blind set for model validation purposes. Although Yang’s11 group followed a similar approach, Masalha’s6 group used only 3-fold cross-validation, which is less than optimum to ensure proper shuffling of the data to get rid of the inherent bias in the dataset. In both training and blind datasets, three different positive-to-negative data ratios, such as 1:1, 1:2 and 1:3, were used to check the overfitting of the models. We observed that the balanced dataset with an equal number of positive and negative instances performed better than the unbalanced sets. Other studies did not check the effects of balanced and unbalanced datasets on the performance of the prediction algorithms.
In this study 2D physicochemical properties comprising 17 features (PubChemPy) were compiled and all (1440) features (PaDELpy) of the datasets. In contrast, Yang’s11 group used only 198 manually selected molecular descriptors, while Stokes’s13 group combined graph representation of each molecule with molecular features computed by RDKit. A combination of univariate, L1-based and tree-based feature selection methods were used in this study, whereas previous studies had used only a single method like recursive feature elimination11 or iterative stochastic elimination6.
Our study employed eight ML classifiers, namely XGBoost, RF, GBC, DNN, SVM, MLP, DT and LR, to construct the prediction models, from which the top four models, namely XGBoost, RF, GBC and DNN, were chosen for further analysis. Yang’s11 group used three algorithms, namely SVC, k-NN and C4.5 DT, for their study, whereas the Chemprop server developed by Stokes’s13 group was solely based on DNN.
Our antibiotic prediction server, ABDpred, used a soft-voting technique for the first time for antibiotic discovery, using the aggregation of top four classifiers (XGBoost, RF, GBC and DNN). Soft voting is capable of yielding improved prediction results because it determines the classifier that has highest influence on the final ensemble decision by considering each classifier’s uncertainty in the final decision process27. ABDpred server is also capable to check the information on the drugability and bioavailability of the input chemicals. At the same time, the server also compares the structural similarity of the input drugs/chemicals to known antibiotics based on the Tanimoto similarity scores (cut-off 0.6). A higher similarity score hints towards a greater potential of the input drug/chemical to be used as an antimicrobial compound in the future. The Chemprop server, the only other existing web server with a similar functionality, does not provide this sort of information about the input chemicals.
We have also evaluated the NCI chemical database (n=173714) using the XGBoost and RF models and predicted 596 NCI chemicals as potential antibiotics. Interestingly, we found that most of these predicted molecules had literature evidence for antibiotic activity28,29. Further, we tested our top four models to see if they could accurately predict the antibiotic activity of more than 10 compounds reported by Ivanenkov’s12 group, including halicin, which Stokes’s13 group had predicted as a potent antibiotic. Our best RF model classified with 67 per cent of accuracy, 67 per cent of sensitivity and 67 per cent of AUC and these compounds as antimicrobial compounds.
Overall, developing ML classifiers that can accurately identify antibacterial compounds from chemical databases would be advantageous for drug development. Through this study a ML-based method was developed with the aggregation of four classifiers. The RF model was the best performer among all four classifiers, achieving an accuracy value of 88.55 per cent on the training test and 80.62 per cent on the blind dataset for predicting antimicrobial compounds. Our models predicted 596 small chemicals from the NCI chemical database as future antimicrobial compounds with high confidence and a probability score of higher than 0.9. Most of these predicted chemicals had already reported antibacterial activity and might be developed as antimicrobial compounds in the near future. Overall, this will be a helpful resource for people working in the field of antibiotic research and development, including pharmaceutical companies, clinical researchers and healthcare providers.
Financial support and sponsorship
This study received funding supported from the Indian Council of Medical Research (No. 2019-3127/GEN/ADHOC-BMS) to SD.
Conflicts of interest
None.
The 596 National Cancer Institute chemicals were screened as an independent set using the best machine-learning classifier models
(A and B) Multidimensional scaling of positive set (antibiotic compounds) and negative set (non-antimicrobial compounds) using ChemMine tool where Tanimoto similarity cut-off of 0.4 was chosen in two-dimensional space, V1 and V2 considered Tanimoto similarity scores.
Density plots of 17 features (PubChemPy), including molecular weight (mw), xlogp3-aa (xlogp3), hydrogen bond donor count (hbdc), hydrogen bond acceptor count (hbac), rotatable bond count (rbc), exact mass (exact_mass), monoisotopic mass (mono_mass), topological polar surface area (tpsc), heavy atom count (hac), formal charge (charge), complexity, isotope atom count (iac), defined atom stereocenter count (dasc), undefined atom stereocenter count (uasc), defined bond stereocenter count (dbsc), undefined bond stereocenter count (ubsc) and covalently-bonded unit count (cbu). The antimicrobial compounds were marked as blue and non-antibiotic compounds were marked as red.
XGBoost decision tree trained on the 17 selected features with P:N=1:1.
Decision tree of the random forest algorithm that was applied to the selected 17 features with a P:N ratio of 1:1.
(A-D) Positive training set (312) evaluation using the top four machine learning classifier models (extreme gradient boosting, random forest, gradient boosting classifier and deep neural network) based on 17 selected features, P: N= 1:1. XGBoost, Extreme gradient boosting; RF, random forest; GBC, gradient boosting classifier; DNN, deep neural network.
The Venn diagram illustrates the antimicrobial compounds identified amongst the NCI chemicals, which were analyzed as an independent set using the top machine learning classifiers, extreme gradient boosting (XGBoost) and random forest. XGBoost predicted 13,552 NCI chemicals to have a probability score above 0.9 and a threshold probability value of 0.56 for antimicrobial activity. Of these, 596 small chemicals were confidently and probably predicted to exhibit antimicrobial activity by both XGBoost and random forest classifiers.
Acknowledgment:
The authors acknowledge Drs. Sudipto Saha and Pujarini Dutta for critically reading the manuscript and providing valuable suggestions.
References
- 1.Laxminarayan R, Van Boeckel T, Frost I, Kariuki S, Khan EA, Limmathurotsakul D, et al. The lancet infectious diseases commission on antimicrobial resistance:6 years later. Lancet Infect Dis. 2020;20:e51–60. doi: 10.1016/S1473-3099(20)30003-7. [DOI] [PubMed] [Google Scholar]
- 2.Hutchings MI, Truman AW, Wilkinson B. Antibiotics:Past, present and future. Curr Opin Microbiol. 2019;51:72–80. doi: 10.1016/j.mib.2019.10.008. [DOI] [PubMed] [Google Scholar]
- 3.Peterson E, Kaur P. Antibiotic resistance mechanisms in bacteria:Relationships between resistance determinants of antibiotic producers, environmental bacteria, and clinical pathogens. Front Microbiol. 2018;2928;9 doi: 10.3389/fmicb.2018.02928. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Acharya KP, Wilson RT. Antimicrobial resistance in Nepal. Front Med (Lausanne) 2019;6:105. doi: 10.3389/fmed.2019.00105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lewis K. The science of antibiotic discovery. Cell. 2020;181:29–45. doi: 10.1016/j.cell.2020.02.056. [DOI] [PubMed] [Google Scholar]
- 6.Masalha M, Rayan M, Adawi A, Abdallah Z, Rayan A. Capturing antibacterial natural products with in silico techniques. Mol Med Rep. 2018;18:763–70. doi: 10.3892/mmr.2018.9027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lewis K. Platforms for antibiotic discovery. Nat Rev Drug Discov. 2013;12:371–87. doi: 10.1038/nrd3975. [DOI] [PubMed] [Google Scholar]
- 8.Leemans E, Mahasenan KV, Kumarasiri M, Spink E, Ding D, O’Daniel PI, et al. Three-dimensional QSAR analysis and design of new 1,2,4-oxadiazole antibacterials. Bioorg Med Chem Lett. 2016;26:1011–5. doi: 10.1016/j.bmcl.2015.12.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.García-Domenech R, de Julián-Ortiz JV. Antimicrobial activity characterization in a heterogeneous group of compounds. J Chem Inf Comput Sci. 1998;38:445–9. doi: 10.1021/ci9702454. [DOI] [PubMed] [Google Scholar]
- 10.Jaén-Oltra J, Salabert-Salvador MT, García-March FJ, Pérez-Giménez F, Tomás-Vert F. Artificial neural network applied to prediction of fluorquinolone antibacterial activity by topological methods. J Med Chem. 2000;43:1143–8. doi: 10.1021/jm980448z. [DOI] [PubMed] [Google Scholar]
- 11.Yang XG, Chen D, Wang M, Xue Y, Chen YZ. Prediction of antibacterial compounds by machine learning approaches. J Comput Chem. 2009;30:1202–11. doi: 10.1002/jcc.21148. [DOI] [PubMed] [Google Scholar]
- 12.Ivanenkov YA, Zhavoronkov A, Yamidanov RS, Osterman IA, Sergiev PV, Aladinskiy VA, et al. Identification of novel antibacterials using machine learning techniques. Front Pharmacol. 2019;10:913. doi: 10.3389/fphar.2019.00913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, et al. A deep learning approach to antibiotic discovery. Cell. 2020;180:688–702.e13. doi: 10.1016/j.cell.2020.01.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45:D945–54. doi: 10.1093/nar/gkw1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Backman TW, Cao Y, Girke T. ChemMine tools:An online service for analyzing and clustering small molecules. Nucleic Acids Res. 2011;39:W486–91. doi: 10.1093/nar/gkr320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Poroikov VV, Filimonov DA, Ihlenfeldt WD, Gloriozova TA, Lagunin AA, Borodina YV, et al. PASS biological activity spectrum predictions in the enhanced open NCI database browser. J Chem Inf Comput Sci. 2003;43:228–36. doi: 10.1021/ci020048r. [DOI] [PubMed] [Google Scholar]
- 17.Bertoni M, Duran-Frigola M, Badia-I-Mompel P, Pauls E, Orozco-Ruiz M, Guitart-Pla O, et al. Bioactivity descriptors for uncharacterized chemical compounds. Nat Commun. 2021;3932;12 doi: 10.1038/s41467-021-24150-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem in. 2021: New data content and improved web interfaces. Nucleic Acids Res. 2021;49:D1388–95. doi: 10.1093/nar/gkaa971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yap CW. PaDEL-descriptor:An open source software to calculate molecular descriptors and fingerprints. J Comput Chem. 2011;32:1466–74. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- 20.Majumder AB, Gupta S, Singh D, Acharya B, Gerogiannis VC, Kanavos A, et al. Heart disease prediction using concatenated hybrid ensemble classifiers. Algorithms. 2023;16:538. [Google Scholar]
- 21.Araya-Cloutier C, Vincken JP, van de Schans MGM, Hageman J, Schaftenaar G, den Besten HMW, et al. QSAR-based molecular signatures of prenylated (iso)flavonoids underlying antimicrobial potency against and membrane-disruption in Gram positive and Gram negative bacteria. Sci Rep. 2018;9267;8 doi: 10.1038/s41598-018-27545-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001;46:3–26. doi: 10.1016/s0169-409x(00)00129-0. [DOI] [PubMed] [Google Scholar]
- 23.Yusof I, Shah F, Hashimoto T, Segall MD, Greene N. Finding the rules for successful drug optimisation. Drug Discov Today. 2014;19:680–7. doi: 10.1016/j.drudis.2014.01.005. [DOI] [PubMed] [Google Scholar]
- 24.Serafim MSM, Kronenberger T, Oliveira PR, Poso A, Honório KM, Mota BEF, et al. The application of machine learning techniques to innovative antibacterial discovery and development. Expert Opin Drug Discov. 2020;15:1165–80. doi: 10.1080/17460441.2020.1776696. [DOI] [PubMed] [Google Scholar]
- 25.Morjan RY, Al-Attar NH, Abu-Teim OS, Ulrich M, Awadallah AM, Mkadmh AM, et al. Synthesis, antibacterial and QSAR evaluation of 5-oxo and 5-thio derivatives of 1, 4-disubstituted tetrazoles. Bioorg Med Chem Lett. 2015;25:4024–8. doi: 10.1016/j.bmcl.2015.04.070. [DOI] [PubMed] [Google Scholar]
- 26.Badura A, Krysiński J, Nowaczyk A, Buciński A. Application of artificial neural networks to prediction of new substances with antimicrobial activity against Escherichia coli. J Appl Microbiol. 2021;130:40–9. doi: 10.1111/jam.14763. [DOI] [PubMed] [Google Scholar]
- 27.Sherazi SWA, Bae JW, Lee JY. A soft voting ensemble classifier for early prediction and diagnosis of occurrences of major adverse cardiovascular events for STEMI and NSTEMI during 2-year follow-up in patients with acute coronary syndrome. PLoS One. 2021;16:e0249338. doi: 10.1371/journal.pone.0249338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pothineni VR, Wagh D, Babar MM, Inayathullah M, Watts RE, Kim KM, et al. Screening of NCI-DTP library to identify new drug candidates for Borrelia burgdorferi. J Antibiot Tokyo. 2016;70:308–12. doi: 10.1038/ja.2016.131. [DOI] [PubMed] [Google Scholar]
- 29.Arora G, Gagandeep, Behura A, Gosain TP, Shaliwal RP, Kidwai S, et al. NSC 18725, a pyrazole derivative inhibits growth of intracellular Mycobacterium tuberculosis by induction of autophagy. Front Microbiol. 2019;3051;10 doi: 10.3389/fmicb.2019.03051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
The 596 National Cancer Institute chemicals were screened as an independent set using the best machine-learning classifier models
(A and B) Multidimensional scaling of positive set (antibiotic compounds) and negative set (non-antimicrobial compounds) using ChemMine tool where Tanimoto similarity cut-off of 0.4 was chosen in two-dimensional space, V1 and V2 considered Tanimoto similarity scores.
Density plots of 17 features (PubChemPy), including molecular weight (mw), xlogp3-aa (xlogp3), hydrogen bond donor count (hbdc), hydrogen bond acceptor count (hbac), rotatable bond count (rbc), exact mass (exact_mass), monoisotopic mass (mono_mass), topological polar surface area (tpsc), heavy atom count (hac), formal charge (charge), complexity, isotope atom count (iac), defined atom stereocenter count (dasc), undefined atom stereocenter count (uasc), defined bond stereocenter count (dbsc), undefined bond stereocenter count (ubsc) and covalently-bonded unit count (cbu). The antimicrobial compounds were marked as blue and non-antibiotic compounds were marked as red.
XGBoost decision tree trained on the 17 selected features with P:N=1:1.
Decision tree of the random forest algorithm that was applied to the selected 17 features with a P:N ratio of 1:1.
(A-D) Positive training set (312) evaluation using the top four machine learning classifier models (extreme gradient boosting, random forest, gradient boosting classifier and deep neural network) based on 17 selected features, P: N= 1:1. XGBoost, Extreme gradient boosting; RF, random forest; GBC, gradient boosting classifier; DNN, deep neural network.
The Venn diagram illustrates the antimicrobial compounds identified amongst the NCI chemicals, which were analyzed as an independent set using the top machine learning classifiers, extreme gradient boosting (XGBoost) and random forest. XGBoost predicted 13,552 NCI chemicals to have a probability score above 0.9 and a threshold probability value of 0.56 for antimicrobial activity. Of these, 596 small chemicals were confidently and probably predicted to exhibit antimicrobial activity by both XGBoost and random forest classifiers.
