Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was responsible for over 4 million confirmed cases of severe acute respiratory syndrome, of which more than 300,000 cases were confirmed to be dead as of May 2020. The virulent endocytotic activities of SARS-CoV-2 have been associated with angiotensin-converting enzyme 2 (ACE2) and transmembrane protease serine 2 (TMPRSS2). Previous studies on the viral activation of TMPRSS2 focused most often than not on the isoform 2 of TMPRSS2, but the isoform 1 (529 residues) has also been shown to be expressed in target cells and contribute to viral activation in host. The inhibition of TMPRSS2 has been reported to grossly reduce the pathogenic effects of SARS-CoV-2 endocytotic activities.
In this study therefore, we developed two machine learning models using random forest classifier (RFC) and neural networks (NNs) based on 2251 serine protease inhibitors to screen a database of 21,000,000 virtual compounds. We screened the hit compounds using absorption, distribution, metabolism, and excretion (ADME) properties and finally docked the filtered compounds into the predicted binding site of TMPRSS2 isoform 1 homology model to determine their corresponding binding affinity and plausible molecular interactions.
One (ASONN) and four (ASOIRFC1–4) lead compounds were obtained from the ADME-NN and RFC filtered hits, respectively, having better binding affinity and lead-likeness properties than those of camostat; this could be due to extensive hydrogen and hydrophobic interactions.
Keywords: Camostat, Molecular docking, Neural network, Random forest, SARS-CoV-2, TMPRSS2
1. Introduction
Sever acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) have been linked to SARS coronavirus (SARS-CoV) and MERS-coronavirus (MERS-CoV), respectively [1,2]. As of September 1, 2021, a total of 218,946,836 laboratory-confirmed cases of SARS infection by SARS-CoV-2 was recorded; of which, 4,539,724 were recorded to have resulted in death [3].
SARS-CoV-2, a novel coronavirus detected late 2019 and closely related to SARS-CoV, has been etiologically implicated in the new pulmonary disease [[4], [5], [6]]. The viral endocytotic activities of coronaviruses and specifically those of SARS-CoV-2 in the targeted host cells are aided by the S1 unit of the spike protein (S), thus enhancing viral attachment and entry into the host cells via angiotensin-converting enzyme 2 (ACE2), which can be blocked via transmembrane protease serine 2 (TMPRSS2) inhibitors [7].
In addition, the priming of S protein for SARS-CoV-2 viral entry is also aided by TMPRSS2 [[8], [9], [10]] and endosomal cysteine proteases cathepsin B and L (CatB/L). Nevertheless, the activity of CatB/L is negligible compared to the activity of TMPRSS2, which is necessary for viral pathogenesis and spread in the host cell [[11], [12], [13]]. The isoform 1 of TMPRSS2 has been shown to activate SARS-CoV spike protein for independent entry into the host cells via cathepsin L [[14], [61]].
Currently, no broad spectrum of antiviral drugs are available for the treatment of highly virulent respiratory viruses, which include MERS and SARS. In this study therefore, we developed two machine learning models using random forest classifier (RFC) and neural networks (NNs) based on 2251 inhibitors of serine proteases to screen a database of 21,000,000 virtual compounds. We screened the hit compounds using absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties and finally docked the filtered compounds into the active site of TMPRSS2 to determine their corresponding binding affinity and plausible molecular interactions.
2. Materials and methods
2.1. Software and hardware
All analyses were carried out using Linux Ubuntu 18.04 system running on a 12-GB RAM, core i5, 4 core, 2.5-GHz hp Pavilion dm4 laptop.
All python packages were run on python 3.6 using Jupyter Lab 1.2.6. Python packages used include Scikit-learn 0.22.2, Tensorflow 2.1.0, Feature Selector, Pandas 1.0.3, Matplotlib 3.2.1, Seaborn 0.10.0, Numpy 1.18.1, Skater 1.0.2, and TPOT (Tree-Based Pipeline Optimization Tool) 0.11.1.
2.2. Data extraction and descriptor calculation
A total of 2253 inhibitors Simplified Molecular Input Line Entry System (SMILES) of TMPRSS2 isoform 1 was downloaded from the Chemical European Molecular Biology Laboratory (CHEMBL) database with their corresponding Half-maximal inhibitory concentration (IC50) values. A python script using the pandas module was written to clean up the downloaded data (i.e., remove SMILES with missing IC50 values and duplicates and extract only SMILES and their corresponding IC50).
Molecular Operating Environment (MOE) [15] was used to calculate 2D descriptors for the SMILES. A total of 206 descriptors were calculated.
2.3. Exploratory data analysis
Exploratory data analysis was carried out using a python script (Pandas, Feature Selector, Matplotlib, and Seaborn). IC50 values were converted to the negative log of the IC50 value when converted to molar (pIC50) using this formula: 9-log10[IC50]. The pIC50 values were converted into categorical values of active (1) and nonactive (0), and the activity threshold was set at 8.0 (active ≥ 8.0, nonactive < 8.0). Correlation within descriptors was carried out using the python Feature Selector module, and the correlation threshold was set at 0.75. All intercorrelated descriptors were removed.
2.4. Tree-based pipeline optimization tool analysis
A python script implementing TPOT analysis [16] was carried out to investigate the best machine learning algorithms and their corresponding hyperparameters for our modeling task. TPOT classifier class was used, early stopping parameter was set to true, verbose was set to 4, and cross-validation with shuffling was set to 10-fold. All other parameters were left at default values.
2.5. Model building
In order to reduce the number of training features, we used a python script implementing the Feature Selector module. The module uses the XGBoost algorithm [17] to identify zero- and low-importance features. The cumulative importance value was set at 0.95, the task value was set to the classification, and the evaluation metric used was l2. All other parameters were left at default. The number of features was reduced to 43. The 43 descriptors were normalized using Scikit-Learn RobustScaler [18].
The RFC was the best according to our TPOT analysis, hence an RFC algorithm was used to build a model based on our 43 descriptors using Scikit-Learn RFC class [18]. The following hyperparameters were used: bootstrap: False, criterion: entropy, max_features: 0.2, min_samples_leaf: 1, min_samples_split: 15, n_estimators: 100, and class weight was set at 0: 0.60, 1: 0.4. All other parameters were left at default.
A deep NN model was built using a python TensorFlow module [19] in order to find the best values for NN hyperparameters, namely, epoch, batch size, and optimization function. We wrote a python script implementing Scikit-learn Keras wrapper and Grid Search CV class [18]. Cross-validation was set to 10 verbose 4, an NN with three layers of 100 units each, ReLU as the activation function.
After finding the best values for the abovementioned hyperparameters the NN for our modeling task was constructed. The NN architecture had the following parameters: input dimension: 44, three dense layers with 100 neurons (units) each, an output dense layer with 2 units (2 units to accommodate for both active(1) and nonactive(0)), each of the dense layers used ReLU activation function and an l2(0.01) regularizer, and the output layer used sigmoid activation function. Finally, the network was compiled using binary cross-entropy loss function, Adam optimizer, categorical accuracy as metric evaluation, early stopping and model check point as the callback, epoch: 20, batch size: 5, verbose: 2, and class weight: 0: 60, 1: 40. All other parameters were left at default.
2.6. Classification metric evaluation
The classification models were evaluated using Scikit learn metric module [18] to calculate various classification evaluation metrics as described by Abadi et al. [19].
2.7. Database screening
SCUBIDOO a database [20] with 21,000,000 compounds was screened against both models. SCUBIDOO has sample representations of the database: 9994 (S); 99,977 (M); and 999,794 (L) compounds. We used the M sample to screen against our models, and hits were filtered using two criteria: synthesis probability (≥ 0.9) and drug-likeness (= 1). This criterion was calculated using MOE software [15], while the filtration was executed using a python script. The hit compounds from this filtration process were further subjected to ADMET property verification and further reduced.
2.8. Sequence retrieval, alignment, and homology modeling of transmembrane protease serine 2 isoform 1 target protein
The crystal structure of TMPRSS2 of either isoform 1 or 2 was not available in the PDB database; thus the homology model was generated for this study. The protein FASTA query sequence of TMPRSS2 isoform 1 was retrieved from the NCBI database with the accession number NP_001128571.1. The template 5ce1.1.A was selected for the homology modeling using the SWISS-MODEL webserver [21].
2.9. Optimization and refinement of transmembrane protease serine 2 isoform 1 modeled protein
The generated homology model of TMPRSS2 isoform 1 was uploaded on the 3Drefine webserver [22]; this makes use of iterative optimization of hydrogen bonding network in addition to atomic-level energy minimization on the optimized protein model using a composite physics and knowledge-based force field for efficient protein structure refinement. The output of 3Drefine webserver was further refined using GalaxyRefine webserver [23]; this works by rebuilding side chains, repacking of side chain, and relaxing overall protein structure using molecular dynamic simulation (MDS) [23].
2.10. Validation and quality estimation of optimized transmembrane protease serine 2 isoform 1 modeled protein
The optimized TMPRSS2 isoform 1 model was validated using RAMPAGE and PROSESS [24]. Quality estimation of the modeled protein was carried out using SAVES server and ProSA-web [25]. Ramachandran plot quality estimation was done using WHAT IF webserver. The resolution of the optimized modeled protein was calculated using ResProx server [26] and visualized using Discovery Studio 3.0 [27].
2.11. Physiochemical characterization of transmembrane protease serine 2 isoform 1
Using the ProtParam webserver, the individual percentage of amino acid residues, molecular weight, theoretical pI, atomic composition, extinction coefficient, and instability index of the model protein were determined [28].
2.12. Binding site prediction of transmembrane protease serine 2 isoform 1 modeled protein
The recognition of protein-ligand binding sites is of great importance in drug discovery. The binding site of the proposed lead compound(s) was predicted using P2Rank, which is a template-free machine learning algorithm embedded in PrankWeb [29,30]. This works based on local chemical neighborhood ligandability unified on junctures placed on a solvent-accessible protein surface. Junctures or points with increased ligandability score are then clustered to form the resulting ligand binding sites [29].
2.13. Submission of the model in protein model database
The model generated for TMPRSS2 isoform 1 was successfully submitted in the Protein Model Database (PMDB) having PMDB ID: PM0083140.
2.14. Molecular docking using AutoDock Vina
For our analysis, we used the PyRx, AutoDock Vina exhaustiveness search docking function. After the minimization process, the grid box resolution of TMPRSS2 homology model protein was centered at 1.1075 × −1.3338 × 15.7311 along the x, y, and z center axes, respectively, at a grid dimension of 70.919 × 58.432 × 58.519 Å to define the binding site of the protein. Camostat was first docked within the binding site of TMPRSS2 and the resulting interaction was compared with those of the hits into the same active sites using the same grid box dimension.
3. Result and discussion
3.1. Model building and database screening
The use of machine learning algorithms to virtually screen databases of compounds with unknown biological activity for hit compounds has become an established protocol in drug discovery [31]. Machine learning algorithms can be classified into various categories, including logic-based algorithms (decision trees), statistical algorithms (support vector machine, Bayesian network, K-nearest neighbor), and perceptron-based algorithms (NNs) [32], but for the sake of this study, we classified them into two: NNs and non-NNs (logic-based and statistical algorithms).
In this study, we built models based on both types of machine learning algorithms and screened for hit compounds.
3.2. Data preprocessing
A total of 2251 inhibitors of protease inhibitors were extracted from CHEMBL, with pIC50 ranging from 3 to 11 (Fig. 28.1 ). 2D descriptors (206) were calculated using MOE software, descriptors with a correlation greater than 0.75 were removed (Fig. 28.1), and descriptors with zero or low importance were removed using the Feature Selector python package. A total of 44 descriptors were carried forward for modeling. The pIC50 values were converted into categorical data-type of active and nonactive using a python script, and the activity threshold was set to pIC50 8 and above.
Figure 28.1.

Distribution plot of (A) pIC50, (B) active and nonactive, and (C) correlation plot.
3.3. Non-neural networks
There are huge arrays of non-NN algorithm, and each has its array of hyperparameters that must be tuned to achieve high-performance models. Selecting from these models could be a daunting task, therefore we used a python package, TPOT [16], to assist in this task.
TPOT is a genetic programming AutoML (Auto machine learning) protocol with the primary aim of maximizing classification or regression accuracy. TPOT classifier class was trained on the compounds and validated (training set, 1576; test set, 675). The analysis produced RFC as the best model for our modeling task with an accuracy of 0.802.
The RFC was trained on a training set of 1576 and validated on a test set of 675 compounds. The RFC model was subjected to a 10-fold cross-validation and was evaluated using various categorical classification evaluation metrics (Table 28.1 ). A confusion matrix (Fig. 28.2 ) was constructed from which the classification evaluation metrics were calculated.
Table 28.1.
The10-fold cross-validation evaluation of random forest classifier.
| Metrics | Score |
|---|---|
| Precision | 0.85 |
| Recall (sensitivity) | 0.77 |
| Accuracy | 0.85 |
| Error rate | 0.18 |
| F1 | 0.80 |
| ROC_AUC | 0.84 |
| Specificity | 0.85 |
| Balanced accuracy | 0.81 |
Figure 28.2.

Random forest confusion matrix.
3.4. Non-neural network performance evaluation
Precision metric measures the positive predictive value rate; that is, it measures how well the model identifies true positives as against false positives. The RFC model has a precision score of 0.85 (Table 28.1). The recall metric, also known as sensitivity score, is a measure of true positive rate, i.e., evaluates how well the model classifies true positives from false negatives. F1 is the harmonic mean of both precision and accuracy. Specificity measures the true negative rate; that is, it measures how well a model classifies true negatives correctly.
In order to get a full picture of how well the model is identifying true positive (sensitivity) and true negative (specificity), we calculate another metric call balanced accuracy (Table 28.1). Balanced accuracy is the mean of both sensitivity and specificity, and it enables us to evaluate the model's ability to identify active and nonactive inhibitors. The RFC model had a balanced accuracy of 0.808. The ROC_AUC score measures how often the model picks a true positive ahead of a false positive (Fig. 28.3 ). The RFC model had an ROC_AUC score of 0.84 (Note that all the metric values ranged from 0 to 1, with 1 being the best value and 0 the worst value possible).
Figure 28.3.

The ROC curve of random forest classifier (ROC_AUC score: 0.84).
Having evaluated this RFC model with the abovementioned metrics the results, therefore, suggest that the model is robust and reliable for the screening of compounds with unknown biological activity.
3.5. Model interpretation
Machine learning models provide an immerse opportunity in predicting unknowns, but they come with a challenge of interpretability. Most of the best performing algorithms are black boxes in nature and how they come about their prediction with the imputed features is not known [9]. It is therefore imperative to provide some sort of model explanation for every selected model. We used the Skater python library to provide a suitable model explanation. Skater explains machine learning models using two major plots: feature importance plot and partial dependence plot (PDP). The feature importance plot ranks features in the order of their importance to the model. It revealed SLogP_VSA4, SMR_VSA4, vsa_other, apol, and GCUT_SLOGP_0 as the top descriptors important to the model (Fig. 28.4 ).
Figure 28.4.

Feature importance of the random forest classifier model.
The PDP investigates the marginal effect of different values of a descriptor on a predicted outcome of the machine learning model [33]. For categorical models, PDP investigates the marginal effect of the descriptor on the prediction probability of a class. For our study, the class in consideration is the active class.
The PDP suggests that increasing the values of SLogP_VSA4 has the highest marginal effect on the predicted outcome (i.e., active class) of the model. SMR_VSA4 had the next highest marginal effect; however, this effect was only visible within a narrow range of 0.8–1.0. Further increase in values of SMR_VSA4 resulted in decreased effects, which were stable afterward (Fig. 28.5 ). VSA_other, apol, and GCUT_SLOGP_0 had an equal marginal effect on the active class predictions (Fig. 28.5).
Figure 28.5.

Partial dependence plot of random forest classifier model: (A) SLogP_VSA4, (B) SMR_VSA4, (C) vsa_other, (D) apol, and (E) GCUT_LOGP_0.
3.5.1. Neural network
In this study, NN was used to build a model on the inhibitors of TMPRSS 2. The NN model was built using the TensorFlow Python library [20]. The training set consists of 1576 compounds, test set for evaluation consisted of 675 compounds, and validation data was 10% of the training set.
As noted in non-NN, hyperparameters of NNs are also quite extensive; in order to solve this problem, python Scikit-learn Keras wrapper and GridSearchCV library were used to search for the optimum values: epoch, batch size, and the optimization function. The optimum values obtained from this exhaustive search after a 10-fold cross-validation are epoch: 20, batch size: 5, and optimizer: Adam. These values were therefore used to build the NN.
Another common problem of NN is overfitting of the model to the training data, in order to avoid this a regularization kernel (l2 regularize :0.01) was set for each layer of the neural architecture and finally an early stopping callback was set when compiling the neural architecture.
The model was trained over eight epochs (although the epoch parameter was set to 20 but was stopped early at the eighth epoch). The model training loss and training categorical accuracy, validation loss, and categorical accuracy were measured at each epoch and plotted (Fig. 28.6 ).
Figure 28.6.

Model loss and accuracy, and validation loss and accuracy.
3.5.2. Neural network performance evaluation
The eighth epoch had a training loss of 0.26 and training categorical accuracy of 0.80, validation loss was 0.25, and validation categorical accuracy was 0.81 (Table 28.2 ). The eighth epoch was the best and saved. The model was finally evaluated with the test set with a categorical accuracy of 0.84 and a loss of 0.46 (Table 28.2).
Table 28.2.
Theeighth epoch Neural Network categorical evaluation.
| Metrics | Score |
|---|---|
| Training loss | 0.26 |
| Training categorical accuracy | 0.83 |
| Validation loss | 0.25 |
| Validation categorical accuracy | 0.81 |
| Test loss | 0.84 |
| Test categorical accuracy | 0.46 |
| Recall (sensitivity) | 0.82 |
| Precision | 0.79 |
| Accuracy | 0.84 |
| Error rate | 0.12 |
| F1 | 0.80 |
| Specificity | 0.85 |
| Balanced accuracy | 0.84 |
A confusion matrix was constructed (Fig. 28.7 ) and categorical metrics evaluated (Table 28.2) (it should be noted, however, that evaluation is on the eighth epoch only).
Figure 28.7.

Theeighth epoch neural network confusion matrix.
The NN model had the following metric score: precision 0.79, recall (sensitivity) 0.82, F1 0.80, specificity 0.85, and balanced accuracy 0.84 (Table 28.2). The results, therefore, suggest that the NN model is reliable and can be used for extrapolations.
3.5.3. Model interpretation
As stated earlier, model interpretation is important in understanding machine learning model predictions. The Skater python library was also used to provide explanations for the NN model (Feature Importance, PDP) (Figure 28.8, Figure 28.9 ). Of the 43 descriptors used to build the NN model, SLogP_VSA4, PEOE_VSA_FHYD, a_nf, and SMR_VSA4 were the top important features (Fig. 28.8). A PDP showed the marginal effect of these descriptors on the predicted outcome (active class). Of these top descriptors, the a_nf descriptor had the highest marginal effect on the active class prediction probability, while PEOE_VSA_FHYD had the lowest contribution to the marginal effect on the active class prediction probability (Fig. 28.9). However, increasing the values of these descriptors had an increasing marginal effect on the model predictions (Fig. 28.9).
Figure 28.8.

Feature importance of deep neural network model.
Figure 28.9.

Partial dependence plot of deep neural network model: (A) SLogP_VSA4, (B) PEOE_VSA_FHYD, (C) a_nf, and (D) SMR_VSA4.
3.5.4. Model comparison
Although the primary aim is to compare both models, we evaluated both models on some performance metrics to see which model was the best performer.
From this comparison (Table 28.3 ), NN had the lowest error rate and the highest balanced accuracy, also the sensitivity of NN to true positive is about a fold higher than RFC. This, therefore, suggests that although both NN and RFC had a high level of accuracy and equal level of true negative identification (specificity), NN would most likely screen the database with a high level of sensitivity to compounds with inhibitory activities.
Table 28.3.
Machine learning model comparison.
| Metric | Random forest classifier | Neural network |
|---|---|---|
| Sensitivity (recall) | 0.77 | 0.82 |
| Specificity | 0.85 | 0.85 |
| Error rate | 0.18 | 0.12 |
| Balanced accuracy | 0.81 | 0.84 |
3.5.6. Database screening
Having developed and validated these two models (RFC and NN), we screened the SCUBIDOO database [20] sample of 100,000 compounds. The random forest model classified over 80,000 compounds as active and the NN model classified 3000 compounds as active (Fig. 28.10 ). The high number of predicted active compounds by RFC was however not surprising, which is due to the random forest low sensitivity score (Table 28.3). However, in order to filter down this number, a python script was written to filter down the base of the compound on two criteria: synthesis probability and drug-likeness. The script reduced the RFC model hits to 1600 compounds and the NN model hits to 250 compounds. Furthermore, using ADMET properties on SwissADME [34] the compounds were further screened down to 784 and 70, respectively.
Figure 28.10.

SCUBIDOO database screening activity prediction (1: active, 0: nonactive): (A) random forest classifier and (B) neural network.
3.6. Physiochemical properties of transmembrane protease serine 2 isoform 1
The TMPRSS2 isoform 1 physiochemical properties were predicted and analyzed using ProtPara webserver. The protein sequence for TMPRSS2 isoform 1 consisted of 346 amino acid residues, with Ser (8.7%) and Met and Phe (2.6%) amino acid residues having the highest and lowest composition, respectively, as indicated in Fig. 28.11 . The computed pI for the amino acid residues for TMPRSS2 isoform 1, i.e., the isoelectric pH, was 8.58 (pI > 7); this revealed the slight alkaline nature of the amino acid residues of the model protein. In experimental studies like isoelectric focusing and 2D electrophoresis, the isoelectric pH of a protein plays a critical role.
Figure 28.11.

Amino acid residue composition of transmembrane protease serine 2 isoform 1.
The stability of the protein was computed by analyzing its instability index. The index for this protein was 35.18. This protein may be stable because its predicted value is within the range of 40 [35,36].
3.7. Subcellular localization of transmembrane protease serine 2 isoform 1
The Hum-mPLoc3 webserver [36] was used to predict the subcellular localization of TMPRSS2 isoform 1. The analysis revealed that the human TMPRSS2 isoform 1 is predominantly a plasma membrane protein (Fig. 28.12 ).
Figure 28.12.

Subcellular localization of transmembrane protease serine 2 isoform 1.
3.8. Homology modeling for predicting 3D structure of the human BAG3
Quaternary structures with complex interactions and their physiologic roles are necessary for detailed comprehension of the human system. Experimental elucidation of protein structures using either nuclear magnetic resonance spectroscopy or X-ray crystallography is more realistic but time consuming and occasionally unsuccessful in case of membrane proteins; thus these necessitate the use of homology modeling techniques. Serine protease hepsin of template 5ce1.1.A with a 33.82% sequence identity and 2.5 Å resolution was used in the homology modeling of TMPRSS2 isoform 1.
Global Model Quality Estimate helps in the identification and selection of an optimal template in the modeling process. For this study, the score was 0.48. The sequence similarity to the query sequence was 0.38, which covers the range of 183–528 of the amino acid residues. The quality of the built model was assessed by the QMEAN scoring function, which uses potentials of mean force to generate global and per residue quality estimates. The QMEAN score [37], which is the best of the model generated, was −1.39, as indicated in Fig. 28.13D . The quality of the model developed was assessed using SAVES [38], PROSESS [24], and PROSA webservers [25]. The percentage of residues in the most favored regions and the percentage of Phi/Psi pairs in the favored regions of the Ramachandran plot of the nonoptimized protein was 86% and 92% compared to the 93.08% and 97.60% expected values, respectively, as indicated in Fig. 28.13A.
Figure 28.13.

(A–D) Homology model protein, quality, and validation check of transmembrane protease serine 2 (TMPRSS2) isoform 1 before optimization.
The overall quality factor of the model was assessed further using ERRAT and was observed to be 92.923%. A good model of high resolution is expected to have ERRAT quality values around 95% higher. Thus the model developed has a lower resolution, as indicated in Fig. 28.13B.
One of the computational limitations of protein modeling is the variation of the experimental and predicted model from the native true structure of a protein, thus necessitating the need for refinement and optimization of protein models. In this study, an improvement in the model was made using 3Drefine [22], which led to the generation of five refined protein models from an initially input SWISS model.
3Drefine works by optimizing hydrogen bonding network in addition to atomic-level energy minimization. The best model of the five generated, as indicated in Fig. 28.14D, has the lowest potential energy according to 3Drefine force field. Quality check was carried out on the 3Drefine model having the lowest potential energy. An improvement in the quality of the model was observed compared to the unrefined model, as indicated in Fig. 28.13D. The percentage of residues in the most favored regions and the percentage of Phi/Psi pairs in the favored regions of the Ramachandran plot of the 3Drefine-optimized protein was 89% and 96.51% compared to the 86% and 92% of the unrefined model, respectively. But this improvement was below the expected values 93.08% and 97.60%. The ERRAT and Z-score qualities also improved and were 95.5% and −8.68, respectively, as indicated in Figure 28.14, Figure 28.15 .
Figure 28.14.

(A–D) Homology model protein, quality, and validation check of transmembrane protease serine 2 (TMPRSS2) isoform 1 after optimization using 3Drefine.
Figure 28.15.

(A–D) Homology model protein, quality, and validation check of transmembrane protease serine 2 (TMPRSS2) isoform 1 after optimization using GalaxyWEB.
The quality of the 3Drefine model was further optimized using GalaxyWEB [39], which works on the basis of rebuilding the side chain and overall protein relaxation using MDS. The best of the five models generated was model one. Quality check and validation was carried out on this model also with a dramatic increase in the quality of the protein. The percentage of residues in the most favored regions of the Ramachandran plot was 92%, which exceeded the expected by 2%. The ERRAT quality factor score was also improved by 97.01%, which also exceeded the expected by 2%. The quality of the model assessed by PROSA also indicated an improvement in Z-score, as indicated in Fig. 28.15C. Thus compared to the unrefined and 3Drefine, the GalaxyWEB generated the best model as corroborated by the quality and validation scores.
3.9. Binding pocket prediction of transmembrane protease serine 2 isoform 1
Using a template-free machine learning algorithm, P2Rank embedded in PrankWeb [29,30] was used to predict the binding site of ligand on TMPRSS2. P2Rank works based on points situated on solvent-accessible protein surface from local chemical neighborhood ligandability. The resulting binding sites are formed via the cluster of points with high ligandability score. The predicted binding sites are indicated in Table 28.1.
3.10. Molecular docking analysis
The set of computational methodology used in the analysis of large databases with the aim of identifying potential hit candidates is referred to as virtual screening [[40], [41], [42]]. Ligand-based and structure-based (or receptor-based) are the two main types of virtual screening methods used in potential hit identification. Both the methods were carried out. In this study, through ligand-based virtual screening, 1600 and 250 compounds obtained from RFC and NN models were screened to 70 and 784 compounds using SwissADME, respectively. In ligand-based virtual screening of the compounds obtained from RFC and NN using SwissADME the compounds were screened to 70 and 784, respectively; all the compounds passed the Lipinski's [43], Ghose's [44], Oprea's [45], Veber's [46], Varma's [47], Egan's [48], and Muegge's [49] rules filters for drug-likeness evaluation. In addition to these, no PAINS (pan assay interference compounds) [50] and Brenk alert [51] were recorded and all passed the lead-likeness rules.
These compounds were then used in receptor-based virtual screening so as to obtain potential leads compared to camostat; the established standard drug was used as an inhibitor of TMPRSS2. The leads obtained from the receptor-based virtual screening of the compounds are indicated in Figure 28.16, Figure 28.1, Figure 28.1, Figure 28.1, Figure 28.1 . The binding affinities were −8.7, −8.4, −8.4, −8.5, and −8.5 kcal/mol, respectively, compared to −7.4 kcal/mol obtained for camostat, as indicated in Fig. 28.18 . All the amino acids involved in bound interactions are within the range of predicted binding pockets, as indicated in Table 28.4 .
Figure 28.17A.

(A) ASOIRFC1 interaction with transmembrane protease serine 2 isoform 1 at the binding site.
Figure 28.17B.

(B) ASOIRFC2 interaction with transmembrane protease serine 2 isoform 1 at the binding site.
Figure 28.17C.

(C) ASOIRFC3 interaction with transmembrane protease serine 2 isoform 1 at the binding site.
Figure 28.17D.

(D) ASOIRF4 interaction with transmembrane protease serine 2 isoform 1 at the binding site.
Figure 28.16.

(A–D) ASSOINN interaction with transmembrane protease serine 2 isoform 1 at the binding site.
Figure 28.18.

Camostat interaction with the binding site of transmembrane protease serine 2 isoform 1.
Table 28.4.
Predicted binding sites and scores.
| Name | Score | Predicted residues of the binding pockets |
|---|---|---|
| Pocket 1 | 7.8177 | GLY296 ILE418 SER419 GLY420 GLY422 ALA423 THR424 ASN435 ALA436 ALA437 ASN470 VAL471 ASP472 SER473 ASP477 CYS502 ALA503 |
| Pocket 2 | 7.7105 | VAL317 HIS333 CYS334 GLU336 LEU339 GLU426 ASP472 SER473 CYS474 GLN475 GLY476 ASP478 TRP497 GLY498 SER499 GLY501 CYS502 GLY509 |
| Pocket 3 | 7.2816 | ARG184 ARG187 LEU188 GLY190 PHE193 MET225 TYR227 ARG277 CYS278 ALA280 CYS281 VAL283 SER487 ARG489 SER491 |
The compound ASOINN indicated in Fig. 28.16 was the lead obtained from the structural virtual screening of the 70 compounds. The lead forms nine hydrophobic bond interactions, which included two HIS333, four TRP498, two LYS379, and LEU456. In addition to these, two SER497 bound interactions were formed.
Compounds ASOIRFC1–4 as indicated in Figure 28.1, Figure 28.1, Figure 28.1, Figure 28.1 were obtained from TMPRSS2 isoform 1 virtual screening of 784 compounds, which were obtained from the SwissADME-screened RFC 1600 compounds.
As indicated in Fig. 28.17A, ASOIRFC1 formed six hydrogen bond interactions, which included SER478, SER497, SER473, two GLY499, and GLY501, as well as five hydrophobic bonds (three TRP498, LEU456, and LYS379); these are in contrast to the nine hydrophobic interactions and two hydrogen bonds formed in ASOINN.
ASOIRFC2 indicated in Fig. 28.17B formed five hydrophobic interactions, four hydrogen interactions, and one electrostatic interaction, which are not formed in ASOINN and ASOIRFC1.
As indicated in Fig. 28.17C, ASOIRFC3 formed four hydrogen bond interactions, which included ALA423, GLY422, ASP477, and GLY296. In addition to these, two hydrophobic interactions, including ALA503 and ALA437, were formed. ASOIRFC3 so far formed the lowest number of interactions.
ASOIRFC4 as indicated in Fig. 28.17D formed seven bond interactions, which included two hydrogen bonds, three hydrophobic interactions, pi-lone pair, and pi-sulfur type bonds.
Camostat, the standard drug used as an inhibitor of TMPRSS2, forms seven hydrogen bonds and one electrostatic and hydrophobic bond, as indicated in Fig. 28.18.
Hydrogen bonds are the prevailing directional intermolecular interactions in biological complexes, and the predominant contribution to the specificity of molecular recognition. In drug design, hydrogen bonds are exploited to gain specificity owing to their strict distance and geometric constraints. Furthermore, previous studies indicated that, contribution of hydrogen bond to free energy is dependent on local environment: a solvent-exposed hydrogen-bond contributes significantly less to net interaction energy than the same hydrogen-bond in a buried hydrophobic pocket [[51a], [51b]]. To validate the docking protocol, we redocked on of the lead (ASOINN) compound into the predicted binding pocket of TMPRSS2. As indicated in Fig. 28.19 , the redocked pose overlapped almost completely with the previous orientation, thus indicating the reliability of the docking protocol and the scores obtained.
Figure 28.19.

Validation of docking, comparability of the redocked binding mode, and the pose of ASOINN with the accompanying residues of transmembrane protease serine 2 isoform 1 binding pocket.
Hydrogen bonds are nonbonded interactions; they are specific, directional, and in a short range. They occur via a covalent interaction between hydrogen atoms and electronegative atoms, which usually involves N, S, or O. The strength of hydrogen bonds is optimal with alignment of the atoms involved in bond formation, and this occurs especially when the H donor points directly at the electron acceptor pair. However, the strength of hydrogen bonds is weaker than that of covalent or ionic bonds, contributing to the specificity of molecular recognition [52,53]. Heavy atoms in N–H⋯O, N–H⋯N, and O–H⋯O hydrogen bonds were all found to be separated by similar median distances of approximately 3.0 Å in previous works. In this study, the bond distances for majority of the lead complexes are within the required threshold of median distance 3.0 Å [58]. Hydrophobic contacts are by far the most common interactions in protein–ligand complexes, accounting for 66,772 contacts at a distance cut-off of 4.0 between a carbon and a carbon, halogen, or sulfur atom. The group formed by an aliphatic carbon in the receptor and an aromatic carbon in the ligand is the most populous, accounting for over 42,000 interactions [59]. This suggests that aromatic rings are common in small molecule inhibitors. In fact, one or more aromatic rings are present in 76 percent of marketed drugs, with benzene being by far the most commonly encountered ring system [60]. Not surprisingly, leucine side-chains are the most frequently involved in hydrophobic interactions, followed by valine, isoleucine, and alanine side-chains [58]. Based on the requirements of absorption and permeation of drug molecules, the number of hydrogen bonds has been shown to be limited. The Lipinski rule-of-five, for example, states that drug molecules with hydrogen bond acceptors greater than 10 and hydrogen bound donors greater than 5 are most likely to have poor absorption or permeation properties [62]. A single hydrogen bond formed either by intra- or intermolecular interaction is weaker than ionic or covalent bond but stronger than van der Waals interaction, thus unable to aid a drug-receptor interaction alone. Moreover, the formation of multiple drug-receptor hydrogen bond interactions conferred stability, which is an essential feature of drug-receptor interactions [57].
In this study, the lead compounds, ASOINN and ASORFC1–4, are within the range of accepted number of hydrogen bounds in terms of acceptors and donors; thus they all may have good absorption or permeation properties. In addition to these, although all the compounds are within the range of accepted number of hydrogen, only ASOIRF1 has the highest number of hydrogen bonds (6) compared to camostat (7); thus ASOIRF1 may be more stable than ASOINN and ASOIRF2–4.
Previous studies have shown that synergistic receptor-ligand H-bond pairings potentiate high-affinity binding, which correspond to an increase in binding affinity [54]. In addition to these, hydrogen bonding and optimized hydrophobic interactions have been shown to both stabilize the ligands at the target site and help alter binding affinity and drug efficacy [55,56]. The recorded increase in binding affinities of ligands ASOINN and ASORFC1–4 in contrast to camostat may be due to the observed increase in the hydrophobic and hydrogen bound interactions, in addition to electrostatic and π-interactions.
3.11. Conclusion
The search for the therapeutic treatment of SARS-CoV-2 infection (coronavirus disease 2019 [COVID-19]) is not only of utmost importance but also time sensitive. Hence a fast method to screen for plausible therapeutic drug against different targets in SARS-CoV-2 has been employed all over the world. In this study, we selected TMPRSS2 isoform 1 as a therapeutic target for SARS-CoV-2 and employed the power of machine learning to develop two models (random forest and NNs) based on 2251 inhibitors of serine proteases downloaded from CHEMBL. These models have been used to screen a sample of SCUBIDOO database (M: 99,977), which is a database of 21,000,000 virtual compounds. We have therefore identified five possible lead compounds having shown good ADMET properties, binding affinity, and molecular interaction with TMPRSS2.
To further improve this work, the five lead compounds can be used to search for more similar compounds in the SCUBIDOO database and finally validated experimentally.
References
- 1.Corman V.M., Landt O., Kaiser M., Molenkamp R., Meijer A., Chu D.K., Bleicker T., Brünink S., Schneider J., Schmidt M.L., Mulders D.G. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Euro Surveill. 2020;25(3):2000045. doi: 10.2807/1560-7917.ES.2020.25.3.2000045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Fehr A.R., Channappanavar R., Perlman S. Middle East respiratory syndrome: emergence of a pathogenic human coronavirus. Annu. Rev. Med. 2017;68:387–399. doi: 10.1146/annurev-med-051215-031152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.World Health Organization. (2020). Coronavirus disease (COVID-19).
- 4.Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y., Zhang L., Fan G., Xu J., Gu X., Cheng Z. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wang C., Horby P.W., Hayden F.G., Gao G.F. A novel coronavirus outbreak of global health concern. Lancet. 2020;395(10223):470–473. doi: 10.1016/S0140-6736(20)30185-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zhu N., Zhang D., Wang W., Li X., Yang B., Song J., Zhao X., Huang B., Shi W., Lu R., Niu P. A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. 2020 doi: 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhou P., Yang X.L., Wang X.G., Hu B., Zhang L., Zhang W., Si H.R., Zhu Y., Li B., Huang C.L., Chen H.D. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Glowacka I., Bertram S., Müller M.A., Allen P., Soilleux E., Pfefferle S., Steffen I., Tsegaye T.S., He Y., Gnirss K., Niemeyer D. Evidence that TMPRSS2 activates the severe acute respiratory syndrome coronavirus spike protein for membrane fusion and reduces viral control by the humoral immune response. J. Virol. 2011;85(9):4122–4134. doi: 10.1128/JVI.02232-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Matsuyama S., Nagata N., Shirato K., Kawase M., Takeda M., Taguchi F. Efficient activation of the severe acute respiratory syndrome coronavirus spike protein by the transmembrane protease TMPRSS2. J. Virol. 2010;84(24):12658–12664. doi: 10.1128/JVI.01542-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shulla A., Heald-Sargent T., Subramanya G., Zhao J., Perlman S., Gallagher T. A transmembrane serine protease is linked to the severe acute respiratory syndrome coronavirus receptor and activates virus entry. J. Virol. 2011;85(2):873–882. doi: 10.1128/JVI.02062-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Iwata-Yoshikawa N., Okamura T., Shimizu Y., Hasegawa H., Takeda M., Nagata N. TMPRSS2 contributes to virus spread and immunopathology in the airways of murine models after coronavirus infection. J. Virol. 2019;93(6):e01815–e01818. doi: 10.1128/JVI.01815-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Shirato K., Kanou K., Kawase M., Matsuyama S. Clinical isolates of human coronavirus 229E bypass the endosome for cell entry. J. Virol. 2017;91(1) doi: 10.1128/JVI.01387-16. e01387–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shirato K., Kawase M., Matsuyama S. Wild-type human coronaviruses prefer cell-surface TMPRSS2 to endosomal cathepsins for cell entry. Virology. 2018;517:9–15. doi: 10.1016/j.virol.2017.11.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zmora P., Moldenhauer A.S., Hofmann-Winkler H., Pöhlmann S. TMPRSS2 isoform 1 activates respiratory viruses and is expressed in viral target cells. PLoS One. 2015;10(9) doi: 10.1371/journal.pone.0138380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.MOE (The Molecular Operating Environment), software available from Chemical Computing Group Inc., 1010 Sherbrooke Street West, Suite 910, Montreal, Canada H3A 2R7. http://www.chemcomp.com.
- 16.Olson R.S., Moore J.H. Automated Machine Learning. Springer; Cham: 2019. TPOT: a tree-based pipeline optimization tool for automating machine learning; pp. 151–160. [Google Scholar]
- 17.Chen T., Guestrin C. Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. 2015. XGBoost: reliable large-scale tree boosting system; pp. 13–17. [Google Scholar]
- 18.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 19.Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat S., Irving G., Isard M., Kudlur M. 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) 2016. Tensorflow: a system for large-scale machine learning; pp. 265–283. [Google Scholar]
- 20.Chevillard F., Kolb P. SCUBIDOO: a large yet screenable and easily searchable database of computationally created chemical compounds optimized toward high likelihood of synthetic tractability. J. Chem. Inf. Model. 2015;55(9):1824–1835. doi: 10.1021/acs.jcim.5b00203. [DOI] [PubMed] [Google Scholar]
- 21.Biasini M., Bienert S., Waterhouse A., Arnold K., Studer G., Schmidt T., Kiefer F., Cassarino T.G., Bertoni M., Bordoli L., Schwede T. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res. 2014;42(W1):W252–W258. doi: 10.1093/nar/gku340. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bhattacharya D., Nowotny J., Cao R., Cheng J. 3Drefine: an interactive web server for efficient protein structure refinement. Nucleic Acids Res. 2016;44(W1):W406–W409. doi: 10.1093/nar/gkw336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Heo L., Park H., Seok C. GalaxyRefine: protein structure refinement driven by side-chain repacking. Nucleic Acids Res. 2013;41(W1):W384–W388. doi: 10.1093/nar/gkt458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Berjanskii M., Liang Y., Zhou J., Tang P., Stothard P., Zhou Y., Cruz J., MacDonell C., Lin G., Lu P., Wishart D.S. PROSESS: a protein structure evaluation suite and server. Nucleic Acids Res. 2010;38(Suppl. 2):W633–W640. doi: 10.1093/nar/gkq375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wiederstein M., Sippl M.J. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res. 2007;35(Suppl. 2):W407–W410. doi: 10.1093/nar/gkm290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Berjanskii M., Zhou J., Liang Y., Lin G., Wishart D.S. Resolution-by-proxy: a simple measure for assessing and comparing the overall quality of NMR protein structures. J. Biomol. NMR. 2012;53(3):167–180. doi: 10.1007/s10858-012-9637-2. [DOI] [PubMed] [Google Scholar]
- 27.Studio D. Version 3.0. Accelrys Software Inc; San Diego: 2010. [Google Scholar]
- 28.Garg V.K., Avashthi H., Tiwari A., Jain P.A., Ramkete P.W., Kayastha A.M., Singh V.K. MFPPI–Multi FASTA ProtParam interface. Bioinformation. 2016;12(2):74. doi: 10.6026/97320630012074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Jendele L., Krivak R., Skoda P., Novotny M., Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019;47(W1):W345–W349. doi: 10.1093/nar/gkz424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Krivák R., Hoksza D. P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. J. Cheminf. 2018;10(1):39. doi: 10.1186/s13321-018-0285-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Vamathevan J., Clark D., Czodrowski P., Dunham I., Ferran E., Lee G., Li B., Madabhushi A., Shah P., Spitzer M., Zhao S. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019;18(6):463–477. doi: 10.1038/s41573-019-0024-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kotsiantis S.B., Zaharakis I., Pintelas P. Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 2007;160:3–24. [Google Scholar]
- 33.Zhao Q., Hastie T. Causal interpretations of black-box models. J. Bus. Econ. Stat. 2019;1–0 doi: 10.1080/07350015.2019.1624293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Cipollone R., Ascenzi P., Tomao P., Imperi F., Visca P. Enzymatic detoxification of cyanide: clues from Pseudomonas aeruginosa Rhodanese. J. Mol. Microbiol. Biotechnol. 2008;15(2–3):199–211. doi: 10.1159/000121331. [DOI] [PubMed] [Google Scholar]
- 35.Carugo O., Djinović-Carugo K. Half a century of Ramachandran plots. Acta Crystallogr. Sect. D Biol. Crystallogr. 2013;69(8):1333–1341. doi: 10.1107/S090744491301158X. [DOI] [PubMed] [Google Scholar]
- 36.Zhou H., Yang Y., Shen H.B. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features. Bioinformatics. 2017;33(6):843–853. doi: 10.1093/bioinformatics/btw723. [DOI] [PubMed] [Google Scholar]
- 37.Waterhouse A., Bertoni M., Bienert S., Studer G., Tauriello G., Gumienny R., Heer F.T., de Beer T.A., Rempfer C., Bordoli L., Lepore R. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 2018;46(W1):W296–W303. doi: 10.1093/nar/gky427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Colovos C., Yeates T.O. Verification of protein structures: patterns of nonbonded atomic interactions. Protein Sci. 1993;2(9):1511–1519. doi: 10.1002/pro.5560020916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ko J., Park H., Heo L., Seok C. GalaxyWEB server for protein structure prediction and refinement. Nucleic Acids Res. 2012;40(W1):W294–W297. doi: 10.1093/nar/gks493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Clark D.E. What has virtual screening ever done for drug discovery? Expert Opin. Drug Discov. 2008;3(8):841–851. doi: 10.1517/17460441.3.8.841. [DOI] [PubMed] [Google Scholar]
- 41.Cereto-Massagué A., Ojeda M.J., Valls C., Mulero M., Garcia-Vallvé S., Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. [DOI] [PubMed] [Google Scholar]
- 42.Kumar A., Zhang K.Y. Hierarchical virtual screening approaches in small molecule drug discovery. Methods. 2015;71:26–37. doi: 10.1016/j.ymeth.2014.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Lipinski C.A. Poor aqueous solubility—an industry wide problem in drug discovery. Am. Pharm. Rev. 2002;5(3):82–85. [Google Scholar]
- 44.Ghose A.K., Herbertz T., Hudkins R.L., Dorsey B.D., Mallamo J.P. Knowledge-based, central nervous system (CNS) lead selection and lead optimization for CNS drug discovery. ACS Chem. Neurosci. 2012;3(1):50–68. doi: 10.1021/cn200100h. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Oprea T.I. Current trends in lead discovery: are we looking for the appropriate properties? Mol. Divers. 2000;5(4):199–208. doi: 10.1023/a:1021368007777. [DOI] [PubMed] [Google Scholar]
- 46.Pollastri M.P. Overview on the rule of five. Curr. Protoc. Pharmacol. 2010;49(1):9–12. doi: 10.1002/0471141755.ph0912s49. [DOI] [PubMed] [Google Scholar]
- 47.Varma M.V., Khandavilli S., Ashokraj Y., Jain A., Dhanikula A., Sood A., Thomas N.S., Pillai O., Sharma P., Gandhi R., Agrawal S. Biopharmaceutic classification system: a scientific framework for pharmacokinetic optimization in drug research. Curr. Drug Metabol. 2004;5(5):375–388. doi: 10.2174/1389200043335423. [DOI] [PubMed] [Google Scholar]
- 48.Egan W.J. Predicting ADME properties in drug discovery. Drug Des. Struct. Ligand-Based Approaches. 2010:165–180. [Google Scholar]
- 49.Muegge I. Pharmacophore features of potential drugs. Chem. Euro. J. 2002;8(9):1976–1981. doi: 10.1002/1521-3765(20020503)8:9<1976::AID-CHEM1976>3.0.CO;2-K. [DOI] [PubMed] [Google Scholar]
- 50.Baell J.B., Ferrins L., Falk H., Nikolakopoulos G. PAINS: relevance to tool compound discovery and fragment-based screening. Aust. J. Chem. 2014;66(12):1483–1494. [Google Scholar]
- 51.Brenk R., Schipani A., James D., Krasowski A., Gilbert I.H., Frearson J., Wyatt P.G. Lessons learnt from assembling screening libraries for drug discovery for neglected diseases. ChemMedChem. 2008;3(3):435–444. doi: 10.1002/cmdc.200700139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51a.B.K. Shoichet. No free energy lunch. Nature Biotechnol. 25 (10) (2007) 1109–1110. [DOI] [PubMed]
- 51b.A. S. El-Magboub, Computational Models for Drug Design and Delivery (Doctoral dissertation, University of Southern California), 2017.
- 52.Schaeffer L. The Practice of Medicinal Chemistry. Academic Press; 2008. The role of functional groups in drug–receptor interactions; pp. 464–480. [Google Scholar]
- 53.Nishio M. The CH/π hydrogen bond in chemistry. Conformation, supramolecules, optical resolution and interactions involving carbohydrates. Phys. Chem. Chem. Phys. 2011;13(31):13873–13900. doi: 10.1039/c1cp20404a. [DOI] [PubMed] [Google Scholar]
- 54.Chen D., Oezguen N., Urvil P., Ferguson C., Dann S.M., Savidge T.C. Regulation of protein-ligand binding affinity by hydrogen bond pairing. Sci. Adv. 2016;2(3):e1501240. doi: 10.1126/sciadv.1501240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Patil R., Das S., Stanley A., Yadav L., Sudhakar A., Varma A.K. Optimized hydrophobic interactions and hydrogen bonding at the target-ligand interface leads the pathways of drug-designing. PLoS One. 2010;5(8) doi: 10.1371/journal.pone.0012029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Eldehna W.M., Abou-Seri S.M., El Kerdawy A.M., Ayyad R.R., Hamdy A.M., Ghabbour H.A., Ali M.M., El Ella D.A. Increasing the binding affinity of VEGFR-2 inhibitors by extending their hydrophobic interaction with the active site: design, synthesis and biological evaluation of 1-substituted-4-(4-methoxybenzyl) phthalazine derivatives. Eur. J. Med. Chem. 2016;113:50–62. doi: 10.1016/j.ejmech.2016.02.029. [DOI] [PubMed] [Google Scholar]
- 57.Maher T.J., Johnson D.A. vol. 85. 2008. Receptors and drug action. (Foye's Principles of Medicinal Chemistry). [Google Scholar]
- 58.de Freitas R.F., Schapira M. A systematic analysis of atomic protein–ligand interactions in the PDB. Med. Chem. Comm. 2017;8(10):1970–1981. doi: 10.1039/c7md00381a. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Ritchie T.J., Macdonald S.J. J. Med. Chem. 2014;57(17):7206–7215. doi: 10.1021/jm500515d. [DOI] [PubMed] [Google Scholar]
- 60.Taylor R.D., MacCoss M., Lawson A.D. J. Med. Chem. 2014;57(14):5845–5859. doi: 10.1021/jm4017625. [DOI] [PubMed] [Google Scholar]
- 61.Idris M.O., Yekeen A.A., Alakanse O.S., Durojaye O.A. J. Biomol. Struct. Dyn. 2020:1–19. doi: 10.1080/07391102.2020.1792346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Daina A., Michielin O., Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 2017;7:42717. doi: 10.1038/srep42717. [DOI] [PMC free article] [PubMed] [Google Scholar]
