Abstract
Machine learning (ML) techniques have become powerful tools in both industrial and academic settings. Their ability to facilitate analysis of complex data and generation of predictive insights is transforming how scientific problems are approached across a wide range of disciplines. In this tutorial, we present a cursory introduction to three widely used ML techniqueslogistic regression, random forest, and multilayer perceptronapplied toward analyzing molecular dynamics (MD) trajectory data. We employ our chosen ML models to the study of the SARS-CoV-2 spike protein receptor binding domain interacting with the receptor ACE2. We develop a pipeline for processing MD simulation trajectory data and identifying residues that significantly impact the stability of the complex.


Background
A novel coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), rapidly spread throughout the world since 2019. , SARS-CoV-2 is less deadly but more transmissible than SARS-CoV, which appeared in late 2002. , The membrane-enveloped coronavirus has a spherical shape with oblong protrusions called spike proteins littering the surface. To initiate infection, one spike protein’s ectodomain must contact its receptor, the human angiotensin-converting enzyme 2 (ACE2). − The region of the spike protein that makes contact with the ACE2 receptor is called the receptor binding domain (RBD) and, intuitively, alteration of the structure of this domain impacts ACE2 binding and thus infectivity (Figure ). − , Although a large portion of the SARS-CoV spike protein sequence is conserved in SARS-CoV-2, there are a substantial number of differing residues (∼50%) within the RBD that contribute to differences in binding affinity between the two variants, ,, X-ray crystallography and cryo-electron microscopy experimental methods in combination with biophysical simulations allow for rigorous analysis of the differences in interactions among the two SARS variants at the atomistic level. Studies have shown that the SARS-CoV-2 RBD binds preferentially to the ACE2 receptor when compared to SARS-CoV, resulting in a more stable complex. − Furthermore, particular mutations of the RBD that serve to stabilize RBD-ACE2 complex interactions will correspond to higher binding affinity and so understanding the stabilization of this structure due to changes in sequence is crucial in understanding changes in infectivity. The stability of the RBD alone can vary dramatically dependent on mutation, with some mutations exhibiting less stable structures and some, such as the Omicron N501Y mutation, increasing stability. MD simulations have been used to quantify these differences in interaction strength, which contribute to the infectivity of the virus. ,,− Fatouros et al. used computational methods to quantify ACE2 binding affinities despite lacking experimental structures. Additionally, Kumar et al. utilized computational methods to determine which mutations in the Delta and Omicron variants are responsible for their differences in binding. Furthermore, Jena et al. utilized MD simulations to understand how compounds catechin and curcumin can exhibit antiviral properties via differential binding to the RBD-ACE2 complex. A common theme surrounding these studies is the strategic use of computational methods to analyze complex, multidimensional data and rapidly address current public health crises. Here, we focus on a previous study by Pavlova et al. in which MD simulations were utilized in order to determine which residues are most important for distinguishing between the SARS-CoV and SARS-CoV-2 variants. Since analyzing each residue’s contribution individually is practically infeasible, they utilized three different ML algorithms to extract which residues contribute the most to the difference in binding affinity between SARS-CoV and SARS-CoV-2 and, thus, the increase in infectivity.
1.

Visualization of the SARS-CoV RBD in complex with the extracellular domain of the ACE2 receptor. Molecular images rendered with PyMOL and VMD.
ML methods provide great utility when parsing through very large data sets. In particular, they prove incredibly useful when drawing conclusions from methodologies such as MD simulations which result in gigabytes (or more) of atomistic coordinate data. Numerous groups have taken advantage of this fact and utilized various ML methods in order to analyze complex trajectory data and create useful tools to aid others in this process. − In particular, we focus on the utilization of ML methods in order to perform advanced and rigorous structural classifications using MD data. Following the prior study by Pavlova et al., we provide a python tutorial showcasing the utility of three different ML models to compare how the SARS-CoV and SARS-CoV-2 RBDs differentially bind to the ACE2 receptor.
Theory
We utilize three different ML methods in order to adequately determine what structural factors allow for the increased binding affinity of the SARS-CoV-2 RBD to the ACE2 receptor. Generally, ML models are useful tools in determining patterns associated with very large collections of data. In this case, we will apply an ML method in order to more easily and rigorously quantify the differences in dynamics between the SARS-CoV and SARS-CoV-2 RBDs interacting with ACE2 that would not be clear from visual observation alone. More generally, input vectors, or “features”, are mapped to output values or “classifications”, and a “training” process occurs whereby weights of input values are adjusted in order to minimize error in outputted classifications. We utilize supervised learning algorithms where we provide categorical classifications of training data up front. In this case, the classifications correspond to either the SARS-CoV or SARS-CoV-2 RBD. Our process of model training involves splitting our data into testing and training components (testing to determine weights and training to adjudicate the accuracy of the results), optimizing parameters based on minimizing error in training data, and then evaluating performance (Figure ).
2.
Schematic visualizing the pathway of data processing and analysis covered within this tutorial.
MD simulations were previously conducted with NAMD, sampling both the SARS-CoV and SARS-CoV-2 RBDs, and supervised ML algorithms were utilized to output which residues are most important to differential binding affinities. Here, we provide explicit instructions and advice in regards to the implementation of these supervised learning ML methods toward drawing conclusions from MD trajectories. More specifically, this tutorial focuses on the implementation of logistic regression, random forest, and multilayer perceptron methods in order to determine residue importance for distinguishing between RBD-ACE2 complexes. While not exhaustive, these three methods represent commonly used approaches for classification problems. We provide brief theory on the fundamentals of each of these three methods in the following sections.
Logistic Regression
To describe logistic regression models, we first begin by introducing linear regression models. Linear regression is a technique whereby a response variable is estimated by a linear function of a corresponding explanatory variable; in other words, if we predict that a classification y is linearly related to our descriptor variables x 0, x 1,...,x N , then we can use this known relationship to determine any classification y based upon any arbitrary series of inputs. Linear regression models, therefore, can output an estimation of a given value, y, based on an input, X, assuming a linear relationship between input and output variables. In this case, we assume that X = x 1, x 2,...,x N can adequately describe y via the following relationship
Importantly, a linear regression model produces a continuous response variable and thus is not useful for our purposes (since we only need a binary classification). However, we can apply a generalized linear model (GLM), which is a type of model utilizing a fundamental logic similar to linear regression models, but expanded to include more general categorical target variables. This is done by constructing each linear term β i x i from a more complex functional form, typically an exponential. More specifically, we will be implementing a logistic regression model, which is an example of a simple GLM.
Here, let us classify distances associated with the SARS-CoV RBD as “0” and with the SARS-CoV-2 RBD as “1”. For each input vector (from 1,...i,...,N), we can define a probability that our input vector corresponds to the SARS-CoV-2 RBD (y i = 1) as
In order to construct a linear combination of values with π i as inputs, we require a final transformation function to ensure that our outputs will always smoothly produce a number between 0 and 1. Therefore, we finally define the following
which also takes the forms
and
Here, we will train our model to determine which weights (β0,···,β N ) are optimal for accurately predicting a particular RBD variant given a series of input distances. We will then use these weights to help determine which residues are most important in differentiating between SARS-CoV and SARS-CoV-2 RBDs.
Random Forest
Decision tree classifiers are simple models where “decisions” are made sequentially in order to most optimally divide a given set of data into individual categories (classifications). Random forest classifiers are expansions on decision tree classifiers, whereby each “tree” is generated via the classification and regression tree (CART) algorithm. The CART algorithm allows for the implementation of a recursive division process with the goal to provide the least number of binary divisions required to adequately separate two (or more) target conditions. Put more simply, a random forest model consists of a combination of decision trees, where each tree can make an individual prediction (in this case, predicting whether the given residue–residue pairing of interest belongs to SARS-CoV or SARS-CoV-2), and the combination of predictions across the multiple decision trees generates a more accurate answer.
Additionally, instead of generating one singular random forest prediction, bootstrap aggregating (bagging) is often used to reduce variance in outputted predictions. Bagging is an ensemble method where subsamples of data are randomly pulled from the original data set and corresponding predictions are generated. Once all predictions have been constructed, they are combined to obtain the final random forest model. This procedure is appreciated since, when applied, there will consistently remain unused “out of bag” (OOB) data consisting of approximately 37% of given data. This OOB data can then be used as effective testing data to validate a random forest model. Similarly to our logistic regression model, we can use the weights associated with each feature (and how each feature impacts the trained decision tree) in order to pinpoint which residues contribute more to the differences in binding between the two virus’s RBDs.
Multilayer Perceptron
A multilayer perceptron (MLP) is a neural network containing input nodes, hidden layers, and output layers connected by corresponding weights. Simply put, a neural network consists of layers of nodes, whereby each node processes information, which is then propagated to nodes in the next layer. The combination of values from the final layer are used in order to determine the prediction that the network outputs.
Neural networks are incredibly flexible models due to the ability to take in information from a series of inputs and combine the results in a variety of different ways, implementing connections among nodes that propagate across multiple hidden layers, allowing for an inherently nonlinear response to be produced. More specifically, each neuron is partially defined by a summation function, typically of the form , which combines all information obtained from incoming neuronal connections. Additionally, each neuron will then take in the summed information and produce an output based upon an activation function, which is typically a smooth sigmoidal function.
Prerequisites
To gain the most from this tutorial, prior experience in Python is suggested for users. We recommend the user begin with a Python environment containing these packages:
scikit-learn 1.2.2
pandas 1.5.3
Numpy 1.23.5
Jupyter Notebook 6.5.2
Matplotlib 3.6.2
Exercises
In this tutorial, we provide two data sets, each containing pairwise distances of nearby residues between the spike-protein RBD and the ACE2 receptor for both SARS-CoV and SARS-CoV-2 (Figure ); in addition to processed data sets, original simulation trajectories are available online at Zenodo (DOI: 10.5281/zenodo.15376189). We also provide residue–residue pairing information corresponding to each index within the distance matrices. The data were collected from two independent runs, each consisting of 2 μs MD simulations, initialized for both SARS-CoV and SARS-CoV-2 RBDs in complex with the ACE2 receptor. In order to reduce computational burden, each data set contains distances collected across only 1000 frames evenly spaced across the corresponding trajectories. We note that increasing the number of processed frames can increase the accuracy of calculated residue importances.
3.
(a) SARS-CoV-2 spike protein (red, green, and yellow) bound to the ACE2 receptor (blue). (b) Closer view of the RBD-ACE2 interaction scaffold. Interacting residues are colored by residue type (red/blue: charged, green: polar, white: hydrophobic). (c) Sequence alignment of the SARS-CoV and SARS-CoV-2 receptor binding motif (RBM) with identical residues highlighted in red. The figure was reproduced from Figure 1 in ref . Copyright 2021 American Chemical Society.
Our data sets require a nontrivial amount of preprocessing in order to effectively train our ML models. This will function as the first exercise.
Exercise: Data Preprocessing
We begin by creating a
pandas dataframe containing all of the pairwise distances calculated
across simulation time for two independent runs of SARS-CoV and SARS-CoV-2.
To increase the accuracy of model predictions, it is common practice in ML analytics to implement a standard scalar in the data pipeline (applied to training and test data before being fed into the ML algorithm), which removes the mean and scales to standardize variance.
In our case, as residues that are closest in physical
space are
likely to interact and therefore have a greater impact on the dynamics
of the binding interface, it is reasonable to define co-residue importance
by an inverse-distance relationship. Therefore, after storing our
data, we take the inverse of each data point and scale our entire
dataframe, obtaining a coarse measurement which will later be used
to define residue “importance” for each residue pairing.
Finally, in order to conduct our supervised ML, we need to add
a target variable classification to each of our frames that corresponds
to the given RBD variant.
When we start to train our models,
we want to remove any features
that share a very high correlation, since correlated features are
redundant and greatly increase the computational burden of our analysis.
We generate a correlation matrix corresponding to the original dataframe
and plot the initial correlation matrix to visually observe the areas
of high correlation (Figure
).
4.
Correlation matrices before and after dropping highly correlated data.
We then shuffle the correlation
matrix and drop any features that
fall above the correlation threshold.
We plot the resulting correlation matrix to visually inspect the results and to showcase the amount of data that has been removed from the dropping process (Figure ). Importantly, multiple runs should be conducted of the ML training where each run corresponds to a different reshuffling of the correlation matrix. This is important as it will allow for different combinations of correlated residue pairings to be removed, allowing for a much more stable end result. We provide exercises for the user to conduct only one run, but we strongly recommend implementing multiple correlation reshuffling in practice.
Finally, we initialize
an instance of logistic regression, random
forest, and MLP models utilizing a series of defined parameters. We
note that the ML parameters chosen for this tutorial were previously
optimized by performing a sweep across numerous value ranges and choosing
those that gave the best performances. We leave the user with the
additional exercise of determining optimal parameters choices if they
so desire.
We recommend generating separate blank dataframes
corresponding
to the RBD and the ACE2 receptor residues that will be used to store
summed residue importances.
Exercise: Calculating Residue Importance
Our dataframe
contains information regarding the inverse distances associated with
each residue pairing. We will then train our ML model to determine
which residue pairs are most important in differentiating between
SARS-CoV and SARS-CoV-2. These importances will be assigned to each
residue pair and calculated from the coefficients of our trained LR
model, the feature importances of our RF model, and relevance from
a layer-wise relevance propagation method (explained in more detail
below) for our MLP model. In this way, we treat each residue pair
as a feature and then can quantify how “important” each
individual residue is based on the importances of the corresponding
residue pairs. In order to do this, we provide an exercise to generate
a simple function that sums all of the inverse distances associated
with each residue across the entire dataframe. We also provide a different
method of determining per-residue importance factors, which could
potentially (but not necessarily) alter the conclusions from each
model. This function will be utilized in later components of the tutorial.
Exercise: Developing a Trained Logistic Regression Model
Generally, it is standard practice to split any training data into
“training” and “testing” inputs such that
the performance of the model can be easily evaluated after sufficient
training. The percentage of data used
for the training set can vary, but we suggest using an 80/20 training/testing
split. This exercise splits the data into a training and testing component
and fits a logistic regression model to the training data. We point
out that one should loop through multiple sets of random states to
increase convergence.
Exercise: Evaluating Logistic Regression Model Performance
Evaluating the accuracy of a generated model is an important and useful procedure. We recommend implementing a variety of tests in order to evaluate each ML model. Once the model reaches sufficient accuracy (up to 100% as targeted here), we can move forward with our study.
We note that when training our models, we were able to
achieve a sufficiently high accuracy in all evaluation metrics. However,
there are cases in which that is not possible to achieve. In these
situations, it may be necessary to prioritize one score over another.
“Accuracy” is defined as the ratio of all true positives
and true negatives to the total number of samples. This metric gives a simple indication as to how many correct
classifications were made, but it can easily be skewed by uneven sizes
in data sets. For our purposes, this
is not a problem because we ensured that the number of trajectory
frames corresponding to SARS-CoV and SARS-CoV-2 were equivalent. “Recall”
describes the ratio of true positive classifications to all samples
which should have been classified as positive. “Precision” is defined by the number of correctly
assigned samples for a particular category (positive or negative)
divided by the total number of samples in that category. This metric is very useful if incorrectly classified
samples within a particular class (positive or negative) are more
detrimental to the output of the model. The F1 score is the harmonic
mean between precision and recall, and so it is a convenient metric
to use if extreme values of precision or recall need to be taken into
consideration to properly evaluate a model of interest. Finally, the “ROC_AUC” (receiver
operating characteristic curve/area under the curve) metric takes
into account how effectively a model can distinguish between both
positive and negative cases
,
which is useful in
our case as we care equally about both the SARS-CoV and SARS-CoV-2
classifications.
Plotting any of the performance metrics will show that our model has been well parametrized. We encourage the reader to adjust initialized parameters to see how sensitive the models are to changes in parametrization.
Exercise: Determining Logistic Regression Per-Residue Importance
Here, we sort the pairwise residue importances by CoV, CoV-2, and
ACE before adding these values to our summary dataframes.
Exercise: Developing and Evaluating a Trained Random Forest Model
We provide an exercise identical to the one above but
utilizing the random forest model instead.
Exercise: Determining Random Forest Per-Residue Importance
As before, we will extract the importance of each residue as determined
by the Random Forest model.
Exercise: Developing a Trained Multilayer Perceptron Model
Finally, we showcase the utilization of a neural network MLP model. Training this model follows a very similar protocol to that of the other two models but with some key differences.
First, we transform
our classification array using one-hot encoding. In order to increase
the effectiveness of our MLP model, utilizing one-hot encoding is
a common practice that creates a binary representation associated
with each classification. We will transform
our data in the following manner.
Once our model is trained,
we cannot directly access residue pair
importances like we were able to with our logistic regression and
random forest models. Instead, to quantify the importances of each
of our residue pairs, we need to utilize layer-wise relevance propagation,
a technique to determine feature importance based on back-propagation
of neural signals across layers of the network. With these differences in mind, we can proceed with training
our MLP model.
Exercise: Evaluating MLP Model Performance
We will
evaluate the accuracy of our MLP model in the same manner as we did
for our logistic regression and random forest models.
Once again, plotting the resulting accuracies will demonstrate the validity of our chosen initialization parameters.
Exercise: Determining MLP Per-Residue Importance
Using
layer-wise relevance propagation, we will obtain our residue importances
calculated from the MLP model and add them to the corresponding dataframes.
Exercise: Plotting and Interpreting Results
Finally,
we plot and interpret which residues are most important to differentiating
the binding between SARS-CoV and SARS-CoV-2 RBDs to ACE2 as determined
via our three ML models (Figure
). Once highly important residues are determined, their
importances can be projected onto the protein structure to better
visualize locations of greatest importance.
5.
Visualization of calculated residue importance via three different ML models as described in this tutorial utilizing our truncated data set. Importances are plotted and projected onto the SARS-CoV-2 structure; residues colored darker blue correspond to having larger calculated importances.
Any set of nonidentical ML methods (RF/LR/MLP) will not produce identical results due to the varying mathematical construction of the models. Because of this, we examine the areas of greatest overlap and individually high performance to validate the success of our approach. We note that although this tutorial uses a truncated data set, our method has successfully pinpointed important residues consistent with other studies. For instance, Chen et al. have previously used ML methods and determined that mutations at the N439 (identified by RF/LR/MLP methods), T500 (identified by MLP), N501 (identified by RF/LR/MLP), and Y505 (identified by LR/MLP) positions are associated with large changes in binding free energies between the RBD and ACE2 receptor and, thus, potentially differences in infectivity. Additionally, da Costa et al. previously discussed that notable mutations at positions N440 (identified by RF/LR/MLP), Q493 (identified by RF/LR/MLP), and Q498 (identified by RF/MLP) are associated with greater infectivity via strengthening binding affinity. We further note that regardless of ML method used, the K417, N439, and N501 positions are deemed to be important for differentiating binding (Figure ). In particular, mutation of N501 was initially associated with the alpha variant and mutation of K417 was associated with the beta variant of SARS-CoV-2. , Cheng et al. note that the N501Y mutation increases affinity to the ACE2 receptor via increased amino-aromatic and aromatic–aromatic interactions, and the K417N/T mutations decrease affinity, marking these two residue positions as essential in regulating the global dynamics of the RBD-ACE2 interactions. Additionally, Geng et al. have elucidated the role of the K417 residue in modifying the structural conformation of the SARS-CoV-2 RBD. Thomson et al. also found the N439 position to play a major role in determining the binding affinity of the RBD to ACE2, pointing out that the prevalent N439K mutation increases binding affinity for the ACE2 receptor. In other words, numerous studies have confirmed the impact of the highly important residues identified by this study, validating the implementation of our ML-based classification methods.
In cases where significant overlap of importances between models is not observed, we recommend investigating the individual model accuracies to guarantee they are performing within acceptable margins. The model accuracies can be improved by increasing the scope and reliability of training data (for instance by increasing the MD trajectory resolution). Further improvements can be made by tuning the hyper parameters for each model (RF_tuned_params/LR_tuned_params/MLP_tuned_params in this tutorial). General methods such as grid searching, random searching, and Bayesian optimization can be used to identify which parameters produce higher model accuracies. , If the computational resources can be spared to automate parameter searches, it is recommended as optimized parameters will yield the most consistent results.
Conclusions
This tutorial demonstrates how to apply ML models to MD simulation data. We identified the key differences in the RBDs of SARS-CoV and SARS-CoV-2 that influence binding affinity. We note that this same method could be utilized to determine differences in binding interactions among SARS-CoV-2 variants, which could have implications for potential therapeutic strategies targeting the spike protein. However, comparing highly similar variants (such as the two-residue mutations T22N and F59S between KP.3 and XEC) will only yield meaningful residue importances if these mutations affect the RBD-ACE2 binding affinity. The developed technique is not limited to exploration of SARS-CoV and SARS-CoV-2 and can be applied generally to protein-receptor binding sites, such as FAB-epitope binding, as long as the binding region is well characterized. , Our implementation of logistic regression, random forest, and MLP models provides a brief survey of different ML methods, which can be generalized for a variety of classification problems. This article serves as a practical guide, which can be utilized to effectively conduct an ML-based analysis to further biological understanding.
Supplementary Material
Acknowledgments
The authors thank the Quantitative Biosciences program at Georgia Tech, notably the 2022 cohort including Sayantan Datta, Jiyeon Maeng, Maryam Hejri, Akash Arani, Hayley Hassler, and Raymond Copeland. This work was modified from instructional material presented at the 2023 Quantitative Biosciences workshop funded by the Burroughs Wellcome Fund and National Science Foundation TRIPODS+X:EDU. A.-L.R.B. was supported by the InQuBATE training program (National Institutes of Health T32GM142616), and K.M.K. was supported by an administrative supplement to the InQuBATE program.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jpcb.4c08824.
The full unabridged python analysis can be found alongside this publication as a Jupyter Notebook (ZIP)
The authors declare no competing financial interest.
References
- Hu B., Guo H., Zhou P., Shi Z.-L.. Characteristics of SARS-CoV-2 and COVID-19. Nat. Rev. Microbiol. 2021;19:141–154. doi: 10.1038/s41579-020-00459-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu J., Zhao S., Teng T., Abdalla A. E., Zhu W., Xie L., Wang Y., Guo X.. Systematic Comparison of Two Animal-to-Human Transmitted Human Coronaviruses: SARS-CoV-2 and SARS-CoV. Viruses. 2020;12:244. doi: 10.3390/v12020244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoffmann M., Kleine-Weber H., Schroeder S., Krüger N., Herrler T., Erichsen S., Schiergens T. S., Herrler G., Wu N.-H., Nitsche A., Müller M. A.. SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor. Cell. 2020;181:271–280.e8. doi: 10.1016/j.cell.2020.02.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shang J., Wan Y., Luo C., Ye G., Geng Q., Auerbach A., Li F.. Cell entry mechanisms of SARS-CoV-2. Proc. Natl. Acad. Sci. U. S. A. 2020;117:11727–11734. doi: 10.1073/pnas.2003138117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li W., Zhang C., Sui J., Kuhn J. H., Moore M. J., Luo S., Wong S., Huang I., Xu K., Vasilieva N.. Receptor and viral determinants of SARS-coronavirus adaptation to human ACE2. Embo J. 2005;24:1634–1643. doi: 10.1038/sj.emboj.7600640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yan R., Zhang Y., Li Y., Xia L., Guo Y., Zhou Q.. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science. 2020;367:1444–1448. doi: 10.1126/science.abb2762. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lan J., Ge J., Yu J., Shan S., Zhou H., Fan S., Zhang Q., Shi X., Wang Q., Zhang L.. Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor. Nature. 2020;581:215–220. doi: 10.1038/s41586-020-2180-5. [DOI] [PubMed] [Google Scholar]
- Wang Q., Zhang Y., Wu L., Niu S., Song C., Zhang Z., Lu G., Qiao C., Hu Y., Yuen K.-Y., Wang Q.. Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2. Cell. 2020;181:894–904.e9. doi: 10.1016/j.cell.2020.03.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roy U.. Comparative structural analyses of selected spike protein-RBD mutations in SARS-CoV-2 lineages. Immunol. Res. 2022;70:143–151. doi: 10.1007/s12026-021-09250-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghorbani M., Brooks B. R., Klauda J. B.. Critical Sequence Hotspots for Binding of Novel Coronavirus to Angiotensin Converter Enzyme as Evaluated by Molecular Simulations. J. Phys. Chem. B. 2020;124:10034–10047. doi: 10.1021/acs.jpcb.0c05994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Liu M., Gao J.. Enhanced receptor binding of SARS-CoV-2 through networks of hydrogen-bonding and hydrophobic interactions. Proc. Natl. Acad. Sci. U. S. A. 2020;117:13967–13974. doi: 10.1073/pnas.2008209117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen H. L., Lan P. D., Thai N. Q., Nissley D. A., O’Brien E. P., Li M. S.. Does SARS-CoV-2 Bind to Human ACE2 More Strongly Than Does SARS-CoV? J. Phys. Chem. B. 2020;124:7336–7347. doi: 10.1021/acs.jpcb.0c04511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J., Wang R., Wang M., Wei G.-W.. Mutations Strengthened SARS-CoV-2 Infectivity. J. Mol. Biol. 2020;432:5212–5226. doi: 10.1016/j.jmb.2020.07.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumar S., Thambiraja T. S., Karuppanan K., Subramaniam G.. Omicron and Delta variant of SARS-CoV-2: A comparative computational study of spike protein. J. Med. Virol. 2022;94:1641–1649. doi: 10.1002/jmv.27526. [DOI] [PubMed] [Google Scholar]
- Pavlova A., Zhang Z., Acharya A., Lynch D. L., Pang Y. T., Mou Z., Parks J. M., Chipot C., Gumbart J. C.. Machine Learning Reveals the Critical Interactions for SARS-CoV-2 Spike Protein Binding to ACE2. J. Phys. Chem. Lett. 2021;12:5494–5502. doi: 10.1021/acs.jpclett.1c01494. [DOI] [PubMed] [Google Scholar]
- Sang P., Chen Y.-Q., Liu M.-T., Wang Y.-T., Yue T., Li Y., Yin Y.-R., Yang L.-Q.. Electrostatic Interactions Are the Primary Determinant of the Binding Affinity of SARS-CoV-2 Spike RBD to ACE2: A Computational Case Study of Omicron Variants. Int. J. Mol. Sci. 2022;23:14796. doi: 10.3390/ijms232314796. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Polydorides S., Archontis G.. Computational optimization of the SARS-CoV-2 receptor-binding-motif affinity for human ACE2. Biophys. J. 2021;120:2859–2871. doi: 10.1016/j.bpj.2021.02.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fatouros P., Roy U., Sur S.. Modeling Substrate Coordination to Zn-Bound Angiotensin Converting Enzyme 2. Int. J. Pept. Res. Ther. 2022;28(2):65. doi: 10.1007/s10989-022-10373-6. [DOI] [Google Scholar]
- Jena A., Kanungo N., Nayak V., Chainy G. B. N., Dandapat J.. Catechin and curcumin interact with S protein of SARS-CoV2 and ACE2 of human cell membrane: insights from computational studies. Sci. Rep. 2021;11(1):2043. doi: 10.1038/s41598-021-81462-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schrödinger LLC. The PyMOL Molecular Graphics System, Version 3.0. http://www.pymol.org/pymol.
- Humphrey W., Dalke A., Schulten K.. VMD – Visual Molecular Dynamics. J. Mol. Graphics. 1996;14:33–38. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
- Fleetwood O., Kasimova M. A., Westerlund A. M., Delemotte L.. Molecular Insights from Conformational Ensembles via Machine Learning. Biophys. J. 2020;118:765–780. doi: 10.1016/j.bpj.2019.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y., Lamim Ribeiro J. M., Tiwary P.. Machine learning approaches for analyzing and enhancing molecular dynamics simulations. Curr. Opin. Struct. Biol. 2020;61:139–145. doi: 10.1016/j.sbi.2019.12.016. [DOI] [PubMed] [Google Scholar]
- Mitrovic D., McComas S. E., Alleva C., Bonaccorsi M., Drew D., Delemotte L.. Reconstructing the transport cycle in the sugar porter superfamily using coevolution-powered machine learning. eLife. 2023;12:e84805. doi: 10.7554/eLife.84805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramírez-Palacios C., Marrink S. J.. Computational prediction of ω-transaminase selectivity by deep learning analysis of molecular dynamics trajectories. QRB Discovery. 2023;4:e1. doi: 10.1017/qrd.2022.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plante A., Shore D. M., Morra G., Khelashvili G., Weinstein H.. A Machine Learning Approach for the Discovery of Ligand-Specific Functional Mechanisms of GPCRs. Molecules. 2019;24:2097. doi: 10.3390/molecules24112097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noé F., Tkatchenko A., Müller K.-R., Clementi C.. Machine Learning for Molecular Simulation. Annu. Rev. Phys. Chem. 2020;71:361–390. doi: 10.1146/annurev-physchem-042018-052331. [DOI] [PubMed] [Google Scholar]
- Jin W., Pei J., Xie P., Chen J., Zhao H.. Machine Learning-Based Prediction of Mechanical Properties and Performance of Nickel–Graphene Nanocomposites Using Molecular Dynamics Simulation Data. ACS Appl. Nano Mater. 2023;6:12190–12199. doi: 10.1021/acsanm.3c01919. [DOI] [Google Scholar]
- Marchetti F., Moroni E., Pandini A., Colombo G.. Machine Learning Prediction of Allosteric Drug Activity from Molecular Dynamics. J. Phys. Chem. Lett. 2021;12:3724–3732. doi: 10.1021/acs.jpclett.1c00045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karamzadeh R., Karimi-Jafari M. H., Sharifi-Zarchi A., Chitsaz H., Salekdeh G. H., Moosavi-Movahedi A. A.. Machine Learning and Network Analysis of Molecular Dynamics Trajectories Reveal Two Chains of Red/Ox-specific Residue Interactions in Human Protein Disulfide Isomerase. Sci. Rep. 2017;7:3666. doi: 10.1038/s41598-017-03966-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X., Hong Y., Wang M., Xin G., Yue Y., Zhang J.. Mechanical properties of molybdenum diselenide revealed by molecular dynamics simulation and support vector machine. Phys. Chem. Chem. Phys. 2019;21:9159–9167. doi: 10.1039/C8CP07881E. [DOI] [PubMed] [Google Scholar]
- Geisel D., Lenz P.. Machine learning classification of trajectories from molecular dynamics simulations of chromosome segregation. PLoS One. 2022;17:e0262177. doi: 10.1371/journal.pone.0262177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davies J. G., Menzies G. E., Fraternali F.. Utilizing biological experimental data and molecular dynamics for the classification of mutational hotspots through machine learning. Bioinform. Adv. 2024;4(1):vbae125. doi: 10.1093/bioadv/vbae125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiao J., Melvin R. L., Salsbury F. R.. Probing light chain mutation effects on thrombin via molecular dynamics simulations and machine learning. J. Biomol. Struct. Dyn. 2019;37:982–999. doi: 10.1080/07391102.2018.1445032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Phillips J. C., Hardy D. J., Maia J. D. C., Stone J. E., Ribeiro J. V., Bernardi R. C., Buch R., Fiorin G., Hénin J., Jiang W., McGreevy R.. Scalable molecular dynamics on CPU and GPU architectures with NAMD. J. Chem. Phys. 2020;153:044130. doi: 10.1063/5.0014475. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marill K. A.. Advanced Statistics: Linear Regression, Part I: Multiple Linear Regression. Acad. Emerg. Med. 2004;11:94–102. doi: 10.1197/j.aem.2003.09.006. [DOI] [PubMed] [Google Scholar]
- Yan, X. ; Su, X. G. . Linear Regression Analysis: theory and Computing; World Scientific Publishing Co., Inc: USA, 2009. [Google Scholar]
- Goldstein B. A., Polley E. C., Briggs F. B. S.. Random Forests for Genetic Association Studies. Stat. Appl. Genet. Mol. Biol. 2011;10:32. doi: 10.2202/1544-6115.1691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Faris H., Aljarah I., Mirjalili S.. Training feedforward neural networks using multi-verse optimizer for binary classification problems. Appl. Intell. 2016;45:322–332. doi: 10.1007/s10489-016-0767-1. [DOI] [Google Scholar]
- Dobbin K. K., Simon R. M.. Optimally splitting cases for training and testing high dimensional classifiers. BMC Med. Genomics. 2011;4(1):31. doi: 10.1186/1755-8794-4-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hicks S., Strümke I., Thambawita V., Hammou M., Riegler M., Halvorsen P., Parasa S.. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 2022;12(1):5979. doi: 10.1038/s41598-022-09954-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cooper C. J., Krishnamoorthy G., Wolloscheck D., Walker J. K., Rybenkov V. V., Parks J. M., Zgurskaya H. I.. Molecular Properties That Define the Activities of Antibiotics in Escherichia coli and Pseudomonas aeruginosa . ACS Infect. Dis. 2018;4:1223–1234. doi: 10.1021/acsinfecdis.8b00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward M. D., Zimmerman M. I., Meller A., Chung M., Swamidass S. J., Bowman G. R.. Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with DiffNets. Nat. Commun. 2021;12(1):3023. doi: 10.1038/s41467-021-23246-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kouba P., Kohout P., Haddadi F., Bushuiev A., Samusevich R., Sedlar J., Damborsky J., Pluskal T., Sivic J., Mazurenko S.. Machine Learning-Guided Protein Engineering. ACS Catal. 2023;13:13863–13895. doi: 10.1021/acscatal.3c02743. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Explainable AI: interpreting, Explaining and Visualizing Deep Learning, Samek, W. ; Müller, K.-R. ; Samek, W. ; Montavon, G. ; Vedaldi, A. ; Hansen, L. K. ; Müller, K.-R. , Eds.; Springer International Publishing: Cham, 2019; pp. 193–209. [Google Scholar]
- da Costa C., de Freitas C., Alves C., Lameira J.. Assessment of mutations on RBD in the Spike protein of SARS-CoV-2 Alpha, Delta and Omicron variants. Sci. Rep. 2022;12(1):8540. doi: 10.1038/s41598-022-12479-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frampton D., Rampling T., Cross A., Bailey H., Heaney J., Byott M., Scott R., Sconza R., Price J., Margaritis M.. Genomic characteristics and clinical effect of the emergent SARS-CoV-2 B.1.1.7 lineage in London, UK: a whole-genome sequencing and hospital-based cohort study. Lancet Infect. Dis. 2021;21:1246–1256. doi: 10.1016/S1473-3099(21)00170-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox M., Peacock T., Harvey W., Hughes J., Wright D., Willett B., Thomson E., Gupta R., Peacock S., Robertson D. L., Carabelli A. M.. COVID-19 Genomics UK (COG-UK) Consortium. SARS-CoV-2 variant evasion of monoclonal antibodies based on in vitro studies. Nat. Rev. Microbiol. 2023;21:112–124. doi: 10.1038/s41579-022-00809-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng M., Krieger J., Banerjee A., Xiang Y., Kaynak B., Shi Y., Arditi M., Bahar I.. Impact of new variants on SARS-CoV-2 infectivity and neutralization: A molecular assessment of the alterations in the spike-host protein interactions. iScience. 2022;25:103939. doi: 10.1016/j.isci.2022.103939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geng Q., Wan Y., Hsueh F.-C., Shang J., Ye G., Bu F., Herbst M., Wilkens R., Liu B., Li F.. Lys417 acts as a molecular switch that regulates the conformation of SARS-CoV-2 spike protein. eLife. 2023;12:e74060. doi: 10.7554/eLife.74060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomson E. C., Rosen L. E., Shepherd J. G., Spreafico R., da Silva Filipe A., Wojcechowskyj J. A., Davis C., Piccoli L., Pascall D. J., Dillen J., Lytras S.. Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity. Cell. 2021;184:1171–1187.e20. doi: 10.1016/j.cell.2021.01.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schratz P., Muenchow J., Iturritxa E., Richter J., Brenning A.. Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol. Modell. 2019;406:109–120. doi: 10.1016/j.ecolmodel.2019.06.002. [DOI] [Google Scholar]
- International Conference on Advanced Engineering, Technology and Applications (ICAETA) ICAETA; 2021. Machine Learning Model Optimization with Hyper Parameter Tuning Approach. [Google Scholar]
- Smaoui M. R., Yahyaoui H.. Unraveling the stability landscape of mutations in the SARS-CoV-2 receptor-binding domain. Sci. Rep. 2021;11(1):9166. doi: 10.1038/s41598-021-88696-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J., Yu Y., Jian F., Yang S., Song W., Wang P., Yu L., Shao F., Cao Y.. Enhanced immune evasion of SARS-CoV-2 KP.3.1.1 and XEC through NTD glycosylation. bioRxiv. 2024 doi: 10.1101/2024.10.23.619754. [DOI] [PubMed] [Google Scholar]
- Capelli, R. ; Serapian, S. A. ; Colombo, G. . Tsumoto, K. ; Kuroda, D. . Computer-Aided Antibody Design; Springer US: New York, NY, 2023; pp. 255–266. [Google Scholar]
- Epitope Mapping Protocols, Rockberg, J. ; Nilvebrant, J. ; Malm, M. ; Berndt Thalén, N. ; Springer New York: New York, NY, 2018; pp. 1–10. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




