Benchmarking the Ligand–HER2 Interactions Using Machine Learning and Molecular Dynamics Simulations

Duc Toan Truong; Quang Tung Dao; Thi Thuy Mai Tran; Ngoc Ha Nguyen; My-Kristyna Nguyen-Thao; Nguyen-Hai Nam; Thi Mai Dung Do; Minh Tho Nguyen

doi:10.1021/acsomega.5c10459

. 2026 Feb 8;11(7):11818–11832. doi: 10.1021/acsomega.5c10459

Benchmarking the Ligand–HER2 Interactions Using Machine Learning and Molecular Dynamics Simulations

Duc Toan Truong ^†,^‡, Quang Tung Dao ^§, Thi Thuy Mai Tran ^∥, Ngoc Ha Nguyen ^∥, My-Kristyna Nguyen-Thao ^∥, Nguyen-Hai Nam ^∥, Thi Mai Dung Do ^∥,^⊥,^*, Minh Tho Nguyen ^#,^*

PMCID: PMC12947145 PMID: 41768642

Abstract

Understanding the inhibitor–HER2 interaction mechanism remains a critical challenge in combating breast cancer. In the present work, the role of five critical residues that are deeply located in the HER2 active site was recognized. To win the race against time in learning the activities of the HER2 tyrosine kinase protein, we employed a stepwise computational procedure including a machine learning predictive regression model, atomistic molecular dynamics (MD) simulations, and the umbrella sampling MD method. A systematic mining of a data set of 8 million chemical compounds allowed us to finally identify 13 candidates whose capacities as anti-HER2 have not been reported before. Based on the computed results, a benchmark for the strength of the ligand–HER2 interaction has been established. Although van der Waals potential energy tends to stabilize ligand–protein associations, the ligand that electrostatically interacts with five residues, Lys753, Leu796, Thr798, Asp863, and Asp880, is a key factor in deciding the inhibitor strength. Significantly, the strong binding of compound lig233 was exemplified by its ability to form hydrogen bonds with Asp863 and Asp880 and maintain exceptionally short distances to many key residues, indicating the formation of strong chemical bonds. Lig233 also exhibits a binding free energy of −47 kcal/mol, two times as large as that of −21 kcal/mol for the known drug lapatinib. The fresh understanding achieved in the present study can lead to the necessary adjustments in the experimental development of HER2 inhibitors.

graphic file with name ao5c10459_0010.jpg

graphic file with name ao5c10459_0008.jpg

1. Introduction

Breast cancer remains one of the most prevalent and life-threatening malignancies worldwide, affecting millions of female individuals annually. Despite great advancements in the early detection and subsequent handling, the treatment of breast cancer continues to pose great challenges due to its heterogeneous nature and inherent development of drug resistance. Among the various subtypes, the human epidermal growth factor receptor 2 (HER2)-enriched group in the related breast cancer is characterized by its overexpression, which plays a pivotal role in tumor progression, metastasis, and therapeutic resistance. HER2 is a transmembrane tyrosine kinase receptor that belongs to the ErbB family, and its amplification is observed in ∼20% of breast cancer cases. This protein subtype is associated with a more aggressive disease course and a poorer prognosis if it is left untreated.

The emergence of targeted therapies against HER2 activities significantly improved clinical outcomes in HER2-positive breast cancer patients. The HER2 protein consists of three major domains including an extracellular ligand-binding domain, a transmembrane region, and an intracellular tyrosine kinase domain. Unlike other members of the ErbB family, HER2 does not have a known ligand; instead, it is activated through homodimerization or heterodimerization with other ErbB receptors, particularly HER3. This leads to an activation of downstream signaling pathways, including PI3K/AKT/mTOR and MAPK, which are critical in cancer cell proliferation, survival, and metastasis. The tyrosine kinase domain of HER2 contains a primary region for ATP binding and phosphorylation events that propagate oncogenic signaling. , HER2-targeted therapies in breast cancer treatment, including monoclonal antibodies, tyrosine kinase inhibitors (TKIs), and antibody–drug conjugates (ADCs), led to some remarkable improvements in survival and impressive outcomes. Compared to monoclonal antibodies, small molecules offer several advantages such as oral bioavailability, deeper tumor penetration, and the ability to overcome resistance mechanisms. The tyrosine kinase domain inhibitors (TKIs) are developed to not only directly occupy the ATP region but also prevent HER2 from dimerization. Indeed, targeting the HER2 tyrosine kinase domain remains a cornerstone strategy in the treatment of HER2-positive breast cancer.

First-generation reversible inhibitors, such as lapatinib, showed a certain efficacy by inhibiting both EGFR and HER. However, due to the emergence of resistance mechanisms and limited selectivity of this class of inhibitors, second-generation irreversible inhibitors, such as neratinib, were developed to form covalent bonds with the kinase domain and provide sustained inhibition. More recently, highly selective HER2 inhibitors such as tucatinib demonstrated a superior efficacy with reduced off-target effects. Current research efforts focus on optimizing small-molecule inhibitors, addressing resistance mechanisms, and integrating them with other targeted therapies to improve clinical outcomes. The discovery of novel HER2 inhibitors continues to hold significant promise for the advancement of breast cancer therapy.

Trial clinical experiments require high cost, huge resource, and long testing time, and more seriously, a new drug resistance usually appears faster than an FDA drug approval process. In this context, it is of crucial importance to reliably predict the potential of drugs for effective treatments. Nowadays, advanced computational methods such as molecular dynamics (MD) simulations and binding free energy calculations offer us with a chance to tackle these issues and then compete in the race against time. In addition, the emergence of machine learning (ML) methods allows us to explore big databases containing millions of relevant compounds. A combination of both approaches can provide us with new insights into the physicochemical mechanism at the atomistic level. ,

To learn about HER2-positive cancers, ML models showed great promise in accelerating the identification of kinase inhibitors, offering insights into structure–activity relationships and facilitating scaffold innovation. However, studies specifically targeting HER2 using ML remain relatively limited. − Matrouk et al. combined ML-QSAR modeling, pharmacophore descriptors, and docking to identify some novel HER2 inhibitors. The data set employed included 1,397 HER2 inhibitors curated from ChEMBL, with compounds categorized as “potent” (IC₅₀ ≤ 500 nM) or “less active” (IC₅₀ ≥ 5000 nM). An innovative aspect of this study lies in the use of pharmacophore-derived descriptors extracted from the docking poses that were then integrated into the ML pipeline. After optimizations using the genetic function algorithm (GFA), the GFA-Bagging model achieved the best performance with 84% accuracy and an area under the curve (AUC) of 0.77, while GFA-J48Graft achieved 78% accuracy. Notably, the removal of pharmacophore features reduced the model performance significantly, underlining the importance of capturing ligand–target interaction features. Screening of the NCI library led to the experimental identification of three novel HER2 inhibitors having IC₅₀ values from 3.85 to 6.92 μM. A principal component analysis (PCA) showed that these compounds represent some novel scaffolds and thus a potential for scaffold hopping. Kleandrova et al. proposed a perturbation-based ML framework to identify dual inhibitors of HER2 and CDK4, addressing the poly-pharmacology challenge in kinase inhibition. A data set of 2209 compounds was compiled, including both selective and dual inhibitors. A multilayer perceptron neural network using 7 molecular descriptors achieved 84.0% training and 80.4% test accuracy, with Mathews correlation coefficient (MCC) values of 0.683 and 0.616, respectively. The model allowed the rational design of six new structures including three dual inhibitors and three HER2-selective inhibitors, all of which satisfied Lipinski’s rule of five, supporting their drug-likeness and oral bioavailability. Collectively, these previous studies illustrate the expanding role of ML in the discovery of HER2 inhibitors, from QSAR modeling and virtual screening to kinase selectivity prediction and multitarget drug design. As computational methodologies continue to evolve, integrating ML with structural biology and cheminformatics holds immense potential for accelerating drug discovery, particularly in addressing challenges related to kinase inhibitor selectivity and scaffold innovation.

As far as we are aware, most ML models developed for HER2 inhibitors have primarily focused on classification tasks, assigning compounds to broad activity classes (e.g., active vs inactive). − However, for virtual screening and hit prioritization, it is more informative to predict a continuous measure of potency rather than a coarse activity label. In this context, regression-based ML models that estimate the specific inhibitory strength of individual molecular structures against HER2, and thereby enable a more fine-grained ranking of candidate compounds, remain largely underexplored.

On the other side of computational advances, MD simulations have also proved their impact in the field of rational drug design. As a powerful tool working at the atomistic level, MD simulations are also able to provide detailed insights into the dynamic interaction of proteins and ligands, the mechanic inhibition of drugs, and the underlying thermodynamics of macromolecular systems. Especially, several computing methods can effectively predict the binding affinities between a small ligand and a receptor, including docking algorithms, molecular mechanics/Poisson–Boltzmann surface area (MM-PBSA) or molecular mechanics/generalized Born surface area (MM-GBSA), free energy perturbation, thermal integration, steered molecular dynamics (SMD), and umbrella sampling molecular dynamics (USMD). In the case of HER2, Verma et al. performed MD simulations followed by MM-PBSA analysis and reported calculated binding energies for lapatinib in the range of several hundred kilojoules per mole; for the wild-type HER2–lapatinib complex, the MM-PBSA binding energy was approximately −554 kJ/mol (≈ −133 kcal/mol). Bolaji et al. performed MM-GBSA analysis after induced-fit docking and MD simulations and reported ΔG binding values of −64 kcal/mol for lapatinib and −63 to −79 kcal/mol for their top-ranked HER2 inhibitors. In a study by Alvarado-Lozano et al., MD simulations were combined with MM-GBSA calculations to evaluate the binding characteristics of two natural compounds, mangiferin and silybin, against both wild-type and mutant forms of HER2 and EGFR. The results showed that gefitinib exhibits the strongest binding across all systems, with ΔG values of −50 kcal/mol for the wild-type HER2. Although it displays slightly weaker binding, mangiferin records a ΔG value of −31.3 kcal/mol, as compared to −31.1 kcal/mol for silybin. While studies on anti-HER2 remain of intense interest, to our best knowledge, free energies of HER2–ligand complexes have most often been reported using end point methods such as MM-PBSA or MM-GBSA. The reported numbers correspond to protocol-dependent MM-PBSA or MM-GBSA computed binding energies, which are best interpreted as relative scoring functions for comparing ligands within the same setup, rather than as absolute thermodynamic binding free energies directly comparable to their experimental K _d values. Although these approaches have done well in evaluating the free energy difference between two single states, the primary limitation always occurs with regard to the free energy profile connecting these states. The free energy profile is of importance to explore how ligand–protein complexes transform in going from a bound state into an unbound state, separated by a transition state.

In this context, we set out to carry out a combined ML/MD study and present herewith a predictive regression model and a comprehensive investigation based on a huge database of relevant compounds, which in fact contains 8 million compounds for screening. In the first part, the ML predictive regression models promote a group of 64 top-hit compounds coming into all-atomistic MD simulations. In the second part of MD runs, we further construct a workflow whose effect has already been proven in our previous study, , including docking, classical MD, SMD, and USMD simulations. More significantly, this is the first time the absolute binding free energies between HER2 and its inhibitors are computed, making use of the potential of mean force (PMF) method, which remains so far one of the most accurate methods to evaluate the small ligand–receptor affinity. With the support of computing facilities, a total of 75 μs MD simulations and 20 TB outcomes are generated. This computational work helps us identify 13 compounds that possess affinities larger than that of lapatinib, a currently used drug in breast cancer treatment. In addition, this is the first time the role of critical residues located in the HER2 active site has been reported in the context that the effects of these residues could decide the strong or weak binding interaction between HER2 and inhibitors. The present study sets a benchmark for the evaluation of and provides new insights into interaction energies.

2. Materials and Methods

To systematically discover a group of novel HER2 kinase inhibitors with better binding energy, we devise an integrated virtual screening strategy that leverages both data-driven prediction and physicochemical-based simulations. Our present study is separated into two independent parts, namely, (i) screening of top-hit compounds by an ML predictive regression model and (ii) determination of binding affinities by the USMD simulations method and PMF calculations (Figure ).

Spatial structure of the HER2 tyrosine kinase domain: (A) front view, with HER2 illustrated in green; (B) side view, with HER2 illustrated as the blue surface. Lapaitnib is shown as magenta spheres. Images are obtained by using the crystal structure PDB-ID 3RCD and the support of the Pymol Package. (C) Systematic scheme illustrating our screening and verification protocol in this study.

The first part is initiated with the construction of a high-performance ML regression model trained on a curated data set of 625 HER2 inhibitors to predict biological activities from molecular fingerprints. This model is subsequently employed to screen a vast compound library, narrowing down candidates with a high predicted potency. When learning a database containing 8 million compounds, the first work part turns out to suggest 64 compounds for subsequent processing using MD simulations. The second part of the work focuses on the exploration of, at the atomic level, interactions between each compound considered and the HER2 tyrosine kinase domain. Various software, including the Avogadro, Gaussian-16, and Antechamber, are serially employed to build up the atomic point charge of ligands. Continuously, the 64 selected ligands are docked into the HER2 binding pocket by the Autodock Vina software before running 100 ns MD simulations per system. Analysis of the outcomes of MD steps helps us observe the interaction dynamics of biosystems when ligand fluctuates into the protein active site. The number of hydrogen bonds, the number of contacts, Van der Waals interactions, and electrostatic energies between 64 ligands and HER2 are also figured out. Following careful consideration of all obtained information, we then select 20 high-priority compounds stepping into the last calculating step using the USMD simulations. In brief, the protocol we develop and apply here, as summarized in Figure , aims to detect and determine a group of effective compounds, out of eight million, which are expected to yield good results in further experimental in vitro testing.

2.1. ML and Deep Learning Application in Ligand-Based Virtual Screening

2.1.1. Data Collection and Curation

2.1.1.1. Data Collection

Compounds with HER2 inhibitory activity were curated from research articles on CHEMBL33 with search criteria following “Research article”, “Novel inhibitor,” and “HER2 enzyme assay”. As a result, 642 datapoints with IC₅₀ values were collected and used for further steps in the process. The list of data, along with research publication details, is shown in the link https://github.com/daoquangtung2411/her2_prediction.

2.1.2. Data Cleaning

Data obtained were first cleaned and preprocessed through the following steps: (1) remove salt; (2) remove duplication using SMILES and INCHIKEY by averaging duplicated data; (3) convert IC₅₀ to pIC₅₀. Since the data were curated from research articles, most duplication data are reference compounds. After cleaning and preprocessing, 625 datapoints remained. The biological activity of the preprocessed data set ranges from 0.5 to 47,000 nM. To improve the distribution of data, pIC₅₀ (−log IC₅₀) was used instead of IC₅₀. The box plot in Figure S5 shows that the pIC₅₀ values of the data set are normally distributed as compared to IC₅₀ values.

2.1.3. Feature Generation

Extended-connectivity fingerprints (ECFPs), a type of circular fingerprint developed by Rogers and Hahn in 2010, are one of the most widely used features for supervised ML. ECFPs have been proven successful in many applications, including ligand-based virtual screening. However, ECFP4 is prone to bit collision due to the hash function, and its lack of interconnectedness in the structure might lead to a lower-performance model. Therefore, our hypothesis was to combine ECFP with other descriptors or fingerprints to improve the model’s prediction ability as well as express broader compound structures. Molecular ACCess System (MACCS), a dictionary-based fingerprint, is a subset of 166 bits of predefined substructural keys. Despite lacking substructure connectivity, MACCS can serve as an appropriate supplement to ECFPs due to its emphasis on key substructures. A recent study showed that the combination of MACCS and ECFP4 achieved higher performance than standalone for predicting logP using ML algorithms. In this work, we used a combination of ECFP4 (2048 bits) and the MACCS key (166 bits) as molecular features for building and evaluating models. The RDKit package was used to generate ECFP4 fingerprint and MACCS key.

2.1.4. Dimensionality Reduction

In this study, the data set was relatively limited (625 datapoints), and the dimension of features was high (2214 features). Therefore, dimensionality-reduction techniques were subsequently utilized to achieve an appropriate feature subset to train the model.

Feature selection techniques were used to reduce the dimensionality for model building and evaluation. The process included (1) removing features that have over 95% of zero values and then (2) removing collinear columns with a correlation coefficient over 0.7. The feature selection process is applied to the whole data set for the clustering step and the training data set.

Besides, advanced dimensionality techniques were also applied for clustering. Binary cross-entropy (BCE) was used as reconstruction loss to compute the loss between the output and input features (which were binary); the output layer activation function was sigmoid. Meanwhile, KL loss computes the encoded distribution and the normal distribution with 0 mean and 1.0 variance. We also applied beta-VAE, where weight is added to KL loss for better latent space representations. In this study, VAE was implemented with TensorFlow and PCA also applied on latent representation for further reduction of dimensionality.

2.1.5. Data Variation and Clustering

In this work, clustering is used solely for the visualization and assessment of chemical space diversity. To determine data variation, the pairwise similarity between compounds in the input data set was calculated using the pdist function provided by SciPy and based on Jaccard distances, since binary features were used. Pairwise Jaccard similarity calculations are shown in Equations . Moreover, the Gini coefficient was also calculated to determine the diversity of the input data set (equation ). The Gini coefficient ranges from 0 to 1: a value closer to 0 can be interpreted as more sparse data, and there are no data points that dominate the data set or lower rates of similar datapoints.

Jaccard similarity J (A, B) = 1 - \frac{| A Δ B |}{| A \cup B |}

J(A,B): Jaccard similarity between fingerprint A, B; A Δ B: logical and of A and B (same position that has 1-bits in both A and B); A ∪ B: logical or of A and B (positions have 1-bits in either A or B).

Gini coefficient G (S) = \frac{1}{n - 1} (n + 1 - 2 \frac{\sum_{i = 1}^{n} (n + 1 - i) y_{(i)}}{\sum_{i = 1}^{n} y_{(i)}})

G(S): Gini coefficient; n: number of observations; y _(i): the ith value after sorting all y values in increasing order.

Clustering was utilized to visualize the distribution of the input data set with the aim of determining the number of representative structure clusters. Furthermore, clustering can show the coverage of the train and test data set. In this study, K-Medoids was used for the first time in combination with VAE to perform clustering.

The ultimate goal of the clustering method is maximizing the intracluster distances and minimizing intercluster distances. The Davies Bouldin index (eq ) and Silhouette score (eq ) were used to determine how well the clusters are separated from each other. Silhouette scores range from −1 to 1, where a good cluster has a silhouette score close to 1. Lower DBI values (closer to 0) indicate good clusters where it is compact and well separated.

DB = \frac{1}{K} \times \sum_{i = 1}^{K} \max_{j \neq i} \frac{S_{i} + S_{j}}{M_{i, j}}

DB: Davies–Bouldin index; K: the number of clusters; i, j: cluster indices with i, j ∈ {1,···, K} and i ≠ j; s _i, s_j : intracluster similarity for elements in clusters i, j, respectively; M _i,j: intercluster similarity between elements of cluster i and j.

s (i) = \frac{b (i) - a (i)}{\max (a (i), b (i))}

s(i): Silhouette score; a(i): mean intracluster distance; b(i): mean nearest-cluster distance.

2.1.6. Model Building and Evaluation

First, the input data set was divided into a train and test set with proportions of 85–15% (531 and 94 datapoints) using the train-test split function of scikit-learn. Then the feature generation and selection were applied to obtain a data set with ECFP4–MACCS merge features (151 features refers to EMM). Furthermore, the same process was also applied to create ECFP4 and MACCS feature data set (which included 265 and 118 variables, respectively) for pretesting with a simple model. Surprisingly, in our preliminary finding with linear models (linear regression) on 3 sets of features, EMM performed significantly better than ECFP4 and MACCS. The results are shown in Table S6 (SI file).

From the findings in Table , further approach was carried out with merged ECFP4–MACCS features and the performance was evaluated on several ML and deep learning algorithms such as linear regression, random forest regression, XGBoost regression, and multilayer perceptron.

1. Pretesting of Three Feature Sets with a Linear Model.

	R2		RMSE
feature set	train	test	train	test
EMM	0.86	0.80	0.44	0.57
ECFP4	0.92	0.54	0.34	0.86
MACCS	0.75	inf	0.60	>2¹³

Open in a new tab

2.1.6.1. Linear Regression

Linear regression was provided by scikit-learn library, using the ordinary least squares (OLS) algorithm. The only parameter for linear regression implemented by scikit-learn was the fit intercept, which is set in the y-intercept to 0 if False and determined by the line of best fit if True.

2.1.6.2. Random Forest Regression

This work uses Decision Tree Regressor as the base learner for the Random Forest Regressor, provided by the scikit-learn library. The tuning parameters were the total number of trees (n_estimators), maximum depth of tree, more depth means individual tree would learn better, stronger (max_depth), minimum number of samples required to split an internal node (min_samples_split), minimum of samples of samples required to be at a leaf node (min_samples_leaf), and the number of features to randomly select (max_features).

2.1.6.3. XGBoost Regressor

The parameters to tune were booster, ratio of training instances (subsample), subsample ratio of columns when constructing individual tree (colsample_bytree), minimum loss reduction required to make a further partition on a leaf node of tree (gamma), maximum depth of each tree, which control the complexity and overfitting possibility of tree (max_depth), step size shrinkage in update tree to prevent overfitting (learning_rate), minimum number of instances needed to be in each node (min_child_weight), L2 regularization term (reg_lambda), L1 regularization term (reg_alpha), tree construction algorithm (tree_method), and number of trees (n_estimators).

2.1.6.4. Multilayer Perceptron

Multilayer perceptron was implemented with Tensorflow Keras. Adam optimization, a stochastic gradient descent, was used to optimize cost function, and the learning rate was tuned to find the appropriate parameter for the Adam optimizer. Some activation functions were also tuned to find appropriate ones, namely, ReLU, Sigmoid, and Tanh. The activation function for the output layer is linear since the research focuses on the regression model. Other parameters considered include the number of hidden layers, number of nodes, and batch size to train on.

2.1.6.5. Hyperparameter Tuning

Hyperopt was used to tune the hyperparameter for each model. For tuning hyperparameters, Hyperopt, based on sequential model-based optimization (SBMO) and tree of Parzen estimators (TPE) algorithm, was utilized to find the optimal hyperparameters. Hyperopt used expected improvement (EI) to optimize; in the case of TPE, the EI calculation is as illustrated in eq . A 5-fold cross-validation was used to find the optimal hyperparameter.

{EI}_{y *} (x) = \frac{γ y^{*} l (x) - l (x) \int_{- \infty}^{y *} p (y) d y}{γ l (x) + (1 - γ) g (x)} \propto (γ + \frac{g (x)}{l (x)} (1 - γ))^{- 1}

EI_y*(x): expected improvement of optimal gain with configuration x; x: hyperparameter configuration; y: corresponding loss given hyperparameter x; y*: threshold for optimal gain; p(y): the marginal probability density of the loss; γ: quantile of the observe y values, equal to p(y < y*); l(x)= p(x| y < y*): density formed by using observation {x(i)} with corresponding loss y less than threshold y*; g(x) = p(x| y ≥ y*): density formed by using the remaining observation (corresponding loss y higher than or equal to threshold y*.

2.1.7. Virtual Screening Data Set

The virtual screening data set contains over 8 million compounds from ChEMBL 33 and PubChem.

2.1.8. Visualization and Supporting Libraries

In this research, to visualize and process data, the following libraries were used including Matplotlib, Seaborn, NumPy, and Pandas.

2.2. Molecular Docking

To prepare for docking simulations, receptor (PDB ID: 3RCD) and ligand structures were processed using the AutoDockTools (ADT) package within the MGLTools software suite. The binding poses and affinities of ligands were initially assessed using AutoDock Vina script version 1.2.3. All ligands were docked to the ATP-binding site of the TK domain. The docking grid box was defined with dimensions of 80 × 80 × 80 Å and was centered at (8.156, 11.818, 14.228). Docking simulation was carried out with an exhaustiveness value of 400 and a grid spacing of 0.375 Å. The docking protocol was validated to HER2 using the redocking method. The docking protocol was successfully redocked with an RMSD lower than 2.0 Å.

2.3. MD Simulations

The tyrosine kinase domain of HER2, like all protein kinases, contains a smaller N-terminal lobe (N-lobe), predominantly composed of β-strands, a single α-helix, and a larger C-terminal lobe (C-lobe), which is primarily α-helical. These two lobes are connected by a flexible hinge region, forming a deep cleft that serves as the ATP-binding site. Most residues critical for catalytic activity are located within or near this cleft. ,, The key regions and residues within the HER2 kinase domain and their roles include the following:

1.
Glycine-rich binding loop: It is located in the N-lobe, specifically Leu726–Val734. This loop forms a lid over the phosphates of ATP. A portion of this loop Gly729–Gly732 is partially disordered, and the side chain of Phe731 is solvent-exposed, indicating conformational flexibility.
2.
αC-helix: Spanning Pro761–Ala775 in HER2, this helix is located in the N-lobe. Its position relative to the ATP-binding site is a key determinant of the kinase’s activation state.
3.
DFG motif: Consisting of Asp863–Gly865 in HER2, it is found in the C-lobe. In the active state, the aspartate residue (Asp863) typically points toward the ATP-binding pocket.
4.
Catalytic loop: Containing Arg844–Asn850 in the HER2 structure, these residues are located in the C-lobe. This loop is critical for the actual phosphoryl transfer, with the catalytic aspartate acting as a base to abstract a proton from the tyrosine hydroxyl group of the substrate.
5.
Activation loop (A-loop): it is located in the C-lobe from Asp863–Val884 in HER2. In active kinase conformations, this loop is typically extended and well-ordered, without forming any secondary structure, facilitating catalytic activity.
6.
Lys753 (in β-strand 3) and Glu770 (in αC-helix): In most active kinase conformations, a canonical salt bridge forms between these two highly conserved residues. However, in published HER2 structures, this salt bridge is often not observed, even in an active-like conformation, with the bulky group of inhibitors disrupting it and shifting the αC-helix away from the active site. This unique feature contributes to the distinctive activation mechanism of HER2. ,
7.
Ser783: This residue plays a significant role in the selectivity of certain HER2 inhibitors. For example, its side chain in HER2 forms a hydrogen bond with Thr860, which prevents the binding mode observed for TAK-285 in EGFR, contributing to SYR127063’s HER2 selectivity over EGFR.
8.
Thr798 was reported as a gate-keeper residue located at the end of the β-strand. Mutations at this position, such as T798I, are linked to lapatinib resistance in HER2-positive cancers by causing steric clashes with the inhibitor. The gate-keeper residue plays a critical role in regulating access to a hydrophobic back pocket adjacent to the ATP-binding site.

The conformational dynamics of the HER2 and ligand complexes were explored using GROMACS software (version 2023.1). The receptor’s topology was generated based on the Amber99SB-ILDN force field combined with the TIP3P water model. The solvated protein–ligand system, along with ions, was placed inside a periodic rectangular simulation box of 10.0 × 10.0 × 10.0 nm. This system, consisting of more than 98,000 atoms, underwent energy minimization to ensure that no atom experiences a force greater than 100 kJ/mol/nm. To equilibrate the system, two preliminary simulations were performed, namely, a 100 ps NVT ensemble (constant number of particles, volume, and temperature), followed by a 100 ps NPT ensemble (constant number of particles, pressure, and temperature). After equilibration, a 100 ns MD simulation was conducted at 300 K and 1 atm pressure.

2.4. US and PMF Calculations

For free energy calculations, umbrella sampling (US) is employed involving multiple ligand–protein configurations known as US windows. In this approach, the ligand is incrementally displaced by 0.025 nm between adjacent windows. Although the US provides reliable equilibrium free energy estimates, it does not inherently simulate ligand unbinding. To address this limitation, initial SMD simulations were performed at a slow pulling velocity of 0.1 m/s. SMD applied an external force to the ligand’s center of mass (COM) at a constant pulling velocity of 1 m/s, using a spring constant of 600 kJ/mol/nm. The pulling direction was aligned with the z-axis of the simulation box using CAVER program. A harmonic restraint with a force constant of 100 kJ/(mol nm) is applied to all Cα atoms of the protein. More than 120 US windows were generated, each capturing a 0.025 nm ligand movement. Prior to data collection, NVT and NPT equilibration steps were completed, followed by a 10 ns MD simulation to track the ligand’s COM. The resulting free energy profile was then constructed using the PMF method implemented within GROMACS.

3. Results and Discussion

3.1. ML and Deep Learning Application in Ligand-Based Virtual Screening

3.1.1. Data Variances

Data variances can be defined as the degree of spread in a data set, which reflects the distribution of data in a real-world pattern.

After completing the data processing, an input data set containing 625 compounds is collected from the CHEMBL and 153 EMM features (merge ECFP4–MACCS fingerprint). The pairwise distances between compounds in the data set are calculated using the Jaccard distance to determine the similarity between each pair of compounds in the data set. The histogram of pairwise distances is illustrated in Figure S1 of the Supporting Information (SI) file. The average Tanimoto similarity of the input data set amounts to 0.37, and this figure being smaller than 0.4 shows that most of the compound’s data set is not similar to each other. The Gini coefficient is 0.17, which is closer to 0 and can be interpreted as the data being sparse and diverse. A dispersed data set means that the model can learn from different scaffolds and may be more applicable when testing in real-world settings.

3.1.2. Clustering

This is a process of partitioning data into subsets or grouping similar datapoints into different groups. Clustering has many applications in ligand-based virtual screening. However, in the scope of this study, clustering is utilized to visualize the distribution of the input data set with the aim of determining the number of representative structure clusters that can guide one to a highly potent subset. The result of clustering algorithms with different dimensionality-reduction techniques is shown in Table S1 and Figure S2 (ESI file). Clustering is largely affected by feature dimensions. When the original data set is clustered using the Jaccard distance for binary features, the resulting clusters exhibit considerable overlap, and the evaluation metrics suggest that the clustering quality is suboptimal. When PCA is applied with the number of components set to 24, explaining 80% of the variance of the original data set, the clustering performance is only slightly improved, and the clusters still could not be well separated. Therefore, advanced techniques are explored for further improvement in data clustering.

Parameters of the variational autoencoder in the ML technique (VAE, including latent dimension, layer density, batch size, and beta) are determined using a custom grid search. Optimal parameters for both VAE (beta = 1) and beta-VAE (beta = 1.5) are as follows: latent dimension = 8, layer density = 8; and batch size = 8. With VAE, the clustering results are significantly improved when compared to previous algorithms. Although there are still controversial issues around the beta-VAE, when the penalized weight for the Kullback–Leibler divergence (KL loss) is introduced in this study, the latent representation helps improve the separation of clusters and thereby achieves a good silhouette score and Davies Bouldin index. Upon visualization, seven clusters are clearly separated, but some points are still located on the boundary of another cluster. K-Medoids is a good clustering algorithm when using VAE latent space in terms of cluster determination and finding representative points rather than average points as centers. Details on the structure of the representative of each cluster are illustrated in Table .

2. Representative Structures of Each Cluster.

Open in a new tab

It can be seen from the clustering results that clusters 1, 5, and 6 have one similar root structure (shown in bold in Table ), and each possesses a high mean pIC₅₀, in such a way that the root structure can contribute to the ability to inhibit the HER2 kinase domain. Furthermore, clusters 1, 2, 5, and 6 can be considered as highly potent clusters that include an aza-aromatic ring, and they differ from clusters 2, 4, and 7, which are small rings. A reasoning for this result is that an appropriate HER2 kinase domain inhibitor needs a lipophilic counterpart to bind it into the active site while maintaining high solubility to approach HER2 in the intracellular region.

3.1.3. Model Building and Evaluation

After splitting, the data set contains 531 datapoints in the training set and 94 in the test set. Feature selection is applied on a training set, and 153 binary fingerprints are filtered out for training models. The result metrics, including the R2 score and RMSE of regression models, are presented in Table .

3. Features Selected for Training Models.

	R2		RMSE
model (best parameters)	train	test	train	test
linear regression (fit_intercept = False)	0.86	0.80	0.44	0.57
random forest regressor (max depth = 12, max features = 75, min samples leaf = 12, min samples split = 4, n estimators = 100)	0.78	0.84	0.56	0.50
XGBoost regressor (booster = ‘gbtree’, colsample_bytree = 0.6027, gamma = 0.8254, learning_rate = 0.0421, max_depth = 8, min_child_weight = 1, n_estimators = 245)	0.89	0.88	0.39	0.44
multilayer perceptron (activation: relu, 2 hidden layers, number of hidden layer nodes: 64, learning rate: 0.001, batch size: 16)	0.81	0.86	0.53	0.48

Open in a new tab

Nonlinear models give a better generalization on test set, especially the XGBoost Regressor model. Surprisingly, the deep learning model (multilayer perceptron) does not perform well on either the training set or the test set. XGBoost Regressor is well generalized with an unseen data set with RMSE 0.44 and an R ² score of 0.88. From the residual plot in Figures S3 and S4 (SI file), the XGBoost Regressor model also shows that the residual is the least among the four models considered, and the majority residual values range from −0.5 to 0.5. This model also does not suffer from overfitting while the other three models return a slightly overfitting result between the training and test sets. From the recorded metrics and visualization, the XGBoost Regressor model shows the highest performance. Therefore, the XGBoost Regressor is selected to run the following virtual screen.

3.1.4. Virtual Screening Result

The virtual screening process is performed on a data set that contains over 8 million compounds. The XGBoost Regressor model successfully predicts 33.481 compounds (0.4%) to be more potent than lapatinib, which is an approved HER2 inhibitor used here for reference. After filtering out compounds with available bioactivity records on HER2, the top 64 compounds are chosen as the hit data set (Table S2, SI file).

3.2. Ligand–HER2 Interaction Dynamics via MD Simulation Analysis

The hit data set including 64 compounds described above is first docked into the active site of HER2 (PDB-ID: 3RCD) using Autodock Vina. The selected docking poses of 64 compounds and lapatinib are illustrated in Table S3 (SI file). Since a docking process does not fully account for the flexibility of the target, it may overlook certain compounds with strong activities. To address this limitation, we use these docked conformations as the input for MD simulations and performed 100 ns simulations for each compound. The root-mean-square deviation (RMSD) versus time plots of protein–ligand complexes (Table S4 in the SI file) indicate that all systems reach their stable configurations. These results also confirm that 100 ns simulated time is long enough for our ligands to find their best bindings and exploring the fluctuation of ligands in a narrow space of the HER2 active site. The average RMSD values of the first and the last 20 ns (Table S5, SI file) reveal a consistent structural rearrangement compared to their initial docking configurations. Specifically, the average RMSD increases from ∼0.162 nm during the first 20 ns simulation to ∼0.175 nm in the final 20 ns for all ligands studied. Accordingly, ligand structures undergo noticeable conformational changes during simulations. Among the ligands, lig116 exhibits the most significant structural shift, with its RMSD markedly increasing from 0.159 to 0.182 nm, again suggesting an extensive structural rearrangement.

Simultaneously, to quantify the ligand–protein interaction changes during the 100 ns MD simulation, important physicochemical quantities including the number of contacts, number of hydrogen bonds, and Coulomb and Van der Waals interaction energies between ligand and HER2 protein are counted. The average values of these quantities are computed separately from the first and the last 20 ns of a duration of 100 ns and are presented in Tables S6 and S7 (SI file). These numerical data also reveal significant differences in MD simulations in comparison to docking results. According to the percentage difference in RMSD from the first to the last 20 ns (Table S6, SI file), 44 compounds exhibit an increase in the number of contacts over time. This underscores a general trend toward enhanced engagement with the protein binding site. Notably, lapatinib has the largest increase in contacts (14.9%), followed by lig230 (7.8%) and lig118 (6.9%). These ligands likely undergo favorable reorientation or induced-fit adjustments within the binding pocket, leading to an improved interaction network as the simulation progresses. Conversely, several ligands induce pronounced declines in contact frequency, including lig235, lig250, and lig224 (Table S6, SI file), suggesting a partial disengagement from the binding pocket or less stable binding conformations under dynamic motion, solvent-exposed conditions. The number of hydrogen bonds formed by the protein and ligands in the first and the last durations of time also illustrate the differences in stabilization trends. The ligands lig236, lig231, and lig227 end up having the most significant increases in H-bonds, indicating the formation of new or more persistent polar interactions over time. In contrast, ligands such as lig130, lig131, and lig241 experience substantial reductions, potentially reflecting unfavorable conformational shifts that disrupt key donor–acceptor alignments (Table S6, SI file).

To better capture the stabilizing interaction behavior of each ligand, data collected from the final 20 ns of the 100 ns of MD simulations are analyzed in some detail. Based on the calculated results (Table S6, SI file), 19 ligands containing a larger number of contacts than lapatinib and five ligands including lig222, lig205, lig233, lig210, and lig102 emerge to have the highest number of contacts. lig102 emerges to have the highest number of contacts, all exceeding the value observed for the reference lapatinib. Among the latter, lig102 features the smallest relative increase, 9.5%, whereas lig222 shows the most pronounced enhancement, 23.2%. The remaining compounds also achieve notable gains, with lig210, lig233, and lig205 surpassing the reference by 11.3, 15.8, and 17.3% in terms of contacts, respectively. Besides, 38 ligands exhibit greater hydrogen bonds than the reference compound, lapatinib. Inspiringly, lig232 shows the highest value of 2.91, averaging the number of hydrogen bonds per frame, closely followed by lig233 (2.88 hydrogen bonds per frame). To illustrate the dynamic interaction profiles between ligands and proteins, lig233 was selected as an example for time-dependent analyses, including the RMSD, number of contacts, electrostatic potential (Coulomb), and total interaction energies, as shown in Figure .

Time-dependent (A) root mean squares deviations (RMSD) of all protein–ligand atoms, (B) number of contacts, (C) electrostatic potential (Coulomb), and (D) total interaction energies between lig233 (in black), lapatinib (in red), and the HER2 tyrosine kinase domain.

A thousand ligand–protein configurations from each of the MD simulation trajectories are collected in order to calculate the polar and nonpolar interaction energy profiles. The average values of the polar and nonpolar interaction energies from the last 20 ns simulation of 64 complexes are presented in Table S7 (SI file). The nonpolar interaction energies between ligands and the HER2 protein dominated over the polar energies in all 64 complexes. This predicts the hydrophobic features of HER2 inhibitors. According to the fourth column of Table S7, the ratios between Van der Waals and Coulomb potential energies range from 1.95 (in the case of lig117) to 13.57 (in the case of lig208). The compound lig222 interacted with the HER2 binding region by the largest amount of nonpolar (Van der Waals) potential energy, −80.5 kcal/mol, while its electrostatic counterpart is quite small (−10.8 kcal/mol). This is responsible for the fifth rank of lig222, whose total interaction energy is −91.3 kcal/mol and smaller than −101.0 kcal/mol of lig233, −97.0 kcal/mol of lig232, −92.8 kcal/mol of lig107, and −92.1 kcal/mol of lig227. Exploring the magnitude of the polar interaction energy (Coulomb potential), the largest amount of −29.9 kcal/mol is obtained for lig233. The well-known reference drug, lapatinib, retains the positions of 11th Van der Waals potential energy (−70.7 kcal/mol), 20th of Coulomb potential energy (−14.3 kcal/mol), and eighth total interaction energy. In the situation that US simulations cannot be performed for all 64 compounds, based on the full profile of both polar and nonpolar interaction energies, 19 ligands containing noticeable interaction energies are selected into a group of top-hit compounds. Two compounds lig205 and lig207 are chosen due to their high values of Van der Waals potential energies, −75.0 and −74.4 kcal/mol, respectively. The remaining 17 possess the highest total interaction energies that range from the 1st to 17th rank.

To present the data in a more clear way, only parameters of 19 selected compounds are detailed in Table . Although increasing or decreasing strength of protein–ligand interaction in each protein–ligand complex is not predictable, as they are arbitrary and different, MD simulations show that they are successful to drive the systems into more stable conformations. Moreover, they can also detect a consistent feature, hydrophobicity, of all compounds tested in the HER2 binding site. These findings highlight the importance of postdocking MD simulations in refining ligand ranking and thereby understanding mechanistic binding behaviors that are not captured by the docking step. The dynamic evolution of the system reinforces the MD place as a critical tool in rational drug design.

4. Averaged Numerical Data from the Last 20 ns MD Simulations .

No	compound	contact	H-bond	Van der Waals (kcal/mol)	Coulomb (kcal/mol)	interaction energy (kcal/mol)
1	lig101	3407	2.46	–66.1	–16.5	–82.5
2	lig102	3998	1.24	–76.4	–9.5	–85.9
3	lig107	3904	1.83	–73.9	–18.8	–92.8
4	lig112	3693	1.28	–66.3	–12.5	–78.8
5	lig118	3964	1.36	–72.7	–10.0	–82.7
6	lig205	4284	0.06	–75.0	–5.8	–53.9
7	lig207	3774	0.28	–74.4	–6.9	–65.1
8	lig209	3930	1.79	–69.0	–19.2	–80.8
9	lig210	4065	1.37	–77.7	–13.0	–81.3
10	lig222	4499	0.96	–80.5	–10.8	–91.3
11	lig227	3699	2.56	–66.2	–25.9	–92.1
12	lig228	3034	2.28	–62.5	–18.9	–81.5
13	lig231	3869	1.44	–69.1	–13.3	–82.4
14	lig232	3954	2.91	–74.8	–22.1	–96.9
15	lig233	4229	2.88	–71.0	–30.0	–101.0
16	lig234	3809	2.27	–69.9	–15.3	–85.2
17	lig237	3286	1.03	–62.8	–16.6	–79.5
18	lig239	3787	1.57	–64.2	–19.9	–84.1
19	lig242	3209	1.05	–63.1	–16.2	–79.4
20	Lapatinib	3653	0.94	–70.7	–14.3	–85.0

Open in a new tab

Number of contacts, number of hydrogen bonds, nonpolar and polar potential energies, and total interaction energy arising from the 19 selected ligands and proteins.

3.3. USMD Method Reveals Strong Binding Affinities between 13 Top-Hit Agents and the HER2 Tyrosine Domain

First, to establish the ranking of binding affinities in this part, the representative conformations of 19 ligand–protein systems have to be determined. Hence, we collected 2000 images from the last 20 ns of simulation for each ligand. The free energy surfaces (FES) are constructed from two components, namely, the number of contacts and the solvent accessible surface area (SASA) of the ligand. The image that belongs to the global minimum is chosen as a representative configuration of each system. Once again, the Gibbs energy analysis of all ligands reveals that the MD process undergoes a conversion as shown in Table S8 (SI file). Figure shows the Gibbs (free) energy landscape and representative interaction of two systems, lig233 and lapatinib.

(A,B) Gibbs energy landscapes of lig233 and lapatinib. (C,D) Key interactions between lig233/lapatinib and the HER2 active site residues in representative conformations. Hydrogen bonds are shown as yellow dashed lines connected to cyan-colored residues; halogen bonds are shown as blue dashed lines linked to purple-colored residues; hydrophobic contacts are indicated by interactions with orange-colored residues.

Among a variety of free energy calculation methods, such as the perturbation theory, thermodynamic integration (TI), and Wexplore, the USMD method has been known as one of the most popular methods, which can generate high precision binding free energy of a ligand–protein system. As stated above, the strongest advance of this method can not only provide us with the free energy difference between bound and unbound states but also construct the free energy profile along a reaction coordinate. To prepare the starting structures for USMD calculations (cf. details given in sec2), one trajectory of SMD simulations with low velocity (v = 1m/s) is necessarily conducted. More than 120 windows are prepared to visualize how the ligand moves 2.5 Å from the adjacent window. The shorter the distance moved by the ligand between two windows, the larger the number of MD performances. All initial configurations are submitted in parallel into the 10 ns MD runs. We need to ensure that the overlap diagram between two adjacent windows is better than 5%, despite this condition, which is the hardest to achieve, especially when the system is in a transition state. Detailed US diagrams of 19 selected ligands are shown in Table S9 (SI file).

In Figure , the red and black lines represent lig233 and lapatinib systems. The main line of lig233 indicates a value of −46.6 kcal/mol for the binding free energy, which is two times as high as that of lapatinib, −21.0 kcal/mol. Additionally, the bootstrapping statistical method is used to assess the dependability of the data acquired. One hundred subcurves are produced for each system, reaching values ranging from −17.6 to −24.7 kcal/mol for lapatinib and from −40.0 to −51.2 kcal/mol with lig233. Larger ΔG values and a distinct error profile with lapatinib prove that lig233 is a promising candidate. After taking all bootstrapping sampling into account, this part enables us to support 13 compounds rather than 20. In further experimental testing, this smaller set of compounds is expected to produce encouraging results. Details of the numerical data are shown in Table , while Figure illustrates the PMF profile of lig233 in comparison with that of lapatinib. Overall, the following compounds including lig233 (ranking first), followed by lig107, lig242, lig222, lig205, lig239, lig102, lig210, lig118, lig234, lig227, and lig237 and ending with lig228 emerge with high potential to be better anti-breast-cancer agents, as compared to lapatinib as the reference.

(A) Displacement-dependent free energy profiles of lig233 (red line) and lapatinib (black line) are calculated via the USMD method. Hundreds of subcurves extracted by the bootstrapping method are also shown to indicate the method reliability. (B, C) Histograms of more than 120 windows that are overlapped at least 5% to the adjacent one. (D) Free energy differences of 20 systems, including lapatinib, ranked from the largest to the smallest one.

5. Absolute Free Energy Differences of the 19 High-Priority Ligands Are Larger than that of Lapatinib .

No	compound	G (kcal/mol)	minimum bootstrapping value	maximum bootstrapping value
1	lig101	–24.9	–20.0	–29.3
2	lig102	–34.5	–31.0	–38.1
3	lig107	–41.3	–38.5	–46.2
4	lig112	–26.8	–23.8	–30.2
5	lig118	–32.9	–29.8	–34.9
6	lig205	–35.5	–31.0	–41.4
7	lig207	–20.6	–18.9	–22.8
8	lig209	–23.7	–21.2	–25.7
9	lig210	–34.2	–28.4	–39.0
10	lig222	–39.0	–34.9	–42.2
11	lig227	–31.1	–28.4	–33.8
12	lig228	–30.1	–26.2	–36.1
13	lig231	–27.5	–24.4	–29.8
14	lig232	–23.4	–20.9	–25.4
15	lig233	–46.5	–40.5	–51.2
16	lig234	–31.5	–26.4	–36.9
17	lig237	–30.5	–28.4	–33.0
18	lig239	–35.4	–31.6	–39.0
19	lig242	–39.4	–37.0	–41.8
20	Lapatinib	–21.0	–18.0	–24.0

Open in a new tab

Full diagram plots of 20 compounds are provided in Tables S9 and S10 (SI file).

3.4. Key Residues Decide the Strong and Weak Ligand–HER2 Interactions

Relying on results obtained by previous MD simulations and PMF binding free energy calculations, the 64 compounds are classified into three subgroups: (A) 13 compounds have predicted affinities that are better than the corresponding value for lapatinib (green columns, Figure ); (B) 6 compounds with “less-credible” strength (red columns, Figure ); and (C) remaining 45 compounds showing weaker interactions with HER2 than lapatinib. To elucidate critical residues involved in the ligand–protein association, we now consider all residues located in the binding region of the HER2 protein. These residues are found to contribute in over 85% of total contacts between ligands and HER2. The most desirable qualification to be achieved in this section is to figure out the residues that contribute the most to, or decide, the strength of the binding interaction of the ligand–protein system. Using the Coulomb, van der Waals, and total interaction energies contributed per residue in the last 20 ns simulations, they are averaged per strong (A) or weak (C) group. The top 10 residues are ranked, leading to the highest averaged values of interaction energies, as detailed in Table .

6. Ranking the Best 10 Residues Having an Average Interaction Energy .

		Coulomb (kcal/mol)		VDW (kcal/mol)		interaction energy (kcal/mol)
rank	residue	strong	weak	strong	weak	strong	weak	location
1	Asp863	–5.2	–4.6	–5.8	–4.8	–11.0	–9.4	activation loop
2	Asp880	–6.6	–2.3	–1.7	–1.3	–8.3	–3.6	activation loop
3	Lys753	–1.7	0.1	–6.5	–5.0	–8.2	–5.0	salt bridge
4	Leu796	–2.4	–0.4	–4.2	–3.6	–6.7	–4.1
5	Thr798	–1.0	–0.1	–3.0	–2.9	–4.0	–3.0	gate- keeper
6	Val734	–0.05	–0.02	–3.9	–3.1	–4.0	–3.1	glycine - rich loop
7	Asn850	–0.5	–0.4	–3.3	–2.7	–3.8	–3.2	catalyze loop
8	Thr862	–0.7	–0.84	–3.1	–2.6	–3.8	–3.4
9	Leu852	0.03	0.01	–3.6	–2.3	–3.6	–2.3
10	Leu785	–0.02	–0.05	–2.6	–2.9	–2.6	–2.9

Open in a new tab

Data are collected from the last 20 ns of MD simulations and separated into two subgroups: (A) 13 stronger binders and (C) 45 weaker compounds compared to lapatinib.

Focusing on the (C)-weaker and (A)-stronger groups, Asp863 is found to contribute to the two largest values of the interaction energies: −9.5 and −11.0 kcal/mol in two subgroups. Figure displays the numerical data related to contributions in both the (C)-weak group and (A)-strong group into the interaction of ligand and HER2 protein complexes. There is a large difference in the contributions in the case of Asp880, Lys753, and Leu796. Among them, Asp880 gives −3.6 and −8.3 kcal/mol, equal to 43% in comparison between weak and strong groups, respectively. Asp863 and Asp880 are amino acids located in the activation loop of HER2, which are responsible for providing the flexibility of protein kinase. Reasonably, aspartic acid carries a negative charge that can easily form large electrostatic energies with polar groups of the ligand, −5.2 and −6.6 kcal/mol in the case of Asp863 and Asp880, respectively. Due to the flexibility of the activation loop, both Asp863 and Asp880 induce a great influence on the active conformation of the protein as well as the ability to bind them to a ligand.

Comparative panel of total interaction energies between the weak group (blue bar) and the strong group (red bar). Data are collected from the last 20 ns of 100 ns MD simulations and eventually averaging from 45 complexes and 13 complexes distinguished as the weak and strong ones, respectively. The result exposing the differences, from weak to strong ligands, was clearly caused by four key amino acids Asp863, Asp880, Lys753, and Leu796.

Additionally, the interaction energies of Lys753 and Leu796 to two groups also indicate large differences. From weak to strong groups, Lys753 interacts with ligands by the average values of −5.1 and −8.3 kcal/mol interaction energy, while Leu796 shows interaction energies of −4.1 and −6.7 kcal/mol. The positive charge of Lys753 allows it not only to create a salt bridge with Glu770 but also to easily connect to a negatively charged group of the ligand. Due to the high conservation of the Lys753 residue commonly found in the protein kinase family, and its important role in interacting with the phosphate group of ATP, the good interaction of Lys753 with the ligand helps increase the effectiveness of inhibiting the signaling process of the HER2 protein. Even when van der Waals potential energy still plays a significant role in the ligand–protein association, their classification as strong or weak interactions seems to be dependent on the Coulomb potential energies that the ligands can form with five key amino acids. These four residues appear to, in cooperation with the Thr798 gate-keeper, construct a polar interacting space in the HER2 binding active site. A ligand deeply occupying this region can become a strong inhibitor.

3.5. Lig233 Interactive Profile in the Active Site of the HER2 Tyrosine Kinase Domain

The relationship between the ligand and key residues, especially Asp863, Asp880, Lys753, and Leu796, can allow the strong and/or weak binding affinities of compounds to be distinguished from each other. Indeed, an understanding can be expanded in the case of lig233, which features the strongest interactions. The time-dependent minimum distances between lig233 and Asp880 active residues are plotted in Figure . The minimum distance between lig233 and residues is almost shorter than 2.2 Å. In Figure C, the representative conformation of lig233 reveals key interactions with HER2 active site residues, including four hydrogen bonds (with Asp863 and Asp880) and multiple hydrophobic contacts involving Leu726, Phe731, Ala751, Lys753, Ile767, Met774, Leu785, Leu796, Leu800, Leu852, and Leu866. These facts explain how large the total interaction energy between lig233 and the HER2 domain can be. In another attempt, Figure illustrates the time-dependent minimum distance between lig233 and residue Asp880, which is frequently smaller than 1.8 Å. The data collected from the last 20 ns MD simulations point out that 11 per 200 frames, equal to 5.5%, lig233 has all distances to Lys753, Leu796, Thr798, Asp863, and Asp880 shorter than 2.2 Å. Similar values increase into 24.5 and 51% of frames when four and three strong chemical bonds appear at the same time, respectively. In the case of lapatinib, related values are estimated to be 0, 1, and 15%, respectively. All evidence is inherently consistent to reinforces the view that lig233 emerges as a highly potential HER2 inhibitor.

Comparison of the time-dependent minimum distances between the residue Asp880 and lig233 (in black), lapatinib (in red), and a weak-group member, lig250 (in green).

4. Concluding Remarks

In the present theoretical study, we establish a hybrid pipeline that demonstrated significant advantages over more traditional approaches with the aim to identify novel HER2 inhibitors but also to capture the thermodynamic properties of HER2 complexes. This work is completed using advanced computational methods including ML algorithms, all-atom MD simulations, and the US binding free energy method.

Among the ML and deep learning algorithms evaluated, the XGBoost Regressor demonstrates the highest performance, achieving an R2 score of 0.88 and an RMSE of 0.44 on the unseen test data set. XGBoost Regressor also showed minimal overfitting compared to other models. Leveraging the high performance of XGBoost Regressor, the virtual screening process identified 33,481 compounds (0.4%) from 8 million compounds that were predicted to have a better IC₅₀ index than lapatinib. Clustering insights reveal that high-potent clusters (1, 2, 5, 6) contain aza-aromatic rings and suggest that an appropriate HER2 kinase domain inhibitor requires lipophilicity to bind to the active site while maintaining high solubility. The top 64 compounds are selected as the hit data set.

Learning the RMSD profiles, our study also proves that MD simulations are necessary to overcome inherent limitations of docking methods, allowing us to explore the ligand fluctuations in the HER2 active sites and to identify the optimal binding modes. These simulations reveal significant advances with respect to the docking results. Moreover, MD simulations show a general trend toward enhanced engagement, with 44 compounds exhibiting an increasing number of contacts over time, including lapatinib. Nineteen ligands form a larger number of contacts than lapatinib, and 38 ligands contain a greater number of hydrogen bonds. For all 64 complexes selected, the nonpolar (van der Waals) interactions are dominant over the polar (Coulomb) interactions. This points out the hydrophobic nature of the HER2 inhibitors.

The binding affinity is quantified using the USMD method for the first time for the HER2 tyrosine kinase domain. It confirms that lig233 has a binding free energy of −46 kcal/mol, which is more than double that of lapatinib (−21 kcal/mol). The reported values correspond to the depth of the PMF profile along the reaction coordinate. In particular, we do not apply the corresponding (1M) standard-state correction. , Our intention is to use the PMF profiles for a relative comparison and the ranking of ligands under an identical simulation setup (same box size, pulling protocol, restraints, and analysis). These high values are strictly computational predictions and could not represent the real experimental binding energies. Statistical bootstrapping further supported lig233 as an effective candidate and led to an identification of 13 top-hit compounds, including lig233, lig107, lig242, lig222, lig205, lig239, lig102, lig210, lig118, lig234, lig227, lig237, and lig228, which can be regarded as promising anticancer inhibitors, as compared to lapatinib.

The study also successfully identifies crucial residues that contribute to strong ligand–HER2 interactions. Asp863, Asp880, Lys753, and Leu796 are highlighted for their significant differences in interaction energies between strong and weak binding groups. More importantly, while van der Waals interaction plays a major role in stabilizing the ligand–protein association, electrostatic interactions with Asp863, Asp880, Lys753, Leu796, and Thr798 constitute a determining factor for a strong or weak inhibitor.

Starting from mining a huge data set of 8 million compounds, our study achieves the discovery of a specific region that is constructed by five key amino acids located in the protein active site.

Supplementary Material

ao5c10459_si_001.pdf^{(12.1MB, pdf)}

Acknowledgments

This work was supported by resources provided by the Pawsey Supercomputing Research Centre’s Setonix Supercomputer (10.48569/18sb-8s43), with funding from the Australian Government and the Government of Western Australia. We thank the Polish high-performance computing infrastructure PLGrid for providing computing facilities within Grant No. PLG/2024.017274. D.T.T. thanks Van Lang University for support. T.M.D.D. is grateful to Hanoi University of Pharmacy. M.T.N. is indebted to VinUniversity for a distinguished professorship.

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c10459.

Additional cheminformatics, ML, docking, and MD results: Tanimoto similarity distribution of the data set (Figure S1); clustering performance metrics and visualization using different dimensionality-reduction approaches (Table S1 and Figure S2); model performance plots (actual vs predicted and residual analyses) for LR, Random Forest, XGBoost, and multilayer perceptron models (Figures S3 and S4); distribution of IC₅₀ and pIC₅₀ values (Figure S5); SMILES, docking scores, and predicted pIC₅₀ for 64 ML-screened top hits (Table S2); docking poses of the 64 hits and lapatinib (Table S3); RMSD plots and summarized RMSD statistics from 100 ns MD simulations (Tables S4 and S5); contact and hydrogen-bond statistics (Table S6); interaction energy components (van der Waals and Coulomb) and total interaction energies (Table S7); Gibbs free energy landscapes (Table S8); US diagrams for selected complexes (Table S9); and bootstrap results for selected compounds (Table S10) (PDF)

The authors declare no competing financial interest.

References

Slamon D. J., Clark G. M., Wong S. G., Levin W. J., Ullrich A., McGuire W. L.. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science. 1987;235(4785):177–182. doi: 10.1126/science.3798106. [DOI] [PubMed] [Google Scholar]
Slamon D. J., Godolphin W., Jones L. A., Holt J. A., Wong S. G., Keith D. E., Levin W. J., Stuart S. G., Udove J., Ullrich A.. et al. Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science. 1989;244(4905):707–712. doi: 10.1126/science.2470152. [DOI] [PubMed] [Google Scholar]
Citri A., Yarden Y.. EGF-ERBB signalling: towards the systems level. Nat Rev Mol Cell Biol. 2006;7(7):505–516. doi: 10.1038/nrm1962. [DOI] [PubMed] [Google Scholar]
Pan L., Li J., Xu Q., Gao Z., Yang M., Wu X., Li X.. HER2/PI3K/AKT pathway in HER2-positive breast cancer: A review. Medicine (Baltimore) 2024;103(24):e38508. doi: 10.1097/MD.0000000000038508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aertgeerts K., Skene R., Yano J., Sang B. C., Zou H., Snell G., Jennings A., Iwamoto K., Habuka N., Hirokawa A.. et al. Structural analysis of the mechanism of inhibition and allosteric activation of the kinase domain of HER2 protein. J. Biol. Chem. 2011;286(21):18756–18765. doi: 10.1074/jbc.M110.206193. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roskoski R. Jr.. ErbB/HER protein-tyrosine kinases: Structures and small molecule inhibitors. Pharmacol. Res. 2014;87:42–59. doi: 10.1016/j.phrs.2014.06.001. [DOI] [PubMed] [Google Scholar]
Schlam I., Swain S. M.. HER2-positive breast cancer and tyrosine kinase inhibitors: the time is now. NPJ Breast Cancer. 2021;7(1):56. doi: 10.1038/s41523-021-00265-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo L., Shao W., Zhou C., Yang H., Yang L., Cai Q., Wang J., Shi Y., Huang L., Zhang J.. Neratinib for HER2-positive breast cancer with an overlooked option. Mol. Med. 2023;29(1):134. doi: 10.1186/s10020-023-00736-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murthy R. K., Loi S., Okines A., Paplomata E., Hamilton E., Hurvitz S. A., Lin N. U., Borges V., Abramson V., Anders C.. et al. Tucatinib, Trastuzumab, and Capecitabine for HER2-Positive Metastatic Breast Cancer. N Engl J Med. 2020;382(7):597–609. doi: 10.1056/NEJMoa1914609. [DOI] [PubMed] [Google Scholar]
Chen J., Wang J., Yang W., Zhao L., Zhao J., Hu G.. Molecular mechanism of phosphorylation-mediated impacts on the conformation dynamics of GTP-Bound KRAS probed by GaMD trajectory-based deep learning. Molecules. 2024;29(10):2317. doi: 10.3390/molecules29102317. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen J., Wang J., Yang W., Zhao L., Hu G.. Conformations of KRAS4B affected by its partner binding and G12C mutation: insights from GaMD trajectory-image transformation-based deep learning. Journal of Chemical Information and Modeling. 2024;64(17):6880–6898. doi: 10.1021/acs.jcim.4c01174. [DOI] [PubMed] [Google Scholar]
Matrouk A. Y., Mohammad H., Daoud S., Taha M. O.. Discovery of New HER2 Inhibitors via Computational Docking, Pharmacophore Modeling, and Machine Learning. Mol Inform. 2025;44(2):e202400336. doi: 10.1002/minf.202400336. [DOI] [PubMed] [Google Scholar]
Saini R., Agarwal S. M.. EGFRisopred: a machine learning-based classification model for identifying isoform-specific inhibitors against EGFR and HER2. Mol Divers. 2022;26(3):1531–1543. doi: 10.1007/s11030-021-10284-6. [DOI] [PubMed] [Google Scholar]
Kleandrova V. V., Scotti M. T., Scotti L., Speck-Planche A.. Multi-target Drug Discovery via PTML Modeling: Applications to the Design of Virtual Dual Inhibitors of CDK4 and HER2. Curr Top Med Chem. 2021;21(7):661–675. doi: 10.2174/1568026621666210119112845. [DOI] [PubMed] [Google Scholar]
Yang S. C., Chang S. S., Chen C. Y.. Identifying HER2 inhibitors from natural products database. PLoS One. 2011;6(12):e28793. doi: 10.1371/journal.pone.0028793. [DOI] [PMC free article] [PubMed] [Google Scholar]
Verma S., Goyal S., Kumari A., Singh A., Jamal S., Grover A.. Structural investigations on mechanism of lapatinib resistance caused by HER-2 mutants. PLoS One. 2018;13(2):e0190942. doi: 10.1371/journal.pone.0190942. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bolaji O. Q., Adelusi T. I., Ojo T. O., Boyenle I. D., Oyedele A.-Q. K., Ogunjobi T. T., Oyaronbi A. O., Ayoola S. O., Ogunlana A. T.. Leveraging computational approaches in identifying novel HER-2+ breast cancer potential therapeutics: integrating virtual screening and molecular dynamics simulation. Future Journal of Pharmaceutical Sciences. 2025;11(1):1. doi: 10.1186/s43094-024-00748-5. [DOI] [Google Scholar]
Alvarado-Lozano J. E., Hernandez-Valencia J. A., Avila-Aviles R. D., Bello M.. Insight into the inhibitory activity of mangiferin and Silybin against HER2 and EGFR using theoretical and experimental approaches. Sci Rep. 2025;15(1):8658. doi: 10.1038/s41598-025-93612-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Truong D. T., Ho K., Nguyen M. T.. The Jarzynski binding free energy can effectively rank ligand-protein affinities in inadequate samplings. Chem. Phys. Lett. 2024;840:141145. doi: 10.1016/j.cplett.2024.141145. [DOI] [Google Scholar]
Truong D. T., Ho K., Nhi H. T. Y., Nguyen V. H., Dang T. T., Nguyen M. T.. Imidazole [1, 5-a] pyridine derivatives as EGFR tyrosine kinase inhibitors unraveled by umbrella sampling and steered molecular dynamics simulations. Scientific Reports. 2024;14(1):12218. doi: 10.1038/s41598-024-62743-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hanwell M. D., Curtis D. E., Lonie D. C., Vandermeersch T., Zurek E., Hutchison G. R.. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminf. 2012;4(1):17. doi: 10.1186/1758-2946-4-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frisch, M. e. ; Trucks, G. ; Schlegel, H. B. ; Scuseria, G. ; Robb, M. ; Cheeseman, J. ; Scalmani, G. ; Barone, V. ; Petersson, G. ; Nakatsuji, H. . Gaussian 16; Gaussian, Inc.: Wallingford, CT, 2016. [Google Scholar]
Case D. A., Aktulga H. M., Belfon K., Cerutti D. S., Cisneros G. A., Cruzeiro V. W. D., Forouzesh N., Giese T. J., Gotz A. W., Gohlke H.. et al. AmberTools. J. Chem. Inf. Model. 2023;63(20):6183–6191. doi: 10.1021/acs.jcim.3c01153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morris G. M., Huey R., Lindstrom W., Sanner M. F., Belew R. K., Goodsell D. S., Olson A. J.. AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J. Comput. Chem. 2009;30(16):2785–2791. doi: 10.1002/jcc.21256. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blackshaw, J. ; Adasme Mora, M. F. ; Arcila Toro, R. ; Bosc, N. ; Corbett, S. ; De Veij, M. ; Félix, E. ; Hunter, F. ; Ioannidis, H. ; Kizilören, T. ; et al. CHEMBL 2011, 10.6019/CHEMBL.database.33. [DOI]
Rogers D., Hahn M.. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Riniker S., Landrum G. A.. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminf. 2013;5(1):26. doi: 10.1186/1758-2946-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wigh D. S., Goodman J. M., Lapkin A. A.. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2022;12(5):e1603. doi: 10.1002/wcms.1603. [DOI] [Google Scholar]
Durant J. L., Leland B. A., Henry D. R., Nourse J. G.. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
Xie L., Xu L., Kong R., Chang S., Xu X.. Improvement of Prediction Performance With Conjoint Molecular Fingerprint in Deep Learning. Front Pharmacol. 2020;11:606668. doi: 10.3389/fphar.2020.606668. [DOI] [PMC free article] [PubMed] [Google Scholar]
https://rdkit.org (accessed Jul 31, 2023).
Higgins, I. ; Matthey, L. ; Pal, A. ; Burgess, C. ; Glorot, X. ; Botvinick, M. ; Mohamed, S. ; Lerchner, A. . beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR: OpenReview: 2016. [Google Scholar]
TensorFlow: Large-scale machine learning on heterogeneous systems; 2015. tensorflow.org (accessed).
Virtanen P., Ralf G., Oliphant T. E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., van der Walt S. J., Brett M., Wilson J., Millman K. J., Mayorov N., Nelson A. R. J., Jones E., Kern R., Larson E., Carey C. J., Polat İ., Feng Y., Moore E. W., VanderPlas J., Laxalde D., Perktold J., Cimrman R., Henriksen I., Quintero E. A., Harris C. R., Archibald A. M., Ribeiro A. H., Pedregosa F., van Mulbregt P.. SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaccard P.. Etude de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaudoise Sci. Nat. 1901;37:547–579. doi: 10.5169/seals-266450. [DOI] [Google Scholar]
Fauber, B. Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces. arXiv 2024, 10.48550/arXiv.2411.07983. [DOI] [Google Scholar]
Park H.-S., Jun C.-H.. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications. 2009;36(2):3336–3341. doi: 10.1016/j.eswa.2008.01.039. [DOI] [Google Scholar]
Pedregosa F., V G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E.. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825. doi: 10.5555/1953048.2078195. [DOI] [Google Scholar]
Breiman L.. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
Chen, T. Q. ; Guestrin, C. . Xgboost: A Scalable Tree Boosting System. In the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13–17 August, 2016; ACM: 2016; pp 785–794. [Google Scholar]
Bergstra, J. ; Yamins, D. ; Cox, D. D. . Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proc. of the 30th International Conference on Machine Learning (ICML 2013); JMLR.org: 2013. [Google Scholar]
Bergstra, J. ; Bardenet, R. ; Bengio, Y. ; Kégl, B. . Algorithms for hyper-parameter optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’11); Curran Associates Inc.: 2011. [Google Scholar]
Hunter J. D.. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9:90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
Waskom M.. seaborn: statistical data visualization. J. Open Source Softw. 2021;6(60):3021. doi: 10.21105/joss.03021. [DOI] [Google Scholar]
Harris C. R., Millman K. J., van der Walt S. J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N. J.. et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
McKinney, W. Data Structures for Statistical Computing in Python. In Proc. of the 9th python in science conf. (SCIPY 2010); 2010. [Google Scholar]
Eberhardt J., Santos-Martins D., Tillack A. F., Forli S.. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J Chem Inf Model. 2021;61(8):3891–3898. doi: 10.1021/acs.jcim.1c00203. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roskoski R. Jr.. Small molecule inhibitors targeting the EGFR/ErbB family of protein-tyrosine kinases in human cancers. Pharmacol. Res. 2019;139:395–411. doi: 10.1016/j.phrs.2018.11.014. [DOI] [PubMed] [Google Scholar]
Topalan E., Buyukgungor A., Cigdem M., Gura S., Sever B., Otsuka M., Fujita M., Demirci H., Ciftci H.. A Structural Insight Into Two Important ErbB Receptors (EGFR and HER2) and Their Relevance to Non-Small Cell Lung Cancer. Arch Pharm (Weinheim) 2025;358(4):e2400992. doi: 10.1002/ardp.202400992. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pall S., Zhmurov A., Bauer P., Abraham M., Lundborg M., Gray A., Hess B., Lindahl E.. Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS. J. Chem. Phys. 2020;153(13):134110. doi: 10.1063/5.0018516. [DOI] [PubMed] [Google Scholar]
Lindorff-Larsen K., Piana S., Palmo K., Maragakis P., Klepeis J. L., Dror R. O., Shaw D. E.. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins. 2010;78(8):1950–1958. doi: 10.1002/prot.22711. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mark P., Nilsson L.. Structure and Dynamics of the TIP3P, SPC, and SPC/E Water Models at 298 K. The Journal of Physical Chemistry A. 2001;105(43):9954–9960. doi: 10.1021/jp003020w. [DOI] [Google Scholar]
Stourac J., Vavra O., Kokkonen P., Filipovic J., Pinto G., Brezovsky J., Damborsky J., Bednar D.. Caver Web 1.0: identification of tunnels and channels in proteins and analysis of ligand transport. Nucleic Acids Res. 2019;47(W1):W414–W422. doi: 10.1093/nar/gkz378. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ishikawa T., Seto M., Banno H., Kawakita Y., Oorui M., Taniguchi T., Ohta Y., Tamura T., Nakayama A., Miki H.. et al. Design and synthesis of novel human epidermal growth factor receptor 2 (HER2)/epidermal growth factor receptor (EGFR) dual inhibitors bearing a pyrrolo [3, 2-d] pyrimidine scaffold. J. Med. Chem. 2011;54(23):8030–8050. doi: 10.1021/jm2008634. [DOI] [PubMed] [Google Scholar]
Doudou S., Burton N. A., Henchman R. H.. Standard Free Energy of Binding from a One-Dimensional Potential of Mean Force. Journal of Chemical Theory and Computation. 2009;5(4):909–918. doi: 10.1021/ct8002354. [DOI] [PubMed] [Google Scholar]
de Ruiter A., Oostenbrink C.. Protein–Ligand Binding from Distancefield Distances and Hamiltonian Replica Exchange Simulations. Journal of Chemical Theory and Computation. 2013;9(2):883–892. doi: 10.1021/ct300967a. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ao5c10459_si_001.pdf^{(12.1MB, pdf)}

[ref1] Slamon D. J., Clark G. M., Wong S. G., Levin W. J., Ullrich A., McGuire W. L.. Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science. 1987;235(4785):177–182. doi: 10.1126/science.3798106. [DOI] [PubMed] [Google Scholar]

[ref2] Slamon D. J., Godolphin W., Jones L. A., Holt J. A., Wong S. G., Keith D. E., Levin W. J., Stuart S. G., Udove J., Ullrich A.. et al. Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science. 1989;244(4905):707–712. doi: 10.1126/science.2470152. [DOI] [PubMed] [Google Scholar]

[ref3] Citri A., Yarden Y.. EGF-ERBB signalling: towards the systems level. Nat Rev Mol Cell Biol. 2006;7(7):505–516. doi: 10.1038/nrm1962. [DOI] [PubMed] [Google Scholar]

[ref4] Pan L., Li J., Xu Q., Gao Z., Yang M., Wu X., Li X.. HER2/PI3K/AKT pathway in HER2-positive breast cancer: A review. Medicine (Baltimore) 2024;103(24):e38508. doi: 10.1097/MD.0000000000038508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Aertgeerts K., Skene R., Yano J., Sang B. C., Zou H., Snell G., Jennings A., Iwamoto K., Habuka N., Hirokawa A.. et al. Structural analysis of the mechanism of inhibition and allosteric activation of the kinase domain of HER2 protein. J. Biol. Chem. 2011;286(21):18756–18765. doi: 10.1074/jbc.M110.206193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] Roskoski R. Jr.. ErbB/HER protein-tyrosine kinases: Structures and small molecule inhibitors. Pharmacol. Res. 2014;87:42–59. doi: 10.1016/j.phrs.2014.06.001. [DOI] [PubMed] [Google Scholar]

[ref7] Schlam I., Swain S. M.. HER2-positive breast cancer and tyrosine kinase inhibitors: the time is now. NPJ Breast Cancer. 2021;7(1):56. doi: 10.1038/s41523-021-00265-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] Guo L., Shao W., Zhou C., Yang H., Yang L., Cai Q., Wang J., Shi Y., Huang L., Zhang J.. Neratinib for HER2-positive breast cancer with an overlooked option. Mol. Med. 2023;29(1):134. doi: 10.1186/s10020-023-00736-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] Murthy R. K., Loi S., Okines A., Paplomata E., Hamilton E., Hurvitz S. A., Lin N. U., Borges V., Abramson V., Anders C.. et al. Tucatinib, Trastuzumab, and Capecitabine for HER2-Positive Metastatic Breast Cancer. N Engl J Med. 2020;382(7):597–609. doi: 10.1056/NEJMoa1914609. [DOI] [PubMed] [Google Scholar]

[ref10] Chen J., Wang J., Yang W., Zhao L., Zhao J., Hu G.. Molecular mechanism of phosphorylation-mediated impacts on the conformation dynamics of GTP-Bound KRAS probed by GaMD trajectory-based deep learning. Molecules. 2024;29(10):2317. doi: 10.3390/molecules29102317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Chen J., Wang J., Yang W., Zhao L., Hu G.. Conformations of KRAS4B affected by its partner binding and G12C mutation: insights from GaMD trajectory-image transformation-based deep learning. Journal of Chemical Information and Modeling. 2024;64(17):6880–6898. doi: 10.1021/acs.jcim.4c01174. [DOI] [PubMed] [Google Scholar]

[ref12] Matrouk A. Y., Mohammad H., Daoud S., Taha M. O.. Discovery of New HER2 Inhibitors via Computational Docking, Pharmacophore Modeling, and Machine Learning. Mol Inform. 2025;44(2):e202400336. doi: 10.1002/minf.202400336. [DOI] [PubMed] [Google Scholar]

[ref13] Saini R., Agarwal S. M.. EGFRisopred: a machine learning-based classification model for identifying isoform-specific inhibitors against EGFR and HER2. Mol Divers. 2022;26(3):1531–1543. doi: 10.1007/s11030-021-10284-6. [DOI] [PubMed] [Google Scholar]

[ref14] Kleandrova V. V., Scotti M. T., Scotti L., Speck-Planche A.. Multi-target Drug Discovery via PTML Modeling: Applications to the Design of Virtual Dual Inhibitors of CDK4 and HER2. Curr Top Med Chem. 2021;21(7):661–675. doi: 10.2174/1568026621666210119112845. [DOI] [PubMed] [Google Scholar]

[ref15] Yang S. C., Chang S. S., Chen C. Y.. Identifying HER2 inhibitors from natural products database. PLoS One. 2011;6(12):e28793. doi: 10.1371/journal.pone.0028793. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] Verma S., Goyal S., Kumari A., Singh A., Jamal S., Grover A.. Structural investigations on mechanism of lapatinib resistance caused by HER-2 mutants. PLoS One. 2018;13(2):e0190942. doi: 10.1371/journal.pone.0190942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Bolaji O. Q., Adelusi T. I., Ojo T. O., Boyenle I. D., Oyedele A.-Q. K., Ogunjobi T. T., Oyaronbi A. O., Ayoola S. O., Ogunlana A. T.. Leveraging computational approaches in identifying novel HER-2+ breast cancer potential therapeutics: integrating virtual screening and molecular dynamics simulation. Future Journal of Pharmaceutical Sciences. 2025;11(1):1. doi: 10.1186/s43094-024-00748-5. [DOI] [Google Scholar]

[ref18] Alvarado-Lozano J. E., Hernandez-Valencia J. A., Avila-Aviles R. D., Bello M.. Insight into the inhibitory activity of mangiferin and Silybin against HER2 and EGFR using theoretical and experimental approaches. Sci Rep. 2025;15(1):8658. doi: 10.1038/s41598-025-93612-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Truong D. T., Ho K., Nguyen M. T.. The Jarzynski binding free energy can effectively rank ligand-protein affinities in inadequate samplings. Chem. Phys. Lett. 2024;840:141145. doi: 10.1016/j.cplett.2024.141145. [DOI] [Google Scholar]

[ref20] Truong D. T., Ho K., Nhi H. T. Y., Nguyen V. H., Dang T. T., Nguyen M. T.. Imidazole [1, 5-a] pyridine derivatives as EGFR tyrosine kinase inhibitors unraveled by umbrella sampling and steered molecular dynamics simulations. Scientific Reports. 2024;14(1):12218. doi: 10.1038/s41598-024-62743-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Hanwell M. D., Curtis D. E., Lonie D. C., Vandermeersch T., Zurek E., Hutchison G. R.. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminf. 2012;4(1):17. doi: 10.1186/1758-2946-4-17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] Frisch, M. e. ; Trucks, G. ; Schlegel, H. B. ; Scuseria, G. ; Robb, M. ; Cheeseman, J. ; Scalmani, G. ; Barone, V. ; Petersson, G. ; Nakatsuji, H. . Gaussian 16; Gaussian, Inc.: Wallingford, CT, 2016. [Google Scholar]

[ref23] Case D. A., Aktulga H. M., Belfon K., Cerutti D. S., Cisneros G. A., Cruzeiro V. W. D., Forouzesh N., Giese T. J., Gotz A. W., Gohlke H.. et al. AmberTools. J. Chem. Inf. Model. 2023;63(20):6183–6191. doi: 10.1021/acs.jcim.3c01153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] Morris G. M., Huey R., Lindstrom W., Sanner M. F., Belew R. K., Goodsell D. S., Olson A. J.. AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J. Comput. Chem. 2009;30(16):2785–2791. doi: 10.1002/jcc.21256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] Blackshaw, J. ; Adasme Mora, M. F. ; Arcila Toro, R. ; Bosc, N. ; Corbett, S. ; De Veij, M. ; Félix, E. ; Hunter, F. ; Ioannidis, H. ; Kizilören, T. ; et al. CHEMBL 2011, 10.6019/CHEMBL.database.33. [DOI]

[ref26] Rogers D., Hahn M.. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref27] Riniker S., Landrum G. A.. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminf. 2013;5(1):26. doi: 10.1186/1758-2946-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] Wigh D. S., Goodman J. M., Lapkin A. A.. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2022;12(5):e1603. doi: 10.1002/wcms.1603. [DOI] [Google Scholar]

[ref29] Durant J. L., Leland B. A., Henry D. R., Nourse J. G.. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]

[ref30] Xie L., Xu L., Kong R., Chang S., Xu X.. Improvement of Prediction Performance With Conjoint Molecular Fingerprint in Deep Learning. Front Pharmacol. 2020;11:606668. doi: 10.3389/fphar.2020.606668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] https://rdkit.org (accessed Jul 31, 2023).

[ref32] Higgins, I. ; Matthey, L. ; Pal, A. ; Burgess, C. ; Glorot, X. ; Botvinick, M. ; Mohamed, S. ; Lerchner, A. . beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR: OpenReview: 2016. [Google Scholar]

[ref33] TensorFlow: Large-scale machine learning on heterogeneous systems; 2015. tensorflow.org (accessed).

[ref34] Virtanen P., Ralf G., Oliphant T. E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., van der Walt S. J., Brett M., Wilson J., Millman K. J., Mayorov N., Nelson A. R. J., Jones E., Kern R., Larson E., Carey C. J., Polat İ., Feng Y., Moore E. W., VanderPlas J., Laxalde D., Perktold J., Cimrman R., Henriksen I., Quintero E. A., Harris C. R., Archibald A. M., Ribeiro A. H., Pedregosa F., van Mulbregt P.. SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] Jaccard P.. Etude de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaudoise Sci. Nat. 1901;37:547–579. doi: 10.5169/seals-266450. [DOI] [Google Scholar]

[ref36] Fauber, B. Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces. arXiv 2024, 10.48550/arXiv.2411.07983. [DOI] [Google Scholar]

[ref37] Park H.-S., Jun C.-H.. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications. 2009;36(2):3336–3341. doi: 10.1016/j.eswa.2008.01.039. [DOI] [Google Scholar]

[ref38] Pedregosa F., V G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E.. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825. doi: 10.5555/1953048.2078195. [DOI] [Google Scholar]

[ref39] Breiman L.. Random forests. Machine Learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]

[ref40] Chen, T. Q. ; Guestrin, C. . Xgboost: A Scalable Tree Boosting System. In the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13–17 August, 2016; ACM: 2016; pp 785–794. [Google Scholar]

[ref41] Bergstra, J. ; Yamins, D. ; Cox, D. D. . Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In Proc. of the 30th International Conference on Machine Learning (ICML 2013); JMLR.org: 2013. [Google Scholar]

[ref42] Bergstra, J. ; Bardenet, R. ; Bengio, Y. ; Kégl, B. . Algorithms for hyper-parameter optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’11); Curran Associates Inc.: 2011. [Google Scholar]

[ref43] Hunter J. D.. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9:90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]

[ref44] Waskom M.. seaborn: statistical data visualization. J. Open Source Softw. 2021;6(60):3021. doi: 10.21105/joss.03021. [DOI] [Google Scholar]

[ref45] Harris C. R., Millman K. J., van der Walt S. J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N. J.. et al. Array programming with NumPy. Nature. 2020;585(7825):357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] McKinney, W. Data Structures for Statistical Computing in Python. In Proc. of the 9th python in science conf. (SCIPY 2010); 2010. [Google Scholar]

[ref47] Eberhardt J., Santos-Martins D., Tillack A. F., Forli S.. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J Chem Inf Model. 2021;61(8):3891–3898. doi: 10.1021/acs.jcim.1c00203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] Roskoski R. Jr.. Small molecule inhibitors targeting the EGFR/ErbB family of protein-tyrosine kinases in human cancers. Pharmacol. Res. 2019;139:395–411. doi: 10.1016/j.phrs.2018.11.014. [DOI] [PubMed] [Google Scholar]

[ref49] Topalan E., Buyukgungor A., Cigdem M., Gura S., Sever B., Otsuka M., Fujita M., Demirci H., Ciftci H.. A Structural Insight Into Two Important ErbB Receptors (EGFR and HER2) and Their Relevance to Non-Small Cell Lung Cancer. Arch Pharm (Weinheim) 2025;358(4):e2400992. doi: 10.1002/ardp.202400992. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] Pall S., Zhmurov A., Bauer P., Abraham M., Lundborg M., Gray A., Hess B., Lindahl E.. Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS. J. Chem. Phys. 2020;153(13):134110. doi: 10.1063/5.0018516. [DOI] [PubMed] [Google Scholar]

[ref51] Lindorff-Larsen K., Piana S., Palmo K., Maragakis P., Klepeis J. L., Dror R. O., Shaw D. E.. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins. 2010;78(8):1950–1958. doi: 10.1002/prot.22711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] Mark P., Nilsson L.. Structure and Dynamics of the TIP3P, SPC, and SPC/E Water Models at 298 K. The Journal of Physical Chemistry A. 2001;105(43):9954–9960. doi: 10.1021/jp003020w. [DOI] [Google Scholar]

[ref53] Stourac J., Vavra O., Kokkonen P., Filipovic J., Pinto G., Brezovsky J., Damborsky J., Bednar D.. Caver Web 1.0: identification of tunnels and channels in proteins and analysis of ligand transport. Nucleic Acids Res. 2019;47(W1):W414–W422. doi: 10.1093/nar/gkz378. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref54] Ishikawa T., Seto M., Banno H., Kawakita Y., Oorui M., Taniguchi T., Ohta Y., Tamura T., Nakayama A., Miki H.. et al. Design and synthesis of novel human epidermal growth factor receptor 2 (HER2)/epidermal growth factor receptor (EGFR) dual inhibitors bearing a pyrrolo [3, 2-d] pyrimidine scaffold. J. Med. Chem. 2011;54(23):8030–8050. doi: 10.1021/jm2008634. [DOI] [PubMed] [Google Scholar]

[ref55] Doudou S., Burton N. A., Henchman R. H.. Standard Free Energy of Binding from a One-Dimensional Potential of Mean Force. Journal of Chemical Theory and Computation. 2009;5(4):909–918. doi: 10.1021/ct8002354. [DOI] [PubMed] [Google Scholar]

[ref56] de Ruiter A., Oostenbrink C.. Protein–Ligand Binding from Distancefield Distances and Hamiltonian Replica Exchange Simulations. Journal of Chemical Theory and Computation. 2013;9(2):883–892. doi: 10.1021/ct300967a. [DOI] [PubMed] [Google Scholar]

PERMALINK

Benchmarking the Ligand–HER2 Interactions Using Machine Learning and Molecular Dynamics Simulations

Duc Toan Truong

Quang Tung Dao

Thi Thuy Mai Tran

Ngoc Ha Nguyen

My-Kristyna Nguyen-Thao

Nguyen-Hai Nam

Thi Mai Dung Do

Minh Tho Nguyen

Abstract

1. Introduction

2. Materials and Methods

1.

2.1. ML and Deep Learning Application in Ligand-Based Virtual Screening

2.1.1. Data Collection and Curation

2.1.1.1. Data Collection

2.1.2. Data Cleaning

2.1.3. Feature Generation

2.1.4. Dimensionality Reduction

2.1.5. Data Variation and Clustering

2.1.6. Model Building and Evaluation

1. Pretesting of Three Feature Sets with a Linear Model.

2.1.6.1. Linear Regression

2.1.6.2. Random Forest Regression

2.1.6.3. XGBoost Regressor

2.1.6.4. Multilayer Perceptron

2.1.6.5. Hyperparameter Tuning

2.1.7. Virtual Screening Data Set

2.1.8. Visualization and Supporting Libraries

2.2. Molecular Docking

2.3. MD Simulations

2.4. US and PMF Calculations

3. Results and Discussion

3.1. ML and Deep Learning Application in Ligand-Based Virtual Screening

3.1.1. Data Variances

3.1.2. Clustering

2. Representative Structures of Each Cluster.

3.1.3. Model Building and Evaluation

3. Features Selected for Training Models.

3.1.4. Virtual Screening Result

3.2. Ligand–HER2 Interaction Dynamics via MD Simulation Analysis

2.

4. Averaged Numerical Data from the Last 20 ns MD Simulations .

3.3. USMD Method Reveals Strong Binding Affinities between 13 Top-Hit Agents and the HER2 Tyrosine Domain

3.

4.

5. Absolute Free Energy Differences of the 19 High-Priority Ligands Are Larger than that of Lapatinib .

3.4. Key Residues Decide the Strong and Weak Ligand–HER2 Interactions

6. Ranking the Best 10 Residues Having an Average Interaction Energy .

5.

3.5. Lig233 Interactive Profile in the Active Site of the HER2 Tyrosine Kinase Domain

6.

4. Concluding Remarks

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases