Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 1.
Published in final edited form as: J Comput Aided Mol Des. 2019 Nov 16;34(2):131–147. doi: 10.1007/s10822-019-00237-5

MathDL: Mathematical deep learning for D3R Grand Challenge 4

Duc Duy Nguyen 1, Kaifu Gao 1, Menglun Wang 1, Guo-Wei Wei 1,2,3,*
PMCID: PMC7376411  NIHMSID: NIHMS1543387  PMID: 31734815

Abstract

We present the performances of our mathematical deep learning (MathDL) models for D3R Grand Challenge 4 (GC4). This challenge involves pose prediction, affinity ranking, and free energy estimation for beta secretase 1 (BACE) as well as affinity ranking and free energy estimation for Cathepsin S (CatS). We have developed advanced mathematics, namely differential geometry, algebraic graph, and/or algebraic topology, to accurately and efficiently encode high dimensional physical/chemical interactions into scalable low-dimensional rotational and translational invariant representations. These representations are integrated with deep learning models, such as generative adversarial networks (GAN) and convolutional neural networks (CNN) for pose prediction and energy evaluation, respectively. Overall, our MathDL models achieved the top place in pose prediction for BACE ligands in Stage 1a. Moreover, our submissions obtained the highest Spearman correlation coefficient on the affinity ranking of 460 CatS compounds, and the smallest centered root mean square error on the free energy set of 39 CatS molecules. It is worthy to mention that our method on docking pose predictions has significantly improved from our previous ones.

1. Introduction

The Drug Design Data Resource (D3R) offers blind communitywide challenges of ligand pose and binding affinity ranking predictions.13 Benchmarks in D3R contests contain high quality structures and reliable binding energies supplied by experimental groups before the publication. These challenges provide computer-aided drug design (CADD) community a great opportunity to validate, calibrate, and develop drug virtual screening (VS) models. The latest D3R Grand Challenge 4 (GC4), took place from September 4th 2018 to December 4th, 2018. GC4 presented two different protein targets, Cathepsin S (CatS) and beta secretase 1 (BACE), which were generously supplied by Janssen Pharmaceuticals and Novartis, respectively. There were two stages in GC4. The first one has two subchallenges, namely Stage 1a and Stage 1b. In Stage 1a, participants were asked to predict the pose, rank the affinity, and estimate the free energy of BACE ligands. Following Stage 1a, Stage 1b revealed the receptor structures and participants were asked again to predict the crystallographic poses of 20 BACE ligands. There was no affinity calculation in this stage 1b. The second part of GC4 was called Stage 2 which contained the affinity rankings and free energy challenges for both BACE and CatS compounds. In this last stage, participants were able to take advantage of experimental structures of BACE complexes released right after stage 1b.

A successful VS model requires a reliable ligand conformation generation and highly accurate scoring function to predict binding affinities. There are several state-of-the-art software packages to take care of the first component of VS, for example, Autodock Vina,4 GOLD,5 GLIDE,6 ICM,7 etc. Unfortunately, one may fail dramatically to achieve decent poses if blindly using these software programs. The pose prediction results in Grand Challenge 3 (GC3) clearly demonstrated this issue.3 The second component of VS relates to the development of scoring function (SF) for binding affinity predictions. Basically, one can classify SF methods into four different types, namely force-field-based SF, knowledge-based SF, empirical-based SF, and machine learning-based SF.8 The force-field-based SFs commonly emphasize van der Walls (vdW) interactions, electrostatic energy, hydrogen bonding descriptions, solvation effects, and so on. The well-known SFs for this category are COMBINE,9 MedusaScore,10 to name only a few. Typical examples of knowledge-based SFs are,11 DrugScore,12 KECSA,13 and IT-Score,14 which utilize protein-ligand pairwise statistical potentials in an additive manner to predict binding affinities. One can regard the empirical-based SFs as simple machine learning-based SFs since these SFs employ linear regression schemes to construct predictive models using various physical features, for instance vdW interactions, Lennard-Jones potentials, hydrogen bonds, electrostatics, solvation, and torsion information, etc. PLP,15 ChemScore,16 and X-Score17 are the well-known representatives in this category. The last type of binding affinity SFs is machine learning-based approaches which have recently arise as the most advanced technique in CADD. One of the pioneer work on this SF category is RF-Score18 based on the Random Forest (RF) algorithm19 and their features as the numbers of atom pairwise contacts. Thanks to the nonlinear representation of the sophisticated machine learning frameworks, machine learning-based SFs can characterize the non-additive contributions from functional group interactions in the binding affinity calculations.2026

The availability of massive biological datasets, along with the accessibility to high-performance computing cluster (HPCC), has made machine learning-based models an emerging technology in biomolecular data analysis and prediction. However, the accuracy of machine learning-based SFs highly depends on whether their features are able to capture the physical and chemical information in protein-ligand interactions. Moreover, the direct use of three dimensional (3D) biomolecular structures in the deep learning network is immensely expensive. This hindrance mainly causes by the hefty number of degrees of freedom in the 3D macromolecular representations and the number of atoms varying among different structures. Therefore, there is a pressing need to develop innovative representations of protein-ligand complexes for machine learning methods.

Mathematical deep learning (MathDL) encompasses a family of scalable low-dimensional rotational and translational invariant mathematical representations integrated with advanced machine learning, including deep learning algorithms.27 Its hypothesis is that the intrinsic physics of macromolecular interactions lie in low-dimensional manifolds. Based on such hypothesis, we have developed a number of mathematical tools originated from geometry, topology, graph theory, combinatorics, and analysis to simplify macromolecular complexity and reduce their dimensionality. For example, differential geometry provides a high-level abstraction of macromolecular complexes.28 In molecular biophysics, differential geometry-based framework has shown its efficiency in modeling solvation-free energies29, 30 and ion channel transport.3135 However, in those applications, differential geometry information is largely restricted to the separation of solvent and solute domains in facilitating the Poisson-Boltzmann model or the Poisson-Nernst-Planck model. In geometric modeling, differential geometry has been utilized for the qualitative analysis of biomolecule properties.36, 37 Also, potential protein-ligand binding sites can be recognized via concave and convex regions of molecular surfaces indicated by minimum and/or maximum curvatures.37, 38 Most recently, the roles of different kinds of curvature in solvation free energy models have been investigated.39 However, the efficiency of the aforementioned differential geometry models is limited due to neglecting of atomic level information. Element interactive manifolds (EIM) were proposed to address this problem in differential geometry-based geometric learning (DG-GL).25 These EIMs successfully encode the pivotal physical, chemical, and biological information stored in high-dimensional data into low-dimensional manifolds, rendering a powerful approach for predicting solvation free energy, drug toxicity, and protein-ligand binding affinity.25

Another low-dimensional mathematical approach is the topological representation of biomolecular structures. In topological data analysis, one can capture the connectivity of macromolecules or molecular components. Topological invariants, such as independent components, rings, cavities, and higher dimension faces in terms of Betti numbers help to characterize the conformation change upon the protein-ligand binding process, the folding and unfolding of proteins, and the opening or closing of ion channels.40 The traditional topological descriptors, unfortunately, cannot discriminate the geometric difference among various macromolecular structures. Persistent homology (PH), a new branch of algebraic topology, utilizes a filtration parameter to generate a family of topological spaces and associated invariants, which contain richer geometric information.41,42 PH has been applied to computational biology.4345 However, these applications were mostly limited to qualitative analysis. Recently, we have devised PH for the quantitative analysis of protein folding energy, protein flexibility,46 ill-posed inverse problems of cryo-EM structures,47 predictive models of curvature energies of fullerene isomers,48, 49 and protein pocket detection.50 In 2015, we introduced one of the first combinations of PH descriptors and machine learning algorithms.51 Since then, the integration of PH and machine learning has become a very popular approach in topological data analysis. Nonetheless, this approach is not good enough for biomolecular systems. It turns out that PH neglects chemical and biological information in its topological simplification of geometric complexity. Element-specific PH was introduced to retain chemical and biological information.22 The integration of element-specific PH and machine learning algorithms has found great success in the predictions of protein folding free energy changes upon mutation,52 binding affinity,2224 drug toxicity,53 partition coefficient, and aqueous solubility.54 It has been employed for the classification of active ligands and decoys.24 All of these new topological models outperformed other state-of-the-art methods on various common benchmarks.

Similarly to topology, graph theory also accentuates the connectivity between vertices to define graph edges. There are two major types of graphs: geometric graphs and algebraic graphs. Geometric graphs concern the pairwise connectivity between graph nodes and represent it in terms of “topological index”,55,56 graph centrality,5759 and contact map.60,61 The algebraic graph theory expresses the connectivity via eigenvalues, particularly, the second-smallest eigenvalue of the Laplacian matrix, known as Fiedler value, which is often used to analyze the stability of dynamical systems.62 Graph theory has been widely used in many interdisciplinary studies. In biophysics, it is employed to model protein flexibility and long-time dynamics in normal mode analysis (NMA)6366 and elastic network model (ENM).60, 6772 Since graph theory offers a nature representation of molecular structure, it is a common approach for analyzing chemical datasets56, 7377 and biomolecular datasets.60, 7883 Although there was much effort in constructing various graph representations in the past, graph based quantitative models are often less accurate than other competitive models in the analysis and prediction of biomolecular properties from massive and diverse datasets. Indeed, in the protein stability changes upon mutation analysis, the other models23, 52, 84 are more accurate than the graph-based approach.85 In addition, the graph theory based Gaussian network model (GNM) is not competitive in protein B-factor predcitions.86 One of the main reasons is that there is no systematic representation of interactions among different chemical element types in a molecular structure. Additionally, many graph approaches do not describe non-covalent interactions. To overcome these limitations, we have proposed novel multiscale weighted colored subgraphs in both geometric graph and algebraic graph schemes to achieve the state-of-the-art performances in the predictions of protein B-factor,87 protein-ligand binding affinity,21, 26 docking,26 and virtual screening.26

Our MathDL models using graph theory and algebraic topology were employed in the D3R Grand Challenges since GC2 and has obtained many encouraging results. Specifically, our prediction of the free energy set in Stage 2 was ranked the best in GC2 in our first participation of D3R competitions.27 In our second participation, i.e. GC3, our submissions achieved the top places in 10 out of 26 official contests.27 These achievements have confirmed the predictive power and efficiency of our MathDL models in drug design and discovery. However, there were still some shortcomings existing in our previous approaches mostly concerning the pose generation performance and ability to rank affinities of compounds with diverse chemical structures.

In the current D3R challenge, i.e. GC4, we have brought in two new technological aspects in our approach. First, we have further developed powerful differential geometry and algebraic graph-based MathDL models to assist our algebraic topology based methods. Additionally, we have extended our MathDL approach with more advanced deep learning architectures like generative adversarial networks (GAN).88 We have achieved very promising results with top places in pose prediction, affinity ranking and free energy estimation. The rest of this paper is devoted to more detailed discussions of our methodologies and their performances in D3R GC4.

2. Methods

We describe the mathematical methods underpinning our MathDL models in this sections.

2.1. Differential geometry representation

2.1.1. Multiscale discrete-to-continuum mapping

Given a molecule having N atoms. Denote ri and qj, i = 1 ··· N, respectively, an atomic coordinate and a partial charge of the jth atom. A discrete-to-continuum mapping8991 represents the unnormalized molecular density at an arbitrary point r3 as follows

ρ(r,{ηk},{wk})=j=1NwjΦ(rrj;nj), (1)

where ||rrj|| is the Euclidean distance of the point r and the jth atom in a given molecule. If all wj are set to 1, ρ(r,{ηk},{wk}) indicates a molecular density, whereas ρ(r,{ηk},{wk}) serves as molecular charge density with wj = qj for all j. In the present work, we utilize Autodock Tools (http://autodock.scripps.edu/resources/adt/index_html) to assign the Gasteiger charges for small molecules and macromolecules. Additionally, ηj are characteristic distances and Φ is a monotonically decreasing kernel featuring the similarity between two 3D data points. To ensure the existence of the geometric representations such as curvatures, Φ is chosen to be monotonically decreasing C2 function satisfying the following conditions

Φ(rrj;ηj)=1,asrrj0, (2)
Φ(rrj;ηj)=0,asrrj. (3)

It is noted that radial basis functions meet admissibility conditions (2) and (3). Commonly used correlation kernels are generalized exponential functions

Φ(rrj;ηj)=e(rrj/nj)κ,κ>0; (4)

and generalized Lorentz functions

Φ(rrj;ηj)=11+(rrj/ηj)ν,ν>0. (5)

Moreover, one can use correlation kernels to model the electrostatic interaction between two charged articles as the following

Φ(rrj,qi,qj;c)=11+ecqiqj/rirj, (6)

where, qi and qj are the partial charges of two atoms, and c is a nonzero tunable parameter. It is noted that Φ described in Eq. (6) does not follow the admissible conditions (2) and (3). It is, therefore, only utilized to generate electrostatic persistent homology. All the Φs discussed in the current work were determined by one of Eqs. (4)(5). Here, Φ takes 3D coordinates and kernel parameters as the input variables and maps them to a real number: 3. Therefore, Φ values totally depend on atom coordinates or grid point positions and are rotationally and translationally invariant.

It is expected that C2 delta sequences of the positive type discussed in an earlier work92 can function well for the correlation kernel purposes. To obtain multiscale discrete-to-continuum mapping, one can employ more than one set of scale parameters. In the current work, the aforementioned mapping was applied to protein-ligand complexes.

2.1.2. Element interactive densities

In order for differential geometry (DG) representations to effectively capture the crucial physical and biological information of large and diverse biomolecular datasets, we must employ DG to feature non-covalent intramolecular molecular interactions in a molecule and intermolecular interactions in molecular complexes, such as protein-protein and protein-ligand.

Additionally, the accuracy of the DG representations can be upgraded by element-level descriptions which result in scalable low-dimension manifold representations of high dimensional structures. For instance, to describe the pairwise interactions between protein and ligand, we consider frequently occurring element types in proteins and ligands. Particularly, the commonly occurring element types in proteins are C, N, O, S and commonly occurring element types in ligands are H, C, N, O, S, P, F, Cl, Br, I. That gives rise to 40 element pairwise groups. We do not include hydrogen in protein element types since H is usually absent from most datasets in the Protein Data Bank (PDB). Note that during our validation process, the pairwise interactions between different atom types did not enhance the overall performance of our models (this may be due to the limited data size.). Thus, we only carried out the element-specific interactions for the sake of simplicity.

Based on a statistical analysis, the frequently occurring element types in the biomolecular dataset are denoted as C = {H, C, N, O, S, P, F, Cl, · · ·}. For convenience, Ck represents the kth element in the set C. For example, C5 = S. An ith atom in a given molecule is associated with its coordinate ri, element type αi, and partial charge qi. The non-covalent interactions between atoms of element type Ck and C are assumed to be described by the correlation kernel Φ

{Φ(rrj;ηkk)|αi=Ck,αj=Ck;i,j=1,2,,N;rirj>ri+rj+σ}, (7)

where ri and rj are the atomic radii of ith and jth atoms, respectively and σ is the mean value of the standard deviations of ri and rj in the interested dataset. The covalent interactions are excluded due to the constraint ||rirj|| > ri + rj + σ. In addition, ηkkʹ is a characteristic distance between the atoms, which depends only on their element types.

To construct the element interactive densities, we define atomic-radius-parametrized van der Waals domain of all atoms of kth element type as25

Dk:=ri,αi=CkB(ri,rk), (8)

in which B(ri, ri) is a ball with a center ri and a radius ri, and rk is the atomic radius of the kth element type. Thus, Dk depends on atom coordinate ri and its atomic radius. Note that, Dk does not define any vdW interactions but a domain to construct the surface density. The element interactive density between domain Dk and all atoms of kʹth (kkʹ) element type is given by

ρkk(r,ηkk)=jαj=Ckrirj>ri+rj+σ,αiCkwjΦ(rrj;ηkk),rDk. (9)

When kʹ = k, the element interactive density ρkk is now induced only by van der Waals domain Dk. In this case, we exclude the covalent interactions based on the position of the density input. Assuming rDki, with Dki=B(ri,ri),αi=Ck, the element interactive density is then formulated by

ρkk(r,ηkk)=jαj=Ckrirj>2rj+σwjΦ(rrj;ηkk). (10)

For the sake of simplicity, we chose wj = 1 for all cases. Since element interactive density is obtained by the addition of correlation kernels, it belongs to C2 on the closed domain of Dk. We construct element interactive manifolds by restricting the set of points at a given level set of the density as shown in Fig. 1.

Figure 1:

Figure 1:

IIlustration of some element-specific selections and corresponding element interactive manifolds obtained at a given level set of the element interactive density. Each sphere illustrates the atomic positions. Cyan, red, and blue colors represent carbon, oxygen, and nitrogen, respectively. The transparent surfaces are the isosurface extracted from volume data represented in Eq. (8).

2.1.3. Element interactive curvatures

Given an element interactive density ρ(r), one can calculate the Gaussian curvature (K), the mean curvature (H), the minimum curvature (κmin), and the maximum curvature (κmax) for the resulting manifold as the following:37, 93

K=1g2[2ρxρyρxzρyz+2ρxρzρxyρyz+2ρyρzρxyρxz2ρxρzρxzρyy2ρyρzρxxρyz2ρxρyρxyρzz+ρz2ρxxρyy+ρx2ρyyρzz+ρy2ρxxρzzρx2ρyz2ρy2ρxz2ρz2ρxy2], (11)
H=12g32[2ρxρyρxy+2ρxρzρxz+2ρyρzρyz(ρy2+ρz2)ρxx(ρx2+ρz2)ρyy(ρx2+ρy2)ρzz], (12)
κmin=HH2K, (13)
κmax=H+H2K, (14)

where g=ρx2+ρy2+ρz2.

To construct unified curvature quantities for various biomolecular structures, we study the element interactive curvatures (EIC) at the atomic center and formulate them as25

KkkEI(ηkk)=iKkk(ri,ηkk),riDk;kk (15)

and

KkkEI(ηkk)=iKkk(ri,ηkk),riDki,DkiDk. (16)

Eqs. (15) and (16) are for the element interactive Gaussian curvature (EIGC), are applied to protein-ligand complexes in the current work. Thus, the atomic centers in Eqs. (15) and (16) can be either from ligand atoms or protein atoms. In a same manner, one can define HkkEI(ηkk), κkk,minEI(ηkk) and κkk,maxEI(ηkk) for the element interactive mean curvature, element interactive minimum curvature, and element interactive maximum curvature, respectively.

It is worth noting that, the expressions of the curvatures defined in (11), (12), (13), and (14) are in the analytical forms. Thus, the EIC formulations are free from numerical error and totally preserve the reference geometric information of the molecules.

2.2. Multiscale weighted colored geometric subgraphs

For a given molecular datasets, we denote C a set consisting of the most frequently appearing element types. For a molecule of interest, we define a graph with the following vertices

V={(rj,αj)|rj3;αjC;j=1,2,,N}, (17)

where N is the number of atoms, rj and αj are, respectively, coordinates and element type of the jth atom. Similarly to the discussion in the differential geometry representation section, we only consider non-covalent interactions represented by correlation kernels

Ekk={Φ(rirj;ηkk)|αi=Ck,αj=Ck;i,j=1,2,N;rirj>ri+rj+σ}, (18)

all the notations in Eq. (18) are adopted from Sec. 2.1. In which, Φ refers to the edge weight which represents the potential interaction between two nodes forming that edge. We now form weighted colored subgraphs G(V,Ekk) to describe pairwise interactions in a given molecule. To unify the geometric graph-based descriptors for a diversity dataset, we construct multiscale weighted colored subgraph rigidity between kth element type Ck and k′th element type C via a graph centrality type of scheme

RIG(ηkk)=iμiG(ηkk)=iαi=Ckjαj=Ckrirj>ri+rj+σΦ(rirj;ηkk). (19)

The proposed subgraph rigidity index RIG(ηkk′) in Eq. (19) is the aggregation of the collective subgraph centrality μiG(ηkk) which used in our previous B-factor prediction model.87 That formulation represents a coarse-grained description at the element-level capturing important physical and biology information in a molecule or biomolecule such as van der Waals interactions, hydrogen bonds, electrostatics, etc. This description is scalable, i.e., independent of the size of an individual protein-ligand complex. In fact, when describing protein-ligand interactions, the labeled subgraph G(V,Ekk) gives rise to a bipartite graph with its edges connecting protein atoms to ligand atoms. The positive and negative eigenvalues of the adjacency matrix of a bipartite graph are reflective, which enables us to select only positive or negative eigenvalues in machine learning. Moreover, Eq. (19) generalized our previous binding affinity prediction model21 and was utilized for the D3R Grand Challenge 3.27

2.3. Multiscale weighted colored algebraic subgraphs

Still based on multiscale weighted colored subgraphs as defined in Section 2.2, we have recently developed a novel algebraic graph approach or spectral graph formulation to describe molecules, biomolecules and their interactions at atomic levels.25 We here utilize the Laplacian matrix and adjacency matrix to represent the interactions between nodes in a given subgraph.

Based on a weighted colored subgraph G(V,Ekk), we define the weighted colored Laplacian matrix Lij(ηkk′) as the following

Lij(ηkk)={Φ(rirj;ηkk)jLijifij,αi=Ck,αj=Ckandrirj>ri+rj+σ;ifi=j. (20)

Due to the symmetric, diagonally dominant and positive-semidefinite, all eigenvalues of the Laplacian matrix Lij(ηkk′) are nonnegative. Moreover, the smallest eigenvalues are zero. It is worth noting that the number of zero eigenvalues can equally referred to the zero-dimensional topological invariant which implies the number of the connected components in the graph. If a graph is connected, there exists one non-zero eigenvalue. Moreover, the smallest non-zero ones is called as Fiedler value representing algebraic connectivity. It is interesting to see that one can reconstruct the geometric graph rigidity via the following formulation

RIG(ηkk)=TrL(ηkk),

In addition, we can form the adjacency matrix Aij for the aforementioned subgraph G(V,Ekk) by

Aij(ηkk)={Φ(rirj;ηkk)0ifij,αi=Ck,αj=Ckandrirj>ri+rj+σ;ifi=j. (21)

Clearly, adjacency matrix A(ηkk′) is a symmetric non-negative matrix. As a result, its spectrum is real. The Laplacian and adjacency matrices for subgraph including only oxygen and nitrogen atoms in molecule C5H6N2O2 are depicted in Fig. 2. Note that for different molecules, one can expect to have different graph structures. We only utilized one unique 3D representation for each ligand; thus there was only one single graph structure to represent one corresponding compound.

Figure 2:

Figure 2:

IIlustration of weight colored subgraphs GNO including its Laplacian matrix (Left), and adjacency matrix (Right) deduced from molecule graph (C5H6N2O2) (Middle). Atoms 1 and 4 are oxygen, while atoms 2 and 3 are nitrogen. Graph edges, Φij, are in the green-dashed lines representing the noncovalent bonds. In addition, one can get 9 other nontrivial subgraph for this molecule, namely GCC, GCN, GCO, GCH, GNN, GNH, GOO, GOH, and GHH.

In general, the element-level information decoded from the Laplacian matrix and the adjacency matrix is quite similar despite of the different behaviors among their eigenvalues and eigenvectors. Specifically, the correlation between the adjacency matrix and the Laplacian matrix can be found in the Perron-Frobenius theorem via the following inequalities

minijAijρ(A)maxijAij. (22)

In other words, one can state that the spectral radius ρ(A) of the adjacency matrix A is bounded by diagonal element interval of the corresponding Laplacian matrix L.

In the algebraic approach, we are interested in describing the interactions between elements in the subgraph by the eigenvalues of its matrix. Thus, we design the weighted colored Laplacian matrix based descriptor at the element-level by

RIL(ηkk)=iμiL(ηkk), (23)

and the weighted colored adjacency matrix based descriptor is proposed in a similar manner. Note that GNM60 is a special case of the proposed Laplacian matrix μiL(ηkk). Thus, one can utilize its spectrum μiL(ηkk) for the protein B-factor prediction. To enrich the algebraic graph-based description information, we consider the statistics of the eigenvalues such as sum, mean, maximum, minimum and standard deviation.

2.4. Algebraic topology-based molecular signature

By employing powerful topological analysis, one can construct sophisticated topological spaces to capture the key interactions at the element level of an interested molecule or biomolecule. These physical and chemical information are encoded in different dimensional space under the topological invariant features, so-called Betti numbers. Upon the topological information, the rich and systematic descriptions are formulated and integrated with advanced machine learning framework.

2.4.1. Persistent homology

In the geometric point of view, the collection of points, edges, triangles, and higher-dimension representations form topological spaces. The general form of a triangle or a tetrahedron is called a simplex. Mathematically, a set of (k + 1) affinely independent points in n with nk gives rise to a simplex. To further characterize the topological spaces, face is introduced as a convex hull of a subset of points defining a simplex. In addition, a finite collection of simplices defines a simplicial complex X provided that two requirements are met. First, the faces of any simplex in X are also in X. Second, the intersection of two simplices σ1 and σ2 in X are either empty or a face of both σ1 and σ2. In a given simplicial complex X, a k-chain c is a formal sum of all the k-simplices in X which is defined as c=iaiσi. Here, ai is an integer coefficient chosen in a finite field p with a prime p. With the additional operator on the coefficients of in the k-chain, one can form a group of k-chain denoted Ck(X). The boundary operator on simplices is defined as

k(σ)=i=0k(1)i[υ0,,υ^i,,υk], (24)

where υ0,,υk are vertices of the k-simplex σ and [υ0,,υ^i,,υk] means the codim-1 face of σ be omitting the vertex υi. The boundary operator k(σ) is homeomorphisms going from Ck(X) to Ck−1(X) with an important property kk+1 = 0. Therefore, one can form the following chain complex

i+1Ci(X)iCi1(X)i12C1(X)1C0(X)00. (25)

In algebraic topology, homology is used to distinguish two shapes by detecting their holes. To define kth homology group, we consider the image of the boundary operator k+1 denoted Bk(X) = Im(k+1) and the kernel of k denoted Zk(X) = Ker(k) which are all illustrated in Fig. 3. Then, the quotient group between the aforementioned kernel and image gives rise to the kth homology group

Hk(X)=Zk(X)/Bk(X). (26)
Figure 3:

Figure 3:

Illustration of boundary operators, chain, cycle, and boundary groups in 3. Yellow circles are empty sets.

The described above homology group is applied for a fixed topological space. To accommodate the objects related to multiscale, we can construct a sequence of subspaces of topological space. Such sequence is called a filtration =X0X1Xm1Xm=X which naturally induces a series of homology groups of different dimensions connected by homomorphisms

Ikt,s:Hk(Xt)Hk(Xs),with0tsm. (27)

The images of these homomorphisms are called kth persistent homology groups, and ranks of these groups define kth persistent Betti numbers which are used to recognize topological spaces via nuber of k-dimensional holes. In the physical interpretation, Betti-0 counts the number of independent components, Betti-1 illustrates number of rings, and Betti-2 encodes the cavities.

2.4.2. Topological description of molecular systems

We carry out persistent homology on labels subgraph G(V,Ekk) defined in the previous sections to describe molecular properties. The resulting topological formulation is called element specific persistent homology.22, 52

There are two common types of filtration, namely Vietoris-Rips complex and alpha complex.94 The Vietoris-Rips complex, a distance-based filtration, is used to directly address the protein-ligand interactions. For a set of atoms in subgraph G(V,Ekk), the subcomplex associated to s is defined as

XRips(ε)={σX|σ=[υ0,,υk],d(υi,υj)2εfor0i,jk}, (28)

where X is the collection of all possible simplices, d is the distance between two atoms. To capture a complex protein geometry, one can utilize alpha complex. The alpha filtration is built upon the non-empty intersection between a k-simplex and a (k + 1) Voronoi cells. In general, in the alpha filtration, the subcomplex associated to ε is defined as

Xalpha(ϵ)={σX|σ=[υ0,,υk],i(V(υi)Bϵ(υi))}, (29)

where V(υi) is the Voronoi cell of υi and Bϵ(υi) is an ε ball centered at υi. For the details of building an alpha filtration, we refer the interested readers to our published work.46

Similarly to multiscale weight colored subgraphs in algebraic graph theory approaches, the element specific persistent homology has been shown to capture crucial physical interactions by tweaking the distance functions used in the filtration.22, 52 Indeed, the hydrophobic effects can be described by considering the persistent homology computation on the collection of all carbon atoms. To describe the hydrophilic behavior of the molecular system, the element specific persistent homology is carried out only for nitrogen and oxygen atoms. In addition, an appropriate distance function selection can characterize the covalent bonds and non-covalent interactions in small molecules.24

There are several ways to incorporate barcodes generated by persistent homology into machine learning models. One can use the Wasserstein metric to measure the similarities between two molecules’ barcodes. As a result, the distance-based machine learning approaches such as nearest neighbors and kernel methods can be exploited.24 To make use advanced machine learning algorithms such as the ensemble of trees and deep neural networks, we vectorize persistent homology barcodes by discretizing them into bins and taking into account of the persistence, birth and death incidents in each bin. Furthermore, the statistics of element-specific persistent homology barcodes are included in fixed length features.24 In the convolutional neural networks, such featurization of barcodes is represented in 1-dimensional and 2-dimensional like images.23,24

2.5. MathDL energy prediction models

We integrate the mathematical features with deep learning networks to form a powerful predictive model. The convolutional neural network (CNN) is a well-known algorithm with much success in image recognition and computer vision analysis. Essentially, CNN is a regularized version of the artificial neural network consisting of many convolutional layers, followed by several fully connected layers. To enhance the learning process, dropout techniques have been exploited in network layers.95 The neural networks we use are classified as the feed-forward network where all the information in the current layer is linearly combined and then nonlinearized via an activation function before sending out to the next layer. The predictive power of the CNN models relies on the characterization of the local interactions in the spatial dimension under the discrete convolution operator. The choice of features inputs in the CNN networks gives rise to variants of binding energy predictive models. Fig. 4 depicts MathDL energy prediction models and their network architectures are described in Fig. S1 in the Supporting Information. In the D3R GC4, we utilized two different models. In the first approach, the combination of algebraic topology and differential geometry features were employed in the network, we named this model BP1. In the second approach, algebraic topology, differential geometry, and algebraic graph representations were mixed to lead to another binding energy prediction model named BP2. The details of feature generation procedure of the algebraic topology, differential geometry, and algebraic graph models can be found in our earlier work2426.

Figure 4:

Figure 4:

A framework of MathDL energy prediction model which integrates advanced mathematical representations with sophisticated CNN architectures

2.6. MathDeep docking models

We here present an innovative pose generation scheme, denoted MGAN, using advanced mathematical representation pre-conditioned generative adversarial networks (GAN). GAN is a kind of deep learning model consisting of a generator G to learn the data distribution, and a discriminator D to discriminate training set structural information from that of the generator G.88 The G model is iteratively improved from the D feedback until the D cannot tell the difference between training set structural information and D set one. To improve the GAN performance and avoid vanishing gradient and mode collapse, we employ Wasserstein GAN (WGAN)96 in our model. To further enhance the quality of the generated structures, we take advantage of the conditional GAN technique.97 The deep learning (DL) models G and D are partially adapted from our binding energy prediction networks which are fed with data encoded in intrinsically low-dimensional manifolds with differential geometry, algebraic topology and graph theory. Fig. 5 depicts the MGAN’s framework. Network architectures of autodecoder and autoencoder are illustrated in Figs. S2 and S3, respectively. By varying combinations of different mathematics, we end up with several docking models. Specifically, If DL networks G and D only exploit algebraic topology, we name this docking model DM1. Similarly, we attain DM2 and DM3 when GAN model includes only algebraic graph and differential geometry based representations, respectively. Finally, DM4 is constructed with the assistance of algebraic topology, algebraic graph, and differential geometry. We employed the PDBbind v2018 dataset to train MathDL and MGAN models. The optimal hyperparameters of the MathDL model were selected by experience and finalized by hyperopt python package (http://github.com/hyperopt/hyperopt). The MGAN model was trained based on the setting of Wasserstein GAN network discussed in this work.96 Furthermore, to enhance the pose generation quality, we carry out the transfer learning to further optimize the MGAN model with the protein family-specific structures.

Figure 5:

Figure 5:

Illustration of our docking approach using mathematical representations integrated with GAN architectures. The generator contains an autodecoder, a latent space (LS), and a noise source. The discriminator consists of an autoencoder and latent space. The Math center encodes 3D structures into low-dimensional mathematical representations using algebraic topology, differential geometry, and/or graph theory.

3. Results and discussion

In this section, we present MathDL results and discuss our performances in the latest Grand Challenge named GC4.

3.1. Pose prediction results and discussion

We have participated in the docking challenge task since D3R GC2. Before the current challenge, i.e., GC4, our docking results in term of RMSE were not competitive in comparison to those of other participants. Specifically, our mean RMSD values are 6.03 Å and 3.78 Å for GC2 and GC3, respectively. These results reflect an improvement in our docking approaches but their accuracy is still behind the top submissions in GC3. Instead of depending on the docking programs such as Autodock Vina4 and GLIDE6 as we did in the previous challenges, our GC4 docking schemes were driven by advanced mathematical representations and sophisticated deep learning architectures. Consequently, we achieved remarkable performances on the pose prediction tasks. The rest of this section is devoted to result discussions.

Despite having two protein receptors in GC4, all the pose predictions were only for BACE ligands and were organized in two stages, Stage 1a and Stage 1b. In Stage 1a, participants were provided SMILES strings of 20 ligands to be docked, the FASTA sequence of the BACE protein, and the reference protein structure (PDBID: 5ygx, chain A) for the superimposition process. Stage 1b took place right after the end of Stage 1a. Stage 1b provided the experimental protein structures in the complexes with 20 ligands requested for pose predictions, in which the structures of these ligands were removed. Participants were still asked to predict their poses. Therefore, Stage 1b is often referred to a self-docking challenge. There are two evaluation metrics for the pose prediction tasks, namely median and mean calculated over all RMSD values between the predicted poses and crystal structures.

In Stage 1a, we submitted two results. Fig. 6 illustrates the performances of 70 submissions having median RMSD less than 10 Å. Our best submission having receipt ID 5t302 with median RMSD = 0.53 Å and being highlighted in the red color. This docking model was DM1. In Stage 1b, we delivered 4 submissions; unfortunately, none of them was ranked the first place in either the median or mean metric. However, our results were very promising. Particularly, our submission based on docking model DM3 with receipt ID itzv6 achieved mean RMSD of 0.73 Å which is at the second place and is a bit less accurate than the top submission with mean RMSD being 0.61 Å (receipt ID 5od5g). It may be noted that the best result in Stage 1b is not as good as that in Stage 1a. Fig. 7 compares the poses predicted by our submission ID 0invp to the corresponding experimental structures at different levels of accuracy.

Figure 6:

Figure 6:

Performance comparison of different submissions on pose prediction challenge of Stage 1a for the BACE dataset in term of median RMSD. Our submissions are highlighted in the red color, in which the best one is 5t302 with median RMSD = 0.55 Å.

Figure 7:

Figure 7:

Illustration of pose predictions by our MathGAN docking model with receipt ID 0invp. The top-left corner is original binding pocket of the BACE receptor. The top-right corner is our best pose prediction accuracy obtained when predicting BACE03’s pose with RMSD = 0.23 Å. The bottom-left corner is our middle performance when predicting BACE05’s pose with RMSD = 0.53 Å. The bottom-right is our worst performance when predicting BACE07’s pose with RMSD = 2.63 Å. The experiment structures are in yellow while the predicted structures are in purple.

It is interesting to find out that, the additional information of the co-crystal structures did not help our docking models. For example, our docking approach DM4 with submission ID Oinvp attained median RMSD of 0.53 Å and mean RMSD of 0.8 Å, respectively in Stage 1a. However, in Stage 1b, the same model labeled by receipt ID 2ieqo produced median RMSD and mean RMSD as high as 0.55 Å and 0.84 Å, respectively. These observations can confirm the robustness of our models and predictive value for the realistic situations in CADD when little or no co-crystal information is provided.

3.2. Affinity prediction results and discussion

There were two subchallenges for affinity prediction tasks. Subchallenge 1 regarded BACE ligands while Subchallenge 2 concerned CatS ligands. Both subchallenges were interested in affinity ranking of a diversity datasets and relative binding affinity predictions on the designated free energy set. There were two stages on BACE affinity prediction task, namely Stage 1 and Stage 2, whereas there was only one stage on CatS ligands. Unfortunately, we did not participate in Stage 1 of the BACE target since the announcement email made us overlook this contest.

Statistically, there were 154 compounds in the BACE dataset for affinity ranking contest, while there were 34 compounds for the calculation of relative or absolute binding affinities of the same receptor target. In CatS dataset, participants were asked to rank affinities of 459 ligands and predicted the binding energies of a smaller subset with 39 molecules. Moreover, Kendall’s τ and Spearman’s ρ were the evaluation metrics for affinity ranking challenges. In the binding free energy predictions, besides the aforementioned metrics, Pearson’s r and centered root mean square error (RMSEc) were utilized.

Overall, the official results from the D3R organizer have placed us among the top performers on these energy prediction contests. By considering specific evaluation metrics, we were ranked first place in combined ligand and structure based scoring*, structure based scoring, and free energy set subcategories all belonging to the CatS dataset. For illustration, Fig. 8 presents the Spearman’s ρ performance of different submissions on the CatS affinity ranking contest combining ligand and structure based scoring models. Our best submission are highlighted in the red color with receipt IDs 3c8nw and 0xvrb. Both of them achieved the same Spearman’s ρ as high as 0.73 and shared the first place with another group’s submission having ID x4svd. In submission ID 3c8nw, we employed docking model DM4 for pose generation and model BP2 for the affinity prediction. While in submission ID 0xvrb, docking approach was DM3 and binding prediction protocol was BP2. In addition, our best result with ID ar5p6 achieved the lowest RMSEc for the free energy prediction of 39 designated CatS molecules. This successful submission utilized docking model DM4 and affinity prediction model BP2 for the calculations. Fig. 9 presents RMSEc performance of various groups for the free energy prediction of CatS dataset. Table 1 summarizes the performances of our group at all categories in D3R GC4. We only counted the number of our submissions in the top three including ties. “No participation” at the results column implies that we did not participate in the corresponding contest. The blank results indicate that our predictions were not ranked within the top three.

Figure 8:

Figure 8:

Performance comparison of different submissions on the combined ligand and structure based scoring of CatS dataset in term of Spearman’s ρ. Our submissions are highlighted in the red color, in which our top-ranked submissions are 3c8nw and 0xvrb with ρ=0.73.

Figure 9:

Figure 9:

Performance comparison of D3R GC4 participants on free energy set for CatS contest in term of centered RMSE RMSEc. Our submissions are highlighted in the red color, in which our top-ranked prediction is ar5p6 with RMSEc = 0.47 kcal/mol.

Table 1:

Overview of MathDL’s performance in D3R GC4. The numbers in “(a/b)” indicates that a number of our predictions had the ranking and there was a total of b submissions sharing the ranking.

Dataset Contest Results

Pose Prediction
BACE Stage 1A Pose Prediction Ranked 1st (1/2)i; Ranked 2nd (3/3)ii
BACE Stage 1B Pose Prediction Ranked 2nd (2/2)iii; Ranked 3rd (1/2)iv
Affinity Predictions
Cathepsin Stage 2 Combined Ligand and Structure Based Scoring Ranked 1st (2/5)v; Ranked 2nd (2/3)vi; Ranked 3rd (2/4)vii
Cathepsin Stage 2 Ligand Based Scoring No participation
Cathepsin Stage 2 Structure Based Scoring Ranked 1st (2/4)viii; Ranked 2nd (3/3)ix; Ranked 3rd(3/3)x
Cathepsin Stage 2 Free Energy Set Ranked 1st (1/7)xi; Ranked 2nd (1/7)xii; Ranked 3rd(3/5)xiii
BACE Stage 1 Combined Ligand and Structure No participation
BACE Stage 1 Ligand Based Scoring No participation
BACE Stage 1 Structure Based Scoring No participation
BACE Stage 1 Free Energy Set No participation
BACE Stage 2 Combined Ligand and Structure
BACE Stage 2 Ligand Based Scoring No participation
BACE Stage 2 Structure Based Scoring
BACE Stage 2 Free Energy Set Ranked 2nd (3/4)xiv; Ranked 3rd (1/4)xv
Superscript Submission ID Evaluation Metric Docking Protocol Scoring Protocol
i 5t302 Median RMSD DM1
ii 5t302 Mean RMSD DM1
0invp Median RMSD DM4
0invp Mean RMSD DM4
iii 2ieqo Median RMSD DM4
itzv6 Mean RMSD DM3
iv 4myne Mean RMSD DM1
v 0xvrb Spearman’s ρ DM3 BP2
3c8nw Spearman’s ρ DM4 BP2
vi 0xvrb Kendall’s τ DM3 BP2
3c8nw Kendall’s τ DM4 BP2
vii qb2s2 Kendall’s τ DM1 BP2
qb2s2 Spearman’s ρ DM1 BP2
viii 0xvrb Spearman’s ρ DM3 BP2
3c8nw Spearman’s ρ DM4 BP2
ix 0xvrb Kendall’s τ DM3 BP2
3c8nw Kendall’s τ DM4 BP2
qb2s2 Spearman’s ρ DM1 BP2
x qb2s2 Kendall’s τ DM1 BP2
qi5ev Spearman’s ρ DM3 BP1
kohoc Spearman’s ρ DM2 BP2
xi ar5p6 RMSEc DM4 BP2
xii 24b03 RMSEc DM3 BP2
xiii 24b03 Kendall’s τ DM3 BP2
24b03 Spearman’s ρ DM3 BP2
24b03 Pearson’s r DM3 BP2
xiv 8frur Kendall’s τ DM1 BP2
8frur Spearman’s ρ DM1 BP2
8frur RMSEc DM1 BP2
xv 8frur Pearson’s r DM1 BP2

It is noted that in the BACE affinity prediction, our results were not in the top three. In fact, our team was behind only to two teams that collected all the top three places in BACE affinity ranking, which indicates the consistence of our MathDL models in GC4 competitions.

Overall, the model BP2 was our best model for binding affinity prediction for both CatS and BACE datasets (see Table S1). The great performance of BP2 was expected since it combines algebraic topology, differential geometry, and graph theory features which help to enrich feature space and cover the most important aspects of physical and biological properties. However, there was a mixed conclusion when finding the best solution for pose prediction. Indeed, models DM3 and DM4 worked well for the CatS dataset, while DM1 was an only good solution for producing high quality poses for the BACE dataset (see Table S1). They helped the predictor BP2 achieved the best rankings among our submitted models. One can argue that DM1 achieved the best pose prediction for BACE ligands in Stage 1A; therefore it was foretasted to help BACE energy prediction tasks. The same behavior was observed for CatS dataset. According to our pre-validation results, DM4 which was our best model for the CatS pose prediction, achieved mean RMSD of 1.8 Å for the CatS pose prediction Stage 1B challenge in GC3. Note that the best submission in that subchallenge accomplished mean RMSD as low as 2.13 Å. It seems that the pose quality of our pose generation models correlates well to the accuracy of our binding affinity predictors.

4. Conclusion

The performances of our mathematical deep learning (MathDL) models on D3R GC4 are presented and discussed in this paper. We participated in a variety of D3R GC4 contests including pose predictions, affinity ranking, and absolute free energy predictions. Overall, our submissions were ranked the first in pose prediction in Stage 1a, affinity ranking and free energy predictions for Cathepsin ligands. Unfortunately, we did not get the first place on BACE datasets. Our best submission was only at the second place in free energy set for BACE in Stage 2 contest. In comparison to our previous D3R challenges, i.e., D3R GC2 and D3R GC3, we had two improvements in D3R GC4. The first improvement was the pose prediction. This was the first time we won this contest thanks to our newly developed docking model which integrates scalable low-dimensional rotational and translational invariant mathematical representations, such as differential geometry, algebraic graph, and algebraic topology, with well-designed generative adversarial networks. The second improvement was the affinity ranking for a dataset with diverse chemical properties. In previous challenges, our approaches performed well on free energy predictions but not on affinity ranking. In GC4, we successfully unified our newly established models, i.e., differential geometry and algebraic graph, and our well-known algebraic topology into powerful and robustness convolutional neural network models for binding affinity predictions.

In terms of efficiency, at this point, our MathDL models are quite automated. With sufficient computer resources, our MathDL models can finish all the GC4 competition tasks in a week or so.

It is worth noting that our models for GC4 was the less competitive performance in BACE affinity ranking and free energy predictions. Additionally, it seems that our docking model did not upgrade when the co-crystal structures became available. These issues are under our investigation.

Supplementary Material

10822_2019_237_MOESM2_ESM
10822_2019_237_MOESM1_ESM

Acknowledgments

This work was supported in part by NSF Grants DMS-1721024, DMS-1761320, and IIS1900473 and NIH grant GM126189. DDN and GWW are also funded by Bristol-Myers Squibb and Pfizer.

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

*

This subcategory is the common list of ligand based and structure based scoring subcategories

References

  • [1].Gathiaka S, Liu S, Chiu M, Yang H, Stuckey JA, Kang YN, Delproposto J, Kubish G, Dunbar JB, Carlson HA, et al. , “D3r grand challenge 2015: evaluation of protein–ligand pose and affinity predictions,” Journal of computer-aided molecular design, vol. 30, no. 9, pp. 651–668, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Gaieb Z, Liu S, Gathiaka S, Chiu M, Yang H, Shao C, Feher VA, Walters WP, Kuhn B, Rudolph MG, et al. , “D3r grand challenge 2: blind prediction of protein–ligand poses, affinity rankings, and relative binding free energies,” Journal of computer-aided molecular design, vol. 32, no. 1, pp. 1–20, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Gaieb Z, Parks CD, Chiu M, Yang H, Shao C, Walters WP, Lambert MH, Nevins N, Bembenek SD, Ameriks MK, et al. , “D3r grand challenge 3: blind prediction of protein–ligand poses and affinity rankings,” Journal of computer-aided molecular design, vol. 33, no. 1, pp. 1–18, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Trott O and Olson AJ, “AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading,” J Computat Chem, vol. 31, no. 2, pp. 455–461, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Jones G, Willett P, Glen RC, Leach AR, and Taylor R, “Development and validation of a genetic algorithm for flexible docking.,” Journal of Molecular Biology, vol. 267, no. 3, pp. 727–748, 1997. [DOI] [PubMed] [Google Scholar]
  • [6].Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, JK JKP, Shaw DE, Francis P, and Shenkin PS, “Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy.,” J. Med. Chem, vol. 47, p. 1739, 2004. [DOI] [PubMed] [Google Scholar]
  • [7].Abagyan R, Totrov M, and Kuznetsov D, “Icm—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation,” Journal of computational chemistry, vol. 15, no. 5, pp. 488–506, 1994. [Google Scholar]
  • [8].Liu J and Wang R, “Classification of current scoring functions,” Journal of Chemical Information and Model, vol. 55, no. 3, pp. 475–482, 2015. [DOI] [PubMed] [Google Scholar]
  • [9].Ortiz AR, Pisabarro MT, Gago F, and Wade RC, “Prediction of drug binding affinities by comparative binding energy analysis,” J. Med. Chem, vol. 38, pp. 2681–2691, 1995. [DOI] [PubMed] [Google Scholar]
  • [10].Yin S, Biedermannova L, Vondrasek J, and Dokholyan NV, “Medusascore: An acurate force field-based scoring function for virtual drug screening,” Journal of Chemical Information and Model, vol. 48, pp. 1656–1662, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Muegge I and Martin Y, “A general and fast scoring function for protein-ligand interactions: a simplified potential approach.,” J Med Chem, vol. 42, no. 5, pp. 791–804, 1999. [DOI] [PubMed] [Google Scholar]
  • [12].Velec HFG, Gohlke H, and Klebe G, “Knowledge-based scoring function derived from small molecule crystal data with superior recognition rate of near-native ligand poses and better affinity prediction.,” J. Med. Chem, vol. 48, pp. 6296–6303, 2005. [DOI] [PubMed] [Google Scholar]
  • [13].Zheng Z, Wang T, Li P, and Merz KM Jr, “KECSA-Movable type implicit solvation model (KMTISM),” Journal of Chemical Theory and Computation, vol. 11, pp. 667–682, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Huang SY and Zou X, “An iterative knowledge-based scoring function to predict protein-ligand interactions: I. derivation of interaction potentials.,” J. Comput. Chem, vol. 27, pp. 1865–1875, 2006. [DOI] [PubMed] [Google Scholar]
  • [15].Verkhivker G, Appelt K, Freer ST, and Villafranca JE, “Empirical free energy calculations of ligand-protein crystallographic complexes. i. knowledge based ligand-protein interaction potentials applied to the prediction of human immunodeficiency virus protease binding affinity.,” Protein Eng, vol. 8, pp. 677–691, 1995. [DOI] [PubMed] [Google Scholar]
  • [16].Eldridge MD, Murray CW, Auton TR, Paolini GV, and Mee RP, “Empirical scoring functions: the development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes.,” J. Comput. Aided. Mol. Des, vol. 11, pp. 425–445, 1997. [DOI] [PubMed] [Google Scholar]
  • [17].Wang R, Lai L, and Wang S, “Further development and validation of empirical scoring functions for structural based binding affinity prediction.,” J. Comput-Aided Mol. Des, vol. 16, pp. 11–26, 2002. [DOI] [PubMed] [Google Scholar]
  • [18].Ballester PJ and Mitchell JBO, “A machine learning approach to predicting protein -ligand binding affinity with applications to molecular docking,” Bioinformatics, vol. 26, no. 9, pp. 1169–1175, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Breiman L, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar]
  • [20].Li H, Leung K-S, Wong M-H, and Ballester PJ, “Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study,” BMC bioinformatics, vol. 15, no. 1, p. 1, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Nguyen DD, Xiao T, Wang ML, and Wei GW, “ Rigidity strengthening: A mechanism for protein-ligand binding,” Journal of Chemical Information and Modeling, vol. 57, pp. 1715–1721, 2017. [DOI] [PubMed] [Google Scholar]
  • [22].Cang ZX and Wei GW, “Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction,” International Journal for Numerical Methods in Biomedical Engineering, vol. 34(2), p. DOI: 10.1002/cnm.2914, 2018. [DOI] [PubMed] [Google Scholar]
  • [23].Cang ZX and Wei GW, “TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions,” PLOS Computational Biology, vol. 13(7), pp. e1005690, 10.1371/journal.pcbi.1005690, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Cang ZX, Mu L, and Wei GW, “Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening,” PLOS Computational Biology, vol. 14(1), pp. e1005929, 10.1371/journal.pcbi.1005929, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Nguyen DD and Wei G-W, “Dg-gl: Differential geometry-based geometric learning of molecular datasets,” International journal for numerical methods in biomedical engineering, vol. 35, no. 3, p. e3179, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Nguyen D and Wei G-W, “Agl-score: Algebraic graph learning score for protein-ligand binding scoring, ranking, docking, and screening,” Journal of Chemical Information and Modeling, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Nguyen DD, Cang Z, Wu K, Wang M, Cao Y, and Wei G-W, “Mathematical deep learning for pose and binding affinity prediction and ranking in d3r grand challenges,” Journal of computer-aided molecular design, vol. 33, no. 1, pp. 71–82, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Wei GW, “Differential geometry based multiscale models,” Bulletin of Mathematical Biology, vol. 72, pp. 1562–1622, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Chen Z, Zhao S, Chun J, Thomas DG, Baker NA, Bates PB, and Wei GW, “Variational approach for nonpolar solvation analysis,” Journal of Chemical Physics, vol. 137, no. 084101, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Wang B and Wei G-W, “Parameter optimization in differential geometry based solvation models,” Jornal of Chemical Physics, vol. 143, p. 134119, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Chen D and Wei GW, “Quantum dynamics in continuum for proton transport III: Generalized correlation,” J Chem. Phys, vol. 136, p. 134109, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Chen D and Wei GW, “Quantum dynamics in continuum for proton transport—Generalized correlation,” J Chem. Phys, vol. 136, p. 134109, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Wei G-W, Zheng Q, Chen Z, and Xia K, “Variational multiscale models for charge transport,” SIAM Review, vol. 54, no. 4, pp. 699–754, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Wei GW, “Multiscale, multiphysics and multidomain models I: Basic theory,” Journal of Theoretical and Computational Chemistry, vol. 12, no. 8, p. 1341006, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Chen D and Wei GW, “Quantum dynamics in continuum for proton transport I: Basic formulation,” Commun. Comput. Phys, vol. 13, pp. 285–324, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Feng X, Xia K, Tong Y, and Wei G-W, “Geometric modeling of subcellular structures, organelles and large multiprotein complexes,” International Journal for Numerical Methods in Biomedical Engineering, vol. 28, pp. 1198–1223, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Xia KL, Feng X, Tong YY, and Wei GW, “Multiscale geometric modeling of macromolecules i: Cartesian representation,” Journal of Computational Physics, vol. 275, pp. 912–936, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Mu L, Xia K, and Wei G, “Geometric and electrostatic modeling using molecular rigidity functions,” Journal of Computational and Applied Mathematics, vol. 313, pp. 18–37, 2017. [Google Scholar]
  • [39].Nguyen DD and Wei GW, “The impact of surface area, volume, curvature and lennard-jones potential to solvation modeling,” Journal of Computational Chemistry, vol. 38, pp. 24–36, 2017. [DOI] [PubMed] [Google Scholar]
  • [40].Kaczynski T, Mischaikow K, and Mrozek M, Computational homology. Springer-Verlag, 2004. 20 [Google Scholar]
  • [41].Edelsbrunner H, Letscher D, and Zomorodian A, “Topological persistence and simplification,” Discrete Comput. Geom, vol. 28, pp. 511–533, 2001. [Google Scholar]
  • [42].Zomorodian A and Carlsson G, “Computing persistent homology,” Discrete Comput. Geom, vol. 33, pp. 249–274, 2005. [Google Scholar]
  • [43].Kasson PM, Zomorodian A, Park S, Singhal N, Guibas LJ, and Pande VS, “Persistent voids a new structural metric for membrane fusion,” Bioinformatics, vol. 23, pp. 1753–1759, 2007. [DOI] [PubMed] [Google Scholar]
  • [44].Dabaghian Y, Mémoli F, Frank L, and Carlsson G, “A topological paradigm for hippocampal spatial map formation using persistent homology,” PLoS computational biology, vol. 8, no. 8, p. e1002581, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Gameiro M, Hiraoka Y, Izumi S, Kramar M, Mischaikow K, and Nanda V, “Topological measurement of protein compressibility via persistence diagrams,” Japan Journal of Industrial and Applied Mathematics, vol. 32, pp. 1–17, 2014. [Google Scholar]
  • [46].Xia KL and Wei GW, “Persistent homology analysis of protein structure, flexibility and folding,” International Journal for Numerical Methods in Biomedical Engineering, vol. 30, pp. 814–844, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Xia KL and Wei GW, “Persistent topology for cryo-EM data analysis,” International Journal for Numerical Methods in Biomedical Engineering, vol. 31, p. e02719, 2015. [DOI] [PubMed] [Google Scholar]
  • [48].Xia KL, Feng X, Tong YY, and Wei GW, “Persistent homology for the quantitative prediction of fullerene stability,” Journal of Computational Chemistry, vol. 36, pp. 408–422, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Wang B and Wei GW, “Object-oriented persistent homology,” Journal of Computational Physics, vol. 305, pp. 276–299, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Liu B, Wang B, Zhao R, Tong Y, and Wei G-W, “Eses: Software for e ulerian solvent excluded surface,” Journal of Computational Chemistry, vol. 38, no. 7, pp. 446–466, 2017. [DOI] [PubMed] [Google Scholar]
  • [51].Cang ZX, Mu L, Wu K, Opron K, Xia K, and Wei G-W, “A topological approach to protein classification,” Molecular based Mathematical Biology, vol. 3, pp. 140–162, 2015. [Google Scholar]
  • [52].Cang ZX and Wei GW, “Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology,” Bioinformatics, vol. 33, pp. 3549–3557, 2017. [DOI] [PubMed] [Google Scholar]
  • [53].Wu K and Wei GW, “Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks,” Journal of Chemical Information and Modeling, vol. 58, pp. 520–531, 2018. [DOI] [PubMed] [Google Scholar]
  • [54].Wu K, Zhao Z, Wang R, and Wei GW, “TopP-S: Persistent Homology-Based Multi-Task Deep Neural Networks for Simultaneous Predictions of Partition Coefficient and Aqueous Solubility,” Journal of Computational Chemistry, vol. 39, pp. 1444–1454, 2018. [DOI] [PubMed] [Google Scholar]
  • [55].Hosoya H, “Topological index. a newly proposed quantity characterizing the topological nature of structural isomers of saturated hydrocarbons,” Bulletin of the Chemical Society of Japan, vol. 44, no. 9, pp. 2332–2339, 1971. [Google Scholar]
  • [56].Hansen PJ and Jurs PC, “Chemical applications of graph theory. part i. fundamentals and topological indices,” J. Chem. Educ, vol. 65, no. 7, p. 574, 1988. [Google Scholar]
  • [57].Newman M, Networks: an introduction Oxford university press, 2010. [Google Scholar]
  • [58].Bavelas A, “Communication patterns in task-oriented groups,” The Journal of the Acoustical Society of America, vol. 22, no. 6, pp. 725–730, 1950. [Google Scholar]
  • [59].Dekker A, “Conceptual distance in social network analysis,” Journal of Social Structure (JOSS), vol. 6, 2005. [Google Scholar]
  • [60].Bahar I, Atilgan AR, and Erman B, “Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.,” Folding and Design, vol. 2, pp. 173–181, 1997. [DOI] [PubMed] [Google Scholar]
  • [61].Yang LW and Chng CP, “Coarse-grained models reveal functional dynamics–I. elastic network models–theories, comparisons and perspectives.,” Bioinformatics and Biology Insights, vol. 2, pp. 25–45, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [62].Wei GW, Zhan M, and Lai CH, “Tailoring wavelets for chaos control,” Phys. Rev. Lett, vol. 89, p. 284103, 2002. [DOI] [PubMed] [Google Scholar]
  • [63].Go N, Noguti T, and Nishikawa T, “Dynamics of a small globular protein in terms of low-frequency vibrational modes,” Proc. Natl. Acad. Sci, vol. 80, pp. 3696–3700, 1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [64].Tasumi M, Takenchi H, Ataka S, Dwidedi AM, and Krimm S, “Normal vibrations of proteins: Glucagon,” Biopolymers, vol. 21, pp. 711–714, 1982. [DOI] [PubMed] [Google Scholar]
  • [65].Brooks BR, Bruccoleri RE, Olafson BD, States D, Swaminathan S, and Karplus M, “Charmm: A program for macromolecular energy, minimization, and dynamics calculations,” J. Comput. Chem, vol. 4, pp. 187–217, 1983. [Google Scholar]
  • [66].Levitt M, Sander C, and Stern PS, “Protein normal-mode dynamics: Trypsin inhibitor, crambin, ribonuclease and lysozyme.,” J. Mol. Biol, vol. 181, no. 3, pp. 423–447, 1985. [DOI] [PubMed] [Google Scholar]
  • [67].Flory PJ, “Statistical thermodynamics of random networks.,” Proc. Roy. Soc. Lond. A,, vol. 351, pp. 351–378, 1976. [Google Scholar]
  • [68].Bahar I, Atilgan AR, Demirel MC, and Erman B, “Vibrational dynamics of proteins: Significance of slow and fast modes in relation to function and stability.,” Phys. Rev. Lett, vol. 80, pp. 2733–2736, 1998. [Google Scholar]
  • [69].Atilgan AR, Durrell SR, Jernigan RL, Demirel MC, Keskin O, and Bahar I, “Anisotropy of fluctuation dynamics of proteins with an elastic network model.,” Biophys. J, vol. 80, pp. 505–515, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [70].Hinsen K, “Analysis of domain motions by approximate normal mode calculations.,” Proteins, vol. 33, pp. 417–429, 1998. [DOI] [PubMed] [Google Scholar]
  • [71].Tama F and Sanejouand YH, “Conformational change of proteins arising from normal mode calculations.,” Protein Eng, vol. 14, pp. 1–6, 2001. [DOI] [PubMed] [Google Scholar]
  • [72].Cui Q and Bahar I, Normal mode analysis: theory and applications to biological and chemical systems. Chapman and Hall/CRC, 2010. [Google Scholar]
  • [73].Balaban AT, Chemical applications of graph theory. Academic Press, 1976. [Google Scholar]
  • [74].Trinajstic N, “Chemical graph theory,” Boca Raton, 1983.
  • [75].Schultz HP, “Topological organic chemistry. 1. graph theory and topological indices of alkanes,” Journal of Chemical Information and Computer Sciences, vol. 29, no. 3, pp. 227–228, 1989. [DOI] [PubMed] [Google Scholar]
  • [76].Foulds LR, Graph theory applications. Springer Science & Business Media, 2012. [Google Scholar]
  • [77].Ozkanlar A and Clark AE, “Chemnetworks: A complex network analysis tool for chemical systems,” Journal of computational chemistry, vol. 35, no. 6, pp. 495–505, 2014. [DOI] [PubMed] [Google Scholar]
  • [78].Di Paola L and Giuliani A, “Protein contact network topology: a natural language for allostery,” Current opinion in structural biology, vol. 31, pp. 43–48, 2015. [DOI] [PubMed] [Google Scholar]
  • [79].Canutescu AA, Shelenkov AA, and Dunbrack RL, “A graph-theory algorithm for rapid protein side-chain prediction,” Protein science, vol. 12, no. 9, pp. 2001–2014, 2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [80].Ryslik GA, Cheng Y, Cheung K-H, Modis Y, and Zhao H, “A graph theoretic approach to utilizing protein structure to identify non-random somatic mutations,” BMC bioinformatics, vol. 15, no. 1, p. 86, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [81].Jacobs DJ, Rader AJ, Kuhn LA, and Thorpe MF, “Protein flexibility predictions using graph theory,” Proteins-Structure, Function, and Genetics, vol. 44, pp. 150–165, August 1 2001. [DOI] [PubMed] [Google Scholar]
  • [82].Vishveshwara S, Brinda K, and Kannan N, “Protein structure: insights from graph theory,” Journal of Theoretical and Computational Chemistry, vol. 1, no. 01, pp. 187–211, 2002. [Google Scholar]
  • [83].Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, and Pande V, “Moleculenet: A benchmark for molecular machine learning,” arXiv preprint arXiv:1703.00564, 2017. [DOI] [PMC free article] [PubMed]
  • [84].Quan L, Lv Q, and Zhang Y, “Strum: structure-based prediction of protein stability changes upon single-point mutation,” Structural Bioinformatics, In press, 2016. [DOI] [PMC free article] [PubMed]
  • [85].Pires DEV, Ascher DB, and Blundell TL, “mcsm: predicting the effects of mutations in proteins using graph-based signatures,” Structural Bioinformatics, vol. 30, pp. 335–342, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [86].Park JK, Jernigan R, and Wu Z, “Coarse grained normal mode analysis vs. refined gaussian network model for protein residue-level structural fluctuations,” Bulletin of Mathematical Biology, vol. 75, pp. 124 –160, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [87].Bramer D and Wei GW, “Weighted multiscale colored graphs for protein flexibility and rigidity analysis,” Journal of Chemical Physics, vol. 148, p. 054103, 2018. [DOI] [PubMed] [Google Scholar]
  • [88].Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, and Bengio Y, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • [89].Xia KL, Opron K, and Wei GW, “Multiscale multiphysics and multidomain models — Flexibility and rigidity,” Journal of Chemical Physics, vol. 139, p. 194109, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [90].Opron K, Xia KL, and Wei GW, “Fast and anisotropic flexibility-rigidity index for protein flexibility and fluctuation analysis,” Journal of Chemical Physics, vol. 140, p. 234105, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [91].Nguyen DD, Xia KL, and Wei GW, “Generalized flexibility-rigidity index,” Journal of Chemical Physics, vol. 144, p. 234106, 2016. [DOI] [PubMed] [Google Scholar]
  • [92].Wei GW, “Wavelets generated by using discrete singular convolution kernels,” Journal of Physics A: Mathematical and General, vol. 33, pp. 8577–8596, 2000. [Google Scholar]
  • [93].Soldea O, Elber G, and Rivlin E, “Global segmentation and curvature analysis of volumetric data sets using trivariate b-spline functions,” IEEE Trans. on PAMI, vol. 28, no. 2, pp. 265–278, 2006. [DOI] [PubMed] [Google Scholar]
  • [94].Edelsbrunner H, “Weighted alpha shapes,” tech. rep, Champaign, IL, USA, 1992. [Google Scholar]
  • [95].Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, and Salakhutdinov R, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [Google Scholar]
  • [96].Arjovsky M, Chintala S, and Bottou L, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, pp. 214–223, 2017.
  • [97].Mirza M and Osindero S, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

10822_2019_237_MOESM2_ESM
10822_2019_237_MOESM1_ESM

RESOURCES