Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Apr 6;18(4):e1009943. doi: 10.1371/journal.pcbi.1009943

Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction

Xiang Liu 1,2,3, Huitao Feng 2,4, Jie Wu 3,5, Kelin Xia 1,*
Editor: Joanna Slusky6
PMCID: PMC8985993  PMID: 35385478

Abstract

With the great advancements in experimental data, computational power and learning algorithms, artificial intelligence (AI) based drug design has begun to gain momentum recently. AI-based drug design has great promise to revolutionize pharmaceutical industries by significantly reducing the time and cost in drug discovery processes. However, a major issue remains for all AI-based learning model that is efficient molecular representations. Here we propose Dowker complex (DC) based molecular interaction representations and Riemann Zeta function based molecular featurization, for the first time. Molecular interactions between proteins and ligands (or others) are modeled as Dowker complexes. A multiscale representation is generated by using a filtration process, during which a series of DCs are generated at different scales. Combinatorial (Hodge) Laplacian matrices are constructed from these DCs, and the Riemann zeta functions from their spectral information can be used as molecular descriptors. To validate our models, we consider protein-ligand binding affinity prediction. Our DC-based machine learning (DCML) models, in particular, DC-based gradient boosting tree (DC-GBT), are tested on three most-commonly used datasets, i.e., including PDBbind-2007, PDBbind-2013 and PDBbind-2016, and extensively compared with other existing state-of-the-art models. It has been found that our DC-based descriptors can achieve the state-of-the-art results and have better performance than all machine learning models with traditional molecular descriptors. Our Dowker complex based machine learning models can be used in other tasks in AI-based drug design and molecular data analysis.

Author summary

With the ever-increasing accumulation of chemical and biomolecular data, data-driven artificial intelligence (AI) models will usher in an era of faster, cheaper and more-efficient drug design and drug discovery. However, unlike image, text, video, audio data, molecular data from chemistry and biology, have much complicated three-dimensional structures, as well as physical and chemical properties. Efficient molecular representations and descriptors are key to the success of machine learning models in drug design. Here, we propose Dowker complex based molecular representation and Riemann Zeta function based molecular featurization, for the first time. To characterize the complicated molecular structures and interactions at the atomic level, Dowker complexes are constructed. Based on them, intrinsic mathematical invariants are derived and used as molecular descriptors, which can be further combined with machine learning and deep learning models. Our model has achieved state-of-the-art results in protein-ligand binding affinity prediction, demonstrating its great potential for other drug design and discovery problems.

Introduction

Featurization (or feature engineering) is of essential importance for AI-based drug design. The performance of quantitative structure activity/property relationship (QSAR/QSPR) models and machine learning models for biomolecular data analysis is largely determined by the design of proper molecular descriptors/fingerprints. Currently, more than 5000 types of molecular descriptors, which are based on molecular structural, chemical, physical and biological properties, have been proposed [1, 2]. Among these molecular features, structural descriptors are the most-widely used ones and can be classified into one-dimensional (1D), two-dimensional (2D), three-dimensional (3D), and four-dimensional (4D) [1, 2]. In general, 1D molecular descriptors are atom counts, bond counts, molecular weight, fragment counts, functional group counts, and other summarized general properties. The 2D molecular descriptors are topological indices, graph properties, combinatorial properties, molecular profiles, autocorrelation coefficients, and other topological/graphic/combintorial properties. The 3D molecular descriptors are molecular surface properties, volume properties, autocorrelation descriptors, substituent constants, quantum chemical descriptors, and other geometric or density-function related properties. The 4D chemical descriptors are usually generated from a dynamic process that covers various molecular configurations. Further, various molecular fingerprints are proposed, including substructure key based fingerprints [3], path-based fingerprints [4, 5], circular fingerprints [6], pharmacophore fingerprints [7, 8], and autoencoded fingerprints. Different from molecular descriptors, molecular fingerprint is large-sized vector of molecular features that are systematically generated based on molecular properties, in particular, structural properties. Deep learning models, such as antoencoder, CNN, and GNN, have also been used in molecular fingerprint generation [913].

The generalizability and transferability of QSAR/QSPR and machine learning models are highly related to molecular descriptors or fingerprints. Features that characterize more intrinsic and fundamental properties can be better shared between data and “understood” by machine learning models. Mathematical invariants, from geometry, topology, algebra, combinatorics and number theory, are highly abstract quantities that describe the most intrinsic and fundamental rules and properties in nature sciences. In particular, topological and geometric invariants based molecular descriptors have achieved great successes in various steps of drug design, including protein-ligand binding affinity prediction [1418], protein stability change upon mutation prediction [19, 20], toxicity prediction [21], solvation free energy prediction [22, 23], partition coefficient and aqueous solubility [24], binding pocket detection [25], and drug discovery [26]. These models have also demonstrated great advantages over traditional molecular representations in D3R Grand challenge [2729]. Recently, persistent models, including hypergraph-based persistent homology [30, 31], persistent spectral [32], and persistent Ricci curvature [3336], have been developed for molecular characterization and delivered great performance in protein-ligand binding affinity prediction.

Dowker complex (DC) is developed for the characterization of relations between two sets [3739]. Mathematically, Dowker complex (DC) is defined on two sets X and Y with a relation R, which is a subset of the product set X × Y. The elements in the same set can form a simplex in DC if they all have relation with a common element from the other set. Note that only elements in the same set, i.e., either X or Y, can form simplexes. Stated differently, a simplex in DC can never be formed among elements from both sets. In this way, a DC can be separated into two disjoint simplicial complexes, i.e., one with elements all from X and the other with elements all from Y. These two simplicial complexes share the same homology groups, homotopy groups, and homotopy types [37, 38]. Moreover, DCs are equivalent to Neighborhood complex (NC) for all bipartite graphs. In fact, if the relations between two sets are represented by a bipartite graph, its associated DC is exact the same as NC. Further, Riemann Zeta function or Euler Riemann Zeta function, is a mathematical function of a complex variable. The Riemann Zeta function plays a pivotal role in analytic number theory and has applications in physics, probability theory, and applied statistics. Mathematically, Riemann Zeta function can be used in the characterization of intrinsic information of the system.

Here we propose Dowker complex based molecular interaction representations and Riemann Zeta function based molecular featurization, for the first time. More specifically, a bipartite graph can be used to model the interactions between two molecules, such as a protein and a ligand. Mathematically, a bipartite graph can be viewed as a relation between two sets, and a Dowker complex can be generated naturally from it. Further, a DC has two disjoint components, which share the same homology groups. For a protein-ligand complex, protein-based DC and ligand-based DC have the same Betti number. Further, DC-based persistent spectral models can be constructed from a filtration process, and persistent Riemann Zeta functions are used as molecular descriptors or fingerprints. Our DC-based machine learning models, in particular, DC-based gradient boosting tree (DC-GBT), are extensively tested on the three most-commonly used datasets from the well-established protein-ligand binding databank of PDBbind. It is found that our DC-GBT model has achieved state-of-the-art results and are better than all machine learning models with traditional molecular descriptors.

Results

DC-based biomolecular interaction analysis

Molecular representation and featurization are of essential importance for the analysis of molecular data from materials, chemistry and biology. Mathematical invariant based molecular descriptors are of greater transferability thus have achieved better performance in AI-based drug design [20, 4042]. Here we propose the first DC-based representations for molecular interaction analysis.

Bipartite graph-based molecular interaction characterization

Graph theory is widely used for the description and characterization of molecular structures and interactions. A molecular graph G = (V, E) is composed of a set of vertices V, with each vertex representing molecular atom, residue, motif, or even the entire molecule, and a set of edges E, representing interactions of various kinds including covalent bonds, van der Walls, electrostatic, and other non-covalent forces. Both intra- and inter- molecular interactions, i.e., interactions within and between molecules, can be represented as bipartite graphs (also known as bigraphs or 2-mode networks). Mathematically, a bipartite graph G(V1, V2, E) has two vertex sets V1 and V2, and all its edges are formed only between the two vertex sets. Recently, bipartite-graph based interactive matrixes have been used for machine learning models in drug design and achieved great success [20, 3034, 40]. Mathematically, these interactive matrixes, which are based on atomic distances and electrostatic interactions, can be transformed into a weighted biadjacency matrixes between protein and ligand atoms. More specifically, if we let VP = {vi|i = 0, 1, …, NP} and VL = {vj|j = 0, 1, …, NL} represent the coordinate sets of protein and ligand atoms respectively, the biadjacency matrix B with a size NP × NL is defined as follows,

B(vi,vj)=wij,viVPandvjVL. (1)

The weights wij can be chosen as the Euclidean distances or electrostatic interactions [20]. Essentially, inter-molecular interactions between protein and ligand atoms are characterized in the above biadjacency matrix.

Two individual unipartite graphs G1 and G2 can be constructed from a bipartite graph G(V1, V2, E) through a projection process [43, 44]. More specifically, the unipartite graph G1 is generated from vertex set V1, and its edges are defined between any two vertices that have a common neighborhood vertex in V2. Similarly, the unipartite graph G2 is based on vertex set V2, and any two vertices that have a common neighborhood vertex in V1 will form an edge in G2. Mathematically, the connection matrixes for the unipartite graphs can be generated from the adjacency matrix as in Eq (1). The connection matrixes for protein and ligand are BBT and BTB, respectively. Note that the two matrixes are of different sizes. Further, based on the unipartite graphs G1 and G2, two flag complexes (or clique complexes), KF,1(G) and KF,2(G) can be constructed respectively. More specifically, in the two flag complexes, a k-complex is formed among k + 1 vertices when any two vertices are connected by an edge.

Bipartite graph-based DC models

Mathematically, a bipartite graph can be seen as a relation. Two Dowker complexs KD,1(G) and KD,2(G) can be naturally constructed from a bipartite graph G(V1, V2, E). The DC KD,1(G) is defined on V1, and its k-simplex is formed among k + 1 vertices which have “relations”, i.e., forming edges, with a common vertex in V2. Similarly, the DC KD,2(G) is based on V2, and its k-simplex is formed among k + 1 vertices which are “related to” a common vertex in V1. Note that when vertices are related to a common vertex, that means they share the same common neighborhood vertex and a DC-based simplex will be formed among them.

Further, the Dowker theorem states that the homology of KD,1(G) and KD,2(G) are isomorphic, which means Hp(KD,1(G)) ≅ Hp(KD,1(G))(p > 0), and 0-th homology is isomorphic if the bipartite graph G is connected [38, 39]. It is worth mentioning that the flag complexes, i.e., KF,1(G) and KF,2(G), from the unipartite graphs are usually different from the DCs. Fig 1 illustrates the bipartite graph-based DCs and their persistent barcodes. The bipartite graph is generated from the phosphorus-phosphorus (P-P) interactions between two chains from DNA 1D77. The corresponding DC has two disjoint components, one from chain A and the other from chain B. The distance-based filtration process is considered and two persistent barcodes for chain A and chain B are generated. It can be observed that β1 persistent barcodes are exactly the same. Note that β0 persistent barcodes are not the same as the bipartite graphs are not always connected during the filtration process.

Fig 1. Dowker complex based representation for the atomic interactions between two chains of DNA 1D77.

Fig 1

Only the Phosphor (P) atoms of the DNA are considered. A bipartite graph is constructed between the two DNA chains, i.e., chain A and chain B, using a cutoff distance of 16.5 Å. The corresponding Dowker complex is generated and consists of two disjoint components, one from chain A and the other from chain B. The cutoff distance can be used a filtration parameter and two persistent barcodes are obtained. It can be seen that the β1 persistent barcodes are exactly the same for the two types of DCs from chain A and chain B. The β0 persistent barcodes are different because the bipartite graph are not always connected during the filtration process.

DC-based persistent spectral models

For all the persistent models, including persistent homology/cohomology, persistent spectral and persistent function, the key point is the filtration process. There are various ways to define the filtration parameter, leading to different filtration processes. For topology-based protein-ligand interaction models, we can define the filtration parameter f as the weight value of the biadjacency matrix in Eq (1). With the increase of filtration value, a series of nested bipartite graphs can be generated,

Gf0Gf1Gfn. (2)

Here (f0f1 ⩽ … ⩽ fn) are filtration values. The corresponding DCs can be constructed accordingly as follows,

KD(Gf0)KD(Gf1)KD(Gfn). (3)

In fact, we have two disjointed series of nested DCs as follows,

KD,1(Gf0)KD,1(Gf1)KD,1(Gfn). (4)
KD,2(Gf0)KD,2(Gf1)KD,2(Gfn). (5)

Note that the first DC series {KN,1(Gfi)} are for protein part, all of their vertices are protein atoms. In contrast, the second DC series {KN,2(Gfi)} are fully based on ligand atoms. From Dowker’s theorem, these two DC series share the same homology groups, i.e., Hp(KD,1(Gfi))Hp(KD,2(Gfi)) (p > 0, i = 1, 2, …, n).

Persistent spectral (PerSpect) models are proposed to study the persistence and variation of spectral information of the topological representations during a filtration process [32]. These spectral information can be used as molecular descriptors or fingerprints and combined with machine learning models for drug design. Here we study DC-bases persistent spectral models. From the Eqs (4) and (5), two sequences of Hodge Laplacian matrixes can be generated respectively (see Materials and methods). These matrixes characterize the interactions between protein and ligand atoms at various different scales. The spectral information derived from these Hodge Laplacian matrixes are used for the characterization of protein-ligand interactions. Fig 2 illustrates DC-based filtration process and the corresponding Hodge Laplacian matrixes for the protein-ligand complex 2POG. From the bipartite sequence, two separated series of DCs are generated based on protein atoms and ligand atoms respectively. We consider the 0-dimensional (0-D) and 1-dimensional (1-D) Hodge Laplacian matrices. Note that 0-D Hodge Laplacian matrices represent topological connections between vertices, while 1-D matrixes characterize topological connections between edges. Similarly, other-higher dimensional Hodge Laplacian matrixes can be generated for higher-dimensional simplexes. Further, the multiplicity of zero-eigenvalues for the k-th dimensional Hodge Laplacian matrices is βk, i.e., the k-th Betti number. Additionally, information from non-zero-eigenvalues indicates “geometric” properties of the simplicial complexes [32].

Fig 2. Persistent combinatorial Laplacian matrixes for Dowker complex from C-C pair of PDBID 2POG.

Fig 2

As in the picture, based on the filtration process of bipartite graphes, a filtration of Dowker complexes can be generated and further divided into two disjoint filtration processes in protein and ligand. Then for each filtration process, two sequence of laplacian matrixes in dimension 0 and 1 are depicted. The cutoff extracting the binding core region is 5Å, filtration values are 3.5Å, 4Å, 4.2Å, 4.5Å and 5Å. For 0-D laplacian matrixes, with the increase of filtration value, the matrix size is always same, off-diagonal entries decrease from 0 to -1 until all become -1 and diagonal entries increase until all up to the number of 0-simplexes minus 1. For 1-D laplacian matrixes, the matrix size increase consistently until up to a constant, and off-diagonal entries have nonzero values 1 and -1 due to their oriention and the number of off-diagonal nonzero entries increase at early stage and then decrease until all go to zero, and diagonal entries increase until all up to the number of 0-simplexes.

Spectral information, i.e., eigenvalues and eigenvectors, from our PerSpect models can not be directed used in machine learning models. This is due to the reason that their sizes vary dramatically during the filtration process. As seen in Fig 2, the number of 1-simplexes (edges) increases greatly during the filtration. In this way, the size of 1-D Hodge Laplacian matrices and the number of related eigenvalues and eigenvectors will increase with the filtration. In our PerSpect models, a series of persistent attributes are considered [32]. The persistent attributes are statistic and combinatorial properties of eigenvalues from the sequences of Hodge Laplacian matrices. They characterize the persistence and variation of spectral information during the filtration.

Here we can use eigenvalue-based Riemann Zeta functions. More specifically, for a set of eigenvalues {λ1, λ2, …, λn}, the Riemann Zeta functions are defined as,

ζ(s)=i=1n1λis.

They can be used as molecular features for machine learning. In our model, we consider 11 types of different Riemann Zeta functions, i.e., ζ(s) with s = 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5. Note that Riemann Zeta functions are related to different persistent spectral moments.

DC-based machine learning models for protein-ligand binding affinity prediction

The prediction of protein-ligand binding affinity is arguably the most important step in virtual screening and AI-based drug design. Here we consider DC-based machine learning models. To characterize the detailed interactions between protein and ligand atoms, we consider element-specific bipartite graph representations. More specifically, we decompose the protein atoms at binding core regions into four groups according to their atoms types, including C, N, O, and S. Ligand atoms at binding core regions are decomposed into nine groups according to their atoms types, including C, N, O, S, P, F, Cl, Br, and I. In this way, there are totally 36 = 4 × 9 groups of atom combinations, and protein-ligand interactions can be represented by 36 types of bipartite graphs from these atom combinations.

The bipartite graphs can be generated based on atom distances and electrostatic interactions. As stated above, the bipartite biadjacency matrix is represented as Eq (1). There are two different ways to define the weights. One is based on the Euclidian distance between atoms, that is wij = d(vi, vj) with d(vi, vj) the distance between atoms vi and vj. The other is based on atomic electrostatic interactions, that is wij=11+exp(-cqiqjd(vi,vj)) with qi and qj the partial charges of atoms vi and vj and parameter c a constant value (usually taken as 100). With the importance of hydrogen atom for electrostatic interactions, H atoms are usually taken into consideration and a total number of 50 types of atom combinations are considered for electrostatic interactions. The software “PDB2PQR” [45] is used to generate partial charge for protein while the partial charge of ligand can be found in PDBBind database.

We consider three most commonly-used datasets from PDBbind databank, including PDB-v2007, PDB-v2013 and PDB-v2016, as benchmark for our DC-based machine learning models. The detailed training and testing information are listed in Table 1. The binding core region is considered by using a cutoff distance of 10Å, that is all protein atoms within 10Å of any ligand atom. For distance-based DC models, the filtration goes from 2Å to 10Å with step 0.1Å, and for electrostatic-based DC models, the filtration goes from 0 to 1 with step 0.02. We only consider 0-D persistent spectral information. In this way, the size of feature vectors are 63360 = 36(atom combinations) × 80(filtration values) × 11(Riemann Zeta functions) × 2(two DCs), 55000 = 50(atom combinations) × 50(filtration values) × 11(Riemann Zeta functions) × 2(two DCs) and 118360 = 63360 + 55000 for distance-based model, electrostatic-based model and combined model respectively. Gradient boosting tree is considered to alleviate the overfitting problem. The GBT setting is listed in Table 2.

Table 1. Detailed information of the three PDBbind databases, i.e., PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016.

Dataset Refined set Training set Test set (Core set)
PDBbind-v2007 1300 1105 195
PDBbind-v2013 2959 2764 195
PDBbind-v2016 4057 3772 285

Table 2. The parameters for our DC-based gradient boosting tree (GBT) models.

No. of Estimators Learning rate Max depth Subsample
40000 0.001 6 0.7
Min_samples_split Loss function Max features Repetitions
2 Least square SQRT 10

Scoring power

The results for our DC-GBT models are listed in Table 3. Note that 10 independent regressions are performed and the median values of Pearson correlation coefficient (PCC) and root mean square error(RMSE) are taken as the final performance of our model. Further, we systematically compare our model with existing models with traditional learning descriptors [4654]. Detailed comparison results can be found in Fig 3. It can be seen that our model outperforms all the other machine learning models with traditional molecular descriptors, for all three datasets.

Table 3. The PCCs and RMSEs (pKd/pKi) for our DC-GBT models in three test cases, i.e., PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016.

Three DC-GBT models are considered with features from different types of bipartite graphs. The DC-GBT(Dist) model uses features from distance-based bipartite graphs; The DC-GBT(Chrg) model uses features from electrostatic-based bipartite graphs; The DC-GBT(Dist+Chrg) model uses features from both distance-based bipartite graphs and electrostatic-based bipartite graphs.

Dataset Dist Chrg Dist+Chrg
PDBbind-v2007 0.816(1.416) 0.811(1.437) 0.824(1.402)
PDBbind-v2013 0.789(1.457) 0.790(1.456) 0.799(1.432)
PDBbind-v2016 0.836(1.270) 0.834(1.284) 0.843(1.255)
Average 0.813(1.377) 0.812(1.385) 0.822(1.357)
Fig 3. Preformance comparison between our models and other models.

Fig 3

The comparison of PCCs between our model and other molecular descriptor based models, for the prediction of protein-ligand binding affinity. The PCCs are calculated based on the core set (test set) of PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016.

Further, we compare our DC-GBT model with advanced-mathematical based machine learning models [14, 18, 20, 30]. The results are presented in Table 4. Our DC-GBT model is ranked as second for PDBbind-v2016 dataset and PDBbind-v2013 datasets (after TopBP). Note that the accuracy of our DC-based models can be further improved if convolutional neural network models, such as the one used in TopBP models, are designed and employed.

Table 4. The comparison of our DC-GBT model with advanced-mathematical based machine learning models [14, 18, 20, 30, 32, 33].

Note that values marked with * uses PDBbind-v2016 core set (N = 290).

Model PDBbind-v2007 PDBbind-v2013 PDBbind-v2016 Average
AGL-Score 0.830 0.792 0.833 0.818
HPC-GBT 0.829 0.784 0.831 0.815
TNet-BP 0.826 N/A 0.810* N/A
TopBP 0.827 0.808 0.861* 0.832
PerSpect 0.836 0.793 0.840 0.823
OPRC 0.821 0.789 0.838 0.816
DC-GBT 0.824 0.799 0.843 0.822

More recently, some deep learning models for protein-ligand binding affinity prediction are proposed, such as the graphDelta model [55], ECIF model [56], OnionNet-2 model [57], DeepAtom model [58] and others [54, 5964]. Note that these new models usually employ a large training set with extra data from general sets from PDBbind. Details of the training sets, testing sets, and performance (PCC) of these models are listed in Table 5.

Table 5. The performance in terms of PCCs and RMSEs (pKd/pKi) for recently-proposed models using different training sets [5464].

Note that values marked with * uses PDBbind-v2016 core set (N = 290), and the values marked with + uses PDBbind-v2013 core set(N = 180) and PDBbind-v2016 core set(N = 276).

Model Training set Testing set1 core(PDB-v2013) Testing set2 core(PDB-v2016)
graphDelta PDB-v2018(8766) 0.87(1.05)
ECIF PDB-v2019(9299) 0.866(1.169)
OnionNet-2 PDB-v2019(>9000) 0.821(1.357) 0.864(1.164)
DeepAtom PDB-v2018(9383) 0.831(1.232)*
Ligand-based PDB-v2018(11663) 0.780(N/A)+ 0.821(N/A)+
SE-OnionNet PDB-v2018(11663) 0.812(1.692) 0.83(N/A)
DeepDTAF PDB-v2016(11906) N/A(1.443)*
Deep Fusion PDB-v2016(9226) 0.803(1.327*)

Docking power

We test the docking power, which is to identify the native poses from the ones generated by docking softwares [65], of our model on benchmark CASF-2013. There are totally 195 testing ligands in CASF-2013, each ligand has 100 poses generated from three docking programs, GOLD v5.1, Surflex-Dock in SYBYL v8.1 and MOE v2011. A pose is considered to be a native one if its RMSD value with respect to the true binding pose is less than 2 Å. Detailed RMSD information of all the ligands can be found in CASF-2013. If the pose with the highest predicted binding energy is a native one, it is regarded as a successful prediction. Once this process is performed for the whole 195 testing ligands, an overall success rate can be computed for the given scoring function.

For each ligand, an individual training and testing process is needed. We repeat our DC-GBT(Dist) model for each of the 195 ligands, following the procedure in the work [18]. Note that GOLD v5.6.3 [66] is used to generate 1000 training poses for each ligand. These poses and their scores are available at https://weilab.math.msu.edu/AGL-Score.

In our implementation, ten independent regressions are performed for each ligand. The ligand is regarded as a successful one if at least three regressions successfully identify the native poses. In this case, our success rate can reach 88%. If we use at least six successful regressions as standard, our success rate drops slightly to 86%.

Screening power

The screening power of a scoring function is its ability to identify the true binders for a given target protein from decoy structures. We test our model on benchmark CASF-2013. There are totally 65 different proteins in CASF-2013. For each protein, there are at least three true binders while the rest of the 195 ligands are regarded as decoys. There are two kinds of screening power measurements. One is to find out the enrichment factors (EF) among the x% top-ranked molecules.

EFx%=numberoftruebindersamongx%toprankedmolecules(totalnumberoftruebindersforthegivenprotein)×x%

where top-ranked molecules means the predicted candidates with high binding energies. And the average EF value among all 65 proteins is used to assess the screening power of a scoring function. The other is the success rate to identify the best true binders. For each target protein, if the the best binders are found in x% top-ranked molecules, this protein is taken as a successful one. Then the overall success rate is given by the total number of successful proteins over 65.

For each protein, an individual training and testing process is considered. Our DC-GBT(Dist) model is used for each of 65 proteins, following the works [18]. More specifically, for a given target protein, AutoDock Vina is used to dock all the ligands in PDBbind-v2015 refined set, excluding the core set and the true binders of this protein. This procedure gives rise to a few thousand of training poses and associated energy labels for each target protein. The binding scores (kcal/mol) generated by AutoDock Vina is converted to binding energy (pKd) by multiplying a constant -0.7335. Those binders in the refined set that do not bind to the target protein are regarded as decoys, and their binding energies should be smaller than the true binders. Therefore, if the energy of a decoy generated by AutoDock Vina is higher than the lower bound of the true binders’ energies, the decoy is relabeled by the lower bound of the true binders’ energies. Note that this procedure may cause many different decoys sharing same labels. The binding energy of a protein-ligand complex is usually a positive value, so we exclude the entries with negative binding energies from the training sets for all the 65 protein cases. The poses and associated scores can be found at https://weilab.math.msu.edu/AGL-Score.

In our implementatoin, for each protein, ten independent regressions are performed. As in PDBBind dataset, two decimal places are kept for the predicted binding energies of 195 testing ligands. For each regression, an EF value can be obtained and whether this protein is a successful one can be judged. The average EF value of ten regressions is taken as the final EF value for this protein in our model. Note that there are many different entries in the training set sharing the same labels, which results in that our model may predict same binding energies for different ligands. In this case, we take the first and second ranked ligands as the 1% top-ranked candidates. Hence the number of 1% top-ranked candidates maybe larger than 2. In this case, we set the EF to be 66.6. For the overall success rate, if at least two cases among the ten regressions assert that the given protein is a successful one, this protein is regarded as a successful one. In this case, our success rate can reach to 68%. If at least six successful regressions are used as standard, our success rate drops to about 58%.

Note that in our DC-GBT model, we consider target-specific scoring models and train them separatively for scoring, docking and screening power tests. Previously, general scoring function models were developed. That is for all the tasks, the same general scoring function is used. Recently, data-driven learning models make use of the different types of training datasets to improve the performance of scoring functions. In particular, the incorporation of decoy data in the training set can significantly improve the docking and screening power [6772]. In our DC-GBT models, different training sets are used, which results in different state-of-the-art scoring models in scoring, docking and screening power tests.

Discussion

Machine learning models have made tremendous progresses in text, video, audio and image data analysis. In particular, convolutional neural network (CNN) models have achieved revolutionary advancements in the analysis of image data. However, molecular data from material, chemical and biological systems are fundamentally different from text and image data, as their properties are usually directly determined by their topological structures. Persistent models, including persistent homology/cohomology, persistent functions, and persistent spectral, provide a series of highly effective molecular descriptors that not only preserve intrinsic structural information, but also maintain molecular multiscale properties. Here we propose Dowker complex based machine learning models for drug design. Dowker complex is used for molecular interaction representation. Riemann Zeta functions are defined on persistent spectral models and further used as molecular descriptors. Our Dowker complex based machine learning models have achieved state-of-the-art results for protein-ligand binding affinity. They can be also be used in AI-based drug design and other molecular data analysis.

Materials and methods

Our model contains two essential components, i.e., DC-based molecular representation and DC-based PerSpect models. For a molecular interaction-based bipartite graph, its associated DCs can be decomposed into two disjointed DCs which have the same homology groups. The DC-based Hodge-Laplacian matrices and Riemann Zeta function can be constructed and further used in the generation of molecular features for machine learning models.

DC-based persistent homology

Dowker complex is based on the “relations” of two sets and is originally developed to explore the homology of relations [38]. Mathematically, a relation is equivalent to a bipartite graph. So for each bipartite graph, a Dowker complex can be naturally constructed. More specifically, let G be a bipartite graph, the DC KD(G) will have two disjoint components KD,1(G) and KD,2(G). Assume G = (V1, V2, E) is a connected bipartite graph where V1 and V2 are two vertex sets and E is the edge set that form only between V1 and V2. KD(G) = KD,1(G) ∪ KD,2(G) where KD,1(G) and KD,2(G) are the two disjoint components of KD(G), KD,1(G) and KD,2(G) are defined as follows:

  • KD,1(G): for a set of vertices {xi0,xi1,,xip} in V1, a p-simplex is formed among these vertices in KD,1(G), if there exists a vertex yV2 such that {(xim,y)|0mp}E.

  • KD,2(G): for a set of vertices {yi0,yi1,,yip} in V2, a p-simplex is formed among these vertices in KD,2(G), if there exists a vertex xV1 such that {(x,yim)|0mq}E.

An example can be found in Fig 4. It can be seen that the Dowker complex just has two disjoint components, one is in black points and the other is in green points, and a simplex is formed if their vertices have a common neighbor vertex in the bipartite graph. Like the triangle in black points, its three vertices has a common green point as neighbor vertex in the bipartite graph. We have Hp(KD,1(G)) ≅ Hp(KD,2(G))(0 ⩽ p). Actually KD,1(G) and KD,2(G) are homotopic equivalent.

Fig 4. A bipartie graph and its associated Dowker complex.

Fig 4

It can be seen that there are two disjoint components in DC, one is from the black points and the other is from the green points. Note that a triangle (2-simplex) is formed among the black point set in DC, as the corresponding three black vertices have a common neighbor blue vertex in the bipartite graph.

Finally, if we construct the bipartite graph filtration process as in Eq (2), and induce the DC-based filtration as in Eq (3). Two separated DC sequences are generated as Eqs (4) and (5). The corresponding persistent barcodes of these two DC sequences are exactly same.

DC-based PerSpect models

Persistent spectral theory studies the spectral evolutional information of combinatorial Laplacian matrixes associated with a filtration process. An oriented DC is needed for the construction of combinatorial Laplacian matrixes, but different orientations share the same eigen spectral information. In this way, we can define an orientation based on the sequence of atoms (as in PDB file) for simplicity.

For an oriented DC KD={δki;k=0,1,;i=1,2,}, its k-th boundary matrix Bk can be defined as follows,

Bk(i,j)={1,ifδik-1δjkandδik-1δjk-1,ifδik-1δjkandδik-1δjk0,ifδik-1δjk.

Here δik-1δjk means that δik-1 is a face of δjk and δik-1δjk means the opposite. The notation δik-1δjk means the two simplexes have the same orientation, i.e., oriented similarly, and δik-1δjk means the opposite.

The k-th laplacian matrix is defined as follows,

Lk=BkTBk+Bk+1Bk+1T.

More specifically, L0 can be expressed explicitly as,

L0(i,j)={d(δi0),ifi=j-1,ifij,δi0δj0,0,ifij,δi0δj0 (8)

Further, Lk(k > 0) can be expressed as,

Lk(i,j)={d(δik)+k+1,ifi=j1,ifij,δikδjk,δikδjkandδikδjk-1,ifij,δikδjk,δikδjkandδikδjk0,ifij,δikδjkorδikδjk.

Here d(δik) is (upper) degree of k-simplex δik. It is the number of (k + 1)-simplexes, of which δik is a face. Notation δikδjk means the two simplexes are upper adjacent, i.e., they are faces of a common (k + 1)-simplex, and δikδjk means the opposite. Notation δikδjk means the two simplexes are lower adjacent, i.e., they share a common (k − 1)-simplex as their face, and δikδjk means the opposite. Notation δikδjk means the two simplexes have the same orientation, i.e., oriented similarly, and δikδjk means the opposite.

In our model, we consider the Riemann Zeta functions on the spectral as our persistent attributes. More specifically, for a set of eigenvalues {λ1, λ2, …, λn}, the spectral moment of the simplicial complex can be defined as the Zeta function ζ(s)=i=1n1λis. Then we use the persistent spectral moment as the persistent attributes for machine learning.

Data Availability

The PDBbind datasets are available at http://www.pdbbind.org.cn/, and the code is available on GitHub at https://github.com/LiuXiangMath/Dowker-Complex-Based-ML.

Funding Statement

This work was supported in part by Nanyang Technological University Startup Grant M4081842 and Singapore Ministry of Education Academic Research fund Tier 1 RG109/19, MOE-T2EP20120-0013 and MOE-T2EP20220-0010. The first author (XL) was supported by Nankai Zhide foundation. The second author (HF) was supported by Natural Science Foundation of China (NSFC grant no. 11931007, 11221091, 11271062, 11571184). The third author (JW) was supported by Natural Science Foundation of China (NSFC grant no. 11971144) and High-level Scientific Research Foundation of Hebei Province. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Puzyn T, Leszczynski J, Cronin MT. Recent advances in QSAR studies: methods and applications. vol. 8. Springer Science & Business Media; 2010. [Google Scholar]
  • 2. Lo YC, Rensi SE, Torng W, Altman RB. Machine learning in chemoinformatics and drug discovery. Drug discovery today. 2018;23(8):1538–1546. doi: 10.1016/j.drudis.2018.05.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. Journal of chemical information and computer sciences. 2002;42(6):1273–1280. doi: 10.1021/ci010132r [DOI] [PubMed] [Google Scholar]
  • 4. O’Boyle NM, Banck M, James CA, Morley C, Vandermeersch T, Hutchison GR. Open Babel: An open chemical toolbox. Journal of cheminformatics. 2011;3(1):33. doi: 10.1186/1758-2946-3-33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Hall LH, Kier LB. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. Journal of Chemical Information and Computer Sciences. 1995;35(6):1039–1045. doi: 10.1021/ci00028a014 [DOI] [Google Scholar]
  • 6. Rogers D, Hahn M. Extended-connectivity fingerprints. Journal of chemical information and modeling. 2010;50(5):742–754. doi: 10.1021/ci100050t [DOI] [PubMed] [Google Scholar]
  • 7.Landrum G. RDKit: Open-source cheminformatics. 2006;.
  • 8. Stiefl N, Watson IA, Baumann K, Zaliani A. ErG: 2D pharmacophore descriptions for scaffold hopping. Journal of chemical information and modeling. 2006;46(1):208–220. doi: 10.1021/ci050457y [DOI] [PubMed] [Google Scholar]
  • 9. Merkwirth C, Lengauer T. Automatic generation of complementary descriptors with molecular graph networks. Journal of chemical information and modeling. 2005;45(5):1159–1168. doi: 10.1021/ci049613b [DOI] [PubMed] [Google Scholar]
  • 10. Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, et al. Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems; 2015. p. 2224–2232. [Google Scholar]
  • 11. Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF. Convolutional embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and modeling. 2017;57(8):1757–1772. doi: 10.1021/acs.jcim.6b00601 [DOI] [PubMed] [Google Scholar]
  • 12. Xu Y, Pei J, Lai L. Deep learning based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. Journal of chemical information and modeling. 2017;57(11):2672–2685. doi: 10.1021/acs.jcim.7b00244 [DOI] [PubMed] [Google Scholar]
  • 13. Winter R, Montanari F, Noé F, Clevert DA. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical science. 2019;10(6):1692–1701. doi: 10.1039/c8sc04175j [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Cang ZX, Wei GW. TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLOS Computational Biology. 2017;13(7):e1005690. doi: 10.1371/journal.pcbi.1005690 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Cang ZX, Wei GW. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. International journal for numerical methods in biomedical engineering. 2017;. [DOI] [PubMed] [Google Scholar]
  • 16. Nguyen DD, Xiao T, Wang ML, Wei GW. Rigidity Strengthening: A Mechanism for Protein–Ligand Binding. Journal of chemical information and modeling. 2017;57(7):1715–1721. doi: 10.1021/acs.jcim.7b00226 [DOI] [PubMed] [Google Scholar]
  • 17. Cang ZX, Wei GW. Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction. International journal for numerical methods in biomedical engineering. 2018;34(2):e2914. doi: 10.1002/cnm.2914 [DOI] [PubMed] [Google Scholar]
  • 18. Nguyen DD, Wei GW. AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening. Journal of chemical information and modeling. 2019;59(7):3291–3304. doi: 10.1021/acs.jcim.9b00334 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Cang ZX, Wei GW. Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology. Bioinformatics. 2017;33(22):3549–3557. [DOI] [PubMed] [Google Scholar]
  • 20. Cang ZX, Mu L, Wei GW. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS computational biology. 2018;14(1):e1005929. doi: 10.1371/journal.pcbi.1005929 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Wu KD, Wei GW. Quantitative toxicity prediction using topology based multi-task deep neural networks. Journal of chemical information and modeling. 2018;. doi: 10.1021/acs.jcim.7b00558 [DOI] [PubMed] [Google Scholar]
  • 22. Wang B, Zhao ZX, Wei GW. Automatic parametrization of non-polar implicit solvent models for the blind prediction of solvation free energies. The Journal of chemical physics. 2016;145(12):124110. doi: 10.1063/1.4963193 [DOI] [PubMed] [Google Scholar]
  • 23. Wang B, Wang CZ, Wu KD, Wei GW. Breaking the polar-nonpolar division in solvation free energy prediction. Journal of computational chemistry. 2018;39(4):217–233. doi: 10.1002/jcc.25107 [DOI] [PubMed] [Google Scholar]
  • 24. Wu KD, Zhao ZX, Wang RX, Wei GW. TopP–S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. Journal of computational chemistry. 2018;39(20):1444–1454. doi: 10.1002/jcc.25213 [DOI] [PubMed] [Google Scholar]
  • 25. Zhao RD, Cang ZX, Tong YY, Wei GW. Protein pocket detection via convex hull surface evolution and associated Reeb graph. Bioinformatics. 2018;34(17):i830–i837. doi: 10.1093/bioinformatics/bty598 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Grow C, Gao KF, Nguyen DD, Wei GW. Generative network complex (GNC) for drug discovery. arXiv preprint arXiv:191014650. 2019;. [DOI] [PMC free article] [PubMed]
  • 27. Nguyen DD, Cang ZX, Wu KD, Wang ML, Cao Y, Wei GW. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. Journal of computer-aided molecular design. 2019;33(1):71–82. doi: 10.1007/s10822-018-0146-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Nguyen DD, Gao KF, Wang ML, Wei GW. MathDL: Mathematical deep learning for D3R Grand Challenge 4. Journal of computer-aided molecular design. 2019; p. 1–17. doi: 10.1007/s10822-019-00237-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Nguyen DD, Cang ZX, Wu KD, Wang ML, Cao Y, Wei GW. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. Journal of computer-aided molecular design. 2019;33(1):71–82. doi: 10.1007/s10822-018-0146-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Liu X, J WX, Wu J, Xia KL. Hypergraph based persistent cohomology (HPC) for molecular representations in drug design. Briefings in Bioinformatics. 2021;. [DOI] [PubMed] [Google Scholar]
  • 31. Liu X, Feng H, Wu J, Xia K. Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction. Briefings in Bioinformatics;. [DOI] [PubMed] [Google Scholar]
  • 32. Meng ZY, Xia KL. Persistent spectral based machine learning (PerSpect ML) for drug design. Science Advances, 2021;. doi: 10.1126/sciadv.abc5329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Wee J, Xia K. Ollivier Persistent Ricci Curvature-Based Machine Learning for the Protein–Ligand Binding Affinity Prediction. Journal of Chemical Information and Modeling;. [DOI] [PubMed] [Google Scholar]
  • 34. Wee J, Xia K. Forman persistent Ricci curvature (FPRC) based machine learning models for protein-ligand binding affinity prediction. Briefings in Bioinformatics, 2021;. doi: 10.1093/bib/bbab136 [DOI] [PubMed] [Google Scholar]
  • 35. Wang R, Nguyen DD, Wei GW. Persistent spectral graph. International Journal for Numerical Methods in Biomedical Engineering. 2020; p. e3376. doi: 10.1002/cnm.3376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Wang R, Zhao R, Ribando-Gros E, Chen J, Tong Y, Wei GW. HERMES: Persistent spectral graph software. Foundations of Data Science. 2020;3:67–97. doi: 10.3934/fods.2021006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Björner A. Topological methods. Handbook of combinatorics. 1995;2:1819–1872. [Google Scholar]
  • 38. Dowker CH. Homology groups of relations. Annals of mathematics. 1952; p. 84–95. doi: 10.2307/1969768 [DOI] [Google Scholar]
  • 39. Chowdhury S, Mémoli F. A functorial Dowker theorem and persistent homology of asymmetric networks. Journal of Applied and Computational Topology. 2018;2(1):115–175. doi: 10.1007/s41468-018-0020-6 [DOI] [Google Scholar]
  • 40. Nguyen DD, Cang ZX, Wei GW. A review of mathematical representations of biomolecular data. Physical Chemistry Chemical Physics. 2020;. doi: 10.1039/c9cp06554g [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Gao K, Nguyen DD, Sresht V, Mathiowetz AM, Tu M, Wei GW. Are 2D fingerprints still valuable for drug discovery? Physical chemistry chemical physics. 2020;22(16):8373–8390. doi: 10.1039/d0cp00305k [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Nguyen DD, Gao K, Wang M, Wei GW. MathDL: mathematical deep learning for D3R Grand Challenge 4. Journal of computer-aided molecular design. 2020;34(2):131–147. doi: 10.1007/s10822-019-00237-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zhou T, Ren J, Medo M, Zhang YC. Bipartite network projection and personal recommendation. Physical review E. 2007;76(4):046115. doi: 10.1103/PhysRevE.76.046115 [DOI] [PubMed] [Google Scholar]
  • 44. Pavlopoulos GA, Kontou PI, Pavlopoulou A, Bouyioukos C, Markou E, Bagos PG. Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience. 2018;7(4):giy014. doi: 10.1093/gigascience/giy014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Jensen JH, Klebe G, et al. PDB2PQR: Expanding and upgrading automated preparation of biomolecular structures for molecular simulations. Nucleic Acids Res. 2007;35:W522–525. doi: 10.1093/nar/gkm276 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Liu J, Wang RX. Classification of current scoring functions. Journal of chemical information and modeling. 2015;55(3):475–482. doi: 10.1021/ci500731a [DOI] [PubMed] [Google Scholar]
  • 47. Li HJ, Leung KS, Wong MH, Ballester PJ. Improving AutoDock Vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets. Molecular informatics. 2015;34(2-3):115–126. doi: 10.1002/minf.201400132 [DOI] [PubMed] [Google Scholar]
  • 48. Wójcikowski M, Kukiełka M, Stepniewska-Dziubinska MM, Siedlecki P. Development of a protein–ligand extended connectivity (PLEC) fingerprint and its application for binding affinity predictions. Bioinformatics. 2019;35(8):1334–1341. doi: 10.1093/bioinformatics/bty757 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Jiménez J, Skalic M, Martinez-Rosell G, De Fabritiis G. KDEEP: Protein–ligand absolute binding affinity prediction via 3D-convolutional neural networks. Journal of chemical information and modeling. 2018;58(2):287–296. doi: 10.1021/acs.jcim.7b00650 [DOI] [PubMed] [Google Scholar]
  • 50. Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P. Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics. 2018;34(21):3666–3674. doi: 10.1093/bioinformatics/bty374 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Su MY, Yang QF, Du Y, Feng GQ, Liu ZH, Li Y, et al. Comparative assessment of scoring functions: The CASF-2016 update. Journal of chemical information and modeling. 2018;59(2):895–913. doi: 10.1021/acs.jcim.8b00545 [DOI] [PubMed] [Google Scholar]
  • 52. Afifi K, Al-Sadek AF. Improving classical scoring functions using random forest: The non-additivity of free energy terms’ contributions in binding. Chemical biology & drug design. 2018;92(2):1429–1434. doi: 10.1111/cbdd.13206 [DOI] [PubMed] [Google Scholar]
  • 53. Feinberg EN, Sur D, Wu ZQ, Husic BE, Mai HH, Li Y, et al. PotentialNet for molecular property prediction. ACS central science. 2018;4(11):1520–1530. doi: 10.1021/acscentsci.8b00507 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Boyles F, Deane CM, Morris GM. Learning from the ligand: using ligand-based features to improve binding affinity prediction. Bioinformatics. 2020;36(3):758–764. [DOI] [PubMed] [Google Scholar]
  • 55. Karlov DS, Sosnin S, Fedorov MV, Popov P. graphDelta: MPNN scoring function for the affinity prediction of protein–ligand complexes. ACS omega. 2020;5(10):5150–5159. doi: 10.1021/acsomega.9b04162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Sánchez-Cruz N, Medina-Franco JL, Mestres J, Barril X. Extended connectivity interaction features: Improving binding affinity prediction through chemical description. Bioinformatics. 2021;37(10):1376–1382. doi: 10.1093/bioinformatics/btaa982 [DOI] [PubMed] [Google Scholar]
  • 57.Wang Z, Zheng L, Liu Y, Qu Y, Li YQ, Zhao M, et al. OnionNet-2: A Convolutional Neural Network Model for Predicting Protein-Ligand Binding Affinity based on Residue-Atom Contacting Shells. arXiv preprint arXiv:210311664. 2021;. [DOI] [PMC free article] [PubMed]
  • 58.Rezaei MA, Li Y, Wu DO, Li X, Li C. Deep Learning in Drug Design: Protein-Ligand Binding Affinity Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2020;. [DOI] [PMC free article] [PubMed]
  • 59. Song T, Wang S, Liu D, Ding M, Du Z, Zhong Y, et al. SE-OnionNet: A convolution neural network for protein-ligand binding affinity prediction. Frontiers in Genetics. 2020;11:1805. doi: 10.3389/fgene.2020.607824 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Zhu F, Zhang X, Allen JE, Jones D, Lightstone FC. Binding Affinity Prediction by Pairwise Function Based on Neural Network. Journal of chemical information and modeling. 2020;60(6):2766–2772. doi: 10.1021/acs.jcim.0c00026 [DOI] [PubMed] [Google Scholar]
  • 61. Wang K, Zhou R, Li Y, Li M. DeepDTAF: a deep learning method to predict protein–ligand binding affinity. Briefings in Bioinformatics. 2021;. [DOI] [PubMed] [Google Scholar]
  • 62.Zhou J, Li S, Huang L, Xiong H, Wang F, Xu T, et al. Distance-aware Molecule Graph Attention Network for Drug-Target Binding Affinity Prediction. arXiv preprint arXiv:201209624. 2020;.
  • 63. Jones D, Kim H, Zhang X, Zemla A, Stevenson G, Bennett WD, et al. Improved Protein–Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference. Journal of Chemical Information and Modeling. 2021;61(4):1583–1592. doi: 10.1021/acs.jcim.0c01306 [DOI] [PubMed] [Google Scholar]
  • 64. Hassan-Harrirou H, Zhang C, Lemmin T. RosENet: Improving Binding Affinity Prediction by Leveraging Molecular Mechanics Energies with an Ensemble of 3D Convolutional Neural Networks. Journal of Chemical Information and Modeling. 2020;. doi: 10.1021/acs.jcim.0c00075 [DOI] [PubMed] [Google Scholar]
  • 65. Cheng T, Li X, Li Y, Liu Z, Wang R. Comparative assessment of scoring functions on a diverse test set. Journal of chemical information and modeling. 2009;49(4):1079–1093. doi: 10.1021/ci9000053 [DOI] [PubMed] [Google Scholar]
  • 66. Jones G, Willett P, Glen RC, Leach AR, Taylor R. Development and validation of a genetic algorithm for flexible docking. Journal of molecular biology. 1997;267(3):727–748. doi: 10.1006/jmbi.1996.0897 [DOI] [PubMed] [Google Scholar]
  • 67. Pham TA, Jain AN. Parameter estimation for scoring protein- ligand interactions using negative training data. Journal of medicinal chemistry. 2006;49(20):5856–5868. doi: 10.1021/jm050040j [DOI] [PubMed] [Google Scholar]
  • 68. Durrant JD, McCammon JA. NNScore: a neural-network-based scoring function for the characterization of protein- ligand complexes. Journal of chemical information and modeling. 2010;50(10):1865–1871. doi: 10.1021/ci100244v [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Li L, Wang B, Meroueh SO. Support vector regression scoring of receptor–ligand complexes for rank-ordering and virtual screening of chemical libraries. Journal of chemical information and modeling. 2011;51(9):2132–2138. doi: 10.1021/ci200078f [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Ragoza M, Hochuli J, Idrobo E, Sunseri J, Koes DR. Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling. 2017;57(4):942–957. doi: 10.1021/acs.jcim.6b00740 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Wang C, Zhang Y. Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest. Journal of computational chemistry. 2017;38(3):169–177. doi: 10.1002/jcc.24667 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Bao J, He X, Zhang JZ. DeepBSP—a machine learning method for accurate prediction of protein–ligand docking structures. Journal of Chemical Information and Modeling. 2021;61(5):2231–2240. doi: 10.1021/acs.jcim.1c00334 [DOI] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009943.r001

Decision Letter 0

Arne Elofsson, Joanna Slusky

12 Oct 2021

Dear Dr. Xia,

Thank you very much for submitting your manuscript "Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Joanna Slusky, Ph.D.

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Here the authors proposed a Dowker complex based molecular interaction representations, which used a bipartite graph to model the interactions between a protein and a ligand. Then a DC-based persistent spectral model was constructed and the persistent Riemann Zeta functions were calculated as molecular descriptors. Finally, a DC-based gradient boosting tree model was trained to predict protein-ligand binding affinity.

This is a novel method to represent protein-ligand interaction as bipartite graph and calculate the descriptors from the knowledge of topology and graph theory. When it was applied to protein-ligand affinity prediction, however, I have some concerns about the representations and models:

1. In order to calculate the descriptors, the binding core region was defined using a cutoff distance of 10A. I wonder how you defined the cutoff. Actually, I have seen some different definitions about the binding core region, ranging from 5A to 12A. Does the cutoff distance influence the results a lot?

2. According to the manuscript, the size of feature vectors depend on the filtration values and the number of Riemann Zeta functions. Did they have physical or mathematical significance? Or were they selected by hyper-parameter optimization?

3. In page 8/17, line 226: “Note that the accuracy of our DC-based models can be further improved if convolutional neural network models, such as the one used in TopBP models…” Have you already tried the convolutional neural network models or you just imagined that?

4. The Table 1 listed the detailed information of the three PDBBind databases. I noticed that the Training set includes all the remained data when removing Test set from Refined set. Is there a validation set when you train your model? And how the hyper-parameters listed in Table 2 were selected?

5. There are many different type of protein-ligand affinity prediction models, which can also be called scoring functions. The scoring power is not the only problem we concerned, there are test sets for testing the docking power and screening power in CASF-2016 (or other version). We are very interested in the docking power and screening power of the model. We suggested that you provide the related results.

6. In page 8/17, line 231: “We do not compare with these models because the training and testing sets of these models are different from the standard ones in PDBbind datasets” Considering that all the PDBBind datasets are public, it is not difficult to make a comparison. I think more evidence should be given to prove the advantage of DC-based molecular interaction representations.

Reviewer #2: This work proposes novel molecular descriptors for protein-ligand binding affinity predictions. These descriptors are constructed from Dowker complex and spectral graph information. The authors have validated the robustness and the efficiency of the proposed features against series of PDBbind benchmarks. Overall this manuscript is well-written and easy to follow. Besides these positive sides, there are some downsides I would like to bring up here

1) Proposed models use charges, distances, DC-based features, etc. General readers will appreciate it if authors carefully investigate the performances of the separated features. There might be some redundant features.

2) I do not know how atom charges were obtained. Please provide such a discussion in the revised version.

3)Lines 226-228, authors claim that CNN can further improve their current model. Are there any hard proofs? If yes please provide them otherwise I suggest removing these sentences.

3) Please include TopBP in figure 3 since it is discussed in Table 4

4) There are missing data files/features files from the authors’ provided Github link. Please update them.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: None

Reviewer #2: No: Missing data files/feature files

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009943.r003

Decision Letter 1

Arne Elofsson, Joanna Slusky

6 Jan 2022

Dear Dr. Xia,

Thank you very much for submitting your manuscript "Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

After reviewing your new manuscript I noted that a significant number of the clarifications and requests by reviewers resulted in responses to the reviewers that did not yield changes to the manuscripts. Please consider the reviewers as representatives of your broader audience. Almost anything on which the reviewer needed clarification future readers will need clarification as well. Therefore, if a concept or detail needs to be explained to the reviewers, it also needs to be explained to the audience of PLoS Computational Biology in the manuscript. I cannot send this revision back to reviewers until you have added your responses to the manuscript as well.

In addition, it would be helpful to add quotes to the response to reviewers with the precise language you use in the manuscript to address the reviewers concerns. This allows the reviewers and editor to find your changes more easily and see in context how you addressed previous concerns.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Joanna Slusky, Ph.D.

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

After reviewing your new manuscript I noted that a significant number of the clarifications and requests by reviewers resulted in responses to the reviewers that did not yield changes to the manuscripts. Please consider the reviewers as representatives of your broader audience. Almost anything on which the reviewer needed clarification future readers will need clarification as well. Therefore, if a concept or detail needs to be explained to the reviewers, it also needs to be explained to the audience of PLoS Computational Biology in the manuscript. I cannot send this revision back to reviewers until you have added your responses to the manuscript as well.

In addition, it would be helpful to add quotes to the response to reviewers with the precise language you use in the manuscript to address the reviewers concerns. This allows the reviewers and editor to find your changes more easily and see in context how you addressed previous concerns.

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009943.r005

Decision Letter 2

Arne Elofsson, Joanna Slusky

31 Jan 2022

Dear Dr. Xia,

Thank you very much for submitting your manuscript "Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Joanna Slusky, Ph.D.

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors have responded adequately to most of my comments, but there are some problems in the response to the question about docking power and screening power. The scoring, docking and screening powers should be evaluated using the same scoring function model, but the authors retrained their model for each ligand in the docking power test and each protein in the screening power test. The performance of these target-specific scoring models cannot be compared to the performance of those general scoring functions listed in Figure 4, except the AGL-Score. In other words, the model which is used to evaluate the docking power and screening power should be the same model that is used to evaluate scoring power.

Reviewer #2: The authors have addressed all of my concerns.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009943.r007

Decision Letter 3

Arne Elofsson, Joanna Slusky

21 Feb 2022

Dear Dr. Xia,

We are pleased to inform you that your manuscript 'Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Joanna Slusky, Ph.D.

Associate Editor

PLOS Computational Biology

Arne Elofsson

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009943.r008

Acceptance letter

Arne Elofsson, Joanna Slusky

22 Mar 2022

PCOMPBIOL-D-21-01678R3

Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction

Dear Dr Xia,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Livia Horvath

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Reply_Letter.pdf

    Attachment

    Submitted filename: Response_Letter.pdf

    Attachment

    Submitted filename: Reply_Letter_v2.pdf

    Data Availability Statement

    The PDBbind datasets are available at http://www.pdbbind.org.cn/, and the code is available on GitHub at https://github.com/LiuXiangMath/Dowker-Complex-Based-ML.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES