Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2016 Nov 16;113(48):13564–13569. doi: 10.1073/pnas.1611138113

Predicting protein–ligand affinity with a random matrix framework

Alpha A Lee a,b,1, Michael P Brenner a,b, Lucy J Colwell c,1
PMCID: PMC5137738  PMID: 27856761

Significance

Developing computational methods to screen ligands against protein targets is a major challenge for drug discovery. We present a robust mathematical framework, inspired by random matrix theory, which predicts ligand binding to a target given the known ligand set of that target. Our method considers binding prediction as a denoising problem, recognizing that only some of the chemically important features associated with each ligand contribute to binding to a particular receptor. We use correlations among chemical features in the known ligand set, combined with random matrix theory, to eliminate statistically insignificant correlations. Our method outperforms existing algorithms in the literature. We show that our algorithm has the physical interpretation of estimating the ligand–target binding energy.

Keywords: drug discovery, random matrix theory, protein–ligand affinity, computational pharmacology, statistical physics

Abstract

Rapid determination of whether a candidate compound will bind to a particular target receptor remains a stumbling block in drug discovery. We use an approach inspired by random matrix theory to decompose the known ligand set of a target in terms of orthogonal “signals” of salient chemical features, and distinguish these from the much larger set of ligand chemical features that are not relevant for binding to that particular target receptor. After removing the noise caused by finite sampling, we show that the similarity of an unknown ligand to the remaining, cleaned chemical features is a robust predictor of ligand–target affinity, performing as well or better than any algorithm in the published literature. We interpret our algorithm as deriving a model for the binding energy between a target receptor and the set of known ligands, where the underlying binding energy model is related to the classic Ising model in statistical physics.


Finding new ligands that bind to a given target is both a crucial step and a major stumbling block in modern drug discovery. Numerous attempts have been made to develop computational algorithms to predict the binding affinity of a ligand to a given receptor, which would allow potential compounds to be screened in silico, reducing costs and saving time. In particular, in response to the wealth of experimental data that exists both within pharmaceutical companies, and also in freely accessible online databases such as ChEMBL (1), approaches that attempt to “learn” from these data are increasingly gaining attention (2).

An intuitive data-driven approach builds on the hypothesis that chemical commonalities among the known ligand set reveal salient features of the binding site. A corollary is that ligands with similar chemical functionality are expected to share similar binding affinity toward a particular receptor (3, 4). This suggests that the known ligand set of a given target can be used to learn criteria that predict whether a novel ligand will bind to the target. This ligand-based approach is a powerful paradigm that does not require structural information about the receptor, which is potentially arduous to obtain, unlike other more atomistic methods such as docking or molecular dynamics.

Any ligand-based method requires a way to quantify the chemical functionalities of a ligand, and various chemical descriptors have been proposed. Examples include a vector of measured or predicted physical properties (58), a vector enumerating the presence or absence of known functional groups on the ligand (9, 10), a vectorial representation of connectivities in the molecular graph (11, 12) (known also as molecular fingerprints), and simply the 3D shape of the ligand (1316). Existing approaches then take the descriptor associated with each ligand and compare ligands with each other, for example through the Tanimoto coefficient (17, 18).

Nonetheless, regardless of how ligand chemical functionalities are quantified, without fortuitously knowing a priori which ligand features determine binding, most of the chemical features describing the ligand are likely irrelevant. Whereas some of the features in the descriptor determine binding to the receptor of interest, others do not and simply add background noise. Moreover, for any particular receptor, the known set of ligands that bind to it is often smaller, or of the same order of magnitude as the number of potentially relevant chemical features. As such, the problem of ligand-based binding prediction can be recast as a problem in signal processing––can we identify those chemical ligand features that determine binding (i.e., the “signal”) amid many irrelevant ones (the “noise”) in the regime where the amount of data is not significantly larger than the number of variables being measured?

Random matrix theory (RMT) provides a natural mathematical framework for addressing this issue. Physical applications of RMT include Wigner’s study of the spectra of heavy atoms (19). In the context of data analysis, RMT gives a null model for the similarity between samples (ligands) that can be expected by chance due to finite sampling (20). Powerful analytical tools from RMT define a precise threshold that distinguishes the similarity that can be expected by chance from that which is caused by signal. These tools enable an effective and simple denoising algorithm, which allows us to recover the statistically significant signals. This denoising algorithm has been used in different fields, ranging from finance (2123) to face recognition (24, 25).

This article contains three major results: First, we show that for a randomly chosen set of molecules, the eigenvalue distribution of the covariance matrix of chemical descriptors agrees with the canonical Marčenko–Pastur (MP) distribution (26) of RMT, expected in the absence of any significant signal. Second, if we consider descriptors of pharmacologically similar molecules, i.e., those that bind to the same protein receptor, then part of the eigenvalue spectrum agrees with the MP distribution, but crucially there are eigenvalues that deviate from it significantly. These eigenvalues, and their corresponding eigenvectors, describe the statistically significant signals. The most common substructure of these eigenvectors corresponds to pharmacophores. Using these two results, we can predict with higher accuracy than known methods when an unknown ligand will bind to a receptor, constructing a unique model for each protein receptor. Finally, we provide a physical interpretation of the success of the algorithm––namely, that it is effectively inferring a model of the ligand–protein binding energy from the covariance structure of fingerprints that bind to a target protein. The underlying mathematical model is closely related to the classic Ising model in statistical physics.

RMT Framework

To motivate the RMT framework, we focus on a popular set of descriptors that are often used in cheminformatics. Molecular fingerprints are typically constructed by first representing a ligand as a 2D molecular graph, and then considering all possible bond paths within the molecule (11, 12). The set of bond paths that characterize each molecule is unique, so that only identical molecules share exactly the same bond paths; similar molecules share most bond paths. Because the set of all possible bond paths is vast, typically fingerprints are defined by first considering bond paths that are below some threshold length (i.e., within some radius of every atom of the structure) and then mapping these bond paths to a bit string of defined length [a molecular “fingerprint” (27)] through a hash function.

The fundamental aim is to detect similarity among a set of binary strings of the same length, p, where each bit represents the presence or absence of a molecular feature. There is significant noise in these bit strings, because only some of the bits are truly informative––for any particular receptor, not all bond paths are equally relevant to ligand–target binding. If the individual bits of the binary strings were chosen randomly, with no information about ligand–target binding, then RMT predicts that the eigenvalue distribution of the covariance matrix of the bit strings obeys a specific analytical function known as the MP distribution. Therefore, a highly accurate test for detecting the presence of nonrandom commonalities among a set of strings is to compare the eigenvalue spectrum of their covariance matrix to the MP distribution. Any deviation necessarily reflects the presence of a signal in the data, which in this case are sets of molecular features that characterize the chosen ligand–target interaction.

Mathematically, we represent the kth ligand associated with the chosen receptor as a row vector of bits fk using the Morgan fingerprint algorithm with radius 3, implemented using the package rdKit (28). The ensemble of N ligands that bind to the chosen receptor can be arranged as a data matrix A=[f1;f2fN]N×p, where the value of N will vary between receptors. We then remove repeated columns of the data matrix, which correspond to redundant information, and convert the data matrix to z scores by subtracting the column mean and normalizing each column to have unit variance. This allows us to construct the N × N correlation matrix C=ATA/N. In general, for well-sampled data, large entries in C would indicate relationships between specific molecular features, suggesting that these features do not occur independently of one another in this dataset.

A fundamental result from random matrix theory describes the eigenvalue distribution of the correlation matrix C analytically—under certain weak assumptions, if entries in A are drawn from a Gaussian distribution with zero mean and unit variance, the probability of A having an eigenvalue λ is given by the MP distribution (26)

ρ(λ)=[(1+γ)2λ]+ [λ(1γ)2)]+2πγλ, [1]

where γ=p/N describes how well-sampled the dataset is. The probability that a random matrix has eigenvalues larger than (1+γ)2 in the absence of any signal is vanishingly small. Thus, the key insight gained from Eq. 1 is that those eigenvalues above (1+γ)2 correspond to statistically significant signals.

Fig. 1A shows that the eigenvalue distribution of the correlation matrix of 1,000 ligands drawn randomly from ChEMBL (1) agrees quantitatively with the MP distribution. However, if instead we choose the ligands nonrandomly, by choosing the ligand sets associated with a particular protein receptor, we find a significant number of eigenvalues above the MP threshold. As examples, Fig. 1 B and C shows the eigenvalue distribution from ligand sets from ChEMBL associated with two G protein coupled receptors, the adenosine A2a receptor (AA2AR) and the β1-adrenergic receptor (ADRB1).

Fig. 1.

Fig. 1.

MP distribution (red curve) provides the null hypothesis for ligand–ligand correlations expected in the absence of signal. The eigenvalue distribution is plotted for the correlation matrix of (A) a random sample of 1,000 ligands from ChEMBL, and the ligand set of (B) AA2AR and (C) ADRB1.

The MP distribution thus suggests an intuitive denoising algorithm for ligands that bind to a particular receptor: only eigenvectors with eigenvalues larger than the MP upper bound correspond to statistically significant features of the receptor; the other eigenvectors simply reflect random noise caused by finite sampling. The set of statistically significant features, represented as orthonormal eigenvectors, are thus orthogonal chemical features relevant for ligand binding. In other words, if there are m eigenvalues greater than the MP upper bound, then the linear space spanned by the m associated eigenvectors, V=span(v1,v2,vm), is the subspace of chemical feature space that facilitates binding to that particular receptor.

Classification of Unknown Ligands

Intuitively, if an unknown ligand is sufficiently similar to the set of known ligands that bind to a receptor, the unknown ligand will likely also bind to the receptor. The random matrix framework gives a precise mathematical statement for this intuition: An unknown ligand is predicted to bind to a receptor if the bit-string vector corresponding to the unknown ligand (after transformation to z score by subtracting the sample mean and normalizing by sample variance) lies close to the subspace V.

Let u be the vector of z scores corresponding to the unknown ligand. The projection of u onto V is given by

up=i=1m(viu)vi. [2]

Here, u lies in the subspace V if and only if u=up. The distance between u and up is thus a quantitative metric of similarity between the unknown ligand and the set of ligands that bind to the receptor in question. The ligand is predicted to bind if and only if

uup<ϵ, [3]

where is the Euclidean norm, and ε is a threshold parameter. Eq. 3 has the chemical interpretation that one can be confident a ligand will bind to the receptor if it contains pharmacophores found in known ligands, and is minimally decorated with other functional groups. A pharmacophore is typically a small fragment (see Fig. 4), and the chemical properties of the resulting molecule will increasingly deviate from those of the pharmacophore as one incorporates additional functional groups. The threshold parameter ε allows the tolerance of the analysis to the presence of other functional groups to be controlled, and hence an appropriate false positive/false negative tradeoff selected; this is discussed in detail below.

Fig. 4.

Fig. 4.

Chemical motif corresponding to the first and second eigenvector of AA2AR and ADRB1. The motif is obtained computing the common structure among the top 20 ligands ordered by the magnitude of the dot product between its fingerprint and the eigenvector.

To test this, we consider human G protein coupled receptors (GPCRs) reported in ChEMBL. A ligand is considered to bind to a given target if its Ki, Kd, IC50, or EC50 is 1 μM or less. We consider only GPCRs with more than 120 known ligands reported in ChEMBL. We randomly sort ligands into a training set (80%) and a verification set (20%). To test for false positives, we need compounds that do not bind to the receptor. Negative results are seldom reported and the judicious selection of decoys is still a subject of intense research effort (29). In our analysis, we use a random selection of 1,000 compounds from ChEMBL as a proxy. The median number of ligands associated with each GPCR is 400; thus, even if the actual ligand set is an order of magnitude larger than those that are known, it still represents a negligible proportion of the 1,583,897 compounds in ChEMBL. Therefore, a random selection of 1,000 ligands from ChEMBL is unlikely to contain any ligand that binds to a particular GPCR.

The receiver operating characteristic (ROC) curve plots the accuracy of identifying ligands (true positives) as a function of false-positive predictions. This characteristic is commonly used to quantify the performance of classification algorithms. In particular, the area under the ROC (the so-called AUC) is the crucial figure of merit: the closer the AUC is to 1, the better the classifier. Fig. 2A shows that our algorithm has a mean AUC of 0.9, surpassing methods commonly used in the literature, which have a mean AUC of 0.70.8 (30). As such, our algorithm comfortably outperforms commonly used methods.

Fig. 2.

Fig. 2.

Our RMT-inspired algorithm classifies ligands with high accuracy. (A) The ROC curve of our algorithm. The AUC of the mean ROC curve is 0.9. The shaded region shows 1 SD in the true positive, corresponding to AUC = 0.86–0.95. (B) Accuracy at identifying ligands and rejecting decoys plotted as a function of percent of the training set rejected by the choice of the threshold ε.

The ROC curve is plotted by varying ε, the threshold parameter in Eq. 3. Fig. 2B shows the effect of varying ε, represented as the percent of the training set accounted for by each choice of ε. A stringent choice of ε corresponds to a large portion of the training set being rejected by the threshold in Eq. 3, resulting in a low false-positive rate but a high false-negative rate. Vice versa, an ε value that accounts for a larger portion of the training set has higher false-positive rate but lower false-negative rate. In the remainder of this paper, we choose ε so that 95% of the training set lies within the threshold in Eq. 3. With this heuristic choice, the algorithm picks out 84% of the verification set as ligands with a 7% false-positive rate (i.e., it rejects 93% of randomly selected ligands from ChEMBL).

The random matrix distribution (Eq. 1) is crucial to the success of our algorithm. Fig. 3 shows that including too many eigenvectors into V increases the false-positive rate, whereas including too few eigenvectors decreases the success rate of picking out ligands from the verification set. The balance between overfitting and underfitting is achieved close to the MP bound (as the bound is probabilistic, slight sample-to-sample deviation is expected). Although Fig. 3 only shows the results for AA2AR, ADRB1, the μ1 opioid receptor, and the cannabinoid CB1 receptor, the near optimality of the MP bound is general.

Fig. 3.

Fig. 3.

MP bound strikes a balance between overfitting and underfitting. The percent accuracy in identifying ligands from the verification set and rejecting ligands randomly selected from ChEMBL is shown as a function of the number of eigenvectors included in V for (A) AA2AR, (B) ADRB1, (C) μ1 opioid receptor, and (D) cannabinoid CB1 receptor.

We also report that the statistically significant eigenvectors picked out by our algorithm represent pharmacophores. Formally, a fingerprint cannot be inverted directly to give a unique chemical structure because multiple structures can lead to the same fingerprint. Nonetheless we can infer the structural motif that an eigenvector represents by the common substructure among those ligands that lie closest to that eigenvector. Fig. 4 shows the structural motif corresponding to the top two eigenvalues of AA2AR and ADRB1. Strikingly, the first eigenvector of AA2AR is precisely the adenine motif. The second eigenvector contains a thymine motif fused to a more complex scaffold. For ADRB1, the top eigenvector is the structural motif of β-blockers (e.g., propranolol), a class of successful antagonists which are used, e.g., to treat hypertension.

Physical Model

Before concluding, we address the question of why this algorithm might prove effective. What is the physics encoded in those eigenvalues larger than the MP threshold, and their associated eigenvectors?

The clearest way of determining which ligands bind to a given protein would be to accurately predict the binding energy of every possible ligand to the protein. The ligand set of the protein is then given by the set of ligands with a binding affinity greater than some threshold. Accurate determination of this binding energy is extremely computationally intensive. Nonetheless, even without a first-principles determination of the ligand binding energy, we might still hope to parameterize a model of protein–ligand binding, where the parameters are determined from the set of ligands that bind to a given protein target. If sufficiently accurate, such a model of the binding energy could potentially still give accurate predictions as to which ligands bind to a given target protein.

We now demonstrate that there is a natural class of models for ligand binding where our algorithm precisely picks out the set of strongly binding ligands. To begin, we note that because we are describing ligands through their fingerprints f, the ligand binding energy is a function of the fingerprints, i.e., E=E(f). We can expand E in powers of f, so that to leading order

E(f)=i=1pwifi+i,j=1pfiJijfj+. [4]

Here, wi and Jij are protein-specific quantities; they parameterize how well ligands (characterized by their fingerprints) bind to the binding pocket of the protein in question. The values of wi,Jij, and p also depend on the nature of the fingerprints that we use to describe the ligands. More detailed fingerprints have a better chance of accurately modeling the binding energy between the ligand and receptor than those that do not take into account parts of the molecule that bind to the receptor. The fact that the Morgan 3 fingerprints used herein have long been shown to have predictive power for ligand–target association means that they plausibly contain sufficient information to model the binding energy. It is noteworthy that because fingerprints are binary strings of length p, the model in Eq. 4 is equivalent to the Ising model, well known in statistical physics.

Can we deduce w and J from the fingerprints of those ligands that bind to a protein target? Here we take as input the correlation matrix of the fingerprints that bind to each protein target in question. Indeed, determining the Ising model interaction matrix J from the correlation matrix is a classic problem in statistical physics and biophysics (3135). We now argue that our random-matrix-based procedure effectively removes noise caused by finite sampling from this problem. The essence of our algorithm is the derivation of a protein-specific binding energy model J.

We can directly compute the correlation matrix of the fingerprints that bind to a given protein target (characterized by wi,Jij) by noting that our model implies that the equilibrium probability of observing a fingerprint f is given by

P(f)=eβE(f)Z, [5]

where β=(kBT)1 characterizes the temperature and Z is the partition function, summing eβE over all possible fingerprints. The correlation matrix follows directly from this model via Cij=fifjfifj, where denotes an average over the probability model in Eq. 5. The correlation matrix Cij is a function of temperature T: at high temperatures, where βE1, all f are equally probable, and the nontrivial correlations disappear. At lower temperatures, the set of fingerprints that are probable will reflect the structure of the interaction matrix J in Eq. 4.

Correspondingly, ligand–protein target binding only occurs over a range of temperatures, and we assume that we are in the range of temperatures where the binding is effective. Our algorithm computes the correlation matrix Cij not from taking equilibrium averages but instead by averaging over n samples, where n is the number of ligands that bind to the target in question. Critically, n is the same order of magnitude as the fingerprint length p, so our computed covariance matrix does not converge to the equilibrium expectation––it is corrupted by noise. Our procedure of extracting the eigenvalues above the MP threshold corresponds to estimating the binding energy from the data matrix.

To see this, Fig. 5 shows a set of simulations of the Ising model. We consider fingerprints of length p=50, drawn from the distribution of Eq. 5. We take the first-order coefficients to vanish (wi=0; in the case of the fingerprints this corresponds to using the z score) and choose J=αu′JuJ, where α>0. This is a rank-1 matrix, where uJ is the (randomly chosen) direction that by construction will minimize the energy. Fig. 5A shows the spectrum of the resulting correlation matrix, formed by considering n=200 samples from Eq. 5 with βα=0.1. The temperature is sufficiently high that the fingerprints are uncorrelated, so the spectrum is well fit by the MP distribution (red line). Fig. 5B shows the corresponding spectrum of the correlation matrix when βα=0.6. Here the bulk spectrum agrees well with the MP distribution (red line), but there is a single eigenvalue that escapes from the bulk with λ9. Fig. 5C shows that the eigenvector corresponding to this eigenvalue is extremely well correlated with uJ.

Fig. 5.

Fig. 5.

Eigenvalue spectrum of n=200 fingerprints of length p=50 sampled from P(λ) in Eq. 5, with w=0 and J=αu′JuJ a rank-1 matrix described in the text. (A) The spectrum with βα=0.1 agrees quantitatively with the MP distribution (red line). At high temperature the covariance structure of J is irrelevant and the fingerprints are uncorrelated up to sampling noise. (B) The spectrum with βα=0.6 has a bulk that agrees with the MP distribution (red line), but has a single eigenvalue escape from the bulk, near λ9. (C) The eigenvector v associated with this eigenvalue is highly correlated with uJ, the direction of J.

This correlation between the eigenvector and the coupling matrix J gives a physical interpretation of the projection onto the subspace of eigenvectors that escape the MP distribution in Eq. 2: We have used the data to derive a model for the binding energy of the ligand in fingerprint “coordinates,” and to determine whether an arbitrary ligand binds to the target, we are simply evaluating this binding energy. The correlation structure is lost when we use a dataset of random ligands instead of those corresponding to a single protein receptor, because in this case there is no underlying energy model to learn. Although our simulations (Fig. 5) use a rank-1 J for simplicity, if J is of higher rank, more eigenvectors will be pushed outside the MP distribution. Indeed, ref. 36 showed that random matrix denoising is related to putting in a prior that the rank of J (in our case the number of independent pharmacophores) is less than the number of variables (2,048 for the Morgan 3 fingerprint). We note that the Ising energy in Eq. 4 provides another way to score ligands. However, the classification accuracy does not significantly improve if the energy is estimated using the leading-order mean-field approximation (37).

Although interpreting our algorithm in terms of a binding energy function requires experimental verification through binding energy measurements, we note that this interpretation offers several conceptual insights. First, new candidate compounds could be uncovered by exploring the potential energy landscape of Eq. 4, and jumping between different energy minima could be related to “scaffold hopping” in drug discovery (38) as the minima would correspond to structures with pharmacophores. Investigating the topology of the energy landscape and those paths that connect distinct basins (39), as well as the statistics of energy minima, could reveal properties of the binding site. Secondly, relating our algorithm to an interaction energy provides a way to extend our method to regression problems, such as predicting solubility (40).

Third, we note that chemical fingerprints may be improved by incorporating physically relevant terms such as charge and molecular volume. This is facilitated by our approach, which accounts for additional noise introduced by increasing the number of fingerprint variables. Finally, the binding energy interpretation highlights the importance of high-quality negative data, i.e., which molecules do not bind to the desired receptor. Ref. 36 shows that including repulsive patterns could improve high-dimensional inference with inverse Ising/Hopfield models. Empirically, for our system, the repulsive patterns (small eigenvalues) inferred from the data are noisy and uninformative. This can be addressed either through identification of many more ligands that bind to each protein receptor, or, perhaps more efficiently, the incorporation of negative data into this framework.

Conclusion

We have developed a classification algorithm that predicts whether a compound will bind to a particular receptor of interest, given the known ligand set of that receptor. Our algorithm decomposes signal from noise using a robust bound that is derived from RMT. Applying our approach to human GPCRs reported in ChEMBL successfully identifies 84% of known ligands with a 7% false-positive rate, yielding an average AUC of 0.9. The methodology developed here complements the vast literature on optimizing fingerprint design, for example through the use of high-throughput screening data (7) or through application of neural networks to molecular graphs (41). The random matrix framework described here provides a robust threshold for maximizing the information extracted from correlations between structural features, while avoiding overfitting the data. The algorithm has the natural interpretation as a data-driven model for the binding energy of the ligands to the target protein, in fingerprint coordinates. This model gives a different perspective on the validity and uses chemical fingerprints for both ligand binding predictions and other purposes such as predicting ligand solubility (40) or aggregation (42), as well as revealing insights in fingerprint design.

Acknowledgments

This research was funded by a grant from Roche Pharmaceuticals. A.A.L. acknowledges the support of a Fulbright Fellowship. L.J.C. was supported by a Next Generation Fellowship, and a Marie Curie Career Integration Grant (Evo-Couplings, Grant 631609). M.P.B. is an investigator of the Simons Foundation, and also acknowledges support from the National Science Foundation through DMS-1411694.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

References

  • 1.Bento AP, et al. The ChEMBL bioactivity database: An update. Nucleic Acids Res. 2014;42(Database issue):D1083–D1090. doi: 10.1093/nar/gkt1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Jacoby E. Computational chemogenomics. Wiley Interdiscip Rev Comput Mol Sci. 2011;1(1):57–67. [Google Scholar]
  • 3.Johnson MA, Maggiora GM. Concepts and Applications of Molecular Similarity. Wiley; New York: 1990. [Google Scholar]
  • 4.Maggiora G, Vogt M, Stumpfe D, Bajorath J. Molecular similarity in medicinal chemistry. J Med Chem. 2014;57(8):3186–3204. doi: 10.1021/jm401411z. [DOI] [PubMed] [Google Scholar]
  • 5.Larsson J, Gottfries J, Muresan S, Backlund A. ChemGPS-NP: Tuned for navigation in biologically relevant chemical space. J Nat Prod. 2007;70(5):789–794. doi: 10.1021/np070002y. [DOI] [PubMed] [Google Scholar]
  • 6.García-Sosa AT, Oja M, Hetényi C, Maran U. DrugLogit: Logistic discrimination between drugs and nondrugs including disease-specificity by assigning probabilities based on molecular properties. J Chem Inf Model. 2012;52(8):2165–2180. doi: 10.1021/ci200587h. [DOI] [PubMed] [Google Scholar]
  • 7.Petrone PM, et al. Rethinking molecular similarity: Comparing compounds on the basis of biological activity. ACS Chem Biol. 2012;7(8):1399–1409. doi: 10.1021/cb3001028. [DOI] [PubMed] [Google Scholar]
  • 8.Buonfiglio R, et al. Investigating pharmacological similarity by charting chemical space. J Chem Inf Model. 2015;55(11):2375–2390. doi: 10.1021/acs.jcim.5b00375. [DOI] [PubMed] [Google Scholar]
  • 9.Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273–1280. doi: 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
  • 10.Vilar S, et al. Similarity-based modeling in large-scale prediction of drug-drug interactions. Nat Protoc. 2014;9(9):2147–2163. doi: 10.1038/nprot.2014.151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • 12.Sastry M, Lowrie JF, Dixon SL, Sherman W. Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J Chem Inf Model. 2010;50(5):771–784. doi: 10.1021/ci100062n. [DOI] [PubMed] [Google Scholar]
  • 13.Nikolova N, Jaworska J. Approaches to measure chemical similarity–a review. QSAR Comb Sci. 2003;22(9-10):1006–1026. [Google Scholar]
  • 14.Putta S, Beroza P. Shapes of things: Computer modeling of molecular shape in drug discovery. Curr Top Med Chem. 2007;7(15):1514–1524. doi: 10.2174/156802607782194770. [DOI] [PubMed] [Google Scholar]
  • 15.Kubinyi H. Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes. Wiley-VCH; Weinheim, Germany: 2008. Comparative molecular field analysis (CoMFA) pp. 1555–1574. [Google Scholar]
  • 16.Shin WH, Zhu X, Bures MG, Kihara D. Three-dimensional compound comparison methods and their application in drug discovery. Molecules. 2015;20(7):12841–12862. doi: 10.3390/molecules200712841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Todeschini R, et al. Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012;52(11):2884–2901. doi: 10.1021/ci300261r. [DOI] [PubMed] [Google Scholar]
  • 18.Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 2015;7:20. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wigner EP. Characteristic vectors of bordered matrices with infinite dimensions. I. Ann Math. 1955;62(3):548–564. [Google Scholar]
  • 20.Edelman A, Wang Y. Advances in Applied Mathematics, Modeling, and Computational Science. Springer; New York: 2013. pp. 91–116. [Google Scholar]
  • 21.Laloux L, Cizeau P, Bouchaud JP, Potters M. Noise dressing of financial correlation matrices. Phys Rev Lett. 1999;83(7):1467. [Google Scholar]
  • 22.Plerou V, Gopikrishnan P, Rosenow B, Amaral LAN, Stanley HE. Universal and nonuniversal properties of cross correlations in financial time series. Phys Rev Lett. 1999;83(7):1471. [Google Scholar]
  • 23.Bouchaud JP, Potters M. In: The Oxford Handbook of Random Matrix Theory. Akemann G, Baik J, Di Francesco P, editors. Oxford Univ Press; Oxford: 2011. pp. 824–848. [Google Scholar]
  • 24.Turk MA, Pentland AP. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Press; Washington, DC: 1991. Face recognition using eigenfaces; pp. 586–591. [Google Scholar]
  • 25.Turk M, Pentland A. Eigenfaces for recognition. J Cogn Neurosci. 1991;3(1):71–86. doi: 10.1162/jocn.1991.3.1.71. [DOI] [PubMed] [Google Scholar]
  • 26.Marčenko VA, Pastur LA. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik. 1967;1(4):457. [Google Scholar]
  • 27.Cereto-Massagué A, et al. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2014.08.005. [DOI] [PubMed] [Google Scholar]
  • 28.Landrum G. 2016 Rdkit: Open-source cheminformatics. Available at www.rdkit.org. Accessed October 18, 2016.
  • 29.Lagarde N, Zagury JF, Montes M. Benchmarking data sets for the evaluation of virtual ligand screening methods: Review and perspectives. J Chem Inf Model. 2015;55(7):1297–1307. doi: 10.1021/acs.jcim.5b00090. [DOI] [PubMed] [Google Scholar]
  • 30.Unterthiner T, et al. 2014 Deep learning as an opportunity in virtual screening. Neural Information Processing Systems 2014 (NIPS 2014): Deep Learning and Representation Learning Workshop. Available at www.bioinf.jku.at/publications/2014/NIPS2014f.pdf. Accessed November 2, 2016.
  • 31.Ackley DH, Hinton GE, Sejnowski TJ. A learning algorithm for Boltzmann machines. Cogn Sci. 1985;9(1):147–169. [Google Scholar]
  • 32.Schneidman E, Berry MJ, 2nd, Segev R, Bialek W. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature. 2006;440(7087):1007–1012. doi: 10.1038/nature04701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Cocco S, Leibler S, Monasson R. Neuronal couplings between retinal ganglion cells inferred by efficient inverse statistical physics methods. Proc Natl Acad Sci USA. 2009;106(33):14058–14062. doi: 10.1073/pnas.0906705106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein-protein interaction by message passing. Proc Natl Acad Sci USA. 2009;106(1):67–72. doi: 10.1073/pnas.0805923106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bailly-Bechet M, Braunstein A, Pagnani A, Weigt M, Zecchina R. Inference of sparse combinatorial-control networks from gene-expression data: A message passing approach. BMC Bioinformatics. 2010;11(1):355. doi: 10.1186/1471-2105-11-355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Cocco S, Monasson R, Sessak V. High-dimensional inference with the generalized Hopfield model: Principal component analysis and corrections. Phys Rev E. 2011;83:051123. doi: 10.1103/PhysRevE.83.051123. [DOI] [PubMed] [Google Scholar]
  • 37.Tanaka T. Mean-field theory of Boltzmann machine learning. Phys Rev E. 1998;58(2):2302. [Google Scholar]
  • 38.Sun H, Tawa G, Wallqvist A. Classification of scaffold-hopping approaches. Drug Discov Today. 2012;17(7-8):310–324. doi: 10.1016/j.drudis.2011.10.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wales D. Energy Landscapes: Applications to Clusters, Biomolecules and Glasses. Cambridge Univ Press; Cambridge, UK: 2003. [Google Scholar]
  • 40.Llinàs A, Glen RC, Goodman JM. Solubility challenge: Can you predict solubilities of 32 molecules using a database of 100 reliable measurements? J Chem Inf Model. 2008;48(7):1289–1303. doi: 10.1021/ci800058v. [DOI] [PubMed] [Google Scholar]
  • 41.Duvenaud DK, et al. 2015 Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28, eds Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R. Available at https://arxiv.org/pdf/1509.09292.pdf. Accessed October 18, 2016.
  • 42.Irwin JJ, et al. An aggregation advisor for ligand discovery. J Med Chem. 2015;58(17):7076–7087. doi: 10.1021/acs.jmedchem.5b01105. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES