When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values

Pierre Baldi; Ramzi Nasr

doi:10.1021/ci100010v

. Author manuscript; available in PMC: 2011 Jul 26.

Published in final edited form as: J Chem Inf Model. 2010 Jul 26;50(7):1205–1222. doi: 10.1021/ci100010v

When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values

Pierre Baldi ¹, Ramzi Nasr ^1,^✉

PMCID: PMC2914517 NIHMSID: NIHMS213669 PMID: 20540577

Abstract

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules in order to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework allows one to predict also the value of standard chemical retrieval metrics, such as Sensitivity and Specificity at fixed thresholds, or ROC (Receiver Operating Characteristic) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments carried in part with large sets of molecules from the ChemDB show remarkable agreement between theory and empirical results.

Introduction

As chemical repositories of molecules continue to grow and become more open,¹^–⁵ it becomes increasingly important to develop the tools to search them efficiently. In one of the most typical settings, a query molecule is used to search millions of other compounds not only for exact matches, but also frequently for approximate similarity matches. In a drug discovery project, for instance, one may be interested in retrieving all the commercially-available compounds that are “similar” to a given lead, with the aim of finding compounds with better physical, chemical, biological, or pharmacological properties.

The idea of searching for molecular “cousins” is of course not new, and constitutes one of the pillars of bioinformatics where one routinely searches for homologs of nucleotide or amino acid sequences. Search tools such as BLAST⁶ and its significance “E-scores” have become de facto standards of modern biology, and have driven the exponential expansion of bioinformatics methods in the life sciences.

In chemoinformatics, several approaches have been developed for chemical searches, including different molecular representations and different similarity scores. However, no consensus tool such as BLAST has emerged, for several reasons. Some of the reasons have to do with the cultural differences between the two fields, especially in terms of openness and data sharing. But there are also more technical and fundamental reasons–in particular there has been no systematic derivation of a theory that can account for molecular similarity scores and their distributions and significance levels. As a result, many existing search engines do not return a score with the molecules they retrieve, let alone any measure of significance.

Examples of fundamental questions one would like to address include: What threshold should one use to assess significance in a typical search? For instance, is a Tanimoto score of 0.5 significant or not? And how many molecules with a similarity score above 0.5 should one expect to find? How does the answer to these questions depend on the size of the database being queried, or the type of queries used? A clear answer to these questions is important for developing better standards in chemoinformatics and unifying existing search methods for assessing the significance of a similarity score, and ultimately for better understanding the nature of chemical space.

These questions are addressed here systematically by conducting a detailed empirical and theoretical study of chemical similarity scores and their extreme values. Surprisingly rare previous work related to these questions include an interesting study by Keiser et al.,⁷ which uses empirical fitting of distributions to extreme chemical similarity scores but does not derive a predictive mathematical theory of chemical scores and their extremes values, and a short preliminary report of some of our own results.⁸ Here we provide a more general, complete, and self-contained treatment of these questions, including both new theoretical and new simulation results. In particular, we extend previous work by studying several different ways of assessing the significance of chemical similarity scores, by analyzing in detail how the results depend on the parameters of the query molecule as well as the size of the database being searched, by applying the general framework to the analysis and prediction of ROC curves for molecular retrieval, by applying the general framework to the detection of outlier molecules, and by providing a more complete and predictive mathematical theory of the distribution of similarity scores and its extreme values.

The rest of this paper is organized as follows. Section 2 defines the molecular representations and similarity scores that are used throughout the study. Section 3 and 4 develop the probabilistic models required to both approximate empirical distributions of similarity scores and to create random background models against which significance can be assessed. Section 5 presents the main theory for the distribution of chemical similarity scores followed by Section 6 which presents the theory for the distribution of the extreme values of the score distributions. Corresponding experimental results to illustrate and corroborate the theory are described in Section 7 and 8, followed by a Discussion and Conclusions section. To improve readability, the details of the mathematical derivations are given in the Appendix.

Molecular Representations and Similarity Scores

Many different representations and similarity scores have been developed in chemoinformatics. The methods to be described here are broadly applicable but, for exposition purposes, we illustrate the theory using the framework that is most commonly used across many different chemoinformatics platforms, namely binary fingerprint representations with Tanimoto similarity scores. When appropriate, we also briefly describe how the same approach can be extended to other implementations and settings.

Molecular Representations: Fingerprints

Multiple representations have been developed for small molecules, from one dimensional SMILES strings to 3D pharmacophores,⁹ and different representations can be used for different purposes. To search large databases of compounds by similarity, most modern chemoinformatics systems use a fingerprint vector representation⁹^–¹⁵ whereby a molecule is represented by a vector whose components index the presence/absence, or number of occurrences of a particular functional group, feature, or substructure in the molecular bond graph. Because binary fingerprints are used in the great majority of cases, here we present the theory for these fingerprints, but it should be clear that the theory can readily be adapted to fingerprint based on counts. We use Inline graphic to denote a molecule and A⃗ = (A_i) to denote the corresponding fingerprint. We let A denote the number of 1-bits in the fingerprint A⃗ (A = |A⃗|).

In early chemoinformatics systems, fingerprint vectors were relatively short, containing typically a few dozen components selected from a small set of features, hand-picked by chemists. In most modern systems, however, the major trend is towards the combinatorial construction of extremely long feature vectors with a number of components N that can vary in the 10³–10⁶ range, depending on the set of features. Examples of typical features include all possible labeled paths or labeled trees, up to a certain depth. The advantage of these longer, combinatorially-based representations is twofold. First, they do not require expert chemical knowledge, which may be incomplete or unavailable. Second, they can support extremely large numbers of compounds containing both existing and unobserved molecular structures, such as those that are starting to become available in public repositories and commercial catalogs, as well as the recursively enumerable space of virtual molecules.¹⁶ The particular nature of the fingerprint components is not essential for the theory to be presented. To illustrate the principles, in the simulations we have used both fingerprints based on labeled paths and fingerprints based on labeled shallow trees with qualitatively similar results. For completeness, the details of the fingerprints used in the simulations are given below in the Data subsection. For brevity and consistency, the examples reported in the Results are derived primarily using fingerprints based on paths.

Fingerprint Compression

In many chemoinformatics systems, the long sparse fingerprint vectors are often compressed to much shorter and denser binary fingerprint vectors. The most widely used method of compression is a lossy compression method based on the application of the logical OR operator to the binary fingerprint vector after modulo wrapping to 512, 1,024, or 2,048 bits.¹² Other more efficient lossless methods of compression have recently been developed.¹⁵ With the proper and obvious adjustments, our results are applicable to both lossy compressed and uncompressed fingerprints. Because these are widely used, the majority of the simulation examples we report are obtained using modulo-OR compressed binary fingerprints of length N = 1,024. Due to their shorter length, these fingerprints have also the advantage of speeding up Monte Carlo sampling simulations.

Similarity Scores

Several similarity measures have been developed for molecular fingerprints.¹⁷^,¹⁸ Given two molecules Inline graphic and ℬ, the Tanimoto similarity score is given by

(1)

Here (A ∩ B) denotes the size of the intersection, i.e. the number of 1-bits common to A⃗ and B⃗, and A ∪ B denotes the size of the union, i.e. the number of 1-bits in A⃗ or B⃗. Because the Tanimoto similarity is by far the most widely used, the theory and experimental results reported here are based on the Tanimoto similarity. However, we also briefly describe how the same theory can be extended to other measures. Because Tanimoto similarity scores are built from intersections and unions, it will be natural to begin the theoretical analysis by studying the distribution of these intersections and unions, in particular their means, variances, and covariances.

Data

In the simulations, we illustrate the methods using fingerprints that are either randomly generated using one of the stochastic models described in Section 3, or randomly selected from the 5M molecules or so available in the ChemDB database.¹ In the case of the actual molecules, we use fingerprints associated with two schemes:¹⁵ labeled paths of length up to eight (i.e. 9 atoms and 8 bonds), or labeled circular substructures of depth up to two, with Element (E) and Extended Connectivity (EC) labeling. In the first scheme, referred to as paths throughout the paper, for each chemical we extract all labeled paths of length up to eight starting from each vertex and using depth-first traversal of the edges in the corresponding molecular graph. For this scheme, each vertex is labeled by the element (C,N,O, etc) of the corresponding atom and each edge is labeled by the type (single, double, triple, aromatic, and amide) of the corresponding bond. This scheme is closely related to the scheme used in many existing chemoinformatics systems, including the Daylight system.¹² In the second scheme, for each chemical we extract every circular substructure, of depth up to two, from the corresponding molecular graph. Circular substructures (see Hert et al.,¹⁹ Bender et al.,²⁰ and Hassan et al.²¹) are fully explored labeled trees of a particular depth, rooted at a particular vertex. For this scheme, molecular graphs are labeled as follows: each vertex is labeled by the element (C,N,O, etc) and degree (1, 2, 3, etc) of the corresponding atom, and each edge is labeled as above. The degree of a vertex is given by the number of edges incident to that vertex or, equivalently, the number of atoms bonded to the corresponding atom.

Both in the case of randomly generated fingerprints and actual molecular fingerprints, we used both uncompressed fingerprints, corresponding also to lossless compressed fingerprints, as well as lossy compressed fingerprint obtained using the standard modulo-OR-compression algorithm to generate fingerprint vectors of length 1,024. For both the randomly generated fingerprints and actual molecular fingerprints, we run typical simulations using a sample of n = 100 queries against background sets that range from 5,000 to 1 million fingerprints in order to study the effects associated with database size.

Probabilistic Models of Fingerprints

One of the main goals of this work is to derive good statistical models and approximations for the distribution of similarity scores. At the most fundamental level this can be addressed by building probabilistic models of fingerprints. Statistical models of fingerprints are essential for a variety of tasks. For instance, in fingerprint compression, fingerprints can be viewed as “messages” produced by a stochastic source and understanding the statistical regularities of the source is essential for deriving efficient compression algorithms that use short code words for the most frequent events. Here, statistical models are essential in at least two different ways: (1) to model and approximate the distribution of statistical scores; and (2) to assess significance against a random background. Similar observations of course can be made in bioinformatics to, for instance, assess what is the probability of observing a particular sequence or alignment score against a random generative or evolutionary model of protein or DNA sequences. It is worth noting that as a default, we assume that the distribution over the queries is the same as the distribution over the molecules in the database. However, these statistical models can also be used to model particular distributions over the space of queries that may differ from the overall background distribution.

Single-Parameter Bernoulli and Binomial Model

The simplest statistical model for binary fingerprints is a sequence of independent identically distributed Bernoulli trials (coin flips) with probability p of producing a 1-bit, and q = 1 − p of producing a 0-bit. This model can be applied to both long fingerprints with a very low p or to the modulo-OR compressed fingerprints with a higher value of p. The coin flip model corresponds to fingerprint features that are randomly ordered and statistically exchangeable, in fact even independent, and leads to a Binomial model ℬ(N, p), with only two parameters N and p, for the total number of 1-bits in the corresponding fingerprints. The single-parameter Bernouilli model is a weak model of real fingerprints for two reasons. First the probability of the individual components are not identical: some features are more likely to occur than others. Second, the components are not strictly independent. These shortcomings are further addressed in the more complex models described below. Nevertheless, the single-parameter Bernouilli model remains useful because of its simplicity and tractability, and it provides a point of reference or baseline for other models.

The Bernoulli model can be used to approximate the distribution of fingerprints in an entire database such as ChemDB by setting p to the average fingerprint density in the database. If one then compares the behavior of the number A of 1-bits in the Bernoulli generated fingerprints and in the actual database, one typically observes that the average of A is the same in both cases, by construction of p, but the variance is quite different. The variance A in the Bernoulli generated fingerprints is given by N pq and is always at most equal to the expectation N p, whereas in large databases of compounds one typically observes a larger variance (Figure 1). In general, a better model for A is provided by a Normal distribution Inline graphic (μ,σ²) where the mean μ = Np and variance σ² ≠ Npq are fitted empirically to the data.

Distributions of the number of 1-bits in fingerprints from the ChemDB (blue solid line) and fingerprints from the matching Single-Parameter Bernoulli model (red solid line) with p ≈ 205/1,024). Both distributions are constructed using a random sample of 100,000 fingerprints. Though both distributions have similar means, the standard deviations differ significantly. The distributions are also fit using two Normal distributions, which approximate the data well (dotted lines).

In some analyses, it is useful to consider fingerprints that contain A 1-bits. These can be modeled using Bernoulli coin flips with p = A/N, although this is at best an approximation since in the resulting fingerprints the number of 1-bits is not constant and varies around the mean value A, introducing some additional variability with respect to the case where A is held fixed (see Conditional Distribution Uniform model below). Finally, a distribution over queries that is different from the overall database distribution can be modeled using two Bernoulli models: one with parameter r for the queries, and one with parameter p (p ≠ r) for the database.

Multiple-Parameter Bernoulli Model

While the coin flip model is useful to derive a number of approximations, it is clear that chemical fingerprints have a more complex structure and their components are not exactly exchangeable since the individual feature probabilities p₁,…, p_N are not identical and equal to p but vary significantly. In particular, when the fingerprint components are reordered in decreasing frequency order, they typically follow a power-law distribution,¹⁵ especially in the uncompressed case. The probability of the j-ranked component is given approximately by p_j = Cj⁻^α resulting in a line of slope −α in a log-log plot. Thus the statistical model at the next level of approximation is that of a sequence of non-stationary independent coin flips where the probability p_j of each coin flip varies. This Multiple-Parameter Bernoulli model has N parameters: p₁, p₂, …, and p_N. In this case, using the independence, the expectation of the total number A of 1-bits is given by Σ_i p_i and its variance by $\sum_{i = 1}^{N} p_{i} q_{i}$ . In general, this variance is still an underestimate of the variance observed in actual large databases in spite of the larger number of parameters compared to the Single-Parameter Bernoulli Model (not shown). Similarly to the case of the Single-Parameter Bernouilli model, a distribution over queries different from the overall distribution could be modeled using a Multiple-Parameter Bernoulli model with a different set of parameters r₁,…,r_N.

Conditional Distribution Uniform Model

Both the Single-Parameter and Multiple-Parameter Bernoulli models consider the fingerprint components as independent random variables. The Conditional Distribution Uniform model is an exchangeable model where the components are weakly coupled and thus not independent. To generate a fingerprint vector under this model, one first samples the value A corresponding to the total number of 1-bits in the fingerprint, using a given distribution, typically a Normal distribution (Figure 1). The model then assumes that conditioned on the value of A all fingerprints with A 1-bits are equally probably (uniform distribution). Thus, for example, the Conditional Normal Uniform model has only three parameters: the mean μ, the standard deviation σ², and N. Compared to the Binomial model, the additional parameter in the Conditional Normal Uniform model allows for a better fit of the variance of A in the data. As we shall see, for the questions to be considered here the Conditional Normal Uniform model performs the best in spite of the fact that it does not model the probability differences between different fingerprint components.

Spin Models

More complex, second order, models are possible but will not be considered here. These models are essentially spin models from statistical physics, and are also known as Markov Random fields or Boltzmann machines.²²^,²³ In these models, one would have to take into account also the correlations between pairs of features which can be superimposed over the Multiple Bernoulli model. In real data, these correlations are often (though not always) weak, but not exchangeable, and thus behave differently from those introduced in the Conditional Distribution Uniform model. The slight improvements in modeling accuracy that may result from spin models come in general at a significant computational cost since these cannot be solved analytically and therefore cannot be used in a straightforward manner to derive the probability distribution of the similarity scores. Study of spin models is left for future work.

Probabilistic Models of Distribution Scores

While in the following sections and the Appendix we show how the distribution of similarity scores can be estimated somehow from “first principles”, i.e. from the corresponding probabilistic model of fingerprints, it is also possible to model or approximate the score distribution directly, for instance by assuming that the scores are approximately normally distributed and obtaining the mean and standard deviation by sampling methods. In a similar way, one can also use a Gamma distribution model to completely avoid negative scores, or a Beta model to insist on bounded scores between 0 and 1, to model the overall distribution of scores. Another intermediary alternative, which is less direct but still avoids modeling the fingerprints themselves, is to model the intersections and unions that are used to derive the Tanimoto scores, and then try to derive the distribution of the scores from those models. For example, one could consider modeling both the intersections and corresponding unions using two different normal distributions and derive the means and standard deviations of these normal distributions by sampling methods. The commonalities, differences, and tradeoffs between these various modeling and approximation approaches to the distribution of chemical similarity scores will become clear in the following sections. The most complex case, where everything is derived from the probabilistic models of fingerprints, is treated in detail in the Appendix.

The Distribution of the Similarity Scores

With these preliminaries in place, we are now set to analyze the distribution of Tanimoto scores under the various probabilistic models.

Main Result

Since the Tanimoto score is the ratio of an intersection over an union, the basic strategy is to first study the distribution of the corresponding intersection and union, their means and variances. Note that the intersection and union, in general, are not two independent random variables, but have a non-zero correlation that must be estimated analytically or through simulations. In turn, from these results one can derive a closed-form approximation for the distribution of the Tanimoto scores and its extreme properties. This analysis can be carried using empirical fingerprint data as well as fingerprints generated by the probabilistic models. Furthermore, the analysis can be conducted by conditioning on the total number of 1-bits contained in the query molecule (A) and the molecules being searched (B), by conditioning on only one of these quantities–typically the number A of 1-bits in the query molecule–and integrating over the other, or with no conditioning at all by integrating over both the fingerprints in the queries and the fingerprints in the database being searched. These forms of conditioning are practically relevant, especially conditioning on A, which will be shown to lead to much better retrieval results.

In all these cases, one in general finds that:

The intersection and union have approximately a Normal distribution with means and variances that can be estimated empirically or computed analytically in the case of the probabilistic models.
The intersection and union have a non-zero (positive) covariance that can be estimated empirically or computed analytically in the case of the probabilistic models.
As a consequence, the distribution of the corresponding Tanimoto scores can be modeled and approximated by the distribution of the ratio of two correlated Normal random variables.

These facts are demonstrated in the Results section using simulations. In Appendix A, mathematical proofs are provided for the probabilistic models of fingerprints together with analytical formula for the means, variances and covariances of the intersections and unions.

Ratio of Two Correlated Normal Random Variables Approximation

Whether one uses the Single Bernoulli/Binomial, Multiple Bernoulli, or Conditional Density Uniform models, or the empirical intersection and union data, in the end, the Tanimoto score distribution can be approximated by the distribution of two correlated Normal random variables approximating the numerator and the denominator. The different models will yield different estimates of the mean, variance, and covariance of the respective Normal distributions.

The density of the ratio of two correlated Normal random variables has been studied in the literature and can be obtained analytically, although its expression is somewhat involved.²⁴^–²⁷ The probability density for Z = X/Y, where , and ρ = Corr(X, Y) ≠ ± 1 is given by the product of two terms

P_{Z} (z) = \frac{σ_{X} σ_{Y} \sqrt{1 - ρ^{2}}}{π (σ_{Y}^{2} z^{2} - 2 ρ σ_{X} σ_{Y} z + σ_{X}^{2})} [exp (- \frac{1}{2} {supR}^{2}) (1 + \frac{R Φ (R)}{φ (R)})]

(2)

P_{Z} (z) = \frac{σ_{X} σ_{Y} \sqrt{1 - ρ^{2}}}{π (σ_{Y}^{2} z^{2} - 2 ρ σ_{X} σ_{Y} z + σ_{X}^{2})} [exp (- \frac{1}{2} {supR}^{2}) + \sqrt{2 π} R Φ (R) exp (- \frac{1}{2} [{supR}^{2} - R^{2}])]

(3)

where

R = R (z) = \frac{(σ_{Y}^{2} μ_{X} - ρ σ_{X} σ_{Y} μ_{Y}) z - ρ σ_{X} σ_{Y} μ_{X} + σ_{X}^{2} μ_{Y}}{σ_{X} σ_{Y} \sqrt{1 - ρ^{2}} \sqrt{σ_{Y}^{2} z^{2} - 2 ρ σ_{X} σ_{Y} z σ_{X}^{2}}} = \frac{(\frac{μ_{X}}{σ_{X}} - ρ \frac{μ_{Y}}{σ_{Y}}) z - (ρ \frac{μ_{X}}{σ_{X}} - \frac{μ_{Y}}{σ_{Y}}) \frac{σ_{X}}{σ_{Y}}}{\sqrt{1 - ρ^{2}} \sqrt{z^{2} - 2 ρ \frac{σ_{X}}{σ_{Y}} z + {(\frac{σ_{X}}{σ_{Y}})}^{2}}}

(4)

{supR}^{2} = \frac{σ_{Y}^{2} μ_{X}^{2} - 2 ρ σ_{X} σ_{Y} μ_{X} μ_{Y} + σ_{X}^{2} μ_{Y}^{2}}{σ_{X}^{2} σ_{Y}^{2} (1 - ρ^{2})} = \frac{{(\frac{μ_{X}}{σ_{X}})}^{2} - 2 ρ \frac{μ_{X}}{σ_{X}} \frac{μ_{Y}}{σ_{Y}} + {(\frac{μ_{Y}}{σ_{Y}})}^{2}}{1 - ρ^{2}}

(5)

and

{supR}^{2} - R^{2} = \frac{{(μ_{X} - μ_{Y} z)}^{2}}{σ_{Y}^{2} z^{2} - 2 ρ σ_{X} σ_{Y} z + σ_{X}^{2}} = \frac{{(\frac{μ_{X} σ_{X}}{σ_{X} σ_{Y}} - \frac{μ_{Y}}{σ_{Y}} z)}^{2}}{z^{2} - 2 ρ \frac{σ_{X}}{σ_{Y}} z + {(\frac{σ_{X}}{σ_{Y}})}^{2}}

(6)

Thus, anytime one can approximate the intersections and the unions by two correlated Normal random variables, the distribution of the Tanimoto scores can be approximated using Equations 2–6 with X = I and Y = U. This approach can be used, for instance, to derive the mean and standard deviation of the Tanimoto scores under various assumptions including: (1) the Single- and Multiple-Parameter Bernoulli models with p = r (or p_i = r_i) for the average Tanimoto scores across all queries; (2) the Single- and Multiple-Parameter Bernoulli models with p ≠ r (or p_i ≠ r_i) for queries modeled by a different Bernoulli model than the one used for the database being searched; (3) the Conditional Distribution Uniform model with A fixed, or A integrated over the database distribution, or a distribution over queries; and (4) the empirically-derived Normal models for the union and intersection averaged over the entire database, or focused on a particular class of molecules.

A Python code implementation for the density of the ratio of two correlated Normal random variables (Equations 2–6) is available from the ChemDB chemoinformatics portal (cdb.ics.uci.edu), under Supplements.

Extensions to Other Measures

While we have described the theory for the Tanimoto similarity scores, the same theory can readily be adapted to most other fingerprint similarity measures.¹⁷^,¹⁸ To see this, it suffice to note that the other measures consist of algebraic expressions built from A ∪ B and A ∩ B, as well as other obvious terms such as A, B, and sometimes N. For example, the Tversky measure²⁸^,²⁹ is an important generalization of the Tanimoto measure defined by

S_{α β} (\vec{A}, \vec{B}) = \frac{A \cap B}{α A + β B + (1 - α - β) (A \cap B)}

(7)

where the parameters α and β can be used to tune the search towards sub-structures or super structures of the query molecule. The numerator and denominator in the Tversky measure can again be modeled by two two correlated Normal random variables. The only difference is in the mean and variance of the denominator, and its covariance with the numerator. The new mean, variance, and covariance can be computed empirically. They can also be derived analytically for the simple probabilistic models, as described in Appendix A. Similar considerations apply to all the other measures described in.¹⁷^,¹⁸ Thus the distribution and statistical properties of all the other similarity measures¹⁷^,¹⁸ can readily be derived from the general framework described presented here.

Alternatives and Related Approaches

Because the intersections and unions have always positive values, it is also possible in some cases to approximate their distributions using Gamma distributions. The distribution of Tanimoto scores can then be modeled using the distribution of the ratio of two correlated Gamma distributions, for which some theory exists.³⁰^–³² Likewise, in regimes where the finite [0, N] range of the intersections and unions becomes important, the intersection and union can be rescaled by 1/N and the corresponding distributions modeled using Beta distributions. In this case, the distribution of Tanimoto scores can be modeled using the distribution of the ratio of two correlated Beta distributions.³³ Finally, as already mentioned, it is also possible to model or approximate a distribution of Tanimoto scores directly using a Normal, Gamma, or Beta distribution (or a mixture of these distributions) without having to first consider the intersections and unions.

It is not possible to give a general prescription as to which approach may work best in a practical application, since this may depend on the details and goals of a particular implementation. However, the theory presented provides a general framework for predicting and modeling the distribution of Tanimoto scores that can be adapted to any particular implementation. And using this distribution, it is possible to derive measures and visualization tools to assess the quality and significance of the molecules being retrieved with a given query and the corresponding rates of false positives and false negatives, as described in the next sections.

Theory: Z-Scores, E-Values, P-Values, Outliers, and ROC Curves

There are various computational approaches for determining the significance of similarity scores. All these approaches derive from the distribution of similarity scores. Significance scores include Z-scores, E-values, and p-values associated with the extreme value distribution³⁴^–³⁶ of similarity scores. The distribution of similarity scores can also be used to detect outliers and predict ROC (Receiver Operating Characteristic) curves in chemical retrieval. As we shall see, these significance analyses can be done and yield better results when conditioned on the size of the queries.

Z-Scores

In the Z-score approach, one simply looks at the distance of a score from the mean of the corresponding family of scores, in numbers of standard deviations. Therefore the Z-score is given by

Z = \frac{t - μ}{σ}

(8)

The parameters μ and σ can be determined either empirically from a database of fingerprints, using the statistical models described above. While Z-scores can be useful, their focus is on the global mean and standard deviation of the distribution of the scores, not on the tail of extreme values. Thus we next consider two measures that are more focused on the extreme values.

E-Values

When considering a particular similarity value or selecting a similarity threshold t for a given query, an important consideration is the expected number of hits in the database above that threshold. To use a terminology similar to what is used for BLAST, we refer to this number as the E-value. From the distribution of scores in a database of size D, the E-value corresponding to a Tanimoto threshold t is estimated by

E = [1 - F (t)] \times D

(9)

where F(t) is the cumulative distribution of the corresponding similarity scores, which can be approximated using the methods described above.

Extreme Value Distributions and P-Values

The second approach focused on extreme values corresponds to computing p-values. For a given score t, its p-value is the probability of finding a score equal to or greater than t under a random model. Thus in this case, one is interested in modeling the tail of the distribution of the scores, and more precisely the distribution of the maximum score.³⁴^–³⁶ This distribution depends on the size of the database being searched since for a given query, and everything else being equal, we can expect the maximum similarity value to increase with the database size.

Consider a query molecule Inline graphic used to search a database containing D molecules, yielding D similarity scores t₁,…,t_D. The cumulative distribution of the maximum max is given by

F_{\max} (t) = P (\max \leq t) = P (t_{1} \leq t) \dots P (t_{D} \leq t) = F {(t)}^{D}

(10)

under the usual assumption that the scores are independent and identically distributed. Here F(t) is the cumulative distribution of a single score. A p-value is obtained by computing the probability

p = 1 - F_{\max} (t)

(11)

that the maximum score be larger than t under F. The density of the maximum is obtained by differentiation

f_{\max} (t) = D f (t) {[F (t)]}^{D - 1}

(12)

where f (t) is the density of a single score. In the case of Tanimoto similarity scores, f (t) can be approximated by the ratio of two correlated Normal random variables approach described above, and F(t) is obtained from f (t) by integration. F(t) can also be approximated by²⁵

F (t) \approx Φ (\frac{μ_{Y} t - μ_{X}}{σ_{X} σ_{Y} a (t)})

(13)

where $Φ (u) = \int_{- \infty}^{u} [1 / \sqrt{2 π}] e^{- x^{2} / 2} d x$ is the cumulative distribution of the normalized Normal distribution and

a (t) = {(\frac{t^{2}}{σ_{X}^{2}} - \frac{2 ρ t}{σ_{x} σ_{Y}} + \frac{1}{σ_{Y}^{2}})}^{1 / 2}

(14)

This approximation is good when the denominator of the ratio of two correlated Normal random variables is positive, with its standard deviation much larger than its average. In any case, by combining Equations 2, 12, and 13, we get:

f_{\max} (t) \approx D \frac{σ_{X} σ_{Y} \sqrt{1 - ρ^{2}}}{π (σ_{Y}^{2} t^{2} - 2 ρ σ_{X} σ_{Y} t + σ_{X}^{2})} [exp (- \frac{1}{2} {supR}^{2}) (1 + \frac{R Φ (R)}{φ (R)})] {[Φ (\frac{μ_{Y} t - μ_{X}}{σ_{X} σ_{Y} a (t)})]}^{D - 1}

(15)

Finally, because the Tanimoto scores are bounded by one, the theory of extreme value distributions shows that the cumulative distribution of the normalized maximum score n_D, normalized linearly in the form n_D = a_Dmax + b_D using appropriate sequences a_D and b_D of normalizing constants, converges to a type-III extreme value distribution, or Weibull distribution function, of the form

F (x) = P (n_{D} \leq x) = exp [- {(\frac{μ - x}{σ})}^{ξ}]

(16)

The linear normalization can be ignored since it is absorbed into the parameters of the Weibull distribution. The advantage of the Weibull formula is its suitability for representing F_max in a closed form that can be easily and efficiently computed. How to fit in practice the Weibull distribution to the data is described in Appendix B.

Outliers

The framework allows us to detect molecules that are atypical, within their group, in the following sense. From the framework we can predict the typical (average) distribution of Tanimoto scores for a given query size S, or the expected number of hits above any given threshold t, given S. Clearly if we are dealing with an actual query molecule Inline graphic with a fingerprint containing A = S 1-bits, if the distribution of observed scores for differs from the typical distribution given A, then the molecule can be viewed as being atypical within the class of molecules in the database containing A 1-bits in their fingerprints. The difference between the typical distribution of scores for molecules with A 1-bits and the distribution of scores generated by the actual query Inline graphic can be measured in many ways, for instance by using the relative entropy or Kullback Leibler divergence between the two distributions.³⁷ Similar considerations can be made using the expected number of scores above a given threshold for molecules with A 1-bits versus the actual number observed for molecule Inline graphic .

ROC Curves

Finally, the general framework can be used to predict false positive and false negative rates, as well as standard ROC (Receiver Operating Characteristic) curves. For conciseness, let us describe the approach for ROC curves, which plot false positive rates on the x-axis versus true positive rates on the y-axis. Consider a set of molecules (e.g. a set of Estrogen Receptor binding molecules) as a set of positive examples used to search a large database for similar molecules. Empirically, or using the ratio of correlated Normal random variables approach, one can derive a density f and a corresponding cumulative distribution F for the similarity scores of the positive examples, and a density g and a corresponding cumulative distribution G for the similarity scores of the negative examples provided by the overwhelming majority of the molecules in the large database. Thus for a given threshold t on the Tanimoto similarity, the corresponding point on the ROC is easily obtained and given by

x = 1 - G (t) and y = 1 - F (t)

(17)

In other words, using continuous approximations, the equation of the ROC curve is given by y = 1 − F(G⁻¹(1 − x)). Similarly, other measures such as Specificity or Sensitivity can be estimated at any given threshold.

Now, armed with this theoretical framework we can proceed with simulations to demonstrate how the framework can be applied and assess the quality of the corresponding predictions. The following sections describe experimental results obtained using actual molecules from the ChemDB. A large number of experiments were carried and only a sample of the main results is described here for brevity. Unless otherwise specified, the results reported here are obtained using path fingerprints compressed to 1,024 bits using lossy modulo-OR compression.

Simulation Results: The Distribution of Similarity Scores

We first examine the quality of the ratio of correlated Normal random variables approximation. Figure 2 shows in the left column the empirical distributions of the sizes of the intersections and unions averaged across the entire database and obtained by Monte Carlo methods, for both lossy compressed fingerprints (upper plots) and uncompressed fingerprints (lower plots) together with their Normal approximations. The positive covariance between the intersections and unions is Cov(I,U) = 3048.5 (with corresponding correlation Corr(I,U) = 0.82) for the lossy compressed fingerprints, and Cov(I,U) = 1253.2 (Corr(I,U) = 0.35) for the uncompressed fingerprints. In the right column, one can see the corresponding histogram of Tanimoto scores and the ratio of correlated Normal random variables approximation. Overall, the ratio of correlated Normal random variables approach approximates the histograms very well in this case, where one is using averaging over all molecules.

Results obtained with 100 molecules randomly selected from ChemDB used as queries against a sample of 100,000 molecules randomly selected from ChemDB. The two upper figures correspond to fingerprints of length 1,024 with modulo OR lossy compression, while the two lower figures correspond to fingerprints with lossless compression (equivalent to uncompressed fingerprints). The figures in the left column display the histograms of the sizes of the intersections and unions and their direct Normal approximations in blue and green respectively. The figures in the right column display the histograms of the Tanimoto scores (blue bars), while the solid black line shows the corresponding approximation derived using the ratio of correlated Normal random variables approach.

To test whether the ratio of correlated Normal random variables approximation works well at a finer grained level, we repeat a similar experiment but conditioning on the size A of the query molecules. In fact, in this experiment we use an even more stringent theoretical model. Instead of fitting Normal distributions to the intersections and unions (as in Figure 2), we assume that the data is generated by the Conditional Normal Uniform Model with only two parameters that are fit to the mean and variance of B across the entire ChemDB. As described in the Appendix, this gives us analytical formulas for the means and variances of the intersections and unions for each value of A, as well as their covariances. Figure 3 provides heat maps showing the corresponding empirical and predicted distributions of the intersections (first row) and unions (second row) as a function of A. The last row compares the observed Tanimoto score distribution to the predicted Tanimoto score distribution, using the ratio of correlated Normals approach. Overall, there is remarkable agreement between the theoretical predictions and the corresponding empirical observations at all values of A and at all Tanimoto scores, especially considering that the Conditional Normal Uniform model used in these heat maps has only two parameters that are fit to the actual data (the mean and variance of B in ChemDB).

Empirical (left) and predicted (right) heat maps corresponding to the distribution of the intersections (top), unions (middle), and Tanimoto scores (bottom). The distribution is conditioned on the size of the query molecule, A, shown on the vertical axis. The empirical results are obtained by using for each A 100 molecules randomly selected from the molecules in ChemDB with size A. The theoretical results of the intersection and union distributions use the Conditional Normal Uniform model. At each value of A, the mean and variance of the intersection and union are obtained from Equations 29, 30, 33, and 36 respectively. The theoretical score distribution is a result of the ratio of correlated Normal random variables approximation given by Equations 2–6

Likewise, Figure 4 shows how for each value of A, the covariance between the union and the intersection is well predicted by the Conditional Normal Uniform model, with a small deviation observed for molecules with a high bit count where the covariance is slightly smaller than predicted by the theory, probably as a result of a decrease in the variability of the size of the union for queries associated with molecules from ChemDB with a large A (the size of the union tends to be close to A since the components in the complement have exceedingly small probabilities).

The empirical and theoretical covariance *Cov*(I,U) between the intersection and the union, conditioned on the size A of the query molecule, shown in blue and green respectively. Empirical results are obtained by using, for each A, 100 molecules randomly selected from the molecules in ChemDB with size *A. A* is shown on the vertical axis for consistency with the previous heat map figures. Theoretical predictions are derived with the Conditional Normal Uniform Model conditioned on A (Equation 39).

In sum, these results show that the distribution of Tanimoto scores can be modeled, predicted, or approximated accurately with the framework proposed here. Among the simplest models, the Conditional Normal Uniform Model performs best. Conditioning on the size A of the query can play an important role, since there are significant variations in the distribution of the scores as A varies.

Z-Scores, E-Values, P-Values, Outliers, and ROC Curves

We now turn to the assessment of significance using Z-scores, E-values, extreme value distributions and p-values, and ROC curves. Figure 5 provides four examples, one in each column, of pairs of molecules where the top molecule can be viewed as the query, and the bottom molecule can be viewed as a potential “hit” retrieved while searching a random subset of 100,000 molecules taken from the ChemDB. The four queries have different sizes corresponding to A = 16, 109, 199, and 258. The corresponding four Tanimoto similarity scores are 0.200, 0.400, 0.571, and 0.233. Columns a and d correspond to similar Tanimoto scores, although they should be viewed quite differently due to the disparity in the size A of the corresponding queries, as shown in the following analyses.

The first row shows four query molecules. The second row considers four corresponding potential “hits” in the corresponding columns. The table shows the size A of the four query molecules followed by the corresponding Tanimoto scores, Z-scores, E-scores, and p-values observed empirically or predicted from the theory with and without conditioning on the size A of the query molecule. Molecules are represented by Daylight-style fingerprints of length 1024 with OR lossy compression.