Abstract
The SCISSORS method for approximating chemical similarities has shown excellent empirical performance on a number of real-world chemical data sets, but lacks theoretically-proven bounds on its worst-case error performance. This paper first proves reductions showing SCISSORS to be equivalent to two previous kernel methods: kernel principal components analysis and the rank-k Nyström approximation of a Gram matrix. These reductions allow the use of generalization bounds on these techniques to show that the expected error in SCISSORS approximations of molecular similarity kernels is bounded in expected pairwise inner product error, in matrix 2-norm and Frobenius norm for full kernel matrix approximations, and in RMS deviation for approximated matrices. Finally, we show that the actual performance of SCISSORS is significantly better than these worst-case bounds, indicating that chemical space is well-structured for chemical sampling algorithms.
Introduction
The SCISSORS method is a technique for accelerating chemical similarity search by transforming Tanimoto similarity scores to inner products, computing a metric embedding for a small “basis set” of molecules that optimally reconstructs the given inner products, and then projecting remaining non-basis “library” molecules into this vector space.1 SCISSORS similarities are then computed as Tanimotos on these embedded vectors. Significant speedups can be achieved for certain similarity measures (those which are expensive to compute, and have highly concentrated eigenvalue spectra) for repeated queries into a static database: the work done to compute vector projections for each database molecule can be amortized easily across a large number of queries. In the original SCISSORS paper, Haque and Pande report that for a database of approx. 57,000 molecules, a basis set of 1,000 molecules and embedding dimension of 100 was sufficient to accurately reproduce the shape similarity over the whole database.
The embedding used in SCISSORS is computed by first calculating the pairwise inner product matrix G between all pairs of basis molecules. G is then decomposed into eigenvectors V and eigenvalues along the diagonal of a matrix D; the vector embedding for the basis molecules lie along the rows of matrix B in the following equation:
| (1) |
The rank of the approximation can be controlled by ordering the eigenvalues in order of decreasing value, setting all eigenvalues below a certain desired count to zero, and truncating these zero dimensions in the resulting vectors.
Figure 1 shows an example of SCISSORS applied to a molecular similarity kernel. In this example, SCISSORS is used to approximate the intersection size (SMILES overlap), as used for the LINGO similarity measure,2 between two molecules from the Maybridge Screening Collection: (S)-mandelate (molecule 1) and (2R)-3-(4-chlorophenoxy)propane-1,2-diol (molecule 2). The true intersection sizes, as computed by the SIML implementation of LINGO,3 are shown in the first row: between molecule 1 and itself, molecule 2 and itself, and the two molecules against each other. We then constructed a basis set of molecules from 3,072 isomeric SMILES strings drawn at random from the Maybridge Screening Collection, and embedded molecules 1 and 2 into SCISSORS vector spaces of varying dimensionalities: 64, 256, and 1,024. The three SCISSORS data rows in the table show the approximated values of each intersection, as a function of embedding dimension. As the dimension grows, the approximation error (difference between LINGO true value and SCISSORS-approximated value) decreases. Our objective in this paper is to derive theoretical bounds on the magnitude of this error.
Figure 1.
Example of SCISSORS applied to a molecular similarity kernel (LINGO intersection size). Table indicates LINGO true kernel value and SCISSORS-approximated kernel value for various dimensionalities.
A number of methods used in chemical informatics are mathematically similar to SCISSORS. In particular, the “molecular basis set” approach taken by Raghavendra and Maggiora4 is very similar. The Raghavendra and Maggiora (RM) method skips Tanimoto-to-inner product conversion (treating Tanimotos as inner products directly), does not restrict the dimensionality of the vector expansion, and is derived using a different justification, but otherwise is very similar. In particular, both this method and SCISSORS are variants of kernel principal components analysis.
While the RM method and SCISSORS in particular seem to have good empirical performance, they lack theoretically-rigorous guarantees on their approximations. In this paper, we derive theoretical guarantees on the SCISSORS approximation error by reducing SCISSORS to previously-described kernel methods from machine learning.
Preliminaries
SCISSORS as a kernel method
The key insight of the SCISSORS technique is that molecular similarity measures, after appropriate transformation, can be treated as “kernel functions” taking pairs of molecules to scalar values that can be interpreted as inner products. Kernels are mathematical objects widely used in machine learning which can be used to adapt linear machine learning models (e.g., support vector machines) to nonlinear spaces. Informally, a kernel function is one taking two “objects” (often vectors, but in the chemical context molecules, strings, or fingerprints) and returning a non-negative real scalar satisfying particular properties of the real dot product (including symmetry and positive-semi-definiteness). While molecular similarity scores such as Tanimotos are not in themselves inner products or the result of kernel functions, they are often constructed from intermediate quantities which are. For example, the set intersection in LINGO2 is a kernel function, and the shape overlap volume from Gaussian shape overlay5 is approximately a kernel (non-negative and symmetric, but not positive-definite).
The advantage of interpreting SCISSORS as working on kernel functions or inner products is that it allows leveraging the body of machine learning literature on kernel methods. The SCISSORS pipeline can be roughly segmented into the following operations:
Convert Tanimotos to inner products (basis-vs-basis or library-vs-basis)
Compute a vector embedding on the inner products (by eigendecomposition or least-squares)
Compute vector-space inner products (standard dot product in ℜN)
Convert vector-space inner products to Tanimotos using standard vector Tanimoto equation
Steps 1 and 4 in this pipeline involve ratios of inner products (or kernel values), and as such, introduce nonlinearities into the analysis. However, if one assumes that exact kernel values are given or easily obtained (as demonstrated for the shape overlap volume in1), and that the goal is to directly approximate these kernel values rather than the Tanimoto, then SCISSORS directly resembles a typical kernel method. Therefore, in this paper, we will consider only the error in these inner-product-space stages, rather than error introduced at the Tanimoto stages. Accordingly, we replace the notion of a “molecular similarity function” with that of a “molecular similarity kernel”, which can be thought of as the composition of a similarity function with the Tanimoto-to-inner-product operation from SCISSORS.
The following lemma will be useful in demonstrating the equivalence of SCISSORS to various other kernel methods. Proof is provided in the Supplemental Information.
Lemma 1 (SCISSORS library vectors are projections onto eigenvectors of the basis inner product matrix)
Given an N × N SCISSORS basis inner product matrix (that is, a similarity matrix post-Tanimoto-to-inner-product conversion) K. Let the eigenvalues (resp. eigenvectors) of K be denoted λi and Vi, with eigenvalues sorted in descending order of value. Let the matrix of all eigenvectors be named V = [V1V2 ··· VN]. The SCISSORS vector w for a new molecule with library-vs-basis inner product vector L, in d dimensions, is defined by the expression:
| (2) |
Note that Lemma 1 suggests a method to compute SCISSORS vectors that is distinct from, but equivalent to, the least-squares calculation specified by Haque and Pande.1 Given a vector L of basis-vs-library inner products and a matrix M = VD1/2 of basis SCISSORS vectors, the original SCISSORS calculation suggested solving the least-squares equation Mx = L for the SCISSORS vector x of the new molecule. This lemma shows that the same problem is solved by the matrix multiplication D−1/2VT L. This provides a computational shortcut for the projection of large numbers of library molecules: the projection matrix D−1/2VT can be computed once for a basis set; after library-vs-basis Tanimotos have been computed and converted to inner products, the SCISSORS vector can be computed by a simple matrix multiplication rather than least-squares.
Assumptions
The analysis in this paper rests on the following assumptions:
SCISSORS is given molecular similarity kernel values, not Tanimotos, to analyze. While the conversion from Tanimoto to inner product will introduce distortion (particularly if different molecules x and y have very different values of κ(x, x) and κ(y, y) for similarity kernel κ, we will not consider this distortion here.
-
It is assumed that the similarity kernel κ is symmetric positive semidefinite (SPSD). Similarity kernels that are not SPSD are not Mercer kernels and some proofs will fail in the presence of negative kernel eigenvalues. However, given non-SPSD κ, the results of this paper can still be applied to a modified kernel κ′, the nearest SPSD approximation to κ. If κ is symmetric but indefinite, then certain divergence terms can be easily calculated between the kernel matrices K and K′ induced by κ and κ′:
||K − K′||2 = absolute value of the negative eigenvalue with largest magnitude
, where λ<0 are the negative eigenvalues
It is assumed that kernel values are exactly computable. In particular, the case in which kernel values themselves are subject to noise or inexactitude is not considered here.
Under these assumptions, it is possible to bound the additional error made by SCISSORS in choosing a small random basis rather than using the eigendecomposition of the full kernel matrix over the entire library. Two different types of bounds will be shown in this paper, arising from reductions to two different kernel methods: kernel principal components analysis, and the rank-k Nyström approximation.
Reduction of SCISSORS to Kernel PCA
Overview of Kernel PCA
Kernel principal components analysis6,7 is a generalization of traditional principal components analysis from the data space to a feature space defined by a Mercer kernel function κ. Given a sample of N data points, kernel PCA computes up to N directions of maximum variance of the data, in the kernel’s feature space. Points can then be projected into this N-dimensional subspace by a projection of their kernel values against the original (training) data points; thus, kernel PCA can be considered to perform a metric embedding of data points into a subspace of the feature space defined by a given kernel.
Similar to traditional (linear) PCA, kernel PCA can be preceded by a centering step, in which the data are centered in feature space; this ensures that the data mean is not reflected in the recovered coordinates. However, the uncentered case has relevance to SCISSORS, so we now proceed to derive the kernel PCA algorithm without data centering (following the approach of Scholköpf6).
Derivation of kernel PCA
Given a set of data points xi, i ∈ [1, ···, ℓ], and a Mercer kernel κ(x, y), defined by κ(x, y) = 〈Φ(x), Φ(y)〉 for some feature-space projection Φ. Consider the feature covariance matrix C̄:
Let the eigenvalues and eigenvectors of C̄ be named λk and Vk respectively such that ∀k λV = C̄V. All such Vi must lie in the span of Φ(x1) ··· Φ(xℓ). Thus the following system is equivalent:
and there exist a1 ··· aℓ such that
Defining matrix Ki j = 〈Φ(xi), Φ(xj)〉 and vector α = [a1 ··· an]T we get ℓλKα = K2α, so we solve the eigenvalue problem ℓλα = Kα. Solutions λk, αk correspond to eigenvalues/vectors of the kernel matrix.
We normalize the resulting solutions by requiring that the feature-space eigenvectors (Vk) be unit magnitude. This implies:
We can compute the projection of a new data point x onto the feature-space correlation matrix eigenvectors Vk by:
So for d eigenvectors, the projected KPCA coordinate vector w is:
Equivalently:
Reduction Proof
We now demonstrate that SCISSORS is equivalent to kernel PCA performed without data centering.
As proven in Lemma 1, the SCISSORS vector w corresponding to a library molecule is defined by weighted inner products between the eigenvectors of the kernel matrix and the library-versus-basis inner product vector L. Define new vectors . Recall that the kernel matrix and vector L are already identical between methods, and both and αi are defined to be eigenvectors of the kernel matrix. To prove equivalence, all that is left to prove is that the SCISSORS projection vectors have the same normalization as the KPCA αi; KPCA requires λk 〈αk, αk〉 = 1.
Proof
We hypothesize that . Then:
Reduction of SCISSORS to the Nyström Rank-k Approximation
Overview of the Nyström Method
In many large-scale machine learning methods, the computation and eigendecomposition of very-large scale kernel matrices is a bottleneck, as the time complexity of eigendecomposition scales as O(N3). Williams and Seeger introduced a method, based on the Nyström approximation from integral equation theory, to compute a low-rank approximation to a large kernel matrix, based on computing approximate eigenvectors for the entire matrix based on a random sample of a small number of points.8 Precisely, using notation from Drineas et al.,9 given an n × n kernel matrix K, a desired rank k, and a number of basis elements ℓ, the Nyström approximation computes K̃k, a rank-k approximation to K by the following procedure:
Algorithm Sketch 1 (Nyström approximation)
Given a kernel matrix K ∈ ℝn× n, choose ℓ columns (equivalently, ℓ basis/landmark input points) [b1, b2, ···, bℓ] to obtain matrices C and W:
Let Wk be the best rank-k approximation to matrix W and be the Moore-Penrose pseudoin-verse of Wk. Then the rank-k Nyström approximation to matrix K is defined by .
Preliminaries
Consider a SCISSORS computation of full pairwise similarity over some large set of molecules ℳ. Partition this set, by random selection without replacement, into a basis set ℬ and a library set ℒ. Then, the matrix W in Algorithm 1 corresponds to the SCISSORS basis inner-product matrix on ℬ; similarly, C is an aggregation of transposed library-vs-basis inner-product vectors. To prove the equivalence of SCISSORS and the Nyström method, we will demonstrate that the inner-product matrix computed by the SCISSORS-approximated vectors is identical to that computed by the Nyström method. It is sufficient to show (by Lemma 1) that , the Nyström-approximated Gram matrix, factorizes as where:
| (3) |
Sk is the matrix with library (and basis) vectors along the rows, so is the SCISSORS-approximated Gram matrix. The following lemma is helpful for the proof. Proof of the lemma is provided in the Supplemental Information.
Lemma 2 (The pseudoinverse of Wk)
, where V̄ = [V1V2 ··· Vk], the matrix form from the first k columns of the basis matrix eigenvectors, and , the diagonal matrix of the reciprocals of the first k eigenvalues of the basis matrix.
Final Reduction
We must show that is equal to where .
Proof
The expected error in individual SCISSORS inner products is bounded with high probability
Statement of the theorem
Theorem 1 (Bounded expected inner product error)
Given a chemical similarity kernel κ defined over pairs of molecules from some distribution D, such that κ(x, x) < R2 for some positive real constant R for all x ∈ D. Construct a SCISSORS basis set from a random sample S of ℓ molecules drawn uniformly at random from D. Denote by the SCISSORS-approximated kernel of k dimensions from basis set S. Then, with probability at least (1 − δ)2, the expected error in SCISSORS approximation, over pairs of independently-chosen molecules x, y ∈ D, is bounded:
| (4) |
Where si are the basis molecules and λ̂>d(S) is the sum of the eigenvalues of the basis matrix not used in SCISSORS:
Proof Overview
The proof of Theorem 1 relies on a bound on the generalization error of kernel PCA projections due to Shawe-Taylor.10 This theorem bounds the expected residual from projecting new data onto a sampled kernel PCA basis; we extend this proof to bound the expected error in inner products from projecting two points onto a kernel PCA basis. Then, the translation to SCISSORS follows trivially from the reduction of SCISSORS to kernel PCA. Because the full proof is lengthy, it has been included in the Supplemental Information; this section presents a sketch.
The proof sketch relies on the following definitions from the Shawe-Taylor work:10
V̂k is the space spanned by the first k eigenvectors of the sample correlation matrix of a sample of vectors S; is the orthogonal complement to this space.
λk is the kth true eigenvalue of the kernel operator κ, computed over the entire distribution generating our data.
λ̂k is the kth empirical eigenvalue (i.e., the kth eigenvalue, in descending order of value, of the kernel matrix on S).
λ>k is the sum Σi>k λk, and similarly for λ̂>k.
The residual is the projection of x onto the space .
We make use of the following theorem:
Theorem 2 (Theorem 1 from10)
If we perform PCA in the feature space defined by kernel κ, then over random samples of points S s.t. |S| = ℓ (ℓ-samples), for all 1 ≤ k ≤ ℓ, if we project new data onto the space V̂k, the expected squared residual is bounded by the following, with probability greater than 1 − δ:
| (5) |
Where the support of the distribution is in a ball of radius R in feature space.
Using Theorem 1, it is possible to compute a bound on the projection error for each of the two points. The proof then bounds the variance of the resulting inner product error, and uses this to bound the overall error.
The error in SCISSORS-approximated Gram matrices is bounded in 2-norm, Frobenius norm, and RMS deviation
Statement of Theorems
Given a chemical similarity kernel κ and a set of n input molecules drawn from some probability distribution such that the κ(x, x) < R2 for all molecules x. Let the true kernel matrix be denoted K and the best possible rank-k approximation to K be denoted Kk. Compute a SCISSORS-approximated kernel matrix K̃ based on a size-ℓ uniform random sample of these vectors and a k-dimensional vector expansion. Then the following three theorems hold:
Theorem 3 (Bounded error 2-norm)
With probability at least 1 − δ, the error in the SCISSORS kernel matrix is worse than the lowest possible error from a rank k-approximated kernel matrix by at most a bounded amount in 2-norm:
Theorem 4 (Bounded error Frobenius norm)
With probability at least 1 − δ, the error in the SCISSORS kernel matrix is worse than the lowest possible error from a rank k-approximated kernel matrix by at most a bounded amount in Frobenius norm:
Theorem 5 (Bounded RMS error)
With probability at least 1 − δ, the elementwise root-mean-square (RMS) error in the SCISSORS kernel matrix is worse than the lowest possible RMS error from a rank k-approximated kernel matrix by at most a bounded amount:
Proof Overview
The proofs of Theorems 3, 4, and 5 rely on the following theorem, due to Talwalkar11 bounding the error of the rank-k Nyström approximation of a Gram matrix:
Theorem 6 (Theorem 5.2 from11)
Let K̃ denote the rank-k Nyström approximation of an n × n Gram matrix K based on ℓ columns sampled uniformly at random without replacement from K, and Kk the best rank-k approximation of K. Then, with probability at least 1 − δ, the following inequalities hold for any sample of size ℓ:
Where:
Kmax = maxi Kii
is the maximum distance implied in
β(ℓ, n) = 1 − (2max{ℓ, n − ℓ})−1.
For SCISSORS, we are particularly interested in the case in which ℓ≪n, so and .
Proof of Theorems 3, 4, and 5
Given a kernel κ and a distribution of input vectors such that their distribution in the feature space implied by κ is D, and that the support of D is contained within a ball of radius R in feature space. Then, Kmax in the above equations is bounded above by R2 and . Note that this boundedness assumption holds for any finite sample of vectors from D, as we can construct an empirical distribution of vectors from the sample, which will be guaranteed to be of bounded radius.
Theorems 3 and 4 immediately follow from theorem 6 by applying the reduction of SCISSORS to the Nyström method, the definitions of Kmax and , and the assumption above that ℓ ≪ n. Theorem 5 requires one additional step:
Lemma 3
Given an n × n matrix M, the root-mean-square value of each element of M, RMS{M} is related to the Frobenius norm of M, ||M||F by the relationship:
Proof
Then Theorem 5 follows by multiplying each term of Theorem 4 by 1/n.
Discussion
Reduction to existing kernel methods makes it possible to prove rigorous probabilistic bounds on the approximation error made by SCISSORS under fairly mild restrictions on the input molecule distribution. However, because very few assumptions are made about the input distribution, the resulting bounds end up being very loose. For example, consider the added RMS error from basis-sampling (Theorem 5) under conditions similar to those in Figure 1, if we were to approximate 50,000 molecules in Maybridge rather than just two. Specifically, consider 256 dimensions, 3,000 basis molecules, and a desired confidence of 1 − e−3 ≈ 95%: n = 50, 000, k = 256, l = 3, 000, δ = e−3:
So with 95% confidence, the RMS kernel error will be less than 1.8 times the maximum value of the kernel. Looking back at Figure 1 shows that this is clearly a very loose result: 1.8 times the largest kernel value in the source data is an RMS error of 50.4 (1.8 × 28), whereas we achieve much smaller errors on the (randomly chosen) molecules given. However, it is notable that this result holds with no assumptions about the distribution of input molecules, except boundedness in the kernel values. The performance of SCISSORS on real-world data sets is significantly better than this worst-case estimate (see for example, the statistics on the full Maybridge data set in the original SCISSORS paper1), indicating that the distribution of molecules in the similarity space considered is somehow friendly to sampling-based algorithms.
Conclusion
Sampling algorithms, both kernel PCA-based like SCISSORS and the Raghavendra/Maggiora method4 and non-PCA based diversity selection and clustering methods are widespread in chemical informatics. In this paper we have provided theoretical performance guarantees on the approximation error arising from dataset sampling and rank-reduction of chemical kernels. Our results relate chemical dimensionality reduction algorithms to well-known methods in machine learning. In particular, the fact that the worst-case bounds are significantly looser than the real-world performance of sampling algorithms suggests that in practice, many chemical kernels are representable in few dimensions and that chemical space is well-structured, such that sampling is a viable strategy.
Supplementary Material
Acknowledgments
ISH and VSP acknowledge support from Simbios (NIH Roadmap GM072970). ISH acknowledges support from an NSF graduate fellowship.
Detailed proofs for Lemmas 1 and 2 and Theorem 1 are included in the Supporting Information. This information is available free of charge via the Internet at http://pubs.acs.org/.
References
- 1.Haque IS, Pande VS. SCISSORS: A Linear-Algebraical Technique to Rapidly Approximate Chemical Similarities. J Chem Inf Model. 2010;50:1075–1088. doi: 10.1021/ci1000136. [DOI] [PubMed] [Google Scholar]
- 2.Vidal D, Thormann M, Pons M. LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities. J Chem Inf Model. 2005;45:386–393. doi: 10.1021/ci0496797. [DOI] [PubMed] [Google Scholar]
- 3.Haque IS, Pande VS, Walters WP. SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs. J Chem Inf Model. 2010;50:560–564. doi: 10.1021/ci100011z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Raghavendra AS, Maggiora GM. Molecular Basis Sets — A General Similarity-Based Approach for Representing Chemical Spaces. J Chem Inf Model. 2007;47:1328–1340. doi: 10.1021/ci600552n. [DOI] [PubMed] [Google Scholar]
- 5.Grant JA, Pickup BT. A Gaussian Description of Molecular Shape. J Phys Chem. 1995;99:3503–3510. [Google Scholar]
- 6.Schölkopf B, Smola A, Müller K-R. Technical Report 44. Max-Planck-Institut für biologische Kybernetik; Tuebingen, Germany: 1996. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. [Google Scholar]
- 7.Schölkopf B, Smola A, Müller KR. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998;10:1299–1319. [Google Scholar]
- 8.Williams C, Seeger M. Advances in Neural Information Processing Systems. Vol. 13. MIT Press; Cambridge, MA: 2001. Using the Nyström Method to Speed Up Kernel Machines; pp. 682–688. [Google Scholar]
- 9.Drineas P, Mahoney MW. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J Mach Learn Res. 2005;6:2153–2175. [Google Scholar]
- 10.Shawe-Taylor J, Williams CKI, Cristianini N, Kandola J. On the Eigenspectrum of the Gram Matrix and the Generalization Error of Kernel-PCA. IEEE Trans Info Theory. 2005;51:2510–2522. [Google Scholar]
- 11.Talwalkar A. PhD thesis. Courant Institute of Mathematical Sciences, New York University; New York, NY: 2010. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

