Error Bounds on the SCISSORS Approximation Method

Imran S Haque; Vijay S Pande

doi:10.1021/ci200251a

. Author manuscript; available in PMC: 2012 Sep 26.

Published in final edited form as: J Chem Inf Model. 2011 Sep 8;51(9):2248–2253. doi: 10.1021/ci200251a

Error Bounds on the SCISSORS Approximation Method

Imran S Haque ^1,^†, Vijay S Pande ^1,^*,^†,^‡

PMCID: PMC3183166 NIHMSID: NIHMS320426 PMID: 21851122

Abstract

The SCISSORS method for approximating chemical similarities has shown excellent empirical performance on a number of real-world chemical data sets, but lacks theoretically-proven bounds on its worst-case error performance. This paper first proves reductions showing SCISSORS to be equivalent to two previous kernel methods: kernel principal components analysis and the rank-k Nyström approximation of a Gram matrix. These reductions allow the use of generalization bounds on these techniques to show that the expected error in SCISSORS approximations of molecular similarity kernels is bounded in expected pairwise inner product error, in matrix 2-norm and Frobenius norm for full kernel matrix approximations, and in RMS deviation for approximated matrices. Finally, we show that the actual performance of SCISSORS is significantly better than these worst-case bounds, indicating that chemical space is well-structured for chemical sampling algorithms.

Introduction

The SCISSORS method is a technique for accelerating chemical similarity search by transforming Tanimoto similarity scores to inner products, computing a metric embedding for a small “basis set” of molecules that optimally reconstructs the given inner products, and then projecting remaining non-basis “library” molecules into this vector space.¹ SCISSORS similarities are then computed as Tanimotos on these embedded vectors. Significant speedups can be achieved for certain similarity measures (those which are expensive to compute, and have highly concentrated eigenvalue spectra) for repeated queries into a static database: the work done to compute vector projections for each database molecule can be amortized easily across a large number of queries. In the original SCISSORS paper, Haque and Pande report that for a database of approx. 57,000 molecules, a basis set of 1,000 molecules and embedding dimension of 100 was sufficient to accurately reproduce the shape similarity over the whole database.

The embedding used in SCISSORS is computed by first calculating the pairwise inner product matrix G between all pairs of basis molecules. G is then decomposed into eigenvectors V and eigenvalues along the diagonal of a matrix D; the vector embedding for the basis molecules lie along the rows of matrix B in the following equation:

\begin{array}{l} G = {B B}^{T} = {VDV}^{T} = {V D}^{1 / 2} D^{1 / 2} V^{T} \\ ∴ B = {V D}^{1 / 2} \end{array}

(1)

The rank of the approximation can be controlled by ordering the eigenvalues in order of decreasing value, setting all eigenvalues below a certain desired count to zero, and truncating these zero dimensions in the resulting vectors.

Figure 1 shows an example of SCISSORS applied to a molecular similarity kernel. In this example, SCISSORS is used to approximate the intersection size (SMILES overlap), as used for the LINGO similarity measure,² between two molecules from the Maybridge Screening Collection: (S)-mandelate (molecule 1) and (2R)-3-(4-chlorophenoxy)propane-1,2-diol (molecule 2). The true intersection sizes, as computed by the SIML implementation of LINGO,³ are shown in the first row: between molecule 1 and itself, molecule 2 and itself, and the two molecules against each other. We then constructed a basis set of molecules from 3,072 isomeric SMILES strings drawn at random from the Maybridge Screening Collection, and embedded molecules 1 and 2 into SCISSORS vector spaces of varying dimensionalities: 64, 256, and 1,024. The three SCISSORS data rows in the table show the approximated values of each intersection, as a function of embedding dimension. As the dimension grows, the approximation error (difference between LINGO true value and SCISSORS-approximated value) decreases. Our objective in this paper is to derive theoretical bounds on the magnitude of this error.

Example of SCISSORS applied to a molecular similarity kernel (LINGO intersection size). Table indicates LINGO true kernel value and SCISSORS-approximated kernel value for various dimensionalities.

A number of methods used in chemical informatics are mathematically similar to SCISSORS. In particular, the “molecular basis set” approach taken by Raghavendra and Maggiora⁴ is very similar. The Raghavendra and Maggiora (RM) method skips Tanimoto-to-inner product conversion (treating Tanimotos as inner products directly), does not restrict the dimensionality of the vector expansion, and is derived using a different justification, but otherwise is very similar. In particular, both this method and SCISSORS are variants of kernel principal components analysis.

While the RM method and SCISSORS in particular seem to have good empirical performance, they lack theoretically-rigorous guarantees on their approximations. In this paper, we derive theoretical guarantees on the SCISSORS approximation error by reducing SCISSORS to previously-described kernel methods from machine learning.

Preliminaries

SCISSORS as a kernel method

The key insight of the SCISSORS technique is that molecular similarity measures, after appropriate transformation, can be treated as “kernel functions” taking pairs of molecules to scalar values that can be interpreted as inner products. Kernels are mathematical objects widely used in machine learning which can be used to adapt linear machine learning models (e.g., support vector machines) to nonlinear spaces. Informally, a kernel function is one taking two “objects” (often vectors, but in the chemical context molecules, strings, or fingerprints) and returning a non-negative real scalar satisfying particular properties of the real dot product (including symmetry and positive-semi-definiteness). While molecular similarity scores such as Tanimotos are not in themselves inner products or the result of kernel functions, they are often constructed from intermediate quantities which are. For example, the set intersection in LINGO² is a kernel function, and the shape overlap volume from Gaussian shape overlay⁵ is approximately a kernel (non-negative and symmetric, but not positive-definite).

The advantage of interpreting SCISSORS as working on kernel functions or inner products is that it allows leveraging the body of machine learning literature on kernel methods. The SCISSORS pipeline can be roughly segmented into the following operations:

Convert Tanimotos to inner products (basis-vs-basis or library-vs-basis)
Compute a vector embedding on the inner products (by eigendecomposition or least-squares)
Compute vector-space inner products (standard dot product in ℜ^N)
Convert vector-space inner products to Tanimotos using standard vector Tanimoto equation

Steps 1 and 4 in this pipeline involve ratios of inner products (or kernel values), and as such, introduce nonlinearities into the analysis. However, if one assumes that exact kernel values are given or easily obtained (as demonstrated for the shape overlap volume in¹), and that the goal is to directly approximate these kernel values rather than the Tanimoto, then SCISSORS directly resembles a typical kernel method. Therefore, in this paper, we will consider only the error in these inner-product-space stages, rather than error introduced at the Tanimoto stages. Accordingly, we replace the notion of a “molecular similarity function” with that of a “molecular similarity kernel”, which can be thought of as the composition of a similarity function with the Tanimoto-to-inner-product operation from SCISSORS.

The following lemma will be useful in demonstrating the equivalence of SCISSORS to various other kernel methods. Proof is provided in the Supplemental Information.

Lemma 1 (SCISSORS library vectors are projections onto eigenvectors of the basis inner product matrix)

Given an N × N SCISSORS basis inner product matrix (that is, a similarity matrix post-Tanimoto-to-inner-product conversion) K. Let the eigenvalues (resp. eigenvectors) of K be denoted λ_i and V_i, with eigenvalues sorted in descending order of value. Let the matrix of all eigenvectors be named V = [V₁V₂ ··· V_N]. The SCISSORS vector w for a new molecule with library-vs-basis inner product vector L, in d dimensions, is defined by the expression:

w = [\begin{matrix} λ_{1}^{- 1 / 2} 〈 V_{1}, L 〉 \\ λ_{2}^{- 1 / 2} 〈 V_{2}, L 〉 \\ ⋮ \\ λ_{d}^{- 1 / 2} 〈 V_{d}, L 〉 \end{matrix}]

(2)

Note that Lemma 1 suggests a method to compute SCISSORS vectors that is distinct from, but equivalent to, the least-squares calculation specified by Haque and Pande.¹ Given a vector L of basis-vs-library inner products and a matrix M = VD^1/2 of basis SCISSORS vectors, the original SCISSORS calculation suggested solving the least-squares equation Mx = L for the SCISSORS vector x of the new molecule. This lemma shows that the same problem is solved by the matrix multiplication D^−1/2V^T L. This provides a computational shortcut for the projection of large numbers of library molecules: the projection matrix D^−1/2V^T can be computed once for a basis set; after library-vs-basis Tanimotos have been computed and converted to inner products, the SCISSORS vector can be computed by a simple matrix multiplication rather than least-squares.

Assumptions

The analysis in this paper rests on the following assumptions:

SCISSORS is given molecular similarity kernel values, not Tanimotos, to analyze. While the conversion from Tanimoto to inner product will introduce distortion (particularly if different molecules x and y have very different values of κ(x, x) and κ(y, y) for similarity kernel κ, we will not consider this distortion here.
It is assumed that the similarity kernel κ is symmetric positive semidefinite (SPSD). Similarity kernels that are not SPSD are not Mercer kernels and some proofs will fail in the presence of negative kernel eigenvalues. However, given non-SPSD κ, the results of this paper can still be applied to a modified kernel κ′, the nearest SPSD approximation to κ. If κ is symmetric but indefinite, then certain divergence terms can be easily calculated between the kernel matrices K and K′ induced by κ and κ′:
- ||K − K′||₂ = absolute value of the negative eigenvalue with largest magnitude
- ${| | K - K^{'} | |}_{F} = \sum λ_{< 0}^{2}$ , where λ_<0 are the negative eigenvalues
It is assumed that kernel values are exactly computable. In particular, the case in which kernel values themselves are subject to noise or inexactitude is not considered here.

Under these assumptions, it is possible to bound the additional error made by SCISSORS in choosing a small random basis rather than using the eigendecomposition of the full kernel matrix over the entire library. Two different types of bounds will be shown in this paper, arising from reductions to two different kernel methods: kernel principal components analysis, and the rank-k Nyström approximation.

Reduction of SCISSORS to Kernel PCA

Overview of Kernel PCA

Kernel principal components analysis^6,7 is a generalization of traditional principal components analysis from the data space to a feature space defined by a Mercer kernel function κ. Given a sample of N data points, kernel PCA computes up to N directions of maximum variance of the data, in the kernel’s feature space. Points can then be projected into this N-dimensional subspace by a projection of their kernel values against the original (training) data points; thus, kernel PCA can be considered to perform a metric embedding of data points into a subspace of the feature space defined by a given kernel.

Similar to traditional (linear) PCA, kernel PCA can be preceded by a centering step, in which the data are centered in feature space; this ensures that the data mean is not reflected in the recovered coordinates. However, the uncentered case has relevance to SCISSORS, so we now proceed to derive the kernel PCA algorithm without data centering (following the approach of Scholköpf⁶).

Derivation of kernel PCA

Given a set of data points x_i, i ∈ [1, ···, ℓ], and a Mercer kernel κ(x, y), defined by κ(x, y) = 〈Φ(x), Φ(y)〉 for some feature-space projection Φ. Consider the feature covariance matrix C̄:

\bar{C} = \frac{1}{ℓ} \sum_{j = 1}^{ℓ} Φ (x_{j}) Φ {(x_{j})}^{T}

Let the eigenvalues and eigenvectors of C̄ be named λ_k and V_k respectively such that ∀k λV = C̄V. All such V_i must lie in the span of Φ(x₁) ··· Φ(x_ℓ). Thus the following system is equivalent:

λ (Φ (x_{k}) \cdot V) = (Φ (x_{k}) \cdot \bar{C} V) \forall k

and there exist a₁ ··· a_ℓ such that

V = \sum_{i = 1}^{ℓ} a_{i} Φ (x_{i})

Defining matrix K_{i j} = 〈Φ(x_i), Φ(x_j)〉 and vector α = [a₁ ··· a_n]T we get ℓλKα = K²α, so we solve the eigenvalue problem ℓλα = Kα. Solutions λ_k, α^k correspond to eigenvalues/vectors of the kernel matrix.

We normalize the resulting solutions by requiring that the feature-space eigenvectors (V_k) be unit magnitude. This implies:

\sum_{i = 1}^{ℓ} \sum_{j = 1}^{ℓ} a_{i}^{k} a_{j}^{k} 〈 Φ (x_{i}), Φ (x_{j}) 〉 = 〈 α^{k}, K α^{k} 〉 = λ_{k} 〈 α^{k}, α^{k} 〉 = 1

We can compute the projection of a new data point x onto the feature-space correlation matrix eigenvectors V_k by:

〈 V_{k}, Φ (x) 〉 = \sum_{i = 1}^{ℓ} a_{i}^{k} 〈 Φ (x_{i}), Φ (x) 〉

So for d eigenvectors, the projected KPCA coordinate vector w is:

w = KPCA {x} = [\begin{matrix} \sum_{i} a_{i}^{1} 〈 Φ (x_{i}), Φ (x) 〉 \\ \sum_{i} a_{i}^{2} 〈 Φ (x_{i}), Φ (x) 〉 \\ ⋮ \\ \sum_{i} a_{i}^{d} 〈 Φ (x_{i}), Φ (x) 〉 \end{matrix}]

Equivalently:

\begin{array}{l} L = {[〈 Φ (x_{1}), Φ (x) 〉, \dots, 〈 Φ (x_{ℓ}), Φ (x) 〉]}^{T} \\ w = KPCA {x} = {[〈 α^{1}, L 〉, \dots, 〈 α^{d}, L 〉]}^{T} \end{array}

Reduction Proof

We now demonstrate that SCISSORS is equivalent to kernel PCA performed without data centering.

As proven in Lemma 1, the SCISSORS vector w corresponding to a library molecule is defined by weighted inner products between the eigenvectors of the kernel matrix and the library-versus-basis inner product vector L. Define new vectors $V_{i}^{'} = λ_{i}^{- 1 / 2} V_{i}$ . Recall that the kernel matrix and vector L are already identical between methods, and both $V_{i}^{'}$ and α_i are defined to be eigenvectors of the kernel matrix. To prove equivalence, all that is left to prove is that the SCISSORS projection vectors $V_{i}^{'}$ have the same normalization as the KPCA αⁱ; KPCA requires λ_k 〈α^k, α^k〉 = 1.

Proof

We hypothesize that $V_{k}^{'} = α^{k}$ . Then:

\begin{array}{l} λ_{k} 〈 V_{k}^{'}, V_{k}^{'} 〉 = λ_{k} 〈 λ_{k}^{- 1 / 2} V_{k}, λ_{k}^{- 1 / 2} V_{k} 〉 \\ = λ_{k} λ_{k}^{- 1} 〈 V_{k}, V_{k} 〉 = 1 \end{array}

Reduction of SCISSORS to the Nyström Rank-k Approximation

Overview of the Nyström Method

In many large-scale machine learning methods, the computation and eigendecomposition of very-large scale kernel matrices is a bottleneck, as the time complexity of eigendecomposition scales as O(N³). Williams and Seeger introduced a method, based on the Nyström approximation from integral equation theory, to compute a low-rank approximation to a large kernel matrix, based on computing approximate eigenvectors for the entire matrix based on a random sample of a small number of points.⁸ Precisely, using notation from Drineas et al.,⁹ given an n × n kernel matrix K, a desired rank k, and a number of basis elements ℓ, the Nyström approximation computes K̃_k, a rank-k approximation to K by the following procedure:

Algorithm Sketch 1 (Nyström approximation)

Given a kernel matrix K ∈ ℝ^{n× n}, choose ℓ columns (equivalently, ℓ basis/landmark input points) [b₁, b₂, ···, b_ℓ] to obtain matrices C and W:

\begin{array}{r} C = [\begin{matrix} K_{1 b_{1}} & K_{1 b_{2}} & \dots & K_{1 b_{ℓ}} \\ K_{2 b_{1}} & K_{2 b_{2}} & \dots & K_{2 b_{ℓ}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ K_{{n b}_{1}} & K_{{n b}_{2}} & \dots & K_{{n b}_{ℓ}} \end{matrix}] \\ W = [\begin{matrix} K_{b_{1} b_{1}} & K_{b_{1} b_{2}} & \dots & K_{b_{1} b_{ℓ}} \\ K_{b_{2} b_{1}} & K_{b_{2} b_{2}} & \dots & K_{b_{2} b_{ℓ}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ K_{b_{ℓ} b_{1}} & K_{b_{ℓ} b_{2}} & \dots & K_{b_{ℓ} b_{ℓ}} \end{matrix}] \end{array}

Let W_k be the best rank-k approximation to matrix W and $W_{k}^{+}$ be the Moore-Penrose pseudoin-verse of W_k. Then the rank-k Nyström approximation to matrix K is defined by ${\tilde{K}}_{k} = {C W}_{k}^{+} C^{T}$ .

Preliminaries

Consider a SCISSORS computation of full pairwise similarity over some large set of molecules ℳ. Partition this set, by random selection without replacement, into a basis set ℬ and a library set ℒ. Then, the matrix W in Algorithm 1 corresponds to the SCISSORS basis inner-product matrix on ℬ; similarly, C is an aggregation of transposed library-vs-basis inner-product vectors. To prove the equivalence of SCISSORS and the Nyström method, we will demonstrate that the inner-product matrix computed by the SCISSORS-approximated vectors is identical to that computed by the Nyström method. It is sufficient to show (by Lemma 1) that ${C W}_{k}^{+} C^{T}$ , the Nyström-approximated Gram matrix, factorizes as $S_{k} S_{k}^{T}$ where:

S_{k}^{T} = D_{k}^{1 / 2} [\begin{matrix} V_{1}^{T} \\ V_{2}^{T} \\ ⋮ \\ V_{k}^{T} \end{matrix}] C^{T}

(3)

S_k is the matrix with library (and basis) vectors along the rows, so $S_{k} S_{k}^{T}$ is the SCISSORS-approximated Gram matrix. The following lemma is helpful for the proof. Proof of the lemma is provided in the Supplemental Information.

Lemma 2 (The pseudoinverse of W_k)

$W_{k}^{+} = \bar{V} D_{k}^{- 1} {\bar{V}}^{T}$ , where V̄ = [V₁V₂ ··· V_k], the matrix form from the first k columns of the basis matrix eigenvectors, and $D_{k}^{- 1} = diag [λ_{1}^{- 1}, λ_{2}^{- 1}, \dots, λ_{k}^{- 1}]$ , the diagonal matrix of the reciprocals of the first k eigenvalues of the basis matrix.

Final Reduction

We must show that ${C W}_{k}^{+} C^{T}$ is equal to $S_{k} S_{k}^{T}$ where $S_{k}^{T} = D_{k}^{- 1 / 2} {\bar{V}}^{T} C^{T}$ .

Proof

\begin{array}{l} S_{k} S_{k}^{T} & = C \bar{V} D_{k}^{- 1 / 2} D_{k}^{- 1 / 2} {\bar{V}}^{T} C^{T} & by definition of S_{k}^{T} \\ = C (\bar{V} D_{k}^{- 1} {\bar{V}}^{T}) C^{T} \\ = {C W}_{k}^{+} C^{T} & by lemma 2 \end{array}

The expected error in individual SCISSORS inner products is bounded with high probability

Statement of the theorem

Theorem 1 (Bounded expected inner product error)

Given a chemical similarity kernel κ defined over pairs of molecules from some distribution D, such that κ(x, x) < R² for some positive real constant R for all x ∈ D. Construct a SCISSORS basis set from a random sample S of ℓ molecules drawn uniformly at random from D. Denote by $κ_{k}^{S}$ the SCISSORS-approximated kernel of k dimensions from basis set S. Then, with probability at least (1 − δ)², the expected error in SCISSORS approximation, over pairs of independently-chosen molecules x, y ∈ D, is bounded:

\begin{array}{l} 0 \leq E [κ (x, y) - κ_{k}^{S} (x, y)] \leq [min_{1 \leq d \leq k} (\frac{1}{ℓ} {\hat{λ}}^{> d} (S) + \frac{1 + \sqrt{d}}{\sqrt{ℓ}} \sqrt{\frac{2}{ℓ} \sum_{i = 1}^{ℓ} κ {(s_{i}, s_{i})}^{2}}) \\ + R^{2} (\frac{1}{4} \sqrt{\frac{18}{ℓ} ln (\frac{2 ℓ}{δ})})] \end{array}

(4)

Where s_i are the basis molecules and λ̂^>d(S) is the sum of the eigenvalues of the basis matrix not used in SCISSORS:

{\hat{λ}}^{> d} (S) = \sum_{i = k + 1}^{ℓ} λ_{i}

Proof Overview

The proof of Theorem 1 relies on a bound on the generalization error of kernel PCA projections due to Shawe-Taylor.¹⁰ This theorem bounds the expected residual from projecting new data onto a sampled kernel PCA basis; we extend this proof to bound the expected error in inner products from projecting two points onto a kernel PCA basis. Then, the translation to SCISSORS follows trivially from the reduction of SCISSORS to kernel PCA. Because the full proof is lengthy, it has been included in the Supplemental Information; this section presents a sketch.

The proof sketch relies on the following definitions from the Shawe-Taylor work:¹⁰

V̂_k is the space spanned by the first k eigenvectors of the sample correlation matrix of a sample of vectors S; ${\hat{V}}_{k}^{T}$ is the orthogonal complement to this space.
λ_k is the kth true eigenvalue of the kernel operator κ, computed over the entire distribution generating our data.
λ̂_k is the kth empirical eigenvalue (i.e., the kth eigenvalue, in descending order of value, of the kernel matrix on S).
λ^>^k is the sum Σ_i>k λ_k, and similarly for λ̂^>^k.
The residual $P_{{\hat{V}}_{k}}^{T} (x)$ is the projection of x onto the space ${\hat{V}}_{k}^{T}$ .

We make use of the following theorem:

Theorem 2 (Theorem 1 from¹⁰)

If we perform PCA in the feature space defined by kernel κ, then over random samples of points S s.t. |S| = ℓ (ℓ-samples), for all 1 ≤ k ≤ ℓ, if we project new data onto the space V̂_k, the expected squared residual is bounded by the following, with probability greater than 1 − δ:

\begin{array}{l} λ^{> k} \leq E [{‖ P_{{\hat{V}}_{k}}^{T} (Φ (x)) ‖}^{2}] \\ \leq min_{1 \leq d \leq k} [\frac{1}{ℓ} {\hat{λ}}^{> d} (S) + \frac{1 + \sqrt{d}}{\sqrt{ℓ}} \sqrt{\frac{2}{ℓ} \sum_{i = 1}^{ℓ} κ {(x_{i}, x_{i})}^{2}}] + R^{2} \sqrt{\frac{18}{ℓ} ln (\frac{2 ℓ}{δ})} \end{array}

(5)

Where the support of the distribution is in a ball of radius R in feature space.

Using Theorem 1, it is possible to compute a bound on the projection error for each of the two points. The proof then bounds the variance of the resulting inner product error, and uses this to bound the overall error.

The error in SCISSORS-approximated Gram matrices is bounded in 2-norm, Frobenius norm, and RMS deviation

Statement of Theorems

Given a chemical similarity kernel κ and a set of n input molecules drawn from some probability distribution such that the κ(x, x) < R² for all molecules x. Let the true kernel matrix be denoted K and the best possible rank-k approximation to K be denoted K_k. Compute a SCISSORS-approximated kernel matrix K̃ based on a size-ℓ uniform random sample of these vectors and a k-dimensional vector expansion. Then the following three theorems hold:

Theorem 3 (Bounded error 2-norm)

With probability at least 1 − δ, the error in the SCISSORS kernel matrix is worse than the lowest possible error from a rank k-approximated kernel matrix by at most a bounded amount in 2-norm:

{| | K - \tilde{K} | |}_{2} \leq {| | K - K_{k} | |}_{2} + \frac{2 n}{\sqrt{ℓ}} R^{2} [1 + 2 \sqrt{\frac{{(n - ℓ)}^{2}}{(n - 1 / 2) (n - ℓ - 1 / 2)} log \frac{1}{δ}}]

Theorem 4 (Bounded error Frobenius norm)

With probability at least 1 − δ, the error in the SCISSORS kernel matrix is worse than the lowest possible error from a rank k-approximated kernel matrix by at most a bounded amount in Frobenius norm:

{| | K - \tilde{K} | |}_{F} \leq {| | K - K_{k} | |}_{F} + {[\frac{64 k}{ℓ}]}^{1 / 4} {n R}^{2} {[1 + 2 \sqrt{\frac{{(n - ℓ)}^{2}}{(n - 1 / 2) (n - ℓ - 1 / 2)} log \frac{1}{δ}}]}^{1 / 2}

Theorem 5 (Bounded RMS error)

With probability at least 1 − δ, the elementwise root-mean-square (RMS) error in the SCISSORS kernel matrix is worse than the lowest possible RMS error from a rank k-approximated kernel matrix by at most a bounded amount:

RMS {K - \tilde{K}} \leq RMS {K - K_{k}} + {[\frac{64 k}{ℓ}]}^{1 / 4} R^{2} {[1 + 2 \sqrt{\frac{{(n - ℓ)}^{2}}{(n - 1 / 2) (n - ℓ - 1 / 2)} log \frac{1}{δ}}]}^{1 / 2}

Proof Overview

The proofs of Theorems 3, 4, and 5 rely on the following theorem, due to Talwalkar¹¹ bounding the error of the rank-k Nyström approximation of a Gram matrix:

Theorem 6 (Theorem 5.2 from¹¹)

Let K̃ denote the rank-k Nyström approximation of an n × n Gram matrix K based on ℓ columns sampled uniformly at random without replacement from K, and K_k the best rank-k approximation of K. Then, with probability at least 1 − δ, the following inequalities hold for any sample of size ℓ:

\begin{matrix} {| | K - \tilde{K} | |}_{2} \leq {| | K - K_{k} | |}_{2} + \frac{2 n}{\sqrt{ℓ}} K_{\max} [1 + \sqrt{\frac{n - ℓ}{n - 1 / 2} \frac{1}{β (ℓ, n)} log \frac{1}{δ}} \frac{d_{\max}^{K}}{K_{\max}^{1 / 2}}] \\ {| | K - \tilde{K} | |}_{F} \leq {| | K - K_{k} | |}_{F} + {[\frac{64 k}{ℓ}]}^{1 / 4} {n K}_{\max} {[1 + \sqrt{\frac{n - ℓ}{n - 1 / 2} \frac{1}{β (ℓ, n)} log \frac{1}{δ}} \frac{d_{\max}^{K}}{K_{\max}^{1 / 2}}]}^{1 / 2} \end{matrix}

Where:

K_max = max_i K_ii
$d_{\max}^{K}$ is the maximum distance implied in $K = {max}_{i, j} \sqrt{K_{i i} + K_{j j} - K_{i j}}$
β(ℓ, n) = 1 − (2max{ℓ, n − ℓ})⁻¹.

For SCISSORS, we are particularly interested in the case in which ℓ≪n, so $β (ℓ, n) = 1 - \frac{1}{2 n - 2 ℓ}$ and $1 / β (ℓ, n) = \frac{n - ℓ}{n - ℓ - 1 / 2}$ .

Proof of Theorems 3, 4, and 5

Given a kernel κ and a distribution of input vectors such that their distribution in the feature space implied by κ is D, and that the support of D is contained within a ball of radius R in feature space. Then, K_max in the above equations is bounded above by R2 and $d_{\max}^{K} \leq 2 R$ . Note that this boundedness assumption holds for any finite sample of vectors from D, as we can construct an empirical distribution of vectors from the sample, which will be guaranteed to be of bounded radius.

Theorems 3 and 4 immediately follow from theorem 6 by applying the reduction of SCISSORS to the Nyström method, the definitions of K_max and $d_{\max}^{K}$ , and the assumption above that ℓ ≪ n. Theorem 5 requires one additional step:

Lemma 3

Given an n × n matrix M, the root-mean-square value of each element of M, RMS{M} is related to the Frobenius norm of M, ||M||_F by the relationship:

RMS {M} = \frac{1}{n} {| | M | |}_{F}

Proof

\begin{array}{l} {| | M | |}_{F} = \sqrt{\sum_{i, j} M_{i j}} \\ RMS {M} = \sqrt{\frac{1}{n^{2}} \sum_{i, j} M_{i j}} = \frac{1}{n} \sqrt{\sum_{i, j} M_{i j}} = \frac{1}{n} {| | M | |}_{F} \end{array}

Then Theorem 5 follows by multiplying each term of Theorem 4 by 1/n.

Discussion

Reduction to existing kernel methods makes it possible to prove rigorous probabilistic bounds on the approximation error made by SCISSORS under fairly mild restrictions on the input molecule distribution. However, because very few assumptions are made about the input distribution, the resulting bounds end up being very loose. For example, consider the added RMS error from basis-sampling (Theorem 5) under conditions similar to those in Figure 1, if we were to approximate 50,000 molecules in Maybridge rather than just two. Specifically, consider 256 dimensions, 3,000 basis molecules, and a desired confidence of 1 − e⁻³ ≈ 95%: n = 50, 000, k = 256, l = 3, 000, δ = e⁻³:

{[\frac{64·256}{3000}]}^{1 / 4} K_{\max} {[1 + 2 \sqrt{\frac{{(50000 - 3000)}^{2}}{(50000 - 1 / 2) (50000 - 3000 - 1 / 2)} log e^{- 3}}]}^{1 / 2} \approx 1.8 K_{\max}

So with 95% confidence, the RMS kernel error will be less than 1.8 times the maximum value of the kernel. Looking back at Figure 1 shows that this is clearly a very loose result: 1.8 times the largest kernel value in the source data is an RMS error of 50.4 (1.8 × 28), whereas we achieve much smaller errors on the (randomly chosen) molecules given. However, it is notable that this result holds with no assumptions about the distribution of input molecules, except boundedness in the kernel values. The performance of SCISSORS on real-world data sets is significantly better than this worst-case estimate (see for example, the statistics on the full Maybridge data set in the original SCISSORS paper¹), indicating that the distribution of molecules in the similarity space considered is somehow friendly to sampling-based algorithms.

Conclusion

Sampling algorithms, both kernel PCA-based like SCISSORS and the Raghavendra/Maggiora method⁴ and non-PCA based diversity selection and clustering methods are widespread in chemical informatics. In this paper we have provided theoretical performance guarantees on the approximation error arising from dataset sampling and rank-reduction of chemical kernels. Our results relate chemical dimensionality reduction algorithms to well-known methods in machine learning. In particular, the fact that the worst-case bounds are significantly looser than the real-world performance of sampling algorithms suggests that in practice, many chemical kernels are representable in few dimensions and that chemical space is well-structured, such that sampling is a viable strategy.

Supplementary Material

1_si_001

NIHMS320426-supplement-1_si_001.pdf^{(143.6KB, pdf)}

Acknowledgments

ISH and VSP acknowledge support from Simbios (NIH Roadmap GM072970). ISH acknowledges support from an NSF graduate fellowship.

Detailed proofs for Lemmas 1 and 2 and Theorem 1 are included in the Supporting Information. This information is available free of charge via the Internet at http://pubs.acs.org/.

References

1.Haque IS, Pande VS. SCISSORS: A Linear-Algebraical Technique to Rapidly Approximate Chemical Similarities. J Chem Inf Model. 2010;50:1075–1088. doi: 10.1021/ci1000136. [DOI] [PubMed] [Google Scholar]
2.Vidal D, Thormann M, Pons M. LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities. J Chem Inf Model. 2005;45:386–393. doi: 10.1021/ci0496797. [DOI] [PubMed] [Google Scholar]
3.Haque IS, Pande VS, Walters WP. SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs. J Chem Inf Model. 2010;50:560–564. doi: 10.1021/ci100011z. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Raghavendra AS, Maggiora GM. Molecular Basis Sets — A General Similarity-Based Approach for Representing Chemical Spaces. J Chem Inf Model. 2007;47:1328–1340. doi: 10.1021/ci600552n. [DOI] [PubMed] [Google Scholar]
5.Grant JA, Pickup BT. A Gaussian Description of Molecular Shape. J Phys Chem. 1995;99:3503–3510. [Google Scholar]
6.Schölkopf B, Smola A, Müller K-R. Technical Report 44. Max-Planck-Institut für biologische Kybernetik; Tuebingen, Germany: 1996. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. [Google Scholar]
7.Schölkopf B, Smola A, Müller KR. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998;10:1299–1319. [Google Scholar]
8.Williams C, Seeger M. Advances in Neural Information Processing Systems. Vol. 13. MIT Press; Cambridge, MA: 2001. Using the Nyström Method to Speed Up Kernel Machines; pp. 682–688. [Google Scholar]
9.Drineas P, Mahoney MW. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J Mach Learn Res. 2005;6:2153–2175. [Google Scholar]
10.Shawe-Taylor J, Williams CKI, Cristianini N, Kandola J. On the Eigenspectrum of the Gram Matrix and the Generalization Error of Kernel-PCA. IEEE Trans Info Theory. 2005;51:2510–2522. [Google Scholar]
11.Talwalkar A. PhD thesis. Courant Institute of Mathematical Sciences, New York University; New York, NY: 2010. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1_si_001

NIHMS320426-supplement-1_si_001.pdf^{(143.6KB, pdf)}

[R1] 1.Haque IS, Pande VS. SCISSORS: A Linear-Algebraical Technique to Rapidly Approximate Chemical Similarities. J Chem Inf Model. 2010;50:1075–1088. doi: 10.1021/ci1000136. [DOI] [PubMed] [Google Scholar]

[R2] 2.Vidal D, Thormann M, Pons M. LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities. J Chem Inf Model. 2005;45:386–393. doi: 10.1021/ci0496797. [DOI] [PubMed] [Google Scholar]

[R3] 3.Haque IS, Pande VS, Walters WP. SIML: A Fast SIMD Algorithm for Calculating LINGO Chemical Similarities on GPUs and CPUs. J Chem Inf Model. 2010;50:560–564. doi: 10.1021/ci100011z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Raghavendra AS, Maggiora GM. Molecular Basis Sets — A General Similarity-Based Approach for Representing Chemical Spaces. J Chem Inf Model. 2007;47:1328–1340. doi: 10.1021/ci600552n. [DOI] [PubMed] [Google Scholar]

[R5] 5.Grant JA, Pickup BT. A Gaussian Description of Molecular Shape. J Phys Chem. 1995;99:3503–3510. [Google Scholar]

[R6] 6.Schölkopf B, Smola A, Müller K-R. Technical Report 44. Max-Planck-Institut für biologische Kybernetik; Tuebingen, Germany: 1996. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. [Google Scholar]

[R7] 7.Schölkopf B, Smola A, Müller KR. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Comput. 1998;10:1299–1319. [Google Scholar]

[R8] 8.Williams C, Seeger M. Advances in Neural Information Processing Systems. Vol. 13. MIT Press; Cambridge, MA: 2001. Using the Nyström Method to Speed Up Kernel Machines; pp. 682–688. [Google Scholar]

[R9] 9.Drineas P, Mahoney MW. On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. J Mach Learn Res. 2005;6:2153–2175. [Google Scholar]

[R10] 10.Shawe-Taylor J, Williams CKI, Cristianini N, Kandola J. On the Eigenspectrum of the Gram Matrix and the Generalization Error of Kernel-PCA. IEEE Trans Info Theory. 2005;51:2510–2522. [Google Scholar]

[R11] 11.Talwalkar A. PhD thesis. Courant Institute of Mathematical Sciences, New York University; New York, NY: 2010. [Google Scholar]

PERMALINK

Error Bounds on the SCISSORS Approximation Method

Imran S Haque

Vijay S Pande

Abstract

Introduction

Figure 1.

Preliminaries

SCISSORS as a kernel method

Lemma 1 (SCISSORS library vectors are projections onto eigenvectors of the basis inner product matrix)

Assumptions

Reduction of SCISSORS to Kernel PCA

Overview of Kernel PCA

Derivation of kernel PCA

Reduction Proof

Proof

Reduction of SCISSORS to the Nyström Rank-k Approximation

Overview of the Nyström Method

Algorithm Sketch 1 (Nyström approximation)

Preliminaries

Lemma 2 (The pseudoinverse of Wk)

Final Reduction

Proof

The expected error in individual SCISSORS inner products is bounded with high probability

Statement of the theorem

Theorem 1 (Bounded expected inner product error)

Proof Overview

Theorem 2 (Theorem 1 from10)

The error in SCISSORS-approximated Gram matrices is bounded in 2-norm, Frobenius norm, and RMS deviation

Statement of Theorems

Theorem 3 (Bounded error 2-norm)

Theorem 4 (Bounded error Frobenius norm)

Theorem 5 (Bounded RMS error)

Proof Overview

Theorem 6 (Theorem 5.2 from11)

Proof of Theorems 3, 4, and 5

Lemma 3

Proof

Discussion

Conclusion

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Lemma 2 (The pseudoinverse of W_k)

Theorem 2 (Theorem 1 from¹⁰)

Theorem 6 (Theorem 5.2 from¹¹)