Flexible Analog Search with Kernel PCA Embedded Molecule Vectors

Stefano Rensi; Russ B Altman

doi:10.1016/j.csbj.2017.03.003

. 2017 Mar 24;15:320–327. doi: 10.1016/j.csbj.2017.03.003

Flexible Analog Search with Kernel PCA Embedded Molecule Vectors

Stefano Rensi ^1,^⁎, Russ B Altman ¹

PMCID: PMC5396859 PMID: 28458783

Abstract

Studying analog series to find structural transformations that enhance the activity and ADME properties of lead compounds is an important part of drug development. Matched molecular pair (MMP) search is a powerful tool for analog analysis that imitates researchers' ability to select pairs of compounds that differ only by small well-defined transformations. Abstraction is a challenge for existing MMP search algorithms, which can result in the omission of relevant, inexact MMPs, and inclusion of irrelevant, contextually dissimilar MMPs. In this work, we present a new method for MMP search that returns approximate results and enables flexible control over abstraction of contextual information. We illustrate the concepts and mechanics of our method with a series of exemplar MMP queries, and then benchmark search accuracy using MMPs found by fragment indexing. We show that we can search for MMPs in a context dependent manner, and accurately approximate context independent fragment index based MMP search over a range of fingerprint and dataset conditions. Our method can be used to search for pairwise correspondences among analog sets and bolster MMP datasets where data is missing or incomplete.

1. Introduction

1.1. Analog Search is Important for Lead Optimization

Successful optimization of lead compounds requires the iterative application of structural modifications that yield favorable changes in target activity profiles and ADMET properties. Traditionally, this process has been the sole domain of medicinal chemistry teams. Researchers assembled analog series data by hand, guided by their knowledge of compounds that had been synthesized and tested within their organization. They would then generate hypotheses using techniques such as Free-Wilson analysis [1], Hansch analysis [2], Topliss schemes [3], and Craig plots [4]; and combine data driven insights with expertise to prioritize compounds for synthesis and testing in each design iteration. Today, the overall process of lead optimization is similar, but the volume chemical information in corporate and public databases is too large for development teams to process without computational support. The challenge has driven innovation in search and index of analog sets in chemical libraries.

1.2. MMPs and MMP Search Are Useful Computational Constructs

Matched molecular pairs (MMPs) are a concept in analog analysis that formalizes a particular type of analog relationship: two molecules that differ at a single site by a specific transformation. MMP analysis has been successfully used to study the effects of transformations on ADME properties [5], [6], solubility [7], chemical activities [8], as well as bioisosterism and activity cliffs [9], [10], [11]. A diverse set supervised and unsupervised approaches to MMP search have been described using 2D fingerprints [12], maximum common substructure alignment [9], [12], [13], SMILES/SMARTS editing [5], [7], [14], [15], and molecule fragmentation [16], [17]; though many are difficult to evaluate because the software or detailed descriptions of the underlying algorithms are not in the public domain. The unsupervised algorithm published by Hussain and Rea is currently the most widely used [17]. It efficiently discovers all MMPs in a dataset by exhaustive fragmentation at non-ring single bond sites followed by indexing of cores and substituents into a bipartite data structure.

1.3. MMP Search is Limited by Abstraction

The key limitation of MMP search is abstraction. By definition MMPs group together pairs of molecules with the exact same transformation. Unfortunately, this excludes many near-MMPs relevant to the analysis. Griffen et al. gave the example of using a methyl sulfone addition for an analysis where the effect of the desired ethyl sulfone addition had not been previously observed [18]. Furthermore, transformation indexing is done in a context independent manner; all scaffolds undergoing a particular transformation are grouped together. We refer to the molecular context as the common substructure and transformation site shared by two analogs. For example, for the methylation of L-Dopa into 3-O-methyl-L-Dopa, the transformation would be the replacement of hydrogen by a methyl group, and the context would be the 3-hydroxyl substituent of L-Dopa. Papadatos et all demonstrated how much context matters by comparing context specific and global distributions of transformation effects on hERG inhibition [19]. We also note that the issues arising from the exactness of fragment indexing and the looseness of scaffold grouping exacerbate each other. As we narrow the rage of contexts to be more specific, we would also expect to find fewer MMPs that match any particular transformation. Thus there is a need for methods of querying MMPs that can return approximate matches and flexibly integrate varying amounts of contextual information.

1.4. We Propose More Flexible MMP Search

In this work, we develop a method for flexible MMP search that allows varying levels of abstraction. Our method represents chemical transformations as vector differences, in a manner similar to Sheridan et al. [12], with the difference that we use dimensionality reduction techniques to embed molecule vectors in continuous space and reformulate MMP search as a supervised query. Our queries return approximate MMPs, and our continuous vectors allow us dynamic control over the level of abstraction at which we search. First we develop our search method and demonstrate its mechanics on an exemplar analog series of nucleic acids and nucleosides. Second, we benchmark MMP recall on datasets of varying size and diversity using embedded molecule vectors derived from several different underlying fingerprint representations. Finally, we summarize our results and highlight some considerations for using our method.

2. Methods

2.1. 2D Fingerprint and MMP Generation

We generated 2D chemical fingerprints and MMPs using RDKit for Python 3.0 [20]. We encoded chemical structures using four types of fingerprints. Extended connectivity fingerprints (ECFP_6), atom pair fingerprints (APFP), topological torsion fingerprints (TTFP), and RDKit Path Fingerprints (RDFP). For each fingerprint type, we used the default settings as inputs, and returned fingerprints as unfolded, sparse vectors. For MMPs we used a max fragment length of 10 atoms, and max fragment ratio of 0.3. We implemented vector generation in R-3.03.

2.2. Molecule Vector Generation

We embedded 2D fingerprints in continuous space using kernel principal components analysis (KPCA), a non-linear dimensionality reduction technique similar to principal components analysis (PCA) and multidimensional scaling [21]. We used Tanimoto similarity (aka Jaccard index) as our kernel function. Vector generation can be broken down into three steps:

(1)
Compute the Tanimoto matrix: Compute a kernel matrix of Tanimoto similarities T(X, X) over all pairs of data instances (x_i, x_j) ∈ X , X.
$T (x_{i}, x_{j}) = \frac{x_{i} \cap x_{j}}{x_{i} \cup x_{j}}$
(2)
Solve the Eigenvalue problem: Factor the Tanimoto matrix by solving the eigenvalue problem. Here ϕ(X) denotes the data represented as continuous vectors, and the matrices Q and Λ are the eigenvectors and eigenvalues of the Tanimoto similarity matrix.
$ϕ (X) \cdot ϕ {(X)}^{T} = QΛ Q^{T} = T (X, X)$

$ϕ (X) = Q Λ^{1 / 2}$
(3)
Embed the data in continuous space: Compute a vector of Tanimoto distances and multiply by the eigenvectors and inverse square root of the eigenvalues to project the data instances into the principal component space.
$ϕ (x_{i}) = T (x_{i}, X) \cdot Q Λ^{- 1 / 2}$

Like classical PCA, the features returned by KPCA are orthogonal and ordered by their explanatory power. Similar considerations for dimensionality reduction also apply. Unlike classical PCA, KPCA is non-linear and dot products of Tanimoto KPCA embedded vectors approximate Tanimoto similarities.

2.3. Analog Search

We have broken down our method into four component concepts that build on each other into approximate MMP search: the analog score, basic search, basic feature selection, and uncoupled feature sets. Our method was implemented using built in functions R-3.0.3.

2.4. Analog Score

The analog score is a measure of similarity between relationships. We score the similarity of pairwise relationships {a, b} and {c, d} with the following formula:

Score (a : b ∷ c : d) = \frac{(b - a + c) ∙ d}{‖(b - a + c)‖ ‖d‖}

We point out that the algebraic structure of the score gives a non-intuitive logical equipoise a : b : : c : d ≡ a : c : : b : d.

2.5. Basic Search

Our search objective is to find the molecules that correctly or approximately complete an MMP as an analogy:

Given molecules \{a, b, c\}, find d \in D such that a : b : : c : d

To search a list of candidates (D), we compute analog scores for the input triple (a, b, c) with each molecule (d) in the list, order D by score, and return the top n results.

2.6. Basic Feature Selection

The subset of components used to represent the data is an independent parameter in the search. We call this the dimensionality parameter p. While dimensionality reduction is typically done by using the first p < n principal components, it is not the case that the components must be included or excluded in consecutive order (order of decreasing variance). Thus we introduce the more general notation ω ⊆ {1, … , n} to denote an active subset of components used to represent the data, and the indicator function I(ω), where I(ω)_jj = 1 if j ∈ ω else it is 0. For example, if we wish to represent the data using the first three principal components, then ω = {j : j ≤ 3} and I(ω)₁₁ = I(ω)₂₂ = I(ω)₃₃ = 1. The reduced dimension representation of a molecule vector ϕ(x)_ω is given by:

{ϕ (x_{i})}_{ω} = ϕ (x_{i}) \cdot I (ω)

2.7. Uncoupled Feature Sets

To flexibly control the level of abstraction at which we search, we decompose query vectors into two components: (a) the difference of two molecules – the relation, and (b) the remaining summand molecule – the target.

\underset{query}{\underset{⏟}{(b - a + c)}} = \underset{re lation}{\underset{⏟}{b - a}} + \underset{target}{\underset{⏟}{c}}

Importantly, we select features independently for relation and target. Specifically, we assign each term a feature selector parameter – ω_r and ω_t. For example, we may wish to compute a query with a coarse grained relationship, using only the first three principal components ω_r = {j : j ≤ 3}, and a more detailed representation of the target, using the first 20 components ω_t = {j : j ≤ 20}. We compute the query vector using the following formula:

(b - a) \cdot I (ω_{r}) + c \cdot I (ω_{t})

2.8. Search Examples

We provide a series of four examples, centered on the analog series of nucleic acids and nucleosides, to illustrate the analog score, basic search, basic feature selection, and uncoupled feature sets. Our example dataset consists of 1398 small molecule, FDA approved drugs from Drugbank [22], and 635 biologically important metabolites from the KEGGBRITE ontology [23]. We excluded non-human native classes from the KEGGBRITE set such as marine, fungal, and phytochemical compounds. After duplicate and salt removal, we filtered by molecular weight, excluding molecules outside of the range 100–700 to focus on molecules in the “druglike” size range.

For the analog score, we computed scores for all 8 unique analogous nucleobase pair permutations e.g. {a : g ∷ c : t, a : t ∷ g : c, …}. For basic search, we queried a set of analogous nucleobase/nucleoside MMPs, and return the top 5 search results as well as the search rank of the correct nucleoside. For basic feature selection, we computed the same set of queries, and tracked the search ranks of all nucleosides as we varied the number of principal components used to represent the data from 3 to 150. For uncoupled feature selection, we independently varied the feature selector parameters ω_r and ω_t each over the range 4–50 and recorded the search rank of the correct nucleoside. We computed analog scores and basic search using the first 20 principal components.

2.9. Approximate Context Independent MMP Search Benchmarking

We tested the accuracy of approximate MMP recall with the following protocol. For each dataset: (a) find a set of true positive analogous MMPs using fragment indexing, compute completion queries for all unique analogous pairs, and record the search rank of the correct top result in each case. We executed this over a set of 72,251 compounds, grouped into 102 activity class datasets, and further consolidated into Easy, Intermediate, and Difficult superclasses on the basis of dataset size and diversity [24], [25]. The number of compounds and mean ECFP_4 Tanimoto coefficient for each superclass reported by Jasial et al. is shown in Table 1 [25]. We downloaded structures from ChEMBL in SMILES format, and removed salts.

Table 1.

Size and diversity of MMP search benchmarking dataset superclasses.

	N (mean)	Mean T_c
Easy	2967 (135)	0.28
Intermediate	25,175 (504)	0.19
Difficult	47,109 (1570)	0.11

Open in a new tab

We used the following feature selection strategy to approximate the context independence of fragment indexing MMP search. Given molecule vectors a , b, we set ω_r = {j : a_jb_j < 0}. In other words, we represent the transformations using features for which the molecules that define the transformation differ in sign. For target molecules we used the full rank representation ω_t = {j : λ_j > 10^− 10}.

3. Results

3.1. Molecule Vector Generation

Fig. 1 shows a plot of the nucleobases and nucleosides along the first and third principal axes. Analogous nucleobase/nucleoside MMPs have similar spatial orientations.

3.2. Analog Score

The sorted scores of the 8 unique possible analogous nucleobase pair arrangements are shown in Fig. 2 along with the structures of the nucleobases: adenine, guanine, cytosine, and thymine. The top scoring pairwise alignment gives the correct purine/pyrimidine, hydrogen bond donor/acceptor correspondences.

3.3. Basic Search

Fig. 3 shows the results of 4 basic supervised MMP searches. MMPs queries across hydrogen bond donor/acceptor contexts (A : A^∗ ∷ G : G^∗) perform better than queries across purine/pyrimidine contexts (A : A^∗ ∷ C : C^∗), which perform better than queries across both contexts (A : A^∗ ∷ T : T^∗).

3.4. Basic Feature Selection

Fig. 4 shows the effect of dimensionality reduction on search. Plots display the search ranks of all four nucleosides as we vary the number of principal components in each search example. All nucleosides are highly ranked at low dimension. Rankings diverge as dimension increases. At high dimension the nucleoside used to construct the query saturates at the top result. Basic feature selection was not able to recover queries across hydrogen bond donor/acceptor contexts (Fig. 4(c), (d)). We did not observe any meaningful change in rankings above 150 principal components.

3.5. Uncoupled Feature Sets

Fig. 5 shows the search rank of the correct nucleoside for each query as we independently vary the number of principal components used to represent relationships (transformations) and targets (molecules). Computing the relationship/transformation vector at low dimension (4–12), and adding it to a high dimension target/molecule vector (20–25) rescued queries across purine/pyrimidine contexts (Fig. 5(c), (d)), and did not significantly diminish performance on queries across hydrogen bond donor/acceptor contexts (Fig. 5(a)(b)).

3.6. Approximate Context Independent MMP Search Benchmarking

Fig. 6 shows the distribution of search ranks assigned to the correct top result in MMP queries for continuous vectors derived from four 2D fingerprint types: APFPs, ECFPs, RDFPs, and TTFPs. Note that we report the absolute rather than the percentile rank. We computed 44,371 (mean = 2016/class), 1,011,564 (mean = 20,231/class), and 4,958,361 (mean = 165,288/class) unique MMP queries for activity class datasets in the Easy, Intermediate, and Difficult superclasses. The distribution of absolute ranks did not change significantly relative to the diversity and size of the datasets in each superclass.

4. Discussion

4.1. Analog Score Is a Way to Test Similarity of Relationships

The analog score measures the similarity of chemical transformations. Our use of vector differences is similar in concept to the T-Analyze program [12] with the difference that we have algebraically rearranged the scoring function.

\underset{T - Analyze}{\underset{⏟}{b - a \approx d - c}} ⟺ \underset{Analog Score}{\underset{⏟}{b - a + c \approx d}}

Our function is conceptually similar to supervised MMP queries where a specified transformation is applied to a template molecule [7], [26]. However, we can run in unsupervised mode where we exhaustively compute all pairwise analogies among a set of molecules; or supervised mode, in which one or more of the molecules are specified in advance and held constant. The results of unsupervised mode show some of the flexibility of vector based MMP search. The analogous transformation A : C ∷ G : T would not be found by fragmentation at acyclic bonds, but here it is an equivalent top result.

4.2. Supervised MMP Search Returns Ordered Lists of Similarly Related Molecules; Context Dependence is a Challenge

MMP search returns lists of near MMPs. In Fig. 3(a), (b), we return closely related, approximate MMP molecules along with the correct nucleoside. We show unsuccessful searches in Fig. 3(c), (d) to demonstrate that transformation vectors encode information about the molecular contexts in which they occur. Sheridan reported a similar effect where transformation vectors clustered together on the basis of context [12]; here it has a confounding effect on searches across purine/pyrimidine contexts. To recover those examples, we would like to operate closer to the context independent manner of fragment indexing schemes.

4.3. Basic Feature Selection Allows Us to Include and Exclude Contextual Information

Fig. 4 shows how feature selection can be used to exclude contextual information, but not without challenges. Adding features in order of variance can be interpreted as moving from a coarse to a fine-grained representation of the data. At low dimension we have excluded a significant amount of contextual information, but retained information about the transformation. The correct nucleosides are highly ranked, but indistinguishable from other nucleosides. This is problematic because we would like to discriminate between nucleosides and other closely related molecules to order search rankings. Conversely, we have enough information at high dimension to resolve nucleosides, but the transformations are now context specific. The problem is that we have two types of entities, transformations and molecules, and we would like to represent them at different levels of abstraction. But this is not permitted by typical modes of feature selection. Fortunately, we can separate the level of abstraction for transformations (relationships) and molecules (targets).

4.4. Uncoupling Relation and Target Feature Sets is the Secret Sauce

Decomposing query vectors into relationship and target terms with separate feature sets gives us the flexibility to represent transformations and molecules at different levels of abstraction. Our interpretation of Fig. 4 suggests that the combination of a low dimensional relationship vector, with a high dimensional target vector would recover queries across purine/pyrimidine contexts. This is verified in Fig. 5.

4.5. Adaptive Feature Selection Approximates Fragment Index MMP search

With our adaptive feature selection heuristic, we are able to accurately approximate fragment index based MMP search (Fig. 6). Recall of fragment indexed MMPs was robust over a range of dataset size and diversity conditions.

4.6. Feature Selection Is the Primary Consideration of Our Work

The key idea in our work is that uncoupled feature sets enable us to represent relationships and targets at different levels of abstraction, and thus achieve a more flexible search capability. However, this raises the question of how one should select feature sets. Our results show our method is quite sensitive to feature selection (Fig. 4, Fig. 5). To answer is difficult because the issue is closely linked to one of how much contextual information should be used in search computations. This is often unclear because what constitutes an equivalent transformation is highly subjective and application specific. Thus, the feature selection strategy should be dictated by the particular needs of the application.

For the demonstration examples, our objective was to illustrate different aspects of our method. We chose a grid search over consecutive feature sets because of its intuitive interpretation of moving from a coarse to a detailed representation of the data. In our examples, we showed that we were able access parameter settings that recovered queries of transformations in dissimilar contexts; but the non-smoothness of search rankings in Fig. 4, Fig. 5 suggests that contextual information is not stored in consecutive feature sets, and the strategy would not generally suffice for context independent search.

For benchmarking, our objective was to approximate context independence. We used a rough heuristic to identify context from information contained in the query vectors. For query molecules a and b, features that match in sign positively contribute to their similarity score and encode their similar parts (context); and those that do not match negatively contribute and encode their dissimilar parts (transformation).

sim (a, b) \propto a \cdot b = \underset{context}{\underset{⏟}{\sum_{match} |a_{j} b_{j}|}} - \underset{transformation}{\underset{⏟}{\sum_{mismatch} |a_{j} b_{j}|}}

In between full context dependence and independence, there are a range of strategies, not limited to reordering transformation features relative to their expected variance and incorporating information about contextual difference of the target molecule, which can be used to tune searches.

4.7. Embedded Vectors Are Only as Good as Underlying Representations

A key consideration is the underlying representation used to encode the molecules. Fig. 6 shows it has a measurable effect on accuracy and robustness of search. We found that properties of the 2D fingerprints, such as the repetition insensitivity of binary ECFPs, were passed on to their continuous vectors. We suspect that the RDKit fingerprint representation performs better because of its property that substructure relationships between molecules correspond to subset relationships between their fingerprints. Fingerprint hyperparameters should also be taken into account. We encoded 2D fingerprints as unfolded sparse vectors because in some cases folding resulted in Tanimoto matrices with negative eigenvalues; which is problematic because kernel functions are required to be positive semi definite, and violation of this constraint can cause computations to break down in unpredictable ways.

4.8. Embedding Technique is Another Hyperparameter

Another key consideration is the technique used to embed the data in continuous space. We prefer Tanimoto KPCA because it returns uncorrelated features whose dot products approximate Tanimoto coefficients; but it is just one of a variety of techniques for embedding molecules in continuous space. In addition to using alternative similarity metrics and distance-based embedding methods, neural network embedded graph convolution fingerprints [27], [28] are a new type of continuous representation that has shown superior QSAR performance, and could be used with our method. Our method is adaptable to any continuous coordinate representation where the following condition is met: similar chemical structure transformations yield similar transformation vectors. This corollary of linear algebraic consistency should be kept in mind because it is not necessarily the case that differences computed using any set of continuous descriptors should correspond to meaningful chemical structure relationships.

5. Conclusion

MMPs are a useful tool to study analog relationships and local QSAR, but current MMP search methods are brittle compared to intuitive notions of what constitutes a matched analog pair. Efficient index based search methods enforce precisely defined context independent transformations that can miss near MMPs relevant to an analysis. Likewise, previous iterations of vector based MMP search enforce strict context dependence and feature set coupling that can fail to group together transformations occurring in different contexts.

We demonstrate a new vector based method for computing approximate MMP queries that allows us to flexibly include or exclude varying degrees of contextual information. We have benchmarked its accuracy for approximate context-independent MMP recall. Given the incompleteness of high confidence assay data in chemical databases, our method can be used to find suitable approximate replacements in cases where the properties of a specific analog found by fragment indexing are uncertain or have not been observed. Our method can also be used to bolster the size of MMP datasets to improve statistical power. Perhaps the most interesting aspect of our work is that kernel embedding can be applied to any symbolic representation that supports similarity computation, opening the prospect of searching and characterizing relationships between non-structural aspects of chemicals such as binding affinity profiles; or even higher order chemical entities such as analog series.

References

1.Kubinyi H. Free Wilson analysis. Theory, applications and its relationship to Hansch analysis. Mol Inform. 1988;7:121–133. [Google Scholar]
2.Hansch C., Fujita T. p-σ-π Analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc. 1964;86:1616–1626. [Google Scholar]
3.Topliss J.G. Utilization of operational schemes for analog synthesis in drug design. J Med Chem. 1972;15:1006–1011. doi: 10.1021/jm00280a002. [DOI] [PubMed] [Google Scholar]
4.Craig P.N. Interdependence between physical parameters and selection of substituent groups for correlation studies. J Med Chem. 1971;14:680–684. doi: 10.1021/jm00290a004. [DOI] [PubMed] [Google Scholar]
5.Gleeson P., Bravi G., Modi S., Lowe D. ADMET rules of thumb II: a comparison of the effects of common substituents on a range of ADMET parameters. Bioorg Med Chem. 2009;17:5906–5919. doi: 10.1016/j.bmc.2009.07.002. [DOI] [PubMed] [Google Scholar]
6.Keefer C.E., Chang G., Kauffman G.W. Extraction of tacit knowledge from large ADME data sets via pairwise analysis. Bioorg Med Chem. 2011;19:3739–3749. doi: 10.1016/j.bmc.2011.05.003. [DOI] [PubMed] [Google Scholar]
7.Leach A.G., Jones H.D., Cosgrove D.A., Kenny P.W., Ruston L., MacFaul P. Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. J Med Chem. 2006;49:6672–6682. doi: 10.1021/jm0605233. [DOI] [PubMed] [Google Scholar]
8.Schönherr H., Cernak T. Profound methyl effects in drug discovery and a call for new CH methylation reactions. Angew Chem Int Ed. 2013;52:12256–12267. doi: 10.1002/anie.201303207. [DOI] [PubMed] [Google Scholar]
9.Sheridan R.P. The most common chemical replacements in drug-like compounds. J Chem Inf Comput Sci. 2002;42:103–108. doi: 10.1021/ci0100806. [DOI] [PubMed] [Google Scholar]
10.Wassermann A.M., Bajorath J.r. Chemical substitutions that introduce activity cliffs across different compound classes and biological targets. J Chem Inf Model. 2010;50:1248–1256. doi: 10.1021/ci1001845. [DOI] [PubMed] [Google Scholar]
11.Hu X., Hu Y., Vogt M., Stumpfe D., Bajorath J.r. MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J Chem Inf Model. 2012;52:1138–1145. doi: 10.1021/ci3001138. [DOI] [PubMed] [Google Scholar]
12.Sheridan R.P., Hunt P., Culberson J.C. Molecular transformations as a way of finding and exploiting consistent local QSAR. J Chem Inf Model. 2006;46:180–192. doi: 10.1021/ci0503208. [DOI] [PubMed] [Google Scholar]
13.Raymond J.W., Watson I.A., Mahoui A. Rationalizing lead optimization by associating quantitative relevance with molecular structure modification. J Chem Inf Model. 2009;49:1952–1962. doi: 10.1021/ci9000426. [DOI] [PubMed] [Google Scholar]
14.Hajduk P.J., Sauer D.R. Statistical analysis of the effects of common chemical substituents on ligand potency. J Med Chem. 2008;51:553–564. doi: 10.1021/jm070838y. [DOI] [PubMed] [Google Scholar]
15.Stewart K.D., Shiroda M., James C.A. Drug Guru: a computer software program for drug design using medicinal chemistry rules. Bioorg Med Chem. 2006;14:7011–7022. doi: 10.1016/j.bmc.2006.06.024. [DOI] [PubMed] [Google Scholar]
16.Haubertin D.Y., Bruneau P. A database of historically-observed chemical replacements. J Chem Inf Model. 2007;47:1294–1302. doi: 10.1021/ci600395u. [DOI] [PubMed] [Google Scholar]
17.Hussain J., Rea C. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model. 2010;50:339–348. doi: 10.1021/ci900450m. [DOI] [PubMed] [Google Scholar]
18.Griffen E., Leach A.G., Robb G.R., Warner D.J. Matched molecular pairs as a medicinal chemistry tool: miniperspective. J Med Chem. 2011;54:7739–7750. doi: 10.1021/jm200452d. [DOI] [PubMed] [Google Scholar]
19.Papadatos G., Alkarouri M., Gillet V.J., Willett P., Kadirkamanathan V., Luscombe C.N. Lead optimization using matched molecular pairs: inclusion of contextual information for enhanced prediction of hERG inhibition, solubility, and lipophilicity. J Chem Inf Model. 2010;50:1872–1886. doi: 10.1021/ci100258p. [DOI] [PubMed] [Google Scholar]
20.Landrum G. RDKit: open-source cheminformatics. http://www.rdkit.org (Online) (Accessed 2006, 3, 2012)
21.Schölkopf B., Smola A., Müller K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 1998;10:1299–1319. [Google Scholar]
22.Wishart D.S., Knox C., Guo A.C., Shrivastava S., Hassanali M., Stothard P. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34:D668–D672. doi: 10.1093/nar/gkj067. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Kanehisa M., Goto S., Furumichi M., Tanabe M., Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Heikamp K., Bajorath J.r. Large-scale similarity search profiling of ChEMBL compound data sets. J Chem Inf Model. 2011;51:1831–1839. doi: 10.1021/ci200199u. [DOI] [PubMed] [Google Scholar]
25.Jasial S., Hu Y., Vogt M., Bajorath J. Activity-relevant similarity values for fingerprints and implications for similarity searching. F1000Res. 2016:5. doi: 10.12688/f1000research.8357.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lewis M.L., Cucurull-Sanchez L. Structural pairwise comparisons of HLM stability of phenyl derivatives: introduction of the Pfizer metabolism index (PMI) and metabolism-lipophilicity efficiency (MLE) J Comput Aided Mol Des. 2009;23:97–103. doi: 10.1007/s10822-008-9242-3. [DOI] [PubMed] [Google Scholar]
27.Duvenaud D.K., Maclaurin D., Iparraguirre J., Bombarell R., Hirzel T., Aspuru-Guzik A. Vol. 2015. 2015. Convolutional networks on graphs for learning molecular fingerprints; pp. 2224–2232. (Advances in neural information processing systems). [Google Scholar]
28.Kearnes S., McCloskey K., Berndl M., Pande V., Riley P. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30:595–608. doi: 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0005] 1.Kubinyi H. Free Wilson analysis. Theory, applications and its relationship to Hansch analysis. Mol Inform. 1988;7:121–133. [Google Scholar]

[bb0010] 2.Hansch C., Fujita T. p-σ-π Analysis. A method for the correlation of biological activity and chemical structure. J Am Chem Soc. 1964;86:1616–1626. [Google Scholar]

[bb0015] 3.Topliss J.G. Utilization of operational schemes for analog synthesis in drug design. J Med Chem. 1972;15:1006–1011. doi: 10.1021/jm00280a002. [DOI] [PubMed] [Google Scholar]

[bb0020] 4.Craig P.N. Interdependence between physical parameters and selection of substituent groups for correlation studies. J Med Chem. 1971;14:680–684. doi: 10.1021/jm00290a004. [DOI] [PubMed] [Google Scholar]

[bb0025] 5.Gleeson P., Bravi G., Modi S., Lowe D. ADMET rules of thumb II: a comparison of the effects of common substituents on a range of ADMET parameters. Bioorg Med Chem. 2009;17:5906–5919. doi: 10.1016/j.bmc.2009.07.002. [DOI] [PubMed] [Google Scholar]

[bb0030] 6.Keefer C.E., Chang G., Kauffman G.W. Extraction of tacit knowledge from large ADME data sets via pairwise analysis. Bioorg Med Chem. 2011;19:3739–3749. doi: 10.1016/j.bmc.2011.05.003. [DOI] [PubMed] [Google Scholar]

[bb0035] 7.Leach A.G., Jones H.D., Cosgrove D.A., Kenny P.W., Ruston L., MacFaul P. Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. J Med Chem. 2006;49:6672–6682. doi: 10.1021/jm0605233. [DOI] [PubMed] [Google Scholar]

[bb0040] 8.Schönherr H., Cernak T. Profound methyl effects in drug discovery and a call for new CH methylation reactions. Angew Chem Int Ed. 2013;52:12256–12267. doi: 10.1002/anie.201303207. [DOI] [PubMed] [Google Scholar]

[bb0045] 9.Sheridan R.P. The most common chemical replacements in drug-like compounds. J Chem Inf Comput Sci. 2002;42:103–108. doi: 10.1021/ci0100806. [DOI] [PubMed] [Google Scholar]

[bb0050] 10.Wassermann A.M., Bajorath J.r. Chemical substitutions that introduce activity cliffs across different compound classes and biological targets. J Chem Inf Model. 2010;50:1248–1256. doi: 10.1021/ci1001845. [DOI] [PubMed] [Google Scholar]

[bb0055] 11.Hu X., Hu Y., Vogt M., Stumpfe D., Bajorath J.r. MMP-cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs. J Chem Inf Model. 2012;52:1138–1145. doi: 10.1021/ci3001138. [DOI] [PubMed] [Google Scholar]

[bb0060] 12.Sheridan R.P., Hunt P., Culberson J.C. Molecular transformations as a way of finding and exploiting consistent local QSAR. J Chem Inf Model. 2006;46:180–192. doi: 10.1021/ci0503208. [DOI] [PubMed] [Google Scholar]

[bb0065] 13.Raymond J.W., Watson I.A., Mahoui A. Rationalizing lead optimization by associating quantitative relevance with molecular structure modification. J Chem Inf Model. 2009;49:1952–1962. doi: 10.1021/ci9000426. [DOI] [PubMed] [Google Scholar]

[bb0070] 14.Hajduk P.J., Sauer D.R. Statistical analysis of the effects of common chemical substituents on ligand potency. J Med Chem. 2008;51:553–564. doi: 10.1021/jm070838y. [DOI] [PubMed] [Google Scholar]

[bb0075] 15.Stewart K.D., Shiroda M., James C.A. Drug Guru: a computer software program for drug design using medicinal chemistry rules. Bioorg Med Chem. 2006;14:7011–7022. doi: 10.1016/j.bmc.2006.06.024. [DOI] [PubMed] [Google Scholar]

[bb0080] 16.Haubertin D.Y., Bruneau P. A database of historically-observed chemical replacements. J Chem Inf Model. 2007;47:1294–1302. doi: 10.1021/ci600395u. [DOI] [PubMed] [Google Scholar]

[bb0085] 17.Hussain J., Rea C. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model. 2010;50:339–348. doi: 10.1021/ci900450m. [DOI] [PubMed] [Google Scholar]

[bb0090] 18.Griffen E., Leach A.G., Robb G.R., Warner D.J. Matched molecular pairs as a medicinal chemistry tool: miniperspective. J Med Chem. 2011;54:7739–7750. doi: 10.1021/jm200452d. [DOI] [PubMed] [Google Scholar]

[bb0095] 19.Papadatos G., Alkarouri M., Gillet V.J., Willett P., Kadirkamanathan V., Luscombe C.N. Lead optimization using matched molecular pairs: inclusion of contextual information for enhanced prediction of hERG inhibition, solubility, and lipophilicity. J Chem Inf Model. 2010;50:1872–1886. doi: 10.1021/ci100258p. [DOI] [PubMed] [Google Scholar]

[bb0100] 20.Landrum G. RDKit: open-source cheminformatics. http://www.rdkit.org (Online) (Accessed 2006, 3, 2012)

[bb0105] 21.Schölkopf B., Smola A., Müller K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 1998;10:1299–1319. [Google Scholar]

[bb0110] 22.Wishart D.S., Knox C., Guo A.C., Shrivastava S., Hassanali M., Stothard P. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34:D668–D672. doi: 10.1093/nar/gkj067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0115] 23.Kanehisa M., Goto S., Furumichi M., Tanabe M., Hirakawa M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0120] 24.Heikamp K., Bajorath J.r. Large-scale similarity search profiling of ChEMBL compound data sets. J Chem Inf Model. 2011;51:1831–1839. doi: 10.1021/ci200199u. [DOI] [PubMed] [Google Scholar]

[bb0125] 25.Jasial S., Hu Y., Vogt M., Bajorath J. Activity-relevant similarity values for fingerprints and implications for similarity searching. F1000Res. 2016:5. doi: 10.12688/f1000research.8357.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0130] 26.Lewis M.L., Cucurull-Sanchez L. Structural pairwise comparisons of HLM stability of phenyl derivatives: introduction of the Pfizer metabolism index (PMI) and metabolism-lipophilicity efficiency (MLE) J Comput Aided Mol Des. 2009;23:97–103. doi: 10.1007/s10822-008-9242-3. [DOI] [PubMed] [Google Scholar]

[bb0135] 27.Duvenaud D.K., Maclaurin D., Iparraguirre J., Bombarell R., Hirzel T., Aspuru-Guzik A. Vol. 2015. 2015. Convolutional networks on graphs for learning molecular fingerprints; pp. 2224–2232. (Advances in neural information processing systems). [Google Scholar]

[bb0140] 28.Kearnes S., McCloskey K., Berndl M., Pande V., Riley P. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30:595–608. doi: 10.1007/s10822-016-9938-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Flexible Analog Search with Kernel PCA Embedded Molecule Vectors

Stefano Rensi

Russ B Altman

Abstract

1. Introduction

1.1. Analog Search is Important for Lead Optimization

1.2. MMPs and MMP Search Are Useful Computational Constructs

1.3. MMP Search is Limited by Abstraction

1.4. We Propose More Flexible MMP Search

2. Methods

2.1. 2D Fingerprint and MMP Generation

2.2. Molecule Vector Generation

2.3. Analog Search

2.4. Analog Score

2.5. Basic Search

2.6. Basic Feature Selection

2.7. Uncoupled Feature Sets

2.8. Search Examples

2.9. Approximate Context Independent MMP Search Benchmarking

Table 1.

3. Results

3.1. Molecule Vector Generation

Fig. 1.

3.2. Analog Score

Fig. 2.

3.3. Basic Search

Fig. 3.

3.4. Basic Feature Selection

Fig. 4.

3.5. Uncoupled Feature Sets

Fig. 5.

3.6. Approximate Context Independent MMP Search Benchmarking

Fig. 6.

4. Discussion

4.1. Analog Score Is a Way to Test Similarity of Relationships

4.2. Supervised MMP Search Returns Ordered Lists of Similarly Related Molecules; Context Dependence is a Challenge

4.3. Basic Feature Selection Allows Us to Include and Exclude Contextual Information

4.4. Uncoupling Relation and Target Feature Sets is the Secret Sauce

4.5. Adaptive Feature Selection Approximates Fragment Index MMP search

4.6. Feature Selection Is the Primary Consideration of Our Work

4.7. Embedded Vectors Are Only as Good as Underlying Representations

4.8. Embedding Technique is Another Hyperparameter

5. Conclusion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases