Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2024 Dec 5;44(1):e202400265. doi: 10.1002/minf.202400265

From High Dimensions to Human Insight: Exploring Dimensionality Reduction for Chemical Space Visualization

Alexey A Orlov 1, Tagir N Akhmetshin 1, Dragos Horvath 1, Gilles Marcou 1, Alexandre Varnek 1,
PMCID: PMC11733715  PMID: 39633514

Abstract

Dimensionality reduction is an important exploratory data analysis method that allows high‐dimensional data to be represented in a human‐interpretable lower‐dimensional space. It is extensively applied in the analysis of chemical libraries, where chemical structure data ‐ represented as high‐dimensional feature vectors‐are transformed into 2D or 3D chemical space maps. In this paper, commonly used dimensionality reduction techniques ‐ Principal Component Analysis (PCA), t‐Distributed Stochastic Neighbor Embedding (t‐SNE), Uniform Manifold Approximation and Projection (UMAP), and Generative Topographic Mapping (GTM) ‐ are evaluated in terms of neighborhood preservation and visualization capability of sets of small molecules from the ChEMBL database.

Keywords: chemical libraries, chemical space, chemography, dimensionality reduction, Generative Topographic Mapping, principal component analysis, t-distributed Stochastic Neighbor Embedding, Uniform Manifold Approximation and Projection


graphic file with name MINF-44-e202400265-g004.jpg

Introduction

Dimensionality reduction (DR) is an important machine learning (ML) technique used to produce a compressed low‐dimensional embedding of a given high‐dimensional dataset, serving either as a data preprocessing step for further application of other machine learning algorithms or as a tool for visualizing human‐interpretable 2 or 3 dimensions (2D and 3D) [1, 2, 3]. As a visualization tool DR techniques are ubiquitous and are widely used in a variety of fields, including extensive application for omics studies [4], and chemical space analysis [5]. In the latter, they are known by the term “chemography” by analogy to geography [6]. Chemography aims at converting the data on chemical structures that are frequently represented as a feature vector of high dimensionality in the form of a 2D chemical space map. Beyond their illustrativeness and artistic visual appeal [7], chemical space maps can be combined with other tools, such as deep generative models [8, 9] to effectively guide chemical space exploration, or accelerate similarity‐based virtual screening [10].

Numerous benchmarking studies have been conducted to compare DR methods, both for tasks in specific domains and across numerous fields [4, 11, 12, 13]. These studies highlight non‐linear DR algorithms t‐Distributed Stochastic Neighbor Embedding (t‐SNE) [14], Uniform Manifold Approximation and Projection (UMAP) [15] as the best‐performing methods. However, a linear DR method, Principal Component Analysis (PCA), is also very popular and is sometimes reported as more efficient [16]. Therefore, there is no single method that is universally superior; the choice of method should be guided by its suitability for a particular set of tasks. While chemical datasets have been benchmarked in some studies [11, 17, 18, 19], a detailed discussion of the DR methods’ relevance in the context of chemical space analysis for medicinal chemistry‐relevant small organic molecules is lacking in the literature.

This paper compares dimensionality reduction (DR) techniques for exploring chemical space. Specifically, we evaluate the effectiveness of three non‐linear methods — t‐SNE, UMAP, and Generative Topographic Mapping (GTM)—and one linear method, PCA, commonly used for visualizing chemical spaces [20, 21, 22]. The analysis utilizes subsamples from the ChEMBL database [23], focusing on compounds tested against specific biological targets. Various representations of different dimensionalities were used to describe chemical compounds and a grid‐based search was conducted to optimize hyperparameters with neighborhood preservation as the objective. The results confirmed the strong performance of the non‐linear methods in neighborhood preservation. Additionally, scatterplot diagnostics (scagnostics) [24] were applied to quantitatively assess the characteristics of the chemical space maps that can be relevant to human perception. The strengths and weaknesses of these methods are discussed, highlighting their effectiveness and potential limitations.

Methods

Workflow for Comparing Dimensionality Reduction Approaches for Chemical Space Analysis

The dimensionality reduction techniques were assessed in two ways: the accuracy of neighborhood preservation and visual interpretability. The general scheme for the comparative analysis of DR techniques used in this paper is presented in Figure 1. To optimize hyperparameters, a grid‐based search was conducted using the percentage of preserved nearest 20 neighbors from the high‐dimensional space as the optimization metric. The optimized models were then evaluated using additional neighborhood preservation metrics. Additionally, to quantitatively assess the visual interpretability of the plots, scatter diagnostics [24] (scagnostics) were calculated providing relevance of the method′s visualization for better human perception.

Figure 1.

Figure 1

Workflow for comparing dimensionality reduction methods for chemical space visualization. Initially, 13 datasets with low intrinsic dimensionality are retrieved from ChEMBL. These datasets are then processed using three distinct descriptor types: Morgan count fingerprints, MACCS keys, and ChemDist. Subsequently, dimensionality reduction techniques, including t‐SNE, GTM, and UMAP, are applied, with hyperparameters optimized for neighborhood preservation. A comparative analysis focused on evaluating the preservation of neighborhood structures and the visual interpretability of the chemical space maps (low‐dimensional embeddings) is performed.

In summary, the objective of this study is to evaluate the performance of various dimensionality reduction (DR) methods across different scenarios:

  1. Comparison of Neighborhood Preservation for in‐sample DR: This involves assessing how well the methods maintain neighborhood relationships within a series of target‐specific ChEMBL subsets, using the entire dataset for training.

  2. Comparison of Neighborhood Preservation for out‐of‐sample DR: The neighborhood preservation was assessed in a Leave‐One‐Library‐Out Scenario (LOLO). This focuses on evaluating the DR methods when applied to new data, where one library is excluded during training.

  3. Quantitative Evaluation of Visualizations: The visualizations generated by the DR methods are quantitatively assessed using scagnostics metrics.

This contribution does not address Neighborhood Behavior [25] (NB, the hypothesis that structurally similar compounds exhibit similar biological properties). It only focuses on the question of the impact of DR on the neighborhood of items, and does not further explore the question whether resulting maps are effectively regrouping activity‐related molecules. Map NB‐compliance is expectedly worse than original descriptor space compliance, but this issue has been extensively explored at least for one method, GTM, known to support highly NB‐compliant property “landscapes” [26, 27]. Therefore, the biological properties of the compounds were not included into the analysis.

Data Collection and Preprocessing

Subsets of chemical compounds were retrieved from a pool of preprocessed target specific subsets from ChEMBL version 33 database [23], prepared according to an in‐house protocol as previously described [28, 29]. The selection of datasets was based on two criteria: each subset contains more than 400 compounds, and the intrinsic dimensionality, calculated using Fisher′s separability algorithm [30] on the data represented as Morgan count fingerprints (see below) shall cover a wide range of values. In addition to these target‐specific subsets, three random subsets of sizes 500, 1500, and 9269 were also retrieved from ChEMBL.

Descriptor Calculation

Three types of descriptors with varying number of dimensions were used: Morgan count fingerprints [31], MACCS keys [32], and embeddings from deep neural network [33] (ChemDist). Morgan count fingerprints are circular fingerprints that capture the presence and frequency of substructures within a molecule by encoding its atomic environments up to a certain radius. MACCS keys are a fixed‐length binary representation based on predefined structural features, indicating the presence or absence of specific functional groups or substructures. The ChemDist embeddings are obtained from a graph neural network trained using deep metric learning, where molecules are viewed as graphs with atoms as nodes and bonds as edges. This approach generates continuous vector representations that quantify molecular similarity based on distances in the embedding space. The model is trained so that the Euclidean distance between embeddings simulates chemical similarity.

Morgan count fingerprints and MACCS keys were calculated using the RDKit (v.2022.09.5) library [34]. For Morgan count fingerprints radius 2 and fingerprint size 1024 were used. Default RDKit parameters were used to generate MACCS keys.

ChemDist embeddings were obtained using the pretrained network [33]. The following parameters of the original network were used to obtain embeddings: number of node features 74, number of edge features 12, final embedding size used for metric calculation 16.

For each dataset, all zero‐variance features were removed, and the remaining features were standardized before applying a dimensionality reduction algorithm.

Dimensionality Reduction Methods

The following implementations of dimensionality reduction algorithms were used: the PCA algorithm implemented in scikit‐learn (version 1.4.1.post1); the t‐SNE algorithm from the OpenTSNE (version 1.0.1) library [35]; the UMAP algorithm from the umap‐learn (version 0.5.5) library15; and an in‐house algorithm for GTM, which is available upon request [36].

Defining “Neighbors” in Descriptor and Latent Space Respectively

In latent space (on the maps) the first k neighbors of an item i are found by calculating the Euclidean distances between the projection of i and all the projections of the remaining set members and ranking the latter.

In descriptor spaces, “default” neighbor definition used the same approach, based on respective Euclidean distances. However, an alternative definition of distance as the complement of the Tanimoto similarity score (1‐T) was also employed. No normalization of descriptors was undertaken for calculating Tanimoto similarity values.

Neighborhood Preservation Analysis

As a primary metric of the neighborhood preservation for the optimization, an average number of nearest neighbors preserved between the original and the latent spaces was used [11]:

PNNk=i=1NSikk×N, (1)

where PNNk is the neighborhood preservation score, k represents the number of considered nearest neighbors, Sik is the number of the shared k ‐nearest neighbors of the i‐th compound from the N populating the latent space and original spaces.

Additionally, the following metrics for evaluating neighborhood preservation suggested in the literature were calculated37: co‐k‐nearest neighbor size (QNN) (2), an area under QNN curve AUC(QNN) (3), local continuity meta criterion (LCMC) (4), the local (Qlocal) (5) and global (Qglobal) (6) properties, trustworthiness (7), continuity (8).

The calculation of the metrics is based on building co‐ranking matrix (Q) (Figure 2). The Qkl elements of the matrix Q count how many samples of rank k became of rank l [37]. The off‐diagonal elements of Q are being used to calculate all the metrics. In the case of ties in the ranking, the ranks were selected randomly.

Figure 2.

Figure 2

Metrics used for the analysis of neighborhood preservation (a) The descriptor space and the corresponding latent space are compared through (i) distances between data points that are then (ii) converted into a ranking of the nearest neighbors for each instance. The co‐ranking matrix Q with elements (Qkl) is calculated based on the ranking matrices. (b) The illustration of the calculation of metrics based on the co‐ranking matrix and Equations (2)–(9).

Co‐k‐Nearest Neighbor Size

 2

QNNk=1kmi=1kj=1kQij (2)

where k is a row and a column index, m ‐ number of rows/columns in the coranking matrix, Qij ‐ elements of the co‐ranking matrix Q.

This measures the number cases where the neighborhood is preserved in a given tolerance, up to rank k. It corresponds to the green region in the Q matrix represented in Figure 2b.

Area Under the QNN Curve

 3

AUC=1mk=1mQNNk (3)

where k is a row index, m – number of rows/columns in the co‐ranking matrix Q, QNN(k) – co‐k‐nearest neighbor size as calculated by the Equation (2).

AUC characterizes the global neighborhood preservation based on the QNN curve as shown in Figure 2b.

Local Continuity Meta Criterion (LCMC)

 4

LCMCk=QNNk-km-1 (4)

where k is a row index, m – number of rows/columns in the co‐ranking matrix Q, QNN(k) – co‐k‐nearest neighbor size as calculated by the Equation (2).

LCMC is a normalized (by the number of neighbors (km-1 ) that can be retrieved randomly) version of QNN [37].

LCMCmaximumpointkmax=argmaxkLCMCk (5)

where k is a row index in the co‐ranking matrix Q, LCMC (k) ‐ local continuity meta criterion as calculated by the Equation (4).

The inflection point of the LCMC curve corresponding to the number of neighbors (Figure 2b).

Local and Global Property Metric

 6, 7

Qlocal=1kmaxk=1kmaxQNNk (6)
Qglobal=1m-kmaxk=kmaxm-1QNNk (7)

where kmax is an inflection point in LCMC curve as calculated by the Equation (5), k is a row index, m ‐ number of rows/columns in the co‐ranking matrix.

Qlocal and Qglobal represent local and global neighborhood preservation, respectively, as calculated from the LCMC curve (Figure 2b). These correspond to the green and red areas in Figure 2b when k equals kmax.

Trustworthiness and Continuity

 8, 9

Tk=1-2mk2m-3k-1i=kmj=1kQij×i-k (8)
Ck=1-2mk2m-3k-1i=1kj=kmQij×j-k (9)

where k is a row index, m – number of rows/columns in the co‐ranking matrix Q, Qij ‐ elements of the co‐ranking matrix.

Trustworthiness and continuity correspond to hard intrusions (bottom left corner in Figure 2b) and hard extrusions (top right corner in Figure 2b), respectively. Hard intrusions occur when data points (compounds) that are distant in the descriptor space appear close in the latent space. Hard extrusions happen when compounds that are close in the descriptor space appear far apart in the latent space.

Chemical Library Diversity Analysis

Chemical library diversity analysis was conducted using the Consensus Diversity Plots (CDPs) methodology [38]. This approach evaluates the diversity of chemical libraries by considering both scaffold diversity (y‐axis) and the distribution of chemical similarity calculated based on fingerprints (x‐axis). In this work, the fraction of scaffolds required to retrieve 50% of the database (F50) was used as a measure of scaffold diversity [38]. Bemis‐Murcko scaffolds calculated using RDKit were employed, and the F50 values were plotted against the median intra‐library similarity calculated using Morgan count fingerprints and MACCS keys.

Quantitative Assessment of Chemical Space Visualization Using Scagnostics

To quantitatively evaluate the interpretability of the visualization we used scagnostics (scatterplot diagnostics) [24, 39] ‐ visual representation quality metrics, which map a visual pattern to a real number and are frequently used to found visualizations that contain interesting patterns in an automated manner [40]. Scagnostics were calculated using an R package scagnostics [41] (version 0.2–6).

The scagnostics were calculated according to Equations (10–19) as suggested in the original publications by Wilkinson et al. [24, 39]. In brief, scagnostics are calculated using three principle geometric concepts: the minimum spanning tree, the convex hull, and the alpha hull (Figure 3).

Figure 3.

Figure 3

Three fundamental geometric constructs utilized in the calculation of scagnostics: the Minimum Spanning Tree (MST), the Alpha Hull, and the Convex Hull. The MST (left panel) represents the shortest possible tree that connects all points. The Alpha Hull (middle panel) provides a nuanced boundary of the point set by incorporating concavities, thus capturing more detailed structural features and revealing local patterns that the convex hull might overlook. The Convex Hull (right panel) represents the smallest convex polygon encompassing all points, offering a broad overview of the dataset′s outer boundary.

In the following Equations (10–19), H stands for the convex hull, A stands for the alpha hull, and T stands for the minimum spanning tree.

The Outlying scagnostic is calculated based on the minimum spanning tree T and measures the ratio of outlying dots, i. e., dots that are relatively far from others in the plot. It is defined as the proportion of the total edge length of the minimum spanning tree T that is accounted for by the total length of edges adjacent to these outlying points [39]:

coutlying=lengthToutlierslengthT (10)

where length(T) is the total edge length for the MST graph, length (Toutliers) is the total edge length for the outliers in the MST graph.

An outlier is defined to be a vertex with degree 1 and associated edge weight greater than w, where w is calculated using Equation (11): 11

w=q75+1.5q75-q25 (11)

where q75 and q25 are the 75th and 25th percentiles of the MST edge lengths.

Several following metrics (12–15) characterize the shape of the set of scattered points.

The Convex scagnostic characterizes the convexity of the 2D point distribution′s shape is calculated as the ratio of the area of the alpha hull and the area of the convex hull.

cconvex=areaAareaH (12)

where area(A) and area(H) are the areas of the alpha hull and the area of the convex hull respectively.

Skinny scagnostic evaluates the ratio of perimeter to area of a polygon:

cskinny=1-4πareaAperimeterA (13)

where area(A) and perimeter (A) are the area and the perimeter of the alpha hull respectively.

The Stringy scagnostic defines how “path‐like” is MST and is calculated using the Equation 14:

cstringy=diameterTlengthT (14)

where diameter(T) and length(T) are the diameter and length of all MST edges.

Straight scagnostic is defined as the Euclidean distance between the points at the ends of the longest shortest path of the MST divided by the diameter:

cstraight=disttj,tkdiameterT (15)

where tj and tk are the vertices in T on which the diameter is defined.

Monotonic scagnostic is defined as the square of Spearman correlation coefficient (rspearman) between coordinates and characterizes the presence of a clear trend on the plot:

cmonotonic=rspearman2 (16)

Skewed scagnostic evaluates the relative density of points in a scattered configuration: 17

cskew=q90-q50q90-q10 (17)

where q90, q50, q10 are quantiles of the MST edge lengths.

Clumpy scagnostic characterizes the presence of clusters and is calculated as:

cclumpyT=maxj1-maxklengtheklengthej (18)

where j indexes edges in the MST and k indexes edges in each runt set derived from an edge indexed by j. The runt set corresponds to an edge that is the smaller of the two subsets of edges that are still connected to each of the two vertices in e j after deleting edges in the MST with lengths less than length(ej).

Striate scagnostic assesses the presence of multiple parallel lines and defined as:

cstriateT=1V2vV2cosθev,aev,b (19)

V(2) V be the set of all vertices of degree 2 in V.

The metrics used in this study are summarized in Table 1.

Table 1.

List of measures used to assess the neighborhood and scagnostics of dimensionality reduction techniques.

Name

Range

Comment

Equation

Neighborhood preservation metrics

Neighborhood preservation score (PNN)

0–1

(1 ‐ all neighbors preserved at given k, 0 ‐ no neighbors preserved at given k)

Real‐valued metric: Characterizes the preservation of neighbors without considering their ranks in the descriptor and latent spaces.

(1)

Co‐k‐nearest neighbor size (QNN (k))

0–1

(1 ‐ ideal neighborhood preservation at given k)

Real valued metric: Evaluates the preservation of neighbors at a given nearest neighborhood size k, considering their ranks in both descriptor and latent spaces.

(2)

Area under the QNN curve (AUC)

0.5–1

(1 ‐ ideal neighborhood preservation)

Real‐valued metric: Summarizes the global preservation of neighbors based on QNN.

(3)

Local Continuity Meta Criterion (LCMC (k))

0–1

(1 ‐ all neighbors preserved at given k, 0 ‐ no neighbors preserved at given k)

Real valued metric:

QNN (k) value is normalized by the number of neighbors that can be drawn randomly for a given k

(4)

kmax

1‐N,

where N ‐ the number of data points in the dataset (larger kmax values signify larger preserved local neighborhoods)

Integer: The maximum value point of the LCMC curve.

(5)

Qlocal

0–1

(1 ‐ high local neighborhood preservation, 0 ‐ low local neighborhood preservation)

Real number: Metric characterizing local neighborhood preservation based on LCMC for (k<kmax)

(6)

Qglobal

0–1

(1 ‐ high global neighborhood preservation, 0 ‐ low global neighborhood preservation)

Real number: Metric characterizing global neighborhood preservation based on LCMC for k>kmax

(7)

Trustworthiness

0–1

(1 ‐ no hard intrusions, 0 ‐ large number of hard extrusions)

Real number: Indicates the presence of hard intrusions.

(8)

Continuity

0–1

(1 ‐ no hard extrusions, 0 ‐ large number of hard extrusions)

Real number: Indicates the presence of hard extrusions.

(9)

Scagnostics

Outlying

0–1

Characterizes the presence of outlying data points on a scatter plot.

(10)

Convex

0–1

Characterizes various aspects of the shape distribution of data points on a scatter plot.

(12)

Stringy

0–1

(13)

Straight

0–1

(14)

Monotonic

0–1

Determines if a clear trend can be identified on a scatter plot.

(15)

Skewed

0–1

Provides characteristics of the density of the data point distribution on a scatter plot.

(16)

Clumpy

0–1

(17)

Striated

0–1

Characterizes the coherence of the plots.

(18)

Optimization of Hyperparameters

The following hyperparameters were optimized for non‐linear methods, with a total of 72 parameters tested for each method.

The PCA calculations have been performed using scikit‐learn v. 1.5.0 software. The implementation is based on the singular value decomposition. The default (’auto’) parameter of ‘svd_solver’ was used. Only the 2 first principal components are used to project the datasets.

For t‐SNE, the hyperparameters were chosen according to the suggestions from Gove et al. [42]. Perplexity values were chosen to be 1, 2, 4, 8, 16, 32, 64, 128]. Exaggeration values were chosen to be [1, 2, 3, 4, 5, 6, 8, 16, 32]. The learning rate was kept as the default of OpenTSNE, since t‐SNE is more robust to changes in learning rate around our empirical hyperparameter guideline than to changes in perplexity or exaggeration [35]. Fast Fourier Transform accelerated interpolation method was used to calculate gradients.

For UMAP, the parameter grid included 9 values for nearest neighbors (n_neighbors): 2, 4, 6, 8, 16, 32, 64, 128, 256] and 8 values for minimal distance (min_dist): [0.0, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8, 0.99]. The values were selected to span a wide range of neighbor/distance values including those typically regarded to identify “true” neighbors, reflecting chemical similarity from a chemist′s perspective within the chemical space [43, 44].

The GTM parameter grid included the following hyperparameters: the number of nodes set to 225, 625, and 1600; the number of basis functions set to 100, 400, and 1225; regularization coefficients of 1, 10, and 100; and basis widths of 0.1, 0.4, 0.8, and 1.2. These values were selected based on the range observed in previous applications of GTM for chemical space analysis [22].

Intrinsic Dimension Analysis

Intrinsic dimension (ID) analysis with Fisher separability algorithm was performed using scikit‐dimension library (v. 0.3.3) [45].

Results

Neighborhood Preservation Analysis

Neighborhood Preservation Analysis for In‐Sample Dimensionality Reduction

Although numerous chemical datasets are available for benchmarking supervised machine learning methods [46, 47, 48], to our knowledge there are no datasets designed to evaluate the quality of DR neighborhood preservation and visualization. In this study, we focused on small organic molecules tested against specific ChEMBL targets. Following observations on the importance of low intrinsic dimension for achieving meaningful visualization [11, 16, 45], these datasets were chosen to cover a wide range of intrinsic dimension values as assessed by the Fisher separation method using Morgan count fingerprints as features [30]. In total, 103 datasets with the number of compounds from 406 to 4376 were selected (Supplementary Figure SF1) and intrinsic dimensionality ranged between 3 and 26.

One of the most common methods to evaluate the usability of a visualization obtained using a DR technique is to assess how well close neighbors in the original space are preserved in the latent space2. While numerous metrics have been suggested for this type of evaluation, in this work, we focused on one of the simplest: the number of k‐neighbors preserved in the latent space, also known as the neighborhood hit [11]. To optimize hyperparameters, a grid‐based search was conducted using the percentage of preserved nearest 20 neighbors from the high‐dimensional space as the optimization metric. All non‐linear techniques were able to retrieve, on average, 40% to 75% of the 20 closest neighbors depending on the descriptor set, outperforming PCA by 20% or more (Figure 4a, Supplementary Table ST1). The descriptors used in this study offer different levels of granularity in capturing chemical structure similarity. MACCS keys provide a coarse‐grained perspective by using fixed‐length binary vectors (166 bits) based on predefined structural features, potentially missing subtle molecular details. In contrast, Morgan count fingerprints offer a fine‐grained view with high‐dimensional vectors (1024 bits used in this work), capturing detailed substructural counts that reflect richer molecular information. ChemDist embeddings retain the advantages of Morgan count fingerprints in capturing substructural nuances while compressing the information into a low‐dimensional (16‐dimensional) real‐valued vector. On average, a lower dimensionality of the ambient space data corresponds to a higher preservation score (Figure 4), while the relative performance of the methods remains consistent across different descriptor sets.

Figure 4.

Figure 4

Average neighborhood preservation metrics for optimized models across 59 ChEMBL subsets for various feature sets (Morgan fingerprints, MACCS keys, ChemDist embeddings). The models’ hyperparameters were selected to maximize the preservation of neighbors among the 20 nearest ones (Euclidean distance in the original space was used). Color scheme: PCA ‐ blue, t‐SNE ‐ orange, UMAP ‐ green, GTM ‐ red. The ratio of nearest neighbors (PNN) preserved at different k‐values, trustworthiness, continuity, co‐k‐nearest neighbor size (QNN), and Local Continuity Meta Criterion (LCMC) as functions of the k‐nearest neighbors are shown. Standard deviation values calculated across datasets are shown as bars or filled areas. Corresponding AUC, Qlocal, Qglobal, k‐max values can be found in Supplementary Table ST1.

Consistently, all non‐linear methods demonstrated similar trends in other neighborhood preservation metrics, avoiding significant intrusions and exclusions – cases where compounds positioned far apart in the original space appear close on the map, and vice versa, where compounds far apart on the map are actually close in the original space (Figure 4). For non‐linear methods, co‐k‐nearest neighbor size (QNN) and Local Continuity Meta Criterion (LCMC) exhibited a sharp increase for low k‐values, indicating their strong performance in preserving the closest neighbors (Figure 4). In contrast, PCA demonstrated a more uniform performance across various values of k.

While all the methods were able to preserve a significant number of neighboring compounds for the aforementioned datasets, the percentage of the preserved neighbors for nine randomly selected ChEMBL datasets was significantly lower (Supplementary Table ST1, Supplementary Figure SF2) for all considered descriptor spaces. As was shown before, this discrepancy suggests that the effectiveness of these DR can vary considerably depending on the dataset′s characteristics [16, 45]. If the data inherently resides in a high‐dimensional space, traditional dimensionality reduction methods might struggle to preserve the neighborhood structure accurately [16, 45]. Therefore, assessing the intrinsic dimension of the datasets is important for evaluating the applicability of the DR for the particular dataset. The correlation between intrinsic dimension (ID) and neighborhood preservation (PNN) was observed in datasets using Morgan fingerprints and MACCS keys as features (Figure 5a, Supplementary Figure SF3, SF4). Additionally, neighborhood tends to be better preserved for compounds that are closer analogs to each other in the descriptor space (Figure 5c, Supplementary Figure SF5). Therefore, neighborhood preservation scores are meaningful only when there are items within a relevant neighborhood. To put this into a chemical perspective, congeneric organic compound series are much less dimensional than random collections. Random sets of a few thousand ChEMBL compounds consist almost entirely of singletons, compounds for which the nearest neighbor being very distant. To illustrate this, chemical library diversity was analyzed using the consensus diversity plots (CDP) approach [38]. Here, scaffold similarity, represented by the fraction of scaffolds needed to cover 50% of compounds (F50), is plotted against the median intra‐library similarity calculated using chemical fingerprints. The CDP analysis reveals that all methods struggle to maintain neighborhood structures in highly diverse libraries with numerous scaffolds (large F50 values, Figure 5b, Supplementary Figure SF6). However, some libraries with comparatively low neighborhood preservation consist of highly similar compounds within only a few chemotypes, indicating that a broad range of similarity values can be important for effective preservation of high dimension neighborhoods (Figure 5b, Supplementary Figure SF6). Further investigation is required to draw more solid conclusions across various dataset sizes and feature sets.

Figure 5.

Figure 5

a The figure illustrates the negative correlation between the adjusted neighborhood preservation (P*NN), normalized by the number of neighbors that can be selected randomly, and the intrinsic dimension (ID) calculated using the Fisher algorithm. This correlation is observed across different dimensionality reduction techniques (PCA, t‐SNE, UMAP, GTM). b. Consensus diversity plot: the fraction of scaffolds needed to cover 50% of compounds (F50) as a function of a median intra‐dataset similarity (Median Tc). The color scheme shows the adjusted neighborhood preservation (P*NN). For a and b Morgan fingerprints were used as descriptors, the size of the data points reflects the number of compounds in the dataset, and the random ChEMBL subsets are shown as crosses. c. The ratio of preserved nearest neighbors (RNN) within a given Tanimoto similarity range (from 0.0–0.1 to 0.9–1.0): For each k‐nearest neighbors in the descriptor space (Ndesc) within the specified Tanimoto coefficient range were identified, and the number of these neighbors among the k nearest in the latent space (Nlat) was determined. The RNN value was then calculated by dividing Nlat by Ndesc. Morgan fingerprints were used as descriptors. Color scheme: PCA ‐ blue, t‐SNE ‐ orange, UMAP ‐ green, GTM ‐ red. Standard deviation values calculated across datasets are shown as filled areas.

There were not many datasets with large sizes and low intrinsic dimensions (Figure 6, Supplementary Figure SF1). To assess the possibility of having a relatively large dataset with low intrinsic dimension, we selected 18 partially overlapping datasets (low‐ID datasets, Supplementary Table ST1, Supplementary Figure SF7). Each dataset contained between 411 and 1756 compounds and had an intrinsic dimension of less than 6 when represented as Morgan count fingerprints. When combined into a single dataset, there were 16287 unique compounds, and the intrinsic dimension was equal to 7.5. The results for the fused dataset were similar to those for the individual datasets: in both cases, all non‐linear methods significantly outperformed PCA in terms of preserving neighborhood behavior, exhibiting performance similar to that observed in the case of the individual libraries (Supplementary Table ST1, Supplementary Figure SF7). Among non‐linear methods, t‐SNE and UMAP outperformed GTM in preserving the closest nearest neighbors in all descriptor spaces.

Figure 6.

Figure 6

Average Nearest Neighbor (PNN) preservation metric for leave‐one library‐out (LOLO) setup. As with the in‐sample case (Figure 4), non‐linear methods outperform PCA in neighborhood preservation for Morgan FP and MACCS keys, albeit with lower PNN values. Among them, GTM demonstrated the most robust performance. The models’ hyperparameters were selected to maximize the preservation of neighbors among the 20 nearest ones using Euclidean distance in the original descriptor space with Morgan fingerprints, MACCS keys and ChemDist embeddings used as feature sets. Color scheme: PCA ‐ blue, t‐SNE ‐ orange, UMAP ‐ green, GTM ‐ red. The ratio of nearest neighbors (PNN) preserved at different k‐values, trustworthiness, continuity, co‐k‐nearest neighbor size (QNN), and Local Continuity Meta Criterion (LCMC) as functions of the k‐nearest neighbors are shown. Standard deviation values calculated across datasets are shown as bars or filled areas. Corresponding AUC, Qlocal, Qglobal, k‐max values can be found in Supplementary Table ST1.

The similarity between chemical compounds is typically analyzed using the Tanimoto score rather than Euclidean distance [49]. These metrics do not necessarily produce the same neighborhoods, and numerical transformation of the descriptors can significantly alter the results. We assessed whether the neighborhood preservation metrics would differ if the methods were optimized while keeping track of nearest neighbors in the descriptor space with Tanimoto similarity for the low‐ID ChEMBL datasets. All methods preserved more nearest neighbors when using Tanimoto similarity to evaluate neighborhoods in the descriptor space (Supplementary Table ST1). Since the Tanimoto kernel can be used in combination with all methods as a distance metric in the original space (for example, as in kernel PCA2 and GTM) [50], its usage presents a promising avenue for further enhancing neighborhood preservation.

Neighborhood Preservation for Out‐of‐Sample Dimensionality Reduction

While standard dimensionality reduction techniques can be straightforwardly applied to small and medium‐sized libraries, their application to large (millions to tens of millions) and ultra‐large (over 1 billion) datasets remains challenging because it is time‐consuming, and resource‐intensive and often necessitates the use of certain approximations [51, 52]. To handle such large volumes of data, a common approach is to select a subsample of the entire dataset, often referred to as a frameset or reference set, and then project the remaining data points onto the map built using this subset of the original data [22, 53]. In this case, the DR algorithm should be able to project new (out‐of‐sample) data onto already built embedding. We assessed the algorithms for the effectiveness of out‐of‐sample projection using a leave‐one‐library‐out (LOLO) scenario. In this scenario, a library was removed from the pool of 18 low‐ID ChEMBL libraries, the method was fitted to the remaining data (the frameset), its parameters optimized towards neighborhood preservation, the removed (out‐of‐sample) library was projected onto the built embedding and neighborhood preservation metrics were calculated. On average, GTM demonstrated more robust out‐of‐sample neighborhood preservation compared to other non‐linear methods (Supplementary Table ST1, Figure 6), preserving more neighbors when using Morgan count fingerprints and MACCS keys, although it showed reduced performance with ChemDist embeddings.

Quantitative Analysis of Chemical Space Maps Visualization Using Scagnostics

While neighborhood preservation is an important parameter for assessing the quality of a low‐dimensional embedding, the primary goal of DR‐based visualization is to present the data in a form that can be easily understood by humans. Such visualizations should reveal data patterns within the dataset. For instance, chemical space maps built in this work show that neighborhood preservation is not evenly distributed across them: in some areas, nearly all of the 20 closest neighbors are preserved, while in other areas, the percentage of preserved neighbors is much lower (Figure 7).

Figure 7.

Figure 7

Scatter and hexbin plot visualizations of chemical space using PCA, t‐SNE, UMAP, and GTM using Morgan count fingerprints as descriptors. These visualizations are shown for one out of 18 low‐ID datasets (CHEMBL3638344) (a) and the combined dataset of 18 low‐ID ChEMBL subsets (b). The color scheme corresponds to PNN (k=20): black indicates all neighbors are preserved, while pale brown indicates none are preserved.

While individual data points can be mostly recognized in smaller datasets (Figure 7a), larger datasets (Figure 7) present a challenge as most zones are too dense to explore effectively on a static image. To alleviate this problem we use a hexagonal grid to render the density of data points covered by the grid. Alternatively, one can apply interpolation techniques such as kernel density estimation or Voronoi diagrams [54, 55]. In contrast to other methods, grid‐based visualization is a built‐in feature of the GTM, allowing data to be visualized not only as scatter plots but also as grid‐based landscapes without the need for auxiliary binarization tools, which can be especially attractive for the visualization of large‐scale datasets.

An orthogonal approach to neighborhood preservation for comparing DR techniques is to assess the interpretability of the visualization, specifically how effectively a human can comprehend the patterns shown on the map.56 The analysis of factors influencing human perception of statistical pattern visualization is an active and evolving area of research [40, 57, 58]. One of the most frequently used metrics to assess the ease of visualization for scatter plots are scatterplot diagnostics, commonly known as scagnostics [40, 57, 58, 59]. Scagnostics provide a quantitative way to evaluate the visual characteristics of scatterplots, such as shape, density, skewness, and the presence of outliers and can help to determine which plots are more likely to be easily understood by human observers [59]. For example, they were found to be aligned with human perception of correlations, clusters, and trends [40].

Scagnostics were calculated for all low‐dimensional embeddings built in this work (Figure 8, Supplementary Figure SF8). They show high variance in the obtained values, indicating that even with similar neighborhood scores, one method may be preferred over another in terms of visualization quality. For instance, scagnostics calculated for embeddings of the dataset CHEMBL3638344 (Figure 8a) highlight different characteristics of GTM, t‐SNE, and UMAP‐based generalizations. UMAP shows clear clustering of compounds, resulting in high Clumpy values, while the GTM plot offers complementary more striated representation. Additionally, different descriptors show varying scagnostic values across different methods (Supplementary Figure SF8), providing further options for choosing the most relevant representation for visualization.

Figure 8.

Figure 8

Radar chart representation of scagnostics calculated for scatter plot visualizations of chemical space using PCA, t‐SNE, UMAP, and GTM with parameters optimized for preserving the 20 nearest neighbors using Morgan count fingerprints as descriptors (as shown in Figure 7). These visualizations are shown for CHEMBL3638344 dataset (a) and the combined dataset of 18 low‐ID ChEMBL subsets (b). Color scheme: PCA ‐ blue, t‐SNE ‐ orange, UMAP ‐ green, GTM ‐ red.

Discussion

Benchmarking Dimensionality Reduction Methods For Chemical Space Exploration

Representing chemical structure data embedded in chemical libraries in a manner suitable for chemist comprehension poses a significant challenge. Among the various approaches, one can use similarity heatmaps [20], network‐based approaches [60, 61], scaffolds [62], and DR techniques. The latter represents one of the main strategies, especially in the context of big data [36, 52]. When used properly, DR methods can provide valuable insights into the inner structure of chemical spaces, as demonstrated in numerous studies [20, 21]. While reducing complex data to two dimensions may result in information loss, this information can potentially be recovered by chemists analyzing the maps, as can be seen when combining “human intuition” with machine learning [63, 64]. However, to be maximally useful for a chemist, a good dimensionality reduction method should have the following features [2, 65, 66]:

  1. The produced projections should be a sufficiently accurate lower‐dimensional (2D or 3D) representation of the input data;

  2. The method should provide options for projection of new data points to facilitate library comparison and large chemical space data analysis.

  3. The method should be big‐data compatible, ensuring fast training and new data projection with minimal resources.

The findings of our study in this context are summarized in Figure 9. Among the dimensionality reduction algorithms benchmarked, all non‐linear methods proved effective in neighborhood preservation, outperforming PCA. t‐SNE demonstrated the strongest performance in preserving the closest neighbors, which is not surprising since it is designed to maximize this criterion. On the other hand, GTM offers more robust out‐of‐sample performance and out‐of‐the‐box big‐data compatible visualization due to its grid‐based nature. Therefore, the choice of method should be based on the specific task at hand. For example, if one wants to analyze a small chemical library, t‐SNE might be preferred for its performance. On the other hand, for visualizing large libraries and potentially projecting new compounds onto them, GTM might be the better choice.

Figure 9.

Figure 9

Visual representations of data reduced by PCA, t‐SNE, UMAP, and GTM are displayed. The performance of each technique is assessed based on several criteria: the necessity for hyperparameter tuning, neighborhood preservation quality, interpretability of hyperparameters, robustness of out‐of‐sample neighborhood preservation, and suitability for out‐of‐the‐box big data visualization.

Conclusions

In this work, the effectiveness of commonly used dimensionality reduction techniques for the visualization of chemical space was assessed across three case studies commonly encountered in practice. It was found that non‐linear methods significantly outperformed a linear method (PCA) in neighborhood preservation tasks for several subsets of congeneric organic small molecule compounds. Among non‐linear methods t‐SNE was shown to excel at preserving the very closest neighbors. For out‐of‐sample visualization, commonly used methods like t‐SNE and UMAP were found to demonstrate less robust behavior as compared to the GTM. Additionally, GTM was recognized for its out‐of‐the‐box, big‐data compatible visualization capabilities. However, this work has some limitations, and further improvements can be made to provide a more comprehensive assessment of the performance of DR methods in various scenarios.

Limitations of the Current Work and Future Outlook

Data

The datasets used in this study comprise small organic molecule compounds featuring partially overlapping congeneric series of organic molecules, thus covering a very small part of a chemical space. Further research is necessary to design datasets with diverse distributions of chemical similarity for thorough benchmarking results.

A significant aspect of this paper is the focus on compounds from a specific region of chemical space—namely, small organic molecules derived from published medicinal chemistry data. This choice was made deliberately to maintain a clear scope for the study. Consequently, combinatorial libraries (e. g., DNA‐encoded libraries, Enamine REAL), which typically exhibit a narrower distribution of chemical similarities and a more densely populated chemical space, were not explored. Additionally, datasets related to materials, polymers, and other chemical entities were not investigated. These areas, where DR techniques are increasingly being applied for visualization [19, 67], present additional layers of complexity and variability. Future studies could expand into these broader chemical spaces to further validate and benchmark DR techniques in diverse contexts.

Algorithms

Only close‐to‐the original versions of the non‐linear DR algorithms (t‐SNE, UMAP, GTM) were tested in this paper, while numerous enhancements have been suggested in the literature [1, 68]. These enhanced versions may be more suitable than the “vanilla” algorithms in certain scenarios. For example, the question of algorithm run‐time was not addressed in our comparison, as multiple optimized versions exist that can significantly reduce computation time, including those capable of running on graphical processing units [69, 70]. (GPUs). Therefore, benchmarking the original versions in terms of time efficiency would not yield solid conclusions on the applicability of the methods in general.

The hyperparameter grid was fixed in our studies for both in‐sample and out‐of‐sample scenarios, and we did not specifically attempt to alter the hyperparameter grid for the latter, that can potentially be required in such cases [42]. For example, one can choose to lower learning rate while projecting new data onto existing t‐SNE embeddings[35]. Alternatively, one may opt to several parametric versions of both UMAP and t‐SNE have been proposed for out‐of‐sample visualizations, which can be more efficient than the setups used in this paper. For instance, the parametric t‐SNE was successfully applied for the analysis of chemical space [71]. A thorough analysis of the applicability domains [72] of the DR techniques and, more generally, their out‐of‐distribution performance [73, 74] is left for future studies. Additionally, various optimization techniques, such as e. g. genetic algorithms [75], can be used for a more thorough search for optimal model hyperparameters.

Methodology

While the metrics used in this study are widely applied in the analysis of DR results, further improvements can be made. For instance, some compounds may have identical or very similar distances in the descriptor space, yet they could be ranked differently, impacting the final metrics. Although this paper investigates the influence of various distance thresholds on the neighborhood preservation score, future research could explore the influence on the other neighborhood preservation metrics, as well as the behavior of alternative types of similarity metrics (e. g., graph edit distance).

A significant challenge posed to dimensionality reduction (DR) techniques for chemical space analysis is the rapid expansion of chemical libraries, now encountering up to 10 [26] virtual compounds [76]. In this case, the maps become “too crowded” and lose specific resolution details, as seen in Figure 3c, with too many data points projected onto the same zones. One way to deal with this issue is to organize maps in a hierarchical way [77, 78, 79]. Hierarchical versions of all considered methods were developed [28, 29, 30, 31]. For example, this approach was applied to build a chemical space atlas [81] ‐ a set of hierarchically organized GTMs. These approaches are to be benchmarked against more recently suggested DR algorithms, such as TMAP [18], which were specifically designed to address the challenges of big data.

In this work, DR techniques were assessed in terms of neighborhood preservation. The visualization of properties on the maps and the use of labeled data, such as biological activity, as an optimization parameter was not evaluated in this study. This is left for future studies with the evaluation of approaches that incorporate labeled information when building visualizations in a supervised or semi‐supervised fashion [28, 82, 83]. Additionally, future work could include a thorough analysis of the chemical relevance of the obtained maps, such as the distribution and preservation of chemical structure patterns like scaffolds [84]. In particular, comparing clustering in the high‐dimensional initial descriptor space with that in latent spaces, both widely used for chemical space analysis and exploration [44, 85, 86] represents an important area for future research. While metrics like scagnostics can reflect the ease of human perception of scatter plots, future studies could explore other chemistry‐relevant aspects of low‐dimensional visualizations.

Overall, while an ongoing discussion exists about how effective the DR are for revealing patterns within datasets [87], we believe that these methods, when used properly, represent a promising tool for chemical space analysis. The unification of methods under a common theoretical framework[88, 89], along with the development of benchmarking datasets with controlled data complexity as well as protocols for evaluating DR algorithms in various scenarios, will enable a more thorough understanding of which techniques are best suited for specific scenarios.

Supporting Information

Additional supporting information can be found online in the Supporting Information section at the end of this article.

Conflict of Interests

The authors declare no conflicts of interest.

1.

Supporting information

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer reviewed and may be re‐organized for online delivery, but are not copy‐edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Supporting Information

Acknowledgments

The authors thank Erik Egyan for preparing the datasets for the analysis using the in‐house workflow.

Orlov A. A., Akhmetshin T. N., Horvath D., Marcou G., Varnek A., Molecular Informatics 2025, 44, e202400265. 10.1002/minf.202400265

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/AxelRolov/cdrbench.

References

  • 1.B. Ghojogh, M. Crowley, A. Ghodsi, and F. Karray, Elements of Dimensionality Reduction and Manifold Learning( Cham: Springer International Publishing AG, 2023 10.1007/978-3-031-10602-6. [DOI]
  • 2.J. A. Lee, and M. Verleysen, Nonlinear Dimensionality Reduction( New York: Springer, 2007).
  • 3. Lovrić M., “Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints”, Pharmaceuticals 14 (2021): 758, 10.3390/ph14080758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.R. Xiang, et al. A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data. Frontiers in Genetics 12, (2021). [DOI] [PMC free article] [PubMed]
  • 5.Y. Zabolotna, et al. Chemography: Searching for Hidden Treasures. Journal of Chemical Information and Modeling (2020): acs.jcim.0c00936, doi: 10.1021/acs.jcim.0c00936. [DOI] [PubMed]
  • 6. Oprea T. I., and Gottfries J., “Chemography:  The Art of Navigating in Chemical Space”, J. Combinatorial Chemistry & High Throughput Screening 3 (2001): 157–166, 10.1021/cc0000388. [DOI] [PubMed] [Google Scholar]
  • 7. Gaytán-Hernández D., “Art driven by visual representations of chemical space”, Journal of Cheminformatics 15 (2023): 100, 10.1186/s13321--023--00770--4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Sattarov B., “De Novo Molecular Design by Combining Deep Autoencoder Recurrent Neural Networks with Generative Topographic Mapping”, Journal of Chemical Information and Modeling 59 (2019): 1182–1196, 10.1021/acs.jcim.8b00751. [DOI] [PubMed] [Google Scholar]
  • 9. Bort W., “Discovery of novel chemical reactions by deep generative recurrent neural network”, Science Report 11 (2021): 3178, 10.1038/s41598-021-81889-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Bonachera F., Marcou G., Kireeva N., Varnek A., and Horvath D., “Using self-organizing maps to accelerate similarity search”, Bioorganic & Medicinal Chemistry 20 (2012): 5396–5409, 10.1016/j.bmc.2012.04.024. [DOI] [PubMed] [Google Scholar]
  • 11. Espadoto M., Martins R. M., Kerren A., Hirata N. S. T., and Telea A. C., “Toward a Quantitative Survey of Dimension Reduction Techniques”, IEEE Transactions on Visualization and Computer Graphics 27 (2021): 2153–2173, 10.1109/TVCG.2019.2944182. [DOI] [PubMed] [Google Scholar]
  • 12. Wang K., “Comparative analysis of dimension reduction methods for cytometry by time-of-flight data”, Nat Commun 14 (2023): 1836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.M. Vikram, R. Pavan, N. D. Dineshbhai, and B. Mohan, Performance Evaluation of Dimensionality Reduction Techniques on High Dimensional Data. in 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) 1169–1174 (2019). doi: 10.1109/ICOEI.2019.8862526. [DOI]
  • 14. Maaten L., “van der & Hinton, G. E. Visualizing Data using t-SNE”, Journal of Machine Learning Research 9 (2008): 2579–2605. [Google Scholar]
  • 15. McInnes L., Healy J., Saul N., and Großberger L., “UMAP: Uniform Manifold Approximation and Projection”, Journal of Open Source Software 3 (2018): 861, 10.21105/joss.00861. [DOI] [Google Scholar]
  • 16.T. Tr, Dimensionality Reduction: A Comparative Review.
  • 17. Alsenan S. A., Al-Turaiki I. M., and Hafez A. M., “Feature Extraction Methods in Quantitative Structure–Activity Relationship Modeling: A Comparative Study”, IEEE Access 8 (2020): 78737–78752, 10.1109/ACCESS.2020.2990375. [DOI] [Google Scholar]
  • 18. Probst D., and Reymond J.-L., “Visualization of very large high-dimensional data sets as minimum spanning trees”, Journal of Cheminformatics 12 (2020): 12, 10.1186/s13321--020--0416-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.M. Villares, C. M. Saunders, and N. Fey, Comparison of Dimensionality Reduction Techniques for the Visualisation of Chemical Space in Organometallic Catalysis. Artificial Intelligence Chemistry (2024): 100055, doi: 10.1016/j.aichem.2024.100055. [DOI]
  • 20. Osolodkin D. I., “Progress in visual representations of chemical space”, Expert Opinion on Drug Discovery 10 (2015): 959–973, 10.1517/17460441.2015.1060216. [DOI] [PubMed] [Google Scholar]
  • 21. Medina-Franco J. L., Martinez-Mayorga K., Giulianotti M. A., Houghten R. A., and Pinilla C., “Visualization of the Chemical Space in Drug Discovery”, Current Computer - Aided Drug Design 4 (2008): 322–333, 10.2174/157340908786786010. [DOI] [Google Scholar]
  • 22.D. Horvath, G.Marcou, and A.Varnek, “Generative topographic mapping in drug design.” Drug Discovery Today: Technologies (2020): S1740674920300044 doi: 10.1016/j.ddtec.2020.06.003. [DOI] [PubMed]
  • 23. Gaulton A., “ChEMBL: a large-scale bioactivity database for drug discovery”, Nucleic Acids Research 40 (2012): D1100–D1107, 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.L. Wilkinson, A. Anand, and R. Grossman, Graph-theoretic scagnostics. in IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. 157–164 (2005). doi: 10.1109/INFVIS.2005.1532142. [DOI]
  • 25. Horvath D., and Jeandenans C., “Neighborhood behavior of in silico structural spaces with respect to in vitro activity spaces-a novel understanding of the molecular similarity principle in the context of multiple receptor binding profiles”, Journal of Chemical Information and Computer Sciences 43 (2003): 680–690, 10.1021/ci025634z. [DOI] [PubMed] [Google Scholar]
  • 26. Sidorov P., Gaspar H., Marcou G., Varnek A., and Horvath D., “Mappability of drug-like space: towards a polypharmacologically competent map of drug-relevant compounds”, Journal of Computational-Aided Molecular Design 29 (2015): 1087–1108, 10.1007/s10822--015--9882-z. [DOI] [PubMed] [Google Scholar]
  • 27. Gaspar H. A., Baskin I. I., Marcou G., Horvath D., and Varnek A., “GTM-Based QSAR Models and Their Applicability Domains”, Molecular Informatics 34 (2015): 348–356, 10.1002/minf.201400153. [DOI] [PubMed] [Google Scholar]
  • 28. Casciuc I., “Virtual Screening with Generative Topographic Maps: How Many Maps Are Required?”, Journal of Chemical Information and Modeling 59 (2019): 564–572, 10.1021/acs.jcim.8b00650. [DOI] [PubMed] [Google Scholar]
  • 29. Lin A., Horvath D., Marcou G., Beck B., and Varnek A., “Multi-task generative topographic mapping in virtual screening”, Journal of Computational-Aided Molecular Design 33 (2019): 331–343, 10.1007/s10822--019--00188-x. [DOI] [PubMed] [Google Scholar]
  • 30.L. Albergante, J. Bac, and A. Zinovyev, Estimating the effective dimension of large biological datasets using Fisher separability analysis. at (2019) http://arxiv.org/abs/1901.06328.
  • 31. Morgan H. L., “The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service”, Journal of Chemical Documentation 5 (1965): 107–113, 10.1021/c160017a018. [DOI] [Google Scholar]
  • 32. Durant J. L., Leland B. A., Henry D. R., and Nourse J. G., “Reoptimization of MDL Keys for Use in Drug Discovery”, Journal of Chemical Information and Computer Sciences 42 (2002): 1273–1280, 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
  • 33. Coupry D. E., and Pogány P., “Application of deep metric learning to molecular graph similarity”, Journal of Cheminformatics 14 (2022): 11, 10.1186/s13321--022--00595--7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. RD Kit: Open-Source Cheminformatics; Http://www.Rdkit.Org.
  • 35. Poličar P. G., Stražar M., and Zupan B., “openTSNE: A Modular Python Library for t-SNE Dimensionality Reduction and Embedding”, Journal of Statistical Software 109 (2024): 1–30. [Google Scholar]
  • 36. Gaspar H. A., Baskin I. I., Marcou G., Horvath D., and Varnek A., “Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge”, Journal of Chemical Information and Modeling 55 (2015): 84–94, 10.1021/ci500575y. [DOI] [PubMed] [Google Scholar]
  • 37. Zhang Y., Shang Q., and Zhang G., “pyDRMetrics - A Python toolkit for dimensionality reduction quality assessment”, Heliyon 7 (2021): e06199, 10.1016/j.heliyon.2021.e06199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. González-Medina M., Prieto-Martínez F. D., Owen J. R., and Medina-Franco J. L., “Consensus Diversity Plots: a global diversity analysis of chemical libraries”, Journal of Cheminformatics 8 (2016): 63, 10.1186/s13321--016--0176--9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Wilkinson L., and Wills G., “Scagnostics Distributions”, Journal of Computational and Graphical Statistics 17 (2008): 473–491, 10.1198/106186008X320465. [DOI] [Google Scholar]
  • 40. Lehmann D. J., Hundt S., and Theisel H., “A study on quality metrics vs. human perception: Can visual measures help us to filter visualizations of interest?”, it - Information Technology 57 (2015): 11–21, 10.1515/itit-2014--1070. [DOI] [Google Scholar]
  • 41.scagnostics: Compute scagnostics - scatterplot diagnostics.
  • 42. Gove R., Cadalzo L., Leiby N., Singer J. M., and Zaitzeff A., “New guidance for using t-SNE: Alternative defaults, hyperparameter selection automation, and comparative evaluation”, Visual Informatics 6 (2022): 87–97, 10.1016/j.visinf.2022.04.003. [DOI] [Google Scholar]
  • 43. Gandini E., “Molecular Similarity Perception Based on Machine-Learning Models”, International Journal of Molecular Sciences 23 (2022): 6114, 10.3390/ijms23116114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. López-Pérez K., “Molecular similarity: Theory, applications, and perspectives”, Artificial Intelligence Chemistry 2 (2024): 100077, 10.1016/j.aichem.2024.100077. [DOI] [Google Scholar]
  • 45. Bac J., Mirkes E. M., Gorban A. N., Tyukin I., and Zinovyev A., “Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation”, Entropy 23 (2021): 1368, 10.3390/e23101368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.R. R. Mittal, R. A. McKinnon, and M. J. Sorich, Comparison Data Sets for Benchmarking QSAR Methodologies in Lead Optimization. Journal of Chemical Information and Modeling 49, (2009): 1810–1820. [DOI] [PubMed]
  • 47. Tian T., “Benchmarking compound activity prediction for real-world drug discovery applications”, Communication Chemistry 7 (2024): 1–19, 10.1038/s42004-024-01204-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Rohrer S. G., and Baumann K., “Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data”, Journal of Chemical Information and Modeling 49 (2009): 169–184, 10.1021/ci8002649. [DOI] [PubMed] [Google Scholar]
  • 49. Bajusz D., Rácz A., and Héberger K., “Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?”, Journal of Cheminformatics 7 (2015): 20, 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.I. Olier, A. Vellido, and J. Giraldo, Kernel Generative Topographic Mapping. ESANN Machine learning and Computational Intelligence Proceedings (2010): 481-486, ISBN-2-930307-10-2.
  • 51.V. Wei, N. Ivkin, V. Braverman, and A. Szalay, Sketch and Scale: Geo-distributed tSNE and UMAP. at http://arxiv.org/abs/2011.06103 (2020).
  • 52. Lin A., “Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling”, Molecular Informatics 39 (2020): 2000009, 10.1002/minf.202000009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Zheng Q., “From Whole to Part: Reference-Based Representation for Clustering Categorical Data”, IEEE Trans. Neural Netw. Learning Syst. 31 (2020): 927–937, 10.1109/TNNLS.2019.2911118. [DOI] [PubMed] [Google Scholar]
  • 54.R. Whitaker, and I. Hotz, Transformations, Mappings, and Data Summaries. in Foundations of Data Visualization (eds. Chen, M., Hauser, H., Rheingans, P. & Scheuermann, G.) (Springer International Publishing, Cham, 2020). doi:10.1007/978-3-030-34444-36.
  • 55. Cihan Sorkun M., Mullaj D., Koelman J. M. V. A., and Er S., “ChemPlot, a Python Library for Chemical Space Visualization**”, Chemistry–Methods 2 (2022): e202200005. [Google Scholar]
  • 56. Foundations of Data Visualization. (Springer International Publishing, Cham, 2020), doi: 10.1007/978-3-030-34444-3. [DOI]
  • 57.A. Filipowicz, et al. Visual Elements and Cognitive Biases Influence Interpretations of Trends in Scatter Plots. at http://arxiv.org/abs/2310.15406 (2023).
  • 58.A. V. Pandey, J. Krause, C. Felix, J. Boy, and E. Bertini, Towards Understanding Human Similarity Perception in the Analysis of Large Sets of Scatter Plots. in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY, USA, 2016). doi: 10.1145/2858036.2858155. [DOI]
  • 59. Etemadpour R., Shintree S., and Shereen A. D., “Brain Activity is Influenced by How High Dimensional Data are Represented: An EEG Study of Scatterplot Diagnostic (Scagnostics) Measures”, Journal of Healthcare Informatics Research 8 (2024): 19–49, 10.1007/s41666-023-00145-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Amoroso N., “Making sense of chemical space network shows signs of criticality”, Sci Rep 13 (2023): 21335, 10.1038/s41598-023-48107-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Scalfani V. F., Patel V. D., and Fernandez A. M., “Visualizing chemical space networks with RDKit and NetworkX”, Journal of Cheminformatics 14 (2022): 87, 10.1186/s13321-022-00664-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Velkoborsky J., and Hoksza D., “Scaffold analysis of PubChem database as background for hierarchical scaffold-based visualization”, Journal of Cheminformatics 8 (2016): 74, 10.1186/s13321--016--0186--7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.P. Llompart, et al. Harnessing Medicinal Chemical Intuition from Collective Intelligence.
  • 64. Teso S., Alkan Ö., Stammer W., and Daly E., “Leveraging explanations in interactive machine learning: An overview”, Frontiers in Artificial Intelligence 6 (2023): 1066049, 10.3389/frai.2023.1066049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.P. Wassenaar, P. Guetschel, and M. Tangermann, Approximate UMAP allows for high-rate online visualization of high-dimensional data streams. at http://arxiv.org/abs/2404.04001 (2024).
  • 66. A.N. Gorban, B. Kégl, D.W. Wunsch, and A. Zyinovyev, in Principal Manifolds for Data Visualization and Dimension Reduction (Eds.: A.N. Gorban, B. Kégl, D.W. Wunsch, and A. Zyinovyev), Springer, Berlin Heidelberg, 2008, pp. V-XII.
  • 67. Park H., Onwuli A., Butler K., and Walsh A., “Mapping inorganic crystal chemical space”, Faraday Discuss. (2024): Doi: 10.1039/D4FD00063C. [DOI] [PubMed] [Google Scholar]
  • 68. Bishop C. M., Svensén M., and Williams C. K. I., “Developments of the generative topographic mapping”, Neurocomputing 21 (1998): 203–224, 10.1016/S0925-2312(98)00043-5. [DOI] [Google Scholar]
  • 69.D. M. Chan, R. Rao, F. Huang, and J. F. Canny, T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data. in 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 330–338 (2018). doi: 10.1109/CAHPC.2018.8645912. [DOI]
  • 70.C. J. Nolet, et al. Bringing UMAP Closer to the Speed of Light with GPU Acceleration.
  • 71. Karlov D. S., Sosnin S., Tetko I. V., and Fedorov M. V., “Chemical space exploration guided by deep neural networks”, RSC Advances 9 (2019): 5151–5157, 10.1039/C8RA10182E. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Varnek A., and Baskin I. I., “Chemoinformatics as a Theoretical Chemistry Discipline”, Molecular Informatics 30 (2011): 20–32, 10.1002/minf.201000100. [DOI] [PubMed] [Google Scholar]
  • 73. Tossou P., Wognum C., Craig M., Mary H., and Noutahi E., “Real-World Molecular Out-Of-Distribution: Specification and Investigation”, Journal of Chemical Information and Modeling 64 (2024): 697–711, 10.1021/acs.jcim.3c01774. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.J Liu, et al. Towards Out-Of-Distribution Generalization: A Survey. at http://arxiv.org/abs/2108.13624 (2023).
  • 75. Horvath D., Marcou G., and Varnek A., “Generative topographic mapping in drug design”, Drug Discovery Today: Technologies 32–33 (2019): 99–107, 10.1016/j.ddtec.2020.06.003. [DOI] [PubMed] [Google Scholar]
  • 76.W. A. Warr, M. C. Nicklaus, C. A. Nicolaou, and M. Rarey, Exploration of Ultralarge Compound Collections for Drug Discovery. Journal of Chemical Information and Modeling 62, (2022): 2021–2034. [DOI] [PubMed]
  • 77. VanHorn K. C., and Çobanoğlu M. C., “Haisu: Hierarchically supervised nonlinear dimensionality reduction”, PLOS Computational Biology 18 (2022): e1010351, 10.1371/journal.pcbi.1010351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.W. E. Marcílio-Jr, D. M. Eler, F. V. Paulovich, R. M. Martins, HUMAP: Hierarchical Uniform Manifold Approximation and Projection. at http://arxiv.org/abs/2106.07718 (2023). [DOI] [PubMed]
  • 79. Tino P., and Nabney I., “Hierarchical GTM: constructing localized nonlinear projection manifolds in a principled way”, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002): 639–656, 10.1109/34.1000238. [DOI] [Google Scholar]
  • 80.M. Avellaneda, Hierarchical PCA and Applications to Portfolio Management. at http://arxiv.org/abs/1910.02310 (2019).
  • 81. Zabolotna Y., “Chemspace Atlas: Multiscale Chemography of Ultralarge Libraries for Drug Discovery”, Journal of Chemical Information and Modeling 62 (2022): 4537–4548, 10.1021/acs.jcim.2c00509. [DOI] [PubMed] [Google Scholar]
  • 82.L. Hajderanj, I. Weheliye, and D. Chen, A New Supervised t-SNE with Dissimilarity Measure for Effective Data Visualization and Classification. in Proceedings of the 8th International Conference on Software and Information Engineering 232–236 (Association for Computing Machinery, New York, NY, USA, 2019). doi: 10.1145/3328833.3328853. [DOI]
  • 83.B. Ghojogh A. Ghodsi, F. Karray, and M. Crowley, Uniform Manifold Approximation and Projection (UMAP) and its Variants: Tutorial and Survey. at http://arxiv.org/abs/2109.02508 (2021).
  • 84. Zahoránszky-Kőhalmi G., Wan K. K., and Godfrey A. G., “Hilbert-curve assisted structure embedding method”, Journal of Cheminformatics 16 (2024): 87, 10.1186/s13321-024-00850-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Kireeva N., “Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison”, Molecular Informatics 31 (2012): 301–312, 10.1002/minf.201100163. [DOI] [PubMed] [Google Scholar]
  • 86.K. L. Pérez, V. Jung, L. Chen, K. Huddleston and R. A. Miranda-Quintana, “Efficient clustering of large molecular libraries.” bioRxiv (2024): 10.1101/2024.08.10.607459.
  • 87.V. Marx, Seeing data as t-SNE and UMAP do. Nature Methods 1–4 (2024): doi: 10.1038/s41592-024-02301-x. [DOI] [PubMed]
  • 88.A. Ravuri, and N. D. Lawrence, Towards One Model for Classical Dimensionality Reduction: A Probabilistic Perspective on UMAP and t-SNE. at http://arxiv.org/abs/2405.17412 (2024).
  • 89.A. Ravuri, F. Vargas, V. Lalchand, and N. D. Lawrence, Dimensionality Reduction as Probabilistic Inference. at http://arxiv.org/abs/2304.07658(2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer reviewed and may be re‐organized for online delivery, but are not copy‐edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Supporting Information

Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/AxelRolov/cdrbench.


Articles from Molecular Informatics are provided here courtesy of Wiley

RESOURCES