Skip to main content
Journal of Cheminformatics logoLink to Journal of Cheminformatics
. 2020 Feb 12;12:12. doi: 10.1186/s13321-020-0416-x

Visualization of very large high-dimensional data sets as minimum spanning trees

Daniel Probst 1,, Jean-Louis Reymond 1,
PMCID: PMC7015965  PMID: 33431043

Abstract

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature.graphic file with name 13321_2020_416_Figa_HTML.jpg

Keywords: Data visualization, Chemistry databases, Algorithms, Big data, Dimensionality reduction

Introduction

The recent development of new and often very accessible frameworks and powerful hardware has enabled the implementation of computational methods to generate and collect large high dimensional data sets and created an ever increasing need to explore as well as understand these data [19]. Generally, large high-dimensional data sets are matrices where rows are samples and columns are measured variables, each column defining a dimension of the space which contains the data. Visualizing such data sets is challenging because reducing the dimensionality, which is required in order to make the data visually interpretable for humans, is both lossy and computationally expensive [10].

Large high-dimensional data sets are frequently used in the chemical sciences. For instance the ChEMBL database (n=1,159,881) of bioactive molecules from the scientific literature and their associated biological assay data are used daily in the area of drug discovery [11]. Further examples of large databases containing molecules include FDB17 n=10,101,204), a fragment-like subset of the enumerated database GDB17 listing theoretically possible molecules up to 17 atoms [1214], and DSSTox (n=848,816), containing molecules investigated for toxicity [15]. Examples of smaller data sets include the Natural Products Atlas (n=24,594), collecting microbially-derived natural products; [16] Drugbank (n=9300), listing molecules marketed or investigated as drugs; [17] and the MoleculeNet benchmark, containing a collection of 16 data sets of small organic molecules [18].

To visualize such databases, simple linear dimensionality reduction methods such as principal component analysis and similarity mapping readily produce 2D- or 3D-representations of global features [1925]. However, local features defining the relationships between close or even nearest neighbor (NN) molecules, which are very important to understand the structure of data, are mostly lost, limiting the applicability of linear dimensionality reduction methods for visualization. The important NN relationships are much better preserved using non-linear manifold learning algorithms, which assume that the data lies on a lower-dimensional manifold embedded within the high-dimensional space. Algorithms such as nonlinear principal component analysis (NLPCA), t-distributed stochastic neighbor embedding (t-SNE), and more recently uniform manifold approximation and projection (UMAP) are based on this assumption [2628]. Other techniques used are probabilistic generative topographic maps (GTM) and self-organizing maps (SOM), which are based on artificial neural networks [29, 30]. However, these algorithms have time complexities between at least On1.14 and On5, limiting the size of to be visualized data sets [31]. The same limitations in terms of data set size apply when distributing data in a tree by implementing the neighbor joining algorithm or similar methods used to create phylogenetic trees [32, 33]. This limiting behavior has been documented by the ChemTreeMap tool, which can only visualize up to approximately 10,000 data points (molecules or clusters of molecules) [34]. Due to the described challenges, large scientific data sets are generally visualized in aggregated or reduced form [35, 36].

Here we present an algorithm, named TMAP (Tree MAP), to generate and distribute intuitive visualizations of large data sets in the order of up to 107 with arbitrary dimensionality in a tree. Our method is based on a combination of locality sensitive hashing, graph theory, and modern web technology which also integrates into established data analysis and plotting workflows. This tree-based layout facilitates visual inspection of the data with a high resolution by explicitly visualizing the closest distance between clusters and the detailed structure of clusters through branches and sub-branches. We demonstrate the performance of TMAP with toy data sets from computer graphics and with ChEMBL subsets of different size and composition, and show that it surpasses comparable algorithms such as t-SNE and UMAP in terms of time and space complexity. We further exemplify the use of TMAP for visualizing large high-dimensional data sets from chemistry as well as from further scientific fields (Table 1).

Table 1.

Data sets visualized using TMAP

Data set Description Data type Size
Toy data sets
 COIL20 Gray-scale images of 20 objects, each rotated 72 × at 5° intervals Images 1440
 MNIST Gray-scale images of handwritten digits Images 70,000
 Fashion MNIST Gray-scale images of fashion items from 10 classes Images 70,000
Chemical compound databases and PDB
 ChEMBL Bioactive molecules with drug-like properties SMILES 1,159,881
 FDB17 and ChEMBL Fragment database (up to 17 atoms) and ChEMBL SMILES 11,261,085
 Natural products atlas Bacterial and fungal natural products SMILES 24,594
 DSSTox U.S. EPA information on toxicity of chemicals SMILES 848,816
 PDB Information on the 3D structures of proteins and nucleic acids Atomic coordinates 131,236
 Drugbank Approved, investigational, experimental, and withdrawn drugs SMILES 9300
MoleculeNet benchmark data sets
 QM8 Subset of GDB-13 with associated QM properties SMILES 21,786
 QM9 Subset of GDB-13 with associated QM properties SMILES 133,885
 ESOL Common organic small molecules with solubility information SMILES 1128
 FreeSolv Calculated and experimental hydration free energy of molecules SMILES 642
 Lipophilicity Experimental results of logD for organic small molecules SMILES 4200
 PCBA PubChem subset with biological activities SMILES 437,929
 MUV PubChem subset for virtual screening validation SMILES 93,087
 HIV Experimental results for HIV replication inhibition SMILES 41,127
 PDBind Binding affinities for ligands in biomolecular complexes SMILES 11,908
 BACE IC50 values against BACE-1 (human β-secretase 1) SMILES 1513
 BBBP Ability of organic molecules to cross the blood–brain barrier SMILES 2039
 Tox21 Toxicity measurements on 12 targets SMILES 7831
 ToxCast Toxicity measurements on more than 600 targets SMILES 8575
 SIDER Adverse drug reactions of a selection of marketed drugs. SMILES 1427
 ClinTox FDA approved drugs that failed clinical trials for toxicity reasons SMILES 1478
Other data sets
 PubMed central Full-text archive of biomedical and life sciences journal literature Text 327,628
 Gutenberg A subset of public domain Project Gutenberg eBooks. Text 3036
 NIPS Abstracts of NIPS conference papers from 1987 to 2015 Text 7241
 RNA sequencing A subset of the PANCAN database Gene expression 801
 ProteomeHD Human proteome co-regulation data Co-regulation scores 5013
 Flowcytometry Data gathered from a flow cytometry experiment Signal intensity 436,877
 MiniBooNE Data gathered by the MiniBooNE particle physics experiment Particle ID 130,065

Methods

Given an arbitrary data set as an input, TMAP encompasses four phases: (I) LSH forest indexing [37, 38], (II) construction of a c-approximate k-nearest neighbor graph, (III) calculation of a minimum spanning tree (MST) of the c-approximate k-nearest neighbor graph [39], and (IV) generation of a layout for the resulting MST [40].

During phase I, the input data are indexed in an LSH forest data structure, enabling c-approximate k-nearest neighbor (k-NN) searches with a time complexity sub-linear in n. Text and binary data are encoded using the MinHash algorithm, while integer and floating-point data are encoded using a weighted variation of the algorithm [4143]. The LSH Forest data structure for both MinHash and weighted MinHash data is initialized with the number of hash functions d used in encoding the data, and the number of prefix trees l. An increase in the values of both parameters led to an increase in main memory usage; however, higher values for l also decrease query speed. The effect of parameters d and l on the final visualization is shown in Additional file 1: Fig. S1. The use of a combination of (weighted) MinHash and LSH Forest, which supports fast estimation of the Jaccard distance between two binary sets, has been shown to perform very well for molecules [44]. Note that other data structures and algorithms implementing a variety of different distance metrics may show better performance on other data and can be used as drop-in replacements of phase I.

In phase II, an undirected weighted c-approximate k-nearest neighbor graph (ck-NNG) is constructed from the data points indexed in the LSH forest, where an augmented variant of the LSH forest query algorithm we previously introduced for virtual screening tasks is used to increase efficiency [45]. The ck-NNG construction phase takes two arguments, namely k, the number of nearest-neighbors to be searched for, and kc, the factor used by the augmented query algorithm. The variant of the query algorithm increases the time complexity of a single query from Ologn to Ok·kc+logn, resulting in an overall time complexity of Onk·kc+logn, where practically k·kc>logn, for the ck-NNG construction. The edges of the ck-NNG are assigned the Jaccard distance of their incident vertices as their weight. Depending on the distribution and the hashing of the data, the ck-NNG can be disconnected (1) if outliers exist which have a Jaccard distance of 1.0 to all other data points and are therefore not connected to any other nodes or (2) if, due to highly connected clusters of size k in the Jaccard space, connected components are created. However, the following phases are agnostic to whether this phase yields a disconnected graph. The effect of parameters k and kc on the final visualization is shown in Additional file 1: Fig. S2. Alternatively, an arbitrary undirected graph can be supplied to the algorithm as a (weighted) edge list.

During phase III, a minimum spanning tree (MST) is constructed on the weighted ck-NNG using Kruskal’s algorithm, which represents the central and differentiating phase of the described algorithm. Whereas comparable algorithms such as UMAP or t-SNE attempt to embed pruned graphs, TMAP removes all cycles from the initial graph using the MST algorithm, significantly lowering the computational complexity of a low dimensional embedding. The algorithm reaches a globally optimal solution by applying a greedy approach of selecting locally optimal solutions at each stage—properties which are also desirable in data visualization. The time complexity of Kruskal’s algorithm is OE+logV, rendering this phase negligible compared to phase II in terms of execution time. In the case of a disconnected ck-NNG, a minimum spanning forest is created.

Phase IV lays out the tree on the Euclidean plane. As the MST is unrooted and to keep the drawing compact, the tree is not visualized by applying a tree but a graph layout algorithm. In order to draw MSTs of considerable size (millions of vertices), a spring-electrical model layout algorithm with multilevel multipole-based force approximation is applied. This algorithm is provided by the open graph drawing framework (OGDF), a modular C++ library [40]. In addition, the use of the OGDF allows for effortless adjustments to the graph layout algorithm in terms of both aesthetics and computational time requirements. Whereas several parameters can be configured for the layout phase, only parameter p must be adjusted based on the size of the input data set (Additional file 1: Fig. S3). This phase constitutes the bottleneck regarding computational complexity.

Results and discussion

TMAP performance assessment with toy data sets and ChEMBL subsets

The quality of our TMAP algorithm is first assessed by comparing TMAP and UMAP to visualize the common benchmarking data sets MNIST, FMNIST, and COIL20 (Fig. 1). UMAP generally represents clusters as tightly packed patches and tries to reach maximal separation between them. On the other hand, TMAP visualizes the relations between, as well as within, clusters as branches and sub-branches. While UMAP can represent the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins subsets through one or more edges of smallest difference (Fig. 1a, b). However, the plot shows that this removal of local connectivity leads to an untangling of highly similar data (shown in dark green, orange, dark red, dark purple, and light blue). This behavior has been assessed and compared to UMAP in Additional file 1: Figures S4 and S5, where it is shown that both TMAP and UMAP have to sacrifice locality preservation for more complex examples. For the MNIST and FMNIST data sets, the tree structure results in a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives (Fig. 1c–f).

Fig. 1.

Fig. 1

Comparison between TMAP and UMAP on benchmark data sets. Please use the interactive versions of the TMAP visualizations at http://tmap.gdb.tools to see images associated with each point on the map. TMAP explicitly visualizes the relations between as well as within clusters. a, b While UMAP represents the circular nature of the COIL20 subsets, TMAP cuts the circular clusters at the edge of largest difference and joins clusters through an edge of smallest difference. cf For the MNIST and FMNIST data sets, the tree structure allows for a higher resolution of both variances and errors within clusters as it becomes apparent how sub clusters (branches within clusters) are linked and which true positives connect to false positives. The image data of all three sets was binarized using the average intensity per image as a threshold

In a second, more applied comparison example, we visualize data from ChEMBL using TMAP and UMAP. For this analysis molecular structures are encoded using ECFP4 (extended connectivity fingerprint up to 4 bonds, 512-D binary vector), a molecular fingerprint encoding circular substructures and which performs well in virtual screening and target prediction [4648]. We consider a subset St of the top 10,000 ChEMBL compounds by insertion date, as well as a random subset Sr of 10,000 ChEMBL molecules.

Taking the more homogeneous set St as an input, the 2D-maps produced by each representation, plotted using the Python library matplotlib, illustrate that TMAP, which distributes clusters in branches and subbranches of the MST, produces a much more even distribution of compounds on the canvas compared to UMAP, thus enabling better visual resolution (Fig. 2a, b). Furthermore, in a visualization of the heterogeneous set Sr, nearest neighbor relationships (locality) are better preserved in TMAP compared to UMAP, as illustrated by the positioning of the 20 structurally nearest neighbors of compound CHEMBL370160 [2, 49] reported as a potent inhibitor of human tyrosine-protein kinase SYK. The 20 structurally similar nearest neighbors are defined as 20 nearest neighbors in the original 512-dimensional fingerprint space. TMAP directly connects the query compound to three of the 20 nearest neighbors, CHEMBL3701630, CHEMBL3701611, and CHEMBL38911457, its nearest, second nearest, and 15th nearest neighbor respectively. The nearest neighbors 1 through 7 are all within a topological distance of 3 around the query (Fig. 2c). In contrast, UMAP has positioned nearest neighbors 2, 3, 9, and 18, among several even more distant data points, closer to the query than the nearest neighbor from the original space (Fig. 2d). Indeed, TMAP preserves locality in terms of retaining 1-nearest neighbor relationships much better than UMAP, applying both topological and Euclidean metrics (Fig. 2e, f; Additional file 1: Fig. S6). The quality of the preservation of locality largely depends on parameter d, with adjustments to parameters k and kc only having a minor influence (Additional file 1: Fig. S7). Moreover, TMAP yields reproducible results when running on identical parameters and input data, whereas results of comparable algorithms such as UMAP change considerably with every run (Additional file 1: Fig. S8) [26].

Fig. 2.

Fig. 2

Comparing TMAP and UMAP for visualizing ChEMBL. The first n compounds St (a, b, e) and a random sample Sr (c, d, f), each of size n=10,000, were drawn from the 512-D ECFP-encoded ChEMBL data set to visualize the distribution of biological entity classes and k-nearest neighbors respectively. a TMAP lays out the data as a single connected tree, whereas (b) UMAP draws what appears to be a highly disconnected graph, with the connection between components becoming impossible to assert. TMAP keeps the intra- and inter-cluster distances at the same magnitude, increasing the visual resolution of the plot. c, d The 20 nearest neighbors of a randomly selected compound from a random sample. c TMAP directly connects the query compound to three of the 20 nearest neighbors (1, 2, 15); nearest neighbors 1 through 7 are all within a topological distance of 3 around the query compound. d The closest nearest neighbors of the same query compound in the UMAP visualization are true nearest neighbors 2, 3, 18, 9, and 1, with 1 being the farthest of the five. e, f Ranked distances from true nearest neighbor in original high dimensional space after embedding based on topological and Euclidean distance for data sets St and Sr respectively. g Computing the coordinates for a random sample (n=1,000,000) highlights the running time behavior of TMAP and allows an inspection of the time and space requirements of the different phases of the algorithm. Four random samples increasing in size (n=10,000, n=100,000, n=500,000, and n=1,000,000) detail the differences in memory usage (h) and running time (i) between TMAP and UMAP (tTMAP=4.865s, aTMAP=0.223GB; tUMAP=20.985s, aUMAP=0.383GB and tTMAP=33.485s, aTMAP=1.12GB; tUMAP=115.661s, aUMAP=2.488GB respectively) (tTMAP=175.89s, aTMAP=4.521GB; tUMAP=3,577.768s, aUMAP=18.854GB and tTMAP=354.682s, aTMAP=8.553GB; tUMAP=41,325.944s, aUMAP=48.507GB respectively)

In terms of calculation times, TMAP and UMAP have comparable running time t and memory usage a for small random subsets of the 512-D ECFP-encoded ChEMBL data set with sizes n=10,000 and n=100,000, TMAP significantly outperforms UMAP for larger random subsets (n=500,000 and n=1,000,000) (Fig. 2h, i). Further insight into the computational behavior of TMAP is provided by analyzing running times for the different phases based on a larger subset (n=1,000,000) of the ECFP4-encoded ChEMBL data set (Fig. 2g). During phase I of the algorithm, which accounts for 180s of the execution time and approximately 5GB of main memory usage, data is loaded and indexed in the LSH Forest data structure in chunks of 100,000, as expressed by 10 distinct jumps in memory consumption. The construction of the ck-NNG during phase II requires a negligible amount of main memory and takes approximately 110s. During 10 s of execution time, MST creation (phase III) occupies a further 2GB of main memory of which approximately 1GB is retained to store the tree data structure. The graph layout algorithm (phase IV) requires 2GB throughout 55s, after which the algorithm completes with a total wall clock run time of 355s and peak main memory usage of 8.553GB.

Note that TMAP supports Jaccard similarity estimation through MinHash and weighted MinHash for binary and weighted sets, respectively. While the Jaccard metric is very suitable for chemical similarity calculations based on molecular fingerprints, the metric may not be the best option available to problems presented by other data sets. However, there exists a wide range of LSH families supporting distance and similarity metrics such as Hamming distance, lp distance, Levenshtein distance, or cosine similarity, which are compatible with TMAP [50, 51]. Furthermore, the modularity of TMAP allows to plug in arbitrary nearest-neighbor-graph creation techniques or load existing graphs from files.

TMAPs of small molecule data sets: ChEMBL, FDB17, DSSTox, and the Natural Products Atlas

The high performance and relatively low memory usage of TMAP, as well as the ability to generate highly detailed and interpretable representations of high-dimensional data sets, is illustrated here by interactive visualization of a series of small molecule data sets available in the public domain. In these examples we use MHFP6 (512 MinHash permutations), a molecular fingerprint related to ECFP4 but with better performance for virtual screening tasks and the ability to be directly indexed in an LSH Forest data structure, which considerably speeds up computation for large data sets [45].

As a first example, we discuss the TMAP of the full data set of the ChEMBL database containing the 1.13 million ChEMBL compounds associated with biological assay data. TMAP completes the calculation within 613 s with a peak memory usage of 20.562 GB. Note that approximately half of the main memory usage is accounted for by SMILES, activities, and biological entity classes which are loaded for later use in the visualization. To facilitate data analysis, the coordinates computed by TMAP are exported as an interactive portable HTML file using Faerun, where molecules are displayed using the JavaScript library SmilesDrawer (Fig. 3a) [25, 52].

Fig. 3.

Fig. 3

TMAP visualization of ChEMBL, FDB17, DSSTox, and the Natural Products Atlas in the MHFP6 chemical space. Please use the interactive versions at https://tmap.gdb.tools to visualize molecular structures associated with each point. a Visualization of all ChEMBL compounds associated with biological assay data (n=1,159,881) colored by target class. The inset shows molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters (cyan); inhibitors of the phenylethanolamine N-methyltransferase (orange); and structurally related compounds with high binding affinities for nicotinic acetylcholine receptors and inhibitory effects on cytochrome p450s (red, dark blue). b The ChEMBL data set was merged with fragment database (FDB17) compounds (n=11,261,085) and visualized. FDB17 molecules are shown in light gray. The inset shows a branch of steroid and steroid-like ChEMBL compounds, as well as dominantly FDB17 branches which are sparsely populated by ChEMBL molecules. c Visualization of DSSTox compounds colored by reported toxicity level. The inset shows a subtree containing a high number of toxic compounds structurally similar or related to naphthalenes and other polycyclic aromatic hydrocarbons. d The Natural Products Atlas chemical space colored by origin genus of the 9 largest groups. The inset shows that structurally similar compounds are grouped into distinct branches and subbranches and are usually produced by plants and fungi from the same genus

Analyzing the distribution of molecules on the tree shows that TMAP groups molecules according to their structure and their biological activity, accurately reflecting similarities calculated in the high-dimensional MHFP6 space. This is well illustrated for a subset of the map (Fig. 3a, insert). In this area of the map, data points in cyan indicate molecules with a high binding affinity for serotonin, norepinephrine, and dopamine neurotransmitters in two connected branches (right side of inset), while data points in orange show inhibitors of the phenylethanolamine N-methyltransferase (PNMT) (left side of inset), and red and dark blue data points indicate nicotinic acetylcholine receptor (nAChRs) ligands and cytochrome p450s (CYPs) inhibitors, respectively.

As a second example, we visualize the ChEMBL set merged with FDB17 (n=10,101,204) into a superset of size n=11,261,085 (Fig. 3b), which corresponds to the largest data set that TMAP can successfully handle. As above, the TMAP 2D-layout accurately reflects structural and functional similarities computed in the high-dimensional MHFP6 space. In this TMAP visualization, the majority of ChEMBL compounds accumulate in closely connected clusters (branches) due to the prevalence of aromatic carbocycles. A notable exception is a relatively sizable branch of steroids and steroid-like compounds, which is connected to a branch of FDB17 molecules containing non-aromatic 5-membered carbocycles and ketones (Fig. 3b, insert). Many more detailed insights can be gained by inspecting the interactive map in Faerun (http://tmap-fdb.gdb.tools).

Further examples include MHFP6-encoded compounds from the Distributed Structure-Searchable Toxicity (DSSTox) Database (n=848,816) and the Natural Products Atlas (n=24,594). Visualizing DSSTox and coloring the resulting tree by toxicity rating, TMAP creates several subtrees and branches representing structural regions with a high incidence of highly toxic compounds (shown in red, Fig. 3c). An example of such a subtree contains naphthalenes and other polycyclic aromatic hydrocarbons (Fig. 3c, insert). The TMAP tree of the Natural Products Atlas was colored according to origin genus and reveals that branches and subbranches containing distinct substructures usually correlate with a certain genus such as various combinations of phenols, fused cyclopentanes, lactones and steroids produced by the fungi genus Ganoderma (colored purple in Fig. 3d, inset).

Visualization of the MoleculeNet benchmark data sets

We further illustrate TMAP to visualize the MoleculeNet, a benchmark for molecular machine learning which has found wide adaption in cheminformatics and encompasses 16 data sets ranging in size and composition (Table 1) [18]. As for the other small molecule data sets above, we computed MHFP6 fingerprints of the associated molecules and the corresponding TMAPs, which we then color-coded according to various numerical values available in the benchmarks. The procedure was applied with all MoleculeNet data sets except for QM7/b, where no SMILES have been provided.

The resulting TMAP representations, accessible at the TMAP website (http://tmap.gdb.tools), reveal the detailed structure of the data sets as well as the behaviour of methods applied to these data sets as a function of the chemical structures of the molecules. For example, TMAPs of the QM8 and QM9 (n=21,786 and n=133,885), which contain small molecules and DFT-modelled parameters, reveal relationships between molecular structures and the various computed physico-chemical values. For instance the TMAP of the QM8 data set color-coded by the oscillator strengths of the lowest two singlet electronic states reveals how the value correlates with molecular structure and explains the performance differences in machine learning models trained on Coulomb matrices versus those trained on structure-sensitive molecular fingerprints [53]. In the case of the ESOL data set containing measured and calculated water solubility values of common small molecules (n=1128), its TMAP color-coded with the difference between computed and measured values reveals the limitation of the ESOL model when estimating solubility of polycyclic aromatic hydrocarbons and compounds containing pyridines. For the FreeSolv data set (n=642) containing small molecules and their measured and calculated hydration free energy in water, the TMAP visualization hints at possible limitations of the method when calculating hydration free energies of sugars. Finally, for the MUV data set (n=93,087), which contains active small drug-like molecules against 17 different protein targets mixed in each case with inactive decoy molecules, the various TMAPs reveal differences in the structural distribution of actives among decoys. Actives are usually well distributed but appear to form clusters in certain subsets (e.g. MUV-548 and MUV-846), explaining the generally higher performance of fingerprint benchmarks for these subsets [47].

Application to other scientific data sets

We further illustrate the general applicability of TMAP to visualize data sets from the fields of linguistics, biology, and particle physics. All produced maps are available as interactive Faerun plots on the TMAP website (http://tmap.gdb.tools).

Our first example concerns visualization of the RCSB Protein Data Bank, which contains experimental 3D-structures of proteins and nucleic acids (n=131,236) [54]. The PDB files were extracted from the Protein Data Bank and encoded using the protein shape fingerprint 3DP (136-D integer vector, 256 weighted MinHash samples) 3DP encodes the structural shape of large molecules stored as PDB files based on through-space distances of atoms [22]. Processing data extracted from the PDB and indexed using a weighted variant of MinHash, demonstrates the ability of TMAP to visualize both global and local structure, improving on previous efforts on the visualization of the database [22, 55]. The global structure of the 3DP-encoded PDB data is dominated by the size (heavy atom count) of the proteins (Fig. 4a), on the other hand, the local structure is defined by properties such as the fraction of negative charges (Fig. 4b).

Fig. 4.

Fig. 4

TMAP visualizations of the RCSB Protein Data Bank (PDB), PANCAN, and ProteomeHD data. For a and b, please use the interactive versions at http://pdb-tmap.gdb.tools to visualize protein structures associated with each point. 3DP-encoded PDB entries visualized using TMAP with weighted MinHash indexing, the color bars show the log–log distribution of the property values. a Colored according to the macromolecular size (heavy atom count). The resulting map reflects the size-sensitivity of the 3DP fingerprint. b Colored according to the fraction of negative charges in the molecules. Macromolecules with a high fraction of negatively charged atoms, predominantly nucleic acids, are visible as clusters of red branches. c The PANCAN data set (n = 801, d = 20,531) consists of gene expressions data of five types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA) and was indexed using a weighted variant of the MinHash algorithm. d Visualization of the ProteomeHD data set (n = 5013, d = 5013) based on co-regulation scores of proteins. The data points have been colored according to the associated cellular location

As an additional example from biology, we consider the PANCAN data set (n=800, d=20,531), which consists of gene expressions of patients having different types of tumors (PRAD, KIRC, LUAD, COAD, and BRCA), randomly extracted from the cancer genome atlas database [56]. Here we index the PANCAN data directly using the LSH Forest data structure and weighted MinHash. The output produced by processing the PANCAN data set displays the successful differentiation of tumor types based on RNA sequencing data by the algorithm (Fig. 4c). We also visualize the ProteomeHD data set using TMAP [57]. This data set consists of co-regulation scores of 5013 proteins, annotated with their respective cellular location. In addition to the ProteomeHD data set, Kustatscher et al. also released an R script to create a map of the set using t-SNE which took a total of 400 s to complete; in contrast, TMAP visualized the data set within 32 s (Fig. 4d), successfully clustering proteins by their cellular location based on co-regulation scores. As a further biology example, our TMAP webpage also features flow cytometry measurements (n=436,877,d=14), exemplifying the methods application for the visualization of relatively low dimensional data [17, 58].

As an example from physics, we represent the MiniBooNE data set (n=130,065, d=50), which consists of measurements extracted from Fermilab’s MiniBooNE experiment and contains the detection of signal (electron neutrinos) and background (muon neutrinos) events [59]. As the attributes in MiniBooNE are real numbers, we use the Annoy indexing library which supports the cosine metric in phase I of the algorithm to index the data for k-NNG construction, which demonstrates the modularity of TMAP [60]. This example reflects the independence of the MST and layout phases of the algorithm from the input data, displaying the distribution of the signal over the background data (Fig. 5a).

Fig. 5.

Fig. 5

Visualizing linguistics, RNA sequencing, and particle physics data sets. a The MiniBooNE data set (n=130,065, d=50) consists of measurements extracted from Fermilab’s MiniBooNE experiment. TMAP visualizes the distribution of the signal data among the background. b The GUTENBERG data set is a selection of books by 142 authors (n=3036,d=1,217,078). The works of five different authors are shown to occupy distinct branches. Interactive version of these maps and further examples can be found at http://tmap.gdb.tools

Outside of the natural sciences, we exemplify TMAP to visualize the GUTENBERG set as an example of a data set from linguistics. This data set features a selection of n=3036 books by 142 authors written in English [61]. To analyze this data, we define a book fingerprint as a dense-form binary vector indicating which words from the universe of all words extracted from all books occurred at least once in a given book (yielding a dimensionality of d=1,217,078), and index this book fingerprint using the LSH Forest data structure with MinHash. The visualization of the GUTENBERG data set exemplifies the ability of TMAP to handle input with extremely high dimensionality (d=1,217,078) efficiently (Fig. 5b). The works of different authors tend to populate specific branches, with notable expected exceptions such as the autobiography of Charles Darwin, which does not lie on the same branch as all his other works. Meanwhile, the works of Alfred Russel Wallace are found on subbranches of the Darwin branch.

Related to linguistics, the TMAP webpage further features a map of the distribution of different scientific journals (Nature, Cell, Angewandte Chemie, Science, the Journal of the American Chemical Society, and Demography) over the entire PubMed article space (n=327,628,d=1,633,762), perceiving specialization, diversification, and overlaps; as well as a TMAP of the NeurIPS conference papers (n=7,241,d=225,423), visualizing the increase in occurrence of the word “deep” in conference paper abstracts over time (1987–2016).

Conclusion

In this study, we introduced TMAP as a visualization method for very large, high-dimensional data sets enabling high data interpretability by preserving and visualizing both global and local features. By using TMAP in combination with the MHFP6 fingerprint, we can visualize databases of millions of organic small molecules and the associated property data with a high degree of resolution, which was not possible with previous methods. TMAP is also well-suited to visualize arbitrary data sets such as images, text, or RNA-seq data, hinting at its usefulness in a wide range of fields including computational linguistics or biology.

TMAP excels with its low memory usage and running time, with performance superior to other visualization algorithms such as t-SNE, UMAP or PCA. By adjusting the available parameters and leveraging output quality and memory usage, TMAP does not require specialized hardware for high-quality visualizations of data sets containing millions of data points. Most importantly, TMAP generates visualizations with an empirical sub-linear time complexity of On0.931, allowing to visualize much larger high dimensional data sets than previous methods.

All the TMAP visualizations presented, including installation and usage instructions, are available as interactive online versions (http://tmap.gdb.tools). The source code for TMAP is available on GitHub (https://github.com/reymond-group/tmap) and a Python package can be obtained using the conda package manager.

Supplementary information

Acknowledgements

This work was supported financially by the Swiss National Science Foundation, NCCR TransCure (Grant No. 51NF40-185544).

Abbreviations

DSSTox

Distributed structure-searchable toxicity

ECFP

Extended connectivity fingerprint

FDB17

Fragment database 17

GDB17

Generated database 17

GTM

Generative topographic maps

LSH

Locality sensitive hashing

MHFP

MinHash fingerprint

MST

Minimum spanning tree

NLPCA

Nonliner prinicipal component analysis

NN

Nearest neighbor

NNG

Nearest neighbor graph

OGDF

Open graph drawing framework

PANCAN

Pancreatic cancer action network

PCA

Principal component analysis

PDB

Protein data bank

SMILES

Simplified molecular input line entry specification

SOM

Self-organizing maps

TMAP

Tree MAP

t-SNE

t-distributed stochastic neighbor embedding

UMAP

Uniform manifold approximation and projection

Authors’ contributions

DP designed and realized the study and wrote the paper. JLR supervised the study and wrote the paper. Both authors read and approved the final manuscript.

Availability of data and materials

The datasets generated during and/or analysed during the current study are available in the tmap repository, http://tmap.gdb.tools.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Daniel Probst, Email: daniel.probst@dcb.unibe.ch.

Jean-Louis Reymond, Email: jean-louis.reymond@dcb.unibe.ch.

Supplementary information

Supplementary information accompanies this paper at 10.1186/s13321-020-0416-x.

References

  • 1.Callahan SP, et al (2006) VisTrails: Visualization meets data management. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM. pp 745–747. 10.1145/1142473.1142574
  • 2.Fox P, Hendler J. Changing the equation on scientific data visualization. Science. 2011;331:705–708. doi: 10.1126/science.1197654. [DOI] [PubMed] [Google Scholar]
  • 3.Michel J-B, et al. Quantitative analysis of culture using millions of digitized books. Science. 2011;331:176–182. doi: 10.1126/science.1199644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Keim D, Qu H, Ma K. Big-data visualization. IEEE Comput Graphics Appl. 2013;33:20–21. doi: 10.1109/MCG.2013.54. [DOI] [PubMed] [Google Scholar]
  • 5.Costa FF. Big data in biomedicine. Drug Disc Today. 2014;19:433–440. doi: 10.1016/j.drudis.2013.10.012. [DOI] [PubMed] [Google Scholar]
  • 6.Stephens ZD, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195. doi: 10.1371/journal.pbio.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bikakis N, Sellis T (2016) Exploration and visualization in the web of big linked data: a survey of the state of the art. arXiv:1601.08059
  • 8.Kahles A, et al. Comprehensive analysis of alternative splicing across tumors from 8,705 patients. Cancer Cell. 2018;34:211–224.e6. doi: 10.1016/j.ccell.2018.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Arús-Pous J, et al. Exploring the GDB-13 chemical space using deep generative models. J Cheminform. 2019;11:20. doi: 10.1186/s13321-019-0341-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.van der Maaten L, Postma EO, van der Herik HJ. Dimensionality reduction : a comparative review. J Mach Learn Res. 2009;10:66–71. [Google Scholar]
  • 11.Gaulton A, et al. The ChEMBL database in 2017. Nucleic Acids Res. 2017;45:D945–D954. doi: 10.1093/nar/gkw1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model. 2012;52:2864–2875. doi: 10.1021/ci300415d. [DOI] [PubMed] [Google Scholar]
  • 13.Visini R, Awale M, Reymond J-L. Fragment database FDB-17. J Chem Inf Model. 2017;57:700–709. doi: 10.1021/acs.jcim.7b00020. [DOI] [PubMed] [Google Scholar]
  • 14.Awale M, Visini R, Probst D, Arús-Pous J, Reymond J-L. Chemical space: big data challenge for molecular diversity. Chimia. 2017;71:661–666. doi: 10.2533/chimia.2017.661. [DOI] [PubMed] [Google Scholar]
  • 15.Richard AM, Williams CR. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res. 2002;499:27–52. doi: 10.1016/S0027-5107(01)00289-5. [DOI] [PubMed] [Google Scholar]
  • 16.Natural Products Atlas. https://www.npatlas.org/joomla/
  • 17.Wishart DS, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46:D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wu Z, et al (2017) MoleculeNet: A benchmark for molecular machine learning. arXiv:1703.00564[physics, stat] [DOI] [PMC free article] [PubMed]
  • 19.Oprea TI, Gottfries J. Chemography: the art of navigating in chemical space. J Comb Chem. 2001;3:157–166. doi: 10.1021/cc0000388. [DOI] [PubMed] [Google Scholar]
  • 20.Awale M, van Deursen R, Reymond J-L. MQN-Mapplet: visualization of chemical space with interactive maps of DrugBank, ChEMBL, PubChem, GDB-11, and GDB-13. J Chem Inf Model. 2013;53:509–518. doi: 10.1021/ci300513m. [DOI] [PubMed] [Google Scholar]
  • 21.Awale M, Reymond J-L. Similarity Mapplet: interactive visualization of the directory of useful decoys and ChEMBL in high dimensional chemical spaces. J Chem Inf Model. 2015;55:1509–1516. doi: 10.1021/acs.jcim.5b00182. [DOI] [PubMed] [Google Scholar]
  • 22.Jin X, et al. PDB-explorer: a web-based interactive map of the protein data bank in shape space. BMC Bioinform. 2015;16:339. doi: 10.1186/s12859-015-0776-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Awale M, Reymond J-L. Web-based 3D-visualization of the DrugBank chemical space. J. Cheminform. 2016;8:25. doi: 10.1186/s13321-016-0138-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Awale M, Probst D, Reymond J-L. WebMolCS: a web-based interface for visualizing molecules in three-dimensional chemical spaces. J Chem Inf Model. 2017;57:643–649. doi: 10.1021/acs.jcim.6b00690. [DOI] [PubMed] [Google Scholar]
  • 25.Probst D, Reymond J-L. FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web. Bioinformatics. 2018;34:1433–1435. doi: 10.1093/bioinformatics/btx760. [DOI] [PubMed] [Google Scholar]
  • 26.McInnes L, Healy J, Melville J (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [cs, stat]
  • 27.van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–2605. [Google Scholar]
  • 28.Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
  • 29.Bishop CM, Svensén M, Williams CKIGTM. The generative topographic mapping. Neural Comput. 1998;10:215–234. doi: 10.1162/089976698300017953. [DOI] [Google Scholar]
  • 30.Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: Proceedings of international conference on neural networks (ICNN’97) vol. 1 PL1-PL6 vol.1
  • 31.Dong W, Moses C, Li K (2011) Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th international conference on World wide web—WWW’11 577, ACM Press. 10.1145/1963405.1963487
  • 32.Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
  • 33.Zhou Z, et al. GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 2018;28:1395–1404. doi: 10.1101/gr.232397.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lu J, Carlson HA. ChemTreeMap: an interactive map of biochemical similarity in molecular datasets. Bioinformatics. 2016;32:3584–3592. doi: 10.1093/bioinformatics/btw523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.P’ng C, et al. BPG: seamless, automated and interactive visualization of scientific data. BMC Bioinform. 2019;20:42. doi: 10.1186/s12859-019-2610-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Idreos S, Papaemmanouil O, Chaudhuri S (2015) Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data. ACM, pp 277–281. 10.1145/2723372.2731084
  • 37.Andoni A, Razenshteyn I, Nosatzki NS (2017) LSH Forest: practical algorithms made theoretical. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 67–78 10.1137/1.9781611974782.5
  • 38.Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on World Wide Web—WWW’05 651. ACM Press. 10.1145/1060745.1060840
  • 39.Kruskal JB. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc. 1956;7:48. doi: 10.1090/S0002-9939-1956-0078686-7. [DOI] [Google Scholar]
  • 40.Chimani M, et al. The open graph drawing framework (OGDF) Handbook Graph Draw Vis. 2013;2011:543–569. [Google Scholar]
  • 41.Broder AZ ((1997) On the resemblance and containment of documents. In: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) 21–29. 10.1109/sequen.1997.666900
  • 42.Manber U (1994) Finding similar files in a large file system. In: Usenix Winter 1994 technical conference 1–10
  • 43.Wu W, Li B, Chen L, Zhang C, Yu P (2017). Improved consistent weighted sampling revisited. arXiv:1706.01172 [cs]
  • 44.Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J Cheminform. 2015;7:20. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Probst D, Reymond J-L. A probabilistic molecular fingerprint for big data settings. J Cheminform. 2018;10:66. doi: 10.1186/s13321-018-0321-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • 47.Riniker S, Landrum GA. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J Cheminform. 2013;5:26. doi: 10.1186/1758-2946-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Awale M, Reymond J-L. Polypharmacology Browser PPB2: target prediction combining nearest neighbors with machine learning. J Chem Inf Model. 2019;59:10–17. doi: 10.1021/acs.jcim.8b00524. [DOI] [PubMed] [Google Scholar]
  • 49.Binding DB (2014) BindingDB Entry 6310: Compounds and compositions as Syk kinase inhibitors. 10.7270/q24q7sns
  • 50.Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. arXiv:1408.2927[cs]
  • 51.Marcais G, DeBlasio D, Pandey P, Kingsford C (2019) Locality sensitive hashing for the edit distance. http://biorxiv.org/lookup/doi/10.1101/53444610.1101/534446 [DOI] [PMC free article] [PubMed]
  • 52.Probst D, Reymond J-L. SmilesDrawer: parsing and drawing SMILES-encoded molecular structures using client-side JavaScript. J Chem Inf Model. 2018;58:1–7. doi: 10.1021/acs.jcim.7b00425. [DOI] [PubMed] [Google Scholar]
  • 53.Ramakrishnan R, Hartmann M, Tapavicza E, von Lilienfeld OA. Electronic spectra from TDDFT and machine learning in chemical space. J Chem Phys. 2015;143:084111. doi: 10.1063/1.4928757. [DOI] [PubMed] [Google Scholar]
  • 54.Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Awale M, Reymond J-L. Atom pair 2D-fingerprints perceive 3D-molecular shape and pharmacophores for very fast virtual screening of ZINC and GDB-17. J Chem Inf Model. 2014;54:1892–1907. doi: 10.1021/ci500232g. [DOI] [PubMed] [Google Scholar]
  • 56.The Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Kustatscher G, et al. Co-regulation map of the human proteome enables identification of protein functions. Nat Biotechnol. 2019;37:1361–1371. doi: 10.1038/s41587-019-0298-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Hanley MB, Lomas W, Mittar D, Maino V, Park E. Detection of low abundance RNA molecules in individual cells by flow cytometry. PLoS ONE. 2013;8:e57002. doi: 10.1371/journal.pone.0057002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Roe BP, et al. Boosted decision trees as an alternative to artificial neural networks for particle identification. Nucl Instrum Methods Phys Res. 2005;543:577–584. doi: 10.1016/j.nima.2004.12.018. [DOI] [Google Scholar]
  • 60.Bernhardsson E. Annoy (Approximate Nearest Neighbors Oh Yeah). https://github.com/spotify/annoy
  • 61.Lahiri S (2013) Complexity of word collocation networks: a preliminary structural analysis. arXiv:1310.5111[physics]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets generated during and/or analysed during the current study are available in the tmap repository, http://tmap.gdb.tools.


Articles from Journal of Cheminformatics are provided here courtesy of BMC

RESOURCES