Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 Nov 4;25(6):bbae565. doi: 10.1093/bib/bbae565

Prototype-based contrastive substructure identification for molecular property prediction

Gaoqi He 1, Shun Liu 2, Zhuoran Liu 3, Changbo Wang 4, Kai Zhang 5,, Honglin Li 6,7,
PMCID: PMC11533112  PMID: 39494969

Abstract

Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.

Keywords: molecular property prediction, Graph Neural Networks, self-supervised learning, contrastive learning

Introduction

Molecular property prediction (MPP) is a significant task in modern drug discovery. Accurate prediction methods can accelerate lead compound discovery, virtual screening, and other drug discovery processes [1, 2]. However, conventional wet-lab experiments entail considerable time and labor costs [3]. In recent years, deep learning techniques have been widely applied in various downstream tasks and have reported excellent results. With the continuous accumulation of biochemical data, deep learning-based MPP methods are attracting an ever-increasing interest from researchers, showing potential in prediction performance and generalization capability [4].

Deep learning-based MPP methods can be divided into sequence-based and graph-based methods according to the representation of molecules [5]. Each presents unique advantages and challenges for predicting molecular properties. Sequence-based methods mainly leverage the Simplified Molecular Input Line Entry System (SMILES) strings [6], which can benefit from advanced language models [7, 8]. However, the string representations loss spatial structural information [9, 10]. In contrast, graph-based methods offer a topology-aware representation by viewing atoms as nodes and bonds as edges. Consequently, Graph Neural Networks (GNNs) have become well-suited tools for MPP tasks. GNN-based methods continue to emerge and achieve promising results [1, 11].

Existing enhancements to graph-based MPP methods are primarily driven by the unique characteristics of molecules. Typical contributions include D-MPNN [12], which fuses edge feartures to the message passing phase for incorporating attributes of chemical bonds. Attentive FP introduces a multi-level attention mechanism to enable both reasoning capabilities and interpretability [13]. Combining chemical domain knowledge, FP-GNN integrates fingerpints into the graph representations of molecules [14]. These GNN-based models mainly consider atom-level feature aggregation, while the rich structural information of functional groups remains to be explored. According to the chemical domain knowledge, functional groups (substructures) form the basic building blocks that determine molecular properties and are shared across molecules [15, 16]. For example, the carboxyl group (-COOH) often indicate compounds with high water solubility.

To this end, recent MPP-related works have considered substructures in the learning process. These methods mainly adopt manually defined fragmentation rules. For instance, FraGAT randomly breaks acyclic single bonds to generate substructures [17]; HiGNN employs the Breaking of Retrosynthetically Interesting Chemical Substructures algorithm to partition molecules [18]; MgRX combines the BRICS and the Retrosynthetic Combinatorial Analysis Procedure algorithm to obtain fine-grained fragments [19]; and CAFE-MPP detects breakable bonds for fragmentation according to chemistry-aware rules [20].

These rule-based methods leverage chemical knowledge to cleave molecules, but they lack flexibility and are not aware of substructure classifications. For example, Fig. 1(A) demonstrates that both halogen (Cl) and phenolic hydroxyl, which are different classes of functional groups. This mixture obscures the functional groups that may be important for prediction.

Figure 1.

Figure 1

(A) Rule-based molecular fragmentation. Using the BRICS algorithm, the molecule is fragmented into four substructures by pre-defined rules. Each substructure does not possess a global substructure class information. (B) Adaptive fragmentation with POSIT. The molecule is divided into seven substructures by probabilistically assigning each atom to a global set of substructure classes. The partitioning is implemented by imposing both topology-based and attribute-based constraints simultaneously.

As for adaptive substructure identification methods, though not directly designed for MPP tasks, there are two related directions: graph mining and graph clustering. Graph mining aims at the data-driven discovery of frequent subgraphs, which may demand extensive domain knowledge and high computational costs [21, 22]. Notable methods for graph clustering, such as DiffPool and MinCutPool [23, 24], hierarchically partitions nodes into different clusters and generates pooled representations. Moreover, MICRO forms various motifs via node clustering with the EM algorithm [25]. SLIM models structural interaction through mapping rooted subgraphs to finite landmarks, while flexibly shaped substructures cannot be discovered [26]. These works primarily utilize node-level similarities to cluster nodes on each graph. However, substructure-level relationships across graphs are not fully exploited to model the intra-class consistency and inter-class discrimination of substructures.

To address the above concerns, we propose Prototype-based cOntrastive Substructure IdentificaTion (POSIT) framework to adaptively mine substructures, consequently augmenting molecular representations for MPP tasks. Compared to existing rule-based methods, POSIT has several merits: (1) it allows fine-tuning the fragmentation process based on downstream supervised signals, leading to more informative substructure representations; (2) it results in flexible, class-aware substructures with global coherence; and (3) it eliminates the reliance on chemistry domain knowledge. The framework incorporates two learning stages. During pre-training, a graph encoder and a partitioner are pre-trained to partition molecules. Specifically, the partitioner softly assign nodes to various substructure classes with a connectivity constraint. On this basis, a prototypical contrastive objective is designed on substructure-level representations, thereby encouraging salient clustering of substructures with intra-class consistency and inter-class discrimination.

In the fine-tuning stage, the predictor is trained via supervised MPP data, where a cross-scale attention mechanism is introduced to capture the interaction between the the substructure-level and the graph-level representations. Meanwhile, the encoder and the partitioner are fine-tuned for downstream MPP tasks. We conducted extensive experiments on 10 datasets to validate the performance of POSIT, covering classification and regression tasks.

The contributions of this work are summarized as follows:

  • We introduce an innovative self-supervised framework capable of adaptively extracting informative substructural prototypes among biochemical data, consequently identifying meaningful molecular substructures. Further visualization studies illustrate the consistency of the partitioned substructures with chemical priors.

  • The prototypical contrastive substructure identification is explored as a novel pretext task for further fine-tuning. During fine-tuning, a cross-scale attention mechanism is integrated, which fuses substructure-level information to enhance molecular representations.

  • Comprehensive experiments are used to evaluate the performance of POSIT, covering classification and regression MPP tasks on 10 real-world datasets. Results compared to baseline models and ablation studies demonstrate the effectiveness and generalizability of POSIT.

Materials and methods

In this section, we first state the problem definition of MPP and then introduce the preliminary requirements of molecular fragmentation. Subsequently, we elaborate on the design of the two-stage framework, emphasizing both pre-training and fine-tuning stages.

Problem definition

Typically, a molecule can be viewed as an undirected graph Inline graphic, where Inline graphic denotes the node set representing atoms and Inline graphic denotes the edge set representing chemical bonds. The initial features of Inline graphic nodes within a graph are represented as Inline graphic, where Inline graphic is the dimension. Connection relations are represented as an adjacency matrix Inline graphic and edge features Inline graphic, where each element of Inline graphic is either 0 or 1. Given a set of Inline graphic molecular graphs: Inline graphic, and their task labels on various properties: Inline graphic, the objective of MPP is to train a model Inline graphic, where Inline graphic fits various ground truths well. This model requires informative molecular representations that applicable across varied tasks.

Substructure identification

Matching exact functional groups in molecules is related to graph mining, which may require expensive computational costs [22]. Hence, we expect to discover substructures that potentially encapsulate functional groups. During this process, the challenge lies in designing a method to partition atoms into coherent and meaningful substructures that benefit downstream tasks.

Following the requirements of graph mining, substructures should adhere to certain constraints [22, 27]: (1) Connectivity: since functional groups are composed of local atomic clusters, substructures should maintain tight connectivity. (2) Topological similarity: substructures within the same class should represent similar connectivity patterns. (3) Attribute consistency: as for attributed graphs, substructures within the same class should share identical type and count of nodes. To meet the challenges, we considers to relaxing these constraints to probabilistic versions, thereby enabling learnable fragmentation in a data driven manner. Details are described in the following subsections.

Overview

The overall architecture of the two-stage network is illustrated in Fig. 2. The pre-training stage aims at adaptively identifying meaningful substructures. Specifically, a GNN-based encoder and a partitioner are jointly pre-trained under two objectives: (1) a local connectivity constraint that generates geometrically meaningful subgraphs within a molecule; and (2) a global prototype-based contrastive clustering loss that encourages substructures form salient clusters, using both topological and attribute-based similarities.

Figure 2.

Figure 2

The overall architecture of POSIT. (1) Stage-I:pre-training stage. Encoded molecular atoms are softly partitioned into substructures. The prototypical contrastive objective directs these substructures to cluster by pulling them closer to their own class prototypes and distancing them from others. (2) Stage-II: fine-tuning stage. Substructures are first identified by the pre-trained network. Next, the cross-scale attention fuses the substructure-level information with the global representation, which is finally fed into the predictor.

The fine-tuning stage leverages the supervised data to update pre-training parameters and build accurate predictors for MPP tasks. In particular, based on the identified substructures, the cross-scale attention mechanism is introduced to integrate substructure-level representations into graph-level representations as informative features.

Pre-training stage

Given a molecular graph Inline graphic with Inline graphic nodes, we first extract Inline graphic-dimensional node embeddings with a GNN-based encoder. Utilizing message passing, the representation of node Inline graphic at the Inline graphicth layer is aggregated iteratively from its neighbour set Inline graphic:

graphic file with name DmEquation1.gif (1)

where Inline graphic is the aggregation function, and Inline graphic is the updating function. Any variant of GNNs is applicable [28]. Here, Attentive FP [13], an attention-based graph encoder, is used in the implementation of POSIT.

Next, we aim at partitioning molecules into various substructures. Instead of directly specifying the number of substructures within each molecule, Inline graphic classes (or clusters) of substructures are defined globally across all molecules. Consequently, nodes within a molecule that are partitioned into the same class will naturally form a substructure. Given a molecular graph Inline graphic with Inline graphic nodes, we assign the embeddings of these nodes into Inline graphic classes via an MLP-based partitioner:

graphic file with name DmEquation2.gif (2)

where Inline graphic, and Inline graphic denotes the pre-defined number of substructure classes. Next, the probability of node Inline graphic belonging to a certain substructure class Inline graphic can be calculated as follows:

graphic file with name DmEquation3.gif (3)

where Inline graphic is the temperature parameter that controls the peakedness of the Softmax function. Hence, the node assignment matrix of Inline graphic is described as Inline graphic:

graphic file with name DmEquation4.gif (4)

which assigns each of the Inline graphic nodes into Inline graphic substructure classes.

Connectivity constraint

After partitioning molecular nodes into substructures through the partitioner, this subsection focuses on applying specific constraints to ensure that the resultant substructures should be geometrically meaningful. Generally, atoms in the same functional group are tightly connected, while connections between different functional groups are relatively sparse [16]. Inspired by this observation, a modularity-based regularization is introduced as a constraint, which is a spectral clustering metric that describes the connectivity of graph partitions. This constraint enforces that the assignment matrix Inline graphic should create geometrically connected substructures. Here, we employed the relaxed version of modularity [29], which is formulated as follows:

graphic file with name DmEquation5.gif (5)

where Inline graphic is the degrees of nodes, Inline graphic is the adjancency matrix of the input graph, and Inline graphic denotes the edge count in the molecular graph Inline graphic.

In practice, optimizing this objective tends to assign all nodes to a single partition [24]. To avoid such degenerate solutions, the orthogonality regularization is additionally included [30]. The formula is

graphic file with name DmEquation6.gif (6)

This objective counts the size of substructure classes as regularization, reaching 0 when the sizes of different classes are strictly balanced.

Therefore, the overall loss function for connectivity is formulated as follows:

graphic file with name DmEquation7.gif (7)

where Inline graphic controls the balance of the two loss terms.

Prototypical contrastive substructure clustering

Besides tight connectivity, substructures of the same class are expected to share topological similarity and attribute consistency, which is identical with the clustering property. To explicitly enforce clustering among the substructure instances, we introduce the concept of prototypes.

Prototypes are the representative embeddings of classes, i.e. the centroids of classes [31, 32]. Leveraging prototypes to represent classes allows capturing intra-class similarity and inter-class dis-similarity between classes [33, 34]. In this work, prototypes are defined as representative substructures that guide substructures to form respective clusters. Concretely, POSIT minimizes the distance between substructure embeddings and their assigned prototypes, while maximizes the distance between substructure embeddings and all other prototypes.

First, substructure embeddings are derived. The probability of each node Inline graphic being assigned to substructure classes is indicated by the probability vector Inline graphic from Equation (3). Therefore, the substructure embeddings in Inline graphic can be obtained by performing average pooling of the nodes inside each substructure. Let Inline graphic be the node assignment matrix of Inline graphic as in Equation (4), the substructure embeddings in Inline graphic can then be computed as follows:

graphic file with name DmEquation8.gif (8)

where Inline graphic is the embedding matrix of substructures in Inline graphic, Inline graphic is the embedding matrix of nodes in graph Inline graphic, Inline graphic is a Inline graphic dimension vector representing the probability that the Inline graphicth node is assigned to the Inline graphic substructure classes, Inline graphic is the element-wise division operator, and Inline graphic expands a Inline graphic-dimensional vector to a Inline graphic matrix. Each graph contains Inline graphic substructure embeddings in Inline graphic with dimension Inline graphic, representing the occurrences of the Inline graphic substructure classes within this graph. Substructure classes that do not appear in Inline graphic are naturally represented as zero vectors in Inline graphic. Therefore, the embeddings of the substructures that emerge in Inline graphic can be conveniently represented as the substructure embedding matrix Inline graphic. Subsequently, substructure prototypes will be derived from these classes of substructure embeddings across the dataset, with the count also equal to Inline graphic.

Substructure-prototype constrastive clustering. After obtained the substructure embeddings, we employ a prototype-based contrastive clustering to enforce that the substructure instances form salient groups. Here, the prototype for each substructure class are determined as non-parametric embeddings for better generalization ability [35, 36]. Specifically, the prototype vector Inline graphic is the averaged embedding of the Inline graphicth substructure class. It is updated in the momentum style using the batched substructure embeddings Inline graphic:

graphic file with name DmEquation9.gif (9)
graphic file with name DmEquation10.gif (10)

where Inline graphic is the set of substructure indices that have the highest probability to be assigned to prototype class Inline graphic in an entire batch of molecules. Since substructures are composed of nodes sharing the same class index, Inline graphic can be derived from the assignment matrix Inline graphic. Inline graphic is the average of batched substructure embeddings in class Inline graphic, and Inline graphic denotes the momentum coefficient. In other words, the global estimation of the Inline graphic is updated by its local version Inline graphic in each mini-batch incrementally.

To promote clustering, prototypes and substructures of the same class are viewed as positive pairs, and the rest are viewed as negative pairs. The relation between a prototype and a substructure is measured with cosine similarity:

graphic file with name DmEquation11.gif (11)

where Inline graphic are vector embeddings. We use Inline graphic to denote the prototype assigned to the Inline graphicth substructure, where Inline graphic is the class index with the highest probability assigned by Inline graphic in Equation (3). Then, the prototypical contrastive clustering objective is formulated as follows:

graphic file with name DmEquation12.gif (12)

where Inline graphic is the temperature hyper-parameter. Intuitively, minimizing Inline graphic pushes each transformed substructure embedding Inline graphic towards its assigned class prototype Inline graphic and away from other prototypes.

By clustering substructures of the same class under the guidance of propotypes, the GNN encoder and partitioner coordinately generate high-quality fragmentations. The resultant substructures can thus satisfy the constraints on both topological similarity and attribute consistency.

Intra-class compactness optimization. Equation (12) mainly motivates the distinction between substructures and other class prototypes, i.e. inter-class discrimination. Meanwhile, the intra-class similarities shouled be considered. Therefore, we further encourage the compatness within a class by the following loss function [36]:

graphic file with name DmEquation13.gif (13)

Equation (13) directly minimizes the distance between substructure embeddings and their assigned prototypes. Through this objective, substructures within the same class can have better attribute-based consistency.

To sum up, the clustering objective is formed as follows:

graphic file with name DmEquation14.gif (14)

Considering the connectivity constraint (7), the overall self-supervised training objective is formulated as follows:

graphic file with name DmEquation15.gif (15)

where Inline graphic and Inline graphic are weights of objectives.

After pre-training, POSIT can adaptively identify substructure instances from the input molecules in an unsupervised manner. If new molecular graphs come in, the pre-trained network can be applied conveniently to obtain meaningful substructures, which can then be leveraged to empower molecular representations.

Fine-tuning stage

The second stage of POSIT is a fine-tuning step. It further optimized the pre-trained network using labelled MPP data, where a cross-scale attention mechanism is introduced to extract informative molecular features for the predictive tasks.

Cross-scale attention

Substructures not only provide more contextualized information than individual nodes, but are also at a finer granularity compared with the whole molecule. Therefore, they are supposed to offer rich topological and attribute information. To this end, a cross-scale attention mechanism is devised to explicitly capture the interaction between the substructure-level (local) representations and the graph-level (global) representations to generate informative molecular features.

Globally, a pooling function is operated on the node embeddings of the molecular graph:

graphic file with name DmEquation16.gif (16)

where Inline graphic is the global representation of graph Inline graphic, Inline graphic is the node set of Inline graphic, and Inline graphic is a pooling function that compact all nodes in Inline graphic into a single vector.

Locally, substructure representations are utilized as processing units. Then, the global and local representations are bridged via the cross-scale attention mechanism, which is formed as follows:

graphic file with name DmEquation17.gif (17)
graphic file with name DmEquation18.gif (18)

where Inline graphic is the substructure embedding, Inline graphic and Inline graphic are trainable parameters to transform the embedding to a scalar, Inline graphic denotes the concatenate operator, Inline graphic is the activation function, and Inline graphic is the head count. In this process, substructures serve as fundamental building blocks of the molecule. They are combined linearly, with each substructure weighted according to its attention coefficient relative to the global molecular representation Inline graphic.

Supervised loss function

Given a molecular graph Inline graphic, we combine its summarized embedding Inline graphic and the linear combination of the substructure embeddings Inline graphic as the final molecular representation:

graphic file with name DmEquation19.gif (19)

For binary classification, the representation is fed into another MLP to generate the prediction Inline graphic:

graphic file with name DmEquation20.gif (20)

Considering the potential class imbalance, we adopted the focal loss to tackle the problem [37]. The formula is

graphic file with name DmEquation21.gif (21)
graphic file with name DmEquation22.gif (22)

where Inline graphic is the ground-truth label, Inline graphic is the predicted value, and Inline graphic is the size of dataset. Inline graphic is set to 0.25, and Inline graphic is set to the proportion of negative samples in experiments.

For regression tasks, the final prediction Inline graphic is first produced by MLP. Then, MSE loss is adopted as the loss function:

graphic file with name DmEquation23.gif (23)

The overall loss function, including the self-supervised objective and the supervised objective, is formed as follows:

graphic file with name DmEquation24.gif (24)

where Inline graphic is a hyper-parameter for the trade-off between the supervised loss term and the self-supervised loss term.

Experiments and results

In this section, we report an extensive set of experimental results including the comparisons with baselines, ablation studies, and visual analysis to demonstrate the effectiveness and substructure identification capability of POSIT.

Datasets

We mainly consider MPP tasks, including classification and regression. For the pre-training stage, HIV is used as unlabeled pre-training data. It is a molecular dataset originated from MoleculeNet [38], which contains 41 127 molecules in total(Although this dataset is relatively small, it already demonstrates encouraging performance when used as the pre-training dataset for POSIT. We will investigate larger pre-training dataset in our future studies.). The performance of our framework on downstream MPP tasks is evaluated on 10 commonly used datasets from MoleculeNet with labels. These datasets cover a wide range of molecular properties, including physical chemistry, physiology, and biophysics. Among the 10 datasets, 7 are classification tasks and 3 are regression tasks. The statistical information of the datasets is summarized in Table 1, and their detailed descriptions are listed in Supplementary Table S3.

Table 1.

The statistical information of datasets

Category Datasets Task type #Molecules #Tasks #Avg nodes Split Metric
Biophysics BACE Classification 1513 1 34.1 Scaffold ROC-AUC
HIV Classification 41 127 1 25.5 Scaffold ROC-AUC
Physiology BBBP Classification 2050 1 23.9 Scaffold ROC-AUC
Tox21 Classification 7831 12 18.6 Random ROC-AUC
ToxCast Classification 8597 617 18.7 Random ROC-AUC
ClinTox Classification 1484 2 26.1 Random ROC-AUC
SIDER Classification 1427 27 33.6 Random ROC-AUC
Physical Chemistry ESOL Regression 1128 1 13.3 Random RMSE
FreeSolv Regression 642 1 8.7 Random RMSE
Lipophilicity Regression 4200 1 27.0 Random RMSE

Data preprocessing

The molecular data were initially obtained as SMILES strings and then transformed into graph structures using Rdkit [39]. In the first stage, the HIV dataset is used for pre-training. In the fine-tuning stage, all datasets were split into training, validation, and testing subsets with a ratio of 8:1:1. For fair comparisons, we adopted the same data splitting strategy as previous works in terms of random and scaffold splitting [18, 38], which are listed in Table 1. Compared to random splitting, scaffold splitting typically generates datasets that are more challenging for predictive models. Following MoleculeNet [38], the ROC-AUC metric is used for evaluating the performance of the classification tasks, where a higher score means a more accurate prediction. For regression tasks, the Root Mean Squared Error (RMSE) is used as the metric, where a lower score indicates a better result.

Baseline models

To thoroughly validate our framework, we compare its performance with 8 advanced baseline models. All these baseline models follow the same data preprocessing strategy. The baselines are mainly divided into two categories:

MPP-oriented GNN variants. These baselines are introduced to compare the performance of POSIT with models using domain knowledge in chemistry. Among them, attentive FP is a GNN variant based on a multi-level attention mechanism designed for MPP tasks [13]. D-MPNN updates relations between chemical bonds during message passing phases to explicitly include bond information [12]. TrimNet employs a novel triplet message mechanism to calculate messages from atom-bond-atom information for molecular representation [40]. HRGCN+ combines molecular graphs and molecular descriptors as inputs to the GNN model [41]. FP-GNN leverages molecular graphs and fingerprints simultaneously for MPP tasks [14].

Models with molecular fragmentation rules. These baselines are introduced to compare the effectiveness of adaptive substructure mining with those methods using pre-defined fragmentation rules. Among them, FraGAT defines all acyclic single bonds as breakable bonds, and randomly chooses a breakable bond to partition the molecule into two fragments [17]. This form of fragmentation is efficient and easy to process. Differently, HiGNN segments a molecule based on the BRICS algorithm, which is a decomposition algorithm based on biochemical domain knowledge [18]. It uses 16 cleavage rules that are pre-defined to decompose the molecules. This approach can obtain an indefinite number of substructures from various molecules.

Besides the GNN-based models and those methods using pre-defined fragmentation rules, we include XGBoost [42, 43], an advanced machine learning algorithm that is a commonly used baseline in MPP tasks.

The input atomic and bond features shared by baseline models are consistent with the one introduced by Xiong et al. and Zhu et al. [13, 18]. These features are detailed in Supplementary Table S1.

Experimental setting

Parameter setup. In the pre-training stage, we used Attentive FP with edge attributes as the GNN encoder [13]. The model was pre-trained for 100 epochs with the Adam optimizer. During pre-training, the learning rate was adjusted by a consine-based scheduler. After that, the model was fine-tuned for 200 epochs with early stopping. The selection of prototype count Inline graphic is described in the following Sensitivity Analysis subsection. The trade-off coefficient between losses were set to Inline graphic, Inline graphic, and Inline graphic, respectively. Other hyper-parameters were optimized by the validation set, and their selection range is detailed in Supplementary Table S4. We performed five independent runs for each dataset with different random seeds. Then, the mean value and standard deviation of their metrics are reported.

Experimental environment. Codes for experiments were implemented in the Python. In particular, Pytorch and Pytorch Geometric are the primary third-party libraries we utilized for implementing POSIT. All experiments were carried out using a single 2080Ti GPU.

Performance validation

To verify the performance of our framework, extensive experiments are conducted on 10 biochemical datasets, covering classification and regression tasks. We compare the performance with MPP-specific models, fragmentation rule-based models and XGBoost. The results of Attentive FP, HRGCN+, and XGBoost were collected from Cai et al. [14]. Other results were collected from original studies, respectively. The performance comparisons of POSIT are demonstrated in Table 2.

Table 2.

Performance comparisons with baseline models

Dataset Attentive FP D-MPNN HRGCN+ TrimNet FP-GNN FraGAT HiGNN XGBoost POSIT
BACE 0.846 0.857 0.878 0.860 0.801 0.882 0.900Inline graphic0.028
Tox21 0.852 0.854 0.848 0.860 0.815 0.843 0.856 0.836 0.861Inline graphic0.017
ToxCast 0.794 0.764 0.793 0.777 0.714 0.781 0.774 0.796Inline graphic0.017
BBBP 0.909 0.886 0.850 0.916 0.923 0.929 0.938Inline graphic0.024
Clintox 0.904 0.897 0.899 0.948 0.840 0.964 0.930 0.911 0.845Inline graphic0.087
SIDER 0.623 0.658 0.641 0.657 0.661 0.618 0.651 0.642 0.662Inline graphic0.037
HIV 0.757 0.794 0.804 0.824 0.761 0.802 0.782Inline graphic0.025
ESOL 0.587 0.587 0.563 0.675 0.536 0.645 0.582 0.524Inline graphic0.013
FreeSolv 1.091 1.009 0.926 0.905 1.020 0.915 1.025 1.074Inline graphic0.035
Lipophilicity 0.549 0.563 0.603 0.625 0.651 0.549 0.574 0.609Inline graphic0.030

Note: the best performance on each dataset is shown in bold. ‘-’ means the results were not reported in the original studies.

As shown in Table 2, POSIT outperforms eight competing baselines in five out of seven datasets for classification tasks. For three regression datasets, it achieves the best performance in ESOL and exhibits competitive performance in Lipophilicity. All these observations indicate that POSIT is capable of effectively predicting a wide range of molecular properties.

The performance are less satisfactory in the FreeSolv, Clintox, and HIV datasets. We notice that FreeSolv has very small data sizes according to Table 1, making it challenging to adapt the pre-trained model to downstream domains. Meanwhile, the average node count of molecules in FreeSolv is the smallest among all datasets. Thus, the substructure classes may be simple and limited, making it difficult to benefit from diverse substructural features. Besides, the distribution of data labels in HIV are extremely unbalanced, which may hinder model’s generalization ability. As for Clintox, it has both the traits of small data size and highly unbalanced data distribution, causing suboptimal results.

Ablation studies

To further explore the effectiveness of the key components of POSIT, several variants of POSIT have been designed. These variants focus on evaluating the effectiveness of the connectivity objective, the constrastive clustering objective, and the cross-scale attention mechanism:

  • POSIT without connectivity constraint (w/o Con). This variant removes the modularity and orthogonality-based objectives, which are used to encourage internal connectivity within substructures.

  • POSIT without prototypical contrastive clustering (w/o Clu). This variant removes the substructure clustering module based on prototype learning. Therefore, substructure identification only relies on the connectivity objective.

  • POSIT without corss-scale attention (w/o Att). This variant reserves the complete pre-training architecture but removes the cross-scale attention mechanism during fine-tuning. Only the global graph embedding Inline graphic is used for prediction.

We have conducted the ablation studies on all datasets adhering to the above experimental setups. The results are demonstrated in Fig. 3.

Figure 3.

Figure 3

Results of the ablation experiments. Left: The results of classification tasks. ROC-AUC is used as the metric. Right: The results of regression tasks, using RMSE as the metric.

Impact of the two pre-training components. The pre-training consists of two components: connectivity optimization and prototypical contrastive clustering. In Fig. 3, it is evident that POSIT performs better than POSIT w/o Con on all classification and regression datasets, indicating the contribution of connectivity constraints. Performance degradation up to 4% was observed on all datasets. In addition, without the connectivity objective, distant atoms may be partitioned into the same substructure, violating chemical priors. Likewise, an average degradation of 2% is observed on most datasets except for BACE and HIV when using POSIT w/o Clu. The above results indicate that prototypical contrastive clustering leads to more effective substructure identification. Furthermore, the performance rank of POSIT is the most stable on all datasets compared to the variants. Therefore, both components are indispensable for the pre-training stage.

Impact of the cross-scale attention. It is observed that POSIT outperforms POSIT w/o Att on all datasets. When removing the cross-scale attention, the performance drops by 3% and 5% on the BACE and HIV datasets, respectively, and an average of 1.5% on other datasets. As a result, it can be inferred that capturing substructural information and fusing hierarchical graph representations can effectively contribute to prediction performance.

Visualization analysis

In this subsection, we describe qualitative analysis of POSIT through visualization to validate its capability for substructure identification. Concretely, we visualize the distribution of substructures and corresponding prototypes, as well as the identified substructures in each molecule. Additionally, the distributions of substructure counts in each class on different datasets are visualized in Supplementary Fig. S2.

Distribution of substructures. Figure 5 visualizes the distribution of structures and prototypes on six datasets after pre-training and fine-tuning. T-SNE is utilized to reduce the dimension of embeddings for visualization [44]. The visualization results of other datasets are demonstrated in Supplementary Fig. 1. In pre-training, the number of prototypes is set to 30 for visual clarity. It can be clearly observed that, around the prototypes, semantically similar substructures will gather closely and form clusters. Meanwhile, there are clear boundaries between different substructure classes. Moreover, there are similar numbers of substructures in different clusters, showing that the orthogonality objective avoids degenerate solutions. Thus, the pre-training stage can effectively identify meaningful substructures in datasets.

Figure 5.

Figure 5

T-SNE visualization of substructure and propotype distribution on six datasets. The prototype count Inline graphic is set to 30. Each dot represents a substructure, and Inline graphic represents a prototype. Substructures identified as the same class share the same color.

Identification of substructures. In Fig. 6, we selected three molecules from each of the four datasets and visualized the substructure instances identified in them through the pre-trained network. Specifically, the substructure assignment of node Inline graphic is determined by the highest probability in Inline graphic. Substructures of different classes are marked with distinct background styles (colors and filling types), while those of the same class have identical styles.

Figure 6.

Figure 6

Visualization of the substructures identified. Atoms in a molecule with the same background form a substructure instance. Substructures belonging to the same class across datasets are marked. Substructures that do not match the chemical prior are also marked (which are typically small subgraphs).

It can be observed that nodes in the same substructure are closely connected, and most identified substructures are consistent with chemical priors. For instance, three substructures are identified in the molecule of Fig. 6(d)(1), which correspond to the phenyl group, phenolic hydroxyl group, and carboxyl group, respectively. Besides, substructures marked with the same color share consistent attributes and topology across molecules, which can represent the semantics of a certain functional group. For instance, the carboxyl groups in Fig. 6(a)(1), (d)(1), and (d)(3) are identified as the same class. Also, multiple identical substructures within a molecule are assigned to the same class, and they can also be clearly identified. For instance, in Fig. 6(a)(3), two symmetric hdrazine groups are identified in one molecule. Moreover, some substructures with the same attributes but different structural contexts are distinguished. For instance, the -OH group in Fig. 6(a)(1) and the -OH group in Fig. 6(b(1)) but are identified as different classes, where one represents an alcoholic hydroxyl group and the other represents a phenolic hydroxyl group.

Despite the above advantages, several suboptimal cases are also observed. First, some unrelated atoms are partitioned into certain substructure classes. For instance, although the alcoholic hydroxyl group in Fig. 6(a)(2) is identified, it contains extra carbon atoms than the same group in Fig. 6(b)(1). Additionally, rings in the molecules may be broken. For instance, one of the pyridines in Fig. 6(b)(3) is not clearly identified. However, the probabilistic nature of the partition mitigates such undesired partitions. Different arrangements of the same atoms may also lead to different identifications. For instance, the acylamino groups in Fig. 6(c)(1) and Fig. 6(c)(2) are identified as two different classes in POSIT. There are also substructures that cannot be aligned to known functional groups, such as those marked by gray in Fig. 6(a)(2) and (a)(3).

Overall, despite the suboptimal cases, the formation of numerous substructure classes and their prototypes is satisfactory in general, which corresponds to chemically meaningful functional groups. This claidates the advanced substructure identification capability of the proposed method.

Sensitivity analysis

In this part, we analyzed the impact of the pre-defined number of prototypes Inline graphic. Figure 4 illustrates the performance of different selections of Inline graphic ranging from 10 to 120 on SIDER and Tox21. As shown in both datasets, when the initial Inline graphic is small, performance increases as Inline graphic increases. When Inline graphic reaches around 50, performance reaches its peak and then decreases slightly. The performance does not fluctuate violently as Inline graphic changes. Overall, the performance of POSIT is relatively robust against the choice of the hyper-parameter Inline graphic.

Figure 4.

Figure 4

Impact of the number of prototypes K on performance. The shaded area represents the standard deviation.

Conclusion

In this paper, we introduced POSIT, a novel self-supervised approach designed for MPP tasks. The key innovation of POSIT lies in (1) its ability to adaptively identify substructures from molecules; (2) its ability to explicitly incorporate substructure-level information to enhance molecular representations. During pre-training, the connectivity constraint and the prototypical contrastive clustering objective together generate meaningful substructures. In fine-tuning, the cross-scale attention mechanism is leveraged to integrate the substructure-level information to graph-level representations. On this basis, POSIT allows for effective MPP prediction.

We provided a detailed analysis of the performance and capabilities of POSIT. The results of extensive experiments demonstrated POSIT’s effectiveness on MPP tasks. The visualization analysis further validates the advanced capability of POSIT in identifying substructures and aligning them with chemical priors.

Despite the promising results of POSIT, there are still directions for further improvement. For example, real-world molecules are 3D in nature, so incorporating this information in the pre-training will be useful to enhance its effectiveness.

Key Points

  • We introduce the Prototype-based cOntrastive Substructure IdentificaTion (POSIT) framework, a self-supervised learning approach designed to autonomously discover substructural prototypes across molecular graphs. This innovation allows for the adaptive identification of meaningful substructures to enhance MPP tasks without manual rule definition.

  • POSIT employs a two-stage learning process consisting of pre-training and fine-tuning. During pre-training, a graph encoder and partitioner work in tandem to identify substructures, emphasizing connectivity and attribute-based similarity. The fine-tuning stage integrates substructure-level information through a cross-scale attention mechanism, enhancing molecular representations to improve prediction performance.

  • Extensive experiments on various real-world datasets demonstrate POSIT’s effectiveness in both classification and regression MPP tasks. The results highlight POSIT’s superior performance compared to multiple baseline models, validating its capability for accurate molecular property prediction.

Supplementary Material

Revised_Supplementary_Material_bbae565

Contributor Information

Gaoqi He, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.

Shun Liu, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.

Zhuoran Liu, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.

Changbo Wang, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.

Kai Zhang, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.

Honglin Li, Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, 200237 Shanghai, China.

Funding

This work was supported in part by National Key Research and Development Program of China (No. 2022YFC3400501), Fundamental Research Funds for the Central Universities, National Natural Science Foundation of China (No. 62276099, 62002121 and 62072183), Natural Science Foundation of Chongqing, China (No. CSTB2022NSCQ-MSX0552), the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems (No. MAIS2024111).

Conflict of interest: None declared.

Data availability

Datasets and source codes described in this paper are available at https://github.com/VRPharmer/POSIT.

References

  • 1. Li Z, Jiang M, Wang S. et al. Deep learning methods for molecular representation and property prediction. Drug Discov Today 2022;27:103373. 10.1016/j.drudis.2022.103373. [DOI] [PubMed] [Google Scholar]
  • 2. Petra Schneider W, Walters P, Plowright AT. et al. Rethinking drug design in the artificial intelligence era. Nat Rev Drug Discov 2020;19:353–64. 10.1038/s41573-019-0050-3. [DOI] [PubMed] [Google Scholar]
  • 3. Yi H-C, You Z-H, Huang D-S. et al. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform 2022;23:bbab340. 10.1093/bib/bbab340. [DOI] [PubMed] [Google Scholar]
  • 4. Deng J, Yang Z, Ojima I. et al. Artificial intelligence in drug discovery: Applications and techniques. Brief Bioinform 2022;23:bbab430. 10.1093/bib/bbab430. [DOI] [PubMed] [Google Scholar]
  • 5. Mancuso CA, Johnson KA, Liu R. et al. Joint representation of molecular networks from multiple species improves gene classification. PLoS Comput Biol 2024;20:e1011773. 10.1371/journal.pcbi.1011773. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Weininger D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. 10.1021/ci00057a005. [DOI] [Google Scholar]
  • 7. Jiang J, Zhang R, Zhao Z. et al. MultiGran-SMILES: multi-granularity smiles learning for molecular property prediction. Bioinformatics 2022;38:4573–80. 10.1093/bioinformatics/btac550. [DOI] [PubMed] [Google Scholar]
  • 8. Zhang X-C, Cheng-Kun W, Yang Z-J. et al. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform 2021;22:bbab152. 10.1093/bib/bbab152. [DOI] [PubMed] [Google Scholar]
  • 9. Atz K, Grisoni F, Schneider G. Geometric deep learning on molecular representations. Nat Mach Intell 2021;3:1023–32. 10.1038/s42256-021-00418-8. [DOI] [Google Scholar]
  • 10. Tianyu W, Tang Y, Sun Q. et al. Molecular joint representation learning via multi-modal information of smiles and graphs. IEEE/ACM Trans Comput Biol Bioinform 2023;20:3044–55. 10.1109/TCBB.2023.3253862. [DOI] [PubMed] [Google Scholar]
  • 11. Wieder O, Kohlbacher S, Kuenemann M. et al. A compact review of molecular property prediction with Graph Neural Networks. Drug Discov Today Technol 2020;37:1–12. 10.1016/j.ddtec.2020.11.009. [DOI] [PubMed] [Google Scholar]
  • 12. Yang K, Swanson K, Jin W. et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model 2019;59:3370–88. 10.1021/acs.jcim.9b00237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Xiong Z, Wang D, Liu X. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 2020;63:8749–60. 10.1021/acs.jmedchem.9b00959. [DOI] [PubMed] [Google Scholar]
  • 14. Cai H, Zhang H, Zhao D. et al. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022;23:bbac408. 10.1093/bib/bbac408. [DOI] [PubMed] [Google Scholar]
  • 15. Bader RFW, Popelier PLA, Keith TA. Theoretical definition of a functional group and the molecular orbital paradigm. Angew Chem Int Ed English 1994;33:620–31. 10.1002/anie.199406201. [DOI] [Google Scholar]
  • 16. Kotera M, McDonald AG, Boyce S. et al. Functional group and substructure searching as a tool in metabolomics. PloS One 2008;3:e1537. 10.1371/journal.pone.0001537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zhang Z, Guan J, Zhou S. FraGAT: a fragment-oriented multi-scale graph attention model for molecular property prediction. Bioinformatics 2021;37:2981–7. 10.1093/bioinformatics/btab195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Zhu W, Zhang Y, Zhao D. et al. HiGNN: a hierarchical informative Graph Neural Network for molecular property prediction equipped with feature-wise attention. J Chem Inf Model 2023;63:43–55. 10.1021/acs.jcim.2c01099. [DOI] [PubMed] [Google Scholar]
  • 19. Sun H, Wang G, Liu Q. et al. An explainable molecular property prediction via multi-granularity. Inform Sci 2023;642:119094. 10.1016/j.ins.2023.119094. [DOI] [Google Scholar]
  • 20. Xie A, Zhang Z, Guan J. et al. Self-supervised learning with chemistry-aware fragmentation for effective molecular property prediction. Brief Bioinform 2023;24:1–13. 10.1093/bib/bbad296. [DOI] [PubMed] [Google Scholar]
  • 21. Kong X, Huang W, Tan Z. et al. Molecule generation by principal subgraph mining and assembling. Adv Neural Inf Process Syst 2022;35:2550–63. [Google Scholar]
  • 22. Nguyen LBQ, Zelinka I, Snasel V. et al. Subgraph mining in a large graph: a review. Wiley Interdiscip Rev. Data Min Knowl Discov 2022;12:e1454. 10.1002/widm.1454. [DOI] [Google Scholar]
  • 23. Ying Z, You J, Morris C. et al. Hierarchical graph representation learning with differentiable pooling. Adv Neural Inf Process Syst 2018;31:4805–15. [Google Scholar]
  • 24. Bianchi FM, Grattarola D, Alippi C. Spectral clustering with Graph Neural Networks for graph pooling. In: International conference on machine learning, pp. 874–83. PMLR, New York, NY, United States: Association for Computing Machinery, 2020. [Google Scholar]
  • 25. Subramonian A. Motif-driven contrastive learning of graph representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 15980–1. Association for the Advancement of Artificial Intelligence, MIT Press, 2021. 10.1609/aaai.v35i18.17986. [DOI] [Google Scholar]
  • 26. Zhu Y, Zhang K, Wang J. et al. Structural landmarking and interaction modelling: a “slim” network for graph classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, pp. 9251–9. Association for the Advancement of Artificial Intelligence, MIT Press, 2022. 10.1609/aaai.v36i8.20912. [DOI] [Google Scholar]
  • 27. Boxin D, Zhang S, Cao N. et al. FIRST: fast interactive attributed subgraph matching. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1447–56. New York, NY, United States: Association for Computing Machinery, 2017.
  • 28. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. 2017.
  • 29. Brandes U, Delling D, Gaertler M. et al. Maximizing modularity is hard arXiv preprint physics/0608255. 2006.
  • 30. Müller E. Graph clustering with Graph Neural Networks. J Mach Learn Res 2023;24:1–21. [Google Scholar]
  • 31. Caron M, Misra I, Mairal J. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 2020;33:9912–24. [Google Scholar]
  • 32. Li J, Pan Z, Xiong C. et al. Prototypical contrastive learning of unsupervised representations. International Conference on Learning Representations, 2021.
  • 33. Lin S, Liu C, Zhou P. et al. Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems 2022;35:2747–58. 10.1109/TNNLS.2022.3191086. [DOI] [PubMed] [Google Scholar]
  • 34. Ren Y, Ke L, Dong L. et al. Incremental graph classification by class prototype construction and augmentation. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 2136–45. New York, NY, United States: Association for Computing Machinery, 2023.
  • 35. Peng M, Juan X, Li Z. Graph prototypical contrastive learning. Inform Sci 2022;612:816–34. 10.1016/j.ins.2022.09.013. [DOI] [Google Scholar]
  • 36. Zhou T, Wang W, Konukoglu E. et al. Rethinking semantic segmentation: a prototype view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2582–93. Institute of Electrical and Electronics Engineers (IEEE), 2022.
  • 37. Lin T-Y, Goyal P, Girshick R. et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–8. Institute of Electrical and Electronics Engineers (IEEE), 2017.
  • 38. Zhenqin W, Ramsundar B, Feinberg EN. et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci 2018;9:513–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Landrum G. et al. Rdkit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 2013;8:31. [Google Scholar]
  • 40. Li P, Li Y, Hsieh C-Y. et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Brief Bioinform 2021;22:bbaa266. 10.1093/bib/bbaa266. [DOI] [PubMed] [Google Scholar]
  • 41. Zhenxing W, Jiang D, Hsieh C-Y. et al. Hyperbolic relational graph convolution networks plus: a simple but highly efficient QSAR-modeling method. Brief Bioinform 2021;22:bbab112. [DOI] [PubMed] [Google Scholar]
  • 42. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–94. New York, NY, United States: Association for Computing Machinery, 2016.
  • 43. Deng D, Chen X, Zhang R. et al. XGraphBoost: extracting Graph Neural Network-based features for a better prediction of molecular properties. J Chem Inf Model 2021;61:2697–705. 10.1021/acs.jcim.0c01489. [DOI] [PubMed] [Google Scholar]
  • 44. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–2605. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Revised_Supplementary_Material_bbae565

Data Availability Statement

Datasets and source codes described in this paper are available at https://github.com/VRPharmer/POSIT.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES