Abstract
Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.
Keywords: molecular property prediction, Graph Neural Networks, self-supervised learning, contrastive learning
Introduction
Molecular property prediction (MPP) is a significant task in modern drug discovery. Accurate prediction methods can accelerate lead compound discovery, virtual screening, and other drug discovery processes [1, 2]. However, conventional wet-lab experiments entail considerable time and labor costs [3]. In recent years, deep learning techniques have been widely applied in various downstream tasks and have reported excellent results. With the continuous accumulation of biochemical data, deep learning-based MPP methods are attracting an ever-increasing interest from researchers, showing potential in prediction performance and generalization capability [4].
Deep learning-based MPP methods can be divided into sequence-based and graph-based methods according to the representation of molecules [5]. Each presents unique advantages and challenges for predicting molecular properties. Sequence-based methods mainly leverage the Simplified Molecular Input Line Entry System (SMILES) strings [6], which can benefit from advanced language models [7, 8]. However, the string representations loss spatial structural information [9, 10]. In contrast, graph-based methods offer a topology-aware representation by viewing atoms as nodes and bonds as edges. Consequently, Graph Neural Networks (GNNs) have become well-suited tools for MPP tasks. GNN-based methods continue to emerge and achieve promising results [1, 11].
Existing enhancements to graph-based MPP methods are primarily driven by the unique characteristics of molecules. Typical contributions include D-MPNN [12], which fuses edge feartures to the message passing phase for incorporating attributes of chemical bonds. Attentive FP introduces a multi-level attention mechanism to enable both reasoning capabilities and interpretability [13]. Combining chemical domain knowledge, FP-GNN integrates fingerpints into the graph representations of molecules [14]. These GNN-based models mainly consider atom-level feature aggregation, while the rich structural information of functional groups remains to be explored. According to the chemical domain knowledge, functional groups (substructures) form the basic building blocks that determine molecular properties and are shared across molecules [15, 16]. For example, the carboxyl group (-COOH) often indicate compounds with high water solubility.
To this end, recent MPP-related works have considered substructures in the learning process. These methods mainly adopt manually defined fragmentation rules. For instance, FraGAT randomly breaks acyclic single bonds to generate substructures [17]; HiGNN employs the Breaking of Retrosynthetically Interesting Chemical Substructures algorithm to partition molecules [18]; MgRX combines the BRICS and the Retrosynthetic Combinatorial Analysis Procedure algorithm to obtain fine-grained fragments [19]; and CAFE-MPP detects breakable bonds for fragmentation according to chemistry-aware rules [20].
These rule-based methods leverage chemical knowledge to cleave molecules, but they lack flexibility and are not aware of substructure classifications. For example, Fig. 1(A) demonstrates that both halogen (Cl) and phenolic hydroxyl, which are different classes of functional groups. This mixture obscures the functional groups that may be important for prediction.
Figure 1.

(A) Rule-based molecular fragmentation. Using the BRICS algorithm, the molecule is fragmented into four substructures by pre-defined rules. Each substructure does not possess a global substructure class information. (B) Adaptive fragmentation with POSIT. The molecule is divided into seven substructures by probabilistically assigning each atom to a global set of substructure classes. The partitioning is implemented by imposing both topology-based and attribute-based constraints simultaneously.
As for adaptive substructure identification methods, though not directly designed for MPP tasks, there are two related directions: graph mining and graph clustering. Graph mining aims at the data-driven discovery of frequent subgraphs, which may demand extensive domain knowledge and high computational costs [21, 22]. Notable methods for graph clustering, such as DiffPool and MinCutPool [23, 24], hierarchically partitions nodes into different clusters and generates pooled representations. Moreover, MICRO forms various motifs via node clustering with the EM algorithm [25]. SLIM models structural interaction through mapping rooted subgraphs to finite landmarks, while flexibly shaped substructures cannot be discovered [26]. These works primarily utilize node-level similarities to cluster nodes on each graph. However, substructure-level relationships across graphs are not fully exploited to model the intra-class consistency and inter-class discrimination of substructures.
To address the above concerns, we propose Prototype-based cOntrastive Substructure IdentificaTion (POSIT) framework to adaptively mine substructures, consequently augmenting molecular representations for MPP tasks. Compared to existing rule-based methods, POSIT has several merits: (1) it allows fine-tuning the fragmentation process based on downstream supervised signals, leading to more informative substructure representations; (2) it results in flexible, class-aware substructures with global coherence; and (3) it eliminates the reliance on chemistry domain knowledge. The framework incorporates two learning stages. During pre-training, a graph encoder and a partitioner are pre-trained to partition molecules. Specifically, the partitioner softly assign nodes to various substructure classes with a connectivity constraint. On this basis, a prototypical contrastive objective is designed on substructure-level representations, thereby encouraging salient clustering of substructures with intra-class consistency and inter-class discrimination.
In the fine-tuning stage, the predictor is trained via supervised MPP data, where a cross-scale attention mechanism is introduced to capture the interaction between the the substructure-level and the graph-level representations. Meanwhile, the encoder and the partitioner are fine-tuned for downstream MPP tasks. We conducted extensive experiments on 10 datasets to validate the performance of POSIT, covering classification and regression tasks.
The contributions of this work are summarized as follows:
We introduce an innovative self-supervised framework capable of adaptively extracting informative substructural prototypes among biochemical data, consequently identifying meaningful molecular substructures. Further visualization studies illustrate the consistency of the partitioned substructures with chemical priors.
The prototypical contrastive substructure identification is explored as a novel pretext task for further fine-tuning. During fine-tuning, a cross-scale attention mechanism is integrated, which fuses substructure-level information to enhance molecular representations.
Comprehensive experiments are used to evaluate the performance of POSIT, covering classification and regression MPP tasks on 10 real-world datasets. Results compared to baseline models and ablation studies demonstrate the effectiveness and generalizability of POSIT.
Materials and methods
In this section, we first state the problem definition of MPP and then introduce the preliminary requirements of molecular fragmentation. Subsequently, we elaborate on the design of the two-stage framework, emphasizing both pre-training and fine-tuning stages.
Problem definition
Typically, a molecule can be viewed as an undirected graph
, where
denotes the node set representing atoms and
denotes the edge set representing chemical bonds. The initial features of
nodes within a graph are represented as
, where
is the dimension. Connection relations are represented as an adjacency matrix
and edge features
, where each element of
is either 0 or 1. Given a set of
molecular graphs:
, and their task labels on various properties:
, the objective of MPP is to train a model
, where
fits various ground truths well. This model requires informative molecular representations that applicable across varied tasks.
Substructure identification
Matching exact functional groups in molecules is related to graph mining, which may require expensive computational costs [22]. Hence, we expect to discover substructures that potentially encapsulate functional groups. During this process, the challenge lies in designing a method to partition atoms into coherent and meaningful substructures that benefit downstream tasks.
Following the requirements of graph mining, substructures should adhere to certain constraints [22, 27]: (1) Connectivity: since functional groups are composed of local atomic clusters, substructures should maintain tight connectivity. (2) Topological similarity: substructures within the same class should represent similar connectivity patterns. (3) Attribute consistency: as for attributed graphs, substructures within the same class should share identical type and count of nodes. To meet the challenges, we considers to relaxing these constraints to probabilistic versions, thereby enabling learnable fragmentation in a data driven manner. Details are described in the following subsections.
Overview
The overall architecture of the two-stage network is illustrated in Fig. 2. The pre-training stage aims at adaptively identifying meaningful substructures. Specifically, a GNN-based encoder and a partitioner are jointly pre-trained under two objectives: (1) a local connectivity constraint that generates geometrically meaningful subgraphs within a molecule; and (2) a global prototype-based contrastive clustering loss that encourages substructures form salient clusters, using both topological and attribute-based similarities.
Figure 2.
The overall architecture of POSIT. (1) Stage-I:pre-training stage. Encoded molecular atoms are softly partitioned into substructures. The prototypical contrastive objective directs these substructures to cluster by pulling them closer to their own class prototypes and distancing them from others. (2) Stage-II: fine-tuning stage. Substructures are first identified by the pre-trained network. Next, the cross-scale attention fuses the substructure-level information with the global representation, which is finally fed into the predictor.
The fine-tuning stage leverages the supervised data to update pre-training parameters and build accurate predictors for MPP tasks. In particular, based on the identified substructures, the cross-scale attention mechanism is introduced to integrate substructure-level representations into graph-level representations as informative features.
Pre-training stage
Given a molecular graph
with
nodes, we first extract
-dimensional node embeddings with a GNN-based encoder. Utilizing message passing, the representation of node
at the
th layer is aggregated iteratively from its neighbour set
:
![]() |
(1) |
where
is the aggregation function, and
is the updating function. Any variant of GNNs is applicable [28]. Here, Attentive FP [13], an attention-based graph encoder, is used in the implementation of POSIT.
Next, we aim at partitioning molecules into various substructures. Instead of directly specifying the number of substructures within each molecule,
classes (or clusters) of substructures are defined globally across all molecules. Consequently, nodes within a molecule that are partitioned into the same class will naturally form a substructure. Given a molecular graph
with
nodes, we assign the embeddings of these nodes into
classes via an MLP-based partitioner:
![]() |
(2) |
where
, and
denotes the pre-defined number of substructure classes. Next, the probability of node
belonging to a certain substructure class
can be calculated as follows:
![]() |
(3) |
where
is the temperature parameter that controls the peakedness of the Softmax function. Hence, the node assignment matrix of
is described as
:
![]() |
(4) |
which assigns each of the
nodes into
substructure classes.
Connectivity constraint
After partitioning molecular nodes into substructures through the partitioner, this subsection focuses on applying specific constraints to ensure that the resultant substructures should be geometrically meaningful. Generally, atoms in the same functional group are tightly connected, while connections between different functional groups are relatively sparse [16]. Inspired by this observation, a modularity-based regularization is introduced as a constraint, which is a spectral clustering metric that describes the connectivity of graph partitions. This constraint enforces that the assignment matrix
should create geometrically connected substructures. Here, we employed the relaxed version of modularity [29], which is formulated as follows:
![]() |
(5) |
where
is the degrees of nodes,
is the adjancency matrix of the input graph, and
denotes the edge count in the molecular graph
.
In practice, optimizing this objective tends to assign all nodes to a single partition [24]. To avoid such degenerate solutions, the orthogonality regularization is additionally included [30]. The formula is
![]() |
(6) |
This objective counts the size of substructure classes as regularization, reaching 0 when the sizes of different classes are strictly balanced.
Therefore, the overall loss function for connectivity is formulated as follows:
![]() |
(7) |
where
controls the balance of the two loss terms.
Prototypical contrastive substructure clustering
Besides tight connectivity, substructures of the same class are expected to share topological similarity and attribute consistency, which is identical with the clustering property. To explicitly enforce clustering among the substructure instances, we introduce the concept of prototypes.
Prototypes are the representative embeddings of classes, i.e. the centroids of classes [31, 32]. Leveraging prototypes to represent classes allows capturing intra-class similarity and inter-class dis-similarity between classes [33, 34]. In this work, prototypes are defined as representative substructures that guide substructures to form respective clusters. Concretely, POSIT minimizes the distance between substructure embeddings and their assigned prototypes, while maximizes the distance between substructure embeddings and all other prototypes.
First, substructure embeddings are derived. The probability of each node
being assigned to substructure classes is indicated by the probability vector
from Equation (3). Therefore, the substructure embeddings in
can be obtained by performing average pooling of the nodes inside each substructure. Let
be the node assignment matrix of
as in Equation (4), the substructure embeddings in
can then be computed as follows:
![]() |
(8) |
where
is the embedding matrix of substructures in
,
is the embedding matrix of nodes in graph
,
is a
dimension vector representing the probability that the
th node is assigned to the
substructure classes,
is the element-wise division operator, and
expands a
-dimensional vector to a
matrix. Each graph contains
substructure embeddings in
with dimension
, representing the occurrences of the
substructure classes within this graph. Substructure classes that do not appear in
are naturally represented as zero vectors in
. Therefore, the embeddings of the substructures that emerge in
can be conveniently represented as the substructure embedding matrix
. Subsequently, substructure prototypes will be derived from these classes of substructure embeddings across the dataset, with the count also equal to
.
Substructure-prototype constrastive clustering. After obtained the substructure embeddings, we employ a prototype-based contrastive clustering to enforce that the substructure instances form salient groups. Here, the prototype for each substructure class are determined as non-parametric embeddings for better generalization ability [35, 36]. Specifically, the prototype vector
is the averaged embedding of the
th substructure class. It is updated in the momentum style using the batched substructure embeddings
:
![]() |
(9) |
![]() |
(10) |
where
is the set of substructure indices that have the highest probability to be assigned to prototype class
in an entire batch of molecules. Since substructures are composed of nodes sharing the same class index,
can be derived from the assignment matrix
.
is the average of batched substructure embeddings in class
, and
denotes the momentum coefficient. In other words, the global estimation of the
is updated by its local version
in each mini-batch incrementally.
To promote clustering, prototypes and substructures of the same class are viewed as positive pairs, and the rest are viewed as negative pairs. The relation between a prototype and a substructure is measured with cosine similarity:
![]() |
(11) |
where
are vector embeddings. We use
to denote the prototype assigned to the
th substructure, where
is the class index with the highest probability assigned by
in Equation (3). Then, the prototypical contrastive clustering objective is formulated as follows:
![]() |
(12) |
where
is the temperature hyper-parameter. Intuitively, minimizing
pushes each transformed substructure embedding
towards its assigned class prototype
and away from other prototypes.
By clustering substructures of the same class under the guidance of propotypes, the GNN encoder and partitioner coordinately generate high-quality fragmentations. The resultant substructures can thus satisfy the constraints on both topological similarity and attribute consistency.
Intra-class compactness optimization. Equation (12) mainly motivates the distinction between substructures and other class prototypes, i.e. inter-class discrimination. Meanwhile, the intra-class similarities shouled be considered. Therefore, we further encourage the compatness within a class by the following loss function [36]:
![]() |
(13) |
Equation (13) directly minimizes the distance between substructure embeddings and their assigned prototypes. Through this objective, substructures within the same class can have better attribute-based consistency.
To sum up, the clustering objective is formed as follows:
![]() |
(14) |
Considering the connectivity constraint (7), the overall self-supervised training objective is formulated as follows:
![]() |
(15) |
where
and
are weights of objectives.
After pre-training, POSIT can adaptively identify substructure instances from the input molecules in an unsupervised manner. If new molecular graphs come in, the pre-trained network can be applied conveniently to obtain meaningful substructures, which can then be leveraged to empower molecular representations.
Fine-tuning stage
The second stage of POSIT is a fine-tuning step. It further optimized the pre-trained network using labelled MPP data, where a cross-scale attention mechanism is introduced to extract informative molecular features for the predictive tasks.
Cross-scale attention
Substructures not only provide more contextualized information than individual nodes, but are also at a finer granularity compared with the whole molecule. Therefore, they are supposed to offer rich topological and attribute information. To this end, a cross-scale attention mechanism is devised to explicitly capture the interaction between the substructure-level (local) representations and the graph-level (global) representations to generate informative molecular features.
Globally, a pooling function is operated on the node embeddings of the molecular graph:
![]() |
(16) |
where
is the global representation of graph
,
is the node set of
, and
is a pooling function that compact all nodes in
into a single vector.
Locally, substructure representations are utilized as processing units. Then, the global and local representations are bridged via the cross-scale attention mechanism, which is formed as follows:
![]() |
(17) |
![]() |
(18) |
where
is the substructure embedding,
and
are trainable parameters to transform the embedding to a scalar,
denotes the concatenate operator,
is the activation function, and
is the head count. In this process, substructures serve as fundamental building blocks of the molecule. They are combined linearly, with each substructure weighted according to its attention coefficient relative to the global molecular representation
.
Supervised loss function
Given a molecular graph
, we combine its summarized embedding
and the linear combination of the substructure embeddings
as the final molecular representation:
![]() |
(19) |
For binary classification, the representation is fed into another MLP to generate the prediction
:
![]() |
(20) |
Considering the potential class imbalance, we adopted the focal loss to tackle the problem [37]. The formula is
![]() |
(21) |
![]() |
(22) |
where
is the ground-truth label,
is the predicted value, and
is the size of dataset.
is set to 0.25, and
is set to the proportion of negative samples in experiments.
For regression tasks, the final prediction
is first produced by MLP. Then, MSE loss is adopted as the loss function:
![]() |
(23) |
The overall loss function, including the self-supervised objective and the supervised objective, is formed as follows:
![]() |
(24) |
where
is a hyper-parameter for the trade-off between the supervised loss term and the self-supervised loss term.
Experiments and results
In this section, we report an extensive set of experimental results including the comparisons with baselines, ablation studies, and visual analysis to demonstrate the effectiveness and substructure identification capability of POSIT.
Datasets
We mainly consider MPP tasks, including classification and regression. For the pre-training stage, HIV is used as unlabeled pre-training data. It is a molecular dataset originated from MoleculeNet [38], which contains 41 127 molecules in total(Although this dataset is relatively small, it already demonstrates encouraging performance when used as the pre-training dataset for POSIT. We will investigate larger pre-training dataset in our future studies.). The performance of our framework on downstream MPP tasks is evaluated on 10 commonly used datasets from MoleculeNet with labels. These datasets cover a wide range of molecular properties, including physical chemistry, physiology, and biophysics. Among the 10 datasets, 7 are classification tasks and 3 are regression tasks. The statistical information of the datasets is summarized in Table 1, and their detailed descriptions are listed in Supplementary Table S3.
Table 1.
The statistical information of datasets
| Category | Datasets | Task type | #Molecules | #Tasks | #Avg nodes | Split | Metric |
|---|---|---|---|---|---|---|---|
| Biophysics | BACE | Classification | 1513 | 1 | 34.1 | Scaffold | ROC-AUC |
| HIV | Classification | 41 127 | 1 | 25.5 | Scaffold | ROC-AUC | |
| Physiology | BBBP | Classification | 2050 | 1 | 23.9 | Scaffold | ROC-AUC |
| Tox21 | Classification | 7831 | 12 | 18.6 | Random | ROC-AUC | |
| ToxCast | Classification | 8597 | 617 | 18.7 | Random | ROC-AUC | |
| ClinTox | Classification | 1484 | 2 | 26.1 | Random | ROC-AUC | |
| SIDER | Classification | 1427 | 27 | 33.6 | Random | ROC-AUC | |
| Physical Chemistry | ESOL | Regression | 1128 | 1 | 13.3 | Random | RMSE |
| FreeSolv | Regression | 642 | 1 | 8.7 | Random | RMSE | |
| Lipophilicity | Regression | 4200 | 1 | 27.0 | Random | RMSE |
Data preprocessing
The molecular data were initially obtained as SMILES strings and then transformed into graph structures using Rdkit [39]. In the first stage, the HIV dataset is used for pre-training. In the fine-tuning stage, all datasets were split into training, validation, and testing subsets with a ratio of 8:1:1. For fair comparisons, we adopted the same data splitting strategy as previous works in terms of random and scaffold splitting [18, 38], which are listed in Table 1. Compared to random splitting, scaffold splitting typically generates datasets that are more challenging for predictive models. Following MoleculeNet [38], the ROC-AUC metric is used for evaluating the performance of the classification tasks, where a higher score means a more accurate prediction. For regression tasks, the Root Mean Squared Error (RMSE) is used as the metric, where a lower score indicates a better result.
Baseline models
To thoroughly validate our framework, we compare its performance with 8 advanced baseline models. All these baseline models follow the same data preprocessing strategy. The baselines are mainly divided into two categories:
MPP-oriented GNN variants. These baselines are introduced to compare the performance of POSIT with models using domain knowledge in chemistry. Among them, attentive FP is a GNN variant based on a multi-level attention mechanism designed for MPP tasks [13]. D-MPNN updates relations between chemical bonds during message passing phases to explicitly include bond information [12]. TrimNet employs a novel triplet message mechanism to calculate messages from atom-bond-atom information for molecular representation [40]. HRGCN+ combines molecular graphs and molecular descriptors as inputs to the GNN model [41]. FP-GNN leverages molecular graphs and fingerprints simultaneously for MPP tasks [14].
Models with molecular fragmentation rules. These baselines are introduced to compare the effectiveness of adaptive substructure mining with those methods using pre-defined fragmentation rules. Among them, FraGAT defines all acyclic single bonds as breakable bonds, and randomly chooses a breakable bond to partition the molecule into two fragments [17]. This form of fragmentation is efficient and easy to process. Differently, HiGNN segments a molecule based on the BRICS algorithm, which is a decomposition algorithm based on biochemical domain knowledge [18]. It uses 16 cleavage rules that are pre-defined to decompose the molecules. This approach can obtain an indefinite number of substructures from various molecules.
Besides the GNN-based models and those methods using pre-defined fragmentation rules, we include XGBoost [42, 43], an advanced machine learning algorithm that is a commonly used baseline in MPP tasks.
The input atomic and bond features shared by baseline models are consistent with the one introduced by Xiong et al. and Zhu et al. [13, 18]. These features are detailed in Supplementary Table S1.
Experimental setting
Parameter setup. In the pre-training stage, we used Attentive FP with edge attributes as the GNN encoder [13]. The model was pre-trained for 100 epochs with the Adam optimizer. During pre-training, the learning rate was adjusted by a consine-based scheduler. After that, the model was fine-tuned for 200 epochs with early stopping. The selection of prototype count
is described in the following Sensitivity Analysis subsection. The trade-off coefficient between losses were set to
,
, and
, respectively. Other hyper-parameters were optimized by the validation set, and their selection range is detailed in Supplementary Table S4. We performed five independent runs for each dataset with different random seeds. Then, the mean value and standard deviation of their metrics are reported.
Experimental environment. Codes for experiments were implemented in the Python. In particular, Pytorch and Pytorch Geometric are the primary third-party libraries we utilized for implementing POSIT. All experiments were carried out using a single 2080Ti GPU.
Performance validation
To verify the performance of our framework, extensive experiments are conducted on 10 biochemical datasets, covering classification and regression tasks. We compare the performance with MPP-specific models, fragmentation rule-based models and XGBoost. The results of Attentive FP, HRGCN+, and XGBoost were collected from Cai et al. [14]. Other results were collected from original studies, respectively. The performance comparisons of POSIT are demonstrated in Table 2.
Table 2.
Performance comparisons with baseline models
| Dataset | Attentive FP | D-MPNN | HRGCN+ | TrimNet | FP-GNN | FraGAT | HiGNN | XGBoost | POSIT |
|---|---|---|---|---|---|---|---|---|---|
| BACE | 0.846 | 0.857 | – | 0.878 | 0.860 | 0.801 | 0.882 | – |
0.900 0.028
|
| Tox21 | 0.852 | 0.854 | 0.848 | 0.860 | 0.815 | 0.843 | 0.856 | 0.836 |
0.861 0.017
|
| ToxCast | 0.794 | 0.764 | 0.793 | 0.777 | – | 0.714 | 0.781 | 0.774 |
0.796 0.017
|
| BBBP | 0.909 | 0.886 | – | 0.850 | 0.916 | 0.923 | 0.929 | – |
0.938 0.024
|
| Clintox | 0.904 | 0.897 | 0.899 | 0.948 | 0.840 | 0.964 | 0.930 | 0.911 | 0.845 0.087 |
| SIDER | 0.623 | 0.658 | 0.641 | 0.657 | 0.661 | 0.618 | 0.651 | 0.642 |
0.662 0.037
|
| HIV | 0.757 | 0.794 | – | 0.804 | 0.824 | 0.761 | 0.802 | – | 0.782 0.025 |
| ESOL | 0.587 | 0.587 | 0.563 | – | 0.675 | 0.536 | 0.645 | 0.582 |
0.524 0.013
|
| FreeSolv | 1.091 | 1.009 | 0.926 | – | 0.905 | 1.020 | 0.915 | 1.025 | 1.074 0.035 |
| Lipophilicity | 0.549 | 0.563 | 0.603 | – | 0.625 | 0.651 | 0.549 | 0.574 | 0.609 0.030 |
Note: the best performance on each dataset is shown in bold. ‘-’ means the results were not reported in the original studies.
As shown in Table 2, POSIT outperforms eight competing baselines in five out of seven datasets for classification tasks. For three regression datasets, it achieves the best performance in ESOL and exhibits competitive performance in Lipophilicity. All these observations indicate that POSIT is capable of effectively predicting a wide range of molecular properties.
The performance are less satisfactory in the FreeSolv, Clintox, and HIV datasets. We notice that FreeSolv has very small data sizes according to Table 1, making it challenging to adapt the pre-trained model to downstream domains. Meanwhile, the average node count of molecules in FreeSolv is the smallest among all datasets. Thus, the substructure classes may be simple and limited, making it difficult to benefit from diverse substructural features. Besides, the distribution of data labels in HIV are extremely unbalanced, which may hinder model’s generalization ability. As for Clintox, it has both the traits of small data size and highly unbalanced data distribution, causing suboptimal results.
Ablation studies
To further explore the effectiveness of the key components of POSIT, several variants of POSIT have been designed. These variants focus on evaluating the effectiveness of the connectivity objective, the constrastive clustering objective, and the cross-scale attention mechanism:
POSIT without connectivity constraint (w/o Con). This variant removes the modularity and orthogonality-based objectives, which are used to encourage internal connectivity within substructures.
POSIT without prototypical contrastive clustering (w/o Clu). This variant removes the substructure clustering module based on prototype learning. Therefore, substructure identification only relies on the connectivity objective.
POSIT without corss-scale attention (w/o Att). This variant reserves the complete pre-training architecture but removes the cross-scale attention mechanism during fine-tuning. Only the global graph embedding
is used for prediction.
We have conducted the ablation studies on all datasets adhering to the above experimental setups. The results are demonstrated in Fig. 3.
Figure 3.

Results of the ablation experiments. Left: The results of classification tasks. ROC-AUC is used as the metric. Right: The results of regression tasks, using RMSE as the metric.
Impact of the two pre-training components. The pre-training consists of two components: connectivity optimization and prototypical contrastive clustering. In Fig. 3, it is evident that POSIT performs better than POSIT w/o Con on all classification and regression datasets, indicating the contribution of connectivity constraints. Performance degradation up to 4% was observed on all datasets. In addition, without the connectivity objective, distant atoms may be partitioned into the same substructure, violating chemical priors. Likewise, an average degradation of 2% is observed on most datasets except for BACE and HIV when using POSIT w/o Clu. The above results indicate that prototypical contrastive clustering leads to more effective substructure identification. Furthermore, the performance rank of POSIT is the most stable on all datasets compared to the variants. Therefore, both components are indispensable for the pre-training stage.
Impact of the cross-scale attention. It is observed that POSIT outperforms POSIT w/o Att on all datasets. When removing the cross-scale attention, the performance drops by 3% and 5% on the BACE and HIV datasets, respectively, and an average of 1.5% on other datasets. As a result, it can be inferred that capturing substructural information and fusing hierarchical graph representations can effectively contribute to prediction performance.
Visualization analysis
In this subsection, we describe qualitative analysis of POSIT through visualization to validate its capability for substructure identification. Concretely, we visualize the distribution of substructures and corresponding prototypes, as well as the identified substructures in each molecule. Additionally, the distributions of substructure counts in each class on different datasets are visualized in Supplementary Fig. S2.
Distribution of substructures. Figure 5 visualizes the distribution of structures and prototypes on six datasets after pre-training and fine-tuning. T-SNE is utilized to reduce the dimension of embeddings for visualization [44]. The visualization results of other datasets are demonstrated in Supplementary Fig. 1. In pre-training, the number of prototypes is set to 30 for visual clarity. It can be clearly observed that, around the prototypes, semantically similar substructures will gather closely and form clusters. Meanwhile, there are clear boundaries between different substructure classes. Moreover, there are similar numbers of substructures in different clusters, showing that the orthogonality objective avoids degenerate solutions. Thus, the pre-training stage can effectively identify meaningful substructures in datasets.
Figure 5.
T-SNE visualization of substructure and propotype distribution on six datasets. The prototype count
is set to 30. Each dot represents a substructure, and
represents a prototype. Substructures identified as the same class share the same color.
Identification of substructures. In Fig. 6, we selected three molecules from each of the four datasets and visualized the substructure instances identified in them through the pre-trained network. Specifically, the substructure assignment of node
is determined by the highest probability in
. Substructures of different classes are marked with distinct background styles (colors and filling types), while those of the same class have identical styles.
Figure 6.
Visualization of the substructures identified. Atoms in a molecule with the same background form a substructure instance. Substructures belonging to the same class across datasets are marked. Substructures that do not match the chemical prior are also marked (which are typically small subgraphs).
It can be observed that nodes in the same substructure are closely connected, and most identified substructures are consistent with chemical priors. For instance, three substructures are identified in the molecule of Fig. 6(d)(1), which correspond to the phenyl group, phenolic hydroxyl group, and carboxyl group, respectively. Besides, substructures marked with the same color share consistent attributes and topology across molecules, which can represent the semantics of a certain functional group. For instance, the carboxyl groups in Fig. 6(a)(1), (d)(1), and (d)(3) are identified as the same class. Also, multiple identical substructures within a molecule are assigned to the same class, and they can also be clearly identified. For instance, in Fig. 6(a)(3), two symmetric hdrazine groups are identified in one molecule. Moreover, some substructures with the same attributes but different structural contexts are distinguished. For instance, the -OH group in Fig. 6(a)(1) and the -OH group in Fig. 6(b(1)) but are identified as different classes, where one represents an alcoholic hydroxyl group and the other represents a phenolic hydroxyl group.
Despite the above advantages, several suboptimal cases are also observed. First, some unrelated atoms are partitioned into certain substructure classes. For instance, although the alcoholic hydroxyl group in Fig. 6(a)(2) is identified, it contains extra carbon atoms than the same group in Fig. 6(b)(1). Additionally, rings in the molecules may be broken. For instance, one of the pyridines in Fig. 6(b)(3) is not clearly identified. However, the probabilistic nature of the partition mitigates such undesired partitions. Different arrangements of the same atoms may also lead to different identifications. For instance, the acylamino groups in Fig. 6(c)(1) and Fig. 6(c)(2) are identified as two different classes in POSIT. There are also substructures that cannot be aligned to known functional groups, such as those marked by gray in Fig. 6(a)(2) and (a)(3).
Overall, despite the suboptimal cases, the formation of numerous substructure classes and their prototypes is satisfactory in general, which corresponds to chemically meaningful functional groups. This claidates the advanced substructure identification capability of the proposed method.
Sensitivity analysis
In this part, we analyzed the impact of the pre-defined number of prototypes
. Figure 4 illustrates the performance of different selections of
ranging from 10 to 120 on SIDER and Tox21. As shown in both datasets, when the initial
is small, performance increases as
increases. When
reaches around 50, performance reaches its peak and then decreases slightly. The performance does not fluctuate violently as
changes. Overall, the performance of POSIT is relatively robust against the choice of the hyper-parameter
.
Figure 4.

Impact of the number of prototypes K on performance. The shaded area represents the standard deviation.
Conclusion
In this paper, we introduced POSIT, a novel self-supervised approach designed for MPP tasks. The key innovation of POSIT lies in (1) its ability to adaptively identify substructures from molecules; (2) its ability to explicitly incorporate substructure-level information to enhance molecular representations. During pre-training, the connectivity constraint and the prototypical contrastive clustering objective together generate meaningful substructures. In fine-tuning, the cross-scale attention mechanism is leveraged to integrate the substructure-level information to graph-level representations. On this basis, POSIT allows for effective MPP prediction.
We provided a detailed analysis of the performance and capabilities of POSIT. The results of extensive experiments demonstrated POSIT’s effectiveness on MPP tasks. The visualization analysis further validates the advanced capability of POSIT in identifying substructures and aligning them with chemical priors.
Despite the promising results of POSIT, there are still directions for further improvement. For example, real-world molecules are 3D in nature, so incorporating this information in the pre-training will be useful to enhance its effectiveness.
Key Points
We introduce the Prototype-based cOntrastive Substructure IdentificaTion (POSIT) framework, a self-supervised learning approach designed to autonomously discover substructural prototypes across molecular graphs. This innovation allows for the adaptive identification of meaningful substructures to enhance MPP tasks without manual rule definition.
POSIT employs a two-stage learning process consisting of pre-training and fine-tuning. During pre-training, a graph encoder and partitioner work in tandem to identify substructures, emphasizing connectivity and attribute-based similarity. The fine-tuning stage integrates substructure-level information through a cross-scale attention mechanism, enhancing molecular representations to improve prediction performance.
Extensive experiments on various real-world datasets demonstrate POSIT’s effectiveness in both classification and regression MPP tasks. The results highlight POSIT’s superior performance compared to multiple baseline models, validating its capability for accurate molecular property prediction.
Supplementary Material
Contributor Information
Gaoqi He, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Shun Liu, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Zhuoran Liu, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Changbo Wang, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Kai Zhang, School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Honglin Li, Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China; Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science & Technology, 200237 Shanghai, China.
Funding
This work was supported in part by National Key Research and Development Program of China (No. 2022YFC3400501), Fundamental Research Funds for the Central Universities, National Natural Science Foundation of China (No. 62276099, 62002121 and 62072183), Natural Science Foundation of Chongqing, China (No. CSTB2022NSCQ-MSX0552), the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems (No. MAIS2024111).
Conflict of interest: None declared.
Data availability
Datasets and source codes described in this paper are available at https://github.com/VRPharmer/POSIT.
References
- 1. Li Z, Jiang M, Wang S. et al. Deep learning methods for molecular representation and property prediction. Drug Discov Today 2022;27:103373. 10.1016/j.drudis.2022.103373. [DOI] [PubMed] [Google Scholar]
- 2. Petra Schneider W, Walters P, Plowright AT. et al. Rethinking drug design in the artificial intelligence era. Nat Rev Drug Discov 2020;19:353–64. 10.1038/s41573-019-0050-3. [DOI] [PubMed] [Google Scholar]
- 3. Yi H-C, You Z-H, Huang D-S. et al. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform 2022;23:bbab340. 10.1093/bib/bbab340. [DOI] [PubMed] [Google Scholar]
- 4. Deng J, Yang Z, Ojima I. et al. Artificial intelligence in drug discovery: Applications and techniques. Brief Bioinform 2022;23:bbab430. 10.1093/bib/bbab430. [DOI] [PubMed] [Google Scholar]
- 5. Mancuso CA, Johnson KA, Liu R. et al. Joint representation of molecular networks from multiple species improves gene classification. PLoS Comput Biol 2024;20:e1011773. 10.1371/journal.pcbi.1011773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Weininger D. Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. 10.1021/ci00057a005. [DOI] [Google Scholar]
- 7. Jiang J, Zhang R, Zhao Z. et al. MultiGran-SMILES: multi-granularity smiles learning for molecular property prediction. Bioinformatics 2022;38:4573–80. 10.1093/bioinformatics/btac550. [DOI] [PubMed] [Google Scholar]
- 8. Zhang X-C, Cheng-Kun W, Yang Z-J. et al. MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform 2021;22:bbab152. 10.1093/bib/bbab152. [DOI] [PubMed] [Google Scholar]
- 9. Atz K, Grisoni F, Schneider G. Geometric deep learning on molecular representations. Nat Mach Intell 2021;3:1023–32. 10.1038/s42256-021-00418-8. [DOI] [Google Scholar]
- 10. Tianyu W, Tang Y, Sun Q. et al. Molecular joint representation learning via multi-modal information of smiles and graphs. IEEE/ACM Trans Comput Biol Bioinform 2023;20:3044–55. 10.1109/TCBB.2023.3253862. [DOI] [PubMed] [Google Scholar]
- 11. Wieder O, Kohlbacher S, Kuenemann M. et al. A compact review of molecular property prediction with Graph Neural Networks. Drug Discov Today Technol 2020;37:1–12. 10.1016/j.ddtec.2020.11.009. [DOI] [PubMed] [Google Scholar]
- 12. Yang K, Swanson K, Jin W. et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model 2019;59:3370–88. 10.1021/acs.jcim.9b00237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Xiong Z, Wang D, Liu X. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 2020;63:8749–60. 10.1021/acs.jmedchem.9b00959. [DOI] [PubMed] [Google Scholar]
- 14. Cai H, Zhang H, Zhao D. et al. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022;23:bbac408. 10.1093/bib/bbac408. [DOI] [PubMed] [Google Scholar]
- 15. Bader RFW, Popelier PLA, Keith TA. Theoretical definition of a functional group and the molecular orbital paradigm. Angew Chem Int Ed English 1994;33:620–31. 10.1002/anie.199406201. [DOI] [Google Scholar]
- 16. Kotera M, McDonald AG, Boyce S. et al. Functional group and substructure searching as a tool in metabolomics. PloS One 2008;3:e1537. 10.1371/journal.pone.0001537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Zhang Z, Guan J, Zhou S. FraGAT: a fragment-oriented multi-scale graph attention model for molecular property prediction. Bioinformatics 2021;37:2981–7. 10.1093/bioinformatics/btab195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zhu W, Zhang Y, Zhao D. et al. HiGNN: a hierarchical informative Graph Neural Network for molecular property prediction equipped with feature-wise attention. J Chem Inf Model 2023;63:43–55. 10.1021/acs.jcim.2c01099. [DOI] [PubMed] [Google Scholar]
- 19. Sun H, Wang G, Liu Q. et al. An explainable molecular property prediction via multi-granularity. Inform Sci 2023;642:119094. 10.1016/j.ins.2023.119094. [DOI] [Google Scholar]
- 20. Xie A, Zhang Z, Guan J. et al. Self-supervised learning with chemistry-aware fragmentation for effective molecular property prediction. Brief Bioinform 2023;24:1–13. 10.1093/bib/bbad296. [DOI] [PubMed] [Google Scholar]
- 21. Kong X, Huang W, Tan Z. et al. Molecule generation by principal subgraph mining and assembling. Adv Neural Inf Process Syst 2022;35:2550–63. [Google Scholar]
- 22. Nguyen LBQ, Zelinka I, Snasel V. et al. Subgraph mining in a large graph: a review. Wiley Interdiscip Rev. Data Min Knowl Discov 2022;12:e1454. 10.1002/widm.1454. [DOI] [Google Scholar]
- 23. Ying Z, You J, Morris C. et al. Hierarchical graph representation learning with differentiable pooling. Adv Neural Inf Process Syst 2018;31:4805–15. [Google Scholar]
- 24. Bianchi FM, Grattarola D, Alippi C. Spectral clustering with Graph Neural Networks for graph pooling. In: International conference on machine learning, pp. 874–83. PMLR, New York, NY, United States: Association for Computing Machinery, 2020. [Google Scholar]
- 25. Subramonian A. Motif-driven contrastive learning of graph representations. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 15980–1. Association for the Advancement of Artificial Intelligence, MIT Press, 2021. 10.1609/aaai.v35i18.17986. [DOI] [Google Scholar]
- 26. Zhu Y, Zhang K, Wang J. et al. Structural landmarking and interaction modelling: a “slim” network for graph classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, pp. 9251–9. Association for the Advancement of Artificial Intelligence, MIT Press, 2022. 10.1609/aaai.v36i8.20912. [DOI] [Google Scholar]
- 27. Boxin D, Zhang S, Cao N. et al. FIRST: fast interactive attributed subgraph matching. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1447–56. New York, NY, United States: Association for Computing Machinery, 2017.
- 28. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. 2017.
- 29. Brandes U, Delling D, Gaertler M. et al. Maximizing modularity is hard arXiv preprint physics/0608255. 2006.
- 30. Müller E. Graph clustering with Graph Neural Networks. J Mach Learn Res 2023;24:1–21. [Google Scholar]
- 31. Caron M, Misra I, Mairal J. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 2020;33:9912–24. [Google Scholar]
- 32. Li J, Pan Z, Xiong C. et al. Prototypical contrastive learning of unsupervised representations. International Conference on Learning Representations, 2021.
- 33. Lin S, Liu C, Zhou P. et al. Prototypical graph contrastive learning. IEEE Transactions on Neural Networks and Learning Systems 2022;35:2747–58. 10.1109/TNNLS.2022.3191086. [DOI] [PubMed] [Google Scholar]
- 34. Ren Y, Ke L, Dong L. et al. Incremental graph classification by class prototype construction and augmentation. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 2136–45. New York, NY, United States: Association for Computing Machinery, 2023.
- 35. Peng M, Juan X, Li Z. Graph prototypical contrastive learning. Inform Sci 2022;612:816–34. 10.1016/j.ins.2022.09.013. [DOI] [Google Scholar]
- 36. Zhou T, Wang W, Konukoglu E. et al. Rethinking semantic segmentation: a prototype view. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2582–93. Institute of Electrical and Electronics Engineers (IEEE), 2022.
- 37. Lin T-Y, Goyal P, Girshick R. et al. Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–8. Institute of Electrical and Electronics Engineers (IEEE), 2017.
- 38. Zhenqin W, Ramsundar B, Feinberg EN. et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci 2018;9:513–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Landrum G. et al. Rdkit: a software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 2013;8:31. [Google Scholar]
- 40. Li P, Li Y, Hsieh C-Y. et al. TrimNet: learning molecular representation from triplet messages for biomedicine. Brief Bioinform 2021;22:bbaa266. 10.1093/bib/bbaa266. [DOI] [PubMed] [Google Scholar]
- 41. Zhenxing W, Jiang D, Hsieh C-Y. et al. Hyperbolic relational graph convolution networks plus: a simple but highly efficient QSAR-modeling method. Brief Bioinform 2021;22:bbab112. [DOI] [PubMed] [Google Scholar]
- 42. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–94. New York, NY, United States: Association for Computing Machinery, 2016.
- 43. Deng D, Chen X, Zhang R. et al. XGraphBoost: extracting Graph Neural Network-based features for a better prediction of molecular properties. J Chem Inf Model 2021;61:2697–705. 10.1021/acs.jcim.0c01489. [DOI] [PubMed] [Google Scholar]
- 44. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–2605. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Datasets and source codes described in this paper are available at https://github.com/VRPharmer/POSIT.





































