Abstract
In this paper, we propose DGCL, a dual-graph neural networks (GNNs)-based contrastive learning (CL) integrated with mixed molecular fingerprints (MFPs) for molecular property prediction. The DGCL-MFP method contains two stages. In the first pretraining stage, we utilize two different GNNs as encoders to construct CL, rather than using the method of generating enhanced graphs as before. Precisely, DGCL aggregates and enhances features of the same molecule by the Graph Isomorphism Network and the Graph Attention Network, with representations extracted from the same molecule serving as positive samples, and others marked as negative ones. In the downstream tasks training stage, features extracted from the two above pretrained graph networks and the meticulously selected MFPs are concated together to predict molecular properties. Our experiments show that DGCL enhances the performance of existing GNNs by achieving or surpassing the state-of-the-art self-supervised learning models on multiple benchmark datasets. Specifically, DGCL increases the average performance of classification tasks by 3.73
and improves the performance of regression task Lipo by 0.126. Through ablation studies, we validate the impact of network fusion strategies and MFPs on model performance. In addition, DGCL’s predictive performance is further enhanced by weighting different molecular features based on the Extended Connectivity Fingerprint. The code and datasets of DGCL will be made publicly available.
Keywords: molecular property prediction, contrasitive learning, graph neural network
Introduction
In molecular biology and systems biology, there is an abundance of unlabeled data in contrast to a relatively small proportion of labeled data [1, 2]. The unannotated datasets hold valuable insights into new biological patterns and phenomena that remain unexplored. However, the lack of labels limits the direct applicability of supervised learning methods [3–9]. Self-supervised learning (SSL) enables models to learn from unlabeled data and attracts a lot of attention in the area of molecular property prediction. Recent advancements in SSL have significantly improved molecular property prediction. For example, N-Gram [10] captures local structural features by counting the frequency of fixed-length molecular fragments. PretrainGNN [11] leverages SSL by designing node-level and graph-level tasks to capture complex dependencies within molecular graphs. GROVER [12] utilizes a node-level and an edge-level graph neural networks (GNNs) architecture, designing contextual property prediction and graph-level motif prediction tasks to achieve SSL. MG-BERT [13] integrates GNNs with BERT’s [14] masked language model pretraining to effectively perform SSL for context-sensitive atomic representations. GEM [15] employs mutual information to capture intricate dependencies within molecular graphs. Meanwhile, UniMol [16] utilizes a unified model architecture to facilitate learning across various molecular datasets, resulting in more generalized and robust representations.
Contrastive learning (CL), as an effective SSL paradigm, consistently demonstrates outstanding capabilities across various fields [17, 18]. Simultaneously, CL has also made significant advancements in several biological domains, including protein structure prediction [19–21], gene expression analysis [22–24], and MFP generation [25–27]. In the context of molecular property prediction, many existing CL methods for molecular representation utilize diverse strategies to enhance prediction accuracy. For instance, GeomGCL [28] and GraphMVP [29] explore the complementarity of various molecular representations, enabling models to assimilate multifaceted information. However, these methods may encounter information overlap between representations, potentially limiting the enhancement of model learning capabilities. Alternatively, methods such as GraphCL [30] and MolCLR [31] concentrate on enhancing molecular graphs to improve the model’s comprehension of molecular structures and properties. However, these techniques often lack robust methods for generating enhanced representations based on established prior knowledge. Based on the above analysis, we find that existing CL approaches primarily emphasize data augmentation, whereas our work shifts the focus to the methods of molecular feature extraction, which can extract the local and global features from convolution-based and transformer-based models, respectively.
In this paper, we propose DGCL, a novel CL framework which uses Dual-GNNs to extract distinct features of the same molecule simultaneously. The framework of DGCL is illustrated in Fig. 1. DGCL emphasizes altering molecular feature extraction methods while avoiding the generation of augmented graphs. In the pretraining stage, we focus on different molecular feature extraction strategies instead of generating augmented graphs to ensure that the intrinsic properties of molecules remain unchanged. The core idea is to use different networks to aggregate features from the same molecule through CL. Features from the same molecule are marked as positive, while others are considered negative. This approach enhances the complementarity of the extracted molecular information. In this work, we adopt widely used graph network models, Graph Attention Network (GAT) and Graph Isomorphism Network (GIN), with different advantages as the GNN encoder in DGCL.
Figure 1.
Illustrative flowchart of the proposed DGCL framework for molecular property prediction; (A) DGCL pretraining stage framework. The method generates two feature representations for each molecular graph using GIN and GAT, and then performs contrastive analysis on the features after passing them through projection heads, and representations derived from the same molecular graph via different graph networks are considered positive pairs, while representations of other samples in the batch are considered negative pairs; (B) DGCL downstream task framework. Molecular features obtained from the pretrained GIN and GAT networks are concatenated with carefully selected mixed MFPs, which are transformed via fully connected layers to match the dimensionality of the graph network features, and the prediction layer then outputs the results for molecular property prediction, and (C) DGCL with attention modulation framework. Based on ECFP, weights are assigned to the molecular features obtained from the pretrained GIN and GAT networks and hybrid MFPs; they are then concatenated to predict downstream tasks.
In the downstream stage, the molecular representations obtained from the pretrained graph network models are combined with carefully selected mixed molecular fingerprints (MFPs) to collaboratively predict molecular properties. Graph networks primarily extract the topological structures of molecules, while MFPs effectively supplement information about functional groups. MFPs consist of three types and are selected based on the classification of MFPs, which offer a more comprehensive characterization of molecular structures and functional groups. Additionally, weighting these three types of representations—derived from pretrained graph networks and MFPs—based on extended connectivity fingerprints (ECFPs) can further enhance the accuracy of molecular property predictions.
We conduct experiments on a series of downstream tasks related to molecular property prediction, including seven classification tasks and four regression tasks, to evaluate the effectiveness of DGCL. The results demonstrate that DGCL achieves state-of-the-art (SOTA) performance on multiple tasks compared with multiple baselines. Specifically, DGCL achieves the best average performance on six classification tasks, improving by 3.73
over the suboptimal model; on the regression task Lipo, it outperforms the suboptimal method by 0.126. Furthermore, this paper explores the impacts of network fusion strategy and MFP strategy, confirming the superiority of our chosen strategies. We also use visualization to illustrate that DGCL can map the representations of similar molecules to close positions in high-dimensional space, providing new insights for the further development of SSL models.
In summary, DGCL makes the following contributions:
DGCL is the first to propose a general molecular pretraining framework from the perspective of encoder-based CL, focusing on different molecular feature extraction strategies rather than data augmentation.
In downstream tasks, DGCL has incorporated carefully selected MFPs to add pharmacophore information to the model, further enhancing its performance in predicting molecular properties.
Benefiting from pretraining on a large amount of unlabeled data, simple GNN models trained with DGCL demonstrated performance superior to other baselines across multiple datasets.
Through ablation studies, we have demonstrated the impact of the network integration strategy, MFPs, and feature dimensions on DGCL, confirming the significant advantages of the strategies we selected.
Materials and methods
Graph neural network
Molecular graph is a natural graph-structured data. Each molecule can be converted into an undirected graph
, where
represents the node set of atoms and
represents the edge set of chemical bonds. GNNs update the characteristics of each node by aggregating information from itself and its neighbors [32]
![]() |
(1) |
where
represents the feature vector of node
;
denotes the set of neighboring nodes of
;
is a node-level function that typically computes a summary statistic such as the sum, mean, or max of the features of neighboring nodes
of node
, and
is a function that integrates the original node feature with the aggregated neighborhood features. Ultimately, the node vectors are aggregated into a graph-level vector
as the output
![]() |
(2) |
where
is a function that integrates features from all nodes in the graph
to produce a single vector representation of the graph.
In the pretraining stage, we employ the GAT [33] and GIN [34] as GNN encoders due to their distinctive benefits. The GAT enhances model expressiveness through its multi-head attention mechanism, which significantly reduces overfitting. Meanwhile, the GIN excels in recognizing topological differences between graphs, matching the robustness of the Weisfeiler-Lehman (WL) graph isomorphism test [35], thereby ensuring accurate structural representations in our DGCL.
Graph attention network
GAT [33] incorporates an attention mechanism to capture local dependencies within graph structures. In DGCL, we consider both node features and edge features. Initially, the attention from node
to node
, denoted as
, is computed as
![]() |
(3) |
where
and
represent the embedding representations of nodes
and
, respectively, and
is the embedding representation of the edge connecting nodes
and
. It is important to note that
,
, and
are trainable parameters. Then, the attention score
between the two nodes can be obtained according to 
![]() |
(4) |
The feature representation of nodes is updated according to the attention scores
![]() |
(5) |
After updating all nodes, the model computes the average of all node representations as the output for the entire molecular graph
![]() |
(6) |
where
is the number of all nodes in the graph.
We employ the GAT with multiple attention heads, where each head can learn different neighborhood weight distributions, capturing a more diverse range of information to enhance the model’s expressive power. While the multi-head attention mechanism is designed to improve model robustness by reducing the dependency on a single attention head, it is important to balance the number of heads, as an excessive number of heads could introduce nonlinearity that might increase the risk of overfitting. Therefore, careful consideration is given to selecting the optimal number of attention heads to achieve better generalization across various datasets [36, 37].
Graph isomorphism network
GIN [34], inspired by the WL isomorphism test, is effective in distinguishing graphs with unique structures by capturing local node information through message-passing mechanisms. These networks demonstrate a strong ability to differentiate non-isomorphic graphs, similar to the capabilities of the WL test, making them particularly useful for identifying subtle structural differences. While GIN may encounter challenges in certain cases, especially where the WL test itself is limited, it remains a powerful tool for many graph-based tasks. A key strength of GIN is its design principle, which ensures that nodes from distinct neighborhoods are not conflated during aggregation. The use of an injective aggregation method further enhances its ability to accurately represent structural differences, ensuring that nodes with different structures are distinctly represented [38].
Based on this concept, GIN incorporates edge features into the formula for updating node features as follows:
![]() |
(7) |
where
and
denote the embedding representations of nodes
and
, respectively, and
represents the embedding of the edge connecting nodes
and
. Notably, both W and
are trainable, with
being an irrational number. To ensure uniformity of output for ease of comparison, the same aggregation approach as in GAT is employed. After updating all nodes, the model computes the average of all node representations as the output for the entire molecular graph.
Initial molecule featurization
Before feeding compounds into the graph network model, we initialize the node and edge features of the molecular graph using molecular properties. The node features are constituted by the atomic properties within the molecule, while the edge features are determined by the bond properties between atoms. In this approach, we adopt a comprehensive set of features to initialize the molecular graph, which not only encompasses the basic chemical properties of atoms and bonds but also includes more complex structural information, such as the atom’s degree, implicit valence, and the stereochemistry of bonds. These features are detailed in the Table 1 shown. By configuring the node and edge features, we aim to enrich the graph network with more information, enhancing the model’s capability to perceive the structural details of compound molecules.
Table 1.
Molecular atom & bond feature initialization
| Feature type | Attribute | Size | Description |
|---|---|---|---|
| Atom features 78 | Atom type | 44 | Atomic number of the element |
| Degree | 11 | No. of bonds in which the atom is involved | |
| Implicit Valence | 11 | No. of electrical charges | |
| Number of H | 11 | No. of bonded hydrogen atoms | |
| Aromaticity | 1 | Whether the atom is a component of an aromatic system | |
| Bond Features 11 | Bond type | 4 | Single, double, triple or aromatic |
| Wedge bond | 2 | Assisting in representing the relative 3D configuration between atoms | |
| Stereo | 5 | Stereochemistry of bonds (none, E/Z or cis/trans) |
Molecular fingerprints
MFPs can be broadly classified into three categories: substructure key-based fingerprints, topological or path-based fingerprints, and circular fingerprints [39]. In this model, referencing the classification of MFPs, three complementary types of fingerprints are selected and combined [40]: the Molecular ACCess System (MACCS) fingerprint (https://rdrr.io/bioc/Rcpi/man/extractDrugMACCS.html), the PubChem fingerprint [41], and the Pharmacophore Extended Reduced Graph (Pharmacophore ErG) fingerprint [42]. To demonstrate the impact of MFPs on the model, these are compared with the basic ECFP [43] in our proposed DGCL method.
ECFP: This fingerprint captures the topological environment of atoms within a molecule, encoding this information into a binary vector with a specified radius. It focuses on the molecule’s localized chemical environments and topological information.
MACCS fingerprint: Based on 166 predefined chemical substructures (following SMARTS rules), it generates a binary vector by identifying the frequencies of these substructures’ occurrences.
PubChem fingerprint: Originating from the substructures and chemical properties of compounds known in the PubChem database, it matches molecular structures with predefined substructure patterns, producing an 881-bit binary vector.
Pharmacophore ErG fingerprint: It is generated based on the pharmacophores and their topological relationships within a molecule. By identifying pharmacophoric elements and arranging them according to their relative positions into a topological pharmacophore graph, it is encoded into a 441-bit binary vector.
DGCL-based pretraining
The specific framework of the DGCL pretraining stage is shown in Fig. 1A.
During the pretraining stage, DGCL transforms the graph of the same molecule obtained from SMILES [44] through both a GIN and a GAT, converting the molecular representation into latent vectors
and
. These are then mapped to
and
through a nonlinear projection head, after which the similarity between the two projections,
, is calculated. Following the design of the loss function in SimCLR[17], the NT-Xent Loss is used as the contrastive loss function:
![]() |
(8) |
where
,
is an indicator function that equals 1 if
and 0 if
,
denotes the temperature parameter, and
represents the batch size. In this model, the cosine distance is utilized to assess the similarity between two representations of the same molecule, expressed as
.
DGCL introduces innovations in constructing positive and negative sample pairs. Firstly, molecular representations obtained from the same molecular graph via different graph networks are considered as positive sample pairs. This method effectively utilizes the intrinsic molecular structure information to capture the molecule’s multidimensional features through different graph networks, thereby enhancing the model’s representation capability. Secondly, during the training process, the framework considers the rest of the batch’s samples’ representations through the encoder as negative samples, significantly increasing the number of negative samples and enriching their diversity. This helps to strengthen the model’s performance in distinguishing between positive and negative samples, encouraging the model to focus more on capturing key information of molecular representations during the learning process and avoiding the risk of overfitting due to a small number of negative samples.
Downstream task training
The specific framework of the downstream task stage of DGCL is shown in Fig. 1B.
Due to the fact that graph networks mostly extract only the topological structure of molecules, we enhance the molecular representations output by graph networks with MFPs in downstream tasks. MFPs are capable of representing the properties of functional groups within a few steps to a certain extent. By selecting MFPs that contain more comprehensive information and combining them with the molecular embeddings generated by a well-trained encoder, we can further improve the capability of molecular representation. This approach leverages the intrinsic chemical information of molecules, providing a richer and more informative representation that is beneficial for downstream analytical tasks.
In DGCL, three complementary fingerprints—MACCS fingerprints, PubChem fingerprints, and Pharmacophore ErG fingerprints—are concatenated to form what is called mixed fingerprints (MFPs)
![]() |
(9) |
and the dimensionality of the concatenated mixed fingerprint is 1489.
During the downstream task stage, this mixed fingerprint is transformed through a fully connected network to match the dimensionality of the output from the graph network. It is then combined with the molecular representation output from the graph network to jointly make predictions for downstream tasks. Thus, the input to the prediction head in terms of molecular representation is composed as
![]() |
(10) |
which means that the final input representation for downstream tasks includes embeddings from the GAT and GIN models, along with the transformed mixed fingerprint representation, offering a comprehensive view of the molecule for enhanced predictive performance.
In the experimental stage of downstream tasks for the DGCL model, we choose to freeze the model weights obtained during the pretraining stage. This approach, while saving time and computational resources, may also restrict the model’s ability to further learn and refine molecular representations. To counteract the potential performance loss caused by freezing the weights, we meticulously design the classification and regression heads of the DGCL model. This design of the prediction heads is aimed at maximizing the model’s performance in downstream tasks by leveraging the pretrained embeddings to their fullest potential, without further training the underlying graph network layers. By doing so, the goal is to enhance the model’s ability to perform various tasks such as molecule classification or property prediction, ensuring that it remains effective and accurate, despite the limitations imposed by weight freezing. This careful design allows the model to still adapt to new tasks and datasets while maintaining efficiency and effectiveness.
In this part, we further refine the DGCL method by employing attention modulation to aggregate three types of feature representations instead of direct concatenation, as illustrated in the framework shown in Fig. 1C. Considering computational efficiency and recognition, we select ECFP as the basis for attention modulation to determine the weights of the three feature representations. The final result can be expressed as
![]() |
(11) |
where
and
represent the molecular feature representations output by the graph encoders GAT and GIN, respectively, and
denotes the MFPs processed through a fully connected layer, with
.
Experiments
Experiment settings
Datasets
Pretraining datasets The dataset for pretraining is downloaded from ZINC15 [45], including the SMILES descriptors of 306 347 small molecules with biological activity. The open-source tool RDKit (https://www.rdkit.org/docs/index.html) is employed to transform each SMILES descriptor into a molecular graph, in which nodes represent atoms and edges denote chemical bonds. Features for both nodes and edges are derived from molecular properties.
Downstream datasets The downstream tasks selected from the MoleculeNet [46] dataset encompass 10 tasks, including six classification tasks and four regression tasks. The classification tasks consist of BBBP [47], SIDER [48], ClinTox [49], Tox21(https://tripod.nih.gov/tox21/challenge/), BACE [50], and HIV (https://www.hiv.lanl.gov/), with the ROC-AUC serving as the evaluation metric. The regression tasks ESOL [51], FreeSolv [52], and Lipo [53] use RMSE as the evaluation metric, and QM7 [1] uses MAE as the evaluation metric. The selected tasks represent the most commonly used and highly recognized datasets in the field of molecular property prediction, covering various molecular characteristics such as membrane permeability, toxicity, biological activity, and solubility. In addition, we also selected a small-sized Escherichia coli dataset E. coli [54] that verify the potential for drug discovery, which is a binary classification task, with the ROC-AUC serving as the evaluation metric.
Baselines
To comprehensively assess the performance of the proposed DGCL model, it is compared with 11 other competitive methods currently in the field, including 3 supervised learning methods and 8 SSL methods. A brief introduction to these methods is as follows:
D-MPNN [3] and AttentiveFP [4] are supervised GNN methods based on message passing and attention mechanism, respectively.
TransFoxMol [9]: A supervised approach, a transformer-based model is introduced to fully exploit chemical knowledge.
N-Gram [10]: Utilizes node embedding techniques to combine the embeddings of adjacent nodes within a molecular graph over short time steps, constructing a compact representation of the graph.
PretrainGNN [11]: Conducts pretraining at both individual node and entire graph levels, aiming to learn useful local and global representations simultaneously.
GraphCL [30]: Generates diverse augmented graphs by applying various graph augmentation techniques, including node dropping, edge perturbation, attribute masking, and subgraph extraction.
GROVER [12]: This model integrates GNNs with the Transformer architecture, learning fine-grained features of molecules through the implementation of context prediction tasks and functional group prediction tasks.
MolCLR [31]: Utilizes a GNN within a graph CL framework to process augmented graphs generated by randomly removing nodes, edges, or subgraphs.
GraphMVP [29]: Combines CL with generative tasks, treating 2D and 3D graphs of the same molecule as augmented views in the CLprocess.
GEM [15]: Designs a multi-task learning framework that incorporates 3D molecular information by fusing key angle information through a dual graph structure.
Uni-Mol [16]: Employs a Transformer architecture specifically designed for processing 3D spatial information, with the 3D conformation of a molecule as both input and output.
Implementation details
During the pretraining stage, CL is applied to large-scale unlabeled data to obtain molecular representations and a trained encoder that can be used for downstream tasks. The GIN encoder has three layers, each combining node features and edge features for node updates, with mean pooling used for aggregation in the last layer. The GAT encoder comprises two layers, each with ten attention heads, also with mean pooling used for feature aggregation. The loss function is optimized using the batch gradient descent algorithm of the Adam optimizer, with a learning rate set at 1e-4, batch size at 128, and the number of pretraining epochs at 200.
In the transfer learning process for molecular property prediction, the encoder’s weights are frozen to act as a feature extractor, with two fully connected layers added afterward. During the fine-tuning stage, only the two fully connected layers are optimized. For classification tasks, a weighted binary cross-entropy loss is used, while for regression tasks except QM7, mean squared error loss is used. For the QM7 dataset, due to the large amount of data, mean absolute error (MAE) loss is used. Both use the Adam optimizer with a batch size set to 256. The evaluation metric for classification tasks is ROC-AUC, and the evaluation metric for regression tasks except QM7 is RMSE, and the evaluation metric for QM7 is MAE.
Furthermore, an early stopping strategy is employed during the downstream task stage to optimize the model training process, aiming to prevent overfitting while improving training efficiency. Specifically, training is prematurely terminated if there is no improvement in performance on the validation set over 100 consecutive training epochs. This ensures that the model ceases training upon achieving optimal performance, allowing the model parameters at this state to be used for testing set evaluation.
Each dataset used for predicting molecular properties is divided into training, validation, and test datasets in an 8:1:1 ratio based on the molecular scaffold [55].
Performance evaluation
Overall comparison
In this section, we evaluate the DGCL method and compare it with the current 11 most competitive methods. These comparisons cover seven classification tasks and four regression tasks. For tasks on the MoleculeNet dataset, most baseline results come from GEM’s paper, except for the recent TransFoxMol and Uni-Mol results. The Uni-Mol results come from its paper. Since TransFoxMol uses random scaffold splitting, we rerun it with the same scaffold splitting as other baselines for fair comparison. We also use TransFoxMol’s automatic search procedure to obtain the best hyperparameters for each dataset. For the ecoli dataset, we only picked four baselines for comparison due to time constraints. Detailed experimental results are shown in Tables 2 and 3. In terms of classification tasks, DGCL achieves SOTA results on 5 datasets except Tox21. This result reflects its excellent ability to handle complex molecular classification problems. Specifically, the performance of the DGCL method reaches 91.48
on the BACE dataset, 97.12
on the ClinTox dataset, and 78.12
on the SIDER dataset. In addition, DGCL achieves an overall relative improvement of 5.3
on the average ROC-AUC compared with the previous SOTA result from Tox21.
Table 2.
DGCL performance on molecular property prediction classification tasks (ROC-AUC
)
| Datasets #Molecule #Tasks |
BACE 1513 1 |
BBBP 2039 1 |
ClinTox 1478 2 |
HIV 41127 1 |
SIDER 1427 27 |
Tox21 7831 12 |
Avg. | Ecoli 2335 1 |
|---|---|---|---|---|---|---|---|---|
| D-MPNN | 80.9(0.6) | 71.0(0.3) | 90.6(0.6) | 77.1(0.5) | 57.0(0.7) | 75.9(0.7) | 75.42 | - |
| Attentive FP | 78.4(0.022) | 64.3(1.8) | 84.7(0.3) | 75.7(1.4) | 60.6(3.2) | 76.1(0.5) | 73.3 | - |
| TransFoxMol | 85.3(1.1) | 70.5(0.7) | 87.4(3.2) | - | 62.7(1.0) | 73.1(0.7) | - | b 93.9(4.5) |
| N-Gram | 79.1(1.3) | 69.1(0.8) | 87.5(2.7) | 77.2(0.1) | c 66.8(0.7) | 74.3(0.4) | 75.67 | - |
| PretrainGNN | 84.5(0.7) | 68.7(1.3) | 72.6(1.5) | 79.9(0.7) | 62.7(0.8) | b 78.1(0.6) | 74.42 | - |
| GraphCL | 75.4(1.4) | 70.0(1.7) | 76.0(2.7) | 78.5(1.2) | 60.5(0.9) | 73.9(0.7) | 72.38 | - |
| GROVER | 82.6(0.7) | 70.0(0.1) | 81.2(3.0) | 62.5(0.9) | 64.8(0.6) | 74.3(0.1) | 72.57 | 85.4(3.7) |
| MolCLR | 82.4(0.9) | 72.2(2.1) | c 91.2(3.5) | 78.1(0.5) | 58.9(1.4) | 75.0(0.2) | 76.30 | - |
| GraphMVP | 81.2(0.9) | c 72.4(1.6) | 77.5(2.8) | 77.0(1.2) | 63.9(1.2) | 75.9(0.5) | 74.65 | - |
| GEM | c 85.6(1.1) | c 72.4(0.4) | 90.1(1.3) | c 80.6(0.9) | b 67.2(0.4) | b 78.1(0.1) | c 79.00 | c 90.9(2.1) |
| Uni-Mol | b 85.7(0.2) | b 72.9(0.6) | b 91.9(1.8) | b 80.8(0.3) | 65.9(1.3) | a 79.6(0.5) | b 79.47 | 90.5(6.9) |
| DGCL | a 91.48(1.68) | a 73.78(0.55) | a 97.12(2.87) | a 81.49(1.10) | a 78.12(2.23) | c 77.16(0.31) | a 83.19 | a 94.9(3.0) |
Note: Higher ROC-AUC value indicates better performance. The top three performers are highlighted with labels
for the best, second, and third performance, respectively. The standard deviations are in brackets. We were unable to perform TransFoxMol on HiV because the original paper did not process this dataset.
Table 3.
DGCL performance on molecular property prediction regression tasks
| Datasets #Molecule #Tasks |
ESOL 1128 1 |
FreeSolv 642 1 |
Lipo 4200 1 |
Avg. | QM7 6830 1 |
|---|---|---|---|---|---|
| D-MPNN | 1.050(0.008) | 2.082(0.082) | 0.683(0.016) | 1.272 | 103.5(8.6) |
| Attentive FP | c 0.877(0.029) | 2.073(0.183) | 0.721(0.001) | 1.224 | 72.0(2.7) |
| TransFoxMol | 0.992(0.027) | b 1.867(0.114) | 0.716(0.022) | c 1.192 | c 65.7(2.1) |
| N-Gram | 1.074(0.107) | 2.688(0.085) | 0.812(0.028) | 1.525 | 81.9(1.9) |
| PretrainGNN | 1.100(0.006) | 2.764(0.002) | 0.739(0.003) | 1.534 | 113.2(0.6) |
| GROVER | 0.983(0.090) | 2.176(0.052) | 0.817(0.008) | 1.325 | 92.0(0.9) |
| MolCLR | 1.271(0.040) | 2.594(0.249) | 0.691(0.004) | 1.519 | 66.8(2.3) |
| GraphMVP | 1.029(0.033) | – | 0.681(0.010) | – | – |
| GEM | b 0.798(0.029) | c 1.877(0.094) | c 0.660(0.008) | b 1.112 | b 58.9(0.8) |
| Uni-Mol | a 0.788(0.029) | a 1.620(0.035) | b 0.603(0.010) | a 1.004 | a 41.8(0.2) |
| DGCL | 1.046(0.050) | 2.080(0.027) | a 0.477(0.031) | 1.207 | 100.9(2.5) |
Note: Lower RMSE value indicates better performance. The top three performers are highlighted with labels
for the best, second, and third performance, respectively. The standard deviations are in brackets.
For regression tasks, DGCL performs worse than Uni-Mol but achieves the best performance among all models on the Lipo dataset of 0.477, exceeding the suboptimal model by 0.126. This result suggests that DGCL may have significant potential in handling specific types of regression tasks.
In addition, the average performance of DGCL and 11 other baseline models on classification tasks and regression tasks is also calculated. For classification tasks, DGCL showed the best average performance, improving 3.725
compared with the suboptimal model. For regression tasks using RMSE as an metric, DGCL ranks fourth among all baseline models, only 0.220 lower than the optimal model and 0.210 lower than the suboptimal model.
DGCL achieves significant improvements on classification datasets but struggles on the regression task. We guess that the data volume is small and the regression task is more difficult than the binary classification task. Each data corresponds to a specific value and requires a more accurate output. At the same time, it may be because the designed regression head is relatively simple, which affects the performance of the model on the regression task of small data. For the QM7 dataset, the poor performance may be due to the lack of 3D information. QM7 is a quantum mechanics dataset that focuses more on the 3D conformation of molecules. We guess that regression datasets focus on predicting quantum chemical properties that are highly correlated with molecular geometry.
Ablation studies
In this section, we will conduct a series of experiments on six classification datasets and three regression datasets from MoleculeNet, aiming to evaluate the impact and contribution of each component of the DGCL method on the overall performance of the model. In order to ensure the reliability and fairness of the experimental results, the performance evaluation of all models will be performed three independent runs based on three different random seeds (0, 1, 2). In the following content, we will discuss the impact of network combination and MFPs, feature dimensions, and pretraining strategies on the performance of the DGCL method in turn.
The influence of network integration and MFPs
In this section, the effects of network combination and mixed fingerprint on the performance of the DGCL model are discussed, respectively. The complete experimental results are shown in Table 4. In the table, “GAT” represents the pretrained GAT model, the same as “GIN,” and “DGN” represents the concatenation of pretrained GIN and GAT models; “GAT-MFP” represents the model that combines the pretrained GAT model with mixed MFPs, the same applies to “GIN-MFP”; “GAT-ECFP” represents the model that combines the pre-trained GAT model with ECFP fingerprint. The same applies to “GIN-ECFP” and “DGN-ECFP.”
Table 4.
Ablation study results on the effects of network integration and MFPs
| Classification tasks(AUC%) | Regression Tasks(RMSE) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Datasets | BACE | BBBP | ClinTox | HIV | SIDER | Tox21 | ESOL | FreeSolv | Lipo |
| GAT | 83.60(2.77) | 65.48(0.43) | 91.07(4.74) | 69.26(1.47) | 68.91(1.90) | 72.70(1.14) | 1.483(0.065) | 2.894(0.100) | 0.721(0.044) |
| GIN | 73.38(4.52) | 61.25(1.23) | 76.74(5.56) | 60.33(1.66) | 56.05(0.91) | 60.00(1.11) | 1.603(0.080) | 2.988(0.064) | 0.801(0.140) |
| DGN | 79.88(2.35) | 65.01(1.39) | 91.57(6.73) | 75.63(0.86) | 64.24(0.85) | 75.96(2.25) | 1.360(0.021) | 2.815(0.060) | 0.620(0.033) |
| GAT-ECFP | 90.99(1.14) | 67.71(0.37) | 96.55(0.71) | 75.14(3.12) | 74.65(4.06) | 74.04(2.22) | 1.534(0.117) | 3.061(0.071) | 0.595(0.040) |
| GIN-ECFP | 82.84(3.38) | 64.36(0.53) | 84.01(2.60) | 54.92(2.49) | 62.95(1.19) | 57.07(1.22) | 1.443(0.064) | 3.346(0.381) | 0.564(0.056) |
| DGN-ECFP | 89.56(1.11) | 64.17(0.18) | 91.22(4.32) | 73.19(0.39) | 70.35(1.16) | 74.14(1.17) | 1.366(0.049) | 2.853(0.115) | 0.577(0.036) |
| GAT-MFP | 89.84(1.13) | 71.70(0.09) | 92.15(1.64) | 76.47(1.55) | 74.58(1.31) | 75.17(1.08) | 1.083(0.091) | 2.066(0.213) | 0.514(0.047) |
| GIN-MFP | 82.03(0.69) | 71.04(0.35) | 73.84(2.87) | 62.68(0.85) | 62.75(0.85) | 68.25(0.28) | 1.100(0.078) | 1.909(0.252) | 0.505(0.063) |
| DGCL | 91.48(1.68) | 73.78(0.55) | 97.12(2.87) | 81.49(1.10) | 78.12(2.23) | 77.16(0.31) | 1.008(0.018) | 2.080(0.027) | 0.477(0.031) |
Note: Higher ROC-AUC value and lower RMSE value indicate better performance. The SOTA results are shown in bold. The standard deviations are in brackets.
At the same time, we also calculated the average performance of each model on these tasks, as shown in Fig. 2.
Figure 2.

Average performance analysis: the effects of network integration and MFPs; (a) classification tasks and (b) regression tasks.
First, the impact of network combination on prediction performance is analyzed. Comparing the results of GAT-MFP, GIN-MFP, and DGCL, DGCL performed better than GAT-MFP and GIN-MFP on six classification tasks and two regression tasks except FreeSolv. In terms of average performance, DGCL is 13.09
higher than GIN-MFP on classification tasks and 3.21
higher than GAT-MFP; on regression tasks, DGCL performs worse than GIN-MFP, but better than GAT-MFP. The average RMSE value is 0.02 lower.
The impact of MFPs on prediction performance is then analyzed. Taking GAT as an example, comparing its results with GAT-ECFP and GAT-MFP, its performance on six classification tasks is lower than that of the model with MFPs added. In terms of average performance, GAT-ECFP is 4.68
higher than GAT on classification tasks, and GAT-MFP is 4.82
higher than GAT. On regression tasks, GAT-MFP is 0.478 better than GAT.
Finally, the impact of mixed fingerprints on prediction performance is analyzed. The combination of a single GNN and MFPs may perform differently on different datasets, and not necessarily all models that add hybrid fingerprints are better than models that add ECFP fingerprints. Comparing the results of GNN-ECFP and DGCL, it can be seen that DGCL performs better than GNN-ECFP on all classification and regression tasks. Its average performance on classification tasks is 6.09
higher, and on regression tasks mission is 0.410 higher.
The influence of feature dimensionality
Due to the high complexity of compound structural information, the dimensionality of molecular representations directly affects the model’s performance on molecular property prediction tasks. On one hand, a lower feature dimension may limit the model’s ability to represent molecular features, thereby affecting its predictive performance; on the other hand, a higher feature dimension can enhance the model’s representational capacity but may also increase the computational burden, reducing computational efficiency.
Therefore, in this part, we demonstrate that our chosen feature dimension can achieve superior performance by comparing the results of molecular representations to prediction heads at different feature dimensions. Detailed results are shown in Tables 5 and 6, where “DGCL(128)” refers to the molecular representation output by the pretrained graph network, representations transformed by the fully connected layer from mixed MFPs, and the aggregated embedding vectors of graph networks and mixed MFPs set to 128 dimensions, likewise for “DGCL(256)” and “DGCL(512).” Additionally, the average performance across three different feature dimensions for both classification and regression tasks is compiled and depicted in the last column of the respective tables.
Table 5.
The influence of feature dimensionality on classification tasks(AUC%)
| Datasets | BACE | BBBP | ClinTox | HIV | SIDER | Tox21 | Average Performance |
|---|---|---|---|---|---|---|---|
| DGCL(128) | 87.85(3.21) | 69.53(0.20) | 93.14(0.84) | 76.95(0.76) | 70.81(1.19) | 74.71(1.52) | 78.83 |
| DGCL(256) | 90.01(1.69) | 73.52(0.95) | 95.08(2.07) | 78.19(0.85) | 74.55(1.25) | 75.64(0.16) | 81.17 |
| DGCL(512) | 91.48(1.68) | 73.78(0.55) | 97.12(2.87) | 81.49(1.10) | 78.12(2.23) | 77.16(0.31) | 83.19 |
Table 6.
The influence of feature dimensionality on regression tasks(RMSE)
| Datasets | ESOL | FreeSolv | ClinTox | Average Performance |
|---|---|---|---|---|
| DGCL(128) | 1.200(0.069) | 2.101(0.166) | 0.542(0.066) | 1.281 |
| DGCL(256) | 1.124(0.029) | 2.287(0.098) | 0.501(0.039) | 1.304 |
| DGCL(512) | 1.008(0.018) | 2.080(0.027) | 0.477(0.031) | 1.188 |
Analysis reveals that the dimensionality of molecular representations to a certain extent affects the model’s performance. With the increase in feature dimension, the model’s performance improves as well. DGCL(512) surpasses DGCL(128) and DGCL(256) in both classification and regression tasks, and this trend is evident in the average performance too.
The choice to cap the dimensionality at 512 primarily considers computational efficiency. While further increasing the dimensionality of molecular representations could theoretically improve the model’s representational capacity and predictive performance, it would also significantly demand more computational resources, thereby affecting the model’s practicality and efficiency.
The influence of pretraining
In this part, we compare the traditional supervised learning GAT with its pre-trained counterpart, denoted as “GAT” and “GAT(pre),” respectively, to explore the impact of pretraining on model performance. The results for classification and regression tasks are presented in their respective Tables 7 and 8, with the last column of each table depicting the average performances.
Table 7.
The influence of pretraining on classification tasks(AUC%)
| Datasets | BACE | BBBP | ClinTox | HIV | SIDER | Tox21 | Average performance |
|---|---|---|---|---|---|---|---|
| GAT | 60.78(2.01) | 61.94(1.17) | 84.90(0.74) | 72.24(0.64) | 60.15(0.30) | 69.16(1.24) | 68.19 |
| GAT(pre) | 83.60(2.77) | 65.48(0.43) | 91.07(4.74) | 69.26(1.47) | 68.91(1.90) | 72.70(1.14) | 75.17 |
Table 8.
The influence of pretraining on regression tasks(RMSE)
| Datasets | ESOL | FreeSolv | ClinTox | Average performance |
|---|---|---|---|---|
| GAT | 1.375(0.027) | 3.138(0.202) | 0.933(0.027) | 1.815 |
| GAT(pre) | 1.483(0.065) | 2.894(0.100) | 0.721(0.044) | 1.699 |
In classification tasks, except for the HIV dataset, the performance of GAT(pre) surpasses that of GAT, with an average performance improvement of 6.98
. In regression tasks, the average performance of GAT(pre) is better than that of GAT by 0.116. These results demonstrate that our designed pretraining strategy can indeed enhance the model’s performance on molecular property prediction tasks to a certain extent.
Visualization of interpretability results
In this section, we evaluate the performance of the DGCL method through visual analysis. Specifically, we selected 16 molecules from the BBBP dataset and displayed their SMILES format, true labels, and the labels predicted by the model in Table 9. Molecules M1 through M8 have positive labels, while M9 through M16 have negative labels. Based on the mixed features obtained from model training, we calculated the cosine similarity of these feature representations and presented the results in a heatmap, as shown in Fig. 3. Analysis reveals that scores within the group of molecules with positive labels and within the group with negative labels are significantly higher than the scores between molecules with positive and negative labels. This indicates that the model can effectively distinguish between different categories of molecules in the high-dimensional feature space, bringing molecules with the same label closer together and distancing those with different labels. This conclusion is consistent with common sense and further confirms that the DGCL method is capable of extracting molecular representations with discriminatory power to some extent.
Table 9.
Selected molecules from the BBBP dataset
| No. | SMILES | True | Prediction |
|---|---|---|---|
| M1 | C1CCN(CC1)Cc1cccc(c1)OCCCNC(=O)C | 1 | 0.906 |
| M2 | c1c(nccc1)CCN(C)C | 1 | 0.824 |
| M3 | Clc1ccc2N(CC3CC3)C(=O)CN=C(c4ccccc4)c2c1 | 1 | 0.827 |
| M4 | C1CN(CCC1)Cc1cccc(c1)OCCCO | 1 | 0.874 |
| M5 | C1=CC=CC3=C1N=C([N]2C(=NC(=N2)C)C3)N4CCN(C)CC4 | 1 | 0.880 |
| M6 | CN1CCN(CC1)C2=C3C=CC=CC3=Nc4ccc(Cl)cc4N2 | 1 | 0.763 |
| M7 | CN1CCN(CC1)C2=C3C=CC=CC3=Nc4ccc(Cl)cc4N2 | 1 | 0.874 |
| M8 | c1ccc(C(NCCCOc2cc(CN3CCCCC3)ccc2)=O)cc1 | 0 | 0.624 |
| M9 | NCC(O)c1ccc(O)c(O)c1 | 0 | 0.052 |
| M10 | CNC[C@H](O)c1ccc(O)c(O)c1 | 0 | 0.027 |
| M11 | CC(Cc1ccc(O)cc1)NCC(O)c2cc(O)cc(O)c2 | 0 | 0.040 |
| M12 | CC(C)NCC(O)c1ccc(O)c(O)c1 | 0 | 0.030 |
| M13 | NCC(O)c1cccc(O)c1 | 0 | 0.090 |
| M14 | CC(CCc1ccccc1)NC(C)C(O)c2ccc(O)cc2 | 0 | 0.074 |
| M15 | CN1C2CCC1CC(C2)OC(=O)C(O)c3ccccc3 | 0 | 0.107 |
| M16 | CNCC(O)c1ccc(O)c(O)c1 | 0 | 0.033 |
Figure 3.
Visualization of correlation heatmaps for molecular feature representations; molecules M1 through M8 have positive labels, while M9 through M16 have negative labels.
Attention modulation
In this part, the DGCL method with added attention modulation is compared with the initial method, with results across various datasets as depicted in Table 10. DGCL(attention) refers to the DGCL method enhanced with attention modulation. Analysis reveals that in classification tasks, the performance of DGCL(attention) consistently exceeds that of the original model; improvements are also noted in the performance on the regression task in the Lipo dataset. Additionally, in classification tasks, the average performance of DGCL(attention) is superior to the initial model by 1.18
.
Table 10.
Performance of DGCL and DGCL(attention)
| Classification tasks(AUC%) | Regression tasks(RMSE) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Datasets | BACE | BBBP | ClinTox | HIV | SIDER | Tox21 | ESOL | FreeSolv | Lipo |
| DGCL | 91.48(1.68) | 73.78(0.55) | 97.12(2.87) | 81.49(1.10) | 78.12(2.23) | 77.16(0.31) | 1.008(0.018) | 2.080(0.027) | 0.477(0.031) |
| DGCL(attention) | 92.16(2.41) | 76.07(0.72) | 98.31(0.83) | 81.60(1.98) | 78.18(1.63) | 79.92(2.19) | 1.021(0.047) | 2.194(0.061) | 0.469(0.044) |
Note: Higher ROC-AUC value and lower RMSE value indicate better performance. The SOTA results are shown in bold. The standard deviations are in brackets.
To further explore the performance of the attention modulation method within the DGCL framework, we conducted an in-depth analysis on the BACE dataset, with results presented in Fig. 4. The analysis indicates that different models exhibit varied weight distributions when faced with different types of compounds, differing from the approach through direct concatenation. This outcome suggests that it is meaningful to assign appropriate weights based on the characteristics of different feature representations, using ECFP as the basis.
Figure 4.

Weight distribution of feature representations in the BACE dataset.
Discussion and conclusion
This paper introduces DGCL, a molecular representation learning model based on dual-graph network CL. This approach aims to overcome the limitations of traditional CL methods in generating augmented graphs by altering the way molecular features are extracted, thereby avoiding the step of generating augmented graphs. During the pretraining stage, DGCL enhances the model’s feature extraction and generalization capabilities by using different networks to aggregate feature representations of the same molecule and ensuring these representations to be as similar as possible. Furthermore, in the downstream task stage, DGCL further enhances the model’s ability to predict molecular properties by integrating mixed MFPs. Compared with the eight most competitive SSL methods currently available, DGCL achieves the best average performance on classification tasks, improving by 3.73
over the second-best model; on the Lipo regression task, it performs 0.126 better than the second-best method.
Additionally, a series of experiments were conducted to demonstrate the impact of network fusion, MFPs, feature dimensions, and pretraining on the DGCL method. In the processing of downstream tasks, DGCL freezes the encoder’s weights and only fine-tunes the classification or regression head, showing that the encoder can learn molecular representations beneficial to downstream tasks after pretraining.
Compared with other SSL models, DGCL requires a relatively smaller dataset for pretraining and less time per epoch, achieving the same or better performance than other models. This indicates the potential of DGCL to extend to models with stronger representational capabilities in the future.
Key Points
DGCL is the first to propose a general molecular pre-training framework from the perspective of encoder-based contrastive learning, focusing on changing the method of molecular feature extraction rather than data augmentation.
In downstream tasks, DGCL employs selectively mixed MFPs, adding pharmacophore details to improve its performance in molecular property prediction.
Benefiting from pretraining on a large amount of unlabeled data, simple GNN models trained with DGCL demonstrated performance superior to other self-supervised learning methods across multiple datasets.
Through ablation studies, we have demonstrated the impact of the network integration strategy and mixed MFPs, confirming the significant advantages of the strategies we selected.
DGCL’s performance can be further improved by employing ECFP-based attention weighting.
Supplementary Material
Contributor Information
Xiuyu Jiang, School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China.
Liqin Tan, School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China.
Qingsong Zou, School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China.
Conflict of interest: No competing interest is declared.
Funding
This work is supported in part by the National Science and Technology Major Project (2022ZD0117804), by the National Natural Science Foundation of China under grants 92370113 and 12071496, and by the Natural Science Foundation of the Guangdong Province under the grant 2023A1515012079.
Data availability
Source code and all datasets used in this study are available at https://github.com/Sysuzqs/DGCL.
Author contributions
X.J. contributed to the model construction, numerical experiments, and the writing of the first draft of the paper; L.T. contributed to the writing of the revised version of the paper and additional numerical examples in the second version; Q.Z. contributed to the design of the project and the writing of the first and second version of the paper.
References
- 1. Blum LC, Reymond J-L. 970 million druglike small molecules for virtual screening in the chemical universe database gdb-13. J Am Chem Soc 2009;131:8732–3. [DOI] [PubMed] [Google Scholar]
- 2. Ruddigkeit L, Van Deursen R, Blum LC. et al. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. J Chem Inf Model 2012;52:2864–75. [DOI] [PubMed] [Google Scholar]
- 3. Yang K, Swanson K, Jin W. et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model 2019;59:3370–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Xiong Z, Wang D, Liu X. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 2019;63:8749–60. [DOI] [PubMed] [Google Scholar]
- 5. Li P, Li Y, Hsieh C-Y. et al. Trimnet: learning molecular representation from triplet messages for biomedicine. Brief Bioinform 2021;22:bbaa266. [DOI] [PubMed] [Google Scholar]
- 6. Masumshah R, Aghdam R, Eslahchi C. A neural network-based method for polypharmacy side effects prediction. BMC bioinformatics 2021;22:1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Meng Y, Changcheng L, Jin M. et al. A weighted bilinear neural collaborative filtering approach for drug repositioning. Brief Bioinform 2022;23:bbab581. [DOI] [PubMed] [Google Scholar]
- 8. Masumshah R, Eslahchi C. Dpsp: a multimodal deep learning framework for polypharmacy side effects prediction. Bioinformatics Advances 2023;3:vbad110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Gao J, Shen Z, Xie Y. et al. Transfoxmol: predicting molecular property with focused attention. Brief Bioinform 2023;24:bbad306. [DOI] [PubMed] [Google Scholar]
- 10. Liu S, Demirel MF, Liang Y. N-gram graph: simple unsupervised representation for graphs, with applications to molecules. Adv Neural Inf Process Syst 2019;32:8466–78. [Google Scholar]
- 11. Hu W, Liu B, Gomes J. et al. Strategies for pre-training graph neural networks. In: Proceedings of the 8th International Conference on Learning Representations (ICLR). Virtual Conference, 2020.
- 12. Rong Y, Bian Y, Tingyang X. et al. Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst 2020;33:12559–71. [Google Scholar]
- 13. Zhang X-C, Cheng-Kun W, Yang Z-J. et al. Mg-bert: leveraging unsupervised atomic representation learning for molecular property prediction. Brief Bioinform 2021;22:bbab152. [DOI] [PubMed] [Google Scholar]
- 14. Devlin J, Chang M-W, Lee K. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019;1:4171–86. [Google Scholar]
- 15. Fang X, Liu L, Lei J. et al. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell 2022;4:127–34. [Google Scholar]
- 16. Zhou G, Gao Z, Ding Q. et al. Uni-Mol: a universal 3d molecular representation learning framework. In: The Eleventh International Conference on Learning Representations (ICLR). Kigali, Rwanda, 2023.
- 17. Chen T, Kornblith S, Norouzi M. et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR, 2020, 1597–607. [Google Scholar]
- 18. Zhang Y, Zhu H, Wang Y. et al. A contrastive framework for learning sentence representations from pairwise and triple-wise perspective in angular space. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022;1:4892–903. [Google Scholar]
- 19. Sanchez-Fernandez A, Rumetshofer E, Hochreiter S, Klambauer G. Contrastive learning of image-and structure-based representations in drug discovery. In: ICLR2022 Machine Learning for Drug Discovery. 2022.
- 20. Tianhao Y, Cui H, Li JC. et al. Enzyme function prediction using contrastive learning. Science 2023;379:1358–63. [DOI] [PubMed] [Google Scholar]
- 21. Singh R, Sledzieski S, Bryson B. et al. Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proc Natl Acad Sci 2023;120:e2220778120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Zheng L, Liu Z, Yang Y. et al. Accurate inference of gene regulatory interactions from spatial gene expression with deep contrastive learning. Bioinformatics 2022;38:746–53. [DOI] [PubMed] [Google Scholar]
- 23. Tao W, Liu Y, Lin X. et al. Prediction of multi-relational drug–gene interaction via dynamic hypergraph contrastive learning. Brief Bioinform 2023;24:bbad371. [DOI] [PubMed] [Google Scholar]
- 24. Zhao S, Zhang J, Nie Z. Large-scale cell representation learning via divide-and-conquer contrastive learning. arXiv preprint arXiv:230604371. 2023.
- 25. Fang Y, Zhang Q, Yang H. et al. Molecular contrastive learning with chemical element knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2022;36:3968–76. [Google Scholar]
- 26. Moon K, Im H-J, Kwon S. 3D graph contrastive learning for molecular property prediction. Bioinformatics 2023;39:btad371. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Zhu J, Xia Y, W Lijun. et al. Dual-view molecular pre-training. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2023, p. 3615–27.
- 28. Li S, Zhou J, T Xu. et al. GeomGCL: geometric graph contrastive learning for molecular property prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, 2022;36:4541–9. [Google Scholar]
- 29. Liu S, Wang H, Liu W. et al. Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:211007728. 2021.
- 30. You Y, Chen T, Sui Y. et al. Graph contrastive learning with augmentations. Adv Neural Inf Process Syst 2020;33:5812–23. [Google Scholar]
- 31. Wang Y, Wang J, Cao Z. et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 2022;4:279–87. [Google Scholar]
- 32. Gilmer J, Schoenholz SS, Riley PF. et al. Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning (ICML). Sydney, Australia, 2017, 1263–72.
- 33. Velickovic P, Cucurull G, Casanova A. et al. Graph attention networks. The Sixth International Conference on Learning Representations. Vancouver CANADA: ICLR, Vancouver Convention Center, 2018. [Google Scholar]
- 34. Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv preprint arXiv:181000826. 2018.
- 35. Shervashidze N, Schweitzer P, Van Leeuwen EJ. et al. Weisfeiler-Lehman graph kernels. J Mach Learn Res 2011;12:2539−61. [Google Scholar]
- 36. Lee J, Lee I, Kang J. Self-attention graph pooling. In: Proceedings of the 36th International Conference on Machine Learning (ICML). Long Beach, California, USA, 2019, 3734–43.
- 37. Brody S, Alon U, Yahav E. How attentive are graph attention networks?The 10th International Conference on Learning Representations (ICLR). Virtual Conference, 2022.
- 38. Luan S. On addressing the limitations of graph neural networks. arXiv preprint arXiv:230612640. 2023.
- 39. Muegge I, Mukherjee P. An overview of molecular fingerprint similarity search in virtual screening. Expert Opin Drug Discovery 2016;11:137–48. [DOI] [PubMed] [Google Scholar]
- 40. Cai H, Zhang H, Zhao D. et al. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022;23:bbac408. [DOI] [PubMed] [Google Scholar]
- 41. Bolton EE, Wang Y, Thiessen PA. et al. PubChem: integrated platform of small molecules and biological activities [M]. Ann rep Comput Chem 2008;4:217–41. [Google Scholar]
- 42. Stiefl N, Watson IA, Baumann K. et al. ErG: 2D pharmacophore descriptions for scaffold hopping. J Chem Inf Model 2006;46:208–20. [DOI] [PubMed] [Google Scholar]
- 43. Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model 2010;50:742–54. [DOI] [PubMed] [Google Scholar]
- 44. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. [Google Scholar]
- 45. Sterling T, Irwin JJ. ZINC 15–ligand discovery for everyone. J Chem Inf Model 2015;55:2324–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Zhenqin W, Ramsundar B, Feinberg EN. et al. MoleculeNet: a benchmark for molecular machine learning. Chem Sci 2018;9:513–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Martins IF, Teixeira AL, Pinheiro L. et al. A bayesian approach to in silico blood-brain barrier penetration modeling. J Chem Inf Model 2012;52:1686–97. [DOI] [PubMed] [Google Scholar]
- 48. Kuhn M, Letunic I, Jensen LJ. et al. The SIDER database of drugs and side effects. Nucleic Acids Res 2016;44:D1075–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Gayvert KM, Madhukar NS, Elemento O. A data-driven approach to predicting successes and failures of clinical trials. Cell Chemical Biology 2016;23:1294–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
50.
Subramanian G, Ramsundar B, Pande V. et al.
Computational modeling of
-secretase 1 (BACE-1) inhibitors using ligand based approaches. J Chem Inf Model 2016;56:1936–49.
[DOI] [PubMed] [Google Scholar] - 51. Delaney JS. ESOL: Estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 2004;44:1000–5. [DOI] [PubMed] [Google Scholar]
- 52. Mobley DL, Peter J, Guthrie. FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des 2014;28:711–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Gaulton A, Bellis LJ, Patricia Bento A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 2012;40:D1100–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Stokes JM, Yang K, Swanson K. et al. A deep learning approach to antibiotic discovery. Cell 2020;180:688–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Bemis GW, Murcko MA. The properties of known drugs. 1. Molecular frameworks. J Med Chem 1996;39:2887–93. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Source code and all datasets used in this study are available at https://github.com/Sysuzqs/DGCL.













