Abstract
In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT’s pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01126-w.
Keywords: GCL, Molecular property prediction, Multi-MoleScale, Deep learning, Self-supervised learning
Scientific Contribution
We introduce Multi-MoleScale, a novel multi-scale framework that integrates Graph Contrastive Learning (GCL) with sequence-based models such as BERT. This innovative dual approach effectively combines molecular graph structures with sequence information, significantly enhancing predictive accuracy. By capturing both intrinsic graph-based features and contextual relationships within molecular sequences, Multi-MoleScale enables the differentiation between relevant and irrelevant molecular features.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01126-w.
Introduction
Small molecules have been explored their functions in living systems in the last century, beginning with the discovery of glucose, amino acids, vitamins, hormones, and many others [1, 2]. Small molecules elicit a therapeutic response by binding to a target biological macromolecule. Once bound, small molecule ligands either inhibit the binding of other ligands or allosterically modulate the target’s conformational ensemble. There are 260 FDA-approved drugs are small organic molecules with various properties and mechanisms [3]. Understanding the types of targets and process that can be modulated with small molecules helped define the principles that underlie the rational discovery of small-molecule therapeutics [4]. Accurate predictions of properties like physio-chemical properties, or toxicity of molecule can help researchers prioritize compounds for further experimentation, assess the potential toxicity of chemicals, optimizer drug candidates and design new materials with desired properties [5].
In molecular property prediction, Traditional ML-based QSAR/QSPR models have notable limitations, as they rely on chemists to define and select molecular descriptors or fingerprints (e.g., ECFP), which are then computed using software tools. However, the process of manually designing, curating, and optimizing these descriptors is often time-consuming and subject to subjective biases—choices of descriptors may vary across researchers, potentially limiting the generalizability of models. Additionally, this reliance on handcrafted features restricts the scalability of descriptors and the applicability of the models to diverse molecular systems [6, 7]. Among traditional algorithms, both SVM [8] and XGBoost [9] perform well on small datasets; however, SVM becomes inefficient when dealing with high-dimensional and large datasets, while XGBoost requires extensive feature engineering and hyperparameter tuning, making it difficult to model complex molecular graphs. In contrast, deep learning (DL) methods present a promising alternative, offering several advantages for molecular property prediction. DL models can automatically extract features from molecular graphs [10], SMILES sequences [11], or images [12], thus eliminating the need for manual descriptor selection. This improves both the flexibility and accuracy of predictions, with Graph Neural Networks (GNNs) demonstrating particularly strong performance. However, GNNs can struggle with small datasets, suffer from over-smoothing, and are typically limited to 2–4 layers of graph convolutions, which restricts their ability to extract complex features. Variants of GNNs, such as Graph Attention Networks (GAT) [13] and Graph Convolutional Networks (GCN) [14], are capable of capturing both local and global interactions. However, both GAT and GCN exhibit certain limitations in addressing long-range dependencies. Message Passing Neural Networks (MPNNs) [15] improve inter-molecular dependencies but are complex to train and limited by graph size. SchNet [16] and AttentiveFP [17] excel in quantum property prediction and molecular fingerprint optimization, but they struggle with large molecules and complex structures, requiring significant computational resources. HRGCN + [18] constructs multi-level interaction models but faces severe computational complexity and efficiency bottlenecks when applied to large datasets.
In recent years, the application of pre-trained models in molecular property prediction has gained significant momentum, driving substantial progress in this field. To enhance the representational capabilities of GNNs, Hu et al. [19] proposed a pre-training framework that incorporates both node-level and graph-level tasks. However, the graph-level tasks rely heavily on supervised learning, which is constrained by the scarcity of labeled data, limiting their effectiveness in scenarios with limited or no labeled data. In molecular representation learning, K-BERT [20], a pre-trained model based on SMILES strings, has significantly advanced molecular property prediction. Nonetheless, it faces challenges such as its dependence on limited labeled data, difficulties in handling complex molecular structures, and insufficient capacity to capture both local and global features of molecular graphs. Another pre-trained model based on the BERT architecture, Mole-BERT [21], leverages SMILES strings for training, effectively capturing molecular structures and chemical information. Mole-BERT employs the AttrMask task, which involves masking atoms and predicting their types. However, due to the small and imbalanced atom vocabulary, this task is overly simplistic, causing the model to focus on the more dominant atom types. Consequently, the model converges quickly, limiting its ability to learn more transferable knowledge. To address these limitations, FG-BERT [22] combines both graph and sequence information by converting molecular structure graphs into sequence formats, aiming to improve the accuracy of molecular property predictions. However, this model may lose critical topological information during the graph-to-sequence conversion, which can negatively affect its overall performance.
Among the various GNNs, Graph Convolutional Networks (GCNs) stand out for their exceptional ability to capture dependencies between nodes and edges within graph structures, making them particularly suitable for modeling molecules [23]. However, while traditional GNNs excel at capturing local structural information in molecular graphs, they often struggle with handling long-range dependencies and modeling multi-scale molecular interactions. Molecular properties are governed by complex relationships that span multiple scales, from local interactions between atoms to long-range effects between distant molecular regions. To address these challenges, we propose Multi-MoleScale, a novel multi-scale framework that integrates GCL with sequence-based models like BERT to enhance molecular property prediction. Multi-MoleScale is designed to capture both local and global structural features of molecules by utilizing a multi-scale approach, which allows it to model interactions across varying scales, from atomic-level to molecular-level dependencies. The GCL module, which introduces a contrastive learning framework, significantly improves the representational capacity of molecular graphs by maximizing the similarity between positive samples (graph structures derived from the same molecule) and minimizing the similarity between negative samples (graph structures from different molecules) [24]. This contrastive learning mechanism enhances the expressiveness of molecular representations, improving the generalization ability and stability of downstream prediction models. On the other hand, the BERT model, originally developed for natural language processing, provides a bidirectional self-attention mechanism that excels at capturing interdependencies between elements in a sequence from a global perspective [25]. In molecular graphs, predicting molecular properties often requires understanding complex interactions between distant nodes, and BERT’s bidirectional nature allows it to simultaneously learn contextual information from both directions, significantly improving its ability to model long-range dependencies. To combine these strengths, Multi-MoleScale integrates GCL’s graph-based learning and BERT’s sequence-based contextual learning into a unified framework. The model utilizes a co-attention mechanism [26, 27] to effectively fuse molecular graph structures with sequence information, enabling a complementary and synergistic interaction between graph and sequence data. This dual mechanism, enhanced by the multi-scale approach, allows Multi-MoleScale to better capture both short-range and long-range molecular interactions, resulting in more accurate predictions of molecular properties. Through this innovative joint learning approach, Multi-MoleScale sets a new benchmark for molecular property predictions, demonstrating superior representational capabilities and predictive performance across a diverse array of molecular datasets.
Methods
The Multi-MoleScale framework
Multi-MoleScale is a bi-modal co-attention framework, as illustrated in Fig. 1. The input drug molecular graph is processed through a GNN model, pre-trained using the GCL technique, enhanced by Multi-Scale Data Augmentation, to obtain the compound’s graph embedding representation. The Multi-Scale Data Augmentation approach introduces diverse transformations to the molecular graph, such as node dropping, edge perturbation, feature masking, and random walk subgraphs. These augmentations capture both local and global structural features of the graph, ensuring a more robust and generalized graph embedding. This enhancement significantly improves the model’s ability to recognize and learn complex graph patterns and structural information. Simultaneously, the corresponding SMILES sequence is input into a pre-trained sequence model, generating the sequence’s embedding. After obtaining both graph and sequence embeddings, Multi-MoleScale predicts molecular properties through cross-modal transformation and fusion [28]. The model encodes both embeddings firstly using a self-attention mechanism [29], which is then passed through two co-attention layers: GSA (Graph-to-Sequence Attention) and SGA (Sequence-to-Graph Attention). GSA focuses on the influence of graph structure on the sequence, while SGA focuses on the influence of sequence on the graph structure. Ultimately, after merging the two modalities, the prediction of a compound’s property is produced by a fully connected network.
Fig. 1.
Overview of Multi-MoleScale. The Multi-MoleScale framework accepts two inputs: the molecular graph and the SMILES sequence of a drug. The molecular graph is processed through a pre-trained GNN model utilizing graph contrastive learning, producing the graph embedding representation of the compound. Simultaneously, the SMILES sequence is input into a pre-trained sequence model, generating its corresponding sequence embedding. Multi-MoleScale incorporates a co-attention module to capture the interactions between the two modalities, employing an intersected variant of this module. The classifier module consists of two fully connected layers, both activated by ReLU functions
Graph pre-training
Graph convolution module
Molecules can naturally be represented as graph-structured data, enabling a straightforward mapping to a graph G(V, E). Here, the node set V = {x1,x2,…,xN} represents N atoms in the molecule, while the edge set E = {e1,e2,…,eM} represents M chemical bonds, with eij denoting the bond between atom xi and xj. The graph is encoded by two primary matrices: the node feature matrix , where F is the feature dimension of each node, and the adjacency matrix , which represents connectivity between nodes. Atomic and edge features are shown in Supplementary Table S1.
The propagation rule for a graph convolutional layer can be expressed as:
| 1 |
where , I is the identity matrix, represents the degree matrix of , and H denotes the features of the nodes at each layer, with H = X for the input layer. represents the nonlinear activation function.
From a spatial perspective, in graph neural networks, the graph convolution layer can be decomposed into two steps: where is the aggregation function and is the feature extraction function [30]. The aggregation function gathers and combines information from the local neighborhood of each node, ensuring that the features of the neighboring nodes influence each other. Common aggregation methods include mean, sum, or max pooling. Consequently, the feature extraction function applies a learnable transformation to the aggregated features, often using a nonlinear activation function to enhance the expressive power of the network.
After updating the features of all nodes, the output is the average value of the complete molecular graph:
| 2 |
represents the embedding vector of the molecular graph in the Graph convolution module [31].
Multi-scale data augmentation
The purpose of data augmentation is to generate realistic, diverse data through transformations that preserve semantic labels [32, 33]. In this study, we focus on graph-level data augmentation. Given a graph G from a dataset, its augmented graph G′∼T(G) is sampled from a distribution T(G), reflecting certain prior assumptions. In drug-like small molecule classification, transformations like atom deletion and bond perturbation can help capture the structural features of the molecule, encoding important information about its molecular structure. For instance, deleting a hydrogen atom or adding a chemical bond generates an augmented molecule with similar pharmacological properties to the original [34]. Contrastive learning between the original and augmented molecules enables the model to focus on key structural features, improving its ability to distinguish molecules with different functionalities.
We adopt four different augmentation strategies, each corresponding to a different prior assumption:
Node Dropping (ND): A fixed proportion of nodes and their associated edges are randomly removed. The underlying assumption is that the graph’s semantics are robust to the removal of some nodes. The drop probability for each node follows a uniform distribution. This strategy is designed to capture local-scale structural features, as the removal of specific atoms or functional groups encourages the model to focus on the overall molecular topology and connectivity. By forcing the model to rely on a broader view of the molecular structure, Node Dropping helps enhance the robustness of the graph embedding, enabling the model to generalize better to unseen molecular structures.
Edge Perturbation (EP): A fixed proportion of edges are randomly added or deleted. This assumes that the graph’s semantics exhibit some robustness to changes in edge connectivity. The addition or deletion of each edge follows the uniform distribution. Edge Perturbation can be viewed as modifying the intermediate-scale features of the graph, simulating variations in chemical bonds or interactions, while still preserving the key molecular properties and functionality. This augmentation helps the model better capture the dynamic nature of molecular interactions, improving its ability to generalize across different molecular structures.
Feature Masking (FM): This technique involves masking a fixed proportion of node features, encouraging the model to predict these missing features using the context provided by the unmasked ones. The underlying assumption is that the omission of certain node attributes does not significantly degrade the model’s performance. Feature Masking primarily addresses local node-level information, as the model must rely on the remaining features and the node’s neighborhood to infer the missing details.
Subgraphs induced by Random walks (RW): A subgraph is sampled through random walks, based on the assumption that the graph’s semantics are largely preserved within its local structure [35]. Random walks capture subgraphs that span from local to global scales, offering a way to learn multi-scale features, from small, localized regions to broader molecular environments.
These four strategies, especially when combined with contrastive learning, enable the model to capture structural features at different scales of the molecular graph, which we refer to as multi-scale data augmentation. The multi-scale nature of these augmentations reflects different levels of molecular information—from local node attributes and bonds to larger-scale molecular structures. By leveraging this diversity in augmentation, the model is able to learn more robust, generalized representations of drug-like molecules, enhancing classification or regression performance and improving the model’s ability to distinguish molecules with different functionalities.
Graph contrastive learning
The contrastive learning method, illustrated in Fig. 2, aims to learn effective graph representations by maximizing the consistency between two views of the same graph [36, 37]. This method consists of four key components:
Graph Data Augmentation Module: Graph G employs a data augmentation module to generate two distinct views, Gi and Gj, both of which are treated as positive samples. In a mini-batch of N molecular graphs, the augmentation process produces 2N augmented graphs. The positive sample pairs are formed by using the two views of the same graph, while the negative sample pairs are constructed by pairing graphs which are derived from different graph in the batch.
GNN Encoder Module: The encoder f(.) processes the augmented graphs Gi and Gj, deriving their respective representation vectors hi and hj. Both graphs are processed by the same encoder, which leverages graph convolution and readout operations to effectively capture structural and contextual information.
Projection Head: A projection head g(.), implemented as a two-layer MLP, maps the graph representations hi and hj to a latent space, producing the vectors zi and zj. Both representations are mapped using the same MLP, ensuring better alignment in the latent space.
- Contrastive Loss Function: The contrastive loss function L(.) is designed to maximize the consistency between zi and zj, utilizing the normalized temperature-scaled cross entropy loss (NT-Xent). With cosine similarity defined as , the NT-Xent loss is then defined as:
3
Fig. 2.
Overview of graph pre-training: A SMILES sequence sn is extracted from a mini-batch of N molecules and transformed into a molecular graph Gn. Then, random augmentation operations are applied to the graph to produce the original graph Gi and an augmented graph Gj. A feature encoder, based on graph convolution and readout operations, extracts the graph representations hi and hj for Gi and Gj, respectively. NT-Xent loss is used to maximize the alignment between the latent vectors zi and zj, which are generated by the MLP projection head. This pre-training process enables the GNN to learn robust and representative features, which can subsequently be utilized for downstream molecular property prediction tasks
Here, τ represents the control parameter.
This pre-training approach enables the model to effectively leverage structural and contextual information, allowing it to learn robust graph representations that are well-suited for downstream tasks. During the pre-training phase of GCL, we utilized a large set of unlabeled molecular data to extract contextual information from the molecules. In this phase, approximately 1.04 million compounds were randomly selected from the ChEMBL [38] database for pre-training.
Sequence pre-training module
The BERT model is pre-trained using Masked Language Model (MLM) and Next Sentence Prediction (NSP) [39]. The MLM task involves randomly masking specific molecular fragments in the input sequence and training the model to predict the masked fragments based on the surrounding context. Through this process, the model captures essential dependencies between molecular fragments, enabling it to infer the masked portions with high accuracy. This bidirectional context learning allows BERT to model complex relationships within molecular sequences, significantly enhancing its ability to understand the overall molecular structure. By extracting meaningful features from these sequences, BERT builds a comprehensive understanding of molecular dependencies and chemical information, supporting advanced applications in molecular modeling and cheminformatics [40]. In natural language processing (NLP), words within a sentence are often interdependent, requiring the model to consider relationships between all words. Similarly, in molecular graphs, atoms and functional groups are primarily connected through chemical bonds, and their properties are influenced by the adjacent atoms or groups. During pre-training, the model undergoes self-supervised learning using large-scale chemical molecule datasets, aiming to capture useful patterns and features embedded in molecular structures. Specifically, BERT learns to understand these structural relationships by modeling the contextual dependencies in SMILES sequences, which is conceptually similar to how BERT models relationships between words in NLP tasks [41, 42].
In Fig. 3, a CLS node is used to connect all atoms in the SMILES representation, enabling comprehensive information exchange. The output from the CLS node then serves as the final molecular representation, which is subsequently utilized for downstream classification or regression tasks. To further enhance feature extraction, the MLM pre-training protocol from ChemBERTa [43] is utilized. In this implementation, 15% of tokens in each input string are randomly masked, and the model is trained to accurately predict these masked tokens based on the surrounding context. The vocabulary is constructed from frequently occurring SMILES characters, encompassing atomic symbols, bond types, and special tokens for structural features. The resulting dictionary contains 591 tokens, while the maximum sequence length is set to 512 tokens to accommodate diverse molecular representations.
Fig. 3.
The overall pretraining and fine-tuning procedures of BERT
Co-attention module
Integrating graph embedding XG and sequence embedding XS is critical for constructing a unified molecular representation. Traditional approaches generally employ a simple concatenation of these embeddings, followed by a fully connected network for further processing. However, such methods often fail to capture the intricate interactions between graph and sequence representations, potentially leading to suboptimal performance. To address this limitation, this study introduces a co-attention mechanism to fuse information from both modalities before conducting classification or regression tasks. The co-attention mechanism employs a dual-modality attention structure, where graph and sequence embeddings dynamically attend to each other, ensuring efficient utilization of attention resources and extraction of modality-specific key features. This approach enables the model to efficiently leverage limited attention resources to swiftly extract high-value information from extensive datasets. Figure 4 illustrates the three attention mechanisms utilized in this study: SA [44], GSA, and SGA. SA captures intra-modality dependencies, while GSA and SGA focus on bidirectional interactions between the two modalities, enhancing the representation’s overall expressiveness.
Fig. 4.
Overview of the three attention mechanisms. A Self-attention (SA): Operates within a single modality, capturing internal dependencies and relationships. B Graph-Sequence attention (GSA): Uses the molecular graph feature vector as the query, while the molecular sequence feature vectors as the key and value. C Sequence-Graph attention (SGA): Uses the molecular sequence feature vector as the query, with the molecular graph feature vectors acting as the key and value
The SA module first inputs XG and XS into the self-attention mechanism to learn the hidden information within each modality, as shown in Fig. 4a.
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
For this module, Xemb represents the input embedding, where Xemb = XG when the input is the molecular graph embedding and Xemb = XS when it is the molecular sequence embedding. The matrices Wq, Wk and Wv are shared, learnable parameters applied across all nodes, with . , dmodel is the embedding size and H is the number of attention heads.
To capture information across diverse feature spaces, we incorporate a multi-head attention mechanism into the co-attention model with m attention heads. The outputs from each head are concatenated and passed through a learnable matrix Wo for linear transformation to produce the final attention output.
| 9 |
Here, is also a learnable shared matrix.
The GSA captures the complementary information that the graph structure of a drug molecule provides to its sequence. Specifically, GSA processes two feature inputs, XG and XS. Keys and values are obtained from XG, while queries are computed from XS. The multi-head attention mechanism leverages scaled dot-product calculations to learn pairwise relationships between XG and XS. As a result, GSA effectively embeds the graph structure information of the drug into its sequence representation. Similarly, the SGA processes XS and XG, but it assesses the impact of the drug’s sequence structure on the molecular graph structure. In SGA, keys and values are derived from XS, while queries are computed from XG.
| 10 |
For all attention units, the input features Xupdate are processed through a feed-forward layer and a dropout layer. To further improve the module’s robustness, residual connections and layer normalization are additionally applied. Details of the computational complexity are provided in Supplementary Materials (Appendix A1).
Hyperparameter optimization and training protocol
Seven hyperparameters were investigated in this study: the graph augmentation strategies for GCL, the type of aggregation function, the dropout rate for co-attention, the learning rate for co-attention, the batch size for co-attention, and the number of heads for co-attention. Multi-MoleScale employs the dropout mechanism in both the GCL and co-attention to prevent overfitting during training. Additionally, The dataset is partitioned into training, validation, and test sets, with the validation set used to determine the final model for optimal generalization. Multi-MoleScale was developed using PyTorch.
Benchmark datasets and performance evaluation metric
The performance of the Multi-MoleScale model was thoroughly evaluated using three benchmark datasets. First, 12 widely used public datasets related to drug discovery were employed to assess the model’s performance. These datasets, derived from the paper by Wu et al. [45], are widely recognized as benchmark datasets in the field of drug property prediction. Details of the 12 benchmarking datasets can be found in Supplementary Table S2. The datasets include 3 physicochemical datasets (ESOL [46], FreeSolv [47], and Lipophilicity [48]), 5 bioactivity and biophysics datasets (Malaria [49], MUV [50], HIV [51], BACE [52], and CEP [21]), and 4 physiology and toxicity datasets (BBBP [53], Tox21 [54], SIDER [55], and ClinTox [56]). The sizes of the datasets vary significantly, with smaller ones, such as FreeSolv, containing only 642 molecules, and larger datasets, such as MUV, consisting of 17 sub-tasks with a total of 93,087 molecules. Second, 15 ADMET [20] datasets related to drug discovery (Supplementary Table S3) were used to evaluate the performance of Multi-MoleScale. These datasets, collected from ADMETlab 2.0 [57], consist of 15 small drug-likeness datasets, each containing fewer than 2000 molecules. Finally, 14 phenotype screening datasets related to breast cancer cell lines (Supplementary Table S4) were utilized to assess the predictive performance of Multi-MoleScale [58].
The datasets were divided into training, validation, and test sets with a ratio of 8:1:1. Specifically, the 12 commonly utilized public benchmark datasets and 14 phenotype screening datasets pertaining to breast cancer cell lines were segregated according to this ratio. For the 15 ADMET datasets, their partitioning adhered to the predefined splits established in reference [22], thereby maintaining consistency with the original experimental setup of these datasets. To address the impact of randomness in data partitioning and ensure the robustness of results, we performed 10 independent experiments with different random seeds for all datasets. The reported performance metrics (ROC-AUC for classification tasks and RMSE for regression tasks) are the average values across these 10 iterations, which effectively reduces the influence of random biases in a single partition and enhances the reliability of our conclusions.
Results and discussion
Performance of Multi-MoleScale on molecular property datasets
In this study, the predictive performance of the Multi-MoleScale model was evaluated using 12 publicly available benchmark datasets, including seven classification tasks and five regression tasks. Nine supervised learning models (SVM, XGBoost, GCN, GAT, MPNN, SchNet, ChemBerta, 3D-InfoMax [59] and 3D-PGT [60]) and three self-supervised or pre-trained models (Hu [19], Mole-BERT, and FG-BERT) were selected for comparative experiments.
Table 1 shows the ROC-AUC test results of our Multi-MoleScale model on classification tasks compared against both supervised-only and self-supervised/pre-trained baselines. The BACE, HIV, and MUV datasets focus on bioactivity and biophysical properties, specifically on the measurement of binding affinity for various biological targets. Accurate prediction of these properties is essential for effective drug discovery. Multi-MoleScale achieved the highest performance on the HIV and MUV datasets, with AUC values of 0.865 and 0.837, respectively. Although XGBoost slightly outperformed Multi-MoleScale on the BACE dataset, our model remained competitive. For the BBBP, SIDER, TOX21, and ClinTox datasets, which relate to physiological and toxicity characteristics critical for early-stage molecule toxicity identification in drug development, Multi-MoleScale showed superior classification performance, achieving AUC values of 0.956, 0.663, 0.844, and 0.913, respectively. The following observations can be drawn from Table 1: (1) Compared to supervised or pre-trained models, Multi-MoleScale achieved the best results on 6 out of 7 classification benchmarking datasets, with an average improvement of 2.8% in terms of AUC-ROC. This suggests that Multi-MoleScale, which does not rely on domain expertise in molecular chemistry and biology can accurately identify the physiological and toxic characteristics of molecules from its raw data representation. (2) Compared to advanced ensemble learning and graph neural network algorithms, Multi-MoleScale showed competitive or superior performance. On datasets such as ClinTox, BBBP, and MUV, Multi-MoleScale surpassed state-of-the-art GNNs, which often require complex graph-based operations or extensive feature engineering. These results highlight Multi-MoleScale’s potential for simplifying and improving tasks in drug development. For instance, on the BBBP benchmark, Multi-MoleScale improved the ROC-AUC by 3.4% over the GCN model.
Table 1.
Comparison of model performance (ROC-AUC) in classification tasks on seven benchmarking datasets
| Dataset | BACE | HIV | MUV | TOX21 | BBBP | ClinTox | SIDER |
|---|---|---|---|---|---|---|---|
| #Molecules | 1513 | 41,127 | 93,087 | 7831 | 2019 | 1478 | 1427 |
| #Tasks | 1 | 1 | 17 | 12 | 1 | 2 | 27 |
| SVM | 0.847 ± 0.002 | 0.639 ± 0.006 | 0.500 ± 0.012 | 0.635 ± 0.013 | 0.824 ± 0.008 | 0.561 ± 0.055 | 0.534 ± 0.009 |
| XGBoost | 0.888 ± 0.022 | 0.816 ± 0.007 | 0.657 ± 0.015 | 0.777 ± 0.033 | 0.921 ± 0.006 | 0.678 ± 0.046 | 0.625 ± 0.022 |
| GAT | 0.867 ± 0.019 | 0.755 ± 0.012 | 0.717 ± 0.041 | 0.788 ± 0.080 | 0.918 ± 0.034 | 0.900 ± 0.016 | 0.596 ± 0.011 |
| GCN | 0.886 ± 0.021 | 0.851 ± 0.030 | 0.812 ± 0.035 | 0.791 ± 0.025 | 0.922 ± 0.011 | 0.869 ± 0.027 | 0.618 ± 0.015 |
| MPNN | 0.873 ± 0.007 | 0.837 ± 0.017 | 0.806 ± 0.027 | 0.783 ± 0.019 | 0.920 ± 0.032 | 0.907 ± 0.043 | 0.613 ± 0.021 |
| ChemBerta | 0.859 ± 0.009 | 0.789 ± 0.004 | – | 0.803 ± 0.002 | 0.956 ± 0.005 | 0.601 ± 0.000 | 0.618 ± 0.018 |
| 3D-InfoMax | 0.786 | 0.761 | 0.762 | 0.745 | 0.691 | 0.627 | 0.568 |
| 3D-PGT | 0.809 | 0.781 | 0.694 | 0.738 | 0.721 | 0.794 | 0.606 |
| SchNet | 0.766 ± 0.011 | 0.702 ± 0.034 | 0.713 ± 0.030 | 0.772 ± 0.023 | 0.848 ± 0.022 | 0.715 ± 0.037 | 0.539 ± 0.037 |
| Hu et al | 0.859 ± 0.008 | 0.802 ± 0.009 | 0.814 ± 0.020 | 0.787 ± 0.004 | 0.708 ± 0.015 | 0.789 ± 0.024 | 0.652 ± 0.090 |
| Mole-BERT | 0.808 ± 0.015 | 0.782 ± 0.008 | 0.786 ± 0.018 | 0.768 ± 0.050 | 0.719 ± 0.016 | 0.789 ± 0.030 | 0.628 ± 0.011 |
| FG-BERT | 0.845 ± 0.015 | 0.774 ± 0.010 | 0.753 ± 0.024 | 0.784 ± 0.080 | 0.702 ± 0.009 | 0.832 ± 0.016 | 0.640 ± 0.007 |
| Multi-MoleScale | 0.880 ± 0.005 | 0.865 ± 0.009 | 0.837 ± 0.104 | 0.844 ± 0.039 | 0.956 ± 0.004 | 0.913 ± 0.005 | 0.663 ± 0.063 |
| Multi-MoleScale (ChemBERTa) | 0.871 ± 0.007 | 0.862 ± 0.013 | 0.829 ± 0.023 | 0.863 ± 0.058 | 0.962 ± 0.007 | 0.927 ± 0.032 | 0.694 ± 0.061 |
The top-performing models based on supervised learning and self-supervised learning in each benchmark are highlighted in bold. “–” means the results were not reported in the references
Table 2 summarizes the performance of Multi-MoleScale compared to other models on regression tasks. The FreeSolv, ESOL, and Lipophilicity datasets focus on the physicochemical properties of molecules, which are critical for drug discovery and development. Accurate predictions of these properties can streamline the identification and optimization of potential drug candidates. On these datasets, Multi-MoleScale achieved the best performance, including the lowest average RMSE value of 0.953 on the ESOL, FreeSolv, and Malaria datasets, demonstrating its superiority in predicting drug-relevant properties.
Table 2.
Comparison of model performance (RMSE) in regression tasks on six benchmarking datasets
| Dataset | ESOL | FreeSolv | Lipophilicity | Malaria | CEP |
|---|---|---|---|---|---|
| #Molecules | 1128 | 642 | 4200 | 9999 | 29,978 |
| #Tasks | 1 | 1 | 1 | 1 | 1 |
| SVM | 1.185 ± 0.012 | 2.459 ± 0.016 | 1.014 ± 0.002 | 1.099 ± 0.013 | 1.698 ± 0.019 |
| XGBoost | 1.190 ± 0.019 | 2.653 ± 0.009 | 0.867 ± 0.011 | 1.300 ± 0.025 | 1.336 ± 0.018 |
| GAT | 0.749 ± 0.009 | 1.279 ± 0.038 | 0.942 ± 0.003 | 1.251 ± 0.033 | 1.640 ± 0.033 |
| GCN | 1.430 ± 0.040 | 2.850 ± 0.021 | 0.848 ± 0.004 | 1.085 ± 0.021 | 1.441 ± 0.012 |
| MPNN | 0.980 ± 0.026 | 2.131 ± 0.033 | 0.655 ± 0.005 | 1.267 ± 0.012 | 1.107 ± 0.050 |
| ChemBerta | 0.682 ± 0.089 | 1.399 ± 0.051 | 0.615 ± 0.007 | – | – |
| 3D-InfoMax | 0.894 | 2.337 | 0.706 | 1.121 | 1.218 |
| 3D-PGT | 1.061 | 1.062 | 0.687 | 1.104 | 1.215 |
| SchNet | 1.070 ± 0.006 | 3.125 ± 0.076 | 0.911 ± 0.010 | 1.564 ± 0.008 | 1.566 ± 0.035 |
| Hu et al | 1.220 ± 0.002 | 2.733 ± 0.012 | 0.740 ± 0.000 | 1.879 ± 0.011 | 1.945 ± 0.032 |
| Mole-BERT | 1.015 ± 0.030 | 1.110 ± 0.016 | 0.677 ± 0.017 | 1.061 ± 0.009 | 1.230 ± 0.059 |
| FG-BERT | 0.944 ± 0.025 | 1.076 ± 0.024 | 0.655 ± 0.009 | 1.043 ± 0.006 | 1.031 ± 0.029 |
| Multi-MoleScale | 0.621 ± 0.016 | 1.022 ± 0.054 | 0.730 ± 0.006 | 1.038 ± 0.062 | 1.386 ± 0.024 |
| Multi-MoleScale (ChemBERTa) | 0.612 ± 0.012 | 1.080 ± 0.025 | 0.752 ± 0.019 | 1.042 ± 0.080 | 1.379 ± 0.027 |
The top-performing models based on supervised learning and self-supervised learning in each benchmark are highlighted in bold. “–” means the results were not reported in the references
Tables 1 and 2 show that the Multi-MoleScale model achieved the best performance in 9 out of 12 benchmarking datasets, demonstrating its effectiveness. Furthermore, Multi-MoleScale outperformed or matched mainstream supervised learning and pre-training models in all baseline comparisons. This highlights the potential of combining pre-trained molecular graph and sequence embeddings to improve the generalization ability of graph-based deep learning algorithms for molecular property prediction, a key factor in drug discovery and design. The hyperparameters of the best Multi-MoleScale model for each learning task are provided in Supplementary Table S5, the model’s results on the validation dataset are presented in Table S6, and performance on the test dataset is shown in Table S7.
Performance of Multi-MoleScale on ADMET datasets
To validate the advantages of our method in predicting molecular properties for drug discovery, we compared the performance of Multi-MoleScale with several other methods across 15 ADMET datasets. Details of these datasets are provided in Supplementary Table S3. From the study of Li et al., we selected the advanced models as baselines for comparison: two graph-based methods (HRGCN + and Attentive FP), a fingerprint-based XGBoost model (XGBoost-ECFP4), and two pre-training models (K-BERT and FG-BERT). To ensure a consistent and fair comparison, we used the same data, partitioning method, and data split ratio for Multi-MoleScale. Additionally, we averaged the results of 10 trials with different random seeds for each dataset and presented the final outcomes for the test set in Fig. 5a. The heatmap shows the performance of the models across different tasks. The main observation is that the Multi-MoleScale model consistently outperforms the baseline models in most tasks, with darker shades indicating better performance. This suggests that Multi-MoleScale is more effective at capturing the underlying patterns in classification tasks compared to the other models.
Fig. 5.
Performance of Multi-MoleScale on 15 ADMET datasets compared to the baseline models. A presents a detailed comparison of AUC values on the test set between Multi-MoleScale and the baseline models. B shows the comparison of their average AUC values over all prediction tasks. The performance data for all five baseline models are sourced from the study by Li et al. [22]
Overall, as shown in Fig. 5b, Multi-MoleScale achieved the best ADMET prediction performance with the highest mean AUC value of 0.820. The FG-BERT pretraining model ranked second, followed by K-BERT, HRGCN +, XGBoost-ECFP4, and Attentive FP. This result clearly shows that the pretraining methods (Multi-MoleScale, FG-BERT, K-BERT) are superior to the non-pretraining methods (HRGCN +, Attentive FP, XGBoost-ECFP4) on the ADMET datasets, highlighting the advantages of pretraining approach in drug discovery. These benefits likely stem from the ability of BERT-based pretraining models to learn accurate and useful molecular representations from large, unlabeled datasets. Furthermore, Multi-MoleScale surpassed FG-BERT due to its graph contrastive learning, which excels at capturing the dependencies between nodes and edges in molecular graph structures, thereby enhancing the model’s ability to learn structured features. Additionally, BERT sequence learning is particularly effective at capturing contextual information in sequence data and modeling long-range dependencies. The co-attention mechanism, which fuses molecular graph learning and sequence learning, enables Multi-MoleScale to extract crucial chemical structure and semantic information from molecules. Our findings underscore Multi-MoleScale’s strong predictive capability, positioning it as one of the most competitive methods for predicting molecular ADMET properties. The hyperparameters of the best Multi-MoleScale model for each ADMET learning task are provided in Supplementary Table S5.
Performance of Multi-MoleScale on cell-based phenotypic screening datasets
In recent years, phenotypic screening (such as cell-line based assays) has gained significant attention in the field of drug screening. This study evaluates Multi-MoleScale’s performance in phenotypic screening using 13 breast cancer and 1 normal breast cell lines. Details of the 14 cell phenotypic screening datasets are shown in Supplementary Table S4. Three graph-based methods (GCN, GAT, and MPNN) and two fingerprint-based methods (XGBoost and SVM) were selected to predict molecular activity against these cell lines.
As shown in Fig. 6, Multi-MoleScale outperformed all models in 10 of the 14 cell lines (i.e., MDA-MB-453, SK-BR-3, MDA-MB-435, MDA-MB-361, BT-474, BT-20, BT-549, HS-578T, Bcap-37, and HBL-100). XGBoost showed the best performance in 2 cell lines, T-47D and MDA-MB-231. GCN achieved the best performance in MDA-MB-468, while MPNN excelled in MCF-7. Importantly, Multi-MoleScale achieved the best overall performance across these 14 cell lines, with the highest mean AUC of 0.856. Multi-MoleScale’s outstanding performance on the cell line-based phenotypic screening dataset demonstrates its great potential in phenotype-based drug discovery. The hyperparameters of the best Multi-MoleScale model for each cell-based phenotypic screening dataset are provided in Supplementary Table S5.
Fig. 6.
Performance of Multi-MoleScale on 14 breast cancer cell line datasets compared to the baseline models. A presents a detailed comparison of AUC values on the test set between Multi-MoleScale and the baseline models. B shows the comparison of their average AUC values over all prediction tasks
Ablation studies
Impact of GCL and BERT on model performance
The ablation study of Multi-MoleScale was performed using the cell-based phenotypic screening datasets. For each target, the Multi-MoleScale model was reduced into either GCL or BERT components and trained with the same hyperparameters of the full model. As shown in Fig. 7, the full model performed better than both GCL-only and BERT-only models in 13 out of the 14 targets. For the remaining target (HS-578T), the full Multi-MoleScale model exhibited moderate performance, slightly lower than the GCL-only model but significantly better than the BERT-only model. These results suggest that Multi-MoleScale effectively integrates the strengths of GCL and BERT, capturing complementary information from molecular graphs and sequences, leading to enhanced performance. Integrating information from molecular graphs and sequences enables the acquisition of both local and global structural information, leading to more accurate molecular property predictions. Furthermore, both the GCL and BERT components were found to enhance molecular prediction performance. Notably, the inclusion of GCL resulted in significant performance improvement, indicating that molecular properties are primarily influenced by neighboring atoms.
Fig. 7.
Results of the ablation study on cell-based phenotypic screening datasets
Impact of Multi-Scale graph augmentations on model performance
To understand the impact of molecular graph augmentation strategies on prediction performance, models with the application of different strategies and their combinations—Feature Masking (FM), Edge Perturbation (EP), Node Dropping (ND), and subgraphs induced by random walks (RW)—were compared. Seven augmentation strategies were considered: (1) FM with a 25% ratio; (2) EP with a 25% ratio; (3) ND with a 25% ratio; (4) RW; (5) A combination of FM and EP, both with a 25% ratio; (6) FM, EP, and ND, each with a 10% ratio; and (7) a combination of all four augmentation methods FM, EP, ND and RW.
As depicted in Fig. 8a, b, the combination of all four multi-scale graph augmentation methods achieved the overall best performance. This performance is due to the combined strategy’s ability to integrate the advantages of each augmentation method, capturing the diversity and underlying relationships within the data and enhancing the model’s ability to learn complex patterns. When evaluating the performance of individual multi-scale graph augmentation methods in classification tasks, most methods yielded similar results in terms of classification accuracy. However, the ND augmentation method stood out as particularly unstable, showing significant fluctuations in performance. This suggests that while ND may provide some benefit in specific scenarios, its effectiveness is not consistent and could be highly dependent on the task or data characteristics. In contrast, when combining two or three augmentation methods, the improvement in model performance was not as significant as expected. Although combining different augmentations theoretically expands the feature space, the performance gain appears to plateau when fewer than four augmentations are used. This indicates that two or three augmentations may not introduce enough diversity to fully capture the range of useful variations in the data. The model that uses all four multi-scale graph augmentations, however, demonstrated the best performance in prediction tasks, either outperforming or matching the performance of models with fewer augmentations. This suggests that a combination of multiple augmentations results in a more robust and generalized model, capable of handling a wider variety of scenarios. The inclusion of all four augmentations likely introduces a diverse set of transformations, enabling the model to learn better representations of the data.
Fig. 8.
Impact of molecular graph augmentations and co-attention mechanism on Multi-MoleScale model performance in the molecular property benchmarking datasets for (A, C) classification tasks and (B, D) regression tasks. The ‘Composition of four augmentations’ bar in (A, B) and the ‘co-attention’ bar in (C, D) represent the original model
However, in the case of the ESOL dataset, the combination of all four multi-scale graph augmentation methods underperformed. This is due to the high sensitivity of molecular structures, where even slight topological changes can lead to substantial property variations. Therefore, while a diversified augmentation strategy can generally enhance model performance, its impact may be limited in cases where the data is highly sensitive to structural changes. Nevertheless, for datasets with complex features and less sensitivity to topological variations, leveraging the complementary nature of different multi-scale graph augmentation methods significantly improves the model’s ability to capture and represent data, thereby boosting the efficacy of contrastive learning.
We analyzed the molecular representations learned through graph pre-training using four multi-scale graph augmentation methods, employing the t-SNE [61] embedding algorithm. The t-SNE algorithm maps similar molecular representations to adjacent points in the two-dimensional space. As shown in Fig. 9, we also collected 2045 small molecules from the CHEMBL database [62] and embedded them into two-dimensional space using t-SNE. The graph pre-training method learns similar representations for molecules with comparable topological structures or functional groups. This suggests that, even in the absence of labels, the model is capable of capturing the intrinsic relationships between molecules, as those with similar properties tend to exhibit similar features.
Fig. 9.
t-SNE visualization of molecular representations obtained from graph pre-training integrated with four multi-scale graph augmentation methods, derived from a random sample of 2045 small molecules additionally collected from the CHEMBL database
Impact of co-attention to model performance
To assess whether co-attention enhances the accuracy of molecular property prediction, the models with and without co-attention were compared under the same graph and sequence embedding settings. Without co-attention, the representations of graph and sequence embeddings were concatenated. As seen in Fig. 8c, d, the experimental results show that models with co-attention outperformed those without. In the classification tasks, performance improved by about 2%, while the improvement in the regression tasks was about 16.7%. These results suggest that the co-attention mechanism significantly enhances the model’s representational capacity by simultaneously focusing on both structural information (graph-related) and semantic information (sequence-related). By establishing a close connection between graph data and sequence data, this dual focus enables the model to better understand the intrinsic relationships between molecules and capture finer-grained features. This capability allows the model to extract richer, more accurate, and discriminative molecular representations, thereby enhancing its generalization ability across a variety of tasks. As a result, the model is better equipped to adapt to different data distributions and task requirements. Ultimately, the model’s performance on various downstream tasks is significantly improved, leading to a marked increase in predictive accuracy and demonstrating enhanced robustness and flexibility.
Interpretability analysis of model representations
To further explore Multi-MoleScale and improve the interpretability of its predictions, we visualize the learned molecular representations from the BBBP and ESOL datasets, following Zhu et al. [63].
As shown in Fig. 10 for the BBBP dataset (a binary classification task), the molecules tend to cluster into two distinct groups. This separation corresponds with the task’s objective of distinguishing molecules based on their ability to penetrate the blood–brain barrier. The observed clustering pattern indicates that Multi-MoleScale may capture structural features relevant to blood–brain barrier permeability, providing some basis for understanding the model’s classification behavior.
Fig. 10.
Molecular representation visualization on BBBP and ESOL datasets
In the case of the ESOL dataset (a regression task predicting aqueous solubility), the molecular representations show a gradual distribution pattern. Molecules are arranged along a gradient from regions associated with lower solubility to those linked to higher solubility. This continuous pattern is consistent with the nature of solubility as a continuous property, suggesting that Multi-MoleScale’s latent representations are sensitive to subtle structural variations affecting solubility.
Taken together, these visualizations suggest that Multi-MoleScale can extract meaningful latent representations related to molecular properties, including the binary permeability classification in BBBP and continuous solubility prediction in ESOL. Although only two datasets were visualized, the distinct patterns observed (clustering for classification and gradation for regression) imply that the model captures features pertinent to each task, offering preliminary evidence of its interpretability.
To enhance the statistical robustness of our results—particularly given the dependence of deep learning models on initial weight initialization—we performed 10 independent experiments with distinct random seeds. These experiments enabled the calculation of standard deviations for all performance metrics (Supplementary Table S7), which reflect variations arising from random initialization and data partitioning. The generally low standard deviations (most < 0.05) demonstrate stable performance across runs.
Furthermore, paired t-tests on the 10 replicate results confirmed statistical significance: for instance, Multi-MoleScale’s AUC on the BBBP dataset (0.956 ± 0.004) significantly outperformed that of GCN (0.922 ± 0.011, p < 0.01) and FG-BERT (0.702 ± 0.009, p < 0.001). Supplementary Table S7 further includes diverse metrics (Precision, F1 score, PR-AUC), with consistent trends across these metrics validating the reliability of our results. These analyses reinforce confidence that the observed performance improvements are not coincidental, thereby enhancing the rigor of our findings.
To more intuitively assess the contributions of individual atoms to the molecular representations learned by the graph contrastive learning model, we performed systematic perturbations of the atoms and bonds in the target molecule and quantified their effects. For a given molecule, such perturbation analysis can quantitatively identify which chemical fragments contribute most to property prediction. As shown in Fig. 11a, the two chlorine atoms exhibited markedly higher importance upon perturbation, which may be attributed to chlorine’s strong electron‑withdrawing character and its influence on molecular polarity. In Fig. 11b, darker regions exert greater influence on the blood–brain barrier permeability (BBBP) prediction, whereas lighter regions have smaller effects. Most substructures of this compound are hydrophobic, which favors BBB permeation; notably, the phenyl ring (C7–C12) shows the lowest polarity and contributes most to BBB permeation. ClogP values for each fragment, computed using ChemBioDraw, indicate that the phenyl ring has a higher ClogP, thereby corroborating the perturbation analysis and supporting the interpretability of the model.
Fig. 11.
Analysis of Molecular Fragment Importance Based on Atom/Bond Perturbation
Conclusion
In the field of molecular property prediction, the scarcity of labeled data presents a significant challenge. To address this, we propose a novel self-supervised learning method called Multi-MoleScale, which integrates graph-based and sequence-based learning techniques for molecular property prediction. For graph learning, Multi-MoleScale employs GCL to capture structural information from molecular graphs. Through a multi-scale learning strategy, GCL simultaneously learns both local and global graph features, yielding a more detailed and accurate representation of molecular structures. For sequence learning, BERT is applied to molecular SMILES sequences to extract embedded chemical information. Additionally, Multi-MoleScale introduces a co-attention mechanism to establish stronger connections between graph and sequence representations, facilitating better integration of structural and sequential information. This co-attention mechanism captures complex interactions within molecules, thereby enhancing the accuracy of molecular property prediction.
Our Multi-MoleScale architecture synergistically combines the strengths of GCL and BERT through the co-attention mechanism, effectively integrating the molecular graph structures with sequence information. This strategy facilitates the fusion of information between graph and sequence data, thereby enhancing the model’s expressiveness. Furthermore, the chemical context embedded in the sequence is utilized to refine the understanding of molecular properties, leading to improved accuracy and robustness of predictions. The evaluation on three benchmarking datasets demonstrate that Multi-MoleScale outperforms state-of-the-art supervised learning algorithms, graph-based deep learning models, and self-supervised pre-training methods, exhibiting superior capabilities in molecular classification and regression tasks.
A promising avenue for future work involves the integration of molecular scaffold information with functional group data. Molecular scaffolds are pivotal in determining the overall shape and rigidity of molecules, factors that are crucial for predicting properties such as bioactivity and pharmacokinetics. By developing specialized encoding mechanisms for scaffolds within the graph representation, the model can more effectively capture these fundamental structural features. Another promising direction is fine-tuning the model with domain-specific datasets. Although Multi-MoleScale has been pre-trained on large-scale, general datasets, adapting the model for specific molecular classes—such as kinase inhibitors or antibiotics—could improve prediction accuracy within these domains. Techniques such as transfer learning and semi-supervised learning could help mitigate the challenges posed by smaller, less diverse datasets, thereby enhancing the model’s performance in specialized applications.
Supplementary Information
Author contributions
SWI.S. conceived the idea. X.L. and J.C. were involved in the data collection process, implementation, and experimentation. X.L., J.C., and SWI.S. were involved in the writing of the manuscript and in the interpretation of the results. All authors read and approved the final manuscript.
Funding
This project is supported by Macao Polytechnic University with grant number RP/FCA-06/2024. The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication. J.C. was the recipient of the Macau Polytechnic University graduate scholarship. This work is part of the thesis work of X.L with submission number (s/c fca.2940.e12f.3).
Data availability
The experimental code and data are provided at https://github.com/pdssunny/Multi-MoleScale.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Wu D, Li Y, Zheng L et al (2023) Small molecules targeting protein–protein interactions for cancer therapy. Acta Pharm Sin B 13(10):4060–4088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Odoemelam CS, Percival B, Wallis H et al (2020) G-protein coupled receptors: structure and function in drug discovery. RSC Adv 10(60):36337–36348 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Brown DG, Wobst HJ (2021) A decade of FDA-approved drugs (2010–2019): trends and future directions. J Med Chem 64(5):2312–2338 [DOI] [PubMed] [Google Scholar]
- 4.Liu Z, Hu M, Yang Y et al (2022) An overview of PROTACs: a promising drug discovery paradigm. Mol Biomed 3(1):46 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Elbadawi M, Gaisford S, Basit AW (2021) Advanced machine-learning techniques in drug discovery. Drug Discov Today 26(3):769–777 [DOI] [PubMed] [Google Scholar]
- 6.Wu Z, Zhu M, Kang Y et al (2021) Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets. Brief Bioinform 22(4):bbaa321 [DOI] [PubMed] [Google Scholar]
- 7.Tian Y, Wang X, Yao X et al (2023) Predicting molecular properties based on the interpretable graph neural network with multistep focus mechanism. Brief Bioinform 24(1):bbac534 [DOI] [PubMed] [Google Scholar]
- 8.Xia J, Zhang L, Zhu X et al (2023) Understanding the limitations of deep models for molecular property prediction: insights and solutions. Adv Neural Inf Process Syst 36:64774–64792 [Google Scholar]
- 9.Deng D, Chen X, Zhang R et al (2021) XGraphBoost: extracting graph neural network-based features for a better prediction of molecular properties. J Chem Inf Model 61(6):2697–2705 [DOI] [PubMed] [Google Scholar]
- 10.Li Y, Hsieh CY, Lu R et al (2022) An adaptive graph learning method for automated molecular interactions and properties predictions. Nat Mach Intell 4(7):645–651 [Google Scholar]
- 11.Wang S, Guo Y, Wang Y et al (2019) Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. pp 429–436
- 12.Zeng X, Xiang H, Yu L et al (2022) Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell 4(11):1004–1016 [Google Scholar]
- 13.Feng YH, Zhang SW (2022) Prediction of drug-drug interaction using an attention-based graph neural network on drug molecular graphs. Molecules 27(9):3004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. 10.48550/arXiv.1609.02907
- 15.Gilmer J, Schoenholz SS, Riley PF et al (2017) Neural message passing for quantum chemistry. In: International conference on machine learning. PMLR. pp 1263–1272
- 16.Schütt KT, Sauceda HE, Kindermans PJ et al (2018) Schnet–a deep learning architecture for molecules and materials. J Chem Phys. 10.1063/1.5019779 [DOI] [PubMed] [Google Scholar]
- 17.Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H, Zheng M (2019) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63(16):8749–8760. 10.1021/acs.jmedchem.9b00959 [DOI] [PubMed] [Google Scholar]
- 18.Wu Z, Jiang D, Hsieh CY et al (2021) Hyperbolic relational graph convolution networks plus: a simple but highly efficient QSAR-modeling method. Brief Bioinform 22(5):bbab112 [DOI] [PubMed] [Google Scholar]
- 19.Hu W, Liu B, Gomes J et al (2019) Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265
- 20.Wu Z, Jiang D, Wang J et al (2022) Knowledge-based BERT: a method to extract molecular features like computational chemists. Brief Bioinform 23(3):bbac131 [DOI] [PubMed] [Google Scholar]
- 21.Xia J, Zhao C, Hu B, Gao Z, Tan C, Liu Y, Li S, Li SZ (2023) Mole-BERT: rethinking pre-training graph neural networks for molecules. ICLR
- 22.Li B, Lin M, Chen T et al (2023) FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Brief Bioinform 24(6):bbad398 [DOI] [PubMed] [Google Scholar]
- 23.Wu Z, Pan S, Chen F et al (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4–24 [DOI] [PubMed] [Google Scholar]
- 24.You Y, Chen T, Sui Y et al (2020) Graph contrastive learning with augmentations. Adv Neural Inf Process Syst 33:5812–5823 [Google Scholar]
- 25.Wang Y, Wang J, Cao Z et al (2022) Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 4(3):279–287 [Google Scholar]
- 26.Liu Y, Zhang X, Zhang Q et al (2021) Dual self-attention with co-attention networks for visual question answering. Pattern Recognit 117:107956 [Google Scholar]
- 27.Jiang W, Wang W, Hu H (2021) Bi-directional co-attention network for image captioning. ACM Trans Multimedia Comput Commun Appl 17(4):1–20 [Google Scholar]
- 28.Srinivas S S, Runkana V (2024) Cross-modal learning for chemistry property prediction: large language models meet graph machine learning. arXiv preprint arXiv:2408.14964
- 29.Liu P, Ren Y, Tao J, Ren Z (2024) Git-mol: a multi-modal large language model for molecular science with graph, image, and text. Comput Biol Med 171:108073. 10.1016/j.compbiomed.2024.108073 [DOI] [PubMed] [Google Scholar]
- 30.Bongini P, Bianchini M, Scarselli F (2021) Molecular generative graph neural networks for drug discovery. Neurocomputing 450:242–252 [Google Scholar]
- 31.Nguyen T, Nguyen GTT, Nguyen T et al (2021) Graph convolutional networks for drug response prediction. IEEE ACM Trans Comput Biol Bioinform 19(1):146–154 [DOI] [PubMed] [Google Scholar]
- 32.Rong Y, Bian Y, Xu T et al (2020) Self-supervised graph transformer on large-scale molecular data. Adv Neural Inf Process Syst 33:12559–12571 [Google Scholar]
- 33.You J, Ying Z, Leskovec J (2020) Design space for graph neural networks. Adv Neural Inf Process Syst 33:17009–17021 [Google Scholar]
- 34.Li QZ, Zou WL, Yu ZY et al (2024) Remote site-selective arene C-H functionalization enabled by N-heterocyclic carbene organocatalysis. Nat Catal 7(8):900–911 [Google Scholar]
- 35.Han S, Fu H, Wu Y et al (2023) HimGNN: a novel hierarchical molecular graph representation learning framework for property prediction. Brief Bioinform 24(5):bbad305 [DOI] [PubMed] [Google Scholar]
- 36.Li S, Zhou J, Xu T et al (2022) Geomgcl: geometric graph contrastive learning for molecular property prediction. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36. No. 4. pp 4541–4549
- 37.Xia J, Wu L, Chen J et al (2022) Simgrace: a simple framework for graph contrastive learning without data augmentation. In: Proceedings of the ACM Web Conference 2022. pp 1070–1079
- 38.Gaulton A, Bellis LJ, Bento AP et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kotkondawar RR, Sutar SR, Kiwelekar AW et al (2024) Integrating Transformer-based Language Model for Drug Discovery. In: 2024 11th International Conference on Computing for Sustainable Global Development (INDIACom). IEEE, pp 1096–1101
- 40.Méndez-Lucio O, Nicolaou C, Earnshaw B (2022) MolE: a molecular foundation model for drug discovery. arXiv preprint arXiv:2211.02657 [DOI] [PMC free article] [PubMed]
- 41.Wen N, Liu G, Zhang J et al (2022) A fingerprints based molecular property prediction method using the BERT model. J Cheminform 14(1):71 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Liu Y, Zhang R, Li T, Jiang J, Ma J, Wang P (2023) MolRoPE-BERT: an enhanced molecular representation with Rotary Position Embedding for molecular property prediction. J Mol Graph Model 118:108344. 10.1016/j.jmgm.2022.108344 [DOI] [PubMed] [Google Scholar]
- 43.Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885
- 44.Hu S, Cai W, Gao T, Zhou J, Wang M (2021) Robust wave-feature adaptive heartbeat classification based on self-attention mechanism using a transformer model. Physiol Meas 42(12):125001. 10.1088/1361-6579/ac3e88 [DOI] [PubMed] [Google Scholar]
- 45.Wu Z, Ramsundar B, Feinberg EN et al (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Delaney JS (2004) ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 44(3):1000–1005 [DOI] [PubMed] [Google Scholar]
- 47.Mobley DL, Guthrie JP (2014) FreeSolv: a database of experimental and calculated hydration free energies, with input files. J Comput Aided Mol Des 28:711–720 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Mendez D, Gaulton A, Bento AP et al (2019) ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47(D1):D930–D940 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kato N, Comer E, Sakata-Kato T et al (2016) Diversity-oriented synthesis yields novel multistage antimalarial inhibitors. Nature 538(7625):344–349 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Rohrer SG, Baumann K (2009) Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J Chem Inf Model 49(2):169–184 [DOI] [PubMed] [Google Scholar]
- 51.(2017) AIDS antiviral screen data. In: NIH/NCI (ed)
- 52.Subramanian G, Ramsundar B, Pande V et al (2016) Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. J Chem Inf Model 56(10):1936–1949 [DOI] [PubMed] [Google Scholar]
- 53.Martins IF, Teixeira AL, Pinheiro L et al (2012) A bayesian approach to in silico blood-brain barrier penetration modeling. J Chem Inf Model 52(6):1686–1697 [DOI] [PubMed] [Google Scholar]
- 54.Born J, Markert G, Janakarajan N et al (2023) Chemical representation learning for toxicity prediction. Digit Discov 2(3):674–691 [Google Scholar]
- 55.Kuhn M, Letunic I, Jensen LJ et al (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Gayvert KM, Madhukar NS, Elemento O (2016) A data-driven approach to predicting successes and failures of clinical trials. Cell Chem Biol 23(10):1294–1301 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Xiong G, Wu Z, Yi J et al (2021) ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res 49(W1):W5–W14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.He S, Zhao D, Ling Y, Cai H, Cai Y, Zhang J, Wang L (2021) Machine learning enables accurate and rapid prediction of active molecules against breast cancer cells. Front Pharmacol 12:796534. 10.3389/fphar.2021.796534 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Stärk H, Beaini D, Corso G et al (2022) 3d infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning. PMLR. pp 20479–20502
- 60.Wang X, Zhao H, Tu W et al (2023) Automated 3D pre-training for molecular property prediction. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp 2419–2430
- 61.Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605 [Google Scholar]
- 62.Zdrazil B, Felix E, Hunter F et al (2024) The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 52(D1):D1180–D1192 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zhu W, Zhang Y, Zhao D, Xu J, Wang L (2022) HiGNN: a hierarchical informative graph neural network for molecular property prediction equipped with feature-wise attention. J Chem Inf Model 63(1):43–55. 10.1021/acs.jcim.2c01099 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The experimental code and data are provided at https://github.com/pdssunny/Multi-MoleScale.











