Abstract
Absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are critical determinants of the pharmacokinetic and safety profiles of drug candidates. Accurate and early-stage prediction of ADMET characteristics is essential for reducing late-stage attrition rates, lowering development costs, and accelerating the drug discovery process. Recent advances in deep learning have shown great promise in molecular property prediction, especially with the emergence of Transformer-based architectures that can effectively model long-range dependencies in molecular representations. However, most existing methods rely heavily on atom-level encodings (e.g. smiles or molecular graphs), which often lack structural interpretability and generalization across heterogeneous tasks. Previously, we developed a de novo and flexible molecular representation framework named MSformer (available at https://github.com/ZJUFanLab/MSformer), which demonstrated success in bioactivity prediction. We have now adapted and specialized this architecture for ADMET property prediction. This adapted implementation, designated as MSformer-ADMET, extends the framework’s capabilities to pharmacokinetic and toxicity endpoints while maintaining its flexible, fragmentation-based approach to molecular representation learning. MSformer-ADMET is fine-tuned on 22 tasks collected from the Therapeutics Data Commons (TDC), covering both classification and regression settings. Results demonstrate that MSformer-ADMET achieves superior performance across a wide range of ADMET endpoints, consistently outperforming conventional smiles-based and graph-based models. Notably, we further conducted interpretability analyses by leveraging the model’s attention distributions and fragment-to-atom mappings, allowing the identification of key structural fragments that are highly associated with molecular properties. This post hoc interpretability provides more transparent insights into the structure–property relationship. Collectively, results demonstrate that MSformer-ADMET is a highly effective and broadly applicable model for ADMET prediction.
Keywords: ADMET, transformer, deep learning, meta structure, natural products, drug discovery
Introduction
Drug development is fraught with high costs, risks, and failure rates [1–5]. On average, it takes over a decade and billions of dollars for a candidate compound to go from initial screening to market launch [6, 7]. Despite rigorous selection, over 90% of candidates fail in clinical trials, with many failures due to poor ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties [8–10].
The convergence of artificial intelligence and pharmaceutical sciences has revolutionized biomedical research [11, 12]. In drug discovery, graph convolutional networks (GCNs) and generative adversarial networks (GANs) now enable high-fidelity prediction of protein-ligand binding affinities and de novo design of molecules with optimized pharmacokinetic profiles, significantly shortening the preclinical development cycle [13]. Equally, the interpretable deep learning (DL) architecture developed by Zhao et al., which leverages gradient-based class activation mapping (Grad-CAM) to visualize critical protein-ligand interaction hotspots, thereby bridging the “black box” gap between prediction and mechanistic insight [14]. Beyond molecular design, DL-driven systematic reviews have cataloged and analyzed >148 biomedical segmentation models, classifying them into two-stage detectors, single-stage detectors, and point-based approaches to address organ-specific diagnostic challenges [15]. In omics research, although DL faces hurdles in biological interpretability, it has outperformed traditional machine learning in preprocessing sparse single-cell RNA-seq data and identifying rare cell populations critical for understanding tumor heterogeneity [16, 17].
Concurrently, the therapeutic efficacy of drugs is closely dependent on their safety profiles. Common risks such as heavy metal contamination and toxic metabolites (e.g. aristolochic acid I) can lead to treatment failure and even organ damage [18–20]. To address these challenges, multimodal deep learning frameworks have emerged as a new paradigm for safety assessment by enabling efficient prediction of multidrug adverse effects. For example, Masumshah et al. proposed the DPSP framework, which integrates five-dimensional drug features with a lightweight neural network architecture. This model demonstrated superior predictive performance, and its interpretability analysis confirmed that pathway-level features are critical for identifying toxicity mechanisms [21]. Similarly, the NNPS model combines single-drug side effect data with drug-protein interaction information, employing PCA for dimensionality reduction to reduce training time from 15 days to just 8 h. It achieved significantly better performance than mainstream models in predicting 964 high-risk adverse events, including malignant hypertension [22]. These advances in computational toxicity screening, as exemplified by DPSP and NNPS, directly catalyze the evolution of ADMET prediction frameworks, shifting from reactive risk assessment to proactive safety-by-design paradigms.
To reduce research and development (R&D) risks and enhance drug screening efficiency, computational prediction methods for ADMET properties have been advancing [23–26]. Traditional quantitative structure–activity relationship (QSAR) methods, relying on manual features and expert knowledge, have achieved limited success in some tasks. However, they struggle to handle increasingly complex drug structures and diverse toxicological tasks [27]. With the broad application of deep learning in molecular modeling, models based on graph neural networks (GNNs) and Transformers have become mainstream [28–33]. GNNs, through their message-passing mechanisms between nodes and edges, effectively model local molecular structures and have performed well in various ADMET prediction tasks [34, 35]. However, graph-based models such as GCN [36] and Chemprop [37], while strong in capturing localized interactions, are inherently limited in their ability to model long-range dependencies due to their reliance on local connectivity, which restricts their capacity to represent global chemical context. Similarly, NeuralFP, although proposed for out-of-distribution detection, is not optimized for multitask learning or property-specific molecular representations, which are essential in comprehensive ADMET modeling [38]. Morgan + MLP, which combines handcrafted circular fingerprints with shallow feedforward networks, lacks the flexibility of deep architectures and may underperform in capturing complex molecular patterns in an end-to-end fashion [39].
In contrast, the Transformer architecture, leveraging its self-attention mechanism, directly models relationships between any pair of atoms. It can adequately capture long-range dependencies and global semantics within molecules. In recent years, it has enabled advances in molecular representations [40–43]. However, when dealing with small-molecule structures, Transformers also encounter challenges such as input granularity selection, structural normalization, and modeling efficiency. Moreover, most existing deep-learning methods model molecules as a whole, either as atomic graphs or smiles sequences. This approach makes it difficult to capture the fragment-level information of molecules during complex processes in biological environments, such as dissociation, metabolism, and structural rearrangement [44–46].
Recently developed Transformer-based and hybrid models attempt to address some of these limitations. For instance, SPMM employs a multimodal strategy by integrating smiles strings with predefined molecular property vectors. While this approach introduces auxiliary descriptors, it may insufficiently reflect spatial or contextual features, limiting its generalizability to diverse structure–function relationships [47]. Mamba, built upon self-supervised sequence models, depends on sequential encoders and may underperform in scenarios requiring explicit structural reasoning, particularly for molecules with intricate or cyclic topologies [48]. CFA, a fusion-based ensemble architecture, aggregates predictions from multiple models, achieving competitive performance. However, it heavily relies on the quality and diversity of base learners and often suffers from high computational cost and limited interpretability [49]. HFST, which integrates smiles sequences with learned fragment tokens, introduces a promising representation paradigm but still faces challenges in effectively fusing heterogeneous molecular information, potentially leading to trade-offs in stability and interpretability across different tasks [50].
Against this backdrop, we present MSformer-ADMET (see Fig. 1), an advanced ADMET prediction pipeline with MSformer (https://github.com/ZJUFanLab/MSformer). Leveraging a curated fragment library derived from natural product structures, MSformer-ADMET is pretrained to capture context-dependent relationships among structural fragments, thereby enabling more nuanced molecular representation learning. Subsequently, the model is fine-tuned and systematically evaluated on 22 ADMET-related tasks sourced from the Therapeutics Data Commons (TDC), encompassing both classification and regression settings. Experimental results demonstrate that MSformer-ADMET outperforms baselines across multiple endpoints, exhibiting superior multitask predictive performance, structural interpretability, and prediction robustness.
Figure 1.
Illustration of MSformer-ADMET. This model adopts the MSformer architecture, which first constructs all possible meta-structures of a molecule. These meta-structures are then encoded into computable vector sets, merged through global averaging, and finally connected to an MLP (multilayer perceptron) layer for output, which is used for ADMET (absorption, distribution, metabolism, excretion, and toxicity) task prediction.
Material and methods
Architectural overview of MSformer-ADMET
In this work, we present MSformer-ADMET, a novel molecular representation architecture specifically optimized for ADMET property prediction. Building upon the original MSformer framework, our model integrates three key functional modules: (i) a meta-structure encoder for capturing hierarchical chemical patterns, (ii) a structural feature extractor for learning discriminative molecular representations, and (iii) a multilayer perceptron (MLP) classifier for endpoint-specific predictions (https://github.com/ZJUFanLab/MSformer). Unlike traditional language models that rely on character- or token-level frequency patterns, the underlying model of MSformer-ADMET adopts interpretable fragments as its fundamental modeling units, introducing chemically meaningful structural representations at the input level. Each fragment is treated as a representative of a local structural motif, and their combinations collectively capture the global conformational characteristics of the molecule.
For downstream property prediction tasks, MSformer-ADMET first converts each query molecule into a set of corresponding meta-structures. These fragments are then encoded into fixed-length embeddings using a pretrained encoder. This encoding process enables molecular-level structural alignment, allowing the model to represent diverse molecules in a shared vector space. The resulting structural embeddings are passed into a feature extraction module, which refines task-specific semantic information and feeds it into an MLP-based classifier, enabling an end-to-end prediction workflow. To effectively integrate contributions from different fragments, global average pooling (GAP) is applied to aggregate fragment-level features into molecule-level representations. Furthermore, MSformer-ADMET incorporates a multihead parallel MLP structure to support multitask learning.
Pretraining-finetuning strategy
To fully exploit the structural diversity of natural products and construct a generalizable, information-rich molecular representation for ADMET prediction, we adapted the previously published MSformer architecture to a dedicated framework termed MSformer-ADMET. Hence, MSformer-ADMET had completed pretraining in a corpus of 234 million representative original structure data. In this work, we focus on reconfiguring and fine-tuning this framework for comprehensive ADMET property modeling.
For each of the 22 datasets from the TDC [51] is first subjected to the same meta-structure fragmentation, after which the pretrained encoder is used to generate molecular embeddings. These embeddings are then aggregated via GAP, and subsequently passed through a task-specific feature extraction module and MLP classifier for property prediction. The fine-tuning strategy is adaptable to the nature of each task. MSformer-ADMET employs a multihead MLP output layer to enable simultaneous modeling of multiple ADMET endpoints, with shared encoder weights supporting efficient cross-task transfer learning. By combining a structure-aware pretraining mechanism with a task-oriented fine-tuning pipeline, MSformer-ADMET achieves efficient, robust, and interpretable molecular property predictions, which demonstrate strong adaptability and scalability across diverse and complex ADMET scenarios.
Ablation studies
To systematically assess the contribution of key architectural components in MSformer-ADMET, we conducted four sets of ablation experiments covering the meta-structure encoding strategy, the pretraining initialization, the fragment-level pooling method, and the attention mechanism type. Each ablation was conducted under consistent experimental conditions, with only the target component altered, and all models were trained five times with different random seeds to ensure robustness and enable statistical comparison.
(i) Meta-structure encoding: to evaluate the benefit of incorporating fragment-level meta-structure representations, we replaced the input fragment embeddings with vanilla smiles-level embeddings while keeping all other model components and training configurations unchanged.
(ii) Pretraining initialization: to validate the effectiveness of the pretraining stage, we removed pretrained weights and instead randomly initialized both the Transformer encoder and downstream network parameters.
(iii) Pooling strategy: given the concern that GAP may dilute substructure-specific signals by assigning equal importance to all fragments, we conducted an ablation study by replacing GAP with a learnable attention pooling mechanism.
(iv) Attention type: to evaluate the role of the full self-attention mechanism, we replaced it with a linear attention configuration (attention_type = linear).
Results
Introduction to the dataset
To systematically evaluate the performance of the proposed MSformer-ADMET model in ADMET prediction tasks, we utilized two distinct levels of data sources. The first is a set of standardized downstream ADMET datasets obtained from the TDC platform; the second is a large-scale molecular library derived from the COCONUT database and then generated fragments by using MassKG [52, 53], which supports the pretraining phase of MSformer-ADMET. TDC is an open and structured benchmarking framework that spans a wide range of critical tasks in the drug discovery pipeline. It is widely used for validating new computational methods and enabling fair comparisons across models. In this study, we selected 22 tasks from TDC that are directly related to ADME properties and toxicity prediction. These tasks cover the following subcategories: Absorption, including Caco2 (cell permeability), PAMPA permeability, P-glycoprotein (Pgp) inhibition, lipophilicity, solubility, and hydration free energy (Hyfr); Distribution, including blood–brain barrier (BBB) penetration and plasma protein binding rate (PPBR); Metabolism, focusing on metabolic stability and substrate/inhibitor classification of cytochrome P450 (CYP450) enzyme isoforms; Excretion, particularly renal clearance-related endpoints such as half-life and clearance; Toxicity, involving both acute and chronic indicators such as hERG channel blockade, AMES mutagenicity, drug-induced liver injury (DILI), and Ld50. These datasets vary in sample size, ranging from several hundred to tens of thousands of molecules, and offer high-quality annotations with diverse distributions (refer to Table 1). Collectively, they provide a robust foundation for evaluating the real-world applicability, and stability of MSformer-ADMET in ADMET prediction scenarios.
Table 1.
The statistical results of the datasets for modeling
| Category | Property | Total | Positive | Negative | Train | Valid | Test |
|---|---|---|---|---|---|---|---|
| Absorption | Caco2 | 910 | - | - | 728 | 91 | 91 |
| Panc | 2035 | 1740 | 295 | 1628 | 203 | 204 | |
| Pbrocc | 1219 | 651 | 568 | 975 | 122 | 122 | |
| Lipoas | 4200 | - | - | 3360 | 420 | 420 | |
| Soaq | 9982 | - | - | 7985 | 998 | 999 | |
| Hyfr | 642 | - | - | 513 | 64 | 65 | |
| Distribution | Bbbm | 2039 | 1560 | 479 | 1631 | 204 | 204 |
| Ppaz | 2828 | - | - | 2262 | 283 | 283 | |
| Metabolism | Cyb2c19 | 12 665 | 5819 | 6846 | 10 132 | 1266 | 1267 |
| Cyb2d6 | 13 130 | 2514 | 10 616 | 10 504 | 1313 | 1313 | |
| Cyb3a4 | 12 328 | 5110 | 7218 | 9862 | 1233 | 1233 | |
| Cyb1a2 | 12 579 | 5829 | 6750 | 10 063 | 1258 | 1258 | |
| Cyb2c9 | 12 092 | 4045 | 8047 | 9673 | 1209 | 1210 | |
| Cyb2c9sub | 669 | 141 | 528 | 535 | 67 | 67 | |
| Cyb2d6sub | 667 | 191 | 476 | 533 | 67 | 67 | |
| Cyb3a4sub | 670 | 355 | 315 | 536 | 67 | 67 | |
| Excretion | Halo | 667 | - | - | 534 | 66 | 67 |
| Clhe | 1213 | - | - | 970 | 121 | 122 | |
| Toxicity | hERG | 655 | 451 | 204 | 524 | 65 | 66 |
| DILI | 475 | 236 | 239 | 380 | 47 | 48 | |
| Ld50 | 7385 | - | - | 5908 | 738 | 739 | |
| AMES | 7278 | 3974 | 3304 | 5822 | 727 | 729 |
Table 1. presents the statistical results of ADMET-related datasets from the TDC website, covering absorption, distribution, metabolism, excretion, and toxicity. It lists dataset names, total samples, and splits into training, validation, and test sets. Total samples show dataset size, while the splits reflect their use in machine-learning modeling.
Evaluation indicators and experimental settings
To ensure fairness and reproducibility in model evaluation, all ADMET prediction tasks strictly followed the standardized experimental protocols provided by the TDC platform. Each dataset was randomly partitioned into training, validation, and test sets with a fixed ratio of 8:1:1. To comprehensively assess the modeling capabilities across diverse task types, we employed a suite of standardized evaluation metrics tailored to both classification and regression settings: area under the receiver operating characteristic curve (AUROC) used for classification tasks, this metric quantifies the model’s ability to distinguish between positive and negative samples. Relative mean absolute error (RMAE) applied in regression tasks, this measures the relative deviation between the predicted and true values, providing an interpretable estimate of prediction accuracy. Spearman Rank Correlation Coefficient: this nonparametric metric evaluates the rank-order correlation between predicted and actual values, making it particularly suitable for datasets that do not exhibit linear relationships. These evaluation strategies ensure a robust, multidimensional assessment of model performance across the various ADMET endpoints.
We conducted a systematic hyperparameter sensitivity analysis on the purpose of assessing the stability and generalization ability of MSformer-ADMET under different architecture settings. In this study, we explored the following key hyperparameter dimensions: Number of attention heads (2–12); Number of Transformer hidden layers (2–12); Batch size (8–32); Dropout rate (0.1–0.3); Maximum meta-structure count per molecule (200–1000) and Attention type(full/linear). The experimental results demonstrate that variations in these parameters directly impact the model’s performance. The Table 2 lists the optimal hyperparameters used during the model training process. During the training process of MSformer-ADMET, we introduced an early stopping strategy to prevent overfitting and improve training efficiency. Specifically, if the validation loss does not show significant improvement over 20 consecutive epochs, the training is automatically terminated, and the model is rolled back to the checkpoint with the best validation performance. All experiments were replicated in a unified environment to ensure the robustness and reproducibility of the results.
Table 2.
Optimal hyperparameter configurations used in MSformer-ADMET for regression and classification tasks
| Regression tasks | Classification tasks |
|---|---|
| attention_type: full; n_layer: 6; n_head: 2; dropout: 0.1; batch_size: 32; lr_start: 3e-5; maxfrags: 100; max_epochs: 50; desc_skip_connection: false; device: 2 GPUs; num_workers: 16; - |
attention_type: full; n_layer: 6; n_head: 4; dropout: 0.1; batch_size: 8; r_start: 3e-5; maxfrags: 500; max_epochs: 50; desc_skip_connection: false; device: 2 GPUs; num_workers: 16; num_classes: 2 |
Table 2. lists the optimal hyperparameters used during the model training process. All experiments were conducted using 2 GPUs in a unified computing environment. The num_classes was set to two to reflect binary prediction endpoints. The desc_skip_connection indicates whether descriptor-level residual connections were applied. All training runs were performed using 16 CPU workers (num_workers = 16) for parallel data loading.
Overall performance comparison
To systematically evaluate performance of MSformer-ADMET across 22 tasks, we compared it with eight state-of-the-art models (Chemprop [37], GCN [36], NeuralFP [38] Morgan + MLP [39], SPMM [47], Mamba [48], CFA [49], and HFST [50]) to conduct an extensive performance comparison.
MSformer-ADMET achieves brilliant performance across all five categories of TDC ADMET data sets (see Table 3 and Table 4). To be more specific, MSformer-ADMET ranked first in 11 tasks and was the top two in 17 tasks among all the 22 tasks. It is worth mentioning that we further performed Wilcoxon signed-rank tests to statistically assess whether the performance improvements of MSformer-ADMET over each baseline model are significant across the 22 benchmark tasks. The results (summarized in Table 5) show that MSformer-ADMET significantly outperforms six out of the eight baseline models (P < 0.05), including Mamba, Morgan + MLP, GCN, NeuralFP, HFST, and Chemprop, confirming the robustness of our improvements beyond random fluctuations. In addition, we conducted per-task model ranking and computed average ranks across all models. MSformer-ADMET achieved the best average rank (2.611) among all models, followed by CFA (3.111) and Chemprop (4.222), further confirming its superiority. These results indicate that MSformer-ADMET has a robust ability to understand molecular internal-structure semantics. This combination of chemical prior knowledge and deep-structure perception not only improves performance in ADMET prediction tasks but also provides a new method to create more interpretable and generalizable molecular representations, highlighting the importance of structural-hierarchy modeling in drug discovery.
Table 3.
Test performance of different models on eight regression benchmarks
| Datasets | Ours | SPMM | Mamba | Morgan + MLP | GCN | NeuralFP | CFA | HFST | Chemprop |
|---|---|---|---|---|---|---|---|---|---|
| Halo(S) | 0.360 ± 0.022 | 0.494 ± 0.900 | 0.247±0.100 | 0.329±0.083 | 0.239±0.100 | 0.177 ± 0.165 | 0.576±0.025 | 0.232 ± 0.072 | 0.265 ± 0.032 |
| Clhe(S) | 0.408 ± 0.087 | 0.436 ± 0.675 | 0.501±0.049 | 0.492±0.020 | 0.532±0.033 | 0.529 ± 0.015 | 0.625±0.012 | 0.585 ± 0.032 | 0.555 ± 0.022 |
| Lipoas(M) | 0.590 ± 0.001 | 0.670 ± 0.004 | 0.583±0.020 | 0.701±0.009 | 0.541±0.011 | 0.563 ± 0.023 | 0.626±0.013 | 0.867 ± 0.023 | 0.470 ± 0.009 |
| Caco2(M) | 0.400 ± 0.026 | 0.325 ± 0.010 | 0.438±0.030 | 0.908±0.060 | 0.599±0.104 | 0.530 ± 0.102 | 0.335±0.033 | 0.605 ± 0.081 | 0.344 ± 0.015 |
| Hyfr(M) | 1.480 ± 0.029 | 4.975 ± 0.041 | - | - | - | - | - | - | - |
| Soaq(M) | 0.760 ± 0.020 | 2.367 ± 0.008 | 0.819±0.020 | 1.203±0.019 | 0.907±0.020 | 0.947 ± 0.016 | 0.939±0.030 | 1.176 ± 0.038 | 0.829 ± 0.022 |
| Ppaz(M) | 7.290 ± 0.145 | 194.1 ± 0.107 | 9.371±0.311 | 12.85±0.362 | 10.19±0.373 | 9.292 ± 0.384 | 8.680±0.262 | 9.638 ± 1.014 | 7.788 ± 0.210 |
| Ld50(M) | 0.580 ± 0.026 | 0.426 ± 0.022 | 0.678±0.012 | 0.649±0.019 | 0.649±0.026 | 0.667 ± 0.020 | 0.630±0.012 | 0.781 ± 0.025 | 0.606 ± 0.024 |
Table 3. presents a comparison of the performance of various state-of-the-art algorithms in regression tasks. In the table, the letter “S” in parentheses stands for the Spearman metric, where higher values indicate better performance, while the letter “M” represents the mean absolute error (RMAE), where lower values are better. For each task, the best-performing model is indicated in bold-face. The model’s results are reported as the mean ± standard deviation across five runs with different random seeds.
Table 4.
Test performance of different models on 14 classification benchmarks
| Dataset | Ours | SPMM | Mamba | Morgan + MLP | GCN | NeuralFP | CFA | HFST | Chemprop |
|---|---|---|---|---|---|---|---|---|---|
| Pbrocc | 0.942 ± 0.010 | 0.770±0.010 | 0.930±0.017 | 0.880 ± 0.006 | 0.895 ± 0.021 | 0.902 ± 0.020 | 0.928 ± 0.010 | 0.870 ± 0.018 | 0.860 ± 0.036 |
| Panc | 0.736 ± 0.059 |
0.752± 0.004 |
- | - | - | - | - | - | - |
| Cyb3a4sub | 0.689 ± 0.024 | 0.544±0.041 | 0.664±0.027 | 0.633 ± 0.013 | 0.590 ± 0.023 | 0.578 ± 0.020 | 0.667 ± 0.019 | 0.571 ± 0.032 | 0.596 ± 0.018 |
| Cyb3a4 | 0.861 ± 0.009 | 0.866 ± 0.022 |
0.893± 0.012 |
0.827 ± 0.009 | 0.840 ± 0.010 | 0.849 ± 0.004 | 0.855 ± 0.004 | 0.666 ± 0.014 | 0.862 ± 0.003 |
| Cyb2d6sub | 0.871 ± 0.023 | 0.714 ± 0.051 | 0.748±0.012 | 0.671 ± 0.066 | 0.617 ± 0.039 | 0.572 ± 0.062 | 0.704 ± 0.015 | 0.501 ± 0.072 | 0.632 ± 0.037 |
| Cyb2d6 | 0.825 ± 0.008 | 0.853 ± 0.028 | 0.747±0.013 | 0.587 ± 0.011 | 0.616 ± 0.020 | 0.627 ± 0.009 | 0.664 ± 0.012 | 0.373 ± 0.049 | 0.649 ± 0.016 |
| Cyb2c9sub | 0.679 ± 0.026 | 0.570 ± 0.010 | 0.365±0.021 | 0.380 ± 0.015 | 0.344 ± 0.051 | 0.359 ± 0.059 | 0.417 ± 0.010 | 0.388 ± 0.052 | 0.382 ± 0.019 |
| Cyb2c9 | 0.873 ± 0.006 | 0.836 ± 0.014 | 0.845±0.011 | 0.715 ± 0.004 | 0.735 ± 0.004 | 0.739 ± 0.010 | 0.751 ± 0.006 | 0.620 ± 0.016 | 0.754 ± 0.002 |
| Cyb2c19 | 0.870 ± 0.010 | 0.840 ± 0.053 | - | - | - | - | - | - | - |
| Cyb1a2 | 0.930 ± 0.003 | 0.885 ± 0.100 | - | - | - | - | - | - | - |
| Bbbm | 0.896 ± 0.037 | 0.756 ± 0.034 | 0.852±0.018 | 0.823 ± 0.015 | 0.842 ± 0.016 | 0.836 ± 0.009 | 0.920 ± 0.006 | 0.769 ± 0.037 | 0.821 ± 0.112 |
| hERG | 0.881 ± 0.038 | 0.855 ± 0.026 | 0.708±0.045 | 0.736 ± 0.023 | 0.738 ± 0.038 | 0.722 ± 0.034 | 0.875 ± 0.014 | 0.713 ± 0.040 | 0.721 ± 0.045 |
| AMES | 0.855 ± 0.015 | 0.881 ± 0.018 | 0.801±0.030 | 0.794 ± 0.008 | 0.818 ± 0.010 | 0.823 ± 0.006 | 0.852 ± 0.005 | 0.656 ± 0.014 | 0.842 ± 0.014 |
| DILI | 0.853 ± 0.037 | 0.891 ± 0.009 |
0.928± 0.022 |
0.832 ± 0.021 | 0.859 ± 0.033 | 0.851 ± 0.026 | 0.919 ± 0.014 | 0.777 ± 0.011 | 0.899 ± 0.008 |
Table 4. displays a comparison of the performance of various state-of-the-art algorithms in classification tasks, where model performance was assessed using the ROC-AUC metric. Higher ROC-AUC values indicate better performance. For each task, the best-performing model is indicated in bold-face. The model’s results are reported as the mean ± standard deviation across five runs with different random seeds.
Table 5.
Wilcoxon signed-rank test results and average rankings
| Model | Average rankings | Statistic | P_value | Significant |
|---|---|---|---|---|
| Ours | 2.611 | - | - | - |
| CFA | 3.111 | 56.0 | 0.212 | FALSE |
| Chemprop | 4.222 | 35.0 | 0.027 | TRUE |
| Mamba | 4.222 | 28.0 | 0.010 | TRUE |
| SPMM | 4.555 | 52.5 | 0.167 | FALSE |
| GCN | 5.888 | 14.0 | 0.000 | TRUE |
| NeuralFP | 6.000 | 12.0 | 0.000 | TRUE |
| Morgan + MLP | 6.666 | 9.0 | 0.000 | TRUE |
| HFST | 7.611 | 7.0 | 0.000 | TRUE |
Table 5. summarizes the average rankings of all models across 22 benchmark datasets, along with the results of Wilcoxon signed-rank tests comparing MSformer-ADMET (Ours) against each baseline model. The Statistic column reports the Wilcoxon test statistic, and P_value indicates whether the observed performance difference is statistically significant (Significant = TRUE for P < 0.05).
No single method dominated all tasks, as performance depends on feature types and specific tasks. This variation comes from different molecular representations and machine-learning models capturing diverse information types. In low-sample and complex toxicity-classification tasks, MSformer-ADMET effectively transferred structural knowledge from pretraining, boosting prediction stability. This confirms the effectiveness of pretraining strategies in drug modeling and suggests that MSformer’s molecular representations could be useful for other downstream molecular tasks, thus laying the foundation for a general-purpose molecular language model.
The influence of different architectural components
The performance was evaluated across multiple regression and classification tasks, with results summarized in Table 6 and Table 7.
Table 6.
Ablation study results of MSformer-ADMET across regression tasks
| Tasks | Normal | Only-smiles | No_pretrain | Attention pooling | Attention type_linear |
|---|---|---|---|---|---|
| Halo(S) | 0.356 ± 0.022 | 0.197 ± 0.045 | 0.364 ± 0.068 | 0.303 ± 0.052 | 0.440 ± 0.023 |
| Clhe(S) | 0.408 ± 0.087 | 0.283 ± 0.041 | 0.390 ± 0.023 | 0.349 ± 0.042 | 0.428 ± 0.024 |
| Soaq(M) | 0.760 ± 0.020 | 1.000 ± 0.038 | 0.770 ± 0.011 | 0.775 ± 0.012 | 0.803 ± 0.017 |
| Ppaz(M) | 7.290 ± 0.145 | 8.790 ± 0.319 | 8.200 ± 0.186 | 7.902 ± 0.368 | 8.661 ± 0.207 |
| Lipoas(M) | 0.590 ± 0.001 | 0.770 ± 0.025 | 0.590 ± 0.010 | 0.607 ± 0.014 | 0.627 ± 0.024 |
| Hyfr(M) | 1.480 ± 0.029 | 2.010 ± 0.095 | 1.500 ± 0.171 | 1.251 ± 0.062 | 1.387 ± 0.026 |
| Caco2(M) | 0.400 ± 0.026 | 0.410 ± 0.069 | 0.420 ± 0.026 | 0.427 ± 0.021 | 0.427 ± 0.021 |
| Ld50(M) | 0.580 ± 0.026 | 0.650 ± 0.053 | 0.610 ± 0.014 | 0.595 ± 0.016 | 0.587 ± 0.018 |
Table 6. reports the predictive performance of MSformer-ADMET under four different configurations across eight regression tasks from the TDC benchmark. “Normal” represents the default setting of the model. “Only-smiles” replaces fragment-based meta-structure encoding with smiles-level embeddings. “No_pretrain” removes the pretraining stage, randomly initializing all weights. “Attention pooling” replaces the default global average pooling (GAP) with a learnable attention pooling module. “Attention type_linear” replaces full attention with linear attention. For each task, the best-performing results is indicated in bold-face and all results are presented as mean ± standard deviation over five random seeds.
Table 7.
Ablation study results of MSformer-ADMET across classification tasks
| Tasks | Normal | Only-smiles | No_pretrain | Attention pooling | Attention type_linear |
|---|---|---|---|---|---|
| Pbrocc | 0.942 ± 0.010 | 0.909 ± 0.011 | 0.858 ± 0.010 | 0.866 ± 0.014 | 0.874 ± 0.020 |
| Panc | 0.736 ± 0.059 | 0.673 ± 0.017 | 0.588 ± 0.109 | 0.700 ± 0.040 | 0.662 ± 0.034 |
| Cyb3a4sub | 0.689 ± 0.024 | 0.530 ± 0.027 | 0.635 ± 0.030 | 0.615 ± 0.025 | 0.615 ± 0.025 |
| Cyb3a4 | 0.861 ± 0.009 | 0.788 ± 0.013 | 0.848 ± 0.016 | 0.857 ± 0.010 | 0.853 ± 0.010 |
| Cyb2d6sub | 0.871 ± 0.023 | 0.746 ± 0.079 | 0.811 ± 0.026 | 0.807 ± 0.017 | 0.809 ± 0.036 |
| Cyb2d6 | 0.825 ± 0.008 | 0.748 ± 0.034 | 0.813 ± 0.013 | 0.816 ± 0.011 | 0.811 ± 0.006 |
| Cyb2c9sub | 0.697 ± 0.026 | 0.687 ± 0.079 | 0.658 ± 0.020 | 0.636 ± 0.053 | 0.539 ± 0.031 |
| Cyb2c9 | 0.873 ± 0.006 | 0.791 ± 0.012 | 0.854 ± 0.009 | 0.850 ± 0.007 | 0.855 ± 0.009 |
| Cyb2c19 | 0.870 ± 0.010 | 0.800 ± 0.015 | 0.850 ± 0.013 | 0.870 ± 0.008 | 0.864 ± 0.010 |
| Cyb1a2 | 0.930 ± 0.003 | 0.875 ± 0.010 | 0.918 ± 0.008 | 0.921 ± 0.005 | 0.912 ± 0.006 |
| Bbbm | 0.896 ± 0.037 | 0.875 ± 0.018 | 0.842 ± 0.008 | 0.847 ± 0.017 | 0.847 ± 0.016 |
| hERG | 0.881 ± 0.038 | 0.782 ± 0.005 | 0.937 ± 0.007 | 0.914 ± 0.011 | 0.905 ± 0.015 |
| AMES | 0.855 ± 0.015 | 0.806 ± 0.016 | 0.819 ± 0.012 | 0.841 ± 0.013 | 0.835 ± 0.008 |
| DILI | 0.853 ± 0.037 | 0.775 ± 0.024 | 0.814 ± 0.061 | 0.843 ± 0.033 | 0.822 ± 0.019 |
Table 7. reports the predictive performance of MSformer-ADMET under four different configurations across 14 classification tasks from the TDC benchmark. “Normal” represents the default setting of the model. “Only-smiles” replaces fragment-based meta-structure encoding with smiles-level embeddings. “No_pretrain” removes the pretraining stage, randomly initializing all weights. “Attention pooling” replaces the default global average pooling (GAP) with a learnable attention pooling module. “Attention type_linear” replaces full attention with linear attention. For each task, the best-performing results is indicated in bold-face and all results are presented as mean ± standard deviation over five random seeds.
(i) Meta-structure encoding: when replacing the input fragment embeddings with smiles-level embeddings, the model exhibited a 26.04% performance degradation on regression tasks and a 3.38% decrease on classification tasks. This underscores the importance of meta-structure fragments in capturing the intricate details necessary for accurate ADMET property prediction.
(ii) Pretraining initialization: compared to pretrained models, random initialization made model performance declined by 3.43% on regression tasks and 4.75% on classification tasks. This demonstrates that pretraining is pivotal in equipping the model with a foundational understanding that greatly aids in downstream task performance.
(iii) Pooling strategy: the results revealed that this substitution did not lead to satisfactory improvements. Specifically, attention pooling resulted in a performance drop of 4.56% on regression tasks and a 3.53% decrease on classification tasks compared to the original GAP-based configuration. These findings suggest that GAP remains an effective choice in our framework, particularly when combined with structurally informative, pretrained fragment embeddings.
(iv) Attention type: to evaluate the role of the full self-attention mechanism, we replaced it with a linear attention configuration (attention_type = linear). This modification led to a decline in predictive performance across multiple tasks. Specifically, classification performance dropped by 5.27%, while regression performance showed only a marginal decrease of 0.49% compared to the original full attention model. Surprisingly, the linear attention variant resulted in a more than fivefold increase in training time, indicating potential inefficiencies or implementation bottlenecks in its backend. The observed performance degradation highlights the importance of the full, multihead self-attention mechanism in modeling molecular systems. Its capacity to dynamically assign contextual importance to different substructures enables the model to capture complex intramolecular dependencies.
The performance and analysis of MSformer-ADMET for Pbrocc
Leveraging its rich meta-structure-based architecture, MSformer-ADMET enables multifaceted interpretation of model features. Globally, the weight distribution of meta-structures correlates with model performance. An example of meta-structures is shown in Fig. 2, including a meta-structure which exclusively associated with active compounds, a meta-structure which was linked to active compounds. We further dissected atomic-level contributions using (S)-5-(5-cyclopropyl-1H-pyrazol-3-ylamino)-3-(1-[5-fluoropyridin-2-yl]ethylamino)pyrazine-2-carbonitrile (PubChem CID: 25171876) as a case study. This molecule is active in the Pbrocc task. By normalizing meta-structure weights, we observed complementary attention patterns between tasks (Fig. 2). MSformer-ADMET captures multi-MS weight distributions for each atom, enabling nuanced interpretation. For example, in Pbrocc inhibition, attention weights decreased with meta-structure MW. This consistency suggests MSformer’s ability to recognize fundamental physicochemical determinants of bioactivity transcendence, providing critical guidance for rational drug modification.
Figure 2.
Atom-level attention maps for a compound exhibiting activity. (A) Visualization of atomic-level attention weights. The highlighted meta-structure indicates the part of the molecule that receives focus in the Pbrocc task. (B) Frequency statistics of different atoms in the meta-structure. The x-axis represents molecular IDs. (C) Correlation analysis between the molecular weight of the meta-structure and its attention weights. (D) Distribution of attention weights for each atom across different meta-structures.
Discussion
In this study, we applied MSformer-ADMET to ADMET tasks and get superior performance over existing methods. Notably, the performance gains are not merely attributable to deeper or wider networks, but rather to innovations in structural representation mechanisms. By deconstructing molecules into chemically meaningful meta-structural fragments and adopting these as the fundamental modeling units, MSformer-ADMET more effectively captures the local structural determinants of pharmacokinetic properties and their contextual dependencies. This fragment-level modeling approach not only enhances robustness under data-scarce or label-imbalanced conditions but also provides improved model interpretability and decision transparency.
From the perspective of the pretraining-fine-tuning paradigm, MSformer-ADMET is pretrained on large-scale unlabeled meta-structures, significantly improving its ability to learn common structural patterns in molecules. Compared with traditional molecular representation learning approaches that directly model smiles strings or molecular graphs, MSformer-ADMET achieves a favorable balance among representational granularity, task adaptability, and structural generalization capacity. Particularly in prediction tasks with limited samples or skewed class distributions, MSformer-ADMET demonstrates markedly better transferability and stability than nonpretrained counterparts, underscoring the positive impact of incorporating structural priors on downstream modeling.
Furthermore, MSformer-ADMET incorporates an interpretability analysis module that enables fragment-level explanations of prediction outcomes from attention-based perspectives. This capability allows the model to maintain high performance while providing traceable structural rationales. Such interpretability aids medicinal chemists in eliminating high-risk candidates, thereby offering actionable insights for structural optimization and lead compound design. In the future, this approach holds promise for broader applications in more complex cross-modal and multiscale prediction scenarios, and may serve as a versatile structural modeling interface for high-throughput drug screening and molecular generation models.
Limitations and future work
While MSformer-ADMET demonstrates compelling performance across diverse ADMET prediction tasks, several challenges remain to be addressed. The primary focus of the current framework relies on 2D structural information represented through molecular fragments. Although this design facilitates efficient modeling and interpretability, it inherently omits three-dimensional (3D) conformational features such as stereochemistry, spatial strain, and chiral centers, which are known to influence pharmacokinetic and toxicological behaviors. Therefore, MSformer-ADMET may underperform in tasks where 3D topology or geometric complementarity plays a central role. Beyond this, the model depends on a predefined vocabulary of fragments derived from natural product-like structures. While this vocabulary captures a wide range of chemically meaningful substructures, its coverage may be limited in certain domains, particularly for inorganic compounds, metal–organic complexes, or synthetically modified molecules containing atypical scaffolds. This constraint may restrict model generalizability when applied to underrepresented or emerging chemical spaces, such as organometallic drugs, radiopharmaceuticals, or boron- and platinum-containing agents.
In addition to the above, although the self-attention mechanism in MSformer-ADMET allows for flexible aggregation of fragment-level information, its current implementation is tailored to fixed-length fragment sets. This may pose challenges when scaling to ultra-large molecules or macromolecular assemblies, where the number of fragments and their structural diversity can grow substantially. Moreover, the reliance on fragment quality, determined during the preprocessing stage, introduces an upstream dependency: poorly defined or incorrectly parsed fragments can propagate noise into downstream representation learning and ultimately impact prediction fidelity. Notably, rigorous validation of MSformer-ADMET is currently lacking for rare disease compounds, low-frequency pharmacophores, and low-data regimes, where sample scarcity and data imbalance may compromise both model training and interpretability. Addressing these constraints such as incorporating 3D-aware embeddings, dynamic fragment vocabularies, or hybrid graph representations will be crucial in future work to extend the scope and reliability of MSformer-ADMET in real-world pharmacological pipelines.
Conclusion
This study presents MSformer-ADMET, an application of the fragment-aware framework MSformer, tailored for comprehensive prediction of ADMET properties. By incorporating chemically meaningful meta-structures derived from mass spectrometry-inspired fragmentation and modeling their contextual relationships through a cascaded Transformer architecture, the proposed approach effectively enhances molecular semantic representation. Through large-scale pretraining and task-specific fine-tuning, MSformer-ADMET achieves consistent improvements across 22 diverse ADMET prediction tasks. The model demonstrates strong generalization, adaptability to multitask learning, and resilience in low-resource settings. Its interpretability module enables fragment-level analysis of predictive outcomes, providing mechanistic insights into structure–property relationships and supporting rational optimization in drug discovery workflows. Collectively, MSformer-ADMET provides a novel and effective paradigm for ADMET modeling, with significant implications for enhancing early-stage safety assessment and candidate prioritization in drug discovery pipelines.
Key Points
Development of MSformer-ADMET, a deep learning framework that integrates mass spectrometry-inspired meta-structural fragments into a cascaded Transformer architecture. This enables chemically meaningful molecular representations beyond traditional atom-level approaches (e.g. smiles or graphs).
Systematic evaluation on 22 TDC datasets shows MSformer-ADMET consistently outperforms state-of-the-art smiles-based and graph-based models in both classification and regression tasks.
An integrated module provides fragment-level explanations of ADMET predictions, offering insights into structure–property relationships and supporting molecular optimization.
Dr. Liao is a ZJU100 Young Professor at the College of Pharmaceutical Sciences, Zhejiang University, and serves as Associate Director and Principal Investigator at the Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University. His research Interests focus on AI for TCM, Spatiotemporal Omics and Bioinformatics.
Contributor Information
Huihui Liu, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Bingjie Zhu, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Shuyang Nie, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
Haoran Li, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Yugang Lin, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Department of Pharmacy, Affiliated Jinhua Hospital, Zhejiang University School of Medicine, Jinhua, 321000, China.
Tianyi Ma, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Xin Shao, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Qian Chen, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; Hangzhou TCM Hospital Affiliated to Zhejiang Chinese Medical University, Hangzhou, 310007, China.
Minjie Shen, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China.
Yanrong Zheng, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Collaborative Innovation Center for the Brain Diseases with Integrative Medicine, Zhejiang Key Laboratory of Neuropsychopharmacology, Zhejiang Chinese Medical University, Hangzhou, 310053, China.
Xiaohui Fan, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; Hangzhou TCM Hospital Affiliated to Zhejiang Chinese Medical University, Hangzhou, 310007, China.
Jie Liao, Department of Pharmaceutical Sciences, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China; Zhejiang Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; State Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314100, China; Hangzhou TCM Hospital Affiliated to Zhejiang Chinese Medical University, Hangzhou, 310007, China.
Author contributions
Huihui Liu (Methodology, Formal Analysis, Visualization, Writing—original draft) Bingjie Zhu (Validation, Formal Analysis, Data Curation, Writing—original draft) Shuyang Nie (Validation, Formal Analysis) Haoran Li (Methodology, Validation) Yugang Lin (Validation) Tianyi Ma (Software) Xin Shao (Validation) Qian Cheng (Validation) Minjie Shen (Validation) Yanrong Zheng (Supervision, Writing—review & editing) Xiaohui Fan (Supervision, Writing–review & editing) and Jie Liao (Conceptualization, Supervision, Writing—review & editing)
Conflict of interest: The authors declare no conflicts of interest.
Funding
This work was supported by Noncommunicable Chronic Diseases-National Science and Technology Major Project [No. 2024ZD0530704, J.L.]; ‘Pioneer’ and ‘Leading Goose’ R&D Program of Zhejiang [No. 2024C03106, X.F.]; Zhejiang Provincial Natural Science Foundation of China [No. LD25H090002, Y.Z.]; National Natural Science Foundation of China [No. 82204772, J.L.]; Starlit South Lake Leading Elite Program [No. 2023A303005, X.F.].
Data availability
The TDC datasets are available at https://tdcommons.ai/single_pred_tasks/adme/.
References
- 1. Van Norman GA. Drugs, devices, and the FDA: part 1. JACC: Basic to Translational Science 2016;1:170–9. 10.1016/j.jacbts.2016.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Moffat JG, Vincent F, Lee JA, et al. Opportunities and challenges in phenotypic drug discovery: an industry perspective. Nat Rev Drug Discov 2017;16:531–43. 10.1038/nrd.2017.111. [DOI] [PubMed] [Google Scholar]
- 3. Khanna I. Drug discovery in pharmaceutical industry: productivity challenges and trends. Drug Discov Today 2012;17:1088–102. 10.1016/j.drudis.2012.05.007. [DOI] [PubMed] [Google Scholar]
- 4. Fernandez-Moure JS. Lost in translation: the gap in scientific advancements and clinical application. Front Bioeng Biotechnol 2016;4:4. 10.3389/fbioe.2016.00043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Pankevich DE, Altevogt BM, Dunlop J, et al. Improving and accelerating drug development for nervous system disorders. Neuron 2014;84:546–53. 10.1016/j.neuron.2014.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: new estimates of R&D costs. J Health Econ 2016;47:20–33. 10.1016/j.jhealeco.2016.01.012. [DOI] [PubMed] [Google Scholar]
- 7. Kiriiri GK, Njogu PM, Mwangi AN. Exploring different approaches to improve the success of drug discovery and development projects: a review. Futur J Pharm Sci 2020;6:27. 10.1186/s43094-020-00047-9. [DOI] [Google Scholar]
- 8. Moreno L, Pearson AD. How can attrition rates be reduced in cancer drug discovery? Expert Opin Drug Discov 2013;8:363–8. 10.1517/17460441.2013.768984. [DOI] [PubMed] [Google Scholar]
- 9. Schaduangrat N, Lampa S, Simeon S, et al. Towards reproducible computational drug discovery. J Chem 2020;12:9. 10.1186/s13321-020-0408-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Vrbanac J, Slauter R. ADME in drug discovery. A Comprehensive Guide to Toxicology in Nonclinical Drug Development, Chapter 3, pp. 39–67. Academic Press, 2017. 10.1016/B978-0-12-803620-4.00003-7. [DOI] [Google Scholar]
- 11. Zhang J, Chen X, Huang L, et al. Traditional Chinese medicine + artificial intelligence: Wuzhen consensus. Acupuncture and Herbal Medicine 2025;5:134–5. 10.1097/HM9.0000000000000163. [DOI] [Google Scholar]
- 12. Chen Z, Wang H, Li C, et al. Large language models in traditional Chinese medicine: a systematic review. Acupuncture and Herbal Medicine 2025;5:57–67. 10.1097/HM9.0000000000000143. [DOI] [Google Scholar]
- 13. Le NQK. Predicting emerging drug interactions using GNNs. Nat Comput Sci 2023;3:1007–8. 10.1038/s43588-023-00555-7. [DOI] [PubMed] [Google Scholar]
- 14. Zhao Z, Gui J, Yao A, et al. Improved prediction model of protein and peptide toxicity by Integrating Channel attention into a convolutional neural network and gated recurrent units. ACS Omega 2022;7:40569–77. 10.1021/acsomega.2c05881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Wahid F, Ma Y, Khan D. et al. Biomedical image segmentation: a systematic literature review of deep learning based object detection methods. arXiv preprint arXiv:2408.03393, 2024. http://arxiv.org/abs/2408.03393.
- 16. Erfanian N, Heydari AA, Iañez P. et al. Deep learning applications in single-cell omics data analysis. bioRxiv preprint, 2021. http://biorxiv.org/lookup/doi/10.1101/2021.11.26.470166. [DOI] [PubMed]
- 17. Shen WX, Liu Y, Chen Y, et al. AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks. Nucleic Acids Res 2022;50:e45–5. 10.1093/nar/gkac010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Kamuntavičius G, Paquet T, Bastas O. et al. Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models. J Chem 2025;17:108. 10.1186/s13321-025-01041-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Philips CA, Theruvath AH, Ravindran R. et al. The Placebo Project: An Observational Study and Comprehensive Analysis of 134 Commonly Prescribed Homeopathic Remedies in India Uncovers Potential for Hepatotoxicity. Medicine 2025;104:e42560. 10.1097/MD.0000000000042560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Wang C, Zhang Y, Chen D, et al. Oral subacute nephrotoxicity of aristololactam I in rats. Toxicology 2022;475:153228. 10.1016/j.tox.2022.153228. [DOI] [PubMed] [Google Scholar]
- 21. Masumshah R, Eslahchi C. DPSP: a multimodal deep learning framework for polypharmacy side effects prediction. Bioinformatics. Advances 2023;3:vbad110. 10.1093/bioadv/vbad110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Masumshah R, Aghdam R, Eslahchi C. A neural network-based method for polypharmacy side effects prediction. BMC Bioinformatics 2021;22:385. 10.1186/s12859-021-04298-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Lv Q, Zhou F, Liu X, et al. Artificial intelligence in small molecule drug discovery from 2018 to 2023: does it really work? Bioorg Chem 2023;141:106894. 10.1016/j.bioorg.2023.106894. [DOI] [PubMed] [Google Scholar]
- 24. Padi S, Cardone A, Sriram RD. A Meta-Model for ADMET Property Prediction Analysis. bioRxiv preprint, 2023. http://biorxiv.org/lookup/doi/10.1101/2023.12.05.570279.
- 25. Xu C, Liu R, Huang S, et al. 3D-SMGE: a pipeline for scaffold-based molecular generation and evaluation. Brief Bioinform 2023;24:bbad327. 10.1093/bib/bbad327. [DOI] [PubMed] [Google Scholar]
- 26. Yi J-C, Yang Z-Y, Zhao W-T, et al. ChemMORT: an automatic ADMET optimization platform using deep learning and multi-objective particle swarm optimization. Brief Bioinform 2024;25:bbae008. 10.1093/bib/bbae008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Kleandrova VV, Scotti L, Bezerra Mendonça Junior FJ, et al. QSAR Modeling for multi-target drug discovery: designing simultaneous inhibitors of proteins in diverse pathogenic parasites. Front Chem 2021;9:634663. 10.3389/fchem.2021.634663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Xu Y, Liu X, Xia W, et al. ChemXTree: a feature-enhanced graph neural network-neural decision tree framework for ADMET prediction. J Chem Inf Model 2024;64:8440–52. 10.1021/acs.jcim.4c01186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Wang Y, Wang J, Cao Z, et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 2022;4:279–87. 10.1038/s42256-022-00447-x. [DOI] [Google Scholar]
- 30. Zhou G, Gao Z, Ding Q. et al. Uni-Mol: a universal 3D molecular representation learning framework. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023). https://openreview.net/forum?id=6K2RM6wVqKu.
- 31. Rong Y, Bian Y, Xu T. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
- 32. Xiao H, Chen X. Drug ADMET prediction method based on improved graph convolution neural network. In: 2022 4th International Conference on Robotics and Computer Vision (ICRCV), pp. 266–71. IEEE, 2022. https://ieeexplore.ieee.org/document/9953254/.
- 33. Mizera M, Lin A, Babin E. et al. Graph transformer foundation model for modeling ADMET properties. ChemRxiv preprint 2024. https://chemrxiv.org/engage/chemrxiv/article-details/66ebc757cec5d6c142886d28.
- 34. Wei Y, Li S, Li Z, et al. Interpretable-ADMET: a web service for ADMET prediction and optimization based on deep neural representation. Bioinformatics 2022;38:2863–71. 10.1093/bioinformatics/btac192. [DOI] [PubMed] [Google Scholar]
- 35. De Carlo A, Ronchi D, Piastra M, et al. Predicting ADMET properties from molecule SMILE: a bottom-up approach using attention-based graph neural networks. Pharmaceutics 2024;16:776. 10.3390/pharmaceutics16060776. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Kipf TN, Welling M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907, 2017. http://arxiv.org/abs/1609.02907.
- 37. Heid E, Greenman KP, Chung Y, et al. Chemprop: a machine learning package for chemical property prediction. J Chem Inf Model 2024;64:9–17. 10.1021/acs.jcim.3c01250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Lee W-H, Millman S, Desai N. et al. NeuralFP: Out-of-distribution detection using fingerprints of neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9561–8. IEEE, 2021. https://ieeexplore.ieee.org/document/9412489/.
- 39. Huang K, Fu T, Glass LM, et al. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 2021;36:5545–7. 10.1093/bioinformatics/btaa1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Song Y, Chen J, Wang W, et al. Double-head transformer neural network for molecular property prediction. J Chem 2023;15:27. 10.1186/s13321-023-00700-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Li H. A knowledge-guided pre-training framework for improving molecular representation learning. Nat Commun 2023;14:7568. 10.1038/s41467-023-43214-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Ahmad W, Simon E, Chithrananda S. ChemBERTa-2: towards chemical foundation models. In Proceedings of the ELLIS Machine Learning for Molecule Discovery Workshop, 2021; arXiv preprint arXiv:2209.01712, 2022. https://arxiv.org/abs/2209.01712.
- 43. Boulougouri M, Vandergheynst P, Probst D. Molecular set representation learning. Nat Mach Intell 2024;6:754–63. 10.1038/s42256-024-00856-0. [DOI] [Google Scholar]
- 44. Wu J-N, Wang T, Chen Y, et al. T-SMILES: a fragment-based molecular representation framework for de novo ligand design. Nat Commun 2024;15:4993. 10.1038/s41467-024-49388-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Long S, Wu J, Zhou Y, et al. Deep neural networks for knowledge-enhanced molecular modeling. Neurocomputing 2025;614:128838. 10.1016/j.neucom.2024.128838. [DOI] [Google Scholar]
- 46. Wang Y, Guo M, Chen X, et al. Screening of multi deep learning-based de novo molecular generation models and their application for specific target molecular generation. Sci Rep 2025;15:4419. 10.1038/s41598-025-86840-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Chang J, Ye JC. Bidirectional generation of structure and properties through a single molecular foundation model. Nat Commun 2024;15:2323. 10.1038/s41467-024-46440-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Xu B, Lu Y, Li C. et al. SMILES-mamba: chemical mamba foundation models for drug ADMET prediction. arXiv preprint arXiv:2408.05696, 2024. http://arxiv.org/abs/2408.05696.
- 49. Jiang N, Quazi M, Schweikert C. et al. Enhancing ADMET property models performance through combinatorial fusion analysis. ChemRxiv preprint 2023. https://chemrxiv.org/engage/chemrxiv/article-details/6563ec17cf8b3c3cd73212b3.
- 50. Aksamit N, Tchagang A, Li Y, et al. Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. BMC Bioinformatics 2024;25:255. 10.1186/s12859-024-05861-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Huang K, Fu T, Gao W. et al. Therapeutics data commons: machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548, 2021. http://arxiv.org/abs/2102.09548.
- 52. Sorokina M, Merseburger P, Rajan K, et al. COCONUT online: collection of open natural products database. J Chem 2021;13:2. 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Zhu B, Li Z, Jin Z, et al. Knowledge-based in silico fragmentation and annotation of mass spectra for natural products with MassKG. Comput Struct Biotechnol J 2024;23:3327–41. 10.1016/j.csbj.2024.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The TDC datasets are available at https://tdcommons.ai/single_pred_tasks/adme/.


