Abstract
Data heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01103-3.
Keywords: Data reporting, Molecular property, ADME, Physicochemical, Machine learning, Data aggregation, Predictive accuracy, Benchmark
Scientific contribution
By systematically analyzing public ADME datasets, we uncovered substantial distributional misalignments and annotation discrepancies between benchmark and gold-standard sources. These challenges were shown to undermine predictive modeling, as naive integration or standardization often degraded performance. To address them, we present AssayInspector, a tool that enables both consistency assessment and informed data integration, providing a foundation for more reliable predictive modelling in drug discovery.
Graphical Abstract
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01103-3.
Introduction
Machine learning (ML) has emerged as a powerful tool for scientific discovery, providing cost-effective predictions that accelerate research across various domains [1–3]. Yet, the accuracy and reliability of ML models depend on the quality, size, and consistency of training data [4, 5]. In this context, integrating publicly available datasets offers an opportunity to increase sample sizes and expand feature space coverage, potentially enhancing predictive accuracy and model generalizability [6]. However, despite these advantages, data integration presents significant challenges. In particular, variability in experimental protocols, feature shifts, and differences in applicability domain can introduce inconsistencies that obscure biological signals, ultimately undermining model performance [7, 8]. These limitations have become a growing source of concern within the research community, raising critical questions about the overreliance on AI-driven modeling and its potential to mislead interpretations [9]. To ensure robust integration, effective data consistency inspection is essential to systematically identify and address these discrepancies [10].
These challenges are particularly critical in drug discovery pipelines, where high-stake decisions rely on sparse, heterogeneous, and limited datasets [11–14]. A major bottleneck in this process is the optimization of a drug candidate’s pharmacokinetic (PK) properties, commonly referred to as its absorption, distribution, metabolism, and excretion (ADME) profile. Unlike binding affinity data, which is primarily derived from high-throughput in vitro experiments and widely available in public databases [15, 16], ADME data is largely obtained from in vivo studies using animal models or clinical trials. This makes PK data costly and labor-intensive to generate, limiting its accessibility. As a result, ADME datasets have historically remained proprietary and restricted to pharmaceutical companies, with only a few small, property-specific datasets in the public domain [17].
Recognizing the need for publicly accessible ADME data, early efforts focused on curating pharmacokinetic parameters from scientific literature. One of the first publicly available datasets was collected by Obach and collaborators, who extracted intravenous PK parameters from published studies [18]. Since then, the scientific community has made significant progress in producing ADME datasets, sourced from high-throughput assay screenings and curated literature efforts [19]. More recently, initiatives such as Therapeutic Data Commons (TDC) have begun assembling molecular property data, including ADME datasets, to offer standardized benchmarks for predictive models [6, 20].
As public datasets are increasingly available, integrating data from multiple sources offers an opportunity to increase predictive accuracy and model generalizability. By incorporating a broader range of experimental data, researchers can increase the number of samples for a given molecular property while expanding coverage of chemical space, potentially leading to more reliable models. For instance, integrating the benchmark aqueous solubility dataset AqSolDB [21] with additional curated sources nearly doubled the molecule coverage, resulting in better model performance [22]. Similarly, a recent study by Napoli et al. demonstrated that integrating data from Genentech and Roche into a multitask model improves predictive accuracy and generalization, which was attributed to the expanded chemical space that, in turn, broadened the model’s applicability domain [23].
While diverse ML strategies for integrating heterogenous molecular data have been extensively explored in molecular predictive modeling and drug discovery [24, 25], the systematic aggregation of physicochemical and ADME data across publicly available sources remains insufficiently addressed. In this work, we explore the critical task of data consistency assessment, illustrating its impact on integrating molecular property datasets to enhance ADME prediction models. To this end, we developed AssayInspector, a computational tool designed to systematically characterize datasets by detecting distributional differences, outliers, and batch effects that could impact ML model performance. Unlike general data visualization tools such as Tableau [26] or AutoViz [27], AssayInspector is specifically tailored to compare experimental datasets from distinct sources before aggregation in ML pipelines. Notably, this tool can be applied to any scientific assay that may exhibit variations in experimental conditions across different sources, such as in vitro binding, cytotoxicity or enzyme inhibition assays.
Using AssayInspector, we examined multiple ADME datasets, focusing on those properties more widely represented in public and benchmark sources, such as half-life and clearance. Through this analysis, we identified significant misalignments between commonly used benchmark and gold-standard sources, highlighting discrepancies in data distributions that can affect model performance. Additionally, we examined how data integration decisions influence ML model accuracy, demonstrating that directly aggregating property datasets without addressing distributional inconsistencies introduces noise, ultimately decreasing predictive performance. These findings emphasize the importance of evaluating dataset misalignments prior modeling to ensure the reliability and generalizability of predictive models. To support future research and guide the integration of ADME datasets, we have made AssayInspector publicly available at https://github.com/chemotargets/assay_inspector.
Methods
AssayInspector package
AssayInspector is a computational tool designed to provide statistics-informed data aggregation and cleaning recommendations prior to ML pipelines. Developed in pure Python, this software supports data analysis, visualization, statistical testing, and preprocessing for physicochemical and PK prediction tasks (Fig. 1). It is compatible with both regression and classification modeling, allowing users to analyze individual datasets and generate detailed reports to identify discrepancies between data sources. In addition to supporting precomputed features, the tool incorporates built-in functionality to calculate traditional chemical descriptors, such as ECFP4 fingerprints and 1D and 2D descriptors, on the fly using RDKit v2022.09.5 [28]. By default, the Tanimoto Coefficient is used to measure molecular similarity when ECFP4 fingerprints are selected, while the standardized Euclidean distance is applied for RDKit descriptors. Alternatively, users can choose any other similarity metric supported by SciPy [29] or specify a custom similarity function.
Fig. 1.
Overview of the AssayInspector package. Given an initial ADME property dataset containing endpoint values for each molecule, AssayInspector generates a statistical summary, various visualization plots, and an insight report with warnings and recommendations. These outputs provide support and recommendations for dataset aggregation prior to ML modeling pipelines
The tool's functionalities can be categorized into three main components. First, it generates a tabular file that summarizes key descriptive parameters for each data source. These include the number of molecules and endpoint statistics such as mean, standard deviation, minimum, maximum and quartiles for regression tasks, as well as class counts and ratios for classification tasks. Additionally, the tool performs statistical comparisons of endpoint distributions, applying the two-sample Kolmogorov–Smirnov (KS) test for regression tasks and the Chi-square test for classification tasks. It also computes within- and between-source feature similarity values in a one-vs-other setting. For regression endpoints specifically, it provides skewness and kurtosis calculation, along with the identification of outliers and out-of-range data points across datasets. Most of these statistics are computed using the Scipy software [29].
Second, AssayInspector generates a comprehensive set of visualization plots that facilitate the detection of inconsistencies across data sources. These visualizations cover various aspects, including property distribution, chemical space, dataset discrepancies, dataset intersection, and feature similarity. The property distribution plots illustrate endpoint distribution across datasets, highlighting significantly different distributions using pairwise two-sample KS test, calculated using the Scipy package [29]. The dataset intersection analysis visually represents molecular overlap among datasets. Users can also provide reference molecule sets to examine their distribution within each dataset and their intersection across different data sources. To assess dataset discrepancies, AssayInspector evaluates molecule overlap and quantifies the numerical differences in annotations for shared compounds across datasets. Additionally, feature similarity plots examine whether any data source deviates in terms of input representation (e.g. chemical structure) from others. The feature (e.g. chemical) space visualization uses the UMAP (Uniform Manifold Approximation and Projection) dimensionality reduction technique [30] to provide insights into dataset coverage and potential applicability domains in property space. By default, UMAP computes distances in the embedding space using the same similarity metric selected for molecular comparisons. Plotly (https://plot.ly), Matplotlib [31] and Seaborn [32] are the three main visualization libraries used to generate all the figures.
Finally, AssayInspector generates an insight report containing multiple alerts and recommendations to guide data cleaning and preprocessing. This report helps identify dissimilar datasets based on descriptor profiles, conflicting datasets with differing annotations for shared molecules, divergent datasets with low molecule overlap, and redundant datasets with high proportion of shared molecules. For regression tasks, it also detects datasets with significantly different endpoint distributions, inconsistent value ranges, skewed distributions, and the presence of outliers or out-of-range data points. This functionality supports the assessment of dataset compatibility and ensures data consistency before finalizing the dataset for model training.
Half-life datasets
We gathered data from five different sources (see Table S1 in Supplementary material 1). As reference datasets in the field, we used Obach et al. [18] and Lombardo et al. [33], containing human intravenous half-life measurements for 670 and 1,352 molecules, respectively, curated from the literature. Of note, the Obach et al. dataset is used as a benchmark of half-life in the TDC [20]. Additionally, we collected the gold-standard dataset published by Fan et al. in 2024 [34], as it is also the primary source used by other platforms such as ADMETlab 3.0 [35]. This dataset comprises half-life data for 3,512 compounds, primarily retrieved from the ChEMBL database [36]. Finally, two additional publicly available databases containing experimental PK data of small-molecule drugs were incorporated: DDPD 1.0 [37] and e-Drug3D [38].
Clearance datasets
We gathered data from seven different sources (see Table S2 in Supplementary material 1). As with the half-life dataset, we included the two reference datasets of Obach et al. [18] and Lombardo et al. [33]. Additionally, we incorporated the clearance data used by the TDC as clearance benchmark [20], containing experimental in vitro data deposited in ChEMBL by AstraZeneca [39]. Moreover, we included the human total clearance dataset released by Iwata et al. [40], containing data for 741 compounds primarily retrieved from ChEMBL [41] and other studies [42]. Finally, we also considered three supplementary minor datasets from Gombar and Hall [43] and Varma [44, 45], each containing clearance data for approximately 300–500 molecules.
While the TDC dataset includes intrinsic clearance in human liver microsomes [46] and hepatocytes [47], we focused on the microsomes data as it contained more compounds after our preprocessing. Although in vitro clearance data have known limitations [48], we retained the TDC dataset given its role as a leading benchmark resource in molecule property prediction. To scale the in vitro intrinsic clearance values from cells to fully healthy liver, we applied the following equations using in vivo-based scaling factors from Obach and collaborators [8], which were approximated for a healthy adult human weighing 70 kg [49]:
![]() |
Molecule standardization and data preprocessing
The Simplified Molecular Input Line Entry System (SMILES) and property values of each molecule from every individual dataset were retrieved. For datasets that did not provide SMILES representations, such as e-Drug3D [38], Obach [18], and Varma [44, 45], we obtained the SMILES from PubChem [50] using the corresponding CAS number. Manual curation was performed to ensure the correct association between SMILES and molecule names. To ensure consistency across datasets, molecular representations were first standardized with RDKit [28]. This process involved repairing molecules, disconnecting counterions and other non-essential components, and selecting the largest fragment of the molecule. The chemical representations were then normalized by adjusting formal charges and standardizing tautomers, while preserving stereochemical information. The standardized SMILES were then converted into InChIKeys for consistent identification. We additionally checked and cleaned Fan et al.’s dataset SMILES [34].
Regarding the endpoint values, half-life and clearance were log-transformed prior to analysis to improve numerical stability and reduce skewness in the data distribution. Moreover, we excluded referential data included in the TDC clearance dataset [39]. Specifically, we retained only intrinsic clearance values from human microsomes that were not annotated with ‘ > ’ or ‘ < ’ symbols, as we found these categorical annotations to bias the regressor. This refinement reduced the number of molecules from 1,102 to 744.
To aggregate multiple value annotations for a single molecule, compound-value annotations were deduplicated based on the InChIKey by taking the mean. When the number of annotations exceeded five, extreme values were excluded by retaining only those within two standard deviations from the mean.
Regarding the drug reference set, we used all small-molecule drugs with known chemical structures listed in the Anatomical Therapeutic Chemical (ATC) classification system [51], as described in our previous work [22].
Finally, to evaluate the impact of the standardization procedure on model performance, we categorized property sources as either divergent and homogenous. We then applied independent standardization to the endpoint values using the RobustScaler function from scikit-learn [52].
Performance evaluation of predictive models
Each molecular structure was encoded as 1,024-bit numerical vectors using ECFP4 binary fingerprints alongside 1D and 2D descriptors calculated with RDKit [28].
Before training, we first combined the molecule universe of all datasets and generated five equal-sized folds. Duplicated molecules across sources were aggregated by computing the arithmetic mean, excluding extreme values when the number of annotations exceeded five. Additionally, to prevent any data leakage from overlapping compounds, we clustered molecules based on ECFP4 Tanimoto similarity (≥ 0.95) using a connected component approach, assigning entire molecule clusters to the same fold. These folds were then used to create dataset-specific splits, thereby ensuring that molecules in the test set of a given split never appeared in the training data of that split in any dataset. By keeping the test splits fixed across all integration variations, we ensured that any observed differences in performance metrics were primarily due to changes in the training data (i.e. different dataset combinations). The whole process was repeated five times with different random seeds. After benchmarking multiple ML algorithms, including XGBoost, Support Vector Machine, Random Forest, and K-Nearest Neighbors, as well as diverse molecular representations, ECFP4 fingerprints alone and in combination with 1D and 2D RDKit descriptors (see Supplementary material 5), the XGBoost Regressor algorithm alongside the complete molecular descriptor was selected as the reference modeling approach. Hyperparameter optimization (HPO) was performed using hyperopt [53] with the Tree-structurated Parzen Estimators (TPE) optimization algorithm and employing a fivefold cross-validation approach, searching for the optimal combination of parameters within a predefined search space (see Table S3 in Supplementary material 1). HPO was conducted over a maximum of 50 optimization runs within a 600-s time limit. At each iteration, model performance was evaluated using various metrics on each individual source validation set, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Spearman's rank correlation coefficient, and coefficient of determination (R2). R2 was selected for discussion as it provides a task-independent measure of predictive power, unlike MAE or RMSE, which vary depending on the modeling task and require domain expertise to interpret. Furthermore, a negative R2 indicates that the model performs worse than a baseline, indicating strong dataset biases that hinder reliable predictions. The standard deviation across folds and repetitions was analyzed to assess process stability and determine whether random data splitting introduced any bias [54].
To statistically compare the model performance across different integration and standardization strategies, we applied repeated measures ANOVA (analysis of variance) with the post hoc Tukey’s HSD (honestly significant difference) test for pairwise comparison between models, ensuring that all assumptions for these tests were met in accordance with the guidelines provided by Ash and collaborators [55].
Results
In this study, we assessed the alignment of multiple datasets across key ADME endpoints, focusing on clearance and half-life as representative PK properties (see Supplementary material 2 for a detailed analysis of additional PK endpoints). Clearance measures the rate at which a drug is removed from the bloodstream, primarily influenced by liver and kidney-driven elimination and serum protein binding [56, 57]. Half-life refers to the time it takes for a drug’s active ingredient to decrease to half its peak concentration, determined by metabolism and elimination processes [58, 59]. These two endpoints were selected due to their significance in drug discovery (e.g. in determining dosing regimens), the availability of multiple public datasets, and the discrepancies we observed across different sources. Given the lack of widely accepted ADME benchmarks, we included multiple datasets in this study.
After molecule standardization, deduplicating and cleaning (see Methods section), we selected 5 and 7 representative datasets for half-life and clearance, respectively. Although AssayInspector accepts any molecular descriptor, we chose to represent molecules using extended circular fingerprints (ECFP4), as they provide a strong baseline across various molecule prediction tasks [60–62]. For each ADME endpoint, we employed AssayInspector to analyze: (i) the number of contributed molecules from each dataset, (ii) the endpoint distributions, (iii) the overlap between datasets, (iv) the molecule annotation discrepancies, and (v) the presence of outliers. AssayInspector generates tailored visualizations for each analysis type, providing a final report with anticipated challenges and recommendations on dataset integration (Fig. 1).
Analyzing half-life datasets with AssayInspector
Focusing first on half-life, the Insight Report (see Figure S1 in Supplementary material 1) highlighted that the dataset from Fan et al. exhibits a significantly lower half-life distribution than the rest of datasets (two-sample KS test p < 0.0001) (Fig. 2A), with median difference values ranging from 0.778 (6.0 h) to 1.026 (10.6 h) (see Figure S1A in Supplementary material 1). Notably, this dataset is both the largest and the one contributing the highest proportion of unique molecules, having relatively low overlap (< 10%) with the rest of sources (Fig. 2B). As a result, molecules provided by Fan et al. dataset explore a broader portion of the chemical space, reaching regions that are not covered by other sources (Fig. 2C). Despite these distributional differences, compounds shared between Fan’s dataset and the rest of sources exhibit consistent annotations (Fig. 2D and see Figure S1B in Supplementary material 1). While this consistency could in part reflect shared provenance from common primary sources, it nonetheless indicates agreement in the reported values across datasets. This suggests that, although Fan et al.’s dataset differs in molecule composition, the annotations for shared compounds are aligned across sources. Regarding outliers, these were found to be marginal across datasets (Fig. 2F).
Fig. 2.
AssayInspector analysis of half-life datasets. Collection of plots generated using the AssayInspector package. A Violin plots showing the distribution of half-life values across data sources. Each embedded boxplot displays the median, upper and lower quartiles, and individual outliers. The two additional bar charts at the top indicate the number of molecules and the percentage of unique molecules contributed by each source to the final dataset. B Pairwise heatmaps illustrating the proportion of shared molecules. Color code shows the proportions calculated using the source on the y-axis as the reference. C Kernel Density Estimate (KDE) plot depicting the coverage of the descriptor space, projected onto the first two UMAP dimensions, for the two most relevant data sources, Fan et al. and Lombardo et al. D Scatter plot comparing half-life values for shared molecules between Fan et al. and Lombardo et al. data sources. E Half-life distribution comparison between the subset of molecules corresponding to approved drugs (orange) and the rest of molecules in the dataset (green). F Summary of key insights extracted from the Insight Report file
The distinct endpoint distribution and expanded chemical space suggest that the Fan et al. dataset may have a different applicability domain. In fact, while the other sources primarily rely on literature curation, the Fan et al. dataset was principally compiled from assays deposited in ChEMBL, which predominantly contain data from preclinical compounds, potentially explaining its diversity. We hypothesized that datasets derived from literature curation would be more enriched in approved drugs. To test this, we provided a reference set of drugs to AssayInspector to model the presence and half-life distribution of drug-like molecules within these datasets (Fig. 2E and see Figure S2A in Supplementary material 1). This analysis revealed that, while all datasets contained a high proportion of approved drugs (ranging from 42 to 87%), Fan et al. included only 6%. Interestingly, the drug molecules within the Fan et al. dataset exhibited much higher half-life values than the rest of compounds, even comparable to those in other datasets. Since PK properties of approved drugs are already optimized for clinical use [63], we hypothesize that the shorter half-life observed in the Fan et al. dataset may be driven by its higher proportion of molecules in pre-clinical stages.
Overall, these results suggest that the observed differences between datasets may be partially explained by variations in their applicability domains, likely influenced by their respective sources of screened compounds. Importantly, the peculiarities of the Fan et al. dataset should be carefully considered, especially since it is one of the largest and most recent in the field, making it particularly well-suited for training prediction platforms like ADMETlab 3.0 [35].
Analyzing clearance datasets with AssayInspector
Focusing next on clearance, the Insight Report (see Figure S3 in Supplementary material 1) highlighted a higher degree of overlap between some sources and a significantly higher clearance distribution of the TDC dataset compared to the rest (two-sample KS test p < 0.0001) (Fig. 3A, B), with mean difference values ranging from 0.680 (4.8 ml/min/kg) to 0.855 (7.2 ml/min/kg) (see Figure S3A in Supplementary material 1). Notably, TDC is the second largest dataset, representing ~ 29% of the unique molecules, although its chemical space highly overlaps with other sources (Fig. 3C).
Fig. 3.
AssayInspector analysis of clearance datasets. Collection of plots generated using the AssayInspector package. A Violin plots showing the distribution of clearance values across data sources. Each embedded boxplot displays the median, upper and lower quartiles, and individual outliers. The two additional bar charts at the top indicate the number of molecules and the percentage of unique molecules contributed by each source to the final dataset. B Pairwise heatmaps illustrating the proportion of shared molecules. Color code shows the proportions are calculated using the source on the y-axis as the reference. C Kernel Density Estimate (KDE) plot depicting the coverage of the descriptor space, projected onto the first two UMAP dimensions, for the two most relevant data sources, Lombardo and TDC (AstraZeneca). D Scatter plot comparing clearance values for shared molecules between TDC (AstraZeneca) and Lombardo et al. E Clearance distribution comparison between the subset of molecules corresponding to approved drugs (orange) and the rest of molecules in the dataset (green). F Summary of key insights extracted from the Insight Report file
Strikingly, the analysis of shared annotations across sources reveals inconsistencies, with TDC systematically reporting higher clearance values for molecules in common with other datasets (Fig. 3D and see Figure S3B in Supplementary material 1). Interestingly, although the presence of approved drugs is significantly lower in TDC (10% vs. 43–67%), their clearance values are more similar to those of other molecules within the dataset (Fig. 3E and see Figure S2B in Supplementary material 1). As in the half-life scenario, outliers were marginal across datasets (Fig. 3F).
The variation in annotations for the same molecules across datasets, which closely mirrors the distribution of their source data, suggests that these datasets were likely generated under different experimental conditions. Indeed, the TDC dataset originates from an AstraZeneca in vitro screening deposited in ChEMBL [39]. This distinction is important because in vitro intrinsic clearance was measured using human liver microsomes, which do not account for key in vivo processes, such as the proportion of the drug bound to plasma proteins that can not be metabolized by the liver [64]. This factor is critical in determining effective drug accessibility by liver cells and, when unaccounted for, assays will tend to overestimate total clearance.
Since the other sources are primarily based on human intravenous data, we hypothesize that the distribution patterns and inconsistencies identified in the TDC dataset are likely driven by its in vitro experimental conditions, which fail to capture the full complexity of the biological environment drugs encounter in the human body [7, 65]. This is particularly relevant, because TDC is a leading open-access platform providing ADME benchmarks for modeling. Given the discrepancies observed with other datasets, models trained on alternative sources may not directly generalize well to this benchmark. Additionally, combining these datasets could introduce inconsistencies, potentially compromising ML model quality and limiting their applicability.
Informed data integration leads to more accurate predictive models
To assess the translational impact of the AssayInspector analysis, we trained ML models for both ADME properties using different dataset combinations, either by following or disregarding the recommendations provided by the Insight Report (see Figures S1 and S3 in Supplementary material 1, and Supplementary material 2 and 3). Specifically, we categorized datasets as either divergent (e.g. Fan et al. for half-life and TDC for clearance) or homogenous (i.e. with similar distributions according to AssayInspector). We then evaluated how training models on these datasets alone or in combination affected the performance on each dataset individually. To ensure our observations hold across different modelling choices, we explored multiple ML models and descriptor combinations (see Supplementary material 5) based on top-performing submissions in the TDC leaderboard [66]. For result discussion, we selected the top-scoring model-descriptor combination. Importantly, our objective is not to identify the best model or descriptor, but rather to assess how data consistency influences predictive performance. We employed a 5 × 5 repeated cross-validation strategy and conducted basic hyperparameter tuning to ensure reliability and robustness of the results (see Methods section). The predictive performance of the models was primarily assessed using the coefficient of determination (R2), summarized in Tables 1 and 2, although we additionally measured other evaluation metrics (see Tables S4 and S5 in Supplementary material 1).
Table 1.
Performance of models trained on different aggregated half-life sources
| Aggregated training datasets | Fan et al. (Divergent source) | Obach et al. (TDC benchmark) | Lombardo et al | e-Drug3D | DDPD 1.0 |
|---|---|---|---|---|---|
| Benchmark source (Obach et al.) | −5.34 ± 0.22 | 0.31 ± 0.04 | 0.23 ± 0.03 | 0.19 ± 0.02 | 0.18 ± 0.04 |
| All sources | 0.38 ± 0.04 | 0.32 ± 0.04 | 0.17 ± 0.03 | −0.08 ± 0.04 | 0.03 ± 0.05 |
| Divergent source (Fan et al.) | 0.76 ± 0.01 | −1.45 ± 0.14 | −1.52 ± 0.03 | −2.13 ± 0.13 | −1.56 ± 0.09 |
| Homogeneous sources (all but Fan et al.) | −8.18 ± 0.39 | 0.41 ± 0.04 | 0.33 ± 0.02 | 0.34 ± 0.02 | 0.28 ± 0.03 |
R2 scores (± standard deviation) evaluated on individual test sets (columns) for models trained on different combinations of aggregated datasets (rows). Each row indicates a unique training set composition derived from the aggregation of specific data sources, while each column corresponds to the R2 performance on a particular held-out test dataset. Results are averaged over five cross-validation folds, repeated five times independently.
Best performance per dataset (column) is highlighted in bold
Table 2.
Performance of models trained on different aggregated clearance sources
| Aggregated training datasets | TDC (Divergent source) | Lombardo et al | Iwata et al | Obach et al | Gombar and Hall | Varma et al. [44] | Varma et al. [45] |
|---|---|---|---|---|---|---|---|
| Benchmark source (Obach et al.) | −2.52 ± 0.21 | 0.15 ± 0.04 | 0.31 ± 0.09 | 0.32 ± 0.06 | 0.22 ± 0.06 | 0.38 ± 0.04 | 0.24 ± 0.05 |
| All sources | −0.06 ± 0.13 | 0.29 ± 0.05 | 0.35 ± 0.05 | 0.38 ± 0.06 | 0.33 ± 0.06 | 0.36 ± 0.11 | 0.29 ± 0.10 |
| Divergent source (TDC) | 0.19 ± 0.03 | −0.68 ± 0.06 | −0.90 ± 0.08 | −0.97 ± 0.11 | −1.16 ± 0.19 | −1.15 ± 0.22 | −1.52 ± 0.23 |
| Homogeneous sources (all but TDC) | −2.47 ± 0.26 | 0.33 ± 0.01 | 0.36 ± 0.04 | 0.41 ± 0.04 | 0.29 ± 0.09 | 0.40 ± 0.03 | 0.29 ± 0.09 |
R2 scores (± standard deviation) evaluated on individual test sets (columns) for models trained on different combinations of aggregated datasets (rows). Each row indicates a unique training set composition derived from the aggregation of specific data sources, while each column corresponds to the R2 performance on a particular held-out test dataset. Results are averaged over five cross-validation folds, repeated five times independently.
Best performance per dataset (column) is highlighted in bold
Focusing first on half-life, integrating data from all sources resulted in overall low predictive performance, with the highest performance observed for the Fan et al. dataset (R2 = 0.38), followed by Obach et al. (R2 = 0.32). However, when the divergent Fan et al. dataset was modeled independently, predictive performance increased substantially (R2 = 0.76). Moreover, excluding the Fan dataset from the integration led to significant improvements in the remaining datasets, achieving moderate predictive performance (R2 > 0.3). Notably, integrating homogeneous datasets improved predictive performance, as illustrated by the nearly twofold increase in R2 when augmenting the Obach dataset, in contrast to the divergent dataset, for which the model performed better when trained alone.
A similar trend was observed for clearance, where excluding the divergent TDC dataset from the integration and modelling it independently led to an overall increase in R2. Indeed, modeling the divergent dataset alone restored its predictive performance, shifting from non-predictive (R2 < 0) to weakly predictive (R2 = 0.19). Furthermore, as with the half-life case, supplementing dataset-specific models with data from similar sources enhances predictive power, as illustrated by Obach’s dataset, where R2 increased from 0.32 to 0.41.
These results reinforce the findings from AssayInspector, highlighting how an overlooked integration of publicly available molecular property datasets can introduce noise that significantly reduces predictive performance. This issue is particularly pronounced when substantial differences exist across sources, especially if a divergent source contributes a larger number of molecules to the integrated dataset. In contrast, combining datasets with similar characteristics can improve model performance by leveraging a larger data pool, outperforming models trained on individual dataset. Consequently, datasets identified as divergent by AssayInspector are better modelled separately, while those with similar distributions can be combined to increase predictive performance.
Evaluating standardization strategies for modeling divergent ADME datasets
Standardizing data ranges before modeling is a common practice in ML to prevent models from being biased toward overrepresented value ranges. We investigated whether standardizing each dataset’s endpoint distribution before training would be sufficient to harmonize sources and improve generalization, particularly when dealing with divergent datasets. To this end, we transformed each dataset’s distribution by subtracting the median and dividing by the interquartile range (see Methods section).
Our results showed that standardizing the distributions before integrating all sources enabled predictive power (R2 > 0) across all datasets (Table 3). For clearance, performance was comparable to that obtained when training the divergent dataset independently. However, although half-life models trained on the standardized datasets achieved moderate predictive performance (R2 = 0.69 on the divergent test set and R2 = 0.27 on the homogeneous test set), their performance remained significantly lower (p < 0.001) compared to models trained separately on the divergent dataset and the aggregated homogeneous sources. One possible explanation for this performance drop is that, whereas clearance differences may primarily reflect experimental variability, the distinct half-life distributions could result from fundamental differences in the dominant chemical spaces. We hypothesize that, because experimental variability often introduces systematic shifts across datasets, standardization enables the model to focus on relative differences rather than absolute values. In contrast, if half-life differences are driven by chemically meaningful features rather than experimental discrepancies, standardization may obscure relevant patterns between drug-optimized features and half-life, ultimately reducing model performance.
Table 3.
Performance of models trained on standardized clearance and half-life aggregated sources
| Aggregated training datasets | Homogenous sources | Divergent source (Fan et al./TDC) |
|---|---|---|
| Half-life | ||
| Homogenous sources | 0.36 ± 0.06 | −9.10 ± 0.50 |
| All sources | 0.27 ± 0.04 | 0.69 ± 0.02 |
| Clearance | ||
| Homogenous sources | 0.36 ± 0.04 | −2.78 ± 0.17 |
| All sources | 0.33 ± 0.05 | 0.20 ± 0.03 |
R2 scores (± standard deviation) evaluated on individual test sets (columns) for models trained on different combinations of aggregated datasets (rows). Each row indicates a unique training set composition derived from the aggregation of specific data sources, while each column corresponds to the R2 performance on a particular held-out test dataset. Training datasets were standardized independently before aggregation. Results are averaged over five cross-validation folds, repeated five times independently.
Best performance per dataset (column) is highlighted in bold
Importantly, AssayInspector helps anticipate these discrepancies by analyzing feature space similarity and value discrepancies among overlapping molecules. If two datasets span distinct regions of chemical space but common compounds retain similar values, this suggests that their differences reflect distinct sampling of chemical space rather than inconsistencies in measurement. Conversely, if datasets cover similar chemical spaces but overlapping molecules exhibit systematic value differences, it points to systematic shifts, potentially driven by experimental variability. Regardless of the underlying cause, AssayInspector enables users to critically assess potential sources of variability, providing valuable insights to guide informed decisions on data integration strategies.
Discussion
In this work, we highlight key challenges in ADME modeling using public datasets, particularly those arising from dataset heterogeneity and distributional misalignments. To address these challenges, we developed AssayInspector, a Python tool designed to analyze and anticipate potential pitfalls in data aggregation, enabling more informed and effective data handling. Through statistical summaries, visualizations, and diagnostic summaries, AssayInspector helps identify distributional differences, outliers, and batch effects while providing recommendations to mitigate their negative impact on predictive modeling.
While AssayInspector applies to any assay with cross-source variability, we focus on ADME datasets due to their relevance in drug discovery and the limited public data available. In this context, integration approaches offer a promising strategy to boost predictive performance and model robustness. Applying AssayInspector to various ADME datasets, we uncovered significant misalignments between widely used benchmark datasets (e.g., TDC) and gold-standard datasets for key ADME properties such as half-life and clearance. Our findings demonstrate that while directly integrating these sources can reduce model performance, an informed-based integration improves generalizability. Alternatively, data standardization can help accommodate discrepancies arising from experimental conditions in the datasets we modeled, but the resulting increase in dataset size did not consistently translate into performance improvements.
To better integrate property datasets with differing value distributions, alternative strategies should be explored. When experimental conditions are well-documented, tailored preprocessing can help minimize discrepancies between datasets. For instance, in the case of intrinsic clearance, annotations for the fraction unbound in plasma (fub) can be used to estimate hepatic metabolic clearance via in vitro–in vivo extrapolation (IVIVE) methods [25, 67]. However, implementing this approach requires domain expertise, detailed experimental knowledge, and access to necessary measurements for data correction. Alternatively, advanced ML strategies, such as transfer learning or Model-Agnostic Meta-Learning (MAML) [68, 69], can leverage prior knowledge while quickly adapting to dataset-specific characteristics, thereby mitigating source-specific biases and improving generalizability. However, such techniques are not model-agnostic but constrained to neural networks, which struggle in low-data regimes, a common challenge in ADME datasets [4, 5, 70]. An alternative approach is to adopt a multi-task setting, where divergent datasets are treated as independent tasks rather than a single aggregated dataset. This strategy has shown promise in modelling multiple ADME properties simultaneously, improving both predictive performance and generalization [23].
Overall, AssayInspector provides a model-agnostic, data-driven approach to automatically anticipate challenges in data integration, supporting more informed decision-making in machine learning modeling. This approach can prove particularly relevant in federated learning scenarios, where models must integrate datasets from multiple sources and ensure consistency to effectively leverage the combined data [71]. Additionally, anticipating data inconsistencies helps prevent performance deterioration and identify areas for improvement in existing benchmarks. For example, Obach et al.’s half-life dataset is included as one of the ADME benchmarks in TDC. Our results suggest that models incorporating data from similar datasets, such as Lombardo et al., e-Drug3D and DDPD, can generalize well to the TDC benchmark, improving performance. However, in the case of clearance, the TDC benchmark differs significantly from the other datasets, making transfer learning between sources more challenging. These results emphasize the need for comprehensive benchmarks that more accurately reflect the in vivo PK context, ultimately leading to more relevant, reliable and effective predictive models.
Supplementary Information
Acknowledgements
The authors gratefully acknowledge all researchers at Chemotargets for their input and feedback during the construction of the package.
Abbreviations
- ADME
Absorption Distribution Metabolism Excretion
- ML
Machine learning
- PK
Pharmacokinetic
- MAE
Mean absolute error
- RMSE
Root mean squared error
- R2
Coefficient of determination
- TDC
Therapeutic data commons
Author contributions
Conceptualization, R.P., A.F.-T. and J.M.; Data assembly & curation, R.P. and L.M.; Methodology, R.P., A.F.-T., L.M., and R.G.; Model validation and analysis, R.P. and A.F.-T.; Original draft preparation, R.P. and A.F.-T; Final review and editing, R.P., A.F.-T, L.M., R.G. and J.M.
Funding
This research received no external funding.
Data availability
The datasets supporting the conclusions of this article are available in the AssayInspector Github repository, https://github.com/chemotargets/assay_inspector.
Declarations
Competing interests
R.P., L.M, R.G., A.F.-T, and J. M. are currently employees of the company Chemotargets, of which J. M. is co-founder and co-owner. The company had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Adrià Fernández-Torras, Email: adria.fernandez@chemotargets.com.
Jordi Mestres, Email: jordi.mestres@udg.edu.
References
- 1.Danishuddin KV, Faheem M, Woo Lee K (2022) A decade of machine learning-based predictive models for human pharmacokinetics: Advances and challenges. Drug Discov Today 27:529–537. 10.1016/j.drudis.2021.09.013 [DOI] [PubMed] [Google Scholar]
- 2.Pillai N, Abos A, Teutonico D, Mavroudis PD (2024) Machine learning framework to predict pharmacokinetic profile of small molecule drugs based on chemical structure. Clin Transl Sci 17:e13824. 10.1111/cts.13824 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wainberg M, Merico D, Delong A, Frey BJ (2018) Deep learning in biomedicine. Nat Biotechnol 36:829–838. 10.1038/nbt.4233 [DOI] [PubMed] [Google Scholar]
- 4.Mohammed S, Budach L, Feuerpfeil M et al (2025) The effects of data quality on machine learning performance on tabular data. Inf Syst 132:102549. 10.1016/j.is.2025.102549 [Google Scholar]
- 5.van Tilborg D, Brinkmann H, Criscuolo E et al (2024) Deep learning for low-data drug discovery: hurdles and opportunities. Curr Opin Struct Biol 86:102818. 10.1016/j.sbi.2024.102818 [DOI] [PubMed] [Google Scholar]
- 6.Goecks J, Jalili V, Heiser LM, Gray JW (2020) How machine learning will transform biomedicine. Cell 181:92–101. 10.1016/j.cell.2020.03.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Smith DA, Beaumont K, Maurer TS, Di L (2019) Clearance in drug design. J Med Chem 62(5):2245–2255. 10.1021/acs.jmedchem.8b01263 [DOI] [PubMed] [Google Scholar]
- 8.Di L, Keefer C, Scott DO et al (2012) Mechanistic insights from comparing intrinsic clearance values between human liver microsomes and hepatocytes to guide drug design. Eur J Med Chem 57:441–448. 10.1016/j.ejmech.2012.06.043 [DOI] [PubMed] [Google Scholar]
- 9.Narayanan A, Kapoor S (2025) Why an overreliance on AI-driven modelling is bad for science. Nature 640:312–314. 10.1038/d41586-025-01067-2 [DOI] [PubMed] [Google Scholar]
- 10.Bala B, Behal S (2024) A Brief Survey of Data Preprocessing in Machine Learning and Deep Learning Techniques. In: 2024 8th International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). pp 1755–1762
- 11.Wouters OJ, McKee M, Luyten J (2020) Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA 323:844–853. 10.1001/jama.2020.1166 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Harrison RK (2016) Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15:817–818. 10.1038/nrd.2016.184 [DOI] [PubMed] [Google Scholar]
- 13.Dowden H, Munro J (2019) Trends in clinical success rates and therapeutic focus. Nat Rev Drug Discov 18:495–496. 10.1038/d41573-019-00074-z [DOI] [PubMed] [Google Scholar]
- 14.Sun D, Gao W, Hu H, Zhou S (2022) Why 90% of clinical drug development fails and how to improve it? Acta Pharm Sin B 12:3049–3062. 10.1016/j.apsb.2022.02.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Liu T, Hwang L, Burley SK et al (2025) BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res 53:D1633–D1644. 10.1093/nar/gkae1075 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wang R, Fang X, Lu Y et al (2005) The PDBbind database: methodologies and updates. J Med Chem 48:4111–4119. 10.1021/jm048957q [DOI] [PubMed] [Google Scholar]
- 17.Leo A, Hansch C, Elkins D (1971) Partition coefficients and their uses. Chem Rev 71:525–616. 10.1021/cr60274a001 [Google Scholar]
- 18.Obach RS, Lombardo F, Waters NJ (2008) Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 670 drug compounds. Drug Metab Dispos 36:1385–1405. 10.1124/dmd.108.020479 [DOI] [PubMed] [Google Scholar]
- 19.Przybylak KR, Madden JC, Covey-Crump E et al (2018) Characterisation of data resources forin silicomodelling: benchmark datasets for ADME properties. Expert Opin Drug Metab Toxicol 14:169–181. 10.1080/17425255.2017.1316449 [DOI] [PubMed] [Google Scholar]
- 20.Huang K, Fu T, Gao W, et al (2021) Therapeutics data commons: machine learning datasets and tasks for drug discovery and development. Preprint at 10.48550/arXiv.2102.09548.
- 21.Sorkun MC, Khetan A, Er S (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data 6:143. 10.1038/s41597-019-0151-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Menestrina L, Parrondo-Pizarro R, Gomez I et al (2025) Refined ADME profiles for ATC drug classes. Pharmaceutics 17:308. 10.3390/pharmaceutics17030308 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Napoli JA, Reutlinger M, Brandl P et al (2025) Multitask deep learning models of combined industrial absorption, distribution, metabolism, and excretion datasets to improve generalization. Mol Pharm 22:1892–1900. 10.1021/acs.molpharmaceut.4c01086 [DOI] [PubMed] [Google Scholar]
- 24.Liyaqat T, Ahmad T, Saxena C (2024) Advancements in molecular property prediction: a survey of single and multimodal approaches. Preprint at 10.48550/arXiv.2408.09461.
- 25.Güvenç Paltun B, Mamitsuka H, Kaski S (2021) Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches. Brief Bioinform 22:346–359. 10.1093/bib/bbz153 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Salesforce, Inc. Tableau Desktop (2025.2.1) [Computer software]. Salesforce, Inc., San Francisco, CA 94105, United States. https://www.tableau.com/. Accessed 2 Aug 2025.
- 27.Seshadri R. AutoViz: The One-Line Automatic Data Visualization Library (0.1.905) [Computer software]. GitHub. https://github.com/AutoViML/AutoViz. Accessed 2 Aug 2025
- 28.Landrum G. RDKit: Open-source cheminformatics (2024.09.4). https://www.rdkit.org. Accessed 2 Aug 2025.
- 29.Virtanen P, Gommers R, Oliphant TE et al (2020) Scipy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.McInnes L, Healy J, Saul N, Großberger L (2018) UMAP: Uniform manifold approximation and projection. J Open Source Softw 3:861. 10.21105/joss.00861 [Google Scholar]
- 31.Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9:90–95. 10.1109/MCSE.2007.55 [Google Scholar]
- 32.Waskom ML (2021) Seaborn: statistical data visualization. J Open Source Softw 6(60):3021. 10.21105/joss.03021 [Google Scholar]
- 33.Lombardo F, Berellini G, Obach RS (2018) Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 1352 drug compounds. Drug Metab Dispos 46:1466–1477. 10.1124/dmd.118.082966 [DOI] [PubMed] [Google Scholar]
- 34.Fan J, Shi S, Xiang H et al (2024) Predicting elimination of small-molecule drug half-life in pharmacokinetics using ensemble and consensus machine learning methods. J Chem Inf Model 64:3080–3092. 10.1021/acs.jcim.3c02030 [DOI] [PubMed] [Google Scholar]
- 35.Fu L, Shi S, Yi J et al (2024) ADMETlab 3.0: an updated comprehensive online ADMET prediction platform enhanced with broader coverage, improved performance, API functionality and decision support. Nucleic Acids Res 52:W422–W431. 10.1093/nar/gkae236 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zdrazil B, Felix E, Hunter F et al (2024) The ChEMBL database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 52:D1180–D1192. 10.1093/nar/gkad1004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li Q, Ma S, Zhang X et al (2022) DDPD 1.0: a manually curated and standardized database of digital properties of approved drugs for drug-likeness evaluation and drug development. Database 2022:baab083. 10.1093/database/baab083 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Pihan E, Colliandre L, Guichou J-F, Douguet D (2012) E-drug 3D: 3D structure collections dedicated to drug repurposing and fragment-based drug design. Bioinformatics 28:1540–1541. 10.1093/bioinformatics/bts186 [DOI] [PubMed] [Google Scholar]
- 39.Experimental in vitro DMPK and physicochemical data on a set of publicly disclosed compounds (2025) Wenlock M, Tomkinson N. 10.6019/CHEMBL3301361. Accessed 8 Aug 2025
- 40.Iwata H, Matsuo T, Mamada H et al (2022) Predicting total drug clearance and volumes of distribution using the machine learning-mediated multimodal method through the imputation of various nonclinical data. J Chem Inf Model 62:4057–4065. 10.1021/acs.jcim.2c00318 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gaulton A, Bellis LJ, Bento AP et al (2012) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40:D1100–D1107. 10.1093/nar/gkr777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Lombardo F, Waters NJ, Argikar UA et al (2013) Comprehensive assessment of human pharmacokinetic prediction based on in vivo animal pharmacokinetic data, Part 2: Clearance. J Clin Pharmacol 53:178–191. 10.1177/0091270012440282 [DOI] [PubMed] [Google Scholar]
- 43.Gombar VK, Hall SD (2013) Quantitative structure-activity relationship models of clinical pharmacokinetics: clearance and volume of distribution. J Chem Inf Model 53:948–957. 10.1021/ci400001u [DOI] [PubMed] [Google Scholar]
- 44.Varma MVS, Feng B, Obach RS et al (2009) Physicochemical determinants of human renal clearance. J Med Chem 52:4844–4852. 10.1021/jm900403j [DOI] [PubMed] [Google Scholar]
- 45.Varma MVS, Obach RS, Rotter C et al (2010) Physicochemical space for optimum oral bioavailability: contribution of human intestinal absorption and first-pass elimination. J Med Chem 53:1098–1108. 10.1021/jm901371v [DOI] [PubMed] [Google Scholar]
- 46.(2025) PubChem Bioassay Record for AID 1159394, Source: ChEMBL. In: Natl. Cent. Biotechnol. Inf. from https://pubchem.ncbi.nlm.nih.gov/bioassay/1159394. Accessed 7 Mar 2025
- 47.(2025) PubChem Bioassay Record for AID 1159396, Source: ChEMBL. In: Natl. Cent. Biotechnol. Inf. https://pubchem.ncbi.nlm.nih.gov/bioassay/1159396. Accessed 7 Mar 2025
- 48.Foster JA, Houston JB, Hallifax D (2011) Comparison of intrinsic clearances in human liver microsomes and suspended hepatocytes from the same donor livers: clearance-dependent relationship and implications for prediction of in vivo clearance. Xenobiotica 41:124–136. 10.3109/00498254.2010.530700 [DOI] [PubMed] [Google Scholar]
- 49.Hosea NA, Collard WT, Cole S et al (2009) Prediction of human pharmacokinetics from preclinical information: comparative accuracy of quantitative prediction approaches. J Clin Pharmacol 49:513–533. 10.1177/0091270009333209 [DOI] [PubMed] [Google Scholar]
- 50.Kim S, Chen J, Cheng T et al (2025) PubChem 2025 update. Nucleic Acids Res 53:D1516–D1525. 10.1093/nar/gkae1059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.WHO Collaborating Centre for Drug Statistics Methodology, (2025) ATC classification index with DDDs. 2024. Oslo, Norway. https://www.who.int/tools/atc-ddd-toolkit/atc-classification. Accessed 8 Nov 2024.
- 52.Pedregosa F, Varoquaux G, Gramfort A et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 [Google Scholar]
- 53.Bergstra J, Yamins D, Cox DD (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. In: Sanjoy D, McAllester D (eds). Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 28. PMLR, p 115–123.
- 54.Rainio O, Teuho J, Klén R (2024) Evaluation metrics and statistical tests for machine learning. Sci Rep 14:6086. 10.1038/s41598-024-56706-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ash JR, Wognum C, Rodríguez-Pérez R, et al (2025) Practically significant method comparison protocols for machine learning in small molecule drug discovery. J Chem Inf Model 65:9398–9411. 10.1021/acs.jcim.5c01609 [DOI] [PubMed]
- 56.Toutain PL, Bousquet-Mélou A (2004) Plasma clearance. J Vet Pharmacol Ther 27:415–425. 10.1111/j.1365-2885.2004.00605.x [DOI] [PubMed] [Google Scholar]
- 57.Rowland M (1984) Protein binding and drug clearance. Clin Pharmacokinet 9:10–17. 10.2165/00003088-198400091-00002 [DOI] [PubMed] [Google Scholar]
- 58.Toutain PL, Bousquet-Mélou A (2004) Plasma terminal half-life. J Vet Pharmacol Ther 27:427–439. 10.1111/j.1365-2885.2004.00600.x [DOI] [PubMed] [Google Scholar]
- 59.Smith DA, Beaumont K, Maurer TS, Di L (2018) Relevance of half-life in drug design. J Med Chem 61:4273–4282. 10.1021/acs.jmedchem.7b00969 [DOI] [PubMed] [Google Scholar]
- 60.Venkatraman V (2021) FP-ADMET: a compendium of fingerprint-based ADMET prediction models. J Cheminform 13:75. 10.1186/s13321-021-00557-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Kausar S, Falcao AO (2019) Analysis and comparison of vector space and metric space representations in QSAR modeling. Molecules 24:1698. 10.3390/molecules24091698 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Sato A, Miyao T, Jasial S, Funatsu K (2021) Comparing predictive ability of QSAR/QSPR models using 2D and 3D molecular representations. J Comput Aided Mol Des 35:179–193. 10.1007/s10822-020-00361-7 [DOI] [PubMed] [Google Scholar]
- 63.Xia Y, Wang Y, Wang Z, Zhang W (2024) A comprehensive review of molecular optimization in artificial intelligence-based drug discovery. Quant Biol 12:15–29. 10.1002/qub2.30 [Google Scholar]
- 64.Bohnert T, Gan L-S (2013) Plasma protein binding: from discovery to development. J Pharm Sci 102:2953–2994. 10.1002/jps.23614 [DOI] [PubMed] [Google Scholar]
- 65.Chao P, Uss AS, Cheng K (2010) Use of intrinsic clearance for prediction of human hepatic clearance. Expert Opin Drug Metab Toxicol 6:189–198. 10.1517/17425250903405622 [DOI] [PubMed] [Google Scholar]
- 66.Jiang N, Quazi M, Schweikert C, et al (2023) Enhancing ADMET Property Models Performance through Combinatorial Fusion Analysis. 10.26434/chemrxiv-2023-dh70x.
- 67.Hallifax D, Foster JA, Houston JB (2010) Prediction of human metabolic clearance from in vitro systems: retrospective analysis and prospective view. Pharm Res 27:2150–2161. 10.1007/s11095-010-0218-3 [DOI] [PubMed] [Google Scholar]
- 68.Zhuang F, Qi Z, Duan K et al (2021) A comprehensive survey on transfer learning. Proc IEEE 109:43–76. 10.1109/JPROC.2020.3004555 [Google Scholar]
- 69.Finn C, Abbeel P, Levine S (2017) Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In: Precup D, Teh Y W (eds) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol 70. PMLR, p 1126–1135
- 70.Bhhatarai B, Walters WP, Hop CECA et al (2019) Opportunities and challenges using artificial intelligence in ADME/tox. Nat Mater 18:418–422. 10.1038/s41563-019-0332-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Hanser T, Ahlberg E, Amberg A et al (2025) Data-driven federated learning in drug discovery with knowledge distillation. Nat Mach Intell 7:423–436. 10.1038/s42256-025-00991-2 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets supporting the conclusions of this article are available in the AssayInspector Github repository, https://github.com/chemotargets/assay_inspector.





