CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction

Huong Van Le; Weibin Ren; Junhong Kim; Yukyung Yun; Young Bin Park; Young Jun Kim; Bok Kyung Han; Inho Choi; Jong-Il Park; Hwi-yeol Yun; Jae-Mun Choi

doi:10.1186/s13321-025-01137-7

. 2025 Dec 22;17:184. doi: 10.1186/s13321-025-01137-7

CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction

Huong Van Le ¹, Weibin Ren ¹, Junhong Kim ¹, Yukyung Yun ¹, Young Bin Park ^1,¹¹, Young Jun Kim ^2,^3,⁴, Bok Kyung Han ^2,^3,^5,^✉, Inho Choi ^6,⁷, Jong-Il Park ⁸, Hwi-yeol Yun ^9,¹⁰, Jae-Mun Choi ^1,^2,^10,^✉

PMCID: PMC12752011 PMID: 41430621

Abstract

Caco-2= permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates= during early-stage drug discovery. To enhance the accuracy and= efficiency of computational predictions, we systematically investigated the impact of eight molecular feature= representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. We evaluated model performance across various molecular representations using two datasets differing in scale and chemical diversity, namely the TDC benchmark and curated OCHEM data. Among the tested fingerprints and descriptors, PaDEL, Mordred, and RDKit emerged as particularly effective for predicting Caco-2 permeability. Notably, our model CaliciBoost, identified through training optimization, achieved the lowest MAE and secured the top position on the TDC Caco-2 Leaderboard. Furthermore, for both Padel and Mordred, using TDC data, incorporating 3D descriptors seem lead to improvements over using 2D features alone, as supported by feature importance analyses. These findings highlight the effectiveness of automated machine learning approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks.

Scientific contribution

This work provides a systematic benchmarking of eight molecular feature representation types in conjunction with AutoML for Caco-2 permeability prediction. It highlights the critical role of 3D descriptors in enhancing predictive accuracy and establishes a PaDEL-based AutoML model that achieves top-ranked performance on a public leaderboard. The study also emphasizes the value of interpretable feature selection (via SHAP and permutation importance), offering insights into feature contributions and generalizable modeling strategies for cheminformatics applications.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13321-025-01137-7.

Keywords: Caco-2 permeability, Molecular descriptors, Molecular fingerprints, Feature representation, AutoML, ADMET prediction, PaDEL, Mordred, SHAP analysis

Introduction

Caco-2 cell permeability is a widely used in vitro proxy for assessing the intestinal absorption of drug candidates in early-stage drug discovery. Accurately modeling this property enables effective compound prioritization, optimizes experimental resources, and reduces both the cost and time of ADMET screening. Given that oral bioavailability is a critical determinant of clinical success, predictive modeling of Caco-2 permeability plays a central role in rational drug design and high-throughput screening pipelines. However, despite its importance, building robust predictive models remains challenging due to limited data availability and the complexity of molecular feature engineering.

Caco-2 permeability has long been utilized as a representative in vitro indicator for predicting the oral absorption of drug candidates in humans [1]. In addition to synthetic drug candidates, there has been growing attention on natural products including herbal medicines, functional foods, and dietary supplements due to their increasing use in supporting health and treating chronic diseases such as HIV/AIDS and cancer [2, 3]. These products often contain bioactive compounds that may interact with co-administered medications, influencing absorption, metabolism, and therapeutic efficacy. However, many of these interactions remain underexplored, and healthcare professionals are often unaware of their potential impact [4]. To address this, the Caco-2 cell model has been widely adopted to study the intestinal permeability of natural compounds and their potential for herb–drug interactions. These assessments are vital for identifying risks of adverse effects or therapeutic failure, especially when natural products are used concomitantly with conventional drugs. The use of Caco-2 screening for such interactions is therefore essential not only in pharmaceutical research but also in functional ingredient development and regulatory safety evaluation, supporting the rational and safe use of natural medicines within modern drug discovery frameworks.

Various machine learning approaches have been developed to model Caco-2 permeability, typically relying on molecular descriptors and structural fingerprints. Early computational efforts to predict Caco-2 permeability primarily relied on traditional machine learning algorithms such as random forests, support vector machines, and multiple linear regression, trained on basic physicochemical descriptors and molecular fingerprints. Commonly used features included molecular weight, topological polar surface area (TPSA), hydrogen bond donors and acceptors, LogP, and binary fingerprints such as Morgan (ECFP). Descriptor sets calculated via cheminformatics toolkits such as RDKit, PaDEL, and Mordred have been widely adopted for their accessibility and coverage of structural, electronic, and topological features. For example, Welling et al. employed random forests on a small set of surfactant-like enhancers using standard molecular descriptors [5], while Wang et al. utilized boosting algorithms on descriptors calculated by Molecular Operating Environment software to improve prediction accuracy [6]. Esaki et al. demonstrated the utility of Mordred descriptors in regression-based modeling [7]. Recursive feature selection combined with random forest on RDKit descriptors was used by Falcón-Cano et al. to predict permeability [8]. Victor Acuña-Guzmán et al. applied a hybrid Support Vector Machine—Random Forest—Gradient Boosting model using PaDEL-derived descriptors [9] while Wang et al. used RDKit 2D descriptors and Morgan fingerprints for large-scale industry-relevant Caco-2 prediction tasks [10].

Although these studies have made progress, they typically focused on a limited subset of features or a single modeling approach. Few have conducted systematic comparisons across a wide spectrum of fingerprint and descriptor types. Moreover, the relative contributions of different representations to model performance and generalization remain poorly quantified. This gap limits our ability to make informed decisions when selecting molecular features for Caco-2 permeability prediction tasks.

Recent years have witnessed a growing interest in deep learning approaches such as Graph Neural Networks (GNNs) which can directly learn molecular representations from graph-structured inputs without relying on molecular features. Graph Convolutional Networks (GCNs) have been applied to ADMET prediction tasks, including Caco-2 cell permeability, using modular frameworks like DeepPurpose integrated with the Therapeutics Data Commons (TDC) benchmark suite [11, 12]. In DeepPurpose’s pipeline, molecules are encoded using the drug_encoding=‘GCN’ option, where atoms are treated as nodes and bonds as edges. GCNs then apply neighborhood aggregation operations to learn graph-based feature representations, which are subsequently fed into an end-to-end neural network for regression prediction. However, GCN models often require large and diverse datasets to generalize well across chemical space. On small to medium-sized datasets such as Caco-2 permeability, their performance is often hindered by limited data availability. This is evident in benchmark results from the Therapeutics Data Commons (TDC), where classical ensemble models such as MapLight, BaseBoosting, and XGBoost consistently outperform CNN- and GNN-based models (Morgan + MLP (DeepPurpose), RDKit2D + MLP (DeepPurpose)) [13].

Given these limitations, selecting the optimal molecular feature representation is critical. Fingerprints such as Morgan [14], Avalon [15], ErG [16], and MACCS [17] offer efficient substructure-based encodings. In parallel, descriptors derived from RDKit [18], PaDEL [19], Mordred [20], and CDDD [21] capture physicochemical and structural properties, often providing complementary information. However, balancing their use, especially when datasets are not large, requires careful experimentation and validation.

To address the complexity and interdependence of model development tasks, Automated Machine Learning (AutoML) has emerged as a powerful paradigm. AutoML automates key components of the machine learning pipeline, including feature selection, preprocessing, algorithm choice, and hyperparameter optimization. In cheminformatics, where datasets are typically high-dimensional, heterogeneous, and small to medium in size, AutoML offers a scalable, reproducible, and expert-free approach to model building. It is increasingly applied in QSAR modeling, ADMET property prediction, and virtual screening [22–26].

Among various frameworks, AutoGluon [26] was selected in this study due to its superior performance on high-dimensional tabular data, strong ensemble learning capabilities, and efficient handling of missing or sparse inputs. It combines multiple model types such as LightGBM, XGBoost, CatBoost, neural networks, and k-NN and performs joint optimization using Bayesian search strategies [27, 28]. AutoGluon also supports preprocessing operations like normalization, categorical encoding, and imputation, and its automated ensemble construction boosts generalization while minimizing manual tuning. These characteristics make AutoGluon especially well-suited for cheminformatics tasks, where model interpretability, consistency, and scalability are essential.

This study aims to fill the gap in systematic benchmarking of molecular feature representations for Caco-2 permeability prediction. We evaluate eight distinct feature types including Morgan, Avalon, ErG, MACCS fingerprints and descriptors from RDKit, PaDEL, Mordred, and CDDD across two datasets with different sizes and properties. Using AutoGluon, we assess individual predictive performance, investigating how representation choice affects accuracy, robustness, and feature importance in a data-limited setting.

The significance of this study lies in its contribution to practical cheminformatics workflows. By identifying optimal molecular features and demonstrating the utility of AutoML, this work provides actionable guidance for researchers building ADMET prediction models under real-world constraints. Furthermore, the findings support evidence-based decisions in featurization and model selection, advancing reproducibility and effectiveness in drug discovery pipelines.

Experimental design

This study employs two datasets, TDC and OCHEM, to predict Caco-2 permeability using eight different molecular feature representations: Morgan FP, Avalon FP, ErG FP, RDKit descriptors, MACCS FP, PaDEL, Mordred, and CDDD. Each feature representation is used to generate features to train in AutoML model. Feature importance is assessed via permutation importance and SHAP values, followed by selection of top-ranked features. These top features are used to retrain the model, and hyperparameters are optimized using Bayesian optimization to identify the best-performing feature representation for Caco-2 permeability prediction. The final models are evaluated using MAE, RMSE, R², and Pearson correlation to identify the most robust approach for Caco-2 permeability prediction. The full pipeline of the experimental design is illustrated in Fig. 1.

Fig. 1 — Overall workflow for Caco-2 permeability prediction using AutoML and multi-representation molecular features

Datasets

TDC

TDC.Caco2_Wang dataset selected by the Therapeutics Data Commons (TDC) [13] was used to train and evaluate AutoML models for Caco-2 permeability prediction. This dataset contains 906 compounds with experimentally measured Caco-2 permeability values, originally curated from Wang et al. [6]. It adopts a scaffold-based data split by TDC, commonly used to evaluate generalization to structurally novel compounds. Each entry includes a SMILES string and the corresponding permeability value, making it well-suited for QSAR modeling in standardized machine learning workflows. Furthermore, we retained the scaffold-based split as provided by TDC without further modification. No additional filtering or curation was applied, as the dataset has already been pre-processed and cleaned, making it a reliable benchmark for ADMET prediction models.

OCHEM

A curated dataset of Caco-2 permeability values was obtained from the OCHEM (Online Chemical Modeling Environment) platform, a web-based system for managing and automating QSAR modeling [29]. The dataset contains 9402 compound entries with experimental apparent Caco-2 permeability (Papp) values measured across Caco-2 cell monolayers. Each entry includes SMILES, Papp values, and additional metadata such as compound name, PubMed ID, pH, temperature, and P-gp inhibition status. To ensure consistency and remove potential noise from assay variability, we applied a systematic preprocessing and filtering strategy. We retained only measurements performed under standard physiological conditions (~pH 7.4, ~37 °C) and excluded records with ambiguous or missing experimental metadata (e.g., undefined pH, temperature), deduplicated entries by SMILES keeping the most complete representative record, and prioritized values measured in the apical → basolateral (A → B) direction, which are most commonly used to estimate intestinal absorption. This curated OCHEM dataset was then filtered and subjected to scaffold-based splitting, using the same strategy as TDC, to promote generalization toward novel chemical scaffolds. This curation process was inspired by the filtering criteria adopted by Wang et al. [6] for the TDC dataset, but adapted to the richer metadata and greater heterogeneity present in OCHEM.

Data preprocessing

TDC

In this study, we used the scaffold split provided by TDC [13], which partitions the dataset into:

Training set: 728 compounds
Test set: 182 compounds

Although the TDC dataset provides predefined train and test splits, we conducted an additional examination of the data distribution and structural clustering to verify the integrity of the split. As visualized in Fig. 2, the distribution of Caco-2 permeability values remains consistent across the training and test subsets. Furthermore, Principal Component Analysis (PCA) and clustering, implemented using scikit-learn package reveal that chemical structures are well spread and that the test set adequately covers the structural diversity observed in the training set. These observations affirm the reliability and representativeness of the TDC split for robust model evaluation.

Fig. 2 — Overview of data distribution and structural clustering in the TDC dataset. A Distribution of Caco-2 permeability in the training and test sets, showing consistent permeability ranges across subsets. B PCA projection of molecular structures colored by structural clusters, confirming chemical diversity within the dataset. C PCA projection showing separation between train and test sets, indicating broad structural coverage and no major distributional bias

OCHEM

In the OCHEM dataset, permeability direction was variably labeled, with some entries specified as apical-to-basolateral (A → B), others as basolateral-to-apical (B → A), and the majority lacking direction metadata. Since Caco-2 permeability assays are typically conducted in the A → B direction to model drug absorption in the discovery phase [30, 31], we assumed that entries without explicit direction labels followed this conventional setup. This assumption aligns with standard practices reported in the literature, such as the study by Pade and Stavchansky [30], which demonstrated that effective permeability coefficients (Pe) are typically evaluated in the A → B direction under physiological conditions (pH 7.4, 37 °C) and that for many passively transported drugs (e.g., propranolol, naproxen, salicylic acid), Pe(ap → bl) and Pe(bl → ap) do not significantly differ, which indicates that passive diffusion dominates and the A → B direction is a valid reference. As a result, 36 data points explicitly labeled as B → A were excluded. All remaining records regardless of label were retained and treated as representative of A → B transport, yielding a dataset of 9366 entries for next processing step.

Subsequent filtering was applied based on pH conditions. As most Caco-2 assays are conducted at pH 7.4 [32], entries reporting different pH values were excluded to minimize experimental variability. A total of 573 entries were removed in this step. For records with missing pH values, we imputed pH 7.4, assuming standard assay conditions.

In addition, temperature information was used to ensure consistency with typical biological assays conducted at 37 °C [33]. Entries with reported temperatures differing from 37 °C or with missing temperature data were excluded, resulting in the removal of 14 entries.

Next, we standardized chemical structure representations by converting all SMILES strings to their canonical form using RDKit [18]. One compound that could not be successfully converted was discarded. We also removed 53 entries with negative Papp values, which are physiologically implausible and likely the result of data entry errors.

Furthermore, duplicate records were removed to eliminate redundancy and ensure the uniqueness of compound-permeability pairs. Entries were considered duplicates if they shared identical canonical SMILES strings and experimental conditions. We selected the record whose permeability value (Papp) was closest to the mean of all replicates for that compound, rather than averaging all values. This approach allows us to preserve a real experimental value rather than a synthetic average, while still mitigating the impact of outliers. A total of 3244 duplicates were excluded, ensuring that each data point reflected a distinct and reliable experimental observation.

Finally, to stabilize the regression task and reduce skewness in permeability values, all Papp measurements were transformed using a base-10 logarithm (log₁₀). This transformation, commonly used in QSAR modeling, normalizes data distributions and improves model performance.

After applying all preprocessing steps, a total of 5481 curated records were retained for model development.

Following data curation, a total of 5481 entries were retained from the OCHEM dataset for further modeling. To enable cluster-aware and value-balanced splitting between training and testing subsets, the following steps were applied:

Molecular structures were encoded using 1024-bit Morgan fingerprints, and then projected into a lower-dimensional space using Principal Component Analysis (PCA) for visualization and structural diversity assessment. In this 2D PCA space, KMeans clustering was performed to group molecules by structural similarity. In parallel, the log-transformed Caco-2 permeability values were binned into discrete intervals to allow for stratified sampling by permeability levels.

Using both cluster assignments and Caco-2 permeability bins, we performed a stratified split to construct training and test sets that preserved the overall distribution of both chemical structure and permeability. As a result, the dataset was divided into:

Training set: 4377 compounds
Test set: 1095 compounds

As shown in Fig. 3, the distribution of Papp between the train and test sets remains consistent. The PCA projections further confirm structural diversity, with distinct cluster separation and a balanced distribution of train and test compounds in chemical space. These results indicate that the sampling process preserved both label and structural diversity without introducing significant bias.

Feature extraction

To represent molecular structures in a machine-readable format, we extracted features using multiple cheminformatics tools and packages. Morgan, Avalon, ErG fingerprints, MACCS fingerprints, and RDKit descriptors were computed using the RDKit toolkit (ver. 2023.9.6) [18]. CDDD descriptors were generated via a pretrained sequence-to-sequence autoencoder, while PaDEL and Mordred descriptors were calculated using the padelpy (ver. 0.1.14) [34] and Mordred Python package (community ver. 2.0.6) [35] respectively.

Morgan fingerprints (1024 bits) were computed using GetHashedMorganFingerprint, which encodes circular atom environments based on atomic neighborhoods.
Avalon fingerprints (1024 bits) were generated using GetAvalonCountFP, capturing predefined substructure patterns.
ErG fingerprints (315 dimensions) were calculated via GetErGFingerprint, encoding topological relationships between pharmacophoric features.
RDKit descriptors were extracted using MolecularDescriptorCalculator, yielding over 200 physicochemical and topological properties.
MACCS fingerprints (167 bits) were generated using GenMACCSKeys, capturing the presence or absence of predefined structural keys commonly associated with bioactivity and chemical functionality.

All RDKit-based features were converted to NumPy arrays and used as input for downstream machine learning models.

PaDEL descriptors were calculated using the padelpy wrapper for the PaDEL-Descriptor software, producing 1875 2D, and 3D descriptors per molecule. It internally generates 3D coordinates from SMILES using its default preprocessing pipeline and calculate 3D descriptors.
Mordred descriptors were computed with the Mordred Python package, which supports over 1800 descriptors including physicochemical, topological, and 3D properties. Mordred performs automated preprocessing for each descriptor, including hydrogen handling, Kekulization, and aromaticity perception. Different from PaDEL, Mordred does not handle 3D conformer generation internally. Therefore, we needed to generate 3D conformers from SMILES using Open Babel [36] and perform energy minimization before calculating 3D descriptors.
CDDD descriptors were generated using a pretrained sequence-to-sequence autoencoder that translates between different SMILES representations. The encoder produces a 512-dimensional continuous embedding per molecule. We used the official CDDD Python package provided by Winter et al. [21].

AutoML training and optimization

To develop predictive models for Caco-2 permeability, we employed AutoGluon-Tabular (v0.7.0), an AutoML framework optimized for tabular datasets with high-dimensional feature spaces. For each feature representation type—Morgan, Avalon, ErG, RDKit descriptors, MACCS, CDDD, PaDEL, and Mordred—a separate AutoGluon model was trained using the best_quality preset with the seed of 42. The target variable was Caco-2 permeability, and the primary evaluation metric was mean absolute error (MAE), consistent with the Therapeutics Data Commons (TDC) leaderboard. Additional metrics including root mean squared error (RMSE), R-squared (R²), and Pearson correlation coefficient (R) were also calculated to provide a more comprehensive performance assessment.

Model performance across feature representation types was compared based on MAE. For each feature representation type, the top features were selected for further analysis, where we identified top-ranked features in each set using permutation importance and SHAP values. Permutation feature importance is a model-agnostic technique that estimates the importance of each feature by measuring the decrease in model performance when the feature’s values are randomly shuffled, thereby breaking its relationship with the target variable [37]. SHAP (SHapley Additive exPlanations) is a unified framework for interpreting model predictions by assigning each feature an importance value for individual predictions, combining principles from cooperative game theory with additive feature attribution methods to ensure consistency and local accuracy [38]. The model was then retrained using only the most informative features of each feature representation type to assess the impact of dimensionality reduction on performance and interpretability. In the final stage, we applied Bayesian optimization within Autogluon framework to fine-tune the model’s hyperparameters. This was implemented using the predictor.fit() function with hyperparameter_tune_kwargs enabled (searcher=“auto”, num_trials=20), which allows AutoGluon to automatically perform Bayesian optimization using an internal surrogate model. The specific hyperparameters tuned depend on each model type and are selected internally by AutoGluon. This procedure aimed to maximize predictive accuracy while maintaining model robustness.

Results and discussion

Effect of dataset split: TDC vs. OCHEM

To evaluate the influence of dataset split strategy on model performance, we systematically compared model outcomes using the same feature representations and AutoML frameworks under both the TDC scaffold split and the OCHEM custom split. Results are summarized in Supplementary Table 1 (TDC) and Supplementary Table 2 (OCHEM), with a graphical comparison provided in Fig. 4 (TDC) and Fig. 5 (OCHEM).

Fig. 4 — MAE of models trained with the TDC dataset across different feature representations using both all and top features. Figure presents the mean absolute error (MAE) of models trained on the TDC dataset using eight molecular feature representation types: Morgan FP, Avalon FP, ErG FP, RDKit descriptors, MACCS FP, PaDEL, Mordred, and CDDD. For each type, performance is compared between models trained on the full feature set (all) and those retrained using top-ranked features with hyperparameter optimization. Models trained with top features consistently outperformed their counterparts using all features, confirming the effectiveness of feature selection and model refinement. Among all representations, models trained with top features from PaDEL, Mordred, and RDKit descriptors achieved the lowest MAE values, highlighting their strong predictive power for Caco-2 permeability

Fig. 5 — MAE of models trained with the OCHEM dataset across different feature representations using both all and top features. Figure shows the mean absolute error (MAE) of models trained on the OCHEM dataset using eight molecular feature representation types: Morgan FP, Avalon FP, ErG FP, RDKit descriptors, MACCS FP, PaDEL, Mordred, and CDDD. In most cases, retraining models on a refined subset of top-ranked features combined with hyperparameter optimization led to comparable or even superior performance compared to using the full feature set, highlighting the benefit of targeted feature selection. Nonetheless, for certain representations like ErG FP and MACCS FP, models trained on the complete feature set showed marginally better results than their top-feature counterparts

Across all eight feature representation types—Morgan fingerprints, Avalon fingerprints, ErG fingerprints, RDKit descriptors, MACCS fingerprints, PaDEL, Mordred, and CDDD—models trained and evaluated on the TDC split consistently achieved lower MAE, RMSE, and higher R² and Pearson correlation, indicating better predictive accuracy. This trend highlights TDC’s stability as a benchmark for evaluating Caco-2 permeability models.

However, it is important to note that the OCHEM dataset, being approximately five times larger than TDC, provides a broader and more diverse chemical space. Despite this increased complexity, model performance on OCHEM remains reasonably strong across all representations, as shown in Fig. 9. The results suggest that the AutoML framework retains predictive capacity even under more challenging, real-world-like conditions, demonstrating the robustness of the trained models beyond controlled benchmark splits.

Fig. 9 — Comparative model performance using PaDEL and Mordred descriptors on the OCHEM dataset. A–C Model trained using PaDEL descriptors using 2D features, 3D features and combined 2D + 3D top features. D–F Model trained using Mordred descriptors using 2D features, 3D features and combined 2D + 3D top features

In Fig. 4, we visualize MAE comparisons for both all and top features across each molecular representation on the TDC dataset. A similar analysis for the OCHEM dataset is shown in Fig. 5, where overall MAE values tend to be higher. These plots further emphasize the increased difficulty of the OCHEM split, likely stemming from its greater structural diversity, larger chemical space, and less uniform data distribution, in contrast to the more stable scaffold-based split used in TDC.

In summary, the TDC benchmark provides a more stable, balanced, and reproducible environment for evaluating model performance, making it well-suited for internal validation. On the other hand, the OCHEM dataset introduces a more realistic and challenging setting, thereby serving as a more rigorous testbed for assessing model robustness and generalization. These insights are critical when selecting appropriate benchmarks for model development and deployment in cheminformatics and virtual screening applications.

Performance of individual feature representations

Using all features in each feature representation

Each molecular representation—including Morgan, Avalon, ErG, RDKit descriptors, MACCS, PaDEL, Mordred, and CDDD—was independently evaluated by training auto ML models using its entire set of features, without prior filtering or feature selection. The evaluation was conducted separately on the TDC and OCHEM splits to examine the robustness and generalizability of each representation.

On the TDC dataset, models trained on RDKit descriptors, Morgan fingerprints, and Avalon fingerprints exhibited the best performance (e.g., 0.2898, 0.2995, and 0.3009 respectively), closely following by Mordred and PaDEL (0.3033 and 0.3058), achieving the lowest MAE values and high R² (up to 0.7304) and Pearson correlation coefficients (up to 0.8642), as shown in Supplementary Table 1. This indicates that these representations capture structural patterns highly relevant to Caco-2 permeability and are effective even without additional dimensionality reduction.

In contrast, on the OCHEM dataset, overall performance is not as good as TDC across all representations (see Supplementary Table 2), with higher MAE and lower R², consistent with the broader observation that TDC provides a more stable benchmark (see “Effect of dataset split: TDC vs. OCHEM” section). However, among the representations, PaDEL descriptors, RDKit fingerprints, and Morgan fingerprints maintained relatively better performance on OCHEM.

These findings highlight that RDKit descriptors, Avalon fingerprints, Morgan fingerprints, PaDEL, Mordred—even without prior feature selection—contribute substantially to model performance in Caco-2 permeability prediction, underscoring their intrinsic value as robust input representations in AutoML-based modeling frameworks.

Using top features in each feature representation

To improve both interpretability and modeling efficiency, we performed feature importance analysis using two complementary methods: permutation importance [39] and SHAP (SHapley Additive exPlanations) values [40]. These techniques enabled us to quantify how much each feature contributed to model performance across different molecular representations—permutation importance capturing the impact of feature perturbation on accuracy, and SHAP providing fine-grained attribution of features to individual predictions. This dual approach ensured robust identification of the most informative features.

Initially, we evaluated retraining models using a fixed number of top features (e.g., top 200, 100, 50, or 30). However, this strategy did not consistently improve performance and in some cases, resulted in worse metrics compared to models trained on all features in each feature representation. This outcome suggests that the optimal number of top features is not universal across representations and that rigid cutoffs may exclude features that contribute to non-linear interactions important for prediction.

To address this, we conducted a screening experiment, evaluating model performance using the top N features, where N ranged from 1 to 200. This approach enabled us to identify an optimal subset of features, together with SHAP and permutation rankings, that not only preserved or improved predictive accuracy compared to the full feature set but also reduced dimensionality and enhanced model interpretability. Performance trends for each representation (MAE vs. number of top features) are provided in Supplementary Fig. 1 (for TDC dataset) and Supplementary Fig. 2 (for OCHEM dataset). This screening process allowed us to select the optimal subset of features that retained or improved predictive performance compared to using the full feature set.

The refined feature subsets of each feature representation type were then used to retrain the models with Bayesian optimization in Autogluon framework, resulting in consistent improvements in MAE and correlation metrics. This analysis not only improved model performance but also provided insight into which chemical substructures or properties were most predictive of Caco-2 permeability within each molecular representation.

Notably, regarding models trained using TDC dataset, the PaDEL top feature model achieved the best overall results, with the lowest MAE of 0.2525, RMSE of 0.3216, R² of 0.7805, and a Pearson correlation of 0.8835. Based on the permutation importance screening and subsequent SHAP analysis, several features consistently contributed most to model performance. The most influential descriptors combined physicochemical and autocorrelation features, including nAcid (number of acidic functional groups), ATSC6p, AATSC1p, ATSC2m, ATSC0c (mass-weighted autocorrelation descriptors), ATS8m (Broto-Moreau autocorrelation), ATSC3c and ATSC3m (centered charge-based autocorrelations), SpMax7_Bhe (Burden modified eigenvalue), and VE2_Dzi (Barysz matrix-derived eigenvector coefficient weighted by ionization potential), among others. The contribution of each feature to the model’s prediction was further interpreted using SHAP values, which quantify the marginal impact of a feature on the predicted log-transformed Caco-2 permeability. A detailed breakdown of SHAP values for the top-ranked descriptors is provided in Supplementary Figs. 3–6. The Mordred top feature model followed closely (MAE = 0.2613, RMSE = 0.3413, R² = 0.7533, Pearson R = 0.8687). The model trained with RDKit top descriptors also showed significant performance gains, with metrics of MAE = 0.2637, RMSE = 0.3324, R² = 0.7655, and Pearson R = 0.8750 (Fig. 6).

Fig. 6 — Performance comparison of models trained across eight molecular feature representations on the TDC dataset. A Morgan fingerprints, B Avalon fingerprints, C ErG fingerprints, D RDKit descriptors, E MACCS fingerprints, F PaDEL descriptors, G Mordred descriptors, and H CDDD embeddings. Each bar reflects the model’s predictive performance using top-ranked features selected for each representation. Among these, the PaDEL-based model yielded the best overall results (MAE = 0.2525), followed closely by Mordred (MAE = 0.2613) and RDKit (MAE = 0.2637). These findings highlight the strong predictive power of descriptor-based feature representations, especially those incorporating 3D information, for Caco-2 permeability modeling

A similar trend was observed with the OCHEM dataset, where retraining models on top-ranked features followed by Bayesian optimization also led to noticeable improvements in prediction performance (Fig. 7). While overall scores on OCHEM remained slightly lower due to its increased complexity and chemical diversity, the relative gains reinforce the effectiveness of this two-step refinement strategy for enhancing model quality across datasets.

Fig. 7 — Performance comparison of models trained across eight molecular feature representations on the OCHEM dataset. A Morgan fingerprints, B Avalon fingerprints, C ErG fingerprints, D RDKit descriptors, E MACCS fingerprints, F PaDEL descriptors, G Mordred descriptors, and H CDDD embeddings. Similar to the results from the TDC dataset, retraining models using top-ranked features and applying Bayesian hyperparameter optimization improved prediction accuracy across almost representations. Although overall MAE scores were slightly higher on OCHEM due to its larger size and structural diversity, the observed performance gains confirm the robustness and generalizability of the two-step refinement strategy across complex, real-world datasets

Among the eight molecular feature representation types evaluated in this study, only PaDEL and Mordred offer the capability to generate both 2D and 3D molecular descriptors. This provides a unique opportunity to investigate the relative importance and contribution of 3D structural information in predicting Caco-2 permeability. To this end, we conducted additional experiments to assess model performance when trained separately on 2D descriptors, 3D descriptors, and the full descriptor sets (2D + 3D) from both PaDEL and Mordred using TDC dataset (Fig. 8) and OCHEM dataset (Fig. 9).

Fig. 8 — Comparative model performance using PaDEL and Mordred descriptors on the TDC dataset. A–C Model trained using PaDEL descriptors using 2D features, 3D features and combined 2D + 3D top features. D–F Model trained using Mordred descriptors using 2D features, 3D features and combined 2D + 3D top features

Through this analysis, we aimed to understand whether incorporating 3D descriptors enhances model accuracy, and whether these features appear among the most informative descriptors selected during model refinement. In certain cases such as with PaDEL descriptor for TDC dataset and Mordred descriptors for both TDC and OCHEM dataset, the inclusion of 3D features led to further improvements, especially when combined with top feature selection and hyperparameter optimization. However, the model trained solely on PaDEL 2D features achieved performance that was slightly better than the combined 2D + 3D model in the OCHEM dataset (Fig. 9). This indicates that 2D structural information alone can be highly informative for permeability prediction. Thus, while 2D descriptors remain foundational, incorporating 3D information may enhance model performance depending on the feature set and dataset characteristics.

A complete list of the selected top features, along with their corresponding SHAP values, is provided in Supplementary Figs. 3–6 for both PaDEL and Mordred representations. Although the vast majority of top-ranked descriptors in model trained with PaDEL feature representation and TDC dataset, were 2D features, a small subset of 3D descriptors still provided significant predictive value. Specifically, out of the top 73 most important features identified by feature importance analysis on the TDC dataset, only three 3D descriptors—TDB6u, TDB3v (topological distance-based autocorrelations), and WNSA-1 (3D charge partial surface area)—were included (Supplementary Fig. 3). Despite their limited representation, experiments revealed that the inclusion of these 3D features led to an approximate 15.5% reduction in model MAE, confirming their non-negligible contribution to the model’s performance. This highlights that while 2D descriptors predominantly drive predictive accuracy, carefully selected 3D descriptors can offer valuable complementary information.

To benchmark the performance of our best model developed during the process, referred to as CaliciBoost, we followed the official submission and verification guidelines provided by the TDC team. The best-performing model was an XGBoost Regressor trained on the top-ranked PaDEL descriptors, which include both 2D and 3D molecular features (MAE = 0.2525, RMSE = 0.3216, R² = 0.7805, Pearson R = 0.8835). After identifying this optimal model and feature set, we followed the official Therapeutics Data Commons (TDC) evaluation protocol. We compared CaliciBoost against several baseline models with diverse architectural backbones using the official TDC Caco-2 benchmark. These include boosting-based models such as XGBoost and BaseBoosting, graph-aware variants like MapLight GNN, and convolutional models such as MolMapNet-D. Notably, CaliciBoost consistently outperformed these baselines in terms of Mean Absolute Error (MAE) and achieved the top-ranked position on the TDC Caco-2 Leaderboard. As shown in Figs. 10 and 11, CaliciBoost achieved a MAE of 0.2560 ± 0.006 after first 5 seed running as requirement from TDC, outperforming all other entries on the leaderboard, which highlights the superior predictive performance of CaliciBoost across a wide range of feature representations and modeling strategies. These results suggest that CaliciBoost offers a state-of-the-art solution for Caco-2 permeability prediction and holds promise for future official benchmarking. This demonstrates not only the effectiveness of our AutoML-based pipeline and feature refinement strategy, but also confirms its ability to generalize across diverse compound structures with better accuracy.

Fig. 10 — MAE comparison between models on the TDC Caco-2 Permeability Leaderboard and our model CaliciBoost. CaliciBoost achieved a mean absolute error (MAE) of 0.2560 ± 0.006 averaged over five seeds, indicating potential state-of-the-art performance compared to current leaderboard entries

Fig. 11 — Performance of CaliciBoost on the TDC Caco-2 Permeability Leaderboard. CaliciBoost achieved a mean absolute error (MAE) of 0.2560 ± 0.006 across five seeds, outperforming all other models on TDC Caco-2 Leaderboard

Furthermore, CaliciBoost is readily accessible through the Pharmaco-Net platform, an integrated AI-driven drug discovery platform [41]. Users can directly use CaliciBoost module within Pharmaco-Net to predict Caco-2 permeability for new compounds using their SMILES strings. This integration not only enhances practical usability but also accelerates early-stage ADMET profiling by providing rapid and accurate permeability predictions in a user-friendly environment. Detailed usage instructions for the CaliciBoost module on the Pharmaco-Net platform are provided in Supplementary Fig. 7.

Conclusion

In this study, we investigated the effectiveness of various molecular feature representations for predicting Caco-2 permeability using AutoML-based modeling. Our key contributions are summarized as follows.

Comprehensive benchmarking of molecular representations

We conducted a systematic evaluation of eight molecular feature representation types, including 2D/3D descriptors and fingerprints, across two datasets (TDC and OCHEM). This benchmarking revealed that PaDEL, Mordred, and RDKit descriptors are among the most effective for Caco-2 permeability prediction, offering a practical reference for future model development.

Evidence of the importance of 2D and 3D descriptors

Our experiments demonstrate that 2D descriptors remain critically important for Caco-2 permeability prediction. For example, the model trained with PaDEL 2D features alone using TDC dataset achieved performance comparable to, and in some cases slightly better than, the combined 2D + 3D model. However, in certain feature representation such as Mordred, using both TDC and OCHEM dataset, incorporating 3D descriptors led to modest performance gains, particularly when used with top feature selection and hyperparameter optimization. These findings suggest that while 2D features provide strong baseline predictive power, the inclusion of 3D structural information can offer additional benefit depending on the feature representation and dataset characteristics. However, we acknowledge that using a single 3D conformer for 3D descriptor calculation may limit descriptor robustness, and future work will explore multi-conformer or ensemble-based approaches.

Development of a state-of-the-art model

Leveraging the insights from our feature evaluation and AutoML optimization, we developed CaliciBoost model that achieved rank 1 on the TDC Caco-2 permeability prediction leaderboard, with the MAE of 0.2560 ± 0.006, outperforming all previously reported methods. The model is integrated into Pharmaco-Net platform for direct use.

These findings suggest that combining top-ranked molecular descriptors with AutoML-based modeling can provide robust and generalizable models for permeability prediction. Notably, our analysis also revealed that feature selection improved the performance of descriptor-based models, especially for the TDC dataset, while fingerprint-based models benefited less from this step. This observation is consistent with the nature of fingerprints, which are typically fixed-length binary vectors designed to efficiently capture substructural presence or absence patterns and are already relatively sparse and noise-tolerant. In contrast, molecular descriptors, often high-dimensional and continuous, contain redundant or irrelevant features that can obscure learning signals unless properly selected. For future work, combining multiple feature representation types—especially those shown to contribute meaningfully such as PaDEL, Mordred, and RDKit—may further enhance model performance. In addition, expanding the dataset by incorporating Caco-2 permeability measurements from alternative sources like PubChem BioAssay or by merging complementary datasets such as TDC and OCHEM to construct a more comprehensive benchmark could support training of more complex architectures, including deep graph-based models, and help address current limitations in data diversity and volume. Furthermore, linking predicted Caco-2 permeability values with experimentally validated human intestinal absorption metrics such as Fa (fraction absorbed) could strengthen biological relevance and support the translation of in silico predictions to in vivo contexts.

Supplementary Information

Additional file 1.^{(11.9MB, docx)}

Additional file 2.^{(416.5KB, csv)}

Acknowledgements

H.V.L. gratefully acknowledges the valuable advice and encouragement provided by Dr. Quoc-Khanh Nguyen throughout the development of this research work.

Abbreviations

Caco-2: Human colon epithelial cancer cell line
Papp: Caco-2 permeability
ADMET: Chemical absorption, distribution, metabolism, excretion, toxicity
TDC: Therapeutics data commons
OCHEM: Online chemical modeling environment
FP: Fingerprint
ERG: Extended reduced graph
CDDD: Continuous data-driven descriptors
PCA: Principal component analysis
QSAR: Quantitative structure activity relationship
AutoML: Automated machine learning
CNN: Convolutional neural network
GNN: Graph neural network
SHAP: SHapley Additive exPlanations
MAE: Mean absolute error
RMSE: Root mean square error
R²: Coefficient of determination
Pearson R: Pearson correlation coefficient

Author contributions

H.V.L, S.L., Y.B.P, Y.J.K., H.Y.Y, J.I.P, I.C, B.K.H, J.M.C conceptualized the study. H.V.L and W.R implemented methods. J.K and Y.Y performed data analysis and preprocessing. H.V.L and W.R featurized and trained the models. H.V.L, W.R, J.K., Y.Y. contributed to the manuscript. H.V.L assembled and revised the final manuscript. All authors read and approved the final manuscript.

Funding

This research was supported by the Ministry of Food and Drug Safety (MFDS), Republic of Korea, through a grant (Project No. RS-2024-00332490) and the Graduate School Education Program of Regulatory Sciences for Functional Food (21153MFDS604) in 2025. It was also supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (Grant No. RS-2022-KH130308). Additional financial support was provided by the Ministry of Trade, Industry, and Energy (MOTIE), Korea, under the “Infrastructure Program for Industrial Innovation” supervised by the Korea Institute for Advancement of Technology (KIAT) (Grant No. RS-2024-00434342). Furthermore, this study was supported by the 2025 project titled “Establishing the Foundation for In Silico Industrialization in Uiseong-gun, Gyeongsangbuk-do.”

Availability of data and materials

The TDC dataset used in this study is publicly available from the Therapeutics Data Commons (TDC) Caco-2 Permeability task page: https://tdcommons.ai/single_pred_tasks/adme/#caco-2-cell-effective-permeability-wang-et-al. The OCHEM dataset was curated, filtered, and processed by the authors, and is available at: https://huggingface.co/datasets/junhong1222/Caco2-Ochem-dataset. Data is also provided as the Supplementary Data. The GitHub repository containing the full implementation, including code and pretrained model for CaliciBoost (which achieved the MAE of 0.2560 ± 0.006, ranked 1 st on the TDC Caco-2 permeability prediction leaderboard), is accessible at: https://github.com/Calici/CaliciBoost.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Bok Kyung Han, Email: hanmoo@korea.ac.kr.

Jae-Mun Choi, Email: jm.choi@calici.co.

References

1.Yee S (1997) In vitro permeability across Caco-2 cells (colonic) can predict in vivo (small intestinal) absorption in man—fact or myth. Pharm Res 14(6):763–766. 10.1023/A:1012188625088 [DOI] [PubMed] [Google Scholar]
2.Awodele O, Olayemi SO, Adeyemo TA, Sanya TA, Dolapo DC (2012) Use of complementary medicine amongst patients on antiretroviral drugs in an HIV treatment centre in Lagos, Nigeria. Curr Drug Saf 7(2):120–125. 10.2174/157488612802715627 [DOI] [PubMed] [Google Scholar]
3.Moltó J, Miranda C, Malo S, Valle M, Andreu A, Bonafont X, Clotet B (2012) Use of herbal remedies among HIV-infected patients: patterns and correlates. Med Clin (Barc) 138(3):93–98. 10.1016/j.medcli.2011.04.031 [DOI] [PubMed] [Google Scholar]
4.Klepser TB, Doucette WR, Horton MR (2000) Assessment of patients’ perceptions and beliefs regarding herbal therapies. Pharmacotherapy 20(1):83–87. 10.1592/phco.20.1.83.34658 [DOI] [PubMed] [Google Scholar]
5.Welling SH, Clemmensen LKH, Buckley ST, Hovgaard L, Brockhoff PB, Refsgaard HHF (2015) In silico modelling of permeation enhancement potency in Caco-2 monolayers based on molecular descriptors and random forest. Eur J Pharm Biopharm 94:152–159. 10.1016/j.ejpb.2015.05.012 [DOI] [PubMed] [Google Scholar]
6.Wang NN, Dong J, Deng YH, Zhu MF, Wen M, Yao ZJ, Lu AP, Wang JB, Cao DS (2016) ADME properties evaluation in drug discovery: prediction of Caco-2 cell permeability using a combination of NSGA-II and boosting. J Chem Inf Model 56(4):763–773. 10.1021/acs.jcim.5b00642 [DOI] [PubMed] [Google Scholar]
7.Esaki T, Yonezawa T, Yamazaki D, Ikeda K (2022) Prediction models for fraction of absorption and membrane permeability using Mordred descriptors. Chem-Bio Inf J 22:46–54. 10.1273/cbij.22.46 [Google Scholar]
8.Falcón-Cano G, Molina C, Cabrera-Pérez M (2022) Reliable prediction of Caco-2 permeability by supervised recursive machine learning approaches. Pharmaceutics. 10.3390/pharmaceutics14101998 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Acuña-Guzman V, Montoya-Alfaro ME, Negrón-Ballarte LP, Solis-Calero C (2024) A machine learning approach for predicting Caco-2 cell permeability in natural products from the biodiversity in Peru. Pharmaceuticals 17(6):750. 10.3390/ph17060750 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wang D, Jin J, Shi G, Bao J, Wang Z, Li S, Pan P, Li D, Kang Y, Hou T (2025) ADMET evaluation in drug discovery: 21. application and industrial validation of machine learning algorithms for Caco-2 permeability prediction. J Cheminform 17:3. 10.1186/s13321-025-00947-z [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J (2020) DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 36(22–23):5545–5547. 10.1093/bioinformatics/btaa1005 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Therapeutics Data Commons (TDC)—Single‑pred ADMET examples (2024) GitHub repository: https://github.com/mims-harvard/TDC/blob/master/examples/single_pred/admet. Accessed 4 Nov 2025
13.Therapeutics Data Commons (TDC) (2024) Caco-2 Permeability Leaderboard—TDC.Caco2_Wang. Available via TDC: https://tdcommons.ai/benchmark/admet_group/01caco2/. Accessed 19 May 2025
14.Morgan HL (1965) The generation of a unique machine description for chemical structures—a technique developed at Chemical Abstracts Service. J Chem Doc 5(2):107–113. 10.1021/c160017a018 [Google Scholar]
15.Gedeck P, Rohde B, Bartels C (2006) QSAR—How good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets. J Chem Inf Model 46(5):1924–1936. 10.1021/ci050413p [DOI] [PubMed] [Google Scholar]
16.Stiefl N, Watson IA, Baumann K, Zaliani A (2006) ErG: 2D pharmacophore descriptions for scaffold hopping. J Chem Inf Model 46(1):208–220. 10.1021/ci050457y [DOI] [PubMed] [Google Scholar]
17.Kuwahara H, Gao X (2021) Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J Cheminform 13(1):27. 10.1186/s13321-021-00506-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.RDKit: Open-source cheminformatics. https://www.rdkit.org
19.Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474. 10.1002/jcc.21707 [DOI] [PubMed] [Google Scholar]
20.Moriwaki H, Tian Y, Kawashita N, Takagi T (2018) Mordred: a molecular descriptor calculator. J Cheminform. 10.1186/s13321-018-0258-y [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Winter R, Montanari F, Noé F, Clevert DA (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10(6):1692–1701. 10.1039/C8SC04175J [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Feurer M, Eggensperger K, Falkner S, Lindauer M, Hutter F (2022) Auto-Sklearn 2.0: hands-free AutoML via meta-learning. J Mach Learn Res 23:1–61 [Google Scholar]
23.Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer, Cham, pp 151–160 [Google Scholar]
24.LeDell E, Poirier S (2020) H2O AutoML: scalable automatic machine learning. In: 7th ICML workshop on automated machine learning. Available via AUTOML: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf. Accessed 15 May 2025
25.Wang C, Wu Q, Weimer M, Zhu E (2021) FLAML: a fast and lightweight AutoML library. In: Proceedings of the 4th MLSys conference, San Jose, CA, USA. Available via MLSys: https://proceedings.mlsys.org/paper_files/paper/2021/file/1ccc3bfa05cb37b917068778f3c4523a-Paper.pdf. Accessed 15 May 2025
26.Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, Smola A (2020) AutoGluon-Tabular: robust and accurate AutoML for structured data. Available via ARXIV: https://arxiv.org/pdf/2003.06505. Accessed 15 May 2025
27.Gui Y, Zhan D, Li T (2024) Taking another step: a simple approach to high-dimensional Bayesian optimization. Inf Sci. 10.1016/j.ins.2024.121056 [Google Scholar]
28.Malu M, Dasarathy G, Spanias A (2021) Bayesian optimization in high-dimensional spaces: a brief survey. In: 2021 12th International conference on information, intelligence, systems & applications (IISA), pp 1–8. 10.1109/IISA52424.2021.9555522
29.Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY, Todeschini R, Varnek A, Marcou G, Ertl P, Potemkin V, Grishina M, Gasteiger J, Schwab C, Baskin II, Palyulin VA, Radchenko EV, Welsh WJ, Kholodovych V, Chekmarev D, Cherkasov A, Aires-de-Sousa J, Zhang QY, Bender A, Nigsch F, Patiny L, Williams A, Tetko IV (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25(6):533–554. 10.1007/s10822-011-9440-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Pade V, Stavchansky S (1998) Link between drug absorption solubility and permeability measurements in Caco-2 cells. J Pharm Sci 87(12):1604–1607. 10.1021/js980111k [DOI] [PubMed] [Google Scholar]
31.Thomas S, Brightman F, Gill H, Lee S, Pufong B (2008) Simulation modelling of human intestinal absorption using Caco-2 permeability and kinetic solubility data for early drug discovery. J Pharm Sci 97(10):4557–4574. 10.1002/jps.21305 [DOI] [PubMed] [Google Scholar]
32.Lee KJ, Johnson N, Castelo J, Sinko PJ, Grass G, Holme K, Lee YH (2005) Effect of experimental pH on the in vitro permeability in intact rabbit intestines and Caco-2 monolayer. Eur J Pharm Sci 25(2–3):193–200. 10.1016/j.ejps.2005.02.012 [DOI] [PubMed] [Google Scholar]
33.Hubatsch I, Ragnarsson EGE, Artursson P (2007) Determination of drug permeability and prediction of drug absorption in Caco-2 monolayers. Nat Protoc 2(9):2111–2119. 10.1038/nprot.2007.303 [DOI] [PubMed] [Google Scholar]
34.PaDEL-Descriptor software github: https://github.com/ecrl/padelpy
35.Mordred python package github: https://github.com/mordred-descriptor/mordred
36.O’Boyle NM, Banck M, James CA, Morley C, Vandermeersh T, Hutchison GR (2011) Open babel: an open chemical toolbox. J Cheminform. 10.1186/1758-2946-3-33 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Breiman L (2001) Random forests. Mach Learn 45(1):5–32. 10.1023/A:1010933404324 [Google Scholar]
38.Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st conference on neural information processing systems (NeurIPS 2017), Long Beach, CA, USA. Available via NeurIPS: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
39.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Joly A, Duchesnay D, Perrot M (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 [Google Scholar]
40.Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: NIPS’17: proceedings of the 31st international conference on neural information processing systems, pp 4768–4777
41.Pharmaco-Net: Empowering Discovery for Life. https://pharmaco-net.org/

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Additional file 1.^{(11.9MB, docx)}

Additional file 2.^{(416.5KB, csv)}

Data Availability Statement

[CR1] 1.Yee S (1997) In vitro permeability across Caco-2 cells (colonic) can predict in vivo (small intestinal) absorption in man—fact or myth. Pharm Res 14(6):763–766. 10.1023/A:1012188625088 [DOI] [PubMed] [Google Scholar]

[CR2] 2.Awodele O, Olayemi SO, Adeyemo TA, Sanya TA, Dolapo DC (2012) Use of complementary medicine amongst patients on antiretroviral drugs in an HIV treatment centre in Lagos, Nigeria. Curr Drug Saf 7(2):120–125. 10.2174/157488612802715627 [DOI] [PubMed] [Google Scholar]

[CR3] 3.Moltó J, Miranda C, Malo S, Valle M, Andreu A, Bonafont X, Clotet B (2012) Use of herbal remedies among HIV-infected patients: patterns and correlates. Med Clin (Barc) 138(3):93–98. 10.1016/j.medcli.2011.04.031 [DOI] [PubMed] [Google Scholar]

[CR4] 4.Klepser TB, Doucette WR, Horton MR (2000) Assessment of patients’ perceptions and beliefs regarding herbal therapies. Pharmacotherapy 20(1):83–87. 10.1592/phco.20.1.83.34658 [DOI] [PubMed] [Google Scholar]

[CR5] 5.Welling SH, Clemmensen LKH, Buckley ST, Hovgaard L, Brockhoff PB, Refsgaard HHF (2015) In silico modelling of permeation enhancement potency in Caco-2 monolayers based on molecular descriptors and random forest. Eur J Pharm Biopharm 94:152–159. 10.1016/j.ejpb.2015.05.012 [DOI] [PubMed] [Google Scholar]

[CR6] 6.Wang NN, Dong J, Deng YH, Zhu MF, Wen M, Yao ZJ, Lu AP, Wang JB, Cao DS (2016) ADME properties evaluation in drug discovery: prediction of Caco-2 cell permeability using a combination of NSGA-II and boosting. J Chem Inf Model 56(4):763–773. 10.1021/acs.jcim.5b00642 [DOI] [PubMed] [Google Scholar]

[CR7] 7.Esaki T, Yonezawa T, Yamazaki D, Ikeda K (2022) Prediction models for fraction of absorption and membrane permeability using Mordred descriptors. Chem-Bio Inf J 22:46–54. 10.1273/cbij.22.46 [Google Scholar]

[CR8] 8.Falcón-Cano G, Molina C, Cabrera-Pérez M (2022) Reliable prediction of Caco-2 permeability by supervised recursive machine learning approaches. Pharmaceutics. 10.3390/pharmaceutics14101998 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Acuña-Guzman V, Montoya-Alfaro ME, Negrón-Ballarte LP, Solis-Calero C (2024) A machine learning approach for predicting Caco-2 cell permeability in natural products from the biodiversity in Peru. Pharmaceuticals 17(6):750. 10.3390/ph17060750 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Wang D, Jin J, Shi G, Bao J, Wang Z, Li S, Pan P, Li D, Kang Y, Hou T (2025) ADMET evaluation in drug discovery: 21. application and industrial validation of machine learning algorithms for Caco-2 permeability prediction. J Cheminform 17:3. 10.1186/s13321-025-00947-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J (2020) DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 36(22–23):5545–5547. 10.1093/bioinformatics/btaa1005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Therapeutics Data Commons (TDC)—Single‑pred ADMET examples (2024) GitHub repository: https://github.com/mims-harvard/TDC/blob/master/examples/single_pred/admet. Accessed 4 Nov 2025

[CR13] 13.Therapeutics Data Commons (TDC) (2024) Caco-2 Permeability Leaderboard—TDC.Caco2_Wang. Available via TDC: https://tdcommons.ai/benchmark/admet_group/01caco2/. Accessed 19 May 2025

[CR14] 14.Morgan HL (1965) The generation of a unique machine description for chemical structures—a technique developed at Chemical Abstracts Service. J Chem Doc 5(2):107–113. 10.1021/c160017a018 [Google Scholar]

[CR15] 15.Gedeck P, Rohde B, Bartels C (2006) QSAR—How good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets. J Chem Inf Model 46(5):1924–1936. 10.1021/ci050413p [DOI] [PubMed] [Google Scholar]

[CR16] 16.Stiefl N, Watson IA, Baumann K, Zaliani A (2006) ErG: 2D pharmacophore descriptions for scaffold hopping. J Chem Inf Model 46(1):208–220. 10.1021/ci050457y [DOI] [PubMed] [Google Scholar]

[CR17] 17.Kuwahara H, Gao X (2021) Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach. J Cheminform 13(1):27. 10.1186/s13321-021-00506-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.RDKit: Open-source cheminformatics. https://www.rdkit.org

[CR19] 19.Yap CW (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32(7):1466–1474. 10.1002/jcc.21707 [DOI] [PubMed] [Google Scholar]

[CR20] 20.Moriwaki H, Tian Y, Kawashita N, Takagi T (2018) Mordred: a molecular descriptor calculator. J Cheminform. 10.1186/s13321-018-0258-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Winter R, Montanari F, Noé F, Clevert DA (2019) Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 10(6):1692–1701. 10.1039/C8SC04175J [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Feurer M, Eggensperger K, Falkner S, Lindauer M, Hutter F (2022) Auto-Sklearn 2.0: hands-free AutoML via meta-learning. J Mach Learn Res 23:1–61 [Google Scholar]

[CR23] 23.Olson RS, Moore JH (2019) TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter F, Kotthoff L, Vanschoren J (eds) Automated machine learning: methods, systems, challenges. Springer, Cham, pp 151–160 [Google Scholar]

[CR24] 24.LeDell E, Poirier S (2020) H2O AutoML: scalable automatic machine learning. In: 7th ICML workshop on automated machine learning. Available via AUTOML: https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf. Accessed 15 May 2025

[CR25] 25.Wang C, Wu Q, Weimer M, Zhu E (2021) FLAML: a fast and lightweight AutoML library. In: Proceedings of the 4th MLSys conference, San Jose, CA, USA. Available via MLSys: https://proceedings.mlsys.org/paper_files/paper/2021/file/1ccc3bfa05cb37b917068778f3c4523a-Paper.pdf. Accessed 15 May 2025

[CR26] 26.Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, Smola A (2020) AutoGluon-Tabular: robust and accurate AutoML for structured data. Available via ARXIV: https://arxiv.org/pdf/2003.06505. Accessed 15 May 2025

[CR27] 27.Gui Y, Zhan D, Li T (2024) Taking another step: a simple approach to high-dimensional Bayesian optimization. Inf Sci. 10.1016/j.ins.2024.121056 [Google Scholar]

[CR28] 28.Malu M, Dasarathy G, Spanias A (2021) Bayesian optimization in high-dimensional spaces: a brief survey. In: 2021 12th International conference on information, intelligence, systems & applications (IISA), pp 1–8. 10.1109/IISA52424.2021.9555522

[CR29] 29.Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY, Todeschini R, Varnek A, Marcou G, Ertl P, Potemkin V, Grishina M, Gasteiger J, Schwab C, Baskin II, Palyulin VA, Radchenko EV, Welsh WJ, Kholodovych V, Chekmarev D, Cherkasov A, Aires-de-Sousa J, Zhang QY, Bender A, Nigsch F, Patiny L, Williams A, Tetko IV (2011) Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 25(6):533–554. 10.1007/s10822-011-9440-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Pade V, Stavchansky S (1998) Link between drug absorption solubility and permeability measurements in Caco-2 cells. J Pharm Sci 87(12):1604–1607. 10.1021/js980111k [DOI] [PubMed] [Google Scholar]

[CR31] 31.Thomas S, Brightman F, Gill H, Lee S, Pufong B (2008) Simulation modelling of human intestinal absorption using Caco-2 permeability and kinetic solubility data for early drug discovery. J Pharm Sci 97(10):4557–4574. 10.1002/jps.21305 [DOI] [PubMed] [Google Scholar]

[CR32] 32.Lee KJ, Johnson N, Castelo J, Sinko PJ, Grass G, Holme K, Lee YH (2005) Effect of experimental pH on the in vitro permeability in intact rabbit intestines and Caco-2 monolayer. Eur J Pharm Sci 25(2–3):193–200. 10.1016/j.ejps.2005.02.012 [DOI] [PubMed] [Google Scholar]

[CR33] 33.Hubatsch I, Ragnarsson EGE, Artursson P (2007) Determination of drug permeability and prediction of drug absorption in Caco-2 monolayers. Nat Protoc 2(9):2111–2119. 10.1038/nprot.2007.303 [DOI] [PubMed] [Google Scholar]

[CR34] 34.PaDEL-Descriptor software github: https://github.com/ecrl/padelpy

[CR35] 35.Mordred python package github: https://github.com/mordred-descriptor/mordred

[CR36] 36.O’Boyle NM, Banck M, James CA, Morley C, Vandermeersh T, Hutchison GR (2011) Open babel: an open chemical toolbox. J Cheminform. 10.1186/1758-2946-3-33 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Breiman L (2001) Random forests. Mach Learn 45(1):5–32. 10.1023/A:1010933404324 [Google Scholar]

[CR38] 38.Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Proceedings of the 31st conference on neural information processing systems (NeurIPS 2017), Long Beach, CA, USA. Available via NeurIPS: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

[CR39] 39.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Joly A, Duchesnay D, Perrot M (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 [Google Scholar]

[CR40] 40.Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: NIPS’17: proceedings of the 31st international conference on neural information processing systems, pp 4768–4777

[CR41] 41.Pharmaco-Net: Empowering Discovery for Life. https://pharmaco-net.org/

PERMALINK

CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction

Huong Van Le

Weibin Ren

Junhong Kim

Yukyung Yun

Young Bin Park

Young Jun Kim

Bok Kyung Han

Inho Choi

Jong-Il Park

Hwi-yeol Yun

Jae-Mun Choi

Abstract

Abstract

Scientific contribution

Supplementary Information

Introduction

Experimental design

Fig. 1.

Datasets

TDC

OCHEM

Data preprocessing

TDC

Fig. 2.

OCHEM

Fig. 3.

Feature extraction

AutoML training and optimization

Results and discussion

Effect of dataset split: TDC vs. OCHEM

Fig. 4.

Fig. 5.

Fig. 9.

Performance of individual feature representations

Using all features in each feature representation

Using top features in each feature representation

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 10.

Fig. 11.

Conclusion

Comprehensive benchmarking of molecular representations

Evidence of the importance of 2D and 3D descriptors

Development of a state-of-the-art model

Supplementary Information

Acknowledgements

Abbreviations

Author contributions

Funding

Availability of data and materials

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases