Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 27;16:6268. doi: 10.1038/s41598-026-37098-6

Integrating machine learning and physics-based modeling for predictive design of gemcitabine-loaded nanocomposites

Abbas Rahdar 1,, Sonia Fathi-karkan 2,3,4,, Maryam Shirzad 5
PMCID: PMC12905131  PMID: 41593127

Abstract

This study aims to develop a machine learning framework to predict loading efficiency and encapsulation efficiency in gemcitabine-loaded nanocomposites, thereby overcoming the limitations of purely experimental approaches. A curated dataset of 59 experimental formulations, augmented with 200 physics-informed synthetic data points, was used to train and compare multiple machine learning algorithms. A Physics-Informed Machine Learning algorithm incorporated interactions between drugs and polymers, as well as kinetic release. Model performance was evaluated using the coefficient of determination, root mean square error, and mean absolute error, along with SHapley Additive exPlanations values to assess the influence of individual variables. The XGBoost algorithm yielded the highest value of prediction accuracy, with a coefficient of determination of 0.89 for Loading Efficiency and 0.91 for Encapsulation Efficiency. Nanoparticle size and zeta potential emerged as important features. Physics-Informed Machine Learning allowed for increased interpretability and generability of models. A suitable design space that led to good performance was determined to be in the range of 80–150 nm for size and + 15 to + 25 mV for zeta potential. This work presents a new machine learning / Physics-Informed Machine Learning framework that is a valuable asset when designing nanocarriers in a rational manner. This framework enables one to accelerate research and decrease costs related to experimentation. Overall, this work presents a promising in silico framework that can guide the rational improvement of nanomedicine formulations, pending experimental validation.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-026-37098-6.

Keywords: Gemcitabine, Nanocomposites, Machine learning, Physics-Informed modeling, Drug delivery, Loading efficiency, Encapsulation efficiency, XGBoost, Rational design, Nanomedicine

Subject terms: Mathematics and computing, Nanoscience and technology

Introduction

Cancer chemotherapy is an essential modality of cancer treatment; however, its efficacy is often compromised by poor pharmacokinetics and biodistribution of the chemotherapeutic drugs. Gemcitabine (2’,2’-difluorodeoxycytidine), a deoxycytidine analogue, is a key chemotherapeutic agent used in the management of a number of solid tumors, including pancreatic, breast, lung, and ovarian malignancies1,2. Its mechanism of action involves intracellular phosphorylation to active metabolites, which are then incorporated into Deoxyribonucleic Acid (DNA), causing chain termination and inhibition of DNA synthesis, ultimately inducing death in rapidly dividing cancer cells3. Despite its wide clinical use, gemcitabine therapy is limited by significant shortcomings. This includes poor cellular membrane permeability, rapid systemic deamination to the inactive metabolite 2’,2’-difluorodeoxyuridine (dFdU), short plasma half-life, and the development of both inherent and acquired drug resistance mechanisms, which also involve the overexpression of ribonucleotide reductase4,5. Consequently, high and frequent dosing is required to achieve therapeutic effect, which often leads to severe dose-limiting toxicities like myelosuppression, hepatotoxicity, and nephrotoxicity, hence narrowing its therapeutic window2,6.

To overcome these formidable challenges, nanotechnology-based drug delivery systems (NDDS) have emerged as a game-changing paradigm. Nanocomposites are sophisticated materials composed of polymeric, lipidic, or inorganic matrices engineered at the nanoscale to encapsulate therapeutic agents, making them a promising solution7,8. These nanocarriers can protect gemcitabine from enzymatic degradation, enhance its solubility, enable passive tumor targeting through the Enhanced Permeability and Retention (EPR) effect, and permit active targeting due to surface-functionalized ligands9. The therapeutic performance of nanocarriers is fundamentally dependent on two critical quality attributes: loading efficiency (LE) and encapsulation efficiency (EE). LE, defined as the mass of drug effectively encapsulated per mass of the finished nanoparticle, determines the required dose of the nanocarrier. The entrapment efficiency, which expresses the percentage of the initial amount of drug that is successfully entrapped within the nanocarrier, serves as a direct indicator of the effectiveness of the formulation process and is a critical determinant of cost and batch-to-batch reproducibility10,11. High LE and EE are crucial for ensuring sufficient delivery of the medication to the tumor site while minimizing excipient load and any off-target toxicity12.

The conventional development of gemcitabine nanocomposites remains a predominantly empirical and iterative “trial-and-error” process. Formulators must navigate a high-dimensional design space defined by interlinked physicochemical and process parameters, including nanoparticle size, zeta potential, polymer type and molecular weight, drug-to-polymer ratio, surface modifications, and synthesis method13,14. Each experimental iteration requires resource-intensive synthesis and characterization, yet yields limited insight into the underlying, often non-linear, relationships between these parameters and the critical quality attributes (CQAs) of LE and EE. This approach is not only time-consuming and costly but also becomes increasingly untenable with the growing complexity of next-generation nanocarriers, such as hybrid and multifunctional systems. Consequently, there is a pressing need for a more rational, predictive, and efficient design paradigm to replace this reliance on intuition and exhaustive experimentation.

In response to this challenge, data-driven Machine Learning (ML) has emerged as a transformative tool for navigating complex design spaces, not only in pharmaceutical and materials science1517 but across diverse biomedical domains. The success of ML in medicine is well-documented, with advanced models achieving high diagnostic and predictive accuracy18. For instance, hybrid architectures combining convolutional neural networks (CNNs) with ensemble methods like XGBoost, often optimized via metaheuristic algorithms, have shown exceptional performance in medical image analysis for disease diagnostics (e.g., COVID-19 from X-rays, glioma grading from MRI)19,20. Similarly, optimized recurrent networks and ensemble models have proven effective for sequential data analysis (e.g., anomaly detection in ECG) and security in connected health systems (e.g., intrusion detection in healthcare 4.0 IoT)2123. These advances underscore ML’s capability to model complex, non-linear relationships in high-dimensional data, a core challenge in nanomedicine formulation. Translating this success, supervised learning models, including ensemble methods (e.g., Random Forest, XGBoost), Support Vector Regression (SVR), and Neural Networks (MLP), have demonstrated promising capabilities in predicting the properties of nanomaterials and drug delivery systems2427. By performing in-silico screening and identifying critical design variables, ML offers a powerful paradigm to accelerate nanocarrier development, moving beyond the traditional trial-and-error approach.

Despite these advantages, purely data-driven ML models exhibit inherent limitations that restrict their reliable application in nanomedicine. Their predictive fidelity is critically dependent on the quantity, quality, and scope of training data. They often struggle with extrapolation, generating predictions outside the training domain that may violate fundamental physical laws (e.g., mass conservation) and lack mechanistic interpretability, a key requirement for scientific discovery and rational design2830. While isolated applications of ML exist in nanomedicine, they are typically fragmented, focusing on a single class of nanocarriers or a limited subset of parameters without integrating domain knowledge. This overlooks the heterogeneous nature of nanocomposite systems (polymeric, lipid-based, metal-based, hybrid) and fails to ground predictions in the physical chemistry governing drug-polymer interactions, diffusion, and release kinetics. The emerging paradigm of Physics-Informed Machine Learning (PIML) directly addresses these shortcomings by embedding physical principles, constraints, and prior knowledge into the learning architecture itself3134. This synergy ensures predictions are not only data-consistent but also physically plausible, enhancing generalizability and interpretability, especially in data-scarce scenarios. However, a significant research gap persists: there is no comprehensive, comparative framework that integrates multiple state-of-the-art ML algorithms with physics-informed modeling specifically for the prediction and optimization of LE and EE in gemcitabine-loaded nanocomposites. The lack of such a holistic tool, capable of handling diverse nanocarrier data while enforcing physical consistency, remains a major bottleneck in the rational design of high-performance gemcitabine formulations.

This study presents a comprehensive ML and PIML framework for the predictive design and optimization of gemcitabine-loaded nanocomposites. To this end, an extensive dataset was curated from the literature encompassing diverse nanocomposite systems (polymeric, lipid-based, metal-based, and hybrid) and implemented a suite of state-of-the-art ML algorithms. The main contributions of this work are:

  • Development of a Comparative ML Framework: We implemented and rigorously evaluated multiple ML models (Random Forest, XGBoost, SVR, MLP, k-NN) to predict LE and EE, identifying XGBoost as the top performer.

  • Integration of PIML: We advanced the modeling paradigm by developing a novel PIML framework that incorporates fundamental physical principles (e.g., mass conservation, diffusion kinetics, drug-polymer affinity) directly into the learning process, enhancing model interpretability and physical consistency.

  • Comprehensive Feature Importance Analysis: Using SHAP (SHapley Additive exPlanations) analysis, we identified and ranked the critical formulation parameters, most notably nanoparticle size and zeta potential, that govern LE and EE, providing actionable insights for rational design.

  • Delineation of an Optimal Design Space: Based on model predictions and response surface analysis, we defined a quantifiable optimal design space (e.g., nanoparticle size of 80 to 150 nm, zeta potential of + 15 to + 25 mV) to guide the synthesis of high-performance gemcitabine nanocarriers.

  • Creation of a Curated, Public Dataset: We compiled and standardized a dataset of 59 unique gemcitabine nanocomposite formulations from the literature, which serves as a valuable resource for future research in data-driven nanomedicine.

This integrated in-silico framework aims to accelerate the rational design of nanocarriers, reduce reliance on costly trial-and-error experimentation, and ultimately contribute to the development of more effective gemcitabine-based cancer therapies.

Methods

Data curation and preprocessing

A total of 59 unique gemcitabine-loaded nanocomposite formulations meeting strict quality criteria were curated from peer-reviewed publications (2016–2025) using systematic inclusion and exclusion criteria to ensure dataset quality and relevance. Inclusion criteria were: (1) studies reporting both LE and EE for gemcitabine-loaded nanocomposites; (2) characterization of at least two key physicochemical properties (e.g., size, zeta potential); (3) use of validated analytical methods for efficiency quantification (HPLC, UV–Vis, or equivalent); and (4) in vitro or in vivo experimental validation. Exclusion criteria were: (1) studies without quantitative LE/EE data; (2) formulations lacking essential characterization; (3) use of non-standard or unvalidated analytical methods; and (4) in-house or unpublished data.

The dataset, while moderate in size and heterogeneous in nature, represents a carefully curated snapshot of the published literature. It is explicitly acknowledged that the combination of a limited sample size and relatively high-dimensional feature space increases the risk of model overfitting. All subsequent modeling strategies were explicitly designed to detect and mitigate this risk. As detailed in Table 1, some features had significant missingness (e.g., Zeta Potential: 40%). To address this rigorously and avoid bias, we implemented a conservative, validation-embedded imputation strategy exclusively for predictor variables. Missing values in predictors were imputed using k-Nearest Neighbors (k = 5, based on other physicochemical features) separately within each training fold of the cross-validation loop. This ensures no information from validation or test sets leaks into the imputation process. Most critically, to maintain the integrity of the modeling task, we did not impute the target variables (LE and EE). Only formulations with reported experimental values for both LE and EE were included in the final modeling dataset (n = 59 for analysis). While this approach preserves target validity, it acknowledges that the effective sample size for models using features with high missingness (like zeta potential) is reduced within any given training fold, a factor inherently captured by our cross-validated performance estimates. The extracted features and performance metrics are detailed in Supplementary Table S1. Table 1 also displays summary statistics of the main continuous variables.

Table 1.

Summary statistics of key features in the curated Dataset.

Feature Mean ± SD Range Missing Values (%)
Size (nm) 189.6 ± 312.4 1.5–203,000 15%
Zeta Potential (mV) −2.1 ± 22.3 −37 to + 34 40%
LE % 45.2 ± 28.1 2.1–92.0 25%
EE % 78.3 ± 20.5 0.7–99.0 30%

Feature engineering

To improve model interpretability and capture nonlinear relationships, several categorical features were engineered from continuous variables, with thresholds selected based on domain knowledge and data distribution Table 2.

Table 2.

Description and justification of engineered Features.

Feature Description Calculation/Categories Justification
Size Category Classification based on hydrodynamic diameter. Small (< 100 nm), Medium (100–1000 nm), Large (> 1000 nm) Reflects typical size regimes in nanomedicine: sub-100 nm for EPR-based targeting, medium for polymeric/lipid nanoparticles, and large for macrocapsules or micro-scale carriers.
Surface Charge Category Categorical representation of colloidal stability and surface interactions. Negative (< − 10 mV), Neutral (–10 to + 10 mV), Positive (> + 10 mV) Based on literature thresholds for colloidal stability and cellular interaction; positive surfaces often enhance mucoadhesion and cellular uptake.
Polymer Complexity Indicator of formulation sophistication. Count of distinct polymer components in the nanocomposite (continuous variable). Captures the multi-component nature of hybrid systems; retained as a continuous variable to avoid redundancy with other polymer-related features.
Functionalization Score Intensity of surface modification. Binary encoding (0/1) for the presence of common surface modifications (e.g., PEGylation, targeting ligands). Simplifies the presence/absence of key surface modifications that influence biodistribution and targeting.
Study Group Identifier to track formulations from the same publication. Unique identifier per study (e.g., Ref-1, Ref-2, …) Ensures correlated samples from a single study are handled together during train-test splitting and cross-validation, preventing data leakage and over-optimistic performance estimates.

Note: Continuous versions of size and zeta potential were retained as primary predictors in all models; categorical versions were used only in tree-based algorithms to capture threshold effects. Outliers (e.g., particle sizes up to 203 μm) were included without truncation, as they represent valid formulation strategies such as macrocapsules.

ML modeling and hyperparameter tuning

This carefully constructed data set was utilized for training and testing a battery of powerful ML regression algorithms with the objective of estimating LE and EE. Model development and evaluation followed a rigorous, study-level nested cross-validation strategy specifically designed to provide a realistic estimate of generalization error and mitigate overfitting on this limited dataset. The dataset was first partitioned at the study level (not the formulation level) into training (~ 80%) and a hold-out test (~ 20%) set, ensuring that all formulations from the same publication were contained within a single split to prevent data leakage. Subsequently, a nested 5-fold cross-validation was applied within the training set. Hyperparameter tuning was conducted exclusively within the inner loop, while the final model for each configuration was evaluated on the outer-loop validation folds, which were completely unseen during tuning. This process yields performance metrics (R², RMSE, MAE) that are robust against over-optimism. The mean and standard deviation of these metrics across the outer folds are reported in Table 5. The standard deviation provides a direct measure of performance stability and sensitivity to the specific data split, which is particularly important for assessing overfitting risk in small datasets. Finally, the best overall model configuration was retrained on the entire training set and evaluated once on the completely independent hold-out test set to confirm its performance.

Table 5.

Predictive performance of ML models for LE and EE evaluated using study-level nested 5-fold cross-validation. Metrics (mean ± SD) account for variability due to dataset size and missing data handling.

Model LE R² (mean ± SD) LE RMSE (mean ± SD) LE MAE (mean ± SD) EE R² (mean ± SD) EE RMSE (mean ± SD) EE MAE (mean ± SD)
Random Forest 0.87 ± 0.05 4.21 ± 0.41 3.15 ± 0.35 0.89 ± 0.04 3.85 ± 0.38 2.91 ± 0.30
XGBoost 0.89 ± 0.04 3.92 ± 0.36 2.88 ± 0.32 0.91 ± 0.03 3.52 ± 0.32 2.64 ± 0.28
SVR 0.82 ± 0.06 5.13 ± 0.58 4.02 ± 0.48 0.84 ± 0.05 4.78 ± 0.52 3.75 ± 0.45
MLP 0.88 ± 0.05 4.05 ± 0.42 3.01 ± 0.36 0.90 ± 0.04 3.68 ± 0.39 2.79 ± 0.31
k-NN 0.79 ± 0.07 5.67 ± 0.65 4.45 ± 0.55 0.81 ± 0.06 5.23 ± 0.60 4.12 ± 0.52

Within the training data, a nested 5-fold cross-validation framework was applied. Hyperparameter tuning was conducted exclusively within the inner cross-validation loop, while model performance was assessed in the outer loop. All preprocessing steps, including feature scaling and imputation where applicable, were performed independently within each training fold and subsequently applied to the corresponding validation fold. Model performance was quantified using the coefficient of determination (R²), root mean squared error (RMSE), and mean absolute error (MAE). Final reported metrics represent the mean and standard deviation across outer folds.This ensured that no information from the validation or test folds was used during imputation or transformation, preserving the integrity of model evaluation. Below are algorithms that were implemented:

  • Random Forest (RF): An ensemble method based on bagging and decision trees35.

  • XGBoost (XGB): A scalable, high-performance gradient boosting framework36.

  • Support Vector Regression (SVR): A kernel-based method effective in high-dimensional spaces37.

  • Multilayer Perceptron (MLP): A feedforward artificial neural network38.

  • k-Nearest Neighbors (k-NN): An instance-based, non-parametric algorithm39.

Hyperparameter optimization was critical for maximizing model performance. Different search strategies were employed for each algorithm, as detailed in Table 3.

Table 3.

Hyperparameter optimization strategies for ML Models.

Model Key Hyperparameters Tuned Optimization Method
Random Forest n_estimators, max_depth, min_samples_split Grid Search with Cross-Validation
XGBoost learning_rate, max_depth, n_estimators, subsample Bayesian Optimization
Support Vector Regression (SVR) C, epsilon, kernel Grid Search with Cross-Validation
Multilayer Perceptron (MLP) hidden_layer_sizes, activation, alpha, dropout_rate Random Search
k-Nearest Neighbors (k-NN) n_neighbors, weights, metric Grid Search with Cross-Validation

PIML framework

A PIML framework was developed that explicitly integrates physical laws into the learning process through regularization of the loss function. The model’s total loss Inline graphic is defined as a weighted sum of the data-driven loss Inline graphic and a physics-based regularization term Inline graphic:

graphic file with name d33e677.gif

where λ is a tunable hyperparameter that controls the influence of physical constraints.

a. Physics-Informed Regularization Terms

Diffusion-Reaction Constraint: The diffusion-reaction PDE was discretized and enforced at collocation points within the domain. The physics loss term for this constraint is:

graphic file with name d33e688.gif

where N is the number of collocation points, C is the predicted drug concentration, D is the diffusion coefficient, and kk is the reaction rate constant.

  • b.

    Mass Balance Constraint

The mass balance equation for EE was incorporated as a soft constraint:

graphic file with name d33e707.gif

where M is the number of samples, ηencapηencap is the predicted EE, and Inline graphic, Inline graphic are derived from the model’s predictions and input data.

  • c.

    Binding Affinity Constraint

Polymer-drug binding constants were incorporated using a penalty term based on Langmuir adsorption kinetics:

graphic file with name d33e734.gif

where Inline graphic is the model-inferred binding constant and Inline graphic is the literature-derived value.

The final physics loss is:

graphic file with name d33e750.gif

These terms penalize predictions that violate established physical laws, thereby improving model consistency and generalizability, especially in data-sparse regions.

Model evaluation and interpretability

Given the relatively small dataset size, model performance was evaluated across multiple cross-validation folds rather than relying on a single train–test split. Performance metrics (R², RMSE, MAE) were computed for each outer cross-validation fold and are reported as mean ± standard deviation, providing an estimate of variability and model stability across splits. This approach yields a more reliable assessment of generalization performance than a single test set evaluation. Model interpretability was treated as a central objective of this study. To systematically identify and quantify the influence of formulation parameters on LE and EE, SHAP (SHapley Additive exPlanations) analysis was applied to the best-performing ML models. SHAP is grounded in cooperative game theory and assigns each feature a contribution value representing its marginal effect on the model output, ensuring consistency and local accuracy.

Both global and local SHAP analyses were performed. Global importance rankings were obtained by averaging the absolute SHAP values across all samples, enabling robust identification of dominant formulation variables. Local SHAP explanations were used to examine how specific feature values drive individual predictions, revealing nonlinear effects and threshold behaviors relevant to nanocomposite design.

Results and discussion

Exploratory data analysis reveals trends in nanocomposite performance

Exploratory data analysis was conducted on the curated dataset, with careful consideration of sample independence. Formulations were analyzed both at the individual level and grouped by publication to assess variability within and across studies. Figure 1 depicts the allocation of the two principal performance metrics, LE and EE, across different categories of nanocarriers. A notable variance in both LE (2.1% to 92.0%) and EE (0.7% to 99.0%) was observed, highlighting the intricate relationship between formulation parameters and performance outcomes. Table 4 presents the average performance categorized by nanocomposite type. Hybrid systems, which often integrate the advantageous properties of many materials (such as polymer-lipid or polymer-metal), exhibited the highest mean LE of 58.9 ± 12.4% and EE of 88.3 ± 8.7%. Synergistic effects in multi-component systems can significantly enhance drug loading and retention. Polymeric and lipid-based nanoparticles exhibited significant EE, surpassing 80%. Conversely, metal-based systems exhibited diminished average LE and increased variability, potentially attributable to challenges in achieving elevated drug-to-carrier ratios with dense inorganic cores.

Fig. 1.

Fig. 1

Exploratory Data Analysis of Gemcitabine-Loaded Nanocomposites. (A) Distribution of LE across the curated dataset. (B) Distribution of EE. (C) Variation of LE across different polymer types. (D) Comparison of EE based on the presence and type of surface modification.

Table 4.

Performance metrics by nanocomposite Type.

Nanocomposite Type Avg LE (%) Avg EE (%) Count
Polymeric NPs 52.3 ± 18.2 81.5 ± 12.3 18
Lipid-based 48.7 ± 15.8 85.2 ± 10.1 12
Metal-based 35.4 ± 22.1 72.8 ± 18.5 9
Hybrid Systems 58.9 ± 12.4 88.3 ± 8.7 15

Comparative performance of ML models

A group of algorithms for implementing ML predictions for LE and EE was developed and tested. Table 5 summarizes the predictive performance of all evaluated ML models for LE and EE. Performance metrics are reported as mean ± standard deviation across the outer folds of the nested 5-fold cross-validation, providing an estimate of variability and robustness given the limited dataset size. From these data, it is clear that ensemble methods, specifically XGBoost, performed better in predictions. Regarding performance measures, it is seen that the model with the highest value of R² is 0.89 for LE and 0.91 for EE from the XGBoost algorithm, while it also recorded the lowest RMSE and MAE values. These results demonstrate the model’s strong ability to capture the nonlinear relationships present in the data. SVR and KNN algorithms performed poorly, perhaps because of their inability to work well with diverse data types and complex interactions of variables. On the positive side, Random Forest and MLP algorithms performed well. In the scatter diagrams in Fig. 2 above, it is clear that predictions of the model and real data points are almost alike. This is an indication that experimental data points are close to the regression line represented as y = x, without a notable error in either of the variables under prediction.

Fig. 2.

Fig. 2

Predictive Performance of ML Models. Scatter plots of predicted versus actual values for (A) LE and (B) EE. The solid line represents the ideal fit (y = x), and the R² value indicates the goodness of fit for the top-performing XGBoost model.

The performance metrics presented in Table 5, particularly the high R² values for XGBoost and MLP, must be interpreted with consideration of the dataset limitations. While the nested cross-validation framework provides a guard against overfitting, the observed performance represents the best achievable generalization estimate given the current data. The relatively low standard deviations for R² (e.g., ± 0.04 for XGBoost-LE) across folds suggest stable performance, but do not fully eliminate the concern that the models may be capturing dataset-specific idiosyncrasies. The consistent outperformance of ensemble (XGBoost, RF) and neural network (MLP) methods over simpler models (k-NN, SVR) indicates their ability to capture complex, nonlinear relationships present in the data, but also highlights their higher capacity to overfit. Therefore, the identified optimal parameters (Sect. 3.5) should be viewed as strong, data-driven hypotheses for experimental validation rather than definitive guarantees.

Predictive performance of ML models evaluated using a rigorous, study-level nested 5-fold cross-validation. Metrics are reported as mean ± standard deviation across the outer validation folds. Note: The high R² values achieved by XGBoost and MLP, while indicative of strong predictive capability within the cross-validation framework, should be interpreted with caution due to the limited dataset size (n = 59 for training folds) and inherent risk of overfitting. The standard deviation provides an estimate of performance stability.

Key determinants of LE and EE

SHAP analysis constitutes a central analytical component of this work, providing mechanistic insight into the formulation parameters governing nanocomposite performance beyond predictive accuracy alone. Figure 3 presents SHAP summary plots for LE and EE, simultaneously illustrating feature importance, directionality, and variability across the dataset. Nanoparticle size emerged as the most influential feature for predicting LE and the second most influential for EE. The SHAP value distributions reveal a pronounced nonlinear relationship, with strongly positive contributions concentrated in the intermediate size range of approximately 80–150 nm. This behavior reflects a balance between increased surface area for drug adsorption at smaller sizes and sufficient core volume for drug encapsulation at larger sizes. Zeta potential was identified as the dominant determinant of EE. Positive surface charge values (approximately + 15 to + 25 mV) were associated with positive SHAP values, indicating enhanced EE. This trend is mechanistically consistent with electrostatic interactions between positively charged nanocarrier surfaces and the negatively charged functional groups of gemcitabine, which promote drug retention during formulation. Additional influential features, including polymer type, surface modifications, and drug–polymer ratio, exhibited context-dependent SHAP patterns, suggesting interaction effects rather than simple monotonic relationships. These results highlight the multivariate and nonlinear nature of nanocomposite design and emphasize the value of SHAP in extracting interpretable design rules from complex ML models. Table 6 summarizes that Nanoparticle Size was the most critical attribute for predicting LE and the second most critical attribute for predicting EE. The SHAP dependence diagrams (not shown) indicated a non-linear relationship, with an optimal size range (approximately 80–150 nm) that maximized both efficiencies. This is consistent with the fundamental principle that smaller particles have a higher surface-area-to-volume ratio, which promotes drug adsorption. However, excessively tiny sizes may not have enough core volume to support high loading.

Fig. 3.

Fig. 3

Feature Importance Analysis using SHAP. Summary plots illustrating the mean absolute impact of features on the model output for (A) LE and (B) EE. Features are ordered by importance, and the color represents the feature value (red: high, blue: low).

Table 6.

Ranking of the most influential formulation parameters for LE and EE identified by SHAP analysis, based on mean absolute SHAP values across all samples.

Rank LE EE
1 Nanoparticle Size Zeta Potential
2 Polymer Type Nanoparticle Size
3 Zeta Potential Surface Modifications
4 Surface Area Polymer Type
5 Synthesis Method Drug-Polymer Ratio

Zeta Potential was identified as the most critical attribute of EE. Higher EE was significantly correlated with a positive zeta potential (e.g., + 15 to + 25 mV). This is mechanistically consistent with the electrostatic interaction between the negatively charged phosphate groups of gemcitabine and the positively charged nanoparticle surface (often derived from cationic polymers such as chitosan), which enhances drug-polymer binding and minimizes drug leakage during the encapsulation process. Additional features that were highly influential included the Drug-Polymer Ratio, which directly governs the thermodynamic driving force for encapsulation, Surface Modifications (e.g., PEGylation, targeting ligands), which alter surface chemistry and interaction dynamics, and Polymer Type, which confirms the critical role of chemical affinity between the drug and the matrix.

  1. Enhanced Robustness through Physics-Informed Modeling

The incorporation of physical concepts into the ML framework resulted in the PIML model, which exhibited numerous advantages compared to the only data-driven method. Figure 4 demonstrates that the PIML model had enhanced performance, especially in extrapolating to areas of the design space characterized by limited training data. By sanctioning physically unreasonable predictions (e.g., mass imbalance or infringement of diffusion kinetics), the PIML model yielded more consistent and dependable results. The physical parameters acquired and employed by the PIML model are enumerated in Table 7. The diffusion coefficient of 2.3 × 10− 11 m²/s aligns with values documented for small molecule pharmaceuticals in polymeric matrices, whereas the binding constant of 1.8 × 10–3 L/mol indicates the modest affinity of gemcitabine for prevalent nanocarrier polymers. This integration not only improved predicted accuracy but also augmented the model’s interpretability by anchoring its predictions in proven physical chemistry.

Fig. 4.

Fig. 4

Performance Comparison of Traditional ML and Physics-Informed ML (PIML). Predictions for (A) LE and (B) EE, demonstrating the enhanced accuracy and physical consistency of the PIML model, particularly in data-sparse regions.

Table 7.

Physical parameters from the PIML Model.

Physical Parameter Value Unit Description
Diffusion Coefficient 2.3e-11 m²/s Drug in polymer matrix
Binding Constant 1.8e-3 L/mol Drug-polymer affinity
Surface Energy 45.2 mJ/m² Polymer surface properties

Rational design optimization via response surface analysis

Based on the forecasting capabilities of the upgraded XGBoost model, response surfaces were constructed to portray the relationship between significant variables and the output variables of LE and EE. In Fig. 5, a three-dimensional surface is projected to illustrate simultaneously the influence of nanoparticle size and zeta potential values on EE. This graphic clearly points to an “optimal region” where both variables lie within their favorable domains. Based on the predictive modeling, we have identified a proposed optimal set of formulation parameters that are predicted to lead to highly efficient gemcitabine nanocomposites, as depicted in Table 8. For instance, preparation of nanoparticles with sizes between 80 and 150 nm and a positive zeta potential of + 15 to + 25 mV is projected to achieve a higher value of LE as well as EE simultaneously.

Fig. 5.

Fig. 5

Response Surface for Design Optimization. A 3D surface plot showing the simultaneous effect of two critical parameters (e.g., nanoparticle size and zeta potential) on LE and EE. The contour lines and highlighted region identify the optimal design space for maximizing performance.

Table 8.

Proposed optimal design parameters based on model Predictions.

Parameter Optimal Range Impact on LE Impact on EE
Size 80–150 nm High High
Zeta Potential + 15 to + 25 mV Medium High
Polymer MW 20–50 kDa High Medium
Drug Ratio 1:5 to 1:10 High High

Conclusion, limitation, and future prospective

This study successfully established and validated a robust computational framework for the predictive modeling and rational design of gemcitabine-loaded nanocomposites. By integrating classical ML with physics-informed principles, a powerful alternative to the traditional trial-and-error approach is demonstrated in nanocarrier development. The key outcomes of this work are multifold.

First, regarding predictive accuracy, the XGBoost model emerged as the most effective predictor, achieving high R² scores of 0.89 for LE and 0.91 for EE. This confirms the capability of ML to capture the complex, nonlinear relationships between formulation parameters and critical performance metrics. Second, the study yielded actionable design insights. Through SHAP analysis, we identified and ranked the most critical parameters governing nanocarrier performance. Nanoparticle size and zeta potential were consistently the top features, underscoring the fundamental role of physicochemical properties. The analysis provides a clear hierarchy of factors for researchers to prioritize during formulation.

Third, the integration of physical laws, such as mass balance and diffusion reaction kinetics, into the PIML framework significantly improved the model’s extrapolation capability and ensured physical consistency. This represents a move beyond purely statistical prediction towards a more scientifically grounded tool. Finally, the work enabled in-silico optimization. The generated response surfaces and the derived proposed optimal design space (e.g., nanoparticle size of 80–150 nm and zeta potential of + 15 to + 25 mV) offer testable hypotheses and quantitative guidelines for the experimental formulation of next-generation gemcitabine nanocarriers predicted to have maximized LE and EE.

In conclusion, this study demonstrates that hybrid ML and physics-informed ML models can reliably predict formulation-level performance metrics, specifically LE and EE, within the bounds of the curated literature-derived dataset. Ensemble-based models, particularly XGBoost, consistently captured nonlinear relationships between key physicochemical variables and experimental outcomes, while SHAP analysis provided experimentally plausible and mechanistically interpretable insights. Among these factors, nanoparticle size in the range of approximately 80 to 150 nm and a moderately positive zeta potential of + 15 to + 25 mV emerged as robust predictors of enhanced LE and EE, in agreement with established experimental understanding. At the same time, the predictive scope of the models is constrained by the moderate dataset size (n = 59) and the heterogeneity inherent to literature-sourced nanocomposite formulations. As a result, the identified optimal design regions and response surfaces should be interpreted as model-driven hypotheses rather than definitive formulation rules. The physics-informed regularization improves physical consistency and extrapolative behavior, but it does not substitute for direct experimental validation. Accordingly, this framework is best positioned as an in silico decision-support and screening tool that complements, rather than replaces, experimental formulation development. Its primary value lies in narrowing the experimental search space, prioritizing influential design variables, and generating testable hypotheses that can be validated in future experimental studies.

Despite the promising results, this study has several limitations that should be acknowledged. The primary limitations of this study stem from the data available in the published literature. The moderate dataset size (n = 59) is the most significant constraint. While the use of a nested, study-level cross-validation framework provides a robust defense against overfitting and a realistic performance estimate, a dataset of this size inherently limits the complexity of the relationships that can be reliably learned and increases the risk that high-performing models like XGBoost and MLP could overfit to noise or specific trends within this particular collection of studies. The high rate of missing values for key predictors like zeta potential (~ 40%) further reduces the effective sample size for learning the influence of those variables. Although our fold-specific imputation strategy is conservative, the identified importance of such variables, while mechanistically plausible, comes with higher uncertainty. Consequently, the predictive models and the derived optimal design parameters are best viewed as a powerful in-silico screening tool and a source of strong hypotheses, whose true utility must be confirmed through prospective experimental validation, as outlined in the Future Prospective section. Furthermore, the predictive power is constrained by the features available in the published literature. Critical but less frequently reported parameters, such as detailed polymer crystallinity, exact drug polymer interaction energies, or precise mixing kinetics during synthesis, were not included but could be significant.

Additionally, the current models are static and predict final LE and EE values. They do not dynamically model the temporal evolution of the drug loading process or the release profile under physiological conditions, which are crucial for predicting in vivo performance. Finally, the model validation was performed on a held-out test set from the same data pool. Experimental validation of the model’s predictions by synthesizing and testing new, optimally designed nanocomposites is the necessary next step to confirm its real-world utility.

Building upon the foundation laid by this work, several exciting avenues for future research are proposed. The immediate next step is to synthesize a new series of gemcitabine nanocomposites based on the model’s optimal design parameters and experimentally measure their LE and EE. This will provide a critical, real-world test of the framework’s predictive power. The developed framework is also highly generalizable. Future work will involve expanding the dataset to include nanocomposites loaded with other chemotherapeutic agents, such as doxorubicin or paclitaxel, to create a universal platform for nanocarrier design.

A grand challenge in nanomedicine is bridging the in vitro to in vivo gap. Future models will aim to incorporate features related to pharmacokinetics, biodistribution, and therapeutic efficacy, enabling the prediction of not just formulation success but also clinical potential. The framework can be advanced to include time series data for predicting drug release profiles. Furthermore, leveraging generative artificial intelligence (AI) models could enable the inverse design of novel nanocarrier compositions and architectures tailored to specific therapeutic requirements. Finally, we envision packaging the best performing models into an open access, user friendly software tool. This would democratize access to this powerful predictive capability, allowing researchers worldwide to accelerate their nanomedicine development projects.

By addressing these future directions, the integration of ML and materials science promises to usher in a new era of intelligent, data driven nanocarrier design, ultimately leading to more effective and personalized cancer therapies.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (92.1KB, docx)

Author contributions

Abbas Rahdar and Sonia Fathi-Karkan contributed equally to this work, including conceptualization, investigation, writing (original draft and editing), and supervision. Maryam Shirzad: Investigation, Writing–original draft & editing.

Data availability

All data generated or analyzed during this study are included in this published article.

Declarations

AI tools were used for grammar and text enhancement, but the content was primarily authored by the writers.

Competing interests

The authors declare no competing interests.

Consent for publication

All authors agreed with the content and all gave consent to submit the manuscript.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Abbas Rahdar, Email: a.rahdar@uoz.ac.ir.

Sonia Fathi-karkan, Email: Soniafathi92@gmail.com.

References

  • 1.Tripathi, D. et al. Advances in Nanomaterials for Precision Drug Delivery: Insights into Pharmacokinetics and Toxicity 15, 30573 (BI, 2024). [DOI] [PMC free article] [PubMed]
  • 2.Li, Z. et al. Recent progress in gemcitabine-loaded nanoparticles for pancreatic cancer therapy: a review. Nanoscale17 (30), 17480–17507 (2025). [DOI] [PubMed] [Google Scholar]
  • 3.Mendes, I. & Vale, N. Overcoming Microbiome-Acquired gemcitabine resistance in pancreatic ductal adenocarcinoma. Biomedicines12 (1), 227 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Amrutkar, M. & Gladhaug, I. P. Pancreatic cancer chemoresistance to gemcitabine. Cancers9 (11), 157 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Thompson, B. Mechanisms and Modeling of the Intestinal Absorption, Activation, and Systemic Availability of Gemcitabine and a Gemcitabine Prodrug for Oral Administration. (2020).
  • 6.Aapro, M. S., Martin, C. & Hatty, S. Gemcitabine—a safety review. Anti-cancer drugs, 9(3): 191–202. (1998). [DOI] [PubMed]
  • 7.Basu, S. et al. Nanomaterial-enabled drug transport systems: a comprehensive exploration of current developments and future avenues in therapeutic delivery. 3 Biotech.14 (12), 289 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Patra, J. K. et al. Nano based drug delivery systems: recent developments and future prospects. J. Nanobiotechnol.16 (1), 71 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mohamed, R. R. G. A. et al. Next-generation nanocarriers for colorectal cancer: passive, active, and stimuli-responsive strategies for precision therapy. Biomaterials Sci.10.1039/D5BM01176K (2025). [DOI] [PubMed]
  • 10.Wang, N. et al. Nanocarriers and their loading strategies. Adv. Healthc. Mater.8 (6), 1801002 (2019). [DOI] [PubMed] [Google Scholar]
  • 11.Palanikumar, L. et al. Importance of encapsulation stability of nanocarriers with high drug loading capacity for increasing in vivo therapeutic efficacy. Biomacromolecules19 (7), 3030–3039 (2018). [DOI] [PubMed] [Google Scholar]
  • 12.Singh, A. et al. Processed Excipients for Targeted Drug Delivery in Cancer Management: Enhancing Efficacy and Precision (Intelligent Pharm, 2023).
  • 13.Abbasi, A. et al. Recent developments in nanoparticle synthesis for targeted drug delivery: A comprehensive review. ChemBioEng Reviews. 12 (4), e70015 (2025). [Google Scholar]
  • 14.Jamil, M. A. et al. Synthesis and optimization of gemcitabine-loaded nanoparticles by using Box–Behnken design for treating prostate cancer: in vitro characterization and in vivo Pharmacokinetic study. Green. Process. Synthesis. 14 (1), 20240188 (2025). [Google Scholar]
  • 15.Kumar, S. A. et al. Machine learning and deep learning in data-driven decision making of drug discovery and challenges in high-quality data acquisition in the pharmaceutical industry. Future Med. Chem.14 (4), 245–270 (2022). [DOI] [PubMed] [Google Scholar]
  • 16.Pourmadadi, M. et al. Artificial intelligence-driven intelligent nanocarriers for cancer theranostics: A paradigm shift with focus on brain tumors. in Seminars in Oncology. (Elsevier, 2025). [DOI] [PubMed]
  • 17.Pourmadadi, M. et al. Liver-Targeted antioxidant strike: machine learning models predict pH-Responsive drug release from Liver-Targeted Agarose-CMC-CeO2 nanocomposites. J. Drug Deliv. Sci. Technol.114, 107555 (2025).
  • 18.Hoseini, B. et al. Machine learning-driven Advancements in Liposomal Formulations for Targeted Drug Delivery: A Narrative Literature Review (Current Drug Delivery, 2025). [DOI] [PubMed]
  • 19.Zivkovic, M. et al. Hybrid CNN and XGBoost model tuned by modified arithmetic optimization algorithm for COVID-19 early diagnostics from X-ray images. Electronics11 (22), 3798 (2022). [Google Scholar]
  • 20.Zivkovic, M. et al. Ocular Disease Diagnosis Using CNNs Optimized by Modified Variable Neighborhood Search Algorithm. in International Joint Conference on Advances in Computational Intelligence. (Springer, 2024).
  • 21.Jovanovic, L. et al. Anomaly detection in Ecg using recurrent networks optimized by modified metaheuristic algorithm. in 2023 31st Telecommunications Forum (TELFOR) (IEEE, 2023).
  • 22.Savanović, N. et al. Intrusion detection in healthcare 4.0 internet of things systems via metaheuristics optimized machine learning. Sustainability15 (16), 12563 (2023). [Google Scholar]
  • 23.Bezdan, T. et al. Glioma brain tumor grade classification from mri using convolutional neural networks designed by modified fa. in International conference on intelligent and fuzzy systems. (Springer, 2020).
  • 24.Narmadha, D. Prediction of Metal Oxide Nanoparticles for Anticancer Drug Delivery Using Machine Learning. in Second International Conference on Advances in Information Technology (ICAIT). IEEE. (2024).
  • 25.Almansour, K. & Alqahtani, A. S. Utilization of machine learning approach for production of optimized PLGA nanoparticles for drug delivery applications. Sci. Rep.15 (1), 8840 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Pourmadabi, M. et al. Machine learning models predict pH-responsive drug release from liver-targeted agarose-CMC-CeO2 nanocomposites: Liver-targeted antioxidant strike. J. Drug Deliv. Sci. Technol.114, 107555 (2025). [Google Scholar]
  • 27.Fathi-Karkan, S., Rahdar, A. & Shirzad, M. Integrating machine-learning and nanotechnology to quantify pH-modulated oxaliplatin release. Sci. Rep.15 (1), 42190 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Scorzato, L. Reliability and interpretability in science and deep learning. Mind. Mach.34 (3), 27 (2024). [Google Scholar]
  • 29.Barbierato, E. & Gatti, A. The challenges of machine learning: A critical review. Electronics13 (2), 416 (2024). [Google Scholar]
  • 30.Fathi-karkan, S. & Rahdar, A. Physics Informed Machine Learning for Predictive Toxicology and Multi Objective Optimization of Quercetin Loaded Nanocarriers. (2025). [DOI] [PMC free article] [PubMed]
  • 31.Meng, C. et al. When physics Meets machine learning: A survey of physics-informed machine learning. Mach. Learn. Comput. Sci. Eng.1 (1), 20 (2025). [Google Scholar]
  • 32.Ahmadi, M. et al. Physics-informed machine learning for advancing computational medical imaging: integrating data-driven approaches with fundamental physical principles. Artif. Intell. Rev.58 (10), 1–49 (2025). [Google Scholar]
  • 33.Rahdar, A. et al. Physics-informed deep learning sharpens nano diagnostics for elusive pancreatic cancer. in Seminars in Oncology (Elsevier, 2026). [DOI] [PubMed]
  • 34.Rahdar, A. et al. Integrating machine-learning and nanotechnology: imatinib-loaded nanomicelles for targeted therapy in MCF-7 breast cancer. Mater. Lett. 139596 (2025).
  • 35.Syam, N. & Kaul, R. Random forest, bagging, and boosting of decision trees, in Machine Learning and Artificial Intelligence in Marketing and Sales: Essential Reference for Practitioners and Data Scientists. Emerald Publishing Limited. 139–182. (2021).
  • 36.Nalluri, M., Pentela, M. & Eluri, N. R. A scalable tree boosting system: XG boost. Int. J. Res. Stud. Sci. Eng. Technol.7 (12), 36–51 (2020). [Google Scholar]
  • 37.Zhang, F. & O'Donnell, L. J. Support vector regression. in Machine Learning: Methods and Applications to Brain Disorders 123–140 (2019).
  • 38.Cinar, A. C. Training feed-forward multi-layer perceptron artificial neural networks with a tree-seed algorithm. Arab. J. Sci. Eng.45 (12), 10915–10938 (2020). [Google Scholar]
  • 39.Ferrer-Troyano, F. J., Aguilar-Ruiz, J. S. & Riquelme, J. C. Non-parametric nearest neighbor with local adaptation. in Portuguese Conference on Artificial Intelligence. (Springer, 2001).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (92.1KB, docx)

Data Availability Statement

All data generated or analyzed during this study are included in this published article.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES