Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Aug 22;15:30887. doi: 10.1038/s41598-025-16577-2

Predictive analysis of solubility data with pressure and temperature in assessing nanomedicine preparation via supercritical carbon dioxide

Hashem O Alsaab 1,, Yusuf S Althobaiti 2,3
PMCID: PMC12373915  PMID: 40846787

Abstract

This work presents a comprehensive study on the prediction of phenytoin solubility at supercritical state using advanced techniques including machine learning analysis. The solubility of small-molecule pharmaceutical was analyzed and calculated to enhance its solubility and bioavailability as well. The models were employed to approximate the solubility at various pressures and temperatures. The dataset comprises temperature (T), pressure (P), and solubility (y) values, along with the corresponding solvent density measurements that were used in the models. Three models, namely Automatic Relevance Determination Regression (ARD), Gaussian process regression (GPR), and Linear Regression (LR) were designed and tuned to build predictive models. The ADABOOST ensemble technique was applied to strengthen the predictive capabilities of the models, while hyperparameter tuning was conducted using the Jellyfish Optimization (JO) algorithm. For phenytoin solubility prediction, the ADA-GPR model demonstrated outstanding accuracy, obtaining an R² of 0.99644. The ADA-LR model also produced competitive results, attaining an R² value of 0.93381, whereas the ADA-ARD model showed robust performance, yielding an R² of 0.95249. In terms of solvent density prediction, the ADA-GPR model once again outperformed the others, with an R² value of 0.9933.

Keywords: Automatic relevance determination, Nanoparticles, Green processing, Solubility

Subject terms: Molecular biology, Molecular medicine, Biomedical engineering

Introduction

Estimating drug solubility and dissolution at different temperatures and times is essential to control the bioavailability of medicines, as the low drug and poor solubility is a major issue in pharmaceutical industry. It has been recognized that some chemical methods such as amorphous solid dispersion (ASD) and co-crystallization would be useful for understanding the interactions between solvent and solute to improve the dissolution as well as solubility of drugs in aqueous solutions13. Thus, the solubility plays an important role for development of novel drug formulations.

Some numerical optimization methods such as machine learning can be integrated to molecular-level models to boost the prediction capability of molecular modeling techniques in design of drugs with enhanced solubility. Machine learning techniques have transformed scientific research by enabling powerful data-driven modeling and predictive capabilities across diverse disciplines. Regression models play a crucial role in this domain which can be also used for prediction of drugs solubility in different solvents46. Compared to the thermodynamic approach for simulation of pharmaceutical solubility, regressive models such as machine learning can be applied for a large number of entities with great accuracy and the estimation of intermolecular interaction parameters are not required for regressive models7.

Here, we used several regressive models for simulation and correlation of pharmaceutical solubility in supercritical CO2. The applied regressive models are based on machine learning (ML) and include Automatic Relevance Determination Regression (ARD), Gaussian process regression (GPR), and Linear Regression (LR) to make models on solubility of small-molecule phenytoin drug and solvent density related to it. The solvent is supercritical carbon dioxide which is green and has great advantages for green processing of nanomedicine810. AdaBoost is employed to enhance the model’s precision, and the Jellyfish Optimizer (JO) is employed for optimizing hyperparameters of the models.

The Adaboost Regression technique is an effective approach to ensemble learning, whereby a robust predictor is constructed by amalgamating several weak regression models. Adaboost Regression operates by sequentially fitting a series of weak regressors to the training data, while dynamically updating the weights assigned to individual samples based on the prediction errors in each iteration. Poorly predicted samples get higher weights, directing later models to focus more on these tough cases. This iterative process gradually improves performance.

Jellyfish Optimizer (JO) is inspired by the movement patterns of jellyfish, offering efficient solutions for regression problems. Automatic Relevance Determination Regression (ARD) determines the relevance of input features in regression models, aiding in feature selection and regularization11. Linear Regression (LR) models the relationship between input variables and a target variable by fitting a linear equation to observed data, assuming the target can be expressed as a weighted combination of inputs plus noise12. Gaussian process regression (GPR) utilizes Gaussian processes to model continuous outputs, providing a flexible and probabilistic approach to regression tasks13,14.

Description of solubility

We considered modeling of small molecule in supercritical solvent at various pressures and temperatures to optimize its condition and find out the determining parameters which can impact the change of solubility. The dataset provided contains information about the solubility of phenytoin (y) and the density of carbon dioxide at multiple levels of temperatures (T) and pressures (P). The data is taken from15 and has been used by Ghazwani et al.7,9 for machine learning correlation of data, and their method is used in this work. The data is organized into four columns7,9:

  • T (K): Temperature in the ranges: [313–345 K],

  • P (bar): Pressure in the range from 95 to 250 bar,

  • y: Solubility of phenytoin as the model of drug. It represents the fraction of phenytoin dissolved in the solution,

  • Solvent density: Density of carbon dioxide. The density values range from 222.2849 to 858.2483.

Figure 1 displays the frequencies of input and output parameters in histogram plots.

Fig. 1.

Fig. 1

Histograms of parameter frequencies for four variables in the dataset.

In the data preprocessing phase and prior to the ML modeling of dataset, normalization was performed using the Min-Max scaling technique to ensure that all input features—namely temperature, pressure, and other continuous variables—were scaled to a uniform range, typically between 0 and 1. This normalization procedure was crucial to avoid bias toward high-magnitude features and improve optimization efficiency. Subsequently, the dataset was randomly divided into training-validation and testing groups using an 80 − 20 split ratio. This random partitioning of dataset ensured that the models were trained on a substantial portion of the data while preserving a hold-out set for unbiased evaluation of model performance.

Methodology

Jellyfish optimizer (JO)

Some numerical optimization methods were employed in this study which are being discussed in the following sections. The JO is a metaheuristic algorithm motivated by the collective behavior of jellyfish. This algorithm is inspired by the natural behavior of jellyfish swarms, particularly their foraging strategies and movement dynamics, to address complex optimization tasks. It is designed to effectively navigate the solution space and converge toward optimal outcomes16.

The algorithm iteratively improves the solutions through a combination of exploration and exploitation strategies. The positional update for each jellyfish individual is mathematically represented by the following equation16:

graphic file with name d33e320.gif

Here, Inline graphic represents the new position of jellyfish Inline graphic in dimension Inline graphic at time Inline graphic, Inline graphic is its current location, and Inline graphic represents the displacement vector17.

Automatic relevance determination regression (ARD)

ARD Regression is an advanced Bayesian modeling approach that performs simultaneous feature selection and coefficient estimation by inferring the importance of predictors from the data. The model is based on Bayesian principles and provides a principled approach for feature selection and regularization18.

In ARD Regression, the objective is to predict the target variable Inline graphic based on a given input vector x. The model is written as11:

graphic file with name d33e391.gif

Inference and learning in ARD are conducted within a Bayesian framework. Variational Bayesian techniques are employed to estimate the posterior distributions of both the model weights and associated hyperparameters10. These techniques seek to reduce the discrepancy between the exact posterior distribution and its approximate counterpart. The regularization parameters (α) are usually learned by maximizing the model evidence or the posterior probability, utilizing methods like type-II maximum likelihood or empirical Bayes estimation.

ARD regression offers the advantage of automatic feature selection by assigning individual relevance weights to input features, effectively reducing model complexity and improving interpretability. However, it can be computationally intensive for high-dimensional data and may require careful tuning of hyperparameters to avoid overfitting or underfitting.

Linear regression (LR)

LR postulates a linear association between the input feature(s) and the target variable, and aims to find the best-fitting linear equation that describes this relationship19,20.

Let us consider a dataset including Inline graphic data points, where each sample is denoted by a vector Inline graphic of Inline graphic input variables, and the corresponding target variable is indicated by Inline graphic. The objective of LR is to find a set of coefficients Inline graphic such that the linear equation12:

graphic file with name d33e452.gif

provides the best approximation of the relationship between the input features and the target output. Here, Inline graphic represents the intercept term, Inline graphic (for Inline graphic) denotes the coefficient associated with the Inline graphic-th input variable, and Inline graphic signifies the error term, which captures the unexplained variability in the target variable21.

To estimate the coefficients, the Ordinary Least Squares (OLS) method is frequently utilized in LR. The objective of OLS is to minimize the sum of squared discrepancies between the expected target variable and the model predicted values derived from the linear equation. Mathematically, this can be formulated as12:

graphic file with name d33e500.gif

Once the coefficients are estimated, the Linear Regression model provides a mechanism to the estimation of the target variable Inline graphic for new observations based on their predictor values Inline graphic. The predicted value is calculated as12,22:

graphic file with name d33e526.gif

where Inline graphic represents the input values for the new observation.

LR is simple, fast, and easy to interpret, making it a strong baseline for many predictive tasks. Its main limitation is the assumption of linear relationships between variables, which can lead to poor performance on complex, nonlinear data. Additionally, it can be sensitive to multicollinearity and outliers.

Gaussian process regression (GPR)

GP is defined as a group of stochastic variables, where specific variables manifest Gaussian distributions while others do not demonstrate such characteristics23,24. The employment of mean and covariance functions is a pertinent approach to evaluate the efficacy of a Gaussian Process25. In the course of this procedure, Gaussian distributions (GDs) are subject to expansion. The application of Gaussian process regression models requires a correlation to exist among the past data. When comparing the GP and GD, it can be observed that the former demonstrates a wider spectrum of functionalities14.

One distinguishing characteristic of GPR in comparison to other regression models is its lack of necessity for a precise specification of a fitting function. Field observations can be understood as a probabilistic framework similar to a randomly selected portion taken from a multivariate normal distribution26,27.

The output Inline graphic corresponds to a set of multidimensional pairs Inline graphic, where each target value Inline graphic and the input vector Inline graphic represents the feature data14.

graphic file with name d33e600.gif

A GP is defined in terms of a latent function Inline graphic as14:

graphic file with name d33e617.gif

Mean operator is indicated by m(x) and kernel covariance is denoted by K, which depends on the input values and kernels28.

GPR provides flexible, nonparametric modeling with built-in uncertainty quantification, making it well-suited for capturing complex, nonlinear relationships. Optimal implementation requires balancing computational demands against model flexibility - the kernel matrix inversion becomes computationally intensive for large data sets, while suboptimal kernel specifications degrade generalization capability.

Modeling results

This section presents the comprehensive modeling results obtained from the application of ADA-ARD, ADA-GPR, and ADA-LR models on the provided dataset.

For the prediction of solubility, the models were boosted using ADABOOST, and hyper-parameter tuning was performed using the JO algorithm. The modeling results for solubility prediction are summarized as follows:

  • ADA-GPR: The ADA-GPR model achieved outstanding performance with an impressive R2 value of 0.99644. The MSE was calculated as 7.4007E-02, indicating the model’s high accuracy. Furthermore, the MAPE was remarkably low at 4.60427E-02.

  • ADA-LR: The ADA-LR method offered great accuracy, achieving an R2 of 0.93381. MSE for this model is estimated as 8.8473E-02, indicating reasonable accuracy. The MAPE for ADA-LR was determined as 5.1172E-02.

  • ADA-ARD: This method showed favorable performance for data fitting, obtaining an R2 of 0.95249. The MSE for ADA-ARD was calculated as 8.1420E-02, indicating good accuracy. The MAPE value for this model was 4.94415E-02.

For solvent density prediction, we employed the same ensemble of models (ADA-GPR, ADA-ARD, and ADA-LR), with comparative results detailed in the following items:

  • ADA-GPR: The ADA-GPR model showed excellent precision in determination of density, with score of 0.9933. The MSE was determined as 2.9764E + 02, indicating a low level of error. The MAPE value for this model was 3.09560E-02.

  • ADA-ARD: The ADA-ARD model also performed well, obtaining an R2 of 0.9493. The MSE for this model was calculated as 1.9115E + 03, indicating reasonable accuracy. The MAPE value for ADA-ARD was determined as 7.5128E-02.

  • ADA-LR: The ADA-LR model demonstrated a satisfactory performance in predicting solvent density, returning accuracy with R2 = 0.92743. The MAPE value for ADA-LR was determined as 9.36355E-02.

Based on the above information, the ADA-GPR model used for rest of analyses in this study. Additionally, 5-fold cross-validation was performed to evaluate the generalization ability of the models (see Table 1). The ADA-GPR model showed excellent consistency, having an R² of 0.99452 and MAPE of 0.04879 for solubility, and R² of 0.99191 and MAPE of 0.03296 for density prediction, with low standard deviations across folds. These results confirm the model’s robustness and absence of overfitting.

Table 1.

Cross validation values of the models using 5-fold method.

Output Model Metric Mean Std
Solubility ADA-GPR 0.99452 0.00183
MSE 0.08281 0.00712
MAPE 0.04879 0.00297
ADA-LR 0.92841 0.01345
MSE 0.09560 0.01128
MAPE 0.05398 0.00674
ADA-ARD 0.94897 0.00936
MSE 0.08759 0.00859
MAPE 0.05137 0.00521
Density ADA-GPR 0.99191 0.00294
MSE 311.025 63.2145
MAPE 0.03296 0.00167
ADA-ARD 0.94587 0.01182
MSE 2000.23 217.4839
MAPE 0.07874 0.00635
ADA-LR 0.92236 0.01769
MSE 2455.04 284.917
MAPE 0.09661 0.00892

The performance of the ADA-GPR model for both outputs was visualized using residual plots, as presented in Figs. 2 and 3. These figures show the high performance and closeness of actual and predicted values. The density is decreased with temperature which is due to the expansion of the solvent and can result in less solubility in the solvent7.

Fig. 2.

Fig. 2

Residuals results for solubility.

Fig. 3.

Fig. 3

Residuals results for CO2 density.

Finally, the 2D plots of Figs. 4 and 5 are the individual effects of T on both outputs. These figures show that solubility increases on increase of both inputs. But solvent density increases on increase of Temperatures. It can be understood that the molecular interactions between drug and the solvent is more responsible for the solubility enhancement with increasing T, so that the solubility is increased with enhancing T, despite the reduction in the solvent density29. Moreover, pressure has direct relationship with both solubility and density such that increasing pressure is favorable for supercritical processing due to rising solubility with P9. This fact has been analyzed and reported by Ghazwani et al.7,9 for the solubility of phenytoin with similar results.

Fig. 4.

Fig. 4

Temperature effect on solubility on multiple constant pressures.

Fig. 5.

Fig. 5

Temperature effect on CO2 density on multiple constant pressures.

To demonstrate the generality of the best-performing model, ADA-GPR, its predictive capabilities were extended to a diverse group of drugs, as detailed in Table 2 and visualized in Fig. 6. The model was applied to a dataset comprising over 450 data points across multiple pharmaceutical compounds, including Azathioprine, Sunitinib malate, Levonorgestrel, Docetaxel, Anastrozole, Exemestane, Tamsulosin, Nilotinib hydrochloride monohydrate, Crizotinib, Ketoconazole, Ketotifen Fumarate, Quetiapine hemifumarate, Amlodipine Besylate, Losartan Potassium, and Regorafenib monohydrate. The ADA-GPR model achieved high R² scores for these drugs, ranging from 0.94841 for Docetaxel to 0.99469 for Levonorgestrel, illustrating its robust performance across varied chemical structures and conditions in supercritical processing. Figure 6 illustrates the close alignment between actual and predicted solubility values for these compounds, confirming the model’s versatility and effectiveness in predicting solubility for a broad range of pharmaceuticals in supercritical carbon dioxide, thus supporting its potential for widespread application in nanomedicine development.

Table 2.

Generalizability of the best model for a group of different drugs.

Drug R2 score
Azathioprine 0.99218
Sunitinib malate 0.98192
Levonorgestrel 0.99469
Docetaxel 0.94841
Anastrozole 0.98009
Exemestane 0.99107
Tamsulosin 0.95894
Nilotinib hydrochloride monohydrate 0.98783
Crizotinib 0.95137
Ketoconazole 0.99188
Ketotifen fumarate 0.96311
Quetiapine hemifumarate 0.95329
Amlodipine besylate 0.97587
Losartan potassium 0.96218
Regorafenib monohydrate 0.97420

Fig. 6.

Fig. 6

Actual and predicted values comparison on a group of different drugs with more than 450 data points.

Conclusion

In this study, we used ARD, GPR, and LR models to determine phenytoin solubility and solvent density using temperature (T) and pressure (P). The model was a hybrid comprehensive model combining numerical optimizer and molecular-level analysis for understanding the solvent-drug interactions and finding the controlling step in drug dissolution. This would help one determine and control the rate of drug dissolution in aqueous phase and design new drug delivery system with desired solubility and rate of dissolution. We achieved outstanding modelling results by using ADABOOST for model boosting and the JO algorithm for hyper-parameter tuning. Both solubility and solvent density were estimated using the developed hybrid model in this study.

All three models performed well when it came to predicting solubility. ADA-GPR stood out with an R2 of 0.99644, indicating an exceptional level of accuracy. ADA-LR and ADA-ARD performed admirably as well, with R2 equal to 0.93381 and 0.95249, respectively. These findings demonstrate the models’ ability to accurately estimate solubility, providing valuable insights for drug formulation processes.

In terms of solvent density prediction, ADA-GPR performed admirably, achieving an impressive R2 of 0.9933. ADA-ARD also produced positive results, with an R2 of 0.9493. Despite having a slightly lower R2 of 0.92743, ADA-LR demonstrated reasonable predictive capabilities. These findings show that these models have the potential to accurately estimate solvent density, which can have significant implications in environmental studies.

Acknowledgements

The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-30).

Author contributions

H.O.A.: Writing, Methodology, Investigation, Software, Supervision, Funding. Y.S.A.: Writing, Conceptualization, Investigation, Software, Validation, Visualization.All authors reviewed the manuscript.

Data availability

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.He, Z. et al. Innovative medicinal chemistry strategies for enhancing drug solubility. Eur. J. Med. Chem.279, 116842 (2024). [DOI] [PubMed] [Google Scholar]
  • 2.Cools, L. & Van den Mooter, G. A comprehensive overview of the role of intermolecular interactions in amorphous solid dispersions. Int. J. Pharm.674, 125441 (2025). [DOI] [PubMed] [Google Scholar]
  • 3.Pendam, D. et al. Advances in formulation strategies and stability considerations of amorphous solid dispersions. J. Drug Deliv. Sci. Technol.108, 106922 (2025). [Google Scholar]
  • 4.Abu Lila, A. S. et al. Numerical optimization of Lenalidomide Immunomodulatory drug inside the supercritical carbon dioxide system using different machine learning models. J. Mol. Liq.393, 123647 (2024). [Google Scholar]
  • 5.Alzhrani, R. M. et al. Novel numerical simulation of drug solubility in supercritical CO2 using machine learning technique: Lenalidomide case study. Arab. J. Chem.15 (11), 104180 (2022). [Google Scholar]
  • 6.Xia, S. & Wang, Y. Preparation of solid-dosage nanomedicine via green chemistry route: advanced computational simulation of nanodrug solubility prediction using machine learning models. J. Mol. Liq.375, 121319 (2023). [Google Scholar]
  • 7.Ghazwani, M., Yasmin, M. & Begum Machine learning aided drug development: assessing improvement of drug efficiency by correlation of solubility in supercritical solvent for nanomedicine Preparation. J. Mol. Liq.387, 122511 (2023). [Google Scholar]
  • 8.Chinh Nguyen, H. et al. Computational prediction of drug solubility in supercritical carbon dioxide: thermodynamic and artificial intelligence modeling. J. Mol. Liq.354, 118888 (2022). [Google Scholar]
  • 9.Ghazwani, M. et al. Development of advanced model for Understanding the behavior of drug solubility in green solvents: machine learning modeling for small-molecule API solubility prediction. J. Mol. Liq.386, 122446 (2023). [Google Scholar]
  • 10.Obaidullah, A. J. Implementing and tuning machine learning-based models for description of solubility variations of nanomedicine in supercritical solvent for development of green processing. Case Stud. Therm. Eng.49, 103200 (2023). [Google Scholar]
  • 11.Marwala, T. & Marwala, T. Automatic relevance determination in economic modeling. Econ. Model. Artif. Intell. Methods 45–64. (2013).
  • 12.Montgomery, D. C., Peck, E. A. & Vining, G. G. Introduction To Linear Regression Analysis (Wiley, 2021).
  • 13.Williams, C. K. & Rasmussen, C. E. Gaussian processes for regression. (1996).
  • 14.Rasmussen, C. E. Gaussian processes in machine learning. In Summer School on Machine Learning (Springer, 2003).
  • 15.Notej, B. et al. Increasing solubility of phenytoin and raloxifene drugs: application of supercritical CO2 technology. J. Mol. Liq. 121246. (2023).
  • 16.Chou, J. S. & Truong, D. N. A novel metaheuristic optimizer inspired by behavior of jellyfish in ocean. Appl. Math. Comput.389, 125535 (2021). [Google Scholar]
  • 17.Chou, J. S. & Molla, A. Recent advances in use of bio-inspired jellyfish search algorithm for solving optimization problems. Sci. Rep.12 (1), 19157 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wipf, D. & Nagarajan, S. A new view of automatic relevance determination. Adv. Neural. Inf. Process. Syst.20. (2007).
  • 19.Faraway, J. J. Linear Models with R (CRC press, 2014).
  • 20.Trevor, H., Robert, T. & Jerome, F. The elements of statistical learning: data mining, inference, and prediction. (Springer, 2009).
  • 21.Agresti, A. & Franklin, C. The Art and Science of Learning from Data88 (Upper Saddle River, 2007).
  • 22.Kutner, M. H. et al. Applied Linear Regression Models4 (McGraw-Hill/Irwin New York, 2004).
  • 23.Grbić, R., Kurtagić, D. & Slišković, D. Stream water temperature prediction based on Gaussian process regression. Expert Syst. Appl.40 (18), 7407–7414 (2013). [Google Scholar]
  • 24.Ma, X., Xu, F. & Chen, B. Interpolation of wind pressures using Gaussian process regression. J. Wind Eng. Ind. Aerodyn.188, 30–42 (2019). [Google Scholar]
  • 25.Song, H. et al. Advancing nanomedicine production via green method: modeling and simulation of pharmaceutical solubility at different temperatures and pressures. J. Mol. Liq.411, 125806 (2024). [Google Scholar]
  • 26.Quinonero-Candela, J. & Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. J. Mach. Learn. Res.6, 1939–1959 (2005). [Google Scholar]
  • 27.Jiang, Y. et al. Prediction of gas-liquid two-phase choke flow using Gaussian process regression. Flow Meas. Instrum.81, 102044 (2021). [Google Scholar]
  • 28.Wu, C. et al. Deep Kernel Learning for Clustering∗. In Proceedings of the 2020 SIAM International Conference on Data Mining. (SIAM, 2020).
  • 29.Wu, S. et al. Intelligence modeling of nanomedicine manufacture by supercritical processing in Estimation of solubility of drug in supercritical CO2. Sci. Rep.15 (1), 23193 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES