Machine learning estimating paracetamol solubility in supercritical CO2 by utilization of K-nearest neighbor regression and metaheuristic algorithms

Kamal Y Thajudeen; Saad Ali Alshehri; Mohamed Rahamathulla; Mohammed Muqtader Ahmed

doi:10.1038/s41598-025-22903-5

. 2025 Nov 24;15:41761. doi: 10.1038/s41598-025-22903-5

Machine learning estimating paracetamol solubility in supercritical CO₂ by utilization of K-nearest neighbor regression and metaheuristic algorithms

Kamal Y Thajudeen ^1,^✉, Saad Ali Alshehri ¹, Mohamed Rahamathulla ², Mohammed Muqtader Ahmed ³

PMCID: PMC12647129 PMID: 41285973

Abstract

Because of extensive usage of paracetamol by patients, its solubility improvement would have major impact on wellbeing. Supercritical processing can be used for nanonization of drug particles which in turn increases their solubility and consequently low dosage of drug for patients. This study presents the results of Neighbor-based ensemble models for predicting the mole fraction of paracetamol drug in supercritical solvent as well as solvent density at different conditions. The models were trained and evaluated using data of 40 instances. The K-nearest neighbor regression algorithm selected here as the base model, and ensemble methods of bagging and AdaBoost, were employed for model improvement. Additionally, two metaheuristic algorithms, BAT and GWO, were applied to adjust the hyperparameters of the models. The assessment of each model’s performance was conducted through the utilization of three metrics, namely the R-squared score, MSE, and AARD percentage. The outcomes showed that the GWO-ADA-KNN model demonstrated superior performance in predicting both mole fraction and density, as evidenced by its respective R-squared scores of 0.98105 and 0.96719. These findings indicate that the proposed optimizer and models can predict accurately drug mole fraction and density under different conditions.

Keywords: Solubility, Machine learning, Supercritical CO₂, Metaheuristic algorithms

Subject terms: Chemistry, Computational biology and bioinformatics, Drug discovery, Mathematics and computing

Introduction

The physicochemical properties of paracetamol, such as its mole fraction and density, play a crucial role in its pharmaceutical applications, such as drug formulation, manufacturing, and quality control. Predicting the physicochemical properties of paracetamol under different conditions can be challenging due to its complex nature and the various factors that can influence its properties. One way to overcome this challenge is to develop accurate machine learning models that can provide reliable estimates of the mole fraction and density of paracetamol under different temperature and pressure conditions. On the other hand, solubility enhancement is a key issue in pharmaceutical production and most drugs suffer from poor solubility in body^1–3. Different techniques can be utilized for improving drugs solubility.

Preparation of drugs cocrystals and amorphous solid dispersion design are two methods developed for enhancing drugs bioavailability^4,5. Also, nanosized medicines can be manufactured using supercritical solvents in which the drugs are formed after dissolution in the solvent. Pressure and temperatures can be tuned for production of drug particles with desired size^6,7. The methodology is highly dependent on the solubility of drugs, and it is not suitable for poorly soluble drugs in supercritical solvents. While different solvents can be used in supercritical processing, CO₂ has attracted more attention owing to its great properties in drug manufacture, and mainly its mild supercritical condition which can be provided in the process. Several recent works have utilized CO₂ as the solvent for analysis of supercritical processing, and models were employed in estimation of solubility^8–10.

For saving the time and cost for solubility measurements in supercritical manufacture of medicines, predictive models can be executed and tested for calculation of solubility. Several thermodynamic models have been tested and used for correlating solubility of drugs in supercritical solvents; however, these models cannot be easily expanded for large variety of medications^11–13. Hence, models with easier procedures such as machine learning are preferred for estimation of drugs solubility in supercritical fluids. The significance of this challenge is to develop robust model with capability of handling large dataset and diverse type of medications. The challenge can be addressed by evaluation of generalized machine learning models and validating them using solubility dataset. Recent works illustrated the development of machine learning models in this area for correlating drug solubility to input features such as pressure and temperature^14–16.

In the modern age, machine learning (ML) techniques have gained prominence as a robust method for developing predictive models for various applications, including drug discovery, materials science, and chemical engineering. Among various ML techniques, ensemble learning has attracted substantial attention because of its capacity to improve the robustness of models by combining multiple base models. It can be also used for correlation of drug solubility in supercritical CO₂ where measured data are used for training as well as validation. Luo et al.¹⁷ used some ensemble learning for estimation of drug solubility in supercritical CO₂. Random Forest was used for correlation of Exemestane drug solubility to pressure and temperature. Great agreement was obtained proving the accuracy and reliability of these models in prediction of drugs solubility.

The current investigation reveals findings of Neighbor-based ensemble models for predicting the mole fraction of paracetamol drug and density of solvent (supercritical CO₂) with temperature and pressure as independent variables. Two ensembles (Bagging and AdaBoost) and two optimizers (GWO and BAT) are used on the top of KNN. So, four different models obtained and analyzed in this work. The methodology is implemented for the first time to correlate the solubility data of paracetamol in supercritical CO₂ with a view of developing nanonization process for production of nanomedicines.

Methodology

In this study, we combined KNN with the Adaboost and Bagging ensemble methods, as well as two different optimizers, BAT and GWO. As a result, four different models were found and assessed. Figure 1 shows the overall modeling flow, and the following subsections introduce the modeling building blocks.

Data description and pre-processing

The dataset consists of 40 samples with four variables each: Temperature in Kelvin (T), Pressure in bar (P), Mole fraction, and Density of solvent taken from¹⁸. The objective of this dataset is to predict the values of MF and D based on T and P. Song et al.¹⁶ reported correlation of paracetamol data via machine learning, and suggested a methodology which is applied in this study. The data have been used in other work reported by Alotaibi and their procedure was applied here⁴².

Before proceeding with the analysis, it was essential to conduct a thorough data quality assessment to ensure reliable modeling outcomes. The dataset was first checked for missing values, and none were found, confirming its completeness. Next, we applied the Isolation Forest¹⁹ algorithm to detect potential outliers, as this method is well-suited for high-dimensional data and can effectively identify anomalies by constructing decision trees to isolate observations. Each data was assigned an anomaly score based on the path length required for isolation, and points with scores beyond the defined threshold were classified as outliers. This process revealed two data points with unusually high deviations in either solubility or density relative to their corresponding temperature and pressure values. These two outliers were removed from the training set to prevent bias and overfitting in model development, resulting in a final dataset of 38 clean instances. This careful preprocessing step ensured that the subsequent modeling efforts were based on robust and representative data, improving the reliability of predictions.

Figure 2 shows the relationship between T and the output variables and Fig. 3 is the same for P. These visual representations can help understand the directions of variables with input parameters and provide discussion on the influence of input parameters¹⁶.

Fig. 2 — Temperature against output parameters (solvent density and drug solubility).

Metaheuristic optimization

The metaheuristic algorithm Grey Wolf Optimizer (GWO) is designed to mimic the hunting strategies of grey wolves. The algorithm incorporates four distinct groups of grey wolves, including alpha, beta, delta, and omega. The vector of decision variables represents the position of each wolf within the search space. The search methodology is directed by the location of the wolves, which are revised in every iteration through their individual positions and the positions of their counterparts^20,21.

The updating equation for the position of each wolf is given by²²:

where Inline graphic and denote coefficient vectors, stands for a random vector, indicates the distance vector, and is the position of the i-th wolf at iteration t⁴³.

For the GWO implementation, a population size of 35 wolves was employed, with the algorithm executed for 80 iterations to ensure convergence. The control parameter a, which linearly decreases from 2 to 0 during iterations, was used to balance exploration and exploitation phases. Standard coefficient vectors (A and C) were generated using uniformly distributed random numbers in [0,1] at each iteration, consistent with the original formulation of Mirjalili et al.¹⁸. These settings were selected after preliminary testing to provide a balance between computational cost and prediction accuracy, and they ensured reproducible convergence toward optimal hyperparameters in the AdaBoost-KNN framework⁴³.

The Bat Algorithm is yet another instance of a population-based metaheuristic algorithm that takes cues from the natural world, in this case the bat’s use of echolocation to navigate. The algorithm comprises a group of bats, wherein each bat possesses a frequency and a position within the search space. The search methodology is directed by the frequency and velocity of the bats⁴³. These parameters are revised in every iteration, taking into account the bats’ current locations and the locations of the most optimal bats within the population^23,24.

The updating equation for the velocity as well as position is given by^25,26:

where Inline graphic stands for the velocity of the i-th bat at iteration t, is a random vector, is the position of the best bat in the population, and is the position of the i-th bat at iteration t.

For the BAT algorithm, a population size of 30 bats was used with a maximum of 80 iterations. The frequency parameter was varied within the range [0,2], while initial loudness (A₀) was set to 0.9 and pulse rate (r₀) to 0.5, allowing for gradual transition from exploration to exploitation as iterations progressed. The loudness was adaptively decreased and pulse emission rate increased according to standard update rules, thereby refining the local search as the algorithm advanced. These parameter settings follow the guidelines from Yang²³ and were chosen to maintain consistency with prior implementations while ensuring effective tuning of the KNN-based ensemble models.

Ensemble methods

Ensemble learning models can improve predictive performance by combining multiple base learners, and AdaBoost is a specific approach within ensemble learning that achieves this by adjusting the weights assigned to each weak learner to improve overall accuracy²⁷. Adaptive boosting technology, or AdaBoost, can improve the performance of simple models by enabling them to tackle complex problems. While basic models have appealing generalization properties due to their simple structure, they have limitations in their ability to handle complex tasks because of their inherent bias. AdaBoost overcomes this limitation by enhancing these basic models and making them more effective in addressing complicated problems.

In contrast, complex models are more susceptible to overfitting and have more complicated structures, which can make their practical application more challenging. Despite their potential advantages in handling complex tasks, their use can be problematic due to implementation issues²⁸. The AdaBoost method is suggested as a solution to such challenges.

The approach commences with a weak model, commonly referred to as the weak learner. Subsequently, the weak learner is progressively and consistently integrated with other models to construct a dependable system capable of handling intricate situations^29,30. The following steps provide a general overview of the AdaBoost algorithm, as outlined in references^31,32:

Initiate the weight values and the quantity of base learners (M):

2.
For 1 ≤ k ≤ M:

3.
Final Output:

This pseudocode uses variables N and M to represent the quantity of data points and learners, respectively. The predictor function that operates on data point x and is associated with learner b is denoted as G_b(x)^33–35.

The ensemble-based machine learning algorithm Bagging Regression combines multiple regression models to strengthen the final prediction’s accuracy and reliability. To operate, the model initially generates various bootstrap samples of the original training dataset using the replacement technique. Each bootstrap sample is then used to train an individual regression model. These models are trained independently, and can use different algorithms or hyperparameters³⁶.

During the testing phase, the Bagging Regression model combines the outputs of each individual model to make a final prediction. The combination of the outputs is typically done using a weighted average, where each model’s output is multiplied by a weight coefficient that reflects its performance on the validation set³⁷.

The main advantage of the Bagging Regression model is its ability to reduce variance in the prediction, while maintaining a similar level of bias. This is accomplished by averaging the outputs of multiple models, which reduces the impact of random fluctuations in the training data set.

K-nearest neighbors (KNN) regression model

The K-Nearest Neighbor (KNN) technique is commonly used as a base in regression models. This moodel does not generalize from training examples and instead retains all of the data gathered during testing³⁸. KNN regression is a straightforward approach. It involves identifying a number (k) of nearest neighbors that share the same numerical target value as the given data point³⁹. Regression can be achieved by using the same distance functions as KNN classification to measure the distance between data points. Two common distance metrics are Euclidean (Euc_Distance) and Manhattan (Man_Distance) distance. Below are a couple of equations that show how the distance between two data points (x, y) is determined³⁹.

As a supervised learning algorithm, KNN predicts new values for an output variable by comparing a test data point (X, y) to all data points of the training set D of sample input and output pairs. To make predictions, KNN computes the distance d_i between X and each data point X_i in D, and sorts the distances in ascending order. The output value of the i-th closest neighbor is represented by y_i(X). For regression problems like our case study, the final prediction y is calculated as the mean of the output values of the k closest neighbors^40,41:

Results and discussion

The dataset was utilized to train and evaluate four distinct models that aimed to forecast the Mole Fraction and Density, utilizing the input variables of Temperature (T) and Pressure (P). The performance of models is investigated using the R² score, Mean Squared Error (MSE), and Average Absolute Relative Difference (AARD%).

The four models were trained using KNN regression as the base model and two ensemble methods, Bagging and AdaBoost. Two different metaheuristic optimization algorithms, BAT and GWO, were used to tune the hyperparameters of the models. The final models and their performance on the test set are summarized in the tables below. Tables 1 and 2 below present a summary of the effectiveness of the optimized models on the test set.

Table 1.

Performance of the models for predicting mole fraction.

Ensemble	Optimizer	R² Score	MSE	AARD%
AdaBoost	BAT	0.9666	7.2364E-13	15.9424
Bagging	BAT	0.83651	2.5085E-12	28.4419
AdaBoost	GWO	0.96719	7.2002E-13	17.0692
Bagging	GWO	0.83191	2.5603E-12	30.2435

Open in a new tab

Table 2.

Performance of the models for predicting density.

Ensemble	Optimizer	R² Score	MSE	AARD%
AdaBoost	BAT	0.9778	1.0902E + 03	4.02914
Bagging	BAT	0.95086	2.4825E + 03	9.03404
AdaBoost	GWO	0.98105	9.1456E + 02	3.76726
Bagging	GWO	0.95161	2.5266E + 03	8.78734

Open in a new tab

From the tables above, it can be seen that the GWO-ADA-KNN model performed the best for predicting both Mole Fraction and Density of solvent, achieving the highest R² score and lowest MSE values among all the models implemented in this study. The BAT-BAG-KNN model performed the worst for both Mole Fraction and Density, with the lowest R² score and highest AARD% values. A comparison of ensemble methods and optimizers in terms of R-squared in prediction of both outputs are shown in Figs. 4 and 5.

Fig. 4 — Comparison of Ensemble methods and optimizers by using R-squared as a measure of how well the model predicts mole fraction.

Fig. 5 — Comparison of ensemble methods and optimizers by using R-squared as a measure of how well the model predicts density.

The comparative results reveal that AdaBoost consistently outperforms Bagging in predicting both mole fraction and density. This improvement can be attributed to the adaptive weighting mechanism of AdaBoost, which assigns higher importance to difficult-to-predict samples, thereby reducing bias and enhancing overall accuracy. In contrast, Bagging primarily reduces variance by aggregating bootstrap samples, but it does not systematically address misclassified or poorly predicted points. As a result, Bagging models in this study demonstrated lower R² values and higher AARD% compared to AdaBoost, confirming the advantage of boosting strategies in handling nonlinear and complex relationships in solubility data.

The outcomes of this work align with earlier research on the strength of ensemble and optimization techniques in supercritical solubility modeling. Luo et al.¹⁶ reported strong predictive performance using Random Forest for drug solubility in supercritical CO₂, highlighting the importance of ensemble strategies. Similarly, He et al.¹¹ demonstrated that combining thermodynamic modeling with machine learning improved accuracy for pharmaceutical solubility estimation. The present work advances this line of research by introducing a hybrid GWO-ADA-KNN framework, which achieves comparable or superior predictive accuracy while offering greater flexibility for application to diverse solubility datasets.

To provide a fairer assessment of the proposed ensemble–optimization framework, baseline comparisons were also carried out using Random Forest (RF) and Support Vector Regression (SVR), two widely adopted models in solubility prediction. For mole fraction prediction, RF achieved an R² of 0.921 with an AARD% of 22.5, while SVR obtained an R² of 0.895 with an AARD% of 25.8. Similarly, for density prediction, RF and SVR reached R² values of 0.935 and 0.902, respectively. Although these baseline models demonstrated reasonable accuracy, they consistently underperformed compared to the proposed GWO-ADA-KNN framework, which achieved R² scores of 0.967 (mole fraction) and 0.981 (density) with substantially lower error metrics. This comparison highlights that while conventional models are competent, the hybrid AdaBoost–KNN optimized by GWO provides superior generalization and predictive reliability for solubility estimation under supercritical conditions.

Overall, the results suggest that ensemble methods and hyperparameter tuning using metaheuristic algorithms can considerably enhance the efficacy of KNN regression models for predicting the Mole Fraction and Density based on the input features of temperature and pressure. Also, for final analysis the AdaBoost-KNN model with parameters came from GWO are selected as the most accurate and general model.

To further validate the robustness of the proposed GWO-ADA-KNN model and address concerns of potential overfitting, a 5-fold cross-validation was conducted on the full dataset for both outputs. The results confirmed stable and reliable performance across folds, with mole fraction predictions achieving a mean R² of 0.962 (± 0.012), MSE of 7.85 × 10⁻¹³ (± 1.4 × 10⁻¹³), and AARD% of 17.3 (± 1.9), while density predictions obtained a mean R² of 0.978 (± 0.008), MSE of 9.20 × 10² (± 6.1 × 10¹), and AARD% of 3.9 (± 0.6). These consistent values, closely aligned with the single-split test performance, demonstrate that the hybrid ensemble–optimization framework generalizes well to unseen data. In addition, a paired t-test was performed between the fold-level MSE values of the GWO-ADA-KNN model and a strong baseline (Random Forest). The results showed statistically significant improvements for mole fraction (t(4) = − 4.12, p = 0.014) and density (t(4) = − 3.76, p = 0.019), confirming that the superior accuracy of the proposed model is not due to random variation but reflects a genuine performance advantage. Together, the cross-validation stability and statistical significance reinforce the predictive reliability of the hybrid model while minimizing the likelihood of overfitting despite the limited dataset size⁴².

We have shown the actual and estimated values by this model for both outputs in Figs. 6 and 7. Figures 8, 9, 10 and 11 show the individual effects of input parameters on outputs, and at the end, Figs. 12 and 13 show three-dimensional graphs of outputs as a function of inputs. The drug solubility showed a direct relationship with the pressure and temperature which is due to the variation of thermodynamic properties of solvent by T and P¹⁶. When the solvent is compressed at elevated pressure, unlike organic liquid solvents, the solvation power of solvent is greatly enhanced, and it can take up more amount of drug in its structure. Song et al.¹⁶ reported correlation of paracetamol data via machine learning and our results are in agreement with their outputs for paracetamol data. Also, Alotaibi et al. ⁴² reported similar observations for paracetamol solubility.

Fig. 6 — Predicted and actual mole fraction values using AdaBoost-KNN model.

Fig. 7 — Predicted and actual density values using adaBoost-KNN model.

Fig. 8 — Trends of pressure on mole fraction as output.

Fig. 9 — Trends of temperature on mole fraction as output.

Fig. 10 — Trends of pressure on density as output.

Fig. 11 — Trends of temperature on density as output.

Fig. 12 — 3D visualization of final AdaBoost-KNN model for drug solubility.

Fig. 13 — 3D visualization of final AdaBoost-KNN model for density.

The superiority of GWO over BAT in hyperparameter tuning is evident from the higher predictive accuracy and lower error metrics achieved with GWO-optimized models. This can be explained by the efficient exploration–exploitation balance inherent in GWO, where the cooperative hunting behavior of wolves ensures robust global search and faster convergence to optimal solutions. On the other hand, BAT relies on echolocation-inspired randomization, which, while effective in certain contexts, may lead to premature convergence or less efficient parameter exploration. The results suggest that GWO’s structured optimization strategy aligns better with the nonlinear dynamics of solubility prediction.

Figures 14 and 15 present comparative visual analyses of the AdaBoost-KNN model’s predictive performance for paracetamol solubility and solvent density under varying temperature and pressure conditions. Figure 14 illustrates the model’s predictions against actual solubility values, showing close alignment and strong correlation, while Fig. 15 displays a similar comparison for solvent density, reinforcing the model’s accuracy and reliability in capturing nonlinear relationships between inputs and outputs.

Fig. 14 — SHAP summary plot for mole fraction.

Fig. 15 — SHAP summary plot for density.

Overall, the comparative analysis of the four machine learning models highlights that the GWO-optimized AdaBoost-KNN approach provides the most accurate predictions for both paracetamol solubility and solvent density, achieving the highest R² scores and lowest error metrics. These results demonstrate the value of combining ensemble learning with metaheuristic optimization to capture the nonlinear effects of temperature and pressure in supercritical CO₂ systems. The findings suggest this framework could significantly reduce experimental efforts in pharmaceutical process design, particularly for drug nanonization. However, the study is limited by the relatively small dataset size and focus on a single drug–solvent system, which may restrict model generalizability to other compounds or process conditions. Expanding the dataset, incorporating additional features such as co-solvent effects, and testing deep learning approaches could further strengthen predictive performance and robustness.

Conclusion

This study demonstrated the effectiveness of neighbor-based ensemble learning models for predicting the mole fraction of paracetamol and the density of supercritical CO₂ under varying temperature and pressure conditions. Using KNN as the base model, in combination with AdaBoost and Bagging ensembles and tuned by metaheuristic optimizers (BAT and GWO), four hybrid frameworks were developed and evaluated. Among them, the GWO-ADA-KNN model consistently achieved the best performance, with R² values of 0.98105 for density and 0.96719 for mole fraction, along with the lowest error metrics. These results highlight the significant advantage of coupling ensemble methods with metaheuristic optimization for enhancing predictive accuracy in pharmaceutical solubility modeling.

The novelty of this work lies in the integration of hybrid AdaBoost-KNN models with GWO optimization, applied for the first time to paracetamol solubility in supercritical CO₂. This hybrid strategy not only improves predictive capability but also offers a flexible framework that can be extended to other poorly soluble drugs in supercritical systems.

Future research could explore experimental validation of the proposed models to confirm their applicability under laboratory and industrial conditions. Additionally, extending this approach to real-time prediction systems and larger, more diverse datasets could further establish its utility in accelerating drug formulation and nanonization processes.

Acknowledgements

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/316/46”.

Author contributions

Kamal Y. Thajudeen: Writing, Methodology, Computation, Supervision.Saad Ali Alshehri: Writing, Methodology, Computation, Resources.Mohamed Rahamathulla: Writing, Validation, Computation, Investigation.Mohammed Muqtader Ahmed: Writing, Methodology, Computation, Software.All authors reviewed the manuscript.

Data availability

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Altalbawy, F. M. A. et al. Universal data-driven models to estimate the solubility of anti-cancer drugs in supercritical carbon dioxide: correlation development and machine learning modeling. J. CO2 Utilization. 92, 103021 (2025). [Google Scholar]
2.Chang, K. H. & Chua, H. N. A machine learning model for the classification of illicit drug substances with fourier transform infrared spectroscopy. Microchem. J.212, 113427 (2025). [Google Scholar]
3.Huang, L. et al. On QSPR analysis of glaucoma drugs using machine learning with XGBoost and regression models. Comput. Biol. Med.187, 109731 (2025). [DOI] [PubMed] [Google Scholar]
4.Huang, L. et al. Preparation of IBU-NTM cocrystals via HME and their exploitation in FDM-3D printing for advanced pharmaceutical applications. J. Drug Deliv. Sci. Technol.108, 106917 (2025). [Google Scholar]
5.Pantwalawalkar, J. et al. Pharmaceutical cocrystals: unlocking the potential of challenging drug candidates. J. Drug Deliv. Sci. Technol.104, 106572 (2025). [Google Scholar]
6.Sofia, D., Moffa, M. & Trucillo, P. Supercritical particle formation (SPAF) process for the versatile production of ready-to-market drug delivery systems. Chem. Eng. Sci.302, 120918 (2025). [Google Scholar]
7.Türk, M. & Bolten, D. Formation of submicron poorly water-soluble drugs by rapid expansion of supercritical solution (RESS): results for Naproxen. J. Supercrit. Fluids. 55 (2), 778–785 (2010). [Google Scholar]
8.Liu, Y. et al. Machine learning based modeling for Estimation of drug solubility in supercritical fluid by adjusting important parameters. Chemometr. Intell. Lab. Syst.254, 105241 (2024). [Google Scholar]
9.Roosta, A., Esmaeilzadeh, F. & Haghbakhsh, R. Predicting the solubility of drugs in supercritical carbon dioxide using machine learning and atomic contribution. Eur. J. Pharm. Biopharm.211, 114720 (2025). [DOI] [PubMed] [Google Scholar]
10.Tabebordbar, M. et al. New solubility data of Amoxapine (anti-depressant) drug in supercritical CO2: application of cubic EoSs. J. Drug Deliv. Sci. Technol.101, 106281 (2024). [Google Scholar]
11.He, L. et al. Theoretical Understanding of pharmaceutics solubility in supercritical CO2: thermodynamic modeling and machine learning study. J. Supercrit. Fluids. 223, 106605 (2025). [Google Scholar]
12.Obaidullah, A. J. Thermodynamic and experimental analysis of drug nanoparticles Preparation using supercritical thermal processing: solubility of Chlorothiazide in different Co-solvents. Case Stud. Therm. Eng.49, 103212 (2023). [Google Scholar]
13.Zhang, C. et al. Thermodynamic modeling of anticancer drugs solubilities in supercritical CO2 using the PC-SAFT equation of state. Fluid. Phase. Equilibria. 587, 114202 (2025). [Google Scholar]
14.Chen, C. Artificial intelligence aided pharmaceutical engineering: development of hybrid machine learning models for prediction of nanomedicine solubility in supercritical solvent. J. Mol. Liq.397, 124127 (2024). [Google Scholar]
15.Ghazwani, M., Yasmin, M. & Begum Machine learning aided drug development: assessing improvement of drug efficiency by correlation of solubility in supercritical solvent for nanomedicine Preparation. J. Mol. Liq.387, 122511 (2023). [Google Scholar]
16.Song, H. et al. Advancing nanomedicine production via green method: modeling and simulation of pharmaceutical solubility at different temperatures and pressures. J. Mol. Liq.411, 125806 (2024). [Google Scholar]
17.Luo, B. et al. Experimental validation and modeling study on the drug solubility in supercritical solvent: case study on exemestane drug. J. Mol. Liq.377, 121517 (2023). [Google Scholar]
18.Bagheri, H. et al. Supercritical carbon dioxide utilization in drug delivery: experimental study and modeling of Paracetamol solubility. Eur. J. Pharm. Sci.177, 106273 (2022). [DOI] [PubMed] [Google Scholar]
19.Liu, F. T., Ting, K. M. & Zhou, Z. H. Isolation forest. in eighth ieee international conference on data mining. 2008. IEEE. 2008. IEEE. (2008).
20.Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey Wolf optimizer. Adv. Eng. Softw.69, 46–61 (2014). [Google Scholar]
21.Dereli, S. A new modified grey Wolf optimization algorithm proposal for a fundamental engineering problem in robotics. Neural Comput. Appl.33 (21), 14119–14131 (2021). [Google Scholar]
22.Li, Z. et al. A novel discrete grey wolf optimizer for solving the bounded knapsack problem. in Computational Intelligence and Intelligent Systems: 10th International Symposium, ISICA 2018, Jiujiang, China, October 13–14, Revised Selected Papers 10. 2019. Springer. (2018).
23.Taha, A. M. & Tang, A. Y. Bat algorithm for rough set attribute reduction. J. Theoretical Appl. Inform. Technol.51 (1), 1–8 (2013). [Google Scholar]
24.Yang, X. S. Bat algorithm: literature review and applications. arXiv preprint arXiv:1308.3900, (2013).
25.Yang, X. S. & Gandomi, A. H. Bat algorithm: a novel approach for global engineering optimization. Engineering computations, (2012).
26.Fister, I. et al. Bat algorithm: Recent advances. in 2014 IEEE 15th International symposium on computational intelligence and informatics (CINTI). IEEE. (2014).
27.Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.55 (1), 119–139 (1997). [Google Scholar]
28.Buitinck, L. et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, (2013).
29.Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res.18 (1), 559–563 (2017). [Google Scholar]
30.Drucker, H. Improving Regressors Using Boosting techniques. In ICML (Citeseer, 1997).
31.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical learning. Springer Series in Statistics (Springer, 2001).
32.Bishop, C. M. Pattern recognition. Machine learning, 128(9). (2006).
33.Hastie, T. et al. Multi-class adaboost. Stat. Its Interface. 2 (3), 349–360 (2009). [Google Scholar]
34.Berk, R. A. An Introduction To Ensemble Methods for Data Analysis34p. 263–295 (Sociological Methods & Research, 2006). 3.
35.Ouyang, Z., Ravier, P. & Jabloun, M. STL decomposition of time series can benefit forecasting done by statistical methods but not by machine learning ones. Engineering Proceedings, 5(1). (2021).
36.Chen, K. et al. Bagging based ensemble learning approaches for modeling the emission of PCDD/Fs from municipal solid waste incinerators. Chemosphere274, 129802 (2021). [DOI] [PubMed] [Google Scholar]
37.Seyghaly, R. et al. Interference recognition for fog enabled IoT architecture using a novel tree-based method. in. IEEE International Conference on Omni-layer Intelligent Systems (COINS). 2022. IEEE Computer Society. (2022).
38.Song, Y. et al. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing251, 26–34 (2017). [Google Scholar]
39.Deng, B. Machine learning on density and elastic property of oxide glasses driven by large dataset. J. Non-cryst. Solids. 529, 119768 (2020). [Google Scholar]
40.Cheng, J. C. & Ma, L. J. A non-linear case-based reasoning approach for retrieval of similar cases and selection of target credits in LEED projects. Build. Environ.93, 349–361 (2015). [Google Scholar]
41.Devroye, L. et al. On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimatesp. 1371–1385 (The Annals of Statistics, 1994).
42.Hadil Faris, Alotaibi Chou-Yi, Hsu Fadhil Faez, Sead Anupam, Yadav Renuka, Jyothi S. Swati, Mishra Bilakshan, Purohit Anorgul, Ashirova Murodjon, Yaxshimuratov Ashish Singh, Chauhan (2025) Machine learning estimation and optimization for evaluation of pharmaceutical solubility in supercritical carbon dioxide for improvement of drug efficacy. Scientific Reports 15 (1). 10.1038/s41598-025-19873-z [DOI] [PMC free article] [PubMed]
43.Hashem O., Alsaab Yusuf S., Althobaiti (2025) Intelligence modeling of solubility of raloxifene and density of solvent for green supercritical processing of medicines for enhanced solubility. Scientific Reports 15 (1). 10.1038/s41598-025-18223-3 [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

[CR1] 1.Altalbawy, F. M. A. et al. Universal data-driven models to estimate the solubility of anti-cancer drugs in supercritical carbon dioxide: correlation development and machine learning modeling. J. CO2 Utilization. 92, 103021 (2025). [Google Scholar]

[CR2] 2.Chang, K. H. & Chua, H. N. A machine learning model for the classification of illicit drug substances with fourier transform infrared spectroscopy. Microchem. J.212, 113427 (2025). [Google Scholar]

[CR3] 3.Huang, L. et al. On QSPR analysis of glaucoma drugs using machine learning with XGBoost and regression models. Comput. Biol. Med.187, 109731 (2025). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Huang, L. et al. Preparation of IBU-NTM cocrystals via HME and their exploitation in FDM-3D printing for advanced pharmaceutical applications. J. Drug Deliv. Sci. Technol.108, 106917 (2025). [Google Scholar]

[CR5] 5.Pantwalawalkar, J. et al. Pharmaceutical cocrystals: unlocking the potential of challenging drug candidates. J. Drug Deliv. Sci. Technol.104, 106572 (2025). [Google Scholar]

[CR6] 6.Sofia, D., Moffa, M. & Trucillo, P. Supercritical particle formation (SPAF) process for the versatile production of ready-to-market drug delivery systems. Chem. Eng. Sci.302, 120918 (2025). [Google Scholar]

[CR7] 7.Türk, M. & Bolten, D. Formation of submicron poorly water-soluble drugs by rapid expansion of supercritical solution (RESS): results for Naproxen. J. Supercrit. Fluids. 55 (2), 778–785 (2010). [Google Scholar]

[CR8] 8.Liu, Y. et al. Machine learning based modeling for Estimation of drug solubility in supercritical fluid by adjusting important parameters. Chemometr. Intell. Lab. Syst.254, 105241 (2024). [Google Scholar]

[CR9] 9.Roosta, A., Esmaeilzadeh, F. & Haghbakhsh, R. Predicting the solubility of drugs in supercritical carbon dioxide using machine learning and atomic contribution. Eur. J. Pharm. Biopharm.211, 114720 (2025). [DOI] [PubMed] [Google Scholar]

[CR10] 10.Tabebordbar, M. et al. New solubility data of Amoxapine (anti-depressant) drug in supercritical CO2: application of cubic EoSs. J. Drug Deliv. Sci. Technol.101, 106281 (2024). [Google Scholar]

[CR11] 11.He, L. et al. Theoretical Understanding of pharmaceutics solubility in supercritical CO2: thermodynamic modeling and machine learning study. J. Supercrit. Fluids. 223, 106605 (2025). [Google Scholar]

[CR12] 12.Obaidullah, A. J. Thermodynamic and experimental analysis of drug nanoparticles Preparation using supercritical thermal processing: solubility of Chlorothiazide in different Co-solvents. Case Stud. Therm. Eng.49, 103212 (2023). [Google Scholar]

[CR13] 13.Zhang, C. et al. Thermodynamic modeling of anticancer drugs solubilities in supercritical CO2 using the PC-SAFT equation of state. Fluid. Phase. Equilibria. 587, 114202 (2025). [Google Scholar]

[CR14] 14.Chen, C. Artificial intelligence aided pharmaceutical engineering: development of hybrid machine learning models for prediction of nanomedicine solubility in supercritical solvent. J. Mol. Liq.397, 124127 (2024). [Google Scholar]

[CR15] 15.Ghazwani, M., Yasmin, M. & Begum Machine learning aided drug development: assessing improvement of drug efficiency by correlation of solubility in supercritical solvent for nanomedicine Preparation. J. Mol. Liq.387, 122511 (2023). [Google Scholar]

[CR16] 16.Song, H. et al. Advancing nanomedicine production via green method: modeling and simulation of pharmaceutical solubility at different temperatures and pressures. J. Mol. Liq.411, 125806 (2024). [Google Scholar]

[CR17] 17.Luo, B. et al. Experimental validation and modeling study on the drug solubility in supercritical solvent: case study on exemestane drug. J. Mol. Liq.377, 121517 (2023). [Google Scholar]

[CR18] 18.Bagheri, H. et al. Supercritical carbon dioxide utilization in drug delivery: experimental study and modeling of Paracetamol solubility. Eur. J. Pharm. Sci.177, 106273 (2022). [DOI] [PubMed] [Google Scholar]

[CR19] 19.Liu, F. T., Ting, K. M. & Zhou, Z. H. Isolation forest. in eighth ieee international conference on data mining. 2008. IEEE. 2008. IEEE. (2008).

[CR20] 20.Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey Wolf optimizer. Adv. Eng. Softw.69, 46–61 (2014). [Google Scholar]

[CR21] 21.Dereli, S. A new modified grey Wolf optimization algorithm proposal for a fundamental engineering problem in robotics. Neural Comput. Appl.33 (21), 14119–14131 (2021). [Google Scholar]

[CR22] 22.Li, Z. et al. A novel discrete grey wolf optimizer for solving the bounded knapsack problem. in Computational Intelligence and Intelligent Systems: 10th International Symposium, ISICA 2018, Jiujiang, China, October 13–14, Revised Selected Papers 10. 2019. Springer. (2018).

[CR23] 23.Taha, A. M. & Tang, A. Y. Bat algorithm for rough set attribute reduction. J. Theoretical Appl. Inform. Technol.51 (1), 1–8 (2013). [Google Scholar]

[CR24] 24.Yang, X. S. Bat algorithm: literature review and applications. arXiv preprint arXiv:1308.3900, (2013).

[CR25] 25.Yang, X. S. & Gandomi, A. H. Bat algorithm: a novel approach for global engineering optimization. Engineering computations, (2012).

[CR26] 26.Fister, I. et al. Bat algorithm: Recent advances. in 2014 IEEE 15th International symposium on computational intelligence and informatics (CINTI). IEEE. (2014).

[CR27] 27.Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci.55 (1), 119–139 (1997). [Google Scholar]

[CR28] 28.Buitinck, L. et al. API design for machine learning software: experiences from the scikit-learn project. arXiv preprint arXiv:1309.0238, (2013).

[CR29] 29.Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res.18 (1), 559–563 (2017). [Google Scholar]

[CR30] 30.Drucker, H. Improving Regressors Using Boosting techniques. In ICML (Citeseer, 1997).

[CR31] 31.Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical learning. Springer Series in Statistics (Springer, 2001).

[CR32] 32.Bishop, C. M. Pattern recognition. Machine learning, 128(9). (2006).

[CR33] 33.Hastie, T. et al. Multi-class adaboost. Stat. Its Interface. 2 (3), 349–360 (2009). [Google Scholar]

[CR34] 34.Berk, R. A. An Introduction To Ensemble Methods for Data Analysis34p. 263–295 (Sociological Methods & Research, 2006). 3.

[CR35] 35.Ouyang, Z., Ravier, P. & Jabloun, M. STL decomposition of time series can benefit forecasting done by statistical methods but not by machine learning ones. Engineering Proceedings, 5(1). (2021).

[CR36] 36.Chen, K. et al. Bagging based ensemble learning approaches for modeling the emission of PCDD/Fs from municipal solid waste incinerators. Chemosphere274, 129802 (2021). [DOI] [PubMed] [Google Scholar]

[CR37] 37.Seyghaly, R. et al. Interference recognition for fog enabled IoT architecture using a novel tree-based method. in. IEEE International Conference on Omni-layer Intelligent Systems (COINS). 2022. IEEE Computer Society. (2022).

[CR38] 38.Song, Y. et al. An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing251, 26–34 (2017). [Google Scholar]

[CR39] 39.Deng, B. Machine learning on density and elastic property of oxide glasses driven by large dataset. J. Non-cryst. Solids. 529, 119768 (2020). [Google Scholar]

[CR40] 40.Cheng, J. C. & Ma, L. J. A non-linear case-based reasoning approach for retrieval of similar cases and selection of target credits in LEED projects. Build. Environ.93, 349–361 (2015). [Google Scholar]

[CR41] 41.Devroye, L. et al. On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimatesp. 1371–1385 (The Annals of Statistics, 1994).

[CR42] 42.Hadil Faris, Alotaibi Chou-Yi, Hsu Fadhil Faez, Sead Anupam, Yadav Renuka, Jyothi S. Swati, Mishra Bilakshan, Purohit Anorgul, Ashirova Murodjon, Yaxshimuratov Ashish Singh, Chauhan (2025) Machine learning estimation and optimization for evaluation of pharmaceutical solubility in supercritical carbon dioxide for improvement of drug efficacy. Scientific Reports 15 (1). 10.1038/s41598-025-19873-z [DOI] [PMC free article] [PubMed]

[CR43] 43.Hashem O., Alsaab Yusuf S., Althobaiti (2025) Intelligence modeling of solubility of raloxifene and density of solvent for green supercritical processing of medicines for enhanced solubility. Scientific Reports 15 (1). 10.1038/s41598-025-18223-3 [DOI] [PMC free article] [PubMed]

PERMALINK

Machine learning estimating paracetamol solubility in supercritical CO2 by utilization of K-nearest neighbor regression and metaheuristic algorithms

Kamal Y Thajudeen

Saad Ali Alshehri

Mohamed Rahamathulla

Mohammed Muqtader Ahmed

Abstract

Introduction

Methodology

Fig. 1.

Data description and pre-processing

Fig. 2.

Fig. 3.

Metaheuristic optimization

Ensemble methods

K-nearest neighbors (KNN) regression model

Results and discussion

Table 1.

Table 2.

Fig. 4.

Fig. 5.

Fig. 6.

Fig. 7.

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.