Abstract
This research explores the application of fatty acid ethyl esters (FAEEs) in the pharmaceutical industry due to their biodegradable, renewable nature and versatility as excipients or drug delivery agents. The research seeks to create predictive models utilizing various methods in machine learning to calculate the speed of sound in FAEEs under different temperature, pressure, molar mass, and elemental composition conditions. Laboratory data figures from earlier research were used to train the models. Among the models developed, CNN was recognized as the most accurate model for predicting the speed of sound. This conclusion was drawn from extensive statistical evaluations and visualization techniques. CNN achieved an R² value of 0.9996, with low average absolute relative error and mean squared error, outperforming other tested algorithms. The dataset, consisting of 371 experimental data points, was validated using the Leverage algorithm to ensure reliability. Further analysis showed that pressure is the most influential factor, followed by temperature, as confirmed by sensitivity and SHAP analyses. The proposed framework provides a reliable, cost-effective alternative to experimental methods for estimating sound speed in FAEEs under various physical conditions.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-16095-1.
Keywords: Pharmaceutical industry, Machine learning, Leverage algorithm, Fatty acid ethyl esters, Speed of sound prediction, SHAP analysis
Subject terms: Chemistry, Engineering, Mathematics and computing
Introduction
Fatty acid ethyl esters (FAEEs) have received growing attention in pharmaceutical industry due to their favorable physicochemical and biocompatible properties1. They play a crucial role in improving solubility, enhancing bioavailability, and enabling controlled drug delivery in pharmaceutical formulations2. Moreover, their biodegradable and non-toxic nature makes them suitable as carriers for active pharmaceutical ingredients (APIs), contributing to safer and more sustainable therapeutic systems3–5. FAEEs are also involved in the synthesis of bioactive compounds and pharmaceutical excipients, aligning with the increasing demand for eco-friendly and renewable solutions in pharmaceutical research and development6–9.
Several experimental studies have explored sound speed and related thermophysical properties in FAEEs and their derivatives under various temperature and pressure conditions. Ndiaye et al. measured speed sound in ethyl caprate and methyl caprate using the pulse-echo technique at pressures up to 210 MPa and temperatures between 283.15 and 403.15 K. They also conducted density measurements up to 100 MPa and derived isothermal and isentropic compressibilities10. Habrioux et al. applied a dual acoustic sensor pulse-echo method to study sound speed in ethyl laurate and methyl laurate at pressures up to 200 MPa and temperatures from 293.15 to 353.15 K, using the Newton-Laplace equation to evaluate density and compressibility11. Another study by Ndiaye et al.10 focused on ethyl myristate, methyl palmitate, and methyl myristate, examining their sound speed and proposing an equation of state based on experimental data12. Tat et al. investigated sound speed and isentropic bulk modulus in biodiesel and its pure esters at temperatures between 20 and 100 °C and pressures up to 34.5 MPa. Their findings suggested that sound speed variations may impact fuel injection timing and NOx emissions13. Zhang et al.12 used light scattering to measure thermal diffusivity and sound speed in ethyl myristate across a broad spectrum of temperatures (303.15–560.15 K) and pressures (up to 10 MPa), providing reliable empirical correlations for these properties14.
Despite the benefits of FAEEs, accurately predicting their speed of sound remains a significant challenge owing to the intricate interaction of elements such as temperature, pressure, molar mass, and elemental composition. The sound speed is a critical parameter influencing fuel performance, combustion efficiency, and transport properties, making accurate prediction essential for industrial applications. Experimental methods for measuring speed of sound are reliable but often resource-intensive and time-consuming, highlighting the need for efficient computational models for rapid and precise predictions. This study seeks to tackle this challenge through enhancing advanced machine learning algorithms, comprising Ensemble Learning, Random Forests, Ridge Regression, Decision Trees, AdaBoost, Linear Regression, KNN, CNN, SVM, and MLP-ANN models, to accurately predict the speed of sound for FAEEs. The study employs a refined dataset of 371 data points. The Leverage algorithm was used for anomaly detection to ensure data reliability, and a feature significance an examination was conducted to assess effect of every input variable on speed of sound of FAEEs. Various performance indicators and visualization techniques were employed to evaluate predictive algorithms accuracy. Furthermore, SHapley Additive exPlanations (SHAP) analysis was carried out to analyze influence of key features on speed of sound predictions, offering valuable insights into the roles of elemental composition, molar mass, temperature, and pressure. The overall methodology is depicted in Fig. 1.
Fig. 1.
Process for identifying the highest-performing data-driven model.
Origin of machine learning techniques
Decision tree (DT)
DT is a widely used monitored learning algorithm for both classifying and predicting tasks. It recursively divides data based on feature values, creating a tree-shaped framework within which internal nodes symbolize decision criteria and leaves indicate final predictions This approach is valued due to its straightforwardness, creating it especially useful when understanding the decision process is important15,16.
Decision Trees is able to managing categorical and numerical data, require minimal preprocessing, and can model nonlinear relationships. However, they tend to overfit, particularly when overly deep. This can be mitigated through techniques such as pruning or by using methods that combine multiple models, such as Random Forest and Gradient Boosting. Despite not always attaining the maximum precision, Decision Trees remain popular in domains like medicine, finance, and marketing due to their transparency and ease of use17,18. Figure 2 demonstrates the composition of a typical Decision Tree model, emphasizing its simplicity and interpretability.
Fig. 2.
The framework of decision tree.
Adaptive boosting
AdaBoost is a well-known ensemble learning method that integrates various weak learners to create a robust model predictive model. It works by iteratively modifying the weights of training samples according to the performance of previous learners, placing more emphasis on misclassified examples19. This approach improves both accuracy and robustness by utilizing the advantages of each model while reducing their limitations. AdaBoost is especially proficient at minimizing overfitting and handling complex classification tasks20,21. Figure 3 illustrates the general structure of an AdaBoost method, highlighting its ensemble-based nature and the integration of multiple weak learners.
Fig. 3.
The structure of AdaBoost.
Random forest (RF)
RF is a widely used collection of learning techniques that constructs numerous decision trees on arbitrary choices of information and characteristics to enhance predictive performance. For categorization, it employs majority voting, whereas for regression, it calculates the average of all tree predictions. This method enhances precision and generalization while minimizing the likelihood of overfitting. Through utilizing the variety present in individual trees, Random Forest proficiently encompasses intricate patterns and delivers robust and reliable results across various tasks22,23.
Figure 4 illustrates overall structure of a Random Forest method, emphasizing its ensemble nature and the combination of numerous decision trees. This graphic depiction provides insights into workings of Random Forest, showcasing its capacity to generate accurate and robust forecasts made by capitalizing on the strengths of multiple individual learners. The ensemble approach enables Random Forest to successfully identify patterns and connections within the data, ultimately improving overall forecasting effectiveness24.
Fig. 4.

Random forest structure.
Ensemble learning
Ensemble learning is commonly used employed method in statistical analysis and machine learning that integrates various models to enhance the precision and reliability of predictions25,26. In this approach, several base learners (often of different types) are educated separately, and their personal forecasts are then combined to create a conclusive decision. This strategy leverages the diversity and complementary advantages of specific models to reduce generalization error and increase overall performance.
The most common ensemble strategies include hard voting and soft voting. In strict voting, every model casts a vote for a class label, and label that receives the option with the highest votes is selected as the ultimate forecast. In soft voting, predicted likelihoods for every category from all models are averaged (possibly weighted), and class that has the greatest average likelihood is chosen. Soft voting tends to be more successful when the separate models are properly adjusted.
Ensemble methods can also allocate varying weights to the forecasts of base learners, or embrace sequential strategies where the result of a single model influences the training of the next (e.g., in boosting methods)27,28. These techniques can significantly enhance predictive performance, particularly when handling intricate, chaotic, or elevated-dimensional data29,30.
Figure 5 depicts the typical architecture of an ensemble learning algorithm, highlighting its multi-model structure and collaborative prediction mechanism. This visual representation emphasizes the adaptive and modular nature of ensemble learning, showcasing its strength in tackling complex prediction problems by integrating diverse models31.
Fig. 5.
The structure of ensemble learning.
CNNs
CNNs are a trained architecture within profound learning, originally developed for handling organized data like pictures. They utilize convolutional layers to extract localized designs or attributes similar contours and patterns from input data, rendering them particularly effective for tasks such as image categorization, object identification, and medical image analysis32.
Although CNNs are traditionally used for image data, recent studies have demonstrated their effectiveness in non-image domains such as tabular datasets, particularly when patterns or interactions among input features can be spatially structured. In this study, the tabular dataset containing temperature, pressure, molar mass, and elemental composition was reshaped into a 2D matrix format that mimics the spatial locality of image data. This allowed the convolutional filters to extract interaction patterns between features and learn more complex feature hierarchies compared to traditional fully connected layers33.
By applying CNNs to this structured tabular input, the model was able to:
Capture localized dependencies among physical variables (e.g., interaction between pressure and temperature),
Reduce overfitting through parameter sharing and local connectivity,
And ultimately improve the generalization performance in predicting speed of sound.
This approach reflects a growing trend in machine learning where CNNs are repurposed beyond images, especially in scientific and engineering applications with spatially interpretable feature sets. Figure 6 illustrates a simplified architecture of CNN model applied in this research34.
Fig. 6.

The structure of Convolutional Neural Networks.
KNN
KNN method is a simple but effective monitored learning approach utilized for regression and classification jobs. KNN functions based on concept of resemblance, determining value or class of a point by examining the most common category or mean value of its k closest adjacent points in feature space, as illustrated in Fig. 735,36. The k parameter in KNN represents nearest neighbors number, which is specified by the user.
Fig. 7.
The structure of K-nearest neighbors.
Another challenge with KNN is its computational intensity for extensive datasets, since it necessitates determining distances among aim point and supplementary points in dataset. Despite these issues, KNN remains popular in various domains like recommendation systems, and medical diagnostics due to its simple approach and effectiveness in capturing local patterns within data37,38.
SVM
SVM is an effective machine learning technique utilized for categorization and estimation jobs. It is based on statistical learning principles and nonlinear optimization. SVM operates by locating an appropriate hyperplane that divides two or more classes of data. The key idea is to find the decision boundary which enhances the gap between data points from various classes, enhancing the model’s generalization ability. The central concept regarding this approach is to maximize distance between training data points and decision boundary (Margin Maximization), improving the model’s ability to generalize.
When data cannot be separated linearly, SVM employs kernel functions like Gaussian or polynomial kernels to convert data into a space of greater dimensions, facilitating better separation. A significant The advantage of SVM is its capacity to handle high-dimensional data, rendering it appropriate for uses such as image identification and natural language understanding39,40.
MLP-ANN
MLP is a popular ANN architecture commonly used in various machine learning applications. An MLP typically includes a sensory layer, multiple concealed layers, and a resulting layer, all composed of synthetic neurons41,42.
In the MLP-ANN algorithm, input data is initially submitted to the input layer, along with neurons within each hidden layer receive and process outputs from the previous layer, producing new outputs. This sequential data processing continues until the data reaches the output layer, delivering the final prediction or result43,44. Figure 8 depicts standard design of a MLP-ANN, demonstrating the motion of information through interconnected layers of neurons. This visual representation offers important perspectives on the workings of the MLP-ANN algorithm, emphasizing its ability to comprehend intricate connections and trends within the input data.
Fig. 8.
The structure of MLP-ANN method.
Ridge regression
Ridge Regression is a regularization technique applied in linear regression to avoid overfitting, especially when there are more variables than data points or when multicollinearity exists. Opposed to ordinary least squares (OLS) regression, which minimizes sum of squared residuals, Ridge Regression modifies cost function by adding a penalty that relates to square of parameters’ magnitude. Lasso Regression eliminates several factors by setting them to zero, whereas Ridge Regression decreases the coefficients closer to zero without fully discarding them, making certain that all features remain part of the model but with reduced weights45,46.
Linear regression
Linear Regression is a fundamental machine algorithm for learning utilized to predict a dependent variable determined by one or more independent characteristics. It represents the connection using a linear equation optimized to minimize prediction error. The method is valued for its simplicity and interpretability but presumes a linear connection, which might not always hold in practice. It is also affected by outliers and multicollinearity. Despite these limitations, Linear Regression remains widely used due to its ease of implementation and broad applicability47,48.
Data collection and assessment metrics
Data gathering overview
The data for creating machine learning algorithms in this study was sourced from previous research that performed comprehensive experimental analyses on sound speed characteristics of multiple FAEEs under varied laboratory conditions. The dataset comprises 371 data points, encompassing essential input variables like temperature, pressure, molar mass, and elemental composition (oxygen, carbon, and hydrogen content), derived from studies conducted by various researchers10–14. The FAEEs studied include Ethyl caprate, Ethyl linoleate, Ethyl laurate, Ethyl myristate, and Ethyl stearate.
The dataset consists of speed of sound as the primary response variable, with temperature, pressure, molar mass, and elemental composition (oxygen, carbon, and hydrogen content) serving as the main input variables. To achieve a clearer comprehension of how these factors impact the speed of sound in FAEEs and to analyze their distribution, scatter matrix diagrams are shown in Fig. 9. These graphics offer an in-depth viewpoint of the dataset, revealing patterns, relationships, and possible outliers. Analyzing these aspects is crucial for grasping the data’s underlying structure and ensuring the creation of a precise forecasting model.
Fig. 9.
Scatter matrix visualization illustrating connections among variables.
Furthermore, Fig. 10 exhibits frequency distribution and cumulative distribution for each individual variable, offering a detailed representation of their distributions. These visualizations contribute to a broader insight of the data, enabling a thorough analysis of the variables’ characteristics and their potential influence on sound speed in FAEEs. This knowledge aids in refining the predictive model and interpreting the significance of each input variable.
Fig. 10.
Representation of frequency and cumulative distribution for data.
K-fold cross-validation is an algorithm utilized to enhance machine learning accuracy algorithms through methodically employing complete dataset for algorithm evaluation over ‘K’ cycles. This method involves dividing dataset into ‘K’ equal parts (or folds), where each part serves as a confirmation set once while additional ‘K-1’ components are employed for training. The results from these ‘K’ validation cycles are subsequently merged into generate an evaluation, diminishing prejudice linked to arbitrary information division into validation and training sets49–51. Furthermore, this technique significantly lowers chances of overfitting. Figure 11 depicts k-fold cross-validation procedure. In this research, a 5-fold cross-validation technique was employed throughout training period of each machine learning algorithm to improve their predictions effectiveness.
Fig. 11.
Schematic depiction of k-fold cross-validation algorithm.
Indices for evaluating models
To evaluate and compare the efficiency of forecasting every developed machine learning method, several key metrics are computed for each algorithm, including52–56:
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
In given formulas, subscript ‘i’ denotes index number associated with a particular datapoint in the dataset. The terms ‘pred’ and ‘exp’ refer to predicted and laboratory data linked to every point. Furthermore, ‘N’ denotes the total points tally available in the dataset57.
In this study, the input variables used for developing predictive models include temperature, pressure, molar mass, and elemental composition (oxygen, carbon, and hydrogen content), with the speed of sound for FAEEs acting as the dependent variable. To guarantee a thorough evaluation of the models’ effectiveness, dataset occurs by chance divided into three distinct subsets: training (90% of data), validation, and testing (10% of data).
To reduce impact of input and output data variability factors are normalized using the following formula prior to model development. This normalization procedure aids in reducing differences in scale and size among variables, leading to greater accurate and reliable predictions. Through data standardization, models are able to more efficiently recognize patterns and correlations, diminishing influence of particular parameters or groups. The formula for normalization is:
![]() |
5 |
In this formula, parameters are expressed in this way: n represents the raw, non-normalized point, nmax indicates the greatest worth in dataset, nmin signifies the lowest value in dataset, and nnorm denotes standardized point.
SHAP analysis for feature importance
To better understand contribution of every input parameter to algorithm predictions, we employed SHAP, a consolidated structure for analyzing machine learning algorithms. SHAP is based on cooperative game theory and allocates every feature an importance value indicating its role in a particular prediction. This method allows for local and global understanding model’s actions, providing understanding into which factors have the greatest impact the output across the dataset. In this study, SHAP was utilized in the Random Forest algorithm to quantify impact of temperature, pressure, molar mass, and elemental composition on the predicted speed of sound.
Results and analysis
Identifying outliers
To detect outliers, Leverage technique is used. This technique utilizes residuals integration with Hat matrix (H). This matrix has been computed via formula58:
![]() |
6 |
In equation, X denotes a matrix with R rows and C columns, where C represents number of points and R indicates input parameters count. XT indicates matrix X transpose. The leverage threshold (H*), used to identify potential outliers, is computed using the equation:
![]() |
7 |
The Leverage approach helps identify potential outliers points by comparing that values in relation to leverage threshold H⁺. Subsequently, Williams’ plot is used to demarcate distinct regions for reliable and dubious information, aiding in the identification of outlier points. Figure 12 exemplifies how leverage and suspicion thresholds establish reliable region, assisting data integrity assessment and highlighting possible impact of outliers on further evaluations.
Fig. 12.
Recognition of questionable solubility data through a proven outlier detection technique.
While most solubility measurements are deemed trustworthy, some points (highlighted in red) are indicated as potentially doubtful. This visual depiction assists in evaluating data quality and brings attention to potential influence of anomalous data points on future analysis methods. Importantly, to ensure the development of universally applicable techniques, each point is considered during algorithm development phase.
Sensitivity analysis
This part of study examines impact of temperature, pressure, molar mass, and elemental composition on the speed of sound for FAEEs while assessing the importance of every single parameter. The importance of each factor is quantified utilizing the relevance factor, is defined by subsequent formula59:
![]() |
8 |
In above equation, symbol ‘j’ shows specific parameter input being analyzed. The relevance factor ranges from − 1 to + 1, where values nearer to + 1 suggest a more significant advantage correlation among the input and output parameters. Conversely, values close to -1 indicate a stronger inverse relationship. A detrimental relevance factor indicates an inverse association among variables, whereas a positive value implies a straightforward relationship60–63.
As demonstrated in Fig. 13, correlation matrix illustrates connections among input parameters temperature, pressure, molar mass, and elemental composition (oxygen, carbon, and hydrogen content) and the output parameter (speed of sound for FAEEs). The correlations of molar mass and elemental composition with the speed of sound are relatively weak, ranging from − 0.39 to 0.42. The analysis shows that pressure has a significant positive correlation (0.84) with speed of sound, implying that an increase in pressure significantly enhances the speed of sound for FEEA. Temperature, on the other hand, shows a moderate negative correlation (-0.66) with the speed of sound, indicating that temperature changes have a noticeable inverse effect on sound speed for FEEA. These results emphasize the prevailing impact of pressure and temperature on sound speed, while contributions of molar mass and elemental composition remain relatively minor.
Fig. 13.
The computed relevance measure for every input parameter.
Assessment of models
This section outlines procedure for optimizing hyperparameters various machine learning models. Figure 14a demonstrates that a maximum depth of 15 minimizes MSE for the Decision Tree algorithm, while Fig. 14b identifies 17 estimators as optimal for the AdaBoost algorithm. Figure 14c illustrates that a maximum depth of 10 reduces the Mean Squared Error (MSE) for Random Forest model. For the K-Nearest Neighbors (KNN) method, Fig. 14d shows that 3 neighbors yield the best results. Additionally, Fig. 14e and f illustrate the loss function trends for Convolutional Neural Networks (versus epoch) and Support Vector Regression (versus C hyperparameter), respectively, highlighting their convergence to optimal performance. These findings emphasize the significance of hyperparameter tuning in achieving enhanced model accuracy and efficiency.
Fig. 14.
The optimal hyperparameter value for various machine learning methods.
Table 1 shows the performance metrics for an array of data-driven algorithms, like DT, RF, KNN, AdaBoost, CNN, Ensemble Learning, SVM, Ridge Regression, Linear Regression, and MLP-ANN. These measurements include MSE, R², and AARE%. Figure 15 provides a visual depiction of these criteria throughout the testing phase, enabling a more thorough assessment.
Table 1.
The values of the evaluation indices obtained for all developed algorithms concerning all segments.
| Model | R 2 | MSE | AARE% | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Training | Test | Total | Training | Test | Total | Training | Test | Total | |
| Decision tree | 1 | 0.9877749 | 0.998612 | 0 | 874.94053 | 89.61655 | 0 | 1.962774 | 0.201039 |
| AdaBoost | 0.999816 | 0.9873096 | 0.998396 | 11.68284 | 908.2457 | 103.5141 | 0.049028 | 2.0167033 | 0.250568 |
| Random forest | 0.998877 | 0.9938321 | 0.998306 | 71.45457 | 441.43629 | 109.3503 | 0.522292 | 1.2746318 | 0.599351 |
| KNN | 0.995793 | 0.9909112 | 0.995245 | 267.7302 | 650.47875 | 306.9336 | 0.758256 | 1.5837447 | 0.842808 |
| Ensemble learning | 0.999478 | 0.9941084 | 0.998869 | 33.19793 | 421.6589 | 72.98639 | 0.311831 | 1.2146281 | 0.404301 |
| CNN | 0.999489 | 0.9995753 | 0.999499 | 32.53207 | 30.393282 | 32.31301 | 0.339601 | 0.3274238 | 0.338354 |
| SVR | 0.995183 | 0.9980559 | 0.995517 | 306.5239 | 139.13604 | 289.3791 | 0.802766 | 0.6995348 | 0.792193 |
| MLP-ANN | 0.98711 | 0.9895765 | 0.987411 | 820.2769 | 746.00069 | 812.6691 | 1.660529 | 1.648544 | 1.659302 |
| Linear regression | 0.984623 | 0.9853604 | 0.984731 | 978.5468 | 1047.7514 | 985.6351 | 1.821798 | 1.9798891 | 1.837991 |
| Ridge regression | 0.984623 | 0.9853636 | 0.984732 | 978.5656 | 1047.5161 | 985.6279 | 1.821681 | 1.9782273 | 1.837716 |
Fig. 15.

R2, MSE, and AARE% for all models in this research (testing stage).
Based on the test results, the CNN method outperforms the other algorithms, exhibiting the lowest MSE (30.393282) and the highest R² (0.9995753), indicating its superior predictive accuracy. Additionally, it achieves the lowest AARE% (0.3274238), further confirming its robustness and reliability. In contrast, Linear Regression appears to be the least accurate in this study, yielding the highest MSE (1047.7514) and AARE% (1.9798891) values, along with a relatively low coefficient of determination (R² = 0.9853604).
The research utilizes different graphical evaluations to gauge the predictive accuracy of the trained methods. Preliminary assessments include generating cross charts for every proposed algorithm as illustrated in Fig. 16. The CNN algorithm demonstrates high accuracy, evidenced by the tight grouping of data close to line with a unit slope and regression formulas tightly aligning with bisector line. Figure 17 further visualizes distribution of relative variations for every predictive algorithm. Data points in proximity to y = 0 line indicate increased forecasting precision. This analysis identifies the CNN model as the most proficient predictive tool among those assessed.
Fig. 16.
Crossplots comparing estimated and actual values for all data in different method.
Fig. 17.
Relative error percent for all segments (training, testing and validation) for all constructed algorithms in this study.
Figure 18 demonstrates SHAP values for input features (temperature, pressure, molar mass, and elemental composition) and their respective significance, as assessed by the Random Forest model, in relation to sound speed output. The graph organizes the variables founded on their average absolute SHAP scores, representing average effect of every parameter on algorithm’s predictions. Pressure stands out as the most impactful variable, with greatest SHAP score. This underscores its predominant function in establishing sound speed for FAEEs. Temperature is identified as second most significant factor, with a moderate SHAP score. This indicates that while temperature has a noticeable impact on the speed of sound for FAEEs, its effect is less pronounced compared to pressure. Elemental composition and molar mass appear to have the least impact among the variables considered, as reflected by their lower SHAP score. This examination underscores the comparative importance of each factor in influencing speed of sound for FAEEs, with pressure serving as the key element.
Fig. 18.
SHAP feature importance.
Figure 19 highlights SHAP assessment for a model forecasting the speed of sound for FAEEs, where molar mass, pressure, temperature, and elemental analysis are used as input variables. The graph demonstrates the effect of every input parameter on algorithm’s results, where positive SHAP values indicate a rise in speed of sound, and values below zero represent a reduction. Pressure has the biggest influence on speed of sound, as evidenced by wide range of SHAP values and temperature exhibits a moderate influence, with a reduced scope of SHAP values, indicating that although it influences speed of sound. This analysis highlights the relative importance of each variable, with pressure the primary influencing factor, succeeded by temperature, elemental composition and molar mass. These results provide valuable insights for guiding future research or real-world uses in speed of sound modeling.
Fig. 19.
SHAP feature contributions.
Future work
This research developed a comprehensive data-driven approach to predict sound speed in FAEEs, leveraging molecular and physical parameters such as temperature, pressure, molar mass, and elemental composition as inputs. Multiple machine learning techniques were employed, with a focus on constructing reliable predictive models. The workflow involved careful assembly and validation of a dataset sourced from prior experimental studies, alongside the application of the Leverage method for outlier detection to ensure data integrity. Among the tested algorithms, convolutional neural networks (CNN) demonstrated superior predictive performance, achieving high accuracy and low error metrics. To enhance model interpretability, SHAP analysis was used to recognize crucial elements affecting forecasts, revealing temperature and pressure as dominant variables. By integrating domain-specific physical descriptors with rigorous model validation and explainability tools, this structure provides a precise and affordable substitute for conventional experimental procedures for estimating sound velocity in FAEEs across varied conditions. Future investigations are encouraged to extend this framework to a wide array of applications. Potential fields include aerospace engineering, where optimizing flight trajectories and planning adaptive maneuvers for satellite systems can benefit from such modeling64,65; Cutting-edge materials science research can utilize the approach to analyze phase transformations, solidification processes, and advanced ceramic composites66–69; In biomedical and pharmaceutical arenas, the method could enhance therapies based on stem cells and targeted drug delivery systems for inflammatory conditions70,71. Moreover, emerging areas such as wireless communication infrastructure and next-generation 5G antenna design stand to gain from this framework72. Civil engineering applications like acoustic insulation design, structural stress evaluation of cable domes, and ecological monitoring of marine environments are additional promising targets73. Adopting this versatile methodology across disciplines could accelerate knowledge transfer, minimize reliance on resource-intensive experiments, and enable swift, accurate predictive modeling of complex phenomena.
Conclusions
This research created sophisticated models based on data analysis using various machine learning algorithms, including DT, RF, KNN, AdaBoost, CNN, Ensemble Learning, SVM, Ridge Regression, Linear Regression and MLP-ANN, to predict sound speed for FAEEs. The algorithms utilized a comprehensive dataset of 371 points, incorporating pressure, temperature, molar composition, and elemental analysis as input variables. In addition to the typical factors of temperature and pressure, the study also integrates molar mass and elemental analysis as key variables. These additional factors allow for a deeper understanding of the chemical properties and behavior of fatty acid ethyl esters, offering a more accurate and cost-effective solution for predicting speed of sound for FAEEs. The dataset was sourced from experimental studies and provides a comprehensive representation of FAEE characteristics under varying conditions. The analysis of correlation shows weak relationships between the speed of sound and input variables such as molar mass and elemental composition (O, C, H). In contrast, pressure exhibits a strong positive correlation (0.84), while temperature shows a moderate negative correlation (-0.66) with the speed of sound. These results indicate that pressure has a significant enhancing effect, whereas temperature has a noticeable inverse impact on the speed of sound for FAEEs. Among the machine learning models evaluated, the CNN algorithm consistently outperformed others in terms of prediction accuracy. Although pressure and temperature were identified as the dominant factors influencing the speed of sound, the inclusion of molar mass and elemental composition despite their relatively weaker correlations still contributed to the overall model performance. The sensitivity and SHAP analyses confirmed that all input variables played a role, with pressure and temperature having the strongest impact. By utilizing these advanced machine learning techniques, this study provides an efficient, cost-effective tool for accurately predicting FAEEs speed of sound, which can be used as an alternative to expensive and time-consuming laboratory experiments. The tool developed here contributes to the optimization of industrial processes involving FAEE, offering a practical solution for predicting their physical properties. These results illustrate possibility of machine learning models particularly CNNs for accurately predicting sound speed in FAEE under varying conditions. The proposed framework provides a quick and economical substitute for experimental methods, which is highly valuable for industrial applications such as biodiesel quality assessment and pharmaceutical formulation. Given the critical role of acoustic properties in process design and control, the model can support real-time decision-making in manufacturing environments. Future research could explore extending the model to broader classes of esters, incorporating additional physicochemical features, and deploying the trained models into simulation or monitoring tools for industrial use.
Supplementary Information
Below is the link to the electronic supplementary material.
Author contributions
All authors contributed equally to this paper.
Funding
None.
Data availability
Data is available on request from the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Bloch, M. et al. Fatty Acid Esters in Europe: Market Trends and Technological Perspectives63p. 405–417 (Oil & Gas Science and Technology-Revue de l’IFP, 2008). 4.
- 2.Hayes, D. G. Fatty Acids–Based Surfactants and Their Uses. Fatty acids, pp. 355–384. (2017).
- 3.Wallek, T., Knöbelreiter, K. & Rarey, J. Estimation of pure-component properties of biodiesel-related components: fatty acid ethyl esters. Ind. Eng. Chem. Res.57 (9), 3382–3396 (2018). [Google Scholar]
- 4.Ferreira-Dias, S. et al. The potential use of lipases in the production of fatty acid derivatives for the food and nutraceutical industries. Electron. J. Biotechnol.16 (3), 12–12 (2013). [Google Scholar]
- 5.Ma, H. et al. Development and validation of an automatic machine learning model to predict abnormal increase of transaminase in valproic acid-treated epilepsy. Arch. Toxicol.98 (9), 3049–3061 (2024). [DOI] [PubMed] [Google Scholar]
- 6.Doyle, K. M. et al. Fatty acid ethyl esters are present in human serum after ethanol ingestion. J. Lipid Res.35 (3), 428–437 (1994). [PubMed] [Google Scholar]
- 7.Lamprecht, A., Schäfer, U. & Lehr, C. M. Influences of process parameters on preparation of microparticle used as a carrier system for O-3 unsaturated fatty acid ethyl esters used in supplementary nutrition. J. Microencapsul.18 (3), 347–357 (2001). [DOI] [PubMed] [Google Scholar]
- 8.Saerens, S. M. G. et al. The Saccharomyces cerevisiae EHT1 and EEB1 genes encode novel enzymes with medium-chain fatty acid ethyl ester synthesis and hydrolysis capacity. J. Biol. Chem.281 (7), 4446–4456 (2006). [DOI] [PubMed] [Google Scholar]
- 9.Yu, X. et al. Deep learning for fast denoising filtering in ultrasound localization microscopy. Phys. Med. Biol.68 (20), 205002 (2023). [DOI] [PubMed] [Google Scholar]
- 10.Ndiaye, E. H. I., Nasri, D. & Daridon, J. L. Speed of sound, density, and derivative properties of fatty acid methyl and ethyl esters under high pressure: methyl caprate and ethyl caprate. J. Chem. Eng. Data. 57 (10), 2667–2676 (2012). [Google Scholar]
- 11.Habrioux, M., Nasri, D. & Daridon, J. L. Measurement of speed of sound, density compressibility and viscosity in liquid methyl laurate and ethyl laurate up to 200 mpa by using acoustic wave sensors. J. Chem. Thermodyn.120, 1–12 (2018). [Google Scholar]
- 12.Ndiaye, E. H. I. et al. Speed of sound, density, and derivative properties of ethyl myristate, methyl myristate, and methyl palmitate under high pressure. J. Chem. Eng. Data. 58 (5), 1371–1377 (2013). [Google Scholar]
- 13.Tat, M. E. & Van Gerpen, J. H. Measurement of Biodiesel Speed of Sound and its Impact on Injection Timing: Final Report; Report 4 in a Series of 6 (National Renewable Energy Lab.(NREL), 2003). (United States).
- 14.Zhang, Y. et al. Speed of sound and thermal diffusivity of ethyl myristate. J. Chem. Thermodyn.140, 105899 (2020). [Google Scholar]
- 15.Lin, J. et al. Generalized and scalable optimal sparse decision trees. in International Conference on Machine Learning. PMLR. (2020).
- 16.Zhou, H. et al. A feature selection algorithm of decision tree based on feature weight. Expert Syst. Appl.164, 113842 (2021). [Google Scholar]
- 17.Feng, Y. et al. Application of Machine Learning Decision Tree Algorithm Based on Big Data in Intelligent Procurement. (2024).
- 18.Xiang, X. et al. Enhancing beef tallow flavor through enzymatic hydrolysis: unveiling key aroma precursors and volatile compounds using machine learning. Food Chem.477, 143559 (2025). [DOI] [PubMed] [Google Scholar]
- 19.Yu, Y. et al. CrowdFPN: crowd counting via scale-enhanced and location-aware feature pyramid network. Appl. Intell.55 (5), 359 (2025). [Google Scholar]
- 20.Ghanizadeh, A. R., Amlashi, A. T. & Dessouky, S. A novel hybrid adaptive boosting approach for evaluating properties of sustainable materials: a case of concrete containing waste foundry sand. J. Building Eng.72, 106595 (2023). [Google Scholar]
- 21.Xiao, H. et al. Prediction of shield machine posture using the GRU algorithm with adaptive boosting: a case study of Chengdu subway project. Transp. Geotechnics. 37, 100837 (2022). [Google Scholar]
- 22.Noviyanti, C. N. & Alamsyah, A. Early detection of diabetes using random forest algorithm. J. Inform. Syst. Explor. Res.2(1), (2024).
- 23.Zhu, M. et al. Robust modeling method for thermal error of CNC machine tools based on random forest algorithm. J. Intell. Manuf.34 (4), 2013–2026 (2023). [Google Scholar]
- 24.Feng, K. et al. Statistical tests for replacing human decision makers with algorithms. Manage. Sci. (2025).
- 25.Rincy, T. N. & Gupta, R. Ensemble learning techniques and its efficiency in machine learning: a survey. in 2nd international conference on data, engineering and applications (IDEA). IEEE. (2020).
- 26.Keser, S. B. & Aghalarova, S. HELA: a novel hybrid ensemble learning algorithm for predicting academic performance of students. Educ. Inform. Technol.27 (4), 4521–4552 (2022). [Google Scholar]
- 27.Alojail, M. & Bhatia, S. A novel technique for behavioral analytics using ensemble learning algorithms in e-commerce. IEEE Access.8, 150072–150080 (2020). [Google Scholar]
- 28.Kumar, M. et al. A comparative performance assessment of optimized multilevel ensemble learning model with existing classifier models. Big Data. 10 (5), 371–387 (2022). [DOI] [PubMed] [Google Scholar]
- 29.Toche Tchio, G. M. et al. A comprehensive review of supervised learning algorithms for the diagnosis of photovoltaic systems, proposing a new approach using an ensemble learning algorithm. Appl. Sci.14(5), 2072 (2024).
- 30.Khan, A. A., Chaudhari, O. & Chandra, R. A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. Expert Syst. Appl.244, 122778 (2024). [Google Scholar]
- 31.Guan, Y., Cui, Z. & Zhou, W. Reconstruction in off-axis digital holography based on hybrid clustering and the fractional fourier transform. Opt. Laser Technol.186, 112622 (2025). [Google Scholar]
- 32.Raja Sarobin, M. & Panjanathan, R. V. Diabetic retinopathy classification using CNN and hybrid deep convolutional neural networks. Symmetry14(9), 1932 (2022).
- 33.Giusti, A. et al. Fast Image Scanning with Deep max-pooling Convolutional Neural Networks. IEEE (2013).
- 34.Liu, K. et al. Pixel-Level Noise Mining for Weakly Supervised Salient Object Detection (IEEE Transactions on Neural Networks and Learning Systems, 2025). [DOI] [PubMed]
- 35.Samet, H. K-nearest neighbor finding using MaxNearestDist. IEEE Trans. Pattern Anal. Mach. Intell.30 (2), 243–252 (2007). [DOI] [PubMed] [Google Scholar]
- 36.Sha, X. et al. SSC-Net: a multi-task joint learning network for tongue image segmentation and multi-label classification. Digit. Health. 11, 20552076251343696 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ertuğrul, Ö. F. & Tağluk, M. E. A novel version of k nearest neighbor: dependent nearest neighbor. Appl. Soft Comput.55, 480–490 (2017). [Google Scholar]
- 38.Zhang, H. H. et al. Enhanced two-step deep-learning approach for electromagnetic-inverse-scattering problems: frequency extrapolation and scatterer reconstruction. IEEE Trans. Antennas Propag.71 (2), 1662–1672 (2022). [Google Scholar]
- 39.Valkenborg, D. et al. Support vector machines. Am. J. Orthod. Dentofac. Orthop.164 (5), 754–757 (2023). [DOI] [PubMed] [Google Scholar]
- 40.Zhang, H. H. & Chen, R. S. Coherent processing and superresolution technique of multi-band radar data based on fast sparse bayesian learning algorithm. IEEE Trans. Antennas Propag.62 (12), 6217–6227 (2014). [Google Scholar]
- 41.Valles, J. Application of a Multilayer Perceptron Artificial Neural Network (MLP-ANN) in Hydrological Forecasting in El Salvadorp. 213–239 (Machine Learning and Optimization for Water Resources, 2024).
- 42.Sha, X. et al. ZHPO-LightXBoost an Integrated Prediction Model Based on Small Samples for Pesticide Residues in Crops188p. 106440 (Environmental Modelling & Software, 2025).
- 43.Paluang, P., Thavorntam, W. & Phairuang, W. Application of Multilayer Perceptron Artificial Neural Network (MLP-ANN) Algorithm for PM2. 5 Mass Concentration Estimation During Open Biomass Burning Episodes in Thailand. International Journal of Geoinformatics (2024).
- 44.Zhang, H. H., Wei, E. I. & Jiang, L. J. Fast Monostatic Scattering Analysis Based on Bayesian Compressive Sensing. IEEE (2017).
- 45.Qasim, M., Månsson, K., Golam, B. M. & Kibria On some beta ridge regression estimators: method, simulation and application. J. Stat. Comput. Simul.91 (9), 1699–1712 (2021). [Google Scholar]
- 46.Tian, A. et al. Resistance reduction method for building transmission and distribution systems based on an improved random forest model: a tee case study. Build. Environ. 113256. (2025).
- 47.Marcellino, M., Castelblanco, G. & Marco, A. D. Multiple Linear Regression Model for Project’s Risk Profile and DSCR. AIP Publishing (2023).
- 48.Sha, X. et al. Automatic three-dimensional reconstruction of transparent objects with multiple optimization strategies under limited constraints. Image Vis. Comput. 105580 (2025).
- 49.Zhang, X. & Liu, C. A. Model averaging prediction by K-fold cross-validation. J. Econ.235 (1), 280–301 (2023). [Google Scholar]
- 50.Yan, T. et al. Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm. J. Rock Mech. Geotech. Eng.14 (4), 1292–1303 (2022). [Google Scholar]
- 51.Gao, D. et al. A comprehensive adaptive interpretable Takagi-Sugeuo-Kang fuzzy classifier for fatigue driving detection. IEEE Trans. Fuzzy Syst. (2024).
- 52.Bemani, A., Madani, M. & Kazemi, A. Machine learning-based estimation of nano-lubricants viscosity in different operating conditions. Fuel352, 129102 (2023). [Google Scholar]
- 53.Madani, M. et al. Modeling of CO2-brine interfacial tension: application to enhanced oil recovery. Pet. Sci. Technol.35 (23), 2179–2186 (2017). [Google Scholar]
- 54.Bassir, S. M. & Madani, M. A new model for predicting asphaltene precipitation of diluted crude oil by implementing LSSVM-CSA algorithm. Pet. Sci. Technol.37 (22), 2252–2259 (2019). [Google Scholar]
- 55.Bassir, S. M. & Madani, M. Predicting asphaltene precipitation during titration of diluted crude oil with paraffin using artificial neural network (ANN). Pet. Sci. Technol.37 (24), 2397–2403 (2019). [Google Scholar]
- 56.Madani, M. & Alipour, M. Gas-oil gravity drainage mechanism in fractured oil reservoirs: surrogate model development and sensitivity analysis. Comput. GeoSci.26 (5), 1323–1343 (2022). [Google Scholar]
- 57.Aigbe, U. O. et al. Optimization and prediction of biogas yield from pretreated Ulva intestinalis Linnaeus applying statistical-based regression approach and machine learning algorithms. Renew. Energy. 235, 121347 (2024). [Google Scholar]
- 58.Dufrenois, F. & Noyer, J. C. Formulating robust linear regression estimation as a one-class LDA criterion: discriminative hat matrix. IEEE Trans. Neural Networks Learn. Syst.24 (2), 262–273 (2012). [DOI] [PubMed] [Google Scholar]
- 59.Abbasi, P., Aghdam, S. K. & Madani, M. Modeling subcritical multi-phase flow through surface chokes with new production parameters. Flow Meas. Instrum.89, 102293 (2023). [Google Scholar]
- 60.Bemani, A. et al. Estimation of adsorption capacity of CO2, CH4, and their binary mixtures in Quidam shale using LSSVM: application in CO2 enhanced shale gas recovery and CO2 storage. J. Nat. Gas Sci. Eng.76, 103204 (2020). [Google Scholar]
- 61.Bemani, A., Baghban, A. & Mosavi, A. Estimating CO2-Brine diffusivity using hybrid models of ANFIS and evolutionary algorithms. Eng. Appl. Comput. Fluid Mech.14 (1), 818–834 (2020). [Google Scholar]
- 62.Soltanian, M. R. et al. Data driven simulations for accurately predicting thermodynamic properties of H2 during geological storage. Fuel362, 130768 (2024). [Google Scholar]
- 63.Keybondorian, E., Soltani Soulgani, B. & Bemani, A. Application of ANFIS-GA algorithm for forecasting oil flocculated asphaltene weight% in different operation conditions. Pet. Sci. Technol.36 (12), 862–868 (2018). [Google Scholar]
- 64.Wang, C. et al. On-demand airport slot management: tree-structured capacity profile and coadapted fire-break setting and slot allocation. Transportmetrica A: Transp. Sci. 1–35 (2024).
- 65.Ye, D. et al. PO-SRPP: a decentralized Pivoting path planning method for self-reconfigurable satellites. IEEE Trans. Industr. Electron.71 (11), 14318–14327 (2024). [Google Scholar]
- 66.Lv, H. et al. Study on Prestress Distribution and Structural Performance of Heptagonal Six-Five-Strut Alternated Cable Dome with Inner Hole. Elsevier (2024).
- 67.Lv, S. et al. Effect of axial misalignment on the microstructure, mechanical, and corrosion properties of magnetically impelled Arc butt welding joint. Mater. Today Commun.40, 109866 (2024). [Google Scholar]
- 68.Du, J. et al. Solidification microstructure reconstruction and its effects on phase transformation, grain boundary transformation mechanism, and mechanical properties of TC4 alloy welded joint. Metall. Mater. Trans. A. 55 (4), 1193–1206 (2024). [Google Scholar]
- 69.Bao, W. et al. Keyhole critical failure criteria and variation rule under different thicknesses and multiple materials in K-TIG welding. J. Manuf. Process.126, 48–59 (2024). [Google Scholar]
- 70.Yue, T. et al. Monascus pigment-protected bone marrow-derived stem cells for heart failure treatment. Bioactive Mater.42, 270–283 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Liao, H. et al. Ropinirole suppresses LPS-induced periodontal inflammation by inhibiting the NAT10 in an ac4C-dependent manner. BMC Oral Health. 24 (1), 510 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Zhang, H. H. et al. 5G base station antenna array with Heatsink radome. IEEE Trans. Antennas Propag.72 (3), 2270–2278 (2024). [Google Scholar]
- 73.Zhang, Z. et al. An AUV-enabled dockable platform for long-term dynamic and static monitoring of marine pastures. IEEE J. Oceanic Eng. (2024).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data is available on request from the corresponding author.


























