Abstract
Sooting propensity is a critical property to estimate the combustion efficiency and pollution emissions of a fuel and also to discover the next generation of cleaner and more efficient fuels. Yield sooting index (YSI) is an important metric to characterize the sooting propensity; however, it is inefficient to measure this experimentally. Thus, the development of machine learning (ML)-based predictive models exists as an important instrument to predict the YSI for fuel design. Herein, this work compares the accuracies and interpretability of four ML models to predict the YSI based on different kinds of descriptors. It is demonstrated that the developed best ML models using different kinds of descriptors are different. The multilayer perceptron (MLP) regressor neural network (NN), gradient boosting (GB), and random forest (RF) models are the best models for the PaDEL, mordred, and quantum mechanical (QM) descriptors, respectively. The NN model is suitable for the combination of QM descriptors with full PaDEL and mordred descriptors, while the RF model is better for the combination of QM descriptors with PaDEL and mordred descriptors after the permutation feature importance (PFI) filtering procedure. The usage of QM descriptors can slightly improve the deep-learning-based ML model performance. The developed ML models can all predict the YSI with high accuracy, i.e., the coefficient of determination (R 2) is close to 1.0, and the mean absolute error is less than 20 between the experimental data and prediction data for the training, valid, and test sets, respectively. Among the developed ML models, the GB model, by using the PFI-filtered mordred computed descriptors, exhibits the best performance. The present work is valuable for the selection of descriptors for the development of ML models to predict fuel properties.
1. Introduction
There is a pressing requirement to design and use high-performance and low-carbon fuels to achieve high combustion efficiency and low pollution emissions. Among the various factors influencing the fuel’s combustion efficiency and pollution emissions, sooting propensity is a key property related to the engine’s particulate matter emissions, which attracts significant interest from both academic and industrial areas to discover the next generation of cleaner and more efficient fuels or fuel additives.
Generally, the sooting propensity is a quantitative number that can measure how much particulate matter can be formed when a fuel is burned. During the past decade, various metrics have been proposed and used to characterize the sooting propensity of fuels, i.e., smoke point (SP), threshold sooting index (TSI), yield sooting index (YSI), and oxygen extended sooting index (OEST). SP is the first widely used definition to describe the sooting propensity of a fuel, and this has been incorporated into the ASTM standard test method for routinely certifying aviation fuels routinely. However, SP is still being developed to be an important tool to characterize the sooting propensity of different tools for fundamental combustion research. The measurement of SP is relatively simple and has now been a common procedure; however, the experiment can require up to 20 mL of the fuel in question. Based on SP, a series of procedures have been proposed for normalizing SP. Among these normalization procedures, the TSI proposed by Calcote and Manos has been extensively used due to the low fuel requirements and simple experimental requirements. TSI is derived from the measured SP by using two reference compounds to create a 0–100 scale and has been shown to be correlated well with engine and turbine soot formation. The TSI was also revised as OEST for oxygenated fuels. However, similar problems exist as SP due to the narrow dynamic range. In 2007, McEnally and Pfefferle proposed the YSI method to represent the sooting propensity of fuels, and now, it has been most widely used in the combustion community. The underlying idea of the YSI method is that if the test fuel has a larger sooting propensity, it will produce more soot, and a larger value of the maximum soot concentration will be measured. Procedurally, the YSI method only needs a very small concentration of the fuels and enables high sample throughput. Thus, this method has now been widely used to characterize the sooting propensity of fuels and is also adopted as a key index in fuel design. Up to now, a series of molecules with enough diverse molecular structures have been measured, and the effect of unsaturation, heteroatoms, aromatic structures, and isomers on the sooting propensity has been analyzed. The final goal is to develop a quantitative structure–property relationship (QSPR) through the analysis of the experimentally measured results via simple regression analysis or machine learning (ML) models.
QSPR and ML approaches are extensively used methods in chemistry − and are now increasingly used in the combustion and fuel research areas. − Various types of predictive models have been successfully developed to predict the various physicochemical properties of a variety of fuel molecules. Li et al. developed ML-QSPR models for 15 properties of fuels, including melting point, boiling point, vapor pressure, enthalpy of vaporization, cetane number, flash point, surface tension, and so on, for high-throughput fuel screening toward internal combustion engines. The octane number and derived cetane number were also studied via different ML models. − For the sooting propensity, various ML-based prediction models have also been developed. For example, Ahmed Qasem et al. developed an ML-based model using artificial neural networks (ANNs) for the prediction of TSI of fuels containing oxygenated chemical classes like alcohols and ethers, and it has been shown that the obtained absolute error of prediction was 2.46, which was even closer to the uncertainty from experimental measurements. A similar ANN model was also used to develop predictive models for SP. As the database of the YSI increases, extensive studies have also been conducted to develop ML models for the prediction of YSI. Alboqami et al. employed an ANN and an adaptive network-based fuzzy inference system (ANFIS) model to predict the YSI of 294 hydrocarbons and oxygenates. It was found that the ANN model can predict the test set with high accuracy, while the ANFIS model cannot generalize training results to the unseen test set data. However, the data set was much smaller and restricted to specific fuel molecules. The performance metrics employing only R 2 could be a critical issue. Kessler et al. compared the accuracies and interpretability of three YSI models generated via ANNs, graph neural networks (GNNs), and multivariate equations. One of the major conclusions indicates that ANNs are of high accuracy; however, no significant interpretations of QSPR descriptors relating to YSI can be drawn from ANNs. On the other hand, GNNs and the derived multivariate equation can provide insight into the structural effect on the YSI, although the prediction accuracies of them were poorer than those of the ANNs. Besides the above studies, the deep learning method via a cogent set of molecular descriptors has also been developed for the YSI. From the perspective of fuel design, the ultimate goal of developing ML models is to uncover the QSPR between molecular structures and fuel properties, including the YSI. However, as shown by Kessler et al., most of the developed ANN models are more likely to be “black-box” solutions due to the lack of interpretability. Besides, the input for ML models developed by different researchers often employs very different molecular descriptors, and the effect of this on the developed ML models, together with the interpretations, remains to be explored further.
Based on the above considerations, this work mainly intends to compare typical molecular descriptors on the developed ML models and the corresponding accuracy for YSI. To achieve this purpose, descriptors from traditional cheminformatics and quantum chemistry calculations are considered. Several ML models are also compared using the combination of these descriptors. The paper is organized as follows: Section presents the data set, descriptor calculation methods, and ML methods. The results and discussion are presented in detail in Section , and Section summarizes the major conclusions.
2. Computational Details
2.1. Database and Descriptors
The YSI database used in the present work is based on the most recent summary results of experimental results due the past years by Pfefferle et al. A total number of 663 fuel molecules covering different kinds of fuel molecules, including normal alkanes, cyclic alkanes, aromatics, alcohols, and esters, are considered. The YSI values range from −4.8 to 1339, and most of them fall within the range of tens to hundreds. The uncertainty of the measured YSI value for a given molecule in the data set is 5% of the measured results. These data are divided into three subsets: the training set (80%), validation set (10%), and testing set (10%). The SMILES strings of the molecules in the data set are used as input for the computation of molecular descriptors. The two widely used molecular descriptors from PaDEL (Version 2.21) and mordred (Version 1.1.0) are computed for the molecules in the data set. The PaDEL computes a total number of 1444 1D and 2D descriptors, while mordred calculates 1344 descriptors for each fuel molecule. The descriptors include the topological index, electronic state, and geometric information. For clarity and simplicity, the descriptors computed from the PaDEL and mordred software are denoted as PaDEL and mordred descriptors. Mordred has incorporated all the descriptors implemented by RDKit, and thus, the descriptors solely from RDKit are not considered in this work. Besides the two sets of traditional molecular descriptors, quantum mechanical (QM) descriptors are considered in this work to check their effect on the predicted YSI and also the interpretability. Specifically, due to the large number of molecules, the semiempirical tight-binding (TB) method (xTB version 6.7.1) is used to calculate the QM descriptors. The employed xTB method achieves a good balance between computational cost and accuracy, making it suitable for large molecular systems. It has shown excellent performance in benchmarking against extensive data sets, achieving accuracy comparable to low-cost density functional theory (DFT) methods while maintaining the speed of semiempirical approaches. Although the usage of QM descriptors has been used for ML studies, , the computational methods are usually based on low-level DFT methods due to the large data set. Thus, the xTB method becomes another potential method for ML studies. , The usage of this method to compute the QM descriptors for the 663 molecules only requires 2 h of computational time on a single 36-core desktop. The major QM descriptors based on the xTB method in this work include the HOMO LUMO gap, HOMO energy, LUMO energy, ionization potential (IP), electron affinity (EA), dipole moment, Fermi level, polarization, second IP, second EA, S0-T1 gap, electronegativity, and so on. Specifically, the HOMO–LUMO-related properties have been shown to be important properties to describe the soot formation process. , Generally, the number of the computed QM descriptors is limited to several key properties, mostly related to electronic energies that cannot fully describe the molecular properties. Thus, the combination of these QM descriptors with PaDEL and mordred descriptors is also considered for ML model optimization and validation. The QM descriptors are calculated through the automated quantum mechanical environment (AQME version 1.7.2).
2.2. ML Models
Before ML model screening and optimization, the data set is divided into training and test sets (10%). The 10% test set is widely employed in ML studies for YSI model development. , The YSI data set in this work is, in fact, small, and the data set is imbalanced. Thus, to fully use the data set and reduce the overfit risk, the K-nearest neighbor classification (KN) data splitting method, instead of random splitting, is employed to make sure the heterogeneity of the training and test sets. Considering the large number of descriptors, permutation feature importance (PFI) is employed to identify the most influential descriptors. It should be noted that ML models with/without PFI descriptors are all optimized and analyzed. Specifically, for ML models without the PFI filter procedure, a threshold value of R 2 higher than 0.7 is used for correlation removal, while a threshold value of R 2 lower than 0.001 is defined for low correlation. For PFI models, features that contribute less than 4% to the model’s R 2 value are removed. The hyperparameters in the ML models are optimized using Bayesian optimization, which adopts an objective function that combines the error from a 10 times 5-fold CV (10 × 5-fold CV) and a sorted CV to minimize overfitting. , Specifically, the optimization procedure is redesigned by using a combined RMSE calculated from different CV methods and evaluates a model’s generalization capability by averaging both interpolation and extrapolation CV performance. Interpolation is conducted on the training data, while extrapolation is assessed via a selective sorted 5-fold CV approach, which sorts and partitions the data based on the target value and considers the highest RMSE between the top and bottom partitions. Hyperparameter optimization is performed independently for each descriptor set and each filter type during the model optimization process. The ML models used for YSI in this work are random forests (RF), multivariate linear models (MVL), gradient boosting (GB), and multilayer perceptron (MLP) regressor neural network (termed the NN for simplicity). The hidden layer of the MLP NN is 10, while the default depths for RF and GB are 100 and 40, respectively. The ML model screening and optimization are conducted through the Scikit-learn module implemented in the Robust Optimization-Based Environment for Research Tools (the ROBERT software, version 2.0.1). Besides this, a deep learning-based QSPR method (Deep-QSPR) via the feedforward neural network (FNN) implemented in fastprop software (version 1) is also employed to validate and check the descriptor effect on the predicted YSI. The FNN architecture and use of physically meaningful molecular descriptors enable researchers to apply it easily and make model training dramatically faster. This approach is successful on the smallest of real-world data sets. A number of fuel properties have been validated. Finally, the SHAP (SHapley Additive exPlanations) method is used to analyze the effect of the descriptors on the ML model performance to explain the relationship between molecular structures and the YSI property of fuel molecules. Figure shows the flowchart in this work for the development of ML models to predict the YSI together with the effect of descriptors on the ML models. The ML model training process can be achieved by approximately 1 h on a single desktop computer. Including the QM descriptor calculation for the 663 molecules only requires an additional 2 h of computational time due to the efficiency of the selected semiempirical QM method. Error indicators, including root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R 2) between the experimental data and model prediction data defined as eqs , , and below, are used to estimate the accuracy of the final developed ML models.
| 1 |
| 2 |
| 3 |
In the above equations, n is the number of samples in the data set. YSI i is the i-th experimental value of YSIs, YSÎ i is the predicted value of the i-th YSI, and is the average value of YSIs in the data set. Specifically, the RMSE and MAE are used to describe the magnitude of the error of the prediction result, while R 2 can be used to evaluate the degree of agreement between the predicted value and the experimental value.
1.
Flowchart in the development of ML models for the prediction of YSI in this work.
3. Results and Discussion
3.1. Developed ML Models
Table lists the descriptors used and the best ML model screening for YSI with/without PFI in this work. Before ML model optimization, the original descriptors are filtered through two methods, as shown in Table : No PFI and PFI. The No PFI filter procedure is implemented through the removal of the correlated descriptors (with R 2 higher than 0.7), very low correlation to the target values (with R 2 lower than 0.001), and duplicates. The PFI method is employed to further derive only important descriptors for ML model development. From Table , the best ML models generated from different descriptors are different. Generally, the models derived from the only usage of QM descriptors are less accurate compared with the other models used for the other four types of descriptors. The major reason may be induced by the small number of descriptors, and the calculated values of these descriptors probably cannot effectively describe the differences among these molecules. For the ML models developed using the other four type descriptors, it can be seen that the filtered descriptors via the standard procedure and PFI are very close. The number of descriptors from standard procedures is 197–198, while the important descriptors via PFI are from 68 to 97. Generally, for a specific descriptor, the accuracy of the ML models using the standard filter (with more descriptors) shows slightly better accuracy than the models using only PFI-selected descriptors from the error indicators for the test set. Specifically, the combination usage of PaDEL and QM descriptors can improve the model accuracy compared with that of pure PaDEL descriptors.
1. Descriptors, the Developed Best ML Models, and the Corresponding Error Indicators for the Test Set.
| error
indicator (test set) |
||||||
|---|---|---|---|---|---|---|
| descriptors | features | best ML model | R 2 | MAE | RMSE | |
| PaDEL (1444) | no PFI | 197 | NN | 0.99 | 16.0 | 26.0 |
| PFI | 97 | NN | 0.98 | 32.0 | 32.0 | |
| mordred (1344) | no PFI | 198 | GB | 0.99 | 16.0 | 30.0 |
| PFI | 78 | GB | 0.99 | 15.0 | 27.0 | |
| QM (22) | no PFI | 18 | RF | 0.89 | 30.0 | 81.0 |
| PFI | 11 | RF | 0.89 | 35.0 | 81.0 | |
| PaDEL + QM (1466) | no PFI | 197 | NN | 0.99 | 13.0 | 21.0 |
| PFI | 81 | RF | 0.99 | 17.0 | 32.0 | |
| mordred + QM (1366) | no PFI | 198 | NN | 0.99 | 16.0 | 28.0 |
| PFI | 68 | RF | 0.99 | 16.0 | 27.0 | |
The numbers in brackets represent the original computed numbers of the descriptors.
The features denote the descriptor numbers used for the ML models via standard descriptor filter (No PFI) and only important descriptors from PFI, respectively.
Another noteworthy point is that the corresponding best ML models from different descriptors are different. The NN model is best for the PaDEL descriptor, while the GB model demonstrates the best performance for the mordred descriptor. However, the combinations of QM with either the PaDEL or mordred descriptors demonstrate that the NN and RF models are the best models for the standard filter procedure and PFI-filtered important descriptors. Among the optimized ML models, the NN model from the combination of descriptors of PaDEL and QM via a standard filter procedure demonstrates better performance than the other models for the test set, and the corresponding values of R 2, MAE, and RMSE are 0.99, 13.0, and 21.0, respectively. The models using mordred and its combination with QM descriptors also exhibit good performance for the test set. For the PaDEL descriptors, it can be seen that the NN model with only PFI screened important descriptors displays larger deviations compared with the NN model using the whole PaDEL descriptors, and the MAE increases from 16.0 to 32.0. Overall, most of the models can predict most of the YSI values with good accuracy, except the RF models generated using solely QM descriptors, with larger errors.
To validate the present results and compare the effect of different descriptors on the developed deep-learning-based ML models of YSI, Table lists the corresponding error indicators of the developed Deep-QSPR models via FNN using the different descriptors. The results are generally consistent with the results from Table via the other four ML methods. It can be seen that the prediction accuracy based on QM descriptors is comparable to the results based on the RF models shown in Table . However, the Deep-QSPR models based on the original mordred and mordredcommunity descriptors via the fastprop software show different prediction accuracy, and the FNN model’s default procedure on the present YSI data set generates a model with good prediction accuracy, which can be competed with the GB models. Specifically, mordredcommunity descriptors are computed based on a maintained version of the original mordred descriptor calculator, which is no longer maintained. One of the notable points is that the combination of QM descriptors with PaDEL or mordred descriptors can slightly improve the developed Deep-QSPR models for YSI. For comparisons, Burns et al. developed ML models using fastprop and Chemprop software for a data set of 442 molecules. The MAEs were 25.0 and 28.9, respectively, while the RMSEs were 52 and 63 for the two models. The data set employed in this work is larger than that by Jackson et al., and the error indicators compared with the models developed by Jackson et al. further validate the efficiency of the present models. Generally, although the data set and the dividing ratios of the data set were different, the developed ML models in this work have relatively high accuracy.
2. Corresponding Error Indicators of the Developed Deep-QSPR Models via Different Descriptors for the Test Set .
| error
indicator (test set) |
|||
|---|---|---|---|
| descriptors | R 2 | MAE | RMSE |
| PaDEL | 0.95 | 20.3 | 34.7 |
| mordred | 0.89 | 23.6 | 50.3 |
| mordredcommunity | 0.95 | 19.0 | 32.4 |
| QM | 0.88 | 36.6 | 73.7 |
| PaDEL + QM | 0.96 | 19.9 | 34.0 |
| mordred + QM | 0.96 | 21.0 | 38.5 |
| fastprop | 25.0 | 52 | |
| Chemprop | 28.9 | 63 | |
The mordredcommunity represents the default descriptors implemented in fastprop via a fork of mordred called mordredcommunity that is maintained by community-contributed patches. The fastprop and Chemprop lack R 2 because it is not used as an error indicator.
Table summarizes the best ML models for the different descriptor sets studied in this work. It can be seen that NN, GB, and RF models are suitable for the PaDEL, mordred, and QM descriptors, respectively. By incorporating the QM descriptors, the NN model works better for both the QM combination with PaDEL or mordred descriptors without a PFI filtering procedure. For the YSI prediction, the GB model with PFI filtering descriptors demonstrates the best performance, as shown in Table .
3. Best ML Models for Different Descriptor Sets Shown in This Work.
| error
indicator |
||||||
|---|---|---|---|---|---|---|
| descriptors | best ML model | descriptor filtering | date set | R 2 | MAE | RMSE |
| PaDEL | NN | no PFI | training | 1.00 | 7.0 | 12.0 |
| validation | 0.99 | 11.0 | 19.0 | |||
| test | 0.99 | 16.0 | 26.0 | |||
| mordred | GB | PFI | training | 1.00 | 7.2 | 11.0 |
| validation | 1.00 | 9.7 | 14.0 | |||
| test | 0.99 | 15.0 | 27.0 | |||
| QM | RF | no PFI | training | 0.97 | 19.0 | 41.0 |
| validation | 0.99 | 9.7 | 15.0 | |||
| test | 0.89 | 30.0 | 8.1 | |||
| PaDEL + QM | NN | no PFI | training | 1.00 | 10.0 | 15.0 |
| validation | 0.99 | 13.0 | 22.0 | |||
| test | 0.99 | 13.0 | 21.0 | |||
| mordred + QM | NN | no PFI | training | 1.00 | 8.1 | 13.0 |
| validation | 0.99 | 9.7 | 15.0 | |||
| test | 0.99 | 16.0 | 28.0 | |||
Figures , , and show the optimized ML models using the different descriptor sets for the training, validation, and test sets more clearly. From Figure , it is more obvious that the GB models based on mordred descriptors show slightly better performance than the NN models using the PaDEL descriptors. The predicted YSI values show a more even distribution based on the GB models. Furthermore, the accuracy of the developed GB model using only important descriptors from PFI analysis of the mordred descriptors is nearly identical to that based on full mordred descriptors, while the NN model performance is slightly affected using only important descriptors from the PFI-filtered PaDEL descriptors. For the RF models generated from the QM descriptors, as shown in Figure , the model performance is less accurate compared with the models from PaDEL and mordred descriptors shown in Figure . Thus, the solely usage of these small sets of descriptors from QM calculations is not enough for the development of accurate ML models for YSI.
2.
Scattered plots of the predicted YSI values against experimental data set using the developed NN and GB models via PaDEL and mordred descriptors, respectively.
3.
Scattered plots of the predicted YSI values against experimental data set using the developed RF models via QM descriptors.
4.
Scattered plots of the predicted YSI values against experimental data set using the optimized ML models via the combination of PaDEL/mordred with QM descriptors.
Figure compares the model performance based on the combined descriptors from QM descriptors with PaDEL and mordred descriptors. The most important effect after the inclusion of QM descriptors in the development of ML models for YSI is that the best models can be changed. As shown in Table , the best models for pure PaDEL and mordred descriptors are NN and GB models, respectively. However, with the combination of QM descriptors, the optimized best model for PaDEL with PFI-selected important descriptors has been changed to the RF model, while the best models for mordred with/without PFI screening procedures are NN and RF instead of the previous GB model. From Table , the inclusion of QM descriptors hardly affects the best model performance without PFI-filtered procedures for both PaDEL and mordred descriptors. However, the model performance based on the PFI-filtered important descriptors with QM descriptors is affected, and the models tend to be slightly less accurate. This effect on the RF model using mordred with QM descriptors is more obvious. Overall, the effect of QM descriptors on the PaDEL and mordred descriptors is complex, which can change the best ML models and also the performance.
3.2. ML Model Analysis
One of the most useful things for the development of ML models to predict YSI is to provide insight into the physical meanings of molecular structures with properties. Thus, a systematic analysis of the descriptors of the model performance is critical for this purpose. Herein, this section provides a SHAP analysis of the effect of descriptors on the model performance. SHAP analysis is a method based on cooperative game theory and used to understand how the features (descriptors) affect the ML model predictions. Generally, from a detailed analysis of the results, it is shown that the important descriptors from PFI analysis and also SHAP results, the most important descriptors for models developed with or without PFI-filtered procedures are nearly identical. Thus, only the results from PFI-filtered important descriptor-based ML models are shown. The results from PFI and SHAP analyses are shown in Figures and , respectively.
5.
Most important descriptors from PFI analysis for the developed ML models: (a) NN model based on the PaDEL descriptor; (b) GB model based on the mordred descriptor; (c) RF model based on the combination of PaDEL and QM descriptors; (d) RF model based on the combination of mordred and QM descriptors.
6.
Important descriptors demonstrating a significant effect on the developed ML models from SHAP analysis: (a) NN model based on the PaDEL descriptor; (b) GB model based on the mordred descriptor; (c) RF model based on the combination of PaDEL and QM descriptors; and (d) RF model based on the combination of mordred and QM descriptors.
Figure shows the PFI analysis of the most important descriptors used in the ML model development. The RF model from pure QM descriptors is not shown due to the less accuracy. For the PaDEL descriptor-based NN model, the most important three descriptors are C3SP2, R_TpiPCTPC, and TPC. The descriptor, C3SP2, designed for the representation of unsaturated branched aliphatic systems, demonstrates the most significant influence on YSI, which is in good correlation with the understanding that unsaturated bonds can increase the YSI values. Besides this, another descriptor, R_TpiPCTPC, defined as the ratio of total bond order to total path count, also shows reasonable correlation with the YSI values. Generally, a larger value of R_TpiPCTPC refers to the molecule with more cyclic subgroups or non-single-bond atoms, which could increase the YSI values. With the inclusion of the QM descriptors with PaDEL descriptors, it can be seen that the PFI analysis results are different compared with the results from pure PaDEL descriptors. This could be induced by the correlations between some of the descriptors in QM and PaDEL descriptors. The piPC7, minaasC, and the previous R_TpiPCTPC are, in fact, all related to the bond order and unsaturated bonds in the molecules. Thus, the PFI analysis results generally reflect the impact of molecular structural effects on YSI prediction.
For mordred and its combination with QM descriptors, it can be seen that the most important three descriptors are all BertzCT, AMID_C, and BCUTv-1h, even though the relative importance is changed. From Table , the BertzCT descriptor, a topological index designed to quantify the complexity of molecules from the summation of the complexity of the bonding and the complexity of the distribution of heteroatoms, demonstrates the most important effect without QM descriptors in the developed GB model. This strongly indicates that the YSI can be significantly affected by the complex bonding types and heteroatoms in the molecules. The AMID_C, representing the averaged molecular ID on C atoms, also shows a large effect on the YSI prediction, indicating that the YSI is also related to the carbon quantity in molecules. The BCUT descriptors are eigenvalue-based descriptors. Specifically, the descriptors are based on a weighted version of the Burden matrix, which takes into account both the connectivity and atomic properties of a molecule. , Three weighting schemes, including atomic weight, partial charge, and polarizability, are computed. It can be seen that the BCUTp-1h and BCUTv-1h descriptors demonstrate an important effect on YSI prediction. The overall results shown in Figure and Table are, in fact, partially in good accordance with the results by Kessler et al. Besides this, Comesana et al. also conducted a study on the selection of molecular descriptors as features for the training of ML models to predict physiochemical properties of fuels, including YSI. The analysis indicated that the highest-ranked descriptor was nAromBond, which counts the number of aromatic bonds. This was in good correlation with experimental observation that molecules with aromatic atoms can have significantly high YSI. However, the nAromBond descriptor cannot capture the structural effect of nonaromatic molecules. The second important descriptor by Comesana et al. was ABC, defined as the atom-bond connectivity index. The other four descriptors are related to the BCUT module, as shown in this work. The difference between that by Comesana et al. and the present work is mainly related to the relative importance of the descriptors. This could be affected by the analysis method and also the data set. Overall, comparing the work by Comesana et al. and Kessler et al. with the present work, it can probably be concluded that even though a larger number of descriptors exist and can be computed, the key descriptors related to YSI prediction should be close, and they are mostly related to the similar descriptor module that describes the unsaturation and complexity of the molecules. Furthermore, it is also shown that the IP descriptor from QM calculations also demonstrates a small effect on the YSI prediction, indicating that QM descriptors could be important for developing more accurate ML models for YSI. Specifically, the SHAP values of the IP descriptor range from approximately −4.5 to 12.0. It demonstrates a larger negative effect on the model out, while it shows a smaller effect on the model out compared with the other descriptors.
4. Meanings of the Important Descriptors from PFI and SHAP Analyses.
| descriptor | meanings | type |
|---|---|---|
| C3SP2 | doubly bound carbon bound to three other carbons | PaDEL/mordred |
| R_TpiPCTPC | the ratio of total bond order to total path count: a larger value refers to the molecule with more cyclic subgroups or non-single-bond atoms | PaDEL |
| TPC | total path count (up to order 10) | PaDEL |
| maxHBa | maximum E-States for (strong) hydrogen bond acceptors | PaDEL |
| SaasC | sum of atom-type E-State::C:- | PaDEL/mordred |
| piPC7 | conventional bond order ID number of order 7 | PaDEL/mordred |
| BertzCT | a topological index meant to quantify complexity of molecules: sum of two terms (the complexity of the bonding + the complexity of the distribution of heteroatoms) | mordred |
| AMID_C | averaged molecular ID on C atoms | mordred |
| BCUTv-1h | first highest eigenvalue of Burden matrix weighted by vdw volume | mordred |
| BCUTp-1h | first highest eigenvalue of Burden matrix weighted by polarizability | mordred |
| minaasC | minimum atom-type E-State::C:- | PaDEL/mordred |
| MPC8 | molecular path count of order 8 | PaDEL/mordred |
| ETA_dBeta | a measure of relative unsaturation content | PaDEL/mordred |
| RNCG | relative negative charge: most negative charge/total negative charge | PaDEL/mordred |
| IP | ionization potential | QM |
Figure shows the important descriptors, demonstrating a significant effect on the developed ML models from SHAP analysis. Generally, it can be seen that the SHAP analysis results are in good accordance with the PFI analysis results. Table shows the computed values of the important descriptors of several typical fuel molecules. As shown in Table , as the fuel molecule with the largest YSI value, the computed descriptor values of 1,2-diphenylbenzene are significantly different compared to the other molecules. Specifically, for the GB model with mordred descriptors, the computed descriptor values of BertzCT and AMID_C demonstrate good correlation with the YSI trends of the five molecules, confirming the rationality of the PFI and SHAP analyses. Besides, the C3SP2 descriptor, revealing the doubly bound carbon bound to three other carbons of 1,2-diphenylbenzene, shows the largest value, which is analyzed to be one of the most important descriptors for the NN model based on PaDEL descriptors. Besides, a more intuitive descriptor, i.e, ETA_dBeta, representing the relative unsaturation of fuel molecules directly confirms the YSI values of unsaturation of fuel molecules. The analysis further confirms the interpretability of the developed ML models in this work. Literature studies revealed that the ANN model , could be another choice for YSI prediction. However, this method is usually recognized as a “black-box” solution, which lacks interpretability.
5. Computed Key Descriptors for Typical Fuel Molecules.
| fuel
molecules |
|||||
|---|---|---|---|---|---|
| descriptor | n-dodecane (67.06) | 1-heptene (48.4) | 1,2-diphenybenzene (1338.90) | benzene (100.30) | 1-octanol (41.1) |
| C3SP2 | 0.00 | 0.00 | 4.00 | 0.00 | 0.00 |
| R_TpiPCTPC | 1.00 | 1.21 | 18.18 | 3.46 | 1.00 |
| TPC | 77.00 | 28.00 | 464.00 | 36.00 | 45.00 |
| maxHBa | 0.00 | 0.00 | 0.00 | 0.00 | 8.42 |
| SaasC | 0.00 | 0.00 | 5.09 | 0.00 | 0.00 |
| piPC7 | 1.79 | 0.00 | 6.37 | 0.00 | 1.10 |
| BertzCT | 56.44 | 37.30 | 565.97 | 71.96 | 37.83 |
| AMID_C | 1.90 | 1.83 | 2.08 | 1.97 | 1.68 |
| BCUTv-1h | 20.78 | 20.82 | 20.93 | 20.88 | 20.77 |
| BCUTp-1h | 1.87 | 1.91 | 2.02 | 1.97 | 1.86 |
| minaasC | 0.00 | 0.00 | 1.26 | 0.00 | 0.00 |
| MPC8 | 4.00 | 0.00 | 56.00 | 0.00 | 1.00 |
| ETA_dBeta | –5.50 | –2.00 | 8.00 | 3.00 | –4.25 |
| RNCG | 0.10 | 0.23 | 0.07 | 0.17 | 0.52 |
| IP | 8.14 | 8.61 | 7.60 | 9.37 | 8.45 |
The numbers in parentheses below the names of the molecules are the corresponding YSI values.
4. Conclusions
This work studies the effect of three sets of different descriptors on the accuracy of developed ML models for the prediction of YSI, a property of critical importance for the estimation of fuel combustion and advanced fuel design. The widely used PaDEL and mordred descriptors, together with QM descriptors from semiempirical quantum chemistry calculations, are used to develop ML models for YSI. The ML models considered in this work are RF, NN, GB, MVL, and also Deep-QSPR models via FNN. It is shown that the best ML models using different sets of descriptors are different. The use of cheminformatics descriptors (PaDEL, mordred) can be used to develop accurate ML models for YSI. By comparison of the error indicators of the developed ML models based on different algorithms and descriptor sets, the GB model, by using the PFI-filtered mordred computed descriptors, exhibits the best performance among the considered ML models. The QM descriptors employed in this work can predict the YSI to some extent; however, the accuracy is not as accurate as the PaDEL and mordred descriptors. The usage of combination descriptors from PaDEL/mordred and QM descriptors cannot significantly improve the model performance based on GB, NN, and RF. However, it can slightly improve the Deep-QSPR model performance. Furthermore, the PFI and SHAP analyses also indicate that some of the selected QM descriptors demonstrate some effect on the predicted YSI values. Thus, while the practical usage of QM descriptors to develop specific ML models in this work is limited, the QM descriptors are expected to be important to capture the electronic properties critical to the sooting tendency. Besides, the usage of QM descriptors from accurate calculations could also improve the interpretability compared with traditional molecular descriptors due to the distinct relationship between the molecules and the computed descriptor results. However, more ML algorithms, together with other potential QM descriptors, remain to be tested or optimized for QM descriptors. The present work not only provides valuable information for the selection of descriptors in developing ML models for fuel properties but also develops an efficient GB model for the accurate prediction of YSI property.
Supplementary Material
Acknowledgments
This work was funded by the Joint Research Fund from the National Natural Science Foundation of China and Civil Aviation Administration of China (No. U2133215) and the National Natural Science Foundation of China (No. 12172335).
All the developed ML models are publicly available at https://github.com/quandewang/Descriptor_effect_on_YSI.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c08720.
Q.-D.W.: conceptualization, formal analysis, funding acquisition, supervision, writingreview and editing; L.D.: data curation, formal analysis, investigation; Q.Y.: data curation, formal analysis, validation; B.-Y.W.: methodology, investigation, writingoriginal draft; J.L.: data curation, formal analysis, investigation; P.Z.: data curation, validation; Z.-X.X.: conceptualization, funding acquisition.
The authors declare no competing financial interest.
References
- Pfefferle L. D., Kim S., Kumar S., McEnally C. S., Pérez-Soto R., Xiang Z., Xuan Y.. Sooting tendencies: Combustion science for designing sustainable fuels with improved properties. Proc. Combust. Inst. 2024;40(1–4):105750. doi: 10.1016/j.proci.2024.105750. [DOI] [Google Scholar]
- Barrientos E. J., Anderson J. E., Maricq M. M., Boehman A. L.. Particulate matter indices using fuel smoke point for vehicle emissions with gasoline, ethanol blends, and butanol blends. Combust. Flame. 2016;167:308–319. doi: 10.1016/j.combustflame.2016.01.034. [DOI] [Google Scholar]
- Calcote H. F., Manos D. M.. Effect of molecular structure on incipient soot formation. Combust. Flame. 1983;49(1):289–304. doi: 10.1016/0010-2180(83)90172-4. [DOI] [Google Scholar]
- McEnally C. S., Pfefferle L. D.. Improved sooting tendency measurements for aromatic hydrocarbons and their implications for naphthalene formation pathways. Combust. Flame. 2007;148(4):210–222. doi: 10.1016/j.combustflame.2006.11.003. [DOI] [Google Scholar]
- Barrientos E. J., Lapuerta M., Boehman A. L.. Group additivity in soot formation for the example of C-5 oxygenated hydrocarbon fuels. Combust. Flame. 2013;160(8):1484–1498. doi: 10.1016/j.combustflame.2013.02.024. [DOI] [Google Scholar]
- ASTM D1322-24 , Standard Test Method for Smoke Point of Kerosene and Aviation Turbine Fuel; ASTM International, 2024. [Google Scholar]
- Ceriotti M., Clementi C., Anatole von Lilienfeld O.. Introduction: Machine Learning at the Atomic Scale. Chem. Rev. 2021;121(16):9719–9721. doi: 10.1021/acs.chemrev.1c00598. [DOI] [PubMed] [Google Scholar]
- Stergiou K., Ntakolia C., Varytis P., Koumoulos E., Karlsson P., Moustakidis S.. Enhancing property prediction and process optimization in building materials through machine learning: A review. Comput. Mater. Sci. 2023;220:112031. doi: 10.1016/j.commatsci.2023.112031. [DOI] [Google Scholar]
- Banerjee A., Roy K., Gramatica P.. A bibliometric analysis of the Cheminformatics/QSAR literature (2000–2023) for predictive modeling in data science using the SCOPUS database. Molecular Diversity. 2025;29(4):3703–3715. doi: 10.1007/s11030-024-11056-8. [DOI] [PubMed] [Google Scholar]
- Kuzhagaliyeva N., Horváth S., Williams J., Nicolle A., Sarathy S. M.. Artificial intelligence-driven design of fuel mixtures. Commun. Chem. 2022;5(1):111. doi: 10.1038/s42004-022-00722-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sarathy S. M., Eraqi B. A.. Artificial intelligence for novel fuel design. Proc. Combust. Inst. 2024;40(1–4):105630. doi: 10.1016/j.proci.2024.105630. [DOI] [Google Scholar]
- Ihme M., Chung W. T., Mishra A. A.. Combustion machine learning: Principles, progress and prospects. Prog. Energy Combust. Sci. 2022;91:101010. doi: 10.1016/j.pecs.2022.101010. [DOI] [Google Scholar]
- Hosni Z., Chen X., Achour S., Saadi F.. Predictive Modeling of Yield Sooting Index Using Machine Learning with Uncertainty Estimation. ACS Omega. 2025;10(24):25336–25349. doi: 10.1021/acsomega.5c00042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R., Herreros J. M., Tsolakis A., Yang W.. Integrated machine learning-quantitative structure property relationship (ML-QSPR) and chemical kinetics for high throughput fuel screening toward internal combustion engine. Fuel. 2022;307:121908. doi: 10.1016/j.fuel.2021.121908. [DOI] [Google Scholar]
- vom Lehn F., Brosius B., Broda R., Cai L., Pitsch H.. Using machine learning with target-specific feature sets for structure-property relationship modeling of octane numbers and octane sensitivity. Fuel. 2020;281:118772. doi: 10.1016/j.fuel.2020.118772. [DOI] [Google Scholar]
- Sheyyab M., Lynch P. T., Mayhew E. K., Brezinsky K.. Optimized synthetic data and semi-supervised learning for Derived Cetane Number prediction. Combust. Flame. 2024;259:113184. doi: 10.1016/j.combustflame.2023.113184. [DOI] [Google Scholar]
- Freitas R. S. M., Jiang X.. Descriptors-based machine-learning prediction of cetane number using quantitative structure–property relationship. Energy and AI. 2024;17:100385. doi: 10.1016/j.egyai.2024.100385. [DOI] [Google Scholar]
- Ahmed Qasem M. A., van Oudenhoven V. C. O., Pasha A. A., Pillai S. N., Reddy V. M., Ahmed U., Razzak S. A., Al-Mutairi E. M., Abdul Jameel A. G.. A machine learning model for predicting threshold sooting index (TSI) of fuels containing alcohols and ethers. Fuel. 2022;322:123941. doi: 10.1016/j.fuel.2022.123941. [DOI] [Google Scholar]
- Ahmed Qasem M. A., Al-Mutairi E. M., Abdul Jameel A. G.. Smoke point prediction of oxygenated fuels using neural networks. Fuel. 2023;332:126026. doi: 10.1016/j.fuel.2022.126026. [DOI] [Google Scholar]
- Alboqami F. D., Pasha A. A., Alam M. I., Abdulraheem A., Abdul Jameel A. G.. Prediction of Yield Sooting Index Utilizing Artificial Neural Networks and Adaptive-Network-Based Fuzzy Inference Systems. Arab. J. Sci. Eng. 2023;48(7):8901–8909. doi: 10.1007/s13369-022-07561-3. [DOI] [Google Scholar]
- Kessler T., St. John P. C., Zhu J., McEnally C. S., Pfefferle L. D., Mack J. H.. A comparison of computational models for predicting yield sooting index. Proc. Combust. Inst. 2021;38(1):1385–1393. doi: 10.1016/j.proci.2020.07.009. [DOI] [Google Scholar]
- Burns J. W., Green W. H.. Generalizable, fast, and accurate DeepQSPR with fastprop. J. Cheminformatics. 2025;17(1):73. doi: 10.1186/s13321-025-01013-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yap C. W.. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011;32(7):1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
- Moriwaki H., Tian Y.-S., Kawashita N., Takagi T.. Mordred: a molecular descriptor calculator. J. Cheminformatics. 2018;10:4. doi: 10.1186/s13321-018-0258-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grimme S., Bannwarth C., Shushkov P.. A Robust and Accurate Tight-Binding Quantum Chemical Method for Structures, Vibrational Frequencies, and Noncovalent Interactions of Large Molecular Systems Parametrized for All spd-Block Elements (Z = 1–86) J. Chem. Theory Comput. 2017;13(5):1989–2009. doi: 10.1021/acs.jctc.7b00118. [DOI] [PubMed] [Google Scholar]
- Guan Y., Coley C. W., Wu H., Ranasinghe D., Heid E., Struble T. J., Pattanaik L., Green W. H., Jensen K. F.. Regio-selectivity prediction with a machine-learned reaction representation and on-the-fly quantum mechanical descriptors. Chem. Sci. 2021;12(6):2198–2208. doi: 10.1039/D0SC04823B. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li S.-C., Wu H., Menon A., Spiekermann K. A., Li Y.-P., Green W. H.. When Do Quantum Mechanical Descriptors Help Graph Neural Networks to Predict Chemical Properties? J. Am. Chem. Soc. 2024;146(33):23103–23120. doi: 10.1021/jacs.4c04670. [DOI] [PubMed] [Google Scholar]
- Singh S., Pareek M., Changotra A., Banerjee S., Bhaskararao B., Balamurugan P., Sunoj R. B.. A unified machine-learning protocol for asymmetric catalysis as a proof of concept demonstration using asymmetric hydrogenation. Proc. Natl. Acad. Sci. U. S. A. 2020;117(3):1339–1345. doi: 10.1073/pnas.1916392117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pracht P., Pillai Y., Kapil V., Csányi G., Gönnheimer N., Vondrák M., Margraf J. T., Wales D. J.. Efficient Composite Infrared Spectroscopy: Combining the Double-Harmonic Approximation with Machine Learning Potentials. J. Chem. Theory Comput. 2024;20(24):10986–11004. doi: 10.1021/acs.jctc.4c01157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen D., Wang H.. HOMO-LUMO energy splitting in polycyclic aromatic hydrocarbons and their derivatives. Proc. Combust. Inst. 2019;37(1):953–959. doi: 10.1016/j.proci.2018.06.120. [DOI] [Google Scholar]
- Kateris N., Jayaraman A. S., Wang H.. HOMO-LUMO gaps of large polycyclic aromatic hydrocarbons and their implication on the quantum confinement behavior of flame-formed carbon nanoparticles. Proc. Combust. Inst. 2023;39(1):1069–1077. doi: 10.1016/j.proci.2022.07.168. [DOI] [Google Scholar]
- Alegre-Requena J. V., Sowndarya S. V. S., Pérez-Soto R., Alturaifi T. M., Paton R. S.. AQME: Automated quantum mechanical environments for researchers and educators. WIREs Comput. Mol. Sci. 2023;13(5):e1663. doi: 10.1002/wcms.1663. [DOI] [Google Scholar]
- Taunk, K. ; De, S. ; Verma, S. ; Swetapadma, A. . A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. In 2019 International Conference on Intelligent Computing and Control Systems (ICCS), 15–17 May 2019, 2019; pp 1255–1260. [Google Scholar]
- Jones D. R., Schonlau M., Welch W. J.. Efficient Global Optimization of Expensive Black-Box Functions. JGO. J. Global Optim. 1998;13(4):455–492. doi: 10.1023/A:1008306431147. [DOI] [Google Scholar]
- Dalmau D., Sigman M. S., Alegre-Requena J. V.. Machine learning workflows beyond linear models in low-data regimes. Chem. Sci. 2025;16(19):8555–8560. doi: 10.1039/D5SC00996K. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V.. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- Dalmau D., Alegre-Requena J. V.. ROBERT: Bridging the Gap Between Machine Learning and Chemistry. WIREs Comput. Mol. Sci. 2024;14(5):e1733. doi: 10.1002/wcms.1733. [DOI] [Google Scholar]
- Bertz S. H.. The first general index of molecular complexity. J. Am. Chem. Soc. 1981;103(12):3599–3601. doi: 10.1021/ja00402a071. [DOI] [Google Scholar]
- Pearlman R. S., Smith K. M.. Metric Validation and the Receptor-Relevant Subspace Concept. J. Chem. Inf. Comput. Sci. 1999;39(1):28–35. doi: 10.1021/ci980137x. [DOI] [Google Scholar]
- Burden F. R.. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci. 1989;29(3):225–227. doi: 10.1021/ci00063a011. [DOI] [Google Scholar]
- Comesana A. E., Huntington T. T., Scown C. D., Niemeyer K. E., Rapp V. H.. A systematic method for selecting molecular descriptors as features when training models for predicting physiochemical properties. Fuel. 2022;321:123836. doi: 10.1016/j.fuel.2022.123836. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All the developed ML models are publicly available at https://github.com/quandewang/Descriptor_effect_on_YSI.








