Abstract

This work proposes several machine learning models that predict B3LYP-D4/def-TZVP outputs from HF-3c outputs for supramolecular structures. The data set consists of 1031 entries of dimer, trimer, and tetramer cyclic structures, containing both molecules with heteroatoms in the ring and without. Six quantum chemistry descriptors and features are calculated by using both computational methods: Gibbs energy, electronic energy, entropy, enthalpy, dipole moment, and band gap. Statistical analysis shows a good correlation between energy properties and bad correlation only for the dipole moment. Machine learning models are separated into three groups: linear, tree-based, and neural networks. The best models for the prediction of density functional theory features are LASSO for linear, XGBoost for tree-based, and single-layer perceptron for neural networks with energy-related features having the best prediction values and dipole moment having the worst.
Introduction
Quantum chemistry methods allow us to gain insights into several important molecular properties, such as electronic structure, molecular geometry, HOMO–LUMO gap, vibrational frequencies, thermodynamic properties et al. These properties allow us to predict the path of chemical reactions and thermodynamic stability of products. However, the clear downside of quantum chemistry methods is their computational cost; depending on the method and computational resources, times of calculations vary from several minutes to days. Therefore, the goal of speeding up the calculations has been a major problem for a long time. There were several approaches to ease such an enormous task, such as creating more efficient less time-consuming composite methods,1 improving the calculation efficiency of programs,2 and switching to GPU from CPU.3,4 However, a more popular and successful approach lies in the field of machine Learning algorithms. In ref (5), authors used a machine learning approach to predict density functional theory (DFT) electronic energy and Gibbs free energy from molecular mechanical data on bond length, angles, and dihedrals of molecules from the QM96 data set. They achieved a margin of absolute error of 2.46 kcal/mol. In ref (7), authors used semiempirical methods in order to predict electronic features of druglike molecules from the QMugs8 database. They utilized an equivariant graph neural network to build the prediction model. Although models represented in these works show great performance in their respective tasks, they predominately used single-molecule data sets for prediction which hinders the ability of the models to predict quantum chemistry features of supramolecular structures. Therefore, there is still a substantial gap in knowledge concerning supramolecular systems. The importance of calculating the properties of supramolecular systems using the DFT approach lies in their vast application in a number of fields from drug delivery9,10 to radical trap systems11−13 and DNA trap and storage.14 Concerning the prediction of electronic properties of supramolecular systems, in ref (15), a machine learning algorithm based on graph convolutional network (GCN) to convert B3LYP inputs into DLPNO–CCSD(T) and CCSD(T) outputs with an absolute margin of error of 0.78 and 0.50 kcal/mol, respectively, was suggested. Also, in the case of the DLPNO–CCSD(T) level of theory, authors separately used the dimers data set containing 2000 original dimers with an absolute margin of error of 0.18 kcal/mol. Therefore, this work aims to utilize various ML approaches (linear, tree-based, and neural network models) to surmise the electronic features calculated by the DFT method (B3LYP-D4) from electronic descriptors calculated by semiempirical combined method (HF-3c) for dimer structures obtained from our previous works9,10,16 and for dimers constructed manually from cyclic monomer structures presented in QM9 database.
Methods and Models
Database
The database of our project includes 1031 observations and 12 features, each of which represents one of six molecular descriptors and features (Gibbs energy, electronic energy, entropy, enthalpy, dipole moment, and HOMO–LUMO gap) calculated using two different quantum chemical approximations. The thorough data set representation is shown in Figure 1.
Figure 1.
Data set representation by different groups: (a) percentile distribution of the data set by type of heteroatom; (b) percentile distribution of the data set by number of molecules in the supramolecular system; and (c) percentile distribution of the data set by the size of the rings.
A total of 984 isolated molecules were selected from the QM9 database, from which dimers were generated manually. All these dimers consist of nonconjugated cyclic and heterocyclic molecules with a neutral charge. 47 supramolecular systems represent different spatial variations of real melamine-barbiturate and cyanurate assemblies. Heterocyclic molecules in the database contain only two types of heteroatoms: oxygen and nitrogen, which are depicted in Figure 1a. The database also contains nonheterocyclic molecules, which account for the majority of observations. They are two-, three-, and four-molecule supramolecular systems, with the absolute majority being two-molecule systems from the QM9 data set (Figure 1b). Structurally, they are 3-, 4-, 5-, 6-, and 7-membered rings and heterocycles located on the same plane parallel to each other. There is no angular or stacking arrangement in dimers. The percentile distribution of the data set by the number of rings is shown in Figure 1c. The most abundant are molecules with three, four, and five conjugated atoms, while seven-atom rings constitute only 1% of the whole data set. The three-dimensional structures of supramolecular systems used in our database are available in the “.xyz” and “.sdf” file formats in Supporting Information Materials. The molecular descriptor values are presented in the “.csv” format.
Quantum Chemical Descriptors Calculations
The calculations were carried out using two quantum-chemical approximations: the B3LYP/def2-TZVP level of theory including D4 dispersion corrections with the RIJCOSX approximation17,18 was used to obtain reference values, relative to which the predictive accuracy of machine learning algorithms and neural networks was assessed, and the composite HF-3c approximation was used to build training data set. The RIJCOSX method combines the resolution of identity (RI) approximation for Coulomb integrals with the chain-of-spheres (COSX) numerical integration technique for Hartree–Fock exchange integrals. This approach significantly accelerates the computational process by reducing the complexity associated with evaluating these integrals, making it particularly advantageous for larger molecular systems. The RIJCOSX approximation is now commonly used as the default for hybrid DFT calculations in software like ORCA, providing a balance between computational speed and accuracy.17 The HF-3c is a fast Hartree–Fock based method developed for the computation of structures, vibrational frequencies, and noncovalent interaction energies in huge molecular systems. The HF-3c method is part of a family of computational approaches designed to provide accurate results with reduced computational resources. It utilizes a specific double-ζ basis set (so-called MINIX basis set) and incorporates corrections for London dispersion interactions, basis set superposition error, and short-ranged correction to deal with basis set deficiencies which occur when using small or minimal basis sets making it suitable for studying noncovalent interactions in molecular systems. The HF-3c method is particularly known for its efficiency in providing reliable geometries and energies while being less demanding than traditional HF methods.19 The calculations were performed using the ORCA 5.0.4 software.20−22 For both methods, the convergence tolerances for the geometry optimization were energy change = 5.0 × 10–6 Eh, maximal gradient = 3.0 × 10–4 Eh/bohr, RMS gradient = 1.0 × 10–4 Eh/bohr, maximal displacement = 4.0 × 10–3 bohr, and RMS displacement = 2.0 × 10–3 bohr. The Hessian matrices were calculated for all optimized model structures in order to prove the location of correct stationary points on the potential energy surfaces (no imaginary frequencies were found in all cases) and to estimate the thermodynamic properties (viz. enthalpy, entropy, and Gibbs free energy) for all model systems at 298.15 K and 1 atm. Enthalpy is calculated in Orca using the eq 1.
| 1 |
where kB is Boltzmann’s constant, U—inner energy, which is calculated by eq 2.
| 2 |
where Eel is the total energy from the electronic structure calculation, EZPE is the zero temperature vibrational energy from the frequency calculation, Evib is the finite temperature correction to EZPE due to population of excited vibrational states, Erot is the rotational thermal energy, and Etrans is the translational thermal energy. Eel is calculated using eq 3.
| 3 |
where Ekin–el is the kinetic energy of electrons, Enuc–el is the potential energy of nucleus–electron interaction, Eel–el is the potential energy of electron–electron interaction, Enuc–nuc is the potential energy of nucleus–nucleus interaction.
Entropy term is calculated by eq 4.
| 4 |
where Sel is electronic entropy, Svib is vibrational entropy, Srot is rotational entropy, and Strans is translational entropy.
Finally, the Gibbs free energy is calculated using eq 5
| 5 |
Machine Learning
Descriptors and Features for Molecular Systems
In this work, a standard machine learning approach was used: descriptors calculated using low-level approximation were fed to train models; meanwhile, results obtained via time-consuming and precise calculations were used as target values (features). Different models used in the ΔML technique, such as the connectivity-based hierarchy scheme and DeepDelta, have shown promising results in correcting deficiencies in DFT calculations, achieving coupled cluster accuracy and predicting property differences between molecules with high fidelity and transparency.23−28 Although there are many ML studies on improving DFT calculations to match CC accuracy, HF to DFT predictions remain underexplored and encourage further investigations. Scikit-learn,29 TensorFlow,30 PyTorch,31 and XGBoost32 Python libraries were used to build the skeleton of the models. Besides the conventional molecular descriptors, such as the electronic energy, dipole moment, etc., three-dimensional coordinates of systems were fed for training to enhance the reliability. For 9 out of 10 models, xyz values were passed as NumPy arrays. Each molecular structure was presented as a single vector, therefore making them compatible with traditional regression algorithms. To standardize the input dimensions, the coordinate vectors were padded with zeros to match the length of the longest coordinate array in the data set. This padding was implemented to ensure that each feature vector representing the atomic coordinates is of consistent length. Once standardized, vectors were expanded into individual feature columns within the data set. This approach enabled our models to effectively learn patterns from the spatial arrangement of atoms along with other molecular descriptors. The GCN was the only model that distinctly used coordinates. In chemistry, molecules are intricate networks of atoms linked by bonds, where the properties of the whole are shaped by the interactions among its parts. Similarly, GNNs are designed to capture complex relationships within data by modeling them as graphs, where nodes (analogous to atoms) interact through edges (akin to chemical bonds).15,33 Each atom in the molecule brings with it a set of unique characteristics or features that define its behavior and interactions. These atomic features, derived from the RDKit34 library, include atomic number, explicit valence, formal charge, aromaticity, and atomic mass. While atomic features provide a local view of each atom, understanding the molecule as a whole requires a global perspective. This is where Morgan fingerprints came into play. They represent the molecule’s structure as a binary vector, capturing the presence or absence of specific substructures and patterns that define the molecule’s identity.35,36 Unlike atomic features, which differ from atom to atom, the molecular fingerprint is consistent across all nodes within the same molecule. Overall, the node features were as follows: molecular descriptors obtained by a fast but inaccurate method, atomic features, and Morgan fingerprints. Except for the GCN, for all models, the three-dimensional coordinates of systems were extracted from “.xyz”-files. Since the data on the bonding of atoms were necessary to properly generate the atomic features, the “connectivity” information present in “.sdf”-files helped us to mitigate issues while converting the data set into the graph representation. Before being fed for training, data underwent classical preprocessing steps, such as normalization and dimensionality adjustments. All models in this study were designed to handle multitask learning (MTL), meaning that they were predicting multiple molecular properties simultaneously. Since the targets are interrelated and influenced by similar molecular structures and interactions, MTL enhances the models’ ability to capture shared patterns and correlations among these properties, which results in improved accuracy and generalization.37,38
Data Set Train-Test Division and Hyperparameters
Ten percent of data were allocated to the test set, resulting in 104 instances. This turned out to be a good balance between having sufficient data for training and ensuring a reliable evaluation of model performance. Every single model (except linear regression (LR)) was optimized with the Optuna hyperparameter framework39 with 3-fold cross-validation during the 1000-trials-optimization process. Optuna outperforms classical GridSearchCV primarily due to its more efficient and adaptive search strategy. Unlike GridSearchCV, which exhaustively evaluates all possible combinations of hyperparameters in a predefined grid, Optuna employs a probabilistic approach using techniques such as Bayesian optimization etc. As a result, Optuna can converge to optimal hyperparameters more quickly and effectively, often achieving better performance with fewer evaluations compared to the exhaustive search of GridSearchCV.40−42
The study explored three different types of algorithms: linear models, tree-based methods, and neural networks. It began with less complex models, some with few or no hyperparameters, and then gradually moved toward more sophisticated techniques.
Linear Models
There are four linear models tested, starting with LR as a baseline and progressing to more versatile models like Ridge, Lasso, and Elastic Net regressions. It is reported that Elastic Net regression generally outperforms simple linear models since it includes both L1 and L2 penalties.43−45 The inclusion of regularization was the key focus, aiming to analyze its impact on model performance, particularly in enhancing predictive accuracy and preventing overfitting.
Tree-Based Models
Decision tree (DT) serves as an ideal transition from simple linear models to more complex ones. As a baseline for tree models, it offers simplicity, focusing on understanding the impact of individual features through hyperparameters like “max_depth”, “min_samples_split”, and “min_samples_leaf”. These parameters control the tree’s growth and complexity, helping us to prevent the overfitting during training process. Moving to a random forest (RF), which builds upon the DT by constructing an ensemble of trees, the model becomes more robust and less prone to overfitting.46 In contrast to the DT, auxiliary hyperparameters such as “max_features” and “n_estimators” were fine-tuned to enhance the model performance. Finally, XGBoost represents the most advanced tree-based method, widely regarded as one of the most powerful algorithms in this category.47 Unlike the DT and RF, XGBoost introduces the “learning_rate” hyperparameter, which controls the contribution of each tree in the ensemble. This gradual learning process along with additional “subsample” and “colsample_bytree” hyperparameters allows XGBoost to achieve higher accuracy by improving generalization.
Neural Network Models
Following the exploration of linear and tree-based models, the study extended into neural networks, beginning with the single-layer perceptron (SLP) and advancing to more complex architectures like the multilayer perceptron (MLP) and GCN. Early stopping was employed across all neural network models to monitor validation performance during training, preventing overfitting by halting training when the performance began to degrade. This technique is crucial in NN models to preserve the computing time and improve the model performance.48,49 The SLP served as a basic neural network model, utilizing a single hidden layer with adjustable hyperparameters such as the number of neurons (“num_neurons”), activation function, and optimizer. Regularization (l1_reg and l2_reg) was also tuned to maintain a balance between model flexibility and overfitting. The transition to MLP introduced multiple hidden layers (“num_hidden_layers”), allowing for deeper representation learning and the capture of more intricate patterns within the data. Despite the added complexity, the optimization strategy remained consistent, focusing on the same key hyperparameters as those in the SLP, with additional attention to the number of layers. Both models were constructed by using the TensorFlow library. The GCN represents the most advanced neural network model in this study. Its architecture is distinguished by multiple graph convolutional layers (“num_layers”) that enable the model to capture and propagate information across the connected nodes within the graph. The GCN model was built using the PyTorch library, and its hidden dimensions (“hidden_dim”) were optimized alongside other hyperparameters such as “dropout_rate”, “weight_decay”, and “learning_rate” to enhance its ability to generalize complex patterns. Unlike the SLP and MLP, which primarily focus on individual data points, the GCN is uniquely equipped to leverage the inherent connectivity of the data, making it highly effective for tasks involving relational learning.50−52 Performance metrics and results for the models are detailed in the “Results and Discussion” section, which includes an analysis and comparison of their effectiveness. Insights into the overall contributions to improving molecular property predictions are provided there.
Results and Discussion
Quantum Chemical Descriptors Calculations
Exploratory data analysis was performed for the calculated training and target descriptors. In particular, the following statistics were calculated: confidence interval for the target data set, distribution histograms, correlation matrix of the training and target data set, histograms of the distribution of values for all features, box plots for all features, Pearson coefficients, and the number of outliers in each feature.
Correlation Matrix
The correlation matrix (Figure 2) provides a visual representation of the linear relationship between different variables in a data set.53−56 In this case, we analyze the correlations between the features calculated at the Hartree–Fock (HF) level and the target values obtained using DFT.
Figure 2.

Correlation matrix for quantum chemical descriptors.
Gibbs free energy, electron energy, and enthalpy show very high positive correlations among themselves and with the corresponding DFT targets. This is expected since these quantities are closely related by thermodynamic relations. Entropy has a highly negative correlation with the energy features, which is also consistent with thermodynamic laws. The dipole moment has a weak correlation with other features, except itself. This may indicate that the dipole moment is a more independent feature that is less related to the energy of the molecule. The HF gap shows a moderate positive correlation with energy features and a high positive correlation with the DFT gap. This indicates that the energy gap is related to the overall energy of the system, but this relationship is not as strong as for other energy features. There is a high correlation between HF features and DFT targets that indicates that HF calculations provide sufficiently accurate results for predicting the energy features of molecules, while the dipole moment has a weak correlation with all other features.
Confidence Interval
Confidence intervals cover the expected value of the corresponding quantity with a probability of 95%.57Table 1 shows confidence intervals for all target features calculated using the DFT method.
Table 1. Confidence Intervals Cover the Expected Value of the Corresponding Quantity with a Probability of 95%.
| 95% CI lower | 95% CI upper | 95% CI width | |
|---|---|---|---|
| Gibbs energy, eV | –20513.23 | –19809.36 | 703.87 |
| electronic energy, eV | –20521.47 | –19817.56 | 703.91 |
| entropy, eV | 1.81 | 1.83 | 0.02 |
| enthalpy, eV | –20511.39 | –19807.55 | 703.85 |
| dipole moment, D | 2.91 | 3.16 | 0.25 |
| band gap, eV | 6.67 | 6.78 | 0.11 |
The width of the confidence interval varies for different properties. Wider intervals indicate greater uncertainty in the estimate of the mean. Most energy properties (Gibbs free energy, electron energy, and enthalpy) have relatively narrow confidence intervals, indicating a high accuracy in estimating the mean energy. The confidence interval for entropy has the smallest width, indicating relatively small variability in the entropy values in the target data set. The confidence interval for the dipole moment has a medium width, indicating moderate variability in the dipole moment values. The confidence interval for the energy gap is the narrowest, indicating a high accuracy in estimating the mean band gap.
Distribution Histograms for All Features
Distribution histograms are used for visual analysis of data distribution, detection of outliers, and assessment of homogeneity of features.58Figures S1 and S2 show distribution histograms for DFT and HF calculated features, respectively. Key findings on distribution histograms: (1) skewness: most distributions exhibit left-sided skewness, especially for energy characteristics. This means that most values are concentrated on the right side of the graph, and the tail of the distribution extends to the left; (2) outliers: outliers are observed in some distributions, especially for the energy gap. These can be both abnormal values and extreme cases that can carry important information; and (3) multimodality: some distributions have multiple peaks, which may indicate the presence of several subgroups of data with different characteristics.
Box and Whisker Plots for All Features
Box and whisker plots are used to visually represent the distribution of data and identify the median, quartiles, range, and potential outliers, allowing for easy comparison of different data sets and identification of anomalies.59 Box and whiskers plots for DFT and HF features are shown in Figures S3 and S4, respectively. Key takeaways from box and whisker plots: (1) presence of outliers: all features have outliers, especially features with a wider range of values. This indicates that the data are significantly different from the rest of the data; (2) variation in the range: it is clear that the range of values for different features is significantly different. This indicates that the features have different variability; (3) skewness: some distributions exhibit skewness. This means that the data are skewed to one side and the median does not coincide with the middle of the box.
Number of Outliers in Each Feature
Outliers are data whose values deviate significantly from the rest of the observations in the data set. They can distort model results, lead to inaccurate predictions, and reduce the generalization ability of models.60 The number of outliers is shown in Table 2.
Table 2. Number of Outliers in the Target and Training Datasets.
| data set | Gibbs energy | electronic energy | entropy | enthalpy | dipole moment | band gap |
|---|---|---|---|---|---|---|
| train | 40 | 40 | 37 | 40 | 43 | 12 |
| test | 40 | 40 | 41 | 40 | 19 | 0 |
The number of outliers varies depending on the level of theory (e.g., for dipole moment). This may indicate different levels of noise or variability for different features. However, outliers were not removed due to the limited number of systems in the real data set.
Pearson Coefficients
The Pearson correlation coefficient is used to measure the degree of the linear relationship between two variables. It shows how strongly and in what direction two variables are related to each other.61 Pearson correlation values for the linear correlation between HF-3c descriptors (rows) and DFT features (columns) are represented in Table 3, where HF-3c descriptors are in columns, and DFT features are in rows.
Table 3. Pearson Correlation Values.
| DFT/HF | Gibbs energy | electronic energy | entropy | enthalpy | dipole moment | band gap |
|---|---|---|---|---|---|---|
| Gibbs energy | 0 | 0 | 0 | 0 | 0.938608 | 0 |
| electronic energy | 0 | 0 | 0 | 0 | 0.940995 | 0 |
| entropy | 0 | 0 | 0 | 0 | 0.000111 | 0 |
| enthalpy | 0 | 0 | 0 | 0 | 0.938508 | 0 |
| dipole moment | 0.121795 | 0.121232 | 0 | 0.12183 | 0 | 0 |
| band gap | 0 | 0 | 0 | 0 | 0 | 0 |
All p-values between the energy properties (Gibbs free energy, electron energy, and enthalpy) and their corresponding target values are zero. This indicates a very high correlation between these quantities. The correlation between the dipole moment and the energy properties is statistically insignificant (p > 0.05). This indicates a weak relationship between the dipole moment and the energetic properties of the molecules. For the band gap, p-values are zero, indicating a very high correlation between quantities.
Machine Learning
For the prediction of quantum chemistry features, we used four linear models: LR, LASSO regression (LASSO), Ridge regression (Ridge), and Elastic Net regression (Elastic Net); three tree-based models: DT, RF and Extreme Gradient Boosting (XGBoost); three neural network models: SLP, MLP, and GCN. MAE metrics for all models and all features are shown in Figure 3.
Figure 3.
MAE metrics for all models and all features: (a) MAE for Gibbs energy; (b) MAE for electronic energy; (c) MAE for entropy; (d) MAE for enthalpy; (e) MAE for dipole moment; and (f) MAE for HOMO–LUMO band gap. Downside arrows denote the best models in their respective classes.
Linear Models
Linear Regression
Starting with a LR model is a good idea due to its simplicity, interpretability, and effectiveness in predicting outcomes when the relationship between features and the target variable is approximately linear. This is a straightforward simple LR approach for predicting a quantitative response Y based on a single predictor variable X. It assumes an approximately linear relationship between X and Y. Figure 4 shows an “Actual vs Predicted” scatter plot for LR test results.
Figure 4.
Test results for the LR model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
Gibbs energy, electronic energy, and enthalpy perfectly fit with R2 of 1.0000, low MSE, RMSE, and MAE, and MAPE of 0.00%. Entropy shows very high accuracy with a R2 of 0.9502 and extremely low error metrics. Band gap has a moderate fit with an R2 of 0.6855 but still maintains low error metrics. Dipole moment has a weak fit with a R2 of 0.3330 and higher error metrics, suggesting that the model may need improvement for predicting this metric. The high results may stem from the simplicity and overfitting of the LR (model using Lasso, Ridge, and Elastic Net regressions could help by introducing regularization to prevent overfitting and improve generalization).
Lasso Regression
Lasso regression surpasses LR due to its ability to handle collinearity and obtain sparse solutions, enhancing the computational efficiency in artificial intelligence applications. By incorporation of an L1 regularization term, Lasso regression overcomes collinearity issues and matrix irreversibility, improving solution sparsity and computational efficiency. Figure 5 shows an “Actual vs Predicted” scatter plot for Lasso regression test results.
Figure 5.
Test results for the LASSO regression model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
The LASSO model proves to be slightly less accurate for Gibbs energy, electronic energy, and enthalpy compared to the LR model while improving accuracy for entropy and band gap predictions. This model also shows significant improvements for dipole moment in terms of fit and errors, although the relative error (MAPE) increased slightly. The Lasso regression model is particularly beneficial for entropy, band gap, and dipole moment where it provides better fits and reduced errors while maintaining high accuracy for the other metrics. After optimization, the corresponding alpha regularization hyperparameter is equal to 0.0203 (range from 1 × 10–3 to 1000).
Ridge Regression
Ridge regression is considered to be better than Lasso regression and simple LR due to its ability to handle multicollinearity effectively, provide more stable solutions, and prevent overfitting by introducing a regularization term that shrinks the coefficients. Figure 6 shows an “Actual vs Predicted” scatter plot for Ridge regression test results.
Figure 6.
Test results for the Ridge regression model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
The Ridge regression model demonstrates lower accuracy for Gibbs energy, electronic energy, and enthalpy compared to both the baseline and Lasso regression models, with significantly higher errors The Ridge regression model shows comparable performance to the baseline for entropy but is less accurate than Lasso regression while also demonstrating improved fit and error metrics for dipole moment compared to the baseline, though slightly less accurate than Lasso regression in relative terms. There is a similar performance for band gap compared with both the baseline and Lasso regression models. Overall, Ridge regression shows high performance in some areas but significantly higher errors in others, indicating that it may not be the best choice for all metrics. After optimization, the corresponding alpha regularization hyperparameter is equal to 34.5258 (range from 1 × 10–3 to 1000).
Elastic Net Regression
Elastic Net regression combines the strengths of both Ridge and Lasso regression methods, offering advantages over each of them as well as over traditional LR. By incorporating both L1 (LASSO) and L2 (Ridge) penalties, Elastic Net mitigates the limitations inherent in each method when used independently. It effectively handles multicollinearity by maintaining the regularization advances of Ridge, while also performing variable selection and sparsity like Lasso, making it particularly useful when dealing with data sets where predictors are highly correlated or when there are more predictors than observations. This dual regularization framework allows Elastic Net to provide a more balanced and robust approach to model fitting, improving prediction accuracy and model interpretability compared to using Ridge, Lasso, or LR alone. Figure 7 shows an “Actual vs Predicted” scatter plot for Elastic Net regression test results.
Figure 7.
Test results for the Elastic Net regression model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
The Elastic Net model demonstrates high accuracy for Gibbs energy, electronic energy, and enthalpy, similar to Lasso regression but with higher error metrics compared to the baseline, while also showing an excellent performance for entropy, similar to Lasso regression. Improved fit and error metrics for dipole moment and band gap are comparable to the baseline and Ridge regression, with a similar performance to Lasso regression. After optimization, the corresponding best hyperparameters are equal to “alpha” = 0.0202 (range from 1 × 10–3 to 1000); “l1_ratio” = 0.9998 (range from 0 to 1).
Tree-Based Models
DT Model
DT is often considered as a good next step after mastering simple linear models due to several key advantages: the ability to capture nonlinear relations, feature importance, and robustness to outliers. Figure 8 shows an “Actual vs Predicted” scatter plot for DT test results.
Figure 8.
Test results for the DT model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
The DT model demonstrates the overall lower accuracy for Gibbs energy, electronic energy, and enthalpy compared to linear models, with significantly higher errors and worse performance for entropy with higher errors and a lower fit. Weak fit and higher errors for dipole moment compared to Lasso and Elastic Net regression but better relative accuracy (MAPE). However, an improved performance for band gap with a better fit and lower error metrics is achieved compared with linear models. One should also mention that the DT model tends to classify data for entropy, dipole moment, and band gap features. After optimization, the corresponding best hyperparameters are equal to “max_depth” = 6 (range from 3 to 15); “min_samples_split” = 7 (range from 2 to 10); “min_samples_leaf” = 7 (range from 1 to 10);and “max_features” = 1.0.
RF Model
RF outperforms DT due to its ability to combine multiple decision trees, resulting in higher accuracy in classification and regression tasks. DT is prone to overfitting, especially with complex data sets. RF mitigates this by creating multiple trees and averaging their predictions, reducing the variance. There are also more advantages such as due to the ensemble nature, RF often achieves higher accuracy compared to a single DT; RF can handle missing values effectively, making it more robust. Figure 9 shows an “Actual vs Predicted” scatter plot for RF test results.
Figure 9.
Test results for the RF model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
The RF model demonstrates very good performance for Gibbs energy, electronic energy, and enthalpy, with error metrics lower than those of DT but higher than those of Lasso and Elastic Net regression. It also shows good performance for entropy, comparable to Lasso and Elastic Net regression but slightly less precise. Improved performance for dipole moment compared to DT regression is observed with moderate fit and better relative accuracy. This model also shows a strong performance for band gap with good fit and low error metrics. After optimization, the corresponding best hyperparameters are equal to “n_estimators” = 417 (range from 50 to 1000); “max_depth” = 12 (range from 3 to 15); “min_samples_split” = 3 (range from 2 to 10); “min_samples_leaf” = 1 (range from 1 to 10); and “max_features” = 0.8.
Extreme Gradient Boosting
XGBoost regression outperforms RF regression and DT regression due to its superior accuracy and efficiency in handling complex data. Figure 10 shows an “Actual vs Predicted” scatter plot for XGBoost test results.
Figure 10.
Test results for the XGBoost model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
XGBoost demonstrates very good performance for Gibbs energy, electronic energy, and enthalpy, with lower error metrics compared to DT and RF but higher than Lasso, Ridge, and Elastic Net regression. The model shows good performance for entropy, comparable to Lasso and Elastic Net regression but with a slightly lower fit. Moderate performance for dipole moment with lower errors compared to DT regression but higher than Lasso and Elastic Net regression is observed, with slightly lower relative accuracy compared to RF. Moreover, the model demonstrates strong performance for band gap with a good fit and low error metrics, similar to RF and better than DT regression.
After optimization, the corresponding best hyperparameters are equal to “n_estimators” = 841 (range from 50 to 1000); “max_depth” = 9 (range from 3 to 15); “learning_rate” = 0.0087 (range from 1 × 10–5 to 1 × 10–2); “subsample” = 0.6243 (range from 0.5 to 1); “colsample_bytree” = 0.8243 (range from 0.5 to 1); and “min_child_weight” = 1 (range from 1 to 10).
Neural Network Models
Single- and Multilayer Perceptron
SLP is the simplest neural network model out of the three presented in this work, while MLP only introduces an additional number of layers. Test results for the SLP model are shown in Figure 11. Test results for the MLP model are shown in Figure 12.
Figure 11.
Test results for the SLP model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
Figure 12.
Test results for the MLP model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
With regard to Gibbs energy, electronic energy and enthalpy, SLP and MLP outperform all tree-based models except for XGBoost. Meanwhile, the entropy prediction accuracy by SLP and MLP models is close to that of the XGBoost model. However, SLP and MLP have a prediction accuracy lower than that of most tree-based models. Both the SLP and MLP models failed to predict the dipole moment. The best hyperparameters for SLP are “optimizer” = “RMSprop”; “activation_function” = “relu”; “learning_rate” = 8.4235 × 10–5 (range from 1 × 10–5 to 1 × 10–2); “neurons” = 100 (range from 5 to 200); “l1_reg” = 0.0005 (range from 1 × 10–8 to 100); and “l2_reg” = 9.7312 × 10–6 (range from 1 × 10–8 to 100). The best hyperparameters for MLP are “optimizer” = “RMSprop”; “activation_function” = “leaky_relu”; “learning_rate” = 0.0002 (range from 1 × 10–5 to 1 × 10–2); “neurons” = 87 (range from 5 to 200); “l1_reg” = 0.0004 (range from 1 × 10–8 to 100); and “l2_reg” = 2.8993 × 10–5 (range from 1 × 10–8 to 100). The train-validation loss graphs for SLP and MLP models are shown in Figures S5 and S6 respectively.
Graph Convolution Network
The GCN is designed to handle graph structured data. In this context, molecular data inherently form graphs where atoms are nodes and bonds are edges. Figure 13 shows an “Actual vs Predicted” scatter plot for XGBoost test results.
Figure 13.
Test results for the GCN model for every descriptor: (a) Gibbs energy; (b) electronic energy; (c) entropy; (d) enthalpy; (e) dipole moment; and (f) HOMO–LUMO band gap. Blue dots are data for dimers generated from the QM9 data set, and red dots are data for structures of different spatial variations of real melamine-barbiturate and cyanurate assemblies.
Despite this, the GCN shows the worst performance in prediction of Gibbs energy, electronic energy, entropy, and enthalpy among all other models, while being only slightly better at band gap prediction than linear models. It outperforms many classical models in the prediction of entropy, although being less accurate than XGBoost, SLP, and MLP. Best hyperparameters for GCN are “hidden_dim” = 35 (range from 5 to 200); “num_layers” = 3 (range from 2 to 20); “learning_rate” = 5 × 10–5 (range from 1 × 10–5 to 1 × 10–2); “weight_decay” = 5 × 10–5 (range from 1 × 10–8 to 1 × 10–2); “dropout_rate” = 0.45 (range from 0.1 to 0.5); “optimizer” = “RMSprop”; and “activation_function” = “relu”. The train-validation loss graph is shown in Figure S7.
Graphical User Interface
We introduce an application with a graphical user interface that allows users to easily use the described models to predict DFT features from HF data. The graphical representation of the user interface and a basic pipeline of work is represented in Figure 14. Pipeline shows the most efficient linear, tree-based, and neural network models for each calculated descriptor.
Figure 14.
Screenshot of the user interface and a basic pipeline of the work of application.
The application’s functionality allows users to input HF-3c quantum chemistry descriptors and molecular geometry in the xyz format and receive DFT (B3LYP D4/def2-TZVP) quantum chemistry features as the output. The application is available on the GitHub page.
Conclusions
The data set used in this work consists of 1031 entries of dimer, trimer, and tetramer structures of cyclic organic molecules both with heteroatoms in the ring and without. All structures were calculated using two quantum chemical approximations: DFT-based B3LYP/def2-TZVP (including D4 dispersion corrections) method and the composite method HF-3c. Therefore, the following descriptors and features were obtained for both methods: Gibbs energy, electronic energy, entropy, enthalpy, dipole moment, and band gap. The statistical analysis of these descriptors and features shows high correlations between energy characteristics at HF and DFT levels, confirming the reliability of HF calculations for predicting energy properties. However, a weak correlation of the dipole moment with energy characteristics is observed, which may indicate the need for its separate consideration when constructing models. The results of the evaluation of machine learning models were separated into three main groups based on model types: linear, tree-based, and neural networks. Concerning linear models, it is shown that with regard to energy prediction (G, E, and H), the LR model is the best with LASSO being the close second. However, the high results may stem from the simplicity and overfitting of the LR model. The LASSO model shows the best predictions for the dipole moment, HOMO–LUMO band gap, and entropy. For the tree-based models, the overall winner is the XGBoost model. Lasso and Elastic Net regression consistently showed the best performance for most metrics (Gibbs energy, electronic energy, enthalpy, entropy) with high R2 values and low error metrics. However, ridge regression had higher error metrics compared to Lasso and Elastic Net, suggesting that it may not be the best choice in such a case. From the tree-based models, XGBoost also provided the best performance, especially for Gibbs energy, electronic energy, and band gap, making it a reliable model across multiple metrics. RF regression also showed competitive performance, particularly for dipole moment and band gap, with error metrics lower than those of DT regression. DT regression generally had lower performance with higher error metrics and lower R2 values compared to other tree-based models. Across all neural network models, SLP showed the best results for all energy metrics being outperformed by MLP only at dipole moment prediction. Unfortunately, the GCN model showed the worst results for all metrics. Overall, among all demonstrated models from all three classes, the LASSO, XGBoost, and SLP were the best. Gibbs energy, electronic energy and enthalpy being highly correlated values all show good prediction metrics, while entropy and band gap have average prediction values. However, the dipole moment shows a very high MAE for all models, which might stem from it being a property highly dependent on the geometry of a structure. The relatively poor performance of neural networks in comparison to tree-based models (such as XGBoost) might be linked to the fact that NN models require large data sets to learn complex patterns for good accuracy. Overall in this work, we demonstrated several machine learning models for the accurate prediction of DFT features from HF-3c data for supramolecular structures. In our future work, we plan to augment our database by adding a more diverse supramolecular system that will include more varied chemical structures and more components in each system. We also intend to broaden our research work with other machine learning models, such as graph attention networks and graph autoencoders.
Acknowledgments
We thank for the support of the grant no. 075-15-2021-1349.
Data Availability Statement
The data that support the findings of this study are openly available in our GitHub repository at https://github.com/Saadiallakh/HF_to_DFT and in Supporting Information. Also, raw data that support the findings this study are available from authors upon reasonable request.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.4c09861.
Distribution diagrams for DFT features and HF descriptors, box and whiskers diagrams for DFT features and HF descriptors, train-loss graphs for SLP, MLP , and GCN models (PDF)
GitHub repository containing the source code used for the analysis as well as the data sets utilized for training and testing the model (ZIP)
Multilayer perceptron and predicted DFT properties (MP4)
Author Contributions
† S.N. and P.V.N. equal contribution.
The authors declare no competing financial interest.
Supplementary Material
References
- Grimme S.; Hansen A.; Ehlert S.; Mewes J. M. r2SCAN-3c: A “swiss army knife” composite electronic-structure method. J. Chem. Phys. 2021, 154, 064103. 10.1063/5.0040021. [DOI] [PubMed] [Google Scholar]
- Bochevarov A. D.; Harder E.; Hughes T. F.; Greenwood J. R.; Braden D. A.; Philipp D. M.; Rinaldo D.; Halls M. D.; Zhang J.; Friesner R. A. Jaguar: A high-performance quantum chemistry software program with strengths in life and materials sciences. Int. J. Quantum Chem. 2013, 113, 2110–2142. 10.1002/qua.24481. [DOI] [Google Scholar]
- Seritan S.; Thompson K.; Martínez T. J. Tera Chem Cloud: a high-performance computing service for scalable distributed GPU-accelerated electronic structure calculations. J. Chem. Inf. Model. 2020, 60, 2126–2137. 10.1021/acs.jcim.9b01152. [DOI] [PubMed] [Google Scholar]
- Manathunga M.; Jin C.; Cruzeiro V. W. D.; Miao Y.; Mu D.; Arumugam K.; Keipert K.; Aktulga H. M.; Merz K. M.; Götz A. W. Harnessing the power of multi-GPU acceleration into the quantum interaction computational Kernel program. J. Chem. Theory Comput. 2021, 17, 3955–3966. 10.1021/acs.jctc.1c00145. [DOI] [PubMed] [Google Scholar]
- Lee S.; Ermanis K.; Goodman J. M. MolE8: finding DFT potential energy surface minima values from force-field optimized organic molecules with new machine learning representations. Chem. Sci. 2022, 13, 7204–7214. 10.1039/D1SC06324C. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hansen K.; Biegler F.; Ramakrishnan R.; Pronobis W.; Von Lilienfeld O. A.; Müller K. R.; Tkatchenko A. Machine learning predictions of molecular properties: accurate many-body potentials and nonlocality in chemical space. J. Phys. Chem. Lett. 2015, 6, 2326–2331. 10.1021/acs.jpclett.5b00831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atz K.; Isert C.; Böcker M. N.; Jiménez-Luna J.; Schneider G. Δ-Quantum machine-learning for medicinal chemistry. Phys. Chem. Chem. Phys. 2022, 24, 10775–10783. 10.1039/D2CP00834C. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Isert C.; Atz K.; Jiménez-Luna J.; Schneider G. QMugs, quantum mechanical properties of drug-like molecules. Sci. Data 2022, 9, 273. 10.1038/s41597-022-01390-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imoro N.; Shilovskikh V. V.; Nesterov P. V.; Timralieva A. A.; Gets D.; Nebalueva A.; Lavrentev F. V.; Novikov A. S.; Kondratyuk N. D.; Orekhov N. D.; Skorb E. V. Biocompatible pH-degradable functional capsules based on melamine cyanurate self-assembly. ACS Omega 2021, 6, 17267–17275. 10.1021/acsomega.1c01124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nesterov P. V.; Shilovskikh V. V.; Sokolov A. D.; Gurzhiy V. V.; Novikov A. S.; Timralieva A. A.; Belogub E. V.; Kondratyuk N. D.; Orekhov N. D.; Skorb E. V. Encapsulation of rhodamine 6G dye molecules for affecting symmetry of supramolecular crystals of melamine-barbiturate. Symmetry 2021, 13, 1119. 10.3390/sym13071119. [DOI] [Google Scholar]
- Shilovskikh V. V.; Timralieva A. A.; Nesterov P. V.; Novikov A. S.; Sitnikov P. A.; Konstantinova E. A.; Kokorin A. I.; Skorb E. V. Melamine–barbiturate supramolecular assembly as a pH-dependent organic radical trap material. Chem.—Eur. J. 2020, 26, 16603–16610. 10.1002/chem.202002947. [DOI] [PubMed] [Google Scholar]
- Shilovskikh V. V.; Timralieva A. A.; Belogub E. V.; Konstantinova E. A.; Kokorin A. I.; Skorb E. V. Radical activity of binary melamine-based hydrogen-bonded self-assemblies. Appl. Magn. Reson. 2020, 51, 939–949. 10.1007/s00723-020-01254-6. [DOI] [Google Scholar]
- Timralieva A. A.; Moskalenko I. V.; Nesterov P. V.; Shilovskikh V. V.; Novikov A. S.; Konstantinova E. A.; Kokorin A. I.; Skorb E. V. Melamine barbiturate as a light-induced nanostructured supramolecular material for a bioinspired oxygen and organic radical trap and stabilization. ACS Omega 2023, 8, 8276–8284. 10.1021/acsomega.2c06510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aliev T. A.; Timralieva A. A.; Kurakina T. A.; Katsuba K. E.; Egorycheva Y. A.; Dubovichenko M. V.; Kutyrev M. A.; Shilovskikh V. V.; Orekhov N.; Kondratyuk N.; Semenov S. N.; Kolpashchikov D. M.; Skorb E. V. Designed assembly and disassembly of DNA in supramolecular structure: from ion regulated nuclear formation and machine learning recognition to running DNA cascade. Nano Sel. 2022, 3, 1526–1536. 10.1002/nano.202200092. [DOI] [Google Scholar]
- Ruth M.; Gerbig D.; Schreiner P. R. Machine learning for bridging the gap between density functional theory and coupled cluster energies. J. Chem. Theory Comput. 2023, 19, 4912–4920. 10.1021/acs.jctc.3c00274. [DOI] [PubMed] [Google Scholar]
- Orekhov N.; Kondratyuk N.; Logunov M.; Timralieva A.; Shilovskikh V.; Skorb E. V. Insights into the early stages of melamine cyanurate nucleation from aqueous solution. Cryst. Growth Des. 2021, 21, 1984–1992. 10.1021/acs.cgd.0c01285. [DOI] [Google Scholar]
- Neese F.; Wennmohs F.; Hansen A.; Becker U. Efficient, approximate and parallel Hartree-Fock and hybrid DFT calculations. A “chain-of-spheres” algorithm for the Hartree-Fock exchange. Chem. Phys. 2009, 356, 98–109. 10.1016/j.chemphys.2008.10.036. [DOI] [Google Scholar]
- Neese F. An improvement of the resolution of the identity. J. Comput. Chem. 2003, 24, 1740–1747. 10.1002/jcc.10318. [DOI] [PubMed] [Google Scholar]
- Sure R.; Grimme S. Corrected small basis set Hartree-Fock method for large systems. J. Comput. Chem. 2013, 34, 1672–1685. 10.1002/jcc.23317. [DOI] [PubMed] [Google Scholar]
- Neese F. The ORCA program system. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 73–78. 10.1002/wcms.81. [DOI] [Google Scholar]
- Neese F. Software update: the ORCA program system, version 4.0. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2018, 8, e1327 10.1002/wcms.1327. [DOI] [Google Scholar]
- Neese F.; Wennmohs F.; Becker U.; Riplinger C. The ORCA quantum chemistry program package. J. Chem. Phys. 2020, 152, 224108. 10.1063/5.0004608. [DOI] [PubMed] [Google Scholar]
- Chen C.; Deng S.; Li S. Using small database and energy descriptors to predict molecular thermodynamic energies through mediated learning models. Chem. Eng. J. 2024, 488, 150607. 10.1016/j.cej.2024.150607. [DOI] [Google Scholar]
- Chen X.; Li P.; Hruska E.; Liu F. Δ-Machine learning for quantum chemistry prediction of solution-phase molecular properties at the ground and excited states. Phys. Chem. Chem. Phys. 2023, 25, 13417–13428. 10.1039/D3CP00506B. [DOI] [PubMed] [Google Scholar]
- Maier S.; Collins E. M.; Raghavachari K. Quantitative prediction of vertical ionization potentials from DFT via a graph-network-based delta machine learning model incorporating electronic descriptors. J. Phys. Chem. A 2023, 127, 3472–3483. 10.1021/acs.jpca.2c08821. [DOI] [PubMed] [Google Scholar]
- Fralish Z.; Chen A.; Skaluba P.; Reker D. DeepDelta: predicting ADMET improvements of molecular derivatives with deep learning. J. Cheminf. 2023, 15, 101. 10.1186/s13321-023-00769-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramakrishnan R.; Dral P. O.; Rupp M.; Von Lilienfeld O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 2014, 1, 140022. 10.1038/sdata.2014.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinen S.; Khan D.; Falk von Rudorff G.; Karandashev K.; Jose Arismendi Arrieta D.; Price A. J.; Nandi S.; Bhowmik A.; Hermansson K.; Anatole von Lilienfeld O. Reducing training data needs with minimal multilevel machine learning (M3L). Mach. Learn.: Sci. Technol. 2024, 5, 025058. 10.1088/2632-2153/ad4ae5. [DOI] [Google Scholar]
- Pedregosa F.; et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Abadi M.; et al. TensorFlow, Large-scale machine learning on heterogeneous systems. 2015, arXiv:1603.04467. arXiv preprint. [Google Scholar]
- Ansel J.; Yang E.; He H.; Gimelshein N.; Jain A.; Voznesensky M.; Bao B.; Bell P.; Berard D.; Burovski E.; et al. PyTorch 2: faster machine learning through dynamic Python Bytecode transformation and graph compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’24), 2024; Vol. 2; pp 929–947. 10.1145/3620665.3640366. [DOI]
- Chen T.; Guestrin C.. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Data Mining: New York, NY, USA, 2016, pp 785–794.
- Song Z.; Baur B. A.; Roy S.. Comparison of deep and shallow graph representation learning algorithms for detecting modules in molecular networks. In Proceedings—2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, 2022, pp 2375–2377. 10.1109/bibm55620.2022.9995141. [DOI]
- https://www.rdkit.org/ (accessed Jan 14, 2025).
- Zhou H.; Skolnick J. Utility of the Morgan fingerprint in structure-based virtual ligand screening. J. Phys. Chem. B 2024, 128, 5363–5370. 10.1021/acs.jpcb.4c01875. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ali S.; Chourasia P.; Patterson M.. Expanding chemical representation with k-mers and fragment-based fingerprints for molecular fingerprinting. 2024, arXiv:2403.19844. arXiv preprint. [Google Scholar]
- Fraser C.Business Statistics for Competitive Advantage with Excel and JMP: Basics, Model Building, Simulation, and Cases; Springer Nature Switzerland: Cham, 2024; pp 179–200. [Google Scholar]
- Bonamente M.Statistics and Analysis of Scientific Data; Springer Nature Singapore: Singapore, 2022; pp 247–261. [Google Scholar]
- Akiba T.; Sano S.; Yanase T.; Ohta T.; Koyama M.. Optuna: A next-generation hyperparameter optimization framework. In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp 2623–2631. 10.1145/3292500.3330701. [DOI]
- Almarzooq H.; bin Waheed U. Automating hyperparameter optimization in geophysics with Optuna: a comparative study. Geophys. Prospect. 2024, 72, 1778–1788. 10.1111/1365-2478.13484. [DOI] [Google Scholar]
- Akiba T.; Sano S.; Yanase T.; Ohta T.; Koyama M.. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Data Mining: New York, NY, USA, 2019, pp 2623–2631.
- Tikaningsih A.; Lestari P.; Nurhopipah A.; Tahyudin I.; Winarto E.; Hassa N. Optuna based hyperparameter tuning for improving the performance prediction mortality and hospital length of stay for stroke patients. Telematika 2024, 17, 1–16. 10.35671/telematika.v17i1.2816. [DOI] [Google Scholar]
- Laufer B.; Docherty P. D.; Murray R.; Krueger-Ziolek S.; Jalal N. A.; Hoeflinger F.; Rupitsch S. J.; Reindl L.; Moeller K. Sensor selection for tidal volume determination via linear regression—impact of Lasso versus ridge regression. Sensors 2023, 23, 7407. 10.3390/s23177407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gana R. Ridge regression and the elastic net: How do they do as finders of true regressors and their coefficients?. Mathematics 2022, 10, 3057. 10.3390/math10173057. [DOI] [Google Scholar]
- Abqorunnisa F.; Erfiani E.; Djuraidah A. Performance of LASSO and elastic-net methods on non-invasive blood glucose measurement calibration modeling. BAREKENG: J. Math. Its Appl. 2023, 17, 0037–0042. 10.30598/barekengvol17iss1pp0037-0042. [DOI] [Google Scholar]
- Salman H. A.; Kalakech A.; Steiti A. Random forest algorithm overview. Babylonian J. Mach. Learn. 2024, 2024, 69–79. 10.58496/BJML/2024/007. [DOI] [Google Scholar]
- Didavi A. B. K.; Agbokpanzo R. G.; Agbomahena M.. Comparative study of decision tree, random forest and XGBoost performance in forecasting the power output of a photovoltaic system. In 2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART), 2021, pp 1–5. 10.1109/biosmart54244.2021.9677566. [DOI]
- Bergman E.; Purucker L.; Hutter F.. Don’t waste your time: early stopping cross-validation. 2024, arXiv:2405.03389. arXiv preprint. [Google Scholar]
- Miseta T.; Fodor A.; Vathy-Fogarassy Á. Surpassing early stopping: a novel correlation-based stopping criterion for neural networks. Neurocomputing 2024, 567, 127028. 10.1016/j.neucom.2023.127028. [DOI] [Google Scholar]
- Ma X.; Deng Z.; Xu Z.; Zhang Y.. GCN-based group level analysis of brain functional connectivity in fMRI. In Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022), 2023, p 127050Y.
- Shin Y.; Yoon Y. PGCN progressive graph convolutional networks for spatial–temporal traffic forecasting. IEEE Trans. Intell. Transport. Syst. 2024, 25, 7633–7644. 10.1109/tits.2024.3349565. [DOI] [Google Scholar]
- Yadav R. K.; Abhishek A.; S S.; Verma S.. GCN with Clustering coefficients and attention module. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp 185–190. 10.1109/icmla51294.2020.00038. [DOI]
- Zhou Y.; He D. Multi-target feature selection with adaptive graph learning and target correlations. Mathematics 2024, 12, 372. 10.3390/math12030372. [DOI] [Google Scholar]
- Wang J.; Chen Z.; Sun K.; Li H.; Deng X. Multi-target regression via target specific features. Knowl. Base Syst. 2019, 170, 70–78. 10.1016/j.knosys.2019.01.030. [DOI] [Google Scholar]
- Zhu L. Selection of multi-level deep features via spearman rank correlation for synthetic aperture radar target recognition using decision fusion. IEEE Access 2020, 8, 133914–133927. 10.1109/ACCESS.2020.3010969. [DOI] [Google Scholar]
- Yang X.; Zhang H.; Yang L.; Yang C.; Liu P. X. A joint multi-feature and scale-adaptive correlation filter tracker. IEEE Access 2018, 6, 34246–34253. 10.1109/ACCESS.2018.2849420. [DOI] [Google Scholar]
- Reynolds P. S.A Guide to Sample Size for Animal-based Studies. Chapter 11; John Wiley & Sons, Ltd, 2023; pp 127–132. [Google Scholar]
- Nuzzo R. L. Histograms: a useful data analysis visualization. PM&R 2019, 11, 309–312. 10.1002/pmrj.12145. [DOI] [PubMed] [Google Scholar]
- Cox N. J. Speaking stata: creating and varying box Plots. STATA J. 2009, 9, 478–496. 10.1177/1536867X0900900309. [DOI] [Google Scholar]
- Onoz B.; Oguz B.. In Integrated Technologies for Environmental Monitoring and Information Production; Harmancioglu N. B., Ozkul S. D., Fistikoglu O., Geerders P., Eds.; Springer Netherlands: Dordrecht, 2003; pp 173–180. [Google Scholar]
- Sedgwick P. Pearson’s correlation coefficient. BMJ 2012, 345, e4483 10.1136/bmj.e4483. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are openly available in our GitHub repository at https://github.com/Saadiallakh/HF_to_DFT and in Supporting Information. Also, raw data that support the findings this study are available from authors upon reasonable request.













