Abstract
This study presents an integrated approach combining Density Functional based Tight Binding (DFTB) calculations with machine learning (ML) techniques to predict the density of states (DOS) in pristine and Zn-doped MgO nanoparticles (NPs). A range of over 60 ML models, including linear models, tree-based ensembles, and neural networks, were evaluated for predictive performance. Among these, the weighted k-nearest neighbor (wkNN) algorithm, particularly when using triweight and biweight kernels, consistently outperformed others, achieving a median RMSE of 0.241 for pristine MgO and 0.386 for Zn-doped samples. The models demonstrated robust performance across various doping concentrations (5–25%) and NP sizes (0.8 nm and 0.9 nm), with minimal impact of doping levels on prediction accuracy. This integration of DFTB with ML offers a powerful and efficient framework for accelerating electronic property predictions in materials science, supporting the rapid design of advanced materials for applications in electronics, catalysis, and energy storage. The code and data are publicly available at: https://github.com/KurbanIntelligenceLab/DOS-Nanoparticles-Weighted-kNN.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-07887-6.
Keywords: DFTB, ML, DOS, Zn-doped MgO nanoparticles, WkNN
Subject terms: Nanoscale materials, Electronic properties and materials, Atomic and molecular physics
Introduction
Magnesium oxide (MgO) nanoparticles (NPs) have garnered significant attention in materials science due to their unique properties and diverse applications. Their high thermal stability, wide bandgap, and excellent insulating characteristics make them suitable for various technological advancements1. In catalysis, MgO NPs serve as effective catalysts in numerous organic transformations, including oxidation, reduction, epoxidation, and condensation reactions2. Their basic nature and high surface area facilitate these processes, making them valuable in the synthesis of fine chemicals and organic intermediates. In the electronics industry, MgO NPs are utilized in the fabrication of thin-film transistors due to their high thermal stability, excellent insulating properties, and wide bandgap3. These characteristics contribute to enhanced efficiency and durability in electronic devices. Doping MgO NPs with zinc (Zn) has been shown to modify their electronic properties significantly. Zn incorporation can alter the band structure and optical characteristics of MgO, leading to potential applications in optoelectronic devices4. Studies have indicated that Zn doping affects the microstructural and optical properties of MgO NPs, which can be tailored for specific technological applications5,6. In summary, MgO NPs are highly versatile materials with promising roles in catalysis and electronics and Zn doping further broadens their potential for advanced technological applications.
Density Functional Theory (DFT) has been instrumental in advancing computational chemistry by providing detailed insights into the electronic structures of molecules and materials7,8. Despite its widespread use, however, DFT has inherent limitations9–11. One significant challenge is its computational expense, which increases significantly with the complexity and size of the systems under study. As the number of atoms in a system increases, the computational resources and time required for DFT calculations grow substantially, often making it impractical for large-scale simulations12. This limitation hinders the exploration of extensive chemical spaces and the study of large biomolecules or complex materials. To overcome these challenges, the integration of machine learning (ML) into computational chemistry has emerged as a promising approach13. ML algorithms can be trained on existing DFT data to predict molecular properties with high accuracy, while drastically reducing computational demands 14,15. For instance, ML models have been developed to predict potential energy surfaces, electronic properties, and reaction outcomes, enabling rapid screening of vast chemical spaces that would be computationally prohibitive with traditional DFT methods alone13,16,17. This synergy between DFT and ML not only accelerates the discovery and design of new materials and molecules but also opens avenues for exploring complex chemical systems that were previously beyond reach.
Recent advances in ML have led to promising approaches for predicting electronic properties such as the density of states (DOS) with reduced computational cost and enhanced scalability. Deep learning frameworks now aim to emulate the full pipeline of DFT by predicting electron densities and DOS directly from atomic configurations without solving the Kohn–Sham equations18. Other studies have incorporated atomic-based electronic populations into ML models to enhance the physical fidelity of potential energy surfaces and DOS predictions16. Additionally, physically informed ML models have been shown to improve the accuracy and interpretability of DOS predictions in complex materials systems14,19. Graph-based neural networks have also demonstrated high accuracy in learning DOS patterns across different length scales, offering generalizable frameworks for large-scale electronic structure prediction13. Moreover, descriptor-based ML approaches have been employed to capture key electronic signatures such as d-band center shifts in alloy systems, facilitating more precise electronic behavior modeling20. ML methods have even been extended to predict phonon DOS using crystal-attentive architectures21. Recent studies by Kong et al.22 and Fung et al.19 have proposed graph-based ML models for predicting spectral properties such as the electronic DOS across chemically diverse crystalline materials.
While several ML-based approaches have been proposed to assist or replace quantum simulations for electronic structure prediction, few have directly addressed the prediction of DOSin doped nanomaterials. Prior studies often rely on complex neural network architectures, large-scale graph-based models, or kernel ridge regression methods that require significant amounts of training data and computational tuning13,16,23,24. For example, Fiedler et al.13 and Kirschbaum et al.24 demonstrated large-scale ML models for predicting orbital energies or DOS features, while Sun et al.23 applied ML to enhance DFTB predictions of DOS in periodic systems.
However, these approaches typically rely on large datasets, require significant computational resources, and rarely address size-dependent nanostructures or doping effects. Furthermore, previous studies do not systematically compare a broad range of ML algorithms, nor do they explore the potential of nonparametric, instance-based learners such as weighted k-nearest neighbor (wkNN), particularly with kernel optimization strategies. This study fills that gap by (i) targeting realistic, Zn-doped MgO NPs with varying sizes and doping levels, (ii) evaluating over 60 ML models for DOS prediction, and (iii) demonstrating the superior performance of kernel-optimized wkNN models, which offer lightweight, interpretable, and highly accurate alternatives to more complex black-box approaches. Building on these insights, our work introduces a targeted and scalable framework to address these gaps in a practical and interpretable manner. While this study does not introduce a new ML algorithm, it presents a comprehensive and transferable evaluation framework for assessing DOS prediction performance across a diverse range of models. By systematically comparing over 60 ML approaches on Zn-doped MgO NPs with varying doping concentrations and sizes, the work identifies lightweight yet high-performing algorithms—particularly kernel-optimized wkNN—that are well suited for interpretable electronic structure prediction. This approach provides valuable insights for researchers aiming to accelerate electronic structure predictions using interpretable and computationally efficient ML methods.
The objective of this study is to integrate Density Functional based Tight Binding (DFTB) calculations with ML models to predict the DOS in pristine and Zn-doped MgO NPs. DFTB, is an approximate quantum mechanical method derived from a second-order Taylor expansion of the total DFT energy functional around a reference electron density, offers a computationally efficient means of modeling electronic structures while maintaining reasonable accuracy25. Unlike conventional DFT, DFTB relies on precomputed, tabulated matrix elements (e.g., Slater–Koster (SK) integrals) that depend on interatomic distances, allowing efficient electronic structure calculations with significantly reduced computational cost. However, even with its computational efficiency, DFT/DFTB can become resource-intensive when applied to large datasets or when high-throughput screening is required for materials discovery26. By leveraging ML techniques, we aim to overcome these limitations by training predictive models on DFTB-calculated data, allowing for rapid and accurate DOS predictions across a wide range of NP configurations and doping levels23. This hybrid approach enables the exploration of vast material spaces that would otherwise be computationally prohibitive, significantly accelerating the discovery and design of materials with tailored electronic properties. Furthermore, integrating ML with DFTB not only reduces computational costs but also enables the exploration of how structural variations—such as changes in doping concentration and size—affect the electronic density of states. While these structural factors are not used as direct input features, their influence is embedded in the DFTB-derived eigenvalue spectra that serve as model descriptors. Overall, this methodology offers a practical and efficient pathway for accelerating electronic structure modeling and supports the data-driven design of advanced materials for use in electronics, catalysis, and energy storage24,27,28.
Computational methods
DFTB calculations
The DFTB calculations were performed using the DFTB+ software package29, employing the third-order parametrization (DFTB3). DFTB is an efficient and parameterized approximation to DFT, retaining essential quantum mechanical accuracy while significantly reducing computational cost25. It is derived from a second-order Taylor expansion of the Kohn–Sham total energy with respect to a reference electron density
25 In its basic formulation (DFTB1), the total energy is expressed as:
![]() |
where
and
are the occupation numbers and molecular orbital energies, respectively, and
is the pairwise repulsive potential between atoms A and B, pre-parameterized from DFT reference calculations. The molecular orbitals are represented using a linear combination of atomic orbitals (LCAO), resulting in a generalized eigenvalue problem of the form:
![]() |
where
and
are the Hamiltonian and overlap matrix elements, respectively, and
are the expansion coefficients. All required matrix elements are tabulated in SK parameter files, enabling efficient evaluation without the need for explicit electron density calculations. More advanced variants such as DFTB2 and DFTB3 include self-consistent charge (SCC) and third-order corrections to better handle charge transfer, polarization, and hydrogen bonding effects. Due to its parameterized nature, DFTB offers a significant reduction in computational cost—up to two or three orders of magnitude faster than conventional DFT—while retaining reasonable accuracy for structure, energetics, and electronic properties in medium-sized systems.
Structural optimizations were performed using the Conjugate Gradient method30, allowing full atomic relaxation. The convergence criteria included a maximum force component of 10− 5 and a limit of 100,000 optimization steps. The Hamiltonian parameters were obtained from the 3ob-3-1 SK files31,32, which were carefully selected to accurately describe interactions between Mg, O, and Zn atoms.Particular attention was given to ensuring reliable predictions of the DOS. MgO NPs were modeled with radial dimensions of 0.8 nm and 0.9 nm to assess the size-dependent electronic properties. For Zn-doped MgO systems, doping concentrations of 5%, 10%, 15%, 20%, and 25% were systematically introduced, replacing Mg atoms in the lattice with Zn atoms at random but physically reasonable sites, following substitutional doping mechanisms reported in the literature33–35. These sites were chosen based on structural symmetry and known energetic favorability. Although a systematic variation of dopant positions was beyond the scope of this work, the observed DOS trends agree well with existing theoretical and experimental findings, supporting the validity of the chosen configurations. In particular, Wang et al. demonstrated using DFT calculations that zinc doping significantly lowers the band gap of MgO, and that this reduction is closely correlated with Zn concentration—a trend also captured in our DFTB-based results33. The structural evolution of MgO NPs under Zn doping is depicted in Fig. 1, where the 0.8 nm model is used as a reference to visualize the doping-induced modifications.
Fig. 1.

Schematic representation of the Zn doping process in MgO NPs with a 0.8 nm radius. The central structure depicts the pristine MgO NP, while the surrounding structures illustrate the progressive substitution of Mg atoms with Zn at 5%, 10%, 15%, 20%, and 25% doping concentrations. Each doped configuration exhibits distinct atomic rearrangements, highlighting the structural impact of Zn incorporation. The 0.8 nm model is presented as a representative case to visualize the doping effects; however, a 0.9 nm model has also been investigated to examine size-dependent structural and electronic variations, though it is not included in this figure.
Construction of Zn-doped MgO nanoparticles
A 60 × 60 × 60 supercell was generated from the wurtzite ZnO crystal structure, and approximately spherical MgO NPs with diameters of 0.8 nm and 0.9 nm were extracted from it. This spherical truncation was performed to reflect realistic nanoscale geometries and to enable the evaluation of size-dependent electronic effects.
For the doped configurations, Zn atoms were introduced by substituting Mg atoms at specific lattice sites. Doping concentrations of 5%, 10%, 15%, 20%, and 25% were implemented by replacing the corresponding number of Mg atoms with Zn atoms. The doping was performed using the Avogadro software36, which allowed manual selection of dopant positions while preserving structural consistency and ensuring a reasonable spatial distribution. Dopant sites were selected randomly, but under symmetry and distance constraints to avoid clustering and to mimic physically plausible doping scenarios. For each doping level and NP size, one representative configuration was generated and used in the subsequent DFTB calculations.
Machine learning pipeline
We implemented an extensive pipeline to evaluate 60 distinct ML models. The pipeline architecture encompasses crucial stages including data ingestion, preliminary exploration, preprocessing (including centering, scaling, and filtering invalid values), partitioning of data into training and testing sets, and systematic model training with cross-validation. For each Zn@MgO NP, the input descriptors used in the ML models are the molecular orbital energies (eigenvalues) obtained from DFTB calculations. These spectral values serve as compact numerical representations of the system’s electronic structure and are used as the feature vectors for training the regression models that predict the corresponding DOS profiles. An illustration of our end-to-end model training pipeline is given in Fig. 2. The empirical evaluation framework generates vital performance metrics including Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), while also computing the summary statistics to enable robust model comparison and selection. RMSE and MAE of the models are given in Figs. S1–S24. The diverse collection of models span multiple categories, including linear models (lm), various tree-based ensembles (random forest37, gradient boosting machine38, XGBoost39), support vector machines (svmRadial40), neural networks (nnet, avNNet41, and specialized regression techniques (earth42, least angle regression43, glmnet44).
Fig. 2.
An overview of the data analysis and model building pipeline. (1) A snapshot of the raw data. (2) Steps involved in data cleaning i.e., removing invalid values, adjusting features to the same scale. (3) Partitioning the processed data in train and split sets. (4) An illustrative example of different models used in the pipeline. (5) Model evaluation and selection.
While our analysis evaluated multiple models, experiments revealed that instance based non-parametric techniques, such as k nearest neighbor and kernel weighted k nearest neighbor (referred to as wkNN)45, consistently outperformed other models. We focus our detailed discussion on weighted k-nearest neighbors due to its strong performance across different datasets (median RMSE: 0.918, median MAE: 0.384). The complete model comparison results are given in the supplementary, (the mapping of the data to corresponding figures and summary statistics given in Tables S1 and S2 and summary statistics given in Table S3–S14). In the following sections, we introduce the mathematical notation, followed by a discussion of wkNN, motivation for kernel selection, and the fundamental outline of the wkNN algorithm.
In this study, the ML models are not trained on raw atomic coordinates or structural files. Instead, we first obtain molecular orbital energies (i.e., eigenvalues) through DFTB calculations for each pristine and Zn-doped MgO configuration. These eigenvalue spectra are then used to construct numerical representations of the DOS. The models are trained to learn a mapping from these energy-level configurations to the corresponding DOS profiles, effectively approximating DFTB-level DOS through data-driven regression. This approach enables efficient prediction of DOS across doping levels and NP sizes using spectral features as input.
It is important to note that the DOS is not reconstructed from eigenvalues in a trivial sense (e.g., by direct histogramming), but instead, the models learn complex correlations across doping levels and NP sizes embedded in the spectral patterns.
Notation: Let
represent the training dataset where
is a d-dimensional feature vector and
is the corresponding target value. Given a query point
, the goal is to estimate the corresponding target value
using the weighted k nearest neighbor algorithm. We denote the
nearest neighbors of
from
as
, where
represents the
nearest neighbor. The distance between
and a neighbor
is computed using a distance metric
such as Euclidean distance.
is the indicator functions which evaluates to 1 when the condition is true, for e.g.
. Uppercase
denote the kernel function.
Weighted k-Nearest neighbors (wkNN) focus
The weighted k-nearest neighbor (wkNN) algorithm is an extension of the classic k-nearest neighbor (kNN) technique that introduces a more nuanced approach to making predictions. At its core, wkNN operates as follows: when estimating the value for a new data point, closer neighbors should have more influence on the prediction than farther ones. Given a new observation
that we want to make a prediction for,
finds the
training examples that are most like
according to some distance metric (like Euclidean or Manhattan distance). Then, it averages the target values of the
neighbors to arrive at the final prediction ŷ.
Kernel selection for predicting DOS: There are various kernels that can be used with the wkNN algorithm (see Table 1).
Table 1.
Kernel functions and their mathematical formulations used in wkNN for DOS prediction.
| Kernel type | Formulation |
|---|---|
| Rectangular kernel |
|
| Triangular kernel |
|
| Epanechnikov kernel |
|
| Biweight kernel |
|
| Triweight kernel |
|
| Cosine kernel |
|
| Gaussian kernel |
|
| Inversion kernel |
|
Choice of kernel is often determined by the characteristics of the data and empirical validation. DOS is a continuous variable which exhibits a relatively smooth distribution without significant outliers (see Fig. 3, where the DOS profiles for different Zn doping levels demonstrate only minor fluctuations in peak intensity and energy distribution, maintaining a stable electronic structure). Therefore, kernels like triweight and biweight, which assign higher weights to closer neighbors and gradually decrease the influence of distant points, are effective in capturing the underlying regression function. The triweight kernel has a wider bandwidth compared to the biweight kernel, allowing for a smoother regression estimate. This property is advantageous when the data points are evenly distributed and the relationship between the input and output variables is expected to be smooth. We empirically checked the prediction performance of eight different kernels on the pristine 0.8 nm MgO NP, and result in Table 2 show that biweight and triweight kernels produce consistently better results. .
Fig. 3.

Total Density of States (DOS) as a function of energy for pristine and Zn-doped MgO NPs with varying doping concentrations (5%, 10%, 15%, 20%, and 25%). The DOS profiles for both R8 (0.8 nm radius) and R9 (0.9 nm radius) models are presented, showing the effect of Zn incorporation on the electronic structure. While the overall DOS shape remains consistent, subtle variations in peak intensity and energy distribution can be observed, particularly near the valence and conduction bands.
Table 2.
Prediction error of different kernels for k = 5 (error is measured via RMSE, refer the subsection “evaluation metrics” in experiments). Entries in bold indicate better performance and indicate lowest error. Predictions are made for the pristine 0.8 nm MgO NP.
| Kernel | Prediction error (RMSE) |
|---|---|
| Rectangular | 0.409 |
| Triangular | 0.256 |
| Inverse | 0.259 |
| Epanechnikov | 0.279 |
| Cosine | 0.270 |
| Gaussian | 0.305 |
| Biweight | 0.246 |
| Triweight | 0.241 |
Algorithm: We preset the primary steps of the algorithm below.
Distance computation: Calculate the distances between the query point
and all the data points in
using the chosen distance metric 
Nearest neighbor selection: Select the
nearest neighbors of
based on the computed distances, forming the set
such that 
Normalization of distances: To account for differences in scale, distances of
nearest neighbors are normalized by the distance of the
neighbor, as follows:
![]() |
-
4.
Kernel weighting: Obtain the kernel weighing by transforming the neighbor distances into weights:
, where
denotes the kernel function. -
5.
Prediction: Regression estimate of
for the data point of interest
is obtained by taking a weighted majority of the
nearest neighbors, as follows:
![]() |
Experimental Overview: Empirical investigations are conducted to determine the most effective ML algorithm for predicting electron state distributions as a function of energy levels. The experimental data encompasses electron configurations from twelve distinct NP systems: pristine MgO NPs with radial dimensions of 0.8 nm and 0.9 nm, as well as their corresponding Zn-doped derivatives. For each size, Zn dopants was systematically modulated at concentrations of 5%, 10%, 15%, 20%, and 25% on an atomic basis, resulting in five distinct doped configurations per NP sizes . The experimental framework involves the implementation and evaluation of 75 distinct ML models. The algorithmic diversity encompass ordinary regression frameworks, support vector machine-based regression models, penalized linear regression methodologies, rf implementations, and deep neural networks. Out of the 75 models, 46 achieve successful computational convergence, without errors, and were subsequently selected for experiments. Each model is trained independently on each dataset. We report the top-performing models along with their corresponding summary statistics. Note that in the subsequent section (Results and analysis), the knn and kknn models represent the k nearest neighbor and the kernel weighted k nearest neighbor algorithms, respectively.
Model evaluation metrics
Performance is evaluated using three different metrics: RMSE, MAE and R2. RMSE and MAE are fundamental metrics for evaluating regression model performance on continuous variables. RMSE emphasizes larger errors through squared penalties, making it particularly sensitive to outliers, while MAE provides a more robust estimate of average error magnitude, as it is less affected by extreme values. Both metrics are scale-dependent and expressed in the same units as the target variable, facilitating direct and intuitive interpretation of model accuracy. Lower values of RMSE and MAE indicate better performance. R2, or coefficient of determination, measures the proportion of variance in the dependent variable that is explained by the model. It ranges from 0 to 1, with higher values indicating better fit. The metrics are expressed mathematically as follows:
![]() |
![]() |
![]() |
where
represents the actual value,
is the predicted value,
, and N is the total number of observations in the dataset.
Data processing: The datasets are further split following an 80:20 partition ratio for training and testing sets, respectively, with the training subset being further subdivided for validation purposes as explained subsequently. The dependent variable is characterized by the DOS, while the energy parameter serves as the independent predictor variable.
Implementation details: Experiments were executed on a 64-bit Ubuntu system with 512GB of memory , consisting of 24 AMD processing cores and utilizing R programming language version 4.2.2.
Addressing model overfitting: To ensure robust model generalization and prevent overfitting, we use a combination of several techniques in our ML pipeline.
Cross-validation framework: We employed k-fold cross-validation (k = 5) to provide a more comprehensive assessment of model performance. Each k-fold cross-validation cycle was repeated five times with different random seeds to minimize the variance in performance estimates and ensure statistical robustness. This approach provided a more reliable assessment of out-of-sample performance compared to single train-test splits.
Hyperparameter optimization: For the wkNN algorithm, we conducted systematic evaluation of the optimal number of neighbors (k) and kernel function. Table 2 demonstrates our empirical analysis of eight different kernel functions, with triweight and biweight kernels consistently producing the lowest generalization error. The careful selection of these hyperparameters balanced model complexity against generalization capability.
Data preprocessing: Our pipeline incorporated centering and scaling transformations to normalize input features, ensuring that distance calculations in the feature space were not dominated by arbitrary scale differences. This preprocessing step is particularly important for distance-based algorithms like wkNN to avoid sensitivity to specific energy ranges.
Performance validation across multiple datasets: We assessed model generalization across 12 distinct NP configurations, which served as an additional validation of model robustness. The consistent performance across these configurations (shown in Figs. 4 and 5, and Tables 4, 5, 6, 7, 8, 9, 10 and 11) show that the models captured underlying physical relationships rather than overfitting to specific datasets.
Fig. 4.
Comparative performance of top models while predicting the DOS for the pristine 0.8 nm MgO NP.
Fig. 5.
Comparative performance of top models while predicting the DOS for the pristine 0.9 nm MgO NPs.
Table 3.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.8 nm MgO NP with 5% Zn doping concentration. Models are sorted in ascending order.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.879 | 0.31 | kknn | 0.408 | 0.155 | xgbLinear | 0.999 | 0.99 |
| 2 | RRF | 0.982 | 0.486 | RRF | 0.463 | 0.247 | rf | 0.999 | 0.99 |
| 3 | RRFGlobal | 0.984 | 0.480 | RRFGlobal | 0.463 | 0.247 | RRFGlobal | 0.999 | 0.99 |
| 4 | rf | 0.985 | 0.480 | rf | 0.463 | 0.245 | RRF | 0.999 | 0.99 |
| 5 | knn | 1.038 | 0.362 | knn | 0.504 | 0.199 | kknn | 0.999 | 1 |
Table 4.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.8 nm MgO NP with 10% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.843 | 0.290 | kknn | 0.4 | 0.152 | xgbLinear | 0.999 | 0.99 |
| 2 | RRF | 0.898 | 0.420 | RRF | 0.437 | 0.229 | rf | 0.999 | 1 |
| 3 | RRFGlobal | 0.905 | 0.420 | RRFGlobal | 0.438 | 0.233 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.904 | 0.419 | rf | 0.439 | 0.231 | RRF | 0.999 | 1 |
| 5 | knn | 0.98 | 0.324 | knn | 0.477 | 0.196 | kknn | 0.999 | 1 |
Table 5.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.8 nm MgO NP with 15% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.843 | 0.290 | kknn | 0.4 | 0.152 | knn | 0.999 | 1 |
| 2 | RRF | 0.898 | 0.420 | RRF | 0.437 | 0.229 | rf | 0.999 | 1 |
| 3 | RRFGlobal | 0.905 | 0.420 | RRFGlobal | 0.438 | 0.233 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.904 | 0.419 | rf | 0.439 | 0.231 | RRF | 0.999 | 1 |
| 5 | knn | 0.98 | 0.324 | knn | 0.477 | 0.196 | kknn | 0.999 | 1 |
Table 6.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.8 nm MgO NP with 20% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.844 | 0.274 | kknn | 0.39 | 0.147 | knn | 0.999 | 1 |
| 2 | RRFGlobal | 0.889 | 0.425 | RRF | 0.431 | 0.233 | rf | 0.999 | 1 |
| 3 | RRF | 0.892 | 0.422 | RRFGlobal | 0.431 | 0.234 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.891 | 0.416 | rf | 0.43 | 0.229 | RRF | 0.999 | 1 |
| 5 | knn | 0.948 | 0.337 | knn | 0.463 | 0.207 | kknn | 0.999 | 1 |
Table 7.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.8 nm MgO NP with 25% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.803 | 0.270 | kknn | 0.386 | 0.150 | knn | 0.999 | 1 |
| 2 | RRF | 0.847 | 0.428 | RRF | 0.418 | 0.244 | rf | 0.999 | 1 |
| 3 | RRFGlobal | 0.85 | 0.422 | RRFGlobal | 0.419 | 0.241 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.851 | 0.426 | rf | 0.419 | 0.244 | RRF | 0.999 | 1 |
| 5 | knn | 0.918 | 0.301 | knn | 0.457 | 0.182 | kknn | 0.999 | 1 |
Table 8.
Average (RMSE and MAE on validation set) prediction error for top 5 models on 0.9 nm MgO NP with 5% Zn doping concentration. Models are sorted in ascending order.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 1.212 | 0.292 | kknn | 0.495 | 0.141 | knn | 0.999 | 1 |
| 2 | RRF | 1.28 | 0.521 | RRF | 0.526 | 0.253 | RF | 0.999 | 1 |
| 3 | RRFGlobal | 1.281 | 0.512 | RRFGlobal | 0.526 | 0.257 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 1.294 | 0.531 | rf | 0.529 | 0.256 | RRF | 0.999 | 1 |
| 5 | knn | 1.323 | 0.366 | knn | 0.553 | 0.193 | kknn | 0.999 | 1 |
Table 9.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.9 nm MgO NP with 10% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 1.019 | 0.326 | kknn | 0.49 | 0.176 | knn | 0.999 | 1 |
| 2 | RRF | 1.074 | 0.527 | RRF | 0.514 | 0.298 | RF | 0.999 | 1 |
| 3 | RRFGlobal | 1.079 | 0.523 | RRFGlobal | 0.514 | 0.297 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 1.08 | 0.532 | rf | 0.514 | 0.299 | RRF | 0.999 | 1 |
| 5 | knn | 1.11 | 0.366 | knn | 0.551 | 0.193 | kknn | 0.999 | 1 |
Table 10.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.9 nm MgO NP with 15% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.866 | 0.274 | kknn | 0.418 | 0.154 | knn | 0.999 | 1 |
| 2 | RRF | 0.939 | 0.413 | RRF | 0.445 | 0.234 | RF | 0.999 | 1 |
| 3 | RRFGlobal | 0.942 | 0.419 | RRFGlobal | 0.445 | 0.232 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.94 | 0.423 | rf | 0.445 | 0.235 | RRF | 0.999 | 1 |
| 5 | knn | 0.963 | 0.335 | knn | 0.465 | 0.213 | kknn | 0.999 | 1 |
Results and analysis
0.8 nm MgO nanoparticle
The performance of top 10 models for the 0.8nm pure MgO structure is shown in Fig. 4. Additionally, Top 5 models corresponding to the doped version of 0.8nm MgO are given in Tables 1, 2, 3, 4 and 5. Here, we analyze the results of top performing algorithms, and results for all models are shared in the supplementary. Box plots in sub-figure A and C demonstrate the distribution of RMSE and MAE on the validation data, and the bar plots in subfigure B and D indicate the average prediction error on test data. We note that, irrespective of the doping concentration, across validation and test data, top 5 models are based on non-parametric nearest neighbor and tree approaches, specifically-kknn’ RRFglobal, rf, RRF, knn.
Upon analysis of model performance metrics, the validation dataset yielded RMSE values ranging from 0.93 to 1.29 and MAE values between 0.39 and 0.53 across the evaluated algorithms. The kernelweighted knearest neighbors (kknn) algorithm demonstrates superior predictive capability, achieving minimal error metrics with a mean RMSE of 0.93 and MAE of 0.39 on validation data, while maintaining good performance on the test set with RMSE and MAE values of 0.25 and 0.15 respectively. The performance distribution exhibits a dichotomy between instance-based learning methods utilizing nearest neighbor regression (kknn, knn) and ensemble methodologies employing decision tree architectures. Furthermore, both unweighted and kernel-weighted nearest neighbor regression implementations consistently exhibit optimal performance across the evaluationmetrics. This performance stratification can be attributed to underlying differences in the methodology: (1) the incorporation of distance-based weighting schemes for neighboring instances, which provides more nuanced predictions compared to simple arithmetic means, and (2) the hierarchical feature importance determination inherent in tree-based ensemble methods, which enables both feature-specific weighting and prediction smoothing through model aggregation, in contrast to single-model regression approaches.
On test data, out of the 10 models that have the lowest average i.e. (RMSE < 0.93), 3 models-cubist and bstTree, xgbTree are tree-based models, xgbLinear is a linear regressor, qrf, rf, RRF and RRFglobal are ensemble based on rf, and knn, kknn are the nearest neighbor method that predicts based on the closest neighbors. The rf and its variants i.e., RRF and RRFGlobal utilize a two-step approach to first generate importance scores for all features, followed by selection of features in the first step. Although, the top 10 models have RMSE < 0.93, yet kknn attain the best performance by weighing the DOS values of nearest neighbors.
Evidently, the same pattern is observed amongst Zn doped MgO NPs, and doping concentration did not impact the performance of top models. Results of doped NPs are reported in Tables 4, 5, 6, 7 and 8 and show the average RMSE and MAE on validation and test data, respectively. Results on all models are given in the Supplementary Tables S4–S9.
0.9 nm MgO nanoparticle
Figure 5 presents the comprehensive comparative analysis of various ML models’ predictive performance, evaluated through multiple error metrics: RMSE and MAE, assessed on both validation and test datasets. Predictive performance of top 5 models on the Zn doped data is given in Tables 7, 8, 9, 10 and 11. Performance of all models is given in Supplementary Table S9–S13.
Table 11.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.9 nm MgO NP with 20% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.826 | 0.324 | kknn | 0.415 | 0.173 | knn | 0.999 | 1 |
| 2 | RRFGlobal | 0.866 | 0.524 | RRF | 0.447 | 0.264 | RF | 0.999 | 1 |
| 3 | RRF | 0.867 | 0.522 | RRFGlobal | 0.447 | 0.264 | RRFGlobal | 0.999 | 1 |
| 4 | rf | 0.869 | 0.524 | rf | 0.448 | 0.266 | RRF | 0.999 | 1 |
| 5 | knn | 0.927 | 0.324 | knn | 0.482 | 0.214 | kknn | 0.999 | 1 |
In the validation phase (subplot A), the RMSE distribution exhibits notable heterogeneity across models. The cubist algorithm demonstrates the highest median RMSE with substantial variance, while xgbTree and xgbLinear models display more constrained error distributions, suggesting superior stability in their predictions. The box plots indicate the presence of outliers, particularly in the rf and knn implementations (Table 12).
Table 12.
Average (mean RMSE and MAE on validation set) prediction error for top 5 models on 0.9 nm MgO NP with 25% Zn doping concentration.
| S.no. | RMSE | MAE | R2 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model | Validation set | Test set | Model | Validation set | Test set | Model | Validation set | Test set | |
| 1 | kknn | 0.818 | 0.281 | kknn | 0.413 | 0.166 | xgbLinear | 0.999 | 1 |
| 2 | RRFGlobal | 0.883 | 0.545 | RRF | 0.46 | 0.293 | RF | 0.999 | 1 |
| 3 | RRF | 0.885 | 0.558 | RRFGlobal | 0.46 | 0.300 | RRF | 0.999 | 1 |
| 4 | rf | 0.891 | 0.560 | rf | 0.462 | 0.300 | RRFGlobal | 0.999 | 1 |
| 5 | knn | 0.965 | 0.399 | knn | 0.499 | 0.237 | kknn | 0.999 | 1 |
Test data performance metrics (subplot B) reveal a hierarchical error structure where:
bstTree exhibits the highest average error (1.041).
cubist and xgbTree occupy intermediate positions (0.886 and 0.844 respectively).
The ensemble-based methods (RRF, RRFglobal) demonstrate superior performance with errors clustered around 0.515.
knn variants show the lowest absolute error magnitudes (0.292–0.367).
The MAE analysis (subplot C & D) presents a complementary perspective, with the following key observations: (1) overall reduction in error magnitudes compared to RMSE, (2) maintenance of relative model performance rankings, (3) enhanced discrimination between model performances in the test set evaluation. An interesting observation is the consistent superior performance of instance-based learners (knn, kknn), when evaluated via MAE, suggesting these algorithms may be more robust against outlier influence in the prediction space. The gradient boosting variants (xgbTree, xgbLinear) maintain intermediate performance across all metrics, indicating reliable but non-optimal generalization characteristics.
Comparative analysis of the prediction landscape for MgO nanoparticle models
We observe consistent algorithmic hierarchy preservation across both NP sizes. Predictions share the following similarities: (1) kknn maintains optimal performance, (2) Ensemble methods retain intermediate efficiency in both NP sizes, and (3) tree-based methods (bstTree, cubist) demonstrate relatively poorer performance. Furthermore, MAE/RMSE ratio remains relatively stable across within each NP size, indicating consistent error distribution characteristics within a given system. However, for the NP with 0.9 nm, we notice increment in the average rmse (~ 20–30% across models), specifically there is more pronounced performance degradation in tree-based methods, and larger variance in validation metrics, particularly for cubist and qrf implementations. The results suggest that while the relative performance hierarchy of different ML approaches remains consistent, the absolute prediction accuracy demonstrates significant size-dependent characteristics, with larger NPs presenting substantially increased prediction complexity. The differential scaling of error metrics across algorithmic families provides crucial insights for method selection in size-dependent DOS predictions. The results indicate that instance-based learning methods maintain robust performance across NP sizes, while tree-based and gradient boosting approaches show increased sensitivity to system size variations, potentially due to enhanced complexity in the underlying electronic structure of larger NPs.
Model predictions and physical properties of the material
We analyzed prediction errors of the wkNN and tree-based cubist model across different energy regions to identify where models perform optimally or struggle (Table 13). The superior performance of kernel-weighted k-nearest neighbor, particularly with triweight and biweight kernels, can be attributed to their ability to capture the fundamental physical characteristics of DOS in doped semiconductor NPs. In the valence energy band (− 8 to − 5 eV), the wkNN model demonstrated significantly lower prediction errors (RMSE < = 0.2), compared to other regions. The low prediction error (and high accuracy) agrees with the ground truth DOS in this region, on both R8 and R9 NPs (Fig. 3). In contrast, energy bands below − 8 eV and above − 5 eV, showed higher prediction uncertainties, likely due to more complex hybridization effects between Mg, O, and Zn orbitals that create highly localized states. When analyzing doping effects, the prediction error generally decreased with increasing doping concentration (from 5%, 15 to 25%), with RMSE decreasing more prominently in R9. This trend suggests that as the Zn concentration increases, it becomes easier for the models to learn and predict the DOS, reflecting the underlying contribution of Zn atoms to stabilizing the electronic structure. In comparison to wkNN, the tree-based cubist showed higher errors in all energy bands (Table 14). This suggests that the discrete nature of decision trees may struggle to capture smooth hybridization effects, which are better modeled by the continuous weighting schemes in wkNN.
Table 13.
RMSE of the WkNN model on various energy regions in the pure and doped MgO NPs.
| Energy band | wkNN prediction error on different energy regions | |||||||
|---|---|---|---|---|---|---|---|---|
| R8 | R8 (5% Zn conc.) | R8 (15% Zn conc.) | R8 (25% Zn conc.) | R9 | R9 (5% Zn conc.) | R9 (15% Zn conc.) | R9 (25% Zn conc.) | |
| [− 10, − 8) eV | 0.84 | 0.80 | 0.81 | 0.78 | 1.18 | 1.02 | 0.91 | 0.95 |
| [− 8, − 5) eV | 0 | 0.0000013 | 0.0002 | 0.05 | 0.14 | 0.15 | 0.17 | 0.21 |
| [− 5, 0) eV | 0.41 | 0.42 | 0.48 | 0.31 | 0.38 | 0.42 | 0.36 | 0.35 |
| [0, 2] eV | 0.41 | 0.48 | 0.33 | 0.37 | 0.39 | 0.45 | 0.40 | 0.47 |
Table 14.
RMSE of the tree-based cubist model on various energy regions in the pure and doped MgO NPs.
| Energy band | Cubist prediction error on different energy regions | |||||||
|---|---|---|---|---|---|---|---|---|
| R8 | R8 (5% Zn conc.) | R8 (15% Zn conc.) | R8 (25% Zn conc.) | R9 | R9 (5% Zn conc.) | R9 (15% Zn conc.) | R9 (25% Zn conc.) | |
| [− 10, − 8) eV | 1.66 | 1.18 | 1.21 | 1.09 | 2.16 | 1.37 | 1.35 | 1.40 |
| [− 8, − 5) eV | 1.3e−8 | 0.0000035 | 0.03 | 0.08 | 0.23 | 0.27 | 0.39 | 0.45 |
| [− 5, 0) eV | 0.99 | 0.93 | 1.31 | 0.85 | 1.01 | 1.07 | 1.09 | 1.006 |
| [0, 2] eV | 1.41 | 1.44 | 0.93 | 1.01 | 1.03 | 0.78 | 0.92 | 1.44 |
The differential scaling of prediction errors across algorithmic families provides not only methodological insights but also reveals underlying physical characteristics of the electronic structure. The increased prediction complexity in larger NPs directly reflects the fundamental quantum mechanical principle that as quantum confinement decreases, the DOS becomes more continuous with finer features. Instance-based methods like wkNN maintain robust performance across NP sizes because their local averaging approach naturally adapts to this increasing state density, while tree-based methods struggle with the more continuous nature of larger systems’ electronic structures. This observation aligns with the understanding that electronic states in semiconductor NPs transition from discrete, molecule-like levels to more continuous, bulk-like bands as NP size increases.
Conclusion
This study presents an integrated computational approach that combines DFTB calculations with MLtechniques to predict the density of states (DOS) in pristine and Zn-doped MgO NPs. Through the systematic evaluation of a diverse set of ML models, we identified the weighted k-nearest neighbor (wkNN) algorithm, particularly with triweight and biweight kernels, as the most effective predictor of electronic properties across varying doping concentrations (5–25%) and NP sizes (0.8 nm and 0.9 nm). Our analysis reveals that the superior performance of wkNN algorithms is not merely empirical but is fundamentally connected to the physical nature of electronic structure in doped nanomaterials. The distance-weighting approach inherently mirrors quantum mechanical principles where states with similar energies exhibit similar characteristics, and the smooth kernel functions effectively capture the continuous nature of electronic interactions. This alignment between algorithm design and underlying physics suggests promising directions for further development of physics-informed ML approaches in materials science. Results demonstrate that Zn doping minimally affects the prediction accuracy, highlighting the robustness of the ML models in capturing electronic structure variations in doped nanomaterials.
Moreover, our findings emphasize the computational efficiency of ML-augmented DFTB modeling, which significantly reduces the computational cost associated with electronic structure calculations while maintaining high predictive accuracy. This hybrid methodology provides a scalable framework for accelerating materials discovery by enabling rapid screening of electronic properties in complex systems. The insights derived from this study have direct implications for the design and optimization of functional nanomaterials for applications in catalysis, energy storage, and optoelectronic devices. Future research directions include extending the framework to other material compositions, incorporating additional quantum mechanical descriptors, and refining the ML models to enhance transferability across diverse material classes. It is also worth noting that the ML models in this study were trained to predict DOS values derived from DFTB calculations. While hybrid DFT methods (e.g., B3LYP, HSE06) offer higher accuracy, our focus was to emulate DFTB-level predictions efficiently; future work may involve training models on hybrid DFT datasets to further enhance predictive fidelity.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
The numerical calculations reported were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Centre (TRUBA resources), Türkiye.
Author contributions
H.K., P.S., M.M.D., and M.K. contributed to the conceptualization and methodology of the study. H.K. and M.K. supervised the research and provided critical insights on the study design. P.S. performed the machine learning modeling, data analysis, and algorithm optimization. M.K. conducted the DFTB calculations and prepared the computational datasets. H.K., M.K. and P.S. wrote the original draft of the manuscript, while M.M.D. contributed to the review and editing process. P.S., M.M.D., and M.K. were responsible for data visualization and figure preparation. All authors discussed the results, revised the manuscript, and approved the final version for submission.
Data availability
The data supporting the findings of this study are available from the corresponding authors upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Hasan Kurban, Email: hkurban@hbku.edu.qa.
Mustafa Kurban, Email: kurbanm@ankara.edu.tr, Email: mkurbanphys@gmail.com.
References
- 1.Hornak, J. Synthesis, properties, and selected technical applications of magnesium oxide nanoparticles: A review. Int. J. Mol. Sci.22, 12752. 10.3390/ijms222312752 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Dabhane, H. et al. MgO nanoparticles: synthesis, characterization, and applications as a catalyst for organic transformations. Eur. J. Chem.12, 86–108. 10.5155/eurjchem.12.1.86-108.2060 (2021). [Google Scholar]
- 3.Tharani, K., Jegatha Christy, A., Sagadevan, S. & Nehru, L. C. Fabrication of magnesium oxide nanoparticles using combustion method for a biological and environmental cause. Chem. Phys. Lett.763, 138216. 10.1016/j.cplett.2020.138216 (2021). [Google Scholar]
- 4.Kant, R., Agarwal, Y. K., Kumar, K., Bansal, S. & Kaul, S. Superior dielectric behaviour and band gap tuning of Zn doped MgO nanoparticles. Mater. Technol.37, 3017–3024. 10.1080/10667857.2022.2110795 (2022). [Google Scholar]
- 5.Yathisha, R. O. et al. Investigation the influence of Zn2 + doping on the photovoltaic properties (DSSCs) of MgO nanoparticles. J. Mol. Struct.1217, 128407. 10.1016/j.molstruc.2020.128407 (2020). [Google Scholar]
- 6.Mohan, A. C. et al. Multifaceted properties of Ni and Zn codoped MgO nanoparticles. Sci. Rep.14, 32067. 10.1038/s41598-024-83779-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.van Mourik, T., Bühl, M. & Gaigeot, M. P. Density functional theory across chemistry, physics and biology. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.372, 20120488. 10.1098/rsta.2012.0488 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hasnip, P. J. et al. Density functional theory in the solid state. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.372, 20130270. 10.1098/rsta.2013.0270 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cohen, A. J., Mori-Sánchez, P. & Yang, W. Insights into current limitations of density functional theory. Science (1979). 321, 792–794. 10.1126/science.1158722 (2008). [DOI] [PubMed]
- 10.Cohen, A. J., Mori-Sánchez, P. & Yang, W. Challenges for density functional theory. Chem. Rev.112, 289–320. 10.1021/cr200107z (2012). [DOI] [PubMed] [Google Scholar]
- 11.Verma, P. & Truhlar, D. G. Status and challenges of density functional theory. Trends Chem.2, 302–318. 10.1016/j.trechm.2020.02.005 (2020). [Google Scholar]
- 12.Dawson, W. et al. Density functional theory calculations of large systems: interplay between fragments, observables, and computational complexity. WIREs Comput. Mol. Sci.1210.1002/wcms.1574 (2022).
- 13.Fiedler, L. et al. Predicting electronic structures at any length scale with machine learning. NPJ Comput. Mater.9, 115. 10.1038/s41524-023-01070-z (2023). [Google Scholar]
- 14.Kurban, M., Polat, C., Serpedin, E. & Kurban, H. Enhancing the electronic properties of TiO2 nanoparticles through carbon doping: an integrated DFTB and computer vision approach. Comput. Mater. Sci.244, 113248. 10.1016/j.commatsci.2024.113248 (2024). [Google Scholar]
- 15.Polat, C., Kurban, M. & Kurban, H. Multimodal neural network-based predictive modeling of nanoparticle properties from pure compounds. Mach. Learn. Sci. Technol.5, 045062. 10.1088/2632-2153/ad9708 (2024). [Google Scholar]
- 16.Xie, X., Persson, K. A. & Small, D. W. Incorporating electronic information into machine learning potential energy surfaces via approaching the Ground-State electronic energy as a function of Atom-Based electronic populations. J. Chem. Theory Comput.16, 4256–4270. 10.1021/acs.jctc.0c00217 (2020). [DOI] [PubMed] [Google Scholar]
- 17.Nandi, A., Qu, C., Houston, P. L., Conte, R. & Bowman, J. M. ∆ -machine learning for potential energy surfaces: A PIP approach to bring a DFT-based PES to CCSD(T) level of theory. J. Chem. Phys.15410.1063/5.0038301 (2021). [DOI] [PubMed]
- 18.del Rio, B. G., Phan, B. & Ramprasad, R. A deep learning framework to emulate density functional theory. NPJ Comput. Mater.9, 158. 10.1038/s41524-023-01115-3 (2023). [Google Scholar]
- 19.Fung, V., Ganesh, P. & Sumpter, B. G. Physically informed machine learning prediction of electronic density of States. Chem. Mater.34, 4848–4855. 10.1021/acs.chemmater.1c04252 (2022). [Google Scholar]
- 20.Ishikawa, A. Machine-learning descriptor search on the density of States profile of bimetallic alloy systems and comparison with the d‐band center theory. J. Comput. Chem.45, 1682–1689. 10.1002/jcc.27360 (2024). [DOI] [PubMed] [Google Scholar]
- 21.Al-Fahdi, M., Lin, C., Shen, C., Zhang, H. & Hu, M. Rapid prediction of phonon density of States by crystal attention graph neural network and high-throughput screening of candidate substrates for wide bandgap electronic cooling. Mater. Today Phys.50, 101632. 10.1016/j.mtphys.2024.101632 (2025). [Google Scholar]
- 22.Kong, S. et al. Density of states prediction for materials discovery via contrastive learning from probabilistic embeddings. Nat. Commun.13, 949. 10.1038/s41467-022-28543-x (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sun, W. et al. Machine learning enhanced DFTB method for periodic systems: learning from electronic density of States. J. Chem. Theory Comput.19, 3877–3888. 10.1021/acs.jctc.3c00152 (2023). [DOI] [PubMed] [Google Scholar]
- 24.Kirschbaum, T., von Seggern, B., Dzubiella, J., Bande, A. & Noé, F. Machine learning frontier orbital energies of nanodiamonds. J. Chem. Theory Comput.19, 4461–4473. 10.1021/acs.jctc.2c01275 (2023). [DOI] [PubMed] [Google Scholar]
- 25.Elstner, M. & Seifert, G. Mathematical density functional tight binding. Philos. Trans. R. Soc. A Phys. Eng. Sci.372 20120483. 10.1098/rsta.2012.0483 (2014). [DOI] [PubMed]
- 26.Goyal, P. et al. Molecular simulation of water and hydration effects in different environments: challenges and developments for DFTB based models. J. Phys. Chem. B. 118, 11007–11027. 10.1021/jp503372v (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Panosetti, C., Anniés, S. B., Grosu, C., Seidlmayer, S. & Scheurer, C. DFTB modeling of lithium-intercalated graphite with machine-learned repulsive potential. J. Phys. Chem. A. 125, 691–699. 10.1021/acs.jpca.0c09388 (2021). [DOI] [PubMed] [Google Scholar]
- 28.Chang, C. & Medford, A. J. Application of density functional tight binding and machine learning to evaluate the stability of biomass intermediates on the Rh(111) surface. J. Phys. Chem. C. 125, 18210–18216. 10.1021/acs.jpcc.1c05715 (2021). [Google Scholar]
- 29.Hourahine, B. et al.DFTB+, a software package for efficient approximate density functional theory based atomistic simulations. J. Chem. Phys.15210.1063/1.5143190 (2020). [DOI] [PubMed]
- 30.Hestenes, M. R. & Stiefel, E. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand (1934). 49, 409. 10.6028/jres.049.044 (1952).
- 31.Gaus, M., Goez, A. & Elstner, M. Parametrization and benchmark of DFTB3 for organic molecules. J. Chem. Theory Comput.9, 338–354. 10.1021/ct300849w (2013). [DOI] [PubMed] [Google Scholar]
- 32.Lu, X., Gaus, M., Elstner, M. & Cui, Q. Parametrization of DFTB3/3OB for magnesium and zinc for chemical and biological applications. J. Phys. Chem. B. 119, 1062–1082. 10.1021/jp506557r (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Wang, J., Tu, Y., Yang, L. & Tolner, H. Theoretical investigation of the electronic structure and optical properties of zinc-doped magnesium oxide. J. Comput. Electron.15, 1521–1530. 10.1007/s10825-016-0906-2 (2016). [Google Scholar]
- 34.Sharma, U. & Jeevanandam, P. Synthesis of Zn2+-doped MgO nanoparticles using substituted brucite precursors and studies on their optical properties. J. Solgel Sci. Technol.75, 635–648. 10.1007/s10971-015-3734-0 (2015). [Google Scholar]
- 35.Mohd Saidi, N. S., Badar, N., Mohd Yusoff, H. & Elong, K. Green combustion synthesis of magnesium oxide nanoparticles and doped variants (Zn, Sn, Ti) using persicaria odorata leaves extract: a comprehensive study on synthesis, characterization and band gap analysis. 10.2139/ssrn.4797098 (2024).
- 36.Hanwell, M. D. et al. Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J. Cheminform. 4, 17. 10.1186/1758-2946-4-17 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Breiman, L. Random Forests (2001).
- 38.Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Stat.2910.1214/aos/1013203451 (2001).
- 39.Chen, T. & Guestrin, C. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 785–794. 10.1145/2939672.2939785 (2016).
- 40.Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn.20, 273–297. 10.1007/BF00994018 (1995). [Google Scholar]
- 41.Ripley, B. D. Pattern Recognition and Neural Networks (Cambridge University Press, 1996). 10.1017/CBO9780511812651
- 42.Friedman, J. H. & Splines, M. A. R. Ann. Stat.1910.1214/aos/1176347963 (1991).
- 43.Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat.3210.1214/009053604000000067 (2004).
- 44.Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.3310.18637/jss.v033.i01 (2010). [PMC free article] [PubMed]
- 45.Hechenbichler, K. & Schliep, K. Hechenbichler, Schliep: Weighted k-nearest-neighbor techniques and ordinal classification projektpartner weighted k-nearest-neighbor techniques and ordinal classification. http://epub.ub.uni-muenchen.de/ (2004).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data supporting the findings of this study are available from the corresponding authors upon reasonable request.


















