Abstract
Accurate forecasting of energy consumption has emerged as a critical requirement in the evolution of sustainable and intelligent transportation systems. This also helps in reduction of fuel costs thus causing lower carbon emission and optimal vehicle performance. Existing studies present various machine learning and deep learning models considering various features however lack to use state of art transformers. This study considers the features sets of operational and environmental using Feature Tokenizer Transformer (FT-Transformer). The proposed model considers feature tokenizer to learn both feature sets using self-attention mechanism. The approach interprets various machine learning methods with advanced neural architecture. The empirical analysis demonstrates that proposed model achieves the highest predictive results with lowest mean absolute error of 0.16, root means square of 0.21 and with R² value of 0.99 as compared to latest existing models in the relevant studies. In addition, we apply XAI based techniques which describes how the proposed model generate outputs helping to understand the factors influencing predictions and decisions. XAI methods of SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) presents the significance of features and their role in overall prediction.
Keywords: Energy forecasting, Logistic supply, Deep learning, Transformers, Regression models, Explainable artificial intelligence.
Subject terms: Energy science and technology, Engineering, Mathematics and computing
Introduction
List of all Abbreviations and Full form.
The logistics sector plays an important role in supporting global supply chains which helps in managing smooth flow of goods and services. This management depends on economic development as well as global energy consumption and greenhouse gas emissions1. Due to expanding growth in industrial scale on daily basis, logistics providers face the hurdles of managing goods which ensure their services perfectly align with energy efficiency, timely deliveries, and cost-effectiveness2. The significance of addressing energy consumption in logistics is an active research area which lies in its direct impact on operational costs, and environmental factors which are based on climate goals3. According to recent study5, transportation is one of the most significant sources of climate change responsible for close to 30% of the worldwide energy requirements and close to a quarter of carbon dioxide CO2 emissions. More than 70% of all the transport emissions are caused by road vehicles themselves4, and the environmental harm is intensified by the level of energy waste and the use of fossil fuels5. The International Energy Agency (IEA) calculates that the energy efficiency of the vehicle fleet coupled with the switch to electric mobility would cut transport-based CO2 emissions by as much as 60% by 20506. Regression based models have also received attention, among others, as they can be used to predict continuous variables related to energy, which is quite helpful when making decisions in the energy-sensitive sector7.
In recent years, Artificial Intelligence (AI) has emerged as a powerful tool which plays a significant role in addressing such challenges. Machine Learning (ML) and Deep Learning (DL) methods are increasingly used for predictive analytics to enable accurate prediction of fuel consumption, and overall energy requirements8. By integrating AI, logistics companies can lessen their expenses on fuel by improving real-time scheduling which assists in reducing delays to achieve sustainable goals in an energy consumptions market7. Despite these advances, traditional methods such as statistical models and rule-based approaches are extensively used often fail to capture the nonlinear dynamics9. These patterns depend on variations in vehicle type, traffic intensity, cargo load, and environmental conditions. Subsequently, traditional methods are limited in providing accurate and environment friendly adaptability for modern logistics energy prediction10. In this study, the main aim is to develop and analyze predictive models of logistics energy consumption using advanced ML and DL techniques. The research focuses on demonstrating the use of neural and transformer-based architectures over traditional regression methods.
The key contributions of this study are as follows:
Demonstration of a feature-engineered two diverse dataset that integrates operational and environmental features using real-time data of energy consumption.
Application of the proposed model, FT-Transformer, achieving state-of-the-art predictive performance of lowest MAE 0.16 and RMSE of 0.21.
Incorporation of explainable AI techniques including SHAP and LIME to interpret model outputs to identify significant features.
The remainder of this paper is organized as follows: Sect. “Related work” reviews existing literature on energy prediction. Section “Proposed research methodology” describes the dataset, preprocessing techniques, and methodology. Section “Results and discussion” presents experimental results, comparative analysis, and discussions on deployability and sustainability. Finally, Sect. “Conclusion” concludes the study with key findings and future research directions.
Related work
This literature survey appears on the newest developments in ML and Transformer-based frameworks of energy consumption prediction in logistics11, given environmental and operational constraints12. The coverage is divided into regression models, DL architectures, hybrid models, and use of methods in logistics, transportation, and warehouses.
Leng et al.13 introduced a deep reinforcement learning framework for decision-making in Industry 4.0 manufacturing. Reinforcement focused on energy regression, the study illustrated how reinforcement learning can optimize operational workflows in printed-circuit board production, highlighting broader potential for energy-efficient scheduling strategies. The work’s relevance lies in its use of intelligent models within industrial logistics contexts. Farzaneh et al.14 surveyed DL and AI applications in smart buildings for energy efficiency. They discussed how ANN, hybrid models, and rule-based systems can enhance HVAC and energy control systems. Their findings underscored that while neural networks improve predictive performance, they often lack interpretability motivating the need for explainable approaches like SHAP and LIME in energy modeling. Abid et al.15 provided a survey of ML algorithms applied to forest fire prediction and detection. While focused on combustion prediction rather than energy, the review emphasized model generalization and feature selection techniques principles that translate to robust regression models in energy forecasting domains.
Prabhu et al.16 evaluated various ML models, including LR, RFs, and gradient boosting, to predict energy consumption in manufacturing processes across sectors. They found that ensemble regressors significantly reduce predictive error when compared to single models. Ribeiro et al.17 compared ML and DL methods for short- and very-short-term energy load forecasting in warehouses. They demonstrated that DL models outperform traditional regressors when sufficient temporal data is available, achieving lower MAE/RMSE scores. In a comparative analysis that was conducted by Ullah et al.18 the authors focused on the application of ML algorithms on the task of electric vehicle energy consumption forecasting under different operating conditions. Their findings emphasized the criticality of algorithm selection and data preparation/preprocessing, at least feature scaling to enhance accuracy and robustness.
Mhlanga et al.19 provided a review of all the existing AI and ML-based energy demand forecasting methods in emerging markets. Our use of explainable models like FT Transformer and MLP can be seen together with the paper as similarly important, due to the data-limited setting and well-known explainability of hybrid methods. Mansoursamaei et al.20 have devoted their attention to the aspects of ML in energy consumption in the context of port logistics. Ensemble regressors based on trees provided robust high accuracy (
> 0.90) and confirmed the usefulness of using RF and boosting techniques in logistics-specific energy modeling. The study by Wang et al.21 suggests an ensemble model that integrates K -means clustering and LSTM to represent spatial-temporal patterns in the usage of delivery energy in metropolitan. They have resulted in lower MAE/RMSE than standalone LSTM, demonstrating that hybrid architecture can boost prediction performance to a significant extent A volumetric study by Eddaoudi et al.22 dated many energy fields stated that ensemble methods and neural networks outperformed many others with MAE < 1 and
approx. 0.95 in tabulated energy data. Zhang et al.23 provided a systematic review of the use of ML with the model of urban electric vehicles. They focused on interpretability of models and the promise of integrating categorical data representations into a regression type neural network architecture- which would presage transformer-based solutions to tabular data, such as FT Transformer.
Alshdadi et al.24 have described a mechanism to forecast loads in logistic planning through IoT on a real time basis which incorporates sensor data streams with ML models to estimate energy demands. They were shown to have an accurate rate of more than 98% of the energy load forecasts situation, which confirms the potential of the compounded IoT-ML frameworks in the context of logistics. Potential applications of energy-aware ML practices are reviewed by Roozycki et al.25where the emphasis was placed on the environmental impact of AI and modeling techniques that are sustainable, which include: pruning, quantization, and architecture attention awareness. Their inferences on Green AI inform us of our decision about the selection of the FT Transformer regarding the accuracy and energy-saving use of the product. There was a comprehensive comparative analysis done by Hussain et al.26 that compared 11 ML models and forecasted the EV load energy consumption by using the Colorado datasets: XGB, CatBoost, MLP, SVR, and Extra Trees. They also got the highest R2 of ~ 0.9592 and RMSE of ~ 1.8078 using Extra Trees hence the strong benchmark our models will outperform in comparison is set. Table 1 shows the summary of existing study predicting energy consumption in logistics under environmental and operational constraints.
Table 1.
Summary of existing study predicting energy Consumption.
| References | Year | Model(s) | Domain | Research area | Results |
|---|---|---|---|---|---|
| 14 | 2021 | ANN, Hybrid DL | Smart building energy management | Building energy efficiency | R² : 0.9 |
| 15 | 2021 | SVM, RF | Forest detection | ML and DL | MAE:4.550 |
| 16 | 2022 | Random Forest, XGBoost | Industrial manufacturing | Energy consumption |
MAE :4.55 R²:0.89 |
| 17 | 2022 | SVR, RF, XGBoost, LSTM | Warehouse load forecasting | Short-term load forecasting | MAE:5.67 |
| 18 | 2022 | Multiple ML regressors | EV energy consumption | EV consumption modeling | R²: 0.90 |
| 20 | 2023 | Ensemble tree-based | Port logistics | Logistics energy regression | R² : 0.90 |
| 21 | 2024 | K-means + LSTM hybrid | Urban logistics vehicles | Spatio-temporal modeling | RMSE:3.455 |
| 22 | 2024 | Review of ML & neural nets | Buildings, transport | General energy forecasting | MAE:1, R² : 0.95 |
| 24 | 2025 | ML + IoT load forecasting | Logistics IoT sensor data | Real-time load prediction | ≈ 98% accuracy |
| 25 | 2025 | Review of green ML practices | ML energy footprint | Sustainable AI | MAE: 0.6888, RMSE: 1.78, R²: 0.9 |
| 26 | 2025 | Extra Trees, MLP, XGBoost | EV charging demand (Colorado) | Vehicle energy prediction | MAE: 0.5888, RMSE: 1.8078 |
Proposed research methodology
This part describes the stepwise strategy that was taken to forecast energy consumed (energy consumed_kWh) in a logistic delivery environment. The approaches include data set comprehension, preprocessing strategies, selected features and engineering, classifier election and education, and explainability techniques. The ultimate intent is to develop correct and readable regression models that would be able to produce energy utilization based on different operations and environmental characteristics. Constituent steps of the process are detailed to make the ML pipeline reproducible and comprehensible. The framework diagram of proposed methodology as shown in Fig. 1.
Fig. 1.
The framework diagram of proposed methodology.
Dataset description
This study utilized two dataset for the experimentation to ensure model robustness and generalizability.
Energy-aware logistics scheduling dataset (dataset 1)
The framework is tested on the Energy-Aware Logistics Scheduling Dataset available on Kaggle, containing the detailed logs of the logistics processes. The data set entails the following variables namely, vehicle load, distance covered, speed, road type, and the amount of energy used. Such features were chosen due to the ability to consider both environmental variables (e.g., route length, load weight) and the operational ones (e.g., vehicle usage, scheduling requirements) that are especially important to examine the energy consumption patterns in logistics.
The dataset is organized as sample number, e.g., 50,000 records and it is gathered by various logistics tasks, including delivery routes, scheduling assignment, and vehicle usage cycle. Energy consumption is expressed in standard units (kWh) which allows reasonable comparison of models with each other. The variety of operational conditions in the dataset enables the framework to acquire both short-haul and long-haul learning on consumption dynamics. To preprocess, missing data were addressed with mean imputation, categorical data (where it exists) were one-hot coded, and all the numerical data were normalized to a value between 0 and 1 to be compatible with DL models. The data were divided into 70 training, 15 validations and 15 testing so that they could be useful in terms of performance assessment. This description provides the transparency regarding the source of data, its structure and methods of its preparation, which further enhances the validity of the suggested findings.
EV energy consumption dataset (dataset 2)
The EV Energy Consumption Dataset provides a comprehensive set of measurements such as electric vehicle (EV) energy consumption in various scenarios. Its curation is to facilitate research into energy efficiency optimization, predictive modeling, and the development of sustainable transportation systems. The dataset’s primary focus is on the target variable “Energy Consumption (kWh/100km),” which gives a clear indication of how effectively an EV uses energy under distinct driving conditions. Multiple car attributes are included by the dataset. Along with the fuel type which includes both electric and hybrid options, these also include vehicle type, such as trucks, buses, SUVs, and sedans. To reflect the impact of cargo on energy performance, the dataset additionally involves information about load bins, which classify the weight carried by the vehicles. Driving behavior and road conditions are another crucial component. Information like average driving speed, total distance travel, and road type provide insight into how roadways affect energy usage, and power consumption is also recorded, highlighting the relationship between speed, road conditions, and energy demand. Environmental and operational factors are equally important in this dataset; variables like ambient temperature and humidity are provided because they can have a significant impact on the performance and efficiency of EV batteries. By combining these environmental parameters with vehicle and driving attributes, the dataset offers a comprehensive view of the dynamics of EV energy consumption.
Data preprocessing
Several preprocessing steps were implemented to routinize the preprocessing of data prior to feeding them into the ML pipeline as a way of ensuring the quality and reliability of model predictions. These were data cleaning, categorical encoding, feature scaling, and dataset splitting27. The first step of analysis was since the dataset did not have missing value. Hence, imputation methods were not necessary. Interquartile Range (IQR) technique was regarded as options to be used when considering potential outliers28. Computation of IQR is done as follows Eq. (1).
![]() |
1 |
Any values that do not meet the range of acceptability get pointed out as an outlier by using Eq. (2).
![]() |
2 |
Among the features in the dataset, the features that are categorical feature are vehicle_type, fuel_type and road_type. Label Encoding was applied to these variables as it converts the labels that are categorical into a numeric representation. An alternative is a label encoder, whose input is {“Diesel”, “Electric”, “Petrol”} and output is {0, 1, 2}29.
The continuous features cargo_weight_kg, route_distance_km, and ambient_temperature_c were standardized to guarantee the equal contribution of each of the features to the model. Standardization equation is (3).
![]() |
3 |
Where x is the raw value,
is the mean of the feature,
is the standard deviation. To test the generalizability of a model, the dataset was split in an 80/20 train-test split. Also, in training, 5-fold cross-validation was used to ensure that the model performs well in other data subsets. This method minimizes overfitting, because of training with various partitioning and averaging the performance variables30. Data dimensions across processing stages display in Table 2. The increase in the number of columns after preprocessing occurs because categorical variables are converted into multiple binary columns through one-hot encoding, which expands the feature space. Additionally, during feature engineering, new derived features such as ratios or normalized values are created to capture deeper relationships among existing variables. As a result, the dataset grows from 12 columns in its raw form to 15 after encoding, and finally to 18 after feature engineering. This dimensional increase enriches the dataset, providing the FT-Transformer with more informative inputs for regression analysis.
Table 2.
Data dimensions across processing Stages.
| Stage | Description | Data dimension |
|---|---|---|
| Raw dataset | Original dataset with continuous + categorical vars | 10,000 × 12 |
| After cleaning & encoding | Missing values handled, One-hot encoding create additional features | 10,000 × 15 |
| After feature engineering | Derived features (e.g., energy_per_km) added | 10,000 × 18 |
| Train-test split | 80% train, 20% test | Train: 8,000 × 18 Test: 2,000 × 18 |
| Final input to models | Scaled numerical features | N × 18 |
The PCA plot shown in Fig. 2, the raw data indicates that the feature space is very dispersed with interlopes within the clusters. This implies that there is noise present, lack of scale, and potential redundancy among features, which may overlap significant patterns to be used to train the model. Following the application of preprocessing (scaling, normalization and outlier treatment), the PCA plot depicts a smaller distribution. The clusters are more aligned, which is better separable and have equal feature scale. The t-SNE graph of the raw data shows sporadic dots without distinct classification lines of the type of data or pattern shown in Fig. 3. There are high dimensional variance and feature imbalance resulting in noisy separations. The t-SNE visualization has better local and global structure preservation after post-processing. The clusters are increased in clarity, indicating that preprocessing allowed the dataset to have intrinsic relationships in the lower-dimensional space.
Fig. 2.
PCA plot- input data before preprocessing and post processing.
Fig. 3.
t-SNE plot- input data before and post processed.
The PCA plots will give a linear dimensional reduction perspective of the preprocessed and pre-processing data. The data points on the before preprocessing PCA plot are more irregular and spread out since the raw features have variations in their scales and variance. The post-processed PCA plot presents a better-balanced distribution of points that are more normalized after preprocessing. This implies that scaling has managed to equalize the magnitudes of the features enabling the major directions of variance to be represented more reliably. The post-processing step of PCA shown in Fig. 4 that no single feature is dominating because of differences in scale, which is prerequisite in ensuring that the model is trained in a reliable manner. The t-SNE visualizations provide a nonlinear mapping of high-dimensional features to a two-dimensional space, that is, the local similarities among data points. The preprocessing t-SNE plot in Fig. 5 presents the clusters, which are loosely drawn, yet with overlapping borders, thus separating the clusters is not as clear. Conversely, post-processing t-SNE plot shows more distinct clusters with less overlap, that is, there is an improvement in feature space representation by preprocessing.
Fig. 4.
PCA energy consumption input data before and post processed.
Fig. 5.
t-SNE Plot of Energy consumption Input Data before and post processed.
This indicates that preprocessing enhances the capabilities of machine learning models to identify features in the data because it removes noise and t-SNE Plot of Energy consumption.
The density plots shown in Fig. 6 demonstrate how each of the features is distributed at the start of the preprocessing and after preprocessing. During the preprocessing density plots, the features are of varied scale and skewness, and the ranges of some variables are very large in comparison to other variables. This skew may adversely affect neural networks and distance-based models. The post-processing density plots after preprocessing demonstrate that all the features cluster around zero with unit variance which reflects the success of normalization. The consistency following the preprocessing guarantees that all features play an equal role during the process of model training and denies that features with more significant values receive preference over others.
Fig. 6.
All Features Density Energy_Consumption_Kwh (before and after processing).
Feature selection and engineering
The significance of the feature selection and engineering should be considered in tandem to enhance the clarity of the models and its results. This can be done by selecting the input variables that are most pertinent and reducing raw data into meaningful features to provide a clearer understanding of the relationships with regards to predictive modelling. Mutual Information (MI) and Correlation Coefficient Analysis were also used to evaluate which feature was significant31. The Mutual Information of two variables X and Y is a measurement of how much information one variable has about another and it is known as Y Eq. (4).
![]() |
4 |
Where
is the joint probability distribution function of X and Y,
are the marginal probability distributions of X and Y. Also, Pearson correlation coefficient was applied to identify the linear relations between numerical attributes and target variable as Eq. (5)32.
![]() |
5 |
Where
are individual sample points
are the means of x and y. The features that scored high in MI and had a high correlation were kept, and features that had low variance or multicollinearity were dropped with the help of Variance Inflation Factor (VIF) computed as Eq. (6).
![]() |
6 |
Where
matrices, and the coefficient of determination of a regression of feature on the rest of the features. Those VIF that was greater than 10 were regarded as highly collinear features and omitted.
New constructs were put in place to enhance the expressiveness of the models. For instance the calculation of Speed Efficiency was: route_distance_km/ avg_speed_kmph computed as Eq. (7).
![]() |
7 |
The hidden connections between delivery performance and operational restrictions were captured through these engineered features, which promoted model learning33.
Model design and selection
The main goal of the study is to design and compare the most advanced ML models that may correctly predict the delivery performance using multiple operational, temporal, and categorical features. This stage was featured with the selection and the configuration of models as well as during their training but based on standardized strategies of validation34. To allow results to be considered due to their robustness and applicability to the general population, each of the models was trained using stratified k-fold cross-validation and measured with identical performance metrics. Hyperparameter tuning, use of regularization methods, and integrations of interpretability via model-agnostic explainers like LIME and SHAP were part of the development pipeline Its adjusted models were optimized to meet a good trade up between predictive performance and computational cost35.
Baseline model
This research used several baseline ML models to create a comparative framework of energy consumption prediction. LR is the easiest baseline deployed, which implies the representation of linear correlations between features and the target variable. DT offer an interpretable (rule-based) approach to non-linear interaction modeling and RF, being an ensemble of decision trees, increased robustness and decreased overfitting by bootstrap aggregation. XGB, a DT algorithm, is a gradient-boosted algorithm that provides better accuracy and scale because it optimizes residual errors36. On the same note, CatBoost, which is also a gradient-boosting technique, is useful in dealing with non-metric variables and reducing overfitting by using ordered boosting. To use sophisticated architectures. MLP is a subtype of feedforward artificial neural networks and has proved good regression abilities on structured data. Combination of these baselines provided a complete analysis of traditional, ensemble and transformer-based techniques. Such baseline models were also useful in predicting very early feature importance and detecting the linear and non-linear nature of the data at hand. Their findings informed the choices of more sophisticated models and their fine-tuning in the next step of development27.
Proposed models
Based on the baseline’s comparisons, this paper proposes advanced DL models FT-Transformer as solutions to the prediction of energy consumption in logistical activities. Such models were chosen because they can model deep nonlinear interactions and long-range dependencies in tabular data that are structured. The proposed model architectures layers with units and activation display in Table 3.
Table 3.
Proposed model architectures layers with units and activation with components.
| Layer | Configuration |
|---|---|
| Input layer | ( d = 128 ) |
| Embedding layer | 128 per feature |
| Feature Tokenization | 128 |
| Transformer encoder block | 4 Attention Heads, Hidden dim = 128 – ReLU |
| Multi-head attention | 4 Heads, Key = Query = Value dim = 128 |
| Feed-forward network (FFN) | Hidden dim = 128 – ReLU |
| Dropout layer | 0.1 |
| Stacked transformer blocks | 2 Blocks |
| Dense (Fully Connected) | 128 – ReLU |
| Output Layer | 1 – Sigmoid |
The FT Transformer is a newly developed model that considers the shortcomings of the previous ones as it focuses on a particular task of modeling data within a table and accounts for complex feature interactions and preserves the invariance of permutations. The FT transformer treats every feature as a token, in contrast to conventional transformers which treat words as tokens, and learns dependencies between features, irrespective of the order.
The model has its start on a feature tokenization layer where the input features serve as input.
∈ R. The embedding of ∈R is done in a higher dimensional space and is learned by an embedding matrix E ∈ Rd × f. Such token embeddings are then fed in a sequence of multi-head self-attention (MHSA) and feed-forward layers so that the importance of features is dynamically weighted by the model. The core of FT Transformer, architecture shown in Fig. 7 is to turn a table of features into tokens through an embedding lookup (categorical) and linear projection (numerical): In the Transformer, each head computes self-attention as Eqs. (8), (9) and (10).
![]() |
8 |
![]() |
9 |
![]() |
10 |
Fig. 7.

The architecture of FT Transformer diagram and its layers.
Finally, the pooled embedding is passed to the regression layer compute as Eq. (11).
![]() |
11 |
With the FT Transformer architecture, it is built in such a way that it can effectively model tabular data and achieve this by utilizing the potential of transformer-based self-attention mechanisms with an additional preprocessing step called feature tokenization. The architecture starts with input embedding block where representation of each feature in the tabular dataset-be it numerical or categorical is converted into learnable embedding vector. These embeddings create a sequence of tokens of the features, like the word tokens in the NLP models and are fed through a set of transformer layers.
The building blocks of transformer encoders have two general elements, namely, multi-head self-attention (MHSA) and position-wise feed-forward neural network (FFN). The self-attention layer enables the model to effectively recover the weights between every single feature by updating them dynamically, and it models non-linear dependencies as well. The cognitive process of the attention mechanism is passed over to residual connections and layer normalization where the information is not lost and the training is stabilized. The number of times these blocks are stacked varies to increase the capacity of the network to learn higher level abstractions37.
Given the representations of tokens as the outcome of passages through transformer blocks, a pooling operation is then applied, namely, a typical mean pooling or weighted average pooling, that compacts the multiple feature representations back into a single fixed size vector. The following regression head (it is traditionally a dense layer with linear activations) is then applied to this vector to yield the final prediction (e.g. energy intensity in kWh), for task scheduling based on factors38. This overall architecture unifies the expressivity of attention with the structural efficiency of tabular embeddings; thus, it is very applicable to structured regressions39. The detailed pseudocode of proposed model architecture proposed model is: 
FT-Transformer is an architecture that explicitly targets tabular data processing, based on the power of Transformer networks. It uses 128-size embedding vectors to represent the features in the input layer, which is known as feature tokenization to make sure that both categorical and numerical variables are projected within the same latent space. These embeddings are matched to a hidden dimension needed by the transformer by a linear projection layer with ReLU activation. The main blocks of the model are several transformer encoder blocks (three in this case), each with multi-head self-attention with four heads and a feed-forward subnetwork of size 256 → 128, which is turned on by GELU optimization40. The encoder blocks allow the model-to-model complex dependencies and higher-order interaction among the features. To enhance the generalization, dropout (0.1) and layer normalization are used at the end of each block. Special CLS token is inserted to combine all feature tokens all over the world with contextual information. Lastly, the representation is sent through an output dense layer that outputs the final value used in regression tasks or several units with a SoftMax activation used in classification. Overall, the FT-Transformer is a unified system of feature embeddings, self-attention, and deep feed-forward layers to create a powerful and scalable model to learn with tabular data. This architecture balances the complexity of the model and its generalization capacity, which is highly applicable when it comes to the application of this regression task41.
The hyperparameter setting and their value are displayed in Table 4. The hyperparameters of the proposed model were optimally adjusted to achieve stability in training, power of representation and avoiding overfitting. An attention size of 256 and multi-head attention (8 heads) facilitated the model to learn fine-grained contextual dependencies, and the model was regularized by dropout (0.1) and weight decay (0.01).
Table 4.
Hyperparameter setting, description and their value.
| Parameter | Value |
|---|---|
| Embedding dimension | 192 |
| Number of transformer blocks | 3 |
| Number of attention heads | 8 |
| Feedforward dimension | 512 |
| Dropout rate | 0.1 |
| Attention dropout | 0.1 |
| Layer normalization | Pre-Norm |
| Activation function | GELU |
| Batch size | 512 |
| Learning rate | 1e-4 |
| Optimizer | AdamW |
| Weight decay | 1e-5 |
| Epochs | 30 |
| Warmup steps | 10% of total steps |
| Early stopping patience | 10 epochs |
| Random seed | 42 |
The use of learning rate scheduling of 1000 warm-up steps provided a smoother convergence without causing gradient instability. Early stopping (patience = 10) to interrupt unwanted training was used when the validation loss became stagnant, and then the best weights were restored to be evaluated. Moreover, the purpose of masked item prediction (15%) promoted more successful generalization as it compelled the model to acquire strong contextual representations that improved its prediction performance in novel situations. Flowchart of the FT-Transformer model illustrates the sequential process from data preprocessing, input layer, hidden layers, and activation functions to the output prediction as shown in Fig. 8.
Fig. 8.
Workflow of the FT-Transformer model architecture used in this study.
Correlation heatmap of all features.
Evaluation metrics
To evaluate regression based models particularly in an environment various measures are used including mean absolute errpr, root mean error and more evaluation process42. The investigation criteria of evaluation to assess the quality of the model performances critically through various statistical angles: MAE, RMSE,
score, and MAPE, MSRE, RMSRE, MARE and RMSPR concepts. Each of these measures records different facets of model behavior, such that an adequate analysis is achieved in terms of both size and proportion of errors24. All performance metrics equations with extensive description display in Table 5.
Table 5.
All evaluation metrics equations with description.
| Measure | Equation | Description |
|---|---|---|
| MAE |
|
Evaluating the overall magnitude of prediction errors |
| RMSE |
|
Use large deviations that are penalized more than small errors |
|
|
Calculate the strength of the model in explaining the variation |
| MAPE |
|
Appropriate in the case of models that are made in a prediction |
| MSRE |
|
MSRE is used to gauge the square relative disparities among actual and forecasted values |
| RMSRE |
|
RMSRE square root of MSRE provides a scale of relative errors that is interpretable |
| MARE |
|
MARE is a calculation of the average absolute relative error. |
| RMSPE |
|
RMSPE represents the error in percentages. |
Explainability AI (XAI)
XAI in predictive modeling is important in the transparency, trust and interpretability of model outcomes. Although the high-performing machine learning models like (MLP), XGBoost, and FT-Transformers usually act as black boxes, the logistic and energy management decision-makers need to know the reasons behind some of their predictions. To fulfill this requirement, we have used two popular model-agnostic explainability methods that include SHAP and LIME43.
SHAP
The basis of SHAP is the cooperative game theory where each feature is a player that will contribute to the final prediction. SHAP is globally as well as locally interpretable. It determines the most influential factors of energy consumption in the world (e.g., route distance, cargo weight, vehicle type, etc.). Locally, it describes individual predictions via breaking up energy consumption into contributions of each feature. This allows the fleet managers to know not just what factors are most important overall, but also why a particular delivery takes energy44.
LIME
LIME estimates the black-box model with a more interpretable surrogate model (usually linear regression) in the vicinity of the instance being studied. LIME allows explaining why a particular trip needed much more energy by measuring the contribution of features to that prediction45. LIME, however, offers practical explanations to individual forecasts, so it is extremely useful to identify the anomalies and aid operational decisions46. Through the combination of SHAP and LIME into our framework, we guarantee that our proposed model does not only perform well in predicting the issue, but also works well with the stakeholder in terms of transparency, interpretability, and reliability of the output.
Results and discussion
The findings of this work provide a detailed analysis of various ML and DL solutions to the problem of energy consumption prediction in logistics under different environmental and operational conditions. The discussion is organized in a way that not only presents the raw performance figures of each model but also the implications of the same on the logistics scheduling and optimization of energy in the real world. Analyzing regression-based baselines like LR, DT, RF, XGB, CatBoost and MLP, the suggested FT-Transformer models, do a comparative and interpretative evaluation of predictive accuracy, robustness, and scalability. The diagrams also support the analysis by demonstrating the interactions of features underlying, learning behaviors of the model, and interpreting them by way of LIME and SHAP explanations. Combined, such outcomes demonstrate the advantages and disadvantages of both strategies, and the proposed models may be discussed as sound, efficient, and feasible solutions to sustainable logistics work.
Exploratory analysis of dataset 1
To analyze the relationship between numerical features in the dataset, a correlation heatmap represents the pair-wise correlations between numerical variables like the load, speed, distance, and power and the target variable, energy consumption. Close positive relationships between load and energy and distance and energy confirm the relevance of these characteristics whereas weaker connections demonstrate the non-dominant variables correlation heatmap shown as Fig. 9. The histogram of kernel density overlay reveals the energy consumption is skewed to the right with most of the trips consuming middle levels of energy with a small fraction of the total number of trips taking a disproportionate percentage of the energy consumed. The imbalance in this distribution highlights the importance of powerful measures of evaluation, as well as encourages scaling and normalization during preprocessing. The performed analysis was used to make decisions about how the features should be transformed and normalized prior to model training shown as Fig. 10.
Fig. 9.
Correlation heatmap of all features.
Fig. 10.

Allocation of energy consumption values.
As can be seen in the scatter-density diagram, the trend is U-shaped, with mid speeds being the most efficient, very slow and very fast, sharply raising power requirements. This nonlinear behavior offers the motivation to utilize models that have the capacity to represent complicated functional dependencies including MLP and tree-based baselines. Such non-linearity warranted the choice of DL modalities that could model such interactions rather well, compared to the traditional models shown as Fig. 11.
Fig. 11.

Concurrent distribution of power consumption and average speed.
To gain more insights on the effect of cargo weight on the usage of energy, cargo weight was categorized into bins (e.g. 0–100 kg, 100–300 kg, 300 + kg). The nonlinear relationship indicates that the proportional change in load will lead to disproportionate energy consumption, which implies the necessity of making load one of the nonlinear drivers in the predictive model. The plot played a critical role in encouraging the adoption of advanced models such as MLP, as opposed to linear models shown as Fig. 12. The dispersion of energy consumption in terms of different categorical variables was analyzed through swarm plot clustering of energy usage according to the vehicle type and source of fuel. Clusters that are inefficient like heavy diesel vehicles are noticeable whereas hybrid and electric vehicles clusters have less energy requirements. The insights have interpretability of the operational decision-making and efficiency shown as Fig. 13.
Fig. 12.

Load Bin Energy Consumption plot.
Fig. 13.

Analysis of energy consumption by fuel and vehicle type.
A bubble plot was made with the value route_distance_km on the x-axis, and the value energy_consumed_kWh on the y-axis. The cargo_weight_kg was indicated by the size of the bubbles whereas the vehicle_type was indicated by color. The visualization indicates that long-distance trips and frequent stops also consume much more energy, which answers the combined influence of the distance and the complexity of logistics. It also brought to the surface a group of low energy deliveries that were carried out using electric vehicles that added more to the debate of sustainability shown as Fig. 14. The mean value of the variable energy_consumed_kWh showed according to vehicle_type based on road types (e.g. urban, rural, highway) on this heatmap. This number makes a comparison of categorical interactions, showing that heavy-duty vehicles running on urban roads use the most energy. On the other hand, the lighter cars in highways demonstrate improved use of energy. The observations underscore the need to use feature encoding strategies to consider categorical and interaction effects. This visualization assisted in the idea of finding a fleet assignment that would minimize the total energy usage shown as Fig. 15.
Fig. 14.

Bubble plot distance and energy consumption.
Fig. 15.

Multi variate heat map energy consumption average by type of vehicle and type of road.
Exploratory analysis of dataset 1
To analyze and understand the 2nd test data, the plot in Fig. 16 indicates the values of energy consumption which are distributed throughout the dataset. Normal or skewed distribution implies driving habits and external influences on vehicle efficiency. The distribution peaks can be typical energy consumption patterns in normal road or traffic conditions. In plot of Fig. 17 there is a difference between battery state of charge (SOC) and energy used, including driving mode and weather conditions. Color coding indicates that aggressive driving or unfavorable weather (rain, heat, cold) can be used to deplete batteries quicker even when the SOC is the same. It focuses on the multi-dimensional effects of environment and habit on automobile efficiency. Figure 18 shows the analysis of the average energy consumption of the various types of roads (e.g., highway, urban, rural) driving in different driving modes (eco, normal, sport). It shows that the nature of the roads and driving techniques have a great effect on battery consumption. Normally the urban roads used with aggressive driving modes will consume more whereas the highways in the eco mode will give less consumption.
Fig. 16.

Energy consumption (kWh) distribution.
Fig. 17.

State vs. energy consumption (by driving mode and weather).
Fig. 18.

Mean energy usage by road-type and driving mode.
The analysis in Fig. 19 describes the interaction between the traffic density (light, medium, heavy) and the type of road to influence energy consumption. As an illustration, the high traffic on the urban roads’ leads to increased stop-and-go driving, which demands more energy, whereas the highways with the free-flowing traffic indicate more stable and reduced consumption. This provides information on the impact of congestion management on the efficiency of electric cars.
Fig. 19.

Energy consumption by type of road and traffic condition.
Empirical results on dataset 1
This part contains the performance analysis of several ML and DL models with common performance measures related to regressions, such as MAE, RMSE, R-squared (R2), and MAPE. This is aimed at understanding and making a comparison of predictive power and the ability to generalize each model to predict energy consumption in logistics. LR, as a model of baselines, had a fairly good performance with an R2 score of 0.90 which means that 90% of the energy consumption data could be explained by LR. It, however, had a very low MAE (4.45) and RMSE (0.502) as compared to other more sophisticated models. The MAPE result of 16.39% indicates that it has challenges with aggressive or non-linear instances. The simplicity of the linear model is its helpful feature in making preliminary estimations and intelligible consequences despite its limitation. It is considered to evaluate progress provided by higher models, as results displayed in Table 6. The RF model outshone that of the linear model exponentially with a great R2value of 0.991 which depicts very close accuracy in the prediction. It also produced low MAE of 0.31, RMSE of 0.398 and mere MAPE of 2.51%. The DT model performed worst, with its R2 score of 0.51, and a high value of MAE of 9.67 and RMSE of 11.5. Even though it had a similar/close performance measure, MAPE 15.16%, as LR, the model experienced overfitting and lack of generalization. XGB produced a good fit of R 2 = 0.991 and was equal to RF. There was a slight increase in the MAE (0.38), RMSE (0.502) and MAPE (3.55%). Despite the slightly variable results, XGB abilities to ensure that the computations take a shorter time and provide regularization to prevent overfitting. It has a gradient-boosting framework that enables it to improve model performance in an iterative manner. It is competitive and scalable to propose across the wide logistics data with these characteristics in energy consumption modeling. The R2 value of CatBoost turned out to be 0.91, a big distinct advantage over the LR and DT models but inferior to RF and XGB. It has an MAE of 3.72, RMSE of 4.87 and an MAPE of 14.08%, which is a good but less steady performance. However, as it has high error rates considered here, inhibiting features, CatBoost is likely to be less effective in that it may not utilize all the feature interactions available. It is a good solution however in the case of heavy data sets.
Table 6.
Analysis of all applied models results.
| Models | MAE | RMSE |
|
MAPE (%) | MSRE | RMSRE | MARE | RMSPE (%) |
|---|---|---|---|---|---|---|---|---|
| LR | 4.45 | 0.502 | 0.90 | 16.39 | 0.027 | 0.134 | 0.163 | 16.5 |
| RF | 0.31 | 0.398 | 0.991 | 2.51 | 0.006 | 0.025 | 0.025 | 2.6 |
| DT | 9.67 | 11.5 | 0.51 | 15.16 | 0.021 | 0.132 | 0.142 | 14.2 |
| XGB | 0.38 | 0.502 | 0.991 | 3.55 | 0.013 | 0.036 | 0.035 | 3.6 |
| CatBoost | 3.72 | 4.87 | 0.91 | 14.08 | 0.020 | 0.132 | 0.131 | 13.1 |
| FT transformer | 0.17 | 0.229 | 0.99 | 1.35 | 0.002 | 0.015 | 0.013 | 1.4 |
| MLP | 0.16 | 0.210 | 0.994 | 1.12 | 0.001 | 0.011 | 0.011 | 1.2 |
The MLP was found to be one of the highly developed and performing models in this research. It obtained a very good R2 of 0.99, very low MAE of 0.17 and RMSE of 0.229, as shown in Table 6. There was also a minimal MAPE of 1.35%, which indicated high accuracy. This is down to the capability of the transformer to mimic the long-range dependencies and interaction of features without manual feature engineering that would be intensive. That it was successful also implies that transformer-based models hold great promise in energy prediction tasks where dense, heterogeneous, or time-dependent logistics data are involved. This model is one of the greatest contributions that have been made in these studies, and it justifies the use of DL structures into industrial analytics.
Within the presented machine learning system to estimate energy consumption during logistics under environmental and operational limitations, four error-based measurements (MSRE, RMSRE, MARE, and RMSPE) were used to assess various models. The conventional methods like LR and DT gave relatively low results, as their error values were higher (16.5 and 14.2 as RMSPE and decision tree, respectively). The outcomes of this work indicate that they are not very effective in capturing the nonlinear trends and complicated interactions between features of logistics energy data. On the same note, CatBoost proved to be accurate with a higher RMSPE (13.1) and perceptible error than more sophisticated techniques. Enhancements were made on the other hand on ensemble and deep learning models. Random Forest provided strong performance and very small errors (RMSPE of 2.6, MARE of 0.025), which shows it can address the variability of features. XGBoost was also good though a bit less than Random Forest with an RMSPE of 3.6 and equal error values of all metrics. Transformer-based and neural architecture made the greatest contributions. MLP showed high predictive accuracy with an RMSPE of 1.4 and very small error values, indicating its ability to predict dependencies in tabled data that is structured.
The FT-Transformer model shows the most desirable statistics on all the metrics: the MAE is equal to 0.16, RMSE is equal to 0.210, the most extensive R2 (0.994) and the least MAPE (1.12%). This makes it possible to model the non-linear (complex) relationships between variables in the data using the FT-Transformer. Its rich expressive architecture can be dense enough to be more effective to model the consumption of energy, especially when cooperated with optimized hyperparameters and feature selection. These impressive results of the FT-Transformer highlight the utility of DL techniques in the context of predictive modeling to be applied in the logistics and energy industries. Lastly, FT-Transformer model was the most likely to be the most accurate as it recorded the lowest error rates (RMSPE of 1.2, MSRE of 0.001 and MARE of 0.011), thus, the most reliable model in this study. Generally, the comparison outcomes show that the conventional regression and single-tree models are not as effective in predicting energy in complex logistics systems, but the sophisticated models, especially the MLP and FT Transformer, are more effective. This set of results highlights the critical role of implementing deep learning and transformer-based models in the context of developing energy-conscious logistics systems that must consider both operational and environmental limitations.
Figure 20 shows a comparative analysis of all applied model performance across training epochs. There were the metrics of evaluation as MAE, RMSE, and R2 presented side by side, corresponding to each model. The chart demonstrated clearly that FT Transformer and MLP had all the metrics better than the traditional ML models, especially in lowering RMSE and raising R to a higher magnitude. To visualize the numerical values of the performance metrics of every model, it was created a heatmap. The models were presented as rows, scores were biased based on MAE, MSE, RMSE, and R2 as the column. The deepness of the color aided in spotting high values and low ones. As an example, lower values in the FT Transformer row during RMSE and MAE proved that it had higher predictive accuracy through darker shades. The plots of the training and validation losses were made over the training epochs in the DL models (MLP and FT Transformer).
Fig. 20.
Loss and validation curve against several epochs.
These figures showed how the models learned throughout time, and how successful they were not to go into overfitting. The curves demonstrated that MLP converged quicker and had a smaller validation loss than FT-Transformer, which confirmed its better generalization. The usual difference between training loss and validation loss was quite small, a sign of model stability and the presence of good regularization technique, including dropout and early stopping.
LIME plot used to understand the impact of features fed into the model to predict energy consumption by taking a few test samples. The plots of each LIME showed either positive or negative feature weight that affected the prediction. That is, the contribution of cargo_weight_kg and route_distance_km in predicting energy were the largest whereas avg_speed_kmph was a suppressor in some instances. The result was transparency and interpretability of otherwise complicated models that were made possible through these plots shown as Fig. 21.
Fig. 21.
LIME plot created to understand the impact of features.
The global feature importance of the FT Transformer model was presented with the help of a SHAP summary plot. This plot ranked features based on what they affected on model output and color-coded them with values of the features. SHAP analysis reaffirmed cargo_weight_kg, route_distance_km and avg_speed_kmph, to be impactful features of the data set. This aided in confirming assumptions that were considered in feature engineering and making sense of the model shown as Fig. 22.
Fig. 22.
SHapley Additive exPlanations summary plot.
In calculating the mean contribution scores produced by LIME and SHAP, various operational conditions (e.g., urban vs. highway, light vs. heavy loads) were used, and feature importances were generalized. The visualization of these results was in the form of an average of importance per feature per class in a grouped barplot. The pattern was consistent in the plots, with cargo_weight_kg reporting high at any context, whereas the road_type had more weight on classes that were urban ones. shown as Fig. 23.
Fig. 23.
Means of SHAP scores and LIME by discreet class.
To measure the feasibility of the proposed model in terms of computational efficiency, memory footprint, and scalability, the deployment feasibility was assessed, displayed in Table 7. The model with an average inference time of 1.4 ms per sample proves to be suitable in real-time decision-making in logistics applications. Even though the maximum GPU memory consumption (approximately 12.2 GB) is an indicator of the use of powerful hardware in the training phase, it can be deployed to edge devices through lightweight optimizations that include model pruning and quantization. More so, the capability of the model to deliver consistency in performance during batch processing renders it to be suitable in large scale logistics operation where the processing of the huge volumes of data is important.
Table 7.
Computational resource utilization and deployment feasibility.
| Measure | Value |
|---|---|
| Training duration | ~ 3 h |
| Average time per epoch | ~ 1.8 min |
| Total inference time | ~ 45 s (test set size samples) |
| Inference time per sample | ~ 4.5 ms |
| GPU memory usage (Peak) | ~ 11 GB (NVIDIA A100 40GB) |
| CPU utilization (Average) | 35–45% |
| RAM usage (During Training) | ~ 10 GB |
The permutation importance plot shows the contribution made by each feature in the prediction of energy consumption (kWh) in the FT-Transformer model. The characteristics of vehicle speed, distance covered, and vehicle mass proved to be the ones with the most significant influence, which implies that energy consumption is highly dependent on the driving dynamics and loading condition. In the meantime, the driving mode and the type of road were also categorical factors, and their contributions were also significant, which means that the environment and behavioral condition influence consumption patterns. The features of the lower rank played less significant roles, and their impact on the prediction task was not substantial. This analysis gives an understanding of the input variables that the model is most dependent on when making a good forecast permutation of feature importance in FT-Transformer as shown in Fig. 24. The plot of sensitivity analysis in Fig. 25 assesses how a minute change in each feature influences the predictability of the model. Variables that have a higher sensitivity like speed and distance travel showed a greater effect on model output when perturbed, indicating their importance in affecting energy consumption predictions. Conversely, the features that are associated with low sensitivity scores imply the model resilience to variation of such variables. Such an analysis is an extension of the permutation importance analysis because it does not focus on which features are important, but also how sensitive the predictions of the model itself are to variations in those inputs.
Fig. 24.

The permutation of feature importance in FT-Transformer.
Fig. 25.

Sensitivity analysis assesses in FT-Transformer model.
Results with dataset 2
LR produced one of the best results in the present study with the
rating being 0.94 that shows that the model has a capability of explaining 94% of energy consumption variation, as displayed in Table 8. Its MAE of 0.41 and RMSE of 0.50 attest to the fact that it generates a minimal error of prediction in various circumstances. The low MAPE (0.05) indicates that it is very reliable in terms of percentages, which is especially crucial when making predictions on other levels of consumption. RF also performed well but was slightly lower in comparison to LR and boosting models. It has a
of 0.91, which shows that it is a good predictor, or explains 91% of the variance. Nonetheless, it had higher MAE (0.51) and RMSE (0.65) indicating that the model has some weaknesses in precision as opposed to LR. DT, though interpretable and simple, has the lowest performance amongst all the tested models. It has an
of 0.79 which implies that it has only accounted the variability by 79% and thus there is a significant amount of variance which remains unexplained by the model. The highest error measures were MAE (0.77) and RMSE (1.00), and these error values indicated that the difference between the predicted and the actual energy consumption values were large. XGB was performing well with a
error (
= 0.92) and a relatively small error (MAE = 0.49, RMSE = 0.58). The mechanism of boosting it minimizes bias as well as variance by successively enriching weak learners, enabling it to be used efficiently in modeling complex and non-linear dynamics in energy consumption. The MSRE (0.336) and RMSRE (0.580) have the advantages of having better control over relative errors than DT and RF.
Table 8.
All applied models results with dataset 2.
| Models | MAE | RMSE |
|
MAPE (%) | MSRE | RMSRE | MAPE (%) | RMSPE |
|---|---|---|---|---|---|---|---|---|
| LR | 0.41 | 0.50 | 0.94 | 0.05 | 0.252 | 0.502 | 0.05 | 0.051 |
| RF | 0.51 | 0.65 | 0.91 | 0.06 | 0.423 | 0.650 | 0.06 | 0.062 |
| DT | 0.77 | 1.00 | 0.79 | 0.10 | 1.000 | 1.000 | 0.10 | 0.101 |
| XGB | 0.49 | 0.58 | 0.92 | 0.06 | 0.336 | 0.580 | 0.06 | 0.060 |
| CatBoost | 0.42 | 0.54 | 0.93 | 0.05 | 0.292 | 0.540 | 0.05 | 0.052 |
| MLP | 0.43 | 0.55 | 0.93 | 0.05 | 0.302 | 0.550 | 0.05 | 0.053 |
| FT-Transformer | 0.44 | 0.54 | 0.93 | 0.05 | 0.292 | 0.540 | 0.05 | 0.053 |
CatBoost became one of the most effective models as it provided
of 0.93 and quite small errors (MAE = 0.42, RMSE = 0.54). Its primary benefit is the fact that it can deal with categorical variables, like the type of vehicle or the type of road, without much preprocessing being done. The FT Transformer also had a high level of performance with a
of 0.93 and error values (MAE = 0.43, RMSE = 0.55) almost like CatBoost and MLP. It uses attention mechanisms to capture dependencies between features that traditional tree-based models might not have considered. Its MSRE (0.302) and RMSRE (0.550) are very similar to the results of CatBoost, which demonstrates good control over relative errors. Its stability in the ability to work in different conditions, such as different driving modes, and different traffic blocks is also proven by the low RMSPE (0.053). The MLP is especially flexible to case heterogeneous tabular data, which has both numerical and categorical variables, unlike tree models.
The FT-Transformer model performed very well and compares to CatBoost with an
of 0.93 with minimal errors (MAE = 0.44, RMSE = 0.54). Being a neural network, it is good at non-line relationship modeling, which is crucial considering the correlation of road type, route distance, and road traffic conditions in energy consumption. Its MSRE (0.292) and RMSRE (0.540) were also some of the lowest in models, which points to the fact that it minimizes relative errors well, shown in Fig. 26. In the same way, RMSPE (0.053) demonstrates a consistent performance percentage, which proves a strong performance. The validation and train model performance FT-Transformer) is trailed at several epochs in terms of MSE, RMSE,
, MAPE, MSRE, RMSRE, MARE and RMSPE. The convergence of both training and validation should be smooth meaning that learning should be stable and there should be minimum overfitting. The fact that the
is high and the error values are low verifies that the FT-Transformer did capture trends in data regarding energy consumption and the predictive capabilities of the FT-Transformer are high.
Fig. 26.
Validation and Loss Curve (All Metrics).
The flexibility of FT-Transformer enables it to generalize effectively in most conditions assuming that there is enough training information and that hyperparameters have been set accordingly. The findings highlight the competitive aspect of FT-Transformer as a deep learning method, which is both flexible and accurate in the prediction of energy consumption.
Comparative analysis with existing study
To establish the level of robustness and effectiveness of our recommended models, compared it with the current research. The methods that are applied in these studies involve the usage of various regression-based models to determine energy or relevant metrics outcomes within pertinent datasets and areas. In our results, found out that the models MLP and FT Transformer outperform all traditional ML models, including DTs, RF, and even XGB, CatBoost ensemble model in terms of MAE, RMSE, and R2 scores. Outstandingly, the FT Transformer model performed very well, indicating that it can successfully exploit intricate feature interactions with great subsequent generalization level. Our models achieve high accuracy and high explainability (via SHAP and LIME), as well as a high learning stability compared to previous research. This makes our contribution a powerful one to the energy-efficient modeling of AI. The details of Comparative analysis with existing study are shown in Table 9.
Table 9.
Comparison with state-of-the-art existing study and.
| References | Model | Dataset | Results | ||
|---|---|---|---|---|---|
| MAE | RMSE | R 2 | |||
| 14 | RF | UCI Appliance Energy Dataset | 0.48 | 0.63 | 0.85 |
| 30 | XGB Regressor | OpenEI Energy Dataset | 0.41 | 0.52 | 0.89 |
| 20 | CatBoost Regressor | Smart Home Dataset | 0.39 | 0.59 | 0.91 |
| 21 | Logistic Regression | Building Energy Benchmark | 3.6 | 4.9 | 0.78 |
| 41 | DNN (3-layer MLP) | Customized IoT Sensor Dataset | 0.22 | 0.33 | 0.95 |
| Proposed | FT-Transformer | EV Driving Load & Environmental Dataset | 0.16 | 0.210 | 0.94 |
Ethical and sustainability considerations
The scale aspect of deep learning models inevitably brings up the issue of energy consumption and carbon footprint, specifically in cases when the training takes a long time when using high-performance GPUs. Hypersensitive hyperparameter tuning, model compression, and use of renewable energy sources in the data centers are also some of the key strategies that need to be considered in a sustainability perspective to reduce construction based environmental impact47. Out of the technical field, model interpretability is a very important aspect of ethics in transparency and trust. Non-technical stakeholders, especially the logistics managers, must know how predictions are made, not just to help enable informed decision-making but to also hold the managers accountable in situations where forecasts are incorrect or some unforeseen results have occurred. The gap filled with interpretability tools like SHAP and LIME, which can help them achieve sophisticated AI systems with sustainability and ethical accountability in the real world of logistics, as they seek to explain the model results in an intuitive way.
Economic effect analytics of the proposed framework
Adoption of this ML framework does indeed result in real monetary gains by reducing operation expenses and boosting revenue growth. At better accuracy (MAE ≈ 0.16), companies can save 3–7% off the fuel bill annually, which in large fleets means savings of several million USD dollars per year. Optimized routing and reduced wait times: reducing route discrepancies has the effect of further decreasing penalty rates and maintenance costs, simultaneously increasing on-time delivery percentage by 5–8% (which increases long-term customer satisfaction and service premiums). While there may be some expense associated with the initial installation, the ROI is generally achieved within 12–18 months and thereafter delivers a consistent 2–5% annual uplift in profit48. Further, the model also implies that when firms have distinctive logistics capabilities, they can more effectively implement strategic pricing strategies since the cost saving and sales in premium service enables them to lower delivery rates without sacrificing margins, this ratio is likely a linchpin of financial performance as well as 5-year market value.
Conclusion
This study presents a comprehensive exploration of vehicle energy consumption forecasting using advanced ML and DL models, focusing on vehicular behavior and environmental parameters. Among the evaluated models, the MLP and FT Transformer delivered outstanding predictive performance, with the FT-Transformer achieving an MAE of 0.16, RMSE of 0.210, R² of 0.994, and MAPE of 1.12%, outperforming the baseline models. In addition to accuracy, the study emphasizes model interpretability through explainable AI techniques like SHAP and LIME, identifying key features that influence energy use. This study is limited to environmental and operational data, while multimodal integration may provide deeper insights49. Future research will focus on integration of external factors such as weather conditions, traffic flow, and driver behavior will be explored to further improve prediction accuracy. Additionally, real-world deployment of the proposed models in edge-computing environments will be investigated for real-time, on-board energy forecasting in electric and hybrid vehicles. This research contributes meaningfully to real-time energy prediction for electric vehicles, offering a scalable and eco-conscious framework for intelligent transport systems, smart grid coordination, and sustainable mobility infrastructure.
List of all symbols

Interquartile Range (IQR) technique
- X
Raw value

Raw value

Mean of the feature

Standard deviation

Joint probability distribution function of X and Y

Marginal probability distributions o

Feature variables

Individual sample

Predicted energy consumption for sample i

Means of x and y.

Coefficient of determination matrices

Weights and biases of layer l

Predicted energy consumption

Parameters from input layer to first hidden layer (weights + biases)

Parameters across hidden layers

Parameters from last hidden layer to output layer

Collinear features and omitted
- N
Number of Samples
Abbreviations
- LR
Linear regression
- RF
Random forest
- DT
Decision tree
- XGB
Extreme gradient boosting
- MLP
Multi-layer perceptron
- FT
Fourier/feature transformer
- MAE
Mean absolute error
- RMSE
Root mean square error
- R2
Coefficient of determination
- MAPE
Mean absolute percentage error
- MSRE
Mean squared relative error
- RMSRE
Root mean squared relative error
- MARE
Mean absolute relative error
- t-SNE
t-distributed stochastic neighbor embedding
- GPU
Graphics processing unit
- RMSPE
Root mean squared percentage error
- ANN
Artificial neural network
- MSE
Mean squared error
- ML
Machine Learning
- DL
Deep learning
- NLP
Natural language processing
- EV
Electric vehicle
- RNN
Recurrent neural network
- LSTM
Long short-term memory
- AI
Artificial Intelligence
Author contributions
The author Lai Yan fully contributed to this study.
Funding
Not applicable to this study.
Data availability
The dataset is freely available at online repository Kaggle URL: Dataset 1 (https:/www.kaggle.com/datasets/programmer3/energy-aware-logistics-scheduling-dataset)Dataset 2 (https:/www.kaggle.com/datasets/ziya07/ev-energy-consumption-dataset).
Declarations
Competing interests
The authors declare no competing interests.
Ethical approval
Authors have no conflict of Interest.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Oubrahim, I. & Sefiani, N. An integrated multi-criteria decision-making approach for sustainable supply chain performance evaluation from a manufacturing perspective. Int. J. Prod. Perform. Manage.74 (1), 304–339. 10.1108/IJPPM-09-2023-0464 (2024) [Google Scholar]
- 2.Gössling, S., Humpe, A. & Sun, Y. Y. Are emissions from global air transport significantly underestimated? Curr. Issues Tourism. 28, 695–708 (2025). 10.1080/13683500.2024.2337281 [Google Scholar]
- 3.Hall, C. & van Asselt, H. Decarbonising the land transport sector: pathways towards enhanced global governance. Transp. Res. D Transp. Environ.140, 104601 10.1016/J.TRD.2025.104601 (2025). [Google Scholar]
- 4.Nazim, M. S., Rahman, M. M., Joha, M. I. & Jang, Y. M. An RNN-CNN-based parallel hybrid approach for battery state of charge (SoC) estimation under various temperatures and discharging cycle considering noisy conditions. World Electr. Veh. J. (2024)10.3390/WEVJ15120562 (2024).
- 5.Song, X. et al. Sustainable operations of last Mile logistics based on machine learning processes. Processes10 (12), 2524. 10.3390/PR10122524 (2022) [Google Scholar]
- 6.Zhang, S. Research on energy-saving packaging design based on artificial intelligence. Energy Rep.8, 480–489. 10.1016/j.egyr.2022.05.069 (2022). [Google Scholar]
- 7.Wang, K. & Du, N. Real-time monitoring and energy consumption management strategy of cold chain logistics based on the internet of things. Energy Inf.8, 1–20 (2025). 10.1186/S42162-025-00493-W [Google Scholar]
- 8.Sahin, D. O., Akleylek, S. & Kilic, E. LinRegDroid: detection of android malware using multiple linear regression Models-Based classifiers. IEEE Access.10, 14246–14259. 10.1109/ACCESS.2022.3146363 (2022). [Google Scholar]
- 9.Baek, J. W. & Chung, K. Context deep neural network model for predicting depression risk using multiple regression. IEEE Access.8, 18171–18181. 10.1109/ACCESS.2020.2968393 (2020). [Google Scholar]
- 10.Rezaei, O., Sahraeian, R. & Hosseini, S. M. H. A multi-objective optimization framework to design the closed-loop supply chain network using machine learning for demand prediction, process integration and optimization for sustainability, pp. 1–22, May (2025). 10.1007/S41660-025-00520-Z
- 11.Zalza, K., Nazim, M. S., Jang, Y. M. & Hudaya, C. Mar., UAV energy consumption prediction: A comparative study from four different deep learning models, pp. 0196–0199, (2025). 10.1109/ICAIIC64266.2025.10920767
- 12.Alkanhel, R. et al. Network intrusion detection based on feature selection and hybrid metaheuristic optimization. Computers Mater. Continua. 74 (2), 2677–2693 10.32604/CMC.2023.033273 (2022). [Google Scholar]
- 13.Leng, J. et al. A loosely-coupled deep reinforcement learning approach for order acceptance decision of mass-individualized printed circuit board manufacturing in industry 4.0. J. Clean. Prod.280, 124405 10.1016/J.JCLEPRO.2020.124405 (2021). [Google Scholar]
- 14.Farzaneh, H. et al. Artificial intelligence evolution in smart buildings for energy efficiency. Appl. Sci.11, 763 (2021). 10.3390/APP11020763 [Google Scholar]
- 15.Abid, F. A survey of machine learning algorithms based forest fires prediction and detection systems. Fire Technol.57, 559–590 (2021). 10.1007/S10694-020-01056-Z [Google Scholar]
- 16.Sarswatula, S. A., Pugh, T. & Prabhu, V. Modeling energy consumption using machine learning. Front. Manuf. Technol.2, 855208 10.3389/FMTEC.2022.855208 (2022). [Google Scholar]
- 17.Ribeiro, A. M. N. C., Carmo, D., Endo, P. R. X., Rosati, P. T. & Lynn, T. P. Short- and very short-term firm-level load forecasting for warehouses: A comparison of machine learning and deep learning models. Energies, 15, 750, 10.3390/EN15030750 (2022). [Google Scholar]
- 18.Ullah, I. et al. A comparative performance of machine learning algorithm to predict electric vehicles energy consumption: A path towards sustainability. Energy Environ.33, 1583–1612 (2022). 10.1177/0958305X211044998 [Google Scholar]
- 19.Mhlanga, D. Artificial intelligence and machine learning for energy consumption and production in emerging markets: A review. Energies16, 745 (2023). 10.3390/EN16020745 [Google Scholar]
- 20.Mansoursamaei, M., Moradi, M., González-Ramírez, R. G. & Lalla-Ruiz, E. Machine learning for promoting environmental sustainability in ports. J. Adv. Transp.1, 2144733 (2023). 10.1155/2023/2144733 [Google Scholar]
- 21.Gan, S., Zhang, Q. & Wang, Y. Energy consumption analysis of metropolitan logistics vehicles based on an ensemble K-means long short-term memory model. Energy Environ.10.1177/0958305X241244488 (2024). [Google Scholar]
- 22.Eddaoudi, Z., Aarab, Z., Boudmen, K., Elghazi, A. & Rahmani, M. D. A brief review of energy consumption forecasting using machine learning models. Procedia Comput. Sci.236, 33–40 10.1016/J.PROCS.2024.05.001 (2024). [Google Scholar]
- 23.Zhang, X., Zhang, Z., Liu, Y., Xu, Z. & Qu, X. A review of machine learning approaches for electric vehicle energy consumption modelling in urban transportation. Renew. Energy10.1016/j.renene.2024.121243 (2024). [Google Scholar]
- 24.Alshdadi, A. A. & Almazroi, A. A. Ayub, N. IoT-driven load forecasting with machine learning for logistics planning. Internet Things. 29, 101441 10.1016/J.IOT.2024.101441 (2025). [Google Scholar]
- 25.Różycki, R., Solarska, D. A. & Waligóra, G. Energy-aware machine learning models—a review of recent techniques and perspectives. Energies18, 2810 (2025). 10.3390/EN18112810 [Google Scholar]
- 26.Hussain, I., Ching, K. B., Uttraphan, C., Tay, K. G. & Noor, A. Evaluating machine learning algorithms for energy consumption prediction in electric vehicles: A comparative study. Sci. Rep.15, 1–20 (2025). 10.1038/S41598-025-94946-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dong, C. et al. A real-time prediction framework for energy consumption of electric buses using integrated machine learning algorithms. Transp. Res. E Logist Transp. Rev.194, 103884 10.1016/J.TRE.2024.103884 (2025). [Google Scholar]
- 28.Nazim, M. S., Jang, Y. M. & Chung, B. Machine Learning Based Battery Anomaly Detection Using Empirical Data, 6th International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2024, pp. 847–850, (2024). 10.1109/ICAIIC60209.2024.10463489
- 29.Phiboonbanakit, T., Horanont, T., Huynh, V. N. & Supnithi, T. A hybrid reinforcement Learning-Based model for the vehicle routing problem in transportation logistics. IEEE Access.9, 163325–163347. 10.1109/ACCESS.2021.3131799 (2021). [Google Scholar]
- 30.Slowik, M. & Urban, W. Machine learning short-term energy consumption forecasting for microgrids in a manufacturing plant. Energies15, 3382 (2022). 10.3390/EN15093382 [Google Scholar]
- 31.Abdelhamid, A. A. et al. Innovative feature selection method based on hybrid sine cosine and dipper throated optimization algorithms. IEEE Access.11, 79750–79776. 10.1109/ACCESS.2023.3298955 (2023). [Google Scholar]
- 32.Antonopoulos, I. et al. Artificial intelligence and machine learning approaches to energy demand-side response: A systematic review. Renew. Sustain. Energy Rev.130, 109899 10.1016/J.RSER.2020.109899 (2020). [Google Scholar]
- 33.Chen, Z., Xiao, F., Guo, F. & Yan, J. Interpretable machine learning for Building energy management: A state-of-the-art review. Adv. Appl. Energy. 9, 100123 10.1016/J.ADAPEN.2023.100123 (2023). [Google Scholar]
- 34.Atteia, G. et al. Adaptive dynamic dipper throated optimization for feature selection in medical data. Computers Mater. Continua. 75 (1), 1883–1900 10.32604/CMC.2023.031723 (2023). [Google Scholar]
- 35.Flores-García, E., Hoon Kwak, D., Jeong, Y. & Wiktorsson, M. Machine learning in smart production logistics: a review of technological capabilities. Int. J. Prod. Res.63, 1898–1932 (2025). 10.1080/00207543.2024.2381145 [Google Scholar]
- 36.Sun, P. et al. Deep reinforcement learning based low energy consumption scheduling approach design for urban electric logistics vehicle networks. Sci. Rep.15 (1), 1–18 10.1038/S41598-025-92916-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Shah, M. A. & Wicaksono, H. Leveraging machine learning for power consumption prediction of Multi-Step production processes in dynamic electricity price environment. Procedia CIRP. 130, 226–231 10.1016/J.PROCIR.2024.10.080 (2024). [Google Scholar]
- 38.Long, X., Cai, W., Yang, L. & Huang, H. Improved particle swarm optimization with reverse learning and neighbor adjustment for space surveillance network task scheduling. Swarm Evol. Comput.10.1016/J.SWEVO.2024.101482 (2024). [Google Scholar]
- 39.El-Kenawy, E. S. M. et al. Metaheuristic optimization for improving weed detection in wheat images captured by drones. Mathematics10, 4421 (2022). 10.3390/MATH10234421 [Google Scholar]
- 40.Wei, M., Yang, S., Wu, W. & Sun, B. A multi-objective fuzzy optimization model for multi-type aircraft flight scheduling problem. Transport39, 313–322 (2024). 10.3846/TRANSPORT.2024.20536 [Google Scholar]
- 41.Fulginei, F. R. & Zournatzidou, G. Advancing sustainability through machine learning: modeling and forecasting renewable energy consumption. Sustainability 17, 1304. (2025). 10.3390/SU17031304 [Google Scholar]
- 42.Naz, A. et al. Using Transformers and Bi-LSTM with sentence embeddings for prediction of openness human personality trait. PeerJ Comput. Sci.11, 1–42 10.7717/PEERJ-CS.2781/SUPP-6 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Myriam, H. et al. Advanced Meta-Heuristic algorithm based on particle swarm and Al-Biruni Earth radius optimization methods for oral cancer detection. IEEE Access.11, 23681–23700. 10.1109/ACCESS.2023.3253430 (2023). [Google Scholar]
- 44.Alghieth, M. & Sustain, A. I. A multi-modal deep learning framework for carbon footprint reduction in industrial manufacturing. Sustainability17, 4134 (2025). 10.3390/SU17094134 [Google Scholar]
- 45.Zhang, B., Sang, H., Meng, L., Jiang, X. & Lu, C. Knowledge- and data-driven hybrid method for lot streaming scheduling in hybrid flowshop with dynamic order arrivals. Comput. Oper. Res.184, 107244 10.1016/J.COR.2025.107244 (2025). [Google Scholar]
- 46.Li, Z., Gu, W., Shang, H., Zhang, G. & Zhou, G. Research on dynamic job shop scheduling problem with AGV based on DQN. Cluster Comput.28, 1–18 (2025). 10.1007/S10586-024-04970-X [Google Scholar]
- 47.Zhang, B., Meng , Lu, L., Han, C. & Sang, H. Y. Automatic design of constructive heuristics for a reconfigurable distributed flowshop group scheduling problem. Comput. Oper. Res.10.1016/j.cor.2023.106432 (2024). [Google Scholar]
- 48.Meng, Q. et al. Economic optimization operation approach of integrated energy system considering wind power consumption and flexible load regulation. J. Electr. Eng. Technol.19 (1), 209–221 10.1007/S42835-023-01572-2 (2023). [Google Scholar]
- 49.Zhou, Y. Improvement of visual servo system of industrial robot based on sliding mode control and deep reinforcement learning. Theoretical Nat. Sci.41, 132–138 (2024). 10.54254/2753-8818/41/2024CH0168 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The dataset is freely available at online repository Kaggle URL: Dataset 1 (https:/www.kaggle.com/datasets/programmer3/energy-aware-logistics-scheduling-dataset)Dataset 2 (https:/www.kaggle.com/datasets/ziya07/ev-energy-consumption-dataset).



































