Abstract
Wind energy serves as a cornerstone in the global transition toward carbon neutrality, standing among the most vital renewable energy resources. Accurate wind power forecasting is critical for efficient grid integration and market operations, as a 15% error in prediction can result in profit losses of up to 13.8% for energy market participants. Nonetheless, precise wind power forecasting remains a challenging task, primarily due to the presence of high levels of noise in wind speed forecasts, which significantly undermines both the accuracy and robustness of predictive models. To address this issue, this study proposes a novel method named PCT (Physics-Constrained Transformer) for wind power forecasting. PCT integrates the efficient Transformer architecture with domain knowledge of wind power curves, aiming to not only enhance forecast accuracy but also maintain strong robustness in the presence of high-noise conditions. By embedding physical constraints into the Transformer structure, PCT ensures that its outputs conform to the expected probabilistic behavior of real-world wind power generation, thereby improving resilience against noisy input data. Experimental results based on operational data from 25 distinct wind turbines demonstrate the effectiveness of PCT. Under low-noise scenarios, the proposed method achieves an average performance improvement of approximately 15% over the previous state-of-the-art model. More notably, under scenarios involving different levels of input noise, PCT exhibits considerably better stability and accuracy compared to the standard Transformer model trained solely with MSE loss, highlighting the effectiveness of integrating domain knowledge into the Transformer model.
Keywords: Wind power forecasting, High noise, Physics-Constrained, Transformer, Wind power curve
Subject terms: Renewable energy, Wind energy
Introduction
As a rapidly expanding renewable energy source, wind power has become increasingly influential in both industrial and societal contexts1. Unlike conventional energy forms, wind energy is inherently intermittent and subject to environmental variability. It lacks dispatchability and is highly sensitive to transient meteorological disturbances such as thunderstorms and heavy precipitation. In the absence of energy storage infrastructure, wind-generated electricity must be consumed immediately upon production, which imposes significant operational constraints. Given these characteristics, precise forecasting of wind power output plays a pivotal role in optimizing the management and integration of wind farms into modern power systems2,3. The economic significance of accurate forecasts is underscored by a study4, which reports that a 15% error in prediction can result in profit losses ranging from 2% to 13.8% for market participants. While weather forecasts are commonly used as inputs for renewable energy prediction models, wind speed data tend to exhibit higher volatility and contain more uncertainty. This leads to greater levels of noise in wind speed forecasts, making wind power forecasting particularly challenging compared to other renewable sources such as photovoltaics5,6.
Wind power forecasting spans multiple temporal scales, with day-ahead forecasting being one of the most practically relevant for energy market operations and grid scheduling. Day-ahead forecasts support decision-making processes related to energy trading and resource allocation, underscoring their importance in maximizing wind energy utilization efficiency7,8. Therefore, this study focuses specifically on improving day-ahead wind power forecasting.
The increasing demand for accurate wind power forecasts has driven significant research efforts into developing advanced modeling techniques. These methods can be broadly grouped into two categories: knowledge-based methods and data-driven methods.
Knowledge-based methods utilize underlying physical laws to model wind dynamics, enabling the accurate prediction of wind power output. For example, researchers have employed hybrid strategies that combine traditional signal processing techniques—such as empirical mode decomposition—with ensemble learning frameworks to enhance forecasting reliability9. Another prominent class of knowledge-based methods includes Numerical Weather Prediction (NWP) systems10,11, which simulate future meteorological conditions by solving fluid dynamics equations over defined spatial and temporal domains. Wind power is then derived using deterministic mappings like the wind power curve12. Although these methods offer strong interpretability and physical consistency, they often demand extensive feature engineering and struggle to fully exploit historical data. In contrast, data-driven models rely solely on historical observations to make predictions. These models span from classical time series analysis methods such as ARMA and ARIMA13,14, which are effective for short-term linear trends, to modern deep learning architectures capable of capturing complex nonlinear dependencies. Within the category of deep learning models, multilayer perceptrons (MLPs) provide basic functional approximations, while recurrent neural networks (RNNs)15,16 demonstrate good performance in handling sequential data. Additionally, convolutional neural networks (CNNs)17 and autoencoder-based structures18 have been used to identify spatial and temporal correlations in wind data. Despite their representational ability, however, purely data-driven models typically demand large training sets and may suffer from overfitting or poor generalization.
Despite their respective strengths, neither knowledge-based nor data-driven models alone effectively mitigate the issue of high-noise wind speed forecasts while maintaining good performance. Recent advancements in modeling methodologies have focused on integrating domain knowledge with data-driven approaches to better handle complex, real-world problems. This hybrid modeling paradigm has been explored across various domains. For instance, Physics-Informed Neural Networks (PINNs)19 have shown remarkable success in computational fluid dynamics by embedding differential equations as soft constraints during training. Other studies have introduced frameworks that enforce strict adherence to known physical laws through constraint-based formulations20, while still others have developed more flexible structures such as the TgDLF and Adaptive-TgDLF21,22, which has shown effectiveness in electrical load forecasting. While these models effectively incorporate deterministic physical knowledge, many aspects of domain expertise—such as the relationship between wind speed and power output—are inherently probabilistic in nature23. In response to this challenge, theory-guided deep-learning wind power forecasting (TgDPF)24 was developed to leverage probabilistic domain knowledge in the form of probability distributions. Specifically, TgDPF integrates the statistical characteristics of wind power curves into the training of an LSTM-based deep learning model. The wind power curve is modeled using kernel density estimation25, and discrepancies between predicted and target distributions are quantified via Jensen-Shannon divergence26, allowing the model to learn under distributional constraints.
Although TgDPF has demonstrated promising performance in handling noisy wind speed data, it relies on LSTM architecture as its core temporal modeling component—an approach that no longer represents the state-of-the-art in sequence modeling. While LSTMs were once widely adopted27,28 for their ability to capture short-term temporal dependencies, they exhibit several limitations in modern forecasting tasks. These limitations include difficulty in modeling long-range temporal patterns, potential issues of gradient instability, including vanishing and exploding gradients29, and suboptimal performance when dealing with high-dimensional and complex input features. As a result, there exists a strong motivation to explore more advanced sequential architectures for improved wind power forecasting. The Transformer30 model, originally introduced in the field of natural language processing (NLP), has recently emerged as one of the most efficient and versatile architectures for time series forecasting31. LSTMs process series sequentially through a fixed recurrent structure, which can make it difficult to model long-range dependencies and may propagate noise through the hidden state over time. In contrast, the Transformer employs a self-attention mechanism that enables it to dynamically assign different weights to all historical time steps when making a prediction. This dynamic weighting is crucial for handling noisy inputs. Specifically, when the input wind speed forecast is corrupted by noise, the self-attention mechanism allows the model to adaptively down-weight the contribution of unreliable or noisy input features at specific time steps. Instead of relying solely on the noisy wind speed, the model can learn to place greater emphasis on more reliable historical patterns that are less affected by forecast errors. In this way, the Transformer can effectively filter out irrelevant noise by focusing on the most informative parts of the input sequence, acting as a form of implicit noise resilience. This capability, combined with their fully parallelizable structure, makes Transformer models particularly well-suited for high-dimensional and temporally complex tasks like renewable energy forecasting.
Building upon the conceptual framework of TgDPF and leveraging the superior temporal modeling capacity of the Transformer, this study introduces PCT (Physics-Constrained Transformer) for wind power forecasting, a novel hybrid forecasting method that integrates deep learning model of Transformer with domain knowledge of wind turbine power curves. The loss function of PCT consists of two complementary components: the first component quantifies the discrepancy between the model-predicted wind power distribution and the expected distribution derived from domain knowledge (e.g., the wind power curve). This probabilistic constraint ensures that the Transformer’s outputs remain physically plausible even under high-noise conditions. Specifically, we employ Jensen–Shannon (JS) divergence to measure the difference between the predicted and target distributions. The second component corresponds to the standard Mean Squared Error (MSE) loss, which drives the Transformer model to minimize the deviation between predicted and observed values based solely on empirical data. These two loss components are jointly optimized during training, allowing the model to benefit from both knowledge-based regularization and data-driven learning. By embedding physical constraints into the Transformer architecture, PCT not only improves predictive accuracy but also enhances robustness against high noisy wind speed forecasts—addressing a key limitation of purely data-driven approaches.
To assess the effectiveness of PCT, we perform comprehensive experiments using real-world operational data collected from 25 different wind turbines. The results demonstrate that PCT achieves substantially better forecasting accuracy than the baseline TgDPF model under low-noise conditions, with an average performance improvement of approximately 15%. Moreover, PCT demonstrates substantially improved robustness over the standard Transformer model trained exclusively with MSE loss, particularly in high-noise scenarios. Specifically, under Gaussian noise levels exceeding N(0, 0.3), PCT achieves MSE reductions of over 50% compared to the baseline Transformer. This result highlights the value of integrating domain knowledge into deep learning frameworks—not only for enhancing predictive accuracy but also for improving model stability and reliability in complex, real-world environments characterized by noisy input data.
Methodology
This study proposes the Physics-Constrained Transformer (PCT) framework, designed to enhance wind power forecasting accuracy while effectively mitigating the adverse effects of high-noise wind speed forecasts. To facilitate a clear understanding of the proposed methodology, this section begins with a detailed overview of the three core elements that constitute the PCT: the Transformer, an efficient deep learning model for temporal modeling; kernel density estimation–based representation of wind power curves as probability distributions; and the Jensen–Shannon (JS) divergence, used to quantify discrepancies between these distributions. Finally, building upon these elements, the comprehensive framework of PCT is introduced.
Transformer
The Transformer architecture has attracted considerable interest across multiple disciplines and has been successfully applied in diverse domains such as natural language processing (NLP)32, computer vision (CV)33, medical AI34, signal analysis35. Among recent advances in time series forecasting36,37, the Transformer has emerged as a particularly efficient architecture, distinguished by its ability to capture long-range dependencies across sequential data elements38,39.
At its core, the Transformer model consists of two primary components: an encoder and a decoder, as shown in Fig. 1. The encoder processes input sequences through a stack of identical blocks, each containing a multi-head self-attention mechanism and a position-wise feed-forward network (FFN). These modules in the encoder allow the Transformer model to capture complex interactions between different time steps. The decoder similarly comprises a sequence of blocks, each integrating two multi-head attention layers—one for attending to encoder outputs and another for self-attention over previous decoder states—alongside an FFN module. To facilitate training deeper architectures, residual connections are applied within each block30.
Fig. 1.
Structure of the Transformer model. The architecture features an encoder-decoder framework with self-attention mechanisms, enabling effective modeling of long-term dependencies in sequential data.
Central to the Transformer’s effectiveness is the attention mechanism, which enables dynamic feature interaction by computing relevance scores between different parts of the input data. The attention mechanism operates through three components: query, key, and value vectors. Queries represent the current element seeking relevant context, keys act as matching indices, and values provide the corresponding information. This design allows the Transformer model to selectively focus on informative parts of the input, enabling it to capture intricate temporal patterns that are challenging for RNN-based approaches. More formally, the attention mechanism computes similarity scores using the scaled dot product between queries and keys, followed by a softmax operation to derive weights that are then applied to the values:
![]() |
1 |
where Q, K, and V denote the query, key, and value matrices respectively, and Dk is the scaling factor based on the dimensionality of the keys. And the resulting attention matrix A can guide value aggregation. Furthermore, to enhance representation learning, the Transformer model usually employs multi-head attention, where multiple independent attention functions operate in parallel across different linear projections of the input.
Wind power curve and kernel density Estimation
Wind turbines are electromechanical systems designed to transform kinetic wind energy into electrical power. A fundamental characteristic of a wind turbine is its wind power curve, which describes the functional relationship between wind speed and the corresponding wind power output. This curve serves as an indicator of the wind turbine’s operational efficiency and mechanical behavior under varying wind conditions40.
As illustrated in Fig. 2, the wind power curve typically maps wind speed to the expected wind power output. However, in real-world settings, this relationship is not strictly deterministic. Instead, it is better characterized as a joint probability distribution over wind speed and wind power output. Due to environmental variability and internal system dynamics, even under identical atmospheric conditions—including temperature, pressure, and humidity—the same wind speed may yield different wind power outputs. Two critical thresholds define the operational range of a wind turbine: the cut-in speed, below which the wind turbine does not generate power, and the cut-out speed, beyond which the wind turbine is shut down for safety reasons, even if higher wind speeds would otherwise produce more wind power.
Fig. 2.

Wind Power Curve. This curve characterizes the functional dependence of wind power generation on wind speed, with the x-axis denoting wind speed and the y-axis representing the corresponding wind power output of the turbine.
In line with the methodology used in TgDPF, we employ a non-parametric statistical technique known as kernel density estimation (KDE) to model the probabilistic nature of the wind power curve. Also referred to as the Parzen’s window25, KDE is a data-driven approach for estimating the underlying probability density function of a random variable without assuming a specific parametric form. Given a set of independent and identically distributed samples X1, X2,., Xn drawn from an unknown distribution P with density function f, the kernel density estimator can be defined as:
![]() |
2 |
where n denotes the number of training samples, h is the bandwidth parameter that controls the smoothness of the estimate, and K(⋅) is the kernel function. Studies have shown that the choice of kernel function has a relatively minor impact on the overall quality of the density estimation41,42. In contrast, the selection of the bandwidth h plays a crucial role in determining the accuracy and stability of the resulting estimate. In PCT, the bandwidth h is calculated as TgDPF:
![]() |
3 |
where
represents the unbiased standard deviation, n is the number of bins.
is a tunable scaling factor, which is determined through a grid search procedure. Specifically, we conducted a hyperparameter search over a small set of candidate values: {0.07, 0.1, 0.12}. This search was performed on a representative subset of the data to ensure computational efficiency while still providing a reliable estimate. Based on our empirical evaluation, we observed that λ = 0.1 consistently yielded optimal or near-optimal performance across different turbines in terms of the quality of the estimated wind power curve distribution. Importantly, all steps involved in bandwidth selection and kernel density estimation are differentiable, enabling gradient-based optimization within the PCT framework.
Jensen-shannon (JS) loss
Given that the wind power curve is treated as a probability distribution in the optimization process—similar to the approach taken in TgDPF—the Jensen Shannon (JS) divergence26 is employed as a measure of distance between the predicted and actual distributions. This divergence metric is rooted in Kullback Leibler (KL) divergence43, also known as relative entropy or information divergence, which quantifies the difference in information content between two probability distributions.
For two continuous probability distributions P and Q, the KL divergence can be defined as:
![]() |
4 |
While KL divergence provides a foundational measure of distributional disparity, it lacks symmetry—that is,
is not equal to
—which limits its suitability as a distance metric in many applications. This asymmetry implies that the penalty for predicting a distribution P when the true distribution is Q differs from the reverse scenario. In the context of model training, this can lead to inconsistent optimization behavior depending on the direction of the divergence. To address this limitation, the JS divergence introduces a symmetric formulation based on the average of the two distributions:
![]() |
5 |
In addition, KL divergence is highly sensitive to zero probabilities and outliers. Specifically, if the model assigns a non-zero probability to a region where the target distribution has zero density (or vice versa), the KL divergence can become extremely large or even undefined, leading to unstable gradients during training. Unlike KL divergence, JS divergence yields values within the bounded interval [0,1], where a value of 0 indicates perfect alignment between the two distributions. Its symmetric nature makes it a more robust and interpretable measure for comparing probability distributions. In the context of wind power forecasting, the JS divergence serves as a key loss component in model training. Specifically, by treating P as the wind power distribution predicted by the deep learning model (i.e., Transformer) and Q as the empirically observed wind power curve, the resulting JS loss guides the model toward generating wind power forecasts that are both accurate and probabilistically consistent with real-world data.
Physics-Constrained Transformer for wind power forecasting
Leveraging the strong temporal modeling capabilities of the Transformer architecture and incorporating domain-specific knowledge embedded in wind power curves, this study proposes the Physics-Constrained Transformer (PCT) framework for wind power forecasting. A schematic illustration of PCT’s flowchart is provided in Fig. 3.
Fig. 3.
Flowchart of PCT. The framework combines domain knowledge of wind power curve and the deep learning model Transformer through a weighted combination of JS divergence loss (enforcing distributional alignment with the actual wind power curve) and MSE loss (minimizing point-wise prediction errors), guiding the optimization of the Transformer model.
At its core, PCT leverages the Transformer model as the primary forecasting engine, enhanced through a hybrid learning strategy that integrates both data-driven loss minimization and physics-informed regularization. Prior to model training, the actual wind power curve is constructed. Given the similarity in turbine characteristics across the dataset, an averaged wind power curve is derived by computing mean wind speeds and corresponding power outputs across all turbines at each time step. This averaged curve is then refined using KDE to approximate the underlying joint probability distribution of the wind speed and wind power.
The loss function of PCT combines two components, as illustrated in Fig. 3: (1) MSE loss, which measures the point-wise discrepancy between predicted and real wind power values, serving as the standard data-driven objective; (2) JS loss, which quantifies the divergence between the predicted and actual wind power distributions, acting as a domain knowledge regularization term. These two loss terms are linearly combined during training to optimize the Transformer model. While the MSE loss ensures accurate numerical wind power predictions, the JS loss encourages distributional consistency with real-world wind power behavior, significantly improving the Transformer model’s performance under high-noise conditions. The integration of these objectives results in a forecasting system that achieves both precision and resilience, particularly in challenging operational environments.
In the PCT framework, the Transformer model takes inputs with the following dimensions: (1) batch size of 300, indicating the number of training samples processed per iteration; (2) time steps of 576, corresponding to four days of historical data sampled at 10-minute intervals; and (3) feature dimension of 3, consisting of wind speed, pitch angle, and historical wind power. The Transformer model contains one encoder layer and one decoder layer, with a hidden dimension size of 12. Each attention head is split into two sub-heads to learn diverse representations. The attention mechanism is used to learn interactions across all time steps and features.
The output of the Transformer also maintains three dimensions: (1) batch size of 512, consistent with the input batch; (2) forecasting horizon of 144 time steps, representing a 24-hour ahead forecast at 10-minute intervals; and (3) output feature dimension of 1, reflecting the predicted wind power value for each future time step.
In the PCT framework, a dynamic threshold-based switching mechanism is adopted to address a key technical challenge associated with the JS divergence (loss). Specifically, while the JS divergence is theoretically differentiable and provides a valuable gradient signal for aligning distributions, it has a critical limitation: when the predicted wind power distribution produced by the model Transformer has no overlap with the target wind power curve distribution, the JS divergence saturates to a fixed value of log 224. In this scenario, the gradient of the JS loss becomes zero, rendering it ineffective for guiding model updates—a phenomenon known as gradient vanishing. This situation is highly likely during the initial stages of training, where the model’s outputs are essentially random and may lie far outside the physically plausible range defined by the wind power curve. To mitigate this issue, a dynamic weighting strategy is implemented for the JS loss. The MSE loss weight is fixed at 1 throughout training. The JS loss weight (JS ratio), however, is dynamically adjusted based on the characteristics of the model’s current predictions. Specifically, the JS ratio is set to 1 if the model’s predictions are within a range—defined by a standard deviation greater than 1e-4 and a mean power within the range of wind power curve computation. Otherwise, the JS ratio is set to 0, and only the MSE loss is used. This allows the model Transformer to first stabilize using MSE before incorporating the JS divergence for physical consistency. In practice, the JS ratio typically switches from 0 to 1 within the first few iterations and remains active thereafter.
In summary, the PCT model is trained as follows: (1) The target wind power curve distribution is constructed from historical data using KDE. (2) Input series are fed into the Transformer model to predict 24-hour ahead wind power. (3) A dynamic loss is computed: the MSE loss is always applied; the JS divergence loss, measuring the discrepancy between the predicted and target power distributions, is activated when the model’s predictions are physically plausible (i.e., predicted power has sufficient variance and its mean lies within the operational range). Otherwise, only the MSE loss is used to stabilize the initial training phase. (4) The combined loss is backpropagated to update the model. This process ensures robust convergence and enforces physical consistency.
Experiment
Data description and experiment setting
This study utilizes experimental data collected from 25 wind turbines located in Yancheng, Jiangsu Province, China. The dataset spans the entire year of 2020 and includes three key variables: wind speed, pitch angle, and generated wind power. All signals are recorded at a 10-minute sampling interval, resulting in approximately four million total data points across all turbines. The code is available at https://github.com/daxin007/PCT-WPF. Given the high integrity and consistency of the original measurements, only minimal preprocessing was performed. Specifically, negative values—considered non-physical in this context—were set to zero, and missing entries were imputed using the mean value of the respective feature. As illustrated in Fig. 4, the statistical distributions of the features across all 25 turbines exhibit strong similarity, which supports the use of the average wind power curve as the reference or target wind power curve.
Fig. 4.
Distribution of wind power, wind speed, and pitch angle. The analysis reveals a high degree of consistency in the statistical patterns of wind power, wind speed, and pitch angle across the various wind turbines, suggesting broadly similar behavioral trends among the units.
For model development, data from January to October were used for training, while November and December were reserved for testing. Each input sample spans a duration of four days and includes historical records of pitch angle, wind speed (comprising three days of past observations and one day of forecasted values), and historical wind power generation. Samples are generated using a sliding-window approach: the initial window covers the first four days of input data, and the corresponding output is the wind power for the fifth day. Subsequently, both input and output windows are shifted forward by one day to generate additional training or evaluation samples.
To simulate realistic forecasting conditions, synthetic wind speed forecasts were generated by introducing additive Gaussian noise to the original wind speed measurements. The perturbed (forecasted) wind speed at time t + 1, denoted
, is computed as:
![]() |
6 |
where
is the standardized real wind speed at the time step t+1, and the noise term follows a normal distribution N(0, x), representing unbiased random fluctuations around the true value. This simulates the uncertainty typically present in wind speed forecasts. The noise range in our experiment is N(0, 0.1) to N(0, 0.7), which is based on statistical analysis of real-world wind speed forecast errors44. Specifically, the normalized root mean square error wind speed forecasts typically falls within the range of 0.2 to 0.4. To comprehensively evaluate the performance and robustness limits of the proposed method under extreme conditions, we extended the noise levels up to N(0, 0.6) and N(0, 0.7). These high-noise scenarios represent severely distorted wind speed forecasts. Model performance is evaluated using the Mean Squared Error (MSE), a widely adopted metric that quantifies the average squared deviation between predicted and actual wind power outputs. It is defined as:
![]() |
7 |
where
denotes the standardized actual wind power at the time step i, and
represents the corresponding standardized wind power prediction. Lower MSE values indicate better alignment between predicted and observed wind power output.
Wind power forecasting under varying noise conditions
To assess the effectiveness and robustness of the proposed PCT framework, wind power forecasting experiments were conducted under different levels of input noise. Specifically, controlled amounts of unbiased (zero-mean) Gaussian noise were introduced into the wind speed data of the training set. This experimental setup was designed to simulate real-world forecasting scenarios where wind speed forecasts may be subject to random disturbances. The baseline model used for comparison was a standard Transformer model trained exclusively with MSE loss.
Table 1 summarizes the results obtained under varying levels of unbiased Gaussian noise, denoted as N(0, x), introduced into the wind speed data during training. Each major row—such as N(0, 0.1)—represents a separate experimental condition in which that specific level of noise was applied. Model performance is evaluated using the MSE computed on the test set. The columns labeled from ‘Exp1’ to ‘Exp5’ represent five independent experimental runs, with the final column reporting the average performance across all trials. Within each noise condition, three sub-rows are presented: one for the Transformer model trained solely with MSE loss (denoted as ‘Transformer’), one for the proposed PCT model (denoted as ‘PCT’), and one indicating the relative improvement of PCT over the Transformer in percentage terms (denoted as ‘Impro(%)’). For each experimental run, the ‘Impro(%)’ is computed as:
![]() |
8 |
Table 1.
Wind power forecasting performance of models under training data perturbed with N(0, x) noise. The results are evaluated using the MSE metric, reflecting the models’ predictive accuracy under noisy input conditions.
| Noise | Exp 0 | Exp 1 | Exp 2 | Exp 3 | Exp 4 | AVG | |
|---|---|---|---|---|---|---|---|
| Transformer | 0.0271 | 0.0287 | 0.0298 | 0.0271 | 0.0283 | 0.0282 | |
| N(0, 0.1) | PCT | 0.0264 | 0.0250 | 0.0272 | 0.0264 | 0.0259 | 0.0262 |
| Impro(%) | 2.6 | 14.8 | 8.7 | 2.6 | 8.4 | 7.1 | |
| Transformer | 0.0362 | 0.0392 | 0.0402 | 0.0389 | 0.0384 | 0.0386 | |
| N(0, 0.2) | PCT | 0.0298 | 0.0277 | 0.0283 | 0.0297 | 0.0287 | 0.0288 |
| Impro(%) | 17.7 | 29.3 | 29.6 | 23.7 | 25.3 | 25.4 | |
| Transformer | 0.0663 | 0.0582 | 0.0672 | 0.0691 | 0.0764 | 0.0674 | |
| N(0, 0.3) | PCT | 0.0345 | 0.0337 | 0.0348 | 0.0349 | 0.0358 | 0.0347 |
| Impro(%) | 47.9 | 42.1 | 48.2 | 49.4 | 53.1 | 48.5 | |
| Transformer | 0.1157 | 0.1130 | 0.1128 | 0.1149 | 0.1163 | 0.1145 | |
| N(0, 0.4) | PCT | 0.0396 | 0.0408 | 0.0406 | 0.0392 | 0.0389 | 0.0398 |
| Impro(%) | 65.8 | 64.0 | 64.0 | 65.9 | 66.6 | 65.2 | |
| Transformer | 0.1875 | 0.1748 | 0.1958 | 0.1665 | 0.1813 | 0.1812 | |
| N(0, 0.5) | PCT | 0.0567 | 0.0553 | 0.0599 | 0.0657 | 0.0657 | 0.0607 |
| Impro(%) | 69.8 | 68.4 | 69.4 | 60.5 | 63.8 | 66.3 | |
| Transformer | 0.2574 | 0.2424 | 0.2422 | 0.2260 | 0.2235 | 0.2383 | |
| N(0, 0.6) | PCT | 0.0807 | 0.0796 | 0.0870 | 0.0979 | 0.0978 | 0.0886 |
| Impro(%) | 68.7 | 67.2 | 64.1 | 56.7 | 56.2 | 62.8 | |
| Transformer | 0.3449 | 0.3652 | 0.3278 | 0.2984 | 0.3106 | 0.3294 | |
| N(0, 0.7) | PCT | 0.1262 | 0.1343 | 0.1565 | 0.1541 | 0.1450 | 0.1432 |
| Impro(%) | 63.4 | 63.2 | 52.3 | 48.4 | 43.3 | 58.3 |
where
is the MSE of the proposed PCT model, and
is the MSE of the corresponding baseline model.
The forecasting results clearly show that PCT consistently outperforms Transformer across all noise levels. Under low-noise conditions—specifically N(0, 0.1), N(0, 0.2), and N(0, 0.3) —PCT achieves MSE reductions of 7.1%, 25.4%, and 48.5%, respectively. Additionally, as the noise intensity increases, PCT continues to deliver substantial improvements, often exceeding 50% reduction in MSE compared to Transformer, highlighting its superior robustness to noisy wind speed inputs. The radar chart in Fig. 5 provides a more intuitive visualization of PCT’s performance advantage over the MSE-trained Transformer under different noise conditions.
Fig. 5.
Radar chart comparing PCT and the MSE-trained Transformer under varying noise levels N(0, x). The radial axes indicate the noise level, and the actual MSE values are also annotated on the chart. PCT consistently outperforms the Transformer, especially under high-noise conditions (x > 0.3), where it achieves loss reductions exceeding 50%.
Figure 6 presents a qualitative comparison of wind power forecasting performance between PCT and the baseline MSE-trained Transformer on a representative wind turbine, across three distinct time intervals. In the figure, the red curve denotes the actual wind power output, the violet line represents PCT’s predictions, and the light-blue line corresponds to the Transformer’s forecasts. Visually, it is evident that PCT produces predictions that align more closely with the true power output, accurately capturing both the general trend and fine-grained variations in wind power generation. This level of fidelity demonstrates the PCT’s ability to maintain predictive accuracy even in the presence of substantial noise, outperforming the baseline approach. The results further highlight PCT’s enhanced capacity to preserve temporal dynamics and resist the disruptive effects of noisy wind speed inputs. By integrating the Transformer’s advanced temporal modeling capabilities with domain knowledge of the wind power curve, PCT ensures that its predictions remain consistent with the physical characteristics of wind power generation, leading to enhanced wind power forecasting accuracy and reliability.
Fig. 6.
Comparative visualization of PCT and MSE-trained Transformer forecasts on the test set following the injection of N(0, 0.5) noise into the training data. The figure illustrates the predicted versus actual wind power output, clearly demonstrating the enhanced robustness and accuracy of PCT in handling noisy wind speed inputs.
The integration of the JS loss into the Transformer architecture introduces several computational overhead. Specifically, the additional cost primarily stems from two operations: (1) estimating the predicted power distribution from the model’s output batch, and (2) computing the JS divergence between this predicted distribution and the pre-computed target wind power curve distribution. Both operations are efficiently implemented on the GPU and benefit from batch-level parallelization. Based on experiments conducted on identical hardware (NVIDIA RTX 4090 GPU), the inclusion of the physics-constrained JS loss results in an approximate 27.5% increase in computational cost compared to the MSE-trained Transformer. Although this represents a non-negligible overhead, it is considered acceptable given the substantial gains in prediction accuracy and robustness, particularly under high-noise conditions. The improved performance, as demonstrated by over 50% reduction in MSE in challenging scenarios, justifies the additional computational expense, making PCT a practical and effective solution for noise-resilient wind power forecasting.
Comparative analysis of PCT and TgDPF
This subsection presents a comparative evaluation of PCT and TgDPF under different noise conditions, as summarized in Table 2. As previously mentioned, the two frameworks share a similar conceptual foundation, both integrating domain knowledge—specifically wind power curves—into deep learning architectures to enhance forecasting accuracy and robustness. However, unlike TgDPF, which employs the LSTM architecture for temporal modeling, PCT utilizes the Transformer—a more advanced and effective sequence model known for its superior ability to capture long-range dependencies and complex temporal patterns.
Table 2.
Comparison of PCT and TgDPF under different input noise N(0, x) in training data. The results are evaluated using the MSE metric, reflecting the models’ predictive accuracy under noisy input conditions.
| Noise | Exp 0 | Exp 1 | Exp 2 | Exp 3 | Exp 4 | AVG | |
|---|---|---|---|---|---|---|---|
| TgDPF | 0.0322 | 0.0307 | 0.0300 | 0.0301 | 0.0319 | 0.0310 | |
| N(0, 0.1) | PCT (Ours) | 0.0264 | 0.0250 | 0.0272 | 0.0264 | 0.0259 | 0.0262 |
| Impro(%) | 18.0 | 18.6 | 9.3 | 12.3 | 18.8 | 15.4 | |
| TgDPF | 0.0353 | 0.0328 | 0.0329 | 0.0323 | 0.0327 | 0.0332 | |
| N(0, 0.2) | PCT (Ours) | 0.0298 | 0.0277 | 0.0283 | 0.0297 | 0.0287 | 0.0288 |
| Impro(%) | 15.6 | 15.5 | 14.0 | 8.0 | 12.3 | 13.3 | |
| TgDPF | 0.0374 | 0.0356 | 0.0372 | 0.0376 | 0.0385 | 0.0373 | |
| N(0, 0.3) | PCT (Ours) | 0.0345 | 0.0337 | 0.0348 | 0.0349 | 0.0358 | 0.0347 |
| Impro(%) | 7.7 | 5.3 | 6.5 | 7.2 | 7.0 | 6.7 | |
| TgDPF | 0.0407 | 0.0427 | 0.0405 | 0.0398 | 0.0390 | 0.0405 | |
| N(0, 0.4) | PCT (Ours) | 0.0396 | 0.0408 | 0.0406 | 0.0392 | 0.0389 | 0.0398 |
| Impro(%) | 2.7 | 4.5 | −0.2 | 1.5 | 0.3 | 1.7 | |
| TgDPF | 0.0466 | 0.0464 | 0.0463 | 0.0443 | 0.0439 | 0.0455 | |
| N(0, 0.5) | PCT (Ours) | 0.0567 | 0.0553 | 0.0599 | 0.0657 | 0.0657 | 0.0607 |
| Impro(%) | −21.7 | −19.2 | −29.4 | −48.3 | −49.7 | −33.4 |
As illustrated by the forecasting results, PCT demonstrates substantial performance gains over TgDPF under low-noise conditions. Specifically, when the wind speed input in the training data is perturbed with unbiased Gaussian noise N(0, 0.1), PCT achieves a 15.4% reduction in MSE compared to TgDPF. Under slightly increased noise levels N(0, 0.2), PCT still delivers a 13.3% improvement, and even at N(0, 0.3), it maintains a consistent gain of 6.7% in terms of reduced MSE. The findings demonstrate that replacing the LSTM architecture in TgDPF with the Transformer is a justified design choice under low-noise conditions, due to its superior ability to learn temporal dependencies and maintain forecasting performance stability.
However, as the noise level increases to N(0, 0.4), the relative improvement of PCT over TgDPF diminishes significantly, with only a 1.7% gain observed. When the noise level becomes more severe (i.e., N(0, x) with x ≥ 0.5), PCT’s robustness begins to decline compared to TgDPF. This degradation may be attributed to the Transformer’s relatively complex architecture, which can amplify the impact of noisy inputs during both training and inference. Nevertheless, it is worth noting that wind speed forecasts with such high levels of noise are rarely encountered in real-world applications.
As shown in Fig. 7, when the noise level is not higher than 0.3, the perturbed wind speed data still retains a clear trend that resembles the actual wind speed. In contrast, at higher noise levels, the discrepancy between the noisy and actual wind speed becomes so large that the underlying trend is no longer discernible. Wind speed forecasts with such high levels of noise are unlikely to contribute meaningfully to accurate wind power forecasting and are generally not employed in practical applications.
Fig. 7.
Wind speed data with added noise N(0, x), where x denotes the noise level. The left side of the dashed line represents the original observed wind speed data, while the right side presents the wind speed corrupted by varying levels of noise.
Additional analysis of PCT
This subsection presents an additional analysis of PCT, encompassing sensitivity analysis on the bandwidth parameter, experiments on model depth, comparisons with additional baseline models, and attention heatmap visualizations. These analyses collectively provide a comprehensive evaluation of the model’s design choices and underlying mechanisms.
To assess the impact of the bandwidth parameter on PCT performance, we conducted a sensitivity analysis for
under the noise condition N(0, 0.3), with results summarized in Table 3. The results clearly show that the model achieves its best performance when
= 0.1, with slightly superior results compared to
= 0.07 or
= 0.12.
Table 3.
Sensitivity analysis of bandwidth parameter
under the noise condition N(0,3). The results are evaluated using the MSE metric, reflecting the models’ predictive accuracy under noisy input conditions.
Value |
Exp 0 | Exp 1 | Exp 2 | Exp 3 | Exp 4 | AVG |
|---|---|---|---|---|---|---|
| 0.10 (Ours) | 0.0345 | 0.0337 | 0.0348 | 0.0349 | 0.0358 | 0.0347 |
| 0.07 | 0.0358 | 0.0363 | 0.0355 | 0.0370 | 0.0348 | 0.0359 |
| 0.12 | 0.0374 | 0.0369 | 0.0360 | 0.0364 | 0.0356 | 0.0365 |
In addition to parameter sensitivity, we investigated the influence of model depth on PCT performance. The PCT model adopts a deliberately lightweight architecture, employing a single encoder layer and a single decoder layer. This choice was primarily driven by empirical performance and computational efficiency, as validated through an ablation study on model depth. We conducted experiments comparing the performance of PCT models with 1, 2, and 3 Transformer layers under identical training conditions.
As shown in Table 4, increasing the model depth did not lead to a significant improvement in model’s performance. In our view, for this specific forecasting task, a single-layer Transformer, empowered by its self-attention mechanism, is already highly effective at capturing the necessary long-range temporal dependencies. Furthermore, adding more layers significantly increases the computational burden. The primary innovation of PCT lies in the integration of physical constraints via the JS divergence loss, which acts as a physics-informed regularizer. This physics-informed regularization helps prevent overfitting, reducing the need for a deeper, more complex architecture to achieve robust performance. A deeper model did not provide a corresponding benefit to justify the increased computational cost and model complexity. Therefore, the single-layer design represents an optimal trade-off between performance, efficiency, and simplicity for this application.
Table 4.
Performance of PCT with varying numbers of Transformer layers under different noise levels. Each value represents the average of five independent experimental runs. The results are evaluated using the MSE metric, reflecting the models’ predictive accuracy under noisy input conditions.
| Noise | PCT (one layer) | PCT (two layers) | PCT (three layers) |
|---|---|---|---|
| N(0, 0.1) | 0.0262 | 0.0265 | 0.0266 |
| N(0, 0.2) | 0.0288 | 0.0293 | 0.0280 |
| N(0, 0.3) | 0.0347 | 0.0347 | 0.0355 |
| N(0, 0.4) | 0.0398 | 0.0400 | 0.0395 |
| N(0, 0.5) | 0.0607 | 0.0603 | 0.0612 |
To further demonstrate the superiority of the probabilistic representation of physical knowledge in PCT, we explored a PINN-inspired approach for comparison. Specifically, a polynomial regression model was fitted with the historical data to establish an empirical relationship between the input features (wind speed and pitch angle) and the output (wind power). This fitted polynomial function was then treated as a simplified “physical law” and incorporated into the model as a soft constraint. During the training of a PolyConstrained-Transformer (Poly_Transformer) model, the total loss function comprised two components: the standard MSE loss between the model’s predictions and the actual power output, and a “physics loss” that penalized the deviation between the model’s prediction and the value given by the polynomial function. However, this approach has a fundamental limitation. The polynomial regression provides only an approximate, deterministic mapping, which fails to capture the inherent probabilistic nature and stochastic variability of real-world wind power generation. Consequently, this simplistic “physical constraint” is inherently inaccurate and may misguide the learning process. As shown in Table 5, the performance of the Poly_Transformer model is generally on par with the standard Transformer baseline and significantly inferior to our proposed PCT model across all noise levels. This result underscores that merely adding a poorly representative physical constraint does not enhance model performance and further demonstrates the superiority of PCT, which uses a more accurate, probabilistic representation of physical knowledge. Additionally, to explicitly demonstrate the value of sophisticated deep learning architectures, we introduced a pure physics-based baseline: NWP + Power Curve (NWP + PC). This baseline uses the noisy wind speed input and maps it directly to a power output using the averaged wind power curve derived from training data. As expected, this baseline performs poorly, especially under high-noise conditions, where its MSE is substantially higher than all data-driven and hybrid models. This stark contrast clearly highlights the critical role of deep learning: its ability to learn complex, non-linear temporal patterns from historical data enables it to mitigate input noise and maintain robust performance.
Table 5.
Comparison of PCT and additional baseline models under different noise. Each value represents the average of five independent experimental runs. The results are evaluated using the MSE metric, reflecting the models’ predictive accuracy under noisy input conditions.
| Noise | PCT (Ours) | Transformer | Poly_Transformer | NWP + PC |
|---|---|---|---|---|
| N(0, 0.1) | 0.0262 | 0.0282 | 0.0289 | 0.0657 |
| N(0, 0.2) | 0.0288 | 0.0386 | 0.0373 | 0.0730 |
| N(0, 0.3) | 0.0347 | 0.0674 | 0.0660 | 0.1339 |
| N(0, 0.4) | 0.0398 | 0.1145 | 0.1163 | 0.2551 |
| N(0, 0.5) | 0.0607 | 0.1182 | 0.1190 | 0.4372 |
Finally, to enhance the interpretability of our model, we have analyzed the self-attention weights from the trained PCT model and generated attention heatmaps to visualize how the model allocates its focus across the 4-day (576-step) input series during prediction. For comparison, we also present the attention heatmaps of the MSE-trained Transformer under the same input conditions. As shown in Fig. 8, when presented with identical inputs under high noise conditions (N(0, 0.5)), the attention heatmaps reveal distinct behaviors between the two models. The Transformer’s attention heatmap is generally dim and exhibits irregular patterns, indicating that it struggles to identify reliable features in the noisy input data. This lack of focus leads to inaccurate predictions. In contrast, PCT’s attention heatmap is significantly brighter and displays a clear periodic structure. This demonstrates that PCT, guided by the physical constraints embedded through the wind power curve, effectively filters out misleading information from the noisy inputs and extracts meaningful features. As a result, PCT generates more robust predictions, even when the current input is corrupted by significant noise.
Fig. 8.
Comparative attention heatmap visualization of PCT and MSE-trained Transformer under high noise conditions (N(0, 0.5)). Each heatmap represents the distribution of attention weights across the 4-day (576-step) input series. Brighter colors indicate higher attention weights. The results show that the PCT model’s heatmap is consistently brighter and exhibits a distinct periodic structure, reflecting its ability to extract meaningful features.
Conclusion
This study proposes the Physics-Constrained Transformer (PCT), a novel hybrid framework for day-ahead wind power forecasting that effectively integrates the advanced temporal modeling capabilities of the Transformer architecture with domain knowledge of wind power curves, represented as a probabilistic distribution via KDE. By introducing a JS loss as a physics-based regularizer, PCT ensures that its predictions remain physically plausible, thereby enhancing robustness against the high levels of noise commonly found in wind speed forecasts. Comprehensive experiments on real-world data from 25 wind turbines demonstrate that PCT achieves substantial performance gains, significantly outperforming both a standard Transformer and the previous state-of-the-art TgDPF model under varying noise conditions. We have also conducted extensive ablation studies and comparative analyses to further validate the effectiveness and interpretability of PCT. These results underscore the potential of PCT as a promising framework for wind power forecasting and suggest its broader applicability in the field of renewable energy forecasting.
Abbreviations
- PCT
Physics-constrained transformer
- TgDPF
Theory-guided deep-learning wind power forecasting
- LSTM
Long short-term memory
- KDE
Kernel density estimation
- KL Divergence
Kullback Leibler divergence
- JS Divergence
Jensen Shannon divergence
- N(0, x)
Unbiased Gaussian noise with standard deviation x
- MSE
Mean Squared Error
Author contributions
Ding Wang conceived the study, developed the methodology, implemented the software, and wrote the initial manuscript. Qiang Luo performed formal analysis, contributed to software development, and conducted investigations. Jiaxin Gao validated results, collected field data, and participated in manuscript revision. Yuntian Chen supervised the research and reviewed the manuscript. Dongxiao Zhang contributed to conceptualization and critical manuscript editing. All authors reviewed the manuscript.
Funding
This work was supported by the High Performance Computing Centers at Eastern Institute of Technology, Ningbo.
Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Jiaxin Gao, Email: jiaxingao@sjtu.edu.cn.
Yuntian Chen, Email: ychen@eitech.edu.cn.
References
- 1.Liu, H. & Chen, C. Data processing strategies in wind energy forecasting models and applications: A comprehensive review. Appl. Energy. 249, 392–408. 10.1016/j.apenergy.2019.04.188 (2019). [Google Scholar]
- 2.Wang, D. et al. Enhancing wind power forecasting accuracy through LSTM with adaptive wind speed calibration (C-LSTM). Sci. Rep.15 (1,5352). 10.1038/s41598-025-89398-y (2025). [DOI] [PMC free article] [PubMed]
- 3.Hanifi, S., Liu, X., Lin, Z. & Lotfian, S. A critical review of wind power forecasting methods—past, present and future, Energies (Basel),13,15,3764, (2020). 10.3390/en13153764
- 4.Huang, S. et al. A hybrid framework for day-ahead electricity spot-price forecasting: A case study in China. Appl. Energy. 373,12386310.1016/j.apenergy.2024.123863 (2024).
- 5.Gao, J., Cao, Q., Chen, Y. & Zhang, D. Cross-variable Linear Integrated ENhanced Transformer for Photovoltaic power forecasting, arXiv preprint arXiv:2406.03808, Jun. (2024).
- 6.Das, U. K. et al. Forecasting of photovoltaic power generation and model optimization: A review. Elsevier Ltd.10.1016/j.rser.2017.08.017 (2018). [Google Scholar]
- 7.Wang, Y., Zou, R., Liu, F., Zhang, L. & Liu, Q. A review of wind speed and wind power forecasting with deep neural networks. Appl. Energy. 304 (117766). 10.1016/j.apenergy.2021.117766 (2021).
- 8.Jung, J. Current status and future advances for wind speed and power forecasting. Renew. Sustain. Energy Rev.31, 762–777. 10.1016/j.rser.2013.12.054 (2014). [Google Scholar]
- 9.Shao, H., Deng, X. & Cui, F. Short-term wind speed forecasting using the wavelet decomposition and adaboost technique in wind farm of East China. IET Generation Transmission Distribution. 10 (11), 2585–2592. 10.1049/iet-gtd.2015.0911 (2016). [Google Scholar]
- 10.Hu, J., Heng, J., Wen, J. & Zhao, W. Deterministic and probabilistic wind speed forecasting with de-noising-reconstruction strategy and quantile regression based algorithm. Renew. Energy. 162, 1208–1226. 10.1016/j.renene.2020.08.077 (2020). [Google Scholar]
- 11.Hu, S. et al. Hybrid forecasting method for wind power integrating Spatial correlation and corrected numerical weather prediction. Appl. Energy. 293,11695110.1016/j.apenergy.2021.116951 (2021).
- 12.Lydia, M., Kumar, S. S., Selvakumar, A. I. & Prem Kumar, G. E. A comprehensive review on wind turbine power curve modeling techniques. Renew. Sustain. Energy Rev.30, 452–460. 10.1016/j.rser.2013.10.030 (2014). [Google Scholar]
- 13.Wang, J., Zhou, Q. & Zhang, X. Wind power forecasting based on time series ARMA model. IOP Conf. Ser. Earth Environ. Sci.199,02201510.1088/1755-1315/199/2/022015 (2018).
- 14.Grigonytė, E. & Butkevičiūtė, E. Short-term wind speed forecasting using ARIMA model. Energetika62, 1–2. 10.6001/energetika.v62i1-2.3313 (2016). [Google Scholar]
- 15.Ko, M. S. et al. Deep concatenated residual network with bidirectional LSTM for one-hour-ahead wind power forecasting. IEEE Trans. Sustain. Energy. 12,2, 1321–1335. 10.1109/TSTE.2020.3043884 (2021). [Google Scholar]
- 16.Sarp, A. O., Menguc, E. C., Peker, M. & Guvenc, B. C. Data-adaptive censoring for short-term wind speed predictors based on MLP, RNN, and SVM. IEEE Syst. J.16 (3), 3625–3634. 10.1109/JSYST.2022.3150749 (2022). [Google Scholar]
- 17.Wu, Q., Guan, F., Lv, C. & Huang, Y. Ultra-short‐term multi‐step wind power forecasting based on CNN‐LSTM. IET Renew. Power Gener.15,5, 1019–1029. 10.1049/rpg2.12085 (2021). [Google Scholar]
- 18.Zhang, Y., Qin, C., Srivastava, A. K., Jin, C. & Sharma, R. K. Data-driven day-ahead pv Estimation using autoencoder-LSTM and persistence model. IEEE Trans. Ind. Appl.56,6, 7185–7192. 10.1109/TIA.2020.3025742 (2020). [Google Scholar]
- 19.Raissi,Perdikaris, M. & Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys.378, 686–707. 10.1016/j.jcp.2018.10.045 (2019). [Google Scholar]
- 20.Chen, Y. et al. Theory-guided hard constraint projection (HCP): A knowledge-based data-driven scientific machine learning method. J. Comput. Phys.445,11062410.1016/j.jcp.2021.110624 (2021).
- 21.Chen, Y. & Zhang, D. Theory-guided deep-learning for electrical load forecasting (TgDLF) via ensemble long short-term memory. Adv. Appl. Energy. 1 (100004). 10.1016/j.adapen.2020.100004 (2021).
- 22.Gao, J., Chen, Y., Hu, W. & Zhang, D. An adaptive deep-learning load forecasting framework by integrating transformer and domain knowledge. Adv. Appl. Energy. 10 (100142). 10.1016/J.ADAPEN.2023.100142 (2023).
- 23.Yan, J., Zhang, H., Liu, Y., Han, S. & Li, L. Uncertainty Estimation for wind energy conversion by probabilistic wind turbine power curve modelling. Appl. Energy. 239, 1356–1370. 10.1016/j.apenergy.2019.01.180 (2019). [Google Scholar]
- 24.Gao, J., Cheng, Y., Zhang, D. & Chen, Y. Physics-constrained wind power forecasting aligned with probability distributions for noise-resilient deep learning. Appl. Energy. 383 (125295). 10.1016/j.apenergy.2025.125295 (2025).
- 25.Parzen, E. On estimation of a probability density function and mode, The Annals of Mathematical Statistics,33,3,1065–1076, 1962, [Online]. Available: http://www.jstor.org/stable/2237880
- 26.Menéndez, M. L., Pardo, J. A., Pardo, L. & Pardo, M. C. The Jensen-Shannon divergence. J. Frankl. Inst.334, 2,307–318. 10.1016/S0016-0032(96)00063-4 (1997). [Google Scholar]
- 27.ZHANG, D., CHEN, Y. & MENG, J. Synthetic well logs generation via recurrent neural networks. Pet. Explor. Dev.45, 4,629–639. 10.1016/S1876-3804(18)30068-5 (2018). [Google Scholar]
- 28.Gao, J. et al. A dilated convolution-based method with time series fine tuning for data‐driven crack length Estimation. Fatigue Fract. Eng. Mater. Struct.10.1111/ffe.14305 (2024). [Google Scholar]
- 29.Noh, S. H. Analysis of gradient vanishing of RNNs and performance comparison. Information12 (11), 442. 10.3390/info12110442 (2021). [Google Scholar]
- 30.Vaswani, A. et al. Attention is All you Need, in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., Curran Associates, Inc., [Online]. (2017). Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- 31.Gao, J., Hu, W., Zhang, D. & Chen, Y. Client: Cross-variable linear integrated enhanced transformer for multivariate long-term time series forecasting. AI Open.6, 93–107. 10.1016/j.aiopen.2025.06.001 (2025). [Google Scholar]
- 32.Gillioz, A., Casas, J., Mugellini, E. & Khaled, O. A. Overview of the transformer-based models for NLP tasks, 179–183. (2020). 10.15439/2020F20
- 33.Han, K. et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell.45,1, 87–110. 10.1109/TPAMI.2022.3152247 (2023). [DOI] [PubMed] [Google Scholar]
- 34.Shamshad, F. et al. Transformers in medical imaging: A survey. Med. Image Anal.88 (102802). 10.1016/j.media.2023.102802 (2023). [DOI] [PubMed]
- 35.Gao, J., Cao, Q. & Chen, Y. MoE-AMC: Enhancing automatic modulation classification performance using mixture-of-experts, arXiv preprint arXiv:2312.02298, (2023).
- 36.Gao, J., Hu, Y., Cao, Q., Dai, S. & Chen, Y. CLeaRForecast: contrastive learning of high-purity representations for time series forecasting, arXiv preprint arXiv:2312.05758, (2023).
- 37.Gao, J., Cao, Q. & Chen, Y. Auto-regressive moving diffusion models for time series forecasting, Proceedings of the AAAI Conference on Artificial Intelligence,39,16,16727–16735, (2025). 10.1609/aaai.v39i16.33838
- 38.Zhou, H. et al. Informer: beyond efficient transformer for long sequence time-series forecasting. Proc. AAAI Conf. Artif. Intell.35 (12), 11106–11115. 10.1609/aaai.v35i12.17325 (2021). [Google Scholar]
- 39.Nie, Y., Nguyen,Sinthong, N. H. & Kalagnanam, J. A time series is worth 64 words: long-term forecasting with transformers, arXiv preprint arXiv:2211.14730, (2022).
- 40.Panahi, D., Deilami, S. & Masoum, M. A. S. Evaluation of parametric and non-parametric methods for power curve modelling of wind turbines, in 9th International Conference on Electrical and Electronics Engineering (ELECO), IEEE. 996–1000. 10.1109/ELECO.2015.7394497 (2015).
- 41.Sheather, S. J. & Estimation Density Statistical Science. 19 (4), 588–597. [Online]. Available: http://www.jstor.org/stable/4144429 (2004).
- 42.Scott, D. W. Multivariate density Estimation and visualization. In Handbook of Computational Statistics. 549–569. 10.1007/978-3-642-21551-3_19 (Springer, 2012).
- 43.van Erven, T. & andHarremoes Rényi divergence and kullback-leibler divergence. IEEE Trans. Inf. Theory. 60, 7,3797–3820. 10.1109/TIT.2014.2320500 (2014). [Google Scholar]
- 44.Lange, M. On the uncertainty of wind power predictions—analysis of the forecast accuracy and statistical distribution of errors. J. Sol Energy Eng.127,2, 177–184. 10.1115/1.1862266 (2005). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
















